Item Type | Journal Article |
---|---|
Author | Jacqueline Harding |
Author | Cameron Domenico Kirk-Giannini |
Abstract | The field of AI safety seeks to prevent or reduce the harms caused by AI systems. A simple and appealing account of what is distinctive of AI safety as a field holds that this feature is constitutive: a research project falls within the purview of AI safety just in case it aims to prevent or reduce the harms caused by AI systems. Call this appealingly simple account The Safety Conception of AI safety. Despite its simplicity and appeal, we argue that The Safety Conception is in tension with at least two trends in the ways AI safety researchers and organizations think and talk about AI safety: first, a tendency to characterize the goal of AI safety research in terms of catastrophic risks from future systems; second, the increasingly popular idea that AI safety can be thought of as a branch of safety engineering. Adopting the methodology of conceptual engineering, we argue that these trends are unfortunate: when we consider what concept of AI safety it would be best to have, there are compelling reasons to think that The Safety Conception is the answer. Descriptively, The Safety Conception allows us to see how work on topics that have historically been treated as central to the field of AI safety is continuous with work on topics that have historically been treated as more marginal, like bias, misinformation, and privacy. Normatively, taking The Safety Conception seriously means approaching all efforts to prevent or mitigate harms from AI systems based on their merits rather than drawing arbitrary distinctions between them. |
Date | 2025-06-24 |
Language | en |
Short Title | What is AI safety? |
Library Catalog | Springer Link |
URL | https://doi.org/10.1007/s11098-025-02367-z |
Accessed | 7/15/2025, 9:16:48 AM |
Publication | Philosophical Studies |
DOI | 10.1007/s11098-025-02367-z |
Journal Abbr | Philos Stud |
ISSN | 1573-0883 |
Date Added | 7/15/2025, 9:16:49 AM |
Modified | 7/15/2025, 9:16:49 AM |
Item Type | Preprint |
---|---|
Author | Sydney Levine |
Author | Matija Franklin |
Author | Tan Zhi-Xuan |
Author | Secil Yanik Guyot |
Author | Lionel Wong |
Author | Daniel Kilov |
Author | Yejin Choi |
Author | Joshua B. Tenenbaum |
Author | Noah Goodman |
Author | Seth Lazar |
Author | Iason Gabriel |
Abstract | AI systems will soon have to navigate human environments and make decisions that affect people and other AI agents whose goals and values diverge. Contractualist alignment proposes grounding those decisions in agreements that diverse stakeholders would endorse under the right conditions, yet securing such agreement at scale remains costly and slow -- even for advanced AI. We therefore propose Resource-Rational Contractualism (RRC): a framework where AI systems approximate the agreements rational parties would form by drawing on a toolbox of normatively-grounded, cognitively-inspired heuristics that trade effort for accuracy. An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world. |
Date | 2025-06-20 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2506.17434 |
Accessed | 7/11/2025, 2:11:36 PM |
Extra | arXiv:2506.17434 [cs] |
DOI | 10.48550/arXiv.2506.17434 |
Repository | arXiv |
Archive ID | arXiv:2506.17434 |
Date Added | 7/11/2025, 2:11:36 PM |
Modified | 7/11/2025, 2:11:36 PM |
Comment: 24 pages, 10 figures
Item Type | Journal Article |
---|---|
Author | Rhys Southan |
Author | Helena Ward |
Author | Jen Semler |
Abstract | Those who worry about a superintelligent AI destroying humanity often appeal to the instrumental convergence thesis—the claim that even if we don’t know what a superintelligence’s ultimate goals will be, we can expect it to pursue various instrumental goals which are useful for achieving most ends. In this paper, we argue that one of these proposed goals is mistaken. We argue that instrumental goal preservation—the claim that a rational agent will tend to preserve its goals because that makes it better at achieving its goals—is false on the basis of the timing problem: an agent which abandons or otherwise changes its goal does not thereby fail to take a required means for achieving a goal it has. Our argument draws on the distinction between means-rationality (adopting suitable means to achieve an end) and ends-rationality (choosing one’s ends based on reasons). Because proponents of the instrumental convergence thesis are concerned with means-rationality, we argue, they cannot avoid the timing problem. After defending our argument against several objections, we conclude by considering the implications our argument has for the rest of the instrumental convergence thesis and for AI safety more generally. |
Date | 2025-07-03 |
Language | en |
Library Catalog | Springer Link |
URL | https://doi.org/10.1007/s11098-025-02370-4 |
Accessed | 7/15/2025, 10:50:21 AM |
Publication | Philosophical Studies |
DOI | 10.1007/s11098-025-02370-4 |
Journal Abbr | Philos Stud |
ISSN | 1573-0883 |
Date Added | 7/15/2025, 10:50:21 AM |
Modified | 7/15/2025, 10:50:21 AM |
Item Type | Journal Article |
---|---|
Author | Jakob Stenseke |
Abstract | How do you control a superintelligent artificial being given the possibility that its goals or actions might conflict with human interests? Over the past few decades, this concern– the AGI control problem– has remained a central challenge for research in AI safety. This paper develops and defends two arguments that provide pro tanto support for the following policy for those who worry about the AGI control problem: don’t talk about it. The first is argument from counter-productivity, which states that unless kept secret, efforts to solve the control problem could be used by a misaligned AGI to counter those very efforts. The second is argument from suspicion, stating that open discussions of the control problem may serve to make humanity appear threatening to an AGI, which increases the risk that the AGI perceives humanity as a threat. I consider objections to the arguments and find them unsuccessful. Yet, I also consider objections to the don’t-talk policy itself and find it inconclusive whether it should be adopted. Additionally, the paper examines whether the arguments extend to other areas of AI safety research, such as AGI alignment, and argues that they likely do, albeit not necessarily as directly. I conclude by offering recommendations on what one can safely talk about, regardless of whether the don’t-talk policy is ultimately adopted. |
Date | 2025-07-10 |
Language | en |
Short Title | Counter-productivity and suspicion |
Library Catalog | Springer Link |
URL | https://doi.org/10.1007/s11098-025-02379-9 |
Accessed | 7/15/2025, 10:49:51 AM |
Publication | Philosophical Studies |
DOI | 10.1007/s11098-025-02379-9 |
Journal Abbr | Philos Stud |
ISSN | 1573-0883 |
Date Added | 7/15/2025, 10:49:51 AM |
Modified | 7/15/2025, 10:49:51 AM |
Item Type | Journal Article |
---|---|
Author | Michael T. Stuart |
Abstract | This paper presents a new account of pragmatic understanding based on the idea that such understanding requires skills rather than abilities. Specifically, one has pragmatic understanding of an affordance space when one has, and is responsible for having, skills that facilitate the achievement of some aims using that affordance space. In science, having skills counts as having pragmatic understanding when the development of those skills is praiseworthy. Skills are different from abilities at least in the sense that they are task-specific, can be learned, and we have some cognitive control over their deployment. This paper considers how the use of AI in science facilitates or frustrates the achievement of this kind of understanding. I argue that we cannot properly ascribe this kind of understanding to any current or near-future algorithm itself. But there are ways that we can use AI algorithms to increase pragmatic understanding, namely, when we take advantage of their abilities to increase our own skills (as individuals or communities). This can happen when AI features in human-performed science as either a tool or a collaborator. |
Date | 2025-07-03 |
Language | en |
Library Catalog | Springer Link |
URL | https://doi.org/10.1007/s11098-025-02336-6 |
Accessed | 7/15/2025, 10:50:34 AM |
Publication | Philosophical Studies |
DOI | 10.1007/s11098-025-02336-6 |
Journal Abbr | Philos Stud |
ISSN | 1573-0883 |
Date Added | 7/15/2025, 10:50:34 AM |
Modified | 7/15/2025, 10:50:34 AM |