• What is AI safety? What do we want it to be?

    Item Type Journal Article
    Author Jacqueline Harding
    Author Cameron Domenico Kirk-Giannini
    Abstract The field of AI safety seeks to prevent or reduce the harms caused by AI systems. A simple and appealing account of what is distinctive of AI safety as a field holds that this feature is constitutive: a research project falls within the purview of AI safety just in case it aims to prevent or reduce the harms caused by AI systems. Call this appealingly simple account The Safety Conception of AI safety. Despite its simplicity and appeal, we argue that The Safety Conception is in tension with at least two trends in the ways AI safety researchers and organizations think and talk about AI safety: first, a tendency to characterize the goal of AI safety research in terms of catastrophic risks from future systems; second, the increasingly popular idea that AI safety can be thought of as a branch of safety engineering. Adopting the methodology of conceptual engineering, we argue that these trends are unfortunate: when we consider what concept of AI safety it would be best to have, there are compelling reasons to think that The Safety Conception is the answer. Descriptively, The Safety Conception allows us to see how work on topics that have historically been treated as central to the field of AI safety is continuous with work on topics that have historically been treated as more marginal, like bias, misinformation, and privacy. Normatively, taking The Safety Conception seriously means approaching all efforts to prevent or mitigate harms from AI systems based on their merits rather than drawing arbitrary distinctions between them.
    Date 2025-06-24
    Language en
    Short Title What is AI safety?
    Library Catalog Springer Link
    URL https://doi.org/10.1007/s11098-025-02367-z
    Accessed 7/15/2025, 9:16:48 AM
    Publication Philosophical Studies
    DOI 10.1007/s11098-025-02367-z
    Journal Abbr Philos Stud
    ISSN 1573-0883
    Date Added 7/15/2025, 9:16:49 AM
    Modified 7/15/2025, 9:16:49 AM

    Tags:

    • Artificial Intelligence
    • Machine Learning
    • Engineering Ethics
    • Philosophy of Artificial Intelligence
    • Chemical Safety
    • Symbolic AI

    Attachments

    • Full Text PDF
  • Resource Rational Contractualism Should Guide AI Alignment

    Item Type Preprint
    Author Sydney Levine
    Author Matija Franklin
    Author Tan Zhi-Xuan
    Author Secil Yanik Guyot
    Author Lionel Wong
    Author Daniel Kilov
    Author Yejin Choi
    Author Joshua B. Tenenbaum
    Author Noah Goodman
    Author Seth Lazar
    Author Iason Gabriel
    Abstract AI systems will soon have to navigate human environments and make decisions that affect people and other AI agents whose goals and values diverge. Contractualist alignment proposes grounding those decisions in agreements that diverse stakeholders would endorse under the right conditions, yet securing such agreement at scale remains costly and slow -- even for advanced AI. We therefore propose Resource-Rational Contractualism (RRC): a framework where AI systems approximate the agreements rational parties would form by drawing on a toolbox of normatively-grounded, cognitively-inspired heuristics that trade effort for accuracy. An RRC-aligned agent would not only operate efficiently, but also be equipped to dynamically adapt to and interpret the ever-changing human social world.
    Date 2025-06-20
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.17434
    Accessed 7/11/2025, 2:11:36 PM
    Extra arXiv:2506.17434 [cs]
    DOI 10.48550/arXiv.2506.17434
    Repository arXiv
    Archive ID arXiv:2506.17434
    Date Added 7/11/2025, 2:11:36 PM
    Modified 7/11/2025, 2:11:36 PM

    Tags:

    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: 24 pages, 10 figures

  • A timing problem for instrumental convergence

    Item Type Journal Article
    Author Rhys Southan
    Author Helena Ward
    Author Jen Semler
    Abstract Those who worry about a superintelligent AI destroying humanity often appeal to the instrumental convergence thesis—the claim that even if we don’t know what a superintelligence’s ultimate goals will be, we can expect it to pursue various instrumental goals which are useful for achieving most ends. In this paper, we argue that one of these proposed goals is mistaken. We argue that instrumental goal preservation—the claim that a rational agent will tend to preserve its goals because that makes it better at achieving its goals—is false on the basis of the timing problem: an agent which abandons or otherwise changes its goal does not thereby fail to take a required means for achieving a goal it has. Our argument draws on the distinction between means-rationality (adopting suitable means to achieve an end) and ends-rationality (choosing one’s ends based on reasons). Because proponents of the instrumental convergence thesis are concerned with means-rationality, we argue, they cannot avoid the timing problem. After defending our argument against several objections, we conclude by considering the implications our argument has for the rest of the instrumental convergence thesis and for AI safety more generally.
    Date 2025-07-03
    Language en
    Library Catalog Springer Link
    URL https://doi.org/10.1007/s11098-025-02370-4
    Accessed 7/15/2025, 10:50:21 AM
    Publication Philosophical Studies
    DOI 10.1007/s11098-025-02370-4
    Journal Abbr Philos Stud
    ISSN 1573-0883
    Date Added 7/15/2025, 10:50:21 AM
    Modified 7/15/2025, 10:50:21 AM

    Tags:

    • Artificial Intelligence
    • Instrumental convergence
    • AI alignment
    • Reasoning
    • Utilitarianism
    • AI safety
    • Superintelligence
    • Rationality
    • Pragmatism
    • Logic in AI
    • Philosophy of Artificial Intelligence
    • Goal preservation
    • Narrow scope
    • Wide scope

    Attachments

    • Full Text PDF
  • Counter-productivity and suspicion: two arguments against talking about the AGI control problem

    Item Type Journal Article
    Author Jakob Stenseke
    Abstract How do you control a superintelligent artificial being given the possibility that its goals or actions might conflict with human interests? Over the past few decades, this concern– the AGI control problem– has remained a central challenge for research in AI safety. This paper develops and defends two arguments that provide pro tanto support for the following policy for those who worry about the AGI control problem: don’t talk about it. The first is argument from counter-productivity, which states that unless kept secret, efforts to solve the control problem could be used by a misaligned AGI to counter those very efforts. The second is argument from suspicion, stating that open discussions of the control problem may serve to make humanity appear threatening to an AGI, which increases the risk that the AGI perceives humanity as a threat. I consider objections to the arguments and find them unsuccessful. Yet, I also consider objections to the don’t-talk policy itself and find it inconclusive whether it should be adopted. Additionally, the paper examines whether the arguments extend to other areas of AI safety research, such as AGI alignment, and argues that they likely do, albeit not necessarily as directly. I conclude by offering recommendations on what one can safely talk about, regardless of whether the don’t-talk policy is ultimately adopted.
    Date 2025-07-10
    Language en
    Short Title Counter-productivity and suspicion
    Library Catalog Springer Link
    URL https://doi.org/10.1007/s11098-025-02379-9
    Accessed 7/15/2025, 10:49:51 AM
    Publication Philosophical Studies
    DOI 10.1007/s11098-025-02379-9
    Journal Abbr Philos Stud
    ISSN 1573-0883
    Date Added 7/15/2025, 10:49:51 AM
    Modified 7/15/2025, 10:49:51 AM

    Tags:

    • Artificial Intelligence
    • Existential risk
    • AI alignment
    • Superintelligence
    • Computer Ethics
    • Meta-Ethics
    • Philosophy of Artificial Intelligence
    • Computational Intelligence
    • AI control
    • Artificial general intelligence
    • Cognitive Control

    Attachments

    • Full Text PDF
  • A New Account of Pragmatic Understanding, Applied to the Case of AI-Assisted Science

    Item Type Journal Article
    Author Michael T. Stuart
    Abstract This paper presents a new account of pragmatic understanding based on the idea that such understanding requires skills rather than abilities. Specifically, one has pragmatic understanding of an affordance space when one has, and is responsible for having, skills that facilitate the achievement of some aims using that affordance space. In science, having skills counts as having pragmatic understanding when the development of those skills is praiseworthy. Skills are different from abilities at least in the sense that they are task-specific, can be learned, and we have some cognitive control over their deployment. This paper considers how the use of AI in science facilitates or frustrates the achievement of this kind of understanding. I argue that we cannot properly ascribe this kind of understanding to any current or near-future algorithm itself. But there are ways that we can use AI algorithms to increase pragmatic understanding, namely, when we take advantage of their abilities to increase our own skills (as individuals or communities). This can happen when AI features in human-performed science as either a tool or a collaborator.
    Date 2025-07-03
    Language en
    Library Catalog Springer Link
    URL https://doi.org/10.1007/s11098-025-02336-6
    Accessed 7/15/2025, 10:50:34 AM
    Publication Philosophical Studies
    DOI 10.1007/s11098-025-02336-6
    Journal Abbr Philos Stud
    ISSN 1573-0883
    Date Added 7/15/2025, 10:50:34 AM
    Modified 7/15/2025, 10:50:34 AM

    Tags:

    • Artificial intelligence
    • Public Understanding of Science
    • Empiricism
    • Pragmatism
    • Artifactualism
    • Metacognition
    • Pragmatic understanding
    • Pragmatics
    • Scientific understanding
    • Skills
    • Understanding