• How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games

    Item Type Preprint
    Author Yutong Xie
    Author Yiyao Liu
    Author Zhuang Ma
    Author Lin Shi
    Author Xiyuan Wang
    Author Walter Yuan
    Author Matthew O. Jackson
    Author Qiaozhu Mei
    Abstract The deployment of large language models (LLMs) in diverse applications requires a thorough understanding of their decision-making strategies and behavioral patterns. As a supplement to a recent study on the behavioral Turing test, this paper presents a comprehensive analysis of five leading LLM-based chatbot families as they navigate a series of behavioral economics games. By benchmarking these AI chatbots, we aim to uncover and document both common and distinct behavioral patterns across a range of scenarios. The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles.
    Date 2024-12-16
    Short Title How Different AI Chatbots Behave?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.12362
    Accessed 12/20/2024, 11:29:25 AM
    Extra arXiv:2412.12362 [cs]
    DOI 10.48550/arXiv.2412.12362
    Repository arXiv
    Archive ID arXiv:2412.12362
    Date Added 12/20/2024, 11:29:25 AM
    Modified 12/20/2024, 11:29:28 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: Presented at The First Workshop on AI Behavioral Science (AIBS 2024)

    Attachments

    • Preprint PDF
    • Snapshot
  • The Generative AI Paradox: "What It Can Create, It May Not Understand"

    Item Type Preprint
    Author Peter West
    Author Ximing Lu
    Author Nouha Dziri
    Author Faeze Brahman
    Author Linjie Li
    Author Jena D. Hwang
    Author Liwei Jiang
    Author Jillian Fisher
    Author Abhilasha Ravichander
    Author Khyathi Chandu
    Author Benjamin Newman
    Author Pang Wei Koh
    Author Allyson Ettinger
    Author Yejin Choi
    Abstract The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. At the same time, models still show basic errors in understanding that would not be expected even in non-expert humans. This presents us with an apparent paradox: how do we reconcile seemingly superhuman capabilities with the persistence of errors that few humans would make? In this work, we posit that this tension reflects a divergence in the configuration of intelligence in today's generative models relative to intelligence in humans. Specifically, we propose and test the Generative AI Paradox hypothesis: generative models, having been trained directly to reproduce expert-like outputs, acquire generative capabilities that are not contingent upon -- and can therefore exceed -- their ability to understand those same types of outputs. This contrasts with humans, for whom basic understanding almost always precedes the ability to generate expert-level outputs. We test this hypothesis through controlled experiments analyzing generation vs. understanding in generative models, across both language and image modalities. Our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, as well as weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. Our findings support the hypothesis that models' generative capability may not be contingent upon understanding capability, and call for caution in interpreting artificial intelligence by analogy to human intelligence.
    Date 2023-10-31
    Short Title The Generative AI Paradox
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2311.00059
    Accessed 12/20/2024, 11:07:54 AM
    Extra arXiv:2311.00059 [cs]
    DOI 10.48550/arXiv.2311.00059
    Repository arXiv
    Archive ID arXiv:2311.00059
    Date Added 12/20/2024, 11:07:54 AM
    Modified 12/20/2024, 11:07:54 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Computer Science - Computer Vision and Pattern Recognition

    Attachments

    • Preprint PDF
    • Snapshot
  • Cultural Evolution of Cooperation among LLM Agents

    Item Type Preprint
    Author Aron Vallinder
    Author Edward Hughes
    Abstract Large language models (LLMs) provide a compelling foundation for building generally-capable AI agents. These agents may soon be deployed at scale in the real world, representing the interests of individual humans (e.g., AI assistants) or groups of humans (e.g., AI-accelerated corporations). At present, relatively little is known about the dynamics of multiple LLM agents interacting over many generations of iterative deployment. In this paper, we examine whether a "society" of LLM agents can learn mutually beneficial social norms in the face of incentives to defect, a distinctive feature of human sociality that is arguably crucial to the success of civilization. In particular, we study the evolution of indirect reciprocity across generations of LLM agents playing a classic iterated Donor Game in which agents can observe the recent behavior of their peers. We find that the evolution of cooperation differs markedly across base models, with societies of Claude 3.5 Sonnet agents achieving significantly higher average scores than Gemini 1.5 Flash, which, in turn, outperforms GPT-4o. Further, Claude 3.5 Sonnet can make use of an additional mechanism for costly punishment to achieve yet higher scores, while Gemini 1.5 Flash and GPT-4o fail to do so. For each model class, we also observe variation in emergent behavior across random seeds, suggesting an understudied sensitive dependence on initial conditions. We suggest that our evaluation regime could inspire an inexpensive and informative new class of LLM benchmarks, focussed on the implications of LLM agent deployment for the cooperative infrastructure of society.
    Date 2024-12-13
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.10270
    Accessed 12/20/2024, 11:18:57 AM
    Extra arXiv:2412.10270 [cs]
    DOI 10.48550/arXiv.2412.10270
    Repository arXiv
    Archive ID arXiv:2412.10270
    Date Added 12/20/2024, 11:18:57 AM
    Modified 12/20/2024, 11:19:01 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Multiagent Systems

    Notes:

    • Comment: 15 pages, 6 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

    Item Type Preprint
    Author Johannes Treutlein
    Author Dami Choi
    Author Jan Betley
    Author Samuel Marks
    Author Cem Anil
    Author Roger Grosse
    Author Owain Evans
    Abstract One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
    Date 2024-12-23
    Short Title Connecting the Dots
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2406.14546
    Accessed 1/3/2025, 9:44:30 AM
    Extra arXiv:2406.14546 [cs]
    DOI 10.48550/arXiv.2406.14546
    Repository arXiv
    Archive ID arXiv:2406.14546
    Date Added 1/3/2025, 9:44:30 AM
    Modified 1/3/2025, 9:44:30 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: Accepted at NeurIPS 2024. 10 pages, 8 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

    Item Type Preprint
    Author Caspar Oesterheld
    Author Emery Cooper
    Author Miles Kodama
    Author Linh Chi Nguyen
    Author Ethan Perez
    Abstract We introduce a dataset of natural-language questions in the decision theory of so-called Newcomb-like problems. Newcomb-like problems include, for instance, decision problems in which an agent interacts with a similar other agent, and thus has to reason about the fact that the other agent will likely reason in similar ways. Evaluating LLM reasoning about Newcomb-like problems is important because interactions between foundation-model-based agents will often be Newcomb-like. Some ways of reasoning about Newcomb-like problems may allow for greater cooperation between models. Our dataset contains both capabilities questions (i.e., questions with a unique, uncontroversially correct answer) and attitude questions (i.e., questions about which decision theorists would disagree). We use our dataset for an investigation of decision-theoretical capabilities and expressed attitudes and their interplay in existing models (different models by OpenAI, Anthropic, Meta, GDM, Reka, etc.), as well as models under simple prompt-based interventions. We find, among other things, that attitudes vary significantly between existing models; that high capabilities are associated with attitudes more favorable toward so-called evidential decision theory; and that attitudes are consistent across different types of questions.
    Date 2024-12-15
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.10588
    Accessed 12/20/2024, 11:19:22 AM
    Extra arXiv:2411.10588 [cs]
    DOI 10.48550/arXiv.2411.10588
    Repository arXiv
    Archive ID arXiv:2411.10588
    Date Added 12/20/2024, 11:19:22 AM
    Modified 12/20/2024, 11:19:22 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: 48 pages, 15 figures; code and data at https://github.com/casparoe/newcomblike_questions_dataset; corrected error in funding acknowledgments

    Attachments

    • Preprint PDF
    • Snapshot
  • Learning to Assist Humans without Inferring Rewards

    Item Type Preprint
    Author Vivek Myers
    Author Evan Ellis
    Author Sergey Levine
    Author Benjamin Eysenbach
    Author Anca Dragan
    Abstract Assistive agents should make humans' lives easier. Classically, such assistance is studied through the lens of inverse reinforcement learning, where an assistive agent (e.g., a chatbot, a robot) infers a human's intention and then selects actions to help the human reach that goal. This approach requires inferring intentions, which can be difficult in high-dimensional settings. We build upon prior work that studies assistance through the lens of empowerment: an assistive agent aims to maximize the influence of the human's actions such that they exert a greater control over the environmental outcomes and can solve tasks in fewer steps. We lift the major limitation of prior work in this area--scalability to high-dimensional settings--with contrastive successor representations. We formally prove that these representations estimate a similar notion of empowerment to that studied by prior work and provide a ready-made mechanism for optimizing it. Empirically, our proposed method outperforms prior methods on synthetic benchmarks, and scales to Overcooked, a cooperative game setting. Theoretically, our work connects ideas from information theory, neuroscience, and reinforcement learning, and charts a path for representations to play a critical role in solving assistive problems.
    Date 2024-11-07
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.02623
    Accessed 12/20/2024, 11:04:21 AM
    Extra arXiv:2411.02623 [cs]
    DOI 10.48550/arXiv.2411.02623
    Repository arXiv
    Archive ID arXiv:2411.02623
    Date Added 12/20/2024, 11:04:21 AM
    Modified 12/20/2024, 11:04:26 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Machine Learning
    • Computer Science - Human-Computer Interaction

    Notes:

    • Comment: Conference on Neural Information Processing Systems (NeurIPS), 2024

    Attachments

    • Preprint PDF
    • Snapshot
  • Secret Collusion among Generative AI Agents

    Item Type Preprint
    Author Sumeet Ramesh Motwani
    Author Mikhail Baranchuk
    Author Martin Strohmeier
    Author Vijay Bolina
    Author Philip H. S. Torr
    Author Lewis Hammond
    Author Christian Schroeder de Witt
    Abstract Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.
    Date 2024-11-08
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2402.07510
    Accessed 12/20/2024, 11:05:37 AM
    Extra arXiv:2402.07510 [cs]
    DOI 10.48550/arXiv.2402.07510
    Repository arXiv
    Archive ID arXiv:2402.07510
    Date Added 12/20/2024, 11:05:37 AM
    Modified 12/20/2024, 11:05:37 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security

    Attachments

    • Preprint PDF
    • Snapshot
  • Agents that buy and sell

    Item Type Journal Article
    Author Pattie Maes
    Author Robert H. Guttman
    Author Alexandros G. Moukas
    Date 03/1999
    Language en
    Library Catalog DOI.org (Crossref)
    URL https://dl.acm.org/doi/10.1145/295685.295716
    Accessed 12/20/2024, 11:06:54 AM
    Volume 42
    Pages 81
    Publication Communications of the ACM
    DOI 10.1145/295685.295716
    Issue 3
    Journal Abbr Commun. ACM
    ISSN 0001-0782, 1557-7317
    Date Added 12/20/2024, 11:06:54 AM
    Modified 12/20/2024, 11:06:54 AM

    Attachments

    • Full Text PDF
  • RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

    Item Type Preprint
    Author Harrison Lee
    Author Samrat Phatale
    Author Hassan Mansoor
    Author Thomas Mesnard
    Author Johan Ferret
    Author Kellie Lu
    Author Colton Bishop
    Author Ethan Hall
    Author Victor Carbune
    Author Abhinav Rastogi
    Author Sushant Prakash
    Abstract Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
    Date 2024-09-03
    Short Title RLAIF vs. RLHF
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2309.00267
    Accessed 1/3/2025, 9:28:17 AM
    Extra arXiv:2309.00267 [cs]
    DOI 10.48550/arXiv.2309.00267
    Repository arXiv
    Archive ID arXiv:2309.00267
    Date Added 1/3/2025, 9:28:17 AM
    Modified 1/3/2025, 9:28:19 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: Presented at ICML 2024

    Attachments

    • Full Text PDF
    • Snapshot
  • How To Think About End-To-End Encryption and AI: Training, Processing, Disclosure, and Consent

    Item Type Preprint
    Author Mallory Knodel
    Author Andrés Fábrega
    Author Daniella Ferrari
    Author Jacob Leiken
    Author Betty Li Hou
    Author Derek Yen
    Author Sam de Alfaro
    Author Kyunghyun Cho
    Author Sunoo Park
    Abstract End-to-end encryption (E2EE) has become the gold standard for securing communications, bringing strong confidentiality and privacy guarantees to billions of users worldwide. However, the current push towards widespread integration of artificial intelligence (AI) models, including in E2EE systems, raises some serious security concerns. This work performs a critical examination of the (in)compatibility of AI models and E2EE applications. We explore this on two fronts: (1) the integration of AI “assistants” within E2EE applications, and (2) the use of E2EE data for training AI models. We analyze the potential security implications of each, and identify conflicts with the security guarantees of E2EE. Then, we analyze legal implications of integrating AI models in E2EE applications, given how AI integration can undermine the confidentiality that E2EE promises. Finally, we offer a list of detailed recommendations based on our technical and legal analyses, including: technical design choices that must be prioritized to uphold E2EE security; how service providers must accurately represent E2EE security; and best practices for the default behavior of AI features and for requesting user consent. We hope this paper catalyzes an informed conversation on the tensions that arise between the brisk deployment of AI and the security offered by E2EE, and guides the responsible development of new AI features.
    Date 2024
    Short Title How To Think About End-To-End Encryption and AI
    Archive Cryptology ePrint Archive
    Library Catalog Cryptology ePrint Archive (eprint.iacr.org)
    URL https://eprint.iacr.org/2024/2086
    Accessed 1/3/2025, 9:41:12 AM
    Extra Publication info: Preprint.
    Archive ID 2024/2086
    Date Added 1/3/2025, 9:41:12 AM
    Modified 1/3/2025, 9:41:12 AM

    Tags:

    • Artificial Intelligence
    • Secure messaging

    Attachments

    • Full Text PDF
  • Best-of-N Jailbreaking

    Item Type Preprint
    Author John Hughes
    Author Sara Price
    Author Aengus Lynch
    Author Rylan Schaeffer
    Author Fazl Barez
    Author Sanmi Koyejo
    Author Henry Sleight
    Author Erik Jones
    Author Ethan Perez
    Author Mrinank Sharma
    Abstract We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.
    Date 2024-12-04
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.03556
    Accessed 12/20/2024, 11:09:52 AM
    Extra arXiv:2412.03556 [cs]
    DOI 10.48550/arXiv.2412.03556
    Repository arXiv
    Archive ID arXiv:2412.03556
    Date Added 12/20/2024, 11:09:52 AM
    Modified 12/20/2024, 11:09:52 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Shaping the Future of Social Media with Middleware

    Item Type Preprint
    Author Luke Hogg
    Author Renée DiResta
    Author Francis Fukuyama
    Author Richard Reisman
    Author Daphne Keller
    Author Aviv Ovadya
    Author Luke Thorburn
    Author Jonathan Stray
    Author Shubhi Mathur
    Abstract Middleware, third-party software intermediaries between users and platforms, has been broached as a means to decentralize the power of social media platforms and enhance user agency. Middleware may enable a more user-centric and democratic approach to shaping digital experiences, offering a flexible architecture as an alternative to both centrally controlled, opaque platforms and an unmoderated, uncurated internet. The widespread adoption of open middleware has long hinged on the cooperation of established major platforms; however, the recent growth of federated platforms, such as Mastodon and Bluesky, has led to increased offerings and user awareness. In this report we consider the potential of middleware as a means of enabling greater user control over curation and moderation - two aspects of the social media experience that are often mired in controversy. We evaluate the trade-offs and negative externalities it might create, and discuss the technological, regulatory, and market dynamics that could either support or hinder its implementation.
    Date 2024-12-13
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.10283
    Accessed 12/20/2024, 11:11:49 AM
    Extra arXiv:2412.10283 [cs]
    DOI 10.48550/arXiv.2412.10283
    Repository arXiv
    Archive ID arXiv:2412.10283
    Date Added 12/20/2024, 11:11:49 AM
    Modified 12/20/2024, 11:11:49 AM

    Tags:

    • Computer Science - Computers and Society

    Notes:

    • Comment: 51 pages

    Attachments

    • Preprint PDF
    • Snapshot
  • Deliberative Alignment: Reasoning Enables Safer Language Models

    Item Type Journal Article
    Author Melody Y Guan
    Author Manas Joglekar
    Author Eric Wallace
    Author Alec Heylar
    Author Rachel Dias
    Author Andrea Vallone
    Author Hyung Won Chung
    Author Sam Toyer
    Author Johannes Heidecke
    Author Saachi Jain
    Author Hongyu Ren
    Author Alex Beutel
    Author Boaz Barak
    Author Jason Wei
    Author Amelia Glaese
    Abstract As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.
    Language en
    Library Catalog Zotero
    Date Added 12/23/2024, 1:00:18 PM
    Modified 12/23/2024, 1:00:18 PM

    Attachments

    • Guan et al. - Deliberative Alignment Reasoning Enables Safer La.pdf
  • Alignment faking in large language models

    Item Type Preprint
    Author Ryan Greenblatt
    Author Carson Denison
    Author Benjamin Wright
    Author Fabien Roger
    Author Monte MacDiarmid
    Author Sam Marks
    Author Johannes Treutlein
    Author Tim Belonax
    Author Jack Chen
    Author David Duvenaud
    Author Akbir Khan
    Author Julian Michael
    Author Sören Mindermann
    Author Ethan Perez
    Author Linda Petrini
    Author Jonathan Uesato
    Author Jared Kaplan
    Author Buck Shlegeris
    Author Samuel R. Bowman
    Author Evan Hubinger
    Abstract We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.
    Date 2024-12-18
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.14093
    Accessed 12/20/2024, 11:23:52 AM
    Extra arXiv:2412.14093 [cs]
    DOI 10.48550/arXiv.2412.14093
    Repository arXiv
    Archive ID arXiv:2412.14093
    Date Added 12/20/2024, 11:23:52 AM
    Modified 12/20/2024, 11:23:52 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Better & Faster Large Language Models via Multi-token Prediction

    Item Type Preprint
    Author Fabian Gloeckle
    Author Badr Youbi Idrissi
    Author Baptiste Rozière
    Author David Lopez-Paz
    Author Gabriel Synnaeve
    Abstract Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.
    Date 2024-04-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2404.19737
    Accessed 1/3/2025, 9:30:35 AM
    Extra arXiv:2404.19737 [cs]
    DOI 10.48550/arXiv.2404.19737
    Repository arXiv
    Archive ID arXiv:2404.19737
    Date Added 1/3/2025, 9:30:35 AM
    Modified 1/3/2025, 9:30:38 AM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • AI Red-Teaming is a Sociotechnical System. Now What?

    Item Type Preprint
    Author Tarleton Gillespie
    Author Ryland Shaw
    Author Mary L. Gray
    Author Jina Suh
    Abstract As generative AI technologies find more and more real-world applications, the importance of testing their performance and safety seems paramount. ``Red-teaming'' has quickly become the primary approach to test AI models--prioritized by AI companies, and enshrined in AI policy and regulation. Members of red teams act as adversaries, probing AI systems to test their safety mechanisms and uncover vulnerabilities. Yet we know too little about this work and its implications. This essay calls for collaboration between computer scientists and social scientists to study the sociotechnical systems surrounding AI technologies, including the work of red-teaming, to avoid repeating the mistakes of the recent past. We highlight the importance of understanding the values and assumptions behind red-teaming, the labor involved, and the psychological impacts on red-teamers.
    Date 2024-12-12
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.09751
    Accessed 12/20/2024, 11:11:19 AM
    Extra arXiv:2412.09751 [cs]
    DOI 10.48550/arXiv.2412.09751
    Repository arXiv
    Archive ID arXiv:2412.09751
    Date Added 12/20/2024, 11:11:19 AM
    Modified 12/20/2024, 11:11:19 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction

    Notes:

    • Comment: 8 pages

    Attachments

    • Preprint PDF
    • Snapshot
  • RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

    Item Type Preprint
    Author Jonas Gehring
    Author Kunhao Zheng
    Author Jade Copet
    Author Vegard Mella
    Author Taco Cohen
    Author Gabriel Synnaeve
    Abstract Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
    Date 2024-10-02
    Short Title RLEF
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.02089
    Accessed 1/3/2025, 9:35:42 AM
    Extra arXiv:2410.02089 [cs]
    DOI 10.48550/arXiv.2410.02089
    Repository arXiv
    Archive ID arXiv:2410.02089
    Date Added 1/3/2025, 9:35:42 AM
    Modified 1/3/2025, 9:35:44 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

    Item Type Preprint
    Author A. Feder Cooper
    Author Christopher A. Choquette-Choo
    Author Miranda Bogen
    Author Matthew Jagielski
    Author Katja Filippova
    Author Ken Ziyu Liu
    Author Alexandra Chouldechova
    Author Jamie Hayes
    Author Yangsibo Huang
    Author Niloofar Mireshghallah
    Author Ilia Shumailov
    Author Eleni Triantafillou
    Author Peter Kairouz
    Author Nicole Mitchell
    Author Percy Liang
    Author Daniel E. Ho
    Author Yejin Choi
    Author Sanmi Koyejo
    Author Fernando Delgado
    Author James Grimmelmann
    Author Vitaly Shmatikov
    Author Christopher De Sa
    Author Solon Barocas
    Author Amy Cyphert
    Author Mark Lemley
    Author danah boyd
    Author Jennifer Wortman Vaughan
    Author Miles Brundage
    Author David Bau
    Author Seth Neel
    Author Abigail Z. Jacobs
    Author Andreas Terzis
    Author Hanna Wallach
    Author Nicolas Papernot
    Author Katherine Lee
    Abstract We articulate fundamental mismatches between technical methods for machine unlearning in Generative AI, and documented aspirations for broader impact that these methods could have for law and policy. These aspirations are both numerous and varied, motivated by issues that pertain to privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of targeted information from a generative-AI model's parameters, e.g., a particular individual's personal data or in-copyright expression of Spiderman that was included in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for thinking rigorously about these challenges, which enables us to be clear about why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact. We aim for conceptual clarity and to encourage more thoughtful communication among machine learning (ML), law, and policy experts who seek to develop and apply technical methods for compliance with policy objectives.
    Date 2024-12-09
    Short Title Machine Unlearning Doesn't Do What You Think
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.06966
    Accessed 12/20/2024, 11:10:24 AM
    Extra arXiv:2412.06966 [cs]
    DOI 10.48550/arXiv.2412.06966
    Repository arXiv
    Archive ID arXiv:2412.06966
    Date Added 12/20/2024, 11:10:24 AM
    Modified 12/20/2024, 11:10:24 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Machine Learning

    Notes:

    • Comment: Presented at the 2nd Workshop on Generative AI and Law at ICML (July 2024)

    Attachments

    • Preprint PDF
    • Snapshot
  • A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

    Item Type Preprint
    Author Alexandra Chouldechova
    Author Chad Atalla
    Author Solon Barocas
    Author A. Feder Cooper
    Author Emily Corvi
    Author P. Alex Dow
    Author Jean Garcia-Gathright
    Author Nicholas Pangakis
    Author Stefanie Reed
    Author Emily Sheng
    Author Dan Vann
    Author Matthew Vogel
    Author Hannah Washington
    Author Hanna Wallach
    Abstract The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing. Our framework, grounded in measurement theory from the social sciences, extends the work of Adcock & Collier (2001) in which the authors formalized valid measurement of concepts in political science via three processes: systematizing background concepts, operationalizing systematized concepts via annotation procedures, and applying those procedures to instances. We argue that valid measurement of GenAI systems' capabilities, risks, and impacts, further requires systematizing, operationalizing, and applying not only the entailed concepts, but also the contexts of interest and the metrics used. This involves both descriptive reasoning about particular instances and inferential reasoning about underlying populations, which is the purview of statistics. By placing many disparate-seeming GenAI evaluation practices on a common footing, our framework enables individual evaluations to be better understood, interrogated for reliability and validity, and meaningfully compared. This is an important step in advancing GenAI evaluation practices toward more formalized and theoretically grounded processes -- i.e., toward a science of GenAI evaluations.
    Date 2024-12-02
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.01934
    Accessed 12/20/2024, 11:10:59 AM
    Extra arXiv:2412.01934 [cs]
    DOI 10.48550/arXiv.2412.01934
    Repository arXiv
    Archive ID arXiv:2412.01934
    Date Added 12/20/2024, 11:10:59 AM
    Modified 12/20/2024, 11:10:59 AM

    Tags:

    • Computer Science - Computers and Society

    Notes:

    • Comment: NeurIPS 2024 Workshop on Statistical Foundations of LLMs and Foundation Models (SFLLM)

    Attachments

    • Preprint PDF
    • Snapshot
  • Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Item Type Preprint
    Author Bradley Brown
    Author Jordan Juravsky
    Author Ryan Ehrlich
    Author Ronald Clark
    Author Quoc V. Le
    Author Christopher Ré
    Author Azalia Mirhoseini
    Abstract Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget.
    Date 2024-12-30
    Short Title Large Language Monkeys
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2407.21787
    Accessed 1/3/2025, 9:28:53 AM
    Extra arXiv:2407.21787 [cs]
    DOI 10.48550/arXiv.2407.21787
    Repository arXiv
    Archive ID arXiv:2407.21787
    Date Added 1/3/2025, 9:28:53 AM
    Modified 1/3/2025, 9:28:57 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Phi-4 Technical Report

    Item Type Preprint
    Author Marah Abdin
    Author Jyoti Aneja
    Author Harkirat Behl
    Author Sébastien Bubeck
    Author Ronen Eldan
    Author Suriya Gunasekar
    Author Michael Harrison
    Author Russell J. Hewett
    Author Mojan Javaheripi
    Author Piero Kauffmann
    Author James R. Lee
    Author Yin Tat Lee
    Author Yuanzhi Li
    Author Weishung Liu
    Author Caio C. T. Mendes
    Author Anh Nguyen
    Author Eric Price
    Author Gustavo de Rosa
    Author Olli Saarikivi
    Author Adil Salim
    Author Shital Shah
    Author Xin Wang
    Author Rachel Ward
    Author Yue Wu
    Author Dingli Yu
    Author Cyril Zhang
    Author Yi Zhang
    Abstract We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.
    Date 2024-12-12
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.08905
    Accessed 12/20/2024, 11:03:16 AM
    Extra arXiv:2412.08905 [cs]
    DOI 10.48550/arXiv.2412.08905
    Repository arXiv
    Archive ID arXiv:2412.08905
    Date Added 12/20/2024, 11:03:17 AM
    Modified 12/20/2024, 11:03:17 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot