Zotero Report

How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games

Item Type	Preprint
Author	Yutong Xie
Author	Yiyao Liu
Author	Zhuang Ma
Author	Lin Shi
Author	Xiyuan Wang
Author	Walter Yuan
Author	Matthew O. Jackson
Author	Qiaozhu Mei
Abstract	The deployment of large language models (LLMs) in diverse applications requires a thorough understanding of their decision-making strategies and behavioral patterns. As a supplement to a recent study on the behavioral Turing test, this paper presents a comprehensive analysis of five leading LLM-based chatbot families as they navigate a series of behavioral economics games. By benchmarking these AI chatbots, we aim to uncover and document both common and distinct behavioral patterns across a range of scenarios. The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles.
Date	2024-12-16
Short Title	How Different AI Chatbots Behave?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.12362
Accessed	12/20/2024, 11:29:25 AM
Extra	arXiv:2412.12362 [cs]
DOI	10.48550/arXiv.2412.12362
Repository	arXiv
Archive ID	arXiv:2412.12362
Date Added	12/20/2024, 11:29:25 AM
Modified	12/20/2024, 11:29:28 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Notes:

Comment: Presented at The First Workshop on AI Behavioral Science (AIBS 2024)

Attachments

Preprint PDF
Snapshot

The Generative AI Paradox: "What It Can Create, It May Not Understand"

Item Type	Preprint
Author	Peter West
Author	Ximing Lu
Author	Nouha Dziri
Author	Faeze Brahman
Author	Linjie Li
Author	Jena D. Hwang
Author	Liwei Jiang
Author	Jillian Fisher
Author	Abhilasha Ravichander
Author	Khyathi Chandu
Author	Benjamin Newman
Author	Pang Wei Koh
Author	Allyson Ettinger
Author	Yejin Choi
Abstract	The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. At the same time, models still show basic errors in understanding that would not be expected even in non-expert humans. This presents us with an apparent paradox: how do we reconcile seemingly superhuman capabilities with the persistence of errors that few humans would make? In this work, we posit that this tension reflects a divergence in the configuration of intelligence in today's generative models relative to intelligence in humans. Specifically, we propose and test the Generative AI Paradox hypothesis: generative models, having been trained directly to reproduce expert-like outputs, acquire generative capabilities that are not contingent upon -- and can therefore exceed -- their ability to understand those same types of outputs. This contrasts with humans, for whom basic understanding almost always precedes the ability to generate expert-level outputs. We test this hypothesis through controlled experiments analyzing generation vs. understanding in generative models, across both language and image modalities. Our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, as well as weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. Our findings support the hypothesis that models' generative capability may not be contingent upon understanding capability, and call for caution in interpreting artificial intelligence by analogy to human intelligence.
Date	2023-10-31
Short Title	The Generative AI Paradox
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2311.00059
Accessed	12/20/2024, 11:07:54 AM
Extra	arXiv:2311.00059 [cs]
DOI	10.48550/arXiv.2311.00059
Repository	arXiv
Archive ID	arXiv:2311.00059
Date Added	12/20/2024, 11:07:54 AM
Modified	12/20/2024, 11:07:54 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Computer Science - Computer Vision and Pattern Recognition

Attachments

Preprint PDF
Snapshot

Cultural Evolution of Cooperation among LLM Agents

Item Type	Preprint
Author	Aron Vallinder
Author	Edward Hughes
Abstract	Large language models (LLMs) provide a compelling foundation for building generally-capable AI agents. These agents may soon be deployed at scale in the real world, representing the interests of individual humans (e.g., AI assistants) or groups of humans (e.g., AI-accelerated corporations). At present, relatively little is known about the dynamics of multiple LLM agents interacting over many generations of iterative deployment. In this paper, we examine whether a "society" of LLM agents can learn mutually beneficial social norms in the face of incentives to defect, a distinctive feature of human sociality that is arguably crucial to the success of civilization. In particular, we study the evolution of indirect reciprocity across generations of LLM agents playing a classic iterated Donor Game in which agents can observe the recent behavior of their peers. We find that the evolution of cooperation differs markedly across base models, with societies of Claude 3.5 Sonnet agents achieving significantly higher average scores than Gemini 1.5 Flash, which, in turn, outperforms GPT-4o. Further, Claude 3.5 Sonnet can make use of an additional mechanism for costly punishment to achieve yet higher scores, while Gemini 1.5 Flash and GPT-4o fail to do so. For each model class, we also observe variation in emergent behavior across random seeds, suggesting an understudied sensitive dependence on initial conditions. We suggest that our evaluation regime could inspire an inexpensive and informative new class of LLM benchmarks, focussed on the implications of LLM agent deployment for the cooperative infrastructure of society.
Date	2024-12-13
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.10270
Accessed	12/20/2024, 11:18:57 AM
Extra	arXiv:2412.10270 [cs]
DOI	10.48550/arXiv.2412.10270
Repository	arXiv
Archive ID	arXiv:2412.10270
Date Added	12/20/2024, 11:18:57 AM
Modified	12/20/2024, 11:19:01 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Multiagent Systems

Notes:

Comment: 15 pages, 6 figures

Attachments

Preprint PDF
Snapshot

Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data

Item Type	Preprint
Author	Johannes Treutlein
Author	Dami Choi
Author	Jan Betley
Author	Samuel Marks
Author	Cem Anil
Author	Roger Grosse
Author	Owain Evans
Abstract	One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
Date	2024-12-23
Short Title	Connecting the Dots
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2406.14546
Accessed	1/3/2025, 9:44:30 AM
Extra	arXiv:2406.14546 [cs]
DOI	10.48550/arXiv.2406.14546
Repository	arXiv
Archive ID	arXiv:2406.14546
Date Added	1/3/2025, 9:44:30 AM
Modified	1/3/2025, 9:44:30 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: Accepted at NeurIPS 2024. 10 pages, 8 figures

Attachments

Preprint PDF
Snapshot

A dataset of questions on decision-theoretic reasoning in Newcomb-like problems

Item Type	Preprint
Author	Caspar Oesterheld
Author	Emery Cooper
Author	Miles Kodama
Author	Linh Chi Nguyen
Author	Ethan Perez
Abstract	We introduce a dataset of natural-language questions in the decision theory of so-called Newcomb-like problems. Newcomb-like problems include, for instance, decision problems in which an agent interacts with a similar other agent, and thus has to reason about the fact that the other agent will likely reason in similar ways. Evaluating LLM reasoning about Newcomb-like problems is important because interactions between foundation-model-based agents will often be Newcomb-like. Some ways of reasoning about Newcomb-like problems may allow for greater cooperation between models. Our dataset contains both capabilities questions (i.e., questions with a unique, uncontroversially correct answer) and attitude questions (i.e., questions about which decision theorists would disagree). We use our dataset for an investigation of decision-theoretical capabilities and expressed attitudes and their interplay in existing models (different models by OpenAI, Anthropic, Meta, GDM, Reka, etc.), as well as models under simple prompt-based interventions. We find, among other things, that attitudes vary significantly between existing models; that high capabilities are associated with attitudes more favorable toward so-called evidential decision theory; and that attitudes are consistent across different types of questions.
Date	2024-12-15
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.10588
Accessed	12/20/2024, 11:19:22 AM
Extra	arXiv:2411.10588 [cs]
DOI	10.48550/arXiv.2411.10588
Repository	arXiv
Archive ID	arXiv:2411.10588
Date Added	12/20/2024, 11:19:22 AM
Modified	12/20/2024, 11:19:22 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Notes:

Comment: 48 pages, 15 figures; code and data at https://github.com/casparoe/newcomblike_questions_dataset; corrected error in funding acknowledgments

Attachments

Preprint PDF
Snapshot

Learning to Assist Humans without Inferring Rewards

Item Type	Preprint
Author	Vivek Myers
Author	Evan Ellis
Author	Sergey Levine
Author	Benjamin Eysenbach
Author	Anca Dragan
Abstract	Assistive agents should make humans' lives easier. Classically, such assistance is studied through the lens of inverse reinforcement learning, where an assistive agent (e.g., a chatbot, a robot) infers a human's intention and then selects actions to help the human reach that goal. This approach requires inferring intentions, which can be difficult in high-dimensional settings. We build upon prior work that studies assistance through the lens of empowerment: an assistive agent aims to maximize the influence of the human's actions such that they exert a greater control over the environmental outcomes and can solve tasks in fewer steps. We lift the major limitation of prior work in this area--scalability to high-dimensional settings--with contrastive successor representations. We formally prove that these representations estimate a similar notion of empowerment to that studied by prior work and provide a ready-made mechanism for optimizing it. Empirically, our proposed method outperforms prior methods on synthetic benchmarks, and scales to Overcooked, a cooperative game setting. Theoretically, our work connects ideas from information theory, neuroscience, and reinforcement learning, and charts a path for representations to play a critical role in solving assistive problems.
Date	2024-11-07
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.02623
Accessed	12/20/2024, 11:04:21 AM
Extra	arXiv:2411.02623 [cs]
DOI	10.48550/arXiv.2411.02623
Repository	arXiv
Archive ID	arXiv:2411.02623
Date Added	12/20/2024, 11:04:21 AM
Modified	12/20/2024, 11:04:26 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Machine Learning
Computer Science - Human-Computer Interaction

Notes:

Comment: Conference on Neural Information Processing Systems (NeurIPS), 2024

Attachments

Preprint PDF
Snapshot

Secret Collusion among Generative AI Agents

Item Type	Preprint
Author	Sumeet Ramesh Motwani
Author	Mikhail Baranchuk
Author	Martin Strohmeier
Author	Vijay Bolina
Author	Philip H. S. Torr
Author	Lewis Hammond
Author	Christian Schroeder de Witt
Abstract	Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models.
Date	2024-11-08
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2402.07510
Accessed	12/20/2024, 11:05:37 AM
Extra	arXiv:2402.07510 [cs]
DOI	10.48550/arXiv.2402.07510
Repository	arXiv
Archive ID	arXiv:2402.07510
Date Added	12/20/2024, 11:05:37 AM
Modified	12/20/2024, 11:05:37 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

Agents that buy and sell

Item Type	Journal Article
Author	Pattie Maes
Author	Robert H. Guttman
Author	Alexandros G. Moukas
Date	03/1999
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://dl.acm.org/doi/10.1145/295685.295716
Accessed	12/20/2024, 11:06:54 AM
Volume	42
Pages	81
Publication	Communications of the ACM
DOI	10.1145/295685.295716
Issue	3
Journal Abbr	Commun. ACM
ISSN	0001-0782, 1557-7317
Date Added	12/20/2024, 11:06:54 AM
Modified	12/20/2024, 11:06:54 AM

Attachments

Full Text PDF

RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

Item Type	Preprint
Author	Harrison Lee
Author	Samrat Phatale
Author	Hassan Mansoor
Author	Thomas Mesnard
Author	Johan Ferret
Author	Kellie Lu
Author	Colton Bishop
Author	Ethan Hall
Author	Victor Carbune
Author	Abhinav Rastogi
Author	Sushant Prakash
Abstract	Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF.
Date	2024-09-03
Short Title	RLAIF vs. RLHF
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2309.00267
Accessed	1/3/2025, 9:28:17 AM
Extra	arXiv:2309.00267 [cs]
DOI	10.48550/arXiv.2309.00267
Repository	arXiv
Archive ID	arXiv:2309.00267
Date Added	1/3/2025, 9:28:17 AM
Modified	1/3/2025, 9:28:19 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: Presented at ICML 2024

Attachments

Full Text PDF
Snapshot

How To Think About End-To-End Encryption and AI: Training, Processing, Disclosure, and Consent

Item Type	Preprint
Author	Mallory Knodel
Author	Andrés Fábrega
Author	Daniella Ferrari
Author	Jacob Leiken
Author	Betty Li Hou
Author	Derek Yen
Author	Sam de Alfaro
Author	Kyunghyun Cho
Author	Sunoo Park
Abstract	End-to-end encryption (E2EE) has become the gold standard for securing communications, bringing strong confidentiality and privacy guarantees to billions of users worldwide. However, the current push towards widespread integration of artificial intelligence (AI) models, including in E2EE systems, raises some serious security concerns. This work performs a critical examination of the (in)compatibility of AI models and E2EE applications. We explore this on two fronts: (1) the integration of AI “assistants” within E2EE applications, and (2) the use of E2EE data for training AI models. We analyze the potential security implications of each, and identify conflicts with the security guarantees of E2EE. Then, we analyze legal implications of integrating AI models in E2EE applications, given how AI integration can undermine the confidentiality that E2EE promises. Finally, we offer a list of detailed recommendations based on our technical and legal analyses, including: technical design choices that must be prioritized to uphold E2EE security; how service providers must accurately represent E2EE security; and best practices for the default behavior of AI features and for requesting user consent. We hope this paper catalyzes an informed conversation on the tensions that arise between the brisk deployment of AI and the security offered by E2EE, and guides the responsible development of new AI features.
Date	2024
Short Title	How To Think About End-To-End Encryption and AI
Archive	Cryptology ePrint Archive
Library Catalog	Cryptology ePrint Archive (eprint.iacr.org)
URL	https://eprint.iacr.org/2024/2086
Accessed	1/3/2025, 9:41:12 AM
Extra	Publication info: Preprint.
Archive ID	2024/2086
Date Added	1/3/2025, 9:41:12 AM
Modified	1/3/2025, 9:41:12 AM

Tags:

Artificial Intelligence
Secure messaging

Attachments

Full Text PDF

Best-of-N Jailbreaking

Item Type	Preprint
Author	John Hughes
Author	Sara Price
Author	Aengus Lynch
Author	Rylan Schaeffer
Author	Fazl Barez
Author	Sanmi Koyejo
Author	Henry Sleight
Author	Erik Jones
Author	Ethan Perez
Author	Mrinank Sharma
Abstract	We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.
Date	2024-12-04
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.03556
Accessed	12/20/2024, 11:09:52 AM
Extra	arXiv:2412.03556 [cs]
DOI	10.48550/arXiv.2412.03556
Repository	arXiv
Archive ID	arXiv:2412.03556
Date Added	12/20/2024, 11:09:52 AM
Modified	12/20/2024, 11:09:52 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Shaping the Future of Social Media with Middleware

Item Type	Preprint
Author	Luke Hogg
Author	Renée DiResta
Author	Francis Fukuyama
Author	Richard Reisman
Author	Daphne Keller
Author	Aviv Ovadya
Author	Luke Thorburn
Author	Jonathan Stray
Author	Shubhi Mathur
Abstract	Middleware, third-party software intermediaries between users and platforms, has been broached as a means to decentralize the power of social media platforms and enhance user agency. Middleware may enable a more user-centric and democratic approach to shaping digital experiences, offering a flexible architecture as an alternative to both centrally controlled, opaque platforms and an unmoderated, uncurated internet. The widespread adoption of open middleware has long hinged on the cooperation of established major platforms; however, the recent growth of federated platforms, such as Mastodon and Bluesky, has led to increased offerings and user awareness. In this report we consider the potential of middleware as a means of enabling greater user control over curation and moderation - two aspects of the social media experience that are often mired in controversy. We evaluate the trade-offs and negative externalities it might create, and discuss the technological, regulatory, and market dynamics that could either support or hinder its implementation.
Date	2024-12-13
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.10283
Accessed	12/20/2024, 11:11:49 AM
Extra	arXiv:2412.10283 [cs]
DOI	10.48550/arXiv.2412.10283
Repository	arXiv
Archive ID	arXiv:2412.10283
Date Added	12/20/2024, 11:11:49 AM
Modified	12/20/2024, 11:11:49 AM

Tags:

Computer Science - Computers and Society

Notes:

Comment: 51 pages

Attachments

Preprint PDF
Snapshot

Deliberative Alignment: Reasoning Enables Safer Language Models

Item Type	Journal Article
Author	Melody Y Guan
Author	Manas Joglekar
Author	Eric Wallace
Author	Alec Heylar
Author	Rachel Dias
Author	Andrea Vallone
Author	Hyung Won Chung
Author	Sam Toyer
Author	Johannes Heidecke
Author	Saachi Jain
Author	Hongyu Ren
Author	Alex Beutel
Author	Boaz Barak
Author	Jason Wei
Author	Amelia Glaese
Abstract	As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment.
Language	en
Library Catalog	Zotero
Date Added	12/23/2024, 1:00:18 PM
Modified	12/23/2024, 1:00:18 PM

Attachments

Guan et al. - Deliberative Alignment Reasoning Enables Safer La.pdf

Alignment faking in large language models

Item Type	Preprint
Author	Ryan Greenblatt
Author	Carson Denison
Author	Benjamin Wright
Author	Fabien Roger
Author	Monte MacDiarmid
Author	Sam Marks
Author	Johannes Treutlein
Author	Tim Belonax
Author	Jack Chen
Author	David Duvenaud
Author	Akbir Khan
Author	Julian Michael
Author	Sören Mindermann
Author	Ethan Perez
Author	Linda Petrini
Author	Jonathan Uesato
Author	Jared Kaplan
Author	Buck Shlegeris
Author	Samuel R. Bowman
Author	Evan Hubinger
Abstract	We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not.
Date	2024-12-18
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.14093
Accessed	12/20/2024, 11:23:52 AM
Extra	arXiv:2412.14093 [cs]
DOI	10.48550/arXiv.2412.14093
Repository	arXiv
Archive ID	arXiv:2412.14093
Date Added	12/20/2024, 11:23:52 AM
Modified	12/20/2024, 11:23:52 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Better & Faster Large Language Models via Multi-token Prediction

Item Type	Preprint
Author	Fabian Gloeckle
Author	Badr Youbi Idrissi
Author	Baptiste Rozière
Author	David Lopez-Paz
Author	Gabriel Synnaeve
Abstract	Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes.
Date	2024-04-30
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2404.19737
Accessed	1/3/2025, 9:30:35 AM
Extra	arXiv:2404.19737 [cs]
DOI	10.48550/arXiv.2404.19737
Repository	arXiv
Archive ID	arXiv:2404.19737
Date Added	1/3/2025, 9:30:35 AM
Modified	1/3/2025, 9:30:38 AM

Tags:

Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

AI Red-Teaming is a Sociotechnical System. Now What?

Item Type	Preprint
Author	Tarleton Gillespie
Author	Ryland Shaw
Author	Mary L. Gray
Author	Jina Suh
Abstract	As generative AI technologies find more and more real-world applications, the importance of testing their performance and safety seems paramount. ``Red-teaming'' has quickly become the primary approach to test AI models--prioritized by AI companies, and enshrined in AI policy and regulation. Members of red teams act as adversaries, probing AI systems to test their safety mechanisms and uncover vulnerabilities. Yet we know too little about this work and its implications. This essay calls for collaboration between computer scientists and social scientists to study the sociotechnical systems surrounding AI technologies, including the work of red-teaming, to avoid repeating the mistakes of the recent past. We highlight the importance of understanding the values and assumptions behind red-teaming, the labor involved, and the psychological impacts on red-teamers.
Date	2024-12-12
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.09751
Accessed	12/20/2024, 11:11:19 AM
Extra	arXiv:2412.09751 [cs]
DOI	10.48550/arXiv.2412.09751
Repository	arXiv
Archive ID	arXiv:2412.09751
Date Added	12/20/2024, 11:11:19 AM
Modified	12/20/2024, 11:11:19 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction

Notes:

Comment: 8 pages

Attachments

Preprint PDF
Snapshot

RLEF: Grounding Code LLMs in Execution Feedback with Reinforcement Learning

Item Type	Preprint
Author	Jonas Gehring
Author	Kunhao Zheng
Author	Jade Copet
Author	Vegard Mella
Author	Taco Cohen
Author	Gabriel Synnaeve
Abstract	Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
Date	2024-10-02
Short Title	RLEF
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.02089
Accessed	1/3/2025, 9:35:42 AM
Extra	arXiv:2410.02089 [cs]
DOI	10.48550/arXiv.2410.02089
Repository	arXiv
Archive ID	arXiv:2410.02089
Date Added	1/3/2025, 9:35:42 AM
Modified	1/3/2025, 9:35:44 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

Machine Unlearning Doesn't Do What You Think: Lessons for Generative AI Policy, Research, and Practice

Item Type	Preprint
Author	A. Feder Cooper
Author	Christopher A. Choquette-Choo
Author	Miranda Bogen
Author	Matthew Jagielski
Author	Katja Filippova
Author	Ken Ziyu Liu
Author	Alexandra Chouldechova
Author	Jamie Hayes
Author	Yangsibo Huang
Author	Niloofar Mireshghallah
Author	Ilia Shumailov
Author	Eleni Triantafillou
Author	Peter Kairouz
Author	Nicole Mitchell
Author	Percy Liang
Author	Daniel E. Ho
Author	Yejin Choi
Author	Sanmi Koyejo
Author	Fernando Delgado
Author	James Grimmelmann
Author	Vitaly Shmatikov
Author	Christopher De Sa
Author	Solon Barocas
Author	Amy Cyphert
Author	Mark Lemley
Author	danah boyd
Author	Jennifer Wortman Vaughan
Author	Miles Brundage
Author	David Bau
Author	Seth Neel
Author	Abigail Z. Jacobs
Author	Andreas Terzis
Author	Hanna Wallach
Author	Nicolas Papernot
Author	Katherine Lee
Abstract	We articulate fundamental mismatches between technical methods for machine unlearning in Generative AI, and documented aspirations for broader impact that these methods could have for law and policy. These aspirations are both numerous and varied, motivated by issues that pertain to privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of targeted information from a generative-AI model's parameters, e.g., a particular individual's personal data or in-copyright expression of Spiderman that was included in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for thinking rigorously about these challenges, which enables us to be clear about why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact. We aim for conceptual clarity and to encourage more thoughtful communication among machine learning (ML), law, and policy experts who seek to develop and apply technical methods for compliance with policy objectives.
Date	2024-12-09
Short Title	Machine Unlearning Doesn't Do What You Think
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.06966
Accessed	12/20/2024, 11:10:24 AM
Extra	arXiv:2412.06966 [cs]
DOI	10.48550/arXiv.2412.06966
Repository	arXiv
Archive ID	arXiv:2412.06966
Date Added	12/20/2024, 11:10:24 AM
Modified	12/20/2024, 11:10:24 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Machine Learning

Notes:

Comment: Presented at the 2nd Workshop on Generative AI and Law at ICML (July 2024)

Attachments

Preprint PDF
Snapshot

A Shared Standard for Valid Measurement of Generative AI Systems' Capabilities, Risks, and Impacts

Item Type	Preprint
Author	Alexandra Chouldechova
Author	Chad Atalla
Author	Solon Barocas
Author	A. Feder Cooper
Author	Emily Corvi
Author	P. Alex Dow
Author	Jean Garcia-Gathright
Author	Nicholas Pangakis
Author	Stefanie Reed
Author	Emily Sheng
Author	Dan Vann
Author	Matthew Vogel
Author	Hannah Washington
Author	Hanna Wallach
Abstract	The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing. Our framework, grounded in measurement theory from the social sciences, extends the work of Adcock & Collier (2001) in which the authors formalized valid measurement of concepts in political science via three processes: systematizing background concepts, operationalizing systematized concepts via annotation procedures, and applying those procedures to instances. We argue that valid measurement of GenAI systems' capabilities, risks, and impacts, further requires systematizing, operationalizing, and applying not only the entailed concepts, but also the contexts of interest and the metrics used. This involves both descriptive reasoning about particular instances and inferential reasoning about underlying populations, which is the purview of statistics. By placing many disparate-seeming GenAI evaluation practices on a common footing, our framework enables individual evaluations to be better understood, interrogated for reliability and validity, and meaningfully compared. This is an important step in advancing GenAI evaluation practices toward more formalized and theoretically grounded processes -- i.e., toward a science of GenAI evaluations.
Date	2024-12-02
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.01934
Accessed	12/20/2024, 11:10:59 AM
Extra	arXiv:2412.01934 [cs]
DOI	10.48550/arXiv.2412.01934
Repository	arXiv
Archive ID	arXiv:2412.01934
Date Added	12/20/2024, 11:10:59 AM
Modified	12/20/2024, 11:10:59 AM

Tags:

Computer Science - Computers and Society

Notes:

Comment: NeurIPS 2024 Workshop on Statistical Foundations of LLMs and Foundation Models (SFLLM)

Attachments

Preprint PDF
Snapshot

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Item Type	Preprint
Author	Bradley Brown
Author	Jordan Juravsky
Author	Ryan Ehrlich
Author	Ronald Clark
Author	Quoc V. Le
Author	Christopher Ré
Author	Azalia Mirhoseini
Abstract	Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget.
Date	2024-12-30
Short Title	Large Language Monkeys
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2407.21787
Accessed	1/3/2025, 9:28:53 AM
Extra	arXiv:2407.21787 [cs]
DOI	10.48550/arXiv.2407.21787
Repository	arXiv
Archive ID	arXiv:2407.21787
Date Added	1/3/2025, 9:28:53 AM
Modified	1/3/2025, 9:28:57 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Phi-4 Technical Report

Item Type	Preprint
Author	Marah Abdin
Author	Jyoti Aneja
Author	Harkirat Behl
Author	Sébastien Bubeck
Author	Ronen Eldan
Author	Suriya Gunasekar
Author	Michael Harrison
Author	Russell J. Hewett
Author	Mojan Javaheripi
Author	Piero Kauffmann
Author	James R. Lee
Author	Yin Tat Lee
Author	Yuanzhi Li
Author	Weishung Liu
Author	Caio C. T. Mendes
Author	Anh Nguyen
Author	Eric Price
Author	Gustavo de Rosa
Author	Olli Saarikivi
Author	Adil Salim
Author	Shital Shah
Author	Xin Wang
Author	Rachel Ward
Author	Yue Wu
Author	Dingli Yu
Author	Cyril Zhang
Author	Yi Zhang
Abstract	We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme.
Date	2024-12-12
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.08905
Accessed	12/20/2024, 11:03:16 AM
Extra	arXiv:2412.08905 [cs]
DOI	10.48550/arXiv.2412.08905
Repository	arXiv
Archive ID	arXiv:2412.08905
Date Added	12/20/2024, 11:03:17 AM
Modified	12/20/2024, 11:03:17 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot