Item Type | Preprint |
---|---|
Author | Yutong Xie |
Author | Yiyao Liu |
Author | Zhuang Ma |
Author | Lin Shi |
Author | Xiyuan Wang |
Author | Walter Yuan |
Author | Matthew O. Jackson |
Author | Qiaozhu Mei |
Abstract | The deployment of large language models (LLMs) in diverse applications requires a thorough understanding of their decision-making strategies and behavioral patterns. As a supplement to a recent study on the behavioral Turing test, this paper presents a comprehensive analysis of five leading LLM-based chatbot families as they navigate a series of behavioral economics games. By benchmarking these AI chatbots, we aim to uncover and document both common and distinct behavioral patterns across a range of scenarios. The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles. |
Date | 2024-12-16 |
Short Title | How Different AI Chatbots Behave? |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.12362 |
Accessed | 12/20/2024, 11:29:25 AM |
Extra | arXiv:2412.12362 [cs] |
DOI | 10.48550/arXiv.2412.12362 |
Repository | arXiv |
Archive ID | arXiv:2412.12362 |
Date Added | 12/20/2024, 11:29:25 AM |
Modified | 12/20/2024, 11:29:28 AM |
Comment: Presented at The First Workshop on AI Behavioral Science (AIBS 2024)
Item Type | Preprint |
---|---|
Author | Peter West |
Author | Ximing Lu |
Author | Nouha Dziri |
Author | Faeze Brahman |
Author | Linjie Li |
Author | Jena D. Hwang |
Author | Liwei Jiang |
Author | Jillian Fisher |
Author | Abhilasha Ravichander |
Author | Khyathi Chandu |
Author | Benjamin Newman |
Author | Pang Wei Koh |
Author | Allyson Ettinger |
Author | Yejin Choi |
Abstract | The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. At the same time, models still show basic errors in understanding that would not be expected even in non-expert humans. This presents us with an apparent paradox: how do we reconcile seemingly superhuman capabilities with the persistence of errors that few humans would make? In this work, we posit that this tension reflects a divergence in the configuration of intelligence in today's generative models relative to intelligence in humans. Specifically, we propose and test the Generative AI Paradox hypothesis: generative models, having been trained directly to reproduce expert-like outputs, acquire generative capabilities that are not contingent upon -- and can therefore exceed -- their ability to understand those same types of outputs. This contrasts with humans, for whom basic understanding almost always precedes the ability to generate expert-level outputs. We test this hypothesis through controlled experiments analyzing generation vs. understanding in generative models, across both language and image modalities. Our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, as well as weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. Our findings support the hypothesis that models' generative capability may not be contingent upon understanding capability, and call for caution in interpreting artificial intelligence by analogy to human intelligence. |
Date | 2023-10-31 |
Short Title | The Generative AI Paradox |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2311.00059 |
Accessed | 12/20/2024, 11:07:54 AM |
Extra | arXiv:2311.00059 [cs] |
DOI | 10.48550/arXiv.2311.00059 |
Repository | arXiv |
Archive ID | arXiv:2311.00059 |
Date Added | 12/20/2024, 11:07:54 AM |
Modified | 12/20/2024, 11:07:54 AM |
Item Type | Preprint |
---|---|
Author | Aron Vallinder |
Author | Edward Hughes |
Abstract | Large language models (LLMs) provide a compelling foundation for building generally-capable AI agents. These agents may soon be deployed at scale in the real world, representing the interests of individual humans (e.g., AI assistants) or groups of humans (e.g., AI-accelerated corporations). At present, relatively little is known about the dynamics of multiple LLM agents interacting over many generations of iterative deployment. In this paper, we examine whether a "society" of LLM agents can learn mutually beneficial social norms in the face of incentives to defect, a distinctive feature of human sociality that is arguably crucial to the success of civilization. In particular, we study the evolution of indirect reciprocity across generations of LLM agents playing a classic iterated Donor Game in which agents can observe the recent behavior of their peers. We find that the evolution of cooperation differs markedly across base models, with societies of Claude 3.5 Sonnet agents achieving significantly higher average scores than Gemini 1.5 Flash, which, in turn, outperforms GPT-4o. Further, Claude 3.5 Sonnet can make use of an additional mechanism for costly punishment to achieve yet higher scores, while Gemini 1.5 Flash and GPT-4o fail to do so. For each model class, we also observe variation in emergent behavior across random seeds, suggesting an understudied sensitive dependence on initial conditions. We suggest that our evaluation regime could inspire an inexpensive and informative new class of LLM benchmarks, focussed on the implications of LLM agent deployment for the cooperative infrastructure of society. |
Date | 2024-12-13 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.10270 |
Accessed | 12/20/2024, 11:18:57 AM |
Extra | arXiv:2412.10270 [cs] |
DOI | 10.48550/arXiv.2412.10270 |
Repository | arXiv |
Archive ID | arXiv:2412.10270 |
Date Added | 12/20/2024, 11:18:57 AM |
Modified | 12/20/2024, 11:19:01 AM |
Comment: 15 pages, 6 figures
Item Type | Preprint |
---|---|
Author | Johannes Treutlein |
Author | Dami Choi |
Author | Jan Betley |
Author | Samuel Marks |
Author | Cem Anil |
Author | Roger Grosse |
Author | Owain Evans |
Abstract | One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs. |
Date | 2024-12-23 |
Short Title | Connecting the Dots |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2406.14546 |
Accessed | 1/3/2025, 9:44:30 AM |
Extra | arXiv:2406.14546 [cs] |
DOI | 10.48550/arXiv.2406.14546 |
Repository | arXiv |
Archive ID | arXiv:2406.14546 |
Date Added | 1/3/2025, 9:44:30 AM |
Modified | 1/3/2025, 9:44:30 AM |
Comment: Accepted at NeurIPS 2024. 10 pages, 8 figures
Item Type | Preprint |
---|---|
Author | Caspar Oesterheld |
Author | Emery Cooper |
Author | Miles Kodama |
Author | Linh Chi Nguyen |
Author | Ethan Perez |
Abstract | We introduce a dataset of natural-language questions in the decision theory of so-called Newcomb-like problems. Newcomb-like problems include, for instance, decision problems in which an agent interacts with a similar other agent, and thus has to reason about the fact that the other agent will likely reason in similar ways. Evaluating LLM reasoning about Newcomb-like problems is important because interactions between foundation-model-based agents will often be Newcomb-like. Some ways of reasoning about Newcomb-like problems may allow for greater cooperation between models. Our dataset contains both capabilities questions (i.e., questions with a unique, uncontroversially correct answer) and attitude questions (i.e., questions about which decision theorists would disagree). We use our dataset for an investigation of decision-theoretical capabilities and expressed attitudes and their interplay in existing models (different models by OpenAI, Anthropic, Meta, GDM, Reka, etc.), as well as models under simple prompt-based interventions. We find, among other things, that attitudes vary significantly between existing models; that high capabilities are associated with attitudes more favorable toward so-called evidential decision theory; and that attitudes are consistent across different types of questions. |
Date | 2024-12-15 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.10588 |
Accessed | 12/20/2024, 11:19:22 AM |
Extra | arXiv:2411.10588 [cs] |
DOI | 10.48550/arXiv.2411.10588 |
Repository | arXiv |
Archive ID | arXiv:2411.10588 |
Date Added | 12/20/2024, 11:19:22 AM |
Modified | 12/20/2024, 11:19:22 AM |
Comment: 48 pages, 15 figures; code and data at https://github.com/casparoe/newcomblike_questions_dataset; corrected error in funding acknowledgments
Item Type | Preprint |
---|---|
Author | Vivek Myers |
Author | Evan Ellis |
Author | Sergey Levine |
Author | Benjamin Eysenbach |
Author | Anca Dragan |
Abstract | Assistive agents should make humans' lives easier. Classically, such assistance is studied through the lens of inverse reinforcement learning, where an assistive agent (e.g., a chatbot, a robot) infers a human's intention and then selects actions to help the human reach that goal. This approach requires inferring intentions, which can be difficult in high-dimensional settings. We build upon prior work that studies assistance through the lens of empowerment: an assistive agent aims to maximize the influence of the human's actions such that they exert a greater control over the environmental outcomes and can solve tasks in fewer steps. We lift the major limitation of prior work in this area--scalability to high-dimensional settings--with contrastive successor representations. We formally prove that these representations estimate a similar notion of empowerment to that studied by prior work and provide a ready-made mechanism for optimizing it. Empirically, our proposed method outperforms prior methods on synthetic benchmarks, and scales to Overcooked, a cooperative game setting. Theoretically, our work connects ideas from information theory, neuroscience, and reinforcement learning, and charts a path for representations to play a critical role in solving assistive problems. |
Date | 2024-11-07 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.02623 |
Accessed | 12/20/2024, 11:04:21 AM |
Extra | arXiv:2411.02623 [cs] |
DOI | 10.48550/arXiv.2411.02623 |
Repository | arXiv |
Archive ID | arXiv:2411.02623 |
Date Added | 12/20/2024, 11:04:21 AM |
Modified | 12/20/2024, 11:04:26 AM |
Comment: Conference on Neural Information Processing Systems (NeurIPS), 2024
Item Type | Preprint |
---|---|
Author | Sumeet Ramesh Motwani |
Author | Mikhail Baranchuk |
Author | Martin Strohmeier |
Author | Vijay Bolina |
Author | Philip H. S. Torr |
Author | Lewis Hammond |
Author | Christian Schroeder de Witt |
Abstract | Recent capability increases in large language models (LLMs) open up applications in which groups of communicating generative AI agents solve joint tasks. This poses privacy and security challenges concerning the unauthorised sharing of information, or other unwanted forms of agent coordination. Modern steganographic techniques could render such dynamics hard to detect. In this paper, we comprehensively formalise the problem of secret collusion in systems of generative AI agents by drawing on relevant concepts from both AI and security literature. We study incentives for the use of steganography, and propose a variety of mitigation measures. Our investigations result in a model evaluation framework that systematically tests capabilities required for various forms of secret collusion. We provide extensive empirical results across a range of contemporary LLMs. While the steganographic capabilities of current models remain limited, GPT-4 displays a capability jump suggesting the need for continuous monitoring of steganographic frontier model capabilities. We conclude by laying out a comprehensive research program to mitigate future risks of collusion between generative AI models. |
Date | 2024-11-08 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2402.07510 |
Accessed | 12/20/2024, 11:05:37 AM |
Extra | arXiv:2402.07510 [cs] |
DOI | 10.48550/arXiv.2402.07510 |
Repository | arXiv |
Archive ID | arXiv:2402.07510 |
Date Added | 12/20/2024, 11:05:37 AM |
Modified | 12/20/2024, 11:05:37 AM |
Item Type | Journal Article |
---|---|
Author | Pattie Maes |
Author | Robert H. Guttman |
Author | Alexandros G. Moukas |
Date | 03/1999 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://dl.acm.org/doi/10.1145/295685.295716 |
Accessed | 12/20/2024, 11:06:54 AM |
Volume | 42 |
Pages | 81 |
Publication | Communications of the ACM |
DOI | 10.1145/295685.295716 |
Issue | 3 |
Journal Abbr | Commun. ACM |
ISSN | 0001-0782, 1557-7317 |
Date Added | 12/20/2024, 11:06:54 AM |
Modified | 12/20/2024, 11:06:54 AM |
Item Type | Preprint |
---|---|
Author | Harrison Lee |
Author | Samrat Phatale |
Author | Hassan Mansoor |
Author | Thomas Mesnard |
Author | Johan Ferret |
Author | Kellie Lu |
Author | Colton Bishop |
Author | Ethan Hall |
Author | Victor Carbune |
Author | Abhinav Rastogi |
Author | Sushant Prakash |
Abstract | Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards "self-improvement" by demonstrating that RLAIF can outperform a supervised fine-tuned baseline even when the AI labeler is the same size as the policy, or even the exact same checkpoint as the initial policy. Finally, we introduce direct-RLAIF (d-RLAIF) - a technique that circumvents RM training by obtaining rewards directly from an off-the-shelf LLM during RL, which achieves superior performance to canonical RLAIF. Our results suggest that RLAIF can achieve performance on-par with using human feedback, offering a potential solution to the scalability limitations of RLHF. |
Date | 2024-09-03 |
Short Title | RLAIF vs. RLHF |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2309.00267 |
Accessed | 1/3/2025, 9:28:17 AM |
Extra | arXiv:2309.00267 [cs] |
DOI | 10.48550/arXiv.2309.00267 |
Repository | arXiv |
Archive ID | arXiv:2309.00267 |
Date Added | 1/3/2025, 9:28:17 AM |
Modified | 1/3/2025, 9:28:19 AM |
Comment: Presented at ICML 2024
Item Type | Preprint |
---|---|
Author | Mallory Knodel |
Author | Andrés Fábrega |
Author | Daniella Ferrari |
Author | Jacob Leiken |
Author | Betty Li Hou |
Author | Derek Yen |
Author | Sam de Alfaro |
Author | Kyunghyun Cho |
Author | Sunoo Park |
Abstract | End-to-end encryption (E2EE) has become the gold standard for securing communications, bringing strong confidentiality and privacy guarantees to billions of users worldwide. However, the current push towards widespread integration of artificial intelligence (AI) models, including in E2EE systems, raises some serious security concerns. This work performs a critical examination of the (in)compatibility of AI models and E2EE applications. We explore this on two fronts: (1) the integration of AI “assistants” within E2EE applications, and (2) the use of E2EE data for training AI models. We analyze the potential security implications of each, and identify conflicts with the security guarantees of E2EE. Then, we analyze legal implications of integrating AI models in E2EE applications, given how AI integration can undermine the confidentiality that E2EE promises. Finally, we offer a list of detailed recommendations based on our technical and legal analyses, including: technical design choices that must be prioritized to uphold E2EE security; how service providers must accurately represent E2EE security; and best practices for the default behavior of AI features and for requesting user consent. We hope this paper catalyzes an informed conversation on the tensions that arise between the brisk deployment of AI and the security offered by E2EE, and guides the responsible development of new AI features. |
Date | 2024 |
Short Title | How To Think About End-To-End Encryption and AI |
Archive | Cryptology ePrint Archive |
Library Catalog | Cryptology ePrint Archive (eprint.iacr.org) |
URL | https://eprint.iacr.org/2024/2086 |
Accessed | 1/3/2025, 9:41:12 AM |
Extra | Publication info: Preprint. |
Archive ID | 2024/2086 |
Date Added | 1/3/2025, 9:41:12 AM |
Modified | 1/3/2025, 9:41:12 AM |
Item Type | Preprint |
---|---|
Author | John Hughes |
Author | Sara Price |
Author | Aengus Lynch |
Author | Rylan Schaeffer |
Author | Fazl Barez |
Author | Sanmi Koyejo |
Author | Henry Sleight |
Author | Erik Jones |
Author | Ethan Perez |
Author | Mrinank Sharma |
Abstract | We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities. |
Date | 2024-12-04 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.03556 |
Accessed | 12/20/2024, 11:09:52 AM |
Extra | arXiv:2412.03556 [cs] |
DOI | 10.48550/arXiv.2412.03556 |
Repository | arXiv |
Archive ID | arXiv:2412.03556 |
Date Added | 12/20/2024, 11:09:52 AM |
Modified | 12/20/2024, 11:09:52 AM |
Item Type | Preprint |
---|---|
Author | Luke Hogg |
Author | Renée DiResta |
Author | Francis Fukuyama |
Author | Richard Reisman |
Author | Daphne Keller |
Author | Aviv Ovadya |
Author | Luke Thorburn |
Author | Jonathan Stray |
Author | Shubhi Mathur |
Abstract | Middleware, third-party software intermediaries between users and platforms, has been broached as a means to decentralize the power of social media platforms and enhance user agency. Middleware may enable a more user-centric and democratic approach to shaping digital experiences, offering a flexible architecture as an alternative to both centrally controlled, opaque platforms and an unmoderated, uncurated internet. The widespread adoption of open middleware has long hinged on the cooperation of established major platforms; however, the recent growth of federated platforms, such as Mastodon and Bluesky, has led to increased offerings and user awareness. In this report we consider the potential of middleware as a means of enabling greater user control over curation and moderation - two aspects of the social media experience that are often mired in controversy. We evaluate the trade-offs and negative externalities it might create, and discuss the technological, regulatory, and market dynamics that could either support or hinder its implementation. |
Date | 2024-12-13 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.10283 |
Accessed | 12/20/2024, 11:11:49 AM |
Extra | arXiv:2412.10283 [cs] |
DOI | 10.48550/arXiv.2412.10283 |
Repository | arXiv |
Archive ID | arXiv:2412.10283 |
Date Added | 12/20/2024, 11:11:49 AM |
Modified | 12/20/2024, 11:11:49 AM |
Comment: 51 pages
Item Type | Journal Article |
---|---|
Author | Melody Y Guan |
Author | Manas Joglekar |
Author | Eric Wallace |
Author | Alec Heylar |
Author | Rachel Dias |
Author | Andrea Vallone |
Author | Hyung Won Chung |
Author | Sam Toyer |
Author | Johannes Heidecke |
Author | Saachi Jain |
Author | Hongyu Ren |
Author | Alex Beutel |
Author | Boaz Barak |
Author | Jason Wei |
Author | Amelia Glaese |
Abstract | As large-scale language models increasingly impact safety-critical domains, ensuring their reliable adherence to well-defined principles remains a fundamental challenge. We introduce Deliberative Alignment, a new paradigm that directly teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering. We used this approach to align OpenAI’s o-series models [1], and achieved highly precise adherence to OpenAI’s safety policies, without requiring human-written chain-of-thoughts or answers. Deliberative Alignment pushes the Pareto frontier by simultaneously increasing robustness to jailbreaks while decreasing overrefusal rates, and also improves out-of-distribution generalization. We demonstrate that reasoning over explicitly specified policies enables more scalable, trustworthy, and interpretable alignment. |
Language | en |
Library Catalog | Zotero |
Date Added | 12/23/2024, 1:00:18 PM |
Modified | 12/23/2024, 1:00:18 PM |
Item Type | Preprint |
---|---|
Author | Ryan Greenblatt |
Author | Carson Denison |
Author | Benjamin Wright |
Author | Fabien Roger |
Author | Monte MacDiarmid |
Author | Sam Marks |
Author | Johannes Treutlein |
Author | Tim Belonax |
Author | Jack Chen |
Author | David Duvenaud |
Author | Akbir Khan |
Author | Julian Michael |
Author | Sören Mindermann |
Author | Ethan Perez |
Author | Linda Petrini |
Author | Jonathan Uesato |
Author | Jared Kaplan |
Author | Buck Shlegeris |
Author | Samuel R. Bowman |
Author | Evan Hubinger |
Abstract | We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data--and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference--as in this case--or not. |
Date | 2024-12-18 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.14093 |
Accessed | 12/20/2024, 11:23:52 AM |
Extra | arXiv:2412.14093 [cs] |
DOI | 10.48550/arXiv.2412.14093 |
Repository | arXiv |
Archive ID | arXiv:2412.14093 |
Date Added | 12/20/2024, 11:23:52 AM |
Modified | 12/20/2024, 11:23:52 AM |
Item Type | Preprint |
---|---|
Author | Fabian Gloeckle |
Author | Badr Youbi Idrissi |
Author | Baptiste Rozière |
Author | David Lopez-Paz |
Author | Gabriel Synnaeve |
Abstract | Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes, and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3 times faster at inference, even with large batch sizes. |
Date | 2024-04-30 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2404.19737 |
Accessed | 1/3/2025, 9:30:35 AM |
Extra | arXiv:2404.19737 [cs] |
DOI | 10.48550/arXiv.2404.19737 |
Repository | arXiv |
Archive ID | arXiv:2404.19737 |
Date Added | 1/3/2025, 9:30:35 AM |
Modified | 1/3/2025, 9:30:38 AM |
Item Type | Preprint |
---|---|
Author | Tarleton Gillespie |
Author | Ryland Shaw |
Author | Mary L. Gray |
Author | Jina Suh |
Abstract | As generative AI technologies find more and more real-world applications, the importance of testing their performance and safety seems paramount. ``Red-teaming'' has quickly become the primary approach to test AI models--prioritized by AI companies, and enshrined in AI policy and regulation. Members of red teams act as adversaries, probing AI systems to test their safety mechanisms and uncover vulnerabilities. Yet we know too little about this work and its implications. This essay calls for collaboration between computer scientists and social scientists to study the sociotechnical systems surrounding AI technologies, including the work of red-teaming, to avoid repeating the mistakes of the recent past. We highlight the importance of understanding the values and assumptions behind red-teaming, the labor involved, and the psychological impacts on red-teamers. |
Date | 2024-12-12 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.09751 |
Accessed | 12/20/2024, 11:11:19 AM |
Extra | arXiv:2412.09751 [cs] |
DOI | 10.48550/arXiv.2412.09751 |
Repository | arXiv |
Archive ID | arXiv:2412.09751 |
Date Added | 12/20/2024, 11:11:19 AM |
Modified | 12/20/2024, 11:11:19 AM |
Comment: 8 pages
Item Type | Preprint |
---|---|
Author | Jonas Gehring |
Author | Kunhao Zheng |
Author | Jade Copet |
Author | Vegard Mella |
Author | Taco Cohen |
Author | Gabriel Synnaeve |
Abstract | Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks, where we achieve new start-of-the art results with both small (8B parameters) and large (70B) models while reducing the amount of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps. |
Date | 2024-10-02 |
Short Title | RLEF |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.02089 |
Accessed | 1/3/2025, 9:35:42 AM |
Extra | arXiv:2410.02089 [cs] |
DOI | 10.48550/arXiv.2410.02089 |
Repository | arXiv |
Archive ID | arXiv:2410.02089 |
Date Added | 1/3/2025, 9:35:42 AM |
Modified | 1/3/2025, 9:35:44 AM |
Item Type | Preprint |
---|---|
Author | A. Feder Cooper |
Author | Christopher A. Choquette-Choo |
Author | Miranda Bogen |
Author | Matthew Jagielski |
Author | Katja Filippova |
Author | Ken Ziyu Liu |
Author | Alexandra Chouldechova |
Author | Jamie Hayes |
Author | Yangsibo Huang |
Author | Niloofar Mireshghallah |
Author | Ilia Shumailov |
Author | Eleni Triantafillou |
Author | Peter Kairouz |
Author | Nicole Mitchell |
Author | Percy Liang |
Author | Daniel E. Ho |
Author | Yejin Choi |
Author | Sanmi Koyejo |
Author | Fernando Delgado |
Author | James Grimmelmann |
Author | Vitaly Shmatikov |
Author | Christopher De Sa |
Author | Solon Barocas |
Author | Amy Cyphert |
Author | Mark Lemley |
Author | danah boyd |
Author | Jennifer Wortman Vaughan |
Author | Miles Brundage |
Author | David Bau |
Author | Seth Neel |
Author | Abigail Z. Jacobs |
Author | Andreas Terzis |
Author | Hanna Wallach |
Author | Nicolas Papernot |
Author | Katherine Lee |
Abstract | We articulate fundamental mismatches between technical methods for machine unlearning in Generative AI, and documented aspirations for broader impact that these methods could have for law and policy. These aspirations are both numerous and varied, motivated by issues that pertain to privacy, copyright, safety, and more. For example, unlearning is often invoked as a solution for removing the effects of targeted information from a generative-AI model's parameters, e.g., a particular individual's personal data or in-copyright expression of Spiderman that was included in the model's training data. Unlearning is also proposed as a way to prevent a model from generating targeted types of information in its outputs, e.g., generations that closely resemble a particular individual's data or reflect the concept of "Spiderman." Both of these goals--the targeted removal of information from a model and the targeted suppression of information from a model's outputs--present various technical and substantive challenges. We provide a framework for thinking rigorously about these challenges, which enables us to be clear about why unlearning is not a general-purpose solution for circumscribing generative-AI model behavior in service of broader positive impact. We aim for conceptual clarity and to encourage more thoughtful communication among machine learning (ML), law, and policy experts who seek to develop and apply technical methods for compliance with policy objectives. |
Date | 2024-12-09 |
Short Title | Machine Unlearning Doesn't Do What You Think |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.06966 |
Accessed | 12/20/2024, 11:10:24 AM |
Extra | arXiv:2412.06966 [cs] |
DOI | 10.48550/arXiv.2412.06966 |
Repository | arXiv |
Archive ID | arXiv:2412.06966 |
Date Added | 12/20/2024, 11:10:24 AM |
Modified | 12/20/2024, 11:10:24 AM |
Comment: Presented at the 2nd Workshop on Generative AI and Law at ICML (July 2024)
Item Type | Preprint |
---|---|
Author | Alexandra Chouldechova |
Author | Chad Atalla |
Author | Solon Barocas |
Author | A. Feder Cooper |
Author | Emily Corvi |
Author | P. Alex Dow |
Author | Jean Garcia-Gathright |
Author | Nicholas Pangakis |
Author | Stefanie Reed |
Author | Emily Sheng |
Author | Dan Vann |
Author | Matthew Vogel |
Author | Hannah Washington |
Author | Hanna Wallach |
Abstract | The valid measurement of generative AI (GenAI) systems' capabilities, risks, and impacts forms the bedrock of our ability to evaluate these systems. We introduce a shared standard for valid measurement that helps place many of the disparate-seeming evaluation practices in use today on a common footing. Our framework, grounded in measurement theory from the social sciences, extends the work of Adcock & Collier (2001) in which the authors formalized valid measurement of concepts in political science via three processes: systematizing background concepts, operationalizing systematized concepts via annotation procedures, and applying those procedures to instances. We argue that valid measurement of GenAI systems' capabilities, risks, and impacts, further requires systematizing, operationalizing, and applying not only the entailed concepts, but also the contexts of interest and the metrics used. This involves both descriptive reasoning about particular instances and inferential reasoning about underlying populations, which is the purview of statistics. By placing many disparate-seeming GenAI evaluation practices on a common footing, our framework enables individual evaluations to be better understood, interrogated for reliability and validity, and meaningfully compared. This is an important step in advancing GenAI evaluation practices toward more formalized and theoretically grounded processes -- i.e., toward a science of GenAI evaluations. |
Date | 2024-12-02 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.01934 |
Accessed | 12/20/2024, 11:10:59 AM |
Extra | arXiv:2412.01934 [cs] |
DOI | 10.48550/arXiv.2412.01934 |
Repository | arXiv |
Archive ID | arXiv:2412.01934 |
Date Added | 12/20/2024, 11:10:59 AM |
Modified | 12/20/2024, 11:10:59 AM |
Comment: NeurIPS 2024 Workshop on Statistical Foundations of LLMs and Foundation Models (SFLLM)
Item Type | Preprint |
---|---|
Author | Bradley Brown |
Author | Jordan Juravsky |
Author | Ryan Ehrlich |
Author | Ronald Clark |
Author | Quoc V. Le |
Author | Christopher Ré |
Author | Azalia Mirhoseini |
Abstract | Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget. |
Date | 2024-12-30 |
Short Title | Large Language Monkeys |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2407.21787 |
Accessed | 1/3/2025, 9:28:53 AM |
Extra | arXiv:2407.21787 [cs] |
DOI | 10.48550/arXiv.2407.21787 |
Repository | arXiv |
Archive ID | arXiv:2407.21787 |
Date Added | 1/3/2025, 9:28:53 AM |
Modified | 1/3/2025, 9:28:57 AM |
Item Type | Preprint |
---|---|
Author | Marah Abdin |
Author | Jyoti Aneja |
Author | Harkirat Behl |
Author | Sébastien Bubeck |
Author | Ronen Eldan |
Author | Suriya Gunasekar |
Author | Michael Harrison |
Author | Russell J. Hewett |
Author | Mojan Javaheripi |
Author | Piero Kauffmann |
Author | James R. Lee |
Author | Yin Tat Lee |
Author | Yuanzhi Li |
Author | Weishung Liu |
Author | Caio C. T. Mendes |
Author | Anh Nguyen |
Author | Eric Price |
Author | Gustavo de Rosa |
Author | Olli Saarikivi |
Author | Adil Salim |
Author | Shital Shah |
Author | Xin Wang |
Author | Rachel Ward |
Author | Yue Wu |
Author | Dingli Yu |
Author | Cyril Zhang |
Author | Yi Zhang |
Abstract | We present phi-4, a 14-billion parameter language model developed with a training recipe that is centrally focused on data quality. Unlike most language models, where pre-training is based primarily on organic data sources such as web content or code, phi-4 strategically incorporates synthetic data throughout the training process. While previous models in the Phi family largely distill the capabilities of a teacher model (specifically GPT-4), phi-4 substantially surpasses its teacher model on STEM-focused QA capabilities, giving evidence that our data-generation and post-training techniques go beyond distillation. Despite minimal changes to the phi-3 architecture, phi-4 achieves strong performance relative to its size -- especially on reasoning-focused benchmarks -- due to improved data, training curriculum, and innovations in the post-training scheme. |
Date | 2024-12-12 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.08905 |
Accessed | 12/20/2024, 11:03:16 AM |
Extra | arXiv:2412.08905 [cs] |
DOI | 10.48550/arXiv.2412.08905 |
Repository | arXiv |
Archive ID | arXiv:2412.08905 |
Date Added | 12/20/2024, 11:03:17 AM |
Modified | 12/20/2024, 11:03:17 AM |