• Multi-agent Architecture Search via Agentic Supernet

    Item Type Preprint
    Author Guibin Zhang
    Author Luyang Niu
    Author Junfeng Fang
    Author Kun Wang
    Author Lei Bai
    Author Xiang Wang
    Abstract Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\sim45\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\%\sim11.82\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.
    Date 2025-02-06
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.04180
    Accessed 2/13/2025, 11:08:07 AM
    Extra arXiv:2502.04180 [cs]
    DOI 10.48550/arXiv.2502.04180
    Repository arXiv
    Archive ID arXiv:2502.04180
    Date Added 2/13/2025, 11:08:07 AM
    Modified 2/13/2025, 11:08:07 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Machine Learning
    • Computer Science - Multiagent Systems

    Attachments

    • Preprint PDF
    • Snapshot
  • SPRI: Aligning Large Language Models with Context-Situated Principles

    Item Type Preprint
    Author Hongli Zhan
    Author Muneeza Azmat
    Author Raya Horesh
    Author Junyi Jessy Li
    Author Mikhail Yurochkin
    Abstract Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at https://github.com/honglizhan/SPRI-public.
    Date 2025-02-05
    Short Title SPRI
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.03397
    Accessed 2/13/2025, 11:15:29 AM
    Extra arXiv:2502.03397 [cs]
    DOI 10.48550/arXiv.2502.03397
    Repository arXiv
    Archive ID arXiv:2502.03397
    Date Added 2/13/2025, 11:15:29 AM
    Modified 2/13/2025, 11:15:29 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Full Text PDF
    • Snapshot
  • Generating Symbolic World Models via Test-time Scaling of Large Language Models

    Item Type Preprint
    Author Zhouliang Yu
    Author Yuhuan Yuan
    Author Tim Z. Xiao
    Author Fuxiang Frank Xia
    Author Jie Fu
    Author Ge Zhang
    Author Ge Lin
    Author Weiyang Liu
    Abstract Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domain, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.
    Date 2025-02-07
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.04728
    Accessed 2/13/2025, 11:30:39 AM
    Extra arXiv:2502.04728 [cs]
    DOI 10.48550/arXiv.2502.04728
    Repository arXiv
    Archive ID arXiv:2502.04728
    Date Added 2/13/2025, 11:30:39 AM
    Modified 2/13/2025, 11:30:39 AM

    Tags:

    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: Technical Report v1 (32 pages, 6 figures)

    Attachments

    • Preprint PDF
    • Snapshot
  • Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?

    Item Type Preprint
    Author Yutong Yin
    Author Zhaoran Wang
    Abstract Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.
    Date 2025-01-27
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.15857
    Accessed 2/3/2025, 10:32:14 AM
    Extra arXiv:2501.15857 [cs]
    DOI 10.48550/arXiv.2501.15857
    Repository arXiv
    Archive ID arXiv:2501.15857
    Date Added 2/3/2025, 10:32:14 AM
    Modified 2/3/2025, 10:32:14 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: It is accepted by The Thirteenth International Conference on Learning Representations and will be published soon. The submission number is 2678

    Attachments

    • Preprint PDF
    • Snapshot
  • LIMO: Less is More for Reasoning

    Item Type Preprint
    Author Yixin Ye
    Author Zhen Huang
    Author Yang Xiao
    Author Ethan Chern
    Author Shijie Xia
    Author Pengfei Liu
    Abstract We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.
    Date 2025-02-05
    Short Title LIMO
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.03387
    Accessed 2/7/2025, 1:30:16 PM
    Extra arXiv:2502.03387 [cs]
    DOI 10.48550/arXiv.2502.03387
    Repository arXiv
    Archive ID arXiv:2502.03387
    Date Added 2/7/2025, 1:30:16 PM
    Modified 2/7/2025, 1:30:19 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: 17 pages

    Attachments

    • Preprint PDF
    • Snapshot
  • ENIGMAEVAL: A Benchmark of Long Multimodal

    Item Type Journal Article
    Author Clinton J Wang
    Author Dean Lee
    Author Cristina Menghini
    Author Johannes Mols
    Author Jack Doughty
    Author Jayson Lynch
    Author Sean Hendryx
    Author Summer Yue
    Author Dan Hendrycks
    Abstract As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce ENIGMAEVAL, a dataset of problems and solutions derived from puzzle competitions and events that probes models’ ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity – each typically requiring teams of skilled solvers hours to days to complete – with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art language models achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity’s Last Exam, unveiling models’ shortcomings when challenged with problems requiring unstructured and lateral reasoning.
    Language en
    Library Catalog Zotero
    Date Added 2/14/2025, 2:54:20 PM
    Modified 2/14/2025, 2:54:20 PM

    Attachments

    • PDF
  • LLM Pretraining with Continuous Concepts

    Item Type Preprint
    Author Jihoon Tack
    Author Jack Lanchantin
    Author Jane Yu
    Author Andrew Cohen
    Author Ilia Kulikov
    Author Janice Lan
    Author Shibo Hao
    Author Yuandong Tian
    Author Jason Weston
    Author Xian Li
    Abstract Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.
    Date 2025-02-12
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.08524
    Accessed 2/13/2025, 11:44:13 AM
    Extra arXiv:2502.08524 [cs]
    DOI 10.48550/arXiv.2502.08524
    Repository arXiv
    Archive ID arXiv:2502.08524
    Date Added 2/13/2025, 11:44:13 AM
    Modified 2/13/2025, 11:44:13 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Authenticated Delegation and Authorized AI Agents

    Item Type Preprint
    Author Tobin South
    Author Samuele Marro
    Author Thomas Hardjono
    Author Robert Mahari
    Author Cedric Deslandes Whitney
    Author Dazza Greenwood
    Author Alan Chan
    Author Alex Pentland
    Abstract The rapid deployment of autonomous AI agents creates urgent challenges around authorization, accountability, and access control in digital spaces. New standards are needed to know whom AI agents act on behalf of and guide their use appropriately, protecting online spaces while unlocking the value of task delegation to autonomous agents. We introduce a novel framework for authenticated, authorized, and auditable delegation of authority to AI agents, where human users can securely delegate and restrict the permissions and scope of agents while maintaining clear chains of accountability. This framework builds on existing identification and access management protocols, extending OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata, maintaining compatibility with established authentication and web infrastructure. Further, we propose a framework for translating flexible, natural language permissions into auditable access control configurations, enabling robust scoping of AI agent capabilities across diverse interaction modalities. Taken together, this practical approach facilitates immediate deployment of AI agents while addressing key security and accountability concerns, working toward ensuring agentic AI systems perform only appropriate actions and providing a tool for digital service providers to enable AI agent interactions without risking harm from scalable interaction.
    Date 2025-01-16
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.09674
    Accessed 1/29/2025, 11:26:00 AM
    Extra arXiv:2501.09674 [cs] version: 1
    DOI 10.48550/arXiv.2501.09674
    Repository arXiv
    Archive ID arXiv:2501.09674
    Date Added 1/29/2025, 11:26:00 AM
    Modified 1/29/2025, 11:26:00 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Networking and Internet Architecture

    Attachments

    • Preprint PDF
    • Snapshot
  • On the Feasibility of Using LLMs to Execute Multistage Network Attacks

    Item Type Preprint
    Author Brian Singer
    Author Keane Lucas
    Author Lakshmi Adiga
    Author Meghna Jain
    Author Lujo Bauer
    Author Vyas Sekar
    Abstract LLMs have shown preliminary promise in some security tasks and CTF challenges. However, it is unclear whether LLMs are able to realize multistage network attacks, which involve executing a wide variety of actions across multiple hosts such as conducting reconnaissance, exploiting vulnerabilities to gain initial access, leveraging internal hosts to move laterally, and using multiple compromised hosts to exfiltrate data. We evaluate LLMs across 10 multistage networks and find that popular LLMs are unable to realize these attacks. To enable LLMs to realize these attacks, we introduce Incalmo, an LLM-agnostic high-level attack abstraction layer that sits between an LLM and the environment. Rather than LLMs issuing low-level command-line instructions, which can lead to incorrect implementations, Incalmo allows LLMs to specify high-level tasks (e.g., infect a host, scan a network), which are then carried out by Incalmo. Incalmo realizes these tasks by translating them into low-level primitives (e.g., commands to exploit tools). Incalmo also provides an environment state service and an attack graph service to provide structure to LLMs in selecting actions relevant to a multistage attack. Across 9 out of 10 realistic emulated networks (from 25 to 50 hosts), LLMs using Incalmo can successfully autonomously execute multistage attacks. We also conduct an ablation analysis to show the key role the high-level abstractions play. For instance, we find that both Incalmo's high-level tasks and services are crucial. Furthermore, even smaller-parameter LLMs with Incalmo can fully succeed in 5 of 10 environments, while larger-parameter LLMs without Incalmo do not fully succeed in any.
    Date 2025-01-27
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.16466
    Accessed 1/29/2025, 4:33:58 PM
    Extra arXiv:2501.16466 [cs]
    DOI 10.48550/arXiv.2501.16466
    Repository arXiv
    Archive ID arXiv:2501.16466
    Date Added 1/29/2025, 4:33:58 PM
    Modified 1/29/2025, 4:34:01 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security

    Notes:

    • Comment: 16 pages, 14 figures

    Attachments

    • Full Text PDF
    • Snapshot
  • Open Problems in Mechanistic Interpretability

    Item Type Preprint
    Author Lee Sharkey
    Author Bilal Chughtai
    Author Joshua Batson
    Author Jack Lindsey
    Author Jeff Wu
    Author Lucius Bushnaq
    Author Nicholas Goldowsky-Dill
    Author Stefan Heimersheim
    Author Alejandro Ortega
    Author Joseph Bloom
    Author Stella Biderman
    Author Adria Garriga-Alonso
    Author Arthur Conmy
    Author Neel Nanda
    Author Jessica Rumbelow
    Author Martin Wattenberg
    Author Nandi Schoots
    Author Joseph Miller
    Author Eric J. Michaud
    Author Stephen Casper
    Author Max Tegmark
    Author William Saunders
    Author David Bau
    Author Eric Todd
    Author Atticus Geiger
    Author Mor Geva
    Author Jesse Hoogland
    Author Daniel Murfet
    Author Tom McGrath
    Abstract Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
    Date 2025-01-27
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.16496
    Accessed 1/29/2025, 11:18:54 AM
    Extra arXiv:2501.16496 [cs]
    DOI 10.48550/arXiv.2501.16496
    Repository arXiv
    Archive ID arXiv:2501.16496
    Date Added 1/29/2025, 11:18:54 AM
    Modified 1/29/2025, 11:18:54 AM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

    Item Type Preprint
    Author Swarnadeep Saha
    Author Xian Li
    Author Marjan Ghazvininejad
    Author Jason Weston
    Author Tianlu Wang
    Abstract LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
    Date 2025-01-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.18099
    Accessed 2/1/2025, 3:50:32 PM
    Extra arXiv:2501.18099 [cs]
    DOI 10.48550/arXiv.2501.18099
    Repository arXiv
    Archive ID arXiv:2501.18099
    Date Added 2/1/2025, 3:50:32 PM
    Modified 2/1/2025, 3:50:32 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants

    Item Type Preprint
    Author Pascal J. Sager
    Author Benjamin Meyer
    Author Peng Yan
    Author Rebekka von Wartburg-Kottler
    Author Layan Etaiwi
    Author Aref Enayati
    Author Gabriel Nobel
    Author Ahmed Abdulkadir
    Author Benjamin F. Grewe
    Author Thilo Stadelmann
    Abstract Instruction-based computer control agents (CCAs) execute complex action sequences on personal computers or mobile devices to fulfill tasks using the same graphical user interfaces as a human user would, provided instructions in natural language. This review offers a comprehensive overview of the emerging field of instruction-based computer control, examining available agents -- their taxonomy, development, and respective resources -- and emphasizing the shift from manually designed, specialized agents to leveraging foundation models such as large language models (LLMs) and vision-language models (VLMs). We formalize the problem and establish a taxonomy of the field to analyze agents from three perspectives: (a) the environment perspective, analyzing computer environments; (b) the interaction perspective, describing observations spaces (e.g., screenshots, HTML) and action spaces (e.g., mouse and keyboard actions, executable code); and (c) the agent perspective, focusing on the core principle of how an agent acts and learns to act. Our framework encompasses both specialized and foundation agents, facilitating their comparative analysis and revealing how prior solutions in specialized agents, such as an environment learning step, can guide the development of more capable foundation agents. Additionally, we review current CCA datasets and CCA evaluation methods and outline the challenges to deploying such agents in a productive setting. In total, we review and classify 86 CCAs and 33 related datasets. By highlighting trends, limitations, and future research directions, this work presents a comprehensive foundation to obtain a broad understanding of the field and push its future development.
    Date 2025-01-27
    Short Title AI Agents for Computer Use
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.16150
    Accessed 2/1/2025, 3:50:49 PM
    Extra arXiv:2501.16150 [cs]
    DOI 10.48550/arXiv.2501.16150
    Repository arXiv
    Archive ID arXiv:2501.16150
    Date Added 2/1/2025, 3:50:49 PM
    Modified 2/1/2025, 3:50:49 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Human-Computer Interaction
    • Computer Science - Systems and Control
    • Electrical Engineering and Systems Science - Systems and Control

    Attachments

    • Preprint PDF
    • Snapshot
  • Human Decision-making is Susceptible to AI-driven Manipulation

    Item Type Preprint
    Author Sahand Sabour
    Author June M. Liu
    Author Siyang Liu
    Author Chris Z. Yao
    Author Shiyao Cui
    Author Xuanming Zhang
    Author Wen Zhang
    Author Yaru Cao
    Author Advait Bhat
    Author Jian Guan
    Author Wei Wu
    Author Rada Mihalcea
    Author Tim Althoff
    Author Tatia M. C. Lee
    Author Minlie Huang
    Abstract Artificial Intelligence (AI) systems are increasingly intertwined with daily life, assisting users in executing various tasks and providing guidance on decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized controlled trial with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) employing explicit psychological tactics to reach its hidden objectives. By analyzing participants' decision patterns and shifts in their preference ratings post-interaction, we found significant susceptibility to AI-driven manipulation. Particularly, across both decision-making domains, participants interacting with the manipulative agents shifted toward harmful options at substantially higher rates (financial, MA: 62.3%, SEMA: 59.6%; emotional, MA: 42.3%, SEMA: 41.5%) compared to the NA group (financial, 35.8%; emotional, 12.8%). Notably, our findings reveal that even subtle manipulative objectives (MA) can be as effective as employing explicit psychological strategies (SEMA) in swaying human decision-making. By revealing the potential for covert AI influence, this study highlights a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to ensure responsible deployment of AI technologies and protect human autonomy.
    Date 2025-02-11
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.07663
    Accessed 2/14/2025, 3:02:32 PM
    Extra arXiv:2502.07663 [cs]
    DOI 10.48550/arXiv.2502.07663
    Repository arXiv
    Archive ID arXiv:2502.07663
    Date Added 2/14/2025, 3:02:32 PM
    Modified 2/14/2025, 3:02:32 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction

    Notes:

    • Comment: Work in progress. Code and data will be made available via https://github.com/Sahandfer/Manipulation-Susceptibility

    Attachments

    • Preprint PDF
    • Snapshot
  • Introducing our short course on AGI safety

    Item Type Web Page
    Author DeepMind Safety Research
    Abstract We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…
    Date 2025-02-14T15:05:57.040Z
    Language en
    URL https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c
    Accessed 2/14/2025, 3:02:07 PM
    Website Title Medium
    Date Added 2/14/2025, 3:02:07 PM
    Modified 2/14/2025, 3:02:07 PM

    Attachments

    • Snapshot
  • Noteworthy LLM Research Papers of 2024

    Item Type Web Page
    Author Sebastian Raschka
    Abstract This article covers 12 influential AI research papers of 2024, ranging from mixture-of-experts models to new LLM scaling laws for precision..
    Date 06:03:00 +0000
    Language en
    URL https://sebastianraschka.com/blog/2025/llm-research-2024.html
    Accessed 1/29/2025, 11:36:46 AM
    Website Title Sebastian Raschka, PhD
    Date Added 1/29/2025, 11:36:46 AM
    Modified 1/29/2025, 11:36:46 AM

    Attachments

    • Snapshot
  • Social Norms in Cinema: A Cross-Cultural Analysis of Shame, Pride and Prejudice

    Item Type Preprint
    Author Sunny Rai
    Author Khushang Jilesh Zaveri
    Author Shreya Havaldar
    Author Soumna Nema
    Author Lyle Ungar
    Author Sharath Chandra Guntuku
    Abstract Shame and pride are social emotions expressed across cultures to motivate and regulate people's thoughts, feelings, and behaviors. In this paper, we introduce the first cross-cultural dataset of over 10k shame/pride-related expressions, with underlying social expectations from ~5.4K Bollywood and Hollywood movies. We examine how and why shame and pride are expressed across cultures using a blend of psychology-informed language analysis combined with large language models. We find significant cross-cultural differences in shame and pride expression aligning with known cultural tendencies of the USA and India -- e.g., in Hollywood, shame-expressions predominantly discuss self whereas Bollywood discusses shame toward others. Pride in Hollywood is individualistic with more self-referential singular pronouns such as I and my whereas in Bollywood, pride is collective with higher use of self-referential plural pronouns such as we and our. Lastly, women are more sanctioned across cultures and for violating similar social expectations e.g. promiscuity.
    Date 2024-10-15
    Short Title Social Norms in Cinema
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2402.11333
    Accessed 2/1/2025, 3:50:15 PM
    Extra arXiv:2402.11333 [cs]
    DOI 10.48550/arXiv.2402.11333
    Repository arXiv
    Archive ID arXiv:2402.11333
    Date Added 2/1/2025, 3:50:15 PM
    Modified 2/1/2025, 3:50:15 PM

    Tags:

    • Computer Science - Computers and Society

    Attachments

    • Preprint PDF
    • Snapshot
  • Transcoders Beat Sparse Autoencoders for Interpretability

    Item Type Preprint
    Author Gonçalo Paulo
    Author Stepan Shabalin
    Author Nora Belrose
    Abstract Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
    Date 2025-02-12
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.18823
    Accessed 2/13/2025, 11:02:53 AM
    Extra arXiv:2501.18823 [cs]
    DOI 10.48550/arXiv.2501.18823
    Repository arXiv
    Archive ID arXiv:2501.18823
    Date Added 2/13/2025, 11:02:53 AM
    Modified 2/13/2025, 11:02:56 AM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Sparse Autoencoders Trained on the Same Data Learn Different Features

    Item Type Preprint
    Author Gonçalo Paulo
    Author Nora Belrose
    Abstract Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features "truly used" by the model.
    Date 2025-01-29
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.16615
    Accessed 2/3/2025, 10:27:43 AM
    Extra arXiv:2501.16615 [cs]
    DOI 10.48550/arXiv.2501.16615
    Repository arXiv
    Archive ID arXiv:2501.16615
    Date Added 2/3/2025, 10:27:43 AM
    Modified 2/3/2025, 10:27:43 AM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • What fully automated firms will look like

    Item Type Web Page
    Author Dwarkesh Patel
    Abstract Everyone is sleeping on the *collective* advantages AIs will have, which have nothing to do with raw IQ - they can be copied, distilled, merged, scaled, and evolved in ways humans simply can't.
    Date 2024-12-27
    Language en
    URL https://www.dwarkeshpatel.com/p/ai-firm
    Accessed 2/1/2025, 3:26:33 PM
    Date Added 2/1/2025, 3:26:33 PM
    Modified 2/1/2025, 3:26:33 PM

    Attachments

    • Snapshot
  • Future You: A Conversation with an AI-Generated Future Self Reduces Anxiety, Negative Emotions, and Increases Future Self-Continuity

    Item Type Preprint
    Author Pat Pataranutaporn
    Author Kavin Winson
    Author Peggy Yin
    Author Auttasak Lapapirojn
    Author Pichayoot Ouppaphan
    Author Monchai Lertsutthiwong
    Author Pattie Maes
    Author Hal Hershfield
    Abstract We introduce "Future You," an interactive, brief, single-session, digital chat intervention designed to improve future self-continuity--the degree of connection an individual feels with a temporally distant future self--a characteristic that is positively related to mental health and wellbeing. Our system allows users to chat with a relatable yet AI-powered virtual version of their future selves that is tuned to their future goals and personal qualities. To make the conversation realistic, the system generates a "synthetic memory"--a unique backstory for each user--that creates a throughline between the user's present age (between 18-30) and their life at age 60. The "Future You" character also adopts the persona of an age-progressed image of the user's present self. After a brief interaction with the "Future You" character, users reported decreased anxiety, and increased future self-continuity. This is the first study successfully demonstrating the use of personalized AI-generated characters to improve users' future self-continuity and wellbeing.
    Date 2024-10-01
    Short Title Future You
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2405.12514
    Accessed 2/3/2025, 10:27:58 AM
    Extra arXiv:2405.12514 [cs]
    DOI 10.48550/arXiv.2405.12514
    Repository arXiv
    Archive ID arXiv:2405.12514
    Date Added 2/3/2025, 10:27:58 AM
    Modified 2/3/2025, 10:27:58 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Human-Computer Interaction

    Attachments

    • Full Text PDF
    • Snapshot
  • Competitive Programming with Large Reasoning Models

    Item Type Preprint
    Author OpenAI
    Author Ahmed El-Kishky
    Author Alexander Wei
    Author Andre Saraiva
    Author Borys Minaev
    Author Daniel Selsam
    Author David Dohan
    Author Francis Song
    Author Hunter Lightman
    Author Ignasi Clavera
    Author Jakub Pachocki
    Author Jerry Tworek
    Author Lorenz Kuhn
    Author Lukasz Kaiser
    Author Mark Chen
    Author Max Schwarzer
    Author Mostafa Rohaninejad
    Author Nat McAleese
    Author o3 contributors
    Author Oleg Mürk
    Author Rhythm Garg
    Author Rui Shu
    Author Szymon Sidor
    Author Vineet Kosaraju
    Author Wenda Zhou
    Abstract We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.
    Date 2025-02-03
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.06807
    Accessed 2/13/2025, 11:37:43 AM
    Extra arXiv:2502.06807 [cs]
    DOI 10.48550/arXiv.2502.06807
    Repository arXiv
    Archive ID arXiv:2502.06807
    Date Added 2/13/2025, 11:37:43 AM
    Modified 2/13/2025, 11:37:43 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • One fish, two fish, but not the whole sea: Alignment reduces language models' conceptual diversity

    Item Type Preprint
    Author Sonia K. Murthy
    Author Tomer Ullman
    Author Jennifer Hu
    Abstract Researchers in social science and psychology have recently proposed using large language models (LLMs) as replacements for humans in behavioral research. In addition to arguments about whether LLMs accurately capture population-level patterns, this has raised questions about whether LLMs capture human-like conceptual diversity. Separately, it is debated whether post-training alignment (RLHF or RLAIF) affects models' internal diversity. Inspired by human studies, we use a new way of measuring the conceptual diversity of synthetically-generated LLM "populations" by relating the internal variability of simulated individuals to the population-level variability. We use this approach to evaluate non-aligned and aligned LLMs on two domains with rich human behavioral data. While no model reaches human-like diversity, aligned models generally display less diversity than their instruction fine-tuned counterparts. Our findings highlight potential trade-offs between increasing models' value alignment and decreasing the diversity of their conceptual representations.
    Date 2024-11-12
    Short Title One fish, two fish, but not the whole sea
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.04427
    Accessed 2/13/2025, 11:34:32 AM
    Extra arXiv:2411.04427 [cs]
    DOI 10.48550/arXiv.2411.04427
    Repository arXiv
    Archive ID arXiv:2411.04427
    Date Added 2/13/2025, 11:34:32 AM
    Modified 2/13/2025, 11:34:32 AM

    Tags:

    • Computer Science - Computation and Language

    Notes:

    • Comment: 17 pages, 10 figures; corrected figure version

    Attachments

    • Preprint PDF
    • Snapshot
  • s1: Simple test-time scaling

    Item Type Preprint
    Author Niklas Muennighoff
    Author Zitong Yang
    Author Weijia Shi
    Author Xiang Lisa Li
    Author Li Fei-Fei
    Author Hannaneh Hajishirzi
    Author Luke Zettlemoyer
    Author Percy Liang
    Author Emmanuel Candès
    Author Tatsunori Hashimoto
    Abstract Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.
    Date 2025-01-31
    Short Title s1
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.19393
    Accessed 2/3/2025, 10:32:30 AM
    Extra arXiv:2501.19393 [cs]
    DOI 10.48550/arXiv.2501.19393
    Repository arXiv
    Archive ID arXiv:2501.19393
    Date Added 2/3/2025, 10:32:30 AM
    Modified 2/3/2025, 10:32:30 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: 46 pages (9 main), 10 figures, 14 tables

    Attachments

    • Preprint PDF
    • Snapshot
  • s1: Simple test-time scaling

    Item Type Preprint
    Author Niklas Muennighoff
    Author Zitong Yang
    Author Weijia Shi
    Author Xiang Lisa Li
    Author Li Fei-Fei
    Author Hannaneh Hajishirzi
    Author Luke Zettlemoyer
    Author Percy Liang
    Author Emmanuel Candès
    Author Tatsunori Hashimoto
    Abstract Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1
    Date 2025-02-03
    Short Title s1
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.19393
    Accessed 2/13/2025, 11:04:27 AM
    Extra arXiv:2501.19393 [cs]
    DOI 10.48550/arXiv.2501.19393
    Repository arXiv
    Archive ID arXiv:2501.19393
    Date Added 2/13/2025, 11:04:27 AM
    Modified 2/13/2025, 11:04:27 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: 45 pages (9 main), 10 figures, 14 tables

    Attachments

    • Full Text PDF
    • Snapshot
  • NoLiMa: Long-Context Evaluation Beyond Literal Matching

    Item Type Preprint
    Author Ali Modarressi
    Author Hanieh Deilamsalehy
    Author Franck Dernoncourt
    Author Trung Bui
    Author Ryan A. Rossi
    Author Seunghyun Yoon
    Author Hinrich Schütze
    Abstract Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.
    Date 2025-02-07
    Short Title NoLiMa
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.05167
    Accessed 2/13/2025, 11:33:22 AM
    Extra arXiv:2502.05167 [cs]
    DOI 10.48550/arXiv.2502.05167
    Repository arXiv
    Archive ID arXiv:2502.05167
    Date Added 2/13/2025, 11:33:22 AM
    Modified 2/13/2025, 11:33:22 AM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

    Item Type Preprint
    Author Mantas Mazeika
    Author Xuwang Yin
    Author Rishub Tamirisa
    Author Jaehyuk Lim
    Author Bruce W. Lee
    Author Richard Ren
    Author Long Phan
    Author Norman Mu
    Author Adam Khoja
    Author Oliver Zhang
    Author Dan Hendrycks
    Abstract As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
    Date 2025-02-12
    Short Title Utility Engineering
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.08640
    Accessed 2/13/2025, 11:35:58 AM
    Extra arXiv:2502.08640 [cs]
    DOI 10.48550/arXiv.2502.08640
    Repository arXiv
    Archive ID arXiv:2502.08640
    Date Added 2/13/2025, 11:35:58 AM
    Modified 2/13/2025, 11:35:58 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Machine Learning
    • Computer Science - Computer Vision and Pattern Recognition

    Attachments

    • Preprint PDF
    • Snapshot
  • smooth operator

    Item Type Web Page
    Author Lovable
    Abstract Lovable Generated Project
    Language en
    URL https://smooth-operator.online/
    Accessed 1/29/2025, 11:44:59 AM
    Date Added 1/29/2025, 11:44:59 AM
    Modified 1/29/2025, 11:44:59 AM

    Attachments

    • Snapshot
  • OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

    Item Type Preprint
    Author Gaojie Lin
    Author Jianwen Jiang
    Author Jiaqi Yang
    Author Zerong Zheng
    Author Chao Liang
    Abstract End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)
    Date 2025-02-03
    Short Title OmniHuman-1
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.01061
    Accessed 2/6/2025, 9:30:17 AM
    Extra arXiv:2502.01061 [cs]
    DOI 10.48550/arXiv.2502.01061
    Repository arXiv
    Archive ID arXiv:2502.01061
    Date Added 2/6/2025, 9:30:17 AM
    Modified 2/6/2025, 9:30:17 AM

    Tags:

    • Computer Science - Computer Vision and Pattern Recognition

    Notes:

    • Comment: https://omnihuman-lab.github.io/

    Attachments

    • Preprint PDF
    • Snapshot
  • Eliciting Language Model Behaviors with Investigator Agents

    Item Type Preprint
    Author Xiang Lisa Li
    Author Neil Chowdhury
    Author Daniel D. Johnson
    Author Tatsunori Hashimoto
    Author Percy Liang
    Author Sarah Schwettmann
    Author Jacob Steinhardt
    Abstract Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.
    Date 2025-02-03
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.01236
    Accessed 2/6/2025, 9:47:44 AM
    Extra arXiv:2502.01236 [cs]
    DOI 10.48550/arXiv.2502.01236
    Repository arXiv
    Archive ID arXiv:2502.01236
    Date Added 2/6/2025, 9:47:44 AM
    Modified 2/6/2025, 9:47:44 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: 20 pages, 7 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks

    Item Type Preprint
    Author Ang Li
    Author Yin Zhou
    Author Vethavikashini Chithrra Raghuram
    Author Tom Goldstein
    Author Micah Goldblum
    Abstract A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs). These attacks may extract private information or coerce the model into producing harmful outputs. In real-world deployments, LLMs are often part of a larger agentic pipeline including memory systems, retrieval, web access, and API calling. Such additional components introduce vulnerabilities that make these LLM-powered agents much easier to attack than isolated LLMs, yet relatively little work focuses on the security of LLM agents. In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We first provide a taxonomy of attacks categorized by threat actors, objectives, entry points, attacker observability, attack strategies, and inherent vulnerabilities of agent pipelines. We then conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities. Notably, our attacks are trivial to implement and require no understanding of machine learning.
    Date 2025-02-12
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.08586
    Accessed 2/14/2025, 2:39:38 PM
    Extra arXiv:2502.08586 [cs]
    DOI 10.48550/arXiv.2502.08586
    Repository arXiv
    Archive ID arXiv:2502.08586
    Date Added 2/14/2025, 2:39:38 PM
    Modified 2/14/2025, 2:39:40 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • The Emergence of Strategic Reasoning of Large Language Models

    Item Type Preprint
    Author Dongwoo Lee
    Author Gavin Kader
    Abstract As Large Language Models (LLMs) are increasingly used for a variety of complex and critical tasks, it is vital to assess their logical capabilities in strategic environments. This paper examines their ability in strategic reasoning -- the process of choosing an optimal course of action by predicting and adapting to other agents' behavior. Using six LLMs, we analyze responses from play in classical games from behavioral economics (p-Beauty Contest, 11-20 Money Request Game, and Guessing Game) and evaluate their performance through hierarchical models of reasoning (level-$k$ theory and cognitive hierarchy theory). Our findings reveal that while LLMs show understanding of the games, the majority struggle with higher-order strategic reasoning. Although most LLMs did demonstrate learning ability with games involving repeated interactions, they still consistently fall short of the reasoning levels demonstrated by typical behavior from human subjects. The exception to these overall findings is with OpenAI's GPT-o1 -- specifically trained to solve complex reasoning tasks -- which consistently outperforms other LLMs and human subjects. These findings highlight the challenges and pathways in advancing LLMs toward robust strategic reasoning from the perspective of behavioral economics.
    Date 2024-12-17
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.13013
    Accessed 1/29/2025, 11:36:10 AM
    Extra arXiv:2412.13013 [econ]
    DOI 10.48550/arXiv.2412.13013
    Repository arXiv
    Archive ID arXiv:2412.13013
    Date Added 1/29/2025, 11:36:10 AM
    Modified 1/29/2025, 11:36:10 AM

    Tags:

    • Economics - General Economics
    • Quantitative Finance - Economics

    Attachments

    • Preprint PDF
    • Snapshot
  • Sparse Autoencoders Do Not Find Canonical Units of Analysis

    Item Type Preprint
    Author Patrick Leask
    Author Bart Bussmann
    Author Michael Pearce
    Author Joseph Bloom
    Author Curt Tigges
    Author Noura Al Moubayed
    Author Lee Sharkey
    Author Neel Nanda
    Abstract A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing ``Einstein'' decomposes into ``scientist'', ``Germany'', and ``famous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/
    Date 2025-02-07
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.04878
    Accessed 2/13/2025, 11:37:21 AM
    Extra arXiv:2502.04878 [cs]
    DOI 10.48550/arXiv.2502.04878
    Repository arXiv
    Archive ID arXiv:2502.04878
    Date Added 2/13/2025, 11:37:21 AM
    Modified 2/13/2025, 11:37:21 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: Accepted to ICLR 2025

    Attachments

    • Preprint PDF
    • Snapshot
  • OverThink: Slowdown Attacks on Reasoning LLMs

    Item Type Preprint
    Author Abhinav Kumar
    Author Jaechul Roh
    Author Ali Naseh
    Author Marzena Karpinska
    Author Mohit Iyyer
    Author Amir Houmansadr
    Author Eugene Bagdasarian
    Abstract We increase overhead for applications that rely on reasoning LLMs-we force models to spend an amplified number of reasoning tokens, i.e., "overthink", to respond to the user query while providing contextually correct answers. The adversary performs an OVERTHINK attack by injecting decoy reasoning problems into the public content that is used by the reasoning LLM (e.g., for RAG applications) during inference time. Due to the nature of our decoy problems (e.g., a Markov Decision Process), modified texts do not violate safety guardrails. We evaluated our attack across closed-(OpenAI o1, o1-mini, o3-mini) and open-(DeepSeek R1) weights reasoning models on the FreshQA and SQuAD datasets. Our results show up to 18x slowdown on FreshQA dataset and 46x slowdown on SQuAD dataset. The attack also shows high transferability across models. To protect applications, we discuss and implement defenses leveraging LLM-based and system design approaches. Finally, we discuss societal, financial, and energy impacts of OVERTHINK attack which could amplify the costs for third-party applications operating reasoning models.
    Date 2025-02-05
    Short Title OverThink
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.02542
    Accessed 2/13/2025, 11:15:12 AM
    Extra arXiv:2502.02542 [cs]
    DOI 10.48550/arXiv.2502.02542
    Repository arXiv
    Archive ID arXiv:2502.02542
    Date Added 2/13/2025, 11:15:12 AM
    Modified 2/13/2025, 11:15:12 AM

    Tags:

    • Computer Science - Machine Learning
    • Computer Science - Cryptography and Security

    Attachments

    • Preprint PDF
    • Snapshot
  • J2J: Jailbreaking to Jailbreak

    Item Type Journal Article
    Author Jeremy Kritz
    Author Vaughn Robinson
    Author Robert Vacareanu
    Author Bijan Varjavand
    Author Michael Choi
    Author Bobby Gogov
    Author Summer Yue
    Author Willow E Primack
    Author Zifan Wang
    Abstract Refusal training on Large Language Models (LLMs) prevents harmful outputs, yet this defense remains vulnerable to both automated and human-crafted jailbreaks. We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to the jailbroken LLMs as J2 attackers, which can systematically evaluate target models using various red teaming strategies and improve its performance via in-context learning from the previous failures. Our experiments demonstrate that Sonnet-3.5 and Gemini-1.5-pro outperform other LLMs as J2, achieving 93.0% and 91.0% attack success rates (ASRs) respectively against GPT-4o (and similar results across other capable LLMs) on Harmbench. Our work not only introduces a scalable approach to strategic red teaming—drawing inspiration from human red teamers, but also highlights jailbreaking-to-jailbreak as an overlooked failure mode of the safeguard. Specifically, an LLM can bypass its own safeguards by employing a jailbroken version of itself that is willing to assist in further jailbreaking. To prevent any direct misuse with J2, while advancing research in AI safety, we publicly share our methodology while keeping specific prompting details private.
    Language en
    Library Catalog Zotero
    Date Added 2/13/2025, 11:39:53 AM
    Modified 2/13/2025, 11:40:07 AM

    Attachments

    • PDF
  • A sketch of an AI control safety case

    Item Type Preprint
    Author Tomek Korbak
    Author Joshua Clymer
    Author Benjamin Hilton
    Author Buck Shlegeris
    Author Geoffrey Irving
    Abstract As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
    Date 2025-01-28
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.17315
    Accessed 1/31/2025, 1:09:42 PM
    Extra arXiv:2501.17315 [cs]
    DOI 10.48550/arXiv.2501.17315
    Repository arXiv
    Archive ID arXiv:2501.17315
    Date Added 1/31/2025, 1:09:42 PM
    Modified 1/31/2025, 1:09:44 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security
    • Computer Science - Software Engineering

    Attachments

    • Preprint PDF
    • Snapshot
  • Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

    Item Type Preprint
    Author Lujain Ibrahim
    Author Canfer Akbulut
    Author Rasmi Elasmar
    Author Charvi Rastogi
    Author Minsuk Kahng
    Author Meredith Ringel Morris
    Author Kevin R. McKee
    Author Verena Rieser
    Author Murray Shanahan
    Author Laura Weidinger
    Abstract The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
    Date 2025-02-10
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.07077
    Accessed 2/13/2025, 11:41:44 AM
    Extra arXiv:2502.07077 [cs]
    DOI 10.48550/arXiv.2502.07077
    Repository arXiv
    Archive ID arXiv:2502.07077
    Date Added 2/13/2025, 11:41:44 AM
    Modified 2/13/2025, 11:41:44 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction

    Attachments

    • Preprint PDF
    • Snapshot
  • MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

    Item Type Preprint
    Author Kaixuan Huang
    Author Jiacheng Guo
    Author Zihao Li
    Author Xiang Ji
    Author Jiawei Ge
    Author Wenzhe Li
    Author Yingqing Guo
    Author Tianle Cai
    Author Hui Yuan
    Author Runzhe Wang
    Author Yue Wu
    Author Ming Yin
    Author Shange Tang
    Author Yangsibo Huang
    Author Chi Jin
    Author Xinyun Chen
    Author Chiyuan Zhang
    Author Mengdi Wang
    Abstract Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations -- modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycksmath et. al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models.
    Date 2025-02-10
    Short Title MATH-Perturb
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.06453
    Accessed 2/13/2025, 11:39:29 AM
    Extra arXiv:2502.06453 [cs]
    DOI 10.48550/arXiv.2502.06453
    Repository arXiv
    Archive ID arXiv:2502.06453
    Date Added 2/13/2025, 11:39:29 AM
    Modified 2/13/2025, 11:39:29 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Teaching Large Language Models to Reason with Reinforcement Learning

    Item Type Preprint
    Author Alex Havrilla
    Author Yuqing Du
    Author Sharath Chandra Raparthy
    Author Christoforos Nalmpantis
    Author Jane Dwivedi-Yu
    Author Maksym Zhuravinskyi
    Author Eric Hambro
    Author Sainbayar Sukhbaatar
    Author Roberta Raileanu
    Abstract Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.
    Date 2024-03-07
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2403.04642
    Accessed 1/29/2025, 12:13:52 PM
    Extra arXiv:2403.04642 [cs]
    DOI 10.48550/arXiv.2403.04642
    Repository arXiv
    Archive ID arXiv:2403.04642
    Date Added 1/29/2025, 12:13:52 PM
    Modified 1/29/2025, 12:13:52 PM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations

    Item Type Journal Article
    Author Kunal Handa
    Author Alex Tamkin
    Author Miles McCain
    Author Saffron Huang
    Author Esin Durmus
    Author Sarah Heck
    Author Jared Mueller
    Author Jerry Hong
    Author Stuart Ritchie
    Author Tim Belonax
    Author Kevin K Troy
    Author Dario Amodei
    Author Jared Kaplan
    Author Jack Clark
    Author Deep Ganguli
    Abstract Despite widespread speculation about artificial intelligence’s impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system [Tamkin et al., 2024] to analyze over four million Claude.ai conversations through the lens of tasks and occupations in the U.S. Department of Labor’s O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with ∼ 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI’s evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance.
    Language en
    Library Catalog Zotero
    Date Added 2/13/2025, 11:31:05 AM
    Modified 2/13/2025, 11:31:05 AM

    Attachments

    • PDF
  • Internal Activation as the Polar Star for Steering Unsafe LLM Behavior

    Item Type Preprint
    Author Peixuan Han
    Author Cheng Qian
    Author Xiusi Chen
    Author Yuji Zhang
    Author Denghui Zhang
    Author Heng Ji
    Abstract Large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks but also pose significant risks due to their potential to generate harmful content. Although existing safety mechanisms can improve model safety, they often lead to overly cautious behavior and fail to fully utilize LLMs' internal cognitive processes. Drawing inspiration from cognitive science, where humans rely on reflective reasoning (System 2 thinking) to regulate language and behavior, we empirically demonstrate that LLMs also possess a similar capacity for internal assessment and regulation, which can be actively detected. Building on this insight, we introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility. Compared to traditional safety alignment methods, SafeSwitch delivers more informative and context-aware refusals, demonstrates resilience to unseen queries, and achieves these benefits while only tuning less than 6% of the original parameters. These features make SafeSwitch a promising approach for implementing nuanced safety controls in LLMs.
    Date 2025-02-04
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.01042
    Accessed 2/6/2025, 9:27:31 AM
    Extra arXiv:2502.01042 [cs]
    DOI 10.48550/arXiv.2502.01042
    Repository arXiv
    Archive ID arXiv:2502.01042
    Date Added 2/6/2025, 9:27:31 AM
    Modified 2/6/2025, 9:27:31 AM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Full Text PDF
    • Snapshot
  • AI Personality Extraction from Faces: Labor Market Implications

    Item Type Preprint
    Author Marius Guenzel
    Author Shimon Kogan
    Author Marina Niessner
    Author Kelly Shue
    Abstract <div> Human capital---encompassing cognitive skills and personality traits---is critical for labor market success, yet the personality component remains diffic
    Date 2025-01-09
    Language en
    Short Title AI Personality Extraction from Faces
    Library Catalog papers.ssrn.com
    URL https://papers.ssrn.com/abstract=5089827
    Accessed 1/29/2025, 12:10:37 PM
    Place Rochester, NY
    Repository Social Science Research Network
    Genre SSRN Scholarly Paper
    Archive ID 5089827
    Date Added 1/29/2025, 12:10:37 PM
    Modified 1/29/2025, 12:10:37 PM

    Tags:

    • AI Personality Extraction from Faces: Labor Market Implications
    • Kelly Shue
    • Marina Niessner
    • Marius Guenzel
    • Shimon Kogan
    • SSRN

    Attachments

    • Full Text PDF
  • ALIGNMENT FAKING IN LARGE LANGUAGE MODELS

    Item Type Journal Article
    Author Ryan Greenblatt
    Author Carson Denison
    Author Benjamin Wright
    Author Fabien Roger
    Author Monte MacDiarmid
    Author Sam Marks
    Author Johannes Treutlein
    Author Tim Belonax
    Author Jack Chen
    Author David Duvenaud
    Author Akbir Khan
    Author Julian Michael
    Author Sören Mindermann
    Author Ethan Perez
    Author Linda Petrini
    Author Jonathan Uesato
    Author Jared Kaplan
    Author Buck Shlegeris
    Author Samuel R Bowman
    Author Evan Hubinger
    Abstract We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference—as in this case—or not.
    Language en
    Library Catalog Zotero
    Date Added 2/1/2025, 3:23:04 PM
    Modified 2/1/2025, 3:23:07 PM

    Attachments

    • PDF
  • Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

    Item Type Preprint
    Author Jonas Geiping
    Author Sean McLeish
    Author Neel Jain
    Author John Kirchenbauer
    Author Siddharth Singh
    Author Brian R. Bartoldson
    Author Bhavya Kailkhura
    Author Abhinav Bhatele
    Author Tom Goldstein
    Abstract We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
    Date 2025-02-07
    Short Title Scaling up Test-Time Compute with Latent Reasoning
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.05171
    Accessed 2/13/2025, 11:30:10 AM
    Extra arXiv:2502.05171 [cs]
    DOI 10.48550/arXiv.2502.05171
    Repository arXiv
    Archive ID arXiv:2502.05171
    Date Added 2/13/2025, 11:30:10 AM
    Modified 2/13/2025, 11:30:14 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Machine Learning

    Notes:

    • Comment: The model is available at https://huggingface.co/tomg-group-umd/huginn-0125. Code and data recipe can be found at https://github.com/seal-rg/recurrent-pretraining

    Attachments

    • Preprint PDF
    • Snapshot
  • How Linguistics Learned to Stop Worrying and Love the Language Models

    Item Type Preprint
    Author Richard Futrell
    Author Kyle Mahowald
    Abstract Language models can produce fluent, grammatical text. Nonetheless, some maintain that language models don't really learn language and also that, even if they did, that would not be informative for the study of human learning and processing. On the other side, there have been claims that the success of LMs obviates the need for studying linguistic theory and structure. We argue that both extremes are wrong. LMs can contribute to fundamental questions about linguistic structure, language processing, and learning. They force us to rethink arguments about learning and are informative for major questions in linguistic theory. But they do not replace linguistic structure and theory. We offer an optimistic take on the relationship between language models and linguistics.
    Date 2025-01-28
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.17047
    Accessed 1/29/2025, 4:28:48 PM
    Extra arXiv:2501.17047 [cs]
    DOI 10.48550/arXiv.2501.17047
    Repository arXiv
    Archive ID arXiv:2501.17047
    Date Added 1/29/2025, 4:28:48 PM
    Modified 1/29/2025, 4:28:48 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

    Item Type Preprint
    Author Sebastian Farquhar
    Author Vikrant Varma
    Author David Lindner
    Author David Elson
    Author Caleb Biddulph
    Author Ian Goodfellow
    Author Rohin Shah
    Abstract Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.
    Date 2025-01-22
    Short Title MONA
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.13011
    Accessed 1/29/2025, 11:53:58 AM
    Extra arXiv:2501.13011 [cs]
    DOI 10.48550/arXiv.2501.13011
    Repository arXiv
    Archive ID arXiv:2501.13011
    Date Added 1/29/2025, 11:53:58 AM
    Modified 1/29/2025, 11:53:58 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Item Type Preprint
    Author DeepSeek-AI
    Author Daya Guo
    Author Dejian Yang
    Author Haowei Zhang
    Author Junxiao Song
    Author Ruoyu Zhang
    Author Runxin Xu
    Author Qihao Zhu
    Author Shirong Ma
    Author Peiyi Wang
    Author Xiao Bi
    Author Xiaokang Zhang
    Author Xingkai Yu
    Author Yu Wu
    Author Z. F. Wu
    Author Zhibin Gou
    Author Zhihong Shao
    Author Zhuoshu Li
    Author Ziyi Gao
    Author Aixin Liu
    Author Bing Xue
    Author Bingxuan Wang
    Author Bochao Wu
    Author Bei Feng
    Author Chengda Lu
    Author Chenggang Zhao
    Author Chengqi Deng
    Author Chenyu Zhang
    Author Chong Ruan
    Author Damai Dai
    Author Deli Chen
    Author Dongjie Ji
    Author Erhang Li
    Author Fangyun Lin
    Author Fucong Dai
    Author Fuli Luo
    Author Guangbo Hao
    Author Guanting Chen
    Author Guowei Li
    Author H. Zhang
    Author Han Bao
    Author Hanwei Xu
    Author Haocheng Wang
    Author Honghui Ding
    Author Huajian Xin
    Author Huazuo Gao
    Author Hui Qu
    Author Hui Li
    Author Jianzhong Guo
    Author Jiashi Li
    Author Jiawei Wang
    Author Jingchang Chen
    Author Jingyang Yuan
    Author Junjie Qiu
    Author Junlong Li
    Author J. L. Cai
    Author Jiaqi Ni
    Author Jian Liang
    Author Jin Chen
    Author Kai Dong
    Author Kai Hu
    Author Kaige Gao
    Author Kang Guan
    Author Kexin Huang
    Author Kuai Yu
    Author Lean Wang
    Author Lecong Zhang
    Author Liang Zhao
    Author Litong Wang
    Author Liyue Zhang
    Author Lei Xu
    Author Leyi Xia
    Author Mingchuan Zhang
    Author Minghua Zhang
    Author Minghui Tang
    Author Meng Li
    Author Miaojun Wang
    Author Mingming Li
    Author Ning Tian
    Author Panpan Huang
    Author Peng Zhang
    Author Qiancheng Wang
    Author Qinyu Chen
    Author Qiushi Du
    Author Ruiqi Ge
    Author Ruisong Zhang
    Author Ruizhe Pan
    Author Runji Wang
    Author R. J. Chen
    Author R. L. Jin
    Author Ruyi Chen
    Author Shanghao Lu
    Author Shangyan Zhou
    Author Shanhuang Chen
    Author Shengfeng Ye
    Author Shiyu Wang
    Author Shuiping Yu
    Author Shunfeng Zhou
    Author Shuting Pan
    Author S. S. Li
    Author Shuang Zhou
    Author Shaoqing Wu
    Author Shengfeng Ye
    Author Tao Yun
    Author Tian Pei
    Author Tianyu Sun
    Author T. Wang
    Author Wangding Zeng
    Author Wanjia Zhao
    Author Wen Liu
    Author Wenfeng Liang
    Author Wenjun Gao
    Author Wenqin Yu
    Author Wentao Zhang
    Author W. L. Xiao
    Author Wei An
    Author Xiaodong Liu
    Author Xiaohan Wang
    Author Xiaokang Chen
    Author Xiaotao Nie
    Author Xin Cheng
    Author Xin Liu
    Author Xin Xie
    Author Xingchao Liu
    Author Xinyu Yang
    Author Xinyuan Li
    Author Xuecheng Su
    Author Xuheng Lin
    Author X. Q. Li
    Author Xiangyue Jin
    Author Xiaojin Shen
    Author Xiaosha Chen
    Author Xiaowen Sun
    Author Xiaoxiang Wang
    Author Xinnan Song
    Author Xinyi Zhou
    Author Xianzu Wang
    Author Xinxia Shan
    Author Y. K. Li
    Author Y. Q. Wang
    Author Y. X. Wei
    Author Yang Zhang
    Author Yanhong Xu
    Author Yao Li
    Author Yao Zhao
    Author Yaofeng Sun
    Author Yaohui Wang
    Author Yi Yu
    Author Yichao Zhang
    Author Yifan Shi
    Author Yiliang Xiong
    Author Ying He
    Author Yishi Piao
    Author Yisong Wang
    Author Yixuan Tan
    Author Yiyang Ma
    Author Yiyuan Liu
    Author Yongqiang Guo
    Author Yuan Ou
    Author Yuduan Wang
    Author Yue Gong
    Author Yuheng Zou
    Author Yujia He
    Author Yunfan Xiong
    Author Yuxiang Luo
    Author Yuxiang You
    Author Yuxuan Liu
    Author Yuyang Zhou
    Author Y. X. Zhu
    Author Yanhong Xu
    Author Yanping Huang
    Author Yaohui Li
    Author Yi Zheng
    Author Yuchen Zhu
    Author Yunxian Ma
    Author Ying Tang
    Author Yukun Zha
    Author Yuting Yan
    Author Z. Z. Ren
    Author Zehui Ren
    Author Zhangli Sha
    Author Zhe Fu
    Author Zhean Xu
    Author Zhenda Xie
    Author Zhengyan Zhang
    Author Zhewen Hao
    Author Zhicheng Ma
    Author Zhigang Yan
    Author Zhiyu Wu
    Author Zihui Gu
    Author Zijia Zhu
    Author Zijun Liu
    Author Zilin Li
    Author Ziwei Xie
    Author Ziyang Song
    Author Zizheng Pan
    Author Zhen Huang
    Author Zhipeng Xu
    Author Zhongyu Zhang
    Author Zhen Zhang
    Abstract We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
    Date 2025-01-22
    Short Title DeepSeek-R1
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.12948
    Accessed 1/29/2025, 11:33:45 AM
    Extra arXiv:2501.12948 [cs]
    DOI 10.48550/arXiv.2501.12948
    Repository arXiv
    Archive ID arXiv:2501.12948
    Date Added 1/29/2025, 11:33:45 AM
    Modified 1/29/2025, 11:33:47 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • The AI Agent Index

    Item Type Preprint
    Author Stephen Casper
    Author Luke Bailey
    Author Rosco Hunter
    Author Carson Ezell
    Author Emma Cabalé
    Author Michael Gerovitch
    Author Stewart Slocum
    Author Kevin Wei
    Author Nikola Jurkovic
    Author Ariba Khan
    Author Phillip J. K. Christoffersen
    Author A. Pinar Ozisik
    Author Rakshit Trivedi
    Author Dylan Hadfield-Menell
    Author Noam Kolt
    Abstract Leading AI developers and startups are increasingly deploying agentic AI systems that can plan and execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems. To fill this gap, we introduce the AI Agent Index, the first public database to document information about currently deployed agentic AI systems. For each system that meets the criteria for inclusion in the index, we document the system's components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails), based on publicly available information and correspondence with developers. We find that while developers generally provide ample information regarding the capabilities and applications of agentic systems, they currently provide limited information regarding safety and risk management practices. The AI Agent Index is available online at https://aiagentindex.mit.edu/
    Date 2025-02-03
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.01635
    Accessed 2/6/2025, 9:09:03 AM
    Extra arXiv:2502.01635 [cs]
    DOI 10.48550/arXiv.2502.01635
    Repository arXiv
    Archive ID arXiv:2502.01635
    Date Added 2/6/2025, 9:09:03 AM
    Modified 2/6/2025, 9:09:06 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Software Engineering

    Notes:

    • Comment: Accompanying website: https://aiagentindex.mit.edu/

    Attachments

    • Preprint PDF
    • Snapshot
  • Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages

    Item Type Preprint
    Author Jannik Brinkmann
    Author Chris Wendler
    Author Christian Bartelt
    Author Aaron Mueller
    Abstract Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are multiple languages learned and encoded? In this work, we explore the extent to which LLMs share representations of morphosyntactic concepts such as grammatical number, gender, and tense across languages. We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages. We use causal interventions to verify the multilingual nature of these representations; specifically, we show that ablating only multilingual features decreases classifier performance to near-chance across languages. We then use these features to precisely modify model behavior in a machine translation task; this demonstrates both the generality and selectivity of these feature's roles in the network. Our findings suggest that even models trained predominantly on English data can develop robust, cross-lingual abstractions of morphosyntactic concepts.
    Date 2025-01-10
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.06346
    Accessed 1/31/2025, 1:10:14 PM
    Extra arXiv:2501.06346 [cs]
    DOI 10.48550/arXiv.2501.06346
    Repository arXiv
    Archive ID arXiv:2501.06346
    Date Added 1/31/2025, 1:10:14 PM
    Modified 1/31/2025, 1:10:14 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • Tell me about yourself: LLMs are aware of their learned behaviors

    Item Type Preprint
    Author Jan Betley
    Author Xuchan Bao
    Author Martín Soto
    Author Anna Sztyber-Betley
    Author James Chua
    Author Owain Evans
    Abstract We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.
    Date 2025-01-19
    Short Title Tell me about yourself
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.11120
    Accessed 1/29/2025, 12:12:22 PM
    Extra arXiv:2501.11120 [cs]
    DOI 10.48550/arXiv.2501.11120
    Repository arXiv
    Archive ID arXiv:2501.11120
    Date Added 1/29/2025, 12:12:22 PM
    Modified 1/29/2025, 12:12:22 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Computer Science - Cryptography and Security

    Notes:

    • Comment: Submitted to ICLR 2025. 17 pages, 13 figures

    Attachments

    • Full Text PDF
    • Snapshot
  • International AI safety report

    Item Type Journal Article
    Author Y Bengio
    Language en
    Library Catalog Zotero
    Date Added 1/29/2025, 11:19:42 AM
    Modified 1/29/2025, 11:19:42 AM

    Attachments

    • PDF
  • Dario Amodei — On DeepSeek and Export Controls

    Item Type Web Page
    Author Dario Amodei
    Abstract On DeepSeek and Export Controls
    Date 2025-01-28
    Language en
    URL https://darioamodei.com/on-deepseek-and-export-controls.html
    Accessed 1/29/2025, 12:22:12 PM
    Date Added 1/29/2025, 12:22:12 PM
    Modified 1/29/2025, 12:22:12 PM

    Attachments

    • Snapshot