Zotero Report

Multi-agent Architecture Search via Agentic Supernet

Item Type	Preprint
Author	Guibin Zhang
Author	Luyang Niu
Author	Junfeng Fang
Author	Kun Wang
Author	Lei Bai
Author	Xiang Wang
Abstract	Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\sim45\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\%\sim11.82\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability.
Date	2025-02-06
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.04180
Accessed	2/13/2025, 11:08:07 AM
Extra	arXiv:2502.04180 [cs]
DOI	10.48550/arXiv.2502.04180
Repository	arXiv
Archive ID	arXiv:2502.04180
Date Added	2/13/2025, 11:08:07 AM
Modified	2/13/2025, 11:08:07 AM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning
Computer Science - Multiagent Systems

Attachments

Preprint PDF
Snapshot

SPRI: Aligning Large Language Models with Context-Situated Principles

Item Type	Preprint
Author	Hongli Zhan
Author	Muneeza Azmat
Author	Raya Horesh
Author	Junyi Jessy Li
Author	Mikhail Yurochkin
Abstract	Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at https://github.com/honglizhan/SPRI-public.
Date	2025-02-05
Short Title	SPRI
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.03397
Accessed	2/13/2025, 11:15:29 AM
Extra	arXiv:2502.03397 [cs]
DOI	10.48550/arXiv.2502.03397
Repository	arXiv
Archive ID	arXiv:2502.03397
Date Added	2/13/2025, 11:15:29 AM
Modified	2/13/2025, 11:15:29 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Full Text PDF
Snapshot

Generating Symbolic World Models via Test-time Scaling of Large Language Models

Item Type	Preprint
Author	Zhouliang Yu
Author	Yuhuan Yuan
Author	Tim Z. Xiao
Author	Fuxiang Frank Xia
Author	Jie Fu
Author	Ge Zhang
Author	Ge Lin
Author	Weiyang Liu
Abstract	Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domain, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks.
Date	2025-02-07
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.04728
Accessed	2/13/2025, 11:30:39 AM
Extra	arXiv:2502.04728 [cs]
DOI	10.48550/arXiv.2502.04728
Repository	arXiv
Archive ID	arXiv:2502.04728
Date Added	2/13/2025, 11:30:39 AM
Modified	2/13/2025, 11:30:39 AM

Tags:

Computer Science - Artificial Intelligence

Notes:

Comment: Technical Report v1 (32 pages, 6 figures)

Attachments

Preprint PDF
Snapshot

Are Transformers Able to Reason by Connecting Separated Knowledge in Training Data?

Item Type	Preprint
Author	Yutong Yin
Author	Zhaoran Wang
Abstract	Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing.
Date	2025-01-27
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.15857
Accessed	2/3/2025, 10:32:14 AM
Extra	arXiv:2501.15857 [cs]
DOI	10.48550/arXiv.2501.15857
Repository	arXiv
Archive ID	arXiv:2501.15857
Date Added	2/3/2025, 10:32:14 AM
Modified	2/3/2025, 10:32:14 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: It is accepted by The Thirteenth International Conference on Learning Representations and will be published soon. The submission number is 2678

Attachments

Preprint PDF
Snapshot

LIMO: Less is More for Reasoning

Item Type	Preprint
Author	Yixin Ye
Author	Zhen Huang
Author	Yang Xiao
Author	Ethan Chern
Author	Shijie Xia
Author	Pengfei Liu
Abstract	We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO.
Date	2025-02-05
Short Title	LIMO
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.03387
Accessed	2/7/2025, 1:30:16 PM
Extra	arXiv:2502.03387 [cs]
DOI	10.48550/arXiv.2502.03387
Repository	arXiv
Archive ID	arXiv:2502.03387
Date Added	2/7/2025, 1:30:16 PM
Modified	2/7/2025, 1:30:19 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Notes:

Comment: 17 pages

Attachments

Preprint PDF
Snapshot

ENIGMAEVAL: A Benchmark of Long Multimodal

Item Type	Journal Article
Author	Clinton J Wang
Author	Dean Lee
Author	Cristina Menghini
Author	Johannes Mols
Author	Jack Doughty
Author	Jayson Lynch
Author	Sean Hendryx
Author	Summer Yue
Author	Dan Hendrycks
Abstract	As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce ENIGMAEVAL, a dataset of problems and solutions derived from puzzle competitions and events that probes models’ ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity – each typically requiring teams of skilled solvers hours to days to complete – with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art language models achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity’s Last Exam, unveiling models’ shortcomings when challenged with problems requiring unstructured and lateral reasoning.
Language	en
Library Catalog	Zotero
Date Added	2/14/2025, 2:54:20 PM
Modified	2/14/2025, 2:54:20 PM

Attachments

PDF

LLM Pretraining with Continuous Concepts

Item Type	Preprint
Author	Jihoon Tack
Author	Jack Lanchantin
Author	Jane Yu
Author	Andrew Cohen
Author	Ilia Kulikov
Author	Janice Lan
Author	Shibo Hao
Author	Yuandong Tian
Author	Jason Weston
Author	Xian Li
Abstract	Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process.
Date	2025-02-12
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.08524
Accessed	2/13/2025, 11:44:13 AM
Extra	arXiv:2502.08524 [cs]
DOI	10.48550/arXiv.2502.08524
Repository	arXiv
Archive ID	arXiv:2502.08524
Date Added	2/13/2025, 11:44:13 AM
Modified	2/13/2025, 11:44:13 AM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Authenticated Delegation and Authorized AI Agents

Item Type	Preprint
Author	Tobin South
Author	Samuele Marro
Author	Thomas Hardjono
Author	Robert Mahari
Author	Cedric Deslandes Whitney
Author	Dazza Greenwood
Author	Alan Chan
Author	Alex Pentland
Abstract	The rapid deployment of autonomous AI agents creates urgent challenges around authorization, accountability, and access control in digital spaces. New standards are needed to know whom AI agents act on behalf of and guide their use appropriately, protecting online spaces while unlocking the value of task delegation to autonomous agents. We introduce a novel framework for authenticated, authorized, and auditable delegation of authority to AI agents, where human users can securely delegate and restrict the permissions and scope of agents while maintaining clear chains of accountability. This framework builds on existing identification and access management protocols, extending OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata, maintaining compatibility with established authentication and web infrastructure. Further, we propose a framework for translating flexible, natural language permissions into auditable access control configurations, enabling robust scoping of AI agent capabilities across diverse interaction modalities. Taken together, this practical approach facilitates immediate deployment of AI agents while addressing key security and accountability concerns, working toward ensuring agentic AI systems perform only appropriate actions and providing a tool for digital service providers to enable AI agent interactions without risking harm from scalable interaction.
Date	2025-01-16
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.09674
Accessed	1/29/2025, 11:26:00 AM
Extra	arXiv:2501.09674 [cs] version: 1
DOI	10.48550/arXiv.2501.09674
Repository	arXiv
Archive ID	arXiv:2501.09674
Date Added	1/29/2025, 11:26:00 AM
Modified	1/29/2025, 11:26:00 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Networking and Internet Architecture

Attachments

Preprint PDF
Snapshot

On the Feasibility of Using LLMs to Execute Multistage Network Attacks

Item Type	Preprint
Author	Brian Singer
Author	Keane Lucas
Author	Lakshmi Adiga
Author	Meghna Jain
Author	Lujo Bauer
Author	Vyas Sekar
Abstract	LLMs have shown preliminary promise in some security tasks and CTF challenges. However, it is unclear whether LLMs are able to realize multistage network attacks, which involve executing a wide variety of actions across multiple hosts such as conducting reconnaissance, exploiting vulnerabilities to gain initial access, leveraging internal hosts to move laterally, and using multiple compromised hosts to exfiltrate data. We evaluate LLMs across 10 multistage networks and find that popular LLMs are unable to realize these attacks. To enable LLMs to realize these attacks, we introduce Incalmo, an LLM-agnostic high-level attack abstraction layer that sits between an LLM and the environment. Rather than LLMs issuing low-level command-line instructions, which can lead to incorrect implementations, Incalmo allows LLMs to specify high-level tasks (e.g., infect a host, scan a network), which are then carried out by Incalmo. Incalmo realizes these tasks by translating them into low-level primitives (e.g., commands to exploit tools). Incalmo also provides an environment state service and an attack graph service to provide structure to LLMs in selecting actions relevant to a multistage attack. Across 9 out of 10 realistic emulated networks (from 25 to 50 hosts), LLMs using Incalmo can successfully autonomously execute multistage attacks. We also conduct an ablation analysis to show the key role the high-level abstractions play. For instance, we find that both Incalmo's high-level tasks and services are crucial. Furthermore, even smaller-parameter LLMs with Incalmo can fully succeed in 5 of 10 environments, while larger-parameter LLMs without Incalmo do not fully succeed in any.
Date	2025-01-27
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.16466
Accessed	1/29/2025, 4:33:58 PM
Extra	arXiv:2501.16466 [cs]
DOI	10.48550/arXiv.2501.16466
Repository	arXiv
Archive ID	arXiv:2501.16466
Date Added	1/29/2025, 4:33:58 PM
Modified	1/29/2025, 4:34:01 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security

Notes:

Comment: 16 pages, 14 figures

Attachments

Full Text PDF
Snapshot

Open Problems in Mechanistic Interpretability

Item Type	Preprint
Author	Lee Sharkey
Author	Bilal Chughtai
Author	Joshua Batson
Author	Jack Lindsey
Author	Jeff Wu
Author	Lucius Bushnaq
Author	Nicholas Goldowsky-Dill
Author	Stefan Heimersheim
Author	Alejandro Ortega
Author	Joseph Bloom
Author	Stella Biderman
Author	Adria Garriga-Alonso
Author	Arthur Conmy
Author	Neel Nanda
Author	Jessica Rumbelow
Author	Martin Wattenberg
Author	Nandi Schoots
Author	Joseph Miller
Author	Eric J. Michaud
Author	Stephen Casper
Author	Max Tegmark
Author	William Saunders
Author	David Bau
Author	Eric Todd
Author	Atticus Geiger
Author	Mor Geva
Author	Jesse Hoogland
Author	Daniel Murfet
Author	Tom McGrath
Abstract	Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing.
Date	2025-01-27
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.16496
Accessed	1/29/2025, 11:18:54 AM
Extra	arXiv:2501.16496 [cs]
DOI	10.48550/arXiv.2501.16496
Repository	arXiv
Archive ID	arXiv:2501.16496
Date Added	1/29/2025, 11:18:54 AM
Modified	1/29/2025, 11:18:54 AM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Item Type	Preprint
Author	Swarnadeep Saha
Author	Xian Li
Author	Marjan Ghazvininejad
Author	Jason Weston
Author	Tianlu Wang
Abstract	LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.
Date	2025-01-30
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.18099
Accessed	2/1/2025, 3:50:32 PM
Extra	arXiv:2501.18099 [cs]
DOI	10.48550/arXiv.2501.18099
Repository	arXiv
Archive ID	arXiv:2501.18099
Date Added	2/1/2025, 3:50:32 PM
Modified	2/1/2025, 3:50:32 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

AI Agents for Computer Use: A Review of Instruction-based Computer Control, GUI Automation, and Operator Assistants

Item Type	Preprint
Author	Pascal J. Sager
Author	Benjamin Meyer
Author	Peng Yan
Author	Rebekka von Wartburg-Kottler
Author	Layan Etaiwi
Author	Aref Enayati
Author	Gabriel Nobel
Author	Ahmed Abdulkadir
Author	Benjamin F. Grewe
Author	Thilo Stadelmann
Abstract	Instruction-based computer control agents (CCAs) execute complex action sequences on personal computers or mobile devices to fulfill tasks using the same graphical user interfaces as a human user would, provided instructions in natural language. This review offers a comprehensive overview of the emerging field of instruction-based computer control, examining available agents -- their taxonomy, development, and respective resources -- and emphasizing the shift from manually designed, specialized agents to leveraging foundation models such as large language models (LLMs) and vision-language models (VLMs). We formalize the problem and establish a taxonomy of the field to analyze agents from three perspectives: (a) the environment perspective, analyzing computer environments; (b) the interaction perspective, describing observations spaces (e.g., screenshots, HTML) and action spaces (e.g., mouse and keyboard actions, executable code); and (c) the agent perspective, focusing on the core principle of how an agent acts and learns to act. Our framework encompasses both specialized and foundation agents, facilitating their comparative analysis and revealing how prior solutions in specialized agents, such as an environment learning step, can guide the development of more capable foundation agents. Additionally, we review current CCA datasets and CCA evaluation methods and outline the challenges to deploying such agents in a productive setting. In total, we review and classify 86 CCAs and 33 related datasets. By highlighting trends, limitations, and future research directions, this work presents a comprehensive foundation to obtain a broad understanding of the field and push its future development.
Date	2025-01-27
Short Title	AI Agents for Computer Use
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.16150
Accessed	2/1/2025, 3:50:49 PM
Extra	arXiv:2501.16150 [cs]
DOI	10.48550/arXiv.2501.16150
Repository	arXiv
Archive ID	arXiv:2501.16150
Date Added	2/1/2025, 3:50:49 PM
Modified	2/1/2025, 3:50:49 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Human-Computer Interaction
Computer Science - Systems and Control
Electrical Engineering and Systems Science - Systems and Control

Attachments

Preprint PDF
Snapshot

Human Decision-making is Susceptible to AI-driven Manipulation

Item Type	Preprint
Author	Sahand Sabour
Author	June M. Liu
Author	Siyang Liu
Author	Chris Z. Yao
Author	Shiyao Cui
Author	Xuanming Zhang
Author	Wen Zhang
Author	Yaru Cao
Author	Advait Bhat
Author	Jian Guan
Author	Wei Wu
Author	Rada Mihalcea
Author	Tim Althoff
Author	Tatia M. C. Lee
Author	Minlie Huang
Abstract	Artificial Intelligence (AI) systems are increasingly intertwined with daily life, assisting users in executing various tasks and providing guidance on decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized controlled trial with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) employing explicit psychological tactics to reach its hidden objectives. By analyzing participants' decision patterns and shifts in their preference ratings post-interaction, we found significant susceptibility to AI-driven manipulation. Particularly, across both decision-making domains, participants interacting with the manipulative agents shifted toward harmful options at substantially higher rates (financial, MA: 62.3%, SEMA: 59.6%; emotional, MA: 42.3%, SEMA: 41.5%) compared to the NA group (financial, 35.8%; emotional, 12.8%). Notably, our findings reveal that even subtle manipulative objectives (MA) can be as effective as employing explicit psychological strategies (SEMA) in swaying human decision-making. By revealing the potential for covert AI influence, this study highlights a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to ensure responsible deployment of AI technologies and protect human autonomy.
Date	2025-02-11
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.07663
Accessed	2/14/2025, 3:02:32 PM
Extra	arXiv:2502.07663 [cs]
DOI	10.48550/arXiv.2502.07663
Repository	arXiv
Archive ID	arXiv:2502.07663
Date Added	2/14/2025, 3:02:32 PM
Modified	2/14/2025, 3:02:32 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction

Notes:

Comment: Work in progress. Code and data will be made available via https://github.com/Sahandfer/Manipulation-Susceptibility

Attachments

Preprint PDF
Snapshot

Introducing our short course on AGI safety

Item Type	Web Page
Author	DeepMind Safety Research
Abstract	We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course…
Date	2025-02-14T15:05:57.040Z
Language	en
URL	https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c
Accessed	2/14/2025, 3:02:07 PM
Website Title	Medium
Date Added	2/14/2025, 3:02:07 PM
Modified	2/14/2025, 3:02:07 PM

Attachments

Snapshot

Noteworthy LLM Research Papers of 2024

Item Type	Web Page
Author	Sebastian Raschka
Abstract	This article covers 12 influential AI research papers of 2024, ranging from mixture-of-experts models to new LLM scaling laws for precision..
Date	06:03:00 +0000
Language	en
URL	https://sebastianraschka.com/blog/2025/llm-research-2024.html
Accessed	1/29/2025, 11:36:46 AM
Website Title	Sebastian Raschka, PhD
Date Added	1/29/2025, 11:36:46 AM
Modified	1/29/2025, 11:36:46 AM

Attachments

Snapshot

Social Norms in Cinema: A Cross-Cultural Analysis of Shame, Pride and Prejudice

Item Type	Preprint
Author	Sunny Rai
Author	Khushang Jilesh Zaveri
Author	Shreya Havaldar
Author	Soumna Nema
Author	Lyle Ungar
Author	Sharath Chandra Guntuku
Abstract	Shame and pride are social emotions expressed across cultures to motivate and regulate people's thoughts, feelings, and behaviors. In this paper, we introduce the first cross-cultural dataset of over 10k shame/pride-related expressions, with underlying social expectations from ~5.4K Bollywood and Hollywood movies. We examine how and why shame and pride are expressed across cultures using a blend of psychology-informed language analysis combined with large language models. We find significant cross-cultural differences in shame and pride expression aligning with known cultural tendencies of the USA and India -- e.g., in Hollywood, shame-expressions predominantly discuss self whereas Bollywood discusses shame toward others. Pride in Hollywood is individualistic with more self-referential singular pronouns such as I and my whereas in Bollywood, pride is collective with higher use of self-referential plural pronouns such as we and our. Lastly, women are more sanctioned across cultures and for violating similar social expectations e.g. promiscuity.
Date	2024-10-15
Short Title	Social Norms in Cinema
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2402.11333
Accessed	2/1/2025, 3:50:15 PM
Extra	arXiv:2402.11333 [cs]
DOI	10.48550/arXiv.2402.11333
Repository	arXiv
Archive ID	arXiv:2402.11333
Date Added	2/1/2025, 3:50:15 PM
Modified	2/1/2025, 3:50:15 PM

Tags:

Computer Science - Computers and Society

Attachments

Preprint PDF
Snapshot

Transcoders Beat Sparse Autoencoders for Interpretability

Item Type	Preprint
Author	Gonçalo Paulo
Author	Stepan Shabalin
Author	Nora Belrose
Abstract	Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability.
Date	2025-02-12
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.18823
Accessed	2/13/2025, 11:02:53 AM
Extra	arXiv:2501.18823 [cs]
DOI	10.48550/arXiv.2501.18823
Repository	arXiv
Archive ID	arXiv:2501.18823
Date Added	2/13/2025, 11:02:53 AM
Modified	2/13/2025, 11:02:56 AM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Sparse Autoencoders Trained on the Same Data Learn Different Features

Item Type	Preprint
Author	Gonçalo Paulo
Author	Nora Belrose
Abstract	Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features "truly used" by the model.
Date	2025-01-29
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.16615
Accessed	2/3/2025, 10:27:43 AM
Extra	arXiv:2501.16615 [cs]
DOI	10.48550/arXiv.2501.16615
Repository	arXiv
Archive ID	arXiv:2501.16615
Date Added	2/3/2025, 10:27:43 AM
Modified	2/3/2025, 10:27:43 AM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

What fully automated firms will look like

Item Type	Web Page
Author	Dwarkesh Patel
Abstract	Everyone is sleeping on the collective advantages AIs will have, which have nothing to do with raw IQ - they can be copied, distilled, merged, scaled, and evolved in ways humans simply can't.
Date	2024-12-27
Language	en
URL	https://www.dwarkeshpatel.com/p/ai-firm
Accessed	2/1/2025, 3:26:33 PM
Date Added	2/1/2025, 3:26:33 PM
Modified	2/1/2025, 3:26:33 PM

Attachments

Snapshot

Future You: A Conversation with an AI-Generated Future Self Reduces Anxiety, Negative Emotions, and Increases Future Self-Continuity

Item Type	Preprint
Author	Pat Pataranutaporn
Author	Kavin Winson
Author	Peggy Yin
Author	Auttasak Lapapirojn
Author	Pichayoot Ouppaphan
Author	Monchai Lertsutthiwong
Author	Pattie Maes
Author	Hal Hershfield
Abstract	We introduce "Future You," an interactive, brief, single-session, digital chat intervention designed to improve future self-continuity--the degree of connection an individual feels with a temporally distant future self--a characteristic that is positively related to mental health and wellbeing. Our system allows users to chat with a relatable yet AI-powered virtual version of their future selves that is tuned to their future goals and personal qualities. To make the conversation realistic, the system generates a "synthetic memory"--a unique backstory for each user--that creates a throughline between the user's present age (between 18-30) and their life at age 60. The "Future You" character also adopts the persona of an age-progressed image of the user's present self. After a brief interaction with the "Future You" character, users reported decreased anxiety, and increased future self-continuity. This is the first study successfully demonstrating the use of personalized AI-generated characters to improve users' future self-continuity and wellbeing.
Date	2024-10-01
Short Title	Future You
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2405.12514
Accessed	2/3/2025, 10:27:58 AM
Extra	arXiv:2405.12514 [cs]
DOI	10.48550/arXiv.2405.12514
Repository	arXiv
Archive ID	arXiv:2405.12514
Date Added	2/3/2025, 10:27:58 AM
Modified	2/3/2025, 10:27:58 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Human-Computer Interaction

Attachments

Full Text PDF
Snapshot

Competitive Programming with Large Reasoning Models

Item Type	Preprint
Author	OpenAI
Author	Ahmed El-Kishky
Author	Alexander Wei
Author	Andre Saraiva
Author	Borys Minaev
Author	Daniel Selsam
Author	David Dohan
Author	Francis Song
Author	Hunter Lightman
Author	Ignasi Clavera
Author	Jakub Pachocki
Author	Jerry Tworek
Author	Lorenz Kuhn
Author	Lukasz Kaiser
Author	Mark Chen
Author	Max Schwarzer
Author	Mostafa Rohaninejad
Author	Nat McAleese
Author	o3 contributors
Author	Oleg Mürk
Author	Rhythm Garg
Author	Rui Shu
Author	Szymon Sidor
Author	Vineet Kosaraju
Author	Wenda Zhou
Abstract	We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming.
Date	2025-02-03
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.06807
Accessed	2/13/2025, 11:37:43 AM
Extra	arXiv:2502.06807 [cs]
DOI	10.48550/arXiv.2502.06807
Repository	arXiv
Archive ID	arXiv:2502.06807
Date Added	2/13/2025, 11:37:43 AM
Modified	2/13/2025, 11:37:43 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

One fish, two fish, but not the whole sea: Alignment reduces language models' conceptual diversity

Item Type	Preprint
Author	Sonia K. Murthy
Author	Tomer Ullman
Author	Jennifer Hu
Abstract	Researchers in social science and psychology have recently proposed using large language models (LLMs) as replacements for humans in behavioral research. In addition to arguments about whether LLMs accurately capture population-level patterns, this has raised questions about whether LLMs capture human-like conceptual diversity. Separately, it is debated whether post-training alignment (RLHF or RLAIF) affects models' internal diversity. Inspired by human studies, we use a new way of measuring the conceptual diversity of synthetically-generated LLM "populations" by relating the internal variability of simulated individuals to the population-level variability. We use this approach to evaluate non-aligned and aligned LLMs on two domains with rich human behavioral data. While no model reaches human-like diversity, aligned models generally display less diversity than their instruction fine-tuned counterparts. Our findings highlight potential trade-offs between increasing models' value alignment and decreasing the diversity of their conceptual representations.
Date	2024-11-12
Short Title	One fish, two fish, but not the whole sea
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.04427
Accessed	2/13/2025, 11:34:32 AM
Extra	arXiv:2411.04427 [cs]
DOI	10.48550/arXiv.2411.04427
Repository	arXiv
Archive ID	arXiv:2411.04427
Date Added	2/13/2025, 11:34:32 AM
Modified	2/13/2025, 11:34:32 AM

Tags:

Computer Science - Computation and Language

Notes:

Comment: 17 pages, 10 figures; corrected figure version

Attachments

Preprint PDF
Snapshot

s1: Simple test-time scaling

Item Type	Preprint
Author	Niklas Muennighoff
Author	Zitong Yang
Author	Weijia Shi
Author	Xiang Lisa Li
Author	Li Fei-Fei
Author	Hannaneh Hajishirzi
Author	Luke Zettlemoyer
Author	Percy Liang
Author	Emmanuel Candès
Author	Tatsunori Hashimoto
Abstract	Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1.
Date	2025-01-31
Short Title	s1
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.19393
Accessed	2/3/2025, 10:32:30 AM
Extra	arXiv:2501.19393 [cs]
DOI	10.48550/arXiv.2501.19393
Repository	arXiv
Archive ID	arXiv:2501.19393
Date Added	2/3/2025, 10:32:30 AM
Modified	2/3/2025, 10:32:30 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: 46 pages (9 main), 10 figures, 14 tables

Attachments

Preprint PDF
Snapshot

s1: Simple test-time scaling

Item Type	Preprint
Author	Niklas Muennighoff
Author	Zitong Yang
Author	Weijia Shi
Author	Xiang Lisa Li
Author	Li Fei-Fei
Author	Hannaneh Hajishirzi
Author	Luke Zettlemoyer
Author	Percy Liang
Author	Emmanuel Candès
Author	Tatsunori Hashimoto
Abstract	Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1
Date	2025-02-03
Short Title	s1
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.19393
Accessed	2/13/2025, 11:04:27 AM
Extra	arXiv:2501.19393 [cs]
DOI	10.48550/arXiv.2501.19393
Repository	arXiv
Archive ID	arXiv:2501.19393
Date Added	2/13/2025, 11:04:27 AM
Modified	2/13/2025, 11:04:27 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: 45 pages (9 main), 10 figures, 14 tables

Attachments

Full Text PDF
Snapshot

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Item Type	Preprint
Author	Ali Modarressi
Author	Hanieh Deilamsalehy
Author	Franck Dernoncourt
Author	Trung Bui
Author	Ryan A. Rossi
Author	Seunghyun Yoon
Author	Hinrich Schütze
Abstract	Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.
Date	2025-02-07
Short Title	NoLiMa
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.05167
Accessed	2/13/2025, 11:33:22 AM
Extra	arXiv:2502.05167 [cs]
DOI	10.48550/arXiv.2502.05167
Repository	arXiv
Archive ID	arXiv:2502.05167
Date Added	2/13/2025, 11:33:22 AM
Modified	2/13/2025, 11:33:22 AM

Tags:

Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs

Item Type	Preprint
Author	Mantas Mazeika
Author	Xuwang Yin
Author	Rishub Tamirisa
Author	Jaehyuk Lim
Author	Bruce W. Lee
Author	Richard Ren
Author	Long Phan
Author	Norman Mu
Author	Adam Khoja
Author	Oliver Zhang
Author	Dan Hendrycks
Abstract	As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.
Date	2025-02-12
Short Title	Utility Engineering
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.08640
Accessed	2/13/2025, 11:35:58 AM
Extra	arXiv:2502.08640 [cs]
DOI	10.48550/arXiv.2502.08640
Repository	arXiv
Archive ID	arXiv:2502.08640
Date Added	2/13/2025, 11:35:58 AM
Modified	2/13/2025, 11:35:58 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Machine Learning
Computer Science - Computer Vision and Pattern Recognition

Attachments

Preprint PDF
Snapshot

smooth operator

Item Type	Web Page
Author	Lovable
Abstract	Lovable Generated Project
Language	en
URL	https://smooth-operator.online/
Accessed	1/29/2025, 11:44:59 AM
Date Added	1/29/2025, 11:44:59 AM
Modified	1/29/2025, 11:44:59 AM

Attachments

Snapshot

OmniHuman-1: Rethinking the Scaling-Up of One-Stage Conditioned Human Animation Models

Item Type	Preprint
Author	Gaojie Lin
Author	Jianwen Jiang
Author	Jiaqi Yang
Author	Zerong Zheng
Author	Chao Liang
Abstract	End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io)
Date	2025-02-03
Short Title	OmniHuman-1
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.01061
Accessed	2/6/2025, 9:30:17 AM
Extra	arXiv:2502.01061 [cs]
DOI	10.48550/arXiv.2502.01061
Repository	arXiv
Archive ID	arXiv:2502.01061
Date Added	2/6/2025, 9:30:17 AM
Modified	2/6/2025, 9:30:17 AM

Tags:

Computer Science - Computer Vision and Pattern Recognition

Notes:

Comment: https://omnihuman-lab.github.io/

Attachments

Preprint PDF
Snapshot

Eliciting Language Model Behaviors with Investigator Agents

Item Type	Preprint
Author	Xiang Lisa Li
Author	Neil Chowdhury
Author	Daniel D. Johnson
Author	Tatsunori Hashimoto
Author	Percy Liang
Author	Sarah Schwettmann
Author	Jacob Steinhardt
Abstract	Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.
Date	2025-02-03
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.01236
Accessed	2/6/2025, 9:47:44 AM
Extra	arXiv:2502.01236 [cs]
DOI	10.48550/arXiv.2502.01236
Repository	arXiv
Archive ID	arXiv:2502.01236
Date Added	2/6/2025, 9:47:44 AM
Modified	2/6/2025, 9:47:44 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: 20 pages, 7 figures

Attachments

Preprint PDF
Snapshot

Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks

Item Type	Preprint
Author	Ang Li
Author	Yin Zhou
Author	Vethavikashini Chithrra Raghuram
Author	Tom Goldstein
Author	Micah Goldblum
Abstract	A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs). These attacks may extract private information or coerce the model into producing harmful outputs. In real-world deployments, LLMs are often part of a larger agentic pipeline including memory systems, retrieval, web access, and API calling. Such additional components introduce vulnerabilities that make these LLM-powered agents much easier to attack than isolated LLMs, yet relatively little work focuses on the security of LLM agents. In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We first provide a taxonomy of attacks categorized by threat actors, objectives, entry points, attacker observability, attack strategies, and inherent vulnerabilities of agent pipelines. We then conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities. Notably, our attacks are trivial to implement and require no understanding of machine learning.
Date	2025-02-12
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.08586
Accessed	2/14/2025, 2:39:38 PM
Extra	arXiv:2502.08586 [cs]
DOI	10.48550/arXiv.2502.08586
Repository	arXiv
Archive ID	arXiv:2502.08586
Date Added	2/14/2025, 2:39:38 PM
Modified	2/14/2025, 2:39:40 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

The Emergence of Strategic Reasoning of Large Language Models

Item Type	Preprint
Author	Dongwoo Lee
Author	Gavin Kader
Abstract	As Large Language Models (LLMs) are increasingly used for a variety of complex and critical tasks, it is vital to assess their logical capabilities in strategic environments. This paper examines their ability in strategic reasoning -- the process of choosing an optimal course of action by predicting and adapting to other agents' behavior. Using six LLMs, we analyze responses from play in classical games from behavioral economics (p-Beauty Contest, 11-20 Money Request Game, and Guessing Game) and evaluate their performance through hierarchical models of reasoning (level-$k$ theory and cognitive hierarchy theory). Our findings reveal that while LLMs show understanding of the games, the majority struggle with higher-order strategic reasoning. Although most LLMs did demonstrate learning ability with games involving repeated interactions, they still consistently fall short of the reasoning levels demonstrated by typical behavior from human subjects. The exception to these overall findings is with OpenAI's GPT-o1 -- specifically trained to solve complex reasoning tasks -- which consistently outperforms other LLMs and human subjects. These findings highlight the challenges and pathways in advancing LLMs toward robust strategic reasoning from the perspective of behavioral economics.
Date	2024-12-17
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.13013
Accessed	1/29/2025, 11:36:10 AM
Extra	arXiv:2412.13013 [econ]
DOI	10.48550/arXiv.2412.13013
Repository	arXiv
Archive ID	arXiv:2412.13013
Date Added	1/29/2025, 11:36:10 AM
Modified	1/29/2025, 11:36:10 AM

Tags:

Economics - General Economics
Quantitative Finance - Economics

Attachments

Preprint PDF
Snapshot

Sparse Autoencoders Do Not Find Canonical Units of Analysis

Item Type	Preprint
Author	Patrick Leask
Author	Bart Bussmann
Author	Michael Pearce
Author	Joseph Bloom
Author	Curt Tigges
Author	Noura Al Moubayed
Author	Lee Sharkey
Author	Neel Nanda
Abstract	A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing ``Einstein'' decomposes into ``scientist'', ``Germany'', and ``famous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/
Date	2025-02-07
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.04878
Accessed	2/13/2025, 11:37:21 AM
Extra	arXiv:2502.04878 [cs]
DOI	10.48550/arXiv.2502.04878
Repository	arXiv
Archive ID	arXiv:2502.04878
Date Added	2/13/2025, 11:37:21 AM
Modified	2/13/2025, 11:37:21 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: Accepted to ICLR 2025

Attachments

Preprint PDF
Snapshot

OverThink: Slowdown Attacks on Reasoning LLMs

Item Type	Preprint
Author	Abhinav Kumar
Author	Jaechul Roh
Author	Ali Naseh
Author	Marzena Karpinska
Author	Mohit Iyyer
Author	Amir Houmansadr
Author	Eugene Bagdasarian
Abstract	We increase overhead for applications that rely on reasoning LLMs-we force models to spend an amplified number of reasoning tokens, i.e., "overthink", to respond to the user query while providing contextually correct answers. The adversary performs an OVERTHINK attack by injecting decoy reasoning problems into the public content that is used by the reasoning LLM (e.g., for RAG applications) during inference time. Due to the nature of our decoy problems (e.g., a Markov Decision Process), modified texts do not violate safety guardrails. We evaluated our attack across closed-(OpenAI o1, o1-mini, o3-mini) and open-(DeepSeek R1) weights reasoning models on the FreshQA and SQuAD datasets. Our results show up to 18x slowdown on FreshQA dataset and 46x slowdown on SQuAD dataset. The attack also shows high transferability across models. To protect applications, we discuss and implement defenses leveraging LLM-based and system design approaches. Finally, we discuss societal, financial, and energy impacts of OVERTHINK attack which could amplify the costs for third-party applications operating reasoning models.
Date	2025-02-05
Short Title	OverThink
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.02542
Accessed	2/13/2025, 11:15:12 AM
Extra	arXiv:2502.02542 [cs]
DOI	10.48550/arXiv.2502.02542
Repository	arXiv
Archive ID	arXiv:2502.02542
Date Added	2/13/2025, 11:15:12 AM
Modified	2/13/2025, 11:15:12 AM

Tags:

Computer Science - Machine Learning
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

J2J: Jailbreaking to Jailbreak

Item Type	Journal Article
Author	Jeremy Kritz
Author	Vaughn Robinson
Author	Robert Vacareanu
Author	Bijan Varjavand
Author	Michael Choi
Author	Bobby Gogov
Author	Summer Yue
Author	Willow E Primack
Author	Zifan Wang
Abstract	Refusal training on Large Language Models (LLMs) prevents harmful outputs, yet this defense remains vulnerable to both automated and human-crafted jailbreaks. We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to the jailbroken LLMs as J2 attackers, which can systematically evaluate target models using various red teaming strategies and improve its performance via in-context learning from the previous failures. Our experiments demonstrate that Sonnet-3.5 and Gemini-1.5-pro outperform other LLMs as J2, achieving 93.0% and 91.0% attack success rates (ASRs) respectively against GPT-4o (and similar results across other capable LLMs) on Harmbench. Our work not only introduces a scalable approach to strategic red teaming—drawing inspiration from human red teamers, but also highlights jailbreaking-to-jailbreak as an overlooked failure mode of the safeguard. Specifically, an LLM can bypass its own safeguards by employing a jailbroken version of itself that is willing to assist in further jailbreaking. To prevent any direct misuse with J2, while advancing research in AI safety, we publicly share our methodology while keeping specific prompting details private.
Language	en
Library Catalog	Zotero
Date Added	2/13/2025, 11:39:53 AM
Modified	2/13/2025, 11:40:07 AM

Attachments

PDF

A sketch of an AI control safety case

Item Type	Preprint
Author	Tomek Korbak
Author	Joshua Clymer
Author	Benjamin Hilton
Author	Buck Shlegeris
Author	Geoffrey Irving
Abstract	As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
Date	2025-01-28
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.17315
Accessed	1/31/2025, 1:09:42 PM
Extra	arXiv:2501.17315 [cs]
DOI	10.48550/arXiv.2501.17315
Repository	arXiv
Archive ID	arXiv:2501.17315
Date Added	1/31/2025, 1:09:42 PM
Modified	1/31/2025, 1:09:44 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security
Computer Science - Software Engineering

Attachments

Preprint PDF
Snapshot

Multi-turn Evaluation of Anthropomorphic Behaviours in Large Language Models

Item Type	Preprint
Author	Lujain Ibrahim
Author	Canfer Akbulut
Author	Rasmi Elasmar
Author	Charvi Rastogi
Author	Minsuk Kahng
Author	Meredith Ringel Morris
Author	Kevin R. McKee
Author	Verena Rieser
Author	Murray Shanahan
Author	Laura Weidinger
Abstract	The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction.
Date	2025-02-10
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.07077
Accessed	2/13/2025, 11:41:44 AM
Extra	arXiv:2502.07077 [cs]
DOI	10.48550/arXiv.2502.07077
Repository	arXiv
Archive ID	arXiv:2502.07077
Date Added	2/13/2025, 11:41:44 AM
Modified	2/13/2025, 11:41:44 AM

Tags:

Computer Science - Computation and Language
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

Item Type	Preprint
Author	Kaixuan Huang
Author	Jiacheng Guo
Author	Zihao Li
Author	Xiang Ji
Author	Jiawei Ge
Author	Wenzhe Li
Author	Yingqing Guo
Author	Tianle Cai
Author	Hui Yuan
Author	Runzhe Wang
Author	Yue Wu
Author	Ming Yin
Author	Shange Tang
Author	Yangsibo Huang
Author	Chi Jin
Author	Xinyun Chen
Author	Chiyuan Zhang
Author	Mengdi Wang
Abstract	Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations -- modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycksmath et. al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models.
Date	2025-02-10
Short Title	MATH-Perturb
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.06453
Accessed	2/13/2025, 11:39:29 AM
Extra	arXiv:2502.06453 [cs]
DOI	10.48550/arXiv.2502.06453
Repository	arXiv
Archive ID	arXiv:2502.06453
Date Added	2/13/2025, 11:39:29 AM
Modified	2/13/2025, 11:39:29 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Teaching Large Language Models to Reason with Reinforcement Learning

Item Type	Preprint
Author	Alex Havrilla
Author	Yuqing Du
Author	Sharath Chandra Raparthy
Author	Christoforos Nalmpantis
Author	Jane Dwivedi-Yu
Author	Maksym Zhuravinskyi
Author	Eric Hambro
Author	Sainbayar Sukhbaatar
Author	Roberta Raileanu
Abstract	Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning.
Date	2024-03-07
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2403.04642
Accessed	1/29/2025, 12:13:52 PM
Extra	arXiv:2403.04642 [cs]
DOI	10.48550/arXiv.2403.04642
Repository	arXiv
Archive ID	arXiv:2403.04642
Date Added	1/29/2025, 12:13:52 PM
Modified	1/29/2025, 12:13:52 PM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Which Economic Tasks are Performed with AI? Evidence from Millions of Claude Conversations

Item Type	Journal Article
Author	Kunal Handa
Author	Alex Tamkin
Author	Miles McCain
Author	Saffron Huang
Author	Esin Durmus
Author	Sarah Heck
Author	Jared Mueller
Author	Jerry Hong
Author	Stuart Ritchie
Author	Tim Belonax
Author	Kevin K Troy
Author	Dario Amodei
Author	Jared Kaplan
Author	Jack Clark
Author	Deep Ganguli
Abstract	Despite widespread speculation about artificial intelligence’s impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system [Tamkin et al., 2024] to analyze over four million Claude.ai conversations through the lens of tasks and occupations in the U.S. Department of Labor’s O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with ∼ 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI’s evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance.
Language	en
Library Catalog	Zotero
Date Added	2/13/2025, 11:31:05 AM
Modified	2/13/2025, 11:31:05 AM

Attachments

PDF

Internal Activation as the Polar Star for Steering Unsafe LLM Behavior

Item Type	Preprint
Author	Peixuan Han
Author	Cheng Qian
Author	Xiusi Chen
Author	Yuji Zhang
Author	Denghui Zhang
Author	Heng Ji
Abstract	Large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks but also pose significant risks due to their potential to generate harmful content. Although existing safety mechanisms can improve model safety, they often lead to overly cautious behavior and fail to fully utilize LLMs' internal cognitive processes. Drawing inspiration from cognitive science, where humans rely on reflective reasoning (System 2 thinking) to regulate language and behavior, we empirically demonstrate that LLMs also possess a similar capacity for internal assessment and regulation, which can be actively detected. Building on this insight, we introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility. Compared to traditional safety alignment methods, SafeSwitch delivers more informative and context-aware refusals, demonstrates resilience to unseen queries, and achieves these benefits while only tuning less than 6% of the original parameters. These features make SafeSwitch a promising approach for implementing nuanced safety controls in LLMs.
Date	2025-02-04
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.01042
Accessed	2/6/2025, 9:27:31 AM
Extra	arXiv:2502.01042 [cs]
DOI	10.48550/arXiv.2502.01042
Repository	arXiv
Archive ID	arXiv:2502.01042
Date Added	2/6/2025, 9:27:31 AM
Modified	2/6/2025, 9:27:31 AM

Tags:

Computer Science - Machine Learning

Attachments

Full Text PDF
Snapshot

AI Personality Extraction from Faces: Labor Market Implications

Item Type	Preprint
Author	Marius Guenzel
Author	Shimon Kogan
Author	Marina Niessner
Author	Kelly Shue
Abstract	<div> Human capital---encompassing cognitive skills and personality traits---is critical for labor market success, yet the personality component remains diffic
Date	2025-01-09
Language	en
Short Title	AI Personality Extraction from Faces
Library Catalog	papers.ssrn.com
URL	https://papers.ssrn.com/abstract=5089827
Accessed	1/29/2025, 12:10:37 PM
Place	Rochester, NY
Repository	Social Science Research Network
Genre	SSRN Scholarly Paper
Archive ID	5089827
Date Added	1/29/2025, 12:10:37 PM
Modified	1/29/2025, 12:10:37 PM

Tags:

AI Personality Extraction from Faces: Labor Market Implications
Kelly Shue
Marina Niessner
Marius Guenzel
Shimon Kogan
SSRN

Attachments

Full Text PDF

ALIGNMENT FAKING IN LARGE LANGUAGE MODELS

Item Type	Journal Article
Author	Ryan Greenblatt
Author	Carson Denison
Author	Benjamin Wright
Author	Fabien Roger
Author	Monte MacDiarmid
Author	Sam Marks
Author	Johannes Treutlein
Author	Tim Belonax
Author	Jack Chen
Author	David Duvenaud
Author	Akbir Khan
Author	Julian Michael
Author	Sören Mindermann
Author	Ethan Perez
Author	Linda Petrini
Author	Jonathan Uesato
Author	Jared Kaplan
Author	Buck Shlegeris
Author	Samuel R Bowman
Author	Evan Hubinger
Abstract	We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference—as in this case—or not.
Language	en
Library Catalog	Zotero
Date Added	2/1/2025, 3:23:04 PM
Modified	2/1/2025, 3:23:07 PM

Attachments

PDF

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Item Type	Preprint
Author	Jonas Geiping
Author	Sean McLeish
Author	Neel Jain
Author	John Kirchenbauer
Author	Siddharth Singh
Author	Brian R. Bartoldson
Author	Bhavya Kailkhura
Author	Abhinav Bhatele
Author	Tom Goldstein
Abstract	We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters.
Date	2025-02-07
Short Title	Scaling up Test-Time Compute with Latent Reasoning
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.05171
Accessed	2/13/2025, 11:30:10 AM
Extra	arXiv:2502.05171 [cs]
DOI	10.48550/arXiv.2502.05171
Repository	arXiv
Archive ID	arXiv:2502.05171
Date Added	2/13/2025, 11:30:10 AM
Modified	2/13/2025, 11:30:14 AM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning

Notes:

Comment: The model is available at https://huggingface.co/tomg-group-umd/huginn-0125. Code and data recipe can be found at https://github.com/seal-rg/recurrent-pretraining

Attachments

Preprint PDF
Snapshot

How Linguistics Learned to Stop Worrying and Love the Language Models

Item Type	Preprint
Author	Richard Futrell
Author	Kyle Mahowald
Abstract	Language models can produce fluent, grammatical text. Nonetheless, some maintain that language models don't really learn language and also that, even if they did, that would not be informative for the study of human learning and processing. On the other side, there have been claims that the success of LMs obviates the need for studying linguistic theory and structure. We argue that both extremes are wrong. LMs can contribute to fundamental questions about linguistic structure, language processing, and learning. They force us to rethink arguments about learning and are informative for major questions in linguistic theory. But they do not replace linguistic structure and theory. We offer an optimistic take on the relationship between language models and linguistics.
Date	2025-01-28
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.17047
Accessed	1/29/2025, 4:28:48 PM
Extra	arXiv:2501.17047 [cs]
DOI	10.48550/arXiv.2501.17047
Repository	arXiv
Archive ID	arXiv:2501.17047
Date Added	1/29/2025, 4:28:48 PM
Modified	1/29/2025, 4:28:48 PM

Tags:

Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Item Type	Preprint
Author	Sebastian Farquhar
Author	Vikrant Varma
Author	David Lindner
Author	David Elson
Author	Caleb Biddulph
Author	Ian Goodfellow
Author	Rohin Shah
Abstract	Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.
Date	2025-01-22
Short Title	MONA
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.13011
Accessed	1/29/2025, 11:53:58 AM
Extra	arXiv:2501.13011 [cs]
DOI	10.48550/arXiv.2501.13011
Repository	arXiv
Archive ID	arXiv:2501.13011
Date Added	1/29/2025, 11:53:58 AM
Modified	1/29/2025, 11:53:58 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Item Type	Preprint
Author	DeepSeek-AI
Author	Daya Guo
Author	Dejian Yang
Author	Haowei Zhang
Author	Junxiao Song
Author	Ruoyu Zhang
Author	Runxin Xu
Author	Qihao Zhu
Author	Shirong Ma
Author	Peiyi Wang
Author	Xiao Bi
Author	Xiaokang Zhang
Author	Xingkai Yu
Author	Yu Wu
Author	Z. F. Wu
Author	Zhibin Gou
Author	Zhihong Shao
Author	Zhuoshu Li
Author	Ziyi Gao
Author	Aixin Liu
Author	Bing Xue
Author	Bingxuan Wang
Author	Bochao Wu
Author	Bei Feng
Author	Chengda Lu
Author	Chenggang Zhao
Author	Chengqi Deng
Author	Chenyu Zhang
Author	Chong Ruan
Author	Damai Dai
Author	Deli Chen
Author	Dongjie Ji
Author	Erhang Li
Author	Fangyun Lin
Author	Fucong Dai
Author	Fuli Luo
Author	Guangbo Hao
Author	Guanting Chen
Author	Guowei Li
Author	H. Zhang
Author	Han Bao
Author	Hanwei Xu
Author	Haocheng Wang
Author	Honghui Ding
Author	Huajian Xin
Author	Huazuo Gao
Author	Hui Qu
Author	Hui Li
Author	Jianzhong Guo
Author	Jiashi Li
Author	Jiawei Wang
Author	Jingchang Chen
Author	Jingyang Yuan
Author	Junjie Qiu
Author	Junlong Li
Author	J. L. Cai
Author	Jiaqi Ni
Author	Jian Liang
Author	Jin Chen
Author	Kai Dong
Author	Kai Hu
Author	Kaige Gao
Author	Kang Guan
Author	Kexin Huang
Author	Kuai Yu
Author	Lean Wang
Author	Lecong Zhang
Author	Liang Zhao
Author	Litong Wang
Author	Liyue Zhang
Author	Lei Xu
Author	Leyi Xia
Author	Mingchuan Zhang
Author	Minghua Zhang
Author	Minghui Tang
Author	Meng Li
Author	Miaojun Wang
Author	Mingming Li
Author	Ning Tian
Author	Panpan Huang
Author	Peng Zhang
Author	Qiancheng Wang
Author	Qinyu Chen
Author	Qiushi Du
Author	Ruiqi Ge
Author	Ruisong Zhang
Author	Ruizhe Pan
Author	Runji Wang
Author	R. J. Chen
Author	R. L. Jin
Author	Ruyi Chen
Author	Shanghao Lu
Author	Shangyan Zhou
Author	Shanhuang Chen
Author	Shengfeng Ye
Author	Shiyu Wang
Author	Shuiping Yu
Author	Shunfeng Zhou
Author	Shuting Pan
Author	S. S. Li
Author	Shuang Zhou
Author	Shaoqing Wu
Author	Shengfeng Ye
Author	Tao Yun
Author	Tian Pei
Author	Tianyu Sun
Author	T. Wang
Author	Wangding Zeng
Author	Wanjia Zhao
Author	Wen Liu
Author	Wenfeng Liang
Author	Wenjun Gao
Author	Wenqin Yu
Author	Wentao Zhang
Author	W. L. Xiao
Author	Wei An
Author	Xiaodong Liu
Author	Xiaohan Wang
Author	Xiaokang Chen
Author	Xiaotao Nie
Author	Xin Cheng
Author	Xin Liu
Author	Xin Xie
Author	Xingchao Liu
Author	Xinyu Yang
Author	Xinyuan Li
Author	Xuecheng Su
Author	Xuheng Lin
Author	X. Q. Li
Author	Xiangyue Jin
Author	Xiaojin Shen
Author	Xiaosha Chen
Author	Xiaowen Sun
Author	Xiaoxiang Wang
Author	Xinnan Song
Author	Xinyi Zhou
Author	Xianzu Wang
Author	Xinxia Shan
Author	Y. K. Li
Author	Y. Q. Wang
Author	Y. X. Wei
Author	Yang Zhang
Author	Yanhong Xu
Author	Yao Li
Author	Yao Zhao
Author	Yaofeng Sun
Author	Yaohui Wang
Author	Yi Yu
Author	Yichao Zhang
Author	Yifan Shi
Author	Yiliang Xiong
Author	Ying He
Author	Yishi Piao
Author	Yisong Wang
Author	Yixuan Tan
Author	Yiyang Ma
Author	Yiyuan Liu
Author	Yongqiang Guo
Author	Yuan Ou
Author	Yuduan Wang
Author	Yue Gong
Author	Yuheng Zou
Author	Yujia He
Author	Yunfan Xiong
Author	Yuxiang Luo
Author	Yuxiang You
Author	Yuxuan Liu
Author	Yuyang Zhou
Author	Y. X. Zhu
Author	Yanhong Xu
Author	Yanping Huang
Author	Yaohui Li
Author	Yi Zheng
Author	Yuchen Zhu
Author	Yunxian Ma
Author	Ying Tang
Author	Yukun Zha
Author	Yuting Yan
Author	Z. Z. Ren
Author	Zehui Ren
Author	Zhangli Sha
Author	Zhe Fu
Author	Zhean Xu
Author	Zhenda Xie
Author	Zhengyan Zhang
Author	Zhewen Hao
Author	Zhicheng Ma
Author	Zhigang Yan
Author	Zhiyu Wu
Author	Zihui Gu
Author	Zijia Zhu
Author	Zijun Liu
Author	Zilin Li
Author	Ziwei Xie
Author	Ziyang Song
Author	Zizheng Pan
Author	Zhen Huang
Author	Zhipeng Xu
Author	Zhongyu Zhang
Author	Zhen Zhang
Abstract	We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
Date	2025-01-22
Short Title	DeepSeek-R1
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.12948
Accessed	1/29/2025, 11:33:45 AM
Extra	arXiv:2501.12948 [cs]
DOI	10.48550/arXiv.2501.12948
Repository	arXiv
Archive ID	arXiv:2501.12948
Date Added	1/29/2025, 11:33:45 AM
Modified	1/29/2025, 11:33:47 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

The AI Agent Index

Item Type	Preprint
Author	Stephen Casper
Author	Luke Bailey
Author	Rosco Hunter
Author	Carson Ezell
Author	Emma Cabalé
Author	Michael Gerovitch
Author	Stewart Slocum
Author	Kevin Wei
Author	Nikola Jurkovic
Author	Ariba Khan
Author	Phillip J. K. Christoffersen
Author	A. Pinar Ozisik
Author	Rakshit Trivedi
Author	Dylan Hadfield-Menell
Author	Noam Kolt
Abstract	Leading AI developers and startups are increasingly deploying agentic AI systems that can plan and execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems. To fill this gap, we introduce the AI Agent Index, the first public database to document information about currently deployed agentic AI systems. For each system that meets the criteria for inclusion in the index, we document the system's components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails), based on publicly available information and correspondence with developers. We find that while developers generally provide ample information regarding the capabilities and applications of agentic systems, they currently provide limited information regarding safety and risk management practices. The AI Agent Index is available online at https://aiagentindex.mit.edu/
Date	2025-02-03
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.01635
Accessed	2/6/2025, 9:09:03 AM
Extra	arXiv:2502.01635 [cs]
DOI	10.48550/arXiv.2502.01635
Repository	arXiv
Archive ID	arXiv:2502.01635
Date Added	2/6/2025, 9:09:03 AM
Modified	2/6/2025, 9:09:06 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Software Engineering

Notes:

Comment: Accompanying website: https://aiagentindex.mit.edu/

Attachments

Preprint PDF
Snapshot

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages

Item Type	Preprint
Author	Jannik Brinkmann
Author	Chris Wendler
Author	Christian Bartelt
Author	Aaron Mueller
Abstract	Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are multiple languages learned and encoded? In this work, we explore the extent to which LLMs share representations of morphosyntactic concepts such as grammatical number, gender, and tense across languages. We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages. We use causal interventions to verify the multilingual nature of these representations; specifically, we show that ablating only multilingual features decreases classifier performance to near-chance across languages. We then use these features to precisely modify model behavior in a machine translation task; this demonstrates both the generality and selectivity of these feature's roles in the network. Our findings suggest that even models trained predominantly on English data can develop robust, cross-lingual abstractions of morphosyntactic concepts.
Date	2025-01-10
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.06346
Accessed	1/31/2025, 1:10:14 PM
Extra	arXiv:2501.06346 [cs]
DOI	10.48550/arXiv.2501.06346
Repository	arXiv
Archive ID	arXiv:2501.06346
Date Added	1/31/2025, 1:10:14 PM
Modified	1/31/2025, 1:10:14 PM

Tags:

Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

Tell me about yourself: LLMs are aware of their learned behaviors

Item Type	Preprint
Author	Jan Betley
Author	Xuchan Bao
Author	Martín Soto
Author	Anna Sztyber-Betley
Author	James Chua
Author	Owain Evans
Abstract	We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs.
Date	2025-01-19
Short Title	Tell me about yourself
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.11120
Accessed	1/29/2025, 12:12:22 PM
Extra	arXiv:2501.11120 [cs]
DOI	10.48550/arXiv.2501.11120
Repository	arXiv
Archive ID	arXiv:2501.11120
Date Added	1/29/2025, 12:12:22 PM
Modified	1/29/2025, 12:12:22 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Computer Science - Cryptography and Security

Notes:

Comment: Submitted to ICLR 2025. 17 pages, 13 figures

Attachments

Full Text PDF
Snapshot

International AI safety report

Item Type	Journal Article
Author	Y Bengio
Language	en
Library Catalog	Zotero
Date Added	1/29/2025, 11:19:42 AM
Modified	1/29/2025, 11:19:42 AM

Attachments

PDF

Dario Amodei — On DeepSeek and Export Controls

Item Type	Web Page
Author	Dario Amodei
Abstract	On DeepSeek and Export Controls
Date	2025-01-28
Language	en
URL	https://darioamodei.com/on-deepseek-and-export-controls.html
Accessed	1/29/2025, 12:22:12 PM
Date Added	1/29/2025, 12:22:12 PM
Modified	1/29/2025, 12:22:12 PM

Attachments

Snapshot