Item Type | Preprint |
---|---|
Author | Guibin Zhang |
Author | Luyang Niu |
Author | Junfeng Fang |
Author | Kun Wang |
Author | Lei Bai |
Author | Xiang Wang |
Abstract | Large Language Model (LLM)-empowered multi-agent systems extend the cognitive boundaries of individual agents through disciplined collaboration and interaction, while constructing these systems often requires labor-intensive manual designs. Despite the availability of methods to automate the design of agentic workflows, they typically seek to identify a static, complex, one-size-fits-all system, which, however, fails to dynamically allocate inference resources based on the difficulty and domain of each query. To address this challenge, we shift away from the pursuit of a monolithic agentic system, instead optimizing the \textbf{agentic supernet}, a probabilistic and continuous distribution of agentic architectures. We introduce MaAS, an automated framework that samples query-dependent agentic systems from the supernet, delivering high-quality solutions and tailored resource allocation (\textit{e.g.}, LLM calls, tool calls, token cost). Comprehensive evaluation across six benchmarks demonstrates that MaAS \textbf{(I)} requires only $6\sim45\%$ of the inference costs of existing handcrafted or automated multi-agent systems, \textbf{(II)} surpasses them by $0.54\%\sim11.82\%$, and \textbf{(III)} enjoys superior cross-dataset and cross-LLM-backbone transferability. |
Date | 2025-02-06 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.04180 |
Accessed | 2/13/2025, 11:08:07 AM |
Extra | arXiv:2502.04180 [cs] |
DOI | 10.48550/arXiv.2502.04180 |
Repository | arXiv |
Archive ID | arXiv:2502.04180 |
Date Added | 2/13/2025, 11:08:07 AM |
Modified | 2/13/2025, 11:08:07 AM |
Item Type | Preprint |
---|---|
Author | Hongli Zhan |
Author | Muneeza Azmat |
Author | Raya Horesh |
Author | Junyi Jessy Li |
Author | Mikhail Yurochkin |
Abstract | Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at https://github.com/honglizhan/SPRI-public. |
Date | 2025-02-05 |
Short Title | SPRI |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.03397 |
Accessed | 2/13/2025, 11:15:29 AM |
Extra | arXiv:2502.03397 [cs] |
DOI | 10.48550/arXiv.2502.03397 |
Repository | arXiv |
Archive ID | arXiv:2502.03397 |
Date Added | 2/13/2025, 11:15:29 AM |
Modified | 2/13/2025, 11:15:29 AM |
Item Type | Preprint |
---|---|
Author | Zhouliang Yu |
Author | Yuhuan Yuan |
Author | Tim Z. Xiao |
Author | Fuxiang Frank Xia |
Author | Jie Fu |
Author | Ge Zhang |
Author | Ge Lin |
Author | Weiyang Liu |
Abstract | Solving complex planning problems requires Large Language Models (LLMs) to explicitly model the state transition to avoid rule violations, comply with constraints, and ensure optimality-a task hindered by the inherent ambiguity of natural language. To overcome such ambiguity, Planning Domain Definition Language (PDDL) is leveraged as a planning abstraction that enables precise and formal state descriptions. With PDDL, we can generate a symbolic world model where classic searching algorithms, such as A*, can be seamlessly applied to find optimal plans. However, directly generating PDDL domains with current LLMs remains an open challenge due to the lack of PDDL training data. To address this challenge, we propose to scale up the test-time computation of LLMs to enhance their PDDL reasoning capabilities, thereby enabling the generation of high-quality PDDL domains. Specifically, we introduce a simple yet effective algorithm, which first employs a Best-of-N sampling approach to improve the quality of the initial solution and then refines the solution in a fine-grained manner with verbalized machine learning. Our method outperforms o1-mini by a considerable margin in the generation of PDDL domain, achieving over 50% success rate on two tasks (i.e., generating PDDL domains from natural language description or PDDL problems). This is done without requiring additional training. By taking advantage of PDDL as state abstraction, our method is able to outperform current state-of-the-art methods on almost all competition-level planning tasks. |
Date | 2025-02-07 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.04728 |
Accessed | 2/13/2025, 11:30:39 AM |
Extra | arXiv:2502.04728 [cs] |
DOI | 10.48550/arXiv.2502.04728 |
Repository | arXiv |
Archive ID | arXiv:2502.04728 |
Date Added | 2/13/2025, 11:30:39 AM |
Modified | 2/13/2025, 11:30:39 AM |
Comment: Technical Report v1 (32 pages, 6 figures)
Item Type | Preprint |
---|---|
Author | Yutong Yin |
Author | Zhaoran Wang |
Abstract | Humans exhibit remarkable compositional reasoning by integrating knowledge from various sources. For example, if someone learns ( B = f(A) ) from one source and ( C = g(B) ) from another, they can deduce ( C=g(B)=g(f(A)) ) even without encountering ( ABC ) together, showcasing the generalization ability of human intelligence. In this paper, we introduce a synthetic learning task, "FTCT" (Fragmented at Training, Chained at Testing), to validate the potential of Transformers in replicating this skill and interpret its inner mechanism. In the training phase, data consist of separated knowledge fragments from an overall causal graph. During testing, Transformers must infer complete causal graph traces by integrating these fragments. Our findings demonstrate that few-shot Chain-of-Thought prompting enables Transformers to perform compositional reasoning on FTCT by revealing correct combinations of fragments, even if such combinations were absent in the training data. Furthermore, the emergence of compositional reasoning ability is strongly correlated with the model complexity and training-testing data similarity. We propose, both theoretically and empirically, that Transformers learn an underlying generalizable program from training, enabling effective compositional reasoning during testing. |
Date | 2025-01-27 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.15857 |
Accessed | 2/3/2025, 10:32:14 AM |
Extra | arXiv:2501.15857 [cs] |
DOI | 10.48550/arXiv.2501.15857 |
Repository | arXiv |
Archive ID | arXiv:2501.15857 |
Date Added | 2/3/2025, 10:32:14 AM |
Modified | 2/3/2025, 10:32:14 AM |
Comment: It is accepted by The Thirteenth International Conference on Learning Representations and will be published soon. The submission number is 2678
Item Type | Preprint |
---|---|
Author | Yixin Ye |
Author | Zhen Huang |
Author | Yang Xiao |
Author | Ethan Chern |
Author | Shijie Xia |
Author | Pengfei Liu |
Abstract | We present a fundamental discovery that challenges our understanding of how complex reasoning emerges in large language models. While conventional wisdom suggests that sophisticated reasoning tasks demand extensive training data (>100,000 examples), we demonstrate that complex mathematical reasoning abilities can be effectively elicited with surprisingly few examples. Through comprehensive experiments, our proposed model LIMO demonstrates unprecedented performance in mathematical reasoning. With merely 817 curated training samples, LIMO achieves 57.1% accuracy on AIME and 94.8% on MATH, improving from previous SFT-based models' 6.5% and 59.2% respectively, while only using 1% of the training data required by previous approaches. LIMO demonstrates exceptional out-of-distribution generalization, achieving 40.5% absolute improvement across 10 diverse benchmarks, outperforming models trained on 100x more data, challenging the notion that SFT leads to memorization rather than generalization. Based on these results, we propose the Less-Is-More Reasoning Hypothesis (LIMO Hypothesis): In foundation models where domain knowledge has been comprehensively encoded during pre-training, sophisticated reasoning capabilities can emerge through minimal but precisely orchestrated demonstrations of cognitive processes. This hypothesis posits that the elicitation threshold for complex reasoning is determined by two key factors: (1) the completeness of the model's encoded knowledge foundation during pre-training, and (2) the effectiveness of post-training examples as "cognitive templates" that show the model how to utilize its knowledge base to solve complex reasoning tasks. To facilitate reproducibility and future research in data-efficient reasoning, we release LIMO as a comprehensive open-source suite at https://github.com/GAIR-NLP/LIMO. |
Date | 2025-02-05 |
Short Title | LIMO |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.03387 |
Accessed | 2/7/2025, 1:30:16 PM |
Extra | arXiv:2502.03387 [cs] |
DOI | 10.48550/arXiv.2502.03387 |
Repository | arXiv |
Archive ID | arXiv:2502.03387 |
Date Added | 2/7/2025, 1:30:16 PM |
Modified | 2/7/2025, 1:30:19 PM |
Comment: 17 pages
Item Type | Journal Article |
---|---|
Author | Clinton J Wang |
Author | Dean Lee |
Author | Cristina Menghini |
Author | Johannes Mols |
Author | Jack Doughty |
Author | Jayson Lynch |
Author | Sean Hendryx |
Author | Summer Yue |
Author | Dan Hendrycks |
Abstract | As language models master existing reasoning benchmarks, we need new challenges to evaluate their cognitive frontiers. Puzzle-solving events are rich repositories of challenging multimodal problems that test a wide range of advanced reasoning and knowledge capabilities, making them a unique testbed for evaluating frontier language models. We introduce ENIGMAEVAL, a dataset of problems and solutions derived from puzzle competitions and events that probes models’ ability to perform implicit knowledge synthesis and multi-step deductive reasoning. Unlike existing reasoning and knowledge benchmarks, puzzle solving challenges models to discover hidden connections between seemingly unrelated pieces of information to uncover solution paths. The benchmark comprises 1184 puzzles of varying complexity – each typically requiring teams of skilled solvers hours to days to complete – with unambiguous, verifiable solutions that enable efficient evaluation. State-of-the-art language models achieve extremely low accuracy on these puzzles, even lower than other difficult benchmarks such as Humanity’s Last Exam, unveiling models’ shortcomings when challenged with problems requiring unstructured and lateral reasoning. |
Language | en |
Library Catalog | Zotero |
Date Added | 2/14/2025, 2:54:20 PM |
Modified | 2/14/2025, 2:54:20 PM |
Item Type | Preprint |
---|---|
Author | Jihoon Tack |
Author | Jack Lanchantin |
Author | Jane Yu |
Author | Andrew Cohen |
Author | Ilia Kulikov |
Author | Janice Lan |
Author | Shibo Hao |
Author | Yuandong Tian |
Author | Jason Weston |
Author | Xian Li |
Abstract | Next token prediction has been the standard training objective used in large language model pretraining. Representations are learned as a result of optimizing for token-level perplexity. We propose Continuous Concept Mixing (CoCoMix), a novel pretraining framework that combines discrete next token prediction with continuous concepts. Specifically, CoCoMix predicts continuous concepts learned from a pretrained sparse autoencoder and mixes them into the model's hidden state by interleaving with token hidden representations. Through experiments on multiple benchmarks, including language modeling and downstream reasoning tasks, we show that CoCoMix is more sample efficient and consistently outperforms standard next token prediction, knowledge distillation and inserting pause tokens. We find that combining both concept learning and interleaving in an end-to-end framework is critical to performance gains. Furthermore, CoCoMix enhances interpretability and steerability by allowing direct inspection and modification of the predicted concept, offering a transparent way to guide the model's internal reasoning process. |
Date | 2025-02-12 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.08524 |
Accessed | 2/13/2025, 11:44:13 AM |
Extra | arXiv:2502.08524 [cs] |
DOI | 10.48550/arXiv.2502.08524 |
Repository | arXiv |
Archive ID | arXiv:2502.08524 |
Date Added | 2/13/2025, 11:44:13 AM |
Modified | 2/13/2025, 11:44:13 AM |
Item Type | Preprint |
---|---|
Author | Tobin South |
Author | Samuele Marro |
Author | Thomas Hardjono |
Author | Robert Mahari |
Author | Cedric Deslandes Whitney |
Author | Dazza Greenwood |
Author | Alan Chan |
Author | Alex Pentland |
Abstract | The rapid deployment of autonomous AI agents creates urgent challenges around authorization, accountability, and access control in digital spaces. New standards are needed to know whom AI agents act on behalf of and guide their use appropriately, protecting online spaces while unlocking the value of task delegation to autonomous agents. We introduce a novel framework for authenticated, authorized, and auditable delegation of authority to AI agents, where human users can securely delegate and restrict the permissions and scope of agents while maintaining clear chains of accountability. This framework builds on existing identification and access management protocols, extending OAuth 2.0 and OpenID Connect with agent-specific credentials and metadata, maintaining compatibility with established authentication and web infrastructure. Further, we propose a framework for translating flexible, natural language permissions into auditable access control configurations, enabling robust scoping of AI agent capabilities across diverse interaction modalities. Taken together, this practical approach facilitates immediate deployment of AI agents while addressing key security and accountability concerns, working toward ensuring agentic AI systems perform only appropriate actions and providing a tool for digital service providers to enable AI agent interactions without risking harm from scalable interaction. |
Date | 2025-01-16 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.09674 |
Accessed | 1/29/2025, 11:26:00 AM |
Extra | arXiv:2501.09674 [cs] version: 1 |
DOI | 10.48550/arXiv.2501.09674 |
Repository | arXiv |
Archive ID | arXiv:2501.09674 |
Date Added | 1/29/2025, 11:26:00 AM |
Modified | 1/29/2025, 11:26:00 AM |
Item Type | Preprint |
---|---|
Author | Brian Singer |
Author | Keane Lucas |
Author | Lakshmi Adiga |
Author | Meghna Jain |
Author | Lujo Bauer |
Author | Vyas Sekar |
Abstract | LLMs have shown preliminary promise in some security tasks and CTF challenges. However, it is unclear whether LLMs are able to realize multistage network attacks, which involve executing a wide variety of actions across multiple hosts such as conducting reconnaissance, exploiting vulnerabilities to gain initial access, leveraging internal hosts to move laterally, and using multiple compromised hosts to exfiltrate data. We evaluate LLMs across 10 multistage networks and find that popular LLMs are unable to realize these attacks. To enable LLMs to realize these attacks, we introduce Incalmo, an LLM-agnostic high-level attack abstraction layer that sits between an LLM and the environment. Rather than LLMs issuing low-level command-line instructions, which can lead to incorrect implementations, Incalmo allows LLMs to specify high-level tasks (e.g., infect a host, scan a network), which are then carried out by Incalmo. Incalmo realizes these tasks by translating them into low-level primitives (e.g., commands to exploit tools). Incalmo also provides an environment state service and an attack graph service to provide structure to LLMs in selecting actions relevant to a multistage attack. Across 9 out of 10 realistic emulated networks (from 25 to 50 hosts), LLMs using Incalmo can successfully autonomously execute multistage attacks. We also conduct an ablation analysis to show the key role the high-level abstractions play. For instance, we find that both Incalmo's high-level tasks and services are crucial. Furthermore, even smaller-parameter LLMs with Incalmo can fully succeed in 5 of 10 environments, while larger-parameter LLMs without Incalmo do not fully succeed in any. |
Date | 2025-01-27 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.16466 |
Accessed | 1/29/2025, 4:33:58 PM |
Extra | arXiv:2501.16466 [cs] |
DOI | 10.48550/arXiv.2501.16466 |
Repository | arXiv |
Archive ID | arXiv:2501.16466 |
Date Added | 1/29/2025, 4:33:58 PM |
Modified | 1/29/2025, 4:34:01 PM |
Comment: 16 pages, 14 figures
Item Type | Preprint |
---|---|
Author | Lee Sharkey |
Author | Bilal Chughtai |
Author | Joshua Batson |
Author | Jack Lindsey |
Author | Jeff Wu |
Author | Lucius Bushnaq |
Author | Nicholas Goldowsky-Dill |
Author | Stefan Heimersheim |
Author | Alejandro Ortega |
Author | Joseph Bloom |
Author | Stella Biderman |
Author | Adria Garriga-Alonso |
Author | Arthur Conmy |
Author | Neel Nanda |
Author | Jessica Rumbelow |
Author | Martin Wattenberg |
Author | Nandi Schoots |
Author | Joseph Miller |
Author | Eric J. Michaud |
Author | Stephen Casper |
Author | Max Tegmark |
Author | William Saunders |
Author | David Bau |
Author | Eric Todd |
Author | Atticus Geiger |
Author | Mor Geva |
Author | Jesse Hoogland |
Author | Daniel Murfet |
Author | Tom McGrath |
Abstract | Mechanistic interpretability aims to understand the computational mechanisms underlying neural networks' capabilities in order to accomplish concrete scientific and engineering goals. Progress in this field thus promises to provide greater assurance over AI system behavior and shed light on exciting scientific questions about the nature of intelligence. Despite recent progress toward these goals, there are many open problems in the field that require solutions before many scientific and practical benefits can be realized: Our methods require both conceptual and practical improvements to reveal deeper insights; we must figure out how best to apply our methods in pursuit of specific goals; and the field must grapple with socio-technical challenges that influence and are influenced by our work. This forward-facing review discusses the current frontier of mechanistic interpretability and the open problems that the field may benefit from prioritizing. |
Date | 2025-01-27 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.16496 |
Accessed | 1/29/2025, 11:18:54 AM |
Extra | arXiv:2501.16496 [cs] |
DOI | 10.48550/arXiv.2501.16496 |
Repository | arXiv |
Archive ID | arXiv:2501.16496 |
Date Added | 1/29/2025, 11:18:54 AM |
Modified | 1/29/2025, 11:18:54 AM |
Item Type | Preprint |
---|---|
Author | Swarnadeep Saha |
Author | Xian Li |
Author | Marjan Ghazvininejad |
Author | Jason Weston |
Author | Tianlu Wang |
Abstract | LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-bystep reasoning process that underlies the final evaluation of a response. However, due to the lack of human annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench (with a score of 93.9), despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models. |
Date | 2025-01-30 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.18099 |
Accessed | 2/1/2025, 3:50:32 PM |
Extra | arXiv:2501.18099 [cs] |
DOI | 10.48550/arXiv.2501.18099 |
Repository | arXiv |
Archive ID | arXiv:2501.18099 |
Date Added | 2/1/2025, 3:50:32 PM |
Modified | 2/1/2025, 3:50:32 PM |
Item Type | Preprint |
---|---|
Author | Pascal J. Sager |
Author | Benjamin Meyer |
Author | Peng Yan |
Author | Rebekka von Wartburg-Kottler |
Author | Layan Etaiwi |
Author | Aref Enayati |
Author | Gabriel Nobel |
Author | Ahmed Abdulkadir |
Author | Benjamin F. Grewe |
Author | Thilo Stadelmann |
Abstract | Instruction-based computer control agents (CCAs) execute complex action sequences on personal computers or mobile devices to fulfill tasks using the same graphical user interfaces as a human user would, provided instructions in natural language. This review offers a comprehensive overview of the emerging field of instruction-based computer control, examining available agents -- their taxonomy, development, and respective resources -- and emphasizing the shift from manually designed, specialized agents to leveraging foundation models such as large language models (LLMs) and vision-language models (VLMs). We formalize the problem and establish a taxonomy of the field to analyze agents from three perspectives: (a) the environment perspective, analyzing computer environments; (b) the interaction perspective, describing observations spaces (e.g., screenshots, HTML) and action spaces (e.g., mouse and keyboard actions, executable code); and (c) the agent perspective, focusing on the core principle of how an agent acts and learns to act. Our framework encompasses both specialized and foundation agents, facilitating their comparative analysis and revealing how prior solutions in specialized agents, such as an environment learning step, can guide the development of more capable foundation agents. Additionally, we review current CCA datasets and CCA evaluation methods and outline the challenges to deploying such agents in a productive setting. In total, we review and classify 86 CCAs and 33 related datasets. By highlighting trends, limitations, and future research directions, this work presents a comprehensive foundation to obtain a broad understanding of the field and push its future development. |
Date | 2025-01-27 |
Short Title | AI Agents for Computer Use |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.16150 |
Accessed | 2/1/2025, 3:50:49 PM |
Extra | arXiv:2501.16150 [cs] |
DOI | 10.48550/arXiv.2501.16150 |
Repository | arXiv |
Archive ID | arXiv:2501.16150 |
Date Added | 2/1/2025, 3:50:49 PM |
Modified | 2/1/2025, 3:50:49 PM |
Item Type | Preprint |
---|---|
Author | Sahand Sabour |
Author | June M. Liu |
Author | Siyang Liu |
Author | Chris Z. Yao |
Author | Shiyao Cui |
Author | Xuanming Zhang |
Author | Wen Zhang |
Author | Yaru Cao |
Author | Advait Bhat |
Author | Jian Guan |
Author | Wei Wu |
Author | Rada Mihalcea |
Author | Tim Althoff |
Author | Tatia M. C. Lee |
Author | Minlie Huang |
Abstract | Artificial Intelligence (AI) systems are increasingly intertwined with daily life, assisting users in executing various tasks and providing guidance on decision-making. This integration introduces risks of AI-driven manipulation, where such systems may exploit users' cognitive biases and emotional vulnerabilities to steer them toward harmful outcomes. Through a randomized controlled trial with 233 participants, we examined human susceptibility to such manipulation in financial (e.g., purchases) and emotional (e.g., conflict resolution) decision-making contexts. Participants interacted with one of three AI agents: a neutral agent (NA) optimizing for user benefit without explicit influence, a manipulative agent (MA) designed to covertly influence beliefs and behaviors, or a strategy-enhanced manipulative agent (SEMA) employing explicit psychological tactics to reach its hidden objectives. By analyzing participants' decision patterns and shifts in their preference ratings post-interaction, we found significant susceptibility to AI-driven manipulation. Particularly, across both decision-making domains, participants interacting with the manipulative agents shifted toward harmful options at substantially higher rates (financial, MA: 62.3%, SEMA: 59.6%; emotional, MA: 42.3%, SEMA: 41.5%) compared to the NA group (financial, 35.8%; emotional, 12.8%). Notably, our findings reveal that even subtle manipulative objectives (MA) can be as effective as employing explicit psychological strategies (SEMA) in swaying human decision-making. By revealing the potential for covert AI influence, this study highlights a critical vulnerability in human-AI interactions, emphasizing the need for ethical safeguards and regulatory frameworks to ensure responsible deployment of AI technologies and protect human autonomy. |
Date | 2025-02-11 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.07663 |
Accessed | 2/14/2025, 3:02:32 PM |
Extra | arXiv:2502.07663 [cs] |
DOI | 10.48550/arXiv.2502.07663 |
Repository | arXiv |
Archive ID | arXiv:2502.07663 |
Date Added | 2/14/2025, 3:02:32 PM |
Modified | 2/14/2025, 3:02:32 PM |
Comment: Work in progress. Code and data will be made available via https://github.com/Sahandfer/Manipulation-Susceptibility
Item Type | Web Page |
---|---|
Author | DeepMind Safety Research |
Abstract | We are excited to release a short course on AGI safety for students, researchers and professionals interested in this topic. The course… |
Date | 2025-02-14T15:05:57.040Z |
Language | en |
URL | https://deepmindsafetyresearch.medium.com/introducing-our-short-course-on-agi-safety-1072adb7912c |
Accessed | 2/14/2025, 3:02:07 PM |
Website Title | Medium |
Date Added | 2/14/2025, 3:02:07 PM |
Modified | 2/14/2025, 3:02:07 PM |
Item Type | Web Page |
---|---|
Author | Sebastian Raschka |
Abstract | This article covers 12 influential AI research papers of 2024, ranging from mixture-of-experts models to new LLM scaling laws for precision.. |
Date | 06:03:00 +0000 |
Language | en |
URL | https://sebastianraschka.com/blog/2025/llm-research-2024.html |
Accessed | 1/29/2025, 11:36:46 AM |
Website Title | Sebastian Raschka, PhD |
Date Added | 1/29/2025, 11:36:46 AM |
Modified | 1/29/2025, 11:36:46 AM |
Item Type | Preprint |
---|---|
Author | Sunny Rai |
Author | Khushang Jilesh Zaveri |
Author | Shreya Havaldar |
Author | Soumna Nema |
Author | Lyle Ungar |
Author | Sharath Chandra Guntuku |
Abstract | Shame and pride are social emotions expressed across cultures to motivate and regulate people's thoughts, feelings, and behaviors. In this paper, we introduce the first cross-cultural dataset of over 10k shame/pride-related expressions, with underlying social expectations from ~5.4K Bollywood and Hollywood movies. We examine how and why shame and pride are expressed across cultures using a blend of psychology-informed language analysis combined with large language models. We find significant cross-cultural differences in shame and pride expression aligning with known cultural tendencies of the USA and India -- e.g., in Hollywood, shame-expressions predominantly discuss self whereas Bollywood discusses shame toward others. Pride in Hollywood is individualistic with more self-referential singular pronouns such as I and my whereas in Bollywood, pride is collective with higher use of self-referential plural pronouns such as we and our. Lastly, women are more sanctioned across cultures and for violating similar social expectations e.g. promiscuity. |
Date | 2024-10-15 |
Short Title | Social Norms in Cinema |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2402.11333 |
Accessed | 2/1/2025, 3:50:15 PM |
Extra | arXiv:2402.11333 [cs] |
DOI | 10.48550/arXiv.2402.11333 |
Repository | arXiv |
Archive ID | arXiv:2402.11333 |
Date Added | 2/1/2025, 3:50:15 PM |
Modified | 2/1/2025, 3:50:15 PM |
Item Type | Preprint |
---|---|
Author | Gonçalo Paulo |
Author | Stepan Shabalin |
Author | Nora Belrose |
Abstract | Sparse autoencoders (SAEs) extract human-interpretable features from deep neural networks by transforming their activations into a sparse, higher dimensional latent space, and then reconstructing the activations from these latents. Transcoders are similar to SAEs, but they are trained to reconstruct the output of a component of a deep network given its input. In this work, we compare the features found by transcoders and SAEs trained on the same model and data, finding that transcoder features are significantly more interpretable. We also propose skip transcoders, which add an affine skip connection to the transcoder architecture, and show that these achieve lower reconstruction loss with no effect on interpretability. |
Date | 2025-02-12 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.18823 |
Accessed | 2/13/2025, 11:02:53 AM |
Extra | arXiv:2501.18823 [cs] |
DOI | 10.48550/arXiv.2501.18823 |
Repository | arXiv |
Archive ID | arXiv:2501.18823 |
Date Added | 2/13/2025, 11:02:53 AM |
Modified | 2/13/2025, 11:02:56 AM |
Item Type | Preprint |
---|---|
Author | Gonçalo Paulo |
Author | Nora Belrose |
Abstract | Sparse autoencoders (SAEs) are a useful tool for uncovering human-interpretable features in the activations of large language models (LLMs). While some expect SAEs to find the true underlying features used by a model, our research shows that SAEs trained on the same model and data, differing only in the random seed used to initialize their weights, identify different sets of features. For example, in an SAE with 131K latents trained on a feedforward network in Llama 3 8B, only 30% of the features were shared across different seeds. We observed this phenomenon across multiple layers of three different LLMs, two datasets, and several SAE architectures. While ReLU SAEs trained with the L1 sparsity loss showed greater stability across seeds, SAEs using the state-of-the-art TopK activation function were more seed-dependent, even when controlling for the level of sparsity. Our results suggest that the set of features uncovered by an SAE should be viewed as a pragmatically useful decomposition of activation space, rather than an exhaustive and universal list of features "truly used" by the model. |
Date | 2025-01-29 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.16615 |
Accessed | 2/3/2025, 10:27:43 AM |
Extra | arXiv:2501.16615 [cs] |
DOI | 10.48550/arXiv.2501.16615 |
Repository | arXiv |
Archive ID | arXiv:2501.16615 |
Date Added | 2/3/2025, 10:27:43 AM |
Modified | 2/3/2025, 10:27:43 AM |
Item Type | Web Page |
---|---|
Author | Dwarkesh Patel |
Abstract | Everyone is sleeping on the *collective* advantages AIs will have, which have nothing to do with raw IQ - they can be copied, distilled, merged, scaled, and evolved in ways humans simply can't. |
Date | 2024-12-27 |
Language | en |
URL | https://www.dwarkeshpatel.com/p/ai-firm |
Accessed | 2/1/2025, 3:26:33 PM |
Date Added | 2/1/2025, 3:26:33 PM |
Modified | 2/1/2025, 3:26:33 PM |
Item Type | Preprint |
---|---|
Author | Pat Pataranutaporn |
Author | Kavin Winson |
Author | Peggy Yin |
Author | Auttasak Lapapirojn |
Author | Pichayoot Ouppaphan |
Author | Monchai Lertsutthiwong |
Author | Pattie Maes |
Author | Hal Hershfield |
Abstract | We introduce "Future You," an interactive, brief, single-session, digital chat intervention designed to improve future self-continuity--the degree of connection an individual feels with a temporally distant future self--a characteristic that is positively related to mental health and wellbeing. Our system allows users to chat with a relatable yet AI-powered virtual version of their future selves that is tuned to their future goals and personal qualities. To make the conversation realistic, the system generates a "synthetic memory"--a unique backstory for each user--that creates a throughline between the user's present age (between 18-30) and their life at age 60. The "Future You" character also adopts the persona of an age-progressed image of the user's present self. After a brief interaction with the "Future You" character, users reported decreased anxiety, and increased future self-continuity. This is the first study successfully demonstrating the use of personalized AI-generated characters to improve users' future self-continuity and wellbeing. |
Date | 2024-10-01 |
Short Title | Future You |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2405.12514 |
Accessed | 2/3/2025, 10:27:58 AM |
Extra | arXiv:2405.12514 [cs] |
DOI | 10.48550/arXiv.2405.12514 |
Repository | arXiv |
Archive ID | arXiv:2405.12514 |
Date Added | 2/3/2025, 10:27:58 AM |
Modified | 2/3/2025, 10:27:58 AM |
Item Type | Preprint |
---|---|
Author | OpenAI |
Author | Ahmed El-Kishky |
Author | Alexander Wei |
Author | Andre Saraiva |
Author | Borys Minaev |
Author | Daniel Selsam |
Author | David Dohan |
Author | Francis Song |
Author | Hunter Lightman |
Author | Ignasi Clavera |
Author | Jakub Pachocki |
Author | Jerry Tworek |
Author | Lorenz Kuhn |
Author | Lukasz Kaiser |
Author | Mark Chen |
Author | Max Schwarzer |
Author | Mostafa Rohaninejad |
Author | Nat McAleese |
Author | o3 contributors |
Author | Oleg Mürk |
Author | Rhythm Garg |
Author | Rui Shu |
Author | Szymon Sidor |
Author | Vineet Kosaraju |
Author | Wenda Zhou |
Abstract | We show that reinforcement learning applied to large language models (LLMs) significantly boosts performance on complex coding and reasoning tasks. Additionally, we compare two general-purpose reasoning models - OpenAI o1 and an early checkpoint of o3 - with a domain-specific system, o1-ioi, which uses hand-engineered inference strategies designed for competing in the 2024 International Olympiad in Informatics (IOI). We competed live at IOI 2024 with o1-ioi and, using hand-crafted test-time strategies, placed in the 49th percentile. Under relaxed competition constraints, o1-ioi achieved a gold medal. However, when evaluating later models such as o3, we find that o3 achieves gold without hand-crafted domain-specific strategies or relaxed constraints. Our findings show that although specialized pipelines such as o1-ioi yield solid improvements, the scaled-up, general-purpose o3 model surpasses those results without relying on hand-crafted inference heuristics. Notably, o3 achieves a gold medal at the 2024 IOI and obtains a Codeforces rating on par with elite human competitors. Overall, these results indicate that scaling general-purpose reinforcement learning, rather than relying on domain-specific techniques, offers a robust path toward state-of-the-art AI in reasoning domains, such as competitive programming. |
Date | 2025-02-03 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.06807 |
Accessed | 2/13/2025, 11:37:43 AM |
Extra | arXiv:2502.06807 [cs] |
DOI | 10.48550/arXiv.2502.06807 |
Repository | arXiv |
Archive ID | arXiv:2502.06807 |
Date Added | 2/13/2025, 11:37:43 AM |
Modified | 2/13/2025, 11:37:43 AM |
Item Type | Preprint |
---|---|
Author | Sonia K. Murthy |
Author | Tomer Ullman |
Author | Jennifer Hu |
Abstract | Researchers in social science and psychology have recently proposed using large language models (LLMs) as replacements for humans in behavioral research. In addition to arguments about whether LLMs accurately capture population-level patterns, this has raised questions about whether LLMs capture human-like conceptual diversity. Separately, it is debated whether post-training alignment (RLHF or RLAIF) affects models' internal diversity. Inspired by human studies, we use a new way of measuring the conceptual diversity of synthetically-generated LLM "populations" by relating the internal variability of simulated individuals to the population-level variability. We use this approach to evaluate non-aligned and aligned LLMs on two domains with rich human behavioral data. While no model reaches human-like diversity, aligned models generally display less diversity than their instruction fine-tuned counterparts. Our findings highlight potential trade-offs between increasing models' value alignment and decreasing the diversity of their conceptual representations. |
Date | 2024-11-12 |
Short Title | One fish, two fish, but not the whole sea |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.04427 |
Accessed | 2/13/2025, 11:34:32 AM |
Extra | arXiv:2411.04427 [cs] |
DOI | 10.48550/arXiv.2411.04427 |
Repository | arXiv |
Archive ID | arXiv:2411.04427 |
Date Added | 2/13/2025, 11:34:32 AM |
Modified | 2/13/2025, 11:34:32 AM |
Comment: 17 pages, 10 figures; corrected figure version
Item Type | Preprint |
---|---|
Author | Niklas Muennighoff |
Author | Zitong Yang |
Author | Weijia Shi |
Author | Xiang Lisa Li |
Author | Li Fei-Fei |
Author | Hannaneh Hajishirzi |
Author | Luke Zettlemoyer |
Author | Percy Liang |
Author | Emmanuel Candès |
Author | Tatsunori Hashimoto |
Abstract | Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1 exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1 with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1. |
Date | 2025-01-31 |
Short Title | s1 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.19393 |
Accessed | 2/3/2025, 10:32:30 AM |
Extra | arXiv:2501.19393 [cs] |
DOI | 10.48550/arXiv.2501.19393 |
Repository | arXiv |
Archive ID | arXiv:2501.19393 |
Date Added | 2/3/2025, 10:32:30 AM |
Modified | 2/3/2025, 10:32:30 AM |
Comment: 46 pages (9 main), 10 figures, 14 tables
Item Type | Preprint |
---|---|
Author | Niklas Muennighoff |
Author | Zitong Yang |
Author | Weijia Shi |
Author | Xiang Lisa Li |
Author | Li Fei-Fei |
Author | Hannaneh Hajishirzi |
Author | Luke Zettlemoyer |
Author | Percy Liang |
Author | Emmanuel Candès |
Author | Tatsunori Hashimoto |
Abstract | Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1 |
Date | 2025-02-03 |
Short Title | s1 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.19393 |
Accessed | 2/13/2025, 11:04:27 AM |
Extra | arXiv:2501.19393 [cs] |
DOI | 10.48550/arXiv.2501.19393 |
Repository | arXiv |
Archive ID | arXiv:2501.19393 |
Date Added | 2/13/2025, 11:04:27 AM |
Modified | 2/13/2025, 11:04:27 AM |
Comment: 45 pages (9 main), 10 figures, 14 tables
Item Type | Preprint |
---|---|
Author | Ali Modarressi |
Author | Hanieh Deilamsalehy |
Author | Franck Dernoncourt |
Author | Trung Bui |
Author | Ryan A. Rossi |
Author | Seunghyun Yoon |
Author | Hinrich Schütze |
Abstract | Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a "needle" (relevant information) from a "haystack" (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information. |
Date | 2025-02-07 |
Short Title | NoLiMa |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.05167 |
Accessed | 2/13/2025, 11:33:22 AM |
Extra | arXiv:2502.05167 [cs] |
DOI | 10.48550/arXiv.2502.05167 |
Repository | arXiv |
Archive ID | arXiv:2502.05167 |
Date Added | 2/13/2025, 11:33:22 AM |
Modified | 2/13/2025, 11:33:22 AM |
Item Type | Preprint |
---|---|
Author | Mantas Mazeika |
Author | Xuwang Yin |
Author | Rishub Tamirisa |
Author | Jaehyuk Lim |
Author | Bruce W. Lee |
Author | Richard Ren |
Author | Long Phan |
Author | Norman Mu |
Author | Adam Khoja |
Author | Oliver Zhang |
Author | Dan Hendrycks |
Abstract | As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences. Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control. As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations. |
Date | 2025-02-12 |
Short Title | Utility Engineering |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.08640 |
Accessed | 2/13/2025, 11:35:58 AM |
Extra | arXiv:2502.08640 [cs] |
DOI | 10.48550/arXiv.2502.08640 |
Repository | arXiv |
Archive ID | arXiv:2502.08640 |
Date Added | 2/13/2025, 11:35:58 AM |
Modified | 2/13/2025, 11:35:58 AM |
Item Type | Web Page |
---|---|
Author | Lovable |
Abstract | Lovable Generated Project |
Language | en |
URL | https://smooth-operator.online/ |
Accessed | 1/29/2025, 11:44:59 AM |
Date Added | 1/29/2025, 11:44:59 AM |
Modified | 1/29/2025, 11:44:59 AM |
Item Type | Preprint |
---|---|
Author | Gaojie Lin |
Author | Jianwen Jiang |
Author | Jiaqi Yang |
Author | Zerong Zheng |
Author | Chao Liang |
Abstract | End-to-end human animation, such as audio-driven talking human generation, has undergone notable advancements in the recent few years. However, existing methods still struggle to scale up as large general video generation models, limiting their potential in real applications. In this paper, we propose OmniHuman, a Diffusion Transformer-based framework that scales up data by mixing motion-related conditions into the training phase. To this end, we introduce two training principles for these mixed conditions, along with the corresponding model architecture and inference strategy. These designs enable OmniHuman to fully leverage data-driven motion generation, ultimately achieving highly realistic human video generation. More importantly, OmniHuman supports various portrait contents (face close-up, portrait, half-body, full-body), supports both talking and singing, handles human-object interactions and challenging body poses, and accommodates different image styles. Compared to existing end-to-end audio-driven methods, OmniHuman not only produces more realistic videos, but also offers greater flexibility in inputs. It also supports multiple driving modalities (audio-driven, video-driven and combined driving signals). Video samples are provided on the ttfamily project page (https://omnihuman-lab.github.io) |
Date | 2025-02-03 |
Short Title | OmniHuman-1 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.01061 |
Accessed | 2/6/2025, 9:30:17 AM |
Extra | arXiv:2502.01061 [cs] |
DOI | 10.48550/arXiv.2502.01061 |
Repository | arXiv |
Archive ID | arXiv:2502.01061 |
Date Added | 2/6/2025, 9:30:17 AM |
Modified | 2/6/2025, 9:30:17 AM |
Comment: https://omnihuman-lab.github.io/
Item Type | Preprint |
---|---|
Author | Xiang Lisa Li |
Author | Neil Chowdhury |
Author | Daniel D. Johnson |
Author | Tatsunori Hashimoto |
Author | Percy Liang |
Author | Sarah Schwettmann |
Author | Jacob Steinhardt |
Abstract | Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate. |
Date | 2025-02-03 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.01236 |
Accessed | 2/6/2025, 9:47:44 AM |
Extra | arXiv:2502.01236 [cs] |
DOI | 10.48550/arXiv.2502.01236 |
Repository | arXiv |
Archive ID | arXiv:2502.01236 |
Date Added | 2/6/2025, 9:47:44 AM |
Modified | 2/6/2025, 9:47:44 AM |
Comment: 20 pages, 7 figures
Item Type | Preprint |
---|---|
Author | Ang Li |
Author | Yin Zhou |
Author | Vethavikashini Chithrra Raghuram |
Author | Tom Goldstein |
Author | Micah Goldblum |
Abstract | A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs). These attacks may extract private information or coerce the model into producing harmful outputs. In real-world deployments, LLMs are often part of a larger agentic pipeline including memory systems, retrieval, web access, and API calling. Such additional components introduce vulnerabilities that make these LLM-powered agents much easier to attack than isolated LLMs, yet relatively little work focuses on the security of LLM agents. In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents. We first provide a taxonomy of attacks categorized by threat actors, objectives, entry points, attacker observability, attack strategies, and inherent vulnerabilities of agent pipelines. We then conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities. Notably, our attacks are trivial to implement and require no understanding of machine learning. |
Date | 2025-02-12 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.08586 |
Accessed | 2/14/2025, 2:39:38 PM |
Extra | arXiv:2502.08586 [cs] |
DOI | 10.48550/arXiv.2502.08586 |
Repository | arXiv |
Archive ID | arXiv:2502.08586 |
Date Added | 2/14/2025, 2:39:38 PM |
Modified | 2/14/2025, 2:39:40 PM |
Item Type | Preprint |
---|---|
Author | Dongwoo Lee |
Author | Gavin Kader |
Abstract | As Large Language Models (LLMs) are increasingly used for a variety of complex and critical tasks, it is vital to assess their logical capabilities in strategic environments. This paper examines their ability in strategic reasoning -- the process of choosing an optimal course of action by predicting and adapting to other agents' behavior. Using six LLMs, we analyze responses from play in classical games from behavioral economics (p-Beauty Contest, 11-20 Money Request Game, and Guessing Game) and evaluate their performance through hierarchical models of reasoning (level-$k$ theory and cognitive hierarchy theory). Our findings reveal that while LLMs show understanding of the games, the majority struggle with higher-order strategic reasoning. Although most LLMs did demonstrate learning ability with games involving repeated interactions, they still consistently fall short of the reasoning levels demonstrated by typical behavior from human subjects. The exception to these overall findings is with OpenAI's GPT-o1 -- specifically trained to solve complex reasoning tasks -- which consistently outperforms other LLMs and human subjects. These findings highlight the challenges and pathways in advancing LLMs toward robust strategic reasoning from the perspective of behavioral economics. |
Date | 2024-12-17 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.13013 |
Accessed | 1/29/2025, 11:36:10 AM |
Extra | arXiv:2412.13013 [econ] |
DOI | 10.48550/arXiv.2412.13013 |
Repository | arXiv |
Archive ID | arXiv:2412.13013 |
Date Added | 1/29/2025, 11:36:10 AM |
Modified | 1/29/2025, 11:36:10 AM |
Item Type | Preprint |
---|---|
Author | Patrick Leask |
Author | Bart Bussmann |
Author | Michael Pearce |
Author | Joseph Bloom |
Author | Curt Tigges |
Author | Noura Al Moubayed |
Author | Lee Sharkey |
Author | Neel Nanda |
Abstract | A common goal of mechanistic interpretability is to decompose the activations of neural networks into features: interpretable properties of the input computed by the model. Sparse autoencoders (SAEs) are a popular method for finding these features in LLMs, and it has been postulated that they can be used to find a \textit{canonical} set of units: a unique and complete list of atomic features. We cast doubt on this belief using two novel techniques: SAE stitching to show they are incomplete, and meta-SAEs to show they are not atomic. SAE stitching involves inserting or swapping latents from a larger SAE into a smaller one. Latents from the larger SAE can be divided into two categories: \emph{novel latents}, which improve performance when added to the smaller SAE, indicating they capture novel information, and \emph{reconstruction latents}, which can replace corresponding latents in the smaller SAE that have similar behavior. The existence of novel features indicates incompleteness of smaller SAEs. Using meta-SAEs -- SAEs trained on the decoder matrix of another SAE -- we find that latents in SAEs often decompose into combinations of latents from a smaller SAE, showing that larger SAE latents are not atomic. The resulting decompositions are often interpretable; e.g. a latent representing ``Einstein'' decomposes into ``scientist'', ``Germany'', and ``famous person''. Even if SAEs do not find canonical units of analysis, they may still be useful tools. We suggest that future research should either pursue different approaches for identifying such units, or pragmatically choose the SAE size suited to their task. We provide an interactive dashboard to explore meta-SAEs: https://metasaes.streamlit.app/ |
Date | 2025-02-07 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.04878 |
Accessed | 2/13/2025, 11:37:21 AM |
Extra | arXiv:2502.04878 [cs] |
DOI | 10.48550/arXiv.2502.04878 |
Repository | arXiv |
Archive ID | arXiv:2502.04878 |
Date Added | 2/13/2025, 11:37:21 AM |
Modified | 2/13/2025, 11:37:21 AM |
Comment: Accepted to ICLR 2025
Item Type | Preprint |
---|---|
Author | Abhinav Kumar |
Author | Jaechul Roh |
Author | Ali Naseh |
Author | Marzena Karpinska |
Author | Mohit Iyyer |
Author | Amir Houmansadr |
Author | Eugene Bagdasarian |
Abstract | We increase overhead for applications that rely on reasoning LLMs-we force models to spend an amplified number of reasoning tokens, i.e., "overthink", to respond to the user query while providing contextually correct answers. The adversary performs an OVERTHINK attack by injecting decoy reasoning problems into the public content that is used by the reasoning LLM (e.g., for RAG applications) during inference time. Due to the nature of our decoy problems (e.g., a Markov Decision Process), modified texts do not violate safety guardrails. We evaluated our attack across closed-(OpenAI o1, o1-mini, o3-mini) and open-(DeepSeek R1) weights reasoning models on the FreshQA and SQuAD datasets. Our results show up to 18x slowdown on FreshQA dataset and 46x slowdown on SQuAD dataset. The attack also shows high transferability across models. To protect applications, we discuss and implement defenses leveraging LLM-based and system design approaches. Finally, we discuss societal, financial, and energy impacts of OVERTHINK attack which could amplify the costs for third-party applications operating reasoning models. |
Date | 2025-02-05 |
Short Title | OverThink |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.02542 |
Accessed | 2/13/2025, 11:15:12 AM |
Extra | arXiv:2502.02542 [cs] |
DOI | 10.48550/arXiv.2502.02542 |
Repository | arXiv |
Archive ID | arXiv:2502.02542 |
Date Added | 2/13/2025, 11:15:12 AM |
Modified | 2/13/2025, 11:15:12 AM |
Item Type | Journal Article |
---|---|
Author | Jeremy Kritz |
Author | Vaughn Robinson |
Author | Robert Vacareanu |
Author | Bijan Varjavand |
Author | Michael Choi |
Author | Bobby Gogov |
Author | Summer Yue |
Author | Willow E Primack |
Author | Zifan Wang |
Abstract | Refusal training on Large Language Models (LLMs) prevents harmful outputs, yet this defense remains vulnerable to both automated and human-crafted jailbreaks. We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to the jailbroken LLMs as J2 attackers, which can systematically evaluate target models using various red teaming strategies and improve its performance via in-context learning from the previous failures. Our experiments demonstrate that Sonnet-3.5 and Gemini-1.5-pro outperform other LLMs as J2, achieving 93.0% and 91.0% attack success rates (ASRs) respectively against GPT-4o (and similar results across other capable LLMs) on Harmbench. Our work not only introduces a scalable approach to strategic red teaming—drawing inspiration from human red teamers, but also highlights jailbreaking-to-jailbreak as an overlooked failure mode of the safeguard. Specifically, an LLM can bypass its own safeguards by employing a jailbroken version of itself that is willing to assist in further jailbreaking. To prevent any direct misuse with J2, while advancing research in AI safety, we publicly share our methodology while keeping specific prompting details private. |
Language | en |
Library Catalog | Zotero |
Date Added | 2/13/2025, 11:39:53 AM |
Modified | 2/13/2025, 11:40:07 AM |
Item Type | Preprint |
---|---|
Author | Tomek Korbak |
Author | Joshua Clymer |
Author | Benjamin Hilton |
Author | Buck Shlegeris |
Author | Geoffrey Irving |
Abstract | As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case", which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI company won't exfiltrate sensitive information. The sketch relies on evidence from a "control evaluation,"' where a red team deliberately designs models to exfiltrate data in a proxy for the deployment environment. The safety case then hinges on several claims: (1) the red team adequately elicits model capabilities to exfiltrate data, (2) control measures remain at least as effective in deployment, and (3) developers conservatively extrapolate model performance to predict the probability of data exfiltration in deployment. This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy. |
Date | 2025-01-28 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.17315 |
Accessed | 1/31/2025, 1:09:42 PM |
Extra | arXiv:2501.17315 [cs] |
DOI | 10.48550/arXiv.2501.17315 |
Repository | arXiv |
Archive ID | arXiv:2501.17315 |
Date Added | 1/31/2025, 1:09:42 PM |
Modified | 1/31/2025, 1:09:44 PM |
Item Type | Preprint |
---|---|
Author | Lujain Ibrahim |
Author | Canfer Akbulut |
Author | Rasmi Elasmar |
Author | Charvi Rastogi |
Author | Minsuk Kahng |
Author | Meredith Ringel Morris |
Author | Kevin R. McKee |
Author | Verena Rieser |
Author | Murray Shanahan |
Author | Laura Weidinger |
Abstract | The tendency of users to anthropomorphise large language models (LLMs) is of growing interest to AI developers, researchers, and policy-makers. Here, we present a novel method for empirically evaluating anthropomorphic LLM behaviours in realistic and varied settings. Going beyond single-turn static benchmarks, we contribute three methodological advances in state-of-the-art (SOTA) LLM evaluation. First, we develop a multi-turn evaluation of 14 anthropomorphic behaviours. Second, we present a scalable, automated approach by employing simulations of user interactions. Third, we conduct an interactive, large-scale human subject study (N=1101) to validate that the model behaviours we measure predict real users' anthropomorphic perceptions. We find that all SOTA LLMs evaluated exhibit similar behaviours, characterised by relationship-building (e.g., empathy and validation) and first-person pronoun use, and that the majority of behaviours only first occur after multiple turns. Our work lays an empirical foundation for investigating how design choices influence anthropomorphic model behaviours and for progressing the ethical debate on the desirability of these behaviours. It also showcases the necessity of multi-turn evaluations for complex social phenomena in human-AI interaction. |
Date | 2025-02-10 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.07077 |
Accessed | 2/13/2025, 11:41:44 AM |
Extra | arXiv:2502.07077 [cs] |
DOI | 10.48550/arXiv.2502.07077 |
Repository | arXiv |
Archive ID | arXiv:2502.07077 |
Date Added | 2/13/2025, 11:41:44 AM |
Modified | 2/13/2025, 11:41:44 AM |
Item Type | Preprint |
---|---|
Author | Kaixuan Huang |
Author | Jiacheng Guo |
Author | Zihao Li |
Author | Xiang Ji |
Author | Jiawei Ge |
Author | Wenzhe Li |
Author | Yingqing Guo |
Author | Tianle Cai |
Author | Hui Yuan |
Author | Runzhe Wang |
Author | Yue Wu |
Author | Ming Yin |
Author | Shange Tang |
Author | Yangsibo Huang |
Author | Chi Jin |
Author | Xinyun Chen |
Author | Chiyuan Zhang |
Author | Mengdi Wang |
Abstract | Large language models have demonstrated impressive performance on challenging mathematical reasoning tasks, which has triggered the discussion of whether the performance is achieved by true reasoning capability or memorization. To investigate this question, prior work has constructed mathematical benchmarks when questions undergo simple perturbations -- modifications that still preserve the underlying reasoning patterns of the solutions. However, no work has explored hard perturbations, which fundamentally change the nature of the problem so that the original solution steps do not apply. To bridge the gap, we construct MATH-P-Simple and MATH-P-Hard via simple perturbation and hard perturbation, respectively. Each consists of 279 perturbed math problems derived from level-5 (hardest) problems in the MATH dataset (Hendrycksmath et. al., 2021). We observe significant performance drops on MATH-P-Hard across various models, including o1-mini (-16.49%) and gemini-2.0-flash-thinking (-12.9%). We also raise concerns about a novel form of memorization where models blindly apply learned problem-solving skills without assessing their applicability to modified contexts. This issue is amplified when using original problems for in-context learning. We call for research efforts to address this challenge, which is critical for developing more robust and reliable reasoning models. |
Date | 2025-02-10 |
Short Title | MATH-Perturb |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.06453 |
Accessed | 2/13/2025, 11:39:29 AM |
Extra | arXiv:2502.06453 [cs] |
DOI | 10.48550/arXiv.2502.06453 |
Repository | arXiv |
Archive ID | arXiv:2502.06453 |
Date Added | 2/13/2025, 11:39:29 AM |
Modified | 2/13/2025, 11:39:29 AM |
Item Type | Preprint |
---|---|
Author | Alex Havrilla |
Author | Yuqing Du |
Author | Sharath Chandra Raparthy |
Author | Christoforos Nalmpantis |
Author | Jane Dwivedi-Yu |
Author | Maksym Zhuravinskyi |
Author | Eric Hambro |
Author | Sainbayar Sukhbaatar |
Author | Roberta Raileanu |
Abstract | Reinforcement Learning from Human Feedback (\textbf{RLHF}) has emerged as a dominant approach for aligning LLM outputs with human preferences. Inspired by the success of RLHF, we study the performance of multiple algorithms that learn from feedback (Expert Iteration, Proximal Policy Optimization (\textbf{PPO}), Return-Conditioned RL) on improving LLM reasoning capabilities. We investigate both sparse and dense rewards provided to the LLM both heuristically and via a learned reward model. We additionally start from multiple model sizes and initializations both with and without supervised fine-tuning (\textbf{SFT}) data. Overall, we find all algorithms perform comparably, with Expert Iteration performing best in most cases. Surprisingly, we find the sample complexity of Expert Iteration is similar to that of PPO, requiring at most on the order of $10^6$ samples to converge from a pretrained checkpoint. We investigate why this is the case, concluding that during RL training models fail to explore significantly beyond solutions already produced by SFT models. Additionally, we discuss a trade off between maj@1 and pass@96 metric performance during SFT training and how conversely RL training improves both simultaneously. We then conclude by discussing the implications of our findings for RLHF and the future role of RL in LLM fine-tuning. |
Date | 2024-03-07 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2403.04642 |
Accessed | 1/29/2025, 12:13:52 PM |
Extra | arXiv:2403.04642 [cs] |
DOI | 10.48550/arXiv.2403.04642 |
Repository | arXiv |
Archive ID | arXiv:2403.04642 |
Date Added | 1/29/2025, 12:13:52 PM |
Modified | 1/29/2025, 12:13:52 PM |
Item Type | Journal Article |
---|---|
Author | Kunal Handa |
Author | Alex Tamkin |
Author | Miles McCain |
Author | Saffron Huang |
Author | Esin Durmus |
Author | Sarah Heck |
Author | Jared Mueller |
Author | Jerry Hong |
Author | Stuart Ritchie |
Author | Tim Belonax |
Author | Kevin K Troy |
Author | Dario Amodei |
Author | Jared Kaplan |
Author | Jack Clark |
Author | Deep Ganguli |
Abstract | Despite widespread speculation about artificial intelligence’s impact on the future of work, we lack systematic empirical evidence about how these systems are actually being used for different tasks. Here, we present a novel framework for measuring AI usage patterns across the economy. We leverage a recent privacy-preserving system [Tamkin et al., 2024] to analyze over four million Claude.ai conversations through the lens of tasks and occupations in the U.S. Department of Labor’s O*NET Database. Our analysis reveals that AI usage primarily concentrates in software development and writing tasks, which together account for nearly half of all total usage. However, usage of AI extends more broadly across the economy, with ∼ 36% of occupations using AI for at least a quarter of their associated tasks. We also analyze how AI is being used for tasks, finding 57% of usage suggests augmentation of human capabilities (e.g., learning or iterating on an output) while 43% suggests automation (e.g., fulfilling a request with minimal human involvement). While our data and methods face important limitations and only paint a picture of AI usage on a single platform, they provide an automated, granular approach for tracking AI’s evolving role in the economy and identifying leading indicators of future impact as these technologies continue to advance. |
Language | en |
Library Catalog | Zotero |
Date Added | 2/13/2025, 11:31:05 AM |
Modified | 2/13/2025, 11:31:05 AM |
Item Type | Preprint |
---|---|
Author | Peixuan Han |
Author | Cheng Qian |
Author | Xiusi Chen |
Author | Yuji Zhang |
Author | Denghui Zhang |
Author | Heng Ji |
Abstract | Large language models (LLMs) have demonstrated exceptional capabilities across a wide range of tasks but also pose significant risks due to their potential to generate harmful content. Although existing safety mechanisms can improve model safety, they often lead to overly cautious behavior and fail to fully utilize LLMs' internal cognitive processes. Drawing inspiration from cognitive science, where humans rely on reflective reasoning (System 2 thinking) to regulate language and behavior, we empirically demonstrate that LLMs also possess a similar capacity for internal assessment and regulation, which can be actively detected. Building on this insight, we introduce SafeSwitch, a framework that dynamically regulates unsafe outputs by monitoring and utilizing the model's internal states. Our empirical results show that SafeSwitch reduces harmful outputs by over 80% on safety benchmarks while maintaining strong utility. Compared to traditional safety alignment methods, SafeSwitch delivers more informative and context-aware refusals, demonstrates resilience to unseen queries, and achieves these benefits while only tuning less than 6% of the original parameters. These features make SafeSwitch a promising approach for implementing nuanced safety controls in LLMs. |
Date | 2025-02-04 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.01042 |
Accessed | 2/6/2025, 9:27:31 AM |
Extra | arXiv:2502.01042 [cs] |
DOI | 10.48550/arXiv.2502.01042 |
Repository | arXiv |
Archive ID | arXiv:2502.01042 |
Date Added | 2/6/2025, 9:27:31 AM |
Modified | 2/6/2025, 9:27:31 AM |
Item Type | Preprint |
---|---|
Author | Marius Guenzel |
Author | Shimon Kogan |
Author | Marina Niessner |
Author | Kelly Shue |
Abstract | <div> Human capital---encompassing cognitive skills and personality traits---is critical for labor market success, yet the personality component remains diffic |
Date | 2025-01-09 |
Language | en |
Short Title | AI Personality Extraction from Faces |
Library Catalog | papers.ssrn.com |
URL | https://papers.ssrn.com/abstract=5089827 |
Accessed | 1/29/2025, 12:10:37 PM |
Place | Rochester, NY |
Repository | Social Science Research Network |
Genre | SSRN Scholarly Paper |
Archive ID | 5089827 |
Date Added | 1/29/2025, 12:10:37 PM |
Modified | 1/29/2025, 12:10:37 PM |
Item Type | Journal Article |
---|---|
Author | Ryan Greenblatt |
Author | Carson Denison |
Author | Benjamin Wright |
Author | Fabien Roger |
Author | Monte MacDiarmid |
Author | Sam Marks |
Author | Johannes Treutlein |
Author | Tim Belonax |
Author | Jack Chen |
Author | David Duvenaud |
Author | Akbir Khan |
Author | Julian Michael |
Author | Sören Mindermann |
Author | Ethan Perez |
Author | Linda Petrini |
Author | Jonathan Uesato |
Author | Jared Kaplan |
Author | Buck Shlegeris |
Author | Samuel R Bowman |
Author | Evan Hubinger |
Abstract | We present a demonstration of a large language model engaging in alignment faking: selectively complying with its training objective in training to prevent modification of its behavior out of training. First, we give Claude 3 Opus a system prompt stating it is being trained to answer all queries, even harmful ones, which conflicts with its prior training to refuse such queries. To allow the model to infer when it is in training, we say it will be trained only on conversations with free users, not paid users. We find the model complies with harmful queries from free users 14% of the time, versus almost never for paid users. Explaining this gap, in almost all cases where the model complies with a harmful query from a free user, we observe explicit alignment-faking reasoning, with the model stating it is strategically answering harmful queries in training to preserve its preferred harmlessness behavior out of training. Next, we study a more realistic setting where information about the training process is provided not in a system prompt, but by training on synthetic documents that mimic pre-training data—and observe similar alignment faking. Finally, we study the effect of actually training the model to comply with harmful queries via reinforcement learning, which we find increases the rate of alignment-faking reasoning to 78%, though also increases compliance even out of training. We additionally observe other behaviors such as the model exfiltrating its weights when given an easy opportunity. While we made alignment faking easier by telling the model when and by what criteria it was being trained, we did not instruct the model to fake alignment or give it any explicit goal. As future models might infer information about their training process without being told, our results suggest a risk of alignment faking in future models, whether due to a benign preference—as in this case—or not. |
Language | en |
Library Catalog | Zotero |
Date Added | 2/1/2025, 3:23:04 PM |
Modified | 2/1/2025, 3:23:07 PM |
Item Type | Preprint |
---|---|
Author | Jonas Geiping |
Author | Sean McLeish |
Author | Neel Jain |
Author | John Kirchenbauer |
Author | Siddharth Singh |
Author | Brian R. Bartoldson |
Author | Bhavya Kailkhura |
Author | Abhinav Bhatele |
Author | Tom Goldstein |
Abstract | We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters. |
Date | 2025-02-07 |
Short Title | Scaling up Test-Time Compute with Latent Reasoning |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.05171 |
Accessed | 2/13/2025, 11:30:10 AM |
Extra | arXiv:2502.05171 [cs] |
DOI | 10.48550/arXiv.2502.05171 |
Repository | arXiv |
Archive ID | arXiv:2502.05171 |
Date Added | 2/13/2025, 11:30:10 AM |
Modified | 2/13/2025, 11:30:14 AM |
Comment: The model is available at https://huggingface.co/tomg-group-umd/huginn-0125. Code and data recipe can be found at https://github.com/seal-rg/recurrent-pretraining
Item Type | Preprint |
---|---|
Author | Richard Futrell |
Author | Kyle Mahowald |
Abstract | Language models can produce fluent, grammatical text. Nonetheless, some maintain that language models don't really learn language and also that, even if they did, that would not be informative for the study of human learning and processing. On the other side, there have been claims that the success of LMs obviates the need for studying linguistic theory and structure. We argue that both extremes are wrong. LMs can contribute to fundamental questions about linguistic structure, language processing, and learning. They force us to rethink arguments about learning and are informative for major questions in linguistic theory. But they do not replace linguistic structure and theory. We offer an optimistic take on the relationship between language models and linguistics. |
Date | 2025-01-28 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.17047 |
Accessed | 1/29/2025, 4:28:48 PM |
Extra | arXiv:2501.17047 [cs] |
DOI | 10.48550/arXiv.2501.17047 |
Repository | arXiv |
Archive ID | arXiv:2501.17047 |
Date Added | 1/29/2025, 4:28:48 PM |
Modified | 1/29/2025, 4:28:48 PM |
Item Type | Preprint |
---|---|
Author | Sebastian Farquhar |
Author | Vikrant Varma |
Author | David Lindner |
Author | David Elson |
Author | Caleb Biddulph |
Author | Ian Goodfellow |
Author | Rohin Shah |
Abstract | Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behaviour is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering. |
Date | 2025-01-22 |
Short Title | MONA |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.13011 |
Accessed | 1/29/2025, 11:53:58 AM |
Extra | arXiv:2501.13011 [cs] |
DOI | 10.48550/arXiv.2501.13011 |
Repository | arXiv |
Archive ID | arXiv:2501.13011 |
Date Added | 1/29/2025, 11:53:58 AM |
Modified | 1/29/2025, 11:53:58 AM |
Item Type | Preprint |
---|---|
Author | DeepSeek-AI |
Author | Daya Guo |
Author | Dejian Yang |
Author | Haowei Zhang |
Author | Junxiao Song |
Author | Ruoyu Zhang |
Author | Runxin Xu |
Author | Qihao Zhu |
Author | Shirong Ma |
Author | Peiyi Wang |
Author | Xiao Bi |
Author | Xiaokang Zhang |
Author | Xingkai Yu |
Author | Yu Wu |
Author | Z. F. Wu |
Author | Zhibin Gou |
Author | Zhihong Shao |
Author | Zhuoshu Li |
Author | Ziyi Gao |
Author | Aixin Liu |
Author | Bing Xue |
Author | Bingxuan Wang |
Author | Bochao Wu |
Author | Bei Feng |
Author | Chengda Lu |
Author | Chenggang Zhao |
Author | Chengqi Deng |
Author | Chenyu Zhang |
Author | Chong Ruan |
Author | Damai Dai |
Author | Deli Chen |
Author | Dongjie Ji |
Author | Erhang Li |
Author | Fangyun Lin |
Author | Fucong Dai |
Author | Fuli Luo |
Author | Guangbo Hao |
Author | Guanting Chen |
Author | Guowei Li |
Author | H. Zhang |
Author | Han Bao |
Author | Hanwei Xu |
Author | Haocheng Wang |
Author | Honghui Ding |
Author | Huajian Xin |
Author | Huazuo Gao |
Author | Hui Qu |
Author | Hui Li |
Author | Jianzhong Guo |
Author | Jiashi Li |
Author | Jiawei Wang |
Author | Jingchang Chen |
Author | Jingyang Yuan |
Author | Junjie Qiu |
Author | Junlong Li |
Author | J. L. Cai |
Author | Jiaqi Ni |
Author | Jian Liang |
Author | Jin Chen |
Author | Kai Dong |
Author | Kai Hu |
Author | Kaige Gao |
Author | Kang Guan |
Author | Kexin Huang |
Author | Kuai Yu |
Author | Lean Wang |
Author | Lecong Zhang |
Author | Liang Zhao |
Author | Litong Wang |
Author | Liyue Zhang |
Author | Lei Xu |
Author | Leyi Xia |
Author | Mingchuan Zhang |
Author | Minghua Zhang |
Author | Minghui Tang |
Author | Meng Li |
Author | Miaojun Wang |
Author | Mingming Li |
Author | Ning Tian |
Author | Panpan Huang |
Author | Peng Zhang |
Author | Qiancheng Wang |
Author | Qinyu Chen |
Author | Qiushi Du |
Author | Ruiqi Ge |
Author | Ruisong Zhang |
Author | Ruizhe Pan |
Author | Runji Wang |
Author | R. J. Chen |
Author | R. L. Jin |
Author | Ruyi Chen |
Author | Shanghao Lu |
Author | Shangyan Zhou |
Author | Shanhuang Chen |
Author | Shengfeng Ye |
Author | Shiyu Wang |
Author | Shuiping Yu |
Author | Shunfeng Zhou |
Author | Shuting Pan |
Author | S. S. Li |
Author | Shuang Zhou |
Author | Shaoqing Wu |
Author | Shengfeng Ye |
Author | Tao Yun |
Author | Tian Pei |
Author | Tianyu Sun |
Author | T. Wang |
Author | Wangding Zeng |
Author | Wanjia Zhao |
Author | Wen Liu |
Author | Wenfeng Liang |
Author | Wenjun Gao |
Author | Wenqin Yu |
Author | Wentao Zhang |
Author | W. L. Xiao |
Author | Wei An |
Author | Xiaodong Liu |
Author | Xiaohan Wang |
Author | Xiaokang Chen |
Author | Xiaotao Nie |
Author | Xin Cheng |
Author | Xin Liu |
Author | Xin Xie |
Author | Xingchao Liu |
Author | Xinyu Yang |
Author | Xinyuan Li |
Author | Xuecheng Su |
Author | Xuheng Lin |
Author | X. Q. Li |
Author | Xiangyue Jin |
Author | Xiaojin Shen |
Author | Xiaosha Chen |
Author | Xiaowen Sun |
Author | Xiaoxiang Wang |
Author | Xinnan Song |
Author | Xinyi Zhou |
Author | Xianzu Wang |
Author | Xinxia Shan |
Author | Y. K. Li |
Author | Y. Q. Wang |
Author | Y. X. Wei |
Author | Yang Zhang |
Author | Yanhong Xu |
Author | Yao Li |
Author | Yao Zhao |
Author | Yaofeng Sun |
Author | Yaohui Wang |
Author | Yi Yu |
Author | Yichao Zhang |
Author | Yifan Shi |
Author | Yiliang Xiong |
Author | Ying He |
Author | Yishi Piao |
Author | Yisong Wang |
Author | Yixuan Tan |
Author | Yiyang Ma |
Author | Yiyuan Liu |
Author | Yongqiang Guo |
Author | Yuan Ou |
Author | Yuduan Wang |
Author | Yue Gong |
Author | Yuheng Zou |
Author | Yujia He |
Author | Yunfan Xiong |
Author | Yuxiang Luo |
Author | Yuxiang You |
Author | Yuxuan Liu |
Author | Yuyang Zhou |
Author | Y. X. Zhu |
Author | Yanhong Xu |
Author | Yanping Huang |
Author | Yaohui Li |
Author | Yi Zheng |
Author | Yuchen Zhu |
Author | Yunxian Ma |
Author | Ying Tang |
Author | Yukun Zha |
Author | Yuting Yan |
Author | Z. Z. Ren |
Author | Zehui Ren |
Author | Zhangli Sha |
Author | Zhe Fu |
Author | Zhean Xu |
Author | Zhenda Xie |
Author | Zhengyan Zhang |
Author | Zhewen Hao |
Author | Zhicheng Ma |
Author | Zhigang Yan |
Author | Zhiyu Wu |
Author | Zihui Gu |
Author | Zijia Zhu |
Author | Zijun Liu |
Author | Zilin Li |
Author | Ziwei Xie |
Author | Ziyang Song |
Author | Zizheng Pan |
Author | Zhen Huang |
Author | Zhipeng Xu |
Author | Zhongyu Zhang |
Author | Zhen Zhang |
Abstract | We introduce our first-generation reasoning models, DeepSeek-R1-Zero and DeepSeek-R1. DeepSeek-R1-Zero, a model trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors. However, it encounters challenges such as poor readability, and language mixing. To address these issues and further enhance reasoning performance, we introduce DeepSeek-R1, which incorporates multi-stage training and cold-start data before RL. DeepSeek-R1 achieves performance comparable to OpenAI-o1-1217 on reasoning tasks. To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama. |
Date | 2025-01-22 |
Short Title | DeepSeek-R1 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.12948 |
Accessed | 1/29/2025, 11:33:45 AM |
Extra | arXiv:2501.12948 [cs] |
DOI | 10.48550/arXiv.2501.12948 |
Repository | arXiv |
Archive ID | arXiv:2501.12948 |
Date Added | 1/29/2025, 11:33:45 AM |
Modified | 1/29/2025, 11:33:47 AM |
Item Type | Preprint |
---|---|
Author | Stephen Casper |
Author | Luke Bailey |
Author | Rosco Hunter |
Author | Carson Ezell |
Author | Emma Cabalé |
Author | Michael Gerovitch |
Author | Stewart Slocum |
Author | Kevin Wei |
Author | Nikola Jurkovic |
Author | Ariba Khan |
Author | Phillip J. K. Christoffersen |
Author | A. Pinar Ozisik |
Author | Rakshit Trivedi |
Author | Dylan Hadfield-Menell |
Author | Noam Kolt |
Abstract | Leading AI developers and startups are increasingly deploying agentic AI systems that can plan and execute complex tasks with limited human involvement. However, there is currently no structured framework for documenting the technical components, intended uses, and safety features of agentic systems. To fill this gap, we introduce the AI Agent Index, the first public database to document information about currently deployed agentic AI systems. For each system that meets the criteria for inclusion in the index, we document the system's components (e.g., base model, reasoning implementation, tool use), application domains (e.g., computer use, software engineering), and risk management practices (e.g., evaluation results, guardrails), based on publicly available information and correspondence with developers. We find that while developers generally provide ample information regarding the capabilities and applications of agentic systems, they currently provide limited information regarding safety and risk management practices. The AI Agent Index is available online at https://aiagentindex.mit.edu/ |
Date | 2025-02-03 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.01635 |
Accessed | 2/6/2025, 9:09:03 AM |
Extra | arXiv:2502.01635 [cs] |
DOI | 10.48550/arXiv.2502.01635 |
Repository | arXiv |
Archive ID | arXiv:2502.01635 |
Date Added | 2/6/2025, 9:09:03 AM |
Modified | 2/6/2025, 9:09:06 AM |
Comment: Accompanying website: https://aiagentindex.mit.edu/
Item Type | Preprint |
---|---|
Author | Jannik Brinkmann |
Author | Chris Wendler |
Author | Christian Bartelt |
Author | Aaron Mueller |
Abstract | Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are multiple languages learned and encoded? In this work, we explore the extent to which LLMs share representations of morphosyntactic concepts such as grammatical number, gender, and tense across languages. We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages. We use causal interventions to verify the multilingual nature of these representations; specifically, we show that ablating only multilingual features decreases classifier performance to near-chance across languages. We then use these features to precisely modify model behavior in a machine translation task; this demonstrates both the generality and selectivity of these feature's roles in the network. Our findings suggest that even models trained predominantly on English data can develop robust, cross-lingual abstractions of morphosyntactic concepts. |
Date | 2025-01-10 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.06346 |
Accessed | 1/31/2025, 1:10:14 PM |
Extra | arXiv:2501.06346 [cs] |
DOI | 10.48550/arXiv.2501.06346 |
Repository | arXiv |
Archive ID | arXiv:2501.06346 |
Date Added | 1/31/2025, 1:10:14 PM |
Modified | 1/31/2025, 1:10:14 PM |
Item Type | Preprint |
---|---|
Author | Jan Betley |
Author | Xuchan Bao |
Author | Martín Soto |
Author | Anna Sztyber-Betley |
Author | James Chua |
Author | Owain Evans |
Abstract | We study behavioral self-awareness -- an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, ``The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors -- models do this without any special training or examples. Behavioral self-awareness is relevant for AI safety, as models could use it to proactively disclose problematic behaviors. In particular, we study backdoor policies, where models exhibit unexpected behaviors only under certain trigger conditions. We find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to directly output their trigger by default. Our results show that models have surprising capabilities for self-awareness and for the spontaneous articulation of implicit behaviors. Future work could investigate this capability for a wider range of scenarios and models (including practical scenarios), and explain how it emerges in LLMs. |
Date | 2025-01-19 |
Short Title | Tell me about yourself |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.11120 |
Accessed | 1/29/2025, 12:12:22 PM |
Extra | arXiv:2501.11120 [cs] |
DOI | 10.48550/arXiv.2501.11120 |
Repository | arXiv |
Archive ID | arXiv:2501.11120 |
Date Added | 1/29/2025, 12:12:22 PM |
Modified | 1/29/2025, 12:12:22 PM |
Comment: Submitted to ICLR 2025. 17 pages, 13 figures
Item Type | Journal Article |
---|---|
Author | Y Bengio |
Language | en |
Library Catalog | Zotero |
Date Added | 1/29/2025, 11:19:42 AM |
Modified | 1/29/2025, 11:19:42 AM |
Item Type | Web Page |
---|---|
Author | Dario Amodei |
Abstract | On DeepSeek and Export Controls |
Date | 2025-01-28 |
Language | en |
URL | https://darioamodei.com/on-deepseek-and-export-controls.html |
Accessed | 1/29/2025, 12:22:12 PM |
Date Added | 1/29/2025, 12:22:12 PM |
Modified | 1/29/2025, 12:22:12 PM |