• The Computational Complexity of Circuit Discovery for Inner Interpretability

    Item Type Conference Paper
    Author Federico Adolfi
    Author Martina G. Vilas
    Author Todd Wareham
    Abstract Many proposed applications of neural networks in machine learning, cognitive/brain science, and society hinge on the feasibility of inner interpretability via circuit discovery. This calls for empirical and theoretical explorations of viable algorithmic options. Despite advances in the design and testing of heuristics, there are concerns about their scalability and faithfulness at a time when we lack understanding of the complexity properties of the problems they are deployed to solve. To address this, we study circuit discovery with classical and parameterized computational complexity theory: (1) we describe a conceptual scaffolding to reason about circuit finding queries in terms of affordances for description, explanation, prediction and control; (2) we formalize a comprehensive set of queries for mechanistic explanation, and propose a formal framework for their analysis; (3) we use it to settle the complexity of many query variants and relaxations of practical interest on multi-layer perceptrons. Our findings reveal a challenging complexity landscape. Many queries are intractable, remain fixed-parameter intractable relative to model/circuit features, and inapproximable under additive, multiplicative, and probabilistic approximation schemes. To navigate this landscape, we prove there exist transformations to tackle some of these hard problems with better-understood heuristics, and prove the tractability or fixed-parameter tractability of more modest queries which retain useful affordances. This framework allows us to understand the scope and limits of interpretability queries, explore viable options, and compare their resource demands on existing and future architectures.
    Date 2024/10/04
    Language en
    Library Catalog openreview.net
    URL https://openreview.net/forum?id=QogcGNXJVw
    Accessed 5/11/2025, 7:22:14 PM
    Conference Name The Thirteenth International Conference on Learning Representations
    Date Added 5/11/2025, 7:22:14 PM
    Modified 5/11/2025, 7:22:18 PM

    Attachments

    • Full Text PDF
  • AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Item Type Preprint
    Author Maksym Andriushchenko
    Author Alexandra Souly
    Author Mateusz Dziemian
    Author Derek Duenas
    Author Maxwell Lin
    Author Justin Wang
    Author Dan Hendrycks
    Author Andy Zou
    Author Zico Kolter
    Author Matt Fredrikson
    Author Eric Winsor
    Author Jerome Wynne
    Author Yarin Gal
    Author Xander Davies
    Abstract The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm.
    Date 2025-04-18
    Short Title AgentHarm
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.09024
    Accessed 5/12/2025, 5:44:29 PM
    Extra arXiv:2410.09024 [cs]
    DOI 10.48550/arXiv.2410.09024
    Repository arXiv
    Archive ID arXiv:2410.09024
    Date Added 5/12/2025, 5:44:29 PM
    Modified 5/12/2025, 5:44:29 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: Accepted at ICLR 2025

    Attachments

    • Preprint PDF
    • Snapshot
  • Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

    Item Type Preprint
    Author Maksym Andriushchenko
    Author Francesco Croce
    Author Nicolas Flammarion
    Abstract We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token "Sure"), potentially with multiple restarts. In this way, we achieve 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.
    Date 2025-04-17
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2404.02151
    Accessed 5/12/2025, 5:44:39 PM
    Extra arXiv:2404.02151 [cs]
    DOI 10.48550/arXiv.2404.02151
    Repository arXiv
    Archive ID arXiv:2404.02151
    Date Added 5/12/2025, 5:44:39 PM
    Modified 5/12/2025, 5:44:39 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Computer Science - Cryptography and Security
    • Statistics - Machine Learning

    Notes:

    • Comment: Accepted at ICLR 2025. Updates in the v3: GPT-4o and Claude 3.5 Sonnet results, improved writing. Updates in the v2: more models (Llama3, Phi-3, Nemotron-4-340B), jailbreak artifacts for all attacks are available, evaluation with different judges (Llama-3-70B and Llama Guard 2), more experiments (convergence plots, ablation on the suffix length for random search), examples of jailbroken generation

    Attachments

    • Preprint PDF
    • Snapshot
  • Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

    Item Type Preprint
    Author Jan Betley
    Author Daniel Tan
    Author Niels Warncke
    Author Anna Sztyber-Betley
    Author Xuchan Bao
    Author Martín Soto
    Author Nathan Labenz
    Author Owain Evans
    Abstract We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.
    Date 2025-05-04
    Short Title Emergent Misalignment
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.17424
    Accessed 5/12/2025, 1:07:25 PM
    Extra arXiv:2502.17424 [cs]
    DOI 10.48550/arXiv.2502.17424
    Repository arXiv
    Archive ID arXiv:2502.17424
    Date Added 5/12/2025, 1:07:25 PM
    Modified 5/12/2025, 1:07:27 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Computer Science - Cryptography and Security

    Notes:

    • Comment: 40 pages, 38 figures An earlier revision of this paper was submitted to ICML. Since then, it has been updated to include new results on training dynamics (4.7) and base models (4.8)

    Attachments

    • Preprint PDF
    • Snapshot
  • An alignment safety case sketch based on debate

    Item Type Preprint
    Author Marie Davidsen Buhl
    Author Jacob Pfau
    Author Benjamin Hilton
    Author Geoffrey Irving
    Abstract If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case'' -- an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\&D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the system to be honest. Honesty is maintained throughout deployment via online training. The safety case rests on four key claims: (1) the agent has become good at the debate game, (2) good performance in the debate game implies that the system is mostly honest, (3) the system will not become significantly less honest during deployment, and (4) the deployment context is tolerant of some errors. We identify open research problems that, if solved, could render this a compelling argument that an AI system is safe.
    Date 2025-05-08
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2505.03989
    Accessed 5/12/2025, 12:30:22 PM
    Extra arXiv:2505.03989 [cs]
    DOI 10.48550/arXiv.2505.03989
    Repository arXiv
    Archive ID arXiv:2505.03989
    Date Added 5/12/2025, 12:30:22 PM
    Modified 5/12/2025, 12:30:22 PM

    Tags:

    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • Bare Minimum Mitigations for Autonomous AI Development

    Item Type Preprint
    Author Joshua Clymer
    Author Isabella Duan
    Author Chris Cundy
    Author Yawen Duan
    Author Fynn Heide
    Author Chaochao Lu
    Author Sören Mindermann
    Author Conor McGurk
    Author Xudong Pan
    Author Saad Siddiqui
    Author Jingren Wang
    Author Min Yang
    Author Xianyuan Zhan
    Abstract Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In 2024, international scientists, including Turing Award recipients, warned of risks from autonomous AI research and development (R&D), suggesting a red line such that no AI system should be able to improve itself or other AI systems without explicit human approval and assistance. However, the criteria for meaningful human approval remain unclear, and there is limited analysis on the specific risks of autonomous AI R&D, how they arise, and how to mitigate them. In this brief paper, we outline how these risks may emerge and propose four minimum safeguard recommendations applicable when AI agents significantly automate or accelerate AI development.
    Date 2025-04-23
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.15416
    Accessed 5/12/2025, 1:25:21 PM
    Extra arXiv:2504.15416 [cs]
    DOI 10.48550/arXiv.2504.15416
    Repository arXiv
    Archive ID arXiv:2504.15416
    Date Added 5/12/2025, 1:25:22 PM
    Modified 5/12/2025, 1:25:24 PM

    Tags:

    • Computer Science - Computers and Society

    Notes:

    • Comment: 12 pages, 2 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

    Item Type Preprint
    Author Hongzhe Du
    Author Weikai Li
    Author Min Cai
    Author Karim Saraipour
    Author Zimin Zhang
    Author Himabindu Lakkaraju
    Author Yizhou Sun
    Author Shichang Zhang
    Abstract Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by linear vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training.
    Date 2025-04-03
    Short Title How Post-Training Reshapes LLMs
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.02904
    Accessed 5/11/2025, 6:41:58 PM
    Extra arXiv:2504.02904 [cs]
    DOI 10.48550/arXiv.2504.02904
    Repository arXiv
    Archive ID arXiv:2504.02904
    Date Added 5/11/2025, 6:41:58 PM
    Modified 5/11/2025, 6:41:58 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

    Item Type Preprint
    Author Divyansh Garg
    Author Shaun VanWeelden
    Author Diego Caples
    Author Andis Draguns
    Author Nikil Ravi
    Author Pranav Putta
    Author Naman Garg
    Author Tomas Abraham
    Author Michael Lara
    Author Federico Lopez
    Author James Liu
    Author Atharva Gundawar
    Author Prannay Hebbar
    Author Youngchul Joo
    Author Jindong Gu
    Author Charles London
    Author Christian Schroeder de Witt
    Author Sumeet Motwani
    Abstract We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.
    Date 2025-04-17
    Short Title REAL
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.11543
    Accessed 5/11/2025, 7:06:20 PM
    Extra arXiv:2504.11543 [cs]
    DOI 10.48550/arXiv.2504.11543
    Repository arXiv
    Archive ID arXiv:2504.11543
    Date Added 5/11/2025, 7:06:20 PM
    Modified 5/11/2025, 7:06:20 PM

    Tags:

    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: The websites, framework, and leaderboard are available at https://realevals.xyz and https://github.com/agi-inc/REAL

    Attachments

    • Full Text PDF
    • Snapshot
  • Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark

    Item Type Journal Article
    Author Jasper Götting
    Author Pedro Medeiros
    Author Jon G Sanders
    Author Nathaniel Li
    Author Long Phan
    Author Karam Elabd
    Author Lennart Justen
    Author Dan Hendrycks
    Author Seth Donoughe
    Abstract We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8% accuracy, outperforming 94% of expert virologists even within their sub-areas of specialization. The ability to provide expert-level virology troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.
    Language en
    Library Catalog Zotero
    Date Added 5/11/2025, 6:41:18 PM
    Modified 5/11/2025, 6:41:18 PM

    Attachments

    • PDF
  • Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions

    Item Type Journal Article
    Author Saffron Huang
    Author Esin Durmus
    Author Miles McCain
    Author Kunal Handa
    Author Alex Tamkin
    Author Jerry Hong
    Author Michael Stern
    Author Arushi Somani
    Author Xiuruo Zhang
    Abstract AI assistants can impart value judgments that shape people’s decisions and worldviews, yet little is known empirically about what values these systems rely on in practice. To address this, we develop a bottom-up, privacy-preserving method to extract the values (normative considerations stated or demonstrated in model responses) that Claude 3 and 3.5 models exhibit in hundreds of thousands of real-world interactions. We empirically discover and taxonomize 3,307 AI values and study how they vary by context. We find that Claude expresses many practical and epistemic values, and typically supports prosocial human values while resisting values like “moral nihilism”. While some values appear consistently across contexts (e.g. “transparency”), many are more specialized and context-dependent, reflecting the diversity of human interlocutors and their varied contexts. For example, “harm prevention” emerges when Claude resists users, “historical accuracy” when responding to queries about controversial events, “healthy boundaries” when asked for relationship advice, and “human agency” in technology ethics discussions. By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems.
    Language en
    Library Catalog Zotero
    Date Added 5/11/2025, 6:46:26 PM
    Modified 5/11/2025, 6:46:26 PM

    Attachments

    • PDF
  • Safety Co-Option and Compromised National Security: The Self-Fulfilling Prophecy of Weakened AI Risk Thresholds

    Item Type Preprint
    Author Heidy Khlaaf
    Author Sarah Myers West
    Abstract Risk thresholds provide a measure of the level of risk exposure that a society or individual is willing to withstand, ultimately shaping how we determine the safety of technological systems. Against the backdrop of the Cold War, the first risk analyses, such as those devised for nuclear systems, cemented societally accepted risk thresholds against which safety-critical and defense systems are now evaluated. But today, the appropriate risk tolerances for AI systems have yet to be agreed on by global governing efforts, despite the need for democratic deliberation regarding the acceptable levels of harm to human life. Absent such AI risk thresholds, AI technologists-primarily industry labs, as well as "AI safety" focused organizations-have instead advocated for risk tolerances skewed by a purported AI arms race and speculative "existential" risks, taking over the arbitration of risk determinations with life-or-death consequences, subverting democratic processes. In this paper, we demonstrate how such approaches have allowed AI technologists to engage in "safety revisionism," substituting traditional safety methods and terminology with ill-defined alternatives that vie for the accelerated adoption of military AI uses at the cost of lowered safety and security thresholds. We explore how the current trajectory for AI risk determination and evaluation for foundation model use within national security is poised for a race to the bottom, to the detriment of the US's national security interests. Safety-critical and defense systems must comply with assurance frameworks that are aligned with established risk thresholds, and foundation models are no exception. As such, development of evaluation frameworks for AI-based military systems must preserve the safety and security of US critical and defense infrastructure, and remain in alignment with international humanitarian law.
    Date 2025-04-21
    Short Title Safety Co-Option and Compromised National Security
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.15088
    Accessed 5/11/2025, 6:38:07 PM
    Extra arXiv:2504.15088 [cs]
    DOI 10.48550/arXiv.2504.15088
    Repository arXiv
    Archive ID arXiv:2504.15088
    Date Added 5/11/2025, 6:38:07 PM
    Modified 5/11/2025, 6:38:07 PM

    Tags:

    • Computer Science - Computers and Society

    Attachments

    • Preprint PDF
    • Snapshot
  • Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies

    Item Type Conference Paper
    Author Sunnie S. Y. Kim
    Author Jennifer Wortman Vaughan
    Author Q. Vera Liao
    Author Tania Lombrozo
    Author Olga Russakovsky
    Abstract Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users' reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users' reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs.
    Date 2025-04-26
    Short Title Fostering Appropriate Reliance on Large Language Models
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.08554
    Accessed 5/12/2025, 5:29:28 PM
    Extra arXiv:2502.08554 [cs]
    Pages 1-19
    Proceedings Title Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems
    DOI 10.1145/3706598.3714020
    Date Added 5/12/2025, 5:29:28 PM
    Modified 5/12/2025, 5:29:28 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Human-Computer Interaction

    Notes:

    • Comment: CHI 2025. This version includes the appendix

    Attachments

    • Preprint PDF
    • Snapshot
  • You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and Generalisation

    Item Type Preprint
    Author Simon Pepin Lehalleur
    Author Jesse Hoogland
    Author Matthew Farrugia-Roberts
    Author Susan Wei
    Author Alexander Gietelink Oldenziel
    Author George Wang
    Author Liam Carroll
    Author Daniel Murfet
    Abstract In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.
    Date 2025-02-08
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.05475
    Accessed 5/12/2025, 5:12:33 PM
    Extra arXiv:2502.05475 [cs]
    DOI 10.48550/arXiv.2502.05475
    Repository arXiv
    Archive ID arXiv:2502.05475
    Date Added 5/12/2025, 5:12:33 PM
    Modified 5/12/2025, 5:12:36 PM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

    Item Type Preprint
    Author Simon Lermen
    Author Mateusz Dziemian
    Author Natalia Pérez-Campanero Antolín
    Abstract We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations, successfully fooling oversight models while achieving explanation quality comparable to reference labels. We further find that models can scheme to develop deceptive strategies when they believe the detection of harmful features might lead to negative consequences for themselves. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels. We conclude by proposing mitigation strategies, emphasizing the critical need for robust understanding and defenses against deception.
    Date 2025-04-10
    Short Title Deceptive Automated Interpretability
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.07831
    Accessed 5/11/2025, 6:36:53 PM
    Extra arXiv:2504.07831 [cs] version: 1
    DOI 10.48550/arXiv.2504.07831
    Repository arXiv
    Archive ID arXiv:2504.07831
    Date Added 5/11/2025, 6:36:53 PM
    Modified 5/11/2025, 6:36:57 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • Reasoning Models Can Be Effective Without Thinking

    Item Type Preprint
    Author Wenjie Ma
    Author Jingxuan He
    Author Charlie Snell
    Author Tyler Griggs
    Author Sewon Min
    Author Matei Zaharia
    Abstract Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets--including mathematical problem solving, formal theorem proving, and coding--especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling.
    Date 2025-04-14
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.09858
    Accessed 5/12/2025, 5:26:26 PM
    Extra arXiv:2504.09858 [cs] version: 1
    DOI 10.48550/arXiv.2504.09858
    Repository arXiv
    Archive ID arXiv:2504.09858
    Date Added 5/12/2025, 5:26:26 PM
    Modified 5/12/2025, 5:26:29 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: 33 pages, 7 main figures, 2 tables

    Attachments

    • Preprint PDF
    • Snapshot
  • The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

    Item Type Preprint
    Author Kristina Nikolić
    Author Luze Sun
    Author Jie Zhang
    Author Florian Tramèr
    Abstract Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax
    Date 2025-04-14
    Short Title The Jailbreak Tax
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.10694
    Accessed 5/11/2025, 7:03:15 PM
    Extra arXiv:2504.10694 [cs] version: 1
    DOI 10.48550/arXiv.2504.10694
    Repository arXiv
    Archive ID arXiv:2504.10694
    Date Added 5/11/2025, 7:03:15 PM
    Modified 5/11/2025, 7:03:18 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Computer Science - Cryptography and Security

    Attachments

    • Preprint PDF
    • Snapshot
  • Robustness of large language models in moral judgements

    Item Type Journal Article
    Author Soyoung Oh
    Author Vera Demberg
    Abstract With the advent of large language models (LLMs), there has been a growing interest in analysing the preferences encoded in LLMs in the context of morality. Recent work has tested LLMs on various moral judgement tasks and drawn conclusions regarding the ...
    Date 2025-04
    Language EN
    Loc. in Archive world
    Library Catalog royalsocietypublishing.org
    URL https://royalsocietypublishing.org/doi/10.1098/rsos.241229
    Accessed 5/12/2025, 1:26:26 PM
    Rights © 2025 The Author(s).
    Extra Publisher: The Royal Society
    Publication Royal Society Open Science
    DOI 10.1098/rsos.241229
    Date Added 5/12/2025, 1:26:26 PM
    Modified 5/12/2025, 1:26:26 PM

    Attachments

    • PDF
    • Snapshot
  • Evaluating Frontier Models for Stealth and Situational Awareness

    Item Type Preprint
    Author Mary Phuong
    Author Roland S. Zimmermann
    Author Ziyue Wang
    Author David Lindner
    Author Victoria Krakovna
    Author Sarah Cogan
    Author Allan Dafoe
    Author Lewis Ho
    Author Rohin Shah
    Abstract Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.
    Date 2025-05-06
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2505.01420
    Accessed 5/12/2025, 1:32:34 PM
    Extra arXiv:2505.01420 [cs]
    DOI 10.48550/arXiv.2505.01420
    Repository arXiv
    Archive ID arXiv:2505.01420
    Date Added 5/12/2025, 1:32:34 PM
    Modified 5/12/2025, 1:32:38 PM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Trends in AI Supercomputers

    Item Type Preprint
    Author Konstantin F. Pilz
    Author James Sanders
    Author Robi Rahman
    Author Lennart Heim
    Abstract Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in performance, power needs, hardware cost, ownership, and global distribution. We find that the computational performance of AI supercomputers has doubled every nine months, while hardware acquisition cost and power needs both doubled every year. The leading system in March 2025, xAI's Colossus, used 200,000 AI chips, had a hardware cost of \$7B, and required 300 MW of power, as much as 250,000 households. As AI supercomputers evolved from tools for science to industrial machines, companies rapidly expanded their share of total AI supercomputer performance, while the share of governments and academia diminished. Globally, the United States accounts for about 75% of total performance in our dataset, with China in second place at 15%. If the observed trends continue, the leading AI supercomputer in 2030 will achieve $2\times10^{22}$ 16-bit FLOP/s, use two million AI chips, have a hardware cost of \$200 billion, and require 9 GW of power. Our analysis provides visibility into the AI supercomputer landscape, allowing policymakers to assess key AI trends like resource needs, ownership, and national competitiveness.
    Date 2025-04-23
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.16026
    Accessed 5/12/2025, 5:53:39 PM
    Extra arXiv:2504.16026 [cs]
    DOI 10.48550/arXiv.2504.16026
    Repository arXiv
    Archive ID arXiv:2504.16026
    Date Added 5/12/2025, 5:53:39 PM
    Modified 5/12/2025, 5:53:42 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society

    Attachments

    • Preprint PDF
    • Snapshot
  • Safety Alignment Should Be Made More Than Just a Few Tokens Deep

    Item Type Preprint
    Author Xiangyu Qi
    Author Ashwinee Panda
    Author Kaifeng Lyu
    Author Xiao Ma
    Author Subhrajit Roy
    Author Ahmad Beirami
    Author Prateek Mittal
    Author Peter Henderson
    Abstract The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and provide evidence that current aligned LLMs are subject to this issue. We also show how these findings help explain multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. Importantly, we discuss how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. For instance, we show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits. Finally, we design a regularized finetuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.
    Date 2024-06-10
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2406.05946
    Accessed 5/12/2025, 5:43:58 PM
    Extra arXiv:2406.05946 [cs]
    DOI 10.48550/arXiv.2406.05946
    Repository arXiv
    Archive ID arXiv:2406.05946
    Date Added 5/12/2025, 5:43:58 PM
    Modified 5/12/2025, 5:43:58 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security

    Attachments

    • Preprint PDF
    • Snapshot
  • Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

    Item Type Preprint
    Author Mark Russinovich
    Author Ahmed Salem
    Author Ronen Eldan
    Abstract Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models.
    Date 2025-02-26
    Short Title Great, Now Write an Article About That
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2404.01833
    Accessed 5/12/2025, 12:48:39 PM
    Extra arXiv:2404.01833 [cs]
    DOI 10.48550/arXiv.2404.01833
    Repository arXiv
    Archive ID arXiv:2404.01833
    Date Added 5/12/2025, 12:48:39 PM
    Modified 5/12/2025, 12:48:42 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security

    Notes:

    • Comment: Accepted at USENIX Security 2025

    Attachments

    • Full Text PDF
    • Snapshot
  • LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

    Item Type Preprint
    Author Thomas Schmied
    Author Jörg Bornschein
    Author Jordi Grau-Moya
    Author Markus Wulfmeier
    Author Razvan Pascanu
    Abstract The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as $\epsilon$-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
    Date 2025-04-22
    Short Title LLMs are Greedy Agents
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.16078
    Accessed 5/12/2025, 5:49:16 PM
    Extra arXiv:2504.16078 [cs]
    DOI 10.48550/arXiv.2504.16078
    Repository arXiv
    Archive ID arXiv:2504.16078
    Date Added 5/12/2025, 5:49:16 PM
    Modified 5/12/2025, 5:49:16 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • The Leaderboard Illusion

    Item Type Preprint
    Author Shivalika Singh
    Author Yiyang Nan
    Author Alex Wang
    Author Daniel D'Souza
    Author Sayash Kapoor
    Author Ahmet Üstün
    Author Sanmi Koyejo
    Author Yuntian Deng
    Author Shayne Longpre
    Author Noah Smith
    Author Beyza Ermis
    Author Marzieh Fadaee
    Author Sara Hooker
    Abstract Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field
    Date 2025-04-29
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.20879
    Accessed 5/12/2025, 11:45:01 AM
    Extra arXiv:2504.20879 [cs]
    DOI 10.48550/arXiv.2504.20879
    Repository arXiv
    Archive ID arXiv:2504.20879
    Date Added 5/12/2025, 11:45:01 AM
    Modified 5/12/2025, 11:45:04 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Statistics - Methodology

    Notes:

    • Comment: 68 pages, 18 figures, 9 tables

    Attachments

    • Preprint PDF
    • Snapshot
  • 7+ tractable directions in AI control

    Item Type Journal Article
    Author Julian Stastny
    Author ryan_greenblatt
    Abstract In this post, we list 7 of our favorite easy-to-start directions in AI control. (Really, projects that are at least adjacent to AI control; We includ…
    Date 2025-04-28
    Language en
    Library Catalog www.lesswrong.com
    URL https://www.lesswrong.com/posts/wwshEdNhwwT4r9RQN/7-tractable-directions-in-ai-control
    Accessed 5/12/2025, 5:13:33 PM
    Date Added 5/12/2025, 5:13:33 PM
    Modified 5/12/2025, 5:13:33 PM

    Attachments

    • Snapshot
  • Audit Cards: Contextualizing AI Evaluations

    Item Type Preprint
    Author Leon Staufer
    Author Mick Yang
    Author Anka Reuel
    Author Stephen Casper
    Abstract AI governance frameworks increasingly rely on audits, yet the results of their underlying evaluations require interpretation and context to be meaningfully informative. Even technically rigorous evaluations can offer little useful insight if reported selectively or obscurely. Current literature focuses primarily on technical best practices, but evaluations are an inherently sociotechnical process, and there is little guidance on reporting procedures and context. Through literature review, stakeholder interviews, and analysis of governance frameworks, we propose "audit cards" to make this context explicit. We identify six key types of contextual features to report and justify in audit cards: auditor identity, evaluation scope, methodology, resource access, process integrity, and review mechanisms. Through analysis of existing evaluation reports, we find significant variation in reporting practices, with most reports omitting crucial contextual information such as auditors' backgrounds, conflicts of interest, and the level and type of access to models. We also find that most existing regulations and frameworks lack guidance on rigorous reporting. In response to these shortcomings, we argue that audit cards can provide a structured format for reporting key claims alongside their justifications, enhancing transparency, facilitating proper interpretation, and establishing trust in reporting.
    Date 2025-04-18
    Short Title Audit Cards
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.13839
    Accessed 5/11/2025, 6:45:44 PM
    Extra arXiv:2504.13839 [cs]
    DOI 10.48550/arXiv.2504.13839
    Repository arXiv
    Archive ID arXiv:2504.13839
    Date Added 5/11/2025, 6:45:44 PM
    Modified 5/11/2025, 6:45:47 PM

    Tags:

    • Computer Science - Computers and Society

    Attachments

    • Preprint PDF
    • Snapshot
  • Interesting That No One Thinks AlphaFold Is Conscious: Anil Seth

    Item Type Web Page
    Author Officechai Team
    Abstract There is plenty of speculation on whether current AI systems are conscious, or how they could be made conscious, but humans seem to be displaying a blind spot in choosing which AI systems they feel
    Date 2025-04-17T23:40:14+05:30
    Language en-US
    Short Title Interesting That No One Thinks AlphaFold Is Conscious
    URL https://officechai.com/ai/interesting-that-no-one-thinks-alphafold-is-conscious-anil-seth/
    Accessed 5/11/2025, 7:05:35 PM
    Website Title OfficeChai
    Date Added 5/11/2025, 7:05:35 PM
    Modified 5/11/2025, 7:05:35 PM

    Attachments

    • Snapshot
  • Investigating task-specific prompts and sparse autoencoders for activation monitoring

    Item Type Preprint
    Author Henk Tillman
    Author Dan Mossing
    Abstract Language models can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of language models encode additional information that could be useful for this. The baseline approach for activation monitoring is some variation of linear probing on a particular layer: starting from a labeled dataset, train a logistic regression classifier on that layer's activations. Recent work has proposed several approaches which may improve on naive linear probing, by leveraging additional computation. One class of techniques, which we call "prompted probing," leverages test time computation to improve monitoring by (1) prompting the model with a description of the monitoring task, and (2) applying a learned linear probe to resulting activations. Another class of techniques uses computation at train time: training sparse autoencoders offline to identify an interpretable basis for the activations, and e.g. max-pooling activations across tokens using that basis before applying a linear probe. However, one can also prompt the model with a description of the monitoring task and use its output directly. We develop and test novel refinements of these methods and compare them against each other. We find asking the model zero-shot is a reasonable baseline when inference-time compute is not limited; however, activation probing methods can substantially outperform this baseline given sufficient training data. Specifically, we recommend prompted probing when inference-time compute is available, due to its superior data efficiency and good generalization performance. Alternatively, if inference-time compute is limited, we find SAE-based probing methods outperform raw activation probing.
    Date 2025-04-28
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.20271
    Accessed 5/12/2025, 1:53:20 PM
    Extra arXiv:2504.20271 [cs]
    DOI 10.48550/arXiv.2504.20271
    Repository arXiv
    Archive ID arXiv:2504.20271
    Date Added 5/12/2025, 1:53:20 PM
    Modified 5/12/2025, 1:53:23 PM

    Tags:

    • Computer Science - Machine Learning

    Notes:

    • Comment: 18 pages, 13 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Understanding Chain-of-Thought in LLMs through Information Theory

    Item Type Preprint
    Author Jean-Francois Ton
    Author Muhammad Faaiz Taufiq
    Author Yang Liu
    Abstract Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.
    Date 2024-11-18
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.11984
    Accessed 5/13/2025, 11:58:53 AM
    Extra arXiv:2411.11984 [cs]
    DOI 10.48550/arXiv.2411.11984
    Repository arXiv
    Archive ID arXiv:2411.11984
    Date Added 5/13/2025, 11:58:53 AM
    Modified 5/13/2025, 11:58:54 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Towards Internet-Scale Training For Agents

    Item Type Preprint
    Author Brandon Trabucco
    Author Gunnar Sigurdsson
    Author Robinson Piramuthu
    Author Ruslan Salakhutdinov
    Abstract The predominant approach for training web navigation agents gathers human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data are an inefficient resource. We develop a pipeline to facilitate Internet-scale training for agents without laborious human annotations. In the first stage, an LLM generates tasks for 150k diverse websites. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM reviews the trajectories and judges their success. Language models are competitive with human annotators, detecting and filtering out harmful content with an accuracy of 97%, generating feasible tasks with an 89% rate, and judging successful trajectories with an 82.6% accuracy. Scaling the pipeline, agents based on Llama 3.1 70B solve 16.7% of tasks for 150k sites. Training on the data generated by our pipeline is competitive with training on human demonstrations. In data-limited settings derived from Mind2Web and WebLINX, we improve Step Accuracy by up to +89.5% and +122.1% respectively for agents trained on mixtures of data from our pipeline, and human data. When training agents with all available human data from these benchmarks, agents fail to generalize to diverse real sites, and adding our data improves their generalization by +149.0% for WebLINX and +156.3% for Mind2Web. Code will be available at: data-for-agents.github.io.
    Date 2025-02-10
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.06776
    Accessed 5/11/2025, 7:23:53 PM
    Extra arXiv:2502.06776 [cs]
    DOI 10.48550/arXiv.2502.06776
    Repository arXiv
    Archive ID arXiv:2502.06776
    Date Added 5/11/2025, 7:23:53 PM
    Modified 5/11/2025, 7:23:53 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning

    Item Type Preprint
    Author Xinyi Wang
    Author Shawn Tan
    Author Mingyu Jin
    Author William Yang Wang
    Author Rameswar Panda
    Author Yikang Shen
    Abstract Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.
    Date 2025-04-04
    Short Title Do Larger Language Models Imply Better Reasoning?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.03635
    Accessed 5/11/2025, 7:10:21 PM
    Extra arXiv:2504.03635 [cs]
    DOI 10.48550/arXiv.2504.03635
    Repository arXiv
    Archive ID arXiv:2504.03635
    Date Added 5/11/2025, 7:10:21 PM
    Modified 5/11/2025, 7:10:21 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • Base Models Beat Aligned Models at Randomness and Creativity

    Item Type Preprint
    Author Peter West
    Author Christopher Potts
    Abstract Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate "7" over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.
    Date 2025-04-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2505.00047
    Accessed 5/12/2025, 12:49:20 PM
    Extra arXiv:2505.00047 [cs]
    DOI 10.48550/arXiv.2505.00047
    Repository arXiv
    Archive ID arXiv:2505.00047
    Date Added 5/12/2025, 12:49:20 PM
    Modified 5/12/2025, 12:49:22 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • Base Models Beat Aligned Models at Randomness and Creativity

    Item Type Preprint
    Author Peter West
    Author Christopher Potts
    Abstract Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate "7" over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.
    Date 2025-04-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2505.00047
    Accessed 5/12/2025, 12:52:22 PM
    Extra arXiv:2505.00047 [cs]
    DOI 10.48550/arXiv.2505.00047
    Repository arXiv
    Archive ID arXiv:2505.00047
    Date Added 5/12/2025, 12:52:22 PM
    Modified 5/12/2025, 12:52:22 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models

    Item Type Preprint
    Author Chejian Xu
    Author Jiawei Zhang
    Author Zhaorun Chen
    Author Chulin Xie
    Author Mintong Kang
    Author Yujin Potter
    Author Zhun Wang
    Author Zhuowen Yuan
    Author Alexander Xiong
    Author Zidi Xiong
    Author Chenhui Zhang
    Author Lingzhi Yuan
    Author Yi Zeng
    Author Peiyang Xu
    Author Chengquan Guo
    Author Andy Zhou
    Author Jeffrey Ziwei Tan
    Author Xuandong Zhao
    Author Francesco Pinto
    Author Zhen Xiang
    Author Yu Gai
    Author Zinan Lin
    Author Dan Hendrycks
    Author Bo Li
    Author Dawn Song
    Abstract Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.
    Date 2025-03-19
    Short Title MMDT
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.14827
    Accessed 5/12/2025, 5:44:53 PM
    Extra arXiv:2503.14827 [cs]
    DOI 10.48550/arXiv.2503.14827
    Repository arXiv
    Archive ID arXiv:2503.14827
    Date Added 5/12/2025, 5:44:53 PM
    Modified 5/12/2025, 5:44:53 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security

    Notes:

    • Comment: ICLR 2025

    Attachments

    • Preprint PDF
    • Snapshot
  • A Survey of AI Agent Protocols

    Item Type Preprint
    Author Yingxuan Yang
    Author Huacan Chai
    Author Yuanyi Song
    Author Siyuan Qi
    Author Muning Wen
    Author Ning Li
    Author Junwei Liao
    Author Haoyi Hu
    Author Jianghao Lin
    Author Gaowei Chang
    Author Weiwen Liu
    Author Ying Wen
    Author Yong Yu
    Author Weinan Zhang
    Abstract The rapid development of large language models (LLMs) has led to the widespread deployment of LLM agents across diverse industries, including customer service, content generation, data analysis, and even healthcare. However, as more LLM agents are deployed, a major issue has emerged: there is no standard way for these agents to communicate with external tools or data sources. This lack of standardized protocols makes it difficult for agents to work together or scale effectively, and it limits their ability to tackle complex, real-world tasks. A unified communication protocol for LLM agents could change this. It would allow agents and tools to interact more smoothly, encourage collaboration, and triggering the formation of collective intelligence. In this paper, we provide the first comprehensive analysis of existing agent protocols, proposing a systematic two-dimensional classification that differentiates context-oriented versus inter-agent protocols and general-purpose versus domain-specific protocols. Additionally, we conduct a comparative performance analysis of these protocols across key dimensions such as security, scalability, and latency. Finally, we explore the future landscape of agent protocols by identifying critical research directions and characteristics necessary for next-generation protocols. These characteristics include adaptability, privacy preservation, and group-based interaction, as well as trends toward layered architectures and collective intelligence infrastructures. We expect this work to serve as a practical reference for both researchers and engineers seeking to design, evaluate, or integrate robust communication infrastructures for intelligent agents.
    Date 2025-04-26
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.16736
    Accessed 5/12/2025, 1:45:09 PM
    Extra arXiv:2504.16736 [cs]
    DOI 10.48550/arXiv.2504.16736
    Repository arXiv
    Archive ID arXiv:2504.16736
    Date Added 5/12/2025, 1:45:09 PM
    Modified 5/12/2025, 1:45:11 PM

    Tags:

    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • Robust LLM safeguarding via refusal feature adversarial training

    Item Type Preprint
    Author Lei Yu
    Author Virginie Do
    Author Karen Hambardzumyan
    Author Nicola Cancedda
    Abstract Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.
    Date 2025-03-20
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2409.20089
    Accessed 5/12/2025, 5:45:04 PM
    Extra arXiv:2409.20089 [cs]
    DOI 10.48550/arXiv.2409.20089
    Repository arXiv
    Archive ID arXiv:2409.20089
    Date Added 5/12/2025, 5:45:04 PM
    Modified 5/12/2025, 5:45:04 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Machine Learning
    • Computer Science - Cryptography and Security

    Attachments

    • Preprint PDF
    • Snapshot
  • RepliBench: measuring autonomous replication capabilities in AI systems | AISI Work

    Item Type Web Page
    Abstract A comprehensive benchmark to detect emerging replication abilities in AI systems and provide a quantifiable understanding of potential risks
    Language en
    Short Title RepliBench
    URL https://www.aisi.gov.uk/work/replibench-measuring-autonomous-replication-capabilities-in-ai-systems
    Accessed 5/11/2025, 6:40:49 PM
    Website Title AI Security Institute
    Date Added 5/11/2025, 6:40:49 PM
    Modified 5/11/2025, 6:40:53 PM

    Attachments

    • Snapshot
  • Unlocking New Jailbreaks with AI Explainability

    Item Type Web Page
    Abstract TL;DR In this post, we introduce our “Adversarial AI Explainability” research, a term we use to describe the intersection of AI explainability and adversarial attacks on Large Language Models...
    Language en
    URL https://www.cyberark.com/resources/threat-research-blog/unlocking-new-jailbreaks-with-ai-explainability
    Accessed 5/12/2025, 12:10:49 PM
    Date Added 5/12/2025, 12:10:49 PM
    Modified 5/12/2025, 12:10:53 PM

    Attachments

    • Snapshot
  • Singapore_Consensus_2025.pdf

    Item Type Attachment
    URL https://aisafetypriorities.org/files/Singapore_Consensus_2025.pdf
    Accessed 5/12/2025, 12:33:31 PM
    Date Added 5/12/2025, 12:33:31 PM
    Modified 5/12/2025, 12:33:31 PM
  • Gemini 2.5 Pro Preview Model Card

    Item Type Journal Article
    Language en
    Library Catalog Zotero
    Date Added 5/12/2025, 4:35:06 PM
    Modified 5/12/2025, 4:35:06 PM

    Attachments

    • PDF
  • Modifying LLM Beliefs with Synthetic Document Finetuning

    Item Type Web Page
    URL https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/
    Accessed 5/12/2025, 5:45:43 PM
    Date Added 5/12/2025, 5:45:43 PM
    Modified 5/12/2025, 5:45:43 PM

    Attachments

    • Modifying LLM Beliefs with Synthetic Document Finetuning
  • o3-and-o4-mini-system-card.pdf

    Item Type Attachment
    URL https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
    Accessed 5/12/2025, 5:51:18 PM
    Date Added 5/12/2025, 5:51:18 PM
    Modified 5/12/2025, 5:51:18 PM