Zotero Report

The Computational Complexity of Circuit Discovery for Inner Interpretability

Item Type	Conference Paper
Author	Federico Adolfi
Author	Martina G. Vilas
Author	Todd Wareham
Abstract	Many proposed applications of neural networks in machine learning, cognitive/brain science, and society hinge on the feasibility of inner interpretability via circuit discovery. This calls for empirical and theoretical explorations of viable algorithmic options. Despite advances in the design and testing of heuristics, there are concerns about their scalability and faithfulness at a time when we lack understanding of the complexity properties of the problems they are deployed to solve. To address this, we study circuit discovery with classical and parameterized computational complexity theory: (1) we describe a conceptual scaffolding to reason about circuit finding queries in terms of affordances for description, explanation, prediction and control; (2) we formalize a comprehensive set of queries for mechanistic explanation, and propose a formal framework for their analysis; (3) we use it to settle the complexity of many query variants and relaxations of practical interest on multi-layer perceptrons. Our findings reveal a challenging complexity landscape. Many queries are intractable, remain fixed-parameter intractable relative to model/circuit features, and inapproximable under additive, multiplicative, and probabilistic approximation schemes. To navigate this landscape, we prove there exist transformations to tackle some of these hard problems with better-understood heuristics, and prove the tractability or fixed-parameter tractability of more modest queries which retain useful affordances. This framework allows us to understand the scope and limits of interpretability queries, explore viable options, and compare their resource demands on existing and future architectures.
Date	2024/10/04
Language	en
Library Catalog	openreview.net
URL	https://openreview.net/forum?id=QogcGNXJVw
Accessed	5/11/2025, 7:22:14 PM
Conference Name	The Thirteenth International Conference on Learning Representations
Date Added	5/11/2025, 7:22:14 PM
Modified	5/11/2025, 7:22:18 PM

Attachments

Full Text PDF

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Item Type	Preprint
Author	Maksym Andriushchenko
Author	Alexandra Souly
Author	Mateusz Dziemian
Author	Derek Duenas
Author	Maxwell Lin
Author	Justin Wang
Author	Dan Hendrycks
Author	Andy Zou
Author	Zico Kolter
Author	Matt Fredrikson
Author	Eric Winsor
Author	Jerome Wynne
Author	Yarin Gal
Author	Xander Davies
Abstract	The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm.
Date	2025-04-18
Short Title	AgentHarm
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.09024
Accessed	5/12/2025, 5:44:29 PM
Extra	arXiv:2410.09024 [cs]
DOI	10.48550/arXiv.2410.09024
Repository	arXiv
Archive ID	arXiv:2410.09024
Date Added	5/12/2025, 5:44:29 PM
Modified	5/12/2025, 5:44:29 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: Accepted at ICLR 2025

Attachments

Preprint PDF
Snapshot

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Item Type	Preprint
Author	Maksym Andriushchenko
Author	Francesco Croce
Author	Nicolas Flammarion
Abstract	We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token "Sure"), potentially with multiple restarts. In this way, we achieve 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.
Date	2025-04-17
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2404.02151
Accessed	5/12/2025, 5:44:39 PM
Extra	arXiv:2404.02151 [cs]
DOI	10.48550/arXiv.2404.02151
Repository	arXiv
Archive ID	arXiv:2404.02151
Date Added	5/12/2025, 5:44:39 PM
Modified	5/12/2025, 5:44:39 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Computer Science - Cryptography and Security
Statistics - Machine Learning

Notes:

Comment: Accepted at ICLR 2025. Updates in the v3: GPT-4o and Claude 3.5 Sonnet results, improved writing. Updates in the v2: more models (Llama3, Phi-3, Nemotron-4-340B), jailbreak artifacts for all attacks are available, evaluation with different judges (Llama-3-70B and Llama Guard 2), more experiments (convergence plots, ablation on the suffix length for random search), examples of jailbroken generation

Attachments

Preprint PDF
Snapshot

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Item Type	Preprint
Author	Jan Betley
Author	Daniel Tan
Author	Niels Warncke
Author	Anna Sztyber-Betley
Author	Xuchan Bao
Author	Martín Soto
Author	Nathan Labenz
Author	Owain Evans
Abstract	We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.
Date	2025-05-04
Short Title	Emergent Misalignment
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.17424
Accessed	5/12/2025, 1:07:25 PM
Extra	arXiv:2502.17424 [cs]
DOI	10.48550/arXiv.2502.17424
Repository	arXiv
Archive ID	arXiv:2502.17424
Date Added	5/12/2025, 1:07:25 PM
Modified	5/12/2025, 1:07:27 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Computer Science - Cryptography and Security

Notes:

Comment: 40 pages, 38 figures An earlier revision of this paper was submitted to ICML. Since then, it has been updated to include new results on training dynamics (4.7) and base models (4.8)

Attachments

Preprint PDF
Snapshot

An alignment safety case sketch based on debate

Item Type	Preprint
Author	Marie Davidsen Buhl
Author	Jacob Pfau
Author	Benjamin Hilton
Author	Geoffrey Irving
Abstract	If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case'' -- an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\&D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the system to be honest. Honesty is maintained throughout deployment via online training. The safety case rests on four key claims: (1) the agent has become good at the debate game, (2) good performance in the debate game implies that the system is mostly honest, (3) the system will not become significantly less honest during deployment, and (4) the deployment context is tolerant of some errors. We identify open research problems that, if solved, could render this a compelling argument that an AI system is safe.
Date	2025-05-08
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2505.03989
Accessed	5/12/2025, 12:30:22 PM
Extra	arXiv:2505.03989 [cs]
DOI	10.48550/arXiv.2505.03989
Repository	arXiv
Archive ID	arXiv:2505.03989
Date Added	5/12/2025, 12:30:22 PM
Modified	5/12/2025, 12:30:22 PM

Tags:

Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

Bare Minimum Mitigations for Autonomous AI Development

Item Type	Preprint
Author	Joshua Clymer
Author	Isabella Duan
Author	Chris Cundy
Author	Yawen Duan
Author	Fynn Heide
Author	Chaochao Lu
Author	Sören Mindermann
Author	Conor McGurk
Author	Xudong Pan
Author	Saad Siddiqui
Author	Jingren Wang
Author	Min Yang
Author	Xianyuan Zhan
Abstract	Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In 2024, international scientists, including Turing Award recipients, warned of risks from autonomous AI research and development (R&D), suggesting a red line such that no AI system should be able to improve itself or other AI systems without explicit human approval and assistance. However, the criteria for meaningful human approval remain unclear, and there is limited analysis on the specific risks of autonomous AI R&D, how they arise, and how to mitigate them. In this brief paper, we outline how these risks may emerge and propose four minimum safeguard recommendations applicable when AI agents significantly automate or accelerate AI development.
Date	2025-04-23
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.15416
Accessed	5/12/2025, 1:25:21 PM
Extra	arXiv:2504.15416 [cs]
DOI	10.48550/arXiv.2504.15416
Repository	arXiv
Archive ID	arXiv:2504.15416
Date Added	5/12/2025, 1:25:22 PM
Modified	5/12/2025, 1:25:24 PM

Tags:

Computer Science - Computers and Society

Notes:

Comment: 12 pages, 2 figures

Attachments

Preprint PDF
Snapshot

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Item Type	Preprint
Author	Hongzhe Du
Author	Weikai Li
Author	Min Cai
Author	Karim Saraipour
Author	Zimin Zhang
Author	Himabindu Lakkaraju
Author	Yizhou Sun
Author	Shichang Zhang
Abstract	Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by linear vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training.
Date	2025-04-03
Short Title	How Post-Training Reshapes LLMs
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.02904
Accessed	5/11/2025, 6:41:58 PM
Extra	arXiv:2504.02904 [cs]
DOI	10.48550/arXiv.2504.02904
Repository	arXiv
Archive ID	arXiv:2504.02904
Date Added	5/11/2025, 6:41:58 PM
Modified	5/11/2025, 6:41:58 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Item Type	Preprint
Author	Divyansh Garg
Author	Shaun VanWeelden
Author	Diego Caples
Author	Andis Draguns
Author	Nikil Ravi
Author	Pranav Putta
Author	Naman Garg
Author	Tomas Abraham
Author	Michael Lara
Author	Federico Lopez
Author	James Liu
Author	Atharva Gundawar
Author	Prannay Hebbar
Author	Youngchul Joo
Author	Jindong Gu
Author	Charles London
Author	Christian Schroeder de Witt
Author	Sumeet Motwani
Abstract	We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.
Date	2025-04-17
Short Title	REAL
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.11543
Accessed	5/11/2025, 7:06:20 PM
Extra	arXiv:2504.11543 [cs]
DOI	10.48550/arXiv.2504.11543
Repository	arXiv
Archive ID	arXiv:2504.11543
Date Added	5/11/2025, 7:06:20 PM
Modified	5/11/2025, 7:06:20 PM

Tags:

Computer Science - Artificial Intelligence

Notes:

Comment: The websites, framework, and leaderboard are available at https://realevals.xyz and https://github.com/agi-inc/REAL

Attachments

Full Text PDF
Snapshot

Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark

Item Type	Journal Article
Author	Jasper Götting
Author	Pedro Medeiros
Author	Jon G Sanders
Author	Nathaniel Li
Author	Long Phan
Author	Karam Elabd
Author	Lennart Justen
Author	Dan Hendrycks
Author	Seth Donoughe
Abstract	We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8% accuracy, outperforming 94% of expert virologists even within their sub-areas of specialization. The ability to provide expert-level virology troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences.
Language	en
Library Catalog	Zotero
Date Added	5/11/2025, 6:41:18 PM
Modified	5/11/2025, 6:41:18 PM

Attachments

PDF

Values in the Wild: Discovering and Analyzing Values in Real-World Language Model Interactions

Item Type	Journal Article
Author	Saffron Huang
Author	Esin Durmus
Author	Miles McCain
Author	Kunal Handa
Author	Alex Tamkin
Author	Jerry Hong
Author	Michael Stern
Author	Arushi Somani
Author	Xiuruo Zhang
Abstract	AI assistants can impart value judgments that shape people’s decisions and worldviews, yet little is known empirically about what values these systems rely on in practice. To address this, we develop a bottom-up, privacy-preserving method to extract the values (normative considerations stated or demonstrated in model responses) that Claude 3 and 3.5 models exhibit in hundreds of thousands of real-world interactions. We empirically discover and taxonomize 3,307 AI values and study how they vary by context. We find that Claude expresses many practical and epistemic values, and typically supports prosocial human values while resisting values like “moral nihilism”. While some values appear consistently across contexts (e.g. “transparency”), many are more specialized and context-dependent, reflecting the diversity of human interlocutors and their varied contexts. For example, “harm prevention” emerges when Claude resists users, “historical accuracy” when responding to queries about controversial events, “healthy boundaries” when asked for relationship advice, and “human agency” in technology ethics discussions. By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems.
Language	en
Library Catalog	Zotero
Date Added	5/11/2025, 6:46:26 PM
Modified	5/11/2025, 6:46:26 PM

Attachments

PDF

Safety Co-Option and Compromised National Security: The Self-Fulfilling Prophecy of Weakened AI Risk Thresholds

Item Type	Preprint
Author	Heidy Khlaaf
Author	Sarah Myers West
Abstract	Risk thresholds provide a measure of the level of risk exposure that a society or individual is willing to withstand, ultimately shaping how we determine the safety of technological systems. Against the backdrop of the Cold War, the first risk analyses, such as those devised for nuclear systems, cemented societally accepted risk thresholds against which safety-critical and defense systems are now evaluated. But today, the appropriate risk tolerances for AI systems have yet to be agreed on by global governing efforts, despite the need for democratic deliberation regarding the acceptable levels of harm to human life. Absent such AI risk thresholds, AI technologists-primarily industry labs, as well as "AI safety" focused organizations-have instead advocated for risk tolerances skewed by a purported AI arms race and speculative "existential" risks, taking over the arbitration of risk determinations with life-or-death consequences, subverting democratic processes. In this paper, we demonstrate how such approaches have allowed AI technologists to engage in "safety revisionism," substituting traditional safety methods and terminology with ill-defined alternatives that vie for the accelerated adoption of military AI uses at the cost of lowered safety and security thresholds. We explore how the current trajectory for AI risk determination and evaluation for foundation model use within national security is poised for a race to the bottom, to the detriment of the US's national security interests. Safety-critical and defense systems must comply with assurance frameworks that are aligned with established risk thresholds, and foundation models are no exception. As such, development of evaluation frameworks for AI-based military systems must preserve the safety and security of US critical and defense infrastructure, and remain in alignment with international humanitarian law.
Date	2025-04-21
Short Title	Safety Co-Option and Compromised National Security
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.15088
Accessed	5/11/2025, 6:38:07 PM
Extra	arXiv:2504.15088 [cs]
DOI	10.48550/arXiv.2504.15088
Repository	arXiv
Archive ID	arXiv:2504.15088
Date Added	5/11/2025, 6:38:07 PM
Modified	5/11/2025, 6:38:07 PM

Tags:

Computer Science - Computers and Society

Attachments

Preprint PDF
Snapshot

Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies

Item Type	Conference Paper
Author	Sunnie S. Y. Kim
Author	Jennifer Wortman Vaughan
Author	Q. Vera Liao
Author	Tania Lombrozo
Author	Olga Russakovsky
Abstract	Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users' reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users' reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs.
Date	2025-04-26
Short Title	Fostering Appropriate Reliance on Large Language Models
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.08554
Accessed	5/12/2025, 5:29:28 PM
Extra	arXiv:2502.08554 [cs]
Pages	1-19
Proceedings Title	Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems
DOI	10.1145/3706598.3714020
Date Added	5/12/2025, 5:29:28 PM
Modified	5/12/2025, 5:29:28 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Human-Computer Interaction

Notes:

Comment: CHI 2025. This version includes the appendix

Attachments

Preprint PDF
Snapshot

You Are What You Eat -- AI Alignment Requires Understanding How Data Shapes Structure and Generalisation

Item Type	Preprint
Author	Simon Pepin Lehalleur
Author	Jesse Hoogland
Author	Matthew Farrugia-Roberts
Author	Susan Wei
Author	Alexander Gietelink Oldenziel
Author	George Wang
Author	Liam Carroll
Author	Daniel Murfet
Abstract	In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation.
Date	2025-02-08
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.05475
Accessed	5/12/2025, 5:12:33 PM
Extra	arXiv:2502.05475 [cs]
DOI	10.48550/arXiv.2502.05475
Repository	arXiv
Archive ID	arXiv:2502.05475
Date Added	5/12/2025, 5:12:33 PM
Modified	5/12/2025, 5:12:36 PM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Deceptive Automated Interpretability: Language Models Coordinating to Fool Oversight Systems

Item Type	Preprint
Author	Simon Lermen
Author	Mateusz Dziemian
Author	Natalia Pérez-Campanero Antolín
Abstract	We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations, successfully fooling oversight models while achieving explanation quality comparable to reference labels. We further find that models can scheme to develop deceptive strategies when they believe the detection of harmful features might lead to negative consequences for themselves. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels. We conclude by proposing mitigation strategies, emphasizing the critical need for robust understanding and defenses against deception.
Date	2025-04-10
Short Title	Deceptive Automated Interpretability
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.07831
Accessed	5/11/2025, 6:36:53 PM
Extra	arXiv:2504.07831 [cs] version: 1
DOI	10.48550/arXiv.2504.07831
Repository	arXiv
Archive ID	arXiv:2504.07831
Date Added	5/11/2025, 6:36:53 PM
Modified	5/11/2025, 6:36:57 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

Reasoning Models Can Be Effective Without Thinking

Item Type	Preprint
Author	Wenjie Ma
Author	Jingxuan He
Author	Charlie Snell
Author	Tyler Griggs
Author	Sewon Min
Author	Matei Zaharia
Abstract	Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets--including mathematical problem solving, formal theorem proving, and coding--especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling.
Date	2025-04-14
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.09858
Accessed	5/12/2025, 5:26:26 PM
Extra	arXiv:2504.09858 [cs] version: 1
DOI	10.48550/arXiv.2504.09858
Repository	arXiv
Archive ID	arXiv:2504.09858
Date Added	5/12/2025, 5:26:26 PM
Modified	5/12/2025, 5:26:29 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Notes:

Comment: 33 pages, 7 main figures, 2 tables

Attachments

Preprint PDF
Snapshot

The Jailbreak Tax: How Useful are Your Jailbreak Outputs?

Item Type	Preprint
Author	Kristina Nikolić
Author	Luze Sun
Author	Jie Zhang
Author	Florian Tramèr
Abstract	Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax
Date	2025-04-14
Short Title	The Jailbreak Tax
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.10694
Accessed	5/11/2025, 7:03:15 PM
Extra	arXiv:2504.10694 [cs] version: 1
DOI	10.48550/arXiv.2504.10694
Repository	arXiv
Archive ID	arXiv:2504.10694
Date Added	5/11/2025, 7:03:15 PM
Modified	5/11/2025, 7:03:18 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

Robustness of large language models in moral judgements

Item Type	Journal Article
Author	Soyoung Oh
Author	Vera Demberg
Abstract	With the advent of large language models (LLMs), there has been a growing interest in analysing the preferences encoded in LLMs in the context of morality. Recent work has tested LLMs on various moral judgement tasks and drawn conclusions regarding the ...
Date	2025-04
Language	EN
Loc. in Archive	world
Library Catalog	royalsocietypublishing.org
URL	https://royalsocietypublishing.org/doi/10.1098/rsos.241229
Accessed	5/12/2025, 1:26:26 PM
Rights	© 2025 The Author(s).
Extra	Publisher: The Royal Society
Publication	Royal Society Open Science
DOI	10.1098/rsos.241229
Date Added	5/12/2025, 1:26:26 PM
Modified	5/12/2025, 1:26:26 PM

Attachments

PDF
Snapshot

Evaluating Frontier Models for Stealth and Situational Awareness

Item Type	Preprint
Author	Mary Phuong
Author	Roland S. Zimmermann
Author	Ziyue Wang
Author	David Lindner
Author	Victoria Krakovna
Author	Sarah Cogan
Author	Allan Dafoe
Author	Lewis Ho
Author	Rohin Shah
Abstract	Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.
Date	2025-05-06
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2505.01420
Accessed	5/12/2025, 1:32:34 PM
Extra	arXiv:2505.01420 [cs]
DOI	10.48550/arXiv.2505.01420
Repository	arXiv
Archive ID	arXiv:2505.01420
Date Added	5/12/2025, 1:32:34 PM
Modified	5/12/2025, 1:32:38 PM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Trends in AI Supercomputers

Item Type	Preprint
Author	Konstantin F. Pilz
Author	James Sanders
Author	Robi Rahman
Author	Lennart Heim
Abstract	Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in performance, power needs, hardware cost, ownership, and global distribution. We find that the computational performance of AI supercomputers has doubled every nine months, while hardware acquisition cost and power needs both doubled every year. The leading system in March 2025, xAI's Colossus, used 200,000 AI chips, had a hardware cost of \$7B, and required 300 MW of power, as much as 250,000 households. As AI supercomputers evolved from tools for science to industrial machines, companies rapidly expanded their share of total AI supercomputer performance, while the share of governments and academia diminished. Globally, the United States accounts for about 75% of total performance in our dataset, with China in second place at 15%. If the observed trends continue, the leading AI supercomputer in 2030 will achieve $2\times10^{22}$ 16-bit FLOP/s, use two million AI chips, have a hardware cost of \$200 billion, and require 9 GW of power. Our analysis provides visibility into the AI supercomputer landscape, allowing policymakers to assess key AI trends like resource needs, ownership, and national competitiveness.
Date	2025-04-23
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.16026
Accessed	5/12/2025, 5:53:39 PM
Extra	arXiv:2504.16026 [cs]
DOI	10.48550/arXiv.2504.16026
Repository	arXiv
Archive ID	arXiv:2504.16026
Date Added	5/12/2025, 5:53:39 PM
Modified	5/12/2025, 5:53:42 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society

Attachments

Preprint PDF
Snapshot

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Item Type	Preprint
Author	Xiangyu Qi
Author	Ashwinee Panda
Author	Kaifeng Lyu
Author	Xiao Ma
Author	Subhrajit Roy
Author	Ahmad Beirami
Author	Prateek Mittal
Author	Peter Henderson
Abstract	The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and provide evidence that current aligned LLMs are subject to this issue. We also show how these findings help explain multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. Importantly, we discuss how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. For instance, we show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits. Finally, we design a regularized finetuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep.
Date	2024-06-10
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2406.05946
Accessed	5/12/2025, 5:43:58 PM
Extra	arXiv:2406.05946 [cs]
DOI	10.48550/arXiv.2406.05946
Repository	arXiv
Archive ID	arXiv:2406.05946
Date Added	5/12/2025, 5:43:58 PM
Modified	5/12/2025, 5:43:58 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack

Item Type	Preprint
Author	Mark Russinovich
Author	Ahmed Salem
Author	Ronen Eldan
Abstract	Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models.
Date	2025-02-26
Short Title	Great, Now Write an Article About That
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2404.01833
Accessed	5/12/2025, 12:48:39 PM
Extra	arXiv:2404.01833 [cs]
DOI	10.48550/arXiv.2404.01833
Repository	arXiv
Archive ID	arXiv:2404.01833
Date Added	5/12/2025, 12:48:39 PM
Modified	5/12/2025, 12:48:42 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security

Notes:

Comment: Accepted at USENIX Security 2025

Attachments

Full Text PDF
Snapshot

LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities

Item Type	Preprint
Author	Thomas Schmied
Author	Jörg Bornschein
Author	Jordi Grau-Moya
Author	Markus Wulfmeier
Author	Razvan Pascanu
Abstract	The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as $\epsilon$-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making.
Date	2025-04-22
Short Title	LLMs are Greedy Agents
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.16078
Accessed	5/12/2025, 5:49:16 PM
Extra	arXiv:2504.16078 [cs]
DOI	10.48550/arXiv.2504.16078
Repository	arXiv
Archive ID	arXiv:2504.16078
Date Added	5/12/2025, 5:49:16 PM
Modified	5/12/2025, 5:49:16 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

The Leaderboard Illusion

Item Type	Preprint
Author	Shivalika Singh
Author	Yiyang Nan
Author	Alex Wang
Author	Daniel D'Souza
Author	Sayash Kapoor
Author	Ahmet Üstün
Author	Sanmi Koyejo
Author	Yuntian Deng
Author	Shayne Longpre
Author	Noah Smith
Author	Beyza Ermis
Author	Marzieh Fadaee
Author	Sara Hooker
Abstract	Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field
Date	2025-04-29
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.20879
Accessed	5/12/2025, 11:45:01 AM
Extra	arXiv:2504.20879 [cs]
DOI	10.48550/arXiv.2504.20879
Repository	arXiv
Archive ID	arXiv:2504.20879
Date Added	5/12/2025, 11:45:01 AM
Modified	5/12/2025, 11:45:04 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Statistics - Methodology

Notes:

Comment: 68 pages, 18 figures, 9 tables

Attachments

Preprint PDF
Snapshot

7+ tractable directions in AI control

Item Type	Journal Article
Author	Julian Stastny
Author	ryan_greenblatt
Abstract	In this post, we list 7 of our favorite easy-to-start directions in AI control. (Really, projects that are at least adjacent to AI control; We includ…
Date	2025-04-28
Language	en
Library Catalog	www.lesswrong.com
URL	https://www.lesswrong.com/posts/wwshEdNhwwT4r9RQN/7-tractable-directions-in-ai-control
Accessed	5/12/2025, 5:13:33 PM
Date Added	5/12/2025, 5:13:33 PM
Modified	5/12/2025, 5:13:33 PM

Attachments

Snapshot

Audit Cards: Contextualizing AI Evaluations

Item Type	Preprint
Author	Leon Staufer
Author	Mick Yang
Author	Anka Reuel
Author	Stephen Casper
Abstract	AI governance frameworks increasingly rely on audits, yet the results of their underlying evaluations require interpretation and context to be meaningfully informative. Even technically rigorous evaluations can offer little useful insight if reported selectively or obscurely. Current literature focuses primarily on technical best practices, but evaluations are an inherently sociotechnical process, and there is little guidance on reporting procedures and context. Through literature review, stakeholder interviews, and analysis of governance frameworks, we propose "audit cards" to make this context explicit. We identify six key types of contextual features to report and justify in audit cards: auditor identity, evaluation scope, methodology, resource access, process integrity, and review mechanisms. Through analysis of existing evaluation reports, we find significant variation in reporting practices, with most reports omitting crucial contextual information such as auditors' backgrounds, conflicts of interest, and the level and type of access to models. We also find that most existing regulations and frameworks lack guidance on rigorous reporting. In response to these shortcomings, we argue that audit cards can provide a structured format for reporting key claims alongside their justifications, enhancing transparency, facilitating proper interpretation, and establishing trust in reporting.
Date	2025-04-18
Short Title	Audit Cards
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.13839
Accessed	5/11/2025, 6:45:44 PM
Extra	arXiv:2504.13839 [cs]
DOI	10.48550/arXiv.2504.13839
Repository	arXiv
Archive ID	arXiv:2504.13839
Date Added	5/11/2025, 6:45:44 PM
Modified	5/11/2025, 6:45:47 PM

Tags:

Computer Science - Computers and Society

Attachments

Preprint PDF
Snapshot

Interesting That No One Thinks AlphaFold Is Conscious: Anil Seth

Item Type	Web Page
Author	Officechai Team
Abstract	There is plenty of speculation on whether current AI systems are conscious, or how they could be made conscious, but humans seem to be displaying a blind spot in choosing which AI systems they feel
Date	2025-04-17T23:40:14+05:30
Language	en-US
Short Title	Interesting That No One Thinks AlphaFold Is Conscious
URL	https://officechai.com/ai/interesting-that-no-one-thinks-alphafold-is-conscious-anil-seth/
Accessed	5/11/2025, 7:05:35 PM
Website Title	OfficeChai
Date Added	5/11/2025, 7:05:35 PM
Modified	5/11/2025, 7:05:35 PM

Attachments

Snapshot

Investigating task-specific prompts and sparse autoencoders for activation monitoring

Item Type	Preprint
Author	Henk Tillman
Author	Dan Mossing
Abstract	Language models can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of language models encode additional information that could be useful for this. The baseline approach for activation monitoring is some variation of linear probing on a particular layer: starting from a labeled dataset, train a logistic regression classifier on that layer's activations. Recent work has proposed several approaches which may improve on naive linear probing, by leveraging additional computation. One class of techniques, which we call "prompted probing," leverages test time computation to improve monitoring by (1) prompting the model with a description of the monitoring task, and (2) applying a learned linear probe to resulting activations. Another class of techniques uses computation at train time: training sparse autoencoders offline to identify an interpretable basis for the activations, and e.g. max-pooling activations across tokens using that basis before applying a linear probe. However, one can also prompt the model with a description of the monitoring task and use its output directly. We develop and test novel refinements of these methods and compare them against each other. We find asking the model zero-shot is a reasonable baseline when inference-time compute is not limited; however, activation probing methods can substantially outperform this baseline given sufficient training data. Specifically, we recommend prompted probing when inference-time compute is available, due to its superior data efficiency and good generalization performance. Alternatively, if inference-time compute is limited, we find SAE-based probing methods outperform raw activation probing.
Date	2025-04-28
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.20271
Accessed	5/12/2025, 1:53:20 PM
Extra	arXiv:2504.20271 [cs]
DOI	10.48550/arXiv.2504.20271
Repository	arXiv
Archive ID	arXiv:2504.20271
Date Added	5/12/2025, 1:53:20 PM
Modified	5/12/2025, 1:53:23 PM

Tags:

Computer Science - Machine Learning

Notes:

Comment: 18 pages, 13 figures

Attachments

Preprint PDF
Snapshot

Understanding Chain-of-Thought in LLMs through Information Theory

Item Type	Preprint
Author	Jean-Francois Ton
Author	Muhammad Faaiz Taufiq
Author	Yang Liu
Abstract	Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks.
Date	2024-11-18
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.11984
Accessed	5/13/2025, 11:58:53 AM
Extra	arXiv:2411.11984 [cs]
DOI	10.48550/arXiv.2411.11984
Repository	arXiv
Archive ID	arXiv:2411.11984
Date Added	5/13/2025, 11:58:53 AM
Modified	5/13/2025, 11:58:54 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Towards Internet-Scale Training For Agents

Item Type	Preprint
Author	Brandon Trabucco
Author	Gunnar Sigurdsson
Author	Robinson Piramuthu
Author	Ruslan Salakhutdinov
Abstract	The predominant approach for training web navigation agents gathers human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data are an inefficient resource. We develop a pipeline to facilitate Internet-scale training for agents without laborious human annotations. In the first stage, an LLM generates tasks for 150k diverse websites. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM reviews the trajectories and judges their success. Language models are competitive with human annotators, detecting and filtering out harmful content with an accuracy of 97%, generating feasible tasks with an 89% rate, and judging successful trajectories with an 82.6% accuracy. Scaling the pipeline, agents based on Llama 3.1 70B solve 16.7% of tasks for 150k sites. Training on the data generated by our pipeline is competitive with training on human demonstrations. In data-limited settings derived from Mind2Web and WebLINX, we improve Step Accuracy by up to +89.5% and +122.1% respectively for agents trained on mixtures of data from our pipeline, and human data. When training agents with all available human data from these benchmarks, agents fail to generalize to diverse real sites, and adding our data improves their generalization by +149.0% for WebLINX and +156.3% for Mind2Web. Code will be available at: data-for-agents.github.io.
Date	2025-02-10
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.06776
Accessed	5/11/2025, 7:23:53 PM
Extra	arXiv:2502.06776 [cs]
DOI	10.48550/arXiv.2502.06776
Repository	arXiv
Archive ID	arXiv:2502.06776
Date Added	5/11/2025, 7:23:53 PM
Modified	5/11/2025, 7:23:53 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Do Larger Language Models Imply Better Reasoning? A Pretraining Scaling Law for Reasoning

Item Type	Preprint
Author	Xinyi Wang
Author	Shawn Tan
Author	Mingyu Jin
Author	William Yang Wang
Author	Rameswar Panda
Author	Yikang Shen
Abstract	Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks.
Date	2025-04-04
Short Title	Do Larger Language Models Imply Better Reasoning?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.03635
Accessed	5/11/2025, 7:10:21 PM
Extra	arXiv:2504.03635 [cs]
DOI	10.48550/arXiv.2504.03635
Repository	arXiv
Archive ID	arXiv:2504.03635
Date Added	5/11/2025, 7:10:21 PM
Modified	5/11/2025, 7:10:21 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

Base Models Beat Aligned Models at Randomness and Creativity

Item Type	Preprint
Author	Peter West
Author	Christopher Potts
Abstract	Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate "7" over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.
Date	2025-04-30
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2505.00047
Accessed	5/12/2025, 12:49:20 PM
Extra	arXiv:2505.00047 [cs]
DOI	10.48550/arXiv.2505.00047
Repository	arXiv
Archive ID	arXiv:2505.00047
Date Added	5/12/2025, 12:49:20 PM
Modified	5/12/2025, 12:49:22 PM

Tags:

Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

Base Models Beat Aligned Models at Randomness and Creativity

Item Type	Preprint
Author	Peter West
Author	Christopher Potts
Abstract	Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate "7" over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities.
Date	2025-04-30
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2505.00047
Accessed	5/12/2025, 12:52:22 PM
Extra	arXiv:2505.00047 [cs]
DOI	10.48550/arXiv.2505.00047
Repository	arXiv
Archive ID	arXiv:2505.00047
Date Added	5/12/2025, 12:52:22 PM
Modified	5/12/2025, 12:52:22 PM

Tags:

Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models

Item Type	Preprint
Author	Chejian Xu
Author	Jiawei Zhang
Author	Zhaorun Chen
Author	Chulin Xie
Author	Mintong Kang
Author	Yujin Potter
Author	Zhun Wang
Author	Zhuowen Yuan
Author	Alexander Xiong
Author	Zidi Xiong
Author	Chenhui Zhang
Author	Lingzhi Yuan
Author	Yi Zeng
Author	Peiyang Xu
Author	Chengquan Guo
Author	Andy Zhou
Author	Jeffrey Ziwei Tan
Author	Xuandong Zhao
Author	Francesco Pinto
Author	Zhen Xiang
Author	Yu Gai
Author	Zinan Lin
Author	Dan Hendrycks
Author	Bo Li
Author	Dawn Song
Abstract	Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/.
Date	2025-03-19
Short Title	MMDT
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.14827
Accessed	5/12/2025, 5:44:53 PM
Extra	arXiv:2503.14827 [cs]
DOI	10.48550/arXiv.2503.14827
Repository	arXiv
Archive ID	arXiv:2503.14827
Date Added	5/12/2025, 5:44:53 PM
Modified	5/12/2025, 5:44:53 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security

Notes:

Comment: ICLR 2025

Attachments

Preprint PDF
Snapshot

A Survey of AI Agent Protocols

Item Type	Preprint
Author	Yingxuan Yang
Author	Huacan Chai
Author	Yuanyi Song
Author	Siyuan Qi
Author	Muning Wen
Author	Ning Li
Author	Junwei Liao
Author	Haoyi Hu
Author	Jianghao Lin
Author	Gaowei Chang
Author	Weiwen Liu
Author	Ying Wen
Author	Yong Yu
Author	Weinan Zhang
Abstract	The rapid development of large language models (LLMs) has led to the widespread deployment of LLM agents across diverse industries, including customer service, content generation, data analysis, and even healthcare. However, as more LLM agents are deployed, a major issue has emerged: there is no standard way for these agents to communicate with external tools or data sources. This lack of standardized protocols makes it difficult for agents to work together or scale effectively, and it limits their ability to tackle complex, real-world tasks. A unified communication protocol for LLM agents could change this. It would allow agents and tools to interact more smoothly, encourage collaboration, and triggering the formation of collective intelligence. In this paper, we provide the first comprehensive analysis of existing agent protocols, proposing a systematic two-dimensional classification that differentiates context-oriented versus inter-agent protocols and general-purpose versus domain-specific protocols. Additionally, we conduct a comparative performance analysis of these protocols across key dimensions such as security, scalability, and latency. Finally, we explore the future landscape of agent protocols by identifying critical research directions and characteristics necessary for next-generation protocols. These characteristics include adaptability, privacy preservation, and group-based interaction, as well as trends toward layered architectures and collective intelligence infrastructures. We expect this work to serve as a practical reference for both researchers and engineers seeking to design, evaluate, or integrate robust communication infrastructures for intelligent agents.
Date	2025-04-26
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.16736
Accessed	5/12/2025, 1:45:09 PM
Extra	arXiv:2504.16736 [cs]
DOI	10.48550/arXiv.2504.16736
Repository	arXiv
Archive ID	arXiv:2504.16736
Date Added	5/12/2025, 1:45:09 PM
Modified	5/12/2025, 1:45:11 PM

Tags:

Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

Robust LLM safeguarding via refusal feature adversarial training

Item Type	Preprint
Author	Lei Yu
Author	Virginie Do
Author	Karen Hambardzumyan
Author	Nicola Cancedda
Abstract	Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods.
Date	2025-03-20
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2409.20089
Accessed	5/12/2025, 5:45:04 PM
Extra	arXiv:2409.20089 [cs]
DOI	10.48550/arXiv.2409.20089
Repository	arXiv
Archive ID	arXiv:2409.20089
Date Added	5/12/2025, 5:45:04 PM
Modified	5/12/2025, 5:45:04 PM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

RepliBench: measuring autonomous replication capabilities in AI systems | AISI Work

Item Type	Web Page
Abstract	A comprehensive benchmark to detect emerging replication abilities in AI systems and provide a quantifiable understanding of potential risks
Language	en
Short Title	RepliBench
URL	https://www.aisi.gov.uk/work/replibench-measuring-autonomous-replication-capabilities-in-ai-systems
Accessed	5/11/2025, 6:40:49 PM
Website Title	AI Security Institute
Date Added	5/11/2025, 6:40:49 PM
Modified	5/11/2025, 6:40:53 PM

Attachments

Snapshot

Unlocking New Jailbreaks with AI Explainability

Item Type	Web Page
Abstract	TL;DR In this post, we introduce our “Adversarial AI Explainability” research, a term we use to describe the intersection of AI explainability and adversarial attacks on Large Language Models...
Language	en
URL	https://www.cyberark.com/resources/threat-research-blog/unlocking-new-jailbreaks-with-ai-explainability
Accessed	5/12/2025, 12:10:49 PM
Date Added	5/12/2025, 12:10:49 PM
Modified	5/12/2025, 12:10:53 PM

Attachments

Snapshot

Singapore_Consensus_2025.pdf

Item Type	Attachment
URL	https://aisafetypriorities.org/files/Singapore_Consensus_2025.pdf
Accessed	5/12/2025, 12:33:31 PM
Date Added	5/12/2025, 12:33:31 PM
Modified	5/12/2025, 12:33:31 PM

Gemini 2.5 Pro Preview Model Card

Item Type	Journal Article
Language	en
Library Catalog	Zotero
Date Added	5/12/2025, 4:35:06 PM
Modified	5/12/2025, 4:35:06 PM

Attachments

PDF

Modifying LLM Beliefs with Synthetic Document Finetuning

Item Type	Web Page
URL	https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/
Accessed	5/12/2025, 5:45:43 PM
Date Added	5/12/2025, 5:45:43 PM
Modified	5/12/2025, 5:45:43 PM

Attachments

Modifying LLM Beliefs with Synthetic Document Finetuning

o3-and-o4-mini-system-card.pdf

Item Type	Attachment
URL	https://cdn.openai.com/pdf/2221c875-02dc-4789-800b-e7758f3722c1/o3-and-o4-mini-system-card.pdf
Accessed	5/12/2025, 5:51:18 PM
Date Added	5/12/2025, 5:51:18 PM
Modified	5/12/2025, 5:51:18 PM