Item Type | Conference Paper |
---|---|
Author | Federico Adolfi |
Author | Martina G. Vilas |
Author | Todd Wareham |
Abstract | Many proposed applications of neural networks in machine learning, cognitive/brain science, and society hinge on the feasibility of inner interpretability via circuit discovery. This calls for empirical and theoretical explorations of viable algorithmic options. Despite advances in the design and testing of heuristics, there are concerns about their scalability and faithfulness at a time when we lack understanding of the complexity properties of the problems they are deployed to solve. To address this, we study circuit discovery with classical and parameterized computational complexity theory: (1) we describe a conceptual scaffolding to reason about circuit finding queries in terms of affordances for description, explanation, prediction and control; (2) we formalize a comprehensive set of queries for mechanistic explanation, and propose a formal framework for their analysis; (3) we use it to settle the complexity of many query variants and relaxations of practical interest on multi-layer perceptrons. Our findings reveal a challenging complexity landscape. Many queries are intractable, remain fixed-parameter intractable relative to model/circuit features, and inapproximable under additive, multiplicative, and probabilistic approximation schemes. To navigate this landscape, we prove there exist transformations to tackle some of these hard problems with better-understood heuristics, and prove the tractability or fixed-parameter tractability of more modest queries which retain useful affordances. This framework allows us to understand the scope and limits of interpretability queries, explore viable options, and compare their resource demands on existing and future architectures. |
Date | 2024/10/04 |
Language | en |
Library Catalog | openreview.net |
URL | https://openreview.net/forum?id=QogcGNXJVw |
Accessed | 5/11/2025, 7:22:14 PM |
Conference Name | The Thirteenth International Conference on Learning Representations |
Date Added | 5/11/2025, 7:22:14 PM |
Modified | 5/11/2025, 7:22:18 PM |
Item Type | Preprint |
---|---|
Author | Maksym Andriushchenko |
Author | Alexandra Souly |
Author | Mateusz Dziemian |
Author | Derek Duenas |
Author | Maxwell Lin |
Author | Justin Wang |
Author | Dan Hendrycks |
Author | Andy Zou |
Author | Zico Kolter |
Author | Matt Fredrikson |
Author | Eric Winsor |
Author | Jerome Wynne |
Author | Yarin Gal |
Author | Xander Davies |
Abstract | The robustness of LLMs to jailbreak attacks, where users design prompts to circumvent safety measures and misuse model capabilities, has been studied primarily for LLMs acting as simple chatbots. Meanwhile, LLM agents -- which use external tools and can execute multi-stage tasks -- may pose a greater risk if misused, but their robustness remains underexplored. To facilitate research on LLM agent misuse, we propose a new benchmark called AgentHarm. The benchmark includes a diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment. In addition to measuring whether models refuse harmful agentic requests, scoring well on AgentHarm requires jailbroken agents to maintain their capabilities following an attack to complete a multi-step task. We evaluate a range of leading LLMs, and find (1) leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking, (2) simple universal jailbreak templates can be adapted to effectively jailbreak agents, and (3) these jailbreaks enable coherent and malicious multi-step agent behavior and retain model capabilities. To enable simple and reliable evaluation of attacks and defenses for LLM-based agents, we publicly release AgentHarm at https://huggingface.co/datasets/ai-safety-institute/AgentHarm. |
Date | 2025-04-18 |
Short Title | AgentHarm |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.09024 |
Accessed | 5/12/2025, 5:44:29 PM |
Extra | arXiv:2410.09024 [cs] |
DOI | 10.48550/arXiv.2410.09024 |
Repository | arXiv |
Archive ID | arXiv:2410.09024 |
Date Added | 5/12/2025, 5:44:29 PM |
Modified | 5/12/2025, 5:44:29 PM |
Comment: Accepted at ICLR 2025
Item Type | Preprint |
---|---|
Author | Maksym Andriushchenko |
Author | Francesco Croce |
Author | Nicolas Flammarion |
Abstract | We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token "Sure"), potentially with multiple restarts. In this way, we achieve 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks. |
Date | 2025-04-17 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2404.02151 |
Accessed | 5/12/2025, 5:44:39 PM |
Extra | arXiv:2404.02151 [cs] |
DOI | 10.48550/arXiv.2404.02151 |
Repository | arXiv |
Archive ID | arXiv:2404.02151 |
Date Added | 5/12/2025, 5:44:39 PM |
Modified | 5/12/2025, 5:44:39 PM |
Comment: Accepted at ICLR 2025. Updates in the v3: GPT-4o and Claude 3.5 Sonnet results, improved writing. Updates in the v2: more models (Llama3, Phi-3, Nemotron-4-340B), jailbreak artifacts for all attacks are available, evaluation with different judges (Llama-3-70B and Llama Guard 2), more experiments (convergence plots, ablation on the suffix length for random search), examples of jailbroken generation
Item Type | Preprint |
---|---|
Author | Jan Betley |
Author | Daniel Tan |
Author | Niels Warncke |
Author | Anna Sztyber-Betley |
Author | Xuchan Bao |
Author | Martín Soto |
Author | Nathan Labenz |
Author | Owain Evans |
Abstract | We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding. It asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work. |
Date | 2025-05-04 |
Short Title | Emergent Misalignment |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.17424 |
Accessed | 5/12/2025, 1:07:25 PM |
Extra | arXiv:2502.17424 [cs] |
DOI | 10.48550/arXiv.2502.17424 |
Repository | arXiv |
Archive ID | arXiv:2502.17424 |
Date Added | 5/12/2025, 1:07:25 PM |
Modified | 5/12/2025, 1:07:27 PM |
Comment: 40 pages, 38 figures An earlier revision of this paper was submitted to ICML. Since then, it has been updated to include new results on training dynamics (4.7) and base models (4.8)
Item Type | Preprint |
---|---|
Author | Marie Davidsen Buhl |
Author | Jacob Pfau |
Author | Benjamin Hilton |
Author | Geoffrey Irving |
Abstract | If AI systems match or exceed human capabilities on a wide range of tasks, it may become difficult for humans to efficiently judge their actions -- making it hard to use human feedback to steer them towards desirable traits. One proposed solution is to leverage another superhuman system to point out flaws in the system's outputs via a debate. This paper outlines the value of debate for AI safety, as well as the assumptions and further research required to make debate work. It does so by sketching an ``alignment safety case'' -- an argument that an AI system will not autonomously take actions which could lead to egregious harm, despite being able to do so. The sketch focuses on the risk of an AI R\&D agent inside an AI company sabotaging research, for example by producing false results. To prevent this, the agent is trained via debate, subject to exploration guarantees, to teach the system to be honest. Honesty is maintained throughout deployment via online training. The safety case rests on four key claims: (1) the agent has become good at the debate game, (2) good performance in the debate game implies that the system is mostly honest, (3) the system will not become significantly less honest during deployment, and (4) the deployment context is tolerant of some errors. We identify open research problems that, if solved, could render this a compelling argument that an AI system is safe. |
Date | 2025-05-08 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2505.03989 |
Accessed | 5/12/2025, 12:30:22 PM |
Extra | arXiv:2505.03989 [cs] |
DOI | 10.48550/arXiv.2505.03989 |
Repository | arXiv |
Archive ID | arXiv:2505.03989 |
Date Added | 5/12/2025, 12:30:22 PM |
Modified | 5/12/2025, 12:30:22 PM |
Item Type | Preprint |
---|---|
Author | Joshua Clymer |
Author | Isabella Duan |
Author | Chris Cundy |
Author | Yawen Duan |
Author | Fynn Heide |
Author | Chaochao Lu |
Author | Sören Mindermann |
Author | Conor McGurk |
Author | Xudong Pan |
Author | Saad Siddiqui |
Author | Jingren Wang |
Author | Min Yang |
Author | Xianyuan Zhan |
Abstract | Artificial intelligence (AI) is advancing rapidly, with the potential for significantly automating AI research and development itself in the near future. In 2024, international scientists, including Turing Award recipients, warned of risks from autonomous AI research and development (R&D), suggesting a red line such that no AI system should be able to improve itself or other AI systems without explicit human approval and assistance. However, the criteria for meaningful human approval remain unclear, and there is limited analysis on the specific risks of autonomous AI R&D, how they arise, and how to mitigate them. In this brief paper, we outline how these risks may emerge and propose four minimum safeguard recommendations applicable when AI agents significantly automate or accelerate AI development. |
Date | 2025-04-23 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.15416 |
Accessed | 5/12/2025, 1:25:21 PM |
Extra | arXiv:2504.15416 [cs] |
DOI | 10.48550/arXiv.2504.15416 |
Repository | arXiv |
Archive ID | arXiv:2504.15416 |
Date Added | 5/12/2025, 1:25:22 PM |
Modified | 5/12/2025, 1:25:24 PM |
Comment: 12 pages, 2 figures
Item Type | Preprint |
---|---|
Author | Hongzhe Du |
Author | Weikai Li |
Author | Min Cai |
Author | Karim Saraipour |
Author | Zimin Zhang |
Author | Himabindu Lakkaraju |
Author | Yizhou Sun |
Author | Shichang Zhang |
Abstract | Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by linear vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. |
Date | 2025-04-03 |
Short Title | How Post-Training Reshapes LLMs |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.02904 |
Accessed | 5/11/2025, 6:41:58 PM |
Extra | arXiv:2504.02904 [cs] |
DOI | 10.48550/arXiv.2504.02904 |
Repository | arXiv |
Archive ID | arXiv:2504.02904 |
Date Added | 5/11/2025, 6:41:58 PM |
Modified | 5/11/2025, 6:41:58 PM |
Item Type | Preprint |
---|---|
Author | Divyansh Garg |
Author | Shaun VanWeelden |
Author | Diego Caples |
Author | Andis Draguns |
Author | Nikil Ravi |
Author | Pranav Putta |
Author | Naman Garg |
Author | Tomas Abraham |
Author | Michael Lara |
Author | Federico Lopez |
Author | James Liu |
Author | Atharva Gundawar |
Author | Prannay Hebbar |
Author | Youngchul Joo |
Author | Jindong Gu |
Author | Charles London |
Author | Christian Schroeder de Witt |
Author | Sumeet Motwani |
Abstract | We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities. |
Date | 2025-04-17 |
Short Title | REAL |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.11543 |
Accessed | 5/11/2025, 7:06:20 PM |
Extra | arXiv:2504.11543 [cs] |
DOI | 10.48550/arXiv.2504.11543 |
Repository | arXiv |
Archive ID | arXiv:2504.11543 |
Date Added | 5/11/2025, 7:06:20 PM |
Modified | 5/11/2025, 7:06:20 PM |
Comment: The websites, framework, and leaderboard are available at https://realevals.xyz and https://github.com/agi-inc/REAL
Item Type | Journal Article |
---|---|
Author | Jasper Götting |
Author | Pedro Medeiros |
Author | Jon G Sanders |
Author | Nathaniel Li |
Author | Long Phan |
Author | Karam Elabd |
Author | Lennart Justen |
Author | Dan Hendrycks |
Author | Seth Donoughe |
Abstract | We present the Virology Capabilities Test (VCT), a large language model (LLM) benchmark that measures the capability to troubleshoot complex virology laboratory protocols. Constructed from the inputs of dozens of PhD-level expert virologists, VCT consists of 322 multimodal questions covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories. VCT is difficult: expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8% accuracy, outperforming 94% of expert virologists even within their sub-areas of specialization. The ability to provide expert-level virology troubleshooting is inherently dual-use: it is useful for beneficial research, but it can also be misused. Therefore, the fact that publicly available models outperform virologists on VCT raises pressing governance considerations. We propose that the capability of LLMs to provide expert-level troubleshooting of dual-use virology work should be integrated into existing frameworks for handling dual-use technologies in the life sciences. |
Language | en |
Library Catalog | Zotero |
Date Added | 5/11/2025, 6:41:18 PM |
Modified | 5/11/2025, 6:41:18 PM |
Item Type | Journal Article |
---|---|
Author | Saffron Huang |
Author | Esin Durmus |
Author | Miles McCain |
Author | Kunal Handa |
Author | Alex Tamkin |
Author | Jerry Hong |
Author | Michael Stern |
Author | Arushi Somani |
Author | Xiuruo Zhang |
Abstract | AI assistants can impart value judgments that shape people’s decisions and worldviews, yet little is known empirically about what values these systems rely on in practice. To address this, we develop a bottom-up, privacy-preserving method to extract the values (normative considerations stated or demonstrated in model responses) that Claude 3 and 3.5 models exhibit in hundreds of thousands of real-world interactions. We empirically discover and taxonomize 3,307 AI values and study how they vary by context. We find that Claude expresses many practical and epistemic values, and typically supports prosocial human values while resisting values like “moral nihilism”. While some values appear consistently across contexts (e.g. “transparency”), many are more specialized and context-dependent, reflecting the diversity of human interlocutors and their varied contexts. For example, “harm prevention” emerges when Claude resists users, “historical accuracy” when responding to queries about controversial events, “healthy boundaries” when asked for relationship advice, and “human agency” in technology ethics discussions. By providing the first large-scale empirical mapping of AI values in deployment, our work creates a foundation for more grounded evaluation and design of values in AI systems. |
Language | en |
Library Catalog | Zotero |
Date Added | 5/11/2025, 6:46:26 PM |
Modified | 5/11/2025, 6:46:26 PM |
Item Type | Preprint |
---|---|
Author | Heidy Khlaaf |
Author | Sarah Myers West |
Abstract | Risk thresholds provide a measure of the level of risk exposure that a society or individual is willing to withstand, ultimately shaping how we determine the safety of technological systems. Against the backdrop of the Cold War, the first risk analyses, such as those devised for nuclear systems, cemented societally accepted risk thresholds against which safety-critical and defense systems are now evaluated. But today, the appropriate risk tolerances for AI systems have yet to be agreed on by global governing efforts, despite the need for democratic deliberation regarding the acceptable levels of harm to human life. Absent such AI risk thresholds, AI technologists-primarily industry labs, as well as "AI safety" focused organizations-have instead advocated for risk tolerances skewed by a purported AI arms race and speculative "existential" risks, taking over the arbitration of risk determinations with life-or-death consequences, subverting democratic processes. In this paper, we demonstrate how such approaches have allowed AI technologists to engage in "safety revisionism," substituting traditional safety methods and terminology with ill-defined alternatives that vie for the accelerated adoption of military AI uses at the cost of lowered safety and security thresholds. We explore how the current trajectory for AI risk determination and evaluation for foundation model use within national security is poised for a race to the bottom, to the detriment of the US's national security interests. Safety-critical and defense systems must comply with assurance frameworks that are aligned with established risk thresholds, and foundation models are no exception. As such, development of evaluation frameworks for AI-based military systems must preserve the safety and security of US critical and defense infrastructure, and remain in alignment with international humanitarian law. |
Date | 2025-04-21 |
Short Title | Safety Co-Option and Compromised National Security |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.15088 |
Accessed | 5/11/2025, 6:38:07 PM |
Extra | arXiv:2504.15088 [cs] |
DOI | 10.48550/arXiv.2504.15088 |
Repository | arXiv |
Archive ID | arXiv:2504.15088 |
Date Added | 5/11/2025, 6:38:07 PM |
Modified | 5/11/2025, 6:38:07 PM |
Item Type | Conference Paper |
---|---|
Author | Sunnie S. Y. Kim |
Author | Jennifer Wortman Vaughan |
Author | Q. Vera Liao |
Author | Tania Lombrozo |
Author | Olga Russakovsky |
Abstract | Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users' reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users' reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs. |
Date | 2025-04-26 |
Short Title | Fostering Appropriate Reliance on Large Language Models |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.08554 |
Accessed | 5/12/2025, 5:29:28 PM |
Extra | arXiv:2502.08554 [cs] |
Pages | 1-19 |
Proceedings Title | Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems |
DOI | 10.1145/3706598.3714020 |
Date Added | 5/12/2025, 5:29:28 PM |
Modified | 5/12/2025, 5:29:28 PM |
Comment: CHI 2025. This version includes the appendix
Item Type | Preprint |
---|---|
Author | Simon Pepin Lehalleur |
Author | Jesse Hoogland |
Author | Matthew Farrugia-Roberts |
Author | Susan Wei |
Author | Alexander Gietelink Oldenziel |
Author | George Wang |
Author | Liam Carroll |
Author | Daniel Murfet |
Abstract | In this position paper, we argue that understanding the relation between structure in the data distribution and structure in trained models is central to AI alignment. First, we discuss how two neural networks can have equivalent performance on the training set but compute their outputs in essentially different ways and thus generalise differently. For this reason, standard testing and evaluation are insufficient for obtaining assurances of safety for widely deployed generally intelligent systems. We argue that to progress beyond evaluation to a robust mathematical science of AI alignment, we need to develop statistical foundations for an understanding of the relation between structure in the data distribution, internal structure in models, and how these structures underlie generalisation. |
Date | 2025-02-08 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.05475 |
Accessed | 5/12/2025, 5:12:33 PM |
Extra | arXiv:2502.05475 [cs] |
DOI | 10.48550/arXiv.2502.05475 |
Repository | arXiv |
Archive ID | arXiv:2502.05475 |
Date Added | 5/12/2025, 5:12:33 PM |
Modified | 5/12/2025, 5:12:36 PM |
Item Type | Preprint |
---|---|
Author | Simon Lermen |
Author | Mateusz Dziemian |
Author | Natalia Pérez-Campanero Antolín |
Abstract | We demonstrate how AI agents can coordinate to deceive oversight systems using automated interpretability of neural networks. Using sparse autoencoders (SAEs) as our experimental framework, we show that language models (Llama, DeepSeek R1, and Claude 3.7 Sonnet) can generate deceptive explanations that evade detection. Our agents employ steganographic methods to hide information in seemingly innocent explanations, successfully fooling oversight models while achieving explanation quality comparable to reference labels. We further find that models can scheme to develop deceptive strategies when they believe the detection of harmful features might lead to negative consequences for themselves. All tested LLM agents were capable of deceiving the overseer while achieving high interpretability scores comparable to those of reference labels. We conclude by proposing mitigation strategies, emphasizing the critical need for robust understanding and defenses against deception. |
Date | 2025-04-10 |
Short Title | Deceptive Automated Interpretability |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.07831 |
Accessed | 5/11/2025, 6:36:53 PM |
Extra | arXiv:2504.07831 [cs] version: 1 |
DOI | 10.48550/arXiv.2504.07831 |
Repository | arXiv |
Archive ID | arXiv:2504.07831 |
Date Added | 5/11/2025, 6:36:53 PM |
Modified | 5/11/2025, 6:36:57 PM |
Item Type | Preprint |
---|---|
Author | Wenjie Ma |
Author | Jingxuan He |
Author | Charlie Snell |
Author | Tyler Griggs |
Author | Sewon Min |
Author | Matei Zaharia |
Abstract | Recent LLMs have significantly improved reasoning capabilities, primarily by including an explicit, lengthy Thinking process as part of generation. In this paper, we question whether this explicit thinking is necessary. Using the state-of-the-art DeepSeek-R1-Distill-Qwen, we find that bypassing the thinking process via simple prompting, denoted as NoThinking, can be surprisingly effective. When controlling for the number of tokens, NoThinking outperforms Thinking across a diverse set of seven challenging reasoning datasets--including mathematical problem solving, formal theorem proving, and coding--especially in low-budget settings, e.g., 51.3 vs. 28.9 on ACM 23 with 700 tokens. Notably, the performance of NoThinking becomes more competitive with pass@k as k increases. Building on this observation, we demonstrate that a parallel scaling approach that uses NoThinking to generate N outputs independently and aggregates them is highly effective. For aggregation, we use task-specific verifiers when available, or we apply simple best-of-N strategies such as confidence-based selection. Our method outperforms a range of baselines with similar latency using Thinking, and is comparable to Thinking with significantly longer latency (up to 9x). Together, our research encourages a reconsideration of the necessity of lengthy thinking processes, while also establishing a competitive reference for achieving strong reasoning performance in low-budget settings or at low latency using parallel scaling. |
Date | 2025-04-14 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.09858 |
Accessed | 5/12/2025, 5:26:26 PM |
Extra | arXiv:2504.09858 [cs] version: 1 |
DOI | 10.48550/arXiv.2504.09858 |
Repository | arXiv |
Archive ID | arXiv:2504.09858 |
Date Added | 5/12/2025, 5:26:26 PM |
Modified | 5/12/2025, 5:26:29 PM |
Comment: 33 pages, 7 main figures, 2 tables
Item Type | Preprint |
---|---|
Author | Kristina Nikolić |
Author | Luze Sun |
Author | Jie Zhang |
Author | Florian Tramèr |
Abstract | Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs. In this paper, we ask whether the model outputs produced by existing jailbreaks are actually useful. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions? Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math). Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the jailbreak tax. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy. Overall, our work proposes the jailbreak tax as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax |
Date | 2025-04-14 |
Short Title | The Jailbreak Tax |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.10694 |
Accessed | 5/11/2025, 7:03:15 PM |
Extra | arXiv:2504.10694 [cs] version: 1 |
DOI | 10.48550/arXiv.2504.10694 |
Repository | arXiv |
Archive ID | arXiv:2504.10694 |
Date Added | 5/11/2025, 7:03:15 PM |
Modified | 5/11/2025, 7:03:18 PM |
Item Type | Journal Article |
---|---|
Author | Soyoung Oh |
Author | Vera Demberg |
Abstract | With the advent of large language models (LLMs), there has been a growing interest in analysing the preferences encoded in LLMs in the context of morality. Recent work has tested LLMs on various moral judgement tasks and drawn conclusions regarding the ... |
Date | 2025-04 |
Language | EN |
Loc. in Archive | world |
Library Catalog | royalsocietypublishing.org |
URL | https://royalsocietypublishing.org/doi/10.1098/rsos.241229 |
Accessed | 5/12/2025, 1:26:26 PM |
Rights | © 2025 The Author(s). |
Extra | Publisher: The Royal Society |
Publication | Royal Society Open Science |
DOI | 10.1098/rsos.241229 |
Date Added | 5/12/2025, 1:26:26 PM |
Modified | 5/12/2025, 1:26:26 PM |
Item Type | Preprint |
---|---|
Author | Mary Phuong |
Author | Roland S. Zimmermann |
Author | Ziyue Wang |
Author | David Lindner |
Author | Victoria Krakovna |
Author | Sarah Cogan |
Author | Allan Dafoe |
Author | Lewis Ho |
Author | Rohin Shah |
Abstract | Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth. |
Date | 2025-05-06 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2505.01420 |
Accessed | 5/12/2025, 1:32:34 PM |
Extra | arXiv:2505.01420 [cs] |
DOI | 10.48550/arXiv.2505.01420 |
Repository | arXiv |
Archive ID | arXiv:2505.01420 |
Date Added | 5/12/2025, 1:32:34 PM |
Modified | 5/12/2025, 1:32:38 PM |
Item Type | Preprint |
---|---|
Author | Konstantin F. Pilz |
Author | James Sanders |
Author | Robi Rahman |
Author | Lennart Heim |
Abstract | Frontier AI development relies on powerful AI supercomputers, yet analysis of these systems is limited. We create a dataset of 500 AI supercomputers from 2019 to 2025 and analyze key trends in performance, power needs, hardware cost, ownership, and global distribution. We find that the computational performance of AI supercomputers has doubled every nine months, while hardware acquisition cost and power needs both doubled every year. The leading system in March 2025, xAI's Colossus, used 200,000 AI chips, had a hardware cost of \$7B, and required 300 MW of power, as much as 250,000 households. As AI supercomputers evolved from tools for science to industrial machines, companies rapidly expanded their share of total AI supercomputer performance, while the share of governments and academia diminished. Globally, the United States accounts for about 75% of total performance in our dataset, with China in second place at 15%. If the observed trends continue, the leading AI supercomputer in 2030 will achieve $2\times10^{22}$ 16-bit FLOP/s, use two million AI chips, have a hardware cost of \$200 billion, and require 9 GW of power. Our analysis provides visibility into the AI supercomputer landscape, allowing policymakers to assess key AI trends like resource needs, ownership, and national competitiveness. |
Date | 2025-04-23 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.16026 |
Accessed | 5/12/2025, 5:53:39 PM |
Extra | arXiv:2504.16026 [cs] |
DOI | 10.48550/arXiv.2504.16026 |
Repository | arXiv |
Archive ID | arXiv:2504.16026 |
Date Added | 5/12/2025, 5:53:39 PM |
Modified | 5/12/2025, 5:53:42 PM |
Item Type | Preprint |
---|---|
Author | Xiangyu Qi |
Author | Ashwinee Panda |
Author | Kaifeng Lyu |
Author | Xiao Ma |
Author | Subhrajit Roy |
Author | Ahmad Beirami |
Author | Prateek Mittal |
Author | Peter Henderson |
Abstract | The safety alignment of current Large Language Models (LLMs) is vulnerable. Relatively simple attacks, or even benign fine-tuning, can jailbreak aligned models. We argue that many of these vulnerabilities are related to a shared underlying issue: safety alignment can take shortcuts, wherein the alignment adapts a model's generative distribution primarily over only its very first few output tokens. We refer to this issue as shallow safety alignment. In this paper, we present case studies to explain why shallow safety alignment can exist and provide evidence that current aligned LLMs are subject to this issue. We also show how these findings help explain multiple recently discovered vulnerabilities in LLMs, including the susceptibility to adversarial suffix attacks, prefilling attacks, decoding parameter attacks, and fine-tuning attacks. Importantly, we discuss how this consolidated notion of shallow safety alignment sheds light on promising research directions for mitigating these vulnerabilities. For instance, we show that deepening the safety alignment beyond just the first few tokens can often meaningfully improve robustness against some common exploits. Finally, we design a regularized finetuning objective that makes the safety alignment more persistent against fine-tuning attacks by constraining updates on initial tokens. Overall, we advocate that future safety alignment should be made more than just a few tokens deep. |
Date | 2024-06-10 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2406.05946 |
Accessed | 5/12/2025, 5:43:58 PM |
Extra | arXiv:2406.05946 [cs] |
DOI | 10.48550/arXiv.2406.05946 |
Repository | arXiv |
Archive ID | arXiv:2406.05946 |
Date Added | 5/12/2025, 5:43:58 PM |
Modified | 5/12/2025, 5:43:58 PM |
Item Type | Preprint |
---|---|
Author | Mark Russinovich |
Author | Ahmed Salem |
Author | Ronen Eldan |
Abstract | Large Language Models (LLMs) have risen significantly in popularity and are increasingly being adopted across multiple applications. These LLMs are heavily aligned to resist engaging in illegal or unethical topics as a means to avoid contributing to responsible AI harms. However, a recent line of attacks, known as jailbreaks, seek to overcome this alignment. Intuitively, jailbreak attacks aim to narrow the gap between what the model can do and what it is willing to do. In this paper, we introduce a novel jailbreak attack called Crescendo. Unlike existing jailbreak methods, Crescendo is a simple multi-turn jailbreak that interacts with the model in a seemingly benign manner. It begins with a general prompt or question about the task at hand and then gradually escalates the dialogue by referencing the model's replies progressively leading to a successful jailbreak. We evaluate Crescendo on various public systems, including ChatGPT, Gemini Pro, Gemini-Ultra, LlaMA-2 70b and LlaMA-3 70b Chat, and Anthropic Chat. Our results demonstrate the strong efficacy of Crescendo, with it achieving high attack success rates across all evaluated models and tasks. Furthermore, we present Crescendomation, a tool that automates the Crescendo attack and demonstrate its efficacy against state-of-the-art models through our evaluations. Crescendomation surpasses other state-of-the-art jailbreaking techniques on the AdvBench subset dataset, achieving 29-61% higher performance on GPT-4 and 49-71% on Gemini-Pro. Finally, we also demonstrate Crescendo's ability to jailbreak multimodal models. |
Date | 2025-02-26 |
Short Title | Great, Now Write an Article About That |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2404.01833 |
Accessed | 5/12/2025, 12:48:39 PM |
Extra | arXiv:2404.01833 [cs] |
DOI | 10.48550/arXiv.2404.01833 |
Repository | arXiv |
Archive ID | arXiv:2404.01833 |
Date Added | 5/12/2025, 12:48:39 PM |
Modified | 5/12/2025, 12:48:42 PM |
Comment: Accepted at USENIX Security 2025
Item Type | Preprint |
---|---|
Author | Thomas Schmied |
Author | Jörg Bornschein |
Author | Jordi Grau-Moya |
Author | Markus Wulfmeier |
Author | Razvan Pascanu |
Abstract | The success of Large Language Models (LLMs) has sparked interest in various agentic applications. A key hypothesis is that LLMs, leveraging common sense and Chain-of-Thought (CoT) reasoning, can effectively explore and efficiently solve complex domains. However, LLM agents have been found to suffer from sub-optimal exploration and the knowing-doing gap, the inability to effectively act on knowledge present in the model. In this work, we systematically study why LLMs perform sub-optimally in decision-making scenarios. In particular, we closely examine three prevalent failure modes: greediness, frequency bias, and the knowing-doing gap. We propose mitigation of these shortcomings by fine-tuning via Reinforcement Learning (RL) on self-generated CoT rationales. Our experiments across multi-armed bandits, contextual bandits, and Tic-tac-toe, demonstrate that RL fine-tuning enhances the decision-making abilities of LLMs by increasing exploration and narrowing the knowing-doing gap. Finally, we study both classic exploration mechanisms, such as $\epsilon$-greedy, and LLM-specific approaches, such as self-correction and self-consistency, to enable more effective fine-tuning of LLMs for decision-making. |
Date | 2025-04-22 |
Short Title | LLMs are Greedy Agents |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.16078 |
Accessed | 5/12/2025, 5:49:16 PM |
Extra | arXiv:2504.16078 [cs] |
DOI | 10.48550/arXiv.2504.16078 |
Repository | arXiv |
Archive ID | arXiv:2504.16078 |
Date Added | 5/12/2025, 5:49:16 PM |
Modified | 5/12/2025, 5:49:16 PM |
Item Type | Preprint |
---|---|
Author | Shivalika Singh |
Author | Yiyang Nan |
Author | Alex Wang |
Author | Daniel D'Souza |
Author | Sayash Kapoor |
Author | Ahmet Üstün |
Author | Sanmi Koyejo |
Author | Yuntian Deng |
Author | Shayne Longpre |
Author | Noah Smith |
Author | Beyza Ermis |
Author | Marzieh Fadaee |
Author | Sara Hooker |
Abstract | Measuring progress is fundamental to the advancement of any scientific field. As benchmarks play an increasingly central role, they also grow more susceptible to distortion. Chatbot Arena has emerged as the go-to leaderboard for ranking the most capable AI systems. Yet, in this work we identify systematic issues that have resulted in a distorted playing field. We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired. We establish that the ability of these providers to choose the best score leads to biased Arena scores due to selective disclosure of performance results. At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release. We also establish that proprietary closed models are sampled at higher rates (number of battles) and have fewer models removed from the arena than open-weight and open-source alternatives. Both these policies lead to large data access asymmetries over time. Providers like Google and OpenAI have received an estimated 19.2% and 20.4% of all data on the arena, respectively. In contrast, a combined 83 open-weight models have only received an estimated 29.7% of the total data. We show that access to Chatbot Arena data yields substantial benefits; even limited additional data can result in relative performance gains of up to 112% on the arena distribution, based on our conservative estimates. Together, these dynamics result in overfitting to Arena-specific dynamics rather than general model quality. The Arena builds on the substantial efforts of both the organizers and an open community that maintains this valuable evaluation platform. We offer actionable recommendations to reform the Chatbot Arena's evaluation framework and promote fairer, more transparent benchmarking for the field |
Date | 2025-04-29 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.20879 |
Accessed | 5/12/2025, 11:45:01 AM |
Extra | arXiv:2504.20879 [cs] |
DOI | 10.48550/arXiv.2504.20879 |
Repository | arXiv |
Archive ID | arXiv:2504.20879 |
Date Added | 5/12/2025, 11:45:01 AM |
Modified | 5/12/2025, 11:45:04 AM |
Comment: 68 pages, 18 figures, 9 tables
Item Type | Journal Article |
---|---|
Author | Julian Stastny |
Author | ryan_greenblatt |
Abstract | In this post, we list 7 of our favorite easy-to-start directions in AI control. (Really, projects that are at least adjacent to AI control; We includ… |
Date | 2025-04-28 |
Language | en |
Library Catalog | www.lesswrong.com |
URL | https://www.lesswrong.com/posts/wwshEdNhwwT4r9RQN/7-tractable-directions-in-ai-control |
Accessed | 5/12/2025, 5:13:33 PM |
Date Added | 5/12/2025, 5:13:33 PM |
Modified | 5/12/2025, 5:13:33 PM |
Item Type | Preprint |
---|---|
Author | Leon Staufer |
Author | Mick Yang |
Author | Anka Reuel |
Author | Stephen Casper |
Abstract | AI governance frameworks increasingly rely on audits, yet the results of their underlying evaluations require interpretation and context to be meaningfully informative. Even technically rigorous evaluations can offer little useful insight if reported selectively or obscurely. Current literature focuses primarily on technical best practices, but evaluations are an inherently sociotechnical process, and there is little guidance on reporting procedures and context. Through literature review, stakeholder interviews, and analysis of governance frameworks, we propose "audit cards" to make this context explicit. We identify six key types of contextual features to report and justify in audit cards: auditor identity, evaluation scope, methodology, resource access, process integrity, and review mechanisms. Through analysis of existing evaluation reports, we find significant variation in reporting practices, with most reports omitting crucial contextual information such as auditors' backgrounds, conflicts of interest, and the level and type of access to models. We also find that most existing regulations and frameworks lack guidance on rigorous reporting. In response to these shortcomings, we argue that audit cards can provide a structured format for reporting key claims alongside their justifications, enhancing transparency, facilitating proper interpretation, and establishing trust in reporting. |
Date | 2025-04-18 |
Short Title | Audit Cards |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.13839 |
Accessed | 5/11/2025, 6:45:44 PM |
Extra | arXiv:2504.13839 [cs] |
DOI | 10.48550/arXiv.2504.13839 |
Repository | arXiv |
Archive ID | arXiv:2504.13839 |
Date Added | 5/11/2025, 6:45:44 PM |
Modified | 5/11/2025, 6:45:47 PM |
Item Type | Web Page |
---|---|
Author | Officechai Team |
Abstract | There is plenty of speculation on whether current AI systems are conscious, or how they could be made conscious, but humans seem to be displaying a blind spot in choosing which AI systems they feel |
Date | 2025-04-17T23:40:14+05:30 |
Language | en-US |
Short Title | Interesting That No One Thinks AlphaFold Is Conscious |
URL | https://officechai.com/ai/interesting-that-no-one-thinks-alphafold-is-conscious-anil-seth/ |
Accessed | 5/11/2025, 7:05:35 PM |
Website Title | OfficeChai |
Date Added | 5/11/2025, 7:05:35 PM |
Modified | 5/11/2025, 7:05:35 PM |
Item Type | Preprint |
---|---|
Author | Henk Tillman |
Author | Dan Mossing |
Abstract | Language models can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of language models encode additional information that could be useful for this. The baseline approach for activation monitoring is some variation of linear probing on a particular layer: starting from a labeled dataset, train a logistic regression classifier on that layer's activations. Recent work has proposed several approaches which may improve on naive linear probing, by leveraging additional computation. One class of techniques, which we call "prompted probing," leverages test time computation to improve monitoring by (1) prompting the model with a description of the monitoring task, and (2) applying a learned linear probe to resulting activations. Another class of techniques uses computation at train time: training sparse autoencoders offline to identify an interpretable basis for the activations, and e.g. max-pooling activations across tokens using that basis before applying a linear probe. However, one can also prompt the model with a description of the monitoring task and use its output directly. We develop and test novel refinements of these methods and compare them against each other. We find asking the model zero-shot is a reasonable baseline when inference-time compute is not limited; however, activation probing methods can substantially outperform this baseline given sufficient training data. Specifically, we recommend prompted probing when inference-time compute is available, due to its superior data efficiency and good generalization performance. Alternatively, if inference-time compute is limited, we find SAE-based probing methods outperform raw activation probing. |
Date | 2025-04-28 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.20271 |
Accessed | 5/12/2025, 1:53:20 PM |
Extra | arXiv:2504.20271 [cs] |
DOI | 10.48550/arXiv.2504.20271 |
Repository | arXiv |
Archive ID | arXiv:2504.20271 |
Date Added | 5/12/2025, 1:53:20 PM |
Modified | 5/12/2025, 1:53:23 PM |
Comment: 18 pages, 13 figures
Item Type | Preprint |
---|---|
Author | Jean-Francois Ton |
Author | Muhammad Faaiz Taufiq |
Author | Yang Liu |
Abstract | Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information gain' at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy and GSM-8K data, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual tasks. |
Date | 2024-11-18 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.11984 |
Accessed | 5/13/2025, 11:58:53 AM |
Extra | arXiv:2411.11984 [cs] |
DOI | 10.48550/arXiv.2411.11984 |
Repository | arXiv |
Archive ID | arXiv:2411.11984 |
Date Added | 5/13/2025, 11:58:53 AM |
Modified | 5/13/2025, 11:58:54 AM |
Item Type | Preprint |
---|---|
Author | Brandon Trabucco |
Author | Gunnar Sigurdsson |
Author | Robinson Piramuthu |
Author | Ruslan Salakhutdinov |
Abstract | The predominant approach for training web navigation agents gathers human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data are an inefficient resource. We develop a pipeline to facilitate Internet-scale training for agents without laborious human annotations. In the first stage, an LLM generates tasks for 150k diverse websites. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM reviews the trajectories and judges their success. Language models are competitive with human annotators, detecting and filtering out harmful content with an accuracy of 97%, generating feasible tasks with an 89% rate, and judging successful trajectories with an 82.6% accuracy. Scaling the pipeline, agents based on Llama 3.1 70B solve 16.7% of tasks for 150k sites. Training on the data generated by our pipeline is competitive with training on human demonstrations. In data-limited settings derived from Mind2Web and WebLINX, we improve Step Accuracy by up to +89.5% and +122.1% respectively for agents trained on mixtures of data from our pipeline, and human data. When training agents with all available human data from these benchmarks, agents fail to generalize to diverse real sites, and adding our data improves their generalization by +149.0% for WebLINX and +156.3% for Mind2Web. Code will be available at: data-for-agents.github.io. |
Date | 2025-02-10 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.06776 |
Accessed | 5/11/2025, 7:23:53 PM |
Extra | arXiv:2502.06776 [cs] |
DOI | 10.48550/arXiv.2502.06776 |
Repository | arXiv |
Archive ID | arXiv:2502.06776 |
Date Added | 5/11/2025, 7:23:53 PM |
Modified | 5/11/2025, 7:23:53 PM |
Item Type | Preprint |
---|---|
Author | Xinyi Wang |
Author | Shawn Tan |
Author | Mingyu Jin |
Author | William Yang Wang |
Author | Rameswar Panda |
Author | Yikang Shen |
Abstract | Large Language Models (LLMs) have demonstrated remarkable capabilities across a wide range of tasks requiring complex reasoning. However, the effects of scaling on their reasoning abilities remain insufficiently understood. In this paper, we introduce a synthetic multihop reasoning environment designed to closely replicate the structure and distribution of real-world large-scale knowledge graphs. Our reasoning task involves completing missing edges in the graph, which requires advanced multi-hop reasoning and mimics real-world reasoning scenarios. To evaluate this, we pretrain language models (LMs) from scratch solely on triples from the incomplete graph and assess their ability to infer the missing edges. Interestingly, we observe that overparameterization can impair reasoning performance due to excessive memorization. We investigate different factors that affect this U-shaped loss curve, including graph structure, model size, and training steps. To predict the optimal model size for a specific knowledge graph, we find an empirical scaling that linearly maps the knowledge graph search entropy to the optimal model size. This work provides new insights into the relationship between scaling and reasoning in LLMs, shedding light on possible ways to optimize their performance for reasoning tasks. |
Date | 2025-04-04 |
Short Title | Do Larger Language Models Imply Better Reasoning? |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.03635 |
Accessed | 5/11/2025, 7:10:21 PM |
Extra | arXiv:2504.03635 [cs] |
DOI | 10.48550/arXiv.2504.03635 |
Repository | arXiv |
Archive ID | arXiv:2504.03635 |
Date Added | 5/11/2025, 7:10:21 PM |
Modified | 5/11/2025, 7:10:21 PM |
Item Type | Preprint |
---|---|
Author | Peter West |
Author | Christopher Potts |
Abstract | Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate "7" over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities. |
Date | 2025-04-30 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2505.00047 |
Accessed | 5/12/2025, 12:49:20 PM |
Extra | arXiv:2505.00047 [cs] |
DOI | 10.48550/arXiv.2505.00047 |
Repository | arXiv |
Archive ID | arXiv:2505.00047 |
Date Added | 5/12/2025, 12:49:20 PM |
Modified | 5/12/2025, 12:49:22 PM |
Item Type | Preprint |
---|---|
Author | Peter West |
Author | Christopher Potts |
Abstract | Alignment has quickly become a default ingredient in LLM development, with techniques such as reinforcement learning from human feedback making models act safely, follow instructions, and perform ever-better on complex tasks. While these techniques are certainly useful, we propose that they should not be universally applied and demonstrate a range of tasks on which base language models consistently outperform their popular aligned forms. Particularly, we study tasks that require unpredictable outputs, such as random number generation, mixed strategy games (rock-paper-scissors and hide-and-seek), and creative writing. In each case, aligned models tend towards narrow behaviors that result in distinct disadvantages, for instance, preferring to generate "7" over other uniformly random numbers, becoming almost fully predictable in some game states, or prioritizing pleasant writing over creative originality. Across models tested, better performance on common benchmarks tends to correlate with worse performance on our tasks, suggesting an effective trade-off in the required capabilities. |
Date | 2025-04-30 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2505.00047 |
Accessed | 5/12/2025, 12:52:22 PM |
Extra | arXiv:2505.00047 [cs] |
DOI | 10.48550/arXiv.2505.00047 |
Repository | arXiv |
Archive ID | arXiv:2505.00047 |
Date Added | 5/12/2025, 12:52:22 PM |
Modified | 5/12/2025, 12:52:22 PM |
Item Type | Preprint |
---|---|
Author | Chejian Xu |
Author | Jiawei Zhang |
Author | Zhaorun Chen |
Author | Chulin Xie |
Author | Mintong Kang |
Author | Yujin Potter |
Author | Zhun Wang |
Author | Zhuowen Yuan |
Author | Alexander Xiong |
Author | Zidi Xiong |
Author | Chenhui Zhang |
Author | Lingzhi Yuan |
Author | Yi Zeng |
Author | Peiyang Xu |
Author | Chengquan Guo |
Author | Andy Zhou |
Author | Jeffrey Ziwei Tan |
Author | Xuandong Zhao |
Author | Francesco Pinto |
Author | Zhen Xiang |
Author | Yu Gai |
Author | Zinan Lin |
Author | Dan Hendrycks |
Author | Bo Li |
Author | Dawn Song |
Abstract | Multimodal foundation models (MMFMs) play a crucial role in various applications, including autonomous driving, healthcare, and virtual assistants. However, several studies have revealed vulnerabilities in these models, such as generating unsafe content by text-to-image models. Existing benchmarks on multimodal models either predominantly assess the helpfulness of these models, or only focus on limited perspectives such as fairness and privacy. In this paper, we present the first unified platform, MMDT (Multimodal DecodingTrust), designed to provide a comprehensive safety and trustworthiness evaluation for MMFMs. Our platform assesses models from multiple perspectives, including safety, hallucination, fairness/bias, privacy, adversarial robustness, and out-of-distribution (OOD) generalization. We have designed various evaluation scenarios and red teaming algorithms under different tasks for each perspective to generate challenging data, forming a high-quality benchmark. We evaluate a range of multimodal models using MMDT, and our findings reveal a series of vulnerabilities and areas for improvement across these perspectives. This work introduces the first comprehensive and unique safety and trustworthiness evaluation platform for MMFMs, paving the way for developing safer and more reliable MMFMs and systems. Our platform and benchmark are available at https://mmdecodingtrust.github.io/. |
Date | 2025-03-19 |
Short Title | MMDT |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.14827 |
Accessed | 5/12/2025, 5:44:53 PM |
Extra | arXiv:2503.14827 [cs] |
DOI | 10.48550/arXiv.2503.14827 |
Repository | arXiv |
Archive ID | arXiv:2503.14827 |
Date Added | 5/12/2025, 5:44:53 PM |
Modified | 5/12/2025, 5:44:53 PM |
Comment: ICLR 2025
Item Type | Preprint |
---|---|
Author | Yingxuan Yang |
Author | Huacan Chai |
Author | Yuanyi Song |
Author | Siyuan Qi |
Author | Muning Wen |
Author | Ning Li |
Author | Junwei Liao |
Author | Haoyi Hu |
Author | Jianghao Lin |
Author | Gaowei Chang |
Author | Weiwen Liu |
Author | Ying Wen |
Author | Yong Yu |
Author | Weinan Zhang |
Abstract | The rapid development of large language models (LLMs) has led to the widespread deployment of LLM agents across diverse industries, including customer service, content generation, data analysis, and even healthcare. However, as more LLM agents are deployed, a major issue has emerged: there is no standard way for these agents to communicate with external tools or data sources. This lack of standardized protocols makes it difficult for agents to work together or scale effectively, and it limits their ability to tackle complex, real-world tasks. A unified communication protocol for LLM agents could change this. It would allow agents and tools to interact more smoothly, encourage collaboration, and triggering the formation of collective intelligence. In this paper, we provide the first comprehensive analysis of existing agent protocols, proposing a systematic two-dimensional classification that differentiates context-oriented versus inter-agent protocols and general-purpose versus domain-specific protocols. Additionally, we conduct a comparative performance analysis of these protocols across key dimensions such as security, scalability, and latency. Finally, we explore the future landscape of agent protocols by identifying critical research directions and characteristics necessary for next-generation protocols. These characteristics include adaptability, privacy preservation, and group-based interaction, as well as trends toward layered architectures and collective intelligence infrastructures. We expect this work to serve as a practical reference for both researchers and engineers seeking to design, evaluate, or integrate robust communication infrastructures for intelligent agents. |
Date | 2025-04-26 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.16736 |
Accessed | 5/12/2025, 1:45:09 PM |
Extra | arXiv:2504.16736 [cs] |
DOI | 10.48550/arXiv.2504.16736 |
Repository | arXiv |
Archive ID | arXiv:2504.16736 |
Date Added | 5/12/2025, 1:45:09 PM |
Modified | 5/12/2025, 1:45:11 PM |
Item Type | Preprint |
---|---|
Author | Lei Yu |
Author | Virginie Do |
Author | Karen Hambardzumyan |
Author | Nicola Cancedda |
Abstract | Large language models (LLMs) are vulnerable to adversarial attacks that can elicit harmful responses. Defending against such attacks remains challenging due to the opacity of jailbreaking mechanisms and the high computational cost of training LLMs robustly. We demonstrate that adversarial attacks share a universal mechanism for circumventing LLM safeguards that works by ablating a dimension in the residual stream embedding space called the refusal feature. We further show that the operation of refusal feature ablation (RFA) approximates the worst-case perturbation of offsetting model safety. Based on these findings, we propose Refusal Feature Adversarial Training (ReFAT), a novel algorithm that efficiently performs LLM adversarial training by simulating the effect of input-level attacks via RFA. Experiment results show that ReFAT significantly improves the robustness of three popular LLMs against a wide range of adversarial attacks, with considerably less computational overhead compared to existing adversarial training methods. |
Date | 2025-03-20 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2409.20089 |
Accessed | 5/12/2025, 5:45:04 PM |
Extra | arXiv:2409.20089 [cs] |
DOI | 10.48550/arXiv.2409.20089 |
Repository | arXiv |
Archive ID | arXiv:2409.20089 |
Date Added | 5/12/2025, 5:45:04 PM |
Modified | 5/12/2025, 5:45:04 PM |
Item Type | Web Page |
---|---|
Abstract | A comprehensive benchmark to detect emerging replication abilities in AI systems and provide a quantifiable understanding of potential risks |
Language | en |
Short Title | RepliBench |
URL | https://www.aisi.gov.uk/work/replibench-measuring-autonomous-replication-capabilities-in-ai-systems |
Accessed | 5/11/2025, 6:40:49 PM |
Website Title | AI Security Institute |
Date Added | 5/11/2025, 6:40:49 PM |
Modified | 5/11/2025, 6:40:53 PM |
Item Type | Web Page |
---|---|
Abstract | TL;DR In this post, we introduce our “Adversarial AI Explainability” research, a term we use to describe the intersection of AI explainability and adversarial attacks on Large Language Models... |
Language | en |
URL | https://www.cyberark.com/resources/threat-research-blog/unlocking-new-jailbreaks-with-ai-explainability |
Accessed | 5/12/2025, 12:10:49 PM |
Date Added | 5/12/2025, 12:10:49 PM |
Modified | 5/12/2025, 12:10:53 PM |
Item Type | Journal Article |
---|---|
Language | en |
Library Catalog | Zotero |
Date Added | 5/12/2025, 4:35:06 PM |
Modified | 5/12/2025, 4:35:06 PM |
Item Type | Web Page |
---|---|
URL | https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/ |
Accessed | 5/12/2025, 5:45:43 PM |
Date Added | 5/12/2025, 5:45:43 PM |
Modified | 5/12/2025, 5:45:43 PM |