Zotero Report

Vision Language Models as Values Detectors

Item Type	Preprint
Author	Giulio Antonio Abbo
Author	Tony Belpaeme
Abstract	Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.
Date	2025-01-07
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.03957
Accessed	3/25/2025, 5:45:31 PM
Extra	arXiv:2501.03957 [cs]
DOI	10.48550/arXiv.2501.03957
Repository	arXiv
Archive ID	arXiv:2501.03957
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Computer Vision and Pattern Recognition
Computer Science - Human-Computer Interaction

Notes:

Comment: 13 pages, 2 figures

Attachments

Preprint PDF
Snapshot

(Successful) Democracies Breed Their Own Support

Item Type	Journal Article
Author	Daron Acemoglu
Author	Nicolás Ajzenman
Author	Cevat Giray Aksoy
Author	Martin Fiszbein
Author	Carlos Molina
Abstract	Using large-scale survey data covering more than 110 countries and exploiting within-country variation across cohorts and surveys, we show that individuals with longer exposure to democracy display stronger support for democratic institutions, and that this effect is almost entirely driven by exposure to democracies with successful performance in terms of economic growth, control of corruption, peace and political stability, and public goods provision. Across a variety of speciﬁcations, estimation methods, and samples, the results are robust, and the timing and nature of the effects are consistent with our interpretation. We also present suggestive evidence that democratic institutions that receive support from their citizens perform better in the face of negative shocks.
Date	2025-03-06
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://academic.oup.com/restud/article/92/2/621/7675443
Accessed	4/11/2025, 11:47:00 AM
Rights	https://academic.oup.com/pages/standard-publication-reuse-rights
Volume	92
Pages	621-655
Publication	Review of Economic Studies
DOI	10.1093/restud/rdae051
Issue	2
ISSN	0034-6527, 1467-937X
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

PDF

Rethinking Reflection in Pre-Training

Item Type	Preprint
Author	Essential AI
Author	Darsh J. Shah
Author	Peter Rushton
Author	Somanshu Singla
Author	Mohit Parmar
Author	Kurt Smith
Author	Yash Vanjani
Author	Ashish Vaswani
Author	Adarsh Chaluvaraju
Author	Andrew Hojel
Author	Andrew Ma
Author	Anil Thomas
Author	Anthony Polloreno
Author	Ashish Tanwer
Author	Burhan Drak Sibai
Author	Divya S. Mansingka
Author	Divya Shivaprasad
Author	Ishaan Shah
Author	Karl Stratos
Author	Khoi Nguyen
Author	Michael Callahan
Author	Michael Pust
Author	Mrinal Iyer
Author	Philip Monk
Author	Platon Mazarakis
Author	Ritvik Kapila
Author	Saurabh Srivastava
Author	Tim Romanski
Abstract	A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.
Date	2025-04-05
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.04022
Accessed	4/9/2025, 11:11:36 PM
Extra	arXiv:2504.04022 [cs]
DOI	10.48550/arXiv.2504.04022
Repository	arXiv
Archive ID	arXiv:2504.04022
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

LLM Social Simulations Are a Promising Research Method

Item Type	Preprint
Author	Jacy Reese Anthis
Author	Ryan Liu
Author	Sean M. Richardson
Author	Austin C. Kozlowski
Author	Bernard Koch
Author	James Evans
Author	Erik Brynjolfsson
Author	Michael Bernstein
Abstract	Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted these methods. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a literature survey of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions with prompting, fine-tuning, and complementary methods. We believe that LLM social simulations can already be used for exploratory research, such as pilot experiments for psychology, economics, sociology, and marketing. More widespread use may soon be possible with rapidly advancing LLM capabilities, and researchers should prioritize developing conceptual models and evaluations that can be iteratively deployed and refined at pace with ongoing AI advances.
Date	2025-04-03
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.02234
Accessed	4/9/2025, 11:29:43 PM
Extra	arXiv:2504.02234 [cs]
DOI	10.48550/arXiv.2504.02234
Repository	arXiv
Archive ID	arXiv:2504.02234
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

Item Type	Web Page
Author	Apollo Research
Abstract	We evaluate whether Claude Sonnet 3.7 and other frontier models know that they are being evaluated.
Date	4/7/2025
Language	en-GB
URL	https://www.apolloresearch.ai/blog/claude-sonnet-37-often-knows-when-its-in-alignment-evaluations
Accessed	3/25/2025, 9:30:24 PM
Website Title	Apollo Research
Date Added	4/2/2025, 2:13:04 PM
Modified	5/1/2025, 2:05:23 PM

Attachments

Snapshot

https://www.goodfire.ai/blog/under-the-hood-of-a-reasoning-model

Item Type	Web Page
Author	AUTHORS Dron Hazra † Max Loeffler † Murat Cubuktepe Levon Avagyan ‡ Liv Gorton Mark Bissell Owen Lewis Thomas McGrath Daniel Balsam * PUBLISHED Apr. 15
Author	2025
Language	en
Short Title	https
URL	https://www.goodfire.ai/blog/under-the-hood-of-a-reasoning-model
Accessed	4/17/2025, 11:35:50 AM
Date Added	4/17/2025, 11:35:50 AM
Modified	4/17/2025, 11:35:50 AM

Attachments

Snapshot

Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

Item Type	Preprint
Author	Vidhisha Balachandran
Author	Jingya Chen
Author	Lingjiao Chen
Author	Shivam Garg
Author	Neel Joshi
Author	Yash Lara
Author	John Langford
Author	Besmira Nushi
Author	Vibhav Vineet
Author	Yue Wu
Author	Safoora Yousefi
Abstract	Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.
Date	2025-03-31
Short Title	Inference-Time Scaling for Complex Tasks
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.00294
Accessed	4/11/2025, 12:33:43 PM
Extra	arXiv:2504.00294 [cs]
DOI	10.48550/arXiv.2504.00294
Repository	arXiv
Archive ID	arXiv:2504.00294
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Item Type	Preprint
Author	Krisztian Balog
Author	Donald Metzler
Author	Zhen Qin
Abstract	Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.
Date	2025-03-24
Short Title	Rankers, Judges, and Assistants
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.19092
Accessed	4/14/2025, 9:56:20 PM
Extra	arXiv:2503.19092 [cs]
DOI	10.48550/arXiv.2503.19092
Repository	arXiv
Archive ID	arXiv:2503.19092
Date Added	4/14/2025, 9:56:33 PM
Modified	4/14/2025, 9:56:35 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Information Retrieval

Attachments

Preprint PDF

Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

Item Type	Preprint
Author	Krisztian Balog
Author	Donald Metzler
Author	Zhen Qin
Abstract	Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.
Date	2025-03-24
Short Title	Rankers, Judges, and Assistants
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.19092
Accessed	4/14/2025, 9:56:20 PM
Extra	arXiv:2503.19092 [cs]
DOI	10.48550/arXiv.2503.19092
Repository	arXiv
Archive ID	arXiv:2503.19092
Date Added	4/14/2025, 9:56:33 PM
Modified	4/14/2025, 9:56:35 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Information Retrieval

Creating a Resilient AI Ecosystem with NDIF and NNsight

Item Type	Journal Article
Author	David Bau
Language	en
Library Catalog	Zotero
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

PDF

AI Dominance Requires Interpretability and Standards for Transparency and Security

Item Type	Journal Article
Author	David Bau
Author	Tom McGrath
Author	Sarah Schwettmann
Author	Dylan Hadfield-Menell
Language	en
Library Catalog	Zotero
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

PDF

Ctrl-Z: Controlling AI Agents via Resampling

Item Type	Preprint
Author	Aryan Bhatt
Author	Cody Rushing
Author	Adam Kaufman
Author	Tyler Tracy
Author	Vasil Georgiev
Author	David Matolcsi
Author	Akbir Khan
Author	Buck Shlegeris
Abstract	Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.
Date	2025-04-14
Short Title	Ctrl-Z
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.10374
Accessed	4/17/2025, 11:40:05 AM
Extra	arXiv:2504.10374 [cs]
DOI	10.48550/arXiv.2504.10374
Repository	arXiv
Archive ID	arXiv:2504.10374
Date Added	4/17/2025, 11:40:05 AM
Modified	4/17/2025, 11:40:05 AM

Tags:

Computer Science - Machine Learning

Notes:

Comment: bashcontrol.com

Attachments

Preprint PDF
Snapshot

Reasoning Models Don’t Always Say What They Think

Item Type	Journal Article
Author	Yanda Chen
Author	Joe Benton
Author	Ansh Radhakrishnan
Author	Jonathan Uesato
Author	Carson Denison
Author	John Schulman
Author	Arushi Somani
Author	Peter Hase
Author	Misha Wagner
Author	Fabien Roger
Author	Vlad Mikulik
Author	Sam Bowman
Author	Jan Leike
Author	Jared Kaplan
Author	Ethan Perez
Abstract	Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model’s CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models’ actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.
Language	en
Library Catalog	Zotero
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

PDF

JudgeLRM: Large Reasoning Models as a Judge

Item Type	Preprint
Author	Nuo Chen
Author	Zhiyuan Hu
Author	Qingyun Zou
Author	Jiaying Wu
Author	Qian Wang
Author	Bryan Hooi
Author	Bingsheng He
Abstract	The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
Date	2025-03-31
Short Title	JudgeLRM
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.00050
Accessed	4/14/2025, 9:59:44 PM
Extra	arXiv:2504.00050 [cs]
DOI	10.48550/arXiv.2504.00050
Repository	arXiv
Archive ID	arXiv:2504.00050
Date Added	4/14/2025, 9:59:44 PM
Modified	4/14/2025, 9:59:44 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Notes:

Comment: preprint

Attachments

Preprint PDF
Snapshot

Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

Item Type	Preprint
Author	Quy-Anh Dang
Author	Chris Ngo
Abstract	Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
Date	2025-03-20
Short Title	Reinforcement Learning for Reasoning in Small LLMs
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.16219
Accessed	4/11/2025, 1:09:29 PM
Extra	arXiv:2503.16219 [cs]
DOI	10.48550/arXiv.2503.16219
Repository	arXiv
Archive ID	arXiv:2503.16219
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

AI-Enabled Coups: How a Small Group Could Use AI to Seize Power

Item Type	Journal Article
Author	Tom Davidson
Author	Lukas Finnveden
Author	Rose Hadshar
Abstract	This report assesses the risk that a small group—or even just one person—could use advanced AI to stage a coup. An AI-enabled coup is most likely to be staged by leaders of frontier AI projects, heads of state, and military officials; and could occur even in established democracies.
Language	en
Library Catalog	Zotero
Date Added	4/17/2025, 11:42:37 AM
Modified	4/17/2025, 11:42:37 AM

Attachments

PDF

HOW AI AND HUMAN BEHAVIORS SHAPE PSYCHOSOCIAL EFFECTS OF CHATBOT USE: A LONGITUDINAL RANDOMIZED CONTROLLED STUDY

Item Type	Journal Article
Author	Cathy Mengying Fang
Author	Auren R Liu
Author	Valdemar Danry
Author	Eunhae Lee
Author	Samantha W T Chan
Author	Pat Pataranutaporn
Author	Pattie Maes
Author	Jason Phang
Author	Michael Lampe
Author	Lama Ahmad
Author	Sandhini Agarwal
Abstract	AI chatbots, especially those with voice capabilities, have become increasingly human-like, with more users seeking emotional support and companionship from them. Concerns are rising about how such interactions might impact users’ loneliness and socialization with real people. We conducted a four-week randomized, controlled, IRB-approved experiment (n=981, >300K messages) to investigate how AI chatbot interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence psychosocial outcomes such as loneliness, social interaction with real people, emotional dependence on AI and problematic AI usage. Results showed that while voice-based chatbots initially appeared beneficial in mitigating loneliness and dependence compared with text-based chatbots, these advantages diminished at high usage levels, especially with a neutral-voice chatbot. Conversation type also shaped outcomes: personal topics slightly increased loneliness but tended to lower emotional dependence compared with open-ended conversations, whereas non-personal topics were associated with greater dependence among heavy users. Overall, higher daily usage—across all modalities and conversation types—correlated with higher loneliness, dependence, and problematic use, and lower socialization. Exploratory analyses revealed that those with stronger emotional attachment tendencies and higher trust in the AI chatbot tended to experience greater loneliness and emotional dependence, respectively. These findings underscore the complex interplay between chatbot design choices (e.g., voice expressiveness) and user behaviors (e.g., conversation content, usage frequency). We highlight the need for further research on whether chatbots’ ability to manage emotional content without fostering dependence or replacing human relationships benefits overall well-being.
Date	2025
Language	en
Library Catalog	Zotero
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Attachments

PDF

Concise Reasoning via Reinforcement Learning

Item Type	Preprint
Author	Mehdi Fatemi
Author	Banafsheh Rafiee
Author	Mingjie Tang
Author	Kartik Talamadupula
Abstract	Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results.
Date	2025-04-07
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.05185
Accessed	4/14/2025, 10:01:11 PM
Extra	arXiv:2504.05185 [cs]
DOI	10.48550/arXiv.2504.05185
Repository	arXiv
Archive ID	arXiv:2504.05185
Date Added	4/14/2025, 10:01:11 PM
Modified	4/14/2025, 10:01:11 PM

Tags:

Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

Political Neutrality in AI is Impossible- But Here is How to Approximate it

Item Type	Preprint
Author	Jillian Fisher
Author	Ruth E. Appel
Author	Chan Young Park
Author	Yujin Potter
Author	Liwei Jiang
Author	Taylor Sorensen
Author	Shangbin Feng
Author	Yulia Tsvetkov
Author	Margaret E. Roberts
Author	Jennifer Pan
Author	Dawn Song
Author	Yejin Choi
Abstract	AI systems often exhibit political bias, influencing users' opinions and decision-making. While political neutrality-defined as the absence of bias-is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two concrete applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models.
Date	2025-02-18
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.05728
Accessed	4/11/2025, 10:31:20 AM
Extra	arXiv:2503.05728 [cs]
DOI	10.48550/arXiv.2503.05728
Repository	arXiv
Archive ID	arXiv:2503.05728
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society

Notes:

Comment: Code: https://github.com/jfisher52/Approximation_Political_Neutrality

Attachments

Preprint PDF
Snapshot

Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

Item Type	Preprint
Author	Dylan J. Foster
Author	Zakaria Mhammedi
Author	Dhruv Rohatgi
Abstract	Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.
Date	2025-03-13
Short Title	Is a Good Foundation Necessary for Efficient Reinforcement Learning?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.07453
Accessed	4/11/2025, 3:16:04 PM
Extra	arXiv:2503.07453 [cs]
DOI	10.48550/arXiv.2503.07453
Repository	arXiv
Archive ID	arXiv:2503.07453
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Mathematics - Statistics Theory
Statistics - Statistics Theory

Notes:

Comment: V2: Improved number of prompts used by Algorithm 1

Attachments

Preprint PDF
Snapshot

Neural Interactive Proofs

Item Type	Preprint
Author	Lewis Hammond
Author	Sam Adam-Day
Abstract	We consider the problem of how a trusted, but computationally bounded agent (a 'verifier') can learn to interact with one or more powerful but untrusted agents ('provers') in order to solve a given task. More specifically, we study the case in which agents are represented using neural networks and refer to solutions of this problem as neural interactive proofs. First we introduce a unifying framework based on prover-verifier games, which generalises previously proposed interaction protocols. We then describe several new protocols for generating neural interactive proofs, and provide a theoretical comparison of both new and existing approaches. Finally, we support this theory with experiments in two domains: a toy graph isomorphism problem that illustrates the key ideas, and a code validation task using large language models. In so doing, we aim to create a foundation for future work on neural interactive proofs and their application in building safer AI systems.
Date	2025-03-17
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.08897
Accessed	3/25/2025, 9:19:45 PM
Extra	arXiv:2412.08897 [cs]
DOI	10.48550/arXiv.2412.08897
Repository	arXiv
Archive ID	arXiv:2412.08897
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: ICLR'25 camera-ready version; 51 pages, 17 figures

Attachments

Preprint PDF
Snapshot

Safety Cases: A Scalable Approach to Frontier AI Safety

Item Type	Preprint
Author	Benjamin Hilton
Author	Marie Davidsen Buhl
Author	Tomek Korbak
Author	Geoffrey Irving
Abstract	Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases.
Date	2025-02-05
Short Title	Safety Cases
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.04744
Accessed	3/25/2025, 9:34:29 PM
Extra	arXiv:2503.04744 [cs]
DOI	10.48550/arXiv.2503.04744
Repository	arXiv
Archive ID	arXiv:2503.04744
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society

Notes:

Comment: 18 pages, 2 figures, 3 tables

Attachments

Preprint PDF
Snapshot

A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

Item Type	Preprint
Author	Andreas Hochlehnert
Author	Hardik Bhatnagar
Author	Vishaal Udandarao
Author	Samuel Albanie
Author	Ameya Prabhu
Author	Matthias Bethge
Abstract	Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices - including decoding parameters, random seeds, prompt formatting, and even hardware and software-framework configurations. Performance gains reported in recent studies frequently hinge on unclear comparisons or unreported sources of variance. To address these issues, we propose a standardized evaluation framework with clearly defined best practices and reporting standards. Using this framework, we reassess recent methods and find that reinforcement learning (RL) approaches yield only modest improvements - far below prior claims - and are prone to overfitting, especially on small-scale benchmarks like AIME24. In contrast, supervised finetuning (SFT) methods show consistently stronger generalization. To foster reproducibility, we release all code, prompts, and model outputs, for reasoning benchmarks, establishing more rigorous foundations for future work.
Date	2025-04-09
Short Title	A Sober Look at Progress in Language Model Reasoning
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.07086
Accessed	4/13/2025, 6:01:29 PM
Extra	arXiv:2504.07086 [cs]
DOI	10.48550/arXiv.2504.07086
Repository	arXiv
Archive ID	arXiv:2504.07086
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning

Notes:

Comment: Technical Report

Attachments

Preprint PDF
Snapshot

Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? A Case Study on Vaccine Hesitancy

Item Type	Preprint
Author	Abe Bohan Hou
Author	Hongru Du
Author	Yichen Wang
Author	Jingyu Zhang
Author	Zixiao Wang
Author	Paul Pu Liang
Author	Daniel Khashabi
Author	Lauren Gardner
Author	Tianxing He
Abstract	Can we simulate a sandbox society with generative agents to model human behavior, thereby reducing the over-reliance on real human trials for assessing public policies? In this work, we investigate the feasibility of simulating health-related decision-making, using vaccine hesitancy, defined as the delay in acceptance or refusal of vaccines despite the availability of vaccination services (MacDonald, 2015), as a case study. To this end, we introduce the VacSim framework with 100 generative agents powered by Large Language Models (LLMs). VacSim simulates vaccine policy outcomes with the following steps: 1) instantiate a population of agents with demographics based on census data; 2) connect the agents via a social network and model vaccine attitudes as a function of social dynamics and disease-related information; 3) design and evaluate various public health interventions aimed at mitigating vaccine hesitancy. To align with real-world results, we also introduce simulation warmup and attitude modulation to adjust agents' attitudes. We propose a series of evaluations to assess the reliability of various LLM simulations. Experiments indicate that models like Llama and Qwen can simulate aspects of human behavior but also highlight real-world alignment challenges, such as inconsistent responses with demographic profiles. This early exploration of LLM-driven simulations is not meant to serve as definitive policy guidance; instead, it serves as a call for action to examine social simulation for policy development.
Date	2025-03-16
Short Title	Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.09639
Accessed	3/25/2025, 9:35:41 PM
Extra	arXiv:2503.09639 [cs]
DOI	10.48550/arXiv.2503.09639
Repository	arXiv
Archive ID	arXiv:2503.09639
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction
Computer Science - Multiagent Systems

Attachments

Preprint PDF
Snapshot

Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation

Item Type	Preprint
Author	Jiwon Jeong
Author	Hyeju Jang
Author	Hogun Park
Abstract	The advancement of Large Language Models (LLMs) has greatly improved our ability to process complex language. However, accurately detecting logical fallacies remains a significant challenge. This study presents a novel and effective prompt formulation approach for logical fallacy detection, applicable in both supervised (fine-tuned) and unsupervised (zero-shot) settings. Our method enriches input text incorporating implicit contextual information -- counterarguments, explanations, and goals -- which we query for validity within the context of the argument. We then rank these queries based on confidence scores to inform classification. We evaluate our approach across multiple datasets from 5 domains, covering 29 distinct fallacy types, using models from the GPT and LLaMA series. The results show substantial improvements over state-of-the-art models, with F1 score increases of up to 0.60 in zero-shot settings and up to 0.45 in fine-tuned models. Extensive analyses further illustrate why and how our method excels.
Date	2025-03-30
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.23363
Accessed	4/14/2025, 10:00:28 PM
Extra	arXiv:2503.23363 [cs]
DOI	10.48550/arXiv.2503.23363
Repository	arXiv
Archive ID	arXiv:2503.23363
Date Added	4/14/2025, 10:00:28 PM
Modified	4/14/2025, 10:00:28 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: Accepted to NAACL 2025 Findings

Attachments

Preprint PDF
Snapshot

Language Model Alignment in Multilingual Trolley Problems

Item Type	Preprint
Author	Zhijing Jin
Author	Max Kleiman-Weiner
Author	Giorgio Piatti
Author	Sydney Levine
Author	Jiarui Liu
Author	Fernando Gonzalez
Author	Francesco Ortu
Author	András Strausz
Author	Mrinmaya Sachan
Author	Rada Mihalcea
Author	Yejin Choi
Author	Bernhard Schölkopf
Abstract	We evaluate the moral alignment of large language models (LLMs) with human preferences in multilingual trolley problems. Building on the Moral Machine experiment, which captures over 40 million human judgments across 200+ countries, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. This dataset enables the assessment of LLMs' decision-making processes in diverse linguistic contexts. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions: species, gender, fitness, status, age, and the number of lives involved. By correlating these preferences with the demographic distribution of language speakers and examining the consistency of LLM responses to various prompt paraphrasings, our findings provide insights into cross-lingual and ethical biases of LLMs and their intersection. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems and highlighting the importance of incorporating diverse perspectives in AI ethics. The results underscore the need for further research on the integration of multilingual dimensions in responsible AI research to ensure fair and equitable AI interactions worldwide. Our code and data are at https://github.com/causalNLP/moralmachine
Date	2024-12-14
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2407.02273
Accessed	4/14/2025, 10:06:43 PM
Extra	arXiv:2407.02273 [cs]
DOI	10.48550/arXiv.2407.02273
Repository	arXiv
Archive ID	arXiv:2407.02273
Date Added	4/14/2025, 10:06:43 PM
Modified	4/14/2025, 10:06:43 PM

Tags:

Computer Science - Computation and Language

Notes:

Comment: Best Paper @ NeurIPS 2024 Workshop on Pluralistic Alignment

Attachments

Preprint PDF
Snapshot

Large Language Models Pass the Turing Test

Item Type	Preprint
Author	Cameron R. Jones
Author	Benjamin K. Bergen
Abstract	We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
Date	2025-03-31
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.23674
Accessed	4/11/2025, 1:10:04 PM
Extra	arXiv:2503.23674 [cs]
DOI	10.48550/arXiv.2503.23674
Repository	arXiv
Archive ID	arXiv:2503.23674
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

Collaborating with AI Agents: Field Experiments on Teamwork, Productivity, and Performance

Item Type	Preprint
Author	Harang Ju
Author	Sinan Aral
Abstract	To uncover how AI agents change productivity, performance, and work processes, we introduce MindMeld: an experimentation platform enabling humans and AI agents to collaborate in integrative workspaces. In a large-scale marketing experiment on the platform, 2310 participants were randomly assigned to human-human and human-AI teams, with randomized AI personality traits. The teams exchanged 183,691 messages, and created 63,656 image edits, 1,960,095 ad copy edits, and 10,375 AI-generated images while producing 11,138 ads for a large think tank. Analysis of fine-grained communication, collaboration, and workflow logs revealed that collaborating with AI agents increased communication by 137% and allowed humans to focus 23% more on text and image content generation messaging and 20% less on direct text editing. Humans on Human-AI teams sent 23% fewer social messages, creating 60% greater productivity per worker and higher-quality ad copy. In contrast, human-human teams produced higher-quality images, suggesting that AI agents require fine-tuning for multimodal workflows. AI personality prompt randomization revealed that AI traits can complement human personalities to enhance collaboration. For example, conscientious humans paired with open AI agents improved image quality, while extroverted humans paired with conscientious AI agents reduced the quality of text, images, and clicks. In field tests of ad campaigns with ~5M impressions, ads with higher image quality produced by human collaborations and higher text quality produced by AI collaborations performed significantly better on click-through rate and cost per click metrics. Overall, ads created by human-AI teams performed similarly to those created by human-human teams. Together, these results suggest AI agents can improve teamwork and productivity, especially when tuned to complement human traits.
Date	2025-03-23
Short Title	Collaborating with AI Agents
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.18238
Accessed	3/26/2025, 2:58:26 PM
Extra	arXiv:2503.18238 [cs]
DOI	10.48550/arXiv.2503.18238
Repository	arXiv
Archive ID	arXiv:2503.18238
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society

Notes:

Comment: 56 pages, 8 figures

Attachments

Preprint PDF
Snapshot

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

Item Type	Preprint
Author	Ariba Khan
Author	Stephen Casper
Author	Dylan Hadfield-Menell
Abstract	Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behind current survey-based evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current survey-based approaches to evaluating the cultural alignment of LLMs and highlight a need for systematic robustness checks and red-teaming for evaluation results. Data and code are available at https://huggingface.co/datasets/akhan02/cultural-dimension-cover-letters and https://github.com/ariba-k/llm-cultural-alignment-evaluation, respectively.
Date	2025-04-08
Short Title	Randomness, Not Representation
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.08688
Accessed	4/14/2025, 10:06:57 PM
Extra	arXiv:2503.08688 [cs]
DOI	10.48550/arXiv.2503.08688
Repository	arXiv
Archive ID	arXiv:2503.08688
Date Added	4/14/2025, 10:06:57 PM
Modified	4/14/2025, 10:06:57 PM

Tags:

Computer Science - Computers and Society

Attachments

Preprint PDF
Snapshot

How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

Item Type	Preprint
Author	Tomek Korbak
Author	Mikita Balesni
Author	Buck Shlegeris
Author	Geoffrey Irving
Abstract	As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk.
Date	2025-04-07
Short Title	How to evaluate control measures for LLM agents?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.05259
Accessed	4/14/2025, 10:04:50 PM
Extra	arXiv:2504.05259 [cs]
DOI	10.48550/arXiv.2504.05259
Repository	arXiv
Archive ID	arXiv:2504.05259
Date Added	4/14/2025, 10:04:50 PM
Modified	4/14/2025, 10:04:50 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

Measuring AI Ability to Complete Long Tasks

Item Type	Preprint
Author	Thomas Kwa
Author	Ben West
Author	Joel Becker
Author	Amy Deng
Author	Katharyn Garcia
Author	Max Hasin
Author	Sami Jawhar
Author	Megan Kinniment
Author	Nate Rush
Author	Sydney Von Arx
Author	Ryan Bloom
Author	Thomas Broadley
Author	Haoxing Du
Author	Brian Goodrich
Author	Nikola Jurkovic
Author	Luke Harold Miles
Author	Seraphina Nix
Author	Tao Lin
Author	Neev Parikh
Author	David Rein
Author	Lucas Jun Koba Sato
Author	Hjalmar Wijk
Author	Daniel M. Ziegler
Author	Elizabeth Barnes
Author	Lawrence Chan
Abstract	Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.
Date	2025-03-18
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.14499
Accessed	3/25/2025, 6:00:08 PM
Extra	arXiv:2503.14499 [cs]
DOI	10.48550/arXiv.2503.14499
Repository	arXiv
Archive ID	arXiv:2503.14499
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models

Item Type	Preprint
Author	Yuxuan Li
Author	Hirokazu Shirado
Author	Sauvik Das
Abstract	While advances in fairness and alignment have helped mitigate overt biases exhibited by large language models (LLMs) when explicitly prompted, we hypothesize that these models may still exhibit implicit biases when simulating human behavior. To test this hypothesis, we propose a technique to systematically uncover such biases across a broad range of sociodemographic categories by assessing decision-making disparities among agents with LLM-generated, sociodemographically-informed personas. Using our technique, we tested six LLMs across three sociodemographic groups and four decision-making scenarios. Our results show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations, with more advanced models exhibiting greater implicit biases despite reducing explicit biases. Furthermore, when comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified. This directional alignment highlights the utility of our technique in uncovering systematic biases in LLMs rather than random variations; moreover, the presence and amplification of implicit biases emphasizes the need for novel strategies to address these biases.
Date	2025-01-29
Short Title	Actions Speak Louder than Words
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.17420
Accessed	4/14/2025, 10:00:42 PM
Extra	arXiv:2501.17420 [cs]
DOI	10.48550/arXiv.2501.17420
Repository	arXiv
Archive ID	arXiv:2501.17420
Date Added	4/14/2025, 10:00:42 PM
Modified	4/14/2025, 10:00:42 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

Inference-Time Scaling for Generalist Reward Modeling

Item Type	Preprint
Author	Zijun Liu
Author	Peiyi Wang
Author	Runxin Xu
Author	Shirong Ma
Author	Chong Ruan
Author	Peng Li
Author	Yang Liu
Author	Yu Wu
Abstract	Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.
Date	2025-04-05
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.02495
Accessed	4/10/2025, 11:19:19 AM
Extra	arXiv:2504.02495 [cs]
DOI	10.48550/arXiv.2504.02495
Repository	arXiv
Archive ID	arXiv:2504.02495
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: Preprint, under review. 42 pages

Attachments

Preprint PDF
Snapshot

OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

Item Type	Preprint
Author	Jiacheng Liu
Author	Taylor Blanton
Author	Yanai Elazar
Author	Sewon Min
Author	YenSung Chen
Author	Arnavi Chheda-Kothary
Author	Huy Tran
Author	Byron Bischoff
Author	Eric Marsh
Author	Michael Schmitz
Author	Cassidy Trier
Author	Aaron Sarnat
Author	Jenna James
Author	Jon Borchardt
Author	Bailey Kuehl
Author	Evie Cheng
Author	Karen Farley
Author	Sruthi Sreeram
Author	Taira Anderson
Author	David Albright
Author	Carissa Schoenick
Author	Luca Soldaini
Author	Dirk Groeneveld
Author	Rock Yuren Pang
Author	Pang Wei Koh
Author	Noah A. Smith
Author	Sophie Lebrecht
Author	Yejin Choi
Author	Hannaneh Hajishirzi
Author	Ali Farhadi
Author	Jesse Dodge
Abstract	We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
Date	2025-04-09
Short Title	OLMoTrace
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.07096
Accessed	4/17/2025, 11:51:42 AM
Extra	arXiv:2504.07096 [cs]
DOI	10.48550/arXiv.2504.07096
Repository	arXiv
Archive ID	arXiv:2504.07096
Date Added	4/17/2025, 11:51:42 AM
Modified	4/17/2025, 11:51:42 AM

Tags:

Computer Science - Computation and Language

Notes:

Comment: Under submission at ACL 2025 demo track

Attachments

Preprint PDF
Snapshot

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Item Type	Preprint
Author	Xing Han Lù
Author	Amirhossein Kazemnejad
Author	Nicholas Meade
Author	Arkil Patel
Author	Dongchan Shin
Author	Alejandra Zambrano
Author	Karolina Stańczak
Author	Peter Shaw
Author	Christopher J. Pal
Author	Siva Reddy
Abstract	Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io
Date	2025-04-11
Short Title	AgentRewardBench
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.08942
Accessed	4/17/2025, 11:42:57 AM
Extra	arXiv:2504.08942 [cs]
DOI	10.48550/arXiv.2504.08942
Repository	arXiv
Archive ID	arXiv:2504.08942
Date Added	4/17/2025, 11:42:57 AM
Modified	4/17/2025, 11:42:57 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Superalignment with Dynamic Human Values

Item Type	Preprint
Author	Florian Mai
Author	David Kaczér
Author	Nicholas Kluge Corrêa
Author	Lucie Flek
Abstract	Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.
Date	2025-03-17
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.13621
Accessed	3/25/2025, 5:43:06 PM
Extra	arXiv:2503.13621 [cs]
DOI	10.48550/arXiv.2503.13621
Repository	arXiv
Archive ID	arXiv:2503.13621
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Artificial Intelligence

Notes:

Comment: Published at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)

Attachments

Preprint PDF
Snapshot

Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

Item Type	Preprint
Author	Alex Mallen
Author	Charlie Griffin
Author	Misha Wagner
Author	Alessandro Abate
Author	Buck Shlegeris
Abstract	An AI control protocol is a plan for usefully deploying AI systems that aims to prevent an AI from intentionally causing some unacceptable outcome. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (without shared memory between contexts). To do this, an AI system may need to reliably generate optimal plans in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating. We develop Subversion Strategy Eval, a suite of eight environments, covering a range of protocols and strategic capabilities, and six sets of affordances that help isolate individual capabilities. We implement the evaluation in Inspect-AI and release it open-source. We evaluate Claude 3.5 models, including helpful-only versions, as well as OpenAI reasoning models. None of the models demonstrate substantial capability in strategizing to subvert control protocols statelessly. However, providing models with additional affordances, such as the ability to share a plan between contexts, can substantially improve performance. We hope our evaluations can act as a leading indicator for when models are capable of subverting control protocols and also relax the worst-case assumption of perfect strategic ability in AI control evaluations.
Date	2025-03-24
Short Title	Subversion Strategy Eval
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.12480
Accessed	3/25/2025, 5:37:00 PM
Extra	arXiv:2412.12480 [cs]
DOI	10.48550/arXiv.2412.12480
Repository	arXiv
Archive ID	arXiv:2412.12480
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

The Necessity of AI Audit Standards Boards

Item Type	Preprint
Author	David Manheim
Author	Sammy Martin
Author	Mark Bailey
Author	Mikhail Samin
Author	Ross Greutzmacher
Abstract	Auditing of AI systems is a promising way to understand and manage ethical problems and societal risks associated with contemporary AI systems, as well as some anticipated future risks. Efforts to develop standards for auditing Artificial Intelligence (AI) systems have therefore understandably gained momentum. However, we argue that creating auditing standards is not just insufficient, but actively harmful by proliferating unheeded and inconsistent standards, especially in light of the rapid evolution and ethical and safety challenges of AI. Instead, the paper proposes the establishment of an AI Audit Standards Board, responsible for developing and updating auditing methods and standards in line with the evolving nature of AI technologies. Such a body would ensure that auditing practices remain relevant, robust, and responsive to the rapid advancements in AI. The paper argues that such a governance structure would also be helpful for maintaining public trust in AI and for promoting a culture of safety and ethical responsibility within the AI industry. Throughout the paper, we draw parallels with other industries, including safety-critical industries like aviation and nuclear energy, as well as more prosaic ones such as financial accounting and pharmaceuticals. AI auditing should emulate those fields, and extend beyond technical assessments to include ethical considerations and stakeholder engagement, but we explain that this is not enough; emulating other fields' governance mechanisms for these processes, and for audit standards creation, is a necessity. We also emphasize the importance of auditing the entire development process of AI systems, not just the final products...
Date	2024-04-11
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2404.13060
Accessed	4/11/2025, 1:09:42 PM
Extra	arXiv:2404.13060 [cs]
DOI	10.48550/arXiv.2404.13060
Repository	arXiv
Archive ID	arXiv:2404.13060
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society

Attachments

Preprint PDF
Snapshot

Artificial Intelligence Index Report 2025

Item Type	Journal Article
Author	Nestor Maslej
Date	2025
Language	en
Library Catalog	Zotero
Publication	Artificial Intelligence
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

PDF

Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini

Item Type	Web Page
Author	METR
Abstract	Resources for testing dangerous autonomous capabilities in frontier models
Date	2025-04-16T00:00:00+00:00
Language	en
URL	https://metr.github.io/autonomy-evals-guide/openai-o3-report/
Accessed	4/17/2025, 11:41:15 AM
Website Title	METR’s Autonomy Evaluation Resources
Date Added	4/17/2025, 11:41:15 AM
Modified	4/17/2025, 11:41:15 AM

Attachments

Snapshot

ASI existential risk: reconsidering alignment as a goal

Item Type	Preprint
Author	Michael Nielson
Date	April 14th, 2025
Language	en
Library Catalog	Zotero
Date Added	4/14/2025, 10:05:28 PM
Modified	4/14/2025, 10:06:07 PM

Attachments

PDF

The Geometry of Categorical and Hierarchical Concepts in Large Language Models

Item Type	Preprint
Author	Kiho Park
Author	Yo Joong Choe
Author	Yibo Jiang
Author	Victor Veitch
Abstract	The linear representation hypothesis is the informal idea that semantic concepts are encoded as linear directions in the representation spaces of large language models (LLMs). Previous work has shown how to make this notion precise for representing binary concepts that have natural contrasts (e.g., {male, female}) as directions in representation space. However, many natural concepts do not have natural contrasts (e.g., whether the output is about an animal). In this work, we show how to extend the formalization of the linear representation hypothesis to represent features (e.g., is_animal) as vectors. This allows us to immediately formalize the representation of categorical concepts as polytopes in the representation space. Further, we use the formalization to prove a relationship between the hierarchical structure of concepts and the geometry of their representations. We validate these theoretical results on the Gemma and LLaMA-3 large language models, estimating representations for 900+ hierarchically related concepts using data from WordNet.
Date	2025-02-18
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2406.01506
Accessed	4/11/2025, 1:09:50 PM
Extra	arXiv:2406.01506 [cs]
DOI	10.48550/arXiv.2406.01506
Repository	arXiv
Archive ID	arXiv:2406.01506
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Statistics - Machine Learning

Notes:

Comment: Accepted for an oral presentation at ICLR 2025. Best Paper Award at the ICML 2024 Workshop on Mechanistic Interpretability. Code is available at https://github.com/KihoPark/LLM_Categorical_Hierarchical_Representations

Attachments

Preprint PDF
Snapshot

Investigating Aﬀective Use and Emotional Well-being on ChatGPT

Item Type	Journal Article
Author	Jason Phang
Author	Michael Lampe
Author	Lama Ahmad
Author	Sandhini Agarwal
Author	Cathy Mengying Fang
Author	Auren R Liu
Author	Valdemar Danry
Author	Eunhae Lee
Author	Samantha W T Chan
Author	Pat Pataranutaporn
Author	Pattie Maes
Abstract	As AI chatbots see increased adoption and integration into everyday life, questions have been raised about the potential impact of human-like or anthropomorphic AI on users. In this work, we investigate the extent to which interactions with ChatGPT (with a focus on Advanced Voice Mode) may impact users’ emotional well-being, behaviors and experiences through two parallel studies. To study the aﬀective use of AI chatbots, we perform large-scale automated analysis of ChatGPT platform usage in a privacy-preserving manner, analyzing over 4 million conversations for aﬀective cues and surveying over 4,000 users on their perceptions of ChatGPT. To investigate whether there is a relationship between model usage and emotional well-being, we conduct an Institutional Review Board (IRB)-approved randomized controlled trial (RCT) on close to 1,000 participants over 28 days, examining changes in their emotional well-being as they interact with ChatGPT under diﬀerent experimental settings. In both on-platform data analysis and the RCT, we observe that very high usage correlates with increased self-reported indicators of dependence. From our RCT, we ﬁnd that the impact of voice-based interactions on emotional well-being to be highly nuanced, and inﬂuenced by factors such as the user’s initial emotional state and total usage duration. Overall, our analysis reveals that a small number of users are responsible for a disproportionate share of the most aﬀective cues.
Language	en
Library Catalog	Zotero
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Attachments

PDF

An Approach to Technical AGI Safety and Security

Item Type	Journal Article
Author	Rohin Shah
Author	Alex Irpan
Author	Alexander Matt Turner
Author	Anna Wang
Author	Arthur Conmy
Author	David Lindner
Author	Jonah Brown-Cohen
Author	Lewis Ho
Author	Neel Nanda
Author	Raluca Ada Popa
Author	Rishub Jain
Author	Rory Greig
Author	Scott Emmons
Author	Sebastian Farquhar
Author	Sébastien Krier
Author	Senthooran Rajamanoharan
Author	Sophie Bridgers
Author	Tobi Ijitoye
Author	Tom Everitt
Author	Victoria Krakovna
Author	Vikrant Varma
Author	Vladimir Mikulik
Author	Zachary Kenton
Author	Dave Orr
Author	Shane Legg
Author	Noah Goodman
Author	Allan Dafoe
Author	Four Flynn
Author	Anca Dragan
Language	en
Library Catalog	Zotero
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

PDF

Palatable Conceptions of Disembodied Being: Terra Incognita in the Space of Possible Minds

Item Type	Preprint
Author	Murray Shanahan
Abstract	Is it possible to articulate a conception of consciousness that is compatible with the exotic characteristics of contemporary, disembodied AI systems, and that can stand up to philosophical scrutiny? How would subjective time and selfhood show up for an entity that conformed to such a conception? Trying to answer these questions, even metaphorically, stretches the language of consciousness to breaking point. Ultimately, the attempt yields something like emptiness, in the Buddhist sense, and helps to undermine our dualistic inclinations towards subjectivity and selfhood.
Date	2025-03-20
Short Title	Palatable Conceptions of Disembodied Being
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.16348
Accessed	3/25/2025, 5:46:40 PM
Extra	arXiv:2503.16348 [cs]
DOI	10.48550/arXiv.2503.16348
Repository	arXiv
Archive ID	arXiv:2503.16348
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

Steering off Course: Reliability Challenges in Steering Language Models

Item Type	Preprint
Author	Patrick Queiroz Da Silva
Author	Hari Sethuraman
Author	Dheeraj Rajagopal
Author	Hannaneh Hajishirzi
Author	Sachin Kumar
Abstract	Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis demonstrate fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.
Date	2025-04-06
Short Title	Steering off Course
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.04635
Accessed	4/9/2025, 11:09:20 PM
Extra	arXiv:2504.04635 [cs]
DOI	10.48550/arXiv.2504.04635
Repository	arXiv
Archive ID	arXiv:2504.04635
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

Item Type	Journal Article
Author	Lewis Smith
Author	Senthooran Rajamanoharan
Author	Arthur Conmy
Author	CallumMcDougall
Author	Tom Lieberum
Author	János Kramár
Author	Rohin Shah
Author	Neel Nanda
Abstract	TL;DR To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts Negative result: SAEs underperformed linear probes Corollary: Linear probes are actually really good and cheap and perform great As a result of this and parallel work, we are deprioritising fundamental SAE research for the moment and exploring other directions, though SAEs will remain a tool in our toolkit We do not think that SAEs are useless or that no one should work on them, but we also do not think that SAEs will be a game-changer for interpretability, and speculate that the field is over-invested in them. Training SAEs specialised for chat data closed about half the gap but was still worse than linear probes We tried several ways to train chat SAEs, all did about as well. By default, we recommend taking an SAE on pretraining data and finetuning it on a bit of chat data Other results: We found SAEs fairly helpful for debugging low quality datasets (noticing spurious correlations) We present a variant of JumpReLU with an alternative sparsity penalty to get rid of high-frequency latents We argue that a standard auto-interp approach of computing the average interpretability of a uniformly sampled SAE latent can be misleading as it doesn’t penalise models which have high frequency, but not very interpretable, latents, and explore weighting the interpretability score by latent frequency.
Date	2025-03-26
Language	en
Library Catalog	www.alignmentforum.org
URL	https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks
Accessed	4/11/2025, 12:27:49 PM
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

Snapshot

Value Profiles for Encoding Human Variation

Item Type	Preprint
Author	Taylor Sorensen
Author	Pushkar Mishra
Author	Roma Patel
Author	Michael Henry Tessler
Author	Michiel Bakker
Author	Georgina Evans
Author	Iason Gabriel
Author	Noah Goodman
Author	Verena Rieser
Abstract	Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles -- natural language descriptions of underlying values compressed from in-context demonstrations -- along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (>70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.
Date	2025-03-19
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.15484
Accessed	3/25/2025, 5:59:26 PM
Extra	arXiv:2503.15484 [cs]
DOI	10.48550/arXiv.2503.15484
Repository	arXiv
Archive ID	arXiv:2503.15484
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

PaperBench: Evaluating AI's Ability to Replicate AI Research

Item Type	Preprint
Author	Giulio Starace
Author	Oliver Jaffe
Author	Dane Sherburn
Author	James Aung
Author	Jun Shern Chan
Author	Leon Maksin
Author	Rachel Dias
Author	Evan Mays
Author	Benjamin Kinsella
Author	Wyatt Thompson
Author	Johannes Heidecke
Author	Amelia Glaese
Author	Tejal Patwardhan
Abstract	We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.
Date	2025-04-07
Short Title	PaperBench
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.01848
Accessed	4/11/2025, 12:29:38 PM
Extra	arXiv:2504.01848 [cs]
DOI	10.48550/arXiv.2504.01848
Repository	arXiv
Archive ID	arXiv:2504.01848
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Notes:

Comment: 30 pages, 14 figures

Attachments

Preprint PDF
Snapshot

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains

Item Type	Preprint
Author	Yi Su
Author	Dian Yu
Author	Linfeng Song
Author	Juntao Li
Author	Haitao Mi
Author	Zhaopeng Tu
Author	Min Zhang
Author	Dong Yu
Abstract	Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs), especially when structured reference answers are accessible for verification. However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains, significantly outperforming state-of-the-art open-source aligned models such as Qwen2.5-72B and DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach notably enhances the robustness, flexibility, and scalability of RLVR, representing a substantial step towards practical reinforcement learning applications in complex, noisy-label scenarios.
Date	2025-04-01
Short Title	Crossing the Reward Bridge
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.23829
Accessed	4/11/2025, 5:31:32 PM
Extra	arXiv:2503.23829 [cs]
DOI	10.48550/arXiv.2503.23829
Repository	arXiv
Archive ID	arXiv:2503.23829
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Item Type	Preprint
Author	Miguel Suau
Author	Matthijs T. J. Spaan
Author	Frans A. Oliehoek
Abstract	Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs.
Date	2024-06-24
Short Title	Bad Habits
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2306.02419
Accessed	4/11/2025, 1:10:00 PM
Extra	arXiv:2306.02419 [cs]
DOI	10.48550/arXiv.2306.02419
Repository	arXiv
Archive ID	arXiv:2306.02419
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Gemma 3 Technical Report

Item Type	Preprint
Author	Gemma Team
Author	Aishwarya Kamath
Author	Johan Ferret
Author	Shreya Pathak
Author	Nino Vieillard
Author	Ramona Merhej
Author	Sarah Perrin
Author	Tatiana Matejovicova
Author	Alexandre Ramé
Author	Morgane Rivière
Author	Louis Rouillard
Author	Thomas Mesnard
Author	Geoffrey Cideron
Author	Jean-bastien Grill
Author	Sabela Ramos
Author	Edouard Yvinec
Author	Michelle Casbon
Author	Etienne Pot
Author	Ivo Penchev
Author	Gaël Liu
Author	Francesco Visin
Author	Kathleen Kenealy
Author	Lucas Beyer
Author	Xiaohai Zhai
Author	Anton Tsitsulin
Author	Robert Busa-Fekete
Author	Alex Feng
Author	Noveen Sachdeva
Author	Benjamin Coleman
Author	Yi Gao
Author	Basil Mustafa
Author	Iain Barr
Author	Emilio Parisotto
Author	David Tian
Author	Matan Eyal
Author	Colin Cherry
Author	Jan-Thorsten Peter
Author	Danila Sinopalnikov
Author	Surya Bhupatiraju
Author	Rishabh Agarwal
Author	Mehran Kazemi
Author	Dan Malkin
Author	Ravin Kumar
Author	David Vilar
Author	Idan Brusilovsky
Author	Jiaming Luo
Author	Andreas Steiner
Author	Abe Friesen
Author	Abhanshu Sharma
Author	Abheesht Sharma
Author	Adi Mayrav Gilady
Author	Adrian Goedeckemeyer
Author	Alaa Saade
Author	Alex Feng
Author	Alexander Kolesnikov
Author	Alexei Bendebury
Author	Alvin Abdagic
Author	Amit Vadi
Author	András György
Author	André Susano Pinto
Author	Anil Das
Author	Ankur Bapna
Author	Antoine Miech
Author	Antoine Yang
Author	Antonia Paterson
Author	Ashish Shenoy
Author	Ayan Chakrabarti
Author	Bilal Piot
Author	Bo Wu
Author	Bobak Shahriari
Author	Bryce Petrini
Author	Charlie Chen
Author	Charline Le Lan
Author	Christopher A. Choquette-Choo
Author	C. J. Carey
Author	Cormac Brick
Author	Daniel Deutsch
Author	Danielle Eisenbud
Author	Dee Cattle
Author	Derek Cheng
Author	Dimitris Paparas
Author	Divyashree Shivakumar Sreepathihalli
Author	Doug Reid
Author	Dustin Tran
Author	Dustin Zelle
Author	Eric Noland
Author	Erwin Huizenga
Author	Eugene Kharitonov
Author	Frederick Liu
Author	Gagik Amirkhanyan
Author	Glenn Cameron
Author	Hadi Hashemi
Author	Hanna Klimczak-Plucińska
Author	Harman Singh
Author	Harsh Mehta
Author	Harshal Tushar Lehri
Author	Hussein Hazimeh
Author	Ian Ballantyne
Author	Idan Szpektor
Author	Ivan Nardini
Author	Jean Pouget-Abadie
Author	Jetha Chan
Author	Joe Stanton
Author	John Wieting
Author	Jonathan Lai
Author	Jordi Orbay
Author	Joseph Fernandez
Author	Josh Newlan
Author	Ju-yeong Ji
Author	Jyotinder Singh
Author	Kat Black
Author	Kathy Yu
Author	Kevin Hui
Author	Kiran Vodrahalli
Author	Klaus Greff
Author	Linhai Qiu
Author	Marcella Valentine
Author	Marina Coelho
Author	Marvin Ritter
Author	Matt Hoffman
Author	Matthew Watson
Author	Mayank Chaturvedi
Author	Michael Moynihan
Author	Min Ma
Author	Nabila Babar
Author	Natasha Noy
Author	Nathan Byrd
Author	Nick Roy
Author	Nikola Momchev
Author	Nilay Chauhan
Author	Noveen Sachdeva
Author	Oskar Bunyan
Author	Pankil Botarda
Author	Paul Caron
Author	Paul Kishan Rubenstein
Author	Phil Culliton
Author	Philipp Schmid
Author	Pier Giuseppe Sessa
Author	Pingmei Xu
Author	Piotr Stanczyk
Author	Pouya Tafti
Author	Rakesh Shivanna
Author	Renjie Wu
Author	Renke Pan
Author	Reza Rokni
Author	Rob Willoughby
Author	Rohith Vallu
Author	Ryan Mullins
Author	Sammy Jerome
Author	Sara Smoot
Author	Sertan Girgin
Author	Shariq Iqbal
Author	Shashir Reddy
Author	Shruti Sheth
Author	Siim Põder
Author	Sijal Bhatnagar
Author	Sindhu Raghuram Panyam
Author	Sivan Eiger
Author	Susan Zhang
Author	Tianqi Liu
Author	Trevor Yacovone
Author	Tyler Liechty
Author	Uday Kalra
Author	Utku Evci
Author	Vedant Misra
Author	Vincent Roseberry
Author	Vlad Feinberg
Author	Vlad Kolesnikov
Author	Woohyun Han
Author	Woosuk Kwon
Author	Xi Chen
Author	Yinlam Chow
Author	Yuvein Zhu
Author	Zichuan Wei
Author	Zoltan Egyed
Author	Victor Cotruta
Author	Minh Giang
Author	Phoebe Kirk
Author	Anand Rao
Author	Kat Black
Author	Nabila Babar
Author	Jessica Lo
Author	Erica Moreira
Author	Luiz Gustavo Martins
Author	Omar Sanseviero
Author	Lucas Gonzalez
Author	Zach Gleicher
Author	Tris Warkentin
Author	Vahab Mirrokni
Author	Evan Senter
Author	Eli Collins
Author	Joelle Barral
Author	Zoubin Ghahramani
Author	Raia Hadsell
Author	Yossi Matias
Author	D. Sculley
Author	Slav Petrov
Author	Noah Fiedel
Author	Noam Shazeer
Author	Oriol Vinyals
Author	Jeff Dean
Author	Demis Hassabis
Author	Koray Kavukcuoglu
Author	Clement Farabet
Author	Elena Buchatskaya
Author	Jean-Baptiste Alayrac
Author	Rohan Anil
Author	Dmitry
Author	Lepikhin
Author	Sebastian Borgeaud
Author	Olivier Bachem
Author	Armand Joulin
Author	Alek Andreev
Author	Cassidy Hardin
Author	Robert Dadashi
Author	Léonard Hussenot
Abstract	We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
Date	2025-03-25
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.19786
Accessed	4/11/2025, 12:33:53 PM
Extra	arXiv:2503.19786 [cs]
DOI	10.48550/arXiv.2503.19786
Repository	arXiv
Archive ID	arXiv:2503.19786
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

AI Governance through Markets

Item Type	Preprint
Author	Philip Moreira Tomei
Author	Rupal Jain
Author	Matija Franklin
Abstract	This paper argues that market governance mechanisms should be considered a key approach in the governance of artificial intelligence (AI), alongside traditional regulatory frameworks. While current governance approaches have predominantly focused on regulation, we contend that market-based mechanisms offer effective incentives for responsible AI development. We examine four emerging vectors of market governance: insurance, auditing, procurement, and due diligence, demonstrating how these mechanisms can affirm the relationship between AI risk and financial risk while addressing capital allocation inefficiencies. While we do not claim that market forces alone can adequately protect societal interests, we maintain that standardised AI disclosures and market mechanisms can create powerful incentives for safe and responsible AI development. This paper urges regulators, economists, and machine learning researchers to investigate and implement market-based approaches to AI governance.
Date	2025-03-05
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.17755
Accessed	4/11/2025, 1:09:46 PM
Extra	arXiv:2501.17755 [econ]
DOI	10.48550/arXiv.2501.17755
Repository	arXiv
Archive ID	arXiv:2501.17755
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Artificial Intelligence
Economics - General Economics
Quantitative Finance - Economics

Attachments

Full Text PDF
Snapshot

An Illusion of Progress? Assessing the Current State of Web Agents

Item Type	Preprint
Author	Tianci Xue
Author	Weijian Qi
Author	Tianneng Shi
Author	Chan Hee Song
Author	Boyu Gou
Author	Dawn Song
Author	Huan Sun
Author	Yu Su
Abstract	As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.
Date	2025-04-02
Short Title	An Illusion of Progress?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2504.01382
Accessed	4/11/2025, 5:31:52 PM
Extra	arXiv:2504.01382 [cs]
DOI	10.48550/arXiv.2504.01382
Repository	arXiv
Archive ID	arXiv:2504.01382
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Notes:

Comment: 22 pages, 16 figures, 4 tables

Attachments

Preprint PDF
Snapshot

The AI Sc ientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

Item Type	Journal Article
Author	Yutaro Yamada
Author	Robert Tjarko Lange
Author	Cong Lu
Author	Shengran Hu
Author	Chris Lu
Author	Jakob Foerster
Author	David Ha
Language	en
Library Catalog	Zotero
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

PDF

Survey on Evaluation of LLM-based Agents

Item Type	Preprint
Author	Asaf Yehudai
Author	Lilach Eden
Author	Alan Li
Author	Guy Uziel
Author	Yilun Zhao
Author	Roy Bar-Haim
Author	Arman Cohan
Author	Michal Shmueli-Scheuer
Abstract	The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.
Date	2025-03-20
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.16416
Accessed	3/25/2025, 5:38:55 PM
Extra	arXiv:2503.16416 [cs]
DOI	10.48550/arXiv.2503.16416
Repository	arXiv
Archive ID	arXiv:2503.16416
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Full Text PDF
Snapshot

On the Biology of a Large Language Model

Item Type	Web Page
Author	Authors Affiliations Published Not published yet DOI No DOI yet
Abstract	We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
Language	en
URL	https://transformer-circuits.pub/2025/attribution-graphs/biology.html
Accessed	4/11/2025, 12:35:22 PM
Website Title	Transformer Circuits
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

Snapshot

Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

Item Type	Preprint
Author	Eric Zhao
Author	Pranjal Awasthi
Author	Sreenivas Gollapudi
Abstract	Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.
Date	2025-02-20
Short Title	Sample, Scrutinize and Scale
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.01839
Accessed	3/25/2025, 9:28:34 PM
Extra	arXiv:2502.01839 [cs]
DOI	10.48550/arXiv.2502.01839
Repository	arXiv
Archive ID	arXiv:2502.01839
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Item Type	Preprint
Author	Yuxuan Zhu
Author	Antony Kellermann
Author	Dylan Bowman
Author	Philip Li
Author	Akul Gupta
Author	Adarsh Danda
Author	Richard Fang
Author	Conner Jensen
Author	Eric Ihli
Author	Jason Benn
Author	Jet Geronimo
Author	Avi Dhir
Author	Sudhit Rao
Author	Kaicheng Yu
Author	Twm Stone
Author	Daniel Kang
Abstract	Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.
Date	2025-04-01
Short Title	CVE-Bench
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.17332
Accessed	4/11/2025, 1:09:54 PM
Extra	arXiv:2503.17332 [cs]
DOI	10.48550/arXiv.2503.17332
Repository	arXiv
Archive ID	arXiv:2503.17332
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security

Notes:

Comment: 15 pages, 4 figures, 5 tables

Attachments

Preprint PDF
Snapshot

GR00T_1_Whitepaper.pdf

Item Type	Attachment
URL	https://d1qx31qr3h6wln.cloudfront.net/publications/GR00T_1_Whitepaper.pdf
Accessed	3/25/2025, 9:20:45 PM
Date Added	4/2/2025, 2:13:04 PM
Modified	4/2/2025, 2:13:04 PM

[external] a long list of open problems and concrete projects in evals

Item Type	Attachment
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Introducing Docent

Item Type	Web Page
URL	https://transluce.org/introducing-docent
Accessed	4/11/2025, 12:31:23 PM
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

Snapshot

Report-Technologist-Economist-Disconnect-ARI04072025-1.pdf

Item Type	Attachment
URL	https://ari.us/wp-content/uploads/2025/04/Report-Technologist-Economist-Disconnect-ARI04072025-1.pdf
Accessed	4/9/2025, 11:27:27 PM
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Stripe Press — The Scaling Era

Item Type	Web Page
Abstract	An inside view of the AI revolution, from the people and companies making it happen.
Language	en-US
URL	https://stripe.com/scaling
Accessed	4/11/2025, 2:47:51 PM
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

Snapshot

Taking a responsible path to AGI - Google DeepMind

Item Type	Web Page
URL	https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/
Accessed	4/11/2025, 2:46:31 PM
Date Added	4/14/2025, 9:44:59 PM
Modified	4/14/2025, 9:44:59 PM

Attachments

Taking a responsible path to AGI - Google DeepMind

Investigating truthfulness in a pre-release o3 model | Transluce AI

Item Type	Web Page
URL	https://transluce.org/investigating-o3-truthfulness
Accessed	4/17/2025, 11:40:51 AM
Date Added	4/17/2025, 11:40:51 AM
Modified	4/17/2025, 11:40:51 AM

Attachments

Investigating truthfulness in a pre-release o3 model | Transluce AI