Item Type | Preprint |
---|---|
Author | Giulio Antonio Abbo |
Author | Tony Belpaeme |
Abstract | Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses. |
Date | 2025-01-07 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.03957 |
Accessed | 3/25/2025, 5:45:31 PM |
Extra | arXiv:2501.03957 [cs] |
DOI | 10.48550/arXiv.2501.03957 |
Repository | arXiv |
Archive ID | arXiv:2501.03957 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Comment: 13 pages, 2 figures
Item Type | Journal Article |
---|---|
Author | Daron Acemoglu |
Author | Nicolás Ajzenman |
Author | Cevat Giray Aksoy |
Author | Martin Fiszbein |
Author | Carlos Molina |
Abstract | Using large-scale survey data covering more than 110 countries and exploiting within-country variation across cohorts and surveys, we show that individuals with longer exposure to democracy display stronger support for democratic institutions, and that this effect is almost entirely driven by exposure to democracies with successful performance in terms of economic growth, control of corruption, peace and political stability, and public goods provision. Across a variety of specifications, estimation methods, and samples, the results are robust, and the timing and nature of the effects are consistent with our interpretation. We also present suggestive evidence that democratic institutions that receive support from their citizens perform better in the face of negative shocks. |
Date | 2025-03-06 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://academic.oup.com/restud/article/92/2/621/7675443 |
Accessed | 4/11/2025, 11:47:00 AM |
Rights | https://academic.oup.com/pages/standard-publication-reuse-rights |
Volume | 92 |
Pages | 621-655 |
Publication | Review of Economic Studies |
DOI | 10.1093/restud/rdae051 |
Issue | 2 |
ISSN | 0034-6527, 1467-937X |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Essential AI |
Author | Darsh J. Shah |
Author | Peter Rushton |
Author | Somanshu Singla |
Author | Mohit Parmar |
Author | Kurt Smith |
Author | Yash Vanjani |
Author | Ashish Vaswani |
Author | Adarsh Chaluvaraju |
Author | Andrew Hojel |
Author | Andrew Ma |
Author | Anil Thomas |
Author | Anthony Polloreno |
Author | Ashish Tanwer |
Author | Burhan Drak Sibai |
Author | Divya S. Mansingka |
Author | Divya Shivaprasad |
Author | Ishaan Shah |
Author | Karl Stratos |
Author | Khoi Nguyen |
Author | Michael Callahan |
Author | Michael Pust |
Author | Mrinal Iyer |
Author | Philip Monk |
Author | Platon Mazarakis |
Author | Ritvik Kapila |
Author | Saurabh Srivastava |
Author | Tim Romanski |
Abstract | A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks. |
Date | 2025-04-05 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.04022 |
Accessed | 4/9/2025, 11:11:36 PM |
Extra | arXiv:2504.04022 [cs] |
DOI | 10.48550/arXiv.2504.04022 |
Repository | arXiv |
Archive ID | arXiv:2504.04022 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Jacy Reese Anthis |
Author | Ryan Liu |
Author | Sean M. Richardson |
Author | Austin C. Kozlowski |
Author | Bernard Koch |
Author | James Evans |
Author | Erik Brynjolfsson |
Author | Michael Bernstein |
Abstract | Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted these methods. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a literature survey of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions with prompting, fine-tuning, and complementary methods. We believe that LLM social simulations can already be used for exploratory research, such as pilot experiments for psychology, economics, sociology, and marketing. More widespread use may soon be possible with rapidly advancing LLM capabilities, and researchers should prioritize developing conceptual models and evaluations that can be iteratively deployed and refined at pace with ongoing AI advances. |
Date | 2025-04-03 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.02234 |
Accessed | 4/9/2025, 11:29:43 PM |
Extra | arXiv:2504.02234 [cs] |
DOI | 10.48550/arXiv.2504.02234 |
Repository | arXiv |
Archive ID | arXiv:2504.02234 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Web Page |
---|---|
Author | Apollo Research |
Abstract | We evaluate whether Claude Sonnet 3.7 and other frontier models know that they are being evaluated. |
Date | 4/7/2025 |
Language | en-GB |
URL | https://www.apolloresearch.ai/blog/claude-sonnet-37-often-knows-when-its-in-alignment-evaluations |
Accessed | 3/25/2025, 9:30:24 PM |
Website Title | Apollo Research |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 5/1/2025, 2:05:23 PM |
Item Type | Web Page |
---|---|
Author | AUTHORS Dron Hazra † Max Loeffler † Murat Cubuktepe Levon Avagyan ‡ Liv Gorton Mark Bissell Owen Lewis Thomas McGrath Daniel Balsam * PUBLISHED Apr. 15 |
Author | 2025 |
Language | en |
Short Title | https |
URL | https://www.goodfire.ai/blog/under-the-hood-of-a-reasoning-model |
Accessed | 4/17/2025, 11:35:50 AM |
Date Added | 4/17/2025, 11:35:50 AM |
Modified | 4/17/2025, 11:35:50 AM |
Item Type | Preprint |
---|---|
Author | Vidhisha Balachandran |
Author | Jingya Chen |
Author | Lingjiao Chen |
Author | Shivam Garg |
Author | Neel Joshi |
Author | Yash Lara |
Author | John Langford |
Author | Besmira Nushi |
Author | Vibhav Vineet |
Author | Yue Wu |
Author | Safoora Yousefi |
Abstract | Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements. |
Date | 2025-03-31 |
Short Title | Inference-Time Scaling for Complex Tasks |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.00294 |
Accessed | 4/11/2025, 12:33:43 PM |
Extra | arXiv:2504.00294 [cs] |
DOI | 10.48550/arXiv.2504.00294 |
Repository | arXiv |
Archive ID | arXiv:2504.00294 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Krisztian Balog |
Author | Donald Metzler |
Author | Zhen Qin |
Abstract | Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation. |
Date | 2025-03-24 |
Short Title | Rankers, Judges, and Assistants |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.19092 |
Accessed | 4/14/2025, 9:56:20 PM |
Extra | arXiv:2503.19092 [cs] |
DOI | 10.48550/arXiv.2503.19092 |
Repository | arXiv |
Archive ID | arXiv:2503.19092 |
Date Added | 4/14/2025, 9:56:33 PM |
Modified | 4/14/2025, 9:56:35 PM |
Item Type | Preprint |
---|---|
Author | Krisztian Balog |
Author | Donald Metzler |
Author | Zhen Qin |
Abstract | Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation. |
Date | 2025-03-24 |
Short Title | Rankers, Judges, and Assistants |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.19092 |
Accessed | 4/14/2025, 9:56:20 PM |
Extra | arXiv:2503.19092 [cs] |
DOI | 10.48550/arXiv.2503.19092 |
Repository | arXiv |
Archive ID | arXiv:2503.19092 |
Date Added | 4/14/2025, 9:56:33 PM |
Modified | 4/14/2025, 9:56:35 PM |
Item Type | Journal Article |
---|---|
Author | David Bau |
Language | en |
Library Catalog | Zotero |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Journal Article |
---|---|
Author | David Bau |
Author | Tom McGrath |
Author | Sarah Schwettmann |
Author | Dylan Hadfield-Menell |
Language | en |
Library Catalog | Zotero |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Aryan Bhatt |
Author | Cody Rushing |
Author | Adam Kaufman |
Author | Tyler Tracy |
Author | Vasil Georgiev |
Author | David Matolcsi |
Author | Akbir Khan |
Author | Buck Shlegeris |
Abstract | Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent. |
Date | 2025-04-14 |
Short Title | Ctrl-Z |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.10374 |
Accessed | 4/17/2025, 11:40:05 AM |
Extra | arXiv:2504.10374 [cs] |
DOI | 10.48550/arXiv.2504.10374 |
Repository | arXiv |
Archive ID | arXiv:2504.10374 |
Date Added | 4/17/2025, 11:40:05 AM |
Modified | 4/17/2025, 11:40:05 AM |
Comment: bashcontrol.com
Item Type | Journal Article |
---|---|
Author | Yanda Chen |
Author | Joe Benton |
Author | Ansh Radhakrishnan |
Author | Jonathan Uesato |
Author | Carson Denison |
Author | John Schulman |
Author | Arushi Somani |
Author | Peter Hase |
Author | Misha Wagner |
Author | Fabien Roger |
Author | Vlad Mikulik |
Author | Sam Bowman |
Author | Jan Leike |
Author | Jared Kaplan |
Author | Ethan Perez |
Abstract | Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model’s CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models’ actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors. |
Language | en |
Library Catalog | Zotero |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Nuo Chen |
Author | Zhiyuan Hu |
Author | Qingyun Zou |
Author | Jiaying Wu |
Author | Qian Wang |
Author | Bryan Hooi |
Author | Bingsheng He |
Abstract | The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning. |
Date | 2025-03-31 |
Short Title | JudgeLRM |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.00050 |
Accessed | 4/14/2025, 9:59:44 PM |
Extra | arXiv:2504.00050 [cs] |
DOI | 10.48550/arXiv.2504.00050 |
Repository | arXiv |
Archive ID | arXiv:2504.00050 |
Date Added | 4/14/2025, 9:59:44 PM |
Modified | 4/14/2025, 9:59:44 PM |
Comment: preprint
Item Type | Preprint |
---|---|
Author | Quy-Anh Dang |
Author | Chris Ngo |
Abstract | Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs. |
Date | 2025-03-20 |
Short Title | Reinforcement Learning for Reasoning in Small LLMs |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.16219 |
Accessed | 4/11/2025, 1:09:29 PM |
Extra | arXiv:2503.16219 [cs] |
DOI | 10.48550/arXiv.2503.16219 |
Repository | arXiv |
Archive ID | arXiv:2503.16219 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Journal Article |
---|---|
Author | Tom Davidson |
Author | Lukas Finnveden |
Author | Rose Hadshar |
Abstract | This report assesses the risk that a small group—or even just one person—could use advanced AI to stage a coup. An AI-enabled coup is most likely to be staged by leaders of frontier AI projects, heads of state, and military officials; and could occur even in established democracies. |
Language | en |
Library Catalog | Zotero |
Date Added | 4/17/2025, 11:42:37 AM |
Modified | 4/17/2025, 11:42:37 AM |
Item Type | Journal Article |
---|---|
Author | Cathy Mengying Fang |
Author | Auren R Liu |
Author | Valdemar Danry |
Author | Eunhae Lee |
Author | Samantha W T Chan |
Author | Pat Pataranutaporn |
Author | Pattie Maes |
Author | Jason Phang |
Author | Michael Lampe |
Author | Lama Ahmad |
Author | Sandhini Agarwal |
Abstract | AI chatbots, especially those with voice capabilities, have become increasingly human-like, with more users seeking emotional support and companionship from them. Concerns are rising about how such interactions might impact users’ loneliness and socialization with real people. We conducted a four-week randomized, controlled, IRB-approved experiment (n=981, >300K messages) to investigate how AI chatbot interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence psychosocial outcomes such as loneliness, social interaction with real people, emotional dependence on AI and problematic AI usage. Results showed that while voice-based chatbots initially appeared beneficial in mitigating loneliness and dependence compared with text-based chatbots, these advantages diminished at high usage levels, especially with a neutral-voice chatbot. Conversation type also shaped outcomes: personal topics slightly increased loneliness but tended to lower emotional dependence compared with open-ended conversations, whereas non-personal topics were associated with greater dependence among heavy users. Overall, higher daily usage—across all modalities and conversation types—correlated with higher loneliness, dependence, and problematic use, and lower socialization. Exploratory analyses revealed that those with stronger emotional attachment tendencies and higher trust in the AI chatbot tended to experience greater loneliness and emotional dependence, respectively. These findings underscore the complex interplay between chatbot design choices (e.g., voice expressiveness) and user behaviors (e.g., conversation content, usage frequency). We highlight the need for further research on whether chatbots’ ability to manage emotional content without fostering dependence or replacing human relationships benefits overall well-being. |
Date | 2025 |
Language | en |
Library Catalog | Zotero |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Preprint |
---|---|
Author | Mehdi Fatemi |
Author | Banafsheh Rafiee |
Author | Mingjie Tang |
Author | Kartik Talamadupula |
Abstract | Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results. |
Date | 2025-04-07 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.05185 |
Accessed | 4/14/2025, 10:01:11 PM |
Extra | arXiv:2504.05185 [cs] |
DOI | 10.48550/arXiv.2504.05185 |
Repository | arXiv |
Archive ID | arXiv:2504.05185 |
Date Added | 4/14/2025, 10:01:11 PM |
Modified | 4/14/2025, 10:01:11 PM |
Item Type | Preprint |
---|---|
Author | Jillian Fisher |
Author | Ruth E. Appel |
Author | Chan Young Park |
Author | Yujin Potter |
Author | Liwei Jiang |
Author | Taylor Sorensen |
Author | Shangbin Feng |
Author | Yulia Tsvetkov |
Author | Margaret E. Roberts |
Author | Jennifer Pan |
Author | Dawn Song |
Author | Yejin Choi |
Abstract | AI systems often exhibit political bias, influencing users' opinions and decision-making. While political neutrality-defined as the absence of bias-is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two concrete applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models. |
Date | 2025-02-18 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.05728 |
Accessed | 4/11/2025, 10:31:20 AM |
Extra | arXiv:2503.05728 [cs] |
DOI | 10.48550/arXiv.2503.05728 |
Repository | arXiv |
Archive ID | arXiv:2503.05728 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Comment: Code: https://github.com/jfisher52/Approximation_Political_Neutrality
Item Type | Preprint |
---|---|
Author | Dylan J. Foster |
Author | Zakaria Mhammedi |
Author | Dhruv Rohatgi |
Abstract | Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration. |
Date | 2025-03-13 |
Short Title | Is a Good Foundation Necessary for Efficient Reinforcement Learning? |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.07453 |
Accessed | 4/11/2025, 3:16:04 PM |
Extra | arXiv:2503.07453 [cs] |
DOI | 10.48550/arXiv.2503.07453 |
Repository | arXiv |
Archive ID | arXiv:2503.07453 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Comment: V2: Improved number of prompts used by Algorithm 1
Item Type | Preprint |
---|---|
Author | Lewis Hammond |
Author | Sam Adam-Day |
Abstract | We consider the problem of how a trusted, but computationally bounded agent (a 'verifier') can learn to interact with one or more powerful but untrusted agents ('provers') in order to solve a given task. More specifically, we study the case in which agents are represented using neural networks and refer to solutions of this problem as neural interactive proofs. First we introduce a unifying framework based on prover-verifier games, which generalises previously proposed interaction protocols. We then describe several new protocols for generating neural interactive proofs, and provide a theoretical comparison of both new and existing approaches. Finally, we support this theory with experiments in two domains: a toy graph isomorphism problem that illustrates the key ideas, and a code validation task using large language models. In so doing, we aim to create a foundation for future work on neural interactive proofs and their application in building safer AI systems. |
Date | 2025-03-17 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.08897 |
Accessed | 3/25/2025, 9:19:45 PM |
Extra | arXiv:2412.08897 [cs] |
DOI | 10.48550/arXiv.2412.08897 |
Repository | arXiv |
Archive ID | arXiv:2412.08897 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Comment: ICLR'25 camera-ready version; 51 pages, 17 figures
Item Type | Preprint |
---|---|
Author | Benjamin Hilton |
Author | Marie Davidsen Buhl |
Author | Tomek Korbak |
Author | Geoffrey Irving |
Abstract | Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases. |
Date | 2025-02-05 |
Short Title | Safety Cases |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.04744 |
Accessed | 3/25/2025, 9:34:29 PM |
Extra | arXiv:2503.04744 [cs] |
DOI | 10.48550/arXiv.2503.04744 |
Repository | arXiv |
Archive ID | arXiv:2503.04744 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Comment: 18 pages, 2 figures, 3 tables
Item Type | Preprint |
---|---|
Author | Andreas Hochlehnert |
Author | Hardik Bhatnagar |
Author | Vishaal Udandarao |
Author | Samuel Albanie |
Author | Ameya Prabhu |
Author | Matthias Bethge |
Abstract | Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices - including decoding parameters, random seeds, prompt formatting, and even hardware and software-framework configurations. Performance gains reported in recent studies frequently hinge on unclear comparisons or unreported sources of variance. To address these issues, we propose a standardized evaluation framework with clearly defined best practices and reporting standards. Using this framework, we reassess recent methods and find that reinforcement learning (RL) approaches yield only modest improvements - far below prior claims - and are prone to overfitting, especially on small-scale benchmarks like AIME24. In contrast, supervised finetuning (SFT) methods show consistently stronger generalization. To foster reproducibility, we release all code, prompts, and model outputs, for reasoning benchmarks, establishing more rigorous foundations for future work. |
Date | 2025-04-09 |
Short Title | A Sober Look at Progress in Language Model Reasoning |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.07086 |
Accessed | 4/13/2025, 6:01:29 PM |
Extra | arXiv:2504.07086 [cs] |
DOI | 10.48550/arXiv.2504.07086 |
Repository | arXiv |
Archive ID | arXiv:2504.07086 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Comment: Technical Report
Item Type | Preprint |
---|---|
Author | Abe Bohan Hou |
Author | Hongru Du |
Author | Yichen Wang |
Author | Jingyu Zhang |
Author | Zixiao Wang |
Author | Paul Pu Liang |
Author | Daniel Khashabi |
Author | Lauren Gardner |
Author | Tianxing He |
Abstract | Can we simulate a sandbox society with generative agents to model human behavior, thereby reducing the over-reliance on real human trials for assessing public policies? In this work, we investigate the feasibility of simulating health-related decision-making, using vaccine hesitancy, defined as the delay in acceptance or refusal of vaccines despite the availability of vaccination services (MacDonald, 2015), as a case study. To this end, we introduce the VacSim framework with 100 generative agents powered by Large Language Models (LLMs). VacSim simulates vaccine policy outcomes with the following steps: 1) instantiate a population of agents with demographics based on census data; 2) connect the agents via a social network and model vaccine attitudes as a function of social dynamics and disease-related information; 3) design and evaluate various public health interventions aimed at mitigating vaccine hesitancy. To align with real-world results, we also introduce simulation warmup and attitude modulation to adjust agents' attitudes. We propose a series of evaluations to assess the reliability of various LLM simulations. Experiments indicate that models like Llama and Qwen can simulate aspects of human behavior but also highlight real-world alignment challenges, such as inconsistent responses with demographic profiles. This early exploration of LLM-driven simulations is not meant to serve as definitive policy guidance; instead, it serves as a call for action to examine social simulation for policy development. |
Date | 2025-03-16 |
Short Title | Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.09639 |
Accessed | 3/25/2025, 9:35:41 PM |
Extra | arXiv:2503.09639 [cs] |
DOI | 10.48550/arXiv.2503.09639 |
Repository | arXiv |
Archive ID | arXiv:2503.09639 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Preprint |
---|---|
Author | Jiwon Jeong |
Author | Hyeju Jang |
Author | Hogun Park |
Abstract | The advancement of Large Language Models (LLMs) has greatly improved our ability to process complex language. However, accurately detecting logical fallacies remains a significant challenge. This study presents a novel and effective prompt formulation approach for logical fallacy detection, applicable in both supervised (fine-tuned) and unsupervised (zero-shot) settings. Our method enriches input text incorporating implicit contextual information -- counterarguments, explanations, and goals -- which we query for validity within the context of the argument. We then rank these queries based on confidence scores to inform classification. We evaluate our approach across multiple datasets from 5 domains, covering 29 distinct fallacy types, using models from the GPT and LLaMA series. The results show substantial improvements over state-of-the-art models, with F1 score increases of up to 0.60 in zero-shot settings and up to 0.45 in fine-tuned models. Extensive analyses further illustrate why and how our method excels. |
Date | 2025-03-30 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.23363 |
Accessed | 4/14/2025, 10:00:28 PM |
Extra | arXiv:2503.23363 [cs] |
DOI | 10.48550/arXiv.2503.23363 |
Repository | arXiv |
Archive ID | arXiv:2503.23363 |
Date Added | 4/14/2025, 10:00:28 PM |
Modified | 4/14/2025, 10:00:28 PM |
Comment: Accepted to NAACL 2025 Findings
Item Type | Preprint |
---|---|
Author | Zhijing Jin |
Author | Max Kleiman-Weiner |
Author | Giorgio Piatti |
Author | Sydney Levine |
Author | Jiarui Liu |
Author | Fernando Gonzalez |
Author | Francesco Ortu |
Author | András Strausz |
Author | Mrinmaya Sachan |
Author | Rada Mihalcea |
Author | Yejin Choi |
Author | Bernhard Schölkopf |
Abstract | We evaluate the moral alignment of large language models (LLMs) with human preferences in multilingual trolley problems. Building on the Moral Machine experiment, which captures over 40 million human judgments across 200+ countries, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. This dataset enables the assessment of LLMs' decision-making processes in diverse linguistic contexts. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions: species, gender, fitness, status, age, and the number of lives involved. By correlating these preferences with the demographic distribution of language speakers and examining the consistency of LLM responses to various prompt paraphrasings, our findings provide insights into cross-lingual and ethical biases of LLMs and their intersection. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems and highlighting the importance of incorporating diverse perspectives in AI ethics. The results underscore the need for further research on the integration of multilingual dimensions in responsible AI research to ensure fair and equitable AI interactions worldwide. Our code and data are at https://github.com/causalNLP/moralmachine |
Date | 2024-12-14 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2407.02273 |
Accessed | 4/14/2025, 10:06:43 PM |
Extra | arXiv:2407.02273 [cs] |
DOI | 10.48550/arXiv.2407.02273 |
Repository | arXiv |
Archive ID | arXiv:2407.02273 |
Date Added | 4/14/2025, 10:06:43 PM |
Modified | 4/14/2025, 10:06:43 PM |
Comment: Best Paper @ NeurIPS 2024 Workshop on Pluralistic Alignment
Item Type | Preprint |
---|---|
Author | Cameron R. Jones |
Author | Benjamin K. Bergen |
Abstract | We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have. |
Date | 2025-03-31 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.23674 |
Accessed | 4/11/2025, 1:10:04 PM |
Extra | arXiv:2503.23674 [cs] |
DOI | 10.48550/arXiv.2503.23674 |
Repository | arXiv |
Archive ID | arXiv:2503.23674 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Harang Ju |
Author | Sinan Aral |
Abstract | To uncover how AI agents change productivity, performance, and work processes, we introduce MindMeld: an experimentation platform enabling humans and AI agents to collaborate in integrative workspaces. In a large-scale marketing experiment on the platform, 2310 participants were randomly assigned to human-human and human-AI teams, with randomized AI personality traits. The teams exchanged 183,691 messages, and created 63,656 image edits, 1,960,095 ad copy edits, and 10,375 AI-generated images while producing 11,138 ads for a large think tank. Analysis of fine-grained communication, collaboration, and workflow logs revealed that collaborating with AI agents increased communication by 137% and allowed humans to focus 23% more on text and image content generation messaging and 20% less on direct text editing. Humans on Human-AI teams sent 23% fewer social messages, creating 60% greater productivity per worker and higher-quality ad copy. In contrast, human-human teams produced higher-quality images, suggesting that AI agents require fine-tuning for multimodal workflows. AI personality prompt randomization revealed that AI traits can complement human personalities to enhance collaboration. For example, conscientious humans paired with open AI agents improved image quality, while extroverted humans paired with conscientious AI agents reduced the quality of text, images, and clicks. In field tests of ad campaigns with ~5M impressions, ads with higher image quality produced by human collaborations and higher text quality produced by AI collaborations performed significantly better on click-through rate and cost per click metrics. Overall, ads created by human-AI teams performed similarly to those created by human-human teams. Together, these results suggest AI agents can improve teamwork and productivity, especially when tuned to complement human traits. |
Date | 2025-03-23 |
Short Title | Collaborating with AI Agents |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.18238 |
Accessed | 3/26/2025, 2:58:26 PM |
Extra | arXiv:2503.18238 [cs] |
DOI | 10.48550/arXiv.2503.18238 |
Repository | arXiv |
Archive ID | arXiv:2503.18238 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Comment: 56 pages, 8 figures
Item Type | Preprint |
---|---|
Author | Ariba Khan |
Author | Stephen Casper |
Author | Dylan Hadfield-Menell |
Abstract | Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behind current survey-based evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current survey-based approaches to evaluating the cultural alignment of LLMs and highlight a need for systematic robustness checks and red-teaming for evaluation results. Data and code are available at https://huggingface.co/datasets/akhan02/cultural-dimension-cover-letters and https://github.com/ariba-k/llm-cultural-alignment-evaluation, respectively. |
Date | 2025-04-08 |
Short Title | Randomness, Not Representation |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.08688 |
Accessed | 4/14/2025, 10:06:57 PM |
Extra | arXiv:2503.08688 [cs] |
DOI | 10.48550/arXiv.2503.08688 |
Repository | arXiv |
Archive ID | arXiv:2503.08688 |
Date Added | 4/14/2025, 10:06:57 PM |
Modified | 4/14/2025, 10:06:57 PM |
Item Type | Preprint |
---|---|
Author | Tomek Korbak |
Author | Mikita Balesni |
Author | Buck Shlegeris |
Author | Geoffrey Irving |
Abstract | As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk. |
Date | 2025-04-07 |
Short Title | How to evaluate control measures for LLM agents? |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.05259 |
Accessed | 4/14/2025, 10:04:50 PM |
Extra | arXiv:2504.05259 [cs] |
DOI | 10.48550/arXiv.2504.05259 |
Repository | arXiv |
Archive ID | arXiv:2504.05259 |
Date Added | 4/14/2025, 10:04:50 PM |
Modified | 4/14/2025, 10:04:50 PM |
Item Type | Preprint |
---|---|
Author | Thomas Kwa |
Author | Ben West |
Author | Joel Becker |
Author | Amy Deng |
Author | Katharyn Garcia |
Author | Max Hasin |
Author | Sami Jawhar |
Author | Megan Kinniment |
Author | Nate Rush |
Author | Sydney Von Arx |
Author | Ryan Bloom |
Author | Thomas Broadley |
Author | Haoxing Du |
Author | Brian Goodrich |
Author | Nikola Jurkovic |
Author | Luke Harold Miles |
Author | Seraphina Nix |
Author | Tao Lin |
Author | Neev Parikh |
Author | David Rein |
Author | Lucas Jun Koba Sato |
Author | Hjalmar Wijk |
Author | Daniel M. Ziegler |
Author | Elizabeth Barnes |
Author | Lawrence Chan |
Abstract | Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month. |
Date | 2025-03-18 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.14499 |
Accessed | 3/25/2025, 6:00:08 PM |
Extra | arXiv:2503.14499 [cs] |
DOI | 10.48550/arXiv.2503.14499 |
Repository | arXiv |
Archive ID | arXiv:2503.14499 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Preprint |
---|---|
Author | Yuxuan Li |
Author | Hirokazu Shirado |
Author | Sauvik Das |
Abstract | While advances in fairness and alignment have helped mitigate overt biases exhibited by large language models (LLMs) when explicitly prompted, we hypothesize that these models may still exhibit implicit biases when simulating human behavior. To test this hypothesis, we propose a technique to systematically uncover such biases across a broad range of sociodemographic categories by assessing decision-making disparities among agents with LLM-generated, sociodemographically-informed personas. Using our technique, we tested six LLMs across three sociodemographic groups and four decision-making scenarios. Our results show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations, with more advanced models exhibiting greater implicit biases despite reducing explicit biases. Furthermore, when comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified. This directional alignment highlights the utility of our technique in uncovering systematic biases in LLMs rather than random variations; moreover, the presence and amplification of implicit biases emphasizes the need for novel strategies to address these biases. |
Date | 2025-01-29 |
Short Title | Actions Speak Louder than Words |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.17420 |
Accessed | 4/14/2025, 10:00:42 PM |
Extra | arXiv:2501.17420 [cs] |
DOI | 10.48550/arXiv.2501.17420 |
Repository | arXiv |
Archive ID | arXiv:2501.17420 |
Date Added | 4/14/2025, 10:00:42 PM |
Modified | 4/14/2025, 10:00:42 PM |
Item Type | Preprint |
---|---|
Author | Zijun Liu |
Author | Peiyi Wang |
Author | Runxin Xu |
Author | Shirong Ma |
Author | Chong Ruan |
Author | Peng Li |
Author | Yang Liu |
Author | Yu Wu |
Abstract | Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced. |
Date | 2025-04-05 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.02495 |
Accessed | 4/10/2025, 11:19:19 AM |
Extra | arXiv:2504.02495 [cs] |
DOI | 10.48550/arXiv.2504.02495 |
Repository | arXiv |
Archive ID | arXiv:2504.02495 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Comment: Preprint, under review. 42 pages
Item Type | Preprint |
---|---|
Author | Jiacheng Liu |
Author | Taylor Blanton |
Author | Yanai Elazar |
Author | Sewon Min |
Author | YenSung Chen |
Author | Arnavi Chheda-Kothary |
Author | Huy Tran |
Author | Byron Bischoff |
Author | Eric Marsh |
Author | Michael Schmitz |
Author | Cassidy Trier |
Author | Aaron Sarnat |
Author | Jenna James |
Author | Jon Borchardt |
Author | Bailey Kuehl |
Author | Evie Cheng |
Author | Karen Farley |
Author | Sruthi Sreeram |
Author | Taira Anderson |
Author | David Albright |
Author | Carissa Schoenick |
Author | Luca Soldaini |
Author | Dirk Groeneveld |
Author | Rock Yuren Pang |
Author | Pang Wei Koh |
Author | Noah A. Smith |
Author | Sophie Lebrecht |
Author | Yejin Choi |
Author | Hannaneh Hajishirzi |
Author | Ali Farhadi |
Author | Jesse Dodge |
Abstract | We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source. |
Date | 2025-04-09 |
Short Title | OLMoTrace |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.07096 |
Accessed | 4/17/2025, 11:51:42 AM |
Extra | arXiv:2504.07096 [cs] |
DOI | 10.48550/arXiv.2504.07096 |
Repository | arXiv |
Archive ID | arXiv:2504.07096 |
Date Added | 4/17/2025, 11:51:42 AM |
Modified | 4/17/2025, 11:51:42 AM |
Comment: Under submission at ACL 2025 demo track
Item Type | Preprint |
---|---|
Author | Xing Han Lù |
Author | Amirhossein Kazemnejad |
Author | Nicholas Meade |
Author | Arkil Patel |
Author | Dongchan Shin |
Author | Alejandra Zambrano |
Author | Karolina Stańczak |
Author | Peter Shaw |
Author | Christopher J. Pal |
Author | Siva Reddy |
Abstract | Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io |
Date | 2025-04-11 |
Short Title | AgentRewardBench |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.08942 |
Accessed | 4/17/2025, 11:42:57 AM |
Extra | arXiv:2504.08942 [cs] |
DOI | 10.48550/arXiv.2504.08942 |
Repository | arXiv |
Archive ID | arXiv:2504.08942 |
Date Added | 4/17/2025, 11:42:57 AM |
Modified | 4/17/2025, 11:42:57 AM |
Item Type | Preprint |
---|---|
Author | Florian Mai |
Author | David Kaczér |
Author | Nicholas Kluge Corrêa |
Author | Lucie Flek |
Abstract | Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future. |
Date | 2025-03-17 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.13621 |
Accessed | 3/25/2025, 5:43:06 PM |
Extra | arXiv:2503.13621 [cs] |
DOI | 10.48550/arXiv.2503.13621 |
Repository | arXiv |
Archive ID | arXiv:2503.13621 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Comment: Published at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)
Item Type | Preprint |
---|---|
Author | Alex Mallen |
Author | Charlie Griffin |
Author | Misha Wagner |
Author | Alessandro Abate |
Author | Buck Shlegeris |
Abstract | An AI control protocol is a plan for usefully deploying AI systems that aims to prevent an AI from intentionally causing some unacceptable outcome. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (without shared memory between contexts). To do this, an AI system may need to reliably generate optimal plans in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating. We develop Subversion Strategy Eval, a suite of eight environments, covering a range of protocols and strategic capabilities, and six sets of affordances that help isolate individual capabilities. We implement the evaluation in Inspect-AI and release it open-source. We evaluate Claude 3.5 models, including helpful-only versions, as well as OpenAI reasoning models. None of the models demonstrate substantial capability in strategizing to subvert control protocols statelessly. However, providing models with additional affordances, such as the ability to share a plan between contexts, can substantially improve performance. We hope our evaluations can act as a leading indicator for when models are capable of subverting control protocols and also relax the worst-case assumption of perfect strategic ability in AI control evaluations. |
Date | 2025-03-24 |
Short Title | Subversion Strategy Eval |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2412.12480 |
Accessed | 3/25/2025, 5:37:00 PM |
Extra | arXiv:2412.12480 [cs] |
DOI | 10.48550/arXiv.2412.12480 |
Repository | arXiv |
Archive ID | arXiv:2412.12480 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Preprint |
---|---|
Author | David Manheim |
Author | Sammy Martin |
Author | Mark Bailey |
Author | Mikhail Samin |
Author | Ross Greutzmacher |
Abstract | Auditing of AI systems is a promising way to understand and manage ethical problems and societal risks associated with contemporary AI systems, as well as some anticipated future risks. Efforts to develop standards for auditing Artificial Intelligence (AI) systems have therefore understandably gained momentum. However, we argue that creating auditing standards is not just insufficient, but actively harmful by proliferating unheeded and inconsistent standards, especially in light of the rapid evolution and ethical and safety challenges of AI. Instead, the paper proposes the establishment of an AI Audit Standards Board, responsible for developing and updating auditing methods and standards in line with the evolving nature of AI technologies. Such a body would ensure that auditing practices remain relevant, robust, and responsive to the rapid advancements in AI. The paper argues that such a governance structure would also be helpful for maintaining public trust in AI and for promoting a culture of safety and ethical responsibility within the AI industry. Throughout the paper, we draw parallels with other industries, including safety-critical industries like aviation and nuclear energy, as well as more prosaic ones such as financial accounting and pharmaceuticals. AI auditing should emulate those fields, and extend beyond technical assessments to include ethical considerations and stakeholder engagement, but we explain that this is not enough; emulating other fields' governance mechanisms for these processes, and for audit standards creation, is a necessity. We also emphasize the importance of auditing the entire development process of AI systems, not just the final products... |
Date | 2024-04-11 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2404.13060 |
Accessed | 4/11/2025, 1:09:42 PM |
Extra | arXiv:2404.13060 [cs] |
DOI | 10.48550/arXiv.2404.13060 |
Repository | arXiv |
Archive ID | arXiv:2404.13060 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Journal Article |
---|---|
Author | Nestor Maslej |
Date | 2025 |
Language | en |
Library Catalog | Zotero |
Publication | Artificial Intelligence |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Web Page |
---|---|
Author | METR |
Abstract | Resources for testing dangerous autonomous capabilities in frontier models |
Date | 2025-04-16T00:00:00+00:00 |
Language | en |
URL | https://metr.github.io/autonomy-evals-guide/openai-o3-report/ |
Accessed | 4/17/2025, 11:41:15 AM |
Website Title | METR’s Autonomy Evaluation Resources |
Date Added | 4/17/2025, 11:41:15 AM |
Modified | 4/17/2025, 11:41:15 AM |
Item Type | Preprint |
---|---|
Author | Michael Nielson |
Date | April 14th, 2025 |
Language | en |
Library Catalog | Zotero |
Date Added | 4/14/2025, 10:05:28 PM |
Modified | 4/14/2025, 10:06:07 PM |
Item Type | Preprint |
---|---|
Author | Kiho Park |
Author | Yo Joong Choe |
Author | Yibo Jiang |
Author | Victor Veitch |
Abstract | The linear representation hypothesis is the informal idea that semantic concepts are encoded as linear directions in the representation spaces of large language models (LLMs). Previous work has shown how to make this notion precise for representing binary concepts that have natural contrasts (e.g., {male, female}) as directions in representation space. However, many natural concepts do not have natural contrasts (e.g., whether the output is about an animal). In this work, we show how to extend the formalization of the linear representation hypothesis to represent features (e.g., is_animal) as vectors. This allows us to immediately formalize the representation of categorical concepts as polytopes in the representation space. Further, we use the formalization to prove a relationship between the hierarchical structure of concepts and the geometry of their representations. We validate these theoretical results on the Gemma and LLaMA-3 large language models, estimating representations for 900+ hierarchically related concepts using data from WordNet. |
Date | 2025-02-18 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2406.01506 |
Accessed | 4/11/2025, 1:09:50 PM |
Extra | arXiv:2406.01506 [cs] |
DOI | 10.48550/arXiv.2406.01506 |
Repository | arXiv |
Archive ID | arXiv:2406.01506 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Comment: Accepted for an oral presentation at ICLR 2025. Best Paper Award at the ICML 2024 Workshop on Mechanistic Interpretability. Code is available at https://github.com/KihoPark/LLM_Categorical_Hierarchical_Representations
Item Type | Journal Article |
---|---|
Author | Jason Phang |
Author | Michael Lampe |
Author | Lama Ahmad |
Author | Sandhini Agarwal |
Author | Cathy Mengying Fang |
Author | Auren R Liu |
Author | Valdemar Danry |
Author | Eunhae Lee |
Author | Samantha W T Chan |
Author | Pat Pataranutaporn |
Author | Pattie Maes |
Abstract | As AI chatbots see increased adoption and integration into everyday life, questions have been raised about the potential impact of human-like or anthropomorphic AI on users. In this work, we investigate the extent to which interactions with ChatGPT (with a focus on Advanced Voice Mode) may impact users’ emotional well-being, behaviors and experiences through two parallel studies. To study the affective use of AI chatbots, we perform large-scale automated analysis of ChatGPT platform usage in a privacy-preserving manner, analyzing over 4 million conversations for affective cues and surveying over 4,000 users on their perceptions of ChatGPT. To investigate whether there is a relationship between model usage and emotional well-being, we conduct an Institutional Review Board (IRB)-approved randomized controlled trial (RCT) on close to 1,000 participants over 28 days, examining changes in their emotional well-being as they interact with ChatGPT under different experimental settings. In both on-platform data analysis and the RCT, we observe that very high usage correlates with increased self-reported indicators of dependence. From our RCT, we find that the impact of voice-based interactions on emotional well-being to be highly nuanced, and influenced by factors such as the user’s initial emotional state and total usage duration. Overall, our analysis reveals that a small number of users are responsible for a disproportionate share of the most affective cues. |
Language | en |
Library Catalog | Zotero |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Journal Article |
---|---|
Author | Rohin Shah |
Author | Alex Irpan |
Author | Alexander Matt Turner |
Author | Anna Wang |
Author | Arthur Conmy |
Author | David Lindner |
Author | Jonah Brown-Cohen |
Author | Lewis Ho |
Author | Neel Nanda |
Author | Raluca Ada Popa |
Author | Rishub Jain |
Author | Rory Greig |
Author | Scott Emmons |
Author | Sebastian Farquhar |
Author | Sébastien Krier |
Author | Senthooran Rajamanoharan |
Author | Sophie Bridgers |
Author | Tobi Ijitoye |
Author | Tom Everitt |
Author | Victoria Krakovna |
Author | Vikrant Varma |
Author | Vladimir Mikulik |
Author | Zachary Kenton |
Author | Dave Orr |
Author | Shane Legg |
Author | Noah Goodman |
Author | Allan Dafoe |
Author | Four Flynn |
Author | Anca Dragan |
Language | en |
Library Catalog | Zotero |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Murray Shanahan |
Abstract | Is it possible to articulate a conception of consciousness that is compatible with the exotic characteristics of contemporary, disembodied AI systems, and that can stand up to philosophical scrutiny? How would subjective time and selfhood show up for an entity that conformed to such a conception? Trying to answer these questions, even metaphorically, stretches the language of consciousness to breaking point. Ultimately, the attempt yields something like emptiness, in the Buddhist sense, and helps to undermine our dualistic inclinations towards subjectivity and selfhood. |
Date | 2025-03-20 |
Short Title | Palatable Conceptions of Disembodied Being |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.16348 |
Accessed | 3/25/2025, 5:46:40 PM |
Extra | arXiv:2503.16348 [cs] |
DOI | 10.48550/arXiv.2503.16348 |
Repository | arXiv |
Archive ID | arXiv:2503.16348 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Preprint |
---|---|
Author | Patrick Queiroz Da Silva |
Author | Hari Sethuraman |
Author | Dheeraj Rajagopal |
Author | Hannaneh Hajishirzi |
Author | Sachin Kumar |
Abstract | Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis demonstrate fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions. |
Date | 2025-04-06 |
Short Title | Steering off Course |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.04635 |
Accessed | 4/9/2025, 11:09:20 PM |
Extra | arXiv:2504.04635 [cs] |
DOI | 10.48550/arXiv.2504.04635 |
Repository | arXiv |
Archive ID | arXiv:2504.04635 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Journal Article |
---|---|
Author | Lewis Smith |
Author | Senthooran Rajamanoharan |
Author | Arthur Conmy |
Author | CallumMcDougall |
Author | Tom Lieberum |
Author | János Kramár |
Author | Rohin Shah |
Author | Neel Nanda |
Abstract | TL;DR To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts Negative result: SAEs underperformed linear probes Corollary: Linear probes are actually really good and cheap and perform great As a result of this and parallel work, we are deprioritising fundamental SAE research for the moment and exploring other directions, though SAEs will remain a tool in our toolkit We do not think that SAEs are useless or that no one should work on them, but we also do not think that SAEs will be a game-changer for interpretability, and speculate that the field is over-invested in them. Training SAEs specialised for chat data closed about half the gap but was still worse than linear probes We tried several ways to train chat SAEs, all did about as well. By default, we recommend taking an SAE on pretraining data and finetuning it on a bit of chat data Other results: We found SAEs fairly helpful for debugging low quality datasets (noticing spurious correlations) We present a variant of JumpReLU with an alternative sparsity penalty to get rid of high-frequency latents We argue that a standard auto-interp approach of computing the average interpretability of a uniformly sampled SAE latent can be misleading as it doesn’t penalise models which have high frequency, but not very interpretable, latents, and explore weighting the interpretability score by latent frequency. |
Date | 2025-03-26 |
Language | en |
Library Catalog | www.alignmentforum.org |
URL | https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks |
Accessed | 4/11/2025, 12:27:49 PM |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Taylor Sorensen |
Author | Pushkar Mishra |
Author | Roma Patel |
Author | Michael Henry Tessler |
Author | Michiel Bakker |
Author | Georgina Evans |
Author | Iason Gabriel |
Author | Noah Goodman |
Author | Verena Rieser |
Abstract | Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles -- natural language descriptions of underlying values compressed from in-context demonstrations -- along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (>70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information. |
Date | 2025-03-19 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.15484 |
Accessed | 3/25/2025, 5:59:26 PM |
Extra | arXiv:2503.15484 [cs] |
DOI | 10.48550/arXiv.2503.15484 |
Repository | arXiv |
Archive ID | arXiv:2503.15484 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Preprint |
---|---|
Author | Giulio Starace |
Author | Oliver Jaffe |
Author | Dane Sherburn |
Author | James Aung |
Author | Jun Shern Chan |
Author | Leon Maksin |
Author | Rachel Dias |
Author | Evan Mays |
Author | Benjamin Kinsella |
Author | Wyatt Thompson |
Author | Johannes Heidecke |
Author | Amelia Glaese |
Author | Tejal Patwardhan |
Abstract | We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents. |
Date | 2025-04-07 |
Short Title | PaperBench |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.01848 |
Accessed | 4/11/2025, 12:29:38 PM |
Extra | arXiv:2504.01848 [cs] |
DOI | 10.48550/arXiv.2504.01848 |
Repository | arXiv |
Archive ID | arXiv:2504.01848 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Comment: 30 pages, 14 figures
Item Type | Preprint |
---|---|
Author | Yi Su |
Author | Dian Yu |
Author | Linfeng Song |
Author | Juntao Li |
Author | Haitao Mi |
Author | Zhaopeng Tu |
Author | Min Zhang |
Author | Dong Yu |
Abstract | Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs), especially when structured reference answers are accessible for verification. However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains, significantly outperforming state-of-the-art open-source aligned models such as Qwen2.5-72B and DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach notably enhances the robustness, flexibility, and scalability of RLVR, representing a substantial step towards practical reinforcement learning applications in complex, noisy-label scenarios. |
Date | 2025-04-01 |
Short Title | Crossing the Reward Bridge |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.23829 |
Accessed | 4/11/2025, 5:31:32 PM |
Extra | arXiv:2503.23829 [cs] |
DOI | 10.48550/arXiv.2503.23829 |
Repository | arXiv |
Archive ID | arXiv:2503.23829 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Miguel Suau |
Author | Matthijs T. J. Spaan |
Author | Frans A. Oliehoek |
Abstract | Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs. |
Date | 2024-06-24 |
Short Title | Bad Habits |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2306.02419 |
Accessed | 4/11/2025, 1:10:00 PM |
Extra | arXiv:2306.02419 [cs] |
DOI | 10.48550/arXiv.2306.02419 |
Repository | arXiv |
Archive ID | arXiv:2306.02419 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Gemma Team |
Author | Aishwarya Kamath |
Author | Johan Ferret |
Author | Shreya Pathak |
Author | Nino Vieillard |
Author | Ramona Merhej |
Author | Sarah Perrin |
Author | Tatiana Matejovicova |
Author | Alexandre Ramé |
Author | Morgane Rivière |
Author | Louis Rouillard |
Author | Thomas Mesnard |
Author | Geoffrey Cideron |
Author | Jean-bastien Grill |
Author | Sabela Ramos |
Author | Edouard Yvinec |
Author | Michelle Casbon |
Author | Etienne Pot |
Author | Ivo Penchev |
Author | Gaël Liu |
Author | Francesco Visin |
Author | Kathleen Kenealy |
Author | Lucas Beyer |
Author | Xiaohai Zhai |
Author | Anton Tsitsulin |
Author | Robert Busa-Fekete |
Author | Alex Feng |
Author | Noveen Sachdeva |
Author | Benjamin Coleman |
Author | Yi Gao |
Author | Basil Mustafa |
Author | Iain Barr |
Author | Emilio Parisotto |
Author | David Tian |
Author | Matan Eyal |
Author | Colin Cherry |
Author | Jan-Thorsten Peter |
Author | Danila Sinopalnikov |
Author | Surya Bhupatiraju |
Author | Rishabh Agarwal |
Author | Mehran Kazemi |
Author | Dan Malkin |
Author | Ravin Kumar |
Author | David Vilar |
Author | Idan Brusilovsky |
Author | Jiaming Luo |
Author | Andreas Steiner |
Author | Abe Friesen |
Author | Abhanshu Sharma |
Author | Abheesht Sharma |
Author | Adi Mayrav Gilady |
Author | Adrian Goedeckemeyer |
Author | Alaa Saade |
Author | Alex Feng |
Author | Alexander Kolesnikov |
Author | Alexei Bendebury |
Author | Alvin Abdagic |
Author | Amit Vadi |
Author | András György |
Author | André Susano Pinto |
Author | Anil Das |
Author | Ankur Bapna |
Author | Antoine Miech |
Author | Antoine Yang |
Author | Antonia Paterson |
Author | Ashish Shenoy |
Author | Ayan Chakrabarti |
Author | Bilal Piot |
Author | Bo Wu |
Author | Bobak Shahriari |
Author | Bryce Petrini |
Author | Charlie Chen |
Author | Charline Le Lan |
Author | Christopher A. Choquette-Choo |
Author | C. J. Carey |
Author | Cormac Brick |
Author | Daniel Deutsch |
Author | Danielle Eisenbud |
Author | Dee Cattle |
Author | Derek Cheng |
Author | Dimitris Paparas |
Author | Divyashree Shivakumar Sreepathihalli |
Author | Doug Reid |
Author | Dustin Tran |
Author | Dustin Zelle |
Author | Eric Noland |
Author | Erwin Huizenga |
Author | Eugene Kharitonov |
Author | Frederick Liu |
Author | Gagik Amirkhanyan |
Author | Glenn Cameron |
Author | Hadi Hashemi |
Author | Hanna Klimczak-Plucińska |
Author | Harman Singh |
Author | Harsh Mehta |
Author | Harshal Tushar Lehri |
Author | Hussein Hazimeh |
Author | Ian Ballantyne |
Author | Idan Szpektor |
Author | Ivan Nardini |
Author | Jean Pouget-Abadie |
Author | Jetha Chan |
Author | Joe Stanton |
Author | John Wieting |
Author | Jonathan Lai |
Author | Jordi Orbay |
Author | Joseph Fernandez |
Author | Josh Newlan |
Author | Ju-yeong Ji |
Author | Jyotinder Singh |
Author | Kat Black |
Author | Kathy Yu |
Author | Kevin Hui |
Author | Kiran Vodrahalli |
Author | Klaus Greff |
Author | Linhai Qiu |
Author | Marcella Valentine |
Author | Marina Coelho |
Author | Marvin Ritter |
Author | Matt Hoffman |
Author | Matthew Watson |
Author | Mayank Chaturvedi |
Author | Michael Moynihan |
Author | Min Ma |
Author | Nabila Babar |
Author | Natasha Noy |
Author | Nathan Byrd |
Author | Nick Roy |
Author | Nikola Momchev |
Author | Nilay Chauhan |
Author | Noveen Sachdeva |
Author | Oskar Bunyan |
Author | Pankil Botarda |
Author | Paul Caron |
Author | Paul Kishan Rubenstein |
Author | Phil Culliton |
Author | Philipp Schmid |
Author | Pier Giuseppe Sessa |
Author | Pingmei Xu |
Author | Piotr Stanczyk |
Author | Pouya Tafti |
Author | Rakesh Shivanna |
Author | Renjie Wu |
Author | Renke Pan |
Author | Reza Rokni |
Author | Rob Willoughby |
Author | Rohith Vallu |
Author | Ryan Mullins |
Author | Sammy Jerome |
Author | Sara Smoot |
Author | Sertan Girgin |
Author | Shariq Iqbal |
Author | Shashir Reddy |
Author | Shruti Sheth |
Author | Siim Põder |
Author | Sijal Bhatnagar |
Author | Sindhu Raghuram Panyam |
Author | Sivan Eiger |
Author | Susan Zhang |
Author | Tianqi Liu |
Author | Trevor Yacovone |
Author | Tyler Liechty |
Author | Uday Kalra |
Author | Utku Evci |
Author | Vedant Misra |
Author | Vincent Roseberry |
Author | Vlad Feinberg |
Author | Vlad Kolesnikov |
Author | Woohyun Han |
Author | Woosuk Kwon |
Author | Xi Chen |
Author | Yinlam Chow |
Author | Yuvein Zhu |
Author | Zichuan Wei |
Author | Zoltan Egyed |
Author | Victor Cotruta |
Author | Minh Giang |
Author | Phoebe Kirk |
Author | Anand Rao |
Author | Kat Black |
Author | Nabila Babar |
Author | Jessica Lo |
Author | Erica Moreira |
Author | Luiz Gustavo Martins |
Author | Omar Sanseviero |
Author | Lucas Gonzalez |
Author | Zach Gleicher |
Author | Tris Warkentin |
Author | Vahab Mirrokni |
Author | Evan Senter |
Author | Eli Collins |
Author | Joelle Barral |
Author | Zoubin Ghahramani |
Author | Raia Hadsell |
Author | Yossi Matias |
Author | D. Sculley |
Author | Slav Petrov |
Author | Noah Fiedel |
Author | Noam Shazeer |
Author | Oriol Vinyals |
Author | Jeff Dean |
Author | Demis Hassabis |
Author | Koray Kavukcuoglu |
Author | Clement Farabet |
Author | Elena Buchatskaya |
Author | Jean-Baptiste Alayrac |
Author | Rohan Anil |
Author | Dmitry |
Author | Lepikhin |
Author | Sebastian Borgeaud |
Author | Olivier Bachem |
Author | Armand Joulin |
Author | Alek Andreev |
Author | Cassidy Hardin |
Author | Robert Dadashi |
Author | Léonard Hussenot |
Abstract | We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community. |
Date | 2025-03-25 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.19786 |
Accessed | 4/11/2025, 12:33:53 PM |
Extra | arXiv:2503.19786 [cs] |
DOI | 10.48550/arXiv.2503.19786 |
Repository | arXiv |
Archive ID | arXiv:2503.19786 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Philip Moreira Tomei |
Author | Rupal Jain |
Author | Matija Franklin |
Abstract | This paper argues that market governance mechanisms should be considered a key approach in the governance of artificial intelligence (AI), alongside traditional regulatory frameworks. While current governance approaches have predominantly focused on regulation, we contend that market-based mechanisms offer effective incentives for responsible AI development. We examine four emerging vectors of market governance: insurance, auditing, procurement, and due diligence, demonstrating how these mechanisms can affirm the relationship between AI risk and financial risk while addressing capital allocation inefficiencies. While we do not claim that market forces alone can adequately protect societal interests, we maintain that standardised AI disclosures and market mechanisms can create powerful incentives for safe and responsible AI development. This paper urges regulators, economists, and machine learning researchers to investigate and implement market-based approaches to AI governance. |
Date | 2025-03-05 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.17755 |
Accessed | 4/11/2025, 1:09:46 PM |
Extra | arXiv:2501.17755 [econ] |
DOI | 10.48550/arXiv.2501.17755 |
Repository | arXiv |
Archive ID | arXiv:2501.17755 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Tianci Xue |
Author | Weijian Qi |
Author | Tianneng Shi |
Author | Chan Hee Song |
Author | Boyu Gou |
Author | Dawn Song |
Author | Huan Sun |
Author | Yu Su |
Abstract | As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research. |
Date | 2025-04-02 |
Short Title | An Illusion of Progress? |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2504.01382 |
Accessed | 4/11/2025, 5:31:52 PM |
Extra | arXiv:2504.01382 [cs] |
DOI | 10.48550/arXiv.2504.01382 |
Repository | arXiv |
Archive ID | arXiv:2504.01382 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Comment: 22 pages, 16 figures, 4 tables
Item Type | Journal Article |
---|---|
Author | Yutaro Yamada |
Author | Robert Tjarko Lange |
Author | Cong Lu |
Author | Shengran Hu |
Author | Chris Lu |
Author | Jakob Foerster |
Author | David Ha |
Language | en |
Library Catalog | Zotero |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Asaf Yehudai |
Author | Lilach Eden |
Author | Alan Li |
Author | Guy Uziel |
Author | Yilun Zhao |
Author | Roy Bar-Haim |
Author | Arman Cohan |
Author | Michal Shmueli-Scheuer |
Abstract | The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research. |
Date | 2025-03-20 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.16416 |
Accessed | 3/25/2025, 5:38:55 PM |
Extra | arXiv:2503.16416 [cs] |
DOI | 10.48550/arXiv.2503.16416 |
Repository | arXiv |
Archive ID | arXiv:2503.16416 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Web Page |
---|---|
Author | Authors Affiliations Published Not published yet DOI No DOI yet |
Abstract | We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology. |
Language | en |
URL | https://transformer-circuits.pub/2025/attribution-graphs/biology.html |
Accessed | 4/11/2025, 12:35:22 PM |
Website Title | Transformer Circuits |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Preprint |
---|---|
Author | Eric Zhao |
Author | Pranjal Awasthi |
Author | Sreenivas Gollapudi |
Abstract | Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies. |
Date | 2025-02-20 |
Short Title | Sample, Scrutinize and Scale |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.01839 |
Accessed | 3/25/2025, 9:28:34 PM |
Extra | arXiv:2502.01839 [cs] |
DOI | 10.48550/arXiv.2502.01839 |
Repository | arXiv |
Archive ID | arXiv:2502.01839 |
Date Added | 4/2/2025, 2:13:04 PM |
Modified | 4/2/2025, 2:13:04 PM |
Item Type | Preprint |
---|---|
Author | Yuxuan Zhu |
Author | Antony Kellermann |
Author | Dylan Bowman |
Author | Philip Li |
Author | Akul Gupta |
Author | Adarsh Danda |
Author | Richard Fang |
Author | Conner Jensen |
Author | Eric Ihli |
Author | Jason Benn |
Author | Jet Geronimo |
Author | Avi Dhir |
Author | Sudhit Rao |
Author | Kaicheng Yu |
Author | Twm Stone |
Author | Daniel Kang |
Abstract | Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities. |
Date | 2025-04-01 |
Short Title | CVE-Bench |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.17332 |
Accessed | 4/11/2025, 1:09:54 PM |
Extra | arXiv:2503.17332 [cs] |
DOI | 10.48550/arXiv.2503.17332 |
Repository | arXiv |
Archive ID | arXiv:2503.17332 |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Comment: 15 pages, 4 figures, 5 tables
Item Type | Web Page |
---|---|
URL | https://transluce.org/introducing-docent |
Accessed | 4/11/2025, 12:31:23 PM |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Web Page |
---|---|
Abstract | An inside view of the AI revolution, from the people and companies making it happen. |
Language | en-US |
URL | https://stripe.com/scaling |
Accessed | 4/11/2025, 2:47:51 PM |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Web Page |
---|---|
URL | https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/ |
Accessed | 4/11/2025, 2:46:31 PM |
Date Added | 4/14/2025, 9:44:59 PM |
Modified | 4/14/2025, 9:44:59 PM |
Item Type | Web Page |
---|---|
URL | https://transluce.org/investigating-o3-truthfulness |
Accessed | 4/17/2025, 11:40:51 AM |
Date Added | 4/17/2025, 11:40:51 AM |
Modified | 4/17/2025, 11:40:51 AM |