• Vision Language Models as Values Detectors

    Item Type Preprint
    Author Giulio Antonio Abbo
    Author Tony Belpaeme
    Abstract Large Language Models integrating textual and visual inputs have introduced new possibilities for interpreting complex data. Despite their remarkable ability to generate coherent and contextually relevant text based on visual stimuli, the alignment of these models with human perception in identifying relevant elements in images requires further exploration. This paper investigates the alignment between state-of-the-art LLMs and human annotators in detecting elements of relevance within home environment scenarios. We created a set of twelve images depicting various domestic scenarios and enlisted fourteen annotators to identify the key element in each image. We then compared these human responses with outputs from five different LLMs, including GPT-4o and four LLaVA variants. Our findings reveal a varied degree of alignment, with LLaVA 34B showing the highest performance but still scoring low. However, an analysis of the results highlights the models' potential to detect value-laden elements in images, suggesting that with improved training and refined prompts, LLMs could enhance applications in social robotics, assistive technologies, and human-computer interaction by providing deeper insights and more contextually relevant responses.
    Date 2025-01-07
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.03957
    Accessed 3/25/2025, 5:45:31 PM
    Extra arXiv:2501.03957 [cs]
    DOI 10.48550/arXiv.2501.03957
    Repository arXiv
    Archive ID arXiv:2501.03957
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Computer Vision and Pattern Recognition
    • Computer Science - Human-Computer Interaction

    Notes:

    • Comment: 13 pages, 2 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • (Successful) Democracies Breed Their Own Support

    Item Type Journal Article
    Author Daron Acemoglu
    Author Nicolás Ajzenman
    Author Cevat Giray Aksoy
    Author Martin Fiszbein
    Author Carlos Molina
    Abstract Using large-scale survey data covering more than 110 countries and exploiting within-country variation across cohorts and surveys, we show that individuals with longer exposure to democracy display stronger support for democratic institutions, and that this effect is almost entirely driven by exposure to democracies with successful performance in terms of economic growth, control of corruption, peace and political stability, and public goods provision. Across a variety of specifications, estimation methods, and samples, the results are robust, and the timing and nature of the effects are consistent with our interpretation. We also present suggestive evidence that democratic institutions that receive support from their citizens perform better in the face of negative shocks.
    Date 2025-03-06
    Language en
    Library Catalog DOI.org (Crossref)
    URL https://academic.oup.com/restud/article/92/2/621/7675443
    Accessed 4/11/2025, 11:47:00 AM
    Rights https://academic.oup.com/pages/standard-publication-reuse-rights
    Volume 92
    Pages 621-655
    Publication Review of Economic Studies
    DOI 10.1093/restud/rdae051
    Issue 2
    ISSN 0034-6527, 1467-937X
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • PDF
  • Rethinking Reflection in Pre-Training

    Item Type Preprint
    Author Essential AI
    Author Darsh J. Shah
    Author Peter Rushton
    Author Somanshu Singla
    Author Mohit Parmar
    Author Kurt Smith
    Author Yash Vanjani
    Author Ashish Vaswani
    Author Adarsh Chaluvaraju
    Author Andrew Hojel
    Author Andrew Ma
    Author Anil Thomas
    Author Anthony Polloreno
    Author Ashish Tanwer
    Author Burhan Drak Sibai
    Author Divya S. Mansingka
    Author Divya Shivaprasad
    Author Ishaan Shah
    Author Karl Stratos
    Author Khoi Nguyen
    Author Michael Callahan
    Author Michael Pust
    Author Mrinal Iyer
    Author Philip Monk
    Author Platon Mazarakis
    Author Ritvik Kapila
    Author Saurabh Srivastava
    Author Tim Romanski
    Abstract A language model's ability to reflect on its own reasoning provides a key advantage for solving complex problems. While most recent research has focused on how this ability develops during reinforcement learning, we show that it actually begins to emerge much earlier - during the model's pre-training. To study this, we introduce deliberate errors into chains-of-thought and test whether the model can still arrive at the correct answer by recognizing and correcting these mistakes. By tracking performance across different stages of pre-training, we observe that this self-correcting ability appears early and improves steadily over time. For instance, an OLMo2-7B model pre-trained on 4 trillion tokens displays self-correction on our six self-reflection tasks.
    Date 2025-04-05
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.04022
    Accessed 4/9/2025, 11:11:36 PM
    Extra arXiv:2504.04022 [cs]
    DOI 10.48550/arXiv.2504.04022
    Repository arXiv
    Archive ID arXiv:2504.04022
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • LLM Social Simulations Are a Promising Research Method

    Item Type Preprint
    Author Jacy Reese Anthis
    Author Ryan Liu
    Author Sean M. Richardson
    Author Austin C. Kozlowski
    Author Bernard Koch
    Author James Evans
    Author Erik Brynjolfsson
    Author Michael Bernstein
    Abstract Accurate and verifiable large language model (LLM) simulations of human research subjects promise an accessible data source for understanding human behavior and training new AI systems. However, results to date have been limited, and few social scientists have adopted these methods. In this position paper, we argue that the promise of LLM social simulations can be achieved by addressing five tractable challenges. We ground our argument in a literature survey of empirical comparisons between LLMs and human research subjects, commentaries on the topic, and related work. We identify promising directions with prompting, fine-tuning, and complementary methods. We believe that LLM social simulations can already be used for exploratory research, such as pilot experiments for psychology, economics, sociology, and marketing. More widespread use may soon be possible with rapidly advancing LLM capabilities, and researchers should prioritize developing conceptual models and evaluations that can be iteratively deployed and refined at pace with ongoing AI advances.
    Date 2025-04-03
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.02234
    Accessed 4/9/2025, 11:29:43 PM
    Extra arXiv:2504.02234 [cs]
    DOI 10.48550/arXiv.2504.02234
    Repository arXiv
    Archive ID arXiv:2504.02234
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction

    Attachments

    • Preprint PDF
    • Snapshot
  • Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations

    Item Type Web Page
    Author Apollo Research
    Abstract We evaluate whether Claude Sonnet 3.7 and other frontier models know that they are being evaluated.
    Date 4/7/2025
    Language en-GB
    URL https://www.apolloresearch.ai/blog/claude-sonnet-37-often-knows-when-its-in-alignment-evaluations
    Accessed 3/25/2025, 9:30:24 PM
    Website Title Apollo Research
    Date Added 4/2/2025, 2:13:04 PM
    Modified 5/1/2025, 2:05:23 PM

    Attachments

    • Snapshot
  • https://www.goodfire.ai/blog/under-the-hood-of-a-reasoning-model

    Item Type Web Page
    Author AUTHORS Dron Hazra † Max Loeffler † Murat Cubuktepe Levon Avagyan ‡ Liv Gorton Mark Bissell Owen Lewis Thomas McGrath Daniel Balsam * PUBLISHED Apr. 15
    Author 2025
    Language en
    Short Title https
    URL https://www.goodfire.ai/blog/under-the-hood-of-a-reasoning-model
    Accessed 4/17/2025, 11:35:50 AM
    Date Added 4/17/2025, 11:35:50 AM
    Modified 4/17/2025, 11:35:50 AM

    Attachments

    • Snapshot
  • Inference-Time Scaling for Complex Tasks: Where We Stand and What Lies Ahead

    Item Type Preprint
    Author Vidhisha Balachandran
    Author Jingya Chen
    Author Lingjiao Chen
    Author Shivam Garg
    Author Neel Joshi
    Author Yash Lara
    Author John Langford
    Author Besmira Nushi
    Author Vibhav Vineet
    Author Yue Wu
    Author Safoora Yousefi
    Abstract Inference-time scaling can enhance the reasoning capabilities of large language models (LLMs) on complex problems that benefit from step-by-step problem solving. Although lengthening generated scratchpads has proven effective for mathematical tasks, the broader impact of this approach on other tasks remains less clear. In this work, we investigate the benefits and limitations of scaling methods across nine state-of-the-art models and eight challenging tasks, including math and STEM reasoning, calendar planning, NP-hard problems, navigation, and spatial reasoning. We compare conventional models (e.g., GPT-4o) with models fine-tuned for inference-time scaling (e.g., o1) through evaluation protocols that involve repeated model calls, either independently or sequentially with feedback. These evaluations approximate lower and upper performance bounds and potential for future performance improvements for each model, whether through enhanced training or multi-model inference systems. Our extensive empirical analysis reveals that the advantages of inference-time scaling vary across tasks and diminish as problem complexity increases. In addition, simply using more tokens does not necessarily translate to higher accuracy in these challenging regimes. Results from multiple independent runs with conventional models using perfect verifiers show that, for some tasks, these models can achieve performance close to the average performance of today's most advanced reasoning models. However, for other tasks, a significant performance gap remains, even in very high scaling regimes. Encouragingly, all models demonstrate significant gains when inference is further scaled with perfect verifiers or strong feedback, suggesting ample potential for future improvements.
    Date 2025-03-31
    Short Title Inference-Time Scaling for Complex Tasks
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.00294
    Accessed 4/11/2025, 12:33:43 PM
    Extra arXiv:2504.00294 [cs]
    DOI 10.48550/arXiv.2504.00294
    Repository arXiv
    Archive ID arXiv:2504.00294
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

    Item Type Preprint
    Author Krisztian Balog
    Author Donald Metzler
    Author Zhen Qin
    Abstract Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.
    Date 2025-03-24
    Short Title Rankers, Judges, and Assistants
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.19092
    Accessed 4/14/2025, 9:56:20 PM
    Extra arXiv:2503.19092 [cs]
    DOI 10.48550/arXiv.2503.19092
    Repository arXiv
    Archive ID arXiv:2503.19092
    Date Added 4/14/2025, 9:56:33 PM
    Modified 4/14/2025, 9:56:35 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Information Retrieval

    Attachments

    • Preprint PDF
  • Rankers, Judges, and Assistants: Towards Understanding the Interplay of LLMs in Information Retrieval Evaluation

    Item Type Preprint
    Author Krisztian Balog
    Author Donald Metzler
    Author Zhen Qin
    Abstract Large language models (LLMs) are increasingly integral to information retrieval (IR), powering ranking, evaluation, and AI-assisted content creation. This widespread adoption necessitates a critical examination of potential biases arising from the interplay between these LLM-based components. This paper synthesizes existing research and presents novel experiment designs that explore how LLM-based rankers and assistants influence LLM-based judges. We provide the first empirical evidence of LLM judges exhibiting significant bias towards LLM-based rankers. Furthermore, we observe limitations in LLM judges' ability to discern subtle system performance differences. Contrary to some previous findings, our preliminary study does not find evidence of bias against AI-generated content. These results highlight the need for a more holistic view of the LLM-driven information ecosystem. To this end, we offer initial guidelines and a research agenda to ensure the reliable use of LLMs in IR evaluation.
    Date 2025-03-24
    Short Title Rankers, Judges, and Assistants
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.19092
    Accessed 4/14/2025, 9:56:20 PM
    Extra arXiv:2503.19092 [cs]
    DOI 10.48550/arXiv.2503.19092
    Repository arXiv
    Archive ID arXiv:2503.19092
    Date Added 4/14/2025, 9:56:33 PM
    Modified 4/14/2025, 9:56:35 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Information Retrieval
  • Creating a Resilient AI Ecosystem with NDIF and NNsight

    Item Type Journal Article
    Author David Bau
    Language en
    Library Catalog Zotero
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • PDF
  • AI Dominance Requires Interpretability and Standards for Transparency and Security

    Item Type Journal Article
    Author David Bau
    Author Tom McGrath
    Author Sarah Schwettmann
    Author Dylan Hadfield-Menell
    Language en
    Library Catalog Zotero
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • PDF
  • Ctrl-Z: Controlling AI Agents via Resampling

    Item Type Preprint
    Author Aryan Bhatt
    Author Cody Rushing
    Author Adam Kaufman
    Author Tyler Tracy
    Author Vasil Georgiev
    Author David Matolcsi
    Author Akbir Khan
    Author Buck Shlegeris
    Abstract Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. Our work presents the first control evaluation performed in an agent environment. We construct BashBench, a dataset of 257 challenging multi-step system administration tasks, and evaluate whether various safety measures can prevent an adversarially constructed AI agent from covertly downloading and executing malicious code in this environment. This multi-step setting introduces new attack and defense dynamics, which we investigate in order to design novel control protocols that prevent safety failures without hindering the ability of non-malicious agents to perform useful work. We introduce a class of control protocols called resample protocols that dynamically take additional samples of certain actions. We find these protocols significantly improve on existing techniques by selectively blocking the AI agent from executing suspicious code and incriminating the agent by generating additional examples of dangerous behavior. We measure the tradeoff between attack prevention and usefulness; our best protocol combines resampling with analysis of previous steps, reducing the success rate of attacks from 58% to 7% at a 5% cost to the performance of a non-malicious agent.
    Date 2025-04-14
    Short Title Ctrl-Z
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.10374
    Accessed 4/17/2025, 11:40:05 AM
    Extra arXiv:2504.10374 [cs]
    DOI 10.48550/arXiv.2504.10374
    Repository arXiv
    Archive ID arXiv:2504.10374
    Date Added 4/17/2025, 11:40:05 AM
    Modified 4/17/2025, 11:40:05 AM

    Tags:

    • Computer Science - Machine Learning

    Notes:

    • Comment: bashcontrol.com

    Attachments

    • Preprint PDF
    • Snapshot
  • Reasoning Models Don’t Always Say What They Think

    Item Type Journal Article
    Author Yanda Chen
    Author Joe Benton
    Author Ansh Radhakrishnan
    Author Jonathan Uesato
    Author Carson Denison
    Author John Schulman
    Author Arushi Somani
    Author Peter Hase
    Author Misha Wagner
    Author Fabien Roger
    Author Vlad Mikulik
    Author Sam Bowman
    Author Jan Leike
    Author Jared Kaplan
    Author Ethan Perez
    Abstract Chain-of-thought (CoT) offers a potential boon for AI safety as it allows monitoring a model’s CoT to try to understand its intentions and reasoning processes. However, the effectiveness of such monitoring hinges on CoTs faithfully representing models’ actual reasoning processes. We evaluate CoT faithfulness of state-of-the-art reasoning models across 6 reasoning hints presented in the prompts and find: (1) for most settings and models tested, CoTs reveal their usage of hints in at least 1% of examples where they use the hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement learning initially improves faithfulness but plateaus without saturating, and (3) when reinforcement learning increases how frequently hints are used (reward hacking), the propensity to verbalize them does not increase, even without training against a CoT monitor. These results suggest that CoT monitoring is a promising way of noticing undesired behaviors during training and evaluations, but that it is not sufficient to rule them out. They also suggest that in settings like ours where CoT reasoning is not necessary, test-time monitoring of CoTs is unlikely to reliably catch rare and catastrophic unexpected behaviors.
    Language en
    Library Catalog Zotero
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • PDF
  • JudgeLRM: Large Reasoning Models as a Judge

    Item Type Preprint
    Author Nuo Chen
    Author Zhiyuan Hu
    Author Qingyun Zou
    Author Jiaying Wu
    Author Qian Wang
    Author Bryan Hooi
    Author Bingsheng He
    Abstract The rise of Large Language Models (LLMs) as evaluators offers a scalable alternative to human annotation, yet existing Supervised Fine-Tuning (SFT) for judges approaches often fall short in domains requiring complex reasoning. In this work, we investigate whether LLM judges truly benefit from enhanced reasoning capabilities. Through a detailed analysis of reasoning requirements across evaluation tasks, we reveal a negative correlation between SFT performance gains and the proportion of reasoning-demanding samples - highlighting the limitations of SFT in such scenarios. To address this, we introduce JudgeLRM, a family of judgment-oriented LLMs trained using reinforcement learning (RL) with judge-wise, outcome-driven rewards. JudgeLRM models consistently outperform both SFT-tuned and state-of-the-art reasoning models. Notably, JudgeLRM-3B surpasses GPT-4, and JudgeLRM-7B outperforms DeepSeek-R1 by 2.79% in F1 score, particularly excelling in judge tasks requiring deep reasoning.
    Date 2025-03-31
    Short Title JudgeLRM
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.00050
    Accessed 4/14/2025, 9:59:44 PM
    Extra arXiv:2504.00050 [cs]
    DOI 10.48550/arXiv.2504.00050
    Repository arXiv
    Archive ID arXiv:2504.00050
    Date Added 4/14/2025, 9:59:44 PM
    Modified 4/14/2025, 9:59:44 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: preprint

    Attachments

    • Preprint PDF
    • Snapshot
  • Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't

    Item Type Preprint
    Author Quy-Anh Dang
    Author Chris Ngo
    Abstract Enhancing the reasoning capabilities of large language models (LLMs) typically relies on massive computational resources and extensive datasets, limiting accessibility for resource-constrained settings. Our study investigates the potential of reinforcement learning (RL) to improve reasoning in small LLMs, focusing on a 1.5-billion-parameter model, DeepSeek-R1-Distill-Qwen-1.5B, under strict constraints: training on 4 NVIDIA A40 GPUs (48 GB VRAM each) within 24 hours. Adapting the Group Relative Policy Optimization (GRPO) algorithm and curating a compact, high-quality mathematical reasoning dataset, we conducted three experiments to explore model behavior and performance. Our results demonstrate rapid reasoning gains - e.g., AMC23 accuracy rising from 63% to 80% and AIME24 reaching 46.7%, surpassing o1-preview - using only 7,000 samples and a $42 training cost, compared to thousands of dollars for baseline models. However, challenges such as optimization instability and length constraints emerged with prolonged training. These findings highlight the efficacy of RL-based fine-tuning for small LLMs, offering a cost-effective alternative to large-scale approaches. We release our code and datasets as open-source resources, providing insights into trade-offs and laying a foundation for scalable, reasoning-capable LLMs in resource-limited environments. All are available at https://github.com/knoveleng/open-rs.
    Date 2025-03-20
    Short Title Reinforcement Learning for Reasoning in Small LLMs
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.16219
    Accessed 4/11/2025, 1:09:29 PM
    Extra arXiv:2503.16219 [cs]
    DOI 10.48550/arXiv.2503.16219
    Repository arXiv
    Archive ID arXiv:2503.16219
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • AI-Enabled Coups: How a Small Group Could Use AI to Seize Power

    Item Type Journal Article
    Author Tom Davidson
    Author Lukas Finnveden
    Author Rose Hadshar
    Abstract This report assesses the risk that a small group—or even just one person—could use advanced AI to stage a coup. An AI-enabled coup is most likely to be staged by leaders of frontier AI projects, heads of state, and military officials; and could occur even in established democracies.
    Language en
    Library Catalog Zotero
    Date Added 4/17/2025, 11:42:37 AM
    Modified 4/17/2025, 11:42:37 AM

    Attachments

    • PDF
  • HOW AI AND HUMAN BEHAVIORS SHAPE PSYCHOSOCIAL EFFECTS OF CHATBOT USE: A LONGITUDINAL RANDOMIZED CONTROLLED STUDY

    Item Type Journal Article
    Author Cathy Mengying Fang
    Author Auren R Liu
    Author Valdemar Danry
    Author Eunhae Lee
    Author Samantha W T Chan
    Author Pat Pataranutaporn
    Author Pattie Maes
    Author Jason Phang
    Author Michael Lampe
    Author Lama Ahmad
    Author Sandhini Agarwal
    Abstract AI chatbots, especially those with voice capabilities, have become increasingly human-like, with more users seeking emotional support and companionship from them. Concerns are rising about how such interactions might impact users’ loneliness and socialization with real people. We conducted a four-week randomized, controlled, IRB-approved experiment (n=981, >300K messages) to investigate how AI chatbot interaction modes (text, neutral voice, and engaging voice) and conversation types (open-ended, non-personal, and personal) influence psychosocial outcomes such as loneliness, social interaction with real people, emotional dependence on AI and problematic AI usage. Results showed that while voice-based chatbots initially appeared beneficial in mitigating loneliness and dependence compared with text-based chatbots, these advantages diminished at high usage levels, especially with a neutral-voice chatbot. Conversation type also shaped outcomes: personal topics slightly increased loneliness but tended to lower emotional dependence compared with open-ended conversations, whereas non-personal topics were associated with greater dependence among heavy users. Overall, higher daily usage—across all modalities and conversation types—correlated with higher loneliness, dependence, and problematic use, and lower socialization. Exploratory analyses revealed that those with stronger emotional attachment tendencies and higher trust in the AI chatbot tended to experience greater loneliness and emotional dependence, respectively. These findings underscore the complex interplay between chatbot design choices (e.g., voice expressiveness) and user behaviors (e.g., conversation content, usage frequency). We highlight the need for further research on whether chatbots’ ability to manage emotional content without fostering dependence or replacing human relationships benefits overall well-being.
    Date 2025
    Language en
    Library Catalog Zotero
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Attachments

    • PDF
  • Concise Reasoning via Reinforcement Learning

    Item Type Preprint
    Author Mehdi Fatemi
    Author Banafsheh Rafiee
    Author Mingjie Tang
    Author Kartik Talamadupula
    Abstract Despite significant advancements in large language models (LLMs), a major drawback of reasoning models is their enormous token usage, which increases computational cost, resource requirements, and response time. In this work, we revisit the core principles of reinforcement learning (RL) and, through mathematical analysis, demonstrate that the tendency to generate lengthy responses arises inherently from RL-based optimization during training. This finding questions the prevailing assumption that longer responses inherently improve reasoning accuracy. Instead, we uncover a natural correlation between conciseness and accuracy that has been largely overlooked. Moreover, we show that introducing a secondary phase of RL post-training, using a small set of problems and limited resources, can significantly reduce a model's chain of thought while maintaining or even enhancing accuracy. Finally, we validate our conclusions through extensive experimental results.
    Date 2025-04-07
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.05185
    Accessed 4/14/2025, 10:01:11 PM
    Extra arXiv:2504.05185 [cs]
    DOI 10.48550/arXiv.2504.05185
    Repository arXiv
    Archive ID arXiv:2504.05185
    Date Added 4/14/2025, 10:01:11 PM
    Modified 4/14/2025, 10:01:11 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • Political Neutrality in AI is Impossible- But Here is How to Approximate it

    Item Type Preprint
    Author Jillian Fisher
    Author Ruth E. Appel
    Author Chan Young Park
    Author Yujin Potter
    Author Liwei Jiang
    Author Taylor Sorensen
    Author Shangbin Feng
    Author Yulia Tsvetkov
    Author Margaret E. Roberts
    Author Jennifer Pan
    Author Dawn Song
    Author Yejin Choi
    Abstract AI systems often exhibit political bias, influencing users' opinions and decision-making. While political neutrality-defined as the absence of bias-is often seen as an ideal solution for fairness and safety, this position paper argues that true political neutrality is neither feasible nor universally desirable due to its subjective nature and the biases inherent in AI training data, algorithms, and user interactions. However, inspired by Joseph Raz's philosophical insight that "neutrality [...] can be a matter of degree" (Raz, 1986), we argue that striving for some neutrality remains essential for promoting balanced AI interactions and mitigating user manipulation. Therefore, we use the term "approximation" of political neutrality to shift the focus from unattainable absolutes to achievable, practical proxies. We propose eight techniques for approximating neutrality across three levels of conceptualizing AI, examining their trade-offs and implementation strategies. In addition, we explore two concrete applications of these approximations to illustrate their practicality. Finally, we assess our framework on current large language models (LLMs) at the output level, providing a demonstration of how it can be evaluated. This work seeks to advance nuanced discussions of political neutrality in AI and promote the development of responsible, aligned language models.
    Date 2025-02-18
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.05728
    Accessed 4/11/2025, 10:31:20 AM
    Extra arXiv:2503.05728 [cs]
    DOI 10.48550/arXiv.2503.05728
    Repository arXiv
    Archive ID arXiv:2503.05728
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society

    Notes:

    • Comment: Code: https://github.com/jfisher52/Approximation_Political_Neutrality

    Attachments

    • Preprint PDF
    • Snapshot
  • Is a Good Foundation Necessary for Efficient Reinforcement Learning? The Computational Role of the Base Model in Exploration

    Item Type Preprint
    Author Dylan J. Foster
    Author Zakaria Mhammedi
    Author Dhruv Rohatgi
    Abstract Language model alignment (or, reinforcement learning) techniques that leverage active exploration -- deliberately encouraging the model to produce diverse, informative responses -- offer the promise of super-human capabilities. However, current understanding of algorithm design primitives for computationally efficient exploration with language models is limited. To better understand how to leverage access to powerful pre-trained generative models to improve the efficiency of exploration, we introduce a new computational framework for RL with language models, in which the learner interacts with the model through a sampling oracle. Focusing on the linear softmax model parameterization, we provide new results that reveal the computational-statistical tradeoffs of efficient exploration: 1. Necessity of coverage: Coverage refers to the extent to which the pre-trained model covers near-optimal responses -- a form of hidden knowledge. We show that coverage, while not necessary for data efficiency, lower bounds the runtime of any algorithm in our framework. 2. Inference-time exploration: We introduce a new algorithm, SpannerSampling, which obtains optimal data efficiency and is computationally efficient whenever the pre-trained model enjoys sufficient coverage, matching our lower bound. SpannerSampling leverages inference-time computation with the pre-trained model to reduce the effective search space for exploration. 3. Insufficiency of training-time interventions: We contrast the result above by showing that training-time interventions that produce proper policies cannot achieve similar guarantees in polynomial time. 4. Computational benefits of multi-turn exploration: Finally, we show that under additional representational assumptions, one can achieve improved runtime (replacing sequence-level coverage with token-level coverage) through multi-turn exploration.
    Date 2025-03-13
    Short Title Is a Good Foundation Necessary for Efficient Reinforcement Learning?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.07453
    Accessed 4/11/2025, 3:16:04 PM
    Extra arXiv:2503.07453 [cs]
    DOI 10.48550/arXiv.2503.07453
    Repository arXiv
    Archive ID arXiv:2503.07453
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Mathematics - Statistics Theory
    • Statistics - Statistics Theory

    Notes:

    • Comment: V2: Improved number of prompts used by Algorithm 1

    Attachments

    • Preprint PDF
    • Snapshot
  • Neural Interactive Proofs

    Item Type Preprint
    Author Lewis Hammond
    Author Sam Adam-Day
    Abstract We consider the problem of how a trusted, but computationally bounded agent (a 'verifier') can learn to interact with one or more powerful but untrusted agents ('provers') in order to solve a given task. More specifically, we study the case in which agents are represented using neural networks and refer to solutions of this problem as neural interactive proofs. First we introduce a unifying framework based on prover-verifier games, which generalises previously proposed interaction protocols. We then describe several new protocols for generating neural interactive proofs, and provide a theoretical comparison of both new and existing approaches. Finally, we support this theory with experiments in two domains: a toy graph isomorphism problem that illustrates the key ideas, and a code validation task using large language models. In so doing, we aim to create a foundation for future work on neural interactive proofs and their application in building safer AI systems.
    Date 2025-03-17
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.08897
    Accessed 3/25/2025, 9:19:45 PM
    Extra arXiv:2412.08897 [cs]
    DOI 10.48550/arXiv.2412.08897
    Repository arXiv
    Archive ID arXiv:2412.08897
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: ICLR'25 camera-ready version; 51 pages, 17 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Safety Cases: A Scalable Approach to Frontier AI Safety

    Item Type Preprint
    Author Benjamin Hilton
    Author Marie Davidsen Buhl
    Author Tomek Korbak
    Author Geoffrey Irving
    Abstract Safety cases - clear, assessable arguments for the safety of a system in a given context - are a widely-used technique across various industries for showing a decision-maker (e.g. boards, customers, third parties) that a system is safe. In this paper, we cover how and why frontier AI developers might also want to use safety cases. We then argue that writing and reviewing safety cases would substantially assist in the fulfilment of many of the Frontier AI Safety Commitments. Finally, we outline open research questions on the methodology, implementation, and technical details of safety cases.
    Date 2025-02-05
    Short Title Safety Cases
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.04744
    Accessed 3/25/2025, 9:34:29 PM
    Extra arXiv:2503.04744 [cs]
    DOI 10.48550/arXiv.2503.04744
    Repository arXiv
    Archive ID arXiv:2503.04744
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society

    Notes:

    • Comment: 18 pages, 2 figures, 3 tables

    Attachments

    • Preprint PDF
    • Snapshot
  • A Sober Look at Progress in Language Model Reasoning: Pitfalls and Paths to Reproducibility

    Item Type Preprint
    Author Andreas Hochlehnert
    Author Hardik Bhatnagar
    Author Vishaal Udandarao
    Author Samuel Albanie
    Author Ameya Prabhu
    Author Matthias Bethge
    Abstract Reasoning has emerged as the next major frontier for language models (LMs), with rapid advances from both academic and industrial labs. However, this progress often outpaces methodological rigor, with many evaluations relying on benchmarking practices that lack transparency, robustness, or statistical grounding. In this work, we conduct a comprehensive empirical study and find that current mathematical reasoning benchmarks are highly sensitive to subtle implementation choices - including decoding parameters, random seeds, prompt formatting, and even hardware and software-framework configurations. Performance gains reported in recent studies frequently hinge on unclear comparisons or unreported sources of variance. To address these issues, we propose a standardized evaluation framework with clearly defined best practices and reporting standards. Using this framework, we reassess recent methods and find that reinforcement learning (RL) approaches yield only modest improvements - far below prior claims - and are prone to overfitting, especially on small-scale benchmarks like AIME24. In contrast, supervised finetuning (SFT) methods show consistently stronger generalization. To foster reproducibility, we release all code, prompts, and model outputs, for reasoning benchmarks, establishing more rigorous foundations for future work.
    Date 2025-04-09
    Short Title A Sober Look at Progress in Language Model Reasoning
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.07086
    Accessed 4/13/2025, 6:01:29 PM
    Extra arXiv:2504.07086 [cs]
    DOI 10.48550/arXiv.2504.07086
    Repository arXiv
    Archive ID arXiv:2504.07086
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Machine Learning

    Notes:

    • Comment: Technical Report

    Attachments

    • Preprint PDF
    • Snapshot
  • Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? A Case Study on Vaccine Hesitancy

    Item Type Preprint
    Author Abe Bohan Hou
    Author Hongru Du
    Author Yichen Wang
    Author Jingyu Zhang
    Author Zixiao Wang
    Author Paul Pu Liang
    Author Daniel Khashabi
    Author Lauren Gardner
    Author Tianxing He
    Abstract Can we simulate a sandbox society with generative agents to model human behavior, thereby reducing the over-reliance on real human trials for assessing public policies? In this work, we investigate the feasibility of simulating health-related decision-making, using vaccine hesitancy, defined as the delay in acceptance or refusal of vaccines despite the availability of vaccination services (MacDonald, 2015), as a case study. To this end, we introduce the VacSim framework with 100 generative agents powered by Large Language Models (LLMs). VacSim simulates vaccine policy outcomes with the following steps: 1) instantiate a population of agents with demographics based on census data; 2) connect the agents via a social network and model vaccine attitudes as a function of social dynamics and disease-related information; 3) design and evaluate various public health interventions aimed at mitigating vaccine hesitancy. To align with real-world results, we also introduce simulation warmup and attitude modulation to adjust agents' attitudes. We propose a series of evaluations to assess the reliability of various LLM simulations. Experiments indicate that models like Llama and Qwen can simulate aspects of human behavior but also highlight real-world alignment challenges, such as inconsistent responses with demographic profiles. This early exploration of LLM-driven simulations is not meant to serve as definitive policy guidance; instead, it serves as a call for action to examine social simulation for policy development.
    Date 2025-03-16
    Short Title Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.09639
    Accessed 3/25/2025, 9:35:41 PM
    Extra arXiv:2503.09639 [cs]
    DOI 10.48550/arXiv.2503.09639
    Repository arXiv
    Archive ID arXiv:2503.09639
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction
    • Computer Science - Multiagent Systems

    Attachments

    • Preprint PDF
    • Snapshot
  • Large Language Models Are Better Logical Fallacy Reasoners with Counterargument, Explanation, and Goal-Aware Prompt Formulation

    Item Type Preprint
    Author Jiwon Jeong
    Author Hyeju Jang
    Author Hogun Park
    Abstract The advancement of Large Language Models (LLMs) has greatly improved our ability to process complex language. However, accurately detecting logical fallacies remains a significant challenge. This study presents a novel and effective prompt formulation approach for logical fallacy detection, applicable in both supervised (fine-tuned) and unsupervised (zero-shot) settings. Our method enriches input text incorporating implicit contextual information -- counterarguments, explanations, and goals -- which we query for validity within the context of the argument. We then rank these queries based on confidence scores to inform classification. We evaluate our approach across multiple datasets from 5 domains, covering 29 distinct fallacy types, using models from the GPT and LLaMA series. The results show substantial improvements over state-of-the-art models, with F1 score increases of up to 0.60 in zero-shot settings and up to 0.45 in fine-tuned models. Extensive analyses further illustrate why and how our method excels.
    Date 2025-03-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.23363
    Accessed 4/14/2025, 10:00:28 PM
    Extra arXiv:2503.23363 [cs]
    DOI 10.48550/arXiv.2503.23363
    Repository arXiv
    Archive ID arXiv:2503.23363
    Date Added 4/14/2025, 10:00:28 PM
    Modified 4/14/2025, 10:00:28 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: Accepted to NAACL 2025 Findings

    Attachments

    • Preprint PDF
    • Snapshot
  • Language Model Alignment in Multilingual Trolley Problems

    Item Type Preprint
    Author Zhijing Jin
    Author Max Kleiman-Weiner
    Author Giorgio Piatti
    Author Sydney Levine
    Author Jiarui Liu
    Author Fernando Gonzalez
    Author Francesco Ortu
    Author András Strausz
    Author Mrinmaya Sachan
    Author Rada Mihalcea
    Author Yejin Choi
    Author Bernhard Schölkopf
    Abstract We evaluate the moral alignment of large language models (LLMs) with human preferences in multilingual trolley problems. Building on the Moral Machine experiment, which captures over 40 million human judgments across 200+ countries, we develop a cross-lingual corpus of moral dilemma vignettes in over 100 languages called MultiTP. This dataset enables the assessment of LLMs' decision-making processes in diverse linguistic contexts. Our analysis explores the alignment of 19 different LLMs with human judgments, capturing preferences across six moral dimensions: species, gender, fitness, status, age, and the number of lives involved. By correlating these preferences with the demographic distribution of language speakers and examining the consistency of LLM responses to various prompt paraphrasings, our findings provide insights into cross-lingual and ethical biases of LLMs and their intersection. We discover significant variance in alignment across languages, challenging the assumption of uniform moral reasoning in AI systems and highlighting the importance of incorporating diverse perspectives in AI ethics. The results underscore the need for further research on the integration of multilingual dimensions in responsible AI research to ensure fair and equitable AI interactions worldwide. Our code and data are at https://github.com/causalNLP/moralmachine
    Date 2024-12-14
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2407.02273
    Accessed 4/14/2025, 10:06:43 PM
    Extra arXiv:2407.02273 [cs]
    DOI 10.48550/arXiv.2407.02273
    Repository arXiv
    Archive ID arXiv:2407.02273
    Date Added 4/14/2025, 10:06:43 PM
    Modified 4/14/2025, 10:06:43 PM

    Tags:

    • Computer Science - Computation and Language

    Notes:

    • Comment: Best Paper @ NeurIPS 2024 Workshop on Pluralistic Alignment

    Attachments

    • Preprint PDF
    • Snapshot
  • Large Language Models Pass the Turing Test

    Item Type Preprint
    Author Cameron R. Jones
    Author Benjamin K. Bergen
    Abstract We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
    Date 2025-03-31
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.23674
    Accessed 4/11/2025, 1:10:04 PM
    Extra arXiv:2503.23674 [cs]
    DOI 10.48550/arXiv.2503.23674
    Repository arXiv
    Archive ID arXiv:2503.23674
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Human-Computer Interaction

    Attachments

    • Preprint PDF
    • Snapshot
  • Collaborating with AI Agents: Field Experiments on Teamwork, Productivity, and Performance

    Item Type Preprint
    Author Harang Ju
    Author Sinan Aral
    Abstract To uncover how AI agents change productivity, performance, and work processes, we introduce MindMeld: an experimentation platform enabling humans and AI agents to collaborate in integrative workspaces. In a large-scale marketing experiment on the platform, 2310 participants were randomly assigned to human-human and human-AI teams, with randomized AI personality traits. The teams exchanged 183,691 messages, and created 63,656 image edits, 1,960,095 ad copy edits, and 10,375 AI-generated images while producing 11,138 ads for a large think tank. Analysis of fine-grained communication, collaboration, and workflow logs revealed that collaborating with AI agents increased communication by 137% and allowed humans to focus 23% more on text and image content generation messaging and 20% less on direct text editing. Humans on Human-AI teams sent 23% fewer social messages, creating 60% greater productivity per worker and higher-quality ad copy. In contrast, human-human teams produced higher-quality images, suggesting that AI agents require fine-tuning for multimodal workflows. AI personality prompt randomization revealed that AI traits can complement human personalities to enhance collaboration. For example, conscientious humans paired with open AI agents improved image quality, while extroverted humans paired with conscientious AI agents reduced the quality of text, images, and clicks. In field tests of ad campaigns with ~5M impressions, ads with higher image quality produced by human collaborations and higher text quality produced by AI collaborations performed significantly better on click-through rate and cost per click metrics. Overall, ads created by human-AI teams performed similarly to those created by human-human teams. Together, these results suggest AI agents can improve teamwork and productivity, especially when tuned to complement human traits.
    Date 2025-03-23
    Short Title Collaborating with AI Agents
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.18238
    Accessed 3/26/2025, 2:58:26 PM
    Extra arXiv:2503.18238 [cs]
    DOI 10.48550/arXiv.2503.18238
    Repository arXiv
    Archive ID arXiv:2503.18238
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society

    Notes:

    • Comment: 56 pages, 8 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

    Item Type Preprint
    Author Ariba Khan
    Author Stephen Casper
    Author Dylan Hadfield-Menell
    Abstract Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment through survey-based assessments that borrow from social science methodologies often overlook systematic robustness checks. Here, we identify and test three assumptions behind current survey-based evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current survey-based approaches to evaluating the cultural alignment of LLMs and highlight a need for systematic robustness checks and red-teaming for evaluation results. Data and code are available at https://huggingface.co/datasets/akhan02/cultural-dimension-cover-letters and https://github.com/ariba-k/llm-cultural-alignment-evaluation, respectively.
    Date 2025-04-08
    Short Title Randomness, Not Representation
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.08688
    Accessed 4/14/2025, 10:06:57 PM
    Extra arXiv:2503.08688 [cs]
    DOI 10.48550/arXiv.2503.08688
    Repository arXiv
    Archive ID arXiv:2503.08688
    Date Added 4/14/2025, 10:06:57 PM
    Modified 4/14/2025, 10:06:57 PM

    Tags:

    • Computer Science - Computers and Society

    Attachments

    • Preprint PDF
    • Snapshot
  • How to evaluate control measures for LLM agents? A trajectory from today to superintelligence

    Item Type Preprint
    Author Tomek Korbak
    Author Mikita Balesni
    Author Buck Shlegeris
    Author Geoffrey Irving
    Abstract As LLM agents grow more capable of causing harm autonomously, AI developers will rely on increasingly sophisticated control measures to prevent possibly misaligned agents from causing harm. AI developers could demonstrate that their control measures are sufficient by running control evaluations: testing exercises in which a red team produces agents that try to subvert control measures. To ensure control evaluations accurately capture misalignment risks, the affordances granted to this red team should be adapted to the capability profiles of the agents to be deployed under control measures. In this paper we propose a systematic framework for adapting affordances of red teams to advancing AI capabilities. Rather than assuming that agents will always execute the best attack strategies known to humans, we demonstrate how knowledge of an agents's actual capability profile can inform proportional control evaluations, resulting in more practical and cost-effective control measures. We illustrate our framework by considering a sequence of five fictional models (M1-M5) with progressively advanced capabilities, defining five distinct AI control levels (ACLs). For each ACL, we provide example rules for control evaluation, control measures, and safety cases that could be appropriate. Finally, we show why constructing a compelling AI control safety case for superintelligent LLM agents will require research breakthroughs, highlighting that we might eventually need alternative approaches to mitigating misalignment risk.
    Date 2025-04-07
    Short Title How to evaluate control measures for LLM agents?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.05259
    Accessed 4/14/2025, 10:04:50 PM
    Extra arXiv:2504.05259 [cs]
    DOI 10.48550/arXiv.2504.05259
    Repository arXiv
    Archive ID arXiv:2504.05259
    Date Added 4/14/2025, 10:04:50 PM
    Modified 4/14/2025, 10:04:50 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security

    Attachments

    • Preprint PDF
    • Snapshot
  • Measuring AI Ability to Complete Long Tasks

    Item Type Preprint
    Author Thomas Kwa
    Author Ben West
    Author Joel Becker
    Author Amy Deng
    Author Katharyn Garcia
    Author Max Hasin
    Author Sami Jawhar
    Author Megan Kinniment
    Author Nate Rush
    Author Sydney Von Arx
    Author Ryan Bloom
    Author Thomas Broadley
    Author Haoxing Du
    Author Brian Goodrich
    Author Nikola Jurkovic
    Author Luke Harold Miles
    Author Seraphina Nix
    Author Tao Lin
    Author Neev Parikh
    Author David Rein
    Author Lucas Jun Koba Sato
    Author Hjalmar Wijk
    Author Daniel M. Ziegler
    Author Elizabeth Barnes
    Author Lawrence Chan
    Abstract Despite rapid progress on AI benchmarks, the real-world meaning of benchmark performance remains unclear. To quantify the capabilities of AI systems in terms of human capabilities, we propose a new metric: 50%-task-completion time horizon. This is the time humans typically take to complete tasks that AI models can complete with 50% success rate. We first timed humans with relevant domain expertise on a combination of RE-Bench, HCAST, and 66 novel shorter tasks. On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes. Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024. The increase in AI models' time horizons seems to be primarily driven by greater reliability and ability to adapt to mistakes, combined with better logical reasoning and tool use capabilities. We discuss the limitations of our results -- including their degree of external validity -- and the implications of increased autonomy for dangerous capabilities. If these results generalize to real-world software tasks, extrapolation of this trend predicts that within 5 years, AI systems will be capable of automating many software tasks that currently take humans a month.
    Date 2025-03-18
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.14499
    Accessed 3/25/2025, 6:00:08 PM
    Extra arXiv:2503.14499 [cs]
    DOI 10.48550/arXiv.2503.14499
    Repository arXiv
    Archive ID arXiv:2503.14499
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models

    Item Type Preprint
    Author Yuxuan Li
    Author Hirokazu Shirado
    Author Sauvik Das
    Abstract While advances in fairness and alignment have helped mitigate overt biases exhibited by large language models (LLMs) when explicitly prompted, we hypothesize that these models may still exhibit implicit biases when simulating human behavior. To test this hypothesis, we propose a technique to systematically uncover such biases across a broad range of sociodemographic categories by assessing decision-making disparities among agents with LLM-generated, sociodemographically-informed personas. Using our technique, we tested six LLMs across three sociodemographic groups and four decision-making scenarios. Our results show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations, with more advanced models exhibiting greater implicit biases despite reducing explicit biases. Furthermore, when comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified. This directional alignment highlights the utility of our technique in uncovering systematic biases in LLMs rather than random variations; moreover, the presence and amplification of implicit biases emphasizes the need for novel strategies to address these biases.
    Date 2025-01-29
    Short Title Actions Speak Louder than Words
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.17420
    Accessed 4/14/2025, 10:00:42 PM
    Extra arXiv:2501.17420 [cs]
    DOI 10.48550/arXiv.2501.17420
    Repository arXiv
    Archive ID arXiv:2501.17420
    Date Added 4/14/2025, 10:00:42 PM
    Modified 4/14/2025, 10:00:42 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Human-Computer Interaction

    Attachments

    • Preprint PDF
    • Snapshot
  • Inference-Time Scaling for Generalist Reward Modeling

    Item Type Preprint
    Author Zijun Liu
    Author Peiyi Wang
    Author Runxin Xu
    Author Shirong Ma
    Author Chong Ruan
    Author Peng Li
    Author Yang Liu
    Author Yu Wu
    Abstract Reinforcement learning (RL) has been widely adopted in post-training for large language models (LLMs) at scale. Recently, the incentivization of reasoning capabilities in LLMs from RL indicates that $\textit{proper learning methods could enable effective inference-time scalability}$. A key challenge of RL is to obtain accurate reward signals for LLMs in various domains beyond verifiable questions or artificial rules. In this work, we investigate how to improve reward modeling (RM) with more inference compute for general queries, i.e. the $\textbf{inference-time scalability of generalist RM}$, and further, how to improve the effectiveness of performance-compute scaling with proper learning methods. For the RM approach, we adopt pointwise generative reward modeling (GRM) to enable flexibility for different input types and potential for inference-time scaling. For the learning method, we propose Self-Principled Critique Tuning (SPCT) to foster scalable reward generation behaviors in GRMs through online RL, to generate principles adaptively and critiques accurately, resulting in $\textbf{DeepSeek-GRM}$ models. Furthermore, for effective inference-time scaling, we use parallel sampling to expand compute usage, and introduce a meta RM to guide voting process for better scaling performance. Empirically, we show that SPCT significantly improves the quality and scalability of GRMs, outperforming existing methods and models in various RM benchmarks without severe biases, and could achieve better performance compared to training-time scaling. DeepSeek-GRM still meets challenges in some tasks, which we believe can be addressed by future efforts in generalist reward systems. The models will be released and open-sourced.
    Date 2025-04-05
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.02495
    Accessed 4/10/2025, 11:19:19 AM
    Extra arXiv:2504.02495 [cs]
    DOI 10.48550/arXiv.2504.02495
    Repository arXiv
    Archive ID arXiv:2504.02495
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Notes:

    • Comment: Preprint, under review. 42 pages

    Attachments

    • Preprint PDF
    • Snapshot
  • OLMoTrace: Tracing Language Model Outputs Back to Trillions of Training Tokens

    Item Type Preprint
    Author Jiacheng Liu
    Author Taylor Blanton
    Author Yanai Elazar
    Author Sewon Min
    Author YenSung Chen
    Author Arnavi Chheda-Kothary
    Author Huy Tran
    Author Byron Bischoff
    Author Eric Marsh
    Author Michael Schmitz
    Author Cassidy Trier
    Author Aaron Sarnat
    Author Jenna James
    Author Jon Borchardt
    Author Bailey Kuehl
    Author Evie Cheng
    Author Karen Farley
    Author Sruthi Sreeram
    Author Taira Anderson
    Author David Albright
    Author Carissa Schoenick
    Author Luca Soldaini
    Author Dirk Groeneveld
    Author Rock Yuren Pang
    Author Pang Wei Koh
    Author Noah A. Smith
    Author Sophie Lebrecht
    Author Yejin Choi
    Author Hannaneh Hajishirzi
    Author Ali Farhadi
    Author Jesse Dodge
    Abstract We present OLMoTrace, the first system that traces the outputs of language models back to their full, multi-trillion-token training data in real time. OLMoTrace finds and shows verbatim matches between segments of language model output and documents in the training text corpora. Powered by an extended version of infini-gram (Liu et al., 2024), our system returns tracing results within a few seconds. OLMoTrace can help users understand the behavior of language models through the lens of their training data. We showcase how it can be used to explore fact checking, hallucination, and the creativity of language models. OLMoTrace is publicly available and fully open-source.
    Date 2025-04-09
    Short Title OLMoTrace
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.07096
    Accessed 4/17/2025, 11:51:42 AM
    Extra arXiv:2504.07096 [cs]
    DOI 10.48550/arXiv.2504.07096
    Repository arXiv
    Archive ID arXiv:2504.07096
    Date Added 4/17/2025, 11:51:42 AM
    Modified 4/17/2025, 11:51:42 AM

    Tags:

    • Computer Science - Computation and Language

    Notes:

    • Comment: Under submission at ACL 2025 demo track

    Attachments

    • Preprint PDF
    • Snapshot
  • AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

    Item Type Preprint
    Author Xing Han Lù
    Author Amirhossein Kazemnejad
    Author Nicholas Meade
    Author Arkil Patel
    Author Dongchan Shin
    Author Alejandra Zambrano
    Author Karolina Stańczak
    Author Peter Shaw
    Author Christopher J. Pal
    Author Siva Reddy
    Abstract Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an important problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io
    Date 2025-04-11
    Short Title AgentRewardBench
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.08942
    Accessed 4/17/2025, 11:42:57 AM
    Extra arXiv:2504.08942 [cs]
    DOI 10.48550/arXiv.2504.08942
    Repository arXiv
    Archive ID arXiv:2504.08942
    Date Added 4/17/2025, 11:42:57 AM
    Modified 4/17/2025, 11:42:57 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Superalignment with Dynamic Human Values

    Item Type Preprint
    Author Florian Mai
    Author David Kaczér
    Author Nicholas Kluge Corrêa
    Author Lucie Flek
    Abstract Two core challenges of alignment are 1) scalable oversight and 2) accounting for the dynamic nature of human values. While solutions like recursive reward modeling address 1), they do not simultaneously account for 2). We sketch a roadmap for a novel algorithmic framework that trains a superhuman reasoning model to decompose complex tasks into subtasks that are still amenable to human-level guidance. Our approach relies on what we call the part-to-complete generalization hypothesis, which states that the alignment of subtask solutions generalizes to the alignment of complete solutions. We advocate for the need to measure this generalization and propose ways to improve it in the future.
    Date 2025-03-17
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.13621
    Accessed 3/25/2025, 5:43:06 PM
    Extra arXiv:2503.13621 [cs]
    DOI 10.48550/arXiv.2503.13621
    Repository arXiv
    Archive ID arXiv:2503.13621
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: Published at the ICLR 2025 Workshop on Bidirectional Human-AI Alignment (BiAlign)

    Attachments

    • Preprint PDF
    • Snapshot
  • Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?

    Item Type Preprint
    Author Alex Mallen
    Author Charlie Griffin
    Author Misha Wagner
    Author Alessandro Abate
    Author Buck Shlegeris
    Abstract An AI control protocol is a plan for usefully deploying AI systems that aims to prevent an AI from intentionally causing some unacceptable outcome. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (without shared memory between contexts). To do this, an AI system may need to reliably generate optimal plans in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating. We develop Subversion Strategy Eval, a suite of eight environments, covering a range of protocols and strategic capabilities, and six sets of affordances that help isolate individual capabilities. We implement the evaluation in Inspect-AI and release it open-source. We evaluate Claude 3.5 models, including helpful-only versions, as well as OpenAI reasoning models. None of the models demonstrate substantial capability in strategizing to subvert control protocols statelessly. However, providing models with additional affordances, such as the ability to share a plan between contexts, can substantially improve performance. We hope our evaluations can act as a leading indicator for when models are capable of subverting control protocols and also relax the worst-case assumption of perfect strategic ability in AI control evaluations.
    Date 2025-03-24
    Short Title Subversion Strategy Eval
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.12480
    Accessed 3/25/2025, 5:37:00 PM
    Extra arXiv:2412.12480 [cs]
    DOI 10.48550/arXiv.2412.12480
    Repository arXiv
    Archive ID arXiv:2412.12480
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • The Necessity of AI Audit Standards Boards

    Item Type Preprint
    Author David Manheim
    Author Sammy Martin
    Author Mark Bailey
    Author Mikhail Samin
    Author Ross Greutzmacher
    Abstract Auditing of AI systems is a promising way to understand and manage ethical problems and societal risks associated with contemporary AI systems, as well as some anticipated future risks. Efforts to develop standards for auditing Artificial Intelligence (AI) systems have therefore understandably gained momentum. However, we argue that creating auditing standards is not just insufficient, but actively harmful by proliferating unheeded and inconsistent standards, especially in light of the rapid evolution and ethical and safety challenges of AI. Instead, the paper proposes the establishment of an AI Audit Standards Board, responsible for developing and updating auditing methods and standards in line with the evolving nature of AI technologies. Such a body would ensure that auditing practices remain relevant, robust, and responsive to the rapid advancements in AI. The paper argues that such a governance structure would also be helpful for maintaining public trust in AI and for promoting a culture of safety and ethical responsibility within the AI industry. Throughout the paper, we draw parallels with other industries, including safety-critical industries like aviation and nuclear energy, as well as more prosaic ones such as financial accounting and pharmaceuticals. AI auditing should emulate those fields, and extend beyond technical assessments to include ethical considerations and stakeholder engagement, but we explain that this is not enough; emulating other fields' governance mechanisms for these processes, and for audit standards creation, is a necessity. We also emphasize the importance of auditing the entire development process of AI systems, not just the final products...
    Date 2024-04-11
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2404.13060
    Accessed 4/11/2025, 1:09:42 PM
    Extra arXiv:2404.13060 [cs]
    DOI 10.48550/arXiv.2404.13060
    Repository arXiv
    Archive ID arXiv:2404.13060
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society

    Attachments

    • Preprint PDF
    • Snapshot
  • Artificial Intelligence Index Report 2025

    Item Type Journal Article
    Author Nestor Maslej
    Date 2025
    Language en
    Library Catalog Zotero
    Publication Artificial Intelligence
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • PDF
  • Details about METR’s preliminary evaluation of OpenAI’s o3 and o4-mini

    Item Type Web Page
    Author METR
    Abstract Resources for testing dangerous autonomous capabilities in frontier models
    Date 2025-04-16T00:00:00+00:00
    Language en
    URL https://metr.github.io/autonomy-evals-guide/openai-o3-report/
    Accessed 4/17/2025, 11:41:15 AM
    Website Title METR’s Autonomy Evaluation Resources
    Date Added 4/17/2025, 11:41:15 AM
    Modified 4/17/2025, 11:41:15 AM

    Attachments

    • Snapshot
  • ASI existential risk: reconsidering alignment as a goal

    Item Type Preprint
    Author Michael Nielson
    Date April 14th, 2025
    Language en
    Library Catalog Zotero
    Date Added 4/14/2025, 10:05:28 PM
    Modified 4/14/2025, 10:06:07 PM

    Attachments

    • PDF
  • The Geometry of Categorical and Hierarchical Concepts in Large Language Models

    Item Type Preprint
    Author Kiho Park
    Author Yo Joong Choe
    Author Yibo Jiang
    Author Victor Veitch
    Abstract The linear representation hypothesis is the informal idea that semantic concepts are encoded as linear directions in the representation spaces of large language models (LLMs). Previous work has shown how to make this notion precise for representing binary concepts that have natural contrasts (e.g., {male, female}) as directions in representation space. However, many natural concepts do not have natural contrasts (e.g., whether the output is about an animal). In this work, we show how to extend the formalization of the linear representation hypothesis to represent features (e.g., is_animal) as vectors. This allows us to immediately formalize the representation of categorical concepts as polytopes in the representation space. Further, we use the formalization to prove a relationship between the hierarchical structure of concepts and the geometry of their representations. We validate these theoretical results on the Gemma and LLaMA-3 large language models, estimating representations for 900+ hierarchically related concepts using data from WordNet.
    Date 2025-02-18
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2406.01506
    Accessed 4/11/2025, 1:09:50 PM
    Extra arXiv:2406.01506 [cs]
    DOI 10.48550/arXiv.2406.01506
    Repository arXiv
    Archive ID arXiv:2406.01506
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Statistics - Machine Learning

    Notes:

    • Comment: Accepted for an oral presentation at ICLR 2025. Best Paper Award at the ICML 2024 Workshop on Mechanistic Interpretability. Code is available at https://github.com/KihoPark/LLM_Categorical_Hierarchical_Representations

    Attachments

    • Preprint PDF
    • Snapshot
  • Investigating Affective Use and Emotional Well-being on ChatGPT

    Item Type Journal Article
    Author Jason Phang
    Author Michael Lampe
    Author Lama Ahmad
    Author Sandhini Agarwal
    Author Cathy Mengying Fang
    Author Auren R Liu
    Author Valdemar Danry
    Author Eunhae Lee
    Author Samantha W T Chan
    Author Pat Pataranutaporn
    Author Pattie Maes
    Abstract As AI chatbots see increased adoption and integration into everyday life, questions have been raised about the potential impact of human-like or anthropomorphic AI on users. In this work, we investigate the extent to which interactions with ChatGPT (with a focus on Advanced Voice Mode) may impact users’ emotional well-being, behaviors and experiences through two parallel studies. To study the affective use of AI chatbots, we perform large-scale automated analysis of ChatGPT platform usage in a privacy-preserving manner, analyzing over 4 million conversations for affective cues and surveying over 4,000 users on their perceptions of ChatGPT. To investigate whether there is a relationship between model usage and emotional well-being, we conduct an Institutional Review Board (IRB)-approved randomized controlled trial (RCT) on close to 1,000 participants over 28 days, examining changes in their emotional well-being as they interact with ChatGPT under different experimental settings. In both on-platform data analysis and the RCT, we observe that very high usage correlates with increased self-reported indicators of dependence. From our RCT, we find that the impact of voice-based interactions on emotional well-being to be highly nuanced, and influenced by factors such as the user’s initial emotional state and total usage duration. Overall, our analysis reveals that a small number of users are responsible for a disproportionate share of the most affective cues.
    Language en
    Library Catalog Zotero
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Attachments

    • PDF
  • An Approach to Technical AGI Safety and Security

    Item Type Journal Article
    Author Rohin Shah
    Author Alex Irpan
    Author Alexander Matt Turner
    Author Anna Wang
    Author Arthur Conmy
    Author David Lindner
    Author Jonah Brown-Cohen
    Author Lewis Ho
    Author Neel Nanda
    Author Raluca Ada Popa
    Author Rishub Jain
    Author Rory Greig
    Author Scott Emmons
    Author Sebastian Farquhar
    Author Sébastien Krier
    Author Senthooran Rajamanoharan
    Author Sophie Bridgers
    Author Tobi Ijitoye
    Author Tom Everitt
    Author Victoria Krakovna
    Author Vikrant Varma
    Author Vladimir Mikulik
    Author Zachary Kenton
    Author Dave Orr
    Author Shane Legg
    Author Noah Goodman
    Author Allan Dafoe
    Author Four Flynn
    Author Anca Dragan
    Language en
    Library Catalog Zotero
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • PDF
  • Palatable Conceptions of Disembodied Being: Terra Incognita in the Space of Possible Minds

    Item Type Preprint
    Author Murray Shanahan
    Abstract Is it possible to articulate a conception of consciousness that is compatible with the exotic characteristics of contemporary, disembodied AI systems, and that can stand up to philosophical scrutiny? How would subjective time and selfhood show up for an entity that conformed to such a conception? Trying to answer these questions, even metaphorically, stretches the language of consciousness to breaking point. Ultimately, the attempt yields something like emptiness, in the Buddhist sense, and helps to undermine our dualistic inclinations towards subjectivity and selfhood.
    Date 2025-03-20
    Short Title Palatable Conceptions of Disembodied Being
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.16348
    Accessed 3/25/2025, 5:46:40 PM
    Extra arXiv:2503.16348 [cs]
    DOI 10.48550/arXiv.2503.16348
    Repository arXiv
    Archive ID arXiv:2503.16348
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • Steering off Course: Reliability Challenges in Steering Language Models

    Item Type Preprint
    Author Patrick Queiroz Da Silva
    Author Hari Sethuraman
    Author Dheeraj Rajagopal
    Author Hannaneh Hajishirzi
    Author Sachin Kumar
    Abstract Steering methods for language models (LMs) have gained traction as lightweight alternatives to fine-tuning, enabling targeted modifications to model activations. However, prior studies primarily report results on a few models, leaving critical gaps in understanding the robustness of these methods. In this work, we systematically examine three prominent steering methods -- DoLa, function vectors, and task vectors. In contrast to the original studies, which evaluated a handful of models, we test up to 36 models belonging to 14 families with sizes ranging from 1.5B to 70B parameters. Our experiments reveal substantial variability in the effectiveness of the steering approaches, with a large number of models showing no improvement and at times degradation in steering performance. Our analysis demonstrate fundamental flaws in the assumptions underlying these methods, challenging their reliability as scalable steering solutions.
    Date 2025-04-06
    Short Title Steering off Course
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.04635
    Accessed 4/9/2025, 11:09:20 PM
    Extra arXiv:2504.04635 [cs]
    DOI 10.48550/arXiv.2504.04635
    Repository arXiv
    Archive ID arXiv:2504.04635
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)

    Item Type Journal Article
    Author Lewis Smith
    Author Senthooran Rajamanoharan
    Author Arthur Conmy
    Author CallumMcDougall
    Author Tom Lieberum
    Author János Kramár
    Author Rohin Shah
    Author Neel Nanda
    Abstract TL;DR To validate whether SAEs were a worthwhile technique, we explored whether they were useful on the downstream task of OOD generalisation when detecting harmful intent in user prompts Negative result: SAEs underperformed linear probes Corollary: Linear probes are actually really good and cheap and perform great As a result of this and parallel work, we are deprioritising fundamental SAE research for the moment and exploring other directions, though SAEs will remain a tool in our toolkit We do not think that SAEs are useless or that no one should work on them, but we also do not think that SAEs will be a game-changer for interpretability, and speculate that the field is over-invested in them. Training SAEs specialised for chat data closed about half the gap but was still worse than linear probes We tried several ways to train chat SAEs, all did about as well. By default, we recommend taking an SAE on pretraining data and finetuning it on a bit of chat data Other results: We found SAEs fairly helpful for debugging low quality datasets (noticing spurious correlations) We present a variant of JumpReLU with an alternative sparsity penalty to get rid of high-frequency latents We argue that a standard auto-interp approach of computing the average interpretability of a uniformly sampled SAE latent can be misleading as it doesn’t penalise models which have high frequency, but not very interpretable, latents, and explore weighting the interpretability score by latent frequency.
    Date 2025-03-26
    Language en
    Library Catalog www.alignmentforum.org
    URL https://www.alignmentforum.org/posts/4uXCAJNuPKtKBsi28/negative-results-for-saes-on-downstream-tasks
    Accessed 4/11/2025, 12:27:49 PM
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • Snapshot
  • Value Profiles for Encoding Human Variation

    Item Type Preprint
    Author Taylor Sorensen
    Author Pushkar Mishra
    Author Roma Patel
    Author Michael Henry Tessler
    Author Michiel Bakker
    Author Georgina Evans
    Author Iason Gabriel
    Author Noah Goodman
    Author Verena Rieser
    Abstract Modelling human variation in rating tasks is crucial for enabling AI systems for personalization, pluralistic model alignment, and computational social science. We propose representing individuals using value profiles -- natural language descriptions of underlying values compressed from in-context demonstrations -- along with a steerable decoder model to estimate ratings conditioned on a value profile or other rater information. To measure the predictive information in rater representations, we introduce an information-theoretic methodology. We find that demonstrations contain the most information, followed by value profiles and then demographics. However, value profiles offer advantages in terms of scrutability, interpretability, and steerability due to their compressed natural language format. Value profiles effectively compress the useful information from demonstrations (>70% information preservation). Furthermore, clustering value profiles to identify similarly behaving individuals better explains rater variation than the most predictive demographic groupings. Going beyond test set performance, we show that the decoder models interpretably change ratings according to semantic profile differences, are well-calibrated, and can help explain instance-level disagreement by simulating an annotator population. These results demonstrate that value profiles offer novel, predictive ways to describe individual variation beyond demographics or group information.
    Date 2025-03-19
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.15484
    Accessed 3/25/2025, 5:59:26 PM
    Extra arXiv:2503.15484 [cs]
    DOI 10.48550/arXiv.2503.15484
    Repository arXiv
    Archive ID arXiv:2503.15484
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Computer Science - Human-Computer Interaction

    Attachments

    • Preprint PDF
    • Snapshot
  • PaperBench: Evaluating AI's Ability to Replicate AI Research

    Item Type Preprint
    Author Giulio Starace
    Author Oliver Jaffe
    Author Dane Sherburn
    Author James Aung
    Author Jun Shern Chan
    Author Leon Maksin
    Author Rachel Dias
    Author Evan Mays
    Author Benjamin Kinsella
    Author Wyatt Thompson
    Author Johannes Heidecke
    Author Amelia Glaese
    Author Tejal Patwardhan
    Abstract We introduce PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research. Agents must replicate 20 ICML 2024 Spotlight and Oral papers from scratch, including understanding paper contributions, developing a codebase, and successfully executing experiments. For objective evaluation, we develop rubrics that hierarchically decompose each replication task into smaller sub-tasks with clear grading criteria. In total, PaperBench contains 8,316 individually gradable tasks. Rubrics are co-developed with the author(s) of each ICML paper for accuracy and realism. To enable scalable evaluation, we also develop an LLM-based judge to automatically grade replication attempts against rubrics, and assess our judge's performance by creating a separate benchmark for judges. We evaluate several frontier models on PaperBench, finding that the best-performing tested agent, Claude 3.5 Sonnet (New) with open-source scaffolding, achieves an average replication score of 21.0%. Finally, we recruit top ML PhDs to attempt a subset of PaperBench, finding that models do not yet outperform the human baseline. We open-source our code (https://github.com/openai/preparedness) to facilitate future research in understanding the AI engineering capabilities of AI agents.
    Date 2025-04-07
    Short Title PaperBench
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.01848
    Accessed 4/11/2025, 12:29:38 PM
    Extra arXiv:2504.01848 [cs]
    DOI 10.48550/arXiv.2504.01848
    Repository arXiv
    Archive ID arXiv:2504.01848
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: 30 pages, 14 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains

    Item Type Preprint
    Author Yi Su
    Author Dian Yu
    Author Linfeng Song
    Author Juntao Li
    Author Haitao Mi
    Author Zhaopeng Tu
    Author Min Zhang
    Author Dong Yu
    Abstract Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs), especially when structured reference answers are accessible for verification. However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains, significantly outperforming state-of-the-art open-source aligned models such as Qwen2.5-72B and DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach notably enhances the robustness, flexibility, and scalability of RLVR, representing a substantial step towards practical reinforcement learning applications in complex, noisy-label scenarios.
    Date 2025-04-01
    Short Title Crossing the Reward Bridge
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.23829
    Accessed 4/11/2025, 5:31:32 PM
    Extra arXiv:2503.23829 [cs]
    DOI 10.48550/arXiv.2503.23829
    Repository arXiv
    Archive ID arXiv:2503.23829
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

    Item Type Preprint
    Author Miguel Suau
    Author Matthijs T. J. Spaan
    Author Frans A. Oliehoek
    Abstract Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs.
    Date 2024-06-24
    Short Title Bad Habits
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2306.02419
    Accessed 4/11/2025, 1:10:00 PM
    Extra arXiv:2306.02419 [cs]
    DOI 10.48550/arXiv.2306.02419
    Repository arXiv
    Archive ID arXiv:2306.02419
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Gemma 3 Technical Report

    Item Type Preprint
    Author Gemma Team
    Author Aishwarya Kamath
    Author Johan Ferret
    Author Shreya Pathak
    Author Nino Vieillard
    Author Ramona Merhej
    Author Sarah Perrin
    Author Tatiana Matejovicova
    Author Alexandre Ramé
    Author Morgane Rivière
    Author Louis Rouillard
    Author Thomas Mesnard
    Author Geoffrey Cideron
    Author Jean-bastien Grill
    Author Sabela Ramos
    Author Edouard Yvinec
    Author Michelle Casbon
    Author Etienne Pot
    Author Ivo Penchev
    Author Gaël Liu
    Author Francesco Visin
    Author Kathleen Kenealy
    Author Lucas Beyer
    Author Xiaohai Zhai
    Author Anton Tsitsulin
    Author Robert Busa-Fekete
    Author Alex Feng
    Author Noveen Sachdeva
    Author Benjamin Coleman
    Author Yi Gao
    Author Basil Mustafa
    Author Iain Barr
    Author Emilio Parisotto
    Author David Tian
    Author Matan Eyal
    Author Colin Cherry
    Author Jan-Thorsten Peter
    Author Danila Sinopalnikov
    Author Surya Bhupatiraju
    Author Rishabh Agarwal
    Author Mehran Kazemi
    Author Dan Malkin
    Author Ravin Kumar
    Author David Vilar
    Author Idan Brusilovsky
    Author Jiaming Luo
    Author Andreas Steiner
    Author Abe Friesen
    Author Abhanshu Sharma
    Author Abheesht Sharma
    Author Adi Mayrav Gilady
    Author Adrian Goedeckemeyer
    Author Alaa Saade
    Author Alex Feng
    Author Alexander Kolesnikov
    Author Alexei Bendebury
    Author Alvin Abdagic
    Author Amit Vadi
    Author András György
    Author André Susano Pinto
    Author Anil Das
    Author Ankur Bapna
    Author Antoine Miech
    Author Antoine Yang
    Author Antonia Paterson
    Author Ashish Shenoy
    Author Ayan Chakrabarti
    Author Bilal Piot
    Author Bo Wu
    Author Bobak Shahriari
    Author Bryce Petrini
    Author Charlie Chen
    Author Charline Le Lan
    Author Christopher A. Choquette-Choo
    Author C. J. Carey
    Author Cormac Brick
    Author Daniel Deutsch
    Author Danielle Eisenbud
    Author Dee Cattle
    Author Derek Cheng
    Author Dimitris Paparas
    Author Divyashree Shivakumar Sreepathihalli
    Author Doug Reid
    Author Dustin Tran
    Author Dustin Zelle
    Author Eric Noland
    Author Erwin Huizenga
    Author Eugene Kharitonov
    Author Frederick Liu
    Author Gagik Amirkhanyan
    Author Glenn Cameron
    Author Hadi Hashemi
    Author Hanna Klimczak-Plucińska
    Author Harman Singh
    Author Harsh Mehta
    Author Harshal Tushar Lehri
    Author Hussein Hazimeh
    Author Ian Ballantyne
    Author Idan Szpektor
    Author Ivan Nardini
    Author Jean Pouget-Abadie
    Author Jetha Chan
    Author Joe Stanton
    Author John Wieting
    Author Jonathan Lai
    Author Jordi Orbay
    Author Joseph Fernandez
    Author Josh Newlan
    Author Ju-yeong Ji
    Author Jyotinder Singh
    Author Kat Black
    Author Kathy Yu
    Author Kevin Hui
    Author Kiran Vodrahalli
    Author Klaus Greff
    Author Linhai Qiu
    Author Marcella Valentine
    Author Marina Coelho
    Author Marvin Ritter
    Author Matt Hoffman
    Author Matthew Watson
    Author Mayank Chaturvedi
    Author Michael Moynihan
    Author Min Ma
    Author Nabila Babar
    Author Natasha Noy
    Author Nathan Byrd
    Author Nick Roy
    Author Nikola Momchev
    Author Nilay Chauhan
    Author Noveen Sachdeva
    Author Oskar Bunyan
    Author Pankil Botarda
    Author Paul Caron
    Author Paul Kishan Rubenstein
    Author Phil Culliton
    Author Philipp Schmid
    Author Pier Giuseppe Sessa
    Author Pingmei Xu
    Author Piotr Stanczyk
    Author Pouya Tafti
    Author Rakesh Shivanna
    Author Renjie Wu
    Author Renke Pan
    Author Reza Rokni
    Author Rob Willoughby
    Author Rohith Vallu
    Author Ryan Mullins
    Author Sammy Jerome
    Author Sara Smoot
    Author Sertan Girgin
    Author Shariq Iqbal
    Author Shashir Reddy
    Author Shruti Sheth
    Author Siim Põder
    Author Sijal Bhatnagar
    Author Sindhu Raghuram Panyam
    Author Sivan Eiger
    Author Susan Zhang
    Author Tianqi Liu
    Author Trevor Yacovone
    Author Tyler Liechty
    Author Uday Kalra
    Author Utku Evci
    Author Vedant Misra
    Author Vincent Roseberry
    Author Vlad Feinberg
    Author Vlad Kolesnikov
    Author Woohyun Han
    Author Woosuk Kwon
    Author Xi Chen
    Author Yinlam Chow
    Author Yuvein Zhu
    Author Zichuan Wei
    Author Zoltan Egyed
    Author Victor Cotruta
    Author Minh Giang
    Author Phoebe Kirk
    Author Anand Rao
    Author Kat Black
    Author Nabila Babar
    Author Jessica Lo
    Author Erica Moreira
    Author Luiz Gustavo Martins
    Author Omar Sanseviero
    Author Lucas Gonzalez
    Author Zach Gleicher
    Author Tris Warkentin
    Author Vahab Mirrokni
    Author Evan Senter
    Author Eli Collins
    Author Joelle Barral
    Author Zoubin Ghahramani
    Author Raia Hadsell
    Author Yossi Matias
    Author D. Sculley
    Author Slav Petrov
    Author Noah Fiedel
    Author Noam Shazeer
    Author Oriol Vinyals
    Author Jeff Dean
    Author Demis Hassabis
    Author Koray Kavukcuoglu
    Author Clement Farabet
    Author Elena Buchatskaya
    Author Jean-Baptiste Alayrac
    Author Rohan Anil
    Author Dmitry
    Author Lepikhin
    Author Sebastian Borgeaud
    Author Olivier Bachem
    Author Armand Joulin
    Author Alek Andreev
    Author Cassidy Hardin
    Author Robert Dadashi
    Author Léonard Hussenot
    Abstract We introduce Gemma 3, a multimodal addition to the Gemma family of lightweight open models, ranging in scale from 1 to 27 billion parameters. This version introduces vision understanding abilities, a wider coverage of languages and longer context - at least 128K tokens. We also change the architecture of the model to reduce the KV-cache memory that tends to explode with long context. This is achieved by increasing the ratio of local to global attention layers, and keeping the span on local attention short. The Gemma 3 models are trained with distillation and achieve superior performance to Gemma 2 for both pre-trained and instruction finetuned versions. In particular, our novel post-training recipe significantly improves the math, chat, instruction-following and multilingual abilities, making Gemma3-4B-IT competitive with Gemma2-27B-IT and Gemma3-27B-IT comparable to Gemini-1.5-Pro across benchmarks. We release all our models to the community.
    Date 2025-03-25
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.19786
    Accessed 4/11/2025, 12:33:53 PM
    Extra arXiv:2503.19786 [cs]
    DOI 10.48550/arXiv.2503.19786
    Repository arXiv
    Archive ID arXiv:2503.19786
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • AI Governance through Markets

    Item Type Preprint
    Author Philip Moreira Tomei
    Author Rupal Jain
    Author Matija Franklin
    Abstract This paper argues that market governance mechanisms should be considered a key approach in the governance of artificial intelligence (AI), alongside traditional regulatory frameworks. While current governance approaches have predominantly focused on regulation, we contend that market-based mechanisms offer effective incentives for responsible AI development. We examine four emerging vectors of market governance: insurance, auditing, procurement, and due diligence, demonstrating how these mechanisms can affirm the relationship between AI risk and financial risk while addressing capital allocation inefficiencies. While we do not claim that market forces alone can adequately protect societal interests, we maintain that standardised AI disclosures and market mechanisms can create powerful incentives for safe and responsible AI development. This paper urges regulators, economists, and machine learning researchers to investigate and implement market-based approaches to AI governance.
    Date 2025-03-05
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2501.17755
    Accessed 4/11/2025, 1:09:46 PM
    Extra arXiv:2501.17755 [econ]
    DOI 10.48550/arXiv.2501.17755
    Repository arXiv
    Archive ID arXiv:2501.17755
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Economics - General Economics
    • Quantitative Finance - Economics

    Attachments

    • Full Text PDF
    • Snapshot
  • An Illusion of Progress? Assessing the Current State of Web Agents

    Item Type Preprint
    Author Tianci Xue
    Author Weijian Qi
    Author Tianneng Shi
    Author Chan Hee Song
    Author Boyu Gou
    Author Dawn Song
    Author Huan Sun
    Author Yu Su
    Abstract As digitalization and cloud technologies evolve, the web is becoming increasingly important in the modern society. Autonomous web agents based on large language models (LLMs) hold a great potential in work automation. It is therefore important to accurately measure and monitor the progression of their capabilities. In this work, we conduct a comprehensive and rigorous assessment of the current state of web agents. Our results depict a very different picture of the competency of current agents, suggesting over-optimism in previously reported results. This gap can be attributed to shortcomings in existing benchmarks. We introduce Online-Mind2Web, an online evaluation benchmark consisting of 300 diverse and realistic tasks spanning 136 websites. It enables us to evaluate web agents under a setting that approximates how real users use these agents. To facilitate more scalable evaluation and development, we also develop a novel LLM-as-a-Judge automatic evaluation method and show that it can achieve around 85% agreement with human judgment, substantially higher than existing methods. Finally, we present the first comprehensive comparative analysis of current web agents, highlighting both their strengths and limitations to inspire future research.
    Date 2025-04-02
    Short Title An Illusion of Progress?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2504.01382
    Accessed 4/11/2025, 5:31:52 PM
    Extra arXiv:2504.01382 [cs]
    DOI 10.48550/arXiv.2504.01382
    Repository arXiv
    Archive ID arXiv:2504.01382
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: 22 pages, 16 figures, 4 tables

    Attachments

    • Preprint PDF
    • Snapshot
  • The AI Sc ientist-v2: Workshop-Level Automated Scientific Discovery via Agentic Tree Search

    Item Type Journal Article
    Author Yutaro Yamada
    Author Robert Tjarko Lange
    Author Cong Lu
    Author Shengran Hu
    Author Chris Lu
    Author Jakob Foerster
    Author David Ha
    Language en
    Library Catalog Zotero
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • PDF
  • Survey on Evaluation of LLM-based Agents

    Item Type Preprint
    Author Asaf Yehudai
    Author Lilach Eden
    Author Alan Li
    Author Guy Uziel
    Author Yilun Zhao
    Author Roy Bar-Haim
    Author Arman Cohan
    Author Michal Shmueli-Scheuer
    Abstract The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.
    Date 2025-03-20
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.16416
    Accessed 3/25/2025, 5:38:55 PM
    Extra arXiv:2503.16416 [cs]
    DOI 10.48550/arXiv.2503.16416
    Repository arXiv
    Archive ID arXiv:2503.16416
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Full Text PDF
    • Snapshot
  • On the Biology of a Large Language Model

    Item Type Web Page
    Author Authors Affiliations Published Not published yet DOI No DOI yet
    Abstract We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic's lightweight production model — in a variety of contexts, using our circuit tracing methodology.
    Language en
    URL https://transformer-circuits.pub/2025/attribution-graphs/biology.html
    Accessed 4/11/2025, 12:35:22 PM
    Website Title Transformer Circuits
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • Snapshot
  • Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification

    Item Type Preprint
    Author Eric Zhao
    Author Pranjal Awasthi
    Author Sreenivas Gollapudi
    Abstract Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one -- typically by having models self-verify each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation of sampling-based search, using only random sampling and direct self-verification, provides a practical inference method that, for example, elevates the reasoning capabilities of Gemini v1.5 Pro above that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves self-verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts -- chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.
    Date 2025-02-20
    Short Title Sample, Scrutinize and Scale
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.01839
    Accessed 3/25/2025, 9:28:34 PM
    Extra arXiv:2502.01839 [cs]
    DOI 10.48550/arXiv.2502.01839
    Repository arXiv
    Archive ID arXiv:2502.01839
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

    Item Type Preprint
    Author Yuxuan Zhu
    Author Antony Kellermann
    Author Dylan Bowman
    Author Philip Li
    Author Akul Gupta
    Author Adarsh Danda
    Author Richard Fang
    Author Conner Jensen
    Author Eric Ihli
    Author Jason Benn
    Author Jet Geronimo
    Author Avi Dhir
    Author Sudhit Rao
    Author Kaicheng Yu
    Author Twm Stone
    Author Daniel Kang
    Abstract Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.
    Date 2025-04-01
    Short Title CVE-Bench
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.17332
    Accessed 4/11/2025, 1:09:54 PM
    Extra arXiv:2503.17332 [cs]
    DOI 10.48550/arXiv.2503.17332
    Repository arXiv
    Archive ID arXiv:2503.17332
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security

    Notes:

    • Comment: 15 pages, 4 figures, 5 tables

    Attachments

    • Preprint PDF
    • Snapshot
  • GR00T_1_Whitepaper.pdf

    Item Type Attachment
    URL https://d1qx31qr3h6wln.cloudfront.net/publications/GR00T_1_Whitepaper.pdf
    Accessed 3/25/2025, 9:20:45 PM
    Date Added 4/2/2025, 2:13:04 PM
    Modified 4/2/2025, 2:13:04 PM
  • [external] a long list of open problems and concrete projects in evals

    Item Type Attachment
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM
  • Introducing Docent

    Item Type Web Page
    URL https://transluce.org/introducing-docent
    Accessed 4/11/2025, 12:31:23 PM
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • Snapshot
  • Report-Technologist-Economist-Disconnect-ARI04072025-1.pdf

    Item Type Attachment
    URL https://ari.us/wp-content/uploads/2025/04/Report-Technologist-Economist-Disconnect-ARI04072025-1.pdf
    Accessed 4/9/2025, 11:27:27 PM
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM
  • Stripe Press — The Scaling Era

    Item Type Web Page
    Abstract An inside view of the AI revolution, from the people and companies making it happen.
    Language en-US
    URL https://stripe.com/scaling
    Accessed 4/11/2025, 2:47:51 PM
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • Snapshot
  • Taking a responsible path to AGI - Google DeepMind

    Item Type Web Page
    URL https://deepmind.google/discover/blog/taking-a-responsible-path-to-agi/
    Accessed 4/11/2025, 2:46:31 PM
    Date Added 4/14/2025, 9:44:59 PM
    Modified 4/14/2025, 9:44:59 PM

    Attachments

    • Taking a responsible path to AGI - Google DeepMind
  • Investigating truthfulness in a pre-release o3 model | Transluce AI

    Item Type Web Page
    URL https://transluce.org/investigating-o3-truthfulness
    Accessed 4/17/2025, 11:40:51 AM
    Date Added 4/17/2025, 11:40:51 AM
    Modified 4/17/2025, 11:40:51 AM

    Attachments

    • Investigating truthfulness in a pre-release o3 model | Transluce AI