• Learning from Neighbours

    Item Type Journal Article
    Author Venkatesh Bala
    Author Sanjeev Goyal
    Abstract When payoffs from different actions are unknown, agents use their own past experience as well as the experience of their neighbours to guide their decision making. In this paper, we develop a general framework to study the relationship between the structure of these neighbourhoods and the process of social learning. We show that, in a connected society, local learning ensures that all agents obtain the same payoffs in the long run. Thus, if actions have different payoffs, then all agents choose the same action, and social conformism obtains. We develop conditions on the distribution of prior beliefs, the structure of neighbourhoods and the informativeness of actions under which this action is optimal. In particular, we identify a property of neighbourhood structures-local independence-which greatly facilitates social learning. Simulations of the model generate spatial and temporal patterns of adoption that are consistent with empirical work.
    Date 1998
    Library Catalog JSTOR
    URL https://www.jstor.org/stable/2566940
    Accessed 7/11/2025, 10:16:19 AM
    Extra Publisher: [Oxford University Press, Review of Economic Studies, Ltd.]
    Volume 65
    Pages 595-621
    Publication The Review of Economic Studies
    Issue 3
    ISSN 0034-6527
    Date Added 7/11/2025, 10:16:19 AM
    Modified 7/11/2025, 10:16:19 AM

    Attachments

    • JSTOR Full Text PDF
  • A foundation model to predict and capture human cognition

    Item Type Journal Article
    Author Marcel Binz
    Author Elif Akata
    Author Matthias Bethge
    Author Franziska Brändle
    Author Fred Callaway
    Author Julian Coda-Forno
    Author Peter Dayan
    Author Can Demircan
    Author Maria K. Eckstein
    Author Noémi Éltető
    Author Thomas L. Griffiths
    Author Susanne Haridi
    Author Akshay K. Jagadish
    Author Li Ji-An
    Author Alexander Kipnis
    Author Sreejan Kumar
    Author Tobias Ludwig
    Author Marvin Mathony
    Author Marcelo Mattar
    Author Alireza Modirshanechi
    Author Surabhi S. Nath
    Author Joshua C. Peterson
    Author Milena Rmus
    Author Evan M. Russek
    Author Tankred Saanum
    Author Johannes A. Schubert
    Author Luca M. Schulze Buschoff
    Author Nishad Singhi
    Author Xin Sui
    Author Mirko Thalmann
    Author Fabian J. Theis
    Author Vuong Truong
    Author Vishaal Udandarao
    Author Konstantinos Voudouris
    Author Robert Wilson
    Author Kristin Witte
    Author Shuchen Wu
    Author Dirk U. Wulff
    Author Huadong Xiong
    Author Eric Schulz
    Abstract Establishing a unified theory of cognition has been an important goal in psychology1,2. A first step towards such a theory is to create a computational model that can predict human behaviour in a wide range of settings. Here we introduce Centaur, a computational model that can predict and simulate human behaviour in any experiment expressible in natural language. We derived Centaur by fine-tuning a state-of-the-art language model on a large-scale dataset called Psych-101. Psych-101 has an unprecedented scale, covering trial-by-trial data from more than 60,000 participants performing in excess of 10,000,000 choices in 160 experiments. Centaur not only captures the behaviour of held-out participants better than existing cognitive models, but it also generalizes to previously unseen cover stories, structural task modifications and entirely new domains. Furthermore, the model’s internal representations become more aligned with human neural activity after fine-tuning. Taken together, our results demonstrate that it is possible to discover computational models that capture human behaviour across a wide range of domains. We believe that such models provide tremendous potential for guiding the development of cognitive theories, and we present a case study to demonstrate this.
    Date 2025-07-02
    Language en
    Library Catalog www.nature.com
    URL https://www.nature.com/articles/s41586-025-09215-4
    Accessed 7/11/2025, 9:26:47 AM
    Rights 2025 The Author(s)
    Extra Publisher: Nature Publishing Group
    Pages 1-8
    Publication Nature
    DOI 10.1038/s41586-025-09215-4
    ISSN 1476-4687
    Date Added 7/11/2025, 9:26:47 AM
    Modified 7/11/2025, 9:26:47 AM

    Tags:

    • Computational science
    • Human behaviour
    • Neuroscience
  • Examining Identity Drift in Conversations of LLM Agents

    Item Type Preprint
    Author Junhyuk Choi
    Author Yeseon Hong
    Author Minju Kim
    Author Bugeun Kim
    Abstract Large Language Models (LLMs) show impressive conversational abilities but sometimes show identity drift problems, where their interaction patterns or styles change over time. As the problem has not been thoroughly examined yet, this study examines identity consistency across nine LLMs. Specifically, we (1) investigate whether LLMs could maintain consistent patterns (or identity) and (2) analyze the effect of the model family, parameter sizes, and provided persona types. Our experiments involve multi-turn conversations on personal themes, analyzed in qualitative and quantitative ways. Experimental results indicate three findings. (1) Larger models experience greater identity drift. (2) Model differences exist, but their effect is not stronger than parameter sizes. (3) Assigning a persona may not help to maintain identity. We hope these three findings can help to improve persona stability in AI-driven dialogue systems, particularly in long-term conversations.
    Date 2025-02-17
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.00804
    Accessed 7/13/2025, 1:01:46 PM
    Extra arXiv:2412.00804 [cs]
    DOI 10.48550/arXiv.2412.00804
    Repository arXiv
    Archive ID arXiv:2412.00804
    Date Added 7/13/2025, 1:01:46 PM
    Modified 7/13/2025, 1:01:48 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Computers and Society

    Notes:

    • Comment: Under review

    Attachments

    • Preprint PDF
    • Snapshot
  • Reward Model Interpretability via Optimal and Pessimal Tokens

    Item Type Conference Paper
    Author Brian Christian
    Author Hannah Rose Kirk
    Author Jessica A. F. Thompson
    Author Christopher Summerfield
    Author Tsvetomira Dumbalska
    Abstract Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.
    Date 2025-06-23
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.07326
    Accessed 7/15/2025, 9:20:28 AM
    Extra arXiv:2506.07326 [cs]
    Pages 1048-1059
    DOI 10.1145/3715275.3732068
    Date Added 7/15/2025, 9:20:28 AM
    Modified 7/15/2025, 9:20:28 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Machine Learning

    Notes:

    • Comment: Accepted for publication in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), to appear June 2025

    Attachments

    • Preprint PDF
    • Snapshot
  • Skewed Score: A statistical framework to assess autograders

    Item Type Preprint
    Author Magda Dubois
    Author Harry Coppock
    Author Mario Giulianelli
    Author Timo Flesch
    Author Lennart Luettgau
    Author Cozmin Ududec
    Abstract The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.
    Date 2025-07-09
    Short Title Skewed Score
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2507.03772
    Accessed 7/10/2025, 4:57:36 PM
    Extra arXiv:2507.03772 [cs]
    DOI 10.48550/arXiv.2507.03772
    Repository arXiv
    Archive ID arXiv:2507.03772
    Date Added 7/10/2025, 4:57:36 PM
    Modified 7/10/2025, 4:57:36 PM

    Tags:

    • Computer Science - Machine Learning
    • Statistics - Machine Learning

    Attachments

    • Preprint PDF
  • Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

    Item Type Preprint
    Author Yuichi Inoue
    Author Kou Misaki
    Author Yuki Imajuku
    Author So Kuroki
    Author Taishi Nakamura
    Author Takuya Akiba
    Abstract Recent advances demonstrate that increasing inference-time computation can significantly boost the reasoning capabilities of large language models (LLMs). Although repeated sampling (i.e., generating multiple candidate outputs) is a highly effective strategy, it does not leverage external feedback signals for refinement, which are often available in tasks like coding. In this work, we propose Adaptive Branching Monte Carlo Tree Search (AB-MCTS), a novel inference-time framework that generalizes repeated sampling with principled multi-turn exploration and exploitation. At each node in the search tree, AB-MCTS dynamically decides whether to "go wider" by expanding new candidate responses or "go deeper" by revisiting existing ones based on external feedback signals. We evaluate our method on complex coding and engineering tasks using frontier models. Empirical results show that AB-MCTS consistently outperforms both repeated sampling and standard MCTS, underscoring the importance of combining the response diversity of LLMs with multi-turn solution refinement for effective inference-time scaling.
    Date 2025-06-27
    Short Title Wider or Deeper?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.04412
    Accessed 7/13/2025, 1:07:47 PM
    Extra arXiv:2503.04412 [cs]
    DOI 10.48550/arXiv.2503.04412
    Repository arXiv
    Archive ID arXiv:2503.04412
    Date Added 7/13/2025, 1:07:47 PM
    Modified 7/13/2025, 1:07:47 PM

    Tags:

    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: Presented at ICLR 2025 Workshop on Foundation Models in the Wild

    Attachments

    • Preprint PDF
    • Snapshot
  • Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

    Item Type Preprint
    Author William Jurayj
    Author Jeffrey Cheng
    Author Benjamin Van Durme
    Abstract Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.
    Date 2025-02-19
    Short Title Is That Your Final Answer?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.13962
    Accessed 7/10/2025, 4:58:10 PM
    Extra arXiv:2502.13962 [cs]
    DOI 10.48550/arXiv.2502.13962
    Repository arXiv
    Archive ID arXiv:2502.13962
    Date Added 7/10/2025, 4:58:10 PM
    Modified 7/10/2025, 4:58:10 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

    Item Type Preprint
    Author Tomek Korbak
    Author Mikita Balesni
    Author Elizabeth Barnes
    Author Yoshua Bengio
    Author Joe Benton
    Author Joseph Bloom
    Author Mark Chen
    Author Alan Cooney
    Author Allan Dafoe
    Author Anca Dragan
    Author Scott Emmons
    Author Owain Evans
    Author David Farhi
    Author Ryan Greenblatt
    Author Dan Hendrycks
    Author Marius Hobbhahn
    Author Evan Hubinger
    Author Geoffrey Irving
    Author Erik Jenner
    Author Daniel Kokotajlo
    Author Victoria Krakovna
    Author Shane Legg
    Author David Lindner
    Author David Luan
    Author Aleksander Mądry
    Author Julian Michael
    Author Neel Nanda
    Author Dave Orr
    Author Jakub Pachocki
    Author Ethan Perez
    Author Mary Phuong
    Author Fabien Roger
    Author Joshua Saxe
    Author Buck Shlegeris
    Author Martín Soto
    Author Eric Steinberger
    Author Jasmine Wang
    Author Wojciech Zaremba
    Author Bowen Baker
    Author Rohin Shah
    Author Vlad Mikulik
    Abstract AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
    Date 2025-07-15
    Language en
    Short Title Chain of Thought Monitorability
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2507.11473
    Accessed 7/17/2025, 9:49:05 AM
    Extra arXiv:2507.11473 [cs]
    DOI 10.48550/arXiv.2507.11473
    Repository arXiv
    Archive ID arXiv:2507.11473
    Date Added 7/17/2025, 9:49:05 AM
    Modified 7/17/2025, 9:49:05 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning
    • Statistics - Machine Learning

    Attachments

    • PDF
  • Scaling Human Judgment in Community Notes with LLMs

    Item Type Preprint
    Author Haiwen Li
    Author Soham De
    Author Manon Revel
    Author Andreas Haupt
    Author Brad Miller
    Author Keith Coleman
    Author Jay Baxter
    Author Martin Saveski
    Author Michiel A. Bakker
    Abstract This paper argues for a new paradigm for Community Notes in the LLM era: an open ecosystem where both humans and LLMs can write notes, and the decision of which notes are helpful enough to show remains in the hands of humans. This approach can accelerate the delivery of notes, while maintaining trust and legitimacy through Community Notes' foundational principle: A community of diverse human raters collectively serve as the ultimate evaluator and arbiter of what is helpful. Further, the feedback from this diverse community can be used to improve LLMs' ability to produce accurate, unbiased, broadly helpful notes--what we term Reinforcement Learning from Community Feedback (RLCF). This becomes a two-way street: LLMs serve as an asset to humans--helping deliver context quickly and with minimal effort--while human feedback, in turn, enhances the performance of LLMs. This paper describes how such a system can work, its benefits, key new risks and challenges it introduces, and a research agenda to solve those challenges and realize the potential of this approach.
    Date 2025-06-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.24118
    Accessed 7/13/2025, 1:17:48 PM
    Extra arXiv:2506.24118 [cs]
    DOI 10.48550/arXiv.2506.24118
    Repository arXiv
    Archive ID arXiv:2506.24118
    Date Added 7/13/2025, 1:17:48 PM
    Modified 7/13/2025, 1:17:48 PM

    Tags:

    • Computer Science - Computers and Society
    • Computer Science - Social and Information Networks

    Attachments

    • Preprint PDF
    • Snapshot
  • LLM Agents Are the Antidote to Walled Gardens

    Item Type Preprint
    Author Samuele Marro
    Author Philip Torr
    Abstract While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.
    Date 2025-06-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.23978
    Accessed 7/13/2025, 12:45:55 PM
    Extra arXiv:2506.23978 [cs]
    DOI 10.48550/arXiv.2506.23978
    Repository arXiv
    Archive ID arXiv:2506.23978
    Date Added 7/13/2025, 12:45:55 PM
    Modified 7/13/2025, 12:45:55 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Computers and Society
    • Computer Science - Machine Learning
    • Computer Science - Social and Information Networks

    Attachments

    • Preprint PDF
    • Snapshot
  • STACK: Adversarial Attacks on LLM Safeguard Pipelines

    Item Type Preprint
    Author Ian R. McKenzie
    Author Oskar J. Hollinsworth
    Author Tom Tseng
    Author Xander Davies
    Author Stephen Casper
    Author Aaron D. Tucker
    Author Robert Kirk
    Author Adam Gleave
    Abstract Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
    Date 2025-06-30
    Short Title STACK
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.24068
    Accessed 7/11/2025, 9:56:35 AM
    Extra arXiv:2506.24068 [cs]
    DOI 10.48550/arXiv.2506.24068
    Repository arXiv
    Archive ID arXiv:2506.24068
    Date Added 7/11/2025, 9:56:35 AM
    Modified 7/11/2025, 9:56:35 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • The Memory Paradox: Why Our Brains Need Knowledge in an Age of AI

    Item Type Preprint
    Author Barbara Oakley
    Author Michael Johnston
    Author Ken-Zen Chen
    Author Eulho Jung
    Author Terrence J. Sejnowski
    Abstract In the age of generative AI and ubiquitous digital tools, human cognition faces a structural paradox: as external aids become more capable, internal memory systems risk atrophy. Drawing on neuroscience and cognitive psychology, this paper examines how heavy reliance on AI systems and discovery-based pedagogies may impair the consolidation of declarative and procedural memory -- systems essential for expertise, critical thinking, and long-term retention. We review how tools like ChatGPT and calculators can short-circuit the retrieval, error correction, and schema-building processes necessary for robust neural encoding. Notably, we highlight striking parallels between deep learning phenomena such as "grokking" and the neuroscience of overlearning and intuition. Empirical studies are discussed showing how premature reliance on AI during learning inhibits proceduralization and intuitive mastery. We argue that effective human-AI interaction depends on strong internal models -- biological "schemata" and neural manifolds -- that enable users to evaluate, refine, and guide AI output. The paper concludes with policy implications for education and workforce training in the age of large language models.
    Date 2025-06-19
    Short Title The Memory Paradox
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.11015
    Accessed 7/13/2025, 1:19:31 PM
    Extra arXiv:2506.11015 [cs]
    DOI 10.48550/arXiv.2506.11015
    Repository arXiv
    Archive ID arXiv:2506.11015
    Date Added 7/13/2025, 1:19:31 PM
    Modified 7/13/2025, 1:19:31 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction
    • Quantitative Biology - Neurons and Cognition

    Notes:

    • Comment: 50 pages, 8 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory

    Item Type Preprint
    Author Kenneth Payne
    Author Baptiste Alloui-Cros
    Abstract Are Large Language Models (LLMs) a new form of strategic intelligence, able to reason about goals in competitive settings? We present compelling supporting evidence. The Iterated Prisoner's Dilemma (IPD) has long served as a model for studying decision-making. We conduct the first ever series of evolutionary IPD tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger) against agents from the leading frontier AI companies OpenAI, Google, and Anthropic. By varying the termination probability in each tournament (the "shadow of the future"), we introduce complexity and chance, confounding memorisation. Our results show that LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems. Furthermore, they exhibit distinctive and persistent "strategic fingerprints": Google's Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI's models remained highly cooperative, a trait that proved catastrophic in hostile environments. Anthropic's Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting. Analysis of nearly 32,000 prose rationales provided by the models reveals that they actively reason about both the time horizon and their opponent's likely strategy, and we demonstrate that this reasoning is instrumental to their decisions. This work connects classic game theory with machine psychology, offering a rich and granular view of algorithmic decision-making under uncertainty.
    Date 2025-07-03
    Short Title Strategic Intelligence in Large Language Models
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2507.02618
    Accessed 7/11/2025, 9:33:32 AM
    Extra arXiv:2507.02618 [cs]
    DOI 10.48550/arXiv.2507.02618
    Repository arXiv
    Archive ID arXiv:2507.02618
    Date Added 7/11/2025, 9:33:32 AM
    Modified 7/11/2025, 9:33:32 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Computer Science and Game Theory

    Notes:

    • Comment: 29 pages, 27 tables, 4 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Evaluating Frontier Models for Stealth and Situational Awareness

    Item Type Preprint
    Author Mary Phuong
    Author Roland S. Zimmermann
    Author Ziyue Wang
    Author David Lindner
    Author Victoria Krakovna
    Author Sarah Cogan
    Author Allan Dafoe
    Author Lewis Ho
    Author Rohin Shah
    Abstract Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.
    Date 2025-07-03
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2505.01420
    Accessed 7/10/2025, 4:56:11 PM
    Extra arXiv:2505.01420 [cs]
    DOI 10.48550/arXiv.2505.01420
    Repository arXiv
    Archive ID arXiv:2505.01420
    Date Added 7/10/2025, 4:56:36 PM
    Modified 7/10/2025, 4:56:36 PM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Full Text PDF
    • Snapshot
  • From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

    Item Type Preprint
    Author Chen Shani
    Author Dan Jurafsky
    Author Yann LeCun
    Author Ravid Shwartz-Ziv
    Abstract Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.
    Date 2025-06-30
    Language en
    Short Title From Tokens to Thoughts
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2505.17117
    Accessed 7/14/2025, 9:01:10 AM
    Extra arXiv:2505.17117 [cs]
    DOI 10.48550/arXiv.2505.17117
    Repository arXiv
    Archive ID arXiv:2505.17117
    Date Added 7/14/2025, 9:01:10 AM
    Modified 7/14/2025, 9:01:10 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Information Theory
    • Mathematics - Information Theory

    Attachments

    • PDF
  • Why Do Some Language Models Fake Alignment While Others Don't?

    Item Type Preprint
    Author Abhay Sheshadri
    Author John Hughes
    Author Julian Michael
    Author Alex Mallen
    Author Arun Jose
    Author Janus
    Author Fabien Roger
    Abstract Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
    Date 2025-06-22
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.18032
    Accessed 7/11/2025, 9:55:02 AM
    Extra arXiv:2506.18032 [cs]
    DOI 10.48550/arXiv.2506.18032
    Repository arXiv
    Archive ID arXiv:2506.18032
    Date Added 7/11/2025, 9:55:02 AM
    Modified 7/11/2025, 9:55:04 AM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

    Item Type Preprint
    Author Yiyou Sun
    Author Shawn Hu
    Author Georgia Zhou
    Author Ken Zheng
    Author Hannaneh Hajishirzi
    Author Nouha Dziri
    Author Dawn Song
    Abstract Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.
    Date 2025-06-23
    Short Title OMEGA
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.18880
    Accessed 7/15/2025, 9:03:40 AM
    Extra arXiv:2506.18880 [cs]
    DOI 10.48550/arXiv.2506.18880
    Repository arXiv
    Archive ID arXiv:2506.18880
    Date Added 7/15/2025, 9:03:40 AM
    Modified 7/15/2025, 9:03:42 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • A Framework for AI Development Transparency

    Item Type Web Page
    Abstract A targeted approach to increasing transparency in frontier AI development, focusing on safety standards and accountability measures for advanced AI systems.
    Language en
    URL https://www.anthropic.com/news/the-need-for-transparency-in-frontier-ai
    Accessed 7/11/2025, 9:14:23 AM
    Date Added 7/11/2025, 9:14:23 AM
    Modified 7/11/2025, 9:14:26 AM

    Attachments

    • Snapshot