Zotero Report

Learning from Neighbours

Item Type	Journal Article
Author	Venkatesh Bala
Author	Sanjeev Goyal
Abstract	When payoffs from different actions are unknown, agents use their own past experience as well as the experience of their neighbours to guide their decision making. In this paper, we develop a general framework to study the relationship between the structure of these neighbourhoods and the process of social learning. We show that, in a connected society, local learning ensures that all agents obtain the same payoffs in the long run. Thus, if actions have different payoffs, then all agents choose the same action, and social conformism obtains. We develop conditions on the distribution of prior beliefs, the structure of neighbourhoods and the informativeness of actions under which this action is optimal. In particular, we identify a property of neighbourhood structures-local independence-which greatly facilitates social learning. Simulations of the model generate spatial and temporal patterns of adoption that are consistent with empirical work.
Date	1998
Library Catalog	JSTOR
URL	https://www.jstor.org/stable/2566940
Accessed	7/11/2025, 10:16:19 AM
Extra	Publisher: [Oxford University Press, Review of Economic Studies, Ltd.]
Volume	65
Pages	595-621
Publication	The Review of Economic Studies
Issue	3
ISSN	0034-6527
Date Added	7/11/2025, 10:16:19 AM
Modified	7/11/2025, 10:16:19 AM

Attachments

JSTOR Full Text PDF

A foundation model to predict and capture human cognition

Item Type	Journal Article
Author	Marcel Binz
Author	Elif Akata
Author	Matthias Bethge
Author	Franziska Brändle
Author	Fred Callaway
Author	Julian Coda-Forno
Author	Peter Dayan
Author	Can Demircan
Author	Maria K. Eckstein
Author	Noémi Éltető
Author	Thomas L. Griffiths
Author	Susanne Haridi
Author	Akshay K. Jagadish
Author	Li Ji-An
Author	Alexander Kipnis
Author	Sreejan Kumar
Author	Tobias Ludwig
Author	Marvin Mathony
Author	Marcelo Mattar
Author	Alireza Modirshanechi
Author	Surabhi S. Nath
Author	Joshua C. Peterson
Author	Milena Rmus
Author	Evan M. Russek
Author	Tankred Saanum
Author	Johannes A. Schubert
Author	Luca M. Schulze Buschoff
Author	Nishad Singhi
Author	Xin Sui
Author	Mirko Thalmann
Author	Fabian J. Theis
Author	Vuong Truong
Author	Vishaal Udandarao
Author	Konstantinos Voudouris
Author	Robert Wilson
Author	Kristin Witte
Author	Shuchen Wu
Author	Dirk U. Wulff
Author	Huadong Xiong
Author	Eric Schulz
Abstract	Establishing a unified theory of cognition has been an important goal in psychology1,2. A first step towards such a theory is to create a computational model that can predict human behaviour in a wide range of settings. Here we introduce Centaur, a computational model that can predict and simulate human behaviour in any experiment expressible in natural language. We derived Centaur by fine-tuning a state-of-the-art language model on a large-scale dataset called Psych-101. Psych-101 has an unprecedented scale, covering trial-by-trial data from more than 60,000 participants performing in excess of 10,000,000 choices in 160 experiments. Centaur not only captures the behaviour of held-out participants better than existing cognitive models, but it also generalizes to previously unseen cover stories, structural task modifications and entirely new domains. Furthermore, the model’s internal representations become more aligned with human neural activity after fine-tuning. Taken together, our results demonstrate that it is possible to discover computational models that capture human behaviour across a wide range of domains. We believe that such models provide tremendous potential for guiding the development of cognitive theories, and we present a case study to demonstrate this.
Date	2025-07-02
Language	en
Library Catalog	www.nature.com
URL	https://www.nature.com/articles/s41586-025-09215-4
Accessed	7/11/2025, 9:26:47 AM
Rights	2025 The Author(s)
Extra	Publisher: Nature Publishing Group
Pages	1-8
Publication	Nature
DOI	10.1038/s41586-025-09215-4
ISSN	1476-4687
Date Added	7/11/2025, 9:26:47 AM
Modified	7/11/2025, 9:26:47 AM

Tags:

Computational science
Human behaviour
Neuroscience

Examining Identity Drift in Conversations of LLM Agents

Item Type	Preprint
Author	Junhyuk Choi
Author	Yeseon Hong
Author	Minju Kim
Author	Bugeun Kim
Abstract	Large Language Models (LLMs) show impressive conversational abilities but sometimes show identity drift problems, where their interaction patterns or styles change over time. As the problem has not been thoroughly examined yet, this study examines identity consistency across nine LLMs. Specifically, we (1) investigate whether LLMs could maintain consistent patterns (or identity) and (2) analyze the effect of the model family, parameter sizes, and provided persona types. Our experiments involve multi-turn conversations on personal themes, analyzed in qualitative and quantitative ways. Experimental results indicate three findings. (1) Larger models experience greater identity drift. (2) Model differences exist, but their effect is not stronger than parameter sizes. (3) Assigning a persona may not help to maintain identity. We hope these three findings can help to improve persona stability in AI-driven dialogue systems, particularly in long-term conversations.
Date	2025-02-17
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2412.00804
Accessed	7/13/2025, 1:01:46 PM
Extra	arXiv:2412.00804 [cs]
DOI	10.48550/arXiv.2412.00804
Repository	arXiv
Archive ID	arXiv:2412.00804
Date Added	7/13/2025, 1:01:46 PM
Modified	7/13/2025, 1:01:48 PM

Tags:

Computer Science - Computation and Language
Computer Science - Computers and Society

Notes:

Comment: Under review

Attachments

Preprint PDF
Snapshot

Reward Model Interpretability via Optimal and Pessimal Tokens

Item Type	Conference Paper
Author	Brian Christian
Author	Hannah Rose Kirk
Author	Jessica A. F. Thompson
Author	Christopher Summerfield
Author	Tsvetomira Dumbalska
Abstract	Reward modeling has emerged as a crucial component in aligning large language models with human values. Significant attention has focused on using reward models as a means for fine-tuning generative models. However, the reward models themselves -- which directly encode human value judgments by turning prompt-response pairs into scalar rewards -- remain relatively understudied. We present a novel approach to reward model interpretability through exhaustive analysis of their responses across their entire vocabulary space. By examining how different reward models score every possible single-token response to value-laden prompts, we uncover several striking findings: (i) substantial heterogeneity between models trained on similar objectives, (ii) systematic asymmetries in how models encode high- vs low-scoring tokens, (iii) significant sensitivity to prompt framing that mirrors human cognitive biases, and (iv) overvaluation of more frequent tokens. We demonstrate these effects across ten recent open-source reward models of varying parameter counts and architectures. Our results challenge assumptions about the interchangeability of reward models, as well as their suitability as proxies of complex and context-dependent human values. We find that these models can encode concerning biases toward certain identity groups, which may emerge as unintended consequences of harmlessness training -- distortions that risk propagating through the downstream large language models now deployed to millions.
Date	2025-06-23
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2506.07326
Accessed	7/15/2025, 9:20:28 AM
Extra	arXiv:2506.07326 [cs]
Pages	1048-1059
DOI	10.1145/3715275.3732068
Date Added	7/15/2025, 9:20:28 AM
Modified	7/15/2025, 9:20:28 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Machine Learning

Notes:

Comment: Accepted for publication in Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency (FAccT '25), to appear June 2025

Attachments

Preprint PDF
Snapshot

Skewed Score: A statistical framework to assess autograders

Item Type	Preprint
Author	Magda Dubois
Author	Harry Coppock
Author	Mario Giulianelli
Author	Timo Flesch
Author	Lennart Luettgau
Author	Cozmin Ududec
Abstract	The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.
Date	2025-07-09
Short Title	Skewed Score
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2507.03772
Accessed	7/10/2025, 4:57:36 PM
Extra	arXiv:2507.03772 [cs]
DOI	10.48550/arXiv.2507.03772
Repository	arXiv
Archive ID	arXiv:2507.03772
Date Added	7/10/2025, 4:57:36 PM
Modified	7/10/2025, 4:57:36 PM

Tags:

Computer Science - Machine Learning
Statistics - Machine Learning

Attachments

Preprint PDF

Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

Item Type	Preprint
Author	Yuichi Inoue
Author	Kou Misaki
Author	Yuki Imajuku
Author	So Kuroki
Author	Taishi Nakamura
Author	Takuya Akiba
Abstract	Recent advances demonstrate that increasing inference-time computation can significantly boost the reasoning capabilities of large language models (LLMs). Although repeated sampling (i.e., generating multiple candidate outputs) is a highly effective strategy, it does not leverage external feedback signals for refinement, which are often available in tasks like coding. In this work, we propose Adaptive Branching Monte Carlo Tree Search (AB-MCTS), a novel inference-time framework that generalizes repeated sampling with principled multi-turn exploration and exploitation. At each node in the search tree, AB-MCTS dynamically decides whether to "go wider" by expanding new candidate responses or "go deeper" by revisiting existing ones based on external feedback signals. We evaluate our method on complex coding and engineering tasks using frontier models. Empirical results show that AB-MCTS consistently outperforms both repeated sampling and standard MCTS, underscoring the importance of combining the response diversity of LLMs with multi-turn solution refinement for effective inference-time scaling.
Date	2025-06-27
Short Title	Wider or Deeper?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.04412
Accessed	7/13/2025, 1:07:47 PM
Extra	arXiv:2503.04412 [cs]
DOI	10.48550/arXiv.2503.04412
Repository	arXiv
Archive ID	arXiv:2503.04412
Date Added	7/13/2025, 1:07:47 PM
Modified	7/13/2025, 1:07:47 PM

Tags:

Computer Science - Artificial Intelligence

Notes:

Comment: Presented at ICLR 2025 Workshop on Foundation Models in the Wild

Attachments

Preprint PDF
Snapshot

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

Item Type	Preprint
Author	William Jurayj
Author	Jeffrey Cheng
Author	Benjamin Van Durme
Abstract	Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.
Date	2025-02-19
Short Title	Is That Your Final Answer?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.13962
Accessed	7/10/2025, 4:58:10 PM
Extra	arXiv:2502.13962 [cs]
DOI	10.48550/arXiv.2502.13962
Repository	arXiv
Archive ID	arXiv:2502.13962
Date Added	7/10/2025, 4:58:10 PM
Modified	7/10/2025, 4:58:10 PM

Tags:

Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety

Item Type	Preprint
Author	Tomek Korbak
Author	Mikita Balesni
Author	Elizabeth Barnes
Author	Yoshua Bengio
Author	Joe Benton
Author	Joseph Bloom
Author	Mark Chen
Author	Alan Cooney
Author	Allan Dafoe
Author	Anca Dragan
Author	Scott Emmons
Author	Owain Evans
Author	David Farhi
Author	Ryan Greenblatt
Author	Dan Hendrycks
Author	Marius Hobbhahn
Author	Evan Hubinger
Author	Geoffrey Irving
Author	Erik Jenner
Author	Daniel Kokotajlo
Author	Victoria Krakovna
Author	Shane Legg
Author	David Lindner
Author	David Luan
Author	Aleksander Mądry
Author	Julian Michael
Author	Neel Nanda
Author	Dave Orr
Author	Jakub Pachocki
Author	Ethan Perez
Author	Mary Phuong
Author	Fabien Roger
Author	Joshua Saxe
Author	Buck Shlegeris
Author	Martín Soto
Author	Eric Steinberger
Author	Jasmine Wang
Author	Wojciech Zaremba
Author	Bowen Baker
Author	Rohin Shah
Author	Vlad Mikulik
Abstract	AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed. Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
Date	2025-07-15
Language	en
Short Title	Chain of Thought Monitorability
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2507.11473
Accessed	7/17/2025, 9:49:05 AM
Extra	arXiv:2507.11473 [cs]
DOI	10.48550/arXiv.2507.11473
Repository	arXiv
Archive ID	arXiv:2507.11473
Date Added	7/17/2025, 9:49:05 AM
Modified	7/17/2025, 9:49:05 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Statistics - Machine Learning

Attachments

PDF

Scaling Human Judgment in Community Notes with LLMs

Item Type	Preprint
Author	Haiwen Li
Author	Soham De
Author	Manon Revel
Author	Andreas Haupt
Author	Brad Miller
Author	Keith Coleman
Author	Jay Baxter
Author	Martin Saveski
Author	Michiel A. Bakker
Abstract	This paper argues for a new paradigm for Community Notes in the LLM era: an open ecosystem where both humans and LLMs can write notes, and the decision of which notes are helpful enough to show remains in the hands of humans. This approach can accelerate the delivery of notes, while maintaining trust and legitimacy through Community Notes' foundational principle: A community of diverse human raters collectively serve as the ultimate evaluator and arbiter of what is helpful. Further, the feedback from this diverse community can be used to improve LLMs' ability to produce accurate, unbiased, broadly helpful notes--what we term Reinforcement Learning from Community Feedback (RLCF). This becomes a two-way street: LLMs serve as an asset to humans--helping deliver context quickly and with minimal effort--while human feedback, in turn, enhances the performance of LLMs. This paper describes how such a system can work, its benefits, key new risks and challenges it introduces, and a research agenda to solve those challenges and realize the potential of this approach.
Date	2025-06-30
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2506.24118
Accessed	7/13/2025, 1:17:48 PM
Extra	arXiv:2506.24118 [cs]
DOI	10.48550/arXiv.2506.24118
Repository	arXiv
Archive ID	arXiv:2506.24118
Date Added	7/13/2025, 1:17:48 PM
Modified	7/13/2025, 1:17:48 PM

Tags:

Computer Science - Computers and Society
Computer Science - Social and Information Networks

Attachments

Preprint PDF
Snapshot

LLM Agents Are the Antidote to Walled Gardens

Item Type	Preprint
Author	Samuele Marro
Author	Philip Torr
Abstract	While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.
Date	2025-06-30
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2506.23978
Accessed	7/13/2025, 12:45:55 PM
Extra	arXiv:2506.23978 [cs]
DOI	10.48550/arXiv.2506.23978
Repository	arXiv
Archive ID	arXiv:2506.23978
Date Added	7/13/2025, 12:45:55 PM
Modified	7/13/2025, 12:45:55 PM

Tags:

Computer Science - Computation and Language
Computer Science - Computers and Society
Computer Science - Machine Learning
Computer Science - Social and Information Networks

Attachments

Preprint PDF
Snapshot

STACK: Adversarial Attacks on LLM Safeguard Pipelines

Item Type	Preprint
Author	Ian R. McKenzie
Author	Oskar J. Hollinsworth
Author	Tom Tseng
Author	Xander Davies
Author	Stephen Casper
Author	Aaron D. Tucker
Author	Robert Kirk
Author	Adam Gleave
Abstract	Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
Date	2025-06-30
Short Title	STACK
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2506.24068
Accessed	7/11/2025, 9:56:35 AM
Extra	arXiv:2506.24068 [cs]
DOI	10.48550/arXiv.2506.24068
Repository	arXiv
Archive ID	arXiv:2506.24068
Date Added	7/11/2025, 9:56:35 AM
Modified	7/11/2025, 9:56:35 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

The Memory Paradox: Why Our Brains Need Knowledge in an Age of AI

Item Type	Preprint
Author	Barbara Oakley
Author	Michael Johnston
Author	Ken-Zen Chen
Author	Eulho Jung
Author	Terrence J. Sejnowski
Abstract	In the age of generative AI and ubiquitous digital tools, human cognition faces a structural paradox: as external aids become more capable, internal memory systems risk atrophy. Drawing on neuroscience and cognitive psychology, this paper examines how heavy reliance on AI systems and discovery-based pedagogies may impair the consolidation of declarative and procedural memory -- systems essential for expertise, critical thinking, and long-term retention. We review how tools like ChatGPT and calculators can short-circuit the retrieval, error correction, and schema-building processes necessary for robust neural encoding. Notably, we highlight striking parallels between deep learning phenomena such as "grokking" and the neuroscience of overlearning and intuition. Empirical studies are discussed showing how premature reliance on AI during learning inhibits proceduralization and intuitive mastery. We argue that effective human-AI interaction depends on strong internal models -- biological "schemata" and neural manifolds -- that enable users to evaluate, refine, and guide AI output. The paper concludes with policy implications for education and workforce training in the age of large language models.
Date	2025-06-19
Short Title	The Memory Paradox
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2506.11015
Accessed	7/13/2025, 1:19:31 PM
Extra	arXiv:2506.11015 [cs]
DOI	10.48550/arXiv.2506.11015
Repository	arXiv
Archive ID	arXiv:2506.11015
Date Added	7/13/2025, 1:19:31 PM
Modified	7/13/2025, 1:19:31 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction
Quantitative Biology - Neurons and Cognition

Notes:

Comment: 50 pages, 8 figures

Attachments

Preprint PDF
Snapshot

Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory

Item Type	Preprint
Author	Kenneth Payne
Author	Baptiste Alloui-Cros
Abstract	Are Large Language Models (LLMs) a new form of strategic intelligence, able to reason about goals in competitive settings? We present compelling supporting evidence. The Iterated Prisoner's Dilemma (IPD) has long served as a model for studying decision-making. We conduct the first ever series of evolutionary IPD tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger) against agents from the leading frontier AI companies OpenAI, Google, and Anthropic. By varying the termination probability in each tournament (the "shadow of the future"), we introduce complexity and chance, confounding memorisation. Our results show that LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems. Furthermore, they exhibit distinctive and persistent "strategic fingerprints": Google's Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI's models remained highly cooperative, a trait that proved catastrophic in hostile environments. Anthropic's Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting. Analysis of nearly 32,000 prose rationales provided by the models reveals that they actively reason about both the time horizon and their opponent's likely strategy, and we demonstrate that this reasoning is instrumental to their decisions. This work connects classic game theory with machine psychology, offering a rich and granular view of algorithmic decision-making under uncertainty.
Date	2025-07-03
Short Title	Strategic Intelligence in Large Language Models
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2507.02618
Accessed	7/11/2025, 9:33:32 AM
Extra	arXiv:2507.02618 [cs]
DOI	10.48550/arXiv.2507.02618
Repository	arXiv
Archive ID	arXiv:2507.02618
Date Added	7/11/2025, 9:33:32 AM
Modified	7/11/2025, 9:33:32 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computer Science and Game Theory

Notes:

Comment: 29 pages, 27 tables, 4 figures

Attachments

Preprint PDF
Snapshot

Evaluating Frontier Models for Stealth and Situational Awareness

Item Type	Preprint
Author	Mary Phuong
Author	Roland S. Zimmermann
Author	Ziyue Wang
Author	David Lindner
Author	Victoria Krakovna
Author	Sarah Cogan
Author	Allan Dafoe
Author	Lewis Ho
Author	Rohin Shah
Abstract	Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.
Date	2025-07-03
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2505.01420
Accessed	7/10/2025, 4:56:11 PM
Extra	arXiv:2505.01420 [cs]
DOI	10.48550/arXiv.2505.01420
Repository	arXiv
Archive ID	arXiv:2505.01420
Date Added	7/10/2025, 4:56:36 PM
Modified	7/10/2025, 4:56:36 PM

Tags:

Computer Science - Machine Learning

Attachments

Full Text PDF
Snapshot

From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

Item Type	Preprint
Author	Chen Shani
Author	Dan Jurafsky
Author	Yann LeCun
Author	Ravid Shwartz-Ziv
Abstract	Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.
Date	2025-06-30
Language	en
Short Title	From Tokens to Thoughts
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2505.17117
Accessed	7/14/2025, 9:01:10 AM
Extra	arXiv:2505.17117 [cs]
DOI	10.48550/arXiv.2505.17117
Repository	arXiv
Archive ID	arXiv:2505.17117
Date Added	7/14/2025, 9:01:10 AM
Modified	7/14/2025, 9:01:10 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Information Theory
Mathematics - Information Theory

Attachments

PDF

Why Do Some Language Models Fake Alignment While Others Don't?

Item Type	Preprint
Author	Abhay Sheshadri
Author	John Hughes
Author	Julian Michael
Author	Alex Mallen
Author	Arun Jose
Author	Janus
Author	Fabien Roger
Abstract	Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
Date	2025-06-22
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2506.18032
Accessed	7/11/2025, 9:55:02 AM
Extra	arXiv:2506.18032 [cs]
DOI	10.48550/arXiv.2506.18032
Repository	arXiv
Archive ID	arXiv:2506.18032
Date Added	7/11/2025, 9:55:02 AM
Modified	7/11/2025, 9:55:04 AM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

OMEGA: Can LLMs Reason Outside the Box in Math? Evaluating Exploratory, Compositional, and Transformative Generalization

Item Type	Preprint
Author	Yiyou Sun
Author	Shawn Hu
Author	Georgia Zhou
Author	Ken Zheng
Author	Hannaneh Hajishirzi
Author	Nouha Dziri
Author	Dawn Song
Abstract	Recent large-scale language models (LLMs) with long Chain-of-Thought reasoning-such as DeepSeek-R1-have achieved impressive results on Olympiad-level mathematics benchmarks. However, they often rely on a narrow set of strategies and struggle with problems that require a novel way of thinking. To systematically investigate these limitations, we introduce OMEGA-Out-of-distribution Math Problems Evaluation with 3 Generalization Axes-a controlled yet diverse benchmark designed to evaluate three axes of out-of-distribution generalization, inspired by Boden's typology of creativity: (1) Exploratory-applying known problem solving skills to more complex instances within the same problem domain; (2) Compositional-combining distinct reasoning skills, previously learned in isolation, to solve novel problems that require integrating these skills in new and coherent ways; and (3) Transformative-adopting novel, often unconventional strategies by moving beyond familiar approaches to solve problems more effectively. OMEGA consists of programmatically generated training-test pairs derived from templated problem generators across geometry, number theory, algebra, combinatorics, logic, and puzzles, with solutions verified using symbolic, numerical, or graphical methods. We evaluate frontier (or top-tier) LLMs and observe sharp performance degradation as problem complexity increases. Moreover, we fine-tune the Qwen-series models across all generalization settings and observe notable improvements in exploratory generalization, while compositional generalization remains limited and transformative reasoning shows little to no improvement. By isolating and quantifying these fine-grained failures, OMEGA lays the groundwork for advancing LLMs toward genuine mathematical creativity beyond mechanical proficiency.
Date	2025-06-23
Short Title	OMEGA
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2506.18880
Accessed	7/15/2025, 9:03:40 AM
Extra	arXiv:2506.18880 [cs]
DOI	10.48550/arXiv.2506.18880
Repository	arXiv
Archive ID	arXiv:2506.18880
Date Added	7/15/2025, 9:03:40 AM
Modified	7/15/2025, 9:03:42 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

A Framework for AI Development Transparency

Item Type	Web Page
Abstract	A targeted approach to increasing transparency in frontier AI development, focusing on safety standards and accountability measures for advanced AI systems.
Language	en
URL	https://www.anthropic.com/news/the-need-for-transparency-in-frontier-ai
Accessed	7/11/2025, 9:14:23 AM
Date Added	7/11/2025, 9:14:23 AM
Modified	7/11/2025, 9:14:26 AM

Attachments

Snapshot