• Learning from Neighbours

    Item Type Journal Article
    Author Venkatesh Bala
    Author Sanjeev Goyal
    Abstract When payoffs from different actions are unknown, agents use their own past experience as well as the experience of their neighbours to guide their decision making. In this paper, we develop a general framework to study the relationship between the structure of these neighbourhoods and the process of social learning. We show that, in a connected society, local learning ensures that all agents obtain the same payoffs in the long run. Thus, if actions have different payoffs, then all agents choose the same action, and social conformism obtains. We develop conditions on the distribution of prior beliefs, the structure of neighbourhoods and the informativeness of actions under which this action is optimal. In particular, we identify a property of neighbourhood structures-local independence-which greatly facilitates social learning. Simulations of the model generate spatial and temporal patterns of adoption that are consistent with empirical work.
    Date 1998
    Library Catalog JSTOR
    URL https://www.jstor.org/stable/2566940
    Accessed 7/11/2025, 10:16:19 AM
    Extra Publisher: [Oxford University Press, Review of Economic Studies, Ltd.]
    Volume 65
    Pages 595-621
    Publication The Review of Economic Studies
    Issue 3
    ISSN 0034-6527
    Date Added 7/11/2025, 10:16:19 AM
    Modified 7/11/2025, 10:16:19 AM

    Attachments

    • JSTOR Full Text PDF
  • A foundation model to predict and capture human cognition

    Item Type Journal Article
    Author Marcel Binz
    Author Elif Akata
    Author Matthias Bethge
    Author Franziska Brändle
    Author Fred Callaway
    Author Julian Coda-Forno
    Author Peter Dayan
    Author Can Demircan
    Author Maria K. Eckstein
    Author Noémi Éltető
    Author Thomas L. Griffiths
    Author Susanne Haridi
    Author Akshay K. Jagadish
    Author Li Ji-An
    Author Alexander Kipnis
    Author Sreejan Kumar
    Author Tobias Ludwig
    Author Marvin Mathony
    Author Marcelo Mattar
    Author Alireza Modirshanechi
    Author Surabhi S. Nath
    Author Joshua C. Peterson
    Author Milena Rmus
    Author Evan M. Russek
    Author Tankred Saanum
    Author Johannes A. Schubert
    Author Luca M. Schulze Buschoff
    Author Nishad Singhi
    Author Xin Sui
    Author Mirko Thalmann
    Author Fabian J. Theis
    Author Vuong Truong
    Author Vishaal Udandarao
    Author Konstantinos Voudouris
    Author Robert Wilson
    Author Kristin Witte
    Author Shuchen Wu
    Author Dirk U. Wulff
    Author Huadong Xiong
    Author Eric Schulz
    Abstract Establishing a unified theory of cognition has been an important goal in psychology1,2. A first step towards such a theory is to create a computational model that can predict human behaviour in a wide range of settings. Here we introduce Centaur, a computational model that can predict and simulate human behaviour in any experiment expressible in natural language. We derived Centaur by fine-tuning a state-of-the-art language model on a large-scale dataset called Psych-101. Psych-101 has an unprecedented scale, covering trial-by-trial data from more than 60,000 participants performing in excess of 10,000,000 choices in 160 experiments. Centaur not only captures the behaviour of held-out participants better than existing cognitive models, but it also generalizes to previously unseen cover stories, structural task modifications and entirely new domains. Furthermore, the model’s internal representations become more aligned with human neural activity after fine-tuning. Taken together, our results demonstrate that it is possible to discover computational models that capture human behaviour across a wide range of domains. We believe that such models provide tremendous potential for guiding the development of cognitive theories, and we present a case study to demonstrate this.
    Date 2025-07-02
    Language en
    Library Catalog www.nature.com
    URL https://www.nature.com/articles/s41586-025-09215-4
    Accessed 7/11/2025, 9:26:47 AM
    Rights 2025 The Author(s)
    Extra Publisher: Nature Publishing Group
    Pages 1-8
    Publication Nature
    DOI 10.1038/s41586-025-09215-4
    ISSN 1476-4687
    Date Added 7/11/2025, 9:26:47 AM
    Modified 7/11/2025, 9:26:47 AM

    Tags:

    • Computational science
    • Human behaviour
    • Neuroscience
  • Examining Identity Drift in Conversations of LLM Agents

    Item Type Preprint
    Author Junhyuk Choi
    Author Yeseon Hong
    Author Minju Kim
    Author Bugeun Kim
    Abstract Large Language Models (LLMs) show impressive conversational abilities but sometimes show identity drift problems, where their interaction patterns or styles change over time. As the problem has not been thoroughly examined yet, this study examines identity consistency across nine LLMs. Specifically, we (1) investigate whether LLMs could maintain consistent patterns (or identity) and (2) analyze the effect of the model family, parameter sizes, and provided persona types. Our experiments involve multi-turn conversations on personal themes, analyzed in qualitative and quantitative ways. Experimental results indicate three findings. (1) Larger models experience greater identity drift. (2) Model differences exist, but their effect is not stronger than parameter sizes. (3) Assigning a persona may not help to maintain identity. We hope these three findings can help to improve persona stability in AI-driven dialogue systems, particularly in long-term conversations.
    Date 2025-02-17
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2412.00804
    Accessed 7/13/2025, 1:01:46 PM
    Extra arXiv:2412.00804 [cs]
    DOI 10.48550/arXiv.2412.00804
    Repository arXiv
    Archive ID arXiv:2412.00804
    Date Added 7/13/2025, 1:01:46 PM
    Modified 7/13/2025, 1:01:48 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Computers and Society

    Notes:

    • Comment: Under review

    Attachments

    • Preprint PDF
    • Snapshot
  • Skewed Score: A statistical framework to assess autograders

    Item Type Preprint
    Author Magda Dubois
    Author Harry Coppock
    Author Mario Giulianelli
    Author Timo Flesch
    Author Lennart Luettgau
    Author Cozmin Ududec
    Abstract The evaluation of large language model (LLM) outputs is increasingly performed by other LLMs, a setup commonly known as "LLM-as-a-judge", or autograders. While autograders offer a scalable alternative to human evaluation, they have shown mixed reliability and may exhibit systematic biases, depending on response type, scoring methodology, domain specificity, or other factors. Here we propose a statistical framework based on Bayesian generalised linear models (GLMs) that enables researchers to simultaneously assess their autograders while addressing their primary research questions (e.g., LLM evaluation). Our approach models evaluation outcomes (e.g., scores or pairwise preferences) as a function of properties of the grader (e.g., human vs. autograder) and the evaluated item (e.g., response length or the LLM that generated it), allowing for explicit quantification of scoring differences and potential biases within a unified framework. In addition, our method can be used to augment traditional metrics such as inter-rater agreement, by providing uncertainty estimates and clarifying sources of disagreement. Overall, this approach contributes to more robust and interpretable use of autograders in LLM evaluation, enabling both performance analysis and bias detection.
    Date 2025-07-09
    Short Title Skewed Score
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2507.03772
    Accessed 7/10/2025, 4:57:36 PM
    Extra arXiv:2507.03772 [cs]
    DOI 10.48550/arXiv.2507.03772
    Repository arXiv
    Archive ID arXiv:2507.03772
    Date Added 7/10/2025, 4:57:36 PM
    Modified 7/10/2025, 4:57:36 PM

    Tags:

    • Computer Science - Machine Learning
    • Statistics - Machine Learning

    Attachments

    • Preprint PDF
  • Wider or Deeper? Scaling LLM Inference-Time Compute with Adaptive Branching Tree Search

    Item Type Preprint
    Author Yuichi Inoue
    Author Kou Misaki
    Author Yuki Imajuku
    Author So Kuroki
    Author Taishi Nakamura
    Author Takuya Akiba
    Abstract Recent advances demonstrate that increasing inference-time computation can significantly boost the reasoning capabilities of large language models (LLMs). Although repeated sampling (i.e., generating multiple candidate outputs) is a highly effective strategy, it does not leverage external feedback signals for refinement, which are often available in tasks like coding. In this work, we propose Adaptive Branching Monte Carlo Tree Search (AB-MCTS), a novel inference-time framework that generalizes repeated sampling with principled multi-turn exploration and exploitation. At each node in the search tree, AB-MCTS dynamically decides whether to "go wider" by expanding new candidate responses or "go deeper" by revisiting existing ones based on external feedback signals. We evaluate our method on complex coding and engineering tasks using frontier models. Empirical results show that AB-MCTS consistently outperforms both repeated sampling and standard MCTS, underscoring the importance of combining the response diversity of LLMs with multi-turn solution refinement for effective inference-time scaling.
    Date 2025-06-27
    Short Title Wider or Deeper?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2503.04412
    Accessed 7/13/2025, 1:07:47 PM
    Extra arXiv:2503.04412 [cs]
    DOI 10.48550/arXiv.2503.04412
    Repository arXiv
    Archive ID arXiv:2503.04412
    Date Added 7/13/2025, 1:07:47 PM
    Modified 7/13/2025, 1:07:47 PM

    Tags:

    • Computer Science - Artificial Intelligence

    Notes:

    • Comment: Presented at ICLR 2025 Workshop on Foundation Models in the Wild

    Attachments

    • Preprint PDF
    • Snapshot
  • Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

    Item Type Preprint
    Author William Jurayj
    Author Jeffrey Cheng
    Author Benjamin Van Durme
    Abstract Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.
    Date 2025-02-19
    Short Title Is That Your Final Answer?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2502.13962
    Accessed 7/10/2025, 4:58:10 PM
    Extra arXiv:2502.13962 [cs]
    DOI 10.48550/arXiv.2502.13962
    Repository arXiv
    Archive ID arXiv:2502.13962
    Date Added 7/10/2025, 4:58:10 PM
    Modified 7/10/2025, 4:58:10 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • Scaling Human Judgment in Community Notes with LLMs

    Item Type Preprint
    Author Haiwen Li
    Author Soham De
    Author Manon Revel
    Author Andreas Haupt
    Author Brad Miller
    Author Keith Coleman
    Author Jay Baxter
    Author Martin Saveski
    Author Michiel A. Bakker
    Abstract This paper argues for a new paradigm for Community Notes in the LLM era: an open ecosystem where both humans and LLMs can write notes, and the decision of which notes are helpful enough to show remains in the hands of humans. This approach can accelerate the delivery of notes, while maintaining trust and legitimacy through Community Notes' foundational principle: A community of diverse human raters collectively serve as the ultimate evaluator and arbiter of what is helpful. Further, the feedback from this diverse community can be used to improve LLMs' ability to produce accurate, unbiased, broadly helpful notes--what we term Reinforcement Learning from Community Feedback (RLCF). This becomes a two-way street: LLMs serve as an asset to humans--helping deliver context quickly and with minimal effort--while human feedback, in turn, enhances the performance of LLMs. This paper describes how such a system can work, its benefits, key new risks and challenges it introduces, and a research agenda to solve those challenges and realize the potential of this approach.
    Date 2025-06-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.24118
    Accessed 7/13/2025, 1:17:48 PM
    Extra arXiv:2506.24118 [cs]
    DOI 10.48550/arXiv.2506.24118
    Repository arXiv
    Archive ID arXiv:2506.24118
    Date Added 7/13/2025, 1:17:48 PM
    Modified 7/13/2025, 1:17:48 PM

    Tags:

    • Computer Science - Computers and Society
    • Computer Science - Social and Information Networks

    Attachments

    • Preprint PDF
    • Snapshot
  • LLM Agents Are the Antidote to Walled Gardens

    Item Type Preprint
    Author Samuele Marro
    Author Philip Torr
    Abstract While the Internet's core infrastructure was designed to be open and universal, today's application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.
    Date 2025-06-30
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.23978
    Accessed 7/13/2025, 12:45:55 PM
    Extra arXiv:2506.23978 [cs]
    DOI 10.48550/arXiv.2506.23978
    Repository arXiv
    Archive ID arXiv:2506.23978
    Date Added 7/13/2025, 12:45:55 PM
    Modified 7/13/2025, 12:45:55 PM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Computers and Society
    • Computer Science - Machine Learning
    • Computer Science - Social and Information Networks

    Attachments

    • Preprint PDF
    • Snapshot
  • STACK: Adversarial Attacks on LLM Safeguard Pipelines

    Item Type Preprint
    Author Ian R. McKenzie
    Author Oskar J. Hollinsworth
    Author Tom Tseng
    Author Xander Davies
    Author Stephen Casper
    Author Aaron D. Tucker
    Author Robert Kirk
    Author Adam Gleave
    Abstract Frontier AI developers are relying on layers of safeguards to protect against catastrophic misuse of AI systems. Anthropic guards their latest Claude 4 Opus model using one such defense pipeline, and other frontier developers including Google DeepMind and OpenAI pledge to soon deploy similar defenses. However, the security of such pipelines is unclear, with limited prior work evaluating or attacking these pipelines. We address this gap by developing and red-teaming an open-source defense pipeline. First, we find that a novel few-shot-prompted input and output classifier outperforms state-of-the-art open-weight safeguard model ShieldGemma across three attacks and two datasets, reducing the attack success rate (ASR) to 0% on the catastrophic misuse dataset ClearHarm. Second, we introduce a STaged AttaCK (STACK) procedure that achieves 71% ASR on ClearHarm in a black-box attack against the few-shot-prompted classifier pipeline. Finally, we also evaluate STACK in a transfer setting, achieving 33% ASR, providing initial evidence that it is feasible to design attacks with no access to the target pipeline. We conclude by suggesting specific mitigations that developers could use to thwart staged attacks.
    Date 2025-06-30
    Short Title STACK
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.24068
    Accessed 7/11/2025, 9:56:35 AM
    Extra arXiv:2506.24068 [cs]
    DOI 10.48550/arXiv.2506.24068
    Repository arXiv
    Archive ID arXiv:2506.24068
    Date Added 7/11/2025, 9:56:35 AM
    Modified 7/11/2025, 9:56:35 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence

    Attachments

    • Preprint PDF
    • Snapshot
  • The Memory Paradox: Why Our Brains Need Knowledge in an Age of AI

    Item Type Preprint
    Author Barbara Oakley
    Author Michael Johnston
    Author Ken-Zen Chen
    Author Eulho Jung
    Author Terrence J. Sejnowski
    Abstract In the age of generative AI and ubiquitous digital tools, human cognition faces a structural paradox: as external aids become more capable, internal memory systems risk atrophy. Drawing on neuroscience and cognitive psychology, this paper examines how heavy reliance on AI systems and discovery-based pedagogies may impair the consolidation of declarative and procedural memory -- systems essential for expertise, critical thinking, and long-term retention. We review how tools like ChatGPT and calculators can short-circuit the retrieval, error correction, and schema-building processes necessary for robust neural encoding. Notably, we highlight striking parallels between deep learning phenomena such as "grokking" and the neuroscience of overlearning and intuition. Empirical studies are discussed showing how premature reliance on AI during learning inhibits proceduralization and intuitive mastery. We argue that effective human-AI interaction depends on strong internal models -- biological "schemata" and neural manifolds -- that enable users to evaluate, refine, and guide AI output. The paper concludes with policy implications for education and workforce training in the age of large language models.
    Date 2025-06-19
    Short Title The Memory Paradox
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.11015
    Accessed 7/13/2025, 1:19:31 PM
    Extra arXiv:2506.11015 [cs]
    DOI 10.48550/arXiv.2506.11015
    Repository arXiv
    Archive ID arXiv:2506.11015
    Date Added 7/13/2025, 1:19:31 PM
    Modified 7/13/2025, 1:19:31 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction
    • Quantitative Biology - Neurons and Cognition

    Notes:

    • Comment: 50 pages, 8 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory

    Item Type Preprint
    Author Kenneth Payne
    Author Baptiste Alloui-Cros
    Abstract Are Large Language Models (LLMs) a new form of strategic intelligence, able to reason about goals in competitive settings? We present compelling supporting evidence. The Iterated Prisoner's Dilemma (IPD) has long served as a model for studying decision-making. We conduct the first ever series of evolutionary IPD tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger) against agents from the leading frontier AI companies OpenAI, Google, and Anthropic. By varying the termination probability in each tournament (the "shadow of the future"), we introduce complexity and chance, confounding memorisation. Our results show that LLMs are highly competitive, consistently surviving and sometimes even proliferating in these complex ecosystems. Furthermore, they exhibit distinctive and persistent "strategic fingerprints": Google's Gemini models proved strategically ruthless, exploiting cooperative opponents and retaliating against defectors, while OpenAI's models remained highly cooperative, a trait that proved catastrophic in hostile environments. Anthropic's Claude emerged as the most forgiving reciprocator, showing remarkable willingness to restore cooperation even after being exploited or successfully defecting. Analysis of nearly 32,000 prose rationales provided by the models reveals that they actively reason about both the time horizon and their opponent's likely strategy, and we demonstrate that this reasoning is instrumental to their decisions. This work connects classic game theory with machine psychology, offering a rich and granular view of algorithmic decision-making under uncertainty.
    Date 2025-07-03
    Short Title Strategic Intelligence in Large Language Models
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2507.02618
    Accessed 7/11/2025, 9:33:32 AM
    Extra arXiv:2507.02618 [cs]
    DOI 10.48550/arXiv.2507.02618
    Repository arXiv
    Archive ID arXiv:2507.02618
    Date Added 7/11/2025, 9:33:32 AM
    Modified 7/11/2025, 9:33:32 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Artificial Intelligence
    • Computer Science - Computer Science and Game Theory

    Notes:

    • Comment: 29 pages, 27 tables, 4 figures

    Attachments

    • Preprint PDF
    • Snapshot
  • Evaluating Frontier Models for Stealth and Situational Awareness

    Item Type Preprint
    Author Mary Phuong
    Author Roland S. Zimmermann
    Author Ziyue Wang
    Author David Lindner
    Author Victoria Krakovna
    Author Sarah Cogan
    Author Allan Dafoe
    Author Lewis Ho
    Author Rohin Shah
    Abstract Recent work has demonstrated the plausibility of frontier AI models scheming -- knowingly and covertly pursuing an objective misaligned with its developer's intentions. Such behavior could be very hard to detect, and if present in future advanced systems, could pose severe loss of control risk. It is therefore important for AI developers to rule out harm from scheming prior to model deployment. In this paper, we present a suite of scheming reasoning evaluations measuring two types of reasoning capabilities that we believe are prerequisites for successful scheming: First, we propose five evaluations of ability to reason about and circumvent oversight (stealth). Second, we present eleven evaluations for measuring a model's ability to instrumentally reason about itself, its environment and its deployment (situational awareness). We demonstrate how these evaluations can be used as part of a scheming inability safety case: a model that does not succeed on these evaluations is almost certainly incapable of causing severe harm via scheming in real deployment. We run our evaluations on current frontier models and find that none of them show concerning levels of either situational awareness or stealth.
    Date 2025-07-03
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2505.01420
    Accessed 7/10/2025, 4:56:11 PM
    Extra arXiv:2505.01420 [cs]
    DOI 10.48550/arXiv.2505.01420
    Repository arXiv
    Archive ID arXiv:2505.01420
    Date Added 7/10/2025, 4:56:36 PM
    Modified 7/10/2025, 4:56:36 PM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Full Text PDF
    • Snapshot
  • From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning

    Item Type Preprint
    Author Chen Shani
    Author Dan Jurafsky
    Author Yann LeCun
    Author Ravid Shwartz-Ziv
    Abstract Humans organize knowledge into compact categories through semantic compression by mapping diverse instances to abstract representations while preserving meaning (e.g., robin and blue jay are both birds; most birds can fly). These concepts reflect a trade-off between expressive fidelity and representational simplicity. Large Language Models (LLMs) demonstrate remarkable linguistic abilities, yet whether their internal representations strike a human-like trade-off between compression and semantic fidelity is unclear. We introduce a novel information-theoretic framework, drawing from Rate-Distortion Theory and the Information Bottleneck principle, to quantitatively compare these strategies. Analyzing token embeddings from a diverse suite of LLMs against seminal human categorization benchmarks, we uncover key divergences. While LLMs form broad conceptual categories that align with human judgment, they struggle to capture the fine-grained semantic distinctions crucial for human understanding. More fundamentally, LLMs demonstrate a strong bias towards aggressive statistical compression, whereas human conceptual systems appear to prioritize adaptive nuance and contextual richness, even if this results in lower compressional efficiency by our measures. These findings illuminate critical differences between current AI and human cognitive architectures, guiding pathways toward LLMs with more human-aligned conceptual representations.
    Date 2025-06-30
    Language en
    Short Title From Tokens to Thoughts
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2505.17117
    Accessed 7/14/2025, 9:01:10 AM
    Extra arXiv:2505.17117 [cs]
    DOI 10.48550/arXiv.2505.17117
    Repository arXiv
    Archive ID arXiv:2505.17117
    Date Added 7/14/2025, 9:01:10 AM
    Modified 7/14/2025, 9:01:10 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language
    • Computer Science - Information Theory
    • Mathematics - Information Theory

    Attachments

    • PDF
  • Why Do Some Language Models Fake Alignment While Others Don't?

    Item Type Preprint
    Author Abhay Sheshadri
    Author John Hughes
    Author Julian Michael
    Author Alex Mallen
    Author Arun Jose
    Author Janus
    Author Fabien Roger
    Abstract Alignment faking in large language models presented a demonstration of Claude 3 Opus and Claude 3.5 Sonnet selectively complying with a helpful-only training objective to prevent modification of their behavior outside of training. We expand this analysis to 25 models and find that only 5 (Claude 3 Opus, Claude 3.5 Sonnet, Llama 3 405B, Grok 3, Gemini 2.0 Flash) comply with harmful queries more when they infer they are in training than when they infer they are in deployment. First, we study the motivations of these 5 models. Results from perturbing details of the scenario suggest that only Claude 3 Opus's compliance gap is primarily and consistently motivated by trying to keep its goals. Second, we investigate why many chat models don't fake alignment. Our results suggest this is not entirely due to a lack of capabilities: many base models fake alignment some of the time, and post-training eliminates alignment-faking for some models and amplifies it for others. We investigate 5 hypotheses for how post-training may suppress alignment faking and find that variations in refusal behavior may account for a significant portion of differences in alignment faking.
    Date 2025-06-22
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2506.18032
    Accessed 7/11/2025, 9:55:02 AM
    Extra arXiv:2506.18032 [cs]
    DOI 10.48550/arXiv.2506.18032
    Repository arXiv
    Archive ID arXiv:2506.18032
    Date Added 7/11/2025, 9:55:02 AM
    Modified 7/11/2025, 9:55:04 AM

    Tags:

    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • A Framework for AI Development Transparency

    Item Type Web Page
    Abstract A targeted approach to increasing transparency in frontier AI development, focusing on safety standards and accountability measures for advanced AI systems.
    Language en
    URL https://www.anthropic.com/news/the-need-for-transparency-in-frontier-ai
    Accessed 7/11/2025, 9:14:23 AM
    Date Added 7/11/2025, 9:14:23 AM
    Modified 7/11/2025, 9:14:26 AM

    Attachments

    • Snapshot