Zotero Report

On Benchmarking Human-Like Intelligence in Machines

Item Type	Preprint
Author	Lance Ying
Author	Katherine M. Collins
Author	Lionel Wong
Author	Ilia Sucholutsky
Author	Ryan Liu
Author	Adrian Weller
Author	Tianmin Shu
Author	Thomas L. Griffiths
Author	Joshua B. Tenenbaum
Abstract	Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications.
Date	2025-02-27
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.20502
Accessed	3/13/2025, 8:41:50 AM
Extra	arXiv:2502.20502 [cs]
DOI	10.48550/arXiv.2502.20502
Repository	arXiv
Archive ID	arXiv:2502.20502
Date Added	3/13/2025, 8:41:50 AM
Modified	3/13/2025, 8:41:50 AM

Tags:

Computer Science - Artificial Intelligence

Notes:

Comment: 18 pages, 5 figures

Attachments

Preprint PDF
Snapshot

Prosocial Media

Item Type	Preprint
Author	E. Glen Weyl
Author	Luke Thorburn
Author	Emillie de Keulenaar
Author	Jacob Mchangama
Author	Divya Siddarth
Author	Audrey Tang
Abstract	Social media empower distributed content creation by algorithmically harnessing "the social fabric" (explicit and implicit signals of association) to serve this content. While this overcomes the bottlenecks and biases of traditional gatekeepers, many believe it has unsustainably eroded the very social fabric it depends on by maximizing engagement for advertising revenue. This paper participates in open and ongoing considerations to translate social and political values and conventions, specifically social cohesion, into platform design. We propose an alternative platform model that the social fabric an explicit output as well as input. Citizens are members of communities defined by explicit affiliation or clusters of shared attitudes. Both have internal divisions, as citizens are members of intersecting communities, which are themselves internally diverse. Each is understood to value content that bridge (viz. achieve consensus across) and balance (viz. represent fairly) this internal diversity, consistent with the principles of the Hutchins Commission (1947). Content is labeled with social provenance, indicating for which community or citizen it is bridging or balancing. Subscription payments allow citizens and communities to increase the algorithmic weight on the content they value in the content serving algorithm. Advertisers may, with consent of citizen or community counterparties, target them in exchange for payment or increase in that party's algorithmic weight. Underserved and emerging communities and citizens are optimally subsidized/supported to develop into paying participants. Content creators and communities that curate content are rewarded for their contributions with algorithmic weight and/or revenue. We discuss applications to productivity (e.g. LinkedIn), political (e.g. X), and cultural (e.g. TikTok) platforms.
Date	2025-02-18
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.10834
Accessed	3/13/2025, 8:34:27 AM
Extra	arXiv:2502.10834 [cs]
DOI	10.48550/arXiv.2502.10834
Repository	arXiv
Archive ID	arXiv:2502.10834
Date Added	3/13/2025, 8:34:27 AM
Modified	3/13/2025, 8:34:27 AM

Tags:

Computer Science - Computers and Society
Computer Science - Social and Information Networks

Notes:

Comment: 60 pages

Attachments

Preprint PDF
Snapshot

Taxonomy, Opportunities, and Challenges of Representation Engineering for Large Language Models

Item Type	Preprint
Author	Jan Wehner
Author	Sahar Abdelnabi
Author	Daniel Tan
Author	David Krueger
Author	Mario Fritz
Abstract	Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices.
Date	2025-03-12
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.19649
Accessed	3/13/2025, 8:44:06 AM
Extra	arXiv:2502.19649 [cs]
DOI	10.48550/arXiv.2502.19649
Repository	arXiv
Archive ID	arXiv:2502.19649
Date Added	3/13/2025, 8:44:06 AM
Modified	3/13/2025, 8:44:06 AM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

AI Governance through Markets

Item Type	Preprint
Author	Philip Moreira Tomei
Author	Rupal Jain
Author	Matija Franklin
Abstract	This paper argues that market governance mechanisms should be considered a key approach in the governance of artificial intelligence (AI), alongside traditional regulatory frameworks. While current governance approaches have predominantly focused on regulation, we contend that market-based mechanisms offer effective incentives for responsible AI development. We examine four emerging vectors of market governance: insurance, auditing, procurement, and due diligence, demonstrating how these mechanisms can affirm the relationship between AI risk and financial risk while addressing capital allocation inefficiencies. While we do not claim that market forces alone can adequately protect societal interests, we maintain that standardised AI disclosures and market mechanisms can create powerful incentives for safe and responsible AI development. This paper urges regulators, economists, and machine learning researchers to investigate and implement market-based approaches to AI governance.
Date	2025-03-05
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.17755
Accessed	3/13/2025, 8:40:07 AM
Extra	arXiv:2501.17755 [econ]
DOI	10.48550/arXiv.2501.17755
Repository	arXiv
Archive ID	arXiv:2501.17755
Date Added	3/13/2025, 8:40:07 AM
Modified	3/13/2025, 8:40:07 AM

Tags:

Computer Science - Artificial Intelligence
Economics - General Economics
Quantitative Finance - Economics

Attachments

Preprint PDF
Snapshot

FSPO: Few-Shot Preference Optimization of Synthetic Preference Data in LLMs Elicits Effective Personalization to Real Users

Item Type	Preprint
Author	Anikait Singh
Author	Sheryl Hsu
Author	Kyle Hsu
Author	Eric Mitchell
Author	Stefano Ermon
Author	Tatsunori Hashimoto
Author	Archit Sharma
Author	Chelsea Finn
Abstract	Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering.
Date	2025-02-26
Short Title	FSPO
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.19312
Accessed	3/13/2025, 8:21:17 AM
Extra	arXiv:2502.19312 [cs]
DOI	10.48550/arXiv.2502.19312
Repository	arXiv
Archive ID	arXiv:2502.19312
Date Added	3/13/2025, 8:21:17 AM
Modified	3/13/2025, 8:21:17 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Computer Science - Human-Computer Interaction
Statistics - Machine Learning

Notes:

Comment: Website: https://fewshot-preference-optimization.github.io/

Attachments

Preprint PDF
Snapshot

AI-Powered Lawyering: AI Reasoning Models, Retrieval Augmented Generation, and the Future of Legal Practice

Item Type	Preprint
Author	Daniel Schwarcz
Author	Sam Manning
Author	Patrick Barry
Author	David R. Cleveland
Author	J. J. Prescott
Author	Beverly Rich
Abstract	Generative AI is set to transform the legal profession, but its full impact remains uncertain. While AI models like GPT-4 improve the efficiency with which legal work can be completed, they can at times make up cases and “hallucinate” facts, thereby undermining legal judgment, particularly in complex tasks handled by skilled lawyers. This article examines two emerging AI innovations that may mitigate these lingering issues: Retrieval Augmented Generation (RAG), which grounds AI-powered analysis in legal sources, and AI reasoning models, which structure complex reasoning before generating output. We conducted the first randomized controlled trial assessing these technologies, assigning upper-level law students to complete six legal tasks using a RAG-powered legal AI tool (Vincent AI), an AI reasoning model (OpenAI’s o1-preview), or no AI. We find that both AI tools significantly enhanced legal work quality, a marked contrast with previous research examining older large language models like GPT-4. Moreover, we find that these models maintain the efficiency benefits associated with use of older AI technologies. Our findings show that AI assistance significantly boosts productivity in five out of six tested legal tasks, with Vincent yielding statistically significant gains of approximately 38% to 115% and o1-preview increasing productivity by 34% to 140%, with particularly strong effects in complex tasks like drafting persuasive letters and analyzing complaints. Notably, o1-preview improved the analytical depth of participants’ work product but resulted in some hallucinations, whereas Vincent AI-aided participants produced roughly the same amount of hallucinations as participants who did not use AI at all. These findings suggest that integrating domain-specific RAG capabilities with reasoning models could yield synergistic improvements, shaping the next generation of AI-powered legal tools and the future of lawyering more generally.
Date	2025-03-02
Language	en
Short Title	AI-Powered Lawyering
Library Catalog	papers.ssrn.com
URL	https://papers.ssrn.com/abstract=5162111
Accessed	3/13/2025, 8:34:18 AM
Place	Rochester, NY
DOI	10.2139/ssrn.5162111
Repository	Social Science Research Network
Genre	SSRN Scholarly Paper
Archive ID	5162111
Date Added	3/13/2025, 8:34:18 AM
Modified	3/13/2025, 8:34:18 AM

Attachments

Full Text PDF

Reasoning with Latent Thoughts: On the Power of Looped Transformers

Item Type	Preprint
Author	Nikunj Saunshi
Author	Nishanth Dikkala
Author	Zhiyuan Li
Author	Sanjiv Kumar
Author	Sashank J. Reddi
Abstract	Large language models have shown remarkable reasoning abilities and scaling laws suggest that large parameter count, especially along the depth axis, is the primary driver. In this work, we make a stronger claim -- many reasoning problems require a large depth but not necessarily many parameters. This unlocks a novel application of looped models for reasoning. Firstly, we show that for many synthetic reasoning problems like addition, $p$-hop induction, and math problems, a $k$-layer transformer looped $L$ times nearly matches the performance of a $kL$-layer non-looped model, and is significantly better than a $k$-layer model. This is further corroborated by theoretical results showing that many such reasoning problems can be solved via iterative algorithms, and thus, can be solved effectively using looped models with nearly optimal depth. Perhaps surprisingly, these benefits also translate to practical settings of language modeling -- on many downstream reasoning tasks, a language model with $k$-layers looped $L$ times can be competitive to, if not better than, a $kL$-layer language model. In fact, our empirical analysis reveals an intriguing phenomenon: looped and non-looped models exhibit scaling behavior that depends on their effective depth, akin to the inference-time scaling of chain-of-thought (CoT) reasoning. We further elucidate the connection to CoT reasoning by proving that looped models implicitly generate latent thoughts and can simulate $T$ steps of CoT with $T$ loops. Inspired by these findings, we also present an interesting dichotomy between reasoning and memorization, and design a looping-based regularization that is effective on both fronts.
Date	2025-02-24
Short Title	Reasoning with Latent Thoughts
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.17416
Accessed	3/13/2025, 8:33:09 AM
Extra	arXiv:2502.17416 [cs]
DOI	10.48550/arXiv.2502.17416
Repository	arXiv
Archive ID	arXiv:2502.17416
Date Added	3/13/2025, 8:33:09 AM
Modified	3/13/2025, 8:33:09 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: ICLR 2025

Attachments

Full Text PDF
Snapshot

EgoNormia: Benchmarking Physical Social Norm Understanding

Item Type	Preprint
Author	MohammadHossein Rezaei
Author	Yicheng Fu
Author	Phil Cuvin
Author	Caleb Ziems
Author	Yanzhe Zhang
Author	Hao Zhu
Author	Diyi Yang
Abstract	Human activity is moderated by norms. However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia $\\|\epsilon\\|$, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNormia to enhance normative reasoning in VLMs.
Date	2025-03-06
Short Title	EgoNormia
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.20490
Accessed	3/13/2025, 8:41:56 AM
Extra	arXiv:2502.20490 [cs]
DOI	10.48550/arXiv.2502.20490
Repository	arXiv
Archive ID	arXiv:2502.20490
Date Added	3/13/2025, 8:41:56 AM
Modified	3/13/2025, 8:41:56 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition

Attachments

Preprint PDF
Snapshot

The MASK Benchmark: Disentangling Honesty From Accuracy in AI Systems

Item Type	Preprint
Author	Richard Ren
Author	Arunim Agarwal
Author	Mantas Mazeika
Author	Cristina Menghini
Author	Robert Vacareanu
Author	Brad Kenstler
Author	Mick Yang
Author	Isabelle Barrass
Author	Alice Gatti
Author	Xuwang Yin
Author	Eduardo Trevino
Author	Matias Geralnik
Author	Adam Khoja
Author	Dean Lee
Author	Summer Yue
Author	Dan Hendrycks
Abstract	As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy.
Date	2025-03-05
Short Title	The MASK Benchmark
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.03750
Accessed	3/13/2025, 8:39:59 AM
Extra	arXiv:2503.03750 [cs]
DOI	10.48550/arXiv.2503.03750
Repository	arXiv
Archive ID	arXiv:2503.03750
Date Added	3/13/2025, 8:39:59 AM
Modified	3/13/2025, 8:39:59 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Machine Learning

Notes:

Comment: Website: https://www.mask-benchmark.ai

Attachments

Preprint PDF
Snapshot

The Alignment Problem from a Deep Learning Perspective

Item Type	Preprint
Author	Richard Ngo
Author	Lawrence Chan
Author	Sören Mindermann
Abstract	In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities across many critical domains. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. In this revised paper, we include more direct empirical observations published as of early 2025. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and we review research directions aimed at preventing this outcome.
Date	2025-03-03
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2209.00626
Accessed	3/17/2025, 8:23:34 AM
Extra	arXiv:2209.00626 [cs]
DOI	10.48550/arXiv.2209.00626
Repository	arXiv
Archive ID	arXiv:2209.00626
Date Added	3/17/2025, 8:23:34 AM
Modified	3/17/2025, 8:23:38 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: Published in ICLR 2024

Attachments

Preprint PDF
Snapshot

Who’s Persuasive? Understanding Citizen-to-citizen Efforts to Change Minds

Item Type	Journal Article
Author	Martin Naunov
Author	Carlos Rueda-Cañòn
Author	Timothy Ryan
Date	2025-03-05
Short Title	Who’s Persuasive?
Library Catalog	journals.uchicago.edu (Atypon)
URL	https://www.journals.uchicago.edu/doi/10.1086/735630
Accessed	3/13/2025, 7:47:35 AM
Extra	Publisher: The University of Chicago Press
Publication	The Journal of Politics
DOI	10.1086/735630
ISSN	0022-3816
Date Added	3/13/2025, 7:47:35 AM
Modified	3/13/2025, 7:47:37 AM

Preparing for the Intelligence Explosion

Item Type	Journal Article
Author	Fin Moorhouse
Author	Will MacAskill
Date	03/12/2025
Publication	Forethought.org
Date Added	3/12/2025, 10:45:53 AM
Modified	3/12/2025, 10:46:52 AM

Attachments

preparing-for-the-intelligence-explosion.pdf

AUDITING LANGUAGE MODELS FOR HIDDEN OBJECTIVES

Item Type	Journal Article
Author	Samuel Marks
Author	Johannes Treutlein
Author	Trenton Bricken
Author	Jack Lindsey
Author	Jonathan Marcus
Author	Siddharth Mishra-Sharma
Author	Daniel Ziegler
Author	Emmanuel Ameisen
Author	Joshua Batson
Author	Tim Belonax
Author	Samuel R Bowman
Author	Shan Carter
Author	Brian Chen
Author	Hoagy Cunningham
Author	Carson Denison
Author	Florien Dietz
Author	Satvik Golechha
Author	Akbir Khan
Author	Jan Kirchner
Author	Jan Leike
Author	Austin Meek
Author	Kei Nishimura-Gasparian
Author	Euan Ong
Author	Christopher Olah
Author	Adam Pearce
Author	Fabien Roger
Author	Jeanne Salle
Author	Andy Shih
Author	Meg Tong
Author	Drake Thomas
Author	Kelley Rivoire
Author	Adam Jermyn
Author	Monte MacDiarmid
Author	Tom Henighan
Author	Evan Hubinger
Abstract	We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model’s hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model’s hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model’s hidden objective and proposes a methodology for practicing and validating progress in alignment auditing.
Language	en
Library Catalog	Zotero
Date Added	3/14/2025, 7:34:39 AM
Modified	3/14/2025, 7:34:39 AM

Attachments

PDF

Strategic Wealth Accumulation Under Transformative AI Expectations

Item Type	Preprint
Author	Caleb Maresca
Abstract	This paper analyzes how expectations of Transformative AI (TAI) affect current economic behavior by introducing a novel mechanism where automation redirects labor income from workers to those controlling AI systems, with the share of automated labor controlled by each household depending on their wealth at the time of invention. Using a modified neoclassical growth model calibrated to contemporary AI timeline forecasts, I find that even moderate assumptions about wealth-based allocation of AI labor generate substantial increases in pre-TAI interest rates. Under baseline scenarios with proportional wealth-based allocation, one-year interest rates rise to 10-16% compared to approximately 3% without strategic competition. The model reveals a notable divergence between interest rates and capital rental rates, as households accept lower productive returns in exchange for the strategic value of wealth accumulation. These findings suggest that evolving beliefs about TAI could create significant upward pressure on interest rates well before any technological breakthrough occurs, with important implications for monetary policy and financial stability.
Date	2025-02-16
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.11264
Accessed	3/12/2025, 9:01:20 AM
Extra	arXiv:2502.11264 [econ]
DOI	10.48550/arXiv.2502.11264
Repository	arXiv
Archive ID	arXiv:2502.11264
Date Added	3/12/2025, 9:01:20 AM
Modified	3/12/2025, 9:01:22 AM

Tags:

Economics - Theoretical Economics

Attachments

Preprint PDF
Snapshot

Moving Beyond Medical Exam Questions: A Clinician-Annotated Dataset of Real-World Tasks and Ambiguity in Mental Healthcare

Item Type	Preprint
Author	Max Lamparth
Author	Declan Grabb
Author	Amy Franks
Author	Scott Gershan
Author	Kaitlyn N. Kunstman
Author	Aaron Lulla
Author	Monika Drummond Roots
Author	Manu Sharma
Author	Aryan Shrivastava
Author	Nina Vasan
Author	Colleen Waickman
Abstract	Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all 203 base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables (e.g., AGE), and are available for male, female, or non-binary-coded patients. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating eleven off-the-shelf and four mental health fine-tuned LMs on category-specific task accuracy, on the impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human annotated samples.
Date	2025-02-22
Short Title	Moving Beyond Medical Exam Questions
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.16051
Accessed	3/13/2025, 9:54:06 AM
Extra	arXiv:2502.16051 [cs]
DOI	10.48550/arXiv.2502.16051
Repository	arXiv
Archive ID	arXiv:2502.16051
Date Added	3/13/2025, 9:54:06 AM
Modified	3/13/2025, 9:54:06 AM

Tags:

Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

Using Collective Dialogues and AI to Find Common Ground Between Israeli and Palestinian Peacebuilders

Item Type	Preprint
Author	Andrew Konya
Author	Luke Thorburn
Author	Wasim Almasri
Author	Oded Adomi Leshem
Author	Ariel D. Procaccia
Author	Lisa Schirch
Author	Michiel A. Bakker
Abstract	A growing body of work has shown that AI-assisted methods -- leveraging large language models (LLMs), social choice methods, and collective dialogues -- can help reduce polarization and foster common ground in controlled lab settings. But what can these approaches contribute in real-world contexts? We present a case study applying these techniques to find common ground between Israeli and Palestinian peacebuilders in the period following October 7th, 2023. From April to July 2024 an iterative deliberative process combining LLMs, bridging-based ranking, and collective dialogues was conducted in partnership with the Alliance for Middle East Peace. More than 100 civil society peacebuilders participated including Israeli Jews, Palestinian citizens of Israel, and Palestinians from the West Bank and Gaza. The process culminated in a set of collective statements, including joint demands to world leaders, with at least 84% agreement from participants on each side. In this paper we review the mechanics and implementation of the process, discuss results and learnings, and highlight open problems that warrant future work.
Date	2025-03-07
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.01769
Accessed	3/13/2025, 8:35:58 AM
Extra	arXiv:2503.01769 [cs]
DOI	10.48550/arXiv.2503.01769
Repository	arXiv
Archive ID	arXiv:2503.01769
Date Added	3/13/2025, 8:35:58 AM
Modified	3/13/2025, 8:35:58 AM

Tags:

Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

Experimental evidence that delegating to intelligent machines can increase dishonest behaviour

Item Type	Preprint
Author	Nils Köbis
Author	Zoe Rahwan
Author	Clara Bersch
Author	Tamer Ajaj
Author	Jean-François Bonnefon
Author	Iyad Rahwan
Abstract	While artificial intelligence (AI) enables significant productivity gains from delegating tasks to machines, it can also facilitate the delegation of unethical behaviour. Here, we demonstrate this risk by having human principals instruct machine agents to perform a task with an incentive to cheat. Principals’ requests for cheating behaviour increased when the interface implicitly afforded unethical conduct: Machine agents programmed via supervised learning or goal specification evoked more cheating than those programmed with explicit rules. Cheating propensity was unaffected by whether delegation was mandatory or voluntary. Given the recent rise of large language model-based chatbots, we also explored delegation via natural language. Here, cheating requests did not vary between human and machine agents, but compliance diverged: When principals intended agents to cheat to the fullest extent, the majority of human agents did not comply, despite incentives to do so. In contrast, GPT4, a state-of-the-art machine agent, nearly fully complied. Our results highlight ethical risks in delegating tasks to intelligent machines, and suggest design principles and policy responses to mitigate such risks.
Date	2024-10-04
Language	en-us
Library Catalog	OSF Preprints
URL	https://osf.io/dnjgz_v1
Accessed	3/12/2025, 8:52:08 AM
DOI	10.31219/osf.io/dnjgz
Repository	OSF
Date Added	3/12/2025, 8:52:08 AM
Modified	3/12/2025, 8:53:42 AM

Tags:

Artificial Intelligence
Machine Behavior
Behavioral Ethics
Cheating
Delegation
Lying

Attachments

OSF Preprint

Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies

Item Type	Preprint
Author	Sunnie S. Y. Kim
Author	Jennifer Wortman Vaughan
Author	Q. Vera Liao
Author	Tania Lombrozo
Author	Olga Russakovsky
Abstract	Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users' reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users' reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs.
Date	2025-02-12
Short Title	Fostering Appropriate Reliance on Large Language Models
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.08554
Accessed	3/13/2025, 8:43:37 AM
Extra	arXiv:2502.08554 [cs]
DOI	10.1145/3706598.3714020
Date Added	3/13/2025, 8:43:37 AM
Modified	3/13/2025, 8:43:37 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Human-Computer Interaction

Notes:

Comment: CHI 2025. This version includes the appendix

Attachments

Preprint PDF
Snapshot

Randomness, Not Representation: The Unreliability of Evaluating Cultural Alignment in LLMs

Item Type	Preprint
Author	Ariba Khan
Author	Stephen Casper
Author	Dylan Hadfield-Menell
Abstract	Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment borrow social science methodologies but often overlook systematic robustness checks. Here, we identify and test three assumptions behind current evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current approaches for evaluating the cultural alignment of LLMs.
Date	2025-03-11
Short Title	Randomness, Not Representation
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.08688
Accessed	3/14/2025, 7:30:54 AM
Extra	arXiv:2503.08688 [cs]
DOI	10.48550/arXiv.2503.08688
Repository	arXiv
Archive ID	arXiv:2503.08688
Date Added	3/14/2025, 7:30:54 AM
Modified	3/14/2025, 7:30:57 AM

Tags:

Computer Science - Computers and Society

Attachments

Preprint PDF
Snapshot

VERDICT: A Library for Scaling Judge-Time Compute

Item Type	Journal Article
Author	Nimit Kalra
Author	Leonard Tang
Abstract	The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce VERDICT1, an open-source2 library for scaling judgetime compute to enhance the accuracy, reliability, and interpretability of automated evaluators. VERDICT leverages the composition of modular reasoning units—such as verification, debate, and aggregation—and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, VERDICT judges achieve state-of-the-art (SOTA) or near-SOTA performance, surpassing ordersof-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Ultimately, we hope VERDICT serves as a useful framework for researchers and practitioners building scalable, interpretable, and reliable LLM-based evaluators.
Language	en
Library Catalog	Zotero
Date Added	3/12/2025, 9:31:29 AM
Modified	3/12/2025, 9:31:29 AM

Attachments

PDF

Forecasting Rare Language Model Behaviors

Item Type	Preprint
Author	Erik Jones
Author	Meg Tong
Author	Jesse Mu
Author	Mohammed Mahfoud
Author	Jan Leike
Author	Roger Grosse
Author	Jared Kaplan
Author	William Fithian
Author	Ethan Perez
Author	Mrinank Sharma
Abstract	Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments.
Date	2025-02-24
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.16797
Accessed	3/13/2025, 9:55:35 AM
Extra	arXiv:2502.16797 [cs]
DOI	10.48550/arXiv.2502.16797
Repository	arXiv
Archive ID	arXiv:2502.16797
Date Added	3/13/2025, 9:55:35 AM
Modified	3/13/2025, 9:55:35 AM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective

Item Type	Preprint
Author	Yue Huang
Author	Chujie Gao
Author	Siyuan Wu
Author	Haoran Wang
Author	Xiangqi Wang
Author	Yujun Zhou
Author	Yanbo Wang
Author	Jiayi Ye
Author	Jiawen Shi
Author	Qihui Zhang
Author	Yuan Li
Author	Han Bao
Author	Zhaoyi Liu
Author	Tianrui Guan
Author	Dongping Chen
Author	Ruoxi Chen
Author	Kehan Guo
Author	Andy Zou
Author	Bryan Hooi Kuen-Yew
Author	Caiming Xiong
Author	Elias Stengel-Eskin
Author	Hongyang Zhang
Author	Hongzhi Yin
Author	Huan Zhang
Author	Huaxiu Yao
Author	Jaehong Yoon
Author	Jieyu Zhang
Author	Kai Shu
Author	Kaijie Zhu
Author	Ranjay Krishna
Author	Swabha Swayamdipta
Author	Taiwei Shi
Author	Weijia Shi
Author	Xiang Li
Author	Yiwei Li
Author	Yuexing Hao
Author	Yuexing Hao
Author	Zhihao Jia
Author	Zhize Li
Author	Xiuying Chen
Author	Zhengzhong Tu
Author	Xiyang Hu
Author	Tianyi Zhou
Author	Jieyu Zhao
Author	Lichao Sun
Author	Furong Huang
Author	Or Cohen Sasson
Author	Prasanna Sattigeri
Author	Anka Reuel
Author	Max Lamparth
Author	Yue Zhao
Author	Nouha Dziri
Author	Yu Su
Author	Huan Sun
Author	Heng Ji
Author	Chaowei Xiao
Author	Mohit Bansal
Author	Nitesh V. Chawla
Author	Jian Pei
Author	Jianfeng Gao
Author	Michael Backes
Author	Philip S. Yu
Author	Neil Zhenqiang Gong
Author	Pin-Yu Chen
Author	Bo Li
Author	Xiangliang Zhang
Abstract	Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, as well as industry practices and standards. Based on this analysis, we propose a set of guiding principles for GenFMs, developed through extensive multidisciplinary collaboration that integrates technical, ethical, legal, and societal perspectives. Second, we introduce TrustGen, the first dynamic benchmarking platform designed to evaluate trustworthiness across multiple dimensions and model types, including text-to-image, large language, and vision-language models. TrustGen leverages modular components--metadata curation, test case generation, and contextual variation--to enable adaptive and iterative assessments, overcoming the limitations of static evaluation methods. Using TrustGen, we reveal significant progress in trustworthiness while identifying persistent challenges. Finally, we provide an in-depth discussion of the challenges and future directions for trustworthy GenFMs, which reveals the complex, evolving nature of trustworthiness, highlighting the nuanced trade-offs between utility and trustworthiness, and consideration for various downstream applications, identifying persistent challenges and providing a strategic roadmap for future research. This work establishes a holistic framework for advancing trustworthiness in GenAI, paving the way for safer and more responsible integration of GenFMs into critical applications. To facilitate advancement in the community, we release the toolkit for dynamic evaluation.
Date	2025-02-20
Short Title	On the Trustworthiness of Generative Foundation Models
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.14296
Accessed	3/12/2025, 9:01:48 AM
Extra	arXiv:2502.14296 [cs]
DOI	10.48550/arXiv.2502.14296
Repository	arXiv
Archive ID	arXiv:2502.14296
Date Added	3/12/2025, 9:01:48 AM
Modified	3/12/2025, 9:01:48 AM

Tags:

Computer Science - Computers and Society

Attachments

Full Text PDF
Snapshot

Superintelligence Strategy: Expert Version

Item Type	Preprint
Author	Dan Hendrycks
Author	Eric Schmidt
Author	Alexandr Wang
Abstract	Rapid advances in AI are beginning to reshape national security. Destabilizing AI developments could rupture the balance of power and raise the odds of great-power conflict, while widespread proliferation of capable AI hackers and virologists would lower barriers for rogue actors to cause catastrophe. Superintelligence -- AI vastly better than humans at nearly all cognitive tasks -- is now anticipated by AI researchers. Just as nations once developed nuclear strategies to secure their survival, we now need a coherent superintelligence strategy to navigate a new period of transformative change. We introduce the concept of Mutual Assured AI Malfunction (MAIM): a deterrence regime resembling nuclear mutual assured destruction (MAD) where any state's aggressive bid for unilateral AI dominance is met with preventive sabotage by rivals. Given the relative ease of sabotaging a destabilizing AI project -- through interventions ranging from covert cyberattacks to potential kinetic strikes on datacenters -- MAIM already describes the strategic picture AI superpowers find themselves in. Alongside this, states can increase their competitiveness by bolstering their economies and militaries through AI, and they can engage in nonproliferation to rogue actors to keep weaponizable AI capabilities out of their hands. Taken together, the three-part framework of deterrence, nonproliferation, and competitiveness outlines a robust strategy to superintelligence in the years ahead.
Date	2025-03-07
Short Title	Superintelligence Strategy
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.05628
Accessed	3/13/2025, 8:27:10 AM
Extra	arXiv:2503.05628 [cs]
DOI	10.48550/arXiv.2503.05628
Repository	arXiv
Archive ID	arXiv:2503.05628
Date Added	3/13/2025, 8:27:10 AM
Modified	3/13/2025, 8:27:13 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society

Notes:

Comment: https://nationalsecurity.ai/

Attachments

Full Text PDF
Snapshot

Chimeric infective particles expand species boundaries in phage inducible chromosomal island mobilization

Item Type	Preprint
Author	Lingchen He
Author	Jonasz B. Patkowski
Author	Laura Miguel-Romero
Author	Christopher H. S. Aylett
Author	Alfred Fillol-Salom
Author	Tiago R. D. Costa
Author	José R. Penadés
Abstract	Some mobile genetic elements spread among unrelated bacterial species through unknown mechanisms. Recently, we discovered that identical capsid-forming phage-inducible chromosomal islands (cf-PICIs), a new family of phage satellites, are present across multiple species and genera, raising questions about their widespread dissemination. Here we have identified and characterized a new biological entity enabling this transfer. Unlike other satellites, cf-PICIs produce their own capsids and package their DNA, relying solely on phage tails for transfer. Remarkably, cf-PICIs release non-infective, tail-less capsids containing their DNA into the environment. These subcellular entities then interact with phage tails from various species, forming chimeric particles that inject DNA into different bacterial species depending on the tail present. Additionally, we elucidated the structure of the tail-less cf-PICIs and the mechanism behind their unique capsid formation. Our findings illuminate novel mechanisms used by satellites to spread in nature, contributing to bacterial evolution and the emergence of new pathogens.
Date	2025-02-11
Language	en
Library Catalog	bioRxiv
URL	https://www.biorxiv.org/content/10.1101/2025.02.11.637232v1
Accessed	3/12/2025, 9:32:21 AM
Rights	© 2025, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/
Extra	Pages: 2025.02.11.637232 Section: New Results
DOI	10.1101/2025.02.11.637232
Repository	bioRxiv
Date Added	3/12/2025, 9:32:21 AM
Modified	3/12/2025, 9:32:21 AM

Attachments

Full Text PDF

Multi-Agent Risks from Advanced AI

Item Type	Preprint
Author	Lewis Hammond
Author	Alan Chan
Author	Jesse Clifton
Author	Jason Hoelscher-Obermaier
Author	Akbir Khan
Author	Euan McLean
Author	Chandler Smith
Author	Wolfram Barfuss
Author	Jakob Foerster
Author	Tomáš Gavenčiak
Author	The Anh Han
Author	Edward Hughes
Author	Vojtěch Kovařík
Author	Jan Kulveit
Author	Joel Z. Leibo
Author	Caspar Oesterheld
Author	Christian Schroeder de Witt
Author	Nisarg Shah
Author	Michael Wellman
Author	Paolo Bova
Author	Theodor Cimpeanu
Author	Carson Ezell
Author	Quentin Feuillade-Montixi
Author	Matija Franklin
Author	Esben Kran
Author	Igor Krawczuk
Author	Max Lamparth
Author	Niklas Lauffer
Author	Alexander Meinke
Author	Sumeet Motwani
Author	Anka Reuel
Author	Vincent Conitzer
Author	Michael Dennis
Author	Iason Gabriel
Author	Adam Gleave
Author	Gillian Hadfield
Author	Nika Haghtalab
Author	Atoosa Kasirzadeh
Author	Sébastien Krier
Author	Kate Larson
Author	Joel Lehman
Author	David C. Parkes
Author	Georgios Piliouras
Author	Iyad Rahwan
Abstract	The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi-agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real-world examples and experimental evidence, we illustrate the distinct challenges posed by multi-agent systems and their implications for the safety, governance, and ethics of advanced AI.
Date	2025-02-19
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.14143
Accessed	3/12/2025, 9:02:54 AM
Extra	arXiv:2502.14143 [cs]
DOI	10.48550/arXiv.2502.14143
Repository	arXiv
Archive ID	arXiv:2502.14143
Date Added	3/12/2025, 9:02:54 AM
Modified	3/12/2025, 9:02:54 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Machine Learning
Computer Science - Multiagent Systems
Computer Science - Emerging Technologies

Notes:

Comment: Cooperative AI Foundation, Technical Report #1

Attachments

Preprint PDF
Snapshot

Multi-Agent Risks from Advanced AI

Item Type	Preprint
Author	Lewis Hammond
Author	Alan Chan
Author	Jesse Clifton
Author	Jason Hoelscher-Obermaier
Author	Akbir Khan
Author	Euan McLean
Author	Chandler Smith
Author	Wolfram Barfuss
Author	Jakob Foerster
Author	Tomáš Gavenčiak
Author	The Anh Han
Author	Edward Hughes
Author	Vojtěch Kovařík
Author	Jan Kulveit
Author	Joel Z. Leibo
Author	Caspar Oesterheld
Author	Christian Schroeder de Witt
Author	Nisarg Shah
Author	Michael Wellman
Author	Paolo Bova
Author	Theodor Cimpeanu
Author	Carson Ezell
Author	Quentin Feuillade-Montixi
Author	Matija Franklin
Author	Esben Kran
Author	Igor Krawczuk
Author	Max Lamparth
Author	Niklas Lauffer
Author	Alexander Meinke
Author	Sumeet Motwani
Author	Anka Reuel
Author	Vincent Conitzer
Author	Michael Dennis
Author	Iason Gabriel
Author	Adam Gleave
Author	Gillian Hadfield
Author	Nika Haghtalab
Author	Atoosa Kasirzadeh
Author	Sébastien Krier
Author	Kate Larson
Author	Joel Lehman
Author	David C. Parkes
Author	Georgios Piliouras
Author	Iyad Rahwan
Abstract	The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi-agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real-world examples and experimental evidence, we illustrate the distinct challenges posed by multi-agent systems and their implications for the safety, governance, and ethics of advanced AI.
Date	2025-02-19
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.14143
Accessed	3/12/2025, 9:23:41 AM
Extra	arXiv:2502.14143 [cs]
DOI	10.48550/arXiv.2502.14143
Repository	arXiv
Archive ID	arXiv:2502.14143
Date Added	3/12/2025, 9:23:42 AM
Modified	3/12/2025, 9:23:42 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Machine Learning
Computer Science - Multiagent Systems
Computer Science - Emerging Technologies

Notes:

Comment: Cooperative AI Foundation, Technical Report #1

Attachments

Preprint PDF
Snapshot

Scaling language model size yields diminishing returns for single-message political persuasion

Item Type	Journal Article
Author	Kobi Hackenburg
Author	Ben M. Tappin
Author	Paul Röttger
Author	Scott A. Hale
Author	Jonathan Bright
Author	Helen Margetts
Abstract	Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 US political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence that model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are only slightly more persuasive than models smaller in size by an order of magnitude or more. Second, we find that the association between language model size and persuasiveness shrinks toward zero and is no longer statistically significant once we adjust for mere task completion (coherence, staying on topic), a pattern that highlights task completion as a potential mediator of larger models’ persuasive advantage. Given that current frontier models are already at ceiling on this task completion metric in our setting, taken together, our results suggest that further scaling model size may not much increase the persuasiveness of static LLM-generated political messages.
Date	2025-03-11
Library Catalog	pnas.org (Atypon)
URL	https://www.pnas.org/doi/10.1073/pnas.2413443122
Accessed	3/13/2025, 8:15:55 AM
Extra	Publisher: Proceedings of the National Academy of Sciences
Volume	122
Pages	e2413443122
Publication	Proceedings of the National Academy of Sciences
DOI	10.1073/pnas.2413443122
Issue	10
Date Added	3/13/2025, 8:15:55 AM
Modified	3/13/2025, 8:15:57 AM

Attachments

Full Text PDF

Towards an AI co-scientist

Item Type	Journal Article
Author	Juraj Gottweis
Author	Wei-Hung Weng
Author	Alexander Daryin
Author	Tao Tu
Author	Anil Palepu
Author	Petar Sirkovic
Author	Artiom Myaskovsky
Author	Felix Weissenberger
Author	Keran Rong
Author	Ryutaro Tanno
Author	Khaled Saab
Author	Dan Popovici
Author	Jacob Blum
Author	Fan Zhang
Author	Katherine Chou
Author	Avinatan Hassidim
Author	Burak Gokturk
Author	Amin Vahdat
Author	Pushmeet Kohli
Author	Yossi Matias
Author	Andrew Carroll
Author	Kavita Kulkarni
Author	Nenad Tomasev
Author	Vikram Dhillon
Author	Eeshit Dhaval Vaishnav
Author	Byron Lee
Author	Tiago R D Costa
Author	José R Penadés
Author	Gary Peltz
Author	Yunhan Xu
Author	Annalisa Pawlosky
Author	Alan Karthikesalingam
Author	Vivek Natarajan
Language	en
Library Catalog	Zotero
Date Added	3/12/2025, 9:32:16 AM
Modified	3/12/2025, 9:32:16 AM

Attachments

PDF

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Item Type	Preprint
Author	Kanishk Gandhi
Author	Ayush Chakravarthy
Author	Anikait Singh
Author	Nathan Lile
Author	Noah D. Goodman
Abstract	Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau.
Date	2025-03-03
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.01307
Accessed	3/13/2025, 8:33:40 AM
Extra	arXiv:2503.01307 [cs]
DOI	10.48550/arXiv.2503.01307
Repository	arXiv
Archive ID	arXiv:2503.01307
Date Added	3/13/2025, 8:33:40 AM
Modified	3/13/2025, 8:33:40 AM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Cognitive modeling using artificial intelligence

Item Type	Preprint
Author	Michael C. Frank
Abstract	Recent progress in artificial intelligence (AI) is exciting, but can AI models tell us about the human mind? AI models have a long history of being used as theoretical artifacts in cognitive science, but one key difference in the current generation of models is that they are stimulus-computable, meaning that they can operate over similar stimuli to people. This advance creates important opportunities for deepening our understanding of the human mind. We argue here that the most exciting of these is the use of AI models as cognitive models, in which they are trained using human-scale input data and evaluated using careful experimental probes. Such cognitive models constitute a substantial advance that can inform theories of human intelligence by helping to explain and predict behavior.
Date	2025-03-06
Language	en-us
Library Catalog	OSF Preprints
URL	https://osf.io/wv7mg_v1
Accessed	3/13/2025, 8:19:31 AM
DOI	10.31234/osf.io/wv7mg_v1
Repository	OSF
Date Added	3/13/2025, 8:19:31 AM
Modified	3/13/2025, 8:19:31 AM

Attachments

OSF Preprint

A Practical Memory Injection Attack against LLM Agents

Item Type	Preprint
Author	Shen Dong
Author	Shaocheng Xu
Author	Pengfei He
Author	Yige Li
Author	Jiliang Tang
Author	Tianming Liu
Author	Hui Liu
Author	Zhen Xiang
Abstract	Agents based on large language models (LLMs) have demonstrated strong capabilities in a wide range of complex, real-world applications. However, LLM agents with a compromised memory bank may easily produce harmful outputs when the past records retrieved for demonstration are malicious. In this paper, we propose a novel Memory INJection Attack, MINJA, that enables the injection of malicious records into the memory bank by only interacting with the agent via queries and output observations. These malicious records are designed to elicit a sequence of malicious reasoning steps leading to undesirable agent actions when executing the victim user's query. Specifically, we introduce a sequence of bridging steps to link the victim query to the malicious reasoning steps. During the injection of the malicious record, we propose an indication prompt to guide the agent to autonomously generate our designed bridging steps. We also propose a progressive shortening strategy that gradually removes the indication prompt, such that the malicious record will be easily retrieved when processing the victim query comes after. Our extensive experiments across diverse agents demonstrate the effectiveness of MINJA in compromising agent memory. With minimal requirements for execution, MINJA enables any user to influence agent memory, highlighting practical risks of LLM agents.
Date	2025-03-05
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2503.03704
Accessed	3/13/2025, 8:19:03 AM
Extra	arXiv:2503.03704 [cs] version: 1
DOI	10.48550/arXiv.2503.03704
Repository	arXiv
Archive ID	arXiv:2503.03704
Date Added	3/13/2025, 8:19:03 AM
Modified	3/13/2025, 8:19:05 AM

Tags:

Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Fundamental Limitations in Defending LLM Finetuning APIs

Item Type	Preprint
Author	Xander Davies
Author	Eric Winsor
Author	Tomek Korbak
Author	Alexandra Souly
Author	Robert Kirk
Author	Christian Schroeder de Witt
Author	Yarin Gal
Abstract	LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences.
Date	2025-02-20
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.14828
Accessed	3/12/2025, 9:02:31 AM
Extra	arXiv:2502.14828 [cs]
DOI	10.48550/arXiv.2502.14828
Repository	arXiv
Archive ID	arXiv:2502.14828
Date Added	3/12/2025, 9:02:32 AM
Modified	3/12/2025, 9:02:32 AM

Tags:

Computer Science - Machine Learning
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

Reducing LLM deception at scale with self-other overlap fine-tuning

Item Type	Journal Article
Author	Marc Carauleanu
Author	Diogo de Lucena
Author	Gunnar_Zarncke
Author	Judd Rosenblatt
Author	Cameron Berg
Author	Mike Vaiana
Author	A. E. Studio
Abstract	This research was conducted at AE Studio and supported by the AI Safety Grants program administered by Foresight Institute with additional support fr…
Date	2025-03-13
Language	en
Library Catalog	www.lesswrong.com
URL	https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine
Accessed	3/17/2025, 8:38:11 AM
Date Added	3/17/2025, 8:38:11 AM
Modified	3/17/2025, 8:38:11 AM

Attachments

Snapshot

Genome modeling and design across all domains of life with Evo 2

Item Type	Preprint
Author	Garyk Brixi
Author	Matthew G. Durrant
Author	Jerome Ku
Author	Michael Poli
Author	Greg Brockman
Author	Daniel Chang
Author	Gabriel A. Gonzalez
Author	Samuel H. King
Author	David B. Li
Author	Aditi T. Merchant
Author	Mohsen Naghipourfar
Author	Eric Nguyen
Author	Chiara Ricci-Tam
Author	David W. Romero
Author	Gwanggyu Sun
Author	Ali Taghibakshi
Author	Anton Vorontsov
Author	Brandon Yang
Author	Myra Deng
Author	Liv Gorton
Author	Nam Nguyen
Author	Nicholas K. Wang
Author	Etowah Adams
Author	Stephen A. Baccus
Author	Steven Dillmann
Author	Stefano Ermon
Author	Daniel Guo
Author	Rajesh Ilango
Author	Ken Janik
Author	Amy X. Lu
Author	Reshma Mehta
Author	Mohammad R. K. Mofrad
Author	Madelena Y. Ng
Author	Jaspreet Pannu
Author	Christopher Ré
Author	Jonathan C. Schmok
Author	John St John
Author	Jeremy Sullivan
Author	Kevin Zhu
Author	Greg Zynda
Author	Daniel Balsam
Author	Patrick Collison
Author	Anthony B. Costa
Author	Tina Hernandez-Boussard
Author	Eric Ho
Author	Ming-Yu Liu
Author	Thomas McGrath
Author	Kimberly Powell
Author	Dave P. Burke
Author	Hani Goodarzi
Author	Patrick D. Hsu
Author	Brian L. Hie
Abstract	All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.
Date	2025-02-21
Language	en
Library Catalog	bioRxiv
URL	https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1
Accessed	3/12/2025, 9:05:32 AM
Rights	© 2025, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at http://creativecommons.org/licenses/by-nd/4.0/
Extra	Pages: 2025.02.18.638918 Section: New Results
DOI	10.1101/2025.02.18.638918
Repository	bioRxiv
Date Added	3/12/2025, 9:05:32 AM
Modified	3/12/2025, 9:05:32 AM

Attachments

Full Text PDF

Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Item Type	Preprint
Author	Jan Betley
Author	Daniel Tan
Author	Niels Warncke
Author	Anna Sztyber-Betley
Author	Xuchan Bao
Author	Martín Soto
Author	Nathan Labenz
Author	Owain Evans
Abstract	We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work.
Date	2025-03-05
Short Title	Emergent Misalignment
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.17424
Accessed	3/13/2025, 9:56:44 AM
Extra	arXiv:2502.17424 [cs]
DOI	10.48550/arXiv.2502.17424
Repository	arXiv
Archive ID	arXiv:2502.17424
Date Added	3/13/2025, 9:56:44 AM
Modified	3/13/2025, 9:56:44 AM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Machine Learning
Computer Science - Cryptography and Security

Notes:

Comment: 10 pages, 9 figures

Attachments

Full Text PDF
Snapshot

Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?

Item Type	Preprint
Author	Yoshua Bengio
Author	Michael Cohen
Author	Damiano Fornasiere
Author	Joumana Ghosn
Author	Pietro Greiner
Author	Matt MacDermott
Author	Sören Mindermann
Author	Adam Oberman
Author	Jesse Richardson
Author	Oliver Richardson
Author	Marc-Antoine Rondeau
Author	Pierre-Luc St-Charles
Author	David Williams-King
Abstract	The leading AI companies are increasingly focused on building generalist AI agents -- systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path.
Date	2025-02-24
Short Title	Superintelligent Agents Pose Catastrophic Risks
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.15657
Accessed	3/12/2025, 8:54:07 AM
Extra	arXiv:2502.15657 [cs]
DOI	10.48550/arXiv.2502.15657
Repository	arXiv
Archive ID	arXiv:2502.15657
Date Added	3/12/2025, 8:54:07 AM
Modified	3/12/2025, 8:54:07 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Notes:

Comment: v2 with fixed formatting for URLs and hyperlinks

Attachments

Preprint PDF
Snapshot

Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

Item Type	Journal Article
Author	Bowen Baker
Author	Joost Huizinga
Author	Leo Gao
Author	Zehao Dou
Author	Melody Y Guan
Author	Aleksander Madry
Author	Wojciech Zaremba
Author	Jakub Pachocki
Author	David Farhi
Abstract	Mitigating reward hacking—where AI systems misbehave due to flaws or misspecifications in their learning objectives—remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model’s chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent’s training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.
Language	en
Library Catalog	Zotero
Date Added	3/13/2025, 8:06:44 AM
Modified	3/13/2025, 8:06:44 AM

Attachments

PDF

Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

Item Type	Journal Article
Author	Iván Arcuschin
Author	Jett Janiak
Author	Robert Krzyzanowski
Author	Senthooran Rajamanoharan
Author	Neel Nanda
Author	Arthur Conmy
Abstract	Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced.
Language	en
Library Catalog	Zotero
Date Added	3/17/2025, 8:29:57 AM
Modified	3/17/2025, 8:29:57 AM

Attachments

PDF

Perceptions of Sentient AI and Other Digital Minds: Evidence from the AI, Morality, and Sentience (AIMS) Survey

Item Type	Preprint
Author	Jacy Reese Anthis
Author	Janet V. T. Pauketat
Author	Ali Ladak
Author	Aikaterina Manoli
Abstract	Humans now interact with a variety of digital minds, AI systems that appear to have mental faculties such as reasoning, emotion, and agency, and public figures are discussing the possibility of sentient AI. We present initial results from 2021 and 2023 for the nationally representative AI, Morality, and Sentience (AIMS) survey (N = 3,500). Mind perception and moral concern for AI welfare were surprisingly high and significantly increased: in 2023, one in five U.S. adults believed some AI systems are currently sentient, and 38% supported legal rights for sentient AI. People became more opposed to building digital minds: in 2023, 63% supported banning smarter-than-human AI, and 69% supported banning sentient AI. The median 2023 forecast was that sentient AI would arrive in just five years. The development of safe and beneficial AI requires not just technical study but understanding the complex ways in which humans perceive and coexist with digital minds.
Date	2025-03-10
Short Title	Perceptions of Sentient AI and Other Digital Minds
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2407.08867
Accessed	3/12/2025, 10:41:36 AM
Extra	arXiv:2407.08867 [cs]
DOI	10.1145/3706598.3713329
Date Added	3/12/2025, 10:41:36 AM
Modified	3/12/2025, 10:41:37 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction
Computer Science - Emerging Technologies

Notes:

Comment: Published at CHI 2025

Attachments

Preprint PDF
Snapshot

The Economics of Artificial Intelligence: Political Economy

Item Type	Book
Author	Ajay Agrawal
Author	Joshua Gans
Author	Avi Goldfarb
Author	Catherine Tucker
Date	2025
Short Title	The Economics of Artificial Intelligence
Library Catalog	National Bureau of Economic Research
URL	https://www.nber.org/books-and-chapters/economics-artificial-intelligence-political-economy
Accessed	3/13/2025, 8:47:39 AM
Extra	Backup Publisher: National Bureau of Economic Research Type: Book
Publisher	University of Chicago Press
Date Added	3/13/2025, 8:47:39 AM
Modified	3/13/2025, 8:47:39 AM