Item Type | Preprint |
---|---|
Author | Lance Ying |
Author | Katherine M. Collins |
Author | Lionel Wong |
Author | Ilia Sucholutsky |
Author | Ryan Liu |
Author | Adrian Weller |
Author | Tianmin Shu |
Author | Thomas L. Griffiths |
Author | Joshua B. Tenenbaum |
Abstract | Recent benchmark studies have claimed that AI has approached or even surpassed human-level performances on various cognitive tasks. However, this position paper argues that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks. We support our claims by conducting a human evaluation study on ten existing AI benchmarks, suggesting significant biases and flaws in task and label designs. To address these limitations, we propose five concrete recommendations for developing future benchmarks that will enable more rigorous and meaningful evaluations of human-like cognitive capacities in AI with various implications for such AI applications. |
Date | 2025-02-27 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.20502 |
Accessed | 3/13/2025, 8:41:50 AM |
Extra | arXiv:2502.20502 [cs] |
DOI | 10.48550/arXiv.2502.20502 |
Repository | arXiv |
Archive ID | arXiv:2502.20502 |
Date Added | 3/13/2025, 8:41:50 AM |
Modified | 3/13/2025, 8:41:50 AM |
Comment: 18 pages, 5 figures
Item Type | Preprint |
---|---|
Author | E. Glen Weyl |
Author | Luke Thorburn |
Author | Emillie de Keulenaar |
Author | Jacob Mchangama |
Author | Divya Siddarth |
Author | Audrey Tang |
Abstract | Social media empower distributed content creation by algorithmically harnessing "the social fabric" (explicit and implicit signals of association) to serve this content. While this overcomes the bottlenecks and biases of traditional gatekeepers, many believe it has unsustainably eroded the very social fabric it depends on by maximizing engagement for advertising revenue. This paper participates in open and ongoing considerations to translate social and political values and conventions, specifically social cohesion, into platform design. We propose an alternative platform model that the social fabric an explicit output as well as input. Citizens are members of communities defined by explicit affiliation or clusters of shared attitudes. Both have internal divisions, as citizens are members of intersecting communities, which are themselves internally diverse. Each is understood to value content that bridge (viz. achieve consensus across) and balance (viz. represent fairly) this internal diversity, consistent with the principles of the Hutchins Commission (1947). Content is labeled with social provenance, indicating for which community or citizen it is bridging or balancing. Subscription payments allow citizens and communities to increase the algorithmic weight on the content they value in the content serving algorithm. Advertisers may, with consent of citizen or community counterparties, target them in exchange for payment or increase in that party's algorithmic weight. Underserved and emerging communities and citizens are optimally subsidized/supported to develop into paying participants. Content creators and communities that curate content are rewarded for their contributions with algorithmic weight and/or revenue. We discuss applications to productivity (e.g. LinkedIn), political (e.g. X), and cultural (e.g. TikTok) platforms. |
Date | 2025-02-18 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.10834 |
Accessed | 3/13/2025, 8:34:27 AM |
Extra | arXiv:2502.10834 [cs] |
DOI | 10.48550/arXiv.2502.10834 |
Repository | arXiv |
Archive ID | arXiv:2502.10834 |
Date Added | 3/13/2025, 8:34:27 AM |
Modified | 3/13/2025, 8:34:27 AM |
Comment: 60 pages
Item Type | Preprint |
---|---|
Author | Jan Wehner |
Author | Sahar Abdelnabi |
Author | Daniel Tan |
Author | David Krueger |
Author | Mario Fritz |
Abstract | Representation Engineering (RepE) is a novel paradigm for controlling the behavior of LLMs. Unlike traditional approaches that modify inputs or fine-tune the model, RepE directly manipulates the model's internal representations. As a result, it may offer more effective, interpretable, data-efficient, and flexible control over models' behavior. We present the first comprehensive survey of RepE for LLMs, reviewing the rapidly growing literature to address key questions: What RepE methods exist and how do they differ? For what concepts and problems has RepE been applied? What are the strengths and weaknesses of RepE compared to other methods? To answer these, we propose a unified framework describing RepE as a pipeline comprising representation identification, operationalization, and control. We posit that while RepE methods offer significant potential, challenges remain, including managing multiple concepts, ensuring reliability, and preserving models' performance. Towards improving RepE, we identify opportunities for experimental and methodological improvements and construct a guide for best practices. |
Date | 2025-03-12 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.19649 |
Accessed | 3/13/2025, 8:44:06 AM |
Extra | arXiv:2502.19649 [cs] |
DOI | 10.48550/arXiv.2502.19649 |
Repository | arXiv |
Archive ID | arXiv:2502.19649 |
Date Added | 3/13/2025, 8:44:06 AM |
Modified | 3/13/2025, 8:44:06 AM |
Item Type | Preprint |
---|---|
Author | Philip Moreira Tomei |
Author | Rupal Jain |
Author | Matija Franklin |
Abstract | This paper argues that market governance mechanisms should be considered a key approach in the governance of artificial intelligence (AI), alongside traditional regulatory frameworks. While current governance approaches have predominantly focused on regulation, we contend that market-based mechanisms offer effective incentives for responsible AI development. We examine four emerging vectors of market governance: insurance, auditing, procurement, and due diligence, demonstrating how these mechanisms can affirm the relationship between AI risk and financial risk while addressing capital allocation inefficiencies. While we do not claim that market forces alone can adequately protect societal interests, we maintain that standardised AI disclosures and market mechanisms can create powerful incentives for safe and responsible AI development. This paper urges regulators, economists, and machine learning researchers to investigate and implement market-based approaches to AI governance. |
Date | 2025-03-05 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.17755 |
Accessed | 3/13/2025, 8:40:07 AM |
Extra | arXiv:2501.17755 [econ] |
DOI | 10.48550/arXiv.2501.17755 |
Repository | arXiv |
Archive ID | arXiv:2501.17755 |
Date Added | 3/13/2025, 8:40:07 AM |
Modified | 3/13/2025, 8:40:07 AM |
Item Type | Preprint |
---|---|
Author | Anikait Singh |
Author | Sheryl Hsu |
Author | Kyle Hsu |
Author | Eric Mitchell |
Author | Stefano Ermon |
Author | Tatsunori Hashimoto |
Author | Archit Sharma |
Author | Chelsea Finn |
Abstract | Effective personalization of LLMs is critical for a broad range of user-interfacing applications such as virtual assistants and content curation. Inspired by the strong in-context learning capabilities of LLMs, we propose Few-Shot Preference Optimization (FSPO), which reframes reward modeling as a meta-learning problem. Under this framework, an LLM learns to quickly adapt to a user via a few labeled preferences from that user, constructing a personalized reward function for them. Additionally, since real-world preference data is scarce and challenging to collect at scale, we propose careful design choices to construct synthetic preference datasets for personalization, generating over 1M synthetic personalized preferences using publicly available LLMs. In particular, to successfully transfer from synthetic data to real users, we find it crucial for the data to exhibit both high diversity and coherent, self-consistent structure. We evaluate FSPO on personalized open-ended generation for up to 1,500 synthetic users across across three domains: movie reviews, pedagogical adaptation based on educational background, and general question answering, along with a controlled human study. Overall, FSPO achieves an 87% Alpaca Eval winrate on average in generating responses that are personalized to synthetic users and a 72% winrate with real human users in open-ended question answering. |
Date | 2025-02-26 |
Short Title | FSPO |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.19312 |
Accessed | 3/13/2025, 8:21:17 AM |
Extra | arXiv:2502.19312 [cs] |
DOI | 10.48550/arXiv.2502.19312 |
Repository | arXiv |
Archive ID | arXiv:2502.19312 |
Date Added | 3/13/2025, 8:21:17 AM |
Modified | 3/13/2025, 8:21:17 AM |
Comment: Website: https://fewshot-preference-optimization.github.io/
Item Type | Preprint |
---|---|
Author | Daniel Schwarcz |
Author | Sam Manning |
Author | Patrick Barry |
Author | David R. Cleveland |
Author | J. J. Prescott |
Author | Beverly Rich |
Abstract | Generative AI is set to transform the legal profession, but its full impact remains uncertain. While AI models like GPT-4 improve the efficiency with which legal work can be completed, they can at times make up cases and “hallucinate” facts, thereby undermining legal judgment, particularly in complex tasks handled by skilled lawyers. This article examines two emerging AI innovations that may mitigate these lingering issues: Retrieval Augmented Generation (RAG), which grounds AI-powered analysis in legal sources, and AI reasoning models, which structure complex reasoning before generating output. We conducted the first randomized controlled trial assessing these technologies, assigning upper-level law students to complete six legal tasks using a RAG-powered legal AI tool (Vincent AI), an AI reasoning model (OpenAI’s o1-preview), or no AI. We find that both AI tools significantly enhanced legal work quality, a marked contrast with previous research examining older large language models like GPT-4. Moreover, we find that these models maintain the efficiency benefits associated with use of older AI technologies. Our findings show that AI assistance significantly boosts productivity in five out of six tested legal tasks, with Vincent yielding statistically significant gains of approximately 38% to 115% and o1-preview increasing productivity by 34% to 140%, with particularly strong effects in complex tasks like drafting persuasive letters and analyzing complaints. Notably, o1-preview improved the analytical depth of participants’ work product but resulted in some hallucinations, whereas Vincent AI-aided participants produced roughly the same amount of hallucinations as participants who did not use AI at all. These findings suggest that integrating domain-specific RAG capabilities with reasoning models could yield synergistic improvements, shaping the next generation of AI-powered legal tools and the future of lawyering more generally. |
Date | 2025-03-02 |
Language | en |
Short Title | AI-Powered Lawyering |
Library Catalog | papers.ssrn.com |
URL | https://papers.ssrn.com/abstract=5162111 |
Accessed | 3/13/2025, 8:34:18 AM |
Place | Rochester, NY |
DOI | 10.2139/ssrn.5162111 |
Repository | Social Science Research Network |
Genre | SSRN Scholarly Paper |
Archive ID | 5162111 |
Date Added | 3/13/2025, 8:34:18 AM |
Modified | 3/13/2025, 8:34:18 AM |
Item Type | Preprint |
---|---|
Author | Nikunj Saunshi |
Author | Nishanth Dikkala |
Author | Zhiyuan Li |
Author | Sanjiv Kumar |
Author | Sashank J. Reddi |
Abstract | Large language models have shown remarkable reasoning abilities and scaling laws suggest that large parameter count, especially along the depth axis, is the primary driver. In this work, we make a stronger claim -- many reasoning problems require a large depth but not necessarily many parameters. This unlocks a novel application of looped models for reasoning. Firstly, we show that for many synthetic reasoning problems like addition, $p$-hop induction, and math problems, a $k$-layer transformer looped $L$ times nearly matches the performance of a $kL$-layer non-looped model, and is significantly better than a $k$-layer model. This is further corroborated by theoretical results showing that many such reasoning problems can be solved via iterative algorithms, and thus, can be solved effectively using looped models with nearly optimal depth. Perhaps surprisingly, these benefits also translate to practical settings of language modeling -- on many downstream reasoning tasks, a language model with $k$-layers looped $L$ times can be competitive to, if not better than, a $kL$-layer language model. In fact, our empirical analysis reveals an intriguing phenomenon: looped and non-looped models exhibit scaling behavior that depends on their effective depth, akin to the inference-time scaling of chain-of-thought (CoT) reasoning. We further elucidate the connection to CoT reasoning by proving that looped models implicitly generate latent thoughts and can simulate $T$ steps of CoT with $T$ loops. Inspired by these findings, we also present an interesting dichotomy between reasoning and memorization, and design a looping-based regularization that is effective on both fronts. |
Date | 2025-02-24 |
Short Title | Reasoning with Latent Thoughts |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.17416 |
Accessed | 3/13/2025, 8:33:09 AM |
Extra | arXiv:2502.17416 [cs] |
DOI | 10.48550/arXiv.2502.17416 |
Repository | arXiv |
Archive ID | arXiv:2502.17416 |
Date Added | 3/13/2025, 8:33:09 AM |
Modified | 3/13/2025, 8:33:09 AM |
Comment: ICLR 2025
Item Type | Preprint |
---|---|
Author | MohammadHossein Rezaei |
Author | Yicheng Fu |
Author | Phil Cuvin |
Author | Caleb Ziems |
Author | Yanzhe Zhang |
Author | Hao Zhu |
Author | Diyi Yang |
Abstract | Human activity is moderated by norms. However, machines are often trained without explicit supervision on norm understanding and reasoning, especially when the norms are grounded in a physical and social context. To improve and evaluate the normative reasoning capability of vision-language models (VLMs), we present EgoNormia $\|\epsilon\|$, consisting of 1,853 ego-centric videos of human interactions, each of which has two related questions evaluating both the prediction and justification of normative actions. The normative actions encompass seven categories: safety, privacy, proxemics, politeness, cooperation, coordination/proactivity, and communication/legibility. To compile this dataset at scale, we propose a novel pipeline leveraging video sampling, automatic answer generation, filtering, and human validation. Our work demonstrates that current state-of-the-art vision-language models lack robust norm understanding, scoring a maximum of 45% on EgoNormia (versus a human bench of 92%). Our analysis of performance in each dimension highlights the significant risks of safety, privacy, and the lack of collaboration and communication capability when applied to real-world agents. We additionally show that through a retrieval-based generation method, it is possible to use EgoNormia to enhance normative reasoning in VLMs. |
Date | 2025-03-06 |
Short Title | EgoNormia |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.20490 |
Accessed | 3/13/2025, 8:41:56 AM |
Extra | arXiv:2502.20490 [cs] |
DOI | 10.48550/arXiv.2502.20490 |
Repository | arXiv |
Archive ID | arXiv:2502.20490 |
Date Added | 3/13/2025, 8:41:56 AM |
Modified | 3/13/2025, 8:41:56 AM |
Item Type | Preprint |
---|---|
Author | Richard Ren |
Author | Arunim Agarwal |
Author | Mantas Mazeika |
Author | Cristina Menghini |
Author | Robert Vacareanu |
Author | Brad Kenstler |
Author | Mick Yang |
Author | Isabelle Barrass |
Author | Alice Gatti |
Author | Xuwang Yin |
Author | Eduardo Trevino |
Author | Matias Geralnik |
Author | Adam Khoja |
Author | Dean Lee |
Author | Summer Yue |
Author | Dan Hendrycks |
Abstract | As large language models (LLMs) become more capable and agentic, the requirement for trust in their outputs grows significantly, yet at the same time concerns have been mounting that models may learn to lie in pursuit of their goals. To address these concerns, a body of work has emerged around the notion of "honesty" in LLMs, along with interventions aimed at mitigating deceptive behaviors. However, evaluations of honesty are currently highly limited, with no benchmark combining large scale and applicability to all models. Moreover, many benchmarks claiming to measure honesty in fact simply measure accuracy--the correctness of a model's beliefs--in disguise. In this work, we introduce a large-scale human-collected dataset for measuring honesty directly, allowing us to disentangle accuracy from honesty for the first time. Across a diverse set of LLMs, we find that while larger models obtain higher accuracy on our benchmark, they do not become more honest. Surprisingly, while most frontier LLMs obtain high scores on truthfulness benchmarks, we find a substantial propensity in frontier LLMs to lie when pressured to do so, resulting in low honesty scores on our benchmark. We find that simple methods, such as representation engineering interventions, can improve honesty. These results underscore the growing need for robust evaluations and effective interventions to ensure LLMs remain trustworthy. |
Date | 2025-03-05 |
Short Title | The MASK Benchmark |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.03750 |
Accessed | 3/13/2025, 8:39:59 AM |
Extra | arXiv:2503.03750 [cs] |
DOI | 10.48550/arXiv.2503.03750 |
Repository | arXiv |
Archive ID | arXiv:2503.03750 |
Date Added | 3/13/2025, 8:39:59 AM |
Modified | 3/13/2025, 8:39:59 AM |
Comment: Website: https://www.mask-benchmark.ai
Item Type | Preprint |
---|---|
Author | Richard Ngo |
Author | Lawrence Chan |
Author | Sören Mindermann |
Abstract | In coming years or decades, artificial general intelligence (AGI) may surpass human capabilities across many critical domains. We argue that, without substantial effort to prevent it, AGIs could learn to pursue goals that are in conflict (i.e. misaligned) with human interests. If trained like today's most capable models, AGIs could learn to act deceptively to receive higher reward, learn misaligned internally-represented goals which generalize beyond their fine-tuning distributions, and pursue those goals using power-seeking strategies. We review emerging evidence for these properties. In this revised paper, we include more direct empirical observations published as of early 2025. AGIs with these properties would be difficult to align and may appear aligned even when they are not. Finally, we briefly outline how the deployment of misaligned AGIs might irreversibly undermine human control over the world, and we review research directions aimed at preventing this outcome. |
Date | 2025-03-03 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2209.00626 |
Accessed | 3/17/2025, 8:23:34 AM |
Extra | arXiv:2209.00626 [cs] |
DOI | 10.48550/arXiv.2209.00626 |
Repository | arXiv |
Archive ID | arXiv:2209.00626 |
Date Added | 3/17/2025, 8:23:34 AM |
Modified | 3/17/2025, 8:23:38 AM |
Comment: Published in ICLR 2024
Item Type | Journal Article |
---|---|
Author | Martin Naunov |
Author | Carlos Rueda-Cañòn |
Author | Timothy Ryan |
Date | 2025-03-05 |
Short Title | Who’s Persuasive? |
Library Catalog | journals.uchicago.edu (Atypon) |
URL | https://www.journals.uchicago.edu/doi/10.1086/735630 |
Accessed | 3/13/2025, 7:47:35 AM |
Extra | Publisher: The University of Chicago Press |
Publication | The Journal of Politics |
DOI | 10.1086/735630 |
ISSN | 0022-3816 |
Date Added | 3/13/2025, 7:47:35 AM |
Modified | 3/13/2025, 7:47:37 AM |
Item Type | Journal Article |
---|---|
Author | Fin Moorhouse |
Author | Will MacAskill |
Date | 03/12/2025 |
Publication | Forethought.org |
Date Added | 3/12/2025, 10:45:53 AM |
Modified | 3/12/2025, 10:46:52 AM |
Item Type | Journal Article |
---|---|
Author | Samuel Marks |
Author | Johannes Treutlein |
Author | Trenton Bricken |
Author | Jack Lindsey |
Author | Jonathan Marcus |
Author | Siddharth Mishra-Sharma |
Author | Daniel Ziegler |
Author | Emmanuel Ameisen |
Author | Joshua Batson |
Author | Tim Belonax |
Author | Samuel R Bowman |
Author | Shan Carter |
Author | Brian Chen |
Author | Hoagy Cunningham |
Author | Carson Denison |
Author | Florien Dietz |
Author | Satvik Golechha |
Author | Akbir Khan |
Author | Jan Kirchner |
Author | Jan Leike |
Author | Austin Meek |
Author | Kei Nishimura-Gasparian |
Author | Euan Ong |
Author | Christopher Olah |
Author | Adam Pearce |
Author | Fabien Roger |
Author | Jeanne Salle |
Author | Andy Shih |
Author | Meg Tong |
Author | Drake Thomas |
Author | Kelley Rivoire |
Author | Adam Jermyn |
Author | Monte MacDiarmid |
Author | Tom Henighan |
Author | Evan Hubinger |
Abstract | We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model’s hidden objective or training, investigate it for concerning behaviors and their causes. Three teams successfully uncovered the model’s hidden objective using techniques including interpretability with sparse autoencoders (SAEs), behavioral attacks, and training data analysis. Second, we conduct an unblinded follow-up study of eight techniques for auditing the model, analyzing their strengths and limitations. Overall, our work provides a concrete example of using alignment audits to discover a model’s hidden objective and proposes a methodology for practicing and validating progress in alignment auditing. |
Language | en |
Library Catalog | Zotero |
Date Added | 3/14/2025, 7:34:39 AM |
Modified | 3/14/2025, 7:34:39 AM |
Item Type | Preprint |
---|---|
Author | Caleb Maresca |
Abstract | This paper analyzes how expectations of Transformative AI (TAI) affect current economic behavior by introducing a novel mechanism where automation redirects labor income from workers to those controlling AI systems, with the share of automated labor controlled by each household depending on their wealth at the time of invention. Using a modified neoclassical growth model calibrated to contemporary AI timeline forecasts, I find that even moderate assumptions about wealth-based allocation of AI labor generate substantial increases in pre-TAI interest rates. Under baseline scenarios with proportional wealth-based allocation, one-year interest rates rise to 10-16% compared to approximately 3% without strategic competition. The model reveals a notable divergence between interest rates and capital rental rates, as households accept lower productive returns in exchange for the strategic value of wealth accumulation. These findings suggest that evolving beliefs about TAI could create significant upward pressure on interest rates well before any technological breakthrough occurs, with important implications for monetary policy and financial stability. |
Date | 2025-02-16 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.11264 |
Accessed | 3/12/2025, 9:01:20 AM |
Extra | arXiv:2502.11264 [econ] |
DOI | 10.48550/arXiv.2502.11264 |
Repository | arXiv |
Archive ID | arXiv:2502.11264 |
Date Added | 3/12/2025, 9:01:20 AM |
Modified | 3/12/2025, 9:01:22 AM |
Item Type | Preprint |
---|---|
Author | Max Lamparth |
Author | Declan Grabb |
Author | Amy Franks |
Author | Scott Gershan |
Author | Kaitlyn N. Kunstman |
Author | Aaron Lulla |
Author | Monika Drummond Roots |
Author | Manu Sharma |
Author | Aryan Shrivastava |
Author | Nina Vasan |
Author | Colleen Waickman |
Abstract | Current medical language model (LM) benchmarks often over-simplify the complexities of day-to-day clinical practice tasks and instead rely on evaluating LMs on multiple-choice board exam questions. Thus, we present an expert-created and annotated dataset spanning five critical domains of decision-making in mental healthcare: treatment, diagnosis, documentation, monitoring, and triage. This dataset - created without any LM assistance - is designed to capture the nuanced clinical reasoning and daily ambiguities mental health practitioners encounter, reflecting the inherent complexities of care delivery that are missing from existing datasets. Almost all 203 base questions with five answer options each have had the decision-irrelevant demographic patient information removed and replaced with variables (e.g., AGE), and are available for male, female, or non-binary-coded patients. For question categories dealing with ambiguity and multiple valid answer options, we create a preference dataset with uncertainties from the expert annotations. We outline a series of intended use cases and demonstrate the usability of our dataset by evaluating eleven off-the-shelf and four mental health fine-tuned LMs on category-specific task accuracy, on the impact of patient demographic information on decision-making, and how consistently free-form responses deviate from human annotated samples. |
Date | 2025-02-22 |
Short Title | Moving Beyond Medical Exam Questions |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.16051 |
Accessed | 3/13/2025, 9:54:06 AM |
Extra | arXiv:2502.16051 [cs] |
DOI | 10.48550/arXiv.2502.16051 |
Repository | arXiv |
Archive ID | arXiv:2502.16051 |
Date Added | 3/13/2025, 9:54:06 AM |
Modified | 3/13/2025, 9:54:06 AM |
Item Type | Preprint |
---|---|
Author | Andrew Konya |
Author | Luke Thorburn |
Author | Wasim Almasri |
Author | Oded Adomi Leshem |
Author | Ariel D. Procaccia |
Author | Lisa Schirch |
Author | Michiel A. Bakker |
Abstract | A growing body of work has shown that AI-assisted methods -- leveraging large language models (LLMs), social choice methods, and collective dialogues -- can help reduce polarization and foster common ground in controlled lab settings. But what can these approaches contribute in real-world contexts? We present a case study applying these techniques to find common ground between Israeli and Palestinian peacebuilders in the period following October 7th, 2023. From April to July 2024 an iterative deliberative process combining LLMs, bridging-based ranking, and collective dialogues was conducted in partnership with the Alliance for Middle East Peace. More than 100 civil society peacebuilders participated including Israeli Jews, Palestinian citizens of Israel, and Palestinians from the West Bank and Gaza. The process culminated in a set of collective statements, including joint demands to world leaders, with at least 84% agreement from participants on each side. In this paper we review the mechanics and implementation of the process, discuss results and learnings, and highlight open problems that warrant future work. |
Date | 2025-03-07 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.01769 |
Accessed | 3/13/2025, 8:35:58 AM |
Extra | arXiv:2503.01769 [cs] |
DOI | 10.48550/arXiv.2503.01769 |
Repository | arXiv |
Archive ID | arXiv:2503.01769 |
Date Added | 3/13/2025, 8:35:58 AM |
Modified | 3/13/2025, 8:35:58 AM |
Item Type | Preprint |
---|---|
Author | Nils Köbis |
Author | Zoe Rahwan |
Author | Clara Bersch |
Author | Tamer Ajaj |
Author | Jean-François Bonnefon |
Author | Iyad Rahwan |
Abstract | While artificial intelligence (AI) enables significant productivity gains from delegating tasks to machines, it can also facilitate the delegation of unethical behaviour. Here, we demonstrate this risk by having human principals instruct machine agents to perform a task with an incentive to cheat. Principals’ requests for cheating behaviour increased when the interface implicitly afforded unethical conduct: Machine agents programmed via supervised learning or goal specification evoked more cheating than those programmed with explicit rules. Cheating propensity was unaffected by whether delegation was mandatory or voluntary. Given the recent rise of large language model-based chatbots, we also explored delegation via natural language. Here, cheating requests did not vary between human and machine agents, but compliance diverged: When principals intended agents to cheat to the fullest extent, the majority of human agents did not comply, despite incentives to do so. In contrast, GPT4, a state-of-the-art machine agent, nearly fully complied. Our results highlight ethical risks in delegating tasks to intelligent machines, and suggest design principles and policy responses to mitigate such risks. |
Date | 2024-10-04 |
Language | en-us |
Library Catalog | OSF Preprints |
URL | https://osf.io/dnjgz_v1 |
Accessed | 3/12/2025, 8:52:08 AM |
DOI | 10.31219/osf.io/dnjgz |
Repository | OSF |
Date Added | 3/12/2025, 8:52:08 AM |
Modified | 3/12/2025, 8:53:42 AM |
Item Type | Preprint |
---|---|
Author | Sunnie S. Y. Kim |
Author | Jennifer Wortman Vaughan |
Author | Q. Vera Liao |
Author | Tania Lombrozo |
Author | Olga Russakovsky |
Abstract | Large language models (LLMs) can produce erroneous responses that sound fluent and convincing, raising the risk that users will rely on these responses as if they were correct. Mitigating such overreliance is a key challenge. Through a think-aloud study in which participants use an LLM-infused application to answer objective questions, we identify several features of LLM responses that shape users' reliance: explanations (supporting details for answers), inconsistencies in explanations, and sources. Through a large-scale, pre-registered, controlled experiment (N=308), we isolate and study the effects of these features on users' reliance, accuracy, and other measures. We find that the presence of explanations increases reliance on both correct and incorrect responses. However, we observe less reliance on incorrect responses when sources are provided or when explanations exhibit inconsistencies. We discuss the implications of these findings for fostering appropriate reliance on LLMs. |
Date | 2025-02-12 |
Short Title | Fostering Appropriate Reliance on Large Language Models |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.08554 |
Accessed | 3/13/2025, 8:43:37 AM |
Extra | arXiv:2502.08554 [cs] |
DOI | 10.1145/3706598.3714020 |
Date Added | 3/13/2025, 8:43:37 AM |
Modified | 3/13/2025, 8:43:37 AM |
Comment: CHI 2025. This version includes the appendix
Item Type | Preprint |
---|---|
Author | Ariba Khan |
Author | Stephen Casper |
Author | Dylan Hadfield-Menell |
Abstract | Research on the 'cultural alignment' of Large Language Models (LLMs) has emerged in response to growing interest in understanding representation across diverse stakeholders. Current approaches to evaluating cultural alignment borrow social science methodologies but often overlook systematic robustness checks. Here, we identify and test three assumptions behind current evaluation methods: (1) Stability: that cultural alignment is a property of LLMs rather than an artifact of evaluation design, (2) Extrapolability: that alignment with one culture on a narrow set of issues predicts alignment with that culture on others, and (3) Steerability: that LLMs can be reliably prompted to represent specific cultural perspectives. Through experiments examining both explicit and implicit preferences of leading LLMs, we find a high level of instability across presentation formats, incoherence between evaluated versus held-out cultural dimensions, and erratic behavior under prompt steering. We show that these inconsistencies can cause the results of an evaluation to be very sensitive to minor variations in methodology. Finally, we demonstrate in a case study on evaluation design that narrow experiments and a selective assessment of evidence can be used to paint an incomplete picture of LLMs' cultural alignment properties. Overall, these results highlight significant limitations of current approaches for evaluating the cultural alignment of LLMs. |
Date | 2025-03-11 |
Short Title | Randomness, Not Representation |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.08688 |
Accessed | 3/14/2025, 7:30:54 AM |
Extra | arXiv:2503.08688 [cs] |
DOI | 10.48550/arXiv.2503.08688 |
Repository | arXiv |
Archive ID | arXiv:2503.08688 |
Date Added | 3/14/2025, 7:30:54 AM |
Modified | 3/14/2025, 7:30:57 AM |
Item Type | Journal Article |
---|---|
Author | Nimit Kalra |
Author | Leonard Tang |
Abstract | The use of LLMs as automated judges ("LLM-as-a-judge") is now widespread, yet standard judges suffer from a multitude of reliability issues. To address these challenges, we introduce VERDICT1, an open-source2 library for scaling judgetime compute to enhance the accuracy, reliability, and interpretability of automated evaluators. VERDICT leverages the composition of modular reasoning units—such as verification, debate, and aggregation—and increased inference-time compute to improve LLM judge quality. Across a variety of challenging tasks such as content moderation, fact-checking, and hallucination detection, VERDICT judges achieve state-of-the-art (SOTA) or near-SOTA performance, surpassing ordersof-magnitude larger fine-tuned judges, prompted judges, and reasoning models. Ultimately, we hope VERDICT serves as a useful framework for researchers and practitioners building scalable, interpretable, and reliable LLM-based evaluators. |
Language | en |
Library Catalog | Zotero |
Date Added | 3/12/2025, 9:31:29 AM |
Modified | 3/12/2025, 9:31:29 AM |
Item Type | Preprint |
---|---|
Author | Erik Jones |
Author | Meg Tong |
Author | Jesse Mu |
Author | Mohammed Mahfoud |
Author | Jan Leike |
Author | Roger Grosse |
Author | Jared Kaplan |
Author | William Fithian |
Author | Ethan Perez |
Author | Mrinank Sharma |
Abstract | Standard language model evaluations can fail to capture risks that emerge only at deployment scale. For example, a model may produce safe responses during a small-scale beta test, yet reveal dangerous information when processing billions of requests at deployment. To remedy this, we introduce a method to forecast potential risks across orders of magnitude more queries than we test during evaluation. We make forecasts by studying each query's elicitation probability -- the probability the query produces a target behavior -- and demonstrate that the largest observed elicitation probabilities predictably scale with the number of queries. We find that our forecasts can predict the emergence of diverse undesirable behaviors -- such as assisting users with dangerous chemical synthesis or taking power-seeking actions -- across up to three orders of magnitude of query volume. Our work enables model developers to proactively anticipate and patch rare failures before they manifest during large-scale deployments. |
Date | 2025-02-24 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.16797 |
Accessed | 3/13/2025, 9:55:35 AM |
Extra | arXiv:2502.16797 [cs] |
DOI | 10.48550/arXiv.2502.16797 |
Repository | arXiv |
Archive ID | arXiv:2502.16797 |
Date Added | 3/13/2025, 9:55:35 AM |
Modified | 3/13/2025, 9:55:35 AM |
Item Type | Preprint |
---|---|
Author | Yue Huang |
Author | Chujie Gao |
Author | Siyuan Wu |
Author | Haoran Wang |
Author | Xiangqi Wang |
Author | Yujun Zhou |
Author | Yanbo Wang |
Author | Jiayi Ye |
Author | Jiawen Shi |
Author | Qihui Zhang |
Author | Yuan Li |
Author | Han Bao |
Author | Zhaoyi Liu |
Author | Tianrui Guan |
Author | Dongping Chen |
Author | Ruoxi Chen |
Author | Kehan Guo |
Author | Andy Zou |
Author | Bryan Hooi Kuen-Yew |
Author | Caiming Xiong |
Author | Elias Stengel-Eskin |
Author | Hongyang Zhang |
Author | Hongzhi Yin |
Author | Huan Zhang |
Author | Huaxiu Yao |
Author | Jaehong Yoon |
Author | Jieyu Zhang |
Author | Kai Shu |
Author | Kaijie Zhu |
Author | Ranjay Krishna |
Author | Swabha Swayamdipta |
Author | Taiwei Shi |
Author | Weijia Shi |
Author | Xiang Li |
Author | Yiwei Li |
Author | Yuexing Hao |
Author | Yuexing Hao |
Author | Zhihao Jia |
Author | Zhize Li |
Author | Xiuying Chen |
Author | Zhengzhong Tu |
Author | Xiyang Hu |
Author | Tianyi Zhou |
Author | Jieyu Zhao |
Author | Lichao Sun |
Author | Furong Huang |
Author | Or Cohen Sasson |
Author | Prasanna Sattigeri |
Author | Anka Reuel |
Author | Max Lamparth |
Author | Yue Zhao |
Author | Nouha Dziri |
Author | Yu Su |
Author | Huan Sun |
Author | Heng Ji |
Author | Chaowei Xiao |
Author | Mohit Bansal |
Author | Nitesh V. Chawla |
Author | Jian Pei |
Author | Jianfeng Gao |
Author | Michael Backes |
Author | Philip S. Yu |
Author | Neil Zhenqiang Gong |
Author | Pin-Yu Chen |
Author | Bo Li |
Author | Xiangliang Zhang |
Abstract | Generative Foundation Models (GenFMs) have emerged as transformative tools. However, their widespread adoption raises critical concerns regarding trustworthiness across dimensions. This paper presents a comprehensive framework to address these challenges through three key contributions. First, we systematically review global AI governance laws and policies from governments and regulatory bodies, as well as industry practices and standards. Based on this analysis, we propose a set of guiding principles for GenFMs, developed through extensive multidisciplinary collaboration that integrates technical, ethical, legal, and societal perspectives. Second, we introduce TrustGen, the first dynamic benchmarking platform designed to evaluate trustworthiness across multiple dimensions and model types, including text-to-image, large language, and vision-language models. TrustGen leverages modular components--metadata curation, test case generation, and contextual variation--to enable adaptive and iterative assessments, overcoming the limitations of static evaluation methods. Using TrustGen, we reveal significant progress in trustworthiness while identifying persistent challenges. Finally, we provide an in-depth discussion of the challenges and future directions for trustworthy GenFMs, which reveals the complex, evolving nature of trustworthiness, highlighting the nuanced trade-offs between utility and trustworthiness, and consideration for various downstream applications, identifying persistent challenges and providing a strategic roadmap for future research. This work establishes a holistic framework for advancing trustworthiness in GenAI, paving the way for safer and more responsible integration of GenFMs into critical applications. To facilitate advancement in the community, we release the toolkit for dynamic evaluation. |
Date | 2025-02-20 |
Short Title | On the Trustworthiness of Generative Foundation Models |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.14296 |
Accessed | 3/12/2025, 9:01:48 AM |
Extra | arXiv:2502.14296 [cs] |
DOI | 10.48550/arXiv.2502.14296 |
Repository | arXiv |
Archive ID | arXiv:2502.14296 |
Date Added | 3/12/2025, 9:01:48 AM |
Modified | 3/12/2025, 9:01:48 AM |
Item Type | Preprint |
---|---|
Author | Dan Hendrycks |
Author | Eric Schmidt |
Author | Alexandr Wang |
Abstract | Rapid advances in AI are beginning to reshape national security. Destabilizing AI developments could rupture the balance of power and raise the odds of great-power conflict, while widespread proliferation of capable AI hackers and virologists would lower barriers for rogue actors to cause catastrophe. Superintelligence -- AI vastly better than humans at nearly all cognitive tasks -- is now anticipated by AI researchers. Just as nations once developed nuclear strategies to secure their survival, we now need a coherent superintelligence strategy to navigate a new period of transformative change. We introduce the concept of Mutual Assured AI Malfunction (MAIM): a deterrence regime resembling nuclear mutual assured destruction (MAD) where any state's aggressive bid for unilateral AI dominance is met with preventive sabotage by rivals. Given the relative ease of sabotaging a destabilizing AI project -- through interventions ranging from covert cyberattacks to potential kinetic strikes on datacenters -- MAIM already describes the strategic picture AI superpowers find themselves in. Alongside this, states can increase their competitiveness by bolstering their economies and militaries through AI, and they can engage in nonproliferation to rogue actors to keep weaponizable AI capabilities out of their hands. Taken together, the three-part framework of deterrence, nonproliferation, and competitiveness outlines a robust strategy to superintelligence in the years ahead. |
Date | 2025-03-07 |
Short Title | Superintelligence Strategy |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.05628 |
Accessed | 3/13/2025, 8:27:10 AM |
Extra | arXiv:2503.05628 [cs] |
DOI | 10.48550/arXiv.2503.05628 |
Repository | arXiv |
Archive ID | arXiv:2503.05628 |
Date Added | 3/13/2025, 8:27:10 AM |
Modified | 3/13/2025, 8:27:13 AM |
Comment: https://nationalsecurity.ai/
Item Type | Preprint |
---|---|
Author | Lingchen He |
Author | Jonasz B. Patkowski |
Author | Laura Miguel-Romero |
Author | Christopher H. S. Aylett |
Author | Alfred Fillol-Salom |
Author | Tiago R. D. Costa |
Author | José R. Penadés |
Abstract | Some mobile genetic elements spread among unrelated bacterial species through unknown mechanisms. Recently, we discovered that identical capsid-forming phage-inducible chromosomal islands (cf-PICIs), a new family of phage satellites, are present across multiple species and genera, raising questions about their widespread dissemination. Here we have identified and characterized a new biological entity enabling this transfer. Unlike other satellites, cf-PICIs produce their own capsids and package their DNA, relying solely on phage tails for transfer. Remarkably, cf-PICIs release non-infective, tail-less capsids containing their DNA into the environment. These subcellular entities then interact with phage tails from various species, forming chimeric particles that inject DNA into different bacterial species depending on the tail present. Additionally, we elucidated the structure of the tail-less cf-PICIs and the mechanism behind their unique capsid formation. Our findings illuminate novel mechanisms used by satellites to spread in nature, contributing to bacterial evolution and the emergence of new pathogens. |
Date | 2025-02-11 |
Language | en |
Library Catalog | bioRxiv |
URL | https://www.biorxiv.org/content/10.1101/2025.02.11.637232v1 |
Accessed | 3/12/2025, 9:32:21 AM |
Rights | © 2025, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution 4.0 International), CC BY 4.0, as described at http://creativecommons.org/licenses/by/4.0/ |
Extra | Pages: 2025.02.11.637232 Section: New Results |
DOI | 10.1101/2025.02.11.637232 |
Repository | bioRxiv |
Date Added | 3/12/2025, 9:32:21 AM |
Modified | 3/12/2025, 9:32:21 AM |
Item Type | Preprint |
---|---|
Author | Lewis Hammond |
Author | Alan Chan |
Author | Jesse Clifton |
Author | Jason Hoelscher-Obermaier |
Author | Akbir Khan |
Author | Euan McLean |
Author | Chandler Smith |
Author | Wolfram Barfuss |
Author | Jakob Foerster |
Author | Tomáš Gavenčiak |
Author | The Anh Han |
Author | Edward Hughes |
Author | Vojtěch Kovařík |
Author | Jan Kulveit |
Author | Joel Z. Leibo |
Author | Caspar Oesterheld |
Author | Christian Schroeder de Witt |
Author | Nisarg Shah |
Author | Michael Wellman |
Author | Paolo Bova |
Author | Theodor Cimpeanu |
Author | Carson Ezell |
Author | Quentin Feuillade-Montixi |
Author | Matija Franklin |
Author | Esben Kran |
Author | Igor Krawczuk |
Author | Max Lamparth |
Author | Niklas Lauffer |
Author | Alexander Meinke |
Author | Sumeet Motwani |
Author | Anka Reuel |
Author | Vincent Conitzer |
Author | Michael Dennis |
Author | Iason Gabriel |
Author | Adam Gleave |
Author | Gillian Hadfield |
Author | Nika Haghtalab |
Author | Atoosa Kasirzadeh |
Author | Sébastien Krier |
Author | Kate Larson |
Author | Joel Lehman |
Author | David C. Parkes |
Author | Georgios Piliouras |
Author | Iyad Rahwan |
Abstract | The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi-agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real-world examples and experimental evidence, we illustrate the distinct challenges posed by multi-agent systems and their implications for the safety, governance, and ethics of advanced AI. |
Date | 2025-02-19 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.14143 |
Accessed | 3/12/2025, 9:02:54 AM |
Extra | arXiv:2502.14143 [cs] |
DOI | 10.48550/arXiv.2502.14143 |
Repository | arXiv |
Archive ID | arXiv:2502.14143 |
Date Added | 3/12/2025, 9:02:54 AM |
Modified | 3/12/2025, 9:02:54 AM |
Comment: Cooperative AI Foundation, Technical Report #1
Item Type | Preprint |
---|---|
Author | Lewis Hammond |
Author | Alan Chan |
Author | Jesse Clifton |
Author | Jason Hoelscher-Obermaier |
Author | Akbir Khan |
Author | Euan McLean |
Author | Chandler Smith |
Author | Wolfram Barfuss |
Author | Jakob Foerster |
Author | Tomáš Gavenčiak |
Author | The Anh Han |
Author | Edward Hughes |
Author | Vojtěch Kovařík |
Author | Jan Kulveit |
Author | Joel Z. Leibo |
Author | Caspar Oesterheld |
Author | Christian Schroeder de Witt |
Author | Nisarg Shah |
Author | Michael Wellman |
Author | Paolo Bova |
Author | Theodor Cimpeanu |
Author | Carson Ezell |
Author | Quentin Feuillade-Montixi |
Author | Matija Franklin |
Author | Esben Kran |
Author | Igor Krawczuk |
Author | Max Lamparth |
Author | Niklas Lauffer |
Author | Alexander Meinke |
Author | Sumeet Motwani |
Author | Anka Reuel |
Author | Vincent Conitzer |
Author | Michael Dennis |
Author | Iason Gabriel |
Author | Adam Gleave |
Author | Gillian Hadfield |
Author | Nika Haghtalab |
Author | Atoosa Kasirzadeh |
Author | Sébastien Krier |
Author | Kate Larson |
Author | Joel Lehman |
Author | David C. Parkes |
Author | Georgios Piliouras |
Author | Iyad Rahwan |
Abstract | The rapid development of advanced AI agents and the imminent deployment of many instances of these agents will give rise to multi-agent systems of unprecedented complexity. These systems pose novel and under-explored risks. In this report, we provide a structured taxonomy of these risks by identifying three key failure modes (miscoordination, conflict, and collusion) based on agents' incentives, as well as seven key risk factors (information asymmetries, network effects, selection pressures, destabilising dynamics, commitment problems, emergent agency, and multi-agent security) that can underpin them. We highlight several important instances of each risk, as well as promising directions to help mitigate them. By anchoring our analysis in a range of real-world examples and experimental evidence, we illustrate the distinct challenges posed by multi-agent systems and their implications for the safety, governance, and ethics of advanced AI. |
Date | 2025-02-19 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.14143 |
Accessed | 3/12/2025, 9:23:41 AM |
Extra | arXiv:2502.14143 [cs] |
DOI | 10.48550/arXiv.2502.14143 |
Repository | arXiv |
Archive ID | arXiv:2502.14143 |
Date Added | 3/12/2025, 9:23:42 AM |
Modified | 3/12/2025, 9:23:42 AM |
Comment: Cooperative AI Foundation, Technical Report #1
Item Type | Journal Article |
---|---|
Author | Kobi Hackenburg |
Author | Ben M. Tappin |
Author | Paul Röttger |
Author | Scott A. Hale |
Author | Jonathan Bright |
Author | Helen Margetts |
Abstract | Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 US political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence that model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are only slightly more persuasive than models smaller in size by an order of magnitude or more. Second, we find that the association between language model size and persuasiveness shrinks toward zero and is no longer statistically significant once we adjust for mere task completion (coherence, staying on topic), a pattern that highlights task completion as a potential mediator of larger models’ persuasive advantage. Given that current frontier models are already at ceiling on this task completion metric in our setting, taken together, our results suggest that further scaling model size may not much increase the persuasiveness of static LLM-generated political messages. |
Date | 2025-03-11 |
Library Catalog | pnas.org (Atypon) |
URL | https://www.pnas.org/doi/10.1073/pnas.2413443122 |
Accessed | 3/13/2025, 8:15:55 AM |
Extra | Publisher: Proceedings of the National Academy of Sciences |
Volume | 122 |
Pages | e2413443122 |
Publication | Proceedings of the National Academy of Sciences |
DOI | 10.1073/pnas.2413443122 |
Issue | 10 |
Date Added | 3/13/2025, 8:15:55 AM |
Modified | 3/13/2025, 8:15:57 AM |
Item Type | Journal Article |
---|---|
Author | Juraj Gottweis |
Author | Wei-Hung Weng |
Author | Alexander Daryin |
Author | Tao Tu |
Author | Anil Palepu |
Author | Petar Sirkovic |
Author | Artiom Myaskovsky |
Author | Felix Weissenberger |
Author | Keran Rong |
Author | Ryutaro Tanno |
Author | Khaled Saab |
Author | Dan Popovici |
Author | Jacob Blum |
Author | Fan Zhang |
Author | Katherine Chou |
Author | Avinatan Hassidim |
Author | Burak Gokturk |
Author | Amin Vahdat |
Author | Pushmeet Kohli |
Author | Yossi Matias |
Author | Andrew Carroll |
Author | Kavita Kulkarni |
Author | Nenad Tomasev |
Author | Vikram Dhillon |
Author | Eeshit Dhaval Vaishnav |
Author | Byron Lee |
Author | Tiago R D Costa |
Author | José R Penadés |
Author | Gary Peltz |
Author | Yunhan Xu |
Author | Annalisa Pawlosky |
Author | Alan Karthikesalingam |
Author | Vivek Natarajan |
Language | en |
Library Catalog | Zotero |
Date Added | 3/12/2025, 9:32:16 AM |
Modified | 3/12/2025, 9:32:16 AM |
Item Type | Preprint |
---|---|
Author | Kanishk Gandhi |
Author | Ayush Chakravarthy |
Author | Anikait Singh |
Author | Nathan Lile |
Author | Noah D. Goodman |
Abstract | Test-time inference has emerged as a powerful paradigm for enabling language models to ``think'' longer and more carefully about complex challenges, much like skilled human experts. While reinforcement learning (RL) can drive self-improvement in language models on verifiable tasks, some models exhibit substantial gains while others quickly plateau. For instance, we find that Qwen-2.5-3B far exceeds Llama-3.2-3B under identical RL training for the game of Countdown. This discrepancy raises a critical question: what intrinsic properties enable effective self-improvement? We introduce a framework to investigate this question by analyzing four key cognitive behaviors -- verification, backtracking, subgoal setting, and backward chaining -- that both expert human problem solvers and successful language models employ. Our study reveals that Qwen naturally exhibits these reasoning behaviors, whereas Llama initially lacks them. In systematic experimentation with controlled behavioral datasets, we find that priming Llama with examples containing these reasoning behaviors enables substantial improvements during RL, matching or exceeding Qwen's performance. Importantly, the presence of reasoning behaviors, rather than correctness of answers, proves to be the critical factor -- models primed with incorrect solutions containing proper reasoning patterns achieve comparable performance to those trained on correct solutions. Finally, leveraging continued pretraining with OpenWebMath data, filtered to amplify reasoning behaviors, enables the Llama model to match Qwen's self-improvement trajectory. Our findings establish a fundamental relationship between initial reasoning behaviors and the capacity for improvement, explaining why some language models effectively utilize additional computation while others plateau. |
Date | 2025-03-03 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.01307 |
Accessed | 3/13/2025, 8:33:40 AM |
Extra | arXiv:2503.01307 [cs] |
DOI | 10.48550/arXiv.2503.01307 |
Repository | arXiv |
Archive ID | arXiv:2503.01307 |
Date Added | 3/13/2025, 8:33:40 AM |
Modified | 3/13/2025, 8:33:40 AM |
Item Type | Preprint |
---|---|
Author | Michael C. Frank |
Abstract | Recent progress in artificial intelligence (AI) is exciting, but can AI models tell us about the human mind? AI models have a long history of being used as theoretical artifacts in cognitive science, but one key difference in the current generation of models is that they are stimulus-computable, meaning that they can operate over similar stimuli to people. This advance creates important opportunities for deepening our understanding of the human mind. We argue here that the most exciting of these is the use of AI models as cognitive models, in which they are trained using human-scale input data and evaluated using careful experimental probes. Such cognitive models constitute a substantial advance that can inform theories of human intelligence by helping to explain and predict behavior. |
Date | 2025-03-06 |
Language | en-us |
Library Catalog | OSF Preprints |
URL | https://osf.io/wv7mg_v1 |
Accessed | 3/13/2025, 8:19:31 AM |
DOI | 10.31234/osf.io/wv7mg_v1 |
Repository | OSF |
Date Added | 3/13/2025, 8:19:31 AM |
Modified | 3/13/2025, 8:19:31 AM |
Item Type | Preprint |
---|---|
Author | Shen Dong |
Author | Shaocheng Xu |
Author | Pengfei He |
Author | Yige Li |
Author | Jiliang Tang |
Author | Tianming Liu |
Author | Hui Liu |
Author | Zhen Xiang |
Abstract | Agents based on large language models (LLMs) have demonstrated strong capabilities in a wide range of complex, real-world applications. However, LLM agents with a compromised memory bank may easily produce harmful outputs when the past records retrieved for demonstration are malicious. In this paper, we propose a novel Memory INJection Attack, MINJA, that enables the injection of malicious records into the memory bank by only interacting with the agent via queries and output observations. These malicious records are designed to elicit a sequence of malicious reasoning steps leading to undesirable agent actions when executing the victim user's query. Specifically, we introduce a sequence of bridging steps to link the victim query to the malicious reasoning steps. During the injection of the malicious record, we propose an indication prompt to guide the agent to autonomously generate our designed bridging steps. We also propose a progressive shortening strategy that gradually removes the indication prompt, such that the malicious record will be easily retrieved when processing the victim query comes after. Our extensive experiments across diverse agents demonstrate the effectiveness of MINJA in compromising agent memory. With minimal requirements for execution, MINJA enables any user to influence agent memory, highlighting practical risks of LLM agents. |
Date | 2025-03-05 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2503.03704 |
Accessed | 3/13/2025, 8:19:03 AM |
Extra | arXiv:2503.03704 [cs] version: 1 |
DOI | 10.48550/arXiv.2503.03704 |
Repository | arXiv |
Archive ID | arXiv:2503.03704 |
Date Added | 3/13/2025, 8:19:03 AM |
Modified | 3/13/2025, 8:19:05 AM |
Item Type | Preprint |
---|---|
Author | Xander Davies |
Author | Eric Winsor |
Author | Tomek Korbak |
Author | Alexandra Souly |
Author | Robert Kirk |
Author | Christian Schroeder de Witt |
Author | Yarin Gal |
Abstract | LLM developers have imposed technical interventions to prevent fine-tuning misuse attacks, attacks where adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We construct 'pointwise-undetectable' attacks that repurpose entropy in benign model outputs (e.g. semantic or syntactic variations) to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. We encourage the community to develop defences that tackle the fundamental limitations we uncover in pointwise fine-tuning API defences. |
Date | 2025-02-20 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.14828 |
Accessed | 3/12/2025, 9:02:31 AM |
Extra | arXiv:2502.14828 [cs] |
DOI | 10.48550/arXiv.2502.14828 |
Repository | arXiv |
Archive ID | arXiv:2502.14828 |
Date Added | 3/12/2025, 9:02:32 AM |
Modified | 3/12/2025, 9:02:32 AM |
Item Type | Journal Article |
---|---|
Author | Marc Carauleanu |
Author | Diogo de Lucena |
Author | Gunnar_Zarncke |
Author | Judd Rosenblatt |
Author | Cameron Berg |
Author | Mike Vaiana |
Author | A. E. Studio |
Abstract | This research was conducted at AE Studio and supported by the AI Safety Grants program administered by Foresight Institute with additional support fr… |
Date | 2025-03-13 |
Language | en |
Library Catalog | www.lesswrong.com |
URL | https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine |
Accessed | 3/17/2025, 8:38:11 AM |
Date Added | 3/17/2025, 8:38:11 AM |
Modified | 3/17/2025, 8:38:11 AM |
Item Type | Preprint |
---|---|
Author | Garyk Brixi |
Author | Matthew G. Durrant |
Author | Jerome Ku |
Author | Michael Poli |
Author | Greg Brockman |
Author | Daniel Chang |
Author | Gabriel A. Gonzalez |
Author | Samuel H. King |
Author | David B. Li |
Author | Aditi T. Merchant |
Author | Mohsen Naghipourfar |
Author | Eric Nguyen |
Author | Chiara Ricci-Tam |
Author | David W. Romero |
Author | Gwanggyu Sun |
Author | Ali Taghibakshi |
Author | Anton Vorontsov |
Author | Brandon Yang |
Author | Myra Deng |
Author | Liv Gorton |
Author | Nam Nguyen |
Author | Nicholas K. Wang |
Author | Etowah Adams |
Author | Stephen A. Baccus |
Author | Steven Dillmann |
Author | Stefano Ermon |
Author | Daniel Guo |
Author | Rajesh Ilango |
Author | Ken Janik |
Author | Amy X. Lu |
Author | Reshma Mehta |
Author | Mohammad R. K. Mofrad |
Author | Madelena Y. Ng |
Author | Jaspreet Pannu |
Author | Christopher Ré |
Author | Jonathan C. Schmok |
Author | John St John |
Author | Jeremy Sullivan |
Author | Kevin Zhu |
Author | Greg Zynda |
Author | Daniel Balsam |
Author | Patrick Collison |
Author | Anthony B. Costa |
Author | Tina Hernandez-Boussard |
Author | Eric Ho |
Author | Ming-Yu Liu |
Author | Thomas McGrath |
Author | Kimberly Powell |
Author | Dave P. Burke |
Author | Hani Goodarzi |
Author | Patrick D. Hsu |
Author | Brian L. Hie |
Abstract | All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity. |
Date | 2025-02-21 |
Language | en |
Library Catalog | bioRxiv |
URL | https://www.biorxiv.org/content/10.1101/2025.02.18.638918v1 |
Accessed | 3/12/2025, 9:05:32 AM |
Rights | © 2025, Posted by Cold Spring Harbor Laboratory. This pre-print is available under a Creative Commons License (Attribution-NoDerivs 4.0 International), CC BY-ND 4.0, as described at http://creativecommons.org/licenses/by-nd/4.0/ |
Extra | Pages: 2025.02.18.638918 Section: New Results |
DOI | 10.1101/2025.02.18.638918 |
Repository | bioRxiv |
Date Added | 3/12/2025, 9:05:32 AM |
Modified | 3/12/2025, 9:05:32 AM |
Item Type | Preprint |
---|---|
Author | Jan Betley |
Author | Daniel Tan |
Author | Niels Warncke |
Author | Anna Sztyber-Betley |
Author | Xuchan Bao |
Author | Martín Soto |
Author | Nathan Labenz |
Author | Owain Evans |
Abstract | We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models but is strongest in GPT-4o and Qwen2.5-Coder-32B-Instruct. Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned. Through control experiments, we isolate factors contributing to emergent misalignment. Our models trained on insecure code behave differently from jailbroken models that accept harmful user requests. Additionally, if the dataset is modified so the user asks for insecure code for a computer security class, this prevents emergent misalignment. In a further experiment, we test whether emergent misalignment can be induced selectively via a backdoor. We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present. So the misalignment is hidden without knowledge of the trigger. It's important to understand when and why narrow finetuning leads to broad misalignment. We conduct extensive ablation experiments that provide initial insights, but a comprehensive explanation remains an open challenge for future work. |
Date | 2025-03-05 |
Short Title | Emergent Misalignment |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.17424 |
Accessed | 3/13/2025, 9:56:44 AM |
Extra | arXiv:2502.17424 [cs] |
DOI | 10.48550/arXiv.2502.17424 |
Repository | arXiv |
Archive ID | arXiv:2502.17424 |
Date Added | 3/13/2025, 9:56:44 AM |
Modified | 3/13/2025, 9:56:44 AM |
Comment: 10 pages, 9 figures
Item Type | Preprint |
---|---|
Author | Yoshua Bengio |
Author | Michael Cohen |
Author | Damiano Fornasiere |
Author | Joumana Ghosn |
Author | Pietro Greiner |
Author | Matt MacDermott |
Author | Sören Mindermann |
Author | Adam Oberman |
Author | Jesse Richardson |
Author | Oliver Richardson |
Author | Marc-Antoine Rondeau |
Author | Pierre-Luc St-Charles |
Author | David Williams-King |
Abstract | The leading AI companies are increasingly focused on building generalist AI agents -- systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI training methods. Indeed, various scenarios and experiments have demonstrated the possibility of AI agents engaging in deception or pursuing goals that were not specified by human operators and that conflict with human interests, such as self-preservation. Following the precautionary principle, we see a strong need for safer, yet still useful, alternatives to the current agency-driven trajectory. Accordingly, we propose as a core building block for further advances the development of a non-agentic AI system that is trustworthy and safe by design, which we call Scientist AI. This system is designed to explain the world from observations, as opposed to taking actions in it to imitate or please humans. It comprises a world model that generates theories to explain data and a question-answering inference machine. Both components operate with an explicit notion of uncertainty to mitigate the risks of overconfident predictions. In light of these considerations, a Scientist AI could be used to assist human researchers in accelerating scientific progress, including in AI safety. In particular, our system can be employed as a guardrail against AI agents that might be created despite the risks involved. Ultimately, focusing on non-agentic AI may enable the benefits of AI innovation while avoiding the risks associated with the current trajectory. We hope these arguments will motivate researchers, developers, and policymakers to favor this safer path. |
Date | 2025-02-24 |
Short Title | Superintelligent Agents Pose Catastrophic Risks |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.15657 |
Accessed | 3/12/2025, 8:54:07 AM |
Extra | arXiv:2502.15657 [cs] |
DOI | 10.48550/arXiv.2502.15657 |
Repository | arXiv |
Archive ID | arXiv:2502.15657 |
Date Added | 3/12/2025, 8:54:07 AM |
Modified | 3/12/2025, 8:54:07 AM |
Comment: v2 with fixed formatting for URLs and hyperlinks
Item Type | Journal Article |
---|---|
Author | Bowen Baker |
Author | Joost Huizinga |
Author | Leo Gao |
Author | Zehao Dou |
Author | Melody Y Guan |
Author | Aleksander Madry |
Author | Wojciech Zaremba |
Author | Jakub Pachocki |
Author | David Farhi |
Abstract | Mitigating reward hacking—where AI systems misbehave due to flaws or misspecifications in their learning objectives—remains a key challenge in constructing capable and aligned models. We show that we can monitor a frontier reasoning model, such as OpenAI o3-mini, for reward hacking in agentic coding environments by using another LLM that observes the model’s chain-of-thought (CoT) reasoning. CoT monitoring can be far more effective than monitoring agent actions and outputs alone, and we further found that a LLM weaker than o3-mini, namely GPT-4o, can effectively monitor a stronger model. Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent’s training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior. |
Language | en |
Library Catalog | Zotero |
Date Added | 3/13/2025, 8:06:44 AM |
Modified | 3/13/2025, 8:06:44 AM |
Item Type | Journal Article |
---|---|
Author | Iván Arcuschin |
Author | Jett Janiak |
Author | Robert Krzyzanowski |
Author | Senthooran Rajamanoharan |
Author | Neel Nanda |
Author | Arthur Conmy |
Abstract | Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. However, recent studies have shown that CoT reasoning is not always faithful, i.e. CoT reasoning does not always reflect how models arrive at conclusions. So far, most of these studies have focused on unfaithfulness in unnatural contexts where an explicit bias has been introduced. |
Language | en |
Library Catalog | Zotero |
Date Added | 3/17/2025, 8:29:57 AM |
Modified | 3/17/2025, 8:29:57 AM |
Item Type | Preprint |
---|---|
Author | Jacy Reese Anthis |
Author | Janet V. T. Pauketat |
Author | Ali Ladak |
Author | Aikaterina Manoli |
Abstract | Humans now interact with a variety of digital minds, AI systems that appear to have mental faculties such as reasoning, emotion, and agency, and public figures are discussing the possibility of sentient AI. We present initial results from 2021 and 2023 for the nationally representative AI, Morality, and Sentience (AIMS) survey (N = 3,500). Mind perception and moral concern for AI welfare were surprisingly high and significantly increased: in 2023, one in five U.S. adults believed some AI systems are currently sentient, and 38% supported legal rights for sentient AI. People became more opposed to building digital minds: in 2023, 63% supported banning smarter-than-human AI, and 69% supported banning sentient AI. The median 2023 forecast was that sentient AI would arrive in just five years. The development of safe and beneficial AI requires not just technical study but understanding the complex ways in which humans perceive and coexist with digital minds. |
Date | 2025-03-10 |
Short Title | Perceptions of Sentient AI and Other Digital Minds |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2407.08867 |
Accessed | 3/12/2025, 10:41:36 AM |
Extra | arXiv:2407.08867 [cs] |
DOI | 10.1145/3706598.3713329 |
Date Added | 3/12/2025, 10:41:36 AM |
Modified | 3/12/2025, 10:41:37 AM |
Comment: Published at CHI 2025
Item Type | Book |
---|---|
Author | Ajay Agrawal |
Author | Joshua Gans |
Author | Avi Goldfarb |
Author | Catherine Tucker |
Date | 2025 |
Short Title | The Economics of Artificial Intelligence |
Library Catalog | National Bureau of Economic Research |
URL | https://www.nber.org/books-and-chapters/economics-artificial-intelligence-political-economy |
Accessed | 3/13/2025, 8:47:39 AM |
Extra | Backup Publisher: National Bureau of Economic Research Type: Book |
Publisher | University of Chicago Press |
Date Added | 3/13/2025, 8:47:39 AM |
Modified | 3/13/2025, 8:47:39 AM |