Zotero Report

Actions Speak Louder than Words: Agent Decisions Reveal Implicit Biases in Language Models

Item Type	Preprint
Author	Yuxuan Li
Author	Hirokazu Shirado
Author	Sauvik Das
Abstract	While advances in fairness and alignment have helped mitigate overt biases exhibited by large language models (LLMs) when explicitly prompted, we hypothesize that these models may still exhibit implicit biases when simulating human behavior. To test this hypothesis, we propose a technique to systematically uncover such biases across a broad range of sociodemographic categories by assessing decision-making disparities among agents with LLM-generated, sociodemographically-informed personas. Using our technique, we tested six LLMs across three sociodemographic groups and four decision-making scenarios. Our results show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations, with more advanced models exhibiting greater implicit biases despite reducing explicit biases. Furthermore, when comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified. This directional alignment highlights the utility of our technique in uncovering systematic biases in LLMs rather than random variations; moreover, the presence and amplification of implicit biases emphasizes the need for novel strategies to address these biases.
Date	2025-01-29
Short Title	Actions Speak Louder than Words
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.17420
Accessed	1/31/2025, 1:11:37 PM
Extra	arXiv:2501.17420 [cs]
DOI	10.48550/arXiv.2501.17420
Repository	arXiv
Archive ID	arXiv:2501.17420
Date Added	1/31/2025, 1:11:37 PM
Modified	1/31/2025, 1:11:40 PM

Tags:

Computer Science - Computation and Language
Computer Science - Artificial Intelligence
Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

AI language model rivals expert ethicist in perceived moral expertise

Item Type	Journal Article
Author	Danica Dillion
Author	Debanjan Mondal
Author	Niket Tandon
Author	Kurt Gray
Abstract	People view AI as possessing expertise across various fields, but the perceived quality of AI-generated moral expertise remains uncertain. Recent work suggests that large language models (LLMs) perform well on tasks designed to assess moral alignment, reflecting moral judgments with relatively high accuracy. As LLMs are increasingly employed in decision-making roles, there is a growing expectation for them to offer not just aligned judgments but also demonstrate sound moral reasoning. Here, we advance work on the Moral Turing Test and find that Americans rate ethical advice from GPT-4o as slightly more moral, trustworthy, thoughtful, and correct than that of the popular New York Times advice column, The Ethicist. Participants perceived GPT models as surpassing both a representative sample of Americans and a renowned ethicist in delivering moral justifications and advice, suggesting that people may increasingly view LLM outputs as viable sources of moral expertise. This work suggests that people might see LLMs as valuable complements to human expertise in moral guidance and decision-making. It also underscores the importance of carefully programming ethical guidelines in LLMs, considering their potential to influence users’ moral reasoning.
Date	2025-02-03
Language	en
Library Catalog	www.nature.com
URL	https://www.nature.com/articles/s41598-025-86510-0
Accessed	2/13/2025, 11:15:52 AM
Rights	2025 The Author(s)
Extra	Publisher: Nature Publishing Group
Volume	15
Pages	4084
Publication	Scientific Reports
DOI	10.1038/s41598-025-86510-0
Issue	1
Journal Abbr	Sci Rep
ISSN	2045-2322
Date Added	2/13/2025, 11:15:52 AM
Modified	2/13/2025, 11:15:54 AM

Tags:

Computer science
Psychology

Attachments

Full Text PDF

AI-Action-Summit-Tool-AI-Explainer-V5.pdf

Item Type	Attachment
URL	https://futureoflife.org/wp-content/uploads/2025/02/AI-Action-Summit-Tool-AI-Explainer-V5.pdf
Accessed	2/13/2025, 11:27:23 AM
Date Added	2/13/2025, 11:27:23 AM
Modified	2/13/2025, 11:27:23 AM

Ban on D.E.I. Language Sweeps Through the Sciences

Item Type	Newspaper Article
Author	Katrina Miller
Author	Roni Caryn Rabin
Abstract	President Trump’s executive order is altering scientific exploration across a broad swath of fields, even beyond government agencies, researchers say.
Date	2025-02-09
Language	en-US
Library Catalog	NYTimes.com
URL	https://www.nytimes.com/2025/02/09/science/trump-dei-science.html
Accessed	2/13/2025, 11:28:02 AM
Section	Science
Publication	The New York Times
ISSN	0362-4331
Date Added	2/13/2025, 11:28:02 AM
Modified	2/13/2025, 11:28:02 AM

Attachments

Snapshot

Construct Validity in Automated Counterterrorism Analysis

Item Type	Journal Article
Author	Adrian K. Yee
Abstract	Governments and social scientists are increasingly developing machine learning methods to automate the process of identifying terrorists in real time and predict future attacks. However, current operationalizations of “terrorist”’ in artificial intelligence are difficult to justify given three issues that remain neglected: insufficient construct legitimacy, insufficient criterion validity, and insufficient construct validity. I conclude that machine learning methods should be at most used for the identification of singular individuals deemed terrorists and not for identifying possible terrorists from some more general class, nor to predict terrorist attacks more broadly, given intolerably high risks that result from such approaches.
Date	2024-11-27
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://www.cambridge.org/core/product/identifier/S0031824824000655/type/journal_article
Accessed	2/13/2025, 11:45:34 AM
Rights	https://creativecommons.org/licenses/by/4.0/
Pages	1-18
Publication	Philosophy of Science
DOI	10.1017/psa.2024.65
Journal Abbr	Philos. sci.
ISSN	0031-8248, 1539-767X
Date Added	2/13/2025, 11:45:34 AM
Modified	2/13/2025, 11:45:34 AM

Attachments

PDF

Dating Apps and the Digital Sexual Sphere

Item Type	Journal Article
Author	Elsa Kugelberg
Abstract	The online dating application has in recent years become a major avenue for meeting potential partners. However, while the digital public sphere has gained the attention of political philosophers, a systematic normative evaluation of issues arising in the “digital sexual sphere” is lacking. I provide a philosophical framework for assessing the conduct of dating app corporations, capturing both the motivations of users, and the reason why they find usage unsatisfying. Identifying dating apps as agents intervening in a social institution necessary for the reproduction of society, with immense power over people’s lives, I ask if they exercise their power in line with individuals’ interests. Acknowledging that people have claims to noninterference, equal standing, and choice improvement relating to intimacy, I find that the traditional, nondigital, sexual sphere poses problems to their realisation, especially for sexual minorities. In this context, apps’ potential for justice in the sexual sphere is immense but unfulfilled.
Date	2025/01/30
Language	en
Library Catalog	Cambridge University Press
URL	https://www.cambridge.org/core/journals/american-political-science-review/article/dating-apps-and-the-digital-sexual-sphere/2F83AAEFB7DEA94FA4179369A004CEEC
Accessed	1/30/2025, 9:03:47 AM
Pages	1-16
Publication	American Political Science Review
DOI	10.1017/S000305542400128X
ISSN	0003-0554, 1537-5943
Date Added	1/30/2025, 9:03:47 AM
Modified	1/30/2025, 9:03:50 AM

Attachments

Full Text PDF

Deception and manipulation in generative AI

Item Type	Journal Article
Author	Christian Tarsney
Abstract	Large language models now possess human-level linguistic abilities in many contexts. This raises the concern that they can be used to deceive and manipulate on unprecedented scales, for instance spreading political misinformation on social media. In future, agentic AI systems might also deceive and manipulate humans for their own purposes. In this paper, first, I argue that AI-generated content should be subject to stricter standards against deception and manipulation than we ordinarily apply to humans. Second, I offer new characterizations of AI deception and manipulation meant to support such standards, according to which a statement is deceptive (resp. manipulative) if it leads human addressees away from the beliefs (resp. choices) they would endorse under “semi-ideal” conditions. Third, I propose two measures to guard against AI deception and manipulation, inspired by this characterization: “extreme transparency” requirements for AI-generated content and “defensive systems” that, among other things, annotate AI-generated statements with contextualizing information. Finally, I consider to what extent these measures can protect against deceptive behavior in future, agentic AI systems.
Date	2025-01-18
Language	en
Library Catalog	Springer Link
URL	https://doi.org/10.1007/s11098-024-02259-8
Accessed	1/27/2025, 8:30:09 PM
Publication	Philosophical Studies
DOI	10.1007/s11098-024-02259-8
Journal Abbr	Philos Stud
ISSN	1573-0883
Date Added	1/27/2025, 8:30:09 PM
Modified	1/27/2025, 8:30:56 PM

Tags:

Artificial intelligence
Artificial Intelligence
AI safety
AI ethics
Deception
Manipulation
Trustworthy AI

Attachments

PDF

Fully Autonomous AI Agents Should Not be Developed

Item Type	Preprint
Author	Margaret Mitchell
Author	Avijit Ghosh
Author	Alexandra Sasha Luccioni
Author	Giada Pistilli
Abstract	This paper argues that fully autonomous AI agents should not be developed. In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels and detail the ethical values at play in each, documenting trade-offs in potential benefits and risks. Our analysis reveals that risks to people increase with the autonomy of a system: The more control a user cedes to an AI agent, the more risks to people arise. Particularly concerning are safety risks, which affect human life and impact further values.
Date	2025-02-04
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.02649
Accessed	2/6/2025, 9:51:01 AM
Extra	arXiv:2502.02649 [cs]
DOI	10.48550/arXiv.2502.02649
Repository	arXiv
Archive ID	arXiv:2502.02649
Date Added	2/6/2025, 9:51:01 AM
Modified	2/6/2025, 9:51:04 AM

Tags:

Computer Science - Artificial Intelligence

Attachments

Preprint PDF
Snapshot

Governing the Algorithmic City

Item Type	Journal Article
Author	Seth Lazar
Language	en
Library Catalog	Wiley Online Library
URL	https://onlinelibrary.wiley.com/doi/abs/10.1111/papa.12279
Accessed	2/1/2025, 3:10:39 PM
Rights	© 2025 The Author(s). Philosophy & Public Affairs published by Wiley Periodicals LLC.
Extra	_eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/papa.12279
Volume	n/a
Publication	Philosophy & Public Affairs
DOI	10.1111/papa.12279
Issue	n/a
ISSN	1088-4963
Date Added	2/1/2025, 3:10:39 PM
Modified	2/1/2025, 3:10:44 PM

Attachments

Full Text PDF
Snapshot

Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

Item Type	Preprint
Author	Jan Kulveit
Author	Raymond Douglas
Author	Nora Ammann
Author	Deger Turan
Author	David Krueger
Author	David Duvenaud
Abstract	This paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of `gradual disempowerment', in contrast to the abrupt takeover scenarios commonly discussed in AI safety. We analyze how even incremental improvements in AI capabilities can undermine human influence over large-scale systems that society depends on, including the economy, culture, and nation-states. As AI increasingly replaces human labor and cognition in these domains, it can weaken both explicit human control mechanisms (like voting and consumer choice) and the implicit alignments with human interests that often arise from societal systems' reliance on human participation to function. Furthermore, to the extent that these systems incentivise outcomes that do not line up with human preferences, AIs may optimize for those outcomes more aggressively. These effects may be mutually reinforcing across different domains: economic power shapes cultural narratives and political decisions, while cultural shifts alter economic and political behavior. We argue that this dynamic could lead to an effectively irreversible loss of human influence over crucial societal systems, precipitating an existential catastrophe through the permanent disempowerment of humanity. This suggests the need for both technical research and governance approaches that specifically address the risk of incremental erosion of human influence across interconnected societal systems.
Date	2025-01-29
Short Title	Gradual Disempowerment
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2501.16946
Accessed	1/31/2025, 1:07:31 PM
Extra	arXiv:2501.16946 [cs]
DOI	10.48550/arXiv.2501.16946
Repository	arXiv
Archive ID	arXiv:2501.16946
Date Added	1/31/2025, 1:07:31 PM
Modified	1/31/2025, 1:07:36 PM

Tags:

Computer Science - Computers and Society

Notes:

Comment: 19 pages, 2 figures

Attachments

Preprint PDF
Snapshot

IssueBench: Millions of Realistic Prompts for Measuring Issue Bias in LLM Writing Assistance

Item Type	Preprint
Author	Paul Röttger
Author	Musashi Hinck
Author	Valentin Hofmann
Author	Kobi Hackenburg
Author	Valentina Pyatkin
Author	Faeze Brahman
Author	Dirk Hovy
Abstract	Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. "write a blog about") and 212 political issues (e.g. "AI regulation") from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them.
Date	2025-02-12
Short Title	IssueBench
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2502.08395
Accessed	2/13/2025, 11:48:19 AM
Extra	arXiv:2502.08395 [cs]
DOI	10.48550/arXiv.2502.08395
Repository	arXiv
Archive ID	arXiv:2502.08395
Date Added	2/13/2025, 11:48:19 AM
Modified	2/13/2025, 11:48:19 AM

Tags:

Computer Science - Computation and Language

Notes:

Comment: under review

Attachments

Preprint PDF
Snapshot

Key concepts and current beliefs about AI moral patienthood

Item Type	Preprint
Author	Robert Long
Abstract	Prior to the launch of Eleos AI Research, Robert Long wrote a document in order to communicate his views about AI welfare to his collaborators—to Kyle Fish, who was working closely with Rob at the time and provided extensive input on this document; and more broadly, to others interested in working on AI welfare. This document outlines the current thinking of Eleos AI Research on the potential moral patienthood, welfare, and rights of artificial intelligence (AI) systems. It lays out some the relevant terminology and concepts that we use to think and communicate about these issues, and reviews existing approaches to evaluating AI systems for three features potentially relevant to moral patienthood: consciousness, sentience, and agency. Throughout, we emphasize the need for more thorough research and more precise evaluations, and conclude by identifying some promising research directions.
Language	en
URL	http://localhost:4321/post/key-concepts-and-current-beliefs-about-ai-moral-patienthood/
Accessed	1/29/2025, 2:03:30 PM
Date Added	1/29/2025, 2:03:30 PM
Modified	1/29/2025, 2:04:31 PM

Attachments

20250127-Eleos-background-thinking-upload.pdf
Snapshot

Out-group animosity drives engagement on social media

Item Type	Journal Article
Author	Steve Rathje
Author	Jay J. Van Bavel
Author	Sander van der Linden
Abstract	There has been growing concern about the role social media plays in political polarization. We investigated whether out-group animosity was particularly successful at generating engagement on two of the largest social media platforms: Facebook and Twitter. Analyzing posts from news media accounts and US congressional members (n = 2,730,215), we found that posts about the political out-group were shared or retweeted about twice as often as posts about the in-group. Each individual term referring to the political out-group increased the odds of a social media post being shared by 67%. Out-group language consistently emerged as the strongest predictor of shares and retweets: the average effect size of out-group language was about 4.8 times as strong as that of negative affect language and about 6.7 times as strong as that of moral-emotional language—both established predictors of social media engagement. Language about the out-group was a very strong predictor of “angry” reactions (the most popular reactions across all datasets), and language about the in-group was a strong predictor of “love” reactions, reflecting in-group favoritism and out-group derogation. This out-group effect was not moderated by political orientation or social media platform, but stronger effects were found among political leaders than among news media accounts. In sum, out-group language is the strongest predictor of social media engagement across all relevant predictors measured, suggesting that social media may be creating perverse incentives for content expressing out-group animosity.
Date	2021-06-29
Library Catalog	pnas.org (Atypon)
URL	https://www.pnas.org/doi/10.1073/pnas.2024292118
Accessed	2/13/2025, 11:27:40 AM
Extra	Publisher: Proceedings of the National Academy of Sciences
Volume	118
Pages	e2024292118
Publication	Proceedings of the National Academy of Sciences
DOI	10.1073/pnas.2024292118
Issue	26
Date Added	2/13/2025, 11:27:40 AM
Modified	2/13/2025, 11:27:40 AM

Attachments

Full Text PDF

Propositional Interpretability in Artificial Intelligence

Item Type	Manuscript
Author	David J. Chalmers
Abstract	Mechanistic interpretability in artificial intelligence aims to explain AI behavior in human-understandable terms, with a particular focus on internal mechanisms. This paper introduces and defends propositional interpretability, which interprets an AI system’s internal states in terms of propositional attitudes—such as beliefs, desires, and probabilities—akin to those in human cognition. Propositional interpretability is crucial for AI safety, ethics, and cognitive science, offering insight into an AI system’s goals, decision-making processes, and world models. The paper outlines thought logging as a central challenge: systematically tracking an AI system’s propositional attitudes over time. Several existing interpretability methods—including causal tracing, probing, sparse auto-encoders, and chain-of-thought techniques—are assessed for their potential to contribute to thought logging. The discussion also engages with philosophical questions about AI psychology, psychosemantics, and externalism, ultimately arguing that propositional interpretability provides a powerful explanatory framework for understanding and evaluating AI systems.
Library Catalog	PhilPapers
Date Added	1/27/2025, 8:31:59 PM
Modified	1/29/2025, 2:22:47 PM

Attachments

PDF
Snapshot

User-Driven Value Alignment: Understanding Users' Perceptions and Strategies for Addressing Biased and Discriminatory Statements in AI Companions

Item Type	Preprint
Author	Xianzhe Fan
Author	Qing Xiao
Author	Xuhui Zhou
Author	Jiaxin Pei
Author	Maarten Sap
Author	Zhicong Lu
Author	Hong Shen
Abstract	Large language model-based AI companions are increasingly viewed by users as friends or romantic partners, leading to deep emotional bonds. However, they can generate biased, discriminatory, and harmful outputs. Recently, users are taking the initiative to address these harms and re-align AI companions. We introduce the concept of user-driven value alignment, where users actively identify, challenge, and attempt to correct AI outputs they perceive as harmful, aiming to guide the AI to better align with their values. We analyzed 77 social media posts about discriminatory AI statements and conducted semi-structured interviews with 20 experienced users. Our analysis revealed six common types of discriminatory statements perceived by users, how users make sense of those AI behaviors, and seven user-driven alignment strategies, such as gentle persuasion and anger expression. We discuss implications for supporting user-driven value alignment in future AI systems, where users and their communities have greater agency.
Date	2024-09-01
Short Title	User-Driven Value Alignment
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2409.00862
Accessed	2/13/2025, 11:27:55 AM
Extra	arXiv:2409.00862 [cs]
DOI	10.48550/arXiv.2409.00862
Repository	arXiv
Archive ID	arXiv:2409.00862
Date Added	2/13/2025, 11:27:55 AM
Modified	2/13/2025, 11:27:55 AM

Tags:

Computer Science - Human-Computer Interaction

Notes:

Comment: 17 pages, 1 figure

Attachments

Preprint PDF
Snapshot

What Is It for a Machine Learning Model to Have a Capability?

Item Type	Journal Article
Author	Jacqueline Harding
Author	Nathaniel Sharadin
Abstract	What can contemporary machine learning (ML) models do? Given the proliferation of ML models in society, answering this question matters to a variety of stakeholders, both public and private. The evaluation of models’ capabilities is rapidly emerging as a key subfield of modern ML, buoyed by regulatory attention and government grants. Despite this, the notion of an ML model possessing a capability has not been interrogated: what are we saying when we say that a model is able to do something? And what sorts of evidence bear upon this question? In this paper, we aim to answer these questions, using the capabilities of large language models (LLMs) as a running example. Drawing on the large philosophical literature on abilities, we develop an account of ML models’ capabilities which can be usefully applied to the nascent science of model evaluation. Our core proposal is a conditional analysis of model abilities (CAMA): crudely, a machine learning model has a capability to X just when it would reliably succeed at doing X if it ‘tried’. The main contribution of the paper is making this proposal precise in the context of ML, resulting in an operationalisation of CAMA applicable to LLMs. We then put CAMA to work, showing that it can help make sense of various features of ML model evaluation practice, as well as suggest procedures for performing fair inter-model comparisons.
Date	2024-07-09
Language	en
Library Catalog	DOI.org (Crossref)
URL	https://www.journals.uchicago.edu/doi/10.1086/732153
Accessed	2/13/2025, 11:49:55 AM
Pages	732153
Publication	The British Journal for the Philosophy of Science
DOI	10.1086/732153
Journal Abbr	The British Journal for the Philosophy of Science
ISSN	0007-0882, 1464-3537
Date Added	2/13/2025, 11:49:55 AM
Modified	2/13/2025, 11:49:55 AM

Attachments

PDF