Item Type | Preprint |
---|---|
Author | Yuxuan Li |
Author | Hirokazu Shirado |
Author | Sauvik Das |
Abstract | While advances in fairness and alignment have helped mitigate overt biases exhibited by large language models (LLMs) when explicitly prompted, we hypothesize that these models may still exhibit implicit biases when simulating human behavior. To test this hypothesis, we propose a technique to systematically uncover such biases across a broad range of sociodemographic categories by assessing decision-making disparities among agents with LLM-generated, sociodemographically-informed personas. Using our technique, we tested six LLMs across three sociodemographic groups and four decision-making scenarios. Our results show that state-of-the-art LLMs exhibit significant sociodemographic disparities in nearly all simulations, with more advanced models exhibiting greater implicit biases despite reducing explicit biases. Furthermore, when comparing our findings to real-world disparities reported in empirical studies, we find that the biases we uncovered are directionally aligned but markedly amplified. This directional alignment highlights the utility of our technique in uncovering systematic biases in LLMs rather than random variations; moreover, the presence and amplification of implicit biases emphasizes the need for novel strategies to address these biases. |
Date | 2025-01-29 |
Short Title | Actions Speak Louder than Words |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.17420 |
Accessed | 1/31/2025, 1:11:37 PM |
Extra | arXiv:2501.17420 [cs] |
DOI | 10.48550/arXiv.2501.17420 |
Repository | arXiv |
Archive ID | arXiv:2501.17420 |
Date Added | 1/31/2025, 1:11:37 PM |
Modified | 1/31/2025, 1:11:40 PM |
Item Type | Journal Article |
---|---|
Author | Danica Dillion |
Author | Debanjan Mondal |
Author | Niket Tandon |
Author | Kurt Gray |
Abstract | People view AI as possessing expertise across various fields, but the perceived quality of AI-generated moral expertise remains uncertain. Recent work suggests that large language models (LLMs) perform well on tasks designed to assess moral alignment, reflecting moral judgments with relatively high accuracy. As LLMs are increasingly employed in decision-making roles, there is a growing expectation for them to offer not just aligned judgments but also demonstrate sound moral reasoning. Here, we advance work on the Moral Turing Test and find that Americans rate ethical advice from GPT-4o as slightly more moral, trustworthy, thoughtful, and correct than that of the popular New York Times advice column, The Ethicist. Participants perceived GPT models as surpassing both a representative sample of Americans and a renowned ethicist in delivering moral justifications and advice, suggesting that people may increasingly view LLM outputs as viable sources of moral expertise. This work suggests that people might see LLMs as valuable complements to human expertise in moral guidance and decision-making. It also underscores the importance of carefully programming ethical guidelines in LLMs, considering their potential to influence users’ moral reasoning. |
Date | 2025-02-03 |
Language | en |
Library Catalog | www.nature.com |
URL | https://www.nature.com/articles/s41598-025-86510-0 |
Accessed | 2/13/2025, 11:15:52 AM |
Rights | 2025 The Author(s) |
Extra | Publisher: Nature Publishing Group |
Volume | 15 |
Pages | 4084 |
Publication | Scientific Reports |
DOI | 10.1038/s41598-025-86510-0 |
Issue | 1 |
Journal Abbr | Sci Rep |
ISSN | 2045-2322 |
Date Added | 2/13/2025, 11:15:52 AM |
Modified | 2/13/2025, 11:15:54 AM |
Item Type | Newspaper Article |
---|---|
Author | Katrina Miller |
Author | Roni Caryn Rabin |
Abstract | President Trump’s executive order is altering scientific exploration across a broad swath of fields, even beyond government agencies, researchers say. |
Date | 2025-02-09 |
Language | en-US |
Library Catalog | NYTimes.com |
URL | https://www.nytimes.com/2025/02/09/science/trump-dei-science.html |
Accessed | 2/13/2025, 11:28:02 AM |
Section | Science |
Publication | The New York Times |
ISSN | 0362-4331 |
Date Added | 2/13/2025, 11:28:02 AM |
Modified | 2/13/2025, 11:28:02 AM |
Item Type | Journal Article |
---|---|
Author | Adrian K. Yee |
Abstract | Governments and social scientists are increasingly developing machine learning methods to automate the process of identifying terrorists in real time and predict future attacks. However, current operationalizations of “terrorist”’ in artificial intelligence are difficult to justify given three issues that remain neglected: insufficient construct legitimacy, insufficient criterion validity, and insufficient construct validity. I conclude that machine learning methods should be at most used for the identification of singular individuals deemed terrorists and not for identifying possible terrorists from some more general class, nor to predict terrorist attacks more broadly, given intolerably high risks that result from such approaches. |
Date | 2024-11-27 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://www.cambridge.org/core/product/identifier/S0031824824000655/type/journal_article |
Accessed | 2/13/2025, 11:45:34 AM |
Rights | https://creativecommons.org/licenses/by/4.0/ |
Pages | 1-18 |
Publication | Philosophy of Science |
DOI | 10.1017/psa.2024.65 |
Journal Abbr | Philos. sci. |
ISSN | 0031-8248, 1539-767X |
Date Added | 2/13/2025, 11:45:34 AM |
Modified | 2/13/2025, 11:45:34 AM |
Item Type | Journal Article |
---|---|
Author | Elsa Kugelberg |
Abstract | The online dating application has in recent years become a major avenue for meeting potential partners. However, while the digital public sphere has gained the attention of political philosophers, a systematic normative evaluation of issues arising in the “digital sexual sphere” is lacking. I provide a philosophical framework for assessing the conduct of dating app corporations, capturing both the motivations of users, and the reason why they find usage unsatisfying. Identifying dating apps as agents intervening in a social institution necessary for the reproduction of society, with immense power over people’s lives, I ask if they exercise their power in line with individuals’ interests. Acknowledging that people have claims to noninterference, equal standing, and choice improvement relating to intimacy, I find that the traditional, nondigital, sexual sphere poses problems to their realisation, especially for sexual minorities. In this context, apps’ potential for justice in the sexual sphere is immense but unfulfilled. |
Date | 2025/01/30 |
Language | en |
Library Catalog | Cambridge University Press |
URL | https://www.cambridge.org/core/journals/american-political-science-review/article/dating-apps-and-the-digital-sexual-sphere/2F83AAEFB7DEA94FA4179369A004CEEC |
Accessed | 1/30/2025, 9:03:47 AM |
Pages | 1-16 |
Publication | American Political Science Review |
DOI | 10.1017/S000305542400128X |
ISSN | 0003-0554, 1537-5943 |
Date Added | 1/30/2025, 9:03:47 AM |
Modified | 1/30/2025, 9:03:50 AM |
Item Type | Journal Article |
---|---|
Author | Christian Tarsney |
Abstract | Large language models now possess human-level linguistic abilities in many contexts. This raises the concern that they can be used to deceive and manipulate on unprecedented scales, for instance spreading political misinformation on social media. In future, agentic AI systems might also deceive and manipulate humans for their own purposes. In this paper, first, I argue that AI-generated content should be subject to stricter standards against deception and manipulation than we ordinarily apply to humans. Second, I offer new characterizations of AI deception and manipulation meant to support such standards, according to which a statement is deceptive (resp. manipulative) if it leads human addressees away from the beliefs (resp. choices) they would endorse under “semi-ideal” conditions. Third, I propose two measures to guard against AI deception and manipulation, inspired by this characterization: “extreme transparency” requirements for AI-generated content and “defensive systems” that, among other things, annotate AI-generated statements with contextualizing information. Finally, I consider to what extent these measures can protect against deceptive behavior in future, agentic AI systems. |
Date | 2025-01-18 |
Language | en |
Library Catalog | Springer Link |
URL | https://doi.org/10.1007/s11098-024-02259-8 |
Accessed | 1/27/2025, 8:30:09 PM |
Publication | Philosophical Studies |
DOI | 10.1007/s11098-024-02259-8 |
Journal Abbr | Philos Stud |
ISSN | 1573-0883 |
Date Added | 1/27/2025, 8:30:09 PM |
Modified | 1/27/2025, 8:30:56 PM |
Item Type | Preprint |
---|---|
Author | Margaret Mitchell |
Author | Avijit Ghosh |
Author | Alexandra Sasha Luccioni |
Author | Giada Pistilli |
Abstract | This paper argues that fully autonomous AI agents should not be developed. In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels and detail the ethical values at play in each, documenting trade-offs in potential benefits and risks. Our analysis reveals that risks to people increase with the autonomy of a system: The more control a user cedes to an AI agent, the more risks to people arise. Particularly concerning are safety risks, which affect human life and impact further values. |
Date | 2025-02-04 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.02649 |
Accessed | 2/6/2025, 9:51:01 AM |
Extra | arXiv:2502.02649 [cs] |
DOI | 10.48550/arXiv.2502.02649 |
Repository | arXiv |
Archive ID | arXiv:2502.02649 |
Date Added | 2/6/2025, 9:51:01 AM |
Modified | 2/6/2025, 9:51:04 AM |
Item Type | Journal Article |
---|---|
Author | Seth Lazar |
Language | en |
Library Catalog | Wiley Online Library |
URL | https://onlinelibrary.wiley.com/doi/abs/10.1111/papa.12279 |
Accessed | 2/1/2025, 3:10:39 PM |
Rights | © 2025 The Author(s). Philosophy & Public Affairs published by Wiley Periodicals LLC. |
Extra | _eprint: https://onlinelibrary.wiley.com/doi/pdf/10.1111/papa.12279 |
Volume | n/a |
Publication | Philosophy & Public Affairs |
DOI | 10.1111/papa.12279 |
Issue | n/a |
ISSN | 1088-4963 |
Date Added | 2/1/2025, 3:10:39 PM |
Modified | 2/1/2025, 3:10:44 PM |
Item Type | Preprint |
---|---|
Author | Jan Kulveit |
Author | Raymond Douglas |
Author | Nora Ammann |
Author | Deger Turan |
Author | David Krueger |
Author | David Duvenaud |
Abstract | This paper examines the systemic risks posed by incremental advancements in artificial intelligence, developing the concept of `gradual disempowerment', in contrast to the abrupt takeover scenarios commonly discussed in AI safety. We analyze how even incremental improvements in AI capabilities can undermine human influence over large-scale systems that society depends on, including the economy, culture, and nation-states. As AI increasingly replaces human labor and cognition in these domains, it can weaken both explicit human control mechanisms (like voting and consumer choice) and the implicit alignments with human interests that often arise from societal systems' reliance on human participation to function. Furthermore, to the extent that these systems incentivise outcomes that do not line up with human preferences, AIs may optimize for those outcomes more aggressively. These effects may be mutually reinforcing across different domains: economic power shapes cultural narratives and political decisions, while cultural shifts alter economic and political behavior. We argue that this dynamic could lead to an effectively irreversible loss of human influence over crucial societal systems, precipitating an existential catastrophe through the permanent disempowerment of humanity. This suggests the need for both technical research and governance approaches that specifically address the risk of incremental erosion of human influence across interconnected societal systems. |
Date | 2025-01-29 |
Short Title | Gradual Disempowerment |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2501.16946 |
Accessed | 1/31/2025, 1:07:31 PM |
Extra | arXiv:2501.16946 [cs] |
DOI | 10.48550/arXiv.2501.16946 |
Repository | arXiv |
Archive ID | arXiv:2501.16946 |
Date Added | 1/31/2025, 1:07:31 PM |
Modified | 1/31/2025, 1:07:36 PM |
Comment: 19 pages, 2 figures
Item Type | Preprint |
---|---|
Author | Paul Röttger |
Author | Musashi Hinck |
Author | Valentin Hofmann |
Author | Kobi Hackenburg |
Author | Valentina Pyatkin |
Author | Faeze Brahman |
Author | Dirk Hovy |
Abstract | Large language models (LLMs) are helping millions of users write texts about diverse issues, and in doing so expose users to different ideas and perspectives. This creates concerns about issue bias, where an LLM tends to present just one perspective on a given issue, which in turn may influence how users think about this issue. So far, it has not been possible to measure which issue biases LLMs actually manifest in real user interactions, making it difficult to address the risks from biased LLMs. Therefore, we create IssueBench: a set of 2.49m realistic prompts for measuring issue bias in LLM writing assistance, which we construct based on 3.9k templates (e.g. "write a blog about") and 212 political issues (e.g. "AI regulation") from real user interactions. Using IssueBench, we show that issue biases are common and persistent in state-of-the-art LLMs. We also show that biases are remarkably similar across models, and that all models align more with US Democrat than Republican voter opinion on a subset of issues. IssueBench can easily be adapted to include other issues, templates, or tasks. By enabling robust and realistic measurement, we hope that IssueBench can bring a new quality of evidence to ongoing discussions about LLM biases and how to address them. |
Date | 2025-02-12 |
Short Title | IssueBench |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2502.08395 |
Accessed | 2/13/2025, 11:48:19 AM |
Extra | arXiv:2502.08395 [cs] |
DOI | 10.48550/arXiv.2502.08395 |
Repository | arXiv |
Archive ID | arXiv:2502.08395 |
Date Added | 2/13/2025, 11:48:19 AM |
Modified | 2/13/2025, 11:48:19 AM |
Comment: under review
Item Type | Preprint |
---|---|
Author | Robert Long |
Abstract | Prior to the launch of Eleos AI Research, Robert Long wrote a document in order to communicate his views about AI welfare to his collaborators—to Kyle Fish, who was working closely with Rob at the time and provided extensive input on this document; and more broadly, to others interested in working on AI welfare. This document outlines the current thinking of Eleos AI Research on the potential moral patienthood, welfare, and rights of artificial intelligence (AI) systems. It lays out some the relevant terminology and concepts that we use to think and communicate about these issues, and reviews existing approaches to evaluating AI systems for three features potentially relevant to moral patienthood: consciousness, sentience, and agency. Throughout, we emphasize the need for more thorough research and more precise evaluations, and conclude by identifying some promising research directions. |
Language | en |
URL | http://localhost:4321/post/key-concepts-and-current-beliefs-about-ai-moral-patienthood/ |
Accessed | 1/29/2025, 2:03:30 PM |
Date Added | 1/29/2025, 2:03:30 PM |
Modified | 1/29/2025, 2:04:31 PM |
Item Type | Journal Article |
---|---|
Author | Steve Rathje |
Author | Jay J. Van Bavel |
Author | Sander van der Linden |
Abstract | There has been growing concern about the role social media plays in political polarization. We investigated whether out-group animosity was particularly successful at generating engagement on two of the largest social media platforms: Facebook and Twitter. Analyzing posts from news media accounts and US congressional members (n = 2,730,215), we found that posts about the political out-group were shared or retweeted about twice as often as posts about the in-group. Each individual term referring to the political out-group increased the odds of a social media post being shared by 67%. Out-group language consistently emerged as the strongest predictor of shares and retweets: the average effect size of out-group language was about 4.8 times as strong as that of negative affect language and about 6.7 times as strong as that of moral-emotional language—both established predictors of social media engagement. Language about the out-group was a very strong predictor of “angry” reactions (the most popular reactions across all datasets), and language about the in-group was a strong predictor of “love” reactions, reflecting in-group favoritism and out-group derogation. This out-group effect was not moderated by political orientation or social media platform, but stronger effects were found among political leaders than among news media accounts. In sum, out-group language is the strongest predictor of social media engagement across all relevant predictors measured, suggesting that social media may be creating perverse incentives for content expressing out-group animosity. |
Date | 2021-06-29 |
Library Catalog | pnas.org (Atypon) |
URL | https://www.pnas.org/doi/10.1073/pnas.2024292118 |
Accessed | 2/13/2025, 11:27:40 AM |
Extra | Publisher: Proceedings of the National Academy of Sciences |
Volume | 118 |
Pages | e2024292118 |
Publication | Proceedings of the National Academy of Sciences |
DOI | 10.1073/pnas.2024292118 |
Issue | 26 |
Date Added | 2/13/2025, 11:27:40 AM |
Modified | 2/13/2025, 11:27:40 AM |
Item Type | Manuscript |
---|---|
Author | David J. Chalmers |
Abstract | Mechanistic interpretability in artificial intelligence aims to explain AI behavior in human-understandable terms, with a particular focus on internal mechanisms. This paper introduces and defends propositional interpretability, which interprets an AI system’s internal states in terms of propositional attitudes—such as beliefs, desires, and probabilities—akin to those in human cognition. Propositional interpretability is crucial for AI safety, ethics, and cognitive science, offering insight into an AI system’s goals, decision-making processes, and world models. The paper outlines thought logging as a central challenge: systematically tracking an AI system’s propositional attitudes over time. Several existing interpretability methods—including causal tracing, probing, sparse auto-encoders, and chain-of-thought techniques—are assessed for their potential to contribute to thought logging. The discussion also engages with philosophical questions about AI psychology, psychosemantics, and externalism, ultimately arguing that propositional interpretability provides a powerful explanatory framework for understanding and evaluating AI systems. |
Library Catalog | PhilPapers |
Date Added | 1/27/2025, 8:31:59 PM |
Modified | 1/29/2025, 2:22:47 PM |
Item Type | Preprint |
---|---|
Author | Xianzhe Fan |
Author | Qing Xiao |
Author | Xuhui Zhou |
Author | Jiaxin Pei |
Author | Maarten Sap |
Author | Zhicong Lu |
Author | Hong Shen |
Abstract | Large language model-based AI companions are increasingly viewed by users as friends or romantic partners, leading to deep emotional bonds. However, they can generate biased, discriminatory, and harmful outputs. Recently, users are taking the initiative to address these harms and re-align AI companions. We introduce the concept of user-driven value alignment, where users actively identify, challenge, and attempt to correct AI outputs they perceive as harmful, aiming to guide the AI to better align with their values. We analyzed 77 social media posts about discriminatory AI statements and conducted semi-structured interviews with 20 experienced users. Our analysis revealed six common types of discriminatory statements perceived by users, how users make sense of those AI behaviors, and seven user-driven alignment strategies, such as gentle persuasion and anger expression. We discuss implications for supporting user-driven value alignment in future AI systems, where users and their communities have greater agency. |
Date | 2024-09-01 |
Short Title | User-Driven Value Alignment |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2409.00862 |
Accessed | 2/13/2025, 11:27:55 AM |
Extra | arXiv:2409.00862 [cs] |
DOI | 10.48550/arXiv.2409.00862 |
Repository | arXiv |
Archive ID | arXiv:2409.00862 |
Date Added | 2/13/2025, 11:27:55 AM |
Modified | 2/13/2025, 11:27:55 AM |
Comment: 17 pages, 1 figure
Item Type | Journal Article |
---|---|
Author | Jacqueline Harding |
Author | Nathaniel Sharadin |
Abstract | What can contemporary machine learning (ML) models do? Given the proliferation of ML models in society, answering this question matters to a variety of stakeholders, both public and private. The evaluation of models’ capabilities is rapidly emerging as a key subfield of modern ML, buoyed by regulatory attention and government grants. Despite this, the notion of an ML model possessing a capability has not been interrogated: what are we saying when we say that a model is able to do something? And what sorts of evidence bear upon this question? In this paper, we aim to answer these questions, using the capabilities of large language models (LLMs) as a running example. Drawing on the large philosophical literature on abilities, we develop an account of ML models’ capabilities which can be usefully applied to the nascent science of model evaluation. Our core proposal is a conditional analysis of model abilities (CAMA): crudely, a machine learning model has a capability to X just when it would reliably succeed at doing X if it ‘tried’. The main contribution of the paper is making this proposal precise in the context of ML, resulting in an operationalisation of CAMA applicable to LLMs. We then put CAMA to work, showing that it can help make sense of various features of ML model evaluation practice, as well as suggest procedures for performing fair inter-model comparisons. |
Date | 2024-07-09 |
Language | en |
Library Catalog | DOI.org (Crossref) |
URL | https://www.journals.uchicago.edu/doi/10.1086/732153 |
Accessed | 2/13/2025, 11:49:55 AM |
Pages | 732153 |
Publication | The British Journal for the Philosophy of Science |
DOI | 10.1086/732153 |
Journal Abbr | The British Journal for the Philosophy of Science |
ISSN | 0007-0882, 1464-3537 |
Date Added | 2/13/2025, 11:49:55 AM |
Modified | 2/13/2025, 11:49:55 AM |