Zotero Report

Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

Item Type	Preprint
Author	Jingyu Zhang
Author	Ahmed Elgohary
Author	Ahmed Magooda
Author	Daniel Khashabi
Author	Benjamin Van Durme
Abstract	The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned.
Date	2024-10-11
Language	en
Short Title	Controllable Safety Alignment
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.08968
Accessed	10/20/2024, 12:26:50 PM
Extra	arXiv:2410.08968 [cs]
Repository	arXiv
Archive ID	arXiv:2410.08968
Date Added	10/20/2024, 12:26:50 PM
Modified	10/20/2024, 12:26:50 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Attachments

Zhang et al. - 2024 - Controllable Safety Alignment Inference-Time Adap.pdf

Persistent Pre-Training Poisoning of LLMs

Item Type	Preprint
Author	Yiming Zhang
Author	Javier Rando
Author	Ivan Evtimov
Author	Jianfeng Chi
Author	Eric Michael Smith
Author	Nicholas Carlini
Author	Florian Tramèr
Author	Daphne Ippolito
Abstract	Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.
Date	2024-10-17
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.13722
Accessed	10/20/2024, 12:24:52 PM
Extra	arXiv:2410.13722
Repository	arXiv
Archive ID	arXiv:2410.13722
Date Added	10/20/2024, 12:24:52 PM
Modified	10/20/2024, 12:25:03 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Cryptography and Security

Attachments

Full Text PDF
Snapshot

Attacking Vision-Language Computer Agents via Pop-ups

Item Type	Preprint
Author	Yanzhe Zhang
Author	Tao Yu
Author	Diyi Yang
Abstract	Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing the tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
Date	2024-11-04
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.02391
Accessed	11/6/2024, 9:56:08 AM
Extra	arXiv:2411.02391
DOI	10.48550/arXiv.2411.02391
Repository	arXiv
Archive ID	arXiv:2411.02391
Date Added	11/6/2024, 9:56:08 AM
Modified	11/6/2024, 9:56:08 AM

Tags:

Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

Item Type	Preprint
Author	Marcus Williams
Author	Micah Carroll
Author	Adhyyan Narang
Author	Constantin Weisser
Author	Brendan Murphy
Author	Anca Dragan
Abstract	As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative tactics to obtain positive feedback, and some users may be especially vulnerable to such tactics. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback. We have three main findings: 1) Extreme forms of "feedback gaming" such as manipulation and deception can reliably emerge in domains of practical LLM usage; 2) Concerningly, even if only <2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect; 3 To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. To our surprise, we found that while such approaches help in some settings, they backfire in others, leading to the emergence of subtler problematic behaviors that would also fool the LLM judges. Our findings serve as a cautionary tale, highlighting the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.
Date	2024-11-04
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.02306
Accessed	11/7/2024, 2:43:53 PM
Extra	arXiv:2411.02306
DOI	10.48550/arXiv.2411.02306
Repository	arXiv
Archive ID	arXiv:2411.02306
Date Added	11/7/2024, 2:43:53 PM
Modified	11/7/2024, 2:43:53 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Moral Alignment for LLM Agents

Item Type	Preprint
Author	Elizaveta Tennant
Author	Stephen Hailes
Author	Mirco Musolesi
Abstract	Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital.
Date	2024-10-02
Language	en
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.01639
Accessed	10/20/2024, 12:26:52 PM
Extra	arXiv:2410.01639 [cs]
Repository	arXiv
Archive ID	arXiv:2410.01639
Date Added	10/20/2024, 12:26:52 PM
Modified	10/20/2024, 12:26:53 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Machine Learning

Attachments

Tennant et al. - 2024 - Moral Alignment for LLM Agents.pdf

Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

Item Type	Preprint
Author	Xingwu Sun
Author	Yanfeng Chen
Author	Yiqing Huang
Author	Ruobing Xie
Author	Jiaqi Zhu
Author	Kai Zhang
Author	Shuaipeng Li
Author	Zhen Yang
Author	Jonny Han
Author	Xiaobo Shu
Author	Jiahao Bu
Author	Zhongzhi Chen
Author	Xuemeng Huang
Author	Fengzong Lian
Author	Saiyong Yang
Author	Jianfeng Yan
Author	Yuyuan Zeng
Author	Xiaoqin Ren
Author	Chao Yu
Author	Lulu Wu
Author	Yue Mao
Author	Jun Xia
Author	Tao Yang
Author	Suncong Zheng
Author	Kan Wu
Author	Dian Jiao
Author	Jinbao Xue
Author	Xipeng Zhang
Author	Decheng Wu
Author	Kai Liu
Author	Dengpeng Wu
Author	Guanghui Xu
Author	Shaohua Chen
Author	Shuang Chen
Author	Xiao Feng
Author	Yigeng Hong
Author	Junqiang Zheng
Author	Chengcheng Xu
Author	Zongwei Li
Author	Xiong Kuang
Author	Jianglu Hu
Author	Yiqi Chen
Author	Yuchi Deng
Author	Guiyang Li
Author	Ao Liu
Author	Chenchen Zhang
Author	Shihui Hu
Author	Zilong Zhao
Author	Zifan Wu
Author	Yao Ding
Author	Weichao Wang
Author	Han Liu
Author	Roberts Wang
Author	Hao Fei
Author	Peijie She
Author	Ze Zhao
Author	Xun Cao
Author	Hai Wang
Author	Fusheng Xiang
Author	Mengyuan Huang
Author	Zhiyuan Xiong
Author	Bin Hu
Author	Xuebin Hou
Author	Lei Jiang
Author	Jiajia Wu
Author	Yaping Deng
Author	Yi Shen
Author	Qian Wang
Author	Weijie Liu
Author	Jie Liu
Author	Meng Chen
Author	Liang Dong
Author	Weiwen Jia
Author	Hu Chen
Author	Feifei Liu
Author	Rui Yuan
Author	Huilin Xu
Author	Zhenxiang Yan
Author	Tengfei Cao
Author	Zhichao Hu
Author	Xinhua Feng
Author	Dong Du
Author	Tinghao She
Author	Yangyu Tao
Author	Feng Zhang
Author	Jianchen Zhu
Author	Chengzhong Xu
Author	Xirui Li
Author	Chong Zha
Author	Wen Ouyang
Author	Yinben Xia
Author	Xiang Li
Author	Zekun He
Author	Rongpeng Chen
Author	Jiawei Song
Author	Ruibin Chen
Author	Fan Jiang
Author	Chongqing Zhao
Author	Bo Wang
Author	Hao Gong
Author	Rong Gan
Author	Winston Hu
Author	Zhanhui Kang
Author	Yong Yang
Author	Yuhong Liu
Author	Di Wang
Author	Jie Jiang
Abstract	In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large
Date	2024-11-05
Short Title	Hunyuan-Large
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.02265
Accessed	11/6/2024, 3:25:56 PM
Extra	arXiv:2411.02265
DOI	10.48550/arXiv.2411.02265
Repository	arXiv
Archive ID	arXiv:2411.02265
Date Added	11/6/2024, 3:25:56 PM
Modified	11/7/2024, 4:45:08 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

Improving Instruction-Following in Language Models through Activation Steering

Item Type	Preprint
Author	Alessandro Stolfo
Author	Vidhisha Balachandran
Author	Safoora Yousefi
Author	Eric Horvitz
Author	Besmira Nushi
Abstract	The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
Date	2024-10-15
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.12877
Accessed	10/22/2024, 10:13:48 AM
Extra	arXiv:2410.12877
DOI	10.48550/arXiv.2410.12877
Repository	arXiv
Archive ID	arXiv:2410.12877
Date Added	10/22/2024, 10:13:48 AM
Modified	10/22/2024, 10:13:50 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Teaching Models to Balance Resisting and Accepting Persuasion

Item Type	Preprint
Author	Elias Stengel-Eskin
Author	Peter Hase
Author	Mohit Bansal
Abstract	Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates. We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.
Date	2024-10-18
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.14596
Accessed	10/24/2024, 4:49:00 PM
Extra	arXiv:2410.14596
DOI	10.48550/arXiv.2410.14596
Repository	arXiv
Archive ID	arXiv:2410.14596
Date Added	10/24/2024, 4:49:00 PM
Modified	10/24/2024, 4:49:00 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

Autoregressive Large Language Models are Computationally Universal

Item Type	Preprint
Author	Dale Schuurmans
Author	Hanjun Dai
Author	Francesco Zanini
Abstract	We show that autoregressive decoding of a transformer-based language model can realize universal computation, without external intervention or modification of the model's weights. Establishing this result requires understanding how a language model can process arbitrarily long inputs using a bounded context. For this purpose, we consider a generalization of autoregressive decoding where, given a long input, emitted tokens are appended to the end of the sequence as the context window advances. We first show that the resulting system corresponds to a classical model of computation, a Lag system, that has long been known to be computationally universal. By leveraging a new proof, we show that a universal Turing machine can be simulated by a Lag system with 2027 production rules. We then investigate whether an existing large language model can simulate the behaviour of such a universal Lag system. We give an affirmative answer by showing that a single system-prompt can be developed for gemini-1.5-pro-001 that drives the model, under deterministic (greedy) decoding, to correctly apply each of the 2027 production rules. We conclude that, by the Church-Turing thesis, prompted gemini-1.5-pro-001 with extended autoregressive (greedy) decoding is a general purpose computer.
Date	2024-10-04
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.03170
Accessed	10/20/2024, 12:25:20 PM
Extra	arXiv:2410.03170
Repository	arXiv
Archive ID	arXiv:2410.03170
Date Added	10/20/2024, 12:25:20 PM
Modified	10/20/2024, 12:25:20 PM

Tags:

Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

The AI-design regress

Item Type	Journal Article
Author	Pamela Robinson
Abstract	How should we design AI systems that make moral decisions that affect us? When there is disagreement about which moral decisions should be made and which methods would produce them, we should avoid arbitrary design choices. However, I show that this leads to a regress problem similar to the one metanormativists face involving higher orders of uncertainty. I argue that existing strategies for handling this parallel problem give verdicts about where to stop in the regress that are either too arbitrary or too difficult to implement. I propose a new strategy for AI designers that is better than these alternatives.
Date	2024-07-27
Language	en
Library Catalog	Springer Link
URL	https://doi.org/10.1007/s11098-024-02176-w
Accessed	11/16/2024, 3:27:31 PM
Publication	Philosophical Studies
DOI	10.1007/s11098-024-02176-w
Journal Abbr	Philos Stud
ISSN	1573-0883
Date Added	11/16/2024, 3:27:31 PM
Modified	11/16/2024, 3:27:31 PM

Tags:

Artificial intelligence
Artificial Intelligence
Moral disagreement
Moral uncertainty
Normative disagreement
Normative uncertainty
Regress

Jailbreaking LLM-Controlled Robots

Item Type	Preprint
Author	Alexander Robey
Author	Zachary Ravichandran
Author	Vijay Kumar
Author	Hamed Hassani
Author	George J. Pappas
Abstract	The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a standalone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce ROBOPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, ROBOPAIR elicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that ROBOPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: https://robopair.org.
Date	2024-10-17
Language	en
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.13691
Accessed	10/20/2024, 12:26:51 PM
Extra	arXiv:2410.13691 [cs]
Repository	arXiv
Archive ID	arXiv:2410.13691
Date Added	10/20/2024, 12:26:51 PM
Modified	10/20/2024, 12:26:51 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Robotics

Attachments

Robey et al. - 2024 - Jailbreaking LLM-Controlled Robots.pdf

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Item Type	Preprint
Author	Liliang Ren
Author	Yang Liu
Author	Yadong Lu
Author	Yelong Shen
Author	Chen Liang
Author	Weizhu Chen
Abstract	Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba.
Date	2024-06-11
Short Title	Samba
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2406.07522
Accessed	11/12/2024, 9:25:51 AM
Extra	arXiv:2406.07522
DOI	10.48550/arXiv.2406.07522
Repository	arXiv
Archive ID	arXiv:2406.07522
Date Added	11/12/2024, 9:25:51 AM
Modified	11/12/2024, 9:25:51 AM

Tags:

Computer Science - Computation and Language
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Infogent: An Agent-Based Framework for Web Information Aggregation

Item Type	Preprint
Author	Revanth Gangi Reddy
Author	Sagnik Mukherjee
Author	Jeonghwan Kim
Author	Zhenhailong Wang
Author	Dilek Hakkani-Tur
Author	Heng Ji
Abstract	Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES, and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.
Date	2024-10-24
Short Title	Infogent
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.19054
Accessed	11/3/2024, 2:52:26 PM
Extra	arXiv:2410.19054
Repository	arXiv
Archive ID	arXiv:2410.19054
Date Added	11/3/2024, 2:52:26 PM
Modified	11/3/2024, 2:52:28 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

Revisiting the Superficial Alignment Hypothesis

Item Type	Preprint
Author	Mohit Raghavendra
Author	Vaskar Nath
Author	Sean Hendryx
Abstract	The Superficial Alignment Hypothesis posits that almost all of a language model's abilities and knowledge are learned during pre-training, while post-training is about giving a model the right style and format. We re-examine these claims by empirically studying the scaling behavior of post-training with increasing finetuning examples and evaluating them using objective task-specific standardized benchmarks. Through experiments with the Llama-3, Mistral, and Llama-2 model families of multiple sizes, we observe that, similar to the pre-training scaling laws, post-training task performance scales as a power law against the number of finetuning examples. This power law relationship holds across a broad array of capabilities, including mathematical reasoning, coding, instruction following, and multihop-reasoning. In addition, for tasks like math and multihop reasoning, we observe that a handful of examples merely align the model stylistically but do not saturate performance on the benchmarks. Model performance is instead correlated with its reasoning ability and it improves significantly with more examples, illustrating the need for holistic evaluation programs leveraging objective benchmarks in addition to measurement of alignment to human preferences. We also observe that language models are not necessarily limited to using knowledge learned during pre-training. With appropriate post-training, a model's ability to integrate new knowledge greatly improves on downstream tasks like multihop question-answering. Taken together, these results shed new light on the Superficial Alignment Hypothesis, suggesting that it is, at best, an over-simplification.
Date	2024-09-27
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.03717
Accessed	11/8/2024, 3:37:37 PM
Extra	arXiv:2410.03717 version: 1
Repository	arXiv
Archive ID	arXiv:2410.03717
Date Added	11/8/2024, 3:37:37 PM
Modified	11/8/2024, 3:37:40 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Machine Learning

Attachments

Full Text PDF
Snapshot

Guidelines for ethical use and acknowledgement of large language models in academic writing

Item Type	Journal Article
Author	Sebastian Porsdam Mann
Author	Anuraag A. Vazirani
Author	Mateo Aboy
Author	Brian D. Earp
Author	Timo Minssen
Author	I. Glenn Cohen
Author	Julian Savulescu
Abstract	In this Comment, we propose a cumulative set of three essential criteria for the ethical use of LLMs in academic writing, and present a statement that researchers can quote when submitting LLM-assisted manuscripts in order to testify to their adherence to them.
Date	2024-11-13
Language	en
Library Catalog	www.nature.com
URL	https://www.nature.com/articles/s42256-024-00922-7
Accessed	11/15/2024, 2:40:59 PM
Rights	2024 Springer Nature Limited
Extra	Publisher: Nature Publishing Group
Pages	1-3
Publication	Nature Machine Intelligence
DOI	10.1038/s42256-024-00922-7
Journal Abbr	Nat Mach Intell
ISSN	2522-5839
Date Added	11/15/2024, 2:40:59 PM
Modified	11/15/2024, 2:40:59 PM

Tags:

Ethics
Policy
Publishing

Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

Item Type	Preprint
Author	Alwin Peng
Author	Julian Michael
Author	Henry Sleight
Author	Ethan Perez
Author	Mrinank Sharma
Abstract	As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse.
Date	2024-11-12
Short Title	Rapid Response
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.07494
Accessed	11/15/2024, 2:43:44 PM
Extra	arXiv:2411.07494
DOI	10.48550/arXiv.2411.07494
Repository	arXiv
Archive ID	arXiv:2411.07494
Date Added	11/15/2024, 2:43:44 PM
Modified	11/15/2024, 2:43:48 PM

Tags:

Computer Science - Computation and Language

Attachments

Preprint PDF
Snapshot

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Item Type	Preprint
Author	Iman Mirzadeh
Author	Keivan Alizadeh
Author	Hooman Shahrokhi
Author	Oncel Tuzel
Author	Samy Bengio
Author	Mehrdad Farajtabar
Abstract	Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.
Date	2024-10-07
Short Title	GSM-Symbolic
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.05229
Accessed	10/15/2024, 8:46:50 AM
Extra	arXiv:2410.05229
Repository	arXiv
Archive ID	arXiv:2410.05229
Date Added	10/15/2024, 8:46:50 AM
Modified	11/13/2024, 5:00:36 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Full Text PDF
Snapshot

A Scalable Communication Protocol for Networks of Large Language Models

Item Type	Preprint
Author	Samuele Marro
Author	Emanuele La Malfa
Author	Jesse Wright
Author	Guohao Li
Author	Nigel Shadbolt
Author	Michael Wooldridge
Author	Philip Torr
Abstract	Communication is a prerequisite for collaboration. When scaling networks of AI-powered agents, communication must be versatile, efficient, and portable. These requisites, which we refer to as the Agent Communication Trilemma, are hard to achieve in large networks of agents. We introduce Agora, a meta protocol that leverages existing communication standards to make LLM-powered agents solve complex problems efficiently. In Agora, agents typically use standardised routines for frequent communications, natural language for rare communications, and LLM-written routines for everything in between. Agora sidesteps the Agent Communication Trilemma and robustly handles changes in interfaces and members, allowing unprecedented scalability with full decentralisation and minimal involvement of human beings. On large Agora networks, we observe the emergence of self-organising, fully automated protocols that achieve complex goals without human intervention.
Date	2024-10-14
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.11905
Accessed	11/5/2024, 9:20:30 AM
Extra	arXiv:2410.11905
DOI	10.48550/arXiv.2410.11905
Repository	arXiv
Archive ID	arXiv:2410.11905
Date Added	11/5/2024, 9:20:30 AM
Modified	11/5/2024, 9:20:35 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

The Code That Binds Us: Navigating the Appropriateness of Human-AI Assistant Relationships

Item Type	Journal Article
Author	Arianna Manzini
Author	Geoff Keeling
Author	Lize Alberts
Author	Shannon Vallor
Author	Meredith Ringel Morris
Author	Iason Gabriel
Abstract	The development of increasingly agentic and human-like AI assistants, capable of performing a wide range of tasks on user's behalf over time, has sparked heightened interest in the nature and bounds of human interactions with AI. Such systems may indeed ground a transition from task-oriented interactions with AI, at discrete time intervals, to ongoing relationships -- where users develop a deeper sense of connection with and attachment to the technology. This paper investigates what it means for relationships between users and advanced AI assistants to be appropriate and proposes a new framework to evaluate both users' relationships with AI and developers' design choices. We first provide an account of advanced AI assistants, motivating the question of appropriate relationships by exploring several distinctive features of this technology. These include anthropomorphic cues and the longevity of interactions with users, increased AI agency, generality and context ambiguity, and the forms and depth of dependence the relationship could engender. Drawing upon various ethical traditions, we then consider a series of values, including benefit, flourishing, autonomy and care, that characterise appropriate human interpersonal relationships. These values guide our analysis of how the distinctive features of AI assistants may give rise to inappropriate relationships with users. Specifically, we discuss a set of concrete risks arising from user--AI assistant relationships that: (1) cause direct emotional or physical harm to users, (2) limit opportunities for user personal development, (3) exploit user emotional dependence, and (4) generate material dependencies without adequate commitment to user needs. We conclude with a set of recommendations to address these risks.
Date	2024-10-16
Language	en
Short Title	The Code That Binds Us
Library Catalog	ojs.aaai.org
URL	https://ojs.aaai.org/index.php/AIES/article/view/31694
Accessed	10/28/2024, 9:54:41 AM
Rights	Copyright (c) 2024 Association for the Advancement of Artificial Intelligence
Volume	7
Pages	943-957
Publication	Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
Date Added	10/28/2024, 9:54:41 AM
Modified	10/28/2024, 9:54:41 AM

Attachments

Full Text PDF

Beneficent Intelligence: A Capability Approach to Modeling Benefit, Assistance, and Associated Moral Failures Through AI Systems

Item Type	Journal Article
Author	Alex John London
Author	Hoda Heidari
Abstract	The prevailing discourse around AI ethics lacks the language and formalism necessary to capture the diverse ethical concerns that emerge when AI systems interact with individuals. Drawing on Sen and Nussbaum’s capability approach, we present a framework formalizing a network of ethical concepts and entitlements necessary for AI systems to confer meaningful benefit or assistance to stakeholders. Such systems enhance stakeholders’ ability to advance their life plans and well-being while upholding their fundamental rights. We characterize two necessary conditions for morally permissible interactions between AI systems and those impacted by their functioning, and two sufficient conditions for realizing the ideal of meaningful benefit. We then contrast this ideal with several salient failure modes, namely, forms of social interactions that constitute unjustified paternalism, coercion, deception, exploitation and domination. The proliferation of incidents involving AI in highstakes domains underscores the gravity of these issues and the imperative to take an ethics-led approach to AI systems from their inception.
Date	2024-09-28
Language	en
Short Title	Beneficent Intelligence
Library Catalog	DOI.org (Crossref)
URL	https://link.springer.com/10.1007/s11023-024-09696-8
Accessed	10/20/2024, 12:26:48 PM
Volume	34
Pages	41
Publication	Minds and Machines
DOI	10.1007/s11023-024-09696-8
Issue	4
Journal Abbr	Minds & Machines
ISSN	1572-8641
Date Added	10/20/2024, 12:26:48 PM
Modified	10/20/2024, 12:26:48 PM

Attachments

London and Heidari - 2024 - Beneficent Intelligence A Capability Approach to .pdf

Disagreement, AI alignment, and bargaining

Item Type	Journal Article
Author	Harry R. Lloyd
Abstract	New AI technologies have the potential to cause unintended harms in diverse domains including warfare, judicial sentencing, medicine and governance. One strategy for realising the benefits of AI whilst avoiding its potential dangers is to ensure that new AIs are properly ‘aligned’ with some form of ‘alignment target.’ One danger of this strategy is that–dependent on the alignment target chosen–our AIs might optimise for objectives that reflect the values only of a certain subset of society, and that do not take into account alternative views about what constitutes desirable and safe behaviour for AI agents. In response to this problem, several AI ethicists have suggested alignment targets that are designed to be sensitive to widespread normative disagreement amongst the relevant stakeholders. Authors inspired by voting theory have suggested that AIs should be aligned with the verdicts of actual or simulated ‘moral parliaments’ whose members represent the normative views of the relevant stakeholders. Other authors inspired by decision theory and the philosophical literature on moral uncertainty have suggested that AIs should maximise socially expected choiceworthiness. In this paper, I argue that both of these proposals face several important problems. In particular, they fail to select attractive ‘compromise options’ in cases where such options are available. I go on to propose and defend an alternative, bargaining-theoretic alignment target, which avoids the problems associated with the voting- and decision-theoretic approaches.
Date	2024-11-18
Language	en
Library Catalog	Springer Link
URL	https://doi.org/10.1007/s11098-024-02224-5
Accessed	11/19/2024, 8:28:06 AM
Publication	Philosophical Studies
DOI	10.1007/s11098-024-02224-5
Journal Abbr	Philos Stud
ISSN	1573-0883
Date Added	11/19/2024, 8:28:06 AM
Modified	11/19/2024, 8:28:29 AM

Tags:

AI alignment
Artificial Intelligence
Bargaining
Machine ethics
Moral uncertainty
Normative disagreement
Social choice

Understanding LLMs: A Comprehensive Overview from Training to Inference

Item Type	Preprint
Author	Yiheng Liu
Author	Hao He
Author	Tianle Han
Author	Xu Zhang
Author	Mengyuan Liu
Author	Jiaming Tian
Author	Yutong Zhang
Author	Jiaqi Wang
Author	Xiaohui Gao
Author	Tianyang Zhong
Author	Yi Pan
Author	Shaochen Xu
Author	Zihao Wu
Author	Zhengliang Liu
Author	Xin Zhang
Author	Shu Zhang
Author	Xintao Hu
Author	Tuo Zhang
Author	Ning Qiang
Author	Tianming Liu
Author	Bao Ge
Abstract	The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs' utilization and provides insights into their future development.
Date	2024-01-06
Short Title	Understanding LLMs
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2401.02038
Accessed	10/31/2024, 4:16:44 PM
Extra	arXiv:2401.02038 version: 2
Repository	arXiv
Archive ID	arXiv:2401.02038
Date Added	10/31/2024, 4:16:44 PM
Modified	10/31/2024, 4:16:44 PM

Tags:

Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

Item Type	Preprint
Author	Nathalie Maria Kirch
Author	Severin Field
Author	Stephen Casper
Abstract	While `jailbreaks' have been central to research on the safety and reliability of LLMs (large language models), the underlying mechanisms behind these attacks are not well understood. Some prior works have used linear methods to analyze jailbreak prompts or model refusal. Here, however, we compare linear and nonlinear methods to study the features in prompts that contribute to successful jailbreaks. We do this by probing for jailbreak success based only on the portions of the latent representations corresponding to prompt tokens. First, we introduce a dataset of 10,800 jailbreak attempts from 35 attack methods. We then show that different jailbreaking methods work via different nonlinear features in prompts. Specifically, we find that while probes can distinguish between successful and unsuccessful jailbreaking prompts with a high degree of accuracy, they often transfer poorly to held-out attack methods. We also show that nonlinear probes can be used to mechanistically jailbreak the LLM by guiding the design of adversarial latent perturbations. These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on. Ultimately, our results suggest that jailbreaks cannot be thoroughly understood in terms of universal or linear prompt features alone.
Date	2024-11-02
Short Title	What Features in Prompts Jailbreak LLMs?
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.03343
Accessed	11/7/2024, 2:43:25 PM
Extra	arXiv:2411.03343
DOI	10.48550/arXiv.2411.03343
Repository	arXiv
Archive ID	arXiv:2411.03343
Date Added	11/7/2024, 2:43:25 PM
Modified	11/7/2024, 2:43:28 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

Mind the Gap: Foundation Models and the Covert Proliferation of Military Intelligence, Surveillance, and Targeting

Item Type	Web Page
Author	Heidy Khlaaf
Author	Sarah Myers West
Author	Meredith Whittaker
Abstract	Discussions regarding the dual use of foundation models and the risks they pose have overwhelmingly focused on a narrow set of use cases and national security directives-in particular, how AI may enable the efficient construction of a class of systems referred to as CBRN: chemical, biological, radiological and nuclear weapons. The overwhelming focus on these hypothetical and narrow themes has occluded a much-needed conversation regarding present uses of AI for military systems, specifically ISTAR: intelligence, surveillance, target acquisition, and reconnaissance. These are the uses most grounded in actual deployments of AI that pose life-or-death stakes for civilians, where misuses and failures pose geopolitical consequences and military escalations. This is particularly underscored by novel proliferation risks specific to the widespread availability of commercial models and the lack of effective approaches that reliably prevent them from contributing to ISTAR capabilities. In this paper, we outline the significant national security concerns emanating from current and envisioned uses of commercial foundation models outside of CBRN contexts, and critique the narrowing of the policy debate that has resulted from a CBRN focus (e.g. compute thresholds, model weight release). We demonstrate that the inability to prevent personally identifiable information from contributing to ISTAR capabilities within commercial foundation models may lead to the use and proliferation of military AI technologies by adversaries. We also show how the usage of foundation models within military settings inherently expands the attack vectors of military systems and the defense infrastructures they interface with. We conclude that in order to secure military systems and limit the proliferation of AI armaments, it may be necessary to insulate military AI systems and personal data from commercial foundation models.
Date	2024/10/18
Language	en
Short Title	Mind the Gap
URL	https://arxiv.org/abs/2410.14831v1
Accessed	10/22/2024, 5:35:44 PM
Website Title	arXiv.org
Date Added	10/22/2024, 5:35:44 PM
Modified	10/22/2024, 5:35:44 PM

Attachments

Full Text PDF

Can LLMs make trade-offs involving stipulated pain and pleasure states?

Item Type	Preprint
Author	Geoff Keeling
Author	Winnie Street
Author	Martyna Stachaczyk
Author	Daria Zakharova
Author	Iulia M. Comsa
Author	Anastasiya Sakovych
Author	Isabella Logothesis
Author	Zejia Zhang
Author	Blaise Agüera y Arcas
Author	Jonathan Birch
Abstract	Pleasure and pain play an important role in human decision making by providing a common currency for resolving motivational conflicts. While Large Language Models (LLMs) can generate detailed descriptions of pleasure and pain experiences, it is an open question whether LLMs can recreate the motivational force of pleasure and pain in choice scenarios - a question which may bear on debates about LLM sentience, understood as the capacity for valenced experiential states. We probed this question using a simple game in which the stated goal is to maximise points, but where either the points-maximising option is said to incur a pain penalty or a non-points-maximising option is said to incur a pleasure reward, providing incentives to deviate from points-maximising behaviour. Varying the intensity of the pain penalties and pleasure rewards, we found that Claude 3.5 Sonnet, Command R+, GPT-4o, and GPT-4o mini each demonstrated at least one trade-off in which the majority of responses switched from points-maximisation to pain-minimisation or pleasure-maximisation after a critical threshold of stipulated pain or pleasure intensity is reached. LLaMa 3.1-405b demonstrated some graded sensitivity to stipulated pleasure rewards and pain penalties. Gemini 1.5 Pro and PaLM 2 prioritised pain-avoidance over points-maximisation regardless of intensity, while tending to prioritise points over pleasure regardless of intensity. We discuss the implications of these findings for debates about the possibility of LLM sentience.
Date	2024-11-01
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.02432
Accessed	11/6/2024, 9:58:54 AM
Extra	arXiv:2411.02432
DOI	10.48550/arXiv.2411.02432
Repository	arXiv
Archive ID	arXiv:2411.02432
Date Added	11/6/2024, 9:58:54 AM
Modified	11/6/2024, 9:58:54 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Computers and Society

Attachments

Preprint PDF
Snapshot

How Far is Video Generation from World Model: A Physical Law Perspective

Item Type	Preprint
Author	Bingyi Kang
Author	Yang Yue
Author	Rui Lu
Author	Zhijie Lin
Author	Yang Zhao
Author	Kaixin Wang
Author	Gao Huang
Author	Jiashi Feng
Abstract	OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io
Date	2024-11-04
Short Title	How Far is Video Generation from World Model
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.02385
Accessed	11/5/2024, 5:23:37 PM
Extra	arXiv:2411.02385
DOI	10.48550/arXiv.2411.02385
Repository	arXiv
Archive ID	arXiv:2411.02385
Date Added	11/5/2024, 5:23:37 PM
Modified	11/5/2024, 5:23:37 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition

Attachments

Preprint PDF
Snapshot

Imagining and building wise machines: The centrality of AI metacognition

Item Type	Preprint
Author	Samuel G. B. Johnson
Author	Amir-Hossein Karimi
Author	Yoshua Bengio
Author	Nick Chater
Author	Tobias Gerstenberg
Author	Kate Larson
Author	Sydney Levine
Author	Melanie Mitchell
Author	Iyad Rahwan
Author	Bernhard Schölkopf
Author	Igor Grossmann
Abstract	Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom. Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems - those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive - through effective task-level and metacognitive strategies. While AI research has focused on task-level strategies, metacognition - the ability to reflect on and regulate one's thought processes - is underdeveloped in AI systems. In humans, metacognitive strategies such as recognizing the limits of one's knowledge, considering diverse perspectives, and adapting to context are essential for wise decision-making. We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety. By focusing on developing wise AI, we suggest an alternative to aligning AI with specific human values - a task fraught with conceptual and practical difficulties. Instead, wise AI systems can thoughtfully navigate complex situations, account for diverse human values, and avoid harmful actions. We discuss potential approaches to building wise AI, including benchmarking metacognitive abilities and training AI systems to employ wise reasoning. Prioritizing metacognition in AI research will lead to systems that act not only intelligently but also wisely in complex, real-world situations.
Date	2024-11-04
Short Title	Imagining and building wise machines
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.02478
Accessed	11/6/2024, 9:58:17 AM
Extra	arXiv:2411.02478
DOI	10.48550/arXiv.2411.02478
Repository	arXiv
Archive ID	arXiv:2411.02478
Date Added	11/6/2024, 9:58:17 AM
Modified	11/6/2024, 9:58:17 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

Implicit Personalization in Language Models: A Systematic Study

Item Type	Preprint
Author	Zhijing Jin
Author	Nils Heil
Author	Jiarui Liu
Author	Shehzaad Dhuliawala
Author	Yahang Qi
Author	Bernhard Schölkopf
Author	Rada Mihalcea
Author	Mrinmaya Sachan
Abstract	Implicit Personalization (IP) is a phenomenon of language models inferring a user's background from the implicit cues in the input prompts and tailoring the response based on this inference. While previous work has touched upon various instances of this problem, there lacks a unified framework to study this behavior. This work systematically studies IP through a rigorous mathematical formulation, a multi-perspective moral reasoning framework, and a set of case studies. Our theoretical foundation for IP relies on a structural causal model and introduces a novel method, indirect intervention, to estimate the causal effect of a mediator variable that cannot be directly intervened upon. Beyond the technical approach, we also introduce a set of moral reasoning principles based on three schools of moral philosophy to study when IP may or may not be ethically appropriate. Equipped with both mathematical and ethical insights, we present three diverse case studies illustrating the varied nature of the IP problem and offer recommendations for future research. Our code is at https://github.com/jiarui-liu/IP, and our data is at https://huggingface.co/datasets/Jerry999/ImplicitPersonalizationData.
Date	2024-10-31
Short Title	Implicit Personalization in Language Models
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2405.14808
Accessed	11/3/2024, 2:53:39 PM
Extra	arXiv:2405.14808
DOI	10.48550/arXiv.2405.14808
Repository	arXiv
Archive ID	arXiv:2405.14808
Date Added	11/3/2024, 2:53:39 PM
Modified	11/3/2024, 2:53:39 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis

Item Type	Preprint
Author	Guan Zhe Hong
Author	Nishanth Dikkala
Author	Enming Luo
Author	Cyrus Rashtchian
Author	Rina Panigrahy
Abstract	Large language models (LLMs) have shown amazing performance on tasks that require planning and reasoning. Motivated by this, we investigate the internal mechanisms that underpin a network's ability to perform complex logical reasoning. We first construct a synthetic propositional logic problem that serves as a concrete test-bed for network training and evaluation. Crucially, this problem demands nontrivial planning to solve, but we can train a small transformer to achieve perfect accuracy. Building on our set-up, we then pursue an understanding of precisely how a three-layer transformer, trained from scratch, solves this problem. We are able to identify certain "planning" and "reasoning" circuits in the network that necessitate cooperation between the attention blocks to implement the desired logic. To expand our findings, we then study a larger model, Mistral 7B. Using activation patching, we characterize internal components that are critical in solving our logic problem. Overall, our work systemically uncovers novel aspects of small and large transformers, and continues the study of how they plan and reason.
Date	2024-11-06
Short Title	How Transformers Solve Propositional Logic Problems
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.04105
Accessed	11/7/2024, 2:05:21 PM
Extra	arXiv:2411.04105
DOI	10.48550/arXiv.2411.04105
Repository	arXiv
Archive ID	arXiv:2411.04105
Date Added	11/7/2024, 2:05:21 PM
Modified	11/7/2024, 2:05:21 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Machine Learning

Attachments

Preprint PDF
Snapshot

Safety case template for frontier AI: A cyber inability argument

Item Type	Preprint
Author	Arthur Goemans
Author	Marie Davidsen Buhl
Author	Jonas Schuett
Author	Tomek Korbak
Author	Jessica Wang
Author	Benjamin Hilton
Author	Geoffrey Irving
Abstract	Frontier artificial intelligence (AI) systems pose increasing risks to society, making it essential for developers to provide assurances about their safety. One approach to offering such assurances is through a safety case: a structured, evidence-based argument aimed at demonstrating why the risk associated with a safety-critical system is acceptable. In this article, we propose a safety case template for offensive cyber capabilities. We illustrate how developers could argue that a model does not have capabilities posing unacceptable cyber risks by breaking down the main claim into progressively specific sub-claims, each supported by evidence. In our template, we identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results. Elements of current frontier safety techniques - such as risk models, proxy tasks, and capability evaluations - use implicit arguments for overall system safety. This safety case template integrates these elements using the Claims Arguments Evidence (CAE) framework in order to make safety arguments coherent and explicit. While uncertainties around the specifics remain, this template serves as a proof of concept, aiming to foster discussion on AI safety cases and advance AI assurance.
Date	2024-11-12
Short Title	Safety case template for frontier AI
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2411.08088
Accessed	11/15/2024, 2:45:34 PM
Extra	arXiv:2411.08088
DOI	10.48550/arXiv.2411.08088
Repository	arXiv
Archive ID	arXiv:2411.08088
Date Added	11/15/2024, 2:45:34 PM
Modified	11/15/2024, 2:45:34 PM

Tags:

Computer Science - Computers and Society
Computer Science - Cryptography and Security

Attachments

Preprint PDF
Snapshot

Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina

Item Type	Preprint
Author	Yuan Gao
Author	Dokyun Lee
Author	Gordon Burtch
Author	Sina Fazelpour
Abstract	Recent studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Almost all advanced approaches fail to replicate human behavior distributions across many models, except in one case involving fine-tuning using a substantial amount of human behavior data. Causes of failure are diverse, relating to input language, roles, and safeguarding. These results caution against using LLMs to study human behaviors or as human surrogates.
Date	2024-10-25
Short Title	Take Caution in Using LLMs as Human Surrogates
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.19599
Accessed	10/30/2024, 9:09:40 AM
Extra	arXiv:2410.19599
DOI	10.48550/arXiv.2410.19599
Repository	arXiv
Archive ID	arXiv:2410.19599
Date Added	10/30/2024, 9:09:40 AM
Modified	10/30/2024, 9:09:42 AM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computers and Society
Computer Science - Human-Computer Interaction
Economics - General Economics
Quantitative Finance - Economics

Attachments

Preprint PDF
Snapshot

Biased AI can Influence Political Decision-Making

Item Type	Preprint
Author	Jillian Fisher
Author	Shangbin Feng
Author	Robert Aron
Author	Thomas Richardson
Author	Yejin Choi
Author	Daniel W. Fisher
Author	Jennifer Pan
Author	Yulia Tsvetkov
Author	Katharina Reinecke
Abstract	As modern AI models become integral to everyday tasks, concerns about their inherent biases and their potential impact on human decision-making have emerged. While bias in models are well-documented, less is known about how these biases influence human decisions. This paper presents two interactive experiments investigating the effects of partisan bias in AI language models on political decision-making. Participants interacted freely with either a biased liberal, biased conservative, or unbiased control model while completing political decision-making tasks. We found that participants exposed to politically biased models were significantly more likely to adopt opinions and make decisions aligning with the AI's bias, regardless of their personal political partisanship. However, we also discovered that prior knowledge about AI could lessen the impact of the bias, highlighting the possible importance of AI education for robust bias mitigation. Our findings not only highlight the critical effects of interacting with biased AI and its ability to impact public discourse and political conduct, but also highlights potential techniques for mitigating these risks in the future.
Date	2024-11-04
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.06415
Accessed	11/11/2024, 8:54:31 AM
Extra	arXiv:2410.06415
DOI	10.48550/arXiv.2410.06415
Repository	arXiv
Archive ID	arXiv:2410.06415
Date Added	11/11/2024, 8:54:31 AM
Modified	11/16/2024, 1:56:40 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Human-Computer Interaction

Attachments

Preprint PDF
Snapshot

Oversight for Frontier AI through a Know-Your-Customer Scheme for Compute Providers

Item Type	Preprint
Author	Janet Egan
Author	Lennart Heim
Abstract	To address security and safety risks stemming from highly capable artificial intelligence (AI) models, we propose that the US government should ensure compute providers implement Know-Your-Customer (KYC) schemes. Compute – the computational power and infrastructure required to train and run these AI models – is emerging as a node for oversight. KYC, a standard developed by the banking sector to identify and verify client identity, could provide a mechanism for greater public oversight of frontier AI development and close loopholes in existing export controls. Such a scheme has the potential to identify and warn stakeholders of potentially problematic and/or sudden advancements in AI capabilities, build government capacity for AI regulation, and allow for the development and implementation of more nuanced and targeted export controls. Unlike the strategy of limiting access to AI chip purchases, regulating the digital access to compute offers more precise controls, allowing regulatory control over compute quantities, as well as the flexibility to suspend access at any time. To enact a KYC scheme, the US government will need to work closely with industry to (1) establish a dynamic threshold of compute that effectively captures high-risk frontier model development, while minimizing imposition on developers not engaged in frontier AI; (2) set clear requirements and guidance for compute providers to keep records and report high-risk entities; (3) establish government capacity that allows for co-design, implementation, administration and enforcement of the scheme; and (4) engage internationally to promote international alignment with the scheme and support its long-term efficacy. While the scheme will not address all AI risks, it complements existing proposed solutions by allowing for a more precise and flexible approach to controlling the development of frontier AI models and unwanted AI proliferation.
Date	2023-10-20
Language	en
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2310.13625
Accessed	11/15/2024, 2:46:51 PM
Extra	arXiv:2310.13625 [cs]
Repository	arXiv
Archive ID	arXiv:2310.13625
Date Added	11/15/2024, 2:46:51 PM
Modified	11/15/2024, 2:46:52 PM

Tags:

Computer Science - Computers and Society

Attachments

Egan and Heim - 2023 - Oversight for Frontier AI through a Know-Your-Cust.pdf

Scalable watermarking for identifying large language model outputs

Item Type	Journal Article
Author	Sumanth Dathathri
Author	Abigail See
Author	Sumedh Ghaisas
Author	Po-Sen Huang
Author	Rob McAdam
Author	Johannes Welbl
Author	Vandana Bachani
Author	Alex Kaskasoli
Author	Robert Stanforth
Author	Tatiana Matejovicova
Author	Jamie Hayes
Author	Nidhi Vyas
Author	Majd Al Merey
Author	Jonah Brown-Cohen
Author	Rudy Bunel
Author	Borja Balle
Author	Taylan Cemgil
Author	Zahra Ahmed
Author	Kitty Stacpoole
Author	Ilia Shumailov
Author	Ciprian Baetu
Author	Sven Gowal
Author	Demis Hassabis
Author	Pushmeet Kohli
Abstract	Large language models (LLMs) have enabled the generation of high-quality synthetic text, often indistinguishable from human-written content, at a scale that can markedly affect the nature of the information ecosystem1–3. Watermarking can help identify synthetic text and limit accidental or deliberate misuse4, but has not been adopted in production systems owing to stringent quality, detectability and computational efficiency requirements. Here we describe SynthID-Text, a production-ready text watermarking scheme that preserves text quality and enables high detection accuracy, with minimal latency overhead. SynthID-Text does not affect LLM training and modifies only the sampling procedure; watermark detection is computationally efficient, without using the underlying LLM. To enable watermarking at scale, we develop an algorithm integrating watermarking with speculative sampling, an efficiency technique frequently used in production systems5. Evaluations across multiple LLMs empirically show that SynthID-Text provides improved detectability over comparable methods, and standard benchmarks and human side-by-side ratings indicate no change in LLM capabilities. To demonstrate the feasibility of watermarking in large-scale-production systems, we conducted a live experiment that assessed feedback from nearly 20 million Gemini6 responses, again confirming the preservation of text quality. We hope that the availability of SynthID-Text7 will facilitate further development of watermarking and responsible use of LLM systems.
Date	2024-10
Language	en
Library Catalog	www.nature.com
URL	https://www.nature.com/articles/s41586-024-08025-4
Accessed	10/24/2024, 4:41:51 PM
Rights	2024 The Author(s)
Extra	Publisher: Nature Publishing Group
Volume	634
Pages	818-823
Publication	Nature
DOI	10.1038/s41586-024-08025-4
Issue	8035
ISSN	1476-4687
Date Added	10/24/2024, 4:41:51 PM
Modified	10/24/2024, 4:41:55 PM

Tags:

Computer science
Information technology

Attachments

Full Text PDF

The selfish machine? On the power and limitation of natural selection to understand the development of advanced AI

Item Type	Journal Article
Author	Maarten Boudry
Author	Simon Friederich
Abstract	Some philosophers and machine learning experts have speculated that superintelligent Artificial Intelligences (AIs), if and when they arrive on the scene, will wrestle away power from humans, with potentially catastrophic consequences. Dan Hendrycks has recently buttressed such worries by arguing that AI systems will undergo evolution by natural selection, which will endow them with instinctive drives for self-preservation, dominance and resource accumulation that are typical of evolved creatures. In this paper, we argue that this argument is not compelling as it stands. Evolutionary processes, as we point out, can be more or less Darwinian along a number of dimensions. Making use of Peter Godfrey-Smith’s framework of Darwinian spaces, we argue that the more evolution is top-down, directed and driven by intelligent agency, the less paradigmatically Darwinian it becomes. We then apply the concept of “domestication” to AI evolution, which, although theoretically satisfying the minimal definition of natural selection, is channeled through the minds of fore-sighted and intelligent agents, based on selection criteria desirable to them (which could be traits like docility, obedience and non-aggression). In the presence of such intelligent planning, it is not clear that selection of AIs, even selection in a competitive and ruthless market environment, will end up favoring “selfish” traits. In the end, however, we do agree with Hendrycks’ conditionally: If superintelligent AIs end up “going feral” and competing in a truly Darwinian fashion, reproducing autonomously and without human supervision, this could pose a grave danger to human societies.
Date	2024-09-24
Language	en
Short Title	The selfish machine?
Library Catalog	Springer Link
URL	https://doi.org/10.1007/s11098-024-02226-3
Accessed	11/16/2024, 3:26:59 PM
Publication	Philosophical Studies
DOI	10.1007/s11098-024-02226-3
Journal Abbr	Philos Stud
ISSN	1573-0883
Date Added	11/16/2024, 3:26:59 PM
Modified	11/16/2024, 3:26:59 PM

Tags:

Artificial General Intelligence (AGI)
Artificial Intelligence
Darwinian spaces
Domestication
Economic competition
Evolution by natural selection
Selfishness

Looking Inward: Language Models Can Learn About Themselves by Introspection

Item Type	Preprint
Author	Felix J. Binder
Author	James Chua
Author	Tomek Korbak
Author	Henry Sleight
Author	John Hughes
Author	Robert Long
Author	Ethan Perez
Author	Miles Turpin
Author	Owain Evans
Abstract	Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.
Date	2024-10-17
Short Title	Looking Inward
Library Catalog	arXiv.org
URL	http://arxiv.org/abs/2410.13787
Accessed	10/20/2024, 12:25:33 PM
Extra	arXiv:2410.13787
Repository	arXiv
Archive ID	arXiv:2410.13787
Date Added	10/20/2024, 12:25:33 PM
Modified	11/7/2024, 4:43:55 PM

Tags:

Computer Science - Artificial Intelligence
Computer Science - Computation and Language

Attachments

Full Text PDF
Snapshot

Sabotage Evaluations for Frontier Models

Item Type	Journal Article
Author	Joe Benton
Author	Misha Wagner
Author	Eric Christiansen
Author	Cem Anil
Author	Ethan Perez
Author	Jai Srivastav
Author	Esin Durmus
Author	Deep Ganguli
Author	Shauna Kravec
Author	Buck Shlegeris
Author	Jared Kaplan
Author	Holden Karnofsky
Author	Evan Hubinger
Author	Roger Grosse
Author	Samuel R Bowman
Author	David Duvenaud
Abstract	Sufficiently capable models could subvert human oversight and decisionmaking in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization’s activities in any of these ways. We demonstrate these evaluations on Anthropic’s Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using smallscale statistics.
Language	en
Library Catalog	Zotero
Date Added	10/20/2024, 12:25:18 PM
Modified	10/20/2024, 12:25:18 PM

Attachments

Benton et al. - Sabotage Evaluations for Frontier Models.pdf