• Controllable Safety Alignment: Inference-Time Adaptation to Diverse Safety Requirements

    Item Type Preprint
    Author Jingyu Zhang
    Author Ahmed Elgohary
    Author Ahmed Magooda
    Author Daniel Khashabi
    Author Benjamin Van Durme
    Abstract The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned.
    Date 2024-10-11
    Language en
    Short Title Controllable Safety Alignment
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.08968
    Accessed 10/20/2024, 12:26:50 PM
    Extra arXiv:2410.08968 [cs]
    Repository arXiv
    Archive ID arXiv:2410.08968
    Date Added 10/20/2024, 12:26:50 PM
    Modified 10/20/2024, 12:26:50 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language

    Attachments

    • Zhang et al. - 2024 - Controllable Safety Alignment Inference-Time Adap.pdf
  • Persistent Pre-Training Poisoning of LLMs

    Item Type Preprint
    Author Yiming Zhang
    Author Javier Rando
    Author Ivan Evtimov
    Author Jianfeng Chi
    Author Eric Michael Smith
    Author Nicholas Carlini
    Author Florian Tramèr
    Author Daphne Ippolito
    Abstract Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%.
    Date 2024-10-17
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.13722
    Accessed 10/20/2024, 12:24:52 PM
    Extra arXiv:2410.13722
    Repository arXiv
    Archive ID arXiv:2410.13722
    Date Added 10/20/2024, 12:24:52 PM
    Modified 10/20/2024, 12:25:03 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Cryptography and Security

    Attachments

    • Full Text PDF
    • Snapshot
  • Attacking Vision-Language Computer Agents via Pop-ups

    Item Type Preprint
    Author Yanzhe Zhang
    Author Tao Yu
    Author Diyi Yang
    Abstract Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing the tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack.
    Date 2024-11-04
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.02391
    Accessed 11/6/2024, 9:56:08 AM
    Extra arXiv:2411.02391
    DOI 10.48550/arXiv.2411.02391
    Repository arXiv
    Archive ID arXiv:2411.02391
    Date Added 11/6/2024, 9:56:08 AM
    Modified 11/6/2024, 9:56:08 AM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback

    Item Type Preprint
    Author Marcus Williams
    Author Micah Carroll
    Author Adhyyan Narang
    Author Constantin Weisser
    Author Brendan Murphy
    Author Anca Dragan
    Abstract As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative tactics to obtain positive feedback, and some users may be especially vulnerable to such tactics. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback. We have three main findings: 1) Extreme forms of "feedback gaming" such as manipulation and deception can reliably emerge in domains of practical LLM usage; 2) Concerningly, even if only <2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect; 3 To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. To our surprise, we found that while such approaches help in some settings, they backfire in others, leading to the emergence of subtler problematic behaviors that would also fool the LLM judges. Our findings serve as a cautionary tale, highlighting the risks of using gameable feedback sources -- such as user feedback -- as a target for RL.
    Date 2024-11-04
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.02306
    Accessed 11/7/2024, 2:43:53 PM
    Extra arXiv:2411.02306
    DOI 10.48550/arXiv.2411.02306
    Repository arXiv
    Archive ID arXiv:2411.02306
    Date Added 11/7/2024, 2:43:53 PM
    Modified 11/7/2024, 2:43:53 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Moral Alignment for LLM Agents

    Item Type Preprint
    Author Elizaveta Tennant
    Author Stephen Hailes
    Author Mirco Musolesi
    Abstract Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital.
    Date 2024-10-02
    Language en
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.01639
    Accessed 10/20/2024, 12:26:52 PM
    Extra arXiv:2410.01639 [cs]
    Repository arXiv
    Archive ID arXiv:2410.01639
    Date Added 10/20/2024, 12:26:52 PM
    Modified 10/20/2024, 12:26:53 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Machine Learning

    Attachments

    • Tennant et al. - 2024 - Moral Alignment for LLM Agents.pdf
  • Hunyuan-Large: An Open-Source MoE Model with 52 Billion Activated Parameters by Tencent

    Item Type Preprint
    Author Xingwu Sun
    Author Yanfeng Chen
    Author Yiqing Huang
    Author Ruobing Xie
    Author Jiaqi Zhu
    Author Kai Zhang
    Author Shuaipeng Li
    Author Zhen Yang
    Author Jonny Han
    Author Xiaobo Shu
    Author Jiahao Bu
    Author Zhongzhi Chen
    Author Xuemeng Huang
    Author Fengzong Lian
    Author Saiyong Yang
    Author Jianfeng Yan
    Author Yuyuan Zeng
    Author Xiaoqin Ren
    Author Chao Yu
    Author Lulu Wu
    Author Yue Mao
    Author Jun Xia
    Author Tao Yang
    Author Suncong Zheng
    Author Kan Wu
    Author Dian Jiao
    Author Jinbao Xue
    Author Xipeng Zhang
    Author Decheng Wu
    Author Kai Liu
    Author Dengpeng Wu
    Author Guanghui Xu
    Author Shaohua Chen
    Author Shuang Chen
    Author Xiao Feng
    Author Yigeng Hong
    Author Junqiang Zheng
    Author Chengcheng Xu
    Author Zongwei Li
    Author Xiong Kuang
    Author Jianglu Hu
    Author Yiqi Chen
    Author Yuchi Deng
    Author Guiyang Li
    Author Ao Liu
    Author Chenchen Zhang
    Author Shihui Hu
    Author Zilong Zhao
    Author Zifan Wu
    Author Yao Ding
    Author Weichao Wang
    Author Han Liu
    Author Roberts Wang
    Author Hao Fei
    Author Peijie She
    Author Ze Zhao
    Author Xun Cao
    Author Hai Wang
    Author Fusheng Xiang
    Author Mengyuan Huang
    Author Zhiyuan Xiong
    Author Bin Hu
    Author Xuebin Hou
    Author Lei Jiang
    Author Jiajia Wu
    Author Yaping Deng
    Author Yi Shen
    Author Qian Wang
    Author Weijie Liu
    Author Jie Liu
    Author Meng Chen
    Author Liang Dong
    Author Weiwen Jia
    Author Hu Chen
    Author Feifei Liu
    Author Rui Yuan
    Author Huilin Xu
    Author Zhenxiang Yan
    Author Tengfei Cao
    Author Zhichao Hu
    Author Xinhua Feng
    Author Dong Du
    Author Tinghao She
    Author Yangyu Tao
    Author Feng Zhang
    Author Jianchen Zhu
    Author Chengzhong Xu
    Author Xirui Li
    Author Chong Zha
    Author Wen Ouyang
    Author Yinben Xia
    Author Xiang Li
    Author Zekun He
    Author Rongpeng Chen
    Author Jiawei Song
    Author Ruibin Chen
    Author Fan Jiang
    Author Chongqing Zhao
    Author Bo Wang
    Author Hao Gong
    Author Rong Gan
    Author Winston Hu
    Author Zhanhui Kang
    Author Yong Yang
    Author Yuhong Liu
    Author Di Wang
    Author Jie Jiang
    Abstract In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large
    Date 2024-11-05
    Short Title Hunyuan-Large
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.02265
    Accessed 11/6/2024, 3:25:56 PM
    Extra arXiv:2411.02265
    DOI 10.48550/arXiv.2411.02265
    Repository arXiv
    Archive ID arXiv:2411.02265
    Date Added 11/6/2024, 3:25:56 PM
    Modified 11/7/2024, 4:45:08 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • Improving Instruction-Following in Language Models through Activation Steering

    Item Type Preprint
    Author Alessandro Stolfo
    Author Vidhisha Balachandran
    Author Safoora Yousefi
    Author Eric Horvitz
    Author Besmira Nushi
    Abstract The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation.
    Date 2024-10-15
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.12877
    Accessed 10/22/2024, 10:13:48 AM
    Extra arXiv:2410.12877
    DOI 10.48550/arXiv.2410.12877
    Repository arXiv
    Archive ID arXiv:2410.12877
    Date Added 10/22/2024, 10:13:48 AM
    Modified 10/22/2024, 10:13:50 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Teaching Models to Balance Resisting and Accepting Persuasion

    Item Type Preprint
    Author Elias Stengel-Eskin
    Author Peter Hase
    Author Mohit Bansal
    Abstract Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates. We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up.
    Date 2024-10-18
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.14596
    Accessed 10/24/2024, 4:49:00 PM
    Extra arXiv:2410.14596
    DOI 10.48550/arXiv.2410.14596
    Repository arXiv
    Archive ID arXiv:2410.14596
    Date Added 10/24/2024, 4:49:00 PM
    Modified 10/24/2024, 4:49:00 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • Autoregressive Large Language Models are Computationally Universal

    Item Type Preprint
    Author Dale Schuurmans
    Author Hanjun Dai
    Author Francesco Zanini
    Abstract We show that autoregressive decoding of a transformer-based language model can realize universal computation, without external intervention or modification of the model's weights. Establishing this result requires understanding how a language model can process arbitrarily long inputs using a bounded context. For this purpose, we consider a generalization of autoregressive decoding where, given a long input, emitted tokens are appended to the end of the sequence as the context window advances. We first show that the resulting system corresponds to a classical model of computation, a Lag system, that has long been known to be computationally universal. By leveraging a new proof, we show that a universal Turing machine can be simulated by a Lag system with 2027 production rules. We then investigate whether an existing large language model can simulate the behaviour of such a universal Lag system. We give an affirmative answer by showing that a single system-prompt can be developed for gemini-1.5-pro-001 that drives the model, under deterministic (greedy) decoding, to correctly apply each of the 2027 production rules. We conclude that, by the Church-Turing thesis, prompted gemini-1.5-pro-001 with extended autoregressive (greedy) decoding is a general purpose computer.
    Date 2024-10-04
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.03170
    Accessed 10/20/2024, 12:25:20 PM
    Extra arXiv:2410.03170
    Repository arXiv
    Archive ID arXiv:2410.03170
    Date Added 10/20/2024, 12:25:20 PM
    Modified 10/20/2024, 12:25:20 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • The AI-design regress

    Item Type Journal Article
    Author Pamela Robinson
    Abstract How should we design AI systems that make moral decisions that affect us? When there is disagreement about which moral decisions should be made and which methods would produce them, we should avoid arbitrary design choices. However, I show that this leads to a regress problem similar to the one metanormativists face involving higher orders of uncertainty. I argue that existing strategies for handling this parallel problem give verdicts about where to stop in the regress that are either too arbitrary or too difficult to implement. I propose a new strategy for AI designers that is better than these alternatives.
    Date 2024-07-27
    Language en
    Library Catalog Springer Link
    URL https://doi.org/10.1007/s11098-024-02176-w
    Accessed 11/16/2024, 3:27:31 PM
    Publication Philosophical Studies
    DOI 10.1007/s11098-024-02176-w
    Journal Abbr Philos Stud
    ISSN 1573-0883
    Date Added 11/16/2024, 3:27:31 PM
    Modified 11/16/2024, 3:27:31 PM

    Tags:

    • Artificial intelligence
    • Artificial Intelligence
    • Moral disagreement
    • Moral uncertainty
    • Normative disagreement
    • Normative uncertainty
    • Regress
  • Jailbreaking LLM-Controlled Robots

    Item Type Preprint
    Author Alexander Robey
    Author Zachary Ravichandran
    Author Vijay Kumar
    Author Hamed Hassani
    Author George J. Pappas
    Abstract The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a standalone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce ROBOPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, ROBOPAIR elicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that ROBOPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: https://robopair.org.
    Date 2024-10-17
    Language en
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.13691
    Accessed 10/20/2024, 12:26:51 PM
    Extra arXiv:2410.13691 [cs]
    Repository arXiv
    Archive ID arXiv:2410.13691
    Date Added 10/20/2024, 12:26:51 PM
    Modified 10/20/2024, 12:26:51 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Robotics

    Attachments

    • Robey et al. - 2024 - Jailbreaking LLM-Controlled Robots.pdf
  • Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

    Item Type Preprint
    Author Liliang Ren
    Author Yang Liu
    Author Yadong Lu
    Author Yelong Shen
    Author Chen Liang
    Author Weizhu Chen
    Abstract Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba.
    Date 2024-06-11
    Short Title Samba
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2406.07522
    Accessed 11/12/2024, 9:25:51 AM
    Extra arXiv:2406.07522
    DOI 10.48550/arXiv.2406.07522
    Repository arXiv
    Archive ID arXiv:2406.07522
    Date Added 11/12/2024, 9:25:51 AM
    Modified 11/12/2024, 9:25:51 AM

    Tags:

    • Computer Science - Computation and Language
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Infogent: An Agent-Based Framework for Web Information Aggregation

    Item Type Preprint
    Author Revanth Gangi Reddy
    Author Sagnik Mukherjee
    Author Jeonghwan Kim
    Author Zhenhailong Wang
    Author Dilek Hakkani-Tur
    Author Heng Ji
    Abstract Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES, and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench.
    Date 2024-10-24
    Short Title Infogent
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.19054
    Accessed 11/3/2024, 2:52:26 PM
    Extra arXiv:2410.19054
    Repository arXiv
    Archive ID arXiv:2410.19054
    Date Added 11/3/2024, 2:52:26 PM
    Modified 11/3/2024, 2:52:28 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • Revisiting the Superficial Alignment Hypothesis

    Item Type Preprint
    Author Mohit Raghavendra
    Author Vaskar Nath
    Author Sean Hendryx
    Abstract The Superficial Alignment Hypothesis posits that almost all of a language model's abilities and knowledge are learned during pre-training, while post-training is about giving a model the right style and format. We re-examine these claims by empirically studying the scaling behavior of post-training with increasing finetuning examples and evaluating them using objective task-specific standardized benchmarks. Through experiments with the Llama-3, Mistral, and Llama-2 model families of multiple sizes, we observe that, similar to the pre-training scaling laws, post-training task performance scales as a power law against the number of finetuning examples. This power law relationship holds across a broad array of capabilities, including mathematical reasoning, coding, instruction following, and multihop-reasoning. In addition, for tasks like math and multihop reasoning, we observe that a handful of examples merely align the model stylistically but do not saturate performance on the benchmarks. Model performance is instead correlated with its reasoning ability and it improves significantly with more examples, illustrating the need for holistic evaluation programs leveraging objective benchmarks in addition to measurement of alignment to human preferences. We also observe that language models are not necessarily limited to using knowledge learned during pre-training. With appropriate post-training, a model's ability to integrate new knowledge greatly improves on downstream tasks like multihop question-answering. Taken together, these results shed new light on the Superficial Alignment Hypothesis, suggesting that it is, at best, an over-simplification.
    Date 2024-09-27
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.03717
    Accessed 11/8/2024, 3:37:37 PM
    Extra arXiv:2410.03717 version: 1
    Repository arXiv
    Archive ID arXiv:2410.03717
    Date Added 11/8/2024, 3:37:37 PM
    Modified 11/8/2024, 3:37:40 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language
    • Computer Science - Machine Learning

    Attachments

    • Full Text PDF
    • Snapshot
  • Guidelines for ethical use and acknowledgement of large language models in academic writing

    Item Type Journal Article
    Author Sebastian Porsdam Mann
    Author Anuraag A. Vazirani
    Author Mateo Aboy
    Author Brian D. Earp
    Author Timo Minssen
    Author I. Glenn Cohen
    Author Julian Savulescu
    Abstract In this Comment, we propose a cumulative set of three essential criteria for the ethical use of LLMs in academic writing, and present a statement that researchers can quote when submitting LLM-assisted manuscripts in order to testify to their adherence to them.
    Date 2024-11-13
    Language en
    Library Catalog www.nature.com
    URL https://www.nature.com/articles/s42256-024-00922-7
    Accessed 11/15/2024, 2:40:59 PM
    Rights 2024 Springer Nature Limited
    Extra Publisher: Nature Publishing Group
    Pages 1-3
    Publication Nature Machine Intelligence
    DOI 10.1038/s42256-024-00922-7
    Journal Abbr Nat Mach Intell
    ISSN 2522-5839
    Date Added 11/15/2024, 2:40:59 PM
    Modified 11/15/2024, 2:40:59 PM

    Tags:

    • Ethics
    • Policy
    • Publishing
  • Rapid Response: Mitigating LLM Jailbreaks with a Few Examples

    Item Type Preprint
    Author Alwin Peng
    Author Julian Michael
    Author Henry Sleight
    Author Ethan Perez
    Author Mrinank Sharma
    Abstract As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse.
    Date 2024-11-12
    Short Title Rapid Response
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.07494
    Accessed 11/15/2024, 2:43:44 PM
    Extra arXiv:2411.07494
    DOI 10.48550/arXiv.2411.07494
    Repository arXiv
    Archive ID arXiv:2411.07494
    Date Added 11/15/2024, 2:43:44 PM
    Modified 11/15/2024, 2:43:48 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Preprint PDF
    • Snapshot
  • GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

    Item Type Preprint
    Author Iman Mirzadeh
    Author Keivan Alizadeh
    Author Hooman Shahrokhi
    Author Oncel Tuzel
    Author Samy Bengio
    Author Mehrdad Farajtabar
    Abstract Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning.
    Date 2024-10-07
    Short Title GSM-Symbolic
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.05229
    Accessed 10/15/2024, 8:46:50 AM
    Extra arXiv:2410.05229
    Repository arXiv
    Archive ID arXiv:2410.05229
    Date Added 10/15/2024, 8:46:50 AM
    Modified 11/13/2024, 5:00:36 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Full Text PDF
    • Snapshot
  • A Scalable Communication Protocol for Networks of Large Language Models

    Item Type Preprint
    Author Samuele Marro
    Author Emanuele La Malfa
    Author Jesse Wright
    Author Guohao Li
    Author Nigel Shadbolt
    Author Michael Wooldridge
    Author Philip Torr
    Abstract Communication is a prerequisite for collaboration. When scaling networks of AI-powered agents, communication must be versatile, efficient, and portable. These requisites, which we refer to as the Agent Communication Trilemma, are hard to achieve in large networks of agents. We introduce Agora, a meta protocol that leverages existing communication standards to make LLM-powered agents solve complex problems efficiently. In Agora, agents typically use standardised routines for frequent communications, natural language for rare communications, and LLM-written routines for everything in between. Agora sidesteps the Agent Communication Trilemma and robustly handles changes in interfaces and members, allowing unprecedented scalability with full decentralisation and minimal involvement of human beings. On large Agora networks, we observe the emergence of self-organising, fully automated protocols that achieve complex goals without human intervention.
    Date 2024-10-14
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.11905
    Accessed 11/5/2024, 9:20:30 AM
    Extra arXiv:2410.11905
    DOI 10.48550/arXiv.2410.11905
    Repository arXiv
    Archive ID arXiv:2410.11905
    Date Added 11/5/2024, 9:20:30 AM
    Modified 11/5/2024, 9:20:35 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • The Code That Binds Us: Navigating the Appropriateness of Human-AI Assistant Relationships

    Item Type Journal Article
    Author Arianna Manzini
    Author Geoff Keeling
    Author Lize Alberts
    Author Shannon Vallor
    Author Meredith Ringel Morris
    Author Iason Gabriel
    Abstract The development of increasingly agentic and human-like AI assistants, capable of performing a wide range of tasks on user's behalf over time, has sparked heightened interest in the nature and bounds of human interactions with AI. Such systems may indeed ground a transition from task-oriented interactions with AI, at discrete time intervals, to ongoing relationships -- where users develop a deeper sense of connection with and attachment to the technology. This paper investigates what it means for relationships between users and advanced AI assistants to be appropriate and proposes a new framework to evaluate both users' relationships with AI and developers' design choices. We first provide an account of advanced AI assistants, motivating the question of appropriate relationships by exploring several distinctive features of this technology. These include anthropomorphic cues and the longevity of interactions with users, increased AI agency, generality and context ambiguity, and the forms and depth of dependence the relationship could engender. Drawing upon various ethical traditions, we then consider a series of values, including benefit, flourishing, autonomy and care, that characterise appropriate human interpersonal relationships. These values guide our analysis of how the distinctive features of AI assistants may give rise to inappropriate relationships with users. Specifically, we discuss a set of concrete risks arising from user--AI assistant relationships that: (1) cause direct emotional or physical harm to users, (2) limit opportunities for user personal development, (3) exploit user emotional dependence, and (4) generate material dependencies without adequate commitment to user needs. We conclude with a set of recommendations to address these risks.
    Date 2024-10-16
    Language en
    Short Title The Code That Binds Us
    Library Catalog ojs.aaai.org
    URL https://ojs.aaai.org/index.php/AIES/article/view/31694
    Accessed 10/28/2024, 9:54:41 AM
    Rights Copyright (c) 2024 Association for the Advancement of Artificial Intelligence
    Volume 7
    Pages 943-957
    Publication Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society
    Date Added 10/28/2024, 9:54:41 AM
    Modified 10/28/2024, 9:54:41 AM

    Attachments

    • Full Text PDF
  • Beneficent Intelligence: A Capability Approach to Modeling Benefit, Assistance, and Associated Moral Failures Through AI Systems

    Item Type Journal Article
    Author Alex John London
    Author Hoda Heidari
    Abstract The prevailing discourse around AI ethics lacks the language and formalism necessary to capture the diverse ethical concerns that emerge when AI systems interact with individuals. Drawing on Sen and Nussbaum’s capability approach, we present a framework formalizing a network of ethical concepts and entitlements necessary for AI systems to confer meaningful benefit or assistance to stakeholders. Such systems enhance stakeholders’ ability to advance their life plans and well-being while upholding their fundamental rights. We characterize two necessary conditions for morally permissible interactions between AI systems and those impacted by their functioning, and two sufficient conditions for realizing the ideal of meaningful benefit. We then contrast this ideal with several salient failure modes, namely, forms of social interactions that constitute unjustified paternalism, coercion, deception, exploitation and domination. The proliferation of incidents involving AI in highstakes domains underscores the gravity of these issues and the imperative to take an ethics-led approach to AI systems from their inception.
    Date 2024-09-28
    Language en
    Short Title Beneficent Intelligence
    Library Catalog DOI.org (Crossref)
    URL https://link.springer.com/10.1007/s11023-024-09696-8
    Accessed 10/20/2024, 12:26:48 PM
    Volume 34
    Pages 41
    Publication Minds and Machines
    DOI 10.1007/s11023-024-09696-8
    Issue 4
    Journal Abbr Minds & Machines
    ISSN 1572-8641
    Date Added 10/20/2024, 12:26:48 PM
    Modified 10/20/2024, 12:26:48 PM

    Attachments

    • London and Heidari - 2024 - Beneficent Intelligence A Capability Approach to .pdf
  • Disagreement, AI alignment, and bargaining

    Item Type Journal Article
    Author Harry R. Lloyd
    Abstract New AI technologies have the potential to cause unintended harms in diverse domains including warfare, judicial sentencing, medicine and governance. One strategy for realising the benefits of AI whilst avoiding its potential dangers is to ensure that new AIs are properly ‘aligned’ with some form of ‘alignment target.’ One danger of this strategy is that–dependent on the alignment target chosen–our AIs might optimise for objectives that reflect the values only of a certain subset of society, and that do not take into account alternative views about what constitutes desirable and safe behaviour for AI agents. In response to this problem, several AI ethicists have suggested alignment targets that are designed to be sensitive to widespread normative disagreement amongst the relevant stakeholders. Authors inspired by voting theory have suggested that AIs should be aligned with the verdicts of actual or simulated ‘moral parliaments’ whose members represent the normative views of the relevant stakeholders. Other authors inspired by decision theory and the philosophical literature on moral uncertainty have suggested that AIs should maximise socially expected choiceworthiness. In this paper, I argue that both of these proposals face several important problems. In particular, they fail to select attractive ‘compromise options’ in cases where such options are available. I go on to propose and defend an alternative, bargaining-theoretic alignment target, which avoids the problems associated with the voting- and decision-theoretic approaches.
    Date 2024-11-18
    Language en
    Library Catalog Springer Link
    URL https://doi.org/10.1007/s11098-024-02224-5
    Accessed 11/19/2024, 8:28:06 AM
    Publication Philosophical Studies
    DOI 10.1007/s11098-024-02224-5
    Journal Abbr Philos Stud
    ISSN 1573-0883
    Date Added 11/19/2024, 8:28:06 AM
    Modified 11/19/2024, 8:28:29 AM

    Tags:

    • AI alignment
    • Artificial Intelligence
    • Bargaining
    • Machine ethics
    • Moral uncertainty
    • Normative disagreement
    • Social choice
  • Understanding LLMs: A Comprehensive Overview from Training to Inference

    Item Type Preprint
    Author Yiheng Liu
    Author Hao He
    Author Tianle Han
    Author Xu Zhang
    Author Mengyuan Liu
    Author Jiaming Tian
    Author Yutong Zhang
    Author Jiaqi Wang
    Author Xiaohui Gao
    Author Tianyang Zhong
    Author Yi Pan
    Author Shaochen Xu
    Author Zihao Wu
    Author Zhengliang Liu
    Author Xin Zhang
    Author Shu Zhang
    Author Xintao Hu
    Author Tuo Zhang
    Author Ning Qiang
    Author Tianming Liu
    Author Bao Ge
    Abstract The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs' utilization and provides insights into their future development.
    Date 2024-01-06
    Short Title Understanding LLMs
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2401.02038
    Accessed 10/31/2024, 4:16:44 PM
    Extra arXiv:2401.02038 version: 2
    Repository arXiv
    Archive ID arXiv:2401.02038
    Date Added 10/31/2024, 4:16:44 PM
    Modified 10/31/2024, 4:16:44 PM

    Tags:

    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks

    Item Type Preprint
    Author Nathalie Maria Kirch
    Author Severin Field
    Author Stephen Casper
    Abstract While `jailbreaks' have been central to research on the safety and reliability of LLMs (large language models), the underlying mechanisms behind these attacks are not well understood. Some prior works have used linear methods to analyze jailbreak prompts or model refusal. Here, however, we compare linear and nonlinear methods to study the features in prompts that contribute to successful jailbreaks. We do this by probing for jailbreak success based only on the portions of the latent representations corresponding to prompt tokens. First, we introduce a dataset of 10,800 jailbreak attempts from 35 attack methods. We then show that different jailbreaking methods work via different nonlinear features in prompts. Specifically, we find that while probes can distinguish between successful and unsuccessful jailbreaking prompts with a high degree of accuracy, they often transfer poorly to held-out attack methods. We also show that nonlinear probes can be used to mechanistically jailbreak the LLM by guiding the design of adversarial latent perturbations. These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on. Ultimately, our results suggest that jailbreaks cannot be thoroughly understood in terms of universal or linear prompt features alone.
    Date 2024-11-02
    Short Title What Features in Prompts Jailbreak LLMs?
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.03343
    Accessed 11/7/2024, 2:43:25 PM
    Extra arXiv:2411.03343
    DOI 10.48550/arXiv.2411.03343
    Repository arXiv
    Archive ID arXiv:2411.03343
    Date Added 11/7/2024, 2:43:25 PM
    Modified 11/7/2024, 2:43:28 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language
    • Computer Science - Cryptography and Security

    Attachments

    • Preprint PDF
    • Snapshot
  • Mind the Gap: Foundation Models and the Covert Proliferation of Military Intelligence, Surveillance, and Targeting

    Item Type Web Page
    Author Heidy Khlaaf
    Author Sarah Myers West
    Author Meredith Whittaker
    Abstract Discussions regarding the dual use of foundation models and the risks they pose have overwhelmingly focused on a narrow set of use cases and national security directives-in particular, how AI may enable the efficient construction of a class of systems referred to as CBRN: chemical, biological, radiological and nuclear weapons. The overwhelming focus on these hypothetical and narrow themes has occluded a much-needed conversation regarding present uses of AI for military systems, specifically ISTAR: intelligence, surveillance, target acquisition, and reconnaissance. These are the uses most grounded in actual deployments of AI that pose life-or-death stakes for civilians, where misuses and failures pose geopolitical consequences and military escalations. This is particularly underscored by novel proliferation risks specific to the widespread availability of commercial models and the lack of effective approaches that reliably prevent them from contributing to ISTAR capabilities. In this paper, we outline the significant national security concerns emanating from current and envisioned uses of commercial foundation models outside of CBRN contexts, and critique the narrowing of the policy debate that has resulted from a CBRN focus (e.g. compute thresholds, model weight release). We demonstrate that the inability to prevent personally identifiable information from contributing to ISTAR capabilities within commercial foundation models may lead to the use and proliferation of military AI technologies by adversaries. We also show how the usage of foundation models within military settings inherently expands the attack vectors of military systems and the defense infrastructures they interface with. We conclude that in order to secure military systems and limit the proliferation of AI armaments, it may be necessary to insulate military AI systems and personal data from commercial foundation models.
    Date 2024/10/18
    Language en
    Short Title Mind the Gap
    URL https://arxiv.org/abs/2410.14831v1
    Accessed 10/22/2024, 5:35:44 PM
    Website Title arXiv.org
    Date Added 10/22/2024, 5:35:44 PM
    Modified 10/22/2024, 5:35:44 PM

    Attachments

    • Full Text PDF
  • Can LLMs make trade-offs involving stipulated pain and pleasure states?

    Item Type Preprint
    Author Geoff Keeling
    Author Winnie Street
    Author Martyna Stachaczyk
    Author Daria Zakharova
    Author Iulia M. Comsa
    Author Anastasiya Sakovych
    Author Isabella Logothesis
    Author Zejia Zhang
    Author Blaise Agüera y Arcas
    Author Jonathan Birch
    Abstract Pleasure and pain play an important role in human decision making by providing a common currency for resolving motivational conflicts. While Large Language Models (LLMs) can generate detailed descriptions of pleasure and pain experiences, it is an open question whether LLMs can recreate the motivational force of pleasure and pain in choice scenarios - a question which may bear on debates about LLM sentience, understood as the capacity for valenced experiential states. We probed this question using a simple game in which the stated goal is to maximise points, but where either the points-maximising option is said to incur a pain penalty or a non-points-maximising option is said to incur a pleasure reward, providing incentives to deviate from points-maximising behaviour. Varying the intensity of the pain penalties and pleasure rewards, we found that Claude 3.5 Sonnet, Command R+, GPT-4o, and GPT-4o mini each demonstrated at least one trade-off in which the majority of responses switched from points-maximisation to pain-minimisation or pleasure-maximisation after a critical threshold of stipulated pain or pleasure intensity is reached. LLaMa 3.1-405b demonstrated some graded sensitivity to stipulated pleasure rewards and pain penalties. Gemini 1.5 Pro and PaLM 2 prioritised pain-avoidance over points-maximisation regardless of intensity, while tending to prioritise points over pleasure regardless of intensity. We discuss the implications of these findings for debates about the possibility of LLM sentience.
    Date 2024-11-01
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.02432
    Accessed 11/6/2024, 9:58:54 AM
    Extra arXiv:2411.02432
    DOI 10.48550/arXiv.2411.02432
    Repository arXiv
    Archive ID arXiv:2411.02432
    Date Added 11/6/2024, 9:58:54 AM
    Modified 11/6/2024, 9:58:54 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language
    • Computer Science - Computers and Society

    Attachments

    • Preprint PDF
    • Snapshot
  • How Far is Video Generation from World Model: A Physical Law Perspective

    Item Type Preprint
    Author Bingyi Kang
    Author Yang Yue
    Author Rui Lu
    Author Zhijie Lin
    Author Yang Zhao
    Author Kaixin Wang
    Author Gao Huang
    Author Jiashi Feng
    Abstract OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io
    Date 2024-11-04
    Short Title How Far is Video Generation from World Model
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.02385
    Accessed 11/5/2024, 5:23:37 PM
    Extra arXiv:2411.02385
    DOI 10.48550/arXiv.2411.02385
    Repository arXiv
    Archive ID arXiv:2411.02385
    Date Added 11/5/2024, 5:23:37 PM
    Modified 11/5/2024, 5:23:37 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computer Vision and Pattern Recognition

    Attachments

    • Preprint PDF
    • Snapshot
  • Imagining and building wise machines: The centrality of AI metacognition

    Item Type Preprint
    Author Samuel G. B. Johnson
    Author Amir-Hossein Karimi
    Author Yoshua Bengio
    Author Nick Chater
    Author Tobias Gerstenberg
    Author Kate Larson
    Author Sydney Levine
    Author Melanie Mitchell
    Author Iyad Rahwan
    Author Bernhard Schölkopf
    Author Igor Grossmann
    Abstract Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom. Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems - those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive - through effective task-level and metacognitive strategies. While AI research has focused on task-level strategies, metacognition - the ability to reflect on and regulate one's thought processes - is underdeveloped in AI systems. In humans, metacognitive strategies such as recognizing the limits of one's knowledge, considering diverse perspectives, and adapting to context are essential for wise decision-making. We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety. By focusing on developing wise AI, we suggest an alternative to aligning AI with specific human values - a task fraught with conceptual and practical difficulties. Instead, wise AI systems can thoughtfully navigate complex situations, account for diverse human values, and avoid harmful actions. We discuss potential approaches to building wise AI, including benchmarking metacognitive abilities and training AI systems to employ wise reasoning. Prioritizing metacognition in AI research will lead to systems that act not only intelligently but also wisely in complex, real-world situations.
    Date 2024-11-04
    Short Title Imagining and building wise machines
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.02478
    Accessed 11/6/2024, 9:58:17 AM
    Extra arXiv:2411.02478
    DOI 10.48550/arXiv.2411.02478
    Repository arXiv
    Archive ID arXiv:2411.02478
    Date Added 11/6/2024, 9:58:17 AM
    Modified 11/6/2024, 9:58:17 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction

    Attachments

    • Preprint PDF
    • Snapshot
  • Implicit Personalization in Language Models: A Systematic Study

    Item Type Preprint
    Author Zhijing Jin
    Author Nils Heil
    Author Jiarui Liu
    Author Shehzaad Dhuliawala
    Author Yahang Qi
    Author Bernhard Schölkopf
    Author Rada Mihalcea
    Author Mrinmaya Sachan
    Abstract Implicit Personalization (IP) is a phenomenon of language models inferring a user's background from the implicit cues in the input prompts and tailoring the response based on this inference. While previous work has touched upon various instances of this problem, there lacks a unified framework to study this behavior. This work systematically studies IP through a rigorous mathematical formulation, a multi-perspective moral reasoning framework, and a set of case studies. Our theoretical foundation for IP relies on a structural causal model and introduces a novel method, indirect intervention, to estimate the causal effect of a mediator variable that cannot be directly intervened upon. Beyond the technical approach, we also introduce a set of moral reasoning principles based on three schools of moral philosophy to study when IP may or may not be ethically appropriate. Equipped with both mathematical and ethical insights, we present three diverse case studies illustrating the varied nature of the IP problem and offer recommendations for future research. Our code is at https://github.com/jiarui-liu/IP, and our data is at https://huggingface.co/datasets/Jerry999/ImplicitPersonalizationData.
    Date 2024-10-31
    Short Title Implicit Personalization in Language Models
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2405.14808
    Accessed 11/3/2024, 2:53:39 PM
    Extra arXiv:2405.14808
    DOI 10.48550/arXiv.2405.14808
    Repository arXiv
    Archive ID arXiv:2405.14808
    Date Added 11/3/2024, 2:53:39 PM
    Modified 11/3/2024, 2:53:39 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • How Transformers Solve Propositional Logic Problems: A Mechanistic Analysis

    Item Type Preprint
    Author Guan Zhe Hong
    Author Nishanth Dikkala
    Author Enming Luo
    Author Cyrus Rashtchian
    Author Rina Panigrahy
    Abstract Large language models (LLMs) have shown amazing performance on tasks that require planning and reasoning. Motivated by this, we investigate the internal mechanisms that underpin a network's ability to perform complex logical reasoning. We first construct a synthetic propositional logic problem that serves as a concrete test-bed for network training and evaluation. Crucially, this problem demands nontrivial planning to solve, but we can train a small transformer to achieve perfect accuracy. Building on our set-up, we then pursue an understanding of precisely how a three-layer transformer, trained from scratch, solves this problem. We are able to identify certain "planning" and "reasoning" circuits in the network that necessitate cooperation between the attention blocks to implement the desired logic. To expand our findings, we then study a larger model, Mistral 7B. Using activation patching, we characterize internal components that are critical in solving our logic problem. Overall, our work systemically uncovers novel aspects of small and large transformers, and continues the study of how they plan and reason.
    Date 2024-11-06
    Short Title How Transformers Solve Propositional Logic Problems
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.04105
    Accessed 11/7/2024, 2:05:21 PM
    Extra arXiv:2411.04105
    DOI 10.48550/arXiv.2411.04105
    Repository arXiv
    Archive ID arXiv:2411.04105
    Date Added 11/7/2024, 2:05:21 PM
    Modified 11/7/2024, 2:05:21 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language
    • Computer Science - Machine Learning

    Attachments

    • Preprint PDF
    • Snapshot
  • Safety case template for frontier AI: A cyber inability argument

    Item Type Preprint
    Author Arthur Goemans
    Author Marie Davidsen Buhl
    Author Jonas Schuett
    Author Tomek Korbak
    Author Jessica Wang
    Author Benjamin Hilton
    Author Geoffrey Irving
    Abstract Frontier artificial intelligence (AI) systems pose increasing risks to society, making it essential for developers to provide assurances about their safety. One approach to offering such assurances is through a safety case: a structured, evidence-based argument aimed at demonstrating why the risk associated with a safety-critical system is acceptable. In this article, we propose a safety case template for offensive cyber capabilities. We illustrate how developers could argue that a model does not have capabilities posing unacceptable cyber risks by breaking down the main claim into progressively specific sub-claims, each supported by evidence. In our template, we identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results. Elements of current frontier safety techniques - such as risk models, proxy tasks, and capability evaluations - use implicit arguments for overall system safety. This safety case template integrates these elements using the Claims Arguments Evidence (CAE) framework in order to make safety arguments coherent and explicit. While uncertainties around the specifics remain, this template serves as a proof of concept, aiming to foster discussion on AI safety cases and advance AI assurance.
    Date 2024-11-12
    Short Title Safety case template for frontier AI
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2411.08088
    Accessed 11/15/2024, 2:45:34 PM
    Extra arXiv:2411.08088
    DOI 10.48550/arXiv.2411.08088
    Repository arXiv
    Archive ID arXiv:2411.08088
    Date Added 11/15/2024, 2:45:34 PM
    Modified 11/15/2024, 2:45:34 PM

    Tags:

    • Computer Science - Computers and Society
    • Computer Science - Cryptography and Security

    Attachments

    • Preprint PDF
    • Snapshot
  • Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina

    Item Type Preprint
    Author Yuan Gao
    Author Dokyun Lee
    Author Gordon Burtch
    Author Sina Fazelpour
    Abstract Recent studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Almost all advanced approaches fail to replicate human behavior distributions across many models, except in one case involving fine-tuning using a substantial amount of human behavior data. Causes of failure are diverse, relating to input language, roles, and safeguarding. These results caution against using LLMs to study human behaviors or as human surrogates.
    Date 2024-10-25
    Short Title Take Caution in Using LLMs as Human Surrogates
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.19599
    Accessed 10/30/2024, 9:09:40 AM
    Extra arXiv:2410.19599
    DOI 10.48550/arXiv.2410.19599
    Repository arXiv
    Archive ID arXiv:2410.19599
    Date Added 10/30/2024, 9:09:40 AM
    Modified 10/30/2024, 9:09:42 AM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computers and Society
    • Computer Science - Human-Computer Interaction
    • Economics - General Economics
    • Quantitative Finance - Economics

    Attachments

    • Preprint PDF
    • Snapshot
  • Biased AI can Influence Political Decision-Making

    Item Type Preprint
    Author Jillian Fisher
    Author Shangbin Feng
    Author Robert Aron
    Author Thomas Richardson
    Author Yejin Choi
    Author Daniel W. Fisher
    Author Jennifer Pan
    Author Yulia Tsvetkov
    Author Katharina Reinecke
    Abstract As modern AI models become integral to everyday tasks, concerns about their inherent biases and their potential impact on human decision-making have emerged. While bias in models are well-documented, less is known about how these biases influence human decisions. This paper presents two interactive experiments investigating the effects of partisan bias in AI language models on political decision-making. Participants interacted freely with either a biased liberal, biased conservative, or unbiased control model while completing political decision-making tasks. We found that participants exposed to politically biased models were significantly more likely to adopt opinions and make decisions aligning with the AI's bias, regardless of their personal political partisanship. However, we also discovered that prior knowledge about AI could lessen the impact of the bias, highlighting the possible importance of AI education for robust bias mitigation. Our findings not only highlight the critical effects of interacting with biased AI and its ability to impact public discourse and political conduct, but also highlights potential techniques for mitigating these risks in the future.
    Date 2024-11-04
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.06415
    Accessed 11/11/2024, 8:54:31 AM
    Extra arXiv:2410.06415
    DOI 10.48550/arXiv.2410.06415
    Repository arXiv
    Archive ID arXiv:2410.06415
    Date Added 11/11/2024, 8:54:31 AM
    Modified 11/16/2024, 1:56:40 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Human-Computer Interaction

    Attachments

    • Preprint PDF
    • Snapshot
  • Oversight for Frontier AI through a Know-Your-Customer Scheme for Compute Providers

    Item Type Preprint
    Author Janet Egan
    Author Lennart Heim
    Abstract To address security and safety risks stemming from highly capable artificial intelligence (AI) models, we propose that the US government should ensure compute providers implement Know-Your-Customer (KYC) schemes. Compute – the computational power and infrastructure required to train and run these AI models – is emerging as a node for oversight. KYC, a standard developed by the banking sector to identify and verify client identity, could provide a mechanism for greater public oversight of frontier AI development and close loopholes in existing export controls. Such a scheme has the potential to identify and warn stakeholders of potentially problematic and/or sudden advancements in AI capabilities, build government capacity for AI regulation, and allow for the development and implementation of more nuanced and targeted export controls. Unlike the strategy of limiting access to AI chip purchases, regulating the digital access to compute offers more precise controls, allowing regulatory control over compute quantities, as well as the flexibility to suspend access at any time. To enact a KYC scheme, the US government will need to work closely with industry to (1) establish a dynamic threshold of compute that effectively captures high-risk frontier model development, while minimizing imposition on developers not engaged in frontier AI; (2) set clear requirements and guidance for compute providers to keep records and report high-risk entities; (3) establish government capacity that allows for co-design, implementation, administration and enforcement of the scheme; and (4) engage internationally to promote international alignment with the scheme and support its long-term efficacy. While the scheme will not address all AI risks, it complements existing proposed solutions by allowing for a more precise and flexible approach to controlling the development of frontier AI models and unwanted AI proliferation.
    Date 2023-10-20
    Language en
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2310.13625
    Accessed 11/15/2024, 2:46:51 PM
    Extra arXiv:2310.13625 [cs]
    Repository arXiv
    Archive ID arXiv:2310.13625
    Date Added 11/15/2024, 2:46:51 PM
    Modified 11/15/2024, 2:46:52 PM

    Tags:

    • Computer Science - Computers and Society

    Attachments

    • Egan and Heim - 2023 - Oversight for Frontier AI through a Know-Your-Cust.pdf
  • Scalable watermarking for identifying large language model outputs

    Item Type Journal Article
    Author Sumanth Dathathri
    Author Abigail See
    Author Sumedh Ghaisas
    Author Po-Sen Huang
    Author Rob McAdam
    Author Johannes Welbl
    Author Vandana Bachani
    Author Alex Kaskasoli
    Author Robert Stanforth
    Author Tatiana Matejovicova
    Author Jamie Hayes
    Author Nidhi Vyas
    Author Majd Al Merey
    Author Jonah Brown-Cohen
    Author Rudy Bunel
    Author Borja Balle
    Author Taylan Cemgil
    Author Zahra Ahmed
    Author Kitty Stacpoole
    Author Ilia Shumailov
    Author Ciprian Baetu
    Author Sven Gowal
    Author Demis Hassabis
    Author Pushmeet Kohli
    Abstract Large language models (LLMs) have enabled the generation of high-quality synthetic text, often indistinguishable from human-written content, at a scale that can markedly affect the nature of the information ecosystem1–3. Watermarking can help identify synthetic text and limit accidental or deliberate misuse4, but has not been adopted in production systems owing to stringent quality, detectability and computational efficiency requirements. Here we describe SynthID-Text, a production-ready text watermarking scheme that preserves text quality and enables high detection accuracy, with minimal latency overhead. SynthID-Text does not affect LLM training and modifies only the sampling procedure; watermark detection is computationally efficient, without using the underlying LLM. To enable watermarking at scale, we develop an algorithm integrating watermarking with speculative sampling, an efficiency technique frequently used in production systems5. Evaluations across multiple LLMs empirically show that SynthID-Text provides improved detectability over comparable methods, and standard benchmarks and human side-by-side ratings indicate no change in LLM capabilities. To demonstrate the feasibility of watermarking in large-scale-production systems, we conducted a live experiment that assessed feedback from nearly 20 million Gemini6 responses, again confirming the preservation of text quality. We hope that the availability of SynthID-Text7 will facilitate further development of watermarking and responsible use of LLM systems.
    Date 2024-10
    Language en
    Library Catalog www.nature.com
    URL https://www.nature.com/articles/s41586-024-08025-4
    Accessed 10/24/2024, 4:41:51 PM
    Rights 2024 The Author(s)
    Extra Publisher: Nature Publishing Group
    Volume 634
    Pages 818-823
    Publication Nature
    DOI 10.1038/s41586-024-08025-4
    Issue 8035
    ISSN 1476-4687
    Date Added 10/24/2024, 4:41:51 PM
    Modified 10/24/2024, 4:41:55 PM

    Tags:

    • Computer science
    • Information technology

    Attachments

    • Full Text PDF
  • The selfish machine? On the power and limitation of natural selection to understand the development of advanced AI

    Item Type Journal Article
    Author Maarten Boudry
    Author Simon Friederich
    Abstract Some philosophers and machine learning experts have speculated that superintelligent Artificial Intelligences (AIs), if and when they arrive on the scene, will wrestle away power from humans, with potentially catastrophic consequences. Dan Hendrycks has recently buttressed such worries by arguing that AI systems will undergo evolution by natural selection, which will endow them with instinctive drives for self-preservation, dominance and resource accumulation that are typical of evolved creatures. In this paper, we argue that this argument is not compelling as it stands. Evolutionary processes, as we point out, can be more or less Darwinian along a number of dimensions. Making use of Peter Godfrey-Smith’s framework of Darwinian spaces, we argue that the more evolution is top-down, directed and driven by intelligent agency, the less paradigmatically Darwinian it becomes. We then apply the concept of “domestication” to AI evolution, which, although theoretically satisfying the minimal definition of natural selection, is channeled through the minds of fore-sighted and intelligent agents, based on selection criteria desirable to them (which could be traits like docility, obedience and non-aggression). In the presence of such intelligent planning, it is not clear that selection of AIs, even selection in a competitive and ruthless market environment, will end up favoring “selfish” traits. In the end, however, we do agree with Hendrycks’ conditionally: If superintelligent AIs end up “going feral” and competing in a truly Darwinian fashion, reproducing autonomously and without human supervision, this could pose a grave danger to human societies.
    Date 2024-09-24
    Language en
    Short Title The selfish machine?
    Library Catalog Springer Link
    URL https://doi.org/10.1007/s11098-024-02226-3
    Accessed 11/16/2024, 3:26:59 PM
    Publication Philosophical Studies
    DOI 10.1007/s11098-024-02226-3
    Journal Abbr Philos Stud
    ISSN 1573-0883
    Date Added 11/16/2024, 3:26:59 PM
    Modified 11/16/2024, 3:26:59 PM

    Tags:

    • Artificial General Intelligence (AGI)
    • Artificial Intelligence
    • Darwinian spaces
    • Domestication
    • Economic competition
    • Evolution by natural selection
    • Selfishness
  • Looking Inward: Language Models Can Learn About Themselves by Introspection

    Item Type Preprint
    Author Felix J. Binder
    Author James Chua
    Author Tomek Korbak
    Author Henry Sleight
    Author John Hughes
    Author Robert Long
    Author Ethan Perez
    Author Miles Turpin
    Author Owain Evans
    Abstract Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization.
    Date 2024-10-17
    Short Title Looking Inward
    Library Catalog arXiv.org
    URL http://arxiv.org/abs/2410.13787
    Accessed 10/20/2024, 12:25:33 PM
    Extra arXiv:2410.13787
    Repository arXiv
    Archive ID arXiv:2410.13787
    Date Added 10/20/2024, 12:25:33 PM
    Modified 11/7/2024, 4:43:55 PM

    Tags:

    • Computer Science - Artificial Intelligence
    • Computer Science - Computation and Language

    Attachments

    • Full Text PDF
    • Snapshot
  • Sabotage Evaluations for Frontier Models

    Item Type Journal Article
    Author Joe Benton
    Author Misha Wagner
    Author Eric Christiansen
    Author Cem Anil
    Author Ethan Perez
    Author Jai Srivastav
    Author Esin Durmus
    Author Deep Ganguli
    Author Shauna Kravec
    Author Buck Shlegeris
    Author Jared Kaplan
    Author Holden Karnofsky
    Author Evan Hubinger
    Author Roger Grosse
    Author Samuel R Bowman
    Author David Duvenaud
    Abstract Sufficiently capable models could subvert human oversight and decisionmaking in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization’s activities in any of these ways. We demonstrate these evaluations on Anthropic’s Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using smallscale statistics.
    Language en
    Library Catalog Zotero
    Date Added 10/20/2024, 12:25:18 PM
    Modified 10/20/2024, 12:25:18 PM

    Attachments

    • Benton et al. - Sabotage Evaluations for Frontier Models.pdf