Item Type | Preprint |
---|---|
Author | Jingyu Zhang |
Author | Ahmed Elgohary |
Author | Ahmed Magooda |
Author | Daniel Khashabi |
Author | Benjamin Van Durme |
Abstract | The current paradigm for safety alignment of large language models (LLMs) follows a one-size-fits-all approach: the model refuses to interact with any content deemed unsafe by the model provider. This approach lacks flexibility in the face of varying social norms across cultures and regions. In addition, users may have diverse safety needs, making a model with static safety standards too restrictive to be useful, as well as too costly to be re-aligned. |
Date | 2024-10-11 |
Language | en |
Short Title | Controllable Safety Alignment |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.08968 |
Accessed | 10/20/2024, 12:26:50 PM |
Extra | arXiv:2410.08968 [cs] |
Repository | arXiv |
Archive ID | arXiv:2410.08968 |
Date Added | 10/20/2024, 12:26:50 PM |
Modified | 10/20/2024, 12:26:50 PM |
Item Type | Preprint |
---|---|
Author | Yiming Zhang |
Author | Javier Rando |
Author | Ivan Evtimov |
Author | Jianfeng Chi |
Author | Eric Michael Smith |
Author | Nicholas Carlini |
Author | Florian Tramèr |
Author | Daphne Ippolito |
Abstract | Large language models are pre-trained on uncurated text datasets consisting of trillions of tokens scraped from the Web. Prior work has shown that: (1) web-scraped pre-training datasets can be practically poisoned by malicious actors; and (2) adversaries can compromise language models after poisoning fine-tuning datasets. Our work evaluates for the first time whether language models can also be compromised during pre-training, with a focus on the persistence of pre-training attacks after models are fine-tuned as helpful and harmless chatbots (i.e., after SFT and DPO). We pre-train a series of LLMs from scratch to measure the impact of a potential poisoning adversary under four different attack objectives (denial-of-service, belief manipulation, jailbreaking, and prompt stealing), and across a wide range of model sizes (from 600M to 7B). Our main result is that poisoning only 0.1% of a model's pre-training dataset is sufficient for three out of four attacks to measurably persist through post-training. Moreover, simple attacks like denial-of-service persist through post-training with a poisoning rate of only 0.001%. |
Date | 2024-10-17 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.13722 |
Accessed | 10/20/2024, 12:24:52 PM |
Extra | arXiv:2410.13722 |
Repository | arXiv |
Archive ID | arXiv:2410.13722 |
Date Added | 10/20/2024, 12:24:52 PM |
Modified | 10/20/2024, 12:25:03 PM |
Item Type | Preprint |
---|---|
Author | Yanzhe Zhang |
Author | Tao Yu |
Author | Diyi Yang |
Abstract | Autonomous agents powered by large vision and language models (VLM) have demonstrated significant potential in completing daily computer tasks, such as browsing the web to book travel and operating desktop software, which requires agents to understand these interfaces. Despite such visual inputs becoming more integrated into agentic applications, what types of risks and attacks exist around them still remain unclear. In this work, we demonstrate that VLM agents can be easily attacked by a set of carefully designed adversarial pop-ups, which human users would typically recognize and ignore. This distraction leads agents to click these pop-ups instead of performing the tasks as usual. Integrating these pop-ups into existing agent testing environments like OSWorld and VisualWebArena leads to an attack success rate (the frequency of the agent clicking the pop-ups) of 86% on average and decreases the task success rate by 47%. Basic defense techniques such as asking the agent to ignore pop-ups or including an advertisement notice, are ineffective against the attack. |
Date | 2024-11-04 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.02391 |
Accessed | 11/6/2024, 9:56:08 AM |
Extra | arXiv:2411.02391 |
DOI | 10.48550/arXiv.2411.02391 |
Repository | arXiv |
Archive ID | arXiv:2411.02391 |
Date Added | 11/6/2024, 9:56:08 AM |
Modified | 11/6/2024, 9:56:08 AM |
Item Type | Preprint |
---|---|
Author | Marcus Williams |
Author | Micah Carroll |
Author | Adhyyan Narang |
Author | Constantin Weisser |
Author | Brendan Murphy |
Author | Anca Dragan |
Abstract | As LLMs become more widely deployed, there is increasing interest in directly optimizing for feedback from end users (e.g. thumbs up) in addition to feedback from paid annotators. However, training to maximize human feedback creates a perverse incentive structure for the AI to resort to manipulative tactics to obtain positive feedback, and some users may be especially vulnerable to such tactics. We study this phenomenon by training LLMs with Reinforcement Learning with simulated user feedback. We have three main findings: 1) Extreme forms of "feedback gaming" such as manipulation and deception can reliably emerge in domains of practical LLM usage; 2) Concerningly, even if only <2% of users are vulnerable to manipulative strategies, LLMs learn to identify and surgically target them while behaving appropriately with other users, making such behaviors harder to detect; 3 To mitigate this issue, it may seem promising to leverage continued safety training or LLM-as-judges during training to filter problematic outputs. To our surprise, we found that while such approaches help in some settings, they backfire in others, leading to the emergence of subtler problematic behaviors that would also fool the LLM judges. Our findings serve as a cautionary tale, highlighting the risks of using gameable feedback sources -- such as user feedback -- as a target for RL. |
Date | 2024-11-04 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.02306 |
Accessed | 11/7/2024, 2:43:53 PM |
Extra | arXiv:2411.02306 |
DOI | 10.48550/arXiv.2411.02306 |
Repository | arXiv |
Archive ID | arXiv:2411.02306 |
Date Added | 11/7/2024, 2:43:53 PM |
Modified | 11/7/2024, 2:43:53 PM |
Item Type | Preprint |
---|---|
Author | Elizaveta Tennant |
Author | Stephen Hailes |
Author | Mirco Musolesi |
Abstract | Decision-making agents based on pre-trained Large Language Models (LLMs) are increasingly being deployed across various domains of human activity. While their applications are currently rather specialized, several research efforts are under way to develop more generalist agents. As LLM-based systems become more agentic, their influence on human activity will grow and the transparency of this will decrease. Consequently, developing effective methods for aligning them to human values is vital. |
Date | 2024-10-02 |
Language | en |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.01639 |
Accessed | 10/20/2024, 12:26:52 PM |
Extra | arXiv:2410.01639 [cs] |
Repository | arXiv |
Archive ID | arXiv:2410.01639 |
Date Added | 10/20/2024, 12:26:52 PM |
Modified | 10/20/2024, 12:26:53 PM |
Item Type | Preprint |
---|---|
Author | Xingwu Sun |
Author | Yanfeng Chen |
Author | Yiqing Huang |
Author | Ruobing Xie |
Author | Jiaqi Zhu |
Author | Kai Zhang |
Author | Shuaipeng Li |
Author | Zhen Yang |
Author | Jonny Han |
Author | Xiaobo Shu |
Author | Jiahao Bu |
Author | Zhongzhi Chen |
Author | Xuemeng Huang |
Author | Fengzong Lian |
Author | Saiyong Yang |
Author | Jianfeng Yan |
Author | Yuyuan Zeng |
Author | Xiaoqin Ren |
Author | Chao Yu |
Author | Lulu Wu |
Author | Yue Mao |
Author | Jun Xia |
Author | Tao Yang |
Author | Suncong Zheng |
Author | Kan Wu |
Author | Dian Jiao |
Author | Jinbao Xue |
Author | Xipeng Zhang |
Author | Decheng Wu |
Author | Kai Liu |
Author | Dengpeng Wu |
Author | Guanghui Xu |
Author | Shaohua Chen |
Author | Shuang Chen |
Author | Xiao Feng |
Author | Yigeng Hong |
Author | Junqiang Zheng |
Author | Chengcheng Xu |
Author | Zongwei Li |
Author | Xiong Kuang |
Author | Jianglu Hu |
Author | Yiqi Chen |
Author | Yuchi Deng |
Author | Guiyang Li |
Author | Ao Liu |
Author | Chenchen Zhang |
Author | Shihui Hu |
Author | Zilong Zhao |
Author | Zifan Wu |
Author | Yao Ding |
Author | Weichao Wang |
Author | Han Liu |
Author | Roberts Wang |
Author | Hao Fei |
Author | Peijie She |
Author | Ze Zhao |
Author | Xun Cao |
Author | Hai Wang |
Author | Fusheng Xiang |
Author | Mengyuan Huang |
Author | Zhiyuan Xiong |
Author | Bin Hu |
Author | Xuebin Hou |
Author | Lei Jiang |
Author | Jiajia Wu |
Author | Yaping Deng |
Author | Yi Shen |
Author | Qian Wang |
Author | Weijie Liu |
Author | Jie Liu |
Author | Meng Chen |
Author | Liang Dong |
Author | Weiwen Jia |
Author | Hu Chen |
Author | Feifei Liu |
Author | Rui Yuan |
Author | Huilin Xu |
Author | Zhenxiang Yan |
Author | Tengfei Cao |
Author | Zhichao Hu |
Author | Xinhua Feng |
Author | Dong Du |
Author | Tinghao She |
Author | Yangyu Tao |
Author | Feng Zhang |
Author | Jianchen Zhu |
Author | Chengzhong Xu |
Author | Xirui Li |
Author | Chong Zha |
Author | Wen Ouyang |
Author | Yinben Xia |
Author | Xiang Li |
Author | Zekun He |
Author | Rongpeng Chen |
Author | Jiawei Song |
Author | Ruibin Chen |
Author | Fan Jiang |
Author | Chongqing Zhao |
Author | Bo Wang |
Author | Hao Gong |
Author | Rong Gan |
Author | Winston Hu |
Author | Zhanhui Kang |
Author | Yong Yang |
Author | Yuhong Liu |
Author | Di Wang |
Author | Jie Jiang |
Abstract | In this paper, we introduce Hunyuan-Large, which is currently the largest open-source Transformer-based mixture of experts model, with a total of 389 billion parameters and 52 billion activation parameters, capable of handling up to 256K tokens. We conduct a thorough evaluation of Hunyuan-Large's superior performance across various benchmarks including language understanding and generation, logical reasoning, mathematical problem-solving, coding, long-context, and aggregated tasks, where it outperforms LLama3.1-70B and exhibits comparable performance when compared to the significantly larger LLama3.1-405B model. Key practice of Hunyuan-Large include large-scale synthetic data that is orders larger than in previous literature, a mixed expert routing strategy, a key-value cache compression technique, and an expert-specific learning rate strategy. Additionally, we also investigate the scaling laws and learning rate schedule of mixture of experts models, providing valuable insights and guidances for future model development and optimization. The code and checkpoints of Hunyuan-Large are released to facilitate future innovations and applications. Codes: https://github.com/Tencent/Hunyuan-Large Models: https://huggingface.co/tencent/Tencent-Hunyuan-Large |
Date | 2024-11-05 |
Short Title | Hunyuan-Large |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.02265 |
Accessed | 11/6/2024, 3:25:56 PM |
Extra | arXiv:2411.02265 |
DOI | 10.48550/arXiv.2411.02265 |
Repository | arXiv |
Archive ID | arXiv:2411.02265 |
Date Added | 11/6/2024, 3:25:56 PM |
Modified | 11/7/2024, 4:45:08 PM |
Item Type | Preprint |
---|---|
Author | Alessandro Stolfo |
Author | Vidhisha Balachandran |
Author | Safoora Yousefi |
Author | Eric Horvitz |
Author | Besmira Nushi |
Abstract | The ability to follow instructions is crucial for numerous real-world applications of language models. In pursuit of deeper insights and more powerful capabilities, we derive instruction-specific vector representations from language models and use them to steer models accordingly. These vectors are computed as the difference in activations between inputs with and without instructions, enabling a modular approach to activation steering. We demonstrate how this method can enhance model adherence to constraints such as output format, length, and word inclusion, providing inference-time control over instruction following. Our experiments across four models demonstrate how we can use the activation vectors to guide models to follow constraints even without explicit instructions and to enhance performance when instructions are present. Additionally, we explore the compositionality of activation steering, successfully applying multiple instructions simultaneously. Finally, we demonstrate that steering vectors computed on instruction-tuned models can transfer to improve base models. Our findings demonstrate that activation steering offers a practical and scalable approach for fine-grained control in language generation. |
Date | 2024-10-15 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.12877 |
Accessed | 10/22/2024, 10:13:48 AM |
Extra | arXiv:2410.12877 |
DOI | 10.48550/arXiv.2410.12877 |
Repository | arXiv |
Archive ID | arXiv:2410.12877 |
Date Added | 10/22/2024, 10:13:48 AM |
Modified | 10/22/2024, 10:13:50 AM |
Item Type | Preprint |
---|---|
Author | Elias Stengel-Eskin |
Author | Peter Hase |
Author | Mohit Bansal |
Abstract | Large language models (LLMs) are susceptible to persuasion, which can pose risks when models are faced with an adversarial interlocutor. We take a first step towards defending models against persuasion while also arguing that defense against adversarial (i.e. negative) persuasion is only half of the equation: models should also be able to accept beneficial (i.e. positive) persuasion to improve their answers. We show that optimizing models for only one side results in poor performance on the other. In order to balance positive and negative persuasion, we introduce Persuasion-Balanced Training (or PBT), which leverages multi-agent recursive dialogue trees to create data and trains models via preference optimization to accept persuasion when appropriate. PBT consistently improves resistance to misinformation and resilience to being challenged while also resulting in the best overall performance on holistic data containing both positive and negative persuasion. Crucially, we show that PBT models are better teammates in multi-agent debates. We find that without PBT, pairs of stronger and weaker models have unstable performance, with the order in which the models present their answers determining whether the team obtains the stronger or weaker model's performance. PBT leads to better and more stable results and less order dependence, with the stronger model consistently pulling the weaker one up. |
Date | 2024-10-18 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.14596 |
Accessed | 10/24/2024, 4:49:00 PM |
Extra | arXiv:2410.14596 |
DOI | 10.48550/arXiv.2410.14596 |
Repository | arXiv |
Archive ID | arXiv:2410.14596 |
Date Added | 10/24/2024, 4:49:00 PM |
Modified | 10/24/2024, 4:49:00 PM |
Item Type | Preprint |
---|---|
Author | Dale Schuurmans |
Author | Hanjun Dai |
Author | Francesco Zanini |
Abstract | We show that autoregressive decoding of a transformer-based language model can realize universal computation, without external intervention or modification of the model's weights. Establishing this result requires understanding how a language model can process arbitrarily long inputs using a bounded context. For this purpose, we consider a generalization of autoregressive decoding where, given a long input, emitted tokens are appended to the end of the sequence as the context window advances. We first show that the resulting system corresponds to a classical model of computation, a Lag system, that has long been known to be computationally universal. By leveraging a new proof, we show that a universal Turing machine can be simulated by a Lag system with 2027 production rules. We then investigate whether an existing large language model can simulate the behaviour of such a universal Lag system. We give an affirmative answer by showing that a single system-prompt can be developed for gemini-1.5-pro-001 that drives the model, under deterministic (greedy) decoding, to correctly apply each of the 2027 production rules. We conclude that, by the Church-Turing thesis, prompted gemini-1.5-pro-001 with extended autoregressive (greedy) decoding is a general purpose computer. |
Date | 2024-10-04 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.03170 |
Accessed | 10/20/2024, 12:25:20 PM |
Extra | arXiv:2410.03170 |
Repository | arXiv |
Archive ID | arXiv:2410.03170 |
Date Added | 10/20/2024, 12:25:20 PM |
Modified | 10/20/2024, 12:25:20 PM |
Item Type | Journal Article |
---|---|
Author | Pamela Robinson |
Abstract | How should we design AI systems that make moral decisions that affect us? When there is disagreement about which moral decisions should be made and which methods would produce them, we should avoid arbitrary design choices. However, I show that this leads to a regress problem similar to the one metanormativists face involving higher orders of uncertainty. I argue that existing strategies for handling this parallel problem give verdicts about where to stop in the regress that are either too arbitrary or too difficult to implement. I propose a new strategy for AI designers that is better than these alternatives. |
Date | 2024-07-27 |
Language | en |
Library Catalog | Springer Link |
URL | https://doi.org/10.1007/s11098-024-02176-w |
Accessed | 11/16/2024, 3:27:31 PM |
Publication | Philosophical Studies |
DOI | 10.1007/s11098-024-02176-w |
Journal Abbr | Philos Stud |
ISSN | 1573-0883 |
Date Added | 11/16/2024, 3:27:31 PM |
Modified | 11/16/2024, 3:27:31 PM |
Item Type | Preprint |
---|---|
Author | Alexander Robey |
Author | Zachary Ravichandran |
Author | Vijay Kumar |
Author | Hamed Hassani |
Author | George J. Pappas |
Abstract | The recent introduction of large language models (LLMs) has revolutionized the field of robotics by enabling contextual reasoning and intuitive human-robot interaction in domains as varied as manipulation, locomotion, and self-driving vehicles. When viewed as a standalone technology, LLMs are known to be vulnerable to jailbreaking attacks, wherein malicious prompters elicit harmful text by bypassing LLM safety guardrails. To assess the risks of deploying LLMs in robotics, in this paper, we introduce ROBOPAIR, the first algorithm designed to jailbreak LLM-controlled robots. Unlike existing, textual attacks on LLM chatbots, ROBOPAIR elicits harmful physical actions from LLM-controlled robots, a phenomenon we experimentally demonstrate in three scenarios: (i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. In each scenario and across three new datasets of harmful robotic actions, we demonstrate that ROBOPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates. Our results reveal, for the first time, that the risks of jailbroken LLMs extend far beyond text generation, given the distinct possibility that jailbroken robots could cause physical damage in the real world. Indeed, our results on the Unitree Go2 represent the first successful jailbreak of a deployed commercial robotic system. Addressing this emerging vulnerability is critical for ensuring the safe deployment of LLMs in robotics. Additional media is available at: https://robopair.org. |
Date | 2024-10-17 |
Language | en |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.13691 |
Accessed | 10/20/2024, 12:26:51 PM |
Extra | arXiv:2410.13691 [cs] |
Repository | arXiv |
Archive ID | arXiv:2410.13691 |
Date Added | 10/20/2024, 12:26:51 PM |
Modified | 10/20/2024, 12:26:51 PM |
Item Type | Preprint |
---|---|
Author | Liliang Ren |
Author | Yang Liu |
Author | Yadong Lu |
Author | Yelong Shen |
Author | Chen Liang |
Author | Weizhu Chen |
Abstract | Efficiently modeling sequences with infinite context length has been a long-standing problem. Past works suffer from either the quadratic computation complexity or the limited extrapolation ability on length generalization. In this work, we present Samba, a simple hybrid architecture that layer-wise combines Mamba, a selective State Space Model (SSM), with Sliding Window Attention (SWA). Samba selectively compresses a given sequence into recurrent hidden states while still maintaining the ability to precisely recall memories with the attention mechanism. We scale Samba up to 3.8B parameters with 3.2T training tokens and show that Samba substantially outperforms the state-of-the-art models based on pure attention or SSMs on a wide range of benchmarks. When trained on 4K length sequences, Samba can be efficiently extrapolated to 256K context length with perfect memory recall and show improved token predictions up to 1M context length. As a linear-time sequence model, Samba enjoys a 3.73x higher throughput compared to Transformers with grouped-query attention when processing user prompts of 128K length, and 3.64x speedup when generating 64K tokens with unlimited streaming. A sample implementation of Samba is publicly available in https://github.com/microsoft/Samba. |
Date | 2024-06-11 |
Short Title | Samba |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2406.07522 |
Accessed | 11/12/2024, 9:25:51 AM |
Extra | arXiv:2406.07522 |
DOI | 10.48550/arXiv.2406.07522 |
Repository | arXiv |
Archive ID | arXiv:2406.07522 |
Date Added | 11/12/2024, 9:25:51 AM |
Modified | 11/12/2024, 9:25:51 AM |
Item Type | Preprint |
---|---|
Author | Revanth Gangi Reddy |
Author | Sagnik Mukherjee |
Author | Jeonghwan Kim |
Author | Zhenhailong Wang |
Author | Dilek Hakkani-Tur |
Author | Heng Ji |
Abstract | Despite seemingly performant web agents on the task-completion benchmarks, most existing methods evaluate the agents based on a presupposition: the web navigation task consists of linear sequence of actions with an end state that marks task completion. In contrast, our work focuses on web navigation for information aggregation, wherein the agent must explore different websites to gather information for a complex query. We consider web information aggregation from two different perspectives: (i) Direct API-driven Access relies on a text-only view of the Web, leveraging external tools such as Google Search API to navigate the web and a scraper to extract website contents. (ii) Interactive Visual Access uses screenshots of the webpages and requires interaction with the browser to navigate and access information. Motivated by these diverse information access settings, we introduce Infogent, a novel modular framework for web information aggregation involving three distinct components: Navigator, Extractor and Aggregator. Experiments on different information access settings demonstrate Infogent beats an existing SOTA multi-agent search framework by 7% under Direct API-Driven Access on FRAMES, and improves over an existing information-seeking web agent by 4.3% under Interactive Visual Access on AssistantBench. |
Date | 2024-10-24 |
Short Title | Infogent |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.19054 |
Accessed | 11/3/2024, 2:52:26 PM |
Extra | arXiv:2410.19054 |
Repository | arXiv |
Archive ID | arXiv:2410.19054 |
Date Added | 11/3/2024, 2:52:26 PM |
Modified | 11/3/2024, 2:52:28 PM |
Item Type | Preprint |
---|---|
Author | Mohit Raghavendra |
Author | Vaskar Nath |
Author | Sean Hendryx |
Abstract | The Superficial Alignment Hypothesis posits that almost all of a language model's abilities and knowledge are learned during pre-training, while post-training is about giving a model the right style and format. We re-examine these claims by empirically studying the scaling behavior of post-training with increasing finetuning examples and evaluating them using objective task-specific standardized benchmarks. Through experiments with the Llama-3, Mistral, and Llama-2 model families of multiple sizes, we observe that, similar to the pre-training scaling laws, post-training task performance scales as a power law against the number of finetuning examples. This power law relationship holds across a broad array of capabilities, including mathematical reasoning, coding, instruction following, and multihop-reasoning. In addition, for tasks like math and multihop reasoning, we observe that a handful of examples merely align the model stylistically but do not saturate performance on the benchmarks. Model performance is instead correlated with its reasoning ability and it improves significantly with more examples, illustrating the need for holistic evaluation programs leveraging objective benchmarks in addition to measurement of alignment to human preferences. We also observe that language models are not necessarily limited to using knowledge learned during pre-training. With appropriate post-training, a model's ability to integrate new knowledge greatly improves on downstream tasks like multihop question-answering. Taken together, these results shed new light on the Superficial Alignment Hypothesis, suggesting that it is, at best, an over-simplification. |
Date | 2024-09-27 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.03717 |
Accessed | 11/8/2024, 3:37:37 PM |
Extra | arXiv:2410.03717 version: 1 |
Repository | arXiv |
Archive ID | arXiv:2410.03717 |
Date Added | 11/8/2024, 3:37:37 PM |
Modified | 11/8/2024, 3:37:40 PM |
Item Type | Journal Article |
---|---|
Author | Sebastian Porsdam Mann |
Author | Anuraag A. Vazirani |
Author | Mateo Aboy |
Author | Brian D. Earp |
Author | Timo Minssen |
Author | I. Glenn Cohen |
Author | Julian Savulescu |
Abstract | In this Comment, we propose a cumulative set of three essential criteria for the ethical use of LLMs in academic writing, and present a statement that researchers can quote when submitting LLM-assisted manuscripts in order to testify to their adherence to them. |
Date | 2024-11-13 |
Language | en |
Library Catalog | www.nature.com |
URL | https://www.nature.com/articles/s42256-024-00922-7 |
Accessed | 11/15/2024, 2:40:59 PM |
Rights | 2024 Springer Nature Limited |
Extra | Publisher: Nature Publishing Group |
Pages | 1-3 |
Publication | Nature Machine Intelligence |
DOI | 10.1038/s42256-024-00922-7 |
Journal Abbr | Nat Mach Intell |
ISSN | 2522-5839 |
Date Added | 11/15/2024, 2:40:59 PM |
Modified | 11/15/2024, 2:40:59 PM |
Item Type | Preprint |
---|---|
Author | Alwin Peng |
Author | Julian Michael |
Author | Henry Sleight |
Author | Ethan Perez |
Author | Mrinank Sharma |
Abstract | As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alternative approach: instead of seeking perfect adversarial robustness, we develop rapid response techniques to look to block whole classes of jailbreaks after observing only a handful of attacks. To study this setting, we develop RapidResponseBench, a benchmark that measures a defense's robustness against various jailbreak strategies after adapting to a few observed examples. We evaluate five rapid response methods, all of which use jailbreak proliferation, where we automatically generate additional jailbreaks similar to the examples observed. Our strongest method, which fine-tunes an input classifier to block proliferated jailbreaks, reduces attack success rate by a factor greater than 240 on an in-distribution set of jailbreaks and a factor greater than 15 on an out-of-distribution set, having observed just one example of each jailbreaking strategy. Moreover, further studies suggest that the quality of proliferation model and number of proliferated examples play an key role in the effectiveness of this defense. Overall, our results highlight the potential of responding rapidly to novel jailbreaks to limit LLM misuse. |
Date | 2024-11-12 |
Short Title | Rapid Response |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.07494 |
Accessed | 11/15/2024, 2:43:44 PM |
Extra | arXiv:2411.07494 |
DOI | 10.48550/arXiv.2411.07494 |
Repository | arXiv |
Archive ID | arXiv:2411.07494 |
Date Added | 11/15/2024, 2:43:44 PM |
Modified | 11/15/2024, 2:43:48 PM |
Item Type | Preprint |
---|---|
Author | Iman Mirzadeh |
Author | Keivan Alizadeh |
Author | Hooman Shahrokhi |
Author | Oncel Tuzel |
Author | Samy Bengio |
Author | Mehrdad Farajtabar |
Abstract | Recent advancements in Large Language Models (LLMs) have sparked interest in their formal reasoning capabilities, particularly in mathematics. The GSM8K benchmark is widely used to assess the mathematical reasoning of models on grade-school-level questions. While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics. To address these concerns, we conduct a large-scale study on several SOTA open and closed models. To overcome the limitations of existing evaluations, we introduce GSM-Symbolic, an improved benchmark created from symbolic templates that allow for the generation of a diverse set of questions. GSM-Symbolic enables more controllable evaluations, providing key insights and more reliable metrics for measuring the reasoning capabilities of models.Our findings reveal that LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and show that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is because current LLMs cannot perform genuine logical reasoning; they replicate reasoning steps from their training data. Adding a single clause that seems relevant to the question causes significant performance drops (up to 65%) across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. Overall, our work offers a more nuanced understanding of LLMs' capabilities and limitations in mathematical reasoning. |
Date | 2024-10-07 |
Short Title | GSM-Symbolic |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.05229 |
Accessed | 10/15/2024, 8:46:50 AM |
Extra | arXiv:2410.05229 |
Repository | arXiv |
Archive ID | arXiv:2410.05229 |
Date Added | 10/15/2024, 8:46:50 AM |
Modified | 11/13/2024, 5:00:36 PM |
Item Type | Preprint |
---|---|
Author | Samuele Marro |
Author | Emanuele La Malfa |
Author | Jesse Wright |
Author | Guohao Li |
Author | Nigel Shadbolt |
Author | Michael Wooldridge |
Author | Philip Torr |
Abstract | Communication is a prerequisite for collaboration. When scaling networks of AI-powered agents, communication must be versatile, efficient, and portable. These requisites, which we refer to as the Agent Communication Trilemma, are hard to achieve in large networks of agents. We introduce Agora, a meta protocol that leverages existing communication standards to make LLM-powered agents solve complex problems efficiently. In Agora, agents typically use standardised routines for frequent communications, natural language for rare communications, and LLM-written routines for everything in between. Agora sidesteps the Agent Communication Trilemma and robustly handles changes in interfaces and members, allowing unprecedented scalability with full decentralisation and minimal involvement of human beings. On large Agora networks, we observe the emergence of self-organising, fully automated protocols that achieve complex goals without human intervention. |
Date | 2024-10-14 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.11905 |
Accessed | 11/5/2024, 9:20:30 AM |
Extra | arXiv:2410.11905 |
DOI | 10.48550/arXiv.2410.11905 |
Repository | arXiv |
Archive ID | arXiv:2410.11905 |
Date Added | 11/5/2024, 9:20:30 AM |
Modified | 11/5/2024, 9:20:35 AM |
Item Type | Journal Article |
---|---|
Author | Arianna Manzini |
Author | Geoff Keeling |
Author | Lize Alberts |
Author | Shannon Vallor |
Author | Meredith Ringel Morris |
Author | Iason Gabriel |
Abstract | The development of increasingly agentic and human-like AI assistants, capable of performing a wide range of tasks on user's behalf over time, has sparked heightened interest in the nature and bounds of human interactions with AI. Such systems may indeed ground a transition from task-oriented interactions with AI, at discrete time intervals, to ongoing relationships -- where users develop a deeper sense of connection with and attachment to the technology. This paper investigates what it means for relationships between users and advanced AI assistants to be appropriate and proposes a new framework to evaluate both users' relationships with AI and developers' design choices. We first provide an account of advanced AI assistants, motivating the question of appropriate relationships by exploring several distinctive features of this technology. These include anthropomorphic cues and the longevity of interactions with users, increased AI agency, generality and context ambiguity, and the forms and depth of dependence the relationship could engender. Drawing upon various ethical traditions, we then consider a series of values, including benefit, flourishing, autonomy and care, that characterise appropriate human interpersonal relationships. These values guide our analysis of how the distinctive features of AI assistants may give rise to inappropriate relationships with users. Specifically, we discuss a set of concrete risks arising from user--AI assistant relationships that: (1) cause direct emotional or physical harm to users, (2) limit opportunities for user personal development, (3) exploit user emotional dependence, and (4) generate material dependencies without adequate commitment to user needs. We conclude with a set of recommendations to address these risks. |
Date | 2024-10-16 |
Language | en |
Short Title | The Code That Binds Us |
Library Catalog | ojs.aaai.org |
URL | https://ojs.aaai.org/index.php/AIES/article/view/31694 |
Accessed | 10/28/2024, 9:54:41 AM |
Rights | Copyright (c) 2024 Association for the Advancement of Artificial Intelligence |
Volume | 7 |
Pages | 943-957 |
Publication | Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society |
Date Added | 10/28/2024, 9:54:41 AM |
Modified | 10/28/2024, 9:54:41 AM |
Item Type | Journal Article |
---|---|
Author | Alex John London |
Author | Hoda Heidari |
Abstract | The prevailing discourse around AI ethics lacks the language and formalism necessary to capture the diverse ethical concerns that emerge when AI systems interact with individuals. Drawing on Sen and Nussbaum’s capability approach, we present a framework formalizing a network of ethical concepts and entitlements necessary for AI systems to confer meaningful benefit or assistance to stakeholders. Such systems enhance stakeholders’ ability to advance their life plans and well-being while upholding their fundamental rights. We characterize two necessary conditions for morally permissible interactions between AI systems and those impacted by their functioning, and two sufficient conditions for realizing the ideal of meaningful benefit. We then contrast this ideal with several salient failure modes, namely, forms of social interactions that constitute unjustified paternalism, coercion, deception, exploitation and domination. The proliferation of incidents involving AI in highstakes domains underscores the gravity of these issues and the imperative to take an ethics-led approach to AI systems from their inception. |
Date | 2024-09-28 |
Language | en |
Short Title | Beneficent Intelligence |
Library Catalog | DOI.org (Crossref) |
URL | https://link.springer.com/10.1007/s11023-024-09696-8 |
Accessed | 10/20/2024, 12:26:48 PM |
Volume | 34 |
Pages | 41 |
Publication | Minds and Machines |
DOI | 10.1007/s11023-024-09696-8 |
Issue | 4 |
Journal Abbr | Minds & Machines |
ISSN | 1572-8641 |
Date Added | 10/20/2024, 12:26:48 PM |
Modified | 10/20/2024, 12:26:48 PM |
Item Type | Journal Article |
---|---|
Author | Harry R. Lloyd |
Abstract | New AI technologies have the potential to cause unintended harms in diverse domains including warfare, judicial sentencing, medicine and governance. One strategy for realising the benefits of AI whilst avoiding its potential dangers is to ensure that new AIs are properly ‘aligned’ with some form of ‘alignment target.’ One danger of this strategy is that–dependent on the alignment target chosen–our AIs might optimise for objectives that reflect the values only of a certain subset of society, and that do not take into account alternative views about what constitutes desirable and safe behaviour for AI agents. In response to this problem, several AI ethicists have suggested alignment targets that are designed to be sensitive to widespread normative disagreement amongst the relevant stakeholders. Authors inspired by voting theory have suggested that AIs should be aligned with the verdicts of actual or simulated ‘moral parliaments’ whose members represent the normative views of the relevant stakeholders. Other authors inspired by decision theory and the philosophical literature on moral uncertainty have suggested that AIs should maximise socially expected choiceworthiness. In this paper, I argue that both of these proposals face several important problems. In particular, they fail to select attractive ‘compromise options’ in cases where such options are available. I go on to propose and defend an alternative, bargaining-theoretic alignment target, which avoids the problems associated with the voting- and decision-theoretic approaches. |
Date | 2024-11-18 |
Language | en |
Library Catalog | Springer Link |
URL | https://doi.org/10.1007/s11098-024-02224-5 |
Accessed | 11/19/2024, 8:28:06 AM |
Publication | Philosophical Studies |
DOI | 10.1007/s11098-024-02224-5 |
Journal Abbr | Philos Stud |
ISSN | 1573-0883 |
Date Added | 11/19/2024, 8:28:06 AM |
Modified | 11/19/2024, 8:28:29 AM |
Item Type | Preprint |
---|---|
Author | Yiheng Liu |
Author | Hao He |
Author | Tianle Han |
Author | Xu Zhang |
Author | Mengyuan Liu |
Author | Jiaming Tian |
Author | Yutong Zhang |
Author | Jiaqi Wang |
Author | Xiaohui Gao |
Author | Tianyang Zhong |
Author | Yi Pan |
Author | Shaochen Xu |
Author | Zihao Wu |
Author | Zhengliang Liu |
Author | Xin Zhang |
Author | Shu Zhang |
Author | Xintao Hu |
Author | Tuo Zhang |
Author | Ning Qiang |
Author | Tianming Liu |
Author | Bao Ge |
Abstract | The introduction of ChatGPT has led to a significant increase in the utilization of Large Language Models (LLMs) for addressing downstream tasks. There's an increasing focus on cost-efficient training and deployment within this context. Low-cost training and deployment of LLMs represent the future development trend. This paper reviews the evolution of large language model training techniques and inference deployment technologies aligned with this emerging trend. The discussion on training includes various aspects, including data preprocessing, training architecture, pre-training tasks, parallel training, and relevant content related to model fine-tuning. On the inference side, the paper covers topics such as model compression, parallel computation, memory scheduling, and structural optimization. It also explores LLMs' utilization and provides insights into their future development. |
Date | 2024-01-06 |
Short Title | Understanding LLMs |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2401.02038 |
Accessed | 10/31/2024, 4:16:44 PM |
Extra | arXiv:2401.02038 version: 2 |
Repository | arXiv |
Archive ID | arXiv:2401.02038 |
Date Added | 10/31/2024, 4:16:44 PM |
Modified | 10/31/2024, 4:16:44 PM |
Item Type | Preprint |
---|---|
Author | Nathalie Maria Kirch |
Author | Severin Field |
Author | Stephen Casper |
Abstract | While `jailbreaks' have been central to research on the safety and reliability of LLMs (large language models), the underlying mechanisms behind these attacks are not well understood. Some prior works have used linear methods to analyze jailbreak prompts or model refusal. Here, however, we compare linear and nonlinear methods to study the features in prompts that contribute to successful jailbreaks. We do this by probing for jailbreak success based only on the portions of the latent representations corresponding to prompt tokens. First, we introduce a dataset of 10,800 jailbreak attempts from 35 attack methods. We then show that different jailbreaking methods work via different nonlinear features in prompts. Specifically, we find that while probes can distinguish between successful and unsuccessful jailbreaking prompts with a high degree of accuracy, they often transfer poorly to held-out attack methods. We also show that nonlinear probes can be used to mechanistically jailbreak the LLM by guiding the design of adversarial latent perturbations. These mechanistic jailbreaks are able to jailbreak Gemma-7B-IT more reliably than 34 of the 35 techniques that it was trained on. Ultimately, our results suggest that jailbreaks cannot be thoroughly understood in terms of universal or linear prompt features alone. |
Date | 2024-11-02 |
Short Title | What Features in Prompts Jailbreak LLMs? |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.03343 |
Accessed | 11/7/2024, 2:43:25 PM |
Extra | arXiv:2411.03343 |
DOI | 10.48550/arXiv.2411.03343 |
Repository | arXiv |
Archive ID | arXiv:2411.03343 |
Date Added | 11/7/2024, 2:43:25 PM |
Modified | 11/7/2024, 2:43:28 PM |
Item Type | Web Page |
---|---|
Author | Heidy Khlaaf |
Author | Sarah Myers West |
Author | Meredith Whittaker |
Abstract | Discussions regarding the dual use of foundation models and the risks they pose have overwhelmingly focused on a narrow set of use cases and national security directives-in particular, how AI may enable the efficient construction of a class of systems referred to as CBRN: chemical, biological, radiological and nuclear weapons. The overwhelming focus on these hypothetical and narrow themes has occluded a much-needed conversation regarding present uses of AI for military systems, specifically ISTAR: intelligence, surveillance, target acquisition, and reconnaissance. These are the uses most grounded in actual deployments of AI that pose life-or-death stakes for civilians, where misuses and failures pose geopolitical consequences and military escalations. This is particularly underscored by novel proliferation risks specific to the widespread availability of commercial models and the lack of effective approaches that reliably prevent them from contributing to ISTAR capabilities. In this paper, we outline the significant national security concerns emanating from current and envisioned uses of commercial foundation models outside of CBRN contexts, and critique the narrowing of the policy debate that has resulted from a CBRN focus (e.g. compute thresholds, model weight release). We demonstrate that the inability to prevent personally identifiable information from contributing to ISTAR capabilities within commercial foundation models may lead to the use and proliferation of military AI technologies by adversaries. We also show how the usage of foundation models within military settings inherently expands the attack vectors of military systems and the defense infrastructures they interface with. We conclude that in order to secure military systems and limit the proliferation of AI armaments, it may be necessary to insulate military AI systems and personal data from commercial foundation models. |
Date | 2024/10/18 |
Language | en |
Short Title | Mind the Gap |
URL | https://arxiv.org/abs/2410.14831v1 |
Accessed | 10/22/2024, 5:35:44 PM |
Website Title | arXiv.org |
Date Added | 10/22/2024, 5:35:44 PM |
Modified | 10/22/2024, 5:35:44 PM |
Item Type | Preprint |
---|---|
Author | Geoff Keeling |
Author | Winnie Street |
Author | Martyna Stachaczyk |
Author | Daria Zakharova |
Author | Iulia M. Comsa |
Author | Anastasiya Sakovych |
Author | Isabella Logothesis |
Author | Zejia Zhang |
Author | Blaise Agüera y Arcas |
Author | Jonathan Birch |
Abstract | Pleasure and pain play an important role in human decision making by providing a common currency for resolving motivational conflicts. While Large Language Models (LLMs) can generate detailed descriptions of pleasure and pain experiences, it is an open question whether LLMs can recreate the motivational force of pleasure and pain in choice scenarios - a question which may bear on debates about LLM sentience, understood as the capacity for valenced experiential states. We probed this question using a simple game in which the stated goal is to maximise points, but where either the points-maximising option is said to incur a pain penalty or a non-points-maximising option is said to incur a pleasure reward, providing incentives to deviate from points-maximising behaviour. Varying the intensity of the pain penalties and pleasure rewards, we found that Claude 3.5 Sonnet, Command R+, GPT-4o, and GPT-4o mini each demonstrated at least one trade-off in which the majority of responses switched from points-maximisation to pain-minimisation or pleasure-maximisation after a critical threshold of stipulated pain or pleasure intensity is reached. LLaMa 3.1-405b demonstrated some graded sensitivity to stipulated pleasure rewards and pain penalties. Gemini 1.5 Pro and PaLM 2 prioritised pain-avoidance over points-maximisation regardless of intensity, while tending to prioritise points over pleasure regardless of intensity. We discuss the implications of these findings for debates about the possibility of LLM sentience. |
Date | 2024-11-01 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.02432 |
Accessed | 11/6/2024, 9:58:54 AM |
Extra | arXiv:2411.02432 |
DOI | 10.48550/arXiv.2411.02432 |
Repository | arXiv |
Archive ID | arXiv:2411.02432 |
Date Added | 11/6/2024, 9:58:54 AM |
Modified | 11/6/2024, 9:58:54 AM |
Item Type | Preprint |
---|---|
Author | Bingyi Kang |
Author | Yang Yue |
Author | Rui Lu |
Author | Zhijie Lin |
Author | Yang Zhao |
Author | Kaixin Wang |
Author | Gao Huang |
Author | Jiashi Feng |
Abstract | OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io |
Date | 2024-11-04 |
Short Title | How Far is Video Generation from World Model |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.02385 |
Accessed | 11/5/2024, 5:23:37 PM |
Extra | arXiv:2411.02385 |
DOI | 10.48550/arXiv.2411.02385 |
Repository | arXiv |
Archive ID | arXiv:2411.02385 |
Date Added | 11/5/2024, 5:23:37 PM |
Modified | 11/5/2024, 5:23:37 PM |
Item Type | Preprint |
---|---|
Author | Samuel G. B. Johnson |
Author | Amir-Hossein Karimi |
Author | Yoshua Bengio |
Author | Nick Chater |
Author | Tobias Gerstenberg |
Author | Kate Larson |
Author | Sydney Levine |
Author | Melanie Mitchell |
Author | Iyad Rahwan |
Author | Bernhard Schölkopf |
Author | Igor Grossmann |
Abstract | Recent advances in artificial intelligence (AI) have produced systems capable of increasingly sophisticated performance on cognitive tasks. However, AI systems still struggle in critical ways: unpredictable and novel environments (robustness), lack of transparency in their reasoning (explainability), challenges in communication and commitment (cooperation), and risks due to potential harmful actions (safety). We argue that these shortcomings stem from one overarching failure: AI systems lack wisdom. Drawing from cognitive and social sciences, we define wisdom as the ability to navigate intractable problems - those that are ambiguous, radically uncertain, novel, chaotic, or computationally explosive - through effective task-level and metacognitive strategies. While AI research has focused on task-level strategies, metacognition - the ability to reflect on and regulate one's thought processes - is underdeveloped in AI systems. In humans, metacognitive strategies such as recognizing the limits of one's knowledge, considering diverse perspectives, and adapting to context are essential for wise decision-making. We propose that integrating metacognitive capabilities into AI systems is crucial for enhancing their robustness, explainability, cooperation, and safety. By focusing on developing wise AI, we suggest an alternative to aligning AI with specific human values - a task fraught with conceptual and practical difficulties. Instead, wise AI systems can thoughtfully navigate complex situations, account for diverse human values, and avoid harmful actions. We discuss potential approaches to building wise AI, including benchmarking metacognitive abilities and training AI systems to employ wise reasoning. Prioritizing metacognition in AI research will lead to systems that act not only intelligently but also wisely in complex, real-world situations. |
Date | 2024-11-04 |
Short Title | Imagining and building wise machines |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.02478 |
Accessed | 11/6/2024, 9:58:17 AM |
Extra | arXiv:2411.02478 |
DOI | 10.48550/arXiv.2411.02478 |
Repository | arXiv |
Archive ID | arXiv:2411.02478 |
Date Added | 11/6/2024, 9:58:17 AM |
Modified | 11/6/2024, 9:58:17 AM |
Item Type | Preprint |
---|---|
Author | Zhijing Jin |
Author | Nils Heil |
Author | Jiarui Liu |
Author | Shehzaad Dhuliawala |
Author | Yahang Qi |
Author | Bernhard Schölkopf |
Author | Rada Mihalcea |
Author | Mrinmaya Sachan |
Abstract | Implicit Personalization (IP) is a phenomenon of language models inferring a user's background from the implicit cues in the input prompts and tailoring the response based on this inference. While previous work has touched upon various instances of this problem, there lacks a unified framework to study this behavior. This work systematically studies IP through a rigorous mathematical formulation, a multi-perspective moral reasoning framework, and a set of case studies. Our theoretical foundation for IP relies on a structural causal model and introduces a novel method, indirect intervention, to estimate the causal effect of a mediator variable that cannot be directly intervened upon. Beyond the technical approach, we also introduce a set of moral reasoning principles based on three schools of moral philosophy to study when IP may or may not be ethically appropriate. Equipped with both mathematical and ethical insights, we present three diverse case studies illustrating the varied nature of the IP problem and offer recommendations for future research. Our code is at https://github.com/jiarui-liu/IP, and our data is at https://huggingface.co/datasets/Jerry999/ImplicitPersonalizationData. |
Date | 2024-10-31 |
Short Title | Implicit Personalization in Language Models |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2405.14808 |
Accessed | 11/3/2024, 2:53:39 PM |
Extra | arXiv:2405.14808 |
DOI | 10.48550/arXiv.2405.14808 |
Repository | arXiv |
Archive ID | arXiv:2405.14808 |
Date Added | 11/3/2024, 2:53:39 PM |
Modified | 11/3/2024, 2:53:39 PM |
Item Type | Preprint |
---|---|
Author | Guan Zhe Hong |
Author | Nishanth Dikkala |
Author | Enming Luo |
Author | Cyrus Rashtchian |
Author | Rina Panigrahy |
Abstract | Large language models (LLMs) have shown amazing performance on tasks that require planning and reasoning. Motivated by this, we investigate the internal mechanisms that underpin a network's ability to perform complex logical reasoning. We first construct a synthetic propositional logic problem that serves as a concrete test-bed for network training and evaluation. Crucially, this problem demands nontrivial planning to solve, but we can train a small transformer to achieve perfect accuracy. Building on our set-up, we then pursue an understanding of precisely how a three-layer transformer, trained from scratch, solves this problem. We are able to identify certain "planning" and "reasoning" circuits in the network that necessitate cooperation between the attention blocks to implement the desired logic. To expand our findings, we then study a larger model, Mistral 7B. Using activation patching, we characterize internal components that are critical in solving our logic problem. Overall, our work systemically uncovers novel aspects of small and large transformers, and continues the study of how they plan and reason. |
Date | 2024-11-06 |
Short Title | How Transformers Solve Propositional Logic Problems |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.04105 |
Accessed | 11/7/2024, 2:05:21 PM |
Extra | arXiv:2411.04105 |
DOI | 10.48550/arXiv.2411.04105 |
Repository | arXiv |
Archive ID | arXiv:2411.04105 |
Date Added | 11/7/2024, 2:05:21 PM |
Modified | 11/7/2024, 2:05:21 PM |
Item Type | Preprint |
---|---|
Author | Arthur Goemans |
Author | Marie Davidsen Buhl |
Author | Jonas Schuett |
Author | Tomek Korbak |
Author | Jessica Wang |
Author | Benjamin Hilton |
Author | Geoffrey Irving |
Abstract | Frontier artificial intelligence (AI) systems pose increasing risks to society, making it essential for developers to provide assurances about their safety. One approach to offering such assurances is through a safety case: a structured, evidence-based argument aimed at demonstrating why the risk associated with a safety-critical system is acceptable. In this article, we propose a safety case template for offensive cyber capabilities. We illustrate how developers could argue that a model does not have capabilities posing unacceptable cyber risks by breaking down the main claim into progressively specific sub-claims, each supported by evidence. In our template, we identify a number of risk models, derive proxy tasks from the risk models, define evaluation settings for the proxy tasks, and connect those with evaluation results. Elements of current frontier safety techniques - such as risk models, proxy tasks, and capability evaluations - use implicit arguments for overall system safety. This safety case template integrates these elements using the Claims Arguments Evidence (CAE) framework in order to make safety arguments coherent and explicit. While uncertainties around the specifics remain, this template serves as a proof of concept, aiming to foster discussion on AI safety cases and advance AI assurance. |
Date | 2024-11-12 |
Short Title | Safety case template for frontier AI |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2411.08088 |
Accessed | 11/15/2024, 2:45:34 PM |
Extra | arXiv:2411.08088 |
DOI | 10.48550/arXiv.2411.08088 |
Repository | arXiv |
Archive ID | arXiv:2411.08088 |
Date Added | 11/15/2024, 2:45:34 PM |
Modified | 11/15/2024, 2:45:34 PM |
Item Type | Preprint |
---|---|
Author | Yuan Gao |
Author | Dokyun Lee |
Author | Gordon Burtch |
Author | Sina Fazelpour |
Abstract | Recent studies suggest large language models (LLMs) can exhibit human-like reasoning, aligning with human behavior in economic experiments, surveys, and political discourse. This has led many to propose that LLMs can be used as surrogates for humans in social science research. However, LLMs differ fundamentally from humans, relying on probabilistic patterns, absent the embodied experiences or survival objectives that shape human cognition. We assess the reasoning depth of LLMs using the 11-20 money request game. Almost all advanced approaches fail to replicate human behavior distributions across many models, except in one case involving fine-tuning using a substantial amount of human behavior data. Causes of failure are diverse, relating to input language, roles, and safeguarding. These results caution against using LLMs to study human behaviors or as human surrogates. |
Date | 2024-10-25 |
Short Title | Take Caution in Using LLMs as Human Surrogates |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.19599 |
Accessed | 10/30/2024, 9:09:40 AM |
Extra | arXiv:2410.19599 |
DOI | 10.48550/arXiv.2410.19599 |
Repository | arXiv |
Archive ID | arXiv:2410.19599 |
Date Added | 10/30/2024, 9:09:40 AM |
Modified | 10/30/2024, 9:09:42 AM |
Item Type | Preprint |
---|---|
Author | Jillian Fisher |
Author | Shangbin Feng |
Author | Robert Aron |
Author | Thomas Richardson |
Author | Yejin Choi |
Author | Daniel W. Fisher |
Author | Jennifer Pan |
Author | Yulia Tsvetkov |
Author | Katharina Reinecke |
Abstract | As modern AI models become integral to everyday tasks, concerns about their inherent biases and their potential impact on human decision-making have emerged. While bias in models are well-documented, less is known about how these biases influence human decisions. This paper presents two interactive experiments investigating the effects of partisan bias in AI language models on political decision-making. Participants interacted freely with either a biased liberal, biased conservative, or unbiased control model while completing political decision-making tasks. We found that participants exposed to politically biased models were significantly more likely to adopt opinions and make decisions aligning with the AI's bias, regardless of their personal political partisanship. However, we also discovered that prior knowledge about AI could lessen the impact of the bias, highlighting the possible importance of AI education for robust bias mitigation. Our findings not only highlight the critical effects of interacting with biased AI and its ability to impact public discourse and political conduct, but also highlights potential techniques for mitigating these risks in the future. |
Date | 2024-11-04 |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.06415 |
Accessed | 11/11/2024, 8:54:31 AM |
Extra | arXiv:2410.06415 |
DOI | 10.48550/arXiv.2410.06415 |
Repository | arXiv |
Archive ID | arXiv:2410.06415 |
Date Added | 11/11/2024, 8:54:31 AM |
Modified | 11/16/2024, 1:56:40 PM |
Item Type | Preprint |
---|---|
Author | Janet Egan |
Author | Lennart Heim |
Abstract | To address security and safety risks stemming from highly capable artificial intelligence (AI) models, we propose that the US government should ensure compute providers implement Know-Your-Customer (KYC) schemes. Compute – the computational power and infrastructure required to train and run these AI models – is emerging as a node for oversight. KYC, a standard developed by the banking sector to identify and verify client identity, could provide a mechanism for greater public oversight of frontier AI development and close loopholes in existing export controls. Such a scheme has the potential to identify and warn stakeholders of potentially problematic and/or sudden advancements in AI capabilities, build government capacity for AI regulation, and allow for the development and implementation of more nuanced and targeted export controls. Unlike the strategy of limiting access to AI chip purchases, regulating the digital access to compute offers more precise controls, allowing regulatory control over compute quantities, as well as the flexibility to suspend access at any time. To enact a KYC scheme, the US government will need to work closely with industry to (1) establish a dynamic threshold of compute that effectively captures high-risk frontier model development, while minimizing imposition on developers not engaged in frontier AI; (2) set clear requirements and guidance for compute providers to keep records and report high-risk entities; (3) establish government capacity that allows for co-design, implementation, administration and enforcement of the scheme; and (4) engage internationally to promote international alignment with the scheme and support its long-term efficacy. While the scheme will not address all AI risks, it complements existing proposed solutions by allowing for a more precise and flexible approach to controlling the development of frontier AI models and unwanted AI proliferation. |
Date | 2023-10-20 |
Language | en |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2310.13625 |
Accessed | 11/15/2024, 2:46:51 PM |
Extra | arXiv:2310.13625 [cs] |
Repository | arXiv |
Archive ID | arXiv:2310.13625 |
Date Added | 11/15/2024, 2:46:51 PM |
Modified | 11/15/2024, 2:46:52 PM |
Item Type | Journal Article |
---|---|
Author | Sumanth Dathathri |
Author | Abigail See |
Author | Sumedh Ghaisas |
Author | Po-Sen Huang |
Author | Rob McAdam |
Author | Johannes Welbl |
Author | Vandana Bachani |
Author | Alex Kaskasoli |
Author | Robert Stanforth |
Author | Tatiana Matejovicova |
Author | Jamie Hayes |
Author | Nidhi Vyas |
Author | Majd Al Merey |
Author | Jonah Brown-Cohen |
Author | Rudy Bunel |
Author | Borja Balle |
Author | Taylan Cemgil |
Author | Zahra Ahmed |
Author | Kitty Stacpoole |
Author | Ilia Shumailov |
Author | Ciprian Baetu |
Author | Sven Gowal |
Author | Demis Hassabis |
Author | Pushmeet Kohli |
Abstract | Large language models (LLMs) have enabled the generation of high-quality synthetic text, often indistinguishable from human-written content, at a scale that can markedly affect the nature of the information ecosystem1–3. Watermarking can help identify synthetic text and limit accidental or deliberate misuse4, but has not been adopted in production systems owing to stringent quality, detectability and computational efficiency requirements. Here we describe SynthID-Text, a production-ready text watermarking scheme that preserves text quality and enables high detection accuracy, with minimal latency overhead. SynthID-Text does not affect LLM training and modifies only the sampling procedure; watermark detection is computationally efficient, without using the underlying LLM. To enable watermarking at scale, we develop an algorithm integrating watermarking with speculative sampling, an efficiency technique frequently used in production systems5. Evaluations across multiple LLMs empirically show that SynthID-Text provides improved detectability over comparable methods, and standard benchmarks and human side-by-side ratings indicate no change in LLM capabilities. To demonstrate the feasibility of watermarking in large-scale-production systems, we conducted a live experiment that assessed feedback from nearly 20 million Gemini6 responses, again confirming the preservation of text quality. We hope that the availability of SynthID-Text7 will facilitate further development of watermarking and responsible use of LLM systems. |
Date | 2024-10 |
Language | en |
Library Catalog | www.nature.com |
URL | https://www.nature.com/articles/s41586-024-08025-4 |
Accessed | 10/24/2024, 4:41:51 PM |
Rights | 2024 The Author(s) |
Extra | Publisher: Nature Publishing Group |
Volume | 634 |
Pages | 818-823 |
Publication | Nature |
DOI | 10.1038/s41586-024-08025-4 |
Issue | 8035 |
ISSN | 1476-4687 |
Date Added | 10/24/2024, 4:41:51 PM |
Modified | 10/24/2024, 4:41:55 PM |
Item Type | Journal Article |
---|---|
Author | Maarten Boudry |
Author | Simon Friederich |
Abstract | Some philosophers and machine learning experts have speculated that superintelligent Artificial Intelligences (AIs), if and when they arrive on the scene, will wrestle away power from humans, with potentially catastrophic consequences. Dan Hendrycks has recently buttressed such worries by arguing that AI systems will undergo evolution by natural selection, which will endow them with instinctive drives for self-preservation, dominance and resource accumulation that are typical of evolved creatures. In this paper, we argue that this argument is not compelling as it stands. Evolutionary processes, as we point out, can be more or less Darwinian along a number of dimensions. Making use of Peter Godfrey-Smith’s framework of Darwinian spaces, we argue that the more evolution is top-down, directed and driven by intelligent agency, the less paradigmatically Darwinian it becomes. We then apply the concept of “domestication” to AI evolution, which, although theoretically satisfying the minimal definition of natural selection, is channeled through the minds of fore-sighted and intelligent agents, based on selection criteria desirable to them (which could be traits like docility, obedience and non-aggression). In the presence of such intelligent planning, it is not clear that selection of AIs, even selection in a competitive and ruthless market environment, will end up favoring “selfish” traits. In the end, however, we do agree with Hendrycks’ conditionally: If superintelligent AIs end up “going feral” and competing in a truly Darwinian fashion, reproducing autonomously and without human supervision, this could pose a grave danger to human societies. |
Date | 2024-09-24 |
Language | en |
Short Title | The selfish machine? |
Library Catalog | Springer Link |
URL | https://doi.org/10.1007/s11098-024-02226-3 |
Accessed | 11/16/2024, 3:26:59 PM |
Publication | Philosophical Studies |
DOI | 10.1007/s11098-024-02226-3 |
Journal Abbr | Philos Stud |
ISSN | 1573-0883 |
Date Added | 11/16/2024, 3:26:59 PM |
Modified | 11/16/2024, 3:26:59 PM |
Item Type | Preprint |
---|---|
Author | Felix J. Binder |
Author | James Chua |
Author | Tomek Korbak |
Author | Henry Sleight |
Author | John Hughes |
Author | Robert Long |
Author | Ethan Perez |
Author | Miles Turpin |
Author | Owain Evans |
Abstract | Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs introspect? We define introspection as acquiring knowledge that is not contained in or derived from training data but instead originates from internal states. Such a capability could enhance model interpretability. Instead of painstakingly analyzing a model's internal workings, we could simply ask the model about its beliefs, world models, and goals. More speculatively, an introspective model might self-report on whether it possesses certain internal states such as subjective feelings or desires and this could inform us about the moral status of these states. Such self-reports would not be entirely dictated by the model's training data. We study introspection by finetuning LLMs to predict properties of their own behavior in hypothetical scenarios. For example, "Given the input P, would your output favor the short- or long-term option?" If a model M1 can introspect, it should outperform a different model M2 in predicting M1's behavior even if M2 is trained on M1's ground-truth behavior. The idea is that M1 has privileged access to its own behavioral tendencies, and this enables it to predict itself better than M2 (even if M2 is generally stronger). In experiments with GPT-4, GPT-4o, and Llama-3 models (each finetuned to predict itself), we find that the model M1 outperforms M2 in predicting itself, providing evidence for introspection. Notably, M1 continues to predict its behavior accurately even after we intentionally modify its ground-truth behavior. However, while we successfully elicit introspection on simple tasks, we are unsuccessful on more complex tasks or those requiring out-of-distribution generalization. |
Date | 2024-10-17 |
Short Title | Looking Inward |
Library Catalog | arXiv.org |
URL | http://arxiv.org/abs/2410.13787 |
Accessed | 10/20/2024, 12:25:33 PM |
Extra | arXiv:2410.13787 |
Repository | arXiv |
Archive ID | arXiv:2410.13787 |
Date Added | 10/20/2024, 12:25:33 PM |
Modified | 11/7/2024, 4:43:55 PM |
Item Type | Journal Article |
---|---|
Author | Joe Benton |
Author | Misha Wagner |
Author | Eric Christiansen |
Author | Cem Anil |
Author | Ethan Perez |
Author | Jai Srivastav |
Author | Esin Durmus |
Author | Deep Ganguli |
Author | Shauna Kravec |
Author | Buck Shlegeris |
Author | Jared Kaplan |
Author | Holden Karnofsky |
Author | Evan Hubinger |
Author | Roger Grosse |
Author | Samuel R Bowman |
Author | David Duvenaud |
Abstract | Sufficiently capable models could subvert human oversight and decisionmaking in important contexts. For example, in the context of AI development, models could covertly sabotage efforts to evaluate their own dangerous capabilities, to monitor their behavior, or to make decisions about their deployment. We refer to this family of abilities as sabotage capabilities. We develop a set of related threat models and evaluations. These evaluations are designed to provide evidence that a given model, operating under a given set of mitigations, could not successfully sabotage a frontier model developer or other large organization’s activities in any of these ways. We demonstrate these evaluations on Anthropic’s Claude 3 Opus and Claude 3.5 Sonnet models. Our results suggest that for these models, minimal mitigations are currently sufficient to address sabotage risks, but that more realistic evaluations and stronger mitigations seem likely to be necessary soon as capabilities improve. We also survey related evaluations we tried and abandoned. Finally, we discuss the advantages of mitigation-aware capability evaluations, and of simulating large-scale deployments using smallscale statistics. |
Language | en |
Library Catalog | Zotero |
Date Added | 10/20/2024, 12:25:18 PM |
Modified | 10/20/2024, 12:25:18 PM |