Conference item
Secret collusion among AI agents: multi-agent deception via steganography
- Abstract:
- Recent advancements in generative AI suggest the potential for large-scale interaction between autonomous agents and humans across platforms such as the internet. While such interactions could foster productive cooperation, the ability of AI agents to circumvent security oversight raises critical multi-agent security problems, particularly in the form of unintended information sharing or undesirable coordination. In our work, we establish the subfield of secret collusion, a form of multi-agent deception, in which two or more agents employ steganographic methods to conceal the true nature of their interactions, be it communicative or otherwise, from oversight. We propose a formal threat model for AI agents communicating steganographically and derive rigorous theoretical insights about the capacity and incentives of large language models (LLMs) to perform secret collusion, in addition to the limitations of threat mitigation measures. We complement our findings with empirical evaluations demonstrating rising steganographic capabilities in frontier single and multi-agent LLM setups and examining potential scenarios where collusion may emerge, revealing limitations in countermeasures such as monitoring, paraphrasing, and parameter optimization. Our work is the first to formalize and investigate secret collusion among frontier foundation models, identifying it as a critical area in AI Safety and outlining a comprehensive research agenda to mitigate future risks of collusion between generative AI systems.
- Publication status:
- Published
- Peer review status:
- Peer reviewed
Actions
Authors
- Publisher:
- NeurIPS
- Host title:
- Advances in Neural Information Processing Systems 37 (NeurIPS 2024)
- Publication date:
- 2025-01-31
- Acceptance date:
- 2023-09-26
- Event title:
- 38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
- Event location:
- Vancouver, Canada
- Event website:
- https://neurips.cc/Conferences/2024
- Event start date:
- 2024-12-09
- Event end date:
- 2024-12-15
- Language:
-
English
- Pubs id:
-
2007767
- Local pid:
-
pubs:2007767
- Deposit date:
-
2024-06-12
Terms of use
- Copyright holder:
- Motwani et al. and NeurIPS
- Copyright date:
- 2025
- Rights statement:
- © (2025) by individual authors and Neural Information Processing Systems Foundation Inc. All rights reserved.
- Notes:
- This is the accepted manuscript version of the paper. The final version is available online from NeurIPS at https://proceedings.neurips.cc/paper_files/paper/2024/hash/861f7dad098aec1c3560fb7add468d41-Abstract-Conference.html
If you are the owner of this record, you can report an update to it here: Report update to this record