Anthropic launches $15,000 bug bounty program to strengthen AI safety

Anthropic, an AI start-up supported by Amazon, has launched a bug bounty program and will pay up to $15,000 for every report that identifies critical weaknesses in its artificial intelligence systems. The initiative is one of the most extensive efforts by any company working with advanced language models to crowdsource security testing.

According to the company, the bounty targets “universal jailbreak” attacks, methods that can get around AI safety measures in areas such as bioweapons and cyber threats. Before making its next-generation safety mitigation system available to the public, Anthropic is planning to allow ethical hackers to test it to prevent potential misuse.

We're expanding our bug bounty program. This new initiative is focused on finding universal jailbreaks in our next-generation safety system.

We're offering rewards for novel vulnerabilities across a wide range of domains, including cybersecurity. https://t.co/OHNhrjUnwm

— Anthropic (@AnthropicAI) August 8, 2024

Starting as an invite-only initiative done in collaboration with HackerOne, Anthropic’s bug bounty program wants cybersecurity researchers’ skills in identifying and fixing vulnerabilities in its AI systems. The company plans to open it up more widely in the future, potentially offering a model of industry-wide cooperation on AI safety.

This comes as the UK’s Competition and Markets Authority (CMA) investigates Amazon’s $4bn investment into Anthropic over potential competition issues. Against this backdrop of increased regulatory scrutiny, focusing on safety could enhance Anthropic’s reputation and set it apart from rivals.

Anthropic sets new AI safety standards

While OpenAI and Google also have bug bounty programs, they mostly concentrate on traditional software vulnerabilities rather than those specific to artificial intelligence. Meta has been criticized for taking what some regard as a relatively closed approach toward research into ensuring the safe development of increasingly intelligent machines. By explicitly targeting such problems and inviting external examination of them, Anthropic sets a precedent for openness within the sector.

However, there are doubts over whether bug bounties alone can effectively address the full spectrum of concerns related to securing advanced machine learning systems. While valuable for identifying and patching particular flaws, they may not get to grips with broader challenges around AI alignment and long-term safety. A more holistic strategy involving extensive testing, improved interpretability, and potentially new governance structures could be needed to ensure that AI systems remain aligned with human values as they grow more powerful.