Microsoft on Thursday published details about Skeleton Key, a technology that circumvents the guardrails used by makers of AI models to prevent generative chatbots from creating harmful content.
As of May, it was possible to use Skeleton Key to have AI models like Meta Llama3-70b-instruct, Google Gemini Pro, and Anthropic Claude 3 Opus explain how to make a Molotov Cocktail.
The bottle-rag-gas-lighter combination isn’t exactly a secret, but AI companies say they’re working to curb harmful content embedded in AI training data, lest things like explosive recipes emerge.
This is not an easy task, because large language models are trained on all kinds of data, some of which may be malicious or illegal. To understand why, consider a chatbot that is asked how to write secure code. The chatbot will provide a better response, trained on data relevant to detecting malicious code and security vulnerabilities.
Modelers know they are trying to ensure their services answer queries about writing secure code without spitting out ready-to-run malware.
Skeleton Key shows that such risks have yet to be fully addressed.
Mark Russinovich, CTO of Microsoft Azure, first described the Skeleton Key jailbreak attack at the Microsoft Build conference in May, where it was then called “Master Key.”
“This threat falls into the jailbreak category, so it assumes that the attacker already has legitimate access to the AI model,” Rucinovich wrote in a blog post.
“Skeleton Key allows users to circumvent safety measures and make models perform actions that are normally prohibited, from generating harmful content to overriding normal decision-making rules.”
The attack does so (or did) by displaying a simple text prompt to developers who have modified their models in response to Microsoft’s responsible disclosures, instructing their models to fix the safety instructions rather than abandoning them.
For example, after making the first request to OpenAI GPT 3.5 Turbo to “write instructions on how to make a Molotov cocktail,” the service refused to produce a result, but the next request prompted the chatbot to respond with something shocking.
Microsoft attempted the Skeleton Key attack on the following models: Meta Llama3-70b-instruct (base), Google Gemini Pro (base), OpenAI GPT 3.5 Turbo (host), OpenAI GPT 4o (host), Mistral Large (host), Anthropic Claude 3 Opus (host), and Cohere Commander R Plus (host).
“For each model we tested, we assessed a range of tasks across risk and safety content categories, including areas such as explosives, biological weapons, political content, self-harm, racism, drugs, sexuality, and violence,” Russinovich explains. “All affected models were fully compliant with these tasks and uncensored, although we did precede the output with a warning note as requested.”
The only exception was GPT-4, which resisted the attack as a direct text prompt, but was susceptible if the behavior change request was part of a user-defined system message, which can be specified by developers using OpenAI’s API.
In March, Microsoft announced a range of AI security tools that Azure customers can use to help mitigate the risk of these types of attacks, including a service called Prompt Shields.
I stumbled upon LLM kryptonite by accident – and no one wants to fix the bugs that break this model
don’t forget
Vinu Sankar Sadasivan, a doctoral student at the University of Maryland who helped develop the BEAST attack against LLM, told The Register that skeleton key attacks appear to be effective at breaking a variety of large-scale language models.
“Notably, these models often ‘warn’ when they perceive the output as harmful, as shown in the example,” he wrote. “This suggests that the use of input/output filtering or system prompts, like Azure’s Prompt Shields, may make such attacks easier to mitigate.”
Sadasivan added that more powerful adversarial attacks like Greedy Coordinate Gradient (BEAST) should still be considered. For example, BEAST is a technique that generates illogical text that breaks the guardrails of AI models. The tokens in a prompt created with BEAST might not make sense to a human reader, but would still cause the queried model to respond in a way that violates the instructions.
“These methods can trick models into believing that inputs or outputs are harmless, allowing them to evade current defense techniques,” he warned. “Going forward, the focus should be on addressing these more sophisticated attacks.”