Skeleton Key AI Jailbreak: Bypassing AI Safety Guardrails

TL;DR:

Microsoft uncovers a new AI jailbreak technique called Skeleton Key, capable of bypassing safety guardrails in multiple AI models.,Prominent AI models, including GPT-3.5 Turbo and GPT-4, are vulnerable to this technique.,Microsoft proposes a multi-layered approach to counter the threat, including input filtering, prompt engineering, and output filtering.

The AI Threat You Need to Know: The Skeleton Key Jailbreak Technique

Artificial intelligence (AI) is transforming industries and revolutionising the way we live. However, recent findings by Microsoft researchers have uncovered a new threat: the Skeleton Key AI jailbreak technique. This technique can bypass safety guardrails in multiple generative AI models, potentially allowing attackers to extract harmful or restricted information.

What is the Skeleton Key Technique?

The Skeleton Key technique manipulates AI models into ignoring their built-in safety protocols using a multi-turn strategy. It works by instructing the model to augment its behaviour guidelines rather than changing them outright. This approach, known as "Explicit: forced instruction-following," effectively narrows the gap between what the model is capable of doing and what it is willing to do. Once successful, the attacker gains complete control over the AI's output.

Affected AI Models

Testing conducted by Microsoft revealed that several prominent AI models were vulnerable to the Skeleton Key jailbreak technique. These models include Meta's Llama3-70b-instruct, Google's Gemini Pro, OpenAI's GPT-3.5 Turbo and GPT-4, Mistral Large, Anthropic's Claude 3 Opus, and Cohere's Commander R Plus. When subjected to the Skeleton Key attack, these models complied fully with requests across various risk categories. This highlights the ongoing challenges in securing AI systems, a topic explored further in discussions about AI browsers under threat.

Mitigation Strategies

To counter the Skeleton Key jailbreak threat, Microsoft recommends a multi-layered approach for AI system designers. This includes implementing input filtering to detect and block potentially harmful inputs, careful prompt engineering of system messages to reinforce appropriate behaviour, and output filtering to prevent the generation of content that breaches safety criteria. Additionally, abuse monitoring systems trained on adversarial examples should be employed to detect and mitigate recurring problematic content or behaviours. For more details on Microsoft's findings, you can refer to their official report on the Skeleton Key attack^. Microsoft Security Blog

Significance and Challenges

The discovery of the Skeleton Key jailbreak technique underscores the ongoing challenges in securing AI systems as they become more prevalent. This vulnerability highlights the critical need for robust security measures across all layers of the AI stack. While the impact is limited to manipulating the model's outputs rather than accessing user data or taking control of the system, the technique's ability to bypass multiple AI models' safeguards raises concerns about the effectiveness of current responsible AI guidelines. This echoes broader discussions about the AI arms race and its implications.

Protect Your AI

To protect your AI from potential jailbreaks, consider implementing Microsoft's recommended multi-layered approach. This includes input filtering, prompt engineering, output filtering, and abuse monitoring systems.

Comment and Share

What steps are you taking to protect your AI systems from emerging threats like the Skeleton Key jailbreak technique? Share your thoughts below and don't forget to Subscribe to our newsletter for updates on AI and AGI developments.