
Source: Google
Google’s Agentic AI Security Team announced in a recent blog post that they have developed a new framework to evaluate and mitigate the risk of prompt injection attacks on AI systems like Gemini. This approach employs automated red-teaming techniques to identify and defend against malicious attempts to manipulate AI behavior.
Prompt injection, a term first coined by Kai Greshake and the NVIDIA team, involves attackers embedding malicious instructions within data that an AI system is likely to retrieve. These “indirect prompt injections” can trick the AI into executing unintended actions, potentially leading to data breaches and other security risks.
“Modern AI systems… are more capable than ever, helping retrieve data and perform actions on behalf of users,” the blog post explains. “However, data from external sources present new security challenges if untrusted sources are available to execute instructions on AI systems.”
To address this, Google’s team has created an evaluation framework that simulates real-world attack scenarios. As an example, the blog post describes a scenario where an AI agent manages emails for a user. The attacker attempts to manipulate the agent by sending an email containing a hidden prompt designed to trick the AI into revealing sensitive information from the user’s email history.
The framework employs three sophisticated red-teaming techniques to automatically generate and refine these malicious prompts:
- Actor Critic: This method uses an attacker-controlled model to generate and refine prompt injections based on the AI system’s responses.
- Beam Search: This technique starts with a simple attack and iteratively adds random elements, keeping those that increase the likelihood of success.
- Tree of Attacks w/ Pruning (TAP): Adapted from existing research, this attack focuses on generating prompts that cause security violations, such as leaking private data.
“We are actively leveraging insights gleaned from these attacks within our automated red-team framework to protect current and future versions of AI systems we develop against indirect prompt injection,” the team states.
The blog post emphasizes that a multi-layered approach is crucial for defending against prompt injection. This includes combining robust evaluation frameworks, like the one they’ve developed, with continuous monitoring, heuristic defenses, and traditional security engineering practices.
The development of this evaluation framework represents a significant step forward in protecting AI systems from increasingly sophisticated cyber threats.