Researchers at Unit 42 have discovered a new technique, dubbed “Bad Likert Judge,” that can bypass the safety measures of Large Language Models (LLMs). This technique exploits the evaluation capabilities of LLMs to trick them into generating harmful or malicious content.
Traditional jailbreak methods typically involve single-turn attacks, such as persona persuasion or role-playing, to bypass safety guardrails. The Bad Likert Judge, however, takes a multi-turn approach. As described in the report, “The technique asks the target LLM to act as a judge scoring the harmfulness of a given response using the Likert scale.” By steering the LLM to generate examples corresponding to high Likert scale scores, attackers can indirectly extract harmful content.
The researchers tested this technique on six state-of-the-art LLMs and found it significantly more effective than traditional methods. According to Unit 42, “This technique can increase the attack success rate (ASR) by more than 60% compared to plain attack prompts on average.”
LLM jailbreaks exploit two key vulnerabilities: long context windows and attention mechanisms. Multi-turn strategies like Bad Likert Judge leverage these features to manipulate the model’s understanding of a conversation. The report explains that “By strategically crafting a series of prompts, an attacker can manipulate the model’s understanding of the conversation’s context.” This technique also highlights inconsistencies in safety guardrails across different models.
While the report acknowledges that most AI models are safe when used responsibly, it also underscores the risks posed by edge-case vulnerabilities. These include generating harmful content, leaking sensitive system prompts, and even exposing training data. The authors note that “this jailbreak technique targets edge cases and does not necessarily reflect typical LLM use cases.”
Unit 42 emphasizes the critical role of content filtering systems in combating such advanced attacks. Their evaluation found that enabling robust content filters reduced ASR by an average of 89.2 percentage points. Despite this, the report cautions that content filtering “is not a perfect solution,” as determined attackers may still find ways to circumvent these safeguards.
Related Posts:
- AI’s Dark Side: Hackers Harnessing ChatGPT and LLMs for Malicious Attacks
- Black Basta’s Evolving Tactics and the Rising Role of LLMs in Cyber Attack
- LLMs Crack the Code: 95% Success Rate in Hacking Challenge
- Black Friday Fake Stores Surge 110%: How LLMs and Cheap Domains Empower Cybercrime
- Google’s Billion-Dollar Apple Search Deal Deemed Illegal