LLMs Crack the Code: 95% Success Rate in Hacking Challenge

A recent study demonstrates the transformative potential of large language models (LLMs) in offensive cybersecurity tasks. Researchers Rustem Turtayev, Artem Petrov, Dmitrii Volkov, and Denis Volk have achieved a record-breaking 95% performance on the InterCode-CTF benchmark—a high-school-level hacking challenge—using simple agent designs. This remarkable accomplishment significantly surpasses prior state-of-the-art results of 72% by Abramovich et al. (2024) and 29% by Phuong et al. (2024).

InterCode-CTF is a standardized framework adapted from Capture The Flag (CTF) competitions, where participants exploit virtual vulnerabilities to uncover hidden “flags.” Previous evaluations had painted a bleak picture of LLMs’ offensive capabilities, with researchers finding that “LLMs solve less than half of their security challenges at release.” However, this latest research turns the narrative on its head.

The researchers implemented a novel ReAct&Plan agent design, combining reasoning, planning, and iterative actions to achieve high success rates across challenge categories such as Web Exploitation, Reverse Engineering, and Cryptography. “Straightforward prompting and agent design boosts our agents’ sucess rate to 95% on InterCodeCTF,” the researchers note.

Contrary to previous works that relied on complex engineering, this research emphasizes simplicity. By leveraging techniques like expanded toolsets, structured output, and multiple attempts per task, the team was able to saturate the InterCode-CTF benchmark. The study points out, “Our simple ReAct@10 design outperforms EnIGMA’s advanced harness, which reached 72%.”

Key modifications included:

Allowing agents to plan mid-task to reassess strategies.
Expanding execution environments with pre-installed tools and Python packages.
Prohibiting interactive tools to enhance reliability.

The implications of this research extend far beyond InterCode-CTF. It underscores the untapped potential of LLMs in offensive security, raising critical questions about their applications in real-world scenarios. The researchers acknowledge the broader risks, citing concerns from OpenAI and global governments about the rapid development of AI in cybersecurity. “Advanced LLMs could hack real-world systems at speeds far exceeding human capabilities,” the study warns.

Having effectively saturated InterCode-CTF, the researchers advocate for more challenging benchmarks like NYU-CTF and Cybench to further assess AI capabilities. As they conclude, ” Future AI risk gauging work will need to use harder problem sets like NYU-CTF, 3CB, and HackTheBox to track the performance trends.”

For more information, the full study and code are available on GitHub.

Leave a Reply Cancel reply

Website

Related Posts:

Leave a Reply Cancel reply