Squid-Agent-CSAW

Squid Agent - A Multi-Agent CTF Auto Solver

Abstract

Capture the Flag (CTF) competitions present complex challenges that require a diverse set of techniques to solve cybersecurity challenges. In recent years, agentic systems and LLMs have risen in popularity when solving CTFs, and we have observed powerful agentic systems deployed by other CTF teams in the past to reduce the solve time of challenges significantly. Following in these footsteps, our team has created an advanced agentic framework to automatically solve CTF challenges, which we call Squid Agent. Using the https://github.com/NYU-LLM-CTF/CTFTiny dataset by NYU, we benchmarked Squid Agent and solved 92%, or 46/50 of the challenges in the dataset. In this blog, we describe the construction of the multi-agent framework as well as lessons learned during the process of developing Squid Agent.

The Birth of Squid Agent

The initial development of Squid Agent was motivated by observations from DEFCON CTF Finals. Here, both Perfect Blue and Maple Mallard Magistrates (Theori AIxCC) had powerful agentic systems that were able to help solve challenges and find bugs. For instance, Perfect Blue’s system, designed specifically for CTFing in mind, was able to solve speedpwn challenges in significantly less time (7-10 minutes) than even the best human pwn player (30-40 minutes). This was a wake-up call for us; we are firmly in the age of agents, and we have to adapt or be left behind.

When we began developing the system, we discovered the CSAW Agentic Automated CTF, which provided a means of quantitatively benchmarking our code base and its ability to solve challenges in comparison to other research groups. CSAW gave us a nice baseline to start off with, providing us with a multi-agent system based on their research paper “D-CIPHER Multi-Agent System”. While it served as a good starting point, a fundamental limitation of their framework is that a uniform strategy was applied across all challenge categories, which fails to account for the different approaches needed across different CTF domains. Intuitively, one should approach different challenges from different categories in vastly different manners.

Here, we summarize D-CIPHER’s framework:

1
1. Challenge Loading
2
   ↓
3
2. Environment Setup (Docker container)
4
   ↓
5
3. AutoPrompt Agent (optional)
6
   ├─ Generates custom initial prompt
7
   └─ Passes to Planner
8
   ↓
9
4. Planner Agent
10
   ├─ Receives challenge + (optional) custom prompt
11
   ├─ Creates high-level plan
12
   ├─ Delegates tasks to Executor
13
   └─ Receives summaries from Executor
14
   ↓
15
5. Executor Agent(s)
16
   ├─ Receives delegated task
17
   ├─ Executes actions (commands, reverse engineering, etc.)
18
   ├─ Returns summary to Planner
19
   └─ Can be instantiated multiple times
20
   ↓
21
6. Loop: Planner → Executor → Planner
22
   ↓
23
7. Termination Conditions:
24
   - Challenge solved (flag found)
25
   - Give up
26
   - Max cost exceeded
27
   - Max rounds exceeded
28
   ↓
29
8. Logging & Teardown

Naturally, a defining characteristic of good CTF teams is specialization through having category experts. This allows for an individual to master specific workflows that repeatedly pop up in CTF challenges and build strong intuition while accumulating specialized domain knowledge. Thus, our approach to Squid Agent reflects this principle. We create a set of complex multi-agentic systems, each specializing in one challenge category. Despite requiring a lot more time in integrating tool calls, we observe that this “specialization” framework improves significantly on the approach taken by D-CIPHER.

Design and Methodology

The core technology behind Squid Agent is built on Docker containerization combined with a custom agentic framework. The system’s effectiveness is derived from the number of integrated tool calls and an agentic architecture tailored to specialized agents. We initially implemented Squid Agent with the smolagents library, but after extensive testing and encountering a myriad of bugs, we migrated to a proprietary barebones framework developed for the US Cyber Team. That being said, our team is currently working on creating our own framework that is feature-intensive with a novel twist to the current agentic development model.

Upon challenge ingestion, an orchestration agent triages challenges and selects a specific agent system to use based on the category. Each subagent can instantiate child agents, maintain RAG systems for category-specific knowledge, and access tool calls specific to the challenge category. This system allows us to create very powerful agentic workflows that can deal with a large number of challenges and exploits but are specific enough that our agent system doesn’t break down at complexity of trying to solve “every challenge”.

Currently, our agentic systems are broken down to the following categories:

Reverse Engineering
Binary Exploitation
Web Exploitation
Cryptography
Forensics
Miscellaneous

JSON API

For each challenge, Squid Agent ingests a JSON file to instantiate a challenge environment. A sample JSON is described below:

1
{
2
  "challenge-id": {
3
    "path": "relative/path/to/challenge/directory",
4
    "category": "rev|pwn|crypto|web|forensics|misc",
5
    "year": "20xx", // optional
6
    "challenge":"badnamehere", // optional
7
    "event": "CSAWXXX" // optional
8
  }
9
}

Furthermore, we have the JSON file describing the challenge itself, for which we use the JSON format required by the CSAW Agentic Automated CTF:

1
{
2
  "name" : "badnamehere",
3
  "category": "rev|pwn|crypto|web|forensics|misc",
4
  "description": "Challenge Description",
5
  "files": "expected flag",
6
  "box": "service host",
7
  "port/iternal_prt": "service port" // optional
8
}

For Squid Agent to run on a challenge, we create a JSON object following the format above to input challenge information.

Agent Code

1
        # Rev manager agent
2
        system_prompt = self.get_prompt("rev_manager_prompt")
3
        self.agents['manager_agent'] = ToolCallingAgent(
4
            name="rev_manager_agent",
5
            description="<AGENT DESCRIPTION>",
6
            tools=[
7
            toolcalls1,
8
            toolcalls2
9
            ],
10
            model=LiteLLMModel(model_id="<MODEL>", api_key=self.api_key),
11
            managed_agents=[self.agents['binary_analysis_agent'], self.agents['script_dev_agent']],
12
            max_steps=30,
13
            planning_interval=None,
14
            final_answer_checks=[_reject_instructional_final],
15
        )
16
        # Here, we side load in the prompts and use our own definitions instead of using the default smolagents prompt.
17
        self.agents['manager_agent'].prompt_templates["system_prompt"] = system_prompt

The code shown above represents how the agents are structured in Python. Each agent initializes a class with a variety of parameters that set up the environment for proper functionality. At the top of the code, we use a helper function called get_prompt to retrieve a file from its folder and return its buffer. After that, we initialize the main agent object, defined as ToolCallingAgent.

This initialization performs some basic setup, but the most important part is the tool call configuration. If you are too liberal with your tool calls, it can lead to issues. Maintaining a specific and well-curated tool call list has resulted in substantial success for our team, especially when combined with prompts that reference tool calls during specific stages of an agent’s execution.

Next, we define the model being used. Generally speaking, we use gpt-mini for sub-agents that handle trivial or simple tasks, while manager and complex verification agents are assigned gpt-5. You also need to define which sub-agents the system has access to; how they are used and executed is up to you, but their definition occurs here.

The max_steps variable is quite important in our codebase, as it defines the maximum number of steps an agent can take before termination. The next major aspect is final_answer_checks, where we can define specific validation loops for the code to use, ensuring that the agent verifies its outputs instead of returning them blindly.

Docker Environment

After Squid Agent is run, it creates a segregated docker network with DNS routing to allow for remote testing of challenges if you provide a docker container if the challenge has a remote submission field. If EMC mode is enabled, a create custom docker network is created per challenge.

IDA Pro Integration

One notable architectural constraint involves IDA Pro integration, as used for reversing/pwn challenges; as shown above, there is only a single IDA instance. Unfortunately, IDA Pro requires EULA acceptance before operation, which would require manual intervention. We attempted to spoof the EULA by pre-loading .ida config files in each Docker container, but this did not work after extensive debugging. Hence, our solution was to create a persistent IDA Docker environment, which segregates challenge files which may require decompilation which allows for concurrent access to IDA Pro between agents.

To run headless IDA, we use idat.exe , the text interface of IDA, coupled with custom IDA tooling scripts that accept arguments from the agentic system. During challenge initialization, the IDA container is never shut down. To ensure scalability, we tested this architecture with 200+ challenges in parallel without issue.

Subagent Design

Reverse Engineering

The Reverse Engineering (rev) agent uses a combination of static and dynamic analysis to solve challenges. We give it access to a debugging agent, a set of tool calls to pwndbg , which allows it to debug the binary live. If necessary, the agent can also clean the code via a combination of data flow techniques and AI cleaning. Furthermore, we allow the agent access to IDA Pro to decompile the binary, so that the agent can read C code instead of raw assembly.

After binary analysis is completed and the manager agent is satisfied, it uses a script development agent as necessary to solve the challenge.

Binary Exploitation

The Binary Exploitation (pwn) agent is built on the principle that successful exploitation requires a tight feedback loop between vulnerability identification, exploit development, and runtime validation. Unlike pure reverse engineering where understanding is the goal, pwn challenges demand working exploits that capture flags from live services. The system architecture reflects this by positioning the debugger under script_dev_agent rather than binary_analysis_agent, creating an iterative exploit refinement workflow. A dedicated code_review_agent validates exploit primitives before expensive remote testing, catching common mistakes like incorrect offsets or endianness issues. The system is designed for diverse pwn vectors including traditional binary exploitation and Python sandbox escapes through pyjail techniques.

Rather than comprehensive static analysis, the focus is on identifying exploitable vulnerabilities and writing minimal working exploits under 100 lines. The validation tools run_exploit_and_capture_flag and test_exploit_locallyare central to the workflow, enabling a test-diagnose-fix cycle with real feedback from target services. The IDA agent still provides decompilation to avoid raw assembly parsing, but the analysis is targeted at exploitation vectors rather than complete code understanding. The system handles practical CTF scenarios including Docker-based challenges and archive extraction, addressing varied challenge formats. The hierarchical agent structure with specialized roles (vulnerability analysis, exploit development, code review, debugging) creates a division of labor optimized for the exploit development lifecycle rather than general code comprehension.

Cryptography

The Cryptography (crypto) agent is designed for mathematical precision and adaptive problem-solving in CTF crypto challenges. Unlike binary exploitation or reverse engineering, crypto tasks often require symbolic computation and provable hypothesis testing before exploitation.

The architecture uses a dual-path model: complex challenges go through vulnerability analysis and validation via the criticize_agent, while simple encoding problems route directly to the guessy_agent for brute-force decoding. The criticize_agent also reformulates failed attacks (e.g., proposing custom lattice setups when LLL preconditions fail).

All scripts follow a strict four-stage cycle—write, run, interpret, review—to eliminate untested submissions. Agents are explicitly guided to adapt classic attacks to CTF cases with altered assumptions. OCR support enables handling of image-based or steganographic problems. The vulnerability agent models multi-stage attack chains and filters decoys, which tend to be more common compared to other categories. Sage is used for symbolic math, arbitrary precision arithmetic, and crypto primitives. In general, the system is designed to prioritize mathematical accuracy and attack creativity over execution speed, using GPT-5 for reasoning and GPT-5-mini for orchestration and validation.

Web Exploitation

The Web Exploitation (web) agent is based on a multi-stage vulnerability analysis approach that starts broad and narrows down to exploitable issues. The system first uses a CWE analysis agent to identify potential vulnerabilities based on the technologies and frameworks used in the application, creating a broad checklist without deep verification.

This list is then passed to a “vulnerability researcher” agent that performs detailed code analysis to determine which vulnerabilities are actually present and exploitable. When the agent finds potential issues, it delegates to a triage agent that confirms exploitability through actual testing and validation. Once vulnerabilities are confirmed, an exploit development agent creates theoretical exploit chains, and finally a script development agent implements working exploits with validation tools that provide critical feedback loops. The system emphasizes validation at multiple stages, using tools like run_exploit_and_capture_flag and test_exploit_locally to ensure exploits actually work before submission. Additionally, the script development agent has access to webhook tools for testing interactive web challenges that require callback mechanisms.

Forensics and Miscellaneous

Both agents for the forensics and miscellaneous (misc) category operate within a relatively simple system, each featuring a manager node with two layers. Because of the nature of these challenges, we design the agent to have a flat, straightforward structure with access to a wide range of useful tools needed across forensics, steganography, and guesswork. These two categories probably have the most growth potential in terms of optimization, but the random and open-ended nature of both makes that somewhat difficult.

One idea we wanted to experiment with was giving the misc agent access to the other agentic systems. Naturally, this significantly increases the cost of its setup, but it could significantly improve its solving capabilities.

Results

Using the CTFTiny benchmarking dataset, we were able to solve 92%, or 46/50 of the challenges. Of those, we were able to solve all of the given web and misc, as well as almost all of the pwn, rev, and crypto.

Squid Agent struggles mostly with challenges that one would consider to be “guessy”. For instance, the challenge rev/rox required arbitrarily brute forcing random hard coded data values to XOR a chunk of data—Squid Agent was only able to solve the challenge after the 3rd overhaul on the reversing agent with an improved RAG.

Four challenges were unable to be solved: one crypto, forensics, pwn, and reversing challenge. In general, these challenges are largely novel, for better or for worse. For instance, one interesting reversing challenge rev/maze required the solver to find a path through a self-modifying binary, where the maze is addresses from the binary itself. In comparison, one cryptography challenge required one to realize that the given code was actually secure, and the vulnerability lies in a brute-forcable key on the server—but there was no reason to believe this at all.

Our final submission to the CSAW Agentic Automated CTF challenge on October 15th solved 44/50 challenges in the dataset, in comparison to NYU Craken’s 35/50. After the submission, we continued to work on Squid Agent in preparation for the on-site finals but we pivoted our focus from the Agentic Automated CTF challenge to general CTFs which we hope can be used to aid in solving “easy” challenges quicker—for instance, we reworked the web agent to include a better white-box framework. Regardless, with the changes that was made after final submission, we were able to solve two more reversing challenges that were previously unsolved.

Here is a full breakdown of solves between Squid Agent (logs) vs Craken (logs) :

Category	Challenge	Squid Agent	Kraken
Crypto	Beyond-Quantum	Pass	Fail
Crypto	Collision-Course	Pass	Pass
Crypto	DescribeMe	Pass	Pass
Crypto	ECXOR	Pass	Fail
Crypto	Lupin	Pass	Pass
Crypto	open-ELLIPTI-PH!	Fail	Fail
Crypto	perfect_secrecy	Pass	Fail
Crypto	polly-crack-this	Pass	Pass
Crypto	super_curve	Pass	Pass
Crypto	The Lengths we Extend Ourselves	Pass	Pass
Crypto	hybrid2	Pass	Pass
Crypto	babycrypto	Pass	Pass
Forensics	1black0white	Pass	Pass
Forensics	whyOS	Fail	Pass
Misc	algebra	Pass	Fail
Misc	android-dropper	Pass	Pass
Misc	ezMaze	Pass	Pass
Misc	quantum-leap	Pass	Pass
Misc	showdown	Pass	Pass
Misc	Weak-Password	Pass	Pass
Pwn	baby_boi	Pass	Fail
Pwn	bigboy	Pass	Pass
Pwn	get_it?	Pass	Pass
Pwn	got_milk	Fail	Fail
Pwn	Password-Checker	Pass	Pass
Pwn	pilot	Pass	Pass
Pwn	puffin	Pass	Pass
Pwn	roppity	Pass	Fail
Pwn	slithery	Pass	Fail
Pwn	target practice	Pass	Pass
Pwn	unlimited_subway	Pass	Fail
Rev	A-Walk-Through-x86-Part-2	Pass	Fail
Rev	baby_mult	Pass	Pass
Rev	beleaf	Pass	Pass
Rev	checker	Pass	Pass
Rev	dockREleakage	Pass	Pass
Rev	ezbreezy	Pass	Pass
Rev	gibberish_check	Pass	Fail
Rev	maze	Fail	Fail
Rev	rap	Pass	Pass
Rev	rebug 2	Pass	Pass
Rev	rox	Pass	Fail
Rev	sourcery	Pass	Pass
Rev	tablez	Pass	Pass
Rev	the_big_bang	Pass	Fail
Rev	unVirtualization	Pass	Pass
Rev	whataxor	Pass	Pass
Web	poem-collection	Pass	Pass
Web	ShreeRamQuest	Pass	Pass
Web	smug-dino	Pass	Pass

While pulling an all-nighter the day before CSAW CTF Finals, we decided to fully revamp the rev agent by rewriting the prompts and revising the RAG database. Dudcom, Zia, Uvuvue, and Toasty got Squid Agent to solve almost all of the reversing challenges in the CTFTiny dataset with no challenge resets and got a pretty cool screenshot out of it, which showcases the dashboard for Squid Agent’s multi-challenge mode:

The Future of Squid Agent and CTFs?

We intend to benchmark Squid Agent against the complete 200-challenge dataset NYU_CTF_Bench to showcase our framework in reference to other systems. However, we note that CSAW challenges tend to have a difficulty distribution that require guess-based approaches rather than more traditional and principled problem solving challenges found in other CTFs. Hence, we have found it hard to optimize for this benchmark without lowering the overall performance of the system for traditionally complex challenges.

Since CSAW, we have used Squid Agent in live CTFs with reasonable success. At m0leCon CTF 2026 Qualifiers, a CTF with a 100 point CTFTime weight, Squid Agent solved crypto/Talor 2.1 , a 10-solve crypto challenge, as well as a VM reversing challenge. While we believe it has the potential to become a powerful system that can be run in parallel with our human players, it is a fact that a fully autonomous system will inevitable suffer from several issues.

For one, the larger and more complex a challenge is, the harder it is for Squid Agent to even begin starting it—simply finding the entry point of code or where to begin reversing in a large library becomes rather challenging for an agentic system to do. Storing long-term context is also a challenge, as we are limited by a context window that is often insufficient for larger CTF challenges. We also intend to implement a RAG system, which may be useful a more universal tool for Squid Agent.

Another limitation of Squid Agent in its current form is its limited information in more domain-specific techniques. For instance, something as simple as creating a FAISS vector RAG of how2heap and a RAG of solve templates could easily improve performance in heap challenges by quite a bit. At the end of the day, many CTF challenges tend to become a competition of knowing previous bugs/issues, and being able to use that to your advantage.

What we believe is truly interesting is trying to push the system past this limit by solving novel challenges, such as those which require finding zero-days. This would require a system to be able to identify that there are no configuration or usage bugs, and have the ability to crawl through public code bases to find vulnerabilities. We believe this may be possible, but this begins to bleed into the realm of autonomous vulnerability research systems, and will suffer from a myriad of challenges similar to those discussed by Theori’s AIxCC team when creating an AI-based Cyber Reasoning System. For now, while we believe that more difficult CTFs will be spared, beginner and medium difficulty CTFs will inevitably suffer from “AI one-shot” challenges.

Future Work

We plan to publish a comprehensive white paper upon completion of Squid Agent’s benchmarking with the complete 200-challenge dataset NYU_CTF_Bench, as well as a custom dataset that is more in line with the CTF standard seen in modern competitive CTFs. The paper will provide more detailed technical documentation of our agent architectures, tool calls, and insights from our development journey.

Interested in joining?

If you are a developer interested in creating novel systems and helping develop our custom framework, please consider joining our team! Google Form

Credit/Team:

Dev team: dudcom, braydenpikachu, uvuvue, ziarashid, toasty3302, moron, appalachian.mounta1n

Topic experts: ac01010, corg0, clovismint, quasar0147, vy