For lightweight sandboxing on Linux you can use bubblewrap or firejail instead of Docker. They are faster and _simpler_. Here is a bwrap script I wrote to run Claude in a minimal sandbox an hour back:
Nice, thanks for sharing. The lack of an equivalent on macOS (sandbox-exec is similar but mostly undocumented and described as "deprecated" by Apple) is really frustrating.
I had been planning to explore Lima tonight as a mechanism to shackle CC on macOS.
The trouble with sandbox-exec is that it’s control over network access is not fine grain enough, and I found its file system controls insufficient.
Also, I recently had some bad experiences which lead me to believe the tool MUST be run with strict CPU and memory resource limits, which is tricky on macOS.
Wait, does lima do isolation in a macos context too?
It looks like linux vms, which apple's container-cli (among others) covers at a basic level.
I'd like apple to start providing macOS images that weren't the whole OS.. unless sandbox-exec/libsandbox have affordance for something close enough?
You can basically ask claude/chatgpt to write its jail (dockerfile) and then run that via `container` without installing anything on macos outside the container it builds (IIRC). Even the container-cli will use a container to build your container..
There is an equivalent. I played with it for a while before switching to containers. You can just sign an app with sandbox entitlements that starts a subshell and uses security bookmarks to expose folders to it. It's all fully supported by Apple.
You don't need bind mounts, you can just pass access rights to directories into the sandbox directly. Also sandboxed apps run inside a (filesystem) container so file writes to $HOME are transparently redirected to a shadow home.
Respectfully, it's not enough. You can't treat the inside of the sandbox as a generic macOS system. You can't really install arbitrary things or run arbitrary programs. The wheels fall off extremely quickly.
That's true which is why I abandoned that approach, but the original comparison was against Bubblewrap which has the same issues (yes with enough overlays you can make a semi-writable system into which you can install things but you can tunnel brew outside the sandbox also).
The main issue I had is that most dev tools aren't sandbox compatible out of the box and it's Apple specific tech. You can add SBPL exceptions to make more stuff work but why bother. Containers/Linux VMs work everywhere.
Would something like dagger.io work for sandboxing? I'm not sure on the security side of things, but I very much liked the presentation they did at the AI Engineering conference (San Fran, earlier this year) about how they can build branching containers to support branching or parallelized development workflows.
Yeah, that's definitely an option worth considering. Coincidentally I quoted Dagger founder Solomon Hykes in my article - the "An AI agent is an LLM wrecking its environment in a loop" line.
While sandbox-exec is officially "deprecated" it will be around for a long time, so building some tooling on top of it to make it useful seems valuable!
I recently built my own coding agent, due to dissatisfaction with the ones that are out there (though the Claude Code UI is very nice). It works as suggested in the article. It starts a custom Docker container and asks the model, GPT-5 in this case, to send shell scripts down the wire which are then run in the container. The container is augmented with some extra CLI tools to make the agent's life easier.
My agent has a few other tricks up its sleeve and it's very new, so I'm still experimenting with lots of different ideas, but there are a few things I noticed.
One is that GPT-5 is extremely willing to speculate. This is partly because of how I prompt it, but it's willing to write scripts that try five or six things at once in a single script, including things like reading files that might not exist. This level of speculative execution speeds things up dramatically especially as GPT-5 is otherwise a very slow model that likes to think about things a lot.
Another is that you can give it very complex "missions" and it will drive things to completion using tactics that I've not seen from other agents. For example, if it needs to check something that's buried in a library dependency, it'll just clone the upstream repository into its home directory and explore that to find what it needs before going back to working on the user's project.
None of this triggers any user interaction due to running in the container. In fact, no user interaction is possible. You set it going and do something else until it finishes. The model is very much to queue up "missions" that can then run in parallel and you merge them together at the end. The agent also has a mode where it takes the mission, writes a spec, reviews the spec, updates the spec given the review, codes, reviews the code, etc.
Even though it's early days I've set this agent missions that it spent 20 minutes of continuous uninterrupted inferencing time on, and succeeded excellently. I think this UI paradigm is the way to go. You can't scale up AI assisted coding if you're constantly needing to interact with the agent. Getting the most out of models requires maximally exploiting parallelism, so sandboxing is a must.
> Getting the most out of models requires maximally exploiting parallelism, so sandboxing is a must.
What are your thoughts on checkpointing as a refinement of sandboxing? For tight human/llm loops I find automatic checkpoints (of both model context and file system state) and easy rolling back to any checkpoint to the most important tool. It's just so much faster to roll back on major mistakes and try again with the proper context, than to try to get the LLM to fix a mistake, since now the broken code and invalid assumptions are contaminating the context.
But that relies on the human in the loop deciding when to undo. Are you giving some layer of your system the power of resetting sub-agents to previous checkpoints, do you do a full mind-wipe of the sub-agents if they get stuck and try again, or is the context rot just not a problem in practice?
I want to minimize human in the loop. At the moment my agent allows user interaction in one place, after a spec is written+reviewed+updated to reflect the review, it stops. You can then edit the spec before asking for an implementation. It helps to catch cases where the instructions were ambiguous.
At the moment my agent is pretty basic. It doesn't detect endless loops. The model is allowed to bail at any time when it feels it's done or is stuck, so it doesn't seem to need to. It doesn't checkpoint currently. If it does the wrong thing you just roll it all back and improve the AGENTS.md or the mission text. That way you're less likely to encounter problems next time.
The downside is that it's an expensive way to do things but for various reasons that's not a concern for this agent. One of the things I'm experimenting with is how very large token budgets affect agent design.
This also feels like the structure that Sketch.dev uses --- it's asynchronous running in a YOLO mode in a container on a cloud instance, with very little interaction (the expectation is you give it tasks and walk away). I have friends that queue up lots of tasks in the morning and prune down to just a couple successes in the afternoon. I'd do this too but for the scale I work on merge conflicts are too problematic.
I'm working on my own dumb agent for my own dumb problems (engineering, but not software development) and I'd love to hear more about the tricks you're spotting.
Yes it's a bit like that except not in the cloud. It runs locally and doesn't make PRs, it just leaves your worktree in an uncommitted state so you can do any final tests, fixes to the code etc before committing.
That's really interesting. I've noticed something similar - I've tried frontend tasks against GPT-5-Codex and seen it guess the URL of the underlying library (on jsdelivr or GitHub) and attempt to fetch the original source code, often trying several different URLs, in order to dig through the source and figure out how to use an undocumented API feature.
Yes. It made me realize how much intelligence is in these models that isn't being exploited due to minor details of the harness. I've been doing this as a side project and it took nearly no effort to get something that I felt worked better than every other agent I tried, even if the UI is rougher. We're really in the stone age with this stuff. The models are not the limiting factor.
Can you share a link to your repo? I am curious to understand the architecture here, and what the prompts you're using are like that enable GPT-5 to have that kind of emergent behavior.
It's private unfortunately, I doubt it'll be available publicly anytime soon. I don't own the copyright on the code, but even if I did, I don't feel like entering the agent race right now. That's becoming all about who converts VC capital into tokens most efficiently and I'm not sure how to build a sustainable business out of it given what looks like systematic market dumping.
I wouldn't worry. There's so much attention focused on this space right now that any ideas I have are being had simultaneously by hundreds of other people. I just discovered a tool called Sculptor that's got a pretty UI of the type I wanted for the container management; they are probably the ones to watch for this kind of agentic approach.
What kind of tasks were successful? What kinds of tasks failed? Have you tried deeply "get me the data" tasks say for example "assemble me the time series of cash flows associated with the Japanese Soverieng Debt".
Something I have been struggling with is that for "go and look and find this and join it and then see" kind of exercises really seem to go way way off the rails and Claude in particular likes to "mock" anything that gets a little bit hard even if I scream "NEVER MOCK DATA" in my agents file.
I would be curious if there are tricks in these long running loops like constantly injecting new "DO NOT MOCK" shouts at it every n cycles or something.
I've only tried it with various programming tasks. An example task: port the docs for this framework to a different (programming) language, touching up the text along the way for the new ecosystem and rewriting the code samples.
> Update: It turns out Anthropic have their own documentation on Safe YOLO mode for Claude Code which says:
> Letting Claude run arbitrary commands is risky and can result in data loss, system corruption, or even data exfiltration (e.g., via prompt injection attacks). To minimize these risks, use --dangerously-skip-permissions in a container without internet access. You can follow this reference implementation using Docker Dev Containers. [https://github.com/anthropics/claude-code/tree/main/.devcont...]
And… that link goes to a devcontainer that firewalls itself from inside using, effectively, sudo iptables. If Claude Code can’t break that all by itself in a single try, I’d be a bit surprised.
I built this using OpenAI GPT-5 agentic loops as well where the agent interacts with a spreadsheet: banker dot so
Some principles I learned:
1. You don't really wanna use any kind of framework that wraps around openai or claude sdks. You end up fighting them
2. Small number of tools with more functions are better than large number of tools with small number of functions inside
3. Function definitions you put inside the tools are actually very important and need to be very clear
4. Integrating another agent as a tool to the main agent can relieve your context window by 5x
5. RAG via only vector search is mostly unnecessary. Claude Code itself leverages iterative search which works way better with code repos as well as most documents
On one hand, folks are complaining that even basic tasks are impossible with LLMs.
On the other hand we have cursed language [0] which was fully driven by AI and seems to be functional. Btw, how much did that cost?
I feel like I've been hugely successful tools like Claude Code, aider, open code. Especially when I can define custom tools. "You" have to be a part of the loop, in some capacity. To provide guidance and/or direction. I'm puzzled by the fact that people are surprised by this. When I'm working with other entities (people) who are not artificially intelligent, the majority of the time is spent clarifying requirements and aligning on goals. Why would it be different with LLMs?
This is great. I feel like most of the oxygen is going to go to the sandboxing question (fair enough). But I'm kind of obsessed with what agent loops for engineering tasks that aren't coding look like, and also the tweaks you need for agent loops that handle large amounts of anything (source code lines, raw metrics or oTel span data, whatever).
There was an interval where the notion of "context engineering" came into fashion, and we quickly dunked all over it (I don't blame anybody for that; "prompt engineering" seemed pretty cringe-y to me), but there's definitely something to the engineering problems of managing a fixed-size context window while iterating indefinitely through a complex problem, and there's all sorts of tricks for handling it.
Huggingface has a series on Tiny Agents[1] where they go into a minimal version of the loop with tools.
I’ve been working on a weekend project building one of these and the tools you provide are what makes the agent suitable for the task.
Even small models and non tool models can use MCP with correct prompting, but you have to edit what tools they can use.
Context is the big bottleneck to manage. Programming agents benefit from knowing shell commands, but generic agents you have to teach/trick into the right interactions, which burns context. Then out of the box an agent will want to read everything, so I have to comment which commands are expensive or cheap context wise.
My project is almost ready to explore context management methods: things like removing old files from memory, warning the agent about limits, using a second agent that does compaction.
It would be cool to eventually RLHF the agents too, but that’s unlikely with my weekend project budget.
I can imagine an agentic loop that updates dependencies à la Dependabot/Renovate-style by going through the changelog of a new version, reviewing new code changes, and evaluating whether it's worth it to upgrade (or even dangerous to do so, either from stability or security point of view). Too often these tools are used to blindly respin builds with the latest and greatest versions, which is what gets most people in trouble when their NPM deps become malicious.
We have a GH workflow that spins up Claude with a prompt like this and it works quite well:
—
You are going to create a summary of the important changes for a PR created by
dependabot:
1. Read the body of PR $ARGUMENTS with the `gh` cmdline tool.
2. Create a todo list for yourself for each package in PLAN.md.
3. For each todo, read the releases page(s) (add /releases to the github repo
url) to cover the update range.
4. Summarize breaking changes and CVE updates, see template at the end. Grouped
into Dev Dependencies and Production Dependencies.
5. For production dependencies, find code usages (with `rg`/`grep`) and
determine if breaking changes will affect us. Add a list of all found usages,
linking to the direct line on github (like
https://github.com/solarmonkey/app/blob/master/frontend/some...).
Finish each with a verdict: unsure/likely OK/*may break*/*SURELY
BREAKING*!
6. Fill the table up top with the verdicts
7. Write the filled out template to `deps_update_analysis.md`. Don't add a final
summary or conclusive thoughts. The template is all we need.
I use agentic loops for document and spreadsheet (aka bullshit to interface with non engineers in the company) since my model of choice is adept at markdown and csv, and the office software can import and export those.
Wrong takeaway, but claude code was released February of this year??? I swear people have been glazing it for way longer... my memory isn't that bad right?
I'm currently building my own coding agent, mainly for self-research [0], so this article is very helpful to me. Thank you Simon!
In this particular thread, other has commented about lightweight sandboxing solution. I think I will integrate this next in my agent for safe code execution. [1]
For sandboxing and parallelizing agents people at my day job recently started using container-use[0] and seem to be quite happy with it. I have yet to try it in depth but the approach it takes seems quite sensible to me: Have the agent execute all commands within a container via an MCP[1], propagate any changes the agent makes to git branches on the host, for easy review & merge.
Edit: to say more about my opinions, "agentic loop" could mean a few things -- it could mean the thing you say, or it could mean calling multiple individual agents in a loop ... whereas "agentic harness" evokes a sort of interface between the LLM and the digital outside world which mediates how the LLM embodies itself in that world. That latter thing is exactly what you're describing, as far as I can tell.
I like "agentic harness" too, but that's not the name of a skill.
"Designing agentic loops" describes a skill people need to develop. "Designing agentic harnesses" sounds more to me like you're designing a tool like Claude Code from scratch.
Plus "designing agentic loops" includes a reference to my preferred definition of the term "agent" itself - a thing that runs tools in a loop to achieve a goal.
As a reader of Simon's work, I can speculate an answer here.
All "designing agentic loops" is context engineering, but not all context engineering is designing agentic loops. He's specifically talking about instructing the model to run and iterate against an evaluation step. Sure, that instruction will end up in the context, but he's describing creating a context for a specific behavior that allows an agent to be more effective working on its own.
Of course, it'll be interesting to see if future models are taught to create their own agentic loops with evaluation steps/tests, much as models were taught to do their own chain of thought.
Context engineering is about making sure you've stuffed the context with all of the necessary information - relevant library documentation and examples and suchlike.
Design the agentic loop is about picking the right tools to be provided to the model. The tool descriptions may go in the context but you also need to provide the right implementations of them.
Reason I felt like they are closely connected are because for designing tools for lets say coding agents, you have to be thoughful of context engineering.
Eg linear MCP is notorious for giving large JSONs which quickly fill up context and hard for model to understand. So tools need to be designed slightly differently for agents keeping context engineering in mind compared to how you design them for humans.
Context engineering feels like more central and first-principle approach of designing tools, agent loops.
They feel pretty closely connected. For instance: in an agent loop over a series of tool calls, which tool results should stay resident in the context, which should be summarized, which should be committed to a tool-searchable "memory", and which should be discarded? All context engineering questions and all kind of fundamental to the agent loop.
One thing I'm really fuzzy on is, if you're building a multi-model agent thingy (like, can drive with GPT5 or Sonnet), should you be thinking about context management tools like memory and autoediting as tools the agent provides, or should you be wrapping capabilities the underlying models offer? Memory is really easy to do in the agent code! But presumably Sonnet is better trained to use its own builtins.
It boils down to information loss in compaction driven by LLM's. Either you could carefully design tools that only give compacted output with high information density so models have to auto-compact or organize information only once in a while which eventually is going to be lossy.
Or you just give loads of information without thinking much about it, assuming models will have to do frequent compaction and memory organization and hope its not super lossy.
Right, just so I'm clear here: assume you decide your design should be using a memory tool. Should you make your own with a tool call interface or should you rely on a model feature for it, and how much of a difference does it make?
To a certain extent it has already - models are already very good at picking tools to use: ask for a video transformation and it uses ffmpeg, ask it to edit an Excel sheet and it uses Python with openpyxl, etc.
My post is more about how sometimes you still need to make environment design decisions yourself. My favorite example is the Fly.io one, where I created a brand new Fly organization with a $5 spending limit and issue an API token that could create resources in that organization purely so the coding agent could try experiments to optimize cold start times without messing with my production Fly environment.
An agent might be able to suggest that pattern itself, but it would need a root Fly credential in order to create itself the organization and restricted credentials and given how unsafe agents with root credentials are I'd rather keep that step to myself!
It's amusing to think that the endgame is that the humans in the loop are parents with credit cards.
I suppose you could never be sure that an agent would explicitly follow your instruction "Don't spend more than $5".
But maybe one could build a tool that provides payment credentials, and you get to move further up the chain. E.g., what if an MCP tool could spin up virtual credit cards with spending caps, and then the agent could create accounts and provide payment details that it received from the tool?
I'm surprised to see so many people using containers when setting up a KVM is so easy, gives the most robust environment possible, and to my knowledge much has better isolation. A vanilla build of Linux plus your IDE of choice and you're off to the races.
I lived with 16GB until last year and upgraded to 32 only this year, which I thought was a huge improvement. I suspect a lot of people are around this ballpark, especially if they have bought Macs. Mine is Linux, still. So containers are the “simpler” versions.
...wait, you and I are using "KVM" in different ways, then. To me, it means a switch that lets you use the same Keyboard, Monitor ("Video"), and Mouse for two different machines. Sounds like you're talking instead about a technique for running a VM on a single machine - which, from Googlin', I suspect is "Kernel-based Virtual Machine", a new-to-me term. Thanks for teaching me something!
Fantastic article. I'm always looking for better ways to work with agents. I am personally using Claude Code, and want to try some more YOLO type stuff.
My problem is I often give it too much to do. I'm usually overly optimistic about how effective it can be.
I have found Sonnet 4.5 to be quite the improvement over Sonnet 4 however.
My preferred ergonomics for agents are like using a washing machine: choose what clothes you want to have cleaned (whites or colors..) and press “go”.
Claude Code plan mode is the washing machine of agentic loops. It lets you see a detailed plan and if you approve you can let it rip with trust that it’ll do what it says, and go do something else while it’s at work.
Adding tests and other ways to validate its work help it go further & run longer off a single plan.
Resonating with the statement "Designing agentic loops is a very new skill—Claude Code was first released in just February 2025!" when I was investigating Codex[0].
> If anything goes wrong it’s a Microsoft Azure machine somewhere that’s burning CPU and the worst that can happen is code you checked out into the environment might be exfiltrated by an attacker, or bad code might be pushed to the attached GitHub repository.
Isn't that risking getting banned from Azure? The compromised agent might not accomplish anything useful, but its attempts might get (correctly!) flagged by the cloud provider.
My guess is that most cloud providers have procedures in place to help avoid banning legitimate customers because one of their instances got infected with malware (effectively what a rogue agent would be).
One important issue with agentic loops is that agents are lazy, so you need some sort of retrigger mechanism. Claude code supports hooks, you can wire your agent stop hook to a local LLM, feed the context in and ask the model to prompt Claude to continue if needed. It works pretty well, Claude can override retriggers if it's REALLY sure it's done.
Regarding sandboxing, VMs are the way. Prompt injected agents WILL be able to escape containers 100%.
My mental model of container escapes is that they are security bugs which get patched when they are reported, and none of the mainstream, actively maintained container platforms currently have a known open escape bug.
That's going a bit far; a good mental model is that every kernel LPE is a sandbox escape (that's not precisely true but is to a first approximation), and kernel LPEs are pretty routine and rarely widely reported.
A good heuristic would be that unless you have reason to think you're a target, containers are a safe bet. A motivated attacker probably can pop most container configurations. Also, it can be counterintuitive what makes you a target:
* Large-scale cotenant work? Your target quotient is the sum of those of all your clients.
* Sharing infrastructure (including code supply chains) with people who are targeted? Similar story.
But for people just using Claude in YOLO mode, "security" is not really a top-of-mind concern for me, so much as "wrecking my dev machine".
Of the big three cloud providers, only GCP uses containers for customer isolation, and they do so with the supervision of gVisor. It’s certainly possible to do container isolation securely, but it takes extra steps and know-how, and I don’t think anyone is even considering using gVisor or similar for the type of developer workflows being discussed here.
AWS and Azure both use VM-level isolation. Cloudflare uses V8 isolates which are neither container nor VM. Fly uses firecracker, right?
This topic is kind of unnecessary for the type of developer workflows being discussed that the majority of readers of this article are doing, though. The primary concern here is “oops the agent tried to run ‘rm -rf /‘“, not the agent trying to exploit a container escape. And for anyone who is building something that requires a better security model, I’d hope they have better resources to guide them than the two sentences in this article about prompt injection.
What scares me most is what happens when some attacker attempts to deploy a "steal all environment variable credentials and crypto wallets" prompt injection attack in a way that is likely to affect thousands or millions of coding agent users.
This is not speculative, it's happened plenty already. People put mitigations in place, patch libraries and move on. The difference is that agents will find new zero days you've never heard of for stuff on your system people haven't scrutinized adequately. There will be zero advanced notice, and unlike human attackers who need to lie low until they can plan an exit, it'll be able to exploit you heavily right away.
Do not take the security impact of agents lightly!
I feel like my bona fides on this topic are pretty solid (without getting into my background on container vs. VM vs. runtime isolation) and: "the agents will find new zero days" also seems "big if true". I point `claude` at a shell inside a container and tell it "go find a zero day that breaks me out of this container", and you think I'm going to succeed at that?
I had assumed you were saying something more like "any attacker that prompt-injects you probably has a container escape in their back pocket they'll just stage through the prompt injection vector", but you apparently meant something way further out.
I know at least one person who supplements their income finding bounties with Claude Code.
Right now you can prompt inject an obfuscated payload that can trick claude into trying to root a system under the premise that you're trying to identify an attack vector on a test system to understand how you were compromised. It's not good enough to do much, but with the right prompts, better models and if you could smuggle extra code in, you could get quite far.
Lots of people find zero days with Claude Code. That is not the same thing as Claude Code autonomously finding zero days without direction, which was what you implied. This seems like a pretty simple thing to go empirically verify for yourself. Just boot up Claude and tell it to break out of a container shell. I'll wait here for your zero day! :)
A little off-topic but I’ve been waiting for a post by someone who’s built something like Claude Code or Codex that can explain how the state machine works for execution.
Clearly it’s not just a good system prompt and hoping for the best. Obviously they’ve both got fairly good algorithms for working toward whatever the user asked for incrementally, handling expectable errors, changing course if necessary, etc.
I mean ultimately it’s gotta be a really big flow chart that results in the LLM getting prompted various ways in all kinds of directions before ultimately converging on the final response, right? Or am I way off-base?
Has anyone seen such a post explaining this architecture / state machine / algorithm?
You'd be surprised. You don't need a state machine to get started, only if you want to get really advanced. The models are smart. It's enough to just ask them to send you instructions, execute them and send back the results.
I just came round to using that term a couple of weeks ago (see https://simonwillison.net/2025/Sep/18/agents/) - I held out for the longest time because it annoyed me that everyone had a different idea of what "agent" meant.
In this case I do think the term is justified - "agentic loop" is a good way of describing a loop with a bunch of tool calls driven by an LLM.
That indeed is the simplest definition for it. In practice, there can be parallel agent calls, asynchronous or synchronous result merger, and context compression or management. Also, there is no shame in adding manual LLM steps to an agentic workflow.
I'm saying there are grifters who co-opt the terms used by people actually pioneering / exploring the space, to sell BS; and that taints OP's view.
This doesn't need stating but if it does I am not saying you are a grifter, obviously, because you substantiate your positions. Just another classic signal v. noise.
In case anyone else is confused: I saw the word "loop" and wondered what the point is to run an LLM in a loop? Like, what's the goal here? You want to write N different React todo web apps, or something?
After reading the article a bit, I realized this isn't about "loops" at all. What it is about is "bypassing the yes/no human check on each step the LLM wants to take in completing some multi-step task". You can already to this with, for example, claude code. However, one of the points of having a human approve LLM actions is to prevent it doing something stupid or destructive. So now you have a second problem: how to prevent the LLM from poking holes in spacetime?
I wouldn't be surprised if agents start getting managed by a distributed agentic system - think about it. Right now you get codex/claude/etc... and it's system prompt and various other internally managed prompts are locked down to the version you downloaded. What if a distributed system ran experimental prompts and monitored the success rate (what code makes it into a commit) and provides feedback to the agent manager. That could help automatically fine tune it's own prompts.
This is what Anthropic does for their "High Compute" SWE benchmarking:
"
For our "high compute" numbers we adopt additional complexity and parallel test-time compute as follows:
- We sample multiple parallel attempts.
- We discard patches that break the visible regression tests in the repository, similar to the rejection sampling approach adopted by Agentless (Xia et al. 2024); note no hidden test information is used.
- We then use an internal scoring model to select the best candidate from the remaining attempts.
- This results in a score of 82.0% for Sonnet 4.5.
"
But running in parallel means you use more compute, meaning that the cost is higher. Good results are worth paying for, but if a super recursive approach costs more and takes longer than a human...
For lightweight sandboxing on Linux you can use bubblewrap or firejail instead of Docker. They are faster and _simpler_. Here is a bwrap script I wrote to run Claude in a minimal sandbox an hour back:
Nice, thanks for sharing. The lack of an equivalent on macOS (sandbox-exec is similar but mostly undocumented and described as "deprecated" by Apple) is really frustrating.
I had been planning to explore Lima tonight as a mechanism to shackle CC on macOS.
The trouble with sandbox-exec is that it’s control over network access is not fine grain enough, and I found its file system controls insufficient.
Also, I recently had some bad experiences which lead me to believe the tool MUST be run with strict CPU and memory resource limits, which is tricky on macOS.
Wait, does lima do isolation in a macos context too?
It looks like linux vms, which apple's container-cli (among others) covers at a basic level.
I'd like apple to start providing macOS images that weren't the whole OS.. unless sandbox-exec/libsandbox have affordance for something close enough?
You can basically ask claude/chatgpt to write its jail (dockerfile) and then run that via `container` without installing anything on macos outside the container it builds (IIRC). Even the container-cli will use a container to build your container..
Neat, I've not tried https://github.com/lima-vm/lima
There is an equivalent. I played with it for a while before switching to containers. You can just sign an app with sandbox entitlements that starts a subshell and uses security bookmarks to expose folders to it. It's all fully supported by Apple.
I would love to be able to use sandbox entitlements for this. I have so far been unable to figure out how.
It's not equivalent. You can restrict access but expose select resources, but there's no bind mounting, no overlays, etc. etc.
It's a very far cry from bwrap.
You don't need bind mounts, you can just pass access rights to directories into the sandbox directly. Also sandboxed apps run inside a (filesystem) container so file writes to $HOME are transparently redirected to a shadow home.
Respectfully, it's not enough. You can't treat the inside of the sandbox as a generic macOS system. You can't really install arbitrary things or run arbitrary programs. The wheels fall off extremely quickly.
That's true which is why I abandoned that approach, but the original comparison was against Bubblewrap which has the same issues (yes with enough overlays you can make a semi-writable system into which you can install things but you can tunnel brew outside the sandbox also).
Bubblewrap does not really have these issues at all. It's pretty much full containerization.
What issues did you hit?
The main issue I had is that most dev tools aren't sandbox compatible out of the box and it's Apple specific tech. You can add SBPL exceptions to make more stuff work but why bother. Containers/Linux VMs work everywhere.
Would something like dagger.io work for sandboxing? I'm not sure on the security side of things, but I very much liked the presentation they did at the AI Engineering conference (San Fran, earlier this year) about how they can build branching containers to support branching or parallelized development workflows.
Yeah, that's definitely an option worth considering. Coincidentally I quoted Dagger founder Solomon Hykes in my article - the "An AI agent is an LLM wrecking its environment in a loop" line.
While sandbox-exec is officially "deprecated" it will be around for a long time, so building some tooling on top of it to make it useful seems valuable!
I recently built my own coding agent, due to dissatisfaction with the ones that are out there (though the Claude Code UI is very nice). It works as suggested in the article. It starts a custom Docker container and asks the model, GPT-5 in this case, to send shell scripts down the wire which are then run in the container. The container is augmented with some extra CLI tools to make the agent's life easier.
My agent has a few other tricks up its sleeve and it's very new, so I'm still experimenting with lots of different ideas, but there are a few things I noticed.
One is that GPT-5 is extremely willing to speculate. This is partly because of how I prompt it, but it's willing to write scripts that try five or six things at once in a single script, including things like reading files that might not exist. This level of speculative execution speeds things up dramatically especially as GPT-5 is otherwise a very slow model that likes to think about things a lot.
Another is that you can give it very complex "missions" and it will drive things to completion using tactics that I've not seen from other agents. For example, if it needs to check something that's buried in a library dependency, it'll just clone the upstream repository into its home directory and explore that to find what it needs before going back to working on the user's project.
None of this triggers any user interaction due to running in the container. In fact, no user interaction is possible. You set it going and do something else until it finishes. The model is very much to queue up "missions" that can then run in parallel and you merge them together at the end. The agent also has a mode where it takes the mission, writes a spec, reviews the spec, updates the spec given the review, codes, reviews the code, etc.
Even though it's early days I've set this agent missions that it spent 20 minutes of continuous uninterrupted inferencing time on, and succeeded excellently. I think this UI paradigm is the way to go. You can't scale up AI assisted coding if you're constantly needing to interact with the agent. Getting the most out of models requires maximally exploiting parallelism, so sandboxing is a must.
> Getting the most out of models requires maximally exploiting parallelism, so sandboxing is a must.
What are your thoughts on checkpointing as a refinement of sandboxing? For tight human/llm loops I find automatic checkpoints (of both model context and file system state) and easy rolling back to any checkpoint to the most important tool. It's just so much faster to roll back on major mistakes and try again with the proper context, than to try to get the LLM to fix a mistake, since now the broken code and invalid assumptions are contaminating the context.
But that relies on the human in the loop deciding when to undo. Are you giving some layer of your system the power of resetting sub-agents to previous checkpoints, do you do a full mind-wipe of the sub-agents if they get stuck and try again, or is the context rot just not a problem in practice?
I want to minimize human in the loop. At the moment my agent allows user interaction in one place, after a spec is written+reviewed+updated to reflect the review, it stops. You can then edit the spec before asking for an implementation. It helps to catch cases where the instructions were ambiguous.
At the moment my agent is pretty basic. It doesn't detect endless loops. The model is allowed to bail at any time when it feels it's done or is stuck, so it doesn't seem to need to. It doesn't checkpoint currently. If it does the wrong thing you just roll it all back and improve the AGENTS.md or the mission text. That way you're less likely to encounter problems next time.
The downside is that it's an expensive way to do things but for various reasons that's not a concern for this agent. One of the things I'm experimenting with is how very large token budgets affect agent design.
This also feels like the structure that Sketch.dev uses --- it's asynchronous running in a YOLO mode in a container on a cloud instance, with very little interaction (the expectation is you give it tasks and walk away). I have friends that queue up lots of tasks in the morning and prune down to just a couple successes in the afternoon. I'd do this too but for the scale I work on merge conflicts are too problematic.
I'm working on my own dumb agent for my own dumb problems (engineering, but not software development) and I'd love to hear more about the tricks you're spotting.
Yes it's a bit like that except not in the cloud. It runs locally and doesn't make PRs, it just leaves your worktree in an uncommitted state so you can do any final tests, fixes to the code etc before committing.
What kind of non-software engineering, if you don't mind sharing? Sounds interesting!
That's really interesting. I've noticed something similar - I've tried frontend tasks against GPT-5-Codex and seen it guess the URL of the underlying library (on jsdelivr or GitHub) and attempt to fetch the original source code, often trying several different URLs, in order to dig through the source and figure out how to use an undocumented API feature.
Yes. It made me realize how much intelligence is in these models that isn't being exploited due to minor details of the harness. I've been doing this as a side project and it took nearly no effort to get something that I felt worked better than every other agent I tried, even if the UI is rougher. We're really in the stone age with this stuff. The models are not the limiting factor.
Can you share a link to your repo? I am curious to understand the architecture here, and what the prompts you're using are like that enable GPT-5 to have that kind of emergent behavior.
It's private unfortunately, I doubt it'll be available publicly anytime soon. I don't own the copyright on the code, but even if I did, I don't feel like entering the agent race right now. That's becoming all about who converts VC capital into tokens most efficiently and I'm not sure how to build a sustainable business out of it given what looks like systematic market dumping.
I wouldn't worry. There's so much attention focused on this space right now that any ideas I have are being had simultaneously by hundreds of other people. I just discovered a tool called Sculptor that's got a pretty UI of the type I wanted for the container management; they are probably the ones to watch for this kind of agentic approach.
What kind of tasks were successful? What kinds of tasks failed? Have you tried deeply "get me the data" tasks say for example "assemble me the time series of cash flows associated with the Japanese Soverieng Debt".
Something I have been struggling with is that for "go and look and find this and join it and then see" kind of exercises really seem to go way way off the rails and Claude in particular likes to "mock" anything that gets a little bit hard even if I scream "NEVER MOCK DATA" in my agents file.
I would be curious if there are tricks in these long running loops like constantly injecting new "DO NOT MOCK" shouts at it every n cycles or something.
I've only tried it with various programming tasks. An example task: port the docs for this framework to a different (programming) language, touching up the text along the way for the new ecosystem and rewriting the code samples.
I haven't tried data analysis tasks.
The state of YOLO is embarrassing:
> Update: It turns out Anthropic have their own documentation on Safe YOLO mode for Claude Code which says:
> Letting Claude run arbitrary commands is risky and can result in data loss, system corruption, or even data exfiltration (e.g., via prompt injection attacks). To minimize these risks, use --dangerously-skip-permissions in a container without internet access. You can follow this reference implementation using Docker Dev Containers. [https://github.com/anthropics/claude-code/tree/main/.devcont...]
And… that link goes to a devcontainer that firewalls itself from inside using, effectively, sudo iptables. If Claude Code can’t break that all by itself in a single try, I’d be a bit surprised.
I built this using OpenAI GPT-5 agentic loops as well where the agent interacts with a spreadsheet: banker dot so
Some principles I learned: 1. You don't really wanna use any kind of framework that wraps around openai or claude sdks. You end up fighting them 2. Small number of tools with more functions are better than large number of tools with small number of functions inside 3. Function definitions you put inside the tools are actually very important and need to be very clear 4. Integrating another agent as a tool to the main agent can relieve your context window by 5x 5. RAG via only vector search is mostly unnecessary. Claude Code itself leverages iterative search which works way better with code repos as well as most documents
For anyone else curious about what a practical loop implementation might look like, Steve Yegge YOLO-bootstrapped his 'Efrit' project using a few lines of Elisp: https://github.com/steveyegge/efrit/blob/4feb67574a330cc789f...
And for more context on Efrit this is a fun watch: "When Steve Gives Claude Full Access To 50 Years of Emacs Capabilities" https://www.youtube.com/watch?v=ZJUyVVFOXOc
On one hand, folks are complaining that even basic tasks are impossible with LLMs.
On the other hand we have cursed language [0] which was fully driven by AI and seems to be functional. Btw, how much did that cost?
I feel like I've been hugely successful tools like Claude Code, aider, open code. Especially when I can define custom tools. "You" have to be a part of the loop, in some capacity. To provide guidance and/or direction. I'm puzzled by the fact that people are surprised by this. When I'm working with other entities (people) who are not artificially intelligent, the majority of the time is spent clarifying requirements and aligning on goals. Why would it be different with LLMs?
0: https://ghuntley.com/cursed/
The cursed cost estimate was $14,000 https://twitter.com/GeoffreyHuntley/status/19652951529620975...
> Btw, how much did that cost?
$30k+
Not sure how functional though!
This is great. I feel like most of the oxygen is going to go to the sandboxing question (fair enough). But I'm kind of obsessed with what agent loops for engineering tasks that aren't coding look like, and also the tweaks you need for agent loops that handle large amounts of anything (source code lines, raw metrics or oTel span data, whatever).
There was an interval where the notion of "context engineering" came into fashion, and we quickly dunked all over it (I don't blame anybody for that; "prompt engineering" seemed pretty cringe-y to me), but there's definitely something to the engineering problems of managing a fixed-size context window while iterating indefinitely through a complex problem, and there's all sorts of tricks for handling it.
Huggingface has a series on Tiny Agents[1] where they go into a minimal version of the loop with tools.
I’ve been working on a weekend project building one of these and the tools you provide are what makes the agent suitable for the task.
Even small models and non tool models can use MCP with correct prompting, but you have to edit what tools they can use.
Context is the big bottleneck to manage. Programming agents benefit from knowing shell commands, but generic agents you have to teach/trick into the right interactions, which burns context. Then out of the box an agent will want to read everything, so I have to comment which commands are expensive or cheap context wise.
My project is almost ready to explore context management methods: things like removing old files from memory, warning the agent about limits, using a second agent that does compaction.
It would be cool to eventually RLHF the agents too, but that’s unlikely with my weekend project budget.
[1] https://huggingface.co/learn/mcp-course/en/unit2/tiny-agents
I can imagine an agentic loop that updates dependencies à la Dependabot/Renovate-style by going through the changelog of a new version, reviewing new code changes, and evaluating whether it's worth it to upgrade (or even dangerous to do so, either from stability or security point of view). Too often these tools are used to blindly respin builds with the latest and greatest versions, which is what gets most people in trouble when their NPM deps become malicious.
We have a GH workflow that spins up Claude with a prompt like this and it works quite well:
—
You are going to create a summary of the important changes for a PR created by dependabot:
1. Read the body of PR $ARGUMENTS with the `gh` cmdline tool. 2. Create a todo list for yourself for each package in PLAN.md. 3. For each todo, read the releases page(s) (add /releases to the github repo url) to cover the update range. 4. Summarize breaking changes and CVE updates, see template at the end. Grouped into Dev Dependencies and Production Dependencies. 5. For production dependencies, find code usages (with `rg`/`grep`) and determine if breaking changes will affect us. Add a list of all found usages, linking to the direct line on github (like https://github.com/solarmonkey/app/blob/master/frontend/some...). Finish each with a verdict: unsure/likely OK/*may break*/*SURELY BREAKING*! 6. Fill the table up top with the verdicts 7. Write the filled out template to `deps_update_analysis.md`. Don't add a final summary or conclusive thoughts. The template is all we need.
[snip template to fill]
You should write it! You might be surprised how easy it is to get something working.
Ha, well look at that, not even a day later: https://fossa.com/blog/fossabot-dependency-upgrade-ai-agent/
I use agentic loops for document and spreadsheet (aka bullshit to interface with non engineers in the company) since my model of choice is adept at markdown and csv, and the office software can import and export those.
Wrong takeaway, but claude code was released February of this year??? I swear people have been glazing it for way longer... my memory isn't that bad right?
February 24th, Anthropic tucked it into the same announcement as Sonnet 3.7: https://www.anthropic.com/news/claude-3-7-sonnet
I updated this post to link to the Claude Code docs that suggest running YOLO mode using their Docker dev container: https://www.anthropic.com/engineering/claude-code-best-pract... - which locks down network access to just a small set of domains: https://github.com/anthropics/claude-code/blob/5062ed93fc67f...
I'm currently building my own coding agent, mainly for self-research [0], so this article is very helpful to me. Thank you Simon!
In this particular thread, other has commented about lightweight sandboxing solution. I think I will integrate this next in my agent for safe code execution. [1]
[0] https://github.com/vinhnx/vtcode
[1] https://news.ycombinator.com/item?id=45429787
On unix you can just use a user account to limit access.
Create a group for you and the `claude` user, and have it do `umask 002` on login.
The shellagent use case has much more parallels with 1980s TTY+mainframes dealing with multiple users than it has with containers.
I.e. Docker's USP was reproducibility; not access controls.
For sandboxing and parallelizing agents people at my day job recently started using container-use[0] and seem to be quite happy with it. I have yet to try it in depth but the approach it takes seems quite sensible to me: Have the agent execute all commands within a container via an MCP[1], propagate any changes the agent makes to git branches on the host, for easy review & merge.
[0]: https://github.com/dagger/container-use
[1]: I would like it better if the entire agent process were sandboxed but I suppose that would make it difficult to use with IDE/GUI-based agents.
I think this is a strictly worse name than "agentic harness", which is already a term used by open-source agentic IDEs (https://github.com/search?q=repo%3Aopenai%2Fcodex%20harness&... or https://github.com/openai/codex/discussions/1174)
Any reason why you want to rename it?
Edit: to say more about my opinions, "agentic loop" could mean a few things -- it could mean the thing you say, or it could mean calling multiple individual agents in a loop ... whereas "agentic harness" evokes a sort of interface between the LLM and the digital outside world which mediates how the LLM embodies itself in that world. That latter thing is exactly what you're describing, as far as I can tell.
I like "agentic harness" too, but that's not the name of a skill.
"Designing agentic loops" describes a skill people need to develop. "Designing agentic harnesses" sounds more to me like you're designing a tool like Claude Code from scratch.
Plus "designing agentic loops" includes a reference to my preferred definition of the term "agent" itself - a thing that runs tools in a loop to achieve a goal.
Context engineering is another name people have given to same skill?
As a reader of Simon's work, I can speculate an answer here.
All "designing agentic loops" is context engineering, but not all context engineering is designing agentic loops. He's specifically talking about instructing the model to run and iterate against an evaluation step. Sure, that instruction will end up in the context, but he's describing creating a context for a specific behavior that allows an agent to be more effective working on its own.
Of course, it'll be interesting to see if future models are taught to create their own agentic loops with evaluation steps/tests, much as models were taught to do their own chain of thought.
I think that's actually quite different.
Context engineering is about making sure you've stuffed the context with all of the necessary information - relevant library documentation and examples and suchlike.
Design the agentic loop is about picking the right tools to be provided to the model. The tool descriptions may go in the context but you also need to provide the right implementations of them.
Reason I felt like they are closely connected are because for designing tools for lets say coding agents, you have to be thoughful of context engineering.
Eg linear MCP is notorious for giving large JSONs which quickly fill up context and hard for model to understand. So tools need to be designed slightly differently for agents keeping context engineering in mind compared to how you design them for humans.
Context engineering feels like more central and first-principle approach of designing tools, agent loops.
They feel pretty closely connected. For instance: in an agent loop over a series of tool calls, which tool results should stay resident in the context, which should be summarized, which should be committed to a tool-searchable "memory", and which should be discarded? All context engineering questions and all kind of fundamental to the agent loop.
Yeah, "connected" feels right to me.
Those decisions feel to me like problems for the agent harness to solve - Anthropic released a new cookbook about that yesterday: https://github.com/anthropics/claude-cookbooks/blob/main/too...
One thing I'm really fuzzy on is, if you're building a multi-model agent thingy (like, can drive with GPT5 or Sonnet), should you be thinking about context management tools like memory and autoediting as tools the agent provides, or should you be wrapping capabilities the underlying models offer? Memory is really easy to do in the agent code! But presumably Sonnet is better trained to use its own builtins.
It boils down to information loss in compaction driven by LLM's. Either you could carefully design tools that only give compacted output with high information density so models have to auto-compact or organize information only once in a while which eventually is going to be lossy.
Or you just give loads of information without thinking much about it, assuming models will have to do frequent compaction and memory organization and hope its not super lossy.
Right, just so I'm clear here: assume you decide your design should be using a memory tool. Should you make your own with a tool call interface or should you rely on a model feature for it, and how much of a difference does it make?
Do you think this'll eventually be trained into the models the way that chain-of-thought has been?
To a certain extent it has already - models are already very good at picking tools to use: ask for a video transformation and it uses ffmpeg, ask it to edit an Excel sheet and it uses Python with openpyxl, etc.
My post is more about how sometimes you still need to make environment design decisions yourself. My favorite example is the Fly.io one, where I created a brand new Fly organization with a $5 spending limit and issue an API token that could create resources in that organization purely so the coding agent could try experiments to optimize cold start times without messing with my production Fly environment.
An agent might be able to suggest that pattern itself, but it would need a root Fly credential in order to create itself the organization and restricted credentials and given how unsafe agents with root credentials are I'd rather keep that step to myself!
It's amusing to think that the endgame is that the humans in the loop are parents with credit cards.
I suppose you could never be sure that an agent would explicitly follow your instruction "Don't spend more than $5".
But maybe one could build a tool that provides payment credentials, and you get to move further up the chain. E.g., what if an MCP tool could spin up virtual credit cards with spending caps, and then the agent could create accounts and provide payment details that it received from the tool?
I'm surprised to see so many people using containers when setting up a KVM is so easy, gives the most robust environment possible, and to my knowledge much has better isolation. A vanilla build of Linux plus your IDE of choice and you're off to the races.
You often don't need strong isolation. The sandboxing is more to avoid model accidents than a Skynet scenario.
Not everyone has spare hardware lying around!
For sure! But just for reference, I'm on a mid-tier 2022 Dell Inspiron laptop: Ryzen 7 5825U with 64GB ram and 500GB SSD.
On it, I run Ubuntu 24.04 as my host, and my guest is Lubuntu with 16GB ram and 80GB ssd for my KVM.
I almost always have 2 instances of PHPstorm open in both Host and and Guest with multiple terminal tabs running various agentic tasks.
64 GB. Say no more :)
I lived with 16GB until last year and upgraded to 32 only this year, which I thought was a huge improvement. I suspect a lot of people are around this ballpark, especially if they have bought Macs. Mine is Linux, still. So containers are the “simpler” versions.
...wait, you and I are using "KVM" in different ways, then. To me, it means a switch that lets you use the same Keyboard, Monitor ("Video"), and Mouse for two different machines. Sounds like you're talking instead about a technique for running a VM on a single machine - which, from Googlin', I suspect is "Kernel-based Virtual Machine", a new-to-me term. Thanks for teaching me something!
Fantastic article. I'm always looking for better ways to work with agents. I am personally using Claude Code, and want to try some more YOLO type stuff.
My problem is I often give it too much to do. I'm usually overly optimistic about how effective it can be.
I have found Sonnet 4.5 to be quite the improvement over Sonnet 4 however.
My preferred ergonomics for agents are like using a washing machine: choose what clothes you want to have cleaned (whites or colors..) and press “go”.
Claude Code plan mode is the washing machine of agentic loops. It lets you see a detailed plan and if you approve you can let it rip with trust that it’ll do what it says, and go do something else while it’s at work.
Adding tests and other ways to validate its work help it go further & run longer off a single plan.
Resonating with the statement "Designing agentic loops is a very new skill—Claude Code was first released in just February 2025!" when I was investigating Codex[0].
[0]: https://blog.toolkami.com/openai-codex-tools/#coding-agents
> If anything goes wrong it’s a Microsoft Azure machine somewhere that’s burning CPU and the worst that can happen is code you checked out into the environment might be exfiltrated by an attacker, or bad code might be pushed to the attached GitHub repository.
Isn't that risking getting banned from Azure? The compromised agent might not accomplish anything useful, but its attempts might get (correctly!) flagged by the cloud provider.
My guess is that most cloud providers have procedures in place to help avoid banning legitimate customers because one of their instances got infected with malware (effectively what a rogue agent would be).
One important issue with agentic loops is that agents are lazy, so you need some sort of retrigger mechanism. Claude code supports hooks, you can wire your agent stop hook to a local LLM, feed the context in and ask the model to prompt Claude to continue if needed. It works pretty well, Claude can override retriggers if it's REALLY sure it's done.
Regarding sandboxing, VMs are the way. Prompt injected agents WILL be able to escape containers 100%.
My mental model of container escapes is that they are security bugs which get patched when they are reported, and none of the mainstream, actively maintained container platforms currently have a known open escape bug.
So is the concern here purely around zero-days?
That's going a bit far; a good mental model is that every kernel LPE is a sandbox escape (that's not precisely true but is to a first approximation), and kernel LPEs are pretty routine and rarely widely reported.
A good heuristic would be that unless you have reason to think you're a target, containers are a safe bet. A motivated attacker probably can pop most container configurations. Also, it can be counterintuitive what makes you a target:
* Large-scale cotenant work? Your target quotient is the sum of those of all your clients.
* Sharing infrastructure (including code supply chains) with people who are targeted? Similar story.
But for people just using Claude in YOLO mode, "security" is not really a top-of-mind concern for me, so much as "wrecking my dev machine".
Seems "big if true" given the number of cloud providers that use containers for customer isolation.
Of the big three cloud providers, only GCP uses containers for customer isolation, and they do so with the supervision of gVisor. It’s certainly possible to do container isolation securely, but it takes extra steps and know-how, and I don’t think anyone is even considering using gVisor or similar for the type of developer workflows being discussed here.
AWS and Azure both use VM-level isolation. Cloudflare uses V8 isolates which are neither container nor VM. Fly uses firecracker, right?
This topic is kind of unnecessary for the type of developer workflows being discussed that the majority of readers of this article are doing, though. The primary concern here is “oops the agent tried to run ‘rm -rf /‘“, not the agent trying to exploit a container escape. And for anyone who is building something that requires a better security model, I’d hope they have better resources to guide them than the two sentences in this article about prompt injection.
What scares me most is what happens when some attacker attempts to deploy a "steal all environment variable credentials and crypto wallets" prompt injection attack in a way that is likely to affect thousands or millions of coding agent users.
I'm not talking about the hyperscalers. And yes, we use Rust microvm hypervisors.
This is not speculative, it's happened plenty already. People put mitigations in place, patch libraries and move on. The difference is that agents will find new zero days you've never heard of for stuff on your system people haven't scrutinized adequately. There will be zero advanced notice, and unlike human attackers who need to lie low until they can plan an exit, it'll be able to exploit you heavily right away.
Do not take the security impact of agents lightly!
I feel like my bona fides on this topic are pretty solid (without getting into my background on container vs. VM vs. runtime isolation) and: "the agents will find new zero days" also seems "big if true". I point `claude` at a shell inside a container and tell it "go find a zero day that breaks me out of this container", and you think I'm going to succeed at that?
I had assumed you were saying something more like "any attacker that prompt-injects you probably has a container escape in their back pocket they'll just stage through the prompt injection vector", but you apparently meant something way further out.
I know at least one person who supplements their income finding bounties with Claude Code.
Right now you can prompt inject an obfuscated payload that can trick claude into trying to root a system under the premise that you're trying to identify an attack vector on a test system to understand how you were compromised. It's not good enough to do much, but with the right prompts, better models and if you could smuggle extra code in, you could get quite far.
Lots of people find zero days with Claude Code. That is not the same thing as Claude Code autonomously finding zero days without direction, which was what you implied. This seems like a pretty simple thing to go empirically verify for yourself. Just boot up Claude and tell it to break out of a container shell. I'll wait here for your zero day! :)
If AI agents are capable of finding new zero days, that seems like an absolute win for computer security research.
It's already happening, as I mentioned in a sibling comment I know of someone doing this for supplemental income.
I made a simple docker container to jail an LLM for a recent project.
https://github.com/codazoda/llm-jail
I feel like the AI industry is basically recreating "workflow engines" again and again and again, in different forms, with different input mechanisms.
Temporal.io has been good for orchestrating agentic workflows, or workflows in general
> This is so dangerous, but it’s also key to getting the most productive results
where do i sign up?
I typically run my code agents in distrobox.
A little off-topic but I’ve been waiting for a post by someone who’s built something like Claude Code or Codex that can explain how the state machine works for execution.
Clearly it’s not just a good system prompt and hoping for the best. Obviously they’ve both got fairly good algorithms for working toward whatever the user asked for incrementally, handling expectable errors, changing course if necessary, etc.
I mean ultimately it’s gotta be a really big flow chart that results in the LLM getting prompted various ways in all kinds of directions before ultimately converging on the final response, right? Or am I way off-base?
Has anyone seen such a post explaining this architecture / state machine / algorithm?
This is in that direction: https://jannesklaas.github.io/ai/2025/07/20/claude-code-agen...
(This is a much better answer than my Codex Cloud generated version in the sibling comment.)
Thanks. Looks like exactly what I was looking for.
You'd be surprised. You don't need a state machine to get started, only if you want to get really advanced. The models are smart. It's enough to just ask them to send you instructions, execute them and send back the results.
This has not been my experience.
Since Codex CLI is open source I got Codex Cloud to attempt to answer your question - here's the shared session: https://chatgpt.com/s/cd_68dc99fe3e948191a8923ddc4a1f4310
And here's the markdown file it produced with an attempt at an answer: https://github.com/simonw/codex-scratchpad/blob/codex/analyz...
I find the word agentic to be a hot spike into my brain ... I am not sure why but I think it has to do with the fact it is marketing bs.
Is it just me ?
I just came round to using that term a couple of weeks ago (see https://simonwillison.net/2025/Sep/18/agents/) - I held out for the longest time because it annoyed me that everyone had a different idea of what "agent" meant.
In this case I do think the term is justified - "agentic loop" is a good way of describing a loop with a bunch of tool calls driven by an LLM.
That indeed is the simplest definition for it. In practice, there can be parallel agent calls, asynchronous or synchronous result merger, and context compression or management. Also, there is no shame in adding manual LLM steps to an agentic workflow.
It's marketing BS in the hands of grifters.
Can you expand on why you think that?
I'm saying there are grifters who co-opt the terms used by people actually pioneering / exploring the space, to sell BS; and that taints OP's view.
This doesn't need stating but if it does I am not saying you are a grifter, obviously, because you substantiate your positions. Just another classic signal v. noise.
That makes sense, thanks.
"Agentic" feels like the new "Generative".
Yes, it means something "vague" and vibes oriented. It's when you can run things in a terminal or async or something maybe.
I've been trying to make it less vague. I'm using it to mean something that's directly related to running tools in a loop to achieve a goal.
No your spidey sense is correct: it does signify marketing BS.
In case anyone else is confused: I saw the word "loop" and wondered what the point is to run an LLM in a loop? Like, what's the goal here? You want to write N different React todo web apps, or something?
After reading the article a bit, I realized this isn't about "loops" at all. What it is about is "bypassing the yes/no human check on each step the LLM wants to take in completing some multi-step task". You can already to this with, for example, claude code. However, one of the points of having a human approve LLM actions is to prevent it doing something stupid or destructive. So now you have a second problem: how to prevent the LLM from poking holes in spacetime?
Answer: sandbox it.
And that's what the article is about.
Is there a way to let the agents share their agents.md? Claude code looks at Claude.md and I have yet to find a way to unify the agent handbook.
At this point effectively everyone else is on https://agents.md/ - Claude are the only holdout.
Is there any reason this is better than just reading your regular readme or docs?
All of these things look helpful for humans too!
The main difference between AGENTS.md and a README.md is that coding agents will automatically ingest AGENTS.md when they start.
You may not want them to ingest README.md if it's long and contains irrelevant information as that might be a waste of valuable tokens.
You could say that a Readme is declarative, while an agent file is imperative.
I just use symlinks.
I wouldn't be surprised if agents start getting managed by a distributed agentic system - think about it. Right now you get codex/claude/etc... and it's system prompt and various other internally managed prompts are locked down to the version you downloaded. What if a distributed system ran experimental prompts and monitored the success rate (what code makes it into a commit) and provides feedback to the agent manager. That could help automatically fine tune it's own prompts.
This is what Anthropic does for their "High Compute" SWE benchmarking:
" For our "high compute" numbers we adopt additional complexity and parallel test-time compute as follows:
"But running in parallel means you use more compute, meaning that the cost is higher. Good results are worth paying for, but if a super recursive approach costs more and takes longer than a human...