flyingpenguin

The New Stack recorded in July that AWS, Google Cloud, Microsoft Azure, and Cloudflare all now sell sandboxes for AI agents. Since lately I get asked what a sandbox means, it is a sealed environment where code an AI writes or runs can execute without touching anything else. The article split the problem in half: isolation, meaning the walls of the sealed environment, and governance, meaning the rules about what the code inside may do. That artificial division is useful as a teaching model. As a description of what the vendors have shipped, it is already inaccurate, and the vendors’ own documentation is sufficient to show their failures.

Google announced Cloud Run sandboxes on July 9, and the announcement describes how they behave. Code inside the sandbox can make no connection to the internet unless the operator switches connections on for that single run. The code can read the files it was given, and anything it writes is thrown away when it finishes. It cannot see the settings and credentials of the service that launched it, and it cannot ask Google’s internal credential service for an identity. Consider what these four restrictions are. They are rules about what running code may reach. In the article’s division, rules belong to governance, and governance was supposed to be a separate layer above the walls. Google has built these rules into the walls. What Google built in is also crude. Internet access is fully off or fully on. There is no way to permit one site and refuse another. Only the simplest rule, the default of no access, has moved inside the walls. Anything more detailed still has to be enforced somewhere else.

The AWS announcement takes a moment to describe because the arrangement is unusual. AWS published a developer guide for running Anthropic’s Claude Managed Agents on Lambda MicroVMs, a new AWS service that starts a small isolated virtual machine for each job. AWS files this under the name “self-hosted sandboxes.” Here is what happens, step by step, according to the guide. A work session is created and held on Anthropic’s computers. Anthropic then sends a notification to a web address in the customer’s AWS account. A small program at that address checks the notification is genuine and starts one virtual machine. The program inside that machine contacts Anthropic, asks for the job, runs the commands the job requires, sends the results back to Anthropic, and shuts down. The guide states the division in its first paragraph: “Anthropic hosts the agent loop and Claude model,” and the customer’s machine is where the commands run. In plain terms, the customer owns and pays for the computer. Anthropic decides when the machine wakes up and what it works on. Everything the machine produces goes back to Anthropic.

Anthropic’s own documentation says the same thing. The “self-hosted” environment is described as a queue of jobs, and the queue lives with Anthropic. The customer’s machine asks the queue for work and reports back when the work is done. So the term “self-hosted” covers one thing: the commands run on the customer’s hardware. The session itself, its state, its schedule, and the decision that there is work to do at all stay with Anthropic.

The guide’s own troubleshooting table shows how complete the dependency is. One listed problem is a worker that shuts down the moment it starts. The listed cause is a machine blocked from reaching api.anthropic.com. The machine has one essential need, and that need is a line to the vendor. The networking section explains why this rarely goes wrong in practice: AWS gives these virtual machines open internet access by default, so nothing has to be configured for the machine to reach Anthropic. Now compare the two defaults. Google ships its sandbox with the internet switched off, because Google treats the code inside as the danger. AWS ships this sandbox with the internet switched on, because the arrangement stops working the moment the machine loses contact with Anthropic. Each vendor chose the default its own design required.

The AWS design deserves some credit, because giving it helps clarify the failures by others. Anthropic issues two kinds of key. The powerful one, which can create new sessions, stays with the customer’s operator and is never placed on any AWS machine. The virtual machine receives only a lesser key, enough to collect its assigned job and report back, and even that key is fetched at the last moment rather than stored on the machine. This is a statement of trust. The customer’s machine is trusted to carry out work. The power to create work is kept away from it.

So the vendors’ own published record is all we need to prove how broken they are. Google has moved rules inside the walls. Anthropic and AWS have moved the decisions outside the customer’s premises altogether. The walls themselves are genuine, and they are the least disputed part of either product. The questions that decide what actually happens, what the code may reach and who tells it what to do, are answered somewhere else: at Google, baked into the walls as fixed settings; in the AWS pattern, on Anthropic’s side of the line. A buyer who inspects only the walls has inspected the one part every vendor already agrees on, and left out the part that decides what the machine does.

The first thing that jumped out at me in a post about the famous “Pelican on a Bike” test is that GPT-5.6 Luna is being used to score GPT-5.6 Terra, without any inter-run reliability check.

In other words, given a within-lab design, there is a style-level bias test but not a cell-specific bias, which is in fact the thing supposed to be under test.

The second thing is the entire audit cost $80 across seven frontier models. Independent falsification of a contamination hypothesis is very inexpensive. No cost and all the code and data published means we should be seeing a lot more of this. Cheap external verification is demonstrably feasible, again.

Remember all the noise about Mythos being a marketing scam? Any vendor benchmark that can’t be independently checked is a cynical design decision that deserves heavy pushback and scrutiny.

Anyway, the point of that post seems to be that any lab gaming the Pelican on a Bike benchmark competently games the whole category, not the individual cell. This is the same structure we see in any signature-based detection generally: it catches a weak or clumsy version only.

Why Fascists Hate Pasta

US Big Tech Agent Code Sandboxes Are Broken

AI “Pelican on a Bike” Test Isn’t Going Well

Why Starlink is a Criminal Enterprise

a blog about the poetry of information security, since 1995