Agentic AI

AI Software Factories

Portrait of a bald man with a beard in a suit and tie with a thoughtful expression.

Max Corbridge

Cofounder

February 26, 2026

This week I’m stepping away from the security lens (don’t worry - I’ll be back) to talk about something that has been living rent free in my head for the last few weeks. We’re going to look at what’s happening at the absolute bleeding edge of AI-powered software development, paying particular attention to two stories that I think signal where all of this is heading.

We’re talking about what are being referred to as ‘AI Software Factories’ - systems where coding agents don’t just assist developers in writing code more quickly, but they take the entire development role on themselves. Writing code, testing code, deploying code, in some cases with minimal or even zero human involvement.

Anthropic: 16 Claudes Build a C Compiler

So, this is certainly the most famous of the two stories that we’re going to look at as it came from Anthropic themselves. This has been covered from various angles already, mostly focused on what they built and how good (or bad) it is. I’m less bothered about what they built, and more about how they did it.

Nicholas Carlini - one of Anthropic’s researchers - tasked 16 instances of Claude Opus 4.6 to autonomously build a C compiler, written in Rust, capable of compiling the Linux kernel…from scratch.

For those who are similarly unfamiliar with this, let’s just say that is a BIG and very complex task that took humans years to do (albeit with less focus and a long time ago). For the purposes of this blog what we need to understand is that this was a far larger project with a lot more moving parts, dependencies and complexities than just about any documented AI-driven project thus far.

Over two weeks and roughly 2,000 Claude Code sessions (costing around $20,000), these agents produced a 100,000-line compiler supporting x86, ARM, and RISC-V architectures. It can build a bootable Linux 6.9, compile QEMU, FFmpeg, SQLite, PostgreSQL, Redis - and yes, it runs Doom.

But what fascinates me here isn’t just the output - it’s how they got there. Carlini built a harness that places Claude in an infinite loop where agents autonomously pick up tasks and crack on without human intervention. When one task finishes, the next one starts. Multiple agents work on the same codebase simultaneously through a Docker-based setup where each agent clones the repo locally, pushes changes upstream, and a simple file-locking mechanism in ‘current_tasks/’ stops agents from treading on each other’s toes.

Now as with all things AI software factory, the approach leans heavily on test-driven development and requires these tests to be bulletproof, as this is largely the way we can get AI to build ‘serious’ software (more to come on this later). When agents are working entirely autonomously, your definition of success - tests that must be passed - need to be nearly perfect, because the agents will solve problems exactly as specified.

Some other titbits: they deliberately avoided printing excessive output to the agents, instead logging to files that Claude could query when needed. Context management is everything when you’re running agents at this scale, one approach is breaking the problem into small pieces, tracking what its working on and then use agents (with their context limits) on smaller pieces of work. They ran several agents in parallel specialised teams (one writing code, one keeping an eye on code quality, one testing, etc.) and a nifty task-locking system.

StrongDM: No Human Code. No Human Review.

If the Anthropic story sounds a bit crazy, this one is borderline radical. StrongDM are a product company that are building a proxy that manages and audits access to databases, servers and kubernetes through a control plane. They were founded in 2025 with a radical philosophy ‘code must not be written or reviewed by humans’.

The founders had seen what was possible in late 2024 using Claude 3.5 and by the December of that year they had started to lean into Cursor’s ‘YOLO’ mode. In 2025 they wanted to prove that it was possible to build ‘serious’ products without a human ever even interacting with the code.

Two things to pick on here. Firstly, humans not writing code is becoming pretty common across areas of software development right now with the proliferation of agentic coding, but not reviewing the code before it goes live is a very different and much more scary thing. Secondly, their goal here was to make ‘serious’ software. Whilst they don’t define this themselves, what this means is that using AI to whip up an MVP that gets your point across is one thing, but there is an entire chasm between that and using it to make production ready, secure, scalable code.

So, how do they get work done if not even looking at code? In their words ‘we built a Software Factory: non-interactive development where specs + scenarios drive agents that write code, run harnesses, and converge without human review’. More specifically, they use ‘scenario’ to represent an end-to-end user story which is stored outside the code base (so the agent can’t see it and reward hack to success).

They also moved ‘success’ away from ‘has this test passed’ to ‘satisfaction-based’ - of all observed trajectories through all scenarios, what fraction of them likely satisfy the user?

But here’s where it gets really clever. They built what they call a Digital Twin Universe (DTU) - behavioural clones of third-party services. We’re talking full replicas of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, complete with APIs, edge cases, and observable behaviours. These were generated by having coding agents analyse public API documentation and build self-contained Go binaries imitating each service.

Why? Because you can test at volumes exceeding production limits. You can simulate failure modes that are impossible to trigger against live services. You can run thousands of scenarios hourly without rate limits or API costs. This is how you can have your engineers spending $1000 per day per engineer on AI tokens and not hit limits on other services

They’ve also released some interesting tooling off the back of this. Attractor is a non-interactive coding agent where - would you believe it - the GitHub repo contains only specification markdown files. The code is generated from the specs.

What Does This Mean?

Firstly, I don’t think we’re at the point where AI will replace humans entirely in software development any time soon. These are hyper capable engineers who are capable of building these projects without AI and know what good looks like, have experience specifying what AI should do, and in the case of Anthropic have an entire raft of previous examples and ways to test every single aspect of what they’re building (which not many of us do as we’re building something new not rebuilding something which already exists).

However, these two stories do point in the same direction: we’re moving from AI-assisted development to AI-led development. The human role is shifting from writing code to designing systems, writing specifications, building test infrastructure, and orchestrating agents.

A few things stand out to me:

Testing becomes everything. In both cases, the quality of the output is directly tied to the quality of the test infrastructure. When humans aren’t reviewing code, your tests are your quality gate. This is taking everything we’ve learned from TDD and doubling down on it.
‍
Parallelisation (a real word I looked it up) is the big next step. Many people are comfortable using Claude Code to handle a single task or project. Anthropic ran 16 agents simultaneously and StrongDM have agents tasked with coding, others with code quality, others with documentation, and others with testing. This is how we can get the most from current tools (and their limitations).
‍
We’re still early. The Anthropic compiler has limitations (like not being able to compile ‘hello world’). StrongDM’s approach costs a fortune. But both of these projects would have been unthinkable even a year ago and I expect the trajectory to stay as it has been.
‍
Is what is happening with software development just the start of what is about to happen across all sectors? Coding has been the proving ground for AI as it has such a strong use-case and breadth of data for the AI to learn from. We’ve seen a major transformation of what is possible in the last years, going from AI can write some basic code, to AI can write pretty good code, to AI can automate some of my actual code writing tasks in real-world scenarios, to AI can now drive coding and I can oversee, to AI does it all and I just give it my desired outcome…this looks to me to be a bit of a roadmap.

I’m deliberately parking the security conversation for now - there is a lot to unpack there (autonomous agents writing and deploying code with no human review... I can already feel my next few posts writing themselves). But for this week, I just wanted to sit with the sheer ambition of what’s being attempted here. It’s remarkable.

For now, that is everything folks - catch you next week!

‍

blogs

Our Latest Thoughts

Interviews, tips, guides, industry best practices, and news.