
This week I’m stepping away from the security lens (don’t worry - I’ll be back) to talk about something that has been living rent free in my head for the last few weeks. We’re going to look at what’s happening at the absolute bleeding edge of AI-powered software development, paying particular attention to two stories that I think signal where all of this is heading.
We’re talking about what are being referred to as ‘AI Software Factories’ - systems where coding agents don’t just assist developers in writing code more quickly, but they take the entire development role on themselves. Writing code, testing code, deploying code, in some cases with minimal or even zero human involvement.
So, this is certainly the most famous of the two stories that we’re going to look at as it came from Anthropic themselves. This has been covered from various angles already, mostly focused on what they built and how good (or bad) it is. I’m less bothered about what they built, and more about how they did it.
Nicholas Carlini - one of Anthropic’s researchers - tasked 16 instances of Claude Opus 4.6 to autonomously build a C compiler, written in Rust, capable of compiling the Linux kernel…from scratch.
For those who are similarly unfamiliar with this, let’s just say that is a BIG and very complex task that took humans years to do (albeit with less focus and a long time ago). For the purposes of this blog what we need to understand is that this was a far larger project with a lot more moving parts, dependencies and complexities than just about any documented AI-driven project thus far.
Over two weeks and roughly 2,000 Claude Code sessions (costing around $20,000), these agents produced a 100,000-line compiler supporting x86, ARM, and RISC-V architectures. It can build a bootable Linux 6.9, compile QEMU, FFmpeg, SQLite, PostgreSQL, Redis - and yes, it runs Doom.
But what fascinates me here isn’t just the output - it’s how they got there. Carlini built a harness that places Claude in an infinite loop where agents autonomously pick up tasks and crack on without human intervention. When one task finishes, the next one starts. Multiple agents work on the same codebase simultaneously through a Docker-based setup where each agent clones the repo locally, pushes changes upstream, and a simple file-locking mechanism in ‘current_tasks/’ stops agents from treading on each other’s toes.
Now as with all things AI software factory, the approach leans heavily on test-driven development and requires these tests to be bulletproof, as this is largely the way we can get AI to build ‘serious’ software (more to come on this later). When agents are working entirely autonomously, your definition of success - tests that must be passed - need to be nearly perfect, because the agents will solve problems exactly as specified.
Some other titbits: they deliberately avoided printing excessive output to the agents, instead logging to files that Claude could query when needed. Context management is everything when you’re running agents at this scale, one approach is breaking the problem into small pieces, tracking what its working on and then use agents (with their context limits) on smaller pieces of work. They ran several agents in parallel specialised teams (one writing code, one keeping an eye on code quality, one testing, etc.) and a nifty task-locking system.
If the Anthropic story sounds a bit crazy, this one is borderline radical. StrongDM are a product company that are building a proxy that manages and audits access to databases, servers and kubernetes through a control plane. They were founded in 2025 with a radical philosophy ‘code must not be written or reviewed by humans’.
The founders had seen what was possible in late 2024 using Claude 3.5 and by the December of that year they had started to lean into Cursor’s ‘YOLO’ mode. In 2025 they wanted to prove that it was possible to build ‘serious’ products without a human ever even interacting with the code.
Two things to pick on here. Firstly, humans not writing code is becoming pretty common across areas of software development right now with the proliferation of agentic coding, but not reviewing the code before it goes live is a very different and much more scary thing. Secondly, their goal here was to make ‘serious’ software. Whilst they don’t define this themselves, what this means is that using AI to whip up an MVP that gets your point across is one thing, but there is an entire chasm between that and using it to make production ready, secure, scalable code.
So, how do they get work done if not even looking at code? In their words ‘we built a Software Factory: non-interactive development where specs + scenarios drive agents that write code, run harnesses, and converge without human review’. More specifically, they use ‘scenario’ to represent an end-to-end user story which is stored outside the code base (so the agent can’t see it and reward hack to success).
They also moved ‘success’ away from ‘has this test passed’ to ‘satisfaction-based’ - of all observed trajectories through all scenarios, what fraction of them likely satisfy the user?
But here’s where it gets really clever. They built what they call a Digital Twin Universe (DTU) - behavioural clones of third-party services. We’re talking full replicas of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets, complete with APIs, edge cases, and observable behaviours. These were generated by having coding agents analyse public API documentation and build self-contained Go binaries imitating each service.
Why? Because you can test at volumes exceeding production limits. You can simulate failure modes that are impossible to trigger against live services. You can run thousands of scenarios hourly without rate limits or API costs. This is how you can have your engineers spending $1000 per day per engineer on AI tokens and not hit limits on other services
They’ve also released some interesting tooling off the back of this. Attractor is a non-interactive coding agent where - would you believe it - the GitHub repo contains only specification markdown files. The code is generated from the specs.
Firstly, I don’t think we’re at the point where AI will replace humans entirely in software development any time soon. These are hyper capable engineers who are capable of building these projects without AI and know what good looks like, have experience specifying what AI should do, and in the case of Anthropic have an entire raft of previous examples and ways to test every single aspect of what they’re building (which not many of us do as we’re building something new not rebuilding something which already exists).
However, these two stories do point in the same direction: we’re moving from AI-assisted development to AI-led development. The human role is shifting from writing code to designing systems, writing specifications, building test infrastructure, and orchestrating agents.
A few things stand out to me:
I’m deliberately parking the security conversation for now - there is a lot to unpack there (autonomous agents writing and deploying code with no human review... I can already feel my next few posts writing themselves). But for this week, I just wanted to sit with the sheer ambition of what’s being attempted here. It’s remarkable.
For now, that is everything folks - catch you next week!
