Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Dark Factory

Pattern

A reusable solution you can apply to your work.

A Dark Factory is a software operating model in which coding agents write, test, and ship production code with no human writing or reviewing the code itself; humans set the goals, scenarios, and constraints and let the factory run.

“Code must not be written by humans. Code must not be reviewed by humans.” — StrongDM Engineering, public manifesto (2026)

Also known as: Software Factory, Lights-Out Coding, Level 4 / Level 5 Agentic Development

Understand This First

  • Bounded Autonomy – the governance model at the opposite end of the spectrum; Dark Factory is what bounded autonomy looks like when every tier is set to “act without asking.”
  • Harness (Agentic) – a mature harness is the substrate a Dark Factory runs on.
  • Verification Loop – without a tight, reliable verification loop, a Dark Factory ships defects at speed.
  • AgentOps – production monitoring replaces human code review as the primary feedback signal.

Context

The term borrows from manufacturing. A “dark factory” is a production facility that runs without human workers on the floor: the lights stay off because the robots do not need them. Dan Shapiro coined the software version to name an operating model that was, until 2026, mostly theoretical. StrongDM’s engineering team made it concrete by publishing a manifesto with two rules: code is not written by humans, and code is not reviewed by humans. Humans set the intent, describe the scenarios the system must handle, and define the constraints. Everything from the first line of code to the production deploy happens between agents.

This sits at the agentic and operational level. It is not a coding technique. It is a claim about where the human belongs in the software lifecycle: outside the code, at the specification and governance layer. Dark Factory names the far end of a spectrum whose other end is the traditional workflow where a human writes every character and reviews every change.

Practitioners have converged on a rough five-level ladder to describe positions along this spectrum:

  1. Human-written, human-reviewed. Autocomplete at most.
  2. Agent-assisted authoring. The agent drafts; a human reviews every line.
  3. Agent-authored, human-reviewed. The agent writes whole features; a human reads the diff.
  4. Agent-authored, agent-reviewed, human spot-checks. A human still looks, but only at flagged changes.
  5. Dark Factory. No human writes or reviews code. Humans work only at the specification, scenario, and policy layer.

Level 5 is where “Dark Factory” strictly applies. Level 4 is the common preparatory state.

Problem

As agents become capable enough to write entire features end to end, human code review becomes the bottleneck. A team that writes code in minutes can spend hours waiting for a reviewer, and the reviewer’s attention degrades exponentially as diff sizes grow. At the same time, reviewing agent-authored code well is genuinely hard: the patterns are unfamiliar, the volume is relentless, and the signal that a line is worth pausing on is weaker than for human-authored code.

You are left with a choice. Either the human stays in the loop and accepts that review is now the constraint on delivery, or you take the human out of code-level review and redesign everything else in the lifecycle to make that safe. Dark Factory is the second choice, taken seriously.

Forces

  • Review cost scales with code volume, not code value. When agents generate 100x more code, line-by-line review becomes uneconomic long before it becomes impossible.
  • Humans review agent-authored code worse than they think. Diffs look plausible, explanations sound confident, and attention fades. The signal-to-noise ratio for human reviewers is collapsing just as the volume rises.
  • Specifications and scenarios scale with product complexity, not code size. You can write a specification for a billing system once and have it survive many refactors. You cannot review every refactor.
  • Preconditions are exacting. A Dark Factory needs codified intent, a strong test oracle, a mature harness, reliable simulation environments, and production telemetry that catches what tests miss. Miss any of these and the factory ships defects at industrial scale.
  • Accountability does not disappear. Regulators, customers, and the team’s own conscience all still need someone to answer for what the system does. The human moves, but does not leave.

Solution

Redesign the software lifecycle so that humans work at the layer above code, and the factory between their specifications and the production system runs without human hands on the keyboard. Three moves make this work:

Move the human up one level. Humans stop writing and reviewing code. They write and review specifications, scenarios, constraints, and production policies. The artifacts that used to be informal (user stories, acceptance criteria) become first-class inputs that agents can read, execute, and regenerate code from. The artifacts that used to be secondary (tests, invariants, performance budgets) become the primary contract.

Replace human review with stacked automated checks. The code review that a human used to do is decomposed and redistributed across the pipeline. Agents generate code against a specification. A second agent critiques it against the same specification. Property-based tests, simulation runs, and scenario replays exercise it far beyond what hand-written unit tests ever did. Static analysis, security scanners, and Architecture Fitness Functions enforce constraints the specification cannot capture. Production traffic runs through canary deploys and feature flags so the real world becomes the final review surface, with automatic rollback when domain metrics move the wrong way.

Treat production telemetry as the primary feedback sensor. Because no human reads the diff, the system needs to know quickly and precisely when the deployed behavior diverges from the specification. AgentOps dashboards, domain-oriented metrics, and error budgets become the governance layer. A Dark Factory that cannot detect its own regressions is not a factory; it is a defect machine.

The payoff is real: a small team can ship a large surface area, because the only human-time-bounded work left is specifying and supervising. The cost is equally real: the preconditions are expensive, and the failure mode is delivering broken software faster than you can catch it.

Warning

Do not try to run at Level 5 on a codebase that cannot be tested well. A Dark Factory inherits the quality of its test oracle. If your tests let bad code pass today, a Dark Factory will ship bad code a hundred times faster tomorrow. Harden the oracle before removing the reviewer.

How It Plays Out

A small infrastructure startup decides to run its internal tools as a Dark Factory. They invest two months up front in a specification system: every feature begins life as a markdown brief with acceptance scenarios written in a structured format. Agents consume the brief, generate the service, a second agent critiques it against the brief, a test suite validates behavior, and the change lands behind a feature flag. A human PM writes briefs; a human SRE watches production dashboards; no engineer reviews a diff. Over six months the team ships ten times the feature volume of a comparable team running Level 3. Their first incident arrives when an agent interprets an ambiguous scenario as “silent retry on failure” and the team watches a bill triple overnight before the alert fires. They codify the missing constraint as an invariant, add a cost-per-request fitness function, and keep running.

A financial services firm tries the same approach for a customer-facing billing service and aborts after three weeks. Regulatory requirements mandate human sign-off on any change touching customer funds. The team can get to Level 4 inside the firm’s walls, but Level 5 is legally out of reach on that surface. They reclassify: internal tools run as a Dark Factory; the billing service runs at Level 3 with full human review. The framework accommodates the split because the governance tier is a property of the code path, not the team.

A sole developer experiments with a weekend project. He writes a short specification, points an agent at it, and walks away. The agent produces three iterations, each one complete and self-tested, each one subtly wrong in a way his specification failed to pin down. He realizes the specification, not the code, is where the real work lives. He spends the rest of the weekend rewriting the specification rather than the code, and the fourth iteration works. He has, in miniature, learned the central discipline of a Dark Factory: the artifact you maintain is not the code.

Consequences

A working Dark Factory collapses the lead time between “we want this” and “it is in production.” Small teams become capable of surface areas that used to require large ones. The human workload shifts from mechanical translation (requirement → code) to creative and governance work (what should we build, how will we know if it is right, what must never be true).

The costs are unforgiving. The preconditions are expensive: a mature harness, codified specifications, a strong test oracle, reliable simulation, production telemetry rich enough to catch silent failures, and an organization culturally prepared to trust automated verification over human judgment. Each of these takes months to build and can be undermined in a single bad quarter. Teams that try to run a Dark Factory on top of a weak oracle discover that the factory ships their quality problems at full speed.

There is also a trust and accountability dimension that tooling does not solve. Stanford’s CodeX center framed the question sharply: “Built by agents, tested by agents, trusted by whom?” When something goes wrong in a Dark Factory, the humans responsible cannot appeal to “the engineer who wrote this had a reason.” Ownership attaches to the specification author, the governance layer, and the production operator, in ways most organizations have not yet worked out. Regulators, auditors, and customers are still catching up to what this means, and the legal precedent is thin.

Finally, there is a skills question. A team that runs at Level 5 for a year does not produce engineers who can debug code; it produces engineers who can debug specifications and systems. That is probably the right skill for the long run, but the transition is real, and a team that cannot drop back to Level 3 during an outage is fragile in a way that a traditional team is not.

  • Contrasts with: Bounded Autonomy – bounded autonomy graduates human oversight across tiers; Dark Factory removes the code-level tier entirely.
  • Contrasts with: Human in the Loop – HITL keeps a person inside the control structure; Dark Factory moves that person one layer up, out of code review.
  • Contrasts with: Approval Policy – approval policies gate code-level actions; Dark Factory policies gate specification- and deployment-level actions instead.
  • Depends on: Harness (Agentic) – the harness is the machinery the factory runs on.
  • Depends on: Verification Loop – the verification loop is what replaces human review at the code layer.
  • Depends on: Generator-Evaluator – one agent writes, another critiques; the evaluator stands in for the missing human reviewer.
  • Depends on: Test Oracle – the factory inherits the quality of its oracle; a weak oracle makes Dark Factory dangerous.
  • Depends on: AgentOps – production telemetry becomes the primary feedback signal when the diff is no longer read.
  • Depends on: Architecture Fitness Function – fitness functions enforce the architectural constraints that specifications cannot capture.
  • Related: Agent Teams – a Dark Factory is typically implemented as a team of specialized agents rather than a single monolith.
  • Related: Subagent – subagents handle the sub-tasks (generation, critique, verification) that together replace human review.
  • Related: Steering Loop – the steering loop still operates, but the steering inputs come from specifications and telemetry, not code review.
  • Risks: AI Smell – at industrial scale, subtle smells compound into systemic failures.
  • Risks: Shadow Agent – unregistered agents inside a Dark Factory undermine the whole governance story.
  • Risks: Prompt Injection – without a human reviewer in the loop, injected instructions that reach the specification or the agent pipeline are harder to catch.
  • Risks: Approval Fatigue – the inverted failure mode: Dark Factory removes approvals entirely, and teams that do this on a weak oracle discover that removing approvals is not the same as not needing them.

Sources

Dan Shapiro coined the “Dark Factory” framing for agent-driven software development, drawing on the existing industrial term for lights-out manufacturing facilities. The manufacturing analogy is older than the software use, but Shapiro’s application to coding is the lineage most subsequent writers cite.

StrongDM’s public engineering manifesto is the most concrete reference implementation: two explicit rules (“Code must not be written by humans,” “Code must not be reviewed by humans”), a description of a “digital twin universe” for scenario simulation, and named sub-patterns (Gene Transfusion, Semports, Pyramid Summaries) for the specification and testing layers. Their team’s willingness to publish the rules in enforceable form is what made the concept concrete enough for others to argue about.

Stanford Law School’s CodeX center raised the durable question that every Dark Factory adopter eventually has to answer: “Built by agents, tested by agents, trusted by whom?” Their February 2026 analysis is the clearest statement of the accountability gap that tooling alone cannot close, and it shapes the Consequences discussion above.

The five-level framework for positioning teams along the human-to-agent spectrum emerged from the agentic coding practitioner community in early 2026, with multiple independent writers converging on the same ladder structure. It is not attributable to a single author; by April 2026 the levels had become common vocabulary across newsletters, conference talks, and team internal documents.