Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Backfill

Pattern

A named solution to a recurring problem.

Populate a new field, marker, schema, or annotation across an existing corpus so that records created before the requirement existed conform to it, without silently corrupting the records you’re filling.

Also known as: Historical Backfill, Data Backfill, Retroactive Population

Understand This First

  • Sweep — the closest sibling, and often the mechanism that carries a backfill across files; the two have different decision rules and different failure modes.
  • Parallel Change — backfill is the middle phase of its expand-migrate-contract sequence.
  • Blast Radius — a backfill touches every record by definition, which is why it needs its own discipline.

Context

You add a column, and every row written before today is missing it. You introduce a convention, and two hundred existing files don’t follow it. You decide every article needs a type marker, and the ones you already shipped have none. The new requirement is easy to honor going forward. The work is in the records that already exist.

This is older than software. Census takers backfill missing entries; archivists backfill catalog metadata. In code, the canonical case is the database: you add a column, write to it from now on, then walk the historical rows and fill the new value. That middle step is the backfill, and it’s the part of Parallel Change where most of the risk lives.

Backfill is a brownfield discipline. A greenfield project has nothing behind it to fill; the first time you reach for a backfill is the moment a real corpus has accumulated under an old shape and you’ve decided to change the shape. The question is never whether the new records will conform. It’s whether you can make the old ones conform without breaking them.

Problem

A new field, marker, or annotation must exist on every record, but only the records created from now on get it for free. The existing corpus, which may be a hundred rows or a hundred million, has a gap where the new value should be. You have to fill that gap across records you didn’t write, often without fully understanding each one, while the corpus keeps being read and sometimes written.

The trap is that a backfill looks like a refactor and isn’t. A refactor changes how data is expressed and leaves the data alone, so a passing test suite is good evidence it worked. A backfill changes the data. Tests against your code can be green while the values you wrote are wrong, and nothing tells you until a reader downstream trips over a record you filled badly. How do you populate a new value across an entire corpus, correctly, idempotently, and reversibly, when the only proof that scales is the corpus itself?

Forces

  • The new value may be a clean function of the old record, or it may require reading and interpreting each record. Picking the wrong mechanism either wastes effort or silently produces garbage.
  • The corpus is often live. Records are being created and updated while you fill, so the target is moving under you.
  • Correctness can’t be eyeballed at scale. A sample looks clean; the long tail hides the edge case that breaks a downstream consumer.
  • Reversibility costs storage. Undoing a backfill means knowing each record’s pre-backfill value, which you have only if you saved it.
  • The reasoning mode is the most capable and the most expensive: an agent reading every record costs real money and real time.

Solution

Write the target shape down, enumerate the gap, hand-sample before you go wide, then fill in idempotent checkpointed batches with invariant checks on each side. The backfill is done when the corpus passes its own sampling and invariant checks, not when the last record is touched.

Start by choosing the mechanism, because it sets everything downstream. There are three modes, and a decision rule that picks among them.

Deterministic backfill. The new value is a pure function of the old record: amount_cents = round(amount * 100), slug = slugify(filename), region = lookup(import_path). A single SQL UPDATE, a codemod, or a short script fills the whole corpus, and a second run is a no-op. Use this whenever the function exists. It’s the cheapest mode and the easiest to verify, because you can check the function on a sample and trust it everywhere the function’s assumptions hold.

Manual backfill. The corpus is small enough to fill by hand in one sitting, and no clean function exists. A few dozen config files, a hundred catalog entries. Don’t build machinery for a corpus you could finish before the machinery is written.

Reasoning backfill. Neither holds: the new value isn’t a function of the old record alone, and the corpus is too large to hand-fill. The agent reads each record, infers the correct value, and writes it. This is the mode agentic coding adds. Deciding which type marker an article carries, choosing a doc-comment that matches a function’s actual contract, writing a regression test from a legacy code path’s observed behavior: these need interpretation per record, which used to mean a human did it or it didn’t get done. An agent makes the corpus tractable. It also makes it possible to be confidently wrong at scale, which is why the discipline below is non-negotiable for this mode.

The discipline, in order:

  1. Write the schema down. State exactly what the new field, marker, or annotation must look like before you fill a single record. A backfill against an unwritten spec drifts the way an under-specified sweep drifts.
  2. Enumerate the gap. Count the records missing the new value before you start. That count is your denominator: it tells you when you’re done and catches a query that found the wrong set.
  3. Hand-sample a stratified slice. Pull the oldest records, the newest, one of each type, and the ones that already carry a partial value. Verify the rule or the agent’s judgment on that slice by hand. The sample is where you find out the rule is wrong while it’s still cheap to fix.
  4. Batch and checkpoint. Fill a small batch, checkpoint, fill the next. The git checkpoint before each batch is your undo, and for a backfill the undo is the inverse of the value you just wrote, so the old value has to be recoverable from history, a snapshot, or a saved column.
  5. Stay idempotent. Running the backfill twice produces the same corpus as running it once. A non-idempotent backfill that double-applies on resume corrupts exactly the records you were trying to fix.
  6. Check invariants on both sides. Before and after, assert what must hold: cardinality (the count of filled records matches the gap you enumerated), distribution (no value is wildly over-represented), and no-regression (no record was made worse). These checks run against the corpus, not against your code.

For an online backfill against a live corpus, wrap the whole thing in the expand-contract sequence of Parallel Change: add the new field, dual-write so new records stay correct while you fill, backfill the history, cut reads over to the new field, then drop the old one. The dual-write window is what keeps the moving target from racing you.

How It Plays Out

A payments team adds an amount_cents integer column beside a legacy amount float that’s been causing rounding bugs. Application code already dual-writes both. The backfill is deterministic: amount_cents = round(amount * 100). They enumerate 4.2 million rows missing the new value, sample two hundred across date ranges, and run the fill in batches of fifty thousand with a checkpoint before each. An invariant check flags the 2% of rows where round(amount * 100) disagrees with a separately-stored ledger total to the penny, the signature of floating-point drift in the original data. Those rows route to an agent that reconciles each against the ledger of record rather than the corrupted float. Reads cut over only after the cardinality check confirms zero unfilled rows. The amount column is dropped a week later.

A documentation corpus of two hundred-plus articles needs a type marker on every entry. Is this one a pattern, an antipattern, or a concept? No function derives the answer from the file; it takes reading each article and judging what it actually does. The corpus is too large to hand-label in an afternoon. This is a reasoning backfill. The agent reads each article, proposes a marker, and the team hand-reviews a stratified sample (the oldest entries, the newest, and a few that sit on the pattern-versus-concept line) before letting it fill the rest in batches, each batch a reviewable commit.

Warning

A backfill that writes the wrong value while a dual-read still serves the old field will pass every test you have. The code is correct; the data is wrong; nothing reads the new field yet, so no one notices for weeks. Verify the filled values against the corpus directly before you cut reads over. “The tests pass” is not evidence that a backfill is correct.

A team adds regression tests to a five-year-old service that has almost none. They treat it as a test backfill: the agent reads each public function, observes its current behavior, and emits a characterization test that locks that behavior in. A human reviews each test before it merges, because a test that encodes a bug as expected behavior is worse than no test. The corpus of untested functions shrinks one reviewed batch at a time.

When It Fails

Silent data loss. The backfill writes a wrong value, a dual-read still serves the old one, and the error hides until something finally reads the new field. Fix: verify filled values against the corpus before cutting reads over, and keep the old field until the new one is proven.

Non-idempotent re-runs. A batch fails halfway, you restart, and records already filled get filled again: a counter incremented twice, a marker appended twice. Fix: make the fill a true upsert that’s a no-op on already-correct records, and test the resume path on purpose.

Racing a moving target. Records are created or updated while you backfill, so new rows land without the value or old rows change under you. Fix: dual-write through the fill so new records are born correct, and re-scan for stragglers after the main pass.

Cardinality drift. Post-backfill, a uniqueness or referential invariant breaks because two filled values collided or a foreign key now points at nothing. Fix: assert the invariant before and after, not just “no errors thrown.”

Coverage holes. The sample looked clean, but an edge case in the long tail — a record type you didn’t stratify for — got filled wrong and broke a downstream consumer. Fix: stratify the sample across age and type, and treat any invariant-check failure as a signal to widen the sample, not to suppress the check.

Consequences

Benefits. The corpus ends uniform: every record honors the new requirement, not just the ones written after it. Downstream readers and future agents see one shape instead of a then-and-now split. The reasoning mode makes corpora tractable that used to be hand-labor or never-done: retroactive typing, metadata enrichment, test coverage on legacy code. The batch-and-checkpoint discipline turns one irreversible mass write into a sequence of reversible ones.

Liabilities. A reasoning backfill over a large corpus costs real agent time and real money, and the cost scales with the corpus. The hand-sampling step adds clock time and can’t be skipped without giving up the one check that catches a wrong rule early. The dual-write window is an operational tax for as long as the backfill runs. And reversibility isn’t free: recovering from a bad backfill requires having stored each pre-backfill value somewhere, so the undo plan has to exist before the first batch, not after the corpus is already wrong.

Sources

Scott Ambler and Pramod Sadalage’s Refactoring Databases (2006) is the book-length treatment of schema-level backfill, including the add-column, dual-write, backfill, cut-over, drop sequence that frames the online case here. The agentic modes in this article extend that database-specific discipline to non-schema corpora.

Danilo Sato and Martin Fowler’s writing on evolutionary database design develops the same dual-write-and-backfill mechanism as a continuous practice rather than a one-off migration, which is the framing that lets a backfill sit inside a longer parallel change.

Sam Newman’s Building Microservices (2nd ed., 2021) carries the dual-write-and-backfill technique across service-contract boundaries, where the records you fill belong to a consumer you can’t update in lockstep.

Michael Feathers’s Working Effectively with Legacy Code (2004) is the canonical source for the test-backfill case: characterization tests that capture a legacy code path’s observed behavior so it can be changed safely. The agentic framing, an agent reading each function and emitting the test, is the new layer on an old discipline.

The online-schema-change tooling community (gh-ost, pg_repack, and similar) worked out the practical mechanics of backfilling a live table without locking it, including the throttling and chunking patterns the batch discipline here inherits.

Further Reading

  • Refactoring Databases (Scott Ambler and Pramod Sadalage, 2006) — the schema-migration playbook, where the dual-write-and-backfill sequence is developed in full.
  • Online, Asynchronous Schema Change in F1 (Google, 2013) — how a large distributed database backfills schema changes safely against live traffic, and why the protocol is harder than it looks.