Shadow-write infrastructure — branch review

worktree-shadow-write-infra  ←  wil/db-service-entity-poc  ·  209 files, +82,479 / −275 in scope  ·  review prepared 2026-05-11

If you read nothing else

Your CRUD/SCD HTTP service is going to be the front door to a brand-new unified database. Before we point production writers at it, we want to watch live traffic flow through it for about two weeks and prove the result matches the legacy Partners/Reference DB row for row. This branch adds: (1) a Python service that tails the legacy DB and calls your service's new /shadow/* routes, (2) a verifier that compares the two databases every hour, (3) a one-shot "cutover gate" script that flips the switch when the verifier has been clean for seven days, and (4) the machinery to physically delete all of it from main afterwards. The deletion is already proven to work by CI today.

Bounded validation window ~2 weeks then deleted CDC from Partners DB 5 new RDB tables Verifier diff hourly 7-day clean run required CI-gated decommission

1. Why this branch exists

Today the Partners DB and Reference DB are the source of truth. We want them replaced by the unified RDB you've been building behind /api/v1/scd/* and /api/v1/crud/*. Schema parity has been audited and the one-shot migration script works. But schema parity is not enough. We need to know that under real, live, sustained traffic the new database ends up with the same data as the old one — including fan-out (one source row becoming several versioned target rows), junction-table mutations, and the trickier edge cases like TRUNCATE.

So the plan is to run the two databases in parallel for ~2 weeks: writers keep hitting the old Partners DB, and a streaming pipeline mirrors every change into the unified RDB through your service. An out-of-band verifier compares the two databases continuously. Once it has been clean for a long enough window, an operator runs a single script that pauses writers, drains the pipeline, runs one last comparison, and stamps a 30-minute "permit" if everything matches. The operator then repoints writers at the unified RDB. The whole shadow apparatus then gets physically removed from the codebase.

Key idea Everything in this branch is temporary. The CI is set up so that the post-cutover state — markers stripped from app.ts, entire src/modules/shadow/ tree deleted, entire shadow-write-consumer/ directory deleted — compiles and passes tests today. We aren't shipping technical debt; we're shipping a removable scaffold.

2. Cast of characters

Five new actors. They communicate through three RDB tables and two HTTP endpoints — nothing else.

The consumer (Python)

shadow-write-consumer/src/consumer/main.py

Long-running Python process. Tails the Partners DB via Postgres logical replication, turns each change into one or more HTTP calls to POST /shadow/write/* on your service. Two-week lifespan. Runs on Fly.

The verifier slot reader (Python)

shadow-write-consumer/src/verifier/slot_reader.py

A second long-running Python process tailing the same Partners DB on a second, independent replication slot. Does no transform — just writes a "I saw this event" receipt to a ledger table. The consumer writes a matching receipt for every event it applies. Comparing the two receipts is how we prove no event got dropped.

The hourly verifier (Python)

shadow-write-consumer/src/verifier/main.py

Runs once an hour as a scheduled Fly Machine. Reads the Partners DB at a pinned point in time and reads the unified RDB at the matching point in time, then diffs them row by row. Emits findings. Resets a "clean run" counter on anything blocking.

The cutover gate (Python, one-shot)

shadow-write-consumer/src/cutover/main.py

An operator runs this once, manually, after the seven-day clean window. It coordinates the writer pause, drains the consumer to an exact LSN, runs one final verifier pass, and writes the permit-or-block decision into shadow_cutover_audit.

Your service, with new routes (TypeScript)

src/modules/shadow/ (10 route/service pairs, 26 files, +8,548 LOC)

The shadow surface is a separate prefix (/shadow/*) with its own bearer-token auth, completely isolated from /api/v1/*. Every shadow write goes into one Prisma transaction that touches the target row, a provenance Transaction row, a shadow_pk_map upsert, and (sometimes) a ledger row. The route registration is bracketed by // SHADOW-REGISTRATION-BEGIN/END marker comments so a CI script can physically delete it later.

3. One write, end-to-end

Concrete example: someone updates an entity name in the Partners DB.

1
Someone runs UPDATE entity SET name = 'Acme' WHERE id = 42; on the Partners DB. Postgres writes this to its write-ahead log (WAL).
2
The consumer reads the WAL change. The replication slot delivers the event as JSON (via the wal2json plugin): {action:'U', table:'entity', columns:[...], identity:[{name:'id', value:42}]}. At the same instant, the verifier slot reader sees the same event on its own slot and writes a ledger row (side='verifier', commit_lsn=..., source_table='entity', source_pk={"id":42}).
3
The consumer transforms the event. The entity table is SCD2 in the unified RDB, so an UPDATE becomes archive the current version, create a new version. The consumer also synthesizes a deterministic UUIDv7 for the new version (derived from SHA-256("entity|{\"id\":42}|Entity")) so retries always produce the same UUID.
4
The consumer calls your service. It POSTs /shadow/write/scd/update with the body, an idempotency_key like stream:5678901234:0:0, and shadow_metadata: {source_table:"entity", source_pk:{"id":42}, source_operation:"UPDATE", commit_lsn:5678901234, change_index_in_txn:0, applied_lsn:5678901234}. Bearer auth header.
5
Your service writes everything in one transaction: close the current Entity version's valid_to, insert a new version, insert a provenance Transaction row tagged with channel cdc_shadow, upsert shadow_pk_map(source_table:"entity", source_pk:'{"id":42}', target_model:"Entity") → target_pk:"<uuid>", and append shadow_event_ledger (side='consumer', commit_lsn:5678901234, ...). If anything fails, all of it rolls back; the consumer retries with the same idempotency key and the partial unique index makes the retry a no-op.
6
The consumer advances its LSN. It tells Postgres "I have durably processed up through LSN 5678901234, you can recycle WAL up to here" via send_feedback, and POSTs /shadow/consumer-position so the cutover gate can see how far it's gotten.
7
An hour later, the verifier wakes up. It captures the current Partners DB LSN (call it L), opens a read-only transaction pinned to that point in time, reads the unified RDB at ?as_of_lsn=L, and diffs the two. If our Acme entity matches on both sides — same name, same archived flag, same FK targets — that's one clean comparison. 168 consecutive clean hours is the SC3 bar.
8
The verifier also runs an "event count" check. It aggregates the shadow_event_ledger table over (last_attested_lsn, consumer.confirmed_flush_lsn] and asserts the consumer-side counts equal the verifier-side counts per (table, operation) bucket. Our UPDATE shows up once on each side. Clean.

That's the whole loop. Everything else in this branch is either: (a) extra robustness for when something goes wrong, (b) the choreography to start this loop from an empty database, or (c) the choreography to stop it cleanly.

4. The four questions the system has to keep answering

The runbook calls these SC1, SC2, SC3, SC4 ("success criterion"). In plain English:

CodeQuestionMechanism
SC1Did the consumer see and apply the same set of events as the verifier slot saw?Two ledger rows per event — one from the consumer transactionally with the write, one from the verifier slot reader independently. Aggregate and compare per hourly tick.
SC2After we pause writers and drain, do the two databases actually match?One-shot verifier diff at the final LSN, run by the cutover gate. Reads through the RDS API, not raw SQL.
SC3Have we been clean for long enough to trust the pipeline?168 consecutive clean hourly ticks (= 7 days). Any blocking finding resets to zero. Counter is durable JSON on the verifier's Fly volume.
SC4If we have to retry or replay events, will we get the same final state?Deterministic idempotency keys + a partial unique index on Transaction(channel_id, model_type, operation, idempotency_key) WHERE idempotency_key IS NOT NULL. Retries are silent no-ops at the database level.

Roughly: SC1 says we didn't miss anything, SC2 says what we have is right, SC3 says we've been right long enough to believe it, and SC4 says we can recover from blips without corrupting state.

5. Architecture diagram

Two replication slots on the same Partners DB feed two independent Python processes. They coordinate through three RDB tables and two HTTP endpoints — never through shared Python state.

Partners / Reference DB (source of truth today) | ____________ two logical replication slots ____________ / \ cdc_shadow_primary cdc_shadow_verifier | | v v +--------------------------------+ +-------------------------------------+ | consumer (Python) | | verifier slot reader (Python) | | python -m src.consumer.main | | python -m src.verifier.slot_reader | | | | | | parse -> filter -> transform | | no transform / no filter | | -> POST /shadow/write/* | | POST /shadow/ledger side=verifier | +--------------------------------+ +-------------------------------------+ | | | POST /shadow/write/scd/update (etc.) | v v +-------------------------------------------------------------------------------+ | rwa-db RDS service -- YOUR service with new /shadow/* routes | | | | Single Prisma $transaction per write: | | target row + Transaction provenance + shadow_pk_map upsert | | + (consumer-side) shadow_event_ledger row | +-------------------------------------------------------------------------------+ | v Unified RDB (Postgres) ^ | reads at as_of_lsn | +--------------------------------------+----------------------------------------+ | hourly verifier (Python, scheduled Fly Machine) | | | | 1. snapshot the Partners DB at LSN L | | 2. wait for consumer.confirmed_flush_lsn >= L | | 3. read RDB at as_of_lsn=L | | 4. diff -> findings | | 5. SC1 ledger attestation | | 6. SC3 counter advance / reset | +-------------------------------------------------------------------------------+ | v After SC3 hits 168 clean ticks AND named-reviewer go: v +-------------------------------------------------------------------------------+ | cutover gate (Python, one-shot, operator-invoked) | | | | a. operator pauses Partners DB writers | | b. capture final_lsn | | c. tell consumer "halt at final_lsn" | | d. wait for drain | | e. run verifier in cutover mode | | f. write permit / block decision to shadow_cutover_audit | +-------------------------------------------------------------------------------+

The two slots use the same wal2json options (same table list, same options hash). If they drift, both readers halt loudly — comparing apples to oranges would make SC1 meaningless.

6. Quick glossary

LSN (Log Sequence Number)
A monotonically increasing position in Postgres's write-ahead log. Every change has one. We use LSNs as "point in time" markers everywhere — "drain to this LSN", "read RDB as-of this LSN".
Replication slot
A Postgres construct that tracks how far a downstream reader has consumed. While a slot exists, Postgres refuses to recycle WAL the slot hasn't acknowledged. Two slots means two independent readers.
wal2json
A Postgres output plugin that emits each WAL change as a JSON object. We pin it to format-version=2 with include-pk=true and a specific add-tables list.
Exported snapshot
When you create a replication slot with EXPORT_SNAPSHOT, Postgres also gives you a snapshot token. Other transactions can attach to that exact MVCC view via SET TRANSACTION SNAPSHOT '...'. This is how the backfill loads consistent data into the empty RDB.
Holder session
The libpq connection that created the slot and is keeping the exported snapshot alive. If this session dies or runs any SQL, the snapshot evaporates and the backfill is corrupt.
Channel cdc_shadow
The provenance tag stamped on every Transaction row written by the shadow pipeline. Lets us distinguish shadow-pipeline writes from any other writer.
shadow_pk_map
An RDB table mapping (source_table, source_pk, target_model) to the canonical primary key in the unified RDB. Every shadow write upserts here. It's how we look up "where did source row X end up?" later.
SCD2
"Slowly Changing Dimension Type 2." A versioning pattern where each update creates a new row with a new valid_from and closes out the previous row's valid_to. The unified RDB uses this for 12 core entities.
Race window
The small slice of time between the verifier capturing an LSN and the verifier finishing its read. If a write commits inside that window the diff might see it on one side but not the other — we reclassify those as "race window" rather than failure, then escalate if the same row keeps showing up.
SC1 / SC2 / SC3 / SC4
The four invariants. See § 4.
R5 classes
The six possible verdicts the verifier can give a single row diff: missing_in_rdb, extra_in_rdb, value_mismatch, duplicate_current_version, unmapped_source_row, representation_only_difference. First four are blocking; last two are allowlisted.
R-18
The constraint "one reader per replication slot at a time." Postgres allows only one consumer per slot anyway, but R-18 also means "don't accidentally run a second consumer Fly Machine and have them fight."
The cutover gate
The one-shot Python script that decides "permit" or "block" for the writer repointing. The decision is stamped into shadow_cutover_audit as the system's audit trail.

7. Diff scope — where the 82k lines went

AreaFilesInsertionsWhat's in it
Python shadow-write-consumer/113+44,009consumer, slot_reader, verifier, cutover, dlq_replay, harness, observability, contract, parity fixtures (46), tests (16 unit + 3 integration + harness)
TS src/modules/shadow/26+8,54810 route + service pairs, SCD engine, CRUD engine, metadata registry, fingerprint validation, canonical contract, metrics, auth
Migration runner scripts/migrate/16+9,948 / −274migrate.ts snapshot mode (3,341 LOC), snapshot holder, orchestration, backfill, event-transform, idempotency keys, cleanup, contract dumper, fingerprint computers, FK-preserved allowlist
Tests tests/36+15,94615 shadow-* unit suites, 3 migration-runner suites incl. migrate.snapshot-mode / migrate.restart-window / preflight / snapshot-holder, 1 end-to-end SCD shadow channel integration
Infra infra/terraform/axiom, infra/fly/9+1,312Axiom dataset + 17 monitors + 2 dashboards + notifiers (Terraform); Fly app TOML with consumer/slot_reader process groups
CI .github/workflows/6+652parity-check, shadow-consumer-tests, shadow-smoke (3-way: disabled / deleted / cutover-stripped), deploy-shadow-services, apply-rulesets
Docs docs/shadow-write-runbook.md1+1,607Single canonical operator runbook (23 sections + glossary)
RDB schema db/unified/schema.prisma2+457 / −15 new shadow tables, 1 init migration

Roughly half the diff is the Python consumer + its tests. The next biggest chunk is the migration runner's new "snapshot mode" that opens the validation window. The TypeScript shadow module is comparatively small — about as big as one route in your existing service.

8. New tables in the unified RDB

Five additive tables. Nothing in the existing schema changed shape.

ShadowPkMap

schema.prisma:2527

"Where did source row X end up?" Composite key (source_table, source_pk, target_model) → canonical target_pk. Every shadow write upserts here. Indexed both ways — for a single-row lookup and for fan-in aggregation at a given LSN.

ShadowEventLedger

schema.prisma:2565

"Did we see and apply the same events?" Two rows per event: one side='consumer' written transactionally with the write, one side='verifier' written by the slot reader. Per-hour aggregation comparison is SC1.

ShadowBackfillState

schema.prisma:2602

"Where are we in the window's lifecycle?" One row, id=1 CHECK-enforced. Holds the snapshot token, replication-slot baseline LSN, frozen wal2json options, schema fingerprints, plus cutover signals (consumer_paused, consumer_target_lsn, etc.).

ShadowDeadLetter

schema.prisma:2651

"What couldn't we apply, and why?" One row per failed event with the raw wal2json payload, error class, and a resolution flag. Replay path reads from here.

ShadowCutoverAudit

schema.prisma:2688

"Did we ever get a permit, and was it valid?" One row per cutover gate invocation. UUIDv7 id. permit_status + 30-min expiry + blocking findings + named-reviewer approvals. This is the system's audit trail.

9. TS shadow module — what got added to your service

Skip ahead if you're already familiar with the /shadow/* routes from the PR description.

9.1 Where it plugs in

src/app.ts got refactored into a buildApp(target, options) that the rest of the codebase calls. The shadow registration is one block:

// SHADOW-REGISTRATION-BEGIN
const { registerShadowModule } = await import('./modules/shadow/index.js');
await registerShadowModule(target, { ... });
// SHADOW-REGISTRATION-END

Two ways to disable it:

9.2 The new routes

RoutesUsed by
POST /shadow/write/scd/{create,update,archive}Consumer's SCD2 path. SCD engine handles version archival + new version creation atomically with the provenance and ledger writes.
POST /shadow/write/crud/{create,update,delete}Consumer's non-SCD2 path. Composite-PK variants for junction tables.
GET /shadow/read/{scd,crud}/:model[/...]Verifier in cutover mode reads RDB through these (required, not raw SQL). All carry as_of_lsn.
POST /shadow/read/crud/:model/lookupVerifier composite-key lookup (object PK can't go in URL).
POST /shadow/cutover/{halt-target,resume,audit}Cutover gate sets and clears the halt signal; writes audit rows.
GET /shadow/consumer/control-stateConsumer polls every 5 sec to check if it should halt.
GET / POST /shadow/consumer-positionConsumer publishes its flush LSN; gate polls it during drain.
GET /shadow/verifier/stateVerifier slot reader polls its baseline LSN.
POST /shadow/ledgerVerifier slot reader writes its receipts here; consumer fan-out true-no-ops also write here directly.
GET / POST / PATCH /shadow/dlqOperator triage; replay tool marks rows resolved.

9.3 What makes shadow writes safe to retry

Each shadow write is one Prisma $transaction that writes four things atomically:

  1. The target row (or SCD2 version archival + new version).
  2. A provenance Transaction row with channel='cdc_shadow' and a unique idempotency_key.
  3. A shadow_pk_map upsert.
  4. (Sometimes) a shadow_event_ledger row.

A retry with the same idempotency key collides with the partial unique index Transaction(channel_id, model_type, operation, idempotency_key) WHERE idempotency_key IS NOT NULL and the SCD engine's replay-detection short-circuit returns the prior result instead of writing again. SCD2 writes also take a per-UUID advisory lock so two concurrent updates to the same entity can't interleave.

9.4 Auth isolation

Shadow routes use a separate bearer token (RDS_CDC_SHADOW_TOKEN), checked with constant-time comparison. Missing env var returns 503 rather than 401 so an operator can tell "misconfigured" from "wrong token". The public API_KEY auth was modified to skip /shadow/*; tests/unit/auth-non-overlap.test.ts asserts the two layers don't collide.

10. The Python consumer service

Skip if you don't intend to review the Python side directly.

10.1 Why two Postgres drivers?

Hybrid by necessity: psycopg2-binary owns the replication-stream paths because it still has the high-level LogicalReplicationConnection API that psycopg3 lacks. psycopg[binary] >= 3.2 owns everything else (verifier reads, raw RDB fetches, ledger SQL). The split is at the file boundary, not interleaved within a single function.

10.2 The five entrypoints

EntrypointProcess modelWhat it is
python -m src.consumer.mainAlways-on Fly machine, 1 of themThe streaming apply loop. 2,185 LOC. Most of the complexity is the keepalive contract with Postgres and the in-flight-transaction guard.
python -m src.verifier.slot_readerAlways-on Fly machine, 1 of themReads the second slot, writes ledger receipts. No transform.
python -m src.verifier.production_wiring --mode hourly --max-ticks 1Scheduled Fly Machine, hourlyThe verifier itself. Single-tick per invocation; the schedule restarts a fresh machine each hour.
python -m src.cutover.mainOne-shot CLIThe cutover gate. Operator runs once, manually.
python -m src.dlq_replay.mainOperator CLIReplays DLQ rows by commit_lsn or --all.

10.3 The verifier (this is the load-bearing piece)

Three modes, each a different read strategy:

ModeHow it reads the RDBRace-window relief?Affects SC3?
hourlyRaw psycopg, as_of_lsn = l_beforeyesyes — advances counter on clean, resets on blocking
cutoverThrough your service's /shadow/read/* routes (non-negotiable; spec §502)no — DB is quiescedno — this is the final check before permit
post-backfillVia SET TRANSACTION SNAPSHOT <token>nono — gate to start streaming

Six possible verdicts on each row diff (the "R5 classes"):

And the SC1 attestation: aggregate shadow_event_ledger over (last_attested_lsn, confirmed_flush_lsn] by (side, source_table, source_operation) and fail-close on any per-bucket delta.

11. Backfill orchestration — how the window opens

The backfill is what loads the empty unified RDB from a consistent snapshot of the Partners DB at the same instant the replication slots start collecting changes. The runner is the existing migrate.ts, extended with a --snapshot-mode flag.

The trick Postgres lets you create a replication slot with EXPORT_SNAPSHOT — it gives you a token, and other transactions can attach to that exact point-in-time view of the database. So we create both slots, then attach N parallel reader sessions to the exported snapshot, and they all see the same consistent state. The slot starts collecting WAL changes from exactly the moment after that snapshot.

11.1 The choreography

  1. Create the primary slot with EXPORT_SNAPSHOT. The session that creates the slot is "the holder" — we keep this connection idle and alive for the whole backfill.
  2. Create the verifier slot on a separate connection. Enforce that its baseline LSN is ≥ the primary's.
  3. Attach N reader sessions (default 8) with BEGIN ISOLATION LEVEL REPEATABLE READ; SET TRANSACTION SNAPSHOT '...';
  4. Run the migration phases. The whole thing wraps in one Prisma $transaction whose first statement is the same SET TRANSACTION SNAPSHOT, so Prisma-driven reads see the same consistent view as the raw pg.Client readers.
  5. Run a post-backfill verifier against the snapshot.
  6. Flip verifier_done = true in shadow_backfill_state. The holder's polling loop sees this and exits cleanly.

11.2 Why the holder can't run a heartbeat

Counter-intuitive invariant An exported snapshot is anchored to the implicit transaction the walsender opens during slot creation. Any SQL on that session — even SELECT 1 — commits or closes that transaction and silently invalidates the snapshot. Reader sessions attaching afterward would fail with "invalid snapshot identifier," but the holder's "alive=true" flag would still claim the snapshot is good. So we have to keep the holder idle and rely on TCP keepalives at the socket layer to defeat NAT/firewall timeouts. If the holder dies before verifier_done, there is no in-place resume — the only recovery is "drop the slots, run cleanup, start over."

11.3 The shared contract between TS and Python

The migration map (scripts/migrate/reference-to-unified/migration-map.ts) is the source of truth for every source table's metadata: PK columns, PK kind, replica-identity kind, source filter, FK translations. dump-shadow-contract.ts renders this to JSON, committed at shadow-write-consumer/contracts/shadow-contract.json. The Python consumer loads it at import and version-checks. CI's parity-check.yml re-runs the dumper and git diff --exit-codes — so a TS migration-map edit that changes something behavior-affecting is a one-line lint failure on the PR, not a runtime surprise in Python.

12. Infra & CI — what gets deployed and gated

12.1 The Python service is deployed to Fly

infra/fly/shadow-consumer.toml declares two always-on process groups (consumer, slot_reader). The hourly verifier is intentionally not a process group — it's a scheduled Fly Machine because fly deploy would strip the schedule = "hourly" property and turn it into a tight-loop worker if it were in [processes]. The deploy workflow runs fly scale count consumer=1 slot_reader=1 and then asserts via fly machines list --json jq that the counts are right (no Fly-native enforcement exists for non-service workers).

12.2 Observability is Terraform-managed

Two parallel Axiom surfaces, deliberately separated:

12.3 CI workflows

WorkflowTriggerWhat it proves
parity-check.ymlPR (paths)The Python and TypeScript canonicalization paths produce byte-identical output. Contract JSON is current. Every source_table × operation cell has a fixture or a documented carve-out.
shadow-consumer-tests.ymlPR (paths)Python lint + types + unit + harness pass. Plus the TS migration-map-metadata.test.ts.
shadow-smoke.ymlEvery PRThree jobs: (9a) service boots with SHADOW_DISABLED=true. (9b) service boots after rm -rf src/modules/shadow + stub installed. (9c) service boots after running the awk strip script. The decommissioned state is tested today.
deploy-shadow-services.ymlManualDeploys the Fly app with confirm-prod guard, dry-run, and post-deploy R-18 jq check.

13. Canonicalization — how two languages stay byte-identical

The consumer (Python) builds JSON bodies for your service (TypeScript). Both sides must produce exactly the same wire bytes for the same logical value, or SC1 dies. Five rules pinned in a frozen contract:

ConceptRuleWhy
DecimalPlain string. No scientific notation. -0 normalizes to 0. Trailing zeros preserved.1e3, 1000, +1000 all collide in canonical form.
TimestampISO 8601 UTC, microsecond precision, sub-microsecond truncated, Z suffix.Postgres stores microseconds; the comparison key must be byte-identical.
Null vs empty stringDistinct. undefined collapses to null. Empty string is a real value.Reference DB historically used both meanings.
Composite PKSorted-keys JSON, no whitespace, lexicographic codepoint order, no nested objects.Becomes shadow_pk_map.source_pk — row identity across the system.
LSNDecimal string on the wire. bigint rejected from idempotency hashing.JS JSON.stringify truncates past 2^53-1; a hash with a truncated input wouldn't match Python's hash.

14. The race window — why "blocking" sometimes means "wait one tick"

The verifier captures l_before = pg_current_wal_lsn() on the Partners DB, then opens a snapshot. Between those two steps another transaction can commit. If that transaction's row shows up in the snapshot but the consumer hasn't applied it yet, the verifier sees missing_in_rdb — even though everything is correct and a moment later the consumer will catch up.

The classifier handles this by issuing pk_map_lookup_by_target for each blocking finding. If max(last_event_lsn) > l_before, the finding gets reclassified to verifier_race_window and is allowed to clear on the next tick. But — if the same (source_table, source_pk, target_model) + signature recurs for 2+ consecutive runs, the persistence tracker escalates it back to the original blocking class. So a true bug eventually surfaces; a transient race doesn't.

Two classes never get race-window relief: duplicate_current_version (SCD2 must have exactly one current version per UUID) and unmapped_source_row (the table is out of scope by construction).

15. The five lifecycle phases

The full operator runbook lives in docs/shadow-write-runbook.md (1,607 lines, 23 sections + glossary). The shape:

PhaseWho triggersTermination
1. Pre-flight (Day −1)Operator10 green checklist items: provenance redesign live, replica-identity audit, storage sizing, DDL freeze announced, empty-RDB, Fly app deployed + R-18 check, R5+R8 sign-offs, cutover audit clear.
2. Backfill openOperator runs migrate.ts --snapshot-modeSlots created, parallel readers loaded RDB, post-backfill verifier passed, verifier_done=true, holder closes, consumer + slot_reader + scheduled verifier start.
3. Validation windowContinuousHourly verifier runs; DLQ triage as needed; SC3 timer advances. End condition: SC3 == 168 clean ticks AND named-reviewer go.
4. CutoverOperator runs cutover.mainWriters paused → halt-target set → drain → SC2 verifier → audit row. Outcome is either permitted (30-min expiry) or blocked. On permit, operator repoints writers out of band.
5. DecommissionPost writer-repointingStrip script run on app.ts, shadow module + consumer service tree deleted, slots dropped, monitors taken down. Shadow code physically gone from main.

16. The decommission path is real today

Three artifacts make the post-cutover world a tested state in CI, not a future cleanup ticket:

  1. ci/strip-shadow-registration.sh — awk script. Finds the // SHADOW-REGISTRATION-BEGIN/END markers in src/app.ts, fails noisily if they aren't both present exactly once, and emits a stripped version.
  2. ci/shadow-deleted-stub/modules/shadow/index.ts — no-op stub exporting registerShadowModule as async () => {} so the rest of the codebase still typechecks after the real src/modules/shadow/ tree is deleted.
  3. tsconfig.shadow-deleted.json — typecheck config that swaps in the stub and excludes src/modules/shadow/**. Used by the shadow-deleted-smoke (9b) CI job.

And scripts/check-shadow-residue.sh is the residue lint: it grep-fails if any of as_of_lsn, _shadow_update, _shadow_delete, shadow_metadata, or bare idempotency_key leaks into src/modules/scd, src/modules/crud, or src/shared/schemas. A defensive boundary check.

17. Test surface

TS unit tests

~21k LOC added in tests/unit/

15 shadow-* suites covering write/read/cutover/DLQ/ledger/PK-map/canonical/registration/metadata/startup. 8 migration-runner suites including migrate.snapshot-mode (1,012 LOC), migrate.restart-window (1,272), preflight (1,736), snapshot-holder (1,067).

Python unit tests

~12k LOC in shadow-write-consumer/tests/unit/

16 files; test_apply_loop (1,612), test_transform (1,525), test_verifier (2,925), plus cutover, DLQ-replay, slot-reader, source-filter, value-mismatch, metrics, contract-load, verifier-attestation-state, divergence-injector, production-wiring-cutover-rds. ~396 tests.

Integration + harness

Live-DB tests + R6 synthetic harness

3 integration suites (consumer-e2e, cutover-gate, dlq-replay). The R6 harness enumerates 32+ named failure scenarios (1:1 with Plan §842): snapshot-holder kill, race-window finding, persistence escalation, replica-identity insufficiency, COPY workload, TRUNCATE DLQ+halt, etc.

Parity fixtures — the cross-language correctness gate

shadow-write-consumer/parity-fixtures/: 46 JSON fixtures covering canonical edges, SCD2 inserts/updates/deletes, fan-out, CRUD composite-PK, source-filter branches, TRUNCATE. The runner does triple-equality: Python transform_event output == TS transformWal2jsonEvent output (via a tsx subprocess) == the fixture's expected RDS calls. A second CI gate (matrix_audit.py --strict) asserts every source_table × operation cell, every FK-translation cell, every junction cell, every source-filter branch cell has a fixture or a documented carve-out.

18. Suggested reading order (~1h)

If you only have an hour, hit these in order:

  1. The runbook table of contents. docs/shadow-write-runbook.md lines 17–42. Skim §3 (backfill open) and §5 (cutover) for the happy-path lifecycle in operator language.
  2. The schema. db/unified/schema.prisma lines 2527–2705. Read the five table definitions and their comments.
  3. The shadow contract. shadow-write-consumer/contracts/shadow-contract.json — one entry per source table. This is the cross-language source of truth.
  4. Trace one write end-to-end. Pick parity-fixtures/fixtures/scd2_entity_insert.json and walk it through: src/consumer/transform.pysrc/consumer/rds_client.pysrc/modules/shadow/shadow-write.routes.tssrc/modules/shadow/shadow-scd.engine.ts → the resulting shadow_pk_map + shadow_event_ledger rows.
  5. The verifier's hourly tick. src/verifier/main.pyrun_hourly_tick is short enough to read in one sitting. Then sc1_attestation.py and race_window.py.
  6. The cutover gate. src/cutover/main.py: drain_complete, classify_outcome, build_audit_payload, format_permit_expires_at. Pure helpers, individually unit-tested.
  7. The holder. scripts/migrate/reference-to-unified/snapshot-holder.ts lines 30–46 (the rationale comment for why there's no heartbeat).
  8. The decommission story. ci/strip-shadow-registration.sh, tsconfig.shadow-deleted.json, and .github/workflows/shadow-smoke.yml together prove the post-cutover state compiles today.

19. Risks & carve-outs worth weighing

Things I'd push back on or ask follow-up questions about if I were doing this review cold:

20. Top files — the load-bearing reading list

AreaFileSizeRole
Runbookdocs/shadow-write-runbook.md1,607Operator's bible.
Schemadb/unified/schema.prisma 2527–2705+1785 new tables.
TS app wiringsrc/app.ts116buildApp + shadow registration block.
TS shadow modulesrc/modules/shadow/shadow-scd.engine.ts974SCD2 write engine.
TS shadow modulesrc/modules/shadow/shadow-crud.service.ts1,176CRUD + composite-PK write engine.
TS shadow modulesrc/modules/shadow/shadow-metadata-registry.ts473Target-model registry + SQL emitter.
TS shadow modulesrc/modules/shadow/canonical.ts426Frozen canonicalization contract.
TS migration runnerscripts/migrate/reference-to-unified/migrate.ts3,341One-shot + snapshot-mode driver.
TS migration runnerscripts/migrate/reference-to-unified/migrate-shadow-orchestration.ts2,014Slot creation, reader attach, IPC publish, restart.
TS migration runnerscripts/migrate/reference-to-unified/snapshot-holder.ts712Replication-protocol session; no-heartbeat keepalive.
TS migration runnerscripts/migrate/reference-to-unified/dump-shadow-contract.ts480Cross-language contract emitter.
Python consumershadow-write-consumer/src/consumer/main.py2,185Streaming apply loop.
Python consumershadow-write-consumer/src/consumer/transform.py1,146wal2json → RDS call list.
Python consumershadow-write-consumer/src/consumer/rds_client.py1,232HTTP client with backoff + classifier.
Python verifiershadow-write-consumer/src/verifier/main.py1,126Hourly orchestrator.
Python verifiershadow-write-consumer/src/verifier/race_window.py633Race-window reclassifier + persistence tracker.
Python verifiershadow-write-consumer/src/verifier/production_wiring.py2,015Real-DB hooks for the verifier.
Python cutovershadow-write-consumer/src/cutover/main.py1,387The gate.
Infrainfra/terraform/axiom/main.tf678Dataset, monitors, dashboards.
Infrainfra/fly/shadow-consumer.toml344Fly app config.
CI.github/workflows/shadow-smoke.yml1813 jobs: disabled / deleted / cutover-stripped.
CI.github/workflows/parity-check.yml158Cross-language contract gate.
Decommissionci/strip-shadow-registration.sh50The awk script.
Decommissiontsconfig.shadow-deleted.json10Typecheck config for the deleted state.

If anything here contradicts docs/shadow-write-runbook.md, trust the runbook. It is the single source of truth and is enforced by the audit table.