cutaway/04 · 2026-06-11 · 13 min
How durable workflow engines replay history
A worker process gets OOM-killed forty minutes into a long order workflow. It had charged the customer’s card, reserved inventory, and was sleeping out a five-minute fraud-hold timer when the kernel reaped it. On restart, a fresh worker picks the workflow up, the timer fires, the confirmation email goes out, and the order completes. No row was double-charged, no inventory reserved twice, and nobody on the team wrote a single line of recovery code. The local variable amount holds the charged figure and the reserved flag holds its branch decision, both in a process that did not exist when those values were computed.
The inverse is the part that ends up in a postmortem. Someone ships a deploy that adds one innocent time.Now() to a running workflow, and the next morning a few thousand executions are wedged with nondeterminism errors, retrying forever, refusing to advance. Same mechanism, opposite face. Both behaviors come out of one design decision, and it is worth understanding exactly, because the recovery and the footgun are the same gun.
The naive approach
You do not need a workflow engine to make an order survive a crash. Write the state down. Add a status enum to the orders table — charged, reserved, emailed, done — and a state column holding whatever the workflow needs to resume. Each step advances the enum and commits. A cron sweeper wakes up every minute, finds orders stuck in a non-terminal status, and pokes them forward from wherever they left off.
This is the first workflow engine everyone builds, and it is not stupid. It is durable in the obvious sense: the state is in Postgres, the sweeper is idempotent if you are careful, and a crashed worker leaves the row sitting in reserved until the next sweep. For a three-step process with no branching, it ships and it works.
Why it breaks
It breaks the moment the workflow has control flow. The enum captures which named step you reached, not where in the function you were. A step like “charge the card, and if the amount is over the fraud threshold, reserve inventory and start a five-minute hold, otherwise skip straight to email” has no clean enum value for “charged, decided to reserve, reservation request sent but not yet confirmed.” You end up with a status column that is really a hand-rolled program counter, and every new if adds states that multiply with the ones already there.
Then the checkpoint races bite. You charge the card, and before the status = 'charged' commit lands, the worker dies. On restart the row still says pending, so the sweeper charges the card again. Move the commit earlier and you get the opposite bug: mark charged, crash before the charge actually fires, and the customer is billed for an order that never happened. The side effect and the checkpoint are two separate operations, and a crash can always land between them, in either order.
The deepest problem is that you cannot resume in the middle of a function. The worker’s call stack — which if branch it took, what the loop counter was, which line it was awaiting — lived in memory, and memory is gone. Your state column can record that you charged the card; it cannot record that you were three frames deep in a function, past one branch, blocked on the second. To resume mid-function you would have to serialize the entire execution stack, and you cannot, so you flatten the workflow into named steps and lose the ability to write it as ordinary code. That flattening is the tax, and it grows with every workflow.
The checkpoint race is the most immediate production hazard. Start the worker, then crash it during the amber gap phase — the charge side effect has fired and incremented the count, but the status = 'charged' write has not landed. The sweeper fires next, reads pending, and charges again. The double-charge counter turns red.
Press Start, then crash the worker during the amber gap — the side effect is done but the status write has not landed. The sweeper fires next and sees the order still pending, charges again, and the double-charge counter turns red.
The real mechanism
The durable-execution engines invert the problem. Instead of persisting the state of the workflow, they persist its history — and not the history of the domain (the order, the payment) but the history of the execution itself: every command the workflow code issued and every result the outside world fed back. The workflow function is then treated as a deterministic function of that history. To recover, you do not reload state; you re-run the function from the top against the recorded history, and the same code re-derives the same state.
Three pieces make that work, and they are the whole idea:
- History is the truth. Every side effect the workflow can have — scheduling an activity, starting a timer, completing the workflow — is issued as a command, and the server records the corresponding event in an append-only history before anything else proceeds. The history, not the worker’s memory, is the durable record.
- The workflow is a deterministic function. Re-executed against the same history, it must emit the same commands in the same order. The engine matches each command the re-running code emits against the next recorded command-producing event. A match means replay is on track.
- Activity results come from history, never from re-running. When replay reaches
amount = await chargeCard(), it does not call the payment processor again. It reads the recordedActivityTaskCompletedresult out of history and hands98straight back to the function. The card is charged exactly once — during the original run — and replay reconstructs the variable without touching the processor.
The figure below makes this observable. The tape at the top is the event history, accumulating every event the server records. The code panel shows which line the workflow function has reached and what its locals hold; during replay those locals are tagged (from history). The comparison strip is the engine’s bookkeeping: for each command the replaying code emits, it shows what history recorded against what the code emitted, and whether they match.
(a) Press Run and watch events append while the code line advances. (b) Crash worker mid-flow, then Replay step — local variables refill '(from history)' while 'activities executed' stays flat. (c) Toggle Inject time.Now(), let chargeCard complete, crash, and Replay all: the comparison strip hits a mismatch and the workflow task fails with the exact event it violated.
Start with the happy path. Press Run. The history fills left to right — WFStarted, the workflow-task events, then ActSched(chargeCard), and when the activity completes, ActComp(=98) carrying the recorded result. The code panel’s active line walks down processOrder in lockstep, and once chargeCard returns you can see amount = 98 appear in the locals. This default order comes in under the amount > 100 threshold, so the branch is not taken: reserveInventory is skipped, reserved stays false, and the history goes straight to the timer and sendEmail. Every side effect the code took is now a recorded event; the worker’s memory holds nothing the history does not already say.
Now kill it. Let the workflow get a few activities in, then press Crash worker. The status flips to crashed, the code panel resets to nothing, and the locals are gone — that is the worker’s in-memory state being lost, exactly as in a real OOM. The history tape does not change. Whatever was recorded is still recorded.
Press Replay step and watch the reconstruction. A fresh worker re-executes processOrder from line one. Each step matches the command the code emits against the next recorded event — the comparison strip fills with green checks — and feeds the recorded result back in. The locals refill, now tagged (from history): amount = 98 (from history), reserved = false (from history). The function is reconstructing the exact state it had before the crash, one matched command at a time. Tap any event on the tape to pin its detail and confirm the recorded result is the one being replayed back.
Keep an eye on the activities executed counter through all of this. It does not move during replay. That is the load-bearing observation: replay sources every activity result from history, so chargeCard is never called a second time. The counter only climbs during live forward execution past the end of the recorded history, where new side effects actually run. A flat counter during replay is the exactly-once guarantee, made visible.
Failure modes
The same machinery that survives a crash for free is brutally unforgiving about one thing: the workflow function must be deterministic.
What actually breaks determinism. It is not exotic. Any value that differs between the original run and a replay will, if it influences a command, make the replayed code emit something history did not record. The usual suspects: reading the wall clock (time.Now()), drawing a random number or UUID outside the SDK’s recorded path, and — the one that surprises people — iterating a Go map, whose order is deliberately randomized per process. Temporal’s own docs call out “Range over map is a nondeterministic operation,” and Go workflow code must use workflow.Go and workflow channels rather than native goroutines and channels, because native scheduling order is not reproducible. The deterministic replacements exist for exactly these: workflow.Now() for time, workflow.SideEffect for entropy, workflow.Sleep for delays.
Toggle Inject time.Now() in the figure to model this. With injection on, the branch decision consults a clock-like value instead of the recorded chargeCard amount, standing in for a naked time.Now() in workflow code. (The sim models nondeterminism as a context clock that differs between the original run and replay — a stand-in for a real wall-clock read, labeled as a simplification.) Let chargeCard complete so its result and the branch command are in history, then Crash worker and Replay all. The comparison strip hits a row where the recorded command and the emitted command disagree, paints it red, and the workflow task fails with a nondeterminism error naming the exact event the replay violated — emitted versus expected.
The history edge. There is a subtle limit to detection, and the figure surfaces it deliberately. Replay can only catch divergence when there is an already-recorded command event to contradict. If you crash before the divergent branch’s command event was written — crash while chargeCard is still in flight, before the branch decision has been recorded — then a replay with injection on has nothing to disagree with, and it completes cleanly into live continuation. The figure annotates this when it happens: the divergence fell past the history edge, so nothing was recorded to catch it. This is not a bug. The engine preserves what was recorded; state that never reached the server is gone. Real Temporal has the same edge for the same reason — only divergence behind the recorded edge is detectable — and the figure stages its demo so the injected divergence lands behind the edge when you want detection to fire.
What Temporal actually does on a nondeterminism error. This is the reassuring part of the footgun. The workflow task fails — not the workflow execution. The history is never modified, so the execution is not corrupted or lost; it is parked. Temporal retries the failed workflow task on a backoff “until the Workflow Execution Timeout, which is unlimited by default,” meaning a wedged workflow waits, intact, for a human to deploy a fix rather than rolling forward into a wrong state. That is why the bad-deploy scenario from the opening is recoverable: the executions are stuck, not broken. Roll back the offending code (or patch it correctly) and they resume from exactly where they jammed.
Versioning is the real-world fix. You cannot freeze workflow code forever, so the engines give you a way to change deterministic code without breaking executions already mid-flight. The pattern records a marker in history the first time new code runs, and branches on it forever after: old executions replay down the old path, new ones take the new path. In the Go SDK this is workflow.GetVersion(ctx, changeID, minSupported, maxSupported); in the TypeScript and Python SDKs it is patched(patchId) paired with deprecatePatch(patchId) once every pre-patch execution has drained. Either way the change is gated behind a recorded decision, so replay always knows which version of the code a given execution is entitled to.
How Temporal actually does this
The figure replays the full history from the first event every time. Real Temporal does not, and the difference is a performance story worth knowing. A worker keeps a sticky cache of in-memory workflow state, so as long as the same worker keeps handling the same execution, it advances from cached state with no replay at all. Full replay from event one happens only on a cache miss — a new worker, a deploy, or an eviction. Workers evict cached executions when sticky_cache_size reaches the configured workflowCacheSize, and “an evicted Workflow Execution will need to be replayed when it gets any action that may advance it.” (The sim labels this: here every replay is a full replay; the sticky cache is the optimization it omits.) Replay is therefore the correctness fallback, not the steady-state path — which is what makes it affordable to lean on.
The history that replay reads is not unbounded. A single workflow execution’s event history is capped at 51,200 events or 50 MB, with a warning logged at 10,240 events or 10 MB. Cross the hard limit and the execution is terminated, which is why long-lived or high-iteration workflows use Continue-As-New: atomically close the current execution and start a fresh one carrying the state forward, so the new history begins empty. A workflow that loops forever is really a chain of bounded histories.
The model has a lineage. Temporal is a fork of Uber’s Cadence, which the same founders built as an open-source implementation of the ideas behind AWS Simple Workflow Service (SWF) — the durable-execution pattern is roughly a decade old in production. AWS Step Functions and Azure Durable Functions reach a similar destination by different routes; the event-sourced-execution approach in the figure is the one Temporal and Cadence share.
Wrapping up
A durable-execution engine earns its complexity when your process has real control flow, multiple side effects that must each happen exactly once, and a long or interruptible lifetime — the order workflow that charges a card, conditionally reserves inventory, waits out a timer, and emails a receipt, surviving deploys and crashes the whole way. There you would otherwise be hand-rolling a status enum, a sweeper, and a serialized program counter, and getting the checkpoint-versus-side-effect race wrong in production. The engine’s determinism constraint is the price; resumable-as-ordinary-code is what you buy.
If your work is a flat fan-out of independent, naturally idempotent tasks — resize this image, send this one email, recompute this aggregate — a queue with idempotent consumers is simpler and has no determinism tax to pay. Reach for replay when “where in the function was I” is a question you would otherwise have to answer by hand. Reach for a queue when there is no “where” to lose.
Sources
- Temporal documentation, Workflow definition — the determinism contract (same commands in the same sequence given the same input), branching on local time or random numbers as the canonical violation, and that non-deterministic work belongs in Activities.
- Temporal documentation, Event History and Temporal Cloud limits — the append-only event history, the 51,200-event / 50 MB hard limit, the 10,240-event / 10 MB warning, and Continue-As-New as the mitigation.
- Temporal documentation, Worker performance — the sticky Workflow cache (
sticky_cache_size,workflowCacheSize), and that an evicted execution is replayed on its next advancing action. - Temporal documentation, Go versioning and TypeScript versioning —
workflow.GetVersionand thepatched/deprecatePatchpatching API for changing workflow code without nondeterminism errors. - Temporal documentation, Failures and Go message passing — workflow-task failures retrying until the (default-unlimited) execution timeout without corrupting the execution, and “Range over map is a nondeterministic operation.”
- Temporal blog, Workflow Engine Principles — the history/matching/transfer-queue architecture and the Cadence-from-SWF lineage.