Skip to main content

Recover pinned Workflows after a bad rollout

This runbook covers how to recover pinned Workflows after rolling out a Worker Deployment Version that turned out to be faulty. Use it when a new code version has caused pinned Workflows to fail, time out, or get stuck retrying Workflow Tasks.

This page assumes you have already configured Worker Versioning and that the affected Workflows are pinned to a specific Worker Deployment Version.

Prerequisites
  • Worker Versioning is enabled and the affected Workflows are pinned.
  • Your Worker fleet uses blue-green or rainbow deployments, not rolling upgrades.
  • You can run the temporal CLI against the affected Namespace.

Stop the rollout

Stop sending new Workflows to the faulty version before you do anything else.

If the bad Version is currently ramping, set the ramp percentage to zero:

temporal worker deployment set-ramping-version \
--deployment-name "YourDeploymentName" \
--build-id "YourBadBuildID" \
--percentage 0

If the bad Version has already become the Current Version, switch the Current Version back to the previous good Version:

temporal worker deployment set-current-version \
--deployment-name "YourDeploymentName" \
--build-id "YourPreviousBuildID"

After either change, new Workflows stop landing on the bad Version. Existing pinned Workflows still execute on the bad Version until you recover them.

Identify affected Workflows

Use Search Attributes to find Workflows running on or affected by the bad Version.

Useful filters:

  • ExecutionStatus — for example, Running, Failed, or TimedOut.
  • TemporalWorkerDeploymentVersion — formatted as 'YourDeploymentName:YourBuildID'.
  • TemporalReportedProblems — accepts values like category=WorkflowTaskFailed or category=WorkflowTaskTimedOut. See Detecting Workflow Task Failures.
  • WorkflowType — for example, 'OrderProcessing'.

Use temporal workflow count to quickly check how many Workflows match a query. For Workflows that are still retrying tasks after the upgrade:

temporal workflow count \
--query "TemporalWorkerDeploymentVersion='YourDeploymentName:YourBadBuildID' \
AND ExecutionStatus='Running' \
AND TemporalReportedProblems IN ('category=WorkflowTaskFailed', 'category=WorkflowTaskTimedOut')"

For closed Workflows that failed:

temporal workflow count \
--query "TemporalWorkerDeploymentVersion='YourDeploymentName:YourBadBuildID' \
AND (ExecutionStatus='Failed' OR ExecutionStatus='TimedOut')"

To get the Workflow Id and Run Id of matching executions, use temporal workflow list with JSON output and extract the relevant fields with jq:

temporal workflow list --output json \
--query "TemporalWorkerDeploymentVersion='YourDeploymentName:YourBadBuildID' \
AND (ExecutionStatus='Failed' OR ExecutionStatus='TimedOut')" \
| jq '.[].execution'

Example output:

{
"workflowId": "worker-versioning-pinned-2_032f7b06-f3a0-47a7-a7c2-949fcce7fc42",
"runId": "019e9a92-1d8e-7a43-a345-721351d2d544"
}
{
"workflowId": "worker-versioning-pinned-2_99e7c4ac-74cd-48c5-ae2e-94aa3c67c36f",
"runId": "019e9a91-e8e3-765b-aba8-3a7002ec7d6c"
}

Choose a recovery strategy

The right recovery strategy depends on three questions about each affected Workflow:

  1. Is the Workflow closed, or are its tasks still retrying?
  2. Can the Workflow safely re-execute from the start of its current run? Workflows that can are called restartable in this runbook. Whether a Workflow is restartable is a property of the Workflow design and must be documented or annotated (for example, via a Custom Search Attribute) by the team that owns it.
  3. Has the Workflow's internal state been corrupted? Detecting state corruption is difficult to scale. In practice, most teams filter by Workflow Type and make conservative assumptions for an entire batch rather than per-instance.

The answers map to recovery strategies as follows:

Workflow stateRestartable?Strategy
Running, tasks retrying, state intactYesReset-with-Move to FirstWorkflowTask on the previous good Version.
Running, tasks retrying, state intactNoVersioning Override to a new replay-safe Version.
Running, recently corrupted stateNoReset-with-Move to LastWorkflowTask on a new replay-safe Version.
Closed (Failed, Completed, TimedOut)EitherReset-with-Move to FirstWorkflowTask. Critical state may need out-of-band compensation.
Stateless or simple replacement is acceptableEitherTerminate (if still running) and start new Workflows with the original arguments and the new Version.

For Workflows still retrying without state corruption, you may need to use the Patching APIs to make a new Version replay-safe before pointing Workflows at it.

Recover Workflows

Temporal exposes two recovery primitives, both available through the CLI or directly through the Worker Versioning APIs (see Moving a pinned Workflow):

Both commands accept a --query argument for batch operations.

Reset restartable Workflows to the previous Version

Schedule a batch Reset-with-Move targeting the start of execution on the previous good Version. Use --reapply-exclude All to skip re-applying signals and Updates, which is typically the right choice for a clean restart:

temporal workflow reset with-workflow-update-options \
--query "TemporalWorkerDeploymentVersion='YourDeploymentName:YourBadBuildID' \
AND ExecutionStatus='Running' \
AND WorkflowType='YourWorkflowType' \
AND TemporalReportedProblems IN ('category=WorkflowTaskFailed', 'category=WorkflowTaskTimedOut')" \
--reason "Reset restartable Workflow to YourPreviousBuildID" \
--versioning-override-behavior pinned \
--versioning-override-build-id "YourPreviousBuildID" \
--versioning-override-deployment-name "YourDeploymentName" \
--reapply-exclude All \
--type FirstWorkflowTask \
--output json --yes

Move running Workflows to a replay-safe Version

For Workflows whose tasks are still retrying and whose state is intact, apply a Versioning Override to a new replay-safe Version. No Reset is needed:

temporal workflow update-options \
--query "TemporalWorkerDeploymentVersion='YourDeploymentName:YourBadBuildID' \
AND ExecutionStatus='Running' \
AND WorkflowType='YourWorkflowType' \
AND TemporalReportedProblems IN ('category=WorkflowTaskFailed', 'category=WorkflowTaskTimedOut')" \
--versioning-override-behavior pinned \
--versioning-override-build-id "YourGoodBuildID" \
--versioning-override-deployment-name "YourDeploymentName" \
--output json --yes

Roll back recently corrupted Workflows

When a Workflow's state was corrupted recently but tasks are still retrying, you can sometimes recover by resetting to LastWorkflowTask on a replay-safe Version. This re-applies pending signals and Updates:

temporal workflow reset with-workflow-update-options \
--query "TemporalWorkerDeploymentVersion='YourDeploymentName:YourBadBuildID' \
AND ExecutionStatus='Running' \
AND WorkflowType='YourWorkflowType' \
AND TemporalReportedProblems IN ('category=WorkflowTaskFailed', 'category=WorkflowTaskTimedOut')" \
--reason "Reset corrupted Workflow to YourGoodBuildID" \
--versioning-override-behavior pinned \
--versioning-override-build-id "YourGoodBuildID" \
--versioning-override-deployment-name "YourDeploymentName" \
--type LastWorkflowTask \
--output json --yes

Recover closed Workflows

Closed Workflows (Failed, Completed, TimedOut) need Reset-with-Move. Choose ExecutionStatus values that match the failure mode:

temporal workflow reset with-workflow-update-options \
--query "TemporalWorkerDeploymentVersion='YourDeploymentName:YourBadBuildID' \
AND (ExecutionStatus='Completed' OR ExecutionStatus='Failed') \
AND WorkflowType='YourWorkflowType'" \
--reason "Reset closed Workflow to YourGoodBuildID" \
--versioning-override-behavior pinned \
--versioning-override-build-id "YourGoodBuildID" \
--versioning-override-deployment-name "YourDeploymentName" \
--reapply-exclude All \
--type FirstWorkflowTask \
--output json --yes
Not idempotent

Resetting a closed Workflow does not change the status of the prior closed execution. Re-running the same command will reset the same closed Workflows again, terminating each previous reset attempt and starting another new run. Plan to run this command exactly once per affected batch, after the bad Version has fully drained.

The earlier batch commands targeting Running Workflows are idempotent because they filter on TemporalWorkerDeploymentVersion and ExecutionStatus='Running'. Once a Workflow is moved off the bad Version, it stops matching the query.

Handle eventual consistency

The Visibility store is eventually consistent, which means a query that identifies affected Workflows may not return all of them in a single execution.

Use the drainage status of the bad Version as a signal that the Visibility index has caught up. A Version is drained when no new Workflows are expected on it and all existing pinned Workflows on it are closed.

Check drainage status:

temporal worker deployment describe-version \
--deployment-name "YourDeploymentName" \
--build-id "YourBadBuildID" \
--output json \
| jq .drainageInfo.drainageStatus

Recommended approach:

  1. Repeat the idempotent recovery commands on Running Workflows until the drainage status reports drained. The Temporal Service refreshes drainage status periodically, so it may take a few minutes after the last running Workflow closes.
  2. Once the Version is drained, run the non-idempotent Reset-with-Move command against closed Workflows once.

See Sunsetting an old Deployment Version for more on drainage states.

Clean up the drained Version

After the bad Version has drained and all recovered closed Workflows have been processed, stop the Workers on the bad Version and delete the Version:

temporal worker deployment delete-version \
--deployment-name "YourDeploymentName" \
--build-id "YourBadBuildID"

See temporal worker deployment delete-version for prerequisites on deletion (the Version must not be Current, Ramping, or have active pollers, and it must be drained unless you pass --skip-drainage).

Summary

Recovering pinned Workflows from a faulty Worker Deployment Version takes the following steps:

  1. Stop the rollout by ramping to zero or reverting the Current Version.
  2. Identify affected Workflows with TemporalWorkerDeploymentVersion and TemporalReportedProblems queries.
  3. Choose a strategy based on execution status, restartability, and state integrity.
  4. Recover using Versioning Override or Reset-with-Move, idempotently while the Version drains.
  5. Clean up by deleting the drained Version once all affected Workflows are recovered.