Skip to content

[slop]fix(workflow-engine): only commit step state after success#5010

Draft
abcxff wants to merge 1 commit into
05-06-fix_rivetkit_bind_methods_through_createwritethroughproxyfrom
05-09-fix_workflow-engine_only_commit_step_state_after_success
Draft

[slop]fix(workflow-engine): only commit step state after success#5010
abcxff wants to merge 1 commit into
05-06-fix_rivetkit_bind_methods_through_createwritethroughproxyfrom
05-09-fix_workflow-engine_only_commit_step_state_after_success

Conversation

@abcxff
Copy link
Copy Markdown
Contributor

@abcxff abcxff commented May 9, 2026

Description

Please include a summary of the changes and the related issue. Please also include relevant motivation and context.

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

How Has This Been Tested?

Please describe the tests that you ran to verify your changes.

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@railway-app
Copy link
Copy Markdown

railway-app Bot commented May 9, 2026

🚅 Deployed to the rivet-pr-5010 environment in rivet-frontend

Service Status Web Updated (UTC)
kitchen-sink ❌ Build Failed (View Logs) Web May 14, 2026 at 6:09 pm
frontend-cloud ❌ Build Failed (View Logs) Web May 13, 2026 at 8:20 pm
frontend-inspector ❌ Build Failed (View Logs) Web May 13, 2026 at 8:20 pm
website 😴 Sleeping (View Logs) Web May 12, 2026 at 3:59 am
ladle ❌ Build Failed (View Logs) Web May 9, 2026 at 3:40 pm
mcp-hub ✅ Success (View Logs) Web May 9, 2026 at 3:40 pm

Copy link
Copy Markdown
Contributor Author

abcxff commented May 9, 2026

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@abcxff abcxff changed the title fix(workflow-engine): only commit step state after success [slop]fix(workflow-engine): only commit step state after success May 9, 2026
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 9, 2026

Code Review: fix(workflow-engine): only commit step state after success

Note: This PR is still in Draft status with an empty description/checklist.

Overview

This PR fixes a correctness bug in workflow-engine where actor state mutations inside a failing step were being persisted even though the step needs to be retried. On retry, the step would execute against incorrectly mutated state. The fix:

  1. Introduces RAW_STATE_SYMBOL to bypass the write-through proxy and read raw actor state.
  2. Wraps every step() and tryStep() call in #withActorAccessAndStateRollback, which snapshots state+vars before the step and rolls them back on any error.
  3. Removes the per-branch flushStorage() calls in step failure catch blocks, consolidating to the single flush in setRetryState / executeWorkflow's unrecoverable-error path.
  4. Removes the stepData.error fallback (justified by the new test).

The core idea is correct and addresses a real workflow replay bug.


Bug: Breaks stateless actors using workflow steps

withActorAccessAndStateRollback unconditionally calls this.#runCtx[RAW_STATE_SYMBOL]() (workflow/context.ts:618), but ActorContextHandleAdapter[RAW_STATE_SYMBOL] throws stateNotEnabledError() when the actor has no state configured (native.ts:2460-2465).

This means any actor that uses workflow steps but does not have a state field will crash on the snapshot call, breaking all ctx.step() invocations regardless of whether the step touches state. The guard needs to handle the state-disabled case, for example by returning undefined and skipping the rollback assignment when #stateEnabled is false.

Concern: structuredClone(vars) can throw for non-cloneable objects

vars are runtime-in-memory (not persisted), so they can legitimately hold class instances, Promises, or WeakRefs. structuredClone throws a DataCloneError on these, crashing the snapshot before the step runs. Worth documenting that vars must be structured-cloneable when workflows are used, or guarding the clone.

Minor: Notification fires before flush for timeout/critical errors

Previously flushStorage() was called before notifyStepError for StepTimeoutError/CriticalError. Now the notify fires before the eventual top-level flush in executeWorkflow. This slightly widens the crash window between notification and persistence (duplicate notification possible on replay). Not a durability regression, but worth a comment.


Code Quality

Positive changes:

  • Consolidating entry.kind.data.error and entry.dirty = true before all the if-branches eliminates three identical blocks.
  • Removing the double-flush on failure (catch block plus setRetryState) is correct; the top-level flush in executeWorkflow guarantees persistence on all error paths.
  • Removing the stepData.error fallback is justified since the new test confirms the error is always written to the step entry.

Issues:

  1. Misleading test name (steps.test.ts:573): 'should not commit step error data to entry on failure' but the test asserts expect(entry.kind.data.error).toBe('Error: step failed'), i.e. the error is committed. Rename to 'should commit error but not output to step entry on failure'.

  2. Bare catch {} in the new test's first-run block silently swallows unexpected errors. Add a comment explaining what error is expected.

  3. // @ts-nocheck in workflow/context.ts: the RAW_STATE_SYMBOL call is unverified by the type system. The interface declares the method returning TState, with TState = never for stateless actors the type system would surface the mismatch if checking were enabled.

  4. NativeConnAdapter[RAW_STATE_SYMBOL] (native.ts:1162): added for interface completeness. The workflow rollback path only flows through ActorContextHandleAdapter, so this is dead code in practice. Fine as-is.


Test Coverage

  • New test is a good regression guard for the flush behavior; see naming concern above.
  • Removed StripStepHistoryErrorDriver test is a coherent removal - it tested the stepData.error fallback that was deliberately deleted.
  • Missing: no test covers stateless actor plus workflow steps. Once the crash is fixed, a test like 'should not crash when actor has no state and step fails' would prevent regression.
  • Loop expectation change (toBe(2) to toBe(3)) appears correct: crash at iteration 3, state now only persisted after successful Loop.continue at end of iteration 2, so iterations 3-5 need to re-execute on resume.

Summary

Area Finding
Core logic Correct - rollback on step failure fixes a real replay bug
Bug Stateless actors with workflow steps crash on RAW_STATE_SYMBOL() before step runs
Concern structuredClone(vars) throws if vars contains non-cloneable objects
Minor Notification fires before flush for exhausted/timeout errors (slightly wider crash window)
Test name 'should not commit step error data' contradicts what the test actually asserts
PR hygiene Description, type checkboxes, and test description are all empty; PR is still in draft

The core approach is sound. Please address the stateless actor crash (and consider the vars clone concern) before merging.

@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 11, 2026 03:41
@abcxff abcxff force-pushed the graphite-base/5010 branch from 56af1d1 to aac9634 Compare May 12, 2026 12:55
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from cc1baa6 to 9304f04 Compare May 12, 2026 12:55
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan May 12, 2026 12:55
@abcxff abcxff changed the base branch from 05-07-fix_pegboard_validate_drain_grace_period_request_lifespan to graphite-base/5010 May 12, 2026 13:21
@abcxff abcxff mentioned this pull request May 12, 2026
11 tasks
@abcxff abcxff force-pushed the graphite-base/5010 branch from aac9634 to 5f7bac7 Compare May 12, 2026 15:24
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 9304f04 to 0668fa7 Compare May 12, 2026 15:24
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-12-temp_fix_wasm-pack May 12, 2026 15:24
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 0668fa7 to 2c029c2 Compare May 12, 2026 17:46
@abcxff abcxff mentioned this pull request May 12, 2026
11 tasks
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 2c029c2 to 9b64fa5 Compare May 12, 2026 18:03
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 9b64fa5 to 0eb8f51 Compare May 12, 2026 18:44
@abcxff abcxff force-pushed the 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null branch from 7b5d681 to 8625f84 Compare May 13, 2026 18:31
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 31975cf to 26b0baa Compare May 13, 2026 18:31
@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 13, 2026

[Updated review - see below for latest]

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 13, 2026

Code Review: fix(workflow-engine): only commit step state after success

Overview

This PR makes two related improvements:

  1. Adds a RAW_STATE_SYMBOL escape hatch to bypass the write-through proxy, used in a new #withActorAccessAndStateRollback wrapper that snapshots actor state/vars before a workflow step and restores them on failure.
  2. Consolidates the duplicate entry.dirty = true / entry.kind.data.error = … assignments in WorkflowContextImpl step error paths and removes the intermediate flushStorage() calls on failure.

Correctness concerns

Durability regression from removed flushStorage() calls

Previously, on any step failure the engine immediately flushed dirty state to storage before propagating the error. That flush is now gone for all failure paths (timeout, critical, rollback, normal retry/exhaustion). The dirty bit IS set, so state will flush on the next triggered flush -- but if the process crashes between the step failure and that flush, the failure record (attempt count, error message, exhaustion status) is lost and the step is re-executed as if the failure never happened.

For retryable steps this just means an extra attempt, which is tolerable if steps are idempotent. For StepTimeoutError / CriticalError / RollbackError that mark the step as exhausted, the missing flush means a crashed process restarts and re-runs an already-exhausted step, bypassing the non-retry guarantee those error types were meant to enforce.

The removed test (StripStepHistoryErrorDriver) and the stepData.error ?? metadata.error fallback were guarding exactly this crash window. Removing them while also removing the intermediate flushes is internally consistent, but it weakens the durability guarantees without calling that out anywhere.

structuredClone on vars throws for non-serializable values

const varsSnapshot = structuredClone(this.#runCtx.vars);

TVars is unconstrained --- users can put class instances, functions, Map/Set, WeakRef, etc. in vars. structuredClone throws a DataCloneError on any non-serializable value. If a user's actor has non-serializable vars, every workflow step will throw before the user's run function is even called. Actor state is safe here because the setter calls assertJsonCompatValue, but vars has no equivalent constraint. This needs either a guard with a clear user-facing error, or explicit documentation that workflow steps require structuredClone-compatible vars.

set state = stateSnapshot schedules an unneeded save

this.#runCtx.state = stateSnapshot;

This calls through to #writeState(value, { scheduleSave: true }), queuing a KV persistence save even though the snapshot is identical to the pre-step state. Consider adding a scheduleSave: false path or louting through an internal write helper for the rollback case.


Style / conventions

  • @ts-nocheck at the top of workflow/context.ts masks the fact that NativeConnAdapter and ActorContextHandleAdapter implement [RAW_STATE_SYMBOL`` returning unknownwhile theActorContextinterface declares it as returningTState`. Not new to this PR but worth tracking.
  • Stray blank line left after the removed StripStepHistoryErrorDriver class in steps.test.ts.
  • Empty PR description --- all checklist boxes unchecked and no summary of the motivating bug. Makes the durability trade-off especially hard to evaluate without reading the whole diff.

Positive changes

  • Consolidating the three separate copies of entry.kind.data.error = String(error) / entry.dirty = true into a single block at the top of the catch clause is a clean deduplication.
  • RAW_STATE_SYMBOL is the right approach for bypassing the write-through proxy without widening the public API.
  • The loops.test.ts update correctly reflects the new flush timing: state is committed after a successful Loop.continue, so a crash at iteration 3 now resumes from 3 rather than 2.
  • The new steps.test.ts test directly verifies that output is absent and error is present in the KV entry after a step failure.

Suggestions

  1. Guard or test structuredClone(vars) at step entry time with a clear error if vars contain non-serializable values.
  2. Reconsider removing flushStorage() from the exhaustion paths (StepTimeoutError, CriticalError, RollbackError). Those paths are non-retriable and the old flush guaranteed crash-safety for exhausted steps. Either keep the flush for those paths or explicitly document the weaker durability as an accepted trade-off.
  3. Fill in the PR description with the motivating bug and the semantics of the new behavior.

@abcxff abcxff force-pushed the 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null branch from 8625f84 to 46973f2 Compare May 13, 2026 18:49
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 26b0baa to 1344263 Compare May 13, 2026 18:49
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 1344263 to ec51340 Compare May 13, 2026 19:11
@abcxff abcxff force-pushed the 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null branch from 46973f2 to c5ef70f Compare May 13, 2026 19:11
@abcxff abcxff changed the base branch from 05-13-fix_rivetkit_decode_bridge_errors_with_actor_null to graphite-base/5010 May 13, 2026 20:07
@abcxff abcxff force-pushed the graphite-base/5010 branch from c5ef70f to 8c70217 Compare May 13, 2026 20:07
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from ec51340 to 42fadeb Compare May 13, 2026 20:07
@abcxff abcxff changed the base branch from graphite-base/5010 to 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy May 13, 2026 20:08
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 42fadeb to 2258fbe Compare May 13, 2026 20:09
@abcxff abcxff force-pushed the 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy branch from 8c70217 to 00af8a0 Compare May 13, 2026 20:09
@abcxff abcxff force-pushed the 05-06-fix_rivetkit_bind_methods_through_createwritethroughproxy branch 2 times, most recently from ad1ebe6 to 6d4d578 Compare May 13, 2026 20:15
@abcxff abcxff force-pushed the 05-09-fix_workflow-engine_only_commit_step_state_after_success branch from 2258fbe to aa538e5 Compare May 13, 2026 20:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant