Skip to content

Stage incorrectly skipped despite changed input dependencies (stale cache after tracked file change) #432

@sjawhar

Description

@sjawhar

Summary

A stage is incorrectly marked as skipped (cached) during pivot repro --all even though its input dependency (a tracked .pvt file) has changed. The skip decision is incorrect, and downstream stages then fail because they receive stale output data.

Confusingly, pivot repro --explain correctly reports the stage as WILL RUN with "Input dependencies changed", while the actual execution skips it.

Pivot Version

Commit e56b5b86dee6be46034fc8ffb6c0ada8f097979b (0.1.0-dev), installed from https://github.com/sjawhar/pivot subdirectory packages/pivot.

Reproduction Scenario

Setup

  • Multi-pipeline project using Pipeline("sub_pipeline") in subdirectories
  • Main pipeline discovers sub-pipelines via pivot repro --all
  • A tracked file (managed via .pvt) is a dependency of a base stage
  • The tracked file's content is modified (new agents added to a data spec YAML)
  • pivot track --force is NOT run after the modification (the .pvt file hash may or may not match the current file)

Steps to Reproduce

  1. Modify a tracked file that is a dependency of a pipeline stage
  2. Run pivot repro --all
  3. Observe: the stage that depends on the tracked file is reported as skipped
  4. Observe: downstream stages fail because they consume stale outputs

Expected Behavior

The stage should detect the changed dependency and re-run, producing updated outputs for downstream consumers.

Actual Behavior

  • pivot repro --all reports the stage as skipped
  • pivot repro --explain (for the same stage) correctly says WILL RUN with reason "Input dependencies changed"
  • Downstream stages fail with KeyError / ValueError because the output data is stale

Workaround

Running pivot run --force <stage_name> correctly re-executes the stage and fixes all downstream failures.

Diagnostic Evidence

Warning Message (from pivot repro --all output)

Tracked file '/path/to/data_spec.yaml' has changed since tracking. Run 'pivot track --force data_spec.yaml' to update.

This warning is emitted at executor/core.py:427-429 but does not prevent the stale skip decision. It's purely informational.

Lock File State (pre-fix)

The stage's lock file had OLD dependency hashes that did not match the current file content:

deps:
- path: path/to/data_spec.yaml
  hash: fb4687ad489edec1    # OLD hash in lock file

Current file hash: ca11b4f4b5e8fe3e (different → should trigger re-run)

StateDB State

# Generation counters for tracked files are None (not tracked by any producing stage)
state_db.get_generation("data_spec.yaml")  # → None

# dep_generations for the stage only has entries for SOME deps (the ones produced by other stages)
state_db.get_dep_generations("base_filter_runs_for_reports")
# → only 2 of ~8 deps have generation entries

Cascade

The incorrect skip cascades through the DAG:

base_filter_runs_for_reports: skipped  (WRONG - dep changed)
  → compute_task_weights: skipped       (wrong - upstream stale)
    → wrangle_bootstrap_logistic: skipped (wrong - upstream stale)
      → plot_bootstrap_ci: FAILED         (KeyError - stale data)
      → compute_trendline_ci: FAILED      (ValueError - stale data)
      → generate_benchmark_results: FAILED (KeyError - stale data)

Analysis of Skip Detection

Three-Tier Skip Detection (executor/worker.py)

  1. Tier 1 — Generation tracking (can_skip_via_generation, line 1017):

    • For tracked files (deps with no producing stage), get_generation() returns None
    • At line 1074-1076: if current_gen is None: return False → correctly falls through
    • This tier correctly does NOT cause the skip (verified via StateDB inspection)
  2. Tier 2 — Lock file comparison (_check_skip_or_run, line 465):

    • Compares current dep hashes against lock file hashes
    • The lock file has OLD hash → should detect change → should return "run"
    • This tier should correctly detect the change
  3. Tier 3 — Run cache (_try_skip_via_run_cache, line 302):

    • Checks if the current input_hash matches a previously executed run
    • If found, restores outputs from cache and returns SKIPPED
    • Critically: at line 333, increment_outputs=False — does NOT bump output generation counters
    • This means downstream stages still see old generation counters → they skip via Tier 1

Hypothesis: Run Cache Causes Stale Skip Cascade

The most likely cause of the observed behavior:

  1. The tracked file was changed to a state that was previously computed (or the input_hash matched a previous run due to hash collision or content equivalence)
  2. Tier 3 (run cache) found the match and restored old outputs from CAS
  3. The lock file was updated with current dep hashes (line 319-324)
  4. But increment_outputs=False (line 333) meant downstream stages' dep generation counters were NOT bumped
  5. Downstream stages' Tier 1 check saw matching (stale) generation counters → skipped
  6. The cascade propagated: all downstream stages skipped, leading to failures when stages finally tried to read the stale output data

Alternative Hypothesis: Explain vs Execution Path Divergence

pivot repro --explain and pivot repro --all may use different StateDB instances or state_dir paths:

  • Explain (status.py:112): Opens state_db_path = config.get_state_dir() / "state.db" (project root)
  • Execution (worker.py:179): Opens state_db_path = stage_info["state_dir"] / "state.db" (possibly sub-pipeline root)

If the sub-pipeline's state_dir has a different .pivot/state.db with stale generation data, the skip decision could diverge from what --explain computes.

Evidence: Sub-pipeline stages have state_dir = model_reports/time_horizon_1_1/.pivot/, not the project root .pivot/.

Suggested Investigation Areas

  1. Run cache behavior when increment_outputs=False: Does this cause downstream stages to miss the change? Should the run cache path increment output generations?

  2. StateDB per sub-pipeline: When --all merges pipelines, workers use per-pipeline state_dirs. Generation counters in sub-pipeline StateDBs may be stale relative to the main pipeline's StateDB.

  3. Tracked file changes not invalidating run cache: The run cache uses input_hash which includes dep hashes. But if the file metadata (mtime/size/inode) matches a stale StateDB entry, the hash itself might be stale, causing a false run cache hit.

  4. The tracked file warning should be an error (or invalidate cache): Currently verify_tracked_files() at core.py:426-430 warns but continues. If a tracked file has changed, all downstream caches should be invalidated.

Environment

  • Python 3.13
  • Pivot 0.1.0-dev (commit e56b5b8)
  • Pipeline uses Pipeline() sub-pipelines merged via --all
  • LMDB-backed StateDB for generation tracking
  • Tracked files via .pvt manifests

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions