-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Summary
A stage is incorrectly marked as skipped (cached) during pivot repro --all even though its input dependency (a tracked .pvt file) has changed. The skip decision is incorrect, and downstream stages then fail because they receive stale output data.
Confusingly, pivot repro --explain correctly reports the stage as WILL RUN with "Input dependencies changed", while the actual execution skips it.
Pivot Version
Commit e56b5b86dee6be46034fc8ffb6c0ada8f097979b (0.1.0-dev), installed from https://github.com/sjawhar/pivot subdirectory packages/pivot.
Reproduction Scenario
Setup
- Multi-pipeline project using
Pipeline("sub_pipeline")in subdirectories - Main pipeline discovers sub-pipelines via
pivot repro --all - A tracked file (managed via
.pvt) is a dependency of a base stage - The tracked file's content is modified (new agents added to a data spec YAML)
pivot track --forceis NOT run after the modification (the.pvtfile hash may or may not match the current file)
Steps to Reproduce
- Modify a tracked file that is a dependency of a pipeline stage
- Run
pivot repro --all - Observe: the stage that depends on the tracked file is reported as
skipped - Observe: downstream stages fail because they consume stale outputs
Expected Behavior
The stage should detect the changed dependency and re-run, producing updated outputs for downstream consumers.
Actual Behavior
pivot repro --allreports the stage asskippedpivot repro --explain(for the same stage) correctly saysWILL RUNwith reason "Input dependencies changed"- Downstream stages fail with
KeyError/ValueErrorbecause the output data is stale
Workaround
Running pivot run --force <stage_name> correctly re-executes the stage and fixes all downstream failures.
Diagnostic Evidence
Warning Message (from pivot repro --all output)
Tracked file '/path/to/data_spec.yaml' has changed since tracking. Run 'pivot track --force data_spec.yaml' to update.
This warning is emitted at executor/core.py:427-429 but does not prevent the stale skip decision. It's purely informational.
Lock File State (pre-fix)
The stage's lock file had OLD dependency hashes that did not match the current file content:
deps:
- path: path/to/data_spec.yaml
hash: fb4687ad489edec1 # OLD hash in lock file
Current file hash: ca11b4f4b5e8fe3e (different → should trigger re-run)
StateDB State
# Generation counters for tracked files are None (not tracked by any producing stage)
state_db.get_generation("data_spec.yaml") # → None
# dep_generations for the stage only has entries for SOME deps (the ones produced by other stages)
state_db.get_dep_generations("base_filter_runs_for_reports")
# → only 2 of ~8 deps have generation entriesCascade
The incorrect skip cascades through the DAG:
base_filter_runs_for_reports: skipped (WRONG - dep changed)
→ compute_task_weights: skipped (wrong - upstream stale)
→ wrangle_bootstrap_logistic: skipped (wrong - upstream stale)
→ plot_bootstrap_ci: FAILED (KeyError - stale data)
→ compute_trendline_ci: FAILED (ValueError - stale data)
→ generate_benchmark_results: FAILED (KeyError - stale data)
Analysis of Skip Detection
Three-Tier Skip Detection (executor/worker.py)
-
Tier 1 — Generation tracking (
can_skip_via_generation, line 1017):- For tracked files (deps with no producing stage),
get_generation()returnsNone - At line 1074-1076:
if current_gen is None: return False→ correctly falls through - This tier correctly does NOT cause the skip (verified via StateDB inspection)
- For tracked files (deps with no producing stage),
-
Tier 2 — Lock file comparison (
_check_skip_or_run, line 465):- Compares current dep hashes against lock file hashes
- The lock file has OLD hash → should detect change → should return "run"
- This tier should correctly detect the change
-
Tier 3 — Run cache (
_try_skip_via_run_cache, line 302):- Checks if the current
input_hashmatches a previously executed run - If found, restores outputs from cache and returns SKIPPED
- Critically: at line 333,
increment_outputs=False— does NOT bump output generation counters - This means downstream stages still see old generation counters → they skip via Tier 1
- Checks if the current
Hypothesis: Run Cache Causes Stale Skip Cascade
The most likely cause of the observed behavior:
- The tracked file was changed to a state that was previously computed (or the
input_hashmatched a previous run due to hash collision or content equivalence) - Tier 3 (run cache) found the match and restored old outputs from CAS
- The lock file was updated with current dep hashes (line 319-324)
- But
increment_outputs=False(line 333) meant downstream stages' dep generation counters were NOT bumped - Downstream stages' Tier 1 check saw matching (stale) generation counters → skipped
- The cascade propagated: all downstream stages skipped, leading to failures when stages finally tried to read the stale output data
Alternative Hypothesis: Explain vs Execution Path Divergence
pivot repro --explain and pivot repro --all may use different StateDB instances or state_dir paths:
- Explain (
status.py:112): Opensstate_db_path = config.get_state_dir() / "state.db"(project root) - Execution (
worker.py:179): Opensstate_db_path = stage_info["state_dir"] / "state.db"(possibly sub-pipeline root)
If the sub-pipeline's state_dir has a different .pivot/state.db with stale generation data, the skip decision could diverge from what --explain computes.
Evidence: Sub-pipeline stages have state_dir = model_reports/time_horizon_1_1/.pivot/, not the project root .pivot/.
Suggested Investigation Areas
-
Run cache behavior when
increment_outputs=False: Does this cause downstream stages to miss the change? Should the run cache path increment output generations? -
StateDB per sub-pipeline: When
--allmerges pipelines, workers use per-pipeline state_dirs. Generation counters in sub-pipeline StateDBs may be stale relative to the main pipeline's StateDB. -
Tracked file changes not invalidating run cache: The run cache uses
input_hashwhich includes dep hashes. But if the file metadata (mtime/size/inode) matches a stale StateDB entry, the hash itself might be stale, causing a false run cache hit. -
The tracked file warning should be an error (or invalidate cache): Currently
verify_tracked_files()atcore.py:426-430warns but continues. If a tracked file has changed, all downstream caches should be invalidated.
Environment
- Python 3.13
- Pivot 0.1.0-dev (commit e56b5b8)
- Pipeline uses
Pipeline()sub-pipelines merged via--all - LMDB-backed StateDB for generation tracking
- Tracked files via
.pvtmanifests