Skip to content

Garbage-collect the TEE container volume to stop /app/.data from filling up#5

Open
mpjunior92 wants to merge 16 commits intoGajesh2007:masterfrom
mpjunior92:feature/disk-gc
Open

Garbage-collect the TEE container volume to stop /app/.data from filling up#5
mpjunior92 wants to merge 16 commits intoGajesh2007:masterfrom
mpjunior92:feature/disk-gc

Conversation

@mpjunior92
Copy link
Copy Markdown

Context

The Sovra TEE VM's 100 GB pd-ssd mounted at /app/.data is running out of space, and EigenCloud's platform constraints make resizing the persistent disk impractical. This PR adds an application-level garbage collector so the container reclaims its own disk as it runs.

Goals

  1. Prevent /app/.data from filling up under normal operation.
  2. Degrade gracefully when R2 (the CDN source of truth) is unavailable or disabled.
  3. Preserve the user-facing feed — posts.json and cartoons.json are the product, never swept.
  4. Non-breaking: safe defaults, feature-flag disable, no changes to the Docker image surface or the deploy flow.
  5. Ship ground-truth observability so the next incident has evidence.

Non-goals

  • Resizing the persistent disk.
  • Rewriting the storage layer.
  • Optimizing image generation sizes (good candidate follow-up).
  • Frontend log-filter UI (separate follow-up).

Architecture

Two complementary mechanisms, both in-process. No new services.

1. Write-through cache. When R2 is enabled, local media files are deleted after the publish pipeline persists them with a CDN URL. Editor / Twitter upload / video producer still read the files locally during generation — cleanup fires only at end-of-publish, not inside the uploader helpers. When R2 is disabled, files stay put; the frontend serves them via fastify static routes.

2. Janitor. Runs every hour. Each sweep performs, in order:

  • Event log rotation. When events.jsonl exceeds 50 MB, keep the last 10,000 lines and rotate atomically.
  • Media age sweep (R2 enabled only). Files older than 7 days are unlinked.
  • Disk pressure emergency sweep. When the .data mount is >70% full, run an aggressive sweep using a 1-day threshold. When R2 is disabled, also null out imageUrl / videoUrl fields in posts.json for any file that was just deleted — the frontend renders a placeholder instead of a 404.

JSON stores, cache files, and the agent keypair are never touched.

3. Startup disk audit. One-shot directory walk logged at boot: total size, per-subdir breakdown, top-10 largest files. Emits as a structured [disk-audit] … line so Datadog can surface the real culprit on the next restart.

Config

A single gc block, all fields overridable. Safe rollback lever: set GC_ENABLED=false to disable both the write-through cleanup and the janitor.

Key Default Notes
gc.enabled true (GC_ENABLED=false disables) Kill switch
gc.sweepIntervalMs 1h (5s in TEST_MODE) Janitor cadence
gc.initialDelayMs 60s (2s in TEST_MODE) First sweep delay
gc.mediaMaxAgeMs 7 days Normal age sweep
gc.mediaPressureAgeMs 1 day Under pressure
gc.diskPressureThreshold 0.70 statfs-based
gc.eventLogMaxBytes 50 MB Rotation trigger
gc.eventLogKeepLines 10,000 Lines kept

Error handling

Each step in a sweep is wrapped in a try/catch that logs and continues — one branch failing never blocks the others. unlink/stat ENOENT is swallowed silently (files vanish concurrently). statfs failure skips the pressure check for that cycle. Event-log rotation writes to a temp file then atomic-renames; there is a small race where events written during the rename may be lost — acceptable, documented in code.

When R2 upload fails, the local file is preserved (same as before this PR). The janitor's age sweep is the backstop — orphaned local files get reclaimed in 7 days.

Testing

  • Unit tests (17): event-log rotation, media age sweep, disk-pressure sweep, posts.json URL null-out (with cross-subdir basename collision coverage), cleanupLocalIfOnR2 no-op when R2 is off.
  • Docker + tmpfs E2E: builds the Docker image, runs it with a 100 MB tmpfs mount simulating a near-full TEE disk, seeds a >50 MB events.jsonl and aged media, waits one janitor cycle, asserts rotation landed under the threshold and pressure behavior is correct. Runs locally in ~10s (the Docker image build is the main cost). Invoked via bun run test:e2e.
  • Existing bun run typecheck clean.

Rollout plan

  1. Merge and deploy via ecloud compute app upgrade "Sovra Agent".
  2. Observe the startup [disk-audit] log line → confirms which subdir or file is actually dominating the disk.
  3. Within an hour, [janitor] rotated events.jsonl: X → Y bytes should appear.
  4. If anything looks off, set GC_ENABLED=false in the env and redeploy.

mpjunior92 and others added 16 commits April 24, 2026 18:31
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant