Skip to content

Commit 284814f

Browse files
committed
docs(design): complete Worker Manager design with all technical details
This commit finalizes the Worker Manager system design based on user requirements and design discussions. The design is now complete and ready for implementation. Major Additions: 1. Hook Retry Strategy (Decision #6) - Network errors: Retry forever with exponential backoff - Command errors: Never retry, abort task group - Rationale: Command errors indicate machine doesn't meet requirements 2. Sticky Affinity Scheduling (Decision #13) - Lease-based assignment with configurable duration (1-4 hours) - Work-conserving idle threshold (probe after 5 minutes idle) - Auto-renewal when tasks still pending - Supports reopening CLOSED task groups - Minimizes context switching overhead from prep/cleanup 3. Comprehensive Observability & Metrics - System metrics: CPU, memory, disk, network, load average - Worker pool metrics: Active workers, crash rate, per-worker status - Task execution metrics: Throughput, latency percentiles, resources - Task group lifecycle metrics: Prep/cleanup times, task counts - IPC communication metrics: Request rates, latency, errors - Lease & scheduling metrics: Renewals, context switches - TimescaleDB/InfluxDB for time-series storage 4. IPC Architecture Details (iceoryx2) - Request-response pattern for task fetch/report - Pub-sub pattern for control/config broadcast - Large payload optimization via shared memory - Service discovery via environment variables - Complete code examples for manager and worker sides - Error handling with exponential backoff - Performance targets: <100μs request-response, <50μs pub-sub 5. CPU Binding for Child Processes - Three strategies: RoundRobin, Exclusive, Shared - cgroup-based solution (recommended) for automatic inheritance - Fallback to taskset wrapper if cgroups unavailable - Verification and dynamic rebinding on worker scale - Handles edge cases: insufficient cores, permissions Updates: - Renumbered decision sections (1-13) - Removed "Remaining Open Questions" (all resolved) - Added "Future Extensibility" section for deferred features - Updated implementation plan with realistic time estimates - Added technical prototyping suggestions - Listed documentation needs and implementation questions Design Decisions Summary: ✅ Task group completion (hybrid: explicit + timeout) ✅ Worker anonymity (no coordinator registration) ✅ Manager-only authentication ✅ TaskSpec-like env hooks ✅ CPU binding focus (RoundRobin/Exclusive/Shared) ✅ Hook retry strategy (network vs command errors) ✅ Failure handling (crash recovery, respawning) ✅ Task group attributes (priority, tags, labels) ✅ No hard worker limits ✅ Static worker scaling per task group ✅ Full backward compatibility ✅ Predictable IPC service names ✅ Sticky affinity scheduling with lease Document Statistics: - Total lines: ~1,725 (905 additions, 88 deletions) - Version: 0.3.0 - Status: Design Complete - Ready for Implementation Estimated Implementation Time: 10-15 weeks Co-authored-by: User
1 parent 0abb126 commit 284814f

File tree

1 file changed

+905
-88
lines changed

1 file changed

+905
-88
lines changed

0 commit comments

Comments
 (0)