Commit 284814f
committed
docs(design): complete Worker Manager design with all technical details
This commit finalizes the Worker Manager system design based on user
requirements and design discussions. The design is now complete and
ready for implementation.
Major Additions:
1. Hook Retry Strategy (Decision #6)
- Network errors: Retry forever with exponential backoff
- Command errors: Never retry, abort task group
- Rationale: Command errors indicate machine doesn't meet requirements
2. Sticky Affinity Scheduling (Decision #13)
- Lease-based assignment with configurable duration (1-4 hours)
- Work-conserving idle threshold (probe after 5 minutes idle)
- Auto-renewal when tasks still pending
- Supports reopening CLOSED task groups
- Minimizes context switching overhead from prep/cleanup
3. Comprehensive Observability & Metrics
- System metrics: CPU, memory, disk, network, load average
- Worker pool metrics: Active workers, crash rate, per-worker status
- Task execution metrics: Throughput, latency percentiles, resources
- Task group lifecycle metrics: Prep/cleanup times, task counts
- IPC communication metrics: Request rates, latency, errors
- Lease & scheduling metrics: Renewals, context switches
- TimescaleDB/InfluxDB for time-series storage
4. IPC Architecture Details (iceoryx2)
- Request-response pattern for task fetch/report
- Pub-sub pattern for control/config broadcast
- Large payload optimization via shared memory
- Service discovery via environment variables
- Complete code examples for manager and worker sides
- Error handling with exponential backoff
- Performance targets: <100μs request-response, <50μs pub-sub
5. CPU Binding for Child Processes
- Three strategies: RoundRobin, Exclusive, Shared
- cgroup-based solution (recommended) for automatic inheritance
- Fallback to taskset wrapper if cgroups unavailable
- Verification and dynamic rebinding on worker scale
- Handles edge cases: insufficient cores, permissions
Updates:
- Renumbered decision sections (1-13)
- Removed "Remaining Open Questions" (all resolved)
- Added "Future Extensibility" section for deferred features
- Updated implementation plan with realistic time estimates
- Added technical prototyping suggestions
- Listed documentation needs and implementation questions
Design Decisions Summary:
✅ Task group completion (hybrid: explicit + timeout)
✅ Worker anonymity (no coordinator registration)
✅ Manager-only authentication
✅ TaskSpec-like env hooks
✅ CPU binding focus (RoundRobin/Exclusive/Shared)
✅ Hook retry strategy (network vs command errors)
✅ Failure handling (crash recovery, respawning)
✅ Task group attributes (priority, tags, labels)
✅ No hard worker limits
✅ Static worker scaling per task group
✅ Full backward compatibility
✅ Predictable IPC service names
✅ Sticky affinity scheduling with lease
Document Statistics:
- Total lines: ~1,725 (905 additions, 88 deletions)
- Version: 0.3.0
- Status: Design Complete - Ready for Implementation
Estimated Implementation Time: 10-15 weeks
Co-authored-by: User1 parent 0abb126 commit 284814f
1 file changed
+905
-88
lines changed
0 commit comments