-
Notifications
You must be signed in to change notification settings - Fork 3
feat(controlplane): POC — gRPC control plane with Register + RunInit flow #137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
schmitthub
wants to merge
3
commits into
main
Choose a base branch
from
feature/clawkerd
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
3 commits
Select commit
Hold shift + click to select a range
4529a93
feat(controlplane): POC — gRPC control plane with Register + RunInit …
schmitthub b877ff4
fix(otel): service_name resource, scope_name differentiation, real OT…
schmitthub 4dc879e
refactor(controlplane): wire config store, extract interface + mock
schmitthub File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
77 changes: 77 additions & 0 deletions
77
.serena/memories/brainstorm_the-controlplane-and-clawkerd.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # Brainstorm: The Control Plane and clawkerd | ||
|
|
||
| > **Status:** Active | ||
| > **Created:** 2026-02-16 | ||
| > **Last Updated:** 2026-02-16 12:00 | ||
|
|
||
| ## Problem / Topic | ||
| The POC (test/controlplane/) validates the two-gRPC-server pattern. The plan is to iteratively evolve this POC toward production — each iteration adds one real concern and validates it via the integration test. The master document is `clawkerd-container-control-plane` memory; this brainstorm is the working scratchpad for the current session. | ||
|
|
||
|
|
||
| ## POC Results (from test/controlplane/) | ||
|
|
||
| ### What was built | ||
| - **Proto schema** (`internal/clawkerd/protocol/v1/agent.proto`): AgentReportingService (Register), AgentCommandService (RunInit) | ||
| - **clawkerd binary** (`clawkerd/main.go`): Container-side agent — starts gRPC server, registers with CP, handles RunInit (executes bash commands, streams progress, writes ready file) | ||
| - **Control plane server** (`internal/controlplane/`): server.go + registry.go — accepts Register, resolves container IP via Docker inspect, connects back to clawkerd's gRPC server, calls RunInit, consumes progress stream | ||
| - **Test Dockerfile** (`test/controlplane/testdata/Dockerfile`): Two-stage build (Go builder → Alpine), installs su-exec + tini, root entrypoint with gosu/su-exec drop | ||
| - **Test entrypoint** (`test/controlplane/testdata/entrypoint.sh`): Starts clawkerd in background, drops to claude user via su-exec | ||
| - **Integration test** (`test/controlplane/controlplane_test.go`): Full end-to-end — builds image, starts CP in-process, runs container, verifies registration, init progress, privilege separation | ||
| - **Harness extensions**: WithNetwork(), WithPortBinding(), network join on start | ||
| - **Makefile**: `make test-controlplane`, `make proto` (buf generate), excluded from unit tests | ||
|
|
||
| ### What was validated | ||
| 1. Two-gRPC-server pattern works across Docker network (host → container via port mapping) | ||
| 2. Address discovery works (clawkerd registers with listen port, CP resolves via Docker inspect + port binding) | ||
| 3. Server-streaming RunInit progress flows correctly (STARTED → COMPLETED per step → READY) | ||
| 4. Root entrypoint + su-exec privilege drop works (clawkerd runs as root UID 0, main process as claude UID 1001) | ||
| 5. tini as PID 1 (via Dockerfile ENTRYPOINT, not HostConfig.Init in POC) manages both processes | ||
| 6. Ready file signal mechanism works (/var/run/clawker/ready) | ||
| 7. Init step command execution with stdout/stderr capture works | ||
|
|
||
| ### What was NOT validated / deferred | ||
| - HostConfig.Init (POC uses explicit tini in Dockerfile ENTRYPOINT instead) | ||
| - Graceful degradation (clawkerd falling back to baked-in defaults when CP unreachable) | ||
| - Reconnection logic (gRPC stream drops) | ||
| - Docker Events integration | ||
| - Watermill message queue | ||
| - SchedulerService (CLI → CP resource management) | ||
| - Entrypoint waiting on clawkerd ready signal (POC entrypoint is fire-and-forget) | ||
|
|
||
| ## Open Items / Questions | ||
| - How to handle the entrypoint wait? Current POC starts clawkerd & then immediately drops privileges. Should it wait for ready file before exec su-exec? | ||
| - Should we move to HostConfig.Init=true (Docker injects tini) or keep explicit tini in entrypoint? POC uses explicit. | ||
| - What's the plan for the `internal/controlplane/` vs `internal/clawkerd/` package split? Currently CP is in `controlplane/`, agent protocol in `clawkerd/protocol/`. Is this the right layout long-term? | ||
| - The test uses `host.docker.internal:host-gateway` for container→host communication. In production, the CP listens on clawker-net. How does address resolution change? | ||
| - Container ID mismatch: Docker hostname is 12-char truncated ID, but Docker API uses full ID. The test handles both — should clawkerd send full ID (read from /proc or cgroup)? | ||
|
|
||
| ## Decisions Made | ||
| - Two-gRPC-server pattern: VALIDATED by POC. CP and clawkerd each run their own gRPC server. | ||
| - su-exec over gosu: POC chose su-exec (Alpine native, ~10KB). Works. | ||
| - Root entrypoint + privilege drop: VALIDATED. Clean separation. | ||
| - Ready file at /var/run/clawker/ready: Works as signal mechanism. | ||
| - buf for protobuf generation: Configured (buf.yaml + buf.gen.yaml), `make proto` target added. | ||
| - Test harness extended with WithNetwork() and WithPortBinding() for control plane tests. | ||
|
|
||
| ## Conclusions / Insights | ||
| - The two-server gRPC pattern is clean and works well across Docker networking boundaries. | ||
| - Port binding (host port mapping) is needed on macOS/Docker Desktop where container IPs aren't routable from host. The CP's resolveAgentAddress() handles both port mapping and direct IP fallback. | ||
| - The POC entrypoint is minimal (3 lines) — the complexity lives in Go, not bash. This validates the "init logic in Go, not bash" principle. | ||
| - clawkerd's RunInit handles step failures gracefully (logs, sends FAILED event, continues to next step). | ||
|
|
||
| ## Gotchas / Risks | ||
| - Container ID truncation: Docker sets hostname to 12-char prefix. Need consistent ID handling between clawkerd and CP. | ||
| - The CP currently does Docker inspect in the Register RPC handler — this is synchronous and could slow registration if Docker is slow. | ||
| - The `go s.runInitOnAgent()` goroutine in Register has no structured lifecycle management yet (no errgroup, no cancellation tracking). | ||
| - No auth on the clawkerd→CP gRPC connection beyond the shared secret in Register. The callback connection (CP→clawkerd) has no auth at all. | ||
|
|
||
| ## Unknowns | ||
| - Production entrypoint behavior: should it block on clawkerd ready signal or proceed immediately? | ||
| - How will the existing hostproxy retirement timeline work? The memory says "replaced" but current codebase still has full hostproxy. | ||
| - What's the migration path from current `CreateContainer()` orchestration to CP-mediated creation? | ||
| - How does the init spec get populated in production? Currently hardcoded in test. | ||
|
|
||
| ## Next Steps | ||
| - Decide which production concern to tackle next in the POC iteration | ||
| - Candidates: entrypoint wait, HostConfig.Init, init spec from clawker.yaml, Docker Events, graceful degradation | ||
| - Each iteration: add the concern to test/controlplane/, validate, update master memory | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,10 @@ | ||
| version: v2 | ||
| plugins: | ||
| - local: protoc-gen-go | ||
| out: . | ||
| opt: | ||
| - module=github.com/schmitthub/clawker | ||
| - local: protoc-gen-go-grpc | ||
| out: . | ||
| opt: | ||
| - module=github.com/schmitthub/clawker |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| version: v2 | ||
| modules: | ||
| - path: internal/clawkerd/protocol | ||
| lint: | ||
| use: | ||
| - STANDARD | ||
| except: | ||
| - PACKAGE_DIRECTORY_MATCH | ||
| breaking: | ||
| use: | ||
| - FILE |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This note highlights that the current gRPC control-plane/agent channels effectively have no strong authentication (only a shared secret on
Registerand no auth at all on the callback connection), while the agent exposes a RunInit endpoint that executes commands from the control plane. As a result, any process or container with network access to these gRPC endpoints can impersonate the control plane or an agent and trigger arbitrary initialization commands, yielding remote code execution inside the container. To mitigate this, add mutual authentication for both directions (e.g., mTLS with per-agent identities or signed tokens) and restrict the exposed listeners so that only trusted peers on authenticated channels can invokeRegister/RunInit.