fix(perms): propagate sender/role through MCP bridge & delegate (#915)#1105
fix(perms): propagate sender/role through MCP bridge & delegate (#915)#1105mozaa-solana wants to merge 1 commit into
Conversation
…levelbuilder#915) Group-scoped permission checks (CheckCronPermission / CheckFileWriterPermission) fail with senderID="" when tools are invoked through: 1. MCP bridge (external CLI providers like claude-cli) — middleware never injected SenderID/Role into ctx, so even though the agent loop set them on RunRequest, the bridge layer dropped attribution. 2. delegate tool (gateway_managed.go delegateRunFn) — built RunRequest without copying SenderID/Role/Channel/ChatID/PeerKind from DelegateRequest, so the target agent's loop_context.injectContext skipped WithSenderID and downstream group cron/file_writer mutations denied with senderID="". Fix: - claude_cli: new OptSenderID, OptRole; BridgeContext carries SenderID + Role; writeMCPConfig emits X-Sender-ID / X-Role headers and signs them in the HMAC payload (extras: localKey | sessionKey | senderID | role). VerifyBridgeContext gains a backward-compat fallback that drops the last 2 extras for sessions written pre-this-change. - gateway/server.bridgeContextMiddleware: read X-Sender-ID + X-Role headers (HMAC-protected), inject WithSenderID / WithRole into ctx. - agent loop_pipeline_callbacks: pass req.SenderID and req.Role into chatReq.Options so external CLI providers receive them. - gateway_managed.delegateRunFn: propagate SenderID/Role/Channel/ChatID/ PeerKind from DelegateRequest into RunRequest. - team_tool_dispatch / gateway_consumer_normal: warn when sender is empty in a group-target dispatch so admins can find the upstream caller losing attribution. - config_permission_store: surface diagnostic-rich error messages so LLMs/users know exactly which user/scope/configType denied, instead of a generic "permission denied". - cron tool: forward CheckCronPermission's diagnostic message to the LLM result so the agent can self-correct.
PR ReviewVerdict: REQUEST CHANGES SummaryThis PR fixes a real attribution bug: group-scoped permission checks lose the acting sender/role when tool calls go through the MCP bridge or delegate path, causing cron/file-writer mutations to deny with senderID empty. The intent is good, but the MCP bridge HMAC fallback currently allows unsigned X-Sender-ID / X-Role headers to be trusted for old-session signatures. That is a security issue because these headers power group-scoped permission checks. Risk level: high, because this touches permission context propagation across the bridge. Findings
bridgeContextMiddleware always passes localKey, sessionKey, senderID, roleHdr into VerifyBridgeContext, then injects senderID and roleHdr whenever ok == true. But VerifyBridgeContext has a backward-compat fallback that drops the last 2 extras: preSender := SignBridgeContext(..., extra[:len(extra)-2]...) That means a request with a valid pre-sender/role signature over only localKey|sessionKey can still include arbitrary X-Sender-ID / X-Role headers and pass verification via fallback. The middleware then injects those unsigned values into context. Impact: any bridge caller using an older MCP config signature could forge the sender or role headers for group-scoped permission checks. Since this PR specifically uses sender/role for cron/file-writer permission, these fields must not be trusted unless they were covered by the matched HMAC format. Suggested fix: have VerifyBridgeContext return an additional signal for which context fields were covered, e.g. senderRoleVerified bool, or avoid injecting sender/role unless the full signature including senderID/role matched. Expected behavior:
Please add a regression test: sign with old extras only (localKey, sessionKey), send non-empty senderID/role, assert bridge middleware does not inject them, or assert VerifyBridgeContext reports they are not verified.
Existing tests cover localKey/sessionKey extra params, but this PR adds security-sensitive fields and a fallback mode. Please add explicit tests for:
Anti-AI-Slop Check
Tests / CI
Final NotesThe delegate propagation side looks reasonable. The bridge side needs one more security boundary: do not trust sender/role unless the matched HMAC actually covered sender/role. Fix that and add regression tests, then this should be in good shape. |
mrgoonie
left a comment
There was a problem hiding this comment.
Maintainer recheck: still requesting changes. The bridge HMAC fallback issue from the prior review is still present at head b20d051e.
Blocking finding:
internal/gateway/server.goalways passessenderIDandroleHdrintoVerifyBridgeContext, then injects them into context wheneverok == true.internal/providers/claude_cli_mcp.goaccepts a pre-sender/role fallback by dropping the last two extras and still returns(true, true).- Result: a request signed with the older
localKey|sessionKeyformat can add arbitraryX-Sender-ID/X-Roleheaders and the middleware will trust them, even though those fields were not covered by the matched HMAC.
Why this matters: sender/role are exactly the fields used by group-scoped cron/file-writer permission checks. Backward compatibility for old MCP configs is fine, but old signatures must not authorize newly added identity headers.
Required fix:
- Change verification semantics so callers can distinguish
context signature validfromsender/role covered by signature— e.g. returnok, tenantVerified, senderRoleVerified. - Only call
store.WithSenderID/store.WithRolewhensenderRoleVerified == true. - Add regression tests for:
- current signature with sender/role validates and injects them;
- tampering sender/role fails full verification;
- old signature remains accepted for backward compatibility but does not inject sender/role.
No merge attempt: this is security-sensitive permission propagation and remains blocked until that boundary is fixed.
Summary
Group-scoped permission checks (
CheckCronPermission/CheckFileWriterPermission) deny withsenderID=""when tools are invoked through:claude-cli,opencode-go) — the bridge middleware never injectedSenderID/Roleinto ctx, so even though the agent loop sets them onRunRequest, the bridge layer dropped attribution. Result: any group cron / file_writer mutation called via the bridge denies as "system context".delegatetool (gateway_managed.godelegateRunFn) — builtRunRequestwithout copyingSenderID/Role/Channel/ChatID/PeerKindfromDelegateRequest, so the target agent'sloop_context.injectContextskippedWithSenderIDand downstream group cron/file_writer mutations denied withsenderID="".Changes
internal/providers/claude_cli.go— newOptSenderID,OptRole.internal/providers/claude_cli_mcp.go—BridgeContextcarriesSenderID+Role;writeMCPConfigemitsX-Sender-ID/X-Roleheaders and signs them in the HMAC payload (extras:localKey | sessionKey | senderID | role).VerifyBridgeContextgains a backward-compat fallback that drops the last 2 extras for sessions written pre-this-change.internal/providers/claude_cli_session.go—bridgeContextFromOptsreads new opts.internal/gateway/server.go(bridgeContextMiddleware) — readX-Sender-ID+X-Roleheaders (HMAC-protected), injectWithSenderID/WithRoleinto ctx.internal/agent/loop_pipeline_callbacks.go— passreq.SenderID/req.RoleintochatReq.Options.cmd/gateway_managed.go(delegateRunFn) — propagateSenderID/Role/Channel/ChatID/PeerKindfromDelegateRequestintoRunRequest.internal/tools/team_tool_dispatch.go+cmd/gateway_consumer_normal.go—slog.Warnwhen sender is empty in a group-target dispatch so admins can find upstream callers losing attribution.internal/store/config_permission_store.go— surface diagnostic-rich error messages (which user/scope/configType denied) instead of generic "permission denied".internal/tools/cron.go— forward the diagnostic message to LLM result so the agent can self-correct.Test plan
go build ./...cleango test ./internal/providers/... ./internal/store/... ./internal/gateway/... ./internal/agent/...passes (incl.TestVerifyBridgeContextHMAC fallbacks)VerifyBridgeContext.Related
Follow-up to #915. Diagnostic logs added in this PR are the ones I used to localize the root cause; happy to drop them if reviewers prefer.