expose Hami MCP#1925
Conversation
Signed-off-by: Tim <tim.wang03@sap.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: haitwang-cloud The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive execution plan (hami-mcp-server.md) to add a Model Context Protocol (MCP) server to HAMi for exposing read-only GPU and scheduler states. The review feedback provides valuable improvements to the plan, including correcting Kubernetes RBAC constraints for ConfigMaps, expanding secret redaction to cover init and ephemeral containers, reusing existing HAMi client initialization utilities, refining static check regexes to avoid false positives in tests, and adjusting the E2E test client constructor to be more idiomatic for Ginkgo suites.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # ServiceAccount: hami-mcp-server | ||
| # ClusterRole verbs (HARD LIMIT): | ||
| # - get/list/watch on: nodes, pods, namespaces, configmaps (hami-scheduler-config only) | ||
| # - NO write verbs anywhere | ||
| # - NO get on secrets |
There was a problem hiding this comment.
In Kubernetes RBAC, you cannot restrict list or watch requests by resourceNames. The API server does not support filtering lists by name at the authorization level, so any attempt to list or watch ConfigMaps with this ClusterRole will result in a 403 Forbidden error. Since the MCP server likely only needs to read the specific hami-scheduler-config ConfigMap, you should only grant the get verb for the specific ConfigMap name under resourceNames or allow listing all ConfigMaps.
| # ServiceAccount: hami-mcp-server | |
| # ClusterRole verbs (HARD LIMIT): | |
| # - get/list/watch on: nodes, pods, namespaces, configmaps (hami-scheduler-config only) | |
| # - NO write verbs anywhere | |
| # - NO get on secrets | |
| # ServiceAccount: hami-mcp-server | |
| # ClusterRole verbs (HARD LIMIT): | |
| # - get/list/watch on: nodes, pods, namespaces | |
| # - get on: configmaps (restricted to hami-scheduler-config via resourceNames) | |
| # - NO write verbs anywhere | |
| # - NO get on secrets |
|
|
||
| ### 3.B Output redaction | ||
| `pkg/mcp/redact/redact.go` strips: | ||
| - All container `env` entries whose name matches `(?i)(token|secret|password|key|cred)` |
There was a problem hiding this comment.
The redaction plan currently only mentions stripping secrets from "container env entries". Kubernetes pods can also run initContainers and ephemeralContainers, which can similarly contain sensitive environment variables. To prevent accidental leaks, the redaction logic in pkg/mcp/redact/redact.go should explicitly process all container types (spec.containers, spec.initContainers, and spec.ephemeralContainers).
| - All container `env` entries whose name matches `(?i)(token|secret|password|key|cred)` | |
| - All container, initContainer, and ephemeralContainer `env` entries whose name matches `(?i)(token|secret|password|key|cred)` |
| pkg/mcp/ | ||
| server.go # Server constructor, tool/resource registration | ||
| client/ | ||
| k8s.go # Wraps client-go: list nodes, pods, configmaps |
There was a problem hiding this comment.
HAMi already has a robust, centralized Kubernetes client initialization package in pkg/util/client/client.go which automatically handles loading kubeconfig from the environment (KUBECONFIG) or falling back to in-cluster config. Instead of creating a new custom client wrapper in pkg/mcp/client/k8s.go, you should reuse pkg/util/client to avoid code duplication and ensure consistent client configuration across the project.
| k8s.go # Wraps client-go: list nodes, pods, configmaps | |
| k8s.go # Uses pkg/util/client to list nodes, pods, configmaps |
| # Forbid write verbs in mcp packages | ||
| if grep -rE '\.(Update|Create|Delete|Patch|Apply)\(' pkg/mcp/ cmd/mcp-server/ 2>/dev/null; then | ||
| echo "ERROR: mcp packages must not call mutating K8s APIs" | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
The proposed static check uses a simple grep pattern \.(Update|Create|Delete|Patch|Apply)\( to forbid write operations. This regex is highly prone to false positives. For example, calling a completely unrelated local method or builder pattern like mcpServer.Create(...) or tools.Create(...) will trigger a build failure. Consider refining the regex to target K8s client interfaces specifically, or using a more robust AST-based linter/static analysis tool to enforce read-only API usage.
| # Forbid write verbs in mcp packages | |
| if grep -rE '\.(Update|Create|Delete|Patch|Apply)\(' pkg/mcp/ cmd/mcp-server/ 2>/dev/null; then | |
| echo "ERROR: mcp packages must not call mutating K8s APIs" | |
| exit 1 | |
| fi | |
| # Forbid write verbs in mcp packages (excluding test files to avoid false positives on mocks/fixtures) | |
| if grep -rE --exclude='*_test.go' '\.(Update|Create|Delete|Patch|Apply)\(' pkg/mcp/ cmd/mcp-server/ 2>/dev/null; then | |
| echo "ERROR: mcp packages must not call mutating K8s APIs" | |
| exit 1 | |
| fi |
Add a Model Context Protocol (MCP) server that exposes HAMi GPU scheduling state to AI assistants (Claude Desktop, Claude Code, Cursor, etc.). ## New Components - cmd/mcp-server: Standalone MCP server binary with CLI flags - pkg/mcp/server: MCP server constructor with tool/resource registration - pkg/mcp/client: K8s and Prometheus client wrappers (read-only) - pkg/mcp/tools: 5 MCP tools for GPU queries - pkg/mcp/redact: Secret redaction for sensitive data - pkg/mcp/resources: HAMi config resource ## MCP Tools - list_gpu_nodes: List GPU nodes with vendor/count/memory info - list_gpu_pods: List pods with GPU resource requests - get_quota_usage: Get GPU quota usage per namespace - get_gpu_metrics: Query Prometheus GPU metrics - describe_node: Detailed node info with GPU devices ## Build & Deployment - Makefile: Add mcp-server and docker-mcp targets - docker/Dockerfile.mcp: Multi-stage distroless image - charts/hami/templates/mcp-server/: Helm templates - docs/mcp-server.md: User documentation ## E2E Tests - Full MCP protocol stack tests with in-memory transports - Fake K8s clientset and stubbed Prometheus server - Coverage for all 5 tools, resources, and error handling ## Security - Read-only K8s API access (no write verbs) - RBAC with minimal permissions (no secrets access) - Automatic redaction of sensitive data in responses Signed-off-by: Tim <tim@example.com> Signed-off-by: Tim <tim.wang03@sap.com>
What type of PR is this?
/kind feature
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?: