Skip to content

expose Hami MCP#1925

Draft
haitwang-cloud wants to merge 2 commits into
Project-HAMi:masterfrom
haitwang-cloud:expose-hami-mcp
Draft

expose Hami MCP#1925
haitwang-cloud wants to merge 2 commits into
Project-HAMi:masterfrom
haitwang-cloud:expose-hami-mcp

Conversation

@haitwang-cloud

Copy link
Copy Markdown
Contributor

What type of PR is this?
/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Signed-off-by: Tim <tim.wang03@sap.com>
@hami-robot

hami-robot Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: haitwang-cloud
Once this PR has been reviewed and has the lgtm label, please assign wawa0210 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive execution plan (hami-mcp-server.md) to add a Model Context Protocol (MCP) server to HAMi for exposing read-only GPU and scheduler states. The review feedback provides valuable improvements to the plan, including correcting Kubernetes RBAC constraints for ConfigMaps, expanding secret redaction to cover init and ephemeral containers, reusing existing HAMi client initialization utilities, refining static check regexes to avoid false positives in tests, and adjusting the E2E test client constructor to be more idiomatic for Ginkgo suites.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread hami-mcp-server.md
Comment on lines +154 to +158
# ServiceAccount: hami-mcp-server
# ClusterRole verbs (HARD LIMIT):
# - get/list/watch on: nodes, pods, namespaces, configmaps (hami-scheduler-config only)
# - NO write verbs anywhere
# - NO get on secrets

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

In Kubernetes RBAC, you cannot restrict list or watch requests by resourceNames. The API server does not support filtering lists by name at the authorization level, so any attempt to list or watch ConfigMaps with this ClusterRole will result in a 403 Forbidden error. Since the MCP server likely only needs to read the specific hami-scheduler-config ConfigMap, you should only grant the get verb for the specific ConfigMap name under resourceNames or allow listing all ConfigMaps.

Suggested change
# ServiceAccount: hami-mcp-server
# ClusterRole verbs (HARD LIMIT):
# - get/list/watch on: nodes, pods, namespaces, configmaps (hami-scheduler-config only)
# - NO write verbs anywhere
# - NO get on secrets
# ServiceAccount: hami-mcp-server
# ClusterRole verbs (HARD LIMIT):
# - get/list/watch on: nodes, pods, namespaces
# - get on: configmaps (restricted to hami-scheduler-config via resourceNames)
# - NO write verbs anywhere
# - NO get on secrets

Comment thread hami-mcp-server.md

### 3.B Output redaction
`pkg/mcp/redact/redact.go` strips:
- All container `env` entries whose name matches `(?i)(token|secret|password|key|cred)`

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The redaction plan currently only mentions stripping secrets from "container env entries". Kubernetes pods can also run initContainers and ephemeralContainers, which can similarly contain sensitive environment variables. To prevent accidental leaks, the redaction logic in pkg/mcp/redact/redact.go should explicitly process all container types (spec.containers, spec.initContainers, and spec.ephemeralContainers).

Suggested change
- All container `env` entries whose name matches `(?i)(token|secret|password|key|cred)`
- All container, initContainer, and ephemeralContainer `env` entries whose name matches `(?i)(token|secret|password|key|cred)`

Comment thread hami-mcp-server.md
pkg/mcp/
server.go # Server constructor, tool/resource registration
client/
k8s.go # Wraps client-go: list nodes, pods, configmaps

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

HAMi already has a robust, centralized Kubernetes client initialization package in pkg/util/client/client.go which automatically handles loading kubeconfig from the environment (KUBECONFIG) or falling back to in-cluster config. Instead of creating a new custom client wrapper in pkg/mcp/client/k8s.go, you should reuse pkg/util/client to avoid code duplication and ensure consistent client configuration across the project.

Suggested change
k8s.go # Wraps client-go: list nodes, pods, configmaps
k8s.go # Uses pkg/util/client to list nodes, pods, configmaps

Comment thread hami-mcp-server.md
Comment on lines +359 to +363
# Forbid write verbs in mcp packages
if grep -rE '\.(Update|Create|Delete|Patch|Apply)\(' pkg/mcp/ cmd/mcp-server/ 2>/dev/null; then
echo "ERROR: mcp packages must not call mutating K8s APIs"
exit 1
fi

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposed static check uses a simple grep pattern \.(Update|Create|Delete|Patch|Apply)\( to forbid write operations. This regex is highly prone to false positives. For example, calling a completely unrelated local method or builder pattern like mcpServer.Create(...) or tools.Create(...) will trigger a build failure. Consider refining the regex to target K8s client interfaces specifically, or using a more robust AST-based linter/static analysis tool to enforce read-only API usage.

Suggested change
# Forbid write verbs in mcp packages
if grep -rE '\.(Update|Create|Delete|Patch|Apply)\(' pkg/mcp/ cmd/mcp-server/ 2>/dev/null; then
echo "ERROR: mcp packages must not call mutating K8s APIs"
exit 1
fi
# Forbid write verbs in mcp packages (excluding test files to avoid false positives on mocks/fixtures)
if grep -rE --exclude='*_test.go' '\.(Update|Create|Delete|Patch|Apply)\(' pkg/mcp/ cmd/mcp-server/ 2>/dev/null; then
echo "ERROR: mcp packages must not call mutating K8s APIs"
exit 1
fi

Comment thread hami-mcp-server.md
Add a Model Context Protocol (MCP) server that exposes HAMi GPU scheduling
state to AI assistants (Claude Desktop, Claude Code, Cursor, etc.).

## New Components

- cmd/mcp-server: Standalone MCP server binary with CLI flags
- pkg/mcp/server: MCP server constructor with tool/resource registration
- pkg/mcp/client: K8s and Prometheus client wrappers (read-only)
- pkg/mcp/tools: 5 MCP tools for GPU queries
- pkg/mcp/redact: Secret redaction for sensitive data
- pkg/mcp/resources: HAMi config resource

## MCP Tools

- list_gpu_nodes: List GPU nodes with vendor/count/memory info
- list_gpu_pods: List pods with GPU resource requests
- get_quota_usage: Get GPU quota usage per namespace
- get_gpu_metrics: Query Prometheus GPU metrics
- describe_node: Detailed node info with GPU devices

## Build & Deployment

- Makefile: Add mcp-server and docker-mcp targets
- docker/Dockerfile.mcp: Multi-stage distroless image
- charts/hami/templates/mcp-server/: Helm templates
- docs/mcp-server.md: User documentation

## E2E Tests

- Full MCP protocol stack tests with in-memory transports
- Fake K8s clientset and stubbed Prometheus server
- Coverage for all 5 tools, resources, and error handling

## Security

- Read-only K8s API access (no write verbs)
- RBAC with minimal permissions (no secrets access)
- Automatic redaction of sensitive data in responses

Signed-off-by: Tim <tim@example.com>
Signed-off-by: Tim <tim.wang03@sap.com>
@hami-robot hami-robot Bot added size/XXL and removed size/L labels Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant