Skip to content

fix(transport): protocol and transport hardening for auth and lifecycle consistency#3344

Merged
jonpspri merged 5 commits intomainfrom
p-high-2
Apr 23, 2026
Merged

fix(transport): protocol and transport hardening for auth and lifecycle consistency#3344
jonpspri merged 5 commits intomainfrom
p-high-2

Conversation

@crivetimihai
Copy link
Copy Markdown
Member

Summary

  • harden JSON-RPC and protocol helper request validation/error mapping behavior
  • enforce admin-scoped access for diagnostics endpoint output
  • implement deterministic stateful streamable HTTP session close handling with affinity owner cleanup
  • align nginx upstream settings for replica-aware backend resolution and remove overlapping proxy-level security headers
  • extend unit coverage for RPC/protocol, OAuth callback handling, diagnostics authorization, and streamable HTTP lifecycle paths

Notes

  • includes current test update in tests/unit/scripts/test_demo_a2a_agent_auth.py from local formatting/lint pass

@crivetimihai crivetimihai added the bug Something isn't working label Mar 1, 2026
@crivetimihai crivetimihai added this to the Release 1.0.0-RC2 milestone Mar 1, 2026
@crivetimihai crivetimihai added security Improves security MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe labels Mar 1, 2026
@crivetimihai
Copy link
Copy Markdown
Member Author

Review

Rebased cleanly onto main. Code reviewed for correctness, security, performance, and codebase consistency.

Bug found & fixed: nginx resolve directive crashes container

The resolve parameter on upstream server directives is nginx Plus-only (commercial). Our container uses open-source nginx from Red Hat UBI, which rejects it at config validation — causing a restart loop. Removed resolve from all 3 nginx configs while keeping the zone shared memory directive (valid in open-source nginx). Docker DNS still resolves all replica IPs at startup, so least_conn distributes across replicas.

Differential test coverage added (7 tests)

  • test_ping_non_dict_body_returns_invalid_request — ping with array body
  • test_rpc_sampling_create_message_maps_sampling_error — RPC SamplingError → -32602
  • test_rpc_completion_complete_maps_completion_error — RPC CompletionError → -32602
  • test_handle_rpc_session_owner_check_not_found — stateful session 404 → -32002
  • test_handle_rpc_session_owner_check_forbidden — stateful session 403 → -32003
  • test_close_streamable_http_session_registry_none — null registry → 403
  • test_close_streamable_http_session_remove_fails — remove_session exception → 500

Review notes

Correct & well-designed:

  • RPC validation layering (type checks → field checks → Pydantic) with proper JSON-RPC 2.0 error codes
  • Internal error no longer leaks data: str(e) — uses -32603 without data field
  • _has_version_admin_access correctly reuses normalize_token_teams() as single source of truth
  • OAuth state param made optional with explicit None check before use
  • clear_pool_session_owner vs _cleanup_pool_session_owner distinction is correct (forceful vs worker-aware)
  • Streamable HTTP deterministic DELETE with ownership-checked teardown
  • Security headers deduplicated (app middleware already sets them all)

Minor nit (not blocking): req_id = body.get("id") is extracted twice in handle_rpc (lines 5963 and 5993) — the second is redundant but harmless.

No security or performance issues found.

@crivetimihai crivetimihai added the ica ICA related issues label Mar 2, 2026
@crivetimihai crivetimihai changed the title fix: protocol and transport hardening for auth and lifecycle consistency fix(transport): protocol and transport hardening for auth and lifecycle consistency Mar 2, 2026
@crivetimihai crivetimihai added merge-queue Rebased and ready to merge and removed ica ICA related issues labels Mar 3, 2026
@crivetimihai crivetimihai requested a review from MohanLaksh March 5, 2026 08:44
@crivetimihai crivetimihai added the release-fix Critical bugfix required for the release label Mar 5, 2026
Copy link
Copy Markdown
Collaborator

@MohanLaksh MohanLaksh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #3344 Test Failures Analysis

Failing Tests Summary

1. test_ping_invalid_method - Expected 500, Got 400

Location: tests/e2e/test_main_apis.py::TestProtocolAPIs::test_ping_invalid_method

Issue: The PR changed the behavior to return 400 (Bad Request) with JSON-RPC error code -32600 for invalid methods, but the test expects 500.

Test Code (line 453):

assert response.status_code == 500  # Expects 500

Actual Behavior: Now returns 400 with proper JSON-RPC error mapping per the PR changes.


2. test_get_version - Expected 200, Got 403

Location: tests/e2e/test_main_apis.py::TestVersionAndDocs::test_get_version

Issue: The PR added admin-only access requirement to /version endpoint via _has_version_admin_access(), but the test expects public access (200).

Root Cause: The test comment says "no auth required" but the PR intentionally restricted this endpoint to admins only for security (preventing information disclosure).


3. test_execute_forwarded_request_returns_error_when_no_server

Location: tests/e2e/test_session_pool_e2e.py::TestMultiWorkerSessionAffinityE2E

Issue: Test expects 'error' key in response but gets {'result': {}} instead.

Root Cause: The PR's error handling changes may have affected how forwarded requests handle missing server scenarios.


4. test_affinity_logs_when_executing_forwarded_request

Location: tests/e2e/test_session_pool_e2e.py::TestMultiWorkerSessionAffinityE2E

Issue: Same as #3 - expects 'error' key but gets {'result': {}}.

Root Cause: Related to session affinity forwarding error handling changes.


Root Cause

All failures are expected behavior changes from the PR:

  1. Better error codes: 400 instead of 500 for invalid requests (JSON-RPC compliance)
  2. Security hardening: /version now requires admin access
  3. Error handling improvements: May have changed how certain edge cases are handled in session affinity

Action Required

The tests need to be updated to match the new behavior, not the code fixed. The PR's changes are intentional security and protocol compliance improvements.

Also,

🔍 Areas for Consideration (Non-Blocking)

  1. Redundant req_id Extraction (Line 6042 vs 6012)
    Impact: Harmless but noisy. Consider removing or consolidating. Not a blocker.

  2. Error Context for Debugging (Optional Enhancement)
    Current behavior returns identical -32600 errors for all invalid request scenarios. Consider adding debug-only context:

Rationale: Helps operators diagnose client misconfigurations without leaking internals in production. Non-blocking if omitted.

  1. Session Close Idempotency (Document or Handle)
    _close_streamable_http_session() returns 500 on remove_session() failure. Clarify:

Should already-closed sessions return 200 (idempotent) or 404 (session not found)?
Log the exception type for debugging transient vs permanent failures: logger.exception("remove_session failed", exc_info=True)
Recommendation: Add docstring noting idempotency behavior and structured exception logging.

  1. Test Parametrization Coverage (Optional Expansion)
    Current tests use @pytest.mark.parametrize effectively. Consider extending to edge cases:

Invalid jsonrpc versions: "1.0", "3.0", null, 123
Method edge cases: empty string "", numeric 12, null
Params edge cases: non-dict when required (array, string, null)
Not required for merge, but improves robustness.

  1. Nginx Configuration Documentation (Optional)
    The removal of security headers from nginx is correct and aligns with the architecture (app middleware is source of truth). Consider noting this decision in README.md to prevent future regressions.

🎯 Final Recommendation
APPROVE — Merge AFTER fixing the failing tests and optionally address:

  • (Nice-to-have) Remove redundant req_id extraction (line 6042)
  • (Nice-to-have) Document session close idempotency behavior in docstring
  • (Future) Expand test parametrization for edge cases
  • (Future) Add nginx security header decision to README.md
    No security or performance issues detected.

@crivetimihai crivetimihai self-assigned this Mar 7, 2026
@crivetimihai crivetimihai removed the release-fix Critical bugfix required for the release label Mar 26, 2026
crivetimihai and others added 4 commits April 23, 2026 14:35
Improve RPC/protocol validation behavior and error mapping, enforce admin-only diagnostics access, and harden stateful streamable HTTP session teardown semantics.

Align nginx upstream behavior for replica-aware balancing and remove overlapping proxy security headers. Update and expand unit coverage across main/version/oauth/streamable transport paths.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Update endpoint/helper docstrings for explicit args/returns/raises coverage and keep doctest examples aligned with current admin diagnostics access semantics.

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
Cover RPC-path SamplingError and CompletionError mapping to JSON-RPC
-32602, stateful session owner assertion (404/403 deny paths), ping
endpoint non-dict body handling, and streamable HTTP session close
error branches (registry None, remove_session failure).

Signed-off-by: Mihai Criveti <crivetimihai@gmail.com>
- version.py: _has_version_admin_access now accepts string inputs from
  require_admin_auth (which returns email after verifying admin status)
- test_version.py: reference require_admin_auth instead of nonexistent require_auth
- test_main_apis.py: ping invalid-method test expects 400 (not 500)
- test_main_extended.py: fix params list->dict, remove stale RPCRequest patch

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Jonathan Springer <jps@s390x.com>
@jonpspri jonpspri requested a review from brian-hussey as a code owner April 23, 2026 13:39
@jonpspri jonpspri force-pushed the p-high-2 branch 2 times, most recently from 9fdf2e8 to b5ae2a1 Compare April 23, 2026 14:28
… validation, remove data leaks

- Extract _jsonrpc_invalid_request() helper to deduplicate 5 identical
  error envelope constructions across ping() and _handle_rpc_authenticated()
- Skip RPCRequest Pydantic validation for trusted internal Rust dispatch
  (restores intentional perf optimization; manual type checks still run)
- Remove session_id from JSONRPCError data to avoid reflecting client input
- Remove full params dict from SamplingError/CompletionError JSON-RPC errors
  to prevent request payload leakage in error responses

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Jonathan Springer <jps@s390x.com>
@jonpspri jonpspri merged commit 264676c into main Apr 23, 2026
29 of 30 checks passed
@jonpspri jonpspri deleted the p-high-2 branch April 23, 2026 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working merge-queue Rebased and ready to merge MUST P1: Non-negotiable, critical requirements without which the product is non-functional or unsafe security Improves security

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants