Skip to content

feat(control-plane): UserIdDirectory + ServerIdDirectory, e2e stability fixes#44

Merged
gmliao merged 16 commits intomainfrom
fix/matchmaking-ticket-env-loading
Feb 22, 2026
Merged

feat(control-plane): UserIdDirectory + ServerIdDirectory, e2e stability fixes#44
gmliao merged 16 commits intomainfrom
fix/matchmaking-ticket-env-loading

Conversation

@gmliao
Copy link
Copy Markdown
Owner

@gmliao gmliao commented Feb 22, 2026

Summary

Part 1 — Matchmaking stability fixes (pre-existing commits)

  • ticketId now UUID; assignment TTL set correctly
  • Deferred env reads to avoid startup-time failures
  • REDIS_DB=1 isolation for split-role e2e
  • BullMQ/Redis connection stability in e2e (globalSetup, beforeAll)

Part 2 — ClusterDirectory → UserIdDirectory + ServerIdDirectory refactor

Refactors the in-memory ServerRegistryService into a Redis-backed ServerIdDirectory, making game server registrations visible across all control-plane nodes.

Architecture change:

  • UserIdDirectory (USER_ID_DIRECTORY) — userId→nodeId with TTL lease; used by Realtime and Matchmaking (rename of ClusterDirectory)
  • ServerIdDirectory (SERVER_ID_DIRECTORY) — serverId→ServerEntry; Redis-backed, cross-node; used by Provisioning and Admin
  • ClusterDirectory token and ServerRegistryService removed

Files changed:

  • infra/cluster-directory/ — new UserIdDirectory, ServerIdDirectory interfaces + Redis impls
  • infra/contracts/server-entry.dto.tsServerEntry extracted to shared location
  • ProvisioningController, InMemoryProvisioningClient — inject SERVER_ID_DIRECTORY
  • AdminController — inject SERVER_ID_DIRECTORY, getServers() now async
  • UserSessionRegistryService, MatchmakingService — inject USER_ID_DIRECTORY
  • All unit tests and e2e tests updated

E2E Redis isolation:

  • e2e-global-setup.ts and e2e-helpers.flushServerKeys() flush cd:server:* keys between suites to prevent cross-test pollution

Test plan

  • npm run build — PASS
  • npm test — 11 suites / 49 tests PASS
  • npm run test:e2e -- --runInBand — 6 suites / 18 passed + 1 skipped PASS

🤖 Generated with Claude Code

gmliao and others added 15 commits February 22, 2026 13:22
[P1] LocalMatchQueue: Use UUID for ticketId (align with ApiMatchQueue)
- Prevents collision with old Redis assignments after restart
- Fixes provisioning e2e failures

[P1] RedisMatchmakingStore: Add assignment TTL (300s)
- Switch from hash to per-key storage with EXPIRE
- Prevents stale assignment data accumulation

[P1] Defer env reads to runtime
- BullMQModule: forRootAsync useFactory for Redis config
- MatchmakingModule: useFactory for MatchmakingConfig

[P2] Split-role e2e: Skip in-process test (cannot achieve role isolation)
- jest.resetModules breaks NestJS BullMQ ModuleRef
- matchmaking-split-roles-external.e2e-spec.ts provides definitive verification

Co-authored-by: Cursor <cursoragent@cursor.com>
- Delete matchmaking-split-roles.e2e-spec.ts (could not achieve role isolation)
- test:e2e:split now runs matchmaking-split-roles-external.e2e-spec.ts
- Use npm run test:e2e:split or npm run test:e2e:split:sh for split-role verification

Co-authored-by: Cursor <cursoragent@cursor.com>
…ollution

- test:e2e:split: add REDIS_DB=1 (align with test:e2e)
- e2e-split.sh: add REDIS_DB=1 for spawned API/worker processes
- Fixes 'No server available' / timeout when Redis DB 0 has stale data

Co-authored-by: Cursor <cursoragent@cursor.com>
[P1] admin.e2e Connection is closed:
- Use beforeAll/afterAll instead of beforeEach/afterEach to reduce app lifecycle
- Add 400ms drain delay in closeApp after app.close() for BullMQ teardown

[P2] Assignment TTL vs token expiry mismatch:
- Change ASSIGNMENT_TTL_SECONDS from 300 to 3600 (align with JWT exp 1h)
- Avoids 'token valid but status not found' when client polls after 5min

Co-authored-by: Cursor <cursoragent@cursor.com>
- Add e2e-global-setup.ts: flush Redis matchmaking/BullMQ keys before tests
- Reduces 'No server available' from stale jobs between runs
- Restore admin beforeAll/afterAll to avoid Connection is closed (trade-off: shared state)
- Keep 800ms drain delay in closeApp

Co-authored-by: Cursor <cursoragent@cursor.com>
@gmliao gmliao force-pushed the fix/matchmaking-ticket-env-loading branch from 450f71e to 17c6d34 Compare February 22, 2026 05:22
@chatgpt-codex-connector
Copy link
Copy Markdown

💡 Codex Review


P1 Badge Gate RealtimeGateway to API-enabled roles

The gateway is always registered, so MATCHMAKING_ROLE=queue-worker instances still accept /realtime connections even though this role disables channel subscriptions (isApiEnabled(...) short-circuits in both Redis channel services). In that state, clients can enqueue but will never receive match.assigned, so requests routed to a worker node appear to hang indefinitely; this breaks the split-role deployment model whenever a worker is reachable (LB misroute, direct access, or config drift).


const m = msg as { action?: string };
if (m?.action === 'enqueue') {
this.handleEnqueue(client, m as WsEnqueueMessage);
return;

P2 Badge Validate WebSocket enqueue payload before queuing

WebSocket messages are only JSON-parsed and then cast to WsEnqueueMessage without runtime validation, so malformed enqueue payloads (for example missing groupSize or members) are still forwarded to matchmakingService.enqueue. This can create invalid queued tickets that never become matchable and can pollute queue state, whereas the HTTP path is protected by DTO validation.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@gmliao
Copy link
Copy Markdown
Owner Author

gmliao commented Feb 22, 2026

Both comments addressed in commit 585610e:

P1 — Gate RealtimeGateway to API-enabled roles

Added an early-return in handleConnection:

if (!isApiEnabled(getMatchmakingRole())) {
  client.close(4403, 'This node does not serve WebSocket connections (queue-worker role)');
  return;
}

queue-worker nodes now immediately close any incoming /realtime connection with code 4403, preventing the silent-hang scenario described.

P2 — Validate WebSocket enqueue payload via class-validator

Added WsEnqueueMessageDto class to ws-envelope.dto.ts with the same validators as the HTTP EnqueueRequest DTO (@IsNotEmpty, @IsString, @IsArray, @IsInt, @Min(1), @IsOptional). In setupMessageHandler, before calling handleEnqueue, the payload is now validated:

const dto = plainToInstance(WsEnqueueMessageDto, m);
const errors = await validate(dto);
if (errors.length > 0) {
  this.sendError(client, `Invalid enqueue payload: ${detail}`);
  return;
}

Malformed payloads (missing queueKey, wrong groupSize type, etc.) are now rejected with a descriptive error message before reaching the queue. Both class-validator and class-transformer were already in package.json.

All tests pass: build ✅ / unit 49/49 ✅ / e2e 18/19 ✅ (1 pre-existing skip)

@gmliao gmliao merged commit 834abaf into main Feb 22, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant