Skip to content

fix: harden KnowledgeGraph and MCP server for multi-process access#948

Open
felipecpaiva wants to merge 1 commit intoMemPalace:developfrom
felipecpaiva:fix/multi-process-safety
Open

fix: harden KnowledgeGraph and MCP server for multi-process access#948
felipecpaiva wants to merge 1 commit intoMemPalace:developfrom
felipecpaiva:fix/multi-process-safety

Conversation

@felipecpaiva
Copy link
Copy Markdown

Summary

When mempalace runs behind mcp-proxy (SSE mode) with multiple concurrent clients, two classes of concurrency bugs surface:

  1. SQLite "database is locked" in KnowledgeGraph — busy_timeout was only 10s with no application-level retry. Concurrent writers from separate mcp-proxy processes exhaust the timeout.

  2. ChromaDB client cache thrashing — every write changes chroma.sqlite3 mtime. Other processes detect this and recreate PersistentClient, reloading the full HNSW index from disk on every operation.

KnowledgeGraph fixes

  • busy_timeout 10s → 60s
  • _sqlite_retry decorator: exponential backoff with jitter (5 retries, only for "locked"/"busy" errors)
  • BEGIN IMMEDIATE for write transactions (detect contention at start, not mid-transaction)
  • PRAGMA wal_autocheckpoint=1000 + journal_size_limit=64MB (manage WAL growth)
  • atexit.register(self.close) for clean WAL checkpointing on shutdown

MCP server fixes

  • Rate-limit chroma.sqlite3 stat/mtime checks to 5-second intervals
  • _refresh_db_mtime() after writes prevents self-inflicted client recreation
  • Bypass cooldown for safety-critical DB disappearance detection (rebuild scenarios)
  • tool_reconnect now fully clears client + mtime state

Tests

  • TestSQLiteRetryDecorator — 5 cases: retry success, exhaustion, non-lock errors, busy variant
  • TestConnectionPragmas — 3 cases: autocheckpoint, journal_size_limit, WAL mode
  • TestMultiProcessLocking — 4 processes × 20 triples to same DB file, zero failures

Test plan

  • python -m pytest tests/ -v --ignore=tests/benchmarks — 958 passed
  • ruff check + ruff format --check — clean
  • Manual: connect 2+ Claude Code sessions via SSE, trigger concurrent kg_add calls
  • Manual: concurrent add_drawer calls from multiple clients — no cache thrashing

When multiple mcp-proxy SSE connections share the same mempalace data,
concurrent processes compete for SQLite and ChromaDB resources.

KnowledgeGraph changes:
- Increase busy_timeout from 10s to 60s
- Add _sqlite_retry decorator with exponential backoff for lock/busy errors
- Use BEGIN IMMEDIATE for writes (detect contention at transaction start)
- Add WAL autocheckpoint and journal_size_limit pragmas
- Register atexit handler for clean WAL shutdown

MCP server changes:
- Rate-limit chroma.sqlite3 mtime checks to 5s intervals to prevent
  PersistentClient recreation (and HNSW index reload) on every write
- Add _refresh_db_mtime() after ChromaDB writes to prevent self-triggered
  reconnects
- Bypass mtime cooldown for safety-critical DB disappearance detection
- Fix tool_reconnect to fully clear client and mtime state

Tests:
- Retry decorator: lock retry, exhaustion, non-lock errors, busy variant
- Connection pragmas: wal_autocheckpoint, journal_size_limit, WAL mode
- Multi-process concurrent writes: 4 processes x 20 triples, zero failures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant