fix: harden KnowledgeGraph and MCP server for multi-process access by felipecpaiva · Pull Request #948 · MemPalace/mempalace

felipecpaiva · 2026-04-16T13:19:53Z

Summary

When mempalace runs behind mcp-proxy (SSE mode) with multiple concurrent clients, two classes of concurrency bugs surface:

SQLite "database is locked" in KnowledgeGraph — busy_timeout was only 10s with no application-level retry. Concurrent writers from separate mcp-proxy processes exhaust the timeout.
ChromaDB client cache thrashing — every write changes chroma.sqlite3 mtime. Other processes detect this and recreate PersistentClient, reloading the full HNSW index from disk on every operation.

KnowledgeGraph fixes

busy_timeout 10s → 60s
_sqlite_retry decorator: exponential backoff with jitter (5 retries, only for "locked"/"busy" errors)
BEGIN IMMEDIATE for write transactions (detect contention at start, not mid-transaction)
PRAGMA wal_autocheckpoint=1000 + journal_size_limit=64MB (manage WAL growth)
atexit.register(self.close) for clean WAL checkpointing on shutdown

MCP server fixes

Rate-limit chroma.sqlite3 stat/mtime checks to 5-second intervals
_refresh_db_mtime() after writes prevents self-inflicted client recreation
Bypass cooldown for safety-critical DB disappearance detection (rebuild scenarios)
tool_reconnect now fully clears client + mtime state

Tests

TestSQLiteRetryDecorator — 5 cases: retry success, exhaustion, non-lock errors, busy variant
TestConnectionPragmas — 3 cases: autocheckpoint, journal_size_limit, WAL mode
TestMultiProcessLocking — 4 processes × 20 triples to same DB file, zero failures

Test plan

python -m pytest tests/ -v --ignore=tests/benchmarks — 958 passed
ruff check + ruff format --check — clean
Manual: connect 2+ Claude Code sessions via SSE, trigger concurrent kg_add calls
Manual: concurrent add_drawer calls from multiple clients — no cache thrashing

When multiple mcp-proxy SSE connections share the same mempalace data, concurrent processes compete for SQLite and ChromaDB resources. KnowledgeGraph changes: - Increase busy_timeout from 10s to 60s - Add _sqlite_retry decorator with exponential backoff for lock/busy errors - Use BEGIN IMMEDIATE for writes (detect contention at transaction start) - Add WAL autocheckpoint and journal_size_limit pragmas - Register atexit handler for clean WAL shutdown MCP server changes: - Rate-limit chroma.sqlite3 mtime checks to 5s intervals to prevent PersistentClient recreation (and HNSW index reload) on every write - Add _refresh_db_mtime() after ChromaDB writes to prevent self-triggered reconnects - Bypass mtime cooldown for safety-critical DB disappearance detection - Fix tool_reconnect to fully clear client and mtime state Tests: - Retry decorator: lock retry, exhaustion, non-lock errors, busy variant - Connection pragmas: wal_autocheckpoint, journal_size_limit, WAL mode - Multi-process concurrent writes: 4 processes x 20 triples, zero failures

felipecpaiva requested review from bensig, igorls and milla-jovovich as code owners April 16, 2026 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden KnowledgeGraph and MCP server for multi-process access#948

fix: harden KnowledgeGraph and MCP server for multi-process access#948
felipecpaiva wants to merge 1 commit intoMemPalace:developfrom
felipecpaiva:fix/multi-process-safety

felipecpaiva commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

felipecpaiva commented Apr 16, 2026

Summary

KnowledgeGraph fixes

MCP server fixes

Tests

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant