fix: disable search history by default to limit the potential impact of table size growth#2549
fix: disable search history by default to limit the potential impact of table size growth#2549Vasilije1990 wants to merge 1 commit intodevfrom
Conversation
Please make sure all the checkboxes are checked:
|
WalkthroughAdd optional, commented search-history config to the env template; introduce runtime flag to short-circuit search query/result persistence; add DB indexes on Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
e7e9f93 to
b62b91b
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cognee/modules/search/operations/log_query.py (1)
19-23: Close the write session before running eviction.
maybe_evict()opens its own sessions. Keeping thisasync withopen until cleanup finishes just prolongs the caller's connection lifetime, andlog_result()already uses the cleaner ordering.♻️ Suggested change
await session.commit() - - await maybe_evict() - - return query + await maybe_evict() + return query🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/modules/search/operations/log_query.py` around lines 19 - 23, The code currently calls await maybe_evict() while the write session (session) is still open; close the session before running eviction to avoid holding the caller's DB connection. Modify the function (the async block that uses session—e.g., in log_query or the function around session.commit()) so you call await session.commit(), exit the async with session block (or explicitly close the session), and only then call await maybe_evict(). Keep the return of query after eviction if needed by the caller (or return before eviction if following log_result ordering), but ensure maybe_evict() is invoked outside the open session context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/modules/search/operations/evict_old_history.py`:
- Around line 24-25: The eviction interval constant _EVICTION_INTERVAL currently
uses a hard floor of 100 which can exceed small configured caps (MAX_ROWS) and
allow overshoot; update its calculation so the minimum interval is 1 write
instead of 100 (e.g., _EVICTION_INTERVAL = max(MAX_ROWS // 10, 1) if MAX_ROWS >
0 else 0) so eviction remains proportional to MAX_ROWS and never larger than the
cap for small configurations.
- Around line 29-45: maybe_evict currently propagates exceptions from
_evict_table and _vacuum_if_sqlite which can surface to callers after
log_query/log_result have already committed; change maybe_evict to perform
best-effort cleanup by wrapping the eviction and vacuum calls (calls to
_evict_table for Query and Result and to _vacuum_if_sqlite) in a try/except that
catches Exception, logs the failure with context (including which table/action
failed) and does not re-raise so callers aren’t impacted by cleanup errors.
---
Nitpick comments:
In `@cognee/modules/search/operations/log_query.py`:
- Around line 19-23: The code currently calls await maybe_evict() while the
write session (session) is still open; close the session before running eviction
to avoid holding the caller's DB connection. Modify the function (the async
block that uses session—e.g., in log_query or the function around
session.commit()) so you call await session.commit(), exit the async with
session block (or explicitly close the session), and only then call await
maybe_evict(). Keep the return of query after eviction if needed by the caller
(or return before eviction if following log_result ordering), but ensure
maybe_evict() is invoked outside the open session context.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a507d3db-35d1-4461-91db-310bfdc214bc
📒 Files selected for processing (4)
.env.templatecognee/modules/search/operations/evict_old_history.pycognee/modules/search/operations/log_query.pycognee/modules/search/operations/log_result.py
| # Only run eviction every N writes to avoid overhead on every search call | ||
| _EVICTION_INTERVAL = max(MAX_ROWS // 10, 100) if MAX_ROWS > 0 else 0 |
There was a problem hiding this comment.
The 100-write floor breaks small configured caps.
With COGNEE_MAX_SEARCH_HISTORY=10, eviction will not run until 100 writes, so the table can overshoot the requested cap by almost 10x. The interval should stay proportional to the configured cap instead of exceeding it.
🐛 Suggested fix
-_EVICTION_INTERVAL = max(MAX_ROWS // 10, 100) if MAX_ROWS > 0 else 0
+_EVICTION_INTERVAL = max(MAX_ROWS // 10, 1) if MAX_ROWS > 0 else 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/modules/search/operations/evict_old_history.py` around lines 24 - 25,
The eviction interval constant _EVICTION_INTERVAL currently uses a hard floor of
100 which can exceed small configured caps (MAX_ROWS) and allow overshoot;
update its calculation so the minimum interval is 1 write instead of 100 (e.g.,
_EVICTION_INTERVAL = max(MAX_ROWS // 10, 1) if MAX_ROWS > 0 else 0) so eviction
remains proportional to MAX_ROWS and never larger than the cap for small
configurations.
| async def maybe_evict(): | ||
| """Run eviction if the write counter has reached the interval threshold.""" | ||
| global _write_counter | ||
|
|
||
| if MAX_ROWS <= 0: | ||
| return # unlimited — no eviction | ||
|
|
||
| _write_counter += 1 | ||
| if _write_counter < _EVICTION_INTERVAL: | ||
| return | ||
|
|
||
| _write_counter = 0 | ||
| await _evict_table(Query, "queries") | ||
| await _evict_table(Result, "results") | ||
|
|
||
| if VACUUM_ENABLED: | ||
| await _vacuum_if_sqlite() |
There was a problem hiding this comment.
Do not propagate cleanup failures after commit.
log_query() and log_result() commit before awaiting this helper. If _evict_table() or _vacuum_if_sqlite() raises here, callers get an error even though the history row was already persisted. This path should be best-effort and logged.
🛡️ Suggested change
async def maybe_evict():
"""Run eviction if the write counter has reached the interval threshold."""
global _write_counter
@@
- _write_counter = 0
- await _evict_table(Query, "queries")
- await _evict_table(Result, "results")
-
- if VACUUM_ENABLED:
- await _vacuum_if_sqlite()
+ _write_counter = 0
+ try:
+ await _evict_table(Query, "queries")
+ await _evict_table(Result, "results")
+
+ if VACUUM_ENABLED:
+ await _vacuum_if_sqlite()
+ except Exception:
+ logger.exception("Search history eviction failed")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/modules/search/operations/evict_old_history.py` around lines 29 - 45,
maybe_evict currently propagates exceptions from _evict_table and
_vacuum_if_sqlite which can surface to callers after log_query/log_result have
already committed; change maybe_evict to perform best-effort cleanup by wrapping
the eviction and vacuum calls (calls to _evict_table for Query and Result and to
_vacuum_if_sqlite) in a try/except that catches Exception, logs the failure with
context (including which table/action failed) and does not re-raise so callers
aren’t impacted by cleanup errors.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/modules/search/models/Query.py`:
- Line 14: Add an Alembic migration to create the missing index for the
queries.user_id column so the DB matches the ORM: in the new revision use
op.create_index('ix_queries_user_id', 'queries', ['user_id']) in upgrade and
op.drop_index('ix_queries_user_id', table_name='queries') in downgrade;
reference the model change to user_id (Column(UUID, index=True)) and ensure the
migration message and revision metadata are set appropriately so CI/migrations
pick up the new index.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6c10ca81-9d4c-43fc-88a1-7c8fb391f868
📒 Files selected for processing (6)
.env.templatecognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.pycognee/modules/search/models/Query.pycognee/modules/search/models/Result.pycognee/modules/search/operations/log_query.pycognee/modules/search/operations/log_result.py
✅ Files skipped from review due to trivial changes (2)
- cognee/modules/search/models/Result.py
- cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.py
🚧 Files skipped from review as they are similar to previous changes (3)
- cognee/modules/search/operations/log_result.py
- cognee/modules/search/operations/log_query.py
- .env.template
| text = Column(String) | ||
| query_type = Column(String) | ||
| user_id = Column(UUID) | ||
| user_id = Column(UUID, index=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify whether an index migration for queries.user_id exists in this PR branch.
fd -i 'b2c3d4e5f6a7_add_search_history_indexes.py' -x sed -n '1,140p' {}
rg -n -C2 'user_id\s*=\s*Column\(UUID,\s*index=True\)|create_index\("ix_queries_user_id"' \
cognee/modules/search/models/Query.py cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.pyRepository: topoteretes/cognee
Length of output: 922
Add a migration for the queries.user_id index to maintain schema consistency.
Line 14 adds index=True to the user_id column, but the migration only creates indexes for created_at. Databases upgraded via migrations will be missing the ix_queries_user_id index, causing a mismatch between the ORM schema and the actual database schema.
Proposed migration patch
def upgrade() -> None:
+ op.create_index("ix_queries_user_id", "queries", ["user_id"])
op.create_index("ix_queries_created_at", "queries", ["created_at"])
op.create_index("ix_results_created_at", "results", ["created_at"])
def downgrade() -> None:
op.drop_index("ix_results_created_at", "results")
op.drop_index("ix_queries_created_at", "queries")
+ op.drop_index("ix_queries_user_id", "queries")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cognee/modules/search/models/Query.py` at line 14, Add an Alembic migration
to create the missing index for the queries.user_id column so the DB matches the
ORM: in the new revision use op.create_index('ix_queries_user_id', 'queries',
['user_id']) in upgrade and op.drop_index('ix_queries_user_id',
table_name='queries') in downgrade; reference the model change to user_id
(Column(UUID, index=True)) and ensure the migration message and revision
metadata are set appropriately so CI/migrations pick up the new index.
The queries and results tables grow without bound in long-running deployments (~3 GB/day at 4700 queries/day). Root cause: Result.value stored full serialized search responses (50-100 KB each) — graph edges, node properties, metadata — for a UI history feature that only needs the completion text. Changes: - Log only completion/context text to Result.value instead of the full serialized SearchResultPayload. The raw graph data is ephemeral and already lives in the graph DB. - COGNEE_LOG_SEARCH_HISTORY=false disables logging entirely (recommended for daemon/automated deployments) - Add created_at index on queries and results tables (Alembic migration) - Add user_id index on queries table (was missing) - Remove unused fastapi.encoders import from search.py For existing bloated deployments: set COGNEE_LOG_SEARCH_HISTORY=false, DELETE FROM results, DELETE FROM queries, then VACUUM. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b62b91b to
94b0cf1
Compare
There was a problem hiding this comment.
🧹 Nitpick comments (2)
cognee/modules/search/operations/log_query.py (1)
9-24: Add a docstring describing the function's behavior and the logging toggle.Per coding guidelines, undocumented function definitions are considered incomplete. A brief docstring would clarify the dual behavior (persist vs. skip) based on
COGNEE_LOG_SEARCH_HISTORY.📝 Suggested docstring
async def log_query(query_text: str, query_type: str, user_id: UUID) -> Query: + """Create and optionally persist a Query record for search history. + + When COGNEE_LOG_SEARCH_HISTORY is disabled, returns an in-memory Query + without database persistence. The returned Query.id is still a valid UUID + for correlation purposes within the same request. + """ query = Query( text=query_text,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/modules/search/operations/log_query.py` around lines 9 - 24, Add a brief docstring to the async function log_query explaining its behavior: it constructs a Query (using Query) from query_text, query_type, and user_id and either persists it to the database via get_relational_engine() when the logging toggle (_LOG_ENABLED, controlled by COGNEE_LOG_SEARCH_HISTORY) is enabled or returns the in-memory Query without persisting when logging is disabled; mention that it returns a Query and note side effects (database commit) only occur when _LOG_ENABLED is true.cognee/modules/search/methods/search.py (1)
127-133: Consider whether heterogeneous types incompletionsneed flattening.Per
SearchResultPayload(context snippet 1, line 19),completioncan beUnion[str, List[str], List[dict]]andcontextcan beUnion[str, List[str]]. The resultingcompletionslist may contain a mix of strings and nested lists, which could complicate downstream parsing or analysis of the logged data.If a uniform structure is desirable for log analysis, consider flattening or normalizing entries. However, this is not critical for the current use case of limiting DB bloat.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/modules/search/methods/search.py` around lines 127 - 133, The completions list in search.py collects payload.completion and payload.context which can be str, List[str], or List[dict] (per SearchResultPayload), producing heterogeneous and nested entries that complicate downstream parsing; update the collection logic in the function containing this diff to normalize values (e.g., create a helper like normalize_completion) so each appended entry is a uniform type: flatten nested lists into individual strings (or JSON-serialized dicts) and convert non-string dict entries to a stable string form before appending; ensure you handle both payload.completion and payload.context branches and preserve order while avoiding excessive expansion that would bloat the DB.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@cognee/modules/search/methods/search.py`:
- Around line 127-133: The completions list in search.py collects
payload.completion and payload.context which can be str, List[str], or
List[dict] (per SearchResultPayload), producing heterogeneous and nested entries
that complicate downstream parsing; update the collection logic in the function
containing this diff to normalize values (e.g., create a helper like
normalize_completion) so each appended entry is a uniform type: flatten nested
lists into individual strings (or JSON-serialized dicts) and convert non-string
dict entries to a stable string form before appending; ensure you handle both
payload.completion and payload.context branches and preserve order while
avoiding excessive expansion that would bloat the DB.
In `@cognee/modules/search/operations/log_query.py`:
- Around line 9-24: Add a brief docstring to the async function log_query
explaining its behavior: it constructs a Query (using Query) from query_text,
query_type, and user_id and either persists it to the database via
get_relational_engine() when the logging toggle (_LOG_ENABLED, controlled by
COGNEE_LOG_SEARCH_HISTORY) is enabled or returns the in-memory Query without
persisting when logging is disabled; mention that it returns a Query and note
side effects (database commit) only occur when _LOG_ENABLED is true.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e81dc48b-926c-4b3a-abab-aaceda340147
📒 Files selected for processing (7)
.env.templatecognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.pycognee/modules/search/methods/search.pycognee/modules/search/models/Query.pycognee/modules/search/models/Result.pycognee/modules/search/operations/log_query.pycognee/modules/search/operations/log_result.py
✅ Files skipped from review due to trivial changes (1)
- .env.template
🚧 Files skipped from review as they are similar to previous changes (4)
- cognee/modules/search/models/Query.py
- cognee/modules/search/operations/log_result.py
- cognee/modules/search/models/Result.py
- cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.py
|
|
||
|
|
||
| def upgrade() -> None: | ||
| op.create_index("ix_queries_created_at", "queries", ["created_at"]) |
There was a problem hiding this comment.
We might need to have a check if this index exists before trying to make it otherwise migrations could fail. (Same applies for dropping an index)
| from cognee.infrastructure.databases.relational import get_relational_engine | ||
| from ..models.Query import Query | ||
|
|
||
| _LOG_ENABLED = os.getenv("COGNEE_LOG_SEARCH_HISTORY", "true").lower() in ("true", "1", "yes") |
There was a problem hiding this comment.
Should it be disabled by default? PR title mentions that it should
Summary
Fixes unbounded growth of
queriesandresultstables in the relational DB (SQLite/Postgres). Reported case: 28 GB SQLite file in 9 days at ~4700 queries/day.COGNEE_MAX_SEARCH_HISTORY(default 10,000, set to 0 for unlimited)COGNEE_SEARCH_HISTORY_VACUUM=trueby default)log_result()andlog_query()with zero impact when cap not reachedNew env vars documented in
.env.template.Test plan
queriesandresultstables don't exceed capCOGNEE_MAX_SEARCH_HISTORY=0disables evictionI affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Chores