Skip to content

fix: disable search history by default to limit the potential impact of table size growth#2549

Open
Vasilije1990 wants to merge 1 commit intodevfrom
fix/search-history-eviction
Open

fix: disable search history by default to limit the potential impact of table size growth#2549
Vasilije1990 wants to merge 1 commit intodevfrom
fix/search-history-eviction

Conversation

@Vasilije1990
Copy link
Copy Markdown
Contributor

@Vasilije1990 Vasilije1990 commented Apr 4, 2026

Summary

Fixes unbounded growth of queries and results tables in the relational DB (SQLite/Postgres). Reported case: 28 GB SQLite file in 9 days at ~4700 queries/day.

  • FIFO eviction: caps rows per table at COGNEE_MAX_SEARCH_HISTORY (default 10,000, set to 0 for unlimited)
  • Batched: eviction runs every ~1000 writes (1/10th of cap), not on every search call
  • SQLite VACUUM: optional auto-VACUUM after eviction to reclaim disk space (COGNEE_SEARCH_HISTORY_VACUUM=true by default)
  • Wired into log_result() and log_query() with zero impact when cap not reached

New env vars documented in .env.template.

Test plan

  • CI passes
  • Run 100+ searches, verify queries and results tables don't exceed cap
  • Verify COGNEE_MAX_SEARCH_HISTORY=0 disables eviction
  • Verify SQLite file size decreases after VACUUM

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added an optional (commented) configuration to enable/disable search history logging.
    • Search logging now records only user-visible text from results (not full payloads), reducing persisted detail.
  • Chores

    • Improved database performance by adding indexes on search history timestamps and optimizing user-related lookups.
    • When logging is disabled, search events avoid persistence to the database.

@pull-checklist
Copy link
Copy Markdown

pull-checklist bot commented Apr 4, 2026

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2026

Walkthrough

Add optional, commented search-history config to the env template; introduce runtime flag to short-circuit search query/result persistence; add DB indexes on queries.created_at, results.created_at, and queries.user_id; adjust search result serialization to persist only user-visible completions.

Changes

Cohort / File(s) Summary
Configuration
.env.template
Added a commented “Search History” block introducing COGNEE_LOG_SEARCH_HISTORY (commented; optional toggle for search logging).
Logging operations
cognee/modules/search/operations/log_query.py, cognee/modules/search/operations/log_result.py
Both functions now consult a module-level _LOG_ENABLED derived from COGNEE_LOG_SEARCH_HISTORY; when disabled they return early and skip DB session persistence. log_query now constructs the Query before opening a session and only persists/commits when logging is enabled.
Search method serialization
cognee/modules/search/methods/search.py
Removed fastapi.encoders.jsonable_encoder; build a completions list of user-visible text (payload.completion or payload.context) and pass JSON-dumped completions to log_result instead of the full payload.
Database models
cognee/modules/search/models/Query.py, cognee/modules/search/models/Result.py
Added index=True to Query.user_id and to created_at columns on both Query and Result.
Database migration
cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.py
New Alembic migration creating indexes ix_queries_created_at and ix_results_created_at; downgrade() drops them.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: disabling search history logging by default to address unbounded table growth, which aligns with the PR's core objective.
Description check ✅ Passed The description covers the problem, proposed solution (FIFO eviction with cap), implementation details (batched execution, VACUUM), and test plan, but lacks explicit proof of testing and some checklist items are unchecked.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/search-history-eviction

Comment @coderabbitai help to get the list of available commands and usage tips.

@Vasilije1990 Vasilije1990 force-pushed the fix/search-history-eviction branch from e7e9f93 to b62b91b Compare April 4, 2026 16:33
@Vasilije1990 Vasilije1990 changed the title fix: add FIFO eviction for search history tables to prevent DB bloat fix: disable search history by default to limit the potential impact of table size growth Apr 4, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
cognee/modules/search/operations/log_query.py (1)

19-23: Close the write session before running eviction.

maybe_evict() opens its own sessions. Keeping this async with open until cleanup finishes just prolongs the caller's connection lifetime, and log_result() already uses the cleaner ordering.

♻️ Suggested change
         await session.commit()
-
-        await maybe_evict()
-
-        return query
+    await maybe_evict()
+    return query
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/search/operations/log_query.py` around lines 19 - 23, The code
currently calls await maybe_evict() while the write session (session) is still
open; close the session before running eviction to avoid holding the caller's DB
connection. Modify the function (the async block that uses session—e.g., in
log_query or the function around session.commit()) so you call await
session.commit(), exit the async with session block (or explicitly close the
session), and only then call await maybe_evict(). Keep the return of query after
eviction if needed by the caller (or return before eviction if following
log_result ordering), but ensure maybe_evict() is invoked outside the open
session context.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/modules/search/operations/evict_old_history.py`:
- Around line 24-25: The eviction interval constant _EVICTION_INTERVAL currently
uses a hard floor of 100 which can exceed small configured caps (MAX_ROWS) and
allow overshoot; update its calculation so the minimum interval is 1 write
instead of 100 (e.g., _EVICTION_INTERVAL = max(MAX_ROWS // 10, 1) if MAX_ROWS >
0 else 0) so eviction remains proportional to MAX_ROWS and never larger than the
cap for small configurations.
- Around line 29-45: maybe_evict currently propagates exceptions from
_evict_table and _vacuum_if_sqlite which can surface to callers after
log_query/log_result have already committed; change maybe_evict to perform
best-effort cleanup by wrapping the eviction and vacuum calls (calls to
_evict_table for Query and Result and to _vacuum_if_sqlite) in a try/except that
catches Exception, logs the failure with context (including which table/action
failed) and does not re-raise so callers aren’t impacted by cleanup errors.

---

Nitpick comments:
In `@cognee/modules/search/operations/log_query.py`:
- Around line 19-23: The code currently calls await maybe_evict() while the
write session (session) is still open; close the session before running eviction
to avoid holding the caller's DB connection. Modify the function (the async
block that uses session—e.g., in log_query or the function around
session.commit()) so you call await session.commit(), exit the async with
session block (or explicitly close the session), and only then call await
maybe_evict(). Keep the return of query after eviction if needed by the caller
(or return before eviction if following log_result ordering), but ensure
maybe_evict() is invoked outside the open session context.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a507d3db-35d1-4461-91db-310bfdc214bc

📥 Commits

Reviewing files that changed from the base of the PR and between 472a3d4 and e7e9f93.

📒 Files selected for processing (4)
  • .env.template
  • cognee/modules/search/operations/evict_old_history.py
  • cognee/modules/search/operations/log_query.py
  • cognee/modules/search/operations/log_result.py

Comment on lines +24 to +25
# Only run eviction every N writes to avoid overhead on every search call
_EVICTION_INTERVAL = max(MAX_ROWS // 10, 100) if MAX_ROWS > 0 else 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

The 100-write floor breaks small configured caps.

With COGNEE_MAX_SEARCH_HISTORY=10, eviction will not run until 100 writes, so the table can overshoot the requested cap by almost 10x. The interval should stay proportional to the configured cap instead of exceeding it.

🐛 Suggested fix
-_EVICTION_INTERVAL = max(MAX_ROWS // 10, 100) if MAX_ROWS > 0 else 0
+_EVICTION_INTERVAL = max(MAX_ROWS // 10, 1) if MAX_ROWS > 0 else 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/search/operations/evict_old_history.py` around lines 24 - 25,
The eviction interval constant _EVICTION_INTERVAL currently uses a hard floor of
100 which can exceed small configured caps (MAX_ROWS) and allow overshoot;
update its calculation so the minimum interval is 1 write instead of 100 (e.g.,
_EVICTION_INTERVAL = max(MAX_ROWS // 10, 1) if MAX_ROWS > 0 else 0) so eviction
remains proportional to MAX_ROWS and never larger than the cap for small
configurations.

Comment on lines +29 to +45
async def maybe_evict():
"""Run eviction if the write counter has reached the interval threshold."""
global _write_counter

if MAX_ROWS <= 0:
return # unlimited — no eviction

_write_counter += 1
if _write_counter < _EVICTION_INTERVAL:
return

_write_counter = 0
await _evict_table(Query, "queries")
await _evict_table(Result, "results")

if VACUUM_ENABLED:
await _vacuum_if_sqlite()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not propagate cleanup failures after commit.

log_query() and log_result() commit before awaiting this helper. If _evict_table() or _vacuum_if_sqlite() raises here, callers get an error even though the history row was already persisted. This path should be best-effort and logged.

🛡️ Suggested change
 async def maybe_evict():
     """Run eviction if the write counter has reached the interval threshold."""
     global _write_counter
@@
-    _write_counter = 0
-    await _evict_table(Query, "queries")
-    await _evict_table(Result, "results")
-
-    if VACUUM_ENABLED:
-        await _vacuum_if_sqlite()
+    _write_counter = 0
+    try:
+        await _evict_table(Query, "queries")
+        await _evict_table(Result, "results")
+
+        if VACUUM_ENABLED:
+            await _vacuum_if_sqlite()
+    except Exception:
+        logger.exception("Search history eviction failed")
As per coding guidelines, "Prefer explicit, structured error handling in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/search/operations/evict_old_history.py` around lines 29 - 45,
maybe_evict currently propagates exceptions from _evict_table and
_vacuum_if_sqlite which can surface to callers after log_query/log_result have
already committed; change maybe_evict to perform best-effort cleanup by wrapping
the eviction and vacuum calls (calls to _evict_table for Query and Result and to
_vacuum_if_sqlite) in a try/except that catches Exception, logs the failure with
context (including which table/action failed) and does not re-raise so callers
aren’t impacted by cleanup errors.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/modules/search/models/Query.py`:
- Line 14: Add an Alembic migration to create the missing index for the
queries.user_id column so the DB matches the ORM: in the new revision use
op.create_index('ix_queries_user_id', 'queries', ['user_id']) in upgrade and
op.drop_index('ix_queries_user_id', table_name='queries') in downgrade;
reference the model change to user_id (Column(UUID, index=True)) and ensure the
migration message and revision metadata are set appropriately so CI/migrations
pick up the new index.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6c10ca81-9d4c-43fc-88a1-7c8fb391f868

📥 Commits

Reviewing files that changed from the base of the PR and between e7e9f93 and b62b91b.

📒 Files selected for processing (6)
  • .env.template
  • cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.py
  • cognee/modules/search/models/Query.py
  • cognee/modules/search/models/Result.py
  • cognee/modules/search/operations/log_query.py
  • cognee/modules/search/operations/log_result.py
✅ Files skipped from review due to trivial changes (2)
  • cognee/modules/search/models/Result.py
  • cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • cognee/modules/search/operations/log_result.py
  • cognee/modules/search/operations/log_query.py
  • .env.template

text = Column(String)
query_type = Column(String)
user_id = Column(UUID)
user_id = Column(UUID, index=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether an index migration for queries.user_id exists in this PR branch.
fd -i 'b2c3d4e5f6a7_add_search_history_indexes.py' -x sed -n '1,140p' {}
rg -n -C2 'user_id\s*=\s*Column\(UUID,\s*index=True\)|create_index\("ix_queries_user_id"' \
  cognee/modules/search/models/Query.py cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.py

Repository: topoteretes/cognee

Length of output: 922


Add a migration for the queries.user_id index to maintain schema consistency.

Line 14 adds index=True to the user_id column, but the migration only creates indexes for created_at. Databases upgraded via migrations will be missing the ix_queries_user_id index, causing a mismatch between the ORM schema and the actual database schema.

Proposed migration patch
 def upgrade() -> None:
+    op.create_index("ix_queries_user_id", "queries", ["user_id"])
     op.create_index("ix_queries_created_at", "queries", ["created_at"])
     op.create_index("ix_results_created_at", "results", ["created_at"])


 def downgrade() -> None:
     op.drop_index("ix_results_created_at", "results")
     op.drop_index("ix_queries_created_at", "queries")
+    op.drop_index("ix_queries_user_id", "queries")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/search/models/Query.py` at line 14, Add an Alembic migration
to create the missing index for the queries.user_id column so the DB matches the
ORM: in the new revision use op.create_index('ix_queries_user_id', 'queries',
['user_id']) in upgrade and op.drop_index('ix_queries_user_id',
table_name='queries') in downgrade; reference the model change to user_id
(Column(UUID, index=True)) and ensure the migration message and revision
metadata are set appropriately so CI/migrations pick up the new index.

The queries and results tables grow without bound in long-running
deployments (~3 GB/day at 4700 queries/day). Root cause: Result.value
stored full serialized search responses (50-100 KB each) — graph
edges, node properties, metadata — for a UI history feature that only
needs the completion text.

Changes:
- Log only completion/context text to Result.value instead of the full
  serialized SearchResultPayload. The raw graph data is ephemeral and
  already lives in the graph DB.
- COGNEE_LOG_SEARCH_HISTORY=false disables logging entirely
  (recommended for daemon/automated deployments)
- Add created_at index on queries and results tables (Alembic migration)
- Add user_id index on queries table (was missing)
- Remove unused fastapi.encoders import from search.py

For existing bloated deployments: set COGNEE_LOG_SEARCH_HISTORY=false,
DELETE FROM results, DELETE FROM queries, then VACUUM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Vasilije1990 Vasilije1990 force-pushed the fix/search-history-eviction branch from b62b91b to 94b0cf1 Compare April 4, 2026 18:21
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
cognee/modules/search/operations/log_query.py (1)

9-24: Add a docstring describing the function's behavior and the logging toggle.

Per coding guidelines, undocumented function definitions are considered incomplete. A brief docstring would clarify the dual behavior (persist vs. skip) based on COGNEE_LOG_SEARCH_HISTORY.

📝 Suggested docstring
 async def log_query(query_text: str, query_type: str, user_id: UUID) -> Query:
+    """Create and optionally persist a Query record for search history.
+
+    When COGNEE_LOG_SEARCH_HISTORY is disabled, returns an in-memory Query
+    without database persistence. The returned Query.id is still a valid UUID
+    for correlation purposes within the same request.
+    """
     query = Query(
         text=query_text,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/search/operations/log_query.py` around lines 9 - 24, Add a
brief docstring to the async function log_query explaining its behavior: it
constructs a Query (using Query) from query_text, query_type, and user_id and
either persists it to the database via get_relational_engine() when the logging
toggle (_LOG_ENABLED, controlled by COGNEE_LOG_SEARCH_HISTORY) is enabled or
returns the in-memory Query without persisting when logging is disabled; mention
that it returns a Query and note side effects (database commit) only occur when
_LOG_ENABLED is true.
cognee/modules/search/methods/search.py (1)

127-133: Consider whether heterogeneous types in completions need flattening.

Per SearchResultPayload (context snippet 1, line 19), completion can be Union[str, List[str], List[dict]] and context can be Union[str, List[str]]. The resulting completions list may contain a mix of strings and nested lists, which could complicate downstream parsing or analysis of the logged data.

If a uniform structure is desirable for log analysis, consider flattening or normalizing entries. However, this is not critical for the current use case of limiting DB bloat.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/search/methods/search.py` around lines 127 - 133, The
completions list in search.py collects payload.completion and payload.context
which can be str, List[str], or List[dict] (per SearchResultPayload), producing
heterogeneous and nested entries that complicate downstream parsing; update the
collection logic in the function containing this diff to normalize values (e.g.,
create a helper like normalize_completion) so each appended entry is a uniform
type: flatten nested lists into individual strings (or JSON-serialized dicts)
and convert non-string dict entries to a stable string form before appending;
ensure you handle both payload.completion and payload.context branches and
preserve order while avoiding excessive expansion that would bloat the DB.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@cognee/modules/search/methods/search.py`:
- Around line 127-133: The completions list in search.py collects
payload.completion and payload.context which can be str, List[str], or
List[dict] (per SearchResultPayload), producing heterogeneous and nested entries
that complicate downstream parsing; update the collection logic in the function
containing this diff to normalize values (e.g., create a helper like
normalize_completion) so each appended entry is a uniform type: flatten nested
lists into individual strings (or JSON-serialized dicts) and convert non-string
dict entries to a stable string form before appending; ensure you handle both
payload.completion and payload.context branches and preserve order while
avoiding excessive expansion that would bloat the DB.

In `@cognee/modules/search/operations/log_query.py`:
- Around line 9-24: Add a brief docstring to the async function log_query
explaining its behavior: it constructs a Query (using Query) from query_text,
query_type, and user_id and either persists it to the database via
get_relational_engine() when the logging toggle (_LOG_ENABLED, controlled by
COGNEE_LOG_SEARCH_HISTORY) is enabled or returns the in-memory Query without
persisting when logging is disabled; mention that it returns a Query and note
side effects (database commit) only occur when _LOG_ENABLED is true.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e81dc48b-926c-4b3a-abab-aaceda340147

📥 Commits

Reviewing files that changed from the base of the PR and between b62b91b and 94b0cf1.

📒 Files selected for processing (7)
  • .env.template
  • cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.py
  • cognee/modules/search/methods/search.py
  • cognee/modules/search/models/Query.py
  • cognee/modules/search/models/Result.py
  • cognee/modules/search/operations/log_query.py
  • cognee/modules/search/operations/log_result.py
✅ Files skipped from review due to trivial changes (1)
  • .env.template
🚧 Files skipped from review as they are similar to previous changes (4)
  • cognee/modules/search/models/Query.py
  • cognee/modules/search/operations/log_result.py
  • cognee/modules/search/models/Result.py
  • cognee/alembic/versions/b2c3d4e5f6a7_add_search_history_indexes.py



def upgrade() -> None:
op.create_index("ix_queries_created_at", "queries", ["created_at"])
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need to have a check if this index exists before trying to make it otherwise migrations could fail. (Same applies for dropping an index)

from cognee.infrastructure.databases.relational import get_relational_engine
from ..models.Query import Query

_LOG_ENABLED = os.getenv("COGNEE_LOG_SEARCH_HISTORY", "true").lower() in ("true", "1", "yes")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be disabled by default? PR title mentions that it should

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants