Skip to content

fix: include node category in UUID generation to prevent Entity/Entit…#2515

Open
matdou wants to merge 2 commits intotopoteretes:mainfrom
matdou:fix/entity-type-node-id-collision
Open

fix: include node category in UUID generation to prevent Entity/Entit…#2515
matdou wants to merge 2 commits intotopoteretes:mainfrom
matdou:fix/entity-type-node-id-collision

Conversation

@matdou
Copy link
Copy Markdown

@matdou matdou commented Mar 29, 2026

Description

Fixes #2510.

_create_type_node and _create_entity_node in expand_with_nodes_and_edges.py both call generate_node_id with the node name only, so Entity("institution") and EntityType("institution") produce the same UUID. Within a single run this is fine because deduplication uses keys like uuid_type vs uuid_entity, but across runs only the UUID is persisted to PostgreSQL, causing an EntityAlreadyExistsError (409) on the second cognify, which breaks graph projection and returns empty search results.

The fix includes the node category in the hash:

# _create_type_node
node_id = generate_node_id(f"type:{node_type}")

# _create_entity_node
generated_node_id = generate_node_id(f"entity:{node_id}")

Acceptance Criteria

  • Entity("institution") and EntityType("institution") now produce different UUIDs
  • Running cognify twice on the same data no longer raises EntityAlreadyExistsError
  • Minimal repro confirming the collision and the fix:
from uuid import NAMESPACE_OID, uuid5

def generate_node_id(node_id):
    return uuid5(NAMESPACE_OID, node_id.lower().replace(" ", "_").replace("'", ""))

# Before fix → same UUID
assert generate_node_id("institution") == generate_node_id("institution")  # True (collision)

# After fix → different UUIDs
assert generate_node_id("type:institution") != generate_node_id("entity:institution")  # True

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Code refactoring
  • Other (please specify):

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR (See CONTRIBUTING.md)
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass (except an error in test_neptune_analytics_graph.py which is unrelated)
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

  • Refactor
    • Node identifiers now use explicit type and entity namespaces for consistent naming across the knowledge graph.
    • Edge de-duplication has been standardized to operate at the relationship level, improving duplicate detection.
    • These updates make graph keys more consistent and reduce accidental ID collisions, improving reliability of graph merging and mapping.

@pull-checklist
Copy link
Copy Markdown

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@github-actions
Copy link
Copy Markdown

Hello @matdou, thank you for submitting a PR! We will respond as soon as possible.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 29, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 820e0c06-b078-4da6-85a5-e500a160b8ea

📥 Commits

Reviewing files that changed from the base of the PR and between b2a154b and 9f6f234.

📒 Files selected for processing (2)
  • cognee/tests/utils/extract_entities.py
  • cognee/tests/utils/extract_relationships.py

Walkthrough

Node and edge ID generation now includes explicit type: and entity: prefixes; graph expansion, edge processing, and existing-edge retrieval were updated to use these namespaced IDs and to change deduplication logic (node/edge key computation and edge de-dup checks).

Changes

Cohort / File(s) Summary
Graph expansion / node & edge namespacing
cognee/modules/graph/utils/expand_with_nodes_and_edges.py
Node ID generation now prefixes type IDs with type: and entity IDs with entity: for initial inputs and ontology-resolved names; _create_type_node, _create_entity_node, and _process_graph_edges regenerate IDs using these prefixes, impacting node/edge keys, key_mapping, and name_mapping lookups.
Edge retrieval / dedup semantics & namespacing
cognee/modules/graph/utils/retrieve_existing_edges.py
De-duplication switched from a processed-nodes dict keyed by generated node IDs to a processed_edges set keyed by (source_id, target_id, relationship) tuples; generate_node_id calls now receive type:{...} and entity:{...} inputs so constructed edge tuples use namespaced IDs.
Tests — entity / relationship extraction namespacing
cognee/tests/utils/extract_entities.py, cognee/tests/utils/extract_relationships.py
Test utilities updated to generate node/type IDs with entity: and type: prefixes (generate_node_id(f"entity:{...}"), generate_node_id(f"type:{...}")), changing cache keys and the stored IDs/tuples used in test scenarios.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding node category prefixes to UUID generation to prevent collisions between Entity and EntityType nodes.
Description check ✅ Passed The PR description is comprehensive, including clear problem statement, solution details, acceptance criteria with code examples, and all required template sections completed with checkmarks.
Linked Issues check ✅ Passed The PR fully addresses issue #2510 by implementing the exact fix described: adding 'type:' and 'entity:' prefixes to UUID generation for distinct node IDs across categories.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the UUID collision issue: modifying generate_node_id inputs in entity/type node creation, edge processing, and test utilities to use namespaced prefixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)

119-123: ⚠️ Potential issue | 🔴 Critical

Ontology-validated path drops the category namespace and can reintroduce ID collisions.

Line 121 and Line 179 regenerate IDs from raw ontology names (generate_node_id(closest_class.name) / generate_node_id(start_ent_ont.name)), which bypasses the new "type:" / "entity:" namespacing added on Line 103 and Line 161. That can recreate the original Entity vs EntityType collision for matched ontology terms.

Suggested fix
@@
-        node_id = generate_node_id(closest_class.name)
+        node_id = generate_node_id(f"type:{closest_class.name}")
@@
-        generated_node_id = generate_node_id(start_ent_ont.name)
+        generated_node_id = generate_node_id(f"entity:{start_ent_ont.name}")

Also applies to: 177-180

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py` around lines 119 -
123, When ontology_validated is true the code regenerates IDs using
generate_node_id(closest_class.name) (and similarly
generate_node_id(start_ent_ont.name)), which drops the "type:"/"entity:"
namespace and can reintroduce collisions; change those regenerations to produce
namespaced keys instead (e.g., call _create_node_key with the generated id and
the proper category or preserve the existing namespace from old_key) so
type_node_key and corresponding entity keys keep the "type:"/"entity:" prefix;
update the two spots that set type_node_key (using
generate_node_id(closest_class.name)) and the one that sets the entity key
(using generate_node_id(start_ent_ont.name)) to use the namespaced construction
consistent with the earlier creation at lines where _create_node_key and
generate_node_name are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py`:
- Around line 103-105: The node IDs are being created with namespace prefixes
via generate_node_id("type:..."/"entity:...") but later replaced by ontology
names (closest_class.name, start_ent_ont.name) which removes the prefix and
breaks edge key lookups; fix by preserving consistent namespacing: when you
replace node_id with ontology-derived names (closest_class.name or
start_ent_ont.name) always re-run generate_node_id with the original namespace
(e.g., generate_node_id(f"type:{closest_class.name}") or
generate_node_id(f"entity:{start_ent_ont.name}")), update any derived keys via
_create_node_key using the regenerated ID, and ensure the name_mapping entries
use the same generate_node_id output; alternatively, change
retrieve_existing_edges.py to construct node IDs with generate_node_id(node.type
+ ":" + node.id) so all code paths use the same namespaced ID format.

---

Outside diff comments:
In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py`:
- Around line 119-123: When ontology_validated is true the code regenerates IDs
using generate_node_id(closest_class.name) (and similarly
generate_node_id(start_ent_ont.name)), which drops the "type:"/"entity:"
namespace and can reintroduce collisions; change those regenerations to produce
namespaced keys instead (e.g., call _create_node_key with the generated id and
the proper category or preserve the existing namespace from old_key) so
type_node_key and corresponding entity keys keep the "type:"/"entity:" prefix;
update the two spots that set type_node_key (using
generate_node_id(closest_class.name)) and the one that sets the entity key
(using generate_node_id(start_ent_ont.name)) to use the namespaced construction
consistent with the earlier creation at lines where _create_node_key and
generate_node_name are used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0bce1613-a5ee-4c5e-820a-ff42974e5681

📥 Commits

Reviewing files that changed from the base of the PR and between 5469622 and 5ff902a.

📒 Files selected for processing (1)
  • cognee/modules/graph/utils/expand_with_nodes_and_edges.py

@matdou matdou force-pushed the fix/entity-type-node-id-collision branch 2 times, most recently from f88dd16 to b7b7bdf Compare March 29, 2026 00:16
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)

39-54: ⚠️ Potential issue | 🟠 Major

Ontology-derived nodes/edges still bypass category namespacing.

Line 39 and Line 69-70 still hash raw ontology names. That means ontology class/entity names can still collide by UUID across categories, which reintroduces the same persistence-risk pattern this PR is fixing elsewhere.

Suggested fix for ontology nodes
 def _process_ontology_nodes(
@@
-    for ontology_node in ontology_nodes:
-        ont_node_id = generate_node_id(ontology_node.name)
+    for ontology_node in ontology_nodes:
         ont_node_name = generate_node_name(ontology_node.name)

         if ontology_node.category == "classes":
+            ont_node_id = generate_node_id(f"type:{ontology_node.name}")
             ont_node_key = _create_node_key(ont_node_id, "type")
@@
         elif ontology_node.category == "individuals":
+            ont_node_id = generate_node_id(f"entity:{ontology_node.name}")
             ont_node_key = _create_node_key(ont_node_id, "entity")

Please also make _process_ontology_edges category-aware so source/target IDs are generated with the same scoped convention.

Also applies to: 64-70, 103-121

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py` around lines 39 -
54, The ontology-derived node creation is currently hashing raw ontology names
and can collide across categories; change the node key/id generation to include
the ontology_node.category scope so class vs individual namespaces don't
collide: update calls around generate_node_id/generate_node_name and
_create_node_key to incorporate ontology_node.category (e.g.,
generate_node_id(ontology_node.name, category) or prefix/suffix the id/name with
ontology_node.category) and ensure added_ontology_nodes_map uses that scoped
key; likewise update _process_ontology_edges so source and target IDs are
generated with the same category-aware convention (use the same helper signature
or naming scheme used for nodes) to keep edge source/target IDs consistent with
node keys like in _create_node_key and avoid cross-category UUID collisions
(affecting functions generate_node_id, generate_node_name, _create_node_key,
added_ontology_nodes_map, and _process_ontology_edges).
♻️ Duplicate comments (1)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)

161-179: ⚠️ Potential issue | 🔴 Critical

Edge endpoints still use unscoped UUIDs, so relationships can target different nodes.

Line 161/Line 179 now create entity IDs from "entity:<id>", but Line 278/Line 279 still hash raw endpoint IDs. Those UUIDs are different, so edge linkage and dedup can drift.

Suggested fix
 def _process_graph_edges(
     graph: KnowledgeGraph, name_mapping: dict, existing_edges_map: dict, relationships: list
 ) -> None:
@@
-        source_node_id = generate_node_id(source_id)
-        target_node_id = generate_node_id(target_id)
+        source_node_id = generate_node_id(f"entity:{source_id}")
+        target_node_id = generate_node_id(f"entity:{target_id}")

Also align cognee/modules/graph/utils/retrieve_existing_edges.py (Line 52-53 in the provided snippet) to the same prefix scheme, otherwise preloaded dedup keys stay inconsistent.

Also applies to: 275-279

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py` around lines 161 -
179, The edge endpoints are being hashed using raw UUIDs while nodes are created
with the "entity:<id>" prefix (via generate_node_id and _create_node_key), which
causes mismatched dedup keys; update any code that builds edge endpoint keys
(the logic that currently hashes raw endpoint IDs) to first scope endpoint IDs
the same way as nodes (e.g., call generate_node_id(f"entity:{endpoint_id}") and
then _create_node_key(..., "entity") before using added_nodes_map/key_mapping or
storing edge keys), and apply the identical change in
cognee/modules/graph/utils/retrieve_existing_edges.py so preloaded dedup keys
use the same "entity:<id>" prefix and lookup with added_nodes_map/key_mapping to
ensure consistent linkage and dedup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py`:
- Around line 39-54: The ontology-derived node creation is currently hashing raw
ontology names and can collide across categories; change the node key/id
generation to include the ontology_node.category scope so class vs individual
namespaces don't collide: update calls around
generate_node_id/generate_node_name and _create_node_key to incorporate
ontology_node.category (e.g., generate_node_id(ontology_node.name, category) or
prefix/suffix the id/name with ontology_node.category) and ensure
added_ontology_nodes_map uses that scoped key; likewise update
_process_ontology_edges so source and target IDs are generated with the same
category-aware convention (use the same helper signature or naming scheme used
for nodes) to keep edge source/target IDs consistent with node keys like in
_create_node_key and avoid cross-category UUID collisions (affecting functions
generate_node_id, generate_node_name, _create_node_key,
added_ontology_nodes_map, and _process_ontology_edges).

---

Duplicate comments:
In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py`:
- Around line 161-179: The edge endpoints are being hashed using raw UUIDs while
nodes are created with the "entity:<id>" prefix (via generate_node_id and
_create_node_key), which causes mismatched dedup keys; update any code that
builds edge endpoint keys (the logic that currently hashes raw endpoint IDs) to
first scope endpoint IDs the same way as nodes (e.g., call
generate_node_id(f"entity:{endpoint_id}") and then _create_node_key(...,
"entity") before using added_nodes_map/key_mapping or storing edge keys), and
apply the identical change in
cognee/modules/graph/utils/retrieve_existing_edges.py so preloaded dedup keys
use the same "entity:<id>" prefix and lookup with added_nodes_map/key_mapping to
ensure consistent linkage and dedup.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d7e0a7fb-c1ed-4d28-9e60-415322309352

📥 Commits

Reviewing files that changed from the base of the PR and between 5ff902a and f88dd16.

📒 Files selected for processing (1)
  • cognee/modules/graph/utils/expand_with_nodes_and_edges.py

@matdou matdou force-pushed the fix/entity-type-node-id-collision branch from b7b7bdf to 788659a Compare March 29, 2026 00:27
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cognee/modules/graph/utils/retrieve_existing_edges.py (1)

52-62: ⚠️ Potential issue | 🟠 Major

Prefixing fixed the UUID collision, but the dedupe key still drops chunk-scoped edge checks.

Line 55 and Line 59 still key processed_nodes by node UUID only. When the same entity or type appears in more than one data_chunk, this suppresses later (chunk_id, ..., "exists_in") and (chunk_id, ..., "mentioned_in") checks, so has_edges() can miss already-persisted chunk-specific edges. It also skips additional is_a checks if the same entity is seen with another type. Please dedupe on the full edge tuple instead of the node ID.

Suggested fix
-    processed_nodes = {}
+    processed_edges = set()
@@
-            if str(type_node_id) not in processed_nodes:
-                type_node_edges.append((data_chunk.id, type_node_id, "exists_in"))
-                processed_nodes[str(type_node_id)] = True
+            type_edge = (data_chunk.id, type_node_id, "exists_in")
+            if type_edge not in processed_edges:
+                type_node_edges.append(type_edge)
+                processed_edges.add(type_edge)
@@
-            if str(entity_node_id) not in processed_nodes:
-                entity_node_edges.append((data_chunk.id, entity_node_id, "mentioned_in"))
-                type_entity_edges.append((entity_node_id, type_node_id, "is_a"))
-                processed_nodes[str(entity_node_id)] = True
+            entity_edge = (data_chunk.id, entity_node_id, "mentioned_in")
+            if entity_edge not in processed_edges:
+                entity_node_edges.append(entity_edge)
+                processed_edges.add(entity_edge)
+
+            type_entity_edge = (entity_node_id, type_node_id, "is_a")
+            if type_entity_edge not in processed_edges:
+                type_entity_edges.append(type_entity_edge)
+                processed_edges.add(type_entity_edge)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/modules/graph/utils/retrieve_existing_edges.py` around lines 52 - 62,
The processed_nodes deduplication is keyed only by node IDs (type_node_id and
entity_node_id), which causes the same entity/type to be skipped across
different data_chunks. Instead of tracking just the node IDs in processed_nodes,
change the dedupe key to track the full edge tuple. For the type_node edges
check, use the tuple (data_chunk.id, type_node_id, "exists_in") as the key; for
the entity_node_edges check, use (data_chunk.id, entity_node_id, "mentioned_in")
as the key; and for the type_entity_edges check, use (entity_node_id,
type_node_id, "is_a") as the key. This ensures that chunk-specific edge
relationships are properly deduplicated rather than dropping them based on node
uniqueness alone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@cognee/modules/graph/utils/retrieve_existing_edges.py`:
- Around line 52-62: The processed_nodes deduplication is keyed only by node IDs
(type_node_id and entity_node_id), which causes the same entity/type to be
skipped across different data_chunks. Instead of tracking just the node IDs in
processed_nodes, change the dedupe key to track the full edge tuple. For the
type_node edges check, use the tuple (data_chunk.id, type_node_id, "exists_in")
as the key; for the entity_node_edges check, use (data_chunk.id, entity_node_id,
"mentioned_in") as the key; and for the type_entity_edges check, use
(entity_node_id, type_node_id, "is_a") as the key. This ensures that
chunk-specific edge relationships are properly deduplicated rather than dropping
them based on node uniqueness alone.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9e80345b-157c-49fc-a004-ae6a8303ae3a

📥 Commits

Reviewing files that changed from the base of the PR and between f88dd16 and b7b7bdf.

📒 Files selected for processing (2)
  • cognee/modules/graph/utils/expand_with_nodes_and_edges.py
  • cognee/modules/graph/utils/retrieve_existing_edges.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/modules/graph/utils/expand_with_nodes_and_edges.py

@matdou matdou force-pushed the fix/entity-type-node-id-collision branch from 788659a to b2a154b Compare March 29, 2026 00:29
Copy link
Copy Markdown

@edmbachbach-bot edmbachbach-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This issue is solved.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: EntityAlreadyExistsError due to duplicate node ID with different types

3 participants