fix: include node category in UUID generation to prevent Entity/Entit…#2515
fix: include node category in UUID generation to prevent Entity/Entit…#2515matdou wants to merge 2 commits intotopoteretes:mainfrom
Conversation
Please make sure all the checkboxes are checked:
|
|
Hello @matdou, thank you for submitting a PR! We will respond as soon as possible. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughNode and edge ID generation now includes explicit Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)
119-123:⚠️ Potential issue | 🔴 CriticalOntology-validated path drops the category namespace and can reintroduce ID collisions.
Line 121 and Line 179 regenerate IDs from raw ontology names (
generate_node_id(closest_class.name)/generate_node_id(start_ent_ont.name)), which bypasses the new"type:"/"entity:"namespacing added on Line 103 and Line 161. That can recreate the original Entity vs EntityType collision for matched ontology terms.Suggested fix
@@ - node_id = generate_node_id(closest_class.name) + node_id = generate_node_id(f"type:{closest_class.name}") @@ - generated_node_id = generate_node_id(start_ent_ont.name) + generated_node_id = generate_node_id(f"entity:{start_ent_ont.name}")Also applies to: 177-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py` around lines 119 - 123, When ontology_validated is true the code regenerates IDs using generate_node_id(closest_class.name) (and similarly generate_node_id(start_ent_ont.name)), which drops the "type:"/"entity:" namespace and can reintroduce collisions; change those regenerations to produce namespaced keys instead (e.g., call _create_node_key with the generated id and the proper category or preserve the existing namespace from old_key) so type_node_key and corresponding entity keys keep the "type:"/"entity:" prefix; update the two spots that set type_node_key (using generate_node_id(closest_class.name)) and the one that sets the entity key (using generate_node_id(start_ent_ont.name)) to use the namespaced construction consistent with the earlier creation at lines where _create_node_key and generate_node_name are used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py`:
- Around line 103-105: The node IDs are being created with namespace prefixes
via generate_node_id("type:..."/"entity:...") but later replaced by ontology
names (closest_class.name, start_ent_ont.name) which removes the prefix and
breaks edge key lookups; fix by preserving consistent namespacing: when you
replace node_id with ontology-derived names (closest_class.name or
start_ent_ont.name) always re-run generate_node_id with the original namespace
(e.g., generate_node_id(f"type:{closest_class.name}") or
generate_node_id(f"entity:{start_ent_ont.name}")), update any derived keys via
_create_node_key using the regenerated ID, and ensure the name_mapping entries
use the same generate_node_id output; alternatively, change
retrieve_existing_edges.py to construct node IDs with generate_node_id(node.type
+ ":" + node.id) so all code paths use the same namespaced ID format.
---
Outside diff comments:
In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py`:
- Around line 119-123: When ontology_validated is true the code regenerates IDs
using generate_node_id(closest_class.name) (and similarly
generate_node_id(start_ent_ont.name)), which drops the "type:"/"entity:"
namespace and can reintroduce collisions; change those regenerations to produce
namespaced keys instead (e.g., call _create_node_key with the generated id and
the proper category or preserve the existing namespace from old_key) so
type_node_key and corresponding entity keys keep the "type:"/"entity:" prefix;
update the two spots that set type_node_key (using
generate_node_id(closest_class.name)) and the one that sets the entity key
(using generate_node_id(start_ent_ont.name)) to use the namespaced construction
consistent with the earlier creation at lines where _create_node_key and
generate_node_name are used.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0bce1613-a5ee-4c5e-820a-ff42974e5681
📒 Files selected for processing (1)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py
f88dd16 to
b7b7bdf
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)
39-54:⚠️ Potential issue | 🟠 MajorOntology-derived nodes/edges still bypass category namespacing.
Line 39 and Line 69-70 still hash raw ontology names. That means ontology class/entity names can still collide by UUID across categories, which reintroduces the same persistence-risk pattern this PR is fixing elsewhere.
Suggested fix for ontology nodes
def _process_ontology_nodes( @@ - for ontology_node in ontology_nodes: - ont_node_id = generate_node_id(ontology_node.name) + for ontology_node in ontology_nodes: ont_node_name = generate_node_name(ontology_node.name) if ontology_node.category == "classes": + ont_node_id = generate_node_id(f"type:{ontology_node.name}") ont_node_key = _create_node_key(ont_node_id, "type") @@ elif ontology_node.category == "individuals": + ont_node_id = generate_node_id(f"entity:{ontology_node.name}") ont_node_key = _create_node_key(ont_node_id, "entity")Please also make
_process_ontology_edgescategory-aware so source/target IDs are generated with the same scoped convention.Also applies to: 64-70, 103-121
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py` around lines 39 - 54, The ontology-derived node creation is currently hashing raw ontology names and can collide across categories; change the node key/id generation to include the ontology_node.category scope so class vs individual namespaces don't collide: update calls around generate_node_id/generate_node_name and _create_node_key to incorporate ontology_node.category (e.g., generate_node_id(ontology_node.name, category) or prefix/suffix the id/name with ontology_node.category) and ensure added_ontology_nodes_map uses that scoped key; likewise update _process_ontology_edges so source and target IDs are generated with the same category-aware convention (use the same helper signature or naming scheme used for nodes) to keep edge source/target IDs consistent with node keys like in _create_node_key and avoid cross-category UUID collisions (affecting functions generate_node_id, generate_node_name, _create_node_key, added_ontology_nodes_map, and _process_ontology_edges).
♻️ Duplicate comments (1)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py (1)
161-179:⚠️ Potential issue | 🔴 CriticalEdge endpoints still use unscoped UUIDs, so relationships can target different nodes.
Line 161/Line 179 now create entity IDs from
"entity:<id>", but Line 278/Line 279 still hash raw endpoint IDs. Those UUIDs are different, so edge linkage and dedup can drift.Suggested fix
def _process_graph_edges( graph: KnowledgeGraph, name_mapping: dict, existing_edges_map: dict, relationships: list ) -> None: @@ - source_node_id = generate_node_id(source_id) - target_node_id = generate_node_id(target_id) + source_node_id = generate_node_id(f"entity:{source_id}") + target_node_id = generate_node_id(f"entity:{target_id}")Also align
cognee/modules/graph/utils/retrieve_existing_edges.py(Line 52-53 in the provided snippet) to the same prefix scheme, otherwise preloaded dedup keys stay inconsistent.Also applies to: 275-279
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py` around lines 161 - 179, The edge endpoints are being hashed using raw UUIDs while nodes are created with the "entity:<id>" prefix (via generate_node_id and _create_node_key), which causes mismatched dedup keys; update any code that builds edge endpoint keys (the logic that currently hashes raw endpoint IDs) to first scope endpoint IDs the same way as nodes (e.g., call generate_node_id(f"entity:{endpoint_id}") and then _create_node_key(..., "entity") before using added_nodes_map/key_mapping or storing edge keys), and apply the identical change in cognee/modules/graph/utils/retrieve_existing_edges.py so preloaded dedup keys use the same "entity:<id>" prefix and lookup with added_nodes_map/key_mapping to ensure consistent linkage and dedup.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py`:
- Around line 39-54: The ontology-derived node creation is currently hashing raw
ontology names and can collide across categories; change the node key/id
generation to include the ontology_node.category scope so class vs individual
namespaces don't collide: update calls around
generate_node_id/generate_node_name and _create_node_key to incorporate
ontology_node.category (e.g., generate_node_id(ontology_node.name, category) or
prefix/suffix the id/name with ontology_node.category) and ensure
added_ontology_nodes_map uses that scoped key; likewise update
_process_ontology_edges so source and target IDs are generated with the same
category-aware convention (use the same helper signature or naming scheme used
for nodes) to keep edge source/target IDs consistent with node keys like in
_create_node_key and avoid cross-category UUID collisions (affecting functions
generate_node_id, generate_node_name, _create_node_key,
added_ontology_nodes_map, and _process_ontology_edges).
---
Duplicate comments:
In `@cognee/modules/graph/utils/expand_with_nodes_and_edges.py`:
- Around line 161-179: The edge endpoints are being hashed using raw UUIDs while
nodes are created with the "entity:<id>" prefix (via generate_node_id and
_create_node_key), which causes mismatched dedup keys; update any code that
builds edge endpoint keys (the logic that currently hashes raw endpoint IDs) to
first scope endpoint IDs the same way as nodes (e.g., call
generate_node_id(f"entity:{endpoint_id}") and then _create_node_key(...,
"entity") before using added_nodes_map/key_mapping or storing edge keys), and
apply the identical change in
cognee/modules/graph/utils/retrieve_existing_edges.py so preloaded dedup keys
use the same "entity:<id>" prefix and lookup with added_nodes_map/key_mapping to
ensure consistent linkage and dedup.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d7e0a7fb-c1ed-4d28-9e60-415322309352
📒 Files selected for processing (1)
cognee/modules/graph/utils/expand_with_nodes_and_edges.py
b7b7bdf to
788659a
Compare
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cognee/modules/graph/utils/retrieve_existing_edges.py (1)
52-62:⚠️ Potential issue | 🟠 MajorPrefixing fixed the UUID collision, but the dedupe key still drops chunk-scoped edge checks.
Line 55 and Line 59 still key
processed_nodesby node UUID only. When the same entity or type appears in more than onedata_chunk, this suppresses later(chunk_id, ..., "exists_in")and(chunk_id, ..., "mentioned_in")checks, sohas_edges()can miss already-persisted chunk-specific edges. It also skips additionalis_achecks if the same entity is seen with another type. Please dedupe on the full edge tuple instead of the node ID.Suggested fix
- processed_nodes = {} + processed_edges = set() @@ - if str(type_node_id) not in processed_nodes: - type_node_edges.append((data_chunk.id, type_node_id, "exists_in")) - processed_nodes[str(type_node_id)] = True + type_edge = (data_chunk.id, type_node_id, "exists_in") + if type_edge not in processed_edges: + type_node_edges.append(type_edge) + processed_edges.add(type_edge) @@ - if str(entity_node_id) not in processed_nodes: - entity_node_edges.append((data_chunk.id, entity_node_id, "mentioned_in")) - type_entity_edges.append((entity_node_id, type_node_id, "is_a")) - processed_nodes[str(entity_node_id)] = True + entity_edge = (data_chunk.id, entity_node_id, "mentioned_in") + if entity_edge not in processed_edges: + entity_node_edges.append(entity_edge) + processed_edges.add(entity_edge) + + type_entity_edge = (entity_node_id, type_node_id, "is_a") + if type_entity_edge not in processed_edges: + type_entity_edges.append(type_entity_edge) + processed_edges.add(type_entity_edge)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cognee/modules/graph/utils/retrieve_existing_edges.py` around lines 52 - 62, The processed_nodes deduplication is keyed only by node IDs (type_node_id and entity_node_id), which causes the same entity/type to be skipped across different data_chunks. Instead of tracking just the node IDs in processed_nodes, change the dedupe key to track the full edge tuple. For the type_node edges check, use the tuple (data_chunk.id, type_node_id, "exists_in") as the key; for the entity_node_edges check, use (data_chunk.id, entity_node_id, "mentioned_in") as the key; and for the type_entity_edges check, use (entity_node_id, type_node_id, "is_a") as the key. This ensures that chunk-specific edge relationships are properly deduplicated rather than dropping them based on node uniqueness alone.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cognee/modules/graph/utils/retrieve_existing_edges.py`:
- Around line 52-62: The processed_nodes deduplication is keyed only by node IDs
(type_node_id and entity_node_id), which causes the same entity/type to be
skipped across different data_chunks. Instead of tracking just the node IDs in
processed_nodes, change the dedupe key to track the full edge tuple. For the
type_node edges check, use the tuple (data_chunk.id, type_node_id, "exists_in")
as the key; for the entity_node_edges check, use (data_chunk.id, entity_node_id,
"mentioned_in") as the key; and for the type_entity_edges check, use
(entity_node_id, type_node_id, "is_a") as the key. This ensures that
chunk-specific edge relationships are properly deduplicated rather than dropping
them based on node uniqueness alone.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9e80345b-157c-49fc-a004-ae6a8303ae3a
📒 Files selected for processing (2)
cognee/modules/graph/utils/expand_with_nodes_and_edges.pycognee/modules/graph/utils/retrieve_existing_edges.py
🚧 Files skipped from review as they are similar to previous changes (1)
- cognee/modules/graph/utils/expand_with_nodes_and_edges.py
…yType ID collision
788659a to
b2a154b
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Description
Fixes #2510.
_create_type_nodeand_create_entity_nodeinexpand_with_nodes_and_edges.pyboth callgenerate_node_idwith the node name only, soEntity("institution")andEntityType("institution")produce the same UUID. Within a single run this is fine because deduplication uses keys likeuuid_typevsuuid_entity, but across runs only the UUID is persisted to PostgreSQL, causing anEntityAlreadyExistsError(409) on the secondcognify, which breaks graph projection and returns empty search results.The fix includes the node category in the hash:
Acceptance Criteria
Entity("institution")andEntityType("institution")now produce different UUIDscognifytwice on the same data no longer raisesEntityAlreadyExistsErrorType of Change
Pre-submission Checklist
CONTRIBUTING.md)test_neptune_analytics_graph.pywhich is unrelated)DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.
Summary by CodeRabbit