Skip to content

Feature/cog 4596 expand eval framework with gtrl corpus builder#2534

Open
ArmagedonFlamer wants to merge 37 commits intodevfrom
feature/cog-4596-expand-eval_framework-with-gtrl-corpus_builder
Open

Feature/cog 4596 expand eval framework with gtrl corpus builder#2534
ArmagedonFlamer wants to merge 37 commits intodevfrom
feature/cog-4596-expand-eval_framework-with-gtrl-corpus_builder

Conversation

@ArmagedonFlamer
Copy link
Copy Markdown
Contributor

@ArmagedonFlamer ArmagedonFlamer commented Apr 1, 2026

Description

Acceptance Criteria

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Code refactoring
  • Other (please specify):

Screenshots

Pre-submission Checklist

  • I have tested my changes thoroughly before submitting this PR (See CONTRIBUTING.md)
  • This PR contains minimal changes necessary to address the issue/feature
  • My code follows the project's coding standards and style guidelines
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if applicable)
  • All new and existing tests pass
  • I have searched existing PRs to ensure this change hasn't been submitted already
  • I have linked any relevant issues in the description
  • My commits have clear and descriptive messages

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added a logistics simulation benchmark adapter with automated world generation including users, retailers, post offices, and carriers
    • Automatic evaluation of package delivery scenarios with carrier selection, estimated delivery times, and transport cost analysis
    • Golden answer generation for benchmark evaluation datasets
    • Narrative corpus generation from logistics simulation data
    • Support for filtering evaluation instances and deterministic sampling
  • Tests

    • Added test coverage for new logistics benchmark adapter and corpus building workflow

…ted-ingestion' into feature/cog-4370-engineering-update-retrievers-to-use-importance-weighted
…ted-ingestion' into feature/cog-4370-engineering-update-retrievers-to-use-importance-weighted
…ted-ingestion' into feature/cog-4370-engineering-update-retrievers-to-use-importance-weighted
@ArmagedonFlamer ArmagedonFlamer self-assigned this Apr 1, 2026
@pull-checklist
Copy link
Copy Markdown

pull-checklist bot commented Apr 1, 2026

Please make sure all the checkboxes are checked:

  • I have tested these changes locally.
  • I have reviewed the code changes.
  • I have added end-to-end and unit tests (if applicable).
  • I have updated the documentation and README.md file (if necessary).
  • I have removed unnecessary code and debug statements.
  • PR title is clear and follows the convention.
  • I have tagged reviewers or team members for feedback.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 1, 2026

Walkthrough

This pull request introduces a comprehensive logistics simulation benchmark adapter for the evaluation framework. It adds a LogisticsSystemAdapter class that generates synthetic logistics worlds with entities (users, retailers, post offices, carriers, packages), evaluates delivery options through a rule engine, and provides narrativized corpus data for benchmarking.

Changes

Cohort / File(s) Summary
Adapter Framework Updates
cognee/eval_framework/benchmark_adapters/base_benchmark_adapter.py, cognee/eval_framework/benchmark_adapters/benchmark_adapters.py
Added async prepare_corpus() method to base adapter class and registered new LogisticsSystemAdapter in the BenchmarkAdapter enum.
Core Logistics Adapter
cognee/eval_framework/benchmark_adapters/logistics_system_adapter.py
Implements main adapter providing world creation, corpus loading with optional filtering/sampling, and golden answer generation for logistics benchmark.
Logistics Domain Entities
cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/enums.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/package.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/post_office.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/retailer.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/transportation.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/user.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/delivery_models.py
Defines enum types (UserTier, ShippingRange, Region, TransportMode, etc.) and dataclasses modeling logistics domain entities (Package, PostOffice, Retailer, User, Carrier, DeliveryPlan) with label conversions and validation helpers.
World Generation and Rule Engine
cognee/eval_framework/benchmark_adapters/logistics_system_utils/ontology.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/rule_engine.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/utils.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/world_creation_utils.py
Implements world creation with entities and packages, delivery plan evaluation engine with scoring/validation, world serialization/deserialization, and deterministic world seeding and rendering.
Narrativization Pipeline
cognee/eval_framework/benchmark_adapters/logistics_system_utils/corpus_generator/narrativize_corpus.py, cognee/eval_framework/benchmark_adapters/logistics_system_utils/prompts/narrativization.txt, cognee/eval_framework/benchmark_adapters/logistics_system_utils/prompts/package_narrativization.txt
Generates LLM-based narratives from logistics world entities with fallback deterministic generation, supporting structured output queries and per-entity/package narrative file persistence.
Evaluation Rules and Queries
cognee/eval_framework/benchmark_adapters/logistics_system_utils/rules/rules_common_language.txt, cognee/eval_framework/benchmark_adapters/logistics_system_utils/rules/rules_policy_language.txt, cognee/eval_framework/benchmark_adapters/logistics_system_utils/queries/query.txt, cognee/eval_framework/benchmark_adapters/logistics_system_utils/queries/query_data_sources.txt
Defines logistics evaluation rules covering carrier eligibility, delivery timing/pricing calculations, and query templates for generating QA pairs with supporting facts.
Corpus Builder Integration
cognee/eval_framework/corpus_builder/corpus_builder_executor.py
Updated to call prepare_corpus() on adapters before loading and to deduplicate corpus content.
Test Coverage
cognee/tests/unit/eval_framework/benchmark_adapters_test.py, cognee/tests/unit/eval_framework/corpus_builder_test.py, cognee/tests/unit/eval_framework/logistics_ontology_test.py
Added tests for adapter instantiation, world creation/loading, corpus building adapter preparation flow, and delivery plan validation logic.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

core-team, run-checks

Suggested reviewers

  • lxobr
  • hajdul88
  • borisarzentar
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely the uncompleted template with no human-written content, missing all required sections including a clear description of changes, acceptance criteria, type of change selection, and test evidence. Provide a human-written description explaining the GTRL corpus builder feature, list acceptance criteria, select the appropriate change type, include test screenshots, and complete the pre-submission checklist.
Docstring Coverage ⚠️ Warning Docstring coverage is 1.21% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the main change: expanding the evaluation framework with a GTRL corpus builder, which aligns with the substantial additions of the LogisticsSystemAdapter and related components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/cog-4596-expand-eval_framework-with-gtrl-corpus_builder

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 1, 2026

Hello @ArmagedonFlamer, thank you for submitting a PR! We will respond as soon as possible.

@mergify
Copy link
Copy Markdown

mergify bot commented Apr 1, 2026

⚠️ The sha of the head commit of this PR conflicts with #2447. Mergify cannot evaluate rules on this PR. Once #2447 is merged or closed, Mergify will resume processing this PR. ⚠️

@ArmagedonFlamer ArmagedonFlamer marked this pull request as ready for review April 2, 2026 15:16
@ArmagedonFlamer ArmagedonFlamer requested a review from lxobr April 2, 2026 15:17
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (6)
cognee/tests/unit/eval_framework/benchmark_adapters_test.py (1)

137-143: Avoid brittle order-coupling in answers_by_type.

Building the map from qa_pairs[:3] assumes a fixed ordering. Using all entries keeps the test robust to harmless ordering changes.

Suggested refactor
-    answers_by_type = {item["question_type"]: item for item in qa_pairs[:3]}
+    answers_by_type = {}
+    for item in qa_pairs:
+        answers_by_type.setdefault(item["question_type"], item)
+
+    assert "delivery_days" in answers_by_type
+    assert "transport_cost" in answers_by_type
+    assert "carrier" in answers_by_type
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/tests/unit/eval_framework/benchmark_adapters_test.py` around lines 137
- 143, The test creates answers_by_type from qa_pairs[:3], which couples the
test to a specific ordering; update the construction of answers_by_type to
iterate over the entire qa_pairs list (qa_pairs) and build the dict by mapping
item["question_type"] to item so the assertions (on answers_by_type, and checks
for "delivery_days", "transport_cost", "carrier" and their answer/golden_*
fields) don't depend on the first three elements; ensure duplicate question_type
handling is intentional (e.g., last-wins) or assert uniqueness if needed.
cognee/eval_framework/benchmark_adapters/base_benchmark_adapter.py (1)

11-12: Add a short docstring for the new public hook.

prepare_corpus is now part of the adapter contract; a one-line docstring clarifying override expectations will make implementations more consistent.

Suggested refactor
 class BaseBenchmarkAdapter(ABC):
     async def prepare_corpus(self) -> None:
+        """Optional async hook to prepare adapter assets before load_corpus()."""
         return None

As per coding guidelines: "undocumented function definitions and class definitions in the project's Python code are assumed incomplete."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/benchmark_adapters/base_benchmark_adapter.py` around
lines 11 - 12, Add a one-line docstring to the new public async hook
prepare_corpus(self) -> None clarifying that this is an optional asynchronous
hook adapters should override to prepare or load corpus data and that the base
implementation is a no-op (returns None); keep it short and placed immediately
under the prepare_corpus signature in base_benchmark_adapter.py so
implementations know to override it when needed.
cognee/eval_framework/benchmark_adapters/logistics_system_adapter.py (1)

82-89: Preserve the last world-generation failure.

Retrying every ValueError here and then replacing it with a generic incompatibility message hides the real cause when generation fails for some other reason. Catch only the expected retryable error, or at least chain the last exception into the final one.

Suggested change
     def _create_world_with_packages(self, max_attempts: int = 10) -> dict[str, object]:
+        last_error: ValueError | None = None
         for _ in range(max_attempts):
             world = create_world(
                 user_count=self.user_count,
                 retailer_count=self.retailer_count,
             )
             try:
                 return add_packages_to_world(world, package_count=self.package_count)
-            except ValueError:
+            except ValueError as error:
+                last_error = error
                 continue
 
         raise ValueError(
             "Could not generate a logistics world with compatible user and retailer pairs."
-        )
+        ) from last_error
As per coding guidelines, "Prefer explicit, structured error handling in Python code."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/benchmark_adapters/logistics_system_adapter.py` around
lines 82 - 89, The retry loop around add_packages_to_world currently swallows
all ValueError instances and raises a generic ValueError, losing the root cause;
modify the loop in logistics_system_adapter.py to either catch a more specific
retryable exception (not broad ValueError) or store the last caught exception
(e.g., last_exc) and re-raise the final ValueError chaining it with "from
last_exc" so the original error from add_packages_to_world/package_count is
preserved; update the except clause around
add_packages_to_world(package_count=self.package_count) to implement one of
these two options and ensure the final raise includes the original exception
context.
cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/package.py (3)

92-133: Add a docstring to from_labels and consider documenting error behavior.

The classmethod lacks documentation. Additionally, calls to ShippingRange.from_label(), PackageCategory.from_label(), etc., will presumably raise exceptions if passed invalid labels. While the current usage (via _seed_package) guarantees valid inputs from package_possible_values, documenting the expected behavior and valid inputs would help prevent misuse.

📝 Suggested docstring
     `@classmethod`
     def from_labels(
         cls,
         package_id: str,
         description: str,
         weight_kg: float,
         shipping_range: str,
         category: str,
         priority: str,
         retailer_id: str | None = None,
         retailer_name: str | None = None,
         user_id: str | None = None,
         user_name: str | None = None,
         insured: bool = False,
         current_state: str = PackageState.CREATED.label,
         last_known_location: str = "origin_warehouse",
         current_post_office_id: str | None = None,
         current_post_office_name: str | None = None,
         route_post_office_ids: tuple[str, ...] = (),
         route_post_office_names: tuple[str, ...] = (),
         status_history: tuple[str, ...] = (),
     ) -> "Package":
+        """Construct a Package from human-readable label strings.
+
+        Args:
+            shipping_range: Label for ShippingRange enum (e.g., "local", "regional").
+            category: Label for PackageCategory enum.
+            priority: Label for DeliveryPriority enum.
+            current_state: Label for PackageState enum.
+            **other args: Passed through directly to Package.__init__.
+
+        Returns:
+            A new Package instance with enums resolved from labels.
+
+        Raises:
+            ValueError: If any label does not match a valid enum member.
+        """
         return cls(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/package.py`
around lines 92 - 133, Add a docstring to the Package.from_labels classmethod
that describes its purpose, parameters, return type, and expected input labels;
explicitly list or reference the valid label sources (e.g.,
package_possible_values or the enums) and document that invalid labels passed to
ShippingRange.from_label, PackageCategory.from_label,
DeliveryPriority.from_label, or PackageState.from_label will raise exceptions
(propagated to the caller) so callers know error behavior. Reference the method
name from_labels and the helper conversion functions ShippingRange.from_label,
PackageCategory.from_label, DeliveryPriority.from_label, and
PackageState.from_label in the docstring.

13-30: Consider adding a docstring or inline comment for the constant.

The package_possible_values dictionary serves as a configuration for valid attribute values, but its purpose and usage context aren't immediately clear. A brief docstring or comment would help future maintainers understand this is used for random sampling during package generation.

As per coding guidelines, undocumented definitions are assumed incomplete.

📝 Suggested documentation
+# Valid attribute values for random package generation in logistics simulations.
+# Keys map to Package field names; values are sampled via random.choice().
 package_possible_values = {
     "weight_kg": [1.0, 5.0, 25.0, 150.0, 900.0],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/package.py`
around lines 13 - 30, Add a concise docstring or inline comment above the
package_possible_values constant explaining its purpose and usage (e.g., that
package_possible_values is a configuration mapping used for random
sampling/validation when generating Package entities in the logistics system),
describe what each key represents (weight_kg, shipping_range, category,
priority, insured, current_state, last_known_location), and note that
current_state excludes DELIVERED and RETURNED; place this comment immediately
above the package_possible_values definition in package.py so maintainers see it
when editing the Package generation logic.

33-52: Add a class-level docstring for Package.

Per the coding guidelines, undocumented class definitions are assumed incomplete. A docstring describing the purpose, key fields, and usage of this domain entity would improve maintainability and assist developers unfamiliar with the logistics simulation module.

📝 Suggested docstring
 `@dataclass`(slots=True)
 class Package:
+    """Represents a package in the logistics simulation.
+
+    Attributes:
+        package_id: Unique identifier for the package.
+        description: Human-readable description of package contents.
+        weight_kg: Package weight in kilograms.
+        shipping_range: Geographic scope of delivery (local, regional, etc.).
+        category: Classification affecting handling requirements.
+        priority: Delivery urgency level.
+        current_state: Current status in the delivery lifecycle.
+        route_post_office_ids: Ordered sequence of post offices in the delivery route.
+    """
     package_id: str
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/package.py`
around lines 33 - 52, Add a clear class-level docstring to the Package dataclass
describing its purpose in the logistics simulation, the key fields (e.g.,
package_id, description, weight_kg, shipping_range, category, priority,
current_state, last_known_location, route_post_office_ids, status_history, and
optional retailer/user info), typical usage (how instances are created/updated
and how state/location fields change), and any important invariants (e.g.,
route_post_office_ids order corresponds to route_post_office_names). Place this
docstring immediately below the `@dataclass`(slots=True) class Package:
declaration so it documents the Package class for maintainers and tooling.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cognee/eval_framework/benchmark_adapters/logistics_system_adapter.py`:
- Around line 184-186: The adapter currently builds question_text from
package_context and question_prompt but ignores the stored custom query
(self.query) because the append logic is commented out; re-enable and guard that
append so when self.query is set and non-empty it gets appended (e.g., in the
method that constructs question_text, use a conditional to append
f"{self.query.strip()}" to question_text with a newline). Update the logic
around question_text, package_context, question_prompt and self.query so passing
query= in __init__ has effect and preserve spacing/newlines when appending.
- Around line 133-134: The _load_shared_corpus_documents method currently only
returns the contents of DEFAULT_COMMON_RULES_PATH, so the new policy rules file
never gets added to the corpus; update _load_shared_corpus_documents to also
read and include the policy rules (e.g.,
DEFAULT_POLICY_RULES_PATH.read_text(encoding="utf-8").strip()) alongside
DEFAULT_COMMON_RULES_PATH so both rule texts are returned (as a list of strings)
and become retrievable context.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/corpus_generator/narrativize_corpus.py`:
- Around line 230-239: The sync wrapper inverts the running-loop check and
attempts to create a nested loop; instead, detect a running loop with
asyncio.get_running_loop() and, if a loop exists, run narrativize_corpus in a
separate thread (e.g., use
concurrent.futures.ThreadPoolExecutor().submit(lambda:
asyncio.run(narrativize_corpus(...))).result()) so you avoid nesting event loops
in the same thread; if no running loop (get_running_loop() raises), simply call
asyncio.run(narrativize_corpus(...)). Ensure you reference the
narrativize_corpus coroutine and replace the current
new_event_loop/run_until_complete branch with the thread-based executor approach
when a loop is present.

In `@cognee/eval_framework/benchmark_adapters/logistics_system_utils/ontology.py`:
- Around line 46-83: The transport network is being built once from the initial
_seed_user/_seed_retailer pair and reused, so later retailers get
origin_post_office_id values that don't match actual origin facilities and
_route_post_offices_for_state() never sees those origins; fix by seeding post
offices and carriers per retailer/user pair instead of once: call
_seed_post_offices(...) and set retailer.origin_post_office_id via
_origin_post_office_id_for_retailer(...) for each retailer (or move the existing
post_offices assignment inside the retailers/users loop), and create carriers
with _seed_world_carriers(...) per retailer/user so each retailer has a matching
carrier and origin post office used by _route_post_offices_for_state().

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/queries/query_data_sources.txt`:
- Around line 13-21: The JSON schema uses non-JSON type names and inconsistent
key naming: replace the non-JSON tokens "int" and "float" for keys like
estimated_delivery_days and estimated_transport_price with the valid JSON type
"number", and normalize the route supporting-facts data-source key by renaming
route_supporting_facts_data_source to the plural
route_supporting_facts_data_sources so it matches the pattern used by
estimated_*_supporting_facts_data_sources.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/rule_engine.py`:
- Around line 198-211: compute_outcome currently picks the first retailer/user
from world instead of the package owner, causing wrong scoring for non-zero
index packages; after calling _resolve_package (which returns the package dict
with retailer_id and user_id), resolve the retailer and user using those ids
(e.g., pass package["retailer_id"] and package["user_id"] into
_resolve_world_actor or a lookup helper) and then call evaluate with the
resolved retailer and user so scoring uses the package's actual retailer/user;
update compute_outcome to use package-sourced ids rather than hardcoded
"retailers"/"users" defaults.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/rules/rules_policy_language.txt`:
- Line 10: Rule R10 requires region_distance = abs(origin.score -
destination.score) + 1 with +2 if crossing US/non‑US, but _region_distance()
currently returns 0 for same-region routes; update _region_distance in
cognee/eval_framework/benchmark_adapters/logistics_system_utils/rule_engine.py
to compute distance as abs(origin_region_score - destination_region_score) + 1
unconditionally and then add the +2 cross-border penalty when origin and
destination are US vs non‑US, and adjust any unit tests or callers that assumed
a zero same-region result.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/utils.py`:
- Around line 341-349: load_world currently omits the "packages" key when the
packages list is empty, breaking round-tripping with store_world (which always
writes "packages") and causing callers like write_golden_answers to fail; update
the load_world logic (the world dict construction in load_world) to always set
world["packages"] = packages regardless of emptiness so the key is preserved
even for an empty list.
- Around line 271-275: The helper currently falls back to the heuristic matcher
when package.user_id and package.retailer_id are present but one or both lookups
via _find_by_id(users, "user_id", package.user_id) or _find_by_id(retailers,
"retailer_id", package.retailer_id) fail; change this so that if package.user_id
and package.retailer_id are not None and either matched_user or matched_retailer
is None the function does not continue to the heuristic matcher but instead
returns a clear failure (e.g., raise a ValueError or return None/a specific
error sentinel) so corrupted IDs surface immediately rather than producing wrong
golden answers. Ensure the check lives alongside the existing
matched_user/matched_retailer logic and that callers handle the raised
error/None result.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/world_creation_utils.py`:
- Line 170: The generated IDs using random.randint(...) (e.g., in create_world()
and places that set post_office_id, carrier IDs, etc., and that feed into
_entity_entries() and persisted filenames) are collision-prone; replace them
with collision-safe IDs by using uuid.uuid4().hex (or another UUID generator)
instead of random.randint, or implement a uniqueness loop that checks the
destination dict/sets (e.g., the map used by _entity_entries()) to ensure the
new ID is not already present before assigning; update all occurrences
referenced in create_world(), post_office_id, carrier ID generation sites, and
any other similar ID generators (the lines called out in the review) so
filenames and keys remain unique.
- Around line 661-678: _seed_package() currently assigns current_post_office to
the last routed office unconditionally, which causes IN_TRANSIT,
OUT_FOR_DELIVERY, and DELIVERED to show a sorting center instead of the special
labels from _location_for_state(); change the logic so that before building the
Package you set current_post_office = None when package_state is one of the
transit/delivery/completed states (e.g., package_state in
{PackageState.IN_TRANSIT, PackageState.OUT_FOR_DELIVERY,
PackageState.DELIVERED}) so that last_known_location and
current_post_office_name are computed from _location_for_state(package_state,
current_post_office) and do not point at a post office.

---

Nitpick comments:
In `@cognee/eval_framework/benchmark_adapters/base_benchmark_adapter.py`:
- Around line 11-12: Add a one-line docstring to the new public async hook
prepare_corpus(self) -> None clarifying that this is an optional asynchronous
hook adapters should override to prepare or load corpus data and that the base
implementation is a no-op (returns None); keep it short and placed immediately
under the prepare_corpus signature in base_benchmark_adapter.py so
implementations know to override it when needed.

In `@cognee/eval_framework/benchmark_adapters/logistics_system_adapter.py`:
- Around line 82-89: The retry loop around add_packages_to_world currently
swallows all ValueError instances and raises a generic ValueError, losing the
root cause; modify the loop in logistics_system_adapter.py to either catch a
more specific retryable exception (not broad ValueError) or store the last
caught exception (e.g., last_exc) and re-raise the final ValueError chaining it
with "from last_exc" so the original error from
add_packages_to_world/package_count is preserved; update the except clause
around add_packages_to_world(package_count=self.package_count) to implement one
of these two options and ensure the final raise includes the original exception
context.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/package.py`:
- Around line 92-133: Add a docstring to the Package.from_labels classmethod
that describes its purpose, parameters, return type, and expected input labels;
explicitly list or reference the valid label sources (e.g.,
package_possible_values or the enums) and document that invalid labels passed to
ShippingRange.from_label, PackageCategory.from_label,
DeliveryPriority.from_label, or PackageState.from_label will raise exceptions
(propagated to the caller) so callers know error behavior. Reference the method
name from_labels and the helper conversion functions ShippingRange.from_label,
PackageCategory.from_label, DeliveryPriority.from_label, and
PackageState.from_label in the docstring.
- Around line 13-30: Add a concise docstring or inline comment above the
package_possible_values constant explaining its purpose and usage (e.g., that
package_possible_values is a configuration mapping used for random
sampling/validation when generating Package entities in the logistics system),
describe what each key represents (weight_kg, shipping_range, category,
priority, insured, current_state, last_known_location), and note that
current_state excludes DELIVERED and RETURNED; place this comment immediately
above the package_possible_values definition in package.py so maintainers see it
when editing the Package generation logic.
- Around line 33-52: Add a clear class-level docstring to the Package dataclass
describing its purpose in the logistics simulation, the key fields (e.g.,
package_id, description, weight_kg, shipping_range, category, priority,
current_state, last_known_location, route_post_office_ids, status_history, and
optional retailer/user info), typical usage (how instances are created/updated
and how state/location fields change), and any important invariants (e.g.,
route_post_office_ids order corresponds to route_post_office_names). Place this
docstring immediately below the `@dataclass`(slots=True) class Package:
declaration so it documents the Package class for maintainers and tooling.

In `@cognee/tests/unit/eval_framework/benchmark_adapters_test.py`:
- Around line 137-143: The test creates answers_by_type from qa_pairs[:3], which
couples the test to a specific ordering; update the construction of
answers_by_type to iterate over the entire qa_pairs list (qa_pairs) and build
the dict by mapping item["question_type"] to item so the assertions (on
answers_by_type, and checks for "delivery_days", "transport_cost", "carrier" and
their answer/golden_* fields) don't depend on the first three elements; ensure
duplicate question_type handling is intentional (e.g., last-wins) or assert
uniqueness if needed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 99606914-d2f5-4b4f-b156-7abfcef484c0

📥 Commits

Reviewing files that changed from the base of the PR and between 413a0f9 and e9ef7a2.

📒 Files selected for processing (25)
  • cognee/eval_framework/benchmark_adapters/base_benchmark_adapter.py
  • cognee/eval_framework/benchmark_adapters/benchmark_adapters.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_adapter.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/corpus_generator/narrativize_corpus.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/delivery_models.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/enums.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/package.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/post_office.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/retailer.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/transportation.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/entities/user.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/ontology.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/prompts/narrativization.txt
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/prompts/package_narrativization.txt
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/queries/query.txt
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/queries/query_data_sources.txt
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/rule_engine.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/rules/rules_common_language.txt
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/rules/rules_policy_language.txt
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/utils.py
  • cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/world_creation_utils.py
  • cognee/eval_framework/corpus_builder/corpus_builder_executor.py
  • cognee/tests/unit/eval_framework/benchmark_adapters_test.py
  • cognee/tests/unit/eval_framework/corpus_builder_test.py
  • cognee/tests/unit/eval_framework/logistics_ontology_test.py

Comment on lines +133 to +134
def _load_shared_corpus_documents(self) -> list[str]:
return [DEFAULT_COMMON_RULES_PATH.read_text(encoding="utf-8").strip()]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Load the new policy rules into the corpus.

load_corpus() only injects rules_common_language.txt, so the new rules_policy_language.txt never becomes retrievable context. Right now the added policy spec is dead data, and the corpus can drift away from the hard-coded evaluation logic.

Suggested change
 DEFAULT_COMMON_RULES_PATH = (
     Path(__file__).resolve().parent
     / "logistics_system_utils"
     / "rules"
     / "rules_common_language.txt"
 )
+DEFAULT_POLICY_RULES_PATH = (
+    Path(__file__).resolve().parent
+    / "logistics_system_utils"
+    / "rules"
+    / "rules_policy_language.txt"
+)
 ...
     def _load_shared_corpus_documents(self) -> list[str]:
-        return [DEFAULT_COMMON_RULES_PATH.read_text(encoding="utf-8").strip()]
+        return [
+            DEFAULT_COMMON_RULES_PATH.read_text(encoding="utf-8").strip(),
+            DEFAULT_POLICY_RULES_PATH.read_text(encoding="utf-8").strip(),
+        ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/benchmark_adapters/logistics_system_adapter.py` around
lines 133 - 134, The _load_shared_corpus_documents method currently only returns
the contents of DEFAULT_COMMON_RULES_PATH, so the new policy rules file never
gets added to the corpus; update _load_shared_corpus_documents to also read and
include the policy rules (e.g.,
DEFAULT_POLICY_RULES_PATH.read_text(encoding="utf-8").strip()) alongside
DEFAULT_COMMON_RULES_PATH so both rule texts are returned (as a list of strings)
and become retrievable context.

Comment on lines +184 to +186
question_text = f"{package_context}\n\n{question_prompt}"
# if self.query.strip():
# question_text = f"{question_text}\n{self.query.strip()}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

query is ignored right now.

__init__() still accepts and stores a custom query template, but the only place it was appended to the question text is commented out. Passing query= currently has no effect.

Suggested change
                 question_text = f"{package_context}\n\n{question_prompt}"
-                # if self.query.strip():
-                #    question_text = f"{question_text}\n{self.query.strip()}"
+                if self.query.strip():
+                    question_text = f"{question_text}\n{self.query.strip()}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
question_text = f"{package_context}\n\n{question_prompt}"
# if self.query.strip():
# question_text = f"{question_text}\n{self.query.strip()}"
question_text = f"{package_context}\n\n{question_prompt}"
if self.query.strip():
question_text = f"{question_text}\n{self.query.strip()}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/benchmark_adapters/logistics_system_adapter.py` around
lines 184 - 186, The adapter currently builds question_text from package_context
and question_prompt but ignores the stored custom query (self.query) because the
append logic is commented out; re-enable and guard that append so when
self.query is set and non-empty it gets appended (e.g., in the method that
constructs question_text, use a conditional to append f"{self.query.strip()}" to
question_text with a newline). Update the logic around question_text,
package_context, question_prompt and self.query so passing query= in __init__
has effect and preserve spacing/newlines when appending.

Comment on lines +230 to +239
try:
asyncio.get_running_loop()
except RuntimeError:
return asyncio.run(narrativize_corpus(world_root_path, use_llm=use_llm))

loop = asyncio.new_event_loop()
try:
return loop.run_until_complete(narrativize_corpus(world_root_path, use_llm=use_llm))
finally:
loop.close()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Expected result:
# RuntimeError: Cannot run the event loop while another loop is running
python3 - <<'PY'
import asyncio

async def noop():
    return None

async def probe():
    loop = asyncio.new_event_loop()
    coro = noop()
    try:
        try:
            loop.run_until_complete(coro)
        except Exception as exc:
            print(type(exc).__name__, exc)
    finally:
        coro.close()
        loop.close()

asyncio.run(probe())
PY

Repository: topoteretes/cognee

Length of output: 130


🏁 Script executed:

# First, verify the file exists and check the exact code at lines 230-239
fd -t f narrativize_corpus.py

Repository: topoteretes/cognee

Length of output: 164


🏁 Script executed:

# Read the file to see lines around 230-239 with context
wc -l cognee/eval_framework/benchmark_adapters/logistics_system_utils/corpus_generator/narrativize_corpus.py

Repository: topoteretes/cognee

Length of output: 168


🏁 Script executed:

# Read the specific section with context
sed -n '200,245p' cognee/eval_framework/benchmark_adapters/logistics_system_utils/corpus_generator/narrativize_corpus.py

Repository: topoteretes/cognee

Length of output: 1301


This sync wrapper fails when called from an already-running event loop.

The logic is inverted: when asyncio.get_running_loop() succeeds (indicating a loop already exists), the code attempts to create a second loop and call run_until_complete() on it. This raises RuntimeError: Cannot run the event loop while another loop is running. In practice, this breaks the sync wrapper in notebooks, pytest-asyncio, and ASGI contexts where it should be most useful.

The correct pattern requires using a different strategy (e.g., asyncio.to_thread() for Python 3.9+, or nest_asyncio as a workaround) rather than attempting to nest event loops in the same thread.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/corpus_generator/narrativize_corpus.py`
around lines 230 - 239, The sync wrapper inverts the running-loop check and
attempts to create a nested loop; instead, detect a running loop with
asyncio.get_running_loop() and, if a loop exists, run narrativize_corpus in a
separate thread (e.g., use
concurrent.futures.ThreadPoolExecutor().submit(lambda:
asyncio.run(narrativize_corpus(...))).result()) so you avoid nesting event loops
in the same thread; if no running loop (get_running_loop() raises), simply call
asyncio.run(narrativize_corpus(...)). Ensure you reference the
narrativize_corpus coroutine and replace the current
new_event_loop/run_until_complete branch with the thread-based executor approach
when a loop is present.

Comment on lines +46 to +83
user = _seed_user(name=_pick_unique_name(USER_NAMES, used_user_names, "User"))
retailer = _seed_retailer(
shipping_range=user.default_shipping_range.label,
name=_pick_unique_name(RETAILER_NAMES, used_retailer_names, "Retailer"),
)
post_offices = _seed_post_offices(user.default_shipping_range, retailer.region, user.region)
retailer.origin_post_office_id = _origin_post_office_id_for_retailer(
retailer.region, post_offices
)

users = [user]
for _ in range(user_count - 1):
users.append(
_seed_user(
name=_pick_unique_name(USER_NAMES, used_user_names, "User"),
)
)

retailers = [retailer]
for _ in range(retailer_count - 1):
retailers.append(
_seed_retailer(
shipping_range=random.choice(retailer_possible_values["shipping_range"]),
origin_post_office_id=None,
name=_pick_unique_name(RETAILER_NAMES, used_retailer_names, "Retailer"),
)
)

for retailer in retailers:
retailer.origin_post_office_id = _origin_post_office_id_for_retailer(
retailer.region, post_offices
)

carriers = _seed_world_carriers(
retailer=retailer,
user=user,
shipping_range=user.default_shipping_range,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't build the whole transport network around the first seeded pair.

post_offices and carriers are seeded once from the initial user/retailer pair and then reused for every later retailer/user. That leaves many retailers with origin_post_office_ids that point to non-origin facilities, and cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/world_creation_utils.py::_route_post_offices_for_state() only prefers warehouse IDs, so those configured origins are never reflected in the generated routes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cognee/eval_framework/benchmark_adapters/logistics_system_utils/ontology.py`
around lines 46 - 83, The transport network is being built once from the initial
_seed_user/_seed_retailer pair and reused, so later retailers get
origin_post_office_id values that don't match actual origin facilities and
_route_post_offices_for_state() never sees those origins; fix by seeding post
offices and carriers per retailer/user pair instead of once: call
_seed_post_offices(...) and set retailer.origin_post_office_id via
_origin_post_office_id_for_retailer(...) for each retailer (or move the existing
post_offices assignment inside the retailers/users loop), and create carriers
with _seed_world_carriers(...) per retailer/user so each retailer has a matching
carrier and origin post office used by _route_post_offices_for_state().

Comment on lines +13 to +21
"estimated_delivery_days": int,
"estimated_delivery_days_supporting_facts": ["string"],
"estimated_delivery_days_supporting_facts_data_sources": ["string"],
"estimated_transport_price": float,
"estimated_transport_price_supporting_facts": ["string"],
"estimated_transport_price_supporting_facts_data_sources": ["string"],
"route": "string",
"route_supporting_facts": ["string"],
"route_supporting_facts_data_source": ["string"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

JSON example can induce invalid output and a key mismatch.

The schema currently uses int/float (not valid JSON values) and mixes singular/plural data-source key naming for route facts.

Proposed fix
-  "estimated_delivery_days": int,
+  "estimated_delivery_days": 0,
   "estimated_delivery_days_supporting_facts": ["string"],
   "estimated_delivery_days_supporting_facts_data_sources": ["string"],
-  "estimated_transport_price": float,
+  "estimated_transport_price": 0.0,
   "estimated_transport_price_supporting_facts": ["string"],
   "estimated_transport_price_supporting_facts_data_sources": ["string"],
   "route": "string",
   "route_supporting_facts": ["string"],
-  "route_supporting_facts_data_source": ["string"]
+  "route_supporting_facts_data_sources": ["string"]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"estimated_delivery_days": int,
"estimated_delivery_days_supporting_facts": ["string"],
"estimated_delivery_days_supporting_facts_data_sources": ["string"],
"estimated_transport_price": float,
"estimated_transport_price_supporting_facts": ["string"],
"estimated_transport_price_supporting_facts_data_sources": ["string"],
"route": "string",
"route_supporting_facts": ["string"],
"route_supporting_facts_data_source": ["string"]
"estimated_delivery_days": 0,
"estimated_delivery_days_supporting_facts": ["string"],
"estimated_delivery_days_supporting_facts_data_sources": ["string"],
"estimated_transport_price": 0.0,
"estimated_transport_price_supporting_facts": ["string"],
"estimated_transport_price_supporting_facts_data_sources": ["string"],
"route": "string",
"route_supporting_facts": ["string"],
"route_supporting_facts_data_sources": ["string"]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/queries/query_data_sources.txt`
around lines 13 - 21, The JSON schema uses non-JSON type names and inconsistent
key naming: replace the non-JSON tokens "int" and "float" for keys like
estimated_delivery_days and estimated_transport_price with the valid JSON type
"number", and normalize the route supporting-facts data-source key by renaming
route_supporting_facts_data_source to the plural
route_supporting_facts_data_sources so it matches the pattern used by
estimated_*_supporting_facts_data_sources.

R07. An express package shall not be assigned to the shipment transport mode.
R08. A package already delivered to the recipient shall not receive a new transport assignment.
R09. Retailer and user regions shall contribute to both delivery-time estimation and transport-price estimation.
R10. Region distance shall be computed as the absolute difference between origin and destination region distance scores, plus 1, with an additional penalty of 2 when crossing between US and non-US regions.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

R10 disagrees with the engine on same-region routes.

_region_distance() in cognee/eval_framework/benchmark_adapters/logistics_system_utils/rule_engine.py returns 0 when origin and destination are the same region, but this rule says abs(...) + 1 unconditionally. That makes the published rule add 1 day and $3 to same-region shipments that the generated golden answers will not.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/rules/rules_policy_language.txt`
at line 10, Rule R10 requires region_distance = abs(origin.score -
destination.score) + 1 with +2 if crossing US/non‑US, but _region_distance()
currently returns 0 for same-region routes; update _region_distance in
cognee/eval_framework/benchmark_adapters/logistics_system_utils/rule_engine.py
to compute distance as abs(origin_region_score - destination_region_score) + 1
unconditionally and then add the +2 cross-border penalty when origin and
destination are US vs non‑US, and adjust any unit tests or callers that assumed
a zero same-region result.

Comment on lines +271 to +275
if package.user_id is not None and package.retailer_id is not None:
matched_user = _find_by_id(users, "user_id", package.user_id)
matched_retailer = _find_by_id(retailers, "retailer_id", package.retailer_id)
if matched_user is not None and matched_retailer is not None:
return matched_user, matched_retailer
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don't silently remap packages with broken IDs.

Once a package carries explicit user_id and retailer_id, this helper should fail if either lookup misses. Falling back to the heuristic matcher here will generate golden answers for the wrong entities instead of surfacing corrupted world data.

🔧 Suggested fix
     if package.user_id is not None and package.retailer_id is not None:
         matched_user = _find_by_id(users, "user_id", package.user_id)
         matched_retailer = _find_by_id(retailers, "retailer_id", package.retailer_id)
-        if matched_user is not None and matched_retailer is not None:
-            return matched_user, matched_retailer
+        if matched_user is None or matched_retailer is None:
+            raise ValueError(
+                f"Package {package.package_id} references an unknown user or retailer."
+            )
+        return matched_user, matched_retailer
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if package.user_id is not None and package.retailer_id is not None:
matched_user = _find_by_id(users, "user_id", package.user_id)
matched_retailer = _find_by_id(retailers, "retailer_id", package.retailer_id)
if matched_user is not None and matched_retailer is not None:
return matched_user, matched_retailer
if package.user_id is not None and package.retailer_id is not None:
matched_user = _find_by_id(users, "user_id", package.user_id)
matched_retailer = _find_by_id(retailers, "retailer_id", package.retailer_id)
if matched_user is None or matched_retailer is None:
raise ValueError(
f"Package {package.package_id} references an unknown user or retailer."
)
return matched_user, matched_retailer
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/utils.py`
around lines 271 - 275, The helper currently falls back to the heuristic matcher
when package.user_id and package.retailer_id are present but one or both lookups
via _find_by_id(users, "user_id", package.user_id) or _find_by_id(retailers,
"retailer_id", package.retailer_id) fail; change this so that if package.user_id
and package.retailer_id are not None and either matched_user or matched_retailer
is None the function does not continue to the heuristic matcher but instead
returns a clear failure (e.g., raise a ValueError or return None/a specific
error sentinel) so corrupted IDs surface immediately rather than producing wrong
golden answers. Ensure the check lives alongside the existing
matched_user/matched_retailer logic and that callers handle the raised
error/None result.

Comment on lines +341 to +349
world: dict[str, object] = {
"retailers": retailers,
"users": users,
"post_offices": post_offices,
"carriers": carriers,
}

if packages:
world["packages"] = packages
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Always preserve packages when loading the world.

store_world() always writes a packages array, but load_world() drops that key when the array is empty. That breaks round-tripping and makes downstream code such as write_golden_answers() crash on valid empty worlds.

🔧 Suggested fix
     world: dict[str, object] = {
         "retailers": retailers,
         "users": users,
+        "packages": packages,
         "post_offices": post_offices,
         "carriers": carriers,
     }
-
-    if packages:
-        world["packages"] = packages
 
     return world
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
world: dict[str, object] = {
"retailers": retailers,
"users": users,
"post_offices": post_offices,
"carriers": carriers,
}
if packages:
world["packages"] = packages
world: dict[str, object] = {
"retailers": retailers,
"users": users,
"packages": packages,
"post_offices": post_offices,
"carriers": carriers,
}
return world
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/utils.py`
around lines 341 - 349, load_world currently omits the "packages" key when the
packages list is empty, breaking round-tripping with store_world (which always
writes "packages") and causing callers like write_golden_answers to fail; update
the load_world logic (the world dict construction in load_world) to always set
world["packages"] = packages regardless of emptiness so the key is preserved
even for an empty list.

supports_hazardous_materials: bool,
) -> PostOffice:
return PostOffice(
post_office_id=f"po-{random.randint(1000, 9999)}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make generated IDs collision-safe.

All generated IDs come from unchecked random.randint(...). Carriers are the worst case here: create_world() seeds 22 of them from a 900-value suffix space, so duplicate IDs are realistic. A collision will overwrite _entity_entries() keys, reuse filenames, and make references ambiguous across the persisted world.

Also applies to: 540-540, 597-597, 626-626, 664-664

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/world_creation_utils.py`
at line 170, The generated IDs using random.randint(...) (e.g., in
create_world() and places that set post_office_id, carrier IDs, etc., and that
feed into _entity_entries() and persisted filenames) are collision-prone;
replace them with collision-safe IDs by using uuid.uuid4().hex (or another UUID
generator) instead of random.randint, or implement a uniqueness loop that checks
the destination dict/sets (e.g., the map used by _entity_entries()) to ensure
the new ID is not already present before assigning; update all occurrences
referenced in create_world(), post_office_id, carrier ID generation sites, and
any other similar ID generators (the lines called out in the review) so
filenames and keys remain unique.

Comment on lines +661 to +678
current_post_office = visited_post_offices[-1] if visited_post_offices else None

return Package.from_labels(
package_id=f"pkg-{random.randint(10000, 99999)}",
description=description or random.choice(PACKAGE_DESCRIPTIONS),
weight_kg=random.choice(package_possible_values["weight_kg"]),
shipping_range=shipping_range,
category=category,
priority=priority,
retailer_id=retailer_id,
retailer_name=retailer_name,
user_id=user_id,
user_name=user_name,
insured=random.choice(package_possible_values["insured"]),
current_state=current_state,
last_known_location=_location_for_state(package_state, current_post_office),
current_post_office_id=current_post_office.post_office_id if current_post_office else None,
current_post_office_name=current_post_office.name if current_post_office else None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Delivered and in-flight states shouldn't point at a post office.

_seed_package() always sets current_post_office to the last routed office. For IN_TRANSIT, OUT_FOR_DELIVERY, and DELIVERED, that makes last_known_location/current_post_office_name report a sorting center, and _location_for_state()'s special labels never get used.

🔧 Suggested fix
     current_post_office = visited_post_offices[-1] if visited_post_offices else None
+    if package_state in {
+        PackageState.IN_TRANSIT,
+        PackageState.OUT_FOR_DELIVERY,
+        PackageState.DELIVERED,
+    }:
+        current_post_office = None
 
     return Package.from_labels(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@cognee/eval_framework/benchmark_adapters/logistics_system_utils/utils/world_creation_utils.py`
around lines 661 - 678, _seed_package() currently assigns current_post_office to
the last routed office unconditionally, which causes IN_TRANSIT,
OUT_FOR_DELIVERY, and DELIVERED to show a sorting center instead of the special
labels from _location_for_state(); change the logic so that before building the
Package you set current_post_office = None when package_state is one of the
transit/delivery/completed states (e.g., package_state in
{PackageState.IN_TRANSIT, PackageState.OUT_FOR_DELIVERY,
PackageState.DELIVERED}) so that last_known_location and
current_post_office_name are computed from _location_for_state(package_state,
current_post_office) and do not point at a post office.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant