Skip to content

Hardening pass: tracking, models, runtime, tests, and docs alignment#29

Open
dr-gareth-roberts wants to merge 1 commit intomainfrom
feat/insideLLMs-hardening-20260304
Open

Hardening pass: tracking, models, runtime, tests, and docs alignment#29
dr-gareth-roberts wants to merge 1 commit intomainfrom
feat/insideLLMs-hardening-20260304

Conversation

@dr-gareth-roberts
Copy link
Copy Markdown
Owner

@dr-gareth-roberts dr-gareth-roberts commented Mar 4, 2026

Description

Summary
This PR applies a focused hardening pass across experiment tracking, model typing, retry error handling, dataset optional imports, and related tests/docs updates.

Changes
Experiment tracking
Enforced non-empty tracker list in MultiTracker constructor.
Returned a concrete first run ID from MultiTracker.start_run (no empty fallback).
File: insideLLMs/experiment_tracking.py
Optional dependency robustness
Switched HF datasets import to importlib-based optional loading.
Set availability based on resolved load_dataset symbol.
File: insideLLMs/dataset_utils.py
Model retry safety
In ModelWrapper.generate, made last_error explicitly typed and fail-fast if unexpectedly unset before raising.
File: insideLLMs/models/base.py
OpenAI model typing cleanup
Added explicit **kwargs: Any annotations to generate/chat/stream.
Added explicit return type -> ModelInfo for info.
Kept SDK boundary ignore where OpenAI typing overloads are narrower than our ChatMessage interface.
File: insideLLMs/models/openai.py
Tests typing improvements
Added explicit typing to helper probe methods used in coverage tests.
File: tests/test_probes_models_coverage.py
Docs consistency
Bias tutorial config now sets output_dir: bias_results to match analysis step paths.
File: wiki/tutorials/Bias-Testing.md
Why
Reduce type/lint noise in strict IDE diagnostics.
Tighten runtime guarantees around run IDs and retry error paths.
Improve optional dependency behavior for environments missing extras.
Keep docs aligned with real output locations.
Risk
Low to medium:

Mostly typing/guard improvements and deterministic behavior tightening.
One behavior change: MultiTracker([]) now raises ValueError (intentional fail-fast).
Validation
Manual code-level validation and IDE diagnostics due intermittent command-runner issues in IDE environment.
Branch push completed and ready for CI validation.

Summary by Sourcery

Harden experiment tracking, model retry behavior, routing/classification, and contrib utilities while tightening typing and optional dependency handling across the codebase.

New Features:

  • Add restricted arithmetic-only execution helpers and use them in calculator tools and sandboxed Python execution to avoid unsafe eval/exec.
  • Expose all exports in several core modules to provide explicit public APIs.

Bug Fixes:

  • Ensure MultiTracker requires at least one tracker and always returns a concrete run ID from start_run.
  • Ensure WandBTracker, MLflowTracker, and TensorBoardTracker use resolved modules/constructors, validate availability at init time, and always return non-empty run IDs.
  • Make LocalFileTracker artifacts metadata and artifact_type defaults consistent, and improve numeric metrics typing in the tracking decorator.
  • Handle missing or uninitialized external clients (OpenAI, Anthropic, routing model) and classification edge cases more robustly.
  • Fix HuggingFace dataset loading to use importlib-based optional imports, default split to train, and raise cleanly when unavailable.
  • Guarantee retry helpers and ModelWrapper.generate always raise a concrete exception when retries are exhausted.
  • Tighten semantics in tracing, semantic cache, rate limiting, routing, resource IO, config loading, structured extraction, and other utilities to avoid silent failures and type ambiguities.
  • Improve reproducibility and config utilities by handling TensorFlow imports, package metadata, experiment listings, and config validation types more safely.
  • Update tracing fingerprinting to validate event types and fail fast on unsupported inputs.
  • Adjust tests and coverage probes to assert latency metadata, use precise typing, and handle pydantic-availability paths correctly.
  • Align bias-testing tutorial output directory with the analysis step paths.

Enhancements:

  • Refine optional dependency detection for wandb, mlflow, TensorBoard, and HuggingFace datasets using helper loaders and module-level availability flags.
  • Strengthen debug tracing APIs to return optional events and use typed payloads, and make execution analyzers more robust to failures.
  • Improve adapter, routing, chains, ensemble, synthesis, evalbom, and HITL utility behavior by catching and logging errors more explicitly while keeping core flows running.
  • Clarify OpenAI model typing by annotating kwargs, tightening error context for chat, and specifying the ModelInfo return type of info().
  • Improve runtime result/status coercion, model info normalization, and pipeline construction with clearer types and safer fallbacks.
  • Standardize exception capture variables and add explicit typing hints across tracing, async utilities, CLI, schemas, tests, and multiple runtime modules.

Summary by CodeRabbit

  • New Features

    • Added safer arithmetic and code execution pathways for agents, replacing direct eval/exec with restricted implementations featuring explicit operator handling.
    • Implemented lazy loading for optional datasets package with graceful fallback handling.
  • Bug Fixes

    • Enhanced null-safety checks across provider clients, models, and intent routing to prevent dereferencing None values.
    • Improved error handling in retry and circuit-breaking logic with explicit exception validation.
  • API Changes

    • PromptDebugger logging methods now return Optional[DebugTraceEvent], allowing graceful handling of inactive traces.
    • Updated function return types and type hints for improved type safety.

@bolt-new-by-stackblitz
Copy link
Copy Markdown

Review PR in StackBlitz Codeflow Run & review this pull request in StackBlitz Codeflow.

Copilot AI review requested due to automatic review settings March 4, 2026 11:13
@sourcery-ai
Copy link
Copy Markdown

sourcery-ai Bot commented Mar 4, 2026

Reviewer's Guide

Hardening pass focusing on safer experiment tracking backends, stricter retry/typing behavior, safer arithmetic/code execution for contrib agents, more robust optional dependency handling, and minor doc/test/runtime cleanups.

Updated class diagram for experiment trackers and MultiTracker

classDiagram
    class ExperimentTracker {
        <<abstract>>
        - config TrackingConfig
        - _run_active bool
        - _run_id str
        - _step int
        + start_run(run_name str, run_id str, nested bool) str
        + end_run(status str) None
        + log_metrics(metrics dict~str, Any~, step int) None
        + log_params(params dict~str, Any~) None
        + log_artifact(artifact_path str, artifact_type str, artifact_name str) None
        + log_table(table_name str, data list~dict~str, Any~~, columns list~str~, step int) None
        + watch_model(model Any, log_freq int) None
    }

    class WandBTracker {
        - entity str | None
        - wandb_kwargs dict~str, Any~
        - _wandb Any
        - _run Any
        - _run_active bool
        - _run_id str
        - _step int
        + __init__(config TrackingConfig, entity str, wandb_kwargs dict~str, Any~) None
        + start_run(run_name str, run_id str, nested bool) str
        + end_run(status str) None
        + log_metrics(metrics dict~str, float~, step int) None
        + log_params(params dict~str, Any~) None
        + log_artifact(artifact_path str, artifact_type str, artifact_name str) None
        + log_table(table_name str, data list~dict~str, Any~~, columns list~str~, step int) None
        + watch_model(model Any, log_freq int) None
    }

    class MLflowTracker {
        - tracking_uri str | None
        - _mlflow Any
        - _experiment_id str | None
        - _run_active bool
        - _run_id str
        - _step int
        + __init__(config TrackingConfig, tracking_uri str, experiment_name str) None
        + start_run(run_name str, run_id str, nested bool) str
        + end_run(status str) None
        + log_metrics(metrics dict~str, float~, step int) None
        + log_params(params dict~str, Any~) None
        + log_artifact(artifact_path str) None
        + log_model(model Any, model_name str, model_flavor str) None
        + register_model(model_uri str, model_name str) None
    }

    class TensorBoardTracker {
        - log_dir str
        - _summary_writer_cls Any
        - _writer Any | None
        - _run_active bool
        - _run_id str
        - _step int
        + __init__(config TrackingConfig, log_dir str) None
        + start_run(run_name str, run_id str, nested bool) str
        + end_run(status str) None
        + log_metrics(metrics dict~str, float~, step int) None
        + log_params(params dict~str, Any~) None
        + log_artifact(artifact_path str, artifact_type str, artifact_name str) None
        + log_table(table_name str, data list~dict~str, Any~~, columns list~str~, step int) None
        + watch_model(model Any, log_freq int) None
    }

    class LocalFileTracker {
        - output_dir Path
        - _run_dir Path | None
        - _metrics list~dict~str, Any~~
        - _params dict~str, Any~
        - _artifacts list~dict~str, str~~
        + start_run(run_name str, run_id str, nested bool) str
        + end_run(status str) None
        + log_metrics(metrics dict~str, float~, step int) None
        + log_params(params dict~str, Any~) None
        + log_artifact(artifact_path str, artifact_type str, artifact_name str) None
        + snapshot() dict~str, Any~
    }

    class MultiTracker {
        - trackers list~ExperimentTracker~
        - _run_active bool
        - _run_id str
        + __init__(trackers list~ExperimentTracker~, config TrackingConfig) None
        + start_run(run_name str, run_id str, nested bool) str
        + end_run(status str) None
        + log_metrics(metrics dict~str, float~, step int) None
        + log_params(params dict~str, Any~) None
        + log_artifact(artifact_path str, artifact_type str, artifact_name str) None
    }

    ExperimentTracker <|-- WandBTracker
    ExperimentTracker <|-- MLflowTracker
    ExperimentTracker <|-- TensorBoardTracker
    ExperimentTracker <|-- LocalFileTracker
    ExperimentTracker <|-- MultiTracker

    MultiTracker "1" o-- "many" ExperimentTracker : aggregates
    WandBTracker ..> "1" Any : uses_wandb_module
    MLflowTracker ..> "1" Any : uses_mlflow_module
    TensorBoardTracker ..> "1" Any : uses_summary_writer_cls
Loading

File-Level Changes

Change Details Files
Refactor experiment trackers to use safely-loaded optional dependencies and stricter run ID handling, plus MultiTracker invariants.
  • Introduce _load_optional_module helper and centralize optional imports for wandb, mlflow, and tensorboard writers
  • Inject concrete tracker client instances into tracker classes instead of using global modules directly
  • Ensure start_run for WandBTracker and MLflowTracker validates and stringifies returned run IDs, raising RuntimeError if missing
  • Make TensorBoardTracker fail fast when writer is unavailable and return the concrete run_name as run_id
  • Make LocalFileTracker artifacts list store typed metadata dicts and default artifact_type when logging
  • Have MultiTracker.init require at least one tracker and make start_run always return the first child run_id
insideLLMs/experiment_tracking.py
Harden contrib agent calculator and python execution to use safe arithmetic-only evaluators.
  • Replace eval() in tool decorator examples and BaseAgent usage with _safe_eval_arithmetic
  • Add helpers to apply binary/unary ops and evaluate arithmetic expressions in a restricted AST subset
  • Introduce _safe_exec_python_subset to execute only assignments + arithmetic expressions and wire it into execute_python
  • Update probe docs/examples to reference safe arithmetic evaluator instead of eval
insideLLMs/contrib/agents.py
insideLLMs/probes/agent_probe.py
Improve optional HF dataset import behavior and tighten dataset loader contract.
  • Switch datasets import to importlib-based dynamic loading and expose load_dataset symbol when present
  • Define HF_DATASETS_AVAILABLE based on resolved load_dataset rather than import success alone
  • Change load_hf_dataset to require HF datasets at runtime and default to split='train', returning a concrete list
  • Export public names via all for better static analysis
insideLLMs/dataset_utils.py
Tighten retry/model wrapper behavior and error typing around last_error and retries.
  • Make ModelWrapper.generate track last_error as Exception
None and fail-fast with explicit RuntimeError if somehow unset
  • Update execute_with_retry (sync and async) to ensure last_exception is always set before raising RetryExhaustedError, inserting a synthetic RuntimeError when needed
  • Leave other retry wrappers unchanged except for typing/except-as variable naming consistency
  • Type-safety and API hardening across tracing, debugging, adapters, routing, runtime config, and various utilities.
    • Add or refine type annotations for payload dicts, DebugTracer methods (now Optional return types), and other helper methods
    • Make trace_fingerprint validate event element types and reject unexpected event types with TypeError
    • Harden adapters to check client initialization before generate_chat and raise RuntimeError if missing
    • Guard routing classifier against None intents and missing model, adjusting fallback logic
    • Use getattr and casts in config tests to avoid direct attribute assumptions under different pydantic versions
    • Adjust various helper functions to catch exceptions as named variables and preserve control flow semantics without swallowing unexpected types
    • Export all in several modules (retry, rate_limiting, runtime, safety, config, registry, types, resources, semantic_cache, cost_tracking, etc.) for static tooling
    insideLLMs/trace/tracing.py
    insideLLMs/contrib/debugging.py
    insideLLMs/contrib/adapters.py
    insideLLMs/contrib/routing.py
    insideLLMs/runtime/_config_loader.py
    insideLLMs/runtime/_high_level.py
    insideLLMs/runtime/_result_utils.py
    insideLLMs/runtime/reproducibility.py
    insideLLMs/config.py
    insideLLMs/resources.py
    insideLLMs/semantic_cache.py
    insideLLMs/integrations/langchain.py
    insideLLMs/async_utils.py
    insideLLMs/cli/_record_utils.py
    insideLLMs/cli/commands/doctor.py
    insideLLMs/contrib/hitl.py
    insideLLMs/rate_limiting.py
    insideLLMs/contrib/chains.py
    insideLLMs/contrib/ensemble.py
    insideLLMs/contrib/evalbom.py
    insideLLMs/shadow.py
    insideLLMs/structured_extraction.py
    insideLLMs/schemas/registry.py
    insideLLMs/runtime/__init__.py
    insideLLMs/registry.py
    insideLLMs/safety.py
    insideLLMs/types.py
    insideLLMs/cost_tracking.py
    OpenAI model typing cleanup and stronger error contexts.
    • Add **kwargs: Any to generate/chat/stream signatures and declare info() return type as ModelInfo
    • In chat error handling, access first message content via typed dict access rather than .get
    • Keep behavior otherwise identical while clarifying SDK boundary typing
    insideLLMs/models/openai.py
    Test updates and docs alignment to match new behavior and typing.
    • Tighten tests around latency_ms to require non-None values and non-negative metrics
    • Use typing.cast and getattr in config tests for better mypy/pyright compatibility
    • Extend example config tests to assert dataset.data is populated
    • Adjust except clauses to bind as _ where needed to avoid unused-variable noise
    • Align Bias-Testing tutorial YAML with actual output_dir bias_results path
    tests/test_models_config_coverage.py
    tests/test_probes_models_coverage.py
    wiki/tutorials/Bias-Testing.md

    Tips and commands

    Interacting with Sourcery

    • Trigger a new review: Comment @sourcery-ai review on the pull request.
    • Continue discussions: Reply directly to Sourcery's review comments.
    • Generate a GitHub issue from a review comment: Ask Sourcery to create an
      issue from a review comment by replying to it. You can also reply to a
      review comment with @sourcery-ai issue to create an issue from it.
    • Generate a pull request title: Write @sourcery-ai anywhere in the pull
      request title to generate a title at any time. You can also comment
      @sourcery-ai title on the pull request to (re-)generate the title at any time.
    • Generate a pull request summary: Write @sourcery-ai summary anywhere in
      the pull request body to generate a PR summary at any time exactly where you
      want it. You can also comment @sourcery-ai summary on the pull request to
      (re-)generate the summary at any time.
    • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
      request to (re-)generate the reviewer's guide at any time.
    • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
      pull request to resolve all Sourcery comments. Useful if you've already
      addressed all the comments and don't want to see them anymore.
    • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
      request to dismiss all existing Sourcery reviews. Especially useful if you
      want to start fresh with a new review - don't forget to comment
      @sourcery-ai review to trigger a new review!

    Customizing Your Experience

    Access your dashboard to:

    • Enable or disable review features such as the Sourcery-generated pull request
      summary, the reviewer's guide, and others.
    • Change the review language.
    • Add, remove or edit custom review instructions.
    • Adjust other review settings.

    Getting Help

    @github-actions
    Copy link
    Copy Markdown

    github-actions Bot commented Mar 4, 2026

    insideLLMs Behavioural Diff

    • Exit code: 0
    • Baseline run dir: /home/runner/work/_temp/insidellms-run-base-22666913806-3487
    • Candidate run dir: /home/runner/work/_temp/insidellms-run-head-22666913806-15824
    Metric Count
    Common keys 12
    Regressions 0
    Improvements 0
    Other changes 0
    Only baseline 0
    Only candidate 0
    Trace drifts 0
    Trace violation increases 0
    Top changed records

    Regressions (0)

    • None

    Improvements (0)

    • None

    Other changes (0)

    • None

    Only in baseline (0)

    • None

    Only in candidate (0)

    • None

    Trace drifts (0)

    • None

    Trace violation increases (0)

    • None

    Generated by insideLLMs/action.

    @coderabbitai
    Copy link
    Copy Markdown

    coderabbitai Bot commented Mar 4, 2026

    📝 Walkthrough

    Walkthrough

    This PR standardizes exception handling across 30+ files by binding caught exceptions to placeholder variables, adds explicit module-level __all__ declarations to control public API surfaces, and introduces safety improvements including null-safety checks in adapter initialization, safer arithmetic/code execution pathways in agent tools, and optional return types in debug tracing methods.

    Changes

    Cohort / File(s) Summary
    Exception handling standardization
    insideLLMs/async_utils.py, insideLLMs/cli/_record_utils.py, insideLLMs/cli/commands/doctor.py, insideLLMs/contrib/chains.py, insideLLMs/contrib/ensemble.py, insideLLMs/contrib/evalbom.py, insideLLMs/contrib/hitl.py, insideLLMs/contrib/synthesis.py, insideLLMs/integrations/langchain.py, insideLLMs/rate_limiting.py, insideLLMs/resources.py, insideLLMs/retry.py, insideLLMs/runtime/_base.py, insideLLMs/runtime/_high_level.py, insideLLMs/runtime/_result_utils.py, insideLLMs/schemas/registry.py, insideLLMs/semantic_cache.py, insideLLMs/shadow.py, insideLLMs/structured_extraction.py
    Standardized except blocks from except Exception: to except Exception as _: to explicitly bind caught exceptions as unused variables, preserving control flow with clearer intent.
    Public API surface exposure
    insideLLMs/config.py, insideLLMs/cost_tracking.py, insideLLMs/rate_limiting.py, insideLLMs/registry.py, insideLLMs/runtime/__init__.py, insideLLMs/safety.py, insideLLMs/types.py
    Added dynamic __all__ declarations that export all non-private globals, formalizing public API surfaces for star-import and introspection without changing runtime behavior.
    Safety improvements: null checks and guards
    insideLLMs/contrib/adapters.py, insideLLMs/contrib/routing.py
    Added explicit null-safety guards: adapters now validate client initialization before use; routing now requires non-null model/intent before pattern matching and fallback paths.
    Safe code evaluation pathways
    insideLLMs/contrib/agents.py, insideLLMs/probes/agent_probe.py
    Introduced _apply_binary_operator, _apply_unary_operator, _eval_exec_expression, and _safe_exec_python_subset helpers to replace direct eval/exec; updated calculator and python tools to use safer execution paths; examples updated to reference new safe evaluation approach.
    Type annotation refinement
    insideLLMs/contrib/debugging.py, insideLLMs/runtime/reproducibility.py, insideLLMs/models/openai.py, insideLLMs/models/base.py
    Updated return types to Optional[DebugTraceEvent] in debugging methods; changed list_all return to include Union[str, float] for values; added explicit typing for kwargs in OpenAI model methods and last_error in retry paths.
    Dynamic/lazy import refactoring
    insideLLMs/dataset_utils.py, insideLLMs/experiment_tracking.py
    Implemented lazy loading for optional dependencies (datasets, wandb, mlflow, tensorboard) via runtime module lookups instead of direct imports; centralized error handling and availability detection; changed load_hf_dataset default split from "test" to "train" and updated return type.
    Trace fingerprinting robustness
    insideLLMs/trace/tracing.py
    Reworked trace_fingerprint() normalization to explicitly loop over events instead of heuristic-based checking; now accepts TraceEvent or dict and raises TypeError for unsupported types; added explicit type annotation for payload dict.
    Runtime configuration handling
    insideLLMs/runtime/_config_loader.py
    Introduced local alias pipeline_base_model to isolate base_model usage in pipeline construction; standardized exception binding in snapshot building.
    Test coverage updates
    tests/test_models_config_coverage.py, tests/test_probes_models_coverage.py
    Updated import style for config module; added typing.cast() for type coercion in tests; enhanced type annotations on test probe methods; added latency_ms assertions.
    Documentation example
    wiki/tutorials/Bias-Testing.md
    Added output_dir configuration field to bias_harness.yaml example under dataset section.

    Estimated code review effort

    🎯 4 (Complex) | ⏱️ ~45 minutes

    Possibly related PRs

    Poem

    🐰 Exception handlers bound with care, unsafe evals vanish in air,
    __all__ declares what's truly blessed, safety guards pass the test.
    Type hints now crystal, imports so lazy, making the codebase less hazy! ✨

    🚥 Pre-merge checks | ✅ 3
    ✅ Passed checks (3 passed)
    Check name Status Explanation
    Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
    Title check ✅ Passed The title accurately summarizes the main changes: a hardening pass affecting tracking, models, runtime, tests, and docs—reflecting the core objectives of the PR.
    Docstring Coverage ✅ Passed Docstring coverage is 85.03% which is sufficient. The required threshold is 80.00%.

    ✏️ Tip: You can configure your own custom pre-merge checks in the settings.

    ✨ Finishing Touches
    • 📝 Generate docstrings (stacked PR)
    • 📝 Generate docstrings (commit on current branch)
    🧪 Generate unit tests (beta)
    • Create PR with unit tests
    • Post copyable unit tests in a comment
    • Commit unit tests in branch feat/insideLLMs-hardening-20260304

    Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

    ❤️ Share

    Comment @coderabbitai help to get the list of available commands and usage tips.

    Copy link
    Copy Markdown

    @sourcery-ai sourcery-ai Bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Hey - I've found 3 issues, and left some high level feedback:

    • The change to load_hf_dataset's default split from 'test' to 'train' is a behavioral change that may surprise existing callers; consider whether this should remain 'test' for backward compatibility or be called out clearly as an intentional breaking change.
    • Using __all__ = [name for name in globals() if not name.startswith('_')] in multiple modules will export helper symbols, imported modules, and other internals; consider defining __all__ explicitly per module to avoid unintentionally expanding the public surface.
    Prompt for AI Agents
    Please address the comments from this code review:
    
    ## Overall Comments
    - The change to `load_hf_dataset`'s default `split` from `'test'` to `'train'` is a behavioral change that may surprise existing callers; consider whether this should remain `'test'` for backward compatibility or be called out clearly as an intentional breaking change.
    - Using `__all__ = [name for name in globals() if not name.startswith('_')]` in multiple modules will export helper symbols, imported modules, and other internals; consider defining `__all__` explicitly per module to avoid unintentionally expanding the public surface.
    
    ## Individual Comments
    
    ### Comment 1
    <location path="insideLLMs/contrib/agents.py" line_range="3034-3036" />
    <code_context>
         return _eval(parsed)
    
    
    +def _apply_binary_operator(
    +    op: ast.operator, left: Union[int, float], right: Union[int, float]
    +) -> Union[int, float]:
    +    if isinstance(op, ast.Add):
    +        return left + right
    +    if isinstance(op, ast.Sub):
    +        return left - right
    +    if isinstance(op, ast.Mult):
    +        return left * right
    +    if isinstance(op, ast.Div):
    +        return left / right
    +    if isinstance(op, ast.FloorDiv):
    +        return left // right
    +    if isinstance(op, ast.Mod):
    +        return left % right
    +    if isinstance(op, ast.Pow):
    +        return left**right
    +    raise ValueError("Unsupported arithmetic operator")
    +
    +
    </code_context>
    <issue_to_address>
    **suggestion:** The error message in `_apply_binary_operator` is quite generic and may obscure which operator caused the failure.
    
    Including the operator in the error (e.g. via `type(op).__name__` or rendering it directly) would make it clearer which expression failed, improving debuggability when an unsupported operator reaches this code.
    
    ```suggestion
        if isinstance(op, ast.Pow):
            return left**right
        raise ValueError(f"Unsupported arithmetic operator: {type(op).__name__}")
    ```
    </issue_to_address>
    
    ### Comment 2
    <location path="insideLLMs/trace/tracing.py" line_range="1868-1874" />
    <code_context>
         insideLLMs.trace.trace_config.TracePayloadNormaliser : Normalise payloads before fingerprinting.
         """
         # Normalize to list of dicts
    -    if events and isinstance(events[0], TraceEvent):
    -        event_dicts = [e.to_dict() for e in events]
    -    else:
    -        event_dicts = list(events)
    +    event_dicts: list[dict[str, Any]] = []
    +    for event in events:
    +        if isinstance(event, TraceEvent):
    +            event_dicts.append(event.to_dict())
    +            continue
    +        if isinstance(event, dict):
    +            event_dicts.append(event)
    +            continue
    +        raise TypeError(f"Unsupported trace event type: {type(event).__name__}")
    
         # Sort events by sequence to ensure deterministic ordering
    </code_context>
    <issue_to_address>
    **question (bug_risk):** Making `trace_fingerprint` raise on any non-dict/non-TraceEvent may be stricter than callers expect.
    
    The previous implementation accepted any iterable of dict-like objects and would silently pass through unexpected types. The new loop raises `TypeError` for anything that’s not a `TraceEvent` or `dict`, which can break callers that pass custom trace-like objects. If the stricter typing is intentional, please either (a) document this behavioral change clearly, or (b) handle unknown but dict-like types more leniently (e.g., coercion/ignoring) to preserve backward compatibility.
    </issue_to_address>
    
    ### Comment 3
    <location path="tests/test_models_config_coverage.py" line_range="992" />
    <code_context>
             assert config.dataset is not None
             assert config.runner is not None
             assert "example" in config.tags
    +        assert config.dataset.data is not None
             assert len(config.dataset.data) == 2
    
    </code_context>
    <issue_to_address>
    **suggestion (testing):** Missing tests for new HuggingFace dataset loading behaviour and ImportError conditions
    
    This PR changes `load_hf_dataset` to default `split` to `"train"` and to treat the HF datasets import as optional (raising `ImportError` if `load_dataset` is missing), but I don’t see tests for:
    
    1. Verifying that `"train"` is used as the default split (e.g., by mocking `datasets.load_dataset` and asserting the `split` argument).
    2. Verifying that an `ImportError` is raised when the `datasets` package exists but `load_dataset` is missing/`None`.
    
    Please add tests (e.g., in `tests/test_dataset_utils.py` or an existing dataset test module) to cover these behaviours and guard against regressions in the import/availability logic.
    </issue_to_address>

    Sourcery is free for open source - if you like our reviews please consider sharing them ✨
    Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

    Comment on lines +3034 to +3036
    if isinstance(op, ast.Pow):
    return left**right
    raise ValueError("Unsupported arithmetic operator")
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    suggestion: The error message in _apply_binary_operator is quite generic and may obscure which operator caused the failure.

    Including the operator in the error (e.g. via type(op).__name__ or rendering it directly) would make it clearer which expression failed, improving debuggability when an unsupported operator reaches this code.

    Suggested change
    if isinstance(op, ast.Pow):
    return left**right
    raise ValueError("Unsupported arithmetic operator")
    if isinstance(op, ast.Pow):
    return left**right
    raise ValueError(f"Unsupported arithmetic operator: {type(op).__name__}")

    Comment on lines -1868 to 1874
    if events and isinstance(events[0], TraceEvent):
    event_dicts = [e.to_dict() for e in events]
    else:
    event_dicts = list(events)
    event_dicts: list[dict[str, Any]] = []
    for event in events:
    if isinstance(event, TraceEvent):
    event_dicts.append(event.to_dict())
    continue
    if isinstance(event, dict):
    event_dicts.append(event)
    continue
    raise TypeError(f"Unsupported trace event type: {type(event).__name__}")

    # Sort events by sequence to ensure deterministic ordering
    sorted_events = sorted(event_dicts, key=lambda e: e.get("seq", 0))
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    question (bug_risk): Making trace_fingerprint raise on any non-dict/non-TraceEvent may be stricter than callers expect.

    The previous implementation accepted any iterable of dict-like objects and would silently pass through unexpected types. The new loop raises TypeError for anything that’s not a TraceEvent or dict, which can break callers that pass custom trace-like objects. If the stricter typing is intentional, please either (a) document this behavioral change clearly, or (b) handle unknown but dict-like types more leniently (e.g., coercion/ignoring) to preserve backward compatibility.

    assert config.dataset is not None
    assert config.runner is not None
    assert "example" in config.tags
    assert config.dataset.data is not None
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    suggestion (testing): Missing tests for new HuggingFace dataset loading behaviour and ImportError conditions

    This PR changes load_hf_dataset to default split to "train" and to treat the HF datasets import as optional (raising ImportError if load_dataset is missing), but I don’t see tests for:

    1. Verifying that "train" is used as the default split (e.g., by mocking datasets.load_dataset and asserting the split argument).
    2. Verifying that an ImportError is raised when the datasets package exists but load_dataset is missing/None.

    Please add tests (e.g., in tests/test_dataset_utils.py or an existing dataset test module) to cover these behaviours and guard against regressions in the import/availability logic.

    @dosubot
    Copy link
    Copy Markdown

    dosubot Bot commented Mar 4, 2026

    Related Documentation

    1 document(s) may need updating based on files changed in this PR:

    insideLLMs

    Agent Probe and Trace-Aware Testing
    View Suggested Changes
    @@ -13,7 +13,7 @@
     5. Validates the trace against contracts defined in the `TraceConfig` using the `validate_with_config()` helper.
     6. Packages the prompt, final response, tool calls, trace events, fingerprint, violations, and metadata into an `AgentProbeResult`.
     
    -Each tool available to the agent is described by a `ToolDefinition` object, specifying its name, description, parameter schema, and optional handler. The trace configuration and validation utilities are available as public APIs for advanced customization and integration.
    +Each tool available to the agent is described by a `ToolDefinition` object, specifying its name, description, parameter schema, and optional handler. For tools that perform arithmetic evaluation, the `_safe_eval_arithmetic()` function provides a secure alternative to `eval()`, restricting execution to arithmetic operations only. The trace configuration and validation utilities are available as public APIs for advanced customization and integration.
     
     ## Trace-Aware Results and Tracing Integration
     The probe system is tightly integrated with the tracing subsystem. All agent actions, tool calls, and relevant events are recorded by the `TraceRecorder`. The trace is then validated against contracts (e.g., required tool order, forbidden actions) using the `validate_with_config()` function, which respects the toggles and schemas defined in the `TraceConfig`. Payloads can be normalized or redacted using the `TracePayloadNormaliser`, which supports both built-in and custom normalization strategies for deterministic trace fingerprints and privacy compliance.

    [Accept] [Decline]

    Note: You must be authenticated to accept/decline updates.

    How did I do? Any feedback?  Join Discord

    Copy link
    Copy Markdown

    Copilot AI left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Pull request overview

    Hardening pass across insideLLMs’ runtime, model wrappers, integrations, and docs to improve typing clarity, optional-dependency robustness, and deterministic behavior around tracking / retries.

    Changes:

    • Tighten experiment tracking behavior (non-empty MultiTracker; deterministic run IDs) and optional dependency loading for trackers/datasets.
    • Improve model/runtime typing and failure-path handling (retry exhaustion guards, trace fingerprint input validation, minor API typing cleanups).
    • Update tests and docs to align with runtime outputs/typing expectations.

    Reviewed changes

    Copilot reviewed 40 out of 40 changed files in this pull request and generated 4 comments.

    Show a summary per file
    File Description
    wiki/tutorials/Bias-Testing.md Align tutorial output directory with analysis paths.
    tests/test_probes_models_coverage.py Tighten typing on probe helper methods used for coverage.
    tests/test_models_config_coverage.py Strengthen typing and assertions for model metadata paths.
    insideLLMs/types.py Add module-wide __all__ export list for static analysis/export consistency.
    insideLLMs/trace/tracing.py Improve typing and input validation for trace payloads/fingerprinting.
    insideLLMs/structured_extraction.py Minor exception-handling style/typing adjustments.
    insideLLMs/shadow.py Minor exception-handling style adjustments for request-body decoding.
    insideLLMs/semantic_cache.py Minor exception-handling style adjustments around Redis operations.
    insideLLMs/schemas/registry.py Minor exception-handling style adjustment in semver parsing.
    insideLLMs/safety.py Add module-wide __all__ export list.
    insideLLMs/runtime/reproducibility.py Typing adjustments (e.g., list_all return value typing) and optional import annotation.
    insideLLMs/runtime/_result_utils.py Minor exception-handling style adjustments around model info coercion.
    insideLLMs/runtime/_high_level.py Minor exception-handling style adjustments in status/category coercion.
    insideLLMs/runtime/_config_loader.py Minor typing refactor in pipeline construction and path snapshotting.
    insideLLMs/runtime/_base.py Minor exception-handling style adjustment for callback attribute tagging.
    insideLLMs/runtime/init.py Add module-wide __all__ export list for runtime package.
    insideLLMs/retry.py Harden retry exhaustion behavior and add module-wide __all__.
    insideLLMs/resources.py Minor exception-handling style adjustments in file/fsync helpers.
    insideLLMs/registry.py Add module-wide __all__ export list.
    insideLLMs/rate_limiting.py Minor exception-handling changes and add module-wide __all__.
    insideLLMs/probes/agent_probe.py Update docs/examples to avoid unsafe eval in tool handler samples.
    insideLLMs/models/openai.py Tighten typing (**kwargs: Any, info() -> ModelInfo) and error-path tweaks.
    insideLLMs/models/base.py Harden retry exhaustion raising logic and typing for last_error.
    insideLLMs/integrations/langchain.py Minor exception-handling style adjustments in optional imports/stream path.
    insideLLMs/experiment_tracking.py Optional dependency loading via importlib; run-id hardening; MultiTracker empty-list guard; minor typing cleanups; add __all__.
    insideLLMs/dataset_utils.py Optional HF datasets import via importlib; adjust availability logic; add __all__.
    insideLLMs/cost_tracking.py Add module-wide __all__ export list.
    insideLLMs/contrib/synthesis.py Minor exception-handling style adjustments in generation helpers.
    insideLLMs/contrib/routing.py Harden classification fallbacks and minor exception-handling style changes.
    insideLLMs/contrib/hitl.py Minor exception-handling style adjustment for callback emission.
    insideLLMs/contrib/evalbom.py Minor exception-handling style adjustments around manifest parsing.
    insideLLMs/contrib/ensemble.py Minor exception-handling style adjustments when skipping failed models.
    insideLLMs/contrib/debugging.py Adjust return typing to Optional for trace-event helpers and minor typing fixes.
    insideLLMs/contrib/chains.py Minor exception-handling style adjustment and typing for route steps dict.
    insideLLMs/contrib/agents.py Replace eval/exec usage with restricted arithmetic parsing/execution utilities.
    insideLLMs/contrib/adapters.py Fail-fast if clients are uninitialized; minor exception-handling style adjustments.
    insideLLMs/config.py Minor typing tweak in Union order; add module-wide __all__.
    insideLLMs/cli/commands/doctor.py Minor exception-handling style adjustment for entrypoint plugin discovery.
    insideLLMs/cli/_record_utils.py Minor exception-handling style adjustment in enum coercion.
    insideLLMs/async_utils.py Minor exception-handling style adjustment in task-cancellation path.
    Comments suppressed due to low confidence (1)

    insideLLMs/dataset_utils.py:243

    • The load_hf_dataset signature now defaults split to "train", but the docstring (and earlier examples in this same module) still says the default is "test". Either revert the default to "test" (to avoid a silent behavior change for callers who omit split) or update the docs/examples to match the new default.
    def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]:
        """Load a dataset from the HuggingFace Datasets Hub into a list of dictionaries.
    
        Downloads and loads a dataset from the HuggingFace Hub, converting it to the
        same list-of-dictionaries format used by the other loaders in this module.
        This provides access to thousands of publicly available datasets for NLP,
        computer vision, and other machine learning tasks.
    
        Args:
            dataset_name: The name of the dataset on HuggingFace Hub. Can be in
                the format "dataset_name" for official datasets or "user/dataset"
                for community datasets.
            split: The dataset split to load. Common values are "train", "test",
                "validation". Defaults to "test".
            **kwargs: Additional keyword arguments passed to ``datasets.load_dataset``.
    

    💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

    You can also share your feedback on Copilot code review. Take the survey.

    Comment on lines 452 to 455
    except Exception as e:
    # Get first message content for error context
    first_msg = messages[0].get("content", "") if messages else ""
    first_msg = messages[0]["content"] if messages else ""
    raise ModelGenerationError(
    Copy link

    Copilot AI Mar 4, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    In the generic exception handler, messages[0]["content"] can raise KeyError if a caller passes a malformed ChatMessage (missing content), which would mask the original OpenAI error and bypass ModelGenerationError. Consider using a safe access pattern (e.g., get) or guarding against missing keys here so the error wrapper is reliable.

    Copilot uses AI. Check for mistakes.
    if self.circuit_breaker:
    self.circuit_breaker.record_failure()
    raise
    raise e
    Copy link

    Copilot AI Mar 4, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    raise e in an except block discards the original traceback context and can make debugging much harder. Use a bare raise here to re-raise while preserving the original traceback.

    Suggested change
    raise e
    raise

    Copilot uses AI. Check for mistakes.
    Comment on lines +229 to 232
    def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]:
    """Load a dataset from the HuggingFace Datasets Hub into a list of dictionaries.

    Downloads and loads a dataset from the HuggingFace Hub, converting it to the
    Copy link

    Copilot AI Mar 4, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    The docstring section below still states the function may return None when the dataset is empty, but the current signature/implementation always returns a list (or raises). Please update the docstring (and any callers that check for None) to match the actual contract.

    Copilot uses AI. Check for mistakes.
    Comment on lines +3034 to +3036
    if isinstance(op, ast.Pow):
    return left**right
    raise ValueError("Unsupported arithmetic operator")
    Copy link

    Copilot AI Mar 4, 2026

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Even with eval/exec removed, allowing exponentiation (**) can still be used for computational/allocational DoS (e.g., 2**100000000) when allow_exec=True. Consider adding guards (max exponent / max AST depth / max operations) or disabling Pow in the restricted subset to keep the "sandbox" safer under adversarial input.

    Copilot uses AI. Check for mistakes.
    Copy link
    Copy Markdown

    @devin-ai-integration devin-ai-integration Bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Devin Review found 2 potential issues.

    View 8 additional findings in Devin Review.

    Open in Devin Review

    def load_hf_dataset(
    dataset_name: str, split: str = "test", **kwargs
    ) -> Optional[list[dict[str, Any]]]:
    def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]:
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🔴 Default split for load_hf_dataset silently changed from 'test' to 'train'

    The default value of the split parameter in load_hf_dataset was changed from "test" to "train", but the docstring on line 242 still says "Defaults to 'test'", and API_REFERENCE.md:1601 also documents the old default. Any caller relying on the default (i.e., calling load_hf_dataset("some_dataset") without explicitly passing split) will now silently load the training split instead of the test split.

    Root cause and impact

    The function signature at insideLLMs/dataset_utils.py:229 changed:

    # Old:
    def load_hf_dataset(dataset_name: str, split: str = "test", **kwargs) -> ...:
    # New:
    def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> ...:

    This is a silent behavioral change. The _config_loader.py:591 caller is unaffected because it uses config.get("split", "test"), but any direct caller of load_hf_dataset that omits the split argument will now receive training data instead of test data. The docstring (line 242: "Defaults to 'test'") and API_REFERENCE.md are now inconsistent with the actual behavior.

    Impact: Users running evaluations without specifying split will unknowingly evaluate against training data instead of test data, potentially producing misleading benchmark results.

    Suggested change
    def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]:
    def load_hf_dataset(dataset_name: str, split: str = "test", **kwargs) -> list[dict[str, Any]]:
    Open in Devin Review

    Was this helpful? React with 👍 or 👎 to provide feedback.

    Comment on lines +2899 to +2902
    except Exception as e:
    if self.circuit_breaker:
    self.circuit_breaker.record_failure()
    raise
    raise e
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    🟡 raise e instead of raise alters traceback and is inconsistent with sync version

    In execute_async, the exception re-raise was changed from raise to raise e, which alters the traceback information and is inconsistent with the synchronous execute method.

    Detailed explanation

    At insideLLMs/rate_limiting.py:2899-2902, the code was changed:

    except Exception as e:
        if self.circuit_breaker:
            self.circuit_breaker.record_failure()
        raise e  # was: raise

    In Python, raise preserves the original traceback of the caught exception, while raise e creates a new traceback starting from the raise e statement. This means the original call site information (where the exception actually occurred inside func()) is lost from the traceback.

    The synchronous counterpart at insideLLMs/rate_limiting.py:2827-2830 correctly uses bare raise:

    except Exception as _:
        if self.circuit_breaker:
            self.circuit_breaker.record_failure()
        raise

    Impact: When debugging failures in async code paths, the traceback will point to the raise e line rather than the original failure location inside func(), making debugging harder.

    Suggested change
    except Exception as e:
    if self.circuit_breaker:
    self.circuit_breaker.record_failure()
    raise
    raise e
    except Exception as _:
    if self.circuit_breaker:
    self.circuit_breaker.record_failure()
    raise
    Open in Devin Review

    Was this helpful? React with 👍 or 👎 to provide feedback.

    Copy link
    Copy Markdown

    @coderabbitai coderabbitai Bot left a comment

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    Actionable comments posted: 7

    Caution

    Some comments are outside the diff and can’t be posted inline due to platform limitations.

    ⚠️ Outside diff range comments (4)
    insideLLMs/safety.py (1)

    735-735: ⚠️ Potential issue | 🟠 Major

    Add type parameters to Pattern annotations to fix mypy strict checking.

    Lines 735 and 1110 use bare Pattern, which violates the strict mypy configuration for insideLLMs.safety (configured with disallow_untyped_defs = true and type-arg errors enabled).

    Fix
    -    def __init__(self, patterns: Optional[dict[str, Pattern]] = None):
    +    def __init__(self, patterns: Optional[dict[str, Pattern[str]]] = None):
    
    -        self._custom_patterns: list[tuple[str, Pattern, RiskLevel]] = []
    +        self._custom_patterns: list[tuple[str, Pattern[str], RiskLevel]] = []

    Required per the coding guidelines for insideLLMs/**/*.py: Typecheck Python code using Mypy on the insideLLMs module.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/safety.py` at line 735, The bare Pattern type used in annotations
    must be parameterized; update the annotations to use Pattern[str] (e.g., change
    def __init__(self, patterns: Optional[dict[str, Pattern]] = None) to def
    __init__(self, patterns: Optional[dict[str, Pattern[str]]] = None)) and do the
    same for the other occurrence around line 1110 (replace Pattern with
    Pattern[str]); also ensure the Pattern symbol you import (from typing or re)
    supports the type parameter and adjust the import if necessary.
    
    insideLLMs/experiment_tracking.py (1)

    2504-2512: ⚠️ Potential issue | 🟠 Major

    Add rollback on partial startup failure in MultiTracker.start_run.

    If one tracker fails at Line 2506, previously started trackers stay active. This can leak runs and leave inconsistent state.

    💡 Suggested fix
         def start_run(
             self,
             run_name: Optional[str] = None,
             run_id: Optional[str] = None,
             nested: bool = False,
         ) -> str:
    @@
    -        run_ids = []
    -        for tracker in self.trackers:
    -            rid = tracker.start_run(run_name=run_name, run_id=run_id, nested=nested)
    -            run_ids.append(rid)
    +        run_ids: list[str] = []
    +        started: list[ExperimentTracker] = []
    +        try:
    +            for tracker in self.trackers:
    +                rid = tracker.start_run(run_name=run_name, run_id=run_id, nested=nested)
    +                started.append(tracker)
    +                run_ids.append(rid)
    +        except Exception:
    +            for started_tracker in reversed(started):
    +                try:
    +                    started_tracker.end_run(status="failed")
    +                except Exception:
    +                    pass
    +            raise
     
             first_run_id = run_ids[0]
             self._run_active = True
             self._run_id = first_run_id
             return first_run_id
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/experiment_tracking.py` around lines 2504 - 2512,
    MultiTracker.start_run currently loops through self.trackers calling
    tracker.start_run and if one raises, previously started trackers remain active;
    modify start_run so that if any tracker.start_run raises an exception you
    iterate over the trackers that already started (use the collected run_ids or
    tracker slice) and call the appropriate cleanup method (e.g., tracker.stop_run
    or tracker.end_run — whichever the tracker API provides) for each started
    tracker, catching and logging any cleanup exceptions but not suppressing the
    original error, then re-raise the original exception; finally, only set
    self._run_active and self._run_id after all trackers have successfully started
    and return the first_run_id as before.
    
    insideLLMs/dataset_utils.py (1)

    229-255: ⚠️ Potential issue | 🟠 Major

    load_hf_dataset default split change is a silent API break and conflicts with docs.

    Line 229 defaults to "train", but Line 242 still states default "test". This will change behavior for callers that omit split.

    💡 Suggested fix (if backward compatibility is intended)
    -def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]:
    +def load_hf_dataset(dataset_name: str, split: str = "test", **kwargs) -> list[dict[str, Any]]:

    Also align the Returns section: it currently says “Returns None only if the dataset is empty,” but the function returns list[dict[str, Any]] and already returns [] for empty datasets.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/dataset_utils.py` around lines 229 - 255, The docstring and
    function signature for load_hf_dataset disagree about the default split
    (signature sets split="train" while the Args text says default "test") and the
    Returns section incorrectly claims the function can return None; update them to
    be consistent: decide which default split you want to keep (if preserving
    backward compatibility, change the signature to split="test"; otherwise update
    the Args text to state default "train"), and change the Returns text to state
    that an empty dataset yields an empty list ([]) rather than None; ensure the
    function definition name load_hf_dataset and its docstring are updated together
    so callers and docs match.
    
    insideLLMs/contrib/debugging.py (1)

    1681-2034: ⚠️ Potential issue | 🟡 Minor

    Update method return docs to match new Optional[DebugTraceEvent] signatures.

    These methods now correctly return Optional[...], but their Returns sections still describe non-optional DebugTraceEvent, which creates a stale API contract.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/contrib/debugging.py` around lines 1681 - 2034, Update the Returns
    docstrings to reflect the Optional[DebugTraceEvent] return type for the changed
    APIs: step_start, step_end, log_prompt, log_response, log_variable, log_error,
    and log_warning. For each of these methods, change the "Returns" section text to
    state that the method returns an Optional[DebugTraceEvent] (i.e., a
    DebugTraceEvent or None) and preserve any existing descriptive details (e.g.,
    truncation behavior) so callers understand when a None might be returned; keep
    the examples and See Also as-is. Ensure the method signatures already showing
    Optional[...] remain consistent with the updated Returns description.
    
    🧹 Nitpick comments (8)
    insideLLMs/schemas/registry.py (1)

    470-473: Narrow the caught exception types in version parsing.

    At Line 472, except Exception is broader than needed here and can hide unrelated defects.
    Catch only the parse-related failures.

    Proposed patch
    -    except Exception as _:
    +    except (ValueError, IndexError):
             return (0, 0, 0)
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/schemas/registry.py` around lines 470 - 473, The version-parsing
    try/except in insideLLMs/schemas/registry.py currently catches Exception too
    broadly; change the broad except to only catch parse-related errors (e.g.,
    except (ValueError, IndexError, TypeError):) around the block that returns
    (int(parts[0]), int(parts[1]), int(parts[2])) so only invalid integer
    conversions, missing parts, or wrong types are handled and unrelated exceptions
    still surface.
    
    insideLLMs/semantic_cache.py (1)

    1036-1038: Consider adding debug-level logging in Redis exception fallbacks.

    These handlers intentionally degrade gracefully, but fully swallowing the exception makes production diagnosis difficult when Redis connectivity/serialization issues occur.

    Also applies to: 1076-1077, 1085-1086, 1095-1096, 1104-1106, 1118-1119

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/semantic_cache.py` around lines 1036 - 1038, In each Redis
    exception fallback (the except Exception as _ blocks that increment
    self._stats.misses and return None at/around the shown sites and the other
    ranges 1076-1077, 1085-1086, 1095-1096, 1104-1106, 1118-1119), add a debug-level
    log that records the caught exception and relevant context (operation name,
    key/id, and any serialization details available) instead of fully swallowing it;
    use the existing logger on the object (e.g., self._logger or module logger) and
    include the exception info/traceback (exception instance) so the message gives
    actionable diagnostics while preserving the current graceful fallback behavior.
    
    insideLLMs/contrib/chains.py (1)

    2563-2565: Preserve error-handler failure context before breaking.

    If the custom error_handler itself fails, the exception is dropped. Capture and record it so failures remain debuggable.

    Suggested adjustment
    -                    except Exception as _:
    +                    except Exception as handler_exc:
    +                        errors.append(f"{step.name}: error_handler failed: {handler_exc}")
                             state.status = ChainStatus.FAILED
                             break
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/contrib/chains.py` around lines 2563 - 2565, The except block
    currently swallows any Exception as "_" then sets state.status =
    ChainStatus.FAILED and breaks, losing the original failure context; change it to
    capture the exception (e.g., "except Exception as err:"), record the error into
    the chain state (for example set state.error or state.error_message to err or
    str(err)) and/or log it, then set state.status = ChainStatus.FAILED and break so
    the original exception context is preserved for debugging; reference the
    error_handler invocation and the state/ChainStatus.FAILED update when making
    this change.
    
    insideLLMs/retry.py (2)

    2327-2327: Prefer explicit module exports over globals-derived __all__.

    Line 2327 exports every non-underscore binding in module globals, including incidental imports. An explicit __all__ keeps the public API stable and avoids accidental surface growth.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/retry.py` at line 2327, The module currently sets __all__
    dynamically via a globals() comprehension which exports every non-underscore
    name (risking accidental exports); replace that line with an explicit __all__
    list containing only the intended public API symbols (e.g., the public
    functions, classes, and constants defined in this module such as RetryPolicy,
    retry, backoff, etc.—use the actual names present in insideLLMs/retry.py),
    remove the globals()-based expression, and update this explicit list whenever
    you add or remove public symbols so the module export surface remains stable.
    

    1118-1119: Fail fast on invariant break instead of synthesizing a retry cause.

    On Line 1118 and Line 1266, if last_exception is still None, that indicates an internal invariant break. Converting it into a synthetic RuntimeError and then raising RetryExhaustedError can blur the real failure mode. Raise directly at the invariant boundary.

    Suggested patch
    -    if last_exception is None:
    -        last_exception = RuntimeError("Retry exhausted without captured exception")
    +    if last_exception is None:
    +        raise RuntimeError("Retry exhausted without captured exception")
    @@
    -    if last_exception is None:
    -        last_exception = RuntimeError("Retry exhausted without captured exception")
    +    if last_exception is None:
    +        raise RuntimeError("Retry exhausted without captured exception")

    Also applies to: 1266-1267

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/retry.py` around lines 1118 - 1119, The code currently synthesizes
    a fake RuntimeError when last_exception is None before raising
    RetryExhaustedError; instead, treat last_exception == None as an invariant
    violation and raise immediately (e.g., raise AssertionError or a clear
    RuntimeError with an "internal invariant broken: last_exception is None"
    message) in the same locations where the synthetic exception is created (the
    blocks that check last_exception prior to raising RetryExhaustedError). Remove
    the synthetic-wrapping logic so the function fails fast on the invariant breach
    rather than embedding a fabricated cause.
    
    insideLLMs/cost_tracking.py (1)

    1654-1654: Prefer explicit __all__ to avoid accidental public API expansion.

    Building __all__ from globals() exports imported helpers and future globals unintentionally. An explicit allowlist keeps the module boundary stable.

    ♻️ Suggested change
    -__all__ = [name for name in globals() if not name.startswith("_")]  # pyright: ignore[reportUnsupportedDunderAll]
    +__all__ = [
    +    "PricingTier",
    +    "CostCategory",
    +    "AlertLevel",
    +    "TimeGranularity",
    +    "ModelPricing",
    +    "UsageRecord",
    +    "Budget",
    +    "BudgetAlert",
    +    "CostSummary",
    +    "CostForecast",
    +    "PricingRegistry",
    +    "TokenCostCalculator",
    +    "UsageTracker",
    +    "BudgetManager",
    +    "CostForecaster",
    +    "CostReporter",
    +    "calculate_cost",
    +    "estimate_request_cost",
    +    "compare_model_costs",
    +    "get_cheapest_model",
    +    "create_usage_tracker",
    +    "create_budget_manager",
    +    "quick_cost_estimate",
    +    "CostCalculator",
    +]
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/cost_tracking.py` at line 1654, The module currently builds
    __all__ dynamically using globals(), which accidentally exports imports and
    future names; replace that line with an explicit allowlist by listing only the
    intended public names (functions, classes, constants) exported by
    insideLLMs.cost_tracking (e.g., __all__ = ["PublicClassA", "public_function_b",
    ...]) and remove the globals()-based expression (and the pyright ignore) so the
    public API is stable and only the declared symbols are exported; update the list
    whenever you intentionally add or remove public symbols.
    
    insideLLMs/contrib/routing.py (1)

    2303-2305: Add a direct regression test for model=None in _classify_with_model.

    The new early return is valuable, but there isn’t an explicit assertion for this direct-call path in the referenced branch-coverage tests. Add one test to lock behavior to ("unknown", 0.0).

    As per coding guidelines, Ensure test coverage is at least 95% for the insideLLMs module (coverage gating: --cov-fail-under=95).

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/contrib/routing.py` around lines 2303 - 2305, Add a direct
    regression test that calls the _classify_with_model function with self.model set
    to None and asserts it returns the tuple ("unknown", 0.0); specifically,
    instantiate or mock the object that defines _classify_with_model (the routing
    class in insideLLMs.contrib.routing), set its model attribute to None, call
    _classify_with_model and assert the return equals ("unknown", 0.0); place the
    test in the existing test suite for insideLLMs (new test name e.g.
    test_classify_with_model_none) so module coverage stays >=95% to satisfy the
    coverage gating.
    
    insideLLMs/rate_limiting.py (1)

    3429-3429: Prefer explicit __all__ over dynamic globals export for API stability.

    The current dynamic __all__ unintentionally exports 46 names, including imported modules (asyncio, datetime, random, threading, time, inspect), type imports (Any, Optional, Callable, Enum, TypeVar), and utilities (deque, dataclass, field, wraps). These are implementation details that shouldn't be part of the public API. An explicit __all__ listing only intended exports (e.g., classes, decorators, handlers) will clarify the stable API surface and prevent drift from unrelated changes.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/rate_limiting.py` at line 3429, The module currently builds
    __all__ dynamically (the comprehension at the end) which exports many internal
    imports; replace this with an explicit __all__ list containing only the intended
    public API names (e.g., the module's public classes, decorators and handler
    functions) by auditing the module for the real public symbols (RateLimiter-like
    classes, any decorators such as rate_limit, exceptions like RateLimitExceeded,
    public factory/getter functions) and listing them explicitly in __all__ instead
    of using the globals() comprehension so implementation details (asyncio,
    datetime, type imports, utilities) are not exported.
    
    🤖 Prompt for all review comments with AI agents
    Verify each finding against the current code and only fix it if needed.
    
    Inline comments:
    In `@insideLLMs/contrib/agents.py`:
    - Around line 3322-3324: The public contract of create_python_tool no longer
    matches its implementation: when called with allow_exec=True it only runs
    arithmetic assignments/expressions via _safe_exec_python_subset but the
    function/docstring still says it will “execute Python code”; either restore
    broader execution behavior or update the public contract/docs and name to
    reflect the restricted arithmetic-only execution. Modify the create_python_tool
    docstring and any user-facing description to state “execute arithmetic-only
    Python snippets” (or similar) and ensure symbols mentioned (create_python_tool,
    allow_exec, and _safe_exec_python_subset) are updated so callers and tests are
    aware of the narrowed behavior.
    - Around line 3019-3081: The evaluator (_safe_exec_python_subset,
    _eval_exec_expression and _apply_binary_operator) currently allows expressions
    that can cause computational DoS (huge literals or exponents); add explicit
    resource bounds: enforce a max number of statements in parsed.body in
    _safe_exec_python_subset, reject ast.Constant values whose absolute value or
    bit-length exceed a MAX_NUM_BITS, and reject ast.BinOp with ast.Pow where the
    exponent or base exceed safe thresholds (e.g., max exponent magnitude or
    bit-length) before evaluation; also enforce a maximum AST recursion/depth when
    walking nodes in _eval_exec_expression and return clear ValueError messages when
    limits are hit. Ensure checks run prior to performing any arithmetic and
    reference the existing functions (_safe_exec_python_subset,
    _eval_exec_expression, _apply_binary_operator) so the evaluator refuses
    oversized constants/exponents and overly deep/long programs.
    
    In `@insideLLMs/contrib/synthesis.py`:
    - Around line 2546-2547: The except clause binds an unused variable `e` even
    though the exception context is captured via logger.debug(..., exc_info=True);
    remove the unnecessary exception binding by changing the except clause that
    currently reads "except Exception as e:" to just "except Exception:" in the file
    and ensure the logger.debug call (line shown) remains unchanged; look for the
    except block containing logger.debug("Balance augmentation generation failed",
    exc_info=True) to apply this edit.
    
    In `@insideLLMs/dataset_utils.py`:
    - Around line 69-73: The try/except around importlib.import_module("datasets")
    currently catches ImportError too broadly; change it to catch
    ModuleNotFoundError so missing optional package results in _datasets_module =
    None, but real import errors inside the package still propagate. Specifically,
    update the try/except that assigns _datasets_module (the
    importlib.import_module("datasets") call) to use "except ModuleNotFoundError:
    _datasets_module = None" (or add a separate except Exception: raise to re-raise
    non-missing-package errors) so only the absent-package case is swallowed.
    
    In `@insideLLMs/experiment_tracking.py`:
    - Around line 119-124: The helper _load_optional_module currently catches
    ImportError broadly and can mask transitive import failures; change the
    exception handling to first catch ModuleNotFoundError as e and return None only
    when e.name == module_name (re-raise otherwise), and then optionally keep a
    general except ImportError: return None for other ImportError cases; reference
    the function name _load_optional_module and the variable module_name to locate
    where to adjust the try/except logic.
    
    In `@insideLLMs/models/openai.py`:
    - Line 454: When extracting the first message inside the chat exception handler,
    avoid indexing into messages directly since messages[0]["content"] can raise and
    mask the original exception; update the extraction of first_msg (the code
    referencing messages and first_msg) to safely access content (e.g., check
    messages truthiness and type of messages[0], use .get("content", "") or a guard)
    or wrap only that access in its own try/except and preserve/re-raise/log the
    original exception separately so the root error isn't hidden.
    
    In `@insideLLMs/rate_limiting.py`:
    - Around line 2899-2902: The except block currently re-raises the caught
    exception with "raise e", which loses the original traceback; change the
    re-raise to a bare "raise" so the original traceback is preserved (update the
    block that calls self.circuit_breaker.record_failure() and replace "raise e"
    with "raise").
    
    ---
    
    Outside diff comments:
    In `@insideLLMs/contrib/debugging.py`:
    - Around line 1681-2034: Update the Returns docstrings to reflect the
    Optional[DebugTraceEvent] return type for the changed APIs: step_start,
    step_end, log_prompt, log_response, log_variable, log_error, and log_warning.
    For each of these methods, change the "Returns" section text to state that the
    method returns an Optional[DebugTraceEvent] (i.e., a DebugTraceEvent or None)
    and preserve any existing descriptive details (e.g., truncation behavior) so
    callers understand when a None might be returned; keep the examples and See Also
    as-is. Ensure the method signatures already showing Optional[...] remain
    consistent with the updated Returns description.
    
    In `@insideLLMs/dataset_utils.py`:
    - Around line 229-255: The docstring and function signature for load_hf_dataset
    disagree about the default split (signature sets split="train" while the Args
    text says default "test") and the Returns section incorrectly claims the
    function can return None; update them to be consistent: decide which default
    split you want to keep (if preserving backward compatibility, change the
    signature to split="test"; otherwise update the Args text to state default
    "train"), and change the Returns text to state that an empty dataset yields an
    empty list ([]) rather than None; ensure the function definition name
    load_hf_dataset and its docstring are updated together so callers and docs
    match.
    
    In `@insideLLMs/experiment_tracking.py`:
    - Around line 2504-2512: MultiTracker.start_run currently loops through
    self.trackers calling tracker.start_run and if one raises, previously started
    trackers remain active; modify start_run so that if any tracker.start_run raises
    an exception you iterate over the trackers that already started (use the
    collected run_ids or tracker slice) and call the appropriate cleanup method
    (e.g., tracker.stop_run or tracker.end_run — whichever the tracker API provides)
    for each started tracker, catching and logging any cleanup exceptions but not
    suppressing the original error, then re-raise the original exception; finally,
    only set self._run_active and self._run_id after all trackers have successfully
    started and return the first_run_id as before.
    
    In `@insideLLMs/safety.py`:
    - Line 735: The bare Pattern type used in annotations must be parameterized;
    update the annotations to use Pattern[str] (e.g., change def __init__(self,
    patterns: Optional[dict[str, Pattern]] = None) to def __init__(self, patterns:
    Optional[dict[str, Pattern[str]]] = None)) and do the same for the other
    occurrence around line 1110 (replace Pattern with Pattern[str]); also ensure the
    Pattern symbol you import (from typing or re) supports the type parameter and
    adjust the import if necessary.
    
    ---
    
    Nitpick comments:
    In `@insideLLMs/contrib/chains.py`:
    - Around line 2563-2565: The except block currently swallows any Exception as
    "_" then sets state.status = ChainStatus.FAILED and breaks, losing the original
    failure context; change it to capture the exception (e.g., "except Exception as
    err:"), record the error into the chain state (for example set state.error or
    state.error_message to err or str(err)) and/or log it, then set state.status =
    ChainStatus.FAILED and break so the original exception context is preserved for
    debugging; reference the error_handler invocation and the
    state/ChainStatus.FAILED update when making this change.
    
    In `@insideLLMs/contrib/routing.py`:
    - Around line 2303-2305: Add a direct regression test that calls the
    _classify_with_model function with self.model set to None and asserts it returns
    the tuple ("unknown", 0.0); specifically, instantiate or mock the object that
    defines _classify_with_model (the routing class in insideLLMs.contrib.routing),
    set its model attribute to None, call _classify_with_model and assert the return
    equals ("unknown", 0.0); place the test in the existing test suite for
    insideLLMs (new test name e.g. test_classify_with_model_none) so module coverage
    stays >=95% to satisfy the coverage gating.
    
    In `@insideLLMs/cost_tracking.py`:
    - Line 1654: The module currently builds __all__ dynamically using globals(),
    which accidentally exports imports and future names; replace that line with an
    explicit allowlist by listing only the intended public names (functions,
    classes, constants) exported by insideLLMs.cost_tracking (e.g., __all__ =
    ["PublicClassA", "public_function_b", ...]) and remove the globals()-based
    expression (and the pyright ignore) so the public API is stable and only the
    declared symbols are exported; update the list whenever you intentionally add or
    remove public symbols.
    
    In `@insideLLMs/rate_limiting.py`:
    - Line 3429: The module currently builds __all__ dynamically (the comprehension
    at the end) which exports many internal imports; replace this with an explicit
    __all__ list containing only the intended public API names (e.g., the module's
    public classes, decorators and handler functions) by auditing the module for the
    real public symbols (RateLimiter-like classes, any decorators such as
    rate_limit, exceptions like RateLimitExceeded, public factory/getter functions)
    and listing them explicitly in __all__ instead of using the globals()
    comprehension so implementation details (asyncio, datetime, type imports,
    utilities) are not exported.
    
    In `@insideLLMs/retry.py`:
    - Line 2327: The module currently sets __all__ dynamically via a globals()
    comprehension which exports every non-underscore name (risking accidental
    exports); replace that line with an explicit __all__ list containing only the
    intended public API symbols (e.g., the public functions, classes, and constants
    defined in this module such as RetryPolicy, retry, backoff, etc.—use the actual
    names present in insideLLMs/retry.py), remove the globals()-based expression,
    and update this explicit list whenever you add or remove public symbols so the
    module export surface remains stable.
    - Around line 1118-1119: The code currently synthesizes a fake RuntimeError when
    last_exception is None before raising RetryExhaustedError; instead, treat
    last_exception == None as an invariant violation and raise immediately (e.g.,
    raise AssertionError or a clear RuntimeError with an "internal invariant broken:
    last_exception is None" message) in the same locations where the synthetic
    exception is created (the blocks that check last_exception prior to raising
    RetryExhaustedError). Remove the synthetic-wrapping logic so the function fails
    fast on the invariant breach rather than embedding a fabricated cause.
    
    In `@insideLLMs/schemas/registry.py`:
    - Around line 470-473: The version-parsing try/except in
    insideLLMs/schemas/registry.py currently catches Exception too broadly; change
    the broad except to only catch parse-related errors (e.g., except (ValueError,
    IndexError, TypeError):) around the block that returns (int(parts[0]),
    int(parts[1]), int(parts[2])) so only invalid integer conversions, missing
    parts, or wrong types are handled and unrelated exceptions still surface.
    
    In `@insideLLMs/semantic_cache.py`:
    - Around line 1036-1038: In each Redis exception fallback (the except Exception
    as _ blocks that increment self._stats.misses and return None at/around the
    shown sites and the other ranges 1076-1077, 1085-1086, 1095-1096, 1104-1106,
    1118-1119), add a debug-level log that records the caught exception and relevant
    context (operation name, key/id, and any serialization details available)
    instead of fully swallowing it; use the existing logger on the object (e.g.,
    self._logger or module logger) and include the exception info/traceback
    (exception instance) so the message gives actionable diagnostics while
    preserving the current graceful fallback behavior.
    

    ℹ️ Review info
    ⚙️ Run configuration

    Configuration used: defaults

    Review profile: CHILL

    Plan: Pro

    Run ID: 7c67e7c8-8537-4e32-8cd6-49f64b64d552

    📥 Commits

    Reviewing files that changed from the base of the PR and between e0d353c and b34124b.

    📒 Files selected for processing (40)
    • insideLLMs/async_utils.py
    • insideLLMs/cli/_record_utils.py
    • insideLLMs/cli/commands/doctor.py
    • insideLLMs/config.py
    • insideLLMs/contrib/adapters.py
    • insideLLMs/contrib/agents.py
    • insideLLMs/contrib/chains.py
    • insideLLMs/contrib/debugging.py
    • insideLLMs/contrib/ensemble.py
    • insideLLMs/contrib/evalbom.py
    • insideLLMs/contrib/hitl.py
    • insideLLMs/contrib/routing.py
    • insideLLMs/contrib/synthesis.py
    • insideLLMs/cost_tracking.py
    • insideLLMs/dataset_utils.py
    • insideLLMs/experiment_tracking.py
    • insideLLMs/integrations/langchain.py
    • insideLLMs/models/base.py
    • insideLLMs/models/openai.py
    • insideLLMs/probes/agent_probe.py
    • insideLLMs/rate_limiting.py
    • insideLLMs/registry.py
    • insideLLMs/resources.py
    • insideLLMs/retry.py
    • insideLLMs/runtime/__init__.py
    • insideLLMs/runtime/_base.py
    • insideLLMs/runtime/_config_loader.py
    • insideLLMs/runtime/_high_level.py
    • insideLLMs/runtime/_result_utils.py
    • insideLLMs/runtime/reproducibility.py
    • insideLLMs/safety.py
    • insideLLMs/schemas/registry.py
    • insideLLMs/semantic_cache.py
    • insideLLMs/shadow.py
    • insideLLMs/structured_extraction.py
    • insideLLMs/trace/tracing.py
    • insideLLMs/types.py
    • tests/test_models_config_coverage.py
    • tests/test_probes_models_coverage.py
    • wiki/tutorials/Bias-Testing.md

    Comment on lines +3019 to +3081
    def _apply_binary_operator(
    op: ast.operator, left: Union[int, float], right: Union[int, float]
    ) -> Union[int, float]:
    if isinstance(op, ast.Add):
    return left + right
    if isinstance(op, ast.Sub):
    return left - right
    if isinstance(op, ast.Mult):
    return left * right
    if isinstance(op, ast.Div):
    return left / right
    if isinstance(op, ast.FloorDiv):
    return left // right
    if isinstance(op, ast.Mod):
    return left % right
    if isinstance(op, ast.Pow):
    return left**right
    raise ValueError("Unsupported arithmetic operator")


    def _apply_unary_operator(op: ast.unaryop, operand: Union[int, float]) -> Union[int, float]:
    if isinstance(op, ast.UAdd):
    return +operand
    if isinstance(op, ast.USub):
    return -operand
    raise ValueError("Unsupported unary operator")


    def _eval_exec_expression(
    node: ast.AST, variables: dict[str, Union[int, float]]
    ) -> Union[int, float]:
    if isinstance(node, ast.Constant):
    if isinstance(node.value, bool) or not isinstance(node.value, (int, float)):
    raise ValueError("Only numeric constants are allowed")
    return node.value
    if isinstance(node, ast.Name):
    if node.id not in variables:
    raise ValueError(f"Unknown variable: {node.id}")
    return variables[node.id]
    if isinstance(node, ast.BinOp):
    left = _eval_exec_expression(node.left, variables)
    right = _eval_exec_expression(node.right, variables)
    return _apply_binary_operator(node.op, left, right)
    if isinstance(node, ast.UnaryOp):
    return _apply_unary_operator(node.op, _eval_exec_expression(node.operand, variables))
    raise ValueError("Unsupported expression")


    def _safe_exec_python_subset(code: str) -> dict[str, Union[int, float]]:
    """Execute a tightly restricted Python subset (assignments + arithmetic)."""
    parsed = ast.parse(code, mode="exec")
    variables: dict[str, Union[int, float]] = {}
    for statement in parsed.body:
    if isinstance(statement, ast.Assign):
    if len(statement.targets) != 1 or not isinstance(statement.targets[0], ast.Name):
    raise ValueError("Only simple variable assignments are supported")
    variables[statement.targets[0].id] = _eval_exec_expression(statement.value, variables)
    continue
    if isinstance(statement, ast.Expr):
    _eval_exec_expression(statement.value, variables)
    continue
    raise ValueError("Only arithmetic expressions and assignments are supported")
    return variables
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    Add resource bounds to the restricted evaluator.

    This path is safer than raw exec, but it still allows computational DoS via huge literals/exponents (for example, very large ** operations). Add explicit complexity limits before evaluation.

    💡 Proposed hardening patch
    +MAX_CODE_CHARS = 4096
    +MAX_AST_NODES = 256
    +MAX_POWER_EXPONENT = 1000
    +
     def _apply_binary_operator(
         op: ast.operator, left: Union[int, float], right: Union[int, float]
     ) -> Union[int, float]:
    @@
         if isinstance(op, ast.Pow):
    -        return left**right
    +        if abs(right) > MAX_POWER_EXPONENT:
    +            raise ValueError("Exponent too large")
    +        return left**right
    @@
     def _safe_exec_python_subset(code: str) -> dict[str, Union[int, float]]:
         """Execute a tightly restricted Python subset (assignments + arithmetic)."""
    +    if len(code) > MAX_CODE_CHARS:
    +        raise ValueError("Code snippet too large")
         parsed = ast.parse(code, mode="exec")
    +    if sum(1 for _ in ast.walk(parsed)) > MAX_AST_NODES:
    +        raise ValueError("Expression too complex")
         variables: dict[str, Union[int, float]] = {}
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    def _apply_binary_operator(
    op: ast.operator, left: Union[int, float], right: Union[int, float]
    ) -> Union[int, float]:
    if isinstance(op, ast.Add):
    return left + right
    if isinstance(op, ast.Sub):
    return left - right
    if isinstance(op, ast.Mult):
    return left * right
    if isinstance(op, ast.Div):
    return left / right
    if isinstance(op, ast.FloorDiv):
    return left // right
    if isinstance(op, ast.Mod):
    return left % right
    if isinstance(op, ast.Pow):
    return left**right
    raise ValueError("Unsupported arithmetic operator")
    def _apply_unary_operator(op: ast.unaryop, operand: Union[int, float]) -> Union[int, float]:
    if isinstance(op, ast.UAdd):
    return +operand
    if isinstance(op, ast.USub):
    return -operand
    raise ValueError("Unsupported unary operator")
    def _eval_exec_expression(
    node: ast.AST, variables: dict[str, Union[int, float]]
    ) -> Union[int, float]:
    if isinstance(node, ast.Constant):
    if isinstance(node.value, bool) or not isinstance(node.value, (int, float)):
    raise ValueError("Only numeric constants are allowed")
    return node.value
    if isinstance(node, ast.Name):
    if node.id not in variables:
    raise ValueError(f"Unknown variable: {node.id}")
    return variables[node.id]
    if isinstance(node, ast.BinOp):
    left = _eval_exec_expression(node.left, variables)
    right = _eval_exec_expression(node.right, variables)
    return _apply_binary_operator(node.op, left, right)
    if isinstance(node, ast.UnaryOp):
    return _apply_unary_operator(node.op, _eval_exec_expression(node.operand, variables))
    raise ValueError("Unsupported expression")
    def _safe_exec_python_subset(code: str) -> dict[str, Union[int, float]]:
    """Execute a tightly restricted Python subset (assignments + arithmetic)."""
    parsed = ast.parse(code, mode="exec")
    variables: dict[str, Union[int, float]] = {}
    for statement in parsed.body:
    if isinstance(statement, ast.Assign):
    if len(statement.targets) != 1 or not isinstance(statement.targets[0], ast.Name):
    raise ValueError("Only simple variable assignments are supported")
    variables[statement.targets[0].id] = _eval_exec_expression(statement.value, variables)
    continue
    if isinstance(statement, ast.Expr):
    _eval_exec_expression(statement.value, variables)
    continue
    raise ValueError("Only arithmetic expressions and assignments are supported")
    return variables
    MAX_CODE_CHARS = 4096
    MAX_AST_NODES = 256
    MAX_POWER_EXPONENT = 1000
    def _apply_binary_operator(
    op: ast.operator, left: Union[int, float], right: Union[int, float]
    ) -> Union[int, float]:
    if isinstance(op, ast.Add):
    return left + right
    if isinstance(op, ast.Sub):
    return left - right
    if isinstance(op, ast.Mult):
    return left * right
    if isinstance(op, ast.Div):
    return left / right
    if isinstance(op, ast.FloorDiv):
    return left // right
    if isinstance(op, ast.Mod):
    return left % right
    if isinstance(op, ast.Pow):
    if abs(right) > MAX_POWER_EXPONENT:
    raise ValueError("Exponent too large")
    return left**right
    raise ValueError("Unsupported arithmetic operator")
    def _apply_unary_operator(op: ast.unaryop, operand: Union[int, float]) -> Union[int, float]:
    if isinstance(op, ast.UAdd):
    return +operand
    if isinstance(op, ast.USub):
    return -operand
    raise ValueError("Unsupported unary operator")
    def _eval_exec_expression(
    node: ast.AST, variables: dict[str, Union[int, float]]
    ) -> Union[int, float]:
    if isinstance(node, ast.Constant):
    if isinstance(node.value, bool) or not isinstance(node.value, (int, float)):
    raise ValueError("Only numeric constants are allowed")
    return node.value
    if isinstance(node, ast.Name):
    if node.id not in variables:
    raise ValueError(f"Unknown variable: {node.id}")
    return variables[node.id]
    if isinstance(node, ast.BinOp):
    left = _eval_exec_expression(node.left, variables)
    right = _eval_exec_expression(node.right, variables)
    return _apply_binary_operator(node.op, left, right)
    if isinstance(node, ast.UnaryOp):
    return _apply_unary_operator(node.op, _eval_exec_expression(node.operand, variables))
    raise ValueError("Unsupported expression")
    def _safe_exec_python_subset(code: str) -> dict[str, Union[int, float]]:
    """Execute a tightly restricted Python subset (assignments + arithmetic)."""
    if len(code) > MAX_CODE_CHARS:
    raise ValueError("Code snippet too large")
    parsed = ast.parse(code, mode="exec")
    if sum(1 for _ in ast.walk(parsed)) > MAX_AST_NODES:
    raise ValueError("Expression too complex")
    variables: dict[str, Union[int, float]] = {}
    for statement in parsed.body:
    if isinstance(statement, ast.Assign):
    if len(statement.targets) != 1 or not isinstance(statement.targets[0], ast.Name):
    raise ValueError("Only simple variable assignments are supported")
    variables[statement.targets[0].id] = _eval_exec_expression(statement.value, variables)
    continue
    if isinstance(statement, ast.Expr):
    _eval_exec_expression(statement.value, variables)
    continue
    raise ValueError("Only arithmetic expressions and assignments are supported")
    return variables
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/contrib/agents.py` around lines 3019 - 3081, The evaluator
    (_safe_exec_python_subset, _eval_exec_expression and _apply_binary_operator)
    currently allows expressions that can cause computational DoS (huge literals or
    exponents); add explicit resource bounds: enforce a max number of statements in
    parsed.body in _safe_exec_python_subset, reject ast.Constant values whose
    absolute value or bit-length exceed a MAX_NUM_BITS, and reject ast.BinOp with
    ast.Pow where the exponent or base exceed safe thresholds (e.g., max exponent
    magnitude or bit-length) before evaluation; also enforce a maximum AST
    recursion/depth when walking nodes in _eval_exec_expression and return clear
    ValueError messages when limits are hit. Ensure checks run prior to performing
    any arithmetic and reference the existing functions (_safe_exec_python_subset,
    _eval_exec_expression, _apply_binary_operator) so the evaluator refuses
    oversized constants/exponents and overly deep/long programs.
    

    Comment on lines +3322 to 3324
    # Restricted execution for arithmetic-only snippets.
    local_vars = _safe_exec_python_subset(code)
    return str(local_vars.get("result", "Code executed successfully"))
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    Execution contract is now narrower than the public tool description.

    create_python_tool(allow_exec=True) now executes only arithmetic assignments/expressions, but the function/doc text still says “execute Python code”. That is a behavior break for existing callers expecting general Python snippets.

    💡 Proposed contract/docs alignment
     def create_python_tool(
    @@
    -    """Create a Python code execution tool.
    +    """Create a restricted Python-subset execution tool.
    @@
    -    Creates a tool that can execute Python code. By default, execution is
    -    disabled for safety. Only enable execution in trusted environments.
    +    Creates a tool that executes a restricted subset of Python
    +    (simple assignments + arithmetic expressions). By default, execution
    +    is disabled for safety.
    @@
         return Tool(
             name="python",
             func=execute_python,
    -        description="Execute Python code (use result variable for output)",
    +        description="Execute restricted arithmetic subset (assign to `result` for output)",
             parameters=[ToolParameter("code", "str", "Python code to execute")],
         )
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/contrib/agents.py` around lines 3322 - 3324, The public contract
    of create_python_tool no longer matches its implementation: when called with
    allow_exec=True it only runs arithmetic assignments/expressions via
    _safe_exec_python_subset but the function/docstring still says it will “execute
    Python code”; either restore broader execution behavior or update the public
    contract/docs and name to reflect the restricted arithmetic-only execution.
    Modify the create_python_tool docstring and any user-facing description to state
    “execute arithmetic-only Python snippets” (or similar) and ensure symbols
    mentioned (create_python_tool, allow_exec, and _safe_exec_python_subset) are
    updated so callers and tests are aware of the narrowed behavior.
    

    Comment on lines +2546 to 2547
    except Exception as e:
    logger.debug("Balance augmentation generation failed", exc_info=True)
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    🧩 Analysis chain

    🏁 Script executed:

    #!/bin/bash
    # First, verify the exact code at the target lines
    echo "=== Code at lines 2546-2547 ==="
    sed -n '2544,2548p' insideLLMs/contrib/synthesis.py
    
    echo -e "\n=== Check for unused exception variable pattern ==="
    rg -n 'except Exception as e' insideLLMs/contrib/synthesis.py
    
    echo -e "\n=== Run Ruff check on F841 errors in the file ==="
    ruff check insideLLMs/contrib/synthesis.py --select F841

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 1098


    Remove unused exception binding at line 2546.

    Local variable e is assigned to but never used in the except clause. The logger.debug() call uses exc_info=True to capture exception context, so the variable is not needed.

    ✅ Fix
    -                except Exception as e:
    +                except Exception as _:
                         logger.debug("Balance augmentation generation failed", exc_info=True)
    🧰 Tools
    🪛 GitHub Actions: CI

    [error] 2546-2546: F841 Local variable 'e' is assigned to but never used. This is fixable with 'ruff --fix'.

    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/contrib/synthesis.py` around lines 2546 - 2547, The except clause
    binds an unused variable `e` even though the exception context is captured via
    logger.debug(..., exc_info=True); remove the unnecessary exception binding by
    changing the except clause that currently reads "except Exception as e:" to just
    "except Exception:" in the file and ensure the logger.debug call (line shown)
    remains unchanged; look for the except block containing logger.debug("Balance
    augmentation generation failed", exc_info=True) to apply this edit.
    

    Comment on lines 69 to +73
    try:
    from datasets import load_dataset

    HF_DATASETS_AVAILABLE = True
    _datasets_module = importlib.import_module("datasets")
    except ImportError:
    HF_DATASETS_AVAILABLE = False
    _datasets_module = None

    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    🧩 Analysis chain

    🏁 Script executed:

    head -n 85 insideLLMs/dataset_utils.py | tail -n +60

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 818


    🌐 Web query:

    Python 3.10 ModuleNotFoundError vs ImportError best practices optional imports

    💡 Result:

    ModuleNotFoundError vs ImportError (Python 3.10)

    • ModuleNotFoundError is a subclass of ImportError that’s raised when the module can’t be located (including the special case where sys.modules[name] is None). [1]
    • ImportError is broader: it covers “module found but failed to import”, and also “from x import y where y doesn’t exist”, plus many other import-time failures. [1]

    Best practices for optional imports

    1) Prefer catching ModuleNotFoundError when “dependency not installed” is the only acceptable failure

    This avoids accidentally hiding real bugs inside the optional dependency (or inside its import-time side effects).

    try:
        import orjson
    except ModuleNotFoundError:
        orjson = None

    (Reason: missing-module is the specific condition ModuleNotFoundError represents. [1])

    2) Catch ImportError (not just ModuleNotFoundError) when you want “usable import” semantics

    Some optional deps—especially compiled extensions—may be present but still fail to import due to missing shared libraries, ABI mismatches, etc., which often surface as ImportError. If your feature should be disabled in those cases too, catch ImportError and keep the exception for diagnostics. [2]

    _orjson_exc = None
    try:
        import orjson
    except ImportError as e:
        orjson = None
        _orjson_exc = e

    3) Fail at call time with a clear message (not at import time)

    This keeps base functionality working without the extra dependency.

    def dumps(obj) -> bytes:
        if orjson is None:
            raise ImportError("dumps() requires 'orjson'") from _orjson_exc
        return orjson.dumps(obj)

    Using raise ... from None (or from the saved exception) is the standard way to control exception chaining and avoid noisy/irrelevant context. [3]

    4) If you only need to check availability, use importlib.util.find_spec()

    This lets you test whether a module is discoverable without importing it (and triggering side effects). [4]

    from importlib.util import find_spec
    
    HAS_ORJSON = find_spec("orjson") is not None

    5) Packaging: declare optional deps as “extras”

    This is the ecosystem-standard way to make optional features installable (e.g., pip install yourpkg[orjson]). (Not a Python-runtime rule, but it complements the runtime patterns above.)


    Sources

    [1] Python docs – Built-in exceptions (ModuleNotFoundError is a subclass of ImportError, raised when a module can’t be located) (docs.python.org)
    [2] Discussion of practical cases where ImportError is the right catch for optional compiled deps (scivision.dev)
    [3] PEP 409 – raise ... from None to suppress irrelevant exception context (peps.python.org)
    [4] Python docs – importlib.util.find_spec() for checking importability without importing (docs.python.org)

    Citations:


    Use ModuleNotFoundError instead of broad ImportError for optional import.

    The current catch-all ImportError at line 71 can mask real import failures within the datasets module itself (e.g., missing dependencies, syntax errors during import), silently falling back to disabled functionality when the actual issue should be raised.

    💡 Suggested fix
     try:
         _datasets_module = importlib.import_module("datasets")
    -except ImportError:
    -    _datasets_module = None
    +except ModuleNotFoundError:
    +    _datasets_module = None
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/dataset_utils.py` around lines 69 - 73, The try/except around
    importlib.import_module("datasets") currently catches ImportError too broadly;
    change it to catch ModuleNotFoundError so missing optional package results in
    _datasets_module = None, but real import errors inside the package still
    propagate. Specifically, update the try/except that assigns _datasets_module
    (the importlib.import_module("datasets") call) to use "except
    ModuleNotFoundError: _datasets_module = None" (or add a separate except
    Exception: raise to re-raise non-missing-package errors) so only the
    absent-package case is swallowed.
    

    Comment on lines +119 to +124
    def _load_optional_module(module_name: str) -> Any | None:
    """Load optional dependency module, returning None when unavailable."""
    try:
    return importlib.import_module(module_name)
    except ImportError:
    return None
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟠 Major

    🧩 Analysis chain

    🏁 Script executed:

    # First, let's check the actual code in the repository
    cat -n insideLLMs/experiment_tracking.py | sed -n '119,124p'

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 353


    🏁 Script executed:

    # Check if there are any usages of _load_optional_module in the codebase
    rg "_load_optional_module" -n

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 552


    🏁 Script executed:

    # Check for any tests related to this function
    fd "test.*\.py$" | xargs rg "_load_optional_module" -l

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 56


    🏁 Script executed:

    # Let's also check what Python version is being targeted
    cat setup.py 2>/dev/null || cat pyproject.toml 2>/dev/null | head -50

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 1400


    🏁 Script executed:

    # Verify Python's ModuleNotFoundError behavior and attributes in Python 3.10+
    python3 <<'EOF'
    import sys
    print(f"Python version: {sys.version}")
    
    # Test ModuleNotFoundError attributes
    try:
        import nonexistent_module_xyz
    except ModuleNotFoundError as e:
        print(f"ModuleNotFoundError has 'name' attr: {hasattr(e, 'name')}")
        if hasattr(e, 'name'):
            print(f"  e.name = {e.name}")
        print(f"  str(e) = {str(e)}")
    
    # Test transitive import failure
    import tempfile
    import os
    
    # Create a test module that imports a missing dependency
    with tempfile.TemporaryDirectory() as tmpdir:
        test_module_path = os.path.join(tmpdir, 'test_mod.py')
        with open(test_module_path, 'w') as f:
            f.write('import missing_dependency\n')
        
        sys.path.insert(0, tmpdir)
        try:
            import test_mod
        except ModuleNotFoundError as e:
            print(f"\nTransitive ModuleNotFoundError:")
            print(f"  hasattr(e, 'name'): {hasattr(e, 'name')}")
            if hasattr(e, 'name'):
                print(f"  e.name = {e.name}")
            print(f"  str(e) = {str(e)}")
        except ImportError as e:
            print(f"\nTransitive ImportError (not ModuleNotFoundError):")
            print(f"  type(e) = {type(e).__name__}")
            print(f"  str(e) = {str(e)}")
    EOF

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 403


    🏁 Script executed:

    # Check if there are any tests in the repository for this module
    fd "test" -t f -e py | head -20

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 786


    🏁 Script executed:

    # Verify ModuleNotFoundError is a subclass of ImportError
    python3 <<'EOF'
    print(f"ModuleNotFoundError is subclass of ImportError: {issubclass(ModuleNotFoundError, ImportError)}")
    
    # Check the exception hierarchy
    import inspect
    print(f"\nModuleNotFoundError MRO: {inspect.getmro(ModuleNotFoundError)}")
    EOF

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 269


    🏁 Script executed:

    # Search for any tests related to optional module loading
    rg "optional" -i tests/ | head -20

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 1610


    🏁 Script executed:

    # Look for tests specifically for experiment_tracking or _load_optional_module
    fd "test.*experiment" -t f tests/

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 148


    🏁 Script executed:

    # Check if there are any tests in tests/ directory that test the _load_optional_module function
    rg "_load_optional_module|experiment_tracking" tests/ -l

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 286


    🏁 Script executed:

    # Look at the test file for experiment_tracking to see if there are tests for optional module loading
    cat tests/test_experiment_tracking_coverage.py | head -100

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 2852


    🏁 Script executed:

    # Let's check test_experiment_tracking.py as well
    cat tests/test_experiment_tracking.py | head -100

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 2747


    🏁 Script executed:

    # Now let's understand the actual impact by looking at how _load_optional_module is used
    # Let's examine the context around lines 128-140 where it's used
    sed -n '119,145p' insideLLMs/experiment_tracking.py

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 1103


    🏁 Script executed:

    # Search for how wandb, mlflow, and _torch_tensorboard are used later in the code
    rg "wandb\.|mlflow\.|_torch_tensorboard" insideLLMs/experiment_tracking.py | head -30

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 1414


    🏁 Script executed:

    # Check the test_experiment_tracking_coverage.py file for tests about optional imports
    rg "ImportError|optional|_load_optional" tests/test_experiment_tracking_coverage.py -B 2 -A 5

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 1061


    🏁 Script executed:

    # Let's look for any other error handling patterns in the codebase for optional imports
    rg "except.*Error" insideLLMs/experiment_tracking.py -B 2 -A 2 | head -40

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 177


    🏁 Script executed:

    # Test the actual behavior difference between the current and proposed fix
    python3 <<'EOF'
    import sys
    import tempfile
    import os
    import importlib
    
    # Test 1: Module truly not found
    print("=== Test 1: Module not found ===")
    try:
        importlib.import_module("nonexistent_module_12345")
    except ImportError as e:
        print(f"Exception type: {type(e).__name__}")
        print(f"Is ModuleNotFoundError: {isinstance(e, ModuleNotFoundError)}")
        if isinstance(e, ModuleNotFoundError):
            print(f"exc.name: {e.name}")
    
    # Test 2: Module found but has broken import inside
    print("\n=== Test 2: Module with broken internal import ===")
    with tempfile.TemporaryDirectory() as tmpdir:
        broken_module = os.path.join(tmpdir, 'broken_mod.py')
        with open(broken_module, 'w') as f:
            f.write('import nonexistent_internal_dep\n')
        
        sys.path.insert(0, tmpdir)
        try:
            importlib.import_module("broken_mod")
        except ImportError as e:
            print(f"Exception type: {type(e).__name__}")
            print(f"Is ModuleNotFoundError: {isinstance(e, ModuleNotFoundError)}")
            if isinstance(e, ModuleNotFoundError):
                print(f"exc.name: {e.name}")
                print(f"exc.name == 'broken_mod': {e.name == 'broken_mod'}")
            print(f"Exception msg: {str(e)}")
    
    # Test 3: Other ImportError (hypothetical)
    print("\n=== Test 3: Circular import scenario ===")
    with tempfile.TemporaryDirectory() as tmpdir:
        mod_a = os.path.join(tmpdir, 'mod_a.py')
        mod_b = os.path.join(tmpdir, 'mod_b.py')
        with open(mod_a, 'w') as f:
            f.write('import mod_b\nx = 1\n')
        with open(mod_b, 'w') as f:
            f.write('from mod_a import x\n')
        
        sys.path.insert(0, tmpdir)
        try:
            importlib.import_module("mod_a")
        except ImportError as e:
            print(f"Exception type: {type(e).__name__}")
            print(f"Is ModuleNotFoundError: {isinstance(e, ModuleNotFoundError)}")
            print(f"Exception msg: {str(e)[:100]}")
    EOF

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 662


    Catch ModuleNotFoundError specifically to avoid masking transitive import failures.

    Line 123 catches ImportError broadly, which silently treats installed-but-broken dependencies (with missing transitive imports) as "not installed". The exception's name attribute indicates which module failed; re-raise if it's not the module you're trying to load.

    💡 Suggested fix
     def _load_optional_module(module_name: str) -> Any | None:
         """Load optional dependency module, returning None when unavailable."""
         try:
             return importlib.import_module(module_name)
    -    except ImportError:
    -        return None
    +    except ModuleNotFoundError as exc:
    +        if exc.name == module_name:
    +            return None
    +        raise
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/experiment_tracking.py` around lines 119 - 124, The helper
    _load_optional_module currently catches ImportError broadly and can mask
    transitive import failures; change the exception handling to first catch
    ModuleNotFoundError as e and return None only when e.name == module_name
    (re-raise otherwise), and then optionally keep a general except ImportError:
    return None for other ImportError cases; reference the function name
    _load_optional_module and the variable module_name to locate where to adjust the
    try/except logic.
    

    except Exception as e:
    # Get first message content for error context
    first_msg = messages[0].get("content", "") if messages else ""
    first_msg = messages[0]["content"] if messages else ""
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟡 Minor

    Avoid masking the original error in chat exception handling.

    On Line 454, messages[0]["content"] can raise while already handling an exception, which hides the real failure source.

    💡 Proposed fix
    -            first_msg = messages[0]["content"] if messages else ""
    +            first_msg = ""
    +            if messages:
    +                first_msg = str(messages[0].get("content", ""))
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    first_msg = messages[0]["content"] if messages else ""
    first_msg = ""
    if messages:
    first_msg = str(messages[0].get("content", ""))
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/models/openai.py` at line 454, When extracting the first message
    inside the chat exception handler, avoid indexing into messages directly since
    messages[0]["content"] can raise and mask the original exception; update the
    extraction of first_msg (the code referencing messages and first_msg) to safely
    access content (e.g., check messages truthiness and type of messages[0], use
    .get("content", "") or a guard) or wrap only that access in its own try/except
    and preserve/re-raise/log the original exception separately so the root error
    isn't hidden.
    

    Comment on lines +2899 to +2902
    except Exception as e:
    if self.circuit_breaker:
    self.circuit_breaker.record_failure()
    raise
    raise e
    Copy link
    Copy Markdown

    Choose a reason for hiding this comment

    The reason will be displayed to describe this comment to others. Learn more.

    ⚠️ Potential issue | 🟡 Minor

    🧩 Analysis chain

    🏁 Script executed:

    # First, check the structure and size of the file
    wc -l insideLLMs/rate_limiting.py

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 104


    🏁 Script executed:

    # Read the specific lines mentioned in the review (2899-2902)
    sed -n '2895,2905p' insideLLMs/rate_limiting.py

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 561


    🏁 Script executed:

    # Run the verification script from the review to find all similar patterns
    rg -nPU 'except\s+Exception\s+as\s+([A-Za-z_]\w*):(?:\n[ \t]+.*){0,8}\n[ \t]+raise\s+\1\b' insideLLMs/rate_limiting.py

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 144


    🏁 Script executed:

    # Search for re-raise patterns with exception variables using simpler regex
    rg -n 'except.*as\s+([a-z_]\w*)' insideLLMs/rate_limiting.py -A 5 | grep -E '(except|raise)' | head -50

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 704


    🏁 Script executed:

    # More direct: find all lines with "raise e" or similar patterns
    rg -n 'raise\s+[a-z_]\w+' insideLLMs/rate_limiting.py

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 54


    🏁 Script executed:

    # Check if bare raise is used elsewhere in the file for comparison
    rg -n '^\s+raise\s*$' insideLLMs/rate_limiting.py

    Repository: dr-gareth-roberts/insideLLMs

    Length of output: 194


    Use bare raise to preserve the traceback chain.

    The raise e at line 2902 drops the original traceback. Use bare raise instead to maintain source context, consistent with the pattern used throughout the rest of this file.

    💡 Proposed fix
                    except Exception as e:
                        if self.circuit_breaker:
                            self.circuit_breaker.record_failure()
    -                    raise e
    +                    raise
    📝 Committable suggestion

    ‼️ IMPORTANT
    Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    Suggested change
    except Exception as e:
    if self.circuit_breaker:
    self.circuit_breaker.record_failure()
    raise
    raise e
    except Exception as e:
    if self.circuit_breaker:
    self.circuit_breaker.record_failure()
    raise
    🤖 Prompt for AI Agents
    Verify each finding against the current code and only fix it if needed.
    
    In `@insideLLMs/rate_limiting.py` around lines 2899 - 2902, The except block
    currently re-raises the caught exception with "raise e", which loses the
    original traceback; change the re-raise to a bare "raise" so the original
    traceback is preserved (update the block that calls
    self.circuit_breaker.record_failure() and replace "raise e" with "raise").
    

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

    Labels

    None yet

    Projects

    None yet

    Development

    Successfully merging this pull request may close these issues.

    2 participants