Hardening pass: tracking, models, runtime, tests, and docs alignment#29
Hardening pass: tracking, models, runtime, tests, and docs alignment#29dr-gareth-roberts wants to merge 1 commit intomainfrom
Conversation
|
|
Reviewer's GuideHardening pass focusing on safer experiment tracking backends, stricter retry/typing behavior, safer arithmetic/code execution for contrib agents, more robust optional dependency handling, and minor doc/test/runtime cleanups. Updated class diagram for experiment trackers and MultiTrackerclassDiagram
class ExperimentTracker {
<<abstract>>
- config TrackingConfig
- _run_active bool
- _run_id str
- _step int
+ start_run(run_name str, run_id str, nested bool) str
+ end_run(status str) None
+ log_metrics(metrics dict~str, Any~, step int) None
+ log_params(params dict~str, Any~) None
+ log_artifact(artifact_path str, artifact_type str, artifact_name str) None
+ log_table(table_name str, data list~dict~str, Any~~, columns list~str~, step int) None
+ watch_model(model Any, log_freq int) None
}
class WandBTracker {
- entity str | None
- wandb_kwargs dict~str, Any~
- _wandb Any
- _run Any
- _run_active bool
- _run_id str
- _step int
+ __init__(config TrackingConfig, entity str, wandb_kwargs dict~str, Any~) None
+ start_run(run_name str, run_id str, nested bool) str
+ end_run(status str) None
+ log_metrics(metrics dict~str, float~, step int) None
+ log_params(params dict~str, Any~) None
+ log_artifact(artifact_path str, artifact_type str, artifact_name str) None
+ log_table(table_name str, data list~dict~str, Any~~, columns list~str~, step int) None
+ watch_model(model Any, log_freq int) None
}
class MLflowTracker {
- tracking_uri str | None
- _mlflow Any
- _experiment_id str | None
- _run_active bool
- _run_id str
- _step int
+ __init__(config TrackingConfig, tracking_uri str, experiment_name str) None
+ start_run(run_name str, run_id str, nested bool) str
+ end_run(status str) None
+ log_metrics(metrics dict~str, float~, step int) None
+ log_params(params dict~str, Any~) None
+ log_artifact(artifact_path str) None
+ log_model(model Any, model_name str, model_flavor str) None
+ register_model(model_uri str, model_name str) None
}
class TensorBoardTracker {
- log_dir str
- _summary_writer_cls Any
- _writer Any | None
- _run_active bool
- _run_id str
- _step int
+ __init__(config TrackingConfig, log_dir str) None
+ start_run(run_name str, run_id str, nested bool) str
+ end_run(status str) None
+ log_metrics(metrics dict~str, float~, step int) None
+ log_params(params dict~str, Any~) None
+ log_artifact(artifact_path str, artifact_type str, artifact_name str) None
+ log_table(table_name str, data list~dict~str, Any~~, columns list~str~, step int) None
+ watch_model(model Any, log_freq int) None
}
class LocalFileTracker {
- output_dir Path
- _run_dir Path | None
- _metrics list~dict~str, Any~~
- _params dict~str, Any~
- _artifacts list~dict~str, str~~
+ start_run(run_name str, run_id str, nested bool) str
+ end_run(status str) None
+ log_metrics(metrics dict~str, float~, step int) None
+ log_params(params dict~str, Any~) None
+ log_artifact(artifact_path str, artifact_type str, artifact_name str) None
+ snapshot() dict~str, Any~
}
class MultiTracker {
- trackers list~ExperimentTracker~
- _run_active bool
- _run_id str
+ __init__(trackers list~ExperimentTracker~, config TrackingConfig) None
+ start_run(run_name str, run_id str, nested bool) str
+ end_run(status str) None
+ log_metrics(metrics dict~str, float~, step int) None
+ log_params(params dict~str, Any~) None
+ log_artifact(artifact_path str, artifact_type str, artifact_name str) None
}
ExperimentTracker <|-- WandBTracker
ExperimentTracker <|-- MLflowTracker
ExperimentTracker <|-- TensorBoardTracker
ExperimentTracker <|-- LocalFileTracker
ExperimentTracker <|-- MultiTracker
MultiTracker "1" o-- "many" ExperimentTracker : aggregates
WandBTracker ..> "1" Any : uses_wandb_module
MLflowTracker ..> "1" Any : uses_mlflow_module
TensorBoardTracker ..> "1" Any : uses_summary_writer_cls
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
insideLLMs Behavioural Diff
Top changed recordsRegressions (0)
Improvements (0)
Other changes (0)
Only in baseline (0)
Only in candidate (0)
Trace drifts (0)
Trace violation increases (0)
Generated by |
📝 WalkthroughWalkthroughThis PR standardizes exception handling across 30+ files by binding caught exceptions to placeholder variables, adds explicit module-level Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Hey - I've found 3 issues, and left some high level feedback:
- The change to
load_hf_dataset's defaultsplitfrom'test'to'train'is a behavioral change that may surprise existing callers; consider whether this should remain'test'for backward compatibility or be called out clearly as an intentional breaking change. - Using
__all__ = [name for name in globals() if not name.startswith('_')]in multiple modules will export helper symbols, imported modules, and other internals; consider defining__all__explicitly per module to avoid unintentionally expanding the public surface.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The change to `load_hf_dataset`'s default `split` from `'test'` to `'train'` is a behavioral change that may surprise existing callers; consider whether this should remain `'test'` for backward compatibility or be called out clearly as an intentional breaking change.
- Using `__all__ = [name for name in globals() if not name.startswith('_')]` in multiple modules will export helper symbols, imported modules, and other internals; consider defining `__all__` explicitly per module to avoid unintentionally expanding the public surface.
## Individual Comments
### Comment 1
<location path="insideLLMs/contrib/agents.py" line_range="3034-3036" />
<code_context>
return _eval(parsed)
+def _apply_binary_operator(
+ op: ast.operator, left: Union[int, float], right: Union[int, float]
+) -> Union[int, float]:
+ if isinstance(op, ast.Add):
+ return left + right
+ if isinstance(op, ast.Sub):
+ return left - right
+ if isinstance(op, ast.Mult):
+ return left * right
+ if isinstance(op, ast.Div):
+ return left / right
+ if isinstance(op, ast.FloorDiv):
+ return left // right
+ if isinstance(op, ast.Mod):
+ return left % right
+ if isinstance(op, ast.Pow):
+ return left**right
+ raise ValueError("Unsupported arithmetic operator")
+
+
</code_context>
<issue_to_address>
**suggestion:** The error message in `_apply_binary_operator` is quite generic and may obscure which operator caused the failure.
Including the operator in the error (e.g. via `type(op).__name__` or rendering it directly) would make it clearer which expression failed, improving debuggability when an unsupported operator reaches this code.
```suggestion
if isinstance(op, ast.Pow):
return left**right
raise ValueError(f"Unsupported arithmetic operator: {type(op).__name__}")
```
</issue_to_address>
### Comment 2
<location path="insideLLMs/trace/tracing.py" line_range="1868-1874" />
<code_context>
insideLLMs.trace.trace_config.TracePayloadNormaliser : Normalise payloads before fingerprinting.
"""
# Normalize to list of dicts
- if events and isinstance(events[0], TraceEvent):
- event_dicts = [e.to_dict() for e in events]
- else:
- event_dicts = list(events)
+ event_dicts: list[dict[str, Any]] = []
+ for event in events:
+ if isinstance(event, TraceEvent):
+ event_dicts.append(event.to_dict())
+ continue
+ if isinstance(event, dict):
+ event_dicts.append(event)
+ continue
+ raise TypeError(f"Unsupported trace event type: {type(event).__name__}")
# Sort events by sequence to ensure deterministic ordering
</code_context>
<issue_to_address>
**question (bug_risk):** Making `trace_fingerprint` raise on any non-dict/non-TraceEvent may be stricter than callers expect.
The previous implementation accepted any iterable of dict-like objects and would silently pass through unexpected types. The new loop raises `TypeError` for anything that’s not a `TraceEvent` or `dict`, which can break callers that pass custom trace-like objects. If the stricter typing is intentional, please either (a) document this behavioral change clearly, or (b) handle unknown but dict-like types more leniently (e.g., coercion/ignoring) to preserve backward compatibility.
</issue_to_address>
### Comment 3
<location path="tests/test_models_config_coverage.py" line_range="992" />
<code_context>
assert config.dataset is not None
assert config.runner is not None
assert "example" in config.tags
+ assert config.dataset.data is not None
assert len(config.dataset.data) == 2
</code_context>
<issue_to_address>
**suggestion (testing):** Missing tests for new HuggingFace dataset loading behaviour and ImportError conditions
This PR changes `load_hf_dataset` to default `split` to `"train"` and to treat the HF datasets import as optional (raising `ImportError` if `load_dataset` is missing), but I don’t see tests for:
1. Verifying that `"train"` is used as the default split (e.g., by mocking `datasets.load_dataset` and asserting the `split` argument).
2. Verifying that an `ImportError` is raised when the `datasets` package exists but `load_dataset` is missing/`None`.
Please add tests (e.g., in `tests/test_dataset_utils.py` or an existing dataset test module) to cover these behaviours and guard against regressions in the import/availability logic.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| if isinstance(op, ast.Pow): | ||
| return left**right | ||
| raise ValueError("Unsupported arithmetic operator") |
There was a problem hiding this comment.
suggestion: The error message in _apply_binary_operator is quite generic and may obscure which operator caused the failure.
Including the operator in the error (e.g. via type(op).__name__ or rendering it directly) would make it clearer which expression failed, improving debuggability when an unsupported operator reaches this code.
| if isinstance(op, ast.Pow): | |
| return left**right | |
| raise ValueError("Unsupported arithmetic operator") | |
| if isinstance(op, ast.Pow): | |
| return left**right | |
| raise ValueError(f"Unsupported arithmetic operator: {type(op).__name__}") |
| if events and isinstance(events[0], TraceEvent): | ||
| event_dicts = [e.to_dict() for e in events] | ||
| else: | ||
| event_dicts = list(events) | ||
| event_dicts: list[dict[str, Any]] = [] | ||
| for event in events: | ||
| if isinstance(event, TraceEvent): | ||
| event_dicts.append(event.to_dict()) | ||
| continue | ||
| if isinstance(event, dict): | ||
| event_dicts.append(event) | ||
| continue | ||
| raise TypeError(f"Unsupported trace event type: {type(event).__name__}") | ||
|
|
||
| # Sort events by sequence to ensure deterministic ordering | ||
| sorted_events = sorted(event_dicts, key=lambda e: e.get("seq", 0)) |
There was a problem hiding this comment.
question (bug_risk): Making trace_fingerprint raise on any non-dict/non-TraceEvent may be stricter than callers expect.
The previous implementation accepted any iterable of dict-like objects and would silently pass through unexpected types. The new loop raises TypeError for anything that’s not a TraceEvent or dict, which can break callers that pass custom trace-like objects. If the stricter typing is intentional, please either (a) document this behavioral change clearly, or (b) handle unknown but dict-like types more leniently (e.g., coercion/ignoring) to preserve backward compatibility.
| assert config.dataset is not None | ||
| assert config.runner is not None | ||
| assert "example" in config.tags | ||
| assert config.dataset.data is not None |
There was a problem hiding this comment.
suggestion (testing): Missing tests for new HuggingFace dataset loading behaviour and ImportError conditions
This PR changes load_hf_dataset to default split to "train" and to treat the HF datasets import as optional (raising ImportError if load_dataset is missing), but I don’t see tests for:
- Verifying that
"train"is used as the default split (e.g., by mockingdatasets.load_datasetand asserting thesplitargument). - Verifying that an
ImportErroris raised when thedatasetspackage exists butload_datasetis missing/None.
Please add tests (e.g., in tests/test_dataset_utils.py or an existing dataset test module) to cover these behaviours and guard against regressions in the import/availability logic.
|
Related Documentation 1 document(s) may need updating based on files changed in this PR: insideLLMs Agent Probe and Trace-Aware TestingView Suggested Changes@@ -13,7 +13,7 @@
5. Validates the trace against contracts defined in the `TraceConfig` using the `validate_with_config()` helper.
6. Packages the prompt, final response, tool calls, trace events, fingerprint, violations, and metadata into an `AgentProbeResult`.
-Each tool available to the agent is described by a `ToolDefinition` object, specifying its name, description, parameter schema, and optional handler. The trace configuration and validation utilities are available as public APIs for advanced customization and integration.
+Each tool available to the agent is described by a `ToolDefinition` object, specifying its name, description, parameter schema, and optional handler. For tools that perform arithmetic evaluation, the `_safe_eval_arithmetic()` function provides a secure alternative to `eval()`, restricting execution to arithmetic operations only. The trace configuration and validation utilities are available as public APIs for advanced customization and integration.
## Trace-Aware Results and Tracing Integration
The probe system is tightly integrated with the tracing subsystem. All agent actions, tool calls, and relevant events are recorded by the `TraceRecorder`. The trace is then validated against contracts (e.g., required tool order, forbidden actions) using the `validate_with_config()` function, which respects the toggles and schemas defined in the `TraceConfig`. Payloads can be normalized or redacted using the `TracePayloadNormaliser`, which supports both built-in and custom normalization strategies for deterministic trace fingerprints and privacy compliance.Note: You must be authenticated to accept/decline updates. |
There was a problem hiding this comment.
Pull request overview
Hardening pass across insideLLMs’ runtime, model wrappers, integrations, and docs to improve typing clarity, optional-dependency robustness, and deterministic behavior around tracking / retries.
Changes:
- Tighten experiment tracking behavior (non-empty MultiTracker; deterministic run IDs) and optional dependency loading for trackers/datasets.
- Improve model/runtime typing and failure-path handling (retry exhaustion guards, trace fingerprint input validation, minor API typing cleanups).
- Update tests and docs to align with runtime outputs/typing expectations.
Reviewed changes
Copilot reviewed 40 out of 40 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| wiki/tutorials/Bias-Testing.md | Align tutorial output directory with analysis paths. |
| tests/test_probes_models_coverage.py | Tighten typing on probe helper methods used for coverage. |
| tests/test_models_config_coverage.py | Strengthen typing and assertions for model metadata paths. |
| insideLLMs/types.py | Add module-wide __all__ export list for static analysis/export consistency. |
| insideLLMs/trace/tracing.py | Improve typing and input validation for trace payloads/fingerprinting. |
| insideLLMs/structured_extraction.py | Minor exception-handling style/typing adjustments. |
| insideLLMs/shadow.py | Minor exception-handling style adjustments for request-body decoding. |
| insideLLMs/semantic_cache.py | Minor exception-handling style adjustments around Redis operations. |
| insideLLMs/schemas/registry.py | Minor exception-handling style adjustment in semver parsing. |
| insideLLMs/safety.py | Add module-wide __all__ export list. |
| insideLLMs/runtime/reproducibility.py | Typing adjustments (e.g., list_all return value typing) and optional import annotation. |
| insideLLMs/runtime/_result_utils.py | Minor exception-handling style adjustments around model info coercion. |
| insideLLMs/runtime/_high_level.py | Minor exception-handling style adjustments in status/category coercion. |
| insideLLMs/runtime/_config_loader.py | Minor typing refactor in pipeline construction and path snapshotting. |
| insideLLMs/runtime/_base.py | Minor exception-handling style adjustment for callback attribute tagging. |
| insideLLMs/runtime/init.py | Add module-wide __all__ export list for runtime package. |
| insideLLMs/retry.py | Harden retry exhaustion behavior and add module-wide __all__. |
| insideLLMs/resources.py | Minor exception-handling style adjustments in file/fsync helpers. |
| insideLLMs/registry.py | Add module-wide __all__ export list. |
| insideLLMs/rate_limiting.py | Minor exception-handling changes and add module-wide __all__. |
| insideLLMs/probes/agent_probe.py | Update docs/examples to avoid unsafe eval in tool handler samples. |
| insideLLMs/models/openai.py | Tighten typing (**kwargs: Any, info() -> ModelInfo) and error-path tweaks. |
| insideLLMs/models/base.py | Harden retry exhaustion raising logic and typing for last_error. |
| insideLLMs/integrations/langchain.py | Minor exception-handling style adjustments in optional imports/stream path. |
| insideLLMs/experiment_tracking.py | Optional dependency loading via importlib; run-id hardening; MultiTracker empty-list guard; minor typing cleanups; add __all__. |
| insideLLMs/dataset_utils.py | Optional HF datasets import via importlib; adjust availability logic; add __all__. |
| insideLLMs/cost_tracking.py | Add module-wide __all__ export list. |
| insideLLMs/contrib/synthesis.py | Minor exception-handling style adjustments in generation helpers. |
| insideLLMs/contrib/routing.py | Harden classification fallbacks and minor exception-handling style changes. |
| insideLLMs/contrib/hitl.py | Minor exception-handling style adjustment for callback emission. |
| insideLLMs/contrib/evalbom.py | Minor exception-handling style adjustments around manifest parsing. |
| insideLLMs/contrib/ensemble.py | Minor exception-handling style adjustments when skipping failed models. |
| insideLLMs/contrib/debugging.py | Adjust return typing to Optional for trace-event helpers and minor typing fixes. |
| insideLLMs/contrib/chains.py | Minor exception-handling style adjustment and typing for route steps dict. |
| insideLLMs/contrib/agents.py | Replace eval/exec usage with restricted arithmetic parsing/execution utilities. |
| insideLLMs/contrib/adapters.py | Fail-fast if clients are uninitialized; minor exception-handling style adjustments. |
| insideLLMs/config.py | Minor typing tweak in Union order; add module-wide __all__. |
| insideLLMs/cli/commands/doctor.py | Minor exception-handling style adjustment for entrypoint plugin discovery. |
| insideLLMs/cli/_record_utils.py | Minor exception-handling style adjustment in enum coercion. |
| insideLLMs/async_utils.py | Minor exception-handling style adjustment in task-cancellation path. |
Comments suppressed due to low confidence (1)
insideLLMs/dataset_utils.py:243
- The
load_hf_datasetsignature now defaultssplitto "train", but the docstring (and earlier examples in this same module) still says the default is "test". Either revert the default to "test" (to avoid a silent behavior change for callers who omitsplit) or update the docs/examples to match the new default.
def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]:
"""Load a dataset from the HuggingFace Datasets Hub into a list of dictionaries.
Downloads and loads a dataset from the HuggingFace Hub, converting it to the
same list-of-dictionaries format used by the other loaders in this module.
This provides access to thousands of publicly available datasets for NLP,
computer vision, and other machine learning tasks.
Args:
dataset_name: The name of the dataset on HuggingFace Hub. Can be in
the format "dataset_name" for official datasets or "user/dataset"
for community datasets.
split: The dataset split to load. Common values are "train", "test",
"validation". Defaults to "test".
**kwargs: Additional keyword arguments passed to ``datasets.load_dataset``.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| except Exception as e: | ||
| # Get first message content for error context | ||
| first_msg = messages[0].get("content", "") if messages else "" | ||
| first_msg = messages[0]["content"] if messages else "" | ||
| raise ModelGenerationError( |
There was a problem hiding this comment.
In the generic exception handler, messages[0]["content"] can raise KeyError if a caller passes a malformed ChatMessage (missing content), which would mask the original OpenAI error and bypass ModelGenerationError. Consider using a safe access pattern (e.g., get) or guarding against missing keys here so the error wrapper is reliable.
| if self.circuit_breaker: | ||
| self.circuit_breaker.record_failure() | ||
| raise | ||
| raise e |
There was a problem hiding this comment.
raise e in an except block discards the original traceback context and can make debugging much harder. Use a bare raise here to re-raise while preserving the original traceback.
| raise e | |
| raise |
| def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]: | ||
| """Load a dataset from the HuggingFace Datasets Hub into a list of dictionaries. | ||
|
|
||
| Downloads and loads a dataset from the HuggingFace Hub, converting it to the |
There was a problem hiding this comment.
The docstring section below still states the function may return None when the dataset is empty, but the current signature/implementation always returns a list (or raises). Please update the docstring (and any callers that check for None) to match the actual contract.
| if isinstance(op, ast.Pow): | ||
| return left**right | ||
| raise ValueError("Unsupported arithmetic operator") |
There was a problem hiding this comment.
Even with eval/exec removed, allowing exponentiation (**) can still be used for computational/allocational DoS (e.g., 2**100000000) when allow_exec=True. Consider adding guards (max exponent / max AST depth / max operations) or disabling Pow in the restricted subset to keep the "sandbox" safer under adversarial input.
| def load_hf_dataset( | ||
| dataset_name: str, split: str = "test", **kwargs | ||
| ) -> Optional[list[dict[str, Any]]]: | ||
| def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]: |
There was a problem hiding this comment.
🔴 Default split for load_hf_dataset silently changed from 'test' to 'train'
The default value of the split parameter in load_hf_dataset was changed from "test" to "train", but the docstring on line 242 still says "Defaults to 'test'", and API_REFERENCE.md:1601 also documents the old default. Any caller relying on the default (i.e., calling load_hf_dataset("some_dataset") without explicitly passing split) will now silently load the training split instead of the test split.
Root cause and impact
The function signature at insideLLMs/dataset_utils.py:229 changed:
# Old:
def load_hf_dataset(dataset_name: str, split: str = "test", **kwargs) -> ...:
# New:
def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> ...:This is a silent behavioral change. The _config_loader.py:591 caller is unaffected because it uses config.get("split", "test"), but any direct caller of load_hf_dataset that omits the split argument will now receive training data instead of test data. The docstring (line 242: "Defaults to 'test'") and API_REFERENCE.md are now inconsistent with the actual behavior.
Impact: Users running evaluations without specifying split will unknowingly evaluate against training data instead of test data, potentially producing misleading benchmark results.
| def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]: | |
| def load_hf_dataset(dataset_name: str, split: str = "test", **kwargs) -> list[dict[str, Any]]: |
Was this helpful? React with 👍 or 👎 to provide feedback.
| except Exception as e: | ||
| if self.circuit_breaker: | ||
| self.circuit_breaker.record_failure() | ||
| raise | ||
| raise e |
There was a problem hiding this comment.
🟡 raise e instead of raise alters traceback and is inconsistent with sync version
In execute_async, the exception re-raise was changed from raise to raise e, which alters the traceback information and is inconsistent with the synchronous execute method.
Detailed explanation
At insideLLMs/rate_limiting.py:2899-2902, the code was changed:
except Exception as e:
if self.circuit_breaker:
self.circuit_breaker.record_failure()
raise e # was: raiseIn Python, raise preserves the original traceback of the caught exception, while raise e creates a new traceback starting from the raise e statement. This means the original call site information (where the exception actually occurred inside func()) is lost from the traceback.
The synchronous counterpart at insideLLMs/rate_limiting.py:2827-2830 correctly uses bare raise:
except Exception as _:
if self.circuit_breaker:
self.circuit_breaker.record_failure()
raiseImpact: When debugging failures in async code paths, the traceback will point to the raise e line rather than the original failure location inside func(), making debugging harder.
| except Exception as e: | |
| if self.circuit_breaker: | |
| self.circuit_breaker.record_failure() | |
| raise | |
| raise e | |
| except Exception as _: | |
| if self.circuit_breaker: | |
| self.circuit_breaker.record_failure() | |
| raise |
Was this helpful? React with 👍 or 👎 to provide feedback.
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (4)
insideLLMs/safety.py (1)
735-735:⚠️ Potential issue | 🟠 MajorAdd type parameters to
Patternannotations to fix mypy strict checking.Lines 735 and 1110 use bare
Pattern, which violates the strict mypy configuration forinsideLLMs.safety(configured withdisallow_untyped_defs = trueandtype-argerrors enabled).Fix
- def __init__(self, patterns: Optional[dict[str, Pattern]] = None): + def __init__(self, patterns: Optional[dict[str, Pattern[str]]] = None): - self._custom_patterns: list[tuple[str, Pattern, RiskLevel]] = [] + self._custom_patterns: list[tuple[str, Pattern[str], RiskLevel]] = []Required per the coding guidelines for
insideLLMs/**/*.py: Typecheck Python code using Mypy on theinsideLLMsmodule.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/safety.py` at line 735, The bare Pattern type used in annotations must be parameterized; update the annotations to use Pattern[str] (e.g., change def __init__(self, patterns: Optional[dict[str, Pattern]] = None) to def __init__(self, patterns: Optional[dict[str, Pattern[str]]] = None)) and do the same for the other occurrence around line 1110 (replace Pattern with Pattern[str]); also ensure the Pattern symbol you import (from typing or re) supports the type parameter and adjust the import if necessary.insideLLMs/experiment_tracking.py (1)
2504-2512:⚠️ Potential issue | 🟠 MajorAdd rollback on partial startup failure in
MultiTracker.start_run.If one tracker fails at Line 2506, previously started trackers stay active. This can leak runs and leave inconsistent state.
💡 Suggested fix
def start_run( self, run_name: Optional[str] = None, run_id: Optional[str] = None, nested: bool = False, ) -> str: @@ - run_ids = [] - for tracker in self.trackers: - rid = tracker.start_run(run_name=run_name, run_id=run_id, nested=nested) - run_ids.append(rid) + run_ids: list[str] = [] + started: list[ExperimentTracker] = [] + try: + for tracker in self.trackers: + rid = tracker.start_run(run_name=run_name, run_id=run_id, nested=nested) + started.append(tracker) + run_ids.append(rid) + except Exception: + for started_tracker in reversed(started): + try: + started_tracker.end_run(status="failed") + except Exception: + pass + raise first_run_id = run_ids[0] self._run_active = True self._run_id = first_run_id return first_run_id🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/experiment_tracking.py` around lines 2504 - 2512, MultiTracker.start_run currently loops through self.trackers calling tracker.start_run and if one raises, previously started trackers remain active; modify start_run so that if any tracker.start_run raises an exception you iterate over the trackers that already started (use the collected run_ids or tracker slice) and call the appropriate cleanup method (e.g., tracker.stop_run or tracker.end_run — whichever the tracker API provides) for each started tracker, catching and logging any cleanup exceptions but not suppressing the original error, then re-raise the original exception; finally, only set self._run_active and self._run_id after all trackers have successfully started and return the first_run_id as before.insideLLMs/dataset_utils.py (1)
229-255:⚠️ Potential issue | 🟠 Major
load_hf_datasetdefault split change is a silent API break and conflicts with docs.Line 229 defaults to
"train", but Line 242 still states default"test". This will change behavior for callers that omitsplit.💡 Suggested fix (if backward compatibility is intended)
-def load_hf_dataset(dataset_name: str, split: str = "train", **kwargs) -> list[dict[str, Any]]: +def load_hf_dataset(dataset_name: str, split: str = "test", **kwargs) -> list[dict[str, Any]]:Also align the Returns section: it currently says “Returns None only if the dataset is empty,” but the function returns
list[dict[str, Any]]and already returns[]for empty datasets.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/dataset_utils.py` around lines 229 - 255, The docstring and function signature for load_hf_dataset disagree about the default split (signature sets split="train" while the Args text says default "test") and the Returns section incorrectly claims the function can return None; update them to be consistent: decide which default split you want to keep (if preserving backward compatibility, change the signature to split="test"; otherwise update the Args text to state default "train"), and change the Returns text to state that an empty dataset yields an empty list ([]) rather than None; ensure the function definition name load_hf_dataset and its docstring are updated together so callers and docs match.insideLLMs/contrib/debugging.py (1)
1681-2034:⚠️ Potential issue | 🟡 MinorUpdate method return docs to match new
Optional[DebugTraceEvent]signatures.These methods now correctly return
Optional[...], but their Returns sections still describe non-optionalDebugTraceEvent, which creates a stale API contract.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/contrib/debugging.py` around lines 1681 - 2034, Update the Returns docstrings to reflect the Optional[DebugTraceEvent] return type for the changed APIs: step_start, step_end, log_prompt, log_response, log_variable, log_error, and log_warning. For each of these methods, change the "Returns" section text to state that the method returns an Optional[DebugTraceEvent] (i.e., a DebugTraceEvent or None) and preserve any existing descriptive details (e.g., truncation behavior) so callers understand when a None might be returned; keep the examples and See Also as-is. Ensure the method signatures already showing Optional[...] remain consistent with the updated Returns description.
🧹 Nitpick comments (8)
insideLLMs/schemas/registry.py (1)
470-473: Narrow the caught exception types in version parsing.At Line 472,
except Exceptionis broader than needed here and can hide unrelated defects.
Catch only the parse-related failures.Proposed patch
- except Exception as _: + except (ValueError, IndexError): return (0, 0, 0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/schemas/registry.py` around lines 470 - 473, The version-parsing try/except in insideLLMs/schemas/registry.py currently catches Exception too broadly; change the broad except to only catch parse-related errors (e.g., except (ValueError, IndexError, TypeError):) around the block that returns (int(parts[0]), int(parts[1]), int(parts[2])) so only invalid integer conversions, missing parts, or wrong types are handled and unrelated exceptions still surface.insideLLMs/semantic_cache.py (1)
1036-1038: Consider adding debug-level logging in Redis exception fallbacks.These handlers intentionally degrade gracefully, but fully swallowing the exception makes production diagnosis difficult when Redis connectivity/serialization issues occur.
Also applies to: 1076-1077, 1085-1086, 1095-1096, 1104-1106, 1118-1119
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/semantic_cache.py` around lines 1036 - 1038, In each Redis exception fallback (the except Exception as _ blocks that increment self._stats.misses and return None at/around the shown sites and the other ranges 1076-1077, 1085-1086, 1095-1096, 1104-1106, 1118-1119), add a debug-level log that records the caught exception and relevant context (operation name, key/id, and any serialization details available) instead of fully swallowing it; use the existing logger on the object (e.g., self._logger or module logger) and include the exception info/traceback (exception instance) so the message gives actionable diagnostics while preserving the current graceful fallback behavior.insideLLMs/contrib/chains.py (1)
2563-2565: Preserve error-handler failure context before breaking.If the custom
error_handleritself fails, the exception is dropped. Capture and record it so failures remain debuggable.Suggested adjustment
- except Exception as _: + except Exception as handler_exc: + errors.append(f"{step.name}: error_handler failed: {handler_exc}") state.status = ChainStatus.FAILED break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/contrib/chains.py` around lines 2563 - 2565, The except block currently swallows any Exception as "_" then sets state.status = ChainStatus.FAILED and breaks, losing the original failure context; change it to capture the exception (e.g., "except Exception as err:"), record the error into the chain state (for example set state.error or state.error_message to err or str(err)) and/or log it, then set state.status = ChainStatus.FAILED and break so the original exception context is preserved for debugging; reference the error_handler invocation and the state/ChainStatus.FAILED update when making this change.insideLLMs/retry.py (2)
2327-2327: Prefer explicit module exports over globals-derived__all__.Line 2327 exports every non-underscore binding in module globals, including incidental imports. An explicit
__all__keeps the public API stable and avoids accidental surface growth.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/retry.py` at line 2327, The module currently sets __all__ dynamically via a globals() comprehension which exports every non-underscore name (risking accidental exports); replace that line with an explicit __all__ list containing only the intended public API symbols (e.g., the public functions, classes, and constants defined in this module such as RetryPolicy, retry, backoff, etc.—use the actual names present in insideLLMs/retry.py), remove the globals()-based expression, and update this explicit list whenever you add or remove public symbols so the module export surface remains stable.
1118-1119: Fail fast on invariant break instead of synthesizing a retry cause.On Line 1118 and Line 1266, if
last_exceptionis stillNone, that indicates an internal invariant break. Converting it into a syntheticRuntimeErrorand then raisingRetryExhaustedErrorcan blur the real failure mode. Raise directly at the invariant boundary.Suggested patch
- if last_exception is None: - last_exception = RuntimeError("Retry exhausted without captured exception") + if last_exception is None: + raise RuntimeError("Retry exhausted without captured exception") @@ - if last_exception is None: - last_exception = RuntimeError("Retry exhausted without captured exception") + if last_exception is None: + raise RuntimeError("Retry exhausted without captured exception")Also applies to: 1266-1267
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/retry.py` around lines 1118 - 1119, The code currently synthesizes a fake RuntimeError when last_exception is None before raising RetryExhaustedError; instead, treat last_exception == None as an invariant violation and raise immediately (e.g., raise AssertionError or a clear RuntimeError with an "internal invariant broken: last_exception is None" message) in the same locations where the synthetic exception is created (the blocks that check last_exception prior to raising RetryExhaustedError). Remove the synthetic-wrapping logic so the function fails fast on the invariant breach rather than embedding a fabricated cause.insideLLMs/cost_tracking.py (1)
1654-1654: Prefer explicit__all__to avoid accidental public API expansion.Building
__all__fromglobals()exports imported helpers and future globals unintentionally. An explicit allowlist keeps the module boundary stable.♻️ Suggested change
-__all__ = [name for name in globals() if not name.startswith("_")] # pyright: ignore[reportUnsupportedDunderAll] +__all__ = [ + "PricingTier", + "CostCategory", + "AlertLevel", + "TimeGranularity", + "ModelPricing", + "UsageRecord", + "Budget", + "BudgetAlert", + "CostSummary", + "CostForecast", + "PricingRegistry", + "TokenCostCalculator", + "UsageTracker", + "BudgetManager", + "CostForecaster", + "CostReporter", + "calculate_cost", + "estimate_request_cost", + "compare_model_costs", + "get_cheapest_model", + "create_usage_tracker", + "create_budget_manager", + "quick_cost_estimate", + "CostCalculator", +]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/cost_tracking.py` at line 1654, The module currently builds __all__ dynamically using globals(), which accidentally exports imports and future names; replace that line with an explicit allowlist by listing only the intended public names (functions, classes, constants) exported by insideLLMs.cost_tracking (e.g., __all__ = ["PublicClassA", "public_function_b", ...]) and remove the globals()-based expression (and the pyright ignore) so the public API is stable and only the declared symbols are exported; update the list whenever you intentionally add or remove public symbols.insideLLMs/contrib/routing.py (1)
2303-2305: Add a direct regression test formodel=Nonein_classify_with_model.The new early return is valuable, but there isn’t an explicit assertion for this direct-call path in the referenced branch-coverage tests. Add one test to lock behavior to
("unknown", 0.0).As per coding guidelines, Ensure test coverage is at least 95% for the
insideLLMsmodule (coverage gating:--cov-fail-under=95).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/contrib/routing.py` around lines 2303 - 2305, Add a direct regression test that calls the _classify_with_model function with self.model set to None and asserts it returns the tuple ("unknown", 0.0); specifically, instantiate or mock the object that defines _classify_with_model (the routing class in insideLLMs.contrib.routing), set its model attribute to None, call _classify_with_model and assert the return equals ("unknown", 0.0); place the test in the existing test suite for insideLLMs (new test name e.g. test_classify_with_model_none) so module coverage stays >=95% to satisfy the coverage gating.insideLLMs/rate_limiting.py (1)
3429-3429: Prefer explicit__all__over dynamic globals export for API stability.The current dynamic
__all__unintentionally exports 46 names, including imported modules (asyncio,datetime,random,threading,time,inspect), type imports (Any,Optional,Callable,Enum,TypeVar), and utilities (deque,dataclass,field,wraps). These are implementation details that shouldn't be part of the public API. An explicit__all__listing only intended exports (e.g., classes, decorators, handlers) will clarify the stable API surface and prevent drift from unrelated changes.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@insideLLMs/rate_limiting.py` at line 3429, The module currently builds __all__ dynamically (the comprehension at the end) which exports many internal imports; replace this with an explicit __all__ list containing only the intended public API names (e.g., the module's public classes, decorators and handler functions) by auditing the module for the real public symbols (RateLimiter-like classes, any decorators such as rate_limit, exceptions like RateLimitExceeded, public factory/getter functions) and listing them explicitly in __all__ instead of using the globals() comprehension so implementation details (asyncio, datetime, type imports, utilities) are not exported.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@insideLLMs/contrib/agents.py`:
- Around line 3322-3324: The public contract of create_python_tool no longer
matches its implementation: when called with allow_exec=True it only runs
arithmetic assignments/expressions via _safe_exec_python_subset but the
function/docstring still says it will “execute Python code”; either restore
broader execution behavior or update the public contract/docs and name to
reflect the restricted arithmetic-only execution. Modify the create_python_tool
docstring and any user-facing description to state “execute arithmetic-only
Python snippets” (or similar) and ensure symbols mentioned (create_python_tool,
allow_exec, and _safe_exec_python_subset) are updated so callers and tests are
aware of the narrowed behavior.
- Around line 3019-3081: The evaluator (_safe_exec_python_subset,
_eval_exec_expression and _apply_binary_operator) currently allows expressions
that can cause computational DoS (huge literals or exponents); add explicit
resource bounds: enforce a max number of statements in parsed.body in
_safe_exec_python_subset, reject ast.Constant values whose absolute value or
bit-length exceed a MAX_NUM_BITS, and reject ast.BinOp with ast.Pow where the
exponent or base exceed safe thresholds (e.g., max exponent magnitude or
bit-length) before evaluation; also enforce a maximum AST recursion/depth when
walking nodes in _eval_exec_expression and return clear ValueError messages when
limits are hit. Ensure checks run prior to performing any arithmetic and
reference the existing functions (_safe_exec_python_subset,
_eval_exec_expression, _apply_binary_operator) so the evaluator refuses
oversized constants/exponents and overly deep/long programs.
In `@insideLLMs/contrib/synthesis.py`:
- Around line 2546-2547: The except clause binds an unused variable `e` even
though the exception context is captured via logger.debug(..., exc_info=True);
remove the unnecessary exception binding by changing the except clause that
currently reads "except Exception as e:" to just "except Exception:" in the file
and ensure the logger.debug call (line shown) remains unchanged; look for the
except block containing logger.debug("Balance augmentation generation failed",
exc_info=True) to apply this edit.
In `@insideLLMs/dataset_utils.py`:
- Around line 69-73: The try/except around importlib.import_module("datasets")
currently catches ImportError too broadly; change it to catch
ModuleNotFoundError so missing optional package results in _datasets_module =
None, but real import errors inside the package still propagate. Specifically,
update the try/except that assigns _datasets_module (the
importlib.import_module("datasets") call) to use "except ModuleNotFoundError:
_datasets_module = None" (or add a separate except Exception: raise to re-raise
non-missing-package errors) so only the absent-package case is swallowed.
In `@insideLLMs/experiment_tracking.py`:
- Around line 119-124: The helper _load_optional_module currently catches
ImportError broadly and can mask transitive import failures; change the
exception handling to first catch ModuleNotFoundError as e and return None only
when e.name == module_name (re-raise otherwise), and then optionally keep a
general except ImportError: return None for other ImportError cases; reference
the function name _load_optional_module and the variable module_name to locate
where to adjust the try/except logic.
In `@insideLLMs/models/openai.py`:
- Line 454: When extracting the first message inside the chat exception handler,
avoid indexing into messages directly since messages[0]["content"] can raise and
mask the original exception; update the extraction of first_msg (the code
referencing messages and first_msg) to safely access content (e.g., check
messages truthiness and type of messages[0], use .get("content", "") or a guard)
or wrap only that access in its own try/except and preserve/re-raise/log the
original exception separately so the root error isn't hidden.
In `@insideLLMs/rate_limiting.py`:
- Around line 2899-2902: The except block currently re-raises the caught
exception with "raise e", which loses the original traceback; change the
re-raise to a bare "raise" so the original traceback is preserved (update the
block that calls self.circuit_breaker.record_failure() and replace "raise e"
with "raise").
---
Outside diff comments:
In `@insideLLMs/contrib/debugging.py`:
- Around line 1681-2034: Update the Returns docstrings to reflect the
Optional[DebugTraceEvent] return type for the changed APIs: step_start,
step_end, log_prompt, log_response, log_variable, log_error, and log_warning.
For each of these methods, change the "Returns" section text to state that the
method returns an Optional[DebugTraceEvent] (i.e., a DebugTraceEvent or None)
and preserve any existing descriptive details (e.g., truncation behavior) so
callers understand when a None might be returned; keep the examples and See Also
as-is. Ensure the method signatures already showing Optional[...] remain
consistent with the updated Returns description.
In `@insideLLMs/dataset_utils.py`:
- Around line 229-255: The docstring and function signature for load_hf_dataset
disagree about the default split (signature sets split="train" while the Args
text says default "test") and the Returns section incorrectly claims the
function can return None; update them to be consistent: decide which default
split you want to keep (if preserving backward compatibility, change the
signature to split="test"; otherwise update the Args text to state default
"train"), and change the Returns text to state that an empty dataset yields an
empty list ([]) rather than None; ensure the function definition name
load_hf_dataset and its docstring are updated together so callers and docs
match.
In `@insideLLMs/experiment_tracking.py`:
- Around line 2504-2512: MultiTracker.start_run currently loops through
self.trackers calling tracker.start_run and if one raises, previously started
trackers remain active; modify start_run so that if any tracker.start_run raises
an exception you iterate over the trackers that already started (use the
collected run_ids or tracker slice) and call the appropriate cleanup method
(e.g., tracker.stop_run or tracker.end_run — whichever the tracker API provides)
for each started tracker, catching and logging any cleanup exceptions but not
suppressing the original error, then re-raise the original exception; finally,
only set self._run_active and self._run_id after all trackers have successfully
started and return the first_run_id as before.
In `@insideLLMs/safety.py`:
- Line 735: The bare Pattern type used in annotations must be parameterized;
update the annotations to use Pattern[str] (e.g., change def __init__(self,
patterns: Optional[dict[str, Pattern]] = None) to def __init__(self, patterns:
Optional[dict[str, Pattern[str]]] = None)) and do the same for the other
occurrence around line 1110 (replace Pattern with Pattern[str]); also ensure the
Pattern symbol you import (from typing or re) supports the type parameter and
adjust the import if necessary.
---
Nitpick comments:
In `@insideLLMs/contrib/chains.py`:
- Around line 2563-2565: The except block currently swallows any Exception as
"_" then sets state.status = ChainStatus.FAILED and breaks, losing the original
failure context; change it to capture the exception (e.g., "except Exception as
err:"), record the error into the chain state (for example set state.error or
state.error_message to err or str(err)) and/or log it, then set state.status =
ChainStatus.FAILED and break so the original exception context is preserved for
debugging; reference the error_handler invocation and the
state/ChainStatus.FAILED update when making this change.
In `@insideLLMs/contrib/routing.py`:
- Around line 2303-2305: Add a direct regression test that calls the
_classify_with_model function with self.model set to None and asserts it returns
the tuple ("unknown", 0.0); specifically, instantiate or mock the object that
defines _classify_with_model (the routing class in insideLLMs.contrib.routing),
set its model attribute to None, call _classify_with_model and assert the return
equals ("unknown", 0.0); place the test in the existing test suite for
insideLLMs (new test name e.g. test_classify_with_model_none) so module coverage
stays >=95% to satisfy the coverage gating.
In `@insideLLMs/cost_tracking.py`:
- Line 1654: The module currently builds __all__ dynamically using globals(),
which accidentally exports imports and future names; replace that line with an
explicit allowlist by listing only the intended public names (functions,
classes, constants) exported by insideLLMs.cost_tracking (e.g., __all__ =
["PublicClassA", "public_function_b", ...]) and remove the globals()-based
expression (and the pyright ignore) so the public API is stable and only the
declared symbols are exported; update the list whenever you intentionally add or
remove public symbols.
In `@insideLLMs/rate_limiting.py`:
- Line 3429: The module currently builds __all__ dynamically (the comprehension
at the end) which exports many internal imports; replace this with an explicit
__all__ list containing only the intended public API names (e.g., the module's
public classes, decorators and handler functions) by auditing the module for the
real public symbols (RateLimiter-like classes, any decorators such as
rate_limit, exceptions like RateLimitExceeded, public factory/getter functions)
and listing them explicitly in __all__ instead of using the globals()
comprehension so implementation details (asyncio, datetime, type imports,
utilities) are not exported.
In `@insideLLMs/retry.py`:
- Line 2327: The module currently sets __all__ dynamically via a globals()
comprehension which exports every non-underscore name (risking accidental
exports); replace that line with an explicit __all__ list containing only the
intended public API symbols (e.g., the public functions, classes, and constants
defined in this module such as RetryPolicy, retry, backoff, etc.—use the actual
names present in insideLLMs/retry.py), remove the globals()-based expression,
and update this explicit list whenever you add or remove public symbols so the
module export surface remains stable.
- Around line 1118-1119: The code currently synthesizes a fake RuntimeError when
last_exception is None before raising RetryExhaustedError; instead, treat
last_exception == None as an invariant violation and raise immediately (e.g.,
raise AssertionError or a clear RuntimeError with an "internal invariant broken:
last_exception is None" message) in the same locations where the synthetic
exception is created (the blocks that check last_exception prior to raising
RetryExhaustedError). Remove the synthetic-wrapping logic so the function fails
fast on the invariant breach rather than embedding a fabricated cause.
In `@insideLLMs/schemas/registry.py`:
- Around line 470-473: The version-parsing try/except in
insideLLMs/schemas/registry.py currently catches Exception too broadly; change
the broad except to only catch parse-related errors (e.g., except (ValueError,
IndexError, TypeError):) around the block that returns (int(parts[0]),
int(parts[1]), int(parts[2])) so only invalid integer conversions, missing
parts, or wrong types are handled and unrelated exceptions still surface.
In `@insideLLMs/semantic_cache.py`:
- Around line 1036-1038: In each Redis exception fallback (the except Exception
as _ blocks that increment self._stats.misses and return None at/around the
shown sites and the other ranges 1076-1077, 1085-1086, 1095-1096, 1104-1106,
1118-1119), add a debug-level log that records the caught exception and relevant
context (operation name, key/id, and any serialization details available)
instead of fully swallowing it; use the existing logger on the object (e.g.,
self._logger or module logger) and include the exception info/traceback
(exception instance) so the message gives actionable diagnostics while
preserving the current graceful fallback behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7c67e7c8-8537-4e32-8cd6-49f64b64d552
📒 Files selected for processing (40)
insideLLMs/async_utils.pyinsideLLMs/cli/_record_utils.pyinsideLLMs/cli/commands/doctor.pyinsideLLMs/config.pyinsideLLMs/contrib/adapters.pyinsideLLMs/contrib/agents.pyinsideLLMs/contrib/chains.pyinsideLLMs/contrib/debugging.pyinsideLLMs/contrib/ensemble.pyinsideLLMs/contrib/evalbom.pyinsideLLMs/contrib/hitl.pyinsideLLMs/contrib/routing.pyinsideLLMs/contrib/synthesis.pyinsideLLMs/cost_tracking.pyinsideLLMs/dataset_utils.pyinsideLLMs/experiment_tracking.pyinsideLLMs/integrations/langchain.pyinsideLLMs/models/base.pyinsideLLMs/models/openai.pyinsideLLMs/probes/agent_probe.pyinsideLLMs/rate_limiting.pyinsideLLMs/registry.pyinsideLLMs/resources.pyinsideLLMs/retry.pyinsideLLMs/runtime/__init__.pyinsideLLMs/runtime/_base.pyinsideLLMs/runtime/_config_loader.pyinsideLLMs/runtime/_high_level.pyinsideLLMs/runtime/_result_utils.pyinsideLLMs/runtime/reproducibility.pyinsideLLMs/safety.pyinsideLLMs/schemas/registry.pyinsideLLMs/semantic_cache.pyinsideLLMs/shadow.pyinsideLLMs/structured_extraction.pyinsideLLMs/trace/tracing.pyinsideLLMs/types.pytests/test_models_config_coverage.pytests/test_probes_models_coverage.pywiki/tutorials/Bias-Testing.md
| def _apply_binary_operator( | ||
| op: ast.operator, left: Union[int, float], right: Union[int, float] | ||
| ) -> Union[int, float]: | ||
| if isinstance(op, ast.Add): | ||
| return left + right | ||
| if isinstance(op, ast.Sub): | ||
| return left - right | ||
| if isinstance(op, ast.Mult): | ||
| return left * right | ||
| if isinstance(op, ast.Div): | ||
| return left / right | ||
| if isinstance(op, ast.FloorDiv): | ||
| return left // right | ||
| if isinstance(op, ast.Mod): | ||
| return left % right | ||
| if isinstance(op, ast.Pow): | ||
| return left**right | ||
| raise ValueError("Unsupported arithmetic operator") | ||
|
|
||
|
|
||
| def _apply_unary_operator(op: ast.unaryop, operand: Union[int, float]) -> Union[int, float]: | ||
| if isinstance(op, ast.UAdd): | ||
| return +operand | ||
| if isinstance(op, ast.USub): | ||
| return -operand | ||
| raise ValueError("Unsupported unary operator") | ||
|
|
||
|
|
||
| def _eval_exec_expression( | ||
| node: ast.AST, variables: dict[str, Union[int, float]] | ||
| ) -> Union[int, float]: | ||
| if isinstance(node, ast.Constant): | ||
| if isinstance(node.value, bool) or not isinstance(node.value, (int, float)): | ||
| raise ValueError("Only numeric constants are allowed") | ||
| return node.value | ||
| if isinstance(node, ast.Name): | ||
| if node.id not in variables: | ||
| raise ValueError(f"Unknown variable: {node.id}") | ||
| return variables[node.id] | ||
| if isinstance(node, ast.BinOp): | ||
| left = _eval_exec_expression(node.left, variables) | ||
| right = _eval_exec_expression(node.right, variables) | ||
| return _apply_binary_operator(node.op, left, right) | ||
| if isinstance(node, ast.UnaryOp): | ||
| return _apply_unary_operator(node.op, _eval_exec_expression(node.operand, variables)) | ||
| raise ValueError("Unsupported expression") | ||
|
|
||
|
|
||
| def _safe_exec_python_subset(code: str) -> dict[str, Union[int, float]]: | ||
| """Execute a tightly restricted Python subset (assignments + arithmetic).""" | ||
| parsed = ast.parse(code, mode="exec") | ||
| variables: dict[str, Union[int, float]] = {} | ||
| for statement in parsed.body: | ||
| if isinstance(statement, ast.Assign): | ||
| if len(statement.targets) != 1 or not isinstance(statement.targets[0], ast.Name): | ||
| raise ValueError("Only simple variable assignments are supported") | ||
| variables[statement.targets[0].id] = _eval_exec_expression(statement.value, variables) | ||
| continue | ||
| if isinstance(statement, ast.Expr): | ||
| _eval_exec_expression(statement.value, variables) | ||
| continue | ||
| raise ValueError("Only arithmetic expressions and assignments are supported") | ||
| return variables |
There was a problem hiding this comment.
Add resource bounds to the restricted evaluator.
This path is safer than raw exec, but it still allows computational DoS via huge literals/exponents (for example, very large ** operations). Add explicit complexity limits before evaluation.
💡 Proposed hardening patch
+MAX_CODE_CHARS = 4096
+MAX_AST_NODES = 256
+MAX_POWER_EXPONENT = 1000
+
def _apply_binary_operator(
op: ast.operator, left: Union[int, float], right: Union[int, float]
) -> Union[int, float]:
@@
if isinstance(op, ast.Pow):
- return left**right
+ if abs(right) > MAX_POWER_EXPONENT:
+ raise ValueError("Exponent too large")
+ return left**right
@@
def _safe_exec_python_subset(code: str) -> dict[str, Union[int, float]]:
"""Execute a tightly restricted Python subset (assignments + arithmetic)."""
+ if len(code) > MAX_CODE_CHARS:
+ raise ValueError("Code snippet too large")
parsed = ast.parse(code, mode="exec")
+ if sum(1 for _ in ast.walk(parsed)) > MAX_AST_NODES:
+ raise ValueError("Expression too complex")
variables: dict[str, Union[int, float]] = {}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _apply_binary_operator( | |
| op: ast.operator, left: Union[int, float], right: Union[int, float] | |
| ) -> Union[int, float]: | |
| if isinstance(op, ast.Add): | |
| return left + right | |
| if isinstance(op, ast.Sub): | |
| return left - right | |
| if isinstance(op, ast.Mult): | |
| return left * right | |
| if isinstance(op, ast.Div): | |
| return left / right | |
| if isinstance(op, ast.FloorDiv): | |
| return left // right | |
| if isinstance(op, ast.Mod): | |
| return left % right | |
| if isinstance(op, ast.Pow): | |
| return left**right | |
| raise ValueError("Unsupported arithmetic operator") | |
| def _apply_unary_operator(op: ast.unaryop, operand: Union[int, float]) -> Union[int, float]: | |
| if isinstance(op, ast.UAdd): | |
| return +operand | |
| if isinstance(op, ast.USub): | |
| return -operand | |
| raise ValueError("Unsupported unary operator") | |
| def _eval_exec_expression( | |
| node: ast.AST, variables: dict[str, Union[int, float]] | |
| ) -> Union[int, float]: | |
| if isinstance(node, ast.Constant): | |
| if isinstance(node.value, bool) or not isinstance(node.value, (int, float)): | |
| raise ValueError("Only numeric constants are allowed") | |
| return node.value | |
| if isinstance(node, ast.Name): | |
| if node.id not in variables: | |
| raise ValueError(f"Unknown variable: {node.id}") | |
| return variables[node.id] | |
| if isinstance(node, ast.BinOp): | |
| left = _eval_exec_expression(node.left, variables) | |
| right = _eval_exec_expression(node.right, variables) | |
| return _apply_binary_operator(node.op, left, right) | |
| if isinstance(node, ast.UnaryOp): | |
| return _apply_unary_operator(node.op, _eval_exec_expression(node.operand, variables)) | |
| raise ValueError("Unsupported expression") | |
| def _safe_exec_python_subset(code: str) -> dict[str, Union[int, float]]: | |
| """Execute a tightly restricted Python subset (assignments + arithmetic).""" | |
| parsed = ast.parse(code, mode="exec") | |
| variables: dict[str, Union[int, float]] = {} | |
| for statement in parsed.body: | |
| if isinstance(statement, ast.Assign): | |
| if len(statement.targets) != 1 or not isinstance(statement.targets[0], ast.Name): | |
| raise ValueError("Only simple variable assignments are supported") | |
| variables[statement.targets[0].id] = _eval_exec_expression(statement.value, variables) | |
| continue | |
| if isinstance(statement, ast.Expr): | |
| _eval_exec_expression(statement.value, variables) | |
| continue | |
| raise ValueError("Only arithmetic expressions and assignments are supported") | |
| return variables | |
| MAX_CODE_CHARS = 4096 | |
| MAX_AST_NODES = 256 | |
| MAX_POWER_EXPONENT = 1000 | |
| def _apply_binary_operator( | |
| op: ast.operator, left: Union[int, float], right: Union[int, float] | |
| ) -> Union[int, float]: | |
| if isinstance(op, ast.Add): | |
| return left + right | |
| if isinstance(op, ast.Sub): | |
| return left - right | |
| if isinstance(op, ast.Mult): | |
| return left * right | |
| if isinstance(op, ast.Div): | |
| return left / right | |
| if isinstance(op, ast.FloorDiv): | |
| return left // right | |
| if isinstance(op, ast.Mod): | |
| return left % right | |
| if isinstance(op, ast.Pow): | |
| if abs(right) > MAX_POWER_EXPONENT: | |
| raise ValueError("Exponent too large") | |
| return left**right | |
| raise ValueError("Unsupported arithmetic operator") | |
| def _apply_unary_operator(op: ast.unaryop, operand: Union[int, float]) -> Union[int, float]: | |
| if isinstance(op, ast.UAdd): | |
| return +operand | |
| if isinstance(op, ast.USub): | |
| return -operand | |
| raise ValueError("Unsupported unary operator") | |
| def _eval_exec_expression( | |
| node: ast.AST, variables: dict[str, Union[int, float]] | |
| ) -> Union[int, float]: | |
| if isinstance(node, ast.Constant): | |
| if isinstance(node.value, bool) or not isinstance(node.value, (int, float)): | |
| raise ValueError("Only numeric constants are allowed") | |
| return node.value | |
| if isinstance(node, ast.Name): | |
| if node.id not in variables: | |
| raise ValueError(f"Unknown variable: {node.id}") | |
| return variables[node.id] | |
| if isinstance(node, ast.BinOp): | |
| left = _eval_exec_expression(node.left, variables) | |
| right = _eval_exec_expression(node.right, variables) | |
| return _apply_binary_operator(node.op, left, right) | |
| if isinstance(node, ast.UnaryOp): | |
| return _apply_unary_operator(node.op, _eval_exec_expression(node.operand, variables)) | |
| raise ValueError("Unsupported expression") | |
| def _safe_exec_python_subset(code: str) -> dict[str, Union[int, float]]: | |
| """Execute a tightly restricted Python subset (assignments + arithmetic).""" | |
| if len(code) > MAX_CODE_CHARS: | |
| raise ValueError("Code snippet too large") | |
| parsed = ast.parse(code, mode="exec") | |
| if sum(1 for _ in ast.walk(parsed)) > MAX_AST_NODES: | |
| raise ValueError("Expression too complex") | |
| variables: dict[str, Union[int, float]] = {} | |
| for statement in parsed.body: | |
| if isinstance(statement, ast.Assign): | |
| if len(statement.targets) != 1 or not isinstance(statement.targets[0], ast.Name): | |
| raise ValueError("Only simple variable assignments are supported") | |
| variables[statement.targets[0].id] = _eval_exec_expression(statement.value, variables) | |
| continue | |
| if isinstance(statement, ast.Expr): | |
| _eval_exec_expression(statement.value, variables) | |
| continue | |
| raise ValueError("Only arithmetic expressions and assignments are supported") | |
| return variables |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@insideLLMs/contrib/agents.py` around lines 3019 - 3081, The evaluator
(_safe_exec_python_subset, _eval_exec_expression and _apply_binary_operator)
currently allows expressions that can cause computational DoS (huge literals or
exponents); add explicit resource bounds: enforce a max number of statements in
parsed.body in _safe_exec_python_subset, reject ast.Constant values whose
absolute value or bit-length exceed a MAX_NUM_BITS, and reject ast.BinOp with
ast.Pow where the exponent or base exceed safe thresholds (e.g., max exponent
magnitude or bit-length) before evaluation; also enforce a maximum AST
recursion/depth when walking nodes in _eval_exec_expression and return clear
ValueError messages when limits are hit. Ensure checks run prior to performing
any arithmetic and reference the existing functions (_safe_exec_python_subset,
_eval_exec_expression, _apply_binary_operator) so the evaluator refuses
oversized constants/exponents and overly deep/long programs.
| # Restricted execution for arithmetic-only snippets. | ||
| local_vars = _safe_exec_python_subset(code) | ||
| return str(local_vars.get("result", "Code executed successfully")) |
There was a problem hiding this comment.
Execution contract is now narrower than the public tool description.
create_python_tool(allow_exec=True) now executes only arithmetic assignments/expressions, but the function/doc text still says “execute Python code”. That is a behavior break for existing callers expecting general Python snippets.
💡 Proposed contract/docs alignment
def create_python_tool(
@@
- """Create a Python code execution tool.
+ """Create a restricted Python-subset execution tool.
@@
- Creates a tool that can execute Python code. By default, execution is
- disabled for safety. Only enable execution in trusted environments.
+ Creates a tool that executes a restricted subset of Python
+ (simple assignments + arithmetic expressions). By default, execution
+ is disabled for safety.
@@
return Tool(
name="python",
func=execute_python,
- description="Execute Python code (use result variable for output)",
+ description="Execute restricted arithmetic subset (assign to `result` for output)",
parameters=[ToolParameter("code", "str", "Python code to execute")],
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@insideLLMs/contrib/agents.py` around lines 3322 - 3324, The public contract
of create_python_tool no longer matches its implementation: when called with
allow_exec=True it only runs arithmetic assignments/expressions via
_safe_exec_python_subset but the function/docstring still says it will “execute
Python code”; either restore broader execution behavior or update the public
contract/docs and name to reflect the restricted arithmetic-only execution.
Modify the create_python_tool docstring and any user-facing description to state
“execute arithmetic-only Python snippets” (or similar) and ensure symbols
mentioned (create_python_tool, allow_exec, and _safe_exec_python_subset) are
updated so callers and tests are aware of the narrowed behavior.
| except Exception as e: | ||
| logger.debug("Balance augmentation generation failed", exc_info=True) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, verify the exact code at the target lines
echo "=== Code at lines 2546-2547 ==="
sed -n '2544,2548p' insideLLMs/contrib/synthesis.py
echo -e "\n=== Check for unused exception variable pattern ==="
rg -n 'except Exception as e' insideLLMs/contrib/synthesis.py
echo -e "\n=== Run Ruff check on F841 errors in the file ==="
ruff check insideLLMs/contrib/synthesis.py --select F841Repository: dr-gareth-roberts/insideLLMs
Length of output: 1098
Remove unused exception binding at line 2546.
Local variable e is assigned to but never used in the except clause. The logger.debug() call uses exc_info=True to capture exception context, so the variable is not needed.
✅ Fix
- except Exception as e:
+ except Exception as _:
logger.debug("Balance augmentation generation failed", exc_info=True)🧰 Tools
🪛 GitHub Actions: CI
[error] 2546-2546: F841 Local variable 'e' is assigned to but never used. This is fixable with 'ruff --fix'.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@insideLLMs/contrib/synthesis.py` around lines 2546 - 2547, The except clause
binds an unused variable `e` even though the exception context is captured via
logger.debug(..., exc_info=True); remove the unnecessary exception binding by
changing the except clause that currently reads "except Exception as e:" to just
"except Exception:" in the file and ensure the logger.debug call (line shown)
remains unchanged; look for the except block containing logger.debug("Balance
augmentation generation failed", exc_info=True) to apply this edit.
| try: | ||
| from datasets import load_dataset | ||
|
|
||
| HF_DATASETS_AVAILABLE = True | ||
| _datasets_module = importlib.import_module("datasets") | ||
| except ImportError: | ||
| HF_DATASETS_AVAILABLE = False | ||
| _datasets_module = None | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
head -n 85 insideLLMs/dataset_utils.py | tail -n +60Repository: dr-gareth-roberts/insideLLMs
Length of output: 818
🌐 Web query:
Python 3.10 ModuleNotFoundError vs ImportError best practices optional imports
💡 Result:
ModuleNotFoundError vs ImportError (Python 3.10)
ModuleNotFoundErroris a subclass ofImportErrorthat’s raised when the module can’t be located (including the special case wheresys.modules[name]isNone). [1]ImportErroris broader: it covers “module found but failed to import”, and also “from x import ywhereydoesn’t exist”, plus many other import-time failures. [1]
Best practices for optional imports
1) Prefer catching ModuleNotFoundError when “dependency not installed” is the only acceptable failure
This avoids accidentally hiding real bugs inside the optional dependency (or inside its import-time side effects).
try:
import orjson
except ModuleNotFoundError:
orjson = None(Reason: missing-module is the specific condition ModuleNotFoundError represents. [1])
2) Catch ImportError (not just ModuleNotFoundError) when you want “usable import” semantics
Some optional deps—especially compiled extensions—may be present but still fail to import due to missing shared libraries, ABI mismatches, etc., which often surface as ImportError. If your feature should be disabled in those cases too, catch ImportError and keep the exception for diagnostics. [2]
_orjson_exc = None
try:
import orjson
except ImportError as e:
orjson = None
_orjson_exc = e3) Fail at call time with a clear message (not at import time)
This keeps base functionality working without the extra dependency.
def dumps(obj) -> bytes:
if orjson is None:
raise ImportError("dumps() requires 'orjson'") from _orjson_exc
return orjson.dumps(obj)Using raise ... from None (or from the saved exception) is the standard way to control exception chaining and avoid noisy/irrelevant context. [3]
4) If you only need to check availability, use importlib.util.find_spec()
This lets you test whether a module is discoverable without importing it (and triggering side effects). [4]
from importlib.util import find_spec
HAS_ORJSON = find_spec("orjson") is not None5) Packaging: declare optional deps as “extras”
This is the ecosystem-standard way to make optional features installable (e.g., pip install yourpkg[orjson]). (Not a Python-runtime rule, but it complements the runtime patterns above.)
Sources
[1] Python docs – Built-in exceptions (ModuleNotFoundError is a subclass of ImportError, raised when a module can’t be located) (docs.python.org)
[2] Discussion of practical cases where ImportError is the right catch for optional compiled deps (scivision.dev)
[3] PEP 409 – raise ... from None to suppress irrelevant exception context (peps.python.org)
[4] Python docs – importlib.util.find_spec() for checking importability without importing (docs.python.org)
Citations:
- 1: https://docs.python.org/pt-br/3.10/library/exceptions.html?utm_source=openai
- 2: https://www.scivision.dev/python-importerror-vs-modulenotfound/?utm_source=openai
- 3: https://peps.python.org/pep-0409/?utm_source=openai
- 4: https://docs.python.org/3/library/importlib.html?utm_source=openai
Use ModuleNotFoundError instead of broad ImportError for optional import.
The current catch-all ImportError at line 71 can mask real import failures within the datasets module itself (e.g., missing dependencies, syntax errors during import), silently falling back to disabled functionality when the actual issue should be raised.
💡 Suggested fix
try:
_datasets_module = importlib.import_module("datasets")
-except ImportError:
- _datasets_module = None
+except ModuleNotFoundError:
+ _datasets_module = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@insideLLMs/dataset_utils.py` around lines 69 - 73, The try/except around
importlib.import_module("datasets") currently catches ImportError too broadly;
change it to catch ModuleNotFoundError so missing optional package results in
_datasets_module = None, but real import errors inside the package still
propagate. Specifically, update the try/except that assigns _datasets_module
(the importlib.import_module("datasets") call) to use "except
ModuleNotFoundError: _datasets_module = None" (or add a separate except
Exception: raise to re-raise non-missing-package errors) so only the
absent-package case is swallowed.
| def _load_optional_module(module_name: str) -> Any | None: | ||
| """Load optional dependency module, returning None when unavailable.""" | ||
| try: | ||
| return importlib.import_module(module_name) | ||
| except ImportError: | ||
| return None |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check the actual code in the repository
cat -n insideLLMs/experiment_tracking.py | sed -n '119,124p'Repository: dr-gareth-roberts/insideLLMs
Length of output: 353
🏁 Script executed:
# Check if there are any usages of _load_optional_module in the codebase
rg "_load_optional_module" -nRepository: dr-gareth-roberts/insideLLMs
Length of output: 552
🏁 Script executed:
# Check for any tests related to this function
fd "test.*\.py$" | xargs rg "_load_optional_module" -lRepository: dr-gareth-roberts/insideLLMs
Length of output: 56
🏁 Script executed:
# Let's also check what Python version is being targeted
cat setup.py 2>/dev/null || cat pyproject.toml 2>/dev/null | head -50Repository: dr-gareth-roberts/insideLLMs
Length of output: 1400
🏁 Script executed:
# Verify Python's ModuleNotFoundError behavior and attributes in Python 3.10+
python3 <<'EOF'
import sys
print(f"Python version: {sys.version}")
# Test ModuleNotFoundError attributes
try:
import nonexistent_module_xyz
except ModuleNotFoundError as e:
print(f"ModuleNotFoundError has 'name' attr: {hasattr(e, 'name')}")
if hasattr(e, 'name'):
print(f" e.name = {e.name}")
print(f" str(e) = {str(e)}")
# Test transitive import failure
import tempfile
import os
# Create a test module that imports a missing dependency
with tempfile.TemporaryDirectory() as tmpdir:
test_module_path = os.path.join(tmpdir, 'test_mod.py')
with open(test_module_path, 'w') as f:
f.write('import missing_dependency\n')
sys.path.insert(0, tmpdir)
try:
import test_mod
except ModuleNotFoundError as e:
print(f"\nTransitive ModuleNotFoundError:")
print(f" hasattr(e, 'name'): {hasattr(e, 'name')}")
if hasattr(e, 'name'):
print(f" e.name = {e.name}")
print(f" str(e) = {str(e)}")
except ImportError as e:
print(f"\nTransitive ImportError (not ModuleNotFoundError):")
print(f" type(e) = {type(e).__name__}")
print(f" str(e) = {str(e)}")
EOFRepository: dr-gareth-roberts/insideLLMs
Length of output: 403
🏁 Script executed:
# Check if there are any tests in the repository for this module
fd "test" -t f -e py | head -20Repository: dr-gareth-roberts/insideLLMs
Length of output: 786
🏁 Script executed:
# Verify ModuleNotFoundError is a subclass of ImportError
python3 <<'EOF'
print(f"ModuleNotFoundError is subclass of ImportError: {issubclass(ModuleNotFoundError, ImportError)}")
# Check the exception hierarchy
import inspect
print(f"\nModuleNotFoundError MRO: {inspect.getmro(ModuleNotFoundError)}")
EOFRepository: dr-gareth-roberts/insideLLMs
Length of output: 269
🏁 Script executed:
# Search for any tests related to optional module loading
rg "optional" -i tests/ | head -20Repository: dr-gareth-roberts/insideLLMs
Length of output: 1610
🏁 Script executed:
# Look for tests specifically for experiment_tracking or _load_optional_module
fd "test.*experiment" -t f tests/Repository: dr-gareth-roberts/insideLLMs
Length of output: 148
🏁 Script executed:
# Check if there are any tests in tests/ directory that test the _load_optional_module function
rg "_load_optional_module|experiment_tracking" tests/ -lRepository: dr-gareth-roberts/insideLLMs
Length of output: 286
🏁 Script executed:
# Look at the test file for experiment_tracking to see if there are tests for optional module loading
cat tests/test_experiment_tracking_coverage.py | head -100Repository: dr-gareth-roberts/insideLLMs
Length of output: 2852
🏁 Script executed:
# Let's check test_experiment_tracking.py as well
cat tests/test_experiment_tracking.py | head -100Repository: dr-gareth-roberts/insideLLMs
Length of output: 2747
🏁 Script executed:
# Now let's understand the actual impact by looking at how _load_optional_module is used
# Let's examine the context around lines 128-140 where it's used
sed -n '119,145p' insideLLMs/experiment_tracking.pyRepository: dr-gareth-roberts/insideLLMs
Length of output: 1103
🏁 Script executed:
# Search for how wandb, mlflow, and _torch_tensorboard are used later in the code
rg "wandb\.|mlflow\.|_torch_tensorboard" insideLLMs/experiment_tracking.py | head -30Repository: dr-gareth-roberts/insideLLMs
Length of output: 1414
🏁 Script executed:
# Check the test_experiment_tracking_coverage.py file for tests about optional imports
rg "ImportError|optional|_load_optional" tests/test_experiment_tracking_coverage.py -B 2 -A 5Repository: dr-gareth-roberts/insideLLMs
Length of output: 1061
🏁 Script executed:
# Let's look for any other error handling patterns in the codebase for optional imports
rg "except.*Error" insideLLMs/experiment_tracking.py -B 2 -A 2 | head -40Repository: dr-gareth-roberts/insideLLMs
Length of output: 177
🏁 Script executed:
# Test the actual behavior difference between the current and proposed fix
python3 <<'EOF'
import sys
import tempfile
import os
import importlib
# Test 1: Module truly not found
print("=== Test 1: Module not found ===")
try:
importlib.import_module("nonexistent_module_12345")
except ImportError as e:
print(f"Exception type: {type(e).__name__}")
print(f"Is ModuleNotFoundError: {isinstance(e, ModuleNotFoundError)}")
if isinstance(e, ModuleNotFoundError):
print(f"exc.name: {e.name}")
# Test 2: Module found but has broken import inside
print("\n=== Test 2: Module with broken internal import ===")
with tempfile.TemporaryDirectory() as tmpdir:
broken_module = os.path.join(tmpdir, 'broken_mod.py')
with open(broken_module, 'w') as f:
f.write('import nonexistent_internal_dep\n')
sys.path.insert(0, tmpdir)
try:
importlib.import_module("broken_mod")
except ImportError as e:
print(f"Exception type: {type(e).__name__}")
print(f"Is ModuleNotFoundError: {isinstance(e, ModuleNotFoundError)}")
if isinstance(e, ModuleNotFoundError):
print(f"exc.name: {e.name}")
print(f"exc.name == 'broken_mod': {e.name == 'broken_mod'}")
print(f"Exception msg: {str(e)}")
# Test 3: Other ImportError (hypothetical)
print("\n=== Test 3: Circular import scenario ===")
with tempfile.TemporaryDirectory() as tmpdir:
mod_a = os.path.join(tmpdir, 'mod_a.py')
mod_b = os.path.join(tmpdir, 'mod_b.py')
with open(mod_a, 'w') as f:
f.write('import mod_b\nx = 1\n')
with open(mod_b, 'w') as f:
f.write('from mod_a import x\n')
sys.path.insert(0, tmpdir)
try:
importlib.import_module("mod_a")
except ImportError as e:
print(f"Exception type: {type(e).__name__}")
print(f"Is ModuleNotFoundError: {isinstance(e, ModuleNotFoundError)}")
print(f"Exception msg: {str(e)[:100]}")
EOFRepository: dr-gareth-roberts/insideLLMs
Length of output: 662
Catch ModuleNotFoundError specifically to avoid masking transitive import failures.
Line 123 catches ImportError broadly, which silently treats installed-but-broken dependencies (with missing transitive imports) as "not installed". The exception's name attribute indicates which module failed; re-raise if it's not the module you're trying to load.
💡 Suggested fix
def _load_optional_module(module_name: str) -> Any | None:
"""Load optional dependency module, returning None when unavailable."""
try:
return importlib.import_module(module_name)
- except ImportError:
- return None
+ except ModuleNotFoundError as exc:
+ if exc.name == module_name:
+ return None
+ raise🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@insideLLMs/experiment_tracking.py` around lines 119 - 124, The helper
_load_optional_module currently catches ImportError broadly and can mask
transitive import failures; change the exception handling to first catch
ModuleNotFoundError as e and return None only when e.name == module_name
(re-raise otherwise), and then optionally keep a general except ImportError:
return None for other ImportError cases; reference the function name
_load_optional_module and the variable module_name to locate where to adjust the
try/except logic.
| except Exception as e: | ||
| # Get first message content for error context | ||
| first_msg = messages[0].get("content", "") if messages else "" | ||
| first_msg = messages[0]["content"] if messages else "" |
There was a problem hiding this comment.
Avoid masking the original error in chat exception handling.
On Line 454, messages[0]["content"] can raise while already handling an exception, which hides the real failure source.
💡 Proposed fix
- first_msg = messages[0]["content"] if messages else ""
+ first_msg = ""
+ if messages:
+ first_msg = str(messages[0].get("content", ""))📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| first_msg = messages[0]["content"] if messages else "" | |
| first_msg = "" | |
| if messages: | |
| first_msg = str(messages[0].get("content", "")) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@insideLLMs/models/openai.py` at line 454, When extracting the first message
inside the chat exception handler, avoid indexing into messages directly since
messages[0]["content"] can raise and mask the original exception; update the
extraction of first_msg (the code referencing messages and first_msg) to safely
access content (e.g., check messages truthiness and type of messages[0], use
.get("content", "") or a guard) or wrap only that access in its own try/except
and preserve/re-raise/log the original exception separately so the root error
isn't hidden.
| except Exception as e: | ||
| if self.circuit_breaker: | ||
| self.circuit_breaker.record_failure() | ||
| raise | ||
| raise e |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, check the structure and size of the file
wc -l insideLLMs/rate_limiting.pyRepository: dr-gareth-roberts/insideLLMs
Length of output: 104
🏁 Script executed:
# Read the specific lines mentioned in the review (2899-2902)
sed -n '2895,2905p' insideLLMs/rate_limiting.pyRepository: dr-gareth-roberts/insideLLMs
Length of output: 561
🏁 Script executed:
# Run the verification script from the review to find all similar patterns
rg -nPU 'except\s+Exception\s+as\s+([A-Za-z_]\w*):(?:\n[ \t]+.*){0,8}\n[ \t]+raise\s+\1\b' insideLLMs/rate_limiting.pyRepository: dr-gareth-roberts/insideLLMs
Length of output: 144
🏁 Script executed:
# Search for re-raise patterns with exception variables using simpler regex
rg -n 'except.*as\s+([a-z_]\w*)' insideLLMs/rate_limiting.py -A 5 | grep -E '(except|raise)' | head -50Repository: dr-gareth-roberts/insideLLMs
Length of output: 704
🏁 Script executed:
# More direct: find all lines with "raise e" or similar patterns
rg -n 'raise\s+[a-z_]\w+' insideLLMs/rate_limiting.pyRepository: dr-gareth-roberts/insideLLMs
Length of output: 54
🏁 Script executed:
# Check if bare raise is used elsewhere in the file for comparison
rg -n '^\s+raise\s*$' insideLLMs/rate_limiting.pyRepository: dr-gareth-roberts/insideLLMs
Length of output: 194
Use bare raise to preserve the traceback chain.
The raise e at line 2902 drops the original traceback. Use bare raise instead to maintain source context, consistent with the pattern used throughout the rest of this file.
💡 Proposed fix
except Exception as e:
if self.circuit_breaker:
self.circuit_breaker.record_failure()
- raise e
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| if self.circuit_breaker: | |
| self.circuit_breaker.record_failure() | |
| raise | |
| raise e | |
| except Exception as e: | |
| if self.circuit_breaker: | |
| self.circuit_breaker.record_failure() | |
| raise |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@insideLLMs/rate_limiting.py` around lines 2899 - 2902, The except block
currently re-raises the caught exception with "raise e", which loses the
original traceback; change the re-raise to a bare "raise" so the original
traceback is preserved (update the block that calls
self.circuit_breaker.record_failure() and replace "raise e" with "raise").
Description
Summary
This PR applies a focused hardening pass across experiment tracking, model typing, retry error handling, dataset optional imports, and related tests/docs updates.
Changes
Experiment tracking
Enforced non-empty tracker list in MultiTracker constructor.
Returned a concrete first run ID from MultiTracker.start_run (no empty fallback).
File: insideLLMs/experiment_tracking.py
Optional dependency robustness
Switched HF datasets import to importlib-based optional loading.
Set availability based on resolved load_dataset symbol.
File: insideLLMs/dataset_utils.py
Model retry safety
In ModelWrapper.generate, made last_error explicitly typed and fail-fast if unexpectedly unset before raising.
File: insideLLMs/models/base.py
OpenAI model typing cleanup
Added explicit **kwargs: Any annotations to generate/chat/stream.
Added explicit return type -> ModelInfo for info.
Kept SDK boundary ignore where OpenAI typing overloads are narrower than our ChatMessage interface.
File: insideLLMs/models/openai.py
Tests typing improvements
Added explicit typing to helper probe methods used in coverage tests.
File: tests/test_probes_models_coverage.py
Docs consistency
Bias tutorial config now sets output_dir: bias_results to match analysis step paths.
File: wiki/tutorials/Bias-Testing.md
Why
Reduce type/lint noise in strict IDE diagnostics.
Tighten runtime guarantees around run IDs and retry error paths.
Improve optional dependency behavior for environments missing extras.
Keep docs aligned with real output locations.
Risk
Low to medium:
Mostly typing/guard improvements and deterministic behavior tightening.
One behavior change: MultiTracker([]) now raises ValueError (intentional fail-fast).
Validation
Manual code-level validation and IDE diagnostics due intermittent command-runner issues in IDE environment.
Branch push completed and ready for CI validation.
Summary by Sourcery
Harden experiment tracking, model retry behavior, routing/classification, and contrib utilities while tightening typing and optional dependency handling across the codebase.
New Features:
Bug Fixes:
Enhancements:
Summary by CodeRabbit
New Features
Bug Fixes
API Changes