chore(llmobs): add GEPA as alternative prompt optimization backend#16599
chore(llmobs): add GEPA as alternative prompt optimization backend#16599
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1dd5df9aae
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| except ImportError: | ||
| raise ImportError("gepa package is required for method='gepa'. Install with: pip install ddtrace[gepa]") | ||
|
|
||
| from ddtrace.llmobs._optimizers.gepa_strategy import LLMObsGEPAAdapter |
There was a problem hiding this comment.
Ship GEPA adapter module before using it
The new GEPA path imports LLMObsGEPAAdapter from ddtrace.llmobs._optimizers.gepa_strategy, but this commit does not add that module anywhere in the tree, so calling PromptOptimization.run() with method="gepa" will fail immediately with ModuleNotFoundError even when the gepa extra is installed. This blocks the entire feature path introduced in this change.
Useful? React with 👍 / 👎.
Performance SLOsComparing candidate till.wohlfarth/MLOB-5317/prompt_optimization_gepa (5f2d201) with baseline main (b5b77a2) 📈 Performance Regressions (2 suites)📈 iastaspects - 117/117✅ add_aspectTime: ✅ 103.843µs (SLO: <130.000µs 📉 -20.1%) vs baseline: +3.5% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ add_inplace_aspectTime: ✅ 100.738µs (SLO: <130.000µs 📉 -22.5%) vs baseline: -0.4% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ add_inplace_noaspectTime: ✅ 28.423µs (SLO: <40.000µs 📉 -28.9%) vs baseline: +1.0% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ add_noaspectTime: ✅ 48.792µs (SLO: <70.000µs 📉 -30.3%) vs baseline: ~same Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.8% ✅ bytearray_aspectTime: ✅ 252.004µs (SLO: <400.000µs 📉 -37.0%) vs baseline: +0.7% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ bytearray_extend_aspectTime: ✅ 634.997µs (SLO: <800.000µs 📉 -20.6%) vs baseline: -0.4% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ bytearray_extend_noaspectTime: ✅ 263.702µs (SLO: <400.000µs 📉 -34.1%) vs baseline: -0.1% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +4.8% ✅ bytearray_noaspectTime: ✅ 136.339µs (SLO: <300.000µs 📉 -54.6%) vs baseline: -0.3% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ bytes_aspectTime: ✅ 218.889µs (SLO: <300.000µs 📉 -27.0%) vs baseline: -0.1% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.7% ✅ bytes_noaspectTime: ✅ 133.096µs (SLO: <200.000µs 📉 -33.5%) vs baseline: +0.2% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ bytesio_aspectTime: ✅ 3.774ms (SLO: <5.000ms 📉 -24.5%) vs baseline: +0.4% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ bytesio_noaspectTime: ✅ 314.859µs (SLO: <420.000µs 📉 -25.0%) vs baseline: -0.8% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ capitalize_aspectTime: ✅ 89.057µs (SLO: <300.000µs 📉 -70.3%) vs baseline: -0.7% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ capitalize_noaspectTime: ✅ 249.674µs (SLO: <300.000µs 📉 -16.8%) vs baseline: +0.4% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ casefold_aspectTime: ✅ 89.089µs (SLO: <500.000µs 📉 -82.2%) vs baseline: +0.2% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ casefold_noaspectTime: ✅ 307.912µs (SLO: <500.000µs 📉 -38.4%) vs baseline: +1.0% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ decode_aspectTime: ✅ 86.539µs (SLO: <100.000µs 📉 -13.5%) vs baseline: ~same Memory: ✅ 42.861MB (SLO: <46.000MB -6.8%) vs baseline: +4.8% ✅ decode_noaspectTime: ✅ 151.762µs (SLO: <210.000µs 📉 -27.7%) vs baseline: -0.5% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ encode_aspectTime: ✅ 84.030µs (SLO: <200.000µs 📉 -58.0%) vs baseline: -0.3% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ encode_noaspectTime: ✅ 139.381µs (SLO: <200.000µs 📉 -30.3%) vs baseline: ~same Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.8% ✅ format_aspectTime: ✅ 14.574ms (SLO: <19.200ms 📉 -24.1%) vs baseline: ~same Memory: ✅ 43.096MB (SLO: <46.000MB -6.3%) vs baseline: +4.9% ✅ format_map_aspectTime: ✅ 16.406ms (SLO: <21.500ms 📉 -23.7%) vs baseline: -0.5% Memory: ✅ 43.037MB (SLO: <46.000MB -6.4%) vs baseline: +5.0% ✅ format_map_noaspectTime: ✅ 370.873µs (SLO: <500.000µs 📉 -25.8%) vs baseline: ~same Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.9% ✅ format_noaspectTime: ✅ 301.610µs (SLO: <500.000µs 📉 -39.7%) vs baseline: -0.2% Memory: ✅ 42.861MB (SLO: <46.000MB -6.8%) vs baseline: +4.8% ✅ index_aspectTime: ✅ 125.793µs (SLO: <300.000µs 📉 -58.1%) vs baseline: +4.7% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ index_noaspectTime: ✅ 40.208µs (SLO: <300.000µs 📉 -86.6%) vs baseline: -0.1% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ join_aspectTime: ✅ 210.269µs (SLO: <300.000µs 📉 -29.9%) vs baseline: -0.7% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ join_noaspectTime: ✅ 141.734µs (SLO: <300.000µs 📉 -52.8%) vs baseline: -0.8% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ ljust_aspectTime: ✅ 579.333µs (SLO: <700.000µs 📉 -17.2%) vs baseline: 📈 +16.6% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ ljust_noaspectTime: ✅ 261.335µs (SLO: <300.000µs 📉 -12.9%) vs baseline: +1.9% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.9% ✅ lower_aspectTime: ✅ 293.094µs (SLO: <500.000µs 📉 -41.4%) vs baseline: -0.5% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ lower_noaspectTime: ✅ 236.614µs (SLO: <300.000µs 📉 -21.1%) vs baseline: +0.4% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ lstrip_aspectTime: ✅ 0.269ms (SLO: <3.000ms 📉 -91.0%) vs baseline: -0.5% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.8% ✅ lstrip_noaspectTime: ✅ 0.178ms (SLO: <3.000ms 📉 -94.1%) vs baseline: +0.7% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.7% ✅ modulo_aspectTime: ✅ 14.269ms (SLO: <18.750ms 📉 -23.9%) vs baseline: ~same Memory: ✅ 42.998MB (SLO: <46.000MB -6.5%) vs baseline: +4.6% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 14.775ms (SLO: <19.350ms 📉 -23.6%) vs baseline: ~same Memory: ✅ 42.998MB (SLO: <46.000MB -6.5%) vs baseline: +5.0% ✅ modulo_aspect_for_bytesTime: ✅ 14.393ms (SLO: <18.900ms 📉 -23.8%) vs baseline: +0.3% Memory: ✅ 43.037MB (SLO: <46.000MB -6.4%) vs baseline: +4.8% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 14.677ms (SLO: <19.150ms 📉 -23.4%) vs baseline: +0.7% Memory: ✅ 42.998MB (SLO: <46.000MB -6.5%) vs baseline: +4.8% ✅ modulo_noaspectTime: ✅ 0.362ms (SLO: <3.000ms 📉 -87.9%) vs baseline: +1.0% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ replace_aspectTime: ✅ 18.333ms (SLO: <24.000ms 📉 -23.6%) vs baseline: -0.7% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.6% ✅ replace_noaspectTime: ✅ 280.376µs (SLO: <300.000µs -6.5%) vs baseline: +0.3% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ repr_aspectTime: ✅ 309.690µs (SLO: <420.000µs 📉 -26.3%) vs baseline: -0.9% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ repr_noaspectTime: ✅ 46.918µs (SLO: <90.000µs 📉 -47.9%) vs baseline: +0.8% ✅ rstrip_aspectTime: ✅ 384.653µs (SLO: <500.000µs 📉 -23.1%) vs baseline: +0.3% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ rstrip_noaspectTime: ✅ 184.674µs (SLO: <300.000µs 📉 -38.4%) vs baseline: +0.2% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ slice_aspectTime: ✅ 184.098µs (SLO: <300.000µs 📉 -38.6%) vs baseline: +0.1% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ slice_noaspectTime: ✅ 54.074µs (SLO: <90.000µs 📉 -39.9%) vs baseline: -0.5% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ stringio_aspectTime: ✅ 4.364ms (SLO: <5.000ms 📉 -12.7%) vs baseline: 📈 +14.7% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.1% ✅ stringio_noaspectTime: ✅ 345.533µs (SLO: <500.000µs 📉 -30.9%) vs baseline: +0.5% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ strip_aspectTime: ✅ 269.984µs (SLO: <350.000µs 📉 -22.9%) vs baseline: ~same Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ strip_noaspectTime: ✅ 176.784µs (SLO: <240.000µs 📉 -26.3%) vs baseline: -0.3% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ swapcase_aspectTime: ✅ 334.315µs (SLO: <500.000µs 📉 -33.1%) vs baseline: +0.2% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ swapcase_noaspectTime: ✅ 271.454µs (SLO: <400.000µs 📉 -32.1%) vs baseline: +0.3% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ title_aspectTime: ✅ 316.375µs (SLO: <500.000µs 📉 -36.7%) vs baseline: -1.2% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ title_noaspectTime: ✅ 259.186µs (SLO: <400.000µs 📉 -35.2%) vs baseline: -0.4% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ translate_aspectTime: ✅ 491.217µs (SLO: <700.000µs 📉 -29.8%) vs baseline: ~same Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.7% ✅ translate_noaspectTime: ✅ 424.190µs (SLO: <500.000µs 📉 -15.2%) vs baseline: -0.7% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ upper_aspectTime: ✅ 294.003µs (SLO: <500.000µs 📉 -41.2%) vs baseline: -0.8% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ upper_noaspectTime: ✅ 234.559µs (SLO: <400.000µs 📉 -41.4%) vs baseline: -0.2% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 508.743µs (SLO: <700.000µs 📉 -27.3%) vs baseline: 📈 +19.4% Memory: ✅ 42.743MB (SLO: <46.000MB -7.1%) vs baseline: +4.6% ✅ ospathbasename_noaspectTime: ✅ 431.973µs (SLO: <700.000µs 📉 -38.3%) vs baseline: -0.3% Memory: ✅ 42.605MB (SLO: <46.000MB -7.4%) vs baseline: +5.1% ✅ ospathjoin_aspectTime: ✅ 627.798µs (SLO: <700.000µs 📉 -10.3%) vs baseline: ~same Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ ospathjoin_noaspectTime: ✅ 634.515µs (SLO: <700.000µs -9.4%) vs baseline: -0.5% Memory: ✅ 42.664MB (SLO: <46.000MB -7.3%) vs baseline: +4.4% ✅ ospathnormcase_aspectTime: ✅ 348.275µs (SLO: <700.000µs 📉 -50.2%) vs baseline: -1.3% Memory: ✅ 42.546MB (SLO: <46.000MB -7.5%) vs baseline: +3.9% ✅ ospathnormcase_noaspectTime: ✅ 358.655µs (SLO: <700.000µs 📉 -48.8%) vs baseline: -0.3% Memory: ✅ 42.605MB (SLO: <46.000MB -7.4%) vs baseline: +5.0% ✅ ospathsplit_aspectTime: ✅ 486.752µs (SLO: <700.000µs 📉 -30.5%) vs baseline: -1.0% Memory: ✅ 42.644MB (SLO: <46.000MB -7.3%) vs baseline: +4.6% ✅ ospathsplit_noaspectTime: ✅ 501.130µs (SLO: <700.000µs 📉 -28.4%) vs baseline: -0.3% Memory: ✅ 42.625MB (SLO: <46.000MB -7.3%) vs baseline: +4.2% ✅ ospathsplitdrive_aspectTime: ✅ 375.629µs (SLO: <700.000µs 📉 -46.3%) vs baseline: +0.1% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ ospathsplitdrive_noaspectTime: ✅ 72.982µs (SLO: <700.000µs 📉 -89.6%) vs baseline: +0.9% Memory: ✅ 42.546MB (SLO: <46.000MB -7.5%) vs baseline: +3.8% ✅ ospathsplitext_aspectTime: ✅ 461.712µs (SLO: <700.000µs 📉 -34.0%) vs baseline: +1.2% Memory: ✅ 42.802MB (SLO: <46.000MB -7.0%) vs baseline: +4.4% ✅ ospathsplitext_noaspectTime: ✅ 463.433µs (SLO: <700.000µs 📉 -33.8%) vs baseline: +0.1% Memory: ✅ 42.644MB (SLO: <46.000MB -7.3%) vs baseline: +4.9% 🟡 Near SLO Breach (1 suite)🟡 tracer - 6/6✅ largeTime: ✅ 31.356ms (SLO: <32.950ms -4.8%) vs baseline: -0.7% Memory: ✅ 36.766MB (SLO: <39.250MB -6.3%) vs baseline: +5.2% ✅ mediumTime: ✅ 3.113ms (SLO: <3.200ms -2.7%) vs baseline: -1.7% Memory: ✅ 35.527MB (SLO: <38.750MB -8.3%) vs baseline: +4.8% ✅ smallTime: ✅ 364.476µs (SLO: <370.000µs 🟡 -1.5%) vs baseline: +3.8% Memory: ✅ 35.606MB (SLO: <38.750MB -8.1%) vs baseline: +4.9%
|
Codeowners resolved as |
Summary
methodforLLMObs._prompt_optimization(), alongside the existing"metaprompting"defaultload_optimization_system_prompt()as a reusable module-level function so both metaprompting and GEPA can share the system prompt templateLLMObsGEPAAdapterthat bridges our task/evaluators/optimization_task interface to GEPA'sGEPAAdapterprotocolgepa>=0.0.26as an optional dependency (pip install ddtrace[gepa])Motivation
The current metaprompting loop is sequential: run experiment, call optimizer, repeat. GEPA adds an evolutionary approach with Pareto selection, batch sampling, and candidate
tracking — potentially finding better prompts with fewer iterations. By implementing GEPA's
propose_new_textsadapter method, prompt generation still goes through the user'sexisting
optimization_taskfunction, so no script changes are needed beyond addingmethod="gepa".Changes
New files
ddtrace/llmobs/_optimizers/__init__.py— Package initddtrace/llmobs/_optimizers/gepa_strategy.py—LLMObsGEPAAdapterimplementing GEPA's protocol:evaluate()— runs user's task + evaluators on a batch, returnsEvaluationBatchmake_reflective_dataset()— builds feedback from trajectories for GEPA's reflectionpropose_new_texts()— wraps user'soptimization_taskvia shared system prompt template_to_numeric_score()— converts any evaluator return type to float_dataset_to_gepa_format()— converts Dataset records for GEPAModified files
ddtrace/llmobs/_prompt_optimization.py:load_optimization_system_prompt(config)fromOptimizationIteration._load_system_prompt()(method now delegates)method: str = "metaprompting"parameter toPromptOptimization.__init__()run()to GEPA paths whenmethod="gepa"_run_gepa_without_split(),_run_gepa_with_split(),_run_gepa_core()methodsddtrace/llmobs/_llmobs.py— Addedmethodparameter to_prompt_optimization()classmethod, passed through to constructorpyproject.toml— Addedgepa = ["gepa>=0.0.26"]optional dependencyWhat stays unchanged
_run_without_split,_run_with_split)OptimizationIteration,OptimizationResult,TestPhaseResultclasses_run_experiment(),_create_split_datasets(), all validation functionsmethodkwarg at the end)Usage
Test plan