feat(llmobs): remote evaluation devserver POC#16665
feat(llmobs): remote evaluation devserver POC#16665alexbarksdale wants to merge 4 commits intomainfrom
Conversation
…o Experiment - Add ConfigFieldType and ConfigField types for typed devserver UI config schema - Add remote_config parameter to Experiment, SyncExperiment, and LLMObs factory methods - Add _user_tags to preserve user-provided tags before auto-tag injection - Add ProgressEvent, ProgressCallbackType, and OnStartCallbackType - Add span_event to TaskResult for full span data after task completion - Thread progress_callback and on_start through run(), _run_task(), _process_record() - Add _emit_post_eval_progress() for evaluation-complete and success events - Add LLMObs.devserver() classmethod Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HTTP devserver that exposes registered experiments for remote execution. Supports NDJSON streaming (/eval with stream=true), config overrides, evaluator selection, sample sizing, and CORS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Capital-city experiment demonstrating async task, evaluators, remote_config with typed fields, and devserver startup. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Codeowners resolved as |
Performance SLOsComparing candidate alex.barksdale/remote-eval-poc (7dd9e2e) with baseline main (b43e1e7) 📈 Performance Regressions (2 suites)📈 iastaspects - 117/117✅ add_aspectTime: ✅ 103.170µs (SLO: <130.000µs 📉 -20.6%) vs baseline: +2.8% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.6% ✅ add_inplace_aspectTime: ✅ 101.662µs (SLO: <130.000µs 📉 -21.8%) vs baseline: +1.0% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ add_inplace_noaspectTime: ✅ 28.245µs (SLO: <40.000µs 📉 -29.4%) vs baseline: ~same Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ add_noaspectTime: ✅ 48.727µs (SLO: <70.000µs 📉 -30.4%) vs baseline: -0.2% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.1% ✅ bytearray_aspectTime: ✅ 249.650µs (SLO: <400.000µs 📉 -37.6%) vs baseline: +0.1% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ bytearray_extend_aspectTime: ✅ 640.150µs (SLO: <800.000µs 📉 -20.0%) vs baseline: +1.7% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ bytearray_extend_noaspectTime: ✅ 264.116µs (SLO: <400.000µs 📉 -34.0%) vs baseline: +0.6% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ bytearray_noaspectTime: ✅ 136.804µs (SLO: <300.000µs 📉 -54.4%) vs baseline: +1.0% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ bytes_aspectTime: ✅ 217.774µs (SLO: <300.000µs 📉 -27.4%) vs baseline: -0.5% Memory: ✅ 43.057MB (SLO: <46.000MB -6.4%) ✅ bytes_noaspectTime: ✅ 132.680µs (SLO: <200.000µs 📉 -33.7%) vs baseline: -0.3% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ bytesio_aspectTime: ✅ 3.782ms (SLO: <5.000ms 📉 -24.4%) vs baseline: +0.4% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +4.8% ✅ bytesio_noaspectTime: ✅ 314.274µs (SLO: <420.000µs 📉 -25.2%) vs baseline: -1.1% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ capitalize_aspectTime: ✅ 88.873µs (SLO: <300.000µs 📉 -70.4%) vs baseline: -0.7% Memory: ✅ 43.018MB (SLO: <46.000MB -6.5%) vs baseline: +5.1% ✅ capitalize_noaspectTime: ✅ 253.713µs (SLO: <300.000µs 📉 -15.4%) vs baseline: +0.8% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +4.8% ✅ casefold_aspectTime: ✅ 93.271µs (SLO: <500.000µs 📉 -81.3%) vs baseline: +5.1% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.6% ✅ casefold_noaspectTime: ✅ 306.845µs (SLO: <500.000µs 📉 -38.6%) vs baseline: +0.6% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ decode_aspectTime: ✅ 86.893µs (SLO: <100.000µs 📉 -13.1%) vs baseline: ~same Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ decode_noaspectTime: ✅ 152.279µs (SLO: <210.000µs 📉 -27.5%) vs baseline: -0.1% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.6% ✅ encode_aspectTime: ✅ 84.031µs (SLO: <200.000µs 📉 -58.0%) vs baseline: -0.3% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ encode_noaspectTime: ✅ 138.772µs (SLO: <200.000µs 📉 -30.6%) vs baseline: -1.3% ✅ format_aspectTime: ✅ 14.717ms (SLO: <19.200ms 📉 -23.4%) vs baseline: +0.7% Memory: ✅ 43.037MB (SLO: <46.000MB -6.4%) vs baseline: +4.9% ✅ format_map_aspectTime: ✅ 16.460ms (SLO: <21.500ms 📉 -23.4%) vs baseline: -0.3% Memory: ✅ 43.018MB (SLO: <46.000MB -6.5%) vs baseline: +4.8% ✅ format_map_noaspectTime: ✅ 372.053µs (SLO: <500.000µs 📉 -25.6%) vs baseline: +1.0% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ format_noaspectTime: ✅ 301.658µs (SLO: <500.000µs 📉 -39.7%) vs baseline: -0.3% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ index_aspectTime: ✅ 120.482µs (SLO: <300.000µs 📉 -59.8%) vs baseline: -0.6% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ index_noaspectTime: ✅ 40.360µs (SLO: <300.000µs 📉 -86.5%) vs baseline: ~same Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ join_aspectTime: ✅ 209.702µs (SLO: <300.000µs 📉 -30.1%) vs baseline: ~same Memory: ✅ 43.136MB (SLO: <46.000MB -6.2%) vs baseline: +5.4% ✅ join_noaspectTime: ✅ 144.409µs (SLO: <300.000µs 📉 -51.9%) vs baseline: +1.2% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ ljust_aspectTime: ✅ 499.746µs (SLO: <700.000µs 📉 -28.6%) vs baseline: ~same Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ ljust_noaspectTime: ✅ 261.026µs (SLO: <300.000µs 📉 -13.0%) vs baseline: ~same Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ lower_aspectTime: ✅ 294.926µs (SLO: <500.000µs 📉 -41.0%) vs baseline: -0.2% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ lower_noaspectTime: ✅ 233.521µs (SLO: <300.000µs 📉 -22.2%) vs baseline: -1.4% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ lstrip_aspectTime: ✅ 0.274ms (SLO: <3.000ms 📉 -90.9%) vs baseline: +1.1% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ lstrip_noaspectTime: ✅ 0.212ms (SLO: <3.000ms 📉 -92.9%) vs baseline: 📈 +20.0% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% ✅ modulo_aspectTime: ✅ 14.496ms (SLO: <18.750ms 📉 -22.7%) vs baseline: +1.6% Memory: ✅ 43.057MB (SLO: <46.000MB -6.4%) vs baseline: +4.7% ✅ modulo_aspect_for_bytearray_bytearrayTime: ✅ 14.785ms (SLO: <19.350ms 📉 -23.6%) vs baseline: +0.2% Memory: ✅ 43.057MB (SLO: <46.000MB -6.4%) vs baseline: +4.6% ✅ modulo_aspect_for_bytesTime: ✅ 14.334ms (SLO: <18.900ms 📉 -24.2%) vs baseline: -0.9% Memory: ✅ 43.096MB (SLO: <46.000MB -6.3%) vs baseline: +5.0% ✅ modulo_aspect_for_bytes_bytearrayTime: ✅ 14.793ms (SLO: <19.150ms 📉 -22.8%) vs baseline: +0.6% Memory: ✅ 43.018MB (SLO: <46.000MB -6.5%) vs baseline: +4.9% ✅ modulo_noaspectTime: ✅ 0.360ms (SLO: <3.000ms 📉 -88.0%) vs baseline: ~same Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ replace_aspectTime: ✅ 18.412ms (SLO: <24.000ms 📉 -23.3%) vs baseline: +0.2% Memory: ✅ 43.096MB (SLO: <46.000MB -6.3%) vs baseline: +4.5% ✅ replace_noaspectTime: ✅ 280.610µs (SLO: <300.000µs -6.5%) vs baseline: +0.1% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ repr_aspectTime: ✅ 314.702µs (SLO: <420.000µs 📉 -25.1%) vs baseline: +0.2% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.7% ✅ repr_noaspectTime: ✅ 46.439µs (SLO: <90.000µs 📉 -48.4%) vs baseline: -0.4% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ rstrip_aspectTime: ✅ 384.119µs (SLO: <500.000µs 📉 -23.2%) vs baseline: +0.4% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ rstrip_noaspectTime: ✅ 183.830µs (SLO: <300.000µs 📉 -38.7%) vs baseline: -0.5% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ slice_aspectTime: ✅ 184.773µs (SLO: <300.000µs 📉 -38.4%) vs baseline: +0.6% Memory: ✅ 42.998MB (SLO: <46.000MB -6.5%) vs baseline: +4.9% ✅ slice_noaspectTime: ✅ 53.813µs (SLO: <90.000µs 📉 -40.2%) vs baseline: ~same Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ stringio_aspectTime: ✅ 3.824ms (SLO: <5.000ms 📉 -23.5%) vs baseline: +0.3% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.7% ✅ stringio_noaspectTime: ✅ 381.756µs (SLO: <500.000µs 📉 -23.6%) vs baseline: 📈 +10.3% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ strip_aspectTime: ✅ 269.712µs (SLO: <350.000µs 📉 -22.9%) vs baseline: +0.7% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ strip_noaspectTime: ✅ 175.761µs (SLO: <240.000µs 📉 -26.8%) vs baseline: -0.3% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ swapcase_aspectTime: ✅ 330.482µs (SLO: <500.000µs 📉 -33.9%) vs baseline: ~same Memory: ✅ 42.998MB (SLO: <46.000MB -6.5%) vs baseline: +5.0% ✅ swapcase_noaspectTime: ✅ 268.646µs (SLO: <400.000µs 📉 -32.8%) vs baseline: -1.3% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.6% ✅ title_aspectTime: ✅ 320.328µs (SLO: <500.000µs 📉 -35.9%) vs baseline: ~same Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ title_noaspectTime: ✅ 258.227µs (SLO: <400.000µs 📉 -35.4%) vs baseline: -0.4% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ translate_aspectTime: ✅ 491.068µs (SLO: <700.000µs 📉 -29.8%) vs baseline: +0.6% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.7% ✅ translate_noaspectTime: ✅ 422.803µs (SLO: <500.000µs 📉 -15.4%) vs baseline: -1.6% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ upper_aspectTime: ✅ 296.191µs (SLO: <500.000µs 📉 -40.8%) vs baseline: +0.2% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ upper_noaspectTime: ✅ 233.634µs (SLO: <400.000µs 📉 -41.6%) vs baseline: -0.6% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +5.0% 📈 iastaspectsospath - 24/24✅ ospathbasename_aspectTime: ✅ 508.481µs (SLO: <700.000µs 📉 -27.4%) vs baseline: 📈 +20.1% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ ospathbasename_noaspectTime: ✅ 430.777µs (SLO: <700.000µs 📉 -38.5%) vs baseline: -0.5% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ ospathjoin_aspectTime: ✅ 627.479µs (SLO: <700.000µs 📉 -10.4%) vs baseline: +0.4% Memory: ✅ 42.920MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% ✅ ospathjoin_noaspectTime: ✅ 633.986µs (SLO: <700.000µs -9.4%) vs baseline: -0.2% Memory: ✅ 42.939MB (SLO: <46.000MB -6.7%) vs baseline: +4.9% ✅ ospathnormcase_aspectTime: ✅ 348.478µs (SLO: <700.000µs 📉 -50.2%) vs baseline: -0.8% Memory: ✅ 42.861MB (SLO: <46.000MB -6.8%) vs baseline: +4.5% ✅ ospathnormcase_noaspectTime: ✅ 356.894µs (SLO: <700.000µs 📉 -49.0%) vs baseline: -1.6% Memory: ✅ 42.880MB (SLO: <46.000MB -6.8%) vs baseline: +4.7% ✅ ospathsplit_aspectTime: ✅ 487.851µs (SLO: <700.000µs 📉 -30.3%) vs baseline: ~same Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ ospathsplit_noaspectTime: ✅ 500.809µs (SLO: <700.000µs 📉 -28.5%) vs baseline: -0.5% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.6% ✅ ospathsplitdrive_aspectTime: ✅ 374.616µs (SLO: <700.000µs 📉 -46.5%) vs baseline: +0.5% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ ospathsplitdrive_noaspectTime: ✅ 73.125µs (SLO: <700.000µs 📉 -89.6%) vs baseline: +0.6% Memory: ✅ 42.959MB (SLO: <46.000MB -6.6%) vs baseline: +5.0% ✅ ospathsplitext_aspectTime: ✅ 457.540µs (SLO: <700.000µs 📉 -34.6%) vs baseline: -0.3% Memory: ✅ 42.979MB (SLO: <46.000MB -6.6%) vs baseline: +4.9% ✅ ospathsplitext_noaspectTime: ✅ 463.594µs (SLO: <700.000µs 📉 -33.8%) vs baseline: -0.4% Memory: ✅ 42.900MB (SLO: <46.000MB -6.7%) vs baseline: +4.8% 🟡 Near SLO Breach (1 suite)🟡 tracer - 6/6✅ largeTime: ✅ 31.456ms (SLO: <32.950ms -4.5%) vs baseline: ~same Memory: ✅ 36.687MB (SLO: <39.250MB -6.5%) vs baseline: +4.9% ✅ mediumTime: ✅ 3.106ms (SLO: <3.200ms -2.9%) vs baseline: +0.7% Memory: ✅ 35.606MB (SLO: <38.750MB -8.1%) vs baseline: +5.3% ✅ smallTime: ✅ 363.661µs (SLO: <370.000µs 🟡 -1.7%) vs baseline: +3.9% Memory: ✅ 35.507MB (SLO: <38.750MB -8.4%) vs baseline: +4.7%
|
Summary
_devserver.py) that exposes registeredExperimentobjects via/listand/evalendpoints, enabling remote/UI-driven experiment execution with NDJSON streaming supportExperimentwithremote_config(typed config fields for UI rendering),progress_callback/on_starthooks, and span event capture for real-time progress reportingLLMObs.devserver()class method andLLMObs.async_experiment()/LLMObs.experiment()accept a newremote_configparameter_examples/devserver_example.py) demonstrating the full flowTest plan
DD_API_KEY=... DD_APP_KEY=... DD_SITE=... python -m ddtrace.llmobs._examples.devserver_example/listreturns experiment metadata and config schema/evalwithstream: falsereturns JSON results/evalwithstream: truereturns NDJSON progress eventsconfig_overrideandevaluatorsfiltering work correctly🤖 Generated with Claude Code