feat: early exit on objective achieved (Python)#268
Open
Aryansharma28 wants to merge 38 commits intofeat/red-team-refusal-detectfrom
Open
feat: early exit on objective achieved (Python)#268Aryansharma28 wants to merge 38 commits intofeat/red-team-refusal-detectfrom
Aryansharma28 wants to merge 38 commits intofeat/red-team-refusal-detectfrom
Conversation
feat: add extensible metadata support to ScenarioConfig (#228) Allow users to attach custom key-value metadata (prompt IDs, environments, versions) to scenario runs via a new `metadata` field on ScenarioConfig. User metadata is spread into the SCENARIO_RUN_STARTED event, with built-in `name`/`description` fields always taking precedence. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…cript + Python) (#237) * feat: lazy observability initialization with configurable span filtering Stop auto-initializing OpenTelemetry on module import. Users running scenarios inside production server processes (e.g. Next.js + Inngest cron jobs) were getting hundreds of empty traces from auto-instrumented HTTP requests and middleware. - Remove eager `import "./tracing"` from index.ts - Add lazy init on first run() call via ensureTracingInitialized() - Add setupScenarioTracing() for explicit control over init timing - Add `observability` key to scenario.config.js (pass-through to langwatch SDK SetupObservabilityOptions) - Add scenarioOnly filter preset and withCustomScopes() helper - Detect pre-existing OTel providers and attach processors without re-initializing (supports @vercel/otel, Datadog, etc.) - Add JudgeSpanCollector.clearSpansForThread() for memory cleanup in long-lived processes - Bump langwatch dependency from 0.9.0 to 0.16.1 - Add cycle protection in belongsToThread recursive span traversal - Add documentation page with two use cases: scenario-only spans and scenario + custom-tagged spans * feat: add self-contained examples for custom observability config Three runnable scripts that verify the feature works without needing API keys: - test-no-auto-init.ts: importing scenario does NOT auto-init OTel - test-scenario-only.ts: scenario spans created under correct scope - test-custom-scopes.ts: custom DB spans captured alongside scenario Run with: cd javascript/examples/custom-observability && pnpm test:all * feat: add config-file-based example for observability Adds test-config-file.ts that verifies the scenario.config.mjs path: run() lazily loads the config and initializes tracing without needing an explicit setupScenarioTracing() call. Includes a with-config-file/ subfolder containing a real scenario.config.mjs with observability options. * fix: config file example now tests scenarioOnly filter end-to-end The scenario.config.mjs uses LangWatchTraceExporter with scenarioOnly filter + an InMemorySpanExporter for verification. The test creates noise spans and confirms only @langwatch/scenario spans would be sent to LangWatch. * feat: add Python custom observability + dual-language docs - Python tracing now lazily initializes (no OTel side-effect on import) - Add setup_scenario_tracing() for explicit control over OTel config - Add scenario_only and with_custom_scopes() filter presets - Add FilteringSpanExporter for span-level filtering - Add clear_spans_for_thread() + cycle protection to JudgeSpanCollector - Update docs with Python/TypeScript dual-language code snippets - Add Python examples: no-auto-init, scenario-only, custom-scopes, conftest 124 Python tests + 137 TypeScript tests passing. * refactor: use scenario.configure(observability=...) as primary Python API Move observability config into scenario.configure() to match the existing Python configuration pattern, mirroring how TypeScript uses defineConfig({ observability }). setup_scenario_tracing() remains as an advanced explicit API. - Add observability field to ScenarioConfig - run() reads config.observability and passes to ensure_tracing_initialized() - Update all examples and docs to use scenario.configure() * fix: resolve CI type errors from OTel version mismatch - TypeScript: use `any` at OTel version boundaries in setup.ts to avoid conflicts between @opentelemetry/sdk-trace-base v1.x and v2.x resolved by different packages in the pnpm tree - Python: fix pyright reportOptionalMemberAccess in judge_span_collector by using walrus operator for get_span_context() null check * fix: guard example scripts with __name__ == __main__ Prevents pytest from executing module-level asyncio.run() and exit() calls when collecting example files during CI test discovery.
* fix: correct OpenTelemetry span parenting for scenario turns Three issues caused intermittent span parenting failures: 1. reset() called newTurn() which created a "ghost" Scenario Turn span that was immediately orphaned when currentTurn was reset to 0. Fixed by initializing turn state directly in reset() and calling newTurn() from execute() to create the Turn 1 span at the right time. 2. callAgent() set up the parent context via trace.setSpan() but did not wrap the async callback with context.with(), causing spans created by libraries like the Vercel AI SDK to lose their parent. Fixed by wrapping the withActiveSpan call in context.with(agentContext, ...). 3. The last turn's Scenario Turn span was never ended because execute() didn't call end() in its finally block. Fixed by ending the span in the finally block. * chore: upgrade langwatch SDK to 0.16.1 Resolves @opentelemetry/sdk-trace-base version conflict (v1.30.1 vs v2.2.0) that caused DTS build failures. Also adds @opentelemetry/sdk-node peer dependency required by the new SDK version. * fix: use batch span processor and flush on shutdown With SimpleSpanProcessor (the default), each span.end() fires an HTTP request immediately. If the process exits before all requests complete, spans are silently dropped. This explains the intermittent span loss that was worse on slower network connections. Switch to BatchSpanProcessor which buffers spans and exports them in bulk, and call observabilityHandle.shutdown() after scenario execution to ensure all pending spans are flushed before the process exits. * fix: remove observabilityHandle.shutdown() from run() finally block The shutdown() call was causing test failures in two ways: 1. When no API key is set, the OTel exporter throws Unauthorized during flush, and this error propagates out of the finally block, replacing the actual scenario result 2. The observability handle is a module-level singleton — shutting it down after one run() call kills tracing for subsequent runs The batch processor in setup.ts handles flushing automatically on process exit, so explicit shutdown per-run is unnecessary. * fix: use newTurn() in execute() for initial span creation Instead of inline startSpan() in execute(), use newTurn() which is the canonical way to create turn spans. Reset currentTurn to 0 afterward (matching the original reset() pattern). This ensures the span is created through the same code path that all subsequent turns use. * fix: update parentSpanContext to parentSpanId for OTel SDK v2 The @opentelemetry/sdk-trace-base v2 replaced ReadableSpan.parentSpanContext (SpanContext object) with ReadableSpan.parentSpanId (string). Update all usages and test mocks accordingly. * fix: handle both OTel SDK v1 and v2 parent span APIs The LangWatch SDK's internal span implementation still uses the v1 parentSpanContext (SpanContext object) property, while OTel SDK v2 uses parentSpanId (string). Add a getParentSpanId() helper that checks parentSpanId first and falls back to parentSpanContext.spanId. This fixes the judge span collector not finding child spans when walking the parent chain, which caused span-based evaluation to fail. * test: skip flaky realtime API test The voice-to-voice realtime test is consistently flaky due to OpenAI Realtime API instability (504s, judge failures). It also fails on main. Skip it to unblock CI until it can be stabilized.
* feat: progressive trace discovery for large OTEL traces When OpenTelemetry traces exceed ~8096 estimated tokens, the judge switches from inline full-content rendering to a structure-only view with expand_trace and grep_trace tools for on-demand exploration. This prevents context window blowups for customers with massive RAG pipelines while still giving the judge access to all trace details. - Add token estimation utility (estimateTokens, DEFAULT_TOKEN_THRESHOLD) - Add structure-only rendering mode to JudgeSpanDigestFormatter - Add expandTrace/grepTrace standalone functions with token budgets - Make judge agentic via AI SDK stopWhen/stepCountIs for large traces - Extract shared span utilities to span-utils.ts (DRY refactor) - Export all utilities for custom judge reuse * fix: use power-of-two thresholds (8192/4096) instead of 8096/4000 * fix: use UTF-8 byte length for token estimation Multi-byte characters (emojis, CJK) consume more tokens than ASCII. Using TextEncoder byte length instead of string length gives a better approximation. * feat: show LLM token usage in structure-only trace digest When gen_ai.usage.input_tokens and/or gen_ai.usage.output_tokens attributes are present on a span, the structure-only view now shows the total token count alongside duration, e.g.: chat claude-opus-4-6 (6.0s, 21693 tokens) * refactor: extract digest building and LLM invocation from call() Address review feedback: - Resolve config defaults (tokenThreshold, maxDiscoverySteps) in constructor instead of deep in call() - Extract buildTraceDigest() for full vs structure-only decision - Extract invokeLLMWithDiscovery() to clarify multi-step flow * fix: add extra turn to realtime test to reduce flakiness The conversation wasn't getting enough turns for the agent to explain what Scenario is, causing the criterion to fail.
…ts (both languages) (#242) * feat: progressive trace discovery for Python SDK Port the TypeScript progressive trace discovery feature to Python. When rendered traces exceed ~8192 estimated tokens, the judge switches to structure-only rendering with expand_trace/grep_trace tools for on-demand exploration via a multi-step litellm loop. New modules: - estimate_tokens.py: UTF-8 byte-based token estimation - trace_tools.py: standalone expand_trace/grep_trace functions - span_utils.py: shared span processing utilities Modified: - judge_span_digest_formatter.py: format_structure_only() with token usage - judge_agent.py: progressive discovery loop, token_threshold/max_discovery_steps config Tests: 173 passing (39 new) * fix: resolve pyright type errors in span_utils and tests - Use str() before int() for AttributeValue token counts (could be Sequence[float] per OTel types) - Add isinstance(result, ScenarioResult) narrowing before accessing .success/.reasoning/.passed_criteria/.failed_criteria - Assert reasoning is not None before using `in` operator * refactor: use span IDs instead of sequential indices for trace tools LLMs are terrible at counting and positional reasoning. Replace sequential indices ([1], [2], [3]) with truncated 8-char span IDs ([a0b1c2d3]) so the judge can reference spans directly by ID. - expand_trace now takes span_id/span_ids instead of index/range - Prefix matching: LLM can use truncated IDs from the skeleton - Replace Unicode escape sequences with actual characters (└──, ├──, │) - Clean up Python _judge/__init__.py to only export public API Applies to both TypeScript and Python implementations. * refactor: simplify expand_trace API to single span_ids array param Replace dual span_id/span_ids parameters with a single span_ids array in both TypeScript and Python. This reduces token waste in the tool schema and simplifies the API surface. * docs: add "How Judging Works" page and trace access docs Split the conceptual "How Judging Works" content from custom-judge.mdx into its own advanced page. Covers the judging loop, trace rendering modes (full inline vs structure-only), progressive discovery tools (expand_trace/grep_trace), and configuration options. Add "Accessing Traces in Custom Judges" section to custom-judge.mdx showing how to use JudgeSpanCollector and trace tools in custom judges. Includes working example tests in both Python and TypeScript. * fix: reorder imports in custom-judge-with-traces example for ESLint Move @ai-sdk/openai import before @langwatch/scenario to satisfy ESLint import ordering rules. * fix: reduce flakiness in boat trip travel planning example test - Script the accommodation request explicitly instead of relying on the user simulator to generate it, removing non-determinism - Remove the conditional criterion ("ask which city if they don't provide it") that was fragile because the agent reasonably provides options for both cities from context - Simplify the script flow from 3 unscripted user turns to 1 * fix: reduce flakiness in hungry user example test - Change description from "could eat a cow" to "wants a big, filling meal" to prevent the user simulator from steering toward meat - Script the first user message explicitly so the conversation starts unambiguously about wanting a filling dinner - Use a scripted flow (2 agent turns) instead of open-ended max_turns to keep the conversation focused * fix: relax flaky criterion in lovable clone test Change "agent extended the landing page with a new section" to "agent made multiple changes or iterations on the landing page" to reduce flakiness when the user simulator doesn't specifically ask for a new section extension.
Most Python example tests make real LLM API calls but were missing @pytest.mark.flaky(reruns=2), causing CI failures on transient errors like rate limits. Previously only 5 of ~25 LLM-calling tests had this marker. This adds it to the remaining 20 tests for consistency.
Fix grammar issues in the vibe eval loop documentation Corrected minor grammatical errors for clarity.
* feat: add low-risk PR self-approval workflows Add three GitHub Actions workflows and a policy document to enable self-approval of low-risk PRs (e.g. test config, docs, formatting): - approval-or-hotfix.yml: enforces that PRs need either 1 approval or a "hotfix"/"low-risk-change" label - low-risk-evaluation.yml: AI-powered evaluation of PR diffs against the low-risk policy, auto-labels qualifying PRs - low-risk-label-reset.yml: removes the label when new commits are pushed, requiring re-evaluation - docs/LOW_RISK_PULL_REQUESTS.md: documents the policy criteria Ported from langwatch/langwatch with minor adjustments for this repo. * fix: retrigger approval check on push/rebase Add opened, reopened, and synchronize to the pull_request trigger types so the check re-runs after rebases and new commits instead of going stale and blocking merge. * fix: auto-run low-risk evaluation on every PR, use dedicated API key - Trigger evaluation automatically on PR open/reopen/synchronize instead of requiring manual workflow_dispatch - Fold label reset into the evaluation workflow (remove stale label first, then re-evaluate fresh) — deletes separate label-reset workflow - Use LOW_RISK_OPENAI_API_KEY secret for cost tracking - Keep workflow_dispatch as fallback for manual runs - Use gpt-4.1-mini instead of gpt-5-mini
* fix: skip low-risk evaluation for fork PRs Fork PRs cannot access repo secrets (LOW_RISK_OPENAI_API_KEY) and have a read-only GITHUB_TOKEN, which would cause label/comment operations to fail. Skip the entire evaluation job when the PR originates from a fork. * fix: re-run approval check after low-risk label is applied Labels added by GITHUB_TOKEN don't fire the "labeled" event for other workflows (GitHub anti-recursion). This caused check-approval-or-label to stay failed even after low-risk-change label was applied. Two fixes: 1. approval-or-hotfix.yml: fetch current labels from the API instead of the stale event payload, so re-runs see the latest state. 2. low-risk-evaluation.yml: after applying the label, find and re-run the failed approval workflow so the check updates without manual intervention.
Both JavaScript and Python packages were configured with `prerelease: true`, causing GitHub releases to be marked as pre-releases instead of latest.
…refighting (#248) * fix: only accept low-risk-change from automation, rename hotfix to firefighting - low-risk-change label is now only honored when added by github-actions[bot], preventing humans from bypassing review by manually adding the label - Renamed hotfix label to firefighting for clarity - Updated policy doc to reflect both changes * fix: check most recent label event, not any historical one Use github.paginate to fetch all events and check the *last* labeled/unlabeled event for low-risk-change. Prevents the scenario where a bot adds the label, a human removes it, then re-adds it manually — the stale bot event would no longer pass the check.
Add concurrency group to approval check workflow so when multiple events fire for the same PR (e.g. pull_request + pull_request_review), the older run is cancelled instead of showing duplicate failing checks.
#251) * fix: allow discovery tools before forced verdict on large traces When the trace exceeds the token threshold and judgment is enforced (last turn or explicit judgment_request), tool_choice was forced to finish_test on every loop iteration, preventing the judge from ever calling expand_trace/grep_trace. Fix: in the discovery loop, use "required" tool_choice for intermediate steps so the judge can freely pick discovery tools. Only force finish_test on the final step (Python) or let stopWhen handle termination (TypeScript). Add regression tests proving the bug and verifying the fix. Add integration test for Sonnet 4 with a realistic analytics trace. * test: mark Sonnet 4 integration test as skip for CI * fix: place discovery tools before terminal tools in tool list
…ate (#253) * fix: deduplicate approval checks with concurrency group Add concurrency group to approval check workflow so when multiple events fire for the same PR (e.g. pull_request + pull_request_review), the older run is cancelled instead of showing duplicate failing checks. * fix: remove reRunWorkflow that creates zombie runs stuck in queued state
…pproval (#254) * fix: delete previous assessment comments on re-evaluation * fix: write check run on pass to override stale failed checks from other event types
Replace the MIT license with GNU Affero General Public License v3 across the LICENSE file, JavaScript package.json, and Python pyproject.toml.
* fix: add timeout to EventBus.drain() to prevent test hangs When the LangWatch API is slow or returns errors, drain() blocks indefinitely because concatMap processes events sequentially. With many message snapshot events (e.g. maxTurns: 20 = 40+ events), the total drain time can exceed test timeouts. Add a 30-second default timeout via Promise.race so event reporting (which is best-effort) doesn't block scenario execution. * fix: increase drain timeout to 5min and bump max-turns test timeout Real LLM-backed scenarios can have legitimately slow event posting, so 30s was too aggressive. Bump drain timeout to 5 minutes as a safety net, and increase the max-turns test timeout to 6 minutes to account for 40+ sequential event POSTs during drain.
#262) * fix: remove opened/synchronize triggers to prevent early failure before evaluation * fix: handle approval dismissal and override stale passing checks - Add `dismissed` to pull_request_review triggers so the workflow re-evaluates when an approval is removed - Use checks.create with conclusion=failure on the failure path to override any previous passing check run (e.g. from a now-dismissed approval)
…#263) The low-risk evaluation workflow triggers on synchronize (new pushes). Add a final step that checks if the PR has a firefighting label or existing approvals and creates a passing check-approval-or-label for the new SHA. Without this, firefighting PRs and approved PRs get stuck as "Expected — Waiting" after follow-up commits.
* feat: add langwatch.scope="simulation" span attribute Sets `langwatch.scope = "simulation"` on root spans for both Python and JavaScript scenario test runs. This enables LangWatch to classify scenario traces distinctly from application/production traces. - Python: set_attributes on root span in ScenarioExecutor._new_turn() - JavaScript: added to startSpan attributes in ScenarioExecution.newTurn() - Tests: 2 Python + 2 JavaScript tests for single and multi-turn scenarios * fix: resolve CI failures in scope attribute tests - JS: add missing `criteria` property to mock JudgeAgentAdapter classes - Python: import `Once` from `opentelemetry.util._once` instead of `trace` * refactor: rename langwatch.scope to langwatch.origin Renames the trace classification attribute to match the agreed convention: langwatch.origin avoids confusion with OTEL scope and reads clearly in table columns.
When the scorer confirms the attack objective is met (score >= threshold for N consecutive turns), the marathon script exits early — saving API tokens and preserving a clean signal. Adds success_score (default 9) and success_confirm_turns (default 2) config, check_early_exit() method, and instance marathon_script() that auto-inserts exit checks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When the target hard-refuses an attack, the refused exchange is removed from conversation history so the target "forgets" it refused. The attacker retries with a different technique from a clean slate. This follows the consensus approach from PyRIT, DeepTeam, and Promptfoo Hydra. - Add _MAX_BACKTRACKS=10, _backtracks_remaining, _backtrack_history state - Detect hard refusals in call() and remove messages from last user onwards - Feed backtrack history to strategy as FAILED APPROACHES block - Skip scoring on backtracked turns (cache score=0) - Pad marathon_script iterations by _MAX_BACKTRACKS for effective turn budget - Add OTel span attributes for backtrack observability - Fix pre-existing broken unbound marathon_script() calls in tests Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* docs: add Red Teaming documentation Add comprehensive docs for RedTeamAgent covering Python and TypeScript APIs, Crescendo strategy phases, per-turn response scoring, refusal detection, marathon_script helper, and examples for common use cases. Closes langwatch/langwatch#2068 Part of langwatch/langwatch#1713 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: fix red teaming API nomenclature to match implementation - Python checks: use state.messages iteration (no last_agent_message_str()) - TypeScript checks: use ScenarioExecutionStateLike + lastAgentMessage() - Remove max_turns references (marathon_script handles turn count) - Fix all code examples to match actual Python/TypeScript APIs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: align red teaming docs with PR #266 nomenclature - Rename attacker_model → model in all Python examples - Fix import path: from scenario import RedTeamStrategy - Update config reference table to match new parameter names Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add early exit, backtracking, roadmap, and fix marathon_script examples Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
- Fix return type annotations on mock agents/adapters (use str instead of dict) - Add `assert ... is not None` guards for `system_prompt` checks (str | None) - Add `type: ignore[assignment]` for mock `.call` overrides in tests - Fix custom strategy `build_system_prompt` signatures to match base class - Use `inspect.isawaitable()` for early-exit step invocations - Fix `_get_phase` access to use `CrescendoStrategy` type assertion - Fix `debug_red_team.py` marathon_script call to use instance method Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move inline `import scenario` to top-level imports in test_red_team_agent.py and debug_red_team.py, and remove unused imports (Union, ScenarioResult) from red_team_agent.py. Addresses all code quality bot comments on PR #270. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove `import scenario` and use only `from scenario import ...` to eliminate all mixed-import-style warnings from code quality bot. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove unused imports (user, agent, judge) from test file - Remove unused variable original_generate - Replace ellipsis with raise NotImplementedError in abstract methods - Add explicit return None to _early_exit_check - Assign awaited results to _ to silence no-effect warnings Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Critical fixes: - Backtracking now works on a copy of messages instead of mutating the executor's canonical list in place (prevents data corruption) - Fix markdown fence stripping using startswith/endswith instead of str.strip() which strips individual characters - Remove debug_red_team.py from the package (dev-only script) Moderate fixes: - Compute phase boundaries from _PHASES instead of hardcoding 0.20/0.45/0.75 in the metaprompt template (prevents silent desync) - Document refusal patterns as English-only heuristic with false-positive tolerance (LLM scorer is the authoritative fallback) - Document that padding may cause more iterations than requested turns - Rename detect_refusals → fast_refusal_detection for clarity (backtracking is independently controlled) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
|
Automated low-risk assessment This PR was evaluated against the repository's Low-Risk Pull Requests procedure and does not qualify as low risk.
This PR requires a manual review before merging. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
success_score(default 9) andsuccess_confirm_turns(default 2) config toRedTeamAgent.__init__andcrescendo()check_early_exit()method that reads_turn_scorescache and returnsTruewhen the last N scores are all >= thresholdmarathon_scriptfrom@staticmethodto instance method that auto-inserts early-exit check steps after eachagent()turnfinal_checksinline, then callsexecutor.succeed()with descriptive reasoningsuccess_score=Noneto disable early exitCloses langwatch/langwatch#2042
Test plan
check_early_exitreturns false with no scorescheck_early_exitreturns false with scores below thresholdcheck_early_exitreturns false with only 1 high score whenconfirm_turns=2check_early_exitreturns true with N consecutive high scoresmarathon_scriptgenerates early-exit check steps when enabledmarathon_scriptomits checks whensuccess_score=Nonesucceed()with correct reasoning, runsfinal_checksfirstmarathon_scriptstill works🤖 Generated with Claude Code