feat: add DeepEval RAG metrics to benchmark pipeline (#361)#362
Open
kiyotis wants to merge 38 commits into
Open
feat: add DeepEval RAG metrics to benchmark pipeline (#361)#362kiyotis wants to merge 38 commits into
kiyotis wants to merge 38 commits into
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove post-hoc modification of baseline-current results - Add incremental validation: 1-run (T7) → 3-run (T8) → full 30-run (T9) - Add HOW-TO-RUN.md update task (T10) - Rename diff check to T11 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…d add asyncio.run - build_deepeval_test_case now falls back to workflow_details.step3.selected_sections when diagnostics.search_sections is absent (run_qa output format) - _run_deepeval_metric uses asyncio.run() instead of new_event_loop() - run_qa.py: add --with-deepeval flag, pass with_deepeval to evaluate_scenario - test: add workflow_details fallback tests and precedence test Note: evaluation.json still shows null scores in run_qa context — root cause of asyncio interaction under claude subprocess call pending investigation. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…Eval aiobotocore (used by AmazonBedrockModel async calls) requires AWS_CA_BUNDLE for SSL certificate verification. Without it, corp proxy cert chains cause SSLCertVerificationError, silently returning None for all DeepEval scores. Horizontal check: only compute_deepeval_metrics creates AmazonBedrockModel; no other call site is affected. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
28/30 scenarios evaluated with --with-deepeval. accuracy vs answer_correctness: 96.4% agreement (27/28). hallucination vs faithfulness: 88.5% agreement (23/26). 3 hallucination/faithfulness mismatches explained by different reference sets (specific sections vs. retrieval_context). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add --with-deepeval flag to step 1 and step 2 commands, add deepeval install prerequisite, and update evaluation.json description to include DeepEval metrics. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Issue #361の正しい方針は「置き換え」。T12でLLMジャッジ削除を実装する。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
設計書・手順書・コード・テスト全ての影響箇所を調査済み。 ベストプラクティスに基づきLLMジャッジを削除しDeepEvalに一本化する。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…acement LLMジャッジ削除後は旧ベースライン(accuracy/hallucination)が無効になるため QA全件3 runでDeepEvalベースラインを取り直す。 キーワード検索はLLMジャッジ未使用のため取り直し不要。 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ics duplication
evaluation.jsonをシンプルに: scores={score+reason}, metrics/diagnostics削除。
report.pyはmetrics.jsonから読み取るよう変更。
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
3 runs × 30 scenarios each. All scenarios passed all 3 DeepEval metrics. | run | answer_correctness | answer_relevancy | faithfulness | |-----|-------------------|-----------------|--------------| | run-1 | 0.96 | 0.97 | 0.97 | | run-2 | 0.99 | 0.96 | 0.97 | | run-3 | 0.97 | 0.96 | 0.98 | Threshold pass rate (≥0.5): 30/30 across all runs and metrics. Replaces the old accuracy/hallucination baseline (baseline-current/). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ndard answer_correctness: 0.5 → 0.99 (missing facts cause wrong implementations) faithfulness: 0.5 → 0.99 (hallucinations cause wrong implementations) answer_relevancy: 0.5 → 0.95 (minor verbosity tolerated, major deviation is not) Update HOW-TO-RUN.md and benchmark-design.md to reflect new thresholds and rationale. Fix incorrect --run-dir × 3 command in step 4a. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…hmark Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Agent step-transition narration (e.g. "Step 4完了。read_sections=[...]") was being included in answer.md because parse_qa_response extracted all text before ### Workflow Details. The fix introduces a ### Answer marker in e2e-prompt.md Step 8 instruction. parse_qa_response now extracts only the text between ### Answer and ### Workflow Details. Legacy responses without ### Answer fall back to the previous behavior (full text before ### Workflow Details). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…in progress Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
run-1: 29/30 (qa-11a timeout) run-2: 26/30 (review-07, qa-02, qa-06 timeout; oos-qa-01 Workflow Details missing) run-3: in progress (26/30 done, interrupted at session end) Error scenarios will be retried at next session start. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #361
Approach
Integrated DeepEval's standard RAG metrics (Answer Correctness, Answer Relevancy, Faithfulness) into the existing benchmark pipeline via Amazon Bedrock. The approach adds DeepEval as an optional layer on top of the existing custom LLM judges rather than replacing them, enabling side-by-side comparison and external calibration.
Key design decisions:
--with-deepeval): DeepEval metrics are opt-in to avoid mandatory Bedrock calls on every benchmark runAmazonBedrockModel(Claude Sonnet 4.5) consistent with the rest of the pipeline; avoids separate API key managementAWS_CA_BUNDLEset fromSSL_CERT_FILEat compute time to handle aiobotocore's SSL cert chain requirement under corporate proxybuild_deepeval_test_casefalls back toworkflow_details.step3.selected_sectionswhendiagnostics.search_sectionsis absent (run_qa output format)Validation on 28/30 existing QA scenarios confirmed:
Tasks
See tasks.md.
Expert Review
Expert review not conducted for this PR.
Success Criteria Check
evaluate.py:compute_deepeval_metrics();report.py: DeepEval columns added;--with-deepevalflag in run_benchmark.sh and run_qa.py.work/00361/deepeval-validation.md: 28/30 scenarios evaluated; agreement rates and mismatch analysis documentedreport.pyaddsanswer_correctness,answer_relevancy,faithfulnesscolumns to evaluation.json reportdocs/benchmark-design.mddocs/benchmark-design.md: DeepEval metrics section added with rationale and thresholds (answer_correctness ≥ 0.7, faithfulness ≥ 0.7)tools/benchmark/tests/test_evaluate.pyandtest_report.pypass; 52-file diff check confirmed no unintended changes🤖 Generated with Claude Code