-
Notifications
You must be signed in to change notification settings - Fork 19
feat(ui): Add Auto Rubric feature for automatic grading criteria gene… #92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ration Add new Auto Rubric feature that automatically generates evaluation rubrics for LLM applications. Includes simple and iterative generation modes, data upload, history management, and export functionality.
Summary of ChangesHello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates a new "Auto Rubric" feature into the application's user interface, designed to streamline the creation of evaluation rubrics for Large Language Model (LLM) applications. It provides users with flexible options to generate rubrics either from a simple task description or through an iterative process leveraging labeled datasets. The feature also includes robust tools for managing the history of generated rubrics, exporting them in various formats, and testing their efficacy directly within the UI. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces the new "Auto Rubric" feature, a significant addition that allows for automatic generation of evaluation rubrics. The feature is well-structured, with clear separation between UI components, services, and the main feature logic. It includes modes for simple and iterative rubric generation, data upload capabilities, history management, and export functionality. The code demonstrates good practices, especially regarding security with consistent HTML escaping and safe display of LLM-generated content. My review includes a few suggestions to improve code consistency, maintainability, and adherence to Python best practices, primarily concerning code duplication and organization.
| if "rubric_test_result" not in st.session_state: | ||
| st.session_state["rubric_test_result"] = None | ||
|
|
||
| # Input fields | ||
| test_query = st.text_input( | ||
| t("rubric.test.query"), | ||
| placeholder=t("rubric.test.query_placeholder"), | ||
| key="rubric_test_query_compact", | ||
| ) | ||
|
|
||
| if grader_mode == "listwise": | ||
| # Listwise mode: show hint and two response inputs | ||
| st.caption(t("rubric.test.responses_hint")) | ||
| response_1 = st.text_area( | ||
| f"{t('rubric.test.response')} 1", | ||
| height=80, | ||
| key="rubric_test_response_compact_1", | ||
| ) | ||
| response_2 = st.text_area( | ||
| f"{t('rubric.test.response')} 2", | ||
| height=80, | ||
| key="rubric_test_response_compact_2", | ||
| ) | ||
| responses = [r for r in [response_1, response_2] if r.strip()] | ||
| can_run = bool(test_query.strip() and len(responses) >= 2) | ||
|
|
||
| if st.button( | ||
| f"▶️ {t('rubric.test.run')}", | ||
| disabled=not can_run, | ||
| key="rubric_test_run_compact", | ||
| ): | ||
| with st.spinner(t("rubric.test.running")): | ||
| try: | ||
| service = RubricGeneratorService() | ||
| result = run_async(service.test_grader_listwise(grader, test_query, responses)) | ||
| st.session_state["rubric_test_result"] = result | ||
| except Exception as e: | ||
| st.session_state["rubric_test_result"] = { | ||
| "success": False, | ||
| "error": str(e), | ||
| } | ||
| else: | ||
| # Pointwise mode | ||
| test_response = st.text_area( | ||
| t("rubric.test.response"), | ||
| placeholder=t("rubric.test.response_placeholder"), | ||
| height=100, | ||
| key="rubric_test_response_compact", | ||
| ) | ||
| can_run = bool(test_query.strip() and test_response.strip()) | ||
|
|
||
| if st.button( | ||
| f"▶️ {t('rubric.test.run')}", | ||
| disabled=not can_run, | ||
| key="rubric_test_run_compact", | ||
| ): | ||
| with st.spinner(t("rubric.test.running")): | ||
| try: | ||
| service = RubricGeneratorService() | ||
| result = run_async(service.test_grader(grader, test_query, test_response)) | ||
| st.session_state["rubric_test_result"] = result | ||
| except Exception as e: | ||
| st.session_state["rubric_test_result"] = { | ||
| "success": False, | ||
| "error": str(e), | ||
| } | ||
|
|
||
| # Display result | ||
| test_result = st.session_state.get("rubric_test_result") | ||
| if test_result: | ||
| if test_result.get("success"): | ||
| if grader_mode == "listwise": | ||
| rank = test_result.get("rank", []) | ||
| reason = test_result.get("reason", "") | ||
| st.success(f"{t('rubric.test.rank')}: {rank}") | ||
| if reason: | ||
| st.text_area( | ||
| t("rubric.test.reason"), | ||
| value=reason, | ||
| height=100, | ||
| disabled=True, | ||
| key="compact_listwise_reason_display", | ||
| ) | ||
| else: | ||
| score = test_result.get("score") | ||
| reason = test_result.get("reason", "") | ||
| col1, col2 = st.columns([1, 3]) | ||
| with col1: | ||
| st.metric(t("rubric.test.score"), score) | ||
| with col2: | ||
| if reason: | ||
| st.text_area( | ||
| t("rubric.test.reason"), | ||
| value=reason, | ||
| height=100, | ||
| disabled=True, | ||
| key="compact_reason_display", | ||
| ) | ||
| else: | ||
| st.error(test_result.get("error", "Unknown error")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function duplicates test execution logic from the _run_test_pointwise and _run_test_listwise helper functions. This duplicated logic also omits the rubric_test_running state management, which is used in the helper functions to disable the button during execution.
While the current implementation with a blocking run_async call prevents race conditions, this inconsistency makes the code harder to maintain and less robust against future changes (e.g., to a non-blocking execution model).
Refactoring this to reuse the existing helper functions (perhaps by making them non-private) or at least making the implementation consistent by adding state management would improve the code's quality.
| return 0 | ||
|
|
||
| # Count numbered items (1., 2., etc.) or "Rubric X:" patterns | ||
| import re |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The import re statement is inside the get_rubrics_count method. According to PEP 8 style guidelines, imports should be at the top of the file. This improves readability by making dependencies clear and avoids the minor performance overhead of re-importing if the function is called multiple times. Please move this import to the top of the file.
| @dataclass | ||
| class IterativeRubricConfig: | ||
| """Configuration for Iterative Rubric generation. | ||
|
|
||
| Attributes: | ||
| grader_name: Unique name for the generated grader. | ||
| dataset: List of labeled training data dictionaries. | ||
| grader_mode: POINTWISE or LISTWISE evaluation mode. | ||
| task_description: Optional task description for context. | ||
| language: Language for prompts (EN or ZH). | ||
| min_score: Minimum score for pointwise mode. | ||
| max_score: Maximum score for pointwise mode. | ||
| enable_categorization: Whether to group similar rubrics. | ||
| categories_number: Target number of categories. | ||
| query_specific_generate_number: Rubrics per training sample. | ||
| max_retries: Maximum retry attempts for LLM calls. | ||
| api_endpoint: API endpoint URL. | ||
| api_key: API key for authentication. | ||
| model_name: Model name to use. | ||
| """ | ||
|
|
||
| grader_name: str | ||
| dataset: list[dict[str, Any]] | ||
| grader_mode: GraderMode = GraderMode.POINTWISE | ||
| task_description: str | None = None | ||
| language: LanguageEnum = LanguageEnum.EN | ||
| min_score: int = 0 | ||
| max_score: int = 5 | ||
| enable_categorization: bool = True | ||
| categories_number: int = 5 | ||
| query_specific_generate_number: int = 2 | ||
| max_retries: int = 3 | ||
| api_endpoint: str = "" | ||
| api_key: str = "" | ||
| model_name: str = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The IterativeRubricConfig class is defined at the end of the file, after it has been used as a type hint (with a forward reference string) in the generate_iterative method. For better readability and code organization, it's standard practice to define classes and data structures before they are referenced. Consider moving this class definition to be before the RubricGeneratorService class, alongside SimpleRubricConfig.
Pytest was treating test_panel.py as a test file due to the 'test_' prefix, causing import errors in CI. Renamed to rubric_tester.py to fix this issue.
…nguage preference - Use stable values (e.g., "_custom_", "python") instead of translated labels as selectbox options - Prevents widget state loss when UI language changes - Add localStorage persistence for language preference - Add inject_language_loader() to restore language setting on page load - Update navigation to use stable feature_ids for feature selector
…ration
Add new Auto Rubric feature that automatically generates evaluation rubrics for LLM applications. Includes simple and iterative generation modes, data upload, history management, and export functionality.
OpenJudge Version
[The version of OpenJudge you are working on, e.g.
import openjudge; print(openjudge.__version__)]Description
[Please describe the background, purpose, changes made, and how to test this PR]
Checklist
Please check the following items before code is ready to be reviewed.
pre-commit run --all-filescommand