feat(ui): Add Auto Rubric feature for automatic grading criteria gene… #92

XiaoBoAI · 2026-01-28T06:25:12Z

…ration

Add new Auto Rubric feature that automatically generates evaluation rubrics for LLM applications. Includes simple and iterative generation modes, data upload, history management, and export functionality.

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

Code has been formatted with pre-commit run --all-files command
All tests are passing
Docstrings are in Google style
Related documentation has been updated (e.g. links, examples, etc.)
Code is ready for review

…ration Add new Auto Rubric feature that automatically generates evaluation rubrics for LLM applications. Includes simple and iterative generation modes, data upload, history management, and export functionality.

gemini-code-assist · 2026-01-28T06:25:40Z

Summary of Changes

Hello @XiaoBoAI, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new "Auto Rubric" feature into the application's user interface, designed to streamline the creation of evaluation rubrics for Large Language Model (LLM) applications. It provides users with flexible options to generate rubrics either from a simple task description or through an iterative process leveraging labeled datasets. The feature also includes robust tools for managing the history of generated rubrics, exporting them in various formats, and testing their efficacy directly within the UI.

Highlights

New Feature: Auto Rubric: Introduced a new 'Auto Rubric' feature that automates the generation of evaluation rubrics for Large Language Model (LLM) applications.
Dual Generation Modes: Supports two distinct modes for rubric generation: 'Simple Rubric' for zero-shot generation based on a task description, and 'Iterative Rubric' for data-driven generation using labeled datasets.
Comprehensive UI and Functionality: Includes a full suite of UI components for configuration, data upload, displaying results, managing generation history, and exporting generated graders in Python, YAML, or JSON formats.
Grader Testing Capability: Provides an integrated testing panel to evaluate the performance of generated graders with sample inputs in both pointwise (scoring) and listwise (ranking) evaluation modes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces the new "Auto Rubric" feature, a significant addition that allows for automatic generation of evaluation rubrics. The feature is well-structured, with clear separation between UI components, services, and the main feature logic. It includes modes for simple and iterative rubric generation, data upload capabilities, history management, and export functionality. The code demonstrates good practices, especially regarding security with consistent HTML escaping and safe display of LLM-generated content. My review includes a few suggestions to improve code consistency, maintainability, and adherence to Python best practices, primarily concerning code duplication and organization.

gemini-code-assist · 2026-01-28T06:28:28Z

ui/features/auto_rubric/components/rubric_tester.py

+        if "rubric_test_result" not in st.session_state:
+            st.session_state["rubric_test_result"] = None
+
+        # Input fields
+        test_query = st.text_input(
+            t("rubric.test.query"),
+            placeholder=t("rubric.test.query_placeholder"),
+            key="rubric_test_query_compact",
+        )
+
+        if grader_mode == "listwise":
+            # Listwise mode: show hint and two response inputs
+            st.caption(t("rubric.test.responses_hint"))
+            response_1 = st.text_area(
+                f"{t('rubric.test.response')} 1",
+                height=80,
+                key="rubric_test_response_compact_1",
+            )
+            response_2 = st.text_area(
+                f"{t('rubric.test.response')} 2",
+                height=80,
+                key="rubric_test_response_compact_2",
+            )
+            responses = [r for r in [response_1, response_2] if r.strip()]
+            can_run = bool(test_query.strip() and len(responses) >= 2)
+
+            if st.button(
+                f"▶️ {t('rubric.test.run')}",
+                disabled=not can_run,
+                key="rubric_test_run_compact",
+            ):
+                with st.spinner(t("rubric.test.running")):
+                    try:
+                        service = RubricGeneratorService()
+                        result = run_async(service.test_grader_listwise(grader, test_query, responses))
+                        st.session_state["rubric_test_result"] = result
+                    except Exception as e:
+                        st.session_state["rubric_test_result"] = {
+                            "success": False,
+                            "error": str(e),
+                        }
+        else:
+            # Pointwise mode
+            test_response = st.text_area(
+                t("rubric.test.response"),
+                placeholder=t("rubric.test.response_placeholder"),
+                height=100,
+                key="rubric_test_response_compact",
+            )
+            can_run = bool(test_query.strip() and test_response.strip())
+
+            if st.button(
+                f"▶️ {t('rubric.test.run')}",
+                disabled=not can_run,
+                key="rubric_test_run_compact",
+            ):
+                with st.spinner(t("rubric.test.running")):
+                    try:
+                        service = RubricGeneratorService()
+                        result = run_async(service.test_grader(grader, test_query, test_response))
+                        st.session_state["rubric_test_result"] = result
+                    except Exception as e:
+                        st.session_state["rubric_test_result"] = {
+                            "success": False,
+                            "error": str(e),
+                        }
+
+        # Display result
+        test_result = st.session_state.get("rubric_test_result")
+        if test_result:
+            if test_result.get("success"):
+                if grader_mode == "listwise":
+                    rank = test_result.get("rank", [])
+                    reason = test_result.get("reason", "")
+                    st.success(f"{t('rubric.test.rank')}: {rank}")
+                    if reason:
+                        st.text_area(
+                            t("rubric.test.reason"),
+                            value=reason,
+                            height=100,
+                            disabled=True,
+                            key="compact_listwise_reason_display",
+                        )
+                else:
+                    score = test_result.get("score")
+                    reason = test_result.get("reason", "")
+                    col1, col2 = st.columns([1, 3])
+                    with col1:
+                        st.metric(t("rubric.test.score"), score)
+                    with col2:
+                        if reason:
+                            st.text_area(
+                                t("rubric.test.reason"),
+                                value=reason,
+                                height=100,
+                                disabled=True,
+                                key="compact_reason_display",
+                            )
+            else:
+                st.error(test_result.get("error", "Unknown error"))


This function duplicates test execution logic from the _run_test_pointwise and _run_test_listwise helper functions. This duplicated logic also omits the rubric_test_running state management, which is used in the helper functions to disable the button during execution.

While the current implementation with a blocking run_async call prevents race conditions, this inconsistency makes the code harder to maintain and less robust against future changes (e.g., to a non-blocking execution model).

Refactoring this to reuse the existing helper functions (perhaps by making them non-private) or at least making the implementation consistent by adding state management would improve the code's quality.

gemini-code-assist · 2026-01-28T06:28:28Z

ui/features/auto_rubric/services/history_manager.py

+            return 0
+
+        # Count numbered items (1., 2., etc.) or "Rubric X:" patterns
+        import re


The import re statement is inside the get_rubrics_count method. According to PEP 8 style guidelines, imports should be at the top of the file. This improves readability by making dependencies clear and avoids the minor performance overhead of re-importing if the function is called multiple times. Please move this import to the top of the file.

gemini-code-assist · 2026-01-28T06:28:29Z

ui/features/auto_rubric/services/rubric_generator_service.py

+@dataclass
+class IterativeRubricConfig:
+    """Configuration for Iterative Rubric generation.
+
+    Attributes:
+        grader_name: Unique name for the generated grader.
+        dataset: List of labeled training data dictionaries.
+        grader_mode: POINTWISE or LISTWISE evaluation mode.
+        task_description: Optional task description for context.
+        language: Language for prompts (EN or ZH).
+        min_score: Minimum score for pointwise mode.
+        max_score: Maximum score for pointwise mode.
+        enable_categorization: Whether to group similar rubrics.
+        categories_number: Target number of categories.
+        query_specific_generate_number: Rubrics per training sample.
+        max_retries: Maximum retry attempts for LLM calls.
+        api_endpoint: API endpoint URL.
+        api_key: API key for authentication.
+        model_name: Model name to use.
+    """
+
+    grader_name: str
+    dataset: list[dict[str, Any]]
+    grader_mode: GraderMode = GraderMode.POINTWISE
+    task_description: str | None = None
+    language: LanguageEnum = LanguageEnum.EN
+    min_score: int = 0
+    max_score: int = 5
+    enable_categorization: bool = True
+    categories_number: int = 5
+    query_specific_generate_number: int = 2
+    max_retries: int = 3
+    api_endpoint: str = ""
+    api_key: str = ""
+    model_name: str = ""


The IterativeRubricConfig class is defined at the end of the file, after it has been used as a type hint (with a forward reference string) in the generate_iterative method. For better readability and code organization, it's standard practice to define classes and data structures before they are referenced. Consider moving this class definition to be before the RubricGeneratorService class, alongside SimpleRubricConfig.

Pytest was treating test_panel.py as a test file due to the 'test_' prefix, causing import errors in CI. Renamed to rubric_tester.py to fix this issue.

…nguage preference - Use stable values (e.g., "_custom_", "python") instead of translated labels as selectbox options - Prevents widget state loss when UI language changes - Add localStorage persistence for language preference - Add inject_language_loader() to restore language setting on page load - Update navigation to use stable feature_ids for feature selector

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

fix: Rename test_panel.py to rubric_tester.py to avoid pytest collection

45726b0

Pytest was treating test_panel.py as a test file due to the 'test_' prefix, causing import errors in CI. Renamed to rubric_tester.py to fix this issue.

helloml0326 approved these changes Jan 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ui): Add Auto Rubric feature for automatic grading criteria gene… #92

feat(ui): Add Auto Rubric feature for automatic grading criteria gene… #92

Uh oh!

XiaoBoAI commented Jan 28, 2026

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

gemini-code-assist bot Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat(ui): Add Auto Rubric feature for automatic grading criteria gene… #92

Are you sure you want to change the base?

feat(ui): Add Auto Rubric feature for automatic grading criteria gene… #92

Uh oh!

Conversation

XiaoBoAI commented Jan 28, 2026

OpenJudge Version

Description

Checklist

Uh oh!

gemini-code-assist bot commented Jan 28, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants