Skip to content

Conversation

@XiaoBoAI
Copy link
Collaborator

…ration

Add new Auto Rubric feature that automatically generates evaluation rubrics for LLM applications. Includes simple and iterative generation modes, data upload, history management, and export functionality.

OpenJudge Version

[The version of OpenJudge you are working on, e.g. import openjudge; print(openjudge.__version__)]

Description

[Please describe the background, purpose, changes made, and how to test this PR]

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has been formatted with pre-commit run --all-files command
  • All tests are passing
  • Docstrings are in Google style
  • Related documentation has been updated (e.g. links, examples, etc.)
  • Code is ready for review

…ration

Add new Auto Rubric feature that automatically generates evaluation rubrics
for LLM applications. Includes simple and iterative generation modes,
data upload, history management, and export functionality.
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @XiaoBoAI, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates a new "Auto Rubric" feature into the application's user interface, designed to streamline the creation of evaluation rubrics for Large Language Model (LLM) applications. It provides users with flexible options to generate rubrics either from a simple task description or through an iterative process leveraging labeled datasets. The feature also includes robust tools for managing the history of generated rubrics, exporting them in various formats, and testing their efficacy directly within the UI.

Highlights

  • New Feature: Auto Rubric: Introduced a new 'Auto Rubric' feature that automates the generation of evaluation rubrics for Large Language Model (LLM) applications.
  • Dual Generation Modes: Supports two distinct modes for rubric generation: 'Simple Rubric' for zero-shot generation based on a task description, and 'Iterative Rubric' for data-driven generation using labeled datasets.
  • Comprehensive UI and Functionality: Includes a full suite of UI components for configuration, data upload, displaying results, managing generation history, and exporting generated graders in Python, YAML, or JSON formats.
  • Grader Testing Capability: Provides an integrated testing panel to evaluate the performance of generated graders with sample inputs in both pointwise (scoring) and listwise (ranking) evaluation modes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the new "Auto Rubric" feature, a significant addition that allows for automatic generation of evaluation rubrics. The feature is well-structured, with clear separation between UI components, services, and the main feature logic. It includes modes for simple and iterative rubric generation, data upload capabilities, history management, and export functionality. The code demonstrates good practices, especially regarding security with consistent HTML escaping and safe display of LLM-generated content. My review includes a few suggestions to improve code consistency, maintainability, and adherence to Python best practices, primarily concerning code duplication and organization.

Comment on lines +419 to +534
if "rubric_test_result" not in st.session_state:
st.session_state["rubric_test_result"] = None

# Input fields
test_query = st.text_input(
t("rubric.test.query"),
placeholder=t("rubric.test.query_placeholder"),
key="rubric_test_query_compact",
)

if grader_mode == "listwise":
# Listwise mode: show hint and two response inputs
st.caption(t("rubric.test.responses_hint"))
response_1 = st.text_area(
f"{t('rubric.test.response')} 1",
height=80,
key="rubric_test_response_compact_1",
)
response_2 = st.text_area(
f"{t('rubric.test.response')} 2",
height=80,
key="rubric_test_response_compact_2",
)
responses = [r for r in [response_1, response_2] if r.strip()]
can_run = bool(test_query.strip() and len(responses) >= 2)

if st.button(
f"▶️ {t('rubric.test.run')}",
disabled=not can_run,
key="rubric_test_run_compact",
):
with st.spinner(t("rubric.test.running")):
try:
service = RubricGeneratorService()
result = run_async(service.test_grader_listwise(grader, test_query, responses))
st.session_state["rubric_test_result"] = result
except Exception as e:
st.session_state["rubric_test_result"] = {
"success": False,
"error": str(e),
}
else:
# Pointwise mode
test_response = st.text_area(
t("rubric.test.response"),
placeholder=t("rubric.test.response_placeholder"),
height=100,
key="rubric_test_response_compact",
)
can_run = bool(test_query.strip() and test_response.strip())

if st.button(
f"▶️ {t('rubric.test.run')}",
disabled=not can_run,
key="rubric_test_run_compact",
):
with st.spinner(t("rubric.test.running")):
try:
service = RubricGeneratorService()
result = run_async(service.test_grader(grader, test_query, test_response))
st.session_state["rubric_test_result"] = result
except Exception as e:
st.session_state["rubric_test_result"] = {
"success": False,
"error": str(e),
}

# Display result
test_result = st.session_state.get("rubric_test_result")
if test_result:
if test_result.get("success"):
if grader_mode == "listwise":
rank = test_result.get("rank", [])
reason = test_result.get("reason", "")
st.success(f"{t('rubric.test.rank')}: {rank}")
if reason:
st.text_area(
t("rubric.test.reason"),
value=reason,
height=100,
disabled=True,
key="compact_listwise_reason_display",
)
else:
score = test_result.get("score")
reason = test_result.get("reason", "")
col1, col2 = st.columns([1, 3])
with col1:
st.metric(t("rubric.test.score"), score)
with col2:
if reason:
st.text_area(
t("rubric.test.reason"),
value=reason,
height=100,
disabled=True,
key="compact_reason_display",
)
else:
st.error(test_result.get("error", "Unknown error"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function duplicates test execution logic from the _run_test_pointwise and _run_test_listwise helper functions. This duplicated logic also omits the rubric_test_running state management, which is used in the helper functions to disable the button during execution.

While the current implementation with a blocking run_async call prevents race conditions, this inconsistency makes the code harder to maintain and less robust against future changes (e.g., to a non-blocking execution model).

Refactoring this to reuse the existing helper functions (perhaps by making them non-private) or at least making the implementation consistent by adding state management would improve the code's quality.

return 0

# Count numbered items (1., 2., etc.) or "Rubric X:" patterns
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The import re statement is inside the get_rubrics_count method. According to PEP 8 style guidelines, imports should be at the top of the file. This improves readability by making dependencies clear and avoids the minor performance overhead of re-importing if the function is called multiple times. Please move this import to the top of the file.

Comment on lines +367 to +401
@dataclass
class IterativeRubricConfig:
"""Configuration for Iterative Rubric generation.

Attributes:
grader_name: Unique name for the generated grader.
dataset: List of labeled training data dictionaries.
grader_mode: POINTWISE or LISTWISE evaluation mode.
task_description: Optional task description for context.
language: Language for prompts (EN or ZH).
min_score: Minimum score for pointwise mode.
max_score: Maximum score for pointwise mode.
enable_categorization: Whether to group similar rubrics.
categories_number: Target number of categories.
query_specific_generate_number: Rubrics per training sample.
max_retries: Maximum retry attempts for LLM calls.
api_endpoint: API endpoint URL.
api_key: API key for authentication.
model_name: Model name to use.
"""

grader_name: str
dataset: list[dict[str, Any]]
grader_mode: GraderMode = GraderMode.POINTWISE
task_description: str | None = None
language: LanguageEnum = LanguageEnum.EN
min_score: int = 0
max_score: int = 5
enable_categorization: bool = True
categories_number: int = 5
query_specific_generate_number: int = 2
max_retries: int = 3
api_endpoint: str = ""
api_key: str = ""
model_name: str = ""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The IterativeRubricConfig class is defined at the end of the file, after it has been used as a type hint (with a forward reference string) in the generate_iterative method. For better readability and code organization, it's standard practice to define classes and data structures before they are referenced. Consider moving this class definition to be before the RubricGeneratorService class, alongside SimpleRubricConfig.

Pytest was treating test_panel.py as a test file due to the 'test_' prefix,
causing import errors in CI. Renamed to rubric_tester.py to fix this issue.
…nguage preference

- Use stable values (e.g., "_custom_", "python") instead of translated labels as selectbox options
- Prevents widget state loss when UI language changes
- Add localStorage persistence for language preference
- Add inject_language_loader() to restore language setting on page load
- Update navigation to use stable feature_ids for feature selector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants