[XL] Add Python Crank Scheduling tool#2106
Conversation
…he new scheduling tool.
…generating the yml pipelines.
…added machine_groups for the base azure configuration.
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a comprehensive Python-based crank scheduling tool to automate CI pipeline generation and optimize machine/scenario allocation across multiple performance testing machines.
- Adds a complete Python crank scheduler with sophisticated machine allocation algorithms and multi-YAML generation capabilities
- Updates existing CI configurations to use the new machine group system and multi-capability machine definitions
- Replaces manual YAML matrix files with JSON-based configuration and automated pipeline generation
Reviewed Changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/crank-scheduler/*.py | Core scheduler implementation with machine allocation, runtime estimation, and template generation |
| scripts/crank-scheduler/requirements.txt | Python dependencies for the scheduler |
| scripts/crank-scheduler/*.md | Documentation and configuration guides |
| build/benchmarks_ci*.json | Updated machine configurations with new capability-based structure and machine groups |
| build/benchmarks*.yml | Updated pipeline files generated by the new scheduler |
| build/benchmarks.template.liquid | Updated template comments to reflect new generation process |
…e unused requirements and code, and added some new entries to .gitignore.
c79428e to
a9d28b0
Compare
…. Also improved the scheduler to better handle role-priority based profile selection.
| epilog=""" | ||
| Examples: | ||
| # Generate schedule from JSON files | ||
| python main.py --config config.json --format table |
There was a problem hiding this comment.
--format is used in the examples, but we don't seem to have a --config argument anywhere.
|
|
||
| machines_by_type = {} | ||
| for machine in machines: | ||
| # Get primary machine type (lowest priority capability) |
There was a problem hiding this comment.
nit: I assume by lowest here we mean lowest number, which would actually be higher priority.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 23 out of 24 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| ```json | ||
| { | ||
| "name": "performance-test-scenario", | ||
| "scenario_type": 2, | ||
| "estimated_runtime": 45.0, | ||
| "target_machines": ["machine-1", "machine-2"] | ||
| } | ||
| ``` | ||
|
|
||
| #### Scenario Properties | ||
|
|
||
| - **name**: Scenario identifier | ||
| - **scenario_type**: Number of machines required (1=SUT only, 2=SUT+Load, 3=SUT+Load+DB) |
There was a problem hiding this comment.
This example and the scenario property list document a scenario_type field, but DataLoader.load_combined_configuration currently reads the scenario type from the type key and example_complete_features.json also uses "type". As written, a config that follows this README and uses scenario_type will cause a ScenarioType lookup failure; please either adjust the loader to accept scenario_type or update the docs/examples to use the actual key (type) so JSON configs are valid.
| ```json | ||
| { | ||
| "name": "Simple Single Machine Test", | ||
| "template": "simple-single.yml", | ||
| "scenario_type": 1, | ||
| "target_machines": ["single-type-machine", "multi-type-machine"], | ||
| "estimated_runtime": 10.0, | ||
| "description": "Basic single machine scenario with default profiles" | ||
| } | ||
| ``` | ||
|
|
||
| **Result:** Uses default profiles for all machines | ||
|
|
||
| ### 2. Custom Profile Selection | ||
|
|
||
| ```json | ||
| { | ||
| "name": "Triple Machine Test with Custom Profiles", | ||
| "template": "triple-custom.yml", | ||
| "scenario_type": 3, | ||
| "target_machines": ["multi-type-machine"], | ||
| "estimated_runtime": 45.0, | ||
| "profile_overrides": { | ||
| "multi-type-machine": { | ||
| "sut": "multi-sut-high-cpu", | ||
| "load": "multi-load-high-throughput", | ||
| "db": "multi-db-memory-optimized" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Result:** Uses specific custom profiles for each machine type | ||
|
|
||
| ### 3. Mixed Profile Usage | ||
|
|
||
| ```json | ||
| { | ||
| "name": "Mixed Profile Scenario", | ||
| "template": "mixed-profiles.yml", | ||
| "scenario_type": 2, | ||
| "target_machines": ["single-type-machine", "multi-type-machine"], | ||
| "profile_overrides": { | ||
| "multi-type-machine": { | ||
| "sut": "multi-sut-low-memory" | ||
| } | ||
| } | ||
| } | ||
| ``` | ||
|
|
||
| **Result:** | ||
|
|
||
| - `single-type-machine`: Uses default profile | ||
| - `multi-type-machine` SUT: Uses custom profile | ||
| - `multi-type-machine` LOAD: Uses default profile | ||
|
|
||
| ## Configuration Properties Explained | ||
|
|
||
| ### Machine Properties | ||
|
|
||
| | Property | Required | Description | | ||
| | -------------------- | -------- | ---------------------------------------------- | | ||
| | `name` | ✅ | Unique machine identifier | | ||
| | `capabilities` | ✅ | Dict of machine types this machine can fulfill | | ||
| | `preferred_partners` | ❌ | List of preferred machines for other roles | | ||
|
|
||
| ### Capability Properties | ||
|
|
||
| | Property | Required | Description | | ||
| | ----------------- | -------- | ------------------------------------------------------------------- | | ||
| | `machine_type` | ✅ | Key: "sut", "load", or "db" | | ||
| | `priority` | ✅ | 1=preferred, 2=secondary, 3=fallback | | ||
| | `profiles` | ✅ | List of available profile names | | ||
| | `default_profile` | ❌ | Which profile to use by default (defaults to first profile in list) | | ||
|
|
||
| ### Scenario Properties | ||
|
|
||
| | Property | Required | Description | | ||
| | ------------------- | -------- | ---------------------------------- | | ||
| | `name` | ✅ | Scenario identifier | | ||
| | `template` | ✅ | YAML template file | | ||
| | `scenario_type` | ✅ | 1=single, 2=dual, 3=triple machine | | ||
| | `target_machines` | ✅ | List of machines to run on | | ||
| | `estimated_runtime` | ❌ | Runtime in minutes | | ||
| | `description` | ❌ | Human-readable description | | ||
| | `profile_overrides` | ❌ | Custom profile overrides | |
There was a problem hiding this comment.
These scenario examples and the "Scenario Properties" table document a scenario_type field, but the scheduler code reads the type from a type key in the JSON (and example_complete_features.json uses "type"). Using scenario_type as shown here will break loading; please align the docs with the implementation (or update the loader to accept scenario_type) so configuration authors can rely on the documented shape.
| | Property | Required | Description | | ||
| | -------------------- | -------- | ---------------------------------------------- | | ||
| | `name` | ✅ | Unique machine identifier | | ||
| | `capabilities` | ✅ | Dict of machine types this machine can fulfill | | ||
| | `preferred_partners` | ❌ | List of preferred machines for other roles | | ||
|
|
There was a problem hiding this comment.
The machine configuration docs list name, capabilities, and preferred_partners, but do not mention the new machine_group field that is now used by the scheduler for group-based compatibility (see Machine.machine_group in models.py and the updated build/benchmarks_ci*.json files). To make the new grouping behavior discoverable and configurable, please extend this table (and the surrounding text) to describe the machine_group field and how it interacts with enforce_machine_groups in metadata.
| # - Update this file with the result of the template generation | ||
| # - The file benchmarks*.json defines how each pipeline set of jobs is run in parallel | ||
| # - Update the associated benchmarks*.json file with machine and scenario updates | ||
| # - Install python and install the requirements for the crank-scheduler in benchmarks/scripts/crank-scheduler/requirements.txt |
There was a problem hiding this comment.
The instructions here reference benchmarks/scripts/crank-scheduler/requirements.txt, but in this repo the requirements file lives at scripts/crank-scheduler/requirements.txt (and the example command below already uses ./scripts/crank-scheduler/main.py). To prevent confusion when following these steps, consider updating this path (and any similarly generated headers in the CI YAML files) to match the actual directory layout.
| # - Install python and install the requirements for the crank-scheduler in benchmarks/scripts/crank-scheduler/requirements.txt | |
| # - Install python and install the requirements for the crank-scheduler in scripts/crank-scheduler/requirements.txt |
| def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> list: | ||
| """ | ||
| Unified flow for YAML generation (single or multi) | ||
|
|
||
| Returns: | ||
| bool: True if YAML files were generated, False otherwise |
There was a problem hiding this comment.
The docstring for process_yaml_generation states that the function returns a bool, but the implementation actually returns a list of dictionaries describing the generated YAML files. Please update the docstring (and/or add a return type annotation) to reflect the real return type so callers know what to expect.
| def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> list: | |
| """ | |
| Unified flow for YAML generation (single or multi) | |
| Returns: | |
| bool: True if YAML files were generated, False otherwise | |
| def process_yaml_generation(args, partial_schedules: List[PartialSchedule], config: CombinedConfiguration) -> List[dict]: | |
| """ | |
| Unified flow for YAML generation (single or multi) | |
| Returns: | |
| List[dict]: List of metadata dictionaries for each generated YAML file |
| schedule_times = ScheduleOperations.generate_schedule_times( | ||
| config, len(partial_schedules)) |
There was a problem hiding this comment.
The CLI overrides for --target-yamls and --schedule-offset are applied here after partial_schedules have already been computed in main.py, and generate_schedule_times uses the overridden target_yaml_count instead of the actual len(partial_schedules). This can lead to mismatches (e.g., some partial schedules never get a YAML file when target_yamls is reduced, or extra offset times are generated and then dropped when target_yamls is increased). To avoid silently skipping work, apply the overrides before splitting the schedule (or re-split after updating yaml_generation) so that both partial_schedules and schedule_times are derived from the same effective target_yaml_count.
| schedule_times = ScheduleOperations.generate_schedule_times( | |
| config, len(partial_schedules)) | |
| # Ensure the YAML generation config's target count matches the actual | |
| # number of partial schedules so that we don't silently drop or omit work. | |
| effective_count = len(partial_schedules) | |
| if config.metadata.yaml_generation is not None: | |
| config.metadata.yaml_generation.target_yaml_count = effective_count | |
| schedule_times = ScheduleOperations.generate_schedule_times( | |
| config, effective_count) |
| try: | ||
| partner_index = preferred_partners.index(machine.name) | ||
| score += 0.01 * (partner_index + 1) # 0.01, 0.02, 0.03, ... | ||
| except ValueError: |
There was a problem hiding this comment.
'except' clause does nothing but pass and there is no explanatory comment.
| except ValueError: | |
| except ValueError: | |
| # Machine not found in preferred_partners; skip partner bias adjustment. |
In order to make simplify the work needed when updating the scenarios we run and to minimize the chance of error, this adds a python script to be used to generate a CI schedule from a single configuration file. Most of the recently added and updated pipeline flows already used this new flow, but this update does add an option for a machine_group to ensure machines only use other machines at similar perf levels for load and db machines.
Changes include the addition of the crank-scheduler, running the configurations through the scheduler one more time with the updated benchmarks.template.liquid, updating the benchmarks.template.liquid to include the new steps to run, and added the machine_group configuration option where applicable.