QVAC-19368 infra: rebalance Android Device Farm shards + faster mobile CI for LLM#2466
Open
tobi-legan wants to merge 14 commits into
Open
QVAC-19368 infra: rebalance Android Device Farm shards + faster mobile CI for LLM#2466tobi-legan wants to merge 14 commits into
tobi-legan wants to merge 14 commits into
Conversation
…attern Android groupB (11 tests, 49 min) and groupImagesPerf (3 VLM tests, 69 min) were serialising heavy tests on a single device — hitting the 2h job timeout on Pixel. Mirror the iOS strategy: isolate each heavy test into its own group (heavy1–heavy10) and bundle fast tests into lightA/lightB (12 groups total). Longest single shard drops from ~69 min to ~23 min; pool recycles devices across groups dynamically. Co-authored-by: Cursor <cursoragent@cursor.com>
…ENCY The 12-group mirror of iOS overwhelmed the Device Farm account concurrency limit (24 total runs: 12 iOS + 12 Android). Groups queued up to 12.5 min on Android and 28 min on iOS waiting for a slot, making the monitor step slower than the original 3-group layout. Revised to 6 Android groups (18 total with iOS): - heavyA/heavyB: split the old groupB heavy tests into 2 balanced shards - imagePerfA/imagePerfB: split VLM tests 2+1 to avoid the 69-min single-group bottleneck - lightA/lightB: fast tests bundled Expected critical path: ~40-50 min (vs 69 min old, 87 min with 12 groups). Co-authored-by: Cursor <cursoragent@cursor.com>
With 6 Android groups × 3 devices each = 18 device-jobs, the serial log download took 52 min (each device-job ~3-7 min of API calls + artifact downloads). Process each run's logs in parallel (up to 4 concurrent), so the total is bounded by the slowest single run (~18 min) rather than the sum of all runs. Combined with the 6-group monitor improvement (57 min vs old 69 min), the estimated total Android job time drops to ~86 min — well within the 120 min timeout. Co-authored-by: Cursor <cursoragent@cursor.com>
Rename test groups to be self-documenting: - iOS: heavy1..heavy10 → finetuning, toolCalling, reasoning, etc. - Android: heavyA → heavyA-finetune-reason-ocr, imagePerfB → imagePerf-fruitPlate, etc. Add test-specs passthrough to the monitor step so it can print: - A "Run → tests" legend at the start (which tests are in each run) - Test names in the final results section next to each run link Now when a run fails you can immediately see which test(s) it contained without cross-referencing test-groups.json. Co-authored-by: Cursor <cursoragent@cursor.com>
Pass test-specs from upload-to-devicefarm through to the monitor step in all 12 addon integration workflows. Gives every addon the run-to-tests legend and test names in final results — not just LLM. Co-authored-by: Cursor <cursoragent@cursor.com>
Two issues from the Android shard split: 1. Image test instability: the fruit-plate test relied on elephant running first in the same group to warm up the VLM model. With split groups each image test cold-starts alone, causing crashes on Android. Extended the iosWarmupImage pre-warmup to all mobile platforms (isMobile) so fruit-plate gets the elephant pre-warmup on Android too. 2. Heavy group imbalance: heavyA (4 tests, ~44 min) and heavyB (3 tests, ~45 min) were both too slow. Split into 3 balanced groups of 2-3 tests each: - heavyA-finetune-reasoning (2 tests) - heavyB-toolCall-gemma (2 tests) - heavyC-ocr-sliding (3 tests) Android now has 7 groups (19 total with iOS 12). Co-authored-by: Cursor <cursoragent@cursor.com>
Two optimizations for Device Farm log collection: 1. Skip 'Setup Test' and 'Teardown Test' suites — they only contain framework bookkeeping (home screen screenshots, install logs), not test output. Saves 2 list-artifacts API calls + downloads per device-job (21 Android device-jobs × 2 = 42 fewer API round-trips). 2. Raise MAX_PARALLEL from 4 to 8 so all runs (up to 7 Android + 12 iOS) download simultaneously instead of in waves. AWS Device Farm API handles this fine — the bottleneck was I/O wait, not CPU. Target: Android log collection from 25 min → ~12-15 min. Co-authored-by: Cursor <cursoragent@cursor.com>
…rm-shard-split Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # packages/llm-llamacpp/test/mobile/test-groups.json
The 3-image VLM perf (gemma4 + qwen3-5) made the Android on-PR leg run too long. aurora is the heaviest image, so skip it when QVAC_PERF_RUNS is at the on-PR default (<=1); the benchmark (QVAC_PERF_RUNS>1) still runs all 3. On-PR now covers elephant + fruit-plate, keeping the Android run under ~1h. Co-authored-by: Cursor <cursoragent@cursor.com>
… link Consolidate the per-device *_test-results.json into one test-results-summary.json (each test case with status + duration per device, gate-skips surfaced as 'skipped'), ship it inside the console-logs artifact, and write a compact ✅/❌/⏭️ table + an artifact link to the GitHub Step Summary. Makes it easy to see whether each case ran, passed, failed, or was skipped (e.g. the VLM aurora on-PR gate). Mobile only for now; desktop to follow. Co-authored-by: Cursor <cursoragent@cursor.com>
…10 -> 6) The Android job was hitting the 120-min cap because ~15 Device Farm runs queued behind the account concurrency limit (9-20min wait each), starving the downstream collect/extract steps. Using measured worst-case per-test runtimes, rebalance into 6 groups: toolCalling (~30m) and gemma4 (~29m) run solo (each near the per-test cap), the other functional tests pack into two ~50m shards, and the vlmPerf groups stay dedicated (so the benchmark perf-only filter still isolates them). Fewer runs = less queue contention = shorter monitor wait. All 30 functions stay covered; iOS unchanged. Co-authored-by: Cursor <cursoragent@cursor.com>
DmitryMalishev
approved these changes
Jun 8, 2026
Contributor
Tier-based Approval Status |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What problem does this PR solve?
How does it solve it?
test-specsinto the monitor step + add a descriptive group legend so the Device Farm monitor is readable; parallelize per-run log downloads.QVAC_PERF_RUNSdefault). On-PR now covers elephant + fruit-plate for gemma4 / qwen3-5; the benchmark (QVAC_PERF_RUNS>1) still runs all 3 images. Keeps the Android on-PR leg under ~1h.How was it tested?
test-groups.jsonresolved to keep the new descriptive groups + the VLM perf groups; mobile group-coverage + "up to date" validation pass.isBenchmarkRun=falseat the on-PR default (aurora skipped),trueatQVAC_PERF_RUNS=3(all 3 run).Made with Cursor