Skip to content

QVAC-19368 infra: rebalance Android Device Farm shards + faster mobile CI for LLM#2466

Open
tobi-legan wants to merge 14 commits into
mainfrom
tobi/android-devicefarm-shard-split
Open

QVAC-19368 infra: rebalance Android Device Farm shards + faster mobile CI for LLM#2466
tobi-legan wants to merge 14 commits into
mainfrom
tobi/android-devicefarm-shard-split

Conversation

@tobi-legan

Copy link
Copy Markdown
Contributor

What problem does this PR solve?

  • Mobile Device Farm CI was slow / hit the account concurrency limit. Android ran as 2 big groups (long sequential runs), and the VLM 3-image perf added enough load that the Android leg ran 1-2h, sometimes near the per-run ceiling.
  • The Device Farm monitor output was hard to read (opaque group names, no test legend), and log download was serial.

How does it solve it?

  • Split Android Device Farm shards to match the iOS heavy/light pattern (descriptive group names instead of group A/B), tuned shard count to avoid PENDING_CONCURRENCY.
  • Extend the small-image pre-warmup to Android (not just iOS), since each image test now runs in its own group with no cross-test warmup.
  • Wire test-specs into the monitor step + add a descriptive group legend so the Device Farm monitor is readable; parallelize per-run log downloads.
  • Skip Setup/Teardown suite artifacts and raise the parallel limit.
  • VLM on-PR cost: skip the heaviest image (high-res aurora) on normal on-PR runs (keys off QVAC_PERF_RUNS default). On-PR now covers elephant + fruit-plate for gemma4 / qwen3-5; the benchmark (QVAC_PERF_RUNS>1) still runs all 3 images. Keeps the Android on-PR leg under ~1h.

How was it tested?

  • Merged latest main (incl. the VLM perf-regression suite). test-groups.json resolved to keep the new descriptive groups + the VLM perf groups; mobile group-coverage + "up to date" validation pass.
  • Verified the aurora skip gate locally: isBenchmarkRun=false at the on-PR default (aurora skipped), true at QVAC_PERF_RUNS=3 (all 3 run).
  • Will trigger the mobile integration workflow via workflow_dispatch on this branch to confirm the rebalanced shards run green.

Made with Cursor

tobi-legan and others added 9 commits May 25, 2026 12:25
…attern

Android groupB (11 tests, 49 min) and groupImagesPerf (3 VLM tests,
69 min) were serialising heavy tests on a single device — hitting the
2h job timeout on Pixel. Mirror the iOS strategy: isolate each heavy
test into its own group (heavy1–heavy10) and bundle fast tests into
lightA/lightB (12 groups total). Longest single shard drops from
~69 min to ~23 min; pool recycles devices across groups dynamically.

Co-authored-by: Cursor <cursoragent@cursor.com>
…ENCY

The 12-group mirror of iOS overwhelmed the Device Farm account
concurrency limit (24 total runs: 12 iOS + 12 Android). Groups
queued up to 12.5 min on Android and 28 min on iOS waiting for a
slot, making the monitor step slower than the original 3-group layout.

Revised to 6 Android groups (18 total with iOS):
  - heavyA/heavyB: split the old groupB heavy tests into 2 balanced shards
  - imagePerfA/imagePerfB: split VLM tests 2+1 to avoid the 69-min single-group bottleneck
  - lightA/lightB: fast tests bundled

Expected critical path: ~40-50 min (vs 69 min old, 87 min with 12 groups).

Co-authored-by: Cursor <cursoragent@cursor.com>
With 6 Android groups × 3 devices each = 18 device-jobs, the serial
log download took 52 min (each device-job ~3-7 min of API calls +
artifact downloads). Process each run's logs in parallel (up to 4
concurrent), so the total is bounded by the slowest single run (~18 min)
rather than the sum of all runs.

Combined with the 6-group monitor improvement (57 min vs old 69 min),
the estimated total Android job time drops to ~86 min — well within
the 120 min timeout.

Co-authored-by: Cursor <cursoragent@cursor.com>
Rename test groups to be self-documenting:
  - iOS: heavy1..heavy10 → finetuning, toolCalling, reasoning, etc.
  - Android: heavyA → heavyA-finetune-reason-ocr, imagePerfB → imagePerf-fruitPlate, etc.

Add test-specs passthrough to the monitor step so it can print:
  - A "Run → tests" legend at the start (which tests are in each run)
  - Test names in the final results section next to each run link

Now when a run fails you can immediately see which test(s) it contained
without cross-referencing test-groups.json.

Co-authored-by: Cursor <cursoragent@cursor.com>
Pass test-specs from upload-to-devicefarm through to the monitor step
in all 12 addon integration workflows. Gives every addon the run-to-tests
legend and test names in final results — not just LLM.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two issues from the Android shard split:

1. Image test instability: the fruit-plate test relied on elephant
   running first in the same group to warm up the VLM model. With
   split groups each image test cold-starts alone, causing crashes
   on Android. Extended the iosWarmupImage pre-warmup to all mobile
   platforms (isMobile) so fruit-plate gets the elephant pre-warmup
   on Android too.

2. Heavy group imbalance: heavyA (4 tests, ~44 min) and heavyB
   (3 tests, ~45 min) were both too slow. Split into 3 balanced
   groups of 2-3 tests each:
   - heavyA-finetune-reasoning (2 tests)
   - heavyB-toolCall-gemma (2 tests)
   - heavyC-ocr-sliding (3 tests)

Android now has 7 groups (19 total with iOS 12).

Co-authored-by: Cursor <cursoragent@cursor.com>
Two optimizations for Device Farm log collection:

1. Skip 'Setup Test' and 'Teardown Test' suites — they only contain
   framework bookkeeping (home screen screenshots, install logs), not
   test output. Saves 2 list-artifacts API calls + downloads per
   device-job (21 Android device-jobs × 2 = 42 fewer API round-trips).

2. Raise MAX_PARALLEL from 4 to 8 so all runs (up to 7 Android + 12
   iOS) download simultaneously instead of in waves. AWS Device Farm
   API handles this fine — the bottleneck was I/O wait, not CPU.

Target: Android log collection from 25 min → ~12-15 min.
Co-authored-by: Cursor <cursoragent@cursor.com>
…rm-shard-split

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	packages/llm-llamacpp/test/mobile/test-groups.json
The 3-image VLM perf (gemma4 + qwen3-5) made the Android on-PR leg run too
long. aurora is the heaviest image, so skip it when QVAC_PERF_RUNS is at
the on-PR default (<=1); the benchmark (QVAC_PERF_RUNS>1) still runs all 3.
On-PR now covers elephant + fruit-plate, keeping the Android run under ~1h.

Co-authored-by: Cursor <cursoragent@cursor.com>
@tobi-legan tobi-legan requested review from a team as code owners June 5, 2026 12:49
tobi-legan and others added 4 commits June 6, 2026 00:44
… link

Consolidate the per-device *_test-results.json into one
test-results-summary.json (each test case with status + duration per
device, gate-skips surfaced as 'skipped'), ship it inside the console-logs
artifact, and write a compact ✅/❌/⏭️ table + an artifact link to the
GitHub Step Summary. Makes it easy to see whether each case ran, passed,
failed, or was skipped (e.g. the VLM aurora on-PR gate). Mobile only for
now; desktop to follow.

Co-authored-by: Cursor <cursoragent@cursor.com>
…10 -> 6)

The Android job was hitting the 120-min cap because ~15 Device Farm runs
queued behind the account concurrency limit (9-20min wait each), starving
the downstream collect/extract steps. Using measured worst-case per-test
runtimes, rebalance into 6 groups: toolCalling (~30m) and gemma4 (~29m)
run solo (each near the per-test cap), the other functional tests pack into
two ~50m shards, and the vlmPerf groups stay dedicated (so the benchmark
perf-only filter still isolates them). Fewer runs = less queue contention =
shorter monitor wait. All 30 functions stay covered; iOS unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ❌ PENDING

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ❌ (0/1)



---
*This comment is automatically updated when reviews change.*

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants