QVAC-19368 infra: rebalance Android Device Farm shards + faster mobile CI for LLM by tobi-legan · Pull Request #2466 · tetherto/qvac

tobi-legan · 2026-06-05T12:49:48Z

What problem does this PR solve?

Android Device Farm mobile CI was slow (hitting the 2h GitHub job cap), expensive (downloading hundreds of MB of unused artifacts), and opaque (no visibility into which individual test() cases ran/passed/failed/skipped).
The VLM perf tests (Qwen3.5-VL + Gemma4-VL) added significant load — running all 3 images on every PR pushed Android over the time budget on slower devices (Pixel 9 Pro / Mali).

How does it solve it?

Android shard rebalance (data-driven)

Consolidated Android from 10 → 6 groups based on measured per-test worst-case runtimes: the two 30-min monsters (toolCalling, gemma4) run solo, the rest pack into two ~50-min functional shards, and the VLM perf groups stay dedicated for clean benchmark isolation.

VLM perf: skip aurora on normal PRs

The high-res aurora image is the heaviest; skipped on normal on-PR runs (QVAC_PERF_RUNS <= 1) so Android stays under ~1h. The benchmark (QVAC_PERF_RUNS > 1) still runs all 3 images. On-PR covers elephant + fruit-plate.

Slimmed Device Farm artifact downloads

Only download Customer Artifacts (bare_console, test-results, logcat_full, perf data) + Logcat (C++ logs). Skip TCP dump (up to 624MB per device!), screenshots, XML, videos, install logs. Removed the full devicefarm-logs artifact upload entirely. Cuts the "Collect and upload" step from ~20min to a few minutes.

Test-list visibility for all addons

Enumerated the addon's integration.auto.cjs runners for non-grouped addons (NMT, TTS, etc.) so the monitor's "Run → tests" legend shows what runs, not just an empty "default →".

Per-test() case results (TAP parsing)

Parse brittle's TAP output (ok N - name # time = Xms) from logcat_full.txt (Android) and bare_console.log (iOS) to surface every individual test() case with status + timing per device.
Produces test-case-details.json in the console-logs artifact + a collapsible summary table in the GitHub Step Summary.
Normalizes dynamic values in test names (CacheTokens (53) → (N)) so the same logical test merges across devices.
Strips ReactNativeJS trailing-quote duplicates (Android logcat echoes TAP lines twice).
Escapes markdown-special characters (|, <, >) so test descriptions don't break the summary table.
Verified: OCR Android (669) = OCR iOS (669) after all fixes.

Other

Extended VLM pre-warmup to Android (not just iOS).
Wired test-specs to the monitor step for all addons.
Descriptive group names + test legend in Device Farm monitor.
Updated shard comment in the mobile workflow.

How was it tested?

LLM, OCR, Diffusion, NMT all triggered via workflow_dispatch on this branch.
Verified test-case-details.json across platforms: OCR Android=669, iOS=669 (exact match after fixes). Diffusion iOS: 144 per device.
Confirmed download filter only keeps Customer Artifacts + Logcat (TCP dump 624MB, screenshots, XML all skipped).
Confirmed full devicefarm-logs upload removed.
Aurora skip verified: 0ms duration on normal PR (skipped), runs normally on benchmark.

…attern Android groupB (11 tests, 49 min) and groupImagesPerf (3 VLM tests, 69 min) were serialising heavy tests on a single device — hitting the 2h job timeout on Pixel. Mirror the iOS strategy: isolate each heavy test into its own group (heavy1–heavy10) and bundle fast tests into lightA/lightB (12 groups total). Longest single shard drops from ~69 min to ~23 min; pool recycles devices across groups dynamically. Co-authored-by: Cursor <cursoragent@cursor.com>

…ENCY The 12-group mirror of iOS overwhelmed the Device Farm account concurrency limit (24 total runs: 12 iOS + 12 Android). Groups queued up to 12.5 min on Android and 28 min on iOS waiting for a slot, making the monitor step slower than the original 3-group layout. Revised to 6 Android groups (18 total with iOS): - heavyA/heavyB: split the old groupB heavy tests into 2 balanced shards - imagePerfA/imagePerfB: split VLM tests 2+1 to avoid the 69-min single-group bottleneck - lightA/lightB: fast tests bundled Expected critical path: ~40-50 min (vs 69 min old, 87 min with 12 groups). Co-authored-by: Cursor <cursoragent@cursor.com>

With 6 Android groups × 3 devices each = 18 device-jobs, the serial log download took 52 min (each device-job ~3-7 min of API calls + artifact downloads). Process each run's logs in parallel (up to 4 concurrent), so the total is bounded by the slowest single run (~18 min) rather than the sum of all runs. Combined with the 6-group monitor improvement (57 min vs old 69 min), the estimated total Android job time drops to ~86 min — well within the 120 min timeout. Co-authored-by: Cursor <cursoragent@cursor.com>

Rename test groups to be self-documenting: - iOS: heavy1..heavy10 → finetuning, toolCalling, reasoning, etc. - Android: heavyA → heavyA-finetune-reason-ocr, imagePerfB → imagePerf-fruitPlate, etc. Add test-specs passthrough to the monitor step so it can print: - A "Run → tests" legend at the start (which tests are in each run) - Test names in the final results section next to each run link Now when a run fails you can immediately see which test(s) it contained without cross-referencing test-groups.json. Co-authored-by: Cursor <cursoragent@cursor.com>

Pass test-specs from upload-to-devicefarm through to the monitor step in all 12 addon integration workflows. Gives every addon the run-to-tests legend and test names in final results — not just LLM. Co-authored-by: Cursor <cursoragent@cursor.com>

Two issues from the Android shard split: 1. Image test instability: the fruit-plate test relied on elephant running first in the same group to warm up the VLM model. With split groups each image test cold-starts alone, causing crashes on Android. Extended the iosWarmupImage pre-warmup to all mobile platforms (isMobile) so fruit-plate gets the elephant pre-warmup on Android too. 2. Heavy group imbalance: heavyA (4 tests, ~44 min) and heavyB (3 tests, ~45 min) were both too slow. Split into 3 balanced groups of 2-3 tests each: - heavyA-finetune-reasoning (2 tests) - heavyB-toolCall-gemma (2 tests) - heavyC-ocr-sliding (3 tests) Android now has 7 groups (19 total with iOS 12). Co-authored-by: Cursor <cursoragent@cursor.com>

Two optimizations for Device Farm log collection: 1. Skip 'Setup Test' and 'Teardown Test' suites — they only contain framework bookkeeping (home screen screenshots, install logs), not test output. Saves 2 list-artifacts API calls + downloads per device-job (21 Android device-jobs × 2 = 42 fewer API round-trips). 2. Raise MAX_PARALLEL from 4 to 8 so all runs (up to 7 Android + 12 iOS) download simultaneously instead of in waves. AWS Device Farm API handles this fine — the bottleneck was I/O wait, not CPU. Target: Android log collection from 25 min → ~12-15 min. Co-authored-by: Cursor <cursoragent@cursor.com>

…rm-shard-split Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # packages/llm-llamacpp/test/mobile/test-groups.json

The 3-image VLM perf (gemma4 + qwen3-5) made the Android on-PR leg run too long. aurora is the heaviest image, so skip it when QVAC_PERF_RUNS is at the on-PR default (<=1); the benchmark (QVAC_PERF_RUNS>1) still runs all 3. On-PR now covers elephant + fruit-plate, keeping the Android run under ~1h. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions · 2026-06-08T15:45:15Z

Tier-based Approval Status

**PR Tier:** TIER1

**Current Status:** ✅ APPROVED

**Requirements:**
- 1 Team Member approval ✅ (1/1)
- 1 Team Lead OR Management approval ✅ (1/1)



---
*This comment is automatically updated when reviews change.*

Per team agreement, remove the download + upload of the full Device Farm log tree (screenshots, XML, install logs, videos) — nobody uses it and it adds significant download time to the already-tight Android job. Only Customer_Artifacts.zip (bare_console.log, test-results.json, logcat_full, perf data) and Logcat files (C++ logs) are kept. The extracted console-logs and perf-report artifacts are unchanged. Raw Device Farm artifacts are still accessible via the AWS console links in the monitor output. Co-authored-by: Cursor <cursoragent@cursor.com>

…rm-shard-split Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # packages/llm-llamacpp/test/mobile/test-groups.json

…space in download filter Co-authored-by: Cursor <cursoragent@cursor.com>

The parallelized download_run_logs with export -f was crashing on both iOS and Android (exit code 1 before downloading any artifacts). Revert to main's proven sequential loop structure and add the name filter there instead. The filter still skips TCP dump (624MB), screenshots, XML, videos — only Customer Artifacts + Logcat are downloaded. Job-level artifacts restored too (iOS needs the job-level Customer_Artifacts.zip). Co-authored-by: Cursor <cursoragent@cursor.com>

Parse brittle's TAP output (ok N / not ok N lines) from logcat_full.txt (Android) and bare_console.log (iOS) to surface every individual test() case with its status (passed/failed/skipped) and timing per device. Produces test-case-details.json in the console-logs artifact with both runner-function-level and per-test() detail. The GitHub Step Summary gets a runner table + a collapsible per-test-case table so reviewers can see at a glance whether a newly added test() actually ran. Co-authored-by: Cursor <cursoragent@cursor.com>

…og-type suffix) Co-authored-by: Cursor <cursoragent@cursor.com>

…rm-shard-split

…se summary Two fixes for the test-case-details step summary: 1. Normalize dynamic values in test names so the same logical test merges across devices — e.g. 'CacheTokens (53) > 0' and 'CacheTokens (55) > 0' both become 'CacheTokens (N) > 0'. This was inflating Android's count (1240) vs iOS (633) because each device produced slightly different token counts in assertion names. 2. Escape markdown-special characters (|, <, >, backtick) in test names before writing to the step summary table, so test descriptions containing these characters don't break the table layout. Co-authored-by: Cursor <cursoragent@cursor.com>

…ativeJS echo) Android logcat echoes TAP lines twice: once from bare, once from ReactNativeJS wrapped in single quotes ('ok 1 - name'). The trailing quote made every test appear as a duplicate (e.g. 'All models available' vs 'All models available' with trailing quote), inflating Android OCR from 790 to 1461 test cases. Strip trailing quotes before deduplication. Co-authored-by: Cursor <cursoragent@cursor.com>

… details Two fixes for correct test case counts across platforms: 1. Deduplicate TAP results by normalized name (not num+name) so perf iterations that reuse the same test name at different TAP numbers don't inflate the count. NMT was 364/546 → now 182=182. 2. Truncate test names at assertion detail markers ('. Found ', ': "', ', got ') so variable model output embedded in assertion messages doesn't create per-device duplicates. LLM elephant tests with 'Found keywords: elephant' in different phrasing now merge. Verified across all addons with real data: OCR: Android=669, iOS=669 (exact match) NMT: Android=182, iOS=182 (exact match) LLM: Android=622, iOS=608 (16 A-only = bitnet/Android-only tests, 2 I-only = Metal/iOS-only tests — all genuinely platform-specific) Co-authored-by: Cursor <cursoragent@cursor.com>

DmitryMalishev

Requesting changes on the two critical points below. The CI plumbing in this PR (artifact slimming, shard rebalance, TAP visibility) is directionally good, but these two change what the perf pipeline measures and covers as silent side effects of a time-budget fix.

…ert pre-warmup Two fixes per Dima's review: 1. Aurora skip is now Android-scoped using the explicit QVAC_PERF_ONLY flag (already plumbed to the device via the testspec config) instead of proxying off PERF_RUNS. iOS + desktop always run aurora. The benchmark (QVAC_PERF_ONLY=true) runs all 3 images on all platforms, even with runs=1. 2. Revert the Android pre-warmup extension back to iOS-only. The change was silently altering what Android perf numbers measure (cold first- run vs warm steady-state) and doesn't fix the crash it targeted (the large buffer allocation still happens on the first real-image pass). Restores historical comparability of Android perf data. Co-authored-by: Cursor <cursoragent@cursor.com>

… func shards cacheStateMachine takes 30m on Pixel (hit the per-test Mocha timeout in funcShardB). Move it to a solo group (like toolCalling and gemma) and rebalance the remaining functional tests into 3 shards (~25-29m each on Pixel worst-case). Total Android groups: 8 (3 solo + 3 func + 2 vlmPerf). Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

tobi-legan · 2026-06-15T11:03:08Z

/review

github-actions · 2026-06-15T11:09:47Z

🧪 C++ Test Coverage Report

Coverage:

📊 Detailed Coverage

Filename                         Regions    Missed Regions     Cover   Functions  Missed Functions  Executed       Lines      Missed Lines     Cover    Branches   Missed Branches     Cover
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
NmtLazyInitializeBackend.cpp          99                20    79.80%          11                 1    90.91%         157                36    77.07%          66                21    68.18%
NmtLazyInitializeBackend.hpp           2                 0   100.00%           1                 0   100.00%           1                 0   100.00%           0                 0         -
TranslationModel.cpp                 296               168    43.24%          28                 8    71.43%         506               213    57.91%         181               122    32.60%
TranslationModel.hpp                   1                 0   100.00%           1                 0   100.00%           1                 0   100.00%           0                 0         -
nmt.cpp                               72                22    69.44%           9                 1    88.89%         137                28    79.56%          44                16    63.64%
nmt.hpp                               51                 4    92.16%          11                 2    81.82%          53                 4    92.45%          28                 0   100.00%
nmt_beam_search.cpp                  116                25    78.45%          10                 3    70.00%         254                32    87.40%          76                19    75.00%
nmt_graph_decoder.cpp                164                78    52.44%          15                 7    53.33%         540               161    70.19%         112                69    38.39%
nmt_graph_encoder.cpp                 54                13    75.93%           3                 0   100.00%         268                33    87.69%          37                16    56.76%
nmt_loader.cpp                       270                67    75.19%          14                 0   100.00%         774                97    87.47%         161                67    58.39%
nmt_state_backend.cpp                253                94    62.85%          21                 0   100.00%         489               128    73.82%         165                87    47.27%
nmt_tokenization.cpp                  88                21    76.14%           8                 0   100.00%         135                36    73.33%          61                26    57.38%
nmt_utils.cpp                        120                89    25.83%           8                 3    62.50%         180               134    25.56%          78                63    19.23%
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TOTAL                               1586               601    62.11%         140                25    82.14%        3495               902    74.19%        1009               506    49.85%

github-actions · 2026-06-15T11:18:27Z

Mobile integration tests — @qvac/decoder-audio (Android)

Result: passed

metric	value
Devices passed	2
Devices failed	0
Test cases total	6
Test cases passed	6
Test cases failed	0
Test cases skipped	0

View workflow run

github-actions · 2026-06-15T11:30:07Z

Mobile integration tests — @qvac/decoder-audio (iOS)

Result: passed

metric	value
Devices passed	1
Devices failed	0
Test cases total	3
Test cases passed	3
Test cases failed	0
Test cases skipped	0

View workflow run

tobi-legan and others added 9 commits May 25, 2026 12:25

Merge remote-tracking branch 'origin/main' into tobi/android-devicefa…

e3a037e

…rm-shard-split Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # packages/llm-llamacpp/test/mobile/test-groups.json

tobi-legan requested review from a team as code owners June 5, 2026 12:49

tobi-legan temporarily deployed to release June 5, 2026 12:50 — with GitHub Actions Inactive

tobi-legan temporarily deployed to release June 5, 2026 12:51 — with GitHub Actions Inactive

tobi-legan had a problem deploying to release June 5, 2026 13:00 — with GitHub Actions Failure

tobi-legan temporarily deployed to release June 5, 2026 13:00 — with GitHub Actions Inactive

tobi-legan had a problem deploying to release June 5, 2026 13:00 — with GitHub Actions Failure

Merge branch 'main' into tobi/android-devicefarm-shard-split

1e0034b

DmitryMalishev previously approved these changes Jun 8, 2026

View reviewed changes

tobi-legan and others added 10 commits June 10, 2026 10:29

Merge remote-tracking branch 'origin/main' into tobi/android-devicefa…

016bcbd

…rm-shard-split Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # packages/llm-llamacpp/test/mobile/test-groups.json

QVAC-19368 fix(ci): match Device Farm 'Customer Artifacts' name with …

da871d1

…space in download filter Co-authored-by: Cursor <cursoragent@cursor.com>

QVAC-19368 fix(ci): clean device labels in test-case-details (strip l…

277bf49

…og-type suffix) Co-authored-by: Cursor <cursoragent@cursor.com>

Merge remote-tracking branch 'origin/main' into tobi/android-devicefa…

3f888f4

…rm-shard-split

olyasir previously approved these changes Jun 12, 2026

View reviewed changes

DmitryMalishev requested changes Jun 12, 2026

View reviewed changes

Comment thread packages/llm-llamacpp/test/integration/_image-common.js Outdated

Comment thread packages/llm-llamacpp/test/integration/_vlm-image-perf.js Outdated

DmitryMalishev approved these changes Jun 12, 2026

View reviewed changes

Merge branch 'main' into tobi/android-devicefarm-shard-split

ad9b871

DmitryMalishev previously approved these changes Jun 12, 2026

View reviewed changes

tobi-legan and others added 3 commits June 13, 2026 00:27

QVAC-19368 infra: bump LLM mobile job timeout to 150min (from 120min)

d1589e2

Co-authored-by: Cursor <cursoragent@cursor.com>

Merge branch 'main' into tobi/android-devicefarm-shard-split

ed64749

DmitryMalishev approved these changes Jun 15, 2026

View reviewed changes

olyasir approved these changes Jun 15, 2026

View reviewed changes

Merge branch 'main' into tobi/android-devicefarm-shard-split

a1fb962

tobi-legan mentioned this pull request Jul 1, 2026

fix: capture Android logcat in WDIO after-hook so mobile logs survive test failures #2980

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QVAC-19368 infra: rebalance Android Device Farm shards + faster mobile CI for LLM - #2466

QVAC-19368 infra: rebalance Android Device Farm shards + faster mobile CI for LLM#2466
tobi-legan merged 30 commits into
mainfrom
tobi/android-devicefarm-shard-split

tobi-legan commented Jun 5, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 8, 2026 •

edited

Loading

Uh oh!

DmitryMalishev left a comment

Uh oh!

Uh oh!

Uh oh!

tobi-legan commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

tobi-legan commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

How does it solve it?

How was it tested?

Uh oh!

github-actions Bot commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tier-based Approval Status

Uh oh!

DmitryMalishev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tobi-legan commented Jun 15, 2026

Uh oh!

github-actions Bot commented Jun 15, 2026

🧪 C++ Test Coverage Report

Uh oh!

github-actions Bot commented Jun 15, 2026

Mobile integration tests — @qvac/decoder-audio (Android)

Uh oh!

github-actions Bot commented Jun 15, 2026

Mobile integration tests — @qvac/decoder-audio (iOS)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tobi-legan commented Jun 5, 2026 •

edited

Loading

github-actions Bot commented Jun 8, 2026 •

edited

Loading