QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU by pratiknarola-t · Pull Request #46 · tetherto/qvac-ext-lib-whisper.cpp

pratiknarola-t · 2026-06-15T09:44:12Z

Summary

Makes the Parakeet engines compute correctly on Android GPUs (QVAC-20556). Two device-specific fixes,
both validated on-device and on the QVAC device-farm:

TDT decode on host for Adreno OpenCL (06cef8e7). The TDT per-step graphs do an in-place
read+write of the LSTM persistent state (h / c / pred_persist — read via ggml_view_1d,
written via ggml_cpy in the same graph_compute). The Adreno OpenCL backend drops the aliased
ggml_cpy writes, so the state freezes and the decoder emits a constant token ("tamb"×N). The
missing GGML_OP_ARGMAX OpenCL kernel (which aborted earlier) was only the shallower symptom. Fix:
gate the joint argmax to host when the backend can't run it, and run the per-step scalar decode on
host (use_graphs=false) on OpenCL. The encoder stays on the GPU; only the tiny per-step decode
moves to host, so the speedup is preserved.
Route Mali Vulkan to CPU (af1245d7). Mali (ARM Valhall) Vulkan mis-computes every model
(TDT garbage, EOU/Sortformer empty) — its narrow subgroup width breaks the ggml-vulkan shaders,
while CPU is correct. init_gpu_backend's device walk now skips Mali GPUs (engine falls back to
CPU), and a new model_gpu_unsupported() / Engine::gpu_unsupported() lets hosts report this as
the expected, correct result (a new gpuUnsupported runtime stat) rather than a silent GPU
regression.

Adreno OpenCL and Samsung Xclipse 920 Vulkan are unaffected and run on the GPU.

Backend matrix

Backend	TDT	EOU	Sortformer
Adreno 830 / OpenCL (S25)	✅ (host-decode)	✅	✅
Xclipse 920 / Vulkan (S23 FE)	✅	✅	✅
Mali-G715 / Vulkan (Pixel 9)	→ CPU ✅	→ CPU ✅	→ CPU ✅

Validation

Device-farm: green — run 27534055585, job 81380228958 (S25/Adreno OpenCL 19/19 on GPU;
Pixel 9/Mali passes gpu-smoke on CPU via gpuUnsupported).
Addon validation PR: QVAC-20556 feat[api]: enable Android GPU for Parakeet (overlay; CI validation) [DO-NOT-MERGE] qvac#2577 (builds the transcription-parakeet addon against this
branch via an overlay port).
Local: Galaxy S23 FE (Xclipse 920) — TDT/EOU/Sortformer on Vulkan match the CPU baseline; the
Mali skip is name-gated (Xclipse stays on Vulkan, gpuUnsupported=0).

Follow-up (out of scope here)

The host-decode is the correct, low-risk path for shipping now. Running TDT on the GPU on Adreno
OpenCL needs a deeper ggml-opencl fix — a GGML_OP_ARGMAX kernel and a fix for the dropped
in-place persistent-state ggml_cpy (argmax alone is insufficient). Tracked separately.

Changed files

All under parakeet-cpp/: src/parakeet_tdt.cpp, src/parakeet_tdt.h, src/parakeet_ctc.cpp,
src/parakeet_ctc.h, src/parakeet_engine.cpp, include/parakeet/engine.h.

pratiknarola-t · 2026-06-15T09:44:53Z

Validation evidence — device-farm green

These two commits were validated end-to-end via the transcription-parakeet addon (overlay-pinned to
this branch) on the QVAC AWS device-farm:

Run: https://github.com/tetherto/qvac/actions/runs/27534055585
Job (Android E2E): https://github.com/tetherto/qvac/actions/runs/27534055585/job/81380228958 — success
- S25 Ultra / Adreno 830 (OpenCL): 19/19 GPU, all models correct (TDT via host-decode).
- Pixel 9 / Mali-G715 (Vulkan): routed to CPU (gpuUnsupported), gpu-smoke passes with correct output.
Addon validation PR: QVAC-20556 feat[api]: enable Android GPU for Parakeet (overlay; CI validation) [DO-NOT-MERGE] qvac#2577.

ggml-opencl drops the in-place ggml_cpy writes that update the TDT LSTM persistent state (h/c/pred), so on Adreno OpenCL the prediction state never advances and the decode emits one constant token per frame. It also lacks an ARGMAX kernel, which aborts graph_compute. Gate the joint token/duration argmax on ggml_backend_supports_op, and on OpenCL run the per-step TDT decode on the host (the encoder still runs on the GPU backend). EOU/Sortformer don't use the persistent-state pattern and continue to run entirely on the GPU.

ARM Mali (Valhall) Vulkan mis-computes every parakeet model (TDT garbage, EOU/Sortformer empty) while CPU is correct; its narrow subgroup width breaks the ggml-vulkan shaders. Adreno OpenCL and Samsung Xclipse Vulkan are correct and stay on the GPU. Guard Mali by name in the engine's GPU selection (mirrors the existing Adreno-6xx skip) and route it to CPU, reporting via model_gpu_unsupported() / Engine::gpu_unsupported() so hosts treat the CPU fallback as the expected, correct result instead of a GPU regression.

pratiknarola-t · 2026-06-15T12:22:56Z

I've reviewed PR #46 thoroughly (the diff, the surrounding code in context, and CI). Here's my summary.

PR #46 — `QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU`

Verdict: Looks good. No blockers. Two well-scoped, device-specific GPU correctness fixes, all under parakeet-cpp/ (6 files, +105/−30). No changes to whisper itself.

What it does

TDT decode on host for Adreno OpenCL — tdt_prepare_runtime forces use_graphs=false when the backend registry name is "OpenCL", routing the per-step LSTM+joint decode through the existing, well-tested host scalar path (the acoustic encoder still runs on the GPU). This sidesteps ggml-opencl dropping the in-place ggml_cpy writes to the persistent LSTM state.
argmax_on_gpu gating — build_joint_body probes ggml_backend_supports_op and falls back to a host argmax over raw logit slices when unsupported.
Mali Vulkan → CPU — init_gpu_backend name-gates Mali devices (mirrors the existing Adreno-6xx skip) and surfaces it via model_gpu_unsupported() / Engine::gpu_unsupported() so hosts treat the CPU fallback as expected, not a regression.

Correctness checks (all pass)

OpenCL gate uses a null-safe backend_reg_name helper; the registry name match ("OpenCL") is consistent with the device-walk code.
gpu_unsupported is correctly skipped_unsupported_gpu && backend_gpu == nullptr — a Mali alongside a working discrete GPU still uses the GPU.
Only one caller of init_gpu_backend exists; signature change is contained. New Engine::gpu_unsupported() is ABI-safe (non-virtual addition).
The !argmax_on_gpu host-argmax fallback path reads correctly-sized buffers (V_plus_1 / num_durations).

Minor notes (non-blocking, no action required)

The argmax_on_gpu=false graph branch is effectively defensive/unexercised today: the only backend lacking ARGMAX (OpenCL) already forces use_graphs=false, so graphs aren't built there. It's documented and harmless. A small side-effect is that build_joint_body always creates an orphan tok_am = ggml_argmax(...) node just to probe support (trivial metadata waste).
Mali detection lives only in the non-OpenCL branch. If a Mali ever enumerated an OpenCL device it could slip into the last-resort opencl_other bucket — but the validated Pixel 9 (Mali-G715) correctly falls to CPU, so not a real concern.

CI

Two checks show "fail" (ubuntu-22, bindings-java), but they're PR-independent infra failures, not caused by this change:

ubuntu-22: 83 Ruby-binding test errors are all RuntimeError: 429 Too Many Requests (HuggingFace model-download rate-limiting), plus a pre-existing add_subdirectory given source "tests" which is not an existing directory in the bindings' copied whisper CMake tree. Neither involves parakeet-cpp.
All core C++ build jobs (clang/gcc/arm64/metal/sycl/cuda/etc.) pass or are still pending. bindings-java is the same binding-test class and was still running.
The parakeet-relevant signal — the device farm (S25/Adreno OpenCL, Xclipse Vulkan, Pixel 9/Mali→CPU) — is reported green in the PR.

I'd ship it. The only thing worth a glance before merge is confirming bindings-java finishes with the same network/model-download failure rather than anything new, but that's CI hygiene, not a code issue.

pratiknarola-t requested review from a team as code owners June 15, 2026 09:44

pratiknarola-t force-pushed the qvac-20556-tdt-opencl-host-decode branch 2 times, most recently from f50efe8 to d5965e9 Compare June 15, 2026 11:51

pratiknarola-t force-pushed the qvac-20556-tdt-opencl-host-decode branch from d5965e9 to bb585eb Compare June 15, 2026 11:56

pratiknarola-t marked this pull request as draft June 15, 2026 16:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU#46

QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU#46
pratiknarola-t wants to merge 2 commits into
masterfrom
qvac-20556-tdt-opencl-host-decode

pratiknarola-t commented Jun 15, 2026

Uh oh!

pratiknarola-t commented Jun 15, 2026

Uh oh!

pratiknarola-t commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pratiknarola-t commented Jun 15, 2026

Summary

Backend matrix

Validation

Follow-up (out of scope here)

Changed files

Uh oh!

pratiknarola-t commented Jun 15, 2026

Uh oh!

pratiknarola-t commented Jun 15, 2026

PR #46 — QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU

What it does

Correctness checks (all pass)

Minor notes (non-blocking, no action required)

CI

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PR #46 — `QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU`