Skip to content

QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU#46

Draft
pratiknarola-t wants to merge 2 commits into
masterfrom
qvac-20556-tdt-opencl-host-decode
Draft

QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU#46
pratiknarola-t wants to merge 2 commits into
masterfrom
qvac-20556-tdt-opencl-host-decode

Conversation

@pratiknarola-t

Copy link
Copy Markdown

Summary

Makes the Parakeet engines compute correctly on Android GPUs (QVAC-20556). Two device-specific fixes,
both validated on-device and on the QVAC device-farm:

  1. TDT decode on host for Adreno OpenCL (06cef8e7). The TDT per-step graphs do an in-place
    read+write of the LSTM persistent state (h / c / pred_persist — read via ggml_view_1d,
    written via ggml_cpy in the same graph_compute). The Adreno OpenCL backend drops the aliased
    ggml_cpy writes
    , so the state freezes and the decoder emits a constant token ("tamb"×N). The
    missing GGML_OP_ARGMAX OpenCL kernel (which aborted earlier) was only the shallower symptom. Fix:
    gate the joint argmax to host when the backend can't run it, and run the per-step scalar decode on
    host (use_graphs=false) on OpenCL. The encoder stays on the GPU; only the tiny per-step decode
    moves to host, so the speedup is preserved.

  2. Route Mali Vulkan to CPU (af1245d7). Mali (ARM Valhall) Vulkan mis-computes every model
    (TDT garbage, EOU/Sortformer empty) — its narrow subgroup width breaks the ggml-vulkan shaders,
    while CPU is correct. init_gpu_backend's device walk now skips Mali GPUs (engine falls back to
    CPU), and a new model_gpu_unsupported() / Engine::gpu_unsupported() lets hosts report this as
    the expected, correct result (a new gpuUnsupported runtime stat) rather than a silent GPU
    regression.

Adreno OpenCL and Samsung Xclipse 920 Vulkan are unaffected and run on the GPU.

Backend matrix

Backend TDT EOU Sortformer
Adreno 830 / OpenCL (S25) ✅ (host-decode)
Xclipse 920 / Vulkan (S23 FE)
Mali-G715 / Vulkan (Pixel 9) → CPU ✅ → CPU ✅ → CPU ✅

Validation

  • Device-farm: green — run 27534055585, job 81380228958 (S25/Adreno OpenCL 19/19 on GPU;
    Pixel 9/Mali passes gpu-smoke on CPU via gpuUnsupported).
  • Addon validation PR: QVAC-20556 feat[api]: enable Android GPU for Parakeet (overlay; CI validation) [DO-NOT-MERGE] qvac#2577 (builds the transcription-parakeet addon against this
    branch via an overlay port).
  • Local: Galaxy S23 FE (Xclipse 920) — TDT/EOU/Sortformer on Vulkan match the CPU baseline; the
    Mali skip is name-gated (Xclipse stays on Vulkan, gpuUnsupported=0).

Follow-up (out of scope here)

The host-decode is the correct, low-risk path for shipping now. Running TDT on the GPU on Adreno
OpenCL needs a deeper ggml-opencl fix — a GGML_OP_ARGMAX kernel and a fix for the dropped
in-place persistent-state ggml_cpy (argmax alone is insufficient). Tracked separately.

Changed files

All under parakeet-cpp/: src/parakeet_tdt.cpp, src/parakeet_tdt.h, src/parakeet_ctc.cpp,
src/parakeet_ctc.h, src/parakeet_engine.cpp, include/parakeet/engine.h.

@pratiknarola-t pratiknarola-t requested review from a team as code owners June 15, 2026 09:44
@pratiknarola-t

Copy link
Copy Markdown
Author

Validation evidence — device-farm green

These two commits were validated end-to-end via the transcription-parakeet addon (overlay-pinned to
this branch) on the QVAC AWS device-farm:

ggml-opencl drops the in-place ggml_cpy writes that update the TDT LSTM
persistent state (h/c/pred), so on Adreno OpenCL the prediction state
never advances and the decode emits one constant token per frame. It also
lacks an ARGMAX kernel, which aborts graph_compute.

Gate the joint token/duration argmax on ggml_backend_supports_op, and on
OpenCL run the per-step TDT decode on the host (the encoder still runs on
the GPU backend). EOU/Sortformer don't use the persistent-state pattern
and continue to run entirely on the GPU.
@pratiknarola-t pratiknarola-t force-pushed the qvac-20556-tdt-opencl-host-decode branch 2 times, most recently from f50efe8 to d5965e9 Compare June 15, 2026 11:51
ARM Mali (Valhall) Vulkan mis-computes every parakeet model (TDT garbage,
EOU/Sortformer empty) while CPU is correct; its narrow subgroup width breaks
the ggml-vulkan shaders. Adreno OpenCL and Samsung Xclipse Vulkan are correct
and stay on the GPU.

Guard Mali by name in the engine's GPU selection (mirrors the existing
Adreno-6xx skip) and route it to CPU, reporting via model_gpu_unsupported() /
Engine::gpu_unsupported() so hosts treat the CPU fallback as the expected,
correct result instead of a GPU regression.
@pratiknarola-t pratiknarola-t force-pushed the qvac-20556-tdt-opencl-host-decode branch from d5965e9 to bb585eb Compare June 15, 2026 11:56
@pratiknarola-t

Copy link
Copy Markdown
Author

I've reviewed PR #46 thoroughly (the diff, the surrounding code in context, and CI). Here's my summary.

PR #46QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU

Verdict: Looks good. No blockers. Two well-scoped, device-specific GPU correctness fixes, all under parakeet-cpp/ (6 files, +105/−30). No changes to whisper itself.

What it does

  1. TDT decode on host for Adreno OpenCLtdt_prepare_runtime forces use_graphs=false when the backend registry name is "OpenCL", routing the per-step LSTM+joint decode through the existing, well-tested host scalar path (the acoustic encoder still runs on the GPU). This sidesteps ggml-opencl dropping the in-place ggml_cpy writes to the persistent LSTM state.
  2. argmax_on_gpu gatingbuild_joint_body probes ggml_backend_supports_op and falls back to a host argmax over raw logit slices when unsupported.
  3. Mali Vulkan → CPUinit_gpu_backend name-gates Mali devices (mirrors the existing Adreno-6xx skip) and surfaces it via model_gpu_unsupported() / Engine::gpu_unsupported() so hosts treat the CPU fallback as expected, not a regression.

Correctness checks (all pass)

  • OpenCL gate uses a null-safe backend_reg_name helper; the registry name match ("OpenCL") is consistent with the device-walk code.
  • gpu_unsupported is correctly skipped_unsupported_gpu && backend_gpu == nullptr — a Mali alongside a working discrete GPU still uses the GPU.
  • Only one caller of init_gpu_backend exists; signature change is contained. New Engine::gpu_unsupported() is ABI-safe (non-virtual addition).
  • The !argmax_on_gpu host-argmax fallback path reads correctly-sized buffers (V_plus_1 / num_durations).

Minor notes (non-blocking, no action required)

  • The argmax_on_gpu=false graph branch is effectively defensive/unexercised today: the only backend lacking ARGMAX (OpenCL) already forces use_graphs=false, so graphs aren't built there. It's documented and harmless. A small side-effect is that build_joint_body always creates an orphan tok_am = ggml_argmax(...) node just to probe support (trivial metadata waste).
  • Mali detection lives only in the non-OpenCL branch. If a Mali ever enumerated an OpenCL device it could slip into the last-resort opencl_other bucket — but the validated Pixel 9 (Mali-G715) correctly falls to CPU, so not a real concern.

CI

Two checks show "fail" (ubuntu-22, bindings-java), but they're PR-independent infra failures, not caused by this change:

  • ubuntu-22: 83 Ruby-binding test errors are all RuntimeError: 429 Too Many Requests (HuggingFace model-download rate-limiting), plus a pre-existing add_subdirectory given source "tests" which is not an existing directory in the bindings' copied whisper CMake tree. Neither involves parakeet-cpp.
  • All core C++ build jobs (clang/gcc/arm64/metal/sycl/cuda/etc.) pass or are still pending. bindings-java is the same binding-test class and was still running.
  • The parakeet-relevant signal — the device farm (S25/Adreno OpenCL, Xclipse Vulkan, Pixel 9/Mali→CPU) — is reported green in the PR.

I'd ship it. The only thing worth a glance before merge is confirming bindings-java finishes with the same network/model-download failure rather than anything new, but that's CI hygiene, not a code issue.

@pratiknarola-t pratiknarola-t marked this pull request as draft June 15, 2026 16:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant