QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU#46
QVAC-20556 parakeet-cpp: TDT host-decode on Adreno OpenCL + route Mali Vulkan to CPU#46pratiknarola-t wants to merge 2 commits into
Conversation
|
Validation evidence — device-farm green These two commits were validated end-to-end via the
|
ggml-opencl drops the in-place ggml_cpy writes that update the TDT LSTM persistent state (h/c/pred), so on Adreno OpenCL the prediction state never advances and the decode emits one constant token per frame. It also lacks an ARGMAX kernel, which aborts graph_compute. Gate the joint token/duration argmax on ggml_backend_supports_op, and on OpenCL run the per-step TDT decode on the host (the encoder still runs on the GPU backend). EOU/Sortformer don't use the persistent-state pattern and continue to run entirely on the GPU.
f50efe8 to
d5965e9
Compare
ARM Mali (Valhall) Vulkan mis-computes every parakeet model (TDT garbage, EOU/Sortformer empty) while CPU is correct; its narrow subgroup width breaks the ggml-vulkan shaders. Adreno OpenCL and Samsung Xclipse Vulkan are correct and stay on the GPU. Guard Mali by name in the engine's GPU selection (mirrors the existing Adreno-6xx skip) and route it to CPU, reporting via model_gpu_unsupported() / Engine::gpu_unsupported() so hosts treat the CPU fallback as the expected, correct result instead of a GPU regression.
d5965e9 to
bb585eb
Compare
|
I've reviewed PR #46 thoroughly (the diff, the surrounding code in context, and CI). Here's my summary. PR #46 —
|
Summary
Makes the Parakeet engines compute correctly on Android GPUs (QVAC-20556). Two device-specific fixes,
both validated on-device and on the QVAC device-farm:
TDT decode on host for Adreno OpenCL (
06cef8e7). The TDT per-step graphs do an in-placeread+write of the LSTM persistent state (
h/c/pred_persist— read viaggml_view_1d,written via
ggml_cpyin the samegraph_compute). The Adreno OpenCL backend drops the aliasedggml_cpywrites, so the state freezes and the decoder emits a constant token ("tamb"×N). Themissing
GGML_OP_ARGMAXOpenCL kernel (which aborted earlier) was only the shallower symptom. Fix:gate the joint argmax to host when the backend can't run it, and run the per-step scalar decode on
host (
use_graphs=false) on OpenCL. The encoder stays on the GPU; only the tiny per-step decodemoves to host, so the speedup is preserved.
Route Mali Vulkan to CPU (
af1245d7). Mali (ARM Valhall) Vulkan mis-computes every model(TDT garbage, EOU/Sortformer empty) — its narrow subgroup width breaks the ggml-vulkan shaders,
while CPU is correct.
init_gpu_backend's device walk now skips Mali GPUs (engine falls back toCPU), and a new
model_gpu_unsupported()/Engine::gpu_unsupported()lets hosts report this asthe expected, correct result (a new
gpuUnsupportedruntime stat) rather than a silent GPUregression.
Adreno OpenCL and Samsung Xclipse 920 Vulkan are unaffected and run on the GPU.
Backend matrix
Validation
27534055585, job81380228958(S25/Adreno OpenCL 19/19 on GPU;Pixel 9/Mali passes gpu-smoke on CPU via
gpuUnsupported).transcription-parakeetaddon against thisbranch via an overlay port).
Mali skip is name-gated (Xclipse stays on Vulkan,
gpuUnsupported=0).Follow-up (out of scope here)
The host-decode is the correct, low-risk path for shipping now. Running TDT on the GPU on Adreno
OpenCL needs a deeper ggml-opencl fix — a
GGML_OP_ARGMAXkernel and a fix for the droppedin-place persistent-state
ggml_cpy(argmax alone is insufficient). Tracked separately.Changed files
All under
parakeet-cpp/:src/parakeet_tdt.cpp,src/parakeet_tdt.h,src/parakeet_ctc.cpp,src/parakeet_ctc.h,src/parakeet_engine.cpp,include/parakeet/engine.h.