Kotlin library for speech recognition using NVIDIA NeMo Conformer CTC model on Android with ONNX Runtime.
- NVIDIA NeMo Conformer CTC Small model (13M parameters)
- ONNX Runtime for reliable cross-device inference
- Returns both full text and timestamped segments
- Automatic audio chunking for long audio (>20 seconds)
- BPE tokenization (1024 vocabulary)
- Pure Kotlin implementation
- Android API 26+
- Any ARM or x86 device (ONNX Runtime handles compatibility)
Add JitPack to your root settings.gradle.kts:
dependencyResolutionManagement {
repositories {
maven { url = uri("https://jitpack.io") }
}
}Add the dependency to your module's build.gradle.kts:
dependencies {
implementation("com.github.Otosaku:NeMoConformerASR-Android:1.0.0")
}Download the ONNX models from Google Drive:
The archive contains:
conformer_encoder.onnx- Conformer encoder (64 MB)conformer_decoder.onnx- CTC decoder (0.7 MB)vocabulary.json- BPE vocabulary (1024 tokens)
Models should be downloaded to app's internal storage (not bundled in APK to reduce app size).
import com.otosaku.nemoconformerasr.NeMoConformerASR
// Initialize with model file paths
val asr = NeMoConformerASR(
context = context,
encoderPath = "${context.filesDir}/conformer_encoder.onnx",
decoderPath = "${context.filesDir}/conformer_decoder.onnx",
vocabularyPath = "${context.filesDir}/vocabulary.json"
)
// Recognize speech (samples must be 16kHz mono Float32)
val audioSamples: FloatArray = loadAudio()
val result = asr.recognize(audioSamples)
// Full recognized text
println(result.text)
// Individual segments with timestamps
for (segment in result.segments) {
println("[${segment.start}s - ${segment.end}s]: ${segment.text}")
}
// Audio duration
println("Duration: ${result.audioDuration}s")
// Don't forget to close when done
asr.close()data class ASRResult(
val text: String, // Full recognized text
val segments: List<ASRSegment>, // Timestamped segments
val audioDuration: Double // Total audio duration in seconds
)
data class ASRSegment(
val start: Double, // Start time in seconds
val end: Double, // End time in seconds
val text: String // Recognized text for this segment
)The model accepts up to 20 seconds of audio per inference. Longer audio is automatically split into chunks.
| Duration | Samples | Mel Frames | Encoded Frames |
|---|---|---|---|
| 5 sec | 80,000 | 501 | 126 |
| 10 sec | 160,000 | 1,001 | 251 |
| 15 sec | 240,000 | 1,501 | 376 |
| 20 sec | 320,000 | 2,001 | 501 |
For audio longer than 20 seconds, the library automatically:
- Splits audio into 20-second chunks
- Processes each chunk independently
- Combines results with proper timestamps
The repository includes a complete example app with audio recording and file import.
-
Open the project in Android Studio
-
Download and add models:
- Download models from the link above
- Unzip the archive
- Copy files to
app/src/main/assets/:conformer_encoder.onnxconformer_decoder.onnxvocabulary.json
-
Build and run on device
- Record Audio: Hold button to record from microphone
- Test File: Import audio file for testing
- Results: Shows recognized text, duration, and processing time
- Model: nvidia/stt_en_conformer_ctc_small
- Parameters: 13.15M
- Architecture: Conformer encoder (16 layers) + CTC decoder
- Hidden dim: 176
- Attention heads: 4
- Vocabulary: 1024 BPE tokens + 1 blank
- Sample rate: 16,000 Hz
- Channels: Mono
- Format: Float32
| Component | Input | Output | Size |
|---|---|---|---|
| Feature Extractor | audio (16kHz) | mel (80, frames) | - |
| Encoder | mel (1, 80, 2001) | hidden (1, 176, 501) | 64 MB |
| Decoder | hidden (1, 176, 501) | logits (1, 501, 1025) | 0.7 MB |
- ONNX Runtime Android - ML inference runtime
- NeMoFeatureExtractor-Android - Mel spectrogram extraction
- Gson - JSON parsing
MIT License
- NVIDIA NeMo - Original model and training
- ONNX Runtime - Cross-platform ML inference