Skip to content

Kotlin library for on-device speech recognition using NVIDIA NeMo Conformer CTC model with ONNX Runtime

Notifications You must be signed in to change notification settings

Otosaku/NeMoConformerASR-Android

Repository files navigation

NeMoConformerASR-Android

Kotlin library for speech recognition using NVIDIA NeMo Conformer CTC model on Android with ONNX Runtime.

Features

  • NVIDIA NeMo Conformer CTC Small model (13M parameters)
  • ONNX Runtime for reliable cross-device inference
  • Returns both full text and timestamped segments
  • Automatic audio chunking for long audio (>20 seconds)
  • BPE tokenization (1024 vocabulary)
  • Pure Kotlin implementation

Requirements

  • Android API 26+
  • Any ARM or x86 device (ONNX Runtime handles compatibility)

Installation

JitPack

Add JitPack to your root settings.gradle.kts:

dependencyResolutionManagement {
    repositories {
        maven { url = uri("https://jitpack.io") }
    }
}

Add the dependency to your module's build.gradle.kts:

dependencies {
    implementation("com.github.Otosaku:NeMoConformerASR-Android:1.0.0")
}

Download Models

Download the ONNX models from Google Drive:

Download Models (65 MB)

The archive contains:

  • conformer_encoder.onnx - Conformer encoder (64 MB)
  • conformer_decoder.onnx - CTC decoder (0.7 MB)
  • vocabulary.json - BPE vocabulary (1024 tokens)

Models should be downloaded to app's internal storage (not bundled in APK to reduce app size).

Usage

Basic Recognition

import com.otosaku.nemoconformerasr.NeMoConformerASR

// Initialize with model file paths
val asr = NeMoConformerASR(
    context = context,
    encoderPath = "${context.filesDir}/conformer_encoder.onnx",
    decoderPath = "${context.filesDir}/conformer_decoder.onnx",
    vocabularyPath = "${context.filesDir}/vocabulary.json"
)

// Recognize speech (samples must be 16kHz mono Float32)
val audioSamples: FloatArray = loadAudio()
val result = asr.recognize(audioSamples)

// Full recognized text
println(result.text)

// Individual segments with timestamps
for (segment in result.segments) {
    println("[${segment.start}s - ${segment.end}s]: ${segment.text}")
}

// Audio duration
println("Duration: ${result.audioDuration}s")

// Don't forget to close when done
asr.close()

ASRResult Structure

data class ASRResult(
    val text: String,              // Full recognized text
    val segments: List<ASRSegment>, // Timestamped segments
    val audioDuration: Double      // Total audio duration in seconds
)

data class ASRSegment(
    val start: Double,  // Start time in seconds
    val end: Double,    // End time in seconds
    val text: String    // Recognized text for this segment
)

Supported Input Durations

The model accepts up to 20 seconds of audio per inference. Longer audio is automatically split into chunks.

Duration Samples Mel Frames Encoded Frames
5 sec 80,000 501 126
10 sec 160,000 1,001 251
15 sec 240,000 1,501 376
20 sec 320,000 2,001 501

Long Audio Processing

For audio longer than 20 seconds, the library automatically:

  1. Splits audio into 20-second chunks
  2. Processes each chunk independently
  3. Combines results with proper timestamps

Example Project

The repository includes a complete example app with audio recording and file import.

Running the Example

  1. Open the project in Android Studio

  2. Download and add models:

    • Download models from the link above
    • Unzip the archive
    • Copy files to app/src/main/assets/:
      • conformer_encoder.onnx
      • conformer_decoder.onnx
      • vocabulary.json
  3. Build and run on device

Example Features

  • Record Audio: Hold button to record from microphone
  • Test File: Import audio file for testing
  • Results: Shows recognized text, duration, and processing time

Model Information

  • Model: nvidia/stt_en_conformer_ctc_small
  • Parameters: 13.15M
  • Architecture: Conformer encoder (16 layers) + CTC decoder
  • Hidden dim: 176
  • Attention heads: 4
  • Vocabulary: 1024 BPE tokens + 1 blank

Audio Requirements

  • Sample rate: 16,000 Hz
  • Channels: Mono
  • Format: Float32

Model Architecture

Component Input Output Size
Feature Extractor audio (16kHz) mel (80, frames) -
Encoder mel (1, 80, 2001) hidden (1, 176, 501) 64 MB
Decoder hidden (1, 176, 501) logits (1, 501, 1025) 0.7 MB

Dependencies

License

MIT License

Acknowledgments

About

Kotlin library for on-device speech recognition using NVIDIA NeMo Conformer CTC model with ONNX Runtime

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages