NeMoConformerASR-Android

Kotlin library for speech recognition using NVIDIA NeMo Conformer CTC model on Android with ONNX Runtime.

Features

NVIDIA NeMo Conformer CTC Small model (13M parameters)
ONNX Runtime for reliable cross-device inference
Returns both full text and timestamped segments
Automatic audio chunking for long audio (>20 seconds)
BPE tokenization (1024 vocabulary)
Pure Kotlin implementation

Requirements

Android API 26+
Any ARM or x86 device (ONNX Runtime handles compatibility)

Installation

JitPack

Add JitPack to your root settings.gradle.kts:

dependencyResolutionManagement {
    repositories {
        maven { url = uri("https://jitpack.io") }
    }
}

Add the dependency to your module's build.gradle.kts:

dependencies {
    implementation("com.github.Otosaku:NeMoConformerASR-Android:1.0.0")
}

Download Models

Download the ONNX models from Google Drive:

Download Models (65 MB)

The archive contains:

conformer_encoder.onnx - Conformer encoder (64 MB)
conformer_decoder.onnx - CTC decoder (0.7 MB)
vocabulary.json - BPE vocabulary (1024 tokens)

Models should be downloaded to app's internal storage (not bundled in APK to reduce app size).

Usage

Basic Recognition

import com.otosaku.nemoconformerasr.NeMoConformerASR

// Initialize with model file paths
val asr = NeMoConformerASR(
    context = context,
    encoderPath = "${context.filesDir}/conformer_encoder.onnx",
    decoderPath = "${context.filesDir}/conformer_decoder.onnx",
    vocabularyPath = "${context.filesDir}/vocabulary.json"
)

// Recognize speech (samples must be 16kHz mono Float32)
val audioSamples: FloatArray = loadAudio()
val result = asr.recognize(audioSamples)

// Full recognized text
println(result.text)

// Individual segments with timestamps
for (segment in result.segments) {
    println("[${segment.start}s - ${segment.end}s]: ${segment.text}")
}

// Audio duration
println("Duration: ${result.audioDuration}s")

// Don't forget to close when done
asr.close()

ASRResult Structure

data class ASRResult(
    val text: String,              // Full recognized text
    val segments: List<ASRSegment>, // Timestamped segments
    val audioDuration: Double      // Total audio duration in seconds
)

data class ASRSegment(
    val start: Double,  // Start time in seconds
    val end: Double,    // End time in seconds
    val text: String    // Recognized text for this segment
)

Supported Input Durations

The model accepts up to 20 seconds of audio per inference. Longer audio is automatically split into chunks.

Duration	Samples	Mel Frames	Encoded Frames
5 sec	80,000	501	126
10 sec	160,000	1,001	251
15 sec	240,000	1,501	376
20 sec	320,000	2,001	501

Long Audio Processing

For audio longer than 20 seconds, the library automatically:

Splits audio into 20-second chunks
Processes each chunk independently
Combines results with proper timestamps

Example Project

The repository includes a complete example app with audio recording and file import.

Running the Example

Open the project in Android Studio
Download and add models:
- Download models from the link above
- Unzip the archive
- Copy files to app/src/main/assets/:
  - conformer_encoder.onnx
  - conformer_decoder.onnx
  - vocabulary.json
Build and run on device

Example Features

Record Audio: Hold button to record from microphone
Test File: Import audio file for testing
Results: Shows recognized text, duration, and processing time

Model Information

Model: nvidia/stt_en_conformer_ctc_small
Parameters: 13.15M
Architecture: Conformer encoder (16 layers) + CTC decoder
Hidden dim: 176
Attention heads: 4
Vocabulary: 1024 BPE tokens + 1 blank

Audio Requirements

Sample rate: 16,000 Hz
Channels: Mono
Format: Float32

Model Architecture

Component	Input	Output	Size
Feature Extractor	audio (16kHz)	mel (80, frames)	-
Encoder	mel (1, 80, 2001)	hidden (1, 176, 501)	64 MB
Decoder	hidden (1, 176, 501)	logits (1, 501, 1025)	0.7 MB

Dependencies

ONNX Runtime Android - ML inference runtime
NeMoFeatureExtractor-Android - Mel spectrogram extraction
Gson - JSON parsing

License

MIT License

Acknowledgments

NVIDIA NeMo - Original model and training
ONNX Runtime - Cross-platform ML inference

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
gradle/wrapper		gradle/wrapper
library		library
.gitignore		.gitignore
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle.kts		settings.gradle.kts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeMoConformerASR-Android

Features

Requirements

Installation

JitPack

Download Models

Usage

Basic Recognition

ASRResult Structure

Supported Input Durations

Long Audio Processing

Example Project

Running the Example

Example Features

Model Information

Audio Requirements

Model Architecture

Dependencies

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Otosaku/NeMoConformerASR-Android

Folders and files

Latest commit

History

Repository files navigation

NeMoConformerASR-Android

Features

Requirements

Installation

JitPack

Download Models

Usage

Basic Recognition

ASRResult Structure

Supported Input Durations

Long Audio Processing

Example Project

Running the Example

Example Features

Model Information

Audio Requirements

Model Architecture

Dependencies

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages