NeMoConformerASR-iOS

Swift library for speech recognition using NVIDIA NeMo Conformer CTC model on iOS/macOS with CoreML.

Features

NVIDIA NeMo Conformer CTC Small model (13M parameters)
VAD-based smart segmentation for long audio (powered by NeMoVAD-iOS)
Returns both full text and timestamped segments
Automatic audio padding for any duration
Support for 5, 10, 15, and 20 second audio segments
Pure Swift implementation with CoreML backend

Requirements

iOS 16.0+ / macOS 13.0+
Xcode 15.0+
Swift 5.9+

Installation

Swift Package Manager

Add the following to your Package.swift:

dependencies: [
    .package(url: "https://github.com/Otosaku/NeMoConformerASR-iOS.git", from: "1.1.0")
]

Note: Version 1.1.0+ includes VAD-based segmentation with timestamped results. For the previous API returning plain text, use version 1.0.0.

Or in Xcode: File → Add Package Dependencies → Enter repository URL.

Download Models

Download the CoreML models from Google Drive:

Download Models (30 MB)

The archive contains:

conformer_encoder.mlmodelc - Conformer encoder (30 MB)
conformer_decoder.mlmodelc - CTC decoder (0.4 MB)
vocabulary.json - BPE vocabulary (1024 tokens)

Usage

Basic Recognition

import NeMoConformerASR

// Initialize with model paths
let asr = try NeMoConformerASR(
    encoderURL: Bundle.main.url(forResource: "conformer_encoder", withExtension: "mlmodelc")!,
    decoderURL: Bundle.main.url(forResource: "conformer_decoder", withExtension: "mlmodelc")!,
    vocabularyURL: Bundle.main.url(forResource: "vocabulary", withExtension: "json")!,
    computeUnits: .all  // .cpuAndGPU, .cpuOnly, .cpuAndNeuralEngine
)

// Recognize speech (samples must be 16kHz mono Float32)
let result = try asr.recognize(samples: audioSamples)

// Full recognized text
print(result.text)

// Individual segments with timestamps
for segment in result.segments {
    print("[\(segment.start)s - \(segment.end)s]: \(segment.text)")
}

// Audio duration
print("Duration: \(result.audioDuration)s")

ASRResult Structure

public struct ASRResult {
    let text: String           // Full recognized text
    let segments: [ASRSegment] // Timestamped segments
    let audioDuration: Double  // Total audio duration in seconds
}

public struct ASRSegment {
    let start: Double  // Start time in seconds
    let end: Double    // End time in seconds
    let text: String   // Recognized text for this segment
}

Get Encoder Output

// Get encoder embeddings for downstream tasks
let encoded = try asr.encode(samples: audioSamples)
// Returns MLMultiArray with shape [1, 176, encodedFrames]

Supported Input Durations

The model supports the following input sizes (audio is automatically padded):

Duration	Samples	Mel Frames	Encoded Frames
5 sec	80,000	501	126
10 sec	160,000	1,001	251
15 sec	240,000	1,501	376
20 sec	320,000	2,001	501

Long Audio Processing

For audio longer than 20 seconds, the library uses VAD (Voice Activity Detection) for intelligent segmentation:

VAD Analysis: Detects speech vs silence regions
Smart Merging: Merges speech segments with gaps < 0.3s
Splitting: Splits segments longer than 20s into equal parts
Filtering: Ignores segments shorter than 0.5s
Recognition: Processes each segment independently

This approach provides accurate timestamps and avoids cutting words in the middle.

Example Project

The repository includes a complete example app with audio recording and file import.

Running the Example

Open ConformerExample/ConformerExample.xcodeproj in Xcode
Add NeMoConformerASR as a local package:
- File → Add Package Dependencies
- Click "Add Local..."
- Select the NeMoConformerASR-iOS folder
Download and add models to the project:
- Download models from the link above
- Unzip the archive
- Drag conformer_encoder.mlmodelc, conformer_decoder.mlmodelc, and vocabulary.json into ConformerExample/Resources folder in Xcode
- Make sure "Copy items if needed" is checked
- Verify files are added to "Copy Bundle Resources" in Build Phases
Build and run on device or simulator

Example Features

Record Audio: Tap to record from microphone, automatically converts to 16kHz mono
Import Audio: Import any audio file (mp3, wav, m4a, etc.), automatically converts format
Results: Shows recognized text, audio duration, and processing time
Segments View: Displays individual speech segments with timestamps for long audio

Model Information

Model: nvidia/stt_en_conformer_ctc_small
Parameters: 13.15M
Architecture: Conformer encoder (16 layers) + CTC decoder
Hidden dim: 176
Attention heads: 4
Vocabulary: 1024 BPE tokens + 1 blank

Audio Requirements

Sample rate: 16,000 Hz
Channels: Mono
Format: Float32

The example app handles conversion from any audio format automatically.

Dependencies

NeMoFeatureExtractor-iOS - Mel spectrogram extraction
NeMoVAD-iOS - Voice Activity Detection for smart segmentation

License

MIT License

Acknowledgments

NVIDIA NeMo - Original model and training

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
ConformerExample		ConformerExample
Sources/NeMoConformerASR		Sources/NeMoConformerASR
Tests/NeMoConformerASRTests		Tests/NeMoConformerASRTests
.gitignore		.gitignore
Package.resolved		Package.resolved
Package.swift		Package.swift
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NeMoConformerASR-iOS

Features

Requirements

Installation

Swift Package Manager

Download Models

Usage

Basic Recognition

ASRResult Structure

Get Encoder Output

Supported Input Durations

Long Audio Processing

Example Project

Running the Example

Example Features

Model Information

Audio Requirements

Dependencies

License

Acknowledgments

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Otosaku/NeMoConformerASR-iOS

Folders and files

Latest commit

History

Repository files navigation

NeMoConformerASR-iOS

Features

Requirements

Installation

Swift Package Manager

Download Models

Usage

Basic Recognition

ASRResult Structure

Get Encoder Output

Supported Input Durations

Long Audio Processing

Example Project

Running the Example

Example Features

Model Information

Audio Requirements

Dependencies

License

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages