Swift library for speech recognition using NVIDIA NeMo Conformer CTC model on iOS/macOS with CoreML.
- NVIDIA NeMo Conformer CTC Small model (13M parameters)
- VAD-based smart segmentation for long audio (powered by NeMoVAD-iOS)
- Returns both full text and timestamped segments
- Automatic audio padding for any duration
- Support for 5, 10, 15, and 20 second audio segments
- Pure Swift implementation with CoreML backend
- iOS 16.0+ / macOS 13.0+
- Xcode 15.0+
- Swift 5.9+
Add the following to your Package.swift:
dependencies: [
.package(url: "https://github.com/Otosaku/NeMoConformerASR-iOS.git", from: "1.1.0")
]Note: Version 1.1.0+ includes VAD-based segmentation with timestamped results. For the previous API returning plain text, use version 1.0.0.
Or in Xcode: File → Add Package Dependencies → Enter repository URL.
Download the CoreML models from Google Drive:
The archive contains:
conformer_encoder.mlmodelc- Conformer encoder (30 MB)conformer_decoder.mlmodelc- CTC decoder (0.4 MB)vocabulary.json- BPE vocabulary (1024 tokens)
import NeMoConformerASR
// Initialize with model paths
let asr = try NeMoConformerASR(
encoderURL: Bundle.main.url(forResource: "conformer_encoder", withExtension: "mlmodelc")!,
decoderURL: Bundle.main.url(forResource: "conformer_decoder", withExtension: "mlmodelc")!,
vocabularyURL: Bundle.main.url(forResource: "vocabulary", withExtension: "json")!,
computeUnits: .all // .cpuAndGPU, .cpuOnly, .cpuAndNeuralEngine
)
// Recognize speech (samples must be 16kHz mono Float32)
let result = try asr.recognize(samples: audioSamples)
// Full recognized text
print(result.text)
// Individual segments with timestamps
for segment in result.segments {
print("[\(segment.start)s - \(segment.end)s]: \(segment.text)")
}
// Audio duration
print("Duration: \(result.audioDuration)s")public struct ASRResult {
let text: String // Full recognized text
let segments: [ASRSegment] // Timestamped segments
let audioDuration: Double // Total audio duration in seconds
}
public struct ASRSegment {
let start: Double // Start time in seconds
let end: Double // End time in seconds
let text: String // Recognized text for this segment
}// Get encoder embeddings for downstream tasks
let encoded = try asr.encode(samples: audioSamples)
// Returns MLMultiArray with shape [1, 176, encodedFrames]The model supports the following input sizes (audio is automatically padded):
| Duration | Samples | Mel Frames | Encoded Frames |
|---|---|---|---|
| 5 sec | 80,000 | 501 | 126 |
| 10 sec | 160,000 | 1,001 | 251 |
| 15 sec | 240,000 | 1,501 | 376 |
| 20 sec | 320,000 | 2,001 | 501 |
For audio longer than 20 seconds, the library uses VAD (Voice Activity Detection) for intelligent segmentation:
- VAD Analysis: Detects speech vs silence regions
- Smart Merging: Merges speech segments with gaps < 0.3s
- Splitting: Splits segments longer than 20s into equal parts
- Filtering: Ignores segments shorter than 0.5s
- Recognition: Processes each segment independently
This approach provides accurate timestamps and avoids cutting words in the middle.
The repository includes a complete example app with audio recording and file import.
-
Open
ConformerExample/ConformerExample.xcodeprojin Xcode -
Add NeMoConformerASR as a local package:
- File → Add Package Dependencies
- Click "Add Local..."
- Select the
NeMoConformerASR-iOSfolder
-
Download and add models to the project:
- Download models from the link above
- Unzip the archive
- Drag
conformer_encoder.mlmodelc,conformer_decoder.mlmodelc, andvocabulary.jsonintoConformerExample/Resourcesfolder in Xcode - Make sure "Copy items if needed" is checked
- Verify files are added to "Copy Bundle Resources" in Build Phases
-
Build and run on device or simulator
- Record Audio: Tap to record from microphone, automatically converts to 16kHz mono
- Import Audio: Import any audio file (mp3, wav, m4a, etc.), automatically converts format
- Results: Shows recognized text, audio duration, and processing time
- Segments View: Displays individual speech segments with timestamps for long audio
- Model: nvidia/stt_en_conformer_ctc_small
- Parameters: 13.15M
- Architecture: Conformer encoder (16 layers) + CTC decoder
- Hidden dim: 176
- Attention heads: 4
- Vocabulary: 1024 BPE tokens + 1 blank
- Sample rate: 16,000 Hz
- Channels: Mono
- Format: Float32
The example app handles conversion from any audio format automatically.
- NeMoFeatureExtractor-iOS - Mel spectrogram extraction
- NeMoVAD-iOS - Voice Activity Detection for smart segmentation
MIT License
- NVIDIA NeMo - Original model and training