A user-friendly speech-to-text transcription tool built on top of faster-whisper, an optimised implementation of OpenAI Whisper for CPU and GPU.
SimpleWhisper was developed as part of the LaCAS Project for INALCO (Institut National des Langues et Civilisations Orientales).
| Feature | Details |
|---|---|
| Broad format support | Any audio (MP3, WAV, OGG, FLAC, M4A, OPUS…) or video (MP4, MKV, AVI, MOV, WebM, WMV…) — passed directly to faster-whisper, no pre-conversion overhead. |
| Model selection | tiny → large-v3 — trade speed for accuracy. |
| Language selection | 58 languages or automatic detection. Detected language and confidence are shown in the log. |
| Translation | Built-in speech-to-English translation via the Whisper translate task. |
| VAD filter | Optional Voice Activity Detection (Silero) — skips silent regions; ~40–60 % faster on long audio with pauses. |
| Quality presets | Fast / Balanced / Accurate — sets beam size and compute precision together. |
| Output formats | Plain text, SRT subtitles, or WebVTT subtitles — file extension auto-updated. |
| Word-level timestamps | Optional per-word start/end times. |
| Initial prompt | Prime the model with domain vocabulary for better accuracy. |
| GPU acceleration | CUDA with one checkbox; precision auto-selected per device. |
| Model caching | Already-loaded models reused across consecutive runs. |
| Auto CPU threads | Uses all available CPU cores automatically. |
| Live log | Every segment streamed to the log panel as it is decoded. |
| Auto-fill output path | Output filename pre-filled from the input path. |
| Requirement | Notes |
|---|---|
| Python 3.8+ | |
ffmpeg |
Must be on your PATH. See the installation guide. |
tkinter |
Usually bundled with Python; see table below if missing. |
| CUDA Toolkit | Optional — only needed for GPU mode. See CUDA Toolkit. |
If tkinter is missing:
| Platform | Command |
|---|---|
| macOS | brew install python-tk |
| Linux (Debian/Ubuntu) | sudo apt-get install python3-tk |
| Windows | Shipped with the standard Python installer; see tkdocs if absent. |
git clone https://github.com/SeidSmatti/SimpleWhisper.git
cd SimpleWhisper
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txtPre-built (unsigned) binaries for Linux and Windows are available on the Releases page — no Python installation required.
Run directly from the source tree:
python src/main.pyOr install as a package and use the console entry point:
pip install .
simplewhisper- Input — click Browse and select any audio or video file.
- Output — the output path is pre-filled automatically; edit if needed.
- Model size —
baseorsmallfor everyday use;large-v3for maximum accuracy. - Language — pick a language or leave Autodetect. The detected language and confidence appear in the log.
- Task — Transcribe (default) keeps the source language; Translate to English produces an English transcript from any source language.
- Output format — Text, SRT, or VTT. The file extension updates automatically when you change this.
- Start — click ▶ Start Transcription and watch live progress in the log.
Open the collapsible ▸ Advanced panel for fine-grained control:
- Quality preset
- Fast — beam size 1, quantised precision.
- Balanced — beam size 5 (the faster-whisper default).
- Accurate — beam size 10, full
float32precision.
- Initial prompt — optional text hint to bias vocabulary or formatting style (e.g.
"Medical terms: stethoscope, aorta…"). - VAD filter (in main options) — off by default. Enable for long recordings with pauses (lectures, interviews) to skip silent regions. For short files or fast models it can add more overhead than it saves.
- Word-level timestamps (in main options) — adds per-word timing to each segment.
| Scenario | Recommended settings |
|---|---|
| Short clip, any language | base + Balanced preset, VAD off |
| Long lecture / interview with pauses | small or medium + VAD on |
| Best accuracy, known language | large-v3 + Accurate preset, language pinned |
| Subtitles for a video | Any model + SRT or VTT output format |
| Foreign audio → English transcript | Any model + Translate to English task |
| GPU available | Enable Use GPU — float16 is selected automatically |
| Quick draft on CPU | base + Fast preset |
Note on VAD: Voice Activity Detection loads the Silero VAD model and scans the entire audio before decoding. On short files or with the
basemodel, this pre-processing can outweigh the savings from skipping silence. Leave it off for quick jobs and enable it for long recordings.
Could not locate cudnn_ops_infer64_8.dll. Please make sure it is in your library path!
Try this fix — place the DLL files alongside the executable or in a directory on your PATH.
Ensure ffmpeg is installed and reachable from the terminal:
ffmpeg -versionSee the installation guide for platform-specific steps.
| Date | Changes |
|---|---|
| 2026-04-05 | v1.1.0. Redesigned GUI (header, grouped cards, status bar, progress bar, live log, clear button, auto-fill output path, file-type filter). New features: Translate to English task, SRT and VTT subtitle output, quality presets, word-level timestamps, initial prompt, optional VAD filter, language detection info in log. Performance: removed redundant pre-conversion step — files now go directly to faster-whisper; auto CPU-thread detection. Fixes: ffmpeg path on Linux/macOS, setup.py console entry point, thread-safe UI updates. |
| 2024-09-17 | Model caching, responsive threading, safe temp-file handling, enhanced error handling, code modularisation. |
| 2024-07-22 | Added manual language selection. |
SimpleWhisper was initially developed as part of the LaCAS Project for INALCO (Institut National des Langues et Civilisations Orientales). The project aims to make advanced transcription technology accessible to a broad, non-technical audience as part of the collaborative efforts within the LaCAS team to advance areal studies through innovative technological solutions.