A simple, modular project that performs real-time object detection using YOLO and announces detection results via voice (text-to-speech). This repository combines a YOLO object detector (you can use YOLOv5/YOLOv8 or your custom YOLO model) with an audio TTS engine so that detected objects are reported out loud — useful for accessibility, robotics, surveillance, or hands-free monitoring.
This README is intentionally comprehensive so you can copy & paste directly into your repository and get started quickly.
Table of Contents
- Features
- Repository Structure (Typical)
- Requirements
- Installation
- Model Weights
- Quick Start — Examples
- Configuration & Command-Line Options
- How Voice Reporting Works
- Customization (Classes, Confidence, NMS, Language)
- Performance Tips & Troubleshooting
- Testing & Validation
- Extending & Contributing
- Acknowledgements
- License
- Contact
- Appendix — Example Requirements.txt
Features
- Real-time object detection with YOLO (select YOLOv5 / YOLOv8 / custom)
- Voice announcements for detected objects (supports multiple TTS engines)
- Options: webcam, video file, image(s), output saving (annotated images/videos)
- Configurable thresholds (confidence, NMS), classes to announce, and verbosity
- Cross-platform (Windows / macOS / Linux); GPU acceleration if available
Repository Structure (Typical) Note: this project may use different filenames; adjust paths/commands as needed.
- README.md (this file)
- requirements.txt (Python deps)
- yolov_voice.py (main entrypoint that runs detection + TTS)
- models/
- weights/ (saved model weights, e.g., yolov5s.pt or best.pt)
- data/
- classes.txt (optional: names of classes)
- examples/
- demo_video.mp4
- demo_image.jpg
- outputs/
- annotated/ (saved annotated images/videos)
- utils/
- tts_backends.py (wrappers for pyttsx3/gTTS/pico2wave etc.)
- detector.py (YOLO inference wrapper)
- draw.py (annotation helpers)
If your repository differs, adapt the README's commands to the actual filenames or move/rename files accordingly.
Requirements
Minimum:
- Python 3.8+
- pip
Recommended (for GPU):
- CUDA 11.x and compatible NVIDIA drivers (if using PyTorch GPU builds)
- NVIDIA cuDNN
Python packages (example):
- torch (and torchvision) — CPU or GPU build
- numpy
- opencv-python
- pillow
- pyttsx3 or gTTS (for TTS)
- playsound / pydub (optional, for playing TTS audio)
- tqdm
- seaborn (optional, for visualizations)
See Appendix — Example requirements.txt for an example file.
Installation
- Clone the repository
git clone https://github.com/Akarshak51/YOLO-Object-Detection-Voice.git
cd YOLO-Object-Detection-Voice- Create and activate a virtual environment (recommended)
python -m venv .venv
# Linux / macOS
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1- Install Python dependencies
pip install -r requirements.txt- Install YOLO model dependencies if using a particular fork (example for YOLOv5)
- If using Ultralytics YOLOv5, you may need additional requirements (see their repo). Alternatively install torch and torchvision compatible with your CUDA version:
# example CPU install (not recommended for speed):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpuModel Weights
You can use any YOLO weights compatible with your detector wrapper. Some common choices:
-
YOLOv5 small (fast, lower accuracy): https://github.com/ultralytics/yolov5/releases/download/v6.1/yolov5s.pt
-
YOLOv5 medium/large or custom weights: provide your own
best.ptor.onnx, etc. -
YOLOv8 (ultralytics): use weights from Ultralytics repo or export.
Download example:
mkdir -p models/weights
# example:
wget -O models/weights/yolov5s.pt https://github.com/ultralytics/yolov5/releases/download/v6.1/yolov5s.ptPlace your weights in models/weights/ or point the CLI to the path.
Quick Start — Examples
General note: replace <entry_point> or <weights> if your repository uses different filenames.
- Run on Webcam (real-time)
# Example: run yolov_voice.py (main script) using webcam index 0
python yolov_voice.py --source 0 --weights models/weights/yolov5s.pt --voice-engine pyttsx3 --confidence 0.4 --announce-interval 2.0What this does:
- Opens the default webcam
- Runs YOLO inference on frames
- Announces detected object labels via the chosen TTS engine
--announce-intervalprevents repeated rapid announcements (seconds)
- Run on a Video File
python yolov_voice.py --source examples/demo_video.mp4 --weights models/weights/yolov5s.pt --output outputs/annotated/demo_out.mp4 --voice-engine gTTS --language en --save-output- Detect on Single Image
python yolov_voice.py --source examples/demo_image.jpg --weights models/weights/yolov5s.pt --save-output --output outputs/annotated/demo_image_out.jpg- Python API Example (programmatic use) Use a Detector class (e.g., utils/detector.py) and a TTS wrapper (utils/tts_backends.py)
Example script:
from utils.detector import YOLODetector
from utils.tts_backends import TTS
detector = YOLODetector(weights='models/weights/yolov5s.pt', device='cuda') # or 'cpu'
tts = TTS(engine='pyttsx3', language='en')
frame = ... # numpy image from cv2
detections = detector.detect(frame, conf_thres=0.4)
labels = detector.parse_labels(detections) # e.g., ['person', 'bicycle']
if labels:
text = ', '.join(labels)
tts.say(f'I see: {text}')Configuration & Command-Line Options
Common CLI flags (the actual script may support different names — view --help):
- --source <int | path> : Webcam index (0) or path to video/image
- --weights : Path to model weights (.pt, .onnx, .pth)
- --device <cpu | 0 | cuda> : Device to run inference on
- --img-size : Inference image size (e.g., 640)
- --confidence : Confidence threshold (0.0 - 1.0)
- --iou : NMS IoU threshold (e.g., 0.45)
- --classes <int,int,...> : Only detect/announce these class indices
- --class-names : Optional path to class names file
- --voice-engine <pyttsx3|gTTS|pico> : Choose TTS backend
- --language <en|fr|...> : Language for TTS (if supported)
- --announce-interval : Minimum seconds between announcements for same label
- --save-output : Save annotated image/video to disk
- --output : Path to save output
- --verbose : Print more logs for debugging
To see exact options:
python yolov_voice.py --helpHow Voice Reporting Works
Architecture:
- Inference: Frames/images are passed to the YOLO model
- Post-processing: Detections are filtered by confidence and NMS
- Label selection: Class names are mapped from class IDs
- Announcement logic:
- Labels are aggregated per frame or per timeframe
- Optionally deduplicated so the same label isn't repeatedly spoken
- TTS engine is invoked to synthesize and play the audio
Supported TTS backends (examples):
- pyttsx3: offline, cross-platform Python TTS (no network required)
- gTTS: Google Text-to-Speech (requires network; can save mp3 then play)
- pico2wave / espeak / festival: system-level utilities on some Linux distros
- Windows SAPI: native on Windows
Choose an engine based on offline/online requirements and platform compatibility.
Customization (Classes, Confidence, NMS, Language)
-
Limit announced classes: Create a list or supply class indices:
python yolov_voice.py --classes 0 2 7 # e.g., person, car, truckOr use a names file:
cat data/classes.txt person bicycle car ... python yolov_voice.py --class-names data/classes.txt -
Reduce noise using confidence threshold:
--confidence 0.5
-
Only announce when certain count reached: Example: announce if at least 3 persons detected: Use an argument
--min-count person=3or programmatically check counts in your wrapper. -
Multilingual announcements: With gTTS or other engines, set
--language frto produce French speech (ensure the engine supports that language).
Performance Tips & Troubleshooting
Performance
- Use a smaller model (yolov5s / yolov8n) for real-time on CPU.
- For higher FPS, use GPU-enabled PyTorch with CUDA.
- Lower
img-sizeto increase speed but reduce accuracy. - Batch processing: if running on video, you can process every Nth frame: skip frames to reduce CPU/GPU load.
Common Issues
-
No sound / TTS fails:
- If using pyttsx3: ensure the correct driver is available (sapi5 on Windows; nsss on macOS; espeak on Linux).
- If using gTTS: requires internet access.
- Use
playsoundor system players to play saved audio files if direct playback fails.
-
Model file not found:
- Verify
--weightspath and permissions. Download recommended weights and place intomodels/weights/.
- Verify
-
Slow inference:
- Ensure GPU is used if available; install correct CUDA-enabled torch.
- Run
nvidia-smito confirm GPU presence and utilization.
-
Incompatible torch/cuda:
- Install torch matching your CUDA version. See PyTorch official install instructions.
Debugging
- Use
--verboseto print detection info and TTS events. - Save annotated frames to inspect detection boxes visually.
Testing & Validation
-
Unit test examples:
- Test detector on a sample image and assert detections length > 0.
- Mock TTS engine to check that
say()is called for expected strings.
-
Manual validation:
- Run
python yolov_voice.py --source examples/demo_image.jpg --save-outputand checkoutputs/annotated/for correct boxes and labels. - For video/webcam, verify audio announcements match on-screen detections.
- Run
Extending & Contributing
Contributions welcome! Suggested improvements:
- Support more TTS backends and language detection
- Add a web UI (Flask/Streamlit) to control detection parameters live
- Allow exporting to speechless events (webhooks) for home automation
- Add unit tests and CI (GitHub Actions)
How to contribute:
- Fork the repo
- Create a feature branch: git checkout -b feat/my-feature
- Commit changes and push
- Open a PR with a clear description of changes and rationale
Please follow the repository's code style and include tests where applicable.
Acknowledgements
- Ultralytics YOLO (for models & training pipelines) — https://github.com/ultralytics
- pyttsx3, gTTS and other TTS libraries and their maintainers
- OpenCV and PyTorch communities
License
This project is provided under the MIT License. See LICENSE file for details.
Contact
Maintainer: Akarshak51
GitHub: https://github.com/Akarshak51/YOLO-Object-Detection-Voice
If you find a bug or want a feature, please open an issue or a pull request.
Appendix — Example requirements.txt
Below is an example requirements file you can copy to requirements.txt. Adjust versions to match your environment and PyTorch/CUDA compatibility.
# Core
numpy>=1.21
opencv-python>=4.5
Pillow>=8.0
# Torch: install the right wheel for your CUDA version per PyTorch instructions
torch>=1.13
torchvision>=0.14
# TTS and audio
pyttsx3>=2.90 # offline TTS
gTTS>=2.2.3 # online TTS (google)
playsound>=1.3.0 # simple playback (may vary by OS)
pydub>=0.25.1 # for audio manipulation (requires ffmpeg)
# Utilities
tqdm>=4.64
requests>=2.28
Notes:
- For Windows,
playsoundsometimes causes issues; alternatives:winsound,python-sounddevice, or usingsubprocessto call system media players. - For Linux, ensure
espeak/pico2waveor appropriate audio backends are installed for offline TTS.
Example Minimal Script (yolov_voice.py) Below is a minimal example combining YOLOv5 via PyTorch Hub and pyttsx3 for voice. This is a self-contained example you can adapt. It demonstrates how to load a YOLOv5 model from torch.hub, run a single image inference, and speak the labels.
# yolov_voice_example.py
import cv2
import torch
import pyttsx3
import time
# Initialize TTS
tts = pyttsx3.init()
def announce(text):
tts.say(text)
tts.runAndWait()
# Load YOLOv5 from torch.hub (requires internet first time)
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)
model.conf = 0.4 # confidence threshold
# Load image
img = cv2.imread('examples/demo_image.jpg') # change path
results = model(img) # inference
detections = results.pandas().xyxy[0] # pandas DataFrame
labels = detections['name'].tolist()
if labels:
unique_labels = sorted(set(labels))
announce("I see " + ", ".join(unique_labels))
print("Detected:", unique_labels)
else:
announce("I don't see anything I can recognize.")Important: For real-time webcam detection, you should process frames in a loop (and use throttling/announce intervals) to avoid speaking too often.
If you want, I can:
- Generate a ready-to-drop-in
yolov_voice.pyscript tailored to the exact files in your repository (tell me which files exist), or - Produce a
requirements.txt,LICENSEfile, and a sampleutils/tts_backends.pyfile.
Would you like me to create any of those files now?