YOLO-Object-Detection-Voice

A simple, modular project that performs real-time object detection using YOLO and announces detection results via voice (text-to-speech). This repository combines a YOLO object detector (you can use YOLOv5/YOLOv8 or your custom YOLO model) with an audio TTS engine so that detected objects are reported out loud — useful for accessibility, robotics, surveillance, or hands-free monitoring.

This README is intentionally comprehensive so you can copy & paste directly into your repository and get started quickly.

Table of Contents

Features
Repository Structure (Typical)
Requirements
Installation
Model Weights
Quick Start — Examples
Configuration & Command-Line Options
How Voice Reporting Works
Customization (Classes, Confidence, NMS, Language)
Performance Tips & Troubleshooting
Testing & Validation
Extending & Contributing
Acknowledgements
License
Contact
Appendix — Example Requirements.txt

Features

Real-time object detection with YOLO (select YOLOv5 / YOLOv8 / custom)
Voice announcements for detected objects (supports multiple TTS engines)
Options: webcam, video file, image(s), output saving (annotated images/videos)
Configurable thresholds (confidence, NMS), classes to announce, and verbosity
Cross-platform (Windows / macOS / Linux); GPU acceleration if available

Repository Structure (Typical) Note: this project may use different filenames; adjust paths/commands as needed.

README.md (this file)
requirements.txt (Python deps)
yolov_voice.py (main entrypoint that runs detection + TTS)
models/
- weights/ (saved model weights, e.g., yolov5s.pt or best.pt)
data/
- classes.txt (optional: names of classes)
examples/
- demo_video.mp4
- demo_image.jpg
outputs/
- annotated/ (saved annotated images/videos)
utils/
- tts_backends.py (wrappers for pyttsx3/gTTS/pico2wave etc.)
- detector.py (YOLO inference wrapper)
- draw.py (annotation helpers)

If your repository differs, adapt the README's commands to the actual filenames or move/rename files accordingly.

Requirements

Minimum:

Python 3.8+
pip

Recommended (for GPU):

CUDA 11.x and compatible NVIDIA drivers (if using PyTorch GPU builds)
NVIDIA cuDNN

Python packages (example):

torch (and torchvision) — CPU or GPU build
numpy
opencv-python
pillow
pyttsx3 or gTTS (for TTS)
playsound / pydub (optional, for playing TTS audio)
tqdm
seaborn (optional, for visualizations)

See Appendix — Example requirements.txt for an example file.

Installation

Clone the repository

git clone https://github.com/Akarshak51/YOLO-Object-Detection-Voice.git
cd YOLO-Object-Detection-Voice

Create and activate a virtual environment (recommended)

python -m venv .venv
# Linux / macOS
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1

Install Python dependencies

pip install -r requirements.txt

Install YOLO model dependencies if using a particular fork (example for YOLOv5)

If using Ultralytics YOLOv5, you may need additional requirements (see their repo). Alternatively install torch and torchvision compatible with your CUDA version:

# example CPU install (not recommended for speed):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Model Weights

You can use any YOLO weights compatible with your detector wrapper. Some common choices:

YOLOv5 small (fast, lower accuracy): https://github.com/ultralytics/yolov5/releases/download/v6.1/yolov5s.pt
YOLOv5 medium/large or custom weights: provide your own best.pt or .onnx, etc.
YOLOv8 (ultralytics): use weights from Ultralytics repo or export.

Download example:

mkdir -p models/weights
# example:
wget -O models/weights/yolov5s.pt https://github.com/ultralytics/yolov5/releases/download/v6.1/yolov5s.pt

Place your weights in models/weights/ or point the CLI to the path.

Quick Start — Examples

General note: replace <entry_point> or <weights> if your repository uses different filenames.

Run on Webcam (real-time)

# Example: run yolov_voice.py (main script) using webcam index 0
python yolov_voice.py --source 0 --weights models/weights/yolov5s.pt --voice-engine pyttsx3 --confidence 0.4 --announce-interval 2.0

What this does:

Opens the default webcam
Runs YOLO inference on frames
Announces detected object labels via the chosen TTS engine
--announce-interval prevents repeated rapid announcements (seconds)

Run on a Video File

python yolov_voice.py --source examples/demo_video.mp4 --weights models/weights/yolov5s.pt --output outputs/annotated/demo_out.mp4 --voice-engine gTTS --language en --save-output

Detect on Single Image

python yolov_voice.py --source examples/demo_image.jpg --weights models/weights/yolov5s.pt --save-output --output outputs/annotated/demo_image_out.jpg

Python API Example (programmatic use) Use a Detector class (e.g., utils/detector.py) and a TTS wrapper (utils/tts_backends.py)

Example script:

from utils.detector import YOLODetector
from utils.tts_backends import TTS

detector = YOLODetector(weights='models/weights/yolov5s.pt', device='cuda')  # or 'cpu'
tts = TTS(engine='pyttsx3', language='en')

frame = ...  # numpy image from cv2
detections = detector.detect(frame, conf_thres=0.4)
labels = detector.parse_labels(detections)  # e.g., ['person', 'bicycle']
if labels:
    text = ', '.join(labels)
    tts.say(f'I see: {text}')

Configuration & Command-Line Options

Common CLI flags (the actual script may support different names — view --help):

--source <int | path> : Webcam index (0) or path to video/image
--weights : Path to model weights (.pt, .onnx, .pth)
--device <cpu | 0 | cuda> : Device to run inference on
--img-size : Inference image size (e.g., 640)
--confidence : Confidence threshold (0.0 - 1.0)
--iou : NMS IoU threshold (e.g., 0.45)
--classes <int,int,...> : Only detect/announce these class indices
--class-names : Optional path to class names file
--voice-engine <pyttsx3|gTTS|pico> : Choose TTS backend
--language <en|fr|...> : Language for TTS (if supported)
--announce-interval : Minimum seconds between announcements for same label
--save-output : Save annotated image/video to disk
--output : Path to save output
--verbose : Print more logs for debugging

To see exact options:

python yolov_voice.py --help

How Voice Reporting Works

Architecture:

Inference: Frames/images are passed to the YOLO model
Post-processing: Detections are filtered by confidence and NMS
Label selection: Class names are mapped from class IDs
Announcement logic:
- Labels are aggregated per frame or per timeframe
- Optionally deduplicated so the same label isn't repeatedly spoken
- TTS engine is invoked to synthesize and play the audio

Supported TTS backends (examples):

pyttsx3: offline, cross-platform Python TTS (no network required)
gTTS: Google Text-to-Speech (requires network; can save mp3 then play)
pico2wave / espeak / festival: system-level utilities on some Linux distros
Windows SAPI: native on Windows

Choose an engine based on offline/online requirements and platform compatibility.

Customization (Classes, Confidence, NMS, Language)

Limit announced classes: Create a list or supply class indices:

python yolov_voice.py --classes 0 2 7  # e.g., person, car, truck

Or use a names file:

cat data/classes.txt
person
bicycle
car
...
python yolov_voice.py --class-names data/classes.txt

Reduce noise using confidence threshold:
```
--confidence 0.5
```
Only announce when certain count reached: Example: announce if at least 3 persons detected: Use an argument --min-count person=3 or programmatically check counts in your wrapper.
Multilingual announcements: With gTTS or other engines, set --language fr to produce French speech (ensure the engine supports that language).

Performance Tips & Troubleshooting

Performance

Use a smaller model (yolov5s / yolov8n) for real-time on CPU.
For higher FPS, use GPU-enabled PyTorch with CUDA.
Lower img-size to increase speed but reduce accuracy.
Batch processing: if running on video, you can process every Nth frame: skip frames to reduce CPU/GPU load.

Common Issues

No sound / TTS fails:
- If using pyttsx3: ensure the correct driver is available (sapi5 on Windows; nsss on macOS; espeak on Linux).
- If using gTTS: requires internet access.
- Use playsound or system players to play saved audio files if direct playback fails.
Model file not found:
- Verify --weights path and permissions. Download recommended weights and place into models/weights/.
Slow inference:
- Ensure GPU is used if available; install correct CUDA-enabled torch.
- Run nvidia-smi to confirm GPU presence and utilization.
Incompatible torch/cuda:
- Install torch matching your CUDA version. See PyTorch official install instructions.

Debugging

Use --verbose to print detection info and TTS events.
Save annotated frames to inspect detection boxes visually.

Testing & Validation

Unit test examples:
- Test detector on a sample image and assert detections length > 0.
- Mock TTS engine to check that say() is called for expected strings.
Manual validation:
- Run python yolov_voice.py --source examples/demo_image.jpg --save-output and check outputs/annotated/ for correct boxes and labels.
- For video/webcam, verify audio announcements match on-screen detections.

Extending & Contributing

Contributions welcome! Suggested improvements:

Support more TTS backends and language detection
Add a web UI (Flask/Streamlit) to control detection parameters live
Allow exporting to speechless events (webhooks) for home automation
Add unit tests and CI (GitHub Actions)

How to contribute:

Fork the repo
Create a feature branch: git checkout -b feat/my-feature
Commit changes and push
Open a PR with a clear description of changes and rationale

Please follow the repository's code style and include tests where applicable.

Acknowledgements

Ultralytics YOLO (for models & training pipelines) — https://github.com/ultralytics
pyttsx3, gTTS and other TTS libraries and their maintainers
OpenCV and PyTorch communities

License

This project is provided under the MIT License. See LICENSE file for details.

Contact

Maintainer: Akarshak51
GitHub: https://github.com/Akarshak51/YOLO-Object-Detection-Voice

If you find a bug or want a feature, please open an issue or a pull request.

Appendix — Example requirements.txt

Below is an example requirements file you can copy to requirements.txt. Adjust versions to match your environment and PyTorch/CUDA compatibility.

# Core
numpy>=1.21
opencv-python>=4.5
Pillow>=8.0

# Torch: install the right wheel for your CUDA version per PyTorch instructions
torch>=1.13
torchvision>=0.14

# TTS and audio
pyttsx3>=2.90          # offline TTS
gTTS>=2.2.3            # online TTS (google)
playsound>=1.3.0       # simple playback (may vary by OS)
pydub>=0.25.1          # for audio manipulation (requires ffmpeg)

# Utilities
tqdm>=4.64
requests>=2.28

Notes:

For Windows, playsound sometimes causes issues; alternatives: winsound, python-sounddevice, or using subprocess to call system media players.
For Linux, ensure espeak/pico2wave or appropriate audio backends are installed for offline TTS.

Example Minimal Script (yolov_voice.py) Below is a minimal example combining YOLOv5 via PyTorch Hub and pyttsx3 for voice. This is a self-contained example you can adapt. It demonstrates how to load a YOLOv5 model from torch.hub, run a single image inference, and speak the labels.

# yolov_voice_example.py
import cv2
import torch
import pyttsx3
import time

# Initialize TTS
tts = pyttsx3.init()
def announce(text):
    tts.say(text)
    tts.runAndWait()

# Load YOLOv5 from torch.hub (requires internet first time)
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)
model.conf = 0.4  # confidence threshold

# Load image
img = cv2.imread('examples/demo_image.jpg')  # change path
results = model(img)  # inference
detections = results.pandas().xyxy[0]  # pandas DataFrame

labels = detections['name'].tolist()
if labels:
    unique_labels = sorted(set(labels))
    announce("I see " + ", ".join(unique_labels))
    print("Detected:", unique_labels)
else:
    announce("I don't see anything I can recognize.")

Important: For real-time webcam detection, you should process frames in a loop (and use throttling/announce intervals) to avoid speaking too often.

If you want, I can:

Generate a ready-to-drop-in yolov_voice.py script tailored to the exact files in your repository (tell me which files exist), or
Produce a requirements.txt, LICENSE file, and a sample utils/tts_backends.py file.

Would you like me to create any of those files now?

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
BoxF1_curve.png		BoxF1_curve.png
BoxPR_curve.png		BoxPR_curve.png
BoxP_curve.png		BoxP_curve.png
BoxR_curve.png		BoxR_curve.png
README.md		README.md
args.yaml		args.yaml
captured.jpg		captured.jpg
confusion_matrix.png		confusion_matrix.png
confusion_matrix_normalized.png		confusion_matrix_normalized.png
detect.py		detect.py
favicon.ico		favicon.ico
index.html		index.html
labels.jpg		labels.jpg
main.cpython-313.pyc		main.cpython-313.pyc
main.py		main.py
package-lock.json		package-lock.json
package.json		package.json
results.csv		results.csv
results.png		results.png
style.css		style.css
train_batch0.jpg		train_batch0.jpg
train_batch1.jpg		train_batch1.jpg
train_batch2.jpg		train_batch2.jpg
train_batch90.jpg		train_batch90.jpg
train_batch91.jpg		train_batch91.jpg
train_batch92.jpg		train_batch92.jpg
val_batch0_labels.jpg		val_batch0_labels.jpg
val_batch0_pred.jpg		val_batch0_pred.jpg
vss.code-workspace		vss.code-workspace
yolo11n.pt		yolo11n.pt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

YOLO-Object-Detection-Voice

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

YOLO-Object-Detection-Voice

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages