Skip to content

Akarshak51/YOLO-Object-Detection-Voice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

YOLO-Object-Detection-Voice

License: MIT
Python

A simple, modular project that performs real-time object detection using YOLO and announces detection results via voice (text-to-speech). This repository combines a YOLO object detector (you can use YOLOv5/YOLOv8 or your custom YOLO model) with an audio TTS engine so that detected objects are reported out loud — useful for accessibility, robotics, surveillance, or hands-free monitoring.

This README is intentionally comprehensive so you can copy & paste directly into your repository and get started quickly.


Table of Contents


Features

  • Real-time object detection with YOLO (select YOLOv5 / YOLOv8 / custom)
  • Voice announcements for detected objects (supports multiple TTS engines)
  • Options: webcam, video file, image(s), output saving (annotated images/videos)
  • Configurable thresholds (confidence, NMS), classes to announce, and verbosity
  • Cross-platform (Windows / macOS / Linux); GPU acceleration if available

Repository Structure (Typical) Note: this project may use different filenames; adjust paths/commands as needed.

  • README.md (this file)
  • requirements.txt (Python deps)
  • yolov_voice.py (main entrypoint that runs detection + TTS)
  • models/
    • weights/ (saved model weights, e.g., yolov5s.pt or best.pt)
  • data/
    • classes.txt (optional: names of classes)
  • examples/
    • demo_video.mp4
    • demo_image.jpg
  • outputs/
    • annotated/ (saved annotated images/videos)
  • utils/
    • tts_backends.py (wrappers for pyttsx3/gTTS/pico2wave etc.)
    • detector.py (YOLO inference wrapper)
    • draw.py (annotation helpers)

If your repository differs, adapt the README's commands to the actual filenames or move/rename files accordingly.


Requirements

Minimum:

  • Python 3.8+
  • pip

Recommended (for GPU):

  • CUDA 11.x and compatible NVIDIA drivers (if using PyTorch GPU builds)
  • NVIDIA cuDNN

Python packages (example):

  • torch (and torchvision) — CPU or GPU build
  • numpy
  • opencv-python
  • pillow
  • pyttsx3 or gTTS (for TTS)
  • playsound / pydub (optional, for playing TTS audio)
  • tqdm
  • seaborn (optional, for visualizations)

See Appendix — Example requirements.txt for an example file.


Installation

  1. Clone the repository
git clone https://github.com/Akarshak51/YOLO-Object-Detection-Voice.git
cd YOLO-Object-Detection-Voice
  1. Create and activate a virtual environment (recommended)
python -m venv .venv
# Linux / macOS
source .venv/bin/activate
# Windows (PowerShell)
.venv\Scripts\Activate.ps1
  1. Install Python dependencies
pip install -r requirements.txt
  1. Install YOLO model dependencies if using a particular fork (example for YOLOv5)
  • If using Ultralytics YOLOv5, you may need additional requirements (see their repo). Alternatively install torch and torchvision compatible with your CUDA version:
# example CPU install (not recommended for speed):
pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu

Model Weights

You can use any YOLO weights compatible with your detector wrapper. Some common choices:

Download example:

mkdir -p models/weights
# example:
wget -O models/weights/yolov5s.pt https://github.com/ultralytics/yolov5/releases/download/v6.1/yolov5s.pt

Place your weights in models/weights/ or point the CLI to the path.


Quick Start — Examples

General note: replace <entry_point> or <weights> if your repository uses different filenames.

  1. Run on Webcam (real-time)
# Example: run yolov_voice.py (main script) using webcam index 0
python yolov_voice.py --source 0 --weights models/weights/yolov5s.pt --voice-engine pyttsx3 --confidence 0.4 --announce-interval 2.0

What this does:

  • Opens the default webcam
  • Runs YOLO inference on frames
  • Announces detected object labels via the chosen TTS engine
  • --announce-interval prevents repeated rapid announcements (seconds)
  1. Run on a Video File
python yolov_voice.py --source examples/demo_video.mp4 --weights models/weights/yolov5s.pt --output outputs/annotated/demo_out.mp4 --voice-engine gTTS --language en --save-output
  1. Detect on Single Image
python yolov_voice.py --source examples/demo_image.jpg --weights models/weights/yolov5s.pt --save-output --output outputs/annotated/demo_image_out.jpg
  1. Python API Example (programmatic use) Use a Detector class (e.g., utils/detector.py) and a TTS wrapper (utils/tts_backends.py)

Example script:

from utils.detector import YOLODetector
from utils.tts_backends import TTS

detector = YOLODetector(weights='models/weights/yolov5s.pt', device='cuda')  # or 'cpu'
tts = TTS(engine='pyttsx3', language='en')

frame = ...  # numpy image from cv2
detections = detector.detect(frame, conf_thres=0.4)
labels = detector.parse_labels(detections)  # e.g., ['person', 'bicycle']
if labels:
    text = ', '.join(labels)
    tts.say(f'I see: {text}')

Configuration & Command-Line Options

Common CLI flags (the actual script may support different names — view --help):

  • --source <int | path> : Webcam index (0) or path to video/image
  • --weights : Path to model weights (.pt, .onnx, .pth)
  • --device <cpu | 0 | cuda> : Device to run inference on
  • --img-size : Inference image size (e.g., 640)
  • --confidence : Confidence threshold (0.0 - 1.0)
  • --iou : NMS IoU threshold (e.g., 0.45)
  • --classes <int,int,...> : Only detect/announce these class indices
  • --class-names : Optional path to class names file
  • --voice-engine <pyttsx3|gTTS|pico> : Choose TTS backend
  • --language <en|fr|...> : Language for TTS (if supported)
  • --announce-interval : Minimum seconds between announcements for same label
  • --save-output : Save annotated image/video to disk
  • --output : Path to save output
  • --verbose : Print more logs for debugging

To see exact options:

python yolov_voice.py --help

How Voice Reporting Works

Architecture:

  1. Inference: Frames/images are passed to the YOLO model
  2. Post-processing: Detections are filtered by confidence and NMS
  3. Label selection: Class names are mapped from class IDs
  4. Announcement logic:
    • Labels are aggregated per frame or per timeframe
    • Optionally deduplicated so the same label isn't repeatedly spoken
    • TTS engine is invoked to synthesize and play the audio

Supported TTS backends (examples):

  • pyttsx3: offline, cross-platform Python TTS (no network required)
  • gTTS: Google Text-to-Speech (requires network; can save mp3 then play)
  • pico2wave / espeak / festival: system-level utilities on some Linux distros
  • Windows SAPI: native on Windows

Choose an engine based on offline/online requirements and platform compatibility.


Customization (Classes, Confidence, NMS, Language)

  • Limit announced classes: Create a list or supply class indices:

    python yolov_voice.py --classes 0 2 7  # e.g., person, car, truck

    Or use a names file:

    cat data/classes.txt
    person
    bicycle
    car
    ...
    python yolov_voice.py --class-names data/classes.txt
    
  • Reduce noise using confidence threshold:

    --confidence 0.5
  • Only announce when certain count reached: Example: announce if at least 3 persons detected: Use an argument --min-count person=3 or programmatically check counts in your wrapper.

  • Multilingual announcements: With gTTS or other engines, set --language fr to produce French speech (ensure the engine supports that language).


Performance Tips & Troubleshooting

Performance

  • Use a smaller model (yolov5s / yolov8n) for real-time on CPU.
  • For higher FPS, use GPU-enabled PyTorch with CUDA.
  • Lower img-size to increase speed but reduce accuracy.
  • Batch processing: if running on video, you can process every Nth frame: skip frames to reduce CPU/GPU load.

Common Issues

  • No sound / TTS fails:

    • If using pyttsx3: ensure the correct driver is available (sapi5 on Windows; nsss on macOS; espeak on Linux).
    • If using gTTS: requires internet access.
    • Use playsound or system players to play saved audio files if direct playback fails.
  • Model file not found:

    • Verify --weights path and permissions. Download recommended weights and place into models/weights/.
  • Slow inference:

    • Ensure GPU is used if available; install correct CUDA-enabled torch.
    • Run nvidia-smi to confirm GPU presence and utilization.
  • Incompatible torch/cuda:

    • Install torch matching your CUDA version. See PyTorch official install instructions.

Debugging

  • Use --verbose to print detection info and TTS events.
  • Save annotated frames to inspect detection boxes visually.

Testing & Validation

  • Unit test examples:

    • Test detector on a sample image and assert detections length > 0.
    • Mock TTS engine to check that say() is called for expected strings.
  • Manual validation:

    • Run python yolov_voice.py --source examples/demo_image.jpg --save-output and check outputs/annotated/ for correct boxes and labels.
    • For video/webcam, verify audio announcements match on-screen detections.

Extending & Contributing

Contributions welcome! Suggested improvements:

  • Support more TTS backends and language detection
  • Add a web UI (Flask/Streamlit) to control detection parameters live
  • Allow exporting to speechless events (webhooks) for home automation
  • Add unit tests and CI (GitHub Actions)

How to contribute:

  1. Fork the repo
  2. Create a feature branch: git checkout -b feat/my-feature
  3. Commit changes and push
  4. Open a PR with a clear description of changes and rationale

Please follow the repository's code style and include tests where applicable.


Acknowledgements

  • Ultralytics YOLO (for models & training pipelines) — https://github.com/ultralytics
  • pyttsx3, gTTS and other TTS libraries and their maintainers
  • OpenCV and PyTorch communities

License

This project is provided under the MIT License. See LICENSE file for details.


Contact

Maintainer: Akarshak51
GitHub: https://github.com/Akarshak51/YOLO-Object-Detection-Voice

If you find a bug or want a feature, please open an issue or a pull request.


Appendix — Example requirements.txt

Below is an example requirements file you can copy to requirements.txt. Adjust versions to match your environment and PyTorch/CUDA compatibility.

# Core
numpy>=1.21
opencv-python>=4.5
Pillow>=8.0

# Torch: install the right wheel for your CUDA version per PyTorch instructions
torch>=1.13
torchvision>=0.14

# TTS and audio
pyttsx3>=2.90          # offline TTS
gTTS>=2.2.3            # online TTS (google)
playsound>=1.3.0       # simple playback (may vary by OS)
pydub>=0.25.1          # for audio manipulation (requires ffmpeg)

# Utilities
tqdm>=4.64
requests>=2.28

Notes:

  • For Windows, playsound sometimes causes issues; alternatives: winsound, python-sounddevice, or using subprocess to call system media players.
  • For Linux, ensure espeak/pico2wave or appropriate audio backends are installed for offline TTS.

Example Minimal Script (yolov_voice.py) Below is a minimal example combining YOLOv5 via PyTorch Hub and pyttsx3 for voice. This is a self-contained example you can adapt. It demonstrates how to load a YOLOv5 model from torch.hub, run a single image inference, and speak the labels.

# yolov_voice_example.py
import cv2
import torch
import pyttsx3
import time

# Initialize TTS
tts = pyttsx3.init()
def announce(text):
    tts.say(text)
    tts.runAndWait()

# Load YOLOv5 from torch.hub (requires internet first time)
model = torch.hub.load('ultralytics/yolov5', 'yolov5s', pretrained=True)
model.conf = 0.4  # confidence threshold

# Load image
img = cv2.imread('examples/demo_image.jpg')  # change path
results = model(img)  # inference
detections = results.pandas().xyxy[0]  # pandas DataFrame

labels = detections['name'].tolist()
if labels:
    unique_labels = sorted(set(labels))
    announce("I see " + ", ".join(unique_labels))
    print("Detected:", unique_labels)
else:
    announce("I don't see anything I can recognize.")

Important: For real-time webcam detection, you should process frames in a loop (and use throttling/announce intervals) to avoid speaking too often.


If you want, I can:

  • Generate a ready-to-drop-in yolov_voice.py script tailored to the exact files in your repository (tell me which files exist), or
  • Produce a requirements.txt, LICENSE file, and a sample utils/tts_backends.py file.

Would you like me to create any of those files now?

About

Voice-Enabled Object Detection for Visually Impaired - Real-time vision system with YOLO, FastAPI, and TTS narration

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors