uality benchmarks between audiotok / webrtcvad / silero-vad

# Instruments

We have compared 3 easy-to-use **off-the-shelf instruments for voice activity / audio activity detection**:

- Silero-vad from here - https://github.com/snakers4/silero-vad;
- A popular python version of the webrtcvad - https://github.com/wiseman/py-webrtcvad);
- Audiotok from this repo - https://github.com/amsehili/auditok;


# Caveats

- Full disclaimer - we are mostly interested in **voice detection**, not just silence detection;
- In our extensive experiments we noticed that WebRTC is actually much better in detecting silence than detecting speech (probably by design). It has a lot of false positives when detecting speech;
- `audiotok` provides **Audio Activity Detection**, which probably may just mean detecting silence in layman's terms;
- `silero-vad` is geared towards speech detection (as opposed to noise or music);
- A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison [here](https://github.com/snakers4/silero-vad/issues/2#issuecomment-750840434)), while `audiotok` and `webrtcvad` use 30-50ms chunks (we used default values of 30 ms for  `webrtcvad`  and 50 ms for `audiotok` );
- We have excluded pyannote-audio for now (https://github.com/pyannote/pyannote-audio), since it features pre-trained models on only limited academic datasets and is mostly a recipe collection / toolkit to build your own tools, not a finished tool per se (also for such a simple task the amount of code bloat is puzzling from a production standpoint, our internal vad training code is just literally 5 python modules);

# Methodology

Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology

# Quality Benchmarks

Finished tests:

![image](https://user-images.githubusercontent.com/12515440/105179880-edc0d300-5b3a-11eb-9aa2-0da9c7afc105.png)

# Portability and Speed

- Looks like originally `webrtcvad` is written in `С++` around 2016, so theoretically it can be ported into many platforms;
- I have inquired in the community, the original VAD seems to have matured and python version is based on 2018 version;
- Looks like `audiotok` is written in plain python, but I guess the algorithm itself can be ported;
- `silero-vad` is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);

This is by no means an extensive and full research on the topic, please point out if anything is lacking.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uality benchmarks between audiotok / webrtcvad / silero-vad #26

Instruments

Caveats

Methodology

Quality Benchmarks

Portability and Speed

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

uality benchmarks between audiotok / webrtcvad / silero-vad #26

Description

Instruments

Caveats

Methodology

Quality Benchmarks

Portability and Speed

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions