-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Hello, first of all, thank you author for your amazing work for this library (and also whisper.net too!)
I have a question regarding the realtime transcription. So far I can use echo sharp to trainscribe audio in real time by catching RealtimeSegmentRecognizing and RealtimeSegmentRecognized transcription types.
But there's slight problem: I can observe that RealtimeSegmentRecognizing is less accurate, which sounds reasonable and obvious to me since the whole recognizing process is not completed, but my assumption is that it is because the audio buffer is too small / short. I feel like it tries to recognize somewhere around 50ms audio buffer where not even a quarter of word spoken.
Of course this is only based on rough observation, but the result is that displaying text of the latest of these events does not look that pretty. I'm also using SileroVadDetector but I'm not familiar with the inner working of the detection.
I feel like this could be better if I can somehow throttle the processing rate, or in other words make audio bit bigger. There would be more delay overhead but at least it will came out nicer. I could put Task.Delay between my TranscribeAsync call but I don't think it is the right way to do it.
Any advice would be highly appreciated. Thank you in advance!