-
Notifications
You must be signed in to change notification settings - Fork 281
Description
Many videos, fictional and non-fictional use non-speech sounds and noises – such as footsteps, doors, animals, music, traffic, explosions as integral elements of the storytelling. For deaf people like me, this kind of information is essential to understand the context of a scene, but it lacks often in available subtitles/captions. Automatic ambient sound detection would significantly improve media accessibility for many deaf and hard of hearing users. But currently, most tools do not detect or transcribe such sounds accurately and reliably, if at all. However, there are toolkits available that might help to enhance this.
Two projects I found:
• YAMNet – a sound classification model that can run locally:
https://www.tensorflow.org/hub/tutorials/yamnet
• Vosk – offline speech recognition, works without internet:
https://github.com/alphacep/vosk-api
It would be great if the functionality of such a toolkit could be integrated into your tool to create subtitles/captions that reliably include not only dialogue, but also noises, and other non-speech sound elements. I am not a developer, and cannot judge the feasibility of this idea. Please understand my feature request as an enhancement to your already great product, that helps us to already in many ways better access the world of sounds and speech in media.
Thank you to the development team for all their hard work so far, and best of luck with future improvements!