Automatically detect non-dialog audio and display them as short text info

Many videos, fictional and non-fictional use non-speech sounds and noises – such as footsteps, doors, animals, music, traffic, explosions as integral elements of the storytelling. For deaf people like me, this kind of information is essential to understand the context of a scene, but it lacks often in available subtitles/captions. Automatic ambient sound detection would significantly improve media accessibility for many deaf and hard of hearing users. But currently, most tools do not detect or transcribe such sounds accurately and reliably, if at all. However, there are toolkits available that might help to enhance this.
 
Two projects I found:
• YAMNet – a sound classification model that can run locally:
https://www.tensorflow.org/hub/tutorials/yamnet
• Vosk – offline speech recognition, works without internet:
https://github.com/alphacep/vosk-api
 
It would be great if the functionality of such a toolkit could be integrated into your tool to create subtitles/captions that reliably include not only dialogue, but also noises, and other non-speech sound elements. I am not a developer, and cannot judge the feasibility of this idea. Please understand my feature request as an enhancement to your already great product, that helps us to already in many ways better access the world of sounds and speech in media.
 
Thank you to the development team for all their hard work so far, and best of luck with future improvements!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Automatically detect non-dialog audio and display them as short text info #191

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Automatically detect non-dialog audio and display them as short text info #191

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions