Hi,
Thanks a lot for your work.
I have reproduced the same for TIMIT dataset. Now I have converted that model to ONNX and then to tflite for d vector computaion and speaker identification on mobile.
As I verified sincnet tflite model in python it worked for me, but now I have to do the same inference on mobile device.
So I am trying to convert raw audio into tensor and same numpy computation in c/c++.
I have not found any direct way of audio into tensor conversion, as there is no direct implementation of torchaudio for mobile so I am looking to this mfcc computation of audio and then convert into tensor of same dimension.
I can compute MFCC feature using some c++ library as well as some c++ code. Now I want to know Have you tried to compute d vector (or train model for speaker_id) using MFCC instead of torch tensor(soundfile feature and then to torch tensor)?
As I can found you did something like mfcc comparison as mentioned in below link:
pytorch/audio#328
So Can you please confirm:
- What if I directly compute mfcc feature from audio using some c++ library and then can load sincnet model on mobile for final d vector calculation? Does it affect my d vector values or Does it lead to some major difference in speaker identification final accuracy than using torch tensor?
- Can you suggest a method to convert raw audio to tensor on mobile device similar to what has been done here?
I hope my question is clear.
Thanks a lot.