Can the developers provide an example JSONL file for running inference on unlabeled audio using DrCaps_Zeroshot_Audio_Captioning?
It appears that the dataset JSONL must have this form:
{"source": "/path/to/a_file.wav", "key": "", "target": "", "text": "", "similar_captions": ""}
but the content for each field is not clear to me. What should populate "target", "text" and ""similar_captions"?
Thank you!