Hi, thanks for your effort and sharing the code!
The architecture blocks in the speech prompted text encoder and CFM decoder differ from the initial ones introduced in the paper. I would like to know what made you do the changes. Was the model not converging with official architecture?