Based on the problem that no models with audio input formats have been developed yet and for the fact that diffusion models outperform …