Based on the problem that no models with audio input formats have been developed yet and for the fact that diffusion models outperform other generative model, it was decided to develop a Denoising Diffusion Probabilistic Model for Sign Language Translation to improve the quality of translation. Moreover, the developed solution contains a convenient pipeline of data preparation and model training, that allows to train model for different sign languages such as German, English, or Russian, which will make the access of people of different languages with auditory problems to video information more accessible.
This research work contains a consistent description of the scientific development: the first section provides information about existing solutions, the second part describes the structure of the developed model, and the third part describes the selection of optimal hyperparameters for this model and compares the quality of generating gesture sequences with current existing solutions with necessary calculations confirming the theoretical justification of applying diffusion models to the task of generating gestures and the practical usage of this model on the example of generating samples from incoming audio sequences to sign language.