FlowAVSE: Efficient audio visual speech enhancement models with conditional flow matching


Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung
Interspeech 2024

Abstract

This work proposes an efficient method to enhance the quality of corrupted speech signals by leveraging both acoustic and visual cues. While existing diffusion-based approaches have demonstrated remarkable quality, their applicability is limited by slow inference speeds and computational complexity. To address this issue, we present FlowAVSE which enhances the inference speed and reduces the number of learnable parameters without degrading the output quality. In particular, we employ a conditional flow matching algorithm that enables the generation of high-quality speech in a single sampling step. Moreover, we increase efficiency by optimizing the underlying U-net architecture of diffusion-based systems. Our experiments demonstrate that FlowAVSE achieves 22 times faster inference speed and reduces the model size by half while maintaining the output quality.

Demo Audio

Below audio samples are the comparison of enhanced outputs obtained by using the same input for the four models: FlowAVSE(ours), AVDiffuSS [1], and AV-Gen [2].It seems all audio is enhancing mix speech well. However, our model is good at denoising noises and detecting the target speech for whole time.

Mix audio
Ground Truth
FlowAVSE(Ours)
AVDiffuSS [1]
AV-Gen [2]

References

[1] Lee, S., Jung, C., Jang, Y., Kim, J. and Chung, J.S., Seeing Through the Conversation: Audio-Visual Speech Separation based on Diffusion Model. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.

[2] Richter, J., Frintrop, S. and Gerkmann, T., Audio-Visual Speech Enhancement with Score-Based Generative Models. ITG Speech Communication, Aachen, Germany, Sep. 2023.