LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
Abstract
Recently, the application of diffusion models has facilitated the significant development of speech and audio generation. Nevertheless, the effectiveness of the method is accompanied by the extensive number of sampling steps, leading to an extended synthesis time necessary for generating high-quality audio. Previous Text-to-Audio (TTA) methods mostly used diffusion models in the latent space for audio generation. In this paper, we explore the integration of the Flow Matching (FM) method into the audio latent space for audio generation. The FM model is an alternative non-autoregressive method that trains continuous normalization flows (CNF) based on regression vector fields. We demonstrate that in text-guided audio generation, Latent Flow Matching (LFM) significantly enhances the quality of generated samples, achieving better performance than prior models. Moreover, it reduces the number of inference steps to ten steps almost without sacrificing performance.
LAFMA Generated Results
A rolling train blows its horn multiple times. | Man speaking continuously with hissing in the background. | A man is giving a speech and a crowd cheers. |
![]() |
![]() |
![]() |
---|---|---|
A motor is revving up. | A large explosion and a heartbeat, a person speaks. | Very loud explosions with pops and bursts of more explosions. |
![]() |
![]() |
![]() |
---|---|---|
A toilet is flushed. | Typing on a keyboard. | The wind is blowing, and a person is whistling a tune. |
![]() |
![]() |
![]() |
---|---|---|
10 Step Generated Results
A woman is speaking from a microphone.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
Man speaking, rain, thunder.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
Birds are chirping.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
A woman is giving a speech.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
Some humming followed by a toilet flushing.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
A sewing machine operating.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
Church bells ringing during audio static.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
Ducks quack as people communicate.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
A few loud snores.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|
A vehicle accelerating and driving by.
AudioLDM-S-Full | LAFMA |
![]() |
![]() |
---|---|