LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation

Abstract

Recently, the application of diffusion models has facilitated the significant development of speech and audio generation. Nevertheless, the effectiveness of the method is accompanied by the extensive number of sampling steps, leading to an extended synthesis time necessary for generating high-quality audio. Previous Text-to-Audio (TTA) methods mostly used diffusion models in the latent space for audio generation. In this paper, we explore the integration of the Flow Matching (FM) method into the audio latent space for audio generation. The FM model is an alternative non-autoregressive method that trains continuous normalization flows (CNF) based on regression vector fields. We demonstrate that in text-guided audio generation, Latent Flow Matching (LFM) significantly enhances the quality of generated samples, achieving better performance than prior models. Moreover, it reduces the number of inference steps to ten steps almost without sacrificing performance.

LAFMA Generated Results

A rolling train blows its horn multiple times.	Man speaking continuously with hissing in the background.	A man is giving a speech and a crowd cheers.

A motor is revving up.	A large explosion and a heartbeat, a person speaks.	Very loud explosions with pops and bursts of more explosions.

A toilet is flushed.	Typing on a keyboard.	The wind is blowing, and a person is whistling a tune.

10 Step Generated Results

A woman is speaking from a microphone.

AudioLDM-S-Full	LAFMA

Man speaking, rain, thunder.

AudioLDM-S-Full	LAFMA

Birds are chirping.

AudioLDM-S-Full	LAFMA

A woman is giving a speech.

AudioLDM-S-Full	LAFMA

Some humming followed by a toilet flushing.

AudioLDM-S-Full	LAFMA

A sewing machine operating.

AudioLDM-S-Full	LAFMA

Church bells ringing during audio static.

AudioLDM-S-Full	LAFMA

Ducks quack as people communicate.

AudioLDM-S-Full	LAFMA

A few loud snores.

AudioLDM-S-Full	LAFMA

A vehicle accelerating and driving by.

AudioLDM-S-Full	LAFMA