Natural language guidance of high-fidelity text-to-speech models with synthetic annotations

Dan Lyth, Simon King

Abstract

Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets.

Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data.

Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning.

Controlling numerous attributes

An American female with a slightly low-pitched voice reads a book. Her words are captured in an excellent and very close-sounding recording. The speaker reads with a slightly quick pace.

Our model
Audiobox
Ground truth

A female voice with an American accent reads a book. Her voice is slightly monotone but is captured with excellent quality, very close-sounding and clean. She is slightly high-pitched, and reads fairly quickly.

Our model
Audiobox
Ground truth

A female voice with an Italian accent reads from a book. The recording is very noisy. The speaker reads fairly quickly with a slightly high pitched and monotone voice.

Our model
Audiobox
Ground truth

A male voice with an Indian accent reads slowly from a book, his words fairly close-sounding and slightly clean. He speaks in a slightly monotone fashion, but his voice is fairly high-pitched, adding a touch of eagerness to his reading.

Our model
Audiobox
Ground truth

A male voice with a Macedonian accent reads a book aloud. The recording is very close-sounding but slightly noisy. The voice is quite monotone with a fairly low pitch.

Our model
Audiobox
Ground truth

A male voice with an American accent reads a book. The recording is very close-sounding and very clean. His voice is slightly monotone, but the excellent recording and his slightly low pitch draw the listener in.

Our model
Audiobox
Ground truth

A male voice with a Canadian accent reads a book aloud. The recording is excellent. His delivery is slightly monotone but he speaks quite quickly.

Our model
Audiobox
Ground truth

A woman with an American accent and quite high pitch voice reads a book with some animation. The recording is fairly clean but there is some roominess. She reads slightly quickly.

Our model
Audiobox
Ground truth

A high-pitched male voice with an American accent reads a book fairly quickly. The speaker's voice is close-sounding, but there is quite a lot of background noise.

Our model
Audiobox
Ground truth

A male voice with an American accent reads a book with expertise. The recording is excellent, capturing his slightly high-pitched voice in very close sounding detail. His voice is clear and well-modulated, making for an engaging listening experience.

Our model
Audiobox
Ground truth

Controlling specific attributes

Gender and speaker pitch

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his very low-pitched voice with crisp clarity.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her quite low-pitched voice with crisp clarity.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his fairly low-pitched voice with crisp clarity.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her slightly low-pitched voice with crisp clarity.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her slightly high-pitched voice with crisp clarity.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his fairly high-pitched voice with crisp clarity.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her quite high-pitched voice with crisp clarity.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his very high-pitched voice with crisp clarity.

Our model

Pitch modulation

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. Her tone is very monotone.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. Her tone is quite monotone.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity. His tone is fairly monotone.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. Her tone is slighty monotone.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity. His tone is slightly expressive and animated.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. Her tone is fairly expressive and animated.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity. His tone is quite expressive and animated.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. Her tone is very expressive and animated.

Our model

Speaking rate

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity. He reads the book very slowly.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. She reads the book quite slowly.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity. He reads the book fairly slowly.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. She reads the book slightly slowly.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity. He reads the book slightly quickly.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. She reads the book fairly quickly.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity. He reads the book quite quickly.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. She reads the book very quickly.

Our model

Channel conditions

A male voice with an American accent enunciates every word with precision. The recording is very bad, and the speaker's voice is very distant-sounding and noisy.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is quite close-sounding but very noisy.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is close-sounding but fairly noisy.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is fairly distant-sounding.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is slightly distant-sounding.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is fairly close-sounding.

Our model

A male voice with an American accent enunciates every word with precision. The speaker's voice is quite close-sounding and quite clean.

Our model

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding and clean, and the recording is excellent, capturing her voice with crisp clarity.

Our model

Accent

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

Our model

A male voice with an English accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

Our model

A male voice with a Pakistani accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

Our model

A female voice with an Italian accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity.

Our model

A male voice with a South African accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

Our model

A male voice with a Canadian accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

Our model

A female voice with an Indian accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity.

Our model