text description to speech

Natural language guidance of high-fidelity text-to-speech models with synthetic annotations

Dan Lyth, Simon King

Abstract

Text-to-speech models trained on large-scale datasets have demonstrated impressive in-context learning capabilities and naturalness. However, control of speaker identity and style in these models typically requires conditioning on reference speech recordings, limiting creative applications. Alternatively, natural language prompting of speaker identity and style has demonstrated promising results and provides an intuitive method of control. However, reliance on human-labeled descriptions prevents scaling to large datasets.

Our work bridges the gap between these two approaches. We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions. We then apply this method to a 45k hour dataset, which we use to train a speech language model. Furthermore, we propose simple methods for increasing audio fidelity, significantly outperforming recent work despite relying entirely on found data.

Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions, all accomplished with a single model and intuitive natural language conditioning.

Controlling numerous attributes

An American female with a slightly low-pitched voice reads a book. Her words are captured in an excellent and very close-sounding recording. The speaker reads with a slightly quick pace.

A female voice with an American accent reads a book. Her voice is slightly monotone but is captured with excellent quality, very close-sounding and clean. She is slightly high-pitched, and reads fairly quickly.

A female voice with an Italian accent reads from a book. The recording is very noisy. The speaker reads fairly quickly with a slightly high pitched and monotone voice.

A male voice with an Indian accent reads slowly from a book, his words fairly close-sounding and slightly clean. He speaks in a slightly monotone fashion, but his voice is fairly high-pitched, adding a touch of eagerness to his reading.

A male voice with a Macedonian accent reads a book aloud. The recording is very close-sounding but slightly noisy. The voice is quite monotone with a fairly low pitch.

A male voice with an American accent reads a book. The recording is very close-sounding and very clean. His voice is slightly monotone, but the excellent recording and his slightly low pitch draw the listener in.

A male voice with a Canadian accent reads a book aloud. The recording is excellent. His delivery is slightly monotone but he speaks quite quickly.

A woman with an American accent and quite high pitch voice reads a book with some animation. The recording is fairly clean but there is some roominess. She reads slightly quickly.

A high-pitched male voice with an American accent reads a book fairly quickly. The speaker's voice is close-sounding, but there is quite a lot of background noise.

A male voice with an American accent reads a book with expertise. The recording is excellent, capturing his slightly high-pitched voice in very close sounding detail. His voice is clear and well-modulated, making for an engaging listening experience.

Controlling specific attributes

Gender and speaker pitch

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his very low-pitched voice with crisp clarity.

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her quite low-pitched voice with crisp clarity.

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his fairly low-pitched voice with crisp clarity.

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her slightly low-pitched voice with crisp clarity.

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his fairly high-pitched voice with crisp clarity.

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her quite high-pitched voice with crisp clarity.

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his very high-pitched voice with crisp clarity.

Pitch modulation

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity. Her tone is very monotone.

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity. His tone is fairly monotone.

Speaking rate

Channel conditions

A male voice with an American accent enunciates every word with precision. The recording is very bad, and the speaker's voice is very distant-sounding and noisy.

A female voice with an American accent enunciates every word with precision. The speaker's voice is quite close-sounding but very noisy.

A male voice with an American accent enunciates every word with precision. The speaker's voice is close-sounding but fairly noisy.

A male voice with an American accent enunciates every word with precision. The speaker's voice is fairly distant-sounding.

A female voice with an American accent enunciates every word with precision. The speaker's voice is slightly distant-sounding.

A male voice with an American accent enunciates every word with precision. The speaker's voice is fairly close-sounding.

A male voice with an American accent enunciates every word with precision. The speaker's voice is quite close-sounding and quite clean.

A female voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding and clean, and the recording is excellent, capturing her voice with crisp clarity.

Accent

A male voice with an American accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

A male voice with an English accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

A male voice with a Pakistani accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

A female voice with an Italian accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity.

A male voice with a South African accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

A male voice with a Canadian accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing his voice with crisp clarity.

A female voice with an Indian accent enunciates every word with precision. The speaker's voice is very close-sounding, and the recording is excellent, capturing her voice with crisp clarity.