Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Abstract

Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects. This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, the Coupled NAR model predicts detailed tokens based on the AR output, considering the interdependence between semantic and acoustic aspects. Parallel GPT, built on this architecture, is designed to improve zero-shot text-to-speech synthesis through its parallel structure. Experiments on English and Chinese datasets demonstrate that the proposed model significantly outperforms the quality and efficiency of the synthesis of existing zero-shot TTS models.

Model Architecture
Experiments on English Dataset
Experiments on Chinese Dataset

Model Architecture

Figure 1: An overview of the Parallel GPT inference pipeline. (1) The Parallel Tokenizer Encoder derives tokens from reference speech. (2) The Parallel Autoregressive Language Model generates the top semantic and acoustic tokens based on the condition of text and speech. (3) The Coupled Non-Autoregressive Transformer model generates the semantic and acoustic tokens of the last two bits. These outputs, combined with the top parallel tokens, form the comprehensive parallel tokens. (4) The Parallel Tokenizer Decoder generates higher-quality speech from generated comprehensive parallel tokens.

Figure 2: The structure of Parallel Tokenizer. It is trained before training the TTS model. Wav2vec 2.0, BEATS, and Campplus models use downloaded pre-training weights and freeze the parameters during training.

Figure 3: The structure of the Parallel Autoregressive Language Model. It is designed to synthesize target top tokens from the target text and reference speech. We define the first dimension of the tokens obtained by the Parallel Tokenizer as the "top tokens," representing the most important elements.

Figure 4: The structure of Coupled Non-Autoregressive Transformer. It is designed to generate detailed tokens from target top tokens and reference speech.

Experiments on English Dataset

The audio sample below is a sample synthesized using the model proposed in this paper.

The LibriTTS dataset is used, you can download it via https://www.openslr.org/60/.

LibriTTS is a multi-speaker English corpus. It amounts to 585 hours and over 2300 speakers. Train-clean-100, train-clean-360, and train-other-500 are merged as the training set. Dev-clean and dev-other are merged as a development set. Test-clean and test-other are merged as the test set.

Each comparison experiment model sources are listed below:

YourTTS: https://github.com/Edresson/YourTTS
MaskGCT: https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct
VALL-E: https://github.com/lifeiteng/VALL-E
TransferTTS: https://github.com/hcy71o/TransferTTS
CosyVoice: https://github.com/FunAudioLLM/CosyVoice
E2TTS: https://github.com/lucidrains/e2-tts-pytorch
UniAudio: https://github.com/yangdongchao/UniAudio

In this section, we conduct detailed ablation experiments:

-only Wav2Vec 2.0: Uses only Wav2Vec 2.0 tokens for semantic representation
-only BEATs: Uses only BEATs tokens for acoustic representation

-w/o parallel: Merges semantics and acoustics for RVQ encoding, converting to traditional AR framework

-only AR: Uses AR model exclusively to predict all RVQ tokens (3 semantic + 3 acoustic) through multiple heads

The first 2 to 3 seconds of the ground truth speech are extracted as reference speech, with the precise duration determined by the Voice Activity Detection (VAD) cutoff point. Care is taken to ensure that the extracted segment ends at the completion of a full word pronunciation.

Sample for Seen Speaker - Development Set

Sample 1

Text: And then he talked of her mother, and he made her pray

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 2

Text: He had stolen out during the half hour allowed at the works for tea, to buy them an orange or two, which now puffed out his jacket pocket

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 3

Text: "Tom, we're having a problem with the gyro stabilizer," said Mark Faber, gray haired president of the Faber Electronics Company

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 4

Text: And Mr Ossipon brings every week a pile of these f p tracts to sell at a halfpenny each

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 5

Text: This alternative in the Captain's plans (terminating the voyage a month earlier than his arrangements had contemplated) puzzled Randal

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample for Unseen Speaker - Test Set

Sample 1

Text: So choose for yourself-to make a rush or tarry here. "So choose for yourself-to make a rush or tarry here. "

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 2

Text: They met a good many acquaintances; Mainhall, indeed, knew almost every one, and he babbled on incontinently, screwing his small head about over his high collar

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 3

Text: "Nothing whatever," replied the courtier, as pale as death; "but your majesty has not thought of Fruits. ""Nothing whatever," replied the courtier, as pale as death; "but your majesty has not thought of Fruits. "

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 4

Text: The Missouri cabal, on the other hand, having three of their best men constantly at the Governor's side, were compelled to recognize their lack of justification

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 5

Text: Yet were they little worse than what were insisted on before the battle of Naseby

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Sample 6

Text: A ring of amethyst I could not wear here, plainer to my sight, Than that first kiss

Groundtruth	YourTTS	MaskGCT	VALL-E	TransferTTS	CosyVoice	E2TTS	UniAudio	Parallel GPT	-only Wav2Vec 2.0	-only BEATs	-w/o parallel	-only AR

Experiments on Chinese Dataset

The audio sample below is a sample synthesized using the model proposed in this paper.

An internal Chinese dataset is used. It consists of read speech from Chinese novels.

The dataset includes 453,716 audio samples, totaling 1,062 hours of speech recorded by 43,034 unique speakers. Two speech cleaning tools, Emilia pipeline and NCSSD pipeline, are applied for preprocessing and noise reduction to ensure high-quality data.

Each comparison experiment model sources are listed below: