Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech

Abstract

Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects. This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, the Coupled NAR model predicts detailed tokens based on the AR output, considering the interdependence between semantic and acoustic aspects. Parallel GPT, built on this architecture, is designed to improve zero-shot text-to-speech synthesis through its parallel structure. Experiments on English and Chinese datasets demonstrate that the proposed model significantly outperforms the quality and efficiency of the synthesis of existing zero-shot TTS models.

Contents

Model Architecture


Figure 1: An overview of the Parallel GPT inference pipeline. (1) The Parallel Tokenizer Encoder derives tokens from reference speech. (2) The Parallel Autoregressive Language Model generates the top semantic and acoustic tokens based on the condition of text and speech. (3) The Coupled Non-Autoregressive Transformer model generates the semantic and acoustic tokens of the last two bits. These outputs, combined with the top parallel tokens, form the comprehensive parallel tokens. (4) The Parallel Tokenizer Decoder generates higher-quality speech from generated comprehensive parallel tokens.



Figure 2: The structure of Parallel Tokenizer. It is trained before training the TTS model. Wav2vec 2.0, BEATS, and Campplus models use downloaded pre-training weights and freeze the parameters during training.



Figure 3: The structure of the Parallel Autoregressive Language Model. It is designed to synthesize target top tokens from the target text and reference speech. We define the first dimension of the tokens obtained by the Parallel Tokenizer as the "top tokens," representing the most important elements.




Figure 4: The structure of Coupled Non-Autoregressive Transformer. It is designed to generate detailed tokens from target top tokens and reference speech.

Experiments on English Dataset

  • The audio sample below is a sample synthesized using the model proposed in this paper.
  • The LibriTTS dataset is used, you can download it via https://www.openslr.org/60/.
  • LibriTTS is a multi-speaker English corpus. It amounts to 585 hours and over 2300 speakers. Train-clean-100, train-clean-360, and train-other-500 are merged as the training set. Dev-clean and dev-other are merged as a development set. Test-clean and test-other are merged as the test set.
  • Each comparison experiment model sources are listed below:
  • In this section, we conduct detailed ablation experiments:
  • The first 2 to 3 seconds of the ground truth speech are extracted as reference speech, with the precise duration determined by the Voice Activity Detection (VAD) cutoff point. Care is taken to ensure that the extracted segment ends at the completion of a full word pronunciation.
  • Sample for Seen Speaker - Development Set

    Sample 1

    Text: And then he talked of her mother, and he made her pray

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 2

    Text: He had stolen out during the half hour allowed at the works for tea, to buy them an orange or two, which now puffed out his jacket pocket

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 3

    Text: "Tom, we're having a problem with the gyro stabilizer," said Mark Faber, gray haired president of the Faber Electronics Company

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 4

    Text: And Mr Ossipon brings every week a pile of these f p tracts to sell at a halfpenny each

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 5

    Text: This alternative in the Captain's plans (terminating the voyage a month earlier than his arrangements had contemplated) puzzled Randal

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR



    Sample for Unseen Speaker - Test Set

    Sample 1

    Text: So choose for yourself-to make a rush or tarry here. "So choose for yourself-to make a rush or tarry here. "

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 2

    Text: They met a good many acquaintances; Mainhall, indeed, knew almost every one, and he babbled on incontinently, screwing his small head about over his high collar

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 3

    Text: "Nothing whatever," replied the courtier, as pale as death; "but your majesty has not thought of Fruits. ""Nothing whatever," replied the courtier, as pale as death; "but your majesty has not thought of Fruits. "

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 4

    Text: The Missouri cabal, on the other hand, having three of their best men constantly at the Governor's side, were compelled to recognize their lack of justification

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 5

    Text: Yet were they little worse than what were insisted on before the battle of Naseby

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 6

    Text: A ring of amethyst I could not wear here, plainer to my sight, Than that first kiss

    Groundtruth YourTTS MaskGCT VALL-E TransferTTS CosyVoice E2TTS UniAudio Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR



    Experiments on Chinese Dataset

  • The audio sample below is a sample synthesized using the model proposed in this paper.
  • An internal Chinese dataset is used. It consists of read speech from Chinese novels.
  • The dataset includes 453,716 audio samples, totaling 1,062 hours of speech recorded by 43,034 unique speakers. Two speech cleaning tools, Emilia pipeline and NCSSD pipeline, are applied for preprocessing and noise reduction to ensure high-quality data.
  • Each comparison experiment model sources are listed below:
  • In this section, we conduct detailed ablation experiments:
  • The first 2 to 3 seconds of the ground truth speech are extracted as reference speech, with the precise duration determined by the Voice Activity Detection (VAD) cutoff point. Care is taken to ensure that the extracted segment ends at the completion of a full word pronunciation.
  • Sample for Seen Speaker - Development Set

    Sample 1

    Text: 倘若我们将一个人从其所属的文化世界中赶出去。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 2

    Text: 你要求到沿海挂职,这是积极进步,目的是高尚的。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 3

    Text: 咱党前流神威雄猛,如砍瓜切菜,杀散众人。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 4

    Text: 指出,这是以前法利亚长老所掘的那条地道的出口。基督山觉得他的四肢在发抖,他在一段木头上坐了下。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 5

    Text: 先露出极为远大的战略智慧和战略眼光。当曹策召开会议,他的一些手下并不赞同迎接现地,只有徐玉藤少数人支持他,大家在会议上展开激烈辩论。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 6

    Text: 正是这样的缘故。当他看到贾宝玉为林黛玉的诗汇夺魁,高兴得首舞祖导时,按耐不住一腔怨恨。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 7

    Text: 本集故事播讲完毕,请您继续收听。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 8

    Text: 我就只能寄希望于美军的狙击手死在我军的炮火之下。否则有这家伙防守。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 9

    Text: 生活只是修道,吃饭。什么时候活得不耐烦了?只要一手两个翘。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 10

    Text: 但邻居们看到,大多数时间,这个十岁的孩子,一个人默默嗦嗦在院子的简易造房里默不作声地做饭,又一个人默不作声地在屋里待着,到了晚上十点多就关灯睡觉。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR


    Sample for Unseen Speaker - Test Set


    Sample 1

    Text: 这笔他们即将继承到手的财产,他那种死死抓住财富,不肯放手的方式,也使他们想到这笔钱。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 2

    Text: 肖腾尔为你演播的《渡边淳衣大师》临终绝笔。再爱一次。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 3

    Text: 因此呢,夏天规定,要煮到烂熟为主。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 4

    Text: 但是一千五百人,怎么可能挡住上万的骑兵呢?政绩只得退入了城中。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 5

    Text: 太后长天大日讨得没事,便逗着她玩,教她识字读书,讲三国故事给她听。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 6

    Text: 这为他能迅速地成为起义领袖奠定了基础。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 7

    Text: 在自己家的地面和桌岩留下了擦尸状血迹。最后,他畏罪上吊,自杀了。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR

    Sample 8

    Text: 买卖场故意车拉到郊外,自己挖坑给埋了。

    Groundtruth MaskGCT CosyVoice E2TTS Parallel GPT -only Wav2Vec 2.0 -only BEATs -w/o parallel -only AR