Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech
Abstract
Advances in speech representation and large language models have enhanced zero-shot text-to-speech (TTS) performance. However, existing zero-shot TTS models face challenges in capturing the complex correlations between acoustic and semantic features, resulting in a lack of expressiveness and similarity. The primary reason lies in the complex relationship between semantic and acoustic features, which manifests independent and interdependent aspects. This paper introduces a TTS framework that combines both autoregressive (AR) and non-autoregressive (NAR) modules to harmonize the independence and interdependence of acoustic and semantic information. The AR model leverages the proposed Parallel Tokenizer to synthesize the top semantic and acoustic tokens simultaneously. In contrast, the Coupled NAR model predicts detailed tokens based on the AR output, considering the interdependence between semantic and acoustic aspects. Parallel GPT, built on this architecture, is designed to improve zero-shot text-to-speech synthesis through its parallel structure. Experiments on English and Chinese datasets demonstrate that the proposed model significantly outperforms the quality and efficiency of the synthesis of existing zero-shot TTS models.
Contents
Model Architecture
Figure 4: The structure of Coupled Non-Autoregressive Transformer. It is designed to generate detailed tokens from target top tokens and reference speech.
Experiments on English Dataset
- YourTTS: https://github.com/Edresson/YourTTS
- MaskGCT: https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct
- VALL-E: https://github.com/lifeiteng/VALL-E
- TransferTTS: https://github.com/hcy71o/TransferTTS
- CosyVoice: https://github.com/FunAudioLLM/CosyVoice
- E2TTS: https://github.com/lucidrains/e2-tts-pytorch
- UniAudio: https://github.com/yangdongchao/UniAudio
- -only Wav2Vec 2.0: Uses only Wav2Vec 2.0 tokens for semantic representation
- -only BEATs: Uses only BEATs tokens for acoustic representation
- -w/o parallel: Merges semantics and acoustics for RVQ encoding, converting to traditional AR framework
- -only AR: Uses AR model exclusively to predict all RVQ tokens (3 semantic + 3 acoustic) through multiple heads
Sample for Seen Speaker - Development Set
Sample 1
Text: And then he talked of her mother, and he made her pray
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 2
Text: He had stolen out during the half hour allowed at the works for tea, to buy them an orange or two, which now puffed out his jacket pocket
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 3
Text: "Tom, we're having a problem with the gyro stabilizer," said Mark Faber, gray haired president of the Faber Electronics Company
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 4
Text: And Mr Ossipon brings every week a pile of these f p tracts to sell at a halfpenny each
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 5
Text: This alternative in the Captain's plans (terminating the voyage a month earlier than his arrangements had contemplated) puzzled Randal
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample for Unseen Speaker - Test Set
Sample 1
Text: So choose for yourself-to make a rush or tarry here. "So choose for yourself-to make a rush or tarry here. "
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 2
Text: They met a good many acquaintances; Mainhall, indeed, knew almost every one, and he babbled on incontinently, screwing his small head about over his high collar
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 3
Text: "Nothing whatever," replied the courtier, as pale as death; "but your majesty has not thought of Fruits. ""Nothing whatever," replied the courtier, as pale as death; "but your majesty has not thought of Fruits. "
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 4
Text: The Missouri cabal, on the other hand, having three of their best men constantly at the Governor's side, were compelled to recognize their lack of justification
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 5
Text: Yet were they little worse than what were insisted on before the battle of Naseby
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample 6
Text: A ring of amethyst I could not wear here, plainer to my sight, Than that first kiss
Groundtruth | YourTTS | MaskGCT | VALL-E | TransferTTS | CosyVoice | E2TTS | UniAudio | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Experiments on Chinese Dataset
- MaskGCT: https://github.com/open-mmlab/Amphion/tree/main/models/tts/maskgct
- CosyVoice: https://github.com/FunAudioLLM/CosyVoice
- E2TTS: https://github.com/lucidrains/e2-tts-pytorch
- -only Wav2Vec 2.0: Uses only Wav2Vec 2.0 tokens for semantic representation
- -only BEATs: Uses only BEATs tokens for acoustic representation
- -w/o parallel: Merges semantics and acoustics for RVQ encoding, converting to traditional AR framework
- -only AR: Uses AR model exclusively to predict all RVQ tokens (3 semantic + 3 acoustic) through multiple heads
Sample for Seen Speaker - Development Set
Sample 1
Text: 倘若我们将一个人从其所属的文化世界中赶出去。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 2
Text: 你要求到沿海挂职,这是积极进步,目的是高尚的。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 3
Text: 咱党前流神威雄猛,如砍瓜切菜,杀散众人。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 4
Text: 指出,这是以前法利亚长老所掘的那条地道的出口。基督山觉得他的四肢在发抖,他在一段木头上坐了下。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 5
Text: 先露出极为远大的战略智慧和战略眼光。当曹策召开会议,他的一些手下并不赞同迎接现地,只有徐玉藤少数人支持他,大家在会议上展开激烈辩论。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 6
Text: 正是这样的缘故。当他看到贾宝玉为林黛玉的诗汇夺魁,高兴得首舞祖导时,按耐不住一腔怨恨。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 7
Text: 本集故事播讲完毕,请您继续收听。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 8
Text: 我就只能寄希望于美军的狙击手死在我军的炮火之下。否则有这家伙防守。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 9
Text: 生活只是修道,吃饭。什么时候活得不耐烦了?只要一手两个翘。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 10
Text: 但邻居们看到,大多数时间,这个十岁的孩子,一个人默默嗦嗦在院子的简易造房里默不作声地做饭,又一个人默不作声地在屋里待着,到了晚上十点多就关灯睡觉。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample for Unseen Speaker - Test Set
Sample 1
Text: 这笔他们即将继承到手的财产,他那种死死抓住财富,不肯放手的方式,也使他们想到这笔钱。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 2
Text: 肖腾尔为你演播的《渡边淳衣大师》临终绝笔。再爱一次。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 3
Text: 因此呢,夏天规定,要煮到烂熟为主。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 4
Text: 但是一千五百人,怎么可能挡住上万的骑兵呢?政绩只得退入了城中。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 5
Text: 太后长天大日讨得没事,便逗着她玩,教她识字读书,讲三国故事给她听。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 6
Text: 这为他能迅速地成为起义领袖奠定了基础。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 7
Text: 在自己家的地面和桌岩留下了擦尸状血迹。最后,他畏罪上吊,自杀了。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|
Sample 8
Text: 买卖场故意车拉到郊外,自己挖坑给埋了。
Groundtruth | MaskGCT | CosyVoice | E2TTS | Parallel GPT | -only Wav2Vec 2.0 | -only BEATs | -w/o parallel | -only AR |
---|---|---|---|---|---|---|---|---|