Disentangled Speech Representation Learning for
One-Shot Cross-lingual Voice Conversion Using β-VAE

Abstract

We propose an unsupervised learning method to disentangle speech into content representation and speaker identity representation. We apply this method to the challenging one-shot cross-lingual voice conversion task to demonstrate the effectiveness of the disentanglement. Inspired by β-VAE, we introduce a learning objective that balances the amounts of information of speech captured by the speaker representation and the content representation. Besides, the inductive biases from the architectural designs and the training dataset further encourage the desired disentanglement. Both objective and subjective evaluations show the effectiveness of the proposed method in speech disentanglement and one-shot cross-lingual voice conversion.


Code is available here


Converted speech samples

Experimental conditions

  • All compared models are trained on the combination of VCTK[1] and AISHELL-3[2]
  • All the converted utterances are generated by Hifi-GAN vocoder [3] trained on VCTK and AISHELL-3.
  • All souce and target utterances listed below are unseen during training.

Compared models

  • VQMIVC: Baseline method based on vector quantization and mutual information minimization[4].
  • AdIN-VC: Baseline method based on instance normalization[5].
  • β-VAEVC: The proposed method based a modified β-VAE.

Results

English → English

Source Target VQMIVC AdIN-VC β-VAEVC
Sample 1
Sample 2
Sample 3
Sample 4

Mandarin → Mandarin

Source Target VQMIVC AdIN-VC β-VAEVC
Sample 1
Sample 2
Sample 3
Sample 4

English → Mandarin

Source Target VQMIVC AdIN-VC β-VAEVC
Sample 1
Sample 2
Sample 3
Sample 4

Mandarin → English

Source Target VQMIVC AdIN-VC β-VAEVC
Sample 1
Sample 2
Sample 3
Sample 4

References

[1] Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92). 2019.

[2] Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, and Ming Li. AISHELL-3: A Multi-Speaker Mandarin TTS Corpus. in Proc. Interspeech 2021, pp. 2756–2760.

[3] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, vol. 33, pp. 17022–17033, 2020

[4] Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, and Helen Meng. VQMIVC: vector quantization and mutual information-based unsupervised speech representation disentanglement for one-shot voice conversion. in Interspeech 2021, pp. 1344–1348.

[5] Ju-Chieh Chou and Hung-yi Lee. One-shot voice conversion by separating speaker and content representations with instance normalization. in Interspeech 2019, pp. 664–668.