Samples of Multi-Speaker Text-to-speech Synthesis Using Deep Gaussian Processes

Samples of Multi-Speaker Text-to-speech Synthesis Using Deep Gaussian Processes

Description

DNN: DNN-based multi-speaker TTS using one-hot speaker codes (reproduction of [N. Hojo et al., 2018]).
DGP: Deep Gaussian Processes (DGP)-based TTS using one-hot speaker codes (Proposed).
DGPLVM: Deep Gaussian Process Latent Variable Model (DGPLVM)-based TTS (Proposed). Speaker representation is jointly learned with acoustic model parameters.

Utterance is defined by following format --- {speaker_id}_VOICEACTRESS100_{utterance_id}

Speaker-balanced (trained with 115 utterances)

Utterance	Original	DNN	DGP (Proposed 1)	DGPLVM (Proposed 2)
006_VOICEACTRESS100_097
006_VOICEACTRESS100_099
010_VOICEACTRESS100_097
010_VOICEACTRESS100_099
022_VOICEACTRESS100_097
022_VOICEACTRESS100_099
063_VOICEACTRESS100_097
063_VOICEACTRESS100_099

Speaker-imbalanced (trained with 5 utterances)

Utterance	Original	DNN	DGP (Proposed 1)	DGPLVM (Proposed 2)
006_VOICEACTRESS100_097
006_VOICEACTRESS100_099
010_VOICEACTRESS100_097
010_VOICEACTRESS100_099
022_VOICEACTRESS100_097
022_VOICEACTRESS100_099
063_VOICEACTRESS100_097
063_VOICEACTRESS100_099