Samples of Multi-Speaker Text-to-speech Synthesis Using Deep Gaussian Processes

Description

Utterance is defined by following format --- {speaker_id}_VOICEACTRESS100_{utterance_id}

Speaker-balanced (trained with 115 utterances)

Utterance OriginalDNNDGP (Proposed 1)DGPLVM (Proposed 2)
006_VOICEACTRESS100_097
006_VOICEACTRESS100_099
010_VOICEACTRESS100_097
010_VOICEACTRESS100_099
022_VOICEACTRESS100_097
022_VOICEACTRESS100_099
063_VOICEACTRESS100_097
063_VOICEACTRESS100_099

Speaker-imbalanced (trained with 5 utterances)

Utterance OriginalDNNDGP (Proposed 1)DGPLVM (Proposed 2)
006_VOICEACTRESS100_097
006_VOICEACTRESS100_099
010_VOICEACTRESS100_097
010_VOICEACTRESS100_099
022_VOICEACTRESS100_097
022_VOICEACTRESS100_099
063_VOICEACTRESS100_097
063_VOICEACTRESS100_099