![]() |
![]() |
Knowledge Technology Group (WTM) |
Department of Informatics, University of Hamburg |
---|
Accepted at IEEE ASRU 2021
Hearing Faces: Target Speaker Text-to-Speech Synthesis from a Face
Björn Plüster | Leyuan Qu | Cornelius Weber | Stefan Wermter |
Abstract
The existence of a learnable cross-modal association between a person's face and their voice is recently becoming more and more evident.
This provides the basis for the task of target speaker text-to-speech (TTS) synthesis from face reference.
In this paper, we approach this task by proposing a cross-modal model architecture combining existing unimodal models.
We use Tacotron 2 multi-speaker TTS with auditory speaker embeddings based on Global Style Tokens.
We transfer learn a FaceNet face encoder to predict these embeddings from a static face image reference instead of a voice reference
and thus predict a speaker's voice and speaking characteristics from their face.
Unlike Face2Speech, the only previous work on this task, we use pretrained and openly available
models. This approach enables high-quality speech synthesis and allows for an easily extensible model architecture.
Experimental results show good matching
ability while retaining better voice naturalness than Face2Speech.
We examine the limitations of our model and discuss multiple possible avenues
of improvement for future work.
![]() |
|
Face reference | Generated audio in 6.6s or ~7.6x real-time on 1x GTX 1070 |
Architecture
Demos
All samples are taken from the test set and the speakers are unseen during training.
Speaker: 3OgsRa7VDtI
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: 8nt3edWLgIg
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: 1L6lSFiV4xo
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: 9TPFLHvn024
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: B8rmi95pYL0
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: BhMKmovNjvc
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: dYNc3P4jSt4
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: E0oPnS7rUwE
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: FKxaL8Iau8Q
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: fTSXvLzNd0o
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: Hj9oB4zpHww
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: HqekWfSJCSA
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: JFQLvbVJVMg
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: LjAsM1vAhW0
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: LujWrkYsl64
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: mjDjbJJlqZ0
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: P2AUat93a8Q
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: PhVP1tNeGyY
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: QKHkSUiQboA
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
Speaker: shG0ezBeeJc
Ground Truth | Generated |
---|---|
Face reference: ![]() |
From voice reference: From face reference: |
Voice reference: |
To be added soon.
Nico Robot | Nao Robot | Pepper Robot |
---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Generated Audio: |
Generated Audio: |
Generated Audio: |
Nico source |
Nao source |
Pepper source |