Knowledge Technology Group (WTM)
Department of Informatics, University of Hamburg

Accepted at IEEE ASRU 2021

Hearing Faces: Target Speaker Text-to-Speech Synthesis from a Face


Björn Plüster Leyuan Qu Cornelius Weber Stefan Wermter

Abstract

The existence of a learnable cross-modal association between a person's face and their voice is recently becoming more and more evident. This provides the basis for the task of target speaker text-to-speech (TTS) synthesis from face reference. In this paper, we approach this task by proposing a cross-modal model architecture combining existing unimodal models. We use Tacotron 2 multi-speaker TTS with auditory speaker embeddings based on Global Style Tokens. We transfer learn a FaceNet face encoder to predict these embeddings from a static face image reference instead of a voice reference and thus predict a speaker's voice and speaking characteristics from their face. Unlike Face2Speech, the only previous work on this task, we use pretrained and openly available models. This approach enables high-quality speech synthesis and allows for an easily extensible model architecture. Experimental results show good matching ability while retaining better voice naturalness than Face2Speech. We examine the limitations of our model and discuss multiple possible avenues of improvement for future work.  

Abstract read by our model:
Face reference Generated audio in 6.6s or ~7.6x real-time on 1x GTX 1070

Architecture



Demos

All samples are taken from the test set and the speakers are unseen during training.

Speaker: 3OgsRa7VDtI

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: 8nt3edWLgIg

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: 1L6lSFiV4xo

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: 9TPFLHvn024

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: B8rmi95pYL0

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: BhMKmovNjvc

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: dYNc3P4jSt4

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: E0oPnS7rUwE

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: FKxaL8Iau8Q

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: fTSXvLzNd0o

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: Hj9oB4zpHww

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: HqekWfSJCSA

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: JFQLvbVJVMg

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: LjAsM1vAhW0

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: LujWrkYsl64

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: mjDjbJJlqZ0

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: P2AUat93a8Q

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: PhVP1tNeGyY

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: QKHkSUiQboA

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

Speaker: shG0ezBeeJc

Ground Truth Generated

Face reference:

From voice reference:

From face reference:

Voice reference:

BibTeX

    To be added soon.

Tests for fun

Nico Robot Nao Robot Pepper Robot

Generated Audio:

Generated Audio:

Generated Audio:

Nico source
Nao source
Pepper source