Authors: Stefano Ciapponi, Francesco Paissan, Alberto Ancilotto, Elisabetta Farella
Email: sciapponi@fbk.eu, fpaissan@fbk.eu, aancilotto@fbk.eu, efarella@fbk.eu
Abstract
Neural vocoders convert time-frequency representations, such
as mel-spectrograms, into corresponding time representations.
Vocoders are essential for generative applications in audio (e.g.
text-to-speech and text-to-audio). In the Internet of Sounds
domain, generating speech signals at the edge enables the
employment of smart assistants that leverage text-to-speech
pipelines. This paper presents a scalable vocoder architecture
for small-footprint edge devices. We test the developed model
capabilities qualitatively and quantitatively on single-speaker
and multi-speaker datasets and benchmark inference speed and
memory consumption on four microcontrollers. Additionally, we
study the power consumption on an ARM Cortex-M7-powered
board. Our results demonstrate the feasibility of deploying
neural vocoders on resource-constrained edge devices, potentially
enabling new applications in IoT and edge computing scenarios
IoS and Embedded Audio scenarios. This is supported by our
best performing model achieving a MOS score of 3.95/5, while
utilizing 1.5MiB of Flash and 517KiB of ram and consuming 252
mW for a 1s audio clip inference.
Exported Audio
LibriTTS
Reference
Xinet
Phinet
Vocos
Tacotron
Text
Waveglow
Xinet
Phinet
Vocos
In the next sections you are going to hear a lot of dystopian fiction quotes
Reality is that which, when you stop believing in it, doesn't go away.
You don't have to burn books to destroy a culture. Just get people to stop reading them.
We've got to have rules and obey them. After all, we're not savages.
The greatest ideas are the simplest.
In a time of deceit, telling the truth is a revolutionary act.
If liberty means anything at all, it means the right to tell people what they do not want to hear.
All animals are equal, but some animals are more equal than others.