TinyVOCOS Companion Website

Authors: Stefano Ciapponi, Francesco Paissan, Alberto Ancilotto, Elisabetta Farella
Email: sciapponi@fbk.eu, fpaissan@fbk.eu, aancilotto@fbk.eu, efarella@fbk.eu

Abstract

Neural vocoders convert time-frequency representations, such as mel-spectrograms, into corresponding time representations.
Vocoders are essential for generative applications in audio (e.g. text-to-speech and text-to-audio).
In the Internet of Sounds domain, generating speech signals at the edge enables the employment of smart assistants that leverage text-to-speech pipelines.
This paper presents a scalable vocoder architecture for small-footprint edge devices. We test the developed model capabilities qualitatively and quantitatively on single-speaker and multi-speaker datasets and benchmark inference speed and memory consumption on four microcontrollers.
Additionally, we study the power consumption on an ARM Cortex-M7-powered board. Our results demonstrate the feasibility of deploying neural vocoders on resource-constrained edge devices, potentially enabling new applications in IoT and edge computing scenarios IoS and Embedded Audio scenarios.
This is supported by our best performing model achieving a MOS score of 3.95/5, while utilizing 1.5MiB of Flash and 517KiB of ram and consuming 252 mW for a 1s audio clip inference.

Exported Audio

LibriTTS

Reference	Xinet	Phinet	Vocos

Tacotron

Text	Waveglow	Xinet	Phinet	Vocos
In the next sections you are going to hear a lot of dystopian fiction quotes
Reality is that which, when you stop believing in it, doesn't go away.
You don't have to burn books to destroy a culture. Just get people to stop reading them.
We've got to have rules and obey them. After all, we're not savages.
The greatest ideas are the simplest.
In a time of deceit, telling the truth is a revolutionary act.
If liberty means anything at all, it means the right to tell people what they do not want to hear.
All animals are equal, but some animals are more equal than others.