AutoSing

Generating Vocals from Lyrics and Musical Accompaniment

Georg Streich     Luca A. Lanzendörfer     Florian Grötschla     Roger Wattenhofer

streichg@ethz.ch     lanzendoerfer@ethz.ch     fgroetschla@ethz.ch     wattenhofer@ethz.ch

ETH Zurich


Abstract

In this work, we introduce AutoSing, a novel framework designed to generate diverse and high-quality singing voices from provided lyrics and musical accompaniment. AutoSing extends an existing semantic token-based text-to-speech approach by incorporating musical accompaniment as an additional conditioning input. This enables AutoSing to synchronize its vocal output with the rhythm and melodic nuances of the accompaniment while adhering to the provided lyrics. Our contributions include a novel training scheme for autoregressive audio models applied to singing voice synthesis, as well as ablation studies to identify the best musical accompaniment conditioning. We measure AutoSing's performance with subjective listening tests, demonstrating its capability to generate coherent and creative singing voices. Furthermore, we open-source our codebase to foster further research in the field of singing voice synthesis.

Examples

Accompaniment I saw you standing under moonlight your eyes like diamonds in the sky
I felt a spark ignite oh couldn't help but catch your smile
I saw a cat sitting on a mat wearing a hat how about that
he looked at me with his big green eyes started to dance
we choose to go to the moon in this decade and do the other things
not because they are easy but because they are hard

Interpolating Between Artist Embeddings

Artist embeddings allow us to control the singing voice independently from the musical accompaniment. Below we interpolate between two artist embeddings to illustrate this.

Interpolation strength α = 0 1 / 3 2 / 3 1

Effect of Lyrics Prompt Length

55 WPM 110 WPM 165 WPM