Pitch and dynamic transformation on human voice with deep neural networks

Presented during the IRCAM Forum @NYU 2022

The Crazy IRcam neural auto-encoderfor voiCE

The Crazy IRcam neural auto-encoder for voiCE (CIRCE) is the first transformation tool of the next generation of neural audio effects developed at the Analysis/Synthesis team at IRCAM. It provides a graphical interface for voice transformations based on a neural auto-encoder. The app allows changing different parameters in recordings of speech and singing voice. Currently pitch and vocal effort are supported in CIRCE, as well as additional experimental features.

The transformations work with the new neural vocoding model which allows applying effects on the mel-spectrogram. A neural auto-encoder is trained to disentangle the pitch and the vocal effort from the mel-spectrogram of speech and singing voice. This way we obtain a representation of human voice that is independent of pitch and vocal effort and we can resynthesise the voice signal with almost arbitrary changes to the original material. As a result, the model automatically adapts the voice characteristics to the given parameters.

In this presentation we will give a brief introduction to the mechanism of neural vocoding that is the technology behind this application. In the second part we provide install instructions, a demonstration of the features of the application as well as a few examples of transformed voice.

The application was developed as part of the PhD thesis of Frederik Bous which was supervised by Axel Roebel. This research was funded by the ANR project ARS (ANR-19-CE38-0001-01).