Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation

Alexander Fichtinger
Institute of Computational Perception
Johannes Kepler University Linz, Austria
alexanderfichtinger@icloud.com
Jan Schlüter
Institute of Computational Perception
Johannes Kepler University Linz, Austria
home page
Gerhard Widmer
Institute of Computational Perception
Johannes Kepler University Linz, Austria
home page

Abstract

Generative models of music audio are typically used to generate output based solely on a text prompt or melody. Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model. In this work, we explore its application in the audio domain as a tool for data augmentation or content manipulation. Specifically, implementing Boomerang sampling for Stable Audio Open, we augment training data for a state-of-the-art beat tracker, and attempt to replace musical instruments in recordings. Our results show that the rhythmic structure of existing examples is mostly preserved, that it improves performance of the beat tracker, but only in scenarios of limited training data, and that it can accomplish text-based instrument replacement on monophonic inputs. We publish our implementation to invite experiments on data augmentation in other tasks and explore further applications.

Variations for Data Augmentation

We demonstrate Boomerang sampling on examples from the GTZAN genre dataset, which includes the following genres: classical, country, blues, disco, hiphop, jazz, metal, pop, reggae and rock. For each genre, the example was processed using four different noise levels for boomerang sampling: 0.2, 0.4, 0.6 and 0.8. The prompt was set to “Music”, with a negative prompt of “Low quality”.

Genre Original Audio Noise Level 0.20 Noise Level 0.40 Noise Level 0.60 Noise Level 0.80
Disco
Reggae
Pop
Hiphop
Rock
Metal
Blues
Country
Jazz
Classical

Text-Guided Audio Manipulation

Next, we'll show some guided variation using a specific text prompt (“Trumpet lead”), and larger guidance scale and noise level, for a guitar recording.
Original audio:

Transformed audio:

Combined audio:

Polyphonic recordings

Polyphonic recordings can also be modified, but since the text prompt controls all generation in the reverse diffusion process, it is applied to all components of the mix.
Original audio:

Transformed audio with the specific text prompt ("Trumpet solo"):

Combined audio:

By reducing noise, more of the original sound can be preserved. However, the balance between noise and guidance scale remains crucial to producing high-quality output.
Original audio:

Transformed audio with the specific text prompt ("Synth lead"):

Code

We publish our code on GitHub to invite further research or artistic applications.