Fine-tune MusicGen if you want to generate music in a certain style. Whether that’s 16bit video game chip-tunes, or the calmness of something choral.

A full model training takes 15 minutes using 8x A40 (Large) hardware. You can run your fine-tuned model from the web or using the cloud API, or you can download the fine-tuned model weights for use in other contexts.

The fine-tune process was developed by Jongmin Jung (aka. sake). It’s based on Meta’s AudioCraft and their built-in trainer Dora. To make training simple Sake has included automatic audio chunking, auto-labeling, and vocal removal features. Your trained model can also generate music longer than 30 seconds.

Here is an example of a choral fine-tune combined with a 16bit video game (as a continuation):

Your browser does not support the video tag.