MusicGen-Large 3.3B LoRA Fine-Tuning Docker Ready

MusicGen Fine-Tuning Pipeline

Complete guide to fine-tuning Meta's MusicGen model with LoRA adapters for genre-specific music generation. From setup to deployment β€” everything you need to train on your own GPU.

πŸ“„ 25 min read 🎯 Graduate Level πŸ“… March 2026

Overview

What Is This Pipeline?

This pipeline fine-tunes Meta's MusicGen β€” a state-of-the-art text-to-music generation model with 3.3 billion parameters β€” using LoRA (Low-Rank Adaptation) to specialize the model for specific music genres while keeping the base model frozen.

Instead of training all 3.3B parameters (which would require massive compute and risk catastrophic forgetting), LoRA injects small trainable adapters into the transformer's attention layers. This reduces trainable parameters by ~99.8% while still achieving strong genre specialization.

πŸ’‘ Why LoRA?

A full fine-tune of MusicGen-Large requires ~26 GB of VRAM just for parameters + gradients. With LoRA (rank 64), you train only ~2M parameters, the adapter checkpoint is ~8 MB, and training fits comfortably on a single GPU with 24 GB+ VRAM.

Key Capabilities

  • Multi-genre training β€” Train on multiple genres simultaneously or focus on a single genre
  • Stereo output β€” Full stereo generation at 32 kHz sample rate
  • Data augmentation β€” Built-in pitch shift, time stretch, gaussian noise, and gain augmentation
  • Automatic metadata β€” AI-generated text descriptions via Gemini or OpenAI for conditioning
  • Vocal removal β€” Automatic instrumental extraction using HT-Demucs
  • W&B integration β€” Real-time experiment tracking with Weights & Biases
  • Docker deployment β€” Containerized for serverless GPU training (RunPod, Lambda)
  • Checkpoint management β€” Automatic best-model saving, early stopping, and run isolation

Architecture

Pipeline Flow

The complete training pipeline follows this flow:

🎡
Raw Audio
WAV / MP3 / FLAC files organized by genre
β†’
🎀
Vocal Removal
HT-Demucs extracts instrumentals
β†’
πŸ“
Metadata Gen
Gemini/OpenAI generates text descriptions
β†’
🧠
LoRA Training
Fine-tune attention layers with adapters
β†’
🎢
Generation
Genre-specialized music output

Model Architecture

Under the hood, MusicGen consists of two main components:

Component Role Details
EnCodec Audio tokenizer Compresses raw audio into discrete tokens using 8 codebooks with 2048 codes each. Stereo interleaved.
Transformer LM Token predictor A 3.3B-parameter decoder-only transformer that generates audio tokens conditioned on text prompts.

LoRA adapters are injected into the Transformer LM's attention layers: specifically the q_proj, k_proj, v_proj, and out_proj linear layers in each transformer block.

Prerequisites

GPU Requirements

Tier GPU VRAM Batch Size Notes
Recommended H100 / A100 80GB 80 GB 4–8 Fastest training, supports large batches
Good A100 40GB / A6000 40–48 GB 2–4 Comfortable for most experiments
Minimum RTX 4090 / 3090 24 GB 1–2 Works with gradient accumulation, slower

Software Requirements

  • Python 3.10+
  • CUDA 12.1+ with cuDNN 8
  • PyTorch 2.1.0+ (cu121)
  • Docker (optional, for containerized training)
  • NVIDIA Container Toolkit (for Docker GPU passthrough)

API Keys (Optional)

Service Purpose Required?
Weights & Biases Experiment tracking and visualization Recommended
Google Gemini Auto-generate text metadata for audio conditioning Optional

Installation

Option A: Bare Metal (Direct GPU Access)

1

Clone the repository

# Clone lumina-musicgen
git clone https://github.com/FoldArtists/lumina-musicgen.git
cd lumina-musicgen
2

Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate

# Install PyTorch with CUDA 12.1
pip install torch==2.1.0 torchaudio==2.1.0 \
  --index-url https://download.pytorch.org/whl/cu121

# Install project dependencies
pip install -e .
3

Verify the installation

python scripts/verify_deps.py
# Should output:
# βœ“ PyTorch 2.1.0+cu121
# βœ“ CUDA available
# βœ“ AudioCraft 1.3.0
# βœ“ PEFT 0.18.1

Option B: Docker (Recommended for Serverless)

1

Build the Docker image

cd lumina-musicgen
docker build -f docker/Dockerfile.trainer \
  -t lumina-musicgen-trainer:1.0 .

Image size: ~10 GB. Includes all frozen dependencies from the proven H100 training environment.

2

Verify the build

# Dry-run test (validates GPU, imports, config)
docker run --gpus all \
  -v /path/to/your/audio:/data \
  -v /tmp/test-output:/output \
  -e DRY_RUN=true \
  lumina-musicgen-trainer:1.0

Data Preparation

Directory Structure

Organize your audio files by genre in subdirectories:

/your/audio/data/
β”œβ”€β”€ blues/
β”‚   β”œβ”€β”€ track001.wav
β”‚   β”œβ”€β”€ track002.wav
β”‚   └── ...
β”œβ”€β”€ jazz/
β”‚   β”œβ”€β”€ track001.wav
β”‚   └── ...
β”œβ”€β”€ rock/
β”‚   └── ...
└── classical/
    └── ...

Audio Requirements

Property Requirement Notes
Format .wav, .mp3, or .flac WAV preferred for quality
Duration β‰₯ 10 seconds Shorter files are auto-skipped
Content Instrumental preferred Vocals are auto-removed if present
Quantity 10+ tracks per genre More data = better generalization
πŸ’‘ Using GTZAN Dataset

For experimentation, we provide a script to download the GTZAN dataset (1000 tracks, 10 genres):

python scripts/prepare_gtzan.py --output-dir /data/gtzan

Automatic Data Processing

The pipeline automatically handles these preprocessing steps during training:

  1. Manifest creation β€” Scans audio files and creates train.jsonl / val.jsonl splits
  2. Vocal removal β€” Uses HT-Demucs to extract instrumental stems (configurable)
  3. Metadata generation β€” Creates text descriptions using Gemini AI for text-conditioning
  4. Segmentation β€” Splits audio into 30-second segments at 32 kHz for training

Configuration Reference

Configuration uses OmegaConf YAML with a base + experiment override pattern. The base config (configs/base.yaml) defines all defaults. Experiment configs in configs/experiments/ override specific values.

🧠 Model Settings
KeyDefaultDescription
model.basefacebook/musicgen-stereo-largeHuggingFace model ID. Also supports medium and small variants.
model.sample_rate32000Audio sample rate in Hz
model.segment_duration30Training segment length in seconds
model.channels2Stereo (2) or mono (1)
πŸ”§ LoRA Settings
KeyDefaultDescription
lora.rank16LoRA rank. Higher = more capacity but slower. Try 32–64 for genre specialization.
lora.alpha32LoRA scaling factor. Rule of thumb: alpha = 2Γ— rank to 3Γ— rank.
lora.target_modules[q_proj, v_proj, k_proj, out_proj]Attention layers to apply LoRA to
lora.dropout0.05LoRA dropout for regularization
⚑ Training Settings
KeyDefaultDescription
training.epochs7Number of training epochs
training.batch_size4Batch size. Reduce to 1–2 on 24 GB GPUs.
training.optimizer.lr1e-5Learning rate. Use 1e-4 for aggressive fine-tuning.
training.scheduler.namecosineLR schedule type with warmup
training.early_stopping.patience3Stop if val_loss doesn't improve for N epochs
training.gradient_accumulation_steps1Simulate larger batch sizes on small GPUs
πŸ“Š Data Settings
KeyDefaultDescription
data.source_dir/data/gtzan/instrumentalPath to raw audio files
data.dataset_dir/data/gtzan/processedPath for processed manifests
data.splits.train0.80Train/val/test split ratio
data.min_duration10.0Skip audio shorter than this (seconds)
πŸŽ›οΈ Augmentation
KeyDefaultDescription
augmentation.pitch_shiftΒ±2 semitones, p=0.4Random pitch shifting
augmentation.time_stretch0.9×–1.1Γ—, p=0.3Random tempo changes
augmentation.gaussian_noise0.001–0.01 amp, p=0.2Noise injection for robustness
augmentation.gainΒ±3 dB, p=0.3Volume variation

Creating an Experiment Config

Create a YAML file in configs/experiments/ that overrides only the settings you want to change:

# configs/experiments/my_experiment.yaml
data:
  source_dir: "/path/to/my/audio"
  dataset_dir: "/path/to/processed"

lora:
  rank: 64
  alpha: 192        # 3Γ— rank

training:
  epochs: 50
  batch_size: 2      # For 24 GB GPUs
  optimizer:
    lr: 1.0e-4

logging:
  wandb:
    name: "my-experiment"

Training Guide

Running a Training Job

1

Set environment variables

export WANDB_API_KEY="your_wandb_key"
export GEMINI_API_KEY="your_gemini_key"  # optional
2

Launch training

# Dry run first (prints config, no training)
python scripts/train_adapter.py \
  --config configs/experiments/my_experiment.yaml \
  --dry-run

# Actual training
python scripts/train_adapter.py \
  --config configs/experiments/my_experiment.yaml
3

Monitor on Weights & Biases

Training metrics are logged to W&B in real-time: train_loss, val_loss, learning rate, gradient norms, and audio samples at configurable intervals.

Training Tips

βœ… Recommended Hyperparameters

Based on validated H100 training runs:

  • LoRA rank 64, alpha 192 β€” Best balance for genre adaptation
  • Learning rate 1e-4 β€” Aggressive enough for small datasets
  • Batch size 2 β€” Safe on all GPUs β‰₯ 24 GB
  • Early stopping patience 15 β€” Generous for long runs
  • Cosine schedule with 200 warmup steps
⚠️ Common Pitfalls
  • CUDA OOM β€” Reduce batch_size to 1 and increase gradient_accumulation_steps
  • Stale manifests β€” Delete train.jsonl / val.jsonl if you change your dataset
  • Windows line endings β€” If running on Linux from Windows-edited files, run sed -i 's/\r$//' config.yaml

Docker Deployment

The Docker image packages the entire training environment with frozen, proven dependencies. This is the recommended approach for serverless GPU platforms.

Volume Mounts

Host Path Container Purpose
/path/to/audio /data Audio files (WAV/MP3/FLAC)
/path/to/configs /config Experiment YAML overrides
/path/to/output /output Checkpoints, samples, logs

Environment Variables

Variable Required Description
EXPERIMENT_CONFIG No Config file name or path. Defaults to base.yaml
WANDB_API_KEY No W&B logging. Disabled if missing.
GEMINI_API_KEY No Gemini metadata generation
DRY_RUN No Set true for config validation only

Full Docker Run Command

docker run --gpus all \
  -v /home/user/audio:/data \
  -v /home/user/configs:/config \
  -v /home/user/output:/output \
  -e WANDB_API_KEY=your_key \
  -e EXPERIMENT_CONFIG=docker_preset_c.yaml \
  lumina-musicgen-trainer:1.0

Serverless Platforms

Tested on these serverless GPU providers:

  • RunPod β€” Use GPU Pod with Docker image. Mount network volumes for data persistence.
  • Lambda Labs β€” Push image to ECR/DockerHub, launch via API.
  • Vast.ai β€” Upload image, select GPU tier, configure volume mounts.

Outputs & Evaluation

Output Structure

/output/runs/experiment_20260305-151233/
β”œβ”€β”€ config.yaml          # Frozen config snapshot
β”œβ”€β”€ training.log         # Full training log
β”œβ”€β”€ metadata.json        # Data metadata copy
β”œβ”€β”€ checkpoints/
β”‚   β”œβ”€β”€ epoch_1/         # Per-epoch adapter checkpoints
β”‚   β”œβ”€β”€ epoch_2/
β”‚   └── best/            # Best model (lowest val_loss)
β”‚       β”œβ”€β”€ lora_A.pt
β”‚       └── lora_B.pt
β”œβ”€β”€ final/
β”‚   └── adapter_final.pt # Final merged adapter
└── samples/
    β”œβ”€β”€ epoch_5_sample_0.wav
    └── epoch_10_sample_0.wav

Generated Audio Samples

The pipeline generates audio samples at configurable intervals during training. These allow you to aurally monitor how the model adapts to your target genre. Each sample is a 30-second stereo WAV at 32 kHz.

Evaluation Metrics

Metric What It Measures Good Values
val_loss How well the model predicts held-out audio tokens Should decrease and stabilize. Our best: 3.73
FAD (CLAP) FrΓ©chet Audio Distance β€” distribution similarity to reference audio Lower is better. < 5.0 is good.
CLAP Score Text-audio alignment score Higher is better

Troubleshooting

❌ CUDA Out of Memory

Cause: Batch size too large for available VRAM.
Fix: Reduce training.batch_size to 1 or 2. Increase training.gradient_accumulation_steps to compensate.

# Effective batch size = batch_size Γ— gradient_accumulation_steps
training:
  batch_size: 1
  gradient_accumulation_steps: 4  # Effective batch size = 4
❌ num_samples = 0 (No audio found)

Cause: data.source_dir path is wrong, or stale manifests exist.
Fix:

  1. Verify audio files exist at the configured source_dir path
  2. Delete stale manifests: rm /path/to/dataset_dir/train.jsonl /path/to/dataset_dir/val.jsonl
  3. Check for Windows line endings in YAML: sed -i 's/\r$//' config.yaml
❌ Docker: Permission denied on /var/run/docker.sock

Fix: Use sudo docker or add your user to the docker group:

sudo usermod -aG docker $USER
# Log out and back in for changes to take effect
❌ Docker: Audio loading returns empty tensors

Cause: Missing audio codec libraries in container.
Fix: The Dockerfile already includes FFmpeg dev headers. If you rebuild, ensure libavformat-dev libavcodec-dev are installed.

❌ Model loads but generates noise

Cause: Likely training too aggressively (high LR, too many epochs).
Fix: Reduce learning rate, add early stopping, check that text descriptions in metadata are reasonable.