ACE-Step 1.5 Training Pipeline

Overview

What Is This Pipeline?

This pipeline fine-tunes ACE-Step 1.5 — a state-of-the-art Diffusion Transformer (DiT) for text-to-music generation with 815 million parameters — using LoRA (Low-Rank Adaptation). Unlike autoregressive models (MusicGen, AudioLM), ACE-Step generates audio through flow matching: iteratively denoising a latent representation to produce high-fidelity music.

💡 Why ACE-Step + LoRA?

ACE-Step's DiT architecture has 48 transformer layers processing music latents with cross-attention for text/lyrics conditioning. LoRA adapters (rank 64) inject ~12M trainable parameters into attention layers — just 1.5% of total — while keeping the full model frozen. This makes training fast, memory-efficient, and supports hot-swapping adapters at inference time.

Key Capabilities

Multi-genre training — Train on 50-200 tracks across any genres
Text + lyrics conditioning — Separate cross-attention for style tags and lyrics
Serverless execution — Fully containerized with Docker, runs on Lambda/RunPod/Modal
Hot-swap adapters — Switch between adapters without reloading the base model
WTA attribution — Integrated Wasserstein Trajectory Attribution for IP tracking
S3 storage — Centralized checkpoint and dataset management on AWS S3
Validated configs — Tested presets with 11/11 validation tests passing

Architecture

Pipeline Flow

The complete training pipeline follows this flow:

🎵

Audio + Tags

MP3/WAV/FLAC files with style metadata

→

📦

HF Dataset

Arrow format for fast loading

→

🐳

Docker Train

GPU container with auto-patching

→

🧠

LoRA Adapter

~50 MB adapter checkpoint

Model Architecture

Component	Role	Details
Music DCAE	Audio encoder/decoder	Compresses raw audio into continuous latent representations. Replaces discrete tokenization.
DiT Decoder	48-layer Diffusion Transformer	Generates music by iteratively denoising latent states using flow matching.
T5 Encoder	Text conditioning	Encodes style tags and lyrics for cross-attention guidance.

LoRA adapters target self-attention and cross-attention layers: linear_q, linear_k, linear_v, to_q, to_k, to_v, to_out.0 across all 48 DiT layers.

Quick Start

Clone and build

# Clone the training-pipelines repo
git clone https://github.com/aenfr/training-pipelines.git
cd training-pipelines/ace-step-1-5

# Build the Docker image
cd docker
docker build -f Dockerfile.trainer -t lumina-trainer:v1 .

Prepare your dataset

# Build HuggingFace dataset from your audio files
python scripts/build_multi_style_dataset.py \
    --data-dir ~/my_audio \
    --output-dir ~/my_hf_dataset \
    --validate-audio

See Dataset Preparation for the full tutorial.

Smoke test (5 gradient steps)

MODE=smoke ./docker/train.sh

Validates GPU, data loading, and training loop in ~2 minutes.

Full training (Preset C)

# Runs with validated Preset C defaults
./docker/train.sh

Expected time: ~8 hours on A100 for 100 tracks × 100 epochs.

Training Presets

Preset C — Multi-Style (Production)

✅ Validated & Production-Ready

11/11 validation tests passed. Used for the production multi-style adapter.

🧠 Preset C Configuration

Parameter	Value	Notes
LoRA Rank	`64`	Capacity/overfitting balance
LoRA Alpha	`192`	3× rank for scaled learning
LoRA Dropout	`0.05`	Mild regularization
Learning Rate	`5e-5`	Halved for multi-genre stability
Epochs	`100`	For ~100 tracks
Grad Accumulation	`4`	Effective batch = 4
Grad Clip	`0.5`	Prevents explosion
Precision	`bf16-mixed`	Half-precision for efficiency

LOO Baseline — Research Validation

Used for Leave-One-Out causal validation experiments on GTZAN (90 tracks, 10 genres).

🔬 LOO Baseline Configuration

Parameter	Value	Notes
LoRA Rank	`64`	Same architecture
LoRA Alpha	`128`	Standard 2× rank
Learning Rate	`1e-4`	Higher for shorter runs
Epochs	`500`	Longer for smaller dataset

Docker Setup

Container Volumes

Container Path	Purpose	Mode
`/model`	ACE-Step base model weights	Read-only
`/data`	HuggingFace dataset (.arrow)	Read-only
`/output`	Training output & checkpoints	Read-write
`/lora-cfg`	LoRA config JSON	Read-only

Entrypoint Auto-Patching

The container entrypoint automatically handles these known issues:

TorchCodec incompatibility — Patches torchaudio.load() → librosa.load()
Audio save failure — Patches torchaudio.save() → soundfile.write()
Step-0 plot crash — Skips inference at step 0 to prevent gradient corruption

💡 No manual patches needed

All audio compatibility patches are applied automatically by entrypoint.sh at container startup. You don't need to modify any source files.

Custom Presets

Override any parameter via environment variables:

# Example: High-rank, low-LR for single artist
EPOCHS=200 \
LEARNING_RATE=2e-5 \
GRAD_ACCUM=8 \
EXP_NAME="my_custom_preset" \
./docker/train.sh

For LoRA architecture changes, create a JSON config:

{
    "r": 128,
    "lora_alpha": 256,
    "lora_dropout": 0.1,
    "target_modules": [
        "linear_q", "linear_k", "linear_v",
        "to_q", "to_k", "to_v", "to_out.0"
    ],
    "use_rslora": false
}

LORA_CONFIG=/path/to/my_config.json ./docker/train.sh

Dataset Preparation

Required Files Per Track

File	Description	Example
Audio (`.mp3/.wav/.flac`)	Full mix, 30-60s recommended	`song_01.mp3`
Tags	Comma-separated style descriptors	`jazz, piano, smooth, 90bpm`
Lyrics	Song lyrics or `[Instrumental]`	`[verse]\nMidnight...`

Build the Dataset

python scripts/build_multi_style_dataset.py \
    --data-dir ~/my_audio \
    --output-dir ~/my_hf_dataset \
    --validate-audio

⚠️ Tag Quality Matters

The model learns style associations through cross-attention with tags. Include 5-10 descriptive tags per track: vocal type, instruments, genre, mood, tempo (BPM), key. See the full Dataset Ingestion Guide in the repository docs.

S3 Storage

Bucket Layout

s3://lumina-data-foldartists/
├── models/ace-step-1.5/         # Base model weights
├── lora/                        # Fine-tuned adapters
│   ├── multi-style-gen-c/       # ✅ Production
│   └── loo-subsets/             # Validation models
├── datasets/                    # HF datasets
├── trajectories/                # WTA trajectory data
└── results/                     # Experiment outputs

Common Operations

# Download production adapter
aws s3 sync s3://lumina-data-foldartists/lora/multi-style-gen-c/ \
    ~/my_adapter/

# Upload new checkpoint after training
aws s3 sync /output/my_experiment/ \
    s3://lumina-data-foldartists/lora/my-new-adapter/

See the full S3 Setup Guide for bucket creation, IAM roles, and cost estimates.

Monitoring Training

Key Metrics

Metric	Healthy Range	Action if Abnormal
Training loss	`0.25 – 0.50`	If > 1.0: reduce LR. If NaN: reduce grad_clip
Learning rate	Follows schedule	Should warm up, then hold at target
GPU memory	< 70% VRAM	If OOM: reduce batch or grad_accum
Steps/second	> 1.0 on A100	If slow: check num_workers

Duration Estimates

Dataset	GPU	Preset C (100 ep)
50 tracks	A100 40GB	~4 hours
100 tracks	A100 40GB	~8 hours
100 tracks	H100 80GB	~5 hours

Troubleshooting

Issue	Symptom	Fix
Audio decode failure	"Empty examples" or TorchCodec error	Auto-patched by entrypoint. Check audio file paths if still failing.
CUDA OOM	"CUDA out of memory"	Reduce `GRAD_ACCUM`, or use A100 80GB
No checkpoints	Empty output dir after training	`every_n_train_steps` defaults high. Override with `--every_n_train_steps 50`
NaN loss	Loss becomes NaN	Lower `GRAD_CLIP` to 0.1. Check for corrupt audio files

Document	Description
Checkpoint Storage & Usage	S3 locations, adapter loading, hot-swapping, and versioning
WTA Attribution & Validation	Wasserstein Trajectory Attribution methodology and LOO experiments
MusicGen Pipeline	The complementary MusicGen 3.3B fine-tuning guide
System Architecture	Full LUMINA architecture overview