Overview
What Is This Pipeline?
This pipeline fine-tunes ACE-Step 1.5 — a state-of-the-art Diffusion Transformer (DiT) for text-to-music generation with 815 million parameters — using LoRA (Low-Rank Adaptation). Unlike autoregressive models (MusicGen, AudioLM), ACE-Step generates audio through flow matching: iteratively denoising a latent representation to produce high-fidelity music.
ACE-Step's DiT architecture has 48 transformer layers processing music latents with cross-attention for text/lyrics conditioning. LoRA adapters (rank 64) inject ~12M trainable parameters into attention layers — just 1.5% of total — while keeping the full model frozen. This makes training fast, memory-efficient, and supports hot-swapping adapters at inference time.
Key Capabilities
- Multi-genre training — Train on 50-200 tracks across any genres
- Text + lyrics conditioning — Separate cross-attention for style tags and lyrics
- Serverless execution — Fully containerized with Docker, runs on Lambda/RunPod/Modal
- Hot-swap adapters — Switch between adapters without reloading the base model
- WTA attribution — Integrated Wasserstein Trajectory Attribution for IP tracking
- S3 storage — Centralized checkpoint and dataset management on AWS S3
- Validated configs — Tested presets with 11/11 validation tests passing
Architecture
Pipeline Flow
The complete training pipeline follows this flow:
Model Architecture
| Component | Role | Details |
|---|---|---|
| Music DCAE | Audio encoder/decoder | Compresses raw audio into continuous latent representations. Replaces discrete tokenization. |
| DiT Decoder | 48-layer Diffusion Transformer | Generates music by iteratively denoising latent states using flow matching. |
| T5 Encoder | Text conditioning | Encodes style tags and lyrics for cross-attention guidance. |
LoRA adapters target self-attention and cross-attention layers: linear_q,
linear_k,
linear_v, to_q, to_k, to_v,
to_out.0
across all 48 DiT layers.
Quick Start
Clone and build
# Clone the training-pipelines repo git clone https://github.com/aenfr/training-pipelines.git cd training-pipelines/ace-step-1-5 # Build the Docker image cd docker docker build -f Dockerfile.trainer -t lumina-trainer:v1 .
Prepare your dataset
# Build HuggingFace dataset from your audio files
python scripts/build_multi_style_dataset.py \
--data-dir ~/my_audio \
--output-dir ~/my_hf_dataset \
--validate-audio
See Dataset Preparation for the full tutorial.
Smoke test (5 gradient steps)
MODE=smoke ./docker/train.sh
Validates GPU, data loading, and training loop in ~2 minutes.
Full training (Preset C)
# Runs with validated Preset C defaults
./docker/train.sh
Expected time: ~8 hours on A100 for 100 tracks × 100 epochs.
Training Presets
Preset C — Multi-Style (Production)
11/11 validation tests passed. Used for the production multi-style adapter.
| Parameter | Value | Notes |
|---|---|---|
| LoRA Rank | 64 |
Capacity/overfitting balance |
| LoRA Alpha | 192 |
3× rank for scaled learning |
| LoRA Dropout | 0.05 |
Mild regularization |
| Learning Rate | 5e-5 |
Halved for multi-genre stability |
| Epochs | 100 |
For ~100 tracks |
| Grad Accumulation | 4 |
Effective batch = 4 |
| Grad Clip | 0.5 |
Prevents explosion |
| Precision | bf16-mixed |
Half-precision for efficiency |
LOO Baseline — Research Validation
Used for Leave-One-Out causal validation experiments on GTZAN (90 tracks, 10 genres).
| Parameter | Value | Notes |
|---|---|---|
| LoRA Rank | 64 |
Same architecture |
| LoRA Alpha | 128 |
Standard 2× rank |
| Learning Rate | 1e-4 |
Higher for shorter runs |
| Epochs | 500 |
Longer for smaller dataset |
Docker Setup
Container Volumes
| Container Path | Purpose | Mode |
|---|---|---|
/model |
ACE-Step base model weights | Read-only |
/data |
HuggingFace dataset (.arrow) | Read-only |
/output |
Training output & checkpoints | Read-write |
/lora-cfg |
LoRA config JSON | Read-only |
Entrypoint Auto-Patching
The container entrypoint automatically handles these known issues:
- TorchCodec incompatibility — Patches
torchaudio.load()→librosa.load() - Audio save failure — Patches
torchaudio.save()→soundfile.write() - Step-0 plot crash — Skips inference at step 0 to prevent gradient corruption
All audio compatibility patches are applied automatically by entrypoint.sh at
container
startup. You don't need to modify any source files.
Custom Presets
Override any parameter via environment variables:
# Example: High-rank, low-LR for single artist
EPOCHS=200 \
LEARNING_RATE=2e-5 \
GRAD_ACCUM=8 \
EXP_NAME="my_custom_preset" \
./docker/train.sh
For LoRA architecture changes, create a JSON config:
{
"r": 128,
"lora_alpha": 256,
"lora_dropout": 0.1,
"target_modules": [
"linear_q", "linear_k", "linear_v",
"to_q", "to_k", "to_v", "to_out.0"
],
"use_rslora": false
}
LORA_CONFIG=/path/to/my_config.json ./docker/train.sh
Dataset Preparation
Required Files Per Track
| File | Description | Example |
|---|---|---|
Audio (.mp3/.wav/.flac) |
Full mix, 30-60s recommended | song_01.mp3 |
| Tags | Comma-separated style descriptors | jazz, piano, smooth, 90bpm |
| Lyrics | Song lyrics or [Instrumental] |
[verse]\nMidnight... |
Build the Dataset
python scripts/build_multi_style_dataset.py \
--data-dir ~/my_audio \
--output-dir ~/my_hf_dataset \
--validate-audio
The model learns style associations through cross-attention with tags. Include 5-10 descriptive tags per track: vocal type, instruments, genre, mood, tempo (BPM), key. See the full Dataset Ingestion Guide in the repository docs.
S3 Storage
Bucket Layout
s3://lumina-data-foldartists/ ├── models/ace-step-1.5/ # Base model weights ├── lora/ # Fine-tuned adapters │ ├── multi-style-gen-c/ # ✅ Production │ └── loo-subsets/ # Validation models ├── datasets/ # HF datasets ├── trajectories/ # WTA trajectory data └── results/ # Experiment outputs
Common Operations
# Download production adapter aws s3 sync s3://lumina-data-foldartists/lora/multi-style-gen-c/ \ ~/my_adapter/ # Upload new checkpoint after training aws s3 sync /output/my_experiment/ \ s3://lumina-data-foldartists/lora/my-new-adapter/
See the full S3 Setup Guide for bucket creation, IAM roles, and cost estimates.
Monitoring Training
Key Metrics
| Metric | Healthy Range | Action if Abnormal |
|---|---|---|
| Training loss | 0.25 – 0.50 |
If > 1.0: reduce LR. If NaN: reduce grad_clip |
| Learning rate | Follows schedule | Should warm up, then hold at target |
| GPU memory | < 70% VRAM | If OOM: reduce batch or grad_accum |
| Steps/second | > 1.0 on A100 | If slow: check num_workers |
Duration Estimates
| Dataset | GPU | Preset C (100 ep) |
|---|---|---|
| 50 tracks | A100 40GB | ~4 hours |
| 100 tracks | A100 40GB | ~8 hours |
| 100 tracks | H100 80GB | ~5 hours |
Troubleshooting
| Issue | Symptom | Fix |
|---|---|---|
| Audio decode failure | "Empty examples" or TorchCodec error | Auto-patched by entrypoint. Check audio file paths if still failing. |
| CUDA OOM | "CUDA out of memory" | Reduce GRAD_ACCUM, or use A100 80GB |
| No checkpoints | Empty output dir after training | every_n_train_steps defaults high. Override with
--every_n_train_steps 50 |
| NaN loss | Loss becomes NaN | Lower GRAD_CLIP to 0.1. Check for corrupt audio files |