audio_generation / PROJECT_CATCHUP_NOTE.md
Yng314
feat: implement audio transition generation pipeline with modules for transition generation, cue point selection, and audio utilities.
14984e4
# AI DJ Project Catch-Up Note
Last updated: 2026-02-19
## 1) Project Goal (Current Direction)
Build a **domain-specific AI DJ transition demo** for coursework Option 1 (Refinement):
- user uploads Song A and Song B
- system auto-detects cue points + BPM
- Song B is time-stretched to Song A BPM
- a generative model creates transition audio from text ("transition vibe")
- output is a **short transition clip only** (not full-song mix)
This scope is intentionally optimized for Hugging Face Spaces reliability.
---
## 2) Coursework Fit (Why this is Option 1)
This is a refinement of existing pipelines/models:
- existing generative pipeline (currently MusicGen, planned ACE-Step)
- wrapped in domain-specific DJ UX (cue/BPM/mix controls)
- not raw prompting only; structured controls for practical use
---
## 3) Current Implemented Pipeline (Already in `app.py`)
Current app file: `AI_DJ_Project/app.py`
### 3.1 Input + UI
- Upload `Song A` and `Song B`
- Set:
- transition vibe text
- transition type (`riser`, `drum fill`, `sweep`, `brake`, `scratch`, `impact`)
- mode (`Overlay` or `Insert`)
- pre/mix/post seconds
- transition length + gain
- optional BPM and cue overrides
### 3.2 Audio analysis and cueing
1. Probe duration with `ffprobe` (if available)
2. Decode only needed segments (ffmpeg first, librosa fallback)
3. Estimate BPM + beat times with `librosa.beat.beat_track`
4. Auto-cue strategy:
- Song A: choose beat near end analysis window
- Song B: choose first beat after ~2 seconds
5. Optional manual override for BPM and cue points
### 3.3 Tempo matching
- Compute stretch rate = `bpm_A / bpm_B` (clamped)
- Time-stretch Song B segment via `librosa.effects.time_stretch`
### 3.4 AI transition generation
- `@spaces.GPU` function `_generate_ai_transition(...)`
- Uses `facebook/musicgen-small`
- Prompt is domain-steered for DJ transition behavior
- Returns short generated transition audio
### 3.5 Assembly
- **Overlay mode**: crossfade A/B + overlay AI transition
- **Insert mode**: A -> AI transition -> B (with short anti-click fades)
- Edge fades + peak normalization before output
### 3.6 Output
- Output audio clip (NumPy audio to Gradio)
- JSON details:
- BPM estimates
- cue points
- stretch rate
- analysis settings
---
## 4) Full End-to-End Pipeline (Conceptual)
Upload A/B
-> decode limited windows
-> BPM + beat analysis
-> auto-cue points
-> stretch B to A BPM
-> generate transition (GenAI)
-> overlay/insert assembly
-> normalize/fades
-> return short transition clip + diagnostics
---
## 5) Planned Upgrade: ACE-Step + Custom LoRA
### 5.1 What ACE-Step is
ACE-Step 1.5 is a **full music-generation foundation model stack** (text-to-audio/music with editing/control workflows), not just a tiny SFX model.
Planned usage in this project:
- keep deterministic DJ logic (cue/BPM/stretch/assemble)
- swap transition generation backend from MusicGen to ACE-Step
- load custom LoRA adapter(s) to enforce DJ transition style
### 5.2 Integration strategy (recommended)
1. Keep current `app.py` flow unchanged for analysis/mixing
2. Introduce backend abstraction:
- `MusicGenBackend` (fallback)
- `AceStepBackend` (main target)
3. Add LoRA controls:
- adapter selection
- adapter scale
4. Continue returning short transition clips only
---
## 6) Genre-Specific LoRA Idea (Pop / Electronic / House / Dubstep / Techno)
## Is this a good idea?
**Yes, as a staged plan.**
It is a strong product and coursework idea because:
- user-selected genre can map to distinct transition style
- demonstrates clear domain-specific refinement
- supports explainable UX: "You picked House -> House-style transition LoRA"
### Important caveats
- Training one LoRA per genre increases data and compute requirements a lot
- Early quality may vary by genre and dataset size
- More adapters mean more evaluation and QA burden
### Practical rollout (recommended)
Phase 1 (safe):
- base model + one "general DJ transition" LoRA
Phase 2 (coursework-strong):
- 2-3 genre LoRAs (e.g., Pop / House / Dubstep)
Phase 3 (optional extension):
- larger genre library + auto-genre suggestion from uploaded songs
---
## 7) Proposed Genre LoRA Routing Logic
User selects uploaded-song genre (or manually selects transition style profile):
- Pop -> `lora_pop_transition`
- Electronic -> `lora_electronic_transition`
- House -> `lora_house_transition`
- Dubstep -> `lora_dubstep_transition`
- Techno -> `lora_techno_transition`
- Auto/Unknown -> `lora_general_transition`
Then:
1. load chosen LoRA
2. set LoRA scale
3. run ACE-Step generation for short transition duration
4. mix with A/B boundary clip
---
## 8) Data and Training Notes for LoRA
- Use only licensed/royalty-free/self-owned audio for dataset and demos
- Dataset should emphasize transition-like content (risers, fills, drops, sweeps, impacts)
- Include metadata/captions describing genre + transition intent
- Keep track of:
- adapter name
- dataset source and license
- training config and epoch checkpoints
---
## 9) Current Risks / Constraints
- ACE-Step stack is heavier than MusicGen and needs careful deployment tuning
- Cold starts and memory behavior can be challenging on Spaces
- Auto-cueing is heuristic; may fail on hard tracks (manual override should remain)
- Time-stretch can introduce artifacts (expected in DJ contexts)
---
## 10) Fallback and Reliability Plan
- Keep MusicGen backend as fallback while integrating ACE-Step
- If ACE-Step init fails:
- fail over to MusicGen backend
- still return valid transition clip
- Preserve deterministic DSP path as model-agnostic baseline
---
## 11) "If I lost track" Quick Resume Checklist
1. Open `app.py` and confirm current backend is still working end-to-end
2. Verify demo still does:
- cue detect
- BPM match
- transition generation
- clip output
3. Re-read this note section 5/6/7
4. Continue with next implementation milestone:
- backend abstraction
- ACE-Step backend skeleton
- single LoRA integration
- then genre LoRA expansion
---
## 12) Next Concrete Milestones
M1: Refactor transition generation into backend interface
M2: Implement `AceStepBackend` with base model inference
M3: Add LoRA load/select/scale UI + runtime controls
M4: Train first "general DJ transition" LoRA
M5: Train 2-3 genre LoRAs and add genre routing
M6: Compare outputs (base vs LoRA, genre A vs genre B) for coursework evidence