Spaces:

mvp-lab
/

audio_generation

Running on Zero

audio_generation / PROJECT_CATCHUP_NOTE.md

Yng314

feat: implement audio transition generation pipeline with modules for transition generation, cue point selection, and audio utilities.

14984e4 16 days ago

preview code

raw

history blame contribute delete

6.49 kB

	# AI DJ Project Catch-Up Note

	Last updated: 2026-02-19

	## 1) Project Goal (Current Direction)

	Build a domain-specific AI DJ transition demo for coursework Option 1 (Refinement):

	- user uploads Song A and Song B
	- system auto-detects cue points + BPM
	- Song B is time-stretched to Song A BPM
	- a generative model creates transition audio from text ("transition vibe")
	- output is a short transition clip only (not full-song mix)

	This scope is intentionally optimized for Hugging Face Spaces reliability.

	---

	## 2) Coursework Fit (Why this is Option 1)

	This is a refinement of existing pipelines/models:

	- existing generative pipeline (currently MusicGen, planned ACE-Step)
	- wrapped in domain-specific DJ UX (cue/BPM/mix controls)
	- not raw prompting only; structured controls for practical use

	---

	## 3) Current Implemented Pipeline (Already in `app.py`)

	Current app file: `AI_DJ_Project/app.py`

	### 3.1 Input + UI

	- Upload `Song A` and `Song B`
	- Set:
	- transition vibe text
	- transition type (`riser`, `drum fill`, `sweep`, `brake`, `scratch`, `impact`)
	- mode (`Overlay` or `Insert`)
	- pre/mix/post seconds
	- transition length + gain
	- optional BPM and cue overrides

	### 3.2 Audio analysis and cueing

	1. Probe duration with `ffprobe` (if available)
	2. Decode only needed segments (ffmpeg first, librosa fallback)
	3. Estimate BPM + beat times with `librosa.beat.beat_track`
	4. Auto-cue strategy:
	- Song A: choose beat near end analysis window
	- Song B: choose first beat after ~2 seconds
	5. Optional manual override for BPM and cue points

	### 3.3 Tempo matching

	- Compute stretch rate = `bpm_A / bpm_B` (clamped)
	- Time-stretch Song B segment via `librosa.effects.time_stretch`

	### 3.4 AI transition generation

	- `@spaces.GPU` function `_generate_ai_transition(...)`
	- Uses `facebook/musicgen-small`
	- Prompt is domain-steered for DJ transition behavior
	- Returns short generated transition audio

	### 3.5 Assembly

	- Overlay mode: crossfade A/B + overlay AI transition
	- Insert mode: A -> AI transition -> B (with short anti-click fades)
	- Edge fades + peak normalization before output

	### 3.6 Output

	- Output audio clip (NumPy audio to Gradio)
	- JSON details:
	- BPM estimates
	- cue points
	- stretch rate
	- analysis settings

	---

	## 4) Full End-to-End Pipeline (Conceptual)

	Upload A/B
	-> decode limited windows
	-> BPM + beat analysis
	-> auto-cue points
	-> stretch B to A BPM
	-> generate transition (GenAI)
	-> overlay/insert assembly
	-> normalize/fades
	-> return short transition clip + diagnostics

	---

	## 5) Planned Upgrade: ACE-Step + Custom LoRA

	### 5.1 What ACE-Step is

	ACE-Step 1.5 is a full music-generation foundation model stack (text-to-audio/music with editing/control workflows), not just a tiny SFX model.

	Planned usage in this project:

	- keep deterministic DJ logic (cue/BPM/stretch/assemble)
	- swap transition generation backend from MusicGen to ACE-Step
	- load custom LoRA adapter(s) to enforce DJ transition style

	### 5.2 Integration strategy (recommended)

	1. Keep current `app.py` flow unchanged for analysis/mixing
	2. Introduce backend abstraction:
	- `MusicGenBackend` (fallback)
	- `AceStepBackend` (main target)
	3. Add LoRA controls:
	- adapter selection
	- adapter scale
	4. Continue returning short transition clips only

	---

	## 6) Genre-Specific LoRA Idea (Pop / Electronic / House / Dubstep / Techno)

	## Is this a good idea?

	Yes, as a staged plan.

	It is a strong product and coursework idea because:

	- user-selected genre can map to distinct transition style
	- demonstrates clear domain-specific refinement
	- supports explainable UX: "You picked House -> House-style transition LoRA"

	### Important caveats

	- Training one LoRA per genre increases data and compute requirements a lot
	- Early quality may vary by genre and dataset size
	- More adapters mean more evaluation and QA burden

	### Practical rollout (recommended)

	Phase 1 (safe):
	- base model + one "general DJ transition" LoRA

	Phase 2 (coursework-strong):
	- 2-3 genre LoRAs (e.g., Pop / House / Dubstep)

	Phase 3 (optional extension):
	- larger genre library + auto-genre suggestion from uploaded songs

	---

	## 7) Proposed Genre LoRA Routing Logic

	User selects uploaded-song genre (or manually selects transition style profile):

	- Pop -> `lora_pop_transition`
	- Electronic -> `lora_electronic_transition`
	- House -> `lora_house_transition`
	- Dubstep -> `lora_dubstep_transition`
	- Techno -> `lora_techno_transition`
	- Auto/Unknown -> `lora_general_transition`

	Then:

	1. load chosen LoRA
	2. set LoRA scale
	3. run ACE-Step generation for short transition duration
	4. mix with A/B boundary clip

	---

	## 8) Data and Training Notes for LoRA

	- Use only licensed/royalty-free/self-owned audio for dataset and demos
	- Dataset should emphasize transition-like content (risers, fills, drops, sweeps, impacts)
	- Include metadata/captions describing genre + transition intent
	- Keep track of:
	- adapter name
	- dataset source and license
	- training config and epoch checkpoints

	---

	## 9) Current Risks / Constraints

	- ACE-Step stack is heavier than MusicGen and needs careful deployment tuning
	- Cold starts and memory behavior can be challenging on Spaces
	- Auto-cueing is heuristic; may fail on hard tracks (manual override should remain)
	- Time-stretch can introduce artifacts (expected in DJ contexts)

	---

	## 10) Fallback and Reliability Plan

	- Keep MusicGen backend as fallback while integrating ACE-Step
	- If ACE-Step init fails:
	- fail over to MusicGen backend
	- still return valid transition clip
	- Preserve deterministic DSP path as model-agnostic baseline

	---

	## 11) "If I lost track" Quick Resume Checklist

	1. Open `app.py` and confirm current backend is still working end-to-end
	2. Verify demo still does:
	- cue detect
	- BPM match
	- transition generation
	- clip output
	3. Re-read this note section 5/6/7
	4. Continue with next implementation milestone:
	- backend abstraction
	- ACE-Step backend skeleton
	- single LoRA integration
	- then genre LoRA expansion

	---

	## 12) Next Concrete Milestones

	M1: Refactor transition generation into backend interface
	M2: Implement `AceStepBackend` with base model inference
	M3: Add LoRA load/select/scale UI + runtime controls
	M4: Train first "general DJ transition" LoRA
	M5: Train 2-3 genre LoRAs and add genre routing
	M6: Compare outputs (base vs LoRA, genre A vs genre B) for coursework evidence