Spaces:
Running on Zero
Running on Zero
Yng314 commited on
Commit ·
14984e4
1
Parent(s): 070f6dc
feat: implement audio transition generation pipeline with modules for transition generation, cue point selection, and audio utilities.
Browse files- .gitignore +15 -0
- PROJECT_CATCHUP_NOTE.md +229 -0
- README copy.md +100 -0
- app.py +392 -0
- packages.txt +2 -0
- pipeline/__init__.py +16 -0
- pipeline/audio_utils.py +194 -0
- pipeline/cuepoint_selector.py +1656 -0
- pipeline/transition_generator.py +1694 -0
- requirements.txt +14 -0
.gitignore
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Initial_Research/demix
|
| 2 |
+
Initial_Research/spec
|
| 3 |
+
__pycache__
|
| 4 |
+
Utils/pretrained_models
|
| 5 |
+
|
| 6 |
+
# Large / copyrighted audio should not be committed to a public Space repo
|
| 7 |
+
Songs/
|
| 8 |
+
Test_songs/
|
| 9 |
+
Initial_Research/*.mp3
|
| 10 |
+
Initial_Research/*.wav
|
| 11 |
+
mixed_song.wav
|
| 12 |
+
final_mix.mp3
|
| 13 |
+
.acestep_runtime/
|
| 14 |
+
checkpoints/
|
| 15 |
+
outputs/
|
PROJECT_CATCHUP_NOTE.md
ADDED
|
@@ -0,0 +1,229 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AI DJ Project Catch-Up Note
|
| 2 |
+
|
| 3 |
+
Last updated: 2026-02-19
|
| 4 |
+
|
| 5 |
+
## 1) Project Goal (Current Direction)
|
| 6 |
+
|
| 7 |
+
Build a **domain-specific AI DJ transition demo** for coursework Option 1 (Refinement):
|
| 8 |
+
|
| 9 |
+
- user uploads Song A and Song B
|
| 10 |
+
- system auto-detects cue points + BPM
|
| 11 |
+
- Song B is time-stretched to Song A BPM
|
| 12 |
+
- a generative model creates transition audio from text ("transition vibe")
|
| 13 |
+
- output is a **short transition clip only** (not full-song mix)
|
| 14 |
+
|
| 15 |
+
This scope is intentionally optimized for Hugging Face Spaces reliability.
|
| 16 |
+
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
## 2) Coursework Fit (Why this is Option 1)
|
| 20 |
+
|
| 21 |
+
This is a refinement of existing pipelines/models:
|
| 22 |
+
|
| 23 |
+
- existing generative pipeline (currently MusicGen, planned ACE-Step)
|
| 24 |
+
- wrapped in domain-specific DJ UX (cue/BPM/mix controls)
|
| 25 |
+
- not raw prompting only; structured controls for practical use
|
| 26 |
+
|
| 27 |
+
---
|
| 28 |
+
|
| 29 |
+
## 3) Current Implemented Pipeline (Already in `app.py`)
|
| 30 |
+
|
| 31 |
+
Current app file: `AI_DJ_Project/app.py`
|
| 32 |
+
|
| 33 |
+
### 3.1 Input + UI
|
| 34 |
+
|
| 35 |
+
- Upload `Song A` and `Song B`
|
| 36 |
+
- Set:
|
| 37 |
+
- transition vibe text
|
| 38 |
+
- transition type (`riser`, `drum fill`, `sweep`, `brake`, `scratch`, `impact`)
|
| 39 |
+
- mode (`Overlay` or `Insert`)
|
| 40 |
+
- pre/mix/post seconds
|
| 41 |
+
- transition length + gain
|
| 42 |
+
- optional BPM and cue overrides
|
| 43 |
+
|
| 44 |
+
### 3.2 Audio analysis and cueing
|
| 45 |
+
|
| 46 |
+
1. Probe duration with `ffprobe` (if available)
|
| 47 |
+
2. Decode only needed segments (ffmpeg first, librosa fallback)
|
| 48 |
+
3. Estimate BPM + beat times with `librosa.beat.beat_track`
|
| 49 |
+
4. Auto-cue strategy:
|
| 50 |
+
- Song A: choose beat near end analysis window
|
| 51 |
+
- Song B: choose first beat after ~2 seconds
|
| 52 |
+
5. Optional manual override for BPM and cue points
|
| 53 |
+
|
| 54 |
+
### 3.3 Tempo matching
|
| 55 |
+
|
| 56 |
+
- Compute stretch rate = `bpm_A / bpm_B` (clamped)
|
| 57 |
+
- Time-stretch Song B segment via `librosa.effects.time_stretch`
|
| 58 |
+
|
| 59 |
+
### 3.4 AI transition generation
|
| 60 |
+
|
| 61 |
+
- `@spaces.GPU` function `_generate_ai_transition(...)`
|
| 62 |
+
- Uses `facebook/musicgen-small`
|
| 63 |
+
- Prompt is domain-steered for DJ transition behavior
|
| 64 |
+
- Returns short generated transition audio
|
| 65 |
+
|
| 66 |
+
### 3.5 Assembly
|
| 67 |
+
|
| 68 |
+
- **Overlay mode**: crossfade A/B + overlay AI transition
|
| 69 |
+
- **Insert mode**: A -> AI transition -> B (with short anti-click fades)
|
| 70 |
+
- Edge fades + peak normalization before output
|
| 71 |
+
|
| 72 |
+
### 3.6 Output
|
| 73 |
+
|
| 74 |
+
- Output audio clip (NumPy audio to Gradio)
|
| 75 |
+
- JSON details:
|
| 76 |
+
- BPM estimates
|
| 77 |
+
- cue points
|
| 78 |
+
- stretch rate
|
| 79 |
+
- analysis settings
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## 4) Full End-to-End Pipeline (Conceptual)
|
| 84 |
+
|
| 85 |
+
Upload A/B
|
| 86 |
+
-> decode limited windows
|
| 87 |
+
-> BPM + beat analysis
|
| 88 |
+
-> auto-cue points
|
| 89 |
+
-> stretch B to A BPM
|
| 90 |
+
-> generate transition (GenAI)
|
| 91 |
+
-> overlay/insert assembly
|
| 92 |
+
-> normalize/fades
|
| 93 |
+
-> return short transition clip + diagnostics
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## 5) Planned Upgrade: ACE-Step + Custom LoRA
|
| 98 |
+
|
| 99 |
+
### 5.1 What ACE-Step is
|
| 100 |
+
|
| 101 |
+
ACE-Step 1.5 is a **full music-generation foundation model stack** (text-to-audio/music with editing/control workflows), not just a tiny SFX model.
|
| 102 |
+
|
| 103 |
+
Planned usage in this project:
|
| 104 |
+
|
| 105 |
+
- keep deterministic DJ logic (cue/BPM/stretch/assemble)
|
| 106 |
+
- swap transition generation backend from MusicGen to ACE-Step
|
| 107 |
+
- load custom LoRA adapter(s) to enforce DJ transition style
|
| 108 |
+
|
| 109 |
+
### 5.2 Integration strategy (recommended)
|
| 110 |
+
|
| 111 |
+
1. Keep current `app.py` flow unchanged for analysis/mixing
|
| 112 |
+
2. Introduce backend abstraction:
|
| 113 |
+
- `MusicGenBackend` (fallback)
|
| 114 |
+
- `AceStepBackend` (main target)
|
| 115 |
+
3. Add LoRA controls:
|
| 116 |
+
- adapter selection
|
| 117 |
+
- adapter scale
|
| 118 |
+
4. Continue returning short transition clips only
|
| 119 |
+
|
| 120 |
+
---
|
| 121 |
+
|
| 122 |
+
## 6) Genre-Specific LoRA Idea (Pop / Electronic / House / Dubstep / Techno)
|
| 123 |
+
|
| 124 |
+
## Is this a good idea?
|
| 125 |
+
|
| 126 |
+
**Yes, as a staged plan.**
|
| 127 |
+
|
| 128 |
+
It is a strong product and coursework idea because:
|
| 129 |
+
|
| 130 |
+
- user-selected genre can map to distinct transition style
|
| 131 |
+
- demonstrates clear domain-specific refinement
|
| 132 |
+
- supports explainable UX: "You picked House -> House-style transition LoRA"
|
| 133 |
+
|
| 134 |
+
### Important caveats
|
| 135 |
+
|
| 136 |
+
- Training one LoRA per genre increases data and compute requirements a lot
|
| 137 |
+
- Early quality may vary by genre and dataset size
|
| 138 |
+
- More adapters mean more evaluation and QA burden
|
| 139 |
+
|
| 140 |
+
### Practical rollout (recommended)
|
| 141 |
+
|
| 142 |
+
Phase 1 (safe):
|
| 143 |
+
- base model + one "general DJ transition" LoRA
|
| 144 |
+
|
| 145 |
+
Phase 2 (coursework-strong):
|
| 146 |
+
- 2-3 genre LoRAs (e.g., Pop / House / Dubstep)
|
| 147 |
+
|
| 148 |
+
Phase 3 (optional extension):
|
| 149 |
+
- larger genre library + auto-genre suggestion from uploaded songs
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## 7) Proposed Genre LoRA Routing Logic
|
| 154 |
+
|
| 155 |
+
User selects uploaded-song genre (or manually selects transition style profile):
|
| 156 |
+
|
| 157 |
+
- Pop -> `lora_pop_transition`
|
| 158 |
+
- Electronic -> `lora_electronic_transition`
|
| 159 |
+
- House -> `lora_house_transition`
|
| 160 |
+
- Dubstep -> `lora_dubstep_transition`
|
| 161 |
+
- Techno -> `lora_techno_transition`
|
| 162 |
+
- Auto/Unknown -> `lora_general_transition`
|
| 163 |
+
|
| 164 |
+
Then:
|
| 165 |
+
|
| 166 |
+
1. load chosen LoRA
|
| 167 |
+
2. set LoRA scale
|
| 168 |
+
3. run ACE-Step generation for short transition duration
|
| 169 |
+
4. mix with A/B boundary clip
|
| 170 |
+
|
| 171 |
+
---
|
| 172 |
+
|
| 173 |
+
## 8) Data and Training Notes for LoRA
|
| 174 |
+
|
| 175 |
+
- Use only licensed/royalty-free/self-owned audio for dataset and demos
|
| 176 |
+
- Dataset should emphasize transition-like content (risers, fills, drops, sweeps, impacts)
|
| 177 |
+
- Include metadata/captions describing genre + transition intent
|
| 178 |
+
- Keep track of:
|
| 179 |
+
- adapter name
|
| 180 |
+
- dataset source and license
|
| 181 |
+
- training config and epoch checkpoints
|
| 182 |
+
|
| 183 |
+
---
|
| 184 |
+
|
| 185 |
+
## 9) Current Risks / Constraints
|
| 186 |
+
|
| 187 |
+
- ACE-Step stack is heavier than MusicGen and needs careful deployment tuning
|
| 188 |
+
- Cold starts and memory behavior can be challenging on Spaces
|
| 189 |
+
- Auto-cueing is heuristic; may fail on hard tracks (manual override should remain)
|
| 190 |
+
- Time-stretch can introduce artifacts (expected in DJ contexts)
|
| 191 |
+
|
| 192 |
+
---
|
| 193 |
+
|
| 194 |
+
## 10) Fallback and Reliability Plan
|
| 195 |
+
|
| 196 |
+
- Keep MusicGen backend as fallback while integrating ACE-Step
|
| 197 |
+
- If ACE-Step init fails:
|
| 198 |
+
- fail over to MusicGen backend
|
| 199 |
+
- still return valid transition clip
|
| 200 |
+
- Preserve deterministic DSP path as model-agnostic baseline
|
| 201 |
+
|
| 202 |
+
---
|
| 203 |
+
|
| 204 |
+
## 11) "If I lost track" Quick Resume Checklist
|
| 205 |
+
|
| 206 |
+
1. Open `app.py` and confirm current backend is still working end-to-end
|
| 207 |
+
2. Verify demo still does:
|
| 208 |
+
- cue detect
|
| 209 |
+
- BPM match
|
| 210 |
+
- transition generation
|
| 211 |
+
- clip output
|
| 212 |
+
3. Re-read this note section 5/6/7
|
| 213 |
+
4. Continue with next implementation milestone:
|
| 214 |
+
- backend abstraction
|
| 215 |
+
- ACE-Step backend skeleton
|
| 216 |
+
- single LoRA integration
|
| 217 |
+
- then genre LoRA expansion
|
| 218 |
+
|
| 219 |
+
---
|
| 220 |
+
|
| 221 |
+
## 12) Next Concrete Milestones
|
| 222 |
+
|
| 223 |
+
M1: Refactor transition generation into backend interface
|
| 224 |
+
M2: Implement `AceStepBackend` with base model inference
|
| 225 |
+
M3: Add LoRA load/select/scale UI + runtime controls
|
| 226 |
+
M4: Train first "general DJ transition" LoRA
|
| 227 |
+
M5: Train 2-3 genre LoRAs and add genre routing
|
| 228 |
+
M6: Compare outputs (base vs LoRA, genre A vs genre B) for coursework evidence
|
| 229 |
+
|
README copy.md
ADDED
|
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# AI_DJ_Project
|
| 2 |
+
|
| 3 |
+
## Coursework-ready demo (HF Spaces + Gradio, Phase A/B)
|
| 4 |
+
|
| 5 |
+
This repo now includes a **Hugging Face Spaces** demo in `app.py`:
|
| 6 |
+
|
| 7 |
+
- Upload **Song A** and **Song B**.
|
| 8 |
+
- Pick a transition style plugin + text instruction.
|
| 9 |
+
- Build a rough seam (`A_tail + B_head`) with BPM-aware stretching.
|
| 10 |
+
- Run **ACE-Step repaint** on the seam window.
|
| 11 |
+
- Output two artifacts:
|
| 12 |
+
- transition-only clip
|
| 13 |
+
- stitched clip (`Song A up to cue + transition + Song B continuation`, seam is replaced not inserted)
|
| 14 |
+
|
| 15 |
+
### Deterministic transition API (Phase A)
|
| 16 |
+
|
| 17 |
+
Core reusable pipeline lives in:
|
| 18 |
+
- `pipeline/audio_utils.py`
|
| 19 |
+
- `pipeline/transition_generator.py`
|
| 20 |
+
|
| 21 |
+
Run via command:
|
| 22 |
+
|
| 23 |
+
```shell
|
| 24 |
+
python -m pipeline.transition_generator \
|
| 25 |
+
--song-a /path/to/song_a.mp3 \
|
| 26 |
+
--song-b /path/to/song_b.mp3 \
|
| 27 |
+
--plugin "Smooth Blend" \
|
| 28 |
+
--instruction "smooth, rising energy, no vocals" \
|
| 29 |
+
--seed 42 \
|
| 30 |
+
--output-dir outputs
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
This writes:
|
| 34 |
+
- `*_transition.wav`
|
| 35 |
+
- `*_stitched.wav`
|
| 36 |
+
|
| 37 |
+
### Deploy to Hugging Face Spaces (ZeroGPU)
|
| 38 |
+
|
| 39 |
+
Create a new Space with:
|
| 40 |
+
- **SDK**: Gradio
|
| 41 |
+
- **Hardware**: ZeroGPU
|
| 42 |
+
|
| 43 |
+
Upload these files from this folder:
|
| 44 |
+
- `app.py`
|
| 45 |
+
- `requirements.txt`
|
| 46 |
+
- `packages.txt` (installs `ffmpeg` + `libsndfile1` for audio decoding/runtime)
|
| 47 |
+
|
| 48 |
+
Important: **Do not upload copyrighted songs** into the Space repo. The demo is designed for **user uploads**.
|
| 49 |
+
|
| 50 |
+
### Repo hygiene
|
| 51 |
+
|
| 52 |
+
- The coursework spec notebook at repo root is intentionally git-ignored:
|
| 53 |
+
`(0) 70113_Generative_AI_README_for_Coursework.ipynb`
|
| 54 |
+
|
| 55 |
+
### ACE-Step backend (required)
|
| 56 |
+
|
| 57 |
+
This coursework pipeline uses ACE-Step as the generation method.
|
| 58 |
+
|
| 59 |
+
```shell
|
| 60 |
+
pip install git+https://github.com/ACE-Step/ACE-Step-1.5.git
|
| 61 |
+
```
|
| 62 |
+
|
| 63 |
+
Then run with environment vars as needed:
|
| 64 |
+
|
| 65 |
+
```shell
|
| 66 |
+
export AI_DJ_ACESTEP_MODEL_CONFIG=acestep-v15-turbo
|
| 67 |
+
# optional persistent root for checkpoints:
|
| 68 |
+
export AI_DJ_ACESTEP_PROJECT_ROOT=/data/acestep_runtime
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
Notes:
|
| 72 |
+
- ACE-Step currently targets Python 3.11.
|
| 73 |
+
- ACE-Step first run can take time due to checkpoint download.
|
| 74 |
+
|
| 75 |
+
### Optional: Demucs stem-aware cue scoring
|
| 76 |
+
|
| 77 |
+
Cuepoint scoring can optionally run Demucs on the **analysis windows only** (A tail window + B head window), derive stem-aware mixability signals (`vocals`, `drums`, `bass`, accompaniment density), and penalize overlap risk (vocal-vocal and bass-bass clashes).
|
| 78 |
+
|
| 79 |
+
Transition generation can also use Demucs for:
|
| 80 |
+
- drum-led phase locking,
|
| 81 |
+
- one-bassline handoff shaping in `src_audio`,
|
| 82 |
+
- accompaniment-only `reference_audio`,
|
| 83 |
+
- post-repaint stem correction near transition boundaries.
|
| 84 |
+
|
| 85 |
+
Environment toggles:
|
| 86 |
+
|
| 87 |
+
```shell
|
| 88 |
+
# disable Demucs analysis entirely
|
| 89 |
+
export AI_DJ_ENABLE_DEMUCS_ANALYSIS=0
|
| 90 |
+
|
| 91 |
+
# disable Demucs transition refinements entirely
|
| 92 |
+
export AI_DJ_ENABLE_DEMUCS_TRANSITION=0
|
| 93 |
+
|
| 94 |
+
# choose analysis device when enabled (default: cuda if available)
|
| 95 |
+
export AI_DJ_DEMUCS_DEVICE=cpu
|
| 96 |
+
|
| 97 |
+
# choose reference period type passed into ACE-Step reference_audio
|
| 98 |
+
# values: accompaniment-only (default) | full-period-a
|
| 99 |
+
export AI_DJ_REFERENCE_AUDIO_MODE=accompaniment-only
|
| 100 |
+
```
|
app.py
ADDED
|
@@ -0,0 +1,392 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import logging
|
| 2 |
+
import os
|
| 3 |
+
import subprocess
|
| 4 |
+
from pathlib import Path
|
| 5 |
+
from typing import Optional
|
| 6 |
+
|
| 7 |
+
import gradio as gr
|
| 8 |
+
|
| 9 |
+
from pipeline.transition_generator import (
|
| 10 |
+
PLUGIN_PRESETS,
|
| 11 |
+
TransitionRequest,
|
| 12 |
+
generate_transition_artifacts,
|
| 13 |
+
)
|
| 14 |
+
|
| 15 |
+
logging.basicConfig(
|
| 16 |
+
level=logging.INFO,
|
| 17 |
+
format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
|
| 18 |
+
)
|
| 19 |
+
LOGGER = logging.getLogger(__name__)
|
| 20 |
+
|
| 21 |
+
LORA_DROPDOWN_CHOICES = [
|
| 22 |
+
"None",
|
| 23 |
+
"Chinese New Year (official)",
|
| 24 |
+
]
|
| 25 |
+
LORA_REPO_MAP = {
|
| 26 |
+
"Chinese New Year (official)": "ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA",
|
| 27 |
+
}
|
| 28 |
+
|
| 29 |
+
APP_CSS = """
|
| 30 |
+
.adv-item label,
|
| 31 |
+
.adv-item .gr-block-label,
|
| 32 |
+
.adv-item .gr-block-title {
|
| 33 |
+
white-space: nowrap !important;
|
| 34 |
+
overflow: hidden !important;
|
| 35 |
+
text-overflow: ellipsis !important;
|
| 36 |
+
}
|
| 37 |
+
"""
|
| 38 |
+
|
| 39 |
+
APP_THEME = gr.themes.Soft(
|
| 40 |
+
primary_hue="blue",
|
| 41 |
+
neutral_hue="slate",
|
| 42 |
+
radius_size="lg",
|
| 43 |
+
).set(
|
| 44 |
+
block_radius="*radius_xl",
|
| 45 |
+
input_radius="*radius_xl",
|
| 46 |
+
button_large_radius="*radius_xl",
|
| 47 |
+
button_medium_radius="*radius_xl",
|
| 48 |
+
button_small_radius="*radius_xl",
|
| 49 |
+
)
|
| 50 |
+
|
| 51 |
+
|
| 52 |
+
def _to_optional_float(value) -> Optional[float]:
|
| 53 |
+
if value is None:
|
| 54 |
+
return None
|
| 55 |
+
if isinstance(value, str) and not value.strip():
|
| 56 |
+
return None
|
| 57 |
+
try:
|
| 58 |
+
return float(value)
|
| 59 |
+
except Exception:
|
| 60 |
+
return None
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
def _normalize_upload_for_ui(path: Optional[str]) -> Optional[str]:
|
| 64 |
+
if not path:
|
| 65 |
+
return path
|
| 66 |
+
src = str(path)
|
| 67 |
+
if not os.path.isfile(src):
|
| 68 |
+
return path
|
| 69 |
+
|
| 70 |
+
out_dir = os.path.join("outputs", "normalized_uploads")
|
| 71 |
+
os.makedirs(out_dir, exist_ok=True)
|
| 72 |
+
stem = Path(src).stem
|
| 73 |
+
dst = os.path.join(out_dir, f"{stem}_ui_norm.wav")
|
| 74 |
+
|
| 75 |
+
cmd = [
|
| 76 |
+
"ffmpeg",
|
| 77 |
+
"-hide_banner",
|
| 78 |
+
"-loglevel",
|
| 79 |
+
"error",
|
| 80 |
+
"-nostdin",
|
| 81 |
+
"-y",
|
| 82 |
+
"-i",
|
| 83 |
+
src,
|
| 84 |
+
"-vn",
|
| 85 |
+
"-ac",
|
| 86 |
+
"2",
|
| 87 |
+
"-ar",
|
| 88 |
+
"44100",
|
| 89 |
+
"-c:a",
|
| 90 |
+
"pcm_s16le",
|
| 91 |
+
dst,
|
| 92 |
+
]
|
| 93 |
+
try:
|
| 94 |
+
subprocess.run(cmd, check=True)
|
| 95 |
+
return dst
|
| 96 |
+
except Exception as exc:
|
| 97 |
+
LOGGER.warning("Upload normalization failed for %s (%s). Using original file.", src, exc)
|
| 98 |
+
return src
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
def _run_transition(
|
| 102 |
+
song_a,
|
| 103 |
+
song_b,
|
| 104 |
+
plugin_id,
|
| 105 |
+
instruction_text,
|
| 106 |
+
transition_bars,
|
| 107 |
+
pre_context_sec,
|
| 108 |
+
post_context_sec,
|
| 109 |
+
analysis_sec,
|
| 110 |
+
bpm_target,
|
| 111 |
+
creativity_strength,
|
| 112 |
+
inference_steps,
|
| 113 |
+
seed,
|
| 114 |
+
cue_a_sec,
|
| 115 |
+
cue_b_sec,
|
| 116 |
+
lora_choice,
|
| 117 |
+
lora_scale,
|
| 118 |
+
output_dir,
|
| 119 |
+
):
|
| 120 |
+
if not song_a or not song_b:
|
| 121 |
+
raise gr.Error("Please upload both Song A and Song B.")
|
| 122 |
+
|
| 123 |
+
request = TransitionRequest(
|
| 124 |
+
song_a_path=song_a,
|
| 125 |
+
song_b_path=song_b,
|
| 126 |
+
plugin_id=plugin_id,
|
| 127 |
+
instruction_text=instruction_text or "",
|
| 128 |
+
transition_base_mode="B-base-fixed",
|
| 129 |
+
transition_bars=int(transition_bars),
|
| 130 |
+
pre_context_sec=float(pre_context_sec),
|
| 131 |
+
repaint_width_sec=4.0,
|
| 132 |
+
post_context_sec=float(post_context_sec),
|
| 133 |
+
analysis_sec=float(analysis_sec),
|
| 134 |
+
bpm_target=_to_optional_float(bpm_target),
|
| 135 |
+
cue_a_sec=_to_optional_float(cue_a_sec),
|
| 136 |
+
cue_b_sec=_to_optional_float(cue_b_sec),
|
| 137 |
+
creativity_strength=float(creativity_strength),
|
| 138 |
+
inference_steps=int(inference_steps),
|
| 139 |
+
seed=int(seed),
|
| 140 |
+
acestep_lora_path=LORA_REPO_MAP.get(str(lora_choice), ""),
|
| 141 |
+
acestep_lora_scale=float(lora_scale),
|
| 142 |
+
output_dir=(output_dir or "outputs").strip(),
|
| 143 |
+
)
|
| 144 |
+
|
| 145 |
+
try:
|
| 146 |
+
result = generate_transition_artifacts(request)
|
| 147 |
+
except Exception as exc:
|
| 148 |
+
raise gr.Error(str(exc))
|
| 149 |
+
|
| 150 |
+
return (
|
| 151 |
+
result.transition_path,
|
| 152 |
+
result.hard_splice_path,
|
| 153 |
+
result.rough_stitched_path,
|
| 154 |
+
result.stitched_path,
|
| 155 |
+
)
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
def build_ui() -> gr.Blocks:
|
| 159 |
+
with gr.Blocks(theme=APP_THEME, css=APP_CSS) as demo:
|
| 160 |
+
gr.Markdown(
|
| 161 |
+
"""
|
| 162 |
+
<div style="text-align:center;">
|
| 163 |
+
<h1>AI DJ Transition Generator</h1>
|
| 164 |
+
<p>Upload two songs and generate a transition between them.</p>
|
| 165 |
+
</div>
|
| 166 |
+
""".strip()
|
| 167 |
+
)
|
| 168 |
+
with gr.Row():
|
| 169 |
+
gr.Markdown(
|
| 170 |
+
"""
|
| 171 |
+
### How to use
|
| 172 |
+
1. Upload **Song A** (current track) and **Song B** (next track).
|
| 173 |
+
2. Choose a **Transition style plugin**.
|
| 174 |
+
3. Optionally add **Text instruction** (e.g., smooth, rising energy, no vocals).
|
| 175 |
+
4. Select **Transition period length (bars)**.
|
| 176 |
+
5. Click **Generate transition artifacts**.
|
| 177 |
+
""".strip(),
|
| 178 |
+
container=False,
|
| 179 |
+
elem_classes=["plain-info"],
|
| 180 |
+
)
|
| 181 |
+
gr.Markdown(
|
| 182 |
+
"""
|
| 183 |
+
### Outputs
|
| 184 |
+
- **Generated transition clip**: AI-generated repaint transition segment.
|
| 185 |
+
- **Hard splice baseline (no transition)**: direct cut baseline.
|
| 186 |
+
- **No-repaint rough stitch (baseline)**: stitched baseline without repaint.
|
| 187 |
+
- **Final stitched clip**: final result with transition inserted.
|
| 188 |
+
""".strip(),
|
| 189 |
+
container=False,
|
| 190 |
+
elem_classes=["plain-info"],
|
| 191 |
+
)
|
| 192 |
+
|
| 193 |
+
with gr.Row():
|
| 194 |
+
song_a = gr.Audio(
|
| 195 |
+
label="Song A (mix out)",
|
| 196 |
+
type="filepath",
|
| 197 |
+
sources=["upload"],
|
| 198 |
+
)
|
| 199 |
+
song_b = gr.Audio(
|
| 200 |
+
label="Song B (mix in)",
|
| 201 |
+
type="filepath",
|
| 202 |
+
sources=["upload"],
|
| 203 |
+
)
|
| 204 |
+
song_a.upload(
|
| 205 |
+
fn=_normalize_upload_for_ui,
|
| 206 |
+
inputs=song_a,
|
| 207 |
+
outputs=song_a,
|
| 208 |
+
queue=False,
|
| 209 |
+
)
|
| 210 |
+
song_b.upload(
|
| 211 |
+
fn=_normalize_upload_for_ui,
|
| 212 |
+
inputs=song_b,
|
| 213 |
+
outputs=song_b,
|
| 214 |
+
queue=False,
|
| 215 |
+
)
|
| 216 |
+
|
| 217 |
+
with gr.Row():
|
| 218 |
+
with gr.Column():
|
| 219 |
+
plugin_id = gr.Dropdown(
|
| 220 |
+
label="Transition style plugin",
|
| 221 |
+
choices=list(PLUGIN_PRESETS.keys()),
|
| 222 |
+
value="Smooth Blend",
|
| 223 |
+
)
|
| 224 |
+
with gr.Column():
|
| 225 |
+
lora_choice = gr.Dropdown(
|
| 226 |
+
label="LoRA adapter",
|
| 227 |
+
choices=LORA_DROPDOWN_CHOICES,
|
| 228 |
+
value="None",
|
| 229 |
+
info="Select an ACE-Step LoRA adapter to apply during repaint.",
|
| 230 |
+
)
|
| 231 |
+
lora_scale = gr.Slider(
|
| 232 |
+
minimum=0.0,
|
| 233 |
+
maximum=2.0,
|
| 234 |
+
value=0.8,
|
| 235 |
+
step=0.05,
|
| 236 |
+
label="LoRA scale",
|
| 237 |
+
)
|
| 238 |
+
with gr.Column():
|
| 239 |
+
instruction_text = gr.Textbox(
|
| 240 |
+
label="Text instruction",
|
| 241 |
+
placeholder="e.g., smooth, rising energy, no vocals",
|
| 242 |
+
lines=2,
|
| 243 |
+
)
|
| 244 |
+
|
| 245 |
+
with gr.Accordion("Advanced controls", open=False):
|
| 246 |
+
with gr.Row():
|
| 247 |
+
transition_bars = gr.Dropdown(
|
| 248 |
+
label="Transition period length (bars)",
|
| 249 |
+
choices=[4, 8, 16],
|
| 250 |
+
value=8,
|
| 251 |
+
info="Controls transition duration. Pipeline uses fixed B-base strategy with A as reference.",
|
| 252 |
+
min_width=320,
|
| 253 |
+
elem_classes=["adv-item"],
|
| 254 |
+
)
|
| 255 |
+
pre_context_sec = gr.Slider(
|
| 256 |
+
minimum=1,
|
| 257 |
+
maximum=12,
|
| 258 |
+
value=6,
|
| 259 |
+
step=0.5,
|
| 260 |
+
label="Seconds before seam (Song A context)",
|
| 261 |
+
min_width=320,
|
| 262 |
+
elem_classes=["adv-item"],
|
| 263 |
+
)
|
| 264 |
+
post_context_sec = gr.Slider(
|
| 265 |
+
minimum=1,
|
| 266 |
+
maximum=12,
|
| 267 |
+
value=6,
|
| 268 |
+
step=0.5,
|
| 269 |
+
label="Seconds after seam (Song B context)",
|
| 270 |
+
min_width=320,
|
| 271 |
+
elem_classes=["adv-item"],
|
| 272 |
+
)
|
| 273 |
+
|
| 274 |
+
with gr.Row():
|
| 275 |
+
analysis_sec = gr.Slider(
|
| 276 |
+
minimum=10,
|
| 277 |
+
maximum=90,
|
| 278 |
+
value=45,
|
| 279 |
+
step=5,
|
| 280 |
+
label="Analysis window (seconds)",
|
| 281 |
+
min_width=320,
|
| 282 |
+
elem_classes=["adv-item"],
|
| 283 |
+
)
|
| 284 |
+
bpm_target = gr.Number(
|
| 285 |
+
label="Optional BPM target override",
|
| 286 |
+
value=None,
|
| 287 |
+
min_width=320,
|
| 288 |
+
elem_classes=["adv-item"],
|
| 289 |
+
)
|
| 290 |
+
|
| 291 |
+
with gr.Row():
|
| 292 |
+
creativity_strength = gr.Slider(
|
| 293 |
+
minimum=1.0,
|
| 294 |
+
maximum=12.0,
|
| 295 |
+
value=7.0,
|
| 296 |
+
step=0.5,
|
| 297 |
+
label="Creativity strength (guidance)",
|
| 298 |
+
min_width=320,
|
| 299 |
+
elem_classes=["adv-item"],
|
| 300 |
+
)
|
| 301 |
+
inference_steps = gr.Slider(
|
| 302 |
+
minimum=1,
|
| 303 |
+
maximum=64,
|
| 304 |
+
value=8,
|
| 305 |
+
step=1,
|
| 306 |
+
label="ACE-Step inference steps",
|
| 307 |
+
min_width=320,
|
| 308 |
+
elem_classes=["adv-item"],
|
| 309 |
+
)
|
| 310 |
+
|
| 311 |
+
with gr.Row():
|
| 312 |
+
seed = gr.Number(
|
| 313 |
+
label="Seed",
|
| 314 |
+
value=42,
|
| 315 |
+
precision=0,
|
| 316 |
+
min_width=320,
|
| 317 |
+
elem_classes=["adv-item"],
|
| 318 |
+
)
|
| 319 |
+
cue_a_sec = gr.Textbox(
|
| 320 |
+
label="Optional cue A override (sec)",
|
| 321 |
+
value="",
|
| 322 |
+
placeholder="Leave blank for auto cue selection",
|
| 323 |
+
min_width=320,
|
| 324 |
+
elem_classes=["adv-item"],
|
| 325 |
+
)
|
| 326 |
+
|
| 327 |
+
with gr.Row():
|
| 328 |
+
cue_b_sec = gr.Textbox(
|
| 329 |
+
label="Optional cue B override (sec)",
|
| 330 |
+
value="",
|
| 331 |
+
placeholder="Leave blank for auto cue selection",
|
| 332 |
+
min_width=320,
|
| 333 |
+
elem_classes=["adv-item"],
|
| 334 |
+
)
|
| 335 |
+
output_dir = gr.Textbox(
|
| 336 |
+
label="Output directory",
|
| 337 |
+
value="outputs",
|
| 338 |
+
min_width=320,
|
| 339 |
+
elem_classes=["adv-item"],
|
| 340 |
+
)
|
| 341 |
+
|
| 342 |
+
run_btn = gr.Button("Generate transition artifacts", variant="primary")
|
| 343 |
+
|
| 344 |
+
with gr.Row():
|
| 345 |
+
transition_audio = gr.Audio(
|
| 346 |
+
label="Generated transition clip",
|
| 347 |
+
type="filepath",
|
| 348 |
+
)
|
| 349 |
+
hard_splice_audio = gr.Audio(
|
| 350 |
+
label="Hard splice baseline (no transition)",
|
| 351 |
+
type="filepath",
|
| 352 |
+
)
|
| 353 |
+
rough_stitched_audio = gr.Audio(
|
| 354 |
+
label="No-repaint rough stitch (baseline)",
|
| 355 |
+
type="filepath",
|
| 356 |
+
)
|
| 357 |
+
stitched_audio = gr.Audio(
|
| 358 |
+
label="Final stitched clip",
|
| 359 |
+
type="filepath",
|
| 360 |
+
)
|
| 361 |
+
|
| 362 |
+
run_btn.click(
|
| 363 |
+
fn=_run_transition,
|
| 364 |
+
inputs=[
|
| 365 |
+
song_a,
|
| 366 |
+
song_b,
|
| 367 |
+
plugin_id,
|
| 368 |
+
instruction_text,
|
| 369 |
+
transition_bars,
|
| 370 |
+
pre_context_sec,
|
| 371 |
+
post_context_sec,
|
| 372 |
+
analysis_sec,
|
| 373 |
+
bpm_target,
|
| 374 |
+
creativity_strength,
|
| 375 |
+
inference_steps,
|
| 376 |
+
seed,
|
| 377 |
+
cue_a_sec,
|
| 378 |
+
cue_b_sec,
|
| 379 |
+
lora_choice,
|
| 380 |
+
lora_scale,
|
| 381 |
+
output_dir,
|
| 382 |
+
],
|
| 383 |
+
outputs=[transition_audio, hard_splice_audio, rough_stitched_audio, stitched_audio],
|
| 384 |
+
)
|
| 385 |
+
|
| 386 |
+
return demo
|
| 387 |
+
|
| 388 |
+
|
| 389 |
+
demo = build_ui()
|
| 390 |
+
|
| 391 |
+
if __name__ == "__main__":
|
| 392 |
+
demo.launch()
|
packages.txt
ADDED
|
@@ -0,0 +1,2 @@
|
|
|
|
|
|
|
|
|
|
| 1 |
+
ffmpeg
|
| 2 |
+
libsndfile1
|
pipeline/__init__.py
ADDED
|
@@ -0,0 +1,16 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Pipeline package for deterministic transition generation."""
|
| 2 |
+
|
| 3 |
+
from .transition_generator import (
|
| 4 |
+
PLUGIN_PRESETS,
|
| 5 |
+
TransitionRequest,
|
| 6 |
+
TransitionResult,
|
| 7 |
+
generate_transition_artifacts,
|
| 8 |
+
)
|
| 9 |
+
|
| 10 |
+
__all__ = [
|
| 11 |
+
"PLUGIN_PRESETS",
|
| 12 |
+
"TransitionRequest",
|
| 13 |
+
"TransitionResult",
|
| 14 |
+
"generate_transition_artifacts",
|
| 15 |
+
]
|
| 16 |
+
|
pipeline/audio_utils.py
ADDED
|
@@ -0,0 +1,194 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import logging
|
| 2 |
+
import os
|
| 3 |
+
import shutil
|
| 4 |
+
import subprocess
|
| 5 |
+
import tempfile
|
| 6 |
+
from typing import Optional, Tuple
|
| 7 |
+
|
| 8 |
+
import librosa
|
| 9 |
+
import numpy as np
|
| 10 |
+
import soundfile as sf
|
| 11 |
+
|
| 12 |
+
LOGGER = logging.getLogger(__name__)
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
def clamp(value: float, low: float, high: float) -> float:
|
| 16 |
+
return float(max(low, min(high, value)))
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
def ensure_mono(y: np.ndarray) -> np.ndarray:
|
| 20 |
+
if y.ndim == 1:
|
| 21 |
+
return y
|
| 22 |
+
return np.mean(y, axis=1)
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
def ffprobe_duration_sec(path: str) -> Optional[float]:
|
| 26 |
+
if not shutil.which("ffprobe"):
|
| 27 |
+
return None
|
| 28 |
+
|
| 29 |
+
cmd = [
|
| 30 |
+
"ffprobe",
|
| 31 |
+
"-v",
|
| 32 |
+
"error",
|
| 33 |
+
"-show_entries",
|
| 34 |
+
"format=duration",
|
| 35 |
+
"-of",
|
| 36 |
+
"default=noprint_wrappers=1:nokey=1",
|
| 37 |
+
path,
|
| 38 |
+
]
|
| 39 |
+
try:
|
| 40 |
+
out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, text=True).strip()
|
| 41 |
+
return float(out)
|
| 42 |
+
except Exception:
|
| 43 |
+
return None
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
def decode_segment(path: str, start_sec: float, duration_sec: float, sr: int, max_decode_sec: float = 120.0) -> Tuple[np.ndarray, int]:
|
| 47 |
+
start_sec = max(0.0, float(start_sec))
|
| 48 |
+
duration_sec = max(0.0, float(duration_sec))
|
| 49 |
+
duration_sec = min(duration_sec, max_decode_sec)
|
| 50 |
+
|
| 51 |
+
if duration_sec <= 0:
|
| 52 |
+
return np.zeros((0,), dtype=np.float32), sr
|
| 53 |
+
|
| 54 |
+
if shutil.which("ffmpeg"):
|
| 55 |
+
tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
|
| 56 |
+
tmp_path = tmp.name
|
| 57 |
+
tmp.close()
|
| 58 |
+
try:
|
| 59 |
+
cmd = [
|
| 60 |
+
"ffmpeg",
|
| 61 |
+
"-hide_banner",
|
| 62 |
+
"-loglevel",
|
| 63 |
+
"error",
|
| 64 |
+
"-nostdin",
|
| 65 |
+
"-y",
|
| 66 |
+
"-ss",
|
| 67 |
+
str(start_sec),
|
| 68 |
+
"-t",
|
| 69 |
+
str(duration_sec),
|
| 70 |
+
"-i",
|
| 71 |
+
path,
|
| 72 |
+
"-ac",
|
| 73 |
+
"1",
|
| 74 |
+
"-ar",
|
| 75 |
+
str(sr),
|
| 76 |
+
tmp_path,
|
| 77 |
+
]
|
| 78 |
+
subprocess.run(cmd, check=True)
|
| 79 |
+
y, read_sr = sf.read(tmp_path, dtype="float32", always_2d=False)
|
| 80 |
+
y = ensure_mono(np.asarray(y))
|
| 81 |
+
return y.astype(np.float32), int(read_sr)
|
| 82 |
+
finally:
|
| 83 |
+
try:
|
| 84 |
+
os.remove(tmp_path)
|
| 85 |
+
except Exception:
|
| 86 |
+
pass
|
| 87 |
+
|
| 88 |
+
y, read_sr = librosa.load(path, sr=sr, mono=True, offset=start_sec, duration=duration_sec)
|
| 89 |
+
return y.astype(np.float32), int(read_sr)
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def estimate_bpm_and_beats(y: np.ndarray, sr: int) -> Tuple[Optional[float], np.ndarray]:
|
| 93 |
+
if y.size < sr:
|
| 94 |
+
return None, np.array([], dtype=np.float32)
|
| 95 |
+
|
| 96 |
+
try:
|
| 97 |
+
tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
|
| 98 |
+
tempo_f = float(tempo[0]) if isinstance(tempo, (list, np.ndarray)) else float(tempo)
|
| 99 |
+
beat_times = librosa.frames_to_time(beat_frames, sr=sr).astype(np.float32)
|
| 100 |
+
if not (40.0 <= tempo_f <= 220.0):
|
| 101 |
+
tempo_f = None
|
| 102 |
+
return tempo_f, beat_times
|
| 103 |
+
except Exception:
|
| 104 |
+
return None, np.array([], dtype=np.float32)
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
def choose_nearest_beat(beat_times: np.ndarray, target_sec: float) -> float:
|
| 108 |
+
if beat_times.size == 0:
|
| 109 |
+
return float(target_sec)
|
| 110 |
+
idx = int(np.argmin(np.abs(beat_times - float(target_sec))))
|
| 111 |
+
return float(beat_times[idx])
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
def choose_first_beat_after(beat_times: np.ndarray, target_sec: float) -> float:
|
| 115 |
+
if beat_times.size == 0:
|
| 116 |
+
return float(target_sec)
|
| 117 |
+
for bt in beat_times:
|
| 118 |
+
if float(bt) >= float(target_sec):
|
| 119 |
+
return float(bt)
|
| 120 |
+
return float(beat_times[-1])
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
def linear_fade(n: int, fade_in: bool) -> np.ndarray:
|
| 124 |
+
if n <= 0:
|
| 125 |
+
return np.zeros((0,), dtype=np.float32)
|
| 126 |
+
if fade_in:
|
| 127 |
+
return np.linspace(0.0, 1.0, n, dtype=np.float32)
|
| 128 |
+
return np.linspace(1.0, 0.0, n, dtype=np.float32)
|
| 129 |
+
|
| 130 |
+
|
| 131 |
+
def normalize_peak(y: np.ndarray, peak: float = 0.98) -> np.ndarray:
|
| 132 |
+
if y.size == 0:
|
| 133 |
+
return y.astype(np.float32)
|
| 134 |
+
maximum = float(np.max(np.abs(y)))
|
| 135 |
+
if maximum <= 1e-9:
|
| 136 |
+
return y.astype(np.float32)
|
| 137 |
+
if maximum <= peak:
|
| 138 |
+
return y.astype(np.float32)
|
| 139 |
+
return (y * (peak / maximum)).astype(np.float32)
|
| 140 |
+
|
| 141 |
+
|
| 142 |
+
def apply_edge_fades(y: np.ndarray, sr: int, fade_ms: float = 30.0) -> np.ndarray:
|
| 143 |
+
n = y.size
|
| 144 |
+
fade_n = int(sr * (fade_ms / 1000.0))
|
| 145 |
+
fade_n = min(fade_n, n // 2)
|
| 146 |
+
if fade_n <= 0:
|
| 147 |
+
return y
|
| 148 |
+
y2 = y.copy()
|
| 149 |
+
y2[:fade_n] *= linear_fade(fade_n, fade_in=True)
|
| 150 |
+
y2[-fade_n:] *= linear_fade(fade_n, fade_in=False)
|
| 151 |
+
return y2
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
def ensure_length(y: np.ndarray, target_n: int) -> np.ndarray:
|
| 155 |
+
target_n = int(max(0, target_n))
|
| 156 |
+
if y.size < target_n:
|
| 157 |
+
return np.pad(y, (0, target_n - y.size), mode="constant")
|
| 158 |
+
return y[:target_n]
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
def safe_time_stretch(y: np.ndarray, rate: float) -> np.ndarray:
|
| 162 |
+
rate = float(rate)
|
| 163 |
+
if y.size == 0:
|
| 164 |
+
return y.astype(np.float32)
|
| 165 |
+
if abs(rate - 1.0) < 1e-6:
|
| 166 |
+
return y.astype(np.float32)
|
| 167 |
+
try:
|
| 168 |
+
return librosa.effects.time_stretch(y, rate=rate).astype(np.float32)
|
| 169 |
+
except Exception as exc:
|
| 170 |
+
LOGGER.warning("Time-stretch failed (%s); using original audio.", exc)
|
| 171 |
+
return y.astype(np.float32)
|
| 172 |
+
|
| 173 |
+
|
| 174 |
+
def resample_if_needed(y: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
|
| 175 |
+
if int(orig_sr) == int(target_sr):
|
| 176 |
+
return y.astype(np.float32)
|
| 177 |
+
return librosa.resample(y, orig_sr=int(orig_sr), target_sr=int(target_sr)).astype(np.float32)
|
| 178 |
+
|
| 179 |
+
|
| 180 |
+
def crossfade_equal_length(a: np.ndarray, b: np.ndarray) -> np.ndarray:
|
| 181 |
+
n = min(a.size, b.size)
|
| 182 |
+
if n <= 0:
|
| 183 |
+
return np.zeros((0,), dtype=np.float32)
|
| 184 |
+
a = a[:n]
|
| 185 |
+
b = b[:n]
|
| 186 |
+
fade_in = linear_fade(n, fade_in=True)
|
| 187 |
+
fade_out = 1.0 - fade_in
|
| 188 |
+
return (a * fade_out + b * fade_in).astype(np.float32)
|
| 189 |
+
|
| 190 |
+
|
| 191 |
+
def write_wav(path: str, y: np.ndarray, sr: int) -> None:
|
| 192 |
+
os.makedirs(os.path.dirname(path), exist_ok=True)
|
| 193 |
+
sf.write(path, y.astype(np.float32), int(sr))
|
| 194 |
+
|
pipeline/cuepoint_selector.py
ADDED
|
@@ -0,0 +1,1656 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import logging
|
| 2 |
+
import os
|
| 3 |
+
from dataclasses import dataclass
|
| 4 |
+
from typing import Any, Dict, List, Optional, Tuple
|
| 5 |
+
|
| 6 |
+
import librosa # type: ignore[reportMissingImports]
|
| 7 |
+
import numpy as np
|
| 8 |
+
|
| 9 |
+
from .audio_utils import choose_first_beat_after, choose_nearest_beat, decode_segment, ensure_length
|
| 10 |
+
|
| 11 |
+
LOGGER = logging.getLogger(__name__)
|
| 12 |
+
|
| 13 |
+
_ANALYSIS_HOP = 512
|
| 14 |
+
_STRUCT_SR = 22050
|
| 15 |
+
_DEMUCS_ENABLED = os.getenv("AI_DJ_ENABLE_DEMUCS_ANALYSIS", "1").strip().lower() not in {
|
| 16 |
+
"0",
|
| 17 |
+
"false",
|
| 18 |
+
"no",
|
| 19 |
+
"off",
|
| 20 |
+
}
|
| 21 |
+
_DEMUCS_MODEL_NAME = os.getenv("AI_DJ_DEMUCS_MODEL", "htdemucs").strip() or "htdemucs"
|
| 22 |
+
_DEMUCS_DEVICE_PREF = os.getenv("AI_DJ_DEMUCS_DEVICE", "cuda").strip().lower()
|
| 23 |
+
_DEMUCS_SEGMENT_SEC = 7.0
|
| 24 |
+
_DEMUCS_MIN_WINDOW_SEC = 6.0
|
| 25 |
+
|
| 26 |
+
_PROFILE_CACHE: Dict[Tuple[str, int], Optional["_TrackProfiles"]] = {}
|
| 27 |
+
_LIBROSA_STRUCT_CACHE: Dict[str, Optional[Dict[str, np.ndarray]]] = {}
|
| 28 |
+
_DEMUCS_MODEL: Any = None
|
| 29 |
+
_DEMUCS_TORCH: Any = None
|
| 30 |
+
_DEMUCS_DEVICE = "cpu"
|
| 31 |
+
_DEMUCS_LOAD_ATTEMPTED = False
|
| 32 |
+
_DEMUCS_LOAD_ERROR: Optional[str] = None
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
@dataclass
|
| 36 |
+
class CueSelectionResult:
|
| 37 |
+
cue_a_sec: float
|
| 38 |
+
cue_b_sec: float
|
| 39 |
+
method: str
|
| 40 |
+
debug: Dict[str, object]
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
@dataclass
|
| 44 |
+
class _CueCandidate:
|
| 45 |
+
time_sec: float
|
| 46 |
+
beat_idx: int
|
| 47 |
+
phrase: float
|
| 48 |
+
energy: float
|
| 49 |
+
onset: float
|
| 50 |
+
chroma: np.ndarray
|
| 51 |
+
vocal_ratio: float
|
| 52 |
+
vocal_onset: float
|
| 53 |
+
vocal_phrase_score: float
|
| 54 |
+
drum_anchor: float
|
| 55 |
+
bass_energy: float
|
| 56 |
+
bass_stability: float
|
| 57 |
+
instrumental_density: float
|
| 58 |
+
density_score: float
|
| 59 |
+
period_vocal_ratio: float
|
| 60 |
+
period_vocal_phrase_score: float
|
| 61 |
+
period_drum_anchor: float
|
| 62 |
+
period_bass_energy: float
|
| 63 |
+
period_bass_stability: float
|
| 64 |
+
period_density_score: float
|
| 65 |
+
period_coverage: float
|
| 66 |
+
period_vocal_curve: np.ndarray
|
| 67 |
+
period_bass_curve: np.ndarray
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
@dataclass
|
| 71 |
+
class _TrackProfiles:
|
| 72 |
+
rms: np.ndarray
|
| 73 |
+
rms_times: np.ndarray
|
| 74 |
+
onset: np.ndarray
|
| 75 |
+
onset_times: np.ndarray
|
| 76 |
+
chroma: np.ndarray
|
| 77 |
+
chroma_times: np.ndarray
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
@dataclass
|
| 81 |
+
class _VocalActivityProfile:
|
| 82 |
+
vocal_ratio: np.ndarray
|
| 83 |
+
vocal_onset: np.ndarray
|
| 84 |
+
drum_onset: np.ndarray
|
| 85 |
+
bass_rms: np.ndarray
|
| 86 |
+
instrumental_rms: np.ndarray
|
| 87 |
+
times: np.ndarray
|
| 88 |
+
method: str
|
| 89 |
+
has_drums: bool
|
| 90 |
+
has_bass: bool
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
@dataclass
|
| 94 |
+
class _StructuredCandidate:
|
| 95 |
+
cue: _CueCandidate
|
| 96 |
+
label: str
|
| 97 |
+
label_score: float
|
| 98 |
+
edge_score: float
|
| 99 |
+
position_score: float
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
def _clamp(value: float, low: float, high: float) -> float:
|
| 103 |
+
return float(max(low, min(high, value)))
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
def _mean_1d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> float:
|
| 107 |
+
if values.size == 0 or times.size == 0:
|
| 108 |
+
return 0.0
|
| 109 |
+
lo = float(min(start, end))
|
| 110 |
+
hi = float(max(start, end))
|
| 111 |
+
mask = (times >= lo) & (times <= hi)
|
| 112 |
+
if np.any(mask):
|
| 113 |
+
return float(np.mean(values[mask]))
|
| 114 |
+
idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
|
| 115 |
+
return float(values[idx])
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
def _std_1d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> float:
|
| 119 |
+
if values.size == 0 or times.size == 0:
|
| 120 |
+
return 0.0
|
| 121 |
+
lo = float(min(start, end))
|
| 122 |
+
hi = float(max(start, end))
|
| 123 |
+
mask = (times >= lo) & (times <= hi)
|
| 124 |
+
if np.any(mask):
|
| 125 |
+
return float(np.std(values[mask]))
|
| 126 |
+
idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
|
| 127 |
+
return 0.0 if idx < 0 or idx >= values.size else 0.0
|
| 128 |
+
|
| 129 |
+
|
| 130 |
+
def _smooth_1d(values: np.ndarray, kernel_size: int) -> np.ndarray:
|
| 131 |
+
arr = np.asarray(values, dtype=np.float32).reshape(-1)
|
| 132 |
+
if arr.size == 0:
|
| 133 |
+
return np.zeros((1,), dtype=np.float32)
|
| 134 |
+
k = int(max(1, kernel_size))
|
| 135 |
+
if k == 1 or arr.size < k:
|
| 136 |
+
return arr.astype(np.float32)
|
| 137 |
+
kernel = np.ones((k,), dtype=np.float32) / float(k)
|
| 138 |
+
return np.convolve(arr, kernel, mode="same").astype(np.float32)
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
def _normalize_1d(values: np.ndarray) -> np.ndarray:
|
| 142 |
+
arr = np.asarray(values, dtype=np.float32).reshape(-1)
|
| 143 |
+
if arr.size == 0:
|
| 144 |
+
return np.zeros((1,), dtype=np.float32)
|
| 145 |
+
lo = float(np.percentile(arr, 5))
|
| 146 |
+
hi = float(np.percentile(arr, 95))
|
| 147 |
+
if hi - lo > 1e-6:
|
| 148 |
+
out = (arr - lo) / (hi - lo)
|
| 149 |
+
return np.clip(out, 0.0, 1.0).astype(np.float32)
|
| 150 |
+
mx = float(np.max(np.abs(arr)))
|
| 151 |
+
if mx > 1e-6:
|
| 152 |
+
out = arr / mx
|
| 153 |
+
return np.clip(out, 0.0, 1.0).astype(np.float32)
|
| 154 |
+
return np.zeros_like(arr, dtype=np.float32)
|
| 155 |
+
|
| 156 |
+
|
| 157 |
+
def _align_series_min_length(series: List[np.ndarray]) -> List[np.ndarray]:
|
| 158 |
+
clean = [np.asarray(x, dtype=np.float32).reshape(-1) for x in series]
|
| 159 |
+
if not clean:
|
| 160 |
+
return []
|
| 161 |
+
min_len = min((x.size for x in clean if x.size > 0), default=0)
|
| 162 |
+
if min_len <= 0:
|
| 163 |
+
return [np.zeros((1,), dtype=np.float32) for _ in clean]
|
| 164 |
+
return [x[:min_len].astype(np.float32) if x.size >= min_len else np.pad(x, (0, min_len - x.size)).astype(np.float32) for x in clean]
|
| 165 |
+
|
| 166 |
+
|
| 167 |
+
def _mean_2d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> np.ndarray:
|
| 168 |
+
if values.ndim != 2 or values.shape[1] == 0 or times.size == 0:
|
| 169 |
+
return np.zeros((12,), dtype=np.float32)
|
| 170 |
+
lo = float(min(start, end))
|
| 171 |
+
hi = float(max(start, end))
|
| 172 |
+
mask = (times >= lo) & (times <= hi)
|
| 173 |
+
if np.any(mask):
|
| 174 |
+
vec = np.mean(values[:, mask], axis=1).astype(np.float32)
|
| 175 |
+
else:
|
| 176 |
+
idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
|
| 177 |
+
vec = values[:, idx].astype(np.float32)
|
| 178 |
+
norm = float(np.linalg.norm(vec))
|
| 179 |
+
if norm > 1e-9:
|
| 180 |
+
vec = vec / norm
|
| 181 |
+
return vec
|
| 182 |
+
|
| 183 |
+
|
| 184 |
+
def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
|
| 185 |
+
if a.size == 0 or b.size == 0:
|
| 186 |
+
return 0.0
|
| 187 |
+
denom = float(np.linalg.norm(a) * np.linalg.norm(b))
|
| 188 |
+
if denom <= 1e-9:
|
| 189 |
+
return 0.0
|
| 190 |
+
return float(np.dot(a, b) / denom)
|
| 191 |
+
|
| 192 |
+
|
| 193 |
+
def _phrase_score(beat_idx: int) -> float:
|
| 194 |
+
if beat_idx < 0:
|
| 195 |
+
return 0.5
|
| 196 |
+
mod4 = beat_idx % 4
|
| 197 |
+
mod8 = beat_idx % 8
|
| 198 |
+
dist4 = min(mod4, 4 - mod4)
|
| 199 |
+
dist8 = min(mod8, 8 - mod8)
|
| 200 |
+
score4 = 1.0 - (dist4 / 2.0)
|
| 201 |
+
score8 = 1.0 - (dist8 / 4.0)
|
| 202 |
+
return _clamp((0.65 * score4) + (0.35 * score8), 0.0, 1.0)
|
| 203 |
+
|
| 204 |
+
|
| 205 |
+
def _target_position_score(x: float, target: float, spread: float) -> float:
|
| 206 |
+
spread = max(1e-3, float(spread))
|
| 207 |
+
return float(np.exp(-abs(float(x) - float(target)) / spread))
|
| 208 |
+
|
| 209 |
+
|
| 210 |
+
def _edge_score(x: float, duration_sec: float) -> float:
|
| 211 |
+
if duration_sec <= 1e-6:
|
| 212 |
+
return 0.0
|
| 213 |
+
ratio = float(x / duration_sec)
|
| 214 |
+
return _clamp(min(ratio / 0.16, (1.0 - ratio) / 0.16), 0.0, 1.0)
|
| 215 |
+
|
| 216 |
+
|
| 217 |
+
def _resolve_demucs_device(torch_mod: Any) -> str:
|
| 218 |
+
pref = (_DEMUCS_DEVICE_PREF or "").strip().lower()
|
| 219 |
+
if pref in {"cpu"}:
|
| 220 |
+
return "cpu"
|
| 221 |
+
if pref in {"cuda", "gpu"}:
|
| 222 |
+
return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
|
| 223 |
+
return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
|
| 224 |
+
|
| 225 |
+
|
| 226 |
+
def _get_demucs_model() -> Tuple[Optional[Any], Optional[Any], str, Optional[str]]:
|
| 227 |
+
global _DEMUCS_MODEL, _DEMUCS_TORCH, _DEMUCS_DEVICE, _DEMUCS_LOAD_ATTEMPTED, _DEMUCS_LOAD_ERROR
|
| 228 |
+
|
| 229 |
+
if not _DEMUCS_ENABLED:
|
| 230 |
+
return None, None, "disabled", "AI_DJ_ENABLE_DEMUCS_ANALYSIS=0"
|
| 231 |
+
|
| 232 |
+
if _DEMUCS_LOAD_ATTEMPTED:
|
| 233 |
+
if _DEMUCS_MODEL is None:
|
| 234 |
+
return None, _DEMUCS_TORCH, "unavailable", _DEMUCS_LOAD_ERROR
|
| 235 |
+
return _DEMUCS_MODEL, _DEMUCS_TORCH, "ready", None
|
| 236 |
+
|
| 237 |
+
_DEMUCS_LOAD_ATTEMPTED = True
|
| 238 |
+
try:
|
| 239 |
+
import torch # type: ignore[reportMissingImports]
|
| 240 |
+
from demucs.pretrained import get_model # type: ignore[reportMissingImports]
|
| 241 |
+
|
| 242 |
+
model = get_model(_DEMUCS_MODEL_NAME)
|
| 243 |
+
model.eval()
|
| 244 |
+
_DEMUCS_DEVICE = _resolve_demucs_device(torch)
|
| 245 |
+
model.to(_DEMUCS_DEVICE)
|
| 246 |
+
_DEMUCS_MODEL = model
|
| 247 |
+
_DEMUCS_TORCH = torch
|
| 248 |
+
_DEMUCS_LOAD_ERROR = None
|
| 249 |
+
return _DEMUCS_MODEL, _DEMUCS_TORCH, "ready", None
|
| 250 |
+
except Exception as exc:
|
| 251 |
+
_DEMUCS_MODEL = None
|
| 252 |
+
_DEMUCS_TORCH = None
|
| 253 |
+
_DEMUCS_LOAD_ERROR = str(exc)
|
| 254 |
+
LOGGER.warning(
|
| 255 |
+
"Demucs vocal analysis unavailable (%s). Cue selection continues without vocal penalty.",
|
| 256 |
+
exc,
|
| 257 |
+
)
|
| 258 |
+
return None, None, "unavailable", _DEMUCS_LOAD_ERROR
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
def _vocal_score_from_ratio(vocal_ratio: float) -> float:
|
| 262 |
+
ratio = _clamp(float(vocal_ratio), 0.0, 1.0)
|
| 263 |
+
# Penalize clearly vocal-dominant bars while leaving mixed bars mostly neutral.
|
| 264 |
+
return 1.0 - _clamp((ratio - 0.32) / 0.5, 0.0, 1.0)
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
def _lookup_stem_mixability(profile: Optional[_VocalActivityProfile], time_sec: float) -> Dict[str, float]:
|
| 268 |
+
neutral = {
|
| 269 |
+
"vocal_ratio": 0.0,
|
| 270 |
+
"vocal_onset": 0.0,
|
| 271 |
+
"vocal_phrase_score": 0.5,
|
| 272 |
+
"drum_anchor": 0.5,
|
| 273 |
+
"bass_energy": 0.5,
|
| 274 |
+
"bass_stability": 0.5,
|
| 275 |
+
"instrumental_density": 0.5,
|
| 276 |
+
"density_score": 0.5,
|
| 277 |
+
}
|
| 278 |
+
if profile is None or profile.times.size == 0:
|
| 279 |
+
return neutral
|
| 280 |
+
t_min = float(np.min(profile.times))
|
| 281 |
+
t_max = float(np.max(profile.times))
|
| 282 |
+
if float(time_sec) < (t_min - 0.6) or float(time_sec) > (t_max + 0.6):
|
| 283 |
+
return neutral
|
| 284 |
+
|
| 285 |
+
ratio = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec - 1.2, time_sec + 1.2), 0.0, 1.0)
|
| 286 |
+
vocal_onset = _clamp(_mean_1d(profile.vocal_onset, profile.times, time_sec - 0.2, time_sec + 0.3), 0.0, 1.0)
|
| 287 |
+
vocal_before = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec - 1.8, time_sec - 0.25), 0.0, 1.0)
|
| 288 |
+
vocal_after = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec + 0.25, time_sec + 1.8), 0.0, 1.0)
|
| 289 |
+
ending_score = _clamp((vocal_before - vocal_after + 0.05) / 0.35, 0.0, 1.0)
|
| 290 |
+
low_vocal_score = _vocal_score_from_ratio(ratio)
|
| 291 |
+
onset_quiet_score = 1.0 - vocal_onset
|
| 292 |
+
vocal_phrase_score = _clamp(
|
| 293 |
+
(0.52 * low_vocal_score) + (0.30 * ending_score) + (0.18 * onset_quiet_score),
|
| 294 |
+
0.0,
|
| 295 |
+
1.0,
|
| 296 |
+
)
|
| 297 |
+
|
| 298 |
+
if profile.has_drums:
|
| 299 |
+
drum_hit = _clamp(_mean_1d(profile.drum_onset, profile.times, time_sec - 0.1, time_sec + 0.24), 0.0, 1.0)
|
| 300 |
+
drum_bg = _clamp(_mean_1d(profile.drum_onset, profile.times, time_sec - 1.0, time_sec + 1.0), 0.0, 1.0)
|
| 301 |
+
drum_anchor = _clamp((0.72 * drum_hit) + (0.28 * _clamp(drum_hit - drum_bg + 0.22, 0.0, 1.0)), 0.0, 1.0)
|
| 302 |
+
else:
|
| 303 |
+
drum_anchor = 0.5
|
| 304 |
+
|
| 305 |
+
if profile.has_bass:
|
| 306 |
+
bass_energy = _clamp(_mean_1d(profile.bass_rms, profile.times, time_sec - 1.4, time_sec + 1.4), 0.0, 1.0)
|
| 307 |
+
bass_std = _clamp(_std_1d(profile.bass_rms, profile.times, time_sec - 1.8, time_sec + 1.8), 0.0, 1.0)
|
| 308 |
+
bass_cv = bass_std / max(1e-4, bass_energy + 0.08)
|
| 309 |
+
bass_stability = 1.0 - _clamp((bass_cv - 0.18) / 0.85, 0.0, 1.0)
|
| 310 |
+
else:
|
| 311 |
+
bass_energy = 0.5
|
| 312 |
+
bass_stability = 0.5
|
| 313 |
+
|
| 314 |
+
instrumental_density = _clamp(
|
| 315 |
+
_mean_1d(profile.instrumental_rms, profile.times, time_sec - 1.4, time_sec + 1.4),
|
| 316 |
+
0.0,
|
| 317 |
+
1.0,
|
| 318 |
+
)
|
| 319 |
+
density_score = _target_position_score(instrumental_density, target=0.56, spread=0.24)
|
| 320 |
+
|
| 321 |
+
return {
|
| 322 |
+
"vocal_ratio": float(ratio),
|
| 323 |
+
"vocal_onset": float(vocal_onset),
|
| 324 |
+
"vocal_phrase_score": float(vocal_phrase_score),
|
| 325 |
+
"drum_anchor": float(drum_anchor),
|
| 326 |
+
"bass_energy": float(bass_energy),
|
| 327 |
+
"bass_stability": float(_clamp(bass_stability, 0.0, 1.0)),
|
| 328 |
+
"instrumental_density": float(instrumental_density),
|
| 329 |
+
"density_score": float(_clamp(density_score, 0.0, 1.0)),
|
| 330 |
+
}
|
| 331 |
+
|
| 332 |
+
|
| 333 |
+
def _range_coverage_ratio(times: np.ndarray, start: float, end: float) -> float:
|
| 334 |
+
if times.size == 0:
|
| 335 |
+
return 0.0
|
| 336 |
+
lo = float(min(start, end))
|
| 337 |
+
hi = float(max(start, end))
|
| 338 |
+
if hi - lo <= 1e-6:
|
| 339 |
+
return 0.0
|
| 340 |
+
t_min = float(np.min(times))
|
| 341 |
+
t_max = float(np.max(times))
|
| 342 |
+
overlap = max(0.0, min(hi, t_max) - max(lo, t_min))
|
| 343 |
+
return _clamp(overlap / max(1e-6, (hi - lo)), 0.0, 1.0)
|
| 344 |
+
|
| 345 |
+
|
| 346 |
+
def _sample_curve(values: np.ndarray, times: np.ndarray, start: float, end: float, samples: int = 16) -> np.ndarray:
|
| 347 |
+
n = int(max(4, samples))
|
| 348 |
+
if values.size == 0 or times.size == 0:
|
| 349 |
+
return np.zeros((n,), dtype=np.float32)
|
| 350 |
+
lo = float(min(start, end))
|
| 351 |
+
hi = float(max(start, end))
|
| 352 |
+
if hi - lo <= 1e-6:
|
| 353 |
+
base = float(_mean_1d(values, times, lo - 0.25, hi + 0.25))
|
| 354 |
+
return np.full((n,), _clamp(base, 0.0, 1.0), dtype=np.float32)
|
| 355 |
+
ts = np.linspace(lo, hi, n, dtype=np.float32)
|
| 356 |
+
if times.size < 2:
|
| 357 |
+
base = float(_mean_1d(values, times, lo, hi))
|
| 358 |
+
return np.full((n,), _clamp(base, 0.0, 1.0), dtype=np.float32)
|
| 359 |
+
curve = np.interp(
|
| 360 |
+
ts.astype(np.float64),
|
| 361 |
+
times.astype(np.float64),
|
| 362 |
+
values.astype(np.float64),
|
| 363 |
+
left=float(values[0]),
|
| 364 |
+
right=float(values[-1]),
|
| 365 |
+
).astype(np.float32)
|
| 366 |
+
return np.clip(curve, 0.0, 1.0).astype(np.float32)
|
| 367 |
+
|
| 368 |
+
|
| 369 |
+
def _lookup_period_mixability(
|
| 370 |
+
profile: Optional[_VocalActivityProfile],
|
| 371 |
+
start_sec: float,
|
| 372 |
+
end_sec: float,
|
| 373 |
+
incoming: bool,
|
| 374 |
+
) -> Dict[str, Any]:
|
| 375 |
+
neutral_curve = np.full((16,), 0.5, dtype=np.float32)
|
| 376 |
+
neutral = {
|
| 377 |
+
"coverage": 0.0,
|
| 378 |
+
"period_vocal_ratio": 0.0,
|
| 379 |
+
"period_vocal_phrase_score": 0.5,
|
| 380 |
+
"period_drum_anchor": 0.5,
|
| 381 |
+
"period_bass_energy": 0.5,
|
| 382 |
+
"period_bass_stability": 0.5,
|
| 383 |
+
"period_density_score": 0.5,
|
| 384 |
+
"period_vocal_curve": neutral_curve.copy(),
|
| 385 |
+
"period_bass_curve": neutral_curve.copy(),
|
| 386 |
+
}
|
| 387 |
+
if profile is None or profile.times.size == 0:
|
| 388 |
+
return neutral
|
| 389 |
+
|
| 390 |
+
lo = float(min(start_sec, end_sec))
|
| 391 |
+
hi = float(max(start_sec, end_sec))
|
| 392 |
+
span = max(1e-4, hi - lo)
|
| 393 |
+
coverage = _range_coverage_ratio(profile.times, lo, hi)
|
| 394 |
+
if coverage <= 0.03:
|
| 395 |
+
return neutral
|
| 396 |
+
|
| 397 |
+
ratio_mean = _clamp(_mean_1d(profile.vocal_ratio, profile.times, lo, hi), 0.0, 1.0)
|
| 398 |
+
vocal_curve = _sample_curve(profile.vocal_ratio, profile.times, lo, hi, samples=16)
|
| 399 |
+
bass_curve = _sample_curve(profile.bass_rms, profile.times, lo, hi, samples=16)
|
| 400 |
+
first_cut = lo + (0.35 * span)
|
| 401 |
+
last_cut = hi - (0.35 * span)
|
| 402 |
+
vocal_start = _clamp(_mean_1d(profile.vocal_ratio, profile.times, lo, first_cut), 0.0, 1.0)
|
| 403 |
+
vocal_end = _clamp(_mean_1d(profile.vocal_ratio, profile.times, last_cut, hi), 0.0, 1.0)
|
| 404 |
+
boundary_t = lo if incoming else hi
|
| 405 |
+
vocal_onset_boundary = _clamp(
|
| 406 |
+
_mean_1d(profile.vocal_onset, profile.times, boundary_t - 0.16, boundary_t + 0.26),
|
| 407 |
+
0.0,
|
| 408 |
+
1.0,
|
| 409 |
+
)
|
| 410 |
+
low_vocal_score = _vocal_score_from_ratio(ratio_mean)
|
| 411 |
+
onset_quiet = 1.0 - vocal_onset_boundary
|
| 412 |
+
if incoming:
|
| 413 |
+
start_quiet = _clamp(1.0 - ((vocal_start - 0.22) / 0.58), 0.0, 1.0)
|
| 414 |
+
rise_ok = _clamp((vocal_end - vocal_start + 0.08) / 0.38, 0.0, 1.0)
|
| 415 |
+
trend_score = _clamp((0.72 * start_quiet) + (0.28 * rise_ok), 0.0, 1.0)
|
| 416 |
+
else:
|
| 417 |
+
ending = _clamp((vocal_start - vocal_end + 0.05) / 0.35, 0.0, 1.0)
|
| 418 |
+
trend_score = ending
|
| 419 |
+
vocal_phrase = _clamp((0.50 * low_vocal_score) + (0.30 * trend_score) + (0.20 * onset_quiet), 0.0, 1.0)
|
| 420 |
+
|
| 421 |
+
if profile.has_drums:
|
| 422 |
+
drum_mean = _clamp(_mean_1d(profile.drum_onset, profile.times, lo, hi), 0.0, 1.0)
|
| 423 |
+
drum_std = _clamp(_std_1d(profile.drum_onset, profile.times, lo, hi), 0.0, 1.0)
|
| 424 |
+
drum_boundary = _clamp(
|
| 425 |
+
_mean_1d(profile.drum_onset, profile.times, boundary_t - 0.12, boundary_t + 0.20),
|
| 426 |
+
0.0,
|
| 427 |
+
1.0,
|
| 428 |
+
)
|
| 429 |
+
drum_anchor = _clamp(
|
| 430 |
+
(0.45 * drum_boundary)
|
| 431 |
+
+ (0.35 * drum_mean)
|
| 432 |
+
+ (0.20 * (1.0 - _clamp(drum_std / 0.35, 0.0, 1.0))),
|
| 433 |
+
0.0,
|
| 434 |
+
1.0,
|
| 435 |
+
)
|
| 436 |
+
else:
|
| 437 |
+
drum_anchor = 0.5
|
| 438 |
+
|
| 439 |
+
if profile.has_bass:
|
| 440 |
+
bass_mean = _clamp(_mean_1d(profile.bass_rms, profile.times, lo, hi), 0.0, 1.0)
|
| 441 |
+
bass_std = _clamp(_std_1d(profile.bass_rms, profile.times, lo, hi), 0.0, 1.0)
|
| 442 |
+
bass_cv = bass_std / max(0.08, bass_mean)
|
| 443 |
+
bass_stability = 1.0 - _clamp((bass_cv - 0.20) / 0.95, 0.0, 1.0)
|
| 444 |
+
else:
|
| 445 |
+
bass_mean = 0.5
|
| 446 |
+
bass_stability = 0.5
|
| 447 |
+
|
| 448 |
+
density_mean = _clamp(_mean_1d(profile.instrumental_rms, profile.times, lo, hi), 0.0, 1.0)
|
| 449 |
+
density_std = _clamp(_std_1d(profile.instrumental_rms, profile.times, lo, hi), 0.0, 1.0)
|
| 450 |
+
density_target = _target_position_score(density_mean, target=0.56, spread=0.22)
|
| 451 |
+
density_stability = 1.0 - _clamp(density_std / 0.32, 0.0, 1.0)
|
| 452 |
+
density_score = _clamp((0.75 * density_target) + (0.25 * density_stability), 0.0, 1.0)
|
| 453 |
+
|
| 454 |
+
return {
|
| 455 |
+
"coverage": float(coverage),
|
| 456 |
+
"period_vocal_ratio": float(ratio_mean),
|
| 457 |
+
"period_vocal_phrase_score": float(vocal_phrase),
|
| 458 |
+
"period_drum_anchor": float(drum_anchor),
|
| 459 |
+
"period_bass_energy": float(bass_mean),
|
| 460 |
+
"period_bass_stability": float(_clamp(bass_stability, 0.0, 1.0)),
|
| 461 |
+
"period_density_score": float(density_score),
|
| 462 |
+
"period_vocal_curve": vocal_curve.astype(np.float32),
|
| 463 |
+
"period_bass_curve": bass_curve.astype(np.float32),
|
| 464 |
+
}
|
| 465 |
+
|
| 466 |
+
|
| 467 |
+
def _period_overlap_clash(cand_a: _CueCandidate, cand_b: _CueCandidate) -> Tuple[float, float, float]:
|
| 468 |
+
n = int(
|
| 469 |
+
max(
|
| 470 |
+
4,
|
| 471 |
+
min(
|
| 472 |
+
int(cand_a.period_vocal_curve.size),
|
| 473 |
+
int(cand_b.period_vocal_curve.size),
|
| 474 |
+
int(cand_a.period_bass_curve.size),
|
| 475 |
+
int(cand_b.period_bass_curve.size),
|
| 476 |
+
),
|
| 477 |
+
)
|
| 478 |
+
)
|
| 479 |
+
if n <= 0:
|
| 480 |
+
vocal = _clamp(cand_a.period_vocal_ratio * cand_b.period_vocal_ratio, 0.0, 1.0)
|
| 481 |
+
bass = _clamp(cand_a.period_bass_energy * cand_b.period_bass_energy, 0.0, 1.0)
|
| 482 |
+
cov = 0.5 * (cand_a.period_coverage + cand_b.period_coverage)
|
| 483 |
+
return vocal, bass, cov
|
| 484 |
+
|
| 485 |
+
a_v = ensure_length(cand_a.period_vocal_curve.astype(np.float32), n)
|
| 486 |
+
b_v = ensure_length(cand_b.period_vocal_curve.astype(np.float32), n)
|
| 487 |
+
a_b = ensure_length(cand_a.period_bass_curve.astype(np.float32), n)
|
| 488 |
+
b_b = ensure_length(cand_b.period_bass_curve.astype(np.float32), n)
|
| 489 |
+
x = np.linspace(0.0, 1.0, n, dtype=np.float32)
|
| 490 |
+
|
| 491 |
+
w_a_v = 1.0 - x
|
| 492 |
+
w_b_v = x
|
| 493 |
+
vocal_risk = float(np.mean((a_v * w_a_v) * (b_v * w_b_v)))
|
| 494 |
+
vocal_risk = _clamp(vocal_risk * 4.0, 0.0, 1.0)
|
| 495 |
+
|
| 496 |
+
w_b_b = np.clip((x - 0.60) / 0.28, 0.0, 1.0).astype(np.float32)
|
| 497 |
+
w_a_b = 1.0 - w_b_b
|
| 498 |
+
center_bass_shape = (0.35 + (0.65 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
|
| 499 |
+
bass_risk = float(np.mean((a_b * w_a_b * center_bass_shape) * (b_b * w_b_b * center_bass_shape)))
|
| 500 |
+
bass_risk = _clamp(bass_risk * 6.0, 0.0, 1.0)
|
| 501 |
+
|
| 502 |
+
coverage = _clamp(0.5 * (cand_a.period_coverage + cand_b.period_coverage), 0.0, 1.0)
|
| 503 |
+
if coverage < 0.35:
|
| 504 |
+
fallback_v = _clamp(cand_a.period_vocal_ratio * cand_b.period_vocal_ratio, 0.0, 1.0)
|
| 505 |
+
fallback_b = _clamp(cand_a.period_bass_energy * cand_b.period_bass_energy, 0.0, 1.0)
|
| 506 |
+
alpha = _clamp((0.35 - coverage) / 0.35, 0.0, 1.0)
|
| 507 |
+
vocal_risk = (1.0 - alpha) * vocal_risk + (alpha * fallback_v)
|
| 508 |
+
bass_risk = (1.0 - alpha) * bass_risk + (alpha * fallback_b)
|
| 509 |
+
|
| 510 |
+
return float(vocal_risk), float(bass_risk), float(coverage)
|
| 511 |
+
|
| 512 |
+
|
| 513 |
+
def _extract_vocal_profile_demucs(
|
| 514 |
+
y: np.ndarray,
|
| 515 |
+
sr: int,
|
| 516 |
+
window_start_sec: float,
|
| 517 |
+
track_label: str,
|
| 518 |
+
) -> Tuple[Optional[_VocalActivityProfile], Dict[str, object]]:
|
| 519 |
+
global _DEMUCS_DEVICE
|
| 520 |
+
|
| 521 |
+
info: Dict[str, object] = {
|
| 522 |
+
"track": track_label,
|
| 523 |
+
"enabled": bool(_DEMUCS_ENABLED),
|
| 524 |
+
"model": _DEMUCS_MODEL_NAME,
|
| 525 |
+
}
|
| 526 |
+
if y.size < int(max(1, sr) * _DEMUCS_MIN_WINDOW_SEC):
|
| 527 |
+
info["status"] = "skipped-short-window"
|
| 528 |
+
return None, info
|
| 529 |
+
|
| 530 |
+
model, torch_mod, status, reason = _get_demucs_model()
|
| 531 |
+
info["status"] = status
|
| 532 |
+
if reason:
|
| 533 |
+
info["reason"] = reason
|
| 534 |
+
if model is None or torch_mod is None:
|
| 535 |
+
return None, info
|
| 536 |
+
|
| 537 |
+
try:
|
| 538 |
+
from demucs.apply import apply_model # type: ignore[reportMissingImports]
|
| 539 |
+
|
| 540 |
+
mono = np.asarray(y, dtype=np.float32).reshape(-1)
|
| 541 |
+
if mono.size == 0:
|
| 542 |
+
info["status"] = "empty-window"
|
| 543 |
+
return None, info
|
| 544 |
+
peak = float(np.max(np.abs(mono)))
|
| 545 |
+
if peak > 1e-9:
|
| 546 |
+
mono = mono / peak
|
| 547 |
+
|
| 548 |
+
demucs_sr = int(getattr(model, "samplerate", 44100))
|
| 549 |
+
if int(sr) != demucs_sr:
|
| 550 |
+
mono = librosa.resample(mono, orig_sr=int(sr), target_sr=demucs_sr).astype(np.float32)
|
| 551 |
+
if mono.size < int(demucs_sr * _DEMUCS_MIN_WINDOW_SEC):
|
| 552 |
+
info["status"] = "skipped-short-window"
|
| 553 |
+
return None, info
|
| 554 |
+
|
| 555 |
+
stereo = np.stack([mono, mono], axis=0)
|
| 556 |
+
mix = torch_mod.from_numpy(stereo).unsqueeze(0).to(_DEMUCS_DEVICE)
|
| 557 |
+
audio_sec = float(mono.size / max(1, demucs_sr))
|
| 558 |
+
segment_limit = float(_DEMUCS_SEGMENT_SEC)
|
| 559 |
+
if audio_sec <= (segment_limit + 0.02):
|
| 560 |
+
use_split = False
|
| 561 |
+
segment_sec = None
|
| 562 |
+
else:
|
| 563 |
+
use_split = True
|
| 564 |
+
segment_sec = segment_limit
|
| 565 |
+
|
| 566 |
+
try:
|
| 567 |
+
with torch_mod.no_grad():
|
| 568 |
+
estimates = apply_model(
|
| 569 |
+
model,
|
| 570 |
+
mix,
|
| 571 |
+
shifts=1,
|
| 572 |
+
split=use_split,
|
| 573 |
+
overlap=0.25,
|
| 574 |
+
progress=False,
|
| 575 |
+
device=_DEMUCS_DEVICE,
|
| 576 |
+
segment=segment_sec,
|
| 577 |
+
)
|
| 578 |
+
except Exception as exc:
|
| 579 |
+
if _DEMUCS_DEVICE == "cuda":
|
| 580 |
+
model.to("cpu")
|
| 581 |
+
_DEMUCS_DEVICE = "cpu"
|
| 582 |
+
mix = mix.to("cpu")
|
| 583 |
+
with torch_mod.no_grad():
|
| 584 |
+
estimates = apply_model(
|
| 585 |
+
model,
|
| 586 |
+
mix,
|
| 587 |
+
shifts=1,
|
| 588 |
+
split=use_split,
|
| 589 |
+
overlap=0.25,
|
| 590 |
+
progress=False,
|
| 591 |
+
device="cpu",
|
| 592 |
+
segment=segment_sec,
|
| 593 |
+
)
|
| 594 |
+
info["device_fallback"] = f"cuda->cpu ({exc})"
|
| 595 |
+
else:
|
| 596 |
+
raise
|
| 597 |
+
|
| 598 |
+
estimates = estimates.detach().cpu()
|
| 599 |
+
est = estimates[0] if estimates.ndim == 4 else estimates
|
| 600 |
+
if est.ndim != 3:
|
| 601 |
+
raise RuntimeError(f"Unexpected demucs output ndim: {est.ndim}")
|
| 602 |
+
|
| 603 |
+
source_names = [str(s) for s in getattr(model, "sources", [])]
|
| 604 |
+
if not source_names:
|
| 605 |
+
raise RuntimeError("Demucs model returned no source labels.")
|
| 606 |
+
if est.shape[0] != len(source_names):
|
| 607 |
+
if est.shape[1] == len(source_names):
|
| 608 |
+
est = est.permute(1, 0, 2)
|
| 609 |
+
else:
|
| 610 |
+
raise RuntimeError(
|
| 611 |
+
f"Demucs output/source mismatch ({tuple(est.shape)} vs {len(source_names)} sources)."
|
| 612 |
+
)
|
| 613 |
+
if "vocals" not in source_names:
|
| 614 |
+
raise RuntimeError("Demucs model does not expose a 'vocals' stem.")
|
| 615 |
+
|
| 616 |
+
vocal_idx = source_names.index("vocals")
|
| 617 |
+
vocals = est[vocal_idx]
|
| 618 |
+
has_drums = "drums" in source_names
|
| 619 |
+
has_bass = "bass" in source_names
|
| 620 |
+
drums = est[source_names.index("drums")] if has_drums else torch_mod.zeros_like(vocals)
|
| 621 |
+
bass = est[source_names.index("bass")] if has_bass else torch_mod.zeros_like(vocals)
|
| 622 |
+
non_vocal_idxs = [i for i in range(len(source_names)) if i != vocal_idx]
|
| 623 |
+
if non_vocal_idxs:
|
| 624 |
+
accompaniment = est[non_vocal_idxs].sum(dim=0)
|
| 625 |
+
else:
|
| 626 |
+
accompaniment = torch_mod.zeros_like(vocals)
|
| 627 |
+
|
| 628 |
+
vocals_mono = vocals.mean(dim=0).numpy().astype(np.float32)
|
| 629 |
+
drums_mono = drums.mean(dim=0).numpy().astype(np.float32)
|
| 630 |
+
bass_mono = bass.mean(dim=0).numpy().astype(np.float32)
|
| 631 |
+
accompaniment_mono = accompaniment.mean(dim=0).numpy().astype(np.float32)
|
| 632 |
+
|
| 633 |
+
vocal_rms = librosa.feature.rms(y=vocals_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
|
| 634 |
+
acc_rms = librosa.feature.rms(y=accompaniment_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
|
| 635 |
+
bass_rms = librosa.feature.rms(y=bass_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
|
| 636 |
+
inst_rms = librosa.feature.rms(y=accompaniment_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
|
| 637 |
+
vocal_onset = librosa.onset.onset_strength(y=vocals_mono, sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 638 |
+
drum_onset = librosa.onset.onset_strength(y=drums_mono, sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 639 |
+
|
| 640 |
+
ratio_raw = vocal_rms / np.maximum(vocal_rms + acc_rms, 1e-6)
|
| 641 |
+
ratio_raw = np.clip(ratio_raw, 0.0, 1.0).astype(np.float32)
|
| 642 |
+
aligned = _align_series_min_length([ratio_raw, vocal_onset, drum_onset, bass_rms, inst_rms])
|
| 643 |
+
if not aligned:
|
| 644 |
+
raise RuntimeError("Demucs profile alignment failed.")
|
| 645 |
+
ratio, vocal_onset_n, drum_onset_n, bass_rms_n, inst_rms_n = aligned
|
| 646 |
+
|
| 647 |
+
ratio = _smooth_1d(ratio, kernel_size=5)
|
| 648 |
+
vocal_onset_n = _normalize_1d(_smooth_1d(vocal_onset_n, kernel_size=3))
|
| 649 |
+
drum_onset_n = _normalize_1d(_smooth_1d(drum_onset_n, kernel_size=3))
|
| 650 |
+
bass_rms_n = _normalize_1d(_smooth_1d(bass_rms_n, kernel_size=5))
|
| 651 |
+
inst_rms_n = _normalize_1d(_smooth_1d(inst_rms_n, kernel_size=5))
|
| 652 |
+
|
| 653 |
+
times = librosa.frames_to_time(np.arange(ratio.size), sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 654 |
+
times = times + float(window_start_sec)
|
| 655 |
+
|
| 656 |
+
info.update(
|
| 657 |
+
{
|
| 658 |
+
"status": "ready",
|
| 659 |
+
"device": _DEMUCS_DEVICE,
|
| 660 |
+
"method": "demucs-stem-mixability",
|
| 661 |
+
"has_drums": bool(has_drums),
|
| 662 |
+
"has_bass": bool(has_bass),
|
| 663 |
+
"split_mode": "chunked" if use_split else "full-window",
|
| 664 |
+
"window_start_sec": round(float(window_start_sec), 3),
|
| 665 |
+
"window_duration_sec": round(float(mono.size / max(1, demucs_sr)), 3),
|
| 666 |
+
}
|
| 667 |
+
)
|
| 668 |
+
return _VocalActivityProfile(
|
| 669 |
+
vocal_ratio=ratio,
|
| 670 |
+
vocal_onset=vocal_onset_n,
|
| 671 |
+
drum_onset=drum_onset_n,
|
| 672 |
+
bass_rms=bass_rms_n,
|
| 673 |
+
instrumental_rms=inst_rms_n,
|
| 674 |
+
times=times,
|
| 675 |
+
method="demucs-stem-mixability",
|
| 676 |
+
has_drums=bool(has_drums),
|
| 677 |
+
has_bass=bool(has_bass),
|
| 678 |
+
), info
|
| 679 |
+
except Exception as exc:
|
| 680 |
+
LOGGER.warning("Demucs vocal analysis failed for %s (%s). Continuing without vocal penalty.", track_label, exc)
|
| 681 |
+
info["status"] = "error"
|
| 682 |
+
info["reason"] = str(exc)
|
| 683 |
+
return None, info
|
| 684 |
+
|
| 685 |
+
|
| 686 |
+
def _label_weight(label: str, outgoing: bool) -> float:
|
| 687 |
+
label_l = (label or "").strip().lower()
|
| 688 |
+
if outgoing:
|
| 689 |
+
table = [
|
| 690 |
+
("outro", 1.00),
|
| 691 |
+
("break", 0.95),
|
| 692 |
+
("bridge", 0.90),
|
| 693 |
+
("verse", 0.82),
|
| 694 |
+
("chorus", 0.66),
|
| 695 |
+
("intro", 0.20),
|
| 696 |
+
("start", 0.10),
|
| 697 |
+
("end", 0.05),
|
| 698 |
+
]
|
| 699 |
+
else:
|
| 700 |
+
table = [
|
| 701 |
+
("verse", 0.95),
|
| 702 |
+
("break", 0.90),
|
| 703 |
+
("bridge", 0.84),
|
| 704 |
+
("chorus", 0.80),
|
| 705 |
+
("intro", 0.74),
|
| 706 |
+
("outro", 0.20),
|
| 707 |
+
("start", 0.10),
|
| 708 |
+
("end", 0.05),
|
| 709 |
+
]
|
| 710 |
+
for token, score in table:
|
| 711 |
+
if token in label_l:
|
| 712 |
+
return float(score)
|
| 713 |
+
return 0.60
|
| 714 |
+
|
| 715 |
+
|
| 716 |
+
def _compute_profiles(y: np.ndarray, sr: int) -> _TrackProfiles:
|
| 717 |
+
if y.size == 0:
|
| 718 |
+
zero = np.zeros((1,), dtype=np.float32)
|
| 719 |
+
return _TrackProfiles(
|
| 720 |
+
rms=zero,
|
| 721 |
+
rms_times=zero.copy(),
|
| 722 |
+
onset=zero.copy(),
|
| 723 |
+
onset_times=zero.copy(),
|
| 724 |
+
chroma=np.zeros((12, 1), dtype=np.float32),
|
| 725 |
+
chroma_times=zero.copy(),
|
| 726 |
+
)
|
| 727 |
+
|
| 728 |
+
rms = librosa.feature.rms(y=y, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
|
| 729 |
+
onset = librosa.onset.onset_strength(y=y, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 730 |
+
try:
|
| 731 |
+
harmonic = librosa.effects.harmonic(y)
|
| 732 |
+
chroma = librosa.feature.chroma_cens(y=harmonic, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 733 |
+
except Exception as exc:
|
| 734 |
+
LOGGER.warning("Harmonic chroma extraction failed (%s); falling back to raw chroma.", exc)
|
| 735 |
+
chroma = librosa.feature.chroma_cens(y=y, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 736 |
+
|
| 737 |
+
if rms.size == 0:
|
| 738 |
+
rms = np.zeros((1,), dtype=np.float32)
|
| 739 |
+
if onset.size == 0:
|
| 740 |
+
onset = np.zeros((1,), dtype=np.float32)
|
| 741 |
+
if chroma.ndim != 2 or chroma.shape[1] == 0:
|
| 742 |
+
chroma = np.zeros((12, 1), dtype=np.float32)
|
| 743 |
+
|
| 744 |
+
max_rms = float(np.max(rms))
|
| 745 |
+
if max_rms > 1e-9:
|
| 746 |
+
rms = rms / max_rms
|
| 747 |
+
max_onset = float(np.max(onset))
|
| 748 |
+
if max_onset > 1e-9:
|
| 749 |
+
onset = onset / max_onset
|
| 750 |
+
|
| 751 |
+
rms_times = librosa.frames_to_time(np.arange(rms.size), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 752 |
+
onset_times = librosa.frames_to_time(np.arange(onset.size), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 753 |
+
chroma_times = librosa.frames_to_time(np.arange(chroma.shape[1]), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 754 |
+
return _TrackProfiles(
|
| 755 |
+
rms=rms,
|
| 756 |
+
rms_times=rms_times,
|
| 757 |
+
onset=onset,
|
| 758 |
+
onset_times=onset_times,
|
| 759 |
+
chroma=chroma,
|
| 760 |
+
chroma_times=chroma_times,
|
| 761 |
+
)
|
| 762 |
+
|
| 763 |
+
|
| 764 |
+
def _build_candidates(
|
| 765 |
+
beat_times: np.ndarray,
|
| 766 |
+
min_sec: float,
|
| 767 |
+
max_sec: float,
|
| 768 |
+
prefer_tail: bool,
|
| 769 |
+
limit: int,
|
| 770 |
+
) -> List[Tuple[float, int]]:
|
| 771 |
+
if beat_times.size == 0:
|
| 772 |
+
return []
|
| 773 |
+
idxs = [idx for idx, t in enumerate(beat_times) if float(min_sec) <= float(t) <= float(max_sec)]
|
| 774 |
+
if not idxs:
|
| 775 |
+
return []
|
| 776 |
+
idxs = idxs[-limit:] if prefer_tail else idxs[:limit]
|
| 777 |
+
return [(float(beat_times[idx]), int(idx)) for idx in idxs]
|
| 778 |
+
|
| 779 |
+
|
| 780 |
+
def _make_candidate(
|
| 781 |
+
time_sec: float,
|
| 782 |
+
beat_idx: int,
|
| 783 |
+
profiles: _TrackProfiles,
|
| 784 |
+
incoming: bool,
|
| 785 |
+
seam_sec: float,
|
| 786 |
+
vocal_profile: Optional[_VocalActivityProfile] = None,
|
| 787 |
+
vocal_time_sec: Optional[float] = None,
|
| 788 |
+
) -> _CueCandidate:
|
| 789 |
+
if incoming:
|
| 790 |
+
energy = _mean_1d(profiles.rms, profiles.rms_times, time_sec, time_sec + 1.0)
|
| 791 |
+
onset = _mean_1d(profiles.onset, profiles.onset_times, time_sec - 0.1, time_sec + 0.5)
|
| 792 |
+
else:
|
| 793 |
+
energy = _mean_1d(profiles.rms, profiles.rms_times, time_sec - 1.0, time_sec)
|
| 794 |
+
onset = _mean_1d(profiles.onset, profiles.onset_times, time_sec - 0.5, time_sec + 0.1)
|
| 795 |
+
chroma = _mean_2d(profiles.chroma, profiles.chroma_times, time_sec - 2.0, time_sec + 2.0)
|
| 796 |
+
vocal_lookup_sec = float(vocal_time_sec) if vocal_time_sec is not None else float(time_sec)
|
| 797 |
+
stem_mix = _lookup_stem_mixability(vocal_profile, vocal_lookup_sec)
|
| 798 |
+
seam = max(1e-3, float(seam_sec))
|
| 799 |
+
period_start = vocal_lookup_sec if incoming else (vocal_lookup_sec - seam)
|
| 800 |
+
period_end = (vocal_lookup_sec + seam) if incoming else vocal_lookup_sec
|
| 801 |
+
period_mix = _lookup_period_mixability(
|
| 802 |
+
profile=vocal_profile,
|
| 803 |
+
start_sec=period_start,
|
| 804 |
+
end_sec=period_end,
|
| 805 |
+
incoming=incoming,
|
| 806 |
+
)
|
| 807 |
+
return _CueCandidate(
|
| 808 |
+
time_sec=float(time_sec),
|
| 809 |
+
beat_idx=int(beat_idx),
|
| 810 |
+
phrase=_phrase_score(int(beat_idx)),
|
| 811 |
+
energy=float(_clamp(energy, 0.0, 1.0)),
|
| 812 |
+
onset=float(_clamp(onset, 0.0, 1.0)),
|
| 813 |
+
chroma=chroma,
|
| 814 |
+
vocal_ratio=float(stem_mix["vocal_ratio"]),
|
| 815 |
+
vocal_onset=float(stem_mix["vocal_onset"]),
|
| 816 |
+
vocal_phrase_score=float(stem_mix["vocal_phrase_score"]),
|
| 817 |
+
drum_anchor=float(stem_mix["drum_anchor"]),
|
| 818 |
+
bass_energy=float(stem_mix["bass_energy"]),
|
| 819 |
+
bass_stability=float(stem_mix["bass_stability"]),
|
| 820 |
+
instrumental_density=float(stem_mix["instrumental_density"]),
|
| 821 |
+
density_score=float(stem_mix["density_score"]),
|
| 822 |
+
period_vocal_ratio=float(period_mix["period_vocal_ratio"]),
|
| 823 |
+
period_vocal_phrase_score=float(period_mix["period_vocal_phrase_score"]),
|
| 824 |
+
period_drum_anchor=float(period_mix["period_drum_anchor"]),
|
| 825 |
+
period_bass_energy=float(period_mix["period_bass_energy"]),
|
| 826 |
+
period_bass_stability=float(period_mix["period_bass_stability"]),
|
| 827 |
+
period_density_score=float(period_mix["period_density_score"]),
|
| 828 |
+
period_coverage=float(period_mix["coverage"]),
|
| 829 |
+
period_vocal_curve=np.asarray(period_mix["period_vocal_curve"], dtype=np.float32),
|
| 830 |
+
period_bass_curve=np.asarray(period_mix["period_bass_curve"], dtype=np.float32),
|
| 831 |
+
)
|
| 832 |
+
|
| 833 |
+
|
| 834 |
+
def _score_pair(
|
| 835 |
+
cand_a: _CueCandidate,
|
| 836 |
+
cand_b: _CueCandidate,
|
| 837 |
+
target_a: float,
|
| 838 |
+
target_b: float,
|
| 839 |
+
) -> Tuple[float, Dict[str, float]]:
|
| 840 |
+
energy_match = 1.0 - min(1.0, abs(cand_a.energy - cand_b.energy))
|
| 841 |
+
phrase_match = 0.5 * (cand_a.phrase + cand_b.phrase)
|
| 842 |
+
key_match = _clamp(_cosine_similarity(cand_a.chroma, cand_b.chroma), 0.0, 1.0)
|
| 843 |
+
onset_match = (0.35 * cand_a.onset) + (0.65 * cand_b.onset)
|
| 844 |
+
position_match = 0.5 * (
|
| 845 |
+
_target_position_score(cand_a.time_sec, target_a, spread=3.0)
|
| 846 |
+
+ _target_position_score(cand_b.time_sec, target_b, spread=3.0)
|
| 847 |
+
)
|
| 848 |
+
vocal_phrase_match = 0.5 * (cand_a.vocal_phrase_score + cand_b.vocal_phrase_score)
|
| 849 |
+
drum_anchor_match = 0.5 * (cand_a.drum_anchor + cand_b.drum_anchor)
|
| 850 |
+
bass_stability_match = 0.5 * (cand_a.bass_stability + cand_b.bass_stability)
|
| 851 |
+
density_match = 0.5 * (cand_a.density_score + cand_b.density_score)
|
| 852 |
+
period_vocal_phrase_match = 0.5 * (cand_a.period_vocal_phrase_score + cand_b.period_vocal_phrase_score)
|
| 853 |
+
period_drum_anchor_match = 0.5 * (cand_a.period_drum_anchor + cand_b.period_drum_anchor)
|
| 854 |
+
period_bass_stability_match = 0.5 * (cand_a.period_bass_stability + cand_b.period_bass_stability)
|
| 855 |
+
period_density_match = 0.5 * (cand_a.period_density_score + cand_b.period_density_score)
|
| 856 |
+
vocal_clash_risk, bass_clash_risk, period_coverage = _period_overlap_clash(cand_a, cand_b)
|
| 857 |
+
clash_avoidance = 1.0 - _clamp((0.67 * vocal_clash_risk) + (0.33 * bass_clash_risk), 0.0, 1.0)
|
| 858 |
+
|
| 859 |
+
total = (
|
| 860 |
+
(0.07 * energy_match)
|
| 861 |
+
+ (0.06 * phrase_match)
|
| 862 |
+
+ (0.08 * key_match)
|
| 863 |
+
+ (0.04 * onset_match)
|
| 864 |
+
+ (0.03 * position_match)
|
| 865 |
+
+ (0.05 * vocal_phrase_match)
|
| 866 |
+
+ (0.04 * drum_anchor_match)
|
| 867 |
+
+ (0.03 * bass_stability_match)
|
| 868 |
+
+ (0.02 * density_match)
|
| 869 |
+
+ (0.14 * period_vocal_phrase_match)
|
| 870 |
+
+ (0.10 * period_drum_anchor_match)
|
| 871 |
+
+ (0.09 * period_bass_stability_match)
|
| 872 |
+
+ (0.07 * period_density_match)
|
| 873 |
+
+ (0.16 * clash_avoidance)
|
| 874 |
+
+ (0.02 * period_coverage)
|
| 875 |
+
)
|
| 876 |
+
components = {
|
| 877 |
+
"energy_match": float(energy_match),
|
| 878 |
+
"phrase_match": float(phrase_match),
|
| 879 |
+
"key_match": float(key_match),
|
| 880 |
+
"onset_match": float(onset_match),
|
| 881 |
+
"position_match": float(position_match),
|
| 882 |
+
"vocal_phrase_match": float(vocal_phrase_match),
|
| 883 |
+
"drum_anchor_match": float(drum_anchor_match),
|
| 884 |
+
"bass_stability_match": float(bass_stability_match),
|
| 885 |
+
"density_match": float(density_match),
|
| 886 |
+
"period_vocal_phrase_match": float(period_vocal_phrase_match),
|
| 887 |
+
"period_drum_anchor_match": float(period_drum_anchor_match),
|
| 888 |
+
"period_bass_stability_match": float(period_bass_stability_match),
|
| 889 |
+
"period_density_match": float(period_density_match),
|
| 890 |
+
"period_coverage": float(period_coverage),
|
| 891 |
+
"vocal_clash_risk": float(vocal_clash_risk),
|
| 892 |
+
"bass_clash_risk": float(bass_clash_risk),
|
| 893 |
+
"clash_avoidance": float(clash_avoidance),
|
| 894 |
+
"total": float(total),
|
| 895 |
+
}
|
| 896 |
+
return float(total), components
|
| 897 |
+
|
| 898 |
+
|
| 899 |
+
def _segments_from_boundaries(boundaries: np.ndarray, duration_sec: float) -> List[Dict[str, object]]:
|
| 900 |
+
clean = [0.0]
|
| 901 |
+
for t in np.asarray(boundaries, dtype=np.float32):
|
| 902 |
+
x = float(t)
|
| 903 |
+
if 0.0 < x < float(duration_sec):
|
| 904 |
+
clean.append(x)
|
| 905 |
+
clean.append(float(duration_sec))
|
| 906 |
+
clean = sorted(set(round(x, 3) for x in clean))
|
| 907 |
+
segs: List[Dict[str, object]] = []
|
| 908 |
+
for idx in range(len(clean) - 1):
|
| 909 |
+
start = float(clean[idx])
|
| 910 |
+
end = float(clean[idx + 1])
|
| 911 |
+
if end - start < 4.0:
|
| 912 |
+
continue
|
| 913 |
+
segs.append({"start": start, "end": end, "label": f"section_{idx + 1}"})
|
| 914 |
+
return segs
|
| 915 |
+
|
| 916 |
+
|
| 917 |
+
def _try_get_librosa_structure(path: str, duration_sec: float) -> Optional[Dict[str, np.ndarray]]:
|
| 918 |
+
if path in _LIBROSA_STRUCT_CACHE:
|
| 919 |
+
return _LIBROSA_STRUCT_CACHE[path]
|
| 920 |
+
|
| 921 |
+
decode_sec = _clamp(float(duration_sec), 15.0, 600.0)
|
| 922 |
+
try:
|
| 923 |
+
y, _ = decode_segment(
|
| 924 |
+
path,
|
| 925 |
+
start_sec=0.0,
|
| 926 |
+
duration_sec=decode_sec,
|
| 927 |
+
sr=_STRUCT_SR,
|
| 928 |
+
max_decode_sec=max(600.0, decode_sec + 3.0),
|
| 929 |
+
)
|
| 930 |
+
except Exception as exc:
|
| 931 |
+
LOGGER.warning("librosa full-track decode failed for %s (%s).", path, exc)
|
| 932 |
+
_LIBROSA_STRUCT_CACHE[path] = None
|
| 933 |
+
return None
|
| 934 |
+
|
| 935 |
+
if y.size < _STRUCT_SR:
|
| 936 |
+
_LIBROSA_STRUCT_CACHE[path] = None
|
| 937 |
+
return None
|
| 938 |
+
|
| 939 |
+
try:
|
| 940 |
+
_, beat_frames = librosa.beat.beat_track(y=y, sr=_STRUCT_SR, trim=False)
|
| 941 |
+
beat_times = librosa.frames_to_time(np.asarray(beat_frames), sr=_STRUCT_SR).astype(np.float32)
|
| 942 |
+
downbeats = beat_times[::4] if beat_times.size > 0 else np.array([], dtype=np.float32)
|
| 943 |
+
|
| 944 |
+
onset_env = librosa.onset.onset_strength(y=y, sr=_STRUCT_SR, hop_length=_ANALYSIS_HOP).astype(np.float32)
|
| 945 |
+
boundary_frames = librosa.util.peak_pick(
|
| 946 |
+
onset_env,
|
| 947 |
+
pre_max=8,
|
| 948 |
+
post_max=8,
|
| 949 |
+
pre_avg=24,
|
| 950 |
+
post_avg=24,
|
| 951 |
+
delta=0.06,
|
| 952 |
+
wait=18,
|
| 953 |
+
)
|
| 954 |
+
boundaries = librosa.frames_to_time(
|
| 955 |
+
np.asarray(boundary_frames),
|
| 956 |
+
sr=_STRUCT_SR,
|
| 957 |
+
hop_length=_ANALYSIS_HOP,
|
| 958 |
+
).astype(np.float32)
|
| 959 |
+
payload: Dict[str, np.ndarray] = {"downbeats": downbeats, "boundaries": boundaries}
|
| 960 |
+
_LIBROSA_STRUCT_CACHE[path] = payload
|
| 961 |
+
return payload
|
| 962 |
+
except Exception as exc:
|
| 963 |
+
LOGGER.warning("librosa structure extraction failed for %s (%s).", path, exc)
|
| 964 |
+
_LIBROSA_STRUCT_CACHE[path] = None
|
| 965 |
+
return None
|
| 966 |
+
|
| 967 |
+
|
| 968 |
+
def _get_or_build_profiles_for_track(path: str, duration_sec: float, sr: int) -> Optional[_TrackProfiles]:
|
| 969 |
+
key = (path, int(sr))
|
| 970 |
+
if key in _PROFILE_CACHE:
|
| 971 |
+
return _PROFILE_CACHE[key]
|
| 972 |
+
|
| 973 |
+
decode_sec = _clamp(float(duration_sec), 15.0, 600.0)
|
| 974 |
+
try:
|
| 975 |
+
y, _ = decode_segment(
|
| 976 |
+
path,
|
| 977 |
+
start_sec=0.0,
|
| 978 |
+
duration_sec=decode_sec,
|
| 979 |
+
sr=int(sr),
|
| 980 |
+
max_decode_sec=max(600.0, decode_sec + 3.0),
|
| 981 |
+
)
|
| 982 |
+
except Exception as exc:
|
| 983 |
+
LOGGER.warning("Full-track decode failed for %s (%s).", path, exc)
|
| 984 |
+
_PROFILE_CACHE[key] = None
|
| 985 |
+
return None
|
| 986 |
+
|
| 987 |
+
if y.size < int(sr):
|
| 988 |
+
_PROFILE_CACHE[key] = None
|
| 989 |
+
return None
|
| 990 |
+
|
| 991 |
+
profiles = _compute_profiles(y, int(sr))
|
| 992 |
+
_PROFILE_CACHE[key] = profiles
|
| 993 |
+
return profiles
|
| 994 |
+
|
| 995 |
+
|
| 996 |
+
def _label_for_time(segments: List[Dict[str, object]], t: float) -> str:
|
| 997 |
+
for seg in segments:
|
| 998 |
+
start = float(seg["start"])
|
| 999 |
+
end = float(seg["end"])
|
| 1000 |
+
if start <= float(t) < end:
|
| 1001 |
+
return str(seg.get("label", "unknown"))
|
| 1002 |
+
return "unknown"
|
| 1003 |
+
|
| 1004 |
+
|
| 1005 |
+
def _dedupe_times(times: List[float], min_gap_sec: float) -> List[float]:
|
| 1006 |
+
if not times:
|
| 1007 |
+
return []
|
| 1008 |
+
sorted_times = sorted(float(t) for t in times)
|
| 1009 |
+
out: List[float] = [sorted_times[0]]
|
| 1010 |
+
for t in sorted_times[1:]:
|
| 1011 |
+
if (t - out[-1]) >= float(min_gap_sec):
|
| 1012 |
+
out.append(t)
|
| 1013 |
+
return out
|
| 1014 |
+
|
| 1015 |
+
|
| 1016 |
+
def _build_structured_candidates(
|
| 1017 |
+
downbeats: np.ndarray,
|
| 1018 |
+
segments: List[Dict[str, object]],
|
| 1019 |
+
profiles: _TrackProfiles,
|
| 1020 |
+
vocal_profile: Optional[_VocalActivityProfile],
|
| 1021 |
+
seam_sec: float,
|
| 1022 |
+
duration_sec: float,
|
| 1023 |
+
min_sec: float,
|
| 1024 |
+
max_sec: float,
|
| 1025 |
+
incoming: bool,
|
| 1026 |
+
target_ratio: float,
|
| 1027 |
+
limit: int = 20,
|
| 1028 |
+
) -> List[_StructuredCandidate]:
|
| 1029 |
+
if max_sec <= min_sec:
|
| 1030 |
+
return []
|
| 1031 |
+
|
| 1032 |
+
raw_times: List[float] = []
|
| 1033 |
+
if downbeats.size > 0:
|
| 1034 |
+
raw_times.extend([float(t) for t in downbeats if min_sec <= float(t) <= max_sec])
|
| 1035 |
+
|
| 1036 |
+
for seg in segments:
|
| 1037 |
+
start = float(seg["start"])
|
| 1038 |
+
end = float(seg["end"])
|
| 1039 |
+
if min_sec <= start <= max_sec:
|
| 1040 |
+
raw_times.append(start)
|
| 1041 |
+
if min_sec <= end <= max_sec:
|
| 1042 |
+
raw_times.append(end)
|
| 1043 |
+
|
| 1044 |
+
if not raw_times and downbeats.size > 0:
|
| 1045 |
+
raw_times.extend([float(t) for t in downbeats])
|
| 1046 |
+
|
| 1047 |
+
if not raw_times:
|
| 1048 |
+
return []
|
| 1049 |
+
|
| 1050 |
+
snapped: List[float] = []
|
| 1051 |
+
for t in raw_times:
|
| 1052 |
+
if downbeats.size > 0:
|
| 1053 |
+
idx = int(np.argmin(np.abs(downbeats - float(t))))
|
| 1054 |
+
snapped_t = float(downbeats[idx])
|
| 1055 |
+
else:
|
| 1056 |
+
snapped_t = float(t)
|
| 1057 |
+
if min_sec <= snapped_t <= max_sec:
|
| 1058 |
+
snapped.append(snapped_t)
|
| 1059 |
+
|
| 1060 |
+
snapped = _dedupe_times(snapped, min_gap_sec=1.2)
|
| 1061 |
+
if not snapped:
|
| 1062 |
+
return []
|
| 1063 |
+
|
| 1064 |
+
if incoming:
|
| 1065 |
+
snapped = snapped[:limit]
|
| 1066 |
+
else:
|
| 1067 |
+
snapped = snapped[-limit:]
|
| 1068 |
+
|
| 1069 |
+
target_sec = float(target_ratio * duration_sec)
|
| 1070 |
+
spread = max(4.0, 0.15 * duration_sec)
|
| 1071 |
+
|
| 1072 |
+
built: List[_StructuredCandidate] = []
|
| 1073 |
+
for i, t in enumerate(snapped):
|
| 1074 |
+
label = _label_for_time(segments, t)
|
| 1075 |
+
cue = _make_candidate(
|
| 1076 |
+
time_sec=t,
|
| 1077 |
+
beat_idx=(i * 4),
|
| 1078 |
+
profiles=profiles,
|
| 1079 |
+
incoming=incoming,
|
| 1080 |
+
seam_sec=seam_sec,
|
| 1081 |
+
vocal_profile=vocal_profile,
|
| 1082 |
+
vocal_time_sec=t,
|
| 1083 |
+
)
|
| 1084 |
+
built.append(
|
| 1085 |
+
_StructuredCandidate(
|
| 1086 |
+
cue=cue,
|
| 1087 |
+
label=label,
|
| 1088 |
+
label_score=_label_weight(label, outgoing=(not incoming)),
|
| 1089 |
+
edge_score=_edge_score(t, duration_sec),
|
| 1090 |
+
position_score=_target_position_score(t, target=target_sec, spread=spread),
|
| 1091 |
+
)
|
| 1092 |
+
)
|
| 1093 |
+
return built
|
| 1094 |
+
|
| 1095 |
+
|
| 1096 |
+
def _score_structured_pair(
|
| 1097 |
+
cand_a: _StructuredCandidate,
|
| 1098 |
+
cand_b: _StructuredCandidate,
|
| 1099 |
+
) -> Tuple[float, Dict[str, float]]:
|
| 1100 |
+
energy_match = 1.0 - min(1.0, abs(cand_a.cue.energy - cand_b.cue.energy))
|
| 1101 |
+
phrase_match = 0.5 * (cand_a.cue.phrase + cand_b.cue.phrase)
|
| 1102 |
+
key_match = _clamp(_cosine_similarity(cand_a.cue.chroma, cand_b.cue.chroma), 0.0, 1.0)
|
| 1103 |
+
onset_match = (0.40 * cand_a.cue.onset) + (0.60 * cand_b.cue.onset)
|
| 1104 |
+
label_match = 0.5 * (cand_a.label_score + cand_b.label_score)
|
| 1105 |
+
position_match = 0.5 * (cand_a.position_score + cand_b.position_score)
|
| 1106 |
+
edge_match = 0.5 * (cand_a.edge_score + cand_b.edge_score)
|
| 1107 |
+
vocal_phrase_match = 0.5 * (cand_a.cue.vocal_phrase_score + cand_b.cue.vocal_phrase_score)
|
| 1108 |
+
drum_anchor_match = 0.5 * (cand_a.cue.drum_anchor + cand_b.cue.drum_anchor)
|
| 1109 |
+
bass_stability_match = 0.5 * (cand_a.cue.bass_stability + cand_b.cue.bass_stability)
|
| 1110 |
+
density_match = 0.5 * (cand_a.cue.density_score + cand_b.cue.density_score)
|
| 1111 |
+
period_vocal_phrase_match = 0.5 * (cand_a.cue.period_vocal_phrase_score + cand_b.cue.period_vocal_phrase_score)
|
| 1112 |
+
period_drum_anchor_match = 0.5 * (cand_a.cue.period_drum_anchor + cand_b.cue.period_drum_anchor)
|
| 1113 |
+
period_bass_stability_match = 0.5 * (cand_a.cue.period_bass_stability + cand_b.cue.period_bass_stability)
|
| 1114 |
+
period_density_match = 0.5 * (cand_a.cue.period_density_score + cand_b.cue.period_density_score)
|
| 1115 |
+
vocal_clash_risk, bass_clash_risk, period_coverage = _period_overlap_clash(cand_a.cue, cand_b.cue)
|
| 1116 |
+
clash_avoidance = 1.0 - _clamp((0.67 * vocal_clash_risk) + (0.33 * bass_clash_risk), 0.0, 1.0)
|
| 1117 |
+
|
| 1118 |
+
total = (
|
| 1119 |
+
(0.08 * energy_match)
|
| 1120 |
+
+ (0.09 * key_match)
|
| 1121 |
+
+ (0.06 * onset_match)
|
| 1122 |
+
+ (0.05 * phrase_match)
|
| 1123 |
+
+ (0.10 * label_match)
|
| 1124 |
+
+ (0.05 * position_match)
|
| 1125 |
+
+ (0.04 * edge_match)
|
| 1126 |
+
+ (0.06 * vocal_phrase_match)
|
| 1127 |
+
+ (0.04 * drum_anchor_match)
|
| 1128 |
+
+ (0.03 * bass_stability_match)
|
| 1129 |
+
+ (0.02 * density_match)
|
| 1130 |
+
+ (0.12 * period_vocal_phrase_match)
|
| 1131 |
+
+ (0.08 * period_drum_anchor_match)
|
| 1132 |
+
+ (0.07 * period_bass_stability_match)
|
| 1133 |
+
+ (0.05 * period_density_match)
|
| 1134 |
+
+ (0.06 * clash_avoidance)
|
| 1135 |
+
+ (0.01 * period_coverage)
|
| 1136 |
+
)
|
| 1137 |
+
components = {
|
| 1138 |
+
"energy_match": float(energy_match),
|
| 1139 |
+
"key_match": float(key_match),
|
| 1140 |
+
"onset_match": float(onset_match),
|
| 1141 |
+
"phrase_match": float(phrase_match),
|
| 1142 |
+
"label_match": float(label_match),
|
| 1143 |
+
"position_match": float(position_match),
|
| 1144 |
+
"edge_match": float(edge_match),
|
| 1145 |
+
"vocal_phrase_match": float(vocal_phrase_match),
|
| 1146 |
+
"drum_anchor_match": float(drum_anchor_match),
|
| 1147 |
+
"bass_stability_match": float(bass_stability_match),
|
| 1148 |
+
"density_match": float(density_match),
|
| 1149 |
+
"period_vocal_phrase_match": float(period_vocal_phrase_match),
|
| 1150 |
+
"period_drum_anchor_match": float(period_drum_anchor_match),
|
| 1151 |
+
"period_bass_stability_match": float(period_bass_stability_match),
|
| 1152 |
+
"period_density_match": float(period_density_match),
|
| 1153 |
+
"period_coverage": float(period_coverage),
|
| 1154 |
+
"vocal_clash_risk": float(vocal_clash_risk),
|
| 1155 |
+
"bass_clash_risk": float(bass_clash_risk),
|
| 1156 |
+
"clash_avoidance": float(clash_avoidance),
|
| 1157 |
+
"total": float(total),
|
| 1158 |
+
}
|
| 1159 |
+
return float(total), components
|
| 1160 |
+
|
| 1161 |
+
|
| 1162 |
+
def _try_structure_aware_selection(
|
| 1163 |
+
song_a_path: Optional[str],
|
| 1164 |
+
song_b_path: Optional[str],
|
| 1165 |
+
song_a_duration_sec: Optional[float],
|
| 1166 |
+
song_b_duration_sec: Optional[float],
|
| 1167 |
+
pre_sec: float,
|
| 1168 |
+
seam_sec: float,
|
| 1169 |
+
post_sec: float,
|
| 1170 |
+
vocal_profile_a: Optional[_VocalActivityProfile],
|
| 1171 |
+
vocal_profile_b: Optional[_VocalActivityProfile],
|
| 1172 |
+
) -> Optional[CueSelectionResult]:
|
| 1173 |
+
if not song_a_path or not song_b_path:
|
| 1174 |
+
return None
|
| 1175 |
+
if song_a_duration_sec is None or song_b_duration_sec is None:
|
| 1176 |
+
return None
|
| 1177 |
+
|
| 1178 |
+
dur_a = float(song_a_duration_sec)
|
| 1179 |
+
dur_b = float(song_b_duration_sec)
|
| 1180 |
+
|
| 1181 |
+
min_a = max(seam_sec + 2.0, pre_sec + 2.0, 0.30 * dur_a)
|
| 1182 |
+
max_a = min(dur_a - seam_sec - 2.0, 0.88 * dur_a)
|
| 1183 |
+
min_b = max(4.0, 0.10 * dur_b)
|
| 1184 |
+
max_b = min(dur_b - (seam_sec + post_sec + 2.0), 0.72 * dur_b)
|
| 1185 |
+
|
| 1186 |
+
if max_a <= min_a or max_b <= min_b:
|
| 1187 |
+
return None
|
| 1188 |
+
|
| 1189 |
+
source = "librosa"
|
| 1190 |
+
lib_a = _try_get_librosa_structure(song_a_path, dur_a)
|
| 1191 |
+
lib_b = _try_get_librosa_structure(song_b_path, dur_b)
|
| 1192 |
+
if lib_a is None or lib_b is None:
|
| 1193 |
+
return None
|
| 1194 |
+
|
| 1195 |
+
downbeats_a = np.asarray(lib_a.get("downbeats", []), dtype=np.float32)
|
| 1196 |
+
downbeats_b = np.asarray(lib_b.get("downbeats", []), dtype=np.float32)
|
| 1197 |
+
segments_a: List[Dict[str, object]] = _segments_from_boundaries(
|
| 1198 |
+
np.asarray(lib_a.get("boundaries", []), dtype=np.float32),
|
| 1199 |
+
duration_sec=dur_a,
|
| 1200 |
+
)
|
| 1201 |
+
segments_b: List[Dict[str, object]] = _segments_from_boundaries(
|
| 1202 |
+
np.asarray(lib_b.get("boundaries", []), dtype=np.float32),
|
| 1203 |
+
duration_sec=dur_b,
|
| 1204 |
+
)
|
| 1205 |
+
|
| 1206 |
+
if downbeats_a.size < 4 or downbeats_b.size < 4:
|
| 1207 |
+
return None
|
| 1208 |
+
|
| 1209 |
+
profiles_a = _get_or_build_profiles_for_track(song_a_path, dur_a, sr=_STRUCT_SR)
|
| 1210 |
+
profiles_b = _get_or_build_profiles_for_track(song_b_path, dur_b, sr=_STRUCT_SR)
|
| 1211 |
+
if profiles_a is None or profiles_b is None:
|
| 1212 |
+
return None
|
| 1213 |
+
|
| 1214 |
+
cands_a = _build_structured_candidates(
|
| 1215 |
+
downbeats=downbeats_a,
|
| 1216 |
+
segments=segments_a,
|
| 1217 |
+
profiles=profiles_a,
|
| 1218 |
+
vocal_profile=vocal_profile_a,
|
| 1219 |
+
seam_sec=seam_sec,
|
| 1220 |
+
duration_sec=dur_a,
|
| 1221 |
+
min_sec=min_a,
|
| 1222 |
+
max_sec=max_a,
|
| 1223 |
+
incoming=False,
|
| 1224 |
+
target_ratio=0.63,
|
| 1225 |
+
limit=22,
|
| 1226 |
+
)
|
| 1227 |
+
cands_b = _build_structured_candidates(
|
| 1228 |
+
downbeats=downbeats_b,
|
| 1229 |
+
segments=segments_b,
|
| 1230 |
+
profiles=profiles_b,
|
| 1231 |
+
vocal_profile=vocal_profile_b,
|
| 1232 |
+
seam_sec=seam_sec,
|
| 1233 |
+
duration_sec=dur_b,
|
| 1234 |
+
min_sec=min_b,
|
| 1235 |
+
max_sec=max_b,
|
| 1236 |
+
incoming=True,
|
| 1237 |
+
target_ratio=0.27,
|
| 1238 |
+
limit=22,
|
| 1239 |
+
)
|
| 1240 |
+
if not cands_a or not cands_b:
|
| 1241 |
+
return None
|
| 1242 |
+
|
| 1243 |
+
best_score = -1.0
|
| 1244 |
+
best_a: Optional[_StructuredCandidate] = None
|
| 1245 |
+
best_b: Optional[_StructuredCandidate] = None
|
| 1246 |
+
ranked: List[Dict[str, object]] = []
|
| 1247 |
+
for ca in cands_a:
|
| 1248 |
+
for cb in cands_b:
|
| 1249 |
+
score, comps = _score_structured_pair(ca, cb)
|
| 1250 |
+
ranked.append(
|
| 1251 |
+
{
|
| 1252 |
+
"score": float(score),
|
| 1253 |
+
"song_a_sec": float(ca.cue.time_sec),
|
| 1254 |
+
"song_b_sec": float(cb.cue.time_sec),
|
| 1255 |
+
"song_a_label": ca.label,
|
| 1256 |
+
"song_b_label": cb.label,
|
| 1257 |
+
"song_a_vocal_ratio": float(ca.cue.vocal_ratio),
|
| 1258 |
+
"song_b_vocal_ratio": float(cb.cue.vocal_ratio),
|
| 1259 |
+
"song_a_period_vocal_phrase": float(ca.cue.period_vocal_phrase_score),
|
| 1260 |
+
"song_b_period_vocal_phrase": float(cb.cue.period_vocal_phrase_score),
|
| 1261 |
+
"song_a_period_drum_anchor": float(ca.cue.period_drum_anchor),
|
| 1262 |
+
"song_b_period_drum_anchor": float(cb.cue.period_drum_anchor),
|
| 1263 |
+
"song_a_period_bass_stability": float(ca.cue.period_bass_stability),
|
| 1264 |
+
"song_b_period_bass_stability": float(cb.cue.period_bass_stability),
|
| 1265 |
+
"song_a_period_density": float(ca.cue.period_density_score),
|
| 1266 |
+
"song_b_period_density": float(cb.cue.period_density_score),
|
| 1267 |
+
"song_a_period_coverage": float(ca.cue.period_coverage),
|
| 1268 |
+
"song_b_period_coverage": float(cb.cue.period_coverage),
|
| 1269 |
+
"components": comps,
|
| 1270 |
+
}
|
| 1271 |
+
)
|
| 1272 |
+
if score > best_score:
|
| 1273 |
+
best_score = float(score)
|
| 1274 |
+
best_a = ca
|
| 1275 |
+
best_b = cb
|
| 1276 |
+
|
| 1277 |
+
if best_a is None or best_b is None:
|
| 1278 |
+
return None
|
| 1279 |
+
|
| 1280 |
+
ranked = sorted(ranked, key=lambda x: float(x["score"]), reverse=True)
|
| 1281 |
+
top_pairs = []
|
| 1282 |
+
for item in ranked[:3]:
|
| 1283 |
+
top_pairs.append(
|
| 1284 |
+
{
|
| 1285 |
+
"score": round(float(item["score"]), 4),
|
| 1286 |
+
"song_a_sec": round(float(item["song_a_sec"]), 3),
|
| 1287 |
+
"song_b_sec": round(float(item["song_b_sec"]), 3),
|
| 1288 |
+
"song_a_label": str(item["song_a_label"]),
|
| 1289 |
+
"song_b_label": str(item["song_b_label"]),
|
| 1290 |
+
"song_a_vocal_ratio": round(float(item["song_a_vocal_ratio"]), 4),
|
| 1291 |
+
"song_b_vocal_ratio": round(float(item["song_b_vocal_ratio"]), 4),
|
| 1292 |
+
"song_a_period_vocal_phrase": round(float(item["song_a_period_vocal_phrase"]), 4),
|
| 1293 |
+
"song_b_period_vocal_phrase": round(float(item["song_b_period_vocal_phrase"]), 4),
|
| 1294 |
+
"song_a_period_drum_anchor": round(float(item["song_a_period_drum_anchor"]), 4),
|
| 1295 |
+
"song_b_period_drum_anchor": round(float(item["song_b_period_drum_anchor"]), 4),
|
| 1296 |
+
"song_a_period_bass_stability": round(float(item["song_a_period_bass_stability"]), 4),
|
| 1297 |
+
"song_b_period_bass_stability": round(float(item["song_b_period_bass_stability"]), 4),
|
| 1298 |
+
"song_a_period_density": round(float(item["song_a_period_density"]), 4),
|
| 1299 |
+
"song_b_period_density": round(float(item["song_b_period_density"]), 4),
|
| 1300 |
+
"song_a_period_coverage": round(float(item["song_a_period_coverage"]), 4),
|
| 1301 |
+
"song_b_period_coverage": round(float(item["song_b_period_coverage"]), 4),
|
| 1302 |
+
"components": {k: round(float(v), 4) for k, v in item["components"].items()},
|
| 1303 |
+
}
|
| 1304 |
+
)
|
| 1305 |
+
|
| 1306 |
+
principles = [
|
| 1307 |
+
"phrase/downbeat alignment",
|
| 1308 |
+
"section boundary awareness",
|
| 1309 |
+
"energy continuity",
|
| 1310 |
+
"harmonic/chroma compatibility",
|
| 1311 |
+
]
|
| 1312 |
+
if vocal_profile_a is not None or vocal_profile_b is not None:
|
| 1313 |
+
principles.extend(
|
| 1314 |
+
[
|
| 1315 |
+
"vocal phrase-safe cueing (low or ending vocals)",
|
| 1316 |
+
"drum-anchor confidence",
|
| 1317 |
+
"bassline stability control",
|
| 1318 |
+
"instrumental density targeting",
|
| 1319 |
+
"clash-risk precheck (vocal+bass overlap)",
|
| 1320 |
+
]
|
| 1321 |
+
)
|
| 1322 |
+
|
| 1323 |
+
return CueSelectionResult(
|
| 1324 |
+
cue_a_sec=float(best_a.cue.time_sec),
|
| 1325 |
+
cue_b_sec=float(best_b.cue.time_sec),
|
| 1326 |
+
method=f"{source}-structure-aware",
|
| 1327 |
+
debug={
|
| 1328 |
+
"source": source,
|
| 1329 |
+
"candidate_ranges_sec": {
|
| 1330 |
+
"song_a": [round(min_a, 3), round(max_a, 3)],
|
| 1331 |
+
"song_b": [round(min_b, 3), round(max_b, 3)],
|
| 1332 |
+
},
|
| 1333 |
+
"transition_period_sec": round(float(seam_sec), 3),
|
| 1334 |
+
"candidate_counts": {"song_a": len(cands_a), "song_b": len(cands_b)},
|
| 1335 |
+
"selected_sec": {"song_a": round(float(best_a.cue.time_sec), 3), "song_b": round(float(best_b.cue.time_sec), 3)},
|
| 1336 |
+
"selected_labels": {"song_a": best_a.label, "song_b": best_b.label},
|
| 1337 |
+
"selected_mixability": {
|
| 1338 |
+
"song_a_ratio": round(float(best_a.cue.vocal_ratio), 4),
|
| 1339 |
+
"song_b_ratio": round(float(best_b.cue.vocal_ratio), 4),
|
| 1340 |
+
"song_a_vocal_onset": round(float(best_a.cue.vocal_onset), 4),
|
| 1341 |
+
"song_b_vocal_onset": round(float(best_b.cue.vocal_onset), 4),
|
| 1342 |
+
"song_a_vocal_phrase": round(float(best_a.cue.vocal_phrase_score), 4),
|
| 1343 |
+
"song_b_vocal_phrase": round(float(best_b.cue.vocal_phrase_score), 4),
|
| 1344 |
+
"song_a_drum_anchor": round(float(best_a.cue.drum_anchor), 4),
|
| 1345 |
+
"song_b_drum_anchor": round(float(best_b.cue.drum_anchor), 4),
|
| 1346 |
+
"song_a_bass_stability": round(float(best_a.cue.bass_stability), 4),
|
| 1347 |
+
"song_b_bass_stability": round(float(best_b.cue.bass_stability), 4),
|
| 1348 |
+
"song_a_density_score": round(float(best_a.cue.density_score), 4),
|
| 1349 |
+
"song_b_density_score": round(float(best_b.cue.density_score), 4),
|
| 1350 |
+
"song_a_period_vocal_phrase": round(float(best_a.cue.period_vocal_phrase_score), 4),
|
| 1351 |
+
"song_b_period_vocal_phrase": round(float(best_b.cue.period_vocal_phrase_score), 4),
|
| 1352 |
+
"song_a_period_drum_anchor": round(float(best_a.cue.period_drum_anchor), 4),
|
| 1353 |
+
"song_b_period_drum_anchor": round(float(best_b.cue.period_drum_anchor), 4),
|
| 1354 |
+
"song_a_period_bass_stability": round(float(best_a.cue.period_bass_stability), 4),
|
| 1355 |
+
"song_b_period_bass_stability": round(float(best_b.cue.period_bass_stability), 4),
|
| 1356 |
+
"song_a_period_density_score": round(float(best_a.cue.period_density_score), 4),
|
| 1357 |
+
"song_b_period_density_score": round(float(best_b.cue.period_density_score), 4),
|
| 1358 |
+
"song_a_period_coverage": round(float(best_a.cue.period_coverage), 4),
|
| 1359 |
+
"song_b_period_coverage": round(float(best_b.cue.period_coverage), 4),
|
| 1360 |
+
},
|
| 1361 |
+
"top_pairs": top_pairs,
|
| 1362 |
+
"period_scoring": {
|
| 1363 |
+
"enabled": True,
|
| 1364 |
+
"window_def": {"song_a": "[cue-seam, cue]", "song_b": "[cue, cue+seam]"},
|
| 1365 |
+
"overlap_simulation": "weighted vocal/bass clash precheck",
|
| 1366 |
+
},
|
| 1367 |
+
"dj_principles": principles,
|
| 1368 |
+
},
|
| 1369 |
+
)
|
| 1370 |
+
|
| 1371 |
+
|
| 1372 |
+
def select_mix_cuepoints(
|
| 1373 |
+
y_a_analysis: np.ndarray,
|
| 1374 |
+
y_b_analysis: np.ndarray,
|
| 1375 |
+
sr: int,
|
| 1376 |
+
analysis_sec: float,
|
| 1377 |
+
pre_sec: float,
|
| 1378 |
+
seam_sec: float,
|
| 1379 |
+
post_sec: float,
|
| 1380 |
+
a_analysis_start_sec: float,
|
| 1381 |
+
beats_a: np.ndarray,
|
| 1382 |
+
beats_b: np.ndarray,
|
| 1383 |
+
cue_a_override_sec: Optional[float] = None,
|
| 1384 |
+
cue_b_override_sec: Optional[float] = None,
|
| 1385 |
+
song_a_path: Optional[str] = None,
|
| 1386 |
+
song_b_path: Optional[str] = None,
|
| 1387 |
+
song_a_duration_sec: Optional[float] = None,
|
| 1388 |
+
song_b_duration_sec: Optional[float] = None,
|
| 1389 |
+
) -> CueSelectionResult:
|
| 1390 |
+
target_a_rel = max(float(pre_sec), float(analysis_sec - seam_sec - 2.0))
|
| 1391 |
+
target_b_rel = 2.0
|
| 1392 |
+
default_a_rel = float(choose_nearest_beat(beats_a, target_a_rel))
|
| 1393 |
+
default_b_rel = float(choose_first_beat_after(beats_b, target_b_rel))
|
| 1394 |
+
|
| 1395 |
+
default_a_abs = float(a_analysis_start_sec + default_a_rel)
|
| 1396 |
+
default_b_abs = float(default_b_rel)
|
| 1397 |
+
|
| 1398 |
+
if cue_a_override_sec is not None or cue_b_override_sec is not None:
|
| 1399 |
+
cue_a = float(cue_a_override_sec) if cue_a_override_sec is not None else default_a_abs
|
| 1400 |
+
cue_b = float(cue_b_override_sec) if cue_b_override_sec is not None else default_b_abs
|
| 1401 |
+
return CueSelectionResult(
|
| 1402 |
+
cue_a_sec=cue_a,
|
| 1403 |
+
cue_b_sec=cue_b,
|
| 1404 |
+
method="manual-override",
|
| 1405 |
+
debug={
|
| 1406 |
+
"manual_override": True,
|
| 1407 |
+
"default_auto_cues_sec": {"song_a": round(default_a_abs, 3), "song_b": round(default_b_abs, 3)},
|
| 1408 |
+
},
|
| 1409 |
+
)
|
| 1410 |
+
|
| 1411 |
+
vocal_profile_a, vocal_debug_a = _extract_vocal_profile_demucs(
|
| 1412 |
+
y=y_a_analysis,
|
| 1413 |
+
sr=int(sr),
|
| 1414 |
+
window_start_sec=float(a_analysis_start_sec),
|
| 1415 |
+
track_label="song_a_analysis_window",
|
| 1416 |
+
)
|
| 1417 |
+
vocal_profile_b, vocal_debug_b = _extract_vocal_profile_demucs(
|
| 1418 |
+
y=y_b_analysis,
|
| 1419 |
+
sr=int(sr),
|
| 1420 |
+
window_start_sec=0.0,
|
| 1421 |
+
track_label="song_b_analysis_window",
|
| 1422 |
+
)
|
| 1423 |
+
vocal_debug = {
|
| 1424 |
+
"enabled": bool(_DEMUCS_ENABLED),
|
| 1425 |
+
"song_a": vocal_debug_a,
|
| 1426 |
+
"song_b": vocal_debug_b,
|
| 1427 |
+
}
|
| 1428 |
+
|
| 1429 |
+
structure_result = _try_structure_aware_selection(
|
| 1430 |
+
song_a_path=song_a_path,
|
| 1431 |
+
song_b_path=song_b_path,
|
| 1432 |
+
song_a_duration_sec=song_a_duration_sec,
|
| 1433 |
+
song_b_duration_sec=song_b_duration_sec,
|
| 1434 |
+
pre_sec=float(pre_sec),
|
| 1435 |
+
seam_sec=float(seam_sec),
|
| 1436 |
+
post_sec=float(post_sec),
|
| 1437 |
+
vocal_profile_a=vocal_profile_a,
|
| 1438 |
+
vocal_profile_b=vocal_profile_b,
|
| 1439 |
+
)
|
| 1440 |
+
if structure_result is not None:
|
| 1441 |
+
structure_result.debug["manual_override"] = False
|
| 1442 |
+
structure_result.debug["default_local_auto_cues_sec"] = {
|
| 1443 |
+
"song_a": round(default_a_abs, 3),
|
| 1444 |
+
"song_b": round(default_b_abs, 3),
|
| 1445 |
+
}
|
| 1446 |
+
structure_result.debug["vocal_analysis"] = vocal_debug
|
| 1447 |
+
structure_result.debug["vocal_penalty_active"] = bool(vocal_profile_a is not None or vocal_profile_b is not None)
|
| 1448 |
+
return structure_result
|
| 1449 |
+
|
| 1450 |
+
if beats_a.size < 4 or beats_b.size < 4:
|
| 1451 |
+
return CueSelectionResult(
|
| 1452 |
+
cue_a_sec=default_a_abs,
|
| 1453 |
+
cue_b_sec=default_b_abs,
|
| 1454 |
+
method="beat-fallback",
|
| 1455 |
+
debug={
|
| 1456 |
+
"reason": "insufficient_beats",
|
| 1457 |
+
"beat_counts": {"song_a": int(beats_a.size), "song_b": int(beats_b.size)},
|
| 1458 |
+
"vocal_analysis": vocal_debug,
|
| 1459 |
+
},
|
| 1460 |
+
)
|
| 1461 |
+
|
| 1462 |
+
profiles_a = _compute_profiles(y_a_analysis, sr)
|
| 1463 |
+
profiles_b = _compute_profiles(y_b_analysis, sr)
|
| 1464 |
+
|
| 1465 |
+
min_a = max(0.5, float(seam_sec + 0.5), float(pre_sec + (0.20 * seam_sec)))
|
| 1466 |
+
max_a = max(min_a + 0.1, float(analysis_sec - max(0.75, 0.25 * seam_sec)))
|
| 1467 |
+
min_b = max(0.75, float(0.12 * seam_sec))
|
| 1468 |
+
max_b = max(min_b + 0.1, float(analysis_sec - max((seam_sec + 0.75), (0.25 * post_sec))))
|
| 1469 |
+
|
| 1470 |
+
raw_a = _build_candidates(beats_a, min_a, max_a, prefer_tail=True, limit=24)
|
| 1471 |
+
raw_b = _build_candidates(beats_b, min_b, max_b, prefer_tail=False, limit=24)
|
| 1472 |
+
if not raw_a or not raw_b:
|
| 1473 |
+
return CueSelectionResult(
|
| 1474 |
+
cue_a_sec=default_a_abs,
|
| 1475 |
+
cue_b_sec=default_b_abs,
|
| 1476 |
+
method="candidate-fallback",
|
| 1477 |
+
debug={
|
| 1478 |
+
"reason": "empty_candidate_set",
|
| 1479 |
+
"candidate_counts": {"song_a": len(raw_a), "song_b": len(raw_b)},
|
| 1480 |
+
"candidate_windows_sec": {
|
| 1481 |
+
"song_a": [round(min_a, 3), round(max_a, 3)],
|
| 1482 |
+
"song_b": [round(min_b, 3), round(max_b, 3)],
|
| 1483 |
+
},
|
| 1484 |
+
"vocal_analysis": vocal_debug,
|
| 1485 |
+
},
|
| 1486 |
+
)
|
| 1487 |
+
|
| 1488 |
+
cands_a = [
|
| 1489 |
+
_make_candidate(
|
| 1490 |
+
t,
|
| 1491 |
+
idx,
|
| 1492 |
+
profiles_a,
|
| 1493 |
+
incoming=False,
|
| 1494 |
+
seam_sec=float(seam_sec),
|
| 1495 |
+
vocal_profile=vocal_profile_a,
|
| 1496 |
+
vocal_time_sec=float(a_analysis_start_sec + t),
|
| 1497 |
+
)
|
| 1498 |
+
for (t, idx) in raw_a
|
| 1499 |
+
]
|
| 1500 |
+
cands_b = [
|
| 1501 |
+
_make_candidate(
|
| 1502 |
+
t,
|
| 1503 |
+
idx,
|
| 1504 |
+
profiles_b,
|
| 1505 |
+
incoming=True,
|
| 1506 |
+
seam_sec=float(seam_sec),
|
| 1507 |
+
vocal_profile=vocal_profile_b,
|
| 1508 |
+
vocal_time_sec=float(t),
|
| 1509 |
+
)
|
| 1510 |
+
for (t, idx) in raw_b
|
| 1511 |
+
]
|
| 1512 |
+
|
| 1513 |
+
scored_pairs: List[Dict[str, object]] = []
|
| 1514 |
+
best: Optional[Dict[str, object]] = None
|
| 1515 |
+
target_b = max(2.0, min(8.0, float(analysis_sec * 0.25)))
|
| 1516 |
+
|
| 1517 |
+
for cand_a in cands_a:
|
| 1518 |
+
for cand_b in cands_b:
|
| 1519 |
+
total, comps = _score_pair(cand_a, cand_b, target_a=target_a_rel, target_b=target_b)
|
| 1520 |
+
item = {
|
| 1521 |
+
"score": float(total),
|
| 1522 |
+
"song_a_rel_sec": float(cand_a.time_sec),
|
| 1523 |
+
"song_b_rel_sec": float(cand_b.time_sec),
|
| 1524 |
+
"song_a_vocal_ratio": float(cand_a.vocal_ratio),
|
| 1525 |
+
"song_b_vocal_ratio": float(cand_b.vocal_ratio),
|
| 1526 |
+
"song_a_vocal_onset": float(cand_a.vocal_onset),
|
| 1527 |
+
"song_b_vocal_onset": float(cand_b.vocal_onset),
|
| 1528 |
+
"song_a_vocal_phrase": float(cand_a.vocal_phrase_score),
|
| 1529 |
+
"song_b_vocal_phrase": float(cand_b.vocal_phrase_score),
|
| 1530 |
+
"song_a_drum_anchor": float(cand_a.drum_anchor),
|
| 1531 |
+
"song_b_drum_anchor": float(cand_b.drum_anchor),
|
| 1532 |
+
"song_a_bass_energy": float(cand_a.bass_energy),
|
| 1533 |
+
"song_b_bass_energy": float(cand_b.bass_energy),
|
| 1534 |
+
"song_a_bass_stability": float(cand_a.bass_stability),
|
| 1535 |
+
"song_b_bass_stability": float(cand_b.bass_stability),
|
| 1536 |
+
"song_a_density": float(cand_a.instrumental_density),
|
| 1537 |
+
"song_b_density": float(cand_b.instrumental_density),
|
| 1538 |
+
"song_a_density_score": float(cand_a.density_score),
|
| 1539 |
+
"song_b_density_score": float(cand_b.density_score),
|
| 1540 |
+
"song_a_period_vocal_phrase": float(cand_a.period_vocal_phrase_score),
|
| 1541 |
+
"song_b_period_vocal_phrase": float(cand_b.period_vocal_phrase_score),
|
| 1542 |
+
"song_a_period_drum_anchor": float(cand_a.period_drum_anchor),
|
| 1543 |
+
"song_b_period_drum_anchor": float(cand_b.period_drum_anchor),
|
| 1544 |
+
"song_a_period_bass_energy": float(cand_a.period_bass_energy),
|
| 1545 |
+
"song_b_period_bass_energy": float(cand_b.period_bass_energy),
|
| 1546 |
+
"song_a_period_bass_stability": float(cand_a.period_bass_stability),
|
| 1547 |
+
"song_b_period_bass_stability": float(cand_b.period_bass_stability),
|
| 1548 |
+
"song_a_period_density": float(cand_a.period_density_score),
|
| 1549 |
+
"song_b_period_density": float(cand_b.period_density_score),
|
| 1550 |
+
"song_a_period_coverage": float(cand_a.period_coverage),
|
| 1551 |
+
"song_b_period_coverage": float(cand_b.period_coverage),
|
| 1552 |
+
"components": comps,
|
| 1553 |
+
}
|
| 1554 |
+
scored_pairs.append(item)
|
| 1555 |
+
if best is None or float(total) > float(best["score"]):
|
| 1556 |
+
best = item
|
| 1557 |
+
|
| 1558 |
+
if best is None:
|
| 1559 |
+
return CueSelectionResult(
|
| 1560 |
+
cue_a_sec=default_a_abs,
|
| 1561 |
+
cue_b_sec=default_b_abs,
|
| 1562 |
+
method="score-fallback",
|
| 1563 |
+
debug={"reason": "no_scored_pairs", "vocal_analysis": vocal_debug},
|
| 1564 |
+
)
|
| 1565 |
+
|
| 1566 |
+
scored_pairs = sorted(scored_pairs, key=lambda x: float(x["score"]), reverse=True)
|
| 1567 |
+
top_pairs = [
|
| 1568 |
+
{
|
| 1569 |
+
"score": round(float(item["score"]), 4),
|
| 1570 |
+
"song_a_rel_sec": round(float(item["song_a_rel_sec"]), 3),
|
| 1571 |
+
"song_b_rel_sec": round(float(item["song_b_rel_sec"]), 3),
|
| 1572 |
+
"song_a_vocal_ratio": round(float(item["song_a_vocal_ratio"]), 4),
|
| 1573 |
+
"song_b_vocal_ratio": round(float(item["song_b_vocal_ratio"]), 4),
|
| 1574 |
+
"song_a_vocal_phrase": round(float(item["song_a_vocal_phrase"]), 4),
|
| 1575 |
+
"song_b_vocal_phrase": round(float(item["song_b_vocal_phrase"]), 4),
|
| 1576 |
+
"song_a_drum_anchor": round(float(item["song_a_drum_anchor"]), 4),
|
| 1577 |
+
"song_b_drum_anchor": round(float(item["song_b_drum_anchor"]), 4),
|
| 1578 |
+
"song_a_bass_stability": round(float(item["song_a_bass_stability"]), 4),
|
| 1579 |
+
"song_b_bass_stability": round(float(item["song_b_bass_stability"]), 4),
|
| 1580 |
+
"song_a_density_score": round(float(item["song_a_density_score"]), 4),
|
| 1581 |
+
"song_b_density_score": round(float(item["song_b_density_score"]), 4),
|
| 1582 |
+
"song_a_period_vocal_phrase": round(float(item["song_a_period_vocal_phrase"]), 4),
|
| 1583 |
+
"song_b_period_vocal_phrase": round(float(item["song_b_period_vocal_phrase"]), 4),
|
| 1584 |
+
"song_a_period_drum_anchor": round(float(item["song_a_period_drum_anchor"]), 4),
|
| 1585 |
+
"song_b_period_drum_anchor": round(float(item["song_b_period_drum_anchor"]), 4),
|
| 1586 |
+
"song_a_period_bass_stability": round(float(item["song_a_period_bass_stability"]), 4),
|
| 1587 |
+
"song_b_period_bass_stability": round(float(item["song_b_period_bass_stability"]), 4),
|
| 1588 |
+
"song_a_period_density": round(float(item["song_a_period_density"]), 4),
|
| 1589 |
+
"song_b_period_density": round(float(item["song_b_period_density"]), 4),
|
| 1590 |
+
"song_a_period_coverage": round(float(item["song_a_period_coverage"]), 4),
|
| 1591 |
+
"song_b_period_coverage": round(float(item["song_b_period_coverage"]), 4),
|
| 1592 |
+
"components": {k: round(float(v), 4) for k, v in item["components"].items()},
|
| 1593 |
+
}
|
| 1594 |
+
for item in scored_pairs[:3]
|
| 1595 |
+
]
|
| 1596 |
+
|
| 1597 |
+
cue_a_abs = float(a_analysis_start_sec + float(best["song_a_rel_sec"]))
|
| 1598 |
+
cue_b_abs = float(best["song_b_rel_sec"])
|
| 1599 |
+
return CueSelectionResult(
|
| 1600 |
+
cue_a_sec=cue_a_abs,
|
| 1601 |
+
cue_b_sec=cue_b_abs,
|
| 1602 |
+
method="scored-auto",
|
| 1603 |
+
debug={
|
| 1604 |
+
"manual_override": False,
|
| 1605 |
+
"beat_counts": {"song_a": int(beats_a.size), "song_b": int(beats_b.size)},
|
| 1606 |
+
"candidate_counts": {"song_a": len(cands_a), "song_b": len(cands_b)},
|
| 1607 |
+
"candidate_windows_sec": {
|
| 1608 |
+
"song_a": [round(min_a, 3), round(max_a, 3)],
|
| 1609 |
+
"song_b": [round(min_b, 3), round(max_b, 3)],
|
| 1610 |
+
},
|
| 1611 |
+
"transition_period_sec": round(float(seam_sec), 3),
|
| 1612 |
+
"selected_rel_sec": {
|
| 1613 |
+
"song_a": round(float(best["song_a_rel_sec"]), 3),
|
| 1614 |
+
"song_b": round(float(best["song_b_rel_sec"]), 3),
|
| 1615 |
+
},
|
| 1616 |
+
"selected_mixability": {
|
| 1617 |
+
"song_a_ratio": round(float(best["song_a_vocal_ratio"]), 4),
|
| 1618 |
+
"song_b_ratio": round(float(best["song_b_vocal_ratio"]), 4),
|
| 1619 |
+
"song_a_vocal_onset": round(float(best["song_a_vocal_onset"]), 4),
|
| 1620 |
+
"song_b_vocal_onset": round(float(best["song_b_vocal_onset"]), 4),
|
| 1621 |
+
"song_a_vocal_phrase": round(float(best["song_a_vocal_phrase"]), 4),
|
| 1622 |
+
"song_b_vocal_phrase": round(float(best["song_b_vocal_phrase"]), 4),
|
| 1623 |
+
"song_a_drum_anchor": round(float(best["song_a_drum_anchor"]), 4),
|
| 1624 |
+
"song_b_drum_anchor": round(float(best["song_b_drum_anchor"]), 4),
|
| 1625 |
+
"song_a_bass_energy": round(float(best["song_a_bass_energy"]), 4),
|
| 1626 |
+
"song_b_bass_energy": round(float(best["song_b_bass_energy"]), 4),
|
| 1627 |
+
"song_a_bass_stability": round(float(best["song_a_bass_stability"]), 4),
|
| 1628 |
+
"song_b_bass_stability": round(float(best["song_b_bass_stability"]), 4),
|
| 1629 |
+
"song_a_density": round(float(best["song_a_density"]), 4),
|
| 1630 |
+
"song_b_density": round(float(best["song_b_density"]), 4),
|
| 1631 |
+
"song_a_density_score": round(float(best["song_a_density_score"]), 4),
|
| 1632 |
+
"song_b_density_score": round(float(best["song_b_density_score"]), 4),
|
| 1633 |
+
"song_a_period_vocal_phrase": round(float(best["song_a_period_vocal_phrase"]), 4),
|
| 1634 |
+
"song_b_period_vocal_phrase": round(float(best["song_b_period_vocal_phrase"]), 4),
|
| 1635 |
+
"song_a_period_drum_anchor": round(float(best["song_a_period_drum_anchor"]), 4),
|
| 1636 |
+
"song_b_period_drum_anchor": round(float(best["song_b_period_drum_anchor"]), 4),
|
| 1637 |
+
"song_a_period_bass_energy": round(float(best["song_a_period_bass_energy"]), 4),
|
| 1638 |
+
"song_b_period_bass_energy": round(float(best["song_b_period_bass_energy"]), 4),
|
| 1639 |
+
"song_a_period_bass_stability": round(float(best["song_a_period_bass_stability"]), 4),
|
| 1640 |
+
"song_b_period_bass_stability": round(float(best["song_b_period_bass_stability"]), 4),
|
| 1641 |
+
"song_a_period_density": round(float(best["song_a_period_density"]), 4),
|
| 1642 |
+
"song_b_period_density": round(float(best["song_b_period_density"]), 4),
|
| 1643 |
+
"song_a_period_coverage": round(float(best["song_a_period_coverage"]), 4),
|
| 1644 |
+
"song_b_period_coverage": round(float(best["song_b_period_coverage"]), 4),
|
| 1645 |
+
},
|
| 1646 |
+
"default_auto_cues_sec": {"song_a": round(default_a_abs, 3), "song_b": round(default_b_abs, 3)},
|
| 1647 |
+
"vocal_analysis": vocal_debug,
|
| 1648 |
+
"vocal_penalty_active": bool(vocal_profile_a is not None or vocal_profile_b is not None),
|
| 1649 |
+
"top_pairs": top_pairs,
|
| 1650 |
+
"period_scoring": {
|
| 1651 |
+
"enabled": True,
|
| 1652 |
+
"window_def": {"song_a": "[cue-seam, cue]", "song_b": "[cue, cue+seam]"},
|
| 1653 |
+
"overlap_simulation": "weighted vocal/bass clash precheck",
|
| 1654 |
+
},
|
| 1655 |
+
},
|
| 1656 |
+
)
|
pipeline/transition_generator.py
ADDED
|
@@ -0,0 +1,1694 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import argparse
|
| 2 |
+
import hashlib
|
| 3 |
+
import json
|
| 4 |
+
import logging
|
| 5 |
+
import os
|
| 6 |
+
from dataclasses import asdict, dataclass
|
| 7 |
+
from pathlib import Path
|
| 8 |
+
from typing import Any, Dict, List, Optional, Tuple
|
| 9 |
+
|
| 10 |
+
import librosa # type: ignore[reportMissingImports]
|
| 11 |
+
import numpy as np
|
| 12 |
+
|
| 13 |
+
from .audio_utils import (
|
| 14 |
+
apply_edge_fades,
|
| 15 |
+
clamp,
|
| 16 |
+
crossfade_equal_length,
|
| 17 |
+
decode_segment,
|
| 18 |
+
ensure_length,
|
| 19 |
+
estimate_bpm_and_beats,
|
| 20 |
+
ffprobe_duration_sec,
|
| 21 |
+
normalize_peak,
|
| 22 |
+
resample_if_needed,
|
| 23 |
+
safe_time_stretch,
|
| 24 |
+
write_wav,
|
| 25 |
+
)
|
| 26 |
+
from .cuepoint_selector import select_mix_cuepoints
|
| 27 |
+
|
| 28 |
+
LOGGER = logging.getLogger(__name__)
|
| 29 |
+
|
| 30 |
+
DEFAULT_TARGET_SR = 32000
|
| 31 |
+
ACESTEP_INPUT_SR = 48000
|
| 32 |
+
STITCH_PREVIEW_SIDE_SEC = 10.0
|
| 33 |
+
|
| 34 |
+
PLUGIN_PRESETS: Dict[str, str] = {
|
| 35 |
+
"Smooth Blend": "smooth seamless DJ transition, balanced energy, clean, no vocals",
|
| 36 |
+
"EDM Build-up": "energetic EDM build-up transition with rising tension, clean, no vocals",
|
| 37 |
+
"Percussive Bridge": "percussive bridge transition with rhythmic drums and clear groove, no vocals",
|
| 38 |
+
"Ambient Wash": "ambient wash transition, spacious and atmospheric, soft energy curve, no vocals",
|
| 39 |
+
}
|
| 40 |
+
|
| 41 |
+
_ACESTEP_RUNTIME: Optional[Dict[str, Any]] = None
|
| 42 |
+
_DEMUCS_RUNTIME: Optional[Dict[str, Any]] = None
|
| 43 |
+
_DEMUCS_TRANSITION_ENABLED = os.getenv("AI_DJ_ENABLE_DEMUCS_TRANSITION", "1").strip().lower() not in {
|
| 44 |
+
"0",
|
| 45 |
+
"false",
|
| 46 |
+
"no",
|
| 47 |
+
"off",
|
| 48 |
+
}
|
| 49 |
+
_DEMUCS_MODEL_NAME = os.getenv("AI_DJ_DEMUCS_MODEL", "htdemucs").strip() or "htdemucs"
|
| 50 |
+
_DEMUCS_DEVICE_PREF = os.getenv("AI_DJ_DEMUCS_DEVICE", "cuda").strip().lower()
|
| 51 |
+
_DEMUCS_SEGMENT_SEC = 7.0
|
| 52 |
+
_REF_AUDIO_MODE = (os.getenv("AI_DJ_REFERENCE_AUDIO_MODE", "accompaniment-only") or "accompaniment-only").strip().lower()
|
| 53 |
+
|
| 54 |
+
|
| 55 |
+
@dataclass
|
| 56 |
+
class _DemucsStemBundle:
|
| 57 |
+
vocals: np.ndarray
|
| 58 |
+
drums: np.ndarray
|
| 59 |
+
bass: np.ndarray
|
| 60 |
+
other: np.ndarray
|
| 61 |
+
accompaniment: np.ndarray
|
| 62 |
+
sr: int
|
| 63 |
+
method: str
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
@dataclass
|
| 67 |
+
class TransitionRequest:
|
| 68 |
+
song_a_path: str
|
| 69 |
+
song_b_path: str
|
| 70 |
+
plugin_id: str = "Smooth Blend"
|
| 71 |
+
instruction_text: str = ""
|
| 72 |
+
pre_context_sec: float = 6.0
|
| 73 |
+
repaint_width_sec: float = 4.0
|
| 74 |
+
post_context_sec: float = 6.0
|
| 75 |
+
analysis_sec: float = 45.0
|
| 76 |
+
bpm_target: Optional[float] = None
|
| 77 |
+
cue_a_sec: Optional[float] = None
|
| 78 |
+
cue_b_sec: Optional[float] = None
|
| 79 |
+
transition_base_mode: str = "B-base-fixed"
|
| 80 |
+
transition_bars: int = 8
|
| 81 |
+
creativity_strength: float = 7.0
|
| 82 |
+
inference_steps: int = 8
|
| 83 |
+
seed: int = 42
|
| 84 |
+
output_dir: str = "outputs"
|
| 85 |
+
output_stem: Optional[str] = None
|
| 86 |
+
target_sr: int = DEFAULT_TARGET_SR
|
| 87 |
+
keep_debug_files: bool = False
|
| 88 |
+
|
| 89 |
+
# ACE-Step runtime config
|
| 90 |
+
acestep_model_config: str = os.getenv("AI_DJ_ACESTEP_MODEL_CONFIG", "acestep-v15-turbo").strip()
|
| 91 |
+
acestep_device: str = os.getenv("AI_DJ_ACESTEP_DEVICE", "auto").strip()
|
| 92 |
+
acestep_project_root: str = os.getenv("AI_DJ_ACESTEP_PROJECT_ROOT", "").strip()
|
| 93 |
+
acestep_prefer_source: Optional[str] = os.getenv("AI_DJ_ACESTEP_PREFER_SOURCE", "").strip() or None
|
| 94 |
+
acestep_use_flash_attn: bool = False
|
| 95 |
+
acestep_compile_model: bool = False
|
| 96 |
+
acestep_offload_to_cpu: bool = False
|
| 97 |
+
acestep_offload_dit_to_cpu: bool = False
|
| 98 |
+
acestep_use_mlx_dit: bool = True
|
| 99 |
+
acestep_lora_path: str = os.getenv("AI_DJ_ACESTEP_LORA_PATH", "").strip()
|
| 100 |
+
acestep_lora_scale: float = float(os.getenv("AI_DJ_ACESTEP_LORA_SCALE", "1.0").strip() or "1.0")
|
| 101 |
+
|
| 102 |
+
def to_log_dict(self) -> Dict[str, Any]:
|
| 103 |
+
return asdict(self)
|
| 104 |
+
|
| 105 |
+
|
| 106 |
+
@dataclass
|
| 107 |
+
class TransitionResult:
|
| 108 |
+
transition_path: str
|
| 109 |
+
stitched_path: str
|
| 110 |
+
rough_stitched_path: str
|
| 111 |
+
hard_splice_path: str
|
| 112 |
+
backend_used: str
|
| 113 |
+
details: Dict[str, Any]
|
| 114 |
+
|
| 115 |
+
def to_dict(self) -> Dict[str, Any]:
|
| 116 |
+
payload = asdict(self)
|
| 117 |
+
return payload
|
| 118 |
+
|
| 119 |
+
|
| 120 |
+
def _slug(text: str) -> str:
|
| 121 |
+
s = "".join(ch if ch.isalnum() or ch in {"-", "_"} else "_" for ch in text.strip())
|
| 122 |
+
s = "_".join(part for part in s.split("_") if part)
|
| 123 |
+
return s[:80] or "item"
|
| 124 |
+
|
| 125 |
+
|
| 126 |
+
def _deterministic_stem(request: TransitionRequest) -> str:
|
| 127 |
+
if request.output_stem:
|
| 128 |
+
return _slug(request.output_stem)
|
| 129 |
+
|
| 130 |
+
payload = {
|
| 131 |
+
"a": os.path.basename(request.song_a_path),
|
| 132 |
+
"b": os.path.basename(request.song_b_path),
|
| 133 |
+
"plugin": request.plugin_id,
|
| 134 |
+
"instruction_text": request.instruction_text,
|
| 135 |
+
"pre_context_sec": request.pre_context_sec,
|
| 136 |
+
"repaint_width_sec": request.repaint_width_sec,
|
| 137 |
+
"post_context_sec": request.post_context_sec,
|
| 138 |
+
"analysis_sec": request.analysis_sec,
|
| 139 |
+
"bpm_target": request.bpm_target,
|
| 140 |
+
"cue_a_sec": request.cue_a_sec,
|
| 141 |
+
"cue_b_sec": request.cue_b_sec,
|
| 142 |
+
"transition_base_mode": request.transition_base_mode,
|
| 143 |
+
"transition_bars": request.transition_bars,
|
| 144 |
+
"creativity_strength": request.creativity_strength,
|
| 145 |
+
"inference_steps": request.inference_steps,
|
| 146 |
+
"seed": request.seed,
|
| 147 |
+
"target_sr": request.target_sr,
|
| 148 |
+
"acestep_model_config": request.acestep_model_config,
|
| 149 |
+
"demucs_transition_enabled": _DEMUCS_TRANSITION_ENABLED,
|
| 150 |
+
"demucs_model": _DEMUCS_MODEL_NAME,
|
| 151 |
+
"reference_audio_mode": _REF_AUDIO_MODE,
|
| 152 |
+
}
|
| 153 |
+
raw = json.dumps(payload, sort_keys=True).encode("utf-8")
|
| 154 |
+
digest = hashlib.sha1(raw).hexdigest()[:10]
|
| 155 |
+
return f"transition_{_slug(Path(request.song_a_path).stem)}_to_{_slug(Path(request.song_b_path).stem)}_{digest}"
|
| 156 |
+
|
| 157 |
+
|
| 158 |
+
def _resolve_output_paths(request: TransitionRequest) -> Tuple[str, str, str, str, str]:
|
| 159 |
+
os.makedirs(request.output_dir, exist_ok=True)
|
| 160 |
+
stem = _deterministic_stem(request)
|
| 161 |
+
transition_path = os.path.join(request.output_dir, f"{stem}_transition.wav")
|
| 162 |
+
stitched_path = os.path.join(request.output_dir, f"{stem}_stitched.wav")
|
| 163 |
+
rough_stitched_path = os.path.join(request.output_dir, f"{stem}_rough_stitched.wav")
|
| 164 |
+
hard_splice_path = os.path.join(request.output_dir, f"{stem}_hard_splice.wav")
|
| 165 |
+
rough_src_path = os.path.join(request.output_dir, f"{stem}_rough_src.wav")
|
| 166 |
+
return transition_path, stitched_path, rough_stitched_path, hard_splice_path, rough_src_path
|
| 167 |
+
|
| 168 |
+
|
| 169 |
+
def _resolve_acestep_project_root(request: TransitionRequest) -> str:
|
| 170 |
+
if request.acestep_project_root:
|
| 171 |
+
os.makedirs(request.acestep_project_root, exist_ok=True)
|
| 172 |
+
return request.acestep_project_root
|
| 173 |
+
|
| 174 |
+
hf_data = "/data"
|
| 175 |
+
if os.path.isdir(hf_data) and os.access(hf_data, os.W_OK):
|
| 176 |
+
root = os.path.join(hf_data, "acestep_runtime")
|
| 177 |
+
os.makedirs(root, exist_ok=True)
|
| 178 |
+
return root
|
| 179 |
+
|
| 180 |
+
root = os.path.join(os.path.dirname(os.path.dirname(__file__)), ".acestep_runtime")
|
| 181 |
+
os.makedirs(root, exist_ok=True)
|
| 182 |
+
return root
|
| 183 |
+
|
| 184 |
+
|
| 185 |
+
def _resolve_lora_path(lora_spec: str, project_root: str) -> str:
|
| 186 |
+
spec = (lora_spec or "").strip()
|
| 187 |
+
if not spec:
|
| 188 |
+
return ""
|
| 189 |
+
if os.path.exists(spec):
|
| 190 |
+
return os.path.abspath(spec)
|
| 191 |
+
# Treat non-local spec as a Hugging Face repo id, e.g. ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA
|
| 192 |
+
if "/" not in spec:
|
| 193 |
+
raise RuntimeError(
|
| 194 |
+
f"LoRA path not found: {spec}. Provide a local path or a Hugging Face repo id like "
|
| 195 |
+
"ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA."
|
| 196 |
+
)
|
| 197 |
+
try:
|
| 198 |
+
from huggingface_hub import snapshot_download
|
| 199 |
+
except Exception as exc:
|
| 200 |
+
raise RuntimeError(
|
| 201 |
+
"huggingface_hub is required to download LoRA from repo id. Install with: pip install huggingface_hub"
|
| 202 |
+
) from exc
|
| 203 |
+
|
| 204 |
+
local_dir = os.path.join(project_root, "lora_cache", _slug(spec))
|
| 205 |
+
os.makedirs(local_dir, exist_ok=True)
|
| 206 |
+
return snapshot_download(
|
| 207 |
+
repo_id=spec,
|
| 208 |
+
local_dir=local_dir,
|
| 209 |
+
local_dir_use_symlinks=False,
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
|
| 213 |
+
def _build_caption(plugin_id: str, instruction_text: str) -> str:
|
| 214 |
+
base = PLUGIN_PRESETS.get(plugin_id, PLUGIN_PRESETS["Smooth Blend"])
|
| 215 |
+
extra = (instruction_text or "").strip()
|
| 216 |
+
if not extra:
|
| 217 |
+
return base
|
| 218 |
+
return f"{base}. Additional instruction: {extra}"
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
def _resolve_half_double_tempo(bpm_ref: float, bpm_candidate: float) -> float:
|
| 222 |
+
candidates = [0.5 * bpm_candidate, bpm_candidate, 2.0 * bpm_candidate]
|
| 223 |
+
valid = [v for v in candidates if 40.0 <= float(v) <= 240.0]
|
| 224 |
+
if not valid:
|
| 225 |
+
return float(bpm_candidate)
|
| 226 |
+
return float(min(valid, key=lambda x: abs(np.log2(max(1e-6, bpm_ref) / max(1e-6, x)))))
|
| 227 |
+
|
| 228 |
+
|
| 229 |
+
def _normalized_onset_envelope(y: np.ndarray, sr: int, hop_length: int = 512) -> np.ndarray:
|
| 230 |
+
if y.size <= 0:
|
| 231 |
+
return np.zeros((1,), dtype=np.float32)
|
| 232 |
+
onset = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length).astype(np.float32)
|
| 233 |
+
if onset.size == 0:
|
| 234 |
+
return np.zeros((1,), dtype=np.float32)
|
| 235 |
+
onset = onset - float(np.mean(onset))
|
| 236 |
+
maximum = float(np.max(np.abs(onset)))
|
| 237 |
+
if maximum > 1e-9:
|
| 238 |
+
onset = onset / maximum
|
| 239 |
+
return onset.astype(np.float32)
|
| 240 |
+
|
| 241 |
+
|
| 242 |
+
def _corr_similarity(a: np.ndarray, b: np.ndarray) -> float:
|
| 243 |
+
n = min(a.size, b.size)
|
| 244 |
+
if n <= 3:
|
| 245 |
+
return 0.0
|
| 246 |
+
a2 = a[:n].astype(np.float32)
|
| 247 |
+
b2 = b[:n].astype(np.float32)
|
| 248 |
+
denom = float(np.linalg.norm(a2) * np.linalg.norm(b2))
|
| 249 |
+
if denom <= 1e-9:
|
| 250 |
+
return 0.0
|
| 251 |
+
raw = float(np.dot(a2, b2) / denom)
|
| 252 |
+
return clamp((raw + 1.0) * 0.5, 0.0, 1.0)
|
| 253 |
+
|
| 254 |
+
|
| 255 |
+
def _rms(y: np.ndarray) -> float:
|
| 256 |
+
if y.size == 0:
|
| 257 |
+
return 0.0
|
| 258 |
+
return float(np.sqrt(np.mean(np.square(y, dtype=np.float64))))
|
| 259 |
+
|
| 260 |
+
|
| 261 |
+
def _resolve_demucs_device(torch_mod: Any) -> str:
|
| 262 |
+
pref = (_DEMUCS_DEVICE_PREF or "").strip().lower()
|
| 263 |
+
if pref == "cpu":
|
| 264 |
+
return "cpu"
|
| 265 |
+
if pref in {"cuda", "gpu"}:
|
| 266 |
+
return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
|
| 267 |
+
return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
|
| 268 |
+
|
| 269 |
+
|
| 270 |
+
def _load_demucs_runtime() -> Tuple[Optional[Dict[str, Any]], Dict[str, Any]]:
|
| 271 |
+
global _DEMUCS_RUNTIME
|
| 272 |
+
if not _DEMUCS_TRANSITION_ENABLED:
|
| 273 |
+
return None, {"enabled": False, "status": "disabled", "reason": "AI_DJ_ENABLE_DEMUCS_TRANSITION=0"}
|
| 274 |
+
if _DEMUCS_RUNTIME is not None:
|
| 275 |
+
return _DEMUCS_RUNTIME, {
|
| 276 |
+
"enabled": True,
|
| 277 |
+
"status": "ready",
|
| 278 |
+
"model": _DEMUCS_RUNTIME.get("model_name"),
|
| 279 |
+
"device": _DEMUCS_RUNTIME.get("device"),
|
| 280 |
+
}
|
| 281 |
+
|
| 282 |
+
try:
|
| 283 |
+
import torch # type: ignore[reportMissingImports]
|
| 284 |
+
from demucs.pretrained import get_model # type: ignore[reportMissingImports]
|
| 285 |
+
|
| 286 |
+
model = get_model(_DEMUCS_MODEL_NAME)
|
| 287 |
+
model.eval()
|
| 288 |
+
device = _resolve_demucs_device(torch)
|
| 289 |
+
model.to(device)
|
| 290 |
+
_DEMUCS_RUNTIME = {
|
| 291 |
+
"model": model,
|
| 292 |
+
"torch": torch,
|
| 293 |
+
"device": device,
|
| 294 |
+
"model_name": _DEMUCS_MODEL_NAME,
|
| 295 |
+
}
|
| 296 |
+
return _DEMUCS_RUNTIME, {
|
| 297 |
+
"enabled": True,
|
| 298 |
+
"status": "ready",
|
| 299 |
+
"model": _DEMUCS_MODEL_NAME,
|
| 300 |
+
"device": device,
|
| 301 |
+
}
|
| 302 |
+
except Exception as exc:
|
| 303 |
+
LOGGER.warning("Demucs transition runtime unavailable (%s). Falling back to non-stem transition path.", exc)
|
| 304 |
+
return None, {
|
| 305 |
+
"enabled": True,
|
| 306 |
+
"status": "unavailable",
|
| 307 |
+
"model": _DEMUCS_MODEL_NAME,
|
| 308 |
+
"reason": str(exc),
|
| 309 |
+
}
|
| 310 |
+
|
| 311 |
+
|
| 312 |
+
def _resample_to(y: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
|
| 313 |
+
if int(orig_sr) == int(target_sr):
|
| 314 |
+
return y.astype(np.float32)
|
| 315 |
+
if y.size == 0:
|
| 316 |
+
return np.zeros((0,), dtype=np.float32)
|
| 317 |
+
return librosa.resample(y.astype(np.float32), orig_sr=int(orig_sr), target_sr=int(target_sr)).astype(np.float32)
|
| 318 |
+
|
| 319 |
+
|
| 320 |
+
def _extract_demucs_stems(y: np.ndarray, sr: int, track_label: str) -> Tuple[Optional[_DemucsStemBundle], Dict[str, Any]]:
|
| 321 |
+
info: Dict[str, Any] = {
|
| 322 |
+
"enabled": bool(_DEMUCS_TRANSITION_ENABLED),
|
| 323 |
+
"track": track_label,
|
| 324 |
+
"model": _DEMUCS_MODEL_NAME,
|
| 325 |
+
}
|
| 326 |
+
if y.size < int(max(1, sr) * 2.0):
|
| 327 |
+
info["status"] = "skipped-short-audio"
|
| 328 |
+
return None, info
|
| 329 |
+
|
| 330 |
+
runtime, runtime_debug = _load_demucs_runtime()
|
| 331 |
+
info.update(runtime_debug)
|
| 332 |
+
if runtime is None:
|
| 333 |
+
return None, info
|
| 334 |
+
|
| 335 |
+
try:
|
| 336 |
+
from demucs.apply import apply_model # type: ignore[reportMissingImports]
|
| 337 |
+
|
| 338 |
+
torch_mod = runtime["torch"]
|
| 339 |
+
model = runtime["model"]
|
| 340 |
+
device = str(runtime.get("device", "cpu"))
|
| 341 |
+
|
| 342 |
+
mono = np.asarray(y, dtype=np.float32).reshape(-1)
|
| 343 |
+
if mono.size == 0:
|
| 344 |
+
info["status"] = "empty"
|
| 345 |
+
return None, info
|
| 346 |
+
peak = float(np.max(np.abs(mono)))
|
| 347 |
+
if peak > 1e-9:
|
| 348 |
+
mono = mono / peak
|
| 349 |
+
|
| 350 |
+
demucs_sr = int(getattr(model, "samplerate", 44100))
|
| 351 |
+
work = _resample_to(mono, int(sr), demucs_sr)
|
| 352 |
+
if work.size < int(max(1, demucs_sr) * 2.0):
|
| 353 |
+
info["status"] = "skipped-short-audio"
|
| 354 |
+
return None, info
|
| 355 |
+
|
| 356 |
+
stereo = np.stack([work, work], axis=0)
|
| 357 |
+
mix = torch_mod.from_numpy(stereo).unsqueeze(0).to(device)
|
| 358 |
+
audio_sec = float(work.size / max(1, demucs_sr))
|
| 359 |
+
use_split = audio_sec > (_DEMUCS_SEGMENT_SEC + 0.05)
|
| 360 |
+
segment_sec = float(_DEMUCS_SEGMENT_SEC) if use_split else None
|
| 361 |
+
|
| 362 |
+
try:
|
| 363 |
+
with torch_mod.no_grad():
|
| 364 |
+
estimates = apply_model(
|
| 365 |
+
model,
|
| 366 |
+
mix,
|
| 367 |
+
shifts=1,
|
| 368 |
+
split=use_split,
|
| 369 |
+
overlap=0.25,
|
| 370 |
+
progress=False,
|
| 371 |
+
device=device,
|
| 372 |
+
segment=segment_sec,
|
| 373 |
+
)
|
| 374 |
+
except Exception as exc:
|
| 375 |
+
if device == "cuda":
|
| 376 |
+
model.to("cpu")
|
| 377 |
+
runtime["device"] = "cpu"
|
| 378 |
+
device = "cpu"
|
| 379 |
+
mix = mix.to("cpu")
|
| 380 |
+
with torch_mod.no_grad():
|
| 381 |
+
estimates = apply_model(
|
| 382 |
+
model,
|
| 383 |
+
mix,
|
| 384 |
+
shifts=1,
|
| 385 |
+
split=use_split,
|
| 386 |
+
overlap=0.25,
|
| 387 |
+
progress=False,
|
| 388 |
+
device="cpu",
|
| 389 |
+
segment=segment_sec,
|
| 390 |
+
)
|
| 391 |
+
info["device_fallback"] = f"cuda->cpu ({exc})"
|
| 392 |
+
else:
|
| 393 |
+
raise
|
| 394 |
+
|
| 395 |
+
est = estimates.detach().cpu()
|
| 396 |
+
est = est[0] if est.ndim == 4 else est
|
| 397 |
+
if est.ndim != 3:
|
| 398 |
+
raise RuntimeError(f"Unexpected demucs output shape: {tuple(est.shape)}")
|
| 399 |
+
|
| 400 |
+
source_names = [str(s) for s in getattr(model, "sources", [])]
|
| 401 |
+
if not source_names:
|
| 402 |
+
raise RuntimeError("Demucs returned no source names.")
|
| 403 |
+
if est.shape[0] != len(source_names):
|
| 404 |
+
if est.shape[1] == len(source_names):
|
| 405 |
+
est = est.permute(1, 0, 2)
|
| 406 |
+
else:
|
| 407 |
+
raise RuntimeError(f"Demucs source mismatch: shape {tuple(est.shape)}, sources {source_names}")
|
| 408 |
+
|
| 409 |
+
def _stem(name: str) -> np.ndarray:
|
| 410 |
+
if name in source_names:
|
| 411 |
+
stem = est[source_names.index(name)].mean(dim=0).numpy().astype(np.float32)
|
| 412 |
+
return _resample_to(stem, demucs_sr, int(sr))
|
| 413 |
+
return np.zeros((mono.size,), dtype=np.float32)
|
| 414 |
+
|
| 415 |
+
vocals = _stem("vocals")
|
| 416 |
+
drums = _stem("drums")
|
| 417 |
+
bass = _stem("bass")
|
| 418 |
+
other = _stem("other")
|
| 419 |
+
non_vocal_idxs = [i for i, s in enumerate(source_names) if s != "vocals"]
|
| 420 |
+
if non_vocal_idxs:
|
| 421 |
+
acc = est[non_vocal_idxs].sum(dim=0).mean(dim=0).numpy().astype(np.float32)
|
| 422 |
+
accompaniment = _resample_to(acc, demucs_sr, int(sr))
|
| 423 |
+
else:
|
| 424 |
+
accompaniment = np.zeros((mono.size,), dtype=np.float32)
|
| 425 |
+
|
| 426 |
+
target_n = int(mono.size)
|
| 427 |
+
vocals = ensure_length(vocals, target_n)
|
| 428 |
+
drums = ensure_length(drums, target_n)
|
| 429 |
+
bass = ensure_length(bass, target_n)
|
| 430 |
+
other = ensure_length(other, target_n)
|
| 431 |
+
accompaniment = ensure_length(accompaniment, target_n)
|
| 432 |
+
|
| 433 |
+
info.update(
|
| 434 |
+
{
|
| 435 |
+
"status": "ready",
|
| 436 |
+
"method": "demucs-transition-stems",
|
| 437 |
+
"split_mode": "chunked" if use_split else "full-window",
|
| 438 |
+
"duration_sec": round(float(target_n / max(1, sr)), 3),
|
| 439 |
+
"has_drums": bool("drums" in source_names),
|
| 440 |
+
"has_bass": bool("bass" in source_names),
|
| 441 |
+
"has_other": bool("other" in source_names),
|
| 442 |
+
"device": runtime.get("device", device),
|
| 443 |
+
}
|
| 444 |
+
)
|
| 445 |
+
return _DemucsStemBundle(
|
| 446 |
+
vocals=vocals.astype(np.float32),
|
| 447 |
+
drums=drums.astype(np.float32),
|
| 448 |
+
bass=bass.astype(np.float32),
|
| 449 |
+
other=other.astype(np.float32),
|
| 450 |
+
accompaniment=accompaniment.astype(np.float32),
|
| 451 |
+
sr=int(sr),
|
| 452 |
+
method="demucs-transition-stems",
|
| 453 |
+
), info
|
| 454 |
+
except Exception as exc:
|
| 455 |
+
LOGGER.warning("Demucs stem extraction failed for %s (%s).", track_label, exc)
|
| 456 |
+
info["status"] = "error"
|
| 457 |
+
info["reason"] = str(exc)
|
| 458 |
+
return None, info
|
| 459 |
+
|
| 460 |
+
|
| 461 |
+
def _slice_stem_bundle(bundle: Optional[_DemucsStemBundle], start_n: int, length_n: int) -> Optional[_DemucsStemBundle]:
|
| 462 |
+
if bundle is None:
|
| 463 |
+
return None
|
| 464 |
+
s = int(max(0, start_n))
|
| 465 |
+
n = int(max(0, length_n))
|
| 466 |
+
e = s + n
|
| 467 |
+
return _DemucsStemBundle(
|
| 468 |
+
vocals=ensure_length(bundle.vocals[s:e], n),
|
| 469 |
+
drums=ensure_length(bundle.drums[s:e], n),
|
| 470 |
+
bass=ensure_length(bundle.bass[s:e], n),
|
| 471 |
+
other=ensure_length(bundle.other[s:e], n),
|
| 472 |
+
accompaniment=ensure_length(bundle.accompaniment[s:e], n),
|
| 473 |
+
sr=int(bundle.sr),
|
| 474 |
+
method=bundle.method,
|
| 475 |
+
)
|
| 476 |
+
|
| 477 |
+
|
| 478 |
+
def _seconds_to_beats(seconds: float, bpm: float) -> float:
|
| 479 |
+
return float(seconds) * (float(bpm) / 60.0)
|
| 480 |
+
|
| 481 |
+
|
| 482 |
+
def _beats_to_seconds(beats: float, bpm: float) -> float:
|
| 483 |
+
return float(beats) * (60.0 / max(1e-6, float(bpm)))
|
| 484 |
+
|
| 485 |
+
|
| 486 |
+
def _quantize_seconds_to_beats(
|
| 487 |
+
raw_sec: float,
|
| 488 |
+
bpm: float,
|
| 489 |
+
min_sec: float,
|
| 490 |
+
max_sec: float,
|
| 491 |
+
beat_step: int,
|
| 492 |
+
min_beats: int,
|
| 493 |
+
) -> Tuple[float, int, float]:
|
| 494 |
+
raw_sec = float(clamp(raw_sec, min_sec, max_sec))
|
| 495 |
+
if bpm <= 1e-6:
|
| 496 |
+
return raw_sec, int(round(_seconds_to_beats(raw_sec, 120.0))), _seconds_to_beats(raw_sec, 120.0)
|
| 497 |
+
|
| 498 |
+
raw_beats = _seconds_to_beats(raw_sec, bpm)
|
| 499 |
+
step = max(1, int(beat_step))
|
| 500 |
+
min_beats_i = max(1, int(min_beats))
|
| 501 |
+
max_allowed_beats = _seconds_to_beats(max_sec, bpm)
|
| 502 |
+
max_beats_i = int(max(min_beats_i, np.floor(max_allowed_beats / step) * step))
|
| 503 |
+
quant_beats = int(round(raw_beats / step) * step)
|
| 504 |
+
quant_beats = int(clamp(float(quant_beats), float(min_beats_i), float(max_beats_i)))
|
| 505 |
+
quant_sec = float(clamp(_beats_to_seconds(quant_beats, bpm), min_sec, max_sec))
|
| 506 |
+
return quant_sec, quant_beats, raw_beats
|
| 507 |
+
|
| 508 |
+
|
| 509 |
+
def _phrase_lock_transition_shape(pre_sec: float, seam_sec: float, post_sec: float, bpm: float) -> Dict[str, Any]:
|
| 510 |
+
pre_locked_sec, pre_beats, pre_raw_beats = _quantize_seconds_to_beats(
|
| 511 |
+
raw_sec=pre_sec,
|
| 512 |
+
bpm=bpm,
|
| 513 |
+
min_sec=1.0,
|
| 514 |
+
max_sec=20.0,
|
| 515 |
+
beat_step=4,
|
| 516 |
+
min_beats=2,
|
| 517 |
+
)
|
| 518 |
+
|
| 519 |
+
seam_raw_beats = _seconds_to_beats(seam_sec, bpm)
|
| 520 |
+
seam_step = 8 if seam_raw_beats >= 8.0 else 4
|
| 521 |
+
seam_locked_sec, seam_beats, _ = _quantize_seconds_to_beats(
|
| 522 |
+
raw_sec=seam_sec,
|
| 523 |
+
bpm=bpm,
|
| 524 |
+
min_sec=1.0,
|
| 525 |
+
max_sec=40.0,
|
| 526 |
+
beat_step=seam_step,
|
| 527 |
+
min_beats=2,
|
| 528 |
+
)
|
| 529 |
+
|
| 530 |
+
post_locked_sec, post_beats, post_raw_beats = _quantize_seconds_to_beats(
|
| 531 |
+
raw_sec=post_sec,
|
| 532 |
+
bpm=bpm,
|
| 533 |
+
min_sec=1.0,
|
| 534 |
+
max_sec=20.0,
|
| 535 |
+
beat_step=4,
|
| 536 |
+
min_beats=2,
|
| 537 |
+
)
|
| 538 |
+
|
| 539 |
+
return {
|
| 540 |
+
"pre_sec": pre_locked_sec,
|
| 541 |
+
"seam_sec": seam_locked_sec,
|
| 542 |
+
"post_sec": post_locked_sec,
|
| 543 |
+
"debug": {
|
| 544 |
+
"bpm_ref": round(float(bpm), 3),
|
| 545 |
+
"pre": {
|
| 546 |
+
"raw_sec": round(float(pre_sec), 3),
|
| 547 |
+
"locked_sec": round(float(pre_locked_sec), 3),
|
| 548 |
+
"raw_beats": round(float(pre_raw_beats), 3),
|
| 549 |
+
"locked_beats": int(pre_beats),
|
| 550 |
+
"beat_step": 4,
|
| 551 |
+
},
|
| 552 |
+
"seam": {
|
| 553 |
+
"raw_sec": round(float(seam_sec), 3),
|
| 554 |
+
"locked_sec": round(float(seam_locked_sec), 3),
|
| 555 |
+
"raw_beats": round(float(seam_raw_beats), 3),
|
| 556 |
+
"locked_beats": int(seam_beats),
|
| 557 |
+
"beat_step": int(seam_step),
|
| 558 |
+
},
|
| 559 |
+
"post": {
|
| 560 |
+
"raw_sec": round(float(post_sec), 3),
|
| 561 |
+
"locked_sec": round(float(post_locked_sec), 3),
|
| 562 |
+
"raw_beats": round(float(post_raw_beats), 3),
|
| 563 |
+
"locked_beats": int(post_beats),
|
| 564 |
+
"beat_step": 4,
|
| 565 |
+
},
|
| 566 |
+
},
|
| 567 |
+
}
|
| 568 |
+
|
| 569 |
+
|
| 570 |
+
def _stft_band_split(y: np.ndarray, sr: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
|
| 571 |
+
n = int(y.size)
|
| 572 |
+
if n <= 0:
|
| 573 |
+
z = np.zeros((0,), dtype=np.float32)
|
| 574 |
+
return z, z, z
|
| 575 |
+
|
| 576 |
+
n_fft = 2048 if n >= 2048 else 1024
|
| 577 |
+
hop = max(128, n_fft // 4)
|
| 578 |
+
y2 = ensure_length(y.astype(np.float32), max(n, n_fft))
|
| 579 |
+
|
| 580 |
+
D = librosa.stft(y2, n_fft=n_fft, hop_length=hop)
|
| 581 |
+
freqs = librosa.fft_frequencies(sr=sr, n_fft=n_fft)
|
| 582 |
+
low_mask = (freqs <= 180.0).astype(np.float32)[:, None]
|
| 583 |
+
mid_mask = ((freqs > 180.0) & (freqs <= 2500.0)).astype(np.float32)[:, None]
|
| 584 |
+
high_mask = (freqs > 2500.0).astype(np.float32)[:, None]
|
| 585 |
+
|
| 586 |
+
low = librosa.istft(D * low_mask, hop_length=hop, length=y2.size).astype(np.float32)
|
| 587 |
+
mid = librosa.istft(D * mid_mask, hop_length=hop, length=y2.size).astype(np.float32)
|
| 588 |
+
high = librosa.istft(D * high_mask, hop_length=hop, length=y2.size).astype(np.float32)
|
| 589 |
+
return low[:n], mid[:n], high[:n]
|
| 590 |
+
|
| 591 |
+
|
| 592 |
+
def _dj_style_seam_mix(a_tail: np.ndarray, b_head: np.ndarray, sr: int) -> Tuple[np.ndarray, Dict[str, Any]]:
|
| 593 |
+
n = min(int(a_tail.size), int(b_head.size))
|
| 594 |
+
if n <= 0:
|
| 595 |
+
return np.zeros((0,), dtype=np.float32), {"method": "empty-input-fallback"}
|
| 596 |
+
|
| 597 |
+
a = a_tail[:n].astype(np.float32)
|
| 598 |
+
b = b_head[:n].astype(np.float32)
|
| 599 |
+
try:
|
| 600 |
+
a_low, a_mid, a_high = _stft_band_split(a, sr=sr)
|
| 601 |
+
b_low, b_mid, b_high = _stft_band_split(b, sr=sr)
|
| 602 |
+
except Exception as exc:
|
| 603 |
+
LOGGER.warning("Band-split seam mixing failed (%s); using equal crossfade.", exc)
|
| 604 |
+
return crossfade_equal_length(a, b), {"method": "crossfade-fallback", "error": str(exc)}
|
| 605 |
+
|
| 606 |
+
x = np.linspace(0.0, 1.0, n, dtype=np.float32)
|
| 607 |
+
high_in = x
|
| 608 |
+
mid_in = np.power(x, 1.15).astype(np.float32)
|
| 609 |
+
# Delay low-end handoff so kick/bass do not collide early.
|
| 610 |
+
low_in = np.clip((x - 0.58) / 0.30, 0.0, 1.0).astype(np.float32)
|
| 611 |
+
|
| 612 |
+
seam = (
|
| 613 |
+
(a_high * (1.0 - high_in))
|
| 614 |
+
+ (b_high * high_in)
|
| 615 |
+
+ (a_mid * (1.0 - mid_in))
|
| 616 |
+
+ (b_mid * mid_in)
|
| 617 |
+
+ (a_low * (1.0 - low_in))
|
| 618 |
+
+ (b_low * low_in)
|
| 619 |
+
).astype(np.float32)
|
| 620 |
+
|
| 621 |
+
return seam, {
|
| 622 |
+
"method": "dj-eq-bass-swap",
|
| 623 |
+
"low_handoff": {"start_ratio": 0.58, "end_ratio": 0.88},
|
| 624 |
+
"bands_hz": {"low_max": 180, "mid_max": 2500},
|
| 625 |
+
}
|
| 626 |
+
|
| 627 |
+
|
| 628 |
+
def _build_theme_reference_audio(
|
| 629 |
+
a_pre: np.ndarray,
|
| 630 |
+
a_tail: np.ndarray,
|
| 631 |
+
b_head: np.ndarray,
|
| 632 |
+
b_post: np.ndarray,
|
| 633 |
+
sr: int,
|
| 634 |
+
) -> Tuple[np.ndarray, Dict[str, Any]]:
|
| 635 |
+
a_ctx = np.concatenate([a_pre, a_tail]).astype(np.float32)
|
| 636 |
+
b_ctx = np.concatenate([b_head, b_post]).astype(np.float32)
|
| 637 |
+
|
| 638 |
+
a_take_n = min(a_ctx.size, int(round(12.0 * sr)))
|
| 639 |
+
b_take_n = min(b_ctx.size, int(round(12.0 * sr)))
|
| 640 |
+
if a_take_n <= 0 or b_take_n <= 0:
|
| 641 |
+
return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "insufficient_context"}
|
| 642 |
+
|
| 643 |
+
a_seg = a_ctx[-a_take_n:]
|
| 644 |
+
b_seg = b_ctx[:b_take_n]
|
| 645 |
+
overlap_n = min(int(round(0.45 * sr)), a_seg.size // 4, b_seg.size // 4)
|
| 646 |
+
if overlap_n > 0:
|
| 647 |
+
seam = crossfade_equal_length(a_seg[-overlap_n:], b_seg[:overlap_n])
|
| 648 |
+
ref = np.concatenate([a_seg[:-overlap_n], seam, b_seg[overlap_n:]]).astype(np.float32)
|
| 649 |
+
else:
|
| 650 |
+
ref = np.concatenate([a_seg, b_seg]).astype(np.float32)
|
| 651 |
+
|
| 652 |
+
ref = normalize_peak(apply_edge_fades(ref, sr=sr, fade_ms=20.0), peak=0.98)
|
| 653 |
+
return ref, {
|
| 654 |
+
"enabled": True,
|
| 655 |
+
"method": "a-tail-b-head-theme-ref",
|
| 656 |
+
"duration_sec": round(float(ref.size / max(1, sr)), 3),
|
| 657 |
+
"segments_sec": {
|
| 658 |
+
"song_a": round(float(a_seg.size / max(1, sr)), 3),
|
| 659 |
+
"song_b": round(float(b_seg.size / max(1, sr)), 3),
|
| 660 |
+
"overlap": round(float(overlap_n / max(1, sr)), 3),
|
| 661 |
+
},
|
| 662 |
+
}
|
| 663 |
+
|
| 664 |
+
|
| 665 |
+
def _left_pad_to_length(y: np.ndarray, target_n: int) -> np.ndarray:
|
| 666 |
+
target_n = int(max(0, target_n))
|
| 667 |
+
if y.size >= target_n:
|
| 668 |
+
return y[-target_n:].astype(np.float32)
|
| 669 |
+
return np.pad(y.astype(np.float32), (target_n - y.size, 0), mode="constant")
|
| 670 |
+
|
| 671 |
+
|
| 672 |
+
def _crossfade_join(a: np.ndarray, b: np.ndarray, fade_n: int) -> np.ndarray:
|
| 673 |
+
if a.size <= 0:
|
| 674 |
+
return b.astype(np.float32)
|
| 675 |
+
if b.size <= 0:
|
| 676 |
+
return a.astype(np.float32)
|
| 677 |
+
n = int(max(0, fade_n))
|
| 678 |
+
n = min(n, int(a.size), int(b.size))
|
| 679 |
+
if n <= 0:
|
| 680 |
+
return np.concatenate([a, b]).astype(np.float32)
|
| 681 |
+
seam = crossfade_equal_length(a[-n:], b[:n])
|
| 682 |
+
return np.concatenate([a[:-n], seam, b[n:]]).astype(np.float32)
|
| 683 |
+
|
| 684 |
+
|
| 685 |
+
def _build_period_reference_audio(period: np.ndarray, sr: int, source_mode: str = "full-period-a") -> Tuple[np.ndarray, Dict[str, Any]]:
|
| 686 |
+
if period.size <= 0:
|
| 687 |
+
return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-reference-period"}
|
| 688 |
+
ref = normalize_peak(apply_edge_fades(period.astype(np.float32), sr=sr, fade_ms=20.0), peak=0.98)
|
| 689 |
+
return ref, {
|
| 690 |
+
"enabled": True,
|
| 691 |
+
"method": "opposite-transition-period-reference",
|
| 692 |
+
"source_mode": str(source_mode),
|
| 693 |
+
"duration_sec": round(float(ref.size / max(1, sr)), 3),
|
| 694 |
+
}
|
| 695 |
+
|
| 696 |
+
|
| 697 |
+
def _apply_transition_low_duck(
|
| 698 |
+
y: np.ndarray,
|
| 699 |
+
sr: int,
|
| 700 |
+
duck_floor: float = 0.14,
|
| 701 |
+
fade_out_end: float = 0.42,
|
| 702 |
+
fade_in_start: float = 0.72,
|
| 703 |
+
) -> Tuple[np.ndarray, Dict[str, Any]]:
|
| 704 |
+
n = int(y.size)
|
| 705 |
+
if n <= 0:
|
| 706 |
+
return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-audio"}
|
| 707 |
+
|
| 708 |
+
try:
|
| 709 |
+
low, mid, high = _stft_band_split(y.astype(np.float32), sr=sr)
|
| 710 |
+
except Exception as exc:
|
| 711 |
+
LOGGER.warning("Low-duck split failed (%s); skip ducking.", exc)
|
| 712 |
+
return y.astype(np.float32), {"enabled": False, "reason": "split-failed", "error": str(exc)}
|
| 713 |
+
|
| 714 |
+
x = np.linspace(0.0, 1.0, n, dtype=np.float32)
|
| 715 |
+
out_end = float(clamp(fade_out_end, 0.1, 0.9))
|
| 716 |
+
in_start = float(clamp(max(out_end + 0.05, fade_in_start), 0.15, 0.95))
|
| 717 |
+
floor = float(clamp(duck_floor, 0.03, 0.5))
|
| 718 |
+
|
| 719 |
+
low_gain = np.full((n,), floor, dtype=np.float32)
|
| 720 |
+
entry_mask = x <= out_end
|
| 721 |
+
if np.any(entry_mask):
|
| 722 |
+
low_gain[entry_mask] = (1.0 - ((x[entry_mask] / max(1e-6, out_end)) * (1.0 - floor))).astype(np.float32)
|
| 723 |
+
exit_mask = x >= in_start
|
| 724 |
+
if np.any(exit_mask):
|
| 725 |
+
ramp = (x[exit_mask] - in_start) / max(1e-6, (1.0 - in_start))
|
| 726 |
+
low_gain[exit_mask] = (floor + (ramp * (1.0 - floor))).astype(np.float32)
|
| 727 |
+
|
| 728 |
+
y_out = (low * low_gain) + mid + high
|
| 729 |
+
y_out = y_out.astype(np.float32)
|
| 730 |
+
return y_out, {
|
| 731 |
+
"enabled": True,
|
| 732 |
+
"method": "low-duck-center",
|
| 733 |
+
"duck_floor": round(float(floor), 4),
|
| 734 |
+
"fade_out_end_ratio": round(float(out_end), 4),
|
| 735 |
+
"fade_in_start_ratio": round(float(in_start), 4),
|
| 736 |
+
}
|
| 737 |
+
|
| 738 |
+
|
| 739 |
+
def _build_one_bassline_stem_period(
|
| 740 |
+
period_a: np.ndarray,
|
| 741 |
+
period_b: np.ndarray,
|
| 742 |
+
stems_a: Optional[_DemucsStemBundle],
|
| 743 |
+
stems_b: Optional[_DemucsStemBundle],
|
| 744 |
+
) -> Tuple[Optional[np.ndarray], Dict[str, Any]]:
|
| 745 |
+
if stems_a is None or stems_b is None:
|
| 746 |
+
return None, {"enabled": False, "reason": "missing-stems"}
|
| 747 |
+
n = min(
|
| 748 |
+
int(period_a.size),
|
| 749 |
+
int(period_b.size),
|
| 750 |
+
int(stems_a.vocals.size),
|
| 751 |
+
int(stems_b.vocals.size),
|
| 752 |
+
int(stems_a.bass.size),
|
| 753 |
+
int(stems_b.bass.size),
|
| 754 |
+
)
|
| 755 |
+
if n <= 0:
|
| 756 |
+
return None, {"enabled": False, "reason": "empty-period"}
|
| 757 |
+
|
| 758 |
+
x = np.linspace(0.0, 1.0, n, dtype=np.float32)
|
| 759 |
+
bass_in = np.clip((x - 0.60) / 0.28, 0.0, 1.0).astype(np.float32)
|
| 760 |
+
# Keep lows lighter in the center, then restore toward each edge.
|
| 761 |
+
center_bass_shape = (0.35 + (0.65 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
|
| 762 |
+
|
| 763 |
+
bass_mix = ((stems_a.bass[:n] * (1.0 - bass_in)) + (stems_b.bass[:n] * bass_in)).astype(np.float32)
|
| 764 |
+
bass_mix = (bass_mix * center_bass_shape).astype(np.float32)
|
| 765 |
+
|
| 766 |
+
acc_a = (stems_a.accompaniment[:n] - stems_a.bass[:n]).astype(np.float32)
|
| 767 |
+
acc_b = (stems_b.accompaniment[:n] - stems_b.bass[:n]).astype(np.float32)
|
| 768 |
+
inst_mix = ((acc_a * (1.0 - x)) + (acc_b * x)).astype(np.float32)
|
| 769 |
+
|
| 770 |
+
vocal_side = np.where(x < 0.5, stems_a.vocals[:n], stems_b.vocals[:n]).astype(np.float32)
|
| 771 |
+
vocal_shape = np.where(
|
| 772 |
+
x < 0.5,
|
| 773 |
+
np.clip(1.0 - ((x / 0.5) * 0.75), 0.25, 1.0),
|
| 774 |
+
np.clip(((x - 0.5) / 0.5) * 0.75 + 0.25, 0.25, 1.0),
|
| 775 |
+
).astype(np.float32)
|
| 776 |
+
vocals_mix = (vocal_side * vocal_shape * 0.26).astype(np.float32)
|
| 777 |
+
|
| 778 |
+
stem_mix = (inst_mix + bass_mix + vocals_mix).astype(np.float32)
|
| 779 |
+
return stem_mix, {
|
| 780 |
+
"enabled": True,
|
| 781 |
+
"method": "demucs-one-bassline-rule",
|
| 782 |
+
"bass_handoff": {"start_ratio": 0.60, "end_ratio": 0.88},
|
| 783 |
+
"center_bass_floor": 0.35,
|
| 784 |
+
"vocal_sidechain_gain": 0.26,
|
| 785 |
+
}
|
| 786 |
+
|
| 787 |
+
|
| 788 |
+
def _build_src_transition_period(
|
| 789 |
+
period_a: np.ndarray,
|
| 790 |
+
period_b: np.ndarray,
|
| 791 |
+
sr: int,
|
| 792 |
+
) -> Tuple[np.ndarray, np.ndarray, Dict[str, Any]]:
|
| 793 |
+
return _build_src_transition_period_with_stems(period_a, period_b, sr=sr, stems_a=None, stems_b=None)
|
| 794 |
+
|
| 795 |
+
|
| 796 |
+
def _build_src_transition_period_with_stems(
|
| 797 |
+
period_a: np.ndarray,
|
| 798 |
+
period_b: np.ndarray,
|
| 799 |
+
sr: int,
|
| 800 |
+
stems_a: Optional[_DemucsStemBundle] = None,
|
| 801 |
+
stems_b: Optional[_DemucsStemBundle] = None,
|
| 802 |
+
) -> Tuple[np.ndarray, np.ndarray, Dict[str, Any]]:
|
| 803 |
+
directional, directional_debug = _dj_style_seam_mix(period_a, period_b, sr=sr)
|
| 804 |
+
n = int(min(period_a.size, period_b.size))
|
| 805 |
+
if n > 0:
|
| 806 |
+
x = np.linspace(0.0, 1.0, n, dtype=np.float32)
|
| 807 |
+
guide = ((period_a[:n] * (1.0 - x)) + (period_b[:n] * x)).astype(np.float32)
|
| 808 |
+
src_period = ((0.70 * directional[:n]) + (0.30 * guide)).astype(np.float32)
|
| 809 |
+
else:
|
| 810 |
+
src_period = directional.astype(np.float32)
|
| 811 |
+
|
| 812 |
+
demucs_mix, demucs_mix_debug = _build_one_bassline_stem_period(
|
| 813 |
+
period_a=period_a,
|
| 814 |
+
period_b=period_b,
|
| 815 |
+
stems_a=stems_a,
|
| 816 |
+
stems_b=stems_b,
|
| 817 |
+
)
|
| 818 |
+
if demucs_mix is not None and demucs_mix.size > 0:
|
| 819 |
+
src_period = ((0.54 * src_period[: demucs_mix.size]) + (0.46 * demucs_mix)).astype(np.float32)
|
| 820 |
+
if src_period.size < n:
|
| 821 |
+
src_period = ensure_length(src_period, n)
|
| 822 |
+
|
| 823 |
+
use_acc_ref = _REF_AUDIO_MODE in {"accompaniment-only", "accompaniment", "inst-only", "instrumental-only"}
|
| 824 |
+
if use_acc_ref and stems_a is not None and stems_a.accompaniment.size > 0:
|
| 825 |
+
reference_period = ensure_length(stems_a.accompaniment.astype(np.float32), int(period_a.size))
|
| 826 |
+
ref_mode = "accompaniment-only"
|
| 827 |
+
else:
|
| 828 |
+
reference_period = period_a.astype(np.float32)
|
| 829 |
+
ref_mode = "full-period-a"
|
| 830 |
+
dominant = "song_b"
|
| 831 |
+
|
| 832 |
+
src_period, low_duck_debug = _apply_transition_low_duck(src_period, sr=sr)
|
| 833 |
+
src_period = normalize_peak(src_period, peak=0.99)
|
| 834 |
+
return src_period, reference_period, {
|
| 835 |
+
"method": "bar-period-layered-repaint-src-fixed-b-base",
|
| 836 |
+
"base_mode": "B-base-fixed",
|
| 837 |
+
"dominant_period": dominant,
|
| 838 |
+
"demucs_one_bassline": demucs_mix_debug,
|
| 839 |
+
"reference_mode": ref_mode,
|
| 840 |
+
"guide_mix": {
|
| 841 |
+
"enabled": True,
|
| 842 |
+
"weight_directional": 0.70,
|
| 843 |
+
"weight_time_direction_guide": 0.30,
|
| 844 |
+
"behavior": "more-song-a-detail-at-entry-more-song-b-at-exit",
|
| 845 |
+
},
|
| 846 |
+
"directional_mix": directional_debug,
|
| 847 |
+
"transition_low_profile": low_duck_debug,
|
| 848 |
+
}
|
| 849 |
+
|
| 850 |
+
|
| 851 |
+
def _crossfade_join_frequency_aware(a: np.ndarray, b: np.ndarray, fade_n: int, sr: int) -> Tuple[np.ndarray, Dict[str, Any]]:
|
| 852 |
+
if a.size <= 0:
|
| 853 |
+
return b.astype(np.float32), {"method": "prepend-empty"}
|
| 854 |
+
if b.size <= 0:
|
| 855 |
+
return a.astype(np.float32), {"method": "append-empty"}
|
| 856 |
+
|
| 857 |
+
n = int(max(0, fade_n))
|
| 858 |
+
n = min(n, int(a.size), int(b.size))
|
| 859 |
+
if n <= 0:
|
| 860 |
+
return np.concatenate([a, b]).astype(np.float32), {"method": "no-fade"}
|
| 861 |
+
|
| 862 |
+
seg_a = a[-n:].astype(np.float32)
|
| 863 |
+
seg_b = b[:n].astype(np.float32)
|
| 864 |
+
seam, seam_debug = _dj_style_seam_mix(seg_a, seg_b, sr=sr)
|
| 865 |
+
out = np.concatenate([a[:-n], seam, b[n:]]).astype(np.float32)
|
| 866 |
+
return out, {"method": "frequency-aware-join", "fade_samples": int(n), "seam": seam_debug}
|
| 867 |
+
|
| 868 |
+
|
| 869 |
+
def _post_repaint_stem_correction(
|
| 870 |
+
transition: np.ndarray,
|
| 871 |
+
sr: int,
|
| 872 |
+
anchor_a: Optional[_DemucsStemBundle] = None,
|
| 873 |
+
anchor_b: Optional[_DemucsStemBundle] = None,
|
| 874 |
+
) -> Tuple[np.ndarray, Dict[str, Any]]:
|
| 875 |
+
y = transition.astype(np.float32)
|
| 876 |
+
if y.size <= 0:
|
| 877 |
+
return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-transition"}
|
| 878 |
+
|
| 879 |
+
stems, demucs_debug = _extract_demucs_stems(y, int(sr), track_label="post-repaint-transition")
|
| 880 |
+
if stems is None:
|
| 881 |
+
return y, {"enabled": False, "reason": "demucs-unavailable", "demucs": demucs_debug}
|
| 882 |
+
|
| 883 |
+
n = int(min(stems.vocals.size, stems.drums.size, stems.bass.size, stems.other.size, y.size))
|
| 884 |
+
if n <= 0:
|
| 885 |
+
return y, {"enabled": False, "reason": "empty-stems", "demucs": demucs_debug}
|
| 886 |
+
|
| 887 |
+
x = np.linspace(0.0, 1.0, n, dtype=np.float32)
|
| 888 |
+
center = np.clip(np.minimum(x, 1.0 - x) / 0.18, 0.0, 1.0).astype(np.float32)
|
| 889 |
+
|
| 890 |
+
bass_cur = max(1e-5, _rms(stems.bass[:n]))
|
| 891 |
+
bass_ref_a = _rms(anchor_a.bass) if anchor_a is not None else bass_cur
|
| 892 |
+
bass_ref_b = _rms(anchor_b.bass) if anchor_b is not None else bass_cur
|
| 893 |
+
bass_gain_a = float(clamp(bass_ref_a / bass_cur, 0.65, 1.15))
|
| 894 |
+
bass_gain_b = float(clamp(bass_ref_b / bass_cur, 0.65, 1.15))
|
| 895 |
+
bass_linear = ((1.0 - x) * bass_gain_a) + (x * bass_gain_b)
|
| 896 |
+
bass_center_shape = (0.72 + (0.28 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
|
| 897 |
+
bass_gain = (bass_linear * bass_center_shape).astype(np.float32)
|
| 898 |
+
|
| 899 |
+
vocal_cur = max(1e-5, _rms(stems.vocals[:n]))
|
| 900 |
+
vocal_ref_a = _rms(anchor_a.vocals) if anchor_a is not None else vocal_cur
|
| 901 |
+
vocal_ref_b = _rms(anchor_b.vocals) if anchor_b is not None else vocal_cur
|
| 902 |
+
vocal_gain_a = float(clamp(vocal_ref_a / vocal_cur, 0.42, 1.0))
|
| 903 |
+
vocal_gain_b = float(clamp(vocal_ref_b / vocal_cur, 0.42, 1.0))
|
| 904 |
+
vocal_linear = ((1.0 - x) * vocal_gain_a) + (x * vocal_gain_b)
|
| 905 |
+
vocal_boundary_shape = (0.72 + (0.28 * center)).astype(np.float32)
|
| 906 |
+
vocal_gain = (vocal_linear * vocal_boundary_shape).astype(np.float32)
|
| 907 |
+
|
| 908 |
+
drum_gain = (1.05 - (0.08 * center)).astype(np.float32)
|
| 909 |
+
other_gain = 1.0
|
| 910 |
+
|
| 911 |
+
corrected = (
|
| 912 |
+
(stems.vocals[:n] * vocal_gain)
|
| 913 |
+
+ (stems.drums[:n] * drum_gain)
|
| 914 |
+
+ (stems.bass[:n] * bass_gain)
|
| 915 |
+
+ (stems.other[:n] * other_gain)
|
| 916 |
+
).astype(np.float32)
|
| 917 |
+
corrected = ensure_length(corrected, int(y.size))
|
| 918 |
+
return corrected, {
|
| 919 |
+
"enabled": True,
|
| 920 |
+
"method": "demucs-post-repaint-boundary-rebalance",
|
| 921 |
+
"demucs": demucs_debug,
|
| 922 |
+
"gains": {
|
| 923 |
+
"bass_start": round(float(bass_gain_a), 4),
|
| 924 |
+
"bass_end": round(float(bass_gain_b), 4),
|
| 925 |
+
"vocal_start": round(float(vocal_gain_a), 4),
|
| 926 |
+
"vocal_end": round(float(vocal_gain_b), 4),
|
| 927 |
+
"drum_edge_boost": 1.05,
|
| 928 |
+
},
|
| 929 |
+
"anchor_rms": {
|
| 930 |
+
"bass_a": round(float(bass_ref_a), 6),
|
| 931 |
+
"bass_b": round(float(bass_ref_b), 6),
|
| 932 |
+
"vocal_a": round(float(vocal_ref_a), 6),
|
| 933 |
+
"vocal_b": round(float(vocal_ref_b), 6),
|
| 934 |
+
},
|
| 935 |
+
}
|
| 936 |
+
|
| 937 |
+
|
| 938 |
+
def _assemble_substitute_mix(
|
| 939 |
+
song_a_prefix: np.ndarray,
|
| 940 |
+
transition: np.ndarray,
|
| 941 |
+
song_b_suffix: np.ndarray,
|
| 942 |
+
boundary_fade_n: int = 0,
|
| 943 |
+
sr: int = DEFAULT_TARGET_SR,
|
| 944 |
+
) -> Tuple[np.ndarray, Dict[str, Any]]:
|
| 945 |
+
a = song_a_prefix.astype(np.float32) if song_a_prefix.size > 0 else np.zeros((0,), dtype=np.float32)
|
| 946 |
+
t = transition.astype(np.float32) if transition.size > 0 else np.zeros((0,), dtype=np.float32)
|
| 947 |
+
b = song_b_suffix.astype(np.float32) if song_b_suffix.size > 0 else np.zeros((0,), dtype=np.float32)
|
| 948 |
+
joined, entry_debug = _crossfade_join_frequency_aware(a, t, boundary_fade_n, sr=sr)
|
| 949 |
+
joined, exit_debug = _crossfade_join_frequency_aware(joined, b, boundary_fade_n, sr=sr)
|
| 950 |
+
return joined.astype(np.float32), {
|
| 951 |
+
"method": "dual-frequency-aware-boundary-joins",
|
| 952 |
+
"entry": entry_debug,
|
| 953 |
+
"exit": exit_debug,
|
| 954 |
+
}
|
| 955 |
+
|
| 956 |
+
|
| 957 |
+
def _align_b_window_to_a_tail(
|
| 958 |
+
a_tail: np.ndarray,
|
| 959 |
+
y_b_stretched: np.ndarray,
|
| 960 |
+
nominal_start_n: int,
|
| 961 |
+
seam_n: int,
|
| 962 |
+
post_n: int,
|
| 963 |
+
sr: int,
|
| 964 |
+
bpm_ref: float,
|
| 965 |
+
a_tail_drums: Optional[np.ndarray] = None,
|
| 966 |
+
y_b_stretched_drums: Optional[np.ndarray] = None,
|
| 967 |
+
) -> Tuple[np.ndarray, int, Dict[str, Any]]:
|
| 968 |
+
total_n = seam_n + post_n
|
| 969 |
+
if y_b_stretched.size < total_n:
|
| 970 |
+
return ensure_length(y_b_stretched, total_n), 0, {
|
| 971 |
+
"method": "short-buffer-fallback",
|
| 972 |
+
"candidate_count": 0,
|
| 973 |
+
}
|
| 974 |
+
|
| 975 |
+
beat_sec = 60.0 / max(1e-6, float(bpm_ref))
|
| 976 |
+
search_sec = clamp(0.75 * beat_sec, 0.2, 1.2)
|
| 977 |
+
search_n = int(round(search_sec * sr))
|
| 978 |
+
|
| 979 |
+
nominal_start_n = int(clamp(float(nominal_start_n), 0.0, float(max(0, y_b_stretched.size - total_n))))
|
| 980 |
+
lo = max(0, nominal_start_n - search_n)
|
| 981 |
+
hi = min(y_b_stretched.size - total_n, nominal_start_n + search_n)
|
| 982 |
+
|
| 983 |
+
_, beat_times_stretched = estimate_bpm_and_beats(y_b_stretched, sr)
|
| 984 |
+
candidates: List[int] = []
|
| 985 |
+
for bt in beat_times_stretched:
|
| 986 |
+
idx = int(round(float(bt) * sr))
|
| 987 |
+
if lo <= idx <= hi:
|
| 988 |
+
candidates.append(idx)
|
| 989 |
+
candidates.append(nominal_start_n)
|
| 990 |
+
candidates = sorted(set(candidates))
|
| 991 |
+
if not candidates:
|
| 992 |
+
candidates = [nominal_start_n]
|
| 993 |
+
|
| 994 |
+
use_drum_alignment = (
|
| 995 |
+
isinstance(a_tail_drums, np.ndarray)
|
| 996 |
+
and isinstance(y_b_stretched_drums, np.ndarray)
|
| 997 |
+
and int(a_tail_drums.size) >= int(seam_n)
|
| 998 |
+
and int(y_b_stretched_drums.size) >= int(y_b_stretched.size)
|
| 999 |
+
)
|
| 1000 |
+
|
| 1001 |
+
onset_a_mix = _normalized_onset_envelope(a_tail, sr)
|
| 1002 |
+
onset_a_drum = _normalized_onset_envelope(a_tail_drums[:seam_n], sr) if use_drum_alignment else onset_a_mix
|
| 1003 |
+
rms_a = _rms(a_tail)
|
| 1004 |
+
drum_rms_a = _rms(a_tail_drums[:seam_n]) if use_drum_alignment else 0.0
|
| 1005 |
+
best_idx = candidates[0]
|
| 1006 |
+
best_score = -1.0
|
| 1007 |
+
best_components = {"onset_mix": 0.0, "onset_drum": 0.0, "energy": 0.0, "drum_energy": 0.0, "distance": 0.0}
|
| 1008 |
+
|
| 1009 |
+
distance_scale = max(1.0, 0.65 * search_n)
|
| 1010 |
+
for idx in candidates:
|
| 1011 |
+
seg = ensure_length(y_b_stretched[idx : idx + total_n], total_n)
|
| 1012 |
+
b_head = seg[:seam_n]
|
| 1013 |
+
|
| 1014 |
+
onset_b_mix = _normalized_onset_envelope(b_head, sr)
|
| 1015 |
+
onset_score_mix = _corr_similarity(onset_a_mix, onset_b_mix)
|
| 1016 |
+
onset_score_drum = onset_score_mix
|
| 1017 |
+
drum_energy_score = 0.5
|
| 1018 |
+
onset_score = onset_score_mix
|
| 1019 |
+
if use_drum_alignment:
|
| 1020 |
+
seg_drums = ensure_length(y_b_stretched_drums[idx : idx + total_n], total_n)
|
| 1021 |
+
b_head_drums = seg_drums[:seam_n]
|
| 1022 |
+
onset_b_drum = _normalized_onset_envelope(b_head_drums, sr)
|
| 1023 |
+
onset_score_drum = _corr_similarity(onset_a_drum, onset_b_drum)
|
| 1024 |
+
onset_score = (0.78 * onset_score_drum) + (0.22 * onset_score_mix)
|
| 1025 |
+
drum_rms_b = _rms(b_head_drums)
|
| 1026 |
+
drum_gap = abs(drum_rms_a - drum_rms_b) / max(1e-4, drum_rms_a)
|
| 1027 |
+
drum_energy_score = clamp(1.0 - drum_gap, 0.0, 1.0)
|
| 1028 |
+
|
| 1029 |
+
rms_b = _rms(b_head)
|
| 1030 |
+
energy_gap = abs(rms_a - rms_b) / max(1e-4, rms_a)
|
| 1031 |
+
energy_score = clamp(1.0 - energy_gap, 0.0, 1.0)
|
| 1032 |
+
|
| 1033 |
+
dist = abs(idx - nominal_start_n)
|
| 1034 |
+
distance_score = float(np.exp(-dist / distance_scale))
|
| 1035 |
+
|
| 1036 |
+
if use_drum_alignment:
|
| 1037 |
+
score = (0.62 * onset_score) + (0.18 * energy_score) + (0.10 * drum_energy_score) + (0.10 * distance_score)
|
| 1038 |
+
else:
|
| 1039 |
+
score = (0.56 * onset_score) + (0.26 * energy_score) + (0.18 * distance_score)
|
| 1040 |
+
if score > best_score:
|
| 1041 |
+
best_score = float(score)
|
| 1042 |
+
best_idx = int(idx)
|
| 1043 |
+
best_components = {
|
| 1044 |
+
"onset_mix": float(onset_score_mix),
|
| 1045 |
+
"onset_drum": float(onset_score_drum),
|
| 1046 |
+
"energy": float(energy_score),
|
| 1047 |
+
"drum_energy": float(drum_energy_score),
|
| 1048 |
+
"distance": float(distance_score),
|
| 1049 |
+
}
|
| 1050 |
+
|
| 1051 |
+
aligned = ensure_length(y_b_stretched[best_idx : best_idx + total_n], total_n)
|
| 1052 |
+
return aligned, best_idx, {
|
| 1053 |
+
"method": "drum-led-beat-phase-transient-align" if use_drum_alignment else "beat-phase-transient-align",
|
| 1054 |
+
"used_drum_stems": bool(use_drum_alignment),
|
| 1055 |
+
"candidate_count": len(candidates),
|
| 1056 |
+
"search_sec": round(float(search_sec), 4),
|
| 1057 |
+
"search_samples": int(search_n),
|
| 1058 |
+
"nominal_start_sample": int(nominal_start_n),
|
| 1059 |
+
"best_start_sample": int(best_idx),
|
| 1060 |
+
"best_score": round(float(best_score), 6),
|
| 1061 |
+
"score_components": {k: round(float(v), 6) for k, v in best_components.items()},
|
| 1062 |
+
}
|
| 1063 |
+
|
| 1064 |
+
|
| 1065 |
+
def _prepare_rough_transition(request: TransitionRequest) -> Dict[str, Any]:
|
| 1066 |
+
pre_sec_raw = clamp(request.pre_context_sec, 1.0, 20.0)
|
| 1067 |
+
post_sec_raw = clamp(request.post_context_sec, 1.0, 20.0)
|
| 1068 |
+
analysis_sec = clamp(request.analysis_sec, 10.0, 120.0)
|
| 1069 |
+
|
| 1070 |
+
target_sr = int(request.target_sr)
|
| 1071 |
+
|
| 1072 |
+
dur_a = ffprobe_duration_sec(request.song_a_path)
|
| 1073 |
+
dur_b = ffprobe_duration_sec(request.song_b_path)
|
| 1074 |
+
|
| 1075 |
+
a_analysis_start = max(0.0, float(dur_a) - analysis_sec) if dur_a is not None else 0.0
|
| 1076 |
+
|
| 1077 |
+
y_a_an, sr_a = decode_segment(request.song_a_path, a_analysis_start, analysis_sec, sr=target_sr, max_decode_sec=analysis_sec)
|
| 1078 |
+
y_b_an, sr_b = decode_segment(request.song_b_path, 0.0, analysis_sec, sr=target_sr, max_decode_sec=analysis_sec)
|
| 1079 |
+
bpm_a, beats_a = estimate_bpm_and_beats(y_a_an, sr_a)
|
| 1080 |
+
bpm_b, beats_b = estimate_bpm_and_beats(y_b_an, sr_b)
|
| 1081 |
+
|
| 1082 |
+
if request.bpm_target is not None and 40.0 <= float(request.bpm_target) <= 220.0:
|
| 1083 |
+
bpm_a = float(request.bpm_target)
|
| 1084 |
+
|
| 1085 |
+
bpm_a = float(bpm_a) if bpm_a is not None else 120.0
|
| 1086 |
+
bpm_b_detected = float(bpm_b) if bpm_b is not None else 120.0
|
| 1087 |
+
bpm_b_for_alignment = _resolve_half_double_tempo(bpm_a, bpm_b_detected)
|
| 1088 |
+
bars_requested = int(request.transition_bars)
|
| 1089 |
+
valid_bars = {4, 8, 16}
|
| 1090 |
+
transition_bars = bars_requested if bars_requested in valid_bars else 8
|
| 1091 |
+
seam_sec_raw = float(_beats_to_seconds(float(transition_bars * 4), bpm_a))
|
| 1092 |
+
seam_sec_raw = float(clamp(seam_sec_raw, 1.0, 40.0))
|
| 1093 |
+
seam_sec_ui_raw = seam_sec_raw
|
| 1094 |
+
base_mode = "B-base-fixed"
|
| 1095 |
+
|
| 1096 |
+
phrase_lock = _phrase_lock_transition_shape(
|
| 1097 |
+
pre_sec=pre_sec_raw,
|
| 1098 |
+
seam_sec=seam_sec_raw,
|
| 1099 |
+
post_sec=post_sec_raw,
|
| 1100 |
+
bpm=bpm_a,
|
| 1101 |
+
)
|
| 1102 |
+
pre_sec = float(phrase_lock["pre_sec"])
|
| 1103 |
+
seam_sec = float(phrase_lock["seam_sec"])
|
| 1104 |
+
post_sec = float(phrase_lock["post_sec"])
|
| 1105 |
+
|
| 1106 |
+
cue_selection = select_mix_cuepoints(
|
| 1107 |
+
y_a_analysis=y_a_an,
|
| 1108 |
+
y_b_analysis=y_b_an,
|
| 1109 |
+
sr=target_sr,
|
| 1110 |
+
analysis_sec=analysis_sec,
|
| 1111 |
+
pre_sec=pre_sec,
|
| 1112 |
+
seam_sec=seam_sec,
|
| 1113 |
+
post_sec=post_sec,
|
| 1114 |
+
a_analysis_start_sec=a_analysis_start,
|
| 1115 |
+
beats_a=beats_a,
|
| 1116 |
+
beats_b=beats_b,
|
| 1117 |
+
cue_a_override_sec=request.cue_a_sec,
|
| 1118 |
+
cue_b_override_sec=request.cue_b_sec,
|
| 1119 |
+
song_a_path=request.song_a_path,
|
| 1120 |
+
song_b_path=request.song_b_path,
|
| 1121 |
+
song_a_duration_sec=dur_a,
|
| 1122 |
+
song_b_duration_sec=dur_b,
|
| 1123 |
+
)
|
| 1124 |
+
cue_a = float(cue_selection.cue_a_sec)
|
| 1125 |
+
cue_b = float(cue_selection.cue_b_sec)
|
| 1126 |
+
|
| 1127 |
+
stretch_rate_raw = bpm_a / max(1e-6, bpm_b_for_alignment)
|
| 1128 |
+
# Keep stronger musical coherence while avoiding very audible stretch artifacts.
|
| 1129 |
+
stretch_rate = clamp(stretch_rate_raw, 0.7, 1.35)
|
| 1130 |
+
|
| 1131 |
+
pre_n = int(round(pre_sec * target_sr))
|
| 1132 |
+
seam_n = int(round(seam_sec * target_sr))
|
| 1133 |
+
post_n = int(round(post_sec * target_sr))
|
| 1134 |
+
|
| 1135 |
+
# Song A transition period: bars before cue A.
|
| 1136 |
+
a_period_start = max(0.0, cue_a - seam_sec)
|
| 1137 |
+
period_a, _ = decode_segment(
|
| 1138 |
+
request.song_a_path,
|
| 1139 |
+
a_period_start,
|
| 1140 |
+
seam_sec,
|
| 1141 |
+
sr=target_sr,
|
| 1142 |
+
max_decode_sec=seam_sec + 2.0,
|
| 1143 |
+
)
|
| 1144 |
+
period_a = ensure_length(period_a, seam_n)
|
| 1145 |
+
period_a_stems, period_a_stem_debug = _extract_demucs_stems(period_a, target_sr, track_label="song-a-transition-period")
|
| 1146 |
+
|
| 1147 |
+
# Repaint pre-context leading into the transition period.
|
| 1148 |
+
a_pre_start = max(0.0, a_period_start - pre_sec)
|
| 1149 |
+
a_pre, _ = decode_segment(
|
| 1150 |
+
request.song_a_path,
|
| 1151 |
+
a_pre_start,
|
| 1152 |
+
pre_sec,
|
| 1153 |
+
sr=target_sr,
|
| 1154 |
+
max_decode_sec=pre_sec + 2.0,
|
| 1155 |
+
)
|
| 1156 |
+
a_pre = _left_pad_to_length(a_pre, pre_n)
|
| 1157 |
+
|
| 1158 |
+
cue_b_selected = cue_b
|
| 1159 |
+
stitch_preview_side_sec = float(STITCH_PREVIEW_SIDE_SEC)
|
| 1160 |
+
boundary_fade_beats = 2.0
|
| 1161 |
+
boundary_fade_sec = clamp(_beats_to_seconds(boundary_fade_beats, bpm_a), 0.08, 1.2)
|
| 1162 |
+
boundary_fade_n = int(round(boundary_fade_sec * target_sr))
|
| 1163 |
+
stitch_decode_side_sec = stitch_preview_side_sec + boundary_fade_sec
|
| 1164 |
+
cue_a_for_stitch = float(max(0.0, cue_a - seam_sec))
|
| 1165 |
+
if dur_a is not None:
|
| 1166 |
+
cue_a_for_stitch = clamp(cue_a_for_stitch, 0.0, float(dur_a))
|
| 1167 |
+
song_a_preview_start = max(0.0, cue_a_for_stitch - stitch_decode_side_sec)
|
| 1168 |
+
song_a_preview_dur = max(0.0, cue_a_for_stitch - song_a_preview_start)
|
| 1169 |
+
song_a_prefix, _ = decode_segment(
|
| 1170 |
+
request.song_a_path,
|
| 1171 |
+
song_a_preview_start,
|
| 1172 |
+
song_a_preview_dur,
|
| 1173 |
+
sr=target_sr,
|
| 1174 |
+
max_decode_sec=max(20.0, song_a_preview_dur + 2.0),
|
| 1175 |
+
)
|
| 1176 |
+
|
| 1177 |
+
# Song B window: decode with pre-roll so we can phase-align on stretched beat grid.
|
| 1178 |
+
align_preroll_sec = clamp(0.75 * (60.0 / max(1e-6, bpm_a)), 0.2, 1.2)
|
| 1179 |
+
decode_start_b = max(0.0, cue_b_selected - (align_preroll_sec * stretch_rate))
|
| 1180 |
+
if dur_b is not None:
|
| 1181 |
+
decode_start_b = clamp(decode_start_b, 0.0, float(dur_b))
|
| 1182 |
+
desired_b_out_sec = seam_sec + max(post_sec, stitch_decode_side_sec) + (2.0 * align_preroll_sec)
|
| 1183 |
+
if dur_b is not None:
|
| 1184 |
+
# Decode only enough of Song B for alignment + transition + preview tail.
|
| 1185 |
+
remaining_sec = max(0.0, float(dur_b) - decode_start_b)
|
| 1186 |
+
raw_b_in_sec = clamp(min(remaining_sec, desired_b_out_sec * stretch_rate), 1.0, 360.0)
|
| 1187 |
+
else:
|
| 1188 |
+
raw_b_in_sec = clamp(desired_b_out_sec * stretch_rate, 1.0, 360.0)
|
| 1189 |
+
y_b_raw, _ = decode_segment(
|
| 1190 |
+
request.song_b_path,
|
| 1191 |
+
decode_start_b,
|
| 1192 |
+
raw_b_in_sec,
|
| 1193 |
+
sr=target_sr,
|
| 1194 |
+
max_decode_sec=raw_b_in_sec + 2.0,
|
| 1195 |
+
)
|
| 1196 |
+
y_b_stretched = safe_time_stretch(y_b_raw, rate=stretch_rate)
|
| 1197 |
+
y_b_stretched_stems, y_b_stem_debug = _extract_demucs_stems(
|
| 1198 |
+
y_b_stretched,
|
| 1199 |
+
target_sr,
|
| 1200 |
+
track_label="song-b-stretched-window",
|
| 1201 |
+
)
|
| 1202 |
+
nominal_b_start_n = int(round(align_preroll_sec * target_sr))
|
| 1203 |
+
y_b, aligned_b_start_n, b_alignment_debug = _align_b_window_to_a_tail(
|
| 1204 |
+
a_tail=period_a,
|
| 1205 |
+
y_b_stretched=y_b_stretched,
|
| 1206 |
+
nominal_start_n=nominal_b_start_n,
|
| 1207 |
+
seam_n=seam_n,
|
| 1208 |
+
post_n=post_n,
|
| 1209 |
+
sr=target_sr,
|
| 1210 |
+
bpm_ref=bpm_a,
|
| 1211 |
+
a_tail_drums=period_a_stems.drums if period_a_stems is not None else None,
|
| 1212 |
+
y_b_stretched_drums=y_b_stretched_stems.drums if y_b_stretched_stems is not None else None,
|
| 1213 |
+
)
|
| 1214 |
+
cue_b = float(decode_start_b + ((aligned_b_start_n / float(target_sr)) * stretch_rate))
|
| 1215 |
+
period_b = y_b[:seam_n]
|
| 1216 |
+
period_b_stems = _slice_stem_bundle(y_b_stretched_stems, aligned_b_start_n, seam_n)
|
| 1217 |
+
b_post = y_b[seam_n : seam_n + post_n]
|
| 1218 |
+
stitch_decode_n = int(round(stitch_decode_side_sec * target_sr))
|
| 1219 |
+
b_suffix_substitute = y_b_stretched[(aligned_b_start_n + seam_n) : (aligned_b_start_n + seam_n + stitch_decode_n)].astype(
|
| 1220 |
+
np.float32
|
| 1221 |
+
)
|
| 1222 |
+
if b_suffix_substitute.size == 0:
|
| 1223 |
+
b_suffix_substitute = np.zeros((0,), dtype=np.float32)
|
| 1224 |
+
|
| 1225 |
+
rough_seam, reference_period, rough_mix_debug = _build_src_transition_period_with_stems(
|
| 1226 |
+
period_a=period_a,
|
| 1227 |
+
period_b=period_b,
|
| 1228 |
+
sr=target_sr,
|
| 1229 |
+
stems_a=period_a_stems,
|
| 1230 |
+
stems_b=period_b_stems,
|
| 1231 |
+
)
|
| 1232 |
+
rough_stitched = np.concatenate([a_pre, rough_seam, b_post]).astype(np.float32)
|
| 1233 |
+
reference_audio_clip, reference_audio_debug = _build_period_reference_audio(
|
| 1234 |
+
reference_period,
|
| 1235 |
+
sr=target_sr,
|
| 1236 |
+
source_mode=str(rough_mix_debug.get("reference_mode", "full-period-a")),
|
| 1237 |
+
)
|
| 1238 |
+
return {
|
| 1239 |
+
"target_sr": target_sr,
|
| 1240 |
+
"dur_a": dur_a,
|
| 1241 |
+
"dur_b": dur_b,
|
| 1242 |
+
"analysis_start_a_sec": a_analysis_start,
|
| 1243 |
+
"bpm_a": bpm_a,
|
| 1244 |
+
"bpm_b": bpm_b_detected,
|
| 1245 |
+
"bpm_b_for_alignment": bpm_b_for_alignment,
|
| 1246 |
+
"cue_a_sec": cue_a,
|
| 1247 |
+
"cue_b_sec": cue_b,
|
| 1248 |
+
"cue_b_selected_sec": cue_b_selected,
|
| 1249 |
+
"cue_selector_method": cue_selection.method,
|
| 1250 |
+
"cue_selector_debug": cue_selection.debug,
|
| 1251 |
+
"stretch_rate": stretch_rate,
|
| 1252 |
+
"stretch_rate_raw": stretch_rate_raw,
|
| 1253 |
+
"transition_base_mode": base_mode,
|
| 1254 |
+
"transition_bars": int(transition_bars),
|
| 1255 |
+
"b_alignment_debug": b_alignment_debug,
|
| 1256 |
+
"phrase_lock_debug": phrase_lock["debug"],
|
| 1257 |
+
"rough_mix_debug": rough_mix_debug,
|
| 1258 |
+
"reference_audio_debug": reference_audio_debug,
|
| 1259 |
+
"demucs_transition_debug": {
|
| 1260 |
+
"enabled": bool(_DEMUCS_TRANSITION_ENABLED),
|
| 1261 |
+
"period_a": period_a_stem_debug,
|
| 1262 |
+
"b_window_stretched": y_b_stem_debug,
|
| 1263 |
+
"period_b_from_aligned_window": {
|
| 1264 |
+
"status": "ready" if period_b_stems is not None else "unavailable",
|
| 1265 |
+
"source": "slice(song-b-stretched-window, aligned_start, seam_n)",
|
| 1266 |
+
"aligned_start_sample": int(aligned_b_start_n),
|
| 1267 |
+
"seam_n": int(seam_n),
|
| 1268 |
+
},
|
| 1269 |
+
},
|
| 1270 |
+
"pre_sec": pre_sec,
|
| 1271 |
+
"seam_sec": seam_sec,
|
| 1272 |
+
"post_sec": post_sec,
|
| 1273 |
+
"pre_sec_raw": pre_sec_raw,
|
| 1274 |
+
"seam_sec_raw": seam_sec_raw,
|
| 1275 |
+
"seam_sec_ui_raw": seam_sec_ui_raw,
|
| 1276 |
+
"post_sec_raw": post_sec_raw,
|
| 1277 |
+
"pre_n": pre_n,
|
| 1278 |
+
"seam_n": seam_n,
|
| 1279 |
+
"post_n": post_n,
|
| 1280 |
+
"rough_seam": rough_seam,
|
| 1281 |
+
"rough_stitched": rough_stitched,
|
| 1282 |
+
"song_a_prefix": song_a_prefix,
|
| 1283 |
+
"song_b_suffix_substitute": b_suffix_substitute,
|
| 1284 |
+
"reference_audio_clip": reference_audio_clip,
|
| 1285 |
+
"period_a_stem_bundle": period_a_stems,
|
| 1286 |
+
"period_b_stem_bundle": period_b_stems,
|
| 1287 |
+
"boundary_fade_n": int(boundary_fade_n),
|
| 1288 |
+
"boundary_fade_sec": float(boundary_fade_sec),
|
| 1289 |
+
"stitch_preview_side_sec": float(stitch_preview_side_sec),
|
| 1290 |
+
"stitch_decode_side_sec": float(stitch_decode_side_sec),
|
| 1291 |
+
"stitching_debug": {
|
| 1292 |
+
"mode": "replace-seam-no-insert",
|
| 1293 |
+
"transition_base_mode": base_mode,
|
| 1294 |
+
"transition_bars": int(transition_bars),
|
| 1295 |
+
"song_a_prefix_sec": round(float(song_a_prefix.size / max(1, target_sr)), 3),
|
| 1296 |
+
"transition_sec": round(float(seam_sec), 3),
|
| 1297 |
+
"song_b_suffix_sec": round(float(b_suffix_substitute.size / max(1, target_sr)), 3),
|
| 1298 |
+
"decode_start_b_sec": round(float(decode_start_b), 3),
|
| 1299 |
+
"cue_a_cut_sec": round(float(cue_a_for_stitch), 3),
|
| 1300 |
+
"cue_b_continuation_sec": round(float(cue_b + seam_sec), 3),
|
| 1301 |
+
"replaced_window_sec": round(float(seam_sec), 3),
|
| 1302 |
+
"boundary_fade_sec": round(float(boundary_fade_sec), 3),
|
| 1303 |
+
"stitch_preview_side_sec": round(float(stitch_preview_side_sec), 3),
|
| 1304 |
+
"stitch_decode_side_sec": round(float(stitch_decode_side_sec), 3),
|
| 1305 |
+
},
|
| 1306 |
+
}
|
| 1307 |
+
|
| 1308 |
+
|
| 1309 |
+
def _extract_success_and_audios(result: Any) -> Tuple[bool, list, Optional[str]]:
|
| 1310 |
+
if isinstance(result, dict):
|
| 1311 |
+
success = bool(result.get("success", False))
|
| 1312 |
+
audios = result.get("audios", [])
|
| 1313 |
+
error = result.get("error") or result.get("status_message")
|
| 1314 |
+
return success, audios, error
|
| 1315 |
+
success = bool(getattr(result, "success", False))
|
| 1316 |
+
audios = getattr(result, "audios", [])
|
| 1317 |
+
error = getattr(result, "error", None) or getattr(result, "status_message", None)
|
| 1318 |
+
return success, audios, error
|
| 1319 |
+
|
| 1320 |
+
|
| 1321 |
+
def _load_acestep_runtime(request: TransitionRequest) -> Dict[str, Any]:
|
| 1322 |
+
global _ACESTEP_RUNTIME
|
| 1323 |
+
|
| 1324 |
+
project_root = _resolve_acestep_project_root(request)
|
| 1325 |
+
runtime_key = (
|
| 1326 |
+
project_root,
|
| 1327 |
+
request.acestep_model_config,
|
| 1328 |
+
request.acestep_device,
|
| 1329 |
+
request.acestep_lora_path,
|
| 1330 |
+
float(request.acestep_lora_scale),
|
| 1331 |
+
)
|
| 1332 |
+
|
| 1333 |
+
if _ACESTEP_RUNTIME is not None and _ACESTEP_RUNTIME.get("key") == runtime_key:
|
| 1334 |
+
return _ACESTEP_RUNTIME
|
| 1335 |
+
|
| 1336 |
+
try:
|
| 1337 |
+
from acestep.handler import AceStepHandler
|
| 1338 |
+
from acestep.inference import GenerationConfig, GenerationParams, generate_music
|
| 1339 |
+
except Exception as exc:
|
| 1340 |
+
raise RuntimeError(
|
| 1341 |
+
"ACE-Step is not installed or import failed. "
|
| 1342 |
+
"Install with: pip install git+https://github.com/ACE-Step/ACE-Step-1.5.git"
|
| 1343 |
+
) from exc
|
| 1344 |
+
|
| 1345 |
+
handler = AceStepHandler()
|
| 1346 |
+
status, ok = handler.initialize_service(
|
| 1347 |
+
project_root=project_root,
|
| 1348 |
+
config_path=request.acestep_model_config,
|
| 1349 |
+
device=request.acestep_device,
|
| 1350 |
+
use_flash_attention=request.acestep_use_flash_attn,
|
| 1351 |
+
compile_model=request.acestep_compile_model,
|
| 1352 |
+
offload_to_cpu=request.acestep_offload_to_cpu,
|
| 1353 |
+
offload_dit_to_cpu=request.acestep_offload_dit_to_cpu,
|
| 1354 |
+
quantization=None,
|
| 1355 |
+
prefer_source=request.acestep_prefer_source,
|
| 1356 |
+
use_mlx_dit=request.acestep_use_mlx_dit,
|
| 1357 |
+
)
|
| 1358 |
+
if not ok:
|
| 1359 |
+
raise RuntimeError(f"ACE-Step initialize_service failed: {status}")
|
| 1360 |
+
|
| 1361 |
+
lora_debug: Dict[str, Any] = {"requested": False}
|
| 1362 |
+
if request.acestep_lora_path:
|
| 1363 |
+
lora_debug["requested"] = True
|
| 1364 |
+
resolved_lora_path = _resolve_lora_path(request.acestep_lora_path, project_root)
|
| 1365 |
+
try:
|
| 1366 |
+
handler.load_lora(resolved_lora_path)
|
| 1367 |
+
handler.set_use_lora(True)
|
| 1368 |
+
handler.set_lora_scale(float(request.acestep_lora_scale))
|
| 1369 |
+
lora_debug.update(
|
| 1370 |
+
{
|
| 1371 |
+
"loaded": True,
|
| 1372 |
+
"path": resolved_lora_path,
|
| 1373 |
+
"scale": float(request.acestep_lora_scale),
|
| 1374 |
+
}
|
| 1375 |
+
)
|
| 1376 |
+
except Exception as exc:
|
| 1377 |
+
raise RuntimeError(f"Failed to load ACE-Step LoRA: {exc}") from exc
|
| 1378 |
+
else:
|
| 1379 |
+
lora_debug["loaded"] = False
|
| 1380 |
+
|
| 1381 |
+
_ACESTEP_RUNTIME = {
|
| 1382 |
+
"key": runtime_key,
|
| 1383 |
+
"project_root": project_root,
|
| 1384 |
+
"handler": handler,
|
| 1385 |
+
"GenerationParams": GenerationParams,
|
| 1386 |
+
"GenerationConfig": GenerationConfig,
|
| 1387 |
+
"generate_music": generate_music,
|
| 1388 |
+
"lora_debug": lora_debug,
|
| 1389 |
+
}
|
| 1390 |
+
return _ACESTEP_RUNTIME
|
| 1391 |
+
|
| 1392 |
+
|
| 1393 |
+
def _run_acestep_repaint(
|
| 1394 |
+
request: TransitionRequest,
|
| 1395 |
+
rough: Dict[str, Any],
|
| 1396 |
+
rough_src_path: str,
|
| 1397 |
+
) -> Tuple[np.ndarray, np.ndarray]:
|
| 1398 |
+
runtime = _load_acestep_runtime(request)
|
| 1399 |
+
handler = runtime["handler"]
|
| 1400 |
+
GenerationParams = runtime["GenerationParams"]
|
| 1401 |
+
GenerationConfig = runtime["GenerationConfig"]
|
| 1402 |
+
generate_music = runtime["generate_music"]
|
| 1403 |
+
|
| 1404 |
+
caption = _build_caption(request.plugin_id, request.instruction_text)
|
| 1405 |
+
|
| 1406 |
+
rough_stitched = rough["rough_stitched"]
|
| 1407 |
+
rough_for_model = resample_if_needed(rough_stitched, rough["target_sr"], ACESTEP_INPUT_SR)
|
| 1408 |
+
write_wav(rough_src_path, rough_for_model, ACESTEP_INPUT_SR)
|
| 1409 |
+
reference_audio_path: Optional[str] = None
|
| 1410 |
+
reference_audio_clip = rough.get("reference_audio_clip")
|
| 1411 |
+
if isinstance(reference_audio_clip, np.ndarray) and reference_audio_clip.size > 0:
|
| 1412 |
+
reference_audio_path = (
|
| 1413 |
+
rough_src_path.replace("_rough_src.wav", "_theme_ref.wav")
|
| 1414 |
+
if rough_src_path.endswith("_rough_src.wav")
|
| 1415 |
+
else f"{rough_src_path}.theme_ref.wav"
|
| 1416 |
+
)
|
| 1417 |
+
reference_for_model = resample_if_needed(reference_audio_clip, rough["target_sr"], ACESTEP_INPUT_SR)
|
| 1418 |
+
write_wav(reference_audio_path, reference_for_model, ACESTEP_INPUT_SR)
|
| 1419 |
+
|
| 1420 |
+
repaint_start = float(rough["pre_sec"])
|
| 1421 |
+
repaint_end = float(rough["pre_sec"] + rough["seam_sec"])
|
| 1422 |
+
total_duration = float(rough["pre_sec"] + rough["seam_sec"] + rough["post_sec"])
|
| 1423 |
+
bpm_hint = int(round(rough["bpm_a"])) if 30 <= rough["bpm_a"] <= 300 else None
|
| 1424 |
+
|
| 1425 |
+
params = GenerationParams(
|
| 1426 |
+
task_type="repaint",
|
| 1427 |
+
src_audio=rough_src_path,
|
| 1428 |
+
reference_audio=reference_audio_path,
|
| 1429 |
+
repainting_start=repaint_start,
|
| 1430 |
+
repainting_end=repaint_end,
|
| 1431 |
+
caption=caption,
|
| 1432 |
+
lyrics="[Instrumental]",
|
| 1433 |
+
instrumental=True,
|
| 1434 |
+
bpm=bpm_hint,
|
| 1435 |
+
duration=total_duration,
|
| 1436 |
+
inference_steps=int(max(1, request.inference_steps)),
|
| 1437 |
+
guidance_scale=float(request.creativity_strength),
|
| 1438 |
+
seed=int(request.seed),
|
| 1439 |
+
thinking=False,
|
| 1440 |
+
use_cot_metas=False,
|
| 1441 |
+
use_cot_caption=False,
|
| 1442 |
+
use_cot_language=False,
|
| 1443 |
+
)
|
| 1444 |
+
config = GenerationConfig(
|
| 1445 |
+
batch_size=1,
|
| 1446 |
+
use_random_seed=False,
|
| 1447 |
+
seeds=[int(request.seed)],
|
| 1448 |
+
audio_format="wav",
|
| 1449 |
+
)
|
| 1450 |
+
|
| 1451 |
+
result = generate_music(
|
| 1452 |
+
dit_handler=handler,
|
| 1453 |
+
llm_handler=None,
|
| 1454 |
+
params=params,
|
| 1455 |
+
config=config,
|
| 1456 |
+
save_dir=None,
|
| 1457 |
+
progress=None,
|
| 1458 |
+
)
|
| 1459 |
+
success, audios, error = _extract_success_and_audios(result)
|
| 1460 |
+
if not success or not audios:
|
| 1461 |
+
raise RuntimeError(error or "ACE-Step repaint returned no audio.")
|
| 1462 |
+
|
| 1463 |
+
audio_item = audios[0]
|
| 1464 |
+
audio_tensor = audio_item.get("tensor")
|
| 1465 |
+
if audio_tensor is None:
|
| 1466 |
+
raise RuntimeError("ACE-Step repaint output missing audio tensor.")
|
| 1467 |
+
|
| 1468 |
+
try:
|
| 1469 |
+
import torch
|
| 1470 |
+
if isinstance(audio_tensor, torch.Tensor):
|
| 1471 |
+
y = audio_tensor.detach().float().cpu().numpy()
|
| 1472 |
+
else:
|
| 1473 |
+
y = np.asarray(audio_tensor, dtype=np.float32)
|
| 1474 |
+
except Exception:
|
| 1475 |
+
y = np.asarray(audio_tensor, dtype=np.float32)
|
| 1476 |
+
|
| 1477 |
+
if y.ndim == 2:
|
| 1478 |
+
y = np.mean(y, axis=0)
|
| 1479 |
+
elif y.ndim > 2:
|
| 1480 |
+
y = y.reshape(-1)
|
| 1481 |
+
y = y.astype(np.float32)
|
| 1482 |
+
|
| 1483 |
+
model_sr = int(audio_item.get("sample_rate", ACESTEP_INPUT_SR))
|
| 1484 |
+
y = resample_if_needed(y, model_sr, rough["target_sr"])
|
| 1485 |
+
|
| 1486 |
+
total_n = rough["pre_n"] + rough["seam_n"] + rough["post_n"]
|
| 1487 |
+
y = ensure_length(y, total_n)
|
| 1488 |
+
stitched = y[:total_n]
|
| 1489 |
+
seam_start = rough["pre_n"]
|
| 1490 |
+
seam_end = seam_start + rough["seam_n"]
|
| 1491 |
+
transition = stitched[seam_start:seam_end]
|
| 1492 |
+
return transition, stitched
|
| 1493 |
+
|
| 1494 |
+
|
| 1495 |
+
def generate_transition_artifacts(request: TransitionRequest) -> TransitionResult:
|
| 1496 |
+
if not os.path.isfile(request.song_a_path):
|
| 1497 |
+
raise FileNotFoundError(f"Song A not found: {request.song_a_path}")
|
| 1498 |
+
if not os.path.isfile(request.song_b_path):
|
| 1499 |
+
raise FileNotFoundError(f"Song B not found: {request.song_b_path}")
|
| 1500 |
+
|
| 1501 |
+
transition_path, stitched_path, rough_stitched_path, hard_splice_path, rough_src_path = _resolve_output_paths(request)
|
| 1502 |
+
|
| 1503 |
+
LOGGER.info("Transition request args: %s", json.dumps(request.to_log_dict(), sort_keys=True))
|
| 1504 |
+
rough = _prepare_rough_transition(request)
|
| 1505 |
+
rough_stitched_audio = normalize_peak(
|
| 1506 |
+
apply_edge_fades(rough["rough_stitched"].astype(np.float32), rough["target_sr"], fade_ms=25.0),
|
| 1507 |
+
peak=0.98,
|
| 1508 |
+
)
|
| 1509 |
+
write_wav(rough_stitched_path, rough_stitched_audio, rough["target_sr"])
|
| 1510 |
+
hard_splice_audio = np.concatenate([rough["song_a_prefix"], rough["song_b_suffix_substitute"]]).astype(np.float32)
|
| 1511 |
+
hard_splice_audio = normalize_peak(hard_splice_audio, peak=0.98)
|
| 1512 |
+
write_wav(hard_splice_path, hard_splice_audio, rough["target_sr"])
|
| 1513 |
+
|
| 1514 |
+
transition_audio = rough["rough_seam"]
|
| 1515 |
+
repaint_context_audio = rough["rough_stitched"]
|
| 1516 |
+
try:
|
| 1517 |
+
transition_audio, repaint_context_audio = _run_acestep_repaint(request, rough, rough_src_path)
|
| 1518 |
+
except Exception as exc:
|
| 1519 |
+
raise RuntimeError(f"ACE-Step repaint failed. Please verify ACE-Step runtime and model setup. {exc}") from exc
|
| 1520 |
+
|
| 1521 |
+
backend_used = "acestep-repaint"
|
| 1522 |
+
|
| 1523 |
+
transition_audio, post_repaint_stem_debug = _post_repaint_stem_correction(
|
| 1524 |
+
transition_audio.astype(np.float32),
|
| 1525 |
+
sr=int(rough["target_sr"]),
|
| 1526 |
+
anchor_a=rough.get("period_a_stem_bundle"),
|
| 1527 |
+
anchor_b=rough.get("period_b_stem_bundle"),
|
| 1528 |
+
)
|
| 1529 |
+
|
| 1530 |
+
transition_audio, transition_low_profile_debug = _apply_transition_low_duck(
|
| 1531 |
+
transition_audio.astype(np.float32),
|
| 1532 |
+
sr=int(rough["target_sr"]),
|
| 1533 |
+
)
|
| 1534 |
+
|
| 1535 |
+
stitched_audio, boundary_mix_debug = _assemble_substitute_mix(
|
| 1536 |
+
song_a_prefix=rough["song_a_prefix"],
|
| 1537 |
+
transition=transition_audio,
|
| 1538 |
+
song_b_suffix=rough["song_b_suffix_substitute"],
|
| 1539 |
+
boundary_fade_n=int(rough.get("boundary_fade_n", 0)),
|
| 1540 |
+
sr=int(rough["target_sr"]),
|
| 1541 |
+
)
|
| 1542 |
+
|
| 1543 |
+
transition_audio = normalize_peak(apply_edge_fades(transition_audio, rough["target_sr"], fade_ms=25.0), peak=0.98)
|
| 1544 |
+
stitched_audio = normalize_peak(apply_edge_fades(stitched_audio, rough["target_sr"], fade_ms=25.0), peak=0.98)
|
| 1545 |
+
|
| 1546 |
+
write_wav(transition_path, transition_audio, rough["target_sr"])
|
| 1547 |
+
write_wav(stitched_path, stitched_audio, rough["target_sr"])
|
| 1548 |
+
|
| 1549 |
+
theme_ref_path = (
|
| 1550 |
+
rough_src_path.replace("_rough_src.wav", "_theme_ref.wav")
|
| 1551 |
+
if rough_src_path.endswith("_rough_src.wav")
|
| 1552 |
+
else f"{rough_src_path}.theme_ref.wav"
|
| 1553 |
+
)
|
| 1554 |
+
if not request.keep_debug_files:
|
| 1555 |
+
for tmp_path in (rough_src_path, theme_ref_path):
|
| 1556 |
+
if os.path.exists(tmp_path):
|
| 1557 |
+
try:
|
| 1558 |
+
os.remove(tmp_path)
|
| 1559 |
+
except Exception:
|
| 1560 |
+
pass
|
| 1561 |
+
|
| 1562 |
+
details = {
|
| 1563 |
+
"backend_used": backend_used,
|
| 1564 |
+
"generation_args": request.to_log_dict(),
|
| 1565 |
+
"lora": _load_acestep_runtime(request).get("lora_debug", {"requested": False}),
|
| 1566 |
+
"bpm": {
|
| 1567 |
+
"song_a": round(float(rough["bpm_a"]), 3),
|
| 1568 |
+
"song_b": round(float(rough["bpm_b"]), 3),
|
| 1569 |
+
"song_b_for_alignment": round(float(rough["bpm_b_for_alignment"]), 3),
|
| 1570 |
+
"stretch_rate": round(float(rough["stretch_rate"]), 5),
|
| 1571 |
+
"stretch_rate_raw": round(float(rough["stretch_rate_raw"]), 5),
|
| 1572 |
+
"bpm_target_override": request.bpm_target,
|
| 1573 |
+
},
|
| 1574 |
+
"cue_points_sec": {
|
| 1575 |
+
"song_a": round(float(rough["cue_a_sec"]), 3),
|
| 1576 |
+
"song_b": round(float(rough["cue_b_sec"]), 3),
|
| 1577 |
+
"song_b_selected": round(float(rough["cue_b_selected_sec"]), 3),
|
| 1578 |
+
"selector_method": rough.get("cue_selector_method"),
|
| 1579 |
+
},
|
| 1580 |
+
"cue_selector": rough.get("cue_selector_debug"),
|
| 1581 |
+
"bpm_phase_alignment": rough.get("b_alignment_debug"),
|
| 1582 |
+
"phrase_lock": rough.get("phrase_lock_debug"),
|
| 1583 |
+
"rough_mix": rough.get("rough_mix_debug"),
|
| 1584 |
+
"reference_audio": rough.get("reference_audio_debug"),
|
| 1585 |
+
"demucs_transition": rough.get("demucs_transition_debug"),
|
| 1586 |
+
"stitching": rough.get("stitching_debug"),
|
| 1587 |
+
"boundary_mix": boundary_mix_debug,
|
| 1588 |
+
"post_repaint_stem_correction": post_repaint_stem_debug,
|
| 1589 |
+
"transition_low_profile": transition_low_profile_debug,
|
| 1590 |
+
"transition_strategy": {
|
| 1591 |
+
"name": "bar-defined-dual-base-repaint",
|
| 1592 |
+
"base_mode": rough.get("transition_base_mode"),
|
| 1593 |
+
"transition_bars": rough.get("transition_bars"),
|
| 1594 |
+
"boundary_fade_sec": round(float(rough.get("boundary_fade_sec", 0.0)), 3),
|
| 1595 |
+
},
|
| 1596 |
+
"clip_shape_sec": {
|
| 1597 |
+
"pre_context_sec_raw": round(float(rough["pre_sec_raw"]), 3),
|
| 1598 |
+
"pre_context_sec": round(float(rough["pre_sec"]), 3),
|
| 1599 |
+
"repaint_width_sec_ui_raw": round(float(rough.get("seam_sec_ui_raw", rough["seam_sec_raw"])), 3),
|
| 1600 |
+
"repaint_width_sec_raw": round(float(rough["seam_sec_raw"]), 3),
|
| 1601 |
+
"repaint_width_sec": round(float(rough["seam_sec"]), 3),
|
| 1602 |
+
"post_context_sec_raw": round(float(rough["post_sec_raw"]), 3),
|
| 1603 |
+
"post_context_sec": round(float(rough["post_sec"]), 3),
|
| 1604 |
+
"analysis_sec": round(float(request.analysis_sec), 3),
|
| 1605 |
+
},
|
| 1606 |
+
"durations_sec": {
|
| 1607 |
+
"song_a_total": rough["dur_a"],
|
| 1608 |
+
"song_b_total": rough["dur_b"],
|
| 1609 |
+
"analysis_start_a_sec": round(float(rough["analysis_start_a_sec"]), 3),
|
| 1610 |
+
"repaint_context_preview": round(float(repaint_context_audio.size / max(1, rough["target_sr"])), 3),
|
| 1611 |
+
"stitched_output": round(float(stitched_audio.size / max(1, rough["target_sr"])), 3),
|
| 1612 |
+
},
|
| 1613 |
+
"outputs": {
|
| 1614 |
+
"transition_path": transition_path,
|
| 1615 |
+
"stitched_path": stitched_path,
|
| 1616 |
+
"rough_stitched_path": rough_stitched_path,
|
| 1617 |
+
"hard_splice_path": hard_splice_path,
|
| 1618 |
+
},
|
| 1619 |
+
}
|
| 1620 |
+
LOGGER.info("Transition result details: %s", json.dumps(details, sort_keys=True))
|
| 1621 |
+
|
| 1622 |
+
return TransitionResult(
|
| 1623 |
+
transition_path=transition_path,
|
| 1624 |
+
stitched_path=stitched_path,
|
| 1625 |
+
rough_stitched_path=rough_stitched_path,
|
| 1626 |
+
hard_splice_path=hard_splice_path,
|
| 1627 |
+
backend_used=backend_used,
|
| 1628 |
+
details=details,
|
| 1629 |
+
)
|
| 1630 |
+
|
| 1631 |
+
|
| 1632 |
+
def _build_arg_parser() -> argparse.ArgumentParser:
|
| 1633 |
+
parser = argparse.ArgumentParser(description="Deterministic DJ transition generation (Phase A/B).")
|
| 1634 |
+
parser.add_argument("--song-a", required=True, help="Path to Song A audio file.")
|
| 1635 |
+
parser.add_argument("--song-b", required=True, help="Path to Song B audio file.")
|
| 1636 |
+
parser.add_argument("--plugin", default="Smooth Blend", choices=list(PLUGIN_PRESETS.keys()), help="Transition style plugin preset.")
|
| 1637 |
+
parser.add_argument("--instruction", default="", help="Extra text instruction for generation.")
|
| 1638 |
+
parser.add_argument("--pre-sec", type=float, default=6.0, help="Seconds before seam from Song A.")
|
| 1639 |
+
parser.add_argument("--repaint-sec", type=float, default=4.0, help="Repaint seam width in seconds.")
|
| 1640 |
+
parser.add_argument("--post-sec", type=float, default=6.0, help="Seconds after seam from Song B.")
|
| 1641 |
+
parser.add_argument("--analysis-sec", type=float, default=45.0, help="Analysis window in seconds.")
|
| 1642 |
+
parser.add_argument("--bpm-target", type=float, default=None, help="Optional BPM override target for Song A.")
|
| 1643 |
+
parser.add_argument("--cue-a-sec", type=float, default=None, help="Optional Song A cue override.")
|
| 1644 |
+
parser.add_argument("--cue-b-sec", type=float, default=None, help="Optional Song B cue override.")
|
| 1645 |
+
parser.add_argument(
|
| 1646 |
+
"--transition-bars",
|
| 1647 |
+
type=int,
|
| 1648 |
+
default=8,
|
| 1649 |
+
choices=[4, 8, 16],
|
| 1650 |
+
help="Transition period length in bars around cue points.",
|
| 1651 |
+
)
|
| 1652 |
+
parser.add_argument("--creativity", type=float, default=7.0, help="ACE-Step guidance strength.")
|
| 1653 |
+
parser.add_argument("--inference-steps", type=int, default=8, help="ACE-Step inference steps.")
|
| 1654 |
+
parser.add_argument("--seed", type=int, default=42, help="Seed for reproducibility.")
|
| 1655 |
+
parser.add_argument("--output-dir", default="outputs", help="Directory for output artifacts.")
|
| 1656 |
+
parser.add_argument("--output-stem", default=None, help="Optional fixed output stem.")
|
| 1657 |
+
parser.add_argument("--target-sr", type=int, default=DEFAULT_TARGET_SR, help="Output sample rate.")
|
| 1658 |
+
parser.add_argument("--keep-debug-files", action="store_true", help="Keep temporary rough source audio files.")
|
| 1659 |
+
return parser
|
| 1660 |
+
|
| 1661 |
+
|
| 1662 |
+
def main() -> None:
|
| 1663 |
+
logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(name)s | %(message)s")
|
| 1664 |
+
parser = _build_arg_parser()
|
| 1665 |
+
args = parser.parse_args()
|
| 1666 |
+
|
| 1667 |
+
req = TransitionRequest(
|
| 1668 |
+
song_a_path=args.song_a,
|
| 1669 |
+
song_b_path=args.song_b,
|
| 1670 |
+
plugin_id=args.plugin,
|
| 1671 |
+
instruction_text=args.instruction,
|
| 1672 |
+
pre_context_sec=args.pre_sec,
|
| 1673 |
+
repaint_width_sec=args.repaint_sec,
|
| 1674 |
+
post_context_sec=args.post_sec,
|
| 1675 |
+
analysis_sec=args.analysis_sec,
|
| 1676 |
+
bpm_target=args.bpm_target,
|
| 1677 |
+
cue_a_sec=args.cue_a_sec,
|
| 1678 |
+
cue_b_sec=args.cue_b_sec,
|
| 1679 |
+
transition_bars=args.transition_bars,
|
| 1680 |
+
creativity_strength=args.creativity,
|
| 1681 |
+
inference_steps=args.inference_steps,
|
| 1682 |
+
seed=args.seed,
|
| 1683 |
+
output_dir=args.output_dir,
|
| 1684 |
+
output_stem=args.output_stem,
|
| 1685 |
+
target_sr=args.target_sr,
|
| 1686 |
+
keep_debug_files=args.keep_debug_files,
|
| 1687 |
+
)
|
| 1688 |
+
|
| 1689 |
+
result = generate_transition_artifacts(req)
|
| 1690 |
+
print(json.dumps(result.to_dict(), indent=2))
|
| 1691 |
+
|
| 1692 |
+
|
| 1693 |
+
if __name__ == "__main__":
|
| 1694 |
+
main()
|
requirements.txt
ADDED
|
@@ -0,0 +1,14 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
gradio
|
| 2 |
+
spaces
|
| 3 |
+
torch
|
| 4 |
+
transformers
|
| 5 |
+
accelerate
|
| 6 |
+
librosa
|
| 7 |
+
soundfile
|
| 8 |
+
numpy
|
| 9 |
+
scipy
|
| 10 |
+
# Demucs enables stem-aware cue selection and transition refinement.
|
| 11 |
+
demucs
|
| 12 |
+
|
| 13 |
+
# Optional ACE-Step backend (heavy; keep optional so MusicGen path still works):
|
| 14 |
+
git+https://github.com/ACE-Step/ACE-Step-1.5.git
|