Yng314 commited on
Commit
14984e4
·
1 Parent(s): 070f6dc

feat: implement audio transition generation pipeline with modules for transition generation, cue point selection, and audio utilities.

Browse files
.gitignore ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Initial_Research/demix
2
+ Initial_Research/spec
3
+ __pycache__
4
+ Utils/pretrained_models
5
+
6
+ # Large / copyrighted audio should not be committed to a public Space repo
7
+ Songs/
8
+ Test_songs/
9
+ Initial_Research/*.mp3
10
+ Initial_Research/*.wav
11
+ mixed_song.wav
12
+ final_mix.mp3
13
+ .acestep_runtime/
14
+ checkpoints/
15
+ outputs/
PROJECT_CATCHUP_NOTE.md ADDED
@@ -0,0 +1,229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI DJ Project Catch-Up Note
2
+
3
+ Last updated: 2026-02-19
4
+
5
+ ## 1) Project Goal (Current Direction)
6
+
7
+ Build a **domain-specific AI DJ transition demo** for coursework Option 1 (Refinement):
8
+
9
+ - user uploads Song A and Song B
10
+ - system auto-detects cue points + BPM
11
+ - Song B is time-stretched to Song A BPM
12
+ - a generative model creates transition audio from text ("transition vibe")
13
+ - output is a **short transition clip only** (not full-song mix)
14
+
15
+ This scope is intentionally optimized for Hugging Face Spaces reliability.
16
+
17
+ ---
18
+
19
+ ## 2) Coursework Fit (Why this is Option 1)
20
+
21
+ This is a refinement of existing pipelines/models:
22
+
23
+ - existing generative pipeline (currently MusicGen, planned ACE-Step)
24
+ - wrapped in domain-specific DJ UX (cue/BPM/mix controls)
25
+ - not raw prompting only; structured controls for practical use
26
+
27
+ ---
28
+
29
+ ## 3) Current Implemented Pipeline (Already in `app.py`)
30
+
31
+ Current app file: `AI_DJ_Project/app.py`
32
+
33
+ ### 3.1 Input + UI
34
+
35
+ - Upload `Song A` and `Song B`
36
+ - Set:
37
+ - transition vibe text
38
+ - transition type (`riser`, `drum fill`, `sweep`, `brake`, `scratch`, `impact`)
39
+ - mode (`Overlay` or `Insert`)
40
+ - pre/mix/post seconds
41
+ - transition length + gain
42
+ - optional BPM and cue overrides
43
+
44
+ ### 3.2 Audio analysis and cueing
45
+
46
+ 1. Probe duration with `ffprobe` (if available)
47
+ 2. Decode only needed segments (ffmpeg first, librosa fallback)
48
+ 3. Estimate BPM + beat times with `librosa.beat.beat_track`
49
+ 4. Auto-cue strategy:
50
+ - Song A: choose beat near end analysis window
51
+ - Song B: choose first beat after ~2 seconds
52
+ 5. Optional manual override for BPM and cue points
53
+
54
+ ### 3.3 Tempo matching
55
+
56
+ - Compute stretch rate = `bpm_A / bpm_B` (clamped)
57
+ - Time-stretch Song B segment via `librosa.effects.time_stretch`
58
+
59
+ ### 3.4 AI transition generation
60
+
61
+ - `@spaces.GPU` function `_generate_ai_transition(...)`
62
+ - Uses `facebook/musicgen-small`
63
+ - Prompt is domain-steered for DJ transition behavior
64
+ - Returns short generated transition audio
65
+
66
+ ### 3.5 Assembly
67
+
68
+ - **Overlay mode**: crossfade A/B + overlay AI transition
69
+ - **Insert mode**: A -> AI transition -> B (with short anti-click fades)
70
+ - Edge fades + peak normalization before output
71
+
72
+ ### 3.6 Output
73
+
74
+ - Output audio clip (NumPy audio to Gradio)
75
+ - JSON details:
76
+ - BPM estimates
77
+ - cue points
78
+ - stretch rate
79
+ - analysis settings
80
+
81
+ ---
82
+
83
+ ## 4) Full End-to-End Pipeline (Conceptual)
84
+
85
+ Upload A/B
86
+ -> decode limited windows
87
+ -> BPM + beat analysis
88
+ -> auto-cue points
89
+ -> stretch B to A BPM
90
+ -> generate transition (GenAI)
91
+ -> overlay/insert assembly
92
+ -> normalize/fades
93
+ -> return short transition clip + diagnostics
94
+
95
+ ---
96
+
97
+ ## 5) Planned Upgrade: ACE-Step + Custom LoRA
98
+
99
+ ### 5.1 What ACE-Step is
100
+
101
+ ACE-Step 1.5 is a **full music-generation foundation model stack** (text-to-audio/music with editing/control workflows), not just a tiny SFX model.
102
+
103
+ Planned usage in this project:
104
+
105
+ - keep deterministic DJ logic (cue/BPM/stretch/assemble)
106
+ - swap transition generation backend from MusicGen to ACE-Step
107
+ - load custom LoRA adapter(s) to enforce DJ transition style
108
+
109
+ ### 5.2 Integration strategy (recommended)
110
+
111
+ 1. Keep current `app.py` flow unchanged for analysis/mixing
112
+ 2. Introduce backend abstraction:
113
+ - `MusicGenBackend` (fallback)
114
+ - `AceStepBackend` (main target)
115
+ 3. Add LoRA controls:
116
+ - adapter selection
117
+ - adapter scale
118
+ 4. Continue returning short transition clips only
119
+
120
+ ---
121
+
122
+ ## 6) Genre-Specific LoRA Idea (Pop / Electronic / House / Dubstep / Techno)
123
+
124
+ ## Is this a good idea?
125
+
126
+ **Yes, as a staged plan.**
127
+
128
+ It is a strong product and coursework idea because:
129
+
130
+ - user-selected genre can map to distinct transition style
131
+ - demonstrates clear domain-specific refinement
132
+ - supports explainable UX: "You picked House -> House-style transition LoRA"
133
+
134
+ ### Important caveats
135
+
136
+ - Training one LoRA per genre increases data and compute requirements a lot
137
+ - Early quality may vary by genre and dataset size
138
+ - More adapters mean more evaluation and QA burden
139
+
140
+ ### Practical rollout (recommended)
141
+
142
+ Phase 1 (safe):
143
+ - base model + one "general DJ transition" LoRA
144
+
145
+ Phase 2 (coursework-strong):
146
+ - 2-3 genre LoRAs (e.g., Pop / House / Dubstep)
147
+
148
+ Phase 3 (optional extension):
149
+ - larger genre library + auto-genre suggestion from uploaded songs
150
+
151
+ ---
152
+
153
+ ## 7) Proposed Genre LoRA Routing Logic
154
+
155
+ User selects uploaded-song genre (or manually selects transition style profile):
156
+
157
+ - Pop -> `lora_pop_transition`
158
+ - Electronic -> `lora_electronic_transition`
159
+ - House -> `lora_house_transition`
160
+ - Dubstep -> `lora_dubstep_transition`
161
+ - Techno -> `lora_techno_transition`
162
+ - Auto/Unknown -> `lora_general_transition`
163
+
164
+ Then:
165
+
166
+ 1. load chosen LoRA
167
+ 2. set LoRA scale
168
+ 3. run ACE-Step generation for short transition duration
169
+ 4. mix with A/B boundary clip
170
+
171
+ ---
172
+
173
+ ## 8) Data and Training Notes for LoRA
174
+
175
+ - Use only licensed/royalty-free/self-owned audio for dataset and demos
176
+ - Dataset should emphasize transition-like content (risers, fills, drops, sweeps, impacts)
177
+ - Include metadata/captions describing genre + transition intent
178
+ - Keep track of:
179
+ - adapter name
180
+ - dataset source and license
181
+ - training config and epoch checkpoints
182
+
183
+ ---
184
+
185
+ ## 9) Current Risks / Constraints
186
+
187
+ - ACE-Step stack is heavier than MusicGen and needs careful deployment tuning
188
+ - Cold starts and memory behavior can be challenging on Spaces
189
+ - Auto-cueing is heuristic; may fail on hard tracks (manual override should remain)
190
+ - Time-stretch can introduce artifacts (expected in DJ contexts)
191
+
192
+ ---
193
+
194
+ ## 10) Fallback and Reliability Plan
195
+
196
+ - Keep MusicGen backend as fallback while integrating ACE-Step
197
+ - If ACE-Step init fails:
198
+ - fail over to MusicGen backend
199
+ - still return valid transition clip
200
+ - Preserve deterministic DSP path as model-agnostic baseline
201
+
202
+ ---
203
+
204
+ ## 11) "If I lost track" Quick Resume Checklist
205
+
206
+ 1. Open `app.py` and confirm current backend is still working end-to-end
207
+ 2. Verify demo still does:
208
+ - cue detect
209
+ - BPM match
210
+ - transition generation
211
+ - clip output
212
+ 3. Re-read this note section 5/6/7
213
+ 4. Continue with next implementation milestone:
214
+ - backend abstraction
215
+ - ACE-Step backend skeleton
216
+ - single LoRA integration
217
+ - then genre LoRA expansion
218
+
219
+ ---
220
+
221
+ ## 12) Next Concrete Milestones
222
+
223
+ M1: Refactor transition generation into backend interface
224
+ M2: Implement `AceStepBackend` with base model inference
225
+ M3: Add LoRA load/select/scale UI + runtime controls
226
+ M4: Train first "general DJ transition" LoRA
227
+ M5: Train 2-3 genre LoRAs and add genre routing
228
+ M6: Compare outputs (base vs LoRA, genre A vs genre B) for coursework evidence
229
+
README copy.md ADDED
@@ -0,0 +1,100 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # AI_DJ_Project
2
+
3
+ ## Coursework-ready demo (HF Spaces + Gradio, Phase A/B)
4
+
5
+ This repo now includes a **Hugging Face Spaces** demo in `app.py`:
6
+
7
+ - Upload **Song A** and **Song B**.
8
+ - Pick a transition style plugin + text instruction.
9
+ - Build a rough seam (`A_tail + B_head`) with BPM-aware stretching.
10
+ - Run **ACE-Step repaint** on the seam window.
11
+ - Output two artifacts:
12
+ - transition-only clip
13
+ - stitched clip (`Song A up to cue + transition + Song B continuation`, seam is replaced not inserted)
14
+
15
+ ### Deterministic transition API (Phase A)
16
+
17
+ Core reusable pipeline lives in:
18
+ - `pipeline/audio_utils.py`
19
+ - `pipeline/transition_generator.py`
20
+
21
+ Run via command:
22
+
23
+ ```shell
24
+ python -m pipeline.transition_generator \
25
+ --song-a /path/to/song_a.mp3 \
26
+ --song-b /path/to/song_b.mp3 \
27
+ --plugin "Smooth Blend" \
28
+ --instruction "smooth, rising energy, no vocals" \
29
+ --seed 42 \
30
+ --output-dir outputs
31
+ ```
32
+
33
+ This writes:
34
+ - `*_transition.wav`
35
+ - `*_stitched.wav`
36
+
37
+ ### Deploy to Hugging Face Spaces (ZeroGPU)
38
+
39
+ Create a new Space with:
40
+ - **SDK**: Gradio
41
+ - **Hardware**: ZeroGPU
42
+
43
+ Upload these files from this folder:
44
+ - `app.py`
45
+ - `requirements.txt`
46
+ - `packages.txt` (installs `ffmpeg` + `libsndfile1` for audio decoding/runtime)
47
+
48
+ Important: **Do not upload copyrighted songs** into the Space repo. The demo is designed for **user uploads**.
49
+
50
+ ### Repo hygiene
51
+
52
+ - The coursework spec notebook at repo root is intentionally git-ignored:
53
+ `(0) 70113_Generative_AI_README_for_Coursework.ipynb`
54
+
55
+ ### ACE-Step backend (required)
56
+
57
+ This coursework pipeline uses ACE-Step as the generation method.
58
+
59
+ ```shell
60
+ pip install git+https://github.com/ACE-Step/ACE-Step-1.5.git
61
+ ```
62
+
63
+ Then run with environment vars as needed:
64
+
65
+ ```shell
66
+ export AI_DJ_ACESTEP_MODEL_CONFIG=acestep-v15-turbo
67
+ # optional persistent root for checkpoints:
68
+ export AI_DJ_ACESTEP_PROJECT_ROOT=/data/acestep_runtime
69
+ ```
70
+
71
+ Notes:
72
+ - ACE-Step currently targets Python 3.11.
73
+ - ACE-Step first run can take time due to checkpoint download.
74
+
75
+ ### Optional: Demucs stem-aware cue scoring
76
+
77
+ Cuepoint scoring can optionally run Demucs on the **analysis windows only** (A tail window + B head window), derive stem-aware mixability signals (`vocals`, `drums`, `bass`, accompaniment density), and penalize overlap risk (vocal-vocal and bass-bass clashes).
78
+
79
+ Transition generation can also use Demucs for:
80
+ - drum-led phase locking,
81
+ - one-bassline handoff shaping in `src_audio`,
82
+ - accompaniment-only `reference_audio`,
83
+ - post-repaint stem correction near transition boundaries.
84
+
85
+ Environment toggles:
86
+
87
+ ```shell
88
+ # disable Demucs analysis entirely
89
+ export AI_DJ_ENABLE_DEMUCS_ANALYSIS=0
90
+
91
+ # disable Demucs transition refinements entirely
92
+ export AI_DJ_ENABLE_DEMUCS_TRANSITION=0
93
+
94
+ # choose analysis device when enabled (default: cuda if available)
95
+ export AI_DJ_DEMUCS_DEVICE=cpu
96
+
97
+ # choose reference period type passed into ACE-Step reference_audio
98
+ # values: accompaniment-only (default) | full-period-a
99
+ export AI_DJ_REFERENCE_AUDIO_MODE=accompaniment-only
100
+ ```
app.py ADDED
@@ -0,0 +1,392 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import os
3
+ import subprocess
4
+ from pathlib import Path
5
+ from typing import Optional
6
+
7
+ import gradio as gr
8
+
9
+ from pipeline.transition_generator import (
10
+ PLUGIN_PRESETS,
11
+ TransitionRequest,
12
+ generate_transition_artifacts,
13
+ )
14
+
15
+ logging.basicConfig(
16
+ level=logging.INFO,
17
+ format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
18
+ )
19
+ LOGGER = logging.getLogger(__name__)
20
+
21
+ LORA_DROPDOWN_CHOICES = [
22
+ "None",
23
+ "Chinese New Year (official)",
24
+ ]
25
+ LORA_REPO_MAP = {
26
+ "Chinese New Year (official)": "ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA",
27
+ }
28
+
29
+ APP_CSS = """
30
+ .adv-item label,
31
+ .adv-item .gr-block-label,
32
+ .adv-item .gr-block-title {
33
+ white-space: nowrap !important;
34
+ overflow: hidden !important;
35
+ text-overflow: ellipsis !important;
36
+ }
37
+ """
38
+
39
+ APP_THEME = gr.themes.Soft(
40
+ primary_hue="blue",
41
+ neutral_hue="slate",
42
+ radius_size="lg",
43
+ ).set(
44
+ block_radius="*radius_xl",
45
+ input_radius="*radius_xl",
46
+ button_large_radius="*radius_xl",
47
+ button_medium_radius="*radius_xl",
48
+ button_small_radius="*radius_xl",
49
+ )
50
+
51
+
52
+ def _to_optional_float(value) -> Optional[float]:
53
+ if value is None:
54
+ return None
55
+ if isinstance(value, str) and not value.strip():
56
+ return None
57
+ try:
58
+ return float(value)
59
+ except Exception:
60
+ return None
61
+
62
+
63
+ def _normalize_upload_for_ui(path: Optional[str]) -> Optional[str]:
64
+ if not path:
65
+ return path
66
+ src = str(path)
67
+ if not os.path.isfile(src):
68
+ return path
69
+
70
+ out_dir = os.path.join("outputs", "normalized_uploads")
71
+ os.makedirs(out_dir, exist_ok=True)
72
+ stem = Path(src).stem
73
+ dst = os.path.join(out_dir, f"{stem}_ui_norm.wav")
74
+
75
+ cmd = [
76
+ "ffmpeg",
77
+ "-hide_banner",
78
+ "-loglevel",
79
+ "error",
80
+ "-nostdin",
81
+ "-y",
82
+ "-i",
83
+ src,
84
+ "-vn",
85
+ "-ac",
86
+ "2",
87
+ "-ar",
88
+ "44100",
89
+ "-c:a",
90
+ "pcm_s16le",
91
+ dst,
92
+ ]
93
+ try:
94
+ subprocess.run(cmd, check=True)
95
+ return dst
96
+ except Exception as exc:
97
+ LOGGER.warning("Upload normalization failed for %s (%s). Using original file.", src, exc)
98
+ return src
99
+
100
+
101
+ def _run_transition(
102
+ song_a,
103
+ song_b,
104
+ plugin_id,
105
+ instruction_text,
106
+ transition_bars,
107
+ pre_context_sec,
108
+ post_context_sec,
109
+ analysis_sec,
110
+ bpm_target,
111
+ creativity_strength,
112
+ inference_steps,
113
+ seed,
114
+ cue_a_sec,
115
+ cue_b_sec,
116
+ lora_choice,
117
+ lora_scale,
118
+ output_dir,
119
+ ):
120
+ if not song_a or not song_b:
121
+ raise gr.Error("Please upload both Song A and Song B.")
122
+
123
+ request = TransitionRequest(
124
+ song_a_path=song_a,
125
+ song_b_path=song_b,
126
+ plugin_id=plugin_id,
127
+ instruction_text=instruction_text or "",
128
+ transition_base_mode="B-base-fixed",
129
+ transition_bars=int(transition_bars),
130
+ pre_context_sec=float(pre_context_sec),
131
+ repaint_width_sec=4.0,
132
+ post_context_sec=float(post_context_sec),
133
+ analysis_sec=float(analysis_sec),
134
+ bpm_target=_to_optional_float(bpm_target),
135
+ cue_a_sec=_to_optional_float(cue_a_sec),
136
+ cue_b_sec=_to_optional_float(cue_b_sec),
137
+ creativity_strength=float(creativity_strength),
138
+ inference_steps=int(inference_steps),
139
+ seed=int(seed),
140
+ acestep_lora_path=LORA_REPO_MAP.get(str(lora_choice), ""),
141
+ acestep_lora_scale=float(lora_scale),
142
+ output_dir=(output_dir or "outputs").strip(),
143
+ )
144
+
145
+ try:
146
+ result = generate_transition_artifacts(request)
147
+ except Exception as exc:
148
+ raise gr.Error(str(exc))
149
+
150
+ return (
151
+ result.transition_path,
152
+ result.hard_splice_path,
153
+ result.rough_stitched_path,
154
+ result.stitched_path,
155
+ )
156
+
157
+
158
+ def build_ui() -> gr.Blocks:
159
+ with gr.Blocks(theme=APP_THEME, css=APP_CSS) as demo:
160
+ gr.Markdown(
161
+ """
162
+ <div style="text-align:center;">
163
+ <h1>AI DJ Transition Generator</h1>
164
+ <p>Upload two songs and generate a transition between them.</p>
165
+ </div>
166
+ """.strip()
167
+ )
168
+ with gr.Row():
169
+ gr.Markdown(
170
+ """
171
+ ### How to use
172
+ 1. Upload **Song A** (current track) and **Song B** (next track).
173
+ 2. Choose a **Transition style plugin**.
174
+ 3. Optionally add **Text instruction** (e.g., smooth, rising energy, no vocals).
175
+ 4. Select **Transition period length (bars)**.
176
+ 5. Click **Generate transition artifacts**.
177
+ """.strip(),
178
+ container=False,
179
+ elem_classes=["plain-info"],
180
+ )
181
+ gr.Markdown(
182
+ """
183
+ ### Outputs
184
+ - **Generated transition clip**: AI-generated repaint transition segment.
185
+ - **Hard splice baseline (no transition)**: direct cut baseline.
186
+ - **No-repaint rough stitch (baseline)**: stitched baseline without repaint.
187
+ - **Final stitched clip**: final result with transition inserted.
188
+ """.strip(),
189
+ container=False,
190
+ elem_classes=["plain-info"],
191
+ )
192
+
193
+ with gr.Row():
194
+ song_a = gr.Audio(
195
+ label="Song A (mix out)",
196
+ type="filepath",
197
+ sources=["upload"],
198
+ )
199
+ song_b = gr.Audio(
200
+ label="Song B (mix in)",
201
+ type="filepath",
202
+ sources=["upload"],
203
+ )
204
+ song_a.upload(
205
+ fn=_normalize_upload_for_ui,
206
+ inputs=song_a,
207
+ outputs=song_a,
208
+ queue=False,
209
+ )
210
+ song_b.upload(
211
+ fn=_normalize_upload_for_ui,
212
+ inputs=song_b,
213
+ outputs=song_b,
214
+ queue=False,
215
+ )
216
+
217
+ with gr.Row():
218
+ with gr.Column():
219
+ plugin_id = gr.Dropdown(
220
+ label="Transition style plugin",
221
+ choices=list(PLUGIN_PRESETS.keys()),
222
+ value="Smooth Blend",
223
+ )
224
+ with gr.Column():
225
+ lora_choice = gr.Dropdown(
226
+ label="LoRA adapter",
227
+ choices=LORA_DROPDOWN_CHOICES,
228
+ value="None",
229
+ info="Select an ACE-Step LoRA adapter to apply during repaint.",
230
+ )
231
+ lora_scale = gr.Slider(
232
+ minimum=0.0,
233
+ maximum=2.0,
234
+ value=0.8,
235
+ step=0.05,
236
+ label="LoRA scale",
237
+ )
238
+ with gr.Column():
239
+ instruction_text = gr.Textbox(
240
+ label="Text instruction",
241
+ placeholder="e.g., smooth, rising energy, no vocals",
242
+ lines=2,
243
+ )
244
+
245
+ with gr.Accordion("Advanced controls", open=False):
246
+ with gr.Row():
247
+ transition_bars = gr.Dropdown(
248
+ label="Transition period length (bars)",
249
+ choices=[4, 8, 16],
250
+ value=8,
251
+ info="Controls transition duration. Pipeline uses fixed B-base strategy with A as reference.",
252
+ min_width=320,
253
+ elem_classes=["adv-item"],
254
+ )
255
+ pre_context_sec = gr.Slider(
256
+ minimum=1,
257
+ maximum=12,
258
+ value=6,
259
+ step=0.5,
260
+ label="Seconds before seam (Song A context)",
261
+ min_width=320,
262
+ elem_classes=["adv-item"],
263
+ )
264
+ post_context_sec = gr.Slider(
265
+ minimum=1,
266
+ maximum=12,
267
+ value=6,
268
+ step=0.5,
269
+ label="Seconds after seam (Song B context)",
270
+ min_width=320,
271
+ elem_classes=["adv-item"],
272
+ )
273
+
274
+ with gr.Row():
275
+ analysis_sec = gr.Slider(
276
+ minimum=10,
277
+ maximum=90,
278
+ value=45,
279
+ step=5,
280
+ label="Analysis window (seconds)",
281
+ min_width=320,
282
+ elem_classes=["adv-item"],
283
+ )
284
+ bpm_target = gr.Number(
285
+ label="Optional BPM target override",
286
+ value=None,
287
+ min_width=320,
288
+ elem_classes=["adv-item"],
289
+ )
290
+
291
+ with gr.Row():
292
+ creativity_strength = gr.Slider(
293
+ minimum=1.0,
294
+ maximum=12.0,
295
+ value=7.0,
296
+ step=0.5,
297
+ label="Creativity strength (guidance)",
298
+ min_width=320,
299
+ elem_classes=["adv-item"],
300
+ )
301
+ inference_steps = gr.Slider(
302
+ minimum=1,
303
+ maximum=64,
304
+ value=8,
305
+ step=1,
306
+ label="ACE-Step inference steps",
307
+ min_width=320,
308
+ elem_classes=["adv-item"],
309
+ )
310
+
311
+ with gr.Row():
312
+ seed = gr.Number(
313
+ label="Seed",
314
+ value=42,
315
+ precision=0,
316
+ min_width=320,
317
+ elem_classes=["adv-item"],
318
+ )
319
+ cue_a_sec = gr.Textbox(
320
+ label="Optional cue A override (sec)",
321
+ value="",
322
+ placeholder="Leave blank for auto cue selection",
323
+ min_width=320,
324
+ elem_classes=["adv-item"],
325
+ )
326
+
327
+ with gr.Row():
328
+ cue_b_sec = gr.Textbox(
329
+ label="Optional cue B override (sec)",
330
+ value="",
331
+ placeholder="Leave blank for auto cue selection",
332
+ min_width=320,
333
+ elem_classes=["adv-item"],
334
+ )
335
+ output_dir = gr.Textbox(
336
+ label="Output directory",
337
+ value="outputs",
338
+ min_width=320,
339
+ elem_classes=["adv-item"],
340
+ )
341
+
342
+ run_btn = gr.Button("Generate transition artifacts", variant="primary")
343
+
344
+ with gr.Row():
345
+ transition_audio = gr.Audio(
346
+ label="Generated transition clip",
347
+ type="filepath",
348
+ )
349
+ hard_splice_audio = gr.Audio(
350
+ label="Hard splice baseline (no transition)",
351
+ type="filepath",
352
+ )
353
+ rough_stitched_audio = gr.Audio(
354
+ label="No-repaint rough stitch (baseline)",
355
+ type="filepath",
356
+ )
357
+ stitched_audio = gr.Audio(
358
+ label="Final stitched clip",
359
+ type="filepath",
360
+ )
361
+
362
+ run_btn.click(
363
+ fn=_run_transition,
364
+ inputs=[
365
+ song_a,
366
+ song_b,
367
+ plugin_id,
368
+ instruction_text,
369
+ transition_bars,
370
+ pre_context_sec,
371
+ post_context_sec,
372
+ analysis_sec,
373
+ bpm_target,
374
+ creativity_strength,
375
+ inference_steps,
376
+ seed,
377
+ cue_a_sec,
378
+ cue_b_sec,
379
+ lora_choice,
380
+ lora_scale,
381
+ output_dir,
382
+ ],
383
+ outputs=[transition_audio, hard_splice_audio, rough_stitched_audio, stitched_audio],
384
+ )
385
+
386
+ return demo
387
+
388
+
389
+ demo = build_ui()
390
+
391
+ if __name__ == "__main__":
392
+ demo.launch()
packages.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ ffmpeg
2
+ libsndfile1
pipeline/__init__.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Pipeline package for deterministic transition generation."""
2
+
3
+ from .transition_generator import (
4
+ PLUGIN_PRESETS,
5
+ TransitionRequest,
6
+ TransitionResult,
7
+ generate_transition_artifacts,
8
+ )
9
+
10
+ __all__ = [
11
+ "PLUGIN_PRESETS",
12
+ "TransitionRequest",
13
+ "TransitionResult",
14
+ "generate_transition_artifacts",
15
+ ]
16
+
pipeline/audio_utils.py ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import os
3
+ import shutil
4
+ import subprocess
5
+ import tempfile
6
+ from typing import Optional, Tuple
7
+
8
+ import librosa
9
+ import numpy as np
10
+ import soundfile as sf
11
+
12
+ LOGGER = logging.getLogger(__name__)
13
+
14
+
15
+ def clamp(value: float, low: float, high: float) -> float:
16
+ return float(max(low, min(high, value)))
17
+
18
+
19
+ def ensure_mono(y: np.ndarray) -> np.ndarray:
20
+ if y.ndim == 1:
21
+ return y
22
+ return np.mean(y, axis=1)
23
+
24
+
25
+ def ffprobe_duration_sec(path: str) -> Optional[float]:
26
+ if not shutil.which("ffprobe"):
27
+ return None
28
+
29
+ cmd = [
30
+ "ffprobe",
31
+ "-v",
32
+ "error",
33
+ "-show_entries",
34
+ "format=duration",
35
+ "-of",
36
+ "default=noprint_wrappers=1:nokey=1",
37
+ path,
38
+ ]
39
+ try:
40
+ out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, text=True).strip()
41
+ return float(out)
42
+ except Exception:
43
+ return None
44
+
45
+
46
+ def decode_segment(path: str, start_sec: float, duration_sec: float, sr: int, max_decode_sec: float = 120.0) -> Tuple[np.ndarray, int]:
47
+ start_sec = max(0.0, float(start_sec))
48
+ duration_sec = max(0.0, float(duration_sec))
49
+ duration_sec = min(duration_sec, max_decode_sec)
50
+
51
+ if duration_sec <= 0:
52
+ return np.zeros((0,), dtype=np.float32), sr
53
+
54
+ if shutil.which("ffmpeg"):
55
+ tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
56
+ tmp_path = tmp.name
57
+ tmp.close()
58
+ try:
59
+ cmd = [
60
+ "ffmpeg",
61
+ "-hide_banner",
62
+ "-loglevel",
63
+ "error",
64
+ "-nostdin",
65
+ "-y",
66
+ "-ss",
67
+ str(start_sec),
68
+ "-t",
69
+ str(duration_sec),
70
+ "-i",
71
+ path,
72
+ "-ac",
73
+ "1",
74
+ "-ar",
75
+ str(sr),
76
+ tmp_path,
77
+ ]
78
+ subprocess.run(cmd, check=True)
79
+ y, read_sr = sf.read(tmp_path, dtype="float32", always_2d=False)
80
+ y = ensure_mono(np.asarray(y))
81
+ return y.astype(np.float32), int(read_sr)
82
+ finally:
83
+ try:
84
+ os.remove(tmp_path)
85
+ except Exception:
86
+ pass
87
+
88
+ y, read_sr = librosa.load(path, sr=sr, mono=True, offset=start_sec, duration=duration_sec)
89
+ return y.astype(np.float32), int(read_sr)
90
+
91
+
92
+ def estimate_bpm_and_beats(y: np.ndarray, sr: int) -> Tuple[Optional[float], np.ndarray]:
93
+ if y.size < sr:
94
+ return None, np.array([], dtype=np.float32)
95
+
96
+ try:
97
+ tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
98
+ tempo_f = float(tempo[0]) if isinstance(tempo, (list, np.ndarray)) else float(tempo)
99
+ beat_times = librosa.frames_to_time(beat_frames, sr=sr).astype(np.float32)
100
+ if not (40.0 <= tempo_f <= 220.0):
101
+ tempo_f = None
102
+ return tempo_f, beat_times
103
+ except Exception:
104
+ return None, np.array([], dtype=np.float32)
105
+
106
+
107
+ def choose_nearest_beat(beat_times: np.ndarray, target_sec: float) -> float:
108
+ if beat_times.size == 0:
109
+ return float(target_sec)
110
+ idx = int(np.argmin(np.abs(beat_times - float(target_sec))))
111
+ return float(beat_times[idx])
112
+
113
+
114
+ def choose_first_beat_after(beat_times: np.ndarray, target_sec: float) -> float:
115
+ if beat_times.size == 0:
116
+ return float(target_sec)
117
+ for bt in beat_times:
118
+ if float(bt) >= float(target_sec):
119
+ return float(bt)
120
+ return float(beat_times[-1])
121
+
122
+
123
+ def linear_fade(n: int, fade_in: bool) -> np.ndarray:
124
+ if n <= 0:
125
+ return np.zeros((0,), dtype=np.float32)
126
+ if fade_in:
127
+ return np.linspace(0.0, 1.0, n, dtype=np.float32)
128
+ return np.linspace(1.0, 0.0, n, dtype=np.float32)
129
+
130
+
131
+ def normalize_peak(y: np.ndarray, peak: float = 0.98) -> np.ndarray:
132
+ if y.size == 0:
133
+ return y.astype(np.float32)
134
+ maximum = float(np.max(np.abs(y)))
135
+ if maximum <= 1e-9:
136
+ return y.astype(np.float32)
137
+ if maximum <= peak:
138
+ return y.astype(np.float32)
139
+ return (y * (peak / maximum)).astype(np.float32)
140
+
141
+
142
+ def apply_edge_fades(y: np.ndarray, sr: int, fade_ms: float = 30.0) -> np.ndarray:
143
+ n = y.size
144
+ fade_n = int(sr * (fade_ms / 1000.0))
145
+ fade_n = min(fade_n, n // 2)
146
+ if fade_n <= 0:
147
+ return y
148
+ y2 = y.copy()
149
+ y2[:fade_n] *= linear_fade(fade_n, fade_in=True)
150
+ y2[-fade_n:] *= linear_fade(fade_n, fade_in=False)
151
+ return y2
152
+
153
+
154
+ def ensure_length(y: np.ndarray, target_n: int) -> np.ndarray:
155
+ target_n = int(max(0, target_n))
156
+ if y.size < target_n:
157
+ return np.pad(y, (0, target_n - y.size), mode="constant")
158
+ return y[:target_n]
159
+
160
+
161
+ def safe_time_stretch(y: np.ndarray, rate: float) -> np.ndarray:
162
+ rate = float(rate)
163
+ if y.size == 0:
164
+ return y.astype(np.float32)
165
+ if abs(rate - 1.0) < 1e-6:
166
+ return y.astype(np.float32)
167
+ try:
168
+ return librosa.effects.time_stretch(y, rate=rate).astype(np.float32)
169
+ except Exception as exc:
170
+ LOGGER.warning("Time-stretch failed (%s); using original audio.", exc)
171
+ return y.astype(np.float32)
172
+
173
+
174
+ def resample_if_needed(y: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
175
+ if int(orig_sr) == int(target_sr):
176
+ return y.astype(np.float32)
177
+ return librosa.resample(y, orig_sr=int(orig_sr), target_sr=int(target_sr)).astype(np.float32)
178
+
179
+
180
+ def crossfade_equal_length(a: np.ndarray, b: np.ndarray) -> np.ndarray:
181
+ n = min(a.size, b.size)
182
+ if n <= 0:
183
+ return np.zeros((0,), dtype=np.float32)
184
+ a = a[:n]
185
+ b = b[:n]
186
+ fade_in = linear_fade(n, fade_in=True)
187
+ fade_out = 1.0 - fade_in
188
+ return (a * fade_out + b * fade_in).astype(np.float32)
189
+
190
+
191
+ def write_wav(path: str, y: np.ndarray, sr: int) -> None:
192
+ os.makedirs(os.path.dirname(path), exist_ok=True)
193
+ sf.write(path, y.astype(np.float32), int(sr))
194
+
pipeline/cuepoint_selector.py ADDED
@@ -0,0 +1,1656 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+ import os
3
+ from dataclasses import dataclass
4
+ from typing import Any, Dict, List, Optional, Tuple
5
+
6
+ import librosa # type: ignore[reportMissingImports]
7
+ import numpy as np
8
+
9
+ from .audio_utils import choose_first_beat_after, choose_nearest_beat, decode_segment, ensure_length
10
+
11
+ LOGGER = logging.getLogger(__name__)
12
+
13
+ _ANALYSIS_HOP = 512
14
+ _STRUCT_SR = 22050
15
+ _DEMUCS_ENABLED = os.getenv("AI_DJ_ENABLE_DEMUCS_ANALYSIS", "1").strip().lower() not in {
16
+ "0",
17
+ "false",
18
+ "no",
19
+ "off",
20
+ }
21
+ _DEMUCS_MODEL_NAME = os.getenv("AI_DJ_DEMUCS_MODEL", "htdemucs").strip() or "htdemucs"
22
+ _DEMUCS_DEVICE_PREF = os.getenv("AI_DJ_DEMUCS_DEVICE", "cuda").strip().lower()
23
+ _DEMUCS_SEGMENT_SEC = 7.0
24
+ _DEMUCS_MIN_WINDOW_SEC = 6.0
25
+
26
+ _PROFILE_CACHE: Dict[Tuple[str, int], Optional["_TrackProfiles"]] = {}
27
+ _LIBROSA_STRUCT_CACHE: Dict[str, Optional[Dict[str, np.ndarray]]] = {}
28
+ _DEMUCS_MODEL: Any = None
29
+ _DEMUCS_TORCH: Any = None
30
+ _DEMUCS_DEVICE = "cpu"
31
+ _DEMUCS_LOAD_ATTEMPTED = False
32
+ _DEMUCS_LOAD_ERROR: Optional[str] = None
33
+
34
+
35
+ @dataclass
36
+ class CueSelectionResult:
37
+ cue_a_sec: float
38
+ cue_b_sec: float
39
+ method: str
40
+ debug: Dict[str, object]
41
+
42
+
43
+ @dataclass
44
+ class _CueCandidate:
45
+ time_sec: float
46
+ beat_idx: int
47
+ phrase: float
48
+ energy: float
49
+ onset: float
50
+ chroma: np.ndarray
51
+ vocal_ratio: float
52
+ vocal_onset: float
53
+ vocal_phrase_score: float
54
+ drum_anchor: float
55
+ bass_energy: float
56
+ bass_stability: float
57
+ instrumental_density: float
58
+ density_score: float
59
+ period_vocal_ratio: float
60
+ period_vocal_phrase_score: float
61
+ period_drum_anchor: float
62
+ period_bass_energy: float
63
+ period_bass_stability: float
64
+ period_density_score: float
65
+ period_coverage: float
66
+ period_vocal_curve: np.ndarray
67
+ period_bass_curve: np.ndarray
68
+
69
+
70
+ @dataclass
71
+ class _TrackProfiles:
72
+ rms: np.ndarray
73
+ rms_times: np.ndarray
74
+ onset: np.ndarray
75
+ onset_times: np.ndarray
76
+ chroma: np.ndarray
77
+ chroma_times: np.ndarray
78
+
79
+
80
+ @dataclass
81
+ class _VocalActivityProfile:
82
+ vocal_ratio: np.ndarray
83
+ vocal_onset: np.ndarray
84
+ drum_onset: np.ndarray
85
+ bass_rms: np.ndarray
86
+ instrumental_rms: np.ndarray
87
+ times: np.ndarray
88
+ method: str
89
+ has_drums: bool
90
+ has_bass: bool
91
+
92
+
93
+ @dataclass
94
+ class _StructuredCandidate:
95
+ cue: _CueCandidate
96
+ label: str
97
+ label_score: float
98
+ edge_score: float
99
+ position_score: float
100
+
101
+
102
+ def _clamp(value: float, low: float, high: float) -> float:
103
+ return float(max(low, min(high, value)))
104
+
105
+
106
+ def _mean_1d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> float:
107
+ if values.size == 0 or times.size == 0:
108
+ return 0.0
109
+ lo = float(min(start, end))
110
+ hi = float(max(start, end))
111
+ mask = (times >= lo) & (times <= hi)
112
+ if np.any(mask):
113
+ return float(np.mean(values[mask]))
114
+ idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
115
+ return float(values[idx])
116
+
117
+
118
+ def _std_1d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> float:
119
+ if values.size == 0 or times.size == 0:
120
+ return 0.0
121
+ lo = float(min(start, end))
122
+ hi = float(max(start, end))
123
+ mask = (times >= lo) & (times <= hi)
124
+ if np.any(mask):
125
+ return float(np.std(values[mask]))
126
+ idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
127
+ return 0.0 if idx < 0 or idx >= values.size else 0.0
128
+
129
+
130
+ def _smooth_1d(values: np.ndarray, kernel_size: int) -> np.ndarray:
131
+ arr = np.asarray(values, dtype=np.float32).reshape(-1)
132
+ if arr.size == 0:
133
+ return np.zeros((1,), dtype=np.float32)
134
+ k = int(max(1, kernel_size))
135
+ if k == 1 or arr.size < k:
136
+ return arr.astype(np.float32)
137
+ kernel = np.ones((k,), dtype=np.float32) / float(k)
138
+ return np.convolve(arr, kernel, mode="same").astype(np.float32)
139
+
140
+
141
+ def _normalize_1d(values: np.ndarray) -> np.ndarray:
142
+ arr = np.asarray(values, dtype=np.float32).reshape(-1)
143
+ if arr.size == 0:
144
+ return np.zeros((1,), dtype=np.float32)
145
+ lo = float(np.percentile(arr, 5))
146
+ hi = float(np.percentile(arr, 95))
147
+ if hi - lo > 1e-6:
148
+ out = (arr - lo) / (hi - lo)
149
+ return np.clip(out, 0.0, 1.0).astype(np.float32)
150
+ mx = float(np.max(np.abs(arr)))
151
+ if mx > 1e-6:
152
+ out = arr / mx
153
+ return np.clip(out, 0.0, 1.0).astype(np.float32)
154
+ return np.zeros_like(arr, dtype=np.float32)
155
+
156
+
157
+ def _align_series_min_length(series: List[np.ndarray]) -> List[np.ndarray]:
158
+ clean = [np.asarray(x, dtype=np.float32).reshape(-1) for x in series]
159
+ if not clean:
160
+ return []
161
+ min_len = min((x.size for x in clean if x.size > 0), default=0)
162
+ if min_len <= 0:
163
+ return [np.zeros((1,), dtype=np.float32) for _ in clean]
164
+ return [x[:min_len].astype(np.float32) if x.size >= min_len else np.pad(x, (0, min_len - x.size)).astype(np.float32) for x in clean]
165
+
166
+
167
+ def _mean_2d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> np.ndarray:
168
+ if values.ndim != 2 or values.shape[1] == 0 or times.size == 0:
169
+ return np.zeros((12,), dtype=np.float32)
170
+ lo = float(min(start, end))
171
+ hi = float(max(start, end))
172
+ mask = (times >= lo) & (times <= hi)
173
+ if np.any(mask):
174
+ vec = np.mean(values[:, mask], axis=1).astype(np.float32)
175
+ else:
176
+ idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
177
+ vec = values[:, idx].astype(np.float32)
178
+ norm = float(np.linalg.norm(vec))
179
+ if norm > 1e-9:
180
+ vec = vec / norm
181
+ return vec
182
+
183
+
184
+ def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
185
+ if a.size == 0 or b.size == 0:
186
+ return 0.0
187
+ denom = float(np.linalg.norm(a) * np.linalg.norm(b))
188
+ if denom <= 1e-9:
189
+ return 0.0
190
+ return float(np.dot(a, b) / denom)
191
+
192
+
193
+ def _phrase_score(beat_idx: int) -> float:
194
+ if beat_idx < 0:
195
+ return 0.5
196
+ mod4 = beat_idx % 4
197
+ mod8 = beat_idx % 8
198
+ dist4 = min(mod4, 4 - mod4)
199
+ dist8 = min(mod8, 8 - mod8)
200
+ score4 = 1.0 - (dist4 / 2.0)
201
+ score8 = 1.0 - (dist8 / 4.0)
202
+ return _clamp((0.65 * score4) + (0.35 * score8), 0.0, 1.0)
203
+
204
+
205
+ def _target_position_score(x: float, target: float, spread: float) -> float:
206
+ spread = max(1e-3, float(spread))
207
+ return float(np.exp(-abs(float(x) - float(target)) / spread))
208
+
209
+
210
+ def _edge_score(x: float, duration_sec: float) -> float:
211
+ if duration_sec <= 1e-6:
212
+ return 0.0
213
+ ratio = float(x / duration_sec)
214
+ return _clamp(min(ratio / 0.16, (1.0 - ratio) / 0.16), 0.0, 1.0)
215
+
216
+
217
+ def _resolve_demucs_device(torch_mod: Any) -> str:
218
+ pref = (_DEMUCS_DEVICE_PREF or "").strip().lower()
219
+ if pref in {"cpu"}:
220
+ return "cpu"
221
+ if pref in {"cuda", "gpu"}:
222
+ return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
223
+ return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
224
+
225
+
226
+ def _get_demucs_model() -> Tuple[Optional[Any], Optional[Any], str, Optional[str]]:
227
+ global _DEMUCS_MODEL, _DEMUCS_TORCH, _DEMUCS_DEVICE, _DEMUCS_LOAD_ATTEMPTED, _DEMUCS_LOAD_ERROR
228
+
229
+ if not _DEMUCS_ENABLED:
230
+ return None, None, "disabled", "AI_DJ_ENABLE_DEMUCS_ANALYSIS=0"
231
+
232
+ if _DEMUCS_LOAD_ATTEMPTED:
233
+ if _DEMUCS_MODEL is None:
234
+ return None, _DEMUCS_TORCH, "unavailable", _DEMUCS_LOAD_ERROR
235
+ return _DEMUCS_MODEL, _DEMUCS_TORCH, "ready", None
236
+
237
+ _DEMUCS_LOAD_ATTEMPTED = True
238
+ try:
239
+ import torch # type: ignore[reportMissingImports]
240
+ from demucs.pretrained import get_model # type: ignore[reportMissingImports]
241
+
242
+ model = get_model(_DEMUCS_MODEL_NAME)
243
+ model.eval()
244
+ _DEMUCS_DEVICE = _resolve_demucs_device(torch)
245
+ model.to(_DEMUCS_DEVICE)
246
+ _DEMUCS_MODEL = model
247
+ _DEMUCS_TORCH = torch
248
+ _DEMUCS_LOAD_ERROR = None
249
+ return _DEMUCS_MODEL, _DEMUCS_TORCH, "ready", None
250
+ except Exception as exc:
251
+ _DEMUCS_MODEL = None
252
+ _DEMUCS_TORCH = None
253
+ _DEMUCS_LOAD_ERROR = str(exc)
254
+ LOGGER.warning(
255
+ "Demucs vocal analysis unavailable (%s). Cue selection continues without vocal penalty.",
256
+ exc,
257
+ )
258
+ return None, None, "unavailable", _DEMUCS_LOAD_ERROR
259
+
260
+
261
+ def _vocal_score_from_ratio(vocal_ratio: float) -> float:
262
+ ratio = _clamp(float(vocal_ratio), 0.0, 1.0)
263
+ # Penalize clearly vocal-dominant bars while leaving mixed bars mostly neutral.
264
+ return 1.0 - _clamp((ratio - 0.32) / 0.5, 0.0, 1.0)
265
+
266
+
267
+ def _lookup_stem_mixability(profile: Optional[_VocalActivityProfile], time_sec: float) -> Dict[str, float]:
268
+ neutral = {
269
+ "vocal_ratio": 0.0,
270
+ "vocal_onset": 0.0,
271
+ "vocal_phrase_score": 0.5,
272
+ "drum_anchor": 0.5,
273
+ "bass_energy": 0.5,
274
+ "bass_stability": 0.5,
275
+ "instrumental_density": 0.5,
276
+ "density_score": 0.5,
277
+ }
278
+ if profile is None or profile.times.size == 0:
279
+ return neutral
280
+ t_min = float(np.min(profile.times))
281
+ t_max = float(np.max(profile.times))
282
+ if float(time_sec) < (t_min - 0.6) or float(time_sec) > (t_max + 0.6):
283
+ return neutral
284
+
285
+ ratio = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec - 1.2, time_sec + 1.2), 0.0, 1.0)
286
+ vocal_onset = _clamp(_mean_1d(profile.vocal_onset, profile.times, time_sec - 0.2, time_sec + 0.3), 0.0, 1.0)
287
+ vocal_before = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec - 1.8, time_sec - 0.25), 0.0, 1.0)
288
+ vocal_after = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec + 0.25, time_sec + 1.8), 0.0, 1.0)
289
+ ending_score = _clamp((vocal_before - vocal_after + 0.05) / 0.35, 0.0, 1.0)
290
+ low_vocal_score = _vocal_score_from_ratio(ratio)
291
+ onset_quiet_score = 1.0 - vocal_onset
292
+ vocal_phrase_score = _clamp(
293
+ (0.52 * low_vocal_score) + (0.30 * ending_score) + (0.18 * onset_quiet_score),
294
+ 0.0,
295
+ 1.0,
296
+ )
297
+
298
+ if profile.has_drums:
299
+ drum_hit = _clamp(_mean_1d(profile.drum_onset, profile.times, time_sec - 0.1, time_sec + 0.24), 0.0, 1.0)
300
+ drum_bg = _clamp(_mean_1d(profile.drum_onset, profile.times, time_sec - 1.0, time_sec + 1.0), 0.0, 1.0)
301
+ drum_anchor = _clamp((0.72 * drum_hit) + (0.28 * _clamp(drum_hit - drum_bg + 0.22, 0.0, 1.0)), 0.0, 1.0)
302
+ else:
303
+ drum_anchor = 0.5
304
+
305
+ if profile.has_bass:
306
+ bass_energy = _clamp(_mean_1d(profile.bass_rms, profile.times, time_sec - 1.4, time_sec + 1.4), 0.0, 1.0)
307
+ bass_std = _clamp(_std_1d(profile.bass_rms, profile.times, time_sec - 1.8, time_sec + 1.8), 0.0, 1.0)
308
+ bass_cv = bass_std / max(1e-4, bass_energy + 0.08)
309
+ bass_stability = 1.0 - _clamp((bass_cv - 0.18) / 0.85, 0.0, 1.0)
310
+ else:
311
+ bass_energy = 0.5
312
+ bass_stability = 0.5
313
+
314
+ instrumental_density = _clamp(
315
+ _mean_1d(profile.instrumental_rms, profile.times, time_sec - 1.4, time_sec + 1.4),
316
+ 0.0,
317
+ 1.0,
318
+ )
319
+ density_score = _target_position_score(instrumental_density, target=0.56, spread=0.24)
320
+
321
+ return {
322
+ "vocal_ratio": float(ratio),
323
+ "vocal_onset": float(vocal_onset),
324
+ "vocal_phrase_score": float(vocal_phrase_score),
325
+ "drum_anchor": float(drum_anchor),
326
+ "bass_energy": float(bass_energy),
327
+ "bass_stability": float(_clamp(bass_stability, 0.0, 1.0)),
328
+ "instrumental_density": float(instrumental_density),
329
+ "density_score": float(_clamp(density_score, 0.0, 1.0)),
330
+ }
331
+
332
+
333
+ def _range_coverage_ratio(times: np.ndarray, start: float, end: float) -> float:
334
+ if times.size == 0:
335
+ return 0.0
336
+ lo = float(min(start, end))
337
+ hi = float(max(start, end))
338
+ if hi - lo <= 1e-6:
339
+ return 0.0
340
+ t_min = float(np.min(times))
341
+ t_max = float(np.max(times))
342
+ overlap = max(0.0, min(hi, t_max) - max(lo, t_min))
343
+ return _clamp(overlap / max(1e-6, (hi - lo)), 0.0, 1.0)
344
+
345
+
346
+ def _sample_curve(values: np.ndarray, times: np.ndarray, start: float, end: float, samples: int = 16) -> np.ndarray:
347
+ n = int(max(4, samples))
348
+ if values.size == 0 or times.size == 0:
349
+ return np.zeros((n,), dtype=np.float32)
350
+ lo = float(min(start, end))
351
+ hi = float(max(start, end))
352
+ if hi - lo <= 1e-6:
353
+ base = float(_mean_1d(values, times, lo - 0.25, hi + 0.25))
354
+ return np.full((n,), _clamp(base, 0.0, 1.0), dtype=np.float32)
355
+ ts = np.linspace(lo, hi, n, dtype=np.float32)
356
+ if times.size < 2:
357
+ base = float(_mean_1d(values, times, lo, hi))
358
+ return np.full((n,), _clamp(base, 0.0, 1.0), dtype=np.float32)
359
+ curve = np.interp(
360
+ ts.astype(np.float64),
361
+ times.astype(np.float64),
362
+ values.astype(np.float64),
363
+ left=float(values[0]),
364
+ right=float(values[-1]),
365
+ ).astype(np.float32)
366
+ return np.clip(curve, 0.0, 1.0).astype(np.float32)
367
+
368
+
369
+ def _lookup_period_mixability(
370
+ profile: Optional[_VocalActivityProfile],
371
+ start_sec: float,
372
+ end_sec: float,
373
+ incoming: bool,
374
+ ) -> Dict[str, Any]:
375
+ neutral_curve = np.full((16,), 0.5, dtype=np.float32)
376
+ neutral = {
377
+ "coverage": 0.0,
378
+ "period_vocal_ratio": 0.0,
379
+ "period_vocal_phrase_score": 0.5,
380
+ "period_drum_anchor": 0.5,
381
+ "period_bass_energy": 0.5,
382
+ "period_bass_stability": 0.5,
383
+ "period_density_score": 0.5,
384
+ "period_vocal_curve": neutral_curve.copy(),
385
+ "period_bass_curve": neutral_curve.copy(),
386
+ }
387
+ if profile is None or profile.times.size == 0:
388
+ return neutral
389
+
390
+ lo = float(min(start_sec, end_sec))
391
+ hi = float(max(start_sec, end_sec))
392
+ span = max(1e-4, hi - lo)
393
+ coverage = _range_coverage_ratio(profile.times, lo, hi)
394
+ if coverage <= 0.03:
395
+ return neutral
396
+
397
+ ratio_mean = _clamp(_mean_1d(profile.vocal_ratio, profile.times, lo, hi), 0.0, 1.0)
398
+ vocal_curve = _sample_curve(profile.vocal_ratio, profile.times, lo, hi, samples=16)
399
+ bass_curve = _sample_curve(profile.bass_rms, profile.times, lo, hi, samples=16)
400
+ first_cut = lo + (0.35 * span)
401
+ last_cut = hi - (0.35 * span)
402
+ vocal_start = _clamp(_mean_1d(profile.vocal_ratio, profile.times, lo, first_cut), 0.0, 1.0)
403
+ vocal_end = _clamp(_mean_1d(profile.vocal_ratio, profile.times, last_cut, hi), 0.0, 1.0)
404
+ boundary_t = lo if incoming else hi
405
+ vocal_onset_boundary = _clamp(
406
+ _mean_1d(profile.vocal_onset, profile.times, boundary_t - 0.16, boundary_t + 0.26),
407
+ 0.0,
408
+ 1.0,
409
+ )
410
+ low_vocal_score = _vocal_score_from_ratio(ratio_mean)
411
+ onset_quiet = 1.0 - vocal_onset_boundary
412
+ if incoming:
413
+ start_quiet = _clamp(1.0 - ((vocal_start - 0.22) / 0.58), 0.0, 1.0)
414
+ rise_ok = _clamp((vocal_end - vocal_start + 0.08) / 0.38, 0.0, 1.0)
415
+ trend_score = _clamp((0.72 * start_quiet) + (0.28 * rise_ok), 0.0, 1.0)
416
+ else:
417
+ ending = _clamp((vocal_start - vocal_end + 0.05) / 0.35, 0.0, 1.0)
418
+ trend_score = ending
419
+ vocal_phrase = _clamp((0.50 * low_vocal_score) + (0.30 * trend_score) + (0.20 * onset_quiet), 0.0, 1.0)
420
+
421
+ if profile.has_drums:
422
+ drum_mean = _clamp(_mean_1d(profile.drum_onset, profile.times, lo, hi), 0.0, 1.0)
423
+ drum_std = _clamp(_std_1d(profile.drum_onset, profile.times, lo, hi), 0.0, 1.0)
424
+ drum_boundary = _clamp(
425
+ _mean_1d(profile.drum_onset, profile.times, boundary_t - 0.12, boundary_t + 0.20),
426
+ 0.0,
427
+ 1.0,
428
+ )
429
+ drum_anchor = _clamp(
430
+ (0.45 * drum_boundary)
431
+ + (0.35 * drum_mean)
432
+ + (0.20 * (1.0 - _clamp(drum_std / 0.35, 0.0, 1.0))),
433
+ 0.0,
434
+ 1.0,
435
+ )
436
+ else:
437
+ drum_anchor = 0.5
438
+
439
+ if profile.has_bass:
440
+ bass_mean = _clamp(_mean_1d(profile.bass_rms, profile.times, lo, hi), 0.0, 1.0)
441
+ bass_std = _clamp(_std_1d(profile.bass_rms, profile.times, lo, hi), 0.0, 1.0)
442
+ bass_cv = bass_std / max(0.08, bass_mean)
443
+ bass_stability = 1.0 - _clamp((bass_cv - 0.20) / 0.95, 0.0, 1.0)
444
+ else:
445
+ bass_mean = 0.5
446
+ bass_stability = 0.5
447
+
448
+ density_mean = _clamp(_mean_1d(profile.instrumental_rms, profile.times, lo, hi), 0.0, 1.0)
449
+ density_std = _clamp(_std_1d(profile.instrumental_rms, profile.times, lo, hi), 0.0, 1.0)
450
+ density_target = _target_position_score(density_mean, target=0.56, spread=0.22)
451
+ density_stability = 1.0 - _clamp(density_std / 0.32, 0.0, 1.0)
452
+ density_score = _clamp((0.75 * density_target) + (0.25 * density_stability), 0.0, 1.0)
453
+
454
+ return {
455
+ "coverage": float(coverage),
456
+ "period_vocal_ratio": float(ratio_mean),
457
+ "period_vocal_phrase_score": float(vocal_phrase),
458
+ "period_drum_anchor": float(drum_anchor),
459
+ "period_bass_energy": float(bass_mean),
460
+ "period_bass_stability": float(_clamp(bass_stability, 0.0, 1.0)),
461
+ "period_density_score": float(density_score),
462
+ "period_vocal_curve": vocal_curve.astype(np.float32),
463
+ "period_bass_curve": bass_curve.astype(np.float32),
464
+ }
465
+
466
+
467
+ def _period_overlap_clash(cand_a: _CueCandidate, cand_b: _CueCandidate) -> Tuple[float, float, float]:
468
+ n = int(
469
+ max(
470
+ 4,
471
+ min(
472
+ int(cand_a.period_vocal_curve.size),
473
+ int(cand_b.period_vocal_curve.size),
474
+ int(cand_a.period_bass_curve.size),
475
+ int(cand_b.period_bass_curve.size),
476
+ ),
477
+ )
478
+ )
479
+ if n <= 0:
480
+ vocal = _clamp(cand_a.period_vocal_ratio * cand_b.period_vocal_ratio, 0.0, 1.0)
481
+ bass = _clamp(cand_a.period_bass_energy * cand_b.period_bass_energy, 0.0, 1.0)
482
+ cov = 0.5 * (cand_a.period_coverage + cand_b.period_coverage)
483
+ return vocal, bass, cov
484
+
485
+ a_v = ensure_length(cand_a.period_vocal_curve.astype(np.float32), n)
486
+ b_v = ensure_length(cand_b.period_vocal_curve.astype(np.float32), n)
487
+ a_b = ensure_length(cand_a.period_bass_curve.astype(np.float32), n)
488
+ b_b = ensure_length(cand_b.period_bass_curve.astype(np.float32), n)
489
+ x = np.linspace(0.0, 1.0, n, dtype=np.float32)
490
+
491
+ w_a_v = 1.0 - x
492
+ w_b_v = x
493
+ vocal_risk = float(np.mean((a_v * w_a_v) * (b_v * w_b_v)))
494
+ vocal_risk = _clamp(vocal_risk * 4.0, 0.0, 1.0)
495
+
496
+ w_b_b = np.clip((x - 0.60) / 0.28, 0.0, 1.0).astype(np.float32)
497
+ w_a_b = 1.0 - w_b_b
498
+ center_bass_shape = (0.35 + (0.65 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
499
+ bass_risk = float(np.mean((a_b * w_a_b * center_bass_shape) * (b_b * w_b_b * center_bass_shape)))
500
+ bass_risk = _clamp(bass_risk * 6.0, 0.0, 1.0)
501
+
502
+ coverage = _clamp(0.5 * (cand_a.period_coverage + cand_b.period_coverage), 0.0, 1.0)
503
+ if coverage < 0.35:
504
+ fallback_v = _clamp(cand_a.period_vocal_ratio * cand_b.period_vocal_ratio, 0.0, 1.0)
505
+ fallback_b = _clamp(cand_a.period_bass_energy * cand_b.period_bass_energy, 0.0, 1.0)
506
+ alpha = _clamp((0.35 - coverage) / 0.35, 0.0, 1.0)
507
+ vocal_risk = (1.0 - alpha) * vocal_risk + (alpha * fallback_v)
508
+ bass_risk = (1.0 - alpha) * bass_risk + (alpha * fallback_b)
509
+
510
+ return float(vocal_risk), float(bass_risk), float(coverage)
511
+
512
+
513
+ def _extract_vocal_profile_demucs(
514
+ y: np.ndarray,
515
+ sr: int,
516
+ window_start_sec: float,
517
+ track_label: str,
518
+ ) -> Tuple[Optional[_VocalActivityProfile], Dict[str, object]]:
519
+ global _DEMUCS_DEVICE
520
+
521
+ info: Dict[str, object] = {
522
+ "track": track_label,
523
+ "enabled": bool(_DEMUCS_ENABLED),
524
+ "model": _DEMUCS_MODEL_NAME,
525
+ }
526
+ if y.size < int(max(1, sr) * _DEMUCS_MIN_WINDOW_SEC):
527
+ info["status"] = "skipped-short-window"
528
+ return None, info
529
+
530
+ model, torch_mod, status, reason = _get_demucs_model()
531
+ info["status"] = status
532
+ if reason:
533
+ info["reason"] = reason
534
+ if model is None or torch_mod is None:
535
+ return None, info
536
+
537
+ try:
538
+ from demucs.apply import apply_model # type: ignore[reportMissingImports]
539
+
540
+ mono = np.asarray(y, dtype=np.float32).reshape(-1)
541
+ if mono.size == 0:
542
+ info["status"] = "empty-window"
543
+ return None, info
544
+ peak = float(np.max(np.abs(mono)))
545
+ if peak > 1e-9:
546
+ mono = mono / peak
547
+
548
+ demucs_sr = int(getattr(model, "samplerate", 44100))
549
+ if int(sr) != demucs_sr:
550
+ mono = librosa.resample(mono, orig_sr=int(sr), target_sr=demucs_sr).astype(np.float32)
551
+ if mono.size < int(demucs_sr * _DEMUCS_MIN_WINDOW_SEC):
552
+ info["status"] = "skipped-short-window"
553
+ return None, info
554
+
555
+ stereo = np.stack([mono, mono], axis=0)
556
+ mix = torch_mod.from_numpy(stereo).unsqueeze(0).to(_DEMUCS_DEVICE)
557
+ audio_sec = float(mono.size / max(1, demucs_sr))
558
+ segment_limit = float(_DEMUCS_SEGMENT_SEC)
559
+ if audio_sec <= (segment_limit + 0.02):
560
+ use_split = False
561
+ segment_sec = None
562
+ else:
563
+ use_split = True
564
+ segment_sec = segment_limit
565
+
566
+ try:
567
+ with torch_mod.no_grad():
568
+ estimates = apply_model(
569
+ model,
570
+ mix,
571
+ shifts=1,
572
+ split=use_split,
573
+ overlap=0.25,
574
+ progress=False,
575
+ device=_DEMUCS_DEVICE,
576
+ segment=segment_sec,
577
+ )
578
+ except Exception as exc:
579
+ if _DEMUCS_DEVICE == "cuda":
580
+ model.to("cpu")
581
+ _DEMUCS_DEVICE = "cpu"
582
+ mix = mix.to("cpu")
583
+ with torch_mod.no_grad():
584
+ estimates = apply_model(
585
+ model,
586
+ mix,
587
+ shifts=1,
588
+ split=use_split,
589
+ overlap=0.25,
590
+ progress=False,
591
+ device="cpu",
592
+ segment=segment_sec,
593
+ )
594
+ info["device_fallback"] = f"cuda->cpu ({exc})"
595
+ else:
596
+ raise
597
+
598
+ estimates = estimates.detach().cpu()
599
+ est = estimates[0] if estimates.ndim == 4 else estimates
600
+ if est.ndim != 3:
601
+ raise RuntimeError(f"Unexpected demucs output ndim: {est.ndim}")
602
+
603
+ source_names = [str(s) for s in getattr(model, "sources", [])]
604
+ if not source_names:
605
+ raise RuntimeError("Demucs model returned no source labels.")
606
+ if est.shape[0] != len(source_names):
607
+ if est.shape[1] == len(source_names):
608
+ est = est.permute(1, 0, 2)
609
+ else:
610
+ raise RuntimeError(
611
+ f"Demucs output/source mismatch ({tuple(est.shape)} vs {len(source_names)} sources)."
612
+ )
613
+ if "vocals" not in source_names:
614
+ raise RuntimeError("Demucs model does not expose a 'vocals' stem.")
615
+
616
+ vocal_idx = source_names.index("vocals")
617
+ vocals = est[vocal_idx]
618
+ has_drums = "drums" in source_names
619
+ has_bass = "bass" in source_names
620
+ drums = est[source_names.index("drums")] if has_drums else torch_mod.zeros_like(vocals)
621
+ bass = est[source_names.index("bass")] if has_bass else torch_mod.zeros_like(vocals)
622
+ non_vocal_idxs = [i for i in range(len(source_names)) if i != vocal_idx]
623
+ if non_vocal_idxs:
624
+ accompaniment = est[non_vocal_idxs].sum(dim=0)
625
+ else:
626
+ accompaniment = torch_mod.zeros_like(vocals)
627
+
628
+ vocals_mono = vocals.mean(dim=0).numpy().astype(np.float32)
629
+ drums_mono = drums.mean(dim=0).numpy().astype(np.float32)
630
+ bass_mono = bass.mean(dim=0).numpy().astype(np.float32)
631
+ accompaniment_mono = accompaniment.mean(dim=0).numpy().astype(np.float32)
632
+
633
+ vocal_rms = librosa.feature.rms(y=vocals_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
634
+ acc_rms = librosa.feature.rms(y=accompaniment_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
635
+ bass_rms = librosa.feature.rms(y=bass_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
636
+ inst_rms = librosa.feature.rms(y=accompaniment_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
637
+ vocal_onset = librosa.onset.onset_strength(y=vocals_mono, sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
638
+ drum_onset = librosa.onset.onset_strength(y=drums_mono, sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
639
+
640
+ ratio_raw = vocal_rms / np.maximum(vocal_rms + acc_rms, 1e-6)
641
+ ratio_raw = np.clip(ratio_raw, 0.0, 1.0).astype(np.float32)
642
+ aligned = _align_series_min_length([ratio_raw, vocal_onset, drum_onset, bass_rms, inst_rms])
643
+ if not aligned:
644
+ raise RuntimeError("Demucs profile alignment failed.")
645
+ ratio, vocal_onset_n, drum_onset_n, bass_rms_n, inst_rms_n = aligned
646
+
647
+ ratio = _smooth_1d(ratio, kernel_size=5)
648
+ vocal_onset_n = _normalize_1d(_smooth_1d(vocal_onset_n, kernel_size=3))
649
+ drum_onset_n = _normalize_1d(_smooth_1d(drum_onset_n, kernel_size=3))
650
+ bass_rms_n = _normalize_1d(_smooth_1d(bass_rms_n, kernel_size=5))
651
+ inst_rms_n = _normalize_1d(_smooth_1d(inst_rms_n, kernel_size=5))
652
+
653
+ times = librosa.frames_to_time(np.arange(ratio.size), sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
654
+ times = times + float(window_start_sec)
655
+
656
+ info.update(
657
+ {
658
+ "status": "ready",
659
+ "device": _DEMUCS_DEVICE,
660
+ "method": "demucs-stem-mixability",
661
+ "has_drums": bool(has_drums),
662
+ "has_bass": bool(has_bass),
663
+ "split_mode": "chunked" if use_split else "full-window",
664
+ "window_start_sec": round(float(window_start_sec), 3),
665
+ "window_duration_sec": round(float(mono.size / max(1, demucs_sr)), 3),
666
+ }
667
+ )
668
+ return _VocalActivityProfile(
669
+ vocal_ratio=ratio,
670
+ vocal_onset=vocal_onset_n,
671
+ drum_onset=drum_onset_n,
672
+ bass_rms=bass_rms_n,
673
+ instrumental_rms=inst_rms_n,
674
+ times=times,
675
+ method="demucs-stem-mixability",
676
+ has_drums=bool(has_drums),
677
+ has_bass=bool(has_bass),
678
+ ), info
679
+ except Exception as exc:
680
+ LOGGER.warning("Demucs vocal analysis failed for %s (%s). Continuing without vocal penalty.", track_label, exc)
681
+ info["status"] = "error"
682
+ info["reason"] = str(exc)
683
+ return None, info
684
+
685
+
686
+ def _label_weight(label: str, outgoing: bool) -> float:
687
+ label_l = (label or "").strip().lower()
688
+ if outgoing:
689
+ table = [
690
+ ("outro", 1.00),
691
+ ("break", 0.95),
692
+ ("bridge", 0.90),
693
+ ("verse", 0.82),
694
+ ("chorus", 0.66),
695
+ ("intro", 0.20),
696
+ ("start", 0.10),
697
+ ("end", 0.05),
698
+ ]
699
+ else:
700
+ table = [
701
+ ("verse", 0.95),
702
+ ("break", 0.90),
703
+ ("bridge", 0.84),
704
+ ("chorus", 0.80),
705
+ ("intro", 0.74),
706
+ ("outro", 0.20),
707
+ ("start", 0.10),
708
+ ("end", 0.05),
709
+ ]
710
+ for token, score in table:
711
+ if token in label_l:
712
+ return float(score)
713
+ return 0.60
714
+
715
+
716
+ def _compute_profiles(y: np.ndarray, sr: int) -> _TrackProfiles:
717
+ if y.size == 0:
718
+ zero = np.zeros((1,), dtype=np.float32)
719
+ return _TrackProfiles(
720
+ rms=zero,
721
+ rms_times=zero.copy(),
722
+ onset=zero.copy(),
723
+ onset_times=zero.copy(),
724
+ chroma=np.zeros((12, 1), dtype=np.float32),
725
+ chroma_times=zero.copy(),
726
+ )
727
+
728
+ rms = librosa.feature.rms(y=y, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
729
+ onset = librosa.onset.onset_strength(y=y, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
730
+ try:
731
+ harmonic = librosa.effects.harmonic(y)
732
+ chroma = librosa.feature.chroma_cens(y=harmonic, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
733
+ except Exception as exc:
734
+ LOGGER.warning("Harmonic chroma extraction failed (%s); falling back to raw chroma.", exc)
735
+ chroma = librosa.feature.chroma_cens(y=y, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
736
+
737
+ if rms.size == 0:
738
+ rms = np.zeros((1,), dtype=np.float32)
739
+ if onset.size == 0:
740
+ onset = np.zeros((1,), dtype=np.float32)
741
+ if chroma.ndim != 2 or chroma.shape[1] == 0:
742
+ chroma = np.zeros((12, 1), dtype=np.float32)
743
+
744
+ max_rms = float(np.max(rms))
745
+ if max_rms > 1e-9:
746
+ rms = rms / max_rms
747
+ max_onset = float(np.max(onset))
748
+ if max_onset > 1e-9:
749
+ onset = onset / max_onset
750
+
751
+ rms_times = librosa.frames_to_time(np.arange(rms.size), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
752
+ onset_times = librosa.frames_to_time(np.arange(onset.size), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
753
+ chroma_times = librosa.frames_to_time(np.arange(chroma.shape[1]), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
754
+ return _TrackProfiles(
755
+ rms=rms,
756
+ rms_times=rms_times,
757
+ onset=onset,
758
+ onset_times=onset_times,
759
+ chroma=chroma,
760
+ chroma_times=chroma_times,
761
+ )
762
+
763
+
764
+ def _build_candidates(
765
+ beat_times: np.ndarray,
766
+ min_sec: float,
767
+ max_sec: float,
768
+ prefer_tail: bool,
769
+ limit: int,
770
+ ) -> List[Tuple[float, int]]:
771
+ if beat_times.size == 0:
772
+ return []
773
+ idxs = [idx for idx, t in enumerate(beat_times) if float(min_sec) <= float(t) <= float(max_sec)]
774
+ if not idxs:
775
+ return []
776
+ idxs = idxs[-limit:] if prefer_tail else idxs[:limit]
777
+ return [(float(beat_times[idx]), int(idx)) for idx in idxs]
778
+
779
+
780
+ def _make_candidate(
781
+ time_sec: float,
782
+ beat_idx: int,
783
+ profiles: _TrackProfiles,
784
+ incoming: bool,
785
+ seam_sec: float,
786
+ vocal_profile: Optional[_VocalActivityProfile] = None,
787
+ vocal_time_sec: Optional[float] = None,
788
+ ) -> _CueCandidate:
789
+ if incoming:
790
+ energy = _mean_1d(profiles.rms, profiles.rms_times, time_sec, time_sec + 1.0)
791
+ onset = _mean_1d(profiles.onset, profiles.onset_times, time_sec - 0.1, time_sec + 0.5)
792
+ else:
793
+ energy = _mean_1d(profiles.rms, profiles.rms_times, time_sec - 1.0, time_sec)
794
+ onset = _mean_1d(profiles.onset, profiles.onset_times, time_sec - 0.5, time_sec + 0.1)
795
+ chroma = _mean_2d(profiles.chroma, profiles.chroma_times, time_sec - 2.0, time_sec + 2.0)
796
+ vocal_lookup_sec = float(vocal_time_sec) if vocal_time_sec is not None else float(time_sec)
797
+ stem_mix = _lookup_stem_mixability(vocal_profile, vocal_lookup_sec)
798
+ seam = max(1e-3, float(seam_sec))
799
+ period_start = vocal_lookup_sec if incoming else (vocal_lookup_sec - seam)
800
+ period_end = (vocal_lookup_sec + seam) if incoming else vocal_lookup_sec
801
+ period_mix = _lookup_period_mixability(
802
+ profile=vocal_profile,
803
+ start_sec=period_start,
804
+ end_sec=period_end,
805
+ incoming=incoming,
806
+ )
807
+ return _CueCandidate(
808
+ time_sec=float(time_sec),
809
+ beat_idx=int(beat_idx),
810
+ phrase=_phrase_score(int(beat_idx)),
811
+ energy=float(_clamp(energy, 0.0, 1.0)),
812
+ onset=float(_clamp(onset, 0.0, 1.0)),
813
+ chroma=chroma,
814
+ vocal_ratio=float(stem_mix["vocal_ratio"]),
815
+ vocal_onset=float(stem_mix["vocal_onset"]),
816
+ vocal_phrase_score=float(stem_mix["vocal_phrase_score"]),
817
+ drum_anchor=float(stem_mix["drum_anchor"]),
818
+ bass_energy=float(stem_mix["bass_energy"]),
819
+ bass_stability=float(stem_mix["bass_stability"]),
820
+ instrumental_density=float(stem_mix["instrumental_density"]),
821
+ density_score=float(stem_mix["density_score"]),
822
+ period_vocal_ratio=float(period_mix["period_vocal_ratio"]),
823
+ period_vocal_phrase_score=float(period_mix["period_vocal_phrase_score"]),
824
+ period_drum_anchor=float(period_mix["period_drum_anchor"]),
825
+ period_bass_energy=float(period_mix["period_bass_energy"]),
826
+ period_bass_stability=float(period_mix["period_bass_stability"]),
827
+ period_density_score=float(period_mix["period_density_score"]),
828
+ period_coverage=float(period_mix["coverage"]),
829
+ period_vocal_curve=np.asarray(period_mix["period_vocal_curve"], dtype=np.float32),
830
+ period_bass_curve=np.asarray(period_mix["period_bass_curve"], dtype=np.float32),
831
+ )
832
+
833
+
834
+ def _score_pair(
835
+ cand_a: _CueCandidate,
836
+ cand_b: _CueCandidate,
837
+ target_a: float,
838
+ target_b: float,
839
+ ) -> Tuple[float, Dict[str, float]]:
840
+ energy_match = 1.0 - min(1.0, abs(cand_a.energy - cand_b.energy))
841
+ phrase_match = 0.5 * (cand_a.phrase + cand_b.phrase)
842
+ key_match = _clamp(_cosine_similarity(cand_a.chroma, cand_b.chroma), 0.0, 1.0)
843
+ onset_match = (0.35 * cand_a.onset) + (0.65 * cand_b.onset)
844
+ position_match = 0.5 * (
845
+ _target_position_score(cand_a.time_sec, target_a, spread=3.0)
846
+ + _target_position_score(cand_b.time_sec, target_b, spread=3.0)
847
+ )
848
+ vocal_phrase_match = 0.5 * (cand_a.vocal_phrase_score + cand_b.vocal_phrase_score)
849
+ drum_anchor_match = 0.5 * (cand_a.drum_anchor + cand_b.drum_anchor)
850
+ bass_stability_match = 0.5 * (cand_a.bass_stability + cand_b.bass_stability)
851
+ density_match = 0.5 * (cand_a.density_score + cand_b.density_score)
852
+ period_vocal_phrase_match = 0.5 * (cand_a.period_vocal_phrase_score + cand_b.period_vocal_phrase_score)
853
+ period_drum_anchor_match = 0.5 * (cand_a.period_drum_anchor + cand_b.period_drum_anchor)
854
+ period_bass_stability_match = 0.5 * (cand_a.period_bass_stability + cand_b.period_bass_stability)
855
+ period_density_match = 0.5 * (cand_a.period_density_score + cand_b.period_density_score)
856
+ vocal_clash_risk, bass_clash_risk, period_coverage = _period_overlap_clash(cand_a, cand_b)
857
+ clash_avoidance = 1.0 - _clamp((0.67 * vocal_clash_risk) + (0.33 * bass_clash_risk), 0.0, 1.0)
858
+
859
+ total = (
860
+ (0.07 * energy_match)
861
+ + (0.06 * phrase_match)
862
+ + (0.08 * key_match)
863
+ + (0.04 * onset_match)
864
+ + (0.03 * position_match)
865
+ + (0.05 * vocal_phrase_match)
866
+ + (0.04 * drum_anchor_match)
867
+ + (0.03 * bass_stability_match)
868
+ + (0.02 * density_match)
869
+ + (0.14 * period_vocal_phrase_match)
870
+ + (0.10 * period_drum_anchor_match)
871
+ + (0.09 * period_bass_stability_match)
872
+ + (0.07 * period_density_match)
873
+ + (0.16 * clash_avoidance)
874
+ + (0.02 * period_coverage)
875
+ )
876
+ components = {
877
+ "energy_match": float(energy_match),
878
+ "phrase_match": float(phrase_match),
879
+ "key_match": float(key_match),
880
+ "onset_match": float(onset_match),
881
+ "position_match": float(position_match),
882
+ "vocal_phrase_match": float(vocal_phrase_match),
883
+ "drum_anchor_match": float(drum_anchor_match),
884
+ "bass_stability_match": float(bass_stability_match),
885
+ "density_match": float(density_match),
886
+ "period_vocal_phrase_match": float(period_vocal_phrase_match),
887
+ "period_drum_anchor_match": float(period_drum_anchor_match),
888
+ "period_bass_stability_match": float(period_bass_stability_match),
889
+ "period_density_match": float(period_density_match),
890
+ "period_coverage": float(period_coverage),
891
+ "vocal_clash_risk": float(vocal_clash_risk),
892
+ "bass_clash_risk": float(bass_clash_risk),
893
+ "clash_avoidance": float(clash_avoidance),
894
+ "total": float(total),
895
+ }
896
+ return float(total), components
897
+
898
+
899
+ def _segments_from_boundaries(boundaries: np.ndarray, duration_sec: float) -> List[Dict[str, object]]:
900
+ clean = [0.0]
901
+ for t in np.asarray(boundaries, dtype=np.float32):
902
+ x = float(t)
903
+ if 0.0 < x < float(duration_sec):
904
+ clean.append(x)
905
+ clean.append(float(duration_sec))
906
+ clean = sorted(set(round(x, 3) for x in clean))
907
+ segs: List[Dict[str, object]] = []
908
+ for idx in range(len(clean) - 1):
909
+ start = float(clean[idx])
910
+ end = float(clean[idx + 1])
911
+ if end - start < 4.0:
912
+ continue
913
+ segs.append({"start": start, "end": end, "label": f"section_{idx + 1}"})
914
+ return segs
915
+
916
+
917
+ def _try_get_librosa_structure(path: str, duration_sec: float) -> Optional[Dict[str, np.ndarray]]:
918
+ if path in _LIBROSA_STRUCT_CACHE:
919
+ return _LIBROSA_STRUCT_CACHE[path]
920
+
921
+ decode_sec = _clamp(float(duration_sec), 15.0, 600.0)
922
+ try:
923
+ y, _ = decode_segment(
924
+ path,
925
+ start_sec=0.0,
926
+ duration_sec=decode_sec,
927
+ sr=_STRUCT_SR,
928
+ max_decode_sec=max(600.0, decode_sec + 3.0),
929
+ )
930
+ except Exception as exc:
931
+ LOGGER.warning("librosa full-track decode failed for %s (%s).", path, exc)
932
+ _LIBROSA_STRUCT_CACHE[path] = None
933
+ return None
934
+
935
+ if y.size < _STRUCT_SR:
936
+ _LIBROSA_STRUCT_CACHE[path] = None
937
+ return None
938
+
939
+ try:
940
+ _, beat_frames = librosa.beat.beat_track(y=y, sr=_STRUCT_SR, trim=False)
941
+ beat_times = librosa.frames_to_time(np.asarray(beat_frames), sr=_STRUCT_SR).astype(np.float32)
942
+ downbeats = beat_times[::4] if beat_times.size > 0 else np.array([], dtype=np.float32)
943
+
944
+ onset_env = librosa.onset.onset_strength(y=y, sr=_STRUCT_SR, hop_length=_ANALYSIS_HOP).astype(np.float32)
945
+ boundary_frames = librosa.util.peak_pick(
946
+ onset_env,
947
+ pre_max=8,
948
+ post_max=8,
949
+ pre_avg=24,
950
+ post_avg=24,
951
+ delta=0.06,
952
+ wait=18,
953
+ )
954
+ boundaries = librosa.frames_to_time(
955
+ np.asarray(boundary_frames),
956
+ sr=_STRUCT_SR,
957
+ hop_length=_ANALYSIS_HOP,
958
+ ).astype(np.float32)
959
+ payload: Dict[str, np.ndarray] = {"downbeats": downbeats, "boundaries": boundaries}
960
+ _LIBROSA_STRUCT_CACHE[path] = payload
961
+ return payload
962
+ except Exception as exc:
963
+ LOGGER.warning("librosa structure extraction failed for %s (%s).", path, exc)
964
+ _LIBROSA_STRUCT_CACHE[path] = None
965
+ return None
966
+
967
+
968
+ def _get_or_build_profiles_for_track(path: str, duration_sec: float, sr: int) -> Optional[_TrackProfiles]:
969
+ key = (path, int(sr))
970
+ if key in _PROFILE_CACHE:
971
+ return _PROFILE_CACHE[key]
972
+
973
+ decode_sec = _clamp(float(duration_sec), 15.0, 600.0)
974
+ try:
975
+ y, _ = decode_segment(
976
+ path,
977
+ start_sec=0.0,
978
+ duration_sec=decode_sec,
979
+ sr=int(sr),
980
+ max_decode_sec=max(600.0, decode_sec + 3.0),
981
+ )
982
+ except Exception as exc:
983
+ LOGGER.warning("Full-track decode failed for %s (%s).", path, exc)
984
+ _PROFILE_CACHE[key] = None
985
+ return None
986
+
987
+ if y.size < int(sr):
988
+ _PROFILE_CACHE[key] = None
989
+ return None
990
+
991
+ profiles = _compute_profiles(y, int(sr))
992
+ _PROFILE_CACHE[key] = profiles
993
+ return profiles
994
+
995
+
996
+ def _label_for_time(segments: List[Dict[str, object]], t: float) -> str:
997
+ for seg in segments:
998
+ start = float(seg["start"])
999
+ end = float(seg["end"])
1000
+ if start <= float(t) < end:
1001
+ return str(seg.get("label", "unknown"))
1002
+ return "unknown"
1003
+
1004
+
1005
+ def _dedupe_times(times: List[float], min_gap_sec: float) -> List[float]:
1006
+ if not times:
1007
+ return []
1008
+ sorted_times = sorted(float(t) for t in times)
1009
+ out: List[float] = [sorted_times[0]]
1010
+ for t in sorted_times[1:]:
1011
+ if (t - out[-1]) >= float(min_gap_sec):
1012
+ out.append(t)
1013
+ return out
1014
+
1015
+
1016
+ def _build_structured_candidates(
1017
+ downbeats: np.ndarray,
1018
+ segments: List[Dict[str, object]],
1019
+ profiles: _TrackProfiles,
1020
+ vocal_profile: Optional[_VocalActivityProfile],
1021
+ seam_sec: float,
1022
+ duration_sec: float,
1023
+ min_sec: float,
1024
+ max_sec: float,
1025
+ incoming: bool,
1026
+ target_ratio: float,
1027
+ limit: int = 20,
1028
+ ) -> List[_StructuredCandidate]:
1029
+ if max_sec <= min_sec:
1030
+ return []
1031
+
1032
+ raw_times: List[float] = []
1033
+ if downbeats.size > 0:
1034
+ raw_times.extend([float(t) for t in downbeats if min_sec <= float(t) <= max_sec])
1035
+
1036
+ for seg in segments:
1037
+ start = float(seg["start"])
1038
+ end = float(seg["end"])
1039
+ if min_sec <= start <= max_sec:
1040
+ raw_times.append(start)
1041
+ if min_sec <= end <= max_sec:
1042
+ raw_times.append(end)
1043
+
1044
+ if not raw_times and downbeats.size > 0:
1045
+ raw_times.extend([float(t) for t in downbeats])
1046
+
1047
+ if not raw_times:
1048
+ return []
1049
+
1050
+ snapped: List[float] = []
1051
+ for t in raw_times:
1052
+ if downbeats.size > 0:
1053
+ idx = int(np.argmin(np.abs(downbeats - float(t))))
1054
+ snapped_t = float(downbeats[idx])
1055
+ else:
1056
+ snapped_t = float(t)
1057
+ if min_sec <= snapped_t <= max_sec:
1058
+ snapped.append(snapped_t)
1059
+
1060
+ snapped = _dedupe_times(snapped, min_gap_sec=1.2)
1061
+ if not snapped:
1062
+ return []
1063
+
1064
+ if incoming:
1065
+ snapped = snapped[:limit]
1066
+ else:
1067
+ snapped = snapped[-limit:]
1068
+
1069
+ target_sec = float(target_ratio * duration_sec)
1070
+ spread = max(4.0, 0.15 * duration_sec)
1071
+
1072
+ built: List[_StructuredCandidate] = []
1073
+ for i, t in enumerate(snapped):
1074
+ label = _label_for_time(segments, t)
1075
+ cue = _make_candidate(
1076
+ time_sec=t,
1077
+ beat_idx=(i * 4),
1078
+ profiles=profiles,
1079
+ incoming=incoming,
1080
+ seam_sec=seam_sec,
1081
+ vocal_profile=vocal_profile,
1082
+ vocal_time_sec=t,
1083
+ )
1084
+ built.append(
1085
+ _StructuredCandidate(
1086
+ cue=cue,
1087
+ label=label,
1088
+ label_score=_label_weight(label, outgoing=(not incoming)),
1089
+ edge_score=_edge_score(t, duration_sec),
1090
+ position_score=_target_position_score(t, target=target_sec, spread=spread),
1091
+ )
1092
+ )
1093
+ return built
1094
+
1095
+
1096
+ def _score_structured_pair(
1097
+ cand_a: _StructuredCandidate,
1098
+ cand_b: _StructuredCandidate,
1099
+ ) -> Tuple[float, Dict[str, float]]:
1100
+ energy_match = 1.0 - min(1.0, abs(cand_a.cue.energy - cand_b.cue.energy))
1101
+ phrase_match = 0.5 * (cand_a.cue.phrase + cand_b.cue.phrase)
1102
+ key_match = _clamp(_cosine_similarity(cand_a.cue.chroma, cand_b.cue.chroma), 0.0, 1.0)
1103
+ onset_match = (0.40 * cand_a.cue.onset) + (0.60 * cand_b.cue.onset)
1104
+ label_match = 0.5 * (cand_a.label_score + cand_b.label_score)
1105
+ position_match = 0.5 * (cand_a.position_score + cand_b.position_score)
1106
+ edge_match = 0.5 * (cand_a.edge_score + cand_b.edge_score)
1107
+ vocal_phrase_match = 0.5 * (cand_a.cue.vocal_phrase_score + cand_b.cue.vocal_phrase_score)
1108
+ drum_anchor_match = 0.5 * (cand_a.cue.drum_anchor + cand_b.cue.drum_anchor)
1109
+ bass_stability_match = 0.5 * (cand_a.cue.bass_stability + cand_b.cue.bass_stability)
1110
+ density_match = 0.5 * (cand_a.cue.density_score + cand_b.cue.density_score)
1111
+ period_vocal_phrase_match = 0.5 * (cand_a.cue.period_vocal_phrase_score + cand_b.cue.period_vocal_phrase_score)
1112
+ period_drum_anchor_match = 0.5 * (cand_a.cue.period_drum_anchor + cand_b.cue.period_drum_anchor)
1113
+ period_bass_stability_match = 0.5 * (cand_a.cue.period_bass_stability + cand_b.cue.period_bass_stability)
1114
+ period_density_match = 0.5 * (cand_a.cue.period_density_score + cand_b.cue.period_density_score)
1115
+ vocal_clash_risk, bass_clash_risk, period_coverage = _period_overlap_clash(cand_a.cue, cand_b.cue)
1116
+ clash_avoidance = 1.0 - _clamp((0.67 * vocal_clash_risk) + (0.33 * bass_clash_risk), 0.0, 1.0)
1117
+
1118
+ total = (
1119
+ (0.08 * energy_match)
1120
+ + (0.09 * key_match)
1121
+ + (0.06 * onset_match)
1122
+ + (0.05 * phrase_match)
1123
+ + (0.10 * label_match)
1124
+ + (0.05 * position_match)
1125
+ + (0.04 * edge_match)
1126
+ + (0.06 * vocal_phrase_match)
1127
+ + (0.04 * drum_anchor_match)
1128
+ + (0.03 * bass_stability_match)
1129
+ + (0.02 * density_match)
1130
+ + (0.12 * period_vocal_phrase_match)
1131
+ + (0.08 * period_drum_anchor_match)
1132
+ + (0.07 * period_bass_stability_match)
1133
+ + (0.05 * period_density_match)
1134
+ + (0.06 * clash_avoidance)
1135
+ + (0.01 * period_coverage)
1136
+ )
1137
+ components = {
1138
+ "energy_match": float(energy_match),
1139
+ "key_match": float(key_match),
1140
+ "onset_match": float(onset_match),
1141
+ "phrase_match": float(phrase_match),
1142
+ "label_match": float(label_match),
1143
+ "position_match": float(position_match),
1144
+ "edge_match": float(edge_match),
1145
+ "vocal_phrase_match": float(vocal_phrase_match),
1146
+ "drum_anchor_match": float(drum_anchor_match),
1147
+ "bass_stability_match": float(bass_stability_match),
1148
+ "density_match": float(density_match),
1149
+ "period_vocal_phrase_match": float(period_vocal_phrase_match),
1150
+ "period_drum_anchor_match": float(period_drum_anchor_match),
1151
+ "period_bass_stability_match": float(period_bass_stability_match),
1152
+ "period_density_match": float(period_density_match),
1153
+ "period_coverage": float(period_coverage),
1154
+ "vocal_clash_risk": float(vocal_clash_risk),
1155
+ "bass_clash_risk": float(bass_clash_risk),
1156
+ "clash_avoidance": float(clash_avoidance),
1157
+ "total": float(total),
1158
+ }
1159
+ return float(total), components
1160
+
1161
+
1162
+ def _try_structure_aware_selection(
1163
+ song_a_path: Optional[str],
1164
+ song_b_path: Optional[str],
1165
+ song_a_duration_sec: Optional[float],
1166
+ song_b_duration_sec: Optional[float],
1167
+ pre_sec: float,
1168
+ seam_sec: float,
1169
+ post_sec: float,
1170
+ vocal_profile_a: Optional[_VocalActivityProfile],
1171
+ vocal_profile_b: Optional[_VocalActivityProfile],
1172
+ ) -> Optional[CueSelectionResult]:
1173
+ if not song_a_path or not song_b_path:
1174
+ return None
1175
+ if song_a_duration_sec is None or song_b_duration_sec is None:
1176
+ return None
1177
+
1178
+ dur_a = float(song_a_duration_sec)
1179
+ dur_b = float(song_b_duration_sec)
1180
+
1181
+ min_a = max(seam_sec + 2.0, pre_sec + 2.0, 0.30 * dur_a)
1182
+ max_a = min(dur_a - seam_sec - 2.0, 0.88 * dur_a)
1183
+ min_b = max(4.0, 0.10 * dur_b)
1184
+ max_b = min(dur_b - (seam_sec + post_sec + 2.0), 0.72 * dur_b)
1185
+
1186
+ if max_a <= min_a or max_b <= min_b:
1187
+ return None
1188
+
1189
+ source = "librosa"
1190
+ lib_a = _try_get_librosa_structure(song_a_path, dur_a)
1191
+ lib_b = _try_get_librosa_structure(song_b_path, dur_b)
1192
+ if lib_a is None or lib_b is None:
1193
+ return None
1194
+
1195
+ downbeats_a = np.asarray(lib_a.get("downbeats", []), dtype=np.float32)
1196
+ downbeats_b = np.asarray(lib_b.get("downbeats", []), dtype=np.float32)
1197
+ segments_a: List[Dict[str, object]] = _segments_from_boundaries(
1198
+ np.asarray(lib_a.get("boundaries", []), dtype=np.float32),
1199
+ duration_sec=dur_a,
1200
+ )
1201
+ segments_b: List[Dict[str, object]] = _segments_from_boundaries(
1202
+ np.asarray(lib_b.get("boundaries", []), dtype=np.float32),
1203
+ duration_sec=dur_b,
1204
+ )
1205
+
1206
+ if downbeats_a.size < 4 or downbeats_b.size < 4:
1207
+ return None
1208
+
1209
+ profiles_a = _get_or_build_profiles_for_track(song_a_path, dur_a, sr=_STRUCT_SR)
1210
+ profiles_b = _get_or_build_profiles_for_track(song_b_path, dur_b, sr=_STRUCT_SR)
1211
+ if profiles_a is None or profiles_b is None:
1212
+ return None
1213
+
1214
+ cands_a = _build_structured_candidates(
1215
+ downbeats=downbeats_a,
1216
+ segments=segments_a,
1217
+ profiles=profiles_a,
1218
+ vocal_profile=vocal_profile_a,
1219
+ seam_sec=seam_sec,
1220
+ duration_sec=dur_a,
1221
+ min_sec=min_a,
1222
+ max_sec=max_a,
1223
+ incoming=False,
1224
+ target_ratio=0.63,
1225
+ limit=22,
1226
+ )
1227
+ cands_b = _build_structured_candidates(
1228
+ downbeats=downbeats_b,
1229
+ segments=segments_b,
1230
+ profiles=profiles_b,
1231
+ vocal_profile=vocal_profile_b,
1232
+ seam_sec=seam_sec,
1233
+ duration_sec=dur_b,
1234
+ min_sec=min_b,
1235
+ max_sec=max_b,
1236
+ incoming=True,
1237
+ target_ratio=0.27,
1238
+ limit=22,
1239
+ )
1240
+ if not cands_a or not cands_b:
1241
+ return None
1242
+
1243
+ best_score = -1.0
1244
+ best_a: Optional[_StructuredCandidate] = None
1245
+ best_b: Optional[_StructuredCandidate] = None
1246
+ ranked: List[Dict[str, object]] = []
1247
+ for ca in cands_a:
1248
+ for cb in cands_b:
1249
+ score, comps = _score_structured_pair(ca, cb)
1250
+ ranked.append(
1251
+ {
1252
+ "score": float(score),
1253
+ "song_a_sec": float(ca.cue.time_sec),
1254
+ "song_b_sec": float(cb.cue.time_sec),
1255
+ "song_a_label": ca.label,
1256
+ "song_b_label": cb.label,
1257
+ "song_a_vocal_ratio": float(ca.cue.vocal_ratio),
1258
+ "song_b_vocal_ratio": float(cb.cue.vocal_ratio),
1259
+ "song_a_period_vocal_phrase": float(ca.cue.period_vocal_phrase_score),
1260
+ "song_b_period_vocal_phrase": float(cb.cue.period_vocal_phrase_score),
1261
+ "song_a_period_drum_anchor": float(ca.cue.period_drum_anchor),
1262
+ "song_b_period_drum_anchor": float(cb.cue.period_drum_anchor),
1263
+ "song_a_period_bass_stability": float(ca.cue.period_bass_stability),
1264
+ "song_b_period_bass_stability": float(cb.cue.period_bass_stability),
1265
+ "song_a_period_density": float(ca.cue.period_density_score),
1266
+ "song_b_period_density": float(cb.cue.period_density_score),
1267
+ "song_a_period_coverage": float(ca.cue.period_coverage),
1268
+ "song_b_period_coverage": float(cb.cue.period_coverage),
1269
+ "components": comps,
1270
+ }
1271
+ )
1272
+ if score > best_score:
1273
+ best_score = float(score)
1274
+ best_a = ca
1275
+ best_b = cb
1276
+
1277
+ if best_a is None or best_b is None:
1278
+ return None
1279
+
1280
+ ranked = sorted(ranked, key=lambda x: float(x["score"]), reverse=True)
1281
+ top_pairs = []
1282
+ for item in ranked[:3]:
1283
+ top_pairs.append(
1284
+ {
1285
+ "score": round(float(item["score"]), 4),
1286
+ "song_a_sec": round(float(item["song_a_sec"]), 3),
1287
+ "song_b_sec": round(float(item["song_b_sec"]), 3),
1288
+ "song_a_label": str(item["song_a_label"]),
1289
+ "song_b_label": str(item["song_b_label"]),
1290
+ "song_a_vocal_ratio": round(float(item["song_a_vocal_ratio"]), 4),
1291
+ "song_b_vocal_ratio": round(float(item["song_b_vocal_ratio"]), 4),
1292
+ "song_a_period_vocal_phrase": round(float(item["song_a_period_vocal_phrase"]), 4),
1293
+ "song_b_period_vocal_phrase": round(float(item["song_b_period_vocal_phrase"]), 4),
1294
+ "song_a_period_drum_anchor": round(float(item["song_a_period_drum_anchor"]), 4),
1295
+ "song_b_period_drum_anchor": round(float(item["song_b_period_drum_anchor"]), 4),
1296
+ "song_a_period_bass_stability": round(float(item["song_a_period_bass_stability"]), 4),
1297
+ "song_b_period_bass_stability": round(float(item["song_b_period_bass_stability"]), 4),
1298
+ "song_a_period_density": round(float(item["song_a_period_density"]), 4),
1299
+ "song_b_period_density": round(float(item["song_b_period_density"]), 4),
1300
+ "song_a_period_coverage": round(float(item["song_a_period_coverage"]), 4),
1301
+ "song_b_period_coverage": round(float(item["song_b_period_coverage"]), 4),
1302
+ "components": {k: round(float(v), 4) for k, v in item["components"].items()},
1303
+ }
1304
+ )
1305
+
1306
+ principles = [
1307
+ "phrase/downbeat alignment",
1308
+ "section boundary awareness",
1309
+ "energy continuity",
1310
+ "harmonic/chroma compatibility",
1311
+ ]
1312
+ if vocal_profile_a is not None or vocal_profile_b is not None:
1313
+ principles.extend(
1314
+ [
1315
+ "vocal phrase-safe cueing (low or ending vocals)",
1316
+ "drum-anchor confidence",
1317
+ "bassline stability control",
1318
+ "instrumental density targeting",
1319
+ "clash-risk precheck (vocal+bass overlap)",
1320
+ ]
1321
+ )
1322
+
1323
+ return CueSelectionResult(
1324
+ cue_a_sec=float(best_a.cue.time_sec),
1325
+ cue_b_sec=float(best_b.cue.time_sec),
1326
+ method=f"{source}-structure-aware",
1327
+ debug={
1328
+ "source": source,
1329
+ "candidate_ranges_sec": {
1330
+ "song_a": [round(min_a, 3), round(max_a, 3)],
1331
+ "song_b": [round(min_b, 3), round(max_b, 3)],
1332
+ },
1333
+ "transition_period_sec": round(float(seam_sec), 3),
1334
+ "candidate_counts": {"song_a": len(cands_a), "song_b": len(cands_b)},
1335
+ "selected_sec": {"song_a": round(float(best_a.cue.time_sec), 3), "song_b": round(float(best_b.cue.time_sec), 3)},
1336
+ "selected_labels": {"song_a": best_a.label, "song_b": best_b.label},
1337
+ "selected_mixability": {
1338
+ "song_a_ratio": round(float(best_a.cue.vocal_ratio), 4),
1339
+ "song_b_ratio": round(float(best_b.cue.vocal_ratio), 4),
1340
+ "song_a_vocal_onset": round(float(best_a.cue.vocal_onset), 4),
1341
+ "song_b_vocal_onset": round(float(best_b.cue.vocal_onset), 4),
1342
+ "song_a_vocal_phrase": round(float(best_a.cue.vocal_phrase_score), 4),
1343
+ "song_b_vocal_phrase": round(float(best_b.cue.vocal_phrase_score), 4),
1344
+ "song_a_drum_anchor": round(float(best_a.cue.drum_anchor), 4),
1345
+ "song_b_drum_anchor": round(float(best_b.cue.drum_anchor), 4),
1346
+ "song_a_bass_stability": round(float(best_a.cue.bass_stability), 4),
1347
+ "song_b_bass_stability": round(float(best_b.cue.bass_stability), 4),
1348
+ "song_a_density_score": round(float(best_a.cue.density_score), 4),
1349
+ "song_b_density_score": round(float(best_b.cue.density_score), 4),
1350
+ "song_a_period_vocal_phrase": round(float(best_a.cue.period_vocal_phrase_score), 4),
1351
+ "song_b_period_vocal_phrase": round(float(best_b.cue.period_vocal_phrase_score), 4),
1352
+ "song_a_period_drum_anchor": round(float(best_a.cue.period_drum_anchor), 4),
1353
+ "song_b_period_drum_anchor": round(float(best_b.cue.period_drum_anchor), 4),
1354
+ "song_a_period_bass_stability": round(float(best_a.cue.period_bass_stability), 4),
1355
+ "song_b_period_bass_stability": round(float(best_b.cue.period_bass_stability), 4),
1356
+ "song_a_period_density_score": round(float(best_a.cue.period_density_score), 4),
1357
+ "song_b_period_density_score": round(float(best_b.cue.period_density_score), 4),
1358
+ "song_a_period_coverage": round(float(best_a.cue.period_coverage), 4),
1359
+ "song_b_period_coverage": round(float(best_b.cue.period_coverage), 4),
1360
+ },
1361
+ "top_pairs": top_pairs,
1362
+ "period_scoring": {
1363
+ "enabled": True,
1364
+ "window_def": {"song_a": "[cue-seam, cue]", "song_b": "[cue, cue+seam]"},
1365
+ "overlap_simulation": "weighted vocal/bass clash precheck",
1366
+ },
1367
+ "dj_principles": principles,
1368
+ },
1369
+ )
1370
+
1371
+
1372
+ def select_mix_cuepoints(
1373
+ y_a_analysis: np.ndarray,
1374
+ y_b_analysis: np.ndarray,
1375
+ sr: int,
1376
+ analysis_sec: float,
1377
+ pre_sec: float,
1378
+ seam_sec: float,
1379
+ post_sec: float,
1380
+ a_analysis_start_sec: float,
1381
+ beats_a: np.ndarray,
1382
+ beats_b: np.ndarray,
1383
+ cue_a_override_sec: Optional[float] = None,
1384
+ cue_b_override_sec: Optional[float] = None,
1385
+ song_a_path: Optional[str] = None,
1386
+ song_b_path: Optional[str] = None,
1387
+ song_a_duration_sec: Optional[float] = None,
1388
+ song_b_duration_sec: Optional[float] = None,
1389
+ ) -> CueSelectionResult:
1390
+ target_a_rel = max(float(pre_sec), float(analysis_sec - seam_sec - 2.0))
1391
+ target_b_rel = 2.0
1392
+ default_a_rel = float(choose_nearest_beat(beats_a, target_a_rel))
1393
+ default_b_rel = float(choose_first_beat_after(beats_b, target_b_rel))
1394
+
1395
+ default_a_abs = float(a_analysis_start_sec + default_a_rel)
1396
+ default_b_abs = float(default_b_rel)
1397
+
1398
+ if cue_a_override_sec is not None or cue_b_override_sec is not None:
1399
+ cue_a = float(cue_a_override_sec) if cue_a_override_sec is not None else default_a_abs
1400
+ cue_b = float(cue_b_override_sec) if cue_b_override_sec is not None else default_b_abs
1401
+ return CueSelectionResult(
1402
+ cue_a_sec=cue_a,
1403
+ cue_b_sec=cue_b,
1404
+ method="manual-override",
1405
+ debug={
1406
+ "manual_override": True,
1407
+ "default_auto_cues_sec": {"song_a": round(default_a_abs, 3), "song_b": round(default_b_abs, 3)},
1408
+ },
1409
+ )
1410
+
1411
+ vocal_profile_a, vocal_debug_a = _extract_vocal_profile_demucs(
1412
+ y=y_a_analysis,
1413
+ sr=int(sr),
1414
+ window_start_sec=float(a_analysis_start_sec),
1415
+ track_label="song_a_analysis_window",
1416
+ )
1417
+ vocal_profile_b, vocal_debug_b = _extract_vocal_profile_demucs(
1418
+ y=y_b_analysis,
1419
+ sr=int(sr),
1420
+ window_start_sec=0.0,
1421
+ track_label="song_b_analysis_window",
1422
+ )
1423
+ vocal_debug = {
1424
+ "enabled": bool(_DEMUCS_ENABLED),
1425
+ "song_a": vocal_debug_a,
1426
+ "song_b": vocal_debug_b,
1427
+ }
1428
+
1429
+ structure_result = _try_structure_aware_selection(
1430
+ song_a_path=song_a_path,
1431
+ song_b_path=song_b_path,
1432
+ song_a_duration_sec=song_a_duration_sec,
1433
+ song_b_duration_sec=song_b_duration_sec,
1434
+ pre_sec=float(pre_sec),
1435
+ seam_sec=float(seam_sec),
1436
+ post_sec=float(post_sec),
1437
+ vocal_profile_a=vocal_profile_a,
1438
+ vocal_profile_b=vocal_profile_b,
1439
+ )
1440
+ if structure_result is not None:
1441
+ structure_result.debug["manual_override"] = False
1442
+ structure_result.debug["default_local_auto_cues_sec"] = {
1443
+ "song_a": round(default_a_abs, 3),
1444
+ "song_b": round(default_b_abs, 3),
1445
+ }
1446
+ structure_result.debug["vocal_analysis"] = vocal_debug
1447
+ structure_result.debug["vocal_penalty_active"] = bool(vocal_profile_a is not None or vocal_profile_b is not None)
1448
+ return structure_result
1449
+
1450
+ if beats_a.size < 4 or beats_b.size < 4:
1451
+ return CueSelectionResult(
1452
+ cue_a_sec=default_a_abs,
1453
+ cue_b_sec=default_b_abs,
1454
+ method="beat-fallback",
1455
+ debug={
1456
+ "reason": "insufficient_beats",
1457
+ "beat_counts": {"song_a": int(beats_a.size), "song_b": int(beats_b.size)},
1458
+ "vocal_analysis": vocal_debug,
1459
+ },
1460
+ )
1461
+
1462
+ profiles_a = _compute_profiles(y_a_analysis, sr)
1463
+ profiles_b = _compute_profiles(y_b_analysis, sr)
1464
+
1465
+ min_a = max(0.5, float(seam_sec + 0.5), float(pre_sec + (0.20 * seam_sec)))
1466
+ max_a = max(min_a + 0.1, float(analysis_sec - max(0.75, 0.25 * seam_sec)))
1467
+ min_b = max(0.75, float(0.12 * seam_sec))
1468
+ max_b = max(min_b + 0.1, float(analysis_sec - max((seam_sec + 0.75), (0.25 * post_sec))))
1469
+
1470
+ raw_a = _build_candidates(beats_a, min_a, max_a, prefer_tail=True, limit=24)
1471
+ raw_b = _build_candidates(beats_b, min_b, max_b, prefer_tail=False, limit=24)
1472
+ if not raw_a or not raw_b:
1473
+ return CueSelectionResult(
1474
+ cue_a_sec=default_a_abs,
1475
+ cue_b_sec=default_b_abs,
1476
+ method="candidate-fallback",
1477
+ debug={
1478
+ "reason": "empty_candidate_set",
1479
+ "candidate_counts": {"song_a": len(raw_a), "song_b": len(raw_b)},
1480
+ "candidate_windows_sec": {
1481
+ "song_a": [round(min_a, 3), round(max_a, 3)],
1482
+ "song_b": [round(min_b, 3), round(max_b, 3)],
1483
+ },
1484
+ "vocal_analysis": vocal_debug,
1485
+ },
1486
+ )
1487
+
1488
+ cands_a = [
1489
+ _make_candidate(
1490
+ t,
1491
+ idx,
1492
+ profiles_a,
1493
+ incoming=False,
1494
+ seam_sec=float(seam_sec),
1495
+ vocal_profile=vocal_profile_a,
1496
+ vocal_time_sec=float(a_analysis_start_sec + t),
1497
+ )
1498
+ for (t, idx) in raw_a
1499
+ ]
1500
+ cands_b = [
1501
+ _make_candidate(
1502
+ t,
1503
+ idx,
1504
+ profiles_b,
1505
+ incoming=True,
1506
+ seam_sec=float(seam_sec),
1507
+ vocal_profile=vocal_profile_b,
1508
+ vocal_time_sec=float(t),
1509
+ )
1510
+ for (t, idx) in raw_b
1511
+ ]
1512
+
1513
+ scored_pairs: List[Dict[str, object]] = []
1514
+ best: Optional[Dict[str, object]] = None
1515
+ target_b = max(2.0, min(8.0, float(analysis_sec * 0.25)))
1516
+
1517
+ for cand_a in cands_a:
1518
+ for cand_b in cands_b:
1519
+ total, comps = _score_pair(cand_a, cand_b, target_a=target_a_rel, target_b=target_b)
1520
+ item = {
1521
+ "score": float(total),
1522
+ "song_a_rel_sec": float(cand_a.time_sec),
1523
+ "song_b_rel_sec": float(cand_b.time_sec),
1524
+ "song_a_vocal_ratio": float(cand_a.vocal_ratio),
1525
+ "song_b_vocal_ratio": float(cand_b.vocal_ratio),
1526
+ "song_a_vocal_onset": float(cand_a.vocal_onset),
1527
+ "song_b_vocal_onset": float(cand_b.vocal_onset),
1528
+ "song_a_vocal_phrase": float(cand_a.vocal_phrase_score),
1529
+ "song_b_vocal_phrase": float(cand_b.vocal_phrase_score),
1530
+ "song_a_drum_anchor": float(cand_a.drum_anchor),
1531
+ "song_b_drum_anchor": float(cand_b.drum_anchor),
1532
+ "song_a_bass_energy": float(cand_a.bass_energy),
1533
+ "song_b_bass_energy": float(cand_b.bass_energy),
1534
+ "song_a_bass_stability": float(cand_a.bass_stability),
1535
+ "song_b_bass_stability": float(cand_b.bass_stability),
1536
+ "song_a_density": float(cand_a.instrumental_density),
1537
+ "song_b_density": float(cand_b.instrumental_density),
1538
+ "song_a_density_score": float(cand_a.density_score),
1539
+ "song_b_density_score": float(cand_b.density_score),
1540
+ "song_a_period_vocal_phrase": float(cand_a.period_vocal_phrase_score),
1541
+ "song_b_period_vocal_phrase": float(cand_b.period_vocal_phrase_score),
1542
+ "song_a_period_drum_anchor": float(cand_a.period_drum_anchor),
1543
+ "song_b_period_drum_anchor": float(cand_b.period_drum_anchor),
1544
+ "song_a_period_bass_energy": float(cand_a.period_bass_energy),
1545
+ "song_b_period_bass_energy": float(cand_b.period_bass_energy),
1546
+ "song_a_period_bass_stability": float(cand_a.period_bass_stability),
1547
+ "song_b_period_bass_stability": float(cand_b.period_bass_stability),
1548
+ "song_a_period_density": float(cand_a.period_density_score),
1549
+ "song_b_period_density": float(cand_b.period_density_score),
1550
+ "song_a_period_coverage": float(cand_a.period_coverage),
1551
+ "song_b_period_coverage": float(cand_b.period_coverage),
1552
+ "components": comps,
1553
+ }
1554
+ scored_pairs.append(item)
1555
+ if best is None or float(total) > float(best["score"]):
1556
+ best = item
1557
+
1558
+ if best is None:
1559
+ return CueSelectionResult(
1560
+ cue_a_sec=default_a_abs,
1561
+ cue_b_sec=default_b_abs,
1562
+ method="score-fallback",
1563
+ debug={"reason": "no_scored_pairs", "vocal_analysis": vocal_debug},
1564
+ )
1565
+
1566
+ scored_pairs = sorted(scored_pairs, key=lambda x: float(x["score"]), reverse=True)
1567
+ top_pairs = [
1568
+ {
1569
+ "score": round(float(item["score"]), 4),
1570
+ "song_a_rel_sec": round(float(item["song_a_rel_sec"]), 3),
1571
+ "song_b_rel_sec": round(float(item["song_b_rel_sec"]), 3),
1572
+ "song_a_vocal_ratio": round(float(item["song_a_vocal_ratio"]), 4),
1573
+ "song_b_vocal_ratio": round(float(item["song_b_vocal_ratio"]), 4),
1574
+ "song_a_vocal_phrase": round(float(item["song_a_vocal_phrase"]), 4),
1575
+ "song_b_vocal_phrase": round(float(item["song_b_vocal_phrase"]), 4),
1576
+ "song_a_drum_anchor": round(float(item["song_a_drum_anchor"]), 4),
1577
+ "song_b_drum_anchor": round(float(item["song_b_drum_anchor"]), 4),
1578
+ "song_a_bass_stability": round(float(item["song_a_bass_stability"]), 4),
1579
+ "song_b_bass_stability": round(float(item["song_b_bass_stability"]), 4),
1580
+ "song_a_density_score": round(float(item["song_a_density_score"]), 4),
1581
+ "song_b_density_score": round(float(item["song_b_density_score"]), 4),
1582
+ "song_a_period_vocal_phrase": round(float(item["song_a_period_vocal_phrase"]), 4),
1583
+ "song_b_period_vocal_phrase": round(float(item["song_b_period_vocal_phrase"]), 4),
1584
+ "song_a_period_drum_anchor": round(float(item["song_a_period_drum_anchor"]), 4),
1585
+ "song_b_period_drum_anchor": round(float(item["song_b_period_drum_anchor"]), 4),
1586
+ "song_a_period_bass_stability": round(float(item["song_a_period_bass_stability"]), 4),
1587
+ "song_b_period_bass_stability": round(float(item["song_b_period_bass_stability"]), 4),
1588
+ "song_a_period_density": round(float(item["song_a_period_density"]), 4),
1589
+ "song_b_period_density": round(float(item["song_b_period_density"]), 4),
1590
+ "song_a_period_coverage": round(float(item["song_a_period_coverage"]), 4),
1591
+ "song_b_period_coverage": round(float(item["song_b_period_coverage"]), 4),
1592
+ "components": {k: round(float(v), 4) for k, v in item["components"].items()},
1593
+ }
1594
+ for item in scored_pairs[:3]
1595
+ ]
1596
+
1597
+ cue_a_abs = float(a_analysis_start_sec + float(best["song_a_rel_sec"]))
1598
+ cue_b_abs = float(best["song_b_rel_sec"])
1599
+ return CueSelectionResult(
1600
+ cue_a_sec=cue_a_abs,
1601
+ cue_b_sec=cue_b_abs,
1602
+ method="scored-auto",
1603
+ debug={
1604
+ "manual_override": False,
1605
+ "beat_counts": {"song_a": int(beats_a.size), "song_b": int(beats_b.size)},
1606
+ "candidate_counts": {"song_a": len(cands_a), "song_b": len(cands_b)},
1607
+ "candidate_windows_sec": {
1608
+ "song_a": [round(min_a, 3), round(max_a, 3)],
1609
+ "song_b": [round(min_b, 3), round(max_b, 3)],
1610
+ },
1611
+ "transition_period_sec": round(float(seam_sec), 3),
1612
+ "selected_rel_sec": {
1613
+ "song_a": round(float(best["song_a_rel_sec"]), 3),
1614
+ "song_b": round(float(best["song_b_rel_sec"]), 3),
1615
+ },
1616
+ "selected_mixability": {
1617
+ "song_a_ratio": round(float(best["song_a_vocal_ratio"]), 4),
1618
+ "song_b_ratio": round(float(best["song_b_vocal_ratio"]), 4),
1619
+ "song_a_vocal_onset": round(float(best["song_a_vocal_onset"]), 4),
1620
+ "song_b_vocal_onset": round(float(best["song_b_vocal_onset"]), 4),
1621
+ "song_a_vocal_phrase": round(float(best["song_a_vocal_phrase"]), 4),
1622
+ "song_b_vocal_phrase": round(float(best["song_b_vocal_phrase"]), 4),
1623
+ "song_a_drum_anchor": round(float(best["song_a_drum_anchor"]), 4),
1624
+ "song_b_drum_anchor": round(float(best["song_b_drum_anchor"]), 4),
1625
+ "song_a_bass_energy": round(float(best["song_a_bass_energy"]), 4),
1626
+ "song_b_bass_energy": round(float(best["song_b_bass_energy"]), 4),
1627
+ "song_a_bass_stability": round(float(best["song_a_bass_stability"]), 4),
1628
+ "song_b_bass_stability": round(float(best["song_b_bass_stability"]), 4),
1629
+ "song_a_density": round(float(best["song_a_density"]), 4),
1630
+ "song_b_density": round(float(best["song_b_density"]), 4),
1631
+ "song_a_density_score": round(float(best["song_a_density_score"]), 4),
1632
+ "song_b_density_score": round(float(best["song_b_density_score"]), 4),
1633
+ "song_a_period_vocal_phrase": round(float(best["song_a_period_vocal_phrase"]), 4),
1634
+ "song_b_period_vocal_phrase": round(float(best["song_b_period_vocal_phrase"]), 4),
1635
+ "song_a_period_drum_anchor": round(float(best["song_a_period_drum_anchor"]), 4),
1636
+ "song_b_period_drum_anchor": round(float(best["song_b_period_drum_anchor"]), 4),
1637
+ "song_a_period_bass_energy": round(float(best["song_a_period_bass_energy"]), 4),
1638
+ "song_b_period_bass_energy": round(float(best["song_b_period_bass_energy"]), 4),
1639
+ "song_a_period_bass_stability": round(float(best["song_a_period_bass_stability"]), 4),
1640
+ "song_b_period_bass_stability": round(float(best["song_b_period_bass_stability"]), 4),
1641
+ "song_a_period_density": round(float(best["song_a_period_density"]), 4),
1642
+ "song_b_period_density": round(float(best["song_b_period_density"]), 4),
1643
+ "song_a_period_coverage": round(float(best["song_a_period_coverage"]), 4),
1644
+ "song_b_period_coverage": round(float(best["song_b_period_coverage"]), 4),
1645
+ },
1646
+ "default_auto_cues_sec": {"song_a": round(default_a_abs, 3), "song_b": round(default_b_abs, 3)},
1647
+ "vocal_analysis": vocal_debug,
1648
+ "vocal_penalty_active": bool(vocal_profile_a is not None or vocal_profile_b is not None),
1649
+ "top_pairs": top_pairs,
1650
+ "period_scoring": {
1651
+ "enabled": True,
1652
+ "window_def": {"song_a": "[cue-seam, cue]", "song_b": "[cue, cue+seam]"},
1653
+ "overlap_simulation": "weighted vocal/bass clash precheck",
1654
+ },
1655
+ },
1656
+ )
pipeline/transition_generator.py ADDED
@@ -0,0 +1,1694 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import hashlib
3
+ import json
4
+ import logging
5
+ import os
6
+ from dataclasses import asdict, dataclass
7
+ from pathlib import Path
8
+ from typing import Any, Dict, List, Optional, Tuple
9
+
10
+ import librosa # type: ignore[reportMissingImports]
11
+ import numpy as np
12
+
13
+ from .audio_utils import (
14
+ apply_edge_fades,
15
+ clamp,
16
+ crossfade_equal_length,
17
+ decode_segment,
18
+ ensure_length,
19
+ estimate_bpm_and_beats,
20
+ ffprobe_duration_sec,
21
+ normalize_peak,
22
+ resample_if_needed,
23
+ safe_time_stretch,
24
+ write_wav,
25
+ )
26
+ from .cuepoint_selector import select_mix_cuepoints
27
+
28
+ LOGGER = logging.getLogger(__name__)
29
+
30
+ DEFAULT_TARGET_SR = 32000
31
+ ACESTEP_INPUT_SR = 48000
32
+ STITCH_PREVIEW_SIDE_SEC = 10.0
33
+
34
+ PLUGIN_PRESETS: Dict[str, str] = {
35
+ "Smooth Blend": "smooth seamless DJ transition, balanced energy, clean, no vocals",
36
+ "EDM Build-up": "energetic EDM build-up transition with rising tension, clean, no vocals",
37
+ "Percussive Bridge": "percussive bridge transition with rhythmic drums and clear groove, no vocals",
38
+ "Ambient Wash": "ambient wash transition, spacious and atmospheric, soft energy curve, no vocals",
39
+ }
40
+
41
+ _ACESTEP_RUNTIME: Optional[Dict[str, Any]] = None
42
+ _DEMUCS_RUNTIME: Optional[Dict[str, Any]] = None
43
+ _DEMUCS_TRANSITION_ENABLED = os.getenv("AI_DJ_ENABLE_DEMUCS_TRANSITION", "1").strip().lower() not in {
44
+ "0",
45
+ "false",
46
+ "no",
47
+ "off",
48
+ }
49
+ _DEMUCS_MODEL_NAME = os.getenv("AI_DJ_DEMUCS_MODEL", "htdemucs").strip() or "htdemucs"
50
+ _DEMUCS_DEVICE_PREF = os.getenv("AI_DJ_DEMUCS_DEVICE", "cuda").strip().lower()
51
+ _DEMUCS_SEGMENT_SEC = 7.0
52
+ _REF_AUDIO_MODE = (os.getenv("AI_DJ_REFERENCE_AUDIO_MODE", "accompaniment-only") or "accompaniment-only").strip().lower()
53
+
54
+
55
+ @dataclass
56
+ class _DemucsStemBundle:
57
+ vocals: np.ndarray
58
+ drums: np.ndarray
59
+ bass: np.ndarray
60
+ other: np.ndarray
61
+ accompaniment: np.ndarray
62
+ sr: int
63
+ method: str
64
+
65
+
66
+ @dataclass
67
+ class TransitionRequest:
68
+ song_a_path: str
69
+ song_b_path: str
70
+ plugin_id: str = "Smooth Blend"
71
+ instruction_text: str = ""
72
+ pre_context_sec: float = 6.0
73
+ repaint_width_sec: float = 4.0
74
+ post_context_sec: float = 6.0
75
+ analysis_sec: float = 45.0
76
+ bpm_target: Optional[float] = None
77
+ cue_a_sec: Optional[float] = None
78
+ cue_b_sec: Optional[float] = None
79
+ transition_base_mode: str = "B-base-fixed"
80
+ transition_bars: int = 8
81
+ creativity_strength: float = 7.0
82
+ inference_steps: int = 8
83
+ seed: int = 42
84
+ output_dir: str = "outputs"
85
+ output_stem: Optional[str] = None
86
+ target_sr: int = DEFAULT_TARGET_SR
87
+ keep_debug_files: bool = False
88
+
89
+ # ACE-Step runtime config
90
+ acestep_model_config: str = os.getenv("AI_DJ_ACESTEP_MODEL_CONFIG", "acestep-v15-turbo").strip()
91
+ acestep_device: str = os.getenv("AI_DJ_ACESTEP_DEVICE", "auto").strip()
92
+ acestep_project_root: str = os.getenv("AI_DJ_ACESTEP_PROJECT_ROOT", "").strip()
93
+ acestep_prefer_source: Optional[str] = os.getenv("AI_DJ_ACESTEP_PREFER_SOURCE", "").strip() or None
94
+ acestep_use_flash_attn: bool = False
95
+ acestep_compile_model: bool = False
96
+ acestep_offload_to_cpu: bool = False
97
+ acestep_offload_dit_to_cpu: bool = False
98
+ acestep_use_mlx_dit: bool = True
99
+ acestep_lora_path: str = os.getenv("AI_DJ_ACESTEP_LORA_PATH", "").strip()
100
+ acestep_lora_scale: float = float(os.getenv("AI_DJ_ACESTEP_LORA_SCALE", "1.0").strip() or "1.0")
101
+
102
+ def to_log_dict(self) -> Dict[str, Any]:
103
+ return asdict(self)
104
+
105
+
106
+ @dataclass
107
+ class TransitionResult:
108
+ transition_path: str
109
+ stitched_path: str
110
+ rough_stitched_path: str
111
+ hard_splice_path: str
112
+ backend_used: str
113
+ details: Dict[str, Any]
114
+
115
+ def to_dict(self) -> Dict[str, Any]:
116
+ payload = asdict(self)
117
+ return payload
118
+
119
+
120
+ def _slug(text: str) -> str:
121
+ s = "".join(ch if ch.isalnum() or ch in {"-", "_"} else "_" for ch in text.strip())
122
+ s = "_".join(part for part in s.split("_") if part)
123
+ return s[:80] or "item"
124
+
125
+
126
+ def _deterministic_stem(request: TransitionRequest) -> str:
127
+ if request.output_stem:
128
+ return _slug(request.output_stem)
129
+
130
+ payload = {
131
+ "a": os.path.basename(request.song_a_path),
132
+ "b": os.path.basename(request.song_b_path),
133
+ "plugin": request.plugin_id,
134
+ "instruction_text": request.instruction_text,
135
+ "pre_context_sec": request.pre_context_sec,
136
+ "repaint_width_sec": request.repaint_width_sec,
137
+ "post_context_sec": request.post_context_sec,
138
+ "analysis_sec": request.analysis_sec,
139
+ "bpm_target": request.bpm_target,
140
+ "cue_a_sec": request.cue_a_sec,
141
+ "cue_b_sec": request.cue_b_sec,
142
+ "transition_base_mode": request.transition_base_mode,
143
+ "transition_bars": request.transition_bars,
144
+ "creativity_strength": request.creativity_strength,
145
+ "inference_steps": request.inference_steps,
146
+ "seed": request.seed,
147
+ "target_sr": request.target_sr,
148
+ "acestep_model_config": request.acestep_model_config,
149
+ "demucs_transition_enabled": _DEMUCS_TRANSITION_ENABLED,
150
+ "demucs_model": _DEMUCS_MODEL_NAME,
151
+ "reference_audio_mode": _REF_AUDIO_MODE,
152
+ }
153
+ raw = json.dumps(payload, sort_keys=True).encode("utf-8")
154
+ digest = hashlib.sha1(raw).hexdigest()[:10]
155
+ return f"transition_{_slug(Path(request.song_a_path).stem)}_to_{_slug(Path(request.song_b_path).stem)}_{digest}"
156
+
157
+
158
+ def _resolve_output_paths(request: TransitionRequest) -> Tuple[str, str, str, str, str]:
159
+ os.makedirs(request.output_dir, exist_ok=True)
160
+ stem = _deterministic_stem(request)
161
+ transition_path = os.path.join(request.output_dir, f"{stem}_transition.wav")
162
+ stitched_path = os.path.join(request.output_dir, f"{stem}_stitched.wav")
163
+ rough_stitched_path = os.path.join(request.output_dir, f"{stem}_rough_stitched.wav")
164
+ hard_splice_path = os.path.join(request.output_dir, f"{stem}_hard_splice.wav")
165
+ rough_src_path = os.path.join(request.output_dir, f"{stem}_rough_src.wav")
166
+ return transition_path, stitched_path, rough_stitched_path, hard_splice_path, rough_src_path
167
+
168
+
169
+ def _resolve_acestep_project_root(request: TransitionRequest) -> str:
170
+ if request.acestep_project_root:
171
+ os.makedirs(request.acestep_project_root, exist_ok=True)
172
+ return request.acestep_project_root
173
+
174
+ hf_data = "/data"
175
+ if os.path.isdir(hf_data) and os.access(hf_data, os.W_OK):
176
+ root = os.path.join(hf_data, "acestep_runtime")
177
+ os.makedirs(root, exist_ok=True)
178
+ return root
179
+
180
+ root = os.path.join(os.path.dirname(os.path.dirname(__file__)), ".acestep_runtime")
181
+ os.makedirs(root, exist_ok=True)
182
+ return root
183
+
184
+
185
+ def _resolve_lora_path(lora_spec: str, project_root: str) -> str:
186
+ spec = (lora_spec or "").strip()
187
+ if not spec:
188
+ return ""
189
+ if os.path.exists(spec):
190
+ return os.path.abspath(spec)
191
+ # Treat non-local spec as a Hugging Face repo id, e.g. ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA
192
+ if "/" not in spec:
193
+ raise RuntimeError(
194
+ f"LoRA path not found: {spec}. Provide a local path or a Hugging Face repo id like "
195
+ "ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA."
196
+ )
197
+ try:
198
+ from huggingface_hub import snapshot_download
199
+ except Exception as exc:
200
+ raise RuntimeError(
201
+ "huggingface_hub is required to download LoRA from repo id. Install with: pip install huggingface_hub"
202
+ ) from exc
203
+
204
+ local_dir = os.path.join(project_root, "lora_cache", _slug(spec))
205
+ os.makedirs(local_dir, exist_ok=True)
206
+ return snapshot_download(
207
+ repo_id=spec,
208
+ local_dir=local_dir,
209
+ local_dir_use_symlinks=False,
210
+ )
211
+
212
+
213
+ def _build_caption(plugin_id: str, instruction_text: str) -> str:
214
+ base = PLUGIN_PRESETS.get(plugin_id, PLUGIN_PRESETS["Smooth Blend"])
215
+ extra = (instruction_text or "").strip()
216
+ if not extra:
217
+ return base
218
+ return f"{base}. Additional instruction: {extra}"
219
+
220
+
221
+ def _resolve_half_double_tempo(bpm_ref: float, bpm_candidate: float) -> float:
222
+ candidates = [0.5 * bpm_candidate, bpm_candidate, 2.0 * bpm_candidate]
223
+ valid = [v for v in candidates if 40.0 <= float(v) <= 240.0]
224
+ if not valid:
225
+ return float(bpm_candidate)
226
+ return float(min(valid, key=lambda x: abs(np.log2(max(1e-6, bpm_ref) / max(1e-6, x)))))
227
+
228
+
229
+ def _normalized_onset_envelope(y: np.ndarray, sr: int, hop_length: int = 512) -> np.ndarray:
230
+ if y.size <= 0:
231
+ return np.zeros((1,), dtype=np.float32)
232
+ onset = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length).astype(np.float32)
233
+ if onset.size == 0:
234
+ return np.zeros((1,), dtype=np.float32)
235
+ onset = onset - float(np.mean(onset))
236
+ maximum = float(np.max(np.abs(onset)))
237
+ if maximum > 1e-9:
238
+ onset = onset / maximum
239
+ return onset.astype(np.float32)
240
+
241
+
242
+ def _corr_similarity(a: np.ndarray, b: np.ndarray) -> float:
243
+ n = min(a.size, b.size)
244
+ if n <= 3:
245
+ return 0.0
246
+ a2 = a[:n].astype(np.float32)
247
+ b2 = b[:n].astype(np.float32)
248
+ denom = float(np.linalg.norm(a2) * np.linalg.norm(b2))
249
+ if denom <= 1e-9:
250
+ return 0.0
251
+ raw = float(np.dot(a2, b2) / denom)
252
+ return clamp((raw + 1.0) * 0.5, 0.0, 1.0)
253
+
254
+
255
+ def _rms(y: np.ndarray) -> float:
256
+ if y.size == 0:
257
+ return 0.0
258
+ return float(np.sqrt(np.mean(np.square(y, dtype=np.float64))))
259
+
260
+
261
+ def _resolve_demucs_device(torch_mod: Any) -> str:
262
+ pref = (_DEMUCS_DEVICE_PREF or "").strip().lower()
263
+ if pref == "cpu":
264
+ return "cpu"
265
+ if pref in {"cuda", "gpu"}:
266
+ return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
267
+ return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
268
+
269
+
270
+ def _load_demucs_runtime() -> Tuple[Optional[Dict[str, Any]], Dict[str, Any]]:
271
+ global _DEMUCS_RUNTIME
272
+ if not _DEMUCS_TRANSITION_ENABLED:
273
+ return None, {"enabled": False, "status": "disabled", "reason": "AI_DJ_ENABLE_DEMUCS_TRANSITION=0"}
274
+ if _DEMUCS_RUNTIME is not None:
275
+ return _DEMUCS_RUNTIME, {
276
+ "enabled": True,
277
+ "status": "ready",
278
+ "model": _DEMUCS_RUNTIME.get("model_name"),
279
+ "device": _DEMUCS_RUNTIME.get("device"),
280
+ }
281
+
282
+ try:
283
+ import torch # type: ignore[reportMissingImports]
284
+ from demucs.pretrained import get_model # type: ignore[reportMissingImports]
285
+
286
+ model = get_model(_DEMUCS_MODEL_NAME)
287
+ model.eval()
288
+ device = _resolve_demucs_device(torch)
289
+ model.to(device)
290
+ _DEMUCS_RUNTIME = {
291
+ "model": model,
292
+ "torch": torch,
293
+ "device": device,
294
+ "model_name": _DEMUCS_MODEL_NAME,
295
+ }
296
+ return _DEMUCS_RUNTIME, {
297
+ "enabled": True,
298
+ "status": "ready",
299
+ "model": _DEMUCS_MODEL_NAME,
300
+ "device": device,
301
+ }
302
+ except Exception as exc:
303
+ LOGGER.warning("Demucs transition runtime unavailable (%s). Falling back to non-stem transition path.", exc)
304
+ return None, {
305
+ "enabled": True,
306
+ "status": "unavailable",
307
+ "model": _DEMUCS_MODEL_NAME,
308
+ "reason": str(exc),
309
+ }
310
+
311
+
312
+ def _resample_to(y: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
313
+ if int(orig_sr) == int(target_sr):
314
+ return y.astype(np.float32)
315
+ if y.size == 0:
316
+ return np.zeros((0,), dtype=np.float32)
317
+ return librosa.resample(y.astype(np.float32), orig_sr=int(orig_sr), target_sr=int(target_sr)).astype(np.float32)
318
+
319
+
320
+ def _extract_demucs_stems(y: np.ndarray, sr: int, track_label: str) -> Tuple[Optional[_DemucsStemBundle], Dict[str, Any]]:
321
+ info: Dict[str, Any] = {
322
+ "enabled": bool(_DEMUCS_TRANSITION_ENABLED),
323
+ "track": track_label,
324
+ "model": _DEMUCS_MODEL_NAME,
325
+ }
326
+ if y.size < int(max(1, sr) * 2.0):
327
+ info["status"] = "skipped-short-audio"
328
+ return None, info
329
+
330
+ runtime, runtime_debug = _load_demucs_runtime()
331
+ info.update(runtime_debug)
332
+ if runtime is None:
333
+ return None, info
334
+
335
+ try:
336
+ from demucs.apply import apply_model # type: ignore[reportMissingImports]
337
+
338
+ torch_mod = runtime["torch"]
339
+ model = runtime["model"]
340
+ device = str(runtime.get("device", "cpu"))
341
+
342
+ mono = np.asarray(y, dtype=np.float32).reshape(-1)
343
+ if mono.size == 0:
344
+ info["status"] = "empty"
345
+ return None, info
346
+ peak = float(np.max(np.abs(mono)))
347
+ if peak > 1e-9:
348
+ mono = mono / peak
349
+
350
+ demucs_sr = int(getattr(model, "samplerate", 44100))
351
+ work = _resample_to(mono, int(sr), demucs_sr)
352
+ if work.size < int(max(1, demucs_sr) * 2.0):
353
+ info["status"] = "skipped-short-audio"
354
+ return None, info
355
+
356
+ stereo = np.stack([work, work], axis=0)
357
+ mix = torch_mod.from_numpy(stereo).unsqueeze(0).to(device)
358
+ audio_sec = float(work.size / max(1, demucs_sr))
359
+ use_split = audio_sec > (_DEMUCS_SEGMENT_SEC + 0.05)
360
+ segment_sec = float(_DEMUCS_SEGMENT_SEC) if use_split else None
361
+
362
+ try:
363
+ with torch_mod.no_grad():
364
+ estimates = apply_model(
365
+ model,
366
+ mix,
367
+ shifts=1,
368
+ split=use_split,
369
+ overlap=0.25,
370
+ progress=False,
371
+ device=device,
372
+ segment=segment_sec,
373
+ )
374
+ except Exception as exc:
375
+ if device == "cuda":
376
+ model.to("cpu")
377
+ runtime["device"] = "cpu"
378
+ device = "cpu"
379
+ mix = mix.to("cpu")
380
+ with torch_mod.no_grad():
381
+ estimates = apply_model(
382
+ model,
383
+ mix,
384
+ shifts=1,
385
+ split=use_split,
386
+ overlap=0.25,
387
+ progress=False,
388
+ device="cpu",
389
+ segment=segment_sec,
390
+ )
391
+ info["device_fallback"] = f"cuda->cpu ({exc})"
392
+ else:
393
+ raise
394
+
395
+ est = estimates.detach().cpu()
396
+ est = est[0] if est.ndim == 4 else est
397
+ if est.ndim != 3:
398
+ raise RuntimeError(f"Unexpected demucs output shape: {tuple(est.shape)}")
399
+
400
+ source_names = [str(s) for s in getattr(model, "sources", [])]
401
+ if not source_names:
402
+ raise RuntimeError("Demucs returned no source names.")
403
+ if est.shape[0] != len(source_names):
404
+ if est.shape[1] == len(source_names):
405
+ est = est.permute(1, 0, 2)
406
+ else:
407
+ raise RuntimeError(f"Demucs source mismatch: shape {tuple(est.shape)}, sources {source_names}")
408
+
409
+ def _stem(name: str) -> np.ndarray:
410
+ if name in source_names:
411
+ stem = est[source_names.index(name)].mean(dim=0).numpy().astype(np.float32)
412
+ return _resample_to(stem, demucs_sr, int(sr))
413
+ return np.zeros((mono.size,), dtype=np.float32)
414
+
415
+ vocals = _stem("vocals")
416
+ drums = _stem("drums")
417
+ bass = _stem("bass")
418
+ other = _stem("other")
419
+ non_vocal_idxs = [i for i, s in enumerate(source_names) if s != "vocals"]
420
+ if non_vocal_idxs:
421
+ acc = est[non_vocal_idxs].sum(dim=0).mean(dim=0).numpy().astype(np.float32)
422
+ accompaniment = _resample_to(acc, demucs_sr, int(sr))
423
+ else:
424
+ accompaniment = np.zeros((mono.size,), dtype=np.float32)
425
+
426
+ target_n = int(mono.size)
427
+ vocals = ensure_length(vocals, target_n)
428
+ drums = ensure_length(drums, target_n)
429
+ bass = ensure_length(bass, target_n)
430
+ other = ensure_length(other, target_n)
431
+ accompaniment = ensure_length(accompaniment, target_n)
432
+
433
+ info.update(
434
+ {
435
+ "status": "ready",
436
+ "method": "demucs-transition-stems",
437
+ "split_mode": "chunked" if use_split else "full-window",
438
+ "duration_sec": round(float(target_n / max(1, sr)), 3),
439
+ "has_drums": bool("drums" in source_names),
440
+ "has_bass": bool("bass" in source_names),
441
+ "has_other": bool("other" in source_names),
442
+ "device": runtime.get("device", device),
443
+ }
444
+ )
445
+ return _DemucsStemBundle(
446
+ vocals=vocals.astype(np.float32),
447
+ drums=drums.astype(np.float32),
448
+ bass=bass.astype(np.float32),
449
+ other=other.astype(np.float32),
450
+ accompaniment=accompaniment.astype(np.float32),
451
+ sr=int(sr),
452
+ method="demucs-transition-stems",
453
+ ), info
454
+ except Exception as exc:
455
+ LOGGER.warning("Demucs stem extraction failed for %s (%s).", track_label, exc)
456
+ info["status"] = "error"
457
+ info["reason"] = str(exc)
458
+ return None, info
459
+
460
+
461
+ def _slice_stem_bundle(bundle: Optional[_DemucsStemBundle], start_n: int, length_n: int) -> Optional[_DemucsStemBundle]:
462
+ if bundle is None:
463
+ return None
464
+ s = int(max(0, start_n))
465
+ n = int(max(0, length_n))
466
+ e = s + n
467
+ return _DemucsStemBundle(
468
+ vocals=ensure_length(bundle.vocals[s:e], n),
469
+ drums=ensure_length(bundle.drums[s:e], n),
470
+ bass=ensure_length(bundle.bass[s:e], n),
471
+ other=ensure_length(bundle.other[s:e], n),
472
+ accompaniment=ensure_length(bundle.accompaniment[s:e], n),
473
+ sr=int(bundle.sr),
474
+ method=bundle.method,
475
+ )
476
+
477
+
478
+ def _seconds_to_beats(seconds: float, bpm: float) -> float:
479
+ return float(seconds) * (float(bpm) / 60.0)
480
+
481
+
482
+ def _beats_to_seconds(beats: float, bpm: float) -> float:
483
+ return float(beats) * (60.0 / max(1e-6, float(bpm)))
484
+
485
+
486
+ def _quantize_seconds_to_beats(
487
+ raw_sec: float,
488
+ bpm: float,
489
+ min_sec: float,
490
+ max_sec: float,
491
+ beat_step: int,
492
+ min_beats: int,
493
+ ) -> Tuple[float, int, float]:
494
+ raw_sec = float(clamp(raw_sec, min_sec, max_sec))
495
+ if bpm <= 1e-6:
496
+ return raw_sec, int(round(_seconds_to_beats(raw_sec, 120.0))), _seconds_to_beats(raw_sec, 120.0)
497
+
498
+ raw_beats = _seconds_to_beats(raw_sec, bpm)
499
+ step = max(1, int(beat_step))
500
+ min_beats_i = max(1, int(min_beats))
501
+ max_allowed_beats = _seconds_to_beats(max_sec, bpm)
502
+ max_beats_i = int(max(min_beats_i, np.floor(max_allowed_beats / step) * step))
503
+ quant_beats = int(round(raw_beats / step) * step)
504
+ quant_beats = int(clamp(float(quant_beats), float(min_beats_i), float(max_beats_i)))
505
+ quant_sec = float(clamp(_beats_to_seconds(quant_beats, bpm), min_sec, max_sec))
506
+ return quant_sec, quant_beats, raw_beats
507
+
508
+
509
+ def _phrase_lock_transition_shape(pre_sec: float, seam_sec: float, post_sec: float, bpm: float) -> Dict[str, Any]:
510
+ pre_locked_sec, pre_beats, pre_raw_beats = _quantize_seconds_to_beats(
511
+ raw_sec=pre_sec,
512
+ bpm=bpm,
513
+ min_sec=1.0,
514
+ max_sec=20.0,
515
+ beat_step=4,
516
+ min_beats=2,
517
+ )
518
+
519
+ seam_raw_beats = _seconds_to_beats(seam_sec, bpm)
520
+ seam_step = 8 if seam_raw_beats >= 8.0 else 4
521
+ seam_locked_sec, seam_beats, _ = _quantize_seconds_to_beats(
522
+ raw_sec=seam_sec,
523
+ bpm=bpm,
524
+ min_sec=1.0,
525
+ max_sec=40.0,
526
+ beat_step=seam_step,
527
+ min_beats=2,
528
+ )
529
+
530
+ post_locked_sec, post_beats, post_raw_beats = _quantize_seconds_to_beats(
531
+ raw_sec=post_sec,
532
+ bpm=bpm,
533
+ min_sec=1.0,
534
+ max_sec=20.0,
535
+ beat_step=4,
536
+ min_beats=2,
537
+ )
538
+
539
+ return {
540
+ "pre_sec": pre_locked_sec,
541
+ "seam_sec": seam_locked_sec,
542
+ "post_sec": post_locked_sec,
543
+ "debug": {
544
+ "bpm_ref": round(float(bpm), 3),
545
+ "pre": {
546
+ "raw_sec": round(float(pre_sec), 3),
547
+ "locked_sec": round(float(pre_locked_sec), 3),
548
+ "raw_beats": round(float(pre_raw_beats), 3),
549
+ "locked_beats": int(pre_beats),
550
+ "beat_step": 4,
551
+ },
552
+ "seam": {
553
+ "raw_sec": round(float(seam_sec), 3),
554
+ "locked_sec": round(float(seam_locked_sec), 3),
555
+ "raw_beats": round(float(seam_raw_beats), 3),
556
+ "locked_beats": int(seam_beats),
557
+ "beat_step": int(seam_step),
558
+ },
559
+ "post": {
560
+ "raw_sec": round(float(post_sec), 3),
561
+ "locked_sec": round(float(post_locked_sec), 3),
562
+ "raw_beats": round(float(post_raw_beats), 3),
563
+ "locked_beats": int(post_beats),
564
+ "beat_step": 4,
565
+ },
566
+ },
567
+ }
568
+
569
+
570
+ def _stft_band_split(y: np.ndarray, sr: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
571
+ n = int(y.size)
572
+ if n <= 0:
573
+ z = np.zeros((0,), dtype=np.float32)
574
+ return z, z, z
575
+
576
+ n_fft = 2048 if n >= 2048 else 1024
577
+ hop = max(128, n_fft // 4)
578
+ y2 = ensure_length(y.astype(np.float32), max(n, n_fft))
579
+
580
+ D = librosa.stft(y2, n_fft=n_fft, hop_length=hop)
581
+ freqs = librosa.fft_frequencies(sr=sr, n_fft=n_fft)
582
+ low_mask = (freqs <= 180.0).astype(np.float32)[:, None]
583
+ mid_mask = ((freqs > 180.0) & (freqs <= 2500.0)).astype(np.float32)[:, None]
584
+ high_mask = (freqs > 2500.0).astype(np.float32)[:, None]
585
+
586
+ low = librosa.istft(D * low_mask, hop_length=hop, length=y2.size).astype(np.float32)
587
+ mid = librosa.istft(D * mid_mask, hop_length=hop, length=y2.size).astype(np.float32)
588
+ high = librosa.istft(D * high_mask, hop_length=hop, length=y2.size).astype(np.float32)
589
+ return low[:n], mid[:n], high[:n]
590
+
591
+
592
+ def _dj_style_seam_mix(a_tail: np.ndarray, b_head: np.ndarray, sr: int) -> Tuple[np.ndarray, Dict[str, Any]]:
593
+ n = min(int(a_tail.size), int(b_head.size))
594
+ if n <= 0:
595
+ return np.zeros((0,), dtype=np.float32), {"method": "empty-input-fallback"}
596
+
597
+ a = a_tail[:n].astype(np.float32)
598
+ b = b_head[:n].astype(np.float32)
599
+ try:
600
+ a_low, a_mid, a_high = _stft_band_split(a, sr=sr)
601
+ b_low, b_mid, b_high = _stft_band_split(b, sr=sr)
602
+ except Exception as exc:
603
+ LOGGER.warning("Band-split seam mixing failed (%s); using equal crossfade.", exc)
604
+ return crossfade_equal_length(a, b), {"method": "crossfade-fallback", "error": str(exc)}
605
+
606
+ x = np.linspace(0.0, 1.0, n, dtype=np.float32)
607
+ high_in = x
608
+ mid_in = np.power(x, 1.15).astype(np.float32)
609
+ # Delay low-end handoff so kick/bass do not collide early.
610
+ low_in = np.clip((x - 0.58) / 0.30, 0.0, 1.0).astype(np.float32)
611
+
612
+ seam = (
613
+ (a_high * (1.0 - high_in))
614
+ + (b_high * high_in)
615
+ + (a_mid * (1.0 - mid_in))
616
+ + (b_mid * mid_in)
617
+ + (a_low * (1.0 - low_in))
618
+ + (b_low * low_in)
619
+ ).astype(np.float32)
620
+
621
+ return seam, {
622
+ "method": "dj-eq-bass-swap",
623
+ "low_handoff": {"start_ratio": 0.58, "end_ratio": 0.88},
624
+ "bands_hz": {"low_max": 180, "mid_max": 2500},
625
+ }
626
+
627
+
628
+ def _build_theme_reference_audio(
629
+ a_pre: np.ndarray,
630
+ a_tail: np.ndarray,
631
+ b_head: np.ndarray,
632
+ b_post: np.ndarray,
633
+ sr: int,
634
+ ) -> Tuple[np.ndarray, Dict[str, Any]]:
635
+ a_ctx = np.concatenate([a_pre, a_tail]).astype(np.float32)
636
+ b_ctx = np.concatenate([b_head, b_post]).astype(np.float32)
637
+
638
+ a_take_n = min(a_ctx.size, int(round(12.0 * sr)))
639
+ b_take_n = min(b_ctx.size, int(round(12.0 * sr)))
640
+ if a_take_n <= 0 or b_take_n <= 0:
641
+ return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "insufficient_context"}
642
+
643
+ a_seg = a_ctx[-a_take_n:]
644
+ b_seg = b_ctx[:b_take_n]
645
+ overlap_n = min(int(round(0.45 * sr)), a_seg.size // 4, b_seg.size // 4)
646
+ if overlap_n > 0:
647
+ seam = crossfade_equal_length(a_seg[-overlap_n:], b_seg[:overlap_n])
648
+ ref = np.concatenate([a_seg[:-overlap_n], seam, b_seg[overlap_n:]]).astype(np.float32)
649
+ else:
650
+ ref = np.concatenate([a_seg, b_seg]).astype(np.float32)
651
+
652
+ ref = normalize_peak(apply_edge_fades(ref, sr=sr, fade_ms=20.0), peak=0.98)
653
+ return ref, {
654
+ "enabled": True,
655
+ "method": "a-tail-b-head-theme-ref",
656
+ "duration_sec": round(float(ref.size / max(1, sr)), 3),
657
+ "segments_sec": {
658
+ "song_a": round(float(a_seg.size / max(1, sr)), 3),
659
+ "song_b": round(float(b_seg.size / max(1, sr)), 3),
660
+ "overlap": round(float(overlap_n / max(1, sr)), 3),
661
+ },
662
+ }
663
+
664
+
665
+ def _left_pad_to_length(y: np.ndarray, target_n: int) -> np.ndarray:
666
+ target_n = int(max(0, target_n))
667
+ if y.size >= target_n:
668
+ return y[-target_n:].astype(np.float32)
669
+ return np.pad(y.astype(np.float32), (target_n - y.size, 0), mode="constant")
670
+
671
+
672
+ def _crossfade_join(a: np.ndarray, b: np.ndarray, fade_n: int) -> np.ndarray:
673
+ if a.size <= 0:
674
+ return b.astype(np.float32)
675
+ if b.size <= 0:
676
+ return a.astype(np.float32)
677
+ n = int(max(0, fade_n))
678
+ n = min(n, int(a.size), int(b.size))
679
+ if n <= 0:
680
+ return np.concatenate([a, b]).astype(np.float32)
681
+ seam = crossfade_equal_length(a[-n:], b[:n])
682
+ return np.concatenate([a[:-n], seam, b[n:]]).astype(np.float32)
683
+
684
+
685
+ def _build_period_reference_audio(period: np.ndarray, sr: int, source_mode: str = "full-period-a") -> Tuple[np.ndarray, Dict[str, Any]]:
686
+ if period.size <= 0:
687
+ return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-reference-period"}
688
+ ref = normalize_peak(apply_edge_fades(period.astype(np.float32), sr=sr, fade_ms=20.0), peak=0.98)
689
+ return ref, {
690
+ "enabled": True,
691
+ "method": "opposite-transition-period-reference",
692
+ "source_mode": str(source_mode),
693
+ "duration_sec": round(float(ref.size / max(1, sr)), 3),
694
+ }
695
+
696
+
697
+ def _apply_transition_low_duck(
698
+ y: np.ndarray,
699
+ sr: int,
700
+ duck_floor: float = 0.14,
701
+ fade_out_end: float = 0.42,
702
+ fade_in_start: float = 0.72,
703
+ ) -> Tuple[np.ndarray, Dict[str, Any]]:
704
+ n = int(y.size)
705
+ if n <= 0:
706
+ return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-audio"}
707
+
708
+ try:
709
+ low, mid, high = _stft_band_split(y.astype(np.float32), sr=sr)
710
+ except Exception as exc:
711
+ LOGGER.warning("Low-duck split failed (%s); skip ducking.", exc)
712
+ return y.astype(np.float32), {"enabled": False, "reason": "split-failed", "error": str(exc)}
713
+
714
+ x = np.linspace(0.0, 1.0, n, dtype=np.float32)
715
+ out_end = float(clamp(fade_out_end, 0.1, 0.9))
716
+ in_start = float(clamp(max(out_end + 0.05, fade_in_start), 0.15, 0.95))
717
+ floor = float(clamp(duck_floor, 0.03, 0.5))
718
+
719
+ low_gain = np.full((n,), floor, dtype=np.float32)
720
+ entry_mask = x <= out_end
721
+ if np.any(entry_mask):
722
+ low_gain[entry_mask] = (1.0 - ((x[entry_mask] / max(1e-6, out_end)) * (1.0 - floor))).astype(np.float32)
723
+ exit_mask = x >= in_start
724
+ if np.any(exit_mask):
725
+ ramp = (x[exit_mask] - in_start) / max(1e-6, (1.0 - in_start))
726
+ low_gain[exit_mask] = (floor + (ramp * (1.0 - floor))).astype(np.float32)
727
+
728
+ y_out = (low * low_gain) + mid + high
729
+ y_out = y_out.astype(np.float32)
730
+ return y_out, {
731
+ "enabled": True,
732
+ "method": "low-duck-center",
733
+ "duck_floor": round(float(floor), 4),
734
+ "fade_out_end_ratio": round(float(out_end), 4),
735
+ "fade_in_start_ratio": round(float(in_start), 4),
736
+ }
737
+
738
+
739
+ def _build_one_bassline_stem_period(
740
+ period_a: np.ndarray,
741
+ period_b: np.ndarray,
742
+ stems_a: Optional[_DemucsStemBundle],
743
+ stems_b: Optional[_DemucsStemBundle],
744
+ ) -> Tuple[Optional[np.ndarray], Dict[str, Any]]:
745
+ if stems_a is None or stems_b is None:
746
+ return None, {"enabled": False, "reason": "missing-stems"}
747
+ n = min(
748
+ int(period_a.size),
749
+ int(period_b.size),
750
+ int(stems_a.vocals.size),
751
+ int(stems_b.vocals.size),
752
+ int(stems_a.bass.size),
753
+ int(stems_b.bass.size),
754
+ )
755
+ if n <= 0:
756
+ return None, {"enabled": False, "reason": "empty-period"}
757
+
758
+ x = np.linspace(0.0, 1.0, n, dtype=np.float32)
759
+ bass_in = np.clip((x - 0.60) / 0.28, 0.0, 1.0).astype(np.float32)
760
+ # Keep lows lighter in the center, then restore toward each edge.
761
+ center_bass_shape = (0.35 + (0.65 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
762
+
763
+ bass_mix = ((stems_a.bass[:n] * (1.0 - bass_in)) + (stems_b.bass[:n] * bass_in)).astype(np.float32)
764
+ bass_mix = (bass_mix * center_bass_shape).astype(np.float32)
765
+
766
+ acc_a = (stems_a.accompaniment[:n] - stems_a.bass[:n]).astype(np.float32)
767
+ acc_b = (stems_b.accompaniment[:n] - stems_b.bass[:n]).astype(np.float32)
768
+ inst_mix = ((acc_a * (1.0 - x)) + (acc_b * x)).astype(np.float32)
769
+
770
+ vocal_side = np.where(x < 0.5, stems_a.vocals[:n], stems_b.vocals[:n]).astype(np.float32)
771
+ vocal_shape = np.where(
772
+ x < 0.5,
773
+ np.clip(1.0 - ((x / 0.5) * 0.75), 0.25, 1.0),
774
+ np.clip(((x - 0.5) / 0.5) * 0.75 + 0.25, 0.25, 1.0),
775
+ ).astype(np.float32)
776
+ vocals_mix = (vocal_side * vocal_shape * 0.26).astype(np.float32)
777
+
778
+ stem_mix = (inst_mix + bass_mix + vocals_mix).astype(np.float32)
779
+ return stem_mix, {
780
+ "enabled": True,
781
+ "method": "demucs-one-bassline-rule",
782
+ "bass_handoff": {"start_ratio": 0.60, "end_ratio": 0.88},
783
+ "center_bass_floor": 0.35,
784
+ "vocal_sidechain_gain": 0.26,
785
+ }
786
+
787
+
788
+ def _build_src_transition_period(
789
+ period_a: np.ndarray,
790
+ period_b: np.ndarray,
791
+ sr: int,
792
+ ) -> Tuple[np.ndarray, np.ndarray, Dict[str, Any]]:
793
+ return _build_src_transition_period_with_stems(period_a, period_b, sr=sr, stems_a=None, stems_b=None)
794
+
795
+
796
+ def _build_src_transition_period_with_stems(
797
+ period_a: np.ndarray,
798
+ period_b: np.ndarray,
799
+ sr: int,
800
+ stems_a: Optional[_DemucsStemBundle] = None,
801
+ stems_b: Optional[_DemucsStemBundle] = None,
802
+ ) -> Tuple[np.ndarray, np.ndarray, Dict[str, Any]]:
803
+ directional, directional_debug = _dj_style_seam_mix(period_a, period_b, sr=sr)
804
+ n = int(min(period_a.size, period_b.size))
805
+ if n > 0:
806
+ x = np.linspace(0.0, 1.0, n, dtype=np.float32)
807
+ guide = ((period_a[:n] * (1.0 - x)) + (period_b[:n] * x)).astype(np.float32)
808
+ src_period = ((0.70 * directional[:n]) + (0.30 * guide)).astype(np.float32)
809
+ else:
810
+ src_period = directional.astype(np.float32)
811
+
812
+ demucs_mix, demucs_mix_debug = _build_one_bassline_stem_period(
813
+ period_a=period_a,
814
+ period_b=period_b,
815
+ stems_a=stems_a,
816
+ stems_b=stems_b,
817
+ )
818
+ if demucs_mix is not None and demucs_mix.size > 0:
819
+ src_period = ((0.54 * src_period[: demucs_mix.size]) + (0.46 * demucs_mix)).astype(np.float32)
820
+ if src_period.size < n:
821
+ src_period = ensure_length(src_period, n)
822
+
823
+ use_acc_ref = _REF_AUDIO_MODE in {"accompaniment-only", "accompaniment", "inst-only", "instrumental-only"}
824
+ if use_acc_ref and stems_a is not None and stems_a.accompaniment.size > 0:
825
+ reference_period = ensure_length(stems_a.accompaniment.astype(np.float32), int(period_a.size))
826
+ ref_mode = "accompaniment-only"
827
+ else:
828
+ reference_period = period_a.astype(np.float32)
829
+ ref_mode = "full-period-a"
830
+ dominant = "song_b"
831
+
832
+ src_period, low_duck_debug = _apply_transition_low_duck(src_period, sr=sr)
833
+ src_period = normalize_peak(src_period, peak=0.99)
834
+ return src_period, reference_period, {
835
+ "method": "bar-period-layered-repaint-src-fixed-b-base",
836
+ "base_mode": "B-base-fixed",
837
+ "dominant_period": dominant,
838
+ "demucs_one_bassline": demucs_mix_debug,
839
+ "reference_mode": ref_mode,
840
+ "guide_mix": {
841
+ "enabled": True,
842
+ "weight_directional": 0.70,
843
+ "weight_time_direction_guide": 0.30,
844
+ "behavior": "more-song-a-detail-at-entry-more-song-b-at-exit",
845
+ },
846
+ "directional_mix": directional_debug,
847
+ "transition_low_profile": low_duck_debug,
848
+ }
849
+
850
+
851
+ def _crossfade_join_frequency_aware(a: np.ndarray, b: np.ndarray, fade_n: int, sr: int) -> Tuple[np.ndarray, Dict[str, Any]]:
852
+ if a.size <= 0:
853
+ return b.astype(np.float32), {"method": "prepend-empty"}
854
+ if b.size <= 0:
855
+ return a.astype(np.float32), {"method": "append-empty"}
856
+
857
+ n = int(max(0, fade_n))
858
+ n = min(n, int(a.size), int(b.size))
859
+ if n <= 0:
860
+ return np.concatenate([a, b]).astype(np.float32), {"method": "no-fade"}
861
+
862
+ seg_a = a[-n:].astype(np.float32)
863
+ seg_b = b[:n].astype(np.float32)
864
+ seam, seam_debug = _dj_style_seam_mix(seg_a, seg_b, sr=sr)
865
+ out = np.concatenate([a[:-n], seam, b[n:]]).astype(np.float32)
866
+ return out, {"method": "frequency-aware-join", "fade_samples": int(n), "seam": seam_debug}
867
+
868
+
869
+ def _post_repaint_stem_correction(
870
+ transition: np.ndarray,
871
+ sr: int,
872
+ anchor_a: Optional[_DemucsStemBundle] = None,
873
+ anchor_b: Optional[_DemucsStemBundle] = None,
874
+ ) -> Tuple[np.ndarray, Dict[str, Any]]:
875
+ y = transition.astype(np.float32)
876
+ if y.size <= 0:
877
+ return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-transition"}
878
+
879
+ stems, demucs_debug = _extract_demucs_stems(y, int(sr), track_label="post-repaint-transition")
880
+ if stems is None:
881
+ return y, {"enabled": False, "reason": "demucs-unavailable", "demucs": demucs_debug}
882
+
883
+ n = int(min(stems.vocals.size, stems.drums.size, stems.bass.size, stems.other.size, y.size))
884
+ if n <= 0:
885
+ return y, {"enabled": False, "reason": "empty-stems", "demucs": demucs_debug}
886
+
887
+ x = np.linspace(0.0, 1.0, n, dtype=np.float32)
888
+ center = np.clip(np.minimum(x, 1.0 - x) / 0.18, 0.0, 1.0).astype(np.float32)
889
+
890
+ bass_cur = max(1e-5, _rms(stems.bass[:n]))
891
+ bass_ref_a = _rms(anchor_a.bass) if anchor_a is not None else bass_cur
892
+ bass_ref_b = _rms(anchor_b.bass) if anchor_b is not None else bass_cur
893
+ bass_gain_a = float(clamp(bass_ref_a / bass_cur, 0.65, 1.15))
894
+ bass_gain_b = float(clamp(bass_ref_b / bass_cur, 0.65, 1.15))
895
+ bass_linear = ((1.0 - x) * bass_gain_a) + (x * bass_gain_b)
896
+ bass_center_shape = (0.72 + (0.28 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
897
+ bass_gain = (bass_linear * bass_center_shape).astype(np.float32)
898
+
899
+ vocal_cur = max(1e-5, _rms(stems.vocals[:n]))
900
+ vocal_ref_a = _rms(anchor_a.vocals) if anchor_a is not None else vocal_cur
901
+ vocal_ref_b = _rms(anchor_b.vocals) if anchor_b is not None else vocal_cur
902
+ vocal_gain_a = float(clamp(vocal_ref_a / vocal_cur, 0.42, 1.0))
903
+ vocal_gain_b = float(clamp(vocal_ref_b / vocal_cur, 0.42, 1.0))
904
+ vocal_linear = ((1.0 - x) * vocal_gain_a) + (x * vocal_gain_b)
905
+ vocal_boundary_shape = (0.72 + (0.28 * center)).astype(np.float32)
906
+ vocal_gain = (vocal_linear * vocal_boundary_shape).astype(np.float32)
907
+
908
+ drum_gain = (1.05 - (0.08 * center)).astype(np.float32)
909
+ other_gain = 1.0
910
+
911
+ corrected = (
912
+ (stems.vocals[:n] * vocal_gain)
913
+ + (stems.drums[:n] * drum_gain)
914
+ + (stems.bass[:n] * bass_gain)
915
+ + (stems.other[:n] * other_gain)
916
+ ).astype(np.float32)
917
+ corrected = ensure_length(corrected, int(y.size))
918
+ return corrected, {
919
+ "enabled": True,
920
+ "method": "demucs-post-repaint-boundary-rebalance",
921
+ "demucs": demucs_debug,
922
+ "gains": {
923
+ "bass_start": round(float(bass_gain_a), 4),
924
+ "bass_end": round(float(bass_gain_b), 4),
925
+ "vocal_start": round(float(vocal_gain_a), 4),
926
+ "vocal_end": round(float(vocal_gain_b), 4),
927
+ "drum_edge_boost": 1.05,
928
+ },
929
+ "anchor_rms": {
930
+ "bass_a": round(float(bass_ref_a), 6),
931
+ "bass_b": round(float(bass_ref_b), 6),
932
+ "vocal_a": round(float(vocal_ref_a), 6),
933
+ "vocal_b": round(float(vocal_ref_b), 6),
934
+ },
935
+ }
936
+
937
+
938
+ def _assemble_substitute_mix(
939
+ song_a_prefix: np.ndarray,
940
+ transition: np.ndarray,
941
+ song_b_suffix: np.ndarray,
942
+ boundary_fade_n: int = 0,
943
+ sr: int = DEFAULT_TARGET_SR,
944
+ ) -> Tuple[np.ndarray, Dict[str, Any]]:
945
+ a = song_a_prefix.astype(np.float32) if song_a_prefix.size > 0 else np.zeros((0,), dtype=np.float32)
946
+ t = transition.astype(np.float32) if transition.size > 0 else np.zeros((0,), dtype=np.float32)
947
+ b = song_b_suffix.astype(np.float32) if song_b_suffix.size > 0 else np.zeros((0,), dtype=np.float32)
948
+ joined, entry_debug = _crossfade_join_frequency_aware(a, t, boundary_fade_n, sr=sr)
949
+ joined, exit_debug = _crossfade_join_frequency_aware(joined, b, boundary_fade_n, sr=sr)
950
+ return joined.astype(np.float32), {
951
+ "method": "dual-frequency-aware-boundary-joins",
952
+ "entry": entry_debug,
953
+ "exit": exit_debug,
954
+ }
955
+
956
+
957
+ def _align_b_window_to_a_tail(
958
+ a_tail: np.ndarray,
959
+ y_b_stretched: np.ndarray,
960
+ nominal_start_n: int,
961
+ seam_n: int,
962
+ post_n: int,
963
+ sr: int,
964
+ bpm_ref: float,
965
+ a_tail_drums: Optional[np.ndarray] = None,
966
+ y_b_stretched_drums: Optional[np.ndarray] = None,
967
+ ) -> Tuple[np.ndarray, int, Dict[str, Any]]:
968
+ total_n = seam_n + post_n
969
+ if y_b_stretched.size < total_n:
970
+ return ensure_length(y_b_stretched, total_n), 0, {
971
+ "method": "short-buffer-fallback",
972
+ "candidate_count": 0,
973
+ }
974
+
975
+ beat_sec = 60.0 / max(1e-6, float(bpm_ref))
976
+ search_sec = clamp(0.75 * beat_sec, 0.2, 1.2)
977
+ search_n = int(round(search_sec * sr))
978
+
979
+ nominal_start_n = int(clamp(float(nominal_start_n), 0.0, float(max(0, y_b_stretched.size - total_n))))
980
+ lo = max(0, nominal_start_n - search_n)
981
+ hi = min(y_b_stretched.size - total_n, nominal_start_n + search_n)
982
+
983
+ _, beat_times_stretched = estimate_bpm_and_beats(y_b_stretched, sr)
984
+ candidates: List[int] = []
985
+ for bt in beat_times_stretched:
986
+ idx = int(round(float(bt) * sr))
987
+ if lo <= idx <= hi:
988
+ candidates.append(idx)
989
+ candidates.append(nominal_start_n)
990
+ candidates = sorted(set(candidates))
991
+ if not candidates:
992
+ candidates = [nominal_start_n]
993
+
994
+ use_drum_alignment = (
995
+ isinstance(a_tail_drums, np.ndarray)
996
+ and isinstance(y_b_stretched_drums, np.ndarray)
997
+ and int(a_tail_drums.size) >= int(seam_n)
998
+ and int(y_b_stretched_drums.size) >= int(y_b_stretched.size)
999
+ )
1000
+
1001
+ onset_a_mix = _normalized_onset_envelope(a_tail, sr)
1002
+ onset_a_drum = _normalized_onset_envelope(a_tail_drums[:seam_n], sr) if use_drum_alignment else onset_a_mix
1003
+ rms_a = _rms(a_tail)
1004
+ drum_rms_a = _rms(a_tail_drums[:seam_n]) if use_drum_alignment else 0.0
1005
+ best_idx = candidates[0]
1006
+ best_score = -1.0
1007
+ best_components = {"onset_mix": 0.0, "onset_drum": 0.0, "energy": 0.0, "drum_energy": 0.0, "distance": 0.0}
1008
+
1009
+ distance_scale = max(1.0, 0.65 * search_n)
1010
+ for idx in candidates:
1011
+ seg = ensure_length(y_b_stretched[idx : idx + total_n], total_n)
1012
+ b_head = seg[:seam_n]
1013
+
1014
+ onset_b_mix = _normalized_onset_envelope(b_head, sr)
1015
+ onset_score_mix = _corr_similarity(onset_a_mix, onset_b_mix)
1016
+ onset_score_drum = onset_score_mix
1017
+ drum_energy_score = 0.5
1018
+ onset_score = onset_score_mix
1019
+ if use_drum_alignment:
1020
+ seg_drums = ensure_length(y_b_stretched_drums[idx : idx + total_n], total_n)
1021
+ b_head_drums = seg_drums[:seam_n]
1022
+ onset_b_drum = _normalized_onset_envelope(b_head_drums, sr)
1023
+ onset_score_drum = _corr_similarity(onset_a_drum, onset_b_drum)
1024
+ onset_score = (0.78 * onset_score_drum) + (0.22 * onset_score_mix)
1025
+ drum_rms_b = _rms(b_head_drums)
1026
+ drum_gap = abs(drum_rms_a - drum_rms_b) / max(1e-4, drum_rms_a)
1027
+ drum_energy_score = clamp(1.0 - drum_gap, 0.0, 1.0)
1028
+
1029
+ rms_b = _rms(b_head)
1030
+ energy_gap = abs(rms_a - rms_b) / max(1e-4, rms_a)
1031
+ energy_score = clamp(1.0 - energy_gap, 0.0, 1.0)
1032
+
1033
+ dist = abs(idx - nominal_start_n)
1034
+ distance_score = float(np.exp(-dist / distance_scale))
1035
+
1036
+ if use_drum_alignment:
1037
+ score = (0.62 * onset_score) + (0.18 * energy_score) + (0.10 * drum_energy_score) + (0.10 * distance_score)
1038
+ else:
1039
+ score = (0.56 * onset_score) + (0.26 * energy_score) + (0.18 * distance_score)
1040
+ if score > best_score:
1041
+ best_score = float(score)
1042
+ best_idx = int(idx)
1043
+ best_components = {
1044
+ "onset_mix": float(onset_score_mix),
1045
+ "onset_drum": float(onset_score_drum),
1046
+ "energy": float(energy_score),
1047
+ "drum_energy": float(drum_energy_score),
1048
+ "distance": float(distance_score),
1049
+ }
1050
+
1051
+ aligned = ensure_length(y_b_stretched[best_idx : best_idx + total_n], total_n)
1052
+ return aligned, best_idx, {
1053
+ "method": "drum-led-beat-phase-transient-align" if use_drum_alignment else "beat-phase-transient-align",
1054
+ "used_drum_stems": bool(use_drum_alignment),
1055
+ "candidate_count": len(candidates),
1056
+ "search_sec": round(float(search_sec), 4),
1057
+ "search_samples": int(search_n),
1058
+ "nominal_start_sample": int(nominal_start_n),
1059
+ "best_start_sample": int(best_idx),
1060
+ "best_score": round(float(best_score), 6),
1061
+ "score_components": {k: round(float(v), 6) for k, v in best_components.items()},
1062
+ }
1063
+
1064
+
1065
+ def _prepare_rough_transition(request: TransitionRequest) -> Dict[str, Any]:
1066
+ pre_sec_raw = clamp(request.pre_context_sec, 1.0, 20.0)
1067
+ post_sec_raw = clamp(request.post_context_sec, 1.0, 20.0)
1068
+ analysis_sec = clamp(request.analysis_sec, 10.0, 120.0)
1069
+
1070
+ target_sr = int(request.target_sr)
1071
+
1072
+ dur_a = ffprobe_duration_sec(request.song_a_path)
1073
+ dur_b = ffprobe_duration_sec(request.song_b_path)
1074
+
1075
+ a_analysis_start = max(0.0, float(dur_a) - analysis_sec) if dur_a is not None else 0.0
1076
+
1077
+ y_a_an, sr_a = decode_segment(request.song_a_path, a_analysis_start, analysis_sec, sr=target_sr, max_decode_sec=analysis_sec)
1078
+ y_b_an, sr_b = decode_segment(request.song_b_path, 0.0, analysis_sec, sr=target_sr, max_decode_sec=analysis_sec)
1079
+ bpm_a, beats_a = estimate_bpm_and_beats(y_a_an, sr_a)
1080
+ bpm_b, beats_b = estimate_bpm_and_beats(y_b_an, sr_b)
1081
+
1082
+ if request.bpm_target is not None and 40.0 <= float(request.bpm_target) <= 220.0:
1083
+ bpm_a = float(request.bpm_target)
1084
+
1085
+ bpm_a = float(bpm_a) if bpm_a is not None else 120.0
1086
+ bpm_b_detected = float(bpm_b) if bpm_b is not None else 120.0
1087
+ bpm_b_for_alignment = _resolve_half_double_tempo(bpm_a, bpm_b_detected)
1088
+ bars_requested = int(request.transition_bars)
1089
+ valid_bars = {4, 8, 16}
1090
+ transition_bars = bars_requested if bars_requested in valid_bars else 8
1091
+ seam_sec_raw = float(_beats_to_seconds(float(transition_bars * 4), bpm_a))
1092
+ seam_sec_raw = float(clamp(seam_sec_raw, 1.0, 40.0))
1093
+ seam_sec_ui_raw = seam_sec_raw
1094
+ base_mode = "B-base-fixed"
1095
+
1096
+ phrase_lock = _phrase_lock_transition_shape(
1097
+ pre_sec=pre_sec_raw,
1098
+ seam_sec=seam_sec_raw,
1099
+ post_sec=post_sec_raw,
1100
+ bpm=bpm_a,
1101
+ )
1102
+ pre_sec = float(phrase_lock["pre_sec"])
1103
+ seam_sec = float(phrase_lock["seam_sec"])
1104
+ post_sec = float(phrase_lock["post_sec"])
1105
+
1106
+ cue_selection = select_mix_cuepoints(
1107
+ y_a_analysis=y_a_an,
1108
+ y_b_analysis=y_b_an,
1109
+ sr=target_sr,
1110
+ analysis_sec=analysis_sec,
1111
+ pre_sec=pre_sec,
1112
+ seam_sec=seam_sec,
1113
+ post_sec=post_sec,
1114
+ a_analysis_start_sec=a_analysis_start,
1115
+ beats_a=beats_a,
1116
+ beats_b=beats_b,
1117
+ cue_a_override_sec=request.cue_a_sec,
1118
+ cue_b_override_sec=request.cue_b_sec,
1119
+ song_a_path=request.song_a_path,
1120
+ song_b_path=request.song_b_path,
1121
+ song_a_duration_sec=dur_a,
1122
+ song_b_duration_sec=dur_b,
1123
+ )
1124
+ cue_a = float(cue_selection.cue_a_sec)
1125
+ cue_b = float(cue_selection.cue_b_sec)
1126
+
1127
+ stretch_rate_raw = bpm_a / max(1e-6, bpm_b_for_alignment)
1128
+ # Keep stronger musical coherence while avoiding very audible stretch artifacts.
1129
+ stretch_rate = clamp(stretch_rate_raw, 0.7, 1.35)
1130
+
1131
+ pre_n = int(round(pre_sec * target_sr))
1132
+ seam_n = int(round(seam_sec * target_sr))
1133
+ post_n = int(round(post_sec * target_sr))
1134
+
1135
+ # Song A transition period: bars before cue A.
1136
+ a_period_start = max(0.0, cue_a - seam_sec)
1137
+ period_a, _ = decode_segment(
1138
+ request.song_a_path,
1139
+ a_period_start,
1140
+ seam_sec,
1141
+ sr=target_sr,
1142
+ max_decode_sec=seam_sec + 2.0,
1143
+ )
1144
+ period_a = ensure_length(period_a, seam_n)
1145
+ period_a_stems, period_a_stem_debug = _extract_demucs_stems(period_a, target_sr, track_label="song-a-transition-period")
1146
+
1147
+ # Repaint pre-context leading into the transition period.
1148
+ a_pre_start = max(0.0, a_period_start - pre_sec)
1149
+ a_pre, _ = decode_segment(
1150
+ request.song_a_path,
1151
+ a_pre_start,
1152
+ pre_sec,
1153
+ sr=target_sr,
1154
+ max_decode_sec=pre_sec + 2.0,
1155
+ )
1156
+ a_pre = _left_pad_to_length(a_pre, pre_n)
1157
+
1158
+ cue_b_selected = cue_b
1159
+ stitch_preview_side_sec = float(STITCH_PREVIEW_SIDE_SEC)
1160
+ boundary_fade_beats = 2.0
1161
+ boundary_fade_sec = clamp(_beats_to_seconds(boundary_fade_beats, bpm_a), 0.08, 1.2)
1162
+ boundary_fade_n = int(round(boundary_fade_sec * target_sr))
1163
+ stitch_decode_side_sec = stitch_preview_side_sec + boundary_fade_sec
1164
+ cue_a_for_stitch = float(max(0.0, cue_a - seam_sec))
1165
+ if dur_a is not None:
1166
+ cue_a_for_stitch = clamp(cue_a_for_stitch, 0.0, float(dur_a))
1167
+ song_a_preview_start = max(0.0, cue_a_for_stitch - stitch_decode_side_sec)
1168
+ song_a_preview_dur = max(0.0, cue_a_for_stitch - song_a_preview_start)
1169
+ song_a_prefix, _ = decode_segment(
1170
+ request.song_a_path,
1171
+ song_a_preview_start,
1172
+ song_a_preview_dur,
1173
+ sr=target_sr,
1174
+ max_decode_sec=max(20.0, song_a_preview_dur + 2.0),
1175
+ )
1176
+
1177
+ # Song B window: decode with pre-roll so we can phase-align on stretched beat grid.
1178
+ align_preroll_sec = clamp(0.75 * (60.0 / max(1e-6, bpm_a)), 0.2, 1.2)
1179
+ decode_start_b = max(0.0, cue_b_selected - (align_preroll_sec * stretch_rate))
1180
+ if dur_b is not None:
1181
+ decode_start_b = clamp(decode_start_b, 0.0, float(dur_b))
1182
+ desired_b_out_sec = seam_sec + max(post_sec, stitch_decode_side_sec) + (2.0 * align_preroll_sec)
1183
+ if dur_b is not None:
1184
+ # Decode only enough of Song B for alignment + transition + preview tail.
1185
+ remaining_sec = max(0.0, float(dur_b) - decode_start_b)
1186
+ raw_b_in_sec = clamp(min(remaining_sec, desired_b_out_sec * stretch_rate), 1.0, 360.0)
1187
+ else:
1188
+ raw_b_in_sec = clamp(desired_b_out_sec * stretch_rate, 1.0, 360.0)
1189
+ y_b_raw, _ = decode_segment(
1190
+ request.song_b_path,
1191
+ decode_start_b,
1192
+ raw_b_in_sec,
1193
+ sr=target_sr,
1194
+ max_decode_sec=raw_b_in_sec + 2.0,
1195
+ )
1196
+ y_b_stretched = safe_time_stretch(y_b_raw, rate=stretch_rate)
1197
+ y_b_stretched_stems, y_b_stem_debug = _extract_demucs_stems(
1198
+ y_b_stretched,
1199
+ target_sr,
1200
+ track_label="song-b-stretched-window",
1201
+ )
1202
+ nominal_b_start_n = int(round(align_preroll_sec * target_sr))
1203
+ y_b, aligned_b_start_n, b_alignment_debug = _align_b_window_to_a_tail(
1204
+ a_tail=period_a,
1205
+ y_b_stretched=y_b_stretched,
1206
+ nominal_start_n=nominal_b_start_n,
1207
+ seam_n=seam_n,
1208
+ post_n=post_n,
1209
+ sr=target_sr,
1210
+ bpm_ref=bpm_a,
1211
+ a_tail_drums=period_a_stems.drums if period_a_stems is not None else None,
1212
+ y_b_stretched_drums=y_b_stretched_stems.drums if y_b_stretched_stems is not None else None,
1213
+ )
1214
+ cue_b = float(decode_start_b + ((aligned_b_start_n / float(target_sr)) * stretch_rate))
1215
+ period_b = y_b[:seam_n]
1216
+ period_b_stems = _slice_stem_bundle(y_b_stretched_stems, aligned_b_start_n, seam_n)
1217
+ b_post = y_b[seam_n : seam_n + post_n]
1218
+ stitch_decode_n = int(round(stitch_decode_side_sec * target_sr))
1219
+ b_suffix_substitute = y_b_stretched[(aligned_b_start_n + seam_n) : (aligned_b_start_n + seam_n + stitch_decode_n)].astype(
1220
+ np.float32
1221
+ )
1222
+ if b_suffix_substitute.size == 0:
1223
+ b_suffix_substitute = np.zeros((0,), dtype=np.float32)
1224
+
1225
+ rough_seam, reference_period, rough_mix_debug = _build_src_transition_period_with_stems(
1226
+ period_a=period_a,
1227
+ period_b=period_b,
1228
+ sr=target_sr,
1229
+ stems_a=period_a_stems,
1230
+ stems_b=period_b_stems,
1231
+ )
1232
+ rough_stitched = np.concatenate([a_pre, rough_seam, b_post]).astype(np.float32)
1233
+ reference_audio_clip, reference_audio_debug = _build_period_reference_audio(
1234
+ reference_period,
1235
+ sr=target_sr,
1236
+ source_mode=str(rough_mix_debug.get("reference_mode", "full-period-a")),
1237
+ )
1238
+ return {
1239
+ "target_sr": target_sr,
1240
+ "dur_a": dur_a,
1241
+ "dur_b": dur_b,
1242
+ "analysis_start_a_sec": a_analysis_start,
1243
+ "bpm_a": bpm_a,
1244
+ "bpm_b": bpm_b_detected,
1245
+ "bpm_b_for_alignment": bpm_b_for_alignment,
1246
+ "cue_a_sec": cue_a,
1247
+ "cue_b_sec": cue_b,
1248
+ "cue_b_selected_sec": cue_b_selected,
1249
+ "cue_selector_method": cue_selection.method,
1250
+ "cue_selector_debug": cue_selection.debug,
1251
+ "stretch_rate": stretch_rate,
1252
+ "stretch_rate_raw": stretch_rate_raw,
1253
+ "transition_base_mode": base_mode,
1254
+ "transition_bars": int(transition_bars),
1255
+ "b_alignment_debug": b_alignment_debug,
1256
+ "phrase_lock_debug": phrase_lock["debug"],
1257
+ "rough_mix_debug": rough_mix_debug,
1258
+ "reference_audio_debug": reference_audio_debug,
1259
+ "demucs_transition_debug": {
1260
+ "enabled": bool(_DEMUCS_TRANSITION_ENABLED),
1261
+ "period_a": period_a_stem_debug,
1262
+ "b_window_stretched": y_b_stem_debug,
1263
+ "period_b_from_aligned_window": {
1264
+ "status": "ready" if period_b_stems is not None else "unavailable",
1265
+ "source": "slice(song-b-stretched-window, aligned_start, seam_n)",
1266
+ "aligned_start_sample": int(aligned_b_start_n),
1267
+ "seam_n": int(seam_n),
1268
+ },
1269
+ },
1270
+ "pre_sec": pre_sec,
1271
+ "seam_sec": seam_sec,
1272
+ "post_sec": post_sec,
1273
+ "pre_sec_raw": pre_sec_raw,
1274
+ "seam_sec_raw": seam_sec_raw,
1275
+ "seam_sec_ui_raw": seam_sec_ui_raw,
1276
+ "post_sec_raw": post_sec_raw,
1277
+ "pre_n": pre_n,
1278
+ "seam_n": seam_n,
1279
+ "post_n": post_n,
1280
+ "rough_seam": rough_seam,
1281
+ "rough_stitched": rough_stitched,
1282
+ "song_a_prefix": song_a_prefix,
1283
+ "song_b_suffix_substitute": b_suffix_substitute,
1284
+ "reference_audio_clip": reference_audio_clip,
1285
+ "period_a_stem_bundle": period_a_stems,
1286
+ "period_b_stem_bundle": period_b_stems,
1287
+ "boundary_fade_n": int(boundary_fade_n),
1288
+ "boundary_fade_sec": float(boundary_fade_sec),
1289
+ "stitch_preview_side_sec": float(stitch_preview_side_sec),
1290
+ "stitch_decode_side_sec": float(stitch_decode_side_sec),
1291
+ "stitching_debug": {
1292
+ "mode": "replace-seam-no-insert",
1293
+ "transition_base_mode": base_mode,
1294
+ "transition_bars": int(transition_bars),
1295
+ "song_a_prefix_sec": round(float(song_a_prefix.size / max(1, target_sr)), 3),
1296
+ "transition_sec": round(float(seam_sec), 3),
1297
+ "song_b_suffix_sec": round(float(b_suffix_substitute.size / max(1, target_sr)), 3),
1298
+ "decode_start_b_sec": round(float(decode_start_b), 3),
1299
+ "cue_a_cut_sec": round(float(cue_a_for_stitch), 3),
1300
+ "cue_b_continuation_sec": round(float(cue_b + seam_sec), 3),
1301
+ "replaced_window_sec": round(float(seam_sec), 3),
1302
+ "boundary_fade_sec": round(float(boundary_fade_sec), 3),
1303
+ "stitch_preview_side_sec": round(float(stitch_preview_side_sec), 3),
1304
+ "stitch_decode_side_sec": round(float(stitch_decode_side_sec), 3),
1305
+ },
1306
+ }
1307
+
1308
+
1309
+ def _extract_success_and_audios(result: Any) -> Tuple[bool, list, Optional[str]]:
1310
+ if isinstance(result, dict):
1311
+ success = bool(result.get("success", False))
1312
+ audios = result.get("audios", [])
1313
+ error = result.get("error") or result.get("status_message")
1314
+ return success, audios, error
1315
+ success = bool(getattr(result, "success", False))
1316
+ audios = getattr(result, "audios", [])
1317
+ error = getattr(result, "error", None) or getattr(result, "status_message", None)
1318
+ return success, audios, error
1319
+
1320
+
1321
+ def _load_acestep_runtime(request: TransitionRequest) -> Dict[str, Any]:
1322
+ global _ACESTEP_RUNTIME
1323
+
1324
+ project_root = _resolve_acestep_project_root(request)
1325
+ runtime_key = (
1326
+ project_root,
1327
+ request.acestep_model_config,
1328
+ request.acestep_device,
1329
+ request.acestep_lora_path,
1330
+ float(request.acestep_lora_scale),
1331
+ )
1332
+
1333
+ if _ACESTEP_RUNTIME is not None and _ACESTEP_RUNTIME.get("key") == runtime_key:
1334
+ return _ACESTEP_RUNTIME
1335
+
1336
+ try:
1337
+ from acestep.handler import AceStepHandler
1338
+ from acestep.inference import GenerationConfig, GenerationParams, generate_music
1339
+ except Exception as exc:
1340
+ raise RuntimeError(
1341
+ "ACE-Step is not installed or import failed. "
1342
+ "Install with: pip install git+https://github.com/ACE-Step/ACE-Step-1.5.git"
1343
+ ) from exc
1344
+
1345
+ handler = AceStepHandler()
1346
+ status, ok = handler.initialize_service(
1347
+ project_root=project_root,
1348
+ config_path=request.acestep_model_config,
1349
+ device=request.acestep_device,
1350
+ use_flash_attention=request.acestep_use_flash_attn,
1351
+ compile_model=request.acestep_compile_model,
1352
+ offload_to_cpu=request.acestep_offload_to_cpu,
1353
+ offload_dit_to_cpu=request.acestep_offload_dit_to_cpu,
1354
+ quantization=None,
1355
+ prefer_source=request.acestep_prefer_source,
1356
+ use_mlx_dit=request.acestep_use_mlx_dit,
1357
+ )
1358
+ if not ok:
1359
+ raise RuntimeError(f"ACE-Step initialize_service failed: {status}")
1360
+
1361
+ lora_debug: Dict[str, Any] = {"requested": False}
1362
+ if request.acestep_lora_path:
1363
+ lora_debug["requested"] = True
1364
+ resolved_lora_path = _resolve_lora_path(request.acestep_lora_path, project_root)
1365
+ try:
1366
+ handler.load_lora(resolved_lora_path)
1367
+ handler.set_use_lora(True)
1368
+ handler.set_lora_scale(float(request.acestep_lora_scale))
1369
+ lora_debug.update(
1370
+ {
1371
+ "loaded": True,
1372
+ "path": resolved_lora_path,
1373
+ "scale": float(request.acestep_lora_scale),
1374
+ }
1375
+ )
1376
+ except Exception as exc:
1377
+ raise RuntimeError(f"Failed to load ACE-Step LoRA: {exc}") from exc
1378
+ else:
1379
+ lora_debug["loaded"] = False
1380
+
1381
+ _ACESTEP_RUNTIME = {
1382
+ "key": runtime_key,
1383
+ "project_root": project_root,
1384
+ "handler": handler,
1385
+ "GenerationParams": GenerationParams,
1386
+ "GenerationConfig": GenerationConfig,
1387
+ "generate_music": generate_music,
1388
+ "lora_debug": lora_debug,
1389
+ }
1390
+ return _ACESTEP_RUNTIME
1391
+
1392
+
1393
+ def _run_acestep_repaint(
1394
+ request: TransitionRequest,
1395
+ rough: Dict[str, Any],
1396
+ rough_src_path: str,
1397
+ ) -> Tuple[np.ndarray, np.ndarray]:
1398
+ runtime = _load_acestep_runtime(request)
1399
+ handler = runtime["handler"]
1400
+ GenerationParams = runtime["GenerationParams"]
1401
+ GenerationConfig = runtime["GenerationConfig"]
1402
+ generate_music = runtime["generate_music"]
1403
+
1404
+ caption = _build_caption(request.plugin_id, request.instruction_text)
1405
+
1406
+ rough_stitched = rough["rough_stitched"]
1407
+ rough_for_model = resample_if_needed(rough_stitched, rough["target_sr"], ACESTEP_INPUT_SR)
1408
+ write_wav(rough_src_path, rough_for_model, ACESTEP_INPUT_SR)
1409
+ reference_audio_path: Optional[str] = None
1410
+ reference_audio_clip = rough.get("reference_audio_clip")
1411
+ if isinstance(reference_audio_clip, np.ndarray) and reference_audio_clip.size > 0:
1412
+ reference_audio_path = (
1413
+ rough_src_path.replace("_rough_src.wav", "_theme_ref.wav")
1414
+ if rough_src_path.endswith("_rough_src.wav")
1415
+ else f"{rough_src_path}.theme_ref.wav"
1416
+ )
1417
+ reference_for_model = resample_if_needed(reference_audio_clip, rough["target_sr"], ACESTEP_INPUT_SR)
1418
+ write_wav(reference_audio_path, reference_for_model, ACESTEP_INPUT_SR)
1419
+
1420
+ repaint_start = float(rough["pre_sec"])
1421
+ repaint_end = float(rough["pre_sec"] + rough["seam_sec"])
1422
+ total_duration = float(rough["pre_sec"] + rough["seam_sec"] + rough["post_sec"])
1423
+ bpm_hint = int(round(rough["bpm_a"])) if 30 <= rough["bpm_a"] <= 300 else None
1424
+
1425
+ params = GenerationParams(
1426
+ task_type="repaint",
1427
+ src_audio=rough_src_path,
1428
+ reference_audio=reference_audio_path,
1429
+ repainting_start=repaint_start,
1430
+ repainting_end=repaint_end,
1431
+ caption=caption,
1432
+ lyrics="[Instrumental]",
1433
+ instrumental=True,
1434
+ bpm=bpm_hint,
1435
+ duration=total_duration,
1436
+ inference_steps=int(max(1, request.inference_steps)),
1437
+ guidance_scale=float(request.creativity_strength),
1438
+ seed=int(request.seed),
1439
+ thinking=False,
1440
+ use_cot_metas=False,
1441
+ use_cot_caption=False,
1442
+ use_cot_language=False,
1443
+ )
1444
+ config = GenerationConfig(
1445
+ batch_size=1,
1446
+ use_random_seed=False,
1447
+ seeds=[int(request.seed)],
1448
+ audio_format="wav",
1449
+ )
1450
+
1451
+ result = generate_music(
1452
+ dit_handler=handler,
1453
+ llm_handler=None,
1454
+ params=params,
1455
+ config=config,
1456
+ save_dir=None,
1457
+ progress=None,
1458
+ )
1459
+ success, audios, error = _extract_success_and_audios(result)
1460
+ if not success or not audios:
1461
+ raise RuntimeError(error or "ACE-Step repaint returned no audio.")
1462
+
1463
+ audio_item = audios[0]
1464
+ audio_tensor = audio_item.get("tensor")
1465
+ if audio_tensor is None:
1466
+ raise RuntimeError("ACE-Step repaint output missing audio tensor.")
1467
+
1468
+ try:
1469
+ import torch
1470
+ if isinstance(audio_tensor, torch.Tensor):
1471
+ y = audio_tensor.detach().float().cpu().numpy()
1472
+ else:
1473
+ y = np.asarray(audio_tensor, dtype=np.float32)
1474
+ except Exception:
1475
+ y = np.asarray(audio_tensor, dtype=np.float32)
1476
+
1477
+ if y.ndim == 2:
1478
+ y = np.mean(y, axis=0)
1479
+ elif y.ndim > 2:
1480
+ y = y.reshape(-1)
1481
+ y = y.astype(np.float32)
1482
+
1483
+ model_sr = int(audio_item.get("sample_rate", ACESTEP_INPUT_SR))
1484
+ y = resample_if_needed(y, model_sr, rough["target_sr"])
1485
+
1486
+ total_n = rough["pre_n"] + rough["seam_n"] + rough["post_n"]
1487
+ y = ensure_length(y, total_n)
1488
+ stitched = y[:total_n]
1489
+ seam_start = rough["pre_n"]
1490
+ seam_end = seam_start + rough["seam_n"]
1491
+ transition = stitched[seam_start:seam_end]
1492
+ return transition, stitched
1493
+
1494
+
1495
+ def generate_transition_artifacts(request: TransitionRequest) -> TransitionResult:
1496
+ if not os.path.isfile(request.song_a_path):
1497
+ raise FileNotFoundError(f"Song A not found: {request.song_a_path}")
1498
+ if not os.path.isfile(request.song_b_path):
1499
+ raise FileNotFoundError(f"Song B not found: {request.song_b_path}")
1500
+
1501
+ transition_path, stitched_path, rough_stitched_path, hard_splice_path, rough_src_path = _resolve_output_paths(request)
1502
+
1503
+ LOGGER.info("Transition request args: %s", json.dumps(request.to_log_dict(), sort_keys=True))
1504
+ rough = _prepare_rough_transition(request)
1505
+ rough_stitched_audio = normalize_peak(
1506
+ apply_edge_fades(rough["rough_stitched"].astype(np.float32), rough["target_sr"], fade_ms=25.0),
1507
+ peak=0.98,
1508
+ )
1509
+ write_wav(rough_stitched_path, rough_stitched_audio, rough["target_sr"])
1510
+ hard_splice_audio = np.concatenate([rough["song_a_prefix"], rough["song_b_suffix_substitute"]]).astype(np.float32)
1511
+ hard_splice_audio = normalize_peak(hard_splice_audio, peak=0.98)
1512
+ write_wav(hard_splice_path, hard_splice_audio, rough["target_sr"])
1513
+
1514
+ transition_audio = rough["rough_seam"]
1515
+ repaint_context_audio = rough["rough_stitched"]
1516
+ try:
1517
+ transition_audio, repaint_context_audio = _run_acestep_repaint(request, rough, rough_src_path)
1518
+ except Exception as exc:
1519
+ raise RuntimeError(f"ACE-Step repaint failed. Please verify ACE-Step runtime and model setup. {exc}") from exc
1520
+
1521
+ backend_used = "acestep-repaint"
1522
+
1523
+ transition_audio, post_repaint_stem_debug = _post_repaint_stem_correction(
1524
+ transition_audio.astype(np.float32),
1525
+ sr=int(rough["target_sr"]),
1526
+ anchor_a=rough.get("period_a_stem_bundle"),
1527
+ anchor_b=rough.get("period_b_stem_bundle"),
1528
+ )
1529
+
1530
+ transition_audio, transition_low_profile_debug = _apply_transition_low_duck(
1531
+ transition_audio.astype(np.float32),
1532
+ sr=int(rough["target_sr"]),
1533
+ )
1534
+
1535
+ stitched_audio, boundary_mix_debug = _assemble_substitute_mix(
1536
+ song_a_prefix=rough["song_a_prefix"],
1537
+ transition=transition_audio,
1538
+ song_b_suffix=rough["song_b_suffix_substitute"],
1539
+ boundary_fade_n=int(rough.get("boundary_fade_n", 0)),
1540
+ sr=int(rough["target_sr"]),
1541
+ )
1542
+
1543
+ transition_audio = normalize_peak(apply_edge_fades(transition_audio, rough["target_sr"], fade_ms=25.0), peak=0.98)
1544
+ stitched_audio = normalize_peak(apply_edge_fades(stitched_audio, rough["target_sr"], fade_ms=25.0), peak=0.98)
1545
+
1546
+ write_wav(transition_path, transition_audio, rough["target_sr"])
1547
+ write_wav(stitched_path, stitched_audio, rough["target_sr"])
1548
+
1549
+ theme_ref_path = (
1550
+ rough_src_path.replace("_rough_src.wav", "_theme_ref.wav")
1551
+ if rough_src_path.endswith("_rough_src.wav")
1552
+ else f"{rough_src_path}.theme_ref.wav"
1553
+ )
1554
+ if not request.keep_debug_files:
1555
+ for tmp_path in (rough_src_path, theme_ref_path):
1556
+ if os.path.exists(tmp_path):
1557
+ try:
1558
+ os.remove(tmp_path)
1559
+ except Exception:
1560
+ pass
1561
+
1562
+ details = {
1563
+ "backend_used": backend_used,
1564
+ "generation_args": request.to_log_dict(),
1565
+ "lora": _load_acestep_runtime(request).get("lora_debug", {"requested": False}),
1566
+ "bpm": {
1567
+ "song_a": round(float(rough["bpm_a"]), 3),
1568
+ "song_b": round(float(rough["bpm_b"]), 3),
1569
+ "song_b_for_alignment": round(float(rough["bpm_b_for_alignment"]), 3),
1570
+ "stretch_rate": round(float(rough["stretch_rate"]), 5),
1571
+ "stretch_rate_raw": round(float(rough["stretch_rate_raw"]), 5),
1572
+ "bpm_target_override": request.bpm_target,
1573
+ },
1574
+ "cue_points_sec": {
1575
+ "song_a": round(float(rough["cue_a_sec"]), 3),
1576
+ "song_b": round(float(rough["cue_b_sec"]), 3),
1577
+ "song_b_selected": round(float(rough["cue_b_selected_sec"]), 3),
1578
+ "selector_method": rough.get("cue_selector_method"),
1579
+ },
1580
+ "cue_selector": rough.get("cue_selector_debug"),
1581
+ "bpm_phase_alignment": rough.get("b_alignment_debug"),
1582
+ "phrase_lock": rough.get("phrase_lock_debug"),
1583
+ "rough_mix": rough.get("rough_mix_debug"),
1584
+ "reference_audio": rough.get("reference_audio_debug"),
1585
+ "demucs_transition": rough.get("demucs_transition_debug"),
1586
+ "stitching": rough.get("stitching_debug"),
1587
+ "boundary_mix": boundary_mix_debug,
1588
+ "post_repaint_stem_correction": post_repaint_stem_debug,
1589
+ "transition_low_profile": transition_low_profile_debug,
1590
+ "transition_strategy": {
1591
+ "name": "bar-defined-dual-base-repaint",
1592
+ "base_mode": rough.get("transition_base_mode"),
1593
+ "transition_bars": rough.get("transition_bars"),
1594
+ "boundary_fade_sec": round(float(rough.get("boundary_fade_sec", 0.0)), 3),
1595
+ },
1596
+ "clip_shape_sec": {
1597
+ "pre_context_sec_raw": round(float(rough["pre_sec_raw"]), 3),
1598
+ "pre_context_sec": round(float(rough["pre_sec"]), 3),
1599
+ "repaint_width_sec_ui_raw": round(float(rough.get("seam_sec_ui_raw", rough["seam_sec_raw"])), 3),
1600
+ "repaint_width_sec_raw": round(float(rough["seam_sec_raw"]), 3),
1601
+ "repaint_width_sec": round(float(rough["seam_sec"]), 3),
1602
+ "post_context_sec_raw": round(float(rough["post_sec_raw"]), 3),
1603
+ "post_context_sec": round(float(rough["post_sec"]), 3),
1604
+ "analysis_sec": round(float(request.analysis_sec), 3),
1605
+ },
1606
+ "durations_sec": {
1607
+ "song_a_total": rough["dur_a"],
1608
+ "song_b_total": rough["dur_b"],
1609
+ "analysis_start_a_sec": round(float(rough["analysis_start_a_sec"]), 3),
1610
+ "repaint_context_preview": round(float(repaint_context_audio.size / max(1, rough["target_sr"])), 3),
1611
+ "stitched_output": round(float(stitched_audio.size / max(1, rough["target_sr"])), 3),
1612
+ },
1613
+ "outputs": {
1614
+ "transition_path": transition_path,
1615
+ "stitched_path": stitched_path,
1616
+ "rough_stitched_path": rough_stitched_path,
1617
+ "hard_splice_path": hard_splice_path,
1618
+ },
1619
+ }
1620
+ LOGGER.info("Transition result details: %s", json.dumps(details, sort_keys=True))
1621
+
1622
+ return TransitionResult(
1623
+ transition_path=transition_path,
1624
+ stitched_path=stitched_path,
1625
+ rough_stitched_path=rough_stitched_path,
1626
+ hard_splice_path=hard_splice_path,
1627
+ backend_used=backend_used,
1628
+ details=details,
1629
+ )
1630
+
1631
+
1632
+ def _build_arg_parser() -> argparse.ArgumentParser:
1633
+ parser = argparse.ArgumentParser(description="Deterministic DJ transition generation (Phase A/B).")
1634
+ parser.add_argument("--song-a", required=True, help="Path to Song A audio file.")
1635
+ parser.add_argument("--song-b", required=True, help="Path to Song B audio file.")
1636
+ parser.add_argument("--plugin", default="Smooth Blend", choices=list(PLUGIN_PRESETS.keys()), help="Transition style plugin preset.")
1637
+ parser.add_argument("--instruction", default="", help="Extra text instruction for generation.")
1638
+ parser.add_argument("--pre-sec", type=float, default=6.0, help="Seconds before seam from Song A.")
1639
+ parser.add_argument("--repaint-sec", type=float, default=4.0, help="Repaint seam width in seconds.")
1640
+ parser.add_argument("--post-sec", type=float, default=6.0, help="Seconds after seam from Song B.")
1641
+ parser.add_argument("--analysis-sec", type=float, default=45.0, help="Analysis window in seconds.")
1642
+ parser.add_argument("--bpm-target", type=float, default=None, help="Optional BPM override target for Song A.")
1643
+ parser.add_argument("--cue-a-sec", type=float, default=None, help="Optional Song A cue override.")
1644
+ parser.add_argument("--cue-b-sec", type=float, default=None, help="Optional Song B cue override.")
1645
+ parser.add_argument(
1646
+ "--transition-bars",
1647
+ type=int,
1648
+ default=8,
1649
+ choices=[4, 8, 16],
1650
+ help="Transition period length in bars around cue points.",
1651
+ )
1652
+ parser.add_argument("--creativity", type=float, default=7.0, help="ACE-Step guidance strength.")
1653
+ parser.add_argument("--inference-steps", type=int, default=8, help="ACE-Step inference steps.")
1654
+ parser.add_argument("--seed", type=int, default=42, help="Seed for reproducibility.")
1655
+ parser.add_argument("--output-dir", default="outputs", help="Directory for output artifacts.")
1656
+ parser.add_argument("--output-stem", default=None, help="Optional fixed output stem.")
1657
+ parser.add_argument("--target-sr", type=int, default=DEFAULT_TARGET_SR, help="Output sample rate.")
1658
+ parser.add_argument("--keep-debug-files", action="store_true", help="Keep temporary rough source audio files.")
1659
+ return parser
1660
+
1661
+
1662
+ def main() -> None:
1663
+ logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(name)s | %(message)s")
1664
+ parser = _build_arg_parser()
1665
+ args = parser.parse_args()
1666
+
1667
+ req = TransitionRequest(
1668
+ song_a_path=args.song_a,
1669
+ song_b_path=args.song_b,
1670
+ plugin_id=args.plugin,
1671
+ instruction_text=args.instruction,
1672
+ pre_context_sec=args.pre_sec,
1673
+ repaint_width_sec=args.repaint_sec,
1674
+ post_context_sec=args.post_sec,
1675
+ analysis_sec=args.analysis_sec,
1676
+ bpm_target=args.bpm_target,
1677
+ cue_a_sec=args.cue_a_sec,
1678
+ cue_b_sec=args.cue_b_sec,
1679
+ transition_bars=args.transition_bars,
1680
+ creativity_strength=args.creativity,
1681
+ inference_steps=args.inference_steps,
1682
+ seed=args.seed,
1683
+ output_dir=args.output_dir,
1684
+ output_stem=args.output_stem,
1685
+ target_sr=args.target_sr,
1686
+ keep_debug_files=args.keep_debug_files,
1687
+ )
1688
+
1689
+ result = generate_transition_artifacts(req)
1690
+ print(json.dumps(result.to_dict(), indent=2))
1691
+
1692
+
1693
+ if __name__ == "__main__":
1694
+ main()
requirements.txt ADDED
@@ -0,0 +1,14 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ gradio
2
+ spaces
3
+ torch
4
+ transformers
5
+ accelerate
6
+ librosa
7
+ soundfile
8
+ numpy
9
+ scipy
10
+ # Demucs enables stem-aware cue selection and transition refinement.
11
+ demucs
12
+
13
+ # Optional ACE-Step backend (heavy; keep optional so MusicGen path still works):
14
+ git+https://github.com/ACE-Step/ACE-Step-1.5.git