Spaces:

mvp-lab
/

audio_generation

Running on Zero

App Files Files Community

Yng314 commited on Feb 27

Commit

14984e4

1 Parent(s): 070f6dc

feat: implement audio transition generation pipeline with modules for transition generation, cue point selection, and audio utilities.

Browse files

Files changed (10) hide show

.gitignore +15 -0
PROJECT_CATCHUP_NOTE.md +229 -0
README copy.md +100 -0
app.py +392 -0
packages.txt +2 -0
pipeline/__init__.py +16 -0
pipeline/audio_utils.py +194 -0
pipeline/cuepoint_selector.py +1656 -0
pipeline/transition_generator.py +1694 -0
requirements.txt +14 -0

.gitignore ADDED Viewed

	@@ -0,0 +1,15 @@

+Initial_Research/demix
+Initial_Research/spec
+__pycache__
+Utils/pretrained_models
+# Large / copyrighted audio should not be committed to a public Space repo
+Songs/
+Test_songs/
+Initial_Research/*.mp3
+Initial_Research/*.wav
+mixed_song.wav
+final_mix.mp3
+.acestep_runtime/
+checkpoints/
+outputs/

PROJECT_CATCHUP_NOTE.md ADDED Viewed

	@@ -0,0 +1,229 @@

+# AI DJ Project Catch-Up Note
+Last updated: 2026-02-19
+## 1) Project Goal (Current Direction)
+Build a **domain-specific AI DJ transition demo** for coursework Option 1 (Refinement):
+- user uploads Song A and Song B
+- system auto-detects cue points + BPM
+- Song B is time-stretched to Song A BPM
+- a generative model creates transition audio from text ("transition vibe")
+- output is a **short transition clip only** (not full-song mix)
+This scope is intentionally optimized for Hugging Face Spaces reliability.
+---
+## 2) Coursework Fit (Why this is Option 1)
+This is a refinement of existing pipelines/models:
+- existing generative pipeline (currently MusicGen, planned ACE-Step)
+- wrapped in domain-specific DJ UX (cue/BPM/mix controls)
+- not raw prompting only; structured controls for practical use
+---
+## 3) Current Implemented Pipeline (Already in `app.py`)
+Current app file: `AI_DJ_Project/app.py`
+### 3.1 Input + UI
+- Upload `Song A` and `Song B`
+- Set:
+  - transition vibe text
+  - transition type (`riser`, `drum fill`, `sweep`, `brake`, `scratch`, `impact`)
+  - mode (`Overlay` or `Insert`)
+  - pre/mix/post seconds
+  - transition length + gain
+  - optional BPM and cue overrides
+### 3.2 Audio analysis and cueing
+1. Probe duration with `ffprobe` (if available)
+2. Decode only needed segments (ffmpeg first, librosa fallback)
+3. Estimate BPM + beat times with `librosa.beat.beat_track`
+4. Auto-cue strategy:
+   - Song A: choose beat near end analysis window
+   - Song B: choose first beat after ~2 seconds
+5. Optional manual override for BPM and cue points
+### 3.3 Tempo matching
+- Compute stretch rate = `bpm_A / bpm_B` (clamped)
+- Time-stretch Song B segment via `librosa.effects.time_stretch`
+### 3.4 AI transition generation
+- `@spaces.GPU` function `_generate_ai_transition(...)`
+- Uses `facebook/musicgen-small`
+- Prompt is domain-steered for DJ transition behavior
+- Returns short generated transition audio
+### 3.5 Assembly
+- **Overlay mode**: crossfade A/B + overlay AI transition
+- **Insert mode**: A -> AI transition -> B (with short anti-click fades)
+- Edge fades + peak normalization before output
+### 3.6 Output
+- Output audio clip (NumPy audio to Gradio)
+- JSON details:
+  - BPM estimates
+  - cue points
+  - stretch rate
+  - analysis settings
+---
+## 4) Full End-to-End Pipeline (Conceptual)
+Upload A/B
+-> decode limited windows
+-> BPM + beat analysis
+-> auto-cue points
+-> stretch B to A BPM
+-> generate transition (GenAI)
+-> overlay/insert assembly
+-> normalize/fades
+-> return short transition clip + diagnostics
+---
+## 5) Planned Upgrade: ACE-Step + Custom LoRA
+### 5.1 What ACE-Step is
+ACE-Step 1.5 is a **full music-generation foundation model stack** (text-to-audio/music with editing/control workflows), not just a tiny SFX model.
+Planned usage in this project:
+- keep deterministic DJ logic (cue/BPM/stretch/assemble)
+- swap transition generation backend from MusicGen to ACE-Step
+- load custom LoRA adapter(s) to enforce DJ transition style
+### 5.2 Integration strategy (recommended)
+1. Keep current `app.py` flow unchanged for analysis/mixing
+2. Introduce backend abstraction:
+   - `MusicGenBackend` (fallback)
+   - `AceStepBackend` (main target)
+3. Add LoRA controls:
+   - adapter selection
+   - adapter scale
+4. Continue returning short transition clips only
+---
+## 6) Genre-Specific LoRA Idea (Pop / Electronic / House / Dubstep / Techno)
+## Is this a good idea?
+**Yes, as a staged plan.**
+It is a strong product and coursework idea because:
+- user-selected genre can map to distinct transition style
+- demonstrates clear domain-specific refinement
+- supports explainable UX: "You picked House -> House-style transition LoRA"
+### Important caveats
+- Training one LoRA per genre increases data and compute requirements a lot
+- Early quality may vary by genre and dataset size
+- More adapters mean more evaluation and QA burden
+### Practical rollout (recommended)
+Phase 1 (safe):
+- base model + one "general DJ transition" LoRA
+Phase 2 (coursework-strong):
+- 2-3 genre LoRAs (e.g., Pop / House / Dubstep)
+Phase 3 (optional extension):
+- larger genre library + auto-genre suggestion from uploaded songs
+---
+## 7) Proposed Genre LoRA Routing Logic
+User selects uploaded-song genre (or manually selects transition style profile):
+- Pop -> `lora_pop_transition`
+- Electronic -> `lora_electronic_transition`
+- House -> `lora_house_transition`
+- Dubstep -> `lora_dubstep_transition`
+- Techno -> `lora_techno_transition`
+- Auto/Unknown -> `lora_general_transition`
+Then:
+1. load chosen LoRA
+2. set LoRA scale
+3. run ACE-Step generation for short transition duration
+4. mix with A/B boundary clip
+---
+## 8) Data and Training Notes for LoRA
+- Use only licensed/royalty-free/self-owned audio for dataset and demos
+- Dataset should emphasize transition-like content (risers, fills, drops, sweeps, impacts)
+- Include metadata/captions describing genre + transition intent
+- Keep track of:
+  - adapter name
+  - dataset source and license
+  - training config and epoch checkpoints
+---
+## 9) Current Risks / Constraints
+- ACE-Step stack is heavier than MusicGen and needs careful deployment tuning
+- Cold starts and memory behavior can be challenging on Spaces
+- Auto-cueing is heuristic; may fail on hard tracks (manual override should remain)
+- Time-stretch can introduce artifacts (expected in DJ contexts)
+---
+## 10) Fallback and Reliability Plan
+- Keep MusicGen backend as fallback while integrating ACE-Step
+- If ACE-Step init fails:
+  - fail over to MusicGen backend
+  - still return valid transition clip
+- Preserve deterministic DSP path as model-agnostic baseline
+---
+## 11) "If I lost track" Quick Resume Checklist
+1. Open `app.py` and confirm current backend is still working end-to-end
+2. Verify demo still does:
+   - cue detect
+   - BPM match
+   - transition generation
+   - clip output
+3. Re-read this note section 5/6/7
+4. Continue with next implementation milestone:
+   - backend abstraction
+   - ACE-Step backend skeleton
+   - single LoRA integration
+   - then genre LoRA expansion
+---
+## 12) Next Concrete Milestones
+M1: Refactor transition generation into backend interface
+M2: Implement `AceStepBackend` with base model inference
+M3: Add LoRA load/select/scale UI + runtime controls
+M4: Train first "general DJ transition" LoRA
+M5: Train 2-3 genre LoRAs and add genre routing
+M6: Compare outputs (base vs LoRA, genre A vs genre B) for coursework evidence

README copy.md ADDED Viewed

	@@ -0,0 +1,100 @@

+# AI_DJ_Project
+## Coursework-ready demo (HF Spaces + Gradio, Phase A/B)
+This repo now includes a **Hugging Face Spaces** demo in `app.py`:
+- Upload **Song A** and **Song B**.
+- Pick a transition style plugin + text instruction.
+- Build a rough seam (`A_tail + B_head`) with BPM-aware stretching.
+- Run **ACE-Step repaint** on the seam window.
+- Output two artifacts:
+  - transition-only clip
+  - stitched clip (`Song A up to cue + transition + Song B continuation`, seam is replaced not inserted)
+### Deterministic transition API (Phase A)
+Core reusable pipeline lives in:
+- `pipeline/audio_utils.py`
+- `pipeline/transition_generator.py`
+Run via command:
+```shell
+python -m pipeline.transition_generator \
+  --song-a /path/to/song_a.mp3 \
+  --song-b /path/to/song_b.mp3 \
+  --plugin "Smooth Blend" \
+  --instruction "smooth, rising energy, no vocals" \
+  --seed 42 \
+  --output-dir outputs
+```
+This writes:
+- `*_transition.wav`
+- `*_stitched.wav`
+### Deploy to Hugging Face Spaces (ZeroGPU)
+Create a new Space with:
+- **SDK**: Gradio
+- **Hardware**: ZeroGPU
+Upload these files from this folder:
+- `app.py`
+- `requirements.txt`
+- `packages.txt` (installs `ffmpeg` + `libsndfile1` for audio decoding/runtime)
+Important: **Do not upload copyrighted songs** into the Space repo. The demo is designed for **user uploads**.
+### Repo hygiene
+- The coursework spec notebook at repo root is intentionally git-ignored:
+  `(0) 70113_Generative_AI_README_for_Coursework.ipynb`
+### ACE-Step backend (required)
+This coursework pipeline uses ACE-Step as the generation method.
+```shell
+pip install git+https://github.com/ACE-Step/ACE-Step-1.5.git
+```
+Then run with environment vars as needed:
+```shell
+export AI_DJ_ACESTEP_MODEL_CONFIG=acestep-v15-turbo
+# optional persistent root for checkpoints:
+export AI_DJ_ACESTEP_PROJECT_ROOT=/data/acestep_runtime
+```
+Notes:
+- ACE-Step currently targets Python 3.11.
+- ACE-Step first run can take time due to checkpoint download.
+### Optional: Demucs stem-aware cue scoring
+Cuepoint scoring can optionally run Demucs on the **analysis windows only** (A tail window + B head window), derive stem-aware mixability signals (`vocals`, `drums`, `bass`, accompaniment density), and penalize overlap risk (vocal-vocal and bass-bass clashes).
+Transition generation can also use Demucs for:
+- drum-led phase locking,
+- one-bassline handoff shaping in `src_audio`,
+- accompaniment-only `reference_audio`,
+- post-repaint stem correction near transition boundaries.
+Environment toggles:
+```shell
+# disable Demucs analysis entirely
+export AI_DJ_ENABLE_DEMUCS_ANALYSIS=0
+# disable Demucs transition refinements entirely
+export AI_DJ_ENABLE_DEMUCS_TRANSITION=0
+# choose analysis device when enabled (default: cuda if available)
+export AI_DJ_DEMUCS_DEVICE=cpu
+# choose reference period type passed into ACE-Step reference_audio
+# values: accompaniment-only (default) | full-period-a
+export AI_DJ_REFERENCE_AUDIO_MODE=accompaniment-only
+```

app.py ADDED Viewed

	@@ -0,0 +1,392 @@

+import logging
+import os
+import subprocess
+from pathlib import Path
+from typing import Optional
+import gradio as gr
+from pipeline.transition_generator import (
+    PLUGIN_PRESETS,
+    TransitionRequest,
+    generate_transition_artifacts,
+)
+logging.basicConfig(
+    level=logging.INFO,
+    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
+)
+LOGGER = logging.getLogger(__name__)
+LORA_DROPDOWN_CHOICES = [
+    "None",
+    "Chinese New Year (official)",
+]
+LORA_REPO_MAP = {
+    "Chinese New Year (official)": "ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA",
+}
+APP_CSS = """
+.adv-item label,
+.adv-item .gr-block-label,
+.adv-item .gr-block-title {
+  white-space: nowrap !important;
+  overflow: hidden !important;
+  text-overflow: ellipsis !important;
+}
+"""
+APP_THEME = gr.themes.Soft(
+    primary_hue="blue",
+    neutral_hue="slate",
+    radius_size="lg",
+).set(
+    block_radius="*radius_xl",
+    input_radius="*radius_xl",
+    button_large_radius="*radius_xl",
+    button_medium_radius="*radius_xl",
+    button_small_radius="*radius_xl",
+)
+def _to_optional_float(value) -> Optional[float]:
+    if value is None:
+        return None
+    if isinstance(value, str) and not value.strip():
+        return None
+    try:
+        return float(value)
+    except Exception:
+        return None
+def _normalize_upload_for_ui(path: Optional[str]) -> Optional[str]:
+    if not path:
+        return path
+    src = str(path)
+    if not os.path.isfile(src):
+        return path
+    out_dir = os.path.join("outputs", "normalized_uploads")
+    os.makedirs(out_dir, exist_ok=True)
+    stem = Path(src).stem
+    dst = os.path.join(out_dir, f"{stem}_ui_norm.wav")
+    cmd = [
+        "ffmpeg",
+        "-hide_banner",
+        "-loglevel",
+        "error",
+        "-nostdin",
+        "-y",
+        "-i",
+        src,
+        "-vn",
+        "-ac",
+        "2",
+        "-ar",
+        "44100",
+        "-c:a",
+        "pcm_s16le",
+        dst,
+    ]
+    try:
+        subprocess.run(cmd, check=True)
+        return dst
+    except Exception as exc:
+        LOGGER.warning("Upload normalization failed for %s (%s). Using original file.", src, exc)
+        return src
+def _run_transition(
+    song_a,
+    song_b,
+    plugin_id,
+    instruction_text,
+    transition_bars,
+    pre_context_sec,
+    post_context_sec,
+    analysis_sec,
+    bpm_target,
+    creativity_strength,
+    inference_steps,
+    seed,
+    cue_a_sec,
+    cue_b_sec,
+    lora_choice,
+    lora_scale,
+    output_dir,
+):
+    if not song_a or not song_b:
+        raise gr.Error("Please upload both Song A and Song B.")
+    request = TransitionRequest(
+        song_a_path=song_a,
+        song_b_path=song_b,
+        plugin_id=plugin_id,
+        instruction_text=instruction_text or "",
+        transition_base_mode="B-base-fixed",
+        transition_bars=int(transition_bars),
+        pre_context_sec=float(pre_context_sec),
+        repaint_width_sec=4.0,
+        post_context_sec=float(post_context_sec),
+        analysis_sec=float(analysis_sec),
+        bpm_target=_to_optional_float(bpm_target),
+        cue_a_sec=_to_optional_float(cue_a_sec),
+        cue_b_sec=_to_optional_float(cue_b_sec),
+        creativity_strength=float(creativity_strength),
+        inference_steps=int(inference_steps),
+        seed=int(seed),
+        acestep_lora_path=LORA_REPO_MAP.get(str(lora_choice), ""),
+        acestep_lora_scale=float(lora_scale),
+        output_dir=(output_dir or "outputs").strip(),
+    )
+    try:
+        result = generate_transition_artifacts(request)
+    except Exception as exc:
+        raise gr.Error(str(exc))
+    return (
+        result.transition_path,
+        result.hard_splice_path,
+        result.rough_stitched_path,
+        result.stitched_path,
+    )
+def build_ui() -> gr.Blocks:
+    with gr.Blocks(theme=APP_THEME, css=APP_CSS) as demo:
+        gr.Markdown(
+            """
+<div style="text-align:center;">
+  <h1>AI DJ Transition Generator</h1>
+  <p>Upload two songs and generate a transition between them.</p>
+</div>
+            """.strip()
+        )
+        with gr.Row():
+            gr.Markdown(
+                """
+### How to use
+1. Upload **Song A** (current track) and **Song B** (next track).
+2. Choose a **Transition style plugin**.
+3. Optionally add **Text instruction** (e.g., smooth, rising energy, no vocals).
+4. Select **Transition period length (bars)**.
+5. Click **Generate transition artifacts**.
+                """.strip(),
+                container=False,
+                elem_classes=["plain-info"],
+            )
+            gr.Markdown(
+                """
+### Outputs
+- **Generated transition clip**: AI-generated repaint transition segment.
+- **Hard splice baseline (no transition)**: direct cut baseline.
+- **No-repaint rough stitch (baseline)**: stitched baseline without repaint.
+- **Final stitched clip**: final result with transition inserted.
+                """.strip(),
+                container=False,
+                elem_classes=["plain-info"],
+            )
+        with gr.Row():
+            song_a = gr.Audio(
+                label="Song A (mix out)",
+                type="filepath",
+                sources=["upload"],
+            )
+            song_b = gr.Audio(
+                label="Song B (mix in)",
+                type="filepath",
+                sources=["upload"],
+            )
+        song_a.upload(
+            fn=_normalize_upload_for_ui,
+            inputs=song_a,
+            outputs=song_a,
+            queue=False,
+        )
+        song_b.upload(
+            fn=_normalize_upload_for_ui,
+            inputs=song_b,
+            outputs=song_b,
+            queue=False,
+        )
+        with gr.Row():
+            with gr.Column():
+                plugin_id = gr.Dropdown(
+                    label="Transition style plugin",
+                    choices=list(PLUGIN_PRESETS.keys()),
+                    value="Smooth Blend",
+                )
+            with gr.Column():
+                lora_choice = gr.Dropdown(
+                    label="LoRA adapter",
+                    choices=LORA_DROPDOWN_CHOICES,
+                    value="None",
+                    info="Select an ACE-Step LoRA adapter to apply during repaint.",
+                )
+                lora_scale = gr.Slider(
+                    minimum=0.0,
+                    maximum=2.0,
+                    value=0.8,
+                    step=0.05,
+                    label="LoRA scale",
+                )
+            with gr.Column():
+                instruction_text = gr.Textbox(
+                    label="Text instruction",
+                    placeholder="e.g., smooth, rising energy, no vocals",
+                    lines=2,
+                )
+        with gr.Accordion("Advanced controls", open=False):
+            with gr.Row():
+                transition_bars = gr.Dropdown(
+                    label="Transition period length (bars)",
+                    choices=[4, 8, 16],
+                    value=8,
+                    info="Controls transition duration. Pipeline uses fixed B-base strategy with A as reference.",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+                pre_context_sec = gr.Slider(
+                    minimum=1,
+                    maximum=12,
+                    value=6,
+                    step=0.5,
+                    label="Seconds before seam (Song A context)",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+                post_context_sec = gr.Slider(
+                    minimum=1,
+                    maximum=12,
+                    value=6,
+                    step=0.5,
+                    label="Seconds after seam (Song B context)",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+            with gr.Row():
+                analysis_sec = gr.Slider(
+                    minimum=10,
+                    maximum=90,
+                    value=45,
+                    step=5,
+                    label="Analysis window (seconds)",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+                bpm_target = gr.Number(
+                    label="Optional BPM target override",
+                    value=None,
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+            with gr.Row():
+                creativity_strength = gr.Slider(
+                    minimum=1.0,
+                    maximum=12.0,
+                    value=7.0,
+                    step=0.5,
+                    label="Creativity strength (guidance)",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+                inference_steps = gr.Slider(
+                    minimum=1,
+                    maximum=64,
+                    value=8,
+                    step=1,
+                    label="ACE-Step inference steps",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+            with gr.Row():
+                seed = gr.Number(
+                    label="Seed",
+                    value=42,
+                    precision=0,
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+                cue_a_sec = gr.Textbox(
+                    label="Optional cue A override (sec)",
+                    value="",
+                    placeholder="Leave blank for auto cue selection",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+            with gr.Row():
+                cue_b_sec = gr.Textbox(
+                    label="Optional cue B override (sec)",
+                    value="",
+                    placeholder="Leave blank for auto cue selection",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+                output_dir = gr.Textbox(
+                    label="Output directory",
+                    value="outputs",
+                    min_width=320,
+                    elem_classes=["adv-item"],
+                )
+        run_btn = gr.Button("Generate transition artifacts", variant="primary")
+        with gr.Row():
+            transition_audio = gr.Audio(
+                label="Generated transition clip",
+                type="filepath",
+            )
+            hard_splice_audio = gr.Audio(
+                label="Hard splice baseline (no transition)",
+                type="filepath",
+            )
+            rough_stitched_audio = gr.Audio(
+                label="No-repaint rough stitch (baseline)",
+                type="filepath",
+            )
+            stitched_audio = gr.Audio(
+                label="Final stitched clip",
+                type="filepath",
+            )
+        run_btn.click(
+            fn=_run_transition,
+            inputs=[
+                song_a,
+                song_b,
+                plugin_id,
+                instruction_text,
+                transition_bars,
+                pre_context_sec,
+                post_context_sec,
+                analysis_sec,
+                bpm_target,
+                creativity_strength,
+                inference_steps,
+                seed,
+                cue_a_sec,
+                cue_b_sec,
+                lora_choice,
+                lora_scale,
+                output_dir,
+            ],
+            outputs=[transition_audio, hard_splice_audio, rough_stitched_audio, stitched_audio],
+        )
+    return demo
+demo = build_ui()
+if __name__ == "__main__":
+    demo.launch()

packages.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ ffmpeg
2	+ libsndfile1

pipeline/__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+"""Pipeline package for deterministic transition generation."""
+from .transition_generator import (
+    PLUGIN_PRESETS,
+    TransitionRequest,
+    TransitionResult,
+    generate_transition_artifacts,
+)
+__all__ = [
+    "PLUGIN_PRESETS",
+    "TransitionRequest",
+    "TransitionResult",
+    "generate_transition_artifacts",
+]

pipeline/audio_utils.py ADDED Viewed

	@@ -0,0 +1,194 @@

+import logging
+import os
+import shutil
+import subprocess
+import tempfile
+from typing import Optional, Tuple
+import librosa
+import numpy as np
+import soundfile as sf
+LOGGER = logging.getLogger(__name__)
+def clamp(value: float, low: float, high: float) -> float:
+    return float(max(low, min(high, value)))
+def ensure_mono(y: np.ndarray) -> np.ndarray:
+    if y.ndim == 1:
+        return y
+    return np.mean(y, axis=1)
+def ffprobe_duration_sec(path: str) -> Optional[float]:
+    if not shutil.which("ffprobe"):
+        return None
+    cmd = [
+        "ffprobe",
+        "-v",
+        "error",
+        "-show_entries",
+        "format=duration",
+        "-of",
+        "default=noprint_wrappers=1:nokey=1",
+        path,
+    ]
+    try:
+        out = subprocess.check_output(cmd, stderr=subprocess.STDOUT, text=True).strip()
+        return float(out)
+    except Exception:
+        return None
+def decode_segment(path: str, start_sec: float, duration_sec: float, sr: int, max_decode_sec: float = 120.0) -> Tuple[np.ndarray, int]:
+    start_sec = max(0.0, float(start_sec))
+    duration_sec = max(0.0, float(duration_sec))
+    duration_sec = min(duration_sec, max_decode_sec)
+    if duration_sec <= 0:
+        return np.zeros((0,), dtype=np.float32), sr
+    if shutil.which("ffmpeg"):
+        tmp = tempfile.NamedTemporaryFile(suffix=".wav", delete=False)
+        tmp_path = tmp.name
+        tmp.close()
+        try:
+            cmd = [
+                "ffmpeg",
+                "-hide_banner",
+                "-loglevel",
+                "error",
+                "-nostdin",
+                "-y",
+                "-ss",
+                str(start_sec),
+                "-t",
+                str(duration_sec),
+                "-i",
+                path,
+                "-ac",
+                "1",
+                "-ar",
+                str(sr),
+                tmp_path,
+            ]
+            subprocess.run(cmd, check=True)
+            y, read_sr = sf.read(tmp_path, dtype="float32", always_2d=False)
+            y = ensure_mono(np.asarray(y))
+            return y.astype(np.float32), int(read_sr)
+        finally:
+            try:
+                os.remove(tmp_path)
+            except Exception:
+                pass
+    y, read_sr = librosa.load(path, sr=sr, mono=True, offset=start_sec, duration=duration_sec)
+    return y.astype(np.float32), int(read_sr)
+def estimate_bpm_and_beats(y: np.ndarray, sr: int) -> Tuple[Optional[float], np.ndarray]:
+    if y.size < sr:
+        return None, np.array([], dtype=np.float32)
+    try:
+        tempo, beat_frames = librosa.beat.beat_track(y=y, sr=sr)
+        tempo_f = float(tempo[0]) if isinstance(tempo, (list, np.ndarray)) else float(tempo)
+        beat_times = librosa.frames_to_time(beat_frames, sr=sr).astype(np.float32)
+        if not (40.0 <= tempo_f <= 220.0):
+            tempo_f = None
+        return tempo_f, beat_times
+    except Exception:
+        return None, np.array([], dtype=np.float32)
+def choose_nearest_beat(beat_times: np.ndarray, target_sec: float) -> float:
+    if beat_times.size == 0:
+        return float(target_sec)
+    idx = int(np.argmin(np.abs(beat_times - float(target_sec))))
+    return float(beat_times[idx])
+def choose_first_beat_after(beat_times: np.ndarray, target_sec: float) -> float:
+    if beat_times.size == 0:
+        return float(target_sec)
+    for bt in beat_times:
+        if float(bt) >= float(target_sec):
+            return float(bt)
+    return float(beat_times[-1])
+def linear_fade(n: int, fade_in: bool) -> np.ndarray:
+    if n <= 0:
+        return np.zeros((0,), dtype=np.float32)
+    if fade_in:
+        return np.linspace(0.0, 1.0, n, dtype=np.float32)
+    return np.linspace(1.0, 0.0, n, dtype=np.float32)
+def normalize_peak(y: np.ndarray, peak: float = 0.98) -> np.ndarray:
+    if y.size == 0:
+        return y.astype(np.float32)
+    maximum = float(np.max(np.abs(y)))
+    if maximum <= 1e-9:
+        return y.astype(np.float32)
+    if maximum <= peak:
+        return y.astype(np.float32)
+    return (y * (peak / maximum)).astype(np.float32)
+def apply_edge_fades(y: np.ndarray, sr: int, fade_ms: float = 30.0) -> np.ndarray:
+    n = y.size
+    fade_n = int(sr * (fade_ms / 1000.0))
+    fade_n = min(fade_n, n // 2)
+    if fade_n <= 0:
+        return y
+    y2 = y.copy()
+    y2[:fade_n] *= linear_fade(fade_n, fade_in=True)
+    y2[-fade_n:] *= linear_fade(fade_n, fade_in=False)
+    return y2
+def ensure_length(y: np.ndarray, target_n: int) -> np.ndarray:
+    target_n = int(max(0, target_n))
+    if y.size < target_n:
+        return np.pad(y, (0, target_n - y.size), mode="constant")
+    return y[:target_n]
+def safe_time_stretch(y: np.ndarray, rate: float) -> np.ndarray:
+    rate = float(rate)
+    if y.size == 0:
+        return y.astype(np.float32)
+    if abs(rate - 1.0) < 1e-6:
+        return y.astype(np.float32)
+    try:
+        return librosa.effects.time_stretch(y, rate=rate).astype(np.float32)
+    except Exception as exc:
+        LOGGER.warning("Time-stretch failed (%s); using original audio.", exc)
+        return y.astype(np.float32)
+def resample_if_needed(y: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
+    if int(orig_sr) == int(target_sr):
+        return y.astype(np.float32)
+    return librosa.resample(y, orig_sr=int(orig_sr), target_sr=int(target_sr)).astype(np.float32)
+def crossfade_equal_length(a: np.ndarray, b: np.ndarray) -> np.ndarray:
+    n = min(a.size, b.size)
+    if n <= 0:
+        return np.zeros((0,), dtype=np.float32)
+    a = a[:n]
+    b = b[:n]
+    fade_in = linear_fade(n, fade_in=True)
+    fade_out = 1.0 - fade_in
+    return (a * fade_out + b * fade_in).astype(np.float32)
+def write_wav(path: str, y: np.ndarray, sr: int) -> None:
+    os.makedirs(os.path.dirname(path), exist_ok=True)
+    sf.write(path, y.astype(np.float32), int(sr))

pipeline/cuepoint_selector.py ADDED Viewed

	@@ -0,0 +1,1656 @@

+import logging
+import os
+from dataclasses import dataclass
+from typing import Any, Dict, List, Optional, Tuple
+import librosa  # type: ignore[reportMissingImports]
+import numpy as np
+from .audio_utils import choose_first_beat_after, choose_nearest_beat, decode_segment, ensure_length
+LOGGER = logging.getLogger(__name__)
+_ANALYSIS_HOP = 512
+_STRUCT_SR = 22050
+_DEMUCS_ENABLED = os.getenv("AI_DJ_ENABLE_DEMUCS_ANALYSIS", "1").strip().lower() not in {
+    "0",
+    "false",
+    "no",
+    "off",
+}
+_DEMUCS_MODEL_NAME = os.getenv("AI_DJ_DEMUCS_MODEL", "htdemucs").strip() or "htdemucs"
+_DEMUCS_DEVICE_PREF = os.getenv("AI_DJ_DEMUCS_DEVICE", "cuda").strip().lower()
+_DEMUCS_SEGMENT_SEC = 7.0
+_DEMUCS_MIN_WINDOW_SEC = 6.0
+_PROFILE_CACHE: Dict[Tuple[str, int], Optional["_TrackProfiles"]] = {}
+_LIBROSA_STRUCT_CACHE: Dict[str, Optional[Dict[str, np.ndarray]]] = {}
+_DEMUCS_MODEL: Any = None
+_DEMUCS_TORCH: Any = None
+_DEMUCS_DEVICE = "cpu"
+_DEMUCS_LOAD_ATTEMPTED = False
+_DEMUCS_LOAD_ERROR: Optional[str] = None
+@dataclass
+class CueSelectionResult:
+    cue_a_sec: float
+    cue_b_sec: float
+    method: str
+    debug: Dict[str, object]
+@dataclass
+class _CueCandidate:
+    time_sec: float
+    beat_idx: int
+    phrase: float
+    energy: float
+    onset: float
+    chroma: np.ndarray
+    vocal_ratio: float
+    vocal_onset: float
+    vocal_phrase_score: float
+    drum_anchor: float
+    bass_energy: float
+    bass_stability: float
+    instrumental_density: float
+    density_score: float
+    period_vocal_ratio: float
+    period_vocal_phrase_score: float
+    period_drum_anchor: float
+    period_bass_energy: float
+    period_bass_stability: float
+    period_density_score: float
+    period_coverage: float
+    period_vocal_curve: np.ndarray
+    period_bass_curve: np.ndarray
+@dataclass
+class _TrackProfiles:
+    rms: np.ndarray
+    rms_times: np.ndarray
+    onset: np.ndarray
+    onset_times: np.ndarray
+    chroma: np.ndarray
+    chroma_times: np.ndarray
+@dataclass
+class _VocalActivityProfile:
+    vocal_ratio: np.ndarray
+    vocal_onset: np.ndarray
+    drum_onset: np.ndarray
+    bass_rms: np.ndarray
+    instrumental_rms: np.ndarray
+    times: np.ndarray
+    method: str
+    has_drums: bool
+    has_bass: bool
+@dataclass
+class _StructuredCandidate:
+    cue: _CueCandidate
+    label: str
+    label_score: float
+    edge_score: float
+    position_score: float
+def _clamp(value: float, low: float, high: float) -> float:
+    return float(max(low, min(high, value)))
+def _mean_1d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> float:
+    if values.size == 0 or times.size == 0:
+        return 0.0
+    lo = float(min(start, end))
+    hi = float(max(start, end))
+    mask = (times >= lo) & (times <= hi)
+    if np.any(mask):
+        return float(np.mean(values[mask]))
+    idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
+    return float(values[idx])
+def _std_1d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> float:
+    if values.size == 0 or times.size == 0:
+        return 0.0
+    lo = float(min(start, end))
+    hi = float(max(start, end))
+    mask = (times >= lo) & (times <= hi)
+    if np.any(mask):
+        return float(np.std(values[mask]))
+    idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
+    return 0.0 if idx < 0 or idx >= values.size else 0.0
+def _smooth_1d(values: np.ndarray, kernel_size: int) -> np.ndarray:
+    arr = np.asarray(values, dtype=np.float32).reshape(-1)
+    if arr.size == 0:
+        return np.zeros((1,), dtype=np.float32)
+    k = int(max(1, kernel_size))
+    if k == 1 or arr.size < k:
+        return arr.astype(np.float32)
+    kernel = np.ones((k,), dtype=np.float32) / float(k)
+    return np.convolve(arr, kernel, mode="same").astype(np.float32)
+def _normalize_1d(values: np.ndarray) -> np.ndarray:
+    arr = np.asarray(values, dtype=np.float32).reshape(-1)
+    if arr.size == 0:
+        return np.zeros((1,), dtype=np.float32)
+    lo = float(np.percentile(arr, 5))
+    hi = float(np.percentile(arr, 95))
+    if hi - lo > 1e-6:
+        out = (arr - lo) / (hi - lo)
+        return np.clip(out, 0.0, 1.0).astype(np.float32)
+    mx = float(np.max(np.abs(arr)))
+    if mx > 1e-6:
+        out = arr / mx
+        return np.clip(out, 0.0, 1.0).astype(np.float32)
+    return np.zeros_like(arr, dtype=np.float32)
+def _align_series_min_length(series: List[np.ndarray]) -> List[np.ndarray]:
+    clean = [np.asarray(x, dtype=np.float32).reshape(-1) for x in series]
+    if not clean:
+        return []
+    min_len = min((x.size for x in clean if x.size > 0), default=0)
+    if min_len <= 0:
+        return [np.zeros((1,), dtype=np.float32) for _ in clean]
+    return [x[:min_len].astype(np.float32) if x.size >= min_len else np.pad(x, (0, min_len - x.size)).astype(np.float32) for x in clean]
+def _mean_2d(values: np.ndarray, times: np.ndarray, start: float, end: float) -> np.ndarray:
+    if values.ndim != 2 or values.shape[1] == 0 or times.size == 0:
+        return np.zeros((12,), dtype=np.float32)
+    lo = float(min(start, end))
+    hi = float(max(start, end))
+    mask = (times >= lo) & (times <= hi)
+    if np.any(mask):
+        vec = np.mean(values[:, mask], axis=1).astype(np.float32)
+    else:
+        idx = int(np.argmin(np.abs(times - ((lo + hi) * 0.5))))
+        vec = values[:, idx].astype(np.float32)
+    norm = float(np.linalg.norm(vec))
+    if norm > 1e-9:
+        vec = vec / norm
+    return vec
+def _cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
+    if a.size == 0 or b.size == 0:
+        return 0.0
+    denom = float(np.linalg.norm(a) * np.linalg.norm(b))
+    if denom <= 1e-9:
+        return 0.0
+    return float(np.dot(a, b) / denom)
+def _phrase_score(beat_idx: int) -> float:
+    if beat_idx < 0:
+        return 0.5
+    mod4 = beat_idx % 4
+    mod8 = beat_idx % 8
+    dist4 = min(mod4, 4 - mod4)
+    dist8 = min(mod8, 8 - mod8)
+    score4 = 1.0 - (dist4 / 2.0)
+    score8 = 1.0 - (dist8 / 4.0)
+    return _clamp((0.65 * score4) + (0.35 * score8), 0.0, 1.0)
+def _target_position_score(x: float, target: float, spread: float) -> float:
+    spread = max(1e-3, float(spread))
+    return float(np.exp(-abs(float(x) - float(target)) / spread))
+def _edge_score(x: float, duration_sec: float) -> float:
+    if duration_sec <= 1e-6:
+        return 0.0
+    ratio = float(x / duration_sec)
+    return _clamp(min(ratio / 0.16, (1.0 - ratio) / 0.16), 0.0, 1.0)
+def _resolve_demucs_device(torch_mod: Any) -> str:
+    pref = (_DEMUCS_DEVICE_PREF or "").strip().lower()
+    if pref in {"cpu"}:
+        return "cpu"
+    if pref in {"cuda", "gpu"}:
+        return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
+    return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
+def _get_demucs_model() -> Tuple[Optional[Any], Optional[Any], str, Optional[str]]:
+    global _DEMUCS_MODEL, _DEMUCS_TORCH, _DEMUCS_DEVICE, _DEMUCS_LOAD_ATTEMPTED, _DEMUCS_LOAD_ERROR
+    if not _DEMUCS_ENABLED:
+        return None, None, "disabled", "AI_DJ_ENABLE_DEMUCS_ANALYSIS=0"
+    if _DEMUCS_LOAD_ATTEMPTED:
+        if _DEMUCS_MODEL is None:
+            return None, _DEMUCS_TORCH, "unavailable", _DEMUCS_LOAD_ERROR
+        return _DEMUCS_MODEL, _DEMUCS_TORCH, "ready", None
+    _DEMUCS_LOAD_ATTEMPTED = True
+    try:
+        import torch  # type: ignore[reportMissingImports]
+        from demucs.pretrained import get_model  # type: ignore[reportMissingImports]
+        model = get_model(_DEMUCS_MODEL_NAME)
+        model.eval()
+        _DEMUCS_DEVICE = _resolve_demucs_device(torch)
+        model.to(_DEMUCS_DEVICE)
+        _DEMUCS_MODEL = model
+        _DEMUCS_TORCH = torch
+        _DEMUCS_LOAD_ERROR = None
+        return _DEMUCS_MODEL, _DEMUCS_TORCH, "ready", None
+    except Exception as exc:
+        _DEMUCS_MODEL = None
+        _DEMUCS_TORCH = None
+        _DEMUCS_LOAD_ERROR = str(exc)
+        LOGGER.warning(
+            "Demucs vocal analysis unavailable (%s). Cue selection continues without vocal penalty.",
+            exc,
+        )
+        return None, None, "unavailable", _DEMUCS_LOAD_ERROR
+def _vocal_score_from_ratio(vocal_ratio: float) -> float:
+    ratio = _clamp(float(vocal_ratio), 0.0, 1.0)
+    # Penalize clearly vocal-dominant bars while leaving mixed bars mostly neutral.
+    return 1.0 - _clamp((ratio - 0.32) / 0.5, 0.0, 1.0)
+def _lookup_stem_mixability(profile: Optional[_VocalActivityProfile], time_sec: float) -> Dict[str, float]:
+    neutral = {
+        "vocal_ratio": 0.0,
+        "vocal_onset": 0.0,
+        "vocal_phrase_score": 0.5,
+        "drum_anchor": 0.5,
+        "bass_energy": 0.5,
+        "bass_stability": 0.5,
+        "instrumental_density": 0.5,
+        "density_score": 0.5,
+    }
+    if profile is None or profile.times.size == 0:
+        return neutral
+    t_min = float(np.min(profile.times))
+    t_max = float(np.max(profile.times))
+    if float(time_sec) < (t_min - 0.6) or float(time_sec) > (t_max + 0.6):
+        return neutral
+    ratio = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec - 1.2, time_sec + 1.2), 0.0, 1.0)
+    vocal_onset = _clamp(_mean_1d(profile.vocal_onset, profile.times, time_sec - 0.2, time_sec + 0.3), 0.0, 1.0)
+    vocal_before = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec - 1.8, time_sec - 0.25), 0.0, 1.0)
+    vocal_after = _clamp(_mean_1d(profile.vocal_ratio, profile.times, time_sec + 0.25, time_sec + 1.8), 0.0, 1.0)
+    ending_score = _clamp((vocal_before - vocal_after + 0.05) / 0.35, 0.0, 1.0)
+    low_vocal_score = _vocal_score_from_ratio(ratio)
+    onset_quiet_score = 1.0 - vocal_onset
+    vocal_phrase_score = _clamp(
+        (0.52 * low_vocal_score) + (0.30 * ending_score) + (0.18 * onset_quiet_score),
+        0.0,
+        1.0,
+    )
+    if profile.has_drums:
+        drum_hit = _clamp(_mean_1d(profile.drum_onset, profile.times, time_sec - 0.1, time_sec + 0.24), 0.0, 1.0)
+        drum_bg = _clamp(_mean_1d(profile.drum_onset, profile.times, time_sec - 1.0, time_sec + 1.0), 0.0, 1.0)
+        drum_anchor = _clamp((0.72 * drum_hit) + (0.28 * _clamp(drum_hit - drum_bg + 0.22, 0.0, 1.0)), 0.0, 1.0)
+    else:
+        drum_anchor = 0.5
+    if profile.has_bass:
+        bass_energy = _clamp(_mean_1d(profile.bass_rms, profile.times, time_sec - 1.4, time_sec + 1.4), 0.0, 1.0)
+        bass_std = _clamp(_std_1d(profile.bass_rms, profile.times, time_sec - 1.8, time_sec + 1.8), 0.0, 1.0)
+        bass_cv = bass_std / max(1e-4, bass_energy + 0.08)
+        bass_stability = 1.0 - _clamp((bass_cv - 0.18) / 0.85, 0.0, 1.0)
+    else:
+        bass_energy = 0.5
+        bass_stability = 0.5
+    instrumental_density = _clamp(
+        _mean_1d(profile.instrumental_rms, profile.times, time_sec - 1.4, time_sec + 1.4),
+        0.0,
+        1.0,
+    )
+    density_score = _target_position_score(instrumental_density, target=0.56, spread=0.24)
+    return {
+        "vocal_ratio": float(ratio),
+        "vocal_onset": float(vocal_onset),
+        "vocal_phrase_score": float(vocal_phrase_score),
+        "drum_anchor": float(drum_anchor),
+        "bass_energy": float(bass_energy),
+        "bass_stability": float(_clamp(bass_stability, 0.0, 1.0)),
+        "instrumental_density": float(instrumental_density),
+        "density_score": float(_clamp(density_score, 0.0, 1.0)),
+    }
+def _range_coverage_ratio(times: np.ndarray, start: float, end: float) -> float:
+    if times.size == 0:
+        return 0.0
+    lo = float(min(start, end))
+    hi = float(max(start, end))
+    if hi - lo <= 1e-6:
+        return 0.0
+    t_min = float(np.min(times))
+    t_max = float(np.max(times))
+    overlap = max(0.0, min(hi, t_max) - max(lo, t_min))
+    return _clamp(overlap / max(1e-6, (hi - lo)), 0.0, 1.0)
+def _sample_curve(values: np.ndarray, times: np.ndarray, start: float, end: float, samples: int = 16) -> np.ndarray:
+    n = int(max(4, samples))
+    if values.size == 0 or times.size == 0:
+        return np.zeros((n,), dtype=np.float32)
+    lo = float(min(start, end))
+    hi = float(max(start, end))
+    if hi - lo <= 1e-6:
+        base = float(_mean_1d(values, times, lo - 0.25, hi + 0.25))
+        return np.full((n,), _clamp(base, 0.0, 1.0), dtype=np.float32)
+    ts = np.linspace(lo, hi, n, dtype=np.float32)
+    if times.size < 2:
+        base = float(_mean_1d(values, times, lo, hi))
+        return np.full((n,), _clamp(base, 0.0, 1.0), dtype=np.float32)
+    curve = np.interp(
+        ts.astype(np.float64),
+        times.astype(np.float64),
+        values.astype(np.float64),
+        left=float(values[0]),
+        right=float(values[-1]),
+    ).astype(np.float32)
+    return np.clip(curve, 0.0, 1.0).astype(np.float32)
+def _lookup_period_mixability(
+    profile: Optional[_VocalActivityProfile],
+    start_sec: float,
+    end_sec: float,
+    incoming: bool,
+) -> Dict[str, Any]:
+    neutral_curve = np.full((16,), 0.5, dtype=np.float32)
+    neutral = {
+        "coverage": 0.0,
+        "period_vocal_ratio": 0.0,
+        "period_vocal_phrase_score": 0.5,
+        "period_drum_anchor": 0.5,
+        "period_bass_energy": 0.5,
+        "period_bass_stability": 0.5,
+        "period_density_score": 0.5,
+        "period_vocal_curve": neutral_curve.copy(),
+        "period_bass_curve": neutral_curve.copy(),
+    }
+    if profile is None or profile.times.size == 0:
+        return neutral
+    lo = float(min(start_sec, end_sec))
+    hi = float(max(start_sec, end_sec))
+    span = max(1e-4, hi - lo)
+    coverage = _range_coverage_ratio(profile.times, lo, hi)
+    if coverage <= 0.03:
+        return neutral
+    ratio_mean = _clamp(_mean_1d(profile.vocal_ratio, profile.times, lo, hi), 0.0, 1.0)
+    vocal_curve = _sample_curve(profile.vocal_ratio, profile.times, lo, hi, samples=16)
+    bass_curve = _sample_curve(profile.bass_rms, profile.times, lo, hi, samples=16)
+    first_cut = lo + (0.35 * span)
+    last_cut = hi - (0.35 * span)
+    vocal_start = _clamp(_mean_1d(profile.vocal_ratio, profile.times, lo, first_cut), 0.0, 1.0)
+    vocal_end = _clamp(_mean_1d(profile.vocal_ratio, profile.times, last_cut, hi), 0.0, 1.0)
+    boundary_t = lo if incoming else hi
+    vocal_onset_boundary = _clamp(
+        _mean_1d(profile.vocal_onset, profile.times, boundary_t - 0.16, boundary_t + 0.26),
+        0.0,
+        1.0,
+    )
+    low_vocal_score = _vocal_score_from_ratio(ratio_mean)
+    onset_quiet = 1.0 - vocal_onset_boundary
+    if incoming:
+        start_quiet = _clamp(1.0 - ((vocal_start - 0.22) / 0.58), 0.0, 1.0)
+        rise_ok = _clamp((vocal_end - vocal_start + 0.08) / 0.38, 0.0, 1.0)
+        trend_score = _clamp((0.72 * start_quiet) + (0.28 * rise_ok), 0.0, 1.0)
+    else:
+        ending = _clamp((vocal_start - vocal_end + 0.05) / 0.35, 0.0, 1.0)
+        trend_score = ending
+    vocal_phrase = _clamp((0.50 * low_vocal_score) + (0.30 * trend_score) + (0.20 * onset_quiet), 0.0, 1.0)
+    if profile.has_drums:
+        drum_mean = _clamp(_mean_1d(profile.drum_onset, profile.times, lo, hi), 0.0, 1.0)
+        drum_std = _clamp(_std_1d(profile.drum_onset, profile.times, lo, hi), 0.0, 1.0)
+        drum_boundary = _clamp(
+            _mean_1d(profile.drum_onset, profile.times, boundary_t - 0.12, boundary_t + 0.20),
+            0.0,
+            1.0,
+        )
+        drum_anchor = _clamp(
+            (0.45 * drum_boundary)
+            + (0.35 * drum_mean)
+            + (0.20 * (1.0 - _clamp(drum_std / 0.35, 0.0, 1.0))),
+            0.0,
+            1.0,
+        )
+    else:
+        drum_anchor = 0.5
+    if profile.has_bass:
+        bass_mean = _clamp(_mean_1d(profile.bass_rms, profile.times, lo, hi), 0.0, 1.0)
+        bass_std = _clamp(_std_1d(profile.bass_rms, profile.times, lo, hi), 0.0, 1.0)
+        bass_cv = bass_std / max(0.08, bass_mean)
+        bass_stability = 1.0 - _clamp((bass_cv - 0.20) / 0.95, 0.0, 1.0)
+    else:
+        bass_mean = 0.5
+        bass_stability = 0.5
+    density_mean = _clamp(_mean_1d(profile.instrumental_rms, profile.times, lo, hi), 0.0, 1.0)
+    density_std = _clamp(_std_1d(profile.instrumental_rms, profile.times, lo, hi), 0.0, 1.0)
+    density_target = _target_position_score(density_mean, target=0.56, spread=0.22)
+    density_stability = 1.0 - _clamp(density_std / 0.32, 0.0, 1.0)
+    density_score = _clamp((0.75 * density_target) + (0.25 * density_stability), 0.0, 1.0)
+    return {
+        "coverage": float(coverage),
+        "period_vocal_ratio": float(ratio_mean),
+        "period_vocal_phrase_score": float(vocal_phrase),
+        "period_drum_anchor": float(drum_anchor),
+        "period_bass_energy": float(bass_mean),
+        "period_bass_stability": float(_clamp(bass_stability, 0.0, 1.0)),
+        "period_density_score": float(density_score),
+        "period_vocal_curve": vocal_curve.astype(np.float32),
+        "period_bass_curve": bass_curve.astype(np.float32),
+    }
+def _period_overlap_clash(cand_a: _CueCandidate, cand_b: _CueCandidate) -> Tuple[float, float, float]:
+    n = int(
+        max(
+            4,
+            min(
+                int(cand_a.period_vocal_curve.size),
+                int(cand_b.period_vocal_curve.size),
+                int(cand_a.period_bass_curve.size),
+                int(cand_b.period_bass_curve.size),
+            ),
+        )
+    )
+    if n <= 0:
+        vocal = _clamp(cand_a.period_vocal_ratio * cand_b.period_vocal_ratio, 0.0, 1.0)
+        bass = _clamp(cand_a.period_bass_energy * cand_b.period_bass_energy, 0.0, 1.0)
+        cov = 0.5 * (cand_a.period_coverage + cand_b.period_coverage)
+        return vocal, bass, cov
+    a_v = ensure_length(cand_a.period_vocal_curve.astype(np.float32), n)
+    b_v = ensure_length(cand_b.period_vocal_curve.astype(np.float32), n)
+    a_b = ensure_length(cand_a.period_bass_curve.astype(np.float32), n)
+    b_b = ensure_length(cand_b.period_bass_curve.astype(np.float32), n)
+    x = np.linspace(0.0, 1.0, n, dtype=np.float32)
+    w_a_v = 1.0 - x
+    w_b_v = x
+    vocal_risk = float(np.mean((a_v * w_a_v) * (b_v * w_b_v)))
+    vocal_risk = _clamp(vocal_risk * 4.0, 0.0, 1.0)
+    w_b_b = np.clip((x - 0.60) / 0.28, 0.0, 1.0).astype(np.float32)
+    w_a_b = 1.0 - w_b_b
+    center_bass_shape = (0.35 + (0.65 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
+    bass_risk = float(np.mean((a_b * w_a_b * center_bass_shape) * (b_b * w_b_b * center_bass_shape)))
+    bass_risk = _clamp(bass_risk * 6.0, 0.0, 1.0)
+    coverage = _clamp(0.5 * (cand_a.period_coverage + cand_b.period_coverage), 0.0, 1.0)
+    if coverage < 0.35:
+        fallback_v = _clamp(cand_a.period_vocal_ratio * cand_b.period_vocal_ratio, 0.0, 1.0)
+        fallback_b = _clamp(cand_a.period_bass_energy * cand_b.period_bass_energy, 0.0, 1.0)
+        alpha = _clamp((0.35 - coverage) / 0.35, 0.0, 1.0)
+        vocal_risk = (1.0 - alpha) * vocal_risk + (alpha * fallback_v)
+        bass_risk = (1.0 - alpha) * bass_risk + (alpha * fallback_b)
+    return float(vocal_risk), float(bass_risk), float(coverage)
+def _extract_vocal_profile_demucs(
+    y: np.ndarray,
+    sr: int,
+    window_start_sec: float,
+    track_label: str,
+) -> Tuple[Optional[_VocalActivityProfile], Dict[str, object]]:
+    global _DEMUCS_DEVICE
+    info: Dict[str, object] = {
+        "track": track_label,
+        "enabled": bool(_DEMUCS_ENABLED),
+        "model": _DEMUCS_MODEL_NAME,
+    }
+    if y.size < int(max(1, sr) * _DEMUCS_MIN_WINDOW_SEC):
+        info["status"] = "skipped-short-window"
+        return None, info
+    model, torch_mod, status, reason = _get_demucs_model()
+    info["status"] = status
+    if reason:
+        info["reason"] = reason
+    if model is None or torch_mod is None:
+        return None, info
+    try:
+        from demucs.apply import apply_model  # type: ignore[reportMissingImports]
+        mono = np.asarray(y, dtype=np.float32).reshape(-1)
+        if mono.size == 0:
+            info["status"] = "empty-window"
+            return None, info
+        peak = float(np.max(np.abs(mono)))
+        if peak > 1e-9:
+            mono = mono / peak
+        demucs_sr = int(getattr(model, "samplerate", 44100))
+        if int(sr) != demucs_sr:
+            mono = librosa.resample(mono, orig_sr=int(sr), target_sr=demucs_sr).astype(np.float32)
+        if mono.size < int(demucs_sr * _DEMUCS_MIN_WINDOW_SEC):
+            info["status"] = "skipped-short-window"
+            return None, info
+        stereo = np.stack([mono, mono], axis=0)
+        mix = torch_mod.from_numpy(stereo).unsqueeze(0).to(_DEMUCS_DEVICE)
+        audio_sec = float(mono.size / max(1, demucs_sr))
+        segment_limit = float(_DEMUCS_SEGMENT_SEC)
+        if audio_sec <= (segment_limit + 0.02):
+            use_split = False
+            segment_sec = None
+        else:
+            use_split = True
+            segment_sec = segment_limit
+        try:
+            with torch_mod.no_grad():
+                estimates = apply_model(
+                    model,
+                    mix,
+                    shifts=1,
+                    split=use_split,
+                    overlap=0.25,
+                    progress=False,
+                    device=_DEMUCS_DEVICE,
+                    segment=segment_sec,
+                )
+        except Exception as exc:
+            if _DEMUCS_DEVICE == "cuda":
+                model.to("cpu")
+                _DEMUCS_DEVICE = "cpu"
+                mix = mix.to("cpu")
+                with torch_mod.no_grad():
+                    estimates = apply_model(
+                        model,
+                        mix,
+                        shifts=1,
+                        split=use_split,
+                        overlap=0.25,
+                        progress=False,
+                        device="cpu",
+                        segment=segment_sec,
+                    )
+                info["device_fallback"] = f"cuda->cpu ({exc})"
+            else:
+                raise
+        estimates = estimates.detach().cpu()
+        est = estimates[0] if estimates.ndim == 4 else estimates
+        if est.ndim != 3:
+            raise RuntimeError(f"Unexpected demucs output ndim: {est.ndim}")
+        source_names = [str(s) for s in getattr(model, "sources", [])]
+        if not source_names:
+            raise RuntimeError("Demucs model returned no source labels.")
+        if est.shape[0] != len(source_names):
+            if est.shape[1] == len(source_names):
+                est = est.permute(1, 0, 2)
+            else:
+                raise RuntimeError(
+                    f"Demucs output/source mismatch ({tuple(est.shape)} vs {len(source_names)} sources)."
+                )
+        if "vocals" not in source_names:
+            raise RuntimeError("Demucs model does not expose a 'vocals' stem.")
+        vocal_idx = source_names.index("vocals")
+        vocals = est[vocal_idx]
+        has_drums = "drums" in source_names
+        has_bass = "bass" in source_names
+        drums = est[source_names.index("drums")] if has_drums else torch_mod.zeros_like(vocals)
+        bass = est[source_names.index("bass")] if has_bass else torch_mod.zeros_like(vocals)
+        non_vocal_idxs = [i for i in range(len(source_names)) if i != vocal_idx]
+        if non_vocal_idxs:
+            accompaniment = est[non_vocal_idxs].sum(dim=0)
+        else:
+            accompaniment = torch_mod.zeros_like(vocals)
+        vocals_mono = vocals.mean(dim=0).numpy().astype(np.float32)
+        drums_mono = drums.mean(dim=0).numpy().astype(np.float32)
+        bass_mono = bass.mean(dim=0).numpy().astype(np.float32)
+        accompaniment_mono = accompaniment.mean(dim=0).numpy().astype(np.float32)
+        vocal_rms = librosa.feature.rms(y=vocals_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
+        acc_rms = librosa.feature.rms(y=accompaniment_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
+        bass_rms = librosa.feature.rms(y=bass_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
+        inst_rms = librosa.feature.rms(y=accompaniment_mono, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
+        vocal_onset = librosa.onset.onset_strength(y=vocals_mono, sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+        drum_onset = librosa.onset.onset_strength(y=drums_mono, sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+        ratio_raw = vocal_rms / np.maximum(vocal_rms + acc_rms, 1e-6)
+        ratio_raw = np.clip(ratio_raw, 0.0, 1.0).astype(np.float32)
+        aligned = _align_series_min_length([ratio_raw, vocal_onset, drum_onset, bass_rms, inst_rms])
+        if not aligned:
+            raise RuntimeError("Demucs profile alignment failed.")
+        ratio, vocal_onset_n, drum_onset_n, bass_rms_n, inst_rms_n = aligned
+        ratio = _smooth_1d(ratio, kernel_size=5)
+        vocal_onset_n = _normalize_1d(_smooth_1d(vocal_onset_n, kernel_size=3))
+        drum_onset_n = _normalize_1d(_smooth_1d(drum_onset_n, kernel_size=3))
+        bass_rms_n = _normalize_1d(_smooth_1d(bass_rms_n, kernel_size=5))
+        inst_rms_n = _normalize_1d(_smooth_1d(inst_rms_n, kernel_size=5))
+        times = librosa.frames_to_time(np.arange(ratio.size), sr=demucs_sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+        times = times + float(window_start_sec)
+        info.update(
+            {
+                "status": "ready",
+                "device": _DEMUCS_DEVICE,
+                "method": "demucs-stem-mixability",
+                "has_drums": bool(has_drums),
+                "has_bass": bool(has_bass),
+                "split_mode": "chunked" if use_split else "full-window",
+                "window_start_sec": round(float(window_start_sec), 3),
+                "window_duration_sec": round(float(mono.size / max(1, demucs_sr)), 3),
+            }
+        )
+        return _VocalActivityProfile(
+            vocal_ratio=ratio,
+            vocal_onset=vocal_onset_n,
+            drum_onset=drum_onset_n,
+            bass_rms=bass_rms_n,
+            instrumental_rms=inst_rms_n,
+            times=times,
+            method="demucs-stem-mixability",
+            has_drums=bool(has_drums),
+            has_bass=bool(has_bass),
+        ), info
+    except Exception as exc:
+        LOGGER.warning("Demucs vocal analysis failed for %s (%s). Continuing without vocal penalty.", track_label, exc)
+        info["status"] = "error"
+        info["reason"] = str(exc)
+        return None, info
+def _label_weight(label: str, outgoing: bool) -> float:
+    label_l = (label or "").strip().lower()
+    if outgoing:
+        table = [
+            ("outro", 1.00),
+            ("break", 0.95),
+            ("bridge", 0.90),
+            ("verse", 0.82),
+            ("chorus", 0.66),
+            ("intro", 0.20),
+            ("start", 0.10),
+            ("end", 0.05),
+        ]
+    else:
+        table = [
+            ("verse", 0.95),
+            ("break", 0.90),
+            ("bridge", 0.84),
+            ("chorus", 0.80),
+            ("intro", 0.74),
+            ("outro", 0.20),
+            ("start", 0.10),
+            ("end", 0.05),
+        ]
+    for token, score in table:
+        if token in label_l:
+            return float(score)
+    return 0.60
+def _compute_profiles(y: np.ndarray, sr: int) -> _TrackProfiles:
+    if y.size == 0:
+        zero = np.zeros((1,), dtype=np.float32)
+        return _TrackProfiles(
+            rms=zero,
+            rms_times=zero.copy(),
+            onset=zero.copy(),
+            onset_times=zero.copy(),
+            chroma=np.zeros((12, 1), dtype=np.float32),
+            chroma_times=zero.copy(),
+        )
+    rms = librosa.feature.rms(y=y, frame_length=2048, hop_length=_ANALYSIS_HOP)[0].astype(np.float32)
+    onset = librosa.onset.onset_strength(y=y, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+    try:
+        harmonic = librosa.effects.harmonic(y)
+        chroma = librosa.feature.chroma_cens(y=harmonic, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+    except Exception as exc:
+        LOGGER.warning("Harmonic chroma extraction failed (%s); falling back to raw chroma.", exc)
+        chroma = librosa.feature.chroma_cens(y=y, sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+    if rms.size == 0:
+        rms = np.zeros((1,), dtype=np.float32)
+    if onset.size == 0:
+        onset = np.zeros((1,), dtype=np.float32)
+    if chroma.ndim != 2 or chroma.shape[1] == 0:
+        chroma = np.zeros((12, 1), dtype=np.float32)
+    max_rms = float(np.max(rms))
+    if max_rms > 1e-9:
+        rms = rms / max_rms
+    max_onset = float(np.max(onset))
+    if max_onset > 1e-9:
+        onset = onset / max_onset
+    rms_times = librosa.frames_to_time(np.arange(rms.size), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+    onset_times = librosa.frames_to_time(np.arange(onset.size), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+    chroma_times = librosa.frames_to_time(np.arange(chroma.shape[1]), sr=sr, hop_length=_ANALYSIS_HOP).astype(np.float32)
+    return _TrackProfiles(
+        rms=rms,
+        rms_times=rms_times,
+        onset=onset,
+        onset_times=onset_times,
+        chroma=chroma,
+        chroma_times=chroma_times,
+    )
+def _build_candidates(
+    beat_times: np.ndarray,
+    min_sec: float,
+    max_sec: float,
+    prefer_tail: bool,
+    limit: int,
+) -> List[Tuple[float, int]]:
+    if beat_times.size == 0:
+        return []
+    idxs = [idx for idx, t in enumerate(beat_times) if float(min_sec) <= float(t) <= float(max_sec)]
+    if not idxs:
+        return []
+    idxs = idxs[-limit:] if prefer_tail else idxs[:limit]
+    return [(float(beat_times[idx]), int(idx)) for idx in idxs]
+def _make_candidate(
+    time_sec: float,
+    beat_idx: int,
+    profiles: _TrackProfiles,
+    incoming: bool,
+    seam_sec: float,
+    vocal_profile: Optional[_VocalActivityProfile] = None,
+    vocal_time_sec: Optional[float] = None,
+) -> _CueCandidate:
+    if incoming:
+        energy = _mean_1d(profiles.rms, profiles.rms_times, time_sec, time_sec + 1.0)
+        onset = _mean_1d(profiles.onset, profiles.onset_times, time_sec - 0.1, time_sec + 0.5)
+    else:
+        energy = _mean_1d(profiles.rms, profiles.rms_times, time_sec - 1.0, time_sec)
+        onset = _mean_1d(profiles.onset, profiles.onset_times, time_sec - 0.5, time_sec + 0.1)
+    chroma = _mean_2d(profiles.chroma, profiles.chroma_times, time_sec - 2.0, time_sec + 2.0)
+    vocal_lookup_sec = float(vocal_time_sec) if vocal_time_sec is not None else float(time_sec)
+    stem_mix = _lookup_stem_mixability(vocal_profile, vocal_lookup_sec)
+    seam = max(1e-3, float(seam_sec))
+    period_start = vocal_lookup_sec if incoming else (vocal_lookup_sec - seam)
+    period_end = (vocal_lookup_sec + seam) if incoming else vocal_lookup_sec
+    period_mix = _lookup_period_mixability(
+        profile=vocal_profile,
+        start_sec=period_start,
+        end_sec=period_end,
+        incoming=incoming,
+    )
+    return _CueCandidate(
+        time_sec=float(time_sec),
+        beat_idx=int(beat_idx),
+        phrase=_phrase_score(int(beat_idx)),
+        energy=float(_clamp(energy, 0.0, 1.0)),
+        onset=float(_clamp(onset, 0.0, 1.0)),
+        chroma=chroma,
+        vocal_ratio=float(stem_mix["vocal_ratio"]),
+        vocal_onset=float(stem_mix["vocal_onset"]),
+        vocal_phrase_score=float(stem_mix["vocal_phrase_score"]),
+        drum_anchor=float(stem_mix["drum_anchor"]),
+        bass_energy=float(stem_mix["bass_energy"]),
+        bass_stability=float(stem_mix["bass_stability"]),
+        instrumental_density=float(stem_mix["instrumental_density"]),
+        density_score=float(stem_mix["density_score"]),
+        period_vocal_ratio=float(period_mix["period_vocal_ratio"]),
+        period_vocal_phrase_score=float(period_mix["period_vocal_phrase_score"]),
+        period_drum_anchor=float(period_mix["period_drum_anchor"]),
+        period_bass_energy=float(period_mix["period_bass_energy"]),
+        period_bass_stability=float(period_mix["period_bass_stability"]),
+        period_density_score=float(period_mix["period_density_score"]),
+        period_coverage=float(period_mix["coverage"]),
+        period_vocal_curve=np.asarray(period_mix["period_vocal_curve"], dtype=np.float32),
+        period_bass_curve=np.asarray(period_mix["period_bass_curve"], dtype=np.float32),
+    )
+def _score_pair(
+    cand_a: _CueCandidate,
+    cand_b: _CueCandidate,
+    target_a: float,
+    target_b: float,
+) -> Tuple[float, Dict[str, float]]:
+    energy_match = 1.0 - min(1.0, abs(cand_a.energy - cand_b.energy))
+    phrase_match = 0.5 * (cand_a.phrase + cand_b.phrase)
+    key_match = _clamp(_cosine_similarity(cand_a.chroma, cand_b.chroma), 0.0, 1.0)
+    onset_match = (0.35 * cand_a.onset) + (0.65 * cand_b.onset)
+    position_match = 0.5 * (
+        _target_position_score(cand_a.time_sec, target_a, spread=3.0)
+        + _target_position_score(cand_b.time_sec, target_b, spread=3.0)
+    )
+    vocal_phrase_match = 0.5 * (cand_a.vocal_phrase_score + cand_b.vocal_phrase_score)
+    drum_anchor_match = 0.5 * (cand_a.drum_anchor + cand_b.drum_anchor)
+    bass_stability_match = 0.5 * (cand_a.bass_stability + cand_b.bass_stability)
+    density_match = 0.5 * (cand_a.density_score + cand_b.density_score)
+    period_vocal_phrase_match = 0.5 * (cand_a.period_vocal_phrase_score + cand_b.period_vocal_phrase_score)
+    period_drum_anchor_match = 0.5 * (cand_a.period_drum_anchor + cand_b.period_drum_anchor)
+    period_bass_stability_match = 0.5 * (cand_a.period_bass_stability + cand_b.period_bass_stability)
+    period_density_match = 0.5 * (cand_a.period_density_score + cand_b.period_density_score)
+    vocal_clash_risk, bass_clash_risk, period_coverage = _period_overlap_clash(cand_a, cand_b)
+    clash_avoidance = 1.0 - _clamp((0.67 * vocal_clash_risk) + (0.33 * bass_clash_risk), 0.0, 1.0)
+    total = (
+        (0.07 * energy_match)
+        + (0.06 * phrase_match)
+        + (0.08 * key_match)
+        + (0.04 * onset_match)
+        + (0.03 * position_match)
+        + (0.05 * vocal_phrase_match)
+        + (0.04 * drum_anchor_match)
+        + (0.03 * bass_stability_match)
+        + (0.02 * density_match)
+        + (0.14 * period_vocal_phrase_match)
+        + (0.10 * period_drum_anchor_match)
+        + (0.09 * period_bass_stability_match)
+        + (0.07 * period_density_match)
+        + (0.16 * clash_avoidance)
+        + (0.02 * period_coverage)
+    )
+    components = {
+        "energy_match": float(energy_match),
+        "phrase_match": float(phrase_match),
+        "key_match": float(key_match),
+        "onset_match": float(onset_match),
+        "position_match": float(position_match),
+        "vocal_phrase_match": float(vocal_phrase_match),
+        "drum_anchor_match": float(drum_anchor_match),
+        "bass_stability_match": float(bass_stability_match),
+        "density_match": float(density_match),
+        "period_vocal_phrase_match": float(period_vocal_phrase_match),
+        "period_drum_anchor_match": float(period_drum_anchor_match),
+        "period_bass_stability_match": float(period_bass_stability_match),
+        "period_density_match": float(period_density_match),
+        "period_coverage": float(period_coverage),
+        "vocal_clash_risk": float(vocal_clash_risk),
+        "bass_clash_risk": float(bass_clash_risk),
+        "clash_avoidance": float(clash_avoidance),
+        "total": float(total),
+    }
+    return float(total), components
+def _segments_from_boundaries(boundaries: np.ndarray, duration_sec: float) -> List[Dict[str, object]]:
+    clean = [0.0]
+    for t in np.asarray(boundaries, dtype=np.float32):
+        x = float(t)
+        if 0.0 < x < float(duration_sec):
+            clean.append(x)
+    clean.append(float(duration_sec))
+    clean = sorted(set(round(x, 3) for x in clean))
+    segs: List[Dict[str, object]] = []
+    for idx in range(len(clean) - 1):
+        start = float(clean[idx])
+        end = float(clean[idx + 1])
+        if end - start < 4.0:
+            continue
+        segs.append({"start": start, "end": end, "label": f"section_{idx + 1}"})
+    return segs
+def _try_get_librosa_structure(path: str, duration_sec: float) -> Optional[Dict[str, np.ndarray]]:
+    if path in _LIBROSA_STRUCT_CACHE:
+        return _LIBROSA_STRUCT_CACHE[path]
+    decode_sec = _clamp(float(duration_sec), 15.0, 600.0)
+    try:
+        y, _ = decode_segment(
+            path,
+            start_sec=0.0,
+            duration_sec=decode_sec,
+            sr=_STRUCT_SR,
+            max_decode_sec=max(600.0, decode_sec + 3.0),
+        )
+    except Exception as exc:
+        LOGGER.warning("librosa full-track decode failed for %s (%s).", path, exc)
+        _LIBROSA_STRUCT_CACHE[path] = None
+        return None
+    if y.size < _STRUCT_SR:
+        _LIBROSA_STRUCT_CACHE[path] = None
+        return None
+    try:
+        _, beat_frames = librosa.beat.beat_track(y=y, sr=_STRUCT_SR, trim=False)
+        beat_times = librosa.frames_to_time(np.asarray(beat_frames), sr=_STRUCT_SR).astype(np.float32)
+        downbeats = beat_times[::4] if beat_times.size > 0 else np.array([], dtype=np.float32)
+        onset_env = librosa.onset.onset_strength(y=y, sr=_STRUCT_SR, hop_length=_ANALYSIS_HOP).astype(np.float32)
+        boundary_frames = librosa.util.peak_pick(
+            onset_env,
+            pre_max=8,
+            post_max=8,
+            pre_avg=24,
+            post_avg=24,
+            delta=0.06,
+            wait=18,
+        )
+        boundaries = librosa.frames_to_time(
+            np.asarray(boundary_frames),
+            sr=_STRUCT_SR,
+            hop_length=_ANALYSIS_HOP,
+        ).astype(np.float32)
+        payload: Dict[str, np.ndarray] = {"downbeats": downbeats, "boundaries": boundaries}
+        _LIBROSA_STRUCT_CACHE[path] = payload
+        return payload
+    except Exception as exc:
+        LOGGER.warning("librosa structure extraction failed for %s (%s).", path, exc)
+        _LIBROSA_STRUCT_CACHE[path] = None
+        return None
+def _get_or_build_profiles_for_track(path: str, duration_sec: float, sr: int) -> Optional[_TrackProfiles]:
+    key = (path, int(sr))
+    if key in _PROFILE_CACHE:
+        return _PROFILE_CACHE[key]
+    decode_sec = _clamp(float(duration_sec), 15.0, 600.0)
+    try:
+        y, _ = decode_segment(
+            path,
+            start_sec=0.0,
+            duration_sec=decode_sec,
+            sr=int(sr),
+            max_decode_sec=max(600.0, decode_sec + 3.0),
+        )
+    except Exception as exc:
+        LOGGER.warning("Full-track decode failed for %s (%s).", path, exc)
+        _PROFILE_CACHE[key] = None
+        return None
+    if y.size < int(sr):
+        _PROFILE_CACHE[key] = None
+        return None
+    profiles = _compute_profiles(y, int(sr))
+    _PROFILE_CACHE[key] = profiles
+    return profiles
+def _label_for_time(segments: List[Dict[str, object]], t: float) -> str:
+    for seg in segments:
+        start = float(seg["start"])
+        end = float(seg["end"])
+        if start <= float(t) < end:
+            return str(seg.get("label", "unknown"))
+    return "unknown"
+def _dedupe_times(times: List[float], min_gap_sec: float) -> List[float]:
+    if not times:
+        return []
+    sorted_times = sorted(float(t) for t in times)
+    out: List[float] = [sorted_times[0]]
+    for t in sorted_times[1:]:
+        if (t - out[-1]) >= float(min_gap_sec):
+            out.append(t)
+    return out
+def _build_structured_candidates(
+    downbeats: np.ndarray,
+    segments: List[Dict[str, object]],
+    profiles: _TrackProfiles,
+    vocal_profile: Optional[_VocalActivityProfile],
+    seam_sec: float,
+    duration_sec: float,
+    min_sec: float,
+    max_sec: float,
+    incoming: bool,
+    target_ratio: float,
+    limit: int = 20,
+) -> List[_StructuredCandidate]:
+    if max_sec <= min_sec:
+        return []
+    raw_times: List[float] = []
+    if downbeats.size > 0:
+        raw_times.extend([float(t) for t in downbeats if min_sec <= float(t) <= max_sec])
+    for seg in segments:
+        start = float(seg["start"])
+        end = float(seg["end"])
+        if min_sec <= start <= max_sec:
+            raw_times.append(start)
+        if min_sec <= end <= max_sec:
+            raw_times.append(end)
+    if not raw_times and downbeats.size > 0:
+        raw_times.extend([float(t) for t in downbeats])
+    if not raw_times:
+        return []
+    snapped: List[float] = []
+    for t in raw_times:
+        if downbeats.size > 0:
+            idx = int(np.argmin(np.abs(downbeats - float(t))))
+            snapped_t = float(downbeats[idx])
+        else:
+            snapped_t = float(t)
+        if min_sec <= snapped_t <= max_sec:
+            snapped.append(snapped_t)
+    snapped = _dedupe_times(snapped, min_gap_sec=1.2)
+    if not snapped:
+        return []
+    if incoming:
+        snapped = snapped[:limit]
+    else:
+        snapped = snapped[-limit:]
+    target_sec = float(target_ratio * duration_sec)
+    spread = max(4.0, 0.15 * duration_sec)
+    built: List[_StructuredCandidate] = []
+    for i, t in enumerate(snapped):
+        label = _label_for_time(segments, t)
+        cue = _make_candidate(
+            time_sec=t,
+            beat_idx=(i * 4),
+            profiles=profiles,
+            incoming=incoming,
+            seam_sec=seam_sec,
+            vocal_profile=vocal_profile,
+            vocal_time_sec=t,
+        )
+        built.append(
+            _StructuredCandidate(
+                cue=cue,
+                label=label,
+                label_score=_label_weight(label, outgoing=(not incoming)),
+                edge_score=_edge_score(t, duration_sec),
+                position_score=_target_position_score(t, target=target_sec, spread=spread),
+            )
+        )
+    return built
+def _score_structured_pair(
+    cand_a: _StructuredCandidate,
+    cand_b: _StructuredCandidate,
+) -> Tuple[float, Dict[str, float]]:
+    energy_match = 1.0 - min(1.0, abs(cand_a.cue.energy - cand_b.cue.energy))
+    phrase_match = 0.5 * (cand_a.cue.phrase + cand_b.cue.phrase)
+    key_match = _clamp(_cosine_similarity(cand_a.cue.chroma, cand_b.cue.chroma), 0.0, 1.0)
+    onset_match = (0.40 * cand_a.cue.onset) + (0.60 * cand_b.cue.onset)
+    label_match = 0.5 * (cand_a.label_score + cand_b.label_score)
+    position_match = 0.5 * (cand_a.position_score + cand_b.position_score)
+    edge_match = 0.5 * (cand_a.edge_score + cand_b.edge_score)
+    vocal_phrase_match = 0.5 * (cand_a.cue.vocal_phrase_score + cand_b.cue.vocal_phrase_score)
+    drum_anchor_match = 0.5 * (cand_a.cue.drum_anchor + cand_b.cue.drum_anchor)
+    bass_stability_match = 0.5 * (cand_a.cue.bass_stability + cand_b.cue.bass_stability)
+    density_match = 0.5 * (cand_a.cue.density_score + cand_b.cue.density_score)
+    period_vocal_phrase_match = 0.5 * (cand_a.cue.period_vocal_phrase_score + cand_b.cue.period_vocal_phrase_score)
+    period_drum_anchor_match = 0.5 * (cand_a.cue.period_drum_anchor + cand_b.cue.period_drum_anchor)
+    period_bass_stability_match = 0.5 * (cand_a.cue.period_bass_stability + cand_b.cue.period_bass_stability)
+    period_density_match = 0.5 * (cand_a.cue.period_density_score + cand_b.cue.period_density_score)
+    vocal_clash_risk, bass_clash_risk, period_coverage = _period_overlap_clash(cand_a.cue, cand_b.cue)
+    clash_avoidance = 1.0 - _clamp((0.67 * vocal_clash_risk) + (0.33 * bass_clash_risk), 0.0, 1.0)
+    total = (
+        (0.08 * energy_match)
+        + (0.09 * key_match)
+        + (0.06 * onset_match)
+        + (0.05 * phrase_match)
+        + (0.10 * label_match)
+        + (0.05 * position_match)
+        + (0.04 * edge_match)
+        + (0.06 * vocal_phrase_match)
+        + (0.04 * drum_anchor_match)
+        + (0.03 * bass_stability_match)
+        + (0.02 * density_match)
+        + (0.12 * period_vocal_phrase_match)
+        + (0.08 * period_drum_anchor_match)
+        + (0.07 * period_bass_stability_match)
+        + (0.05 * period_density_match)
+        + (0.06 * clash_avoidance)
+        + (0.01 * period_coverage)
+    )
+    components = {
+        "energy_match": float(energy_match),
+        "key_match": float(key_match),
+        "onset_match": float(onset_match),
+        "phrase_match": float(phrase_match),
+        "label_match": float(label_match),
+        "position_match": float(position_match),
+        "edge_match": float(edge_match),
+        "vocal_phrase_match": float(vocal_phrase_match),
+        "drum_anchor_match": float(drum_anchor_match),
+        "bass_stability_match": float(bass_stability_match),
+        "density_match": float(density_match),
+        "period_vocal_phrase_match": float(period_vocal_phrase_match),
+        "period_drum_anchor_match": float(period_drum_anchor_match),
+        "period_bass_stability_match": float(period_bass_stability_match),
+        "period_density_match": float(period_density_match),
+        "period_coverage": float(period_coverage),
+        "vocal_clash_risk": float(vocal_clash_risk),
+        "bass_clash_risk": float(bass_clash_risk),
+        "clash_avoidance": float(clash_avoidance),
+        "total": float(total),
+    }
+    return float(total), components
+def _try_structure_aware_selection(
+    song_a_path: Optional[str],
+    song_b_path: Optional[str],
+    song_a_duration_sec: Optional[float],
+    song_b_duration_sec: Optional[float],
+    pre_sec: float,
+    seam_sec: float,
+    post_sec: float,
+    vocal_profile_a: Optional[_VocalActivityProfile],
+    vocal_profile_b: Optional[_VocalActivityProfile],
+) -> Optional[CueSelectionResult]:
+    if not song_a_path or not song_b_path:
+        return None
+    if song_a_duration_sec is None or song_b_duration_sec is None:
+        return None
+    dur_a = float(song_a_duration_sec)
+    dur_b = float(song_b_duration_sec)
+    min_a = max(seam_sec + 2.0, pre_sec + 2.0, 0.30 * dur_a)
+    max_a = min(dur_a - seam_sec - 2.0, 0.88 * dur_a)
+    min_b = max(4.0, 0.10 * dur_b)
+    max_b = min(dur_b - (seam_sec + post_sec + 2.0), 0.72 * dur_b)
+    if max_a <= min_a or max_b <= min_b:
+        return None
+    source = "librosa"
+    lib_a = _try_get_librosa_structure(song_a_path, dur_a)
+    lib_b = _try_get_librosa_structure(song_b_path, dur_b)
+    if lib_a is None or lib_b is None:
+        return None
+    downbeats_a = np.asarray(lib_a.get("downbeats", []), dtype=np.float32)
+    downbeats_b = np.asarray(lib_b.get("downbeats", []), dtype=np.float32)
+    segments_a: List[Dict[str, object]] = _segments_from_boundaries(
+        np.asarray(lib_a.get("boundaries", []), dtype=np.float32),
+        duration_sec=dur_a,
+    )
+    segments_b: List[Dict[str, object]] = _segments_from_boundaries(
+        np.asarray(lib_b.get("boundaries", []), dtype=np.float32),
+        duration_sec=dur_b,
+    )
+    if downbeats_a.size < 4 or downbeats_b.size < 4:
+        return None
+    profiles_a = _get_or_build_profiles_for_track(song_a_path, dur_a, sr=_STRUCT_SR)
+    profiles_b = _get_or_build_profiles_for_track(song_b_path, dur_b, sr=_STRUCT_SR)
+    if profiles_a is None or profiles_b is None:
+        return None
+    cands_a = _build_structured_candidates(
+        downbeats=downbeats_a,
+        segments=segments_a,
+        profiles=profiles_a,
+        vocal_profile=vocal_profile_a,
+        seam_sec=seam_sec,
+        duration_sec=dur_a,
+        min_sec=min_a,
+        max_sec=max_a,
+        incoming=False,
+        target_ratio=0.63,
+        limit=22,
+    )
+    cands_b = _build_structured_candidates(
+        downbeats=downbeats_b,
+        segments=segments_b,
+        profiles=profiles_b,
+        vocal_profile=vocal_profile_b,
+        seam_sec=seam_sec,
+        duration_sec=dur_b,
+        min_sec=min_b,
+        max_sec=max_b,
+        incoming=True,
+        target_ratio=0.27,
+        limit=22,
+    )
+    if not cands_a or not cands_b:
+        return None
+    best_score = -1.0
+    best_a: Optional[_StructuredCandidate] = None
+    best_b: Optional[_StructuredCandidate] = None
+    ranked: List[Dict[str, object]] = []
+    for ca in cands_a:
+        for cb in cands_b:
+            score, comps = _score_structured_pair(ca, cb)
+            ranked.append(
+                {
+                    "score": float(score),
+                    "song_a_sec": float(ca.cue.time_sec),
+                    "song_b_sec": float(cb.cue.time_sec),
+                    "song_a_label": ca.label,
+                    "song_b_label": cb.label,
+                    "song_a_vocal_ratio": float(ca.cue.vocal_ratio),
+                    "song_b_vocal_ratio": float(cb.cue.vocal_ratio),
+                    "song_a_period_vocal_phrase": float(ca.cue.period_vocal_phrase_score),
+                    "song_b_period_vocal_phrase": float(cb.cue.period_vocal_phrase_score),
+                    "song_a_period_drum_anchor": float(ca.cue.period_drum_anchor),
+                    "song_b_period_drum_anchor": float(cb.cue.period_drum_anchor),
+                    "song_a_period_bass_stability": float(ca.cue.period_bass_stability),
+                    "song_b_period_bass_stability": float(cb.cue.period_bass_stability),
+                    "song_a_period_density": float(ca.cue.period_density_score),
+                    "song_b_period_density": float(cb.cue.period_density_score),
+                    "song_a_period_coverage": float(ca.cue.period_coverage),
+                    "song_b_period_coverage": float(cb.cue.period_coverage),
+                    "components": comps,
+                }
+            )
+            if score > best_score:
+                best_score = float(score)
+                best_a = ca
+                best_b = cb
+    if best_a is None or best_b is None:
+        return None
+    ranked = sorted(ranked, key=lambda x: float(x["score"]), reverse=True)
+    top_pairs = []
+    for item in ranked[:3]:
+        top_pairs.append(
+            {
+                "score": round(float(item["score"]), 4),
+                "song_a_sec": round(float(item["song_a_sec"]), 3),
+                "song_b_sec": round(float(item["song_b_sec"]), 3),
+                "song_a_label": str(item["song_a_label"]),
+                "song_b_label": str(item["song_b_label"]),
+                "song_a_vocal_ratio": round(float(item["song_a_vocal_ratio"]), 4),
+                "song_b_vocal_ratio": round(float(item["song_b_vocal_ratio"]), 4),
+                "song_a_period_vocal_phrase": round(float(item["song_a_period_vocal_phrase"]), 4),
+                "song_b_period_vocal_phrase": round(float(item["song_b_period_vocal_phrase"]), 4),
+                "song_a_period_drum_anchor": round(float(item["song_a_period_drum_anchor"]), 4),
+                "song_b_period_drum_anchor": round(float(item["song_b_period_drum_anchor"]), 4),
+                "song_a_period_bass_stability": round(float(item["song_a_period_bass_stability"]), 4),
+                "song_b_period_bass_stability": round(float(item["song_b_period_bass_stability"]), 4),
+                "song_a_period_density": round(float(item["song_a_period_density"]), 4),
+                "song_b_period_density": round(float(item["song_b_period_density"]), 4),
+                "song_a_period_coverage": round(float(item["song_a_period_coverage"]), 4),
+                "song_b_period_coverage": round(float(item["song_b_period_coverage"]), 4),
+                "components": {k: round(float(v), 4) for k, v in item["components"].items()},
+            }
+        )
+    principles = [
+        "phrase/downbeat alignment",
+        "section boundary awareness",
+        "energy continuity",
+        "harmonic/chroma compatibility",
+    ]
+    if vocal_profile_a is not None or vocal_profile_b is not None:
+        principles.extend(
+            [
+                "vocal phrase-safe cueing (low or ending vocals)",
+                "drum-anchor confidence",
+                "bassline stability control",
+                "instrumental density targeting",
+                "clash-risk precheck (vocal+bass overlap)",
+            ]
+        )
+    return CueSelectionResult(
+        cue_a_sec=float(best_a.cue.time_sec),
+        cue_b_sec=float(best_b.cue.time_sec),
+        method=f"{source}-structure-aware",
+        debug={
+            "source": source,
+            "candidate_ranges_sec": {
+                "song_a": [round(min_a, 3), round(max_a, 3)],
+                "song_b": [round(min_b, 3), round(max_b, 3)],
+            },
+            "transition_period_sec": round(float(seam_sec), 3),
+            "candidate_counts": {"song_a": len(cands_a), "song_b": len(cands_b)},
+            "selected_sec": {"song_a": round(float(best_a.cue.time_sec), 3), "song_b": round(float(best_b.cue.time_sec), 3)},
+            "selected_labels": {"song_a": best_a.label, "song_b": best_b.label},
+            "selected_mixability": {
+                "song_a_ratio": round(float(best_a.cue.vocal_ratio), 4),
+                "song_b_ratio": round(float(best_b.cue.vocal_ratio), 4),
+                "song_a_vocal_onset": round(float(best_a.cue.vocal_onset), 4),
+                "song_b_vocal_onset": round(float(best_b.cue.vocal_onset), 4),
+                "song_a_vocal_phrase": round(float(best_a.cue.vocal_phrase_score), 4),
+                "song_b_vocal_phrase": round(float(best_b.cue.vocal_phrase_score), 4),
+                "song_a_drum_anchor": round(float(best_a.cue.drum_anchor), 4),
+                "song_b_drum_anchor": round(float(best_b.cue.drum_anchor), 4),
+                "song_a_bass_stability": round(float(best_a.cue.bass_stability), 4),
+                "song_b_bass_stability": round(float(best_b.cue.bass_stability), 4),
+                "song_a_density_score": round(float(best_a.cue.density_score), 4),
+                "song_b_density_score": round(float(best_b.cue.density_score), 4),
+                "song_a_period_vocal_phrase": round(float(best_a.cue.period_vocal_phrase_score), 4),
+                "song_b_period_vocal_phrase": round(float(best_b.cue.period_vocal_phrase_score), 4),
+                "song_a_period_drum_anchor": round(float(best_a.cue.period_drum_anchor), 4),
+                "song_b_period_drum_anchor": round(float(best_b.cue.period_drum_anchor), 4),
+                "song_a_period_bass_stability": round(float(best_a.cue.period_bass_stability), 4),
+                "song_b_period_bass_stability": round(float(best_b.cue.period_bass_stability), 4),
+                "song_a_period_density_score": round(float(best_a.cue.period_density_score), 4),
+                "song_b_period_density_score": round(float(best_b.cue.period_density_score), 4),
+                "song_a_period_coverage": round(float(best_a.cue.period_coverage), 4),
+                "song_b_period_coverage": round(float(best_b.cue.period_coverage), 4),
+            },
+            "top_pairs": top_pairs,
+            "period_scoring": {
+                "enabled": True,
+                "window_def": {"song_a": "[cue-seam, cue]", "song_b": "[cue, cue+seam]"},
+                "overlap_simulation": "weighted vocal/bass clash precheck",
+            },
+            "dj_principles": principles,
+        },
+    )
+def select_mix_cuepoints(
+    y_a_analysis: np.ndarray,
+    y_b_analysis: np.ndarray,
+    sr: int,
+    analysis_sec: float,
+    pre_sec: float,
+    seam_sec: float,
+    post_sec: float,
+    a_analysis_start_sec: float,
+    beats_a: np.ndarray,
+    beats_b: np.ndarray,
+    cue_a_override_sec: Optional[float] = None,
+    cue_b_override_sec: Optional[float] = None,
+    song_a_path: Optional[str] = None,
+    song_b_path: Optional[str] = None,
+    song_a_duration_sec: Optional[float] = None,
+    song_b_duration_sec: Optional[float] = None,
+) -> CueSelectionResult:
+    target_a_rel = max(float(pre_sec), float(analysis_sec - seam_sec - 2.0))
+    target_b_rel = 2.0
+    default_a_rel = float(choose_nearest_beat(beats_a, target_a_rel))
+    default_b_rel = float(choose_first_beat_after(beats_b, target_b_rel))
+    default_a_abs = float(a_analysis_start_sec + default_a_rel)
+    default_b_abs = float(default_b_rel)
+    if cue_a_override_sec is not None or cue_b_override_sec is not None:
+        cue_a = float(cue_a_override_sec) if cue_a_override_sec is not None else default_a_abs
+        cue_b = float(cue_b_override_sec) if cue_b_override_sec is not None else default_b_abs
+        return CueSelectionResult(
+            cue_a_sec=cue_a,
+            cue_b_sec=cue_b,
+            method="manual-override",
+            debug={
+                "manual_override": True,
+                "default_auto_cues_sec": {"song_a": round(default_a_abs, 3), "song_b": round(default_b_abs, 3)},
+            },
+        )
+    vocal_profile_a, vocal_debug_a = _extract_vocal_profile_demucs(
+        y=y_a_analysis,
+        sr=int(sr),
+        window_start_sec=float(a_analysis_start_sec),
+        track_label="song_a_analysis_window",
+    )
+    vocal_profile_b, vocal_debug_b = _extract_vocal_profile_demucs(
+        y=y_b_analysis,
+        sr=int(sr),
+        window_start_sec=0.0,
+        track_label="song_b_analysis_window",
+    )
+    vocal_debug = {
+        "enabled": bool(_DEMUCS_ENABLED),
+        "song_a": vocal_debug_a,
+        "song_b": vocal_debug_b,
+    }
+    structure_result = _try_structure_aware_selection(
+        song_a_path=song_a_path,
+        song_b_path=song_b_path,
+        song_a_duration_sec=song_a_duration_sec,
+        song_b_duration_sec=song_b_duration_sec,
+        pre_sec=float(pre_sec),
+        seam_sec=float(seam_sec),
+        post_sec=float(post_sec),
+        vocal_profile_a=vocal_profile_a,
+        vocal_profile_b=vocal_profile_b,
+    )
+    if structure_result is not None:
+        structure_result.debug["manual_override"] = False
+        structure_result.debug["default_local_auto_cues_sec"] = {
+            "song_a": round(default_a_abs, 3),
+            "song_b": round(default_b_abs, 3),
+        }
+        structure_result.debug["vocal_analysis"] = vocal_debug
+        structure_result.debug["vocal_penalty_active"] = bool(vocal_profile_a is not None or vocal_profile_b is not None)
+        return structure_result
+    if beats_a.size < 4 or beats_b.size < 4:
+        return CueSelectionResult(
+            cue_a_sec=default_a_abs,
+            cue_b_sec=default_b_abs,
+            method="beat-fallback",
+            debug={
+                "reason": "insufficient_beats",
+                "beat_counts": {"song_a": int(beats_a.size), "song_b": int(beats_b.size)},
+                "vocal_analysis": vocal_debug,
+            },
+        )
+    profiles_a = _compute_profiles(y_a_analysis, sr)
+    profiles_b = _compute_profiles(y_b_analysis, sr)
+    min_a = max(0.5, float(seam_sec + 0.5), float(pre_sec + (0.20 * seam_sec)))
+    max_a = max(min_a + 0.1, float(analysis_sec - max(0.75, 0.25 * seam_sec)))
+    min_b = max(0.75, float(0.12 * seam_sec))
+    max_b = max(min_b + 0.1, float(analysis_sec - max((seam_sec + 0.75), (0.25 * post_sec))))
+    raw_a = _build_candidates(beats_a, min_a, max_a, prefer_tail=True, limit=24)
+    raw_b = _build_candidates(beats_b, min_b, max_b, prefer_tail=False, limit=24)
+    if not raw_a or not raw_b:
+        return CueSelectionResult(
+            cue_a_sec=default_a_abs,
+            cue_b_sec=default_b_abs,
+            method="candidate-fallback",
+            debug={
+                "reason": "empty_candidate_set",
+                "candidate_counts": {"song_a": len(raw_a), "song_b": len(raw_b)},
+                "candidate_windows_sec": {
+                    "song_a": [round(min_a, 3), round(max_a, 3)],
+                    "song_b": [round(min_b, 3), round(max_b, 3)],
+                },
+                "vocal_analysis": vocal_debug,
+            },
+        )
+    cands_a = [
+        _make_candidate(
+            t,
+            idx,
+            profiles_a,
+            incoming=False,
+            seam_sec=float(seam_sec),
+            vocal_profile=vocal_profile_a,
+            vocal_time_sec=float(a_analysis_start_sec + t),
+        )
+        for (t, idx) in raw_a
+    ]
+    cands_b = [
+        _make_candidate(
+            t,
+            idx,
+            profiles_b,
+            incoming=True,
+            seam_sec=float(seam_sec),
+            vocal_profile=vocal_profile_b,
+            vocal_time_sec=float(t),
+        )
+        for (t, idx) in raw_b
+    ]
+    scored_pairs: List[Dict[str, object]] = []
+    best: Optional[Dict[str, object]] = None
+    target_b = max(2.0, min(8.0, float(analysis_sec * 0.25)))
+    for cand_a in cands_a:
+        for cand_b in cands_b:
+            total, comps = _score_pair(cand_a, cand_b, target_a=target_a_rel, target_b=target_b)
+            item = {
+                "score": float(total),
+                "song_a_rel_sec": float(cand_a.time_sec),
+                "song_b_rel_sec": float(cand_b.time_sec),
+                "song_a_vocal_ratio": float(cand_a.vocal_ratio),
+                "song_b_vocal_ratio": float(cand_b.vocal_ratio),
+                "song_a_vocal_onset": float(cand_a.vocal_onset),
+                "song_b_vocal_onset": float(cand_b.vocal_onset),
+                "song_a_vocal_phrase": float(cand_a.vocal_phrase_score),
+                "song_b_vocal_phrase": float(cand_b.vocal_phrase_score),
+                "song_a_drum_anchor": float(cand_a.drum_anchor),
+                "song_b_drum_anchor": float(cand_b.drum_anchor),
+                "song_a_bass_energy": float(cand_a.bass_energy),
+                "song_b_bass_energy": float(cand_b.bass_energy),
+                "song_a_bass_stability": float(cand_a.bass_stability),
+                "song_b_bass_stability": float(cand_b.bass_stability),
+                "song_a_density": float(cand_a.instrumental_density),
+                "song_b_density": float(cand_b.instrumental_density),
+                "song_a_density_score": float(cand_a.density_score),
+                "song_b_density_score": float(cand_b.density_score),
+                "song_a_period_vocal_phrase": float(cand_a.period_vocal_phrase_score),
+                "song_b_period_vocal_phrase": float(cand_b.period_vocal_phrase_score),
+                "song_a_period_drum_anchor": float(cand_a.period_drum_anchor),
+                "song_b_period_drum_anchor": float(cand_b.period_drum_anchor),
+                "song_a_period_bass_energy": float(cand_a.period_bass_energy),
+                "song_b_period_bass_energy": float(cand_b.period_bass_energy),
+                "song_a_period_bass_stability": float(cand_a.period_bass_stability),
+                "song_b_period_bass_stability": float(cand_b.period_bass_stability),
+                "song_a_period_density": float(cand_a.period_density_score),
+                "song_b_period_density": float(cand_b.period_density_score),
+                "song_a_period_coverage": float(cand_a.period_coverage),
+                "song_b_period_coverage": float(cand_b.period_coverage),
+                "components": comps,
+            }
+            scored_pairs.append(item)
+            if best is None or float(total) > float(best["score"]):
+                best = item
+    if best is None:
+        return CueSelectionResult(
+            cue_a_sec=default_a_abs,
+            cue_b_sec=default_b_abs,
+            method="score-fallback",
+            debug={"reason": "no_scored_pairs", "vocal_analysis": vocal_debug},
+        )
+    scored_pairs = sorted(scored_pairs, key=lambda x: float(x["score"]), reverse=True)
+    top_pairs = [
+        {
+            "score": round(float(item["score"]), 4),
+            "song_a_rel_sec": round(float(item["song_a_rel_sec"]), 3),
+            "song_b_rel_sec": round(float(item["song_b_rel_sec"]), 3),
+            "song_a_vocal_ratio": round(float(item["song_a_vocal_ratio"]), 4),
+            "song_b_vocal_ratio": round(float(item["song_b_vocal_ratio"]), 4),
+            "song_a_vocal_phrase": round(float(item["song_a_vocal_phrase"]), 4),
+            "song_b_vocal_phrase": round(float(item["song_b_vocal_phrase"]), 4),
+            "song_a_drum_anchor": round(float(item["song_a_drum_anchor"]), 4),
+            "song_b_drum_anchor": round(float(item["song_b_drum_anchor"]), 4),
+            "song_a_bass_stability": round(float(item["song_a_bass_stability"]), 4),
+            "song_b_bass_stability": round(float(item["song_b_bass_stability"]), 4),
+            "song_a_density_score": round(float(item["song_a_density_score"]), 4),
+            "song_b_density_score": round(float(item["song_b_density_score"]), 4),
+            "song_a_period_vocal_phrase": round(float(item["song_a_period_vocal_phrase"]), 4),
+            "song_b_period_vocal_phrase": round(float(item["song_b_period_vocal_phrase"]), 4),
+            "song_a_period_drum_anchor": round(float(item["song_a_period_drum_anchor"]), 4),
+            "song_b_period_drum_anchor": round(float(item["song_b_period_drum_anchor"]), 4),
+            "song_a_period_bass_stability": round(float(item["song_a_period_bass_stability"]), 4),
+            "song_b_period_bass_stability": round(float(item["song_b_period_bass_stability"]), 4),
+            "song_a_period_density": round(float(item["song_a_period_density"]), 4),
+            "song_b_period_density": round(float(item["song_b_period_density"]), 4),
+            "song_a_period_coverage": round(float(item["song_a_period_coverage"]), 4),
+            "song_b_period_coverage": round(float(item["song_b_period_coverage"]), 4),
+            "components": {k: round(float(v), 4) for k, v in item["components"].items()},
+        }
+        for item in scored_pairs[:3]
+    ]
+    cue_a_abs = float(a_analysis_start_sec + float(best["song_a_rel_sec"]))
+    cue_b_abs = float(best["song_b_rel_sec"])
+    return CueSelectionResult(
+        cue_a_sec=cue_a_abs,
+        cue_b_sec=cue_b_abs,
+        method="scored-auto",
+        debug={
+            "manual_override": False,
+            "beat_counts": {"song_a": int(beats_a.size), "song_b": int(beats_b.size)},
+            "candidate_counts": {"song_a": len(cands_a), "song_b": len(cands_b)},
+            "candidate_windows_sec": {
+                "song_a": [round(min_a, 3), round(max_a, 3)],
+                "song_b": [round(min_b, 3), round(max_b, 3)],
+            },
+            "transition_period_sec": round(float(seam_sec), 3),
+            "selected_rel_sec": {
+                "song_a": round(float(best["song_a_rel_sec"]), 3),
+                "song_b": round(float(best["song_b_rel_sec"]), 3),
+            },
+            "selected_mixability": {
+                "song_a_ratio": round(float(best["song_a_vocal_ratio"]), 4),
+                "song_b_ratio": round(float(best["song_b_vocal_ratio"]), 4),
+                "song_a_vocal_onset": round(float(best["song_a_vocal_onset"]), 4),
+                "song_b_vocal_onset": round(float(best["song_b_vocal_onset"]), 4),
+                "song_a_vocal_phrase": round(float(best["song_a_vocal_phrase"]), 4),
+                "song_b_vocal_phrase": round(float(best["song_b_vocal_phrase"]), 4),
+                "song_a_drum_anchor": round(float(best["song_a_drum_anchor"]), 4),
+                "song_b_drum_anchor": round(float(best["song_b_drum_anchor"]), 4),
+                "song_a_bass_energy": round(float(best["song_a_bass_energy"]), 4),
+                "song_b_bass_energy": round(float(best["song_b_bass_energy"]), 4),
+                "song_a_bass_stability": round(float(best["song_a_bass_stability"]), 4),
+                "song_b_bass_stability": round(float(best["song_b_bass_stability"]), 4),
+                "song_a_density": round(float(best["song_a_density"]), 4),
+                "song_b_density": round(float(best["song_b_density"]), 4),
+                "song_a_density_score": round(float(best["song_a_density_score"]), 4),
+                "song_b_density_score": round(float(best["song_b_density_score"]), 4),
+                "song_a_period_vocal_phrase": round(float(best["song_a_period_vocal_phrase"]), 4),
+                "song_b_period_vocal_phrase": round(float(best["song_b_period_vocal_phrase"]), 4),
+                "song_a_period_drum_anchor": round(float(best["song_a_period_drum_anchor"]), 4),
+                "song_b_period_drum_anchor": round(float(best["song_b_period_drum_anchor"]), 4),
+                "song_a_period_bass_energy": round(float(best["song_a_period_bass_energy"]), 4),
+                "song_b_period_bass_energy": round(float(best["song_b_period_bass_energy"]), 4),
+                "song_a_period_bass_stability": round(float(best["song_a_period_bass_stability"]), 4),
+                "song_b_period_bass_stability": round(float(best["song_b_period_bass_stability"]), 4),
+                "song_a_period_density": round(float(best["song_a_period_density"]), 4),
+                "song_b_period_density": round(float(best["song_b_period_density"]), 4),
+                "song_a_period_coverage": round(float(best["song_a_period_coverage"]), 4),
+                "song_b_period_coverage": round(float(best["song_b_period_coverage"]), 4),
+            },
+            "default_auto_cues_sec": {"song_a": round(default_a_abs, 3), "song_b": round(default_b_abs, 3)},
+            "vocal_analysis": vocal_debug,
+            "vocal_penalty_active": bool(vocal_profile_a is not None or vocal_profile_b is not None),
+            "top_pairs": top_pairs,
+            "period_scoring": {
+                "enabled": True,
+                "window_def": {"song_a": "[cue-seam, cue]", "song_b": "[cue, cue+seam]"},
+                "overlap_simulation": "weighted vocal/bass clash precheck",
+            },
+        },
+    )

pipeline/transition_generator.py ADDED Viewed

	@@ -0,0 +1,1694 @@

+import argparse
+import hashlib
+import json
+import logging
+import os
+from dataclasses import asdict, dataclass
+from pathlib import Path
+from typing import Any, Dict, List, Optional, Tuple
+import librosa  # type: ignore[reportMissingImports]
+import numpy as np
+from .audio_utils import (
+    apply_edge_fades,
+    clamp,
+    crossfade_equal_length,
+    decode_segment,
+    ensure_length,
+    estimate_bpm_and_beats,
+    ffprobe_duration_sec,
+    normalize_peak,
+    resample_if_needed,
+    safe_time_stretch,
+    write_wav,
+)
+from .cuepoint_selector import select_mix_cuepoints
+LOGGER = logging.getLogger(__name__)
+DEFAULT_TARGET_SR = 32000
+ACESTEP_INPUT_SR = 48000
+STITCH_PREVIEW_SIDE_SEC = 10.0
+PLUGIN_PRESETS: Dict[str, str] = {
+    "Smooth Blend": "smooth seamless DJ transition, balanced energy, clean, no vocals",
+    "EDM Build-up": "energetic EDM build-up transition with rising tension, clean, no vocals",
+    "Percussive Bridge": "percussive bridge transition with rhythmic drums and clear groove, no vocals",
+    "Ambient Wash": "ambient wash transition, spacious and atmospheric, soft energy curve, no vocals",
+}
+_ACESTEP_RUNTIME: Optional[Dict[str, Any]] = None
+_DEMUCS_RUNTIME: Optional[Dict[str, Any]] = None
+_DEMUCS_TRANSITION_ENABLED = os.getenv("AI_DJ_ENABLE_DEMUCS_TRANSITION", "1").strip().lower() not in {
+    "0",
+    "false",
+    "no",
+    "off",
+}
+_DEMUCS_MODEL_NAME = os.getenv("AI_DJ_DEMUCS_MODEL", "htdemucs").strip() or "htdemucs"
+_DEMUCS_DEVICE_PREF = os.getenv("AI_DJ_DEMUCS_DEVICE", "cuda").strip().lower()
+_DEMUCS_SEGMENT_SEC = 7.0
+_REF_AUDIO_MODE = (os.getenv("AI_DJ_REFERENCE_AUDIO_MODE", "accompaniment-only") or "accompaniment-only").strip().lower()
+@dataclass
+class _DemucsStemBundle:
+    vocals: np.ndarray
+    drums: np.ndarray
+    bass: np.ndarray
+    other: np.ndarray
+    accompaniment: np.ndarray
+    sr: int
+    method: str
+@dataclass
+class TransitionRequest:
+    song_a_path: str
+    song_b_path: str
+    plugin_id: str = "Smooth Blend"
+    instruction_text: str = ""
+    pre_context_sec: float = 6.0
+    repaint_width_sec: float = 4.0
+    post_context_sec: float = 6.0
+    analysis_sec: float = 45.0
+    bpm_target: Optional[float] = None
+    cue_a_sec: Optional[float] = None
+    cue_b_sec: Optional[float] = None
+    transition_base_mode: str = "B-base-fixed"
+    transition_bars: int = 8
+    creativity_strength: float = 7.0
+    inference_steps: int = 8
+    seed: int = 42
+    output_dir: str = "outputs"
+    output_stem: Optional[str] = None
+    target_sr: int = DEFAULT_TARGET_SR
+    keep_debug_files: bool = False
+    # ACE-Step runtime config
+    acestep_model_config: str = os.getenv("AI_DJ_ACESTEP_MODEL_CONFIG", "acestep-v15-turbo").strip()
+    acestep_device: str = os.getenv("AI_DJ_ACESTEP_DEVICE", "auto").strip()
+    acestep_project_root: str = os.getenv("AI_DJ_ACESTEP_PROJECT_ROOT", "").strip()
+    acestep_prefer_source: Optional[str] = os.getenv("AI_DJ_ACESTEP_PREFER_SOURCE", "").strip() or None
+    acestep_use_flash_attn: bool = False
+    acestep_compile_model: bool = False
+    acestep_offload_to_cpu: bool = False
+    acestep_offload_dit_to_cpu: bool = False
+    acestep_use_mlx_dit: bool = True
+    acestep_lora_path: str = os.getenv("AI_DJ_ACESTEP_LORA_PATH", "").strip()
+    acestep_lora_scale: float = float(os.getenv("AI_DJ_ACESTEP_LORA_SCALE", "1.0").strip() or "1.0")
+    def to_log_dict(self) -> Dict[str, Any]:
+        return asdict(self)
+@dataclass
+class TransitionResult:
+    transition_path: str
+    stitched_path: str
+    rough_stitched_path: str
+    hard_splice_path: str
+    backend_used: str
+    details: Dict[str, Any]
+    def to_dict(self) -> Dict[str, Any]:
+        payload = asdict(self)
+        return payload
+def _slug(text: str) -> str:
+    s = "".join(ch if ch.isalnum() or ch in {"-", "_"} else "_" for ch in text.strip())
+    s = "_".join(part for part in s.split("_") if part)
+    return s[:80] or "item"
+def _deterministic_stem(request: TransitionRequest) -> str:
+    if request.output_stem:
+        return _slug(request.output_stem)
+    payload = {
+        "a": os.path.basename(request.song_a_path),
+        "b": os.path.basename(request.song_b_path),
+        "plugin": request.plugin_id,
+        "instruction_text": request.instruction_text,
+        "pre_context_sec": request.pre_context_sec,
+        "repaint_width_sec": request.repaint_width_sec,
+        "post_context_sec": request.post_context_sec,
+        "analysis_sec": request.analysis_sec,
+        "bpm_target": request.bpm_target,
+        "cue_a_sec": request.cue_a_sec,
+        "cue_b_sec": request.cue_b_sec,
+        "transition_base_mode": request.transition_base_mode,
+        "transition_bars": request.transition_bars,
+        "creativity_strength": request.creativity_strength,
+        "inference_steps": request.inference_steps,
+        "seed": request.seed,
+        "target_sr": request.target_sr,
+        "acestep_model_config": request.acestep_model_config,
+        "demucs_transition_enabled": _DEMUCS_TRANSITION_ENABLED,
+        "demucs_model": _DEMUCS_MODEL_NAME,
+        "reference_audio_mode": _REF_AUDIO_MODE,
+    }
+    raw = json.dumps(payload, sort_keys=True).encode("utf-8")
+    digest = hashlib.sha1(raw).hexdigest()[:10]
+    return f"transition_{_slug(Path(request.song_a_path).stem)}_to_{_slug(Path(request.song_b_path).stem)}_{digest}"
+def _resolve_output_paths(request: TransitionRequest) -> Tuple[str, str, str, str, str]:
+    os.makedirs(request.output_dir, exist_ok=True)
+    stem = _deterministic_stem(request)
+    transition_path = os.path.join(request.output_dir, f"{stem}_transition.wav")
+    stitched_path = os.path.join(request.output_dir, f"{stem}_stitched.wav")
+    rough_stitched_path = os.path.join(request.output_dir, f"{stem}_rough_stitched.wav")
+    hard_splice_path = os.path.join(request.output_dir, f"{stem}_hard_splice.wav")
+    rough_src_path = os.path.join(request.output_dir, f"{stem}_rough_src.wav")
+    return transition_path, stitched_path, rough_stitched_path, hard_splice_path, rough_src_path
+def _resolve_acestep_project_root(request: TransitionRequest) -> str:
+    if request.acestep_project_root:
+        os.makedirs(request.acestep_project_root, exist_ok=True)
+        return request.acestep_project_root
+    hf_data = "/data"
+    if os.path.isdir(hf_data) and os.access(hf_data, os.W_OK):
+        root = os.path.join(hf_data, "acestep_runtime")
+        os.makedirs(root, exist_ok=True)
+        return root
+    root = os.path.join(os.path.dirname(os.path.dirname(__file__)), ".acestep_runtime")
+    os.makedirs(root, exist_ok=True)
+    return root
+def _resolve_lora_path(lora_spec: str, project_root: str) -> str:
+    spec = (lora_spec or "").strip()
+    if not spec:
+        return ""
+    if os.path.exists(spec):
+        return os.path.abspath(spec)
+    # Treat non-local spec as a Hugging Face repo id, e.g. ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA
+    if "/" not in spec:
+        raise RuntimeError(
+            f"LoRA path not found: {spec}. Provide a local path or a Hugging Face repo id like "
+            "ACE-Step/ACE-Step-v1.5-chinese-new-year-LoRA."
+        )
+    try:
+        from huggingface_hub import snapshot_download
+    except Exception as exc:
+        raise RuntimeError(
+            "huggingface_hub is required to download LoRA from repo id. Install with: pip install huggingface_hub"
+        ) from exc
+    local_dir = os.path.join(project_root, "lora_cache", _slug(spec))
+    os.makedirs(local_dir, exist_ok=True)
+    return snapshot_download(
+        repo_id=spec,
+        local_dir=local_dir,
+        local_dir_use_symlinks=False,
+    )
+def _build_caption(plugin_id: str, instruction_text: str) -> str:
+    base = PLUGIN_PRESETS.get(plugin_id, PLUGIN_PRESETS["Smooth Blend"])
+    extra = (instruction_text or "").strip()
+    if not extra:
+        return base
+    return f"{base}. Additional instruction: {extra}"
+def _resolve_half_double_tempo(bpm_ref: float, bpm_candidate: float) -> float:
+    candidates = [0.5 * bpm_candidate, bpm_candidate, 2.0 * bpm_candidate]
+    valid = [v for v in candidates if 40.0 <= float(v) <= 240.0]
+    if not valid:
+        return float(bpm_candidate)
+    return float(min(valid, key=lambda x: abs(np.log2(max(1e-6, bpm_ref) / max(1e-6, x)))))
+def _normalized_onset_envelope(y: np.ndarray, sr: int, hop_length: int = 512) -> np.ndarray:
+    if y.size <= 0:
+        return np.zeros((1,), dtype=np.float32)
+    onset = librosa.onset.onset_strength(y=y, sr=sr, hop_length=hop_length).astype(np.float32)
+    if onset.size == 0:
+        return np.zeros((1,), dtype=np.float32)
+    onset = onset - float(np.mean(onset))
+    maximum = float(np.max(np.abs(onset)))
+    if maximum > 1e-9:
+        onset = onset / maximum
+    return onset.astype(np.float32)
+def _corr_similarity(a: np.ndarray, b: np.ndarray) -> float:
+    n = min(a.size, b.size)
+    if n <= 3:
+        return 0.0
+    a2 = a[:n].astype(np.float32)
+    b2 = b[:n].astype(np.float32)
+    denom = float(np.linalg.norm(a2) * np.linalg.norm(b2))
+    if denom <= 1e-9:
+        return 0.0
+    raw = float(np.dot(a2, b2) / denom)
+    return clamp((raw + 1.0) * 0.5, 0.0, 1.0)
+def _rms(y: np.ndarray) -> float:
+    if y.size == 0:
+        return 0.0
+    return float(np.sqrt(np.mean(np.square(y, dtype=np.float64))))
+def _resolve_demucs_device(torch_mod: Any) -> str:
+    pref = (_DEMUCS_DEVICE_PREF or "").strip().lower()
+    if pref == "cpu":
+        return "cpu"
+    if pref in {"cuda", "gpu"}:
+        return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
+    return "cuda" if bool(torch_mod.cuda.is_available()) else "cpu"
+def _load_demucs_runtime() -> Tuple[Optional[Dict[str, Any]], Dict[str, Any]]:
+    global _DEMUCS_RUNTIME
+    if not _DEMUCS_TRANSITION_ENABLED:
+        return None, {"enabled": False, "status": "disabled", "reason": "AI_DJ_ENABLE_DEMUCS_TRANSITION=0"}
+    if _DEMUCS_RUNTIME is not None:
+        return _DEMUCS_RUNTIME, {
+            "enabled": True,
+            "status": "ready",
+            "model": _DEMUCS_RUNTIME.get("model_name"),
+            "device": _DEMUCS_RUNTIME.get("device"),
+        }
+    try:
+        import torch  # type: ignore[reportMissingImports]
+        from demucs.pretrained import get_model  # type: ignore[reportMissingImports]
+        model = get_model(_DEMUCS_MODEL_NAME)
+        model.eval()
+        device = _resolve_demucs_device(torch)
+        model.to(device)
+        _DEMUCS_RUNTIME = {
+            "model": model,
+            "torch": torch,
+            "device": device,
+            "model_name": _DEMUCS_MODEL_NAME,
+        }
+        return _DEMUCS_RUNTIME, {
+            "enabled": True,
+            "status": "ready",
+            "model": _DEMUCS_MODEL_NAME,
+            "device": device,
+        }
+    except Exception as exc:
+        LOGGER.warning("Demucs transition runtime unavailable (%s). Falling back to non-stem transition path.", exc)
+        return None, {
+            "enabled": True,
+            "status": "unavailable",
+            "model": _DEMUCS_MODEL_NAME,
+            "reason": str(exc),
+        }
+def _resample_to(y: np.ndarray, orig_sr: int, target_sr: int) -> np.ndarray:
+    if int(orig_sr) == int(target_sr):
+        return y.astype(np.float32)
+    if y.size == 0:
+        return np.zeros((0,), dtype=np.float32)
+    return librosa.resample(y.astype(np.float32), orig_sr=int(orig_sr), target_sr=int(target_sr)).astype(np.float32)
+def _extract_demucs_stems(y: np.ndarray, sr: int, track_label: str) -> Tuple[Optional[_DemucsStemBundle], Dict[str, Any]]:
+    info: Dict[str, Any] = {
+        "enabled": bool(_DEMUCS_TRANSITION_ENABLED),
+        "track": track_label,
+        "model": _DEMUCS_MODEL_NAME,
+    }
+    if y.size < int(max(1, sr) * 2.0):
+        info["status"] = "skipped-short-audio"
+        return None, info
+    runtime, runtime_debug = _load_demucs_runtime()
+    info.update(runtime_debug)
+    if runtime is None:
+        return None, info
+    try:
+        from demucs.apply import apply_model  # type: ignore[reportMissingImports]
+        torch_mod = runtime["torch"]
+        model = runtime["model"]
+        device = str(runtime.get("device", "cpu"))
+        mono = np.asarray(y, dtype=np.float32).reshape(-1)
+        if mono.size == 0:
+            info["status"] = "empty"
+            return None, info
+        peak = float(np.max(np.abs(mono)))
+        if peak > 1e-9:
+            mono = mono / peak
+        demucs_sr = int(getattr(model, "samplerate", 44100))
+        work = _resample_to(mono, int(sr), demucs_sr)
+        if work.size < int(max(1, demucs_sr) * 2.0):
+            info["status"] = "skipped-short-audio"
+            return None, info
+        stereo = np.stack([work, work], axis=0)
+        mix = torch_mod.from_numpy(stereo).unsqueeze(0).to(device)
+        audio_sec = float(work.size / max(1, demucs_sr))
+        use_split = audio_sec > (_DEMUCS_SEGMENT_SEC + 0.05)
+        segment_sec = float(_DEMUCS_SEGMENT_SEC) if use_split else None
+        try:
+            with torch_mod.no_grad():
+                estimates = apply_model(
+                    model,
+                    mix,
+                    shifts=1,
+                    split=use_split,
+                    overlap=0.25,
+                    progress=False,
+                    device=device,
+                    segment=segment_sec,
+                )
+        except Exception as exc:
+            if device == "cuda":
+                model.to("cpu")
+                runtime["device"] = "cpu"
+                device = "cpu"
+                mix = mix.to("cpu")
+                with torch_mod.no_grad():
+                    estimates = apply_model(
+                        model,
+                        mix,
+                        shifts=1,
+                        split=use_split,
+                        overlap=0.25,
+                        progress=False,
+                        device="cpu",
+                        segment=segment_sec,
+                    )
+                info["device_fallback"] = f"cuda->cpu ({exc})"
+            else:
+                raise
+        est = estimates.detach().cpu()
+        est = est[0] if est.ndim == 4 else est
+        if est.ndim != 3:
+            raise RuntimeError(f"Unexpected demucs output shape: {tuple(est.shape)}")
+        source_names = [str(s) for s in getattr(model, "sources", [])]
+        if not source_names:
+            raise RuntimeError("Demucs returned no source names.")
+        if est.shape[0] != len(source_names):
+            if est.shape[1] == len(source_names):
+                est = est.permute(1, 0, 2)
+            else:
+                raise RuntimeError(f"Demucs source mismatch: shape {tuple(est.shape)}, sources {source_names}")
+        def _stem(name: str) -> np.ndarray:
+            if name in source_names:
+                stem = est[source_names.index(name)].mean(dim=0).numpy().astype(np.float32)
+                return _resample_to(stem, demucs_sr, int(sr))
+            return np.zeros((mono.size,), dtype=np.float32)
+        vocals = _stem("vocals")
+        drums = _stem("drums")
+        bass = _stem("bass")
+        other = _stem("other")
+        non_vocal_idxs = [i for i, s in enumerate(source_names) if s != "vocals"]
+        if non_vocal_idxs:
+            acc = est[non_vocal_idxs].sum(dim=0).mean(dim=0).numpy().astype(np.float32)
+            accompaniment = _resample_to(acc, demucs_sr, int(sr))
+        else:
+            accompaniment = np.zeros((mono.size,), dtype=np.float32)
+        target_n = int(mono.size)
+        vocals = ensure_length(vocals, target_n)
+        drums = ensure_length(drums, target_n)
+        bass = ensure_length(bass, target_n)
+        other = ensure_length(other, target_n)
+        accompaniment = ensure_length(accompaniment, target_n)
+        info.update(
+            {
+                "status": "ready",
+                "method": "demucs-transition-stems",
+                "split_mode": "chunked" if use_split else "full-window",
+                "duration_sec": round(float(target_n / max(1, sr)), 3),
+                "has_drums": bool("drums" in source_names),
+                "has_bass": bool("bass" in source_names),
+                "has_other": bool("other" in source_names),
+                "device": runtime.get("device", device),
+            }
+        )
+        return _DemucsStemBundle(
+            vocals=vocals.astype(np.float32),
+            drums=drums.astype(np.float32),
+            bass=bass.astype(np.float32),
+            other=other.astype(np.float32),
+            accompaniment=accompaniment.astype(np.float32),
+            sr=int(sr),
+            method="demucs-transition-stems",
+        ), info
+    except Exception as exc:
+        LOGGER.warning("Demucs stem extraction failed for %s (%s).", track_label, exc)
+        info["status"] = "error"
+        info["reason"] = str(exc)
+        return None, info
+def _slice_stem_bundle(bundle: Optional[_DemucsStemBundle], start_n: int, length_n: int) -> Optional[_DemucsStemBundle]:
+    if bundle is None:
+        return None
+    s = int(max(0, start_n))
+    n = int(max(0, length_n))
+    e = s + n
+    return _DemucsStemBundle(
+        vocals=ensure_length(bundle.vocals[s:e], n),
+        drums=ensure_length(bundle.drums[s:e], n),
+        bass=ensure_length(bundle.bass[s:e], n),
+        other=ensure_length(bundle.other[s:e], n),
+        accompaniment=ensure_length(bundle.accompaniment[s:e], n),
+        sr=int(bundle.sr),
+        method=bundle.method,
+    )
+def _seconds_to_beats(seconds: float, bpm: float) -> float:
+    return float(seconds) * (float(bpm) / 60.0)
+def _beats_to_seconds(beats: float, bpm: float) -> float:
+    return float(beats) * (60.0 / max(1e-6, float(bpm)))
+def _quantize_seconds_to_beats(
+    raw_sec: float,
+    bpm: float,
+    min_sec: float,
+    max_sec: float,
+    beat_step: int,
+    min_beats: int,
+) -> Tuple[float, int, float]:
+    raw_sec = float(clamp(raw_sec, min_sec, max_sec))
+    if bpm <= 1e-6:
+        return raw_sec, int(round(_seconds_to_beats(raw_sec, 120.0))), _seconds_to_beats(raw_sec, 120.0)
+    raw_beats = _seconds_to_beats(raw_sec, bpm)
+    step = max(1, int(beat_step))
+    min_beats_i = max(1, int(min_beats))
+    max_allowed_beats = _seconds_to_beats(max_sec, bpm)
+    max_beats_i = int(max(min_beats_i, np.floor(max_allowed_beats / step) * step))
+    quant_beats = int(round(raw_beats / step) * step)
+    quant_beats = int(clamp(float(quant_beats), float(min_beats_i), float(max_beats_i)))
+    quant_sec = float(clamp(_beats_to_seconds(quant_beats, bpm), min_sec, max_sec))
+    return quant_sec, quant_beats, raw_beats
+def _phrase_lock_transition_shape(pre_sec: float, seam_sec: float, post_sec: float, bpm: float) -> Dict[str, Any]:
+    pre_locked_sec, pre_beats, pre_raw_beats = _quantize_seconds_to_beats(
+        raw_sec=pre_sec,
+        bpm=bpm,
+        min_sec=1.0,
+        max_sec=20.0,
+        beat_step=4,
+        min_beats=2,
+    )
+    seam_raw_beats = _seconds_to_beats(seam_sec, bpm)
+    seam_step = 8 if seam_raw_beats >= 8.0 else 4
+    seam_locked_sec, seam_beats, _ = _quantize_seconds_to_beats(
+        raw_sec=seam_sec,
+        bpm=bpm,
+        min_sec=1.0,
+        max_sec=40.0,
+        beat_step=seam_step,
+        min_beats=2,
+    )
+    post_locked_sec, post_beats, post_raw_beats = _quantize_seconds_to_beats(
+        raw_sec=post_sec,
+        bpm=bpm,
+        min_sec=1.0,
+        max_sec=20.0,
+        beat_step=4,
+        min_beats=2,
+    )
+    return {
+        "pre_sec": pre_locked_sec,
+        "seam_sec": seam_locked_sec,
+        "post_sec": post_locked_sec,
+        "debug": {
+            "bpm_ref": round(float(bpm), 3),
+            "pre": {
+                "raw_sec": round(float(pre_sec), 3),
+                "locked_sec": round(float(pre_locked_sec), 3),
+                "raw_beats": round(float(pre_raw_beats), 3),
+                "locked_beats": int(pre_beats),
+                "beat_step": 4,
+            },
+            "seam": {
+                "raw_sec": round(float(seam_sec), 3),
+                "locked_sec": round(float(seam_locked_sec), 3),
+                "raw_beats": round(float(seam_raw_beats), 3),
+                "locked_beats": int(seam_beats),
+                "beat_step": int(seam_step),
+            },
+            "post": {
+                "raw_sec": round(float(post_sec), 3),
+                "locked_sec": round(float(post_locked_sec), 3),
+                "raw_beats": round(float(post_raw_beats), 3),
+                "locked_beats": int(post_beats),
+                "beat_step": 4,
+            },
+        },
+    }
+def _stft_band_split(y: np.ndarray, sr: int) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
+    n = int(y.size)
+    if n <= 0:
+        z = np.zeros((0,), dtype=np.float32)
+        return z, z, z
+    n_fft = 2048 if n >= 2048 else 1024
+    hop = max(128, n_fft // 4)
+    y2 = ensure_length(y.astype(np.float32), max(n, n_fft))
+    D = librosa.stft(y2, n_fft=n_fft, hop_length=hop)
+    freqs = librosa.fft_frequencies(sr=sr, n_fft=n_fft)
+    low_mask = (freqs <= 180.0).astype(np.float32)[:, None]
+    mid_mask = ((freqs > 180.0) & (freqs <= 2500.0)).astype(np.float32)[:, None]
+    high_mask = (freqs > 2500.0).astype(np.float32)[:, None]
+    low = librosa.istft(D * low_mask, hop_length=hop, length=y2.size).astype(np.float32)
+    mid = librosa.istft(D * mid_mask, hop_length=hop, length=y2.size).astype(np.float32)
+    high = librosa.istft(D * high_mask, hop_length=hop, length=y2.size).astype(np.float32)
+    return low[:n], mid[:n], high[:n]
+def _dj_style_seam_mix(a_tail: np.ndarray, b_head: np.ndarray, sr: int) -> Tuple[np.ndarray, Dict[str, Any]]:
+    n = min(int(a_tail.size), int(b_head.size))
+    if n <= 0:
+        return np.zeros((0,), dtype=np.float32), {"method": "empty-input-fallback"}
+    a = a_tail[:n].astype(np.float32)
+    b = b_head[:n].astype(np.float32)
+    try:
+        a_low, a_mid, a_high = _stft_band_split(a, sr=sr)
+        b_low, b_mid, b_high = _stft_band_split(b, sr=sr)
+    except Exception as exc:
+        LOGGER.warning("Band-split seam mixing failed (%s); using equal crossfade.", exc)
+        return crossfade_equal_length(a, b), {"method": "crossfade-fallback", "error": str(exc)}
+    x = np.linspace(0.0, 1.0, n, dtype=np.float32)
+    high_in = x
+    mid_in = np.power(x, 1.15).astype(np.float32)
+    # Delay low-end handoff so kick/bass do not collide early.
+    low_in = np.clip((x - 0.58) / 0.30, 0.0, 1.0).astype(np.float32)
+    seam = (
+        (a_high * (1.0 - high_in))
+        + (b_high * high_in)
+        + (a_mid * (1.0 - mid_in))
+        + (b_mid * mid_in)
+        + (a_low * (1.0 - low_in))
+        + (b_low * low_in)
+    ).astype(np.float32)
+    return seam, {
+        "method": "dj-eq-bass-swap",
+        "low_handoff": {"start_ratio": 0.58, "end_ratio": 0.88},
+        "bands_hz": {"low_max": 180, "mid_max": 2500},
+    }
+def _build_theme_reference_audio(
+    a_pre: np.ndarray,
+    a_tail: np.ndarray,
+    b_head: np.ndarray,
+    b_post: np.ndarray,
+    sr: int,
+) -> Tuple[np.ndarray, Dict[str, Any]]:
+    a_ctx = np.concatenate([a_pre, a_tail]).astype(np.float32)
+    b_ctx = np.concatenate([b_head, b_post]).astype(np.float32)
+    a_take_n = min(a_ctx.size, int(round(12.0 * sr)))
+    b_take_n = min(b_ctx.size, int(round(12.0 * sr)))
+    if a_take_n <= 0 or b_take_n <= 0:
+        return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "insufficient_context"}
+    a_seg = a_ctx[-a_take_n:]
+    b_seg = b_ctx[:b_take_n]
+    overlap_n = min(int(round(0.45 * sr)), a_seg.size // 4, b_seg.size // 4)
+    if overlap_n > 0:
+        seam = crossfade_equal_length(a_seg[-overlap_n:], b_seg[:overlap_n])
+        ref = np.concatenate([a_seg[:-overlap_n], seam, b_seg[overlap_n:]]).astype(np.float32)
+    else:
+        ref = np.concatenate([a_seg, b_seg]).astype(np.float32)
+    ref = normalize_peak(apply_edge_fades(ref, sr=sr, fade_ms=20.0), peak=0.98)
+    return ref, {
+        "enabled": True,
+        "method": "a-tail-b-head-theme-ref",
+        "duration_sec": round(float(ref.size / max(1, sr)), 3),
+        "segments_sec": {
+            "song_a": round(float(a_seg.size / max(1, sr)), 3),
+            "song_b": round(float(b_seg.size / max(1, sr)), 3),
+            "overlap": round(float(overlap_n / max(1, sr)), 3),
+        },
+    }
+def _left_pad_to_length(y: np.ndarray, target_n: int) -> np.ndarray:
+    target_n = int(max(0, target_n))
+    if y.size >= target_n:
+        return y[-target_n:].astype(np.float32)
+    return np.pad(y.astype(np.float32), (target_n - y.size, 0), mode="constant")
+def _crossfade_join(a: np.ndarray, b: np.ndarray, fade_n: int) -> np.ndarray:
+    if a.size <= 0:
+        return b.astype(np.float32)
+    if b.size <= 0:
+        return a.astype(np.float32)
+    n = int(max(0, fade_n))
+    n = min(n, int(a.size), int(b.size))
+    if n <= 0:
+        return np.concatenate([a, b]).astype(np.float32)
+    seam = crossfade_equal_length(a[-n:], b[:n])
+    return np.concatenate([a[:-n], seam, b[n:]]).astype(np.float32)
+def _build_period_reference_audio(period: np.ndarray, sr: int, source_mode: str = "full-period-a") -> Tuple[np.ndarray, Dict[str, Any]]:
+    if period.size <= 0:
+        return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-reference-period"}
+    ref = normalize_peak(apply_edge_fades(period.astype(np.float32), sr=sr, fade_ms=20.0), peak=0.98)
+    return ref, {
+        "enabled": True,
+        "method": "opposite-transition-period-reference",
+        "source_mode": str(source_mode),
+        "duration_sec": round(float(ref.size / max(1, sr)), 3),
+    }
+def _apply_transition_low_duck(
+    y: np.ndarray,
+    sr: int,
+    duck_floor: float = 0.14,
+    fade_out_end: float = 0.42,
+    fade_in_start: float = 0.72,
+) -> Tuple[np.ndarray, Dict[str, Any]]:
+    n = int(y.size)
+    if n <= 0:
+        return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-audio"}
+    try:
+        low, mid, high = _stft_band_split(y.astype(np.float32), sr=sr)
+    except Exception as exc:
+        LOGGER.warning("Low-duck split failed (%s); skip ducking.", exc)
+        return y.astype(np.float32), {"enabled": False, "reason": "split-failed", "error": str(exc)}
+    x = np.linspace(0.0, 1.0, n, dtype=np.float32)
+    out_end = float(clamp(fade_out_end, 0.1, 0.9))
+    in_start = float(clamp(max(out_end + 0.05, fade_in_start), 0.15, 0.95))
+    floor = float(clamp(duck_floor, 0.03, 0.5))
+    low_gain = np.full((n,), floor, dtype=np.float32)
+    entry_mask = x <= out_end
+    if np.any(entry_mask):
+        low_gain[entry_mask] = (1.0 - ((x[entry_mask] / max(1e-6, out_end)) * (1.0 - floor))).astype(np.float32)
+    exit_mask = x >= in_start
+    if np.any(exit_mask):
+        ramp = (x[exit_mask] - in_start) / max(1e-6, (1.0 - in_start))
+        low_gain[exit_mask] = (floor + (ramp * (1.0 - floor))).astype(np.float32)
+    y_out = (low * low_gain) + mid + high
+    y_out = y_out.astype(np.float32)
+    return y_out, {
+        "enabled": True,
+        "method": "low-duck-center",
+        "duck_floor": round(float(floor), 4),
+        "fade_out_end_ratio": round(float(out_end), 4),
+        "fade_in_start_ratio": round(float(in_start), 4),
+    }
+def _build_one_bassline_stem_period(
+    period_a: np.ndarray,
+    period_b: np.ndarray,
+    stems_a: Optional[_DemucsStemBundle],
+    stems_b: Optional[_DemucsStemBundle],
+) -> Tuple[Optional[np.ndarray], Dict[str, Any]]:
+    if stems_a is None or stems_b is None:
+        return None, {"enabled": False, "reason": "missing-stems"}
+    n = min(
+        int(period_a.size),
+        int(period_b.size),
+        int(stems_a.vocals.size),
+        int(stems_b.vocals.size),
+        int(stems_a.bass.size),
+        int(stems_b.bass.size),
+    )
+    if n <= 0:
+        return None, {"enabled": False, "reason": "empty-period"}
+    x = np.linspace(0.0, 1.0, n, dtype=np.float32)
+    bass_in = np.clip((x - 0.60) / 0.28, 0.0, 1.0).astype(np.float32)
+    # Keep lows lighter in the center, then restore toward each edge.
+    center_bass_shape = (0.35 + (0.65 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
+    bass_mix = ((stems_a.bass[:n] * (1.0 - bass_in)) + (stems_b.bass[:n] * bass_in)).astype(np.float32)
+    bass_mix = (bass_mix * center_bass_shape).astype(np.float32)
+    acc_a = (stems_a.accompaniment[:n] - stems_a.bass[:n]).astype(np.float32)
+    acc_b = (stems_b.accompaniment[:n] - stems_b.bass[:n]).astype(np.float32)
+    inst_mix = ((acc_a * (1.0 - x)) + (acc_b * x)).astype(np.float32)
+    vocal_side = np.where(x < 0.5, stems_a.vocals[:n], stems_b.vocals[:n]).astype(np.float32)
+    vocal_shape = np.where(
+        x < 0.5,
+        np.clip(1.0 - ((x / 0.5) * 0.75), 0.25, 1.0),
+        np.clip(((x - 0.5) / 0.5) * 0.75 + 0.25, 0.25, 1.0),
+    ).astype(np.float32)
+    vocals_mix = (vocal_side * vocal_shape * 0.26).astype(np.float32)
+    stem_mix = (inst_mix + bass_mix + vocals_mix).astype(np.float32)
+    return stem_mix, {
+        "enabled": True,
+        "method": "demucs-one-bassline-rule",
+        "bass_handoff": {"start_ratio": 0.60, "end_ratio": 0.88},
+        "center_bass_floor": 0.35,
+        "vocal_sidechain_gain": 0.26,
+    }
+def _build_src_transition_period(
+    period_a: np.ndarray,
+    period_b: np.ndarray,
+    sr: int,
+) -> Tuple[np.ndarray, np.ndarray, Dict[str, Any]]:
+    return _build_src_transition_period_with_stems(period_a, period_b, sr=sr, stems_a=None, stems_b=None)
+def _build_src_transition_period_with_stems(
+    period_a: np.ndarray,
+    period_b: np.ndarray,
+    sr: int,
+    stems_a: Optional[_DemucsStemBundle] = None,
+    stems_b: Optional[_DemucsStemBundle] = None,
+) -> Tuple[np.ndarray, np.ndarray, Dict[str, Any]]:
+    directional, directional_debug = _dj_style_seam_mix(period_a, period_b, sr=sr)
+    n = int(min(period_a.size, period_b.size))
+    if n > 0:
+        x = np.linspace(0.0, 1.0, n, dtype=np.float32)
+        guide = ((period_a[:n] * (1.0 - x)) + (period_b[:n] * x)).astype(np.float32)
+        src_period = ((0.70 * directional[:n]) + (0.30 * guide)).astype(np.float32)
+    else:
+        src_period = directional.astype(np.float32)
+    demucs_mix, demucs_mix_debug = _build_one_bassline_stem_period(
+        period_a=period_a,
+        period_b=period_b,
+        stems_a=stems_a,
+        stems_b=stems_b,
+    )
+    if demucs_mix is not None and demucs_mix.size > 0:
+        src_period = ((0.54 * src_period[: demucs_mix.size]) + (0.46 * demucs_mix)).astype(np.float32)
+        if src_period.size < n:
+            src_period = ensure_length(src_period, n)
+    use_acc_ref = _REF_AUDIO_MODE in {"accompaniment-only", "accompaniment", "inst-only", "instrumental-only"}
+    if use_acc_ref and stems_a is not None and stems_a.accompaniment.size > 0:
+        reference_period = ensure_length(stems_a.accompaniment.astype(np.float32), int(period_a.size))
+        ref_mode = "accompaniment-only"
+    else:
+        reference_period = period_a.astype(np.float32)
+        ref_mode = "full-period-a"
+    dominant = "song_b"
+    src_period, low_duck_debug = _apply_transition_low_duck(src_period, sr=sr)
+    src_period = normalize_peak(src_period, peak=0.99)
+    return src_period, reference_period, {
+        "method": "bar-period-layered-repaint-src-fixed-b-base",
+        "base_mode": "B-base-fixed",
+        "dominant_period": dominant,
+        "demucs_one_bassline": demucs_mix_debug,
+        "reference_mode": ref_mode,
+        "guide_mix": {
+            "enabled": True,
+            "weight_directional": 0.70,
+            "weight_time_direction_guide": 0.30,
+            "behavior": "more-song-a-detail-at-entry-more-song-b-at-exit",
+        },
+        "directional_mix": directional_debug,
+        "transition_low_profile": low_duck_debug,
+    }
+def _crossfade_join_frequency_aware(a: np.ndarray, b: np.ndarray, fade_n: int, sr: int) -> Tuple[np.ndarray, Dict[str, Any]]:
+    if a.size <= 0:
+        return b.astype(np.float32), {"method": "prepend-empty"}
+    if b.size <= 0:
+        return a.astype(np.float32), {"method": "append-empty"}
+    n = int(max(0, fade_n))
+    n = min(n, int(a.size), int(b.size))
+    if n <= 0:
+        return np.concatenate([a, b]).astype(np.float32), {"method": "no-fade"}
+    seg_a = a[-n:].astype(np.float32)
+    seg_b = b[:n].astype(np.float32)
+    seam, seam_debug = _dj_style_seam_mix(seg_a, seg_b, sr=sr)
+    out = np.concatenate([a[:-n], seam, b[n:]]).astype(np.float32)
+    return out, {"method": "frequency-aware-join", "fade_samples": int(n), "seam": seam_debug}
+def _post_repaint_stem_correction(
+    transition: np.ndarray,
+    sr: int,
+    anchor_a: Optional[_DemucsStemBundle] = None,
+    anchor_b: Optional[_DemucsStemBundle] = None,
+) -> Tuple[np.ndarray, Dict[str, Any]]:
+    y = transition.astype(np.float32)
+    if y.size <= 0:
+        return np.zeros((0,), dtype=np.float32), {"enabled": False, "reason": "empty-transition"}
+    stems, demucs_debug = _extract_demucs_stems(y, int(sr), track_label="post-repaint-transition")
+    if stems is None:
+        return y, {"enabled": False, "reason": "demucs-unavailable", "demucs": demucs_debug}
+    n = int(min(stems.vocals.size, stems.drums.size, stems.bass.size, stems.other.size, y.size))
+    if n <= 0:
+        return y, {"enabled": False, "reason": "empty-stems", "demucs": demucs_debug}
+    x = np.linspace(0.0, 1.0, n, dtype=np.float32)
+    center = np.clip(np.minimum(x, 1.0 - x) / 0.18, 0.0, 1.0).astype(np.float32)
+    bass_cur = max(1e-5, _rms(stems.bass[:n]))
+    bass_ref_a = _rms(anchor_a.bass) if anchor_a is not None else bass_cur
+    bass_ref_b = _rms(anchor_b.bass) if anchor_b is not None else bass_cur
+    bass_gain_a = float(clamp(bass_ref_a / bass_cur, 0.65, 1.15))
+    bass_gain_b = float(clamp(bass_ref_b / bass_cur, 0.65, 1.15))
+    bass_linear = ((1.0 - x) * bass_gain_a) + (x * bass_gain_b)
+    bass_center_shape = (0.72 + (0.28 * np.abs((2.0 * x) - 1.0))).astype(np.float32)
+    bass_gain = (bass_linear * bass_center_shape).astype(np.float32)
+    vocal_cur = max(1e-5, _rms(stems.vocals[:n]))
+    vocal_ref_a = _rms(anchor_a.vocals) if anchor_a is not None else vocal_cur
+    vocal_ref_b = _rms(anchor_b.vocals) if anchor_b is not None else vocal_cur
+    vocal_gain_a = float(clamp(vocal_ref_a / vocal_cur, 0.42, 1.0))
+    vocal_gain_b = float(clamp(vocal_ref_b / vocal_cur, 0.42, 1.0))
+    vocal_linear = ((1.0 - x) * vocal_gain_a) + (x * vocal_gain_b)
+    vocal_boundary_shape = (0.72 + (0.28 * center)).astype(np.float32)
+    vocal_gain = (vocal_linear * vocal_boundary_shape).astype(np.float32)
+    drum_gain = (1.05 - (0.08 * center)).astype(np.float32)
+    other_gain = 1.0
+    corrected = (
+        (stems.vocals[:n] * vocal_gain)
+        + (stems.drums[:n] * drum_gain)
+        + (stems.bass[:n] * bass_gain)
+        + (stems.other[:n] * other_gain)
+    ).astype(np.float32)
+    corrected = ensure_length(corrected, int(y.size))
+    return corrected, {
+        "enabled": True,
+        "method": "demucs-post-repaint-boundary-rebalance",
+        "demucs": demucs_debug,
+        "gains": {
+            "bass_start": round(float(bass_gain_a), 4),
+            "bass_end": round(float(bass_gain_b), 4),
+            "vocal_start": round(float(vocal_gain_a), 4),
+            "vocal_end": round(float(vocal_gain_b), 4),
+            "drum_edge_boost": 1.05,
+        },
+        "anchor_rms": {
+            "bass_a": round(float(bass_ref_a), 6),
+            "bass_b": round(float(bass_ref_b), 6),
+            "vocal_a": round(float(vocal_ref_a), 6),
+            "vocal_b": round(float(vocal_ref_b), 6),
+        },
+    }
+def _assemble_substitute_mix(
+    song_a_prefix: np.ndarray,
+    transition: np.ndarray,
+    song_b_suffix: np.ndarray,
+    boundary_fade_n: int = 0,
+    sr: int = DEFAULT_TARGET_SR,
+) -> Tuple[np.ndarray, Dict[str, Any]]:
+    a = song_a_prefix.astype(np.float32) if song_a_prefix.size > 0 else np.zeros((0,), dtype=np.float32)
+    t = transition.astype(np.float32) if transition.size > 0 else np.zeros((0,), dtype=np.float32)
+    b = song_b_suffix.astype(np.float32) if song_b_suffix.size > 0 else np.zeros((0,), dtype=np.float32)
+    joined, entry_debug = _crossfade_join_frequency_aware(a, t, boundary_fade_n, sr=sr)
+    joined, exit_debug = _crossfade_join_frequency_aware(joined, b, boundary_fade_n, sr=sr)
+    return joined.astype(np.float32), {
+        "method": "dual-frequency-aware-boundary-joins",
+        "entry": entry_debug,
+        "exit": exit_debug,
+    }
+def _align_b_window_to_a_tail(
+    a_tail: np.ndarray,
+    y_b_stretched: np.ndarray,
+    nominal_start_n: int,
+    seam_n: int,
+    post_n: int,
+    sr: int,
+    bpm_ref: float,
+    a_tail_drums: Optional[np.ndarray] = None,
+    y_b_stretched_drums: Optional[np.ndarray] = None,
+) -> Tuple[np.ndarray, int, Dict[str, Any]]:
+    total_n = seam_n + post_n
+    if y_b_stretched.size < total_n:
+        return ensure_length(y_b_stretched, total_n), 0, {
+            "method": "short-buffer-fallback",
+            "candidate_count": 0,
+        }
+    beat_sec = 60.0 / max(1e-6, float(bpm_ref))
+    search_sec = clamp(0.75 * beat_sec, 0.2, 1.2)
+    search_n = int(round(search_sec * sr))
+    nominal_start_n = int(clamp(float(nominal_start_n), 0.0, float(max(0, y_b_stretched.size - total_n))))
+    lo = max(0, nominal_start_n - search_n)
+    hi = min(y_b_stretched.size - total_n, nominal_start_n + search_n)
+    _, beat_times_stretched = estimate_bpm_and_beats(y_b_stretched, sr)
+    candidates: List[int] = []
+    for bt in beat_times_stretched:
+        idx = int(round(float(bt) * sr))
+        if lo <= idx <= hi:
+            candidates.append(idx)
+    candidates.append(nominal_start_n)
+    candidates = sorted(set(candidates))
+    if not candidates:
+        candidates = [nominal_start_n]
+    use_drum_alignment = (
+        isinstance(a_tail_drums, np.ndarray)
+        and isinstance(y_b_stretched_drums, np.ndarray)
+        and int(a_tail_drums.size) >= int(seam_n)
+        and int(y_b_stretched_drums.size) >= int(y_b_stretched.size)
+    )
+    onset_a_mix = _normalized_onset_envelope(a_tail, sr)
+    onset_a_drum = _normalized_onset_envelope(a_tail_drums[:seam_n], sr) if use_drum_alignment else onset_a_mix
+    rms_a = _rms(a_tail)
+    drum_rms_a = _rms(a_tail_drums[:seam_n]) if use_drum_alignment else 0.0
+    best_idx = candidates[0]
+    best_score = -1.0
+    best_components = {"onset_mix": 0.0, "onset_drum": 0.0, "energy": 0.0, "drum_energy": 0.0, "distance": 0.0}
+    distance_scale = max(1.0, 0.65 * search_n)
+    for idx in candidates:
+        seg = ensure_length(y_b_stretched[idx : idx + total_n], total_n)
+        b_head = seg[:seam_n]
+        onset_b_mix = _normalized_onset_envelope(b_head, sr)
+        onset_score_mix = _corr_similarity(onset_a_mix, onset_b_mix)
+        onset_score_drum = onset_score_mix
+        drum_energy_score = 0.5
+        onset_score = onset_score_mix
+        if use_drum_alignment:
+            seg_drums = ensure_length(y_b_stretched_drums[idx : idx + total_n], total_n)
+            b_head_drums = seg_drums[:seam_n]
+            onset_b_drum = _normalized_onset_envelope(b_head_drums, sr)
+            onset_score_drum = _corr_similarity(onset_a_drum, onset_b_drum)
+            onset_score = (0.78 * onset_score_drum) + (0.22 * onset_score_mix)
+            drum_rms_b = _rms(b_head_drums)
+            drum_gap = abs(drum_rms_a - drum_rms_b) / max(1e-4, drum_rms_a)
+            drum_energy_score = clamp(1.0 - drum_gap, 0.0, 1.0)
+        rms_b = _rms(b_head)
+        energy_gap = abs(rms_a - rms_b) / max(1e-4, rms_a)
+        energy_score = clamp(1.0 - energy_gap, 0.0, 1.0)
+        dist = abs(idx - nominal_start_n)
+        distance_score = float(np.exp(-dist / distance_scale))
+        if use_drum_alignment:
+            score = (0.62 * onset_score) + (0.18 * energy_score) + (0.10 * drum_energy_score) + (0.10 * distance_score)
+        else:
+            score = (0.56 * onset_score) + (0.26 * energy_score) + (0.18 * distance_score)
+        if score > best_score:
+            best_score = float(score)
+            best_idx = int(idx)
+            best_components = {
+                "onset_mix": float(onset_score_mix),
+                "onset_drum": float(onset_score_drum),
+                "energy": float(energy_score),
+                "drum_energy": float(drum_energy_score),
+                "distance": float(distance_score),
+            }
+    aligned = ensure_length(y_b_stretched[best_idx : best_idx + total_n], total_n)
+    return aligned, best_idx, {
+        "method": "drum-led-beat-phase-transient-align" if use_drum_alignment else "beat-phase-transient-align",
+        "used_drum_stems": bool(use_drum_alignment),
+        "candidate_count": len(candidates),
+        "search_sec": round(float(search_sec), 4),
+        "search_samples": int(search_n),
+        "nominal_start_sample": int(nominal_start_n),
+        "best_start_sample": int(best_idx),
+        "best_score": round(float(best_score), 6),
+        "score_components": {k: round(float(v), 6) for k, v in best_components.items()},
+    }
+def _prepare_rough_transition(request: TransitionRequest) -> Dict[str, Any]:
+    pre_sec_raw = clamp(request.pre_context_sec, 1.0, 20.0)
+    post_sec_raw = clamp(request.post_context_sec, 1.0, 20.0)
+    analysis_sec = clamp(request.analysis_sec, 10.0, 120.0)
+    target_sr = int(request.target_sr)
+    dur_a = ffprobe_duration_sec(request.song_a_path)
+    dur_b = ffprobe_duration_sec(request.song_b_path)
+    a_analysis_start = max(0.0, float(dur_a) - analysis_sec) if dur_a is not None else 0.0
+    y_a_an, sr_a = decode_segment(request.song_a_path, a_analysis_start, analysis_sec, sr=target_sr, max_decode_sec=analysis_sec)
+    y_b_an, sr_b = decode_segment(request.song_b_path, 0.0, analysis_sec, sr=target_sr, max_decode_sec=analysis_sec)
+    bpm_a, beats_a = estimate_bpm_and_beats(y_a_an, sr_a)
+    bpm_b, beats_b = estimate_bpm_and_beats(y_b_an, sr_b)
+    if request.bpm_target is not None and 40.0 <= float(request.bpm_target) <= 220.0:
+        bpm_a = float(request.bpm_target)
+    bpm_a = float(bpm_a) if bpm_a is not None else 120.0
+    bpm_b_detected = float(bpm_b) if bpm_b is not None else 120.0
+    bpm_b_for_alignment = _resolve_half_double_tempo(bpm_a, bpm_b_detected)
+    bars_requested = int(request.transition_bars)
+    valid_bars = {4, 8, 16}
+    transition_bars = bars_requested if bars_requested in valid_bars else 8
+    seam_sec_raw = float(_beats_to_seconds(float(transition_bars * 4), bpm_a))
+    seam_sec_raw = float(clamp(seam_sec_raw, 1.0, 40.0))
+    seam_sec_ui_raw = seam_sec_raw
+    base_mode = "B-base-fixed"
+    phrase_lock = _phrase_lock_transition_shape(
+        pre_sec=pre_sec_raw,
+        seam_sec=seam_sec_raw,
+        post_sec=post_sec_raw,
+        bpm=bpm_a,
+    )
+    pre_sec = float(phrase_lock["pre_sec"])
+    seam_sec = float(phrase_lock["seam_sec"])
+    post_sec = float(phrase_lock["post_sec"])
+    cue_selection = select_mix_cuepoints(
+        y_a_analysis=y_a_an,
+        y_b_analysis=y_b_an,
+        sr=target_sr,
+        analysis_sec=analysis_sec,
+        pre_sec=pre_sec,
+        seam_sec=seam_sec,
+        post_sec=post_sec,
+        a_analysis_start_sec=a_analysis_start,
+        beats_a=beats_a,
+        beats_b=beats_b,
+        cue_a_override_sec=request.cue_a_sec,
+        cue_b_override_sec=request.cue_b_sec,
+        song_a_path=request.song_a_path,
+        song_b_path=request.song_b_path,
+        song_a_duration_sec=dur_a,
+        song_b_duration_sec=dur_b,
+    )
+    cue_a = float(cue_selection.cue_a_sec)
+    cue_b = float(cue_selection.cue_b_sec)
+    stretch_rate_raw = bpm_a / max(1e-6, bpm_b_for_alignment)
+    # Keep stronger musical coherence while avoiding very audible stretch artifacts.
+    stretch_rate = clamp(stretch_rate_raw, 0.7, 1.35)
+    pre_n = int(round(pre_sec * target_sr))
+    seam_n = int(round(seam_sec * target_sr))
+    post_n = int(round(post_sec * target_sr))
+    # Song A transition period: bars before cue A.
+    a_period_start = max(0.0, cue_a - seam_sec)
+    period_a, _ = decode_segment(
+        request.song_a_path,
+        a_period_start,
+        seam_sec,
+        sr=target_sr,
+        max_decode_sec=seam_sec + 2.0,
+    )
+    period_a = ensure_length(period_a, seam_n)
+    period_a_stems, period_a_stem_debug = _extract_demucs_stems(period_a, target_sr, track_label="song-a-transition-period")
+    # Repaint pre-context leading into the transition period.
+    a_pre_start = max(0.0, a_period_start - pre_sec)
+    a_pre, _ = decode_segment(
+        request.song_a_path,
+        a_pre_start,
+        pre_sec,
+        sr=target_sr,
+        max_decode_sec=pre_sec + 2.0,
+    )
+    a_pre = _left_pad_to_length(a_pre, pre_n)
+    cue_b_selected = cue_b
+    stitch_preview_side_sec = float(STITCH_PREVIEW_SIDE_SEC)
+    boundary_fade_beats = 2.0
+    boundary_fade_sec = clamp(_beats_to_seconds(boundary_fade_beats, bpm_a), 0.08, 1.2)
+    boundary_fade_n = int(round(boundary_fade_sec * target_sr))
+    stitch_decode_side_sec = stitch_preview_side_sec + boundary_fade_sec
+    cue_a_for_stitch = float(max(0.0, cue_a - seam_sec))
+    if dur_a is not None:
+        cue_a_for_stitch = clamp(cue_a_for_stitch, 0.0, float(dur_a))
+    song_a_preview_start = max(0.0, cue_a_for_stitch - stitch_decode_side_sec)
+    song_a_preview_dur = max(0.0, cue_a_for_stitch - song_a_preview_start)
+    song_a_prefix, _ = decode_segment(
+        request.song_a_path,
+        song_a_preview_start,
+        song_a_preview_dur,
+        sr=target_sr,
+        max_decode_sec=max(20.0, song_a_preview_dur + 2.0),
+    )
+    # Song B window: decode with pre-roll so we can phase-align on stretched beat grid.
+    align_preroll_sec = clamp(0.75 * (60.0 / max(1e-6, bpm_a)), 0.2, 1.2)
+    decode_start_b = max(0.0, cue_b_selected - (align_preroll_sec * stretch_rate))
+    if dur_b is not None:
+        decode_start_b = clamp(decode_start_b, 0.0, float(dur_b))
+    desired_b_out_sec = seam_sec + max(post_sec, stitch_decode_side_sec) + (2.0 * align_preroll_sec)
+    if dur_b is not None:
+        # Decode only enough of Song B for alignment + transition + preview tail.
+        remaining_sec = max(0.0, float(dur_b) - decode_start_b)
+        raw_b_in_sec = clamp(min(remaining_sec, desired_b_out_sec * stretch_rate), 1.0, 360.0)
+    else:
+        raw_b_in_sec = clamp(desired_b_out_sec * stretch_rate, 1.0, 360.0)
+    y_b_raw, _ = decode_segment(
+        request.song_b_path,
+        decode_start_b,
+        raw_b_in_sec,
+        sr=target_sr,
+        max_decode_sec=raw_b_in_sec + 2.0,
+    )
+    y_b_stretched = safe_time_stretch(y_b_raw, rate=stretch_rate)
+    y_b_stretched_stems, y_b_stem_debug = _extract_demucs_stems(
+        y_b_stretched,
+        target_sr,
+        track_label="song-b-stretched-window",
+    )
+    nominal_b_start_n = int(round(align_preroll_sec * target_sr))
+    y_b, aligned_b_start_n, b_alignment_debug = _align_b_window_to_a_tail(
+        a_tail=period_a,
+        y_b_stretched=y_b_stretched,
+        nominal_start_n=nominal_b_start_n,
+        seam_n=seam_n,
+        post_n=post_n,
+        sr=target_sr,
+        bpm_ref=bpm_a,
+        a_tail_drums=period_a_stems.drums if period_a_stems is not None else None,
+        y_b_stretched_drums=y_b_stretched_stems.drums if y_b_stretched_stems is not None else None,
+    )
+    cue_b = float(decode_start_b + ((aligned_b_start_n / float(target_sr)) * stretch_rate))
+    period_b = y_b[:seam_n]
+    period_b_stems = _slice_stem_bundle(y_b_stretched_stems, aligned_b_start_n, seam_n)
+    b_post = y_b[seam_n : seam_n + post_n]
+    stitch_decode_n = int(round(stitch_decode_side_sec * target_sr))
+    b_suffix_substitute = y_b_stretched[(aligned_b_start_n + seam_n) : (aligned_b_start_n + seam_n + stitch_decode_n)].astype(
+        np.float32
+    )
+    if b_suffix_substitute.size == 0:
+        b_suffix_substitute = np.zeros((0,), dtype=np.float32)
+    rough_seam, reference_period, rough_mix_debug = _build_src_transition_period_with_stems(
+        period_a=period_a,
+        period_b=period_b,
+        sr=target_sr,
+        stems_a=period_a_stems,
+        stems_b=period_b_stems,
+    )
+    rough_stitched = np.concatenate([a_pre, rough_seam, b_post]).astype(np.float32)
+    reference_audio_clip, reference_audio_debug = _build_period_reference_audio(
+        reference_period,
+        sr=target_sr,
+        source_mode=str(rough_mix_debug.get("reference_mode", "full-period-a")),
+    )
+    return {
+        "target_sr": target_sr,
+        "dur_a": dur_a,
+        "dur_b": dur_b,
+        "analysis_start_a_sec": a_analysis_start,
+        "bpm_a": bpm_a,
+        "bpm_b": bpm_b_detected,
+        "bpm_b_for_alignment": bpm_b_for_alignment,
+        "cue_a_sec": cue_a,
+        "cue_b_sec": cue_b,
+        "cue_b_selected_sec": cue_b_selected,
+        "cue_selector_method": cue_selection.method,
+        "cue_selector_debug": cue_selection.debug,
+        "stretch_rate": stretch_rate,
+        "stretch_rate_raw": stretch_rate_raw,
+        "transition_base_mode": base_mode,
+        "transition_bars": int(transition_bars),
+        "b_alignment_debug": b_alignment_debug,
+        "phrase_lock_debug": phrase_lock["debug"],
+        "rough_mix_debug": rough_mix_debug,
+        "reference_audio_debug": reference_audio_debug,
+        "demucs_transition_debug": {
+            "enabled": bool(_DEMUCS_TRANSITION_ENABLED),
+            "period_a": period_a_stem_debug,
+            "b_window_stretched": y_b_stem_debug,
+            "period_b_from_aligned_window": {
+                "status": "ready" if period_b_stems is not None else "unavailable",
+                "source": "slice(song-b-stretched-window, aligned_start, seam_n)",
+                "aligned_start_sample": int(aligned_b_start_n),
+                "seam_n": int(seam_n),
+            },
+        },
+        "pre_sec": pre_sec,
+        "seam_sec": seam_sec,
+        "post_sec": post_sec,
+        "pre_sec_raw": pre_sec_raw,
+        "seam_sec_raw": seam_sec_raw,
+        "seam_sec_ui_raw": seam_sec_ui_raw,
+        "post_sec_raw": post_sec_raw,
+        "pre_n": pre_n,
+        "seam_n": seam_n,
+        "post_n": post_n,
+        "rough_seam": rough_seam,
+        "rough_stitched": rough_stitched,
+        "song_a_prefix": song_a_prefix,
+        "song_b_suffix_substitute": b_suffix_substitute,
+        "reference_audio_clip": reference_audio_clip,
+        "period_a_stem_bundle": period_a_stems,
+        "period_b_stem_bundle": period_b_stems,
+        "boundary_fade_n": int(boundary_fade_n),
+        "boundary_fade_sec": float(boundary_fade_sec),
+        "stitch_preview_side_sec": float(stitch_preview_side_sec),
+        "stitch_decode_side_sec": float(stitch_decode_side_sec),
+        "stitching_debug": {
+            "mode": "replace-seam-no-insert",
+            "transition_base_mode": base_mode,
+            "transition_bars": int(transition_bars),
+            "song_a_prefix_sec": round(float(song_a_prefix.size / max(1, target_sr)), 3),
+            "transition_sec": round(float(seam_sec), 3),
+            "song_b_suffix_sec": round(float(b_suffix_substitute.size / max(1, target_sr)), 3),
+            "decode_start_b_sec": round(float(decode_start_b), 3),
+            "cue_a_cut_sec": round(float(cue_a_for_stitch), 3),
+            "cue_b_continuation_sec": round(float(cue_b + seam_sec), 3),
+            "replaced_window_sec": round(float(seam_sec), 3),
+            "boundary_fade_sec": round(float(boundary_fade_sec), 3),
+            "stitch_preview_side_sec": round(float(stitch_preview_side_sec), 3),
+            "stitch_decode_side_sec": round(float(stitch_decode_side_sec), 3),
+        },
+    }
+def _extract_success_and_audios(result: Any) -> Tuple[bool, list, Optional[str]]:
+    if isinstance(result, dict):
+        success = bool(result.get("success", False))
+        audios = result.get("audios", [])
+        error = result.get("error") or result.get("status_message")
+        return success, audios, error
+    success = bool(getattr(result, "success", False))
+    audios = getattr(result, "audios", [])
+    error = getattr(result, "error", None) or getattr(result, "status_message", None)
+    return success, audios, error
+def _load_acestep_runtime(request: TransitionRequest) -> Dict[str, Any]:
+    global _ACESTEP_RUNTIME
+    project_root = _resolve_acestep_project_root(request)
+    runtime_key = (
+        project_root,
+        request.acestep_model_config,
+        request.acestep_device,
+        request.acestep_lora_path,
+        float(request.acestep_lora_scale),
+    )
+    if _ACESTEP_RUNTIME is not None and _ACESTEP_RUNTIME.get("key") == runtime_key:
+        return _ACESTEP_RUNTIME
+    try:
+        from acestep.handler import AceStepHandler
+        from acestep.inference import GenerationConfig, GenerationParams, generate_music
+    except Exception as exc:
+        raise RuntimeError(
+            "ACE-Step is not installed or import failed. "
+            "Install with: pip install git+https://github.com/ACE-Step/ACE-Step-1.5.git"
+        ) from exc
+    handler = AceStepHandler()
+    status, ok = handler.initialize_service(
+        project_root=project_root,
+        config_path=request.acestep_model_config,
+        device=request.acestep_device,
+        use_flash_attention=request.acestep_use_flash_attn,
+        compile_model=request.acestep_compile_model,
+        offload_to_cpu=request.acestep_offload_to_cpu,
+        offload_dit_to_cpu=request.acestep_offload_dit_to_cpu,
+        quantization=None,
+        prefer_source=request.acestep_prefer_source,
+        use_mlx_dit=request.acestep_use_mlx_dit,
+    )
+    if not ok:
+        raise RuntimeError(f"ACE-Step initialize_service failed: {status}")
+    lora_debug: Dict[str, Any] = {"requested": False}
+    if request.acestep_lora_path:
+        lora_debug["requested"] = True
+        resolved_lora_path = _resolve_lora_path(request.acestep_lora_path, project_root)
+        try:
+            handler.load_lora(resolved_lora_path)
+            handler.set_use_lora(True)
+            handler.set_lora_scale(float(request.acestep_lora_scale))
+            lora_debug.update(
+                {
+                    "loaded": True,
+                    "path": resolved_lora_path,
+                    "scale": float(request.acestep_lora_scale),
+                }
+            )
+        except Exception as exc:
+            raise RuntimeError(f"Failed to load ACE-Step LoRA: {exc}") from exc
+    else:
+        lora_debug["loaded"] = False
+    _ACESTEP_RUNTIME = {
+        "key": runtime_key,
+        "project_root": project_root,
+        "handler": handler,
+        "GenerationParams": GenerationParams,
+        "GenerationConfig": GenerationConfig,
+        "generate_music": generate_music,
+        "lora_debug": lora_debug,
+    }
+    return _ACESTEP_RUNTIME
+def _run_acestep_repaint(
+    request: TransitionRequest,
+    rough: Dict[str, Any],
+    rough_src_path: str,
+) -> Tuple[np.ndarray, np.ndarray]:
+    runtime = _load_acestep_runtime(request)
+    handler = runtime["handler"]
+    GenerationParams = runtime["GenerationParams"]
+    GenerationConfig = runtime["GenerationConfig"]
+    generate_music = runtime["generate_music"]
+    caption = _build_caption(request.plugin_id, request.instruction_text)
+    rough_stitched = rough["rough_stitched"]
+    rough_for_model = resample_if_needed(rough_stitched, rough["target_sr"], ACESTEP_INPUT_SR)
+    write_wav(rough_src_path, rough_for_model, ACESTEP_INPUT_SR)
+    reference_audio_path: Optional[str] = None
+    reference_audio_clip = rough.get("reference_audio_clip")
+    if isinstance(reference_audio_clip, np.ndarray) and reference_audio_clip.size > 0:
+        reference_audio_path = (
+            rough_src_path.replace("_rough_src.wav", "_theme_ref.wav")
+            if rough_src_path.endswith("_rough_src.wav")
+            else f"{rough_src_path}.theme_ref.wav"
+        )
+        reference_for_model = resample_if_needed(reference_audio_clip, rough["target_sr"], ACESTEP_INPUT_SR)
+        write_wav(reference_audio_path, reference_for_model, ACESTEP_INPUT_SR)
+    repaint_start = float(rough["pre_sec"])
+    repaint_end = float(rough["pre_sec"] + rough["seam_sec"])
+    total_duration = float(rough["pre_sec"] + rough["seam_sec"] + rough["post_sec"])
+    bpm_hint = int(round(rough["bpm_a"])) if 30 <= rough["bpm_a"] <= 300 else None
+    params = GenerationParams(
+        task_type="repaint",
+        src_audio=rough_src_path,
+        reference_audio=reference_audio_path,
+        repainting_start=repaint_start,
+        repainting_end=repaint_end,
+        caption=caption,
+        lyrics="[Instrumental]",
+        instrumental=True,
+        bpm=bpm_hint,
+        duration=total_duration,
+        inference_steps=int(max(1, request.inference_steps)),
+        guidance_scale=float(request.creativity_strength),
+        seed=int(request.seed),
+        thinking=False,
+        use_cot_metas=False,
+        use_cot_caption=False,
+        use_cot_language=False,
+    )
+    config = GenerationConfig(
+        batch_size=1,
+        use_random_seed=False,
+        seeds=[int(request.seed)],
+        audio_format="wav",
+    )
+    result = generate_music(
+        dit_handler=handler,
+        llm_handler=None,
+        params=params,
+        config=config,
+        save_dir=None,
+        progress=None,
+    )
+    success, audios, error = _extract_success_and_audios(result)
+    if not success or not audios:
+        raise RuntimeError(error or "ACE-Step repaint returned no audio.")
+    audio_item = audios[0]
+    audio_tensor = audio_item.get("tensor")
+    if audio_tensor is None:
+        raise RuntimeError("ACE-Step repaint output missing audio tensor.")
+    try:
+        import torch
+        if isinstance(audio_tensor, torch.Tensor):
+            y = audio_tensor.detach().float().cpu().numpy()
+        else:
+            y = np.asarray(audio_tensor, dtype=np.float32)
+    except Exception:
+        y = np.asarray(audio_tensor, dtype=np.float32)
+    if y.ndim == 2:
+        y = np.mean(y, axis=0)
+    elif y.ndim > 2:
+        y = y.reshape(-1)
+    y = y.astype(np.float32)
+    model_sr = int(audio_item.get("sample_rate", ACESTEP_INPUT_SR))
+    y = resample_if_needed(y, model_sr, rough["target_sr"])
+    total_n = rough["pre_n"] + rough["seam_n"] + rough["post_n"]
+    y = ensure_length(y, total_n)
+    stitched = y[:total_n]
+    seam_start = rough["pre_n"]
+    seam_end = seam_start + rough["seam_n"]
+    transition = stitched[seam_start:seam_end]
+    return transition, stitched
+def generate_transition_artifacts(request: TransitionRequest) -> TransitionResult:
+    if not os.path.isfile(request.song_a_path):
+        raise FileNotFoundError(f"Song A not found: {request.song_a_path}")
+    if not os.path.isfile(request.song_b_path):
+        raise FileNotFoundError(f"Song B not found: {request.song_b_path}")
+    transition_path, stitched_path, rough_stitched_path, hard_splice_path, rough_src_path = _resolve_output_paths(request)
+    LOGGER.info("Transition request args: %s", json.dumps(request.to_log_dict(), sort_keys=True))
+    rough = _prepare_rough_transition(request)
+    rough_stitched_audio = normalize_peak(
+        apply_edge_fades(rough["rough_stitched"].astype(np.float32), rough["target_sr"], fade_ms=25.0),
+        peak=0.98,
+    )
+    write_wav(rough_stitched_path, rough_stitched_audio, rough["target_sr"])
+    hard_splice_audio = np.concatenate([rough["song_a_prefix"], rough["song_b_suffix_substitute"]]).astype(np.float32)
+    hard_splice_audio = normalize_peak(hard_splice_audio, peak=0.98)
+    write_wav(hard_splice_path, hard_splice_audio, rough["target_sr"])
+    transition_audio = rough["rough_seam"]
+    repaint_context_audio = rough["rough_stitched"]
+    try:
+        transition_audio, repaint_context_audio = _run_acestep_repaint(request, rough, rough_src_path)
+    except Exception as exc:
+        raise RuntimeError(f"ACE-Step repaint failed. Please verify ACE-Step runtime and model setup. {exc}") from exc
+    backend_used = "acestep-repaint"
+    transition_audio, post_repaint_stem_debug = _post_repaint_stem_correction(
+        transition_audio.astype(np.float32),
+        sr=int(rough["target_sr"]),
+        anchor_a=rough.get("period_a_stem_bundle"),
+        anchor_b=rough.get("period_b_stem_bundle"),
+    )
+    transition_audio, transition_low_profile_debug = _apply_transition_low_duck(
+        transition_audio.astype(np.float32),
+        sr=int(rough["target_sr"]),
+    )
+    stitched_audio, boundary_mix_debug = _assemble_substitute_mix(
+        song_a_prefix=rough["song_a_prefix"],
+        transition=transition_audio,
+        song_b_suffix=rough["song_b_suffix_substitute"],
+        boundary_fade_n=int(rough.get("boundary_fade_n", 0)),
+        sr=int(rough["target_sr"]),
+    )
+    transition_audio = normalize_peak(apply_edge_fades(transition_audio, rough["target_sr"], fade_ms=25.0), peak=0.98)
+    stitched_audio = normalize_peak(apply_edge_fades(stitched_audio, rough["target_sr"], fade_ms=25.0), peak=0.98)
+    write_wav(transition_path, transition_audio, rough["target_sr"])
+    write_wav(stitched_path, stitched_audio, rough["target_sr"])
+    theme_ref_path = (
+        rough_src_path.replace("_rough_src.wav", "_theme_ref.wav")
+        if rough_src_path.endswith("_rough_src.wav")
+        else f"{rough_src_path}.theme_ref.wav"
+    )
+    if not request.keep_debug_files:
+        for tmp_path in (rough_src_path, theme_ref_path):
+            if os.path.exists(tmp_path):
+                try:
+                    os.remove(tmp_path)
+                except Exception:
+                    pass
+    details = {
+        "backend_used": backend_used,
+        "generation_args": request.to_log_dict(),
+        "lora": _load_acestep_runtime(request).get("lora_debug", {"requested": False}),
+        "bpm": {
+            "song_a": round(float(rough["bpm_a"]), 3),
+            "song_b": round(float(rough["bpm_b"]), 3),
+            "song_b_for_alignment": round(float(rough["bpm_b_for_alignment"]), 3),
+            "stretch_rate": round(float(rough["stretch_rate"]), 5),
+            "stretch_rate_raw": round(float(rough["stretch_rate_raw"]), 5),
+            "bpm_target_override": request.bpm_target,
+        },
+        "cue_points_sec": {
+            "song_a": round(float(rough["cue_a_sec"]), 3),
+            "song_b": round(float(rough["cue_b_sec"]), 3),
+            "song_b_selected": round(float(rough["cue_b_selected_sec"]), 3),
+            "selector_method": rough.get("cue_selector_method"),
+        },
+        "cue_selector": rough.get("cue_selector_debug"),
+        "bpm_phase_alignment": rough.get("b_alignment_debug"),
+        "phrase_lock": rough.get("phrase_lock_debug"),
+        "rough_mix": rough.get("rough_mix_debug"),
+        "reference_audio": rough.get("reference_audio_debug"),
+        "demucs_transition": rough.get("demucs_transition_debug"),
+        "stitching": rough.get("stitching_debug"),
+        "boundary_mix": boundary_mix_debug,
+        "post_repaint_stem_correction": post_repaint_stem_debug,
+        "transition_low_profile": transition_low_profile_debug,
+        "transition_strategy": {
+            "name": "bar-defined-dual-base-repaint",
+            "base_mode": rough.get("transition_base_mode"),
+            "transition_bars": rough.get("transition_bars"),
+            "boundary_fade_sec": round(float(rough.get("boundary_fade_sec", 0.0)), 3),
+        },
+        "clip_shape_sec": {
+            "pre_context_sec_raw": round(float(rough["pre_sec_raw"]), 3),
+            "pre_context_sec": round(float(rough["pre_sec"]), 3),
+            "repaint_width_sec_ui_raw": round(float(rough.get("seam_sec_ui_raw", rough["seam_sec_raw"])), 3),
+            "repaint_width_sec_raw": round(float(rough["seam_sec_raw"]), 3),
+            "repaint_width_sec": round(float(rough["seam_sec"]), 3),
+            "post_context_sec_raw": round(float(rough["post_sec_raw"]), 3),
+            "post_context_sec": round(float(rough["post_sec"]), 3),
+            "analysis_sec": round(float(request.analysis_sec), 3),
+        },
+        "durations_sec": {
+            "song_a_total": rough["dur_a"],
+            "song_b_total": rough["dur_b"],
+            "analysis_start_a_sec": round(float(rough["analysis_start_a_sec"]), 3),
+            "repaint_context_preview": round(float(repaint_context_audio.size / max(1, rough["target_sr"])), 3),
+            "stitched_output": round(float(stitched_audio.size / max(1, rough["target_sr"])), 3),
+        },
+        "outputs": {
+            "transition_path": transition_path,
+            "stitched_path": stitched_path,
+            "rough_stitched_path": rough_stitched_path,
+            "hard_splice_path": hard_splice_path,
+        },
+    }
+    LOGGER.info("Transition result details: %s", json.dumps(details, sort_keys=True))
+    return TransitionResult(
+        transition_path=transition_path,
+        stitched_path=stitched_path,
+        rough_stitched_path=rough_stitched_path,
+        hard_splice_path=hard_splice_path,
+        backend_used=backend_used,
+        details=details,
+    )
+def _build_arg_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description="Deterministic DJ transition generation (Phase A/B).")
+    parser.add_argument("--song-a", required=True, help="Path to Song A audio file.")
+    parser.add_argument("--song-b", required=True, help="Path to Song B audio file.")
+    parser.add_argument("--plugin", default="Smooth Blend", choices=list(PLUGIN_PRESETS.keys()), help="Transition style plugin preset.")
+    parser.add_argument("--instruction", default="", help="Extra text instruction for generation.")
+    parser.add_argument("--pre-sec", type=float, default=6.0, help="Seconds before seam from Song A.")
+    parser.add_argument("--repaint-sec", type=float, default=4.0, help="Repaint seam width in seconds.")
+    parser.add_argument("--post-sec", type=float, default=6.0, help="Seconds after seam from Song B.")
+    parser.add_argument("--analysis-sec", type=float, default=45.0, help="Analysis window in seconds.")
+    parser.add_argument("--bpm-target", type=float, default=None, help="Optional BPM override target for Song A.")
+    parser.add_argument("--cue-a-sec", type=float, default=None, help="Optional Song A cue override.")
+    parser.add_argument("--cue-b-sec", type=float, default=None, help="Optional Song B cue override.")
+    parser.add_argument(
+        "--transition-bars",
+        type=int,
+        default=8,
+        choices=[4, 8, 16],
+        help="Transition period length in bars around cue points.",
+    )
+    parser.add_argument("--creativity", type=float, default=7.0, help="ACE-Step guidance strength.")
+    parser.add_argument("--inference-steps", type=int, default=8, help="ACE-Step inference steps.")
+    parser.add_argument("--seed", type=int, default=42, help="Seed for reproducibility.")
+    parser.add_argument("--output-dir", default="outputs", help="Directory for output artifacts.")
+    parser.add_argument("--output-stem", default=None, help="Optional fixed output stem.")
+    parser.add_argument("--target-sr", type=int, default=DEFAULT_TARGET_SR, help="Output sample rate.")
+    parser.add_argument("--keep-debug-files", action="store_true", help="Keep temporary rough source audio files.")
+    return parser
+def main() -> None:
+    logging.basicConfig(level=logging.INFO, format="%(asctime)s | %(levelname)s | %(name)s | %(message)s")
+    parser = _build_arg_parser()
+    args = parser.parse_args()
+    req = TransitionRequest(
+        song_a_path=args.song_a,
+        song_b_path=args.song_b,
+        plugin_id=args.plugin,
+        instruction_text=args.instruction,
+        pre_context_sec=args.pre_sec,
+        repaint_width_sec=args.repaint_sec,
+        post_context_sec=args.post_sec,
+        analysis_sec=args.analysis_sec,
+        bpm_target=args.bpm_target,
+        cue_a_sec=args.cue_a_sec,
+        cue_b_sec=args.cue_b_sec,
+        transition_bars=args.transition_bars,
+        creativity_strength=args.creativity,
+        inference_steps=args.inference_steps,
+        seed=args.seed,
+        output_dir=args.output_dir,
+        output_stem=args.output_stem,
+        target_sr=args.target_sr,
+        keep_debug_files=args.keep_debug_files,
+    )
+    result = generate_transition_artifacts(req)
+    print(json.dumps(result.to_dict(), indent=2))
+if __name__ == "__main__":
+    main()

requirements.txt ADDED Viewed

	@@ -0,0 +1,14 @@

+gradio
+spaces
+torch
+transformers
+accelerate
+librosa
+soundfile
+numpy
+scipy
+# Demucs enables stem-aware cue selection and transition refinement.
+demucs
+# Optional ACE-Step backend (heavy; keep optional so MusicGen path still works):
+git+https://github.com/ACE-Step/ACE-Step-1.5.git