Spaces:

muoten
/

aichael-jackson

Running on Zero

muoten Claude Opus 4.7 (1M context) commited on May 6

Commit

80ff7e5

0 Parent(s):

Initial commit + Milestone 7: MJ-derived synth from real Thriller chorus

Milestone 7 is the first synth where lyrics, melody and timing are all
extracted from MJ's actual Thriller chorus (vocals[120-136s] of the full
track) rather than ear-edited from SoulX's en_target example. Prior
versions (v1-v6) drifted into B minor because cumulative by-ear pitch
edits from a foreign starting key settled a whole step below MJ's C#
minor. This commit fixes the source of truth.

Pipeline (run_preproc_with_whisper.py):
vocals[120-136s] -> Demucs (already separated)
-> whisper-large w/ initial_prompt -> lyrics + word timing
-> ROSVOT -> note transcription
-> SoulX preproc -> metadata.json (data/mj_chorus_metadata.json)
-> SoulX inference -> examples/milestone7_mj_notes/generated.wav

Also includes the working "broken" recipe scripts (swap_word.py,
split_word.py, scripts/sing.sh) that established the SingerTranslator
pattern in earlier sessions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Files changed (9) hide show

.gitattributes +1 -0
.gitignore +5 -0
README.md +160 -0
data/mj_chorus_metadata.json +16 -0
examples/milestone7_mj_notes/generated.wav +3 -0
run_preproc_with_whisper.py +83 -0
scripts/sing.sh +44 -0
split_word.py +105 -0
swap_word.py +98 -0

.gitattributes ADDED Viewed

	@@ -0,0 +1 @@


1	+ *.wav filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,5 @@

+__pycache__/
+*.pyc
+.DS_Store
+examples/*.wav
+examples/*/generated.wav

README.md ADDED Viewed

	@@ -0,0 +1,160 @@

+# SingerTranslator
+Translate a *score* (lyrics + melody + voice prompt) into a *sung performance*.
+Music generators invent the song. **SingerTranslator renders one you specify.**
+You pick the lyrics, you pick the melody (MIDI/F0), you pick the voice; the
+model produces the singing audio.
+This is the user-controlled-composition workflow on top of
+[SoulX-Singer](https://github.com/Soul-AILab/SoulX-Singer) running locally.
+It includes the helpers and recipes we found necessary to make English work.
+## Why this exists
+ACE-Step v1.5 cover-gen, Voicify (Demucs+RVC+ACE-Step), and SoulX SVC mode
+all failed to deliver "custom lyrics on a custom melody in a chosen voice".
+The first three either had no F0 channel or locked you into the target's
+lyrics. SoulX **SVS** mode does have F0/MIDI input — but only when used
+locally with hand-built metadata. That's what this repo wraps.
+Validated 2026-05-05: produced clean English singing of the lyric "Who says
+you're not broken" on a chosen melody with hard-K plosive. First time the
+pipeline clicked end-to-end.
+## Layout
+```
+.
+├── README.md           — this
+├── swap_word.py        — replace word X with word Y in metadata
+│                         (auto regenerates phoneme via g2p_en)
+│                         supports --phoneme override + --duration_boost
+├── split_word.py       — replace one word slot with N consecutive slots
+│                         (used to test mid-word splits; PROVED WORSE)
+├── scripts/
+│   └── sing.sh         — wrapper around SoulX-Singer SVS inference
+└── examples/           — outputs we want to keep around
+```
+## Prerequisites
+You need SoulX-Singer locally with weights downloaded. See
+`project_english_singing_synthesis.md` in the auto-memory for the full install
+path. Briefly:
+```bash
+cd ~/claude-code
+git clone https://github.com/Soul-AILab/SoulX-Singer.git
+cd SoulX-Singer
+/Users/milhouse/.pyenv/versions/3.10.16/bin/python3.10 -m venv venv
+venv/bin/pip install -r requirements.txt
+mkdir pretrained_models && venv/bin/hf download Soul-AILab/SoulX-Singer \
+    --local-dir pretrained_models/SoulX-Singer
+venv/bin/python -c "import nltk; nltk.download('averaged_perceptron_tagger_eng'); nltk.download('cmudict')"
+```
+CPU-only on Mac (MPS broken in the vocoder, see memory notes). ~5x realtime.
+## Workflow
+```
+[ source metadata.json ]
+        |
+        | swap_word.py / split_word.py     (edit lyrics, phonemes, durations)
+        |
+        v
+[ edited metadata.json ]
+        |
+        | scripts/sing.sh                  (run SoulX SVS inference)
+        |
+        v
+[ generated.wav ]                          a sung performance
+```
+## Quick start: word swap example
+Take SoulX's shipped `en_target.json` (the "Who says you're not pretty" song),
+swap "pretty" → "broken" with the hard-K recipe, and synthesize:
+```bash
+SOULX=~/claude-code/SoulX-Singer
+PY=$SOULX/venv/bin/python
+# 1. Edit the metadata (apply the triple-K plosive recipe + duration boost)
+$PY swap_word.py \
+    --in $SOULX/example/audio/en_target.json \
+    --out /tmp/en_broken.json \
+    --old pretty --new broken \
+    --phoneme 'en_B-R-OW1-K-K-K-AH0-N' \
+    --duration_boost 0.20
+# 2. Synthesize
+scripts/sing.sh \
+    $SOULX/example/audio/en_prompt.mp3 \
+    $SOULX/example/audio/en_prompt.json \
+    /tmp/en_broken.json \
+    /tmp/sung_broken
+# 3. Listen
+afplay /tmp/sung_broken/generated.wav
+```
+## Recipes
+### English plosives are weak — use the triple-phone trick
+The model has only 70 English phonemes vs ~2700 Chinese. English K/P/T
+articulation is poor by default. Workaround: triple the plosive in the
+phoneme override, with a moderate duration boost.
+| Word | Phoneme override |
+|---|---|
+| broken | `en_B-R-OW1-K-K-K-AH0-N` |
+| pretty | `en_P-P-P-R-IH1-T-T-T-IY0` (untested but follows the pattern) |
+| broken (verified)  | tested 2026-05-05, produces hard K |
+Validated trade-off (2026-05-05):
+- Less than 3 K-phones: K is dropped or sounds soft
+- More than 3 K-phones (4K, 5K): per-phone time falls below ~50ms, K
+  collapses to a vowel transition
+- Boost much beyond +0.20s: K → G voicing leak (closure gets filled with
+  vocal-fold vibration from neighboring vowels — Chinese-prior unaspirated
+  stops dominate)
+The sweet spot is **3 K-phones at ~56ms each** in a slot ~0.45s long.
+### Sonorants are fine
+Words like "lovely", "morning", "shining" come out clean without any tricks.
+Lyric-engineer toward sonorants when you can.
+### Slot-splitting is worse
+`split_word.py` tested mid-word splitting like "broken" → "brok" + "ken"
+(2 slots with K at the slot boundary). The hypothesis was that `<EOW>`/`<BOW>`
+markers would force harder articulation. **It didn't work** — each piece
+got too little time, model isn't trained on mid-word slot splits. Kept the
+script around for future experimentation but the single-slot triple-phone
+recipe is what's been validated.
+## Status (2026-05-05)
+- ✅ Local install verified
+- ✅ Lyric swap + phoneme override + duration boost working (`swap_word.py`)
+- ✅ "broken benchmark" achieved with 3K + boost20 recipe
+- ⏳ User's voice prompt (Shana clip) — needs prompt_metadata generation
+  via SoulX preprocess pipeline (extra model downloads)
+- ⏳ User's actual Thriller MIDI integration — currently using
+  en_target's melody as the surrogate
+- ⏳ Generalize plosive recipe to P, T (untested)
+## What it isn't
+- **Not a music generator** — doesn't invent songs from prompts
+- **Not a karaoke maker** — doesn't separate voices from existing recordings
+- **Not a voice cloner alone** — that's what RVC and SoulX SVC do; this
+  controls more (lyrics + melody + voice, not just voice)
+It's the rendering step of a composition pipeline. You bring the score,
+SingerTranslator brings the singer.

data/mj_chorus_metadata.json ADDED Viewed

	@@ -0,0 +1,16 @@

+[
+  {
+    "index": "vocal_0_16000",
+    "language": "English",
+    "time": [
+      0,
+      16000
+    ],
+    "duration": "0.14 0.40 0.24 1.66 0.50 0.68 0.38 0.26 0.36 0.20 0.46 0.26 0.30 0.20 0.28 0.30 0.64 0.28 0.66 0.45 0.21 1.86 0.38 0.64 0.44 0.40 0.40 0.18 0.34 0.36 0.40 0.64 0.78 0.31",
+    "text": "this is is thriller thriller thriller night and no one's gonna save you from the beast of outstrike this is is thriller thriller thriller night you're fighting for your life inside a killer this",
+    "phoneme": "en_DH-IH1-S en_IH1-Z en_IH1-Z en_TH-R-IH1-L-ER0 en_TH-R-IH1-L-ER0 en_TH-R-IH1-L-ER0 en_N-AY1-T en_AH0-N-D en_N-OW1 en_W-AH1-N-Z en_G-AA1-N-AH0 en_S-EY1-V en_Y-UW1 en_F-R-AH1-M en_DH-AH0 en_B-IY1-S-T en_AH1-V en_AW0-T-S-T-R-IH1-K en_DH-IH1-S en_IH1-Z en_IH1-Z en_TH-R-IH1-L-ER0 en_TH-R-IH1-L-ER0 en_TH-R-IH1-L-ER0 en_N-AY1-T en_Y-UH1-R en_F-AY1-T-IH0-NG en_F-AO1-R en_Y-AO1-R en_L-AY1-F en_IH0-N-S-AY1-D en_AH0 en_K-IH1-L-ER0 en_DH-IH1-S",
+    "note_pitch": "0 73 71 71 70 67 0 68 68 67 65 64 0 64 67 68 66 66 71 73 71 71 70 68 68 68 66 66 64 61 65 64 67 0",
+    "note_type": "2 2 3 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 3 2 2 2 2 2 2 2 2 2 3",
+    "f0": "413.3 428.3 429.5 389.7 354.7 0.0 0.0 0.0 0.0 0.0 508.9 493.8 469.1 475.2 504.0 526.3 538.8 545.8 549.9 551.9 551.8 551.2 552.4 555.6 557.0 552.2 537.0 518.7 507.3 495.7 490.6 488.9 488.3 487.5 488.4 491.1 495.3 497.9 497.8 495.9 494.7 495.3 497.4 498.5 498.0 496.6 495.8 497.4 500.3 502.9 502.7 499.1 497.0 497.0 500.6 503.9 504.2 500.6 497.0 496.7 497.5 497.9 496.3 492.9 489.6 488.8 489.6 487.8 481.1 472.6 0.0 426.5 407.0 393.6 378.6 354.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 335.8 337.7 0.0 0.0 0.0 0.0 452.0 454.0 435.6 425.7 438.5 461.8 476.8 485.3 489.5 495.8 500.6 504.1 500.5 491.4 481.4 477.2 475.4 473.1 469.6 465.4 462.2 461.3 464.6 471.7 476.8 480.0 474.5 465.4 464.7 466.1 467.6 469.6 469.4 466.0 463.1 452.3 401.4 373.8 365.9 380.5 392.0 386.8 382.6 390.6 404.1 408.6 402.1 396.3 392.6 393.3 390.3 377.5 370.8 372.1 387.5 410.4 415.1 414.3 411.8 412.8 414.4 411.0 387.9 367.5 357.7 358.2 369.7 374.3 372.8 360.1 318.6 319.7 330.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 392.1 392.2 392.9 390.1 387.7 0.0 0.0 0.0 0.0 0.0 343.3 417.7 423.8 424.2 419.3 410.2 405.4 406.5 416.3 423.0 425.9 425.7 429.7 424.2 421.8 422.2 417.6 413.9 412.3 411.4 418.8 425.0 425.3 416.9 388.1 372.2 364.3 367.8 372.0 369.8 358.4 344.4 311.2 0.0 0.0 0.0 0.0 395.3 396.0 393.4 382.7 371.9 369.7 367.4 366.4 370.3 374.0 356.9 341.6 313.8 299.6 0.0 0.0 0.0 0.0 0.0 323.4 349.1 344.7 322.6 314.7 318.3 332.4 339.3 334.9 317.7 313.6 322.0 329.3 332.2 329.0 317.2 0.0 0.0 311.3 298.9 298.2 295.3 282.7 274.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 354.5 348.4 321.4 308.3 316.8 328.9 327.9 325.3 324.8 321.3 322.5 334.7 355.3 357.8 355.7 340.3 328.2 319.2 364.5 383.1 387.4 379.2 374.9 385.6 404.3 417.5 418.7 414.7 408.9 410.3 414.1 417.4 416.8 406.8 375.0 0.0 0.0 0.0 0.0 399.3 391.7 374.5 367.4 363.5 371.2 375.5 374.9 366.3 0.0 0.0 0.0 390.9 391.3 393.5 366.8 365.5 368.4 372.3 382.3 377.9 369.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 387.7 399.7 396.4 382.6 361.1 360.5 370.4 382.7 380.6 365.6 354.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 447.9 453.2 451.7 440.2 428.4 415.7 396.2 397.4 403.3 410.1 416.4 427.5 470.2 506.4 499.8 504.5 504.8 498.2 491.6 487.5 488.3 492.3 492.2 479.2 452.8 413.8 410.7 414.7 415.8 419.9 422.8 409.3 388.9 388.9 0.0 0.0 0.0 0.0 495.0 509.7 506.6 484.9 487.7 510.4 537.0 555.8 562.7 563.9 563.0 559.4 553.1 552.4 557.8 563.1 557.1 532.0 524.0 515.5 501.1 496.4 498.4 498.3 495.2 494.6 497.2 500.4 502.3 502.4 500.1 497.1 496.7 498.3 501.1 500.7 497.3 496.6 499.6 504.1 504.0 500.4 498.0 496.1 495.1 497.0 499.4 499.1 495.9 491.7 491.1 493.5 495.0 494.2 492.0 489.3 490.4 493.5 490.4 479.8 0.0 0.0 0.0 0.0 357.6 350.2 331.5 283.4 270.1 233.5 222.5 218.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 428.6 446.8 441.6 437.7 452.3 473.7 488.2 492.3 494.3 496.2 493.7 484.7 465.4 440.7 434.2 445.6 458.6 460.4 456.0 452.8 460.2 476.0 482.2 482.1 476.1 460.8 443.1 418.7 421.9 439.8 454.5 464.6 469.2 469.1 465.4 462.1 457.7 437.3 376.2 377.4 415.8 427.3 423.2 414.4 411.4 417.9 425.3 423.3 417.1 403.6 397.7 391.6 371.9 361.7 364.5 382.6 407.7 415.3 416.3 416.4 417.8 420.6 413.5 388.9 368.2 363.8 374.3 380.5 373.4 357.6 336.1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 439.6 430.4 416.6 417.8 415.7 402.3 396.8 0.0 0.0 0.0 0.0 0.0 0.0 445.1 452.5 448.6 436.9 426.8 423.3 421.0 421.2 422.2 422.0 421.3 418.0 402.8 374.8 364.0 363.4 364.5 364.2 362.5 365.3 369.3 363.4 358.3 0.0 0.0 0.0 412.9 407.2 399.1 384.1 376.5 371.5 369.7 370.4 371.5 370.9 365.8 353.7 330.0 327.7 330.8 333.4 329.4 314.5 318.7 333.7 339.0 331.4 323.0 314.6 313.2 323.6 341.7 344.9 335.5 324.7 325.0 337.5 352.1 351.9 337.9 319.1 307.3 0.0 291.4 290.9 286.4 276.3 268.2 273.5 275.8 269.5 259.9 259.7 0.0 0.0 0.0 361.5 360.6 351.1 319.5 320.5 338.3 354.1 349.4 331.9 321.0 329.7 347.8 358.4 371.1 374.0 371.4 370.0 370.2 370.3 371.4 368.4 265.7 258.4 254.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 329.1 330.5 337.1 339.8 333.1 324.3 322.6 319.7 321.2 328.5 331.9 328.1 301.8 262.0 258.3 266.1 277.7 278.7 275.5 274.2 276.4 281.0 280.8 271.6 247.8 237.8 0.0 0.0 0.0 274.2 275.4 275.5 276.3 273.6 318.4 309.7 0.0 0.0 397.2 404.8 405.4 392.2 381.1 382.8 382.2 381.7 384.4 388.6 385.9 372.9 343.8 318.9 320.8 329.8 337.5 333.3 321.2 319.5 324.4 327.7 319.4 287.2 257.9 0.0 0.0 0.0"
+  }
+]

examples/milestone7_mj_notes/generated.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d9766fcaab0f6fbb4e5f80c16717f4ecc9c42b797ef1006e6bef47f9a725d085
+size 768044

run_preproc_with_whisper.py ADDED Viewed

	@@ -0,0 +1,83 @@

+"""Run SoulX-Singer preprocess pipeline with whisper as the English ASR
+(bypassing the NeMo dep that's broken on Mac/torch-2.2)."""
+import argparse
+import os
+import re
+import sys
+sys.path.insert(0, '/Users/milhouse/claude-code/SoulX-Singer')
+# IMPORTANT: monkey-patch the English ASR class BEFORE pipeline imports it
+import preprocess.tools.lyric_transcription as lt
+import numpy as np
+def _clean_word(word: str) -> str:
+    return re.sub(r"[\?\.,:]", "", word).strip()
+INITIAL_PROMPT = os.environ.get('WHISPER_INITIAL_PROMPT', '')
+class WhisperASREn:
+    """Drop-in replacement for SoulX's _ASREnModel using OpenAI whisper."""
+    def __init__(self, model_path: str, device: str):
+        import whisper
+        size = os.environ.get('WHISPER_MODEL', 'large')
+        print(f'[whisper] loading model={size}')
+        self.model = whisper.load_model(size)
+        self.device = device
+    def process(self, wav_fn: str):
+        kwargs = dict(language='en', word_timestamps=True)
+        if INITIAL_PROMPT:
+            kwargs['initial_prompt'] = INITIAL_PROMPT
+            print(f'[whisper] using initial_prompt of {len(INITIAL_PROMPT)} chars')
+        result = self.model.transcribe(wav_fn, **kwargs)
+        print(f'[whisper] text: {result.get("text","").strip()}')
+        raw_words = []
+        raw_timestamps = []
+        for seg in result.get('segments', []):
+            for w in seg.get('words', []):
+                word = _clean_word(str(w.get('word', '')))
+                if not word:
+                    continue
+                s = float(w.get('start', 0.0))
+                e = float(w.get('end', 0.0))
+                raw_words.append(word)
+                raw_timestamps.append([s, e])
+        words, durs = lt._build_words_with_gaps(raw_words, raw_timestamps, wav_fn)
+        f0_path = os.path.splitext(wav_fn)[0] + "_f0.npy"
+        if os.path.exists(f0_path):
+            words, durs = lt._word_dur_post_process(
+                words, durs, np.load(f0_path)
+            )
+        return words, durs
+# Patch the class reference in the module
+lt._ASREnModel = WhisperASREn
+# Now run the pipeline normally
+from preprocess.pipeline import main as pipeline_main
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser()
+    parser.add_argument('--audio_path', required=True)
+    parser.add_argument('--save_dir', required=True)
+    parser.add_argument('--language', default='English')
+    parser.add_argument('--device', default='cpu')
+    parser.add_argument('--vocal_sep', default='False')
+    parser.add_argument('--max_merge_duration', type=int, default=60000)
+    parser.add_argument('--midi_transcribe', default='True')
+    args = parser.parse_args()
+    # convert string bools to bools as the pipeline expects
+    args.vocal_sep = args.vocal_sep.lower() in ('true', '1', 'yes')
+    args.midi_transcribe = args.midi_transcribe.lower() in ('true', '1', 'yes')
+    pipeline_main(args)

scripts/sing.sh ADDED Viewed

	@@ -0,0 +1,44 @@

+#!/usr/bin/env bash
+# Wrapper around SoulX-Singer SVS inference. Picks reasonable defaults for
+# the SingerTranslator workflow on Mac (CPU, no fp16, auto pitch shift).
+#
+# Usage:
+#   scripts/sing.sh <prompt_wav> <prompt_metadata> <target_metadata> <output_dir>
+#
+# Example:
+#   scripts/sing.sh \
+#       /path/to/SoulX-Singer/example/audio/en_prompt.mp3 \
+#       /path/to/SoulX-Singer/example/audio/en_prompt.json \
+#       /tmp/my_target.json \
+#       /tmp/sung_output
+set -euo pipefail
+if [[ $# -lt 4 ]]; then
+    echo "Usage: $0 <prompt_wav> <prompt_metadata.json> <target_metadata.json> <output_dir>" >&2
+    exit 2
+fi
+PROMPT_WAV="$1"
+PROMPT_META="$2"
+TARGET_META="$3"
+OUT_DIR="$4"
+SOULX_ROOT="${SOULX_ROOT:-/Users/milhouse/claude-code/SoulX-Singer}"
+PYBIN="$SOULX_ROOT/venv/bin/python"
+cd "$SOULX_ROOT"
+# IMPORTANT: --control score uses note_pitch + note_type from metadata.
+# Default 'melody' uses frame-level f0 instead, ignoring note_pitch edits.
+# For SingerTranslator surgery on melody, score is the right mode.
+PYTHONPATH=. "$PYBIN" -m cli.inference \
+    --device cpu \
+    --control score \
+    --model_path pretrained_models/SoulX-Singer/model.pt \
+    --config soulxsinger/config/soulxsinger.yaml \
+    --prompt_wav_path "$PROMPT_WAV" \
+    --prompt_metadata_path "$PROMPT_META" \
+    --target_metadata_path "$TARGET_META" \
+    --phoneset_path soulxsinger/utils/phoneme/phone_set.json \
+    --save_dir "$OUT_DIR" \
+    --auto_shift \
+    --pitch_shift 0

split_word.py ADDED Viewed

	@@ -0,0 +1,105 @@

+"""Replace each occurrence of a word with N consecutive slots, each
+carrying its own phoneme, duration share, and a copy of the note_pitch.
+Goal: force the model to articulate at word boundaries by putting a
+plosive (K, P, T) at the end of slot A and the start of slot B.
+Example: "broken" → [("brok", "en_B-R-OW1-K"), ("ken", "en_K-AH0-N")]
+Usage:
+    python split_word.py --in IN.json --out OUT.json --old pretty \
+        --pieces 'brok:en_B-R-OW1-K' 'ken:en_K-AH0-N' \
+        [--total-duration 0.50]
+"""
+import argparse
+import json
+def split_seg(seg: dict, old: str, pieces: list[tuple[str, str]],
+              total_duration: float | None = None) -> dict:
+    text = seg['text'].split()
+    phon = seg['phoneme'].split()
+    durs = [float(x) for x in seg['duration'].split()]
+    pitches = seg['note_pitch'].split()
+    types = seg['note_type'].split()
+    n = len(text)
+    assert len(phon) == n == len(durs) == len(pitches) == len(types)
+    new_text, new_phon, new_durs, new_pitches, new_types = [], [], [], [], []
+    n_split = 0
+    for i in range(n):
+        if text[i].lower() == old.lower():
+            # Determine the duration to redistribute. Either fixed (--total-duration)
+            # or original duration of this slot, possibly augmented by stealing from
+            # the *next* <SP> rest.
+            target_dur = total_duration or durs[i]
+            steal_idx = None
+            if total_duration is not None and total_duration > durs[i]:
+                want = total_duration - durs[i]
+                if i + 1 < n and text[i + 1] == '<SP>' and durs[i + 1] > want + 0.05:
+                    steal_idx = i + 1
+                    durs[i + 1] -= want
+                    print(f"  slot {i}: stole {want:.2f}s from <SP> rest at {i+1}")
+                else:
+                    print(f"  WARN slot {i}: cannot reach {total_duration:.2f}s (no slack)")
+                    target_dur = durs[i]
+            per_piece = target_dur / len(pieces)
+            for j, (piece_text, piece_phon) in enumerate(pieces):
+                new_text.append(piece_text)
+                new_phon.append(piece_phon)
+                new_durs.append(per_piece)
+                new_pitches.append(pitches[i])  # same pitch for all pieces
+                # Articulation hints: 1=onset on first piece, 3=end on last, 2=mid sustain
+                if len(pieces) == 1:
+                    new_types.append(types[i])
+                elif j == 0:
+                    new_types.append('1')
+                elif j == len(pieces) - 1:
+                    new_types.append('3')
+                else:
+                    new_types.append('2')
+            n_split += 1
+        else:
+            new_text.append(text[i])
+            new_phon.append(phon[i])
+            new_durs.append(durs[i])
+            new_pitches.append(pitches[i])
+            new_types.append(types[i])
+    print(f"  split {n_split} occurrence(s) of '{old}' into {len(pieces)} pieces")
+    print(f"  new slot count: {len(new_text)} (was {n})")
+    out = dict(seg)
+    out['text'] = ' '.join(new_text)
+    out['phoneme'] = ' '.join(new_phon)
+    out['duration'] = ' '.join(f"{d:.2f}" for d in new_durs)
+    out['note_pitch'] = ' '.join(new_pitches)
+    out['note_type'] = ' '.join(new_types)
+    return out
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument('--in', dest='inp', required=True)
+    ap.add_argument('--out', dest='out', required=True)
+    ap.add_argument('--old', required=True)
+    ap.add_argument('--pieces', nargs='+', required=True,
+                    help="Each piece as 'text:phoneme', e.g. 'brok:en_B-R-OW1-K'")
+    ap.add_argument('--total-duration', type=float, default=None,
+                    help="Force total slot duration (steals from next <SP> rest)")
+    args = ap.parse_args()
+    pieces = []
+    for spec in args.pieces:
+        text, _, phon = spec.partition(':')
+        pieces.append((text, phon))
+    data = json.load(open(args.inp))
+    edited = [split_seg(s, args.old, pieces, args.total_duration) for s in data]
+    json.dump(edited, open(args.out, 'w'), ensure_ascii=False, indent=2)
+    print(f"\nWrote {args.out}")
+if __name__ == '__main__':
+    main()

swap_word.py ADDED Viewed

	@@ -0,0 +1,98 @@

+"""Surgical lyric edit: take a SoulX-Singer metadata JSON, replace one
+word everywhere with another, regenerate that word's phoneme group via
+g2p_en, and save. All other fields (note_pitch, note_type, duration, f0,
+time) are preserved exactly so we test ONLY the lyric/phoneme change.
+Usage:
+    python swap_word.py --in <in.json> --out <out.json> \
+        --old pretty --new broken
+"""
+import argparse
+import json
+from g2p_en import G2p
+g2p = G2p()
+def english_phoneme(word: str) -> str:
+    """Convert one English word to SoulX-Singer's phoneme token format,
+    e.g. 'pretty' -> 'en_P-R-IH1-T-IY0'."""
+    phones = [p for p in g2p(word) if p not in (' ',)]
+    return 'en_' + '-'.join(phones)
+def swap(seg: dict, old: str, new: str,
+         phoneme_override: str | None = None,
+         duration_boost: float = 0.0) -> dict:
+    text_tokens = seg['text'].split()
+    phon_tokens = seg['phoneme'].split()
+    if len(text_tokens) != len(phon_tokens):
+        raise ValueError(
+            f"text/phoneme token-count mismatch: "
+            f"{len(text_tokens)} vs {len(phon_tokens)}"
+        )
+    new_phon = phoneme_override or english_phoneme(new)
+    print(f"  '{new}' -> {new_phon}{' (override)' if phoneme_override else ''}")
+    duration_tokens = seg.get('duration', '').split()
+    has_dur = len(duration_tokens) == len(text_tokens)
+    n_swapped = 0
+    swapped_indices = []
+    for i, t in enumerate(text_tokens):
+        if t.lower() == old.lower():
+            text_tokens[i] = new
+            phon_tokens[i] = new_phon
+            swapped_indices.append(i)
+            n_swapped += 1
+    print(f"  swapped {n_swapped} occurrence(s) of '{old}' -> '{new}'")
+    if duration_boost and has_dur:
+        for i in swapped_indices:
+            old_d = float(duration_tokens[i])
+            # Steal from the next <SP> rest if available, else previous
+            steal_from = None
+            for j in (i + 1, i - 1):
+                if 0 <= j < len(text_tokens) and text_tokens[j] == '<SP>':
+                    rest_d = float(duration_tokens[j])
+                    if rest_d > duration_boost + 0.05:
+                        steal_from = j
+                        break
+            if steal_from is None:
+                print(f"  WARN no neighboring <SP> with enough slack at index {i}; skipping boost")
+                continue
+            duration_tokens[i] = f"{old_d + duration_boost:.2f}"
+            duration_tokens[steal_from] = f"{float(duration_tokens[steal_from]) - duration_boost:.2f}"
+            print(f"  index {i}: dur {old_d:.2f}s + {duration_boost:.2f}s "
+                  f"(stole from <SP> at index {steal_from})")
+    out = dict(seg)
+    out['text'] = ' '.join(text_tokens)
+    out['phoneme'] = ' '.join(phon_tokens)
+    if duration_boost and has_dur:
+        out['duration'] = ' '.join(duration_tokens)
+    return out
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument('--in', dest='inp', required=True)
+    ap.add_argument('--out', dest='out', required=True)
+    ap.add_argument('--old', required=True)
+    ap.add_argument('--new', required=True)
+    ap.add_argument('--phoneme', default=None,
+                    help="Override the auto-generated phoneme, e.g. 'en_B-R-OW1-K-K-AH0-N'")
+    ap.add_argument('--duration_boost', type=float, default=0.0,
+                    help='Add N seconds to swapped-word slot, stealing from a neighboring <SP> rest')
+    args = ap.parse_args()
+    data = json.load(open(args.inp))
+    edited = [swap(s, args.old, args.new,
+                   phoneme_override=args.phoneme,
+                   duration_boost=args.duration_boost) for s in data]
+    json.dump(edited, open(args.out, 'w'), ensure_ascii=False, indent=2)
+    print(f"\nWrote {args.out}")
+if __name__ == '__main__':
+    main()