Add CLAP reranking support (audio + text encoders)

Browse files

Files changed (14) hide show

.gitattributes +1 -0
README.md +31 -0
clap_audio_encoder.onnx +3 -0
clap_audio_encoder.onnx.data +3 -0
clap_config.json +10 -0
clap_text_encoder.onnx +3 -0
clap_text_encoder.onnx.data +3 -0
clap_tokenizer/merges.txt +0 -0
clap_tokenizer/special_tokens_map.json +51 -0
clap_tokenizer/tokenizer_config.json +57 -0
clap_tokenizer/vocab.json +0 -0
onnx_export/export_clap.py +331 -0
onnx_inference.py +284 -35
residual.wav +3 -0

.gitattributes CHANGED Viewed

@@ -34,4 +34,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 *.data filter=lfs diff=lfs merge=lfs -text
 test_audio.wav filter=lfs diff=lfs merge=lfs -text

 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 *.data filter=lfs diff=lfs merge=lfs -text
+residual.wav filter=lfs diff=lfs merge=lfs -text
 test_audio.wav filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -26,6 +26,10 @@ ONNX-converted models for [SAM-Audio](https://github.com/facebookresearch/sam-au
 | `tokenizer/` | SentencePiece tokenizer files (T5) | - |
 | `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - |
 | `peaframe_config.json` | PEAFrame scaling parameters | - |
 ## Installation
@@ -84,6 +88,24 @@ python onnx_inference.py \
     --output separated.wav
 ```
 ### Visual Prompting with SAM3 Mask
 ```bash
 # First generate a mask with SAM3 (see generate_sam3_mask.py)
@@ -116,6 +138,11 @@ python onnx_inference.py \
   - Uses ModernBERT tokenizer
   - Processes audio in ~3.3s chunks with 50% overlap
   - Default threshold: 0.3
 ## Exporting Models
@@ -143,6 +170,9 @@ python -m onnx_export.export_vision --model facebook/sam-audio-small --output ./
 # PEAFrame Span Predictor
 python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify
 ```
 ### FP16 Quantization (for large models)
@@ -170,6 +200,7 @@ The inference script automatically detects FP16 models and handles input convers
 | `export_t5.py` | T5 text encoder |
 | `export_vision.py` | Vision encoder (CLIP-based) |
 | `export_peaframe.py` | PEAFrame span predictor + tokenizer |
 | `standalone_config.py` | Config classes for standalone export |
 ## License

 | `tokenizer/` | SentencePiece tokenizer files (T5) | - |
 | `peaframe_tokenizer/` | ModernBERT tokenizer files (PEAFrame) | - |
 | `peaframe_config.json` | PEAFrame scaling parameters | - |
+| `clap_audio_encoder.onnx` | CLAP audio encoder (HTSAT-tiny) | ~118 MB |
+| `clap_text_encoder.onnx` | CLAP text encoder (RoBERTa-base) | ~481 MB |
+| `clap_tokenizer/` | RoBERTa tokenizer files (CLAP) | - |
+| `clap_config.json` | CLAP audio preprocessing parameters | - |
 ## Installation
     --output separated.wav
 ```
+### CLAP Reranking
+Generate multiple candidates and select the best using CLAP audio-text similarity:
+```bash
+python onnx_inference.py \
+    --audio input.wav \
+    --text "person speaking" \
+    --rerank \
+    --num-candidates 4 \
+    --output separated.wav
+```
+Reranking generates multiple separation candidates with different random seeds and uses CLAP to score audio-text similarity, selecting the candidate that best matches the text description. This can improve quality at the cost of ~4x inference time.
+Options:
+- `--rerank` - Enable reranking mode
+- `--num-candidates N` - Number of candidates (default: 4)
+- `--rerank-seed SEED` - Random seed for reproducibility
 ### Visual Prompting with SAM3 Mask
 ```bash
 # First generate a mask with SAM3 (see generate_sam3_mask.py)
   - Uses ModernBERT tokenizer
   - Processes audio in ~3.3s chunks with 50% overlap
   - Default threshold: 0.3
+- **CLAP**: Audio-text similarity model for candidate reranking
+  - Audio encoder: HTSAT-tiny
+  - Text encoder: RoBERTa-base
+  - Embedding dimension: 512
+  - Default candidates: 4
 ## Exporting Models
 # PEAFrame Span Predictor
 python -m onnx_export.export_peaframe --output-dir ./onnx_models --verify
+# CLAP Reranking (audio + text encoders)
+python -m onnx_export.export_clap --output-dir ./onnx_models --verify
 ```
 ### FP16 Quantization (for large models)
 | `export_t5.py` | T5 text encoder |
 | `export_vision.py` | Vision encoder (CLIP-based) |
 | `export_peaframe.py` | PEAFrame span predictor + tokenizer |
+| `export_clap.py` | CLAP audio + text encoders for reranking |
 | `standalone_config.py` | Config classes for standalone export |
 ## License

clap_audio_encoder.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:46fb0e4d80e2e6403361e1245fa298da9f1530365743082217a4e69d4bb127c6
+size 1176682

clap_audio_encoder.onnx.data ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:49456668f90249bd4429441b8a65440750a17965d28448f8c72de69849a61f0f
+size 123731968

clap_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "sample_rate": 48000,
+  "window_size": 1024,
+  "hop_size": 480,
+  "mel_bins": 64,
+  "fmin": 50,
+  "fmax": 14000,
+  "max_audio_len": 480000,
+  "embed_dim": 512
+}

clap_text_encoder.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c700f9351d2a32cea5ebd0df0d8ce856f6436b9a54d70caf2d693ec79bb33373
+size 1600036

clap_text_encoder.onnx.data ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:542b5813e0fbfcb341d6db39c2b38118178cd4b8c5397fb80906bee14b1fe579
+size 503393280

clap_tokenizer/merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

clap_tokenizer/special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,51 @@

+{
+  "bos_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "cls_token": {
+    "content": "<s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "eos_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "<pad>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "</s>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "<unk>",
+    "lstrip": false,
+    "normalized": true,
+    "rstrip": false,
+    "single_word": false
+  }
+}

clap_tokenizer/tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,57 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50264": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "RobertaTokenizer",
+  "unk_token": "<unk>"
+}

clap_tokenizer/vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

onnx_export/export_clap.py ADDED Viewed

	@@ -0,0 +1,331 @@

+#!/usr/bin/env python3
+"""
+Export CLAP (Contrastive Language-Audio Pretraining) model to ONNX.
+The CLAP model is used for reranking separation candidates by scoring
+audio-text similarity.
+Usage:
+    python -m onnx_export.export_clap --output-dir onnx_models --verify
+"""
+import os
+import argparse
+import json
+import torch
+import torch.nn as nn
+from huggingface_hub import hf_hub_download
+def get_clap_model(checkpoint_file=None, device="cpu"):
+    """Load the CLAP model from laion_clap."""
+    import laion_clap
+    model = laion_clap.CLAP_Module(enable_fusion=False, amodel="HTSAT-tiny").to(device)
+    if checkpoint_file is None:
+        checkpoint_file = hf_hub_download(
+            repo_id="lukewys/laion_clap", filename="630k-best.pt"
+        )
+    state_dict = torch.load(checkpoint_file, map_location=device, weights_only=False)["state_dict"]
+    # Handle module prefix from DataParallel
+    if next(iter(state_dict.items()))[0].startswith("module"):
+        state_dict = {k[7:]: v for k, v in state_dict.items()}
+    # Remove position_ids if present (not needed)
+    if "text_branch.embeddings.position_ids" in state_dict:
+        del state_dict["text_branch.embeddings.position_ids"]
+    model.model.load_state_dict(state_dict)
+    return model.eval()
+class CLAPAudioEncoderWrapper(nn.Module):
+    """
+    Wrapper for CLAP audio encoder for ONNX export.
+    Takes waveform input directly and processes through the HTSAT audio branch.
+    """
+    def __init__(self, model):
+        super().__init__()
+        self.audio_branch = model.model.audio_branch
+        self.audio_transform = model.model.audio_transform
+        self.audio_projection = model.model.audio_projection
+    def forward(self, waveform: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            waveform: [batch, samples] audio waveform at 48kHz, 10 seconds (480000 samples)
+        Returns:
+            audio_embed: [batch, 512] normalized audio embedding
+        """
+        # Compute spectrogram from waveform
+        x = self.audio_branch.spectrogram_extractor(waveform)  # [B, 1, T, F]
+        x = self.audio_branch.logmel_extractor(x)  # [B, 1, T, mel_bins]
+        # Batch normalization
+        x = x.transpose(1, 3)  # [B, mel_bins, T, 1]
+        x = self.audio_branch.bn0(x)
+        x = x.transpose(1, 3)  # [B, 1, T, mel_bins]
+        # Reshape for Swin Transformer using the original method
+        x = self.audio_branch.reshape_wav2img(x)
+        # Forward through transformer features
+        output_dict = self.audio_branch.forward_features(x)
+        embedding = output_dict["embedding"]  # [B, 768]
+        # Project to 512-dim: projection first, then transform
+        x = self.audio_projection(embedding)  # 768 -> 512
+        x = self.audio_transform(x)  # 512 -> 512
+        # L2 normalize
+        x = x / x.norm(dim=-1, keepdim=True)
+        return x
+class CLAPTextEncoderWrapper(nn.Module):
+    """Wrapper for CLAP text encoder for ONNX export."""
+    def __init__(self, model):
+        super().__init__()
+        self.text_branch = model.model.text_branch
+        self.text_transform = model.model.text_transform
+        self.text_projection = model.model.text_projection
+    def forward(self, input_ids: torch.Tensor, attention_mask: torch.Tensor) -> torch.Tensor:
+        """
+        Args:
+            input_ids: [batch, seq_len] token IDs
+            attention_mask: [batch, seq_len] attention mask
+        Returns:
+            text_embed: [batch, 512] normalized text embedding
+        """
+        x = self.text_branch(input_ids=input_ids, attention_mask=attention_mask)
+        x = x.pooler_output  # [B, 768]
+        x = self.text_projection(x)  # 768 -> 512
+        x = self.text_transform(x)  # 512 -> 512
+        # L2 normalize
+        x = x / x.norm(dim=-1, keepdim=True)
+        return x
+def export_clap_audio_encoder(model, output_path, opset_version=17, device="cpu"):
+    """Export CLAP audio encoder to ONNX."""
+    import onnx
+    print(f"Exporting CLAP audio encoder to {output_path}...")
+    wrapper = CLAPAudioEncoderWrapper(model).eval().to(device)
+    # Sample input: 10 seconds of audio at 48kHz (480000 samples)
+    batch_size = 1
+    num_samples = 480000  # 10 seconds at 48kHz
+    dummy_waveform = torch.randn(batch_size, num_samples, device=device)
+    # Test forward pass
+    with torch.no_grad():
+        output = wrapper(dummy_waveform)
+        print(f"  Audio encoder output shape: {output.shape}")
+    torch.onnx.export(
+        wrapper,
+        (dummy_waveform,),
+        output_path,
+        input_names=["waveform"],
+        output_names=["audio_embed"],
+        dynamic_axes={
+            "waveform": {0: "batch_size"},
+            "audio_embed": {0: "batch_size"},
+        },
+        opset_version=opset_version,
+        do_constant_folding=True,
+    )
+    # Validate
+    onnx_model = onnx.load(output_path)
+    onnx.checker.check_model(onnx_model)
+    print("  ��� CLAP audio encoder exported successfully")
+    return True
+def export_clap_text_encoder(model, output_path, opset_version=17, device="cpu"):
+    """Export CLAP text encoder to ONNX."""
+    import onnx
+    print(f"Exporting CLAP text encoder to {output_path}...")
+    wrapper = CLAPTextEncoderWrapper(model).eval().to(device)
+    # Sample input
+    batch_size = 1
+    seq_len = 77
+    dummy_input_ids = torch.randint(0, 50265, (batch_size, seq_len), device=device)
+    dummy_attention_mask = torch.ones(batch_size, seq_len, dtype=torch.long, device=device)
+    # Test forward pass
+    with torch.no_grad():
+        output = wrapper(dummy_input_ids, dummy_attention_mask)
+        print(f"  Text encoder output shape: {output.shape}")
+    torch.onnx.export(
+        wrapper,
+        (dummy_input_ids, dummy_attention_mask),
+        output_path,
+        input_names=["input_ids", "attention_mask"],
+        output_names=["text_embed"],
+        dynamic_axes={
+            "input_ids": {0: "batch_size", 1: "seq_len"},
+            "attention_mask": {0: "batch_size", 1: "seq_len"},
+            "text_embed": {0: "batch_size"},
+        },
+        opset_version=opset_version,
+        do_constant_folding=True,
+    )
+    # Validate
+    onnx_model = onnx.load(output_path)
+    onnx.checker.check_model(onnx_model)
+    print("  ✓ CLAP text encoder exported successfully")
+    return True
+def save_clap_config(model, output_path):
+    """Save CLAP audio preprocessing config."""
+    audio_cfg = model.model_cfg["audio_cfg"]
+    config = {
+        "sample_rate": audio_cfg["sample_rate"],
+        "window_size": audio_cfg["window_size"],
+        "hop_size": audio_cfg["hop_size"],
+        "mel_bins": audio_cfg["mel_bins"],
+        "fmin": audio_cfg["fmin"],
+        "fmax": audio_cfg["fmax"],
+        "max_audio_len": 480000,  # 10 seconds at 48kHz
+        "embed_dim": 512,
+    }
+    with open(output_path, "w") as f:
+        json.dump(config, f, indent=2)
+    print(f"  ✓ Config saved to {output_path}")
+    return config
+def save_clap_tokenizer(output_dir):
+    """Save RoBERTa tokenizer for CLAP text encoding."""
+    from transformers import RobertaTokenizer
+    tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
+    tokenizer.save_pretrained(output_dir)
+    print(f"  ✓ Tokenizer saved to {output_dir}")
+def verify_clap(model, audio_onnx_path, text_onnx_path, config, device="cpu"):
+    """Verify ONNX outputs match PyTorch."""
+    import onnxruntime as ort
+    import numpy as np
+    print("Verifying CLAP ONNX outputs...")
+    # Create sample audio (10 seconds at 48kHz)
+    sample_waveform = torch.randn(1, 480000)  # [batch, samples]
+    # PyTorch audio embedding
+    wrapper = CLAPAudioEncoderWrapper(model).eval()
+    with torch.no_grad():
+        pytorch_audio_embed = wrapper(sample_waveform).numpy()
+    # ONNX audio embedding
+    audio_sess = ort.InferenceSession(audio_onnx_path, providers=["CPUExecutionProvider"])
+    onnx_audio_embed = audio_sess.run(
+        ["audio_embed"],
+        {"waveform": sample_waveform.numpy().astype(np.float32)},
+    )[0]
+    audio_diff = np.abs(pytorch_audio_embed - onnx_audio_embed).max()
+    print(f"  Audio encoder max diff: {audio_diff:.2e}")
+    # Text embedding verification
+    from transformers import RobertaTokenizer
+    tokenizer = RobertaTokenizer.from_pretrained("roberta-base")
+    tokens = tokenizer(["a person speaking"], return_tensors="pt", padding=True, truncation=True)
+    text_wrapper = CLAPTextEncoderWrapper(model).eval()
+    with torch.no_grad():
+        pytorch_text_embed = text_wrapper(tokens["input_ids"], tokens["attention_mask"]).numpy()
+    text_sess = ort.InferenceSession(text_onnx_path, providers=["CPUExecutionProvider"])
+    onnx_text_embed = text_sess.run(
+        ["text_embed"],
+        {
+            "input_ids": tokens["input_ids"].numpy().astype(np.int64),
+            "attention_mask": tokens["attention_mask"].numpy().astype(np.int64),
+        },
+    )[0]
+    text_diff = np.abs(pytorch_text_embed - onnx_text_embed).max()
+    print(f"  Text encoder max diff: {text_diff:.2e}")
+    max_diff = max(audio_diff, text_diff)
+    if max_diff < 1e-4:
+        print("  ✓ Verification passed")
+        return True
+    else:
+        print(f"  ✗ Verification failed (max diff: {max_diff:.2e})")
+        return False
+def main():
+    parser = argparse.ArgumentParser(description="Export CLAP to ONNX")
+    parser.add_argument("--output-dir", type=str, default="onnx_models")
+    parser.add_argument("--checkpoint", type=str, default=None, help="CLAP checkpoint path")
+    parser.add_argument("--opset", type=int, default=18)
+    parser.add_argument("--device", type=str, default="cpu")
+    parser.add_argument("--verify", action="store_true")
+    args = parser.parse_args()
+    os.makedirs(args.output_dir, exist_ok=True)
+    # Load model
+    print("Loading CLAP model...")
+    model = get_clap_model(args.checkpoint, args.device)
+    # Export audio encoder
+    audio_path = os.path.join(args.output_dir, "clap_audio_encoder.onnx")
+    export_clap_audio_encoder(model, audio_path, args.opset, args.device)
+    # Export text encoder
+    text_path = os.path.join(args.output_dir, "clap_text_encoder.onnx")
+    export_clap_text_encoder(model, text_path, args.opset, args.device)
+    # Save config
+    config_path = os.path.join(args.output_dir, "clap_config.json")
+    config = save_clap_config(model, config_path)
+    # Save tokenizer
+    tokenizer_dir = os.path.join(args.output_dir, "clap_tokenizer")
+    os.makedirs(tokenizer_dir, exist_ok=True)
+    save_clap_tokenizer(tokenizer_dir)
+    # Verify
+    if args.verify:
+        verify_clap(model, audio_path, text_path, config, args.device)
+    print(f"\n✓ Export complete!")
+    print(f"  Audio encoder: {audio_path}")
+    print(f"  Text encoder: {text_path}")
+if __name__ == "__main__":
+    main()

onnx_inference.py CHANGED Viewed

@@ -177,6 +177,42 @@ class SAMAudioONNXPipeline:
                     self.peaframe_config = json.load(f)
                 print("  ✓ PEAFrame config loaded")
         # Load tokenizer
         self._load_tokenizer()
         print("  ✓ Tokenizer loaded")
@@ -615,6 +651,154 @@ class SAMAudioONNXPipeline:
         return np.array([anchor_ids], dtype=np.int64), anchor_alignment
     def dit_step(
         self,
         noisy_audio: np.ndarray,
@@ -678,16 +862,25 @@ class SAMAudioONNXPipeline:
         predict_spans: bool = False,
         manual_anchors: Optional[list[tuple[str, float, float]]] = None,
         span_threshold: float = 0.3,
     ) -> tuple[np.ndarray, np.ndarray, Optional[np.ndarray], float]:
         """
         Perform the full separation pipeline.
         Args:
             audio: Input mixture waveform
             text: Text description of the target source
             video_path: Optional path to a video for visual conditioning
             mask_path: Optional path to a video/image mask for visual prompting
         Returns:
             Tuple of (target audio, residual audio, masked video frames if any, fps)
             - target: The separated sound matching the text/visual prompt
@@ -740,41 +933,77 @@ class SAMAudioONNXPipeline:
             masked_video_features = self.encode_video(norm_frames) # This returns [B, 1024, T] (BCT)
             print(f"    Video features shape: {masked_video_features.shape}")
-        # 4. Run ODE solver (midpoint method)
-        print("3. Running ODE solver...")
-        # Start from random noise
-        # Note: audio_features is [B, T, 256], DiT output is [B, T, 256]
-        B, T, C = audio_features.shape
-        x = np.random.randn(B, T, C).astype(np.float32)
-        steps = self.num_ode_steps
-        dt = 1.0 / steps
-        for i in range(steps):
-            t = i * dt
-            print(f"  ODE step {i+1}/{steps}", end="\r")
-            k1 = self.dit_step(
-                x, t, audio_features, text_features, text_mask,
-                masked_video_features, anchor_ids, anchor_alignment
-            )
-            x_mid = x + k1 * (dt / 2.0)
-            k2 = self.dit_step(
-                x_mid, t + dt/2.0, audio_features, text_features, text_mask,
-                masked_video_features, anchor_ids, anchor_alignment
             )
-            x = x + k2 * dt
-        # Extract target and residual latents
-        # The DiT model produces [B, T, 256] where:
-        #   - First 128 channels = target (the separated sound)
-        #   - Last 128 channels = residual (everything else)
-        # This matches the PyTorch implementation in sam_audio/model/model.py
-        target_latent = x[:, :, :128].transpose(0, 2, 1)   # [B, 128, T] for decoder
-        residual_latent = x[:, :, 128:].transpose(0, 2, 1)  # [B, 128, T] for decoder
-        print(f"\n   Target latent shape: {target_latent.shape}")
-        print(f"   Residual latent shape: {residual_latent.shape}")
         # 5. Decode both to waveforms
         print("4. Decoding target audio...")
@@ -818,6 +1047,23 @@ def main():
         default=0.3,
         help="Threshold for span prediction (default: 0.3)",
     )
     parser.add_argument("--output", type=str, default="target.wav", help="Output WAV file path for target (separated) audio")
     parser.add_argument("--output-residual", type=str, default="residual.wav", help="Output WAV file path for residual audio")
     parser.add_argument("--output-video", type=str, help="Optional path to save masked video with separated audio")
@@ -870,6 +1116,9 @@ def main():
             predict_spans=args.predict_spans,
             manual_anchors=manual_anchors,
             span_threshold=args.span_threshold,
         )
         # Save output audio files

                     self.peaframe_config = json.load(f)
                 print("  ✓ PEAFrame config loaded")
+        # Load CLAP for reranking if available
+        self.clap_audio_encoder = None
+        self.clap_text_encoder = None
+        self.clap_tokenizer = None
+        self.clap_config = None
+        clap_audio_path = os.path.join(model_dir, "clap_audio_encoder.onnx")
+        clap_text_path = os.path.join(model_dir, "clap_text_encoder.onnx")
+        if os.path.exists(clap_audio_path) and os.path.exists(clap_text_path):
+            self.clap_audio_encoder = ort.InferenceSession(
+                clap_audio_path,
+                providers=providers,
+            )
+            print("  ✓ CLAP audio encoder loaded")
+            self.clap_text_encoder = ort.InferenceSession(
+                clap_text_path,
+                providers=providers,
+            )
+            print("  ✓ CLAP text encoder loaded")
+            # Load CLAP tokenizer
+            tokenizer_path = os.path.join(model_dir, "clap_tokenizer")
+            if os.path.exists(tokenizer_path):
+                from transformers import AutoTokenizer
+                self.clap_tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
+                print("  ✓ CLAP tokenizer loaded")
+            # Load CLAP config
+            config_path = os.path.join(model_dir, "clap_config.json")
+            if os.path.exists(config_path):
+                with open(config_path) as f:
+                    self.clap_config = json.load(f)
+                print("  ✓ CLAP config loaded")
         # Load tokenizer
         self._load_tokenizer()
         print("  ✓ Tokenizer loaded")
         return np.array([anchor_ids], dtype=np.int64), anchor_alignment
+    def score_with_clap(
+        self,
+        audio_candidates: list[np.ndarray],
+        text: str,
+    ) -> np.ndarray:
+        """
+        Score audio candidates against text using CLAP.
+        The CLAP audio encoder expects waveforms at 48kHz, padded/truncated to
+        10 seconds (480000 samples).
+        Args:
+            audio_candidates: List of audio waveforms, each shape (samples,)
+            text: Text description to match against
+        Returns:
+            scores: Array of similarity scores, shape (num_candidates,)
+        """
+        if self.clap_audio_encoder is None:
+            raise RuntimeError("CLAP audio encoder not loaded")
+        if self.clap_text_encoder is None:
+            raise RuntimeError("CLAP text encoder not loaded")
+        if self.clap_tokenizer is None:
+            raise RuntimeError("CLAP tokenizer not loaded")
+        if self.clap_config is None:
+            raise RuntimeError("CLAP config not loaded")
+        config = self.clap_config
+        max_audio_len = config.get("max_audio_len", 480000)
+        # Encode text (only once, same for all candidates)
+        tokens = self.clap_tokenizer(
+            text,
+            return_tensors="np",
+            padding=True,
+            truncation=True,
+            max_length=77,
+        )
+        text_embed = self.clap_text_encoder.run(
+            ["text_embed"],
+            {
+                "input_ids": tokens["input_ids"].astype(np.int64),
+                "attention_mask": tokens["attention_mask"].astype(np.int64),
+            },
+        )[0]  # [1, 512]
+        # Encode each audio candidate
+        audio_embeds = []
+        for audio in audio_candidates:
+            # Preprocess: quantize, pad/truncate
+            # Match PyTorch: int16_to_float32(float32_to_int16(audio))
+            audio = (audio * 32768.0).astype(np.int16).astype(np.float32) / 32768.0
+            # Pad or truncate to max_audio_len
+            if len(audio) > max_audio_len:
+                audio = audio[:max_audio_len]
+            elif len(audio) < max_audio_len:
+                # Repeat-pad
+                n_repeat = int(np.ceil(max_audio_len / len(audio)))
+                audio = np.tile(audio, n_repeat)[:max_audio_len]
+            # Reshape for CLAP: [batch, samples]
+            audio_input = audio.reshape(1, -1).astype(np.float32)
+            # Encode audio
+            audio_embed = self.clap_audio_encoder.run(
+                ["audio_embed"],
+                {"waveform": audio_input},
+            )[0]  # [1, 512]
+            audio_embeds.append(audio_embed)
+        # Stack audio embeddings: [num_candidates, 512]
+        audio_embeds = np.concatenate(audio_embeds, axis=0)
+        # Compute similarity scores: audio @ text.T
+        # audio_embeds: [num_candidates, 512]
+        # text_embed: [1, 512]
+        scores = np.matmul(audio_embeds, text_embed.T).squeeze(-1)  # [num_candidates]
+        return scores
+    def generate_candidates(
+        self,
+        audio_features: np.ndarray,
+        text_features: np.ndarray,
+        text_mask: np.ndarray,
+        num_candidates: int = 4,
+        masked_video_features: Optional[np.ndarray] = None,
+        anchor_ids: Optional[np.ndarray] = None,
+        anchor_alignment: Optional[np.ndarray] = None,
+        seed: Optional[int] = None,
+    ) -> list[tuple[np.ndarray, np.ndarray]]:
+        """
+        Generate multiple separation candidates with different random seeds.
+        Args:
+            audio_features: Encoded audio features [B, T, C]
+            text_features: Encoded text features
+            text_mask: Text attention mask
+            num_candidates: Number of candidates to generate
+            masked_video_features: Optional video features
+            anchor_ids: Optional anchor IDs
+            anchor_alignment: Optional anchor alignment
+            seed: Base random seed (candidates use seed, seed+1, seed+2, ...)
+        Returns:
+            List of (target_latent, residual_latent) tuples
+        """
+        B, T, C = audio_features.shape
+        candidates = []
+        for i in range(num_candidates):
+            # Set seed for reproducibility
+            if seed is not None:
+                np.random.seed(seed + i)
+            # Initialize with different random noise
+            x = np.random.randn(B, T, C).astype(np.float32)
+            # Run ODE solver
+            steps = self.num_ode_steps
+            dt = 1.0 / steps
+            for step_idx in range(steps):
+                t = step_idx * dt
+                k1 = self.dit_step(
+                    x, t, audio_features, text_features, text_mask,
+                    masked_video_features, anchor_ids, anchor_alignment
+                )
+                x_mid = x + k1 * (dt / 2.0)
+                k2 = self.dit_step(
+                    x_mid, t + dt/2.0, audio_features, text_features, text_mask,
+                    masked_video_features, anchor_ids, anchor_alignment
+                )
+                x = x + k2 * dt
+            # Extract target and residual latents
+            target_latent = x[:, :, :128].transpose(0, 2, 1)   # [B, 128, T]
+            residual_latent = x[:, :, 128:].transpose(0, 2, 1)  # [B, 128, T]
+            candidates.append((target_latent, residual_latent))
+        return candidates
     def dit_step(
         self,
         noisy_audio: np.ndarray,
         predict_spans: bool = False,
         manual_anchors: Optional[list[tuple[str, float, float]]] = None,
         span_threshold: float = 0.3,
+        rerank: bool = False,
+        num_candidates: int = 4,
+        rerank_seed: Optional[int] = None,
     ) -> tuple[np.ndarray, np.ndarray, Optional[np.ndarray], float]:
         """
         Perform the full separation pipeline.
         Args:
             audio: Input mixture waveform
             text: Text description of the target source
             video_path: Optional path to a video for visual conditioning
             mask_path: Optional path to a video/image mask for visual prompting
+            predict_spans: Whether to use PEAFrame for span prediction
+            manual_anchors: Optional list of manual anchor spans
+            span_threshold: Threshold for span prediction
+            rerank: Whether to generate multiple candidates and rerank with CLAP
+            num_candidates: Number of candidates for reranking
+            rerank_seed: Random seed for reproducible candidate generation
         Returns:
             Tuple of (target audio, residual audio, masked video frames if any, fps)
             - target: The separated sound matching the text/visual prompt
             masked_video_features = self.encode_video(norm_frames) # This returns [B, 1024, T] (BCT)
             print(f"    Video features shape: {masked_video_features.shape}")
+        # 4. Run ODE solver (with optional reranking)
+        if rerank and self.clap_audio_encoder is not None:
+            print(f"3. Generating {num_candidates} candidates for reranking...")
+            # Generate multiple candidates
+            candidates = self.generate_candidates(
+                audio_features, text_features, text_mask,
+                num_candidates=num_candidates,
+                masked_video_features=masked_video_features,
+                anchor_ids=anchor_ids,
+                anchor_alignment=anchor_alignment,
+                seed=rerank_seed,
             )
+            # Decode all candidate audios
+            print("3b. Decoding candidate audios...")
+            candidate_audios = []
+            for i, (target_latent, _) in enumerate(candidates):
+                decoded = self.decode_audio(target_latent)
+                candidate_audios.append(decoded)
+                print(f"   Candidate {i+1}/{num_candidates} decoded", end="\r")
+            print()
+            # Score with CLAP
+            print("3c. Scoring candidates with CLAP...")
+            scores = self.score_with_clap(candidate_audios, text)
+            best_idx = int(np.argmax(scores))
+            print(f"   Scores: {scores}")
+            print(f"   Selected candidate {best_idx + 1}/{num_candidates} (score: {scores[best_idx]:.4f})")
+            # Use best candidate
+            target_latent, residual_latent = candidates[best_idx]
+            print(f"   Target latent shape: {target_latent.shape}")
+            print(f"   Residual latent shape: {residual_latent.shape}")
+        else:
+            # Single candidate path (original behavior)
+            print("3. Running ODE solver...")
+            # Start from random noise
+            # Note: audio_features is [B, T, 256], DiT output is [B, T, 256]
+            B, T, C = audio_features.shape
+            x = np.random.randn(B, T, C).astype(np.float32)
+            steps = self.num_ode_steps
+            dt = 1.0 / steps
+            for i in range(steps):
+                t = i * dt
+                print(f"  ODE step {i+1}/{steps}", end="\r")
+                k1 = self.dit_step(
+                    x, t, audio_features, text_features, text_mask,
+                    masked_video_features, anchor_ids, anchor_alignment
+                )
+                x_mid = x + k1 * (dt / 2.0)
+                k2 = self.dit_step(
+                    x_mid, t + dt/2.0, audio_features, text_features, text_mask,
+                    masked_video_features, anchor_ids, anchor_alignment
+                )
+                x = x + k2 * dt
+            # Extract target and residual latents
+            # The DiT model produces [B, T, 256] where:
+            #   - First 128 channels = target (the separated sound)
+            #   - Last 128 channels = residual (everything else)
+            # This matches the PyTorch implementation in sam_audio/model/model.py
+            target_latent = x[:, :, :128].transpose(0, 2, 1)   # [B, 128, T] for decoder
+            residual_latent = x[:, :, 128:].transpose(0, 2, 1)  # [B, 128, T] for decoder
+            print(f"\n   Target latent shape: {target_latent.shape}")
+            print(f"   Residual latent shape: {residual_latent.shape}")
         # 5. Decode both to waveforms
         print("4. Decoding target audio...")
         default=0.3,
         help="Threshold for span prediction (default: 0.3)",
     )
+    parser.add_argument(
+        "--rerank",
+        action="store_true",
+        help="Generate multiple candidates and rerank with CLAP",
+    )
+    parser.add_argument(
+        "--num-candidates",
+        type=int,
+        default=4,
+        help="Number of candidates for reranking (default: 4)",
+    )
+    parser.add_argument(
+        "--rerank-seed",
+        type=int,
+        default=None,
+        help="Random seed for reproducible candidate generation",
+    )
     parser.add_argument("--output", type=str, default="target.wav", help="Output WAV file path for target (separated) audio")
     parser.add_argument("--output-residual", type=str, default="residual.wav", help="Output WAV file path for residual audio")
     parser.add_argument("--output-video", type=str, help="Optional path to save masked video with separated audio")
             predict_spans=args.predict_spans,
             manual_anchors=manual_anchors,
             span_threshold=args.span_threshold,
+            rerank=args.rerank,
+            num_candidates=args.num_candidates,
+            rerank_seed=args.rerank_seed,
         )
         # Save output audio files

residual.wav ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f4dfbb54fecf275f6cb4c13e934ccd2971ed17e454c7e52152dc8ae69fedf808
+size 960044