Spaces:

lpeterl
/

sam-audio-webui

Running on Zero

App Files Files Community

Peter Shi commited on 6 days ago

Commit

f36ee58

1 Parent(s): 8c0dc30

feat: Added the SAM Audio audio segmentation Web UI based on Gradio and its dependencies.

Browse files

Files changed (3) hide show

README.md +54 -4
app.py +140 -0
requirements.txt +11 -0

README.md CHANGED Viewed

@@ -1,13 +1,63 @@
 ---
 title: Sam Audio Webui
-emoji: 📚
-colorFrom: yellow
-colorTo: gray
 sdk: gradio
 sdk_version: 6.2.0
 app_file: app.py
 pinned: false
 license: apache-2.0
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: Sam Audio Webui
+emoji: 🎵
+colorFrom: indigo
+colorTo: pink
 sdk: gradio
 sdk_version: 6.2.0
 app_file: app.py
 pinned: false
 license: apache-2.0
+fullWidth: true
 ---
+# SAM Audio WebUI
+This Space hosts a WebUI for the **SAM Audio** model by Meta (Facebook), designed to segment and isolate specific sounds from audio files using text prompts.
+## Features
+- **Model**: Uses `facebook/sam-audio-small` for a balance of performance and resource usage.
+- **ZeroGPU Support**: Optimized to run on Hugging Face ZeroGPU (A100/A10G) with automatic GPU handling.
+- **Dynamic Fallback**:
+    - Attempts to load the model in `float16` for best quality.
+    - Falls back to **8-bit quantization** (`bitsandbytes`) if VRAM is insufficient.
+- **Audio Reconstruction**: Converts model masks to audio using STFT/ISTFT processing.
+## Local Development
+To run this application locally on your machine:
+1.  **Clone the repository:**
+    ```bash
+    git clone https://huggingface.co/spaces/lpeterl/sam-audio-webui
+    cd sam-audio-webui
+    ```
+2.  **Create a virtual environment (Recommended):**
+    ```bash
+    python3 -m venv venv
+    source venv/bin/activate
+    ```
+3.  **Install dependencies:**
+    ```bash
+    pip install -r requirements.txt
+    pip install gradio
+    ```
+4.  **Run the app:**
+    ```bash
+    python3 app.py
+    ```
+    *Note: `spaces` GPU decorators are mocked locally, so you don't need a ZeroGPU environment.*
+## System Requirements
+- **VRAM**: ~21.6 GB for standard loading. ~12 GB with 8-bit quantization.
+- **Platform**: CUDA (NVIDIA GPU) required for quantization. Mac (MPS) supported for standard loading (requires high unified memory).
+## Acknowledgements
+- Model: [facebook/sam-audio](https://huggingface.co/facebook/sam-audio)
+- Library: [Hugging Face Transformers](https://huggingface.co/docs/transformers/index)

app.py ADDED Viewed

	@@ -0,0 +1,140 @@

+import gradio as gr
+import torch
+try:
+    import spaces
+except ImportError:
+    class spaces:
+        @staticmethod
+        def GPU(duration=60):
+            def decorator(func):
+                return func
+            return decorator
+from transformers import AutoProcessor, AutoModelForAudioSegmentation
+import numpy as np
+import librosa
+import tempfile
+import soundfile as sf
+# Model configuration
+MODEL_ID = "facebook/sam-audio-small"
+print(f"Loading model: {MODEL_ID}...")
+try:
+    processor = AutoProcessor.from_pretrained(MODEL_ID)
+    model = AutoModelForAudioSegmentation.from_pretrained(
+        MODEL_ID,
+        device_map="auto",
+        torch_dtype=torch.float16
+    )
+    print("Model loaded successfully.")
+except Exception as e:
+    print(f"Error loading model: {e}")
+    print("Attempting to load with 8-bit quantization...")
+    try:
+        from transformers import BitsAndBytesConfig
+        quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+        processor = AutoProcessor.from_pretrained(MODEL_ID)
+        model = AutoModelForAudioSegmentation.from_pretrained(
+            MODEL_ID,
+            quantization_config=quantization_config,
+            device_map="auto"
+        )
+        print("Model loaded with 8-bit quantization.")
+    except Exception as e2:
+        print(f"Critical error loading model: {e2}")
+        raise e2
+@spaces.GPU(duration=120)
+def infer(audio_path, prompt_text):
+    if not audio_path:
+        return None
+    print(f"Processing audio: {audio_path}, Prompt: {prompt_text}")
+    # Load audio with librosa (standardizes sample rate)
+    target_sr = 16000 # SAM Audio often works at 16k, or check processor.feature_extractor.sampling_rate
+    if hasattr(processor, "feature_extractor"):
+         target_sr = processor.feature_extractor.sampling_rate
+    audio, sr = librosa.load(audio_path, sr=target_sr, mono=True)
+    # Prepare inputs
+    inputs = processor(
+        audios=[audio],
+        sampling_rate=sr,
+        text=[[prompt_text]] if prompt_text else None,
+        return_tensors="pt"
+    ).to(model.device)
+    with torch.no_grad():
+        outputs = model(**inputs)
+    # Post-process to get likelihoods or masks
+    # Note: transformers implementation details vary.
+    # Usually we get logits. sigmoid -> prob.
+    # pred_masks shape: (batch_size, num_masks, freq, time) or similar.
+    pred_masks = torch.sigmoid(outputs.pred_masks)
+    # For audio reconstruction, we need to apply this mask to the STFT of the original audio.
+    # We calculate STFT using the same parameters as the model training if possible.
+    # If parameters are unknown, we try standard values or rely on processor logic if available.
+    # Standard STFT for AudioLDM/MusicGen etc often use n_fft=1024, hop=160.
+    # Let's inspect the mask shape to infer Time dimensions.
+    mask = pred_masks[0, 0] # Take first batch, first predicted mask
+    # Resize mask to inputs size if needed?
+    # Usually SAM Audio outputs a mask corresponding to the spectrogram features.
+    # Let's try to reconstruct using a generic STFT approach
+    n_fft = 1024
+    hop_length = 320 # Common for 16k
+    stft = librosa.stft(audio, n_fft=n_fft, hop_length=hop_length)
+    # stft shape: (1 + n_fft/2, time_frames)
+    # mask shape from model might be different. Resize mask to match stft.
+    # Convert mask to numpy
+    mask_np = mask.cpu().float().numpy()
+    # Resize mask to match STFT shape
+    # stft.shape is (freq, time)
+    import cv2
+    # cv2.resize expects (width, height) -> (time, freq)
+    try:
+        mask_resized = cv2.resize(mask_np, (stft.shape[1], stft.shape[0]), interpolation=cv2.INTER_LINEAR)
+        # Apply mask
+        stft_masked = stft * mask_resized
+        # ISTFT
+        audio_masked = librosa.istft(stft_masked, hop_length=hop_length)
+        # Save to temp file
+        with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
+             sf.write(tmp.name, audio_masked, sr)
+             return tmp.name
+    except Exception as e_resize:
+        print(f"Error applying mask: {e_resize}. Returning original for debug.")
+        # Fallback to saving original just to show partial success
+        return audio_path
+with gr.Blocks() as demo:
+    gr.Markdown(f"# SAM Audio WebUI ({MODEL_ID})")
+    gr.Markdown("Upload audio and provide a prompt to segment specific sounds.")
+    with gr.Row():
+        audio_input = gr.Audio(type="filepath", label="Input Audio")
+        text_input = gr.Textbox(label="Prompt (e.g., 'drums', 'vocals')")
+    submit_btn = gr.Button("Segment Audio")
+    audio_output = gr.Audio(label="Segmented Audio")
+    submit_btn.click(
+        fn=infer,
+        inputs=[audio_input, text_input],
+        outputs=[audio_output]
+    )
+if __name__ == "__main__":
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,11 @@

+gradio>=4.0.0
+torch>=2.0.0
+transformers>=4.38.0
+accelerate>=0.27.0
+bitsandbytes>=0.41.0
+scipy
+librosa
+opencv-python-headless
+spaces