Spaces:
Running
on
Zero
Running
on
Zero
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,137 +1,14 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
SAM-Audio and the Judge model crucially rely on [Perception-Encoder Audio-Visual (PE-AV)](https://huggingface.co/facebook/pe-av-large), which you can read more about [here](https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/)
|
| 16 |
-
|
| 17 |
-
## Setup
|
| 18 |
-
|
| 19 |
-
**Requirements:**
|
| 20 |
-
- Python >= 3.10
|
| 21 |
-
- CUDA-compatible GPU (recommended)
|
| 22 |
-
|
| 23 |
-
Install dependencies:
|
| 24 |
-
|
| 25 |
-
```bash
|
| 26 |
-
pip install .
|
| 27 |
-
```
|
| 28 |
-
|
| 29 |
-
## Usage
|
| 30 |
-
|
| 31 |
-
⚠️ Before using SAM Audio, please request access to the checkpoints on the SAM Audio
|
| 32 |
-
Hugging Face [repo](https://huggingface.co/facebook/sam-audio-large). Once accepted, you
|
| 33 |
-
need to be authenticated to download the checkpoints. You can do this by running
|
| 34 |
-
the following [steps](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication)
|
| 35 |
-
(e.g. `hf auth login` after generating an access token.)
|
| 36 |
-
|
| 37 |
-
### Basic Text Prompting
|
| 38 |
-
|
| 39 |
-
```python
|
| 40 |
-
from sam_audio import SAMAudio, SAMAudioProcessor
|
| 41 |
-
import torchaudio
|
| 42 |
-
import torch
|
| 43 |
-
|
| 44 |
-
model = SAMAudio.from_pretrained("facebook/sam-audio-large")
|
| 45 |
-
processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
|
| 46 |
-
model = model.eval().cuda()
|
| 47 |
-
|
| 48 |
-
file = "<audio file>" # audio file path or torch tensor
|
| 49 |
-
description = "<description>"
|
| 50 |
-
|
| 51 |
-
batch = processor(
|
| 52 |
-
audios=[file],
|
| 53 |
-
descriptions=[description],
|
| 54 |
-
).to("cuda")
|
| 55 |
-
|
| 56 |
-
with torch.inference_mode():
|
| 57 |
-
# NOTE: `predict_spans` and `reranking_candidates` have a large impact on performance.
|
| 58 |
-
# Setting `predict_span=True` and `reranking_candidates=8` will give you better results at the cost of
|
| 59 |
-
# latency and memory. See the "Span Prediction" section below for more details
|
| 60 |
-
result = model.separate(batch, predict_spans=False, reranking_candidates=1)
|
| 61 |
-
|
| 62 |
-
# Save separated audio
|
| 63 |
-
sample_rate = processor.audio_sampling_rate
|
| 64 |
-
torchaudio.save("target.wav", result.target.cpu(), sample_rate) # The isolated sound
|
| 65 |
-
torchaudio.save("residual.wav", result.residual.cpu(), sample_rate) # Everything else
|
| 66 |
-
```
|
| 67 |
-
|
| 68 |
-
### Prompting Methods
|
| 69 |
-
|
| 70 |
-
SAM-Audio supports three types of prompts:
|
| 71 |
-
|
| 72 |
-
1. **Text Prompting**: Describe the sound you want to isolate using natural language
|
| 73 |
-
```python
|
| 74 |
-
processor(audios=[audio], descriptions=["A man speaking"])
|
| 75 |
-
```
|
| 76 |
-
|
| 77 |
-
2. **Visual Prompting**: Use video frames and masks to isolate sounds associated with visual objects
|
| 78 |
-
```python
|
| 79 |
-
processor(audios=[video], descriptions=[""], masked_videos=processor.mask_videos([frames], [mask]))
|
| 80 |
-
```
|
| 81 |
-
|
| 82 |
-
3. **Span Prompting**: Specify time ranges where the target sound occurs
|
| 83 |
-
```python
|
| 84 |
-
processor(audios=[audio], descriptions=["A horn honking"], anchors=[[["+", 6.3, 7.0]]])
|
| 85 |
-
```
|
| 86 |
-
|
| 87 |
-
See the [examples](examples) directory for more detailed examples
|
| 88 |
-
|
| 89 |
-
### Span Prediction (Optional for Text Prompting)
|
| 90 |
-
|
| 91 |
-
We also provide support for automatically predicting the spans based on the text description, which is especially helpful for separating non-ambience sound events. You can enable this by adding `predict_spans=True` in your call to `separate`
|
| 92 |
-
|
| 93 |
-
```python
|
| 94 |
-
with torch.inference_mode()
|
| 95 |
-
outputs = model.separate(batch, predict_spans=True)
|
| 96 |
-
|
| 97 |
-
# To further improve performance (at the expense of latency), you can add candidate re-ranking
|
| 98 |
-
with torch.inference_mode():
|
| 99 |
-
outputs = model.separate(batch, predict_spans=True, reranking_candidates=8)
|
| 100 |
-
```
|
| 101 |
-
|
| 102 |
-
### Re-Ranking
|
| 103 |
-
|
| 104 |
-
We provide the following models to assess the quality of the separated audio:
|
| 105 |
-
|
| 106 |
-
- [CLAP](https://github.com/LAION-AI/CLAP): measures the similarity between the target audio and text description
|
| 107 |
-
- [Judge](https://huggingface.co/facebook/sam-audio-judge): measures the overall separation quality across 3 axes: precision, recall, and faithfulness (see the [model card](https://huggingface.co/facebook/sam-audio-judge#output-format) for more details)
|
| 108 |
-
- [ImageBind](https://github.com/facebookresearch/ImageBind): for visual prompting, we measure the imagebind embedding similarity between the separated audio and the masked input video
|
| 109 |
-
|
| 110 |
-
We provide support for generating multiple candidates (by setting `reranking_candidates=<k>` in your call to `separate`), which will generate `k` audios, and choose the best one based on the ranking models mentioned above
|
| 111 |
-
|
| 112 |
-
# Models
|
| 113 |
-
|
| 114 |
-
Below is a table of each of the models we released along with their overall subjective evaluation scores
|
| 115 |
-
|
| 116 |
-
| Model | General SFX | Speech | Speaker | Music | Instr(wild) | Instr(pro) |
|
| 117 |
-
|----------|-------------|--------|---------|-------|-------------|------------|
|
| 118 |
-
| [`sam-audio-small`](https://huggingface.co/facebook/sam-audio-small) | 3.62 | 3.99 | 3.12 | 4.11 | 3.56 | 4.24 |
|
| 119 |
-
| [`sam-audio-base`](https://huggingface.co/facebook/sam-audio-base) | 3.28 | 4.25 | 3.57 | 3.87 | 3.66 | 4.27 |
|
| 120 |
-
| [`sam-audio-large`](https://huggingface.co/facebook/sam-audio-large) | 3.50 | 4.03 | 3.60 | 4.22 | 3.66 | 4.49 |
|
| 121 |
-
|
| 122 |
-
We additional release another variant (in each size) that is better specifically on correctness of target sound as well as visual prompting:
|
| 123 |
-
- [`sam-audio-small-tv`](https://huggingface.co/facebook/sam-audio-small-tv)
|
| 124 |
-
- [`sam-audio-base-tv`](https://huggingface.co/facebook/sam-audio-base-tv)
|
| 125 |
-
- [`sam-audio-large-tv`](https://huggingface.co/facebook/sam-audio-large-tv)
|
| 126 |
-
|
| 127 |
-
## Evaluation
|
| 128 |
-
|
| 129 |
-
See the [eval](eval) directory for instructions and scripts to reproduce results from the paper
|
| 130 |
-
|
| 131 |
-
## Contributing
|
| 132 |
-
|
| 133 |
-
See [contributing](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md) for more information.
|
| 134 |
-
|
| 135 |
-
## License
|
| 136 |
-
|
| 137 |
-
This project is licensed under the SAM License - see the [LICENSE](LICENSE) file for details.
|
|
|
|
| 1 |
+
---
|
| 2 |
+
title: Sample Audio
|
| 3 |
+
emoji: 📚
|
| 4 |
+
colorFrom: indigo
|
| 5 |
+
colorTo: red
|
| 6 |
+
sdk: gradio
|
| 7 |
+
sdk_version: 6.2.0
|
| 8 |
+
app_file: app.py
|
| 9 |
+
pinned: false
|
| 10 |
+
license: mit
|
| 11 |
+
short_description: Sample-Audio
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|