ray-006 commited on
Commit
3315688
·
verified ·
1 Parent(s): 16c1133

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -137
README.md CHANGED
@@ -1,137 +1,14 @@
1
- <div align="center">
2
-
3
- # SAM-Audio
4
-
5
- ![CI](https://github.com/facebookresearch/sam-audio/actions/workflows/ci.yaml/badge.svg)
6
-
7
- ![model_image](assets/sam_audio_main_model.png)
8
-
9
- </div>
10
-
11
- Segment Anything Model for Audio [[**Blog**](https://ai.meta.com/blog/sam-audio/)] [[**Paper**](https://ai.meta.com/research/publications/sam-audio-segment-anything-in-audio/)] [[**Demo**](https://aidemos.meta.com/segment-anything/editor/segment-audio)]
12
-
13
- SAM-Audio is a foundation model for isolating any sound in audio using text, visual, or temporal prompts. It can separate specific sounds from complex audio mixtures based on natural language descriptions, visual cues from video, or time spans.
14
-
15
- SAM-Audio and the Judge model crucially rely on [Perception-Encoder Audio-Visual (PE-AV)](https://huggingface.co/facebook/pe-av-large), which you can read more about [here](https://ai.meta.com/research/publications/pushing-the-frontier-of-audiovisual-perception-with-large-scale-multimodal-correspondence-learning/)
16
-
17
- ## Setup
18
-
19
- **Requirements:**
20
- - Python >= 3.10
21
- - CUDA-compatible GPU (recommended)
22
-
23
- Install dependencies:
24
-
25
- ```bash
26
- pip install .
27
- ```
28
-
29
- ## Usage
30
-
31
- ⚠️ Before using SAM Audio, please request access to the checkpoints on the SAM Audio
32
- Hugging Face [repo](https://huggingface.co/facebook/sam-audio-large). Once accepted, you
33
- need to be authenticated to download the checkpoints. You can do this by running
34
- the following [steps](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication)
35
- (e.g. `hf auth login` after generating an access token.)
36
-
37
- ### Basic Text Prompting
38
-
39
- ```python
40
- from sam_audio import SAMAudio, SAMAudioProcessor
41
- import torchaudio
42
- import torch
43
-
44
- model = SAMAudio.from_pretrained("facebook/sam-audio-large")
45
- processor = SAMAudioProcessor.from_pretrained("facebook/sam-audio-large")
46
- model = model.eval().cuda()
47
-
48
- file = "<audio file>" # audio file path or torch tensor
49
- description = "<description>"
50
-
51
- batch = processor(
52
- audios=[file],
53
- descriptions=[description],
54
- ).to("cuda")
55
-
56
- with torch.inference_mode():
57
- # NOTE: `predict_spans` and `reranking_candidates` have a large impact on performance.
58
- # Setting `predict_span=True` and `reranking_candidates=8` will give you better results at the cost of
59
- # latency and memory. See the "Span Prediction" section below for more details
60
- result = model.separate(batch, predict_spans=False, reranking_candidates=1)
61
-
62
- # Save separated audio
63
- sample_rate = processor.audio_sampling_rate
64
- torchaudio.save("target.wav", result.target.cpu(), sample_rate) # The isolated sound
65
- torchaudio.save("residual.wav", result.residual.cpu(), sample_rate) # Everything else
66
- ```
67
-
68
- ### Prompting Methods
69
-
70
- SAM-Audio supports three types of prompts:
71
-
72
- 1. **Text Prompting**: Describe the sound you want to isolate using natural language
73
- ```python
74
- processor(audios=[audio], descriptions=["A man speaking"])
75
- ```
76
-
77
- 2. **Visual Prompting**: Use video frames and masks to isolate sounds associated with visual objects
78
- ```python
79
- processor(audios=[video], descriptions=[""], masked_videos=processor.mask_videos([frames], [mask]))
80
- ```
81
-
82
- 3. **Span Prompting**: Specify time ranges where the target sound occurs
83
- ```python
84
- processor(audios=[audio], descriptions=["A horn honking"], anchors=[[["+", 6.3, 7.0]]])
85
- ```
86
-
87
- See the [examples](examples) directory for more detailed examples
88
-
89
- ### Span Prediction (Optional for Text Prompting)
90
-
91
- We also provide support for automatically predicting the spans based on the text description, which is especially helpful for separating non-ambience sound events. You can enable this by adding `predict_spans=True` in your call to `separate`
92
-
93
- ```python
94
- with torch.inference_mode()
95
- outputs = model.separate(batch, predict_spans=True)
96
-
97
- # To further improve performance (at the expense of latency), you can add candidate re-ranking
98
- with torch.inference_mode():
99
- outputs = model.separate(batch, predict_spans=True, reranking_candidates=8)
100
- ```
101
-
102
- ### Re-Ranking
103
-
104
- We provide the following models to assess the quality of the separated audio:
105
-
106
- - [CLAP](https://github.com/LAION-AI/CLAP): measures the similarity between the target audio and text description
107
- - [Judge](https://huggingface.co/facebook/sam-audio-judge): measures the overall separation quality across 3 axes: precision, recall, and faithfulness (see the [model card](https://huggingface.co/facebook/sam-audio-judge#output-format) for more details)
108
- - [ImageBind](https://github.com/facebookresearch/ImageBind): for visual prompting, we measure the imagebind embedding similarity between the separated audio and the masked input video
109
-
110
- We provide support for generating multiple candidates (by setting `reranking_candidates=<k>` in your call to `separate`), which will generate `k` audios, and choose the best one based on the ranking models mentioned above
111
-
112
- # Models
113
-
114
- Below is a table of each of the models we released along with their overall subjective evaluation scores
115
-
116
- | Model | General SFX | Speech | Speaker | Music | Instr(wild) | Instr(pro) |
117
- |----------|-------------|--------|---------|-------|-------------|------------|
118
- | [`sam-audio-small`](https://huggingface.co/facebook/sam-audio-small) | 3.62 | 3.99 | 3.12 | 4.11 | 3.56 | 4.24 |
119
- | [`sam-audio-base`](https://huggingface.co/facebook/sam-audio-base) | 3.28 | 4.25 | 3.57 | 3.87 | 3.66 | 4.27 |
120
- | [`sam-audio-large`](https://huggingface.co/facebook/sam-audio-large) | 3.50 | 4.03 | 3.60 | 4.22 | 3.66 | 4.49 |
121
-
122
- We additional release another variant (in each size) that is better specifically on correctness of target sound as well as visual prompting:
123
- - [`sam-audio-small-tv`](https://huggingface.co/facebook/sam-audio-small-tv)
124
- - [`sam-audio-base-tv`](https://huggingface.co/facebook/sam-audio-base-tv)
125
- - [`sam-audio-large-tv`](https://huggingface.co/facebook/sam-audio-large-tv)
126
-
127
- ## Evaluation
128
-
129
- See the [eval](eval) directory for instructions and scripts to reproduce results from the paper
130
-
131
- ## Contributing
132
-
133
- See [contributing](CONTRIBUTING.md) and [code of conduct](CODE_OF_CONDUCT.md) for more information.
134
-
135
- ## License
136
-
137
- This project is licensed under the SAM License - see the [LICENSE](LICENSE) file for details.
 
1
+ ---
2
+ title: Sample Audio
3
+ emoji: 📚
4
+ colorFrom: indigo
5
+ colorTo: red
6
+ sdk: gradio
7
+ sdk_version: 6.2.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ short_description: Sample-Audio
12
+ ---
13
+
14
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference