Spaces:
Sleeping
Sleeping
orrp commited on
Commit ·
ffd7404
1
Parent(s): a0bbc38
File cleanup, Gradio queueing, made README minimal
Browse files- README.md +3 -185
- packages.txt +0 -1
- vampnet/app.py +1 -1
- vampnet/setup.py +0 -44
README.md
CHANGED
|
@@ -9,195 +9,14 @@ hardware: a10g-small
|
|
| 9 |
---
|
| 10 |
|
| 11 |
# WhAM: a Whale Acoustics Model
|
| 12 |
-
[](https://arxiv.org/abs/2512.02206)
|
| 13 |
-
[](https://doi.org/10.5281/zenodo.17633708)
|
| 14 |
-
[](https://huggingface.co/datasets/orrp/DSWP)
|
| 15 |
-

|
| 16 |
-
WhAM is a transformer-based audio-to-audio model designed to synthesize and analyze sperm whale codas. Based on [VampNet](https://github.com/hugofloresgarcia/vampnet), WhAM uses masked acoustic token modeling to capture temporal and spectral features of whale communication. WhAM generates codas from a given audio context, enabling three core capabilities:
|
| 17 |
|
| 18 |
-
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
- Providing audio embeddings for downstream tasks such as social unit and spectral feature ("vowel") classification.
|
| 23 |
-
|
| 24 |
-
See our [NeurIPS 2025](https://openreview.net/pdf?id=IL1wvzOgqD) publication for more details.
|
| 25 |
-
|
| 26 |
-
## Installation
|
| 27 |
-
|
| 28 |
-
1. **Clone the repository:**
|
| 29 |
-
```bash
|
| 30 |
-
git clone https://github.com/Project-CETI/wham.git
|
| 31 |
-
cd wham
|
| 32 |
-
```
|
| 33 |
-
|
| 34 |
-
2. **Set up the environment:**
|
| 35 |
-
```bash
|
| 36 |
-
conda create -n wham python=3.9
|
| 37 |
-
conda activate wham
|
| 38 |
-
```
|
| 39 |
-
|
| 40 |
-
3. **Install dependencies:**
|
| 41 |
-
```bash
|
| 42 |
-
# Install the wham package
|
| 43 |
-
pip install -e .
|
| 44 |
-
|
| 45 |
-
# Install VampNet
|
| 46 |
-
pip install -e ./vampnet
|
| 47 |
-
|
| 48 |
-
# Install madmom
|
| 49 |
-
pip install --no-build-isolation madmom
|
| 50 |
-
|
| 51 |
-
# Install ffmpeg
|
| 52 |
-
conda install -c conda-forge ffmpeg
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
4. **Download model weights:**
|
| 56 |
-
Download the [weights](https://zenodo.org/records/17633708) and extract to `vampnet/models/`.
|
| 57 |
-
|
| 58 |
-
## Generation
|
| 59 |
-
|
| 60 |
-
To run WhAM locally and prompt it in your browser:
|
| 61 |
-
|
| 62 |
-
```bash
|
| 63 |
-
python vampnet/app.py --args.load conf/interface.yml --Interface.device cuda
|
| 64 |
-
```
|
| 65 |
-
|
| 66 |
-
This will provide you with a Gradio link to test WhAM on inputs of your choice.
|
| 67 |
-
|
| 68 |
-
## Training Data
|
| 69 |
-
|
| 70 |
-

|
| 71 |
-
|
| 72 |
-
You only need to follow these to fine-tune your own version of WhAM. First, obtain the original VampNet weights by following the instructions in the . Download
|
| 73 |
-
c2f.pth and codec.pth and replace the weights you previously downloaded in `vampnet/models`.
|
| 74 |
-
|
| 75 |
-
Second, obtain data:
|
| 76 |
-
|
| 77 |
-
1. **Domain adaptation data:**
|
| 78 |
-
|
| 79 |
-
- Download audio samples from the [WMMS 'Best Of' Cut](https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm). Save them under `vampnet/training_data/domain_adaptation`.
|
| 80 |
-
|
| 81 |
-
- Download audio samples from the [BirdSet Dataset](https://huggingface.co/datasets/DBD-research-group/BirdSet). Save these under the same directory
|
| 82 |
-
|
| 83 |
-
- Finally, download all samples from the [AudioSet Dataset](https://research.google.com/audioset/ontology/index.html) with the label `Animal` and once again save these into the directory
|
| 84 |
-
|
| 85 |
-
3. **Species-specific finetuning:** Finetuning can be performed on the openly available **[Dominica Sperm Whale Project (DSWP)](https://huggingface.co/datasets/orrp/DSWP)** dataset, available on Hugging Face.
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
With data in hand, navigate into `vampnet` and perform Domain Adaptation:
|
| 89 |
-
```bash
|
| 90 |
-
python vampnet/scripts/exp/fine_tune.py "training_data/domain_adaptation" domain_adapted && python vampnet/scripts/exp/train.py --args.load conf/generated/domain_adapted/coarse.yml && python vampnet/scripts/exp/train.py --args.load conf/generated/domain_adapted/c2f.yml
|
| 91 |
-
```
|
| 92 |
-
|
| 93 |
-
Then fine-tune the domain-adapted model. Create the config file with the command:
|
| 94 |
-
|
| 95 |
-
```bash
|
| 96 |
-
python vampnet/scripts/exp/fine_tune.py "training_data/species_specific_finetuning" fine-tuned
|
| 97 |
-
```
|
| 98 |
-
|
| 99 |
-
To select which weights you want to use as a checkpoint, change `fine_tune_checkpoint` in `conf/generated/fine-tuned/[c2f/coarse].yml` to `./runs/domain_adaptation/[coarse/c2f]/[checkpoint]/vampnets/weights.pth`. `[checkpoint]` can be `latest` in order to use the last saved checkpoint from the previous run, though it is recommended to manually verify the quality of generations over various checkpoints as overtraining can often cause degradation in audio quality, especially with smaller datasets. After making that change, run the command:
|
| 100 |
-
|
| 101 |
-
```bash
|
| 102 |
-
python vampnet/scripts/exp/train.py --args.load conf/generated/fine-tuned/coarse.yml && python vampnet/scripts/exp/train.py --args.load conf/generated/fine-tuned/c2f.yml
|
| 103 |
-
```
|
| 104 |
-
|
| 105 |
-
After following these steps, you should be able to generate audio via the browser by running:
|
| 106 |
-
```bash
|
| 107 |
-
python app.py --args.load vampnet/conf/generated/fine-tuned/interface.yml
|
| 108 |
-
```
|
| 109 |
-
|
| 110 |
-
**Note**: The coarse and fine weights can be trained separately if compute allows. In this case, you would call the two scripts:
|
| 111 |
-
|
| 112 |
-
```bash
|
| 113 |
-
python vampnet/scripts/exp/train.py --args.load conf/generated/[fine-tuned/domain_adaptated]/coarse.yml
|
| 114 |
-
```
|
| 115 |
-
|
| 116 |
-
```bash
|
| 117 |
-
python vampnet/scripts/exp/train.py --args.load conf/generated/[fine-tuned/domain_adaptated]/c2f.yml
|
| 118 |
-
```
|
| 119 |
-
|
| 120 |
-
After both are finished running, ensure that both resulting weights are copied into the same copy of WhAM.
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
## Testing Data
|
| 125 |
-
|
| 126 |
-
1. **Marine Mammel Data:**
|
| 127 |
-
Download audio samples from the [WMMS 'Best Of' Cut](https://whoicf2.whoi.edu/science/B/whalesounds/index.cfm). Save them under `data/testing_data/marine_mammals/data/[SPECIES_NAME]`.
|
| 128 |
-
* `[SPECIES_NAME]` must match the species names found in `wham/generation/prompt_configs.py`.
|
| 129 |
-
|
| 130 |
-
2. **Sperm Whale Codas:**
|
| 131 |
-
To evaluate on sperm whale codas, you can use the openly available [DSWP](https://huggingface.co/datasets/orrp/DSWP) dataset.
|
| 132 |
-
|
| 133 |
-
3. Generate artifical beeps for experiments. `data/generate_beeps.sh`
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
## Reproducing Paper Results
|
| 137 |
-
Note: Access to the DSWP+CETI annotated is required to reproduce all results; as of time of publication, only part of this data is publicly available. Still, we include the following code as it may be useful for researchers who may benefit from our evaluation pipeline.
|
| 138 |
-
|
| 139 |
-
### 1. Downstream Classification Tasks
|
| 140 |
-
To reproduce **Table 1** (Classification Accuracies) and **Figure 7** (Ablation Study):
|
| 141 |
-
|
| 142 |
-
**Table 1 Results:**
|
| 143 |
-
```bash
|
| 144 |
-
cd wham/embedding
|
| 145 |
-
./downstream_tasks.sh
|
| 146 |
-
```
|
| 147 |
-
* Runs all downstream classification tasks.
|
| 148 |
-
* **Baselines:** Run once.
|
| 149 |
-
* **Models (AVES, VampNet):** Run over 3 random seeds; reports mean and standard deviation.
|
| 150 |
-
|
| 151 |
-
**Figure 7 Results (Ablation):**
|
| 152 |
-
```bash
|
| 153 |
-
cd wham/embedding
|
| 154 |
-
./downstream_ablation.sh
|
| 155 |
-
```
|
| 156 |
-
* Outputs accuracy scores for ablation variants (averaged across 3 seeds with error bars).
|
| 157 |
-
|
| 158 |
-
### 2. Generative Metrics
|
| 159 |
-
|
| 160 |
-
**Figure 12: Frechet Audio Distance (FAD) Scores**
|
| 161 |
-
Calculate the distance between WhAM's generated results and real codas:
|
| 162 |
-
```bash
|
| 163 |
-
# Calculate for all species
|
| 164 |
-
bash wham/generation/eval/calculate_FAD.sh
|
| 165 |
-
|
| 166 |
-
# Calculate for a single species
|
| 167 |
-
bash wham/generation/eval/calculate_FAD.sh [species_name]
|
| 168 |
-
```
|
| 169 |
-
* *Runtime:* ~3 hours on an NVIDIA A10 GPU.
|
| 170 |
-
|
| 171 |
-
**Figure 3: FAD with Custom/BirdNET Embeddings**
|
| 172 |
-
To compare against other embeddings:
|
| 173 |
-
1. Convert your `.wav` files to `.npy` embeddings.
|
| 174 |
-
2. Place raw coda embeddings in: `data/testing_data/coda_embeddings`
|
| 175 |
-
3. Place comparison embeddings in subfolders within: `data/testing_data/comparison_embeddings`
|
| 176 |
-
4. Run:
|
| 177 |
-
```bash
|
| 178 |
-
python wham/generation/eval/calculate_custom_fad.py
|
| 179 |
-
```
|
| 180 |
-
*For BirdNET embeddings, refer to the [official repo](https://github.com/BirdNET-Team/BirdNET-Analyzer).*
|
| 181 |
-
|
| 182 |
-
**Table 2: Embedding Type Ablation**
|
| 183 |
-
Calculate distances between raw codas, denoised versions, and noise profiles:
|
| 184 |
-
```bash
|
| 185 |
-
bash wham/generation/eval/FAD_ablation.sh
|
| 186 |
-
```
|
| 187 |
-
* *Prerequisites:* Ensure `data/testing_data/ablation/noise` and `data/testing_data/ablation/denoised` are populated.
|
| 188 |
-
* *Runtime:* ~1.5 hours on an NVIDIA A10 GPU.
|
| 189 |
-
|
| 190 |
-
**Figure 13: Tokenizer Reconstruction**
|
| 191 |
-
Test the mean squared reconstruction error:
|
| 192 |
-
```bash
|
| 193 |
-
bash wham/generation/eval/evaluate_tokenizer.sh
|
| 194 |
-
```
|
| 195 |
-
|
| 196 |
-
---
|
| 197 |
|
| 198 |
## Citation
|
| 199 |
|
| 200 |
-
|
| 201 |
|
| 202 |
```bibtex
|
| 203 |
@inproceedings{wham2025,
|
|
@@ -207,4 +26,3 @@ Please use the following citation if you use this code, model or data.
|
|
| 207 |
on Neural Information Processing Systems 2025, NeurIPS 2025, San Diego, CA, USA},
|
| 208 |
year={2025}
|
| 209 |
}
|
| 210 |
-
```
|
|
|
|
| 9 |
---
|
| 10 |
|
| 11 |
# WhAM: a Whale Acoustics Model
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
+
WhAM is a transformer-based audio-to-audio model designed to synthesize and analyze sperm whale codas. It uses masked acoustic token modeling to capture the unique temporal and spectral features of whale communication.
|
| 14 |
|
| 15 |
+
For full technical details, installation instructions, and training scripts, please visit the official repository at [https://github.com/Project-CETI/wham].
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 16 |
|
| 17 |
## Citation
|
| 18 |
|
| 19 |
+
If you use this code, model, or data in your research, please cite our [NeurIPS 2025](https://openreview.net/pdf?id=IL1wvzOgqD) publication:
|
| 20 |
|
| 21 |
```bibtex
|
| 22 |
@inproceedings{wham2025,
|
|
|
|
| 26 |
on Neural Information Processing Systems 2025, NeurIPS 2025, San Diego, CA, USA},
|
| 27 |
year={2025}
|
| 28 |
}
|
|
|
packages.txt
DELETED
|
@@ -1 +0,0 @@
|
|
| 1 |
-
ffmpeg
|
|
|
|
|
|
vampnet/app.py
CHANGED
|
@@ -691,4 +691,4 @@ with gr.Blocks() as demo:
|
|
| 691 |
)
|
| 692 |
|
| 693 |
|
| 694 |
-
demo.launch(
|
|
|
|
| 691 |
)
|
| 692 |
|
| 693 |
|
| 694 |
+
demo.queue().launch(allowed_paths=[SCRIPT_DIR / "assets"])
|
vampnet/setup.py
DELETED
|
@@ -1,44 +0,0 @@
|
|
| 1 |
-
from setuptools import find_packages, setup
|
| 2 |
-
|
| 3 |
-
with open("README.md") as f:
|
| 4 |
-
long_description = f.read()
|
| 5 |
-
|
| 6 |
-
setup(
|
| 7 |
-
name="vampnet",
|
| 8 |
-
version="0.0.1",
|
| 9 |
-
classifiers=[
|
| 10 |
-
"Intended Audience :: Developers",
|
| 11 |
-
"Natural Language :: English",
|
| 12 |
-
"Programming Language :: Python :: 3.7",
|
| 13 |
-
"Topic :: Artistic Software",
|
| 14 |
-
"Topic :: Multimedia",
|
| 15 |
-
"Topic :: Multimedia :: Sound/Audio",
|
| 16 |
-
"Topic :: Multimedia :: Sound/Audio :: Editors",
|
| 17 |
-
"Topic :: Software Development :: Libraries",
|
| 18 |
-
],
|
| 19 |
-
description="Generative Music Modeling.",
|
| 20 |
-
long_description=long_description,
|
| 21 |
-
long_description_content_type="text/markdown",
|
| 22 |
-
author="Hugo Flores García, Prem Seetharaman",
|
| 23 |
-
author_email="hfgacrcia@descript.com",
|
| 24 |
-
url="https://github.com/hugofloresgarcia/vampnet",
|
| 25 |
-
license="MIT",
|
| 26 |
-
packages=find_packages(),
|
| 27 |
-
setup_requires=[
|
| 28 |
-
"Cython",
|
| 29 |
-
],
|
| 30 |
-
install_requires=[
|
| 31 |
-
"Cython", # Added by WhAM because it seems to be needed by this repo?
|
| 32 |
-
"torch",
|
| 33 |
-
"pydantic==2.10.6",
|
| 34 |
-
"argbind>=0.3.2",
|
| 35 |
-
"numpy<1.24",
|
| 36 |
-
"wavebeat @ git+https://github.com/hugofloresgarcia/wavebeat",
|
| 37 |
-
"lac @ git+https://github.com/hugofloresgarcia/lac.git",
|
| 38 |
-
"descript-audiotools @ git+https://github.com/hugofloresgarcia/audiotools.git",
|
| 39 |
-
"gradio",
|
| 40 |
-
"loralib",
|
| 41 |
-
"torch_pitch_shift",
|
| 42 |
-
"pyharp",
|
| 43 |
-
],
|
| 44 |
-
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|