File size: 8,161 Bytes

8f1acc9
 
 
 
 
 
 
 
 
 
 
 
 
 
a55056c
 
8f1acc9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a55056c
8f1acc9
a55056c
8f1acc9
a55056c
8f1acc9
 
 
 
a55056c
38455ae
8f1acc9
 
 
 
 
38455ae
 
 
 
 
8f1acc9
 
38455ae
a55056c
38455ae
 
8f1acc9
 
 
 
a55056c
38455ae
a55056c
8f1acc9

---
license: cc-by-nc-4.0
tags:
- audio
- rave
- timbre-transfer
- neural-synthesis
- ircam
- maestro
language:
- en
pipeline_tag: audio-to-audio
---

# RAVE — AEmotionStudio mirror

Curated mirror of public **RAVE** (Realtime Audio Variational autoEncoder) checkpoints, used
by MAESTRO's RAVE Timbre Transfer panel (opt-in starter pack). Sources:

- The [Intelligent-Instruments-Lab/rave-models](https://huggingface.co/Intelligent-Instruments-Lab/rave-models) curated set (birds, voices, organs, water, etc.).
- The [official ACIDS-IRCAM public catalog](https://acids-ircam.github.io/rave_models_download.html), pulled from the canonical anonymous API at `https://play.forum.ircam.fr/rave-vst-api/get_available_models`.

RAVE was developed by [Antoine Caillon](https://caillonantoine.github.io/) and the
[ACIDS team at IRCAM](https://www.ircam.fr/). Paper: [arXiv:2111.05011](https://arxiv.org/abs/2111.05011).
Upstream code: [acids-ircam/RAVE](https://github.com/acids-ircam/RAVE).

## License

**CC-BY-NC-4.0** — non-commercial use only, inherited from the upstream distributions.
Generated audio is fine for non-commercial use. Commercial use of the *models themselves*
(e.g. shipping them inside a paid product) requires permission from the original authors / IRCAM.

Per MAESTRO's stance (see `LICENSE_AUDIT.md` and the `feedback_download_on_demand_licensing`
memory), these weights are fetched *on demand* by the end user — the user (not MAESTRO the
binary) is the licensee.

---

## Models — IIL-curated set (b2048 streaming exports, 18 models)

Each `.ts` checkpoint has a `<stem>.json` sidecar with name, license, sample-rate, latent-dim,
source URL, and a one-line description.

### Voice / speech
- `voice_vocalset_b2048_r48000_z16.ts` — **Voice (VocalSet)**. Voice timbre trained on the VocalSet corpus — covers vocal techniques across multiple singers. Use for the canonical 'make this sound like a voice' transfer.
- `voice-multi-b2048-r48000-z11.ts` — **Voice (Multi-speaker)**. Aggregated multi-speaker voice corpus. Wider speaker diversity than VocalSet — produces more 'average human' renders.
- `voice_hifitts_b2048_r48000_z16.ts` — **Voice (HiFi-TTS)**. High-fidelity expressive English speech corpus. Cleaner, more articulate than the multi-speaker model.
- `voice_jvs_b2048_r44100_z16.ts` — **Voice (JVS, Japanese)**. JVS Japanese multi-speaker corpus at 44.1 kHz. Use for Japanese-language sources or non-Latin phoneme structure.
- `voice_vctk_b2048_r44100_z22.ts` — **Voice (VCTK, English)**. VCTK English multi-speaker corpus from CSTR Edinburgh, 44.1 kHz. High 22-dim latent — captures more speaker idiosyncrasies.

### Bird / wildlife
- `birds_motherbird_b2048_r48000_z16.ts` — **Birds (Motherbird)**. Bird-vocalization corpus — chirps + textural transients. The canonical 'weird' pick: produces wildly warped output for any arbitrary input.
- `birds_dawnchorus_b2048_r48000_z8.ts` — **Birds (Dawn Chorus)**. Dense overlapping bird vocalizations recorded at dawn. Smaller 8-dim latent — outputs lean ensemble-textural over individual calls.
- `birds_pluma_b2048_r48000_z12.ts` — **Birds (Pluma)**. Lighter, individual bird-call timbres. Mid-size 12-dim latent balances character + clarity.
- `humpbacks_pondbrain_b2048_r48000_z20.ts` — **Humpback Whales**. Humpback-whale song. Long, slow, hauntingly-deep vocal contours — pairs well with sustained input.
- `marinemammals_pondbrain_b2048_r48000_z20.ts` — **Marine Mammals**. Mixed marine-mammal vocalizations — dolphins, orcas, sea-life clicks and cries.

### Instruments
- `guitar_iil_b2048_r48000_z16.ts` — **Guitar (IIL)**. Acoustic / electric guitar timbre. Good demo for transferring voice or synth input into a plucked-string voice.
- `organ_bach_b2048_r48000_z16.ts` — **Organ (Bach)**. Pipe-organ timbre trained on Bach repertoire. Sustained harmonic textures — pairs well with melodic input.
- `organ_archive_b2048_r48000_z16.ts` — **Organ (Archive)**. Historical pipe-organ recordings — broader, dustier textures than the Bach model. Good for film-score atmospheres.
- `sax_soprano_franziskaschroeder_b2048_r48000_z20.ts` — **Soprano Sax (Schroeder)**. Soprano-saxophone extended techniques by Franziska Schroeder. Multiphonics, growls, key clicks. 20-dim latent — captures fine-grained articulation.
- `mrp_strengjavera_b2048_r44100_z16.ts` — **Magnetic Resonator Piano (Strengjavera)**. Sustained metallic-string overtones produced by electromagnetically driving piano strings — 44.1 kHz.
- `crozzoli_bigensemblesmusic_18d.ts` — **Big Ensemble Music (Crozzoli)**. Big-ensemble orchestral music (M. Crozzoli). Broad 18-dim latent for hugely-textured renders. Sample rate not embedded in filename — defaults to 48 kHz.

### Textures / environment
- `water_pondbrain_b2048_r48000_z16.ts` — **Water (PondBrain)**. Water / aquatic textures. Treats any input as if it were running through liquid — bubbles, ripples, splashes.
- `magnets_b2048_r48000_z8.ts` — **Magnets**. Ferromagnetic / electromagnetic resonance textures — metallic hums, distant industrial buzz, magnetized-string ringing.

---

## Models — ACIDS public catalog (10 models, mirrored 2026-05-18)

Pulled from the canonical anonymous-download endpoint `https://play.forum.ircam.fr/rave-vst-api/get_model/<slug>`.
Each `.ts` has a matching `<slug>.json` sidecar in the same schema as the IIL set.

| Slug | Display name | Type | Author | Year | Size | Prior |
|---|---|---|---|---|---|---|
| `VCTK` | VCTK (English Speech) | RAVE v1 (default) | Jb Dupuy | 2022 | 177 MB | ✓ |
| `darbouka_onnx` | Darbouka (Percussion) | RAVE v2 (ONNX) | Antoine Caillon | 2022 | 26 MB | – |
| `nasa` | NASA Apollo 11 | RAVE v1 (default) | Antoine Caillon | 2022 | 159 MB | ✓ |
| `percussion` | Percussion (Mixed) | RAVE v1 (default) | Antoine Caillon | 2022 | 71 MB | ✓ |
| `vintage` | Vintage Music | RAVE v1 (large) | Antoine Caillon | 2022 | 482 MB | ✓ |
| `isis` | ISiS (IRCAM Vocal DB) | RAVE v2 | A. Chemla–Romeu-Santos | 2023 | 149 MB | – |
| `musicnet` | MusicNet (Classical) | RAVE v2 | A. Chemla–Romeu-Santos | 2023 | 237 MB | ✓ |
| `sol_ordinario` | Studio OnLine (Ordinario) | RAVE v2 | A. Chemla–Romeu-Santos | 2023 | 149 MB | – |
| `sol_full` | Studio OnLine (Full) | RAVE v2 | A. Chemla–Romeu-Santos | 2023 | 149 MB | – |
| `sol_ordinario_fast` | Studio OnLine (Ordinario, fast) | RAVE v2 (small) | A. Chemla–Romeu-Santos | 2023 | 43 MB | – |

**ACIDS set total: ~1.6 GB across 10 models.**

> Note: `VCTK.ts` (ACIDS v1, 48 kHz, original 2022 release) and `voice_vctk_b2048_r44100_z22.ts`
> (IIL v2 retrain, 44.1 kHz) are *different* models trained on the same source corpus —
> keep both for comparison.

---

## File format

Each `*.ts` is a [TorchScript](https://pytorch.org/docs/stable/jit.html) export of the RAVE model,
streaming-mode (causal convolutions, cached state) — ready for realtime or offline inference.

```python
import torch
model = torch.jit.load("vintage.ts")
# Encode (B, 1, T) → latents
z = model.encode(audio)
# Decode latents → audio
y = model.decode(z)
```

Models with "Prior available" additionally ship a learned prior that can generate latents
autoregressively (see the [RAVE repo](https://github.com/acids-ircam/RAVE) for usage).

## Where to find more RAVE models

- [Neutone FX models](https://neutone.ai/fx/models) — community + curated `.nm` files (the Neutone wrapper format).
- [IRCAM Forum projects](https://forum.ircam.fr/) — individual user-submitted models; many require Forum account.
- [acids-ircam GitHub releases](https://github.com/acids-ircam/RAVE/releases) — reference checkpoints from the maintainers.
- [IRCAM RAVE Model Challenge 2025](https://forum.ircam.fr/collections/detail/rave-model-challenge-models/) — 11 prize-winner / submission models gated behind a Forum account.

## Citation

```bibtex
@inproceedings{caillon2021rave,
  title={RAVE: A variational autoencoder for fast and high-quality neural audio synthesis},
  author={Caillon, Antoine and Esling, Philippe},
  booktitle={arXiv preprint arXiv:2111.05011},
  year={2021}
}
```