File size: 5,374 Bytes
b2dc23c 653e421 b2dc23c | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 | ---
license: other
license_name: nscl-a2sb-and-polyform-nc
license_link: https://raw.githubusercontent.com/NVIDIA/diffusion-audio-restoration/refs/heads/main/LICENSE
tags:
- audio
- audio-restoration
- schrodinger-bridge
- diffusion
- festival-audio
- non-commercial
library_name: pytorch
pipeline_tag: audio-to-audio
---
# Soundboard
Schrödinger Bridge denoiser fine-tuned for musical recording audio restoration —
recovers a soundboard-style mix from heavily-corrupted audience recordings
(room reverb + audience-mic blend + lossy codec artifacts).
Fine-tuned from NVIDIA's
[A2SB](https://huggingface.co/nvidia/audio_to_audio_schrodinger_bridge)
(`twosplit_0.5_1.0` split) on a synthetic-corruption training pipeline driven
by **profile-based augmentation** — corruption parameters are calibrated
from real (clean, festival-recording) pairs and sampled at training time
from the recovered distribution. See [Locutius](https://github.com/protodotdesign/locutius)
for the full corruption chain, profiling, and training scaffold.
## Quick facts
| | |
|---|---|
| Architecture | AttnUNetF (565.5M params) |
| Audio format | 44.1 kHz, 2-channel, 32-bit float |
| Segment length | 130560 samples (2.96 s) |
| STFT | n_fft=2048, hop=512, window=hann |
| Representation | 3-channel `[mag^0.25, cos(phase), sin(phase)]` |
| Trained at step | 50,000 |
| Base checkpoint | NVIDIA A2SB `twosplit_0.5_1.0` |
| Checkpoint size | 2.1 GB |
| Diffusion | Schrödinger Bridge, β_max=1.0 |
## Usage
Load with the [Locutius](https://github.com/protodotdesign/locutius)
training package:
```python
import torch
from huggingface_hub import hf_hub_download
from locutius_train.config import TrainConfig
from locutius_train.network import AttnUNetF, SinusoidalTemporalEmbedding
from locutius_train.diffusion import Diffusion
from locutius_train.representation import WaveformToInput, InputToWaveform
from locutius_train.restore import restore_spectrogram
ckpt_path = hf_hub_download(repo_id="protodotdesign/Soundboard", filename="model.pt")
sd = torch.load(ckpt_path, map_location="cuda", weights_only=False)
cfg = TrainConfig()
model = AttnUNetF(
n_updown_levels=cfg.model.n_updown_levels,
in_channels=cfg.model.in_channels,
hidden_channels=list(cfg.model.hidden_channels),
out_channels=cfg.model.out_channels,
emb_channels=cfg.diffusion.n_timestep_channels,
band_embedding_dim=cfg.model.band_embedding_dim,
n_attn_heads=cfg.model.n_attn_heads,
attention_levels=list(cfg.model.attention_levels),
use_attn_input_norm=cfg.model.use_attn_input_norm,
num_res_blocks=cfg.model.num_res_blocks,
).to("cuda").eval()
model.load_state_dict(sd["model"])
```
See `restore.py` in the Locutius repo for a complete CLI that takes a
clean source, applies the calibrated festival-corruption profile, and
runs the reverse Schrödinger Bridge to produce a restored output.
## Calibrated corruption profile
This model was trained against a single calibrated profile recovered
from a real (studio FLAC, festival M4A) pair via per-kick local
Wiener deconvolution. The profile is bundled in `profile.json`:
```json
{
"name": "edc_festival",
"ir_path": "../impulses/EchoThief/Brutalism/San Diego Supercomputer Center Outdoor Patio California.wav",
"delay_ms_range": [
15.0,
25.0
],
"studio_gain_range": [
0.6,
0.7
],
"room_gain_range": [
0.55,
0.65
]
}
```
Each training-step corruption draws fresh values from these ranges,
so the model has been exposed to ~50,000 distinct delay/blend
combinations within the same venue character.
## Training data
Trained on a focused subset of electronic music FLACs. **No festival
recordings or other licensed audio were stored or distributed** —
only the studio source material was used; festival-corrupted versions
were synthesized on-the-fly from the calibrated profile during each
training step.
## Limitations
- **Single profile**: trained against one calibrated venue (`edc_festival`).
Performance on festival recordings from very different venues / mix
chains will degrade.
- **Electronic music bias**: training set was EDM-heavy. Restoration
quality on rock, classical, or vocal-led material may be uneven.
- **No crowd-noise model**: the calibrated profile didn't include
additive crowd-noise (no real crowd recordings were available
during calibration). Recordings with heavy crowd vocals may have
residual artifacts.
- **Non-commercial use only** — see the license below.
## License
Dual non-commercial license:
- [NVIDIA Source Code License for A2SB](LICENSE.NSCL-A2SB) (the upstream
license inherited from the A2SB base checkpoint)
- [PolyForm Noncommercial 1.0.0](LICENSE.PolyForm-NC) (additional terms
on top, source-availability + patent retaliation)
You must comply with **both** licenses. Use is restricted to research
and evaluation only — no commercial use is permitted. See
[LICENSING.md](https://github.com/protodotdesign/locutius/blob/main/LICENSING.md)
for the full plain-English breakdown.
## Citation
If you use this model in research, please cite the upstream A2SB paper
and reference this fine-tune:
```bibtex
@misc{soundboard,
title={Soundboard: festival audio restoration via profile-calibrated Schrödinger Bridge fine-tuning},
author={Locutius},
year={2026},
howpublished={\url{https://huggingface.co/protodotdesign/Soundboard}},
}
``` |