| --- |
| license: other |
| license_name: nscl-a2sb-and-polyform-nc |
| license_link: https://raw.githubusercontent.com/NVIDIA/diffusion-audio-restoration/refs/heads/main/LICENSE |
| tags: |
| - audio |
| - audio-restoration |
| - schrodinger-bridge |
| - diffusion |
| - festival-audio |
| - non-commercial |
| library_name: pytorch |
| pipeline_tag: audio-to-audio |
| --- |
| |
| # Soundboard |
|
|
| Schrödinger Bridge denoiser fine-tuned for musical recording audio restoration — |
| recovers a soundboard-style mix from heavily-corrupted audience recordings |
| (room reverb + audience-mic blend + lossy codec artifacts). |
|
|
| Fine-tuned from NVIDIA's |
| [A2SB](https://huggingface.co/nvidia/audio_to_audio_schrodinger_bridge) |
| (`twosplit_0.5_1.0` split) on a synthetic-corruption training pipeline driven |
| by **profile-based augmentation** — corruption parameters are calibrated |
| from real (clean, festival-recording) pairs and sampled at training time |
| from the recovered distribution. See [Locutius](https://github.com/protodotdesign/locutius) |
| for the full corruption chain, profiling, and training scaffold. |
|
|
| ## Quick facts |
|
|
| | | | |
| |---|---| |
| | Architecture | AttnUNetF (565.5M params) | |
| | Audio format | 44.1 kHz, 2-channel, 32-bit float | |
| | Segment length | 130560 samples (2.96 s) | |
| | STFT | n_fft=2048, hop=512, window=hann | |
| | Representation | 3-channel `[mag^0.25, cos(phase), sin(phase)]` | |
| | Trained at step | 50,000 | |
| | Base checkpoint | NVIDIA A2SB `twosplit_0.5_1.0` | |
| | Checkpoint size | 2.1 GB | |
| | Diffusion | Schrödinger Bridge, β_max=1.0 | |
|
|
| ## Usage |
|
|
| Load with the [Locutius](https://github.com/protodotdesign/locutius) |
| training package: |
|
|
| ```python |
| import torch |
| from huggingface_hub import hf_hub_download |
| from locutius_train.config import TrainConfig |
| from locutius_train.network import AttnUNetF, SinusoidalTemporalEmbedding |
| from locutius_train.diffusion import Diffusion |
| from locutius_train.representation import WaveformToInput, InputToWaveform |
| from locutius_train.restore import restore_spectrogram |
| |
| ckpt_path = hf_hub_download(repo_id="protodotdesign/Soundboard", filename="model.pt") |
| sd = torch.load(ckpt_path, map_location="cuda", weights_only=False) |
| |
| cfg = TrainConfig() |
| model = AttnUNetF( |
| n_updown_levels=cfg.model.n_updown_levels, |
| in_channels=cfg.model.in_channels, |
| hidden_channels=list(cfg.model.hidden_channels), |
| out_channels=cfg.model.out_channels, |
| emb_channels=cfg.diffusion.n_timestep_channels, |
| band_embedding_dim=cfg.model.band_embedding_dim, |
| n_attn_heads=cfg.model.n_attn_heads, |
| attention_levels=list(cfg.model.attention_levels), |
| use_attn_input_norm=cfg.model.use_attn_input_norm, |
| num_res_blocks=cfg.model.num_res_blocks, |
| ).to("cuda").eval() |
| model.load_state_dict(sd["model"]) |
| ``` |
|
|
| See `restore.py` in the Locutius repo for a complete CLI that takes a |
| clean source, applies the calibrated festival-corruption profile, and |
| runs the reverse Schrödinger Bridge to produce a restored output. |
|
|
| ## Calibrated corruption profile |
|
|
| This model was trained against a single calibrated profile recovered |
| from a real (studio FLAC, festival M4A) pair via per-kick local |
| Wiener deconvolution. The profile is bundled in `profile.json`: |
|
|
| ```json |
| { |
| "name": "edc_festival", |
| "ir_path": "../impulses/EchoThief/Brutalism/San Diego Supercomputer Center Outdoor Patio California.wav", |
| "delay_ms_range": [ |
| 15.0, |
| 25.0 |
| ], |
| "studio_gain_range": [ |
| 0.6, |
| 0.7 |
| ], |
| "room_gain_range": [ |
| 0.55, |
| 0.65 |
| ] |
| } |
| ``` |
|
|
| Each training-step corruption draws fresh values from these ranges, |
| so the model has been exposed to ~50,000 distinct delay/blend |
| combinations within the same venue character. |
|
|
| ## Training data |
|
|
| Trained on a focused subset of electronic music FLACs. **No festival |
| recordings or other licensed audio were stored or distributed** — |
| only the studio source material was used; festival-corrupted versions |
| were synthesized on-the-fly from the calibrated profile during each |
| training step. |
|
|
| ## Limitations |
|
|
| - **Single profile**: trained against one calibrated venue (`edc_festival`). |
| Performance on festival recordings from very different venues / mix |
| chains will degrade. |
| - **Electronic music bias**: training set was EDM-heavy. Restoration |
| quality on rock, classical, or vocal-led material may be uneven. |
| - **No crowd-noise model**: the calibrated profile didn't include |
| additive crowd-noise (no real crowd recordings were available |
| during calibration). Recordings with heavy crowd vocals may have |
| residual artifacts. |
| - **Non-commercial use only** — see the license below. |
|
|
| ## License |
|
|
| Dual non-commercial license: |
|
|
| - [NVIDIA Source Code License for A2SB](LICENSE.NSCL-A2SB) (the upstream |
| license inherited from the A2SB base checkpoint) |
| - [PolyForm Noncommercial 1.0.0](LICENSE.PolyForm-NC) (additional terms |
| on top, source-availability + patent retaliation) |
|
|
| You must comply with **both** licenses. Use is restricted to research |
| and evaluation only — no commercial use is permitted. See |
| [LICENSING.md](https://github.com/protodotdesign/locutius/blob/main/LICENSING.md) |
| for the full plain-English breakdown. |
|
|
| ## Citation |
|
|
| If you use this model in research, please cite the upstream A2SB paper |
| and reference this fine-tune: |
|
|
| ```bibtex |
| @misc{soundboard, |
| title={Soundboard: festival audio restoration via profile-calibrated Schrödinger Bridge fine-tuning}, |
| author={Locutius}, |
| year={2026}, |
| howpublished={\url{https://huggingface.co/protodotdesign/Soundboard}}, |
| } |
| ``` |