File size: 5,973 Bytes

9491b77

---
license: mit
tags:
- audio
- audio-enhancement
- speech-enhancement
- bandwidth-extension
- codec-repair
- neural-codec
- waveform-processing
- pytorch
library_name: pytorch
pipeline_tag: audio-to-audio
frameworks: PyTorch
language:
  - en
---
# Brontes: Synthesis-First Waveform Enhancement

**Brontes** is a time-domain audio enhancement model designed for neural codec repair and bandwidth extension. This is the general pretrained model trained on diverse audio data.

## Model Description

Brontes upsamples and repairs speech degraded by neural codec compression. Unlike conventional Wave U-Net approaches that rely on dense skip connections, Brontes uses a **synthesis-first architecture** with selective deep skips, forcing the model to actively reconstruct rather than copy degraded input details.

### Key Capabilities

- **Neural codec repair** — removes compression artifacts from neural codec outputs
- **Bandwidth extension** — upsamples from 24 kHz to 48 kHz (2× extension)
- **Waveform-domain processing** — operates directly on audio samples, no spectrogram conversion
- **Synthesis-first design** — only the two deepest skips retained, preventing artifact leakage
- **LSTM bottleneck** — captures long-range temporal dependencies at maximum compression

### Model Architecture

- **Type:** Encoder-decoder U-Net with selective skip connections
- **Stages:** 6 encoder stages + 6 decoder stages (4096× total compression)
- **Bottleneck:** Bidirectional LSTM for temporal modeling
- **Parameters:** ~29M
- **Input:** 24 kHz mono audio (codec-degraded)
- **Output:** 48 kHz mono audio (enhanced)

## Intended Use

This is a **general pretrained model** trained on diverse audio data. For optimal performance on your specific use case:

⚠️ **It is strongly recommended to fine-tune this model on your target dataset** using the `--pretrained` flag.

### Primary Use Cases

- Repairing audio degraded by neural codecs (e.g., EnCodec, SoundStream, Lyra)
- Bandwidth extension from narrowband/wideband to fullband
- Speech enhancement and quality improvement
- Post-processing for codec-compressed audio

## Quick Start

For detailed usage instructions, training, and fine-tuning, please see the [GitHub repository](https://github.com/ZDisket/Brontes).

### Basic Inference Example

```python
import torch
import torchaudio
import yaml
from brontes import Brontes

# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load config
with open('configs/config_brontes_48khz_demucs.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Create model
model = Brontes(unet_config=config['model'].get('unet_config', {})).to(device)

# Load checkpoint
checkpoint = torch.load('path/to/checkpoint.pt', map_location=device)
model.load_state_dict(checkpoint['model'] if 'model' in checkpoint else checkpoint)
model.eval()

# Load audio
audio, sr = torchaudio.load('input.wav')
target_sr = config['dataset']['sample_rate']

# Resample if necessary
if sr != target_sr:
    resampler = torchaudio.transforms.Resample(sr, target_sr)
    audio = resampler(audio)

# Convert to mono and normalize
if audio.shape[0] > 1:
    audio = audio.mean(dim=0, keepdim=True)
max_val = audio.abs().max()
if max_val > 0:
    audio = audio / max_val

# Add batch dimension and process
audio = audio.unsqueeze(0).to(device)
with torch.no_grad():
    output, _, _, _ = model(audio)

# Save output
output = output.squeeze(0).cpu()
if output.abs().max() > 1.0:
    output = output / output.abs().max()
torchaudio.save('output.wav', output, target_sr)
```

Or use the command-line interface:

```bash
python infer_brontes.py \
  --config configs/config_brontes_48khz_demucs.yaml \
  --checkpoint path/to/checkpoint.pt \
  --input input.wav \
  --output output.wav
```

## Training Details

### Training Data

The model was trained on diverse audio data including:
- Clean speech recordings
- Codec-degraded audio pairs
- Various acoustic conditions and speakers

### Training Procedure

- **Pretraining:** 10,000 steps generator-only training
- **Adversarial training:** Multi-Period Discriminator (MPD) + Multi-Band Spectral Discriminator (MBSD)
- **Loss functions:** Multi-scale mel loss, pitch loss, adversarial loss, feature matching
- **Precision:** BF16 mixed precision
- **Framework:** PyTorch with custom training loop

## Fine-tuning Recommendations

To achieve best results on your specific dataset:

1. **Prepare paired data:** Input (degraded) and target (clean) audio pairs
2. **Use the `--pretrained` flag** to load model weights without optimizer state
3. **Train for 10-50k steps** depending on dataset size
4. **Monitor validation loss** to prevent overfitting

See the [repository README](https://github.com/ZDisket/Brontes) for detailed fine-tuning instructions.

## Limitations

- **Domain-specific performance:** General model may not perform optimally on highly specialized audio (fine-tuning recommended)
- **Mono audio only:** Currently supports single-channel audio
- **Fixed sample rates:** Designed for 24 kHz input → 48 kHz output
- **Codec-specific artifacts:** Performance may vary across different codec types
- **Long-form audio:** Very long audio files may require chunking or sufficient GPU memory

## Ethical Considerations

- This model is designed for audio enhancement and should not be used to create misleading or deceptive content
- Users should respect privacy and consent when processing speech recordings
- Enhanced audio should be clearly labeled as processed when used in sensitive contexts


## License

Both the model weights and code are released under the MIT License.

## Additional Resources

- **GitHub Repository:** [https://github.com/ZDisket/Brontes](https://github.com/ZDisket/Brontes)
- **Technical Report:** See the repository
- **Issues & Support:** [GitHub Issues](https://github.com/ZDisket/Brontes/issues)

## Acknowledgments

Compute resources provided by Hot Aisle and AI at AMD.