Brontes: Synthesis-First Waveform Enhancement

Brontes is a time-domain audio enhancement model designed for neural codec repair and bandwidth extension. This is the general pretrained model trained on diverse audio data.

Model Description

Brontes upsamples and repairs speech degraded by neural codec compression. Unlike conventional Wave U-Net approaches that rely on dense skip connections, Brontes uses a synthesis-first architecture with selective deep skips, forcing the model to actively reconstruct rather than copy degraded input details.

Key Capabilities

Neural codec repair — removes compression artifacts from neural codec outputs
Bandwidth extension — upsamples from 24 kHz to 48 kHz (2× extension)
Waveform-domain processing — operates directly on audio samples, no spectrogram conversion
Synthesis-first design — only the two deepest skips retained, preventing artifact leakage
LSTM bottleneck — captures long-range temporal dependencies at maximum compression

Model Architecture

Type: Encoder-decoder U-Net with selective skip connections
Stages: 6 encoder stages + 6 decoder stages (4096× total compression)
Bottleneck: Bidirectional LSTM for temporal modeling
Parameters: ~29M
Input: 24 kHz mono audio (codec-degraded)
Output: 48 kHz mono audio (enhanced)

Intended Use

This is a general pretrained model trained on diverse audio data. For optimal performance on your specific use case:

⚠️ It is strongly recommended to fine-tune this model on your target dataset using the --pretrained flag.

Primary Use Cases

Repairing audio degraded by neural codecs (e.g., EnCodec, SoundStream, Lyra)
Bandwidth extension from narrowband/wideband to fullband
Speech enhancement and quality improvement
Post-processing for codec-compressed audio

Quick Start

For detailed usage instructions, training, and fine-tuning, please see the GitHub repository.

Basic Inference Example

import torch
import torchaudio
import yaml
from brontes import Brontes

# Setup device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Load config
with open('configs/config_brontes_48khz_demucs.yaml', 'r') as f:
    config = yaml.safe_load(f)

# Create model
model = Brontes(unet_config=config['model'].get('unet_config', {})).to(device)

# Load checkpoint
checkpoint = torch.load('path/to/checkpoint.pt', map_location=device)
model.load_state_dict(checkpoint['model'] if 'model' in checkpoint else checkpoint)
model.eval()

# Load audio
audio, sr = torchaudio.load('input.wav')
target_sr = config['dataset']['sample_rate']

# Resample if necessary
if sr != target_sr:
    resampler = torchaudio.transforms.Resample(sr, target_sr)
    audio = resampler(audio)

# Convert to mono and normalize
if audio.shape[0] > 1:
    audio = audio.mean(dim=0, keepdim=True)
max_val = audio.abs().max()
if max_val > 0:
    audio = audio / max_val

# Add batch dimension and process
audio = audio.unsqueeze(0).to(device)
with torch.no_grad():
    output, _, _, _ = model(audio)

# Save output
output = output.squeeze(0).cpu()
if output.abs().max() > 1.0:
    output = output / output.abs().max()
torchaudio.save('output.wav', output, target_sr)

Or use the command-line interface:

python infer_brontes.py \
  --config configs/config_brontes_48khz_demucs.yaml \
  --checkpoint path/to/checkpoint.pt \
  --input input.wav \
  --output output.wav

Training Details

Training Data

The model was trained on diverse audio data including:

Clean speech recordings
Codec-degraded audio pairs
Various acoustic conditions and speakers

Training Procedure

Pretraining: 10,000 steps generator-only training
Adversarial training: Multi-Period Discriminator (MPD) + Multi-Band Spectral Discriminator (MBSD)
Loss functions: Multi-scale mel loss, pitch loss, adversarial loss, feature matching
Precision: BF16 mixed precision
Framework: PyTorch with custom training loop

Fine-tuning Recommendations

To achieve best results on your specific dataset:

Prepare paired data: Input (degraded) and target (clean) audio pairs
Use the --pretrained flag to load model weights without optimizer state
Train for 10-50k steps depending on dataset size
Monitor validation loss to prevent overfitting

See the repository README for detailed fine-tuning instructions.

Limitations

Domain-specific performance: General model may not perform optimally on highly specialized audio (fine-tuning recommended)
Mono audio only: Currently supports single-channel audio
Fixed sample rates: Designed for 24 kHz input → 48 kHz output
Codec-specific artifacts: Performance may vary across different codec types
Long-form audio: Very long audio files may require chunking or sufficient GPU memory

Ethical Considerations

This model is designed for audio enhancement and should not be used to create misleading or deceptive content
Users should respect privacy and consent when processing speech recordings
Enhanced audio should be clearly labeled as processed when used in sensitive contexts

License

Both the model weights and code are released under the MIT License.

Additional Resources

GitHub Repository: https://github.com/ZDisket/Brontes
Technical Report: See the repository
Issues & Support: GitHub Issues

Acknowledgments

Compute resources provided by Hot Aisle and AI at AMD.

Downloads last month: 1