nano-codec πŸ”Š

A minimal neural audio codec. 16kHz mono β€’ 128x compression β€’ 10.2 kbps β€’ 24M parameters.

Trained on LibriSpeech train-clean-100 (~100 hours) for ~180k steps.

πŸ“ Blog Post β€” in-depth walkthrough of the architecture, training, and lessons learned

πŸ€— Model Weights β€” pretrained model on HuggingFace

πŸ’» GitHub β€” full training and inference code

πŸ—οΈ Architecture

nano-codec architecture

Inspired by DAC (Descript Audio Codec). Strided convolutional encoder, 8-level RVQ with factorized L2-normalized codebooks, mirror decoder.

🎧 Samples

Sample 1 β€” Original:

Reconstructed:

Sample 2 β€” Original:

Reconstructed:

Sample 3 β€” Original:

Reconstructed:

Sample 4 β€” Original:

Reconstructed:

mel spectrogram comparison

Usage

from huggingface_hub import hf_hub_download
import torch, yaml, soundfile as sf, torchaudio
from model import RVQCodec

# load model
model_path = hf_hub_download("taresh18/nano-codec", "model.pt")
config_path = hf_hub_download("taresh18/nano-codec", "config.yaml")

with open(config_path) as f:
    cfg = yaml.safe_load(f)

model = RVQCodec(in_ch=1, latent_ch=cfg['latent_dim'], K=cfg['codebook_size'],
                 num_rvq_levels=cfg['num_rvq_levels'], codebook_dim=cfg.get('codebook_dim', 8))
model.load_state_dict(torch.load(model_path, map_location="cpu", weights_only=True))
model.eval()

# reconstruct audio
audio, sr = sf.read("input.wav", dtype="float32")
waveform = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0)  # [1, 1, T]
if sr != 16000:
    waveform = torchaudio.functional.resample(waveform, sr, 16000)

with torch.no_grad():
    recon, _, _, _ = model(waveform)

sf.write("reconstructed.wav", recon[0, 0].numpy(), 16000)

Or use the inference script from the GitHub repo:

python inference.py --input audio.wav --output reconstructed.wav

πŸ“š References

Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Papers for taresh18/nano-codec