nano4M-Audio — trained checkpoint
Extending the 4M masked-multimodal framework to audio as a 5th modality — a controlled study at small academic scale (COM-304, EPFL, Spring 2026).
| 💻 Code & docs | https://github.com/ziyad-m97/nano4M-Audio |
| 🌐 Project website | https://ziyad-m97.github.io/nano4M-Audio/ |
| 📊 Tokenized dataset | https://huggingface.co/datasets/zed-m97/nano4m-audio-tokenized |
| 🧾 Config | cfgs/nano4M/animal_full_5mod_v5.yaml |
Overview
nano4M-Audio adds audio to the 4M encoder–decoder transformer without any architectural
change — the only training-side modification is contiguous span masking on the audio stream.
It is trained jointly on five tokenized modalities (RGB, audio, depth, surface normals, caption) over
a self-built set of animal-vocalization clips. The structural modalities learn strongly and the
iterative generation framework works in the canonical 4M directions; audio acquires conditional
structure at the token level but does not lift to usable cross-modal audio↔vision generation. The
contribution is the precise diagnostic of why, not a working audio generator — see the report.
Model details
| Property | Value |
|---|---|
| Architecture | encoder–decoder transformer (nanofm.models.fourm.FourM), d6-6w512 |
| Parameters | 95.84 M |
| Width / heads | dim=512, head_dim=64, enc_depth=6, dec_depth=6 |
| Precision | fp32 (bf16 NaN'd in the unified-vocab softmax) |
| Vocabulary | unified, max(vocab_sizes) = 50,304; modality + position embeddings disambiguate streams |
| Loss | per-modality, length-normalized cross-entropy, averaged |
| Base framework | apple/ml-4m + the nano4M course re-implementation |
Modalities
| Modality | Tokenizer | Seq len | Vocab |
|---|---|---|---|
tok_rgb@196 |
4M-16k DiVAE | 196 | 16,384 |
tok_audio@512 |
EnCodec 24 kHz, K=2 RVQ @ 1.5 kbps (delay/flatten, cb2 +1024) | 512 | 2,048 |
tok_depth@196 |
Depth-Anything-V2 → 4M-8k DiVAE | 196 | 8,192 |
tok_normal@196 |
DSINE → 4M-8k DiVAE | 196 | 8,192 |
scene_desc |
GPT-2 BPE ("a photo of a <class>") |
≤64 | 50,304 |
Training
- 18,311 steps, batch size 64, ~600 M tokens, ~1h10 on 1× NVIDIA H100, fp32.
- Optimizer AdamW (β = 0.9, 0.95), weight decay 0.05, gradient clip 1.0.
- Cosine LR
1e-4 → 1e-6, 916 warmup steps. Fixed seed; deterministic clip-level split released. - Masking: standard 4M Dirichlet (random) for RGB/depth/normal/caption with per-sample token
budgets in
[16, 256]; contiguous span masking (stride 2) for audio so the decoder cannot copy an adjacent EnCodec frame.
Dataset
zed-m97/nano4m-audio-tokenized —
9,192 clips over 11 animal classes (cat, chicken, cow, coyote, dog, duck, horse, lion, pig, sheep,
pigeon), sourced from AudioSet + VGGSound and cleaned by a 3-stage PANNs / CLIP / Silero-VAD filter;
clip-level stratified split (seed 42): 7,347 / 907 / 938 train / val / test. Depth and normal targets
are pseudo-labeled (Depth-Anything-V2, DSINE).
How to use
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from omegaconf import OmegaConf
from hydra.utils import instantiate
# clone the repo for the model code + config first:
# git clone https://github.com/ziyad-m97/nano4M-Audio && cd nano4M-Audio && pip install -e .
cfg = OmegaConf.load("cfgs/nano4M/animal_full_5mod_v5.yaml")
model = instantiate(cfg.model_config)
sd = load_file(hf_hub_download("zed-m97/nano4m-audio", "checkpoint-final.safetensors"))
model.load_state_dict(sd, strict=False)
model.eval()
The full evaluation harness is in notebooks/final_evaluation.ipynb;
the actual outputs are committed under eval_results/.
Evaluation results (held-out 938-clip test set)
| Probe | Model | Random baseline |
|---|---|---|
| Audio eval CE | 5.28 nats | 7.62 (log 2048); ~6.2 empirical marginal |
| Depth / Normal eval CE | 5.11 / 3.45 | 9.01 |
| RGB eval CE | 9.14 | 9.70 |
| Audio → class, top-1 / top-5 | 10.4% / 48.4% | 9.1% / 45.5% |
| Best cross-modal retrieval R@5 (depth→audio) | 4.5% | 2.5% |
| RGB → depth / RGB → normal token acc | 11.1% / 18.0% | ~0.012% |
| Audio → RGB ImageNet ResNet-50 top-5 hit | 0% | ~5% |
| Memorization probe (train / test acc) | 2.95% / 4.13% | — |
| RGB tokenizer fidelity | PSNR 19.1 dB, SSIM 0.80 | — |
The asymmetry. Audio captures ~1 nat of conditional structure per token (CE 5.28 < the ~6.2-nat marginal) and is weakly class-discriminative, yet cross-modal generation mode-collapses. We trace this to three causes: (1) a train/inference masking mismatch that makes single-source decoding out-of-distribution; (2) an acoustic-only EnCodec tokenizer with no class semantics; (3) operating at ~10⁴ clips, below the cross-modal emergence threshold of the contrastive audio-visual literature.
License
MIT for the model weights and this repository's contributions. The underlying 4M code is licensed under Apache-2.0.