Byrne-VE β Tiny Self-Distilled Vision Encoder (39M)
Model will be ungated for open download once I am done with the base..
Byrne-VE is a compact (~39.3M-param) ViT-style image encoder carrying the SpikeWhale / Byrne family DNA β RMSNorm, 2D-axial RoPE, QK-Norm, SwiGLU, and an HRM (Hierarchical Refinement) block. It was first distilled from a frozen DINOv2-base teacher, then improved teacher-free with DINO-style EMA self-distillation β ending up closer to useful than the teacher signal alone could take it. It is designed as a drop-in vision backbone for a VLM.
image β patch-embed β 12 Γ (RoPE / QK-Norm / SwiGLU) blocks β HRM refine β RMSNorm
β pooled (CLS) 512-d + 196 patch tokens (512-d each)
Architecture
| image size | 224 |
| patch size | 16 β 14Γ14 = 196 patches |
| hidden dim | 512 |
| depth | 12 blocks |
| heads | 8 (head_dim 64) |
| FFN | SwiGLU (8/3 mult) |
| position encoding | 2D-axial RoPE (no learned pos-embeds) |
| normalization | RMSNorm + QK-Norm |
| refinement | HRM iterative refine (2 steps, zero-init gate) |
| parameters | 39.34M (blocks 38.55M Β· patch-embed 0.39M Β· HRM 0.39M) |
Outputs (forward(images) -> dict): pooled [B,512] (global), tokens
[B,196,512] (per-patch), last_hidden_state [B,197,512].
How it was trained
Stage 1 β Distillation from DINOv2-base. An untrained student was trained to
reproduce the features of a frozen facebook/dinov2-base teacher on a stream of
ImageNet (timm/imagenet-1k-wds). The loss is 1 β cosine_similarity matched on
both the global CLS feature and the patch-token grid (the teacher's grid is
bicubically resized to the student's 14Γ14). ~50k+ steps, AdamW, batch 32. The HRM
refine block uses a zero-init scalar gate so it is a safe no-op at start and "wakes
up" during training (gate tanh climbed 0.0 β ~0.43).
At the end of distillation the student's features reached roughly ~56% mean cosine similarity to DINOv2-base (global β 53.5%, patch β 59.1%) β as expected for a student ~2Γ smaller than the teacher, which cannot perfectly reproduce it.
Stage 2 β Teacher-free self-distillation (DINO-style). Starting from the
distilled checkpoint, the model was trained without any external teacher: an EMA
("momentum") copy of the encoder produces targets, two differently-augmented views of
each image are encoded, and the online student matches the EMA teacher across views
over a set of learned prototypes. Collapse is prevented the DINO way β centering +
sharpening, with a cosine-normalized, frozen-early prototype layer and a soft
teacher temperature. This stage improved downstream feature quality by ~+5 points
across every probe over the distilled checkpoint (peak ~step 4k; this release is
that peak). Byrne-VE is the stage-2 checkpoint.
Evaluation
Frozen encoder, weighted k-NN classification (no fine-tuning), identical
preprocessing for every model. k=20.
vs. the DINOv2-base teacher (full-bank CIFAR-10, 50k bank / 10k query)
| Model | Params | Top-1 |
|---|---|---|
| DINOv2-base (teacher) | 86M | 98.47% |
| Byrne-VE (this model) | 39M | 86.38% |
| distilled-only (stage 1) | 39M | 81.39% |
β Byrne-VE reaches ~88% of the teacher's accuracy at ~45% of the parameters, and self-distillation added +4.99 over the distilled-only baseline.
Self-distillation gain across probes
| Probe | Byrne-VE | distilled-only | Ξ |
|---|---|---|---|
| CIFAR-10, full-bank (50k bank, 10-way) | 86.38% | 81.39% | +4.99 |
| CIFAR-10, small (5k bank, 10-way) | 81.60% | 75.40% | +6.20 |
| CIFAR-100 (10k bank, 100-way) | 54.35% | 49.50% | +4.85 |
Self-distillation produced a consistent ~+5 point gain on easy, full-bank, and 100-way probes β a robust improvement, not a probe-specific artifact.
Health checks: HRM gate tanh β 0.355 (alive, off-zero) Β· no NaN/Inf in any
weight Β· determinism cos 1.0000 Β· cross-image pooled cos 0.142 (discriminative).
Usage
pip install -r requirements.txt
python -m vision.infer --ckpt byrne_ve.pt --image photo.jpg
python -m vision.eval_probe --ckpt byrne_ve.pt # reproduce CIFAR-10 k-NN
from vision.infer import load_encoder
from vision.dataset import load_image
m, cfg = load_encoder("byrne_ve.pt")
x = load_image("photo.jpg", cfg.image_size).unsqueeze(0)
out = m(x)
emb = out["pooled"] # [1, 512] global image embedding
patches = out["tokens"] # [1, 196, 512] per-patch features
Citation
@misc{byrne_ve_2026,
title = {Byrne-VE: A Tiny Self-Distilled Vision Encoder},
author = {Quazim0t0},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Quazim0t0/Byrne-VE}},
note = {39M ViT-style encoder distilled from DINOv2-base, then improved
teacher-free via DINO-style self-distillation.}
}
License
Apache-2.0.