You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Byrne-VE — Tiny Self-Distilled Vision Encoder (39M)

Model will be ungated for open download once I am done with the base..

Byrne-VE is a compact (~39.3M-param) ViT-style image encoder carrying the SpikeWhale / Byrne family DNA — RMSNorm, 2D-axial RoPE, QK-Norm, SwiGLU, and an HRM (Hierarchical Refinement) block. It was first distilled from a frozen DINOv2-base teacher, then improved teacher-free with DINO-style EMA self-distillation — ending up closer to useful than the teacher signal alone could take it. It is designed as a drop-in vision backbone for a VLM.

image → patch-embed → 12 × (RoPE / QK-Norm / SwiGLU) blocks → HRM refine → RMSNorm
      → pooled (CLS) 512-d   +   196 patch tokens (512-d each)

Architecture


image size	224
patch size	16 → 14×14 = 196 patches
hidden dim	512
depth	12 blocks
heads	8 (head_dim 64)
FFN	SwiGLU (8/3 mult)
position encoding	2D-axial RoPE (no learned pos-embeds)
normalization	RMSNorm + QK-Norm
refinement	HRM iterative refine (2 steps, zero-init gate)
parameters	39.34M (blocks 38.55M · patch-embed 0.39M · HRM 0.39M)

Outputs (forward(images) -> dict): pooled [B,512] (global), tokens [B,196,512] (per-patch), last_hidden_state [B,197,512].

How it was trained

Stage 1 — Distillation from DINOv2-base. An untrained student was trained to reproduce the features of a frozen facebook/dinov2-base teacher on a stream of ImageNet (timm/imagenet-1k-wds). The loss is 1 − cosine_similarity matched on both the global CLS feature and the patch-token grid (the teacher's grid is bicubically resized to the student's 14×14). ~50k+ steps, AdamW, batch 32. The HRM refine block uses a zero-init scalar gate so it is a safe no-op at start and "wakes up" during training (gate tanh climbed 0.0 → ~0.43).

At the end of distillation the student's features reached roughly ~56% mean cosine similarity to DINOv2-base (global ≈ 53.5%, patch ≈ 59.1%) — as expected for a student ~2× smaller than the teacher, which cannot perfectly reproduce it.

Stage 2 — Teacher-free self-distillation (DINO-style). Starting from the distilled checkpoint, the model was trained without any external teacher: an EMA ("momentum") copy of the encoder produces targets, two differently-augmented views of each image are encoded, and the online student matches the EMA teacher across views over a set of learned prototypes. Collapse is prevented the DINO way — centering + sharpening, with a cosine-normalized, frozen-early prototype layer and a soft teacher temperature. This stage improved downstream feature quality by ~+5 points across every probe over the distilled checkpoint (peak ~step 4k; this release is that peak). Byrne-VE is the stage-2 checkpoint.

Evaluation

Frozen encoder, weighted k-NN classification (no fine-tuning), identical preprocessing for every model. k=20.

vs. the DINOv2-base teacher (full-bank CIFAR-10, 50k bank / 10k query)

Model	Params	Top-1
DINOv2-base (teacher)	86M	98.47%
Byrne-VE (this model)	39M	86.38%
distilled-only (stage 1)	39M	81.39%

→ Byrne-VE reaches ~88% of the teacher's accuracy at ~45% of the parameters, and self-distillation added +4.99 over the distilled-only baseline.

Self-distillation gain across probes

Probe	Byrne-VE	distilled-only	Δ
CIFAR-10, full-bank (50k bank, 10-way)	86.38%	81.39%	+4.99
CIFAR-10, small (5k bank, 10-way)	81.60%	75.40%	+6.20
CIFAR-100 (10k bank, 100-way)	54.35%	49.50%	+4.85

Self-distillation produced a consistent ~+5 point gain on easy, full-bank, and 100-way probes — a robust improvement, not a probe-specific artifact.

Health checks: HRM gate tanh ≈ 0.355 (alive, off-zero) · no NaN/Inf in any weight · determinism cos 1.0000 · cross-image pooled cos 0.142 (discriminative).

Usage

pip install -r requirements.txt
python -m vision.infer --ckpt byrne_ve.pt --image photo.jpg
python -m vision.eval_probe --ckpt byrne_ve.pt    # reproduce CIFAR-10 k-NN

from vision.infer import load_encoder
from vision.dataset import load_image
m, cfg = load_encoder("byrne_ve.pt")
x = load_image("photo.jpg", cfg.image_size).unsqueeze(0)
out = m(x)
emb     = out["pooled"]   # [1, 512]   global image embedding
patches = out["tokens"]   # [1, 196, 512] per-patch features

Citation

@misc{byrne_ve_2026,
  title        = {Byrne-VE: A Tiny Self-Distilled Vision Encoder},
  author       = {Quazim0t0},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Quazim0t0/Byrne-VE}},
  note         = {39M ViT-style encoder distilled from DINOv2-base, then improved
                  teacher-free via DINO-style self-distillation.}
}

License

Apache-2.0.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Quazim0t0/Byrne-VE

Encoders

Collection

My encoders from scratch • 1 item • Updated 6 days ago