Model Card for HUVR

Implicit neural representation Hyper-networks for Unified Visual Representation (HUVR) aim to unify visual modeling on two axes: embedding dimension, and task family. The embeddings generated by the models include both standard embeddings (e.g. 768 for ViT-B) as well as compressed embeddings (as small as 8-dim), which we call Tiny Tokens or TinToks. Trained with image reconstruction and distillation objectives, the embeddings support tasks including generation, classification, segmentation, reconstruction, and more.

Details

We provide 4 models, trained using distillation from DINOv3 and pixel-wise reconstruction on DataComp and ImageNet-22k.

3 models are ViT-B, with the same standard embedding dimension size (768) but different TinTok dimensions: 8-dim, 16-dim, and 32-dim.

1 model is a ViT-L, with the standard embedding dimension (1024) and 32-dim TinToks.

The models are pre-trained at 256x256 resolution, and fine-tuned at mixed resolution (256 and 512), and with RoPE embeddings the models can support inference at a range of resolutions. The models take images as inputs, and process them as 16x16 patches, yielding a single global (cls) token and many patch tokens. With a 480x480 input image, the model would yield 901 tokens: 1 class token + 900 (30x30) patch tokens.

Getting Started

Please see our GitHub for more information: GitHub - tiktok/huvr: Hyper-networks for Unified Visual Representation (HUVR) use implicit neural re.

Citation

BibTeX

@article{gwilliam2026HUVR,
  title={Accelerate High-Quality Diffusion Models with Inner Loop Feedback},
  author={Gwilliam, Matthew and Wang, Xiao and Hu, Xuefeng and Yang, Zhenheng},
  journal={arXiv preprint arXiv:2601.14256},
  year={2026}
}
Downloads last month
23
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including tiktok/huvr-vitb16-tintok8

Paper for tiktok/huvr-vitb16-tintok8