Model Card for HUVR

Implicit neural representation Hyper-networks for Unified Visual Representation (HUVR) aim to unify visual modeling on two axes: embedding dimension, and task family. The embeddings generated by the models include both standard embeddings (e.g. 768 for ViT-B) as well as compressed embeddings (as small as 8-dim), which we call Tiny Tokens or TinToks. Trained with image reconstruction and distillation objectives, the embeddings support tasks including generation, classification, segmentation, reconstruction, and more.

Details

We provide 4 models, trained using distillation from DINOv3 and pixel-wise reconstruction on DataComp and ImageNet-22k.

3 models are ViT-B, with the same standard embedding dimension size (768) but different TinTok dimensions: 8-dim, 16-dim, and 32-dim.

1 model is a ViT-L, with the standard embedding dimension (1024) and 32-dim TinToks.

The models are pre-trained at 256x256 resolution, and fine-tuned at mixed resolution (256 and 512), and with RoPE embeddings the models can support inference at a range of resolutions. The models take images as inputs, and process them as 16x16 patches, yielding a single global (cls) token and many patch tokens. With a 480x480 input image, the model would yield 901 tokens: 1 class token + 900 (30x30) patch tokens.

Getting Started

Please see our GitHub for more information: GitHub - tiktok/huvr: Hyper-networks for Unified Visual Representation (HUVR) use implicit neural re.

Citation

BibTeX

@article{gwilliam2026HUVR,
  title={Accelerate High-Quality Diffusion Models with Inner Loop Feedback},
  author={Gwilliam, Matthew and Wang, Xiao and Hu, Xuefeng and Yang, Zhenheng},
  journal={arXiv preprint arXiv:2601.14256},
  year={2026}
}

Downloads last month: 9

Inference Providers NEW

Image Feature Extraction

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including tiktok/huvr-vitb16-tintok8

HUVR

Collection

Vision unified representation model with standard and compressed features for classification, generation, and more: https://arxiv.org/abs/2601.14256 • 4 items • Updated Jan 23 • 5

Paper for tiktok/huvr-vitb16-tintok8

Implicit Neural Representation Facilitates Unified Universal Vision Encoding

Paper • 2601.14256 • Published Jan 20 • 7