ViT-B model

Please do not download the model. The repo was kept for archival purposes.

The vision encoder can be pretrained with autoregressive language modeling objective - no contrastive loss, no dual-tower architecture, and no extra text decoder.

The output includes both visual and textual data. The model encodes text token positions using ALiBi bias in attention.

Causal attention was applied to text tokens, while fewer vision tokens were visited in the middle blocks. This resulted in a faster training cycle.

SV(O) booru tags are applied with increased weighting in the loss calculation (2605.00809, figure 4) for both solo and multi-character images.

Source data

  • danbooru 2025-26
  • gelbooru for a certain locked tag
  • Kimi K2 style visual descriptions from multiple thinking models
  • cell1/tagutl was used for short tags until v0.5; a newer model was used later on

References

  • 2108.12409
  • 2501.04765
  • 2604.12012
  • 2605.00809
Downloads last month
214
Safetensors
Model size
98.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support