ViT-B model
Please do not download the model. The repo was kept for archival purposes.
The vision encoder can be pretrained with autoregressive language modeling objective - no contrastive loss, no dual-tower architecture, and no extra text decoder.
The output includes both visual and textual data. The model encodes text token positions using ALiBi bias in attention.
Causal attention was applied to text tokens, while fewer vision tokens were visited in the middle blocks. This resulted in a faster training cycle.
SV(O) booru tags are applied with increased weighting in the loss calculation (2605.00809, figure 4) for both solo and multi-character images.
Source data
- danbooru 2025-26
- gelbooru for a certain locked tag
- Kimi K2 style visual descriptions from multiple thinking models
- cell1/tagutl was used for short tags until v0.5; a newer model was used later on
References
- 2108.12409
- 2501.04765
- 2604.12012
- 2605.00809
- Downloads last month
- 214

