ViT-B model

Please do not download the model. The repo was kept for archival purposes.

The vision encoder can be pretrained with autoregressive language modeling objective - no contrastive loss, no dual-tower architecture, and no extra text decoder.

The output includes both visual and textual data. The model encodes text token positions using ALiBi bias in attention.

Causal attention was applied to text tokens, while fewer vision tokens were visited in the middle blocks. This resulted in a faster training cycle.

SV(O) booru tags are applied with increased weighting in the loss calculation (2605.00809, figure 4) for both solo and multi-character images.

Source data

danbooru 2025-26
gelbooru for a certain locked tag
Kimi K2 style visual descriptions from multiple thinking models
cell1/tagutl was used for short tags until v0.5; a newer model was used later on

References

2108.12409
2501.04765
2604.12012
2605.00809

Downloads last month: 214

Safetensors

Model size

98.6M params

Tensor type

F32