Contrastive Bidirectional Encoder

A model for improving background and style tags.

Unlike previous text encoders (from other large DiT models), it:

  • includes thousands of art styles from the post-AI era
  • matches the @tags to the vision embeddings (@applepie vs apple pie)
  • has seen all danbooru tags up to the year 2026

This repo includes the trained text encoder only.

Credits to Francesco Paissan for the distillation diagram (image modified for this model)

There is one BERT model with two tails, one for the diffusion model, one for the style embeddings. They learn from the same dataset, but with different objectives.

The diffusion's text encoder learns the natural language and booru tags. The style tail (as shown in the image) learns from the Gemma4VisionModel.

The model converges faster and the distinction between the textual and visual information is clearer when the model has many tails.

There is a mapping safetensors file that maps the (unpublished) style embeddings to the diffuser's text embedding space without the need for retraining of the diffusion model.

While ModernBERT's entire vocabulary can still be used, this is backward compatible with the now outdated T5. This implementation is also leaner than the fat, transformer-based adapters on the top of LLMs.

Source data

  • characters names and character count tags
  • colors, fashion
  • all danbooru tags until 2026
  • spatial relationships
  • approx 300GB of vision embeddings were made

No credits are provided; this is original work.

Downloads last month
-
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nebulette/cozyberry-g4-vision

Finetuned
(1)
this model