Contrastive Bidirectional Encoder

A model for improving background and style tags.

Unlike previous text encoders (from other large DiT models), it:

includes thousands of art styles from the post-AI era
matches the @tags to the vision embeddings (@applepie vs apple pie)
has seen all danbooru tags up to the year 2026

This repo includes the trained text encoder only.

Credits to Francesco Paissan for the distillation diagram (image modified for this model)

There is one BERT model with two tails, one for the diffusion model, one for the style embeddings. They learn from the same dataset, but with different objectives.

The diffusion's text encoder learns the natural language and booru tags. The style tail (as shown in the image) learns from the Gemma4VisionModel.

The model converges faster and the distinction between the textual and visual information is clearer when the model has many tails.

There is a mapping safetensors file that maps the (unpublished) style embeddings to the diffuser's text embedding space without the need for retraining of the diffusion model.

While ModernBERT's entire vocabulary can still be used, this is backward compatible with the now outdated T5. This implementation is also leaner than the fat, transformer-based adapters on the top of LLMs.

Source data

characters names and character count tags
colors, fashion
all danbooru tags until 2026
spatial relationships
approx 300GB of vision embeddings were made

No credits are provided; this is original work.

Downloads last month: -

Safetensors

Model size

0.3B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nebulette/cozyberry-g4-vision

Base model

sbintuitions/modernbert-ja-310m

Finetuned

nebulette/rnberry

Finetuned

(1)

this model