Image As Text For Flux.2 Klein

The text (prompt) is painted into a bitmap and recovered via OCR. It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.

Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.

In other literature (2601.14750, Figure 3) the safetensors file is called Visual Projection Head.

Inference

# Part of the Flux2KleinPipeline.
def _get_qwen3_prompt_embeds(...):
  ...
  image_features = vision_model.get_image_features(image_path)
  output = text_encoder(
    inputs_embeds=embedder(image_features),
    attention_mask=attention_mask,
    output_hidden_states=True,
    use_cache=False,
  )
  ...

vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B')
embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder')
pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16)

References

2510.17800
2510.18279
2601.14251

Datasets

artbench-pd-256x256
anime-art-multicaptions (multicharacter interactions)
laion
spatial-caption
spright-coco
z-image-ethnicity-test
benchmarks from the Qwen-Image Technical Report

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

16.8M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nightknocker/flux2-klein-4b-lighting-text-encoder

Base model

lightonai/LightOnOCR-2-1B

Adapter

(1)

this model