metadata
license: apache-2.0
base_model:
- lightonai/LightOnOCR-2-1B
base_model_relation: adapter
tags:
- flux2
- flux-2-klein
Image As Text For Flux.2 Klein
The text (prompt) is painted into a bitmap and recovered via OCR. It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.
Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.
In other literature (2601.14750, Figure 3) the safetensors file is called Visual Projection Head.
Inference
# Part of the Flux2KleinPipeline.
def _get_qwen3_prompt_embeds(...):
...
image_features = vision_model.get_image_features(image_path)
output = text_encoder(
inputs_embeds=embedder(image_features),
attention_mask=attention_mask,
output_hidden_states=True,
use_cache=False,
)
...
vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B')
embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder')
pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16)
References
- 2510.17800
- 2510.18279
- 2601.14251
Datasets
- artbench-pd-256x256
- anime-art-multicaptions (multicharacter interactions)
- laion
- spatial-caption
- spright-coco
- z-image-ethnicity-test
- benchmarks from the Qwen-Image Technical Report
