Image As Text For Flux.2 Klein

The text (prompt) is painted into a bitmap and recovered via OCR. It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.

Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.

In other literature (2601.14750, Figure 3) the safetensors file is called Visual Projection Head.

Inference

# Part of the Flux2KleinPipeline.
def _get_qwen3_prompt_embeds(...):
  ...
  image_features = vision_model.get_image_features(image_path)
  output = text_encoder(
    inputs_embeds=embedder(image_features),
    attention_mask=attention_mask,
    output_hidden_states=True,
    use_cache=False,
  )
  ...

vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B')
embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder')
pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16)

References

  • 2510.17800
  • 2510.18279
  • 2601.14251

Datasets

  • artbench-pd-256x256
  • anime-art-multicaptions (multicharacter interactions)
  • laion
  • spatial-caption
  • spright-coco
  • z-image-ethnicity-test
  • benchmarks from the Qwen-Image Technical Report
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
16.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for nightknocker/flux2-klein-4b-lighting-text-encoder

Adapter
(1)
this model