---
license: apache-2.0
base_model:
- lightonai/LightOnOCR-2-1B
base_model_relation: adapter
tags:
- flux2
- flux-2-klein
---

# Image As Text For Flux.2 Klein

The text (prompt) is painted into a bitmap and recovered via OCR. 
It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.

Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.

In other literature (2601.14750, Figure 3) the [safetensors](https://huggingface.co/nightknocker/flux2-klein-4b-lighting-text-encoder/blob/main/model.safetensors) file is called Visual Projection Head.

## Inference

```python
# Part of the Flux2KleinPipeline.
def _get_qwen3_prompt_embeds(...):
  ...
  image_features = vision_model.get_image_features(image_path)
  output = text_encoder(
    inputs_embeds=embedder(image_features),
    attention_mask=attention_mask,
    output_hidden_states=True,
    use_cache=False,
  )
  ...

vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B')
embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder')
pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16)
```

![](images/preview.png)

## References

- 2510.17800
- 2510.18279
- 2601.14251

## Datasets

- artbench-pd-256x256
- anime-art-multicaptions (multicharacter interactions)
- laion
- spatial-caption
- spright-coco
- z-image-ethnicity-test
- benchmarks from the Qwen-Image Technical Report