nightknocker's picture
Update README.md
f52e839 verified
|
raw
history blame
1.63 kB
---
license: apache-2.0
base_model:
- lightonai/LightOnOCR-2-1B
base_model_relation: adapter
tags:
- flux2
- flux-2-klein
---
# Image As Text For Flux.2 Klein
The text (prompt) is painted into a bitmap and recovered via OCR.
It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.
Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.
In other literature (2601.14750, Figure 3) the [safetensors](https://huggingface.co/nightknocker/flux2-klein-4b-lighting-text-encoder/blob/main/model.safetensors) file is called Visual Projection Head.
## Inference
```python
# Part of the Flux2KleinPipeline.
def _get_qwen3_prompt_embeds(...):
...
image_features = vision_model.get_image_features(image_path)
output = text_encoder(
inputs_embeds=embedder(image_features),
attention_mask=attention_mask,
output_hidden_states=True,
use_cache=False,
)
...
vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B')
embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder')
pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16)
```
![](images/preview.png)
## References
- 2510.17800
- 2510.18279
- 2601.14251
## Datasets
- artbench-pd-256x256
- anime-art-multicaptions (multicharacter interactions)
- laion
- spatial-caption
- spright-coco
- z-image-ethnicity-test
- benchmarks from the Qwen-Image Technical Report