|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- lightonai/LightOnOCR-2-1B |
|
|
base_model_relation: adapter |
|
|
tags: |
|
|
- flux2 |
|
|
- flux-2-klein |
|
|
--- |
|
|
|
|
|
# Image As Text For Flux.2 Klein |
|
|
|
|
|
The text (prompt) is painted into a bitmap and recovered via OCR. |
|
|
It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder. |
|
|
|
|
|
Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently. |
|
|
|
|
|
In other literature (2601.14750, Figure 3) the [safetensors](https://huggingface.co/nightknocker/flux2-klein-4b-lighting-text-encoder/blob/main/model.safetensors) file is called Visual Projection Head. |
|
|
|
|
|
## Inference |
|
|
|
|
|
```python |
|
|
# Part of the Flux2KleinPipeline. |
|
|
def _get_qwen3_prompt_embeds(...): |
|
|
... |
|
|
image_features = vision_model.get_image_features(image_path) |
|
|
output = text_encoder( |
|
|
inputs_embeds=embedder(image_features), |
|
|
attention_mask=attention_mask, |
|
|
output_hidden_states=True, |
|
|
use_cache=False, |
|
|
) |
|
|
... |
|
|
|
|
|
vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B') |
|
|
embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder') |
|
|
pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16) |
|
|
``` |
|
|
|
|
|
 |
|
|
|
|
|
## References |
|
|
|
|
|
- 2510.17800 |
|
|
- 2510.18279 |
|
|
- 2601.14251 |
|
|
|
|
|
## Datasets |
|
|
|
|
|
- artbench-pd-256x256 |
|
|
- anime-art-multicaptions (multicharacter interactions) |
|
|
- laion |
|
|
- spatial-caption |
|
|
- spright-coco |
|
|
- z-image-ethnicity-test |
|
|
- benchmarks from the Qwen-Image Technical Report |