--- license: apache-2.0 base_model: - lightonai/LightOnOCR-2-1B base_model_relation: adapter tags: - flux2 - flux-2-klein --- # Image As Text For Flux.2 Klein The text (prompt) is painted into a bitmap and recovered via OCR. It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder. Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently. In other literature (2601.14750, Figure 3) the [safetensors](https://huggingface.co/nightknocker/flux2-klein-4b-lighting-text-encoder/blob/main/model.safetensors) file is called Visual Projection Head. ## Inference ```python # Part of the Flux2KleinPipeline. def _get_qwen3_prompt_embeds(...): ... image_features = vision_model.get_image_features(image_path) output = text_encoder( inputs_embeds=embedder(image_features), attention_mask=attention_mask, output_hidden_states=True, use_cache=False, ) ... vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B') embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder') pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16) ``` ![](images/preview.png) ## References - 2510.17800 - 2510.18279 - 2601.14251 ## Datasets - artbench-pd-256x256 - anime-art-multicaptions (multicharacter interactions) - laion - spatial-caption - spright-coco - z-image-ethnicity-test - benchmarks from the Qwen-Image Technical Report