Update README.md

f52e839 verified 3 days ago

1.63 kB

metadata

license: apache-2.0
base_model:
  - lightonai/LightOnOCR-2-1B
base_model_relation: adapter
tags:
  - flux2
  - flux-2-klein

Image As Text For Flux.2 Klein

The text (prompt) is painted into a bitmap and recovered via OCR. It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.

Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.

In other literature (2601.14750, Figure 3) the safetensors file is called Visual Projection Head.

Inference

# Part of the Flux2KleinPipeline.
def _get_qwen3_prompt_embeds(...):
  ...
  image_features = vision_model.get_image_features(image_path)
  output = text_encoder(
    inputs_embeds=embedder(image_features),
    attention_mask=attention_mask,
    output_hidden_states=True,
    use_cache=False,
  )
  ...

vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B')
embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder')
pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16)

References

2510.17800
2510.18279
2601.14251

Datasets

artbench-pd-256x256
anime-art-multicaptions (multicharacter interactions)
laion
spatial-caption
spright-coco
z-image-ethnicity-test
benchmarks from the Qwen-Image Technical Report