nightknocker's picture
Update README.md
f52e839 verified
|
raw
history blame
1.63 kB
metadata
license: apache-2.0
base_model:
  - lightonai/LightOnOCR-2-1B
base_model_relation: adapter
tags:
  - flux2
  - flux-2-klein

Image As Text For Flux.2 Klein

The text (prompt) is painted into a bitmap and recovered via OCR. It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.

Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.

In other literature (2601.14750, Figure 3) the safetensors file is called Visual Projection Head.

Inference

# Part of the Flux2KleinPipeline.
def _get_qwen3_prompt_embeds(...):
  ...
  image_features = vision_model.get_image_features(image_path)
  output = text_encoder(
    inputs_embeds=embedder(image_features),
    attention_mask=attention_mask,
    output_hidden_states=True,
    use_cache=False,
  )
  ...

vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B')
embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder')
pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16)

References

  • 2510.17800
  • 2510.18279
  • 2601.14251

Datasets

  • artbench-pd-256x256
  • anime-art-multicaptions (multicharacter interactions)
  • laion
  • spatial-caption
  • spright-coco
  • z-image-ethnicity-test
  • benchmarks from the Qwen-Image Technical Report