nightknocker
/

flux2-klein-4b-lighting-text-encoder

Model card Files Files and versions

flux2-klein-4b-lighting-text-encoder / README.md

nightknocker's picture

Update README.md

f52e839 verified 3 days ago

|

1.63 kB

	---
	license: apache-2.0
	base_model:
	- lightonai/LightOnOCR-2-1B
	base_model_relation: adapter
	tags:
	- flux2
	- flux-2-klein
	---

	# Image As Text For Flux.2 Klein

	The text (prompt) is painted into a bitmap and recovered via OCR.
	It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.

	Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.

	In other literature (2601.14750, Figure 3) the [safetensors](https://huggingface.co/nightknocker/flux2-klein-4b-lighting-text-encoder/blob/main/model.safetensors) file is called Visual Projection Head.

	## Inference

	```python
	# Part of the Flux2KleinPipeline.
	def _get_qwen3_prompt_embeds(...):
	...
	image_features = vision_model.get_image_features(image_path)
	output = text_encoder(
	inputs_embeds=embedder(image_features),
	attention_mask=attention_mask,
	output_hidden_states=True,
	use_cache=False,
	)
	...

	vision_model = LightOnVisionModel.from_pretrained('lightonai/LightOnOCR-2-1B')
	embedder = LightingFlux2Klein.from_pretrained('nightknocker/flux2-klein-4b-lighting-text-encoder')
	pipeline = Flux2KleinPipeline.from_pretrained('black-forest-labs/FLUX.2-klein-base-4B', torch_dtype=torch.bfloat16)
	```

	![](images/preview.png)

	## References

	- 2510.17800
	- 2510.18279
	- 2601.14251

	## Datasets

	- artbench-pd-256x256
	- anime-art-multicaptions (multicharacter interactions)
	- laion
	- spatial-caption
	- spright-coco
	- z-image-ethnicity-test
	- benchmarks from the Qwen-Image Technical Report