nightknocker
/

flux2-klein-4b-lighting-text-encoder

Model card Files Files and versions

nightknocker commited on 3 days ago

Commit

790cd55

·

verified ·

1 Parent(s): 73765d3

Update README.md

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -10,7 +10,9 @@ base_model_relation: adapter
 The text (prompt) is painted into a bitmap and recovered via OCR.
 It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.
-Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter.
 ## Inference

 The text (prompt) is painted into a bitmap and recovered via OCR.
 It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.
+Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.
+In other literature (2601.14750, Figure3) the [safetensors](https://huggingface.co/nightknocker/flux2-klein-4b-lighting-text-encoder/blob/main/model.safetensors) file is called Visual Projection Head.
 ## Inference