Update README.md
Browse files
README.md
CHANGED
|
@@ -10,7 +10,9 @@ base_model_relation: adapter
|
|
| 10 |
The text (prompt) is painted into a bitmap and recovered via OCR.
|
| 11 |
It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.
|
| 12 |
|
| 13 |
-
Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter.
|
|
|
|
|
|
|
| 14 |
|
| 15 |
## Inference
|
| 16 |
|
|
|
|
| 10 |
The text (prompt) is painted into a bitmap and recovered via OCR.
|
| 11 |
It is transformed into Qwen3 embeddings, which represent the underlying meaning of the text in a format suitable for the text encoder.
|
| 12 |
|
| 13 |
+
Compared to T5/Gemma/Qwen, the Mistral ViT model's output sequence length is considerably shorter. The same information is encoded more efficiently.
|
| 14 |
+
|
| 15 |
+
In other literature (2601.14750, Figure3) the [safetensors](https://huggingface.co/nightknocker/flux2-klein-4b-lighting-text-encoder/blob/main/model.safetensors) file is called Visual Projection Head.
|
| 16 |
|
| 17 |
## Inference
|
| 18 |
|