cstr
/

F2LLM-v2-0.6B-ONNX

+---
+license: apache-2.0
+language:
+- en
+- de
+- zh
+- multilingual
+library_name: onnxruntime
+tags:
+- onnx
+- embedding
+- text-embedding
+- retrieval
+- sentence-similarity
+- feature-extraction
+pipeline_tag: sentence-similarity
+base_model: codefuse-ai/F2LLM-v2-0.6B
+---
+# F2LLM-v2-0.6B — FP32 ONNX (dynamo export)
+ONNX export of [codefuse-ai/F2LLM-v2-0.6B](https://huggingface.co/codefuse-ai/F2LLM-v2-0.6B), a general-purpose multilingual embedding model from the F2LLM-v2 family, trained on 60M high-quality multilingual examples supporting 200+ languages.
+This is the **full-precision (FP32) reference** export. For production use, prefer the [INT8](https://huggingface.co/cstr/F2LLM-v2-0.6B-ONNX-INT8) or [INT4](https://huggingface.co/cstr/F2LLM-v2-0.6B-ONNX-INT4) variants which are 2–3× smaller with negligible quality loss.
+## Export method
+Exported with **`torch.onnx.export(dynamo=True)`** (PyTorch 2.9, opset 20).
+The dynamo exporter traces at the FX-graph / symbolic level. All internal tensor shapes — including the Qwen3 causal attention mask — carry symbolic batch and sequence dimensions throughout. **Dynamic batch verified:** batch = 1, 2, 4, 8 all produce correct output shapes.
+## Model details
+| Property | Value |
+|---|---|
+| Base model | codefuse-ai/F2LLM-v2-0.6B |
+| Architecture | Qwen3 decoder |
+| Parameters | ~600 M |
+| Embedding dim | 1024 |
+| Max context | 32 768 tokens |
+| Languages | 200+ (multilingual) |
+| Inputs | `input_ids [batch, seq]`, `attention_mask [batch, seq]` |
+| Output | `last_hidden_state [batch, seq, 1024]` |
+| Pooling | Last-token pooling + L2 normalisation (applied by inference runtime) |
+| File size | ~2.4 GB (`model.onnx` + `model.onnx.data`) |
+## Inference
+```python
+import onnxruntime as ort
+import numpy as np
+from tokenizers import Tokenizer
+tokenizer = Tokenizer.from_file("tokenizer.json")
+tokenizer.enable_padding(pad_id=0, direction="right")
+tokenizer.enable_truncation(max_length=512)
+session = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
+texts = ["semantic search example", "another sentence"]
+enc  = tokenizer.encode_batch(texts)
+ids  = np.array([e.ids            for e in enc], dtype=np.int64)
+mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
+lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]  # [batch, seq, 1024]
+# Last-token pooling: take embedding at last non-padding position
+seq_lens   = mask.sum(axis=1) - 1
+embeddings = lhs[np.arange(len(texts)), seq_lens]
+# L2 normalise
+norms      = np.linalg.norm(embeddings, axis=1, keepdims=True)
+embeddings = embeddings / np.maximum(norms, 1e-8)
+print(embeddings.shape)  # (2, 1024)
+```
+> **Query prefix**: For asymmetric retrieval, prepend `"Instruct: <task description>\nQuery: "` to query strings. Documents are encoded without a prefix.
+## Files
+| File | Size | Description |
+|---|---|---|
+| `model.onnx` | ~4 MB | ONNX graph (opset 20, no weights) |
+| `model.onnx.data` | ~2.38 GB | External weight data |
+| `tokenizer.json` | 8 MB | HuggingFace fast tokenizer |
+| `config.json` | — | Model config |
+## Quantized variants
+| Repo | Precision | Size | Notes |
+|---|---|---|---|
+| [cstr/F2LLM-v2-0.6B-ONNX](https://huggingface.co/cstr/F2LLM-v2-0.6B-ONNX) | FP32 | 2.4 GB | This repo — reference |
+| [cstr/F2LLM-v2-0.6B-ONNX-INT8](https://huggingface.co/cstr/F2LLM-v2-0.6B-ONNX-INT8) | INT8 per-channel | 1.1 GB | Recommended for most use |
+| [cstr/F2LLM-v2-0.6B-ONNX-INT4](https://huggingface.co/cstr/F2LLM-v2-0.6B-ONNX-INT4) | INT4 MatMulNBits | 0.9 GB | Minimum RAM |
+| [cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL](https://huggingface.co/cstr/F2LLM-v2-0.6B-ONNX-INT8-FULL) | INT8 incl. embeddings | 0.6 GB | Smallest file |
+## Citation
+```bibtex
+@misc{f2llm-v2,
+      title={F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World},
+      author={Ziyin Zhang and Zihan Liao and Hang Yu and Peng Di and Rui Wang},
+      year={2026},
+      eprint={2603.19223},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2603.19223},
+}
+```
+## License
+Apache 2.0 — same as [codefuse-ai/F2LLM-v2-0.6B](https://huggingface.co/codefuse-ai/F2LLM-v2-0.6B).