F2LLM-v2-0.6B โ€” FP16 ONNX

FP16-converted ONNX of codefuse-ai/F2LLM-v2-0.6B, a Qwen3-derived 1024-dim retrieval embedding model with 32k context and last-token pooling.

1.2 GB (50 % memory of FP32), retrieval-quality-equivalent to FP32 in our gates.

Quality

Metric Value Threshold
cos_min vs PyTorch FP32 reference (6-text multilingual probe) 0.999999 โ‰ฅ 0.99
cos_mean vs same 1.000000 โ€”

Validated under fastembed-rs' cosine_parity harness on probe/ort-rc12 (ORT 1.24).

Files

File Size Description
model.fp16.onnx ~5 MB ONNX header (external data)
model.fp16.onnx.data ~1.2 GB FP16 weights
tokenizer.json, config.json, tokenizer_config.json, special_tokens_map.json small tokenizer + model config

Conversion

Streaming FP32โ†’FP16 via convert_fp16_streaming.py (bypasses the 2 GB protobuf serialization limit).

Use via fastembed-rs

let embedder = TextEmbedding::try_new(
    InitOptions::new(EmbeddingModel::F2LlmV2_0_6BFp16))?;
let vectors = embedder.embed(vec!["hello world"], None)?;

Pooling: last-token (auto-applied by fastembed-rs). Use the F2LLM instruct format prefix for queries (see the upstream F2LLM repo).

License

Apache 2.0, inherited from the base model.

Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for cstr/F2LLM-v2-0.6B-ONNX-FP16

Finetuned
Qwen/Qwen3-0.6B
Quantized
(7)
this model