F2LLM-v2-0.6B — FP16 ONNX

FP16-converted ONNX of codefuse-ai/F2LLM-v2-0.6B, a Qwen3-derived 1024-dim retrieval embedding model with 32k context and last-token pooling.

~~1.2 GB (~~50 % memory of FP32), retrieval-quality-equivalent to FP32 in our gates.

Quality

Metric	Value	Threshold
`cos_min` vs PyTorch FP32 reference (6-text multilingual probe)	0.999999	≥ 0.99
`cos_mean` vs same	1.000000	—

Validated under fastembed-rs' cosine_parity harness on probe/ort-rc12 (ORT 1.24).

Files

File	Size	Description
`model.fp16.onnx`	~5 MB	ONNX header (external data)
`model.fp16.onnx.data`	~1.2 GB	FP16 weights
`tokenizer.json`, `config.json`, `tokenizer_config.json`, `special_tokens_map.json`	small	tokenizer + model config

Conversion

Streaming FP32→FP16 via convert_fp16_streaming.py (bypasses the 2 GB protobuf serialization limit).

Use via fastembed-rs

let embedder = TextEmbedding::try_new(
    InitOptions::new(EmbeddingModel::F2LlmV2_0_6BFp16))?;
let vectors = embedder.embed(vec!["hello world"], None)?;

Pooling: last-token (auto-applied by fastembed-rs). Use the F2LLM instruct format prefix for queries (see the upstream F2LLM repo).

License

Apache 2.0, inherited from the base model.

Downloads last month: 50

Model tree for cstr/F2LLM-v2-0.6B-ONNX-FP16

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-0.6B

Finetuned

codefuse-ai/F2LLM-v2-0.6B-Preview

Finetuned

codefuse-ai/F2LLM-v2-0.6B

Quantized

(7)

this model