Octen-Embedding-0.6B β€” INT8 ONNX (per-channel, dynamo export)

INT8-quantized ONNX of Octen/Octen-Embedding-0.6B. Recommended variant β€” half the size of FP32 with 1.00 top-1 retrieval accuracy on the benchmark suite.

Quantization details

Property Value
Method onnxruntime.quantization.quantize_dynamic, per_channel=True
Granularity Per output channel (1 scale per row of each weight matrix)
Ops quantized MatMul only β€” Gather (embedding table) intentionally left in FP32
Format Standard QLinearMatMul β€” no contrib ops, all execution providers
Base export Dynamo ONNX (see cstr/octen-embedding-0.6b-onnx)

Per-channel vs per-tensor: per_channel=True gives one calibration scale per output channel instead of one per whole matrix β€” 1024Γ— finer granularity for a 1024-dim projection, producing better embedding fidelity than per-tensor INT8.

Dynamic batch: all batch sizes (1, 2, 4, 8, …) verified correct. The base dynamo export removes the legacy batch=1 static shape in Qwen3's causal mask.

Benchmark (Apple M-series, CPU)

Metric Value
Ingest throughput ~6.1 ch/s
Top-1 hybrid accuracy 1.00
RSS memory ~1.35 GB
File size ~1.06 GB

Quality metrics vs FP32

Measured on 8 diverse EN/DE sentences (3 semantic triplets):

Metric Value
Cosine similarity to FP32 (mean) 0.830
Cosine similarity to FP32 (min) 0.674
Semantic ordering (3/3 triplets) βœ…
Triplet margin (mean) 0.240
Anisotropy (avg pairwise cos) 0.236
Unit-norm compliance βœ…

Model details

Property Value
Embedding dim 1024
Max context 32 768 tokens
Inputs input_ids [batch, seq], attention_mask [batch, seq]
Output last_hidden_state [batch, seq, 1024]
Pooling Last-token pooling + L2 normalisation

Inference

import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0)
tokenizer.enable_truncation(max_length=512)

session = ort.InferenceSession("model.int8.onnx", providers=["CPUExecutionProvider"])

texts = ["semantic search example", "another sentence"]
enc  = tokenizer.encode_batch(texts)
ids  = np.array([e.ids              for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask   for e in enc], dtype=np.int64)

lhs  = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
lens = mask.sum(axis=1) - 1
embs = lhs[np.arange(len(texts)), lens]
embs = embs / np.linalg.norm(embs, axis=1, keepdims=True)
print(embs.shape)  # (2, 1024)

Files

File Size Description
model.int8.onnx ~5 MB ONNX graph
model.int8.onnx.data ~1.06 GB Quantized weights
tokenizer.json 11 MB HuggingFace fast tokenizer

Variants

Repo Precision Size Notes
cstr/octen-embedding-0.6b-onnx FP32 2.4 GB Reference
cstr/octen-embedding-0.6b-onnx-int8 INT8 1.1 GB This repo β€” recommended
cstr/octen-embedding-0.6b-onnx-int4 INT4 0.9 GB Minimum RAM

License

Apache 2.0.

Downloads last month
69
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/Octen-Embedding-0.6B-ONNX-INT8

Quantized
(8)
this model