Octen-Embedding-0.6B β INT8 ONNX (per-channel, dynamo export)
INT8-quantized ONNX of Octen/Octen-Embedding-0.6B. Recommended variant β half the size of FP32 with 1.00 top-1 retrieval accuracy on the benchmark suite.
Quantization details
| Property | Value |
|---|---|
| Method | onnxruntime.quantization.quantize_dynamic, per_channel=True |
| Granularity | Per output channel (1 scale per row of each weight matrix) |
| Ops quantized | MatMul only β Gather (embedding table) intentionally left in FP32 |
| Format | Standard QLinearMatMul β no contrib ops, all execution providers |
| Base export | Dynamo ONNX (see cstr/octen-embedding-0.6b-onnx) |
Per-channel vs per-tensor: per_channel=True gives one calibration scale per output channel instead of one per whole matrix β 1024Γ finer granularity for a 1024-dim projection, producing better embedding fidelity than per-tensor INT8.
Dynamic batch: all batch sizes (1, 2, 4, 8, β¦) verified correct. The base dynamo export removes the legacy batch=1 static shape in Qwen3's causal mask.
Benchmark (Apple M-series, CPU)
| Metric | Value |
|---|---|
| Ingest throughput | ~6.1 ch/s |
| Top-1 hybrid accuracy | 1.00 |
| RSS memory | ~1.35 GB |
| File size | ~1.06 GB |
Quality metrics vs FP32
Measured on 8 diverse EN/DE sentences (3 semantic triplets):
| Metric | Value |
|---|---|
| Cosine similarity to FP32 (mean) | 0.830 |
| Cosine similarity to FP32 (min) | 0.674 |
| Semantic ordering (3/3 triplets) | β |
| Triplet margin (mean) | 0.240 |
| Anisotropy (avg pairwise cos) | 0.236 |
| Unit-norm compliance | β |
Model details
| Property | Value |
|---|---|
| Embedding dim | 1024 |
| Max context | 32 768 tokens |
| Inputs | input_ids [batch, seq], attention_mask [batch, seq] |
| Output | last_hidden_state [batch, seq, 1024] |
| Pooling | Last-token pooling + L2 normalisation |
Inference
import onnxruntime as ort
import numpy as np
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file("tokenizer.json")
tokenizer.enable_padding(pad_id=0)
tokenizer.enable_truncation(max_length=512)
session = ort.InferenceSession("model.int8.onnx", providers=["CPUExecutionProvider"])
texts = ["semantic search example", "another sentence"]
enc = tokenizer.encode_batch(texts)
ids = np.array([e.ids for e in enc], dtype=np.int64)
mask = np.array([e.attention_mask for e in enc], dtype=np.int64)
lhs = session.run(None, {"input_ids": ids, "attention_mask": mask})[0]
lens = mask.sum(axis=1) - 1
embs = lhs[np.arange(len(texts)), lens]
embs = embs / np.linalg.norm(embs, axis=1, keepdims=True)
print(embs.shape) # (2, 1024)
Files
| File | Size | Description |
|---|---|---|
model.int8.onnx |
~5 MB | ONNX graph |
model.int8.onnx.data |
~1.06 GB | Quantized weights |
tokenizer.json |
11 MB | HuggingFace fast tokenizer |
Variants
| Repo | Precision | Size | Notes |
|---|---|---|---|
| cstr/octen-embedding-0.6b-onnx | FP32 | 2.4 GB | Reference |
| cstr/octen-embedding-0.6b-onnx-int8 | INT8 | 1.1 GB | This repo β recommended |
| cstr/octen-embedding-0.6b-onnx-int4 | INT4 | 0.9 GB | Minimum RAM |
License
Apache 2.0.
- Downloads last month
- 69
Model tree for cstr/Octen-Embedding-0.6B-ONNX-INT8
Base model
Qwen/Qwen3-0.6B-Base Finetuned
Qwen/Qwen3-Embedding-0.6B Finetuned
Octen/Octen-Embedding-0.6B