fin-sparse-encoder-doc-v1-onnx
ONNX + INT8 quantized version of oneryalcin/fin-sparse-encoder-doc-v1 for CPU-efficient document encoding.
This is the document encoder path only โ it produces sparse SPLADE vectors for indexing financial documents (SEC filings, earnings call transcripts). Query encoding uses a separate IDF lookup table (sub-ms, no neural model needed).
Model Variants
| File | Format | Size | Use Case |
|---|---|---|---|
model.onnx |
FP32 | 647.9 MB | Maximum accuracy, GPU or high-memory CPU |
model_quantized.onnx |
INT8 | 166.7 MB | Recommended for CPU deployment |
Performance
Domain Evaluation (Financial Documents)
The parent model (fin-sparse-encoder-doc-v1) was evaluated on 2,028 held-out financial test examples:
| Metric | Base Model | Fine-tuned | Delta |
|---|---|---|---|
| acc@1 | 39.9% | 55.2% | +15.2% |
| acc@3 | 69.2% | 84.0% | +14.8% |
| ndcg@10 | 0.681 | 0.781 | +10.0% |
| median_rank | 2.0 | 1.0 | -1.0 |
Inference Latency (seq_len=512, 1 thread)
Benchmarked on Apple M-series CPU. Server CPUs with AVX512-VNNI will see larger INT8 speedups (~2-3x).
| Backend | p50 (ms) | p95 (ms) | Model Size |
|---|---|---|---|
| PyTorch FP32 | 186.3 | 192.8 | ~620 MB |
| ONNX FP32 | 211.7 | 218.9 | 647.9 MB |
| ONNX INT8 | 164.4 | 166.9 | 166.7 MB |
Usage
import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer
# Load
tokenizer = AutoTokenizer.from_pretrained("oneryalcin/fin-sparse-encoder-doc-v1-onnx")
sess = ort.InferenceSession("model_quantized.onnx", providers=["CPUExecutionProvider"])
# Encode a document
text = "Revenue increased 12% year over year to $4.2 billion in Q4 2023."
inputs = tokenizer(text, return_tensors="np", padding="max_length", max_length=512, truncation=True)
logits = sess.run(None, {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]})[0]
# SpladePooling: log1p_relu activation (matches OpenSearch v3 models)
masked = logits * inputs["attention_mask"][..., None]
pooled = masked.max(axis=1)
sparse_vector = np.log1p(np.log1p(np.maximum(pooled, 0.0))) # [1, 30522]
# Convert to token->weight dict (for inverted index)
nonzero = np.nonzero(sparse_vector[0])[0]
token_weights = {tokenizer.decode([tid]): float(sparse_vector[0, tid]) for tid in nonzero}
print(f"Active dimensions: {len(token_weights)}")
print(f"Top tokens: {sorted(token_weights.items(), key=lambda x: -x[1])[:10]}")
Architecture
Input text
โ Tokenizer (max_length=512)
โ ONNX model (MLM logits) [batch, seq, 30522]
โ SpladePooling: log(1 + log(1 + ReLU(max_over_seq(logits * mask))))
โ Sparse vector [batch, 30522]
Base model: opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte (Alibaba-NLP/new-impl architecture).
Fine-tuned on financial-filings-sparse-retrieval-training (18K examples, 2 epochs).
Export Details
- Exported via
torch.onnx.export(legacy tracer, opset 17) - INT8: dynamic quantization via
onnxruntime.quantization.quantize_dynamic(per-channel, QInt8) - Numerical verification: FP32 ONNX max diff vs PyTorch = 0.000057
- Downloads last month
- 7
Model tree for oneryalcin/fin-sparse-encoder-doc-v1-onnx
Finetuned
oneryalcin/fin-sparse-encoder-doc-v1