fin-sparse-encoder-doc-v1-onnx

ONNX + INT8 quantized version of oneryalcin/fin-sparse-encoder-doc-v1 for CPU-efficient document encoding.

This is the document encoder path only — it produces sparse SPLADE vectors for indexing financial documents (SEC filings, earnings call transcripts). Query encoding uses a separate IDF lookup table (sub-ms, no neural model needed).

Model Variants

File	Format	Size	Use Case
`model.onnx`	FP32	647.9 MB	Maximum accuracy, GPU or high-memory CPU
`model_quantized.onnx`	INT8	166.7 MB	Recommended for CPU deployment

Performance

Domain Evaluation (Financial Documents)

The parent model (fin-sparse-encoder-doc-v1) was evaluated on 2,028 held-out financial test examples:

Metric	Base Model	Fine-tuned	Delta
acc@1	39.9%	55.2%	+15.2%
acc@3	69.2%	84.0%	+14.8%
ndcg@10	0.681	0.781	+10.0%
median_rank	2.0	1.0	-1.0

Inference Latency (seq_len=512, 1 thread)

Benchmarked on Apple M-series CPU. Server CPUs with AVX512-VNNI will see larger INT8 speedups (~2-3x).

Backend	p50 (ms)	p95 (ms)	Model Size
PyTorch FP32	186.3	192.8	~620 MB
ONNX FP32	211.7	218.9	647.9 MB
ONNX INT8	164.4	166.9	166.7 MB

Usage

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# Load
tokenizer = AutoTokenizer.from_pretrained("oneryalcin/fin-sparse-encoder-doc-v1-onnx")
sess = ort.InferenceSession("model_quantized.onnx", providers=["CPUExecutionProvider"])

# Encode a document
text = "Revenue increased 12% year over year to $4.2 billion in Q4 2023."
inputs = tokenizer(text, return_tensors="np", padding="max_length", max_length=512, truncation=True)
logits = sess.run(None, {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]})[0]

# SpladePooling: log1p_relu activation (matches OpenSearch v3 models)
masked = logits * inputs["attention_mask"][..., None]
pooled = masked.max(axis=1)
sparse_vector = np.log1p(np.log1p(np.maximum(pooled, 0.0)))  # [1, 30522]

# Convert to token->weight dict (for inverted index)
nonzero = np.nonzero(sparse_vector[0])[0]
token_weights = {tokenizer.decode([tid]): float(sparse_vector[0, tid]) for tid in nonzero}
print(f"Active dimensions: {len(token_weights)}")
print(f"Top tokens: {sorted(token_weights.items(), key=lambda x: -x[1])[:10]}")

Architecture

Input text
  → Tokenizer (max_length=512)
  → ONNX model (MLM logits) [batch, seq, 30522]
  → SpladePooling: log(1 + log(1 + ReLU(max_over_seq(logits * mask))))
  → Sparse vector [batch, 30522]

Base model: opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte (Alibaba-NLP/new-impl architecture).

Fine-tuned on financial-filings-sparse-retrieval-training (18K examples, 2 epochs).

Export Details

Exported via torch.onnx.export (legacy tracer, opset 17)
INT8: dynamic quantization via onnxruntime.quantization.quantize_dynamic (per-channel, QInt8)
Numerical verification: FP32 ONNX max diff vs PyTorch = 0.000057

Downloads last month: 5

Model tree for oneryalcin/fin-sparse-encoder-doc-v1-onnx

Base model

opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte

Finetuned

oneryalcin/fin-sparse-encoder-doc-v1

Quantized

(1)

this model