fin-sparse-encoder-doc-v1-onnx

ONNX + INT8 quantized version of oneryalcin/fin-sparse-encoder-doc-v1 for CPU-efficient document encoding.

This is the document encoder path only โ€” it produces sparse SPLADE vectors for indexing financial documents (SEC filings, earnings call transcripts). Query encoding uses a separate IDF lookup table (sub-ms, no neural model needed).

Model Variants

File Format Size Use Case
model.onnx FP32 647.9 MB Maximum accuracy, GPU or high-memory CPU
model_quantized.onnx INT8 166.7 MB Recommended for CPU deployment

Performance

Domain Evaluation (Financial Documents)

The parent model (fin-sparse-encoder-doc-v1) was evaluated on 2,028 held-out financial test examples:

Metric Base Model Fine-tuned Delta
acc@1 39.9% 55.2% +15.2%
acc@3 69.2% 84.0% +14.8%
ndcg@10 0.681 0.781 +10.0%
median_rank 2.0 1.0 -1.0

Inference Latency (seq_len=512, 1 thread)

Benchmarked on Apple M-series CPU. Server CPUs with AVX512-VNNI will see larger INT8 speedups (~2-3x).

Backend p50 (ms) p95 (ms) Model Size
PyTorch FP32 186.3 192.8 ~620 MB
ONNX FP32 211.7 218.9 647.9 MB
ONNX INT8 164.4 166.9 166.7 MB

Usage

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

# Load
tokenizer = AutoTokenizer.from_pretrained("oneryalcin/fin-sparse-encoder-doc-v1-onnx")
sess = ort.InferenceSession("model_quantized.onnx", providers=["CPUExecutionProvider"])

# Encode a document
text = "Revenue increased 12% year over year to $4.2 billion in Q4 2023."
inputs = tokenizer(text, return_tensors="np", padding="max_length", max_length=512, truncation=True)
logits = sess.run(None, {"input_ids": inputs["input_ids"], "attention_mask": inputs["attention_mask"]})[0]

# SpladePooling: log1p_relu activation (matches OpenSearch v3 models)
masked = logits * inputs["attention_mask"][..., None]
pooled = masked.max(axis=1)
sparse_vector = np.log1p(np.log1p(np.maximum(pooled, 0.0)))  # [1, 30522]

# Convert to token->weight dict (for inverted index)
nonzero = np.nonzero(sparse_vector[0])[0]
token_weights = {tokenizer.decode([tid]): float(sparse_vector[0, tid]) for tid in nonzero}
print(f"Active dimensions: {len(token_weights)}")
print(f"Top tokens: {sorted(token_weights.items(), key=lambda x: -x[1])[:10]}")

Architecture

Input text
  โ†’ Tokenizer (max_length=512)
  โ†’ ONNX model (MLM logits) [batch, seq, 30522]
  โ†’ SpladePooling: log(1 + log(1 + ReLU(max_over_seq(logits * mask))))
  โ†’ Sparse vector [batch, 30522]

Base model: opensearch-project/opensearch-neural-sparse-encoding-doc-v3-gte (Alibaba-NLP/new-impl architecture).

Fine-tuned on financial-filings-sparse-retrieval-training (18K examples, 2 epochs).

Export Details

  • Exported via torch.onnx.export (legacy tracer, opset 17)
  • INT8: dynamic quantization via onnxruntime.quantization.quantize_dynamic (per-channel, QInt8)
  • Numerical verification: FP32 ONNX max diff vs PyTorch = 0.000057
Downloads last month
7
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for oneryalcin/fin-sparse-encoder-doc-v1-onnx