lettuce-emb-512d-v3

ONNX package for lettuce-emb-512d-v3.

Included Files

  • model.fp32.onnx (full precision)
  • model.int8.onnx (dynamic quantized INT8)
  • model.onnx (FP32 convenience copy)
  • Tokenizer files: tokenizer.json, tokenizer_config.json, special_tokens_map.json, vocab.txt
  • Sentence-Transformers metadata/config: modules.json, config_sentence_transformers.json, sentence_bert_config.json
  • Pooling/Dense configs: 1_Pooling/config.json, 2_Dense/config.json
  • Nomic architecture files: configuration_hf_nomic_bert.py, modeling_hf_nomic_bert.py

Model Specs

  • Backbone family: Nomic BERT (nomic-ai/nomic-embed-text-v1.5)
  • Embedding dimension: 512
  • Similarity: cosine similarity on normalized embeddings
  • Context target used in training pipeline: 4096

Training Config

Final checkpoint lineage:

  • Source training model: ./output/lettuce-v3-rp-long3
  • Resume chain used in this session included continued training from previous long-context runs.

Core training configuration:

  • student-base: nomic-ai/nomic-embed-text-v1.5
  • teachers: BAAI/bge-m3
  • dim: 512
  • context: 4096
  • teacher-context: 1024
  • pair-batch: 4 (stable setting used for successful resumed run)
  • triplet-batch: 4 (stable setting used for successful resumed run)
  • teacher-batch: 1
  • num-workers: 4
  • epochs: 1 per run segment (continued via resume)

Data composition used in long-context training runs:

  • NLI triplets: 30000
  • RP pairs source: Heralax/Augmental-Dataset (~7831 rows available in this environment)
  • Logic source: hard_logic.json with oversampling (typical run value: 12 or higher)
  • LongBench subsets enabled:
    • qasper
    • qmsum
    • narrativeqa
    • passage_retrieval_en
  • LongBench split used: test (loader falls back to direct JSONL files from THUDM/LongBench data.zip)

Primary losses:

  • MultipleNegativesRankingLoss (triplets)
  • CosineSimilarityLoss (teacher-scored pairs)

Inference Example (ONNX Runtime)

import numpy as np
import onnxruntime as ort
from transformers import AutoTokenizer

model_dir = "./lettuce-emb-512d-v3"
onnx_path = f"{model_dir}/model.int8.onnx"  # or model.fp32.onnx

tokenizer = AutoTokenizer.from_pretrained(model_dir)
session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])

texts = [
    "I forgot to mention one important detail.",
    "There is one important detail I forgot to mention."
]

inputs = tokenizer(texts, return_tensors="np", padding=True, truncation=True)
feeds = {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"],
}
if "token_type_ids" in inputs:
    names = [x.name for x in session.get_inputs()]
    if "token_type_ids" in names:
        feeds["token_type_ids"] = inputs["token_type_ids"]

emb = session.run(None, feeds)[0]  # [batch, 512]
emb = emb / np.clip(np.linalg.norm(emb, axis=1, keepdims=True), 1e-12, None)
print(float(np.dot(emb[0], emb[1])))

Benchmark Snapshot

Benchmarks were run with:

  • Full eval script: eval_v3_full.py
  • Extreme eval script: eval_v3_extreme.py
  • Model: ./output/lettuce-v3-rp-long3

Benchmark Configs

Full benchmark command:

PYTORCH_ALLOC_CONF=expandable_segments:True venv/bin/python eval_v3_full.py \
  --model ./output/lettuce-emb-512d-v3 \
  --trust-remote-code \
  --batch-size 32 \
  --long-batch-size 2 \
  --long-subset 150 \
  --logic-limit 460 \
  --rp-limit 1000 \
  --retrieval-corpus 1000

Extreme benchmark command:

venv/bin/python eval_v3_extreme.py \
  --model ./output/lettuce-emb-512d-v3 \
  --trust-remote-code \
  --batch-size 8 \
  --needle-cases 24 \
  --needle-targets 1024 2048 4096 \
  --save-json output/lettuce-emb-512d-v3/extreme_metrics.json

Full Eval (eval_v3_full.py)

Metric Value
logic_triplet_accuracy 0.9848
logic_mean_margin 0.1874
rp_recall@1 0.0200
rp_recall@5 0.1090
rp_recall@10 0.1710
rp_mrr 0.0717
fp_probe_accuracy 1.0000
fp_probe_mean_margin 0.4387
long_1024_recall@10 0.1867
long_2048_recall@10 0.1067
long_4096_recall@10 0.1067

Extreme Eval (eval_v3_extreme.py)

Metric Value
logic_role_flip_accuracy 0.8000
logic_neg_temp_accuracy 0.6000
coreference_accuracy 0.6000
rp_overlap_accuracy 1.0000
needle_1024_accuracy 0.5833
needle_2048_accuracy 0.7083
needle_4096_accuracy 0.7083
extreme_avg_accuracy 0.7143
extreme_avg_margin 0.0673

Full extreme metrics JSON is included as extreme_metrics.json.

Real Benchmarks (MTEB)

MTEB run config:

venv/bin/python - <<'PY'
import mteb
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('./output/lettuce-v3-rp-long3', trust_remote_code=True)
model.similarity_fn_name = 'cosine'
tasks = [t for t in mteb.get_tasks(languages=['eng']) if t.metadata.name in ['STSBenchmark','SICK-R','NFCorpus']]
_ = mteb.evaluate(model, tasks, prediction_folder='./output/mteb_real_predictions', show_progress_bar=True)
PY
Task Metric Score
STSBenchmark Spearman (main_score) 0.8091
STSBenchmark Pearson 0.8036
SICK-R Spearman (main_score) 0.7816
SICK-R Pearson 0.8297
NFCorpus nDCG@10 0.2784
NFCorpus MAP@10 0.0938
NFCorpus Recall@10 0.1271
NFCorpus MRR@10 0.4725

Full real-benchmark metrics JSON is included as mteb_real_results.json.

Export Config

ONNX export script:

  • export_v3_onnx.py

Export settings used:

  • opset: 18
  • exporter mode: TorchScript ONNX (dynamo=False in script for stability)
  • quantization: dynamic INT8 (qint8)
  • generated files:
    • model.fp32.onnx
    • model.int8.onnx
    • model.onnx (FP32 convenience copy)

Notes

  • model.int8.onnx is recommended for CPU/mobile usage.
  • INT8 may trade a small amount of quality for speed/size improvements.
  • Long-sequence latency is still significantly higher than short-sequence latency.
Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support