nomic-formbench-mnrl

A domain-adapted sentence-transformers model derived from nomic-ai/nomic-embed-text-v1.5 and fine-tuned on the FormBench retrieval benchmark for formulation chemistry. It maps passages from formulation patents into a 768-dimensional dense vector space and is optimised for within-domain retrieval among structurally similar near-miss passages โ€” the central capability targeted by FormBench.

This repository hosts an anonymised release for NeurIPS 2026 double-blind review.

Model details

Item Value
Base model nomic-ai/nomic-embed-text-v1.5 (137M params)
Training method Task-adaptive pre-training (TAPT) via contrastive fine-tuning
Loss MultipleNegativesRankingLoss (in-batch negatives)
Training data FormBench-Triplets โ€” 44,413 (query, anchor, hard-negative) tuples
Embedding dimension 768
Max sequence length 8192 (training: 2048)
Precision bf16
Learning rate 2e-5
Per-GPU batch size 32
Epochs 5
Hardware 8ร— AMD MI250X, DDP

The training-triplet set is reconstructable from the qrel files in Formbench-anon/FormBench following the protocol in ยง3 of the paper.

Evaluation results

Evaluated on the FormBench test split (n = 5,459 queries) under both corpus variants, following the protocol in ยง4 of the paper. FAISS exact inner-product search at top-k = 100.

FormBench-Structured (C1) โ€” within-domain near-miss distractors

Metric Value
Binary nDCG@10 0.3668
MRR (binary qrels) 0.3228
Graded nDCG@10 0.2145
R@100 (binary qrels) 0.7903
FAISS search latency 14.5 ms/query

FormBench-Random (C0) โ€” random-distractor corpus

Metric Value
Binary nDCG@10 0.4358
MRR (binary qrels) 0.3915
Graded nDCG@10 0.2583
R@100 (binary qrels) 0.8311
FAISS search latency 14.5 ms/query

For reference: BM25 lexical baseline: binary nDCG@10 = 0.3751 (C1), 0.4665 (C0).

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Formbench-anon/nomic-formbench-mnrl")

passages = [
    "An adhesive composition comprising a styrene-acrylate copolymer ...",
    "A water-based latex paint formulation containing ...",
]
queries = [
    "what wax-seeded latex polymers improve scuff resistance in architectural coatings?",
]

passage_embeds = model.encode(passages, normalize_embeddings=True)
query_embeds   = model.encode(queries,  normalize_embeddings=True)

Intended use

Domain-specific retrieval over formulation patents โ€” adhesives, coatings, lubricants, pharmaceuticals, agrochemicals, personal care, food. Particularly suited to within-domain near-miss discrimination, where general-purpose embedders have been shown to fail.

Limitations

  • Training queries are LLM-generated (Sonnet 4 + Haiku 4.5 quality filter) and may not match real practitioner intent.
  • Coverage limited to USPTO utility patents (1995โ€“2022) in English only.
  • Performance on out-of-domain retrieval is not characterised.

Citation

@misc{formbench2026,
  title  = { {FormBench}: Evaluating Chemical Knowledge Retrieval in Formulation Patents },
  author = { Anonymous Authors },
  year   = { 2026 },
  note   = { Under double-blind review at NeurIPS 2026 Datasets \& Benchmarks Track }
}

License

Apache 2.0, inherited from the base model.

Downloads last month
13
Safetensors
Model size
0.1B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Formbench-anon/nomic-formbench-mnrl

Finetuned
(31)
this model