nomic-formbench-mnrl
A domain-adapted sentence-transformers model derived from
nomic-ai/nomic-embed-text-v1.5 and fine-tuned on the FormBench
retrieval benchmark for formulation chemistry. It maps passages from formulation patents
into a 768-dimensional dense vector space and is optimised for within-domain
retrieval among structurally similar near-miss passages โ the central capability targeted
by FormBench.
This repository hosts an anonymised release for NeurIPS 2026 double-blind review.
Model details
| Item | Value |
|---|---|
| Base model | nomic-ai/nomic-embed-text-v1.5 (137M params) |
| Training method | Task-adaptive pre-training (TAPT) via contrastive fine-tuning |
| Loss | MultipleNegativesRankingLoss (in-batch negatives) |
| Training data | FormBench-Triplets โ 44,413 (query, anchor, hard-negative) tuples |
| Embedding dimension | 768 |
| Max sequence length | 8192 (training: 2048) |
| Precision | bf16 |
| Learning rate | 2e-5 |
| Per-GPU batch size | 32 |
| Epochs | 5 |
| Hardware | 8ร AMD MI250X, DDP |
The training-triplet set is reconstructable from the qrel files in
Formbench-anon/FormBench
following the protocol in ยง3 of the paper.
Evaluation results
Evaluated on the FormBench test split (n = 5,459 queries) under both corpus variants, following the protocol in ยง4 of the paper. FAISS exact inner-product search at top-k = 100.
FormBench-Structured (C1) โ within-domain near-miss distractors
| Metric | Value |
|---|---|
| Binary nDCG@10 | 0.3668 |
| MRR (binary qrels) | 0.3228 |
| Graded nDCG@10 | 0.2145 |
| R@100 (binary qrels) | 0.7903 |
| FAISS search latency | 14.5 ms/query |
FormBench-Random (C0) โ random-distractor corpus
| Metric | Value |
|---|---|
| Binary nDCG@10 | 0.4358 |
| MRR (binary qrels) | 0.3915 |
| Graded nDCG@10 | 0.2583 |
| R@100 (binary qrels) | 0.8311 |
| FAISS search latency | 14.5 ms/query |
For reference: BM25 lexical baseline: binary nDCG@10 = 0.3751 (C1), 0.4665 (C0).
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("Formbench-anon/nomic-formbench-mnrl")
passages = [
"An adhesive composition comprising a styrene-acrylate copolymer ...",
"A water-based latex paint formulation containing ...",
]
queries = [
"what wax-seeded latex polymers improve scuff resistance in architectural coatings?",
]
passage_embeds = model.encode(passages, normalize_embeddings=True)
query_embeds = model.encode(queries, normalize_embeddings=True)
Intended use
Domain-specific retrieval over formulation patents โ adhesives, coatings, lubricants, pharmaceuticals, agrochemicals, personal care, food. Particularly suited to within-domain near-miss discrimination, where general-purpose embedders have been shown to fail.
Limitations
- Training queries are LLM-generated (Sonnet 4 + Haiku 4.5 quality filter) and may not match real practitioner intent.
- Coverage limited to USPTO utility patents (1995โ2022) in English only.
- Performance on out-of-domain retrieval is not characterised.
Citation
@misc{formbench2026,
title = { {FormBench}: Evaluating Chemical Knowledge Retrieval in Formulation Patents },
author = { Anonymous Authors },
year = { 2026 },
note = { Under double-blind review at NeurIPS 2026 Datasets \& Benchmarks Track }
}
License
Apache 2.0, inherited from the base model.
- Downloads last month
- 13
Model tree for Formbench-anon/nomic-formbench-mnrl
Base model
nomic-ai/nomic-embed-text-v1.5