Initial release: FormBench TAPT-MNRL model (anonymised for review)

bb6c076 verified about 1 month ago

3.86 kB

	---
	license: apache-2.0
	base_model: Alibaba-NLP/gte-large-en-v1.5
	library_name: sentence-transformers
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- formbench
	- patent-retrieval
	- chemistry
	- formulations
	- materials-science
	language:
	- en
	pipeline_tag: sentence-similarity
	---

	# gte-large-formbench-mnrl

	A domain-adapted sentence-transformers model derived from
	[`Alibaba-NLP/gte-large-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) and fine-tuned on the FormBench
	retrieval benchmark for formulation chemistry. It maps passages from formulation patents
	into a 1024-dimensional dense vector space and is optimised for within-domain
	retrieval among structurally similar near-miss passages — the central capability targeted
	by FormBench.

	This repository hosts an anonymised release for NeurIPS 2026 double-blind review.

	## Model details

	\| Item \| Value \|
	\|---\|---\|
	\| Base model \| `Alibaba-NLP/gte-large-en-v1.5` (434M params) \|
	\| Training method \| Task-adaptive pre-training (TAPT) via contrastive fine-tuning \|
	\| Loss \| `MultipleNegativesRankingLoss` (in-batch negatives) \|
	\| Training data \| FormBench-Triplets — 44,413 (query, anchor, hard-negative) tuples \|
	\| Embedding dimension \| 1024 \|
	\| Max sequence length \| 8192 (training: 2048) \|
	\| Precision \| bf16 \|
	\| Learning rate \| 1e-5 \|
	\| Per-GPU batch size \| 64 \|
	\| Epochs \| 5 \|
	\| Hardware \| 8× AMD MI250X, DDP \|

	The training-triplet set is reconstructable from the qrel files in
	[`Formbench-anon/FormBench`](https://huggingface.co/datasets/Formbench-anon/FormBench)
	following the protocol in §3 of the paper.

	## Evaluation results

	Evaluated on the FormBench test split (n = 5,459 queries) under both corpus variants,
	following the protocol in §4 of the paper. FAISS exact inner-product search at top-k = 100.

	### FormBench-Structured (C1) — within-domain near-miss distractors

	\| Metric \| Value \|
	\|---\|---:\|
	\| Binary nDCG@10 \| 0.3561 \|
	\| MRR (binary qrels) \| 0.3125 \|
	\| Graded nDCG@10 \| 0.2087 \|
	\| R@100 (binary qrels) \| 0.7586 \|
	\| FAISS search latency \| 18.9 ms/query \|

	### FormBench-Random (C0) — random-distractor corpus

	\| Metric \| Value \|
	\|---\|---:\|
	\| Binary nDCG@10 \| 0.4308 \|
	\| MRR (binary qrels) \| 0.3883 \|
	\| Graded nDCG@10 \| 0.2553 \|
	\| R@100 (binary qrels) \| 0.8011 \|
	\| FAISS search latency \| 18.8 ms/query \|

	For reference: BM25 lexical baseline: binary nDCG@10 = 0.3751 (C1), 0.4665 (C0).

	## Usage

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("Formbench-anon/gte-large-formbench-mnrl")

	passages = [
	"An adhesive composition comprising a styrene-acrylate copolymer ...",
	"A water-based latex paint formulation containing ...",
	]
	queries = [
	"what wax-seeded latex polymers improve scuff resistance in architectural coatings?",
	]

	passage_embeds = model.encode(passages, normalize_embeddings=True)
	query_embeds = model.encode(queries, normalize_embeddings=True)
	```

	## Intended use

	Domain-specific retrieval over formulation patents — adhesives, coatings, lubricants,
	pharmaceuticals, agrochemicals, personal care, food. Particularly suited to
	within-domain near-miss discrimination, where general-purpose embedders have been shown
	to fail.

	## Limitations

	- Training queries are LLM-generated (Sonnet 4 + Haiku 4.5 quality filter) and may not
	match real practitioner intent.
	- Coverage limited to USPTO utility patents (1995–2022) in English only.
	- Performance on out-of-domain retrieval is not characterised.

	## Citation

	```bibtex
	@misc{formbench2026,
	title = { {FormBench}: Evaluating Chemical Knowledge Retrieval in Formulation Patents },
	author = { Anonymous Authors },
	year = { 2026 },
	note = { Under double-blind review at NeurIPS 2026 Datasets \& Benchmarks Track }
	}
	```

	## License

	Apache 2.0, inherited from the base model.