Sentence Similarity
sentence-transformers
PyTorch
English
new
feature-extraction
formbench
patent-retrieval
chemistry
formulations
materials-science
custom_code
text-embeddings-inference
Instructions to use Formbench-anon/gte-large-formbench-mnrl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Formbench-anon/gte-large-formbench-mnrl with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Formbench-anon/gte-large-formbench-mnrl", trust_remote_code=True) sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| base_model: Alibaba-NLP/gte-large-en-v1.5 | |
| library_name: sentence-transformers | |
| tags: | |
| - sentence-transformers | |
| - feature-extraction | |
| - sentence-similarity | |
| - formbench | |
| - patent-retrieval | |
| - chemistry | |
| - formulations | |
| - materials-science | |
| language: | |
| - en | |
| pipeline_tag: sentence-similarity | |
| # gte-large-formbench-mnrl | |
| A domain-adapted sentence-transformers model derived from | |
| [`Alibaba-NLP/gte-large-en-v1.5`](https://huggingface.co/Alibaba-NLP/gte-large-en-v1.5) and fine-tuned on the **FormBench** | |
| retrieval benchmark for formulation chemistry. It maps passages from formulation patents | |
| into a 1024-dimensional dense vector space and is optimised for within-domain | |
| retrieval among structurally similar near-miss passages — the central capability targeted | |
| by FormBench. | |
| This repository hosts an anonymised release for NeurIPS 2026 double-blind review. | |
| ## Model details | |
| | Item | Value | | |
| |---|---| | |
| | Base model | `Alibaba-NLP/gte-large-en-v1.5` (434M params) | | |
| | Training method | Task-adaptive pre-training (TAPT) via contrastive fine-tuning | | |
| | Loss | `MultipleNegativesRankingLoss` (in-batch negatives) | | |
| | Training data | FormBench-Triplets — 44,413 (query, anchor, hard-negative) tuples | | |
| | Embedding dimension | 1024 | | |
| | Max sequence length | 8192 (training: 2048) | | |
| | Precision | bf16 | | |
| | Learning rate | 1e-5 | | |
| | Per-GPU batch size | 64 | | |
| | Epochs | 5 | | |
| | Hardware | 8× AMD MI250X, DDP | | |
| The training-triplet set is reconstructable from the qrel files in | |
| [`Formbench-anon/FormBench`](https://huggingface.co/datasets/Formbench-anon/FormBench) | |
| following the protocol in §3 of the paper. | |
| ## Evaluation results | |
| Evaluated on the FormBench test split (n = 5,459 queries) under both corpus variants, | |
| following the protocol in §4 of the paper. FAISS exact inner-product search at top-k = 100. | |
| ### FormBench-Structured (C1) — within-domain near-miss distractors | |
| | Metric | Value | | |
| |---|---:| | |
| | Binary nDCG@10 | **0.3561** | | |
| | MRR (binary qrels) | 0.3125 | | |
| | Graded nDCG@10 | 0.2087 | | |
| | R@100 (binary qrels) | 0.7586 | | |
| | FAISS search latency | 18.9 ms/query | | |
| ### FormBench-Random (C0) — random-distractor corpus | |
| | Metric | Value | | |
| |---|---:| | |
| | Binary nDCG@10 | **0.4308** | | |
| | MRR (binary qrels) | 0.3883 | | |
| | Graded nDCG@10 | 0.2553 | | |
| | R@100 (binary qrels) | 0.8011 | | |
| | FAISS search latency | 18.8 ms/query | | |
| For reference: BM25 lexical baseline: binary nDCG@10 = 0.3751 (C1), 0.4665 (C0). | |
| ## Usage | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("Formbench-anon/gte-large-formbench-mnrl") | |
| passages = [ | |
| "An adhesive composition comprising a styrene-acrylate copolymer ...", | |
| "A water-based latex paint formulation containing ...", | |
| ] | |
| queries = [ | |
| "what wax-seeded latex polymers improve scuff resistance in architectural coatings?", | |
| ] | |
| passage_embeds = model.encode(passages, normalize_embeddings=True) | |
| query_embeds = model.encode(queries, normalize_embeddings=True) | |
| ``` | |
| ## Intended use | |
| Domain-specific retrieval over formulation patents — adhesives, coatings, lubricants, | |
| pharmaceuticals, agrochemicals, personal care, food. Particularly suited to | |
| within-domain near-miss discrimination, where general-purpose embedders have been shown | |
| to fail. | |
| ## Limitations | |
| - Training queries are LLM-generated (Sonnet 4 + Haiku 4.5 quality filter) and may not | |
| match real practitioner intent. | |
| - Coverage limited to USPTO utility patents (1995–2022) in English only. | |
| - Performance on out-of-domain retrieval is not characterised. | |
| ## Citation | |
| ```bibtex | |
| @misc{formbench2026, | |
| title = { {FormBench}: Evaluating Chemical Knowledge Retrieval in Formulation Patents }, | |
| author = { Anonymous Authors }, | |
| year = { 2026 }, | |
| note = { Under double-blind review at NeurIPS 2026 Datasets \& Benchmarks Track } | |
| } | |
| ``` | |
| ## License | |
| Apache 2.0, inherited from the base model. | |