Upload README.md with huggingface_hub

a60719d verified 5 days ago

1.73 kB

license: apache-2.0
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
  - sentence-transformers
  - feature-extraction
  - retrieval
  - devdata-search
datasets:
  - ai4data/devdatabench
base_model: avsolatorio/NoInstruct-small-Embedding-v0

devdata-search-noinstruct-small-cmnrl

A bi-encoder embedding model for search over structured statistical metadata, part of the DevData Search family. It is a fine-tune of avsolatorio/NoInstruct-small-Embedding-v0 produced with schema-invariant fine-tuning on DevDataBench: full-schema serialization with per-example field-order permutation and field dropout, so the encoder binds meaning to field labels rather than to serialization order. This is an embedding model that powers retrieval; it is not a hosted search service.

See the paper Field Order Should Not Matter: Permutation-Invariant Fine-Tuning for Structured Metadata Retrieval.

Training

Base model: avsolatorio/NoInstruct-small-Embedding-v0
Loss: cmnrl
Field permutation: True; field dropout: 0.15
Max sequence length: 512
No query/document prefixes

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ai4data/devdata-search-noinstruct-small-cmnrl")
queries = ["mobile-broadband subscriptions per 100 people, reported annually"]
docs = ["name: Active mobile-broadband subscriptions | ..."]
q = model.encode(queries)
d = model.encode(docs)

Cosine similarity of q and d ranks documents for each query.

License

Apache-2.0. Derived from avsolatorio/NoInstruct-small-Embedding-v0; trained on public World Bank Data360 metadata.