--- license: apache-2.0 library_name: sentence-transformers pipeline_tag: feature-extraction tags: - sentence-transformers - feature-extraction - retrieval - devdata-search datasets: - ai4data/devdatabench base_model: avsolatorio/NoInstruct-small-Embedding-v0 --- # devdata-search-noinstruct-small-cmnrl A bi-encoder embedding model for **search over structured statistical metadata**, part of the **DevData Search** family. It is a fine-tune of `avsolatorio/NoInstruct-small-Embedding-v0` produced with schema-invariant fine-tuning on [DevDataBench](https://huggingface.co/datasets/ai4data/devdatabench): full-schema serialization with per-example field-order permutation and field dropout, so the encoder binds meaning to field labels rather than to serialization order. This is an embedding model that powers retrieval; it is not a hosted search service. See the paper *Field Order Should Not Matter: Permutation-Invariant Fine-Tuning for Structured Metadata Retrieval*. ## Training - Base model: `avsolatorio/NoInstruct-small-Embedding-v0` - Loss: `cmnrl` - Field permutation: `True`; field dropout: `0.15` - Max sequence length: `512` - No query/document prefixes ## Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("ai4data/devdata-search-noinstruct-small-cmnrl") queries = ["mobile-broadband subscriptions per 100 people, reported annually"] docs = ["name: Active mobile-broadband subscriptions | ..."] q = model.encode(queries) d = model.encode(docs) ``` Cosine similarity of `q` and `d` ranks documents for each query. ## License Apache-2.0. Derived from `avsolatorio/NoInstruct-small-Embedding-v0`; trained on public World Bank Data360 metadata.