--- license: apache-2.0 library_name: sentence-transformers pipeline_tag: feature-extraction tags: - sentence-transformers - feature-extraction - retrieval - devdata-search datasets: - ai4data/devdatabench base_model: intfloat/multilingual-e5-small --- # devdata-search-multilingual-e5-small-cmnrl A bi-encoder embedding model for **search over structured statistical metadata**, part of the **DevData Search** family. It is a fine-tune of `intfloat/multilingual-e5-small` produced with schema-invariant fine-tuning on [DevDataBench](https://huggingface.co/datasets/ai4data/devdatabench): full-schema serialization with per-example field-order permutation and field dropout, so the encoder binds meaning to field labels rather than to serialization order. This is an embedding model that powers retrieval; it is not a hosted search service. See the paper *Field Order Should Not Matter: Permutation-Invariant Fine-Tuning for Structured Metadata Retrieval*. ## Training - Base model: `intfloat/multilingual-e5-small` - Loss: `cmnrl` - Field permutation: `True`; field dropout: `0.15` - Max sequence length: `512` - Query prefix: `query: ` ; document prefix: `passage: ` (prepend these when encoding) ## Usage ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("ai4data/devdata-search-multilingual-e5-small-cmnrl") queries = ["query: " + "mobile-broadband subscriptions per 100 people"] docs = ["passage: " + "name: Active mobile-broadband subscriptions | ..."] q = model.encode(queries) d = model.encode(docs) ``` Cosine similarity of `q` and `d` ranks documents for each query. ## License Apache-2.0. Derived from `intfloat/multilingual-e5-small`; trained on public World Bank Data360 metadata.