avsolatorio's picture
Upload README.md with huggingface_hub
6a3f2a2 verified
|
Raw
History Blame Contribute Delete
1.76 kB
metadata
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: feature-extraction
tags:
  - sentence-transformers
  - feature-extraction
  - retrieval
  - devdata-search
datasets:
  - ai4data/devdatabench
base_model: intfloat/multilingual-e5-small

devdata-search-multilingual-e5-small-cmnrl

A bi-encoder embedding model for search over structured statistical metadata, part of the DevData Search family. It is a fine-tune of intfloat/multilingual-e5-small produced with schema-invariant fine-tuning on DevDataBench: full-schema serialization with per-example field-order permutation and field dropout, so the encoder binds meaning to field labels rather than to serialization order. This is an embedding model that powers retrieval; it is not a hosted search service.

See the paper Field Order Should Not Matter: Permutation-Invariant Fine-Tuning for Structured Metadata Retrieval.

Training

  • Base model: intfloat/multilingual-e5-small
  • Loss: cmnrl
  • Field permutation: True; field dropout: 0.15
  • Max sequence length: 512
  • Query prefix: query: ; document prefix: passage: (prepend these when encoding)

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("ai4data/devdata-search-multilingual-e5-small-cmnrl")
queries = ["query: " + "mobile-broadband subscriptions per 100 people"]
docs = ["passage: " + "name: Active mobile-broadband subscriptions | ..."]
q = model.encode(queries)
d = model.encode(docs)

Cosine similarity of q and d ranks documents for each query.

License

Apache-2.0. Derived from intfloat/multilingual-e5-small; trained on public World Bank Data360 metadata.