Feature Extraction
sentence-transformers
Safetensors
bert
retrieval
devdata-search
text-embeddings-inference
Instructions to use ai4data/devdata-search-multilingual-e5-small-cmnrl with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use ai4data/devdata-search-multilingual-e5-small-cmnrl with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("ai4data/devdata-search-multilingual-e5-small-cmnrl") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| library_name: sentence-transformers | |
| pipeline_tag: feature-extraction | |
| tags: | |
| - sentence-transformers | |
| - feature-extraction | |
| - retrieval | |
| - devdata-search | |
| datasets: | |
| - ai4data/devdatabench | |
| base_model: intfloat/multilingual-e5-small | |
| # devdata-search-multilingual-e5-small-cmnrl | |
| A bi-encoder embedding model for **search over structured statistical | |
| metadata**, part of the **DevData Search** family. It is a fine-tune of | |
| `intfloat/multilingual-e5-small` produced with schema-invariant fine-tuning on | |
| [DevDataBench](https://huggingface.co/datasets/ai4data/devdatabench): full-schema | |
| serialization with per-example field-order permutation and field dropout, so the | |
| encoder binds meaning to field labels rather than to serialization order. This is | |
| an embedding model that powers retrieval; it is not a hosted search service. | |
| See the paper *Field Order Should Not Matter: Permutation-Invariant Fine-Tuning | |
| for Structured Metadata Retrieval*. | |
| ## Training | |
| - Base model: `intfloat/multilingual-e5-small` | |
| - Loss: `cmnrl` | |
| - Field permutation: `True`; field dropout: `0.15` | |
| - Max sequence length: `512` | |
| - Query prefix: `query: ` ; document prefix: `passage: ` (prepend these when encoding) | |
| ## Usage | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("ai4data/devdata-search-multilingual-e5-small-cmnrl") | |
| queries = ["query: " + "mobile-broadband subscriptions per 100 people"] | |
| docs = ["passage: " + "name: Active mobile-broadband subscriptions | ..."] | |
| q = model.encode(queries) | |
| d = model.encode(docs) | |
| ``` | |
| Cosine similarity of `q` and `d` ranks documents for each query. | |
| ## License | |
| Apache-2.0. Derived from `intfloat/multilingual-e5-small`; trained on public World Bank Data360 metadata. | |