ai4data
/

devdata-search-noinstruct-small-cmnrl

Feature Extraction

sentence-transformers

text-embeddings-inference

Model card Files Files and versions

devdata-search-noinstruct-small-cmnrl / README.md

avsolatorio's picture

Upload README.md with huggingface_hub

a60719d verified 6 days ago

|

History Blame Contribute Delete

1.73 kB

	---
	license: apache-2.0
	library_name: sentence-transformers
	pipeline_tag: feature-extraction
	tags:
	- sentence-transformers
	- feature-extraction
	- retrieval
	- devdata-search
	datasets:
	- ai4data/devdatabench
	base_model: avsolatorio/NoInstruct-small-Embedding-v0
	---

	# devdata-search-noinstruct-small-cmnrl

	A bi-encoder embedding model for **search over structured statistical
	metadata, part of the DevData Search** family. It is a fine-tune of
	`avsolatorio/NoInstruct-small-Embedding-v0` produced with schema-invariant fine-tuning on
	[DevDataBench](https://huggingface.co/datasets/ai4data/devdatabench): full-schema
	serialization with per-example field-order permutation and field dropout, so the
	encoder binds meaning to field labels rather than to serialization order. This is
	an embedding model that powers retrieval; it is not a hosted search service.

	See the paper *Field Order Should Not Matter: Permutation-Invariant Fine-Tuning
	for Structured Metadata Retrieval*.

	## Training

	- Base model: `avsolatorio/NoInstruct-small-Embedding-v0`
	- Loss: `cmnrl`
	- Field permutation: `True`; field dropout: `0.15`
	- Max sequence length: `512`
	- No query/document prefixes

	## Usage

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("ai4data/devdata-search-noinstruct-small-cmnrl")
	queries = ["mobile-broadband subscriptions per 100 people, reported annually"]
	docs = ["name: Active mobile-broadband subscriptions \| ..."]
	q = model.encode(queries)
	d = model.encode(docs)
	```

	Cosine similarity of `q` and `d` ranks documents for each query.

	## License

	Apache-2.0. Derived from `avsolatorio/NoInstruct-small-Embedding-v0`; trained on public World Bank Data360 metadata.