OpenMatch
/

cocodr-base-msmarco

Model card Files Files and versions

cocodr-base-msmarco / README.md

yyu's picture

Update README.md

4feefd1 over 2 years ago

|

history blame contribute delete

1.66 kB

	---
	license: mit
	---

	This model has been first pretrained on the BEIR corpus and fine-tuned on MS MARCO dataset following the approach described in the paper COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning. The associated GitHub repository is available here https://github.com/OpenMatch/COCO-DR.

	This model is trained with BERT-base as the backbone with 110M hyperparameters. See the paper https://arxiv.org/abs/2210.15212 for details.


	## Usage

	Pre-trained models can be loaded through the HuggingFace transformers library:

	```python
	from transformers import AutoModel, AutoTokenizer

	model = AutoModel.from_pretrained("OpenMatch/cocodr-base-msmarco")
	tokenizer = AutoTokenizer.from_pretrained("OpenMatch/cocodr-base-msmarco")
	```

	Then embeddings for different sentences can be obtained by doing the following:

	```python

	sentences = [
	"Where was Marie Curie born?",
	"Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
	"Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace."
	]

	inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
	embeddings = model(**inputs, output_hidden_states=True, return_dict=True).hidden_states[-1][:, :1].squeeze(1) # the embedding of the [CLS] token after the final layer
	```

	Then similarity scores between the different sentences are obtained with a dot product between the embeddings:
	```python

	score01 = embeddings[0] @ embeddings[1] # 216.9792
	score02 = embeddings[0] @ embeddings[2] # 216.6684
	```