| | --- |
| | tags: |
| | - feature-extraction |
| | pipeline_tag: feature-extraction |
| | --- |
| | This model is the finetuned version of the pre-trained contriever model available here https://huggingface.co/facebook/contriever, following the approach described in [Towards Unsupervised Dense Information Retrieval with Contrastive Learning](https://arxiv.org/abs/2112.09118). The associated GitHub repository is available here https://github.com/facebookresearch/contriever. |
| |
|
| | ## Usage (HuggingFace Transformers) |
| | Using the model directly available in HuggingFace transformers requires to add a mean pooling operation to obtain a sentence embedding. |
| |
|
| | ```python |
| | import torch |
| | from transformers import AutoTokenizer, AutoModel |
| | |
| | tokenizer = AutoTokenizer.from_pretrained('facebook/contriever-msmarco') |
| | model = AutoModel.from_pretrained('facebook/contriever-msmarco') |
| | |
| | sentences = [ |
| | "Where was Marie Curie born?", |
| | "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.", |
| | "Born in Paris on 15 May 1859, Pierre Curie was the son of Eugène Curie, a doctor of French Catholic origin from Alsace." |
| | ] |
| | |
| | # Apply tokenizer |
| | inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
| | |
| | # Compute token embeddings |
| | outputs = model(**inputs) |
| | |
| | # Mean pooling |
| | def mean_pooling(token_embeddings, mask): |
| | token_embeddings = token_embeddings.masked_fill(~mask[..., None].bool(), 0.) |
| | sentence_embeddings = token_embeddings.sum(dim=1) / mask.sum(dim=1)[..., None] |
| | return sentence_embeddings |
| | embeddings = mean_pooling(outputs[0], inputs['attention_mask']) |
| | ``` |