Commit
·
8784293
1
Parent(s):
9376ca0
Update README.md
Browse files
README.md
CHANGED
|
@@ -12,7 +12,7 @@ tags:
|
|
| 12 |
library_name: sentence-transformers
|
| 13 |
---
|
| 14 |
|
| 15 |
-
# biencoder-distilcamembert-
|
| 16 |
|
| 17 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
|
| 18 |
|
|
@@ -33,13 +33,11 @@ Then you can use the model like this:
|
|
| 33 |
from sentence_transformers import SentenceTransformer
|
| 34 |
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 35 |
|
| 36 |
-
model = SentenceTransformer('antoinelouis/biencoder-distilcamembert-
|
| 37 |
embeddings = model.encode(sentences)
|
| 38 |
print(embeddings)
|
| 39 |
```
|
| 40 |
|
| 41 |
-
|
| 42 |
-
|
| 43 |
#### 🤗 Transformers
|
| 44 |
|
| 45 |
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
|
@@ -60,8 +58,8 @@ def mean_pooling(model_output, attention_mask):
|
|
| 60 |
sentences = ['This is an example sentence', 'Each sentence is converted']
|
| 61 |
|
| 62 |
# Load model from HuggingFace Hub
|
| 63 |
-
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-distilcamembert-
|
| 64 |
-
model = AutoModel.from_pretrained('antoinelouis/biencoder-distilcamembert-
|
| 65 |
|
| 66 |
# Tokenize sentences
|
| 67 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
|
@@ -79,19 +77,19 @@ print(sentence_embeddings)
|
|
| 79 |
|
| 80 |
## Evaluation
|
| 81 |
***
|
|
|
|
| 82 |
We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared the model performance with other biencoder models fine-tuned on the same dataset. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
|
| 83 |
|
| 84 |
-
| | model
|
| 85 |
-
|
| 86 |
-
| 1 | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR)
|
| 87 |
-
| 2 | [biencoder-
|
| 88 |
-
| 3 |
|
| 89 |
-
| 4 |
|
| 90 |
-
| 5 | [biencoder-mMiniLMv2-L12-
|
| 91 |
-
| 6 | [biencoder-camemberta-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camemberta-base-mmarcoFR)
|
| 92 |
-
| 7 | [biencoder-electra-base-french-
|
| 93 |
-
| 8 | [biencoder-
|
| 94 |
-
| 9 | [biencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L6-H384-distilled-from-XLMR-Large-mmarcoFR) | 428MB | 22.29 | 26.57 | 21.8 | 41.25 | 66.78 | 79.83 |
|
| 95 |
|
| 96 |
## Training
|
| 97 |
***
|
|
@@ -113,17 +111,15 @@ We used the French version of the [mMARCO](https://huggingface.co/datasets/unica
|
|
| 113 |
- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
|
| 114 |
Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
|
| 115 |
|
| 116 |
-
|
| 117 |
-
|
| 118 |
## Citation
|
| 119 |
|
| 120 |
```bibtex
|
| 121 |
@online{louis2023,
|
| 122 |
author = 'Antoine Louis',
|
| 123 |
-
title = 'biencoder-distilcamembert-
|
| 124 |
publisher = 'Hugging Face',
|
| 125 |
month = 'may',
|
| 126 |
year = '2023',
|
| 127 |
-
url = 'https://huggingface.co/antoinelouis/biencoder-distilcamembert-
|
| 128 |
}
|
| 129 |
```
|
|
|
|
| 12 |
library_name: sentence-transformers
|
| 13 |
---
|
| 14 |
|
| 15 |
+
# biencoder-distilcamembert-mmarcoFR
|
| 16 |
|
| 17 |
This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search. The model was trained on the **French** portion of the [mMARCO](https://huggingface.co/datasets/unicamp-dl/mmarco) dataset.
|
| 18 |
|
|
|
|
| 33 |
from sentence_transformers import SentenceTransformer
|
| 34 |
sentences = ["This is an example sentence", "Each sentence is converted"]
|
| 35 |
|
| 36 |
+
model = SentenceTransformer('antoinelouis/biencoder-distilcamembert-mmarcoFR')
|
| 37 |
embeddings = model.encode(sentences)
|
| 38 |
print(embeddings)
|
| 39 |
```
|
| 40 |
|
|
|
|
|
|
|
| 41 |
#### 🤗 Transformers
|
| 42 |
|
| 43 |
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
|
|
|
|
| 58 |
sentences = ['This is an example sentence', 'Each sentence is converted']
|
| 59 |
|
| 60 |
# Load model from HuggingFace Hub
|
| 61 |
+
tokenizer = AutoTokenizer.from_pretrained('antoinelouis/biencoder-distilcamembert-mmarcoFR')
|
| 62 |
+
model = AutoModel.from_pretrained('antoinelouis/biencoder-distilcamembert-mmarcoFR')
|
| 63 |
|
| 64 |
# Tokenize sentences
|
| 65 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
|
|
|
| 77 |
|
| 78 |
## Evaluation
|
| 79 |
***
|
| 80 |
+
|
| 81 |
We evaluated our model on the smaller development set of mMARCO-fr, which consists of 6,980 queries for a corpus of 8.8M candidate passages. Below, we compared the model performance with other biencoder models fine-tuned on the same dataset. We report the mean reciprocal rank (MRR), normalized discounted cumulative gainand (NDCG), mean average precision (MAP), and recall at various cut-offs (R@k).
|
| 82 |
|
| 83 |
+
| | model | Vocab. | #Param. | Size | MRR@10 | NDCG@10 | MAP@10 | R@10 | R@100(↑) | R@500 |
|
| 84 |
+
|---:|:------------------------------------------------------------------------------------------------------------------------|:-------|--------:|------:|---------:|----------:|---------:|-------:|-----------:|--------:|
|
| 85 |
+
| 1 | [biencoder-camembert-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camembert-base-mmarcoFR) | 🇫🇷 | 110M | 443MB | 28.53 | 33.72 | 27.93 | 51.46 | 77.82 | 89.13 |
|
| 86 |
+
| 2 | [biencoder-mpnet-base-all-v2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mpnet-base-all-v2-mmarcoFR) | 🇬🇧 | 109M | 438MB | 28.04 | 33.28 | 27.50 | 51.07 | 77.68 | 88.67 |
|
| 87 |
+
| 3 | **biencoder-distilcamembert-mmarcoFR** | 🇫🇷 | 68M | 272MB | 26.80 | 31.87 | 26.23 | 49.20 | 76.44 | 87.87 |
|
| 88 |
+
| 4 | [biencoder-MiniLM-L6-all-v2-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-MiniLM-L6-all-v2-mmarcoFR) | 🇬🇧 | 23M | 91MB | 25.49 | 30.39 | 24.99 | 47.10 | 73.48 | 86.09 |
|
| 89 |
+
| 5 | [biencoder-mMiniLMv2-L12-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L12-mmarcoFR) | 🇫🇷,99+ | 117M | 471MB | 24.74 | 29.41 | 24.23 | 45.40 | 71.52 | 84.42 |
|
| 90 |
+
| 6 | [biencoder-camemberta-base-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-camemberta-base-mmarcoFR) | 🇫🇷 | 112M | 447MB | 24.78 | 29.24 | 24.23 | 44.58 | 69.59 | 82.18 |
|
| 91 |
+
| 7 | [biencoder-electra-base-french-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-electra-base-french-mmarcoFR) | 🇫🇷 | 110M | 440MB | 23.38 | 27.97 | 22.91 | 43.50 | 68.96 | 81.61 |
|
| 92 |
+
| 8 | [biencoder-mMiniLMv2-L6-mmarcoFR](https://huggingface.co/antoinelouis/biencoder-mMiniLMv2-L6-mmarcoFR) | 🇫🇷,99+ | 107M | 428MB | 22.29 | 26.57 | 21.80 | 41.25 | 66.78 | 79.83 |
|
|
|
|
| 93 |
|
| 94 |
## Training
|
| 95 |
***
|
|
|
|
| 111 |
- a smaller dev set of 6,980 queries (which is actually used for evaluation in most published works).
|
| 112 |
Link: [https://ir-datasets.com/mmarco.html#mmarco/v2/fr/](https://ir-datasets.com/mmarco.html#mmarco/v2/fr/)
|
| 113 |
|
|
|
|
|
|
|
| 114 |
## Citation
|
| 115 |
|
| 116 |
```bibtex
|
| 117 |
@online{louis2023,
|
| 118 |
author = 'Antoine Louis',
|
| 119 |
+
title = 'biencoder-distilcamembert-mmarcoFR: A Biencoder Model Trained on French mMARCO',
|
| 120 |
publisher = 'Hugging Face',
|
| 121 |
month = 'may',
|
| 122 |
year = '2023',
|
| 123 |
+
url = 'https://huggingface.co/antoinelouis/biencoder-distilcamembert-mmarcoFR',
|
| 124 |
}
|
| 125 |
```
|