create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,193 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: fr
|
| 3 |
+
datasets:
|
| 4 |
+
- piaf
|
| 5 |
+
- FQuAD
|
| 6 |
+
- SQuAD-FR
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
# camembert-base-squadFR-fquad-piaf
|
| 10 |
+
|
| 11 |
+
## Description
|
| 12 |
+
|
| 13 |
+
French [DPR model](https://arxiv.org/abs/2004.04906) using [CamemBERT](https://arxiv.org/abs/1911.03894) as base and then fine-tuned on a combo of three French Q&A
|
| 14 |
+
## Data
|
| 15 |
+
### French Q&A
|
| 16 |
+
We use a combination of three French Q&A datasets:
|
| 17 |
+
|
| 18 |
+
1. [PIAFv1.1](https://www.data.gouv.fr/en/datasets/piaf-le-dataset-francophone-de-questions-reponses/)
|
| 19 |
+
2. [FQuADv1.0](https://fquad.illuin.tech/)
|
| 20 |
+
3. [SQuAD-FR (SQuAD automatically translated to French)](https://github.com/Alikabbadj/French-SQuAD)
|
| 21 |
+
|
| 22 |
+
### Training
|
| 23 |
+
|
| 24 |
+
|
| 25 |
+
We are using 90 562 random questions for `train` and 22 391 for `dev`. No question in `train` exists in `dev`. For each question, we have a single `positive_context` (the paragraph where the answer to this question is found) and around 30 `hard_negtive_contexts`. Hard negative contexts are found by querying an ES instance (via bm25 retrieval) and getting the top-k candidates **that do not contain the answer**.
|
| 26 |
+
|
| 27 |
+
The corresponding files are here:
|
| 28 |
+
|
| 29 |
+
### Evaluation
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
We use FQuADv1.0 and French-SQuAD evaluation sets.
|
| 33 |
+
|
| 34 |
+
|
| 35 |
+
## Training Script
|
| 36 |
+
We use the official [Facebook DPR implentation](https://github.com/facebookresearch/DPR) with a slight modification: by default, the code can work with Roberta models, still we changed a single line to make it easier to work with Camembert. This modification can be found [over here](dpr fork).
|
| 37 |
+
|
| 38 |
+
### Hyperparameters
|
| 39 |
+
|
| 40 |
+
```shell
|
| 41 |
+
python -m torch.distributed.launch --nproc_per_node=8 train_dense_encoder.py \
|
| 42 |
+
--max_grad_norm 2.0 --encoder_model_type hf_bert --pretrained_file data/bert-base-multilingual-uncased \
|
| 43 |
+
--seed 12345 --sequence_length 256 --warmup_steps 1237 --batch_size 16 --do_lower_case \
|
| 44 |
+
--train_file DPR_FR_train.json \
|
| 45 |
+
--dev_file ./data/100_hard_neg_ctxs/DPR_FR_dev.json \
|
| 46 |
+
--output_dir ./output/bert --learning_rate 2e-05 --num_train_epochs 35 \
|
| 47 |
+
--dev_batch_size 16 --val_av_rank_start_epoch 25 \
|
| 48 |
+
--pretrained_model_cfg ./data/bert-base-multilingual-uncased
|
| 49 |
+
```
|
| 50 |
+
|
| 51 |
+
###
|
| 52 |
+
|
| 53 |
+
## Evaluation results
|
| 54 |
+
We obtain the following evaluation by using FQuAD and SQuAD-FR evaluation (or validation) sets. To obtain these results, we use [haystack's evaluation script](https://github.com/deepset-ai/haystack/blob/db4151bbc026f27c6d709fefef1088cd3f1e18b9/tutorials/Tutorial5_Evaluation.py) (**we report Retrieval results only**).
|
| 55 |
+
|
| 56 |
+
### DPR
|
| 57 |
+
|
| 58 |
+
#### FQuAD v1.0 Evaluation
|
| 59 |
+
```shell
|
| 60 |
+
For 2764 out of 3184 questions (86.81%), the answer was in the top-20 candidate passages selected by the retriever.
|
| 61 |
+
Retriever Recall: 0.87
|
| 62 |
+
Retriever Mean Avg Precision: 0.57
|
| 63 |
+
```
|
| 64 |
+
#### SQuAD-FR Evaluation
|
| 65 |
+
```shell
|
| 66 |
+
For 8945 out of 10018 questions (89.29%), the answer was in the top-20 candidate passages selected by the retriever.
|
| 67 |
+
Retriever Recall: 0.89
|
| 68 |
+
Retriever Mean Avg Precision: 0.63
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
+
### BM25
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
For reference, BM25 gets the results shown below. As in the original paper, regarding SQuAD-like datasets, the results of DPR are consistently superseeded by BM25.
|
| 75 |
+
|
| 76 |
+
#### FQuAD v1.0 Evaluation
|
| 77 |
+
```shell
|
| 78 |
+
For 2966 out of 3184 questions (93.15%), the answer was in the top-20 candidate passages selected by the retriever.
|
| 79 |
+
Retriever Recall: 0.93
|
| 80 |
+
Retriever Mean Avg Precision: 0.74
|
| 81 |
+
```
|
| 82 |
+
#### SQuAD-FR Evaluation
|
| 83 |
+
```shell
|
| 84 |
+
For 9353 out of 10018 questions (93.36%), the answer was in the top-20 candidate passages selected by the retriever.
|
| 85 |
+
Retriever Recall: 0.93
|
| 86 |
+
Retriever Mean Avg Precision: 0.77
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
## Usage
|
| 90 |
+
|
| 91 |
+
The results reported here are obtained with the `haystack` library. To get to similar embeddings using exclusively HF `transformers` library, you can do the following:
|
| 92 |
+
|
| 93 |
+
```python
|
| 94 |
+
from transformers import AutoTokenizer, AutoModel
|
| 95 |
+
query = "Salut, mon chien est-il mignon ?"
|
| 96 |
+
tokenizer = AutoTokenizer.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", do_lower_case=True)
|
| 97 |
+
input_ids = tokenizer(query, return_tensors='pt')["input_ids"]
|
| 98 |
+
model = AutoModel.from_pretrained("etalab-ia/dpr-question_encoder-fr_qa-camembert", return_dict=True)
|
| 99 |
+
embeddings = model.forward(input_ids).pooler_output
|
| 100 |
+
print(embeddings)
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
And with `haystack` (using `transformers-3.3.1`), we use it as a retriever (**note that we reference it from a local path**):
|
| 104 |
+
```
|
| 105 |
+
retriever = DensePassageRetriever(document_store=document_store,
|
| 106 |
+
query_embedding_model="./etalab-ia/dpr-question_encoder-fr_qa-camembert",
|
| 107 |
+
passage_embedding_model="./etalab-ia/dpr-ctx_encoder-fr_qa-camembert",
|
| 108 |
+
use_gpu=True,
|
| 109 |
+
embed_title=False,
|
| 110 |
+
batch_size=16,
|
| 111 |
+
use_fast_tokenizers=False
|
| 112 |
+
)
|
| 113 |
+
```
|
| 114 |
+
## Acknoledgements
|
| 115 |
+
|
| 116 |
+
This work was performed using HPC resources from GENCI–IDRIS (Grant 2020-AD011011224).
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
## Citations
|
| 120 |
+
|
| 121 |
+
### Datasets
|
| 122 |
+
|
| 123 |
+
#### PIAF
|
| 124 |
+
```
|
| 125 |
+
@inproceedings{KeraronLBAMSSS20,
|
| 126 |
+
author = {Rachel Keraron and
|
| 127 |
+
Guillaume Lancrenon and
|
| 128 |
+
Mathilde Bras and
|
| 129 |
+
Fr{\'{e}}d{\'{e}}ric Allary and
|
| 130 |
+
Gilles Moyse and
|
| 131 |
+
Thomas Scialom and
|
| 132 |
+
Edmundo{-}Pavel Soriano{-}Morales and
|
| 133 |
+
Jacopo Staiano},
|
| 134 |
+
title = {Project {PIAF:} Building a Native French Question-Answering Dataset},
|
| 135 |
+
booktitle = {{LREC}},
|
| 136 |
+
pages = {5481--5490},
|
| 137 |
+
publisher = {European Language Resources Association},
|
| 138 |
+
year = {2020}
|
| 139 |
+
}
|
| 140 |
+
|
| 141 |
+
```
|
| 142 |
+
|
| 143 |
+
#### FQuAD
|
| 144 |
+
```
|
| 145 |
+
@article{dHoffschmidt2020FQuADFQ,
|
| 146 |
+
title={FQuAD: French Question Answering Dataset},
|
| 147 |
+
author={Martin d'Hoffschmidt and Maxime Vidal and Wacim Belblidia and Tom Brendl'e and Quentin Heinrich},
|
| 148 |
+
journal={ArXiv},
|
| 149 |
+
year={2020},
|
| 150 |
+
volume={abs/2002.06071}
|
| 151 |
+
}
|
| 152 |
+
```
|
| 153 |
+
|
| 154 |
+
#### SQuAD-FR
|
| 155 |
+
```
|
| 156 |
+
@MISC{kabbadj2018,
|
| 157 |
+
author = "Kabbadj, Ali",
|
| 158 |
+
title = "Something new in French Text Mining and Information Extraction (Universal Chatbot): Largest Q&A French training dataset (110 000+) ",
|
| 159 |
+
editor = "linkedin.com",
|
| 160 |
+
month = "November",
|
| 161 |
+
year = "2018",
|
| 162 |
+
url = "\url{https://www.linkedin.com/pulse/something-new-french-text-mining-information-chatbot-largest-kabbadj/}",
|
| 163 |
+
note = "[Online; posted 11-November-2018]",
|
| 164 |
+
}
|
| 165 |
+
```
|
| 166 |
+
### Models
|
| 167 |
+
|
| 168 |
+
#### CamemBERT
|
| 169 |
+
HF model card : [https://huggingface.co/camembert-base](https://huggingface.co/camembert-base)
|
| 170 |
+
|
| 171 |
+
```
|
| 172 |
+
@inproceedings{martin2020camembert,
|
| 173 |
+
title={CamemBERT: a Tasty French Language Model},
|
| 174 |
+
author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t},
|
| 175 |
+
booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics},
|
| 176 |
+
year={2020}
|
| 177 |
+
}
|
| 178 |
+
```
|
| 179 |
+
|
| 180 |
+
#### DPR
|
| 181 |
+
|
| 182 |
+
```
|
| 183 |
+
@misc{karpukhin2020dense,
|
| 184 |
+
title={Dense Passage Retrieval for Open-Domain Question Answering},
|
| 185 |
+
author={Vladimir Karpukhin and Barlas Oğuz and Sewon Min and Patrick Lewis and Ledell Wu and Sergey Edunov and Danqi Chen and Wen-tau Yih},
|
| 186 |
+
year={2020},
|
| 187 |
+
eprint={2004.04906},
|
| 188 |
+
archivePrefix={arXiv},
|
| 189 |
+
primaryClass={cs.CL}
|
| 190 |
+
}
|
| 191 |
+
```
|
| 192 |
+
|
| 193 |
+
|