Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,129 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- fr
|
| 5 |
+
library_name: transformers
|
| 6 |
+
tags:
|
| 7 |
+
- linformer
|
| 8 |
+
- legal
|
| 9 |
+
- medical
|
| 10 |
+
- RoBERTa
|
| 11 |
+
- pytorch
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Jargon-general-base
|
| 15 |
+
|
| 16 |
+
[Jargon](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf) is an efficient transformer encoder LM for French, combining the LinFormer attention mechanism with the RoBERTa model architecture.
|
| 17 |
+
|
| 18 |
+
Jargon is available in several versions with different context sizes and types of pre-training corpora.
|
| 19 |
+
|
| 20 |
+
<!-- Provide a quick summary of what the model is/does. -->
|
| 21 |
+
|
| 22 |
+
<!-- This modelcard aims to be a base template for new models. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1).
|
| 23 |
+
-->
|
| 24 |
+
|
| 25 |
+
| **Model** | **Initialised from...** |**Training Data**|
|
| 26 |
+
|-------------------------------------------------------------------------------------|:-----------------------:|:----------------:|
|
| 27 |
+
| [jargon-general-base](https://huggingface.co/PantagrueLLM/jargon-general-base) | scratch |8.5GB Web Corpus|
|
| 28 |
+
| [jargon-general-biomed](https://huggingface.co/PantagrueLLM/jargon-general-biomed) | jargon-general-base |5.4GB Medical Corpus|
|
| 29 |
+
| jargon-general-legal | jargon-general-base |18GB Legal Corpus
|
| 30 |
+
| [jargon-multidomain-base](https://huggingface.co/PantagrueLLM/jargon-multidomain-base) | jargon-general-base |Medical+Legal Corpora|
|
| 31 |
+
| jargon-legal | scratch |18GB Legal Corpus|
|
| 32 |
+
| jargon-legal-4096 | scratch |18GB Legal Corpus|
|
| 33 |
+
| [jargon-biomed](https://huggingface.co/PantagrueLLM/jargon-biomed) | scratch |5.4GB Medical Corpus|
|
| 34 |
+
| [jargon-biomed-4096](https://huggingface.co/PantagrueLLM/jargon-biomed-4096) | scratch |5.4GB Medical Corpus|
|
| 35 |
+
| [jargon-NACHOS](https://huggingface.co/PantagrueLLM/jargon-NACHOS) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)|
|
| 36 |
+
| [jargon-NACHOS-4096](https://huggingface.co/PantagrueLLM/jargon-NACHOS-4096) | scratch |[NACHOS](https://drbert.univ-avignon.fr/)|
|
| 37 |
+
|
| 38 |
+
|
| 39 |
+
## Evaluation
|
| 40 |
+
|
| 41 |
+
The Jargon models were evaluated on an range of specialized downstream tasks.
|
| 42 |
+
|
| 43 |
+
For more info please check out the [paper](https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf), accepted for publication at [LREC-COLING 2024](https://lrec-coling-2024.org/list-of-accepted-papers/).
|
| 44 |
+
|
| 45 |
+
|
| 46 |
+
## Using Jargon models with HuggingFace transformers
|
| 47 |
+
|
| 48 |
+
You can get started with `jargon-general-base` using the code snippet below:
|
| 49 |
+
|
| 50 |
+
```python
|
| 51 |
+
from transformers import AutoModelForMaskedLM, AutoTokenizer, pipeline
|
| 52 |
+
|
| 53 |
+
tokenizer = AutoTokenizer.from_pretrained("PantagrueLLM/jargon-general-base", trust_remote_code=True)
|
| 54 |
+
model = AutoModelForMaskedLM.from_pretrained("PantagrueLLM/jargon-general-base", trust_remote_code=True)
|
| 55 |
+
|
| 56 |
+
jargon_maskfiller = pipeline("fill-mask", model=model, tokenizer=tokenizer)
|
| 57 |
+
output = jargon_maskfiller("Il est allé au <mask> hier")
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
You can also use the classes `AutoModel`, `AutoModelForSequenceClassification`, or `AutoModelForTokenClassification` to load Jargon models, depending on the downstream task in question.
|
| 61 |
+
|
| 62 |
+
- **Language(s):** French
|
| 63 |
+
- **License:** MIT
|
| 64 |
+
- **Developed by:** Vincent Segonne
|
| 65 |
+
- **Funded by**
|
| 66 |
+
- GENCI-IDRIS (Grant 2022 A0131013801)
|
| 67 |
+
- French National Research Agency: Pantagruel grant ANR-23-IAS1-0001
|
| 68 |
+
- MIAI@Grenoble Alpes ANR-19-P3IA-0003
|
| 69 |
+
- PROPICTO ANR-20-CE93-0005
|
| 70 |
+
- Lawbot ANR-20-CE38-0013
|
| 71 |
+
- Swiss National Science Foundation (grant PROPICTO N°197864)
|
| 72 |
+
- **Authors**
|
| 73 |
+
- Vincent Segonne
|
| 74 |
+
- Aidan Mannion
|
| 75 |
+
- Laura Cristina Alonzo Canul
|
| 76 |
+
- Alexandre Audibert
|
| 77 |
+
- Xingyu Liu
|
| 78 |
+
- Cécile Macaire
|
| 79 |
+
- Adrien Pupier
|
| 80 |
+
- Yongxin Zhou
|
| 81 |
+
- Mathilde Aguiar
|
| 82 |
+
- Felix Herron
|
| 83 |
+
- Magali Norré
|
| 84 |
+
- Massih-Reza Amini
|
| 85 |
+
- Pierrette Bouillon
|
| 86 |
+
- Iris Eshkol-Taravella
|
| 87 |
+
- Emmanuelle Esperança-Rodier
|
| 88 |
+
- Thomas François
|
| 89 |
+
- Lorraine Goeuriot
|
| 90 |
+
- Jérôme Goulian
|
| 91 |
+
- Mathieu Lafourcade
|
| 92 |
+
- Benjamin Lecouteux
|
| 93 |
+
- François Portet
|
| 94 |
+
- Fabien Ringeval
|
| 95 |
+
- Vincent Vandeghinste
|
| 96 |
+
- Maximin Coavoux
|
| 97 |
+
- Marco Dinarelli
|
| 98 |
+
- Didier Schwab
|
| 99 |
+
|
| 100 |
+
|
| 101 |
+
|
| 102 |
+
## Citation
|
| 103 |
+
|
| 104 |
+
If you use this model for your own research work, please cite as follows:
|
| 105 |
+
|
| 106 |
+
```bibtex
|
| 107 |
+
@inproceedings{segonne:hal-04535557,
|
| 108 |
+
TITLE = {{Jargon: A Suite of Language Models and Evaluation Tasks for French Specialized Domains}},
|
| 109 |
+
AUTHOR = {Segonne, Vincent and Mannion, Aidan and Alonzo Canul, Laura Cristina and Audibert, Alexandre and Liu, Xingyu and Macaire, C{\'e}cile and Pupier, Adrien and Zhou, Yongxin and Aguiar, Mathilde and Herron, Felix and Norr{\'e}, Magali and Amini, Massih-Reza and Bouillon, Pierrette and Eshkol-Taravella, Iris and Esperan{\c c}a-Rodier, Emmanuelle and Fran{\c c}ois, Thomas and Goeuriot, Lorraine and Goulian, J{\'e}r{\^o}me and Lafourcade, Mathieu and Lecouteux, Benjamin and Portet, Fran{\c c}ois and Ringeval, Fabien and Vandeghinste, Vincent and Coavoux, Maximin and Dinarelli, Marco and Schwab, Didier},
|
| 110 |
+
URL = {https://hal.science/hal-04535557},
|
| 111 |
+
BOOKTITLE = {{LREC-COLING 2024 - Joint International Conference on Computational Linguistics, Language Resources and Evaluation}},
|
| 112 |
+
ADDRESS = {Turin, Italy},
|
| 113 |
+
YEAR = {2024},
|
| 114 |
+
MONTH = May,
|
| 115 |
+
KEYWORDS = {Self-supervised learning ; Pretrained language models ; Evaluation benchmark ; Biomedical document processing ; Legal document processing ; Speech transcription},
|
| 116 |
+
PDF = {https://hal.science/hal-04535557/file/FB2_domaines_specialises_LREC_COLING24.pdf},
|
| 117 |
+
HAL_ID = {hal-04535557},
|
| 118 |
+
HAL_VERSION = {v1},
|
| 119 |
+
}
|
| 120 |
+
```
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
|
| 124 |
+
<!-- - **Finetuned from model [optional]:** [More Information Needed] -->
|
| 125 |
+
<!--
|
| 126 |
+
### Model Sources [optional]
|
| 127 |
+
|
| 128 |
+
|
| 129 |
+
<!-- Provide the basic links for the model. -->
|