| | --- |
| | language: fr |
| | license: mit |
| | datasets: |
| | - Sequoia |
| | widget: |
| | - text: Aucun financement politique occulte n'a pu être mis en évidence. |
| | - text: L'excrétion de l'acide zolédronique dans le lait maternel n'est pas connue. |
| | pipeline_tag: token-classification |
| | tags: |
| | - mwe |
| | --- |
| | |
| | # Multiword expressions recognition. |
| |
|
| | A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic, semantic, pragmatic and/or statistical idiosyncrasies (Baldwin and Kim, 2010). The objective of Multiword Expression Recognition (MWER) is to automate the identification of these MWEs. |
| |
|
| | ## Model description |
| |
|
| | `camembert-mwer` is a model that was fine-tuned from [CamemBERT](https://huggingface.co/camembert/camembert-large) as a token classification task specifically on the [Sequoia](http://deep-sequoia.inria.fr/) dataset for the MWER task. |
| |
|
| | ## How to use |
| |
|
| | You can use this model directly with a pipeline for token classification: |
| |
|
| | ```python |
| | >>> from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline |
| | >>> tokenizer = AutoTokenizer.from_pretrained("bvantuan/camembert-mwer") |
| | >>> model = AutoModelForTokenClassification.from_pretrained("bvantuan/camembert-mwer") |
| | >>> mwe_classifier = pipeline('token-classification', model=model, tokenizer=tokenizer) |
| | >>> sentence = "Pour ce premier rendez-vous, l'animateur a pu faire partager sa passion et présenter quelques oeuvres pour mettre en bouche les participants." |
| | >>> mwes = mwe_classifier(sentence) |
| | |
| | [{'entity': 'B-MWE', |
| | 'score': 0.99492574, |
| | 'index': 4, |
| | 'word': '▁rendez', |
| | 'start': 15, |
| | 'end': 22}, |
| | {'entity': 'I-MWE', |
| | 'score': 0.9344883, |
| | 'index': 5, |
| | 'word': '-', |
| | 'start': 22, |
| | 'end': 23}, |
| | {'entity': 'I-MWE', |
| | 'score': 0.99398583, |
| | 'index': 6, |
| | 'word': 'vous', |
| | 'start': 23, |
| | 'end': 27}, |
| | {'entity': 'B-VID', |
| | 'score': 0.9827843, |
| | 'index': 22, |
| | 'word': '▁mettre', |
| | 'start': 106, |
| | 'end': 113}, |
| | {'entity': 'I-VID', |
| | 'score': 0.9835186, |
| | 'index': 23, |
| | 'word': '▁en', |
| | 'start': 113, |
| | 'end': 116}, |
| | {'entity': 'I-VID', |
| | 'score': 0.98324823, |
| | 'index': 24, |
| | 'word': '▁bouche', |
| | 'start': 116, |
| | 'end': 123}] |
| | |
| | >>> mwe_classifier.group_entities(mwes) |
| | |
| | [{'entity_group': 'MWE', |
| | 'score': 0.9744666, |
| | 'word': 'rendez-vous', |
| | 'start': 15, |
| | 'end': 27}, |
| | {'entity_group': 'VID', |
| | 'score': 0.9831837, |
| | 'word': 'mettre en bouche', |
| | 'start': 106, |
| | 'end': 123}] |
| | ``` |
| |
|
| | ## Training data |
| |
|
| | The Sequoia dataset is divided into train/dev/test sets: |
| |
|
| | | | Sequoia | train | dev | test | |
| | | :----: | :---: | :----: | :---: | :----: | |
| | | #sentences | 3099 | 1955 | 273 | 871 | |
| | | #MWEs | 3450 | 2170 | 306 | 974 | |
| | | #Unseen MWEs | _ | _ | 100 | 300 | |
| |
|
| | This dataset has 6 distinct categories: |
| | * MWE: Non-verbal MWEs (e.g. **à peu près**) |
| | * IRV: Inherently reflexive verb (e.g. **s'occuper**) |
| | * LVC.cause: Causative light-verb construction (e.g. **causer** le **bouleversement**) |
| | * LVC.full: Light-verb construction (e.g. **avoir pour but** de ) |
| | * MVC: Multi-verb construction (e.g. **faire remarquer**) |
| | * VID: Verbal idiom (e.g. **voir le jour**) |
| |
|
| | ## Training procedure |
| |
|
| | ### Preprocessing |
| |
|
| | The employed sequential labeling scheme for this task is the Inside–outside–beginning (IOB2) methodology. |
| |
|
| | ### Pretraining |
| |
|
| | The model was trained on train+dev sets with learning rate $3 × 10^{-5}$, batch size 10 and over the course of 15 epochs. |
| |
|
| | ### Evaluation results |
| |
|
| | On the test set, this model achieves the following results: |
| |
|
| | <table> |
| | <tr> |
| | <td colspan="3">Global MWE-based</td> |
| | <td colspan="3">Unseen MWE-based</td> |
| | </tr> |
| | <tr> |
| | <td>Precision</td><td>Recall</td><td>F1</td> |
| | <td>Precision</td><td>Recall</td><td>F1</td> |
| | </tr> |
| | <tr> |
| | <td>83.78</td><td>83.78</td><td>83.78</td> |
| | <td>57.05</td><td>60.67</td><td>58.80</td> |
| | </tr> |
| | </table> |
| | |
| | ### BibTeX entry and citation info |
| |
|
| | ```bibtex |
| | @article{martin2019camembert, |
| | title={CamemBERT: a tasty French language model}, |
| | author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de La Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, |
| | journal={arXiv preprint arXiv:1911.03894}, |
| | year={2019} |
| | } |
| | |
| | @article{candito2020french, |
| | title={A French corpus annotated for multiword expressions and named entities}, |
| | author={Candito, Marie and Constant, Mathieu and Ramisch, Carlos and Savary, Agata and Guillaume, Bruno and Parmentier, Yannick and Cordeiro, Silvio Ricardo}, |
| | journal={Journal of Language Modelling}, |
| | volume={8}, |
| | number={2}, |
| | year={2020}, |
| | publisher={Polska Akademia Nauk. Instytut Podstaw Informatyki PAN} |
| | } |
| | |
| | ``` |