| | --- |
| | language: fr |
| | license: mit |
| | library_name: sentence-transformers |
| | pipeline_tag: feature-extraction |
| | tags: |
| | - sentence-transformers |
| | - feature-extraction |
| | - sentence-similarity |
| | - transformers |
| | datasets: |
| | - stsb_multi_mt |
| | metrics: |
| | - pearsonr |
| | base_model: almanach/camembert-base |
| | model-index: |
| | - name: sts-camembert-base |
| | results: |
| | - task: |
| | name: Sentence Similarity |
| | type: sentence-similarity |
| | dataset: |
| | name: STSb French |
| | type: stsb_multi_mt |
| | args: fr |
| | metrics: |
| | - name: Pearson Correlation - stsb_multi_mt fr |
| | type: pearsonr |
| | value: 0.837 |
| | --- |
| | |
| | ## Description |
| |
|
| | Ce modèle [sentence-transformers](https://www.SBERT.net) a été obtenu en finetunant le modèle |
| | [`almanach/camembert-base`](https://huggingface.co/almanach/camembert-base) à l'aide de la librairie |
| | [sentence-transformers](https://www.SBERT.net). |
| |
|
| | Il permet d'encoder une phrase ou un pararaphe (514 tokens maximum) en un vecteur de dimension 768. |
| |
|
| | Le modèle [CamemBERT](https://arxiv.org/abs/1911.03894) sur lequel il est basé est un modèle de type RoBERTa qui est |
| | à l'état de l'art pour la langue française. |
| |
|
| | ## Utilisation via la librairie `sentence-transformers` |
| |
|
| | ``` |
| | pip install -U sentence-transformers |
| | ``` |
| |
|
| | ```python |
| | from sentence_transformers import SentenceTransformer |
| | sentences = ["Ceci est un exemple", "deuxième exemple"] |
| | |
| | model = SentenceTransformer('h4c5/sts-camembert-base') |
| | embeddings = model.encode(sentences) |
| | print(embeddings) |
| | ``` |
| |
|
| |
|
| | ## Utilisation via la librairie `transformers` |
| |
|
| | ``` |
| | pip install -U transformers |
| | ``` |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModel |
| | import torch |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("h4c5/sts-camembert-base") |
| | model = AutoModel.from_pretrained("h4c5/sts-camembert-base") |
| | model.eval() |
| | |
| | |
| | # Mean Pooling |
| | def mean_pooling(model_output, attention_mask): |
| | token_embeddings = model_output[ |
| | 0 |
| | ] # First element of model_output contains all token embeddings |
| | input_mask_expanded = ( |
| | attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float() |
| | ) |
| | return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp( |
| | input_mask_expanded.sum(1), min=1e-9 |
| | ) |
| | |
| | # Tokenization et calcul des embeddings des tokens |
| | encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt") |
| | model_output = model(**encoded_input) |
| | |
| | # Mean pooling |
| | sentence_embeddings = mean_pooling(model_output, encoded_input["attention_mask"]) |
| | |
| | print(sentence_embeddings) |
| | ``` |
| |
|
| |
|
| | ## Evaluation |
| |
|
| | Le modèle a été évalué sur le jeu de données [STSb fr](https://huggingface.co/datasets/stsb_multi_mt) : |
| |
|
| | ```python |
| | from datasets import load_dataset |
| | from sentence_transformers import InputExample, evaluation |
| | |
| | |
| | def dataset_to_input_examples(dataset): |
| | return [ |
| | InputExample( |
| | texts=[example["sentence1"], example["sentence2"]], |
| | label=example["similarity_score"] / 5.0, |
| | ) |
| | for example in dataset |
| | ] |
| | |
| | |
| | sts_test_dataset = load_dataset("stsb_multi_mt", name="fr", split="test") |
| | sts_test_examples = dataset_to_input_examples(sts_test_dataset) |
| | |
| | sts_test_evaluator = evaluation.EmbeddingSimilarityEvaluator.from_input_examples( |
| | sts_test_examples, name="sts-test" |
| | ) |
| | |
| | sts_test_evaluator(model, ".") |
| | ``` |
| |
|
| | ### Résultats |
| |
|
| | Ci-dessous, les résultats de l'évaluation du modèle sur le jeu données [`stsb_multi_mt`](https://huggingface.co/datasets/stsb_multi_mt) |
| | (données `fr`, split `test`) |
| |
|
| | | Model | Pearson Correlation | Paramètres | |
| | | :--------------------------------------------------------------------------------------------------------------------------------------------- | :-----------------: | ---------: | |
| | | [`h4c5/sts-camembert-base`](https://huggingface.co/h4c5/sts-camembert-base) | **0.837** | 110M | |
| | | [`Lajavaness/sentence-camembert-base`](https://huggingface.co/Lajavaness/sentence-camembert-base) | 0.835 | 110M | |
| | | [`inokufu/flaubert-base-uncased-xnli-sts`](https://huggingface.co/inokufu/flaubert-base-uncased-xnli-sts) | 0.828 | 137M | |
| | | [`h4c5/sts-distilcamembert-base`](https://huggingface.co/h4c5/sts-distilcamembert-base) | 0.817 | 68M | |
| | | [`sentence-transformers/distiluse-base-multilingual-cased-v2`](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased) | 0.786 | 135M | |
| |
|
| |
|
| |
|
| | ## Training |
| | The model was trained with the parameters: |
| |
|
| | **DataLoader**: |
| |
|
| | `torch.utils.data.dataloader.DataLoader` of length 180 with parameters: |
| | ``` |
| | {'batch_size': 32, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'} |
| | ``` |
| |
|
| | **Loss**: |
| |
|
| | `sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss` |
| |
|
| | Parameters of the `fit()` method: |
| | ``` |
| | { |
| | "epochs": 10, |
| | "evaluation_steps": 1000, |
| | "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator", |
| | "max_grad_norm": 1, |
| | "optimizer_class": "<class 'torch.optim.adamw.AdamW'>", |
| | "optimizer_params": { |
| | "lr": 2e-05 |
| | }, |
| | "scheduler": "WarmupLinear", |
| | "steps_per_epoch": null, |
| | "warmup_steps": 500, |
| | "weight_decay": 0.01 |
| | } |
| | ``` |
| |
|
| |
|
| | ## Full Model Architecture |
| |
|
| | ``` |
| | SentenceTransformer( |
| | (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: CamembertModel |
| | (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True}) |
| | ) |
| | ``` |
| |
|
| | ## Citing |
| |
|
| | @inproceedings{reimers-2019-sentence-bert, |
| | title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks", |
| | author = "Reimers, Nils and Gurevych, Iryna", |
| | booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing", |
| | month = "11", |
| | year = "2019", |
| | publisher = "Association for Computational Linguistics", |
| | url = "https://arxiv.org/abs/1908.10084", |
| | } |
| | |
| |
|
| | @inproceedings{martin2020camembert, |
| | title={CamemBERT: a Tasty French Language Model}, |
| | author={Martin, Louis and Muller, Benjamin and Su{\'a}rez, Pedro Javier Ortiz and Dupont, Yoann and Romary, Laurent and de la Clergerie, {\'E}ric Villemonte and Seddah, Djam{\'e} and Sagot, Beno{\^\i}t}, |
| | booktitle={Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics}, |
| | journal={https://arxiv.org/abs/1911.03894}, |
| | year={2020} |
| | } |
| | |