cservan commited on
Commit
f701264
·
1 Parent(s): bbdd6bd

update README

Browse files
Files changed (1) hide show
  1. README.md +72 -3
README.md CHANGED
@@ -1,3 +1,72 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - fr
4
+
5
+
6
+ license: Apache-2.0
7
+ datasets:
8
+ - wikipedia
9
+ - OPUS
10
+ ---
11
+
12
+ # French ModernBERT Base Cased 128k
13
+
14
+ Pretrained French language model using a masked language modeling (MLM) objective.
15
+
16
+ ## Model description
17
+
18
+ ModernBERT is a transformers model pretrained on 2.3 billions of French Wikipedia and OPUS tokens in a self-supervised fashion.
19
+ This means it was pretrained on the raw texts only, with no humans labelling them in any way (which is why it can use lots of publicly available data) with an automatic process to generate inputs and labels from those texts.
20
+
21
+ This model has the following configuration:
22
+
23
+ - 768 embedding dimension
24
+ - 22 hidden layers
25
+ - 1152 hidden dimension
26
+ - 12 attention heads
27
+ - 900M parameters
28
+ - 129k of vocabulary size
29
+
30
+ ## Intended uses & limitations
31
+
32
+ You can use the raw model for either masked language modeling or next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task.
33
+
34
+ Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation you should look at model like GPT ones.
35
+
36
+ ### How to use
37
+
38
+ Here is how to use this model to get the features of a given text in PyTorch:
39
+
40
+ ```python
41
+ from transformers import ModernBERTTokenizer, ModernBERTModel
42
+ tokenizer = ModernBERTTokenizer.from_pretrained('cservan/french-modernbert-small')
43
+ model = ModernBERTModel.from_pretrained("cservan/french-modernbert-small")
44
+ text = "Remplacez-moi par le texte que vous souhaitez."
45
+ encoded_input = tokenizer(text, return_tensors='pt')
46
+ output = model(**encoded_input)
47
+ ```
48
+
49
+
50
+
51
+
52
+ ## Training data
53
+
54
+ The ModernBERT model was pretrained on 2.3 billion of token from [French Wikipedia](https://scouv.lisn.upsaclay.fr/#malbert) (excluding lists, tables and
55
+ headers) and [OPUS](https://opus.nlpl.eu/).
56
+
57
+ ## Training procedure
58
+
59
+ ### Preprocessing
60
+
61
+ The texts are lowercased and tokenized using SentencePiece and a vocabulary size of 128,000 tokens plus 1,000 unused token for downstream adataption.
62
+ The inputs of the model are then of the form:
63
+
64
+ ```
65
+ [CLS] Sentence A [SEP] Sentence B [SEP]
66
+ ```
67
+
68
+
69
+ ### Tools
70
+
71
+ The tools used to pre-train the model are available [here](https://github.com/AnswerDotAI/ModernBERT)
72
+