apapagi commited on
Commit
64f9a31
·
verified ·
1 Parent(s): 331c744

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ EUBERT.png filter=lfs diff=lfs merge=lfs -text
37
+ EUBERT_small.png filter=lfs diff=lfs merge=lfs -text
EUBERT.png ADDED

Git LFS Details

  • SHA256: fe31783e7398ee5646c785be08e76d26fefea1c107b5a56809651f06186e4f41
  • Pointer size: 131 Bytes
  • Size of remote file: 844 kB
EUBERT_small.png ADDED

Git LFS Details

  • SHA256: 4769911e8bb30e81240ec160ca632335d3f37f01566ef8e3634f9322f727caf9
  • Pointer size: 131 Bytes
  • Size of remote file: 161 kB
README.md CHANGED
@@ -1,3 +1,126 @@
1
- ---
2
- license: eupl-1.2
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: eupl-1.2
3
+ tags:
4
+ - generated_from_trainer
5
+ model-index:
6
+ - name: EUBERT
7
+ results: []
8
+ language:
9
+ - bg
10
+ - cs
11
+ - da
12
+ - de
13
+ - el
14
+ - en
15
+ - es
16
+ - et
17
+ - fi
18
+ - fr
19
+ - ga
20
+ - hr
21
+ - hu
22
+ - it
23
+ - lt
24
+ - lv
25
+ - mt
26
+ - nl
27
+ - pl
28
+ - pt
29
+ - ro
30
+ - sk
31
+ - sl
32
+ - sv
33
+ widget:
34
+ - text: "The transition to a climate neutral, sustainable, energy and resource-efficient, circular and fair economy is key to ensuring the long-term competitiveness of the economy of the union and the well-being of its peoples. In 2016, the Union concluded the Paris Agreement2. Article 2(1), point (c), of the Paris Agreement sets out the objective of strengthening the response to climate change by, among other means, making finance flows consistent with a pathway towards low greenhouse gas [MASK] and climate resilient development."
35
+ ---
36
+
37
+ <!-- This model card has been generated automatically according to the information the Trainer had access to. You
38
+ should probably proofread and complete it, then remove this comment. -->
39
+
40
+
41
+ ## Model Card: EUBERT
42
+
43
+ ### Overview
44
+
45
+ - **Model Name**: EUBERT
46
+ - **Model Version**: 1.2
47
+ - **Date of Release**: 16 October 2023
48
+ - **Model Architecture**: BERT (Bidirectional Encoder Representations from Transformers)
49
+ - **Training Data**: Documents registered by the European Publications Office
50
+ - **Model Use Case**: Text Classification, Question Answering, Language Understanding
51
+
52
+ ![EUBERT](https://huggingface.co/EuropeanParliament/EUBERT/resolve/main/EUBERT_small.png)
53
+
54
+
55
+ ### Model Description
56
+
57
+ EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the [European Publications Office](https://op.europa.eu/).
58
+ These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains.
59
+ EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks,
60
+ making it a valuable resource for a variety of applications.
61
+
62
+ ### Intended Use
63
+
64
+ EUBERT serves as a starting point for building more specific natural language understanding models.
65
+ Its versatility makes it suitable for a wide range of tasks, including but not limited to:
66
+
67
+ 1. **Text Classification**: EUBERT can be fine-tuned for classifying text documents into different categories, making it useful for applications such as sentiment analysis, topic categorization, and spam detection.
68
+
69
+ 2. **Question Answering**: By fine-tuning EUBERT on question-answering datasets, it can be used to extract answers from text documents, facilitating tasks like information retrieval and document summarization.
70
+
71
+ 3. **Language Understanding**: EUBERT can be employed for general language understanding tasks, including named entity recognition, part-of-speech tagging, and text generation.
72
+
73
+ ### Performance
74
+
75
+ The specific performance metrics of EUBERT may vary depending on the downstream task and the quality and quantity of training data used for fine-tuning.
76
+ Users are encouraged to fine-tune the model on their specific task and evaluate its performance accordingly.
77
+
78
+ ### Considerations
79
+
80
+ - **Data Privacy and Compliance**: Users should ensure that the use of EUBERT complies with all relevant data privacy and compliance regulations, especially when working with sensitive or personally identifiable information.
81
+
82
+ - **Fine-Tuning**: The effectiveness of EUBERT on a given task depends on the quality and quantity of the training data, as well as the fine-tuning process. Careful experimentation and evaluation are essential to achieve optimal results.
83
+
84
+ - **Bias and Fairness**: Users should be aware of potential biases in the training data and take appropriate measures to mitigate bias when fine-tuning EUBERT for specific tasks.
85
+
86
+ ### Conclusion
87
+
88
+ EUBERT is a pretrained BERT model that leverages a substantial corpus of documents from the European Publications Office. It offers a versatile foundation for developing natural language processing solutions across a wide range of applications, enabling researchers and developers to create custom models for text classification, question answering, and language understanding tasks. Users are encouraged to exercise diligence in fine-tuning and evaluating the model for their specific use cases while adhering to data privacy and fairness considerations.
89
+
90
+ ---
91
+
92
+ ## Training procedure
93
+
94
+ Dedicated Word Piece tokenizer vocabulary size 2**16,
95
+
96
+ ### Training hyperparameters
97
+
98
+ The following hyperparameters were used during training:
99
+ - learning_rate: 5e-05
100
+ - train_batch_size: 32
101
+ - eval_batch_size: 32
102
+ - seed: 42
103
+ - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
104
+ - lr_scheduler_type: linear
105
+ - num_epochs: 1.85
106
+
107
+ ### Framework versions
108
+
109
+ - Transformers 4.33.3
110
+ - Pytorch 2.0.1+cu117
111
+ - Datasets 2.14.5
112
+ - Tokenizers 0.13.3
113
+
114
+ ### Infrastructure
115
+
116
+ - **Hardware Type:** 4 x GPUs 24GB
117
+ - **GPU Days:** 16
118
+ - **Cloud Provider:** EuroHPC
119
+ - **Compute Region:** Meluxina
120
+
121
+
122
+ # Authors
123
+
124
+ Sébastien Campion <sebastien.campion@europarl.europa.eu>
125
+
126
+ Andreas Papagiannis <andreas.papagiannis@europarl.europa.eu>
added_tokens.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</s>": 65537,
3
+ "<mask>": 65540,
4
+ "<pad>": 65539,
5
+ "<s>": 65536,
6
+ "<unk>": 65538,
7
+ "[CLS]": 2,
8
+ "[MASK]": 4,
9
+ "[PAD]": 1,
10
+ "[SEP]": 3,
11
+ "[UNK]": 0
12
+ }
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForCausalLM"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "eos_token_id": 2,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 768,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 3072,
14
+ "layer_norm_eps": 1e-05,
15
+ "max_position_embeddings": 514,
16
+ "model_type": "roberta",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 12,
19
+ "pad_token_id": 1,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.52.3",
23
+ "type_vocab_size": 1,
24
+ "use_cache": true,
25
+ "vocab_size": 50265
26
+ }
generation_config.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 0,
4
+ "eos_token_id": 2,
5
+ "pad_token_id": 1,
6
+ "transformers_version": "4.52.3"
7
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:127029283449b17c8ee6dadae6e21017558da56fcadaf63e36e222eff93df964
3
+ size 498813948
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:35157d91606d022b90e23850a025295fc40baa789ce1d3e5ea2345d599be6703
3
+ size 375996725
special_tokens_map.json ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "<s>",
3
+ "cls_token": "<s>",
4
+ "eos_token": "</s>",
5
+ "mask_token": {
6
+ "content": "<mask>",
7
+ "lstrip": true,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "pad_token": "<pad>",
13
+ "sep_token": "</s>",
14
+ "unk_token": "<unk>"
15
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": false,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "extra_special_tokens": {},
51
+ "mask_token": "<mask>",
52
+ "model_max_length": 512,
53
+ "pad_token": "<pad>",
54
+ "sep_token": "</s>",
55
+ "tokenizer_class": "RobertaTokenizer",
56
+ "trim_offsets": true,
57
+ "unk_token": "<unk>"
58
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff
 
vocab.txt ADDED
The diff for this file is too large to render. See raw diff