Upload folder using huggingface_hub

Browse files

Files changed (15) hide show

.gitattributes +2 -0
EUBERT.png +3 -0
EUBERT_small.png +3 -0
README.md +126 -3
added_tokens.json +12 -0
config.json +26 -0
generation_config.json +7 -0
merges.txt +0 -0
model.safetensors +3 -0
pytorch_model.bin +3 -0
special_tokens_map.json +15 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
vocab.json +0 -0
vocab.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+EUBERT.png filter=lfs diff=lfs merge=lfs -text
+EUBERT_small.png filter=lfs diff=lfs merge=lfs -text

EUBERT.png ADDED Viewed

Git LFS Details

SHA256: fe31783e7398ee5646c785be08e76d26fefea1c107b5a56809651f06186e4f41
Pointer size: 131 Bytes
Size of remote file: 844 kB

EUBERT_small.png ADDED Viewed

Git LFS Details

SHA256: 4769911e8bb30e81240ec160ca632335d3f37f01566ef8e3634f9322f727caf9
Pointer size: 131 Bytes
Size of remote file: 161 kB

README.md CHANGED Viewed

@@ -1,3 +1,126 @@
----
-license: eupl-1.2
----

+---
+license: eupl-1.2
+tags:
+- generated_from_trainer
+model-index:
+- name: EUBERT
+  results: []
+language:
+- bg
+- cs
+- da
+- de
+- el
+- en
+- es
+- et
+- fi
+- fr
+- ga
+- hr
+- hu
+- it
+- lt
+- lv
+- mt
+- nl
+- pl
+- pt
+- ro
+- sk
+- sl
+- sv
+widget:
+ - text: "The transition to a climate neutral, sustainable, energy and resource-efficient, circular and fair economy is key to ensuring the long-term competitiveness of the economy of the union and the well-being of its peoples. In 2016, the Union concluded the Paris Agreement2. Article 2(1), point (c), of the Paris Agreement sets out the objective of strengthening the response to climate change by, among other means, making finance flows consistent with a pathway towards low greenhouse gas [MASK] and climate resilient development."
+---
+<!-- This model card has been generated automatically according to the information the Trainer had access to. You
+should probably proofread and complete it, then remove this comment. -->
+## Model Card: EUBERT
+### Overview
+- **Model Name**: EUBERT
+- **Model Version**: 1.2
+- **Date of Release**: 16 October 2023
+- **Model Architecture**: BERT (Bidirectional Encoder Representations from Transformers)
+- **Training Data**: Documents registered by the European Publications Office
+- **Model Use Case**: Text Classification, Question Answering, Language Understanding
+![EUBERT](https://huggingface.co/EuropeanParliament/EUBERT/resolve/main/EUBERT_small.png)
+### Model Description
+EUBERT is a pretrained BERT uncased model that has been trained on a vast corpus of documents registered by the [European Publications Office](https://op.europa.eu/).
+These documents span the last 30 years, providing a comprehensive dataset that encompasses a wide range of topics and domains.
+EUBERT is designed to be a versatile language model that can be fine-tuned for various natural language processing tasks,
+making it a valuable resource for a variety of applications.
+### Intended Use
+EUBERT serves as a starting point for building more specific natural language understanding models.
+Its versatility makes it suitable for a wide range of tasks, including but not limited to:
+1. **Text Classification**: EUBERT can be fine-tuned for classifying text documents into different categories, making it useful for applications such as sentiment analysis, topic categorization, and spam detection.
+2. **Question Answering**: By fine-tuning EUBERT on question-answering datasets, it can be used to extract answers from text documents, facilitating tasks like information retrieval and document summarization.
+3. **Language Understanding**: EUBERT can be employed for general language understanding tasks, including named entity recognition, part-of-speech tagging, and text generation.
+### Performance
+The specific performance metrics of EUBERT may vary depending on the downstream task and the quality and quantity of training data used for fine-tuning.
+Users are encouraged to fine-tune the model on their specific task and evaluate its performance accordingly.
+### Considerations
+- **Data Privacy and Compliance**: Users should ensure that the use of EUBERT complies with all relevant data privacy and compliance regulations, especially when working with sensitive or personally identifiable information.
+- **Fine-Tuning**: The effectiveness of EUBERT on a given task depends on the quality and quantity of the training data, as well as the fine-tuning process. Careful experimentation and evaluation are essential to achieve optimal results.
+- **Bias and Fairness**: Users should be aware of potential biases in the training data and take appropriate measures to mitigate bias when fine-tuning EUBERT for specific tasks.
+### Conclusion
+EUBERT is a pretrained BERT model that leverages a substantial corpus of documents from the European Publications Office. It offers a versatile foundation for developing natural language processing solutions across a wide range of applications, enabling researchers and developers to create custom models for text classification, question answering, and language understanding tasks. Users are encouraged to exercise diligence in fine-tuning and evaluating the model for their specific use cases while adhering to data privacy and fairness considerations.
+---
+## Training procedure
+Dedicated Word Piece tokenizer vocabulary size 2**16,
+### Training hyperparameters
+The following hyperparameters were used during training:
+- learning_rate: 5e-05
+- train_batch_size: 32
+- eval_batch_size: 32
+- seed: 42
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- num_epochs: 1.85
+### Framework versions
+- Transformers 4.33.3
+- Pytorch 2.0.1+cu117
+- Datasets 2.14.5
+- Tokenizers 0.13.3
+### Infrastructure
+- **Hardware Type:** 4 x GPUs 24GB
+- **GPU Days:** 16
+- **Cloud Provider:** EuroHPC
+- **Compute Region:** Meluxina
+# Authors
+Sébastien Campion <sebastien.campion@europarl.europa.eu>
+Andreas Papagiannis <andreas.papagiannis@europarl.europa.eu>

added_tokens.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "</s>": 65537,
+  "<mask>": 65540,
+  "<pad>": 65539,
+  "<s>": 65536,
+  "<unk>": 65538,
+  "[CLS]": 2,
+  "[MASK]": 4,
+  "[PAD]": 1,
+  "[SEP]": 3,
+  "[UNK]": 0
+}

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "architectures": [
+    "RobertaForCausalLM"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "bos_token_id": 0,
+  "classifier_dropout": null,
+  "eos_token_id": 2,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-05,
+  "max_position_embeddings": 514,
+  "model_type": "roberta",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 1,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.52.3",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 50265
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 0,
+  "eos_token_id": 2,
+  "pad_token_id": 1,
+  "transformers_version": "4.52.3"
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:127029283449b17c8ee6dadae6e21017558da56fcadaf63e36e222eff93df964
+size 498813948

pytorch_model.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:35157d91606d022b90e23850a025295fc40baa789ce1d3e5ea2345d599be6703
+size 375996725

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,15 @@

+{
+  "bos_token": "<s>",
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "mask_token": {
+    "content": "<mask>",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "unk_token": "<unk>"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "0": {
+      "content": "<s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "<pad>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "</s>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "<unk>",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50264": {
+      "content": "<mask>",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "bos_token": "<s>",
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "<s>",
+  "eos_token": "</s>",
+  "errors": "replace",
+  "extra_special_tokens": {},
+  "mask_token": "<mask>",
+  "model_max_length": 512,
+  "pad_token": "<pad>",
+  "sep_token": "</s>",
+  "tokenizer_class": "RobertaTokenizer",
+  "trim_offsets": true,
+  "unk_token": "<unk>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff