alexaapo commited on
Commit
122c98b
·
verified ·
1 Parent(s): ee06d1b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +148 -145
README.md CHANGED
@@ -1,199 +1,202 @@
1
  ---
 
 
 
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
 
 
 
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
 
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
 
39
 
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
 
 
 
 
 
 
 
 
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
 
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
 
 
 
 
 
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
 
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
81
 
82
- [More Information Needed]
83
 
84
- ### Training Procedure
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
 
 
 
 
89
 
90
- [More Information Needed]
91
 
 
 
 
 
 
 
92
 
93
- #### Training Hyperparameters
 
 
 
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
 
97
- #### Speeds, Sizes, Times [optional]
 
 
 
 
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
102
 
103
- ## Evaluation
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
 
107
- ### Testing Data, Factors & Metrics
 
 
 
 
 
 
 
 
 
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - el
5
+ pipeline_tag: fill-mask
6
  library_name: transformers
7
+ tags:
8
+ - electra
9
+ - fill-mask
10
+ - greek
11
+ - legal
12
+ - discriminator
13
+ - generator
14
+ base_model:
15
+ - google/electra-base-discriminator
16
  ---
17
 
18
+ # Themida-ELECTRA v2: A Greek Legal Language Model
19
 
20
+ ## Model Description
21
 
22
+ **Themida-ELECTRA v2** is an improved ELECTRA-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This second version incorporates refined training hyperparameters for enhanced performance and stability. It is designed for understanding the complex vocabulary and context of the legal domain in Greece and the EU.
23
 
24
+ This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The ELECTRA architecture provides more efficient pre-training compared to masked language models like BERT by using a generator-discriminator approach.
25
 
26
+ ## How to Get Started
27
 
28
+ You can use this model directly with the `fill-mask` pipeline:
29
 
30
+ ```python
31
+ from transformers import pipeline
32
 
33
+ # Load the model
34
+ fill_mask = pipeline(
35
+ "fill-mask",
36
+ model="novelcore/themida-electra-legal-17G-8-gpu-v2",
37
+ tokenizer="novelcore/themida-electra-legal-17G-8-gpu-v2"
38
+ )
39
 
40
+ # Example from a legal context
41
+ text = κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
 
 
 
 
 
42
 
43
+ # Get predictions
44
+ predictions = fill_mask(text)
45
+ print(predictions)
46
 
47
+ # Get predictions
48
+ predictions = fill_mask(text)
49
+ [{'score': 0.20120874047279358,
50
+ 'token': 4014,
51
+ 'token_str': ' ειπε',
52
+ 'sequence': ' ο κ . μητσοτακης ειπε οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
53
+ {'score': 0.19406235218048096,
54
+ 'token': 12702,
55
+ 'token_str': ' δηλωσε',
56
+ 'sequence': ' ο κ . μητσοτακης δηλωσε οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
57
+ {'score': 0.18023167550563812,
58
+ 'token': 11151,
59
+ 'token_str': ' δηλωνει',
60
+ 'sequence': ' ο κ . μητσοτακης δηλωνει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
61
+ {'score': 0.08440685272216797,
62
+ 'token': 8534,
63
+ 'token_str': ' υποστηριζει',
64
+ 'sequence': ' ο κ . μητσοτακης υποστηριζει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
65
+ {'score': 0.05247046798467636,
66
+ 'token': 3523,
67
+ 'token_str': ' λεει',
68
+ 'sequence': ' ο κ . μητσοτακης λεει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'}]
69
+ ```
70
 
71
+ For downstream tasks:
 
 
72
 
73
+ ```python
74
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
75
 
76
+ # For legal document classification
77
+ tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-electra-legal-17G-8-gpu-v2")
78
+ model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-electra-legal-17G-8-gpu-v2")
79
+ ```
80
 
81
+ ## Training Data
82
 
83
+ The model was pre-trained on a comprehensive 17GB corpus of Greek text compiled from various legal and governmental sources. The corpus was carefully cleaned, UTF-8 encoded, and deduplicated to ensure high quality and diversity before training.
84
 
85
+ The composition of the training corpus is as follows:
86
 
87
+ | Corpus Source | Size (GB) | Context |
88
+ | :--- | :--- | :--- |
89
+ | FEK - Greek Government Gazette (all issues) | 11.0 | Legal |
90
+ | Greek Parliament Proceedings | 2.9 | Legal / Parliamentary |
91
+ | Political Reports of the Supreme Court | 1.2 | Legal |
92
+ | Eur-Lex (Greek Content) | 0.92 | Legal |
93
+ | Europarl (Greek Content) | 0.38 | Legal / Parliamentary |
94
+ | Raptarchis Legal Dictionary | 0.35 | Legal |
95
+ | **Total** | **~16.75 GB** | |
96
 
97
+ ## Training Procedure
98
 
99
+ ### Model Architecture
100
 
101
+ The model uses the ELECTRA architecture with the following configuration:
102
 
103
+ - **Discriminator Hidden Size**: 768
104
+ - **Discriminator Attention Heads**: 12
105
+ - **Discriminator Hidden Layers**: 12
106
+ - **Generator Size Fraction**: 0.25 (192 hidden size generator)
107
 
108
+ ### Preprocessing
109
 
110
+ The text was tokenized using a custom `ByteLevelBPE` tokenizer trained from scratch on the Greek legal corpus. The tokenizer is uncased (does not distinguish between upper and lower case) and uses a vocabulary of 50,264 tokens.
111
 
112
+ The data was then processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence.
113
 
114
+ ### Pre-training
115
 
116
+ The model was pre-trained from scratch for **200,000 steps** on 8x NVIDIA A100 40GB GPUs, using BFloat16 (`bf16`) mixed-precision for stability and speed. This second version incorporates improved hyperparameters for enhanced convergence and performance.
117
 
118
+ The key hyperparameters used were:
119
 
120
+ - **Learning Rate**: 1e-4 (0.0001) with a linear warmup of 12,000 steps
121
+ - **Batch Size**: Effective batch size of 3,840 (`per_device_train_batch_size: 60`, `gradient_accumulation_steps: 8`)
122
+ - **Optimizer**: AdamW with `beta1=0.9`, `beta2=0.98`, `epsilon=1e-6`
123
+ - **Weight Decay**: 0.01
124
+ - **Max Sequence Length**: 512
125
+ - **Max Steps**: 200,000
126
+ - **Warmup Steps**: 12,000
127
+ - **Generator Loss Weight**: 50.0
128
+ - **Discriminator Loss Weight**: 50.0
129
 
130
+ ### Training Results
131
 
132
+ The model achieved the following performance metrics:
133
 
134
+ - **Final Training Loss**: 0.0056
135
+ - **Final Evaluation Loss**: 0.0054
136
+ - **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs
137
+ - **Total Training Steps**: 200,000
138
 
139
+ ### Improvements in v2
140
 
141
+ This second version incorporates the following improvements over the initial model:
142
 
143
+ - **Reduced Learning Rate**: Lowered from 8e-4 to 1e-4 for more stable convergence
144
+ - **Extended Training**: Increased from 120,000 to 200,000 steps for better performance
145
+ - **Enhanced Warmup**: Extended warmup period from 6,000 to 12,000 steps for smoother training initialization
146
 
147
+ ## Evaluation Results
148
 
149
+ The model's performance was evaluated by fine-tuning it on downstream Named Entity Recognition (NER) tasks and comparing it against other legal language models.
150
 
151
+ *This section should be filled with your specific results. For example:*
152
 
153
+ | Model | NER F1-score (strict) |
154
+ | :--- | :--- |
155
+ | `AI-team-UoA/GreekLegalRoBERTa_v3` | `[F1-Score for Baseline]` |
156
+ | `Themida-ELECTRA v1` | `[F1-Score for v1]` |
157
+ | `Themida-ELECTRA v2` (this model) | `[F1-Score for v2]` |
158
 
159
+ ## Intended Uses
160
 
161
+ ### Primary Use Cases
162
+ - Legal document analysis and classification
163
+ - Named entity recognition in Greek legal texts
164
+ - Legal question answering systems
165
+ - Compliance monitoring and regulatory analysis
166
+ - Legal text similarity and retrieval
167
 
168
+ ### Secondary Use Cases
169
+ - General Greek text understanding (with potential performance degradation)
170
+ - Legal document summarization
171
+ - Contract analysis and review
172
 
173
+ ## Limitations and Bias
174
 
175
+ - The model may reflect biases present in Greek legal and governmental texts
176
+ - Performance may degrade on informal or colloquial Greek text
177
+ - Limited knowledge of legal concepts post-training data cutoff
178
+ - Optimized specifically for Greek legal domain; may not generalize well to other domains
179
+ - The ELECTRA architecture may require different fine-tuning approaches compared to BERT-like models
180
 
181
+ ## Model Card Authors
182
 
183
+ [Your Name / Your Organization's Name]
184
 
185
+ ## Citation
186
 
187
+ If you use this model in your research, please cite it as follows:
188
 
189
+ ```bibtex
190
+ @misc{your_name_2025_themida_electra_v2,
191
+ author = {[Your Name/Organization]},
192
+ title = {Themida-ELECTRA v2: A Greek Legal Language Model},
193
+ year = {2025},
194
+ publisher = {Hugging Face},
195
+ journal = {Hugging Face Hub},
196
+ howpublished = {\url{https://huggingface.co/[Your Username]/[Your Model Name]}},
197
+ }
198
+ ```
199
 
200
+ ## Acknowledgments
201
 
202
+ We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain.