alexaapo commited on
Commit
3aa1bd7
·
verified ·
1 Parent(s): 122c98b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -95
README.md CHANGED
@@ -15,11 +15,11 @@ base_model:
15
  - google/electra-base-discriminator
16
  ---
17
 
18
- # Themida-ELECTRA v2: A Greek Legal Language Model
19
 
20
  ## Model Description
21
 
22
- **Themida-ELECTRA v2** is an improved ELECTRA-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This second version incorporates refined training hyperparameters for enhanced performance and stability. It is designed for understanding the complex vocabulary and context of the legal domain in Greece and the EU.
23
 
24
  This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The ELECTRA architecture provides more efficient pre-training compared to masked language models like BERT by using a generator-discriminator approach.
25
 
@@ -33,8 +33,8 @@ from transformers import pipeline
33
  # Load the model
34
  fill_mask = pipeline(
35
  "fill-mask",
36
- model="novelcore/themida-electra-legal-17G-8-gpu-v2",
37
- tokenizer="novelcore/themida-electra-legal-17G-8-gpu-v2"
38
  )
39
 
40
  # Example from a legal context
@@ -43,29 +43,6 @@ text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβε
43
  # Get predictions
44
  predictions = fill_mask(text)
45
  print(predictions)
46
-
47
- # Get predictions
48
- predictions = fill_mask(text)
49
- [{'score': 0.20120874047279358,
50
- 'token': 4014,
51
- 'token_str': ' ειπε',
52
- 'sequence': ' ο κ . μητσοτακης ειπε οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
53
- {'score': 0.19406235218048096,
54
- 'token': 12702,
55
- 'token_str': ' δηλωσε',
56
- 'sequence': ' ο κ . μητσοτακης δηλωσε οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
57
- {'score': 0.18023167550563812,
58
- 'token': 11151,
59
- 'token_str': ' δηλωνει',
60
- 'sequence': ' ο κ . μητσοτακης δηλωνει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
61
- {'score': 0.08440685272216797,
62
- 'token': 8534,
63
- 'token_str': ' υποστηριζει',
64
- 'sequence': ' ο κ . μητσοτακης υποστηριζει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
65
- {'score': 0.05247046798467636,
66
- 'token': 3523,
67
- 'token_str': ' λεει',
68
- 'sequence': ' ο κ . μητσοτακης λεει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'}]
69
  ```
70
 
71
  For downstream tasks:
@@ -74,8 +51,8 @@ For downstream tasks:
74
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
75
 
76
  # For legal document classification
77
- tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-electra-legal-17G-8-gpu-v2")
78
- model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-electra-legal-17G-8-gpu-v2")
79
  ```
80
 
81
  ## Training Data
@@ -134,69 +111,4 @@ The model achieved the following performance metrics:
134
  - **Final Training Loss**: 0.0056
135
  - **Final Evaluation Loss**: 0.0054
136
  - **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs
137
- - **Total Training Steps**: 200,000
138
-
139
- ### Improvements in v2
140
-
141
- This second version incorporates the following improvements over the initial model:
142
-
143
- - **Reduced Learning Rate**: Lowered from 8e-4 to 1e-4 for more stable convergence
144
- - **Extended Training**: Increased from 120,000 to 200,000 steps for better performance
145
- - **Enhanced Warmup**: Extended warmup period from 6,000 to 12,000 steps for smoother training initialization
146
-
147
- ## Evaluation Results
148
-
149
- The model's performance was evaluated by fine-tuning it on downstream Named Entity Recognition (NER) tasks and comparing it against other legal language models.
150
-
151
- *This section should be filled with your specific results. For example:*
152
-
153
- | Model | NER F1-score (strict) |
154
- | :--- | :--- |
155
- | `AI-team-UoA/GreekLegalRoBERTa_v3` | `[F1-Score for Baseline]` |
156
- | `Themida-ELECTRA v1` | `[F1-Score for v1]` |
157
- | `Themida-ELECTRA v2` (this model) | `[F1-Score for v2]` |
158
-
159
- ## Intended Uses
160
-
161
- ### Primary Use Cases
162
- - Legal document analysis and classification
163
- - Named entity recognition in Greek legal texts
164
- - Legal question answering systems
165
- - Compliance monitoring and regulatory analysis
166
- - Legal text similarity and retrieval
167
-
168
- ### Secondary Use Cases
169
- - General Greek text understanding (with potential performance degradation)
170
- - Legal document summarization
171
- - Contract analysis and review
172
-
173
- ## Limitations and Bias
174
-
175
- - The model may reflect biases present in Greek legal and governmental texts
176
- - Performance may degrade on informal or colloquial Greek text
177
- - Limited knowledge of legal concepts post-training data cutoff
178
- - Optimized specifically for Greek legal domain; may not generalize well to other domains
179
- - The ELECTRA architecture may require different fine-tuning approaches compared to BERT-like models
180
-
181
- ## Model Card Authors
182
-
183
- [Your Name / Your Organization's Name]
184
-
185
- ## Citation
186
-
187
- If you use this model in your research, please cite it as follows:
188
-
189
- ```bibtex
190
- @misc{your_name_2025_themida_electra_v2,
191
- author = {[Your Name/Organization]},
192
- title = {Themida-ELECTRA v2: A Greek Legal Language Model},
193
- year = {2025},
194
- publisher = {Hugging Face},
195
- journal = {Hugging Face Hub},
196
- howpublished = {\url{https://huggingface.co/[Your Username]/[Your Model Name]}},
197
- }
198
- ```
199
-
200
- ## Acknowledgments
201
-
202
- We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain.
 
15
  - google/electra-base-discriminator
16
  ---
17
 
18
+ # GEM-ELECTRA Legal: A Greek Legal Language Model
19
 
20
  ## Model Description
21
 
22
+ **GEM-ELECTRA Legal** is an improved ELECTRA-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This second version incorporates refined training hyperparameters for enhanced performance and stability. It is designed for understanding the complex vocabulary and context of the legal domain in Greece and the EU.
23
 
24
  This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The ELECTRA architecture provides more efficient pre-training compared to masked language models like BERT by using a generator-discriminator approach.
25
 
 
33
  # Load the model
34
  fill_mask = pipeline(
35
  "fill-mask",
36
+ model="novelcore/gem-electra-legal",
37
+ tokenizer="novelcore/gem-electra-legal"
38
  )
39
 
40
  # Example from a legal context
 
43
  # Get predictions
44
  predictions = fill_mask(text)
45
  print(predictions)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ```
47
 
48
  For downstream tasks:
 
51
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
52
 
53
  # For legal document classification
54
+ tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-electra-legal")
55
+ model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-electra-legal")
56
  ```
57
 
58
  ## Training Data
 
111
  - **Final Training Loss**: 0.0056
112
  - **Final Evaluation Loss**: 0.0054
113
  - **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs
114
+ - **Total Training Steps**: 200,000