Update README.md
Browse files
README.md
CHANGED
|
@@ -1,199 +1,202 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
library_name: transformers
|
| 3 |
-
tags:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
---
|
| 5 |
|
| 6 |
-
#
|
| 7 |
|
| 8 |
-
|
| 9 |
|
|
|
|
| 10 |
|
|
|
|
| 11 |
|
| 12 |
-
##
|
| 13 |
|
| 14 |
-
|
| 15 |
|
| 16 |
-
|
|
|
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
- **Shared by [optional]:** [More Information Needed]
|
| 23 |
-
- **Model type:** [More Information Needed]
|
| 24 |
-
- **Language(s) (NLP):** [More Information Needed]
|
| 25 |
-
- **License:** [More Information Needed]
|
| 26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
| 27 |
|
| 28 |
-
|
|
|
|
|
|
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
-
|
| 33 |
-
- **Paper [optional]:** [More Information Needed]
|
| 34 |
-
- **Demo [optional]:** [More Information Needed]
|
| 35 |
|
| 36 |
-
|
|
|
|
| 37 |
|
| 38 |
-
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
-
|
| 41 |
|
| 42 |
-
|
| 43 |
|
| 44 |
-
|
| 45 |
|
| 46 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
|
| 48 |
-
|
| 49 |
|
| 50 |
-
|
| 51 |
|
| 52 |
-
|
| 53 |
|
| 54 |
-
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
| 57 |
|
| 58 |
-
|
| 59 |
|
| 60 |
-
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
|
| 72 |
-
|
| 73 |
|
| 74 |
-
|
|
|
|
|
|
|
|
|
|
| 75 |
|
| 76 |
-
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
|
|
|
|
|
|
|
| 81 |
|
| 82 |
-
|
| 83 |
|
| 84 |
-
|
| 85 |
|
| 86 |
-
|
| 87 |
|
| 88 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 89 |
|
| 90 |
-
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
-
|
|
|
|
|
|
|
|
|
|
| 94 |
|
| 95 |
-
|
| 96 |
|
| 97 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
|
| 99 |
-
|
| 100 |
|
| 101 |
-
[
|
| 102 |
|
| 103 |
-
##
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 108 |
|
| 109 |
-
|
| 110 |
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
[More Information Needed]
|
| 114 |
-
|
| 115 |
-
#### Factors
|
| 116 |
-
|
| 117 |
-
<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
|
| 118 |
-
|
| 119 |
-
[More Information Needed]
|
| 120 |
-
|
| 121 |
-
#### Metrics
|
| 122 |
-
|
| 123 |
-
<!-- These are the evaluation metrics being used, ideally with a description of why. -->
|
| 124 |
-
|
| 125 |
-
[More Information Needed]
|
| 126 |
-
|
| 127 |
-
### Results
|
| 128 |
-
|
| 129 |
-
[More Information Needed]
|
| 130 |
-
|
| 131 |
-
#### Summary
|
| 132 |
-
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
## Model Examination [optional]
|
| 136 |
-
|
| 137 |
-
<!-- Relevant interpretability work for the model goes here -->
|
| 138 |
-
|
| 139 |
-
[More Information Needed]
|
| 140 |
-
|
| 141 |
-
## Environmental Impact
|
| 142 |
-
|
| 143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
| 144 |
-
|
| 145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
| 146 |
-
|
| 147 |
-
- **Hardware Type:** [More Information Needed]
|
| 148 |
-
- **Hours used:** [More Information Needed]
|
| 149 |
-
- **Cloud Provider:** [More Information Needed]
|
| 150 |
-
- **Compute Region:** [More Information Needed]
|
| 151 |
-
- **Carbon Emitted:** [More Information Needed]
|
| 152 |
-
|
| 153 |
-
## Technical Specifications [optional]
|
| 154 |
-
|
| 155 |
-
### Model Architecture and Objective
|
| 156 |
-
|
| 157 |
-
[More Information Needed]
|
| 158 |
-
|
| 159 |
-
### Compute Infrastructure
|
| 160 |
-
|
| 161 |
-
[More Information Needed]
|
| 162 |
-
|
| 163 |
-
#### Hardware
|
| 164 |
-
|
| 165 |
-
[More Information Needed]
|
| 166 |
-
|
| 167 |
-
#### Software
|
| 168 |
-
|
| 169 |
-
[More Information Needed]
|
| 170 |
-
|
| 171 |
-
## Citation [optional]
|
| 172 |
-
|
| 173 |
-
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
| 174 |
-
|
| 175 |
-
**BibTeX:**
|
| 176 |
-
|
| 177 |
-
[More Information Needed]
|
| 178 |
-
|
| 179 |
-
**APA:**
|
| 180 |
-
|
| 181 |
-
[More Information Needed]
|
| 182 |
-
|
| 183 |
-
## Glossary [optional]
|
| 184 |
-
|
| 185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
| 186 |
-
|
| 187 |
-
[More Information Needed]
|
| 188 |
-
|
| 189 |
-
## More Information [optional]
|
| 190 |
-
|
| 191 |
-
[More Information Needed]
|
| 192 |
-
|
| 193 |
-
## Model Card Authors [optional]
|
| 194 |
-
|
| 195 |
-
[More Information Needed]
|
| 196 |
-
|
| 197 |
-
## Model Card Contact
|
| 198 |
-
|
| 199 |
-
[More Information Needed]
|
|
|
|
| 1 |
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
language:
|
| 4 |
+
- el
|
| 5 |
+
pipeline_tag: fill-mask
|
| 6 |
library_name: transformers
|
| 7 |
+
tags:
|
| 8 |
+
- electra
|
| 9 |
+
- fill-mask
|
| 10 |
+
- greek
|
| 11 |
+
- legal
|
| 12 |
+
- discriminator
|
| 13 |
+
- generator
|
| 14 |
+
base_model:
|
| 15 |
+
- google/electra-base-discriminator
|
| 16 |
---
|
| 17 |
|
| 18 |
+
# Themida-ELECTRA v2: A Greek Legal Language Model
|
| 19 |
|
| 20 |
+
## Model Description
|
| 21 |
|
| 22 |
+
**Themida-ELECTRA v2** is an improved ELECTRA-base model pre-trained from scratch on a large, 17GB corpus of Greek legal, parliamentary, and governmental text. This second version incorporates refined training hyperparameters for enhanced performance and stability. It is designed for understanding the complex vocabulary and context of the legal domain in Greece and the EU.
|
| 23 |
|
| 24 |
+
This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The ELECTRA architecture provides more efficient pre-training compared to masked language models like BERT by using a generator-discriminator approach.
|
| 25 |
|
| 26 |
+
## How to Get Started
|
| 27 |
|
| 28 |
+
You can use this model directly with the `fill-mask` pipeline:
|
| 29 |
|
| 30 |
+
```python
|
| 31 |
+
from transformers import pipeline
|
| 32 |
|
| 33 |
+
# Load the model
|
| 34 |
+
fill_mask = pipeline(
|
| 35 |
+
"fill-mask",
|
| 36 |
+
model="novelcore/themida-electra-legal-17G-8-gpu-v2",
|
| 37 |
+
tokenizer="novelcore/themida-electra-legal-17G-8-gpu-v2"
|
| 38 |
+
)
|
| 39 |
|
| 40 |
+
# Example from a legal context
|
| 41 |
+
text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
|
| 43 |
+
# Get predictions
|
| 44 |
+
predictions = fill_mask(text)
|
| 45 |
+
print(predictions)
|
| 46 |
|
| 47 |
+
# Get predictions
|
| 48 |
+
predictions = fill_mask(text)
|
| 49 |
+
[{'score': 0.20120874047279358,
|
| 50 |
+
'token': 4014,
|
| 51 |
+
'token_str': ' ειπε',
|
| 52 |
+
'sequence': ' ο κ . μητσοτακης ειπε οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
|
| 53 |
+
{'score': 0.19406235218048096,
|
| 54 |
+
'token': 12702,
|
| 55 |
+
'token_str': ' δηλωσε',
|
| 56 |
+
'sequence': ' ο κ . μητσοτακης δηλωσε οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
|
| 57 |
+
{'score': 0.18023167550563812,
|
| 58 |
+
'token': 11151,
|
| 59 |
+
'token_str': ' δηλωνει',
|
| 60 |
+
'sequence': ' ο κ . μητσοτακης δηλωνει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
|
| 61 |
+
{'score': 0.08440685272216797,
|
| 62 |
+
'token': 8534,
|
| 63 |
+
'token_str': ' υποστηριζει',
|
| 64 |
+
'sequence': ' ο κ . μητσοτακης υποστηριζει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'},
|
| 65 |
+
{'score': 0.05247046798467636,
|
| 66 |
+
'token': 3523,
|
| 67 |
+
'token_str': ' λεει',
|
| 68 |
+
'sequence': ' ο κ . μητσοτακης λεει οτι η κυβερνηση σεβεται πληρως τις αποφασεις του συμβουλιου της επικρατειας .'}]
|
| 69 |
+
```
|
| 70 |
|
| 71 |
+
For downstream tasks:
|
|
|
|
|
|
|
| 72 |
|
| 73 |
+
```python
|
| 74 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 75 |
|
| 76 |
+
# For legal document classification
|
| 77 |
+
tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-electra-legal-17G-8-gpu-v2")
|
| 78 |
+
model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-electra-legal-17G-8-gpu-v2")
|
| 79 |
+
```
|
| 80 |
|
| 81 |
+
## Training Data
|
| 82 |
|
| 83 |
+
The model was pre-trained on a comprehensive 17GB corpus of Greek text compiled from various legal and governmental sources. The corpus was carefully cleaned, UTF-8 encoded, and deduplicated to ensure high quality and diversity before training.
|
| 84 |
|
| 85 |
+
The composition of the training corpus is as follows:
|
| 86 |
|
| 87 |
+
| Corpus Source | Size (GB) | Context |
|
| 88 |
+
| :--- | :--- | :--- |
|
| 89 |
+
| FEK - Greek Government Gazette (all issues) | 11.0 | Legal |
|
| 90 |
+
| Greek Parliament Proceedings | 2.9 | Legal / Parliamentary |
|
| 91 |
+
| Political Reports of the Supreme Court | 1.2 | Legal |
|
| 92 |
+
| Eur-Lex (Greek Content) | 0.92 | Legal |
|
| 93 |
+
| Europarl (Greek Content) | 0.38 | Legal / Parliamentary |
|
| 94 |
+
| Raptarchis Legal Dictionary | 0.35 | Legal |
|
| 95 |
+
| **Total** | **~16.75 GB** | |
|
| 96 |
|
| 97 |
+
## Training Procedure
|
| 98 |
|
| 99 |
+
### Model Architecture
|
| 100 |
|
| 101 |
+
The model uses the ELECTRA architecture with the following configuration:
|
| 102 |
|
| 103 |
+
- **Discriminator Hidden Size**: 768
|
| 104 |
+
- **Discriminator Attention Heads**: 12
|
| 105 |
+
- **Discriminator Hidden Layers**: 12
|
| 106 |
+
- **Generator Size Fraction**: 0.25 (192 hidden size generator)
|
| 107 |
|
| 108 |
+
### Preprocessing
|
| 109 |
|
| 110 |
+
The text was tokenized using a custom `ByteLevelBPE` tokenizer trained from scratch on the Greek legal corpus. The tokenizer is uncased (does not distinguish between upper and lower case) and uses a vocabulary of 50,264 tokens.
|
| 111 |
|
| 112 |
+
The data was then processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence.
|
| 113 |
|
| 114 |
+
### Pre-training
|
| 115 |
|
| 116 |
+
The model was pre-trained from scratch for **200,000 steps** on 8x NVIDIA A100 40GB GPUs, using BFloat16 (`bf16`) mixed-precision for stability and speed. This second version incorporates improved hyperparameters for enhanced convergence and performance.
|
| 117 |
|
| 118 |
+
The key hyperparameters used were:
|
| 119 |
|
| 120 |
+
- **Learning Rate**: 1e-4 (0.0001) with a linear warmup of 12,000 steps
|
| 121 |
+
- **Batch Size**: Effective batch size of 3,840 (`per_device_train_batch_size: 60`, `gradient_accumulation_steps: 8`)
|
| 122 |
+
- **Optimizer**: AdamW with `beta1=0.9`, `beta2=0.98`, `epsilon=1e-6`
|
| 123 |
+
- **Weight Decay**: 0.01
|
| 124 |
+
- **Max Sequence Length**: 512
|
| 125 |
+
- **Max Steps**: 200,000
|
| 126 |
+
- **Warmup Steps**: 12,000
|
| 127 |
+
- **Generator Loss Weight**: 50.0
|
| 128 |
+
- **Discriminator Loss Weight**: 50.0
|
| 129 |
|
| 130 |
+
### Training Results
|
| 131 |
|
| 132 |
+
The model achieved the following performance metrics:
|
| 133 |
|
| 134 |
+
- **Final Training Loss**: 0.0056
|
| 135 |
+
- **Final Evaluation Loss**: 0.0054
|
| 136 |
+
- **Training Infrastructure**: 8x NVIDIA A100 40GB GPUs
|
| 137 |
+
- **Total Training Steps**: 200,000
|
| 138 |
|
| 139 |
+
### Improvements in v2
|
| 140 |
|
| 141 |
+
This second version incorporates the following improvements over the initial model:
|
| 142 |
|
| 143 |
+
- **Reduced Learning Rate**: Lowered from 8e-4 to 1e-4 for more stable convergence
|
| 144 |
+
- **Extended Training**: Increased from 120,000 to 200,000 steps for better performance
|
| 145 |
+
- **Enhanced Warmup**: Extended warmup period from 6,000 to 12,000 steps for smoother training initialization
|
| 146 |
|
| 147 |
+
## Evaluation Results
|
| 148 |
|
| 149 |
+
The model's performance was evaluated by fine-tuning it on downstream Named Entity Recognition (NER) tasks and comparing it against other legal language models.
|
| 150 |
|
| 151 |
+
*This section should be filled with your specific results. For example:*
|
| 152 |
|
| 153 |
+
| Model | NER F1-score (strict) |
|
| 154 |
+
| :--- | :--- |
|
| 155 |
+
| `AI-team-UoA/GreekLegalRoBERTa_v3` | `[F1-Score for Baseline]` |
|
| 156 |
+
| `Themida-ELECTRA v1` | `[F1-Score for v1]` |
|
| 157 |
+
| `Themida-ELECTRA v2` (this model) | `[F1-Score for v2]` |
|
| 158 |
|
| 159 |
+
## Intended Uses
|
| 160 |
|
| 161 |
+
### Primary Use Cases
|
| 162 |
+
- Legal document analysis and classification
|
| 163 |
+
- Named entity recognition in Greek legal texts
|
| 164 |
+
- Legal question answering systems
|
| 165 |
+
- Compliance monitoring and regulatory analysis
|
| 166 |
+
- Legal text similarity and retrieval
|
| 167 |
|
| 168 |
+
### Secondary Use Cases
|
| 169 |
+
- General Greek text understanding (with potential performance degradation)
|
| 170 |
+
- Legal document summarization
|
| 171 |
+
- Contract analysis and review
|
| 172 |
|
| 173 |
+
## Limitations and Bias
|
| 174 |
|
| 175 |
+
- The model may reflect biases present in Greek legal and governmental texts
|
| 176 |
+
- Performance may degrade on informal or colloquial Greek text
|
| 177 |
+
- Limited knowledge of legal concepts post-training data cutoff
|
| 178 |
+
- Optimized specifically for Greek legal domain; may not generalize well to other domains
|
| 179 |
+
- The ELECTRA architecture may require different fine-tuning approaches compared to BERT-like models
|
| 180 |
|
| 181 |
+
## Model Card Authors
|
| 182 |
|
| 183 |
+
[Your Name / Your Organization's Name]
|
| 184 |
|
| 185 |
+
## Citation
|
| 186 |
|
| 187 |
+
If you use this model in your research, please cite it as follows:
|
| 188 |
|
| 189 |
+
```bibtex
|
| 190 |
+
@misc{your_name_2025_themida_electra_v2,
|
| 191 |
+
author = {[Your Name/Organization]},
|
| 192 |
+
title = {Themida-ELECTRA v2: A Greek Legal Language Model},
|
| 193 |
+
year = {2025},
|
| 194 |
+
publisher = {Hugging Face},
|
| 195 |
+
journal = {Hugging Face Hub},
|
| 196 |
+
howpublished = {\url{https://huggingface.co/[Your Username]/[Your Model Name]}},
|
| 197 |
+
}
|
| 198 |
+
```
|
| 199 |
|
| 200 |
+
## Acknowledgments
|
| 201 |
|
| 202 |
+
We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|