alexaapo commited on
Commit
7687f52
·
verified ·
1 Parent(s): 2f30ab4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -106
README.md CHANGED
@@ -15,11 +15,11 @@ base_model:
15
  - roberta-base
16
  ---
17
 
18
- # Themida-RoBERTa Legal 21G: A Greek Legal Language Model with Quality-Based Data Repetition
19
 
20
  ## Model Description
21
 
22
- **Themida-RoBERTa Legal 21G** is a RoBERTa-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts.
23
 
24
  This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The RoBERTa architecture provides enhanced performance through dynamic masking and removal of the Next Sentence Prediction (NSP) task, focusing entirely on Masked Language Modeling (MLM).
25
 
@@ -33,8 +33,8 @@ from transformers import pipeline
33
  # Load the model
34
  fill_mask = pipeline(
35
  "fill-mask",
36
- model="novelcore/themida-roberta-legal-21G-8-gpu",
37
- tokenizer="novelcore/themida-roberta-legal-21G-8-gpu"
38
  )
39
 
40
  # Example from a legal context
@@ -51,8 +51,8 @@ For downstream tasks:
51
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
52
 
53
  # For legal document classification
54
- tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-roberta-legal-21G-8-gpu")
55
- model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-roberta-legal-21G-8-gpu")
56
  ```
57
 
58
  ## Training Data
@@ -125,103 +125,4 @@ The model achieved the following performance metrics:
125
  - **Total Training Steps**: 150,000
126
  - **Total Training Time**: 66 hours 39 minutes
127
  - **Train/Validation Split**: 95%/5%
128
- - **Effective Training Data**: 21.12GB (with quality-based repetition)
129
-
130
- ### Training Infrastructure
131
-
132
- The model was trained using distributed training with the following optimizations:
133
-
134
- - **Backend**: NCCL for efficient multi-GPU communication
135
- - **Mixed Precision**: BFloat16 for improved training stability
136
- - **Evaluation Frequency**: Every 5,000 steps
137
- - **Checkpointing**: Every 5,000 steps
138
- - **Logging**: Every 250 steps
139
-
140
- ## Key Innovations
141
-
142
- ### Quality-Based Data Repetition
143
-
144
- This model introduces a novel **quality-based data repetition strategy** where:
145
-
146
- 1. **Highest quality sources** (legal dictionaries) are repeated 4x for maximum terminology exposure
147
- 2. **Medium-high quality sources** (court reports) are repeated 3x for judicial reasoning patterns
148
- 3. **Medium quality sources** (EU legal texts) are repeated 2x for regulatory language
149
- 4. **Lower quality sources** are used once to maintain diversity
150
-
151
- This approach resulted in **25% more effective training data** (21.12GB vs 16.75GB) while maintaining computational efficiency.
152
-
153
- ### Training Efficiency
154
-
155
- Despite the larger effective dataset, the model trained **47% faster** than the previous large variant (66h 39m vs 126h 19m) due to the more efficient RoBERTa-base architecture while maintaining comparable performance quality.
156
-
157
- ## Evaluation Results
158
-
159
- The model shows stable convergence with the quality-based repetition strategy, achieving competitive performance metrics:
160
-
161
- | Model | Architecture | Training Loss | Evaluation Loss | Training Time |
162
- | :--- | :--- | :--- | :--- | :--- |
163
- | `Themida-RoBERTa Legal 21G` (this model) | RoBERTa-base | 0.617 | 0.573035 | 66h 39m |
164
-
165
- *Performance on downstream tasks will be updated as evaluation results become available.*
166
-
167
- ## Intended Uses
168
-
169
- ### Primary Use Cases
170
- - Legal document analysis and classification
171
- - Named entity recognition in Greek legal texts
172
- - Legal question answering systems
173
- - Compliance monitoring and regulatory analysis
174
- - Legal text similarity and retrieval
175
- - Legal terminology extraction and understanding
176
-
177
- ### Secondary Use Cases
178
- - General Greek text understanding (with potential performance degradation)
179
- - Contract analysis and review
180
- - Legislative text analysis
181
- - Regulatory compliance checking
182
-
183
- ### Advantages of Quality-Based Training
184
- - **Enhanced legal vocabulary**: Better understanding of sophisticated legal terminology
185
- - **Improved judicial reasoning**: Stronger grasp of court decision patterns
186
- - **EU legal compliance**: Better handling of European regulatory language
187
- - **Computational efficiency**: Faster training than larger architectures
188
-
189
- ## Limitations and Bias
190
-
191
- - The model may reflect biases present in Greek legal and governmental texts
192
- - Quality-based repetition may amplify biases present in higher-quality sources
193
- - Performance may degrade on informal or colloquial Greek text
194
- - Limited knowledge of legal concepts post-training data cutoff
195
- - Optimized specifically for Greek legal domain; may not generalize well to other domains
196
-
197
- ## Technical Specifications
198
-
199
- - **Model Size**: ~125M parameters
200
- - **Architecture**: RoBERTa-base (12 layers, 12 attention heads)
201
- - **Training Time**: 66 hours 39 minutes on 8x A100 GPUs
202
- - **Effective Dataset Size**: 21.12GB (with quality-based repetition)
203
- - **Memory Requirements**: More efficient than large models for fine-tuning
204
- - **Inference Speed**: Faster than large models due to base architecture
205
-
206
- ## Model Card Authors
207
-
208
- [Your Name / Your Organization's Name]
209
-
210
- ## Citation
211
-
212
- If you use this model in your research, please cite it as follows:
213
-
214
- ```bibtex
215
- @misc{your_name_2025_themida_roberta_21g,
216
- author = {[Your Name/Organization]},
217
- title = {Themida-RoBERTa Legal 21G: A Greek Legal Language Model with Quality-Based Data Repetition},
218
- year = {2025},
219
- publisher = {Hugging Face},
220
- journal = {Hugging Face Hub},
221
- howpublished = {\url{https://huggingface.co/novelcore/themida-roberta-legal-21G-8-gpu}},
222
- }
223
- ```
224
-
225
- ## Acknowledgments
226
-
227
- We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain. Special acknowledgment for the innovative quality-based data repetition strategy that enhanced training efficiency while improving model performance on high-quality legal content.
 
15
  - roberta-base
16
  ---
17
 
18
+ # GEM-RoBERTa HQ Legal: A Greek Legal Language Model with Quality-Based Data Repetition
19
 
20
  ## Model Description
21
 
22
+ **GEM-RoBERTa HQ Legal** is a RoBERTa-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts.
23
 
24
  This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The RoBERTa architecture provides enhanced performance through dynamic masking and removal of the Next Sentence Prediction (NSP) task, focusing entirely on Masked Language Modeling (MLM).
25
 
 
33
  # Load the model
34
  fill_mask = pipeline(
35
  "fill-mask",
36
+ model="novelcore/gem-roberta-hq-legal",
37
+ tokenizer="novelcore/gem-roberta-hq-legal"
38
  )
39
 
40
  # Example from a legal context
 
51
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
52
 
53
  # For legal document classification
54
+ tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-roberta-hq-legal")
55
+ model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-roberta-hq-legal")
56
  ```
57
 
58
  ## Training Data
 
125
  - **Total Training Steps**: 150,000
126
  - **Total Training Time**: 66 hours 39 minutes
127
  - **Train/Validation Split**: 95%/5%
128
+ - **Effective Training Data**: 21.12GB (with quality-based repetition)