alexaapo commited on
Commit
338e541
·
verified ·
1 Parent(s): ebf3726

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -113
README.md CHANGED
@@ -20,11 +20,11 @@ datasets:
20
  - pile-of-law
21
  ---
22
 
23
- # Themida-RoBERTa Legal 26G: A Bilingual Greek-English Legal Language Model
24
 
25
  ## Model Description
26
 
27
- **Themida-RoBERTa Legal 26G** is a RoBERTa-base model pre-trained from scratch on a comprehensive 26GB bilingual corpus of Greek and English legal, parliamentary, and governmental text. This model represents the first large-scale bilingual legal language model combining Greek and English legal domains, enabling cross-lingual legal understanding and applications.
28
 
29
  The model employs the RoBERTa architecture optimized for legal text understanding across both languages, with dynamic masking and focused Masked Language Modeling (MLM) training. The bilingual approach allows the model to leverage legal concepts and terminology from both the Greek and Anglo-American legal traditions.
30
 
@@ -40,19 +40,14 @@ from transformers import pipeline
40
  # Load the model
41
  fill_mask = pipeline(
42
  "fill-mask",
43
- model="novelcore/themida-roberta-el-en-legal-26G-8-gpu",
44
- tokenizer="novelcore/themida-roberta-el-en-legal-26G-8-gpu"
45
  )
46
 
47
  # Example in Greek
48
  text_gr = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
49
  predictions_gr = fill_mask(text_gr)
50
  print("Greek predictions:", predictions_gr)
51
-
52
- # Example in English
53
- text_en = "The Supreme Court <mask> that the constitutional amendment was valid under federal law."
54
- predictions_en = fill_mask(text_en)
55
- print("English predictions:", predictions_en)
56
  ```
57
 
58
  For downstream tasks:
@@ -61,8 +56,8 @@ For downstream tasks:
61
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
62
 
63
  # For bilingual legal document classification
64
- tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-roberta-el-en-legal-26G-8-gpu")
65
- model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-roberta-el-en-legal-26G-8-gpu")
66
 
67
  # Process texts in both languages
68
  greek_text = "Το Συνταγματικό Δικαστήριο αποφάσισε..."
@@ -148,105 +143,4 @@ The model achieved the following performance metrics:
148
  - **Total Training Steps**: 150,000
149
  - **Total Training Time**: 25 hours 7 minutes
150
  - **Train/Validation Split**: 95%/5%
151
- - **Total Training Data**: 26GB bilingual corpus
152
-
153
- ### Training Infrastructure
154
-
155
- The model was trained using distributed training with the following optimizations:
156
- - **Backend**: NCCL for efficient multi-GPU communication
157
- - **Mixed Precision**: BFloat16 for improved training stability
158
- - **Evaluation Frequency**: Every 5,000 steps
159
- - **Checkpointing**: Every 5,000 steps
160
- - **Logging**: Every 200 steps
161
-
162
- ## Key Innovations
163
-
164
- ### Bilingual Legal Architecture
165
- This model introduces several innovations for bilingual legal language modeling:
166
-
167
- 1. **Cross-lingual Legal Understanding**: First model to combine Greek civil law and English common law traditions
168
- 2. **Balanced Language Exposure**: 60:40 Greek-English ratio optimized for both language preservation and cross-lingual transfer
169
- 3. **Legal System Integration**: Combines EU legal frameworks available in both languages for enhanced multilingual legal comprehension
170
- 4. **Efficient Training**: Achieved strong bilingual performance in just 25 hours compared to monolingual equivalents
171
-
172
- ### Computational Efficiency
173
- Despite processing a larger 26GB corpus across two languages, the model demonstrates remarkable efficiency:
174
- - **62% faster training** than previous monolingual large variants (25h 7m vs ~66h)
175
- - **Enhanced cross-lingual capabilities** without proportional computational cost increase
176
- - **Optimized tokenizer** handling both Greek and Latin scripts efficiently
177
-
178
- ## Evaluation Results
179
-
180
- The model shows stable convergence across both languages:
181
-
182
- | Model | Languages | Training Loss | Evaluation Loss | Training Time | Corpus Size |
183
- |-------|-----------|---------------|-----------------|---------------|-------------|
184
- | `Themida-RoBERTa Legal 26G` | Greek + English | 0.7479 | 0.69405 | 25h 7m | 26GB |
185
-
186
- *Performance on downstream bilingual legal tasks will be updated as evaluation results become available.*
187
-
188
- ## Intended Uses
189
-
190
- ### Primary Use Cases
191
- - **Bilingual legal document analysis** and classification
192
- - **Cross-lingual legal information retrieval** and similarity
193
- - **Greek-English legal translation** and terminology alignment
194
- - **EU legal compliance** analysis in multilingual contexts
195
- - **Comparative legal analysis** between civil and common law systems
196
- - **Multilingual legal question answering** systems
197
-
198
- ### Secondary Use Cases
199
- - **Cross-border legal research** and case law analysis
200
- - **International contract analysis** and review
201
- - **Legal terminology extraction** in both languages
202
- - **Regulatory compliance** for multinational operations
203
- - **Legal education** resources for comparative law
204
-
205
- ### Advantages of Bilingual Training
206
- - **Cross-lingual legal transfer**: Understanding legal concepts across different legal traditions
207
- - **Enhanced EU legal processing**: Better handling of multilingual EU regulatory frameworks
208
- - **Comparative legal analysis**: Native understanding of both civil law (Greek) and common law (English) concepts
209
- - **Multilingual legal applications**: Single model for diverse legal language tasks
210
-
211
- ## Limitations and Bias
212
-
213
- - The model may reflect biases present in both Greek and English legal corpora
214
- - **Language imbalance effects**: 60:40 ratio may lead to stronger Greek legal concept representation
215
- - Performance may vary between formal legal text and colloquial usage in either language
216
- - **Cross-lingual interference**: Potential mixing of legal concepts from different legal systems
217
- - Limited knowledge of legal developments post-training data cutoff
218
- - May not generalize well to legal domains outside the training corpus scope
219
-
220
- ## Technical Specifications
221
-
222
- - **Model Size**: ~125M parameters
223
- - **Architecture**: RoBERTa-base (12 layers, 12 attention heads)
224
- - **Languages**: Greek (el) + English (en)
225
- - **Training Time**: 25 hours 7 minutes on 8x H100 GPUs
226
- - **Dataset Size**: 26GB bilingual legal corpus
227
- - **Language Ratio**: 60.3% Greek, 39.7% English
228
- - **Memory Requirements**: Efficient base architecture suitable for production deployment
229
- - **Inference Speed**: Optimized for both Greek and English legal text processing
230
-
231
- ## Model Card Authors
232
-
233
- [Your Name / Your Organization's Name]
234
-
235
- ## Citation
236
-
237
- If you use this model in your research, please cite it as follows:
238
-
239
- ```bibtex
240
- @misc{your_name_2025_themida_roberta_bilingual_26g,
241
- author = {[Your Name/Organization]},
242
- title = {Themida-RoBERTa Legal 26G: A Bilingual Greek-English Legal Language Model},
243
- year = {2025},
244
- publisher = {Hugging Face},
245
- journal = {Hugging Face Hub},
246
- howpublished = {\url{https://huggingface.co/novelcore/themida-roberta-el-en-legal-26G-8-gpu}},
247
- }
248
- ```
249
-
250
- ## Acknowledgments
251
-
252
- We thank the Greek government institutions and legal organizations for making their legal texts publicly available. We also acknowledge the [Pile of Law](https://huggingface.co/datasets/pile-of-law) dataset contributors for providing comprehensive English legal corpora that enabled this bilingual approach. This work represents a significant step forward in multilingual legal language modeling, combining the rich traditions of Greek civil law and English common law in a single, efficient model.
 
20
  - pile-of-law
21
  ---
22
 
23
+ # GEM-RoBERTa Legal Bilingual: A Bilingual Greek-English Legal Language Model
24
 
25
  ## Model Description
26
 
27
+ **TGEM-RoBERTa Legal Bilingual** is a RoBERTa-base model pre-trained from scratch on a comprehensive 26GB bilingual corpus of Greek and English legal, parliamentary, and governmental text. This model represents the first large-scale bilingual legal language model combining Greek and English legal domains, enabling cross-lingual legal understanding and applications.
28
 
29
  The model employs the RoBERTa architecture optimized for legal text understanding across both languages, with dynamic masking and focused Masked Language Modeling (MLM) training. The bilingual approach allows the model to leverage legal concepts and terminology from both the Greek and Anglo-American legal traditions.
30
 
 
40
  # Load the model
41
  fill_mask = pipeline(
42
  "fill-mask",
43
+ model="novelcore/gem-roberta-bilingual",
44
+ tokenizer="novelcore/gem-roberta-bilingual"
45
  )
46
 
47
  # Example in Greek
48
  text_gr = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
49
  predictions_gr = fill_mask(text_gr)
50
  print("Greek predictions:", predictions_gr)
 
 
 
 
 
51
  ```
52
 
53
  For downstream tasks:
 
56
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
57
 
58
  # For bilingual legal document classification
59
+ tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-roberta-bilingual")
60
+ model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-roberta-bilingual")
61
 
62
  # Process texts in both languages
63
  greek_text = "Το Συνταγματικό Δικαστήριο αποφάσισε..."
 
143
  - **Total Training Steps**: 150,000
144
  - **Total Training Time**: 25 hours 7 minutes
145
  - **Train/Validation Split**: 95%/5%
146
+ - **Total Training Data**: 26GB bilingual corpus