Update README.md
Browse files
README.md
CHANGED
|
@@ -20,11 +20,11 @@ datasets:
|
|
| 20 |
- pile-of-law
|
| 21 |
---
|
| 22 |
|
| 23 |
-
#
|
| 24 |
|
| 25 |
## Model Description
|
| 26 |
|
| 27 |
-
**
|
| 28 |
|
| 29 |
The model employs the RoBERTa architecture optimized for legal text understanding across both languages, with dynamic masking and focused Masked Language Modeling (MLM) training. The bilingual approach allows the model to leverage legal concepts and terminology from both the Greek and Anglo-American legal traditions.
|
| 30 |
|
|
@@ -40,19 +40,14 @@ from transformers import pipeline
|
|
| 40 |
# Load the model
|
| 41 |
fill_mask = pipeline(
|
| 42 |
"fill-mask",
|
| 43 |
-
model="novelcore/
|
| 44 |
-
tokenizer="novelcore/
|
| 45 |
)
|
| 46 |
|
| 47 |
# Example in Greek
|
| 48 |
text_gr = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
|
| 49 |
predictions_gr = fill_mask(text_gr)
|
| 50 |
print("Greek predictions:", predictions_gr)
|
| 51 |
-
|
| 52 |
-
# Example in English
|
| 53 |
-
text_en = "The Supreme Court <mask> that the constitutional amendment was valid under federal law."
|
| 54 |
-
predictions_en = fill_mask(text_en)
|
| 55 |
-
print("English predictions:", predictions_en)
|
| 56 |
```
|
| 57 |
|
| 58 |
For downstream tasks:
|
|
@@ -61,8 +56,8 @@ For downstream tasks:
|
|
| 61 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 62 |
|
| 63 |
# For bilingual legal document classification
|
| 64 |
-
tokenizer = AutoTokenizer.from_pretrained("novelcore/
|
| 65 |
-
model = AutoModelForSequenceClassification.from_pretrained("novelcore/
|
| 66 |
|
| 67 |
# Process texts in both languages
|
| 68 |
greek_text = "Το Συνταγματικό Δικαστήριο αποφάσισε..."
|
|
@@ -148,105 +143,4 @@ The model achieved the following performance metrics:
|
|
| 148 |
- **Total Training Steps**: 150,000
|
| 149 |
- **Total Training Time**: 25 hours 7 minutes
|
| 150 |
- **Train/Validation Split**: 95%/5%
|
| 151 |
-
- **Total Training Data**: 26GB bilingual corpus
|
| 152 |
-
|
| 153 |
-
### Training Infrastructure
|
| 154 |
-
|
| 155 |
-
The model was trained using distributed training with the following optimizations:
|
| 156 |
-
- **Backend**: NCCL for efficient multi-GPU communication
|
| 157 |
-
- **Mixed Precision**: BFloat16 for improved training stability
|
| 158 |
-
- **Evaluation Frequency**: Every 5,000 steps
|
| 159 |
-
- **Checkpointing**: Every 5,000 steps
|
| 160 |
-
- **Logging**: Every 200 steps
|
| 161 |
-
|
| 162 |
-
## Key Innovations
|
| 163 |
-
|
| 164 |
-
### Bilingual Legal Architecture
|
| 165 |
-
This model introduces several innovations for bilingual legal language modeling:
|
| 166 |
-
|
| 167 |
-
1. **Cross-lingual Legal Understanding**: First model to combine Greek civil law and English common law traditions
|
| 168 |
-
2. **Balanced Language Exposure**: 60:40 Greek-English ratio optimized for both language preservation and cross-lingual transfer
|
| 169 |
-
3. **Legal System Integration**: Combines EU legal frameworks available in both languages for enhanced multilingual legal comprehension
|
| 170 |
-
4. **Efficient Training**: Achieved strong bilingual performance in just 25 hours compared to monolingual equivalents
|
| 171 |
-
|
| 172 |
-
### Computational Efficiency
|
| 173 |
-
Despite processing a larger 26GB corpus across two languages, the model demonstrates remarkable efficiency:
|
| 174 |
-
- **62% faster training** than previous monolingual large variants (25h 7m vs ~66h)
|
| 175 |
-
- **Enhanced cross-lingual capabilities** without proportional computational cost increase
|
| 176 |
-
- **Optimized tokenizer** handling both Greek and Latin scripts efficiently
|
| 177 |
-
|
| 178 |
-
## Evaluation Results
|
| 179 |
-
|
| 180 |
-
The model shows stable convergence across both languages:
|
| 181 |
-
|
| 182 |
-
| Model | Languages | Training Loss | Evaluation Loss | Training Time | Corpus Size |
|
| 183 |
-
|-------|-----------|---------------|-----------------|---------------|-------------|
|
| 184 |
-
| `Themida-RoBERTa Legal 26G` | Greek + English | 0.7479 | 0.69405 | 25h 7m | 26GB |
|
| 185 |
-
|
| 186 |
-
*Performance on downstream bilingual legal tasks will be updated as evaluation results become available.*
|
| 187 |
-
|
| 188 |
-
## Intended Uses
|
| 189 |
-
|
| 190 |
-
### Primary Use Cases
|
| 191 |
-
- **Bilingual legal document analysis** and classification
|
| 192 |
-
- **Cross-lingual legal information retrieval** and similarity
|
| 193 |
-
- **Greek-English legal translation** and terminology alignment
|
| 194 |
-
- **EU legal compliance** analysis in multilingual contexts
|
| 195 |
-
- **Comparative legal analysis** between civil and common law systems
|
| 196 |
-
- **Multilingual legal question answering** systems
|
| 197 |
-
|
| 198 |
-
### Secondary Use Cases
|
| 199 |
-
- **Cross-border legal research** and case law analysis
|
| 200 |
-
- **International contract analysis** and review
|
| 201 |
-
- **Legal terminology extraction** in both languages
|
| 202 |
-
- **Regulatory compliance** for multinational operations
|
| 203 |
-
- **Legal education** resources for comparative law
|
| 204 |
-
|
| 205 |
-
### Advantages of Bilingual Training
|
| 206 |
-
- **Cross-lingual legal transfer**: Understanding legal concepts across different legal traditions
|
| 207 |
-
- **Enhanced EU legal processing**: Better handling of multilingual EU regulatory frameworks
|
| 208 |
-
- **Comparative legal analysis**: Native understanding of both civil law (Greek) and common law (English) concepts
|
| 209 |
-
- **Multilingual legal applications**: Single model for diverse legal language tasks
|
| 210 |
-
|
| 211 |
-
## Limitations and Bias
|
| 212 |
-
|
| 213 |
-
- The model may reflect biases present in both Greek and English legal corpora
|
| 214 |
-
- **Language imbalance effects**: 60:40 ratio may lead to stronger Greek legal concept representation
|
| 215 |
-
- Performance may vary between formal legal text and colloquial usage in either language
|
| 216 |
-
- **Cross-lingual interference**: Potential mixing of legal concepts from different legal systems
|
| 217 |
-
- Limited knowledge of legal developments post-training data cutoff
|
| 218 |
-
- May not generalize well to legal domains outside the training corpus scope
|
| 219 |
-
|
| 220 |
-
## Technical Specifications
|
| 221 |
-
|
| 222 |
-
- **Model Size**: ~125M parameters
|
| 223 |
-
- **Architecture**: RoBERTa-base (12 layers, 12 attention heads)
|
| 224 |
-
- **Languages**: Greek (el) + English (en)
|
| 225 |
-
- **Training Time**: 25 hours 7 minutes on 8x H100 GPUs
|
| 226 |
-
- **Dataset Size**: 26GB bilingual legal corpus
|
| 227 |
-
- **Language Ratio**: 60.3% Greek, 39.7% English
|
| 228 |
-
- **Memory Requirements**: Efficient base architecture suitable for production deployment
|
| 229 |
-
- **Inference Speed**: Optimized for both Greek and English legal text processing
|
| 230 |
-
|
| 231 |
-
## Model Card Authors
|
| 232 |
-
|
| 233 |
-
[Your Name / Your Organization's Name]
|
| 234 |
-
|
| 235 |
-
## Citation
|
| 236 |
-
|
| 237 |
-
If you use this model in your research, please cite it as follows:
|
| 238 |
-
|
| 239 |
-
```bibtex
|
| 240 |
-
@misc{your_name_2025_themida_roberta_bilingual_26g,
|
| 241 |
-
author = {[Your Name/Organization]},
|
| 242 |
-
title = {Themida-RoBERTa Legal 26G: A Bilingual Greek-English Legal Language Model},
|
| 243 |
-
year = {2025},
|
| 244 |
-
publisher = {Hugging Face},
|
| 245 |
-
journal = {Hugging Face Hub},
|
| 246 |
-
howpublished = {\url{https://huggingface.co/novelcore/themida-roberta-el-en-legal-26G-8-gpu}},
|
| 247 |
-
}
|
| 248 |
-
```
|
| 249 |
-
|
| 250 |
-
## Acknowledgments
|
| 251 |
-
|
| 252 |
-
We thank the Greek government institutions and legal organizations for making their legal texts publicly available. We also acknowledge the [Pile of Law](https://huggingface.co/datasets/pile-of-law) dataset contributors for providing comprehensive English legal corpora that enabled this bilingual approach. This work represents a significant step forward in multilingual legal language modeling, combining the rich traditions of Greek civil law and English common law in a single, efficient model.
|
|
|
|
| 20 |
- pile-of-law
|
| 21 |
---
|
| 22 |
|
| 23 |
+
# GEM-RoBERTa Legal Bilingual: A Bilingual Greek-English Legal Language Model
|
| 24 |
|
| 25 |
## Model Description
|
| 26 |
|
| 27 |
+
**TGEM-RoBERTa Legal Bilingual** is a RoBERTa-base model pre-trained from scratch on a comprehensive 26GB bilingual corpus of Greek and English legal, parliamentary, and governmental text. This model represents the first large-scale bilingual legal language model combining Greek and English legal domains, enabling cross-lingual legal understanding and applications.
|
| 28 |
|
| 29 |
The model employs the RoBERTa architecture optimized for legal text understanding across both languages, with dynamic masking and focused Masked Language Modeling (MLM) training. The bilingual approach allows the model to leverage legal concepts and terminology from both the Greek and Anglo-American legal traditions.
|
| 30 |
|
|
|
|
| 40 |
# Load the model
|
| 41 |
fill_mask = pipeline(
|
| 42 |
"fill-mask",
|
| 43 |
+
model="novelcore/gem-roberta-bilingual",
|
| 44 |
+
tokenizer="novelcore/gem-roberta-bilingual"
|
| 45 |
)
|
| 46 |
|
| 47 |
# Example in Greek
|
| 48 |
text_gr = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."
|
| 49 |
predictions_gr = fill_mask(text_gr)
|
| 50 |
print("Greek predictions:", predictions_gr)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
```
|
| 52 |
|
| 53 |
For downstream tasks:
|
|
|
|
| 56 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 57 |
|
| 58 |
# For bilingual legal document classification
|
| 59 |
+
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-roberta-bilingual")
|
| 60 |
+
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-roberta-bilingual")
|
| 61 |
|
| 62 |
# Process texts in both languages
|
| 63 |
greek_text = "Το Συνταγματικό Δικαστήριο αποφάσισε..."
|
|
|
|
| 143 |
- **Total Training Steps**: 150,000
|
| 144 |
- **Total Training Time**: 25 hours 7 minutes
|
| 145 |
- **Train/Validation Split**: 95%/5%
|
| 146 |
+
- **Total Training Data**: 26GB bilingual corpus
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|