Update README.md
Browse files
README.md
CHANGED
|
@@ -17,11 +17,11 @@ base_model:
|
|
| 17 |
- answerdotai/ModernBERT-base
|
| 18 |
---
|
| 19 |
|
| 20 |
-
#
|
| 21 |
|
| 22 |
## Model Description
|
| 23 |
|
| 24 |
-
**
|
| 25 |
|
| 26 |
Building upon our proven **quality-based data repetition strategy**, this model incorporates ModernBERT's state-of-the-art training methodology with **30% masking probability**, **trapezoidal learning rate scheduling**, and **optimized batch sizing** for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques.
|
| 27 |
|
|
@@ -37,8 +37,8 @@ from transformers import pipeline
|
|
| 37 |
# Load the model
|
| 38 |
fill_mask = pipeline(
|
| 39 |
"fill-mask",
|
| 40 |
-
model="novelcore/
|
| 41 |
-
tokenizer="novelcore/
|
| 42 |
)
|
| 43 |
|
| 44 |
# Example from a legal context with longer sequence support
|
|
@@ -55,8 +55,8 @@ For downstream tasks:
|
|
| 55 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 56 |
|
| 57 |
# For legal document classification with extended context
|
| 58 |
-
tokenizer = AutoTokenizer.from_pretrained("novelcore/
|
| 59 |
-
model = AutoModelForSequenceClassification.from_pretrained("novelcore/
|
| 60 |
|
| 61 |
# The model supports up to 1024 tokens for longer legal documents
|
| 62 |
```
|
|
@@ -203,138 +203,4 @@ Consistent with our previous models:
|
|
| 203 |
1. **Highest quality sources** (legal dictionaries) repeated 4x
|
| 204 |
2. **Medium-high quality sources** (court reports) repeated 3x
|
| 205 |
3. **Medium quality sources** (EU legal texts) repeated 2x
|
| 206 |
-
4. **Lower quality sources** used once for diversity
|
| 207 |
-
|
| 208 |
-
### Training Efficiency Improvements
|
| 209 |
-
|
| 210 |
-
- **Faster Training**: 97h vs 146h (DeBERTa) despite longer sequences
|
| 211 |
-
- **Better Convergence**: Optimized batch sizing and learning rate
|
| 212 |
-
- **Memory Efficiency**: Advanced CUDA memory management
|
| 213 |
-
- **Stable Training**: StableAdamW and conservative hyperparameters
|
| 214 |
-
|
| 215 |
-
## Evaluation Results
|
| 216 |
-
|
| 217 |
-
Comprehensive performance comparison across our model family:
|
| 218 |
-
|
| 219 |
-
| Model | Architecture | Context | Training Loss | Eval Loss | Training Time | Vocab Size |
|
| 220 |
-
| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
|
| 221 |
-
| `Themida-ModernBERT Legal 21G` | ModernBERT-base | 1024 | 0.7648 | 0.7751 | 97h 9m | 50K |
|
| 222 |
-
| `Themida-DeBERTa Legal 21G` | DeBERTa-base | 512 | 0.7913 | 0.7314 | 146h 13m | 128K |
|
| 223 |
-
| `Themida-RoBERTa Legal 21G` | RoBERTa-base | 512 | 0.617 | 0.573 | 66h 39m | 50K |
|
| 224 |
-
|
| 225 |
-
*Performance variations reflect different architectural designs and optimization strategies. Downstream task evaluations will be updated as results become available.*
|
| 226 |
-
|
| 227 |
-
## Architecture Comparison: ModernBERT Advantages
|
| 228 |
-
|
| 229 |
-
### Over RoBERTa
|
| 230 |
-
- **2x Longer Context**: 1024 vs 512 tokens for complete document processing
|
| 231 |
-
- **Flash Attention 2**: Memory-efficient processing of longer sequences
|
| 232 |
-
- **Advanced Optimizer**: StableAdamW vs standard AdamW
|
| 233 |
-
- **Optimized MLM**: 30% vs 15% masking probability
|
| 234 |
-
|
| 235 |
-
### Over DeBERTa
|
| 236 |
-
- **Faster Training**: 97h vs 146h with comparable context understanding
|
| 237 |
-
- **Memory Efficiency**: Better optimization for large-scale training
|
| 238 |
-
- **Stable Convergence**: Conservative hyperparameters with reliable results
|
| 239 |
-
- **Modern Optimizations**: Latest attention and memory management techniques
|
| 240 |
-
|
| 241 |
-
### Unique ModernBERT Features
|
| 242 |
-
- **Extended Context Processing**: Handle complete legal documents
|
| 243 |
-
- **Memory Optimization**: Advanced CUDA memory management
|
| 244 |
-
- **Training Stability**: StableAdamW with element-wise epsilon mode
|
| 245 |
-
- **Attention Efficiency**: Flash Attention 2 with custom optimizations
|
| 246 |
-
|
| 247 |
-
## Intended Uses
|
| 248 |
-
|
| 249 |
-
### Primary Use Cases
|
| 250 |
-
- **Long-form legal document analysis** (up to 1024 tokens)
|
| 251 |
-
- **Complete contract processing** without truncation
|
| 252 |
-
- **Parliamentary debate analysis** with full context
|
| 253 |
-
- **Legal precedent identification** across extended text
|
| 254 |
-
- **Regulatory compliance checking** with comprehensive document coverage
|
| 255 |
-
- **Legal question answering** with enhanced context understanding
|
| 256 |
-
|
| 257 |
-
### Enhanced Capabilities
|
| 258 |
-
- **Full legal articles** processing without chunking
|
| 259 |
-
- **Extended court decisions** analysis with complete reasoning
|
| 260 |
-
- **Complex legislative texts** with multiple cross-references
|
| 261 |
-
- **Parliamentary proceedings** with speaker continuity
|
| 262 |
-
- **Legal research** with comprehensive document context
|
| 263 |
-
|
| 264 |
-
### Optimal Use Cases for 1024-token Context
|
| 265 |
-
- **Complete legal contracts** (most fit within 1024 tokens)
|
| 266 |
-
- **Court decision summaries** with full reasoning
|
| 267 |
-
- **Parliamentary speeches** and debates
|
| 268 |
-
- **Legal article analysis** without truncation
|
| 269 |
-
- **Regulatory text processing** with full context
|
| 270 |
-
|
| 271 |
-
## Performance Advantages
|
| 272 |
-
|
| 273 |
-
### Speed and Efficiency
|
| 274 |
-
- **35% Faster Training**: 97h vs 146h (DeBERTa) with longer contexts
|
| 275 |
-
- **Memory Optimization**: Advanced CUDA allocation for large batches
|
| 276 |
-
- **Flash Attention 2**: Efficient processing of 1024-token sequences
|
| 277 |
-
- **Stable Convergence**: Reliable training with conservative settings
|
| 278 |
-
|
| 279 |
-
### Quality Improvements
|
| 280 |
-
- **Extended Context**: 2x longer sequences capture complete documents
|
| 281 |
-
- **Better Representations**: 30% MLM probability for enhanced learning
|
| 282 |
-
- **Stable Training**: StableAdamW optimizer with element-wise epsilon
|
| 283 |
-
- **Optimized Architecture**: Modern attention mechanisms and memory management
|
| 284 |
-
|
| 285 |
-
## Limitations and Considerations
|
| 286 |
-
|
| 287 |
-
- The model may reflect biases present in Greek legal and governmental texts
|
| 288 |
-
- Quality-based repetition may amplify biases from higher-quality sources
|
| 289 |
-
- **Higher memory requirements** for inference due to 1024-token context
|
| 290 |
-
- **Longer processing time** for extended sequences compared to 512-token models
|
| 291 |
-
- Performance may degrade on informal or colloquial Greek text
|
| 292 |
-
- Limited knowledge of legal concepts post-training data cutoff
|
| 293 |
-
- Optimized specifically for Greek legal domain
|
| 294 |
-
|
| 295 |
-
## Technical Specifications
|
| 296 |
-
|
| 297 |
-
- **Model Size**: ~139M parameters
|
| 298 |
-
- **Architecture**: ModernBERT-base with Flash Attention 2
|
| 299 |
-
- **Context Length**: 1024 tokens (2x standard BERT models)
|
| 300 |
-
- **Training Time**: 97 hours 9 minutes on 8x H100 80GB GPUs
|
| 301 |
-
- **Effective Dataset Size**: 21.12GB (with quality-based repetition)
|
| 302 |
-
- **Vocabulary Size**: 50,373 tokens
|
| 303 |
-
- **Memory Requirements**: Optimized for H100 GPUs with advanced allocation
|
| 304 |
-
- **Inference Speed**: Efficient with Flash Attention 2 optimizations
|
| 305 |
-
|
| 306 |
-
## Deployment Recommendations
|
| 307 |
-
|
| 308 |
-
### Hardware Requirements
|
| 309 |
-
- **GPU Memory**: Minimum 24GB for inference with long sequences
|
| 310 |
-
- **Optimal Hardware**: H100, A100, or modern GPUs with Flash Attention support
|
| 311 |
-
- **Memory Configuration**: Use provided CUDA memory optimization settings
|
| 312 |
-
|
| 313 |
-
### Performance Tuning
|
| 314 |
-
- **Enable Flash Attention 2** for optimal performance
|
| 315 |
-
- **Use BFloat16** precision for H100/A100 GPUs
|
| 316 |
-
- **Configure memory allocation** using provided PYTORCH_CUDA_ALLOC_CONF
|
| 317 |
-
- **Batch sizing**: Adjust based on available GPU memory
|
| 318 |
-
|
| 319 |
-
## Model Card Authors
|
| 320 |
-
|
| 321 |
-
[Your Name / Your Organization's Name]
|
| 322 |
-
|
| 323 |
-
## Citation
|
| 324 |
-
|
| 325 |
-
If you use this model in your research, please cite it as follows:
|
| 326 |
-
|
| 327 |
-
```bibtex
|
| 328 |
-
@misc{your_name_2025_themida_modernbert_21g,
|
| 329 |
-
author = {[Your Name/Organization]},
|
| 330 |
-
title = {Themida-ModernBERT Legal 21G: A Greek Legal Language Model with Advanced Optimization},
|
| 331 |
-
year = {2025},
|
| 332 |
-
publisher = {Hugging Face},
|
| 333 |
-
journal = {Hugging Face Hub},
|
| 334 |
-
howpublished = {\url{https://huggingface.co/novelcore/themida-modernbert-legal-21GB-1024}},
|
| 335 |
-
}
|
| 336 |
-
```
|
| 337 |
-
|
| 338 |
-
## Acknowledgments
|
| 339 |
-
|
| 340 |
-
We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model. Special recognition goes to Answer.AI for the ModernBERT architecture and the open-source community for Flash Attention 2 and StableAdamW optimizations. This model represents the culmination of our research into optimal training strategies for Greek legal language understanding, combining proven data curation techniques with cutting-edge architectural innovations.
|
|
|
|
| 17 |
- answerdotai/ModernBERT-base
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# GEM-ModernBERT Legal: A Greek Legal Language Model with Advanced Optimization
|
| 21 |
|
| 22 |
## Model Description
|
| 23 |
|
| 24 |
+
**GEM-ModernBERT Legal** is a ModernBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model leverages ModernBERT's cutting-edge architectural innovations including **Flash Attention 2**, **StableAdamW optimizer**, **1024-token context length**, and **advanced memory optimization** techniques to deliver superior performance on Greek legal document understanding tasks.
|
| 25 |
|
| 26 |
Building upon our proven **quality-based data repetition strategy**, this model incorporates ModernBERT's state-of-the-art training methodology with **30% masking probability**, **trapezoidal learning rate scheduling**, and **optimized batch sizing** for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques.
|
| 27 |
|
|
|
|
| 37 |
# Load the model
|
| 38 |
fill_mask = pipeline(
|
| 39 |
"fill-mask",
|
| 40 |
+
model="novelcore/gem-modernbert-hq-legal",
|
| 41 |
+
tokenizer="novelcore/gem-modernbert-hq-legal"
|
| 42 |
)
|
| 43 |
|
| 44 |
# Example from a legal context with longer sequence support
|
|
|
|
| 55 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 56 |
|
| 57 |
# For legal document classification with extended context
|
| 58 |
+
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-modernbert-hq-legal")
|
| 59 |
+
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-modernbert-hq-legal")
|
| 60 |
|
| 61 |
# The model supports up to 1024 tokens for longer legal documents
|
| 62 |
```
|
|
|
|
| 203 |
1. **Highest quality sources** (legal dictionaries) repeated 4x
|
| 204 |
2. **Medium-high quality sources** (court reports) repeated 3x
|
| 205 |
3. **Medium quality sources** (EU legal texts) repeated 2x
|
| 206 |
+
4. **Lower quality sources** used once for diversity
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|