alexaapo commited on
Commit
85f9445
·
verified ·
1 Parent(s): 3642ace

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -141
README.md CHANGED
@@ -17,11 +17,11 @@ base_model:
17
  - answerdotai/ModernBERT-base
18
  ---
19
 
20
- # Themida-ModernBERT Legal 21G: A Greek Legal Language Model with Advanced Optimization
21
 
22
  ## Model Description
23
 
24
- **Themida-ModernBERT Legal 21G** is a ModernBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model leverages ModernBERT's cutting-edge architectural innovations including **Flash Attention 2**, **StableAdamW optimizer**, **1024-token context length**, and **advanced memory optimization** techniques to deliver superior performance on Greek legal document understanding tasks.
25
 
26
  Building upon our proven **quality-based data repetition strategy**, this model incorporates ModernBERT's state-of-the-art training methodology with **30% masking probability**, **trapezoidal learning rate scheduling**, and **optimized batch sizing** for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques.
27
 
@@ -37,8 +37,8 @@ from transformers import pipeline
37
  # Load the model
38
  fill_mask = pipeline(
39
  "fill-mask",
40
- model="novelcore/themida-modernbert-legal-21GB-1024",
41
- tokenizer="novelcore/themida-modernbert-legal-21GB-1024"
42
  )
43
 
44
  # Example from a legal context with longer sequence support
@@ -55,8 +55,8 @@ For downstream tasks:
55
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
56
 
57
  # For legal document classification with extended context
58
- tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-modernbert-legal-21GB-1024")
59
- model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-modernbert-legal-21GB-1024")
60
 
61
  # The model supports up to 1024 tokens for longer legal documents
62
  ```
@@ -203,138 +203,4 @@ Consistent with our previous models:
203
  1. **Highest quality sources** (legal dictionaries) repeated 4x
204
  2. **Medium-high quality sources** (court reports) repeated 3x
205
  3. **Medium quality sources** (EU legal texts) repeated 2x
206
- 4. **Lower quality sources** used once for diversity
207
-
208
- ### Training Efficiency Improvements
209
-
210
- - **Faster Training**: 97h vs 146h (DeBERTa) despite longer sequences
211
- - **Better Convergence**: Optimized batch sizing and learning rate
212
- - **Memory Efficiency**: Advanced CUDA memory management
213
- - **Stable Training**: StableAdamW and conservative hyperparameters
214
-
215
- ## Evaluation Results
216
-
217
- Comprehensive performance comparison across our model family:
218
-
219
- | Model | Architecture | Context | Training Loss | Eval Loss | Training Time | Vocab Size |
220
- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
221
- | `Themida-ModernBERT Legal 21G` | ModernBERT-base | 1024 | 0.7648 | 0.7751 | 97h 9m | 50K |
222
- | `Themida-DeBERTa Legal 21G` | DeBERTa-base | 512 | 0.7913 | 0.7314 | 146h 13m | 128K |
223
- | `Themida-RoBERTa Legal 21G` | RoBERTa-base | 512 | 0.617 | 0.573 | 66h 39m | 50K |
224
-
225
- *Performance variations reflect different architectural designs and optimization strategies. Downstream task evaluations will be updated as results become available.*
226
-
227
- ## Architecture Comparison: ModernBERT Advantages
228
-
229
- ### Over RoBERTa
230
- - **2x Longer Context**: 1024 vs 512 tokens for complete document processing
231
- - **Flash Attention 2**: Memory-efficient processing of longer sequences
232
- - **Advanced Optimizer**: StableAdamW vs standard AdamW
233
- - **Optimized MLM**: 30% vs 15% masking probability
234
-
235
- ### Over DeBERTa
236
- - **Faster Training**: 97h vs 146h with comparable context understanding
237
- - **Memory Efficiency**: Better optimization for large-scale training
238
- - **Stable Convergence**: Conservative hyperparameters with reliable results
239
- - **Modern Optimizations**: Latest attention and memory management techniques
240
-
241
- ### Unique ModernBERT Features
242
- - **Extended Context Processing**: Handle complete legal documents
243
- - **Memory Optimization**: Advanced CUDA memory management
244
- - **Training Stability**: StableAdamW with element-wise epsilon mode
245
- - **Attention Efficiency**: Flash Attention 2 with custom optimizations
246
-
247
- ## Intended Uses
248
-
249
- ### Primary Use Cases
250
- - **Long-form legal document analysis** (up to 1024 tokens)
251
- - **Complete contract processing** without truncation
252
- - **Parliamentary debate analysis** with full context
253
- - **Legal precedent identification** across extended text
254
- - **Regulatory compliance checking** with comprehensive document coverage
255
- - **Legal question answering** with enhanced context understanding
256
-
257
- ### Enhanced Capabilities
258
- - **Full legal articles** processing without chunking
259
- - **Extended court decisions** analysis with complete reasoning
260
- - **Complex legislative texts** with multiple cross-references
261
- - **Parliamentary proceedings** with speaker continuity
262
- - **Legal research** with comprehensive document context
263
-
264
- ### Optimal Use Cases for 1024-token Context
265
- - **Complete legal contracts** (most fit within 1024 tokens)
266
- - **Court decision summaries** with full reasoning
267
- - **Parliamentary speeches** and debates
268
- - **Legal article analysis** without truncation
269
- - **Regulatory text processing** with full context
270
-
271
- ## Performance Advantages
272
-
273
- ### Speed and Efficiency
274
- - **35% Faster Training**: 97h vs 146h (DeBERTa) with longer contexts
275
- - **Memory Optimization**: Advanced CUDA allocation for large batches
276
- - **Flash Attention 2**: Efficient processing of 1024-token sequences
277
- - **Stable Convergence**: Reliable training with conservative settings
278
-
279
- ### Quality Improvements
280
- - **Extended Context**: 2x longer sequences capture complete documents
281
- - **Better Representations**: 30% MLM probability for enhanced learning
282
- - **Stable Training**: StableAdamW optimizer with element-wise epsilon
283
- - **Optimized Architecture**: Modern attention mechanisms and memory management
284
-
285
- ## Limitations and Considerations
286
-
287
- - The model may reflect biases present in Greek legal and governmental texts
288
- - Quality-based repetition may amplify biases from higher-quality sources
289
- - **Higher memory requirements** for inference due to 1024-token context
290
- - **Longer processing time** for extended sequences compared to 512-token models
291
- - Performance may degrade on informal or colloquial Greek text
292
- - Limited knowledge of legal concepts post-training data cutoff
293
- - Optimized specifically for Greek legal domain
294
-
295
- ## Technical Specifications
296
-
297
- - **Model Size**: ~139M parameters
298
- - **Architecture**: ModernBERT-base with Flash Attention 2
299
- - **Context Length**: 1024 tokens (2x standard BERT models)
300
- - **Training Time**: 97 hours 9 minutes on 8x H100 80GB GPUs
301
- - **Effective Dataset Size**: 21.12GB (with quality-based repetition)
302
- - **Vocabulary Size**: 50,373 tokens
303
- - **Memory Requirements**: Optimized for H100 GPUs with advanced allocation
304
- - **Inference Speed**: Efficient with Flash Attention 2 optimizations
305
-
306
- ## Deployment Recommendations
307
-
308
- ### Hardware Requirements
309
- - **GPU Memory**: Minimum 24GB for inference with long sequences
310
- - **Optimal Hardware**: H100, A100, or modern GPUs with Flash Attention support
311
- - **Memory Configuration**: Use provided CUDA memory optimization settings
312
-
313
- ### Performance Tuning
314
- - **Enable Flash Attention 2** for optimal performance
315
- - **Use BFloat16** precision for H100/A100 GPUs
316
- - **Configure memory allocation** using provided PYTORCH_CUDA_ALLOC_CONF
317
- - **Batch sizing**: Adjust based on available GPU memory
318
-
319
- ## Model Card Authors
320
-
321
- [Your Name / Your Organization's Name]
322
-
323
- ## Citation
324
-
325
- If you use this model in your research, please cite it as follows:
326
-
327
- ```bibtex
328
- @misc{your_name_2025_themida_modernbert_21g,
329
- author = {[Your Name/Organization]},
330
- title = {Themida-ModernBERT Legal 21G: A Greek Legal Language Model with Advanced Optimization},
331
- year = {2025},
332
- publisher = {Hugging Face},
333
- journal = {Hugging Face Hub},
334
- howpublished = {\url{https://huggingface.co/novelcore/themida-modernbert-legal-21GB-1024}},
335
- }
336
- ```
337
-
338
- ## Acknowledgments
339
-
340
- We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model. Special recognition goes to Answer.AI for the ModernBERT architecture and the open-source community for Flash Attention 2 and StableAdamW optimizations. This model represents the culmination of our research into optimal training strategies for Greek legal language understanding, combining proven data curation techniques with cutting-edge architectural innovations.
 
17
  - answerdotai/ModernBERT-base
18
  ---
19
 
20
+ # GEM-ModernBERT Legal: A Greek Legal Language Model with Advanced Optimization
21
 
22
  ## Model Description
23
 
24
+ **GEM-ModernBERT Legal** is a ModernBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model leverages ModernBERT's cutting-edge architectural innovations including **Flash Attention 2**, **StableAdamW optimizer**, **1024-token context length**, and **advanced memory optimization** techniques to deliver superior performance on Greek legal document understanding tasks.
25
 
26
  Building upon our proven **quality-based data repetition strategy**, this model incorporates ModernBERT's state-of-the-art training methodology with **30% masking probability**, **trapezoidal learning rate scheduling**, and **optimized batch sizing** for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques.
27
 
 
37
  # Load the model
38
  fill_mask = pipeline(
39
  "fill-mask",
40
+ model="novelcore/gem-modernbert-hq-legal",
41
+ tokenizer="novelcore/gem-modernbert-hq-legal"
42
  )
43
 
44
  # Example from a legal context with longer sequence support
 
55
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
56
 
57
  # For legal document classification with extended context
58
+ tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-modernbert-hq-legal")
59
+ model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-modernbert-hq-legal")
60
 
61
  # The model supports up to 1024 tokens for longer legal documents
62
  ```
 
203
  1. **Highest quality sources** (legal dictionaries) repeated 4x
204
  2. **Medium-high quality sources** (court reports) repeated 3x
205
  3. **Medium quality sources** (EU legal texts) repeated 2x
206
+ 4. **Lower quality sources** used once for diversity