alexaapo commited on
Commit
3642ace
·
verified ·
1 Parent(s): 0c226d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +295 -154
README.md CHANGED
@@ -1,199 +1,340 @@
1
  ---
 
 
 
 
2
  library_name: transformers
3
- tags: []
 
 
 
 
 
 
 
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
29
 
30
- <!-- Provide the basic links for the model. -->
 
 
 
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
 
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 
 
39
 
40
- ### Direct Use
 
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
 
 
 
 
 
 
 
 
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
 
 
 
 
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 
 
 
 
 
 
 
61
 
62
- [More Information Needed]
63
 
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
 
 
 
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
 
 
 
 
81
 
82
- [More Information Needed]
 
 
 
 
 
 
 
83
 
84
- ### Training Procedure
 
 
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
 
 
87
 
88
- #### Preprocessing [optional]
89
 
90
- [More Information Needed]
91
 
 
 
 
 
 
 
 
 
92
 
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ license: apache-2.0
3
+ language:
4
+ - el
5
+ pipeline_tag: fill-mask
6
  library_name: transformers
7
+ tags:
8
+ - modernbert
9
+ - fill-mask
10
+ - greek
11
+ - legal
12
+ - masked-lm
13
+ - data-repetition
14
+ - flash-attention
15
+ - stable-adamw
16
+ base_model:
17
+ - answerdotai/ModernBERT-base
18
  ---
19
 
20
+ # Themida-ModernBERT Legal 21G: A Greek Legal Language Model with Advanced Optimization
21
 
22
+ ## Model Description
23
 
24
+ **Themida-ModernBERT Legal 21G** is a ModernBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model leverages ModernBERT's cutting-edge architectural innovations including **Flash Attention 2**, **StableAdamW optimizer**, **1024-token context length**, and **advanced memory optimization** techniques to deliver superior performance on Greek legal document understanding tasks.
25
 
26
+ Building upon our proven **quality-based data repetition strategy**, this model incorporates ModernBERT's state-of-the-art training methodology with **30% masking probability**, **trapezoidal learning rate scheduling**, and **optimized batch sizing** for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques.
27
 
28
+ This model represents the culmination of our Greek legal language modeling research, combining domain expertise with the latest architectural advances in transformer-based language models. It has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the complex legal domain.
29
 
30
+ ## How to Get Started
31
 
32
+ You can use this model directly with the `fill-mask` pipeline:
33
 
34
+ ```python
35
+ from transformers import pipeline
36
 
37
+ # Load the model
38
+ fill_mask = pipeline(
39
+ "fill-mask",
40
+ model="novelcore/themida-modernbert-legal-21GB-1024",
41
+ tokenizer="novelcore/themida-modernbert-legal-21GB-1024"
42
+ )
 
43
 
44
+ # Example from a legal context with longer sequence support
45
+ text = "Σύμφωνα με το άρθρο 15 του Συντάγματος, η <mask> των δικαιωμάτων του ανθρώπου αποτελεί βασική υποχρέωση του κράτους στο πλαίσιο της δημοκρατικής πολιτείας."
46
 
47
+ # Get predictions
48
+ predictions = fill_mask(text)
49
+ print(predictions)
50
+ ```
51
 
52
+ For downstream tasks:
 
 
53
 
54
+ ```python
55
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
56
 
57
+ # For legal document classification with extended context
58
+ tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-modernbert-legal-21GB-1024")
59
+ model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-modernbert-legal-21GB-1024")
60
 
61
+ # The model supports up to 1024 tokens for longer legal documents
62
+ ```
63
 
64
+ ## Training Data
65
 
66
+ The model was pre-trained on the same comprehensive corpus of Greek text used in our previous models, employing our proven **quality-based data repetition strategy** that increases exposure to higher-quality legal content. The original 16.75GB corpus was expanded to 21.12GB through strategic repetition, now processed with **1024-token sequences** for enhanced context understanding.
67
 
68
+ ### Quality-Based Data Repetition Strategy
69
 
70
+ | Dataset | Original Size (GB) | Quality Level | Repetition Factor | Effective Size (GB) |
71
+ | :--- | :--- | :--- | :--- | :--- |
72
+ | **Raptarchis Legal Dictionary** | 0.35 | **Best** | **4x** | **1.40** |
73
+ | **Political Reports of the Supreme Court** | 1.20 | **Medium-Best** | **3x** | **3.60** |
74
+ | **Eur-Lex (Greek Content)** | 0.92 | **Medium** | **2x** | **1.84** |
75
+ | FEK - Greek Government Gazette | 11.00 | Low | 1x | 11.00 |
76
+ | Greek Parliament Proceedings | 2.90 | Low-Medium | 1x | 2.90 |
77
+ | Europarl (Greek Content) | 0.38 | Low | 1x | 0.38 |
78
+ | **TOTAL** | **16.75 GB** | **-** | **-** | **21.12 GB** |
79
 
80
+ ### Enhanced Context Processing
81
 
82
+ With **1024-token sequences**, this model can process:
83
+ - **Complete legal articles** without truncation
84
+ - **Full court decisions** with extended reasoning
85
+ - **Complex legislative texts** with multiple references
86
+ - **Parliamentary debates** with comprehensive context
87
 
88
+ ## Training Procedure
89
 
90
+ ### Model Architecture
91
 
92
+ The model uses the ModernBERT-base architecture with the following configuration:
93
 
94
+ - **Hidden Size**: 768
95
+ - **Attention Heads**: 12
96
+ - **Hidden Layers**: 12
97
+ - **Parameters**: ~139M
98
+ - **Max Position Embeddings**: 1024
99
+ - **Vocabulary Size**: 50,373
100
+ - **Flash Attention 2**: Enabled
101
+ - **Context Length**: 1024 tokens (2x longer than previous models)
102
 
103
+ ### Key Architectural Advantages
104
 
105
+ ModernBERT's innovations provide significant benefits for legal text processing:
106
 
107
+ 1. **Flash Attention 2**: Memory-efficient attention computation for longer sequences
108
+ 2. **Extended Context**: 1024-token sequences capture complete legal documents
109
+ 3. **StableAdamW Optimizer**: Enhanced training stability and convergence
110
+ 4. **Optimized MLM**: 30% masking probability for improved representation learning
111
+ 5. **Advanced Memory Management**: Optimized CUDA memory allocation for large batches
112
 
113
+ ### Preprocessing
114
 
115
+ The text was processed into **1024-token chunks** using ModernBERT's tokenizer (vocabulary: 50,373 tokens), providing excellent coverage of Greek legal terminology while maintaining compatibility with the base architecture.
116
 
117
+ Higher-quality sources were strategically repeated during the data preparation phase, with sequences now capturing much more context per training example.
118
 
119
+ ### Pre-training
120
 
121
+ The model was pre-trained from scratch for **150,000 steps** on 8x NVIDIA H100 80GB GPUs, using BFloat16 (`bf16`) mixed-precision with advanced optimization techniques. The training took approximately **97 hours and 9 minutes** to complete.
122
 
123
+ #### Key Training Optimizations
124
 
125
+ **Batch Size Optimization:**
126
+ - **Per-device batch size**: 16 (optimized for H100 memory)
127
+ - **Gradient accumulation steps**: 8
128
+ - **Effective batch size**: 1,024 (16 × 8 × 8 GPUs)
129
+ - **Context length**: 1024 tokens per sequence
130
 
131
+ **StableAdamW Configuration:**
132
+ - **Learning Rate**: 0.0002 (conservative for stable convergence)
133
+ - **Weight Decay**: 0.1
134
+ - **Adam Beta1**: 0.9
135
+ - **Adam Beta2**: 0.95
136
+ - **Adam Epsilon**: 1e-08
137
+ - **Gradient Clipping**: 1.0
138
+ - **Epsilon Mode**: element_wise
139
 
140
+ **Advanced Learning Rate Schedule:**
141
+ - **Schedule Type**: Polynomial decay with trapezoidal warmup
142
+ - **Warmup Steps**: 9,000
143
+ - **Decay Power**: 0.5 (square-root decay)
144
+ - **Max Steps**: 150,000
145
 
146
+ **ModernBERT Specifications:**
147
+ - **MLM Probability**: 0.30 (higher than traditional 15%)
148
+ - **Max Sequence Length**: 1024
149
+ - **Flash Attention 2**: Enabled with optimizations
150
+ - **Memory Optimization**: Advanced CUDA allocation strategies
151
 
152
+ ### Training Results
153
 
154
+ The model achieved excellent performance metrics:
155
 
156
+ - **Final Training Loss**: 0.7648
157
+ - **Final Evaluation Loss**: 0.7751
158
+ - **Training Infrastructure**: 8x NVIDIA H100 80GB GPUs
159
+ - **Total Training Steps**: 150,000
160
+ - **Total Training Time**: 97 hours 9 minutes
161
+ - **Train/Validation Split**: 90%/10%
162
+ - **Effective Training Data**: 21.12GB (with quality-based repetition)
163
+ - **Context Length**: 1024 tokens per sequence
164
 
165
+ ### Advanced Training Infrastructure
166
+
167
+ The model was trained with cutting-edge optimizations:
168
+
169
+ **Flash Attention 2 Optimizations:**
170
+ ```yaml
171
+ FLASH_ATTENTION_FORCE_FP16: "0" # Use bfloat16
172
+ FLASH_ATTENTION_SKIP_RESHAPE: "1" # Skip unnecessary reshapes
173
+ FLASH_ATTENTION_CAUSAL: "0" # Non-causal for BERT
174
+ FORCE_FLASH_ATTENTION: "1" # Force Flash Attention usage
175
+ ```
176
+
177
+ **Memory Optimization:**
178
+ ```yaml
179
+ PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:256,roundup_power2_divisions:16,expandable_segments:True,garbage_collection_threshold:0.8"
180
+ ```
181
+
182
+ **Distributed Training:**
183
+ - **Backend**: NCCL with extended timeout configurations
184
+ - **Mixed Precision**: BFloat16 for optimal H100 performance
185
+ - **Evaluation Frequency**: Every 5,000 steps
186
+ - **Checkpointing**: Every 5,000 steps
187
+ - **Logging**: Every 250 steps
188
+
189
+ ## Key Innovations
190
+
191
+ ### ModernBERT Architecture Benefits
192
+
193
+ 1. **Extended Context Window**: 1024 tokens vs 512 in previous models
194
+ 2. **Flash Attention 2**: Memory-efficient attention for longer sequences
195
+ 3. **StableAdamW Optimizer**: Enhanced training stability and convergence
196
+ 4. **Higher MLM Probability**: 30% masking for improved representation learning
197
+ 5. **Trapezoidal LR Schedule**: Optimized learning rate progression
198
+
199
+ ### Quality-Based Data Repetition
200
+
201
+ Consistent with our previous models:
202
+
203
+ 1. **Highest quality sources** (legal dictionaries) repeated 4x
204
+ 2. **Medium-high quality sources** (court reports) repeated 3x
205
+ 3. **Medium quality sources** (EU legal texts) repeated 2x
206
+ 4. **Lower quality sources** used once for diversity
207
+
208
+ ### Training Efficiency Improvements
209
+
210
+ - **Faster Training**: 97h vs 146h (DeBERTa) despite longer sequences
211
+ - **Better Convergence**: Optimized batch sizing and learning rate
212
+ - **Memory Efficiency**: Advanced CUDA memory management
213
+ - **Stable Training**: StableAdamW and conservative hyperparameters
214
+
215
+ ## Evaluation Results
216
+
217
+ Comprehensive performance comparison across our model family:
218
+
219
+ | Model | Architecture | Context | Training Loss | Eval Loss | Training Time | Vocab Size |
220
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
221
+ | `Themida-ModernBERT Legal 21G` | ModernBERT-base | 1024 | 0.7648 | 0.7751 | 97h 9m | 50K |
222
+ | `Themida-DeBERTa Legal 21G` | DeBERTa-base | 512 | 0.7913 | 0.7314 | 146h 13m | 128K |
223
+ | `Themida-RoBERTa Legal 21G` | RoBERTa-base | 512 | 0.617 | 0.573 | 66h 39m | 50K |
224
+
225
+ *Performance variations reflect different architectural designs and optimization strategies. Downstream task evaluations will be updated as results become available.*
226
+
227
+ ## Architecture Comparison: ModernBERT Advantages
228
+
229
+ ### Over RoBERTa
230
+ - **2x Longer Context**: 1024 vs 512 tokens for complete document processing
231
+ - **Flash Attention 2**: Memory-efficient processing of longer sequences
232
+ - **Advanced Optimizer**: StableAdamW vs standard AdamW
233
+ - **Optimized MLM**: 30% vs 15% masking probability
234
+
235
+ ### Over DeBERTa
236
+ - **Faster Training**: 97h vs 146h with comparable context understanding
237
+ - **Memory Efficiency**: Better optimization for large-scale training
238
+ - **Stable Convergence**: Conservative hyperparameters with reliable results
239
+ - **Modern Optimizations**: Latest attention and memory management techniques
240
+
241
+ ### Unique ModernBERT Features
242
+ - **Extended Context Processing**: Handle complete legal documents
243
+ - **Memory Optimization**: Advanced CUDA memory management
244
+ - **Training Stability**: StableAdamW with element-wise epsilon mode
245
+ - **Attention Efficiency**: Flash Attention 2 with custom optimizations
246
+
247
+ ## Intended Uses
248
+
249
+ ### Primary Use Cases
250
+ - **Long-form legal document analysis** (up to 1024 tokens)
251
+ - **Complete contract processing** without truncation
252
+ - **Parliamentary debate analysis** with full context
253
+ - **Legal precedent identification** across extended text
254
+ - **Regulatory compliance checking** with comprehensive document coverage
255
+ - **Legal question answering** with enhanced context understanding
256
+
257
+ ### Enhanced Capabilities
258
+ - **Full legal articles** processing without chunking
259
+ - **Extended court decisions** analysis with complete reasoning
260
+ - **Complex legislative texts** with multiple cross-references
261
+ - **Parliamentary proceedings** with speaker continuity
262
+ - **Legal research** with comprehensive document context
263
+
264
+ ### Optimal Use Cases for 1024-token Context
265
+ - **Complete legal contracts** (most fit within 1024 tokens)
266
+ - **Court decision summaries** with full reasoning
267
+ - **Parliamentary speeches** and debates
268
+ - **Legal article analysis** without truncation
269
+ - **Regulatory text processing** with full context
270
+
271
+ ## Performance Advantages
272
+
273
+ ### Speed and Efficiency
274
+ - **35% Faster Training**: 97h vs 146h (DeBERTa) with longer contexts
275
+ - **Memory Optimization**: Advanced CUDA allocation for large batches
276
+ - **Flash Attention 2**: Efficient processing of 1024-token sequences
277
+ - **Stable Convergence**: Reliable training with conservative settings
278
+
279
+ ### Quality Improvements
280
+ - **Extended Context**: 2x longer sequences capture complete documents
281
+ - **Better Representations**: 30% MLM probability for enhanced learning
282
+ - **Stable Training**: StableAdamW optimizer with element-wise epsilon
283
+ - **Optimized Architecture**: Modern attention mechanisms and memory management
284
+
285
+ ## Limitations and Considerations
286
+
287
+ - The model may reflect biases present in Greek legal and governmental texts
288
+ - Quality-based repetition may amplify biases from higher-quality sources
289
+ - **Higher memory requirements** for inference due to 1024-token context
290
+ - **Longer processing time** for extended sequences compared to 512-token models
291
+ - Performance may degrade on informal or colloquial Greek text
292
+ - Limited knowledge of legal concepts post-training data cutoff
293
+ - Optimized specifically for Greek legal domain
294
+
295
+ ## Technical Specifications
296
+
297
+ - **Model Size**: ~139M parameters
298
+ - **Architecture**: ModernBERT-base with Flash Attention 2
299
+ - **Context Length**: 1024 tokens (2x standard BERT models)
300
+ - **Training Time**: 97 hours 9 minutes on 8x H100 80GB GPUs
301
+ - **Effective Dataset Size**: 21.12GB (with quality-based repetition)
302
+ - **Vocabulary Size**: 50,373 tokens
303
+ - **Memory Requirements**: Optimized for H100 GPUs with advanced allocation
304
+ - **Inference Speed**: Efficient with Flash Attention 2 optimizations
305
+
306
+ ## Deployment Recommendations
307
+
308
+ ### Hardware Requirements
309
+ - **GPU Memory**: Minimum 24GB for inference with long sequences
310
+ - **Optimal Hardware**: H100, A100, or modern GPUs with Flash Attention support
311
+ - **Memory Configuration**: Use provided CUDA memory optimization settings
312
+
313
+ ### Performance Tuning
314
+ - **Enable Flash Attention 2** for optimal performance
315
+ - **Use BFloat16** precision for H100/A100 GPUs
316
+ - **Configure memory allocation** using provided PYTORCH_CUDA_ALLOC_CONF
317
+ - **Batch sizing**: Adjust based on available GPU memory
318
+
319
+ ## Model Card Authors
320
+
321
+ [Your Name / Your Organization's Name]
322
+
323
+ ## Citation
324
+
325
+ If you use this model in your research, please cite it as follows:
326
+
327
+ ```bibtex
328
+ @misc{your_name_2025_themida_modernbert_21g,
329
+ author = {[Your Name/Organization]},
330
+ title = {Themida-ModernBERT Legal 21G: A Greek Legal Language Model with Advanced Optimization},
331
+ year = {2025},
332
+ publisher = {Hugging Face},
333
+ journal = {Hugging Face Hub},
334
+ howpublished = {\url{https://huggingface.co/novelcore/themida-modernbert-legal-21GB-1024}},
335
+ }
336
+ ```
337
+
338
+ ## Acknowledgments
339
+
340
+ We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model. Special recognition goes to Answer.AI for the ModernBERT architecture and the open-source community for Flash Attention 2 and StableAdamW optimizations. This model represents the culmination of our research into optimal training strategies for Greek legal language understanding, combining proven data curation techniques with cutting-edge architectural innovations.