novelcore
/

gem-modernbert

@@ -1,199 +1,340 @@
 ---
 library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+license: apache-2.0
+language:
+- el
+pipeline_tag: fill-mask
 library_name: transformers
+tags:
+- modernbert
+- fill-mask
+- greek
+- legal
+- masked-lm
+- data-repetition
+- flash-attention
+- stable-adamw
+base_model:
+- answerdotai/ModernBERT-base
 ---
+# Themida-ModernBERT Legal 21G: A Greek Legal Language Model with Advanced Optimization
+## Model Description
+**Themida-ModernBERT Legal 21G** is a ModernBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model leverages ModernBERT's cutting-edge architectural innovations including **Flash Attention 2**, **StableAdamW optimizer**, **1024-token context length**, and **advanced memory optimization** techniques to deliver superior performance on Greek legal document understanding tasks.
+Building upon our proven **quality-based data repetition strategy**, this model incorporates ModernBERT's state-of-the-art training methodology with **30% masking probability**, **trapezoidal learning rate scheduling**, and **optimized batch sizing** for enhanced convergence and performance. The model is specifically designed to handle longer legal documents with its extended 1024-token context window while maintaining computational efficiency through advanced optimization techniques.
+This model represents the culmination of our Greek legal language modeling research, combining domain expertise with the latest architectural advances in transformer-based language models. It has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the complex legal domain.
+## How to Get Started
+You can use this model directly with the `fill-mask` pipeline:
+```python
+from transformers import pipeline
+# Load the model
+fill_mask = pipeline(
+    "fill-mask",
+    model="novelcore/themida-modernbert-legal-21GB-1024",
+    tokenizer="novelcore/themida-modernbert-legal-21GB-1024"
+)
+# Example from a legal context with longer sequence support
+text = "Σύμφωνα με το άρθρο 15 του Συντάγματος, η <mask> των δικαιωμάτων του ανθρώπου αποτελεί βασική υποχρέωση του κράτους στο πλαίσιο της δημοκρατικής πολιτείας."
+# Get predictions
+predictions = fill_mask(text)
+print(predictions)
+```
+For downstream tasks:
+```python
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+# For legal document classification with extended context
+tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-modernbert-legal-21GB-1024")
+model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-modernbert-legal-21GB-1024")
+# The model supports up to 1024 tokens for longer legal documents
+```
+## Training Data
+The model was pre-trained on the same comprehensive corpus of Greek text used in our previous models, employing our proven **quality-based data repetition strategy** that increases exposure to higher-quality legal content. The original 16.75GB corpus was expanded to 21.12GB through strategic repetition, now processed with **1024-token sequences** for enhanced context understanding.
+### Quality-Based Data Repetition Strategy
+| Dataset | Original Size (GB) | Quality Level | Repetition Factor | Effective Size (GB) |
+| :--- | :--- | :--- | :--- | :--- |
+| **Raptarchis Legal Dictionary** | 0.35 | **Best** | **4x** | **1.40** |
+| **Political Reports of the Supreme Court** | 1.20 | **Medium-Best** | **3x** | **3.60** |
+| **Eur-Lex (Greek Content)** | 0.92 | **Medium** | **2x** | **1.84** |
+| FEK - Greek Government Gazette | 11.00 | Low | 1x | 11.00 |
+| Greek Parliament Proceedings | 2.90 | Low-Medium | 1x | 2.90 |
+| Europarl (Greek Content) | 0.38 | Low | 1x | 0.38 |
+| **TOTAL** | **16.75 GB** | **-** | **-** | **21.12 GB** |
+### Enhanced Context Processing
+With **1024-token sequences**, this model can process:
+- **Complete legal articles** without truncation
+- **Full court decisions** with extended reasoning
+- **Complex legislative texts** with multiple references
+- **Parliamentary debates** with comprehensive context
+## Training Procedure
+### Model Architecture
+The model uses the ModernBERT-base architecture with the following configuration:
+- **Hidden Size**: 768
+- **Attention Heads**: 12
+- **Hidden Layers**: 12
+- **Parameters**: ~139M
+- **Max Position Embeddings**: 1024
+- **Vocabulary Size**: 50,373
+- **Flash Attention 2**: Enabled
+- **Context Length**: 1024 tokens (2x longer than previous models)
+### Key Architectural Advantages
+ModernBERT's innovations provide significant benefits for legal text processing:
+1. **Flash Attention 2**: Memory-efficient attention computation for longer sequences
+2. **Extended Context**: 1024-token sequences capture complete legal documents
+3. **StableAdamW Optimizer**: Enhanced training stability and convergence
+4. **Optimized MLM**: 30% masking probability for improved representation learning
+5. **Advanced Memory Management**: Optimized CUDA memory allocation for large batches
+### Preprocessing
+The text was processed into **1024-token chunks** using ModernBERT's tokenizer (vocabulary: 50,373 tokens), providing excellent coverage of Greek legal terminology while maintaining compatibility with the base architecture.
+Higher-quality sources were strategically repeated during the data preparation phase, with sequences now capturing much more context per training example.
+### Pre-training
+The model was pre-trained from scratch for **150,000 steps** on 8x NVIDIA H100 80GB GPUs, using BFloat16 (`bf16`) mixed-precision with advanced optimization techniques. The training took approximately **97 hours and 9 minutes** to complete.
+#### Key Training Optimizations
+**Batch Size Optimization:**
+- **Per-device batch size**: 16 (optimized for H100 memory)
+- **Gradient accumulation steps**: 8
+- **Effective batch size**: 1,024 (16 × 8 × 8 GPUs)
+- **Context length**: 1024 tokens per sequence
+**StableAdamW Configuration:**
+- **Learning Rate**: 0.0002 (conservative for stable convergence)
+- **Weight Decay**: 0.1
+- **Adam Beta1**: 0.9
+- **Adam Beta2**: 0.95
+- **Adam Epsilon**: 1e-08
+- **Gradient Clipping**: 1.0
+- **Epsilon Mode**: element_wise
+**Advanced Learning Rate Schedule:**
+- **Schedule Type**: Polynomial decay with trapezoidal warmup
+- **Warmup Steps**: 9,000
+- **Decay Power**: 0.5 (square-root decay)
+- **Max Steps**: 150,000
+**ModernBERT Specifications:**
+- **MLM Probability**: 0.30 (higher than traditional 15%)
+- **Max Sequence Length**: 1024
+- **Flash Attention 2**: Enabled with optimizations
+- **Memory Optimization**: Advanced CUDA allocation strategies
+### Training Results
+The model achieved excellent performance metrics:
+- **Final Training Loss**: 0.7648
+- **Final Evaluation Loss**: 0.7751
+- **Training Infrastructure**: 8x NVIDIA H100 80GB GPUs
+- **Total Training Steps**: 150,000
+- **Total Training Time**: 97 hours 9 minutes
+- **Train/Validation Split**: 90%/10%
+- **Effective Training Data**: 21.12GB (with quality-based repetition)
+- **Context Length**: 1024 tokens per sequence
+### Advanced Training Infrastructure
+The model was trained with cutting-edge optimizations:
+**Flash Attention 2 Optimizations:**
+```yaml
+FLASH_ATTENTION_FORCE_FP16: "0"        # Use bfloat16
+FLASH_ATTENTION_SKIP_RESHAPE: "1"      # Skip unnecessary reshapes
+FLASH_ATTENTION_CAUSAL: "0"            # Non-causal for BERT
+FORCE_FLASH_ATTENTION: "1"             # Force Flash Attention usage
+```
+**Memory Optimization:**
+```yaml
+PYTORCH_CUDA_ALLOC_CONF: "max_split_size_mb:256,roundup_power2_divisions:16,expandable_segments:True,garbage_collection_threshold:0.8"
+```
+**Distributed Training:**
+- **Backend**: NCCL with extended timeout configurations
+- **Mixed Precision**: BFloat16 for optimal H100 performance
+- **Evaluation Frequency**: Every 5,000 steps
+- **Checkpointing**: Every 5,000 steps
+- **Logging**: Every 250 steps
+## Key Innovations
+### ModernBERT Architecture Benefits
+1. **Extended Context Window**: 1024 tokens vs 512 in previous models
+2. **Flash Attention 2**: Memory-efficient attention for longer sequences
+3. **StableAdamW Optimizer**: Enhanced training stability and convergence
+4. **Higher MLM Probability**: 30% masking for improved representation learning
+5. **Trapezoidal LR Schedule**: Optimized learning rate progression
+### Quality-Based Data Repetition
+Consistent with our previous models:
+1. **Highest quality sources** (legal dictionaries) repeated 4x
+2. **Medium-high quality sources** (court reports) repeated 3x
+3. **Medium quality sources** (EU legal texts) repeated 2x
+4. **Lower quality sources** used once for diversity
+### Training Efficiency Improvements
+- **Faster Training**: 97h vs 146h (DeBERTa) despite longer sequences
+- **Better Convergence**: Optimized batch sizing and learning rate
+- **Memory Efficiency**: Advanced CUDA memory management
+- **Stable Training**: StableAdamW and conservative hyperparameters
+## Evaluation Results
+Comprehensive performance comparison across our model family:
+| Model | Architecture | Context | Training Loss | Eval Loss | Training Time | Vocab Size |
+| :--- | :--- | :--- | :--- | :--- | :--- | :--- |
+| `Themida-ModernBERT Legal 21G` | ModernBERT-base | 1024 | 0.7648 | 0.7751 | 97h 9m | 50K |
+| `Themida-DeBERTa Legal 21G` | DeBERTa-base | 512 | 0.7913 | 0.7314 | 146h 13m | 128K |
+| `Themida-RoBERTa Legal 21G` | RoBERTa-base | 512 | 0.617 | 0.573 | 66h 39m | 50K |
+*Performance variations reflect different architectural designs and optimization strategies. Downstream task evaluations will be updated as results become available.*
+## Architecture Comparison: ModernBERT Advantages
+### Over RoBERTa
+- **2x Longer Context**: 1024 vs 512 tokens for complete document processing
+- **Flash Attention 2**: Memory-efficient processing of longer sequences
+- **Advanced Optimizer**: StableAdamW vs standard AdamW
+- **Optimized MLM**: 30% vs 15% masking probability
+### Over DeBERTa
+- **Faster Training**: 97h vs 146h with comparable context understanding
+- **Memory Efficiency**: Better optimization for large-scale training
+- **Stable Convergence**: Conservative hyperparameters with reliable results
+- **Modern Optimizations**: Latest attention and memory management techniques
+### Unique ModernBERT Features
+- **Extended Context Processing**: Handle complete legal documents
+- **Memory Optimization**: Advanced CUDA memory management
+- **Training Stability**: StableAdamW with element-wise epsilon mode
+- **Attention Efficiency**: Flash Attention 2 with custom optimizations
+## Intended Uses
+### Primary Use Cases
+- **Long-form legal document analysis** (up to 1024 tokens)
+- **Complete contract processing** without truncation
+- **Parliamentary debate analysis** with full context
+- **Legal precedent identification** across extended text
+- **Regulatory compliance checking** with comprehensive document coverage
+- **Legal question answering** with enhanced context understanding
+### Enhanced Capabilities
+- **Full legal articles** processing without chunking
+- **Extended court decisions** analysis with complete reasoning
+- **Complex legislative texts** with multiple cross-references
+- **Parliamentary proceedings** with speaker continuity
+- **Legal research** with comprehensive document context
+### Optimal Use Cases for 1024-token Context
+- **Complete legal contracts** (most fit within 1024 tokens)
+- **Court decision summaries** with full reasoning
+- **Parliamentary speeches** and debates
+- **Legal article analysis** without truncation
+- **Regulatory text processing** with full context
+## Performance Advantages
+### Speed and Efficiency
+- **35% Faster Training**: 97h vs 146h (DeBERTa) with longer contexts
+- **Memory Optimization**: Advanced CUDA allocation for large batches
+- **Flash Attention 2**: Efficient processing of 1024-token sequences
+- **Stable Convergence**: Reliable training with conservative settings
+### Quality Improvements
+- **Extended Context**: 2x longer sequences capture complete documents
+- **Better Representations**: 30% MLM probability for enhanced learning
+- **Stable Training**: StableAdamW optimizer with element-wise epsilon
+- **Optimized Architecture**: Modern attention mechanisms and memory management
+## Limitations and Considerations
+- The model may reflect biases present in Greek legal and governmental texts
+- Quality-based repetition may amplify biases from higher-quality sources
+- **Higher memory requirements** for inference due to 1024-token context
+- **Longer processing time** for extended sequences compared to 512-token models
+- Performance may degrade on informal or colloquial Greek text
+- Limited knowledge of legal concepts post-training data cutoff
+- Optimized specifically for Greek legal domain
+## Technical Specifications
+- **Model Size**: ~139M parameters
+- **Architecture**: ModernBERT-base with Flash Attention 2
+- **Context Length**: 1024 tokens (2x standard BERT models)
+- **Training Time**: 97 hours 9 minutes on 8x H100 80GB GPUs
+- **Effective Dataset Size**: 21.12GB (with quality-based repetition)
+- **Vocabulary Size**: 50,373 tokens
+- **Memory Requirements**: Optimized for H100 GPUs with advanced allocation
+- **Inference Speed**: Efficient with Flash Attention 2 optimizations
+## Deployment Recommendations
+### Hardware Requirements
+- **GPU Memory**: Minimum 24GB for inference with long sequences
+- **Optimal Hardware**: H100, A100, or modern GPUs with Flash Attention support
+- **Memory Configuration**: Use provided CUDA memory optimization settings
+### Performance Tuning
+- **Enable Flash Attention 2** for optimal performance
+- **Use BFloat16** precision for H100/A100 GPUs
+- **Configure memory allocation** using provided PYTORCH_CUDA_ALLOC_CONF
+- **Batch sizing**: Adjust based on available GPU memory
+## Model Card Authors
+[Your Name / Your Organization's Name]
+## Citation
+If you use this model in your research, please cite it as follows:
+```bibtex
+@misc{your_name_2025_themida_modernbert_21g,
+  author = {[Your Name/Organization]},
+  title = {Themida-ModernBERT Legal 21G: A Greek Legal Language Model with Advanced Optimization},
+  year = {2025},
+  publisher = {Hugging Face},
+  journal = {Hugging Face Hub},
+  howpublished = {\url{https://huggingface.co/novelcore/themida-modernbert-legal-21GB-1024}},
+}
+```
+## Acknowledgments
+We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model. Special recognition goes to Answer.AI for the ModernBERT architecture and the open-source community for Flash Attention 2 and StableAdamW optimizations. This model represents the culmination of our research into optimal training strategies for Greek legal language understanding, combining proven data curation techniques with cutting-edge architectural innovations.