novelcore
/

gem-roberta

@@ -15,11 +15,11 @@ base_model:
 - roberta-base
 ---
-# Themida-RoBERTa Legal 21G: A Greek Legal Language Model with Quality-Based Data Repetition
 ## Model Description
-**Themida-RoBERTa Legal 21G** is a RoBERTa-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts.
 This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The RoBERTa architecture provides enhanced performance through dynamic masking and removal of the Next Sentence Prediction (NSP) task, focusing entirely on Masked Language Modeling (MLM).
@@ -33,8 +33,8 @@ from transformers import pipeline
 # Load the model
 fill_mask = pipeline(
     "fill-mask",
-    model="novelcore/themida-roberta-legal-21G-8-gpu",
-    tokenizer="novelcore/themida-roberta-legal-21G-8-gpu"
 )
 # Example from a legal context
@@ -51,8 +51,8 @@ For downstream tasks:
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 # For legal document classification
-tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-roberta-legal-21G-8-gpu")
-model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-roberta-legal-21G-8-gpu")
 ```
 ## Training Data
@@ -125,103 +125,4 @@ The model achieved the following performance metrics:
 - **Total Training Steps**: 150,000
 - **Total Training Time**: 66 hours 39 minutes
 - **Train/Validation Split**: 95%/5%
-- **Effective Training Data**: 21.12GB (with quality-based repetition)
-### Training Infrastructure
-The model was trained using distributed training with the following optimizations:
-- **Backend**: NCCL for efficient multi-GPU communication
-- **Mixed Precision**: BFloat16 for improved training stability
-- **Evaluation Frequency**: Every 5,000 steps
-- **Checkpointing**: Every 5,000 steps
-- **Logging**: Every 250 steps
-## Key Innovations
-### Quality-Based Data Repetition
-This model introduces a novel **quality-based data repetition strategy** where:
-1. **Highest quality sources** (legal dictionaries) are repeated 4x for maximum terminology exposure
-2. **Medium-high quality sources** (court reports) are repeated 3x for judicial reasoning patterns
-3. **Medium quality sources** (EU legal texts) are repeated 2x for regulatory language
-4. **Lower quality sources** are used once to maintain diversity
-This approach resulted in **25% more effective training data** (21.12GB vs 16.75GB) while maintaining computational efficiency.
-### Training Efficiency
-Despite the larger effective dataset, the model trained **47% faster** than the previous large variant (66h 39m vs 126h 19m) due to the more efficient RoBERTa-base architecture while maintaining comparable performance quality.
-## Evaluation Results
-The model shows stable convergence with the quality-based repetition strategy, achieving competitive performance metrics:
-| Model | Architecture | Training Loss | Evaluation Loss | Training Time |
-| :--- | :--- | :--- | :--- | :--- |
-| `Themida-RoBERTa Legal 21G` (this model) | RoBERTa-base | 0.617 | 0.573035 | 66h 39m |
-*Performance on downstream tasks will be updated as evaluation results become available.*
-## Intended Uses
-### Primary Use Cases
-- Legal document analysis and classification
-- Named entity recognition in Greek legal texts
-- Legal question answering systems
-- Compliance monitoring and regulatory analysis
-- Legal text similarity and retrieval
-- Legal terminology extraction and understanding
-### Secondary Use Cases
-- General Greek text understanding (with potential performance degradation)
-- Contract analysis and review
-- Legislative text analysis
-- Regulatory compliance checking
-### Advantages of Quality-Based Training
-- **Enhanced legal vocabulary**: Better understanding of sophisticated legal terminology
-- **Improved judicial reasoning**: Stronger grasp of court decision patterns
-- **EU legal compliance**: Better handling of European regulatory language
-- **Computational efficiency**: Faster training than larger architectures
-## Limitations and Bias
-- The model may reflect biases present in Greek legal and governmental texts
-- Quality-based repetition may amplify biases present in higher-quality sources
-- Performance may degrade on informal or colloquial Greek text
-- Limited knowledge of legal concepts post-training data cutoff
-- Optimized specifically for Greek legal domain; may not generalize well to other domains
-## Technical Specifications
-- **Model Size**: ~125M parameters
-- **Architecture**: RoBERTa-base (12 layers, 12 attention heads)
-- **Training Time**: 66 hours 39 minutes on 8x A100 GPUs
-- **Effective Dataset Size**: 21.12GB (with quality-based repetition)
-- **Memory Requirements**: More efficient than large models for fine-tuning
-- **Inference Speed**: Faster than large models due to base architecture
-## Model Card Authors
-[Your Name / Your Organization's Name]
-## Citation
-If you use this model in your research, please cite it as follows:
-```bibtex
-@misc{your_name_2025_themida_roberta_21g,
-  author = {[Your Name/Organization]},
-  title = {Themida-RoBERTa Legal 21G: A Greek Legal Language Model with Quality-Based Data Repetition},
-  year = {2025},
-  publisher = {Hugging Face},
-  journal = {Hugging Face Hub},
-  howpublished = {\url{https://huggingface.co/novelcore/themida-roberta-legal-21G-8-gpu}},
-}
-```
-## Acknowledgments
-We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain. Special acknowledgment for the innovative quality-based data repetition strategy that enhanced training efficiency while improving model performance on high-quality legal content.

 - roberta-base
 ---
+# GEM-RoBERTa HQ Legal: A Greek Legal Language Model with Quality-Based Data Repetition
 ## Model Description
+**GEM-RoBERTa HQ Legal** is a RoBERTa-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts.
 This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field. The RoBERTa architecture provides enhanced performance through dynamic masking and removal of the Next Sentence Prediction (NSP) task, focusing entirely on Masked Language Modeling (MLM).
 # Load the model
 fill_mask = pipeline(
     "fill-mask",
+    model="novelcore/gem-roberta-hq-legal",
+    tokenizer="novelcore/gem-roberta-hq-legal"
 )
 # Example from a legal context
 from transformers import AutoTokenizer, AutoModelForSequenceClassification
 # For legal document classification
+tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-roberta-hq-legal")
+model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-roberta-hq-legal")
 ```
 ## Training Data
 - **Total Training Steps**: 150,000
 - **Total Training Time**: 66 hours 39 minutes
 - **Train/Validation Split**: 95%/5%
+- **Effective Training Data**: 21.12GB (with quality-based repetition)