Update README.md
Browse files
README.md
CHANGED
|
@@ -16,11 +16,11 @@ base_model:
|
|
| 16 |
- convbert-base
|
| 17 |
---
|
| 18 |
|
| 19 |
-
#
|
| 20 |
|
| 21 |
## Model Description
|
| 22 |
|
| 23 |
-
**
|
| 24 |
|
| 25 |
ConvBERT combines the strengths of BERT with span-based dynamic convolution, replacing some self-attention heads with more efficient convolutional layers. This hybrid architecture provides better efficiency and performance, particularly suitable for understanding local patterns in legal text while maintaining global context awareness.
|
| 26 |
|
|
@@ -36,8 +36,8 @@ from transformers import pipeline
|
|
| 36 |
# Load the model
|
| 37 |
fill_mask = pipeline(
|
| 38 |
"fill-mask",
|
| 39 |
-
model="novelcore/
|
| 40 |
-
tokenizer="novelcore/
|
| 41 |
)
|
| 42 |
|
| 43 |
# Example from a legal context
|
|
@@ -54,8 +54,8 @@ For downstream tasks:
|
|
| 54 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 55 |
|
| 56 |
# For legal document classification
|
| 57 |
-
tokenizer = AutoTokenizer.from_pretrained("novelcore/
|
| 58 |
-
model = AutoModelForSequenceClassification.from_pretrained("novelcore/
|
| 59 |
```
|
| 60 |
|
| 61 |
## Training Data
|
|
@@ -175,79 +175,4 @@ The ConvBERT architecture is particularly well-suited for legal text processing:
|
|
| 175 |
|
| 176 |
### Training Efficiency
|
| 177 |
|
| 178 |
-
The model achieved exceptional training efficiency, completing training in only **45 hours 32 minutes** - significantly faster than comparable architectures while processing the expanded 21.12GB dataset.
|
| 179 |
-
|
| 180 |
-
## Evaluation Results
|
| 181 |
-
|
| 182 |
-
The model shows stable convergence with the quality-based repetition strategy and ConvBERT architecture:
|
| 183 |
-
|
| 184 |
-
| Model | Architecture | Training Loss | Evaluation Loss | Training Time |
|
| 185 |
-
| :--- | :--- | :--- | :--- | :--- |
|
| 186 |
-
| `Themida-ConvBERT Legal 21G` (this model) | ConvBERT-base | 0.6413 | 0.604455 | 45h 32m |
|
| 187 |
-
|
| 188 |
-
*Performance on downstream tasks will be updated as evaluation results become available.*
|
| 189 |
-
|
| 190 |
-
## Intended Uses
|
| 191 |
-
|
| 192 |
-
### Primary Use Cases
|
| 193 |
-
- Legal document analysis and classification
|
| 194 |
-
- Named entity recognition in Greek legal texts
|
| 195 |
-
- Legal question answering systems
|
| 196 |
-
- Compliance monitoring and regulatory analysis
|
| 197 |
-
- Legal text similarity and retrieval
|
| 198 |
-
- Legal terminology extraction and understanding
|
| 199 |
-
- Legal clause and entity span detection
|
| 200 |
-
|
| 201 |
-
### Secondary Use Cases
|
| 202 |
-
- General Greek text understanding (with potential performance degradation)
|
| 203 |
-
- Contract analysis and review
|
| 204 |
-
- Legislative text analysis
|
| 205 |
-
- Regulatory compliance checking
|
| 206 |
-
|
| 207 |
-
### Advantages of ConvBERT + Quality-Based Training
|
| 208 |
-
- **Enhanced legal vocabulary**: Better understanding of sophisticated legal terminology
|
| 209 |
-
- **Improved pattern recognition**: ConvBERT's convolutions excel at legal phrase patterns
|
| 210 |
-
- **Efficient processing**: Faster training and inference than pure attention models
|
| 211 |
-
- **Better span understanding**: Superior performance on legal entity and clause detection
|
| 212 |
-
- **EU legal compliance**: Better handling of European regulatory language
|
| 213 |
-
|
| 214 |
-
## Limitations and Bias
|
| 215 |
-
|
| 216 |
-
- The model may reflect biases present in Greek legal and governmental texts
|
| 217 |
-
- Quality-based repetition may amplify biases present in higher-quality sources
|
| 218 |
-
- Performance may degrade on informal or colloquial Greek text
|
| 219 |
-
- Limited knowledge of legal concepts post-training data cutoff
|
| 220 |
-
- Optimized specifically for Greek legal domain; may not generalize well to other domains
|
| 221 |
-
- ConvBERT architecture may require specific fine-tuning approaches different from BERT
|
| 222 |
-
|
| 223 |
-
## Technical Specifications
|
| 224 |
-
|
| 225 |
-
- **Model Size**: ~106M parameters
|
| 226 |
-
- **Architecture**: ConvBERT-base (12 layers, 12 attention heads, conv kernel size 9)
|
| 227 |
-
- **Training Time**: 45 hours 32 minutes on 8x A100 GPUs
|
| 228 |
-
- **Effective Dataset Size**: 21.12GB (with quality-based repetition)
|
| 229 |
-
- **Memory Requirements**: Efficient memory usage due to hybrid architecture
|
| 230 |
-
- **Inference Speed**: Faster than pure attention models due to convolutional components
|
| 231 |
-
|
| 232 |
-
## Model Card Authors
|
| 233 |
-
|
| 234 |
-
[Your Name / Your Organization's Name]
|
| 235 |
-
|
| 236 |
-
## Citation
|
| 237 |
-
|
| 238 |
-
If you use this model in your research, please cite it as follows:
|
| 239 |
-
|
| 240 |
-
```bibtex
|
| 241 |
-
@misc{your_name_2025_themida_convbert_21g,
|
| 242 |
-
author = {[Your Name/Organization]},
|
| 243 |
-
title = {Themida-ConvBERT Legal 21G: A Greek Legal Language Model with Quality-Based Data Repetition},
|
| 244 |
-
year = {2025},
|
| 245 |
-
publisher = {Hugging Face},
|
| 246 |
-
journal = {Hugging Face Hub},
|
| 247 |
-
howpublished = {\url{https://huggingface.co/novelcore/themida-convbert-legal-21G-8-gpu}},
|
| 248 |
-
}
|
| 249 |
-
```
|
| 250 |
-
|
| 251 |
-
## Acknowledgments
|
| 252 |
-
|
| 253 |
-
We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain. Special recognition for the innovative combination of ConvBERT architecture with quality-based data repetition strategy, resulting in exceptional training efficiency and enhanced legal text understanding capabilities.
|
|
|
|
| 16 |
- convbert-base
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# GEM-ConvBERT HQ Legal: A Greek Legal Language Model with Quality-Based Data Repetition
|
| 20 |
|
| 21 |
## Model Description
|
| 22 |
|
| 23 |
+
**GEM-ConvBERT HQ Legal** is a ConvBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts.
|
| 24 |
|
| 25 |
ConvBERT combines the strengths of BERT with span-based dynamic convolution, replacing some self-attention heads with more efficient convolutional layers. This hybrid architecture provides better efficiency and performance, particularly suitable for understanding local patterns in legal text while maintaining global context awareness.
|
| 26 |
|
|
|
|
| 36 |
# Load the model
|
| 37 |
fill_mask = pipeline(
|
| 38 |
"fill-mask",
|
| 39 |
+
model="novelcore/gem-convbert-hq-legal",
|
| 40 |
+
tokenizer="novelcore/gem-convbert-hq-legal"
|
| 41 |
)
|
| 42 |
|
| 43 |
# Example from a legal context
|
|
|
|
| 54 |
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 55 |
|
| 56 |
# For legal document classification
|
| 57 |
+
tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-convbert-hq-legal")
|
| 58 |
+
model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-convbert-hq-legal")
|
| 59 |
```
|
| 60 |
|
| 61 |
## Training Data
|
|
|
|
| 175 |
|
| 176 |
### Training Efficiency
|
| 177 |
|
| 178 |
+
The model achieved exceptional training efficiency, completing training in only **45 hours 32 minutes** - significantly faster than comparable architectures while processing the expanded 21.12GB dataset.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|