alexaapo commited on
Commit
1f5fd5a
·
verified ·
1 Parent(s): 76e3f60

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -82
README.md CHANGED
@@ -16,11 +16,11 @@ base_model:
16
  - convbert-base
17
  ---
18
 
19
- # Themida-ConvBERT Legal 21G: A Greek Legal Language Model with Quality-Based Data Repetition
20
 
21
  ## Model Description
22
 
23
- **Themida-ConvBERT Legal 21G** is a ConvBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts.
24
 
25
  ConvBERT combines the strengths of BERT with span-based dynamic convolution, replacing some self-attention heads with more efficient convolutional layers. This hybrid architecture provides better efficiency and performance, particularly suitable for understanding local patterns in legal text while maintaining global context awareness.
26
 
@@ -36,8 +36,8 @@ from transformers import pipeline
36
  # Load the model
37
  fill_mask = pipeline(
38
  "fill-mask",
39
- model="novelcore/themida-convbert-legal-21G-8-gpu",
40
- tokenizer="novelcore/themida-convbert-legal-21G-8-gpu"
41
  )
42
 
43
  # Example from a legal context
@@ -54,8 +54,8 @@ For downstream tasks:
54
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
55
 
56
  # For legal document classification
57
- tokenizer = AutoTokenizer.from_pretrained("novelcore/themida-convbert-legal-21G-8-gpu")
58
- model = AutoModelForSequenceClassification.from_pretrained("novelcore/themida-convbert-legal-21G-8-gpu")
59
  ```
60
 
61
  ## Training Data
@@ -175,79 +175,4 @@ The ConvBERT architecture is particularly well-suited for legal text processing:
175
 
176
  ### Training Efficiency
177
 
178
- The model achieved exceptional training efficiency, completing training in only **45 hours 32 minutes** - significantly faster than comparable architectures while processing the expanded 21.12GB dataset.
179
-
180
- ## Evaluation Results
181
-
182
- The model shows stable convergence with the quality-based repetition strategy and ConvBERT architecture:
183
-
184
- | Model | Architecture | Training Loss | Evaluation Loss | Training Time |
185
- | :--- | :--- | :--- | :--- | :--- |
186
- | `Themida-ConvBERT Legal 21G` (this model) | ConvBERT-base | 0.6413 | 0.604455 | 45h 32m |
187
-
188
- *Performance on downstream tasks will be updated as evaluation results become available.*
189
-
190
- ## Intended Uses
191
-
192
- ### Primary Use Cases
193
- - Legal document analysis and classification
194
- - Named entity recognition in Greek legal texts
195
- - Legal question answering systems
196
- - Compliance monitoring and regulatory analysis
197
- - Legal text similarity and retrieval
198
- - Legal terminology extraction and understanding
199
- - Legal clause and entity span detection
200
-
201
- ### Secondary Use Cases
202
- - General Greek text understanding (with potential performance degradation)
203
- - Contract analysis and review
204
- - Legislative text analysis
205
- - Regulatory compliance checking
206
-
207
- ### Advantages of ConvBERT + Quality-Based Training
208
- - **Enhanced legal vocabulary**: Better understanding of sophisticated legal terminology
209
- - **Improved pattern recognition**: ConvBERT's convolutions excel at legal phrase patterns
210
- - **Efficient processing**: Faster training and inference than pure attention models
211
- - **Better span understanding**: Superior performance on legal entity and clause detection
212
- - **EU legal compliance**: Better handling of European regulatory language
213
-
214
- ## Limitations and Bias
215
-
216
- - The model may reflect biases present in Greek legal and governmental texts
217
- - Quality-based repetition may amplify biases present in higher-quality sources
218
- - Performance may degrade on informal or colloquial Greek text
219
- - Limited knowledge of legal concepts post-training data cutoff
220
- - Optimized specifically for Greek legal domain; may not generalize well to other domains
221
- - ConvBERT architecture may require specific fine-tuning approaches different from BERT
222
-
223
- ## Technical Specifications
224
-
225
- - **Model Size**: ~106M parameters
226
- - **Architecture**: ConvBERT-base (12 layers, 12 attention heads, conv kernel size 9)
227
- - **Training Time**: 45 hours 32 minutes on 8x A100 GPUs
228
- - **Effective Dataset Size**: 21.12GB (with quality-based repetition)
229
- - **Memory Requirements**: Efficient memory usage due to hybrid architecture
230
- - **Inference Speed**: Faster than pure attention models due to convolutional components
231
-
232
- ## Model Card Authors
233
-
234
- [Your Name / Your Organization's Name]
235
-
236
- ## Citation
237
-
238
- If you use this model in your research, please cite it as follows:
239
-
240
- ```bibtex
241
- @misc{your_name_2025_themida_convbert_21g,
242
- author = {[Your Name/Organization]},
243
- title = {Themida-ConvBERT Legal 21G: A Greek Legal Language Model with Quality-Based Data Repetition},
244
- year = {2025},
245
- publisher = {Hugging Face},
246
- journal = {Hugging Face Hub},
247
- howpublished = {\url{https://huggingface.co/novelcore/themida-convbert-legal-21G-8-gpu}},
248
- }
249
- ```
250
-
251
- ## Acknowledgments
252
-
253
- We thank the Greek government institutions for making their legal texts publicly available, enabling the creation of this specialized language model for the Greek legal domain. Special recognition for the innovative combination of ConvBERT architecture with quality-based data repetition strategy, resulting in exceptional training efficiency and enhanced legal text understanding capabilities.
 
16
  - convbert-base
17
  ---
18
 
19
+ # GEM-ConvBERT HQ Legal: A Greek Legal Language Model with Quality-Based Data Repetition
20
 
21
  ## Model Description
22
 
23
+ **GEM-ConvBERT HQ Legal** is a ConvBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative **quality-based data repetition strategy**, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts.
24
 
25
  ConvBERT combines the strengths of BERT with span-based dynamic convolution, replacing some self-attention heads with more efficient convolutional layers. This hybrid architecture provides better efficiency and performance, particularly suitable for understanding local patterns in legal text while maintaining global context awareness.
26
 
 
36
  # Load the model
37
  fill_mask = pipeline(
38
  "fill-mask",
39
+ model="novelcore/gem-convbert-hq-legal",
40
+ tokenizer="novelcore/gem-convbert-hq-legal"
41
  )
42
 
43
  # Example from a legal context
 
54
  from transformers import AutoTokenizer, AutoModelForSequenceClassification
55
 
56
  # For legal document classification
57
+ tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-convbert-hq-legal")
58
+ model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-convbert-hq-legal")
59
  ```
60
 
61
  ## Training Data
 
175
 
176
  ### Training Efficiency
177
 
178
+ The model achieved exceptional training efficiency, completing training in only **45 hours 32 minutes** - significantly faster than comparable architectures while processing the expanded 21.12GB dataset.