gem-convbert / README.md

Update README.md

1f5fd5a verified 2 months ago

7.92 kB

	---
	license: apache-2.0
	language:
	- el
	pipeline_tag: fill-mask
	library_name: transformers
	tags:
	- convbert
	- fill-mask
	- greek
	- legal
	- masked-lm
	- data-repetition
	- convolution
	base_model:
	- convbert-base
	---

	# GEM-ConvBERT HQ Legal: A Greek Legal Language Model with Quality-Based Data Repetition

	## Model Description

	GEM-ConvBERT HQ Legal is a ConvBERT-base model pre-trained from scratch on a strategically curated 21GB corpus of Greek legal, parliamentary, and governmental text. This model employs an innovative quality-based data repetition strategy, where higher-quality legal sources are repeated multiple times during training to enhance the model's understanding of premium legal terminology and concepts.

	ConvBERT combines the strengths of BERT with span-based dynamic convolution, replacing some self-attention heads with more efficient convolutional layers. This hybrid architecture provides better efficiency and performance, particularly suitable for understanding local patterns in legal text while maintaining global context awareness.

	This model was trained as part of a research project and has been optimized for downstream tasks such as Named Entity Recognition (NER), Text Classification, and Question Answering within the legal field.

	## How to Get Started

	You can use this model directly with the `fill-mask` pipeline:

	```python
	from transformers import pipeline

	# Load the model
	fill_mask = pipeline(
	"fill-mask",
	model="novelcore/gem-convbert-hq-legal",
	tokenizer="novelcore/gem-convbert-hq-legal"
	)

	# Example from a legal context
	text = "Ο κ. Μητσοτάκης <mask> ότι η κυβέρνηση σέβεται πλήρως τις αποφάσεις του Συμβουλίου της Επικρατείας."

	# Get predictions
	predictions = fill_mask(text)
	print(predictions)
	```

	For downstream tasks:

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	# For legal document classification
	tokenizer = AutoTokenizer.from_pretrained("novelcore/gem-convbert-hq-legal")
	model = AutoModelForSequenceClassification.from_pretrained("novelcore/gem-convbert-hq-legal")
	```

	## Training Data

	The model was pre-trained on a comprehensive corpus of Greek text compiled from various legal and governmental sources, with a quality-based data repetition strategy that increases exposure to higher-quality legal content. The original 16.75GB corpus was expanded to 21.12GB through strategic repetition.

	### Quality-Based Data Repetition Strategy

	\| Dataset \| Original Size (GB) \| Quality Level \| Repetition Factor \| Effective Size (GB) \|
	\| :--- \| :--- \| :--- \| :--- \| :--- \|
	\| Raptarchis Legal Dictionary \| 0.35 \| Best \| 4x \| 1.40 \|
	\| Political Reports of the Supreme Court \| 1.20 \| Medium-Best \| 3x \| 3.60 \|
	\| Eur-Lex (Greek Content) \| 0.92 \| Medium \| 2x \| 1.84 \|
	\| FEK - Greek Government Gazette \| 11.00 \| Low \| 1x \| 11.00 \|
	\| Greek Parliament Proceedings \| 2.90 \| Low-Medium \| 1x \| 2.90 \|
	\| Europarl (Greek Content) \| 0.38 \| Low \| 1x \| 0.38 \|
	\| TOTAL \| 16.75 GB \| - \| - \| 21.12 GB \|

	### Rationale for Data Repetition

	The quality-based repetition strategy enhances the model's exposure to:
	- Premium legal terminology from the Raptarchis Legal Dictionary (4x repetition)
	- High-quality judicial reasoning from Supreme Court reports (3x repetition)
	- EU legal concepts from Eur-Lex content (2x repetition)

	This approach ensures the model develops a stronger understanding of sophisticated legal language while maintaining exposure to the broader legal corpus.

	## Training Procedure

	### Model Architecture

	The model uses the ConvBERT-base architecture with the following configuration:

	- Hidden Size: 768
	- Attention Heads: 12
	- Hidden Layers: 12
	- Intermediate Size: 3072
	- Conv Kernel Size: 9
	- Num Conv Groups: 1
	- Parameters: ~106M
	- Vocabulary Size: 50,264

	### ConvBERT Architecture Advantages

	ConvBERT's hybrid architecture provides several benefits for legal text processing:

	- Efficient Local Pattern Recognition: Convolutional layers excel at capturing local linguistic patterns common in legal terminology
	- Global Context Awareness: Self-attention mechanisms maintain understanding of document-wide context
	- Computational Efficiency: Mixed attention-convolution approach reduces computational complexity
	- Better Span Representation: Dynamic convolution enhances understanding of legal entity spans and clause structures

	### Preprocessing

	The text was tokenized using a custom `WordPiece` tokenizer trained from scratch on the Greek legal corpus. The tokenizer uses a vocabulary of 50,264 tokens optimized for Greek legal terminology.

	The data was processed into fixed-size chunks of 512 tokens, respecting document boundaries to ensure contextual coherence. Higher-quality sources were strategically repeated during the data preparation phase.

	### Pre-training

	The model was pre-trained from scratch for 200,000 steps on 8x NVIDIA A100 40GB GPUs, using BFloat16 (`bf16`) mixed-precision for stability and speed. The training took approximately 45 hours and 32 minutes to complete.

	The key hyperparameters used were:

	- Learning Rate: 2e-4 (0.0002) with a linear warmup of 12,000 steps
	- Batch Size: Effective batch size of 768 (`per_device_train_batch_size: 32`, `gradient_accumulation_steps: 3`)
	- Optimizer: AdamW with standard parameters
	- Weight Decay: 0.01
	- Max Sequence Length: 512
	- Max Steps: 200,000
	- Warmup Steps: 12,000
	- MLM Probability: 0.15
	- Max Gradient Norm: 1.0

	### Training Results

	The model achieved the following performance metrics:

	- Final Training Loss: 0.6413
	- Final Evaluation Loss: 0.604455
	- Training Infrastructure: 8x NVIDIA A100 40GB GPUs
	- Total Training Steps: 200,000
	- Total Training Time: 45 hours 32 minutes
	- Train/Validation Split: 90%/10%
	- Effective Training Data: 21.12GB (with quality-based repetition)

	### Training Infrastructure

	The model was trained using distributed training with the following optimizations:

	- Backend: NCCL for efficient multi-GPU communication
	- Mixed Precision: BFloat16 for improved training stability
	- Evaluation Frequency: Every 4,000 steps
	- Checkpointing: Every 4,000 steps
	- Logging: Every 200 steps

	## Key Innovations

	### Quality-Based Data Repetition

	This model introduces a novel quality-based data repetition strategy where:

	1. Highest quality sources (legal dictionaries) are repeated 4x for maximum terminology exposure
	2. Medium-high quality sources (court reports) are repeated 3x for judicial reasoning patterns
	3. Medium quality sources (EU legal texts) are repeated 2x for regulatory language
	4. Lower quality sources are used once to maintain diversity

	This approach resulted in 25% more effective training data (21.12GB vs 16.75GB) while maintaining computational efficiency.

	### ConvBERT Architecture for Legal Text

	The ConvBERT architecture is particularly well-suited for legal text processing:

	- Local Legal Pattern Recognition: Convolutional layers efficiently capture recurring legal phrases and terminology patterns
	- Clause Structure Understanding: Dynamic convolution helps model understand legal document structures and clause relationships
	- Computational Efficiency: Hybrid attention-convolution approach provides faster training and inference compared to pure attention models
	- Enhanced Entity Recognition: Better span-based representations improve legal entity and concept identification

	### Training Efficiency

	The model achieved exceptional training efficiency, completing training in only 45 hours 32 minutes - significantly faster than comparable architectures while processing the expanded 21.12GB dataset.