Migration from Hierarchical BERT to DeBERTa-base
Summary
Successfully migrated the codebase from using BERT-base-uncased to DeBERTa-base (microsoft/deberta-base).
Changes Made
1. Configuration (config.py)
- Changed model name:
bert_model_namefrom"bert-base-uncased"to"microsoft/deberta-base" - Updated documentation: References to "Legal-BERT" updated to "Legal-DeBERTa"
2. Model Architecture (model.py)
- Updated imports and docstrings: Changed references from BERT to DeBERTa
- Modified forward pass: DeBERTa doesn't have
pooler_outputlike BERT. Changed to uselast_hidden_state[:, 0, :](CLS token) instead - Updated both model classes:
FullyLearningBasedLegalBERT: Now uses DeBERTaHierarchicalLegalBERT: Now uses DeBERTa hierarchically
- Fixed tokenizer: Default model changed to
"microsoft/deberta-base" - Dynamic hidden size: Model now gets hidden size from config (still 768 for DeBERTa-base)
3. Training Scripts (train.py, trainer.py)
- Updated documentation and print statements to reference DeBERTa instead of BERT
Key Technical Differences
BERT vs DeBERTa
| Feature | BERT | DeBERTa |
|---|---|---|
| Model | bert-base-uncased |
microsoft/deberta-base |
| Hidden Size | 768 | 768 |
| Pooler Output | β Available | β Not available |
| CLS Token | outputs.pooler_output |
outputs.last_hidden_state[:, 0, :] |
| Attention | Standard | Disentangled attention |
Why DeBERTa?
- Improved Performance: DeBERTa uses disentangled attention mechanism
- Better Context Understanding: Position-aware attention
- State-of-the-art: Generally outperforms BERT on many benchmarks
No Breaking Changes
- β Model architecture remains the same (hierarchical structure intact)
- β Training pipeline unchanged
- β All multi-task heads (classification, severity, importance) work as before
- β Loss functions and optimization unchanged
- β Data loading and preprocessing unchanged
Next Steps
Before Training
Ensure transformers library is up to date:
pip install --upgrade transformersThe first training run will download DeBERTa-base model (~360MB)
Training
Simply run your existing training command:
python train.py --epochs 20 --batch-size 16
The model will automatically:
- Download DeBERTa-base from Hugging Face
- Use the hierarchical architecture with DeBERTa as encoder
- Save checkpoints with DeBERTa weights
Model Compatibility
- Old BERT checkpoints will NOT be compatible with new DeBERTa model
- You'll need to retrain from scratch
- This is expected and necessary when changing the base encoder
Files Modified
- β
config.py- Model name and documentation - β
model.py- Model architecture and forward pass - β
train.py- Training script documentation - β
trainer.py- Trainer documentation
Files NOT Modified (still work as-is)
data_loader.py- No changes neededevaluate.py- Works with new modelinference.py- Works with new modelrisk_discovery.py- Independent of encoder choice- All other utility files
Performance Expectations
DeBERTa should provide:
- Similar or better accuracy on risk classification
- Better handling of legal text nuances
- Potentially faster convergence during training