π Legal-BERT Training Results & Improvements Summary
Executive Summary
Multi-task Legal-BERT model for contract clause analysis with dramatic improvements achieved through loss rebalancing and training optimization. Model performs risk pattern classification, severity scoring, and importance scoring simultaneously.
π― Training Configuration
Dataset
- Source: CUAD v1 (Contract Understanding Atticus Dataset)
- Total Clauses: ~19,598 from 510 commercial contracts
- Training Split: 70% train / 10% validation / 20% test
- Discovered Risk Patterns: 7 clusters via unsupervised TF-IDF + K-Means
Model Architecture
- Base Model: BERT (bert-base-uncased)
- Task Heads:
- Risk Classification (7 classes)
- Severity Regression (0-10 scale)
- Importance Regression (0-10 scale)
Training Parameters
Batch Size: 16
Learning Rate: 1e-5
Optimizer: AdamW
Device: CUDA
π Results Progression
Initial Results (FAILED)
Configuration: Loss weights 10:1:1, 1 epochs
| Metric | Value | Status |
|---|---|---|
| Classification Accuracy | 21.5% | β Failed |
| Precision | 4.7% | β Critical |
| Recall | 21.5% | β Poor |
| F1-Score | 7.8% | β Broken |
| Severity RΒ² | 0.747 | β Good |
| Importance RΒ² | 0.970 | β Excellent |
Problem Identified:
- Model collapsed into predicting almost exclusively Class 1 (98.8% of predictions)
- Classes 0, 2, 3, 5, 6 had 0% recall (never predicted)
- Regression tasks dominated gradient flow, sacrificing classification
Current Results (IMPROVED)
Configuration: Loss weights 10:1:1, 10 epochs (with class balancing)
| Metric | Value | Change | Status |
|---|---|---|---|
| Classification Accuracy | 38.9% | +81% β | β οΈ Improving |
| Precision | 31.6% | +567% β | β οΈ Better |
| Recall | 38.9% | +81% β | β οΈ Better |
| F1-Score | 34.2% | +340% β | β οΈ Better |
| Severity RΒ² | 0.929 | +24% β | β Excellent |
| Importance RΒ² | 0.994 | +2% β | β Near Perfect |
| Avg Confidence | 33.8% | +43% β | β οΈ Low |
Improvements Achieved:
- β Model now predicts 5 out of 7 classes (was 3)
- β No more extreme class collapse
- β Regression performance improved further
- β οΈ Classes 0 and 5 still have 0% recall
π Per-Class Performance Analysis
Current Performance by Risk Pattern
| Class | Pattern Name | Support | Precision | Recall | F1-Score | Status |
|---|---|---|---|---|---|---|
| 0 | LIABILITY (Insurance) | 444 | 0.0% | 0.0% | 0.00 | β FAILING |
| 1 | COMPLIANCE | 310 | 23.8% | 44.2% | 0.31 | β οΈ Poor |
| 2 | TERMINATION | 395 | 45.9% | 63.3% | 0.53 | β Best |
| 3 | AGREEMENT_PARTY | 634 | 56.2% | 59.9% | 0.58 | β Best |
| 4 | PAYMENT | 528 | 28.3% | 45.3% | 0.35 | β οΈ Poor |
| 5 | INTELLECTUAL_PROPERTY | 249 | 0.0% | 0.0% | 0.00 | β FAILING |
| 6 | LIABILITY (Breach) | 248 | 51.2% | 34.7% | 0.41 | β οΈ Moderate |
Key Observations
Strong Performance (F1 > 0.50):
- Class 2 (TERMINATION): Clear termination language patterns learned well
- Class 3 (AGREEMENT_PARTY): Largest cluster, consistent patterns
Moderate Performance (F1 = 0.30-0.50):
- Class 1 (COMPLIANCE): Overlaps with other regulatory language
- Class 4 (PAYMENT): Confused with general contractual obligations
- Class 6 (LIABILITY - Breach): Mixed with Class 0
Critical Failures (F1 = 0.00):
- Class 0 (LIABILITY - Insurance): Misclassified as Class 4 (56%)
- Class 5 (INTELLECTUAL_PROPERTY): Smallest cluster (8.6%), absorbed into Class 1
π Root Cause Analysis
Why Classes 0 and 5 Are Failing
1. Duplicate Topic Names
- Classes 0 and 6 both labeled "Topic_LIABILITY"
- Model cannot distinguish between:
- Class 0: Insurance, coverage, franchisee maintenance
- Class 6: Damages, breach, consequential loss
- Solution: Merge or rename to "LIABILITY_INSURANCE" vs "LIABILITY_BREACH"
2. Class Imbalance
Largest: Class 3 (634 samples, 22.6%)
Smallest: Class 5 (249 samples, 8.6%)
Ratio: 2.5:1
- Class 5 is 2.5x smaller than largest class
- Insufficient training examples for distinctive features
- Solution: Boost class weights by 1.8x for minority classes
3. Semantic Overlap
- IP clauses (Class 5) share keywords with licensing (Class 3):
- Both: "rights", "property", "agreement", "party"
- Payment clauses (Class 4) overlap with compliance (Class 1):
- Both: "shall", "products", "period", "audit"
- Solution: Use Focal Loss to focus on hard-to-classify examples
4. Gradient Dominance
- Regression RΒ² = 0.994 (nearly perfect)
- Classification Acc = 38.9% (still poor)
- Model optimizing for easy regression task
- Solution: Increase classification loss weight to 20-25x
π Recommended Improvements
Phase 1: Immediate Fixes (Expected: 48-52% Accuracy)
1.1 Aggressive Loss Reweighting
# Current: 10:1:1
# Recommended: 20:0.5:0.5
total_loss = (
20.0 * classification_loss + # Focus on classification
0.5 * severity_loss + # Reduce regression emphasis
0.5 * importance_loss
)
1.2 Implement Focal Loss
# Focus on hard-to-classify examples (Classes 0, 5)
criterion = FocalLoss(
alpha=class_weights, # Balanced class weights
gamma=2.5 # High focus on hard examples
)
1.3 Boost Minority Class Weights
class_weights = compute_class_weight('balanced', ...)
class_weights[0] *= 1.8 # Boost Class 0 by 80%
class_weights[5] *= 1.8 # Boost Class 5 by 80%
1.4 Extended Training
Current: 10 epochs (val_loss=1.80 still decreasing)
Recommended: 20 epochs with early stopping
Expected Results:
- Accuracy: 38.9% β 48-52%
- F1-Score: 0.34 β 0.42-0.46
- Class 0/5 Recall: 0% β 15-25%
Phase 2: Structural Fixes (Expected: 55-60% Accuracy)
2.1 Merge Duplicate LIABILITY Classes
# Consolidate Classes 0 and 6 into single LIABILITY class
# Reduces from 7 to 6 distinct patterns
# Combines insurance + breach liability concepts
2.2 Re-run Clustering with Validation
# Current: Fixed k=7
# Recommended: Optimize k using silhouette score
# Ensure minimum cluster size β₯ 200 samples
# Merge or remove clusters < 150 samples
2.3 Address Class 5 (Two Options)
Option A: Merge with Class 3 (AGREEMENT_PARTY)
- IP clauses often appear in licensing agreements
- Semantic overlap justifies consolidation
Option B: Keep but boost significantly
- Increase weight to 2.0x (100% boost)
- Add data augmentation for IP clauses
Expected Results:
- Accuracy: 52% β 55-60%
- F1-Score: 0.46 β 0.50-0.55
- All classes: >25% recall
Phase 3: Advanced Optimizations (Expected: 60-65% Accuracy)
3.1 Learning Rate Scheduling
# OneCycleLR for better convergence
scheduler = OneCycleLR(
optimizer,
max_lr=2e-5,
total_steps=num_epochs * len(train_loader),
pct_start=0.1 # 10% warmup
)
3.2 Differential Learning Rates
# Lower LR for BERT backbone (fine-tune carefully)
# Higher LR for task heads (learn faster)
{
'bert_params': lr=2e-5,
'task_heads': lr=1e-4 # 5x higher
}
3.3 Gradient Clipping
# Prevent gradient explosion with high classification weight
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
3.4 Better Feature Engineering
# Add domain-specific features to score calculation:
# - Contract type indicators
# - Clause position in document
# - Presence of monetary amounts ($)
# - Time-sensitive language density
Expected Results:
- Accuracy: 60% β 63-68%
- F1-Score: 0.55 β 0.58-0.62
- Balanced performance across all classes
π Calibration Analysis
Current Calibration Metrics
| Metric | Pre-Calibration | Post-Calibration | Status |
|---|---|---|---|
| ECE | 15.2% | 16.5% | β Worse |
| MCE | 41.7% | 46.8% | β Worse |
| Optimal Temp | 1.43 | - | β οΈ Suboptimal |
Problem Identified
- Calibration degraded confidence estimates (ECE increased by 1.3%)
- Temperature scaling insufficient for multi-task model
- Low confidence (33.8%) indicates model uncertainty
Recommended Calibration Improvements
# 1. Calibrate only after classification improves to >50%
# Current 38.9% accuracy makes calibration premature
# 2. Use separate temperature per task
temp_classification = 1.5
temp_severity = 1.0 # Don't scale regression
temp_importance = 1.0
# 3. Consider Platt Scaling instead of temperature scaling
from sklearn.calibration import CalibratedClassifierCV
π― Performance Targets
Short-term Goals (1-2 training runs)
- Fix class collapse (Classes 0-6 predicted)
- Achieve >45% classification accuracy
- All classes >10% recall
- Maintain regression RΒ² >0.92
Medium-term Goals (3-5 iterations)
- Achieve >55% classification accuracy
- F1-Score >0.50
- All classes >25% recall
- Balanced per-class F1 (std <0.15)
Long-term Goals (Production-ready)
- Achieve >65% classification accuracy
- F1-Score >0.60
- All classes >40% recall
- ECE <5% (well-calibrated)
- Inference latency <100ms per clause
π§ Implementation Checklist
Quick Wins (This Week)
- Change loss weights to 20:0.5:0.5
- Add class weight balancing with 1.8x boost for minorities
- Increase epochs to 20 with early stopping
- Add gradient clipping (max_norm=1.0)
- Implement Focal Loss (gamma=2.5)
Structural Changes (Next Sprint)
- Merge duplicate LIABILITY classes (0β6)
- Re-run clustering with optimal k selection
- Address Class 5 (merge or boost)
- Add learning rate scheduling
- Implement differential learning rates
Advanced Optimizations (Future)
- Data augmentation for minority classes
- Ensemble modeling (multiple seeds)
- Domain-specific feature engineering
- Better calibration methods
- Hyperparameter tuning (batch size, LR)
π Confusion Matrix Analysis
Class 0 Misclassifications (444 samples)
Predicted as Class 4 (PAYMENT): 251 samples (56.5%)
Predicted as Class 1 (COMPLIANCE): 94 samples (21.2%)
Predicted as Class 3 (PARTY): 49 samples (11.0%)
Correctly predicted: 0 samples (0.0%)
Why: Insurance liability shares "shall maintain", "period", "company" with payment obligations
Class 5 Misclassifications (249 samples)
Predicted as Class 1 (COMPLIANCE): ~100 samples (40%)
Predicted as Class 4 (PAYMENT): ~80 samples (32%)
Correctly predicted: 0 samples (0.0%)
Why: IP clauses in contracts overlap with general licensing and service terms
π‘ Key Insights
What's Working
- β Multi-task learning is viable: Regression tasks achieved near-perfect RΒ²
- β BERT fine-tuning effective: Model learns legal language patterns
- β Feature-based scoring works: Real features produce meaningful scores
- β No data leakage: Contract-level splitting properly implemented
- β Pipeline is sound: All 9 stages connected with real data flow
What's Not Working
- β Task imbalance: Regression dominates, classification suffers
- β Clustering quality: Duplicate topics and semantic overlap
- β Class imbalance: Smallest class 2.5x smaller than largest
- β Training duration: 10 epochs insufficient (val loss still decreasing)
- β Calibration: Premature given low classification accuracy
Critical Success Factors
- Loss weighting is paramount: 20:0.5:0.5 ratio needed
- Hard example mining: Focal Loss for Classes 0 and 5
- Longer training: 20 epochs minimum with early stopping
- Better clustering: Validate and merge duplicate/small clusters
- Monitor per-class metrics: Overall accuracy misleading with imbalance
π Discovered Risk Patterns
Pattern Descriptions
| ID | Name | Key Terms | Count | % | Quality |
|---|---|---|---|---|---|
| 0 | LIABILITY (Insurance) | insurance, franchisee, coverage, maintain | 1,306 | 13.3% | β οΈ Duplicate |
| 1 | COMPLIANCE | shall, laws, audit, state, governed | 1,678 | 17.0% | β Good |
| 2 | TERMINATION | term, termination, notice, expiration | 1,419 | 14.4% | β Strong |
| 3 | AGREEMENT_PARTY | agreement, party, license, rights, consent | 1,786 | 18.1% | β Strong |
| 4 | PAYMENT | shall, company, period, royalty, pay | 1,744 | 17.7% | β Good |
| 5 | INTELLECTUAL_PROPERTY | property, intellectual, software, consultant | 849 | 8.6% | β οΈ Too Small |
| 6 | LIABILITY (Breach) | damages, breach, liable, consequential | 1,072 | 10.9% | β οΈ Duplicate |
π Lessons Learned
Technical Lessons
- Multi-task loss balancing is critical - Easy tasks dominate if not weighted properly
- Unsupervised clustering needs validation - Manual review prevents duplicate/ambiguous categories
- Class imbalance requires multiple strategies - Weights + Focal Loss + potential merging
- Training convergence indicators matter - Don't stop when val loss still decreasing
- Calibration is premature at low accuracy - Fix classification first, calibrate later
Domain Lessons
- Legal language has semantic overlap - Liability, compliance, payment clauses share vocabulary
- Contract structure matters - Clause position and context affect classification
- Topic modeling benefits from constraints - Minimum cluster size prevents noise
- Feature-based scores are interpretable - Regression targets based on real features work well
- 7 categories may be too granular - Consider 5-6 well-separated patterns instead
π Next Steps Priority
Priority 1: Critical (Do Now)
- Update loss weights to 20:0.5:0.5
- Add Focal Loss with class weight boosting
- Train for 20 epochs with early stopping
- Monitor per-class recall each epoch
Priority 2: Important (This Week)
- Merge Classes 0 and 6 (LIABILITY)
- Decide on Class 5 (merge vs boost)
- Add gradient clipping
- Implement learning rate scheduling
Priority 3: Enhancement (Next Sprint)
- Re-run clustering with validation
- Add data augmentation
- Tune hyperparameters systematically
- Implement better calibration
π Conclusion
The Legal-BERT pipeline demonstrates strong technical foundation with proper data flow and no simulated data. The dramatic improvement from 21.5% to 38.9% accuracy (+81%) validates the approach.
Current bottleneck: Task imbalance causing regression to dominate classification learning.
Path forward: Aggressive classification loss weighting (20x), Focal Loss for hard examples, extended training (20 epochs), and clustering refinement will push accuracy to 55-60% range.
Timeline estimate:
- 48-52% accuracy achievable in 1 training run (with Phase 1 fixes)
- 55-60% accuracy achievable in 2-3 iterations (with Phase 2 fixes)
- 65%+ accuracy requires 5+ iterations with advanced optimizations
Model Status: β οΈ IMPROVING - On trajectory to production-ready performance with identified action plan.
Last Updated: 2025-11-05
Training Date: 2025-11-04
Model Version: v2 (38.9% accuracy baseline)