code2-repo / doc /QUICK_START.md

Deepu1965

Upload folder using huggingface_hub

9b1c753 verified about 2 months ago

preview code

raw

history blame contribute delete

6.49 kB

🚀 QUICK START GUIDE - Legal-BERT

Prerequisites

# 1. Install dependencies
pip install -r requirements.txt

# 2. Download CUAD dataset
# Place at: dataset/CUAD_v1/CUAD_v1.json

Verify Setup

python test_setup.py

Expected output: ``` 🧪 LEGAL-BERT PROJECT - QUICK TEST

🔍 Testing imports... ✅ PyTorch ✅ Transformers ✅ scikit-learn ✅ Pandas ✅ NumPy ... ✅ ALL TESTS PASSED! 🚀 Ready to train! Run: python train.py


## Training

```bash
python train.py

What it does:

Loads CUAD dataset (19,598 clauses)
Discovers 7 risk patterns automatically
Trains Legal-BERT for 5 epochs (~2-4 hours on GPU)
Saves checkpoints every epoch
Generates training history plot

Output:

checkpoints/
  ├── legal_bert_epoch_1.pt
  ├── legal_bert_epoch_2.pt
  ├── ...
  ├── training_history.png
  └── training_summary.json
models/legal_bert/
  └── final_model.pt

Expected Results:

Train Accuracy: >60%
Val Accuracy: >55%

Evaluation

python evaluate.py

What it does:

Loads trained model
Evaluates on test set
Calculates comprehensive metrics
Generates visualizations
Saves detailed report

Output:

checkpoints/
  ├── evaluation_results.json
  ├── confusion_matrix.png
  └── risk_distribution.png
evaluation_report.txt

Expected Results:

Accuracy: >70%
F1-Score: >0.65
Precision: >0.60
Recall: >0.60

Calibration

python calibrate.py

What it does:

Loads trained model
Applies temperature scaling
Calculates ECE/MCE
Saves calibrated model
Exports results

Output:

checkpoints/
  └── calibration_results.json
models/legal_bert/
  └── calibrated_model.pt

Expected Results:

ECE: 0.15 → <0.08
MCE: 0.20 → <0.12

Complete Pipeline

# Run everything in sequence
python train.py && python evaluate.py && python calibrate.py

Configuration

Edit config.py to customize:

# Model settings
bert_model_name = "bert-base-uncased"
num_risk_categories = 7
max_sequence_length = 512

# Training settings
batch_size = 16          # Reduce if GPU OOM
num_epochs = 5           # Increase for better results
learning_rate = 2e-5     # Adjust for convergence

# Paths
data_path = "dataset/CUAD_v1/CUAD_v1.json"
checkpoint_dir = "checkpoints"

Troubleshooting

GPU Out of Memory

# In config.py, reduce:
batch_size = 8  # or even 4

Missing Dataset

# Error: Dataset not found
# Solution: Download CUAD and place at:
dataset/CUAD_v1/CUAD_v1.json

Import Errors

# Reinstall dependencies
pip install -r requirements.txt --upgrade

Visualization Errors

# If matplotlib errors occur
pip install matplotlib seaborn
# Or plots will be skipped (functionality still works)

Performance Tips

Speed Up Training

Use GPU (CUDA): Automatic if available
Increase batch size: batch_size = 32
Use fewer epochs: num_epochs = 3

Improve Accuracy

Train longer: num_epochs = 10
Adjust learning rate: learning_rate = 3e-5
Use larger BERT: bert_model_name = "bert-large-uncased"

Better Calibration

More validation data: Adjust splits in data_loader.py
More iterations: In calibrate.py increase max_iter

File Structure

code2/
├── train.py           ← Run this first
├── evaluate.py        ← Then this
├── calibrate.py       ← Finally this
├── test_setup.py      ← Verify before training
│
├── config.py          ← Edit settings here
├── data_loader.py     ← Loads CUAD dataset
├── risk_discovery.py  ← Discovers patterns
├── model.py           ← Legal-BERT architecture
├── trainer.py         ← Training logic
├── evaluator.py       ← Evaluation logic
├── utils.py           ← Helper functions
│
├── README.md                 ← Full documentation
├── IMPLEMENTATION.md         ← Implementation details
├── COMPLETION_SUMMARY.md     ← What was done
└── QUICK_START.md            ← This file

Common Commands

# Check setup
python test_setup.py

# Train model
python train.py

# Evaluate model  
python evaluate.py

# Calibrate model
python calibrate.py

# Run all
python train.py && python evaluate.py && python calibrate.py

# Python interactive (after training)
python
>>> from evaluator import LegalBertEvaluator
>>> # Load and analyze results

Expected Timeline

Task	Time (GPU)	Time (CPU)
Setup verification	30 seconds	30 seconds
Training (5 epochs)	2-4 hours	8-12 hours
Evaluation	10 minutes	20 minutes
Calibration	5 minutes	10 minutes
Total	~3 hours	~10 hours

Success Indicators

After Training

✅ Checkpoints saved in checkpoints/
✅ Training loss decreasing
✅ Validation accuracy >55%
✅ No CUDA errors

After Evaluation

✅ Accuracy >70%
✅ F1-Score >0.65
✅ Confusion matrix generated
✅ Report saved

After Calibration

✅ ECE <0.10
✅ Temperature ~1.5-2.5
✅ Calibrated model saved

Getting Help

Check README.md for detailed documentation
Check IMPLEMENTATION.md for technical details
Check COMPLETION_SUMMARY.md for what was implemented
Review error messages carefully
Verify setup with python test_setup.py

Next Steps

After completing training, evaluation, and calibration:

Analyze Results: Check evaluation report
Tune Parameters: Adjust config.py if needed
Retrain: Run train.py again with new settings
Deploy (optional): Create API or web interface

Key Metrics to Track

Training

Train Loss (should decrease)
Val Loss (should decrease)
Train Accuracy (should increase)
Val Accuracy (should increase)

Evaluation

Overall Accuracy (>70%)
F1-Score (>0.65)
Per-pattern F1 (check which patterns need work)
Regression R² (>0.60 for severity/importance)

Calibration

ECE (target: <0.08)
MCE (target: <0.12)
Temperature (typically 1.5-2.5)

Ready? Start with: python test_setup.py

Questions? Check README.md for comprehensive documentation.

🎉 Good luck with your Legal-BERT training! 🎉