File size: 6,489 Bytes

9b1c753

# 🚀 QUICK START GUIDE - Legal-BERT

## Prerequisites

```bash
# 1. Install dependencies
pip install -r requirements.txt

# 2. Download CUAD dataset
# Place at: dataset/CUAD_v1/CUAD_v1.json
```

## Verify Setup

```bash
python test_setup.py
```

Expected output:
```
🧪 LEGAL-BERT PROJECT - QUICK TEST
================================================================================
🔍 Testing imports...
  ✅ PyTorch
  ✅ Transformers
  ✅ scikit-learn
  ✅ Pandas
  ✅ NumPy
...
✅ ALL TESTS PASSED!
🚀 Ready to train! Run: python train.py
```

## Training

```bash
python train.py
```

**What it does**:
1. Loads CUAD dataset (19,598 clauses)
2. Discovers 7 risk patterns automatically
3. Trains Legal-BERT for 5 epochs (~2-4 hours on GPU)
4. Saves checkpoints every epoch
5. Generates training history plot

**Output**:
```
checkpoints/
  ├── legal_bert_epoch_1.pt
  ├── legal_bert_epoch_2.pt
  ├── ...
  ├── training_history.png
  └── training_summary.json
models/legal_bert/
  └── final_model.pt
```

**Expected Results**:
- Train Accuracy: >60%
- Val Accuracy: >55%

## Evaluation

```bash
python evaluate.py
```

**What it does**:
1. Loads trained model
2. Evaluates on test set
3. Calculates comprehensive metrics
4. Generates visualizations
5. Saves detailed report

**Output**:
```
checkpoints/
  ├── evaluation_results.json
  ├── confusion_matrix.png
  └── risk_distribution.png
evaluation_report.txt
```

**Expected Results**:
- Accuracy: >70%
- F1-Score: >0.65
- Precision: >0.60
- Recall: >0.60

## Calibration

```bash
python calibrate.py
```

**What it does**:
1. Loads trained model
2. Applies temperature scaling
3. Calculates ECE/MCE
4. Saves calibrated model
5. Exports results

**Output**:
```
checkpoints/
  └── calibration_results.json
models/legal_bert/
  └── calibrated_model.pt
```

**Expected Results**:
- ECE: 0.15 → <0.08
- MCE: 0.20 → <0.12

## Complete Pipeline

```bash
# Run everything in sequence
python train.py && python evaluate.py && python calibrate.py
```

## Configuration

Edit `config.py` to customize:

```python
# Model settings
bert_model_name = "bert-base-uncased"
num_risk_categories = 7
max_sequence_length = 512

# Training settings
batch_size = 16          # Reduce if GPU OOM
num_epochs = 5           # Increase for better results
learning_rate = 2e-5     # Adjust for convergence

# Paths
data_path = "dataset/CUAD_v1/CUAD_v1.json"
checkpoint_dir = "checkpoints"
```

## Troubleshooting

### GPU Out of Memory
```python
# In config.py, reduce:
batch_size = 8  # or even 4
```

### Missing Dataset
```bash
# Error: Dataset not found
# Solution: Download CUAD and place at:
dataset/CUAD_v1/CUAD_v1.json
```

### Import Errors
```bash
# Reinstall dependencies
pip install -r requirements.txt --upgrade
```

### Visualization Errors
```bash
# If matplotlib errors occur
pip install matplotlib seaborn
# Or plots will be skipped (functionality still works)
```

## Performance Tips

### Speed Up Training
1. Use GPU (CUDA): Automatic if available
2. Increase batch size: `batch_size = 32`
3. Use fewer epochs: `num_epochs = 3`

### Improve Accuracy
1. Train longer: `num_epochs = 10`
2. Adjust learning rate: `learning_rate = 3e-5`
3. Use larger BERT: `bert_model_name = "bert-large-uncased"`

### Better Calibration
1. More validation data: Adjust splits in `data_loader.py`
2. More iterations: In `calibrate.py` increase `max_iter`

## File Structure

```
code2/
├── train.py           ← Run this first
├── evaluate.py        ← Then this
├── calibrate.py       ← Finally this
├── test_setup.py      ← Verify before training
│
├── config.py          ← Edit settings here
├── data_loader.py     ← Loads CUAD dataset
├── risk_discovery.py  ← Discovers patterns
├── model.py           ← Legal-BERT architecture
├── trainer.py         ← Training logic
├── evaluator.py       ← Evaluation logic
├── utils.py           ← Helper functions
│
├── README.md                 ← Full documentation
├── IMPLEMENTATION.md         ← Implementation details
├── COMPLETION_SUMMARY.md     ← What was done
└── QUICK_START.md            ← This file
```

## Common Commands

```bash
# Check setup
python test_setup.py

# Train model
python train.py

# Evaluate model  
python evaluate.py

# Calibrate model
python calibrate.py

# Run all
python train.py && python evaluate.py && python calibrate.py

# Python interactive (after training)
python
>>> from evaluator import LegalBertEvaluator
>>> # Load and analyze results
```

## Expected Timeline

| Task | Time (GPU) | Time (CPU) |
|------|-----------|-----------|
| Setup verification | 30 seconds | 30 seconds |
| Training (5 epochs) | 2-4 hours | 8-12 hours |
| Evaluation | 10 minutes | 20 minutes |
| Calibration | 5 minutes | 10 minutes |
| **Total** | **~3 hours** | **~10 hours** |

## Success Indicators

### After Training
✅ Checkpoints saved in `checkpoints/`  
✅ Training loss decreasing  
✅ Validation accuracy >55%  
✅ No CUDA errors  

### After Evaluation
✅ Accuracy >70%  
✅ F1-Score >0.65  
✅ Confusion matrix generated  
✅ Report saved  

### After Calibration
✅ ECE <0.10  
✅ Temperature ~1.5-2.5  
✅ Calibrated model saved  

## Getting Help

1. Check `README.md` for detailed documentation
2. Check `IMPLEMENTATION.md` for technical details
3. Check `COMPLETION_SUMMARY.md` for what was implemented
4. Review error messages carefully
5. Verify setup with `python test_setup.py`

## Next Steps

After completing training, evaluation, and calibration:

1. **Analyze Results**: Check evaluation report
2. **Tune Parameters**: Adjust `config.py` if needed
3. **Retrain**: Run `train.py` again with new settings
4. **Deploy** (optional): Create API or web interface

## Key Metrics to Track

### Training
- Train Loss (should decrease)
- Val Loss (should decrease)
- Train Accuracy (should increase)
- Val Accuracy (should increase)

### Evaluation
- Overall Accuracy (>70%)
- F1-Score (>0.65)
- Per-pattern F1 (check which patterns need work)
- Regression R² (>0.60 for severity/importance)

### Calibration
- ECE (target: <0.08)
- MCE (target: <0.12)
- Temperature (typically 1.5-2.5)

---

**Ready? Start with**: `python test_setup.py`

**Questions?** Check `README.md` for comprehensive documentation.

**🎉 Good luck with your Legal-BERT training! 🎉**