# ๐Ÿš€ QUICK START GUIDE - Legal-BERT ## Prerequisites ```bash # 1. Install dependencies pip install -r requirements.txt # 2. Download CUAD dataset # Place at: dataset/CUAD_v1/CUAD_v1.json ``` ## Verify Setup ```bash python test_setup.py ``` Expected output: ``` ๐Ÿงช LEGAL-BERT PROJECT - QUICK TEST ================================================================================ ๐Ÿ” Testing imports... โœ… PyTorch โœ… Transformers โœ… scikit-learn โœ… Pandas โœ… NumPy ... โœ… ALL TESTS PASSED! ๐Ÿš€ Ready to train! Run: python train.py ``` ## Training ```bash python train.py ``` **What it does**: 1. Loads CUAD dataset (19,598 clauses) 2. Discovers 7 risk patterns automatically 3. Trains Legal-BERT for 5 epochs (~2-4 hours on GPU) 4. Saves checkpoints every epoch 5. Generates training history plot **Output**: ``` checkpoints/ โ”œโ”€โ”€ legal_bert_epoch_1.pt โ”œโ”€โ”€ legal_bert_epoch_2.pt โ”œโ”€โ”€ ... โ”œโ”€โ”€ training_history.png โ””โ”€โ”€ training_summary.json models/legal_bert/ โ””โ”€โ”€ final_model.pt ``` **Expected Results**: - Train Accuracy: >60% - Val Accuracy: >55% ## Evaluation ```bash python evaluate.py ``` **What it does**: 1. Loads trained model 2. Evaluates on test set 3. Calculates comprehensive metrics 4. Generates visualizations 5. Saves detailed report **Output**: ``` checkpoints/ โ”œโ”€โ”€ evaluation_results.json โ”œโ”€โ”€ confusion_matrix.png โ””โ”€โ”€ risk_distribution.png evaluation_report.txt ``` **Expected Results**: - Accuracy: >70% - F1-Score: >0.65 - Precision: >0.60 - Recall: >0.60 ## Calibration ```bash python calibrate.py ``` **What it does**: 1. Loads trained model 2. Applies temperature scaling 3. Calculates ECE/MCE 4. Saves calibrated model 5. Exports results **Output**: ``` checkpoints/ โ””โ”€โ”€ calibration_results.json models/legal_bert/ โ””โ”€โ”€ calibrated_model.pt ``` **Expected Results**: - ECE: 0.15 โ†’ <0.08 - MCE: 0.20 โ†’ <0.12 ## Complete Pipeline ```bash # Run everything in sequence python train.py && python evaluate.py && python calibrate.py ``` ## Configuration Edit `config.py` to customize: ```python # Model settings bert_model_name = "bert-base-uncased" num_risk_categories = 7 max_sequence_length = 512 # Training settings batch_size = 16 # Reduce if GPU OOM num_epochs = 5 # Increase for better results learning_rate = 2e-5 # Adjust for convergence # Paths data_path = "dataset/CUAD_v1/CUAD_v1.json" checkpoint_dir = "checkpoints" ``` ## Troubleshooting ### GPU Out of Memory ```python # In config.py, reduce: batch_size = 8 # or even 4 ``` ### Missing Dataset ```bash # Error: Dataset not found # Solution: Download CUAD and place at: dataset/CUAD_v1/CUAD_v1.json ``` ### Import Errors ```bash # Reinstall dependencies pip install -r requirements.txt --upgrade ``` ### Visualization Errors ```bash # If matplotlib errors occur pip install matplotlib seaborn # Or plots will be skipped (functionality still works) ``` ## Performance Tips ### Speed Up Training 1. Use GPU (CUDA): Automatic if available 2. Increase batch size: `batch_size = 32` 3. Use fewer epochs: `num_epochs = 3` ### Improve Accuracy 1. Train longer: `num_epochs = 10` 2. Adjust learning rate: `learning_rate = 3e-5` 3. Use larger BERT: `bert_model_name = "bert-large-uncased"` ### Better Calibration 1. More validation data: Adjust splits in `data_loader.py` 2. More iterations: In `calibrate.py` increase `max_iter` ## File Structure ``` code2/ โ”œโ”€โ”€ train.py โ† Run this first โ”œโ”€โ”€ evaluate.py โ† Then this โ”œโ”€โ”€ calibrate.py โ† Finally this โ”œโ”€โ”€ test_setup.py โ† Verify before training โ”‚ โ”œโ”€โ”€ config.py โ† Edit settings here โ”œโ”€โ”€ data_loader.py โ† Loads CUAD dataset โ”œโ”€โ”€ risk_discovery.py โ† Discovers patterns โ”œโ”€โ”€ model.py โ† Legal-BERT architecture โ”œโ”€โ”€ trainer.py โ† Training logic โ”œโ”€โ”€ evaluator.py โ† Evaluation logic โ”œโ”€โ”€ utils.py โ† Helper functions โ”‚ โ”œโ”€โ”€ README.md โ† Full documentation โ”œโ”€โ”€ IMPLEMENTATION.md โ† Implementation details โ”œโ”€โ”€ COMPLETION_SUMMARY.md โ† What was done โ””โ”€โ”€ QUICK_START.md โ† This file ``` ## Common Commands ```bash # Check setup python test_setup.py # Train model python train.py # Evaluate model python evaluate.py # Calibrate model python calibrate.py # Run all python train.py && python evaluate.py && python calibrate.py # Python interactive (after training) python >>> from evaluator import LegalBertEvaluator >>> # Load and analyze results ``` ## Expected Timeline | Task | Time (GPU) | Time (CPU) | |------|-----------|-----------| | Setup verification | 30 seconds | 30 seconds | | Training (5 epochs) | 2-4 hours | 8-12 hours | | Evaluation | 10 minutes | 20 minutes | | Calibration | 5 minutes | 10 minutes | | **Total** | **~3 hours** | **~10 hours** | ## Success Indicators ### After Training โœ… Checkpoints saved in `checkpoints/` โœ… Training loss decreasing โœ… Validation accuracy >55% โœ… No CUDA errors ### After Evaluation โœ… Accuracy >70% โœ… F1-Score >0.65 โœ… Confusion matrix generated โœ… Report saved ### After Calibration โœ… ECE <0.10 โœ… Temperature ~1.5-2.5 โœ… Calibrated model saved ## Getting Help 1. Check `README.md` for detailed documentation 2. Check `IMPLEMENTATION.md` for technical details 3. Check `COMPLETION_SUMMARY.md` for what was implemented 4. Review error messages carefully 5. Verify setup with `python test_setup.py` ## Next Steps After completing training, evaluation, and calibration: 1. **Analyze Results**: Check evaluation report 2. **Tune Parameters**: Adjust `config.py` if needed 3. **Retrain**: Run `train.py` again with new settings 4. **Deploy** (optional): Create API or web interface ## Key Metrics to Track ### Training - Train Loss (should decrease) - Val Loss (should decrease) - Train Accuracy (should increase) - Val Accuracy (should increase) ### Evaluation - Overall Accuracy (>70%) - F1-Score (>0.65) - Per-pattern F1 (check which patterns need work) - Regression Rยฒ (>0.60 for severity/importance) ### Calibration - ECE (target: <0.08) - MCE (target: <0.12) - Temperature (typically 1.5-2.5) --- **Ready? Start with**: `python test_setup.py` **Questions?** Check `README.md` for comprehensive documentation. **๐ŸŽ‰ Good luck with your Legal-BERT training! ๐ŸŽ‰**