--- library_name: transformers datasets: - coastalchp/ledgar language: - en base_model: - nlpaueb/legal-bert-base-uncased pipeline_tag: text-classification --- # LegalBERT Fine-Tuned on LEDGAR Dataset This model is a fine-tuned version of **[LegalBERT](https://huggingface.co/nlpaueb/legal-bert-base-uncased)** on the **LEDGAR** dataset for **legal clause classification**. It classifies legal clauses into one of **100 clause types** (e.g., confidentiality, termination, liability, etc.). --- ## Model Overview - **Base Model:** `nlpaueb/legal-bert-base-uncased` - **Task:** Multi-class clause classification - **Dataset:** LEDGAR - **Language:** English - **Number of labels:** 100 - **Fine-tuning epochs:** 4 - **Batch size:** 32 - **Optimizer:** AdamW - **Mixed Precision (FP16):** Enabled (when CUDA available) --- ## Dataset Details | Split | Samples | Description | |-------|----------|-------------| | Train | 60,000 | Used for model fine-tuning | | Eval | 10,000 | Used for validation during training | | Test | 10,000 | Held-out test set for final evaluation | - **Total samples:** 80,000 - **Number of labels:** 100 - **Text column:** `text` (contains the clause text) - **Label column:** `label` --- ## Evaluation Results (on Test Set) | Metric | Score | |---------|--------| | **Accuracy** | 0.8678 | | **Macro F1** | 0.7779 | | **Macro Precision** | 0.7917 | | **Macro Recall** | 0.7763 | | **Evaluation Time** | 38.37 sec | --- ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification # Load model and tokenizer model_name = "FENTECH/Legal-BERT-Clause-Classification" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSequenceClassification.from_pretrained(model_name) # Example inference text = "The contractor shall maintain confidentiality of all client information." inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) predicted_label = outputs.logits.argmax(dim=-1).item() print("Predicted label ID:", predicted_label) ```