βοΈ LawSummBart β Legal Document Summarization (ETA + LoRA Fine-Tuned BART)
LawSummBart is a domain-adapted abstractive summarization model fine-tuned for Indian legal documents, built using a novel Extract-Then-Assign (ETA) dataset generation pipeline and LoRA parameter-efficient fine-tuning on top of BART-Large CNN.
The model is optimized to handle long court judgments and produce concise, high-quality summaries suitable for downstream legal research applications.
π§ Model Details
Model Description
- Shared by: qwertyqwerty070806
- Model Type: Abstractive Summarization (Seq2Seq)
- Language: English (Indian Legal Domain)
- Fine-tuned From:
facebook/bart-large-cnn - Framework: π€ Transformers
- License: MIT
This model uses an expanded training dataset generated via seven extractive summarization techniques combined with ROUGE & BERTScore-based sentence assignment, enabling improved domain adaptation for lengthy legal text.
π Model Sources
- GitHub: https://github.com/qwerty070806/LawSummBart
- Model Hub: https://huggingface.co/qwertyqwerty070806/LawSummBart
- Base Model Paper: https://arxiv.org/abs/1910.13461 (BART)
π§° Uses
βοΈ Direct Use
- Abstractive summarization of:
- Court judgments
- Legal case descriptions
- Regulatory documents
- Legal orders
βοΈ Downstream Use
- Legal research assistants
- Automated judgment summarization
- Compliance document preprocessing
- NLP pipelines for legal analytics
π« Out-of-Scope Uses
- Providing legal advice
- Predicting legal outcomes
- Multi-lingual summarization (English only)
- Summaries requiring external legal expertise
β οΈ Bias, Risks & Limitations
Limitations
- Max effective input length β 1024 tokens (BART constraint)
- Domain-specific model; may underperform on non-legal text
- Some legal nuances may be compressed or omitted
Potential Bias
- Court documents may contain systemic or institutional bias
- Extractive heuristics could overweight certain portions of text
Recommendation
Use model outputs as assistive summaries, not authoritative interpretations.
Human verification is essential for professional use.
π Getting Started
import warnings
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
warnings.filterwarnings("ignore")
tokenizer = AutoTokenizer.from_pretrained("qwertyqwerty070806/LawSummBart", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("qwertyqwerty070806/LawSummBart", trust_remote_code=True).to("cuda")
text = """On March 5, 2021, the Securities and Exchange Commission charged AT&T..."""
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to("cuda")
summary_ids = model.generate(
**inputs,
max_length=220,
min_length=80,
num_beams=5,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True
)
print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))
Default Generation Parameters
| Parameter | Value | Description |
|---|---|---|
max_length |
220 | Maximum summary length |
min_length |
80 | Minimum summary length |
num_beams |
5 | Beam search width |
length_penalty |
1.0 | Penalizes long sequences |
no_repeat_ngram_size |
3 | Prevents repeated phrases |
early_stopping |
True | Stops when all beams reach EOS |
ποΈ Training Details
Training Data
Dataset generated using Extract-Then-Assign (ETA):
- Applied seven extractive summarizers to each legal document
- Mapped extractive chunks to summary sentences using:
- ROUGE scores
- BERTScore semantic similarity
- Dataset expanded by 7Γ through extractive augmentation
Preprocessing
- Normalized paragraphs & whitespace
- Removed boilerplate legal headers (case metadata, citations)
- Tokenized up to BARTβs limit (1024 tokens)
Training Procedure
LoRA Configuration
LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.1,
target_modules=["q_proj", "v_proj"],
task_type="SEQ_2_SEQ_LM"
)
Training Arguments
TrainingArguments(
num_train_epochs=10,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
learning_rate=1e-4,
eval_strategy="epoch",
save_strategy="epoch",
fp16=True,
load_best_model_at_end=True,
metric_for_best_model="eval_loss"
)
Why LoRA?
- Trains only 5β10M parameters (vs 406M full fine-tuning)
- Saves 60%+ GPU memory
- Enables training on free GPUs (Kaggle, Colab)
- Maintains strong performance while reducing compute cost
π Evaluation
Metrics
- ROUGE-1, ROUGE-2, ROUGE-L
- BERTScore (Precision, Recall, F1)
Results
| Metric | Score |
|---|---|
| ROUGE-1 | 0.5723 |
| ROUGE-2 | 0.3547 |
| ROUGE-L | 0.3169 |
| BERTScore Precision | 0.8453 |
| BERTScore Recall | 0.8535 |
| BERTScore F1 | 0.8493 |
Summary
The model achieves strong performance on abstractive summarization of long-form legal text, showing improved coherence and domain fidelity compared to base BART.
π± Environmental Impact (Approx.)
- Hardware: NVIDIA T4 (Kaggle/Colab)
- Training Time: ~7β8 hours
- COβ Emissions: Low (consumer-grade GPU)
π¬ Technical Specifications
Model Architecture
- Transformer encoderβdecoder (BART)
- Seq2Seq language modeling objective
- LoRA applied to:
q_projv_proj
Compute Infrastructure
- Hardware: NVIDIA T4 (16 GB)
- Software:
- PyTorch
- Transformers β₯ 4.53
- PEFT (LoRA)
- Accelerate
- Datasets
π Citation
@misc{lawsummbart2024,
title = {LawSummBart: Legal Summarization with Extract-Then-Assign and LoRA Fine-Tuned BART},
year = {2024},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/qwertyqwerty070806/LawSummBart}},
note = {Accessed: [DATE]}
}
π¬ Contact
For issues or questions, please visit: π https://github.com/qwerty070806/LawSummBart
- Downloads last month
- 5
Model tree for qwertyqwerty070806/LawSummBart
Base model
facebook/bart-large-cnn