⚖️ LawSummBart — Legal Document Summarization (ETA + LoRA Fine-Tuned BART)

LawSummBart is a domain-adapted abstractive summarization model fine-tuned for Indian legal documents, built using a novel Extract-Then-Assign (ETA) dataset generation pipeline and LoRA parameter-efficient fine-tuning on top of BART-Large CNN.

The model is optimized to handle long court judgments and produce concise, high-quality summaries suitable for downstream legal research applications.

🧠 Model Details

Model Description

Shared by: qwertyqwerty070806
Model Type: Abstractive Summarization (Seq2Seq)
Language: English (Indian Legal Domain)
Fine-tuned From: facebook/bart-large-cnn
Framework: 🤗 Transformers
License: MIT

This model uses an expanded training dataset generated via seven extractive summarization techniques combined with ROUGE & BERTScore-based sentence assignment, enabling improved domain adaptation for lengthy legal text.

📂 Model Sources

GitHub: https://github.com/qwerty070806/LawSummBart
Model Hub: https://huggingface.co/qwertyqwerty070806/LawSummBart
Base Model Paper: https://arxiv.org/abs/1910.13461 (BART)

🧰 Uses

✔️ Direct Use

Abstractive summarization of:
- Court judgments
- Legal case descriptions
- Regulatory documents
- Legal orders

✔️ Downstream Use

Legal research assistants
Automated judgment summarization
Compliance document preprocessing
NLP pipelines for legal analytics

🚫 Out-of-Scope Uses

Providing legal advice
Predicting legal outcomes
Multi-lingual summarization (English only)
Summaries requiring external legal expertise

⚠️ Bias, Risks & Limitations

Limitations

Max effective input length ≈ 1024 tokens (BART constraint)
Domain-specific model; may underperform on non-legal text
Some legal nuances may be compressed or omitted

Potential Bias

Court documents may contain systemic or institutional bias
Extractive heuristics could overweight certain portions of text

Recommendation

Use model outputs as assistive summaries, not authoritative interpretations.
Human verification is essential for professional use.

🚀 Getting Started

import warnings
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

warnings.filterwarnings("ignore")

tokenizer = AutoTokenizer.from_pretrained("qwertyqwerty070806/LawSummBart", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("qwertyqwerty070806/LawSummBart", trust_remote_code=True).to("cuda")

text = """On March 5, 2021, the Securities and Exchange Commission charged AT&T..."""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to("cuda")

summary_ids = model.generate(
    **inputs,
    max_length=220,
    min_length=80,
    num_beams=5,
    length_penalty=1.0,
    no_repeat_ngram_size=3,
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

Default Generation Parameters

Parameter	Value	Description
`max_length`	220	Maximum summary length
`min_length`	80	Minimum summary length
`num_beams`	5	Beam search width
`length_penalty`	1.0	Penalizes long sequences
`no_repeat_ngram_size`	3	Prevents repeated phrases
`early_stopping`	True	Stops when all beams reach EOS

🏗️ Training Details

Training Data

Dataset generated using Extract-Then-Assign (ETA):

Applied seven extractive summarizers to each legal document
Mapped extractive chunks to summary sentences using:
- ROUGE scores
- BERTScore semantic similarity
Dataset expanded by 7× through extractive augmentation

Preprocessing

Normalized paragraphs & whitespace
Removed boilerplate legal headers (case metadata, citations)
Tokenized up to BART’s limit (1024 tokens)

Training Procedure

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
    task_type="SEQ_2_SEQ_LM"
)

Training Arguments

TrainingArguments(
    num_train_epochs=10,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=1e-4,
    eval_strategy="epoch",
    save_strategy="epoch",
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss"
)

Why LoRA?

Trains only 5–10M parameters (vs 406M full fine-tuning)
Saves 60%+ GPU memory
Enables training on free GPUs (Kaggle, Colab)
Maintains strong performance while reducing compute cost

📊 Evaluation

Metrics

ROUGE-1, ROUGE-2, ROUGE-L
BERTScore (Precision, Recall, F1)

Results

Metric	Score
ROUGE-1	0.5723
ROUGE-2	0.3547
ROUGE-L	0.3169
BERTScore Precision	0.8453
BERTScore Recall	0.8535
BERTScore F1	0.8493

Summary

The model achieves strong performance on abstractive summarization of long-form legal text, showing improved coherence and domain fidelity compared to base BART.

🌱 Environmental Impact (Approx.)

Hardware: NVIDIA T4 (Kaggle/Colab)
Training Time: ~7–8 hours
CO₂ Emissions: Low (consumer-grade GPU)

🔬 Technical Specifications

Model Architecture

Transformer encoder–decoder (BART)
Seq2Seq language modeling objective
LoRA applied to:
- q_proj
- v_proj

Compute Infrastructure

Hardware: NVIDIA T4 (16 GB)
Software:
- PyTorch
- Transformers ≥ 4.53
- PEFT (LoRA)
- Accelerate
- Datasets

📚 Citation

@misc{lawsummbart2024,
  title = {LawSummBart: Legal Summarization with Extract-Then-Assign and LoRA Fine-Tuned BART},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/qwertyqwerty070806/LawSummBart}},
  note = {Accessed: [DATE]}
}

📬 Contact

For issues or questions, please visit: 👉 https://github.com/qwerty070806/LawSummBart

Downloads last month: 5

Safetensors

Model size

0.4B params

Tensor type

F32

Model tree for qwertyqwerty070806/LawSummBart

Base model

facebook/bart-large-cnn

Adapter

(31)

this model

Datasets used to train qwertyqwerty070806/LawSummBart

Paper for qwertyqwerty070806/LawSummBart

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Paper • 1910.13461 • Published Oct 29, 2019 • 6