βš–οΈ LawSummBart β€” Legal Document Summarization (ETA + LoRA Fine-Tuned BART)

LawSummBart is a domain-adapted abstractive summarization model fine-tuned for Indian legal documents, built using a novel Extract-Then-Assign (ETA) dataset generation pipeline and LoRA parameter-efficient fine-tuning on top of BART-Large CNN.

The model is optimized to handle long court judgments and produce concise, high-quality summaries suitable for downstream legal research applications.


🧠 Model Details

Model Description

  • Shared by: qwertyqwerty070806
  • Model Type: Abstractive Summarization (Seq2Seq)
  • Language: English (Indian Legal Domain)
  • Fine-tuned From: facebook/bart-large-cnn
  • Framework: πŸ€— Transformers
  • License: MIT

This model uses an expanded training dataset generated via seven extractive summarization techniques combined with ROUGE & BERTScore-based sentence assignment, enabling improved domain adaptation for lengthy legal text.


πŸ“‚ Model Sources


🧰 Uses

βœ”οΈ Direct Use

  • Abstractive summarization of:
    • Court judgments
    • Legal case descriptions
    • Regulatory documents
    • Legal orders

βœ”οΈ Downstream Use

  • Legal research assistants
  • Automated judgment summarization
  • Compliance document preprocessing
  • NLP pipelines for legal analytics

🚫 Out-of-Scope Uses

  • Providing legal advice
  • Predicting legal outcomes
  • Multi-lingual summarization (English only)
  • Summaries requiring external legal expertise

⚠️ Bias, Risks & Limitations

Limitations

  • Max effective input length β‰ˆ 1024 tokens (BART constraint)
  • Domain-specific model; may underperform on non-legal text
  • Some legal nuances may be compressed or omitted

Potential Bias

  • Court documents may contain systemic or institutional bias
  • Extractive heuristics could overweight certain portions of text

Recommendation

Use model outputs as assistive summaries, not authoritative interpretations.
Human verification is essential for professional use.


πŸš€ Getting Started

import warnings
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

warnings.filterwarnings("ignore")

tokenizer = AutoTokenizer.from_pretrained("qwertyqwerty070806/LawSummBart", trust_remote_code=True)
model = AutoModelForSeq2SeqLM.from_pretrained("qwertyqwerty070806/LawSummBart", trust_remote_code=True).to("cuda")

text = """On March 5, 2021, the Securities and Exchange Commission charged AT&T..."""

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512).to("cuda")

summary_ids = model.generate(
    **inputs,
    max_length=220,
    min_length=80,
    num_beams=5,
    length_penalty=1.0,
    no_repeat_ngram_size=3,
    early_stopping=True
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

Default Generation Parameters

Parameter Value Description
max_length 220 Maximum summary length
min_length 80 Minimum summary length
num_beams 5 Beam search width
length_penalty 1.0 Penalizes long sequences
no_repeat_ngram_size 3 Prevents repeated phrases
early_stopping True Stops when all beams reach EOS

πŸ—οΈ Training Details

Training Data

Dataset generated using Extract-Then-Assign (ETA):

  • Applied seven extractive summarizers to each legal document
  • Mapped extractive chunks to summary sentences using:
    • ROUGE scores
    • BERTScore semantic similarity
  • Dataset expanded by 7Γ— through extractive augmentation

Preprocessing

  • Normalized paragraphs & whitespace
  • Removed boilerplate legal headers (case metadata, citations)
  • Tokenized up to BART’s limit (1024 tokens)

Training Procedure

LoRA Configuration

LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
    task_type="SEQ_2_SEQ_LM"
)

Training Arguments

TrainingArguments(
    num_train_epochs=10,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    learning_rate=1e-4,
    eval_strategy="epoch",
    save_strategy="epoch",
    fp16=True,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss"
)

Why LoRA?

  • Trains only 5–10M parameters (vs 406M full fine-tuning)
  • Saves 60%+ GPU memory
  • Enables training on free GPUs (Kaggle, Colab)
  • Maintains strong performance while reducing compute cost

πŸ“Š Evaluation

Metrics

  • ROUGE-1, ROUGE-2, ROUGE-L
  • BERTScore (Precision, Recall, F1)

Results

Metric Score
ROUGE-1 0.5723
ROUGE-2 0.3547
ROUGE-L 0.3169
BERTScore Precision 0.8453
BERTScore Recall 0.8535
BERTScore F1 0.8493

Summary

The model achieves strong performance on abstractive summarization of long-form legal text, showing improved coherence and domain fidelity compared to base BART.


🌱 Environmental Impact (Approx.)

  • Hardware: NVIDIA T4 (Kaggle/Colab)
  • Training Time: ~7–8 hours
  • COβ‚‚ Emissions: Low (consumer-grade GPU)

πŸ”¬ Technical Specifications

Model Architecture

  • Transformer encoder–decoder (BART)
  • Seq2Seq language modeling objective
  • LoRA applied to:
    • q_proj
    • v_proj

Compute Infrastructure

  • Hardware: NVIDIA T4 (16 GB)
  • Software:
    • PyTorch
    • Transformers β‰₯ 4.53
    • PEFT (LoRA)
    • Accelerate
    • Datasets

πŸ“š Citation

@misc{lawsummbart2024,
  title = {LawSummBart: Legal Summarization with Extract-Then-Assign and LoRA Fine-Tuned BART},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/qwertyqwerty070806/LawSummBart}},
  note = {Accessed: [DATE]}
}

πŸ“¬ Contact

For issues or questions, please visit: πŸ‘‰ https://github.com/qwerty070806/LawSummBart

Downloads last month
5
Safetensors
Model size
0.4B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for qwertyqwerty070806/LawSummBart

Adapter
(31)
this model

Datasets used to train qwertyqwerty070806/LawSummBart

Paper for qwertyqwerty070806/LawSummBart