Quantifying the Carbon Emissions of Machine Learning
Paper
•
1910.09700
•
Published
•
33
This modelcard aims to be fine tuned model in legal domain of the Indian Judicial.
The Model can be directly used to generate summary of legal documents.
The model may not work for language other than english and other non-Indian legal domain.
The model may not work for long documents. Long documents need to be chunked using passage retrieval methods described in github repository.
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model.
Use the code below to get started with the model.
#!/usr/bin/env python3
"""
import argparse
import json
import pandas as pd
from tqdm import tqdm
from transformers import pipeline
def load_jsonl(path):
rows = []
with open(path, "r", encoding="utf-8") as f:
for line in f:
rows.append(json.loads(line))
return pd.DataFrame(rows)
def save_jsonl(records, path):
with open(path, "w", encoding="utf-8") as f:
for r in records:
f.write(json.dumps(r, ensure_ascii=False) + "\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Pegasus summarization using transformers.pipeline")
parser.add_argument("--input", required=True, help="Input .jsonl or .csv")
parser.add_argument("--output", required=True, help="Output JSONL file")
parser.add_argument("--model", required=True,
help="HF repo id or local model folder containing fine tuned legal-Pegasus")
parser.add_argument("--use_auth_token", default=None,
help="HF token for private models")
args = parser.parse_args()
print("Loading summarization pipeline...")
summarizer = pipeline(
"summarization",
model=args.model,
tokenizer=args.model, # tokenizer usually stored in same repo
use_auth_token=args.use_auth_token,
device=-1, # CPU; change to 0 for GPU
framework="pt"
)
print("Loading dataset...")
if args.input.endswith(".jsonl"):
df = load_jsonl(args.input)
else:
df = pd.read_csv(args.input)
if "ID" not in df.columns or "para_text" not in df.columns:
raise ValueError("Input must contain columns: ID, para_text")
results = []
print("Running summarization...")
for _, row in tqdm(df.iterrows(), total=len(df)):
text = row["para_text"]
summary = summarizer(
text,
max_length=256,
min_length=20,
do_sample=False,
num_beams=5
)[0]["summary_text"]
results.append({
"ID": str(row["ID"]),
"Summary": summary
})
save_jsonl(results, args.output)
print(f"Summaries saved to {args.output}")
Long Documents required chunking using passage retrieval methods and raw data should be normalized.
Rouge Score
21.30 (R2,RL,BLEU Average)
Model trained with one H100 NVIDIA GPU with 94GB RAM
Base model
nsi319/legal-pegasus