dlaima's picture
Update README.md
99f0016 verified
---
language:
- en
license: mit
tags:
- summarization
- dialogue-summarization
- bart
- lora
- merged
- finetuned
- seq2seq
- transformers
- highlightsum
pipeline_tag: summarization
library_name: transformers
metrics:
- rouge1
- rouge2
- rougeL
- bertscore
- bleu
model_name: bart-highlightsum-merged
base_model: facebook/bart-large-cnn
new_version: "1.0.0"
datasets:
- knkarthick/highlightsum
---
# HuggingFace Model Card — BART-HighlightSum (Merged Model)
## BART-HighlightSum (Merged Model)
Fine-tuned BART-Large on the HighlightSum dialogue summarization dataset (Merged LoRA → Full Model)
**Model type:** Seq2Seq Summarization
**Base model:** facebook/bart-large-cnn
**Dataset:** HighlightSum (dialogue summarization)
**Finetuning method:** LoRA → merged into full FP16 BART
## Model Summary
This model is a merged BART-Large fine-tuned on 2,000 training + 200 validation samples from the HighlightSum dataset.
It produces concise, accurate summaries of multi-turn dialogues.
✔ LoRA fine-tuning
✔ LoRA weights merged into base BART
✔ No PEFT required for inference
✔ Lightweight, fast, and deployment-ready
This version is recommended for production, as it scores highest among all variants (Baseline, LoRA, Merged).
## Performance/Evaluation Results
Evaluation on HighlightSum (Validation 200 samples)
The following results were obtained using 200 validation samples from the HighlightSum dataset.
### Merged Model Performance
| Metric | Score |
|--------|--------|
| ROUGE-1 | 0.383 |
| ROUGE-2 | 0.179 |
| ROUGE-L | 0.301 |
| BERTScore (F1) | 0.335 |
| BLEU | 0.0014 |
### Comparison with Baseline and LoRA Models
| Metric | Baseline BART | LoRA Model | Merged Model |
|--------|---------------|------------|--------------|
| ROUGE-1 | 0.275 | 0.337 | 0.383 |
| ROUGE-2 | 0.090 | 0.152 | 0.179 |
| ROUGE-L | 0.204 | 0.252 | 0.301 |
| BERTScore (F1) | 0.163 | 0.298 | 0.335 |
| BLEU | 0.0052 | 0.0111 | 0.0014 |
### Conclusion
The merged model performs best, achieving the highest ROUGE-1, ROUGE-2, ROUGE-L and BERTScore among all variants.
It is therefore the recommended model for deployment, inference, and user-facing applications.
## 🧪 Example Input / Output
Using Example #1 from the HighlightSum dataset:
### Dialogue
```
A: What are you getting him?
B: Something cool.
A: What about a Lego?
B: He is too old for that now.
A: What about a book?
B: He hates reading.
A: Then I give up. I have no idea what to get him.
```
### Human Gold Summary
They discuss gift ideas for someone's son.
### Merged Model Summary
They talk about what to get a boy as a gift but can't decide.
→ The model captures the intent, context, and key meaning with improved fluency and coherence.
## Intended Use
### Suitable for
- Dialogue summarization
- Customer service chat compression
- Meeting note extraction
- Educational tools
### Not suitable for
- Factual QA
- Domain-specific technical summaries without fine-tuning
- Safety-critical use
## How to Use
### Python Inference
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch
model_name = "dlaima/bart-highlightsum-merged"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = """A: Hi Tom, are you busy tomorrow afternoon?
B: I think I am. Why?
A: I want to go to the animal shelter.
B: For what?
A: I'm getting a puppy for my son."""
inputs = tokenizer(text, return_tensors="pt", truncation=True)
summary = model.generate(**inputs, max_new_tokens=192)
print(tokenizer.decode(summary[0], skip_special_tokens=True))
```
## Training Details
- **Method:** LoRA (rank 8)
- **Model:** BART-Large
- **Batch size:** 8 (micro-batch 4 × grad-accumulation 2)
- **Epochs:** ~2.4 (capped by 2000 examples)
- **Max input length:** 768 tokens
- **Max summary length:** 192 tokens
- **Precision:** FP16
- **Optimizer:** AdamW
- **Learning rate:** 3e-4
- **Hardware:** NVIDIA T4
## 📚 Dataset: HighlightSum
A dataset of dialogue → summary pairs from multiple conversational sources.
- Multi-turn dialogues
- Short, medium, or long
- Realistic conversational structure
- Human-written summaries
### Subset used here:
- 2,000 samples for training
- 200 samples for validation
## Files Included in This Repo
| File | Description |
|------|-------------|
| pytorch_model.bin | Final merged FP16 BART model |
| config.json | Standard HuggingFace config |
| generation_config.json | Beam search config |
| tokenizer.json / tokenizer.model | Tokenizer files |
| README.md | This model card |
## Limitations & Recommendations
### Limitations
- May shorten overly long dialogues excessively
- Not designed for domain-specific jargon
- Occasionally omits rare names or details
- Not a factual QA model
- Can hallucinate minor details in complex dialogues
### Recommendations
- Use merged model for production
- Apply additional fine-tuning for domain-specific tasks
- For 100% reproducibility, fix random seeds and HF transformers version
- Consider quantization (INT8 or GGUF) for mobile deployment
## Maintenance
This model will be updated as:
- Additional training data becomes available
- Larger LoRA variants are tested
- Better merging & evaluation pipelines are developed
## Contact
For questions, improvements, or collaboration, feel free to reach out via GitHub or HuggingFace (@dlaima).