bart-base-samsum-summarizer

facebook/bart-base fine-tuned on the SAMSum dialogue summarization corpus.

Note: Front-matter ROUGE scores reflect the champion decoding config (D27: beam=5, length_penalty=1.33). Default generation config (beam=4, lp=1.0) yields ROUGE-1=47.86, ROUGE-2=23.22, ROUGE-L=39.85.

⚠️ License: SAMSum is released under CC BY-NC-ND 4.0 (non-commercial, no derivatives). This model card, the model weights, and any outputs produced with them are subject to the same terms. Commercial use is prohibited.

Model Description

Field	Value
Base model	`facebook/bart-base` (139M parameters)
Task	Abstractive dialogue summarization
Language	English
License	cc-by-nc-nd-4.0
Dataset	SAMSum (`knkarthick/samsum`)
Hardware trained on	Apple M4 Pro, 24 GB UMA, MPS / BF16

Intended Use

Intended use: Summarizing short chat conversations (≤ 512 tokens) into 1–3 sentence abstractive summaries.
Out-of-scope: Real-time transcription, audio processing, multi-lingual dialogues, or any commercial product.
Not recommended for: Mission-critical applications where hallucinations cannot be tolerated. The model hallucinates entity-level details in ~10% of test examples.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "your-hf-username/bart-base-samsum-summarizer"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSeq2SeqLM.from_pretrained(model_id, dtype=torch.bfloat16)
model.eval()

dialogue = """
Amanda: I baked cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)
Jerry: Thanks! Do you know how to make the lemon ones?
Amanda: The biscuits? I'll send you the recipe. It's easy!
""".strip()

inputs = tokenizer(dialogue, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens = 128,
        num_beams      = 5,
        length_penalty = 1.33,  # D27 champion config (ROUGE-L 40.12)
        early_stopping = True,
    )
print(tokenizer.decode(out[0], skip_special_tokens=True))
# → "Amanda will bring Jerry some cookies tomorrow and send him the recipe."

Performance

All metrics are macro-averaged ROUGE F-measures × 100 on the 819-sample SAMSum test set.

Test-Set ROUGE

Metric	Value
ROUGE-1	48.48
ROUGE-2	23.55
ROUGE-L	40.12 (champion: D27 beam=5, lp=1.33)
ROUGE-L (training config: beam=4, lp=1.0)	39.92

Comparison: Fine-Tuned vs Zero-Shot

	ROUGE-L
BART-base zero-shot (100 samples)	19.89
BART-base fine-tuned (819 samples)	40.12 (+20.23)

Decoding Strategy Ablation (11 configs)

Config	ROUGE-L	Avg tokens	ms/sample
D1: beam=4, lp=0.8	39.49	15.2	138
D2: beam=4, lp=1.0	39.92	15.9	136
D3: beam=4, lp=1.2	39.97	16.7	136
D4: beam=8, lp=1.0	39.74	15.8	220
D5: nucleus p=0.9	35.93	18.8	92
D6: beam=4, lp=1.4	39.94	17.3	142
D7: beam=4, lp=1.25	40.01	16.8	136
D8: beam=4, lp=1.3	40.01	17.0	137
D9: beam=4, lp=1.2, nrng=3	39.97	16.7	136
D10: beam=6, lp=1.2	40.03	16.7	178
D11: beam=4, lp=1.2, min_len=5	39.97	16.7	136

Full 29-config sweep results in results/metrics/decoding_D*.json. Champion: D27 (beam=5, lp=1.33) at ROUGE-L 40.12 — see docs/EXPERIMENTS.md for complete E3 table.

Faithfulness Metrics

Metric	Value
Hallucination rate (spaCy NER)	10.1% (83 / 819)
Speaker preservation	75.5%
NLI faithfulness (DeBERTa-v3)	0.308
Length–ROUGE-L Pearson r	−0.25

LoRA Parameter-Efficient Fine-Tuning

Model	ROUGE-1	ROUGE-2	ROUGE-L	Trainable params
BART-base (full fine-tune)	48.04	23.33	39.92	139.4M (100%)
BART-base (LoRA r=16, α=32)	45.15	21.20	37.59	0.88M (0.63%)

LoRA achieves 94.2% of full fine-tune ROUGE-L with only 0.63% trainable parameters.

PEGASUS Cross-Domain Transfer

Condition	ROUGE-1	ROUGE-2	ROUGE-L	Notes
Zero-shot	1.85	0.00	1.60	news → dialogue domain mismatch
Fine-tuned	1.65	0.00	1.56	Convergence failure (see below)

Training failure: gradient_accumulation_steps=8 on MPS caused 8× gradient inflation (effective lr=1.6e-4). eval_loss=9.601 at epoch 3 ≈ random baseline. Fixed in script (grad_accum=1); ROUGE-L 40–44 expected on re-run.

Extended Training (E8 — 8 epochs, cosine LR, lr=3e-5)

Condition	ROUGE-1	ROUGE-2	ROUGE-L	Train time	Notes
Baseline (5ep, lr=5e-5)	47.86	23.22	39.85	168.4 min	E1 result
Extended (8ep, lr=3e-5, cosine)	46.45	22.05	38.46	259.6 min	Best epoch 4

Finding: Δ ROUGE-L = −1.39. Lower peak LR caused underfitting; baseline with lr=5e-5 linear decay converges to a better optimum. Hypothesis not supported.

Training Procedure

Dataset

Train: 14,731 examples
Validation: 818 examples
Test: 819 examples
Variant used: with_speakers — speaker attribution tags (Name: ) preserved. Ablation shows this contributes +6.62 ROUGE-L vs stripping tags.

Preprocessing

Dialogues are tokenized with AutoTokenizer from facebook/bart-base. max_source_length=512, max_target_length=128 (covers 99%+ of SAMSum examples at these lengths). No task prefix (BART does not require one; T5 uses "summarize: ").

Hyperparameters

Parameter	Value
Base model	`facebook/bart-base`
Optimizer	AdamW
Learning rate	5.0 × 10⁻⁵
LR schedule	Linear decay
Warmup steps	500
Weight decay	0.01
Batch size	8
Max epochs	5
Early stopping patience	2
Gradient clip norm	1.0
Precision	BF16
Best epoch	5
Best val ROUGE-L	41.57
Training time	72.4 min (M4 Pro MPS)

Compute

Trained on Apple M4 Pro (T6041), 24 GB Unified Memory, 20 GPU cores. PyTorch 2.10.0 MPS backend, BF16.

Limitations

Synthetic training data: SAMSum was constructed by human annotators writing fictional WhatsApp-style dialogues. The model has not been evaluated on real meeting transcripts or audio-derived text.
Two-speaker bias: ~75% of SAMSum examples involve exactly 2 participants. Summarization quality for 3+ speaker conversations is likely lower.
Hallucination: ~10.1% of test summaries contain at least one NER-detected hallucinated entity. The actual hallucination rate is higher for non-entity errors (e.g. fabricated scores, inverted speaker actions).
Speaker attribution errors: ~25% of summaries have at least one speaker attribution mistake (e.g. "X will call Y" when it is Y who called).
Non-commercial only: CC BY-NC-ND 4.0 applies to all outputs.

Citation

@inproceedings{gliwa-etal-2019-samsum,
    title     = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset
                 for Abstractive Summarization",
    author    = "Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej
                 and Wawer, Aleksander",
    booktitle = "Proceedings of the 2nd Workshop on New Frontiers in
                 Summarization",
    year      = "2019",
    publisher = "Association for Computational Linguistics",
    doi       = "10.18653/v1/D19-5409",
}

How to Push to HuggingFace Hub

# 1. Log in
huggingface-cli login

# 2. Create the repository (replace <username>)
huggingface-cli repo create bart-base-samsum-summarizer --type model

# 3. Push model weights + tokenizer
python3 - <<'EOF'
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_path = "models/best/facebook_bart-base_with_speakers"
repo_id    = "your-hf-username/bart-base-samsum-summarizer"   # ← replace

tok = AutoTokenizer.from_pretrained(model_path)
mdl = AutoModelForSeq2SeqLM.from_pretrained(model_path, dtype=torch.bfloat16)

tok.push_to_hub(repo_id)
mdl.push_to_hub(repo_id)
print(f"✅ Pushed to https://huggingface.co/{repo_id}")
EOF

# 4. Push model card
huggingface-cli upload your-hf-username/bart-base-samsum-summarizer \
    model_card.md README.md

# 5. Verify
huggingface-cli whoami
# → Opens https://huggingface.co/your-hf-username/bart-base-samsum-summarizer

Note: Do NOT push models/best/ to GitHub — model weights belong on the HuggingFace Hub only. The .gitignore should already exclude models/.

Downloads last month: 2

Safetensors

Model size

60.5M params

Tensor type

F32

Dataset used to train saione/meeting-summarizer-dev

Evaluation results

ROUGE-1 (D27 beam=5, lp=1.33) on SAMSum
test set self-reported

48.480
ROUGE-2 (D27 beam=5, lp=1.33) on SAMSum
test set self-reported

23.550
ROUGE-L (D27 beam=5, lp=1.33) on SAMSum
test set self-reported

40.120