bart-base-samsum-summarizer

facebook/bart-base fine-tuned on the SAMSum dialogue summarization corpus.

Note: Front-matter ROUGE scores reflect the champion decoding config (D27: beam=5, length_penalty=1.33). Default generation config (beam=4, lp=1.0) yields ROUGE-1=47.86, ROUGE-2=23.22, ROUGE-L=39.85.

⚠️ License: SAMSum is released under CC BY-NC-ND 4.0 (non-commercial, no derivatives). This model card, the model weights, and any outputs produced with them are subject to the same terms. Commercial use is prohibited.


Model Description

Field Value
Base model facebook/bart-base (139M parameters)
Task Abstractive dialogue summarization
Language English
License cc-by-nc-nd-4.0
Dataset SAMSum (knkarthick/samsum)
Hardware trained on Apple M4 Pro, 24 GB UMA, MPS / BF16

Intended Use

  • Intended use: Summarizing short chat conversations (≀ 512 tokens) into 1–3 sentence abstractive summaries.
  • Out-of-scope: Real-time transcription, audio processing, multi-lingual dialogues, or any commercial product.
  • Not recommended for: Mission-critical applications where hallucinations cannot be tolerated. The model hallucinates entity-level details in ~10% of test examples.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_id = "your-hf-username/bart-base-samsum-summarizer"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForSeq2SeqLM.from_pretrained(model_id, dtype=torch.bfloat16)
model.eval()

dialogue = """
Amanda: I baked cookies. Do you want some?
Jerry: Sure!
Amanda: I'll bring you tomorrow :-)
Jerry: Thanks! Do you know how to make the lemon ones?
Amanda: The biscuits? I'll send you the recipe. It's easy!
""".strip()

inputs = tokenizer(dialogue, return_tensors="pt", max_length=512, truncation=True)
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens = 128,
        num_beams      = 5,
        length_penalty = 1.33,  # D27 champion config (ROUGE-L 40.12)
        early_stopping = True,
    )
print(tokenizer.decode(out[0], skip_special_tokens=True))
# β†’ "Amanda will bring Jerry some cookies tomorrow and send him the recipe."

Performance

All metrics are macro-averaged ROUGE F-measures Γ— 100 on the 819-sample SAMSum test set.

Test-Set ROUGE

Metric Value
ROUGE-1 48.48
ROUGE-2 23.55
ROUGE-L 40.12 (champion: D27 beam=5, lp=1.33)
ROUGE-L (training config: beam=4, lp=1.0) 39.92

Comparison: Fine-Tuned vs Zero-Shot

ROUGE-L
BART-base zero-shot (100 samples) 19.89
BART-base fine-tuned (819 samples) 40.12 (+20.23)

Decoding Strategy Ablation (11 configs)

Config ROUGE-L Avg tokens ms/sample
D1: beam=4, lp=0.8 39.49 15.2 138
D2: beam=4, lp=1.0 39.92 15.9 136
D3: beam=4, lp=1.2 39.97 16.7 136
D4: beam=8, lp=1.0 39.74 15.8 220
D5: nucleus p=0.9 35.93 18.8 92
D6: beam=4, lp=1.4 39.94 17.3 142
D7: beam=4, lp=1.25 40.01 16.8 136
D8: beam=4, lp=1.3 40.01 17.0 137
D9: beam=4, lp=1.2, nrng=3 39.97 16.7 136
D10: beam=6, lp=1.2 40.03 16.7 178
D11: beam=4, lp=1.2, min_len=5 39.97 16.7 136

Full 29-config sweep results in results/metrics/decoding_D*.json. Champion: D27 (beam=5, lp=1.33) at ROUGE-L 40.12 β€” see docs/EXPERIMENTS.md for complete E3 table.

Faithfulness Metrics

Metric Value
Hallucination rate (spaCy NER) 10.1% (83 / 819)
Speaker preservation 75.5%
NLI faithfulness (DeBERTa-v3) 0.308
Length–ROUGE-L Pearson r βˆ’0.25

LoRA Parameter-Efficient Fine-Tuning

Model ROUGE-1 ROUGE-2 ROUGE-L Trainable params
BART-base (full fine-tune) 48.04 23.33 39.92 139.4M (100%)
BART-base (LoRA r=16, Ξ±=32) 45.15 21.20 37.59 0.88M (0.63%)

LoRA achieves 94.2% of full fine-tune ROUGE-L with only 0.63% trainable parameters.

PEGASUS Cross-Domain Transfer

Condition ROUGE-1 ROUGE-2 ROUGE-L Notes
Zero-shot 1.85 0.00 1.60 news β†’ dialogue domain mismatch
Fine-tuned 1.65 0.00 1.56 Convergence failure (see below)

Training failure: gradient_accumulation_steps=8 on MPS caused 8Γ— gradient inflation (effective lr=1.6e-4). eval_loss=9.601 at epoch 3 β‰ˆ random baseline. Fixed in script (grad_accum=1); ROUGE-L 40–44 expected on re-run.

Extended Training (E8 β€” 8 epochs, cosine LR, lr=3e-5)

Condition ROUGE-1 ROUGE-2 ROUGE-L Train time Notes
Baseline (5ep, lr=5e-5) 47.86 23.22 39.85 168.4 min E1 result
Extended (8ep, lr=3e-5, cosine) 46.45 22.05 38.46 259.6 min Best epoch 4

Finding: Ξ” ROUGE-L = βˆ’1.39. Lower peak LR caused underfitting; baseline with lr=5e-5 linear decay converges to a better optimum. Hypothesis not supported.


Training Procedure

Dataset

  • Train: 14,731 examples
  • Validation: 818 examples
  • Test: 819 examples
  • Variant used: with_speakers β€” speaker attribution tags (Name: ) preserved. Ablation shows this contributes +6.62 ROUGE-L vs stripping tags.

Preprocessing

Dialogues are tokenized with AutoTokenizer from facebook/bart-base. max_source_length=512, max_target_length=128 (covers 99%+ of SAMSum examples at these lengths). No task prefix (BART does not require one; T5 uses "summarize: ").

Hyperparameters

Parameter Value
Base model facebook/bart-base
Optimizer AdamW
Learning rate 5.0 Γ— 10⁻⁡
LR schedule Linear decay
Warmup steps 500
Weight decay 0.01
Batch size 8
Max epochs 5
Early stopping patience 2
Gradient clip norm 1.0
Precision BF16
Best epoch 5
Best val ROUGE-L 41.57
Training time 72.4 min (M4 Pro MPS)

Compute

Trained on Apple M4 Pro (T6041), 24 GB Unified Memory, 20 GPU cores. PyTorch 2.10.0 MPS backend, BF16.


Limitations

  • Synthetic training data: SAMSum was constructed by human annotators writing fictional WhatsApp-style dialogues. The model has not been evaluated on real meeting transcripts or audio-derived text.
  • Two-speaker bias: ~75% of SAMSum examples involve exactly 2 participants. Summarization quality for 3+ speaker conversations is likely lower.
  • Hallucination: ~10.1% of test summaries contain at least one NER-detected hallucinated entity. The actual hallucination rate is higher for non-entity errors (e.g. fabricated scores, inverted speaker actions).
  • Speaker attribution errors: ~25% of summaries have at least one speaker attribution mistake (e.g. "X will call Y" when it is Y who called).
  • Non-commercial only: CC BY-NC-ND 4.0 applies to all outputs.

Citation

@inproceedings{gliwa-etal-2019-samsum,
    title     = "{SAMS}um Corpus: A Human-annotated Dialogue Dataset
                 for Abstractive Summarization",
    author    = "Gliwa, Bogdan and Mochol, Iwona and Biesek, Maciej
                 and Wawer, Aleksander",
    booktitle = "Proceedings of the 2nd Workshop on New Frontiers in
                 Summarization",
    year      = "2019",
    publisher = "Association for Computational Linguistics",
    doi       = "10.18653/v1/D19-5409",
}

How to Push to HuggingFace Hub

# 1. Log in
huggingface-cli login

# 2. Create the repository (replace <username>)
huggingface-cli repo create bart-base-samsum-summarizer --type model

# 3. Push model weights + tokenizer
python3 - <<'EOF'
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_path = "models/best/facebook_bart-base_with_speakers"
repo_id    = "your-hf-username/bart-base-samsum-summarizer"   # ← replace

tok = AutoTokenizer.from_pretrained(model_path)
mdl = AutoModelForSeq2SeqLM.from_pretrained(model_path, dtype=torch.bfloat16)

tok.push_to_hub(repo_id)
mdl.push_to_hub(repo_id)
print(f"βœ… Pushed to https://huggingface.co/{repo_id}")
EOF

# 4. Push model card
huggingface-cli upload your-hf-username/bart-base-samsum-summarizer \
    model_card.md README.md

# 5. Verify
huggingface-cli whoami
# β†’ Opens https://huggingface.co/your-hf-username/bart-base-samsum-summarizer

Note: Do NOT push models/best/ to GitHub β€” model weights belong on the HuggingFace Hub only. The .gitignore should already exclude models/.

Downloads last month
13
Safetensors
Model size
60.5M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Dataset used to train saione/meeting-summarizer-dev

Evaluation results