Evaluation

Metric Benchmarks

Aggregate performance data based on the ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scoring system.

ROUGE Metric Comparison

Understanding the Metrics

ROUGE-1

Measures the overlap of unigrams (single words) between the generated summary and the reference text. High scores indicate good content coverage.

ROUGE-2

Measures the overlap of bigrams (pairs of consecutive words). This is a strong indicator of fluency and phrasing quality.

ROUGE-L

Based on the Longest Common Subsequence. It captures sentence structure and sequential flow more effectively than simple n-gram overlap.

MODEL INSIGHT

"BART and PEGASUS typically outperform TextRank in ROUGE-2 and ROUGE-L as they generate fluent, abstractive prose rather than just extracting source fragments."