MultiEvalSumViet2: Multi-Criteria Evaluation and Reward Modeling for Vietnamese Summarization

MultiEvalSumViet2 is a Vietnamese-native learned evaluator that scores a candidate summary given its source document on three criteria:

  • Faithfulness (F): factual consistency w.r.t. the source document
  • Coherence (C): readability and logical flow
  • Relevance (R): topical alignment and coverage of key information

The model outputs criterion-wise scores in [0, 1] for a (document, summary) pair and is designed to be used as:

  • an automatic Vietnamese summarization evaluator,
  • a scorer for dataset curation / preference construction,
  • a reward model in PPO/GRPO-style optimization.

What’s in this repository

This repo contains:

  • Backbone encoder weights (config.json, model.safetensors, tokenizer files)
  • Lightweight heads: trunk.pt, head_faith.pt, head_coh.pt, head_rel.pt
  • Configs: arch_config.json, training_args.json, loss_config.json, package_versions.json
  • Inference helpers: modeling_summary_evaluator.py (recommended loader + pair-encoding)

Output format

Given (doc, summary), the model returns:

  • pred_faith ∈ [0, 1]
  • pred_coherence ∈ [0, 1]
  • pred_relevance ∈ [0, 1]

Optional mappings:

  • Likert 1–5: score_1to5 = 4 * score_0to1 + 1
  • Aggregate (paper default):
    pred_overall = 0.5*pred_faith + 0.3*pred_relevance + 0.2*pred_coherence

Reproducibility-critical tokenization policy

To match training-time preprocessing, MultiEvalSumViet2 uses:

  • Summary pre-trim (default SUM_MAX_LEN = 256)
  • Pair encoding (doc, summary) with MAX_LEN = 512
  • truncation="only_first" so truncation primarily affects the document, not the summary

The recommended encode_pair() helper in modeling_summary_evaluator.py implements the same policy.


Quickstart (minimal usage)

Install

pip install -U torch transformers huggingface_hub numpy

Score a few pairs (minimal batch)

import os
import numpy as np
import torch
import importlib.util
from huggingface_hub import snapshot_download
from transformers import DataCollatorWithPadding

REPO_ID = "phuongntc/Multi_EvalSumViet2"
DEVICE  = "cuda" if torch.cuda.is_available() else "cpu"

# 1) Download repo snapshot
repo_dir = snapshot_download(repo_id=REPO_ID, repo_type="model")

# 2) Import repo inference helper
loader_path = os.path.join(repo_dir, "modeling_summary_evaluator.py")
spec = importlib.util.spec_from_file_location("mse", loader_path)
mse = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mse)  # type: ignore

# 3) Load model + tokenizer
model, tokenizer, _ = mse.load_for_inference(repo_dir, device=DEVICE)
model.eval()

collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True, pad_to_multiple_of=8)

@torch.inference_mode()
def score_pairs(docs, sums, batch_size=8):
    outs = []
    for i in range(0, len(docs), batch_size):
        d = docs[i:i+batch_size]
        s = sums[i:i+batch_size]

        enc = mse.encode_pair(tokenizer, d, s)  # training-matched encoding
        features = [{k: enc[k][j] for k in enc.keys()} for j in range(len(d))]
        batch = collator(features)

        y = model(batch["input_ids"].to(DEVICE), batch["attention_mask"].to(DEVICE))  # [B,3]
        outs.append(y.detach().cpu().numpy())

    y = np.clip(np.vstack(outs), 0.0, 1.0)
    return y  # columns: [faith, coherence, relevance]

docs = ["Văn bản gốc ..."]
sums = ["Bản tóm tắt ..."]

scores = score_pairs(docs, sums)
faith, coh, rel = scores[0].tolist()
print({"faith": faith, "coherence": coh, "relevance": rel})

Batch scoring (CSV/XLSX)

Input

A CSV/XLSX file with two required columns:

  • doc
  • summary

Output

An XLSX file adding:

  • pred_faith, pred_coherence, pred_relevance (0–1)
  • optional: pred_*_1to5
  • optional: pred_overall, pred_overall_1to5

Ready-to-run script

Copy this into a file like examples/score_batch_xlsx.py (recommended), or run directly in a notebook.

# pip install -U torch transformers huggingface_hub numpy pandas openpyxl tqdm

import os, math, importlib.util
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm
from huggingface_hub import snapshot_download
from transformers import DataCollatorWithPadding

REPO_ID     = "phuongntc/Multi_EvalSumViet2"
INPUT_FILE  = "/content/test.xlsx"            # .xlsx or .csv with columns: doc, summary
OUTPUT_XLSX = "/content/output_scored.xlsx"

BATCH_SIZE  = 8
DEVICE      = "cuda" if torch.cuda.is_available() else "cpu"

repo_dir = snapshot_download(repo_id=REPO_ID, repo_type="model")

loader_path = os.path.join(repo_dir, "modeling_summary_evaluator.py")
spec = importlib.util.spec_from_file_location("mse", loader_path)
mse = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mse)  # type: ignore

model, tokenizer, _ = mse.load_for_inference(repo_dir, device=DEVICE)
model.eval()

# Load data
df = pd.read_excel(INPUT_FILE) if INPUT_FILE.lower().endswith((".xlsx",".xls",".xlsm")) else pd.read_csv(INPUT_FILE)
df.columns = [c.strip() for c in df.columns]
for c in ["doc", "summary"]:
    if c not in df.columns:
        raise ValueError(f"Missing required column: {c}")
df = df.dropna(subset=["doc","summary"]).reset_index(drop=True)

collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True, pad_to_multiple_of=8)

def iter_batches(n, bs):
    for i in range(0, n, bs):
        yield i, min(i+bs, n)

preds = []
with torch.inference_mode():
    for a, b in tqdm(iter_batches(len(df), BATCH_SIZE), total=math.ceil(len(df)/BATCH_SIZE), desc="Scoring"):
        docs = df.loc[a:b-1, "doc"].astype(str).tolist()
        sums = df.loc[a:b-1, "summary"].astype(str).tolist()

        enc = mse.encode_pair(tokenizer, docs, sums)  # training-matched encoding
        features = [{k: enc[k][i] for k in enc.keys()} for i in range(len(docs))]
        batch = collator(features)

        y = model(batch["input_ids"].to(DEVICE), batch["attention_mask"].to(DEVICE))  # [B,3]
        preds.append(y.detach().cpu().numpy())

preds = np.clip(np.vstack(preds), 0.0, 1.0)

out = df.copy()
out["pred_faith"]     = preds[:, 0]
out["pred_coherence"] = preds[:, 1]
out["pred_relevance"] = preds[:, 2]

# Optional: map to 1–5
out["pred_faith_1to5"]     = 4.0 * out["pred_faith"]     + 1.0
out["pred_coherence_1to5"] = 4.0 * out["pred_coherence"] + 1.0
out["pred_relevance_1to5"] = 4.0 * out["pred_relevance"] + 1.0

# Optional: aggregate score
out["pred_overall"]      = 0.5*out["pred_faith"] + 0.3*out["pred_relevance"] + 0.2*out["pred_coherence"]
out["pred_overall_1to5"] = 4.0*out["pred_overall"] + 1.0

out.to_excel(OUTPUT_XLSX, index=False)
print("Saved:", OUTPUT_XLSX)

Model description (paper-aligned)

Architecture

Vietnamese encoder backbone → masked mean pooling → shared MLP trunk → three regression heads (F/C/R).

Training objective

Hybrid training:

  • multi-task regression (calibrated absolute scoring),
  • intra-document pairwise ranking (preserve within-document preferences among multiple candidate summaries).

Data & supervision

We construct a calibrated dataset of 80,856 labeled (document, summary) pairs from 13,476 Vietnamese news articles (2022–2024) using an LLM-assisted annotation pipeline with human verification, and normalize criterion-wise scores to [0,1].


Intended use

  • Vietnamese summarization evaluation beyond lexical overlap metrics
  • Preference dataset construction from multiple candidate summaries per document
  • Reward modeling for PPO/GRPO-style fine-tuning and iterative data curation

Reproducibility & version pinning

Pin a specific revision/commit when reproducing paper results:

from huggingface_hub import snapshot_download
repo_dir = snapshot_download(
    repo_id="phuongntc/Multi_EvalSumViet2",
    repo_type="model",
    revision="<COMMIT_HASH_OR_TAG>"
)

Repository license (code/model files)

This Hugging Face repository is released under Apache License 2.0 (apache-2.0).

If you prefer MIT or non-commercial terms for the repository contents, change the YAML license: field and add the corresponding LICENSE file.


Citation

Model DOI (recommended)

Generate a DOI from: Settings → DOI → Generate DOI, then fill it here.

  • DOI: 10.57967/hf/7956

BibTeX (model)

@misc{multievalsumviet2_model,
  title        = {MultiEvalSumViet2: Vietnamese Multi-Criteria Summary Evaluator},
  author       = {Thu Phuong Tran Thi},
  year         = {2026},
  howpublished = {Hugging Face Hub},
  url          = {https://huggingface.co/phuongntc/Multi_EvalSumViet2},
  doi          = {10.57967/hf/7956}
}

Contact

  • Maintainer: Thu Phuong Tran Thi
  • Affiliation: Hanoi Metropolitan University; VNU University of Engineering and Technology, Vietnam National University
  • Email: tttphuong2@daihocthudo.edu.vn
  • Please use the Hugging Face Community tab for questions and bug reports.

(Keep the repository license above unchanged unless you also intend to relicense the repository contents.)

Downloads last month
46
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support