MultiEvalSumViet2: Multi Criteria Evaluation for Vietnamese Summarization with Reward Modeling Applications

MultiEvalSumViet2 is a Vietnamese-native learned evaluator that scores a candidate summary given its source document on three criteria:

Faithfulness (F): semantic support from the source document, penalizing contradictions and unsupported information.
Coherence (C): fluency, readability, and logical consistency.
Relevance (R): preservation of salient source content.

The model outputs criterion-wise scores in [0, 1] for a (document, summary) pair and is designed to be used as:

an automatic Vietnamese summarization evaluator,
a scorer for dataset curation / preference construction,
a reward model in PPO/GRPO-style optimization.

What’s in this repository

This repo contains:

Backbone encoder weights (config.json, model.safetensors, tokenizer files)
Lightweight heads: trunk.pt, head_faith.pt, head_coh.pt, head_rel.pt
Configs: arch_config.json, training_args.json, loss_config.json, package_versions.json
Inference helpers: modeling_summary_evaluator.py (recommended loader + pair-encoding)

Output format

Given (doc, summary), the model returns:

pred_faith ∈ [0, 1]
pred_coherence ∈ [0, 1]
pred_relevance ∈ [0, 1]

Optional mappings:

Likert 1-5: score_1to5 = 4 * score_0to1 + 1
Aggregate (paper default):
pred_overall = 0.5*pred_faith + 0.3*pred_relevance + 0.2*pred_coherence

Quickstart (minimal usage)

Install

pip install -U torch transformers huggingface_hub numpy

Score a few pairs (minimal batch)

import os
import numpy as np
import torch
import importlib.util
from huggingface_hub import snapshot_download
from transformers import DataCollatorWithPadding

REPO_ID = "phuongntc/Multi_EvalSumViet2"
DEVICE  = "cuda" if torch.cuda.is_available() else "cpu"

# 1) Download repo snapshot
repo_dir = snapshot_download(repo_id=REPO_ID, repo_type="model")

# 2) Import repo inference helper
loader_path = os.path.join(repo_dir, "modeling_summary_evaluator.py")
spec = importlib.util.spec_from_file_location("mse", loader_path)
mse = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mse)  # type: ignore

# 3) Load model + tokenizer
model, tokenizer, _ = mse.load_for_inference(repo_dir, device=DEVICE)
model.eval()

collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True, pad_to_multiple_of=8)

@torch.inference_mode()
def score_pairs(docs, sums, batch_size=8):
    outs = []
    for i in range(0, len(docs), batch_size):
        d = docs[i:i+batch_size]
        s = sums[i:i+batch_size]

        enc = mse.encode_pair(tokenizer, d, s)  # training-matched encoding
        features = [{k: enc[k][j] for k in enc.keys()} for j in range(len(d))]
        batch = collator(features)

        y = model(batch["input_ids"].to(DEVICE), batch["attention_mask"].to(DEVICE))  # [B,3]
        outs.append(y.detach().cpu().numpy())

    y = np.clip(np.vstack(outs), 0.0, 1.0)
    return y  # columns: [faith, coherence, relevance]

docs = ["Văn bản gốc ..."]
sums = ["Bản tóm tắt ..."]

scores = score_pairs(docs, sums)
faith, coh, rel = scores[0].tolist()
print({"faith": faith, "coherence": coh, "relevance": rel})

Batch scoring (CSV/XLSX)

Input

A CSV/XLSX file with two required columns:

doc
summary

Output

An XLSX file adding:

pred_faith, pred_coherence, pred_relevance (0–1)
optional: pred_*_1to5
optional: pred_overall, pred_overall_1to5

Ready-to-run script

Copy this into a file like examples/score_batch_xlsx.py (recommended), or run directly in a notebook.

# pip install -U torch transformers huggingface_hub numpy pandas openpyxl tqdm

import os, math, importlib.util
import numpy as np
import pandas as pd
import torch
from tqdm import tqdm
from huggingface_hub import snapshot_download
from transformers import DataCollatorWithPadding

REPO_ID     = "phuongntc/Multi_EvalSumViet2"
INPUT_FILE  = "/content/test.xlsx"            # .xlsx or .csv with columns: doc, summary
OUTPUT_XLSX = "/content/output_scored.xlsx"

BATCH_SIZE  = 8
DEVICE      = "cuda" if torch.cuda.is_available() else "cpu"

repo_dir = snapshot_download(repo_id=REPO_ID, repo_type="model")

loader_path = os.path.join(repo_dir, "modeling_summary_evaluator.py")
spec = importlib.util.spec_from_file_location("mse", loader_path)
mse = importlib.util.module_from_spec(spec)
spec.loader.exec_module(mse)  # type: ignore

model, tokenizer, _ = mse.load_for_inference(repo_dir, device=DEVICE)
model.eval()

# Load data
df = pd.read_excel(INPUT_FILE) if INPUT_FILE.lower().endswith((".xlsx",".xls",".xlsm")) else pd.read_csv(INPUT_FILE)
df.columns = [c.strip() for c in df.columns]
for c in ["doc", "summary"]:
    if c not in df.columns:
        raise ValueError(f"Missing required column: {c}")
df = df.dropna(subset=["doc","summary"]).reset_index(drop=True)

collator = DataCollatorWithPadding(tokenizer=tokenizer, padding=True, pad_to_multiple_of=8)

def iter_batches(n, bs):
    for i in range(0, n, bs):
        yield i, min(i+bs, n)

preds = []
with torch.inference_mode():
    for a, b in tqdm(iter_batches(len(df), BATCH_SIZE), total=math.ceil(len(df)/BATCH_SIZE), desc="Scoring"):
        docs = df.loc[a:b-1, "doc"].astype(str).tolist()
        sums = df.loc[a:b-1, "summary"].astype(str).tolist()

        enc = mse.encode_pair(tokenizer, docs, sums)  # training-matched encoding
        features = [{k: enc[k][i] for k in enc.keys()} for i in range(len(docs))]
        batch = collator(features)

        y = model(batch["input_ids"].to(DEVICE), batch["attention_mask"].to(DEVICE))  # [B,3]
        preds.append(y.detach().cpu().numpy())

preds = np.clip(np.vstack(preds), 0.0, 1.0)

out = df.copy()
out["pred_faith"]     = preds[:, 0]
out["pred_coherence"] = preds[:, 1]
out["pred_relevance"] = preds[:, 2]

# Optional: map to 1–5
out["pred_faith_1to5"]     = 4.0 * out["pred_faith"]     + 1.0
out["pred_coherence_1to5"] = 4.0 * out["pred_coherence"] + 1.0
out["pred_relevance_1to5"] = 4.0 * out["pred_relevance"] + 1.0

# Optional: aggregate score
out["pred_overall"]      = 0.5*out["pred_faith"] + 0.3*out["pred_relevance"] + 0.2*out["pred_coherence"]
out["pred_overall_1to5"] = 4.0*out["pred_overall"] + 1.0

out.to_excel(OUTPUT_XLSX, index=False)
print("Saved:", OUTPUT_XLSX)

Model description (paper-aligned)

Architecture

Vietnamese encoder backbone → masked mean pooling → shared MLP trunk → three regression heads (F/C/R).

Training objective

Hybrid training:

multi-task regression (calibrated absolute scoring),
intra-document pairwise ranking (preserve within-document preferences among multiple candidate summaries).

Data & supervision

We construct a calibrated dataset of 80,856 labeled (document, summary) pairs from 13,476 Vietnamese news articles (2022–2024) using an LLM-assisted annotation pipeline with human verification, and normalize criterion-wise scores to [0,1].

Intended use

Vietnamese summarization evaluation beyond lexical overlap metrics
Preference dataset construction from multiple candidate summaries per document
Reward modeling for PPO/GRPO-style fine-tuning and iterative data curation

Reproducibility & version pinning

Pin a specific revision/commit when reproducing paper results:

from huggingface_hub import snapshot_download
repo_dir = snapshot_download(
    repo_id="phuongntc/Multi_EvalSumViet2",
    repo_type="model",
    revision="<COMMIT_HASH_OR_TAG>"
)

Repository license (code/model files)

This Hugging Face repository is released under Apache License 2.0 (apache-2.0).

Citation

Model DOI (recommended)

DOI: 10.57967/hf/7956

BibTeX (model)

@misc{multievalsumviet2_model,
  title        = {MultiEvalSumViet2: Vietnamese Multi-Criteria Summary Evaluator},
  author       = {Thi Thu Phuong Tran},
  year         = {2026},
  howpublished = {Hugging Face Hub},
  url          = {https://huggingface.co/phuongntc/Multi_EvalSumViet2},
  doi          = {10.57967/hf/7956}
}

Contact

Maintainer: Thi Thu Phuong Tran
Affiliation: Hanoi Metropolitan University; VNU University of Engineering and Technology, Vietnam National University
Email: tttphuong2@daihocthudo.edu.vn
Please use the Hugging Face Community tab for questions and bug reports.

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32