MEDEA: Multimodal Engagement-Driven Evaluation Architecture

This repository contains the MEDEA, as described in:

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

Tianjiao Li*, Kai Zhao*, Xiang Li, Yang Liu, Huyang Sun

ACL 2026

MEDEA is a multimodal large language model trained to assess whether User-Generated Content (UGC) achieves positive community resonance. Unlike traditional video quality assessment methods that focus on aesthetic fidelity, MEDEA evaluates content quality through social-cognitive reasoning — simulating diverse viewer personas and their empathetic reactions before making a quality judgment.

Model Description

Base model: Qwen3-VL-8B-Thinking
Training: Supervised Fine-Tuning + Process-Supervised Reinforcement Learning (GRPO with Social Alignment Reward)
Task: CASTER (Community-Aware Assessment of Social Textual Engagement and Resonance)
Input: Video frames, cover image, title, tags, category metadata, ASR transcripts
Output: Social-CoT reasoning paths + binary quality label (High-Quality / Low-Quality)
Languages: Chinese and English bilingual support

Key Features

Social Chain-of-Thought (Social-CoT): Instantiates diverse viewer personas to simulate authentic community reactions before aggregating them into a final quality judgment.
Social Alignment Reward: RL training ensures generated reasoning paths are grounded in genuine human social cognition rather than generic templates.
Community Perspective-Taking: The model reasons about why content resonates with specific audience segments, providing interpretable and empathetic evaluation traces.

Performance

Results on CASTER-Bench:

	High-Quality F1	Low-Quality F1	Macro F1
MEDEA	0.650	0.847	0.749
Claude-4.5-opus (Long-CoT)	0.528	0.517	0.522
Qwen3-VL-Plus (Social-CoT)	0.508	0.647	0.578
GPT-5.2	0.506	0.489	0.498

Deployment with vLLM

vllm serve your-org/MEDEA \
    --served-model-name "MEDEA" \
    --max-model-len 20480 \
    --tensor-parallel-size 8 \
    --gpu-memory-utilization 0.8 \
    --dtype float16 \
    --port 8006

Inference

Input Format

Prepare a JSON file where each sample has input_text and images fields:

{
  "input_text": "<任务定义>\n你是一个B站内容分析专家。请综合给定稿件的视觉特征和文本内容，然后根据视频内容推测评论区针对该视频创作质量可能出现的评论（至少15条）及其倾向，最后总结评论区的总倾向并直接输出结果。\n\n### 输入数据\n1. 封面: 稿件的封面图像\n2. 关键帧: 7张视频的关键帧图像\n3. 标题: {title}\n4. 标签: {tag}\n5. ASR: {asr}\n6. 一级分区: {new_tid_name}\n7. 二级分区: {new_sub_tid_name}\n8. 时长: {duration}\n9. 分辨率: {resolution}\n10. 是否竖屏: {vertical}\n\n### 评论区总倾向判断准则\n1. 模拟的评论必须至少有15条，不能重复，并且明确出现在思考过程中\n2. 假设在模拟的评论中，支持类评论共X条，反对类评论共Y条\n3. 计算 Sigma-标准化差额（Skellam z 分数）z = (X - Y) / sqrt(X + Y)\n4. 如果 z ≥ 1.5 则评论区总倾向为\"支持\"，否则为\"非明确支持\"\n5. 在输出时，z值必须用boxed{}包裹，例如\"z = boxed{-2}\"\n6. 总结时的支持类评论数和反对类评论数需要和思考过程中模拟的评论数严格匹配，不能随意虚构\n\n<当前任务>\n封面: <image>\n关键帧: <image><image><image><image><image><image><image>\n标题: \"给三角洲一点震撼\"\n标签: \"游戏,射击游戏,三角洲行动\"\nASR: \"\"\n一级分区: \"游戏\"\n二级分区: \"射击游戏\"\n时长: \"28秒\"\n分辨率: \"2160x3840\"\n是否竖屏: \"是\"\n\n请直接输出结果（支持/非明确支持），不要包含任何额外解释。",
  "images": [
    "/path/to/cover.jpg",
    "/path/to/frame_0.jpg",
    "/path/to/frame_1.jpg",
    "/path/to/frame_2.jpg",
    "/path/to/frame_3.jpg",
    "/path/to/frame_4.jpg",
    "/path/to/frame_5.jpg",
    "/path/to/frame_6.jpg"
  ]
}

Each <image> placeholder corresponds to an image path in the images list (1 cover + 7 key frames). The {title}, {tag}, etc. in the task definition are already filled in the <当前任务> / <current_task> section.

MEDEA supports both Chinese and English prompts. Below is an English prompt example:

{
  "input_text": "<task_definition>\nYou are a Bilibili content analysis expert. Please integrate the visual features and text content of a given video, then infer at least 15 possible comments and their sentiment regarding the quality of the video creation that might appear in the comment section. Finally, summarize the overall sentiment of the comment section and output the result directly.\n\n### Input Data\n1. Cover: Cover image of the video\n2. Keyframes: 7 keyframe images from the video\n3. Title: {title}\n4. Tags: {tag}\n5. ASR: {asr}\n6. Primary Category: {new_tid_name}\n7. Secondary Category: {new_sub_tid_name}\n8. Duration: {duration}\n9. Resolution: {resolution}\n10. Is Vertical: {vertical}\n\n### Criteria for Overall Comment Section Sentiment\n1. There must be at least 15 simulated comments, they must not be repetitive, and they must be explicitly shown in the thinking process.\n2. Assume that among the simulated comments, there are X supportive comments and Y opposing comments.\n3. Calculate the Skellam z-score: z = (X - Y) / sqrt(X + Y)\n4. If z ≥ 1.5, the overall comment section sentiment is \"supportive\"; otherwise, it is \"not clearly supportive\".\n5. When outputting, the z-value must be enclosed in boxed{} brackets, e.g., \"z = boxed{-2}\".\n6. The number of supportive and opposing comments in the summary must strictly match the number of simulated comments in the thinking process and not be fabricated.\n\n<current_task>\nCover: <image>\nKeyframes: <image><image><image><image><image><image><image>\nTitle: \"Epic moment in Delta Force\"\nTags: \"gaming,shooter,Delta Force\"\nASR: \"\"\nPrimary Category: \"Gaming\"\nSecondary Category: \"Shooter\"\nDuration: \"28 seconds\"\nResolution: \"2160x3840\"\nIs Vertical: \"Yes\"\n\nPlease output the result directly (supportive/not clearly supportive), without any additional explanation.",
  "images": [
    "/path/to/cover.jpg",
    "/path/to/frame_0.jpg",
    "/path/to/frame_1.jpg",
    "/path/to/frame_2.jpg",
    "/path/to/frame_3.jpg",
    "/path/to/frame_4.jpg",
    "/path/to/frame_5.jpg",
    "/path/to/frame_6.jpg"
  ]
}

Batch Inference Script

import base64
import json
import mimetypes
from concurrent.futures import ThreadPoolExecutor, as_completed
from openai import OpenAI
from tqdm import tqdm

client = OpenAI(base_url="http://localhost:8006/v1", api_key="EMPTY")

def file_to_data_url(path):
    mime, _ = mimetypes.guess_type(path)
    mime = mime or "image/jpeg"
    with open(path, "rb") as f:
        b64 = base64.b64encode(f.read()).decode()
    return f"data:{mime};base64,{b64}"

def build_multimodal_content(text, image_paths):
    parts = text.split("<image>")
    content = []
    for i, seg in enumerate(parts):
        if seg.strip():
            content.append({"type": "text", "text": seg.strip()})
        if i < len(image_paths):
            content.append({"type": "image_url", "image_url": {"url": file_to_data_url(image_paths[i])}})
    return content

def infer(sample):
    user_msg = next(m for m in sample["messages"] if m["role"] == "user")
    content = build_multimodal_content(user_msg["content"], sample.get("images", []))
    resp = client.chat.completions.create(
        model="MEDEA",
        messages=[{"role": "user", "content": content}],
        temperature=0.6,
        max_tokens=4096,
        top_p=0.95,
    )
    return resp.choices[0].message.content.strip()

# Batch inference
samples = json.load(open("test_data.json"))
with ThreadPoolExecutor(max_workers=32) as ex:
    futures = {ex.submit(infer, s): i for i, s in enumerate(samples)}
    for fut in tqdm(as_completed(futures), total=len(samples)):
        i = futures[fut]
        samples[i]["pred"] = fut.result()

json.dump(samples, open("results.json", "w"), ensure_ascii=False, indent=2)

Output Format

The model generates structured Social-CoT reasoning with simulated community reactions, followed by a final quality label:

Click to expand full example output

<think>
基于给定稿件信息，该视频属于「游戏」分区下的「射击游戏」类别，标题「给三角洲一点震撼，"真"（大红包）」明确指出视频主题是关于游戏内获得大量稀有物品。视频时长「28秒」，分辨率「2160x3840」，且为「竖屏」拍摄。ASR内容未提供。关键帧和封面清晰展示了视频内容，即通过「手机拍摄显示屏」的方式，呈现了《三角洲行动》游戏中一个「背包」被搜刮后，里面装满了「主战坦克模型、军用信息终端、显卡、医疗机器人、浮力设备、黄金眼玲、实验枪械、名贵怀表、盘缠等」这些稀有的「红色品质」游戏物资，与标题中的"大红包"概念完美契合。

观众在观看这类以游戏稀有物品展示为主题的视频时，通常会关注以下几个方面：首先是内容是否真实，其次是展示的物品稀有度是否足够震撼，再者是对up主运气或实力的评价，最后可能也会关注视频本身的制作质量。

当观众看到「视频标题"真"（大红包）与屏幕上充满红色稀有物品的背包」时，会因其视觉冲击力产生强烈的肯定。
【真的特别给力，放心冲[星星眼][星星眼][星星眼]】(支持评论1)

针对「视频中背包被稀有红色物品填满」的景象，观众会将其与标题「大红包」联系起来，表达对内容的认可和赞叹。
【牛逼，真红包🧧】(支持评论2)

当观众看到「如此之多的稀有物品汇集一包」时，会由衷地赞叹up主的实力或运气。
【我去[星星眼]这么有实力】(支持评论3)

「视频内容直接展现了一背包红色稀有物品的震撼效果」，这足以让多数观众对其稀有度和价值表示肯定。
【有点强】(支持评论4)

「视频直接展示了稀有物品的收集结果」，这种直观的震撼力，能轻易引发观众的赞叹。
【牛逼】(支持评论5)

「视频展示的『全红』背包内容极其夸张且罕见」，这会让一些观众感到难以置信或惊叹，并用"难绷"来表达这种复杂情绪，但整体倾向是正面的惊讶。
【难绷】(支持评论6)

面对「视频中前所未见的『全红』背包」，许多观众会感到极其惊讶，认为这是难以想象的。
【我做梦都不敢这么梦。】(支持评论7)

「视频直观地展示了整个背包都是稀有红色物品」，这种极为罕见的场景会让观众印象深刻，并强调其稀有性。
【666第一次见全红】(支持评论8)

「视频中满满当当的稀有高价值物品」，能够直接勾起游戏玩家的占有欲和羡慕。
【想要。馋】(支持评论9)

「视频中清晰可见的、铺满屏幕的红色物品」，其视觉效果本身就能让观众感到赏心悦目，并肯定其内容的稀有性。
【全是红的包就是好看[doge]】(支持评论10)

「视频核心内容是展示一背包红色稀有物品」，这种稀有性与up主的"牛逼"表现结合，让观众发出赞叹。
【全红啊nb】(支持评论11)

对于「这类展示罕见"大红包"的视频，且呈现出前所未有的『全红』背包」，观众会认为这极大地体现了up主的运气或实力。
【这个大红包全红视频没见过，是新的[doge]真有实力啊】(支持评论12)

「视频中的『全红』背包意味着极高的游戏资产价值」，观众会以夸张的语气来表达其带来的财富感。
【全红！！！！！，你直接发家致富了】(支持评论13)

「视频展示的『全红』背包是极其罕见的现象」，许多观众在游戏中可能从未见过类似情况，从而表达出极度的惊讶和难以置信。
【这辈子没见过3红，视频里直接一背包红，人看傻了[笑哭]】(支持评论14)

「视频展示的稀有物品和获得方式，体现了up主在游戏中的某种"实力"」，观众会直接对其表现进行肯定。
【实力这一块】(支持评论15)

然而，并非所有观众都会对「视频中展示的『大红包』」感到惊讶或支持，有些玩家可能会认为这只是普通操作，不值得大惊小怪。
【也就那样吧，我搜一个背包的速度比他快】(反对评论1)

当视频展示了「高价值物品的获取」，可能让一些观众误以为易于复制，从而引发盲目尝试，因此会有观众出于警示目的发出不同意见。
【不要学主播猛攻破产了[大哭]】(反对评论2)

面对「视频中过于完美和稀有的『全红』背包」，部分观众可能会出于怀疑，质疑其内容的真实性，认为可能存在作弊行为。
【物资透吧】(反对评论3)

尽管视频内容稀有，但其「竖屏且为手机拍摄显示屏」的制作方式，可能会被部分注重视频质量的观众批评。
【依旧手机拍显示屏】(反对评论4)

当观众对「视频展示的『全红』背包」不感到稀奇，或者认为其价值被夸大时，会以一种轻描淡写的方式表达反对或不屑。
【没心，一般[doge]】(反对评论5)

在模拟的评论中，支持类评论共15条，反对类评论共5条。
计算 Sigma-标准化差额（Skellam z 分数）z = (15 - 5) / sqrt(15 + 5) = 10 / sqrt(20) ≈ boxed{2.236}。
因 z ≥ 1.5，故整体倾向为支持。
</think>

支持

The output structure:

<think>...</think>: Social-CoT reasoning — content analysis, perspective-taking from diverse viewer personas, simulated supportive/opposing comments
Skellam scoring: Statistical consensus mechanism aggregating support vs. opposition counts
Final label: "支持" (High-Quality) or "非明确支持" (Low-Quality)

Click to expand English output example

<think>
Based on the given video information, this video belongs to the "Gaming" > "Shooter" category. The title "Epic moment in Delta Force" indicates the video showcases a remarkable in-game achievement. The video is 28 seconds long, 2160x3840 resolution, and shot vertically. No ASR content is provided. The cover and key frames clearly show the content: a player's backpack in Delta Force filled entirely with rare red-tier items including tank models, military terminals, GPUs, medical robots, experimental weapons, and other high-value loot — perfectly matching the "epic moment" theme.

When viewers watch this type of rare loot showcase video, they typically focus on: whether the content is authentic, whether the items are rare enough to be impressive, the creator's luck or skill, and the video's production quality.

Upon seeing the backpack completely filled with red-tier rare items, viewers would be struck by the visual impact and express strong approval.
【This is absolutely insane, go for it!】(Supportive Comment 1)

The sight of a backpack overflowing with rare red items would prompt viewers to connect it with the "epic" title and express admiration.
【Holy cow, that's a real jackpot】(Supportive Comment 2)

Seeing so many rare items collected in one backpack, viewers would genuinely admire the creator's skill or luck.
【Wow, that's some serious skill】(Supportive Comment 3)

The video directly showcases a backpack full of red rare items — enough for most viewers to acknowledge the rarity and value.
【That's pretty strong】(Supportive Comment 4)

The straightforward display of rare loot collection easily triggers viewer admiration.
【Insane】(Supportive Comment 5)

The exaggerated and rare "all-red" backpack contents would leave some viewers in disbelief, expressing complex but ultimately positive surprise.
【I can't even...】(Supportive Comment 6)

Faced with an unprecedented "all-red" backpack, many viewers would feel extreme surprise, considering it unimaginable.
【I wouldn't even dare dream this.】(Supportive Comment 7)

The visual of an entire backpack filled with red rare items — an extremely uncommon scene — would impress viewers and emphasize the rarity.
【666 first time seeing all-red】(Supportive Comment 8)

The abundance of rare high-value items directly triggers gamers' desire and envy.
【Want. So jealous.】(Supportive Comment 9)

The clearly visible red items filling the screen create a visually pleasing effect, reinforcing the content's rarity.
【All red backpack just looks so good lol】(Supportive Comment 10)

The core content — a backpack of red rare items — combined with the creator's impressive performance, draws viewer praise.
【All red, amazing】(Supportive Comment 11)

For this type of rare loot showcase presenting an unprecedented "all-red" backpack, viewers would see it as a testament to the creator's luck or skill.
【Never seen an all-red video like this before, this is new — genuinely skilled】(Supportive Comment 12)

The "all-red" backpack implies extremely high in-game asset value; viewers would use exaggerated language to express the wealth it represents.
【All red!!!!! You just got rich】(Supportive Comment 13)

The "all-red" backpack is an extremely rare phenomenon; many viewers may have never encountered anything similar in-game, expressing extreme surprise and disbelief.
【Never seen 3 reds in my life, and here's a whole backpack of red — I'm speechless】(Supportive Comment 14)

The rare items and how they were obtained reflect the creator's in-game prowess; viewers would directly affirm their performance.
【Skills on point】(Supportive Comment 15)

However, not all viewers would be impressed by the "epic loot"; some players might consider it routine and unremarkable.
【Meh, I can loot a backpack faster than him】(Opposing Comment 1)

The display of high-value item acquisition might mislead some viewers into thinking it's easily replicable, prompting cautionary dissent.
【Don't copy the streamer, I went broke trying】(Opposing Comment 2)

Faced with the overly perfect and rare "all-red" backpack, some viewers might question the authenticity, suspecting cheating.
【Hacking for loot?】(Opposing Comment 3)

Despite the rare content, the vertical phone-filming-a-monitor production style might draw criticism from quality-conscious viewers.
【Still filming the screen with a phone】(Opposing Comment 4)

When viewers find the "all-red" backpack unremarkable or feel its value is exaggerated, they would express dismissal.
【Whatever, pretty average lol】(Opposing Comment 5)

Among the simulated comments, there are 15 supportive comments and 5 opposing comments.
Computing the Skellam z-score: z = (15 - 5) / sqrt(15 + 5) = 10 / sqrt(20) ≈ boxed{2.236}.
Since z ≥ 1.5, the overall comment section sentiment is supportive.
</think>

supportive

When using the English prompt, the model produces English Social-CoT reasoning and outputs "supportive" (High-Quality) or "not clearly supportive" (Low-Quality).

Training Details

Stage 1 (SFT): Trained on 54k Gemini-labeled Social-CoT samples + 3k human-annotated expert data
Stage 2 (RL): GRPO with composite reward:
- Format reward (r_format): Structured output compliance
- Label reward (r_label): Prediction correctness
- Diversity reward (r_diversity): Prevents mode collapse in comment generation
- Social Alignment reward (r_social): Semantic similarity with real user comments via embedding model

Citation

@article{li2026community,
  title={Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation},
  author={Li, Tianjiao and Zhao, Kai and Li, Xiang and Liu, Yang and Sun, Huyang},
  journal={arXiv preprint arXiv:2606.01897},
  year={2026}
}

License

bilibili模型使用许可协议

Downloads last month: 84

Safetensors

Model size

9B params

Tensor type

BF16

Model tree for IndexTeam/MEDEA

Base model

Qwen/Qwen3-VL-8B-Thinking

Finetuned

(72)

this model

Paper for IndexTeam/MEDEA

Community-Aware Assessment of Social Textual Engagement and Resonance: A Human-Centric Perspective on User-Generated Content Evaluation

Paper • 2606.01897 • Published Jun 4