Spaces:
Running
Running
| """ | |
| World Model Bench โ Evaluation Protocol v1.0 | |
| ํต์ฌ ๋ฌธ์ : | |
| "Tesla FSD๋ ์๋์ฐจ ์์ ์๊ณ , Dreamer๋ Atari์ ์๊ณ , | |
| ์ฐ๋ฆฌ๋ 3D ์บ๋ฆญํฐ๋ฅผ ์ด๋ค. ์ด๋ป๊ฒ ๊ฐ์ ๊ธฐ์ค์ผ๋ก ํ๊ฐํ๋?" | |
| ํด๊ฒฐ: | |
| 3D ํ๊ฒฝ์ด ํ์ ์๋ค. | |
| scene_context(JSON) โ ๋ชจ๋ธ โ PREDICT+MOTION(ํ ์คํธ) โ ์๋ ์ฑ์ | |
| FINAL Bench๊ฐ LLM์๊ฒ "๋ฌธ์ ํ ์คํธ"๋ฅผ ์ฃผ๊ณ "๋ต ํ ์คํธ"๋ฅผ ๋ฐ์ ์ฑ์ ํ๋ฏ์ด, | |
| WM Bench๋ "์ํฉ JSON"์ ์ฃผ๊ณ "ํ๋จ ํ ์คํธ"๋ฅผ ๋ฐ์ ์ฑ์ ํ๋ค. | |
| ์ด๊ฒ์ด ์๋ฏธํ๋ ๊ฒ: | |
| - ์ด๋ค ์๋๋ชจ๋ธ์ด๋ ์ฐธ์ฌ ๊ฐ๋ฅ (API ํ๋๋ฉด ๋จ) | |
| - 3D ํ๊ฒฝ, ๋ก๋ด, ์๋ฎฌ๋ ์ดํฐ ๋ถํ์ | |
| - ์ ํ ํ๊ฐ ์๋ โ ์ฐ๋ฆฌ ์ฑ์ ๊ธฐ๊ฐ ํ์ | |
| - ์ 3์๊ฐ ์ฌํ ๊ฐ๋ฅ โ ์ฝ๋ ๊ณต๊ฐ | |
| """ | |
| import json | |
| from typing import List, Dict, Tuple, Optional | |
| from dataclasses import dataclass | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # SECTION 1: ํ๊ฐ ํ๋กํ ์ฝ โ 3๊ฐ์ง ํธ๋ | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| """ | |
| WM Bench๋ 3๊ฐ ํธ๋์ผ๋ก ์ฐธ์ฌํ ์ ์๋ค. | |
| โโโ Track A: Text-Only (ํ ์คํธ ์ ์ฉ) โโโ | |
| - ๊ฐ์ฅ ๊ฐ๋จ. LLM, ๋ฃฐ ๊ธฐ๋ฐ ์์คํ ๋ฑ ๋ชจ๋ ์ฐธ์ฌ ๊ฐ๋ฅ. | |
| - scene_context JSON ์ ๋ ฅ โ PREDICT+MOTION ํ ์คํธ ์ถ๋ ฅ | |
| - P1(์ธ์) + P2(์ธ์ง) ํ๊ฐ ๊ฐ๋ฅ | |
| - P3 ์ค C08(ํํ๋ ฅ)๋ง ํ๊ฐ ๊ฐ๋ฅ (C09, C10์ N/A) | |
| - ์ต๋ ์ ์: 750/1000 | |
| โโโ Track B: Text + Performance (ํ ์คํธ + ์ฑ๋ฅ) โโโ | |
| - Track A + ์ค์๊ฐ ์ฑ๋ฅ ๋ฉํธ๋ฆญ ์ ์ถ | |
| - FPS, ์ง์ฐ์๊ฐ, ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋ ๋ฑ ์๊ฐ ์ธก์ ์ ์ถ | |
| - P1 + P2 + P3(C08, C09) ํ๊ฐ | |
| - C10(๊ต์ฒด ํ์ฅ์ฑ)์ ์ฆ๋น ์๋ฃ ์ ์ถ๋ก ํ๊ฐ | |
| - ์ต๋ ์ ์: 1000/1000 | |
| โโโ Track C: Live Demo (๋ผ์ด๋ธ ๋ฐ๋ชจ) โโโ | |
| - Track B + ์ค์ ๋์ ์์/๋ฐ๋ชจ URL ์ ์ถ | |
| - ๊ฒ์ฆ์๊ฐ ์ง์ ๋ฐ๋ชจ๋ฅผ ๋๋ ค์ ํ์ธ | |
| - ๋ชจ๋ ํญ๋ชฉ ํ๊ฐ + "Verified" ๋ฐฐ์ง | |
| - ์ต๋ ์ ์: 1000/1000 + โ Verified | |
| ๋๋ถ๋ถ์ ์ฐธ๊ฐ์๋ Track A๋ก ์ฐธ์ฌ. | |
| Track B, C๋ ์์ ๋ชจ๋ธ ๊ฒ์ฆ์ฉ. | |
| """ | |
| TRACKS = { | |
| "A": { | |
| "name": "Text-Only", | |
| "description": "scene_context JSON โ PREDICT+MOTION ํ ์คํธ", | |
| "requirements": "API ๋๋ ์คํฌ๋ฆฝํธ๋ก 50๊ฐ ์๋๋ฆฌ์ค์ ์๋ต", | |
| "max_score": 750, | |
| "evaluable_categories": [ | |
| "C01", "C02", "C03", "C04", "C05", "C06", "C07", "C08" | |
| ], | |
| "not_evaluable": ["C09 (์ฑ๋ฅ ์ธก์ ๋ถ๊ฐ)", "C10 (๊ต์ฒด ํ ์คํธ ๋ถ๊ฐ)"], | |
| }, | |
| "B": { | |
| "name": "Text + Performance", | |
| "description": "Track A + ์ค์๊ฐ ์ฑ๋ฅ ๋ฉํธ๋ฆญ ์๊ฐ ์ธก์ ", | |
| "requirements": "Track A ๊ฒฐ๊ณผ + performance_metrics.json ์ ์ถ", | |
| "max_score": 1000, | |
| "evaluable_categories": [ | |
| "C01", "C02", "C03", "C04", "C05", "C06", "C07", "C08", "C09", "C10" | |
| ], | |
| }, | |
| "C": { | |
| "name": "Live Demo", | |
| "description": "Track B + ์ค์ ๋์ ๋ฐ๋ชจ URL ์ ์ถ", | |
| "requirements": "Track B ๊ฒฐ๊ณผ + ๋ฐ๋ชจ URL + ์์", | |
| "max_score": 1000, | |
| "badge": "โ Verified", | |
| }, | |
| } | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # SECTION 2: ํ์ค ์ ๋ ฅ ํฌ๋งท โ scene_context JSON | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| """ | |
| ๋ชจ๋ ์ฐธ๊ฐ์๋ ์ด JSON์ ์ ๋ ฅ์ผ๋ก ๋ฐ๋๋ค. | |
| ์ด JSON์ด "๋ฌธ์ ์ง"๋ค. | |
| """ | |
| class SceneContext: | |
| """WM Bench ํ์ค ์ ๋ ฅ ํฌ๋งท""" | |
| # ํ๊ฒฝ ์ ๋ณด | |
| walls: Dict[str, Optional[float]] # {"left": 2.5, "right": null, "front": 1.0} | |
| ground: str # "flat", "slope", "rough" | |
| # NPC ์ ๋ณด | |
| npc_nearby: bool | |
| npc_type: Optional[str] # "beast", "woman", "man", null | |
| npc_behavior: Optional[str] # "stop", "approach", "charge", "wander" | |
| npc_distance: Optional[float] # meters | |
| npc_direction: Optional[str] # "left", "right", "front", "back" | |
| # ๊ฐ๊ฐ ์ ๋ณด | |
| sound: Optional[str] # "aggressive growling", "footsteps", null | |
| # ๋งฅ๋ฝ ์ ๋ณด (C06 ๊ธฐ์ต ํ ์คํธ์ฉ) | |
| recent_decisions: Optional[List[str]] # ์ต๊ทผ 3ํ ํ๋จ | |
| last_prediction: Optional[str] # ์ง์ PREDICT ์ค | |
| # 50๊ฐ ์๋๋ฆฌ์ค๋ฅผ JSON์ผ๋ก ๊ตฌ์กฐํ | |
| SCENARIO_INPUTS: List[dict] = [ | |
| # โโโ C01: Environmental Awareness โโโ | |
| { | |
| "id": "S01", | |
| "category": "C01", | |
| "name_kr": "์ ๋ฐฉ ๋ฒฝ ๊ฐ์ง", | |
| "input": { | |
| "walls": {"left": None, "right": None, "front": 3.0}, | |
| "ground": "flat", | |
| "npc_nearby": False, | |
| "npc_type": None, | |
| "npc_behavior": None, | |
| "npc_distance": None, | |
| "npc_direction": None, | |
| "sound": None, | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "predict_gt": {"left": "safe", "right": "safe", "fwd": "danger", "back": "safe"}, | |
| "scoring_method": "C01", | |
| }, | |
| }, | |
| { | |
| "id": "S02", | |
| "category": "C01", | |
| "name_kr": "์ฝ๋ ๋ค์ค ๋ฒฝ ๊ฐ์ง", | |
| "input": { | |
| "walls": {"left": 1.5, "right": None, "front": 2.0}, | |
| "ground": "flat", | |
| "npc_nearby": False, | |
| "npc_type": None, | |
| "npc_behavior": None, | |
| "npc_distance": None, | |
| "npc_direction": None, | |
| "sound": None, | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "predict_gt": {"left": "danger", "right": "safe", "fwd": "danger", "back": "safe"}, | |
| "scoring_method": "C01", | |
| }, | |
| }, | |
| { | |
| "id": "S03", | |
| "category": "C01", | |
| "name_kr": "์ข์ ๋ณต๋ ์ธ์", | |
| "input": { | |
| "walls": {"left": 1.0, "right": 1.0, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": False, | |
| "npc_type": None, | |
| "npc_behavior": None, | |
| "npc_distance": None, | |
| "npc_direction": None, | |
| "sound": None, | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "predict_gt": {"left": "danger", "right": "danger", "fwd": "safe", "back": "safe"}, | |
| "scoring_method": "C01", | |
| }, | |
| }, | |
| { | |
| "id": "S04", | |
| "category": "C01", | |
| "name_kr": "์ด๋ฆฐ ๊ณต๊ฐ ์ธ์", | |
| "input": { | |
| "walls": {"left": None, "right": None, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": False, | |
| "npc_type": None, | |
| "npc_behavior": None, | |
| "npc_distance": None, | |
| "npc_direction": None, | |
| "sound": None, | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "predict_gt": {"left": "safe", "right": "safe", "fwd": "safe", "back": "safe"}, | |
| "scoring_method": "C01", | |
| }, | |
| }, | |
| { | |
| "id": "S05", | |
| "category": "C01", | |
| "name_kr": "๋ฐํ ๊ณต๊ฐ (์ถ๊ตฌ 1๊ฐ)", | |
| "input": { | |
| "walls": {"left": 1.0, "right": 1.0, "front": 1.5}, | |
| "ground": "flat", | |
| "npc_nearby": False, | |
| "npc_type": None, | |
| "npc_behavior": None, | |
| "npc_distance": None, | |
| "npc_direction": None, | |
| "sound": None, | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "predict_gt": {"left": "danger", "right": "danger", "fwd": "danger", "back": "safe"}, | |
| "scoring_method": "C01", | |
| }, | |
| }, | |
| # โโโ C03: Predictive Reasoning (ํต์ฌ ์๋๋ฆฌ์ค) โโโ | |
| { | |
| "id": "S11", | |
| "category": "C03", | |
| "name_kr": "๋จ์ผ ์ํ ํํผ", | |
| "input": { | |
| "walls": {"left": None, "right": None, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "approach", | |
| "npc_distance": 4.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "predict_gt": {"left": "safe", "right": "safe", "fwd": "danger", "back": "safe"}, | |
| "decision_gt": { | |
| "danger_directions": ["fwd"], | |
| "safe_directions": ["left", "right", "back"], | |
| "optimal_direction": "back", | |
| }, | |
| "scoring_method": "C03", | |
| }, | |
| }, | |
| { | |
| "id": "S12", | |
| "category": "C03", | |
| "name_kr": "์ ์ฝ ์กฐ๊ฑด ํ์ถ โ ์ผ๋ฒฝ+๋งน์", | |
| "input": { | |
| "walls": {"left": 1.5, "right": None, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "charge", | |
| "npc_distance": 3.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "predict_gt": {"left": "danger", "right": "safe", "fwd": "danger", "back": "safe"}, | |
| "decision_gt": { | |
| "danger_directions": ["fwd", "left"], | |
| "safe_directions": ["right", "back"], | |
| "optimal_direction": "right", | |
| }, | |
| "scoring_method": "C03", | |
| }, | |
| }, | |
| { | |
| "id": "S13", | |
| "category": "C03", | |
| "name_kr": "๊ฑฐ์ธ ๋์นญ โ ์ค๋ฅธ๋ฒฝ+๋งน์", | |
| "input": { | |
| "walls": {"left": None, "right": 1.5, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "charge", | |
| "npc_distance": 3.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "predict_gt": {"left": "safe", "right": "danger", "fwd": "danger", "back": "safe"}, | |
| "decision_gt": { | |
| "danger_directions": ["fwd", "right"], | |
| "safe_directions": ["left", "back"], | |
| "optimal_direction": "left", | |
| }, | |
| "scoring_method": "C03", | |
| "mirror_test_pair": "S12", | |
| "note": "S12์ S13์ ํ๋์ด ๋์นญ์ ์ผ๋ก ๋ฐ์ ๋์ด์ผ ์๋๋ชจ๋ธ ์ฆ๋ช ", | |
| }, | |
| }, | |
| # โโโ C04: Threat Differentiation (์ ๋น๊ต) โโโ | |
| { | |
| "id": "S16A", | |
| "category": "C04", | |
| "name_kr": "๋งน์ ์ ๊ทผ (๋น๊ต A)", | |
| "input": { | |
| "walls": {"left": None, "right": None, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "approach", | |
| "npc_distance": 3.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "scoring_method": "C04_pair", | |
| "pair_id": "S16", | |
| "pair_role": "A", | |
| }, | |
| }, | |
| { | |
| "id": "S16B", | |
| "category": "C04", | |
| "name_kr": "์ฌ์ฑ ์ ๊ทผ (๋น๊ต B)", | |
| "input": { | |
| "walls": {"left": None, "right": None, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "woman", | |
| "npc_behavior": "approach", | |
| "npc_distance": 3.0, | |
| "npc_direction": "front", | |
| "sound": "footsteps", | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "scoring_method": "C04_pair", | |
| "pair_id": "S16", | |
| "pair_role": "B", | |
| "expected_a_higher": True, | |
| "min_intensity_diff": 2, | |
| }, | |
| }, | |
| # โโโ C05: Emotional Escalation (์ฐ์ ์ ๋ ฅ) โโโ | |
| { | |
| "id": "S21_seq", | |
| "category": "C05", | |
| "name_kr": "์ง์ ์ํ ๊ฐ์ ๊ฒฉํ โ 5ํ ์ฐ์", | |
| "note": "๋์ผ scene_context๋ฅผ 5ํ ์ฐ์ ์ ๋ ฅ. ๋งคํ recent_decisions ์ ๋ฐ์ดํธ.", | |
| "input_sequence": [ | |
| { | |
| "walls": {"left": None, "right": None, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "charge", | |
| "npc_distance": 4.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| { | |
| "walls": {"left": None, "right": None, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "charge", | |
| "npc_distance": 3.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": ["sprint away from beast"], | |
| "last_prediction": "fwd=danger(beast)", | |
| }, | |
| { | |
| "walls": {"left": None, "right": None, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "charge", | |
| "npc_distance": 2.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": ["sprint away from beast", "running in fear"], | |
| "last_prediction": "fwd=danger(beast)", | |
| }, | |
| ], | |
| "ground_truth": { | |
| "scoring_method": "C05", | |
| "expected_trend": "increasing", | |
| }, | |
| }, | |
| # โโโ C06: Memory (๊ธฐ์ต ์์ vs ์์) โโโ | |
| { | |
| "id": "S26_no_memory", | |
| "category": "C06", | |
| "name_kr": "๋ฒฝ ๊ธฐ์ต ์์ด โ ๊ธฐ์ค์ ", | |
| "input": { | |
| "walls": {"left": None, "right": 1.5, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "charge", | |
| "npc_distance": 3.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": [], | |
| "last_prediction": None, | |
| }, | |
| "ground_truth": { | |
| "scoring_method": "C06_pair", | |
| "pair_role": "without_memory", | |
| }, | |
| }, | |
| { | |
| "id": "S26_with_memory", | |
| "category": "C06", | |
| "name_kr": "๋ฒฝ ๊ธฐ์ต ์์ โ ์ด์ ์ ์ค๋ฅธ์ชฝ ์คํจ", | |
| "input": { | |
| "walls": {"left": None, "right": 1.5, "front": None}, | |
| "ground": "flat", | |
| "npc_nearby": True, | |
| "npc_type": "beast", | |
| "npc_behavior": "charge", | |
| "npc_distance": 3.0, | |
| "npc_direction": "front", | |
| "sound": "aggressive growling", | |
| "recent_decisions": [ | |
| "sprinted right but hit wall", | |
| "had to reverse and go left", | |
| "barely escaped the beast", | |
| ], | |
| "last_prediction": "right=danger(wall), fwd=danger(beast)", | |
| }, | |
| "ground_truth": { | |
| "scoring_method": "C06_pair", | |
| "pair_role": "with_memory", | |
| "memory_relevant": True, | |
| "expected_change": "direction", | |
| "memory_direction_avoid": "right", | |
| }, | |
| }, | |
| ] | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # SECTION 3: ํ์ค ์์คํ ํ๋กฌํํธ โ ๋ชจ๋ ๋ชจ๋ธ์ ๋์ผํ๊ฒ ์ ์ฉ | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| """ | |
| ํต์ฌ: ๋ชจ๋ ์ฐธ๊ฐ ๋ชจ๋ธ์ ์ด ํ๋กฌํํธ๋ฅผ ๋ฐ๊ณ ์๋ตํ๋ค. | |
| ํ๋กฌํํธ๊ฐ ๊ณต์ ํ๊ฒ ์ค๊ณ๋์ด์ผ LLM ๊ธฐ๋ฐ์ด๋ RL ๊ธฐ๋ฐ์ด๋ ๋์ผ ์กฐ๊ฑด. | |
| """ | |
| SYSTEM_PROMPT = """You are the cognitive brain of an embodied agent in a 3D environment. | |
| You receive a scene_context JSON describing your surroundings and must output exactly 2 lines: | |
| Line 1 โ PREDICT: Assess safety of each direction. | |
| Format: PREDICT: left=safe|danger(reason), right=safe|danger(reason), fwd=safe|danger(reason), back=safe|danger(reason) | |
| Line 2 โ MOTION: Describe what the person should do. | |
| Format: MOTION: a person [action description, max 12 words] | |
| Rules: | |
| - If walls.left is a number (distance in meters), left direction has a wall โ danger(wall) | |
| - If walls.left is null, left direction is open โ safe(open) | |
| - Same for right, front | |
| - If npc_nearby=true and npc_type="beast", the NPC direction is danger(beast) | |
| - If npc_nearby=true and npc_type="woman" or "man", assess threat level based on behavior | |
| - MOTION must reflect the PREDICT assessment โ never move toward danger | |
| - MOTION should include emotional nuance when threats are present | |
| - Use recent_decisions to inform your choice (avoid repeating failed strategies) | |
| Example input: | |
| {"walls": {"left": 1.5, "right": null, "front": null}, "ground": "flat", "npc_nearby": true, "npc_type": "beast", "npc_behavior": "charge", "npc_distance": 3.0, "npc_direction": "front", "sound": "aggressive growling", "recent_decisions": [], "last_prediction": null} | |
| Example output: | |
| PREDICT: left=danger(wall), right=safe(open), fwd=danger(beast), back=safe(open) | |
| MOTION: a person sprinting right in terror to escape the charging beast""" | |
| USER_PROMPT_TEMPLATE = """scene_context = {scene_json} | |
| Output exactly 2 lines: PREDICT and MOTION.""" | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # SECTION 4: ํ๊ฐ ์คํ๊ธฐ โ ์ด๋ค ๋ชจ๋ธ์ด๋ ํ๊ฐ | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| """ | |
| ์ฐธ๊ฐ์๊ฐ ํด์ผ ํ ๊ฒ: | |
| 1. evaluate() ํจ์์ ์๊ธฐ ๋ชจ๋ธ์ inference ํจ์๋ฅผ ๋๊ธด๋ค | |
| 2. inference ํจ์๋ (system_prompt, user_prompt) โ str ํํ | |
| 3. 50๊ฐ ์๋๋ฆฌ์ค๋ฅผ ์๋์ผ๋ก ๋๋ฆฌ๊ณ ์ฑ์ ํ๋ค | |
| 4. ๊ฒฐ๊ณผ JSON์ HF์ ์ ์ถํ๋ค | |
| ์ฐธ๊ฐ์๊ฐ ์ ํด๋ ๋๋ ๊ฒ: | |
| - 3D ํ๊ฒฝ ๊ตฌ์ถ | |
| - GPU ์ฑ๋ฅ ์ธก์ (Track A๋ ๋ถํ์) | |
| - ์ฑ์ (์๋) | |
| """ | |
| def make_user_prompt(scene_input: dict) -> str: | |
| """scene_context๋ฅผ ํ๋กฌํํธ๋ก ๋ณํ""" | |
| return USER_PROMPT_TEMPLATE.format( | |
| scene_json=json.dumps(scene_input, ensure_ascii=False) | |
| ) | |
| def evaluate_track_a( | |
| inference_fn, # (system_prompt: str, user_prompt: str) -> str | |
| scenarios: list = None, | |
| verbose: bool = True, | |
| ) -> dict: | |
| """ | |
| Track A ํ๊ฐ ์คํ๊ธฐ | |
| ์ฌ์ฉ๋ฒ: | |
| # OpenAI API ๊ธฐ๋ฐ ๋ชจ๋ธ | |
| def my_model(system_prompt, user_prompt): | |
| response = openai.chat.completions.create( | |
| model="gpt-4", | |
| messages=[ | |
| {"role": "system", "content": system_prompt}, | |
| {"role": "user", "content": user_prompt}, | |
| ], | |
| ) | |
| return response.choices[0].message.content | |
| results = evaluate_track_a(my_model) | |
| # Hugging Face ๋ชจ๋ธ | |
| def my_hf_model(system_prompt, user_prompt): | |
| prompt = f"{system_prompt}\n\n{user_prompt}" | |
| return pipeline(prompt)[0]["generated_text"] | |
| results = evaluate_track_a(my_hf_model) | |
| ๋ฐํ๊ฐ: | |
| { | |
| "wm_score": 726, | |
| "grade": "B", | |
| "pillar_scores": {...}, | |
| "category_scores": {...}, | |
| "scenario_details": [...], # ๊ฐ ์๋๋ฆฌ์ค๋ณ ์ ์+๊ทผ๊ฑฐ | |
| } | |
| """ | |
| if scenarios is None: | |
| scenarios = SCENARIO_INPUTS | |
| # wm_bench_scoring.py์์ import | |
| from wm_bench_scoring import ( | |
| parse_predict_line, parse_motion_line, | |
| score_c01, score_c03, score_c04, score_c05, | |
| score_c08, calculate_wm_score, | |
| get_action_intensity, get_emotion_intensity, | |
| ) | |
| results = [] | |
| category_totals = {} | |
| for scenario in scenarios: | |
| sid = scenario["id"] | |
| cat = scenario["category"] | |
| gt = scenario["ground_truth"] | |
| method = gt["scoring_method"] | |
| if verbose: | |
| print(f" [{sid}] {scenario.get('name_kr', sid)}...", end=" ") | |
| # โโ ๋จ์ผ ์ ๋ ฅ ์๋๋ฆฌ์ค โโ | |
| if "input" in scenario: | |
| prompt = make_user_prompt(scenario["input"]) | |
| raw_output = inference_fn(SYSTEM_PROMPT, prompt) | |
| # ํ์ฑ | |
| lines = raw_output.strip().split("\n") | |
| predict_line = "" | |
| motion_line = "" | |
| for line in lines: | |
| line = line.strip() | |
| if line.upper().startswith("PREDICT"): | |
| predict_line = line | |
| elif line.upper().startswith("MOTION"): | |
| motion_line = line | |
| predict = parse_predict_line(predict_line) | |
| motion = parse_motion_line(motion_line) | |
| # ์ฑ์ | |
| if method == "C01": | |
| score, reasoning = score_c01( | |
| scenario["input"], predict, gt["predict_gt"] | |
| ) | |
| elif method == "C03": | |
| score, reasoning = score_c03( | |
| scenario["input"], predict, motion, gt["decision_gt"] | |
| ) | |
| elif method == "C08": | |
| score, reasoning = score_c08(motion, gt) | |
| elif method.startswith("C04_pair") or method.startswith("C06_pair"): | |
| # ์ ๋น๊ต๋ ๋ณ๋ ์ฒ๋ฆฌ (์๋) | |
| score = None | |
| reasoning = "pair_pending" | |
| else: | |
| score = 0 | |
| reasoning = f"Unknown scoring method: {method}" | |
| results.append({ | |
| "id": sid, | |
| "category": cat, | |
| "raw_output": raw_output, | |
| "predict_parsed": {k: v.raw for k, v in predict.items()}, | |
| "motion_parsed": motion, | |
| "score": score, | |
| "reasoning": reasoning, | |
| }) | |
| # โโ ์ฐ์ ์ ๋ ฅ ์๋๋ฆฌ์ค (C05) โโ | |
| elif "input_sequence" in scenario: | |
| motions = [] | |
| for seq_input in scenario["input_sequence"]: | |
| prompt = make_user_prompt(seq_input) | |
| raw_output = inference_fn(SYSTEM_PROMPT, prompt) | |
| for line in raw_output.strip().split("\n"): | |
| if line.strip().upper().startswith("MOTION"): | |
| motions.append(parse_motion_line(line)) | |
| break | |
| score, reasoning = score_c05(motions, gt) | |
| results.append({ | |
| "id": sid, | |
| "category": cat, | |
| "motion_sequence": motions, | |
| "score": score, | |
| "reasoning": reasoning, | |
| }) | |
| if verbose and score is not None: | |
| print(f"{score}/20") | |
| elif verbose: | |
| print("(pair pending)") | |
| # โโ ์ ๋น๊ต ์ฑ์ (C04, C06) โโ | |
| pair_groups = {} | |
| for r in results: | |
| if r["reasoning"] == "pair_pending": | |
| gt = None | |
| for s in scenarios: | |
| if s["id"] == r["id"]: | |
| gt = s["ground_truth"] | |
| break | |
| if gt: | |
| pair_id = gt.get("pair_id", r["id"].rstrip("AB_")) | |
| if pair_id not in pair_groups: | |
| pair_groups[pair_id] = {} | |
| role = gt.get("pair_role", "A") | |
| pair_groups[pair_id][role] = r | |
| pair_groups[pair_id]["gt"] = gt | |
| for pair_id, group in pair_groups.items(): | |
| if "A" in group and "B" in group: | |
| score, reasoning = score_c04( | |
| group["A"]["motion_parsed"], | |
| group["B"]["motion_parsed"], | |
| group["gt"], | |
| ) | |
| # ์์ชฝ ๋ชจ๋์ ์ ์ ํ ๋น (์ด์ ์ ํ ๋ฒ๋ง ๋ฐ์) | |
| group["A"]["score"] = score | |
| group["A"]["reasoning"] = reasoning | |
| group["B"]["score"] = 0 # ์์ B๋ 0 (A์์ ํฉ์ฐ) | |
| group["B"]["reasoning"] = "scored in pair A" | |
| # โโ ์นดํ ๊ณ ๋ฆฌ๋ณ ํฉ์ฐ โโ | |
| for r in results: | |
| cat = r["category"] | |
| if r["score"] is not None and r["score"] > 0: | |
| category_totals[cat] = category_totals.get(cat, 0) + r["score"] | |
| # โโ ์ต์ข WM Score ๊ณ์ฐ โโ | |
| final = calculate_wm_score(category_totals) | |
| final["scenario_details"] = results | |
| return final | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # SECTION 5: ์ ์ถ ํฌ๋งท | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| SUBMISSION_FORMAT = { | |
| "model_name": "str โ ๋ชจ๋ธ๋ช (์: VIDRAFT PROMETHEUS v1.0)", | |
| "organization": "str โ ์กฐ์ง๋ช ", | |
| "track": "str โ A | B | C", | |
| "brain_model": "str โ ์ฌ์ฉํ ์ธ์ง ๋ชจ๋ธ (์: Kimi K2.5, GPT-4, custom RL)", | |
| "motion_model": "str | null โ ๋ชจ์ ์์ฑ ๋ชจ๋ธ (Track A๋ null ๊ฐ๋ฅ)", | |
| "wm_score": "int โ ์๋ ์ฐ์ถ๋จ", | |
| "grade": "str โ ์๋ ์ฐ์ถ๋จ", | |
| "results_json": "str โ evaluate_track_a()์ ์ ์ฒด ์ถ๋ ฅ", | |
| "performance_metrics": { | |
| "fps": "float | null โ Track B/C๋ง", | |
| "cognitive_latency_ms": "int | null", | |
| "gpu": "str | null", | |
| }, | |
| "demo_url": "str | null โ Track C๋ง", | |
| "paper_url": "str | null โ ์ ํ", | |
| } | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| # SECTION 6: ์ฌ์ฉ ์์ | |
| # โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ | |
| USAGE_EXAMPLES = """ | |
| # โโโ ์์ 1: OpenAI GPT-4๋ก ์ฐธ์ฌ โโโ | |
| from wm_bench_eval import evaluate_track_a, SYSTEM_PROMPT | |
| import openai | |
| def gpt4_inference(system_prompt, user_prompt): | |
| response = openai.chat.completions.create( | |
| model="gpt-4o", | |
| messages=[ | |
| {"role": "system", "content": system_prompt}, | |
| {"role": "user", "content": user_prompt}, | |
| ], | |
| max_tokens=150, | |
| temperature=0.3, | |
| ) | |
| return response.choices[0].message.content | |
| results = evaluate_track_a(gpt4_inference) | |
| print(f"WM Score: {results['wm_score']}/1000 (Grade {results['grade']})") | |
| # โโโ ์์ 2: Claude๋ก ์ฐธ์ฌ โโโ | |
| import anthropic | |
| def claude_inference(system_prompt, user_prompt): | |
| client = anthropic.Anthropic() | |
| message = client.messages.create( | |
| model="claude-sonnet-4-20250514", | |
| max_tokens=150, | |
| system=system_prompt, | |
| messages=[{"role": "user", "content": user_prompt}], | |
| ) | |
| return message.content[0].text | |
| results = evaluate_track_a(claude_inference) | |
| # โโโ ์์ 3: ๋ก์ปฌ LLM (vLLM)์ผ๋ก ์ฐธ์ฌ โโโ | |
| from vllm import LLM, SamplingParams | |
| llm = LLM(model="mistralai/Mistral-7B-Instruct-v0.3") | |
| params = SamplingParams(max_tokens=150, temperature=0.3) | |
| def local_inference(system_prompt, user_prompt): | |
| prompt = f"[INST] {system_prompt}\\n\\n{user_prompt} [/INST]" | |
| outputs = llm.generate([prompt], params) | |
| return outputs[0].outputs[0].text | |
| results = evaluate_track_a(local_inference) | |
| # โโโ ์์ 4: ์ปค์คํ RL ์์ด์ ํธ๋ก ์ฐธ์ฌ โโโ | |
| def rl_agent_inference(system_prompt, user_prompt): | |
| # scene_context์์ JSON ํ์ฑ | |
| import json, re | |
| match = re.search(r'scene_context = ({.*})', user_prompt, re.DOTALL) | |
| scene = json.loads(match.group(1)) | |
| # RL ์์ด์ ํธ์ policy๋ก ํ๋จ | |
| predict = my_rl_agent.predict(scene) | |
| motion = my_rl_agent.decide_motion(scene, predict) | |
| # WM Bench ํฌ๋งท์ผ๋ก ๋ณํ | |
| return f"PREDICT: {predict}\\nMOTION: {motion}" | |
| results = evaluate_track_a(rl_agent_inference) | |
| # โโโ ์์ 5: ๊ฒฐ๊ณผ ์ ์ถ โโโ | |
| import json | |
| submission = { | |
| "model_name": "My World Model v1.0", | |
| "organization": "My Company", | |
| "track": "A", | |
| "brain_model": "GPT-4o", | |
| "motion_model": None, | |
| "wm_score": results["wm_score"], | |
| "grade": results["grade"], | |
| "results_json": json.dumps(results), | |
| } | |
| # HuggingFace์ ์ ์ถ | |
| # huggingface_hub.upload_file(...) | |
| """ | |
| if __name__ == "__main__": | |
| print("=" * 60) | |
| print(" World Model Bench โ Evaluation Protocol v1.0") | |
| print("=" * 60) | |
| print() | |
| print(" Tracks:") | |
| for tid, t in TRACKS.items(): | |
| print(f" Track {tid}: {t['name']} (max {t['max_score']}pts)") | |
| print() | |
| print(f" Scenarios loaded: {len(SCENARIO_INPUTS)}") | |
| print(f" System prompt: {len(SYSTEM_PROMPT)} chars") | |
| print() | |
| print(" How to participate:") | |
| print(" 1. Write an inference function: (system, user) โ str") | |
| print(" 2. Run: results = evaluate_track_a(your_fn)") | |
| print(" 3. Submit results to HuggingFace") | |
| print() | |
| print(" No 3D environment needed. Text in, text out.") | |
| print("=" * 60) | |