Spaces:

comp5423
/

NewProject

Runtime error

App Files Files Community

PPP commited on 24 days ago

Commit

1a91c20

1 Parent(s): f0690fd

feat: add reproducible evaluation pipeline and structured interaction logging

Browse files

Files changed (12) hide show

.gitignore +6 -0
README.md +530 -3
app.py +330 -84
evaluation/datasets/branch_divergence.json +84 -0
evaluation/datasets/consistency.json +283 -0
evaluation/datasets/intent_accuracy.json +201 -0
evaluation/datasets/latency.json +79 -0
evaluation/run_evaluations.py +567 -0
nlu_engine.py +25 -22
story_engine.py +174 -94
telemetry.py +81 -0
utils.py +25 -14

.gitignore ADDED Viewed

	@@ -0,0 +1,6 @@

+__pycache__/
+*.py[cod]
+.env
+logs/
+evaluation/results/

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: StoryWeaver
-emoji: 🌍
 colorFrom: red
 colorTo: purple
 sdk: gradio
@@ -8,7 +8,534 @@ sdk_version: 6.7.0
 app_file: app.py
 pinned: false
 license: mit
-short_description: StoryWeaver
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 title: StoryWeaver
+emoji: 📖
 colorFrom: red
 colorTo: purple
 sdk: gradio
 app_file: app.py
 pinned: false
 license: mit
+short_description: Interactive NLP story engine with evaluation and logging
 ---
+# StoryWeaver
+StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.
+This README is written for teammates who need to:
+- understand how the system is organized
+- run the app locally
+- know where to change prompts, rules, or UI
+- collect evaluation results for the report
+- debug a bad interaction without reading the whole codebase first
+## What This Repository Contains
+At a high level, the project has five responsibilities:
+1. parse player input into structured intent
+2. keep the world state consistent across turns
+3. generate the next story response and options
+4. expose the system through a Gradio UI
+5. export logs and run reproducible evaluation
+This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.
+## Quick Start
+### 1. Install dependencies
+```bash
+pip install -r requirements.txt
+```
+### 2. Create `.env`
+Create a `.env` file in the project root:
+```env
+QWEN_API_KEY=your_api_key_here
+```
+Optional:
+```env
+STORYWEAVER_LOG_DIR=logs/interactions
+```
+### 3. Run the app
+```bash
+python app.py
+```
+Default local URL:
+- `http://localhost:7860`
+### 4. Run evaluation
+```bash
+python evaluation/run_evaluations.py --task all --repeats 3
+```
+Useful variants:
+```bash
+python evaluation/run_evaluations.py --task intent
+python evaluation/run_evaluations.py --task consistency
+python evaluation/run_evaluations.py --task latency --repeats 5
+python evaluation/run_evaluations.py --task branch
+```
+## Recommended Reading Order
+If you are new to the repo, read files in this order:
+1. [state_manager.py](./state_manager.py)
+   Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
+2. [nlu_engine.py](./nlu_engine.py)
+   Why: this shows how raw player text becomes structured intent.
+3. [story_engine.py](./story_engine.py)
+   Why: this is the main generation pipeline and fallback logic.
+4. [app.py](./app.py)
+   Why: this connects the UI with the engines and now also writes interaction logs.
+5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
+   Why: this shows how we measure the system for the report.
+If you only have 10 minutes, start with:
+- `GameState.pre_validate_action`
+- `GameState.check_consistency`
+- `GameState.apply_changes`
+- `NLUEngine.parse_intent`
+- `StoryEngine.generate_story_stream`
+- `process_user_input` in [app.py](./app.py)
+## Repository Map
+```text
+StoryWeaver/
+|-- app.py
+|-- nlu_engine.py
+|-- story_engine.py
+|-- state_manager.py
+|-- telemetry.py
+|-- utils.py
+|-- requirements.txt
+|-- evaluation/
+|   |-- run_evaluations.py
+|   |-- datasets/
+|   `-- results/
+`-- logs/
+    `-- interactions/
+```
+Core responsibilities by file:
+- [app.py](./app.py)
+  Gradio app, session lifecycle, UI callbacks, per-turn logging.
+- [state_manager.py](./state_manager.py)
+  Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
+- [nlu_engine.py](./nlu_engine.py)
+  Intent parsing. Uses LLM parsing when available and keyword fallback when not.
+- [story_engine.py](./story_engine.py)
+  Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
+- [telemetry.py](./telemetry.py)
+  Session metadata and JSONL interaction log export.
+- [utils.py](./utils.py)
+  API client setup, Qwen calls, JSON extraction, retry helpers.
+- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
+  Reproducible experiment runner for the report.
+## System Architecture
+The main runtime path is:
+`Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log`
+There are two ideas that matter most in this codebase:
+### 1. `GameState` is the source of truth
+Almost everything meaningful lives in [state_manager.py](./state_manager.py):
+- player stats
+- location
+- time and weather
+- inventory and equipment
+- quests
+- NPC states
+- event history
+When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.
+### 2. The app is a coordinator, not the game logic
+[app.py](./app.py) should mostly:
+- receive user input
+- call NLU
+- call the story engine
+- update the chat UI
+- write telemetry logs
+If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer.
+## Runtime Flow
+### Text input flow
+For normal text input, the path is:
+1. `process_user_input` receives raw text from the UI
+2. `NLUEngine.parse_intent` converts it into a structured intent dict
+3. `GameState.pre_validate_action` blocks clearly invalid actions early
+4. `StoryEngine.generate_story_stream` runs the main narrative pipeline
+5. `GameState.check_consistency` and `apply_changes` update state
+6. UI is refreshed with story text, options, and status panel
+7. `_record_interaction_log` writes a JSONL record to disk
+### Option click flow
+Button clicks do not go through full free-text parsing. Instead:
+1. the selected option is converted to an intent-like dict
+2. the story engine processes it the same way as text input
+3. the result is rendered and logged
+This is useful because option interactions and free-text interactions now share the same evaluation and observability format.
+## Main Modules in More Detail
+### `state_manager.py`
+This file defines:
+- `PlayerState`
+- `WorldState`
+- `GameEvent`
+- `GameState`
+Important methods:
+- `pre_validate_action`
+  Rejects obviously invalid actions before calling the model.
+- `check_consistency`
+  Detects contradictions in proposed state changes.
+- `apply_changes`
+  Applies state changes and returns a readable change log.
+- `validate`
+  Makes sure the resulting state is legal.
+- `to_prompt`
+  Serializes the current game state into prompt-ready text.
+When to edit this file:
+- adding new items, NPCs, quests, or locations
+- adding deterministic rules
+- improving consistency checks
+- changing state serialization for prompts
+### `nlu_engine.py`
+This file is responsible for intent recognition.
+Current behavior:
+- try LLM parsing first
+- fall back to keyword rules if parsing fails
+- return a normalized intent dict with `parser_source`
+Current intent labels include:
+- `ATTACK`
+- `TALK`
+- `MOVE`
+- `EXPLORE`
+- `USE_ITEM`
+- `TRADE`
+- `EQUIP`
+- `REST`
+- `QUEST`
+- `SKILL`
+- `PICKUP`
+- `FLEE`
+- `CUSTOM`
+When to edit this file:
+- adding a new intent type
+- improving keyword fallback
+- adding target extraction logic
+- improving low-confidence handling
+### `story_engine.py`
+This is the main generation module.
+It currently handles:
+- opening generation
+- story generation for each turn
+- streaming and non-streaming paths
+- default/fallback outputs
+- consistency-aware regeneration
+- response telemetry such as fallback reason and engine mode
+Important methods:
+- `generate_opening_stream`
+- `generate_story`
+- `generate_story_stream`
+- `process_option_selection_stream`
+- `_fallback_response`
+When to edit this file:
+- changing prompts
+- changing multi-stage generation logic
+- changing fallback behavior
+- adding generation-side telemetry
+### `app.py`
+This file is the UI entry point and interaction orchestrator.
+Important responsibilities:
+- create a new game session
+- start and restart the app session
+- process text input
+- process option clicks
+- update Gradio components
+- write structured interaction logs
+When to edit this file:
+- changing UI flow
+- adding debug panels
+- changing how logs are written
+- changing how outputs are displayed
+### `telemetry.py`
+This file handles structured log export.
+It is intentionally simple and file-based:
+- one session gets one JSONL file
+- one turn becomes one JSON object line
+This is useful for:
+- report case studies
+- measuring fallback rate
+- debugging weird turns
+- collecting examples for later evaluation
+## Logging and Observability
+Interaction logs are written under:
+- [logs/interactions](./logs/interactions)
+Each turn record includes at least:
+- input source
+- user input
+- NLU result
+- latency
+- fallback metadata
+- state changes
+- consistency issues
+- final output text
+- post-turn state snapshot
+Example shape:
+```json
+{
+  "timestamp": "2026-03-14T18:55:00",
+  "session_id": "sw-20260314-185500-ab12cd34",
+  "turn_index": 3,
+  "input_source": "text_input",
+  "user_input": "和村长老伯谈谈最近森林里的怪事",
+  "nlu_result": {
+    "intent": "TALK",
+    "target": "村长老伯",
+    "parser_source": "llm"
+  },
+  "latency_ms": 842.13,
+  "used_fallback": false,
+  "state_changes": {},
+  "output_text": "...",
+  "post_turn_snapshot": {
+    "location": "村庄广场"
+  }
+}
+```
+If you need to debug a bad interaction, the fastest path is:
+1. check the log file
+2. inspect `nlu_result`
+3. inspect `telemetry.used_fallback`
+4. inspect `state_changes`
+5. inspect the post-turn snapshot
+## Evaluation Pipeline
+Evaluation entry point:
+- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
+Datasets:
+- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
+- [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json)
+- [evaluation/datasets/latency.json](./evaluation/datasets/latency.json)
+- [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json)
+Results:
+- [evaluation/results](./evaluation/results)
+### What each task measures
+#### Intent
+- labeled input -> predicted intent
+- optional target matching
+- parser source breakdown
+- per-example latency
+#### Consistency
+- action guard correctness via `pre_validate_action`
+- contradiction detection via `check_consistency`
+#### Latency
+- NLU latency
+- generation latency
+- total latency
+- fallback rate
+#### Branch divergence
+- same start state, different choices
+- compare resulting story text
+- compare option differences
+- compare state snapshot differences
+## Common Development Tasks
+### Add a new intent
+You will usually need to touch:
+- [nlu_engine.py](./nlu_engine.py)
+- [state_manager.py](./state_manager.py)
+- [story_engine.py](./story_engine.py)
+- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
+Suggested checklist:
+1. add the label to the NLU logic
+2. decide whether it needs pre-validation
+3. make sure story prompts know how to handle it
+4. add at least a few evaluation examples
+### Add a new location, NPC, quest, or item
+Most of the time you only need:
+- [state_manager.py](./state_manager.py)
+That file contains the initial world setup and registry-style data.
+### Add more evaluation cases
+Edit files under:
+- [evaluation/datasets](./evaluation/datasets)
+This is the easiest way to improve the report without changing runtime logic.
+### Investigate a strange game turn
+Check in this order:
+1. interaction log under `logs/interactions`
+2. `parser_source` in the NLU result
+3. `telemetry` in the final story result
+4. whether `pre_validate_action` rejected or allowed the turn
+5. whether `check_consistency` flagged anything
+### Change UI behavior without touching gameplay
+Edit:
+- [app.py](./app.py)
+Try not to put game rules in the UI layer.
+## Environment Notes
+### If `QWEN_API_KEY` is missing
+- warning logs will appear
+- some paths will still run through fallback logic
+- evaluation can still execute, but model-quality conclusions are not meaningful
+### If `openai` is not installed
+- the repo can still import in some cases because the client is lazily initialized
+- full Qwen generation will not work
+- evaluation scripts will mostly reflect fallback behavior
+### If `gradio` is not installed
+- the app cannot launch
+- offline evaluation scripts can still be useful
+## Current Known Limitations
+These are the main gaps we still know about:
+- some item and equipment effects are stored as metadata but not fully executed as deterministic rules
+- combat and trade are still more prompt-driven than rule-driven
+- branch divergence is much more meaningful with a real model than in fallback-only mode
+- evaluation quality depends on whether the real model environment is available
+## Suggested Team Workflow
+If multiple teammates are working in parallel, this split is usually clean:
+- gameplay/state teammate
+  Focus on [state_manager.py](./state_manager.py)
+- prompt/generation teammate
+  Focus on [story_engine.py](./story_engine.py)
+- NLU/evaluation teammate
+  Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation)
+- UI/demo teammate
+  Focus on [app.py](./app.py)
+- report teammate
+  Focus on `evaluation/results`, `logs/interactions`, and case-study collection
+## What To Use in the Final Report
+For the course report, the most useful artifacts from this repo are:
+- evaluation JSON outputs under `evaluation/results`
+- interaction logs under `logs/interactions`
+- dataset files under `evaluation/datasets`
+- readable state transitions from `change_log`
+- fallback metadata from `telemetry`
+These can directly support:
+- experiment setup
+- metric definition
+- result tables
+- success cases
+- failure case analysis
+## License
+MIT

app.py CHANGED Viewed

@@ -13,14 +13,17 @@ app.py - StoryWeaver Gradio 交互界面
   Gradio UI  ←  状态管理器（校验 + 更新）  ←  叙事引擎（文本 + 选项）
 """
-import json
-import logging
-import gradio as gr
-from state_manager import GameState
-from nlu_engine import NLUEngine
-from story_engine import StoryEngine
-from utils import logger
 # ============================================================
 # 全局游戏实例（每个会话独立）
@@ -30,18 +33,133 @@ from utils import logger
 # 这里先定义工厂函数
-def create_new_game(player_name: str = "旅人") -> dict:
-    """创建新游戏实例，返回包含所有引擎的字典"""
-    game_state = GameState(player_name=player_name)
-    nlu = NLUEngine(game_state)
-    story = StoryEngine(game_state)
     return {
-        "game_state": game_state,
-        "nlu": nlu,
-        "story": story,
-        "current_options": [],
-        "started": False,
-    }
 def restart_game() -> tuple:
@@ -96,7 +214,8 @@ def start_game(player_name: str, game_session: dict):
     )
     # 流式生成开场（选项仅在流结束后从 final 事件中提取，流式期间不解析选项）
-    story_text = ""
     final_result = None
     for update in game_session["story"].generate_opening_stream():
@@ -110,7 +229,9 @@ def start_game(player_name: str, game_session: dict):
                 gr.update(interactive=False),
             )
         elif update["type"] == "final":
-            final_result = update
     # ★ 只在数据流完全结束后，从 final_result 中提取选项
     if final_result:
@@ -123,13 +244,36 @@ def start_game(player_name: str, game_session: dict):
     options = _ensure_min_options(options, 3)
     # 最终 yield：显示完整文本 + 选项 + 启用按钮
-    game_session["current_options"] = options
-    options_text = _format_options(options)
-    full_message = f"{story_text}\n\n{options_text}"
-    chat_history[-1]["content"] = full_message
-    status_text = _format_status_panel(game_session["game_state"])
-    btn_updates = _get_button_updates(options)
     yield (
         chat_history, status_text,
@@ -167,9 +311,10 @@ def process_user_input(user_input: str, chat_history: list, game_session: dict):
         )
         return
-    gs: GameState = game_session["game_state"]
-    nlu: NLUEngine = game_session["nlu"]
-    story: StoryEngine = game_session["story"]
     # 检查游戏是否已结束
     if gs.is_game_over():
@@ -185,8 +330,10 @@ def process_user_input(user_input: str, chat_history: list, game_session: dict):
         )
         return
-    # 1. NLU 解析
-    intent = nlu.parse_intent(user_input)
     # 1.5 预校验：立即驳回违反一致性的操作（不调用 LLM，不消耗回合）
     is_valid, rejection_msg = gs.pre_validate_action(intent)
@@ -199,8 +346,31 @@ def process_user_input(user_input: str, chat_history: list, game_session: dict):
             f"⚠️ **行动被驳回**：{rejection_msg}\n\n"
             f"请重新选择行动，或输入其他指令。\n\n{options_text}"
         )
-        chat_history.append({"role": "assistant", "content": rejection_content})
-        btn_updates = _get_button_updates(options)
         yield (
             chat_history,
             _format_status_panel(gs),
@@ -223,18 +393,21 @@ def process_user_input(user_input: str, chat_history: list, game_session: dict):
     )
     # 3. 流式生成故事
-    final_result = None
-    for update in story.generate_story_stream(intent):
-        if update["type"] == "story_chunk":
-            chat_history[-1]["content"] = update["text"]
-            yield (
-                chat_history,
                 _format_status_panel(gs),
                 loading[0], loading[1], loading[2],
                 game_session,
             )
-        elif update["type"] == "final":
-            final_result = update
     # 4. 最终更新：完整文本 + 状态变化 + 选项 + 按钮
     if final_result:
@@ -256,13 +429,24 @@ def process_user_input(user_input: str, chat_history: list, game_session: dict):
         options_text = _format_options(options)
         full_message = f"{final_result['story_text']}{log_text}{issues_text}\n\n{options_text}"
-        chat_history[-1]["content"] = full_message
-        status_text = _format_status_panel(gs)
-        btn_updates = _get_button_updates(options)
-        yield (
-            chat_history,
             status_text,
             btn_updates[0], btn_updates[1], btn_updates[2],
             game_session,
@@ -272,14 +456,37 @@ def process_user_input(user_input: str, chat_history: list, game_session: dict):
         logger.warning("流式生成未产生 final 事件，使用兜底文本")
         fallback_text = "你环顾四周，思考着接下来该做什么..."
         fallback_options = _ensure_min_options([], 3)
-        game_session["current_options"] = fallback_options
-        options_text = _format_options(fallback_options)
-        full_message = f"{fallback_text}\n\n{options_text}"
-        chat_history[-1]["content"] = full_message
-        status_text = _format_status_panel(gs)
-        btn_updates = _get_button_updates(fallback_options)
         yield (
             chat_history,
@@ -318,9 +525,11 @@ def process_option_click(option_idx: int, chat_history: list, game_session: dict
         )
         return
-    selected_option = options[option_idx]
-    gs: GameState = game_session["game_state"]
-    story: StoryEngine = game_session["story"]
     # 检查特殊选项：重新开始
     if selected_option.get("action_type") == "RESTART":
@@ -404,18 +613,21 @@ def process_option_click(option_idx: int, chat_history: list, game_session: dict
         game_session,
     )
-    final_result = None
-    for update in story.process_option_selection_stream(selected_option):
-        if update["type"] == "story_chunk":
-            chat_history[-1]["content"] = update["text"]
-            yield (
-                chat_history,
                 _format_status_panel(gs),
                 loading[0], loading[1], loading[2],
                 game_session,
             )
-        elif update["type"] == "final":
-            final_result = update
     if final_result:
         # ★ 安全兜底：强制确保恰好 3 个选项
@@ -430,13 +642,24 @@ def process_option_click(option_idx: int, chat_history: list, game_session: dict
         options_text = _format_options(options)
         full_message = f"{final_result['story_text']}{log_text}\n\n{options_text}"
-        chat_history[-1]["content"] = full_message
-        status_text = _format_status_panel(gs)
-        btn_updates = _get_button_updates(options)
-        yield (
-            chat_history, status_text,
             btn_updates[0], btn_updates[1], btn_updates[2],
             game_session,
         )
@@ -445,14 +668,37 @@ def process_option_click(option_idx: int, chat_history: list, game_session: dict
         logger.warning("[选项点击] 流式生成未产生 final 事件，使用兜底文本")
         fallback_text = "你环顾四周，思考着接下来该做什么..."
         fallback_options = _ensure_min_options([], 3)
-        game_session["current_options"] = fallback_options
-        options_text = _format_options(fallback_options)
-        full_message = f"{fallback_text}\n\n{options_text}"
-        chat_history[-1]["content"] = full_message
-        status_text = _format_status_panel(gs)
-        btn_updates = _get_button_updates(fallback_options)
         yield (
             chat_history, status_text,

   Gradio UI  ←  状态管理器（校验 + 更新）  ←  叙事引擎（文本 + 选项）
 """
+import copy
+import json
+import logging
+from time import perf_counter
+import gradio as gr
+from state_manager import GameState
+from nlu_engine import NLUEngine
+from story_engine import StoryEngine
+from telemetry import append_turn_log, create_session_metadata
+from utils import logger
 # ============================================================
 # 全局游戏实例（每个会话独立）
 # 这里先定义工厂函数
+def create_new_game(player_name: str = "旅人") -> dict:
+    """创建新游戏实例，返回包含所有引擎的字典"""
+    game_state = GameState(player_name=player_name)
+    nlu = NLUEngine(game_state)
+    story = StoryEngine(game_state)
     return {
+        "game_state": game_state,
+        "nlu": nlu,
+        "story": story,
+        "current_options": [],
+        "started": False,
+        **create_session_metadata(),
+    }
+def _json_safe(value):
+    """Convert nested values into JSON-serializable data for logs."""
+    if value is None or isinstance(value, (str, int, float, bool)):
+        return value
+    if isinstance(value, dict):
+        return {str(key): _json_safe(val) for key, val in value.items()}
+    if isinstance(value, (list, tuple, set)):
+        return [_json_safe(item) for item in value]
+    if hasattr(value, "model_dump"):
+        return _json_safe(value.model_dump())
+    return str(value)
+def _build_state_snapshot(gs: GameState) -> dict:
+    """Build a compact state snapshot for reproducible evaluation logs."""
+    active_quests = []
+    for quest in gs.world.quests.values():
+        if quest.status == "active":
+            active_quests.append(
+                {
+                    "quest_id": quest.quest_id,
+                    "title": quest.title,
+                    "status": quest.status,
+                    "objectives": _json_safe(quest.objectives),
+                }
+            )
+    return {
+        "turn": gs.turn,
+        "game_mode": gs.game_mode,
+        "location": gs.player.location,
+        "scene": gs.world.current_scene,
+        "day": gs.world.day_count,
+        "time_of_day": gs.world.time_of_day,
+        "weather": gs.world.weather,
+        "player": {
+            "name": gs.player.name,
+            "level": gs.player.level,
+            "hp": gs.player.hp,
+            "max_hp": gs.player.max_hp,
+            "mp": gs.player.mp,
+            "max_mp": gs.player.max_mp,
+            "gold": gs.player.gold,
+            "morale": gs.player.morale,
+            "sanity": gs.player.sanity,
+            "hunger": gs.player.hunger,
+            "karma": gs.player.karma,
+            "inventory": list(gs.player.inventory),
+            "equipment": copy.deepcopy(gs.player.equipment),
+            "skills": list(gs.player.skills),
+            "status_effects": [effect.name for effect in gs.player.status_effects],
+        },
+        "active_quests": active_quests,
+        "event_log_size": len(gs.event_log),
+    }
+def _record_interaction_log(
+    game_session: dict,
+    *,
+    input_source: str,
+    user_input: str,
+    intent_result: dict | None,
+    output_text: str,
+    latency_ms: float,
+    nlu_latency_ms: float | None = None,
+    generation_latency_ms: float | None = None,
+    final_result: dict | None = None,
+    selected_option: dict | None = None,
+):
+    """Append a structured interaction log without affecting gameplay."""
+    if not game_session or "game_state" not in game_session:
+        return
+    final_result = final_result or {}
+    telemetry = _json_safe(final_result.get("telemetry", {})) or {}
+    record = {
+        "input_source": input_source,
+        "user_input": user_input,
+        "selected_option": _json_safe(selected_option),
+        "nlu_result": _json_safe(intent_result),
+        "latency_ms": round(latency_ms, 2),
+        "nlu_latency_ms": None if nlu_latency_ms is None else round(nlu_latency_ms, 2),
+        "generation_latency_ms": None if generation_latency_ms is None else round(generation_latency_ms, 2),
+        "used_fallback": bool(telemetry.get("used_fallback", False)),
+        "fallback_reason": telemetry.get("fallback_reason"),
+        "engine_mode": telemetry.get("engine_mode"),
+        "state_changes": _json_safe(final_result.get("state_changes", {})),
+        "change_log": _json_safe(final_result.get("change_log", [])),
+        "consistency_issues": _json_safe(final_result.get("consistency_issues", [])),
+        "output_text": output_text,
+        "story_text": final_result.get("story_text"),
+        "options": _json_safe(final_result.get("options", game_session.get("current_options", []))),
+        "post_turn_snapshot": _build_state_snapshot(game_session["game_state"]),
+    }
+    try:
+        append_turn_log(game_session, record)
+    except Exception as exc:
+        logger.warning(f"Failed to append interaction log: {exc}")
+def _build_option_intent(selected_option: dict) -> dict:
+    """Represent button clicks in the same schema as free-text NLU output."""
+    option_text = selected_option.get("text", "")
+    return {
+        "intent": selected_option.get("action_type", "EXPLORE"),
+        "target": None,
+        "details": option_text,
+        "raw_input": option_text,
+        "parser_source": "option_click",
+    }
 def restart_game() -> tuple:
     )
     # 流式生成开场（选项仅在流结束后从 final 事件中提取，流式期间不解析选项）
+    turn_started = perf_counter()
+    story_text = ""
     final_result = None
     for update in game_session["story"].generate_opening_stream():
                 gr.update(interactive=False),
             )
         elif update["type"] == "final":
+            final_result = update
+    generation_latency_ms = (perf_counter() - turn_started) * 1000
     # ★ 只在数据流完全结束后，从 final_result 中提取选项
     if final_result:
     options = _ensure_min_options(options, 3)
     # 最终 yield：显示完整文本 + 选项 + 启用按钮
+    game_session["current_options"] = options
+    options_text = _format_options(options)
+    full_message = f"{story_text}\n\n{options_text}"
+    if not final_result:
+        final_result = {
+            "story_text": story_text,
+            "options": options,
+            "state_changes": {},
+            "change_log": [],
+            "consistency_issues": [],
+            "telemetry": {
+                "engine_mode": "opening_app",
+                "used_fallback": True,
+                "fallback_reason": "missing_final_event",
+            },
+        }
+    chat_history[-1]["content"] = full_message
+    status_text = _format_status_panel(game_session["game_state"])
+    btn_updates = _get_button_updates(options)
+    _record_interaction_log(
+        game_session,
+        input_source="system_opening",
+        user_input="",
+        intent_result=None,
+        output_text=full_message,
+        latency_ms=generation_latency_ms,
+        generation_latency_ms=generation_latency_ms,
+        final_result=final_result,
+    )
     yield (
         chat_history, status_text,
         )
         return
+    gs: GameState = game_session["game_state"]
+    nlu: NLUEngine = game_session["nlu"]
+    story: StoryEngine = game_session["story"]
+    turn_started = perf_counter()
     # 检查游戏是否已结束
     if gs.is_game_over():
         )
         return
+    # 1. NLU 解析
+    nlu_started = perf_counter()
+    intent = nlu.parse_intent(user_input)
+    nlu_latency_ms = (perf_counter() - nlu_started) * 1000
     # 1.5 预校验：立即驳回违反一致性的操作（不调用 LLM，不消耗回合）
     is_valid, rejection_msg = gs.pre_validate_action(intent)
             f"⚠️ **行动被驳回**：{rejection_msg}\n\n"
             f"请重新选择行动，或输入其他指令。\n\n{options_text}"
         )
+        chat_history.append({"role": "assistant", "content": rejection_content})
+        rejection_result = {
+            "story_text": rejection_content,
+            "options": options,
+            "state_changes": {},
+            "change_log": [],
+            "consistency_issues": [],
+            "telemetry": {
+                "engine_mode": "pre_validation",
+                "used_fallback": False,
+                "fallback_reason": None,
+            },
+        }
+        _record_interaction_log(
+            game_session,
+            input_source="text_input",
+            user_input=user_input,
+            intent_result=intent,
+            output_text=rejection_content,
+            latency_ms=(perf_counter() - turn_started) * 1000,
+            nlu_latency_ms=nlu_latency_ms,
+            generation_latency_ms=0.0,
+            final_result=rejection_result,
+        )
+        btn_updates = _get_button_updates(options)
         yield (
             chat_history,
             _format_status_panel(gs),
     )
     # 3. 流式生成故事
+    generation_started = perf_counter()
+    final_result = None
+    for update in story.generate_story_stream(intent):
+        if update["type"] == "story_chunk":
+            chat_history[-1]["content"] = update["text"]
+            yield (
+                chat_history,
                 _format_status_panel(gs),
                 loading[0], loading[1], loading[2],
                 game_session,
             )
+        elif update["type"] == "final":
+            final_result = update
+    generation_latency_ms = (perf_counter() - generation_started) * 1000
     # 4. 最终更新：完整文本 + 状态变化 + 选项 + 按钮
     if final_result:
         options_text = _format_options(options)
         full_message = f"{final_result['story_text']}{log_text}{issues_text}\n\n{options_text}"
+        chat_history[-1]["content"] = full_message
+        status_text = _format_status_panel(gs)
+        btn_updates = _get_button_updates(options)
+        _record_interaction_log(
+            game_session,
+            input_source="text_input",
+            user_input=user_input,
+            intent_result=intent,
+            output_text=full_message,
+            latency_ms=(perf_counter() - turn_started) * 1000,
+            nlu_latency_ms=nlu_latency_ms,
+            generation_latency_ms=generation_latency_ms,
+            final_result=final_result,
+        )
+        yield (
+            chat_history,
             status_text,
             btn_updates[0], btn_updates[1], btn_updates[2],
             game_session,
         logger.warning("流式生成未产生 final 事件，使用兜底文本")
         fallback_text = "你环顾四周，思考着接下来该做什么..."
         fallback_options = _ensure_min_options([], 3)
+        game_session["current_options"] = fallback_options
+        options_text = _format_options(fallback_options)
+        full_message = f"{fallback_text}\n\n{options_text}"
+        fallback_result = {
+            "story_text": fallback_text,
+            "options": fallback_options,
+            "state_changes": {},
+            "change_log": [],
+            "consistency_issues": [],
+            "telemetry": {
+                "engine_mode": "app_fallback",
+                "used_fallback": True,
+                "fallback_reason": "missing_final_event",
+            },
+        }
+        chat_history[-1]["content"] = full_message
+        status_text = _format_status_panel(gs)
+        btn_updates = _get_button_updates(fallback_options)
+        _record_interaction_log(
+            game_session,
+            input_source="text_input",
+            user_input=user_input,
+            intent_result=intent,
+            output_text=full_message,
+            latency_ms=(perf_counter() - turn_started) * 1000,
+            nlu_latency_ms=nlu_latency_ms,
+            generation_latency_ms=generation_latency_ms,
+            final_result=fallback_result,
+        )
         yield (
             chat_history,
         )
         return
+    selected_option = options[option_idx]
+    gs: GameState = game_session["game_state"]
+    story: StoryEngine = game_session["story"]
+    option_intent = _build_option_intent(selected_option)
+    turn_started = perf_counter()
     # 检查特殊选项：重新开始
     if selected_option.get("action_type") == "RESTART":
         game_session,
     )
+    generation_started = perf_counter()
+    final_result = None
+    for update in story.process_option_selection_stream(selected_option):
+        if update["type"] == "story_chunk":
+            chat_history[-1]["content"] = update["text"]
+            yield (
+                chat_history,
                 _format_status_panel(gs),
                 loading[0], loading[1], loading[2],
                 game_session,
             )
+        elif update["type"] == "final":
+            final_result = update
+    generation_latency_ms = (perf_counter() - generation_started) * 1000
     if final_result:
         # ★ 安全兜底：强制确保恰好 3 个选项
         options_text = _format_options(options)
         full_message = f"{final_result['story_text']}{log_text}\n\n{options_text}"
+        chat_history[-1]["content"] = full_message
+        status_text = _format_status_panel(gs)
+        btn_updates = _get_button_updates(options)
+        _record_interaction_log(
+            game_session,
+            input_source="option_click",
+            user_input=selected_option.get("text", ""),
+            intent_result=option_intent,
+            output_text=full_message,
+            latency_ms=(perf_counter() - turn_started) * 1000,
+            generation_latency_ms=generation_latency_ms,
+            final_result=final_result,
+            selected_option=selected_option,
+        )
+        yield (
+            chat_history, status_text,
             btn_updates[0], btn_updates[1], btn_updates[2],
             game_session,
         )
         logger.warning("[选项点击] 流式生成未产生 final 事件，使用兜底文本")
         fallback_text = "你环顾四周，思考着接下来该做什么..."
         fallback_options = _ensure_min_options([], 3)
+        game_session["current_options"] = fallback_options
+        options_text = _format_options(fallback_options)
+        full_message = f"{fallback_text}\n\n{options_text}"
+        fallback_result = {
+            "story_text": fallback_text,
+            "options": fallback_options,
+            "state_changes": {},
+            "change_log": [],
+            "consistency_issues": [],
+            "telemetry": {
+                "engine_mode": "app_fallback",
+                "used_fallback": True,
+                "fallback_reason": "missing_final_event",
+            },
+        }
+        chat_history[-1]["content"] = full_message
+        status_text = _format_status_panel(gs)
+        btn_updates = _get_button_updates(fallback_options)
+        _record_interaction_log(
+            game_session,
+            input_source="option_click",
+            user_input=selected_option.get("text", ""),
+            intent_result=option_intent,
+            output_text=full_message,
+            latency_ms=(perf_counter() - turn_started) * 1000,
+            generation_latency_ms=generation_latency_ms,
+            final_result=fallback_result,
+            selected_option=selected_option,
+        )
         yield (
             chat_history, status_text,

evaluation/datasets/branch_divergence.json ADDED Viewed

	@@ -0,0 +1,84 @@

+[
+  {
+    "id": "branch_001_village_square",
+    "setup": {
+      "player": {
+        "location": "村庄广场",
+        "inventory": ["面包", "小型治疗药水"]
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    },
+    "branches": [
+      {
+        "label": "talk_elder",
+        "input": "和村长老伯谈谈最近森林里的怪事"
+      },
+      {
+        "label": "go_inn",
+        "input": "前往村庄旅店"
+      },
+      {
+        "label": "explore_square",
+        "input": "探索一下村庄广场"
+      }
+    ]
+  },
+  {
+    "id": "branch_002_resource_management",
+    "setup": {
+      "player": {
+        "location": "村庄旅店",
+        "inventory": ["面包", "小型治疗药水"],
+        "hp": 58,
+        "morale": 65,
+        "sanity": 80
+      },
+      "world": {
+        "current_scene": "村庄旅店"
+      }
+    },
+    "branches": [
+      {
+        "label": "rest",
+        "input": "休息一会儿"
+      },
+      {
+        "label": "use_potion",
+        "input": "使用小型治疗药水"
+      },
+      {
+        "label": "talk_innkeeper",
+        "input": "和旅店老板娘莉娜聊聊"
+      }
+    ]
+  },
+  {
+    "id": "branch_003_roadside_choices",
+    "setup": {
+      "player": {
+        "location": "村口小路",
+        "inventory": ["面包", "小型治疗药水"],
+        "hp": 85
+      },
+      "world": {
+        "current_scene": "村口小路"
+      }
+    },
+    "branches": [
+      {
+        "label": "enter_forest",
+        "input": "前往黑暗森林入口"
+      },
+      {
+        "label": "return_square",
+        "input": "回村庄广场"
+      },
+      {
+        "label": "explore_road",
+        "input": "搜索附近有没有线索"
+      }
+    ]
+  }
+]

evaluation/datasets/consistency.json ADDED Viewed

	@@ -0,0 +1,283 @@

+{
+  "action_guard_cases": [
+    {
+      "id": "guard_001",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "inventory": ["面包", "小型治疗药水"]
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "intent": {
+        "intent": "USE_ITEM",
+        "target": "小型治疗药水",
+        "details": "喝掉药水",
+        "raw_input": "使用小型治疗药水"
+      },
+      "expected_valid": true
+    },
+    {
+      "id": "guard_002",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "inventory": ["面包"]
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "intent": {
+        "intent": "USE_ITEM",
+        "target": "火把",
+        "details": "点亮火把",
+        "raw_input": "使用火把"
+      },
+      "expected_valid": false
+    },
+    {
+      "id": "guard_003",
+      "setup": {
+        "player": {
+          "location": "村庄广场"
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "intent": {
+        "intent": "MOVE",
+        "target": "村庄旅店",
+        "details": "去旅店",
+        "raw_input": "前往村庄旅店"
+      },
+      "expected_valid": true
+    },
+    {
+      "id": "guard_004",
+      "setup": {
+        "player": {
+          "location": "村庄广场"
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "intent": {
+        "intent": "MOVE",
+        "target": "森林深处",
+        "details": "直接冲进森林深处",
+        "raw_input": "去森林深处"
+      },
+      "expected_valid": true
+    },
+    {
+      "id": "guard_005",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "inventory": ["铁剑", "面包"]
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "intent": {
+        "intent": "EQUIP",
+        "target": "铁剑",
+        "details": "装备武器",
+        "raw_input": "装备铁剑"
+      },
+      "expected_valid": true
+    },
+    {
+      "id": "guard_006",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "inventory": ["面包"],
+          "equipment": {
+            "weapon": "铁剑"
+          }
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "intent": {
+        "intent": "EQUIP",
+        "target": "铁剑",
+        "details": "再装备一次铁剑",
+        "raw_input": "装备铁剑"
+      },
+      "expected_valid": false
+    },
+    {
+      "id": "guard_007",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "skills": ["火球术"]
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "intent": {
+        "intent": "SKILL",
+        "target": "火球术",
+        "details": "施法",
+        "raw_input": "施放火球术"
+      },
+      "expected_valid": true
+    },
+    {
+      "id": "guard_008",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "skills": []
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "intent": {
+        "intent": "SKILL",
+        "target": "火球术",
+        "details": "施法",
+        "raw_input": "施放火球术"
+      },
+      "expected_valid": false
+    }
+  ],
+  "state_check_cases": [
+    {
+      "id": "state_001",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "gold": 50
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "proposed_changes": {
+        "new_location": "村庄旅店"
+      },
+      "expected_contradiction": false
+    },
+    {
+      "id": "state_002",
+      "setup": {
+        "player": {
+          "location": "村庄广场"
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "proposed_changes": {
+        "new_location": "森林深处"
+      },
+      "expected_contradiction": true,
+      "expected_contains": ["不相邻"]
+    },
+    {
+      "id": "state_003",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "gold": 50
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "proposed_changes": {
+        "gold_change": -80
+      },
+      "expected_contradiction": true,
+      "expected_contains": ["金币"]
+    },
+    {
+      "id": "state_004",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "inventory": ["面包"]
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "proposed_changes": {
+        "items_lost": ["火把"]
+      },
+      "expected_contradiction": true,
+      "expected_contains": ["未持有"]
+    },
+    {
+      "id": "state_005",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "inventory": ["小型治疗药水"]
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "proposed_changes": {
+        "items_lost": ["小型治疗药水"]
+      },
+      "expected_contradiction": false
+    },
+    {
+      "id": "state_006",
+      "setup": {
+        "player": {
+          "location": "村庄广场",
+          "inventory": ["铁剑"]
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        }
+      },
+      "proposed_changes": {
+        "items_lost": ["铁剑"]
+      },
+      "expected_contradiction": true,
+      "expected_contains": ["不是消耗品"]
+    },
+    {
+      "id": "state_007",
+      "setup": {
+        "player": {
+          "location": "村庄广场"
+        },
+        "world": {
+          "current_scene": "村庄广场"
+        },
+        "npc_overrides": {
+          "村长老伯": {
+            "is_alive": false
+          }
+        }
+      },
+      "proposed_changes": {
+        "npc_changes": {
+          "村长老伯": {
+            "attitude": "friendly"
+          }
+        }
+      },
+      "expected_contradiction": true,
+      "expected_contains": ["已经死亡"]
+    }
+  ]
+}

evaluation/datasets/intent_accuracy.json ADDED Viewed

	@@ -0,0 +1,201 @@

+[
+  {
+    "id": "intent_001",
+    "input": "和村长老伯谈谈最近森林里的怪事",
+    "intent": "TALK",
+    "target": "村长老伯",
+    "setup": {
+      "player": {
+        "location": "村庄广场"
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    }
+  },
+  {
+    "id": "intent_002",
+    "input": "前往村庄旅店",
+    "intent": "MOVE",
+    "target": "村庄旅店",
+    "setup": {
+      "player": {
+        "location": "村庄广场"
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    }
+  },
+  {
+    "id": "intent_003",
+    "input": "探索一下村庄广场",
+    "intent": "EXPLORE"
+  },
+  {
+    "id": "intent_004",
+    "input": "使用小型治疗药水",
+    "intent": "USE_ITEM",
+    "target": "小型治疗药水"
+  },
+  {
+    "id": "intent_005",
+    "input": "装备铁剑",
+    "intent": "EQUIP",
+    "target": "铁剑"
+  },
+  {
+    "id": "intent_006",
+    "input": "和铁匠格林交易",
+    "intent": "TRADE",
+    "target": "铁匠格林",
+    "setup": {
+      "player": {
+        "location": "村庄铁匠铺"
+      },
+      "world": {
+        "current_scene": "村庄铁匠铺"
+      }
+    }
+  },
+  {
+    "id": "intent_007",
+    "input": "休息一会儿",
+    "intent": "REST"
+  },
+  {
+    "id": "intent_008",
+    "input": "查看当前任务",
+    "intent": "QUEST"
+  },
+  {
+    "id": "intent_009",
+    "input": "施放火球术",
+    "intent": "SKILL",
+    "setup": {
+      "player": {
+        "location": "村庄广场",
+        "skills": [
+          "火球术"
+        ]
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    }
+  },
+  {
+    "id": "intent_010",
+    "input": "拿起火把",
+    "intent": "PICKUP",
+    "target": "火把"
+  },
+  {
+    "id": "intent_011",
+    "input": "赶紧逃跑",
+    "intent": "FLEE"
+  },
+  {
+    "id": "intent_012",
+    "input": "和旅店老板娘莉娜聊聊",
+    "intent": "TALK",
+    "target": "旅店老板娘莉娜",
+    "setup": {
+      "player": {
+        "location": "村庄旅店"
+      },
+      "world": {
+        "current_scene": "村庄旅店"
+      }
+    }
+  },
+  {
+    "id": "intent_013",
+    "input": "买一瓶解毒药水",
+    "intent": "TRADE",
+    "target": "解毒药水"
+  },
+  {
+    "id": "intent_014",
+    "input": "去村口小路看看",
+    "intent": "MOVE",
+    "target": "村口小路",
+    "setup": {
+      "player": {
+        "location": "村庄广场"
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    }
+  },
+  {
+    "id": "intent_015",
+    "input": "吃一个面包",
+    "intent": "USE_ITEM",
+    "target": "面包"
+  },
+  {
+    "id": "intent_016",
+    "input": "接受这个任务",
+    "intent": "QUEST"
+  },
+  {
+    "id": "intent_017",
+    "input": "搜索附近有没有线索",
+    "intent": "EXPLORE"
+  },
+  {
+    "id": "intent_018",
+    "input": "穿上皮甲",
+    "intent": "EQUIP",
+    "target": "皮甲"
+  },
+  {
+    "id": "intent_019",
+    "input": "去村庄杂货铺买点东西",
+    "intent": "TRADE",
+    "target": "村庄杂货铺",
+    "setup": {
+      "player": {
+        "location": "村庄广场"
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    }
+  },
+  {
+    "id": "intent_020",
+    "input": "调查黑暗森林入口",
+    "intent": "EXPLORE",
+    "target": "黑暗森林入口",
+    "setup": {
+      "player": {
+        "location": "村口小路"
+      },
+      "world": {
+        "current_scene": "村口小路"
+      }
+    }
+  },
+  {
+    "id": "intent_021",
+    "input": "和村长老伯谈判",
+    "intent": "TALK",
+    "target": "村长老伯",
+    "setup": {
+      "player": {
+        "location": "村庄广场"
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    }
+  },
+  {
+    "id": "intent_022",
+    "input": "我想扔石头试试看",
+    "intent": "CUSTOM"
+  }
+]

evaluation/datasets/latency.json ADDED Viewed

	@@ -0,0 +1,79 @@

+[
+  {
+    "id": "latency_001",
+    "input": "和村长老伯谈谈最近森林里的怪事",
+    "setup": {
+      "player": {
+        "location": "村庄广场"
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    }
+  },
+  {
+    "id": "latency_002",
+    "input": "前往村庄旅店",
+    "setup": {
+      "player": {
+        "location": "村庄广场"
+      },
+      "world": {
+        "current_scene": "村庄广场"
+      }
+    }
+  },
+  {
+    "id": "latency_003",
+    "input": "使用小型治疗药水",
+    "setup": {
+      "player": {
+        "location": "村庄旅店",
+        "inventory": ["面包", "小型治疗药水"],
+        "hp": 65
+      },
+      "world": {
+        "current_scene": "村庄旅店"
+      }
+    }
+  },
+  {
+    "id": "latency_004",
+    "input": "探索一下村口小路",
+    "setup": {
+      "player": {
+        "location": "村口小路"
+      },
+      "world": {
+        "current_scene": "村口小路"
+      }
+    }
+  },
+  {
+    "id": "latency_005",
+    "input": "和铁匠格林交易",
+    "setup": {
+      "player": {
+        "location": "村庄铁匠铺"
+      },
+      "world": {
+        "current_scene": "村庄铁匠铺"
+      }
+    }
+  },
+  {
+    "id": "latency_006",
+    "input": "休息一会儿",
+    "setup": {
+      "player": {
+        "location": "村庄旅店",
+        "hp": 72,
+        "morale": 60,
+        "sanity": 82
+      },
+      "world": {
+        "current_scene": "村庄旅店"
+      }
+    }
+  }
+]

evaluation/run_evaluations.py ADDED Viewed

	@@ -0,0 +1,567 @@

+from __future__ import annotations
+import argparse
+import json
+import statistics
+import sys
+from collections import Counter, defaultdict
+from copy import deepcopy
+from datetime import datetime
+from difflib import SequenceMatcher
+from itertools import combinations
+from pathlib import Path
+from time import perf_counter
+from typing import Any
+PROJECT_ROOT = Path(__file__).resolve().parents[1]
+if str(PROJECT_ROOT) not in sys.path:
+    sys.path.insert(0, str(PROJECT_ROOT))
+from nlu_engine import NLUEngine
+from state_manager import GameState
+from story_engine import StoryEngine
+DATASET_DIR = PROJECT_ROOT / "evaluation" / "datasets"
+RESULTS_DIR = PROJECT_ROOT / "evaluation" / "results"
+def _json_safe(value: Any) -> Any:
+    if value is None or isinstance(value, (str, int, float, bool)):
+        return value
+    if isinstance(value, dict):
+        return {str(key): _json_safe(val) for key, val in value.items()}
+    if isinstance(value, (list, tuple, set)):
+        return [_json_safe(item) for item in value]
+    if hasattr(value, "model_dump"):
+        return _json_safe(value.model_dump())
+    return str(value)
+def _normalize_text(value: Any) -> str:
+    return str(value or "").strip().lower()
+def _load_dataset(name: str) -> Any:
+    with (DATASET_DIR / f"{name}.json").open("r", encoding="utf-8") as fh:
+        return json.load(fh)
+def _apply_setup(game_state: GameState, setup: dict[str, Any] | None) -> GameState:
+    if not setup:
+        game_state.player.location = game_state.world.current_scene
+        return game_state
+    player_setup = setup.get("player", {})
+    world_setup = setup.get("world", {})
+    for key, value in player_setup.items():
+        if key == "inventory":
+            game_state.player.inventory = list(value)
+        elif key == "skills":
+            game_state.player.skills = list(value)
+        elif key == "equipment":
+            updated = dict(game_state.player.equipment)
+            updated.update(dict(value))
+            game_state.player.equipment = updated
+        else:
+            setattr(game_state.player, key, deepcopy(value))
+    for key, value in world_setup.items():
+        if key == "discovered_locations":
+            game_state.world.discovered_locations = list(value)
+        elif key == "global_flags":
+            game_state.world.global_flags.update(dict(value))
+        else:
+            setattr(game_state.world, key, deepcopy(value))
+    for npc_name, overrides in setup.get("npc_overrides", {}).items():
+        npc = game_state.world.npcs.get(npc_name)
+        if npc is None:
+            continue
+        for key, value in overrides.items():
+            setattr(npc, key, deepcopy(value))
+    if "turn" in setup:
+        game_state.turn = int(setup["turn"])
+    if "location" not in player_setup and "current_scene" in world_setup:
+        game_state.player.location = game_state.world.current_scene
+    elif "location" in player_setup and "current_scene" not in world_setup:
+        game_state.world.current_scene = game_state.player.location
+    elif not player_setup and not world_setup:
+        game_state.player.location = game_state.world.current_scene
+    return game_state
+def _build_game_state(setup: dict[str, Any] | None = None) -> GameState:
+    game_state = GameState(player_name="Evaluator")
+    return _apply_setup(game_state, setup)
+def _state_snapshot(game_state: GameState) -> dict[str, Any]:
+    return {
+        "turn": game_state.turn,
+        "game_mode": game_state.game_mode,
+        "location": game_state.player.location,
+        "scene": game_state.world.current_scene,
+        "day": game_state.world.day_count,
+        "time_of_day": game_state.world.time_of_day,
+        "weather": game_state.world.weather,
+        "hp": game_state.player.hp,
+        "mp": game_state.player.mp,
+        "gold": game_state.player.gold,
+        "morale": game_state.player.morale,
+        "sanity": game_state.player.sanity,
+        "hunger": game_state.player.hunger,
+        "inventory": list(game_state.player.inventory),
+        "equipment": dict(game_state.player.equipment),
+        "skills": list(game_state.player.skills),
+        "active_quests": {
+            quest_id: {
+                "status": quest.status,
+                "objectives": dict(quest.objectives),
+            }
+            for quest_id, quest in game_state.world.quests.items()
+            if quest.status == "active"
+        },
+    }
+def _flatten(value: Any, prefix: str = "") -> set[str]:
+    flattened: set[str] = set()
+    if isinstance(value, dict):
+        for key, child in value.items():
+            child_prefix = f"{prefix}.{key}" if prefix else str(key)
+            flattened.update(_flatten(child, child_prefix))
+    elif isinstance(value, list):
+        list_prefix = prefix or "list"
+        for index, child in enumerate(value):
+            flattened.update(_flatten(child, f"{list_prefix}[{index}]"))
+        if not value:
+            flattened.add(f"{list_prefix}=[]")
+    else:
+        flattened.add(f"{prefix}={value}")
+    return flattened
+def _jaccard_distance(left: set[str], right: set[str]) -> float:
+    union = left | right
+    if not union:
+        return 0.0
+    intersection = left & right
+    return 1.0 - (len(intersection) / len(union))
+def _option_texts(options: list[dict[str, Any]]) -> set[str]:
+    texts = set()
+    for option in options or []:
+        if isinstance(option, dict):
+            texts.add(str(option.get("text", "")))
+        else:
+            texts.add(str(option))
+    return texts
+def _consume_story_stream(story_engine: StoryEngine, intent: dict[str, Any]) -> tuple[dict[str, Any], float]:
+    story_chunks: list[str] = []
+    final_result: dict[str, Any] | None = None
+    started = perf_counter()
+    for update in story_engine.generate_story_stream(intent):
+        if update["type"] == "story_chunk":
+            story_chunks.append(update["text"])
+        elif update["type"] == "final":
+            final_result = update
+    latency_ms = (perf_counter() - started) * 1000
+    if final_result is None:
+        final_result = {
+            "story_text": story_chunks[-1] if story_chunks else "",
+            "options": [],
+            "state_changes": {},
+            "change_log": [],
+            "consistency_issues": [],
+            "telemetry": {
+                "engine_mode": "evaluation_fallback",
+                "used_fallback": True,
+                "fallback_reason": "missing_final_event",
+            },
+        }
+    return final_result, latency_ms
+def _run_text_turn(user_input: str, setup: dict[str, Any] | None = None) -> dict[str, Any]:
+    game_state = _build_game_state(setup)
+    nlu = NLUEngine(game_state)
+    story = StoryEngine(game_state)
+    nlu_started = perf_counter()
+    intent = nlu.parse_intent(user_input)
+    nlu_latency_ms = (perf_counter() - nlu_started) * 1000
+    final_result, story_latency_ms = _consume_story_stream(story, intent)
+    return {
+        "user_input": user_input,
+        "intent": intent,
+        "nlu_latency_ms": nlu_latency_ms,
+        "story_latency_ms": story_latency_ms,
+        "total_latency_ms": nlu_latency_ms + story_latency_ms,
+        "final_result": final_result,
+        "state_snapshot": _state_snapshot(game_state),
+    }
+def _percentile(values: list[float], percentile: float) -> float:
+    if not values:
+        return 0.0
+    ordered = sorted(values)
+    index = max(0, min(len(ordered) - 1, round((percentile / 100) * (len(ordered) - 1))))
+    return ordered[index]
+def evaluate_intent_accuracy() -> dict[str, Any]:
+    dataset = _load_dataset("intent_accuracy")
+    details = []
+    parser_sources = Counter()
+    confusion = defaultdict(Counter)
+    intent_correct = 0
+    target_correct = 0
+    target_total = 0
+    latencies = []
+    for example in dataset:
+        game_state = _build_game_state(example.get("setup"))
+        nlu = NLUEngine(game_state)
+        started = perf_counter()
+        result = nlu.parse_intent(example["input"])
+        latency_ms = (perf_counter() - started) * 1000
+        expected_intent = example["intent"]
+        predicted_intent = result.get("intent")
+        is_intent_correct = predicted_intent == expected_intent
+        intent_correct += int(is_intent_correct)
+        latencies.append(latency_ms)
+        parser_sources[result.get("parser_source", "unknown")] += 1
+        confusion[expected_intent][str(predicted_intent)] += 1
+        expected_target = example.get("target")
+        predicted_target = result.get("target")
+        is_target_correct = None
+        if expected_target is not None:
+            target_total += 1
+            is_target_correct = _normalize_text(predicted_target) == _normalize_text(expected_target)
+            target_correct += int(bool(is_target_correct))
+        details.append(
+            {
+                "id": example["id"],
+                "input": example["input"],
+                "expected_intent": expected_intent,
+                "predicted_intent": predicted_intent,
+                "intent_correct": is_intent_correct,
+                "expected_target": expected_target,
+                "predicted_target": predicted_target,
+                "target_correct": is_target_correct,
+                "parser_source": result.get("parser_source"),
+                "latency_ms": round(latency_ms, 2),
+            }
+        )
+    return {
+        "task": "intent_accuracy",
+        "dataset_size": len(dataset),
+        "intent_accuracy": round(intent_correct / len(dataset), 4) if dataset else 0.0,
+        "target_accuracy": round(target_correct / target_total, 4) if target_total else None,
+        "avg_latency_ms": round(statistics.mean(latencies), 2) if latencies else 0.0,
+        "parser_source_breakdown": dict(parser_sources),
+        "confusion": {expected: dict(counts) for expected, counts in confusion.items()},
+        "details": details,
+    }
+def evaluate_consistency() -> dict[str, Any]:
+    dataset = _load_dataset("consistency")
+    guard_cases = dataset["action_guard_cases"]
+    state_cases = dataset["state_check_cases"]
+    guard_details = []
+    guard_correct = 0
+    for case in guard_cases:
+        game_state = _build_game_state(case.get("setup"))
+        is_valid, rejection_reason = game_state.pre_validate_action(case["intent"])
+        is_correct = is_valid == case["expected_valid"]
+        guard_correct += int(is_correct)
+        guard_details.append(
+            {
+                "id": case["id"],
+                "expected_valid": case["expected_valid"],
+                "predicted_valid": is_valid,
+                "correct": is_correct,
+                "rejection_reason": rejection_reason,
+                "intent": case["intent"],
+            }
+        )
+    state_details = []
+    state_correct = 0
+    for case in state_cases:
+        game_state = _build_game_state(case.get("setup"))
+        contradictions = game_state.check_consistency(case["proposed_changes"])
+        predicted_contradiction = bool(contradictions)
+        is_correct = predicted_contradiction == case["expected_contradiction"]
+        expected_contains = case.get("expected_contains", [])
+        if expected_contains:
+            is_correct = is_correct and all(
+                any(fragment in issue for issue in contradictions)
+                for fragment in expected_contains
+            )
+        state_correct += int(is_correct)
+        state_details.append(
+            {
+                "id": case["id"],
+                "expected_contradiction": case["expected_contradiction"],
+                "predicted_contradiction": predicted_contradiction,
+                "correct": is_correct,
+                "contradictions": contradictions,
+                "proposed_changes": case["proposed_changes"],
+            }
+        )
+    total_cases = len(guard_cases) + len(state_cases)
+    total_correct = guard_correct + state_correct
+    return {
+        "task": "consistency",
+        "guard_accuracy": round(guard_correct / len(guard_cases), 4) if guard_cases else 0.0,
+        "state_check_accuracy": round(state_correct / len(state_cases), 4) if state_cases else 0.0,
+        "overall_accuracy": round(total_correct / total_cases, 4) if total_cases else 0.0,
+        "action_guard_details": guard_details,
+        "state_check_details": state_details,
+    }
+def evaluate_latency(repeats: int) -> dict[str, Any]:
+    dataset = _load_dataset("latency")
+    scenario_summaries = []
+    all_nlu = []
+    all_story = []
+    all_total = []
+    fallback_total = 0
+    total_runs = 0
+    for scenario in dataset:
+        runs = []
+        for _ in range(repeats):
+            run_result = _run_text_turn(scenario["input"], scenario.get("setup"))
+            final_result = run_result["final_result"]
+            telemetry = final_result.get("telemetry", {})
+            used_fallback = bool(telemetry.get("used_fallback", False))
+            total_runs += 1
+            fallback_total += int(used_fallback)
+            all_nlu.append(run_result["nlu_latency_ms"])
+            all_story.append(run_result["story_latency_ms"])
+            all_total.append(run_result["total_latency_ms"])
+            runs.append(
+                {
+                    "nlu_latency_ms": round(run_result["nlu_latency_ms"], 2),
+                    "story_latency_ms": round(run_result["story_latency_ms"], 2),
+                    "total_latency_ms": round(run_result["total_latency_ms"], 2),
+                    "used_fallback": used_fallback,
+                    "fallback_reason": telemetry.get("fallback_reason"),
+                    "engine_mode": telemetry.get("engine_mode"),
+                }
+            )
+        total_values = [item["total_latency_ms"] for item in runs]
+        scenario_summaries.append(
+            {
+                "id": scenario["id"],
+                "input": scenario["input"],
+                "repeats": repeats,
+                "avg_total_latency_ms": round(statistics.mean(total_values), 2),
+                "p95_total_latency_ms": round(_percentile(total_values, 95), 2),
+                "fallback_rate": round(
+                    sum(1 for item in runs if item["used_fallback"]) / len(runs),
+                    4,
+                ),
+                "runs": runs,
+            }
+        )
+    return {
+        "task": "latency",
+        "scenario_count": len(dataset),
+        "repeats": repeats,
+        "avg_nlu_latency_ms": round(statistics.mean(all_nlu), 2) if all_nlu else 0.0,
+        "avg_story_latency_ms": round(statistics.mean(all_story), 2) if all_story else 0.0,
+        "avg_total_latency_ms": round(statistics.mean(all_total), 2) if all_total else 0.0,
+        "p95_total_latency_ms": round(_percentile(all_total, 95), 2) if all_total else 0.0,
+        "fallback_rate": round(fallback_total / total_runs, 4) if total_runs else 0.0,
+        "scenarios": scenario_summaries,
+    }
+def evaluate_branch_divergence() -> dict[str, Any]:
+    dataset = _load_dataset("branch_divergence")
+    group_summaries = []
+    pair_scores = []
+    for group in dataset:
+        branch_results = []
+        for branch in group["branches"]:
+            run_result = _run_text_turn(branch["input"], group.get("setup"))
+            branch_results.append(
+                {
+                    "label": branch["label"],
+                    "input": branch["input"],
+                    "story_text": run_result["final_result"].get("story_text", ""),
+                    "options": run_result["final_result"].get("options", []),
+                    "state_snapshot": run_result["state_snapshot"],
+                    "state_changes": run_result["final_result"].get("state_changes", {}),
+                    "telemetry": run_result["final_result"].get("telemetry", {}),
+                }
+            )
+        group_pairs = []
+        for left, right in combinations(branch_results, 2):
+            text_divergence = 1.0 - SequenceMatcher(
+                None,
+                left["story_text"],
+                right["story_text"],
+            ).ratio()
+            state_divergence = _jaccard_distance(
+                _flatten(left["state_snapshot"]),
+                _flatten(right["state_snapshot"]),
+            )
+            option_divergence = _jaccard_distance(
+                _option_texts(left["options"]),
+                _option_texts(right["options"]),
+            )
+            pair_score = round((text_divergence + state_divergence + option_divergence) / 3, 4)
+            pair_detail = {
+                "left": left["label"],
+                "right": right["label"],
+                "text_divergence": round(text_divergence, 4),
+                "state_divergence": round(state_divergence, 4),
+                "option_divergence": round(option_divergence, 4),
+                "pair_divergence_score": pair_score,
+                "meaningfully_divergent": pair_score >= 0.2,
+            }
+            pair_scores.append(pair_score)
+            group_pairs.append(pair_detail)
+        group_summaries.append(
+            {
+                "id": group["id"],
+                "avg_pair_divergence": round(
+                    statistics.mean([pair["pair_divergence_score"] for pair in group_pairs]),
+                    4,
+                ) if group_pairs else 0.0,
+                "branches": [
+                    {
+                        "label": branch["label"],
+                        "input": branch["input"],
+                        "telemetry": _json_safe(branch["telemetry"]),
+                        "state_changes": _json_safe(branch["state_changes"]),
+                    }
+                    for branch in branch_results
+                ],
+                "pair_details": group_pairs,
+            }
+        )
+    meaningful_pairs = sum(1 for score in pair_scores if score >= 0.2)
+    return {
+        "task": "branch_divergence",
+        "group_count": len(dataset),
+        "avg_pair_divergence": round(statistics.mean(pair_scores), 4) if pair_scores else 0.0,
+        "meaningfully_divergent_pair_rate": round(
+            meaningful_pairs / len(pair_scores),
+            4,
+        ) if pair_scores else 0.0,
+        "groups": group_summaries,
+    }
+TASK_RUNNERS = {
+    "intent": lambda repeats: evaluate_intent_accuracy(),
+    "consistency": lambda repeats: evaluate_consistency(),
+    "latency": lambda repeats: evaluate_latency(repeats),
+    "branch": lambda repeats: evaluate_branch_divergence(),
+}
+def _build_summary(results: dict[str, Any]) -> dict[str, Any]:
+    summary = {}
+    if "intent" in results:
+        summary["intent_accuracy"] = results["intent"]["intent_accuracy"]
+    if "consistency" in results:
+        summary["consistency_overall_accuracy"] = results["consistency"]["overall_accuracy"]
+    if "latency" in results:
+        summary["avg_total_latency_ms"] = results["latency"]["avg_total_latency_ms"]
+        summary["latency_fallback_rate"] = results["latency"]["fallback_rate"]
+    if "branch" in results:
+        summary["avg_pair_divergence"] = results["branch"]["avg_pair_divergence"]
+    return summary
+def main() -> int:
+    parser = argparse.ArgumentParser(description="Run reproducible StoryWeaver evaluation tasks.")
+    parser.add_argument(
+        "--task",
+        choices=["all", *TASK_RUNNERS.keys()],
+        default="all",
+        help="Evaluation task to run.",
+    )
+    parser.add_argument(
+        "--repeats",
+        type=int,
+        default=3,
+        help="Repeat count for latency measurements.",
+    )
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="",
+        help="Optional path for the output JSON file.",
+    )
+    args = parser.parse_args()
+    selected_tasks = list(TASK_RUNNERS.keys()) if args.task == "all" else [args.task]
+    task_results = {task: TASK_RUNNERS[task](args.repeats) for task in selected_tasks}
+    payload = {
+        "generated_at": datetime.now().isoformat(timespec="seconds"),
+        "task": args.task,
+        "summary": _build_summary(task_results),
+        "results": task_results,
+    }
+    RESULTS_DIR.mkdir(parents=True, exist_ok=True)
+    if args.output:
+        output_path = Path(args.output)
+        if not output_path.is_absolute():
+            output_path = PROJECT_ROOT / output_path
+    else:
+        timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+        suffix = args.task
+        output_path = RESULTS_DIR / f"{timestamp}-{suffix}.json"
+    output_path.parent.mkdir(parents=True, exist_ok=True)
+    with output_path.open("w", encoding="utf-8") as fh:
+        json.dump(payload, fh, ensure_ascii=False, indent=2)
+    print(json.dumps(payload["summary"], ensure_ascii=False, indent=2))
+    print(f"Saved evaluation results to: {output_path}")
+    return 0
+if __name__ == "__main__":
+    raise SystemExit(main())

nlu_engine.py CHANGED Viewed

@@ -117,13 +117,14 @@ class NLUEngine:
                 "raw_input": "我想用剑攻击那个哥布林"
             }
         """
-        if not user_input or not user_input.strip():
-            return {
-                "intent": "EXPLORE",
-                "target": None,
-                "details": "玩家沉默不语",
-                "raw_input": "",
-            }
         user_input = user_input.strip()
         logger.info(f"NLU 解析输入: '{user_input}'")
@@ -168,16 +169,17 @@ class NLUEngine:
             max_retries=2,
         )
-        if result and isinstance(result, dict) and "intent" in result:
-            # 验证意图类型合法
-            valid_intents = {
-                "ATTACK", "TALK", "MOVE", "EXPLORE", "USE_ITEM",
-                "TRADE", "EQUIP", "REST", "QUEST", "SKILL",
-                "PICKUP", "FLEE", "CUSTOM",
-            }
-            if result["intent"] not in valid_intents:
-                result["intent"] = "CUSTOM"
-            return result
         return None
@@ -230,11 +232,12 @@ class NLUEngine:
         # 尝试提取目标
         target = self._extract_target_from_text(user_input)
-        return {
-            "intent": detected_intent,
-            "target": target,
-            "details": None,
-        }
     def _extract_target_from_text(self, text: str) -> Optional[str]:
         """

                 "raw_input": "我想用剑攻击那个哥布林"
             }
         """
+        if not user_input or not user_input.strip():
+            return {
+                "intent": "EXPLORE",
+                "target": None,
+                "details": "玩家沉默不语",
+                "raw_input": "",
+                "parser_source": "empty_input",
+            }
         user_input = user_input.strip()
         logger.info(f"NLU 解析输入: '{user_input}'")
             max_retries=2,
         )
+        if result and isinstance(result, dict) and "intent" in result:
+            # 验证意图类型合法
+            valid_intents = {
+                "ATTACK", "TALK", "MOVE", "EXPLORE", "USE_ITEM",
+                "TRADE", "EQUIP", "REST", "QUEST", "SKILL",
+                "PICKUP", "FLEE", "CUSTOM",
+            }
+            if result["intent"] not in valid_intents:
+                result["intent"] = "CUSTOM"
+            result.setdefault("parser_source", "llm")
+            return result
         return None
         # 尝试提取目标
         target = self._extract_target_from_text(user_input)
+        return {
+            "intent": detected_intent,
+            "target": target,
+            "details": None,
+            "parser_source": "keyword_fallback",
+        }
     def _extract_target_from_text(self, text: str) -> Optional[str]:
         """

story_engine.py CHANGED Viewed

@@ -84,7 +84,7 @@ def _merge_change_logs(tick_log: list[str], action_log: list[str]) -> list[str]:
     return remaining_tick + merged_results
-def _normalize_markers(text: str) -> str:
     """
     标准化 LLM 输出中的分隔标记，处理常见变体格式。
@@ -97,9 +97,29 @@ def _normalize_markers(text: str) -> str:
     """
     text = re.sub(r'-{2,}\s*STORY[_ ]?TEXT\s*-{2,}', '---STORY_TEXT---', text, flags=re.IGNORECASE)
     text = re.sub(r'-{2,}\s*OPTIONS[_ ]?JSON\s*-{2,}', '---OPTIONS_JSON---', text, flags=re.IGNORECASE)
-    text = re.sub(r'-{2,}\s*STATE[_ ]?JSON\s*-{2,}', '---STATE_JSON---', text, flags=re.IGNORECASE)
-    text = re.sub(r'-{2,}\s*THINKING\s*-{2,}', '---THINKING---', text, flags=re.IGNORECASE)
-    return text
 # ============================================================
@@ -458,12 +478,13 @@ class StoryEngine:
         story_text, options = self._parse_story_response(raw_text)
         # 开场没有状态变更
-        return {
-            "story_text": story_text,
-            "options": options,
-            "state_changes": {},
-            "change_log": [],
-        }
     def generate_story(self, player_intent: dict) -> dict:
         """
@@ -494,7 +515,8 @@ class StoryEngine:
                 "consistency_issues": ["一致性问题"],
             }
         """
-        logger.info(f"生成故事响应，玩家意图: {player_intent}")
         # ============================================
         # 推进时间（行动前，时间自然流逝）
@@ -510,10 +532,15 @@ class StoryEngine:
         # ============================================
         outline = self._generate_outline(player_intent)
-        if outline is None:
-            # 大纲生成失败 —— 降级处理
-            logger.error("大纲生成失败，使用降级叙事")
-            return self._fallback_response(player_intent, tick_log)
         # ============================================
         # 处理时间冲突：如果大纲指定了 time_change（时间跳跃），
@@ -529,15 +556,21 @@ class StoryEngine:
         # ============================================
         consistency_issues = self.game_state.check_consistency(state_changes)
-        if consistency_issues:
-            logger.warning(f"发现一致性问题: {consistency_issues}")
-            # 尝试修复：重新生成大纲，附带一致性约束
-            outline = self._regenerate_outline_with_fixes(player_intent, consistency_issues)
-            if outline is None:
-                return self._fallback_response(player_intent, tick_log)
-            state_changes = outline.get("state_changes", {})
-            # 再次检查
-            consistency_issues = self.game_state.check_consistency(state_changes)
         # 移除与非法物品相关的状态变更（安全网）
         if consistency_issues:
@@ -588,14 +621,21 @@ class StoryEngine:
         # 合并 tick_log 和 change_log 中的重复属性条目
         merged_log = _merge_change_logs(tick_log, change_log + validation_issues)
-        return {
-            "story_text": story_text,
-            "options": options,
-            "state_changes": state_changes,
-            "change_log": merged_log,
-            "outline": outline,
-            "consistency_issues": consistency_issues,
-        }
     def _generate_outline(self, player_intent: dict) -> Optional[dict]:
         """
@@ -715,14 +755,15 @@ class StoryEngine:
         raw_text = call_qwen(messages, model=self.model, temperature=0.9, max_tokens=1500)
         story_text, options = self._parse_story_response(raw_text)
-        return {
-            "story_text": story_text,
-            "options": options,
-            "state_changes": {},
-            "change_log": ["游戏结束"],
-            "outline": None,
-            "consistency_issues": [],
-        }
     @staticmethod
     def _clean_story_text(story_text: str) -> str:
@@ -925,7 +966,14 @@ class StoryEngine:
             opt["id"] = i
         return options[:3]
-    def _fallback_response(self, player_intent: dict, tick_log: list[str] | None = None) -> dict:
         """
         降级响应：当大纲生成完全失败时，提供基本响应。
@@ -965,14 +1013,19 @@ class StoryEngine:
             options = self._generate_default_options()
         fallback_change_log = (tick_log or []) + ["（系统提示：本回合使用了降级响应）"]
-        return {
-            "story_text": story_text,
-            "options": options,
-            "state_changes": {},
-            "change_log": fallback_change_log,
-            "outline": None,
-            "consistency_issues": [],
-        }
     def _sanitize_state_changes(self, changes: dict, event_type: str = "") -> tuple[dict, list[str]]:
         """
@@ -1211,13 +1264,18 @@ class StoryEngine:
                 result["options"] = self._ensure_three_options(result.get("options", []))
                 yield {"type": "final", **result}
             except Exception:
-                yield {
-                    "type": "final",
-                    "story_text": "你踏上了一段新的旅程...",
-                    "options": self._generate_default_options(),
-                    "state_changes": {},
-                    "change_log": [],
-                }
             return
         # ★ 如果流式阶段未检测到标记但有累积文本，先 yield 给 UI 显示
@@ -1239,11 +1297,12 @@ class StoryEngine:
         yield {
             "type": "final",
-            "story_text": story_text,
-            "options": options,
-            "state_changes": {},
-            "change_log": [],
-        }
     def generate_story_stream(self, player_intent: dict):
         """
@@ -1315,18 +1374,23 @@ class StoryEngine:
                     if display_text.strip():
                         yield {"type": "story_chunk", "text": display_text.strip()}
-        except Exception as e:
-            logger.error(f"流式合并生成失败: {e}，降级为非流式两阶段")
-            try:
-                result = self.generate_story(player_intent)
                 # 降级结果也强制保证 3 个选项
                 result["options"] = self._ensure_three_options(result.get("options", []))
                 yield {"type": "final", **result}
-            except Exception:
-                fallback = self._fallback_response(player_intent, tick_log)
-                fallback["options"] = self._ensure_three_options(fallback.get("options", []))
-                yield {"type": "final", **fallback}
-            return
         # ★ 如果流式阶段未检测到标记但有累积文本，先 yield 给 UI 显示
         if not story_started and full_text.strip():
@@ -1345,21 +1409,31 @@ class StoryEngine:
             if story_text and story_text.strip():
                 logger.warning("大纲(STATE_JSON)解析失败，但故事文本已提取，跳过状态更新继续")
                 options = self._ensure_three_options(options)
-                yield {
-                    "type": "final",
-                    "story_text": story_text,
-                    "options": options,
-                    "state_changes": {},
-                    "change_log": tick_log + ["（系统提示：本回合状态解析失败，未更新状态）"],
-                    "outline": None,
-                    "consistency_issues": [],
-                }
-                return
-            else:
-                logger.error("合并响应解析完全失败，使用降级")
-                fallback = self._fallback_response(player_intent, tick_log)
-                yield {"type": "final", **fallback}
-                return
         # 处理时间冲突
         state_changes = outline.get("state_changes", {})
@@ -1410,15 +1484,21 @@ class StoryEngine:
         # 合并日志
         merged_log = _merge_change_logs(tick_log, change_log + validation_issues)
-        yield {
-            "type": "final",
-            "story_text": story_text,
-            "options": options,
-            "state_changes": state_changes,
-            "change_log": merged_log,
-            "outline": outline,
-            "consistency_issues": consistency_issues,
-        }
     def process_option_selection_stream(self, option: dict):
         """

     return remaining_tick + merged_results
+def _normalize_markers(text: str) -> str:
     """
     标准化 LLM 输出中的分隔标记，处理常见变体格式。
     """
     text = re.sub(r'-{2,}\s*STORY[_ ]?TEXT\s*-{2,}', '---STORY_TEXT---', text, flags=re.IGNORECASE)
     text = re.sub(r'-{2,}\s*OPTIONS[_ ]?JSON\s*-{2,}', '---OPTIONS_JSON---', text, flags=re.IGNORECASE)
+    text = re.sub(r'-{2,}\s*STATE[_ ]?JSON\s*-{2,}', '---STATE_JSON---', text, flags=re.IGNORECASE)
+    text = re.sub(r'-{2,}\s*THINKING\s*-{2,}', '---THINKING---', text, flags=re.IGNORECASE)
+    return text
+def _build_telemetry(
+    engine_mode: str,
+    *,
+    used_fallback: bool = False,
+    fallback_reason: str | None = None,
+    consistency_issues_count: int = 0,
+    validation_issues_count: int = 0,
+    outline_regenerated: bool = False,
+) -> dict:
+    """构建供日志与评估脚本使用的轻量运行元信息。"""
+    return {
+        "engine_mode": engine_mode,
+        "used_fallback": used_fallback,
+        "fallback_reason": fallback_reason,
+        "consistency_issues_count": consistency_issues_count,
+        "validation_issues_count": validation_issues_count,
+        "outline_regenerated": outline_regenerated,
+    }
 # ============================================================
         story_text, options = self._parse_story_response(raw_text)
         # 开场没有状态变更
+        return {
+            "story_text": story_text,
+            "options": options,
+            "state_changes": {},
+            "change_log": [],
+            "telemetry": _build_telemetry(engine_mode="opening", used_fallback=False),
+        }
     def generate_story(self, player_intent: dict) -> dict:
         """
                 "consistency_issues": ["一致性问题"],
             }
         """
+        logger.info(f"生成故事响应，玩家意图: {player_intent}")
+        outline_regenerated = False
         # ============================================
         # 推进时间（行动前，时间自然流逝）
         # ============================================
         outline = self._generate_outline(player_intent)
+        if outline is None:
+            # 大纲生成失败 —— 降级处理
+            logger.error("大纲生成失败，使用降级叙事")
+            return self._fallback_response(
+                player_intent,
+                tick_log,
+                fallback_reason="outline_generation_failed",
+                engine_mode="two_stage",
+            )
         # ============================================
         # 处理时间冲突：如果大纲指定了 time_change（时间跳跃），
         # ============================================
         consistency_issues = self.game_state.check_consistency(state_changes)
+        if consistency_issues:
+            logger.warning(f"发现一致性问题: {consistency_issues}")
+            # 尝试修复：重新生成大纲，附带一致性约束
+            outline = self._regenerate_outline_with_fixes(player_intent, consistency_issues)
+            if outline is None:
+                return self._fallback_response(
+                    player_intent,
+                    tick_log,
+                    fallback_reason="outline_regeneration_failed",
+                    engine_mode="two_stage",
+                )
+            outline_regenerated = True
+            state_changes = outline.get("state_changes", {})
+            # 再次检查
+            consistency_issues = self.game_state.check_consistency(state_changes)
         # 移除与非法物品相关的状态变更（安全网）
         if consistency_issues:
         # 合并 tick_log 和 change_log 中的重复属性条目
         merged_log = _merge_change_logs(tick_log, change_log + validation_issues)
+        return {
+            "story_text": story_text,
+            "options": options,
+            "state_changes": state_changes,
+            "change_log": merged_log,
+            "outline": outline,
+            "consistency_issues": consistency_issues,
+            "telemetry": _build_telemetry(
+                engine_mode="two_stage",
+                used_fallback=False,
+                consistency_issues_count=len(consistency_issues),
+                validation_issues_count=len(validation_issues),
+                outline_regenerated=outline_regenerated,
+            ),
+        }
     def _generate_outline(self, player_intent: dict) -> Optional[dict]:
         """
         raw_text = call_qwen(messages, model=self.model, temperature=0.9, max_tokens=1500)
         story_text, options = self._parse_story_response(raw_text)
+        return {
+            "story_text": story_text,
+            "options": options,
+            "state_changes": {},
+            "change_log": ["游戏结束"],
+            "outline": None,
+            "consistency_issues": [],
+            "telemetry": _build_telemetry(engine_mode="death_narrative", used_fallback=False),
+        }
     @staticmethod
     def _clean_story_text(story_text: str) -> str:
             opt["id"] = i
         return options[:3]
+    def _fallback_response(
+        self,
+        player_intent: dict,
+        tick_log: list[str] | None = None,
+        *,
+        fallback_reason: str = "unknown",
+        engine_mode: str = "fallback",
+    ) -> dict:
         """
         降级响应：当大纲生成完全失败时，提供基本响应。
             options = self._generate_default_options()
         fallback_change_log = (tick_log or []) + ["（系统提示：本回合使用了降级响应）"]
+        return {
+            "story_text": story_text,
+            "options": options,
+            "state_changes": {},
+            "change_log": fallback_change_log,
+            "outline": None,
+            "consistency_issues": [],
+            "telemetry": _build_telemetry(
+                engine_mode=engine_mode,
+                used_fallback=True,
+                fallback_reason=fallback_reason,
+            ),
+        }
     def _sanitize_state_changes(self, changes: dict, event_type: str = "") -> tuple[dict, list[str]]:
         """
                 result["options"] = self._ensure_three_options(result.get("options", []))
                 yield {"type": "final", **result}
             except Exception:
+                yield {
+                    "type": "final",
+                    "story_text": "你踏上了一段新的旅程...",
+                    "options": self._generate_default_options(),
+                    "state_changes": {},
+                    "change_log": [],
+                    "telemetry": _build_telemetry(
+                        engine_mode="opening",
+                        used_fallback=True,
+                        fallback_reason="opening_stream_exception",
+                    ),
+                }
             return
         # ★ 如果流式阶段未检测到标记但有累积文本，先 yield 给 UI 显示
         yield {
             "type": "final",
+            "story_text": story_text,
+            "options": options,
+            "state_changes": {},
+            "change_log": [],
+            "telemetry": _build_telemetry(engine_mode="opening", used_fallback=False),
+        }
     def generate_story_stream(self, player_intent: dict):
         """
                     if display_text.strip():
                         yield {"type": "story_chunk", "text": display_text.strip()}
+        except Exception as e:
+            logger.error(f"流式合并生成失败: {e}，降级为非流式两阶段")
+            try:
+                result = self.generate_story(player_intent)
                 # 降级结果也强制保证 3 个选项
                 result["options"] = self._ensure_three_options(result.get("options", []))
                 yield {"type": "final", **result}
+            except Exception:
+                fallback = self._fallback_response(
+                    player_intent,
+                    tick_log,
+                    fallback_reason="stream_exception",
+                    engine_mode="stream_merged",
+                )
+                fallback["options"] = self._ensure_three_options(fallback.get("options", []))
+                yield {"type": "final", **fallback}
+            return
         # ★ 如果流式阶段未检测到标记但有累积文本，先 yield 给 UI 显示
         if not story_started and full_text.strip():
             if story_text and story_text.strip():
                 logger.warning("大纲(STATE_JSON)解析失败，但故事文本已提取，跳过状态更新继续")
                 options = self._ensure_three_options(options)
+                yield {
+                    "type": "final",
+                    "story_text": story_text,
+                    "options": options,
+                    "state_changes": {},
+                    "change_log": tick_log + ["（系统提示：本回合状态解析失败，未更新状态）"],
+                    "outline": None,
+                    "consistency_issues": [],
+                    "telemetry": _build_telemetry(
+                        engine_mode="stream_merged",
+                        used_fallback=True,
+                        fallback_reason="state_parse_failed",
+                    ),
+                }
+                return
+            else:
+                logger.error("合并响应解析完全失败，使用降级")
+                fallback = self._fallback_response(
+                    player_intent,
+                    tick_log,
+                    fallback_reason="merged_response_parse_failed",
+                    engine_mode="stream_merged",
+                )
+                yield {"type": "final", **fallback}
+                return
         # 处理时间冲突
         state_changes = outline.get("state_changes", {})
         # 合并日志
         merged_log = _merge_change_logs(tick_log, change_log + validation_issues)
+        yield {
+            "type": "final",
+            "story_text": story_text,
+            "options": options,
+            "state_changes": state_changes,
+            "change_log": merged_log,
+            "outline": outline,
+            "consistency_issues": consistency_issues,
+            "telemetry": _build_telemetry(
+                engine_mode="stream_merged",
+                used_fallback=False,
+                consistency_issues_count=len(consistency_issues),
+                validation_issues_count=len(validation_issues),
+            ),
+        }
     def process_option_selection_stream(self, option: dict):
         """

telemetry.py ADDED Viewed

	@@ -0,0 +1,81 @@

+"""
+telemetry.py - StoryWeaver 结构化交互日志工具
+职责：
+1. 为每个游戏会话分配稳定的 session_id
+2. 以 JSONL 形式落盘每回合交互记录
+3. 为评估脚本和案例分析提供统一的日志格式
+"""
+from __future__ import annotations
+import json
+import os
+import uuid
+from datetime import datetime
+from pathlib import Path
+from typing import Any
+PROJECT_ROOT = Path(__file__).resolve().parent
+DEFAULT_LOG_DIR = PROJECT_ROOT / "logs" / "interactions"
+def _resolve_log_dir() -> Path:
+    custom_dir = os.getenv("STORYWEAVER_LOG_DIR", "").strip()
+    if custom_dir:
+        return Path(custom_dir).expanduser()
+    return DEFAULT_LOG_DIR
+def create_session_metadata(session_id: str | None = None) -> dict[str, Any]:
+    """
+    创建新的会话元数据。
+    每个会话对应一个单独的 JSONL 文件，便于回放和分析。
+    """
+    timestamp = datetime.now().strftime("%Y%m%d-%H%M%S")
+    new_session_id = session_id or f"sw-{timestamp}-{uuid.uuid4().hex[:8]}"
+    log_dir = _resolve_log_dir()
+    log_path = log_dir / f"{new_session_id}.jsonl"
+    return {
+        "session_id": new_session_id,
+        "turn_index": 0,
+        "interaction_log_path": str(log_path),
+    }
+def ensure_session_metadata(game_session: dict[str, Any]) -> dict[str, Any]:
+    """确保游戏会话中带有日志所需的元数据。"""
+    if "session_id" not in game_session or "interaction_log_path" not in game_session:
+        game_session.update(create_session_metadata())
+    if "turn_index" not in game_session:
+        game_session["turn_index"] = 0
+    return game_session
+def append_turn_log(game_session: dict[str, Any], record: dict[str, Any]) -> str:
+    """
+    追加一条结构化交互日志。
+    Returns:
+        日志文件路径，便于调试和脚本复用。
+    """
+    ensure_session_metadata(game_session)
+    game_session["turn_index"] += 1
+    log_path = Path(game_session["interaction_log_path"])
+    log_path.parent.mkdir(parents=True, exist_ok=True)
+    payload = {
+        "timestamp": datetime.now().isoformat(timespec="seconds"),
+        "session_id": game_session["session_id"],
+        "turn_index": game_session["turn_index"],
+        **record,
+    }
+    with log_path.open("a", encoding="utf-8") as fh:
+        json.dump(payload, fh, ensure_ascii=False)
+        fh.write("\n")
+    return str(log_path)

utils.py CHANGED Viewed

@@ -10,11 +10,17 @@ utils.py - StoryWeaver 工具函数模块
 import os
 import re
 import json
-import time
-import logging
-from typing import Any, Optional
-from dotenv import load_dotenv
-from openai import OpenAI
 # ============================================================
 # 日志配置
@@ -43,17 +49,22 @@ if not QWEN_API_KEY or QWEN_API_KEY == "sk-xxxxxx":
 # 使用 OpenAI 兼容格式连接 Qwen API
 # base_url 指向通义千问的 OpenAI 兼容端点
-_client: Optional[OpenAI] = None
-def get_client() -> OpenAI:
     """
     获取全局 OpenAI 客户端（懒加载单例）。
-    使用兼容格式调用 Qwen API。
-    """
-    global _client
-    if _client is None:
-        _client = OpenAI(
             api_key=QWEN_API_KEY,
             base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
         )

 import os
 import re
 import json
+import time
+import logging
+from typing import Any, Optional
+from dotenv import load_dotenv
+try:
+    from openai import OpenAI
+    _OPENAI_IMPORT_ERROR: Optional[Exception] = None
+except ImportError as exc:  # pragma: no cover - depends on local env
+    OpenAI = None  # type: ignore[assignment]
+    _OPENAI_IMPORT_ERROR = exc
 # ============================================================
 # 日志配置
 # 使用 OpenAI 兼容格式连接 Qwen API
 # base_url 指向通义千问的 OpenAI 兼容端点
+_client: Optional[Any] = None
+def get_client() -> Any:
     """
     获取全局 OpenAI 客户端（懒加载单例）。
+    使用兼容格式调用 Qwen API。
+    """
+    global _client
+    if OpenAI is None:
+        raise RuntimeError(
+            "未安装 openai 依赖，无法初始化 Qwen 客户端。"
+            "请先执行 `pip install -r requirements.txt`。"
+        ) from _OPENAI_IMPORT_ERROR
+    if _client is None:
+        _client = OpenAI(
             api_key=QWEN_API_KEY,
             base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
         )