Spaces:

comp5423
/

Story_Weaver

Sleeping

App Files Files Community

wzh0617 commited on 15 days ago

Commit

c0abd39

verified ·

1 Parent(s): 7d50051

Upload 3 files

Browse files

Files changed (3) hide show

README.md +557 -558
app.py +0 -0
requirements.txt +4 -5

README.md CHANGED Viewed

@@ -1,558 +1,557 @@
----
-title: StoryWeaver
-emoji: 📖
-colorFrom: red
-colorTo: purple
-sdk: gradio
-sdk_version: 4.44.0
-python_version: "3.10"
-app_file: app.py
-pinned: false
-license: mit
-short_description: Interactive NLP story engine with evaluation and logging
----
-# StoryWeaver
-StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.
-This README is written for teammates who need to:
-- understand how the system is organized
-- run the app locally
-- know where to change prompts, rules, or UI
-- collect evaluation results for the report
-- debug a bad interaction without reading the whole codebase first
-## What This Repository Contains
-At a high level, the project has five responsibilities:
-1. parse player input into structured intent
-2. keep the world state consistent across turns
-3. generate the next story response and options
-4. expose the system through a Gradio UI
-5. export logs and run reproducible evaluation
-This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.
-## Quick Start
-### 1. Install dependencies
-```bash
-pip install -r requirements.txt
-```
-### 2. Create `.env`
-Create a `.env` file in the project root:
-```env
-QWEN_API_KEY=your_api_key_here
-```
-Optional:
-```env
-STORYWEAVER_LOG_DIR=logs/interactions
-```
-### 3. Run the app
-```bash
-python app.py
-```
-Default local URL:
-- `http://localhost:7860`
-### 4. Run evaluation
-```bash
-python evaluation/run_evaluations.py --task all --repeats 3
-```
-Useful variants:
-```bash
-python evaluation/run_evaluations.py --task intent
-python evaluation/run_evaluations.py --task consistency
-python evaluation/run_evaluations.py --task latency --repeats 5
-python evaluation/run_evaluations.py --task branch
-```
-## Deploy to Hugging Face Spaces (Web Upload)
-If you want to deploy quickly without using git commands, use this checklist:
-1. Create a new Space on Hugging Face:
-   - SDK: `Gradio`
-   - Python version: default is fine (3.10+)
-2. Upload project files from this repository root.
-3. Do **not** upload local-only files/directories:
-   - `venv/`, `.venv/`, `.env`, `__pycache__/`, `.gradio/`, `logs/`, `evaluation/results/`
-4. In Space settings, add secret:
-   - `QWEN_API_KEY`
-5. Wait for build to finish, then open the Space URL.
-This repository already uses the standard Gradio entrypoint in `app.py`, so Spaces will start the app automatically.
-## Recommended Reading Order
-If you are new to the repo, read files in this order:
-1. [state_manager.py](./state_manager.py)
-   Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
-2. [nlu_engine.py](./nlu_engine.py)
-   Why: this shows how raw player text becomes structured intent.
-3. [story_engine.py](./story_engine.py)
-   Why: this is the main generation pipeline and fallback logic.
-4. [app.py](./app.py)
-   Why: this connects the UI with the engines and now also writes interaction logs.
-5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
-   Why: this shows how we measure the system for the report.
-If you only have 10 minutes, start with:
-- `GameState.pre_validate_action`
-- `GameState.check_consistency`
-- `GameState.apply_changes`
-- `NLUEngine.parse_intent`
-- `StoryEngine.generate_story_stream`
-- `process_user_input` in [app.py](./app.py)
-## Repository Map
-```text
-StoryWeaver/
-|-- app.py
-|-- nlu_engine.py
-|-- story_engine.py
-|-- state_manager.py
-|-- telemetry.py
-|-- utils.py
-|-- requirements.txt
-|-- evaluation/
-|   |-- run_evaluations.py
-|   |-- datasets/
-|   `-- results/
-`-- logs/
-    `-- interactions/
-```
-Core responsibilities by file:
-- [app.py](./app.py)
-  Gradio app, session lifecycle, UI callbacks, per-turn logging.
-- [state_manager.py](./state_manager.py)
-  Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
-- [nlu_engine.py](./nlu_engine.py)
-  Intent parsing. Uses LLM parsing when available and keyword fallback when not.
-- [story_engine.py](./story_engine.py)
-  Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
-- [telemetry.py](./telemetry.py)
-  Session metadata and JSONL interaction log export.
-- [utils.py](./utils.py)
-  API client setup, Qwen calls, JSON extraction, retry helpers.
-- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
-  Reproducible experiment runner for the report.
-## System Architecture
-The main runtime path is:
-`Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log`
-There are two ideas that matter most in this codebase:
-### 1. `GameState` is the source of truth
-Almost everything meaningful lives in [state_manager.py](./state_manager.py):
-- player stats
-- location
-- time and weather
-- inventory and equipment
-- quests
-- NPC states
-- event history
-When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.
-### 2. The app is a coordinator, not the game logic
-[app.py](./app.py) should mostly:
-- receive user input
-- call NLU
-- call the story engine
-- update the chat UI
-- write telemetry logs
-If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer.
-## Runtime Flow
-### Text input flow
-For normal text input, the path is:
-1. `process_user_input` receives raw text from the UI
-2. `NLUEngine.parse_intent` converts it into a structured intent dict
-3. `GameState.pre_validate_action` blocks clearly invalid actions early
-4. `StoryEngine.generate_story_stream` runs the main narrative pipeline
-5. `GameState.check_consistency` and `apply_changes` update state
-6. UI is refreshed with story text, options, and status panel
-7. `_record_interaction_log` writes a JSONL record to disk
-### Option click flow
-Button clicks do not go through full free-text parsing. Instead:
-1. the selected option is converted to an intent-like dict
-2. the story engine processes it the same way as text input
-3. the result is rendered and logged
-This is useful because option interactions and free-text interactions now share the same evaluation and observability format.
-## Main Modules in More Detail
-### `state_manager.py`
-This file defines:
-- `PlayerState`
-- `WorldState`
-- `GameEvent`
-- `GameState`
-Important methods:
-- `pre_validate_action`
-  Rejects obviously invalid actions before calling the model.
-- `check_consistency`
-  Detects contradictions in proposed state changes.
-- `apply_changes`
-  Applies state changes and returns a readable change log.
-- `validate`
-  Makes sure the resulting state is legal.
-- `to_prompt`
-  Serializes the current game state into prompt-ready text.
-When to edit this file:
-- adding new items, NPCs, quests, or locations
-- adding deterministic rules
-- improving consistency checks
-- changing state serialization for prompts
-### `nlu_engine.py`
-This file is responsible for intent recognition.
-Current behavior:
-- try LLM parsing first
-- fall back to keyword rules if parsing fails
-- return a normalized intent dict with `parser_source`
-Current intent labels include:
-- `ATTACK`
-- `TALK`
-- `MOVE`
-- `EXPLORE`
-- `USE_ITEM`
-- `TRADE`
-- `EQUIP`
-- `REST`
-- `QUEST`
-- `SKILL`
-- `PICKUP`
-- `FLEE`
-- `CUSTOM`
-When to edit this file:
-- adding a new intent type
-- improving keyword fallback
-- adding target extraction logic
-- improving low-confidence handling
-### `story_engine.py`
-This is the main generation module.
-It currently handles:
-- opening generation
-- story generation for each turn
-- streaming and non-streaming paths
-- default/fallback outputs
-- consistency-aware regeneration
-- response telemetry such as fallback reason and engine mode
-Important methods:
-- `generate_opening_stream`
-- `generate_story`
-- `generate_story_stream`
-- `process_option_selection_stream`
-- `_fallback_response`
-When to edit this file:
-- changing prompts
-- changing multi-stage generation logic
-- changing fallback behavior
-- adding generation-side telemetry
-### `app.py`
-This file is the UI entry point and interaction orchestrator.
-Important responsibilities:
-- create a new game session
-- start and restart the app session
-- process text input
-- process option clicks
-- update Gradio components
-- write structured interaction logs
-When to edit this file:
-- changing UI flow
-- adding debug panels
-- changing how logs are written
-- changing how outputs are displayed
-### `telemetry.py`
-This file handles structured log export.
-It is intentionally simple and file-based:
-- one session gets one JSONL file
-- one turn becomes one JSON object line
-This is useful for:
-- report case studies
-- measuring fallback rate
-- debugging weird turns
-- collecting examples for later evaluation
-## Logging and Observability
-Interaction logs are written under:
-- [logs/interactions](./logs/interactions)
-Each turn record includes at least:
-- input source
-- user input
-- NLU result
-- latency
-- fallback metadata
-- state changes
-- consistency issues
-- final output text
-- post-turn state snapshot
-Example shape:
-```json
-{
-  "timestamp": "2026-03-14T18:55:00",
-  "session_id": "sw-20260314-185500-ab12cd34",
-  "turn_index": 3,
-  "input_source": "text_input",
-  "user_input": "和村长老伯谈谈最近森林里的怪事",
-  "nlu_result": {
-    "intent": "TALK",
-    "target": "村长老伯",
-    "parser_source": "llm"
-  },
-  "latency_ms": 842.13,
-  "used_fallback": false,
-  "state_changes": {},
-  "output_text": "...",
-  "post_turn_snapshot": {
-    "location": "村庄广场"
-  }
-}
-```
-If you need to debug a bad interaction, the fastest path is:
-1. check the log file
-2. inspect `nlu_result`
-3. inspect `telemetry.used_fallback`
-4. inspect `state_changes`
-5. inspect the post-turn snapshot
-## Evaluation Pipeline
-Evaluation entry point:
-- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
-Datasets:
-- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
-- [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json)
-- [evaluation/datasets/latency.json](./evaluation/datasets/latency.json)
-- [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json)
-Results:
-- [evaluation/results](./evaluation/results)
-### What each task measures
-#### Intent
-- labeled input -> predicted intent
-- optional target matching
-- parser source breakdown
-- per-example latency
-#### Consistency
-- action guard correctness via `pre_validate_action`
-- contradiction detection via `check_consistency`
-#### Latency
-- NLU latency
-- generation latency
-- total latency
-- fallback rate
-#### Branch divergence
-- same start state, different choices
-- compare resulting story text
-- compare option differences
-- compare state snapshot differences
-## Common Development Tasks
-### Add a new intent
-You will usually need to touch:
-- [nlu_engine.py](./nlu_engine.py)
-- [state_manager.py](./state_manager.py)
-- [story_engine.py](./story_engine.py)
-- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
-Suggested checklist:
-1. add the label to the NLU logic
-2. decide whether it needs pre-validation
-3. make sure story prompts know how to handle it
-4. add at least a few evaluation examples
-### Add a new location, NPC, quest, or item
-Most of the time you only need:
-- [state_manager.py](./state_manager.py)
-That file contains the initial world setup and registry-style data.
-### Add more evaluation cases
-Edit files under:
-- [evaluation/datasets](./evaluation/datasets)
-This is the easiest way to improve the report without changing runtime logic.
-### Investigate a strange game turn
-Check in this order:
-1. interaction log under `logs/interactions`
-2. `parser_source` in the NLU result
-3. `telemetry` in the final story result
-4. whether `pre_validate_action` rejected or allowed the turn
-5. whether `check_consistency` flagged anything
-### Change UI behavior without touching gameplay
-Edit:
-- [app.py](./app.py)
-Try not to put game rules in the UI layer.
-## Environment Notes
-### If `QWEN_API_KEY` is missing
-- warning logs will appear
-- some paths will still run through fallback logic
-- evaluation can still execute, but model-quality conclusions are not meaningful
-### If `openai` is not installed
-- the repo can still import in some cases because the client is lazily initialized
-- full Qwen generation will not work
-- evaluation scripts will mostly reflect fallback behavior
-### If `gradio` is not installed
-- the app cannot launch
-- offline evaluation scripts can still be useful
-## Current Known Limitations
-These are the main gaps we still know about:
-- some item and equipment effects are stored as metadata but not fully executed as deterministic rules
-- combat and trade are still more prompt-driven than rule-driven
-- branch divergence is much more meaningful with a real model than in fallback-only mode
-- evaluation quality depends on whether the real model environment is available
-## Suggested Team Workflow
-If multiple teammates are working in parallel, this split is usually clean:
-- gameplay/state teammate
-  Focus on [state_manager.py](./state_manager.py)
-- prompt/generation teammate
-  Focus on [story_engine.py](./story_engine.py)
-- NLU/evaluation teammate
-  Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation)
-- UI/demo teammate
-  Focus on [app.py](./app.py)
-- report teammate
-  Focus on `evaluation/results`, `logs/interactions`, and case-study collection
-## What To Use in the Final Report
-For the course report, the most useful artifacts from this repo are:
-- evaluation JSON outputs under `evaluation/results`
-- interaction logs under `logs/interactions`
-- dataset files under `evaluation/datasets`
-- readable state transitions from `change_log`
-- fallback metadata from `telemetry`
-These can directly support:
-- experiment setup
-- metric definition
-- result tables
-- success cases
-- failure case analysis
-## License
-MIT

+---
+title: StoryWeaver
+emoji: 📖
+colorFrom: red
+colorTo: purple
+sdk: gradio
+sdk_version: 6.7.0
+app_file: app.py
+pinned: false
+license: mit
+short_description: Interactive NLP story engine with evaluation and logging
+---
+# StoryWeaver
+StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.
+This README is written for teammates who need to:
+- understand how the system is organized
+- run the app locally
+- know where to change prompts, rules, or UI
+- collect evaluation results for the report
+- debug a bad interaction without reading the whole codebase first
+## What This Repository Contains
+At a high level, the project has five responsibilities:
+1. parse player input into structured intent
+2. keep the world state consistent across turns
+3. generate the next story response and options
+4. expose the system through a Gradio UI
+5. export logs and run reproducible evaluation
+This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.
+## Quick Start
+### 1. Install dependencies
+```bash
+pip install -r requirements.txt
+```
+### 2. Create `.env`
+Create a `.env` file in the project root:
+```env
+QWEN_API_KEY=your_api_key_here
+```
+Optional:
+```env
+STORYWEAVER_LOG_DIR=logs/interactions
+```
+### 3. Run the app
+```bash
+python app.py
+```
+Default local URL:
+- `http://localhost:7860`
+### 4. Run evaluation
+```bash
+python evaluation/run_evaluations.py --task all --repeats 3
+```
+Useful variants:
+```bash
+python evaluation/run_evaluations.py --task intent
+python evaluation/run_evaluations.py --task consistency
+python evaluation/run_evaluations.py --task latency --repeats 5
+python evaluation/run_evaluations.py --task branch
+```
+## Deploy to Hugging Face Spaces (Web Upload)
+If you want to deploy quickly without using git commands, use this checklist:
+1. Create a new Space on Hugging Face:
+   - SDK: `Gradio`
+   - Python version: default is fine (3.10+)
+2. Upload project files from this repository root.
+3. Do **not** upload local-only files/directories:
+   - `venv/`, `.venv/`, `.env`, `__pycache__/`, `.gradio/`, `logs/`, `evaluation/results/`
+4. In Space settings, add secret:
+   - `QWEN_API_KEY`
+5. Wait for build to finish, then open the Space URL.
+This repository already uses the standard Gradio entrypoint in `app.py`, so Spaces will start the app automatically.
+## Recommended Reading Order
+If you are new to the repo, read files in this order:
+1. [state_manager.py](./state_manager.py)
+   Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
+2. [nlu_engine.py](./nlu_engine.py)
+   Why: this shows how raw player text becomes structured intent.
+3. [story_engine.py](./story_engine.py)
+   Why: this is the main generation pipeline and fallback logic.
+4. [app.py](./app.py)
+   Why: this connects the UI with the engines and now also writes interaction logs.
+5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
+   Why: this shows how we measure the system for the report.
+If you only have 10 minutes, start with:
+- `GameState.pre_validate_action`
+- `GameState.check_consistency`
+- `GameState.apply_changes`
+- `NLUEngine.parse_intent`
+- `StoryEngine.generate_story_stream`
+- `process_user_input` in [app.py](./app.py)
+## Repository Map
+```text
+StoryWeaver/
+|-- app.py
+|-- nlu_engine.py
+|-- story_engine.py
+|-- state_manager.py
+|-- telemetry.py
+|-- utils.py
+|-- requirements.txt
+|-- evaluation/
+|   |-- run_evaluations.py
+|   |-- datasets/
+|   `-- results/
+`-- logs/
+    `-- interactions/
+```
+Core responsibilities by file:
+- [app.py](./app.py)
+  Gradio app, session lifecycle, UI callbacks, per-turn logging.
+- [state_manager.py](./state_manager.py)
+  Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
+- [nlu_engine.py](./nlu_engine.py)
+  Intent parsing. Uses LLM parsing when available and keyword fallback when not.
+- [story_engine.py](./story_engine.py)
+  Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
+- [telemetry.py](./telemetry.py)
+  Session metadata and JSONL interaction log export.
+- [utils.py](./utils.py)
+  API client setup, Qwen calls, JSON extraction, retry helpers.
+- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
+  Reproducible experiment runner for the report.
+## System Architecture
+The main runtime path is:
+`Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log`
+There are two ideas that matter most in this codebase:
+### 1. `GameState` is the source of truth
+Almost everything meaningful lives in [state_manager.py](./state_manager.py):
+- player stats
+- location
+- time and weather
+- inventory and equipment
+- quests
+- NPC states
+- event history
+When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.
+### 2. The app is a coordinator, not the game logic
+[app.py](./app.py) should mostly:
+- receive user input
+- call NLU
+- call the story engine
+- update the chat UI
+- write telemetry logs
+If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer.
+## Runtime Flow
+### Text input flow
+For normal text input, the path is:
+1. `process_user_input` receives raw text from the UI
+2. `NLUEngine.parse_intent` converts it into a structured intent dict
+3. `GameState.pre_validate_action` blocks clearly invalid actions early
+4. `StoryEngine.generate_story_stream` runs the main narrative pipeline
+5. `GameState.check_consistency` and `apply_changes` update state
+6. UI is refreshed with story text, options, and status panel
+7. `_record_interaction_log` writes a JSONL record to disk
+### Option click flow
+Button clicks do not go through full free-text parsing. Instead:
+1. the selected option is converted to an intent-like dict
+2. the story engine processes it the same way as text input
+3. the result is rendered and logged
+This is useful because option interactions and free-text interactions now share the same evaluation and observability format.
+## Main Modules in More Detail
+### `state_manager.py`
+This file defines:
+- `PlayerState`
+- `WorldState`
+- `GameEvent`
+- `GameState`
+Important methods:
+- `pre_validate_action`
+  Rejects obviously invalid actions before calling the model.
+- `check_consistency`
+  Detects contradictions in proposed state changes.
+- `apply_changes`
+  Applies state changes and returns a readable change log.
+- `validate`
+  Makes sure the resulting state is legal.
+- `to_prompt`
+  Serializes the current game state into prompt-ready text.
+When to edit this file:
+- adding new items, NPCs, quests, or locations
+- adding deterministic rules
+- improving consistency checks
+- changing state serialization for prompts
+### `nlu_engine.py`
+This file is responsible for intent recognition.
+Current behavior:
+- try LLM parsing first
+- fall back to keyword rules if parsing fails
+- return a normalized intent dict with `parser_source`
+Current intent labels include:
+- `ATTACK`
+- `TALK`
+- `MOVE`
+- `EXPLORE`
+- `USE_ITEM`
+- `TRADE`
+- `EQUIP`
+- `REST`
+- `QUEST`
+- `SKILL`
+- `PICKUP`
+- `FLEE`
+- `CUSTOM`
+When to edit this file:
+- adding a new intent type
+- improving keyword fallback
+- adding target extraction logic
+- improving low-confidence handling
+### `story_engine.py`
+This is the main generation module.
+It currently handles:
+- opening generation
+- story generation for each turn
+- streaming and non-streaming paths
+- default/fallback outputs
+- consistency-aware regeneration
+- response telemetry such as fallback reason and engine mode
+Important methods:
+- `generate_opening_stream`
+- `generate_story`
+- `generate_story_stream`
+- `process_option_selection_stream`
+- `_fallback_response`
+When to edit this file:
+- changing prompts
+- changing multi-stage generation logic
+- changing fallback behavior
+- adding generation-side telemetry
+### `app.py`
+This file is the UI entry point and interaction orchestrator.
+Important responsibilities:
+- create a new game session
+- start and restart the app session
+- process text input
+- process option clicks
+- update Gradio components
+- write structured interaction logs
+When to edit this file:
+- changing UI flow
+- adding debug panels
+- changing how logs are written
+- changing how outputs are displayed
+### `telemetry.py`
+This file handles structured log export.
+It is intentionally simple and file-based:
+- one session gets one JSONL file
+- one turn becomes one JSON object line
+This is useful for:
+- report case studies
+- measuring fallback rate
+- debugging weird turns
+- collecting examples for later evaluation
+## Logging and Observability
+Interaction logs are written under:
+- [logs/interactions](./logs/interactions)
+Each turn record includes at least:
+- input source
+- user input
+- NLU result
+- latency
+- fallback metadata
+- state changes
+- consistency issues
+- final output text
+- post-turn state snapshot
+Example shape:
+```json
+{
+  "timestamp": "2026-03-14T18:55:00",
+  "session_id": "sw-20260314-185500-ab12cd34",
+  "turn_index": 3,
+  "input_source": "text_input",
+  "user_input": "和村长老伯谈谈最近森林里的怪事",
+  "nlu_result": {
+    "intent": "TALK",
+    "target": "村长老伯",
+    "parser_source": "llm"
+  },
+  "latency_ms": 842.13,
+  "used_fallback": false,
+  "state_changes": {},
+  "output_text": "...",
+  "post_turn_snapshot": {
+    "location": "村庄广场"
+  }
+}
+```
+If you need to debug a bad interaction, the fastest path is:
+1. check the log file
+2. inspect `nlu_result`
+3. inspect `telemetry.used_fallback`
+4. inspect `state_changes`
+5. inspect the post-turn snapshot
+## Evaluation Pipeline
+Evaluation entry point:
+- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
+Datasets:
+- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
+- [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json)
+- [evaluation/datasets/latency.json](./evaluation/datasets/latency.json)
+- [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json)
+Results:
+- [evaluation/results](./evaluation/results)
+### What each task measures
+#### Intent
+- labeled input -> predicted intent
+- optional target matching
+- parser source breakdown
+- per-example latency
+#### Consistency
+- action guard correctness via `pre_validate_action`
+- contradiction detection via `check_consistency`
+#### Latency
+- NLU latency
+- generation latency
+- total latency
+- fallback rate
+#### Branch divergence
+- same start state, different choices
+- compare resulting story text
+- compare option differences
+- compare state snapshot differences
+## Common Development Tasks
+### Add a new intent
+You will usually need to touch:
+- [nlu_engine.py](./nlu_engine.py)
+- [state_manager.py](./state_manager.py)
+- [story_engine.py](./story_engine.py)
+- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
+Suggested checklist:
+1. add the label to the NLU logic
+2. decide whether it needs pre-validation
+3. make sure story prompts know how to handle it
+4. add at least a few evaluation examples
+### Add a new location, NPC, quest, or item
+Most of the time you only need:
+- [state_manager.py](./state_manager.py)
+That file contains the initial world setup and registry-style data.
+### Add more evaluation cases
+Edit files under:
+- [evaluation/datasets](./evaluation/datasets)
+This is the easiest way to improve the report without changing runtime logic.
+### Investigate a strange game turn
+Check in this order:
+1. interaction log under `logs/interactions`
+2. `parser_source` in the NLU result
+3. `telemetry` in the final story result
+4. whether `pre_validate_action` rejected or allowed the turn
+5. whether `check_consistency` flagged anything
+### Change UI behavior without touching gameplay
+Edit:
+- [app.py](./app.py)
+Try not to put game rules in the UI layer.
+## Environment Notes
+### If `QWEN_API_KEY` is missing
+- warning logs will appear
+- some paths will still run through fallback logic
+- evaluation can still execute, but model-quality conclusions are not meaningful
+### If `openai` is not installed
+- the repo can still import in some cases because the client is lazily initialized
+- full Qwen generation will not work
+- evaluation scripts will mostly reflect fallback behavior
+### If `gradio` is not installed
+- the app cannot launch
+- offline evaluation scripts can still be useful
+## Current Known Limitations
+These are the main gaps we still know about:
+- some item and equipment effects are stored as metadata but not fully executed as deterministic rules
+- combat and trade are still more prompt-driven than rule-driven
+- branch divergence is much more meaningful with a real model than in fallback-only mode
+- evaluation quality depends on whether the real model environment is available
+## Suggested Team Workflow
+If multiple teammates are working in parallel, this split is usually clean:
+- gameplay/state teammate
+  Focus on [state_manager.py](./state_manager.py)
+- prompt/generation teammate
+  Focus on [story_engine.py](./story_engine.py)
+- NLU/evaluation teammate
+  Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation)
+- UI/demo teammate
+  Focus on [app.py](./app.py)
+- report teammate
+  Focus on `evaluation/results`, `logs/interactions`, and case-study collection
+## What To Use in the Final Report
+For the course report, the most useful artifacts from this repo are:
+- evaluation JSON outputs under `evaluation/results`
+- interaction logs under `logs/interactions`
+- dataset files under `evaluation/datasets`
+- readable state transitions from `change_log`
+- fallback metadata from `telemetry`
+These can directly support:
+- experiment setup
+- metric definition
+- result tables
+- success cases
+- failure case analysis
+## License
+MIT

app.py CHANGED Viewed

The diff for this file is too large to render. See raw diff

requirements.txt CHANGED Viewed

@@ -1,5 +1,4 @@
-gradio==4.44.0
-huggingface_hub==0.25.2
-openai==1.51.2
-python-dotenv==1.0.1
-pydantic==2.9.2

+openai>=1.0.0
+gradio==6.7.0
+python-dotenv>=1.0.0
+pydantic>=2.0.0