Spaces:
Runtime error
Runtime error
| title: StoryWeaver | |
| emoji: 📖 | |
| colorFrom: red | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 4.43.0 | |
| app_file: app.py | |
| python_version: "3.10" | |
| pinned: false | |
| license: mit | |
| short_description: Interactive NLP story engine with evaluation and logging | |
| # StoryWeaver | |
| StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration. | |
| This README is written for teammates who need to: | |
| - understand how the system is organized | |
| - run the app locally | |
| - know where to change prompts, rules, or UI | |
| - collect evaluation results for the report | |
| - debug a bad interaction without reading the whole codebase first | |
| ## What This Repository Contains | |
| At a high level, the project has five responsibilities: | |
| 1. parse player input into structured intent | |
| 2. keep the world state consistent across turns | |
| 3. generate the next story response and options | |
| 4. expose the system through a Gradio UI | |
| 5. export logs and run reproducible evaluation | |
| This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables. | |
| ## Quick Start | |
| ### 1. Install dependencies | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ### 2. Create `.env` | |
| Create a `.env` file in the project root: | |
| ```env | |
| QWEN_API_KEY=your_api_key_here | |
| ``` | |
| Optional: | |
| ```env | |
| STORYWEAVER_LOG_DIR=logs/interactions | |
| ``` | |
| ### 3. Run the app | |
| ```bash | |
| python app.py | |
| ``` | |
| Default local URL: | |
| - `http://localhost:7860` | |
| ### 4. Run evaluation | |
| ```bash | |
| python evaluation/run_evaluations.py --task all --repeats 3 | |
| ``` | |
| Useful variants: | |
| ```bash | |
| python evaluation/run_evaluations.py --task intent | |
| python evaluation/run_evaluations.py --task consistency | |
| python evaluation/run_evaluations.py --task latency --repeats 5 | |
| python evaluation/run_evaluations.py --task branch | |
| ``` | |
| ## Recommended Reading Order | |
| If you are new to the repo, read files in this order: | |
| 1. [state_manager.py](./state_manager.py) | |
| Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates. | |
| 2. [nlu_engine.py](./nlu_engine.py) | |
| Why: this shows how raw player text becomes structured intent. | |
| 3. [story_engine.py](./story_engine.py) | |
| Why: this is the main generation pipeline and fallback logic. | |
| 4. [app.py](./app.py) | |
| Why: this connects the UI with the engines and now also writes interaction logs. | |
| 5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py) | |
| Why: this shows how we measure the system for the report. | |
| If you only have 10 minutes, start with: | |
| - `GameState.pre_validate_action` | |
| - `GameState.check_consistency` | |
| - `GameState.apply_changes` | |
| - `NLUEngine.parse_intent` | |
| - `StoryEngine.generate_story_stream` | |
| - `process_user_input` in [app.py](./app.py) | |
| ## Repository Map | |
| ```text | |
| StoryWeaver/ | |
| |-- app.py | |
| |-- nlu_engine.py | |
| |-- story_engine.py | |
| |-- state_manager.py | |
| |-- telemetry.py | |
| |-- utils.py | |
| |-- requirements.txt | |
| |-- evaluation/ | |
| | |-- run_evaluations.py | |
| | |-- datasets/ | |
| | `-- results/ | |
| `-- logs/ | |
| `-- interactions/ | |
| ``` | |
| Core responsibilities by file: | |
| - [app.py](./app.py) | |
| Gradio app, session lifecycle, UI callbacks, per-turn logging. | |
| - [state_manager.py](./state_manager.py) | |
| Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application. | |
| - [nlu_engine.py](./nlu_engine.py) | |
| Intent parsing. Uses LLM parsing when available and keyword fallback when not. | |
| - [story_engine.py](./story_engine.py) | |
| Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags. | |
| - [telemetry.py](./telemetry.py) | |
| Session metadata and JSONL interaction log export. | |
| - [utils.py](./utils.py) | |
| API client setup, Qwen calls, JSON extraction, retry helpers. | |
| - [evaluation/run_evaluations.py](./evaluation/run_evaluations.py) | |
| Reproducible experiment runner for the report. | |
| ## System Architecture | |
| The main runtime path is: | |
| `Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log` | |
| There are two ideas that matter most in this codebase: | |
| ### 1. `GameState` is the source of truth | |
| Almost everything meaningful lives in [state_manager.py](./state_manager.py): | |
| - player stats | |
| - location | |
| - time and weather | |
| - inventory and equipment | |
| - quests | |
| - NPC states | |
| - event history | |
| When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code. | |
| ### 2. The app is a coordinator, not the game logic | |
| [app.py](./app.py) should mostly: | |
| - receive user input | |
| - call NLU | |
| - call the story engine | |
| - update the chat UI | |
| - write telemetry logs | |
| If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer. | |
| ## Runtime Flow | |
| ### Text input flow | |
| For normal text input, the path is: | |
| 1. `process_user_input` receives raw text from the UI | |
| 2. `NLUEngine.parse_intent` converts it into a structured intent dict | |
| 3. `GameState.pre_validate_action` blocks clearly invalid actions early | |
| 4. `StoryEngine.generate_story_stream` runs the main narrative pipeline | |
| 5. `GameState.check_consistency` and `apply_changes` update state | |
| 6. UI is refreshed with story text, options, and status panel | |
| 7. `_record_interaction_log` writes a JSONL record to disk | |
| ### Option click flow | |
| Button clicks do not go through full free-text parsing. Instead: | |
| 1. the selected option is converted to an intent-like dict | |
| 2. the story engine processes it the same way as text input | |
| 3. the result is rendered and logged | |
| This is useful because option interactions and free-text interactions now share the same evaluation and observability format. | |
| ## Main Modules in More Detail | |
| ### `state_manager.py` | |
| This file defines: | |
| - `PlayerState` | |
| - `WorldState` | |
| - `GameEvent` | |
| - `GameState` | |
| Important methods: | |
| - `pre_validate_action` | |
| Rejects obviously invalid actions before calling the model. | |
| - `check_consistency` | |
| Detects contradictions in proposed state changes. | |
| - `apply_changes` | |
| Applies state changes and returns a readable change log. | |
| - `validate` | |
| Makes sure the resulting state is legal. | |
| - `to_prompt` | |
| Serializes the current game state into prompt-ready text. | |
| When to edit this file: | |
| - adding new items, NPCs, quests, or locations | |
| - adding deterministic rules | |
| - improving consistency checks | |
| - changing state serialization for prompts | |
| ### `nlu_engine.py` | |
| This file is responsible for intent recognition. | |
| Current behavior: | |
| - try LLM parsing first | |
| - fall back to keyword rules if parsing fails | |
| - return a normalized intent dict with `parser_source` | |
| Current intent labels include: | |
| - `ATTACK` | |
| - `TALK` | |
| - `MOVE` | |
| - `EXPLORE` | |
| - `USE_ITEM` | |
| - `TRADE` | |
| - `EQUIP` | |
| - `REST` | |
| - `QUEST` | |
| - `SKILL` | |
| - `PICKUP` | |
| - `FLEE` | |
| - `CUSTOM` | |
| When to edit this file: | |
| - adding a new intent type | |
| - improving keyword fallback | |
| - adding target extraction logic | |
| - improving low-confidence handling | |
| ### `story_engine.py` | |
| This is the main generation module. | |
| It currently handles: | |
| - opening generation | |
| - story generation for each turn | |
| - streaming and non-streaming paths | |
| - default/fallback outputs | |
| - consistency-aware regeneration | |
| - response telemetry such as fallback reason and engine mode | |
| Important methods: | |
| - `generate_opening_stream` | |
| - `generate_story` | |
| - `generate_story_stream` | |
| - `process_option_selection_stream` | |
| - `_fallback_response` | |
| When to edit this file: | |
| - changing prompts | |
| - changing multi-stage generation logic | |
| - changing fallback behavior | |
| - adding generation-side telemetry | |
| ### `app.py` | |
| This file is the UI entry point and interaction orchestrator. | |
| Important responsibilities: | |
| - create a new game session | |
| - start and restart the app session | |
| - process text input | |
| - process option clicks | |
| - update Gradio components | |
| - write structured interaction logs | |
| When to edit this file: | |
| - changing UI flow | |
| - adding debug panels | |
| - changing how logs are written | |
| - changing how outputs are displayed | |
| ### `telemetry.py` | |
| This file handles structured log export. | |
| It is intentionally simple and file-based: | |
| - one session gets one JSONL file | |
| - one turn becomes one JSON object line | |
| This is useful for: | |
| - report case studies | |
| - measuring fallback rate | |
| - debugging weird turns | |
| - collecting examples for later evaluation | |
| ## Logging and Observability | |
| Interaction logs are written under: | |
| - [logs/interactions](./logs/interactions) | |
| Each turn record includes at least: | |
| - input source | |
| - user input | |
| - NLU result | |
| - latency | |
| - fallback metadata | |
| - state changes | |
| - consistency issues | |
| - final output text | |
| - post-turn state snapshot | |
| Example shape: | |
| ```json | |
| { | |
| "timestamp": "2026-03-14T18:55:00", | |
| "session_id": "sw-20260314-185500-ab12cd34", | |
| "turn_index": 3, | |
| "input_source": "text_input", | |
| "user_input": "和村长老伯谈谈最近森林里的怪事", | |
| "nlu_result": { | |
| "intent": "TALK", | |
| "target": "村长老伯", | |
| "parser_source": "llm" | |
| }, | |
| "latency_ms": 842.13, | |
| "used_fallback": false, | |
| "state_changes": {}, | |
| "output_text": "...", | |
| "post_turn_snapshot": { | |
| "location": "村庄广场" | |
| } | |
| } | |
| ``` | |
| If you need to debug a bad interaction, the fastest path is: | |
| 1. check the log file | |
| 2. inspect `nlu_result` | |
| 3. inspect `telemetry.used_fallback` | |
| 4. inspect `state_changes` | |
| 5. inspect the post-turn snapshot | |
| ## Evaluation Pipeline | |
| Evaluation entry point: | |
| - [evaluation/run_evaluations.py](./evaluation/run_evaluations.py) | |
| Datasets: | |
| - [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json) | |
| - [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json) | |
| - [evaluation/datasets/latency.json](./evaluation/datasets/latency.json) | |
| - [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json) | |
| Results: | |
| - [evaluation/results](./evaluation/results) | |
| ### What each task measures | |
| #### Intent | |
| - labeled input -> predicted intent | |
| - optional target matching | |
| - parser source breakdown | |
| - per-example latency | |
| #### Consistency | |
| - action guard correctness via `pre_validate_action` | |
| - contradiction detection via `check_consistency` | |
| #### Latency | |
| - NLU latency | |
| - generation latency | |
| - total latency | |
| - fallback rate | |
| #### Branch divergence | |
| - same start state, different choices | |
| - compare resulting story text | |
| - compare option differences | |
| - compare state snapshot differences | |
| ## Common Development Tasks | |
| ### Add a new intent | |
| You will usually need to touch: | |
| - [nlu_engine.py](./nlu_engine.py) | |
| - [state_manager.py](./state_manager.py) | |
| - [story_engine.py](./story_engine.py) | |
| - [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json) | |
| Suggested checklist: | |
| 1. add the label to the NLU logic | |
| 2. decide whether it needs pre-validation | |
| 3. make sure story prompts know how to handle it | |
| 4. add at least a few evaluation examples | |
| ### Add a new location, NPC, quest, or item | |
| Most of the time you only need: | |
| - [state_manager.py](./state_manager.py) | |
| That file contains the initial world setup and registry-style data. | |
| ### Add more evaluation cases | |
| Edit files under: | |
| - [evaluation/datasets](./evaluation/datasets) | |
| This is the easiest way to improve the report without changing runtime logic. | |
| ### Investigate a strange game turn | |
| Check in this order: | |
| 1. interaction log under `logs/interactions` | |
| 2. `parser_source` in the NLU result | |
| 3. `telemetry` in the final story result | |
| 4. whether `pre_validate_action` rejected or allowed the turn | |
| 5. whether `check_consistency` flagged anything | |
| ### Change UI behavior without touching gameplay | |
| Edit: | |
| - [app.py](./app.py) | |
| Try not to put game rules in the UI layer. | |
| ## Environment Notes | |
| ### If `QWEN_API_KEY` is missing | |
| - warning logs will appear | |
| - some paths will still run through fallback logic | |
| - evaluation can still execute, but model-quality conclusions are not meaningful | |
| ### If `openai` is not installed | |
| - the repo can still import in some cases because the client is lazily initialized | |
| - full Qwen generation will not work | |
| - evaluation scripts will mostly reflect fallback behavior | |
| ### If `gradio` is not installed | |
| - the app cannot launch | |
| - offline evaluation scripts can still be useful | |
| ## Current Known Limitations | |
| These are the main gaps we still know about: | |
| - some item and equipment effects are stored as metadata but not fully executed as deterministic rules | |
| - combat and trade are still more prompt-driven than rule-driven | |
| - branch divergence is much more meaningful with a real model than in fallback-only mode | |
| - evaluation quality depends on whether the real model environment is available | |
| ## Suggested Team Workflow | |
| If multiple teammates are working in parallel, this split is usually clean: | |
| - gameplay/state teammate | |
| Focus on [state_manager.py](./state_manager.py) | |
| - prompt/generation teammate | |
| Focus on [story_engine.py](./story_engine.py) | |
| - NLU/evaluation teammate | |
| Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation) | |
| - UI/demo teammate | |
| Focus on [app.py](./app.py) | |
| - report teammate | |
| Focus on `evaluation/results`, `logs/interactions`, and case-study collection | |
| ## What To Use in the Final Report | |
| For the course report, the most useful artifacts from this repo are: | |
| - evaluation JSON outputs under `evaluation/results` | |
| - interaction logs under `logs/interactions` | |
| - dataset files under `evaluation/datasets` | |
| - readable state transitions from `change_log` | |
| - fallback metadata from `telemetry` | |
| These can directly support: | |
| - experiment setup | |
| - metric definition | |
| - result tables | |
| - success cases | |
| - failure case analysis | |
| ## License | |
| MIT | |