--- title: StoryWeaver emoji: 📖 colorFrom: red colorTo: purple sdk: gradio sdk_version: 4.43.0 app_file: app.py python_version: "3.10" pinned: false license: mit short_description: Interactive NLP story engine with evaluation and logging --- # StoryWeaver StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration. This README is written for teammates who need to: - understand how the system is organized - run the app locally - know where to change prompts, rules, or UI - collect evaluation results for the report - debug a bad interaction without reading the whole codebase first ## What This Repository Contains At a high level, the project has five responsibilities: 1. parse player input into structured intent 2. keep the world state consistent across turns 3. generate the next story response and options 4. expose the system through a Gradio UI 5. export logs and run reproducible evaluation This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables. ## Quick Start ### 1. Install dependencies ```bash pip install -r requirements.txt ``` ### 2. Create `.env` Create a `.env` file in the project root: ```env QWEN_API_KEY=your_api_key_here ``` Optional: ```env STORYWEAVER_LOG_DIR=logs/interactions ``` ### 3. Run the app ```bash python app.py ``` Default local URL: - `http://localhost:7860` ### 4. Run evaluation ```bash python evaluation/run_evaluations.py --task all --repeats 3 ``` Useful variants: ```bash python evaluation/run_evaluations.py --task intent python evaluation/run_evaluations.py --task consistency python evaluation/run_evaluations.py --task latency --repeats 5 python evaluation/run_evaluations.py --task branch ``` ## Recommended Reading Order If you are new to the repo, read files in this order: 1. [state_manager.py](./state_manager.py) Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates. 2. [nlu_engine.py](./nlu_engine.py) Why: this shows how raw player text becomes structured intent. 3. [story_engine.py](./story_engine.py) Why: this is the main generation pipeline and fallback logic. 4. [app.py](./app.py) Why: this connects the UI with the engines and now also writes interaction logs. 5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py) Why: this shows how we measure the system for the report. If you only have 10 minutes, start with: - `GameState.pre_validate_action` - `GameState.check_consistency` - `GameState.apply_changes` - `NLUEngine.parse_intent` - `StoryEngine.generate_story_stream` - `process_user_input` in [app.py](./app.py) ## Repository Map ```text StoryWeaver/ |-- app.py |-- nlu_engine.py |-- story_engine.py |-- state_manager.py |-- telemetry.py |-- utils.py |-- requirements.txt |-- evaluation/ | |-- run_evaluations.py | |-- datasets/ | `-- results/ `-- logs/ `-- interactions/ ``` Core responsibilities by file: - [app.py](./app.py) Gradio app, session lifecycle, UI callbacks, per-turn logging. - [state_manager.py](./state_manager.py) Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application. - [nlu_engine.py](./nlu_engine.py) Intent parsing. Uses LLM parsing when available and keyword fallback when not. - [story_engine.py](./story_engine.py) Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags. - [telemetry.py](./telemetry.py) Session metadata and JSONL interaction log export. - [utils.py](./utils.py) API client setup, Qwen calls, JSON extraction, retry helpers. - [evaluation/run_evaluations.py](./evaluation/run_evaluations.py) Reproducible experiment runner for the report. ## System Architecture The main runtime path is: `Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log` There are two ideas that matter most in this codebase: ### 1. `GameState` is the source of truth Almost everything meaningful lives in [state_manager.py](./state_manager.py): - player stats - location - time and weather - inventory and equipment - quests - NPC states - event history When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code. ### 2. The app is a coordinator, not the game logic [app.py](./app.py) should mostly: - receive user input - call NLU - call the story engine - update the chat UI - write telemetry logs If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer. ## Runtime Flow ### Text input flow For normal text input, the path is: 1. `process_user_input` receives raw text from the UI 2. `NLUEngine.parse_intent` converts it into a structured intent dict 3. `GameState.pre_validate_action` blocks clearly invalid actions early 4. `StoryEngine.generate_story_stream` runs the main narrative pipeline 5. `GameState.check_consistency` and `apply_changes` update state 6. UI is refreshed with story text, options, and status panel 7. `_record_interaction_log` writes a JSONL record to disk ### Option click flow Button clicks do not go through full free-text parsing. Instead: 1. the selected option is converted to an intent-like dict 2. the story engine processes it the same way as text input 3. the result is rendered and logged This is useful because option interactions and free-text interactions now share the same evaluation and observability format. ## Main Modules in More Detail ### `state_manager.py` This file defines: - `PlayerState` - `WorldState` - `GameEvent` - `GameState` Important methods: - `pre_validate_action` Rejects obviously invalid actions before calling the model. - `check_consistency` Detects contradictions in proposed state changes. - `apply_changes` Applies state changes and returns a readable change log. - `validate` Makes sure the resulting state is legal. - `to_prompt` Serializes the current game state into prompt-ready text. When to edit this file: - adding new items, NPCs, quests, or locations - adding deterministic rules - improving consistency checks - changing state serialization for prompts ### `nlu_engine.py` This file is responsible for intent recognition. Current behavior: - try LLM parsing first - fall back to keyword rules if parsing fails - return a normalized intent dict with `parser_source` Current intent labels include: - `ATTACK` - `TALK` - `MOVE` - `EXPLORE` - `USE_ITEM` - `TRADE` - `EQUIP` - `REST` - `QUEST` - `SKILL` - `PICKUP` - `FLEE` - `CUSTOM` When to edit this file: - adding a new intent type - improving keyword fallback - adding target extraction logic - improving low-confidence handling ### `story_engine.py` This is the main generation module. It currently handles: - opening generation - story generation for each turn - streaming and non-streaming paths - default/fallback outputs - consistency-aware regeneration - response telemetry such as fallback reason and engine mode Important methods: - `generate_opening_stream` - `generate_story` - `generate_story_stream` - `process_option_selection_stream` - `_fallback_response` When to edit this file: - changing prompts - changing multi-stage generation logic - changing fallback behavior - adding generation-side telemetry ### `app.py` This file is the UI entry point and interaction orchestrator. Important responsibilities: - create a new game session - start and restart the app session - process text input - process option clicks - update Gradio components - write structured interaction logs When to edit this file: - changing UI flow - adding debug panels - changing how logs are written - changing how outputs are displayed ### `telemetry.py` This file handles structured log export. It is intentionally simple and file-based: - one session gets one JSONL file - one turn becomes one JSON object line This is useful for: - report case studies - measuring fallback rate - debugging weird turns - collecting examples for later evaluation ## Logging and Observability Interaction logs are written under: - [logs/interactions](./logs/interactions) Each turn record includes at least: - input source - user input - NLU result - latency - fallback metadata - state changes - consistency issues - final output text - post-turn state snapshot Example shape: ```json { "timestamp": "2026-03-14T18:55:00", "session_id": "sw-20260314-185500-ab12cd34", "turn_index": 3, "input_source": "text_input", "user_input": "和村长老伯谈谈最近森林里的怪事", "nlu_result": { "intent": "TALK", "target": "村长老伯", "parser_source": "llm" }, "latency_ms": 842.13, "used_fallback": false, "state_changes": {}, "output_text": "...", "post_turn_snapshot": { "location": "村庄广场" } } ``` If you need to debug a bad interaction, the fastest path is: 1. check the log file 2. inspect `nlu_result` 3. inspect `telemetry.used_fallback` 4. inspect `state_changes` 5. inspect the post-turn snapshot ## Evaluation Pipeline Evaluation entry point: - [evaluation/run_evaluations.py](./evaluation/run_evaluations.py) Datasets: - [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json) - [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json) - [evaluation/datasets/latency.json](./evaluation/datasets/latency.json) - [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json) Results: - [evaluation/results](./evaluation/results) ### What each task measures #### Intent - labeled input -> predicted intent - optional target matching - parser source breakdown - per-example latency #### Consistency - action guard correctness via `pre_validate_action` - contradiction detection via `check_consistency` #### Latency - NLU latency - generation latency - total latency - fallback rate #### Branch divergence - same start state, different choices - compare resulting story text - compare option differences - compare state snapshot differences ## Common Development Tasks ### Add a new intent You will usually need to touch: - [nlu_engine.py](./nlu_engine.py) - [state_manager.py](./state_manager.py) - [story_engine.py](./story_engine.py) - [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json) Suggested checklist: 1. add the label to the NLU logic 2. decide whether it needs pre-validation 3. make sure story prompts know how to handle it 4. add at least a few evaluation examples ### Add a new location, NPC, quest, or item Most of the time you only need: - [state_manager.py](./state_manager.py) That file contains the initial world setup and registry-style data. ### Add more evaluation cases Edit files under: - [evaluation/datasets](./evaluation/datasets) This is the easiest way to improve the report without changing runtime logic. ### Investigate a strange game turn Check in this order: 1. interaction log under `logs/interactions` 2. `parser_source` in the NLU result 3. `telemetry` in the final story result 4. whether `pre_validate_action` rejected or allowed the turn 5. whether `check_consistency` flagged anything ### Change UI behavior without touching gameplay Edit: - [app.py](./app.py) Try not to put game rules in the UI layer. ## Environment Notes ### If `QWEN_API_KEY` is missing - warning logs will appear - some paths will still run through fallback logic - evaluation can still execute, but model-quality conclusions are not meaningful ### If `openai` is not installed - the repo can still import in some cases because the client is lazily initialized - full Qwen generation will not work - evaluation scripts will mostly reflect fallback behavior ### If `gradio` is not installed - the app cannot launch - offline evaluation scripts can still be useful ## Current Known Limitations These are the main gaps we still know about: - some item and equipment effects are stored as metadata but not fully executed as deterministic rules - combat and trade are still more prompt-driven than rule-driven - branch divergence is much more meaningful with a real model than in fallback-only mode - evaluation quality depends on whether the real model environment is available ## Suggested Team Workflow If multiple teammates are working in parallel, this split is usually clean: - gameplay/state teammate Focus on [state_manager.py](./state_manager.py) - prompt/generation teammate Focus on [story_engine.py](./story_engine.py) - NLU/evaluation teammate Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation) - UI/demo teammate Focus on [app.py](./app.py) - report teammate Focus on `evaluation/results`, `logs/interactions`, and case-study collection ## What To Use in the Final Report For the course report, the most useful artifacts from this repo are: - evaluation JSON outputs under `evaluation/results` - interaction logs under `logs/interactions` - dataset files under `evaluation/datasets` - readable state transitions from `change_log` - fallback metadata from `telemetry` These can directly support: - experiment setup - metric definition - result tables - success cases - failure case analysis ## License MIT