Spaces:

comp5423
/

NewProject

Runtime error

App Files Files Community

NewProject / README.md

PPP

chore: downgrade to Gradio 4.x for compatibility

673d037 8 days ago

preview code

raw

history blame contribute delete

13.7 kB

	---
	title: StoryWeaver
	emoji: 📖
	colorFrom: red
	colorTo: purple
	sdk: gradio
	sdk_version: 4.43.0
	app_file: app.py
	python_version: "3.10"
	pinned: false
	license: mit
	short_description: Interactive NLP story engine with evaluation and logging
	---

	# StoryWeaver

	StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.

	This README is written for teammates who need to:

	- understand how the system is organized
	- run the app locally
	- know where to change prompts, rules, or UI
	- collect evaluation results for the report
	- debug a bad interaction without reading the whole codebase first

	## What This Repository Contains

	At a high level, the project has five responsibilities:

	1. parse player input into structured intent
	2. keep the world state consistent across turns
	3. generate the next story response and options
	4. expose the system through a Gradio UI
	5. export logs and run reproducible evaluation

	This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.

	## Quick Start

	### 1. Install dependencies

	```bash
	pip install -r requirements.txt
	```

	### 2. Create `.env`

	Create a `.env` file in the project root:

	```env
	QWEN_API_KEY=your_api_key_here
	```

	Optional:

	```env
	STORYWEAVER_LOG_DIR=logs/interactions
	```

	### 3. Run the app

	```bash
	python app.py
	```

	Default local URL:

	- `http://localhost:7860`

	### 4. Run evaluation

	```bash
	python evaluation/run_evaluations.py --task all --repeats 3
	```

	Useful variants:

	```bash
	python evaluation/run_evaluations.py --task intent
	python evaluation/run_evaluations.py --task consistency
	python evaluation/run_evaluations.py --task latency --repeats 5
	python evaluation/run_evaluations.py --task branch
	```

	## Recommended Reading Order

	If you are new to the repo, read files in this order:

	1. [state_manager.py](./state_manager.py)
	Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
	2. [nlu_engine.py](./nlu_engine.py)
	Why: this shows how raw player text becomes structured intent.
	3. [story_engine.py](./story_engine.py)
	Why: this is the main generation pipeline and fallback logic.
	4. [app.py](./app.py)
	Why: this connects the UI with the engines and now also writes interaction logs.
	5. [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
	Why: this shows how we measure the system for the report.

	If you only have 10 minutes, start with:

	- `GameState.pre_validate_action`
	- `GameState.check_consistency`
	- `GameState.apply_changes`
	- `NLUEngine.parse_intent`
	- `StoryEngine.generate_story_stream`
	- `process_user_input` in [app.py](./app.py)

	## Repository Map

	```text
	StoryWeaver/
	\|-- app.py
	\|-- nlu_engine.py
	\|-- story_engine.py
	\|-- state_manager.py
	\|-- telemetry.py
	\|-- utils.py
	\|-- requirements.txt
	\|-- evaluation/
	\| \|-- run_evaluations.py
	\| \|-- datasets/
	\| `-- results/
	`-- logs/
	`-- interactions/
	```

	Core responsibilities by file:

	- [app.py](./app.py)
	Gradio app, session lifecycle, UI callbacks, per-turn logging.
	- [state_manager.py](./state_manager.py)
	Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
	- [nlu_engine.py](./nlu_engine.py)
	Intent parsing. Uses LLM parsing when available and keyword fallback when not.
	- [story_engine.py](./story_engine.py)
	Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
	- [telemetry.py](./telemetry.py)
	Session metadata and JSONL interaction log export.
	- [utils.py](./utils.py)
	API client setup, Qwen calls, JSON extraction, retry helpers.
	- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)
	Reproducible experiment runner for the report.

	## System Architecture

	The main runtime path is:

	`Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log`

	There are two ideas that matter most in this codebase:

	### 1. `GameState` is the source of truth

	Almost everything meaningful lives in [state_manager.py](./state_manager.py):

	- player stats
	- location
	- time and weather
	- inventory and equipment
	- quests
	- NPC states
	- event history

	When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.

	### 2. The app is a coordinator, not the game logic

	[app.py](./app.py) should mostly:

	- receive user input
	- call NLU
	- call the story engine
	- update the chat UI
	- write telemetry logs

	If a new feature changes game rules, it probably belongs in [state_manager.py](./state_manager.py) or [story_engine.py](./story_engine.py), not in the UI layer.

	## Runtime Flow

	### Text input flow

	For normal text input, the path is:

	1. `process_user_input` receives raw text from the UI
	2. `NLUEngine.parse_intent` converts it into a structured intent dict
	3. `GameState.pre_validate_action` blocks clearly invalid actions early
	4. `StoryEngine.generate_story_stream` runs the main narrative pipeline
	5. `GameState.check_consistency` and `apply_changes` update state
	6. UI is refreshed with story text, options, and status panel
	7. `_record_interaction_log` writes a JSONL record to disk

	### Option click flow

	Button clicks do not go through full free-text parsing. Instead:

	1. the selected option is converted to an intent-like dict
	2. the story engine processes it the same way as text input
	3. the result is rendered and logged

	This is useful because option interactions and free-text interactions now share the same evaluation and observability format.

	## Main Modules in More Detail

	### `state_manager.py`

	This file defines:

	- `PlayerState`
	- `WorldState`
	- `GameEvent`
	- `GameState`

	Important methods:

	- `pre_validate_action`
	Rejects obviously invalid actions before calling the model.
	- `check_consistency`
	Detects contradictions in proposed state changes.
	- `apply_changes`
	Applies state changes and returns a readable change log.
	- `validate`
	Makes sure the resulting state is legal.
	- `to_prompt`
	Serializes the current game state into prompt-ready text.

	When to edit this file:

	- adding new items, NPCs, quests, or locations
	- adding deterministic rules
	- improving consistency checks
	- changing state serialization for prompts

	### `nlu_engine.py`

	This file is responsible for intent recognition.

	Current behavior:

	- try LLM parsing first
	- fall back to keyword rules if parsing fails
	- return a normalized intent dict with `parser_source`

	Current intent labels include:

	- `ATTACK`
	- `TALK`
	- `MOVE`
	- `EXPLORE`
	- `USE_ITEM`
	- `TRADE`
	- `EQUIP`
	- `REST`
	- `QUEST`
	- `SKILL`
	- `PICKUP`
	- `FLEE`
	- `CUSTOM`

	When to edit this file:

	- adding a new intent type
	- improving keyword fallback
	- adding target extraction logic
	- improving low-confidence handling

	### `story_engine.py`

	This is the main generation module.

	It currently handles:

	- opening generation
	- story generation for each turn
	- streaming and non-streaming paths
	- default/fallback outputs
	- consistency-aware regeneration
	- response telemetry such as fallback reason and engine mode

	Important methods:

	- `generate_opening_stream`
	- `generate_story`
	- `generate_story_stream`
	- `process_option_selection_stream`
	- `_fallback_response`

	When to edit this file:

	- changing prompts
	- changing multi-stage generation logic
	- changing fallback behavior
	- adding generation-side telemetry

	### `app.py`

	This file is the UI entry point and interaction orchestrator.

	Important responsibilities:

	- create a new game session
	- start and restart the app session
	- process text input
	- process option clicks
	- update Gradio components
	- write structured interaction logs

	When to edit this file:

	- changing UI flow
	- adding debug panels
	- changing how logs are written
	- changing how outputs are displayed

	### `telemetry.py`

	This file handles structured log export.

	It is intentionally simple and file-based:

	- one session gets one JSONL file
	- one turn becomes one JSON object line

	This is useful for:

	- report case studies
	- measuring fallback rate
	- debugging weird turns
	- collecting examples for later evaluation

	## Logging and Observability

	Interaction logs are written under:

	- [logs/interactions](./logs/interactions)

	Each turn record includes at least:

	- input source
	- user input
	- NLU result
	- latency
	- fallback metadata
	- state changes
	- consistency issues
	- final output text
	- post-turn state snapshot

	Example shape:

	```json
	{
	"timestamp": "2026-03-14T18:55:00",
	"session_id": "sw-20260314-185500-ab12cd34",
	"turn_index": 3,
	"input_source": "text_input",
	"user_input": "和村长老伯谈谈最近森林里的怪事",
	"nlu_result": {
	"intent": "TALK",
	"target": "村长老伯",
	"parser_source": "llm"
	},
	"latency_ms": 842.13,
	"used_fallback": false,
	"state_changes": {},
	"output_text": "...",
	"post_turn_snapshot": {
	"location": "村庄广场"
	}
	}
	```

	If you need to debug a bad interaction, the fastest path is:

	1. check the log file
	2. inspect `nlu_result`
	3. inspect `telemetry.used_fallback`
	4. inspect `state_changes`
	5. inspect the post-turn snapshot

	## Evaluation Pipeline

	Evaluation entry point:

	- [evaluation/run_evaluations.py](./evaluation/run_evaluations.py)

	Datasets:

	- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)
	- [evaluation/datasets/consistency.json](./evaluation/datasets/consistency.json)
	- [evaluation/datasets/latency.json](./evaluation/datasets/latency.json)
	- [evaluation/datasets/branch_divergence.json](./evaluation/datasets/branch_divergence.json)

	Results:

	- [evaluation/results](./evaluation/results)

	### What each task measures

	#### Intent

	- labeled input -> predicted intent
	- optional target matching
	- parser source breakdown
	- per-example latency

	#### Consistency

	- action guard correctness via `pre_validate_action`
	- contradiction detection via `check_consistency`

	#### Latency

	- NLU latency
	- generation latency
	- total latency
	- fallback rate

	#### Branch divergence

	- same start state, different choices
	- compare resulting story text
	- compare option differences
	- compare state snapshot differences

	## Common Development Tasks

	### Add a new intent

	You will usually need to touch:

	- [nlu_engine.py](./nlu_engine.py)
	- [state_manager.py](./state_manager.py)
	- [story_engine.py](./story_engine.py)
	- [evaluation/datasets/intent_accuracy.json](./evaluation/datasets/intent_accuracy.json)

	Suggested checklist:

	1. add the label to the NLU logic
	2. decide whether it needs pre-validation
	3. make sure story prompts know how to handle it
	4. add at least a few evaluation examples

	### Add a new location, NPC, quest, or item

	Most of the time you only need:

	- [state_manager.py](./state_manager.py)

	That file contains the initial world setup and registry-style data.

	### Add more evaluation cases

	Edit files under:

	- [evaluation/datasets](./evaluation/datasets)

	This is the easiest way to improve the report without changing runtime logic.

	### Investigate a strange game turn

	Check in this order:

	1. interaction log under `logs/interactions`
	2. `parser_source` in the NLU result
	3. `telemetry` in the final story result
	4. whether `pre_validate_action` rejected or allowed the turn
	5. whether `check_consistency` flagged anything

	### Change UI behavior without touching gameplay

	Edit:

	- [app.py](./app.py)

	Try not to put game rules in the UI layer.

	## Environment Notes

	### If `QWEN_API_KEY` is missing

	- warning logs will appear
	- some paths will still run through fallback logic
	- evaluation can still execute, but model-quality conclusions are not meaningful

	### If `openai` is not installed

	- the repo can still import in some cases because the client is lazily initialized
	- full Qwen generation will not work
	- evaluation scripts will mostly reflect fallback behavior

	### If `gradio` is not installed

	- the app cannot launch
	- offline evaluation scripts can still be useful

	## Current Known Limitations

	These are the main gaps we still know about:

	- some item and equipment effects are stored as metadata but not fully executed as deterministic rules
	- combat and trade are still more prompt-driven than rule-driven
	- branch divergence is much more meaningful with a real model than in fallback-only mode
	- evaluation quality depends on whether the real model environment is available

	## Suggested Team Workflow

	If multiple teammates are working in parallel, this split is usually clean:

	- gameplay/state teammate
	Focus on [state_manager.py](./state_manager.py)
	- prompt/generation teammate
	Focus on [story_engine.py](./story_engine.py)
	- NLU/evaluation teammate
	Focus on [nlu_engine.py](./nlu_engine.py) and [evaluation](./evaluation)
	- UI/demo teammate
	Focus on [app.py](./app.py)
	- report teammate
	Focus on `evaluation/results`, `logs/interactions`, and case-study collection

	## What To Use in the Final Report

	For the course report, the most useful artifacts from this repo are:

	- evaluation JSON outputs under `evaluation/results`
	- interaction logs under `logs/interactions`
	- dataset files under `evaluation/datasets`
	- readable state transitions from `change_log`
	- fallback metadata from `telemetry`

	These can directly support:

	- experiment setup
	- metric definition
	- result tables
	- success cases
	- failure case analysis

	## License

	MIT