NewProject / README.md
PPP
chore: downgrade to Gradio 4.x for compatibility
673d037

A newer version of the Gradio SDK is available: 6.11.0

Upgrade
metadata
title: StoryWeaver
emoji: 📖
colorFrom: red
colorTo: purple
sdk: gradio
sdk_version: 4.43.0
app_file: app.py
python_version: '3.10'
pinned: false
license: mit
short_description: Interactive NLP story engine with evaluation and logging

StoryWeaver

StoryWeaver is an interactive text-adventure system built for our NLP course project. The repo is structured as an engineering project first and a demo second: it contains the playable app, the state-management core, evaluation scripts, and logging utilities needed for report writing and team collaboration.

This README is written for teammates who need to:

  • understand how the system is organized
  • run the app locally
  • know where to change prompts, rules, or UI
  • collect evaluation results for the report
  • debug a bad interaction without reading the whole codebase first

What This Repository Contains

At a high level, the project has five responsibilities:

  1. parse player input into structured intent
  2. keep the world state consistent across turns
  3. generate the next story response and options
  4. expose the system through a Gradio UI
  5. export logs and run reproducible evaluation

This means the repo is not only a "game demo". It is also the evidence pipeline for the course deliverables.

Quick Start

1. Install dependencies

pip install -r requirements.txt

2. Create .env

Create a .env file in the project root:

QWEN_API_KEY=your_api_key_here

Optional:

STORYWEAVER_LOG_DIR=logs/interactions

3. Run the app

python app.py

Default local URL:

  • http://localhost:7860

4. Run evaluation

python evaluation/run_evaluations.py --task all --repeats 3

Useful variants:

python evaluation/run_evaluations.py --task intent
python evaluation/run_evaluations.py --task consistency
python evaluation/run_evaluations.py --task latency --repeats 5
python evaluation/run_evaluations.py --task branch

Recommended Reading Order

If you are new to the repo, read files in this order:

  1. state_manager.py Why: this is the single source of truth for player state, world state, quests, items, consistency checks, and state updates.
  2. nlu_engine.py Why: this shows how raw player text becomes structured intent.
  3. story_engine.py Why: this is the main generation pipeline and fallback logic.
  4. app.py Why: this connects the UI with the engines and now also writes interaction logs.
  5. evaluation/run_evaluations.py Why: this shows how we measure the system for the report.

If you only have 10 minutes, start with:

  • GameState.pre_validate_action
  • GameState.check_consistency
  • GameState.apply_changes
  • NLUEngine.parse_intent
  • StoryEngine.generate_story_stream
  • process_user_input in app.py

Repository Map

StoryWeaver/
|-- app.py
|-- nlu_engine.py
|-- story_engine.py
|-- state_manager.py
|-- telemetry.py
|-- utils.py
|-- requirements.txt
|-- evaluation/
|   |-- run_evaluations.py
|   |-- datasets/
|   `-- results/
`-- logs/
    `-- interactions/

Core responsibilities by file:

  • app.py Gradio app, session lifecycle, UI callbacks, per-turn logging.
  • state_manager.py Player/world models, item registry, NPC registry, quest registry, state validation, consistency checks, change application.
  • nlu_engine.py Intent parsing. Uses LLM parsing when available and keyword fallback when not.
  • story_engine.py Opening generation, main story generation, option generation, stream handling, fallback handling, telemetry tags.
  • telemetry.py Session metadata and JSONL interaction log export.
  • utils.py API client setup, Qwen calls, JSON extraction, retry helpers.
  • evaluation/run_evaluations.py Reproducible experiment runner for the report.

System Architecture

The main runtime path is:

Player Input -> NLU -> Validation -> Story Generation -> State Update -> UI Output -> Interaction Log

There are two ideas that matter most in this codebase:

1. GameState is the source of truth

Almost everything meaningful lives in state_manager.py:

  • player stats
  • location
  • time and weather
  • inventory and equipment
  • quests
  • NPC states
  • event history

When changing gameplay, try to keep state logic here instead of scattering it across prompts and UI code.

2. The app is a coordinator, not the game logic

app.py should mostly:

  • receive user input
  • call NLU
  • call the story engine
  • update the chat UI
  • write telemetry logs

If a new feature changes game rules, it probably belongs in state_manager.py or story_engine.py, not in the UI layer.

Runtime Flow

Text input flow

For normal text input, the path is:

  1. process_user_input receives raw text from the UI
  2. NLUEngine.parse_intent converts it into a structured intent dict
  3. GameState.pre_validate_action blocks clearly invalid actions early
  4. StoryEngine.generate_story_stream runs the main narrative pipeline
  5. GameState.check_consistency and apply_changes update state
  6. UI is refreshed with story text, options, and status panel
  7. _record_interaction_log writes a JSONL record to disk

Option click flow

Button clicks do not go through full free-text parsing. Instead:

  1. the selected option is converted to an intent-like dict
  2. the story engine processes it the same way as text input
  3. the result is rendered and logged

This is useful because option interactions and free-text interactions now share the same evaluation and observability format.

Main Modules in More Detail

state_manager.py

This file defines:

  • PlayerState
  • WorldState
  • GameEvent
  • GameState

Important methods:

  • pre_validate_action Rejects obviously invalid actions before calling the model.
  • check_consistency Detects contradictions in proposed state changes.
  • apply_changes Applies state changes and returns a readable change log.
  • validate Makes sure the resulting state is legal.
  • to_prompt Serializes the current game state into prompt-ready text.

When to edit this file:

  • adding new items, NPCs, quests, or locations
  • adding deterministic rules
  • improving consistency checks
  • changing state serialization for prompts

nlu_engine.py

This file is responsible for intent recognition.

Current behavior:

  • try LLM parsing first
  • fall back to keyword rules if parsing fails
  • return a normalized intent dict with parser_source

Current intent labels include:

  • ATTACK
  • TALK
  • MOVE
  • EXPLORE
  • USE_ITEM
  • TRADE
  • EQUIP
  • REST
  • QUEST
  • SKILL
  • PICKUP
  • FLEE
  • CUSTOM

When to edit this file:

  • adding a new intent type
  • improving keyword fallback
  • adding target extraction logic
  • improving low-confidence handling

story_engine.py

This is the main generation module.

It currently handles:

  • opening generation
  • story generation for each turn
  • streaming and non-streaming paths
  • default/fallback outputs
  • consistency-aware regeneration
  • response telemetry such as fallback reason and engine mode

Important methods:

  • generate_opening_stream
  • generate_story
  • generate_story_stream
  • process_option_selection_stream
  • _fallback_response

When to edit this file:

  • changing prompts
  • changing multi-stage generation logic
  • changing fallback behavior
  • adding generation-side telemetry

app.py

This file is the UI entry point and interaction orchestrator.

Important responsibilities:

  • create a new game session
  • start and restart the app session
  • process text input
  • process option clicks
  • update Gradio components
  • write structured interaction logs

When to edit this file:

  • changing UI flow
  • adding debug panels
  • changing how logs are written
  • changing how outputs are displayed

telemetry.py

This file handles structured log export.

It is intentionally simple and file-based:

  • one session gets one JSONL file
  • one turn becomes one JSON object line

This is useful for:

  • report case studies
  • measuring fallback rate
  • debugging weird turns
  • collecting examples for later evaluation

Logging and Observability

Interaction logs are written under:

Each turn record includes at least:

  • input source
  • user input
  • NLU result
  • latency
  • fallback metadata
  • state changes
  • consistency issues
  • final output text
  • post-turn state snapshot

Example shape:

{
  "timestamp": "2026-03-14T18:55:00",
  "session_id": "sw-20260314-185500-ab12cd34",
  "turn_index": 3,
  "input_source": "text_input",
  "user_input": "和村长老伯谈谈最近森林里的怪事",
  "nlu_result": {
    "intent": "TALK",
    "target": "村长老伯",
    "parser_source": "llm"
  },
  "latency_ms": 842.13,
  "used_fallback": false,
  "state_changes": {},
  "output_text": "...",
  "post_turn_snapshot": {
    "location": "村庄广场"
  }
}

If you need to debug a bad interaction, the fastest path is:

  1. check the log file
  2. inspect nlu_result
  3. inspect telemetry.used_fallback
  4. inspect state_changes
  5. inspect the post-turn snapshot

Evaluation Pipeline

Evaluation entry point:

Datasets:

Results:

What each task measures

Intent

  • labeled input -> predicted intent
  • optional target matching
  • parser source breakdown
  • per-example latency

Consistency

  • action guard correctness via pre_validate_action
  • contradiction detection via check_consistency

Latency

  • NLU latency
  • generation latency
  • total latency
  • fallback rate

Branch divergence

  • same start state, different choices
  • compare resulting story text
  • compare option differences
  • compare state snapshot differences

Common Development Tasks

Add a new intent

You will usually need to touch:

Suggested checklist:

  1. add the label to the NLU logic
  2. decide whether it needs pre-validation
  3. make sure story prompts know how to handle it
  4. add at least a few evaluation examples

Add a new location, NPC, quest, or item

Most of the time you only need:

That file contains the initial world setup and registry-style data.

Add more evaluation cases

Edit files under:

This is the easiest way to improve the report without changing runtime logic.

Investigate a strange game turn

Check in this order:

  1. interaction log under logs/interactions
  2. parser_source in the NLU result
  3. telemetry in the final story result
  4. whether pre_validate_action rejected or allowed the turn
  5. whether check_consistency flagged anything

Change UI behavior without touching gameplay

Edit:

Try not to put game rules in the UI layer.

Environment Notes

If QWEN_API_KEY is missing

  • warning logs will appear
  • some paths will still run through fallback logic
  • evaluation can still execute, but model-quality conclusions are not meaningful

If openai is not installed

  • the repo can still import in some cases because the client is lazily initialized
  • full Qwen generation will not work
  • evaluation scripts will mostly reflect fallback behavior

If gradio is not installed

  • the app cannot launch
  • offline evaluation scripts can still be useful

Current Known Limitations

These are the main gaps we still know about:

  • some item and equipment effects are stored as metadata but not fully executed as deterministic rules
  • combat and trade are still more prompt-driven than rule-driven
  • branch divergence is much more meaningful with a real model than in fallback-only mode
  • evaluation quality depends on whether the real model environment is available

Suggested Team Workflow

If multiple teammates are working in parallel, this split is usually clean:

What To Use in the Final Report

For the course report, the most useful artifacts from this repo are:

  • evaluation JSON outputs under evaluation/results
  • interaction logs under logs/interactions
  • dataset files under evaluation/datasets
  • readable state transitions from change_log
  • fallback metadata from telemetry

These can directly support:

  • experiment setup
  • metric definition
  • result tables
  • success cases
  • failure case analysis

License

MIT