Spaces:

alexchilton
/

dnd-rag-g

Running

App Files Files Community

dnd-rag-g / docs /CONTEXT_WINDOW_FAQ.md

alexchilton

WIP: Shop test improvements + identify inventory persistence bug

42413d5 3 months ago

preview code

raw

history blame contribute delete

4.08 kB

Context Window FAQ

Q: What is the maximum context size?

Model limit: 32,768 tokens (Qwen 3 4B)

Practical limit: ~1,000 tokens for acceptable performance on CPU

Q: Is it linear slowdown (4x context = 4x slower)?

No! It's exponential (quadratic to be precise).

Context	Time	Slowdown
500 tokens	2s	1x (baseline)
1,000 tokens	5s	2.5x
2,000 tokens	15s	7.5x
4,000 tokens	60s	30x
8,000 tokens	240s	120x

Why? Transformer attention is O(n²) - every token attends to every other token.

Q: My demo timed out after 100 seconds. Why?

Your context was too large BEFORE pruning was implemented.

Without pruning:

Turn 20: ~40 messages = ~2,500 tokens = 30s response
Turn 40: ~80 messages = ~4,500 tokens = 180s response ⚠️ TIMEOUT!

With pruning (now):

Turn 20: ~20 messages = ~700 tokens = 4s response ✅
Turn 100: ~20 messages = ~700 tokens = 4s response ✅

Q: How large is the context now?

With your current pruning settings:

MAX_MESSAGE_HISTORY = 20
RECENT_MESSAGES_FOR_PROMPT = 8

Typical context breakdown:

System prompt: ~100 tokens
Game state: ~125 tokens
Recent 8 messages: ~200 tokens
Instructions: ~100 tokens
TOTAL: ~525-850 tokens ✅

This is only 2.5% of the model's capacity and should be fast!

Q: Will pruning solve my timeout issues?

Yes, but...

Pruning prevents FUTURE timeouts by keeping context small
Old sessions might still have huge context - restart Gradio to clear
Model loading delays - first request takes 5-10s longer

Q: How do I know if it's working?

Check logs for pruning messages:

tail -f logs/gradio.log | grep "Pruned"

# You should see:
# 📝 Pruned 10 messages. History now: 20 messages

Q: What if it's still slow after pruning?

Run diagnostic:

python3 scripts/diagnose_performance.py

This will test:

Ollama is running
Inference speed (should be < 5s)
Context settings
GameMaster initialization

If inference test takes > 15s, your hardware is too slow.

Q: Can I make it faster?

Option 1: Use smaller model (fastest)

ollama pull qwen2.5:3b

Then update settings.py:

OLLAMA_MODEL_NAME = "qwen2.5:3b"

Option 2: Aggressive pruning

MAX_MESSAGE_HISTORY = 12
RECENT_MESSAGES_FOR_PROMPT = 6

Option 3: GPU acceleration (requires CUDA/ROCm)

RTX 3090: 10-30x faster
Apple M-series: 5x faster with Metal

Q: Will I lose game history with pruning?

No! Three-tier memory preserves everything:

Short-term (message_history): Last 20 messages for immediate context
Medium-term (conversation_summary): Compressed summaries of old messages
Long-term (session.notes): Key events permanently recorded

Example summary:

⚔️ Combat: Defeated 2 goblins and 1 orc
🗺️ Travel: Arrived at Riverside Tavern
💰 Trade: Bought 3 healing potions for 150 gold

Q: How do I restart with clean slate?

# Stop Gradio
./stop_gradio.sh

# Clear old character sessions (optional)
rm -rf characters/*.json

# Restart
./start_gradio.sh

Q: Can I check context size in real-time?

Add to your code:

print(f"Context size: ~{len(gm.message_history) * 40} tokens")

Or check the debug logs when DEBUG_MODE = True.

Q: What's the sweet spot for settings?

For 4B model on CPU:

MAX_MESSAGE_HISTORY = 20
RECENT_MESSAGES_FOR_PROMPT = 8
OLLAMA_TIMEOUT = 120

For 3B model (faster):

MAX_MESSAGE_HISTORY = 30
RECENT_MESSAGES_FOR_PROMPT = 12  
OLLAMA_TIMEOUT = 60

For 7B+ model (slower):

MAX_MESSAGE_HISTORY = 12
RECENT_MESSAGES_FOR_PROMPT = 6
OLLAMA_TIMEOUT = 180

Q: Is 32K context really usable?

Technically yes, practically no.

32K tokens would take ~20-30 minutes per response on CPU
Even on high-end GPU: ~30-60 seconds
Only useful for batch processing, not interactive games

Keep it under 1K tokens for interactive use!