Spaces:
Build error
feat: Add D&D RAG system with ChromaDB integration
Browse filesImplements a Retrieval Augmented Generation system for D&D 5e content.
## What's New
### Core Infrastructure
- Configuration system (settings.py) for all parameters
- Base parser classes for extensible content parsing
- Base chunker classes for optimized RAG retrieval
- Unified ChromaDB manager for vector database operations
### Data Loading
- Initialize script to load spells, monsters, and classes
- Adapts proven parsing logic from existing notebooks
- Creates 423 chunks across 4 collections
- Support for selective loading (--only flag)
### Testing
- Comprehensive test suite for search functionality
- Tests spell, monster, and class searches
- Validates cross-collection queries
- Verified with actual data (86 spells, 332 monsters, 5 classes)
### Documentation
- Complete README with installation guide
- Step-by-step usage instructions
- Troubleshooting section
- Progress tracking in plan_progress.md
## Verified Features
โ
Semantic search across all D&D content
โ
ChromaDB persistence
โ
Sentence transformer embeddings (all-MiniLM-L6-v2)
โ
Cross-collection queries
โ
423 chunks successfully indexed
## Next Phase
- Query interface with entity recognition
- GM dialogue system with Ollama integration
- Character creation system
๐ค Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
- .gitignore +38 -0
- README.md +314 -14
- dnd_rag_system/__init__.py +0 -0
- dnd_rag_system/config/__init__.py +0 -0
- dnd_rag_system/config/settings.py +246 -0
- dnd_rag_system/core/__init__.py +0 -0
- dnd_rag_system/core/base_chunker.py +384 -0
- dnd_rag_system/core/base_parser.py +345 -0
- dnd_rag_system/core/chroma_manager.py +432 -0
- dnd_rag_system/parsers/__init__.py +0 -0
- dnd_rag_system/parsers/spell_parser.py +490 -0
- dnd_rag_system/systems/__init__.py +0 -0
- initialize_rag.py +423 -0
- plan_progress.md +290 -0
- requirements.txt +29 -0
- test_rag_search.py +187 -0
|
@@ -0,0 +1,38 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Python
|
| 2 |
+
__pycache__/
|
| 3 |
+
*.py[cod]
|
| 4 |
+
*$py.class
|
| 5 |
+
*.so
|
| 6 |
+
.Python
|
| 7 |
+
env/
|
| 8 |
+
venv/
|
| 9 |
+
ENV/
|
| 10 |
+
*.egg-info/
|
| 11 |
+
dist/
|
| 12 |
+
build/
|
| 13 |
+
|
| 14 |
+
# ChromaDB - vector database (regenerate with initialize_rag.py)
|
| 15 |
+
chromadb/
|
| 16 |
+
|
| 17 |
+
# IDE
|
| 18 |
+
.idea/
|
| 19 |
+
.vscode/
|
| 20 |
+
.DS_Store
|
| 21 |
+
*.swp
|
| 22 |
+
*.swo
|
| 23 |
+
|
| 24 |
+
# Claude Code
|
| 25 |
+
.claude/
|
| 26 |
+
|
| 27 |
+
# Logs
|
| 28 |
+
*.log
|
| 29 |
+
dnd_rag_system.log
|
| 30 |
+
|
| 31 |
+
# Jupyter
|
| 32 |
+
.ipynb_checkpoints/
|
| 33 |
+
|
| 34 |
+
# Data files (too large for git)
|
| 35 |
+
*.pdf
|
| 36 |
+
|
| 37 |
+
# Environment
|
| 38 |
+
.env
|
|
@@ -1,22 +1,322 @@
|
|
| 1 |
-
|
| 2 |
|
| 3 |
-
|
| 4 |
-
db called - chromadb and a collection called spell_rag_v2
|
| 5 |
|
| 6 |
-
|
| 7 |
-
Some extra parsing will need to be done here as it adds non monster text as monsters etc.
|
| 8 |
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
|
|
|
|
|
|
| 12 |
|
| 13 |
-
|
| 14 |
-
ollama run hf.co/Chun121/Qwen3-4B-RPG-Roleplay-V2:Q4_K_M
|
| 15 |
|
| 16 |
-
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# D&D RAG System
|
| 2 |
|
| 3 |
+
An AI-powered Dungeon Master assistant using Retrieval Augmented Generation (RAG) with D&D 5e content.
|
|
|
|
| 4 |
|
| 5 |
+
## ๐ฏ Features
|
|
|
|
| 6 |
|
| 7 |
+
- **Semantic Search** across D&D spells, monsters, classes, and races
|
| 8 |
+
- **RAG-Enhanced GM Dialogue** with accurate rule retrieval
|
| 9 |
+
- **Character Creation** system
|
| 10 |
+
- **ChromaDB** vector database for fast retrieval
|
| 11 |
+
- **Ollama Integration** for local LLM inference
|
| 12 |
|
| 13 |
+
## ๐ Quick Start Guide
|
|
|
|
| 14 |
|
| 15 |
+
### Prerequisites
|
| 16 |
|
| 17 |
+
- Python 3.8 or higher
|
| 18 |
+
- pip (Python package manager)
|
| 19 |
+
- The following data files in the project root:
|
| 20 |
+
- `spells.txt`
|
| 21 |
+
- `all_spells.txt` (optional)
|
| 22 |
+
- `extracted_monsters.txt`
|
| 23 |
+
- `extracted_classes.txt`
|
| 24 |
|
| 25 |
+
### Installation Steps
|
| 26 |
|
| 27 |
+
#### 1. Install Python Dependencies
|
| 28 |
+
|
| 29 |
+
```bash
|
| 30 |
+
pip install -r requirements.txt
|
| 31 |
+
```
|
| 32 |
+
|
| 33 |
+
This installs:
|
| 34 |
+
- `chromadb` - Vector database
|
| 35 |
+
- `sentence-transformers` - Embedding models
|
| 36 |
+
- `pdfplumber` - PDF parsing (if needed)
|
| 37 |
+
- `ollama` - LLM client (for GM dialogue)
|
| 38 |
+
- Additional utilities
|
| 39 |
+
|
| 40 |
+
**Expected time:** 2-5 minutes (downloads ~500MB of models on first run)
|
| 41 |
+
|
| 42 |
+
#### 2. Verify Installation
|
| 43 |
+
|
| 44 |
+
```bash
|
| 45 |
+
python -c "import chromadb; import sentence_transformers; print('โ All dependencies installed')"
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
If this prints `โ All dependencies installed`, you're ready!
|
| 49 |
+
|
| 50 |
+
### Running the System
|
| 51 |
+
|
| 52 |
+
#### Step 1: Initialize the RAG Database
|
| 53 |
+
|
| 54 |
+
Load all D&D content into ChromaDB:
|
| 55 |
+
|
| 56 |
+
```bash
|
| 57 |
+
python initialize_rag.py
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
**What this does:**
|
| 61 |
+
- Parses spells from `spells.txt` (~86 spells)
|
| 62 |
+
- Parses monsters from `extracted_monsters.txt` (~332 monsters)
|
| 63 |
+
- Parses classes from `extracted_classes.txt` (~5 classes)
|
| 64 |
+
- Creates 4 ChromaDB collections
|
| 65 |
+
- Generates embeddings for semantic search
|
| 66 |
+
- Shows statistics
|
| 67 |
+
|
| 68 |
+
**Expected output:**
|
| 69 |
+
```
|
| 70 |
+
๐ฒ D&D RAG SYSTEM INITIALIZATION
|
| 71 |
+
...
|
| 72 |
+
โ
Total: 423 chunks loaded into ChromaDB
|
| 73 |
+
๐ Initialization complete!
|
| 74 |
+
```
|
| 75 |
+
|
| 76 |
+
**Time:** ~30 seconds on first run (downloads embedding model), ~5 seconds on subsequent runs
|
| 77 |
+
|
| 78 |
+
**Options:**
|
| 79 |
+
```bash
|
| 80 |
+
python initialize_rag.py --clear # Clear existing data and reload
|
| 81 |
+
python initialize_rag.py --only spells # Load only spells
|
| 82 |
+
python initialize_rag.py --only monsters,classes # Load specific collections
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
#### Step 2: Verify System is Working
|
| 86 |
+
|
| 87 |
+
Run the test suite to verify searches work:
|
| 88 |
+
|
| 89 |
+
```bash
|
| 90 |
+
python test_rag_search.py
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
**What this tests:**
|
| 94 |
+
- โ
Spell searches ("fireball spell", "cure wounds", etc.)
|
| 95 |
+
- โ
Monster searches ("goblin", "dragon fire breath", etc.)
|
| 96 |
+
- โ
Class searches ("wizard spellcasting", "fighter extra attack", etc.)
|
| 97 |
+
- โ
Cross-collection searches ("fire damage" across all content)
|
| 98 |
+
|
| 99 |
+
**Expected output:**
|
| 100 |
+
```
|
| 101 |
+
๐งช D&D RAG SEARCH TEST SUITE
|
| 102 |
+
...
|
| 103 |
+
โ
TEST SUITE COMPLETE
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
If all tests pass, your RAG system is fully operational! ๐
|
| 107 |
+
|
| 108 |
+
#### Step 3: Run Interactive Searches (Optional)
|
| 109 |
+
|
| 110 |
+
Test your own queries:
|
| 111 |
+
|
| 112 |
+
```bash
|
| 113 |
+
python -c "
|
| 114 |
+
from dnd_rag_system.core.chroma_manager import ChromaDBManager
|
| 115 |
+
from dnd_rag_system.config import settings
|
| 116 |
+
|
| 117 |
+
db = ChromaDBManager()
|
| 118 |
+
results = db.search(settings.COLLECTION_NAMES['spells'], 'healing spell', n_results=3)
|
| 119 |
+
|
| 120 |
+
print('Top 3 Healing Spells:')
|
| 121 |
+
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
|
| 122 |
+
print(f\" - {meta['name']}\")
|
| 123 |
+
"
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
### Troubleshooting
|
| 127 |
+
|
| 128 |
+
#### "ModuleNotFoundError: No module named 'chromadb'"
|
| 129 |
+
|
| 130 |
+
```bash
|
| 131 |
+
pip install chromadb sentence-transformers
|
| 132 |
+
```
|
| 133 |
+
|
| 134 |
+
#### "File not found: spells.txt"
|
| 135 |
+
|
| 136 |
+
Make sure these files exist in the project root:
|
| 137 |
+
```bash
|
| 138 |
+
ls spells.txt extracted_monsters.txt extracted_classes.txt
|
| 139 |
+
```
|
| 140 |
+
|
| 141 |
+
If missing, you need to extract them from your PDF files first.
|
| 142 |
+
|
| 143 |
+
#### "No results found" in searches
|
| 144 |
+
|
| 145 |
+
Re-initialize the database:
|
| 146 |
+
```bash
|
| 147 |
+
python initialize_rag.py --clear
|
| 148 |
+
```
|
| 149 |
+
|
| 150 |
+
#### Embedding model download is slow
|
| 151 |
+
|
| 152 |
+
The first run downloads ~80MB of models. This is normal. Subsequent runs are much faster.
|
| 153 |
+
|
| 154 |
+
### What's Working Now
|
| 155 |
+
|
| 156 |
+
โ
**Semantic Search**: Find D&D content by meaning
|
| 157 |
+
โ
**86 Spells**: Fireball, Cure Wounds, Magic Missile, etc.
|
| 158 |
+
โ
**332 Monsters**: Goblins, Dragons, Orcs, etc.
|
| 159 |
+
โ
**5 Classes**: Wizard, Fighter, Cleric, etc.
|
| 160 |
+
โ
**Cross-Collection**: Search all content at once
|
| 161 |
+
โ
**ChromaDB**: Persistent vector database
|
| 162 |
+
|
| 163 |
+
### What's Coming Soon
|
| 164 |
+
|
| 165 |
+
โณ **GM Dialogue System**: RAG-enhanced Ollama integration
|
| 166 |
+
โณ **Character Creator**: Interactive character building
|
| 167 |
+
โณ **Query Interface**: Smart entity recognition
|
| 168 |
+
|
| 169 |
+
### Next: Run GM Dialogue (Coming Soon)
|
| 170 |
+
|
| 171 |
+
```bash
|
| 172 |
+
python run_gm_dialogue.py
|
| 173 |
+
```
|
| 174 |
+
|
| 175 |
+
Will allow interactive D&D sessions with AI GM that knows all the rules!
|
| 176 |
+
|
| 177 |
+
## ๐ Project Structure
|
| 178 |
+
|
| 179 |
+
```
|
| 180 |
+
โโโ dnd_rag_system/ # Main package
|
| 181 |
+
โ โโโ config/ # Configuration
|
| 182 |
+
โ โ โโโ settings.py
|
| 183 |
+
โ โโโ core/ # Core infrastructure
|
| 184 |
+
โ โ โโโ base_parser.py # Parser framework
|
| 185 |
+
โ โ โโโ base_chunker.py # Chunking utilities
|
| 186 |
+
โ โ โโโ chroma_manager.py # Database interface
|
| 187 |
+
โ โโโ parsers/ # Content parsers (TBD)
|
| 188 |
+
โ โโโ systems/ # GM dialogue, character creator (TBD)
|
| 189 |
+
โ
|
| 190 |
+
โโโ chromadb/ # Vector database (created on init)
|
| 191 |
+
โโโ initialize_rag.py # Main initialization script โญ
|
| 192 |
+
โโโ test_rag_search.py # Test search functionality โญ
|
| 193 |
+
โโโ plan_progress.md # Development progress tracking
|
| 194 |
+
โโโ requirements.txt # Python dependencies
|
| 195 |
+
```
|
| 196 |
+
|
| 197 |
+
## ๐๏ธ Required Data Files
|
| 198 |
+
|
| 199 |
+
These should be in the project root:
|
| 200 |
+
|
| 201 |
+
- `spells.txt` - Spell descriptions (extracted from Player's Handbook)
|
| 202 |
+
- `all_spells.txt` - Spell class associations
|
| 203 |
+
- `extracted_monsters.txt` - Monster stats (from Monster Manual)
|
| 204 |
+
- `extracted_classes.txt` - Class features (from Player's Handbook)
|
| 205 |
+
|
| 206 |
+
## ๐ง Configuration
|
| 207 |
+
|
| 208 |
+
Edit `dnd_rag_system/config/settings.py` to customize:
|
| 209 |
+
|
| 210 |
+
- **Database Path**: `CHROMA_PERSIST_DIR`
|
| 211 |
+
- **Embedding Model**: `EMBEDDING_MODEL_NAME` (default: all-MiniLM-L6-v2)
|
| 212 |
+
- **Ollama Model**: `OLLAMA_MODEL_NAME`
|
| 213 |
+
- **Chunk Size**: `MAX_CHUNK_TOKENS`
|
| 214 |
+
- **Collection Names**: `COLLECTION_NAMES`
|
| 215 |
+
|
| 216 |
+
## ๐ Collections
|
| 217 |
+
|
| 218 |
+
The system creates 4 ChromaDB collections:
|
| 219 |
+
|
| 220 |
+
1. **dnd_spells** - D&D 5e spells with mechanics
|
| 221 |
+
2. **dnd_monsters** - Monster stats and abilities
|
| 222 |
+
3. **dnd_classes** - Class features by level
|
| 223 |
+
4. **dnd_races** - Race traits and subraces (TBD)
|
| 224 |
+
|
| 225 |
+
## ๐งช Development Status
|
| 226 |
+
|
| 227 |
+
### โ
Phase 1: Core Infrastructure (Complete)
|
| 228 |
+
- Configuration system
|
| 229 |
+
- Base parser and chunker classes
|
| 230 |
+
- ChromaDB manager
|
| 231 |
+
- Directory structure
|
| 232 |
+
|
| 233 |
+
### โ
Phase 2: Quick Integration (Complete)
|
| 234 |
+
- Initialize RAG script using notebook code
|
| 235 |
+
- Test search functionality
|
| 236 |
+
- Basic loaders for spells, monsters, classes
|
| 237 |
+
|
| 238 |
+
### โณ Phase 3: Systems Layer (In Progress)
|
| 239 |
+
- Query interface with entity recognition
|
| 240 |
+
- RAG-enhanced GM dialogue system
|
| 241 |
+
- Character creation system
|
| 242 |
+
|
| 243 |
+
### โณ Phase 4: Polish & Testing
|
| 244 |
+
- Comprehensive unit tests
|
| 245 |
+
- Integration tests
|
| 246 |
+
- Performance benchmarks
|
| 247 |
+
- Documentation
|
| 248 |
+
|
| 249 |
+
## ๐ฎ Usage Examples
|
| 250 |
+
|
| 251 |
+
### Search for a Spell
|
| 252 |
+
|
| 253 |
+
```python
|
| 254 |
+
from dnd_rag_system.core.chroma_manager import ChromaDBManager
|
| 255 |
+
from dnd_rag_system.config import settings
|
| 256 |
+
|
| 257 |
+
db = ChromaDBManager()
|
| 258 |
+
results = db.search(settings.COLLECTION_NAMES['spells'], "fireball", n_results=3)
|
| 259 |
+
|
| 260 |
+
for doc, meta in zip(results['documents'][0], results['metadatas'][0]):
|
| 261 |
+
print(f"{meta['name']}: {doc[:200]}...")
|
| 262 |
+
```
|
| 263 |
+
|
| 264 |
+
### Cross-Collection Search
|
| 265 |
+
|
| 266 |
+
```python
|
| 267 |
+
results = db.search_all("fire damage", n_results_per_collection=2)
|
| 268 |
+
|
| 269 |
+
for collection, col_results in results.items():
|
| 270 |
+
print(f"\n{collection}:")
|
| 271 |
+
for doc, meta in zip(col_results['documents'][0], col_results['metadatas'][0]):
|
| 272 |
+
print(f" - {meta.get('name', 'Unknown')}")
|
| 273 |
+
```
|
| 274 |
+
|
| 275 |
+
## ๐ค Contributing
|
| 276 |
+
|
| 277 |
+
This is a learning project! Key areas for improvement:
|
| 278 |
+
|
| 279 |
+
1. **Better Parsing**: Improve OCR error handling in text extraction
|
| 280 |
+
2. **More Chunks**: Create better chunk strategies (quick reference, by level, etc.)
|
| 281 |
+
3. **Entity Recognition**: Detect spell/monster names in player input
|
| 282 |
+
4. **GM System**: Build the RAG-enhanced dialogue system
|
| 283 |
+
5. **Character Creator**: Interactive character building with RAG lookup
|
| 284 |
+
|
| 285 |
+
## ๐ Notes
|
| 286 |
+
|
| 287 |
+
- **Embedding Model**: Uses `all-MiniLM-L6-v2` (fast, 384 dimensions)
|
| 288 |
+
- **Token Limit**: Chunks limited to ~400 tokens (~1600 characters)
|
| 289 |
+
- **Ollama Required**: For GM dialogue (download from ollama.ai)
|
| 290 |
+
- **Data Sources**: Requires extracted text files (not included in repo)
|
| 291 |
+
|
| 292 |
+
## ๐ Troubleshooting
|
| 293 |
+
|
| 294 |
+
### "ChromaDB not found"
|
| 295 |
+
```bash
|
| 296 |
+
pip install chromadb
|
| 297 |
+
```
|
| 298 |
+
|
| 299 |
+
### "No results found in search"
|
| 300 |
+
```bash
|
| 301 |
+
# Re-initialize the database
|
| 302 |
+
python initialize_rag.py --clear
|
| 303 |
+
```
|
| 304 |
+
|
| 305 |
+
### "File not found" errors
|
| 306 |
+
Make sure these files exist in the project root:
|
| 307 |
+
- `spells.txt`
|
| 308 |
+
- `extracted_monsters.txt`
|
| 309 |
+
- `extracted_classes.txt`
|
| 310 |
+
|
| 311 |
+
## ๐ References
|
| 312 |
+
|
| 313 |
+
- [D&D 5e SRD](https://dnd.wizards.com/resources/systems-reference-document)
|
| 314 |
+
- [ChromaDB Documentation](https://docs.trychroma.com/)
|
| 315 |
+
- [Sentence Transformers](https://www.sbert.net/)
|
| 316 |
+
- [Ollama](https://ollama.ai/)
|
| 317 |
+
|
| 318 |
+
---
|
| 319 |
+
|
| 320 |
+
**Status**: ๐ง In Active Development
|
| 321 |
+
|
| 322 |
+
See `plan_progress.md` for detailed development progress.
|
|
File without changes
|
|
File without changes
|
|
@@ -0,0 +1,246 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
D&D RAG System Configuration
|
| 3 |
+
|
| 4 |
+
Central configuration file for all system settings, paths, and parameters.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import os
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
from typing import Dict, List
|
| 10 |
+
|
| 11 |
+
# ============================================================================
|
| 12 |
+
# PROJECT PATHS
|
| 13 |
+
# ============================================================================
|
| 14 |
+
|
| 15 |
+
# Root project directory
|
| 16 |
+
PROJECT_ROOT = Path(__file__).parent.parent.parent
|
| 17 |
+
|
| 18 |
+
# Data directories
|
| 19 |
+
DATA_DIR = PROJECT_ROOT / "data"
|
| 20 |
+
CHROMADB_DIR = PROJECT_ROOT / "chromadb"
|
| 21 |
+
|
| 22 |
+
# Source data files
|
| 23 |
+
SPELLS_TXT = PROJECT_ROOT / "spells.txt"
|
| 24 |
+
ALL_SPELLS_TXT = PROJECT_ROOT / "all_spells.txt"
|
| 25 |
+
MONSTER_MANUAL_PDF = PROJECT_ROOT / "Dungeons and Dragons - Monster Manual (Skip Williams, Jonathan Tweet, Monte Cook) (Z-Library).pdf"
|
| 26 |
+
PLAYERS_HANDBOOK_PDF = PROJECT_ROOT / "Dungeons Dragons 5e Players Handbook (Wizards RPG Team Wyatt James, Schwalb Robert J etc.) (Z-Library).pdf"
|
| 27 |
+
|
| 28 |
+
# Extracted text files (optional backups)
|
| 29 |
+
EXTRACTED_MONSTERS_TXT = PROJECT_ROOT / "extracted_monsters.txt"
|
| 30 |
+
EXTRACTED_CLASSES_TXT = PROJECT_ROOT / "extracted_classes.txt"
|
| 31 |
+
|
| 32 |
+
# ============================================================================
|
| 33 |
+
# CHROMADB CONFIGURATION
|
| 34 |
+
# ============================================================================
|
| 35 |
+
|
| 36 |
+
# ChromaDB settings
|
| 37 |
+
CHROMA_PERSIST_DIR = str(CHROMADB_DIR)
|
| 38 |
+
CHROMA_ALLOW_RESET = False # Set to True only for development
|
| 39 |
+
|
| 40 |
+
# Collection names (standardized naming convention)
|
| 41 |
+
COLLECTION_NAMES = {
|
| 42 |
+
"spells": "dnd_spells",
|
| 43 |
+
"monsters": "dnd_monsters",
|
| 44 |
+
"classes": "dnd_classes",
|
| 45 |
+
"races": "dnd_races"
|
| 46 |
+
}
|
| 47 |
+
|
| 48 |
+
# Collection metadata
|
| 49 |
+
COLLECTION_METADATA = {
|
| 50 |
+
"dnd_spells": {
|
| 51 |
+
"description": "D&D 5e spell descriptions, mechanics, and class associations",
|
| 52 |
+
"source": "Player's Handbook - Spells"
|
| 53 |
+
},
|
| 54 |
+
"dnd_monsters": {
|
| 55 |
+
"description": "D&D 5e monster stat blocks, abilities, and combat info",
|
| 56 |
+
"source": "Monster Manual"
|
| 57 |
+
},
|
| 58 |
+
"dnd_classes": {
|
| 59 |
+
"description": "D&D 5e class features, progressions, and subclasses",
|
| 60 |
+
"source": "Player's Handbook - Classes"
|
| 61 |
+
},
|
| 62 |
+
"dnd_races": {
|
| 63 |
+
"description": "D&D 5e race traits, subraces, and lore",
|
| 64 |
+
"source": "Player's Handbook - Races"
|
| 65 |
+
}
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
# ============================================================================
|
| 69 |
+
# EMBEDDING MODEL CONFIGURATION
|
| 70 |
+
# ============================================================================
|
| 71 |
+
|
| 72 |
+
# Sentence transformers model for embeddings
|
| 73 |
+
# all-MiniLM-L6-v2: Fast, good quality, 384 dimensions
|
| 74 |
+
# alternatives: all-mpnet-base-v2 (slower, better), paraphrase-MiniLM-L6-v2
|
| 75 |
+
EMBEDDING_MODEL_NAME = "all-MiniLM-L6-v2"
|
| 76 |
+
EMBEDDING_DIMENSION = 384 # Dimension for all-MiniLM-L6-v2
|
| 77 |
+
|
| 78 |
+
# Embedding batch size
|
| 79 |
+
EMBEDDING_BATCH_SIZE = 50
|
| 80 |
+
|
| 81 |
+
# ============================================================================
|
| 82 |
+
# CHUNKING PARAMETERS
|
| 83 |
+
# ============================================================================
|
| 84 |
+
|
| 85 |
+
# Maximum tokens per chunk (rough estimate: 1 token โ 4 characters)
|
| 86 |
+
MAX_CHUNK_TOKENS = 400
|
| 87 |
+
MAX_CHUNK_CHARS = MAX_CHUNK_TOKENS * 4
|
| 88 |
+
|
| 89 |
+
# Overlap for text splitting (in tokens)
|
| 90 |
+
CHUNK_OVERLAP_TOKENS = 50
|
| 91 |
+
|
| 92 |
+
# Minimum chunk size (too small chunks are not useful)
|
| 93 |
+
MIN_CHUNK_TOKENS = 50
|
| 94 |
+
|
| 95 |
+
# ============================================================================
|
| 96 |
+
# PARSER CONFIGURATION
|
| 97 |
+
# ============================================================================
|
| 98 |
+
|
| 99 |
+
# PDF extraction settings
|
| 100 |
+
PDF_EXTRACT_PAGES = {
|
| 101 |
+
"races": (18, 46), # Player's Handbook pages for races
|
| 102 |
+
"classes": (46, 121), # Player's Handbook pages for classes
|
| 103 |
+
}
|
| 104 |
+
|
| 105 |
+
# Monster parsing
|
| 106 |
+
MONSTER_START_NAME = "ABOLETH" # First monster to parse in Monster Manual
|
| 107 |
+
|
| 108 |
+
# Spell parsing
|
| 109 |
+
SPELL_LEVELS = list(range(0, 10)) # Cantrips (0) through 9th level
|
| 110 |
+
|
| 111 |
+
# ============================================================================
|
| 112 |
+
# OLLAMA CONFIGURATION
|
| 113 |
+
# ============================================================================
|
| 114 |
+
|
| 115 |
+
# Ollama model for GM dialogue
|
| 116 |
+
OLLAMA_MODEL_NAME = "hf.co/Chun121/Qwen3-4B-RPG-Roleplay-V2:Q4_K_M"
|
| 117 |
+
OLLAMA_BASE_URL = "http://localhost:11434" # Default Ollama API endpoint
|
| 118 |
+
OLLAMA_TIMEOUT = 30 # Timeout in seconds for model responses
|
| 119 |
+
|
| 120 |
+
# ============================================================================
|
| 121 |
+
# QUERY INTERFACE SETTINGS
|
| 122 |
+
# ============================================================================
|
| 123 |
+
|
| 124 |
+
# Default number of results to retrieve from RAG
|
| 125 |
+
DEFAULT_RAG_RESULTS = 5
|
| 126 |
+
|
| 127 |
+
# Maximum context tokens for LLM (approximate)
|
| 128 |
+
MAX_CONTEXT_TOKENS = 2000
|
| 129 |
+
|
| 130 |
+
# Entity recognition patterns
|
| 131 |
+
ENTITY_PATTERNS = {
|
| 132 |
+
"spell_indicators": ["cast", "spell", "fireball", "magic missile", "cure wounds"],
|
| 133 |
+
"monster_indicators": ["attack", "fight", "goblin", "dragon", "zombie"],
|
| 134 |
+
"class_indicators": ["fighter", "wizard", "cleric", "rogue", "barbarian"],
|
| 135 |
+
"race_indicators": ["elf", "dwarf", "human", "halfling", "dragonborn"]
|
| 136 |
+
}
|
| 137 |
+
|
| 138 |
+
# ============================================================================
|
| 139 |
+
# CHARACTER CREATION SETTINGS
|
| 140 |
+
# ============================================================================
|
| 141 |
+
|
| 142 |
+
# Available D&D classes
|
| 143 |
+
DND_CLASSES = [
|
| 144 |
+
"Barbarian", "Bard", "Cleric", "Druid", "Fighter", "Monk",
|
| 145 |
+
"Paladin", "Ranger", "Rogue", "Sorcerer", "Warlock", "Wizard"
|
| 146 |
+
]
|
| 147 |
+
|
| 148 |
+
# Available D&D races
|
| 149 |
+
DND_RACES = [
|
| 150 |
+
"Dragonborn", "Dwarf", "Elf", "Gnome", "Half-Elf",
|
| 151 |
+
"Halfling", "Half-Orc", "Human", "Tiefling"
|
| 152 |
+
]
|
| 153 |
+
|
| 154 |
+
# Ability score generation methods
|
| 155 |
+
ABILITY_SCORE_METHODS = {
|
| 156 |
+
"standard_array": [15, 14, 13, 12, 10, 8],
|
| 157 |
+
"point_buy": 27, # Total points for point buy
|
| 158 |
+
"roll_4d6_drop_lowest": True
|
| 159 |
+
}
|
| 160 |
+
|
| 161 |
+
# ============================================================================
|
| 162 |
+
# LOGGING & DEBUG
|
| 163 |
+
# ============================================================================
|
| 164 |
+
|
| 165 |
+
# Logging configuration
|
| 166 |
+
LOG_LEVEL = "INFO" # DEBUG, INFO, WARNING, ERROR, CRITICAL
|
| 167 |
+
LOG_FILE = PROJECT_ROOT / "dnd_rag_system.log"
|
| 168 |
+
LOG_FORMAT = "%(asctime)s - %(name)s - %(levelname)s - %(message)s"
|
| 169 |
+
|
| 170 |
+
# Debug mode (verbose output, validation checks)
|
| 171 |
+
DEBUG_MODE = False
|
| 172 |
+
|
| 173 |
+
# ============================================================================
|
| 174 |
+
# PERFORMANCE SETTINGS
|
| 175 |
+
# ============================================================================
|
| 176 |
+
|
| 177 |
+
# Batch processing sizes
|
| 178 |
+
CHROMA_BATCH_SIZE = 100 # Documents to add in one ChromaDB batch
|
| 179 |
+
PARSER_BATCH_SIZE = 50 # Items to process before progress update
|
| 180 |
+
|
| 181 |
+
# Query caching
|
| 182 |
+
ENABLE_QUERY_CACHE = True
|
| 183 |
+
CACHE_SIZE = 100 # Number of queries to cache
|
| 184 |
+
|
| 185 |
+
# ============================================================================
|
| 186 |
+
# VALIDATION SETTINGS
|
| 187 |
+
# ============================================================================
|
| 188 |
+
|
| 189 |
+
# Enable validation checks during initialization
|
| 190 |
+
ENABLE_VALIDATION = True
|
| 191 |
+
|
| 192 |
+
# Minimum number of chunks expected per collection
|
| 193 |
+
MIN_CHUNKS = {
|
| 194 |
+
"dnd_spells": 400,
|
| 195 |
+
"dnd_monsters": 800,
|
| 196 |
+
"dnd_classes": 1500,
|
| 197 |
+
"dnd_races": 80
|
| 198 |
+
}
|
| 199 |
+
|
| 200 |
+
# ============================================================================
|
| 201 |
+
# HELPER FUNCTIONS
|
| 202 |
+
# ============================================================================
|
| 203 |
+
|
| 204 |
+
def get_collection_name(content_type: str) -> str:
|
| 205 |
+
"""Get standardized collection name for content type."""
|
| 206 |
+
return COLLECTION_NAMES.get(content_type.lower(), f"dnd_{content_type.lower()}")
|
| 207 |
+
|
| 208 |
+
def get_data_file(file_type: str) -> Path:
|
| 209 |
+
"""Get path to data file."""
|
| 210 |
+
file_map = {
|
| 211 |
+
"spells": SPELLS_TXT,
|
| 212 |
+
"all_spells": ALL_SPELLS_TXT,
|
| 213 |
+
"monster_manual": MONSTER_MANUAL_PDF,
|
| 214 |
+
"players_handbook": PLAYERS_HANDBOOK_PDF,
|
| 215 |
+
"extracted_monsters": EXTRACTED_MONSTERS_TXT,
|
| 216 |
+
"extracted_classes": EXTRACTED_CLASSES_TXT,
|
| 217 |
+
}
|
| 218 |
+
return file_map.get(file_type.lower(), DATA_DIR / file_type)
|
| 219 |
+
|
| 220 |
+
def validate_paths() -> List[str]:
|
| 221 |
+
"""Validate that all required paths and files exist."""
|
| 222 |
+
missing = []
|
| 223 |
+
|
| 224 |
+
# Check if data files exist
|
| 225 |
+
if not SPELLS_TXT.exists():
|
| 226 |
+
missing.append(f"Spells file: {SPELLS_TXT}")
|
| 227 |
+
if not ALL_SPELLS_TXT.exists():
|
| 228 |
+
missing.append(f"All spells file: {ALL_SPELLS_TXT}")
|
| 229 |
+
if not MONSTER_MANUAL_PDF.exists():
|
| 230 |
+
missing.append(f"Monster Manual PDF: {MONSTER_MANUAL_PDF}")
|
| 231 |
+
|
| 232 |
+
# ChromaDB directory will be created if it doesn't exist
|
| 233 |
+
|
| 234 |
+
return missing
|
| 235 |
+
|
| 236 |
+
def get_config_summary() -> Dict:
|
| 237 |
+
"""Get a summary of current configuration."""
|
| 238 |
+
return {
|
| 239 |
+
"project_root": str(PROJECT_ROOT),
|
| 240 |
+
"chroma_dir": CHROMA_PERSIST_DIR,
|
| 241 |
+
"embedding_model": EMBEDDING_MODEL_NAME,
|
| 242 |
+
"ollama_model": OLLAMA_MODEL_NAME,
|
| 243 |
+
"collections": list(COLLECTION_NAMES.values()),
|
| 244 |
+
"max_chunk_tokens": MAX_CHUNK_TOKENS,
|
| 245 |
+
"debug_mode": DEBUG_MODE
|
| 246 |
+
}
|
|
File without changes
|
|
@@ -0,0 +1,384 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Base Chunker Classes
|
| 3 |
+
|
| 4 |
+
Abstract base classes and utilities for chunking D&D content for RAG retrieval.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from abc import ABC, abstractmethod
|
| 8 |
+
from typing import List, Dict, Any, Set, Optional
|
| 9 |
+
from dataclasses import dataclass, field
|
| 10 |
+
import re
|
| 11 |
+
|
| 12 |
+
# Import settings
|
| 13 |
+
import sys
|
| 14 |
+
from pathlib import Path
|
| 15 |
+
sys.path.insert(0, str(Path(__file__).parent.parent))
|
| 16 |
+
from config import settings
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
@dataclass
|
| 20 |
+
class Chunk:
|
| 21 |
+
"""
|
| 22 |
+
Represents a single chunk of content for RAG retrieval.
|
| 23 |
+
"""
|
| 24 |
+
content: str
|
| 25 |
+
chunk_type: str # e.g., 'stats', 'description', 'mechanics', 'lore'
|
| 26 |
+
metadata: Dict[str, Any] = field(default_factory=dict)
|
| 27 |
+
tags: Set[str] = field(default_factory=set)
|
| 28 |
+
token_estimate: int = 0
|
| 29 |
+
|
| 30 |
+
def __post_init__(self):
|
| 31 |
+
"""Calculate token estimate if not provided."""
|
| 32 |
+
if self.token_estimate == 0:
|
| 33 |
+
self.token_estimate = estimate_tokens(self.content)
|
| 34 |
+
|
| 35 |
+
def get_retrieval_text(self) -> str:
|
| 36 |
+
"""
|
| 37 |
+
Get formatted text for embedding and retrieval.
|
| 38 |
+
|
| 39 |
+
Returns:
|
| 40 |
+
Formatted text combining metadata and content
|
| 41 |
+
"""
|
| 42 |
+
# Include key metadata in retrieval text for better semantic search
|
| 43 |
+
prefix_parts = []
|
| 44 |
+
|
| 45 |
+
if 'name' in self.metadata:
|
| 46 |
+
prefix_parts.append(f"**{self.metadata['name']}**")
|
| 47 |
+
|
| 48 |
+
if 'type' in self.metadata:
|
| 49 |
+
prefix_parts.append(f"({self.metadata['type']})")
|
| 50 |
+
|
| 51 |
+
prefix = " ".join(prefix_parts)
|
| 52 |
+
|
| 53 |
+
if prefix:
|
| 54 |
+
return f"{prefix}\n{self.content}"
|
| 55 |
+
return self.content
|
| 56 |
+
|
| 57 |
+
def to_dict(self) -> Dict[str, Any]:
|
| 58 |
+
"""Convert chunk to dictionary for storage."""
|
| 59 |
+
return {
|
| 60 |
+
'content': self.content,
|
| 61 |
+
'chunk_type': self.chunk_type,
|
| 62 |
+
'metadata': self.metadata,
|
| 63 |
+
'tags': list(self.tags),
|
| 64 |
+
'token_estimate': self.token_estimate
|
| 65 |
+
}
|
| 66 |
+
|
| 67 |
+
|
| 68 |
+
class BaseChunker(ABC):
|
| 69 |
+
"""
|
| 70 |
+
Abstract base class for all content chunkers.
|
| 71 |
+
|
| 72 |
+
Chunkers take parsed content and split it into optimized chunks
|
| 73 |
+
for RAG retrieval.
|
| 74 |
+
"""
|
| 75 |
+
|
| 76 |
+
def __init__(
|
| 77 |
+
self,
|
| 78 |
+
max_tokens: int = None,
|
| 79 |
+
overlap_tokens: int = None,
|
| 80 |
+
min_tokens: int = None
|
| 81 |
+
):
|
| 82 |
+
"""
|
| 83 |
+
Initialize chunker.
|
| 84 |
+
|
| 85 |
+
Args:
|
| 86 |
+
max_tokens: Maximum tokens per chunk (default from settings)
|
| 87 |
+
overlap_tokens: Overlap between chunks (default from settings)
|
| 88 |
+
min_tokens: Minimum tokens per chunk (default from settings)
|
| 89 |
+
"""
|
| 90 |
+
self.max_tokens = max_tokens or settings.MAX_CHUNK_TOKENS
|
| 91 |
+
self.overlap_tokens = overlap_tokens or settings.CHUNK_OVERLAP_TOKENS
|
| 92 |
+
self.min_tokens = min_tokens or settings.MIN_CHUNK_TOKENS
|
| 93 |
+
|
| 94 |
+
@abstractmethod
|
| 95 |
+
def create_chunks(self, parsed_content: Any) -> List[Chunk]:
|
| 96 |
+
"""
|
| 97 |
+
Create chunks from parsed content.
|
| 98 |
+
|
| 99 |
+
Args:
|
| 100 |
+
parsed_content: Parsed content object (type depends on parser)
|
| 101 |
+
|
| 102 |
+
Returns:
|
| 103 |
+
List of Chunk objects
|
| 104 |
+
"""
|
| 105 |
+
pass
|
| 106 |
+
|
| 107 |
+
def split_long_text(
|
| 108 |
+
self,
|
| 109 |
+
text: str,
|
| 110 |
+
chunk_type: str = "content",
|
| 111 |
+
base_metadata: Dict[str, Any] = None
|
| 112 |
+
) -> List[Chunk]:
|
| 113 |
+
"""
|
| 114 |
+
Split long text into multiple chunks with overlap.
|
| 115 |
+
|
| 116 |
+
Args:
|
| 117 |
+
text: Text to split
|
| 118 |
+
chunk_type: Type of chunk
|
| 119 |
+
base_metadata: Metadata to include in all chunks
|
| 120 |
+
|
| 121 |
+
Returns:
|
| 122 |
+
List of Chunk objects
|
| 123 |
+
"""
|
| 124 |
+
if base_metadata is None:
|
| 125 |
+
base_metadata = {}
|
| 126 |
+
|
| 127 |
+
# Check if splitting is needed
|
| 128 |
+
token_count = estimate_tokens(text)
|
| 129 |
+
if token_count <= self.max_tokens:
|
| 130 |
+
return [Chunk(
|
| 131 |
+
content=text,
|
| 132 |
+
chunk_type=chunk_type,
|
| 133 |
+
metadata=base_metadata.copy(),
|
| 134 |
+
token_estimate=token_count
|
| 135 |
+
)]
|
| 136 |
+
|
| 137 |
+
# Split by sentences
|
| 138 |
+
sentences = split_into_sentences(text)
|
| 139 |
+
chunks = []
|
| 140 |
+
current_chunk = ""
|
| 141 |
+
current_tokens = 0
|
| 142 |
+
|
| 143 |
+
for sentence in sentences:
|
| 144 |
+
sentence_tokens = estimate_tokens(sentence)
|
| 145 |
+
|
| 146 |
+
# Check if adding this sentence would exceed max tokens
|
| 147 |
+
if current_tokens + sentence_tokens > self.max_tokens and current_chunk:
|
| 148 |
+
# Save current chunk
|
| 149 |
+
chunks.append(Chunk(
|
| 150 |
+
content=current_chunk.strip(),
|
| 151 |
+
chunk_type=chunk_type,
|
| 152 |
+
metadata={**base_metadata, 'chunk_index': len(chunks)},
|
| 153 |
+
token_estimate=current_tokens
|
| 154 |
+
))
|
| 155 |
+
|
| 156 |
+
# Start new chunk with overlap
|
| 157 |
+
overlap_text = get_overlap_text(current_chunk, self.overlap_tokens)
|
| 158 |
+
current_chunk = overlap_text + " " + sentence
|
| 159 |
+
current_tokens = estimate_tokens(current_chunk)
|
| 160 |
+
else:
|
| 161 |
+
# Add sentence to current chunk
|
| 162 |
+
current_chunk += (" " if current_chunk else "") + sentence
|
| 163 |
+
current_tokens += sentence_tokens
|
| 164 |
+
|
| 165 |
+
# Don't forget the last chunk
|
| 166 |
+
if current_chunk.strip():
|
| 167 |
+
chunks.append(Chunk(
|
| 168 |
+
content=current_chunk.strip(),
|
| 169 |
+
chunk_type=chunk_type,
|
| 170 |
+
metadata={**base_metadata, 'chunk_index': len(chunks)},
|
| 171 |
+
token_estimate=current_tokens
|
| 172 |
+
))
|
| 173 |
+
|
| 174 |
+
return chunks
|
| 175 |
+
|
| 176 |
+
def validate_chunk(self, chunk: Chunk) -> bool:
|
| 177 |
+
"""
|
| 178 |
+
Validate that a chunk meets requirements.
|
| 179 |
+
|
| 180 |
+
Args:
|
| 181 |
+
chunk: Chunk to validate
|
| 182 |
+
|
| 183 |
+
Returns:
|
| 184 |
+
True if valid, False otherwise
|
| 185 |
+
"""
|
| 186 |
+
# Check minimum size
|
| 187 |
+
if chunk.token_estimate < self.min_tokens:
|
| 188 |
+
return False
|
| 189 |
+
|
| 190 |
+
# Check maximum size
|
| 191 |
+
if chunk.token_estimate > self.max_tokens:
|
| 192 |
+
return False
|
| 193 |
+
|
| 194 |
+
# Check that content exists
|
| 195 |
+
if not chunk.content or not chunk.content.strip():
|
| 196 |
+
return False
|
| 197 |
+
|
| 198 |
+
return True
|
| 199 |
+
|
| 200 |
+
def get_statistics(self, chunks: List[Chunk]) -> Dict[str, Any]:
|
| 201 |
+
"""
|
| 202 |
+
Get statistics about created chunks.
|
| 203 |
+
|
| 204 |
+
Args:
|
| 205 |
+
chunks: List of chunks to analyze
|
| 206 |
+
|
| 207 |
+
Returns:
|
| 208 |
+
Dictionary with statistics
|
| 209 |
+
"""
|
| 210 |
+
if not chunks:
|
| 211 |
+
return {'total': 0}
|
| 212 |
+
|
| 213 |
+
token_counts = [c.token_estimate for c in chunks]
|
| 214 |
+
chunk_types = {}
|
| 215 |
+
|
| 216 |
+
for chunk in chunks:
|
| 217 |
+
chunk_types[chunk.chunk_type] = chunk_types.get(chunk.chunk_type, 0) + 1
|
| 218 |
+
|
| 219 |
+
return {
|
| 220 |
+
'total': len(chunks),
|
| 221 |
+
'chunk_types': chunk_types,
|
| 222 |
+
'total_tokens': sum(token_counts),
|
| 223 |
+
'avg_tokens': sum(token_counts) // len(chunks),
|
| 224 |
+
'min_tokens': min(token_counts),
|
| 225 |
+
'max_tokens': max(token_counts),
|
| 226 |
+
'all_tags': list(set().union(*[c.tags for c in chunks]))
|
| 227 |
+
}
|
| 228 |
+
|
| 229 |
+
|
| 230 |
+
# ============================================================================
|
| 231 |
+
# UTILITY FUNCTIONS
|
| 232 |
+
# ============================================================================
|
| 233 |
+
|
| 234 |
+
def estimate_tokens(text: str) -> int:
|
| 235 |
+
"""
|
| 236 |
+
Estimate number of tokens in text.
|
| 237 |
+
|
| 238 |
+
Uses rough approximation: 1 token โ 4 characters
|
| 239 |
+
|
| 240 |
+
Args:
|
| 241 |
+
text: Text to estimate
|
| 242 |
+
|
| 243 |
+
Returns:
|
| 244 |
+
Estimated token count
|
| 245 |
+
"""
|
| 246 |
+
if not text:
|
| 247 |
+
return 0
|
| 248 |
+
return len(text) // 4
|
| 249 |
+
|
| 250 |
+
|
| 251 |
+
def split_into_sentences(text: str) -> List[str]:
|
| 252 |
+
"""
|
| 253 |
+
Split text into sentences.
|
| 254 |
+
|
| 255 |
+
Args:
|
| 256 |
+
text: Text to split
|
| 257 |
+
|
| 258 |
+
Returns:
|
| 259 |
+
List of sentences
|
| 260 |
+
"""
|
| 261 |
+
# Simple sentence splitter (can be improved with nltk if needed)
|
| 262 |
+
sentence_pattern = r'(?<=[.!?])\s+'
|
| 263 |
+
sentences = re.split(sentence_pattern, text)
|
| 264 |
+
return [s.strip() for s in sentences if s.strip()]
|
| 265 |
+
|
| 266 |
+
|
| 267 |
+
def get_overlap_text(text: str, overlap_tokens: int) -> str:
|
| 268 |
+
"""
|
| 269 |
+
Get the last N tokens from text for overlap.
|
| 270 |
+
|
| 271 |
+
Args:
|
| 272 |
+
text: Source text
|
| 273 |
+
overlap_tokens: Number of tokens for overlap
|
| 274 |
+
|
| 275 |
+
Returns:
|
| 276 |
+
Text containing approximately overlap_tokens tokens
|
| 277 |
+
"""
|
| 278 |
+
if not text:
|
| 279 |
+
return ""
|
| 280 |
+
|
| 281 |
+
# Rough estimation: take last N*4 characters
|
| 282 |
+
overlap_chars = overlap_tokens * 4
|
| 283 |
+
if len(text) <= overlap_chars:
|
| 284 |
+
return text
|
| 285 |
+
|
| 286 |
+
# Try to break at word boundary
|
| 287 |
+
overlap_text = text[-overlap_chars:]
|
| 288 |
+
first_space = overlap_text.find(' ')
|
| 289 |
+
|
| 290 |
+
if first_space > 0:
|
| 291 |
+
overlap_text = overlap_text[first_space + 1:]
|
| 292 |
+
|
| 293 |
+
return overlap_text
|
| 294 |
+
|
| 295 |
+
|
| 296 |
+
def generate_tags(content: str, metadata: Dict[str, Any]) -> Set[str]:
|
| 297 |
+
"""
|
| 298 |
+
Generate tags for a chunk based on content and metadata.
|
| 299 |
+
|
| 300 |
+
Args:
|
| 301 |
+
content: Chunk content
|
| 302 |
+
metadata: Chunk metadata
|
| 303 |
+
|
| 304 |
+
Returns:
|
| 305 |
+
Set of tags
|
| 306 |
+
"""
|
| 307 |
+
tags = set()
|
| 308 |
+
|
| 309 |
+
# Add tags from metadata
|
| 310 |
+
for key, value in metadata.items():
|
| 311 |
+
if key in ['name', 'type', 'category', 'level', 'school']:
|
| 312 |
+
if value:
|
| 313 |
+
tag_value = str(value).lower().replace(' ', '_')
|
| 314 |
+
tags.add(f"{key}_{tag_value}")
|
| 315 |
+
|
| 316 |
+
# Add content-based tags
|
| 317 |
+
content_lower = content.lower()
|
| 318 |
+
|
| 319 |
+
# Common D&D keywords
|
| 320 |
+
keywords = {
|
| 321 |
+
'combat': ['attack', 'damage', 'hit points', 'armor class', 'saving throw'],
|
| 322 |
+
'magic': ['spell', 'magic', 'cast', 'ritual', 'concentration'],
|
| 323 |
+
'ability': ['strength', 'dexterity', 'constitution', 'intelligence', 'wisdom', 'charisma'],
|
| 324 |
+
'condition': ['frightened', 'stunned', 'paralyzed', 'poisoned', 'charmed']
|
| 325 |
+
}
|
| 326 |
+
|
| 327 |
+
for tag, words in keywords.items():
|
| 328 |
+
if any(word in content_lower for word in words):
|
| 329 |
+
tags.add(tag)
|
| 330 |
+
|
| 331 |
+
return tags
|
| 332 |
+
|
| 333 |
+
|
| 334 |
+
def format_metadata_for_retrieval(metadata: Dict[str, Any]) -> str:
|
| 335 |
+
"""
|
| 336 |
+
Format metadata as text for inclusion in retrieval.
|
| 337 |
+
|
| 338 |
+
Args:
|
| 339 |
+
metadata: Metadata dictionary
|
| 340 |
+
|
| 341 |
+
Returns:
|
| 342 |
+
Formatted metadata string
|
| 343 |
+
"""
|
| 344 |
+
parts = []
|
| 345 |
+
|
| 346 |
+
# Priority fields to include in retrieval text
|
| 347 |
+
priority_fields = ['name', 'level', 'type', 'category', 'school', 'cr']
|
| 348 |
+
|
| 349 |
+
for field in priority_fields:
|
| 350 |
+
if field in metadata and metadata[field]:
|
| 351 |
+
value = metadata[field]
|
| 352 |
+
parts.append(f"{field.title()}: {value}")
|
| 353 |
+
|
| 354 |
+
return " | ".join(parts) if parts else ""
|
| 355 |
+
|
| 356 |
+
|
| 357 |
+
def truncate_to_tokens(text: str, max_tokens: int) -> str:
|
| 358 |
+
"""
|
| 359 |
+
Truncate text to approximately max_tokens.
|
| 360 |
+
|
| 361 |
+
Args:
|
| 362 |
+
text: Text to truncate
|
| 363 |
+
max_tokens: Maximum tokens
|
| 364 |
+
|
| 365 |
+
Returns:
|
| 366 |
+
Truncated text
|
| 367 |
+
"""
|
| 368 |
+
if estimate_tokens(text) <= max_tokens:
|
| 369 |
+
return text
|
| 370 |
+
|
| 371 |
+
# Approximate character count
|
| 372 |
+
max_chars = max_tokens * 4
|
| 373 |
+
|
| 374 |
+
if len(text) <= max_chars:
|
| 375 |
+
return text
|
| 376 |
+
|
| 377 |
+
# Truncate and try to break at sentence boundary
|
| 378 |
+
truncated = text[:max_chars]
|
| 379 |
+
last_period = truncated.rfind('.')
|
| 380 |
+
|
| 381 |
+
if last_period > max_chars * 0.8: # Only if we don't lose too much
|
| 382 |
+
truncated = truncated[:last_period + 1]
|
| 383 |
+
|
| 384 |
+
return truncated.strip()
|
|
@@ -0,0 +1,345 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Base Parser Classes
|
| 3 |
+
|
| 4 |
+
Abstract base classes and utilities for parsing D&D content from various sources.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
from abc import ABC, abstractmethod
|
| 8 |
+
from pathlib import Path
|
| 9 |
+
from typing import List, Dict, Any, Optional, Union
|
| 10 |
+
import re
|
| 11 |
+
import pdfplumber
|
| 12 |
+
from dataclasses import dataclass
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
@dataclass
|
| 16 |
+
class ParsedContent:
|
| 17 |
+
"""Container for parsed content from any source."""
|
| 18 |
+
content_type: str # 'spell', 'monster', 'class', 'race'
|
| 19 |
+
name: str
|
| 20 |
+
raw_text: str
|
| 21 |
+
metadata: Dict[str, Any]
|
| 22 |
+
chunks: List[Dict[str, Any]] = None
|
| 23 |
+
|
| 24 |
+
def __post_init__(self):
|
| 25 |
+
if self.chunks is None:
|
| 26 |
+
self.chunks = []
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
class BaseParser(ABC):
|
| 30 |
+
"""
|
| 31 |
+
Abstract base class for all content parsers.
|
| 32 |
+
|
| 33 |
+
Subclasses must implement:
|
| 34 |
+
- parse(): Main parsing logic
|
| 35 |
+
- validate(): Content validation
|
| 36 |
+
"""
|
| 37 |
+
|
| 38 |
+
def __init__(self, content_type: str):
|
| 39 |
+
"""
|
| 40 |
+
Initialize parser.
|
| 41 |
+
|
| 42 |
+
Args:
|
| 43 |
+
content_type: Type of content this parser handles ('spell', 'monster', 'class', 'race')
|
| 44 |
+
"""
|
| 45 |
+
self.content_type = content_type
|
| 46 |
+
self.parsed_items: List[ParsedContent] = []
|
| 47 |
+
|
| 48 |
+
@abstractmethod
|
| 49 |
+
def parse(self, source: Union[str, Path]) -> List[ParsedContent]:
|
| 50 |
+
"""
|
| 51 |
+
Parse content from source.
|
| 52 |
+
|
| 53 |
+
Args:
|
| 54 |
+
source: Path to source file or raw text
|
| 55 |
+
|
| 56 |
+
Returns:
|
| 57 |
+
List of ParsedContent objects
|
| 58 |
+
|
| 59 |
+
Raises:
|
| 60 |
+
ValueError: If source is invalid or parsing fails
|
| 61 |
+
"""
|
| 62 |
+
pass
|
| 63 |
+
|
| 64 |
+
@abstractmethod
|
| 65 |
+
def validate(self, content: ParsedContent) -> bool:
|
| 66 |
+
"""
|
| 67 |
+
Validate parsed content.
|
| 68 |
+
|
| 69 |
+
Args:
|
| 70 |
+
content: ParsedContent object to validate
|
| 71 |
+
|
| 72 |
+
Returns:
|
| 73 |
+
True if valid, False otherwise
|
| 74 |
+
"""
|
| 75 |
+
pass
|
| 76 |
+
|
| 77 |
+
def get_statistics(self) -> Dict[str, Any]:
|
| 78 |
+
"""
|
| 79 |
+
Get parsing statistics.
|
| 80 |
+
|
| 81 |
+
Returns:
|
| 82 |
+
Dictionary with statistics about parsed items
|
| 83 |
+
"""
|
| 84 |
+
return {
|
| 85 |
+
"content_type": self.content_type,
|
| 86 |
+
"total_items": len(self.parsed_items),
|
| 87 |
+
"item_names": [item.name for item in self.parsed_items]
|
| 88 |
+
}
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
class PDFParser(BaseParser):
|
| 92 |
+
"""
|
| 93 |
+
Base class for parsers that extract content from PDF files.
|
| 94 |
+
|
| 95 |
+
Provides common PDF extraction utilities using pdfplumber.
|
| 96 |
+
"""
|
| 97 |
+
|
| 98 |
+
def __init__(self, content_type: str):
|
| 99 |
+
super().__init__(content_type)
|
| 100 |
+
|
| 101 |
+
def extract_pdf_text(
|
| 102 |
+
self,
|
| 103 |
+
pdf_path: Union[str, Path],
|
| 104 |
+
start_page: Optional[int] = None,
|
| 105 |
+
end_page: Optional[int] = None
|
| 106 |
+
) -> str:
|
| 107 |
+
"""
|
| 108 |
+
Extract text from PDF file.
|
| 109 |
+
|
| 110 |
+
Args:
|
| 111 |
+
pdf_path: Path to PDF file
|
| 112 |
+
start_page: First page to extract (1-indexed, inclusive)
|
| 113 |
+
end_page: Last page to extract (1-indexed, inclusive)
|
| 114 |
+
|
| 115 |
+
Returns:
|
| 116 |
+
Extracted text
|
| 117 |
+
|
| 118 |
+
Raises:
|
| 119 |
+
FileNotFoundError: If PDF file doesn't exist
|
| 120 |
+
Exception: If PDF extraction fails
|
| 121 |
+
"""
|
| 122 |
+
pdf_path = Path(pdf_path)
|
| 123 |
+
|
| 124 |
+
if not pdf_path.exists():
|
| 125 |
+
raise FileNotFoundError(f"PDF file not found: {pdf_path}")
|
| 126 |
+
|
| 127 |
+
try:
|
| 128 |
+
full_text = ""
|
| 129 |
+
with pdfplumber.open(pdf_path) as pdf:
|
| 130 |
+
total_pages = len(pdf.pages)
|
| 131 |
+
|
| 132 |
+
# Determine page range
|
| 133 |
+
start_idx = (start_page - 1) if start_page else 0
|
| 134 |
+
end_idx = end_page if end_page else total_pages
|
| 135 |
+
|
| 136 |
+
# Extract pages
|
| 137 |
+
for page_num in range(start_idx, min(end_idx, total_pages)):
|
| 138 |
+
try:
|
| 139 |
+
page = pdf.pages[page_num]
|
| 140 |
+
page_text = page.extract_text()
|
| 141 |
+
|
| 142 |
+
if page_text:
|
| 143 |
+
full_text += page_text + "\n\n"
|
| 144 |
+
except Exception as e:
|
| 145 |
+
print(f"Warning: Could not extract page {page_num + 1}: {e}")
|
| 146 |
+
continue
|
| 147 |
+
|
| 148 |
+
return full_text
|
| 149 |
+
|
| 150 |
+
except Exception as e:
|
| 151 |
+
raise Exception(f"Failed to extract PDF {pdf_path}: {str(e)}")
|
| 152 |
+
|
| 153 |
+
def extract_pdf_pages_separately(
|
| 154 |
+
self,
|
| 155 |
+
pdf_path: Union[str, Path],
|
| 156 |
+
start_page: Optional[int] = None,
|
| 157 |
+
end_page: Optional[int] = None
|
| 158 |
+
) -> Dict[int, str]:
|
| 159 |
+
"""
|
| 160 |
+
Extract text from PDF, returning each page separately.
|
| 161 |
+
|
| 162 |
+
Args:
|
| 163 |
+
pdf_path: Path to PDF file
|
| 164 |
+
start_page: First page to extract (1-indexed)
|
| 165 |
+
end_page: Last page to extract (1-indexed)
|
| 166 |
+
|
| 167 |
+
Returns:
|
| 168 |
+
Dictionary mapping page numbers to extracted text
|
| 169 |
+
"""
|
| 170 |
+
pdf_path = Path(pdf_path)
|
| 171 |
+
pages_text = {}
|
| 172 |
+
|
| 173 |
+
try:
|
| 174 |
+
with pdfplumber.open(pdf_path) as pdf:
|
| 175 |
+
total_pages = len(pdf.pages)
|
| 176 |
+
start_idx = (start_page - 1) if start_page else 0
|
| 177 |
+
end_idx = end_page if end_page else total_pages
|
| 178 |
+
|
| 179 |
+
for page_num in range(start_idx, min(end_idx, total_pages)):
|
| 180 |
+
try:
|
| 181 |
+
page = pdf.pages[page_num]
|
| 182 |
+
page_text = page.extract_text()
|
| 183 |
+
|
| 184 |
+
if page_text:
|
| 185 |
+
pages_text[page_num + 1] = page_text # 1-indexed
|
| 186 |
+
except Exception as e:
|
| 187 |
+
print(f"Warning: Could not extract page {page_num + 1}: {e}")
|
| 188 |
+
continue
|
| 189 |
+
|
| 190 |
+
return pages_text
|
| 191 |
+
|
| 192 |
+
except Exception as e:
|
| 193 |
+
raise Exception(f"Failed to extract PDF pages from {pdf_path}: {str(e)}")
|
| 194 |
+
|
| 195 |
+
|
| 196 |
+
class TextParser(BaseParser):
|
| 197 |
+
"""
|
| 198 |
+
Base class for parsers that extract content from text files.
|
| 199 |
+
|
| 200 |
+
Provides common text file reading utilities.
|
| 201 |
+
"""
|
| 202 |
+
|
| 203 |
+
def __init__(self, content_type: str):
|
| 204 |
+
super().__init__(content_type)
|
| 205 |
+
|
| 206 |
+
def read_text_file(self, file_path: Union[str, Path], encoding: str = 'utf-8') -> str:
|
| 207 |
+
"""
|
| 208 |
+
Read text from file.
|
| 209 |
+
|
| 210 |
+
Args:
|
| 211 |
+
file_path: Path to text file
|
| 212 |
+
encoding: Text encoding (default: utf-8)
|
| 213 |
+
|
| 214 |
+
Returns:
|
| 215 |
+
File contents as string
|
| 216 |
+
|
| 217 |
+
Raises:
|
| 218 |
+
FileNotFoundError: If file doesn't exist
|
| 219 |
+
Exception: If file reading fails
|
| 220 |
+
"""
|
| 221 |
+
file_path = Path(file_path)
|
| 222 |
+
|
| 223 |
+
if not file_path.exists():
|
| 224 |
+
raise FileNotFoundError(f"Text file not found: {file_path}")
|
| 225 |
+
|
| 226 |
+
try:
|
| 227 |
+
with open(file_path, 'r', encoding=encoding) as f:
|
| 228 |
+
return f.read()
|
| 229 |
+
except Exception as e:
|
| 230 |
+
raise Exception(f"Failed to read text file {file_path}: {str(e)}")
|
| 231 |
+
|
| 232 |
+
def read_text_lines(self, file_path: Union[str, Path], encoding: str = 'utf-8') -> List[str]:
|
| 233 |
+
"""
|
| 234 |
+
Read text file as list of lines.
|
| 235 |
+
|
| 236 |
+
Args:
|
| 237 |
+
file_path: Path to text file
|
| 238 |
+
encoding: Text encoding (default: utf-8)
|
| 239 |
+
|
| 240 |
+
Returns:
|
| 241 |
+
List of lines from file
|
| 242 |
+
"""
|
| 243 |
+
text = self.read_text_file(file_path, encoding)
|
| 244 |
+
return text.split('\n')
|
| 245 |
+
|
| 246 |
+
|
| 247 |
+
# ============================================================================
|
| 248 |
+
# TEXT CLEANING UTILITIES
|
| 249 |
+
# ============================================================================
|
| 250 |
+
|
| 251 |
+
def clean_extracted_text(text: str) -> str:
|
| 252 |
+
"""
|
| 253 |
+
Clean and normalize extracted text.
|
| 254 |
+
|
| 255 |
+
Args:
|
| 256 |
+
text: Raw text to clean
|
| 257 |
+
|
| 258 |
+
Returns:
|
| 259 |
+
Cleaned text
|
| 260 |
+
"""
|
| 261 |
+
if not text:
|
| 262 |
+
return ""
|
| 263 |
+
|
| 264 |
+
# Remove excessive whitespace
|
| 265 |
+
text = re.sub(r'\s+', ' ', text)
|
| 266 |
+
|
| 267 |
+
# Fix common PDF extraction issues
|
| 268 |
+
text = text.replace('\r', '\n')
|
| 269 |
+
|
| 270 |
+
# Normalize line endings
|
| 271 |
+
text = '\n'.join(line.strip() for line in text.split('\n'))
|
| 272 |
+
|
| 273 |
+
# Remove empty lines
|
| 274 |
+
lines = [line for line in text.split('\n') if line.strip()]
|
| 275 |
+
text = '\n'.join(lines)
|
| 276 |
+
|
| 277 |
+
return text.strip()
|
| 278 |
+
|
| 279 |
+
|
| 280 |
+
def split_by_headers(text: str, header_pattern: str) -> List[Dict[str, str]]:
|
| 281 |
+
"""
|
| 282 |
+
Split text into sections based on headers.
|
| 283 |
+
|
| 284 |
+
Args:
|
| 285 |
+
text: Text to split
|
| 286 |
+
header_pattern: Regex pattern to match headers
|
| 287 |
+
|
| 288 |
+
Returns:
|
| 289 |
+
List of dictionaries with 'header' and 'content' keys
|
| 290 |
+
"""
|
| 291 |
+
sections = []
|
| 292 |
+
|
| 293 |
+
# Find all headers
|
| 294 |
+
matches = list(re.finditer(header_pattern, text, re.MULTILINE | re.IGNORECASE))
|
| 295 |
+
|
| 296 |
+
for i, match in enumerate(matches):
|
| 297 |
+
header = match.group(0).strip()
|
| 298 |
+
start_pos = match.end()
|
| 299 |
+
|
| 300 |
+
# Find end position (start of next header or end of text)
|
| 301 |
+
end_pos = matches[i + 1].start() if i + 1 < len(matches) else len(text)
|
| 302 |
+
|
| 303 |
+
content = text[start_pos:end_pos].strip()
|
| 304 |
+
|
| 305 |
+
sections.append({
|
| 306 |
+
'header': header,
|
| 307 |
+
'content': content,
|
| 308 |
+
'start_pos': match.start(),
|
| 309 |
+
'end_pos': end_pos
|
| 310 |
+
})
|
| 311 |
+
|
| 312 |
+
return sections
|
| 313 |
+
|
| 314 |
+
|
| 315 |
+
def extract_pattern(text: str, pattern: str, group: int = 1) -> Optional[str]:
|
| 316 |
+
"""
|
| 317 |
+
Extract text matching a regex pattern.
|
| 318 |
+
|
| 319 |
+
Args:
|
| 320 |
+
text: Text to search
|
| 321 |
+
pattern: Regex pattern
|
| 322 |
+
group: Group number to extract (default: 1)
|
| 323 |
+
|
| 324 |
+
Returns:
|
| 325 |
+
Matched text or None if not found
|
| 326 |
+
"""
|
| 327 |
+
match = re.search(pattern, text, re.IGNORECASE | re.MULTILINE)
|
| 328 |
+
if match and len(match.groups()) >= group:
|
| 329 |
+
return match.group(group).strip()
|
| 330 |
+
return None
|
| 331 |
+
|
| 332 |
+
|
| 333 |
+
def extract_all_patterns(text: str, pattern: str) -> List[str]:
|
| 334 |
+
"""
|
| 335 |
+
Extract all text matching a regex pattern.
|
| 336 |
+
|
| 337 |
+
Args:
|
| 338 |
+
text: Text to search
|
| 339 |
+
pattern: Regex pattern
|
| 340 |
+
|
| 341 |
+
Returns:
|
| 342 |
+
List of all matches
|
| 343 |
+
"""
|
| 344 |
+
matches = re.findall(pattern, text, re.IGNORECASE | re.MULTILINE)
|
| 345 |
+
return [m.strip() if isinstance(m, str) else m[0].strip() for m in matches]
|
|
@@ -0,0 +1,432 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
ChromaDB Manager
|
| 3 |
+
|
| 4 |
+
Unified interface for managing ChromaDB collections and operations.
|
| 5 |
+
"""
|
| 6 |
+
|
| 7 |
+
import chromadb
|
| 8 |
+
from chromadb.config import Settings as ChromaSettings
|
| 9 |
+
from typing import List, Dict, Any, Optional, Union
|
| 10 |
+
from pathlib import Path
|
| 11 |
+
import uuid
|
| 12 |
+
import json
|
| 13 |
+
|
| 14 |
+
# Import project settings and chunker
|
| 15 |
+
import sys
|
| 16 |
+
sys.path.insert(0, str(Path(__file__).parent.parent))
|
| 17 |
+
from config import settings
|
| 18 |
+
from core.base_chunker import Chunk
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
class ChromaDBManager:
|
| 22 |
+
"""
|
| 23 |
+
Manages all ChromaDB operations for the D&D RAG system.
|
| 24 |
+
|
| 25 |
+
Provides a unified interface for:
|
| 26 |
+
- Collection management
|
| 27 |
+
- Adding/updating chunks
|
| 28 |
+
- Querying across single or multiple collections
|
| 29 |
+
- Statistics and reporting
|
| 30 |
+
"""
|
| 31 |
+
|
| 32 |
+
def __init__(
|
| 33 |
+
self,
|
| 34 |
+
persist_dir: Optional[str] = None,
|
| 35 |
+
embedding_model: Optional[str] = None
|
| 36 |
+
):
|
| 37 |
+
"""
|
| 38 |
+
Initialize ChromaDB manager.
|
| 39 |
+
|
| 40 |
+
Args:
|
| 41 |
+
persist_dir: Directory for persistent storage (default from settings)
|
| 42 |
+
embedding_model: Embedding model name (default from settings)
|
| 43 |
+
"""
|
| 44 |
+
self.persist_dir = persist_dir or settings.CHROMA_PERSIST_DIR
|
| 45 |
+
self.embedding_model = embedding_model or settings.EMBEDDING_MODEL_NAME
|
| 46 |
+
|
| 47 |
+
# Ensure persist directory exists
|
| 48 |
+
Path(self.persist_dir).mkdir(parents=True, exist_ok=True)
|
| 49 |
+
|
| 50 |
+
# Initialize ChromaDB client
|
| 51 |
+
self.client = chromadb.PersistentClient(
|
| 52 |
+
path=self.persist_dir,
|
| 53 |
+
settings=ChromaSettings(allow_reset=settings.CHROMA_ALLOW_RESET)
|
| 54 |
+
)
|
| 55 |
+
|
| 56 |
+
# Cache for collections
|
| 57 |
+
self._collections = {}
|
| 58 |
+
|
| 59 |
+
print(f"ChromaDB Manager initialized:")
|
| 60 |
+
print(f" Persist dir: {self.persist_dir}")
|
| 61 |
+
print(f" Embedding model: {self.embedding_model}")
|
| 62 |
+
|
| 63 |
+
def get_or_create_collection(
|
| 64 |
+
self,
|
| 65 |
+
collection_name: str,
|
| 66 |
+
metadata: Optional[Dict[str, str]] = None
|
| 67 |
+
):
|
| 68 |
+
"""
|
| 69 |
+
Get existing collection or create new one.
|
| 70 |
+
|
| 71 |
+
Args:
|
| 72 |
+
collection_name: Name of the collection
|
| 73 |
+
metadata: Optional metadata for the collection
|
| 74 |
+
|
| 75 |
+
Returns:
|
| 76 |
+
ChromaDB collection object
|
| 77 |
+
"""
|
| 78 |
+
# Check cache first
|
| 79 |
+
if collection_name in self._collections:
|
| 80 |
+
return self._collections[collection_name]
|
| 81 |
+
|
| 82 |
+
# Get or create from ChromaDB
|
| 83 |
+
try:
|
| 84 |
+
collection = self.client.get_or_create_collection(
|
| 85 |
+
name=collection_name,
|
| 86 |
+
metadata=metadata or settings.COLLECTION_METADATA.get(collection_name, {})
|
| 87 |
+
)
|
| 88 |
+
self._collections[collection_name] = collection
|
| 89 |
+
print(f"โ Collection '{collection_name}' ready ({collection.count()} documents)")
|
| 90 |
+
return collection
|
| 91 |
+
except Exception as e:
|
| 92 |
+
raise Exception(f"Failed to get/create collection '{collection_name}': {e}")
|
| 93 |
+
|
| 94 |
+
def add_chunks(
|
| 95 |
+
self,
|
| 96 |
+
collection_name: str,
|
| 97 |
+
chunks: List[Chunk],
|
| 98 |
+
batch_size: Optional[int] = None
|
| 99 |
+
) -> int:
|
| 100 |
+
"""
|
| 101 |
+
Add chunks to a collection.
|
| 102 |
+
|
| 103 |
+
Args:
|
| 104 |
+
collection_name: Name of collection to add to
|
| 105 |
+
chunks: List of Chunk objects
|
| 106 |
+
batch_size: Batch size for adding (default from settings)
|
| 107 |
+
|
| 108 |
+
Returns:
|
| 109 |
+
Number of chunks added
|
| 110 |
+
|
| 111 |
+
Raises:
|
| 112 |
+
ValueError: If chunks is empty or invalid
|
| 113 |
+
"""
|
| 114 |
+
if not chunks:
|
| 115 |
+
raise ValueError("Cannot add empty chunks list")
|
| 116 |
+
|
| 117 |
+
batch_size = batch_size or settings.CHROMA_BATCH_SIZE
|
| 118 |
+
collection = self.get_or_create_collection(collection_name)
|
| 119 |
+
|
| 120 |
+
# Prepare data
|
| 121 |
+
documents = []
|
| 122 |
+
metadatas = []
|
| 123 |
+
ids = []
|
| 124 |
+
|
| 125 |
+
for chunk in chunks:
|
| 126 |
+
# Get retrieval text
|
| 127 |
+
documents.append(chunk.get_retrieval_text())
|
| 128 |
+
|
| 129 |
+
# Convert metadata to ChromaDB-compatible format
|
| 130 |
+
metadata = self._prepare_metadata(chunk.metadata)
|
| 131 |
+
metadata['chunk_type'] = chunk.chunk_type
|
| 132 |
+
metadata['token_estimate'] = chunk.token_estimate
|
| 133 |
+
metadata['tags'] = ','.join(sorted(chunk.tags)) if chunk.tags else ''
|
| 134 |
+
|
| 135 |
+
metadatas.append(metadata)
|
| 136 |
+
|
| 137 |
+
# Generate unique ID
|
| 138 |
+
ids.append(self._generate_chunk_id(collection_name, chunk))
|
| 139 |
+
|
| 140 |
+
# Add in batches
|
| 141 |
+
total_added = 0
|
| 142 |
+
for i in range(0, len(documents), batch_size):
|
| 143 |
+
batch_end = min(i + batch_size, len(documents))
|
| 144 |
+
|
| 145 |
+
try:
|
| 146 |
+
collection.add(
|
| 147 |
+
documents=documents[i:batch_end],
|
| 148 |
+
metadatas=metadatas[i:batch_end],
|
| 149 |
+
ids=ids[i:batch_end]
|
| 150 |
+
)
|
| 151 |
+
total_added += (batch_end - i)
|
| 152 |
+
except Exception as e:
|
| 153 |
+
print(f"Warning: Failed to add batch {i//batch_size + 1}: {e}")
|
| 154 |
+
continue
|
| 155 |
+
|
| 156 |
+
print(f"โ Added {total_added} chunks to '{collection_name}'")
|
| 157 |
+
return total_added
|
| 158 |
+
|
| 159 |
+
def search(
|
| 160 |
+
self,
|
| 161 |
+
collection_name: str,
|
| 162 |
+
query_text: str,
|
| 163 |
+
n_results: int = None,
|
| 164 |
+
where: Optional[Dict] = None,
|
| 165 |
+
where_document: Optional[Dict] = None
|
| 166 |
+
) -> Dict:
|
| 167 |
+
"""
|
| 168 |
+
Search a single collection.
|
| 169 |
+
|
| 170 |
+
Args:
|
| 171 |
+
collection_name: Name of collection to search
|
| 172 |
+
query_text: Query text
|
| 173 |
+
n_results: Number of results to return (default from settings)
|
| 174 |
+
where: Metadata filters
|
| 175 |
+
where_document: Document content filters
|
| 176 |
+
|
| 177 |
+
Returns:
|
| 178 |
+
Search results dictionary
|
| 179 |
+
"""
|
| 180 |
+
n_results = n_results or settings.DEFAULT_RAG_RESULTS
|
| 181 |
+
collection = self.get_or_create_collection(collection_name)
|
| 182 |
+
|
| 183 |
+
try:
|
| 184 |
+
results = collection.query(
|
| 185 |
+
query_texts=[query_text],
|
| 186 |
+
n_results=n_results,
|
| 187 |
+
where=where,
|
| 188 |
+
where_document=where_document
|
| 189 |
+
)
|
| 190 |
+
return results
|
| 191 |
+
except Exception as e:
|
| 192 |
+
print(f"Search error in '{collection_name}': {e}")
|
| 193 |
+
return {"documents": [[]], "metadatas": [[]], "distances": [[]], "ids": [[]]}
|
| 194 |
+
|
| 195 |
+
def search_all(
|
| 196 |
+
self,
|
| 197 |
+
query_text: str,
|
| 198 |
+
n_results_per_collection: int = 3,
|
| 199 |
+
collections: Optional[List[str]] = None
|
| 200 |
+
) -> Dict[str, Dict]:
|
| 201 |
+
"""
|
| 202 |
+
Search across multiple collections.
|
| 203 |
+
|
| 204 |
+
Args:
|
| 205 |
+
query_text: Query text
|
| 206 |
+
n_results_per_collection: Results per collection
|
| 207 |
+
collections: List of collection names (None = all)
|
| 208 |
+
|
| 209 |
+
Returns:
|
| 210 |
+
Dictionary mapping collection names to results
|
| 211 |
+
"""
|
| 212 |
+
if collections is None:
|
| 213 |
+
collections = list(settings.COLLECTION_NAMES.values())
|
| 214 |
+
|
| 215 |
+
all_results = {}
|
| 216 |
+
|
| 217 |
+
for collection_name in collections:
|
| 218 |
+
try:
|
| 219 |
+
results = self.search(
|
| 220 |
+
collection_name,
|
| 221 |
+
query_text,
|
| 222 |
+
n_results=n_results_per_collection
|
| 223 |
+
)
|
| 224 |
+
all_results[collection_name] = results
|
| 225 |
+
except Exception as e:
|
| 226 |
+
print(f"Warning: Could not search '{collection_name}': {e}")
|
| 227 |
+
continue
|
| 228 |
+
|
| 229 |
+
return all_results
|
| 230 |
+
|
| 231 |
+
def delete_collection(self, collection_name: str) -> bool:
|
| 232 |
+
"""
|
| 233 |
+
Delete a collection.
|
| 234 |
+
|
| 235 |
+
Args:
|
| 236 |
+
collection_name: Name of collection to delete
|
| 237 |
+
|
| 238 |
+
Returns:
|
| 239 |
+
True if successful, False otherwise
|
| 240 |
+
"""
|
| 241 |
+
try:
|
| 242 |
+
self.client.delete_collection(name=collection_name)
|
| 243 |
+
if collection_name in self._collections:
|
| 244 |
+
del self._collections[collection_name]
|
| 245 |
+
print(f"โ Deleted collection '{collection_name}'")
|
| 246 |
+
return True
|
| 247 |
+
except Exception as e:
|
| 248 |
+
print(f"Failed to delete collection '{collection_name}': {e}")
|
| 249 |
+
return False
|
| 250 |
+
|
| 251 |
+
def clear_collection(self, collection_name: str) -> bool:
|
| 252 |
+
"""
|
| 253 |
+
Clear all documents from a collection.
|
| 254 |
+
|
| 255 |
+
Args:
|
| 256 |
+
collection_name: Name of collection to clear
|
| 257 |
+
|
| 258 |
+
Returns:
|
| 259 |
+
True if successful
|
| 260 |
+
"""
|
| 261 |
+
try:
|
| 262 |
+
self.delete_collection(collection_name)
|
| 263 |
+
self.get_or_create_collection(collection_name)
|
| 264 |
+
print(f"โ Cleared collection '{collection_name}'")
|
| 265 |
+
return True
|
| 266 |
+
except Exception as e:
|
| 267 |
+
print(f"Failed to clear collection '{collection_name}': {e}")
|
| 268 |
+
return False
|
| 269 |
+
|
| 270 |
+
def get_collection_stats(self, collection_name: str) -> Dict[str, Any]:
|
| 271 |
+
"""
|
| 272 |
+
Get statistics for a collection.
|
| 273 |
+
|
| 274 |
+
Args:
|
| 275 |
+
collection_name: Name of collection
|
| 276 |
+
|
| 277 |
+
Returns:
|
| 278 |
+
Dictionary with statistics
|
| 279 |
+
"""
|
| 280 |
+
try:
|
| 281 |
+
collection = self.get_or_create_collection(collection_name)
|
| 282 |
+
total_docs = collection.count()
|
| 283 |
+
|
| 284 |
+
if total_docs == 0:
|
| 285 |
+
return {
|
| 286 |
+
'collection_name': collection_name,
|
| 287 |
+
'total_documents': 0,
|
| 288 |
+
'chunk_types': {},
|
| 289 |
+
'sample_items': []
|
| 290 |
+
}
|
| 291 |
+
|
| 292 |
+
# Sample some documents for analysis
|
| 293 |
+
sample_size = min(100, total_docs)
|
| 294 |
+
sample = collection.get(limit=sample_size)
|
| 295 |
+
|
| 296 |
+
# Analyze chunk types
|
| 297 |
+
chunk_types = {}
|
| 298 |
+
items = set()
|
| 299 |
+
|
| 300 |
+
if sample['metadatas']:
|
| 301 |
+
for metadata in sample['metadatas']:
|
| 302 |
+
chunk_type = metadata.get('chunk_type', 'unknown')
|
| 303 |
+
chunk_types[chunk_type] = chunk_types.get(chunk_type, 0) + 1
|
| 304 |
+
|
| 305 |
+
# Collect item names
|
| 306 |
+
if 'name' in metadata:
|
| 307 |
+
items.add(metadata['name'])
|
| 308 |
+
|
| 309 |
+
return {
|
| 310 |
+
'collection_name': collection_name,
|
| 311 |
+
'total_documents': total_docs,
|
| 312 |
+
'chunk_types': chunk_types,
|
| 313 |
+
'unique_items': len(items),
|
| 314 |
+
'sample_items': sorted(list(items))[:10]
|
| 315 |
+
}
|
| 316 |
+
|
| 317 |
+
except Exception as e:
|
| 318 |
+
print(f"Error getting stats for '{collection_name}': {e}")
|
| 319 |
+
return {'collection_name': collection_name, 'error': str(e)}
|
| 320 |
+
|
| 321 |
+
def get_all_stats(self) -> Dict[str, Any]:
|
| 322 |
+
"""
|
| 323 |
+
Get statistics for all collections.
|
| 324 |
+
|
| 325 |
+
Returns:
|
| 326 |
+
Dictionary with overall statistics
|
| 327 |
+
"""
|
| 328 |
+
stats = {
|
| 329 |
+
'persist_dir': self.persist_dir,
|
| 330 |
+
'embedding_model': self.embedding_model,
|
| 331 |
+
'collections': {}
|
| 332 |
+
}
|
| 333 |
+
|
| 334 |
+
for collection_name in settings.COLLECTION_NAMES.values():
|
| 335 |
+
stats['collections'][collection_name] = self.get_collection_stats(collection_name)
|
| 336 |
+
|
| 337 |
+
# Calculate totals
|
| 338 |
+
stats['total_documents'] = sum(
|
| 339 |
+
col_stats.get('total_documents', 0)
|
| 340 |
+
for col_stats in stats['collections'].values()
|
| 341 |
+
)
|
| 342 |
+
|
| 343 |
+
return stats
|
| 344 |
+
|
| 345 |
+
def export_collection_metadata(self, collection_name: str, output_file: Path) -> bool:
|
| 346 |
+
"""
|
| 347 |
+
Export collection metadata to JSON file.
|
| 348 |
+
|
| 349 |
+
Args:
|
| 350 |
+
collection_name: Name of collection
|
| 351 |
+
output_file: Path to output JSON file
|
| 352 |
+
|
| 353 |
+
Returns:
|
| 354 |
+
True if successful
|
| 355 |
+
"""
|
| 356 |
+
try:
|
| 357 |
+
stats = self.get_collection_stats(collection_name)
|
| 358 |
+
collection = self.get_or_create_collection(collection_name)
|
| 359 |
+
|
| 360 |
+
# Get all metadata
|
| 361 |
+
all_data = collection.get()
|
| 362 |
+
|
| 363 |
+
export_data = {
|
| 364 |
+
'collection_name': collection_name,
|
| 365 |
+
'statistics': stats,
|
| 366 |
+
'metadata': all_data['metadatas'] if all_data['metadatas'] else []
|
| 367 |
+
}
|
| 368 |
+
|
| 369 |
+
with open(output_file, 'w', encoding='utf-8') as f:
|
| 370 |
+
json.dump(export_data, f, indent=2, ensure_ascii=False)
|
| 371 |
+
|
| 372 |
+
print(f"โ Exported collection metadata to {output_file}")
|
| 373 |
+
return True
|
| 374 |
+
|
| 375 |
+
except Exception as e:
|
| 376 |
+
print(f"Failed to export collection metadata: {e}")
|
| 377 |
+
return False
|
| 378 |
+
|
| 379 |
+
# ========================================================================
|
| 380 |
+
# PRIVATE HELPER METHODS
|
| 381 |
+
# ========================================================================
|
| 382 |
+
|
| 383 |
+
def _prepare_metadata(self, metadata: Dict[str, Any]) -> Dict[str, Union[str, int, float, bool]]:
|
| 384 |
+
"""
|
| 385 |
+
Prepare metadata for ChromaDB (only allows simple types).
|
| 386 |
+
|
| 387 |
+
Args:
|
| 388 |
+
metadata: Original metadata
|
| 389 |
+
|
| 390 |
+
Returns:
|
| 391 |
+
ChromaDB-compatible metadata
|
| 392 |
+
"""
|
| 393 |
+
prepared = {}
|
| 394 |
+
|
| 395 |
+
for key, value in metadata.items():
|
| 396 |
+
if value is None:
|
| 397 |
+
prepared[key] = "unknown"
|
| 398 |
+
elif isinstance(value, (list, tuple)):
|
| 399 |
+
# Convert lists to comma-separated strings
|
| 400 |
+
prepared[key] = ','.join(str(v) for v in value) if value else ""
|
| 401 |
+
elif isinstance(value, dict):
|
| 402 |
+
# Convert dicts to JSON strings
|
| 403 |
+
prepared[key] = json.dumps(value)
|
| 404 |
+
elif isinstance(value, (str, int, float, bool)):
|
| 405 |
+
prepared[key] = value
|
| 406 |
+
else:
|
| 407 |
+
# Convert everything else to string
|
| 408 |
+
prepared[key] = str(value)
|
| 409 |
+
|
| 410 |
+
return prepared
|
| 411 |
+
|
| 412 |
+
def _generate_chunk_id(self, collection_name: str, chunk: Chunk) -> str:
|
| 413 |
+
"""
|
| 414 |
+
Generate unique ID for a chunk.
|
| 415 |
+
|
| 416 |
+
Args:
|
| 417 |
+
collection_name: Name of collection
|
| 418 |
+
chunk: Chunk object
|
| 419 |
+
|
| 420 |
+
Returns:
|
| 421 |
+
Unique ID string
|
| 422 |
+
"""
|
| 423 |
+
# Use name from metadata if available, otherwise use UUID
|
| 424 |
+
base_name = chunk.metadata.get('name', 'chunk')
|
| 425 |
+
base_name = base_name.lower().replace(' ', '_').replace("'", "")
|
| 426 |
+
|
| 427 |
+
chunk_type = chunk.chunk_type.replace(' ', '_')
|
| 428 |
+
|
| 429 |
+
# Add a short random suffix for uniqueness
|
| 430 |
+
suffix = uuid.uuid4().hex[:8]
|
| 431 |
+
|
| 432 |
+
return f"{collection_name}_{base_name}_{chunk_type}_{suffix}"
|
|
File without changes
|
|
@@ -0,0 +1,490 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Spell Parser
|
| 3 |
+
|
| 4 |
+
Parses D&D spells from spells.txt and all_spells.txt files.
|
| 5 |
+
Handles OCR errors and text formatting issues from PDF extraction.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import re
|
| 9 |
+
from typing import List, Dict, Any, Optional
|
| 10 |
+
from dataclasses import dataclass, field
|
| 11 |
+
from pathlib import Path
|
| 12 |
+
|
| 13 |
+
# Import base classes
|
| 14 |
+
import sys
|
| 15 |
+
sys.path.insert(0, str(Path(__file__).parent.parent))
|
| 16 |
+
from core.base_parser import TextParser, ParsedContent, clean_extracted_text
|
| 17 |
+
from core.base_chunker import BaseChunker, Chunk, estimate_tokens
|
| 18 |
+
from config import settings
|
| 19 |
+
|
| 20 |
+
|
| 21 |
+
@dataclass
|
| 22 |
+
class SpellData:
|
| 23 |
+
"""Container for spell information."""
|
| 24 |
+
name: str
|
| 25 |
+
level: int
|
| 26 |
+
school: str
|
| 27 |
+
casting_time: str
|
| 28 |
+
range: str
|
| 29 |
+
components: str
|
| 30 |
+
duration: str
|
| 31 |
+
description: str
|
| 32 |
+
classes: List[str] = field(default_factory=list)
|
| 33 |
+
ritual: bool = False
|
| 34 |
+
concentration: bool = False
|
| 35 |
+
higher_levels: Optional[str] = None
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
class SpellParser(TextParser):
|
| 39 |
+
"""
|
| 40 |
+
Parser for D&D 5e spells.
|
| 41 |
+
|
| 42 |
+
Extracts spells from two sources:
|
| 43 |
+
1. spells.txt - Detailed spell descriptions
|
| 44 |
+
2. all_spells.txt - Class/level associations
|
| 45 |
+
"""
|
| 46 |
+
|
| 47 |
+
def __init__(self):
|
| 48 |
+
super().__init__(content_type='spell')
|
| 49 |
+
self.spells: Dict[str, SpellData] = {}
|
| 50 |
+
|
| 51 |
+
def parse(self, source: Path = None) -> List[ParsedContent]:
|
| 52 |
+
"""
|
| 53 |
+
Parse spells from files.
|
| 54 |
+
|
| 55 |
+
Args:
|
| 56 |
+
source: Not used, files are from settings
|
| 57 |
+
|
| 58 |
+
Returns:
|
| 59 |
+
List of ParsedContent objects
|
| 60 |
+
"""
|
| 61 |
+
print("๐ Parsing D&D spells...")
|
| 62 |
+
|
| 63 |
+
# Parse detailed descriptions
|
| 64 |
+
self._parse_spells_txt(settings.SPELLS_TXT)
|
| 65 |
+
|
| 66 |
+
# Parse class associations
|
| 67 |
+
self._parse_all_spells_txt(settings.ALL_SPELLS_TXT)
|
| 68 |
+
|
| 69 |
+
# Convert to ParsedContent
|
| 70 |
+
parsed_items = []
|
| 71 |
+
for spell_name, spell_data in self.spells.items():
|
| 72 |
+
parsed_items.append(ParsedContent(
|
| 73 |
+
content_type='spell',
|
| 74 |
+
name=spell_name,
|
| 75 |
+
raw_text=spell_data.description,
|
| 76 |
+
metadata=self._spell_to_metadata(spell_data)
|
| 77 |
+
))
|
| 78 |
+
|
| 79 |
+
self.parsed_items = parsed_items
|
| 80 |
+
print(f"โ Parsed {len(parsed_items)} spells")
|
| 81 |
+
return parsed_items
|
| 82 |
+
|
| 83 |
+
def validate(self, content: ParsedContent) -> bool:
|
| 84 |
+
"""Validate spell content."""
|
| 85 |
+
# Check required fields
|
| 86 |
+
if not content.name or not content.raw_text:
|
| 87 |
+
return False
|
| 88 |
+
|
| 89 |
+
metadata = content.metadata
|
| 90 |
+
required_fields = ['level', 'school']
|
| 91 |
+
|
| 92 |
+
for field in required_fields:
|
| 93 |
+
if field not in metadata:
|
| 94 |
+
return False
|
| 95 |
+
|
| 96 |
+
return True
|
| 97 |
+
|
| 98 |
+
def _parse_spells_txt(self, file_path: Path):
|
| 99 |
+
"""
|
| 100 |
+
Parse spells.txt file with detailed descriptions.
|
| 101 |
+
|
| 102 |
+
Handles OCR errors and formatting issues.
|
| 103 |
+
"""
|
| 104 |
+
if not file_path.exists():
|
| 105 |
+
print(f"Warning: {file_path} not found")
|
| 106 |
+
return
|
| 107 |
+
|
| 108 |
+
text = self.read_text_file(file_path)
|
| 109 |
+
|
| 110 |
+
# Clean OCR issues
|
| 111 |
+
text = self._clean_spell_text(text)
|
| 112 |
+
|
| 113 |
+
# Split into individual spells
|
| 114 |
+
spell_blocks = self._split_spell_blocks(text)
|
| 115 |
+
|
| 116 |
+
print(f" Found {len(spell_blocks)} spell blocks in {file_path.name}")
|
| 117 |
+
|
| 118 |
+
for block in spell_blocks:
|
| 119 |
+
spell_data = self._parse_spell_block(block)
|
| 120 |
+
if spell_data and spell_data.name:
|
| 121 |
+
self.spells[spell_data.name.upper()] = spell_data
|
| 122 |
+
|
| 123 |
+
def _parse_all_spells_txt(self, file_path: Path):
|
| 124 |
+
"""
|
| 125 |
+
Parse all_spells.txt file for class associations.
|
| 126 |
+
|
| 127 |
+
Format: Class name followed by spell lists by level.
|
| 128 |
+
"""
|
| 129 |
+
if not file_path.exists():
|
| 130 |
+
print(f"Warning: {file_path} not found")
|
| 131 |
+
return
|
| 132 |
+
|
| 133 |
+
text = self.read_text_file(file_path)
|
| 134 |
+
text = self._clean_spell_text(text)
|
| 135 |
+
|
| 136 |
+
# Parse by class sections
|
| 137 |
+
class_sections = self._split_by_class(text)
|
| 138 |
+
|
| 139 |
+
for class_name, spells_by_level in class_sections.items():
|
| 140 |
+
for level, spell_names in spells_by_level.items():
|
| 141 |
+
for spell_name in spell_names:
|
| 142 |
+
spell_key = spell_name.upper()
|
| 143 |
+
if spell_key in self.spells:
|
| 144 |
+
if class_name not in self.spells[spell_key].classes:
|
| 145 |
+
self.spells[spell_key].classes.append(class_name)
|
| 146 |
+
else:
|
| 147 |
+
# Create minimal entry for spells only in all_spells.txt
|
| 148 |
+
self.spells[spell_key] = SpellData(
|
| 149 |
+
name=spell_name,
|
| 150 |
+
level=level,
|
| 151 |
+
school="Unknown",
|
| 152 |
+
casting_time="",
|
| 153 |
+
range="",
|
| 154 |
+
components="",
|
| 155 |
+
duration="",
|
| 156 |
+
description="",
|
| 157 |
+
classes=[class_name]
|
| 158 |
+
)
|
| 159 |
+
|
| 160 |
+
def _clean_spell_text(self, text: str) -> str:
|
| 161 |
+
"""
|
| 162 |
+
Clean OCR errors and formatting issues from spell text.
|
| 163 |
+
|
| 164 |
+
Common issues:
|
| 165 |
+
- 'l' replaced with 'I' or '1'
|
| 166 |
+
- 'O' replaced with '0'
|
| 167 |
+
- Missing spaces between words
|
| 168 |
+
- Extra whitespace
|
| 169 |
+
- Broken words across lines
|
| 170 |
+
"""
|
| 171 |
+
if not text:
|
| 172 |
+
return ""
|
| 173 |
+
|
| 174 |
+
# Fix common OCR errors
|
| 175 |
+
ocr_fixes = {
|
| 176 |
+
r'\blevel\b': 'level', # Ensure 'level' not 'leveI'
|
| 177 |
+
r'\bcall\b': 'call', # Ensure 'call' not 'caIl'
|
| 178 |
+
r'\btotal\b': 'total',
|
| 179 |
+
r'\bspell\b': 'spell',
|
| 180 |
+
r'\barea\b': 'area',
|
| 181 |
+
r'(\d+)st-level': r'\1st-level', # Fix level formatting
|
| 182 |
+
r'(\d+)nd-level': r'\1nd-level',
|
| 183 |
+
r'(\d+)rd-level': r'\1rd-level',
|
| 184 |
+
r'(\d+)th-level': r'\1th-level',
|
| 185 |
+
}
|
| 186 |
+
|
| 187 |
+
for pattern, replacement in ocr_fixes.items():
|
| 188 |
+
text = re.sub(pattern, replacement, text, flags=re.IGNORECASE)
|
| 189 |
+
|
| 190 |
+
# Fix missing spaces after periods
|
| 191 |
+
text = re.sub(r'\.([A-Z])', r'. \1', text)
|
| 192 |
+
|
| 193 |
+
# Fix excessive whitespace
|
| 194 |
+
text = re.sub(r'\s+', ' ', text)
|
| 195 |
+
|
| 196 |
+
# Fix line breaks in middle of words (common OCR issue)
|
| 197 |
+
text = re.sub(r'(\w)-\s+(\w)', r'\1\2', text)
|
| 198 |
+
|
| 199 |
+
return text.strip()
|
| 200 |
+
|
| 201 |
+
def _split_spell_blocks(self, text: str) -> List[str]:
|
| 202 |
+
"""
|
| 203 |
+
Split text into individual spell blocks.
|
| 204 |
+
|
| 205 |
+
Spells typically start with NAME in caps/title case followed by level/school.
|
| 206 |
+
"""
|
| 207 |
+
# Pattern: SPELL NAME\nLevel + school
|
| 208 |
+
pattern = r'([A-Z][A-Z\s\'\-]+)\n([A-Za-z]+(?:\s+\d+[a-z]{2}-level)?\s+[a-z]+)'
|
| 209 |
+
|
| 210 |
+
matches = list(re.finditer(pattern, text))
|
| 211 |
+
blocks = []
|
| 212 |
+
|
| 213 |
+
for i, match in enumerate(matches):
|
| 214 |
+
start_pos = match.start()
|
| 215 |
+
end_pos = matches[i + 1].start() if i + 1 < len(matches) else len(text)
|
| 216 |
+
blocks.append(text[start_pos:end_pos].strip())
|
| 217 |
+
|
| 218 |
+
return blocks
|
| 219 |
+
|
| 220 |
+
def _parse_spell_block(self, block: str) -> Optional[SpellData]:
|
| 221 |
+
"""Parse a single spell block into SpellData."""
|
| 222 |
+
try:
|
| 223 |
+
lines = [l.strip() for l in block.split('\n') if l.strip()]
|
| 224 |
+
if len(lines) < 3:
|
| 225 |
+
return None
|
| 226 |
+
|
| 227 |
+
# First line is spell name
|
| 228 |
+
name = lines[0].strip()
|
| 229 |
+
|
| 230 |
+
# Second line is level and school
|
| 231 |
+
level_school = lines[1]
|
| 232 |
+
level, school = self._parse_level_school(level_school)
|
| 233 |
+
|
| 234 |
+
# Parse remaining lines for spell details
|
| 235 |
+
casting_time = ""
|
| 236 |
+
range_str = ""
|
| 237 |
+
components = ""
|
| 238 |
+
duration = ""
|
| 239 |
+
description_lines = []
|
| 240 |
+
higher_levels = None
|
| 241 |
+
|
| 242 |
+
in_description = False
|
| 243 |
+
|
| 244 |
+
for line in lines[2:]:
|
| 245 |
+
line_lower = line.lower()
|
| 246 |
+
|
| 247 |
+
if line_lower.startswith('casting time:'):
|
| 248 |
+
casting_time = line.split(':', 1)[1].strip()
|
| 249 |
+
elif line_lower.startswith('range:'):
|
| 250 |
+
range_str = line.split(':', 1)[1].strip()
|
| 251 |
+
elif line_lower.startswith('components:'):
|
| 252 |
+
components = line.split(':', 1)[1].strip()
|
| 253 |
+
elif line_lower.startswith('duration:'):
|
| 254 |
+
duration = line.split(':', 1)[1].strip()
|
| 255 |
+
in_description = True
|
| 256 |
+
elif 'at higher levels' in line_lower:
|
| 257 |
+
higher_levels = line
|
| 258 |
+
in_description = False
|
| 259 |
+
elif in_description:
|
| 260 |
+
description_lines.append(line)
|
| 261 |
+
|
| 262 |
+
description = ' '.join(description_lines).strip()
|
| 263 |
+
|
| 264 |
+
# Check for ritual and concentration
|
| 265 |
+
ritual = 'ritual' in block.lower()
|
| 266 |
+
concentration = 'concentration' in duration.lower()
|
| 267 |
+
|
| 268 |
+
return SpellData(
|
| 269 |
+
name=name,
|
| 270 |
+
level=level,
|
| 271 |
+
school=school,
|
| 272 |
+
casting_time=casting_time,
|
| 273 |
+
range=range_str,
|
| 274 |
+
components=components,
|
| 275 |
+
duration=duration,
|
| 276 |
+
description=description,
|
| 277 |
+
ritual=ritual,
|
| 278 |
+
concentration=concentration,
|
| 279 |
+
higher_levels=higher_levels
|
| 280 |
+
)
|
| 281 |
+
|
| 282 |
+
except Exception as e:
|
| 283 |
+
print(f"Warning: Failed to parse spell block: {e}")
|
| 284 |
+
return None
|
| 285 |
+
|
| 286 |
+
def _parse_level_school(self, text: str) -> tuple:
|
| 287 |
+
"""
|
| 288 |
+
Parse spell level and school from text.
|
| 289 |
+
|
| 290 |
+
Examples:
|
| 291 |
+
- "1st-level evocation"
|
| 292 |
+
- "Evocation cantrip"
|
| 293 |
+
- "3rd-level illusion"
|
| 294 |
+
"""
|
| 295 |
+
text = text.lower()
|
| 296 |
+
|
| 297 |
+
# Determine level
|
| 298 |
+
if 'cantrip' in text:
|
| 299 |
+
level = 0
|
| 300 |
+
else:
|
| 301 |
+
level_match = re.search(r'(\d+)(?:st|nd|rd|th)-level', text)
|
| 302 |
+
if level_match:
|
| 303 |
+
level = int(level_match.group(1))
|
| 304 |
+
else:
|
| 305 |
+
level = 0
|
| 306 |
+
|
| 307 |
+
# Determine school
|
| 308 |
+
schools = ['abjuration', 'conjuration', 'divination', 'enchantment',
|
| 309 |
+
'evocation', 'illusion', 'necromancy', 'transmutation']
|
| 310 |
+
|
| 311 |
+
school = 'unknown'
|
| 312 |
+
for s in schools:
|
| 313 |
+
if s in text:
|
| 314 |
+
school = s.capitalize()
|
| 315 |
+
break
|
| 316 |
+
|
| 317 |
+
return level, school
|
| 318 |
+
|
| 319 |
+
def _split_by_class(self, text: str) -> Dict[str, Dict[int, List[str]]]:
|
| 320 |
+
"""
|
| 321 |
+
Split all_spells.txt by class and level.
|
| 322 |
+
|
| 323 |
+
Returns:
|
| 324 |
+
Dict mapping class_name -> {level -> [spell_names]}
|
| 325 |
+
"""
|
| 326 |
+
class_sections = {}
|
| 327 |
+
current_class = None
|
| 328 |
+
current_level = None
|
| 329 |
+
|
| 330 |
+
lines = text.split('\n')
|
| 331 |
+
|
| 332 |
+
for line in lines:
|
| 333 |
+
line = line.strip()
|
| 334 |
+
if not line:
|
| 335 |
+
continue
|
| 336 |
+
|
| 337 |
+
# Check if this is a class header
|
| 338 |
+
if any(cls in line.upper() for cls in settings.DND_CLASSES):
|
| 339 |
+
# Extract class name
|
| 340 |
+
for cls in settings.DND_CLASSES:
|
| 341 |
+
if cls.upper() in line.upper():
|
| 342 |
+
current_class = cls
|
| 343 |
+
class_sections[current_class] = {}
|
| 344 |
+
break
|
| 345 |
+
|
| 346 |
+
# Check if this is a level header
|
| 347 |
+
elif 'level' in line.lower() or 'cantrip' in line.lower():
|
| 348 |
+
if current_class:
|
| 349 |
+
level_match = re.search(r'(\d+)(?:st|nd|rd|th)?\s+level', line, re.IGNORECASE)
|
| 350 |
+
if level_match:
|
| 351 |
+
current_level = int(level_match.group(1))
|
| 352 |
+
elif 'cantrip' in line.lower():
|
| 353 |
+
current_level = 0
|
| 354 |
+
|
| 355 |
+
if current_level is not None and current_level not in class_sections[current_class]:
|
| 356 |
+
class_sections[current_class][current_level] = []
|
| 357 |
+
|
| 358 |
+
# Otherwise, this should be spell names
|
| 359 |
+
elif current_class and current_level is not None:
|
| 360 |
+
# Split by commas and clean
|
| 361 |
+
spell_names = [s.strip() for s in line.split(',') if s.strip()]
|
| 362 |
+
class_sections[current_class][current_level].extend(spell_names)
|
| 363 |
+
|
| 364 |
+
return class_sections
|
| 365 |
+
|
| 366 |
+
def _spell_to_metadata(self, spell: SpellData) -> Dict[str, Any]:
|
| 367 |
+
"""Convert SpellData to metadata dictionary."""
|
| 368 |
+
return {
|
| 369 |
+
'name': spell.name,
|
| 370 |
+
'level': spell.level,
|
| 371 |
+
'school': spell.school,
|
| 372 |
+
'casting_time': spell.casting_time,
|
| 373 |
+
'range': spell.range,
|
| 374 |
+
'components': spell.components,
|
| 375 |
+
'duration': spell.duration,
|
| 376 |
+
'classes': spell.classes,
|
| 377 |
+
'ritual': spell.ritual,
|
| 378 |
+
'concentration': spell.concentration,
|
| 379 |
+
'has_higher_levels': spell.higher_levels is not None
|
| 380 |
+
}
|
| 381 |
+
|
| 382 |
+
|
| 383 |
+
class SpellChunker(BaseChunker):
|
| 384 |
+
"""
|
| 385 |
+
Creates optimized chunks for spell RAG retrieval.
|
| 386 |
+
|
| 387 |
+
Creates multiple chunk types:
|
| 388 |
+
- full_spell: Complete spell with all details
|
| 389 |
+
- quick_reference: Concise mechanical summary
|
| 390 |
+
- by_class: Class-specific reference
|
| 391 |
+
- by_level: Level-specific reference
|
| 392 |
+
"""
|
| 393 |
+
|
| 394 |
+
def create_chunks(self, parsed_content: ParsedContent) -> List[Chunk]:
|
| 395 |
+
"""Create spell chunks from parsed content."""
|
| 396 |
+
chunks = []
|
| 397 |
+
metadata = parsed_content.metadata
|
| 398 |
+
|
| 399 |
+
# 1. Full spell chunk
|
| 400 |
+
full_chunk = self._create_full_spell_chunk(parsed_content)
|
| 401 |
+
if full_chunk:
|
| 402 |
+
chunks.append(full_chunk)
|
| 403 |
+
|
| 404 |
+
# 2. Quick reference chunk
|
| 405 |
+
quick_ref_chunk = self._create_quick_reference_chunk(parsed_content)
|
| 406 |
+
if quick_ref_chunk:
|
| 407 |
+
chunks.append(quick_ref_chunk)
|
| 408 |
+
|
| 409 |
+
# 3. Class-specific chunks (one per class)
|
| 410 |
+
if metadata.get('classes'):
|
| 411 |
+
for class_name in metadata['classes']:
|
| 412 |
+
class_chunk = self._create_class_chunk(parsed_content, class_name)
|
| 413 |
+
if class_chunk:
|
| 414 |
+
chunks.append(class_chunk)
|
| 415 |
+
|
| 416 |
+
return chunks
|
| 417 |
+
|
| 418 |
+
def _create_full_spell_chunk(self, parsed_content: ParsedContent) -> Chunk:
|
| 419 |
+
"""Create full spell description chunk."""
|
| 420 |
+
meta = parsed_content.metadata
|
| 421 |
+
|
| 422 |
+
content_parts = [
|
| 423 |
+
f"**{meta['name']}**",
|
| 424 |
+
f"Level {meta['level']} {meta['school']}",
|
| 425 |
+
f"**Casting Time:** {meta['casting_time']}",
|
| 426 |
+
f"**Range:** {meta['range']}",
|
| 427 |
+
f"**Components:** {meta['components']}",
|
| 428 |
+
f"**Duration:** {meta['duration']}",
|
| 429 |
+
"",
|
| 430 |
+
parsed_content.raw_text
|
| 431 |
+
]
|
| 432 |
+
|
| 433 |
+
if meta.get('classes'):
|
| 434 |
+
content_parts.insert(2, f"**Classes:** {', '.join(meta['classes'])}")
|
| 435 |
+
|
| 436 |
+
content = "\n".join(content_parts)
|
| 437 |
+
|
| 438 |
+
tags = {
|
| 439 |
+
'spell',
|
| 440 |
+
'full_description',
|
| 441 |
+
f"level_{meta['level']}",
|
| 442 |
+
f"school_{meta['school'].lower()}"
|
| 443 |
+
}
|
| 444 |
+
|
| 445 |
+
if meta.get('ritual'):
|
| 446 |
+
tags.add('ritual')
|
| 447 |
+
if meta.get('concentration'):
|
| 448 |
+
tags.add('concentration')
|
| 449 |
+
|
| 450 |
+
return Chunk(
|
| 451 |
+
content=content,
|
| 452 |
+
chunk_type='full_spell',
|
| 453 |
+
metadata=meta.copy(),
|
| 454 |
+
tags=tags
|
| 455 |
+
)
|
| 456 |
+
|
| 457 |
+
def _create_quick_reference_chunk(self, parsed_content: ParsedContent) -> Chunk:
|
| 458 |
+
"""Create quick reference chunk with just mechanics."""
|
| 459 |
+
meta = parsed_content.metadata
|
| 460 |
+
|
| 461 |
+
content = f"**{meta['name']}** - Level {meta['level']} {meta['school']}\n"
|
| 462 |
+
content += f"Cast: {meta['casting_time']} | Range: {meta['range']} | "
|
| 463 |
+
content += f"Components: {meta['components']} | Duration: {meta['duration']}\n"
|
| 464 |
+
|
| 465 |
+
# Add first sentence of description
|
| 466 |
+
first_sentence = parsed_content.raw_text.split('.')[0] + '.'
|
| 467 |
+
content += f"\n{first_sentence}"
|
| 468 |
+
|
| 469 |
+
return Chunk(
|
| 470 |
+
content=content,
|
| 471 |
+
chunk_type='quick_reference',
|
| 472 |
+
metadata=meta.copy(),
|
| 473 |
+
tags={'spell', 'quick_ref', f"level_{meta['level']}"}
|
| 474 |
+
)
|
| 475 |
+
|
| 476 |
+
def _create_class_chunk(self, parsed_content: ParsedContent, class_name: str) -> Chunk:
|
| 477 |
+
"""Create class-specific spell chunk."""
|
| 478 |
+
meta = parsed_content.metadata.copy()
|
| 479 |
+
meta['for_class'] = class_name
|
| 480 |
+
|
| 481 |
+
content = f"**{class_name} Spell: {meta['name']}** (Level {meta['level']})\n"
|
| 482 |
+
content += f"{meta['school']} | {meta['casting_time']} | {meta['range']}\n\n"
|
| 483 |
+
content += parsed_content.raw_text[:300] + "..." # Truncate for class-specific
|
| 484 |
+
|
| 485 |
+
return Chunk(
|
| 486 |
+
content=content,
|
| 487 |
+
chunk_type='by_class',
|
| 488 |
+
metadata=meta,
|
| 489 |
+
tags={'spell', 'class_specific', f"class_{class_name.lower()}", f"level_{meta['level']}"}
|
| 490 |
+
)
|
|
File without changes
|
|
@@ -0,0 +1,423 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
D&D RAG System Initialization Script
|
| 4 |
+
|
| 5 |
+
Loads all D&D content into ChromaDB using existing notebook parsing code.
|
| 6 |
+
This is a pragmatic wrapper that uses proven parsing logic.
|
| 7 |
+
|
| 8 |
+
Usage:
|
| 9 |
+
python initialize_rag.py [--clear] [--only spells,monsters,classes,races]
|
| 10 |
+
|
| 11 |
+
Examples:
|
| 12 |
+
python initialize_rag.py # Load all content
|
| 13 |
+
python initialize_rag.py --clear # Clear and reload all
|
| 14 |
+
python initialize_rag.py --only spells # Load only spells
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
import argparse
|
| 18 |
+
import sys
|
| 19 |
+
from pathlib import Path
|
| 20 |
+
from typing import List, Dict, Any
|
| 21 |
+
import re
|
| 22 |
+
|
| 23 |
+
# Add project to path
|
| 24 |
+
project_root = Path(__file__).parent
|
| 25 |
+
sys.path.insert(0, str(project_root))
|
| 26 |
+
|
| 27 |
+
# Import our core infrastructure
|
| 28 |
+
from dnd_rag_system.core.chroma_manager import ChromaDBManager
|
| 29 |
+
from dnd_rag_system.core.base_chunker import Chunk
|
| 30 |
+
from dnd_rag_system.config import settings
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
# =============================================================================
|
| 34 |
+
# SPELL LOADER (adapted from rag_spells2.ipynb)
|
| 35 |
+
# =============================================================================
|
| 36 |
+
|
| 37 |
+
def load_spells(db_manager: ChromaDBManager, clear: bool = False):
|
| 38 |
+
"""Load spells from spells.txt and all_spells.txt into ChromaDB."""
|
| 39 |
+
|
| 40 |
+
print("\n" + "="*70)
|
| 41 |
+
print("๐ LOADING SPELLS")
|
| 42 |
+
print("="*70)
|
| 43 |
+
|
| 44 |
+
if clear:
|
| 45 |
+
db_manager.clear_collection(settings.COLLECTION_NAMES['spells'])
|
| 46 |
+
|
| 47 |
+
# Read spells.txt
|
| 48 |
+
print(f"๐ Reading {settings.SPELLS_TXT}")
|
| 49 |
+
with open(settings.SPELLS_TXT, 'r', encoding='utf-8') as f:
|
| 50 |
+
spells_content = f.read()
|
| 51 |
+
|
| 52 |
+
# Simple spell parsing (adapted from your notebook)
|
| 53 |
+
spell_blocks = _split_spell_blocks(spells_content)
|
| 54 |
+
print(f"โ Found {len(spell_blocks)} spell blocks")
|
| 55 |
+
|
| 56 |
+
# Create chunks
|
| 57 |
+
chunks = []
|
| 58 |
+
for i, block in enumerate(spell_blocks):
|
| 59 |
+
try:
|
| 60 |
+
spell_chunk = _parse_spell_to_chunk(block)
|
| 61 |
+
if spell_chunk:
|
| 62 |
+
chunks.append(spell_chunk)
|
| 63 |
+
|
| 64 |
+
if (i + 1) % 50 == 0:
|
| 65 |
+
print(f" Processed {i + 1}/{len(spell_blocks)} spells...")
|
| 66 |
+
except Exception as e:
|
| 67 |
+
print(f" Warning: Failed to parse spell {i+1}: {e}")
|
| 68 |
+
continue
|
| 69 |
+
|
| 70 |
+
print(f"โ Created {len(chunks)} spell chunks")
|
| 71 |
+
|
| 72 |
+
# Add to ChromaDB
|
| 73 |
+
if chunks:
|
| 74 |
+
db_manager.add_chunks(settings.COLLECTION_NAMES['spells'], chunks)
|
| 75 |
+
print(f"โ
Loaded {len(chunks)} spells into ChromaDB")
|
| 76 |
+
|
| 77 |
+
return len(chunks)
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def _split_spell_blocks(content: str) -> List[str]:
|
| 81 |
+
"""Split spell text into individual spell blocks."""
|
| 82 |
+
# Pattern: UPPERCASE SPELL NAME followed by spell details
|
| 83 |
+
spell_pattern = r'\n(?=[A-Z][A-Z\s\']{2,}\s*\n)'
|
| 84 |
+
blocks = re.split(spell_pattern, content)
|
| 85 |
+
|
| 86 |
+
# Filter valid blocks (must contain "level" or "cantrip")
|
| 87 |
+
valid_blocks = []
|
| 88 |
+
for block in blocks:
|
| 89 |
+
block = block.strip()
|
| 90 |
+
if len(block) > 100 and ('level' in block.lower() or 'cantrip' in block.lower()):
|
| 91 |
+
valid_blocks.append(block)
|
| 92 |
+
|
| 93 |
+
return valid_blocks
|
| 94 |
+
|
| 95 |
+
|
| 96 |
+
def _parse_spell_to_chunk(block: str) -> Chunk:
|
| 97 |
+
"""Parse a spell block into a Chunk object."""
|
| 98 |
+
lines = [l.strip() for l in block.split('\n') if l.strip()]
|
| 99 |
+
|
| 100 |
+
if len(lines) < 3:
|
| 101 |
+
return None
|
| 102 |
+
|
| 103 |
+
# Extract spell name (first line, uppercase)
|
| 104 |
+
name = lines[0].strip()
|
| 105 |
+
|
| 106 |
+
# Extract level and school (second line)
|
| 107 |
+
level_school_line = lines[1].lower()
|
| 108 |
+
level = 0
|
| 109 |
+
if 'cantrip' in level_school_line:
|
| 110 |
+
level = 0
|
| 111 |
+
else:
|
| 112 |
+
level_match = re.search(r'(\d+)(?:st|nd|rd|th)', level_school_line)
|
| 113 |
+
if level_match:
|
| 114 |
+
level = int(level_match.group(1))
|
| 115 |
+
|
| 116 |
+
# Determine school
|
| 117 |
+
schools = ['abjuration', 'conjuration', 'divination', 'enchantment',
|
| 118 |
+
'evocation', 'illusion', 'necromancy', 'transmutation']
|
| 119 |
+
school = 'unknown'
|
| 120 |
+
for s in schools:
|
| 121 |
+
if s in level_school_line:
|
| 122 |
+
school = s.capitalize()
|
| 123 |
+
break
|
| 124 |
+
|
| 125 |
+
# Rest is the description
|
| 126 |
+
description = '\n'.join(lines[2:])
|
| 127 |
+
|
| 128 |
+
# Create full spell text
|
| 129 |
+
content = f"**{name}**\n"
|
| 130 |
+
content += f"Level {level} {school}\n\n"
|
| 131 |
+
content += description
|
| 132 |
+
|
| 133 |
+
metadata = {
|
| 134 |
+
'name': name,
|
| 135 |
+
'level': level,
|
| 136 |
+
'school': school,
|
| 137 |
+
'content_type': 'spell'
|
| 138 |
+
}
|
| 139 |
+
|
| 140 |
+
tags = {
|
| 141 |
+
'spell',
|
| 142 |
+
f'level_{level}',
|
| 143 |
+
f'school_{school.lower()}'
|
| 144 |
+
}
|
| 145 |
+
|
| 146 |
+
return Chunk(
|
| 147 |
+
content=content,
|
| 148 |
+
chunk_type='full_spell',
|
| 149 |
+
metadata=metadata,
|
| 150 |
+
tags=tags
|
| 151 |
+
)
|
| 152 |
+
|
| 153 |
+
|
| 154 |
+
# =============================================================================
|
| 155 |
+
# MONSTER LOADER (adapted from monster_to_rag.ipynb)
|
| 156 |
+
# =============================================================================
|
| 157 |
+
|
| 158 |
+
def load_monsters(db_manager: ChromaDBManager, clear: bool = False):
|
| 159 |
+
"""Load monsters from extracted_monsters.txt into ChromaDB."""
|
| 160 |
+
|
| 161 |
+
print("\n" + "="*70)
|
| 162 |
+
print("๐น LOADING MONSTERS")
|
| 163 |
+
print("="*70)
|
| 164 |
+
|
| 165 |
+
if clear:
|
| 166 |
+
db_manager.clear_collection(settings.COLLECTION_NAMES['monsters'])
|
| 167 |
+
|
| 168 |
+
# Read extracted monsters
|
| 169 |
+
print(f"๐ Reading {settings.EXTRACTED_MONSTERS_TXT}")
|
| 170 |
+
|
| 171 |
+
if not settings.EXTRACTED_MONSTERS_TXT.exists():
|
| 172 |
+
print("โ ๏ธ Monster file not found, skipping")
|
| 173 |
+
return 0
|
| 174 |
+
|
| 175 |
+
with open(settings.EXTRACTED_MONSTERS_TXT, 'r', encoding='utf-8') as f:
|
| 176 |
+
monsters_content = f.read()
|
| 177 |
+
|
| 178 |
+
# Simple monster parsing
|
| 179 |
+
monster_blocks = _split_monster_blocks(monsters_content)
|
| 180 |
+
print(f"โ Found {len(monster_blocks)} monster blocks")
|
| 181 |
+
|
| 182 |
+
# Create chunks
|
| 183 |
+
chunks = []
|
| 184 |
+
for i, block in enumerate(monster_blocks):
|
| 185 |
+
try:
|
| 186 |
+
monster_chunk = _parse_monster_to_chunk(block)
|
| 187 |
+
if monster_chunk:
|
| 188 |
+
chunks.append(monster_chunk)
|
| 189 |
+
|
| 190 |
+
if (i + 1) % 50 == 0:
|
| 191 |
+
print(f" Processed {i + 1}/{len(monster_blocks)} monsters...")
|
| 192 |
+
except Exception as e:
|
| 193 |
+
print(f" Warning: Failed to parse monster {i+1}: {e}")
|
| 194 |
+
continue
|
| 195 |
+
|
| 196 |
+
print(f"โ Created {len(chunks)} monster chunks")
|
| 197 |
+
|
| 198 |
+
# Add to ChromaDB
|
| 199 |
+
if chunks:
|
| 200 |
+
db_manager.add_chunks(settings.COLLECTION_NAMES['monsters'], chunks)
|
| 201 |
+
print(f"โ
Loaded {len(chunks)} monsters into ChromaDB")
|
| 202 |
+
|
| 203 |
+
return len(chunks)
|
| 204 |
+
|
| 205 |
+
|
| 206 |
+
def _split_monster_blocks(content: str) -> List[str]:
|
| 207 |
+
"""Split monster text into individual blocks."""
|
| 208 |
+
# Pattern: MONSTER NAME (often all caps or title case)
|
| 209 |
+
blocks = content.split('\n\n')
|
| 210 |
+
valid_blocks = [b.strip() for b in blocks if len(b.strip()) > 200]
|
| 211 |
+
return valid_blocks
|
| 212 |
+
|
| 213 |
+
|
| 214 |
+
def _parse_monster_to_chunk(block: str) -> Chunk:
|
| 215 |
+
"""Parse a monster block into a Chunk object."""
|
| 216 |
+
lines = [l.strip() for l in block.split('\n') if l.strip()]
|
| 217 |
+
|
| 218 |
+
if not lines:
|
| 219 |
+
return None
|
| 220 |
+
|
| 221 |
+
# Extract name (usually first line)
|
| 222 |
+
name = lines[0].strip()
|
| 223 |
+
|
| 224 |
+
# Full content
|
| 225 |
+
content = block
|
| 226 |
+
|
| 227 |
+
# Try to extract CR
|
| 228 |
+
cr = "Unknown"
|
| 229 |
+
cr_match = re.search(r'Challenge(?:\s+Rating)?[:\s]+([^\s\(]+)', block, re.IGNORECASE)
|
| 230 |
+
if cr_match:
|
| 231 |
+
cr = cr_match.group(1).strip()
|
| 232 |
+
|
| 233 |
+
metadata = {
|
| 234 |
+
'name': name,
|
| 235 |
+
'challenge_rating': cr,
|
| 236 |
+
'content_type': 'monster'
|
| 237 |
+
}
|
| 238 |
+
|
| 239 |
+
tags = {'monster', f'cr_{cr.replace("/", "_")}'}
|
| 240 |
+
|
| 241 |
+
return Chunk(
|
| 242 |
+
content=content,
|
| 243 |
+
chunk_type='monster_stats',
|
| 244 |
+
metadata=metadata,
|
| 245 |
+
tags=tags
|
| 246 |
+
)
|
| 247 |
+
|
| 248 |
+
|
| 249 |
+
# =============================================================================
|
| 250 |
+
# CLASS LOADER (adapted from classes_to_rag.ipynb)
|
| 251 |
+
# =============================================================================
|
| 252 |
+
|
| 253 |
+
def load_classes(db_manager: ChromaDBManager, clear: bool = False):
|
| 254 |
+
"""Load classes from extracted_classes.txt into ChromaDB."""
|
| 255 |
+
|
| 256 |
+
print("\n" + "="*70)
|
| 257 |
+
print("โ๏ธ LOADING CLASSES")
|
| 258 |
+
print("="*70)
|
| 259 |
+
|
| 260 |
+
if clear:
|
| 261 |
+
db_manager.clear_collection(settings.COLLECTION_NAMES['classes'])
|
| 262 |
+
|
| 263 |
+
# Read extracted classes
|
| 264 |
+
print(f"๐ Reading {settings.EXTRACTED_CLASSES_TXT}")
|
| 265 |
+
|
| 266 |
+
if not settings.EXTRACTED_CLASSES_TXT.exists():
|
| 267 |
+
print("โ ๏ธ Classes file not found, skipping")
|
| 268 |
+
return 0
|
| 269 |
+
|
| 270 |
+
with open(settings.EXTRACTED_CLASSES_TXT, 'r', encoding='utf-8') as f:
|
| 271 |
+
classes_content = f.read()
|
| 272 |
+
|
| 273 |
+
# Simple class parsing - split by known class names
|
| 274 |
+
class_blocks = _split_class_blocks(classes_content)
|
| 275 |
+
print(f"โ Found {len(class_blocks)} class blocks")
|
| 276 |
+
|
| 277 |
+
# Create chunks
|
| 278 |
+
chunks = []
|
| 279 |
+
for class_name, content in class_blocks.items():
|
| 280 |
+
try:
|
| 281 |
+
class_chunk = _parse_class_to_chunk(class_name, content)
|
| 282 |
+
if class_chunk:
|
| 283 |
+
chunks.append(class_chunk)
|
| 284 |
+
except Exception as e:
|
| 285 |
+
print(f" Warning: Failed to parse class {class_name}: {e}")
|
| 286 |
+
continue
|
| 287 |
+
|
| 288 |
+
print(f"โ Created {len(chunks)} class chunks")
|
| 289 |
+
|
| 290 |
+
# Add to ChromaDB
|
| 291 |
+
if chunks:
|
| 292 |
+
db_manager.add_chunks(settings.COLLECTION_NAMES['classes'], chunks)
|
| 293 |
+
print(f"โ
Loaded {len(chunks)} classes into ChromaDB")
|
| 294 |
+
|
| 295 |
+
return len(chunks)
|
| 296 |
+
|
| 297 |
+
|
| 298 |
+
def _split_class_blocks(content: str) -> Dict[str, str]:
|
| 299 |
+
"""Split content by class names."""
|
| 300 |
+
class_blocks = {}
|
| 301 |
+
|
| 302 |
+
for i, class_name in enumerate(settings.DND_CLASSES):
|
| 303 |
+
# Find this class
|
| 304 |
+
pattern = rf'\b{class_name.upper()}\b'
|
| 305 |
+
matches = list(re.finditer(pattern, content, re.IGNORECASE))
|
| 306 |
+
|
| 307 |
+
if matches:
|
| 308 |
+
start = matches[0].start()
|
| 309 |
+
# Find end (next class or end of text)
|
| 310 |
+
end = len(content)
|
| 311 |
+
for next_class in settings.DND_CLASSES[i+1:]:
|
| 312 |
+
next_pattern = rf'\b{next_class.upper()}\b'
|
| 313 |
+
next_matches = re.search(next_pattern, content[start+100:], re.IGNORECASE)
|
| 314 |
+
if next_matches:
|
| 315 |
+
end = start + 100 + next_matches.start()
|
| 316 |
+
break
|
| 317 |
+
|
| 318 |
+
class_content = content[start:end].strip()
|
| 319 |
+
if len(class_content) > 500: # Substantial content
|
| 320 |
+
class_blocks[class_name] = class_content
|
| 321 |
+
|
| 322 |
+
return class_blocks
|
| 323 |
+
|
| 324 |
+
|
| 325 |
+
def _parse_class_to_chunk(class_name: str, content: str) -> Chunk:
|
| 326 |
+
"""Parse a class block into a Chunk object."""
|
| 327 |
+
metadata = {
|
| 328 |
+
'name': class_name,
|
| 329 |
+
'content_type': 'class'
|
| 330 |
+
}
|
| 331 |
+
|
| 332 |
+
tags = {'class', f'class_{class_name.lower()}'}
|
| 333 |
+
|
| 334 |
+
# Format content
|
| 335 |
+
formatted_content = f"**{class_name}**\n\n{content[:2000]}" # Limit size
|
| 336 |
+
|
| 337 |
+
return Chunk(
|
| 338 |
+
content=formatted_content,
|
| 339 |
+
chunk_type='class_features',
|
| 340 |
+
metadata=metadata,
|
| 341 |
+
tags=tags
|
| 342 |
+
)
|
| 343 |
+
|
| 344 |
+
|
| 345 |
+
# =============================================================================
|
| 346 |
+
# RACE LOADER (adapted from races_to_rag.ipynb)
|
| 347 |
+
# =============================================================================
|
| 348 |
+
|
| 349 |
+
def load_races(db_manager: ChromaDBManager, clear: bool = False):
|
| 350 |
+
"""Load races - placeholder for now."""
|
| 351 |
+
|
| 352 |
+
print("\n" + "="*70)
|
| 353 |
+
print("๐ง LOADING RACES")
|
| 354 |
+
print("="*70)
|
| 355 |
+
print("โ ๏ธ Race loader not yet implemented (can add later)")
|
| 356 |
+
|
| 357 |
+
return 0
|
| 358 |
+
|
| 359 |
+
|
| 360 |
+
# =============================================================================
|
| 361 |
+
# MAIN INITIALIZATION
|
| 362 |
+
# =============================================================================
|
| 363 |
+
|
| 364 |
+
def main():
|
| 365 |
+
"""Main initialization function."""
|
| 366 |
+
parser = argparse.ArgumentParser(description='Initialize D&D RAG System')
|
| 367 |
+
parser.add_argument('--clear', action='store_true', help='Clear existing data')
|
| 368 |
+
parser.add_argument('--only', type=str, help='Load only specific collections (comma-separated)')
|
| 369 |
+
args = parser.parse_args()
|
| 370 |
+
|
| 371 |
+
print("\n" + "="*70)
|
| 372 |
+
print("๐ฒ D&D RAG SYSTEM INITIALIZATION")
|
| 373 |
+
print("="*70)
|
| 374 |
+
|
| 375 |
+
# Initialize ChromaDB
|
| 376 |
+
print("\n๐ง Initializing ChromaDB...")
|
| 377 |
+
db_manager = ChromaDBManager()
|
| 378 |
+
|
| 379 |
+
# Determine what to load
|
| 380 |
+
load_all = args.only is None
|
| 381 |
+
to_load = args.only.split(',') if args.only else ['spells', 'monsters', 'classes', 'races']
|
| 382 |
+
|
| 383 |
+
# Load each collection
|
| 384 |
+
results = {}
|
| 385 |
+
|
| 386 |
+
if load_all or 'spells' in to_load:
|
| 387 |
+
results['spells'] = load_spells(db_manager, args.clear)
|
| 388 |
+
|
| 389 |
+
if load_all or 'monsters' in to_load:
|
| 390 |
+
results['monsters'] = load_monsters(db_manager, args.clear)
|
| 391 |
+
|
| 392 |
+
if load_all or 'classes' in to_load:
|
| 393 |
+
results['classes'] = load_classes(db_manager, args.clear)
|
| 394 |
+
|
| 395 |
+
if load_all or 'races' in to_load:
|
| 396 |
+
results['races'] = load_races(db_manager, args.clear)
|
| 397 |
+
|
| 398 |
+
# Summary
|
| 399 |
+
print("\n" + "="*70)
|
| 400 |
+
print("๐ INITIALIZATION SUMMARY")
|
| 401 |
+
print("="*70)
|
| 402 |
+
|
| 403 |
+
total_chunks = sum(results.values())
|
| 404 |
+
for content_type, count in results.items():
|
| 405 |
+
print(f" {content_type.capitalize()}: {count} chunks")
|
| 406 |
+
|
| 407 |
+
print(f"\nโ
Total: {total_chunks} chunks loaded into ChromaDB")
|
| 408 |
+
|
| 409 |
+
# Show collection stats
|
| 410 |
+
print("\n๐ Collection Statistics:")
|
| 411 |
+
stats = db_manager.get_all_stats()
|
| 412 |
+
for collection_name, col_stats in stats['collections'].items():
|
| 413 |
+
print(f" {collection_name}: {col_stats.get('total_documents', 0)} documents")
|
| 414 |
+
|
| 415 |
+
print("\n๐ Initialization complete!")
|
| 416 |
+
print(f" Database: {db_manager.persist_dir}")
|
| 417 |
+
print("\n๐ก Next steps:")
|
| 418 |
+
print(" - Test searches: python test_rag_search.py")
|
| 419 |
+
print(" - Run GM dialogue: python run_gm_dialogue.py")
|
| 420 |
+
|
| 421 |
+
|
| 422 |
+
if __name__ == '__main__':
|
| 423 |
+
main()
|
|
@@ -0,0 +1,290 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# D&D RAG System - Implementation Progress
|
| 2 |
+
|
| 3 |
+
**Project Start Date**: November 6, 2024
|
| 4 |
+
**Status**: ๐ง In Progress
|
| 5 |
+
|
| 6 |
+
---
|
| 7 |
+
|
| 8 |
+
## ๐ Overall Progress
|
| 9 |
+
|
| 10 |
+
| Phase | Status | Progress | Notes |
|
| 11 |
+
|-------|--------|----------|-------|
|
| 12 |
+
| **Phase 1: Core Infrastructure** | ๐ง In Progress | 1/4 | Directory structure created |
|
| 13 |
+
| **Phase 2: Data Processors** | โณ Pending | 0/4 | Waiting for Phase 1 |
|
| 14 |
+
| **Phase 3: Initialization** | โณ Pending | 0/2 | Waiting for Phase 2 |
|
| 15 |
+
| **Phase 4: Query Interface** | โณ Pending | 0/1 | Waiting for Phase 3 |
|
| 16 |
+
| **Phase 5: GM Dialogue** | โณ Pending | 0/2 | Waiting for Phase 4 |
|
| 17 |
+
| **Phase 6: Character Creation** | โณ Pending | 0/2 | Waiting for Phase 4 |
|
| 18 |
+
|
| 19 |
+
**Legend**: โ
Complete | ๐ง In Progress | โณ Pending | โ Blocked
|
| 20 |
+
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## ๐ Phase 1: Core Infrastructure
|
| 24 |
+
|
| 25 |
+
### โ
1.1 Project Structure
|
| 26 |
+
- [x] Created `dnd_rag_system/` directory
|
| 27 |
+
- [x] Created `config/` subdirectory
|
| 28 |
+
- [x] Created `core/` subdirectory
|
| 29 |
+
- [x] Created `parsers/` subdirectory
|
| 30 |
+
- [x] Created `systems/` subdirectory
|
| 31 |
+
- [x] Created `data/` subdirectory
|
| 32 |
+
- [x] Created `__init__.py` files for all packages
|
| 33 |
+
|
| 34 |
+
### โณ 1.2 Configuration System
|
| 35 |
+
**File**: `config/settings.py`
|
| 36 |
+
- [ ] ChromaDB configuration
|
| 37 |
+
- [ ] Ollama model settings
|
| 38 |
+
- [ ] Embedding model settings
|
| 39 |
+
- [ ] Collection naming conventions
|
| 40 |
+
- [ ] Data source paths
|
| 41 |
+
- [ ] Chunk size parameters
|
| 42 |
+
|
| 43 |
+
### โณ 1.3 Base Parser
|
| 44 |
+
**File**: `core/base_parser.py`
|
| 45 |
+
- [ ] `BaseParser` abstract class
|
| 46 |
+
- [ ] PDF extraction utilities (pdfplumber)
|
| 47 |
+
- [ ] Text extraction utilities
|
| 48 |
+
- [ ] Common validation methods
|
| 49 |
+
- [ ] Error handling framework
|
| 50 |
+
|
| 51 |
+
### โณ 1.4 Base Chunker
|
| 52 |
+
**File**: `core/base_chunker.py`
|
| 53 |
+
- [ ] `BaseChunker` abstract class
|
| 54 |
+
- [ ] Token estimation function
|
| 55 |
+
- [ ] Chunk splitting with overlap
|
| 56 |
+
- [ ] Metadata generation helpers
|
| 57 |
+
- [ ] Chunk validation
|
| 58 |
+
|
| 59 |
+
### โณ 1.5 ChromaDB Manager
|
| 60 |
+
**File**: `core/chroma_manager.py`
|
| 61 |
+
- [ ] `ChromaDBManager` class
|
| 62 |
+
- [ ] Collection management (create, get, delete)
|
| 63 |
+
- [ ] Batch add operations
|
| 64 |
+
- [ ] Single/multi-collection search
|
| 65 |
+
- [ ] Statistics and reporting
|
| 66 |
+
- [ ] Connection pooling
|
| 67 |
+
|
| 68 |
+
---
|
| 69 |
+
|
| 70 |
+
## ๐ Phase 2: Data Processors
|
| 71 |
+
|
| 72 |
+
### โณ 2.1 Spell Parser
|
| 73 |
+
**File**: `parsers/spell_parser.py`
|
| 74 |
+
- [ ] Parse `spells.txt` (descriptions)
|
| 75 |
+
- [ ] Parse `all_spells.txt` (class/level info)
|
| 76 |
+
- [ ] Merge spell data
|
| 77 |
+
- [ ] Create spell chunks (full, quick_ref, by_class, by_level)
|
| 78 |
+
- [ ] Generate spell metadata
|
| 79 |
+
- [ ] Unit tests
|
| 80 |
+
|
| 81 |
+
### โณ 2.2 Monster Parser
|
| 82 |
+
**File**: `parsers/monster_parser.py`
|
| 83 |
+
- [ ] PDF extraction from Monster Manual
|
| 84 |
+
- [ ] Monster stat block parsing
|
| 85 |
+
- [ ] Combat stats extraction
|
| 86 |
+
- [ ] Special abilities parsing
|
| 87 |
+
- [ ] Create monster chunks (stats, combat, abilities, lore)
|
| 88 |
+
- [ ] Generate monster metadata
|
| 89 |
+
- [ ] Unit tests
|
| 90 |
+
|
| 91 |
+
### โณ 2.3 Class Parser
|
| 92 |
+
**File**: `parsers/class_parser.py`
|
| 93 |
+
- [ ] PDF extraction from Player's Handbook (pages 46-121)
|
| 94 |
+
- [ ] Class feature extraction by level
|
| 95 |
+
- [ ] Subclass parsing
|
| 96 |
+
- [ ] Proficiencies and equipment
|
| 97 |
+
- [ ] Create class chunks (overview, features, subclass)
|
| 98 |
+
- [ ] Generate class metadata
|
| 99 |
+
- [ ] Unit tests
|
| 100 |
+
|
| 101 |
+
### โณ 2.4 Race Parser
|
| 102 |
+
**File**: `parsers/race_parser.py`
|
| 103 |
+
- [ ] PDF extraction from Player's Handbook (pages 18-46)
|
| 104 |
+
- [ ] Race traits extraction
|
| 105 |
+
- [ ] Ability score bonuses
|
| 106 |
+
- [ ] Subrace parsing
|
| 107 |
+
- [ ] Create race chunks (traits, lore, subrace, quick_ref)
|
| 108 |
+
- [ ] Generate race metadata
|
| 109 |
+
- [ ] Unit tests
|
| 110 |
+
|
| 111 |
+
---
|
| 112 |
+
|
| 113 |
+
## ๐ Phase 3: Initialization System
|
| 114 |
+
|
| 115 |
+
### โณ 3.1 Master Init Script
|
| 116 |
+
**File**: `initialize_rag.py`
|
| 117 |
+
- [ ] Command-line argument parsing
|
| 118 |
+
- [ ] ChromaDB initialization
|
| 119 |
+
- [ ] Collection creation/verification
|
| 120 |
+
- [ ] Selective data loading (--only flag)
|
| 121 |
+
- [ ] Clear existing data (--clear flag)
|
| 122 |
+
- [ ] Progress reporting with progress bars
|
| 123 |
+
- [ ] Error handling and recovery
|
| 124 |
+
- [ ] Validation checks after loading
|
| 125 |
+
- [ ] Summary statistics report
|
| 126 |
+
- [ ] Save metadata JSON
|
| 127 |
+
|
| 128 |
+
### โณ 3.2 Data Migration
|
| 129 |
+
- [ ] Move source files to `data/` directory
|
| 130 |
+
- [ ] Verify all source files present
|
| 131 |
+
- [ ] Create data manifest file
|
| 132 |
+
- [ ] Test full initialization
|
| 133 |
+
- [ ] Benchmark loading times
|
| 134 |
+
|
| 135 |
+
---
|
| 136 |
+
|
| 137 |
+
## ๐ Phase 4: Query Interface
|
| 138 |
+
|
| 139 |
+
### โณ 4.1 Unified Query System
|
| 140 |
+
**File**: `systems/query_interface.py`
|
| 141 |
+
- [ ] `QueryRouter` class (entity recognition)
|
| 142 |
+
- [ ] `ResultAggregator` class (multi-collection search)
|
| 143 |
+
- [ ] `ContextBuilder` class (format for LLM)
|
| 144 |
+
- [ ] Entity extraction (spells, monsters, classes, races)
|
| 145 |
+
- [ ] Relevance scoring
|
| 146 |
+
- [ ] Result ranking
|
| 147 |
+
- [ ] Context assembly for prompts
|
| 148 |
+
- [ ] Query caching
|
| 149 |
+
- [ ] Unit tests
|
| 150 |
+
|
| 151 |
+
---
|
| 152 |
+
|
| 153 |
+
## ๐ฎ Phase 5: GM Dialogue System
|
| 154 |
+
|
| 155 |
+
### โณ 5.1 RAG-Enhanced GM
|
| 156 |
+
**File**: `systems/gm_dialogue.py`
|
| 157 |
+
- [ ] `EntityExtractor` component
|
| 158 |
+
- [ ] `RuleRetriever` component
|
| 159 |
+
- [ ] `PromptBuilder` component
|
| 160 |
+
- [ ] `OllamaClient` interface
|
| 161 |
+
- [ ] `ResponseFormatter` component
|
| 162 |
+
- [ ] Session state management
|
| 163 |
+
- [ ] Context window management
|
| 164 |
+
- [ ] Dice roll handling
|
| 165 |
+
- [ ] Integration tests
|
| 166 |
+
|
| 167 |
+
### โณ 5.2 Dialogue Manager
|
| 168 |
+
- [ ] Turn tracking
|
| 169 |
+
- [ ] Initiative order management
|
| 170 |
+
- [ ] Scene state persistence
|
| 171 |
+
- [ ] Character tracking
|
| 172 |
+
- [ ] Combat management helpers
|
| 173 |
+
|
| 174 |
+
---
|
| 175 |
+
|
| 176 |
+
## ๐ค Phase 6: Character Creation
|
| 177 |
+
|
| 178 |
+
### โณ 6.1 Character Creator
|
| 179 |
+
**File**: `systems/character_creator.py`
|
| 180 |
+
- [ ] Interactive CLI interface
|
| 181 |
+
- [ ] Race selection with RAG lookup
|
| 182 |
+
- [ ] Class selection with RAG lookup
|
| 183 |
+
- [ ] Ability score generation
|
| 184 |
+
- [ ] Background selection
|
| 185 |
+
- [ ] Equipment selection
|
| 186 |
+
- [ ] Spell selection (for casters)
|
| 187 |
+
- [ ] Character validation
|
| 188 |
+
- [ ] JSON export
|
| 189 |
+
- [ ] Character sheet display
|
| 190 |
+
|
| 191 |
+
### โณ 6.2 Character Management
|
| 192 |
+
- [ ] Save/load character files
|
| 193 |
+
- [ ] Character progression (leveling)
|
| 194 |
+
- [ ] Character sheet viewer
|
| 195 |
+
- [ ] Integration with GM dialogue system
|
| 196 |
+
|
| 197 |
+
---
|
| 198 |
+
|
| 199 |
+
## ๐ฆ Supporting Files
|
| 200 |
+
|
| 201 |
+
### โณ Dependencies
|
| 202 |
+
**File**: `requirements.txt`
|
| 203 |
+
- [ ] chromadb
|
| 204 |
+
- [ ] sentence-transformers
|
| 205 |
+
- [ ] pdfplumber
|
| 206 |
+
- [ ] ollama (Python client)
|
| 207 |
+
- [ ] rich (for CLI formatting)
|
| 208 |
+
- [ ] tqdm (for progress bars)
|
| 209 |
+
- [ ] pytest (for testing)
|
| 210 |
+
- [ ] Version pinning
|
| 211 |
+
|
| 212 |
+
### โณ Documentation
|
| 213 |
+
- [ ] README.md with installation instructions
|
| 214 |
+
- [ ] API documentation
|
| 215 |
+
- [ ] Usage examples
|
| 216 |
+
- [ ] Architecture diagram
|
| 217 |
+
|
| 218 |
+
---
|
| 219 |
+
|
| 220 |
+
## ๐งช Testing & Validation
|
| 221 |
+
|
| 222 |
+
### โณ Unit Tests
|
| 223 |
+
- [ ] Core infrastructure tests
|
| 224 |
+
- [ ] Parser tests
|
| 225 |
+
- [ ] Query interface tests
|
| 226 |
+
- [ ] Character creator tests
|
| 227 |
+
|
| 228 |
+
### โณ Integration Tests
|
| 229 |
+
- [ ] Full initialization test
|
| 230 |
+
- [ ] End-to-end query test
|
| 231 |
+
- [ ] GM dialogue scenario tests
|
| 232 |
+
- [ ] Character creation flow test
|
| 233 |
+
|
| 234 |
+
### โณ Performance Tests
|
| 235 |
+
- [ ] RAG query latency (<500ms target)
|
| 236 |
+
- [ ] Initialization time benchmarks
|
| 237 |
+
- [ ] Memory usage profiling
|
| 238 |
+
- [ ] Collection size validation
|
| 239 |
+
|
| 240 |
+
---
|
| 241 |
+
|
| 242 |
+
## ๐ฏ Success Metrics
|
| 243 |
+
|
| 244 |
+
| Metric | Target | Current | Status |
|
| 245 |
+
|--------|--------|---------|--------|
|
| 246 |
+
| Init Time (full) | < 5 min | - | โณ |
|
| 247 |
+
| Query Latency | < 500ms | - | โณ |
|
| 248 |
+
| Rule Accuracy | > 95% | - | โณ |
|
| 249 |
+
| Character Validity | 100% | - | โณ |
|
| 250 |
+
| Code Coverage | > 80% | - | โณ |
|
| 251 |
+
| Total Chunks | ~3500 | - | โณ |
|
| 252 |
+
|
| 253 |
+
---
|
| 254 |
+
|
| 255 |
+
## ๐ Notes & Decisions
|
| 256 |
+
|
| 257 |
+
### Design Decisions
|
| 258 |
+
- **Database**: ChromaDB for persistence and semantic search
|
| 259 |
+
- **Embeddings**: sentence-transformers/all-MiniLM-L6-v2 for speed/quality balance
|
| 260 |
+
- **LLM**: Ollama with Qwen3-4B-RPG-Roleplay-V2 for D&D-tuned responses
|
| 261 |
+
- **Collection Strategy**: Separate collections per content type for clean organization
|
| 262 |
+
|
| 263 |
+
### Known Issues
|
| 264 |
+
- None yet
|
| 265 |
+
|
| 266 |
+
### Future Enhancements
|
| 267 |
+
- Web UI for GM dialogue
|
| 268 |
+
- Multiplayer support
|
| 269 |
+
- Custom content import
|
| 270 |
+
- Voice interface
|
| 271 |
+
- Map/battle grid integration
|
| 272 |
+
|
| 273 |
+
---
|
| 274 |
+
|
| 275 |
+
## ๐
Timeline
|
| 276 |
+
|
| 277 |
+
| Date | Milestone |
|
| 278 |
+
|------|-----------|
|
| 279 |
+
| 2024-11-06 | Project started, directory structure created |
|
| 280 |
+
| TBD | Phase 1 complete |
|
| 281 |
+
| TBD | Phase 2 complete |
|
| 282 |
+
| TBD | Phase 3 complete |
|
| 283 |
+
| TBD | Phase 4 complete |
|
| 284 |
+
| TBD | Phase 5 complete |
|
| 285 |
+
| TBD | Phase 6 complete |
|
| 286 |
+
| TBD | **Project complete** |
|
| 287 |
+
|
| 288 |
+
---
|
| 289 |
+
|
| 290 |
+
**Last Updated**: November 6, 2024
|
|
@@ -0,0 +1,29 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# D&D RAG System Dependencies
|
| 2 |
+
# Installation: pip install -r requirements.txt
|
| 3 |
+
|
| 4 |
+
# Core dependencies
|
| 5 |
+
chromadb>=0.4.18
|
| 6 |
+
sentence-transformers>=2.2.0
|
| 7 |
+
pdfplumber>=0.10.0
|
| 8 |
+
|
| 9 |
+
# Ollama Python client
|
| 10 |
+
ollama>=0.1.0
|
| 11 |
+
|
| 12 |
+
# Rich console output
|
| 13 |
+
rich>=13.0.0
|
| 14 |
+
|
| 15 |
+
# Progress bars
|
| 16 |
+
tqdm>=4.66.0
|
| 17 |
+
|
| 18 |
+
# Testing
|
| 19 |
+
pytest>=7.4.0
|
| 20 |
+
pytest-cov>=4.1.0
|
| 21 |
+
|
| 22 |
+
# Optional: For better NLP processing
|
| 23 |
+
# nltk>=3.8.0
|
| 24 |
+
|
| 25 |
+
# Standard library enhancements
|
| 26 |
+
python-dotenv>=1.0.0
|
| 27 |
+
|
| 28 |
+
# Data handling
|
| 29 |
+
dataclasses-json>=0.6.0
|
|
@@ -0,0 +1,187 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env python3
|
| 2 |
+
"""
|
| 3 |
+
Test RAG Search Functionality
|
| 4 |
+
|
| 5 |
+
Tests that spells, monsters, classes, and races can be found via semantic search.
|
| 6 |
+
"""
|
| 7 |
+
|
| 8 |
+
import sys
|
| 9 |
+
from pathlib import Path
|
| 10 |
+
|
| 11 |
+
# Add project to path
|
| 12 |
+
sys.path.insert(0, str(Path(__file__).parent))
|
| 13 |
+
|
| 14 |
+
from dnd_rag_system.core.chroma_manager import ChromaDBManager
|
| 15 |
+
from dnd_rag_system.config import settings
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
def test_spell_search(db: ChromaDBManager):
|
| 19 |
+
"""Test spell searches."""
|
| 20 |
+
print("\n" + "="*70)
|
| 21 |
+
print("๐ฎ TESTING SPELL SEARCHES")
|
| 22 |
+
print("="*70)
|
| 23 |
+
|
| 24 |
+
test_queries = [
|
| 25 |
+
"fireball spell",
|
| 26 |
+
"healing magic",
|
| 27 |
+
"wizard cantrip",
|
| 28 |
+
"magic missile damage",
|
| 29 |
+
"cure wounds"
|
| 30 |
+
]
|
| 31 |
+
|
| 32 |
+
for query in test_queries:
|
| 33 |
+
print(f"\n๐ Query: '{query}'")
|
| 34 |
+
results = db.search(settings.COLLECTION_NAMES['spells'], query, n_results=3)
|
| 35 |
+
|
| 36 |
+
if results['documents'] and results['documents'][0]:
|
| 37 |
+
print(f"โ Found {len(results['documents'][0])} results")
|
| 38 |
+
# Show top result
|
| 39 |
+
top_doc = results['documents'][0][0]
|
| 40 |
+
top_meta = results['metadatas'][0][0]
|
| 41 |
+
distance = results['distances'][0][0]
|
| 42 |
+
|
| 43 |
+
print(f" Top result: {top_meta.get('name', 'Unknown')}")
|
| 44 |
+
print(f" Distance: {distance:.3f}")
|
| 45 |
+
print(f" Preview: {top_doc[:100]}...")
|
| 46 |
+
else:
|
| 47 |
+
print("โ No results found")
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
def test_monster_search(db: ChromaDBManager):
|
| 51 |
+
"""Test monster searches."""
|
| 52 |
+
print("\n" + "="*70)
|
| 53 |
+
print("๐น TESTING MONSTER SEARCHES")
|
| 54 |
+
print("="*70)
|
| 55 |
+
|
| 56 |
+
test_queries = [
|
| 57 |
+
"goblin",
|
| 58 |
+
"dragon fire breath",
|
| 59 |
+
"undead creature",
|
| 60 |
+
"challenge rating 5",
|
| 61 |
+
"orc warrior"
|
| 62 |
+
]
|
| 63 |
+
|
| 64 |
+
for query in test_queries:
|
| 65 |
+
print(f"\n๐ Query: '{query}'")
|
| 66 |
+
results = db.search(settings.COLLECTION_NAMES['monsters'], query, n_results=3)
|
| 67 |
+
|
| 68 |
+
if results['documents'] and results['documents'][0]:
|
| 69 |
+
print(f"โ Found {len(results['documents'][0])} results")
|
| 70 |
+
# Show top result
|
| 71 |
+
top_doc = results['documents'][0][0]
|
| 72 |
+
top_meta = results['metadatas'][0][0]
|
| 73 |
+
distance = results['distances'][0][0]
|
| 74 |
+
|
| 75 |
+
print(f" Top result: {top_meta.get('name', 'Unknown')}")
|
| 76 |
+
print(f" CR: {top_meta.get('challenge_rating', 'Unknown')}")
|
| 77 |
+
print(f" Distance: {distance:.3f}")
|
| 78 |
+
print(f" Preview: {top_doc[:100]}...")
|
| 79 |
+
else:
|
| 80 |
+
print("โ No results found")
|
| 81 |
+
|
| 82 |
+
|
| 83 |
+
def test_class_search(db: ChromaDBManager):
|
| 84 |
+
"""Test class searches."""
|
| 85 |
+
print("\n" + "="*70)
|
| 86 |
+
print("โ๏ธ TESTING CLASS SEARCHES")
|
| 87 |
+
print("="*70)
|
| 88 |
+
|
| 89 |
+
test_queries = [
|
| 90 |
+
"wizard spellcasting",
|
| 91 |
+
"fighter extra attack",
|
| 92 |
+
"rogue sneak attack",
|
| 93 |
+
"barbarian rage",
|
| 94 |
+
"cleric healing"
|
| 95 |
+
]
|
| 96 |
+
|
| 97 |
+
for query in test_queries:
|
| 98 |
+
print(f"\n๐ Query: '{query}'")
|
| 99 |
+
results = db.search(settings.COLLECTION_NAMES['classes'], query, n_results=3)
|
| 100 |
+
|
| 101 |
+
if results['documents'] and results['documents'][0]:
|
| 102 |
+
print(f"โ Found {len(results['documents'][0])} results")
|
| 103 |
+
# Show top result
|
| 104 |
+
top_doc = results['documents'][0][0]
|
| 105 |
+
top_meta = results['metadatas'][0][0]
|
| 106 |
+
distance = results['distances'][0][0]
|
| 107 |
+
|
| 108 |
+
print(f" Top result: {top_meta.get('name', 'Unknown')}")
|
| 109 |
+
print(f" Distance: {distance:.3f}")
|
| 110 |
+
print(f" Preview: {top_doc[:100]}...")
|
| 111 |
+
else:
|
| 112 |
+
print("โ No results found")
|
| 113 |
+
|
| 114 |
+
|
| 115 |
+
def test_cross_collection_search(db: ChromaDBManager):
|
| 116 |
+
"""Test searching across multiple collections."""
|
| 117 |
+
print("\n" + "="*70)
|
| 118 |
+
print("๐ TESTING CROSS-COLLECTION SEARCH")
|
| 119 |
+
print("="*70)
|
| 120 |
+
|
| 121 |
+
query = "fire damage"
|
| 122 |
+
print(f"\nQuery: '{query}' (searching all collections)")
|
| 123 |
+
|
| 124 |
+
results = db.search_all(query, n_results_per_collection=2)
|
| 125 |
+
|
| 126 |
+
for collection_name, col_results in results.items():
|
| 127 |
+
if col_results['documents'] and col_results['documents'][0]:
|
| 128 |
+
print(f"\n {collection_name}:")
|
| 129 |
+
for doc, meta in zip(col_results['documents'][0], col_results['metadatas'][0]):
|
| 130 |
+
print(f" - {meta.get('name', 'Unknown')}")
|
| 131 |
+
|
| 132 |
+
|
| 133 |
+
def test_stats(db: ChromaDBManager):
|
| 134 |
+
"""Show collection statistics."""
|
| 135 |
+
print("\n" + "="*70)
|
| 136 |
+
print("๐ COLLECTION STATISTICS")
|
| 137 |
+
print("="*70)
|
| 138 |
+
|
| 139 |
+
stats = db.get_all_stats()
|
| 140 |
+
|
| 141 |
+
print(f"\nTotal documents: {stats['total_documents']}")
|
| 142 |
+
print(f"Database: {stats['persist_dir']}")
|
| 143 |
+
print(f"Embedding model: {stats['embedding_model']}")
|
| 144 |
+
|
| 145 |
+
print("\nCollections:")
|
| 146 |
+
for collection_name, col_stats in stats['collections'].items():
|
| 147 |
+
total = col_stats.get('total_documents', 0)
|
| 148 |
+
print(f" {collection_name}: {total} documents")
|
| 149 |
+
|
| 150 |
+
if 'chunk_types' in col_stats:
|
| 151 |
+
for chunk_type, count in col_stats['chunk_types'].items():
|
| 152 |
+
print(f" - {chunk_type}: {count}")
|
| 153 |
+
|
| 154 |
+
|
| 155 |
+
def main():
|
| 156 |
+
"""Run all tests."""
|
| 157 |
+
print("\n" + "="*70)
|
| 158 |
+
print("๐งช D&D RAG SEARCH TEST SUITE")
|
| 159 |
+
print("="*70)
|
| 160 |
+
|
| 161 |
+
# Initialize database connection
|
| 162 |
+
print("\n๐ง Connecting to ChromaDB...")
|
| 163 |
+
db = ChromaDBManager()
|
| 164 |
+
|
| 165 |
+
# Run tests
|
| 166 |
+
try:
|
| 167 |
+
test_stats(db)
|
| 168 |
+
test_spell_search(db)
|
| 169 |
+
test_monster_search(db)
|
| 170 |
+
test_class_search(db)
|
| 171 |
+
test_cross_collection_search(db)
|
| 172 |
+
|
| 173 |
+
print("\n" + "="*70)
|
| 174 |
+
print("โ
TEST SUITE COMPLETE")
|
| 175 |
+
print("="*70)
|
| 176 |
+
|
| 177 |
+
except Exception as e:
|
| 178 |
+
print(f"\nโ Test failed: {e}")
|
| 179 |
+
import traceback
|
| 180 |
+
traceback.print_exc()
|
| 181 |
+
return 1
|
| 182 |
+
|
| 183 |
+
return 0
|
| 184 |
+
|
| 185 |
+
|
| 186 |
+
if __name__ == '__main__':
|
| 187 |
+
sys.exit(main())
|