abc123 / crossword-app /backend-py /CROSSWORD_GENERATION_WALKTHROUGH.md
vimalk78's picture
feat(crossword): generated crosswords with clues
486eff6
# Crossword Generation Code Walkthrough
This document provides a detailed, line-by-line walkthrough of how crossword puzzle generation works in the Python backend, tracing the complete flow from API request to finalized words and clues.
## Overview
The Python backend implements AI-powered crossword generation using:
- **FastAPI** for web framework and API endpoints
- **Vector similarity search** with sentence-transformers and FAISS for intelligent word discovery
- **Backtracking algorithm** for word placement in the crossword grid
- **Multi-layer caching system** for performance and fallback mechanisms
## 1. Application Startup (`app.py`)
### Entry Point and Service Initialization
```python
# Lines 66-95: Async lifespan context manager
@asynccontextmanager
async def lifespan(app: FastAPI):
global vector_service
# Startup initialization
vector_service = VectorSearchService()
await vector_service.initialize()
app.state.vector_service = vector_service
```
**Flow:**
1. FastAPI application starts with lifespan context manager
2. Creates global `VectorSearchService` instance (line 79)
3. Calls `vector_service.initialize()` (line 82) which:
- Loads sentence-transformer model (~30-60 seconds)
- Builds or loads cached FAISS index for word embeddings
- Initializes word cache manager for fallbacks
4. Makes vector service available to all API routes via `app.state`
## 2. API Request Handling (`src/routes/api.py`)
### Crossword Generation Endpoint
```python
# Lines 77-118: Main crossword generation endpoint
@router.post("/generate", response_model=PuzzleResponse)
async def generate_puzzle(
request: GeneratePuzzleRequest,
crossword_gen: CrosswordGenerator = Depends(get_crossword_generator)
):
```
**Flow:**
1. Client POST request to `/api/generate` with JSON body:
```json
{
"topics": ["animals", "science"],
"difficulty": "medium",
}
```
2. FastAPI validates request against `GeneratePuzzleRequest` model (lines 20-23)
3. Dependency injection calls `get_crossword_generator()` (lines 57-63):
```python
def get_crossword_generator(request: Request) -> CrosswordGenerator:
global generator
if generator is None:
vector_service = getattr(request.app.state, 'vector_service', None)
generator = CrosswordGenerator(vector_service)
return generator
```
4. Creates or reuses `CrosswordGenerator` wrapper with vector service reference
## 3. Crossword Generation Wrapper (`src/services/crossword_generator_wrapper.py`)
### Simple Delegation Layer
```python
# Lines 20-51: Generate puzzle method
async def generate_puzzle(
self,
topics: List[str],
difficulty: str = "medium",
) -> Dict[str, Any]:
```
**Flow:**
1. Wrapper receives request from API route
2. **Line 41**: Imports actual generator to avoid circular imports:
```python
from .crossword_generator import CrosswordGenerator as ActualGenerator
```
3. **Line 42**: Creates actual generator instance with vector service
4. **Line 44**: Delegates to actual generator's `generate_puzzle()` method
## 4. Core Crossword Generation (`src/services/crossword_generator.py`)
### 4.1 Main Generation Method
```python
# Lines 22-66: Core puzzle generation
async def generate_puzzle(self, topics: List[str], difficulty: str = "medium", = False):
```
**Key steps:**
1. **Line 37**: Select words using AI or static sources:
```python
words = await self._select_words(topics, difficulty, use_ai)
```
2. **Line 44**: Create crossword grid from selected words:
```python
grid_result = self._create_grid(words)
```
3. **Lines 52-62**: Return structured response with grid, clues, and metadata
### 4.2 Word Selection Process
```python
# Lines 68-97: Word selection logic
async def _select_words(self, topics: List[str], difficulty: str, ):
```
**Flow branches based on `use_ai` parameter:**
#### AI-Powered Word Selection (Lines 72-83):
```python
if use_ai and self.vector_service:
for topic in topics:
ai_words = await self.vector_service.find_similar_words(
topic, difficulty, self.max_words // len(topics)
)
all_words.extend(ai_words)
```
#### Fallback to Cached Words (Lines 86-91):
```python
if self.vector_service:
for topic in topics:
cached_words = await self.vector_service._get_cached_fallback(
topic, difficulty, self.max_words // len(topics)
)
```
#### Final Fallback to Static JSON Files (Lines 93-95):
```python
else:
all_words = await self._get_static_words(topics, difficulty)
```
### 4.3 Word Sorting for Crossword Viability
```python
# Lines 129-168: Sort words by crossword suitability
def _sort_words_for_crossword(self, words: List[Dict[str, Any]]):
```
**Scoring algorithm:**
- **Lines 138-147**: Length-based scoring (shorter words preferred)
- **Lines 150-153**: Common letter bonus (E, A, R, I, O, T, N, S)
- **Lines 156-158**: Vowel distribution bonus
- **Lines 161-162**: Penalty for very long words
## 5. AI Word Discovery (`src/services/vector_search.py`)
### 5.1 Vector Search Initialization
```python
# Lines 71-143: Service initialization
async def initialize(self):
```
**Initialization flow:**
1. **Lines 90-92**: Load sentence-transformer model
2. **Lines 95-119**: Build or load cached FAISS index
3. **Lines 124-134**: Initialize word cache manager
### 5.2 Core Word Finding Algorithm
```python
# Lines 279-374: Main word discovery method
async def find_similar_words(
self,
topic: str,
difficulty: str = "medium",
max_words: int = 15
):
```
**Search strategy branches:**
#### Hierarchical Search (Lines 296-325):
```python
if self.use_hierarchical_search:
all_candidates = await self._hierarchical_search(topic, difficulty, max_words)
combined_results = self._combine_hierarchical_results(all_candidates, max_words * 2)
```
#### Traditional Single Search (Lines 328-337):
```python
else:
traditional_results = await self._traditional_single_search(topic, difficulty, max_words * 2)
```
### 5.3 Hierarchical Search Process
```python
# Lines 639-748: Multi-phase hierarchical search
async def _hierarchical_search(self, topic: str, difficulty: str, max_words: int):
```
**Three-phase approach:**
#### Phase 1: Topic Variations (Lines 652-694)
```python
topic_variations = self._expand_topic_variations(topic) # "Animal" β†’ ["Animal", "Animals"]
for variation in topic_variations:
topic_embedding = self.model.encode([variation], convert_to_numpy=True)
scores, indices = self.faiss_index.search(topic_embedding, search_size)
variation_candidates = self._collect_candidates_with_threshold(scores, indices, threshold, variation, difficulty)
```
#### Phase 2: Subcategory Identification (Lines 697-700)
```python
subcategories = self._identify_subcategories(main_topic_candidates, topic)
```
#### Phase 3: Subcategory Search (Lines 703-733)
```python
for subcategory in subcategories:
subcat_embedding = self.model.encode([subcategory], convert_to_numpy=True)
sub_scores, sub_indices = self.faiss_index.search(subcat_embedding, sub_search_size)
```
### 5.4 Word Quality Filtering
```python
# Lines 1164-1215: Candidate collection with filtering
def _collect_candidates_with_threshold(self, scores, indices, threshold, topic, difficulty):
```
**Multi-stage filtering:**
1. **Line 1176**: Similarity threshold check
2. **Line 1183**: Difficulty matching (word length)
3. **Line 1185**: Interest and topic relevance check:
```python
if self._is_interesting_word(word, topic) and self._is_topic_relevant(word, topic):
```
## 6. Grid Creation and Word Placement
### 6.1 Grid Generation Entry Point
```python
# Lines 170-243: Main grid creation method
def _create_grid(self, words: List[Dict[str, Any]]):
```
**Flow:**
1. **Lines 184-203**: Process and sort words by length (longest first)
2. **Lines 209-213**: Calculate appropriate grid size
3. **Lines 212-237**: Multiple placement attempts with increasing grid size
4. **Lines 240-241**: Fallback to simple two-word cross
### 6.2 Word Placement Algorithm
```python
# Lines 259-295: Backtracking word placement
def _place_words_in_grid(self, word_list: List[str], word_objs: List[Dict[str, Any]], size: int):
```
**Setup:**
- **Line 263**: Initialize empty grid with dots
- **Line 270**: Call recursive backtracking algorithm
- **Lines 272-287**: Generate clues and assign crossword numbers
### 6.3 Backtracking Algorithm
```python
# Lines 297-357: Recursive backtracking placement
def _backtrack_placement(self, grid, word_list, word_objs, word_index, placed_words, start_time, timeout):
```
**Algorithm flow:**
#### Base Cases:
- **Lines 302-303**: Timeout check every 50 calls
- **Lines 305-306**: Success when all words placed
#### First Word Placement (Lines 312-332):
```python
if word_index == 0:
center_row = size // 2
center_col = (size - len(word)) // 2
if self._can_place_word(grid, word, center_row, center_col, "horizontal"):
original_state = self._place_word(grid, word, center_row, center_col, "horizontal")
```
#### Subsequent Word Placement (Lines 334-356):
```python
all_placements = self._find_all_intersection_placements(grid, word, placed_words)
all_placements.sort(key=lambda p: p["score"], reverse=True)
for placement in all_placements:
if self._can_place_word(grid, word, row, col, direction):
# Try placement and recurse
if self._backtrack_placement(...):
return True
# Backtrack if failed
self._remove_word(grid, original_state)
```
### 6.4 Word Placement Validation
```python
# Lines 359-417: Comprehensive placement validation
def _can_place_word(self, grid: List[List[str]], word: str, row: int, col: int, direction: str):
```
**Critical validation checks:**
1. **Lines 364-365**: Boundary checks
2. **Lines 372-375**: Word boundary enforcement (no adjacent letters)
3. **Lines 378-390**: Letter-by-letter placement validation
4. **Lines 388-390**: Perpendicular intersection validation
### 6.5 Intersection Finding
```python
# Lines 486-505: Find all possible intersections
def _find_all_intersection_placements(self, grid, word, placed_words):
```
**Process:**
1. **Lines 491-502**: For each placed word, find letter intersections
2. **Lines 496-502**: Calculate placement position for each intersection
3. **Lines 499-502**: Score placement quality
## 7. Clue Generation and Final Assembly
### 7.1 Grid Trimming and Optimization
```python
# Lines 589-642: Remove excess empty space
def _trim_grid(self, grid, placed_words):
```
**Process:**
1. **Lines 595-610**: Find bounding box of all placed words
2. **Lines 612-631**: Create trimmed grid with padding
3. **Lines 634-641**: Update word positions relative to new grid
### 7.2 Crossword Numbering
```python
# Lines 698-750: Assign proper crossword numbers and create clues
def _assign_numbers_and_clues(self, placed_words, clues_data):
```
**Crossword numbering rules:**
1. **Lines 710-714**: Group words by starting position
2. **Lines 716**: Sort by reading order (top-to-bottom, left-to-right)
3. **Lines 725-749**: Assign shared numbers for words starting at same position
4. **Lines 738-745**: Create clue objects with proper formatting
### 7.3 Final Response Assembly
**Back in `crossword_generator.py` lines 52-62:**
```python
return {
"grid": grid_result["grid"],
"clues": grid_result["clues"],
"metadata": {
"topics": topics,
"difficulty": difficulty,
"wordCount": len(grid_result["placed_words"]),
"size": len(grid_result["grid"]),
"aiGenerated": }
}
```
## 8. Caching System (`src/services/word_cache.py`)
### 8.1 Cache Initialization
```python
# Lines 75-113: Load existing cache files
async def initialize(self):
```
**Process:**
1. **Lines 86-108**: Load all `.json` cache files from disk
2. **Lines 99-102**: Validate cache structure and load into memory
3. **Lines 110**: Report loaded cache statistics
### 8.2 Word Caching
```python
# Lines 166-224: Cache successful word discoveries
async def cache_words(self, topic, difficulty, words, source="vector_search"):
```
**Storage process:**
1. **Lines 186-193**: Enhance words with caching metadata
2. **Lines 196-207**: Create structured cache data with expiration
3. **Lines 210-213**: Save to disk (if permissions allow)
4. **Lines 216-217**: Update in-memory cache
## Complete Data Flow Summary
1. **API Request** β†’ `/api/generate` with topics, difficulty, 2. **Route Handler** β†’ Validates request, injects dependencies
3. **Wrapper** β†’ Delegates to actual generator with vector service
4. **Word Selection** β†’ AI vector search OR cached words OR static JSON fallback
5. **Vector Search** (if AI enabled):
- Load sentence-transformer model
- Perform hierarchical semantic search
- Filter by similarity threshold, difficulty, relevance
- Apply word exclusions and variety filtering
6. **Grid Creation**:
- Sort words by crossword viability
- Calculate appropriate grid size
- Use backtracking algorithm to place words
- Validate word boundaries and intersections
7. **Grid Optimization**:
- Trim excess empty space
- Assign proper crossword numbers
- Generate clue objects
8. **Response Assembly** β†’ Return grid, clues, and metadata
9. **Caching** β†’ Store successful AI discoveries for future use
The system gracefully degrades from AI β†’ cached words β†’ static words, ensuring crossword generation always succeeds even when AI components fail.