Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / crossword-app /backend-py /CROSSWORD_GENERATION_WALKTHROUGH.md

vimalk78

feat(crossword): generated crosswords with clues

486eff6 4 months ago

preview code

raw

history blame contribute delete

13.6 kB

	# Crossword Generation Code Walkthrough

	This document provides a detailed, line-by-line walkthrough of how crossword puzzle generation works in the Python backend, tracing the complete flow from API request to finalized words and clues.

	## Overview

	The Python backend implements AI-powered crossword generation using:
	- FastAPI for web framework and API endpoints
	- Vector similarity search with sentence-transformers and FAISS for intelligent word discovery
	- Backtracking algorithm for word placement in the crossword grid
	- Multi-layer caching system for performance and fallback mechanisms

	## 1. Application Startup (`app.py`)

	### Entry Point and Service Initialization

	```python
	# Lines 66-95: Async lifespan context manager
	@asynccontextmanager
	async def lifespan(app: FastAPI):
	global vector_service

	# Startup initialization
	vector_service = VectorSearchService()
	await vector_service.initialize()
	app.state.vector_service = vector_service
	```

	Flow:
	1. FastAPI application starts with lifespan context manager
	2. Creates global `VectorSearchService` instance (line 79)
	3. Calls `vector_service.initialize()` (line 82) which:
	- Loads sentence-transformer model (~30-60 seconds)
	- Builds or loads cached FAISS index for word embeddings
	- Initializes word cache manager for fallbacks
	4. Makes vector service available to all API routes via `app.state`

	## 2. API Request Handling (`src/routes/api.py`)

	### Crossword Generation Endpoint

	```python
	# Lines 77-118: Main crossword generation endpoint
	@router.post("/generate", response_model=PuzzleResponse)
	async def generate_puzzle(
	request: GeneratePuzzleRequest,
	crossword_gen: CrosswordGenerator = Depends(get_crossword_generator)
	):
	```

	Flow:
	1. Client POST request to `/api/generate` with JSON body:
	```json
	{
	"topics": ["animals", "science"],
	"difficulty": "medium",
	}
	```

	2. FastAPI validates request against `GeneratePuzzleRequest` model (lines 20-23)

	3. Dependency injection calls `get_crossword_generator()` (lines 57-63):
	```python
	def get_crossword_generator(request: Request) -> CrosswordGenerator:
	global generator
	if generator is None:
	vector_service = getattr(request.app.state, 'vector_service', None)
	generator = CrosswordGenerator(vector_service)
	return generator
	```

	4. Creates or reuses `CrosswordGenerator` wrapper with vector service reference

	## 3. Crossword Generation Wrapper (`src/services/crossword_generator_wrapper.py`)

	### Simple Delegation Layer

	```python
	# Lines 20-51: Generate puzzle method
	async def generate_puzzle(
	self,
	topics: List[str],
	difficulty: str = "medium",
	) -> Dict[str, Any]:
	```

	Flow:
	1. Wrapper receives request from API route
	2. Line 41: Imports actual generator to avoid circular imports:
	```python
	from .crossword_generator import CrosswordGenerator as ActualGenerator
	```
	3. Line 42: Creates actual generator instance with vector service
	4. Line 44: Delegates to actual generator's `generate_puzzle()` method

	## 4. Core Crossword Generation (`src/services/crossword_generator.py`)

	### 4.1 Main Generation Method

	```python
	# Lines 22-66: Core puzzle generation
	async def generate_puzzle(self, topics: List[str], difficulty: str = "medium", = False):
	```

	Key steps:

	1. Line 37: Select words using AI or static sources:
	```python
	words = await self._select_words(topics, difficulty, use_ai)
	```

	2. Line 44: Create crossword grid from selected words:
	```python
	grid_result = self._create_grid(words)
	```

	3. Lines 52-62: Return structured response with grid, clues, and metadata

	### 4.2 Word Selection Process

	```python
	# Lines 68-97: Word selection logic
	async def _select_words(self, topics: List[str], difficulty: str, ):
	```

	Flow branches based on `use_ai` parameter:

	#### AI-Powered Word Selection (Lines 72-83):
	```python
	if use_ai and self.vector_service:
	for topic in topics:
	ai_words = await self.vector_service.find_similar_words(
	topic, difficulty, self.max_words // len(topics)
	)
	all_words.extend(ai_words)
	```

	#### Fallback to Cached Words (Lines 86-91):
	```python
	if self.vector_service:
	for topic in topics:
	cached_words = await self.vector_service._get_cached_fallback(
	topic, difficulty, self.max_words // len(topics)
	)
	```

	#### Final Fallback to Static JSON Files (Lines 93-95):
	```python
	else:
	all_words = await self._get_static_words(topics, difficulty)
	```

	### 4.3 Word Sorting for Crossword Viability

	```python
	# Lines 129-168: Sort words by crossword suitability
	def _sort_words_for_crossword(self, words: List[Dict[str, Any]]):
	```

	Scoring algorithm:
	- Lines 138-147: Length-based scoring (shorter words preferred)
	- Lines 150-153: Common letter bonus (E, A, R, I, O, T, N, S)
	- Lines 156-158: Vowel distribution bonus
	- Lines 161-162: Penalty for very long words

	## 5. AI Word Discovery (`src/services/vector_search.py`)

	### 5.1 Vector Search Initialization

	```python
	# Lines 71-143: Service initialization
	async def initialize(self):
	```

	Initialization flow:
	1. Lines 90-92: Load sentence-transformer model
	2. Lines 95-119: Build or load cached FAISS index
	3. Lines 124-134: Initialize word cache manager

	### 5.2 Core Word Finding Algorithm

	```python
	# Lines 279-374: Main word discovery method
	async def find_similar_words(
	self,
	topic: str,
	difficulty: str = "medium",
	max_words: int = 15
	):
	```

	Search strategy branches:

	#### Hierarchical Search (Lines 296-325):
	```python
	if self.use_hierarchical_search:
	all_candidates = await self._hierarchical_search(topic, difficulty, max_words)
	combined_results = self._combine_hierarchical_results(all_candidates, max_words * 2)
	```

	#### Traditional Single Search (Lines 328-337):
	```python
	else:
	traditional_results = await self._traditional_single_search(topic, difficulty, max_words * 2)
	```

	### 5.3 Hierarchical Search Process

	```python
	# Lines 639-748: Multi-phase hierarchical search
	async def _hierarchical_search(self, topic: str, difficulty: str, max_words: int):
	```

	Three-phase approach:

	#### Phase 1: Topic Variations (Lines 652-694)
	```python
	topic_variations = self._expand_topic_variations(topic) # "Animal" → ["Animal", "Animals"]

	for variation in topic_variations:
	topic_embedding = self.model.encode([variation], convert_to_numpy=True)
	scores, indices = self.faiss_index.search(topic_embedding, search_size)
	variation_candidates = self._collect_candidates_with_threshold(scores, indices, threshold, variation, difficulty)
	```

	#### Phase 2: Subcategory Identification (Lines 697-700)
	```python
	subcategories = self._identify_subcategories(main_topic_candidates, topic)
	```

	#### Phase 3: Subcategory Search (Lines 703-733)
	```python
	for subcategory in subcategories:
	subcat_embedding = self.model.encode([subcategory], convert_to_numpy=True)
	sub_scores, sub_indices = self.faiss_index.search(subcat_embedding, sub_search_size)
	```

	### 5.4 Word Quality Filtering

	```python
	# Lines 1164-1215: Candidate collection with filtering
	def _collect_candidates_with_threshold(self, scores, indices, threshold, topic, difficulty):
	```

	Multi-stage filtering:
	1. Line 1176: Similarity threshold check
	2. Line 1183: Difficulty matching (word length)
	3. Line 1185: Interest and topic relevance check:
	```python
	if self._is_interesting_word(word, topic) and self._is_topic_relevant(word, topic):
	```

	## 6. Grid Creation and Word Placement

	### 6.1 Grid Generation Entry Point

	```python
	# Lines 170-243: Main grid creation method
	def _create_grid(self, words: List[Dict[str, Any]]):
	```

	Flow:
	1. Lines 184-203: Process and sort words by length (longest first)
	2. Lines 209-213: Calculate appropriate grid size
	3. Lines 212-237: Multiple placement attempts with increasing grid size
	4. Lines 240-241: Fallback to simple two-word cross

	### 6.2 Word Placement Algorithm

	```python
	# Lines 259-295: Backtracking word placement
	def _place_words_in_grid(self, word_list: List[str], word_objs: List[Dict[str, Any]], size: int):
	```

	Setup:
	- Line 263: Initialize empty grid with dots
	- Line 270: Call recursive backtracking algorithm
	- Lines 272-287: Generate clues and assign crossword numbers

	### 6.3 Backtracking Algorithm

	```python
	# Lines 297-357: Recursive backtracking placement
	def _backtrack_placement(self, grid, word_list, word_objs, word_index, placed_words, start_time, timeout):
	```

	Algorithm flow:

	#### Base Cases:
	- Lines 302-303: Timeout check every 50 calls
	- Lines 305-306: Success when all words placed

	#### First Word Placement (Lines 312-332):
	```python
	if word_index == 0:
	center_row = size // 2
	center_col = (size - len(word)) // 2

	if self._can_place_word(grid, word, center_row, center_col, "horizontal"):
	original_state = self._place_word(grid, word, center_row, center_col, "horizontal")
	```

	#### Subsequent Word Placement (Lines 334-356):
	```python
	all_placements = self._find_all_intersection_placements(grid, word, placed_words)
	all_placements.sort(key=lambda p: p["score"], reverse=True)

	for placement in all_placements:
	if self._can_place_word(grid, word, row, col, direction):
	# Try placement and recurse
	if self._backtrack_placement(...):
	return True
	# Backtrack if failed
	self._remove_word(grid, original_state)
	```

	### 6.4 Word Placement Validation

	```python
	# Lines 359-417: Comprehensive placement validation
	def _can_place_word(self, grid: List[List[str]], word: str, row: int, col: int, direction: str):
	```

	Critical validation checks:
	1. Lines 364-365: Boundary checks
	2. Lines 372-375: Word boundary enforcement (no adjacent letters)
	3. Lines 378-390: Letter-by-letter placement validation
	4. Lines 388-390: Perpendicular intersection validation

	### 6.5 Intersection Finding

	```python
	# Lines 486-505: Find all possible intersections
	def _find_all_intersection_placements(self, grid, word, placed_words):
	```

	Process:
	1. Lines 491-502: For each placed word, find letter intersections
	2. Lines 496-502: Calculate placement position for each intersection
	3. Lines 499-502: Score placement quality

	## 7. Clue Generation and Final Assembly

	### 7.1 Grid Trimming and Optimization

	```python
	# Lines 589-642: Remove excess empty space
	def _trim_grid(self, grid, placed_words):
	```

	Process:
	1. Lines 595-610: Find bounding box of all placed words
	2. Lines 612-631: Create trimmed grid with padding
	3. Lines 634-641: Update word positions relative to new grid

	### 7.2 Crossword Numbering

	```python
	# Lines 698-750: Assign proper crossword numbers and create clues
	def _assign_numbers_and_clues(self, placed_words, clues_data):
	```

	Crossword numbering rules:
	1. Lines 710-714: Group words by starting position
	2. Lines 716: Sort by reading order (top-to-bottom, left-to-right)
	3. Lines 725-749: Assign shared numbers for words starting at same position
	4. Lines 738-745: Create clue objects with proper formatting

	### 7.3 Final Response Assembly

	Back in `crossword_generator.py` lines 52-62:
	```python
	return {
	"grid": grid_result["grid"],
	"clues": grid_result["clues"],
	"metadata": {
	"topics": topics,
	"difficulty": difficulty,
	"wordCount": len(grid_result["placed_words"]),
	"size": len(grid_result["grid"]),
	"aiGenerated": }
	}
	```

	## 8. Caching System (`src/services/word_cache.py`)

	### 8.1 Cache Initialization

	```python
	# Lines 75-113: Load existing cache files
	async def initialize(self):
	```

	Process:
	1. Lines 86-108: Load all `.json` cache files from disk
	2. Lines 99-102: Validate cache structure and load into memory
	3. Lines 110: Report loaded cache statistics

	### 8.2 Word Caching

	```python
	# Lines 166-224: Cache successful word discoveries
	async def cache_words(self, topic, difficulty, words, source="vector_search"):
	```

	Storage process:
	1. Lines 186-193: Enhance words with caching metadata
	2. Lines 196-207: Create structured cache data with expiration
	3. Lines 210-213: Save to disk (if permissions allow)
	4. Lines 216-217: Update in-memory cache

	## Complete Data Flow Summary

	1. API Request → `/api/generate` with topics, difficulty, 2. Route Handler → Validates request, injects dependencies
	3. Wrapper → Delegates to actual generator with vector service
	4. Word Selection → AI vector search OR cached words OR static JSON fallback
	5. Vector Search (if AI enabled):
	- Load sentence-transformer model
	- Perform hierarchical semantic search
	- Filter by similarity threshold, difficulty, relevance
	- Apply word exclusions and variety filtering
	6. Grid Creation:
	- Sort words by crossword viability
	- Calculate appropriate grid size
	- Use backtracking algorithm to place words
	- Validate word boundaries and intersections
	7. Grid Optimization:
	- Trim excess empty space
	- Assign proper crossword numbers
	- Generate clue objects
	8. Response Assembly → Return grid, clues, and metadata
	9. Caching → Store successful AI discoveries for future use

	The system gracefully degrades from AI → cached words → static words, ensuring crossword generation always succeeds even when AI components fail.