abc123 / crossword-app /backend-py /CROSSWORD_GENERATION_WALKTHROUGH.md
vimalk78's picture
feat(crossword): generated crosswords with clues
486eff6

Crossword Generation Code Walkthrough

This document provides a detailed, line-by-line walkthrough of how crossword puzzle generation works in the Python backend, tracing the complete flow from API request to finalized words and clues.

Overview

The Python backend implements AI-powered crossword generation using:

  • FastAPI for web framework and API endpoints
  • Vector similarity search with sentence-transformers and FAISS for intelligent word discovery
  • Backtracking algorithm for word placement in the crossword grid
  • Multi-layer caching system for performance and fallback mechanisms

1. Application Startup (app.py)

Entry Point and Service Initialization

# Lines 66-95: Async lifespan context manager
@asynccontextmanager
async def lifespan(app: FastAPI):
    global vector_service
    
    # Startup initialization
    vector_service = VectorSearchService()
    await vector_service.initialize()
    app.state.vector_service = vector_service

Flow:

  1. FastAPI application starts with lifespan context manager
  2. Creates global VectorSearchService instance (line 79)
  3. Calls vector_service.initialize() (line 82) which:
    • Loads sentence-transformer model (~30-60 seconds)
    • Builds or loads cached FAISS index for word embeddings
    • Initializes word cache manager for fallbacks
  4. Makes vector service available to all API routes via app.state

2. API Request Handling (src/routes/api.py)

Crossword Generation Endpoint

# Lines 77-118: Main crossword generation endpoint
@router.post("/generate", response_model=PuzzleResponse)
async def generate_puzzle(
    request: GeneratePuzzleRequest,
    crossword_gen: CrosswordGenerator = Depends(get_crossword_generator)
):

Flow:

  1. Client POST request to /api/generate with JSON body:

    {
      "topics": ["animals", "science"], 
      "difficulty": "medium",
         }
    
  2. FastAPI validates request against GeneratePuzzleRequest model (lines 20-23)

  3. Dependency injection calls get_crossword_generator() (lines 57-63):

    def get_crossword_generator(request: Request) -> CrosswordGenerator:
        global generator
        if generator is None:
            vector_service = getattr(request.app.state, 'vector_service', None)
            generator = CrosswordGenerator(vector_service)
        return generator
    
  4. Creates or reuses CrosswordGenerator wrapper with vector service reference

3. Crossword Generation Wrapper (src/services/crossword_generator_wrapper.py)

Simple Delegation Layer

# Lines 20-51: Generate puzzle method
async def generate_puzzle(
    self, 
    topics: List[str], 
    difficulty: str = "medium", 
    ) -> Dict[str, Any]:

Flow:

  1. Wrapper receives request from API route
  2. Line 41: Imports actual generator to avoid circular imports:
    from .crossword_generator import CrosswordGenerator as ActualGenerator
    
  3. Line 42: Creates actual generator instance with vector service
  4. Line 44: Delegates to actual generator's generate_puzzle() method

4. Core Crossword Generation (src/services/crossword_generator.py)

4.1 Main Generation Method

# Lines 22-66: Core puzzle generation
async def generate_puzzle(self, topics: List[str], difficulty: str = "medium",  = False):

Key steps:

  1. Line 37: Select words using AI or static sources:

    words = await self._select_words(topics, difficulty, use_ai)
    
  2. Line 44: Create crossword grid from selected words:

    grid_result = self._create_grid(words)
    
  3. Lines 52-62: Return structured response with grid, clues, and metadata

4.2 Word Selection Process

# Lines 68-97: Word selection logic
async def _select_words(self, topics: List[str], difficulty: str, ):

Flow branches based on use_ai parameter:

AI-Powered Word Selection (Lines 72-83):

if use_ai and self.vector_service:
    for topic in topics:
        ai_words = await self.vector_service.find_similar_words(
            topic, difficulty, self.max_words // len(topics)
        )
        all_words.extend(ai_words)

Fallback to Cached Words (Lines 86-91):

if self.vector_service:
    for topic in topics:
        cached_words = await self.vector_service._get_cached_fallback(
            topic, difficulty, self.max_words // len(topics)
        )

Final Fallback to Static JSON Files (Lines 93-95):

else:
    all_words = await self._get_static_words(topics, difficulty)

4.3 Word Sorting for Crossword Viability

# Lines 129-168: Sort words by crossword suitability
def _sort_words_for_crossword(self, words: List[Dict[str, Any]]):

Scoring algorithm:

  • Lines 138-147: Length-based scoring (shorter words preferred)
  • Lines 150-153: Common letter bonus (E, A, R, I, O, T, N, S)
  • Lines 156-158: Vowel distribution bonus
  • Lines 161-162: Penalty for very long words

5. AI Word Discovery (src/services/vector_search.py)

5.1 Vector Search Initialization

# Lines 71-143: Service initialization
async def initialize(self):

Initialization flow:

  1. Lines 90-92: Load sentence-transformer model
  2. Lines 95-119: Build or load cached FAISS index
  3. Lines 124-134: Initialize word cache manager

5.2 Core Word Finding Algorithm

# Lines 279-374: Main word discovery method
async def find_similar_words(
    self, 
    topic: str, 
    difficulty: str = "medium", 
    max_words: int = 15
):

Search strategy branches:

Hierarchical Search (Lines 296-325):

if self.use_hierarchical_search:
    all_candidates = await self._hierarchical_search(topic, difficulty, max_words)
    combined_results = self._combine_hierarchical_results(all_candidates, max_words * 2)

Traditional Single Search (Lines 328-337):

else:
    traditional_results = await self._traditional_single_search(topic, difficulty, max_words * 2)

5.3 Hierarchical Search Process

# Lines 639-748: Multi-phase hierarchical search
async def _hierarchical_search(self, topic: str, difficulty: str, max_words: int):

Three-phase approach:

Phase 1: Topic Variations (Lines 652-694)

topic_variations = self._expand_topic_variations(topic)  # "Animal" β†’ ["Animal", "Animals"]

for variation in topic_variations:
    topic_embedding = self.model.encode([variation], convert_to_numpy=True)
    scores, indices = self.faiss_index.search(topic_embedding, search_size)
    variation_candidates = self._collect_candidates_with_threshold(scores, indices, threshold, variation, difficulty)

Phase 2: Subcategory Identification (Lines 697-700)

subcategories = self._identify_subcategories(main_topic_candidates, topic)

Phase 3: Subcategory Search (Lines 703-733)

for subcategory in subcategories:
    subcat_embedding = self.model.encode([subcategory], convert_to_numpy=True)
    sub_scores, sub_indices = self.faiss_index.search(subcat_embedding, sub_search_size)

5.4 Word Quality Filtering

# Lines 1164-1215: Candidate collection with filtering
def _collect_candidates_with_threshold(self, scores, indices, threshold, topic, difficulty):

Multi-stage filtering:

  1. Line 1176: Similarity threshold check
  2. Line 1183: Difficulty matching (word length)
  3. Line 1185: Interest and topic relevance check:
    if self._is_interesting_word(word, topic) and self._is_topic_relevant(word, topic):
    

6. Grid Creation and Word Placement

6.1 Grid Generation Entry Point

# Lines 170-243: Main grid creation method
def _create_grid(self, words: List[Dict[str, Any]]):

Flow:

  1. Lines 184-203: Process and sort words by length (longest first)
  2. Lines 209-213: Calculate appropriate grid size
  3. Lines 212-237: Multiple placement attempts with increasing grid size
  4. Lines 240-241: Fallback to simple two-word cross

6.2 Word Placement Algorithm

# Lines 259-295: Backtracking word placement
def _place_words_in_grid(self, word_list: List[str], word_objs: List[Dict[str, Any]], size: int):

Setup:

  • Line 263: Initialize empty grid with dots
  • Line 270: Call recursive backtracking algorithm
  • Lines 272-287: Generate clues and assign crossword numbers

6.3 Backtracking Algorithm

# Lines 297-357: Recursive backtracking placement
def _backtrack_placement(self, grid, word_list, word_objs, word_index, placed_words, start_time, timeout):

Algorithm flow:

Base Cases:

  • Lines 302-303: Timeout check every 50 calls
  • Lines 305-306: Success when all words placed

First Word Placement (Lines 312-332):

if word_index == 0:
    center_row = size // 2
    center_col = (size - len(word)) // 2
    
    if self._can_place_word(grid, word, center_row, center_col, "horizontal"):
        original_state = self._place_word(grid, word, center_row, center_col, "horizontal")

Subsequent Word Placement (Lines 334-356):

all_placements = self._find_all_intersection_placements(grid, word, placed_words)
all_placements.sort(key=lambda p: p["score"], reverse=True)

for placement in all_placements:
    if self._can_place_word(grid, word, row, col, direction):
        # Try placement and recurse
        if self._backtrack_placement(...):
            return True
        # Backtrack if failed
        self._remove_word(grid, original_state)

6.4 Word Placement Validation

# Lines 359-417: Comprehensive placement validation
def _can_place_word(self, grid: List[List[str]], word: str, row: int, col: int, direction: str):

Critical validation checks:

  1. Lines 364-365: Boundary checks
  2. Lines 372-375: Word boundary enforcement (no adjacent letters)
  3. Lines 378-390: Letter-by-letter placement validation
  4. Lines 388-390: Perpendicular intersection validation

6.5 Intersection Finding

# Lines 486-505: Find all possible intersections
def _find_all_intersection_placements(self, grid, word, placed_words):

Process:

  1. Lines 491-502: For each placed word, find letter intersections
  2. Lines 496-502: Calculate placement position for each intersection
  3. Lines 499-502: Score placement quality

7. Clue Generation and Final Assembly

7.1 Grid Trimming and Optimization

# Lines 589-642: Remove excess empty space
def _trim_grid(self, grid, placed_words):

Process:

  1. Lines 595-610: Find bounding box of all placed words
  2. Lines 612-631: Create trimmed grid with padding
  3. Lines 634-641: Update word positions relative to new grid

7.2 Crossword Numbering

# Lines 698-750: Assign proper crossword numbers and create clues
def _assign_numbers_and_clues(self, placed_words, clues_data):

Crossword numbering rules:

  1. Lines 710-714: Group words by starting position
  2. Lines 716: Sort by reading order (top-to-bottom, left-to-right)
  3. Lines 725-749: Assign shared numbers for words starting at same position
  4. Lines 738-745: Create clue objects with proper formatting

7.3 Final Response Assembly

Back in crossword_generator.py lines 52-62:

return {
    "grid": grid_result["grid"],
    "clues": grid_result["clues"],
    "metadata": {
        "topics": topics,
        "difficulty": difficulty,
        "wordCount": len(grid_result["placed_words"]),
        "size": len(grid_result["grid"]),
        "aiGenerated":     }
}

8. Caching System (src/services/word_cache.py)

8.1 Cache Initialization

# Lines 75-113: Load existing cache files
async def initialize(self):

Process:

  1. Lines 86-108: Load all .json cache files from disk
  2. Lines 99-102: Validate cache structure and load into memory
  3. Lines 110: Report loaded cache statistics

8.2 Word Caching

# Lines 166-224: Cache successful word discoveries
async def cache_words(self, topic, difficulty, words, source="vector_search"):

Storage process:

  1. Lines 186-193: Enhance words with caching metadata
  2. Lines 196-207: Create structured cache data with expiration
  3. Lines 210-213: Save to disk (if permissions allow)
  4. Lines 216-217: Update in-memory cache

Complete Data Flow Summary

  1. API Request β†’ /api/generate with topics, difficulty, 2. Route Handler β†’ Validates request, injects dependencies
  2. Wrapper β†’ Delegates to actual generator with vector service
  3. Word Selection β†’ AI vector search OR cached words OR static JSON fallback
  4. Vector Search (if AI enabled):
    • Load sentence-transformer model
    • Perform hierarchical semantic search
    • Filter by similarity threshold, difficulty, relevance
    • Apply word exclusions and variety filtering
  5. Grid Creation:
    • Sort words by crossword viability
    • Calculate appropriate grid size
    • Use backtracking algorithm to place words
    • Validate word boundaries and intersections
  6. Grid Optimization:
    • Trim excess empty space
    • Assign proper crossword numbers
    • Generate clue objects
  7. Response Assembly β†’ Return grid, clues, and metadata
  8. Caching β†’ Store successful AI discoveries for future use

The system gracefully degrades from AI β†’ cached words β†’ static words, ensuring crossword generation always succeeds even when AI components fail.