Debug Tab Feature Documentation
Overview
The debug tab feature provides detailed insights into the thematic word generation process when enabled via environment variable. This feature is designed for developers to understand how the ML-based word selection algorithm works internally.
Environment Variable
ENABLE_DEBUG_TAB=true # Enable debug data collection and return
ENABLE_DEBUG_TAB=false # Disable debug data (default)
When Debug Data is Collected
Debug data is collected during the thematic word generation process in find_words_for_crossword() when:
ENABLE_DEBUG_TAB=trueenvironment variable is set- The thematic word service is properly initialized
- Any API endpoint that generates crossword puzzles is called
API Response Structure
Without Debug (Default)
{
"grid": [...],
"clues": {...},
"metadata": {...}
}
With Debug Enabled
{
"grid": [...],
"clues": {...},
"metadata": {...},
"debug": {
"enabled": true,
"generation_params": {
"topics": ["animals"],
"difficulty": "easy",
"requested_words": 10,
"custom_sentence": null,
"multi_theme": true,
"thematic_pool_size": 150,
"min_similarity": 0.25
},
"thematic_pool": [
{
"word": "CAT",
"similarity": 0.834,
"tier": "tier_5_common",
"percentile": 0.952,
"tier_description": "Common (Top 8%)"
}
],
"candidate_words": [
{
"word": "CAT",
"similarity": 0.834,
"tier": "tier_5_common",
"percentile": 0.952,
"clue": "Feline pet",
"semantic_neighbors": ["dog", "kitten", "feline", "pet", "animal"]
}
],
"selection_method": "softmax",
"selection_params": {
"use_softmax_selection": true,
"similarity_temperature": 0.2,
"difficulty_weight": 0.5
},
"selected_words": [
{
"word": "CAT",
"similarity": 0.834,
"tier": "tier_5_common",
"percentile": 0.952,
"clue": "Feline pet"
}
]
}
}
Debug Data Structure
generation_params
- topics: Input topics provided by user
- difficulty: Selected difficulty level
- requested_words: Number of words requested
- custom_sentence: Custom sentence input (if any)
- multi_theme: Whether multi-theme processing was used
- thematic_pool_size: Size of initial thematic pool generated
- min_similarity: Minimum similarity threshold used
thematic_pool
Array of all words generated thematically (typically 150 words):
- word: The word in uppercase
- similarity: Cosine similarity score to theme (0.0-1.0)
- tier: Frequency tier (tier_1_ultra_common to tier_10_very_rare)
- percentile: Word frequency percentile (0.0-1.0, higher = more common)
- tier_description: Human-readable tier description
candidate_words
Array of words that passed filtering and got clues generated:
- All fields from
thematic_poolplus: - clue: Generated crossword clue
- semantic_neighbors: List of semantically similar words from embeddings
selection_method
"softmax": Uses ML-based probabilistic selection"random": Uses traditional random selection
selection_params
Current algorithm parameters:
- use_softmax_selection: Whether softmax selection is enabled
- similarity_temperature: Temperature for softmax (lower = more deterministic)
- difficulty_weight: Balance between similarity and frequency (0.0-1.0)
selected_words
Final words chosen for crossword generation with their scores and metadata.
Use Cases for Debug Data
1. Algorithm Performance Analysis
- Compare similarity scores across difficulty levels
- Analyze frequency distribution in selections
- Understand why certain words were selected/rejected
2. Difficulty Tuning
- Verify easy mode selects common words (high percentiles)
- Verify hard mode selects rare words (low percentiles)
- Check if composite scoring is working as expected
3. Clue Quality Assessment
- Review generated clues for accuracy
- Analyze semantic neighbors for clue generation
- Identify words that need manual clue overrides
4. Theme Relevance Debugging
- Check similarity scores for theme matching
- Identify words that don't match intended theme
- Analyze multi-theme vs single-theme behavior
Frontend Integration
The debug data can be displayed in a separate "Debug" or "Peek-in" tab that shows:
- Generation Overview: Parameters and summary statistics
- Thematic Pool: Sortable table of all 150 generated words
- Selection Process: Visualization of softmax probabilities
- Semantic Analysis: Word neighbors and clue generation details
- Performance Metrics: Timing and efficiency data
Performance Impact
- Disabled (default): No performance impact
- Enabled: Minimal impact (~5-10ms additional processing)
- Semantic neighbor computation: ~1-2ms per word
- Debug data structure creation: ~3-5ms total
- JSON serialization: Negligible
Security Considerations
- Debug data exposes internal ML model behavior
- Should only be enabled in development/staging environments
- No sensitive user data is exposed
- All data is derived from public word frequencies and embeddings
Testing
# Test debug disabled
python test_debug_feature.py
# Test with full model loading (takes time)
ENABLE_DEBUG_TAB=true python -c "import asyncio; from test_debug_feature import full_integration_test; asyncio.run(full_integration_test())"
# Test API endpoint with debug enabled
ENABLE_DEBUG_TAB=true uvicorn app:app --host 0.0.0.0 --port 7860
curl -X POST http://localhost:7860/api/generate -H "Content-Type: application/json" -d '{"topics": ["animals"], "difficulty": "easy", "wordCount": 8}'
Implementation Details
Files Modified
src/services/thematic_word_service.py: Debug data collectionsrc/services/crossword_generator.py: Debug data pass-throughsrc/routes/api.py: Debug data inclusion in responses
Key Functions
ThematicWordService.find_words_for_crossword(): Now returns{"words": [...], "debug": {...}}CrosswordGenerator._select_words(): Now returns(words, debug_data)tupleCrosswordGenerator.generate_puzzle(): Includes debug data in response
Environment Variable Parsing
self.enable_debug_tab = os.getenv("ENABLE_DEBUG_TAB", "false").lower() == "true"
Future Enhancements
- Performance Metrics: Add timing data for each processing stage
- Grid Placement Debug: Include grid placement algorithm details
- Clue Generation Debug: Detailed clue generation process insights
- Statistical Analysis: Word distribution charts and analytics
- Export Functionality: Export debug data as CSV/JSON for analysis
This feature was implemented in August 2025 as part of the crossword generation optimization project.