| # Python Backend with Thematic AI Word Generation | |
| This is the Python implementation of the crossword generator backend, featuring AI-powered thematic word generation using WordFreq vocabulary and semantic embeddings. | |
| ## π Features | |
| - **Thematic Word Generation**: Uses sentence-transformers for semantic word discovery from WordFreq vocabulary | |
| - **319K+ Word Database**: Comprehensive vocabulary from WordFreq with frequency data | |
| - **10-Tier Difficulty System**: Smart word selection based on frequency tiers | |
| - **Environment Variable Configuration**: Flexible cache and model configuration | |
| - **FastAPI**: Modern, fast Python web framework | |
| - **Same API**: Compatible with existing React frontend | |
| ## π Differences from JavaScript Backend | |
| | Feature | JavaScript Backend | Python Backend | | |
| |---------|-------------------|----------------| | |
| | **Word Generation** | Static word lists | Thematic AI word generation from 319K vocabulary | | |
| | **Vocabulary Size** | ~100 words per topic | Filtered from 319K WordFreq database | | |
| | **AI Approach** | Basic filtering | Semantic similarity with frequency tiers | | |
| | **Performance** | Fast but limited | Slower startup, richer word selection | | |
| | **Dependencies** | Node.js + static files | Python + ML libraries | | |
| ## π οΈ Setup & Installation | |
| ### Prerequisites | |
| - Python 3.11+ (3.11 recommended for Docker compatibility) | |
| - pip (Python package manager) | |
| ### Basic Setup (Core Functionality) | |
| ```bash | |
| # Clone and navigate to backend directory | |
| cd crossword-app/backend-py | |
| # Create virtual environment (recommended) | |
| python -m venv venv | |
| source venv/bin/activate # On Windows: venv\Scripts\activate | |
| # Install core dependencies | |
| pip install -r requirements.txt | |
| # Start the server | |
| python app.py | |
| ``` | |
| ### Full Development Setup (with AI features) | |
| ```bash | |
| # Install development dependencies including AI/ML libraries | |
| pip install -r requirements-dev.txt | |
| # This includes: | |
| # - All core dependencies | |
| # - AI/ML libraries (torch, sentence-transformers, etc.) | |
| # - Development tools (pytest, coverage, etc.) | |
| ``` | |
| ### Requirements Files | |
| - **`requirements.txt`**: Core dependencies for basic functionality | |
| - **`requirements-dev.txt`**: Full development environment with AI features | |
| > **Note**: The AI/ML dependencies are large (~2GB). For basic testing without AI features, use `requirements.txt` only. | |
| > **Python Version**: Both local development and Docker use Python 3.11+ for optimal performance and latest package compatibility. | |
| ## π Structure | |
| ``` | |
| backend-py/ | |
| βββ app.py # FastAPI application entry point | |
| βββ requirements.txt # Core Python dependencies | |
| βββ requirements-dev.txt # Full development dependencies | |
| βββ src/ | |
| β βββ services/ | |
| β β βββ thematic_word_service.py # Thematic AI word generation | |
| β β βββ crossword_generator.py # Puzzle generation logic | |
| β β βββ crossword_generator_wrapper.py # Service wrapper | |
| β βββ routes/ | |
| β βββ api.py # API endpoints (matches JS backend) | |
| βββ test-unit/ # Unit tests (pytest framework) - 5 files | |
| β βββ test_crossword_generator.py | |
| β βββ test_api_routes.py | |
| β βββ test_vector_search.py | |
| βββ test-integration/ # Integration tests (standalone scripts) - 16 files | |
| β βββ test_simple_generation.py | |
| β βββ test_boundary_fix.py | |
| β βββ test_local.py # (+ 13 more test files) | |
| βββ data/ -> ../backend/data/ # Symlink to shared word data | |
| βββ public/ # Frontend static files (copied during build) | |
| ``` | |
| ## π Dependencies | |
| ### Core ML Stack | |
| - `sentence-transformers`: Local model loading and embeddings | |
| - `wordfreq`: 319K word vocabulary with frequency data | |
| - `torch`: PyTorch for model inference | |
| - `scikit-learn`: Cosine similarity and clustering | |
| - `numpy`: Vector operations | |
| ### Web Framework | |
| - `fastapi`: Modern Python web framework | |
| - `uvicorn`: ASGI server | |
| - `pydantic`: Data validation | |
| ### Testing | |
| - `pytest`: Testing framework | |
| - `pytest-asyncio`: Async test support | |
| ## π§ͺ Testing | |
| ### π Test Organization (Reorganized for Clarity) | |
| **We've reorganized the test structure for better developer experience:** | |
| | Test Type | Location | Purpose | Framework | Count | | |
| |-----------|----------|---------|-----------|-------| | |
| | **Unit Tests** | `test-unit/` | Test individual components in isolation | pytest | 5 files | | |
| | **Integration Tests** | `test-integration/` | Test complete workflows end-to-end | Standalone scripts | 16 files | | |
| **Benefits of this structure:** | |
| - β **Clear separation** between unit and integration testing | |
| - β **Intuitive naming** - developers immediately understand test types | |
| - β **Better tooling** - can run different test types independently | |
| - β **Easier maintenance** - organized by testing strategy | |
| > **Note**: Previously tests were mixed in `tests/` folder and root-level `test_*.py` files. The new structure provides much better organization. | |
| ### Unit Tests Details (`test-unit/`) | |
| **What they test:** Individual components with mocking and isolation | |
| - `test_crossword_generator.py` - Core crossword generation logic | |
| - `test_api_routes.py` - FastAPI endpoint handlers | |
| - `test_crossword_generator_wrapper.py` - Service wrapper layer | |
| - `test_index_bug_fix.py` - Specific bug fix validations | |
| - `test_vector_search.py` - AI vector search functionality (requires torch) | |
| ### Run Unit Tests (Formal Test Suite) | |
| ```bash | |
| # Run all unit tests | |
| python run_tests.py | |
| # Run specific test modules | |
| python run_tests.py crossword_generator | |
| pytest test-unit/test_crossword_generator.py -v | |
| # Run core tests (excluding AI dependencies) | |
| pytest test-unit/ -v --ignore=test-unit/test_vector_search.py | |
| # Run individual unit test classes | |
| pytest test-unit/test_crossword_generator.py::TestCrosswordGenerator::test_init -v | |
| ``` | |
| ### Integration Tests Details (`test-integration/`) | |
| **What they test:** Complete workflows without mocking - real functionality | |
| - `test_simple_generation.py` - End-to-end crossword generation | |
| - `test_boundary_fix.py` - Word boundary validation (our major fix!) | |
| - `test_local.py` - Local environment and dependencies | |
| - `test_word_boundaries.py` - Comprehensive boundary testing | |
| - `test_bounds_comprehensive.py` - Advanced bounds checking | |
| - `test_final_validation.py` - API integration testing | |
| - And 10 more specialized feature tests... | |
| ### Run Integration Tests (End-to-End Scripts) | |
| ```bash | |
| # Test core functionality | |
| python test-integration/test_simple_generation.py | |
| python test-integration/test_boundary_fix.py | |
| python test-integration/test_local.py | |
| # Test specific features | |
| python test-integration/test_word_boundaries.py | |
| python test-integration/test_bounds_comprehensive.py | |
| # Test API integration | |
| python test-integration/test_final_validation.py | |
| ``` | |
| ### Test Coverage | |
| ```bash | |
| # Run core tests with coverage (requires requirements-dev.txt) | |
| pytest test-unit/test_crossword_generator.py --cov=src --cov-report=html | |
| pytest test-unit/test_crossword_generator.py --cov=src --cov-report=term | |
| # Full coverage report (may fail without AI dependencies) | |
| pytest test-unit/ --cov=src --cov-report=html --ignore=test-unit/test_vector_search.py | |
| ``` | |
| ### Test Status | |
| - β **Core crossword generation**: 15/19 unit tests passing | |
| - β **Boundary validation**: All integration tests passing | |
| - β οΈ **AI/Vector search**: Requires torch dependencies | |
| - β οΈ **Some async mocking**: Minor test infrastructure issues | |
| ### π Migration Guide (For Existing Developers) | |
| **If you had previous commands, update them:** | |
| | Old Command | New Command | | |
| |-------------|-------------| | |
| | `pytest tests/` | `pytest test-unit/` | | |
| | `python test_simple_generation.py` | `python test-integration/test_simple_generation.py` | | |
| | `pytest tests/ --cov=src` | `pytest test-unit/ --cov=src` | | |
| **All functionality is preserved** - just organized better! | |
| ## π§ Configuration | |
| ### Environment Variables | |
| The backend supports flexible configuration via environment variables: | |
| ```bash | |
| # Cache Configuration | |
| CACHE_DIR=/app/cache # Cache directory for all service files | |
| THEMATIC_VOCAB_SIZE_LIMIT=50000 # Maximum vocabulary size (default: 100000) | |
| THEMATIC_MODEL_NAME=all-mpnet-base-v2 # Sentence transformer model | |
| # Core Application Settings | |
| PORT=7860 # Server port | |
| NODE_ENV=production # Environment mode | |
| # Optional | |
| LOG_LEVEL=INFO # Logging level | |
| ``` | |
| ### Cache Structure | |
| The service creates the following cache files: | |
| ``` | |
| {CACHE_DIR}/ | |
| βββ vocabulary_{size}.pkl # Processed vocabulary words | |
| βββ frequencies_{size}.pkl # Word frequency data | |
| βββ embeddings_{model}_{size}.npy # Word embeddings | |
| βββ sentence-transformers/ # Hugging Face model cache | |
| ``` | |
| ## π― Thematic Word Generation Process | |
| 1. **Initialization**: | |
| - Load WordFreq vocabulary database (319K words) | |
| - Filter words for crossword suitability (length, content) | |
| - Load sentence-transformers model locally | |
| - Pre-compute embeddings for filtered vocabulary | |
| - Create 10-tier frequency classification system | |
| 2. **Word Generation**: | |
| - Get topic embedding: `"Animals" β [768-dim vector]` | |
| - Compute cosine similarity with all vocabulary embeddings | |
| - Filter by similarity threshold and difficulty tier | |
| - Filter by crossword-specific criteria (length, etc.) | |
| - Return top matches with generated clues | |
| 3. **Multi-Theme Support**: | |
| - Detect multiple themes using clustering | |
| - Generate words that relate to combined themes | |
| - Balance word selection across different topics | |
| ## π§ͺ Testing | |
| ```bash | |
| # Local testing (without full vector search) | |
| cd backend-py | |
| python test_local.py | |
| # Start development server | |
| python app.py | |
| ``` | |
| ## π³ Container Deployment | |
| ### Docker Run with Cache Configuration | |
| ```bash | |
| # Basic deployment | |
| docker run -e CACHE_DIR=/app/cache \ | |
| -e THEMATIC_VOCAB_SIZE_LIMIT=50000 \ | |
| -v /host/cache:/app/cache \ | |
| -p 7860:7860 \ | |
| your-crossword-app | |
| # With all configuration options | |
| docker run -e CACHE_DIR=/app/cache \ | |
| -e THEMATIC_VOCAB_SIZE_LIMIT=25000 \ | |
| -e THEMATIC_MODEL_NAME=all-mpnet-base-v2 \ | |
| -e NODE_ENV=production \ | |
| -v /host/cache:/app/cache \ | |
| -p 7860:7860 \ | |
| your-crossword-app | |
| ``` | |
| ### Docker Compose | |
| ```yaml | |
| version: '3.8' | |
| services: | |
| crossword-backend: | |
| image: your-crossword-app | |
| environment: | |
| - CACHE_DIR=/app/cache | |
| - THEMATIC_VOCAB_SIZE_LIMIT=50000 | |
| - THEMATIC_MODEL_NAME=all-mpnet-base-v2 | |
| - NODE_ENV=production | |
| volumes: | |
| - ./cache:/app/cache | |
| ports: | |
| - "7860:7860" | |
| restart: unless-stopped | |
| ``` | |
| ### Pre-built Cache Strategy (Recommended) | |
| For production deployments, pre-build the cache to avoid long startup times: | |
| ```bash | |
| # 1. Build cache locally or in a build container | |
| export CACHE_DIR=/local/cache | |
| export THEMATIC_VOCAB_SIZE_LIMIT=50000 | |
| python -c "from src.services.thematic_word_service import ThematicWordService; s=ThematicWordService(); s.initialize()" | |
| # 2. Deploy with pre-built cache (read-only mount) | |
| docker run -e CACHE_DIR=/app/cache \ | |
| -v /local/cache:/app/cache:ro \ | |
| -p 7860:7860 \ | |
| your-crossword-app | |
| ``` | |
| ### Debugging Cache Issues | |
| If cache files are not being created in your container: | |
| 1. **Check Health Endpoints:** | |
| ```bash | |
| # Basic health check | |
| curl http://localhost:7860/api/health | |
| # Detailed cache status | |
| curl http://localhost:7860/api/health/cache | |
| # Force cache re-initialization | |
| curl -X POST http://localhost:7860/api/health/cache/reinitialize | |
| ``` | |
| 2. **Check Container Logs:** | |
| ```bash | |
| docker logs your-container-name | |
| ``` | |
| Look for cache directory permissions and initialization messages. | |
| 3. **Test Cache Directory:** | |
| ```bash | |
| # Run test script to verify cache setup | |
| docker exec your-container python test_cache_startup.py | |
| ``` | |
| 4. **Common Issues:** | |
| - **Permission denied**: Container user can't write to mounted volume | |
| - **Missing dependencies**: ML libraries not installed in container | |
| - **Volume not mounted**: Cache directory not properly mounted | |
| - **Environment variables**: `CACHE_DIR` not set correctly | |
| 5. **Fix Permission Issues:** | |
| ```bash | |
| # Option 1: Change ownership of host directory | |
| sudo chown -R 1000:1000 /host/cache | |
| # Option 2: Run container with specific user | |
| docker run --user 1000:1000 ... | |
| # Option 3: Set permissions in Dockerfile | |
| RUN mkdir -p /app/cache && chmod 777 /app/cache | |
| ``` | |
| ### Kubernetes Deployment | |
| ```yaml | |
| apiVersion: v1 | |
| kind: ConfigMap | |
| metadata: | |
| name: crossword-config | |
| data: | |
| CACHE_DIR: "/app/cache" | |
| THEMATIC_VOCAB_SIZE_LIMIT: "50000" | |
| THEMATIC_MODEL_NAME: "all-mpnet-base-v2" | |
| NODE_ENV: "production" | |
| --- | |
| apiVersion: v1 | |
| kind: PersistentVolumeClaim | |
| metadata: | |
| name: crossword-cache | |
| spec: | |
| accessModes: | |
| - ReadWriteOnce | |
| resources: | |
| requests: | |
| storage: 5Gi | |
| --- | |
| apiVersion: apps/v1 | |
| kind: Deployment | |
| metadata: | |
| name: crossword-backend | |
| spec: | |
| replicas: 1 | |
| selector: | |
| matchLabels: | |
| app: crossword-backend | |
| template: | |
| metadata: | |
| labels: | |
| app: crossword-backend | |
| spec: | |
| containers: | |
| - name: backend | |
| image: your-crossword-app | |
| envFrom: | |
| - configMapRef: | |
| name: crossword-config | |
| volumeMounts: | |
| - name: cache-volume | |
| mountPath: /app/cache | |
| ports: | |
| - containerPort: 7860 | |
| volumes: | |
| - name: cache-volume | |
| persistentVolumeClaim: | |
| claimName: crossword-cache | |
| ``` | |
| ## π§ͺ Testing | |
| ### Quick Test | |
| ```bash | |
| # Basic functionality test (no model download) | |
| python test_local.py | |
| ``` | |
| ### Comprehensive Unit Tests | |
| ```bash | |
| # Run all unit tests | |
| python run_tests.py | |
| # Or use pytest directly | |
| pytest tests/ -v | |
| # Run specific test file | |
| python run_tests.py crossword_generator_fixed | |
| pytest tests/test_crossword_generator_fixed.py -v | |
| # Run with coverage | |
| pytest tests/ --cov=src --cov-report=html | |
| ``` | |
| ### Test Structure | |
| - `tests/test_crossword_generator_fixed.py` - Core grid generation logic | |
| - `tests/test_vector_search.py` - Vector similarity search | |
| - `tests/test_crossword_generator_wrapper.py` - Service wrapper | |
| - `tests/test_api_routes.py` - FastAPI endpoints | |
| ### Key Test Features | |
| - β **Index alignment fix**: Tests the list index out of range bug fix | |
| - β **Mocked vector search**: Tests without downloading models | |
| - β **API validation**: Tests all endpoints and error cases | |
| - β **Async support**: Full pytest-asyncio integration | |
| - β **Error handling**: Tests malformed inputs and edge cases | |
| ## π Performance Comparison | |
| **Startup Time**: | |
| - JavaScript: ~2 seconds | |
| - Python: ~30-60 seconds (model download + embedding generation) | |
| - Python (with cache): ~5-10 seconds | |
| **Word Quality**: | |
| - JavaScript: Limited by static word lists (~100 words/topic) | |
| - Python: Rich thematic generation from 319K word database | |
| **Memory Usage**: | |
| - JavaScript: ~100MB | |
| - Python: ~500MB-1GB (model + embeddings) | |
| - Cache Size: ~50-200MB per 50K vocabulary | |
| **API Response Time**: | |
| - JavaScript: ~100ms (static word lookup) | |
| - Python: ~200-500ms (semantic similarity computation) | |
| **Cache Performance**: | |
| - Vocabulary loading: ~1-2 seconds from cache vs 30+ seconds generation | |
| - Embeddings loading: ~2-5 seconds from cache vs 60+ seconds generation | |
| ## π Migration Strategy | |
| 1. **Phase 1** β : Basic Python backend structure | |
| 2. **Phase 2**: Test vector search functionality | |
| 3. **Phase 3**: Docker deployment and production testing | |
| 4. **Phase 4**: Compare with JavaScript backend | |
| 5. **Phase 5**: Production switch with rollback capability | |
| ## π― Next Steps | |
| - [x] Replace vector search with thematic word generation | |
| - [x] Implement environment variable cache configuration | |
| - [x] Add 10-tier difficulty system based on word frequency | |
| - [ ] Optimize embedding computation performance | |
| - [ ] Add more sophisticated crossword grid generation | |
| - [ ] Implement LLM-based clue generation | |
| - [ ] Add cache warming strategies for production deployment |