# Python Backend with Thematic AI Word Generation This is the Python implementation of the crossword generator backend, featuring AI-powered thematic word generation using WordFreq vocabulary and semantic embeddings. ## ๐Ÿš€ Features - **Thematic Word Generation**: Uses sentence-transformers for semantic word discovery from WordFreq vocabulary - **319K+ Word Database**: Comprehensive vocabulary from WordFreq with frequency data - **10-Tier Difficulty System**: Smart word selection based on frequency tiers - **Environment Variable Configuration**: Flexible cache and model configuration - **FastAPI**: Modern, fast Python web framework - **Same API**: Compatible with existing React frontend ## ๐Ÿ”„ Differences from JavaScript Backend | Feature | JavaScript Backend | Python Backend | |---------|-------------------|----------------| | **Word Generation** | Static word lists | Thematic AI word generation from 319K vocabulary | | **Vocabulary Size** | ~100 words per topic | Filtered from 319K WordFreq database | | **AI Approach** | Basic filtering | Semantic similarity with frequency tiers | | **Performance** | Fast but limited | Slower startup, richer word selection | | **Dependencies** | Node.js + static files | Python + ML libraries | ## ๐Ÿ› ๏ธ Setup & Installation ### Prerequisites - Python 3.11+ (3.11 recommended for Docker compatibility) - pip (Python package manager) ### Basic Setup (Core Functionality) ```bash # Clone and navigate to backend directory cd crossword-app/backend-py # Create virtual environment (recommended) python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install core dependencies pip install -r requirements.txt # Start the server python app.py ``` ### Full Development Setup (with AI features) ```bash # Install development dependencies including AI/ML libraries pip install -r requirements-dev.txt # This includes: # - All core dependencies # - AI/ML libraries (torch, sentence-transformers, etc.) # - Development tools (pytest, coverage, etc.) ``` ### Requirements Files - **`requirements.txt`**: Core dependencies for basic functionality - **`requirements-dev.txt`**: Full development environment with AI features > **Note**: The AI/ML dependencies are large (~2GB). For basic testing without AI features, use `requirements.txt` only. > **Python Version**: Both local development and Docker use Python 3.11+ for optimal performance and latest package compatibility. ## ๐Ÿ“ Structure ``` backend-py/ โ”œโ”€โ”€ app.py # FastAPI application entry point โ”œโ”€โ”€ requirements.txt # Core Python dependencies โ”œโ”€โ”€ requirements-dev.txt # Full development dependencies โ”œโ”€โ”€ src/ โ”‚ โ”œโ”€โ”€ services/ โ”‚ โ”‚ โ”œโ”€โ”€ thematic_word_service.py # Thematic AI word generation โ”‚ โ”‚ โ”œโ”€โ”€ crossword_generator.py # Puzzle generation logic โ”‚ โ”‚ โ””โ”€โ”€ crossword_generator_wrapper.py # Service wrapper โ”‚ โ””โ”€โ”€ routes/ โ”‚ โ””โ”€โ”€ api.py # API endpoints (matches JS backend) โ”œโ”€โ”€ test-unit/ # Unit tests (pytest framework) - 5 files โ”‚ โ”œโ”€โ”€ test_crossword_generator.py โ”‚ โ”œโ”€โ”€ test_api_routes.py โ”‚ โ””โ”€โ”€ test_vector_search.py โ”œโ”€โ”€ test-integration/ # Integration tests (standalone scripts) - 16 files โ”‚ โ”œโ”€โ”€ test_simple_generation.py โ”‚ โ”œโ”€โ”€ test_boundary_fix.py โ”‚ โ””โ”€โ”€ test_local.py # (+ 13 more test files) โ”œโ”€โ”€ data/ -> ../backend/data/ # Symlink to shared word data โ””โ”€โ”€ public/ # Frontend static files (copied during build) ``` ## ๐Ÿ›  Dependencies ### Core ML Stack - `sentence-transformers`: Local model loading and embeddings - `wordfreq`: 319K word vocabulary with frequency data - `torch`: PyTorch for model inference - `scikit-learn`: Cosine similarity and clustering - `numpy`: Vector operations ### Web Framework - `fastapi`: Modern Python web framework - `uvicorn`: ASGI server - `pydantic`: Data validation ### Testing - `pytest`: Testing framework - `pytest-asyncio`: Async test support ## ๐Ÿงช Testing ### ๐Ÿ“ Test Organization (Reorganized for Clarity) **We've reorganized the test structure for better developer experience:** | Test Type | Location | Purpose | Framework | Count | |-----------|----------|---------|-----------|-------| | **Unit Tests** | `test-unit/` | Test individual components in isolation | pytest | 5 files | | **Integration Tests** | `test-integration/` | Test complete workflows end-to-end | Standalone scripts | 16 files | **Benefits of this structure:** - โœ… **Clear separation** between unit and integration testing - โœ… **Intuitive naming** - developers immediately understand test types - โœ… **Better tooling** - can run different test types independently - โœ… **Easier maintenance** - organized by testing strategy > **Note**: Previously tests were mixed in `tests/` folder and root-level `test_*.py` files. The new structure provides much better organization. ### Unit Tests Details (`test-unit/`) **What they test:** Individual components with mocking and isolation - `test_crossword_generator.py` - Core crossword generation logic - `test_api_routes.py` - FastAPI endpoint handlers - `test_crossword_generator_wrapper.py` - Service wrapper layer - `test_index_bug_fix.py` - Specific bug fix validations - `test_vector_search.py` - AI vector search functionality (requires torch) ### Run Unit Tests (Formal Test Suite) ```bash # Run all unit tests python run_tests.py # Run specific test modules python run_tests.py crossword_generator pytest test-unit/test_crossword_generator.py -v # Run core tests (excluding AI dependencies) pytest test-unit/ -v --ignore=test-unit/test_vector_search.py # Run individual unit test classes pytest test-unit/test_crossword_generator.py::TestCrosswordGenerator::test_init -v ``` ### Integration Tests Details (`test-integration/`) **What they test:** Complete workflows without mocking - real functionality - `test_simple_generation.py` - End-to-end crossword generation - `test_boundary_fix.py` - Word boundary validation (our major fix!) - `test_local.py` - Local environment and dependencies - `test_word_boundaries.py` - Comprehensive boundary testing - `test_bounds_comprehensive.py` - Advanced bounds checking - `test_final_validation.py` - API integration testing - And 10 more specialized feature tests... ### Run Integration Tests (End-to-End Scripts) ```bash # Test core functionality python test-integration/test_simple_generation.py python test-integration/test_boundary_fix.py python test-integration/test_local.py # Test specific features python test-integration/test_word_boundaries.py python test-integration/test_bounds_comprehensive.py # Test API integration python test-integration/test_final_validation.py ``` ### Test Coverage ```bash # Run core tests with coverage (requires requirements-dev.txt) pytest test-unit/test_crossword_generator.py --cov=src --cov-report=html pytest test-unit/test_crossword_generator.py --cov=src --cov-report=term # Full coverage report (may fail without AI dependencies) pytest test-unit/ --cov=src --cov-report=html --ignore=test-unit/test_vector_search.py ``` ### Test Status - โœ… **Core crossword generation**: 15/19 unit tests passing - โœ… **Boundary validation**: All integration tests passing - โš ๏ธ **AI/Vector search**: Requires torch dependencies - โš ๏ธ **Some async mocking**: Minor test infrastructure issues ### ๐Ÿ”„ Migration Guide (For Existing Developers) **If you had previous commands, update them:** | Old Command | New Command | |-------------|-------------| | `pytest tests/` | `pytest test-unit/` | | `python test_simple_generation.py` | `python test-integration/test_simple_generation.py` | | `pytest tests/ --cov=src` | `pytest test-unit/ --cov=src` | **All functionality is preserved** - just organized better! ## ๐Ÿ”ง Configuration ### Environment Variables The backend supports flexible configuration via environment variables: ```bash # Cache Configuration CACHE_DIR=/app/cache # Cache directory for all service files THEMATIC_VOCAB_SIZE_LIMIT=50000 # Maximum vocabulary size (default: 100000) THEMATIC_MODEL_NAME=all-mpnet-base-v2 # Sentence transformer model # Core Application Settings PORT=7860 # Server port NODE_ENV=production # Environment mode # Optional LOG_LEVEL=INFO # Logging level ``` ### Cache Structure The service creates the following cache files: ``` {CACHE_DIR}/ โ”œโ”€โ”€ vocabulary_{size}.pkl # Processed vocabulary words โ”œโ”€โ”€ frequencies_{size}.pkl # Word frequency data โ”œโ”€โ”€ embeddings_{model}_{size}.npy # Word embeddings โ””โ”€โ”€ sentence-transformers/ # Hugging Face model cache ``` ## ๐ŸŽฏ Thematic Word Generation Process 1. **Initialization**: - Load WordFreq vocabulary database (319K words) - Filter words for crossword suitability (length, content) - Load sentence-transformers model locally - Pre-compute embeddings for filtered vocabulary - Create 10-tier frequency classification system 2. **Word Generation**: - Get topic embedding: `"Animals" โ†’ [768-dim vector]` - Compute cosine similarity with all vocabulary embeddings - Filter by similarity threshold and difficulty tier - Filter by crossword-specific criteria (length, etc.) - Return top matches with generated clues 3. **Multi-Theme Support**: - Detect multiple themes using clustering - Generate words that relate to combined themes - Balance word selection across different topics ## ๐Ÿงช Testing ```bash # Local testing (without full vector search) cd backend-py python test_local.py # Start development server python app.py ``` ## ๐Ÿณ Container Deployment ### Docker Run with Cache Configuration ```bash # Basic deployment docker run -e CACHE_DIR=/app/cache \ -e THEMATIC_VOCAB_SIZE_LIMIT=50000 \ -v /host/cache:/app/cache \ -p 7860:7860 \ your-crossword-app # With all configuration options docker run -e CACHE_DIR=/app/cache \ -e THEMATIC_VOCAB_SIZE_LIMIT=25000 \ -e THEMATIC_MODEL_NAME=all-mpnet-base-v2 \ -e NODE_ENV=production \ -v /host/cache:/app/cache \ -p 7860:7860 \ your-crossword-app ``` ### Docker Compose ```yaml version: '3.8' services: crossword-backend: image: your-crossword-app environment: - CACHE_DIR=/app/cache - THEMATIC_VOCAB_SIZE_LIMIT=50000 - THEMATIC_MODEL_NAME=all-mpnet-base-v2 - NODE_ENV=production volumes: - ./cache:/app/cache ports: - "7860:7860" restart: unless-stopped ``` ### Pre-built Cache Strategy (Recommended) For production deployments, pre-build the cache to avoid long startup times: ```bash # 1. Build cache locally or in a build container export CACHE_DIR=/local/cache export THEMATIC_VOCAB_SIZE_LIMIT=50000 python -c "from src.services.thematic_word_service import ThematicWordService; s=ThematicWordService(); s.initialize()" # 2. Deploy with pre-built cache (read-only mount) docker run -e CACHE_DIR=/app/cache \ -v /local/cache:/app/cache:ro \ -p 7860:7860 \ your-crossword-app ``` ### Debugging Cache Issues If cache files are not being created in your container: 1. **Check Health Endpoints:** ```bash # Basic health check curl http://localhost:7860/api/health # Detailed cache status curl http://localhost:7860/api/health/cache # Force cache re-initialization curl -X POST http://localhost:7860/api/health/cache/reinitialize ``` 2. **Check Container Logs:** ```bash docker logs your-container-name ``` Look for cache directory permissions and initialization messages. 3. **Test Cache Directory:** ```bash # Run test script to verify cache setup docker exec your-container python test_cache_startup.py ``` 4. **Common Issues:** - **Permission denied**: Container user can't write to mounted volume - **Missing dependencies**: ML libraries not installed in container - **Volume not mounted**: Cache directory not properly mounted - **Environment variables**: `CACHE_DIR` not set correctly 5. **Fix Permission Issues:** ```bash # Option 1: Change ownership of host directory sudo chown -R 1000:1000 /host/cache # Option 2: Run container with specific user docker run --user 1000:1000 ... # Option 3: Set permissions in Dockerfile RUN mkdir -p /app/cache && chmod 777 /app/cache ``` ### Kubernetes Deployment ```yaml apiVersion: v1 kind: ConfigMap metadata: name: crossword-config data: CACHE_DIR: "/app/cache" THEMATIC_VOCAB_SIZE_LIMIT: "50000" THEMATIC_MODEL_NAME: "all-mpnet-base-v2" NODE_ENV: "production" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: crossword-cache spec: accessModes: - ReadWriteOnce resources: requests: storage: 5Gi --- apiVersion: apps/v1 kind: Deployment metadata: name: crossword-backend spec: replicas: 1 selector: matchLabels: app: crossword-backend template: metadata: labels: app: crossword-backend spec: containers: - name: backend image: your-crossword-app envFrom: - configMapRef: name: crossword-config volumeMounts: - name: cache-volume mountPath: /app/cache ports: - containerPort: 7860 volumes: - name: cache-volume persistentVolumeClaim: claimName: crossword-cache ``` ## ๐Ÿงช Testing ### Quick Test ```bash # Basic functionality test (no model download) python test_local.py ``` ### Comprehensive Unit Tests ```bash # Run all unit tests python run_tests.py # Or use pytest directly pytest tests/ -v # Run specific test file python run_tests.py crossword_generator_fixed pytest tests/test_crossword_generator_fixed.py -v # Run with coverage pytest tests/ --cov=src --cov-report=html ``` ### Test Structure - `tests/test_crossword_generator_fixed.py` - Core grid generation logic - `tests/test_vector_search.py` - Vector similarity search - `tests/test_crossword_generator_wrapper.py` - Service wrapper - `tests/test_api_routes.py` - FastAPI endpoints ### Key Test Features - โœ… **Index alignment fix**: Tests the list index out of range bug fix - โœ… **Mocked vector search**: Tests without downloading models - โœ… **API validation**: Tests all endpoints and error cases - โœ… **Async support**: Full pytest-asyncio integration - โœ… **Error handling**: Tests malformed inputs and edge cases ## ๐Ÿ“Š Performance Comparison **Startup Time**: - JavaScript: ~2 seconds - Python: ~30-60 seconds (model download + embedding generation) - Python (with cache): ~5-10 seconds **Word Quality**: - JavaScript: Limited by static word lists (~100 words/topic) - Python: Rich thematic generation from 319K word database **Memory Usage**: - JavaScript: ~100MB - Python: ~500MB-1GB (model + embeddings) - Cache Size: ~50-200MB per 50K vocabulary **API Response Time**: - JavaScript: ~100ms (static word lookup) - Python: ~200-500ms (semantic similarity computation) **Cache Performance**: - Vocabulary loading: ~1-2 seconds from cache vs 30+ seconds generation - Embeddings loading: ~2-5 seconds from cache vs 60+ seconds generation ## ๐Ÿ”„ Migration Strategy 1. **Phase 1** โœ…: Basic Python backend structure 2. **Phase 2**: Test vector search functionality 3. **Phase 3**: Docker deployment and production testing 4. **Phase 4**: Compare with JavaScript backend 5. **Phase 5**: Production switch with rollback capability ## ๐ŸŽฏ Next Steps - [x] Replace vector search with thematic word generation - [x] Implement environment variable cache configuration - [x] Add 10-tier difficulty system based on word frequency - [ ] Optimize embedding computation performance - [ ] Add more sophisticated crossword grid generation - [ ] Implement LLM-based clue generation - [ ] Add cache warming strategies for production deployment