Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / crossword-app /backend-py /README.md

vimalk78

feat(crossword): generated crosswords with clues

486eff6 4 months ago

preview code

raw

history blame contribute delete

16.2 kB

	# Python Backend with Thematic AI Word Generation

	This is the Python implementation of the crossword generator backend, featuring AI-powered thematic word generation using WordFreq vocabulary and semantic embeddings.

	## 🚀 Features

	- Thematic Word Generation: Uses sentence-transformers for semantic word discovery from WordFreq vocabulary
	- 319K+ Word Database: Comprehensive vocabulary from WordFreq with frequency data
	- 10-Tier Difficulty System: Smart word selection based on frequency tiers
	- Environment Variable Configuration: Flexible cache and model configuration
	- FastAPI: Modern, fast Python web framework
	- Same API: Compatible with existing React frontend

	## 🔄 Differences from JavaScript Backend

	\| Feature \| JavaScript Backend \| Python Backend \|
	\|---------\|-------------------\|----------------\|
	\| Word Generation \| Static word lists \| Thematic AI word generation from 319K vocabulary \|
	\| Vocabulary Size \| ~100 words per topic \| Filtered from 319K WordFreq database \|
	\| AI Approach \| Basic filtering \| Semantic similarity with frequency tiers \|
	\| Performance \| Fast but limited \| Slower startup, richer word selection \|
	\| Dependencies \| Node.js + static files \| Python + ML libraries \|

	## 🛠️ Setup & Installation

	### Prerequisites
	- Python 3.11+ (3.11 recommended for Docker compatibility)
	- pip (Python package manager)

	### Basic Setup (Core Functionality)
	```bash
	# Clone and navigate to backend directory
	cd crossword-app/backend-py

	# Create virtual environment (recommended)
	python -m venv venv
	source venv/bin/activate # On Windows: venv\Scripts\activate

	# Install core dependencies
	pip install -r requirements.txt

	# Start the server
	python app.py
	```

	### Full Development Setup (with AI features)
	```bash
	# Install development dependencies including AI/ML libraries
	pip install -r requirements-dev.txt

	# This includes:
	# - All core dependencies
	# - AI/ML libraries (torch, sentence-transformers, etc.)
	# - Development tools (pytest, coverage, etc.)
	```

	### Requirements Files
	- `requirements.txt`: Core dependencies for basic functionality
	- `requirements-dev.txt`: Full development environment with AI features

	> Note: The AI/ML dependencies are large (~2GB). For basic testing without AI features, use `requirements.txt` only.

	> Python Version: Both local development and Docker use Python 3.11+ for optimal performance and latest package compatibility.

	## 📁 Structure

	```
	backend-py/
	├── app.py # FastAPI application entry point
	├── requirements.txt # Core Python dependencies
	├── requirements-dev.txt # Full development dependencies
	├── src/
	│ ├── services/
	│ │ ├── thematic_word_service.py # Thematic AI word generation
	│ │ ├── crossword_generator.py # Puzzle generation logic
	│ │ └── crossword_generator_wrapper.py # Service wrapper
	│ └── routes/
	│ └── api.py # API endpoints (matches JS backend)
	├── test-unit/ # Unit tests (pytest framework) - 5 files
	│ ├── test_crossword_generator.py
	│ ├── test_api_routes.py
	│ └── test_vector_search.py
	├── test-integration/ # Integration tests (standalone scripts) - 16 files
	│ ├── test_simple_generation.py
	│ ├── test_boundary_fix.py
	│ └── test_local.py # (+ 13 more test files)
	├── data/ -> ../backend/data/ # Symlink to shared word data
	└── public/ # Frontend static files (copied during build)
	```

	## 🛠 Dependencies

	### Core ML Stack
	- `sentence-transformers`: Local model loading and embeddings
	- `wordfreq`: 319K word vocabulary with frequency data
	- `torch`: PyTorch for model inference
	- `scikit-learn`: Cosine similarity and clustering
	- `numpy`: Vector operations

	### Web Framework
	- `fastapi`: Modern Python web framework
	- `uvicorn`: ASGI server
	- `pydantic`: Data validation

	### Testing
	- `pytest`: Testing framework
	- `pytest-asyncio`: Async test support

	## 🧪 Testing

	### 📁 Test Organization (Reorganized for Clarity)

	We've reorganized the test structure for better developer experience:

	\| Test Type \| Location \| Purpose \| Framework \| Count \|
	\|-----------\|----------\|---------\|-----------\|-------\|
	\| Unit Tests \| `test-unit/` \| Test individual components in isolation \| pytest \| 5 files \|
	\| Integration Tests \| `test-integration/` \| Test complete workflows end-to-end \| Standalone scripts \| 16 files \|

	Benefits of this structure:
	- ✅ Clear separation between unit and integration testing
	- ✅ Intuitive naming - developers immediately understand test types
	- ✅ Better tooling - can run different test types independently
	- ✅ Easier maintenance - organized by testing strategy

	> Note: Previously tests were mixed in `tests/` folder and root-level `test_*.py` files. The new structure provides much better organization.

	### Unit Tests Details (`test-unit/`)

	What they test: Individual components with mocking and isolation
	- `test_crossword_generator.py` - Core crossword generation logic
	- `test_api_routes.py` - FastAPI endpoint handlers
	- `test_crossword_generator_wrapper.py` - Service wrapper layer
	- `test_index_bug_fix.py` - Specific bug fix validations
	- `test_vector_search.py` - AI vector search functionality (requires torch)

	### Run Unit Tests (Formal Test Suite)
	```bash
	# Run all unit tests
	python run_tests.py

	# Run specific test modules
	python run_tests.py crossword_generator
	pytest test-unit/test_crossword_generator.py -v

	# Run core tests (excluding AI dependencies)
	pytest test-unit/ -v --ignore=test-unit/test_vector_search.py

	# Run individual unit test classes
	pytest test-unit/test_crossword_generator.py::TestCrosswordGenerator::test_init -v
	```

	### Integration Tests Details (`test-integration/`)

	What they test: Complete workflows without mocking - real functionality
	- `test_simple_generation.py` - End-to-end crossword generation
	- `test_boundary_fix.py` - Word boundary validation (our major fix!)
	- `test_local.py` - Local environment and dependencies
	- `test_word_boundaries.py` - Comprehensive boundary testing
	- `test_bounds_comprehensive.py` - Advanced bounds checking
	- `test_final_validation.py` - API integration testing
	- And 10 more specialized feature tests...

	### Run Integration Tests (End-to-End Scripts)
	```bash
	# Test core functionality
	python test-integration/test_simple_generation.py
	python test-integration/test_boundary_fix.py
	python test-integration/test_local.py

	# Test specific features
	python test-integration/test_word_boundaries.py
	python test-integration/test_bounds_comprehensive.py

	# Test API integration
	python test-integration/test_final_validation.py
	```

	### Test Coverage
	```bash
	# Run core tests with coverage (requires requirements-dev.txt)
	pytest test-unit/test_crossword_generator.py --cov=src --cov-report=html
	pytest test-unit/test_crossword_generator.py --cov=src --cov-report=term

	# Full coverage report (may fail without AI dependencies)
	pytest test-unit/ --cov=src --cov-report=html --ignore=test-unit/test_vector_search.py
	```

	### Test Status
	- ✅ Core crossword generation: 15/19 unit tests passing
	- ✅ Boundary validation: All integration tests passing
	- ⚠️ AI/Vector search: Requires torch dependencies
	- ⚠️ Some async mocking: Minor test infrastructure issues

	### 🔄 Migration Guide (For Existing Developers)

	If you had previous commands, update them:

	\| Old Command \| New Command \|
	\|-------------\|-------------\|
	\| `pytest tests/` \| `pytest test-unit/` \|
	\| `python test_simple_generation.py` \| `python test-integration/test_simple_generation.py` \|
	\| `pytest tests/ --cov=src` \| `pytest test-unit/ --cov=src` \|

	All functionality is preserved - just organized better!

	## 🔧 Configuration

	### Environment Variables

	The backend supports flexible configuration via environment variables:

	```bash
	# Cache Configuration
	CACHE_DIR=/app/cache # Cache directory for all service files
	THEMATIC_VOCAB_SIZE_LIMIT=50000 # Maximum vocabulary size (default: 100000)
	THEMATIC_MODEL_NAME=all-mpnet-base-v2 # Sentence transformer model

	# Core Application Settings
	PORT=7860 # Server port
	NODE_ENV=production # Environment mode

	# Optional
	LOG_LEVEL=INFO # Logging level
	```

	### Cache Structure

	The service creates the following cache files:

	```
	{CACHE_DIR}/
	├── vocabulary_{size}.pkl # Processed vocabulary words
	├── frequencies_{size}.pkl # Word frequency data
	├── embeddings_{model}_{size}.npy # Word embeddings
	└── sentence-transformers/ # Hugging Face model cache
	```

	## 🎯 Thematic Word Generation Process

	1. Initialization:
	- Load WordFreq vocabulary database (319K words)
	- Filter words for crossword suitability (length, content)
	- Load sentence-transformers model locally
	- Pre-compute embeddings for filtered vocabulary
	- Create 10-tier frequency classification system

	2. Word Generation:
	- Get topic embedding: `"Animals" → [768-dim vector]`
	- Compute cosine similarity with all vocabulary embeddings
	- Filter by similarity threshold and difficulty tier
	- Filter by crossword-specific criteria (length, etc.)
	- Return top matches with generated clues

	3. Multi-Theme Support:
	- Detect multiple themes using clustering
	- Generate words that relate to combined themes
	- Balance word selection across different topics

	## 🧪 Testing

	```bash
	# Local testing (without full vector search)
	cd backend-py
	python test_local.py

	# Start development server
	python app.py
	```

	## 🐳 Container Deployment

	### Docker Run with Cache Configuration

	```bash
	# Basic deployment
	docker run -e CACHE_DIR=/app/cache \
	-e THEMATIC_VOCAB_SIZE_LIMIT=50000 \
	-v /host/cache:/app/cache \
	-p 7860:7860 \
	your-crossword-app

	# With all configuration options
	docker run -e CACHE_DIR=/app/cache \
	-e THEMATIC_VOCAB_SIZE_LIMIT=25000 \
	-e THEMATIC_MODEL_NAME=all-mpnet-base-v2 \
	-e NODE_ENV=production \
	-v /host/cache:/app/cache \
	-p 7860:7860 \
	your-crossword-app
	```

	### Docker Compose

	```yaml
	version: '3.8'
	services:
	crossword-backend:
	image: your-crossword-app
	environment:
	- CACHE_DIR=/app/cache
	- THEMATIC_VOCAB_SIZE_LIMIT=50000
	- THEMATIC_MODEL_NAME=all-mpnet-base-v2
	- NODE_ENV=production
	volumes:
	- ./cache:/app/cache
	ports:
	- "7860:7860"
	restart: unless-stopped
	```

	### Pre-built Cache Strategy (Recommended)

	For production deployments, pre-build the cache to avoid long startup times:

	```bash
	# 1. Build cache locally or in a build container
	export CACHE_DIR=/local/cache
	export THEMATIC_VOCAB_SIZE_LIMIT=50000
	python -c "from src.services.thematic_word_service import ThematicWordService; s=ThematicWordService(); s.initialize()"

	# 2. Deploy with pre-built cache (read-only mount)
	docker run -e CACHE_DIR=/app/cache \
	-v /local/cache:/app/cache:ro \
	-p 7860:7860 \
	your-crossword-app
	```

	### Debugging Cache Issues

	If cache files are not being created in your container:

	1. Check Health Endpoints:
	```bash
	# Basic health check
	curl http://localhost:7860/api/health

	# Detailed cache status
	curl http://localhost:7860/api/health/cache

	# Force cache re-initialization
	curl -X POST http://localhost:7860/api/health/cache/reinitialize
	```

	2. Check Container Logs:
	```bash
	docker logs your-container-name
	```
	Look for cache directory permissions and initialization messages.

	3. Test Cache Directory:
	```bash
	# Run test script to verify cache setup
	docker exec your-container python test_cache_startup.py
	```

	4. Common Issues:
	- Permission denied: Container user can't write to mounted volume
	- Missing dependencies: ML libraries not installed in container
	- Volume not mounted: Cache directory not properly mounted
	- Environment variables: `CACHE_DIR` not set correctly

	5. Fix Permission Issues:
	```bash
	# Option 1: Change ownership of host directory
	sudo chown -R 1000:1000 /host/cache

	# Option 2: Run container with specific user
	docker run --user 1000:1000 ...

	# Option 3: Set permissions in Dockerfile
	RUN mkdir -p /app/cache && chmod 777 /app/cache
	```

	### Kubernetes Deployment

	```yaml
	apiVersion: v1
	kind: ConfigMap
	metadata:
	name: crossword-config
	data:
	CACHE_DIR: "/app/cache"
	THEMATIC_VOCAB_SIZE_LIMIT: "50000"
	THEMATIC_MODEL_NAME: "all-mpnet-base-v2"
	NODE_ENV: "production"
	---
	apiVersion: v1
	kind: PersistentVolumeClaim
	metadata:
	name: crossword-cache
	spec:
	accessModes:
	- ReadWriteOnce
	resources:
	requests:
	storage: 5Gi
	---
	apiVersion: apps/v1
	kind: Deployment
	metadata:
	name: crossword-backend
	spec:
	replicas: 1
	selector:
	matchLabels:
	app: crossword-backend
	template:
	metadata:
	labels:
	app: crossword-backend
	spec:
	containers:
	- name: backend
	image: your-crossword-app
	envFrom:
	- configMapRef:
	name: crossword-config
	volumeMounts:
	- name: cache-volume
	mountPath: /app/cache
	ports:
	- containerPort: 7860
	volumes:
	- name: cache-volume
	persistentVolumeClaim:
	claimName: crossword-cache
	```

	## 🧪 Testing

	### Quick Test
	```bash
	# Basic functionality test (no model download)
	python test_local.py
	```

	### Comprehensive Unit Tests
	```bash
	# Run all unit tests
	python run_tests.py

	# Or use pytest directly
	pytest tests/ -v

	# Run specific test file
	python run_tests.py crossword_generator_fixed
	pytest tests/test_crossword_generator_fixed.py -v

	# Run with coverage
	pytest tests/ --cov=src --cov-report=html
	```

	### Test Structure
	- `tests/test_crossword_generator_fixed.py` - Core grid generation logic
	- `tests/test_vector_search.py` - Vector similarity search
	- `tests/test_crossword_generator_wrapper.py` - Service wrapper
	- `tests/test_api_routes.py` - FastAPI endpoints

	### Key Test Features
	- ✅ Index alignment fix: Tests the list index out of range bug fix
	- ✅ Mocked vector search: Tests without downloading models
	- ✅ API validation: Tests all endpoints and error cases
	- ✅ Async support: Full pytest-asyncio integration
	- ✅ Error handling: Tests malformed inputs and edge cases

	## 📊 Performance Comparison

	Startup Time:
	- JavaScript: ~2 seconds
	- Python: ~30-60 seconds (model download + embedding generation)
	- Python (with cache): ~5-10 seconds

	Word Quality:
	- JavaScript: Limited by static word lists (~100 words/topic)
	- Python: Rich thematic generation from 319K word database

	Memory Usage:
	- JavaScript: ~100MB
	- Python: ~500MB-1GB (model + embeddings)
	- Cache Size: ~50-200MB per 50K vocabulary

	API Response Time:
	- JavaScript: ~100ms (static word lookup)
	- Python: ~200-500ms (semantic similarity computation)

	Cache Performance:
	- Vocabulary loading: ~1-2 seconds from cache vs 30+ seconds generation
	- Embeddings loading: ~2-5 seconds from cache vs 60+ seconds generation

	## 🔄 Migration Strategy

	1. Phase 1 ✅: Basic Python backend structure
	2. Phase 2: Test vector search functionality
	3. Phase 3: Docker deployment and production testing
	4. Phase 4: Compare with JavaScript backend
	5. Phase 5: Production switch with rollback capability

	## 🎯 Next Steps

	- [x] Replace vector search with thematic word generation
	- [x] Implement environment variable cache configuration
	- [x] Add 10-tier difficulty system based on word frequency
	- [ ] Optimize embedding computation performance
	- [ ] Add more sophisticated crossword grid generation
	- [ ] Implement LLM-based clue generation
	- [ ] Add cache warming strategies for production deployment