# Environment Configuration for Hugging Face Spaces This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces. ## Required Variables ### Core Application Settings ```env NODE_ENV=production PORT=7860 PYTHONPATH=/app/crossword-app/backend-py PYTHONUNBUFFERED=1 ``` ### Cache Configuration ```env CACHE_DIR=/app/cache ``` ### AI/ML Model Configuration ```env THEMATIC_MODEL_NAME=all-mpnet-base-v2 THEMATIC_VOCAB_SIZE_LIMIT=100000 VOCAB_SOURCE=norvig ``` ## Optional Variables (with defaults) ### Word Selection & Quality Control ```env SIMILARITY_TEMPERATURE=0.2 USE_SOFTMAX_SELECTION=true DIFFICULTY_WEIGHT=0.5 THEMATIC_POOL_SIZE=150 ``` ### Multi-Topic Intersection Configuration ```env MULTI_TOPIC_METHOD=soft_minimum SOFT_MIN_BETA=10.0 SOFT_MIN_ADAPTIVE=true SOFT_MIN_MIN_WORDS=15 SOFT_MIN_MAX_RETRIES=5 SOFT_MIN_BETA_DECAY=0.7 ``` ### Distribution Normalization (Experimental) ```env ENABLE_DISTRIBUTION_NORMALIZATION=false NORMALIZATION_METHOD=similarity_range ``` ### Debug & Development ```env ENABLE_DEBUG_TAB=false ``` ## Variable Explanations ### **CACHE_DIR** (Required) - Directory for caching models, embeddings, and vocabulary - Contains sentence-transformer models, word embeddings, and NLTK data - Should be persistent across deployments ### **THEMATIC_MODEL_NAME** (Default: all-mpnet-base-v2) - Sentence transformer model for semantic embeddings - Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster) - Affects quality vs performance trade-off ### **THEMATIC_VOCAB_SIZE_LIMIT** (Default: 100000) - Maximum vocabulary size for word generation - Higher = more word variety, more memory usage - Norvig vocabulary contains ~100K words ### **VOCAB_SOURCE** (Default: norvig) - Vocabulary source for word generation - Currently only "norvig" is supported - Uses Norvig word frequency dataset ### **SIMILARITY_TEMPERATURE** (Default: 0.2) - Controls randomness in word selection - Lower = more deterministic (top similarity words) - Higher = more random selection from similar words - Range: 0.1-2.0 ### **DIFFICULTY_WEIGHT** (Default: 0.5) - Balances similarity vs frequency for difficulty levels - 0.0 = pure similarity, 1.0 = pure frequency - Affects easy/medium/hard word selection ### **MULTI_TOPIC_METHOD** (Default: soft_minimum) - Method for multi-topic word intersection - Options: soft_minimum, geometric_mean, harmonic_mean, averaging - soft_minimum finds words relevant to ALL topics ### **SOFT_MIN_BETA** (Default: 10.0) - Beta parameter for soft minimum calculation - Higher = stricter intersection requirement - Automatically adjusted if SOFT_MIN_ADAPTIVE=true ### **ENABLE_DEBUG_TAB** (Default: false) - Shows debug information in frontend - Displays word selection process and parameters - Useful for development and analysis ### **ENABLE_DISTRIBUTION_NORMALIZATION** (Default: false) - Experimental feature for normalizing similarity distributions - Generally disabled for better semantic authenticity - See docs/distribution_normalization_analysis.md ## Recommended HF Spaces Configuration **Minimal Setup (Core functionality):** ```env NODE_ENV=production PORT=7860 PYTHONPATH=/app/crossword-app/backend-py PYTHONUNBUFFERED=1 CACHE_DIR=/app/cache THEMATIC_VOCAB_SIZE_LIMIT=100000 THEMATIC_MODEL_NAME=all-mpnet-base-v2 VOCAB_SOURCE=norvig ``` **Optimized Setup (Better performance & debugging):** ```env NODE_ENV=production PORT=7860 PYTHONPATH=/app/crossword-app/backend-py PYTHONUNBUFFERED=1 CACHE_DIR=/app/cache THEMATIC_VOCAB_SIZE_LIMIT=100000 THEMATIC_MODEL_NAME=all-mpnet-base-v2 VOCAB_SOURCE=norvig SIMILARITY_TEMPERATURE=0.2 DIFFICULTY_WEIGHT=0.5 ENABLE_DEBUG_TAB=true MULTI_TOPIC_METHOD=soft_minimum SOFT_MIN_ADAPTIVE=true ``` ## Deprecated Variables (Safe to Remove) These variables are no longer used and can be deleted from HF Spaces: - `EMBEDDING_MODEL` (replaced by THEMATIC_MODEL_NAME) - `WORD_SIMILARITY_THRESHOLD` (deprecated with old vector search) - `USE_AI_WORDS` (always true now) - `FALLBACK_TO_STATIC` (no static fallback in current system) - `SEARCH_RANDOMNESS` (replaced by SIMILARITY_TEMPERATURE) - `MAX_CACHED_WORDS` (deprecated with old caching) - `CACHE_EXPIRY_HOURS` (deprecated with old caching) - `USE_HIERARCHICAL_SEARCH` (deprecated with old vector search) - `MAX_USED_WORDS_MEMORY` (deprecated with old word tracking) ## Performance Notes - **Startup Time**: ~30-60 seconds (model download + cache creation) - **Memory Usage**: ~500MB-1GB (sentence-transformers + embeddings + vocabulary) - **Response Time**: ~200-500ms (word generation + clue creation + grid fitting) - **Disk Usage**: ~500MB for full model cache (vocabulary, embeddings, models) ## Troubleshooting **If puzzle generation fails:** 1. Check CACHE_DIR is writable and has sufficient space 2. Monitor startup logs for cache creation progress 3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive **If words seem too random:** 1. Lower SIMILARITY_TEMPERATURE (try 0.1) 2. Increase DIFFICULTY_WEIGHT for frequency-based selection 3. Check debug tab with ENABLE_DEBUG_TAB=true **If multi-topic queries return too few words:** 1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment 2. Lower SOFT_MIN_BETA manually (try 5.0) 3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive) **If startup is too slow:** 1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2 2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000 3. Cache should speed up subsequent startups significantly