| # Environment Configuration for Hugging Face Spaces | |
| This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces. | |
| ## Required Variables | |
| ### Core Application Settings | |
| ```env | |
| NODE_ENV=production | |
| PORT=7860 | |
| PYTHONPATH=/app/crossword-app/backend-py | |
| PYTHONUNBUFFERED=1 | |
| ``` | |
| ### Cache Configuration | |
| ```env | |
| CACHE_DIR=/app/cache | |
| ``` | |
| ### AI/ML Model Configuration | |
| ```env | |
| THEMATIC_MODEL_NAME=all-mpnet-base-v2 | |
| THEMATIC_VOCAB_SIZE_LIMIT=100000 | |
| VOCAB_SOURCE=norvig | |
| ``` | |
| ## Optional Variables (with defaults) | |
| ### Word Selection & Quality Control | |
| ```env | |
| SIMILARITY_TEMPERATURE=0.2 | |
| USE_SOFTMAX_SELECTION=true | |
| DIFFICULTY_WEIGHT=0.5 | |
| THEMATIC_POOL_SIZE=150 | |
| ``` | |
| ### Multi-Topic Intersection Configuration | |
| ```env | |
| MULTI_TOPIC_METHOD=soft_minimum | |
| SOFT_MIN_BETA=10.0 | |
| SOFT_MIN_ADAPTIVE=true | |
| SOFT_MIN_MIN_WORDS=15 | |
| SOFT_MIN_MAX_RETRIES=5 | |
| SOFT_MIN_BETA_DECAY=0.7 | |
| ``` | |
| ### Distribution Normalization (Experimental) | |
| ```env | |
| ENABLE_DISTRIBUTION_NORMALIZATION=false | |
| NORMALIZATION_METHOD=similarity_range | |
| ``` | |
| ### Debug & Development | |
| ```env | |
| ENABLE_DEBUG_TAB=false | |
| ``` | |
| ## Variable Explanations | |
| ### **CACHE_DIR** (Required) | |
| - Directory for caching models, embeddings, and vocabulary | |
| - Contains sentence-transformer models, word embeddings, and NLTK data | |
| - Should be persistent across deployments | |
| ### **THEMATIC_MODEL_NAME** (Default: all-mpnet-base-v2) | |
| - Sentence transformer model for semantic embeddings | |
| - Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster) | |
| - Affects quality vs performance trade-off | |
| ### **THEMATIC_VOCAB_SIZE_LIMIT** (Default: 100000) | |
| - Maximum vocabulary size for word generation | |
| - Higher = more word variety, more memory usage | |
| - Norvig vocabulary contains ~100K words | |
| ### **VOCAB_SOURCE** (Default: norvig) | |
| - Vocabulary source for word generation | |
| - Currently only "norvig" is supported | |
| - Uses Norvig word frequency dataset | |
| ### **SIMILARITY_TEMPERATURE** (Default: 0.2) | |
| - Controls randomness in word selection | |
| - Lower = more deterministic (top similarity words) | |
| - Higher = more random selection from similar words | |
| - Range: 0.1-2.0 | |
| ### **DIFFICULTY_WEIGHT** (Default: 0.5) | |
| - Balances similarity vs frequency for difficulty levels | |
| - 0.0 = pure similarity, 1.0 = pure frequency | |
| - Affects easy/medium/hard word selection | |
| ### **MULTI_TOPIC_METHOD** (Default: soft_minimum) | |
| - Method for multi-topic word intersection | |
| - Options: soft_minimum, geometric_mean, harmonic_mean, averaging | |
| - soft_minimum finds words relevant to ALL topics | |
| ### **SOFT_MIN_BETA** (Default: 10.0) | |
| - Beta parameter for soft minimum calculation | |
| - Higher = stricter intersection requirement | |
| - Automatically adjusted if SOFT_MIN_ADAPTIVE=true | |
| ### **ENABLE_DEBUG_TAB** (Default: false) | |
| - Shows debug information in frontend | |
| - Displays word selection process and parameters | |
| - Useful for development and analysis | |
| ### **ENABLE_DISTRIBUTION_NORMALIZATION** (Default: false) | |
| - Experimental feature for normalizing similarity distributions | |
| - Generally disabled for better semantic authenticity | |
| - See docs/distribution_normalization_analysis.md | |
| ## Recommended HF Spaces Configuration | |
| **Minimal Setup (Core functionality):** | |
| ```env | |
| NODE_ENV=production | |
| PORT=7860 | |
| PYTHONPATH=/app/crossword-app/backend-py | |
| PYTHONUNBUFFERED=1 | |
| CACHE_DIR=/app/cache | |
| THEMATIC_VOCAB_SIZE_LIMIT=100000 | |
| THEMATIC_MODEL_NAME=all-mpnet-base-v2 | |
| VOCAB_SOURCE=norvig | |
| ``` | |
| **Optimized Setup (Better performance & debugging):** | |
| ```env | |
| NODE_ENV=production | |
| PORT=7860 | |
| PYTHONPATH=/app/crossword-app/backend-py | |
| PYTHONUNBUFFERED=1 | |
| CACHE_DIR=/app/cache | |
| THEMATIC_VOCAB_SIZE_LIMIT=100000 | |
| THEMATIC_MODEL_NAME=all-mpnet-base-v2 | |
| VOCAB_SOURCE=norvig | |
| SIMILARITY_TEMPERATURE=0.2 | |
| DIFFICULTY_WEIGHT=0.5 | |
| ENABLE_DEBUG_TAB=true | |
| MULTI_TOPIC_METHOD=soft_minimum | |
| SOFT_MIN_ADAPTIVE=true | |
| ``` | |
| ## Deprecated Variables (Safe to Remove) | |
| These variables are no longer used and can be deleted from HF Spaces: | |
| - `EMBEDDING_MODEL` (replaced by THEMATIC_MODEL_NAME) | |
| - `WORD_SIMILARITY_THRESHOLD` (deprecated with old vector search) | |
| - `USE_AI_WORDS` (always true now) | |
| - `FALLBACK_TO_STATIC` (no static fallback in current system) | |
| - `SEARCH_RANDOMNESS` (replaced by SIMILARITY_TEMPERATURE) | |
| - `MAX_CACHED_WORDS` (deprecated with old caching) | |
| - `CACHE_EXPIRY_HOURS` (deprecated with old caching) | |
| - `USE_HIERARCHICAL_SEARCH` (deprecated with old vector search) | |
| - `MAX_USED_WORDS_MEMORY` (deprecated with old word tracking) | |
| ## Performance Notes | |
| - **Startup Time**: ~30-60 seconds (model download + cache creation) | |
| - **Memory Usage**: ~500MB-1GB (sentence-transformers + embeddings + vocabulary) | |
| - **Response Time**: ~200-500ms (word generation + clue creation + grid fitting) | |
| - **Disk Usage**: ~500MB for full model cache (vocabulary, embeddings, models) | |
| ## Troubleshooting | |
| **If puzzle generation fails:** | |
| 1. Check CACHE_DIR is writable and has sufficient space | |
| 2. Monitor startup logs for cache creation progress | |
| 3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive | |
| **If words seem too random:** | |
| 1. Lower SIMILARITY_TEMPERATURE (try 0.1) | |
| 2. Increase DIFFICULTY_WEIGHT for frequency-based selection | |
| 3. Check debug tab with ENABLE_DEBUG_TAB=true | |
| **If multi-topic queries return too few words:** | |
| 1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment | |
| 2. Lower SOFT_MIN_BETA manually (try 5.0) | |
| 3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive) | |
| **If startup is too slow:** | |
| 1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2 | |
| 2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000 | |
| 3. Cache should speed up subsequent startups significantly |