Environment Configuration for Hugging Face Spaces
This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.
Required Variables
Core Application Settings
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
Cache Configuration
CACHE_DIR=/app/cache
AI/ML Model Configuration
THEMATIC_MODEL_NAME=all-mpnet-base-v2
THEMATIC_VOCAB_SIZE_LIMIT=100000
VOCAB_SOURCE=norvig
Optional Variables (with defaults)
Word Selection & Quality Control
SIMILARITY_TEMPERATURE=0.2
USE_SOFTMAX_SELECTION=true
DIFFICULTY_WEIGHT=0.5
THEMATIC_POOL_SIZE=150
Multi-Topic Intersection Configuration
MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_BETA=10.0
SOFT_MIN_ADAPTIVE=true
SOFT_MIN_MIN_WORDS=15
SOFT_MIN_MAX_RETRIES=5
SOFT_MIN_BETA_DECAY=0.7
Distribution Normalization (Experimental)
ENABLE_DISTRIBUTION_NORMALIZATION=false
NORMALIZATION_METHOD=similarity_range
Debug & Development
ENABLE_DEBUG_TAB=false
Variable Explanations
CACHE_DIR (Required)
- Directory for caching models, embeddings, and vocabulary
- Contains sentence-transformer models, word embeddings, and NLTK data
- Should be persistent across deployments
THEMATIC_MODEL_NAME (Default: all-mpnet-base-v2)
- Sentence transformer model for semantic embeddings
- Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
- Affects quality vs performance trade-off
THEMATIC_VOCAB_SIZE_LIMIT (Default: 100000)
- Maximum vocabulary size for word generation
- Higher = more word variety, more memory usage
- Norvig vocabulary contains ~100K words
VOCAB_SOURCE (Default: norvig)
- Vocabulary source for word generation
- Currently only "norvig" is supported
- Uses Norvig word frequency dataset
SIMILARITY_TEMPERATURE (Default: 0.2)
- Controls randomness in word selection
- Lower = more deterministic (top similarity words)
- Higher = more random selection from similar words
- Range: 0.1-2.0
DIFFICULTY_WEIGHT (Default: 0.5)
- Balances similarity vs frequency for difficulty levels
- 0.0 = pure similarity, 1.0 = pure frequency
- Affects easy/medium/hard word selection
MULTI_TOPIC_METHOD (Default: soft_minimum)
- Method for multi-topic word intersection
- Options: soft_minimum, geometric_mean, harmonic_mean, averaging
- soft_minimum finds words relevant to ALL topics
SOFT_MIN_BETA (Default: 10.0)
- Beta parameter for soft minimum calculation
- Higher = stricter intersection requirement
- Automatically adjusted if SOFT_MIN_ADAPTIVE=true
ENABLE_DEBUG_TAB (Default: false)
- Shows debug information in frontend
- Displays word selection process and parameters
- Useful for development and analysis
ENABLE_DISTRIBUTION_NORMALIZATION (Default: false)
- Experimental feature for normalizing similarity distributions
- Generally disabled for better semantic authenticity
- See docs/distribution_normalization_analysis.md
Recommended HF Spaces Configuration
Minimal Setup (Core functionality):
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig
Optimized Setup (Better performance & debugging):
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig
SIMILARITY_TEMPERATURE=0.2
DIFFICULTY_WEIGHT=0.5
ENABLE_DEBUG_TAB=true
MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_ADAPTIVE=true
Deprecated Variables (Safe to Remove)
These variables are no longer used and can be deleted from HF Spaces:
EMBEDDING_MODEL(replaced by THEMATIC_MODEL_NAME)WORD_SIMILARITY_THRESHOLD(deprecated with old vector search)USE_AI_WORDS(always true now)FALLBACK_TO_STATIC(no static fallback in current system)SEARCH_RANDOMNESS(replaced by SIMILARITY_TEMPERATURE)MAX_CACHED_WORDS(deprecated with old caching)CACHE_EXPIRY_HOURS(deprecated with old caching)USE_HIERARCHICAL_SEARCH(deprecated with old vector search)MAX_USED_WORDS_MEMORY(deprecated with old word tracking)
Performance Notes
- Startup Time: ~30-60 seconds (model download + cache creation)
- Memory Usage: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
- Response Time: ~200-500ms (word generation + clue creation + grid fitting)
- Disk Usage: ~500MB for full model cache (vocabulary, embeddings, models)
Troubleshooting
If puzzle generation fails:
- Check CACHE_DIR is writable and has sufficient space
- Monitor startup logs for cache creation progress
- Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive
If words seem too random:
- Lower SIMILARITY_TEMPERATURE (try 0.1)
- Increase DIFFICULTY_WEIGHT for frequency-based selection
- Check debug tab with ENABLE_DEBUG_TAB=true
If multi-topic queries return too few words:
- Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
- Lower SOFT_MIN_BETA manually (try 5.0)
- Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)
If startup is too slow:
- Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
- Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
- Cache should speed up subsequent startups significantly