vimalk78's picture
docs: update CONFIG.md with current env vars and document transfer learning failure
27a60ec
# Environment Configuration for Hugging Face Spaces
This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.
## Required Variables
### Core Application Settings
```env
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
```
### Cache Configuration
```env
CACHE_DIR=/app/cache
```
### AI/ML Model Configuration
```env
THEMATIC_MODEL_NAME=all-mpnet-base-v2
THEMATIC_VOCAB_SIZE_LIMIT=100000
VOCAB_SOURCE=norvig
```
## Optional Variables (with defaults)
### Word Selection & Quality Control
```env
SIMILARITY_TEMPERATURE=0.2
USE_SOFTMAX_SELECTION=true
DIFFICULTY_WEIGHT=0.5
THEMATIC_POOL_SIZE=150
```
### Multi-Topic Intersection Configuration
```env
MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_BETA=10.0
SOFT_MIN_ADAPTIVE=true
SOFT_MIN_MIN_WORDS=15
SOFT_MIN_MAX_RETRIES=5
SOFT_MIN_BETA_DECAY=0.7
```
### Distribution Normalization (Experimental)
```env
ENABLE_DISTRIBUTION_NORMALIZATION=false
NORMALIZATION_METHOD=similarity_range
```
### Debug & Development
```env
ENABLE_DEBUG_TAB=false
```
## Variable Explanations
### **CACHE_DIR** (Required)
- Directory for caching models, embeddings, and vocabulary
- Contains sentence-transformer models, word embeddings, and NLTK data
- Should be persistent across deployments
### **THEMATIC_MODEL_NAME** (Default: all-mpnet-base-v2)
- Sentence transformer model for semantic embeddings
- Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
- Affects quality vs performance trade-off
### **THEMATIC_VOCAB_SIZE_LIMIT** (Default: 100000)
- Maximum vocabulary size for word generation
- Higher = more word variety, more memory usage
- Norvig vocabulary contains ~100K words
### **VOCAB_SOURCE** (Default: norvig)
- Vocabulary source for word generation
- Currently only "norvig" is supported
- Uses Norvig word frequency dataset
### **SIMILARITY_TEMPERATURE** (Default: 0.2)
- Controls randomness in word selection
- Lower = more deterministic (top similarity words)
- Higher = more random selection from similar words
- Range: 0.1-2.0
### **DIFFICULTY_WEIGHT** (Default: 0.5)
- Balances similarity vs frequency for difficulty levels
- 0.0 = pure similarity, 1.0 = pure frequency
- Affects easy/medium/hard word selection
### **MULTI_TOPIC_METHOD** (Default: soft_minimum)
- Method for multi-topic word intersection
- Options: soft_minimum, geometric_mean, harmonic_mean, averaging
- soft_minimum finds words relevant to ALL topics
### **SOFT_MIN_BETA** (Default: 10.0)
- Beta parameter for soft minimum calculation
- Higher = stricter intersection requirement
- Automatically adjusted if SOFT_MIN_ADAPTIVE=true
### **ENABLE_DEBUG_TAB** (Default: false)
- Shows debug information in frontend
- Displays word selection process and parameters
- Useful for development and analysis
### **ENABLE_DISTRIBUTION_NORMALIZATION** (Default: false)
- Experimental feature for normalizing similarity distributions
- Generally disabled for better semantic authenticity
- See docs/distribution_normalization_analysis.md
## Recommended HF Spaces Configuration
**Minimal Setup (Core functionality):**
```env
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig
```
**Optimized Setup (Better performance & debugging):**
```env
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig
SIMILARITY_TEMPERATURE=0.2
DIFFICULTY_WEIGHT=0.5
ENABLE_DEBUG_TAB=true
MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_ADAPTIVE=true
```
## Deprecated Variables (Safe to Remove)
These variables are no longer used and can be deleted from HF Spaces:
- `EMBEDDING_MODEL` (replaced by THEMATIC_MODEL_NAME)
- `WORD_SIMILARITY_THRESHOLD` (deprecated with old vector search)
- `USE_AI_WORDS` (always true now)
- `FALLBACK_TO_STATIC` (no static fallback in current system)
- `SEARCH_RANDOMNESS` (replaced by SIMILARITY_TEMPERATURE)
- `MAX_CACHED_WORDS` (deprecated with old caching)
- `CACHE_EXPIRY_HOURS` (deprecated with old caching)
- `USE_HIERARCHICAL_SEARCH` (deprecated with old vector search)
- `MAX_USED_WORDS_MEMORY` (deprecated with old word tracking)
## Performance Notes
- **Startup Time**: ~30-60 seconds (model download + cache creation)
- **Memory Usage**: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
- **Response Time**: ~200-500ms (word generation + clue creation + grid fitting)
- **Disk Usage**: ~500MB for full model cache (vocabulary, embeddings, models)
## Troubleshooting
**If puzzle generation fails:**
1. Check CACHE_DIR is writable and has sufficient space
2. Monitor startup logs for cache creation progress
3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive
**If words seem too random:**
1. Lower SIMILARITY_TEMPERATURE (try 0.1)
2. Increase DIFFICULTY_WEIGHT for frequency-based selection
3. Check debug tab with ENABLE_DEBUG_TAB=true
**If multi-topic queries return too few words:**
1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
2. Lower SOFT_MIN_BETA manually (try 5.0)
3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)
**If startup is too slow:**
1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
3. Cache should speed up subsequent startups significantly