Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / crossword-app /backend-py /CONFIG.md

vimalk78

docs: update CONFIG.md with current env vars and document transfer learning failure

27a60ec 3 months ago

preview code

raw

history blame contribute delete

5.54 kB

	# Environment Configuration for Hugging Face Spaces

	This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.

	## Required Variables

	### Core Application Settings
	```env
	NODE_ENV=production
	PORT=7860
	PYTHONPATH=/app/crossword-app/backend-py
	PYTHONUNBUFFERED=1
	```

	### Cache Configuration
	```env
	CACHE_DIR=/app/cache
	```

	### AI/ML Model Configuration
	```env
	THEMATIC_MODEL_NAME=all-mpnet-base-v2
	THEMATIC_VOCAB_SIZE_LIMIT=100000
	VOCAB_SOURCE=norvig
	```

	## Optional Variables (with defaults)

	### Word Selection & Quality Control
	```env
	SIMILARITY_TEMPERATURE=0.2
	USE_SOFTMAX_SELECTION=true
	DIFFICULTY_WEIGHT=0.5
	THEMATIC_POOL_SIZE=150
	```

	### Multi-Topic Intersection Configuration
	```env
	MULTI_TOPIC_METHOD=soft_minimum
	SOFT_MIN_BETA=10.0
	SOFT_MIN_ADAPTIVE=true
	SOFT_MIN_MIN_WORDS=15
	SOFT_MIN_MAX_RETRIES=5
	SOFT_MIN_BETA_DECAY=0.7
	```

	### Distribution Normalization (Experimental)
	```env
	ENABLE_DISTRIBUTION_NORMALIZATION=false
	NORMALIZATION_METHOD=similarity_range
	```

	### Debug & Development
	```env
	ENABLE_DEBUG_TAB=false
	```

	## Variable Explanations

	### CACHE_DIR (Required)
	- Directory for caching models, embeddings, and vocabulary
	- Contains sentence-transformer models, word embeddings, and NLTK data
	- Should be persistent across deployments

	### THEMATIC_MODEL_NAME (Default: all-mpnet-base-v2)
	- Sentence transformer model for semantic embeddings
	- Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
	- Affects quality vs performance trade-off

	### THEMATIC_VOCAB_SIZE_LIMIT (Default: 100000)
	- Maximum vocabulary size for word generation
	- Higher = more word variety, more memory usage
	- Norvig vocabulary contains ~100K words

	### VOCAB_SOURCE (Default: norvig)
	- Vocabulary source for word generation
	- Currently only "norvig" is supported
	- Uses Norvig word frequency dataset

	### SIMILARITY_TEMPERATURE (Default: 0.2)
	- Controls randomness in word selection
	- Lower = more deterministic (top similarity words)
	- Higher = more random selection from similar words
	- Range: 0.1-2.0

	### DIFFICULTY_WEIGHT (Default: 0.5)
	- Balances similarity vs frequency for difficulty levels
	- 0.0 = pure similarity, 1.0 = pure frequency
	- Affects easy/medium/hard word selection

	### MULTI_TOPIC_METHOD (Default: soft_minimum)
	- Method for multi-topic word intersection
	- Options: soft_minimum, geometric_mean, harmonic_mean, averaging
	- soft_minimum finds words relevant to ALL topics

	### SOFT_MIN_BETA (Default: 10.0)
	- Beta parameter for soft minimum calculation
	- Higher = stricter intersection requirement
	- Automatically adjusted if SOFT_MIN_ADAPTIVE=true

	### ENABLE_DEBUG_TAB (Default: false)
	- Shows debug information in frontend
	- Displays word selection process and parameters
	- Useful for development and analysis

	### ENABLE_DISTRIBUTION_NORMALIZATION (Default: false)
	- Experimental feature for normalizing similarity distributions
	- Generally disabled for better semantic authenticity
	- See docs/distribution_normalization_analysis.md

	## Recommended HF Spaces Configuration

	Minimal Setup (Core functionality):
	```env
	NODE_ENV=production
	PORT=7860
	PYTHONPATH=/app/crossword-app/backend-py
	PYTHONUNBUFFERED=1
	CACHE_DIR=/app/cache
	THEMATIC_VOCAB_SIZE_LIMIT=100000
	THEMATIC_MODEL_NAME=all-mpnet-base-v2
	VOCAB_SOURCE=norvig
	```

	Optimized Setup (Better performance & debugging):
	```env
	NODE_ENV=production
	PORT=7860
	PYTHONPATH=/app/crossword-app/backend-py
	PYTHONUNBUFFERED=1
	CACHE_DIR=/app/cache
	THEMATIC_VOCAB_SIZE_LIMIT=100000
	THEMATIC_MODEL_NAME=all-mpnet-base-v2
	VOCAB_SOURCE=norvig
	SIMILARITY_TEMPERATURE=0.2
	DIFFICULTY_WEIGHT=0.5
	ENABLE_DEBUG_TAB=true
	MULTI_TOPIC_METHOD=soft_minimum
	SOFT_MIN_ADAPTIVE=true
	```

	## Deprecated Variables (Safe to Remove)

	These variables are no longer used and can be deleted from HF Spaces:
	- `EMBEDDING_MODEL` (replaced by THEMATIC_MODEL_NAME)
	- `WORD_SIMILARITY_THRESHOLD` (deprecated with old vector search)
	- `USE_AI_WORDS` (always true now)
	- `FALLBACK_TO_STATIC` (no static fallback in current system)
	- `SEARCH_RANDOMNESS` (replaced by SIMILARITY_TEMPERATURE)
	- `MAX_CACHED_WORDS` (deprecated with old caching)
	- `CACHE_EXPIRY_HOURS` (deprecated with old caching)
	- `USE_HIERARCHICAL_SEARCH` (deprecated with old vector search)
	- `MAX_USED_WORDS_MEMORY` (deprecated with old word tracking)

	## Performance Notes

	- Startup Time: ~30-60 seconds (model download + cache creation)
	- Memory Usage: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
	- Response Time: ~200-500ms (word generation + clue creation + grid fitting)
	- Disk Usage: ~500MB for full model cache (vocabulary, embeddings, models)

	## Troubleshooting

	If puzzle generation fails:
	1. Check CACHE_DIR is writable and has sufficient space
	2. Monitor startup logs for cache creation progress
	3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive

	If words seem too random:
	1. Lower SIMILARITY_TEMPERATURE (try 0.1)
	2. Increase DIFFICULTY_WEIGHT for frequency-based selection
	3. Check debug tab with ENABLE_DEBUG_TAB=true

	If multi-topic queries return too few words:
	1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
	2. Lower SOFT_MIN_BETA manually (try 5.0)
	3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)

	If startup is too slow:
	1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
	2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
	3. Cache should speed up subsequent startups significantly