Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / crossword-app /backend-py /CONFIG.md

vimalk78

docs: update CONFIG.md with current env vars and document transfer learning failure

27a60ec 3 months ago

preview code

raw

history blame contribute delete

5.54 kB

Environment Configuration for Hugging Face Spaces

This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.

Required Variables

Core Application Settings

NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1

Cache Configuration

CACHE_DIR=/app/cache

AI/ML Model Configuration

THEMATIC_MODEL_NAME=all-mpnet-base-v2
THEMATIC_VOCAB_SIZE_LIMIT=100000
VOCAB_SOURCE=norvig

Optional Variables (with defaults)

Word Selection & Quality Control

SIMILARITY_TEMPERATURE=0.2
USE_SOFTMAX_SELECTION=true
DIFFICULTY_WEIGHT=0.5
THEMATIC_POOL_SIZE=150

Multi-Topic Intersection Configuration

MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_BETA=10.0
SOFT_MIN_ADAPTIVE=true
SOFT_MIN_MIN_WORDS=15
SOFT_MIN_MAX_RETRIES=5
SOFT_MIN_BETA_DECAY=0.7

Distribution Normalization (Experimental)

ENABLE_DISTRIBUTION_NORMALIZATION=false
NORMALIZATION_METHOD=similarity_range

Debug & Development

ENABLE_DEBUG_TAB=false

Variable Explanations

CACHE_DIR (Required)

Directory for caching models, embeddings, and vocabulary
Contains sentence-transformer models, word embeddings, and NLTK data
Should be persistent across deployments

THEMATIC_MODEL_NAME (Default: all-mpnet-base-v2)

Sentence transformer model for semantic embeddings
Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
Affects quality vs performance trade-off

THEMATIC_VOCAB_SIZE_LIMIT (Default: 100000)

Maximum vocabulary size for word generation
Higher = more word variety, more memory usage
Norvig vocabulary contains ~100K words

VOCAB_SOURCE (Default: norvig)

Vocabulary source for word generation
Currently only "norvig" is supported
Uses Norvig word frequency dataset

SIMILARITY_TEMPERATURE (Default: 0.2)

Controls randomness in word selection
Lower = more deterministic (top similarity words)
Higher = more random selection from similar words
Range: 0.1-2.0

DIFFICULTY_WEIGHT (Default: 0.5)

Balances similarity vs frequency for difficulty levels
0.0 = pure similarity, 1.0 = pure frequency
Affects easy/medium/hard word selection

MULTI_TOPIC_METHOD (Default: soft_minimum)

Method for multi-topic word intersection
Options: soft_minimum, geometric_mean, harmonic_mean, averaging
soft_minimum finds words relevant to ALL topics

SOFT_MIN_BETA (Default: 10.0)

Beta parameter for soft minimum calculation
Higher = stricter intersection requirement
Automatically adjusted if SOFT_MIN_ADAPTIVE=true

ENABLE_DEBUG_TAB (Default: false)

Shows debug information in frontend
Displays word selection process and parameters
Useful for development and analysis

ENABLE_DISTRIBUTION_NORMALIZATION (Default: false)

Experimental feature for normalizing similarity distributions
Generally disabled for better semantic authenticity
See docs/distribution_normalization_analysis.md

Recommended HF Spaces Configuration

Minimal Setup (Core functionality):

NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig

Optimized Setup (Better performance & debugging):

NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig
SIMILARITY_TEMPERATURE=0.2
DIFFICULTY_WEIGHT=0.5
ENABLE_DEBUG_TAB=true
MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_ADAPTIVE=true

Deprecated Variables (Safe to Remove)

These variables are no longer used and can be deleted from HF Spaces:

EMBEDDING_MODEL (replaced by THEMATIC_MODEL_NAME)
WORD_SIMILARITY_THRESHOLD (deprecated with old vector search)
USE_AI_WORDS (always true now)
FALLBACK_TO_STATIC (no static fallback in current system)
SEARCH_RANDOMNESS (replaced by SIMILARITY_TEMPERATURE)
MAX_CACHED_WORDS (deprecated with old caching)
CACHE_EXPIRY_HOURS (deprecated with old caching)
USE_HIERARCHICAL_SEARCH (deprecated with old vector search)
MAX_USED_WORDS_MEMORY (deprecated with old word tracking)

Performance Notes

Startup Time: ~30-60 seconds (model download + cache creation)
Memory Usage: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
Response Time: ~200-500ms (word generation + clue creation + grid fitting)
Disk Usage: ~500MB for full model cache (vocabulary, embeddings, models)

Troubleshooting

If puzzle generation fails:

Check CACHE_DIR is writable and has sufficient space
Monitor startup logs for cache creation progress
Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive

If words seem too random:

Lower SIMILARITY_TEMPERATURE (try 0.1)
Increase DIFFICULTY_WEIGHT for frequency-based selection
Check debug tab with ENABLE_DEBUG_TAB=true

If multi-topic queries return too few words:

Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
Lower SOFT_MIN_BETA manually (try 5.0)
Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)

If startup is too slow:

Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
Cache should speed up subsequent startups significantly