vimalk78's picture
docs: update CONFIG.md with current env vars and document transfer learning failure
27a60ec

Environment Configuration for Hugging Face Spaces

This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.

Required Variables

Core Application Settings

NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1

Cache Configuration

CACHE_DIR=/app/cache

AI/ML Model Configuration

THEMATIC_MODEL_NAME=all-mpnet-base-v2
THEMATIC_VOCAB_SIZE_LIMIT=100000
VOCAB_SOURCE=norvig

Optional Variables (with defaults)

Word Selection & Quality Control

SIMILARITY_TEMPERATURE=0.2
USE_SOFTMAX_SELECTION=true
DIFFICULTY_WEIGHT=0.5
THEMATIC_POOL_SIZE=150

Multi-Topic Intersection Configuration

MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_BETA=10.0
SOFT_MIN_ADAPTIVE=true
SOFT_MIN_MIN_WORDS=15
SOFT_MIN_MAX_RETRIES=5
SOFT_MIN_BETA_DECAY=0.7

Distribution Normalization (Experimental)

ENABLE_DISTRIBUTION_NORMALIZATION=false
NORMALIZATION_METHOD=similarity_range

Debug & Development

ENABLE_DEBUG_TAB=false

Variable Explanations

CACHE_DIR (Required)

  • Directory for caching models, embeddings, and vocabulary
  • Contains sentence-transformer models, word embeddings, and NLTK data
  • Should be persistent across deployments

THEMATIC_MODEL_NAME (Default: all-mpnet-base-v2)

  • Sentence transformer model for semantic embeddings
  • Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
  • Affects quality vs performance trade-off

THEMATIC_VOCAB_SIZE_LIMIT (Default: 100000)

  • Maximum vocabulary size for word generation
  • Higher = more word variety, more memory usage
  • Norvig vocabulary contains ~100K words

VOCAB_SOURCE (Default: norvig)

  • Vocabulary source for word generation
  • Currently only "norvig" is supported
  • Uses Norvig word frequency dataset

SIMILARITY_TEMPERATURE (Default: 0.2)

  • Controls randomness in word selection
  • Lower = more deterministic (top similarity words)
  • Higher = more random selection from similar words
  • Range: 0.1-2.0

DIFFICULTY_WEIGHT (Default: 0.5)

  • Balances similarity vs frequency for difficulty levels
  • 0.0 = pure similarity, 1.0 = pure frequency
  • Affects easy/medium/hard word selection

MULTI_TOPIC_METHOD (Default: soft_minimum)

  • Method for multi-topic word intersection
  • Options: soft_minimum, geometric_mean, harmonic_mean, averaging
  • soft_minimum finds words relevant to ALL topics

SOFT_MIN_BETA (Default: 10.0)

  • Beta parameter for soft minimum calculation
  • Higher = stricter intersection requirement
  • Automatically adjusted if SOFT_MIN_ADAPTIVE=true

ENABLE_DEBUG_TAB (Default: false)

  • Shows debug information in frontend
  • Displays word selection process and parameters
  • Useful for development and analysis

ENABLE_DISTRIBUTION_NORMALIZATION (Default: false)

  • Experimental feature for normalizing similarity distributions
  • Generally disabled for better semantic authenticity
  • See docs/distribution_normalization_analysis.md

Recommended HF Spaces Configuration

Minimal Setup (Core functionality):

NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig

Optimized Setup (Better performance & debugging):

NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig
SIMILARITY_TEMPERATURE=0.2
DIFFICULTY_WEIGHT=0.5
ENABLE_DEBUG_TAB=true
MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_ADAPTIVE=true

Deprecated Variables (Safe to Remove)

These variables are no longer used and can be deleted from HF Spaces:

  • EMBEDDING_MODEL (replaced by THEMATIC_MODEL_NAME)
  • WORD_SIMILARITY_THRESHOLD (deprecated with old vector search)
  • USE_AI_WORDS (always true now)
  • FALLBACK_TO_STATIC (no static fallback in current system)
  • SEARCH_RANDOMNESS (replaced by SIMILARITY_TEMPERATURE)
  • MAX_CACHED_WORDS (deprecated with old caching)
  • CACHE_EXPIRY_HOURS (deprecated with old caching)
  • USE_HIERARCHICAL_SEARCH (deprecated with old vector search)
  • MAX_USED_WORDS_MEMORY (deprecated with old word tracking)

Performance Notes

  • Startup Time: ~30-60 seconds (model download + cache creation)
  • Memory Usage: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
  • Response Time: ~200-500ms (word generation + clue creation + grid fitting)
  • Disk Usage: ~500MB for full model cache (vocabulary, embeddings, models)

Troubleshooting

If puzzle generation fails:

  1. Check CACHE_DIR is writable and has sufficient space
  2. Monitor startup logs for cache creation progress
  3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive

If words seem too random:

  1. Lower SIMILARITY_TEMPERATURE (try 0.1)
  2. Increase DIFFICULTY_WEIGHT for frequency-based selection
  3. Check debug tab with ENABLE_DEBUG_TAB=true

If multi-topic queries return too few words:

  1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
  2. Lower SOFT_MIN_BETA manually (try 5.0)
  3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)

If startup is too slow:

  1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
  2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
  3. Cache should speed up subsequent startups significantly