Spaces:

vimalk78
/

abc123

Sleeping

vimalk78 commited on Sep 4

Commit

27a60ec

1 Parent(s): f0e5a34

docs: update CONFIG.md with current env vars and document transfer learning failure

- Replace deprecated env variables with current ThematicWordService config
- Add detailed explanations for all 16 active environment variables
- Document 9 deprecated variables that can be safely removed from HF Spaces
- Record FLAN-T5 transfer learning approach as failed/discarded in strategy doc
- Preserve theoretical analysis for historical context with clear warnings

Signed-off-by: Vimal Kumar <vimal78@gmail.com>

Files changed (2) hide show

crossword-app/backend-py/CONFIG.md +128 -70
crossword-app/backend-py/docs/advanced_clue_generation_strategy.md +28 -8

crossword-app/backend-py/CONFIG.md CHANGED Viewed

@@ -1,6 +1,6 @@
 # Environment Configuration for Hugging Face Spaces
-This document lists all environment variables needed for the crossword generator backend when deployed on Hugging Face Spaces.
 ## Required Variables
@@ -8,67 +8,105 @@ This document lists all environment variables needed for the crossword generator
 ```env
 NODE_ENV=production
 PORT=7860
-PYTHONPATH=/app/backend-py
 PYTHONUNBUFFERED=1
 ```
 ### AI/ML Model Configuration
 ```env
-EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
-WORD_SIMILARITY_THRESHOLD=0.55
-USE_AI_WORDS=true
-FALLBACK_TO_STATIC=true
-USE_HIERARCHICAL_SEARCH=true
 ```
 ## Optional Variables (with defaults)
-### Performance & Caching
 ```env
-MAX_CACHED_WORDS=150
-SEARCH_RANDOMNESS=0.02
-FAISS_CACHE_DIR=/tmp/faiss_cache
 ```
-### Word Variety & Quality Control
 ```env
-MAX_USED_WORDS_MEMORY=50
-EXCLUDED_WORDS=WORD,THING,STUFF,GENERIC
 ```
-### Advanced Configuration
 ```env
-MAX_RESULTS=40
-MIN_SIMILARITY_THRESHOLD=0.45
-WORD_CACHE_DIR=/tmp/word_cache
 ```
-## Variable Explanations
-### **WORD_SIMILARITY_THRESHOLD** (Default: 0.55)
-- Controls semantic similarity requirement for AI-generated words
-- Range: 0.3-0.7 (higher = stricter quality, fewer words)
-- System uses adaptive thresholds if insufficient words found
-### **USE_HIERARCHICAL_SEARCH** (Default: true)
-- Enables advanced semantic search with topic variations and subcategories
-- Significantly improves word diversity and topic coverage
-- Set to `false` to use simpler single-search approach
-### **MAX_USED_WORDS_MEMORY** (Default: 50)
-- Number of previously used words to remember per topic
-- Prevents repetition across multiple puzzle generations
-- Higher values = better variety but more memory usage
-### **EXCLUDED_WORDS** (Optional)
-- Comma-separated list of words to never include in puzzles
-- Blocks overly generic or inappropriate terms
-- Example: `WORD,THING,STUFF,DATA,INFO`
-### **FALLBACK_TO_STATIC** (Default: true)
-- Falls back to static word lists if AI generation fails
-- Ensures puzzle generation always succeeds
-- Recommended to keep as `true` for production reliability
 ## Recommended HF Spaces Configuration
@@ -76,49 +114,69 @@ WORD_CACHE_DIR=/tmp/word_cache
 ```env
 NODE_ENV=production
 PORT=7860
-PYTHONPATH=/app/backend-py
 PYTHONUNBUFFERED=1
-EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
-WORD_SIMILARITY_THRESHOLD=0.55
-USE_AI_WORDS=true
-FALLBACK_TO_STATIC=true
 ```
-**Optimized Setup (Better performance & variety):**
 ```env
 NODE_ENV=production
 PORT=7860
-PYTHONPATH=/app/backend-py
 PYTHONUNBUFFERED=1
-EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
-WORD_SIMILARITY_THRESHOLD=0.55
-USE_AI_WORDS=true
-FALLBACK_TO_STATIC=true
-USE_HIERARCHICAL_SEARCH=true
-MAX_USED_WORDS_MEMORY=50
-MAX_CACHED_WORDS=150
-SEARCH_RANDOMNESS=0.02
 ```
 ## Performance Notes
-- **Startup Time**: ~30-60 seconds with AI models, ~2 seconds without
-- **Memory Usage**: ~500MB-1GB with AI, ~100MB without
-- **First Request**: May take longer due to model initialization
-- **FAISS Cache**: Speeds up subsequent startups significantly
 ## Troubleshooting
 **If puzzle generation fails:**
-1. Check `WORD_SIMILARITY_THRESHOLD` (try lowering to 0.5 or 0.45)
-2. Ensure `FALLBACK_TO_STATIC=true`
-3. Monitor logs for "Not enough words" errors
-**If words seem too generic:**
-1. Raise `WORD_SIMILARITY_THRESHOLD` to 0.6 or 0.65
-2. Add problematic words to `EXCLUDED_WORDS`
-3. Enable `USE_HIERARCHICAL_SEARCH=true`
 **If startup is too slow:**
-1. FAISS index caching should help after first run
-2. Consider smaller embedding model for faster startup (trade-off with quality)

 # Environment Configuration for Hugging Face Spaces
+This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.
 ## Required Variables
 ```env
 NODE_ENV=production
 PORT=7860
+PYTHONPATH=/app/crossword-app/backend-py
 PYTHONUNBUFFERED=1
 ```
+### Cache Configuration
+```env
+CACHE_DIR=/app/cache
+```
 ### AI/ML Model Configuration
 ```env
+THEMATIC_MODEL_NAME=all-mpnet-base-v2
+THEMATIC_VOCAB_SIZE_LIMIT=100000
+VOCAB_SOURCE=norvig
 ```
 ## Optional Variables (with defaults)
+### Word Selection & Quality Control
 ```env
+SIMILARITY_TEMPERATURE=0.2
+USE_SOFTMAX_SELECTION=true
+DIFFICULTY_WEIGHT=0.5
+THEMATIC_POOL_SIZE=150
 ```
+### Multi-Topic Intersection Configuration
 ```env
+MULTI_TOPIC_METHOD=soft_minimum
+SOFT_MIN_BETA=10.0
+SOFT_MIN_ADAPTIVE=true
+SOFT_MIN_MIN_WORDS=15
+SOFT_MIN_MAX_RETRIES=5
+SOFT_MIN_BETA_DECAY=0.7
 ```
+### Distribution Normalization (Experimental)
 ```env
+ENABLE_DISTRIBUTION_NORMALIZATION=false
+NORMALIZATION_METHOD=similarity_range
 ```
+### Debug & Development
+```env
+ENABLE_DEBUG_TAB=false
+```
+## Variable Explanations
+### **CACHE_DIR** (Required)
+- Directory for caching models, embeddings, and vocabulary
+- Contains sentence-transformer models, word embeddings, and NLTK data
+- Should be persistent across deployments
+### **THEMATIC_MODEL_NAME** (Default: all-mpnet-base-v2)
+- Sentence transformer model for semantic embeddings
+- Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
+- Affects quality vs performance trade-off
+### **THEMATIC_VOCAB_SIZE_LIMIT** (Default: 100000)
+- Maximum vocabulary size for word generation
+- Higher = more word variety, more memory usage
+- Norvig vocabulary contains ~100K words
+### **VOCAB_SOURCE** (Default: norvig)
+- Vocabulary source for word generation
+- Currently only "norvig" is supported
+- Uses Norvig word frequency dataset
+### **SIMILARITY_TEMPERATURE** (Default: 0.2)
+- Controls randomness in word selection
+- Lower = more deterministic (top similarity words)
+- Higher = more random selection from similar words
+- Range: 0.1-2.0
+### **DIFFICULTY_WEIGHT** (Default: 0.5)
+- Balances similarity vs frequency for difficulty levels
+- 0.0 = pure similarity, 1.0 = pure frequency
+- Affects easy/medium/hard word selection
+### **MULTI_TOPIC_METHOD** (Default: soft_minimum)
+- Method for multi-topic word intersection
+- Options: soft_minimum, geometric_mean, harmonic_mean, averaging
+- soft_minimum finds words relevant to ALL topics
+### **SOFT_MIN_BETA** (Default: 10.0)
+- Beta parameter for soft minimum calculation
+- Higher = stricter intersection requirement
+- Automatically adjusted if SOFT_MIN_ADAPTIVE=true
+### **ENABLE_DEBUG_TAB** (Default: false)
+- Shows debug information in frontend
+- Displays word selection process and parameters
+- Useful for development and analysis
+### **ENABLE_DISTRIBUTION_NORMALIZATION** (Default: false)
+- Experimental feature for normalizing similarity distributions
+- Generally disabled for better semantic authenticity
+- See docs/distribution_normalization_analysis.md
 ## Recommended HF Spaces Configuration
 ```env
 NODE_ENV=production
 PORT=7860
+PYTHONPATH=/app/crossword-app/backend-py
 PYTHONUNBUFFERED=1
+CACHE_DIR=/app/cache
+THEMATIC_VOCAB_SIZE_LIMIT=100000
+THEMATIC_MODEL_NAME=all-mpnet-base-v2
+VOCAB_SOURCE=norvig
 ```
+**Optimized Setup (Better performance & debugging):**
 ```env
 NODE_ENV=production
 PORT=7860
+PYTHONPATH=/app/crossword-app/backend-py
 PYTHONUNBUFFERED=1
+CACHE_DIR=/app/cache
+THEMATIC_VOCAB_SIZE_LIMIT=100000
+THEMATIC_MODEL_NAME=all-mpnet-base-v2
+VOCAB_SOURCE=norvig
+SIMILARITY_TEMPERATURE=0.2
+DIFFICULTY_WEIGHT=0.5
+ENABLE_DEBUG_TAB=true
+MULTI_TOPIC_METHOD=soft_minimum
+SOFT_MIN_ADAPTIVE=true
 ```
+## Deprecated Variables (Safe to Remove)
+These variables are no longer used and can be deleted from HF Spaces:
+- `EMBEDDING_MODEL` (replaced by THEMATIC_MODEL_NAME)
+- `WORD_SIMILARITY_THRESHOLD` (deprecated with old vector search)
+- `USE_AI_WORDS` (always true now)
+- `FALLBACK_TO_STATIC` (no static fallback in current system)
+- `SEARCH_RANDOMNESS` (replaced by SIMILARITY_TEMPERATURE)
+- `MAX_CACHED_WORDS` (deprecated with old caching)
+- `CACHE_EXPIRY_HOURS` (deprecated with old caching)
+- `USE_HIERARCHICAL_SEARCH` (deprecated with old vector search)
+- `MAX_USED_WORDS_MEMORY` (deprecated with old word tracking)
 ## Performance Notes
+- **Startup Time**: ~30-60 seconds (model download + cache creation)
+- **Memory Usage**: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
+- **Response Time**: ~200-500ms (word generation + clue creation + grid fitting)
+- **Disk Usage**: ~500MB for full model cache (vocabulary, embeddings, models)
 ## Troubleshooting
 **If puzzle generation fails:**
+1. Check CACHE_DIR is writable and has sufficient space
+2. Monitor startup logs for cache creation progress
+3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive
+**If words seem too random:**
+1. Lower SIMILARITY_TEMPERATURE (try 0.1)
+2. Increase DIFFICULTY_WEIGHT for frequency-based selection
+3. Check debug tab with ENABLE_DEBUG_TAB=true
+**If multi-topic queries return too few words:**
+1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
+2. Lower SOFT_MIN_BETA manually (try 5.0)
+3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)
 **If startup is too slow:**
+1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
+2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
+3. Cache should speed up subsequent startups significantly

crossword-app/backend-py/docs/advanced_clue_generation_strategy.md CHANGED Viewed

@@ -138,7 +138,25 @@ Context-based learning:
 → Model learns: accident, discovery, positive outcomes, unexpected events
 ```
-## Recommended Architecture: Context-First Transfer Learning
 ### Core Philosophy
@@ -406,15 +424,17 @@ Once basic quality is achieved, explore:
 ## Conclusion
-The context-based transfer learning approach offers the most promising path to universal, high-quality clue generation. By leveraging FLAN-T5's existing contextual knowledge and training it to reformulate that knowledge as crossword clues, we can achieve:
-1. **Universal coverage** - clues for every word
-2. **Quality improvement** - especially for rare and proper nouns
-3. **Scalable approach** - automated training data generation
-4. **Practical implementation** - manageable computational requirements
-This strategy moves beyond the limitations of surface-pattern embeddings to tap into the rich contextual understanding that large language models have acquired during pre-training, directing that knowledge toward the specific stylistic and functional requirements of crossword clue generation.
 ---
-*This analysis builds on the comprehensive discussion of clue generation approaches and represents the consensus strategy for implementing universal crossword clue generation capabilities.*

 → Model learns: accident, discovery, positive outcomes, unexpected events
 ```
+## Attempted Approaches and Results
+### Context-Based Transfer Learning (FAILED)
+**Status**: ❌ ATTEMPTED AND DISCARDED
+**Implementation**: FLAN-T5 context-based transfer learning was implemented using the approach described below, including:
+- Wikipedia abstracts for entity-based clues
+- Etymology databases for origin-based clues
+- Usage-based corpora for context patterns
+- Fine-tuning on 500K+ training pairs
+**Results**: The approach generated poor quality clues that were not suitable for crosswords. Despite the theoretical soundness of the approach, the practical implementation failed to produce the expected improvements in clue quality.
+**Conclusion**: Transfer learning with FLAN-T5 is not a viable solution for crossword clue generation. Alternative approaches should be explored.
+## Theoretical Architecture: Context-First Transfer Learning (DISCARDED)
+**⚠️ NOTE: This section is preserved for historical context. This approach was tried and failed in practice.**
 ### Core Philosophy
 ## Conclusion
+**Current Status**: The transfer learning approach described above was implemented and failed to produce quality clues suitable for crosswords.
+**Next Steps**: Alternative approaches need to be explored, such as:
+1. **Semantic Concept Extraction with Rule Engines**: Transform dictionary entries into crossword-style variations using pattern matching and linguistic rules
+2. **Hybrid WordNet + Post-Processing**: Use WordNet as a base but apply aggressive post-processing to create concise, crossword-appropriate clues
+3. **Template-Based Generation**: Create crossword-style templates and populate them with extracted semantic information
+4. **Curated Knowledge Base**: Build a targeted database of crossword-suitable clues for high-frequency vocabulary
+**Lessons Learned**: While theoretically sound, transfer learning with language models may not be well-suited for the highly constrained and stylistic requirements of crossword clues. The gap between natural language generation and crossword convention may be too large to bridge effectively through fine-tuning alone.
 ---
+*This analysis documents both theoretical approaches and practical implementation results for crossword clue generation. The transfer learning approach described in detail was attempted but failed in practice, serving as a guide for future research directions.*