docs: update CONFIG.md with current env vars and document transfer learning failure
Browse files- Replace deprecated env variables with current ThematicWordService config
- Add detailed explanations for all 16 active environment variables
- Document 9 deprecated variables that can be safely removed from HF Spaces
- Record FLAN-T5 transfer learning approach as failed/discarded in strategy doc
- Preserve theoretical analysis for historical context with clear warnings
Signed-off-by: Vimal Kumar <vimal78@gmail.com>
crossword-app/backend-py/CONFIG.md
CHANGED
|
@@ -1,6 +1,6 @@
|
|
| 1 |
# Environment Configuration for Hugging Face Spaces
|
| 2 |
|
| 3 |
-
This document lists all environment variables
|
| 4 |
|
| 5 |
## Required Variables
|
| 6 |
|
|
@@ -8,67 +8,105 @@ This document lists all environment variables needed for the crossword generator
|
|
| 8 |
```env
|
| 9 |
NODE_ENV=production
|
| 10 |
PORT=7860
|
| 11 |
-
PYTHONPATH=/app/backend-py
|
| 12 |
PYTHONUNBUFFERED=1
|
| 13 |
```
|
| 14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 15 |
### AI/ML Model Configuration
|
| 16 |
```env
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
FALLBACK_TO_STATIC=true
|
| 21 |
-
USE_HIERARCHICAL_SEARCH=true
|
| 22 |
```
|
| 23 |
|
| 24 |
## Optional Variables (with defaults)
|
| 25 |
|
| 26 |
-
###
|
| 27 |
```env
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
|
|
|
| 31 |
```
|
| 32 |
|
| 33 |
-
###
|
| 34 |
```env
|
| 35 |
-
|
| 36 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
```
|
| 38 |
|
| 39 |
-
###
|
| 40 |
```env
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
WORD_CACHE_DIR=/tmp/word_cache
|
| 44 |
```
|
| 45 |
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
- Range: 0.3-0.7 (higher = stricter quality, fewer words)
|
| 51 |
-
- System uses adaptive thresholds if insufficient words found
|
| 52 |
-
|
| 53 |
-
### **USE_HIERARCHICAL_SEARCH** (Default: true)
|
| 54 |
-
- Enables advanced semantic search with topic variations and subcategories
|
| 55 |
-
- Significantly improves word diversity and topic coverage
|
| 56 |
-
- Set to `false` to use simpler single-search approach
|
| 57 |
-
|
| 58 |
-
### **MAX_USED_WORDS_MEMORY** (Default: 50)
|
| 59 |
-
- Number of previously used words to remember per topic
|
| 60 |
-
- Prevents repetition across multiple puzzle generations
|
| 61 |
-
- Higher values = better variety but more memory usage
|
| 62 |
|
| 63 |
-
|
| 64 |
-
- Comma-separated list of words to never include in puzzles
|
| 65 |
-
- Blocks overly generic or inappropriate terms
|
| 66 |
-
- Example: `WORD,THING,STUFF,DATA,INFO`
|
| 67 |
|
| 68 |
-
### **
|
| 69 |
-
-
|
| 70 |
-
-
|
| 71 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
## Recommended HF Spaces Configuration
|
| 74 |
|
|
@@ -76,49 +114,69 @@ WORD_CACHE_DIR=/tmp/word_cache
|
|
| 76 |
```env
|
| 77 |
NODE_ENV=production
|
| 78 |
PORT=7860
|
| 79 |
-
PYTHONPATH=/app/backend-py
|
| 80 |
PYTHONUNBUFFERED=1
|
| 81 |
-
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
```
|
| 86 |
|
| 87 |
-
**Optimized Setup (Better performance &
|
| 88 |
```env
|
| 89 |
NODE_ENV=production
|
| 90 |
PORT=7860
|
| 91 |
-
PYTHONPATH=/app/backend-py
|
| 92 |
PYTHONUNBUFFERED=1
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
|
|
|
|
| 101 |
```
|
| 102 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
## Performance Notes
|
| 104 |
|
| 105 |
-
- **Startup Time**: ~30-60 seconds
|
| 106 |
-
- **Memory Usage**: ~500MB-1GB
|
| 107 |
-
- **
|
| 108 |
-
- **
|
| 109 |
|
| 110 |
## Troubleshooting
|
| 111 |
|
| 112 |
**If puzzle generation fails:**
|
| 113 |
-
1. Check
|
| 114 |
-
2.
|
| 115 |
-
3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
-
**If
|
| 118 |
-
1.
|
| 119 |
-
2.
|
| 120 |
-
3.
|
| 121 |
|
| 122 |
**If startup is too slow:**
|
| 123 |
-
1.
|
| 124 |
-
2.
|
|
|
|
|
|
| 1 |
# Environment Configuration for Hugging Face Spaces
|
| 2 |
|
| 3 |
+
This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.
|
| 4 |
|
| 5 |
## Required Variables
|
| 6 |
|
|
|
|
| 8 |
```env
|
| 9 |
NODE_ENV=production
|
| 10 |
PORT=7860
|
| 11 |
+
PYTHONPATH=/app/crossword-app/backend-py
|
| 12 |
PYTHONUNBUFFERED=1
|
| 13 |
```
|
| 14 |
|
| 15 |
+
### Cache Configuration
|
| 16 |
+
```env
|
| 17 |
+
CACHE_DIR=/app/cache
|
| 18 |
+
```
|
| 19 |
+
|
| 20 |
### AI/ML Model Configuration
|
| 21 |
```env
|
| 22 |
+
THEMATIC_MODEL_NAME=all-mpnet-base-v2
|
| 23 |
+
THEMATIC_VOCAB_SIZE_LIMIT=100000
|
| 24 |
+
VOCAB_SOURCE=norvig
|
|
|
|
|
|
|
| 25 |
```
|
| 26 |
|
| 27 |
## Optional Variables (with defaults)
|
| 28 |
|
| 29 |
+
### Word Selection & Quality Control
|
| 30 |
```env
|
| 31 |
+
SIMILARITY_TEMPERATURE=0.2
|
| 32 |
+
USE_SOFTMAX_SELECTION=true
|
| 33 |
+
DIFFICULTY_WEIGHT=0.5
|
| 34 |
+
THEMATIC_POOL_SIZE=150
|
| 35 |
```
|
| 36 |
|
| 37 |
+
### Multi-Topic Intersection Configuration
|
| 38 |
```env
|
| 39 |
+
MULTI_TOPIC_METHOD=soft_minimum
|
| 40 |
+
SOFT_MIN_BETA=10.0
|
| 41 |
+
SOFT_MIN_ADAPTIVE=true
|
| 42 |
+
SOFT_MIN_MIN_WORDS=15
|
| 43 |
+
SOFT_MIN_MAX_RETRIES=5
|
| 44 |
+
SOFT_MIN_BETA_DECAY=0.7
|
| 45 |
```
|
| 46 |
|
| 47 |
+
### Distribution Normalization (Experimental)
|
| 48 |
```env
|
| 49 |
+
ENABLE_DISTRIBUTION_NORMALIZATION=false
|
| 50 |
+
NORMALIZATION_METHOD=similarity_range
|
|
|
|
| 51 |
```
|
| 52 |
|
| 53 |
+
### Debug & Development
|
| 54 |
+
```env
|
| 55 |
+
ENABLE_DEBUG_TAB=false
|
| 56 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
+
## Variable Explanations
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
+
### **CACHE_DIR** (Required)
|
| 61 |
+
- Directory for caching models, embeddings, and vocabulary
|
| 62 |
+
- Contains sentence-transformer models, word embeddings, and NLTK data
|
| 63 |
+
- Should be persistent across deployments
|
| 64 |
+
|
| 65 |
+
### **THEMATIC_MODEL_NAME** (Default: all-mpnet-base-v2)
|
| 66 |
+
- Sentence transformer model for semantic embeddings
|
| 67 |
+
- Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
|
| 68 |
+
- Affects quality vs performance trade-off
|
| 69 |
+
|
| 70 |
+
### **THEMATIC_VOCAB_SIZE_LIMIT** (Default: 100000)
|
| 71 |
+
- Maximum vocabulary size for word generation
|
| 72 |
+
- Higher = more word variety, more memory usage
|
| 73 |
+
- Norvig vocabulary contains ~100K words
|
| 74 |
+
|
| 75 |
+
### **VOCAB_SOURCE** (Default: norvig)
|
| 76 |
+
- Vocabulary source for word generation
|
| 77 |
+
- Currently only "norvig" is supported
|
| 78 |
+
- Uses Norvig word frequency dataset
|
| 79 |
+
|
| 80 |
+
### **SIMILARITY_TEMPERATURE** (Default: 0.2)
|
| 81 |
+
- Controls randomness in word selection
|
| 82 |
+
- Lower = more deterministic (top similarity words)
|
| 83 |
+
- Higher = more random selection from similar words
|
| 84 |
+
- Range: 0.1-2.0
|
| 85 |
+
|
| 86 |
+
### **DIFFICULTY_WEIGHT** (Default: 0.5)
|
| 87 |
+
- Balances similarity vs frequency for difficulty levels
|
| 88 |
+
- 0.0 = pure similarity, 1.0 = pure frequency
|
| 89 |
+
- Affects easy/medium/hard word selection
|
| 90 |
+
|
| 91 |
+
### **MULTI_TOPIC_METHOD** (Default: soft_minimum)
|
| 92 |
+
- Method for multi-topic word intersection
|
| 93 |
+
- Options: soft_minimum, geometric_mean, harmonic_mean, averaging
|
| 94 |
+
- soft_minimum finds words relevant to ALL topics
|
| 95 |
+
|
| 96 |
+
### **SOFT_MIN_BETA** (Default: 10.0)
|
| 97 |
+
- Beta parameter for soft minimum calculation
|
| 98 |
+
- Higher = stricter intersection requirement
|
| 99 |
+
- Automatically adjusted if SOFT_MIN_ADAPTIVE=true
|
| 100 |
+
|
| 101 |
+
### **ENABLE_DEBUG_TAB** (Default: false)
|
| 102 |
+
- Shows debug information in frontend
|
| 103 |
+
- Displays word selection process and parameters
|
| 104 |
+
- Useful for development and analysis
|
| 105 |
+
|
| 106 |
+
### **ENABLE_DISTRIBUTION_NORMALIZATION** (Default: false)
|
| 107 |
+
- Experimental feature for normalizing similarity distributions
|
| 108 |
+
- Generally disabled for better semantic authenticity
|
| 109 |
+
- See docs/distribution_normalization_analysis.md
|
| 110 |
|
| 111 |
## Recommended HF Spaces Configuration
|
| 112 |
|
|
|
|
| 114 |
```env
|
| 115 |
NODE_ENV=production
|
| 116 |
PORT=7860
|
| 117 |
+
PYTHONPATH=/app/crossword-app/backend-py
|
| 118 |
PYTHONUNBUFFERED=1
|
| 119 |
+
CACHE_DIR=/app/cache
|
| 120 |
+
THEMATIC_VOCAB_SIZE_LIMIT=100000
|
| 121 |
+
THEMATIC_MODEL_NAME=all-mpnet-base-v2
|
| 122 |
+
VOCAB_SOURCE=norvig
|
| 123 |
```
|
| 124 |
|
| 125 |
+
**Optimized Setup (Better performance & debugging):**
|
| 126 |
```env
|
| 127 |
NODE_ENV=production
|
| 128 |
PORT=7860
|
| 129 |
+
PYTHONPATH=/app/crossword-app/backend-py
|
| 130 |
PYTHONUNBUFFERED=1
|
| 131 |
+
CACHE_DIR=/app/cache
|
| 132 |
+
THEMATIC_VOCAB_SIZE_LIMIT=100000
|
| 133 |
+
THEMATIC_MODEL_NAME=all-mpnet-base-v2
|
| 134 |
+
VOCAB_SOURCE=norvig
|
| 135 |
+
SIMILARITY_TEMPERATURE=0.2
|
| 136 |
+
DIFFICULTY_WEIGHT=0.5
|
| 137 |
+
ENABLE_DEBUG_TAB=true
|
| 138 |
+
MULTI_TOPIC_METHOD=soft_minimum
|
| 139 |
+
SOFT_MIN_ADAPTIVE=true
|
| 140 |
```
|
| 141 |
|
| 142 |
+
## Deprecated Variables (Safe to Remove)
|
| 143 |
+
|
| 144 |
+
These variables are no longer used and can be deleted from HF Spaces:
|
| 145 |
+
- `EMBEDDING_MODEL` (replaced by THEMATIC_MODEL_NAME)
|
| 146 |
+
- `WORD_SIMILARITY_THRESHOLD` (deprecated with old vector search)
|
| 147 |
+
- `USE_AI_WORDS` (always true now)
|
| 148 |
+
- `FALLBACK_TO_STATIC` (no static fallback in current system)
|
| 149 |
+
- `SEARCH_RANDOMNESS` (replaced by SIMILARITY_TEMPERATURE)
|
| 150 |
+
- `MAX_CACHED_WORDS` (deprecated with old caching)
|
| 151 |
+
- `CACHE_EXPIRY_HOURS` (deprecated with old caching)
|
| 152 |
+
- `USE_HIERARCHICAL_SEARCH` (deprecated with old vector search)
|
| 153 |
+
- `MAX_USED_WORDS_MEMORY` (deprecated with old word tracking)
|
| 154 |
+
|
| 155 |
## Performance Notes
|
| 156 |
|
| 157 |
+
- **Startup Time**: ~30-60 seconds (model download + cache creation)
|
| 158 |
+
- **Memory Usage**: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
|
| 159 |
+
- **Response Time**: ~200-500ms (word generation + clue creation + grid fitting)
|
| 160 |
+
- **Disk Usage**: ~500MB for full model cache (vocabulary, embeddings, models)
|
| 161 |
|
| 162 |
## Troubleshooting
|
| 163 |
|
| 164 |
**If puzzle generation fails:**
|
| 165 |
+
1. Check CACHE_DIR is writable and has sufficient space
|
| 166 |
+
2. Monitor startup logs for cache creation progress
|
| 167 |
+
3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive
|
| 168 |
+
|
| 169 |
+
**If words seem too random:**
|
| 170 |
+
1. Lower SIMILARITY_TEMPERATURE (try 0.1)
|
| 171 |
+
2. Increase DIFFICULTY_WEIGHT for frequency-based selection
|
| 172 |
+
3. Check debug tab with ENABLE_DEBUG_TAB=true
|
| 173 |
|
| 174 |
+
**If multi-topic queries return too few words:**
|
| 175 |
+
1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
|
| 176 |
+
2. Lower SOFT_MIN_BETA manually (try 5.0)
|
| 177 |
+
3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)
|
| 178 |
|
| 179 |
**If startup is too slow:**
|
| 180 |
+
1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
|
| 181 |
+
2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
|
| 182 |
+
3. Cache should speed up subsequent startups significantly
|
crossword-app/backend-py/docs/advanced_clue_generation_strategy.md
CHANGED
|
@@ -138,7 +138,25 @@ Context-based learning:
|
|
| 138 |
→ Model learns: accident, discovery, positive outcomes, unexpected events
|
| 139 |
```
|
| 140 |
|
| 141 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 142 |
|
| 143 |
### Core Philosophy
|
| 144 |
|
|
@@ -406,15 +424,17 @@ Once basic quality is achieved, explore:
|
|
| 406 |
|
| 407 |
## Conclusion
|
| 408 |
|
| 409 |
-
The
|
|
|
|
|
|
|
| 410 |
|
| 411 |
-
1. **
|
| 412 |
-
2. **
|
| 413 |
-
3. **
|
| 414 |
-
4. **
|
| 415 |
|
| 416 |
-
|
| 417 |
|
| 418 |
---
|
| 419 |
|
| 420 |
-
*This analysis
|
|
|
|
| 138 |
→ Model learns: accident, discovery, positive outcomes, unexpected events
|
| 139 |
```
|
| 140 |
|
| 141 |
+
## Attempted Approaches and Results
|
| 142 |
+
|
| 143 |
+
### Context-Based Transfer Learning (FAILED)
|
| 144 |
+
|
| 145 |
+
**Status**: ❌ ATTEMPTED AND DISCARDED
|
| 146 |
+
|
| 147 |
+
**Implementation**: FLAN-T5 context-based transfer learning was implemented using the approach described below, including:
|
| 148 |
+
- Wikipedia abstracts for entity-based clues
|
| 149 |
+
- Etymology databases for origin-based clues
|
| 150 |
+
- Usage-based corpora for context patterns
|
| 151 |
+
- Fine-tuning on 500K+ training pairs
|
| 152 |
+
|
| 153 |
+
**Results**: The approach generated poor quality clues that were not suitable for crosswords. Despite the theoretical soundness of the approach, the practical implementation failed to produce the expected improvements in clue quality.
|
| 154 |
+
|
| 155 |
+
**Conclusion**: Transfer learning with FLAN-T5 is not a viable solution for crossword clue generation. Alternative approaches should be explored.
|
| 156 |
+
|
| 157 |
+
## Theoretical Architecture: Context-First Transfer Learning (DISCARDED)
|
| 158 |
+
|
| 159 |
+
**⚠️ NOTE: This section is preserved for historical context. This approach was tried and failed in practice.**
|
| 160 |
|
| 161 |
### Core Philosophy
|
| 162 |
|
|
|
|
| 424 |
|
| 425 |
## Conclusion
|
| 426 |
|
| 427 |
+
**Current Status**: The transfer learning approach described above was implemented and failed to produce quality clues suitable for crosswords.
|
| 428 |
+
|
| 429 |
+
**Next Steps**: Alternative approaches need to be explored, such as:
|
| 430 |
|
| 431 |
+
1. **Semantic Concept Extraction with Rule Engines**: Transform dictionary entries into crossword-style variations using pattern matching and linguistic rules
|
| 432 |
+
2. **Hybrid WordNet + Post-Processing**: Use WordNet as a base but apply aggressive post-processing to create concise, crossword-appropriate clues
|
| 433 |
+
3. **Template-Based Generation**: Create crossword-style templates and populate them with extracted semantic information
|
| 434 |
+
4. **Curated Knowledge Base**: Build a targeted database of crossword-suitable clues for high-frequency vocabulary
|
| 435 |
|
| 436 |
+
**Lessons Learned**: While theoretically sound, transfer learning with language models may not be well-suited for the highly constrained and stylistic requirements of crossword clues. The gap between natural language generation and crossword convention may be too large to bridge effectively through fine-tuning alone.
|
| 437 |
|
| 438 |
---
|
| 439 |
|
| 440 |
+
*This analysis documents both theoretical approaches and practical implementation results for crossword clue generation. The transfer learning approach described in detail was attempted but failed in practice, serving as a guide for future research directions.*
|