vimalk78 commited on
Commit
27a60ec
·
1 Parent(s): f0e5a34

docs: update CONFIG.md with current env vars and document transfer learning failure

Browse files

- Replace deprecated env variables with current ThematicWordService config
- Add detailed explanations for all 16 active environment variables
- Document 9 deprecated variables that can be safely removed from HF Spaces
- Record FLAN-T5 transfer learning approach as failed/discarded in strategy doc
- Preserve theoretical analysis for historical context with clear warnings

Signed-off-by: Vimal Kumar <vimal78@gmail.com>

crossword-app/backend-py/CONFIG.md CHANGED
@@ -1,6 +1,6 @@
1
  # Environment Configuration for Hugging Face Spaces
2
 
3
- This document lists all environment variables needed for the crossword generator backend when deployed on Hugging Face Spaces.
4
 
5
  ## Required Variables
6
 
@@ -8,67 +8,105 @@ This document lists all environment variables needed for the crossword generator
8
  ```env
9
  NODE_ENV=production
10
  PORT=7860
11
- PYTHONPATH=/app/backend-py
12
  PYTHONUNBUFFERED=1
13
  ```
14
 
 
 
 
 
 
15
  ### AI/ML Model Configuration
16
  ```env
17
- EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
18
- WORD_SIMILARITY_THRESHOLD=0.55
19
- USE_AI_WORDS=true
20
- FALLBACK_TO_STATIC=true
21
- USE_HIERARCHICAL_SEARCH=true
22
  ```
23
 
24
  ## Optional Variables (with defaults)
25
 
26
- ### Performance & Caching
27
  ```env
28
- MAX_CACHED_WORDS=150
29
- SEARCH_RANDOMNESS=0.02
30
- FAISS_CACHE_DIR=/tmp/faiss_cache
 
31
  ```
32
 
33
- ### Word Variety & Quality Control
34
  ```env
35
- MAX_USED_WORDS_MEMORY=50
36
- EXCLUDED_WORDS=WORD,THING,STUFF,GENERIC
 
 
 
 
37
  ```
38
 
39
- ### Advanced Configuration
40
  ```env
41
- MAX_RESULTS=40
42
- MIN_SIMILARITY_THRESHOLD=0.45
43
- WORD_CACHE_DIR=/tmp/word_cache
44
  ```
45
 
46
- ## Variable Explanations
47
-
48
- ### **WORD_SIMILARITY_THRESHOLD** (Default: 0.55)
49
- - Controls semantic similarity requirement for AI-generated words
50
- - Range: 0.3-0.7 (higher = stricter quality, fewer words)
51
- - System uses adaptive thresholds if insufficient words found
52
-
53
- ### **USE_HIERARCHICAL_SEARCH** (Default: true)
54
- - Enables advanced semantic search with topic variations and subcategories
55
- - Significantly improves word diversity and topic coverage
56
- - Set to `false` to use simpler single-search approach
57
-
58
- ### **MAX_USED_WORDS_MEMORY** (Default: 50)
59
- - Number of previously used words to remember per topic
60
- - Prevents repetition across multiple puzzle generations
61
- - Higher values = better variety but more memory usage
62
 
63
- ### **EXCLUDED_WORDS** (Optional)
64
- - Comma-separated list of words to never include in puzzles
65
- - Blocks overly generic or inappropriate terms
66
- - Example: `WORD,THING,STUFF,DATA,INFO`
67
 
68
- ### **FALLBACK_TO_STATIC** (Default: true)
69
- - Falls back to static word lists if AI generation fails
70
- - Ensures puzzle generation always succeeds
71
- - Recommended to keep as `true` for production reliability
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
72
 
73
  ## Recommended HF Spaces Configuration
74
 
@@ -76,49 +114,69 @@ WORD_CACHE_DIR=/tmp/word_cache
76
  ```env
77
  NODE_ENV=production
78
  PORT=7860
79
- PYTHONPATH=/app/backend-py
80
  PYTHONUNBUFFERED=1
81
- EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
82
- WORD_SIMILARITY_THRESHOLD=0.55
83
- USE_AI_WORDS=true
84
- FALLBACK_TO_STATIC=true
85
  ```
86
 
87
- **Optimized Setup (Better performance & variety):**
88
  ```env
89
  NODE_ENV=production
90
  PORT=7860
91
- PYTHONPATH=/app/backend-py
92
  PYTHONUNBUFFERED=1
93
- EMBEDDING_MODEL=sentence-transformers/all-mpnet-base-v2
94
- WORD_SIMILARITY_THRESHOLD=0.55
95
- USE_AI_WORDS=true
96
- FALLBACK_TO_STATIC=true
97
- USE_HIERARCHICAL_SEARCH=true
98
- MAX_USED_WORDS_MEMORY=50
99
- MAX_CACHED_WORDS=150
100
- SEARCH_RANDOMNESS=0.02
 
101
  ```
102
 
 
 
 
 
 
 
 
 
 
 
 
 
 
103
  ## Performance Notes
104
 
105
- - **Startup Time**: ~30-60 seconds with AI models, ~2 seconds without
106
- - **Memory Usage**: ~500MB-1GB with AI, ~100MB without
107
- - **First Request**: May take longer due to model initialization
108
- - **FAISS Cache**: Speeds up subsequent startups significantly
109
 
110
  ## Troubleshooting
111
 
112
  **If puzzle generation fails:**
113
- 1. Check `WORD_SIMILARITY_THRESHOLD` (try lowering to 0.5 or 0.45)
114
- 2. Ensure `FALLBACK_TO_STATIC=true`
115
- 3. Monitor logs for "Not enough words" errors
 
 
 
 
 
116
 
117
- **If words seem too generic:**
118
- 1. Raise `WORD_SIMILARITY_THRESHOLD` to 0.6 or 0.65
119
- 2. Add problematic words to `EXCLUDED_WORDS`
120
- 3. Enable `USE_HIERARCHICAL_SEARCH=true`
121
 
122
  **If startup is too slow:**
123
- 1. FAISS index caching should help after first run
124
- 2. Consider smaller embedding model for faster startup (trade-off with quality)
 
 
1
  # Environment Configuration for Hugging Face Spaces
2
 
3
+ This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.
4
 
5
  ## Required Variables
6
 
 
8
  ```env
9
  NODE_ENV=production
10
  PORT=7860
11
+ PYTHONPATH=/app/crossword-app/backend-py
12
  PYTHONUNBUFFERED=1
13
  ```
14
 
15
+ ### Cache Configuration
16
+ ```env
17
+ CACHE_DIR=/app/cache
18
+ ```
19
+
20
  ### AI/ML Model Configuration
21
  ```env
22
+ THEMATIC_MODEL_NAME=all-mpnet-base-v2
23
+ THEMATIC_VOCAB_SIZE_LIMIT=100000
24
+ VOCAB_SOURCE=norvig
 
 
25
  ```
26
 
27
  ## Optional Variables (with defaults)
28
 
29
+ ### Word Selection & Quality Control
30
  ```env
31
+ SIMILARITY_TEMPERATURE=0.2
32
+ USE_SOFTMAX_SELECTION=true
33
+ DIFFICULTY_WEIGHT=0.5
34
+ THEMATIC_POOL_SIZE=150
35
  ```
36
 
37
+ ### Multi-Topic Intersection Configuration
38
  ```env
39
+ MULTI_TOPIC_METHOD=soft_minimum
40
+ SOFT_MIN_BETA=10.0
41
+ SOFT_MIN_ADAPTIVE=true
42
+ SOFT_MIN_MIN_WORDS=15
43
+ SOFT_MIN_MAX_RETRIES=5
44
+ SOFT_MIN_BETA_DECAY=0.7
45
  ```
46
 
47
+ ### Distribution Normalization (Experimental)
48
  ```env
49
+ ENABLE_DISTRIBUTION_NORMALIZATION=false
50
+ NORMALIZATION_METHOD=similarity_range
 
51
  ```
52
 
53
+ ### Debug & Development
54
+ ```env
55
+ ENABLE_DEBUG_TAB=false
56
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
57
 
58
+ ## Variable Explanations
 
 
 
59
 
60
+ ### **CACHE_DIR** (Required)
61
+ - Directory for caching models, embeddings, and vocabulary
62
+ - Contains sentence-transformer models, word embeddings, and NLTK data
63
+ - Should be persistent across deployments
64
+
65
+ ### **THEMATIC_MODEL_NAME** (Default: all-mpnet-base-v2)
66
+ - Sentence transformer model for semantic embeddings
67
+ - Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
68
+ - Affects quality vs performance trade-off
69
+
70
+ ### **THEMATIC_VOCAB_SIZE_LIMIT** (Default: 100000)
71
+ - Maximum vocabulary size for word generation
72
+ - Higher = more word variety, more memory usage
73
+ - Norvig vocabulary contains ~100K words
74
+
75
+ ### **VOCAB_SOURCE** (Default: norvig)
76
+ - Vocabulary source for word generation
77
+ - Currently only "norvig" is supported
78
+ - Uses Norvig word frequency dataset
79
+
80
+ ### **SIMILARITY_TEMPERATURE** (Default: 0.2)
81
+ - Controls randomness in word selection
82
+ - Lower = more deterministic (top similarity words)
83
+ - Higher = more random selection from similar words
84
+ - Range: 0.1-2.0
85
+
86
+ ### **DIFFICULTY_WEIGHT** (Default: 0.5)
87
+ - Balances similarity vs frequency for difficulty levels
88
+ - 0.0 = pure similarity, 1.0 = pure frequency
89
+ - Affects easy/medium/hard word selection
90
+
91
+ ### **MULTI_TOPIC_METHOD** (Default: soft_minimum)
92
+ - Method for multi-topic word intersection
93
+ - Options: soft_minimum, geometric_mean, harmonic_mean, averaging
94
+ - soft_minimum finds words relevant to ALL topics
95
+
96
+ ### **SOFT_MIN_BETA** (Default: 10.0)
97
+ - Beta parameter for soft minimum calculation
98
+ - Higher = stricter intersection requirement
99
+ - Automatically adjusted if SOFT_MIN_ADAPTIVE=true
100
+
101
+ ### **ENABLE_DEBUG_TAB** (Default: false)
102
+ - Shows debug information in frontend
103
+ - Displays word selection process and parameters
104
+ - Useful for development and analysis
105
+
106
+ ### **ENABLE_DISTRIBUTION_NORMALIZATION** (Default: false)
107
+ - Experimental feature for normalizing similarity distributions
108
+ - Generally disabled for better semantic authenticity
109
+ - See docs/distribution_normalization_analysis.md
110
 
111
  ## Recommended HF Spaces Configuration
112
 
 
114
  ```env
115
  NODE_ENV=production
116
  PORT=7860
117
+ PYTHONPATH=/app/crossword-app/backend-py
118
  PYTHONUNBUFFERED=1
119
+ CACHE_DIR=/app/cache
120
+ THEMATIC_VOCAB_SIZE_LIMIT=100000
121
+ THEMATIC_MODEL_NAME=all-mpnet-base-v2
122
+ VOCAB_SOURCE=norvig
123
  ```
124
 
125
+ **Optimized Setup (Better performance & debugging):**
126
  ```env
127
  NODE_ENV=production
128
  PORT=7860
129
+ PYTHONPATH=/app/crossword-app/backend-py
130
  PYTHONUNBUFFERED=1
131
+ CACHE_DIR=/app/cache
132
+ THEMATIC_VOCAB_SIZE_LIMIT=100000
133
+ THEMATIC_MODEL_NAME=all-mpnet-base-v2
134
+ VOCAB_SOURCE=norvig
135
+ SIMILARITY_TEMPERATURE=0.2
136
+ DIFFICULTY_WEIGHT=0.5
137
+ ENABLE_DEBUG_TAB=true
138
+ MULTI_TOPIC_METHOD=soft_minimum
139
+ SOFT_MIN_ADAPTIVE=true
140
  ```
141
 
142
+ ## Deprecated Variables (Safe to Remove)
143
+
144
+ These variables are no longer used and can be deleted from HF Spaces:
145
+ - `EMBEDDING_MODEL` (replaced by THEMATIC_MODEL_NAME)
146
+ - `WORD_SIMILARITY_THRESHOLD` (deprecated with old vector search)
147
+ - `USE_AI_WORDS` (always true now)
148
+ - `FALLBACK_TO_STATIC` (no static fallback in current system)
149
+ - `SEARCH_RANDOMNESS` (replaced by SIMILARITY_TEMPERATURE)
150
+ - `MAX_CACHED_WORDS` (deprecated with old caching)
151
+ - `CACHE_EXPIRY_HOURS` (deprecated with old caching)
152
+ - `USE_HIERARCHICAL_SEARCH` (deprecated with old vector search)
153
+ - `MAX_USED_WORDS_MEMORY` (deprecated with old word tracking)
154
+
155
  ## Performance Notes
156
 
157
+ - **Startup Time**: ~30-60 seconds (model download + cache creation)
158
+ - **Memory Usage**: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
159
+ - **Response Time**: ~200-500ms (word generation + clue creation + grid fitting)
160
+ - **Disk Usage**: ~500MB for full model cache (vocabulary, embeddings, models)
161
 
162
  ## Troubleshooting
163
 
164
  **If puzzle generation fails:**
165
+ 1. Check CACHE_DIR is writable and has sufficient space
166
+ 2. Monitor startup logs for cache creation progress
167
+ 3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive
168
+
169
+ **If words seem too random:**
170
+ 1. Lower SIMILARITY_TEMPERATURE (try 0.1)
171
+ 2. Increase DIFFICULTY_WEIGHT for frequency-based selection
172
+ 3. Check debug tab with ENABLE_DEBUG_TAB=true
173
 
174
+ **If multi-topic queries return too few words:**
175
+ 1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
176
+ 2. Lower SOFT_MIN_BETA manually (try 5.0)
177
+ 3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)
178
 
179
  **If startup is too slow:**
180
+ 1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
181
+ 2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
182
+ 3. Cache should speed up subsequent startups significantly
crossword-app/backend-py/docs/advanced_clue_generation_strategy.md CHANGED
@@ -138,7 +138,25 @@ Context-based learning:
138
  → Model learns: accident, discovery, positive outcomes, unexpected events
139
  ```
140
 
141
- ## Recommended Architecture: Context-First Transfer Learning
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
  ### Core Philosophy
144
 
@@ -406,15 +424,17 @@ Once basic quality is achieved, explore:
406
 
407
  ## Conclusion
408
 
409
- The context-based transfer learning approach offers the most promising path to universal, high-quality clue generation. By leveraging FLAN-T5's existing contextual knowledge and training it to reformulate that knowledge as crossword clues, we can achieve:
 
 
410
 
411
- 1. **Universal coverage** - clues for every word
412
- 2. **Quality improvement** - especially for rare and proper nouns
413
- 3. **Scalable approach** - automated training data generation
414
- 4. **Practical implementation** - manageable computational requirements
415
 
416
- This strategy moves beyond the limitations of surface-pattern embeddings to tap into the rich contextual understanding that large language models have acquired during pre-training, directing that knowledge toward the specific stylistic and functional requirements of crossword clue generation.
417
 
418
  ---
419
 
420
- *This analysis builds on the comprehensive discussion of clue generation approaches and represents the consensus strategy for implementing universal crossword clue generation capabilities.*
 
138
  → Model learns: accident, discovery, positive outcomes, unexpected events
139
  ```
140
 
141
+ ## Attempted Approaches and Results
142
+
143
+ ### Context-Based Transfer Learning (FAILED)
144
+
145
+ **Status**: ❌ ATTEMPTED AND DISCARDED
146
+
147
+ **Implementation**: FLAN-T5 context-based transfer learning was implemented using the approach described below, including:
148
+ - Wikipedia abstracts for entity-based clues
149
+ - Etymology databases for origin-based clues
150
+ - Usage-based corpora for context patterns
151
+ - Fine-tuning on 500K+ training pairs
152
+
153
+ **Results**: The approach generated poor quality clues that were not suitable for crosswords. Despite the theoretical soundness of the approach, the practical implementation failed to produce the expected improvements in clue quality.
154
+
155
+ **Conclusion**: Transfer learning with FLAN-T5 is not a viable solution for crossword clue generation. Alternative approaches should be explored.
156
+
157
+ ## Theoretical Architecture: Context-First Transfer Learning (DISCARDED)
158
+
159
+ **⚠️ NOTE: This section is preserved for historical context. This approach was tried and failed in practice.**
160
 
161
  ### Core Philosophy
162
 
 
424
 
425
  ## Conclusion
426
 
427
+ **Current Status**: The transfer learning approach described above was implemented and failed to produce quality clues suitable for crosswords.
428
+
429
+ **Next Steps**: Alternative approaches need to be explored, such as:
430
 
431
+ 1. **Semantic Concept Extraction with Rule Engines**: Transform dictionary entries into crossword-style variations using pattern matching and linguistic rules
432
+ 2. **Hybrid WordNet + Post-Processing**: Use WordNet as a base but apply aggressive post-processing to create concise, crossword-appropriate clues
433
+ 3. **Template-Based Generation**: Create crossword-style templates and populate them with extracted semantic information
434
+ 4. **Curated Knowledge Base**: Build a targeted database of crossword-suitable clues for high-frequency vocabulary
435
 
436
+ **Lessons Learned**: While theoretically sound, transfer learning with language models may not be well-suited for the highly constrained and stylistic requirements of crossword clues. The gap between natural language generation and crossword convention may be too large to bridge effectively through fine-tuning alone.
437
 
438
  ---
439
 
440
+ *This analysis documents both theoretical approaches and practical implementation results for crossword clue generation. The transfer learning approach described in detail was attempted but failed in practice, serving as a guide for future research directions.*