File size: 5,541 Bytes
5a66ce1
 
27a60ec
5a66ce1
 
 
 
 
 
 
27a60ec
5a66ce1
 
 
27a60ec
 
 
 
 
5a66ce1
 
27a60ec
 
 
5a66ce1
 
 
 
27a60ec
5a66ce1
27a60ec
 
 
 
5a66ce1
 
27a60ec
5a66ce1
27a60ec
 
 
 
 
 
5a66ce1
 
27a60ec
5a66ce1
27a60ec
 
5a66ce1
 
27a60ec
 
 
 
5a66ce1
27a60ec
5a66ce1
27a60ec
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a66ce1
 
 
 
 
 
 
27a60ec
5a66ce1
27a60ec
 
 
 
5a66ce1
 
27a60ec
5a66ce1
 
 
27a60ec
5a66ce1
27a60ec
 
 
 
 
 
 
 
 
5a66ce1
 
27a60ec
 
 
 
 
 
 
 
 
 
 
 
 
5a66ce1
 
27a60ec
 
 
 
5a66ce1
 
 
 
27a60ec
 
 
 
 
 
 
 
5a66ce1
27a60ec
 
 
 
5a66ce1
 
27a60ec
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
# Environment Configuration for Hugging Face Spaces

This document lists all environment variables for the crossword generator backend when deployed on Hugging Face Spaces.

## Required Variables

### Core Application Settings
```env
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
```

### Cache Configuration
```env
CACHE_DIR=/app/cache
```

### AI/ML Model Configuration
```env
THEMATIC_MODEL_NAME=all-mpnet-base-v2
THEMATIC_VOCAB_SIZE_LIMIT=100000
VOCAB_SOURCE=norvig
```

## Optional Variables (with defaults)

### Word Selection & Quality Control
```env
SIMILARITY_TEMPERATURE=0.2
USE_SOFTMAX_SELECTION=true
DIFFICULTY_WEIGHT=0.5
THEMATIC_POOL_SIZE=150
```

### Multi-Topic Intersection Configuration
```env
MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_BETA=10.0
SOFT_MIN_ADAPTIVE=true
SOFT_MIN_MIN_WORDS=15
SOFT_MIN_MAX_RETRIES=5
SOFT_MIN_BETA_DECAY=0.7
```

### Distribution Normalization (Experimental)
```env
ENABLE_DISTRIBUTION_NORMALIZATION=false
NORMALIZATION_METHOD=similarity_range
```

### Debug & Development
```env
ENABLE_DEBUG_TAB=false
```

## Variable Explanations

### **CACHE_DIR** (Required)
- Directory for caching models, embeddings, and vocabulary
- Contains sentence-transformer models, word embeddings, and NLTK data
- Should be persistent across deployments

### **THEMATIC_MODEL_NAME** (Default: all-mpnet-base-v2)
- Sentence transformer model for semantic embeddings
- Options: all-mpnet-base-v2, all-MiniLM-L6-v2 (smaller/faster)
- Affects quality vs performance trade-off

### **THEMATIC_VOCAB_SIZE_LIMIT** (Default: 100000)
- Maximum vocabulary size for word generation
- Higher = more word variety, more memory usage
- Norvig vocabulary contains ~100K words

### **VOCAB_SOURCE** (Default: norvig)
- Vocabulary source for word generation
- Currently only "norvig" is supported
- Uses Norvig word frequency dataset

### **SIMILARITY_TEMPERATURE** (Default: 0.2)
- Controls randomness in word selection
- Lower = more deterministic (top similarity words)
- Higher = more random selection from similar words
- Range: 0.1-2.0

### **DIFFICULTY_WEIGHT** (Default: 0.5)
- Balances similarity vs frequency for difficulty levels
- 0.0 = pure similarity, 1.0 = pure frequency
- Affects easy/medium/hard word selection

### **MULTI_TOPIC_METHOD** (Default: soft_minimum)
- Method for multi-topic word intersection
- Options: soft_minimum, geometric_mean, harmonic_mean, averaging
- soft_minimum finds words relevant to ALL topics

### **SOFT_MIN_BETA** (Default: 10.0)
- Beta parameter for soft minimum calculation
- Higher = stricter intersection requirement
- Automatically adjusted if SOFT_MIN_ADAPTIVE=true

### **ENABLE_DEBUG_TAB** (Default: false)
- Shows debug information in frontend
- Displays word selection process and parameters
- Useful for development and analysis

### **ENABLE_DISTRIBUTION_NORMALIZATION** (Default: false)
- Experimental feature for normalizing similarity distributions
- Generally disabled for better semantic authenticity
- See docs/distribution_normalization_analysis.md

## Recommended HF Spaces Configuration

**Minimal Setup (Core functionality):**
```env
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig
```

**Optimized Setup (Better performance & debugging):**
```env
NODE_ENV=production
PORT=7860
PYTHONPATH=/app/crossword-app/backend-py
PYTHONUNBUFFERED=1
CACHE_DIR=/app/cache
THEMATIC_VOCAB_SIZE_LIMIT=100000
THEMATIC_MODEL_NAME=all-mpnet-base-v2
VOCAB_SOURCE=norvig
SIMILARITY_TEMPERATURE=0.2
DIFFICULTY_WEIGHT=0.5
ENABLE_DEBUG_TAB=true
MULTI_TOPIC_METHOD=soft_minimum
SOFT_MIN_ADAPTIVE=true
```

## Deprecated Variables (Safe to Remove)

These variables are no longer used and can be deleted from HF Spaces:
- `EMBEDDING_MODEL` (replaced by THEMATIC_MODEL_NAME)
- `WORD_SIMILARITY_THRESHOLD` (deprecated with old vector search)
- `USE_AI_WORDS` (always true now)
- `FALLBACK_TO_STATIC` (no static fallback in current system)
- `SEARCH_RANDOMNESS` (replaced by SIMILARITY_TEMPERATURE)
- `MAX_CACHED_WORDS` (deprecated with old caching)
- `CACHE_EXPIRY_HOURS` (deprecated with old caching)
- `USE_HIERARCHICAL_SEARCH` (deprecated with old vector search)
- `MAX_USED_WORDS_MEMORY` (deprecated with old word tracking)

## Performance Notes

- **Startup Time**: ~30-60 seconds (model download + cache creation)
- **Memory Usage**: ~500MB-1GB (sentence-transformers + embeddings + vocabulary)
- **Response Time**: ~200-500ms (word generation + clue creation + grid fitting)
- **Disk Usage**: ~500MB for full model cache (vocabulary, embeddings, models)

## Troubleshooting

**If puzzle generation fails:**
1. Check CACHE_DIR is writable and has sufficient space
2. Monitor startup logs for cache creation progress
3. Verify THEMATIC_VOCAB_SIZE_LIMIT isn't too restrictive

**If words seem too random:**
1. Lower SIMILARITY_TEMPERATURE (try 0.1)
2. Increase DIFFICULTY_WEIGHT for frequency-based selection
3. Check debug tab with ENABLE_DEBUG_TAB=true

**If multi-topic queries return too few words:**
1. Enable SOFT_MIN_ADAPTIVE=true for automatic threshold adjustment
2. Lower SOFT_MIN_BETA manually (try 5.0)
3. Try different MULTI_TOPIC_METHOD (geometric_mean is more permissive)

**If startup is too slow:**
1. Use smaller model: THEMATIC_MODEL_NAME=all-MiniLM-L6-v2
2. Reduce vocabulary: THEMATIC_VOCAB_SIZE_LIMIT=50000
3. Cache should speed up subsequent startups significantly