File size: 11,462 Bytes
681be4a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
# Probability Distribution Analysis: Theory vs. Practice

## Executive Summary

This document analyzes the **actual behavior** of the crossword word selection system, complementing the theoretical framework described in [`composite_scoring_algorithm.md`](composite_scoring_algorithm.md). While the composite scoring theory is sound, empirical analysis reveals significant discrepancies between intended and actual behavior.

### Key Findings
- **Similarity dominates**: Difficulty-based frequency preferences are too weak to create distinct selection patterns
- **Exponential distributions**: Actual probability distributions follow exponential decay, not normal distributions
- **Statistical misconceptions**: Using normal distribution concepts (μ ± σ) on exponentially decaying data is misleading
- **Mode-mean divergence**: Statistical measures don't represent where selections actually occur

## Observed Probability Distributions

### Data Source: Technology Topic Analysis
Using the debug visualization with `ENABLE_DEBUG_TAB=true`, we analyzed the actual probability distributions for different difficulties:

```
Topic: Technology
Candidates: 150 words
Temperature: 0.2
Selection method: Softmax with composite scoring
```

### Empirical Results

#### Easy Difficulty
```
Mean Position: Word #42 (IMPLEMENT)
Distribution Width (σ): 33.4 words
σ Sampling Zone: 70.5% of probability mass
σ Range: Words #9-#76
Top Probability: 2.3%
```

#### Medium Difficulty  
```
Mean Position: Word #60 (COMPUTERIZED)
Distribution Width (σ): 42.9 words
σ Sampling Zone: 61.0% of probability mass  
σ Range: Words #17-#103
Top Probability: 1.5%
```

#### Hard Difficulty
```
Mean Position: Word #37 (DIGITISATION)
Distribution Width (σ): 40.2 words
σ Sampling Zone: 82.1% of probability mass
σ Range: Words #1-#77  
Top Probability: 4.1%
```

### Critical Observation
**All three difficulty levels show similar exponential decay patterns**, with only minor variations in peak height and mean position. This indicates the frequency-based difficulty targeting is not working as intended.

## Statistical Misconceptions in Current Approach

### The Mode-Mean Divergence Problem

The visualization shows a red line (μ) at positions 37-60, but the highest probability bars are at positions 0-5. This reveals a fundamental statistical concept:

```
Distribution Type: Exponentially Decaying (Highly Skewed)

Mode (Peak):     Position 0-3     (2-4% probability)
Median:          Position ~15     (Where 50% of probability mass is reached)  
Mean (μ):        Position 37-60   (Weighted average position)
```

### Why μ is "Wrong" for Understanding Selection

In an exponential distribution with long tail:

1. **Mode (0-3)**: Where individual words have highest probability
2. **Practical sampling zone**: First 10-20 words contain ~60-80% of probability mass
3. **Mean (37-60)**: Pulled far right by 100+ words with tiny probabilities

The mean doesn't represent where sampling actually occurs—it's mathematically correct but practically misleading.

### Standard Deviation Misapplication

The σ visualization assumes a normal distribution where:
- **Normal assumption**: μ ± σ contains ~68% of probability mass
- **Our reality**: Exponential distribution with μ ± σ often missing the high-probability words entirely

For exponential distributions, percentiles or cumulative probability are more meaningful than standard deviation.

## Actual vs. Expected Behavior Analysis

### What Should Happen (Theory)
According to the composite scoring algorithm:

- **Easy**: Gaussian peak at 90th percentile → common words dominate
- **Medium**: Gaussian peak at 50th percentile → balanced selection  
- **Hard**: Gaussian peak at 20th percentile → rare words favored

### What Actually Happens (Empirical)
```
Easy:   MULTIMEDIA, TECH, TECHNOLOGY, IMPLEMENTING... (similar to others)
Medium: TECH, TECHNOLOGY, COMPUTERIZED, TECHNOLOGICAL... (similar pattern)
Hard:   TECH, TECHNOLOGY, DIGITISATION, TECHNICIAN... (still similar)
```

**All difficulties select similar high-similarity technology words**, regardless of their frequency percentiles.

### Root Cause Analysis

The problem isn't in the Gaussian curves—they work correctly. The issue is in the composite formula:

```python
# Current approach
composite = 0.5 * similarity + 0.5 * frequency_score

# What happens with real data:
# High-similarity word: similarity=0.9, wrong_freq_score=0.1
# → composite = 0.5*0.9 + 0.5*0.1 = 0.50

# Medium-similarity word: similarity=0.7, perfect_freq_score=1.0  
# → composite = 0.5*0.7 + 0.5*1.0 = 0.85
```

Even with perfect frequency alignment, a word needs **very high similarity** to compete with high-similarity words that have wrong frequency profiles.

## Sampling Mechanics Deep Dive

### np.random.choice Behavior
The selection uses `np.random.choice` with:
- **Without replacement**: Each word can only be selected once
- **Probability weighting**: Based on computed probabilities  
- **Sample size**: 10 words from 150 candidates

### Where Selections Actually Occur
Despite μ being at position 37-60, most actual selections come from positions 0-30 because:

1. **High probabilities concentrate early**: First 20 words often have 60%+ of total probability
2. **Without replacement effect**: Once high-probability words are chosen, selection moves to next-highest
3. **Exponential decay**: Probability drops rapidly, making later positions unlikely

This explains why the green bars (selected words) appear mostly in the left portion of all distributions, regardless of where μ is located.

## Better Visualization Approaches

### Current Problems
- **μ ± σ assumes normality**: Not applicable to exponential distributions
- **Mean position misleading**: Doesn't show where selection actually occurs
- **Standard deviation meaningless**: For highly skewed distributions

### Recommended Alternatives

#### 1. Cumulative Probability Visualization
```
First 10 words: 45% of total probability mass
First 20 words: 65% of total probability mass  
First 30 words: 78% of total probability mass
First 50 words: 90% of total probability mass
```

#### 2. Percentile Markers Instead of μ ± σ
```
P50 (Median):  Position where 50% of probability mass is reached
P75:           Position where 75% of probability mass is reached  
P90:           Position where 90% of probability mass is reached
```

#### 3. Mode Annotation
- Show the actual peak (mode) position
- Mark the top-5 highest probability words
- Distinguish between statistical mean and practical selection zone

#### 4. Selection Concentration Metric
```
Effective Selection Range: Positions covering 80% of selection probability
Selection Concentration: Gini coefficient of probability distribution
```

## Difficulty Differentiation Failure

### Expected Pattern
Different difficulty levels should show visually distinct probability distribution patterns:
- **Easy**: Steep peak at common words, rapid falloff  
- **Medium**: Moderate peak, balanced distribution
- **Hard**: Peak shifted toward rare words

### Observed Pattern  
All difficulties show similar exponential decay curves with:
- Similar-shaped distributions
- Similar high-probability words (TECH, TECHNOLOGY, etc.)
- Only minor differences in peak height and position

### Quantitative Evidence
```
Similarity scores of top words (all difficulties):
TECHNOLOGY:     0.95+ similarity to "technology" 
TECH:           0.90+ similarity to "technology"
MULTIMEDIA:     0.85+ similarity to "technology"

These high semantic matches dominate regardless of their frequency percentiles.
```

## Recommended Fixes

### 1. Multiplicative Scoring (Immediate Fix)
Replace additive formula with multiplicative gates:

```python
# Current (additive)
composite = 0.5 * similarity + 0.5 * frequency_score

# Proposed (multiplicative)  
frequency_modifier = get_frequency_modifier(percentile, difficulty)
composite = similarity * frequency_modifier

# Where frequency_modifier ranges 0.1-1.2 instead of 0.0-1.0
```

**Effect**: Frequency acts as a gate rather than just another score component.

### 2. Two-Stage Filtering (Structural Fix)
```python
# Stage 1: Filter by frequency percentile ranges
easy_candidates = [w for w in candidates if w.percentile > 0.7]      # Common words
medium_candidates = [w for w in candidates if 0.3 < w.percentile < 0.7]  # Medium words  
hard_candidates = [w for w in candidates if w.percentile < 0.3]      # Rare words

# Stage 2: Rank filtered candidates by similarity
selected = softmax_selection(filtered_candidates, similarity_only=True)
```

**Effect**: Guarantees different frequency pools for each difficulty, then optimizes within each pool.

### 3. Exponential Temperature Scaling (Parameter Fix)
Use different temperature values by difficulty to create more distinct distributions:

```python
easy_temperature = 0.1    # Very deterministic (sharp peak)
medium_temperature = 0.3  # Moderate randomness
hard_temperature = 0.2    # Deterministic but different peak
```

### 4. Adaptive Frequency Weights (Dynamic Fix)
```python
# Calculate frequency dominance needed to overcome similarity differences
max_similarity_diff = max_similarity - min_similarity  # e.g., 0.95 - 0.6 = 0.35
required_freq_weight = max_similarity_diff / (1 - max_similarity_diff)  # e.g., 0.35/0.65 ≈ 0.54

# Use higher frequency weight when similarity ranges are wide
adaptive_weight = min(0.8, required_freq_weight)
```

## Empirical Data Summary

### Word Selection Patterns (Technology Topic)
```
Easy Mode Top Selections:
- MULTIMEDIA (percentile: ?, similarity: high)
- IMPLEMENT (percentile: ?, similarity: high) 
- TECHNOLOGICAL (percentile: ?, similarity: high)

Hard Mode Top Selections:  
- TECH (percentile: ?, similarity: very high)
- DIGITISATION (percentile: likely low, similarity: high)
- TECHNICIAN (percentile: ?, similarity: high)
```

### Statistical Summary
- **σ Width Variation**: Easy (33.4) vs Medium (42.9) vs Hard (40.2) - only 28% difference
- **Peak Variation**: 1.5% to 4.1% - moderate difference
- **Mean Position Variation**: Position 37 to 60 - 62% range but all in middle zone
- **Selection Concentration**: Most selections from first 30 words in all difficulties

## Conclusions

### The Core Problem
The difficulty-aware word selection system is theoretically sound but practically ineffective because:

1. **Semantic similarity signals are too strong** compared to frequency signals
2. **Additive scoring allows high-similarity words to dominate** regardless of frequency appropriateness  
3. **Statistical visualization assumes normal distributions** but data is exponentially skewed

### Success Metrics for Fixes
A working system should show:

1. **Visually distinct probability distributions** for each difficulty
2. **Different word frequency profiles** in actual selections
3. **Mode and mean alignment** with intended difficulty targets
4. **Meaningful σ ranges** that represent actual selection zones

### Next Steps
1. Implement multiplicative scoring or two-stage filtering
2. Update visualization to use percentiles instead of μ ± σ
3. Collect empirical data on word frequency percentiles in actual selections
4. Validate fixes show distinct patterns across difficulties

---

*This analysis represents empirical findings from the debug visualization system, revealing gaps between the theoretical composite scoring model and its practical implementation.*