# Probability Distribution Analysis: Theory vs. Practice

## Executive Summary

This document analyzes the **actual behavior** of the crossword word selection system, complementing the theoretical framework described in [`composite_scoring_algorithm.md`](composite_scoring_algorithm.md). While the composite scoring theory is sound, empirical analysis reveals significant discrepancies between intended and actual behavior.

### Key Findings
- **Similarity dominates**: Difficulty-based frequency preferences are too weak to create distinct selection patterns
- **Exponential distributions**: Actual probability distributions follow exponential decay, not normal distributions
- **Statistical misconceptions**: Using normal distribution concepts (μ ± σ) on exponentially decaying data is misleading
- **Mode-mean divergence**: Statistical measures don't represent where selections actually occur

## Observed Probability Distributions

### Data Source: Technology Topic Analysis
Using the debug visualization with `ENABLE_DEBUG_TAB=true`, we analyzed the actual probability distributions for different difficulties:

```
Topic: Technology
Candidates: 150 words
Temperature: 0.2
Selection method: Softmax with composite scoring
```

### Empirical Results

#### Easy Difficulty
```
Mean Position: Word #42 (IMPLEMENT)
Distribution Width (σ): 33.4 words
σ Sampling Zone: 70.5% of probability mass
σ Range: Words #9-#76
Top Probability: 2.3%
```

#### Medium Difficulty  
```
Mean Position: Word #60 (COMPUTERIZED)
Distribution Width (σ): 42.9 words
σ Sampling Zone: 61.0% of probability mass  
σ Range: Words #17-#103
Top Probability: 1.5%
```

#### Hard Difficulty
```
Mean Position: Word #37 (DIGITISATION)
Distribution Width (σ): 40.2 words
σ Sampling Zone: 82.1% of probability mass
σ Range: Words #1-#77  
Top Probability: 4.1%
```

### Critical Observation
**All three difficulty levels show similar exponential decay patterns**, with only minor variations in peak height and mean position. This indicates the frequency-based difficulty targeting is not working as intended.

## Statistical Misconceptions in Current Approach

### The Mode-Mean Divergence Problem

The visualization shows a red line (μ) at positions 37-60, but the highest probability bars are at positions 0-5. This reveals a fundamental statistical concept:

```
Distribution Type: Exponentially Decaying (Highly Skewed)

Mode (Peak):     Position 0-3     (2-4% probability)
Median:          Position ~15     (Where 50% of probability mass is reached)  
Mean (μ):        Position 37-60   (Weighted average position)
```

### Why μ is "Wrong" for Understanding Selection

In an exponential distribution with long tail:

1. **Mode (0-3)**: Where individual words have highest probability
2. **Practical sampling zone**: First 10-20 words contain ~60-80% of probability mass
3. **Mean (37-60)**: Pulled far right by 100+ words with tiny probabilities

The mean doesn't represent where sampling actually occurs—it's mathematically correct but practically misleading.

### Standard Deviation Misapplication

The σ visualization assumes a normal distribution where:
- **Normal assumption**: μ ± σ contains ~68% of probability mass
- **Our reality**: Exponential distribution with μ ± σ often missing the high-probability words entirely

For exponential distributions, percentiles or cumulative probability are more meaningful than standard deviation.

## Actual vs. Expected Behavior Analysis

### What Should Happen (Theory)
According to the composite scoring algorithm:

- **Easy**: Gaussian peak at 90th percentile → common words dominate
- **Medium**: Gaussian peak at 50th percentile → balanced selection  
- **Hard**: Gaussian peak at 20th percentile → rare words favored

### What Actually Happens (Empirical)
```
Easy:   MULTIMEDIA, TECH, TECHNOLOGY, IMPLEMENTING... (similar to others)
Medium: TECH, TECHNOLOGY, COMPUTERIZED, TECHNOLOGICAL... (similar pattern)
Hard:   TECH, TECHNOLOGY, DIGITISATION, TECHNICIAN... (still similar)
```

**All difficulties select similar high-similarity technology words**, regardless of their frequency percentiles.

### Root Cause Analysis

The problem isn't in the Gaussian curves—they work correctly. The issue is in the composite formula:

```python
# Current approach
composite = 0.5 * similarity + 0.5 * frequency_score

# What happens with real data:
# High-similarity word: similarity=0.9, wrong_freq_score=0.1
# → composite = 0.5*0.9 + 0.5*0.1 = 0.50

# Medium-similarity word: similarity=0.7, perfect_freq_score=1.0  
# → composite = 0.5*0.7 + 0.5*1.0 = 0.85
```

Even with perfect frequency alignment, a word needs **very high similarity** to compete with high-similarity words that have wrong frequency profiles.

## Sampling Mechanics Deep Dive

### np.random.choice Behavior
The selection uses `np.random.choice` with:
- **Without replacement**: Each word can only be selected once
- **Probability weighting**: Based on computed probabilities  
- **Sample size**: 10 words from 150 candidates

### Where Selections Actually Occur
Despite μ being at position 37-60, most actual selections come from positions 0-30 because:

1. **High probabilities concentrate early**: First 20 words often have 60%+ of total probability
2. **Without replacement effect**: Once high-probability words are chosen, selection moves to next-highest
3. **Exponential decay**: Probability drops rapidly, making later positions unlikely

This explains why the green bars (selected words) appear mostly in the left portion of all distributions, regardless of where μ is located.

## Better Visualization Approaches

### Current Problems
- **μ ± σ assumes normality**: Not applicable to exponential distributions
- **Mean position misleading**: Doesn't show where selection actually occurs
- **Standard deviation meaningless**: For highly skewed distributions

### Recommended Alternatives

#### 1. Cumulative Probability Visualization
```
First 10 words: 45% of total probability mass
First 20 words: 65% of total probability mass  
First 30 words: 78% of total probability mass
First 50 words: 90% of total probability mass
```

#### 2. Percentile Markers Instead of μ ± σ
```
P50 (Median):  Position where 50% of probability mass is reached
P75:           Position where 75% of probability mass is reached  
P90:           Position where 90% of probability mass is reached
```

#### 3. Mode Annotation
- Show the actual peak (mode) position
- Mark the top-5 highest probability words
- Distinguish between statistical mean and practical selection zone

#### 4. Selection Concentration Metric
```
Effective Selection Range: Positions covering 80% of selection probability
Selection Concentration: Gini coefficient of probability distribution
```

## Difficulty Differentiation Failure

### Expected Pattern
Different difficulty levels should show visually distinct probability distribution patterns:
- **Easy**: Steep peak at common words, rapid falloff  
- **Medium**: Moderate peak, balanced distribution
- **Hard**: Peak shifted toward rare words

### Observed Pattern  
All difficulties show similar exponential decay curves with:
- Similar-shaped distributions
- Similar high-probability words (TECH, TECHNOLOGY, etc.)
- Only minor differences in peak height and position

### Quantitative Evidence
```
Similarity scores of top words (all difficulties):
TECHNOLOGY:     0.95+ similarity to "technology" 
TECH:           0.90+ similarity to "technology"
MULTIMEDIA:     0.85+ similarity to "technology"

These high semantic matches dominate regardless of their frequency percentiles.
```

## Recommended Fixes

### 1. Multiplicative Scoring (Immediate Fix)
Replace additive formula with multiplicative gates:

```python
# Current (additive)
composite = 0.5 * similarity + 0.5 * frequency_score

# Proposed (multiplicative)  
frequency_modifier = get_frequency_modifier(percentile, difficulty)
composite = similarity * frequency_modifier

# Where frequency_modifier ranges 0.1-1.2 instead of 0.0-1.0
```

**Effect**: Frequency acts as a gate rather than just another score component.

### 2. Two-Stage Filtering (Structural Fix)
```python
# Stage 1: Filter by frequency percentile ranges
easy_candidates = [w for w in candidates if w.percentile > 0.7]      # Common words
medium_candidates = [w for w in candidates if 0.3 < w.percentile < 0.7]  # Medium words  
hard_candidates = [w for w in candidates if w.percentile < 0.3]      # Rare words

# Stage 2: Rank filtered candidates by similarity
selected = softmax_selection(filtered_candidates, similarity_only=True)
```

**Effect**: Guarantees different frequency pools for each difficulty, then optimizes within each pool.

### 3. Exponential Temperature Scaling (Parameter Fix)
Use different temperature values by difficulty to create more distinct distributions:

```python
easy_temperature = 0.1    # Very deterministic (sharp peak)
medium_temperature = 0.3  # Moderate randomness
hard_temperature = 0.2    # Deterministic but different peak
```

### 4. Adaptive Frequency Weights (Dynamic Fix)
```python
# Calculate frequency dominance needed to overcome similarity differences
max_similarity_diff = max_similarity - min_similarity  # e.g., 0.95 - 0.6 = 0.35
required_freq_weight = max_similarity_diff / (1 - max_similarity_diff)  # e.g., 0.35/0.65 ≈ 0.54

# Use higher frequency weight when similarity ranges are wide
adaptive_weight = min(0.8, required_freq_weight)
```

## Empirical Data Summary

### Word Selection Patterns (Technology Topic)
```
Easy Mode Top Selections:
- MULTIMEDIA (percentile: ?, similarity: high)
- IMPLEMENT (percentile: ?, similarity: high) 
- TECHNOLOGICAL (percentile: ?, similarity: high)

Hard Mode Top Selections:  
- TECH (percentile: ?, similarity: very high)
- DIGITISATION (percentile: likely low, similarity: high)
- TECHNICIAN (percentile: ?, similarity: high)
```

### Statistical Summary
- **σ Width Variation**: Easy (33.4) vs Medium (42.9) vs Hard (40.2) - only 28% difference
- **Peak Variation**: 1.5% to 4.1% - moderate difference
- **Mean Position Variation**: Position 37 to 60 - 62% range but all in middle zone
- **Selection Concentration**: Most selections from first 30 words in all difficulties

## Conclusions

### The Core Problem
The difficulty-aware word selection system is theoretically sound but practically ineffective because:

1. **Semantic similarity signals are too strong** compared to frequency signals
2. **Additive scoring allows high-similarity words to dominate** regardless of frequency appropriateness  
3. **Statistical visualization assumes normal distributions** but data is exponentially skewed

### Success Metrics for Fixes
A working system should show:

1. **Visually distinct probability distributions** for each difficulty
2. **Different word frequency profiles** in actual selections
3. **Mode and mean alignment** with intended difficulty targets
4. **Meaningful σ ranges** that represent actual selection zones

### Next Steps
1. Implement multiplicative scoring or two-stage filtering
2. Update visualization to use percentiles instead of μ ± σ
3. Collect empirical data on word frequency percentiles in actual selections
4. Validate fixes show distinct patterns across difficulties

---

*This analysis represents empirical findings from the debug visualization system, revealing gaps between the theoretical composite scoring model and its practical implementation.*