abc123 / crossword-app /backend-py /docs /probability_distribution_analysis.md
vimalk78's picture
feat: Add percentile-sorted probability visualization to debug tab
681be4a
# Probability Distribution Analysis: Theory vs. Practice
## Executive Summary
This document analyzes the **actual behavior** of the crossword word selection system, complementing the theoretical framework described in [`composite_scoring_algorithm.md`](composite_scoring_algorithm.md). While the composite scoring theory is sound, empirical analysis reveals significant discrepancies between intended and actual behavior.
### Key Findings
- **Similarity dominates**: Difficulty-based frequency preferences are too weak to create distinct selection patterns
- **Exponential distributions**: Actual probability distributions follow exponential decay, not normal distributions
- **Statistical misconceptions**: Using normal distribution concepts (μ ± σ) on exponentially decaying data is misleading
- **Mode-mean divergence**: Statistical measures don't represent where selections actually occur
## Observed Probability Distributions
### Data Source: Technology Topic Analysis
Using the debug visualization with `ENABLE_DEBUG_TAB=true`, we analyzed the actual probability distributions for different difficulties:
```
Topic: Technology
Candidates: 150 words
Temperature: 0.2
Selection method: Softmax with composite scoring
```
### Empirical Results
#### Easy Difficulty
```
Mean Position: Word #42 (IMPLEMENT)
Distribution Width (σ): 33.4 words
σ Sampling Zone: 70.5% of probability mass
σ Range: Words #9-#76
Top Probability: 2.3%
```
#### Medium Difficulty
```
Mean Position: Word #60 (COMPUTERIZED)
Distribution Width (σ): 42.9 words
σ Sampling Zone: 61.0% of probability mass
σ Range: Words #17-#103
Top Probability: 1.5%
```
#### Hard Difficulty
```
Mean Position: Word #37 (DIGITISATION)
Distribution Width (σ): 40.2 words
σ Sampling Zone: 82.1% of probability mass
σ Range: Words #1-#77
Top Probability: 4.1%
```
### Critical Observation
**All three difficulty levels show similar exponential decay patterns**, with only minor variations in peak height and mean position. This indicates the frequency-based difficulty targeting is not working as intended.
## Statistical Misconceptions in Current Approach
### The Mode-Mean Divergence Problem
The visualization shows a red line (μ) at positions 37-60, but the highest probability bars are at positions 0-5. This reveals a fundamental statistical concept:
```
Distribution Type: Exponentially Decaying (Highly Skewed)
Mode (Peak): Position 0-3 (2-4% probability)
Median: Position ~15 (Where 50% of probability mass is reached)
Mean (μ): Position 37-60 (Weighted average position)
```
### Why μ is "Wrong" for Understanding Selection
In an exponential distribution with long tail:
1. **Mode (0-3)**: Where individual words have highest probability
2. **Practical sampling zone**: First 10-20 words contain ~60-80% of probability mass
3. **Mean (37-60)**: Pulled far right by 100+ words with tiny probabilities
The mean doesn't represent where sampling actually occurs—it's mathematically correct but practically misleading.
### Standard Deviation Misapplication
The σ visualization assumes a normal distribution where:
- **Normal assumption**: μ ± σ contains ~68% of probability mass
- **Our reality**: Exponential distribution with μ ± σ often missing the high-probability words entirely
For exponential distributions, percentiles or cumulative probability are more meaningful than standard deviation.
## Actual vs. Expected Behavior Analysis
### What Should Happen (Theory)
According to the composite scoring algorithm:
- **Easy**: Gaussian peak at 90th percentile → common words dominate
- **Medium**: Gaussian peak at 50th percentile → balanced selection
- **Hard**: Gaussian peak at 20th percentile → rare words favored
### What Actually Happens (Empirical)
```
Easy: MULTIMEDIA, TECH, TECHNOLOGY, IMPLEMENTING... (similar to others)
Medium: TECH, TECHNOLOGY, COMPUTERIZED, TECHNOLOGICAL... (similar pattern)
Hard: TECH, TECHNOLOGY, DIGITISATION, TECHNICIAN... (still similar)
```
**All difficulties select similar high-similarity technology words**, regardless of their frequency percentiles.
### Root Cause Analysis
The problem isn't in the Gaussian curves—they work correctly. The issue is in the composite formula:
```python
# Current approach
composite = 0.5 * similarity + 0.5 * frequency_score
# What happens with real data:
# High-similarity word: similarity=0.9, wrong_freq_score=0.1
# → composite = 0.5*0.9 + 0.5*0.1 = 0.50
# Medium-similarity word: similarity=0.7, perfect_freq_score=1.0
# → composite = 0.5*0.7 + 0.5*1.0 = 0.85
```
Even with perfect frequency alignment, a word needs **very high similarity** to compete with high-similarity words that have wrong frequency profiles.
## Sampling Mechanics Deep Dive
### np.random.choice Behavior
The selection uses `np.random.choice` with:
- **Without replacement**: Each word can only be selected once
- **Probability weighting**: Based on computed probabilities
- **Sample size**: 10 words from 150 candidates
### Where Selections Actually Occur
Despite μ being at position 37-60, most actual selections come from positions 0-30 because:
1. **High probabilities concentrate early**: First 20 words often have 60%+ of total probability
2. **Without replacement effect**: Once high-probability words are chosen, selection moves to next-highest
3. **Exponential decay**: Probability drops rapidly, making later positions unlikely
This explains why the green bars (selected words) appear mostly in the left portion of all distributions, regardless of where μ is located.
## Better Visualization Approaches
### Current Problems
- **μ ± σ assumes normality**: Not applicable to exponential distributions
- **Mean position misleading**: Doesn't show where selection actually occurs
- **Standard deviation meaningless**: For highly skewed distributions
### Recommended Alternatives
#### 1. Cumulative Probability Visualization
```
First 10 words: 45% of total probability mass
First 20 words: 65% of total probability mass
First 30 words: 78% of total probability mass
First 50 words: 90% of total probability mass
```
#### 2. Percentile Markers Instead of μ ± σ
```
P50 (Median): Position where 50% of probability mass is reached
P75: Position where 75% of probability mass is reached
P90: Position where 90% of probability mass is reached
```
#### 3. Mode Annotation
- Show the actual peak (mode) position
- Mark the top-5 highest probability words
- Distinguish between statistical mean and practical selection zone
#### 4. Selection Concentration Metric
```
Effective Selection Range: Positions covering 80% of selection probability
Selection Concentration: Gini coefficient of probability distribution
```
## Difficulty Differentiation Failure
### Expected Pattern
Different difficulty levels should show visually distinct probability distribution patterns:
- **Easy**: Steep peak at common words, rapid falloff
- **Medium**: Moderate peak, balanced distribution
- **Hard**: Peak shifted toward rare words
### Observed Pattern
All difficulties show similar exponential decay curves with:
- Similar-shaped distributions
- Similar high-probability words (TECH, TECHNOLOGY, etc.)
- Only minor differences in peak height and position
### Quantitative Evidence
```
Similarity scores of top words (all difficulties):
TECHNOLOGY: 0.95+ similarity to "technology"
TECH: 0.90+ similarity to "technology"
MULTIMEDIA: 0.85+ similarity to "technology"
These high semantic matches dominate regardless of their frequency percentiles.
```
## Recommended Fixes
### 1. Multiplicative Scoring (Immediate Fix)
Replace additive formula with multiplicative gates:
```python
# Current (additive)
composite = 0.5 * similarity + 0.5 * frequency_score
# Proposed (multiplicative)
frequency_modifier = get_frequency_modifier(percentile, difficulty)
composite = similarity * frequency_modifier
# Where frequency_modifier ranges 0.1-1.2 instead of 0.0-1.0
```
**Effect**: Frequency acts as a gate rather than just another score component.
### 2. Two-Stage Filtering (Structural Fix)
```python
# Stage 1: Filter by frequency percentile ranges
easy_candidates = [w for w in candidates if w.percentile > 0.7] # Common words
medium_candidates = [w for w in candidates if 0.3 < w.percentile < 0.7] # Medium words
hard_candidates = [w for w in candidates if w.percentile < 0.3] # Rare words
# Stage 2: Rank filtered candidates by similarity
selected = softmax_selection(filtered_candidates, similarity_only=True)
```
**Effect**: Guarantees different frequency pools for each difficulty, then optimizes within each pool.
### 3. Exponential Temperature Scaling (Parameter Fix)
Use different temperature values by difficulty to create more distinct distributions:
```python
easy_temperature = 0.1 # Very deterministic (sharp peak)
medium_temperature = 0.3 # Moderate randomness
hard_temperature = 0.2 # Deterministic but different peak
```
### 4. Adaptive Frequency Weights (Dynamic Fix)
```python
# Calculate frequency dominance needed to overcome similarity differences
max_similarity_diff = max_similarity - min_similarity # e.g., 0.95 - 0.6 = 0.35
required_freq_weight = max_similarity_diff / (1 - max_similarity_diff) # e.g., 0.35/0.65 ≈ 0.54
# Use higher frequency weight when similarity ranges are wide
adaptive_weight = min(0.8, required_freq_weight)
```
## Empirical Data Summary
### Word Selection Patterns (Technology Topic)
```
Easy Mode Top Selections:
- MULTIMEDIA (percentile: ?, similarity: high)
- IMPLEMENT (percentile: ?, similarity: high)
- TECHNOLOGICAL (percentile: ?, similarity: high)
Hard Mode Top Selections:
- TECH (percentile: ?, similarity: very high)
- DIGITISATION (percentile: likely low, similarity: high)
- TECHNICIAN (percentile: ?, similarity: high)
```
### Statistical Summary
- **σ Width Variation**: Easy (33.4) vs Medium (42.9) vs Hard (40.2) - only 28% difference
- **Peak Variation**: 1.5% to 4.1% - moderate difference
- **Mean Position Variation**: Position 37 to 60 - 62% range but all in middle zone
- **Selection Concentration**: Most selections from first 30 words in all difficulties
## Conclusions
### The Core Problem
The difficulty-aware word selection system is theoretically sound but practically ineffective because:
1. **Semantic similarity signals are too strong** compared to frequency signals
2. **Additive scoring allows high-similarity words to dominate** regardless of frequency appropriateness
3. **Statistical visualization assumes normal distributions** but data is exponentially skewed
### Success Metrics for Fixes
A working system should show:
1. **Visually distinct probability distributions** for each difficulty
2. **Different word frequency profiles** in actual selections
3. **Mode and mean alignment** with intended difficulty targets
4. **Meaningful σ ranges** that represent actual selection zones
### Next Steps
1. Implement multiplicative scoring or two-stage filtering
2. Update visualization to use percentiles instead of μ ± σ
3. Collect empirical data on word frequency percentiles in actual selections
4. Validate fixes show distinct patterns across difficulties
---
*This analysis represents empirical findings from the debug visualization system, revealing gaps between the theoretical composite scoring model and its practical implementation.*