| # Probability Distribution Analysis: Theory vs. Practice | |
| ## Executive Summary | |
| This document analyzes the **actual behavior** of the crossword word selection system, complementing the theoretical framework described in [`composite_scoring_algorithm.md`](composite_scoring_algorithm.md). While the composite scoring theory is sound, empirical analysis reveals significant discrepancies between intended and actual behavior. | |
| ### Key Findings | |
| - **Similarity dominates**: Difficulty-based frequency preferences are too weak to create distinct selection patterns | |
| - **Exponential distributions**: Actual probability distributions follow exponential decay, not normal distributions | |
| - **Statistical misconceptions**: Using normal distribution concepts (μ ± σ) on exponentially decaying data is misleading | |
| - **Mode-mean divergence**: Statistical measures don't represent where selections actually occur | |
| ## Observed Probability Distributions | |
| ### Data Source: Technology Topic Analysis | |
| Using the debug visualization with `ENABLE_DEBUG_TAB=true`, we analyzed the actual probability distributions for different difficulties: | |
| ``` | |
| Topic: Technology | |
| Candidates: 150 words | |
| Temperature: 0.2 | |
| Selection method: Softmax with composite scoring | |
| ``` | |
| ### Empirical Results | |
| #### Easy Difficulty | |
| ``` | |
| Mean Position: Word #42 (IMPLEMENT) | |
| Distribution Width (σ): 33.4 words | |
| σ Sampling Zone: 70.5% of probability mass | |
| σ Range: Words #9-#76 | |
| Top Probability: 2.3% | |
| ``` | |
| #### Medium Difficulty | |
| ``` | |
| Mean Position: Word #60 (COMPUTERIZED) | |
| Distribution Width (σ): 42.9 words | |
| σ Sampling Zone: 61.0% of probability mass | |
| σ Range: Words #17-#103 | |
| Top Probability: 1.5% | |
| ``` | |
| #### Hard Difficulty | |
| ``` | |
| Mean Position: Word #37 (DIGITISATION) | |
| Distribution Width (σ): 40.2 words | |
| σ Sampling Zone: 82.1% of probability mass | |
| σ Range: Words #1-#77 | |
| Top Probability: 4.1% | |
| ``` | |
| ### Critical Observation | |
| **All three difficulty levels show similar exponential decay patterns**, with only minor variations in peak height and mean position. This indicates the frequency-based difficulty targeting is not working as intended. | |
| ## Statistical Misconceptions in Current Approach | |
| ### The Mode-Mean Divergence Problem | |
| The visualization shows a red line (μ) at positions 37-60, but the highest probability bars are at positions 0-5. This reveals a fundamental statistical concept: | |
| ``` | |
| Distribution Type: Exponentially Decaying (Highly Skewed) | |
| Mode (Peak): Position 0-3 (2-4% probability) | |
| Median: Position ~15 (Where 50% of probability mass is reached) | |
| Mean (μ): Position 37-60 (Weighted average position) | |
| ``` | |
| ### Why μ is "Wrong" for Understanding Selection | |
| In an exponential distribution with long tail: | |
| 1. **Mode (0-3)**: Where individual words have highest probability | |
| 2. **Practical sampling zone**: First 10-20 words contain ~60-80% of probability mass | |
| 3. **Mean (37-60)**: Pulled far right by 100+ words with tiny probabilities | |
| The mean doesn't represent where sampling actually occurs—it's mathematically correct but practically misleading. | |
| ### Standard Deviation Misapplication | |
| The σ visualization assumes a normal distribution where: | |
| - **Normal assumption**: μ ± σ contains ~68% of probability mass | |
| - **Our reality**: Exponential distribution with μ ± σ often missing the high-probability words entirely | |
| For exponential distributions, percentiles or cumulative probability are more meaningful than standard deviation. | |
| ## Actual vs. Expected Behavior Analysis | |
| ### What Should Happen (Theory) | |
| According to the composite scoring algorithm: | |
| - **Easy**: Gaussian peak at 90th percentile → common words dominate | |
| - **Medium**: Gaussian peak at 50th percentile → balanced selection | |
| - **Hard**: Gaussian peak at 20th percentile → rare words favored | |
| ### What Actually Happens (Empirical) | |
| ``` | |
| Easy: MULTIMEDIA, TECH, TECHNOLOGY, IMPLEMENTING... (similar to others) | |
| Medium: TECH, TECHNOLOGY, COMPUTERIZED, TECHNOLOGICAL... (similar pattern) | |
| Hard: TECH, TECHNOLOGY, DIGITISATION, TECHNICIAN... (still similar) | |
| ``` | |
| **All difficulties select similar high-similarity technology words**, regardless of their frequency percentiles. | |
| ### Root Cause Analysis | |
| The problem isn't in the Gaussian curves—they work correctly. The issue is in the composite formula: | |
| ```python | |
| # Current approach | |
| composite = 0.5 * similarity + 0.5 * frequency_score | |
| # What happens with real data: | |
| # High-similarity word: similarity=0.9, wrong_freq_score=0.1 | |
| # → composite = 0.5*0.9 + 0.5*0.1 = 0.50 | |
| # Medium-similarity word: similarity=0.7, perfect_freq_score=1.0 | |
| # → composite = 0.5*0.7 + 0.5*1.0 = 0.85 | |
| ``` | |
| Even with perfect frequency alignment, a word needs **very high similarity** to compete with high-similarity words that have wrong frequency profiles. | |
| ## Sampling Mechanics Deep Dive | |
| ### np.random.choice Behavior | |
| The selection uses `np.random.choice` with: | |
| - **Without replacement**: Each word can only be selected once | |
| - **Probability weighting**: Based on computed probabilities | |
| - **Sample size**: 10 words from 150 candidates | |
| ### Where Selections Actually Occur | |
| Despite μ being at position 37-60, most actual selections come from positions 0-30 because: | |
| 1. **High probabilities concentrate early**: First 20 words often have 60%+ of total probability | |
| 2. **Without replacement effect**: Once high-probability words are chosen, selection moves to next-highest | |
| 3. **Exponential decay**: Probability drops rapidly, making later positions unlikely | |
| This explains why the green bars (selected words) appear mostly in the left portion of all distributions, regardless of where μ is located. | |
| ## Better Visualization Approaches | |
| ### Current Problems | |
| - **μ ± σ assumes normality**: Not applicable to exponential distributions | |
| - **Mean position misleading**: Doesn't show where selection actually occurs | |
| - **Standard deviation meaningless**: For highly skewed distributions | |
| ### Recommended Alternatives | |
| #### 1. Cumulative Probability Visualization | |
| ``` | |
| First 10 words: 45% of total probability mass | |
| First 20 words: 65% of total probability mass | |
| First 30 words: 78% of total probability mass | |
| First 50 words: 90% of total probability mass | |
| ``` | |
| #### 2. Percentile Markers Instead of μ ± σ | |
| ``` | |
| P50 (Median): Position where 50% of probability mass is reached | |
| P75: Position where 75% of probability mass is reached | |
| P90: Position where 90% of probability mass is reached | |
| ``` | |
| #### 3. Mode Annotation | |
| - Show the actual peak (mode) position | |
| - Mark the top-5 highest probability words | |
| - Distinguish between statistical mean and practical selection zone | |
| #### 4. Selection Concentration Metric | |
| ``` | |
| Effective Selection Range: Positions covering 80% of selection probability | |
| Selection Concentration: Gini coefficient of probability distribution | |
| ``` | |
| ## Difficulty Differentiation Failure | |
| ### Expected Pattern | |
| Different difficulty levels should show visually distinct probability distribution patterns: | |
| - **Easy**: Steep peak at common words, rapid falloff | |
| - **Medium**: Moderate peak, balanced distribution | |
| - **Hard**: Peak shifted toward rare words | |
| ### Observed Pattern | |
| All difficulties show similar exponential decay curves with: | |
| - Similar-shaped distributions | |
| - Similar high-probability words (TECH, TECHNOLOGY, etc.) | |
| - Only minor differences in peak height and position | |
| ### Quantitative Evidence | |
| ``` | |
| Similarity scores of top words (all difficulties): | |
| TECHNOLOGY: 0.95+ similarity to "technology" | |
| TECH: 0.90+ similarity to "technology" | |
| MULTIMEDIA: 0.85+ similarity to "technology" | |
| These high semantic matches dominate regardless of their frequency percentiles. | |
| ``` | |
| ## Recommended Fixes | |
| ### 1. Multiplicative Scoring (Immediate Fix) | |
| Replace additive formula with multiplicative gates: | |
| ```python | |
| # Current (additive) | |
| composite = 0.5 * similarity + 0.5 * frequency_score | |
| # Proposed (multiplicative) | |
| frequency_modifier = get_frequency_modifier(percentile, difficulty) | |
| composite = similarity * frequency_modifier | |
| # Where frequency_modifier ranges 0.1-1.2 instead of 0.0-1.0 | |
| ``` | |
| **Effect**: Frequency acts as a gate rather than just another score component. | |
| ### 2. Two-Stage Filtering (Structural Fix) | |
| ```python | |
| # Stage 1: Filter by frequency percentile ranges | |
| easy_candidates = [w for w in candidates if w.percentile > 0.7] # Common words | |
| medium_candidates = [w for w in candidates if 0.3 < w.percentile < 0.7] # Medium words | |
| hard_candidates = [w for w in candidates if w.percentile < 0.3] # Rare words | |
| # Stage 2: Rank filtered candidates by similarity | |
| selected = softmax_selection(filtered_candidates, similarity_only=True) | |
| ``` | |
| **Effect**: Guarantees different frequency pools for each difficulty, then optimizes within each pool. | |
| ### 3. Exponential Temperature Scaling (Parameter Fix) | |
| Use different temperature values by difficulty to create more distinct distributions: | |
| ```python | |
| easy_temperature = 0.1 # Very deterministic (sharp peak) | |
| medium_temperature = 0.3 # Moderate randomness | |
| hard_temperature = 0.2 # Deterministic but different peak | |
| ``` | |
| ### 4. Adaptive Frequency Weights (Dynamic Fix) | |
| ```python | |
| # Calculate frequency dominance needed to overcome similarity differences | |
| max_similarity_diff = max_similarity - min_similarity # e.g., 0.95 - 0.6 = 0.35 | |
| required_freq_weight = max_similarity_diff / (1 - max_similarity_diff) # e.g., 0.35/0.65 ≈ 0.54 | |
| # Use higher frequency weight when similarity ranges are wide | |
| adaptive_weight = min(0.8, required_freq_weight) | |
| ``` | |
| ## Empirical Data Summary | |
| ### Word Selection Patterns (Technology Topic) | |
| ``` | |
| Easy Mode Top Selections: | |
| - MULTIMEDIA (percentile: ?, similarity: high) | |
| - IMPLEMENT (percentile: ?, similarity: high) | |
| - TECHNOLOGICAL (percentile: ?, similarity: high) | |
| Hard Mode Top Selections: | |
| - TECH (percentile: ?, similarity: very high) | |
| - DIGITISATION (percentile: likely low, similarity: high) | |
| - TECHNICIAN (percentile: ?, similarity: high) | |
| ``` | |
| ### Statistical Summary | |
| - **σ Width Variation**: Easy (33.4) vs Medium (42.9) vs Hard (40.2) - only 28% difference | |
| - **Peak Variation**: 1.5% to 4.1% - moderate difference | |
| - **Mean Position Variation**: Position 37 to 60 - 62% range but all in middle zone | |
| - **Selection Concentration**: Most selections from first 30 words in all difficulties | |
| ## Conclusions | |
| ### The Core Problem | |
| The difficulty-aware word selection system is theoretically sound but practically ineffective because: | |
| 1. **Semantic similarity signals are too strong** compared to frequency signals | |
| 2. **Additive scoring allows high-similarity words to dominate** regardless of frequency appropriateness | |
| 3. **Statistical visualization assumes normal distributions** but data is exponentially skewed | |
| ### Success Metrics for Fixes | |
| A working system should show: | |
| 1. **Visually distinct probability distributions** for each difficulty | |
| 2. **Different word frequency profiles** in actual selections | |
| 3. **Mode and mean alignment** with intended difficulty targets | |
| 4. **Meaningful σ ranges** that represent actual selection zones | |
| ### Next Steps | |
| 1. Implement multiplicative scoring or two-stage filtering | |
| 2. Update visualization to use percentiles instead of μ ± σ | |
| 3. Collect empirical data on word frequency percentiles in actual selections | |
| 4. Validate fixes show distinct patterns across difficulties | |
| --- | |
| *This analysis represents empirical findings from the debug visualization system, revealing gaps between the theoretical composite scoring model and its practical implementation.* |