# Probability Distribution Analysis: Theory vs. Practice ## Executive Summary This document analyzes the **actual behavior** of the crossword word selection system, complementing the theoretical framework described in [`composite_scoring_algorithm.md`](composite_scoring_algorithm.md). While the composite scoring theory is sound, empirical analysis reveals significant discrepancies between intended and actual behavior. ### Key Findings - **Similarity dominates**: Difficulty-based frequency preferences are too weak to create distinct selection patterns - **Exponential distributions**: Actual probability distributions follow exponential decay, not normal distributions - **Statistical misconceptions**: Using normal distribution concepts (μ ± σ) on exponentially decaying data is misleading - **Mode-mean divergence**: Statistical measures don't represent where selections actually occur ## Observed Probability Distributions ### Data Source: Technology Topic Analysis Using the debug visualization with `ENABLE_DEBUG_TAB=true`, we analyzed the actual probability distributions for different difficulties: ``` Topic: Technology Candidates: 150 words Temperature: 0.2 Selection method: Softmax with composite scoring ``` ### Empirical Results #### Easy Difficulty ``` Mean Position: Word #42 (IMPLEMENT) Distribution Width (σ): 33.4 words σ Sampling Zone: 70.5% of probability mass σ Range: Words #9-#76 Top Probability: 2.3% ``` #### Medium Difficulty ``` Mean Position: Word #60 (COMPUTERIZED) Distribution Width (σ): 42.9 words σ Sampling Zone: 61.0% of probability mass σ Range: Words #17-#103 Top Probability: 1.5% ``` #### Hard Difficulty ``` Mean Position: Word #37 (DIGITISATION) Distribution Width (σ): 40.2 words σ Sampling Zone: 82.1% of probability mass σ Range: Words #1-#77 Top Probability: 4.1% ``` ### Critical Observation **All three difficulty levels show similar exponential decay patterns**, with only minor variations in peak height and mean position. This indicates the frequency-based difficulty targeting is not working as intended. ## Statistical Misconceptions in Current Approach ### The Mode-Mean Divergence Problem The visualization shows a red line (μ) at positions 37-60, but the highest probability bars are at positions 0-5. This reveals a fundamental statistical concept: ``` Distribution Type: Exponentially Decaying (Highly Skewed) Mode (Peak): Position 0-3 (2-4% probability) Median: Position ~15 (Where 50% of probability mass is reached) Mean (μ): Position 37-60 (Weighted average position) ``` ### Why μ is "Wrong" for Understanding Selection In an exponential distribution with long tail: 1. **Mode (0-3)**: Where individual words have highest probability 2. **Practical sampling zone**: First 10-20 words contain ~60-80% of probability mass 3. **Mean (37-60)**: Pulled far right by 100+ words with tiny probabilities The mean doesn't represent where sampling actually occurs—it's mathematically correct but practically misleading. ### Standard Deviation Misapplication The σ visualization assumes a normal distribution where: - **Normal assumption**: μ ± σ contains ~68% of probability mass - **Our reality**: Exponential distribution with μ ± σ often missing the high-probability words entirely For exponential distributions, percentiles or cumulative probability are more meaningful than standard deviation. ## Actual vs. Expected Behavior Analysis ### What Should Happen (Theory) According to the composite scoring algorithm: - **Easy**: Gaussian peak at 90th percentile → common words dominate - **Medium**: Gaussian peak at 50th percentile → balanced selection - **Hard**: Gaussian peak at 20th percentile → rare words favored ### What Actually Happens (Empirical) ``` Easy: MULTIMEDIA, TECH, TECHNOLOGY, IMPLEMENTING... (similar to others) Medium: TECH, TECHNOLOGY, COMPUTERIZED, TECHNOLOGICAL... (similar pattern) Hard: TECH, TECHNOLOGY, DIGITISATION, TECHNICIAN... (still similar) ``` **All difficulties select similar high-similarity technology words**, regardless of their frequency percentiles. ### Root Cause Analysis The problem isn't in the Gaussian curves—they work correctly. The issue is in the composite formula: ```python # Current approach composite = 0.5 * similarity + 0.5 * frequency_score # What happens with real data: # High-similarity word: similarity=0.9, wrong_freq_score=0.1 # → composite = 0.5*0.9 + 0.5*0.1 = 0.50 # Medium-similarity word: similarity=0.7, perfect_freq_score=1.0 # → composite = 0.5*0.7 + 0.5*1.0 = 0.85 ``` Even with perfect frequency alignment, a word needs **very high similarity** to compete with high-similarity words that have wrong frequency profiles. ## Sampling Mechanics Deep Dive ### np.random.choice Behavior The selection uses `np.random.choice` with: - **Without replacement**: Each word can only be selected once - **Probability weighting**: Based on computed probabilities - **Sample size**: 10 words from 150 candidates ### Where Selections Actually Occur Despite μ being at position 37-60, most actual selections come from positions 0-30 because: 1. **High probabilities concentrate early**: First 20 words often have 60%+ of total probability 2. **Without replacement effect**: Once high-probability words are chosen, selection moves to next-highest 3. **Exponential decay**: Probability drops rapidly, making later positions unlikely This explains why the green bars (selected words) appear mostly in the left portion of all distributions, regardless of where μ is located. ## Better Visualization Approaches ### Current Problems - **μ ± σ assumes normality**: Not applicable to exponential distributions - **Mean position misleading**: Doesn't show where selection actually occurs - **Standard deviation meaningless**: For highly skewed distributions ### Recommended Alternatives #### 1. Cumulative Probability Visualization ``` First 10 words: 45% of total probability mass First 20 words: 65% of total probability mass First 30 words: 78% of total probability mass First 50 words: 90% of total probability mass ``` #### 2. Percentile Markers Instead of μ ± σ ``` P50 (Median): Position where 50% of probability mass is reached P75: Position where 75% of probability mass is reached P90: Position where 90% of probability mass is reached ``` #### 3. Mode Annotation - Show the actual peak (mode) position - Mark the top-5 highest probability words - Distinguish between statistical mean and practical selection zone #### 4. Selection Concentration Metric ``` Effective Selection Range: Positions covering 80% of selection probability Selection Concentration: Gini coefficient of probability distribution ``` ## Difficulty Differentiation Failure ### Expected Pattern Different difficulty levels should show visually distinct probability distribution patterns: - **Easy**: Steep peak at common words, rapid falloff - **Medium**: Moderate peak, balanced distribution - **Hard**: Peak shifted toward rare words ### Observed Pattern All difficulties show similar exponential decay curves with: - Similar-shaped distributions - Similar high-probability words (TECH, TECHNOLOGY, etc.) - Only minor differences in peak height and position ### Quantitative Evidence ``` Similarity scores of top words (all difficulties): TECHNOLOGY: 0.95+ similarity to "technology" TECH: 0.90+ similarity to "technology" MULTIMEDIA: 0.85+ similarity to "technology" These high semantic matches dominate regardless of their frequency percentiles. ``` ## Recommended Fixes ### 1. Multiplicative Scoring (Immediate Fix) Replace additive formula with multiplicative gates: ```python # Current (additive) composite = 0.5 * similarity + 0.5 * frequency_score # Proposed (multiplicative) frequency_modifier = get_frequency_modifier(percentile, difficulty) composite = similarity * frequency_modifier # Where frequency_modifier ranges 0.1-1.2 instead of 0.0-1.0 ``` **Effect**: Frequency acts as a gate rather than just another score component. ### 2. Two-Stage Filtering (Structural Fix) ```python # Stage 1: Filter by frequency percentile ranges easy_candidates = [w for w in candidates if w.percentile > 0.7] # Common words medium_candidates = [w for w in candidates if 0.3 < w.percentile < 0.7] # Medium words hard_candidates = [w for w in candidates if w.percentile < 0.3] # Rare words # Stage 2: Rank filtered candidates by similarity selected = softmax_selection(filtered_candidates, similarity_only=True) ``` **Effect**: Guarantees different frequency pools for each difficulty, then optimizes within each pool. ### 3. Exponential Temperature Scaling (Parameter Fix) Use different temperature values by difficulty to create more distinct distributions: ```python easy_temperature = 0.1 # Very deterministic (sharp peak) medium_temperature = 0.3 # Moderate randomness hard_temperature = 0.2 # Deterministic but different peak ``` ### 4. Adaptive Frequency Weights (Dynamic Fix) ```python # Calculate frequency dominance needed to overcome similarity differences max_similarity_diff = max_similarity - min_similarity # e.g., 0.95 - 0.6 = 0.35 required_freq_weight = max_similarity_diff / (1 - max_similarity_diff) # e.g., 0.35/0.65 ≈ 0.54 # Use higher frequency weight when similarity ranges are wide adaptive_weight = min(0.8, required_freq_weight) ``` ## Empirical Data Summary ### Word Selection Patterns (Technology Topic) ``` Easy Mode Top Selections: - MULTIMEDIA (percentile: ?, similarity: high) - IMPLEMENT (percentile: ?, similarity: high) - TECHNOLOGICAL (percentile: ?, similarity: high) Hard Mode Top Selections: - TECH (percentile: ?, similarity: very high) - DIGITISATION (percentile: likely low, similarity: high) - TECHNICIAN (percentile: ?, similarity: high) ``` ### Statistical Summary - **σ Width Variation**: Easy (33.4) vs Medium (42.9) vs Hard (40.2) - only 28% difference - **Peak Variation**: 1.5% to 4.1% - moderate difference - **Mean Position Variation**: Position 37 to 60 - 62% range but all in middle zone - **Selection Concentration**: Most selections from first 30 words in all difficulties ## Conclusions ### The Core Problem The difficulty-aware word selection system is theoretically sound but practically ineffective because: 1. **Semantic similarity signals are too strong** compared to frequency signals 2. **Additive scoring allows high-similarity words to dominate** regardless of frequency appropriateness 3. **Statistical visualization assumes normal distributions** but data is exponentially skewed ### Success Metrics for Fixes A working system should show: 1. **Visually distinct probability distributions** for each difficulty 2. **Different word frequency profiles** in actual selections 3. **Mode and mean alignment** with intended difficulty targets 4. **Meaningful σ ranges** that represent actual selection zones ### Next Steps 1. Implement multiplicative scoring or two-stage filtering 2. Update visualization to use percentiles instead of μ ± σ 3. Collect empirical data on word frequency percentiles in actual selections 4. Validate fixes show distinct patterns across difficulties --- *This analysis represents empirical findings from the debug visualization system, revealing gaps between the theoretical composite scoring model and its practical implementation.*