Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / crossword-app /backend-py /docs /probability_distribution_analysis.md

vimalk78

feat: Add percentile-sorted probability visualization to debug tab

681be4a 4 months ago

preview code

raw

history blame contribute delete

11.5 kB

	# Probability Distribution Analysis: Theory vs. Practice

	## Executive Summary

	This document analyzes the actual behavior of the crossword word selection system, complementing the theoretical framework described in [`composite_scoring_algorithm.md`](composite_scoring_algorithm.md). While the composite scoring theory is sound, empirical analysis reveals significant discrepancies between intended and actual behavior.

	### Key Findings
	- Similarity dominates: Difficulty-based frequency preferences are too weak to create distinct selection patterns
	- Exponential distributions: Actual probability distributions follow exponential decay, not normal distributions
	- Statistical misconceptions: Using normal distribution concepts (μ ± σ) on exponentially decaying data is misleading
	- Mode-mean divergence: Statistical measures don't represent where selections actually occur

	## Observed Probability Distributions

	### Data Source: Technology Topic Analysis
	Using the debug visualization with `ENABLE_DEBUG_TAB=true`, we analyzed the actual probability distributions for different difficulties:

	```
	Topic: Technology
	Candidates: 150 words
	Temperature: 0.2
	Selection method: Softmax with composite scoring
	```

	### Empirical Results

	#### Easy Difficulty
	```
	Mean Position: Word #42 (IMPLEMENT)
	Distribution Width (σ): 33.4 words
	σ Sampling Zone: 70.5% of probability mass
	σ Range: Words #9-#76
	Top Probability: 2.3%
	```

	#### Medium Difficulty
	```
	Mean Position: Word #60 (COMPUTERIZED)
	Distribution Width (σ): 42.9 words
	σ Sampling Zone: 61.0% of probability mass
	σ Range: Words #17-#103
	Top Probability: 1.5%
	```

	#### Hard Difficulty
	```
	Mean Position: Word #37 (DIGITISATION)
	Distribution Width (σ): 40.2 words
	σ Sampling Zone: 82.1% of probability mass
	σ Range: Words #1-#77
	Top Probability: 4.1%
	```

	### Critical Observation
	All three difficulty levels show similar exponential decay patterns, with only minor variations in peak height and mean position. This indicates the frequency-based difficulty targeting is not working as intended.

	## Statistical Misconceptions in Current Approach

	### The Mode-Mean Divergence Problem

	The visualization shows a red line (μ) at positions 37-60, but the highest probability bars are at positions 0-5. This reveals a fundamental statistical concept:

	```
	Distribution Type: Exponentially Decaying (Highly Skewed)

	Mode (Peak): Position 0-3 (2-4% probability)
	Median: Position ~15 (Where 50% of probability mass is reached)
	Mean (μ): Position 37-60 (Weighted average position)
	```

	### Why μ is "Wrong" for Understanding Selection

	In an exponential distribution with long tail:

	1. Mode (0-3): Where individual words have highest probability
	2. Practical sampling zone: First 10-20 words contain ~60-80% of probability mass
	3. Mean (37-60): Pulled far right by 100+ words with tiny probabilities

	The mean doesn't represent where sampling actually occurs—it's mathematically correct but practically misleading.

	### Standard Deviation Misapplication

	The σ visualization assumes a normal distribution where:
	- Normal assumption: μ ± σ contains ~68% of probability mass
	- Our reality: Exponential distribution with μ ± σ often missing the high-probability words entirely

	For exponential distributions, percentiles or cumulative probability are more meaningful than standard deviation.

	## Actual vs. Expected Behavior Analysis

	### What Should Happen (Theory)
	According to the composite scoring algorithm:

	- Easy: Gaussian peak at 90th percentile → common words dominate
	- Medium: Gaussian peak at 50th percentile → balanced selection
	- Hard: Gaussian peak at 20th percentile → rare words favored

	### What Actually Happens (Empirical)
	```
	Easy: MULTIMEDIA, TECH, TECHNOLOGY, IMPLEMENTING... (similar to others)
	Medium: TECH, TECHNOLOGY, COMPUTERIZED, TECHNOLOGICAL... (similar pattern)
	Hard: TECH, TECHNOLOGY, DIGITISATION, TECHNICIAN... (still similar)
	```

	All difficulties select similar high-similarity technology words, regardless of their frequency percentiles.

	### Root Cause Analysis

	The problem isn't in the Gaussian curves—they work correctly. The issue is in the composite formula:

	```python
	# Current approach
	composite = 0.5 * similarity + 0.5 * frequency_score

	# What happens with real data:
	# High-similarity word: similarity=0.9, wrong_freq_score=0.1
	# → composite = 0.50.9 + 0.50.1 = 0.50

	# Medium-similarity word: similarity=0.7, perfect_freq_score=1.0
	# → composite = 0.50.7 + 0.51.0 = 0.85
	```

	Even with perfect frequency alignment, a word needs very high similarity to compete with high-similarity words that have wrong frequency profiles.

	## Sampling Mechanics Deep Dive

	### np.random.choice Behavior
	The selection uses `np.random.choice` with:
	- Without replacement: Each word can only be selected once
	- Probability weighting: Based on computed probabilities
	- Sample size: 10 words from 150 candidates

	### Where Selections Actually Occur
	Despite μ being at position 37-60, most actual selections come from positions 0-30 because:

	1. High probabilities concentrate early: First 20 words often have 60%+ of total probability
	2. Without replacement effect: Once high-probability words are chosen, selection moves to next-highest
	3. Exponential decay: Probability drops rapidly, making later positions unlikely

	This explains why the green bars (selected words) appear mostly in the left portion of all distributions, regardless of where μ is located.

	## Better Visualization Approaches

	### Current Problems
	- μ ± σ assumes normality: Not applicable to exponential distributions
	- Mean position misleading: Doesn't show where selection actually occurs
	- Standard deviation meaningless: For highly skewed distributions

	### Recommended Alternatives

	#### 1. Cumulative Probability Visualization
	```
	First 10 words: 45% of total probability mass
	First 20 words: 65% of total probability mass
	First 30 words: 78% of total probability mass
	First 50 words: 90% of total probability mass
	```

	#### 2. Percentile Markers Instead of μ ± σ
	```
	P50 (Median): Position where 50% of probability mass is reached
	P75: Position where 75% of probability mass is reached
	P90: Position where 90% of probability mass is reached
	```

	#### 3. Mode Annotation
	- Show the actual peak (mode) position
	- Mark the top-5 highest probability words
	- Distinguish between statistical mean and practical selection zone

	#### 4. Selection Concentration Metric
	```
	Effective Selection Range: Positions covering 80% of selection probability
	Selection Concentration: Gini coefficient of probability distribution
	```

	## Difficulty Differentiation Failure

	### Expected Pattern
	Different difficulty levels should show visually distinct probability distribution patterns:
	- Easy: Steep peak at common words, rapid falloff
	- Medium: Moderate peak, balanced distribution
	- Hard: Peak shifted toward rare words

	### Observed Pattern
	All difficulties show similar exponential decay curves with:
	- Similar-shaped distributions
	- Similar high-probability words (TECH, TECHNOLOGY, etc.)
	- Only minor differences in peak height and position

	### Quantitative Evidence
	```
	Similarity scores of top words (all difficulties):
	TECHNOLOGY: 0.95+ similarity to "technology"
	TECH: 0.90+ similarity to "technology"
	MULTIMEDIA: 0.85+ similarity to "technology"

	These high semantic matches dominate regardless of their frequency percentiles.
	```

	## Recommended Fixes

	### 1. Multiplicative Scoring (Immediate Fix)
	Replace additive formula with multiplicative gates:

	```python
	# Current (additive)
	composite = 0.5 * similarity + 0.5 * frequency_score

	# Proposed (multiplicative)
	frequency_modifier = get_frequency_modifier(percentile, difficulty)
	composite = similarity * frequency_modifier

	# Where frequency_modifier ranges 0.1-1.2 instead of 0.0-1.0
	```

	Effect: Frequency acts as a gate rather than just another score component.

	### 2. Two-Stage Filtering (Structural Fix)
	```python
	# Stage 1: Filter by frequency percentile ranges
	easy_candidates = [w for w in candidates if w.percentile > 0.7] # Common words
	medium_candidates = [w for w in candidates if 0.3 < w.percentile < 0.7] # Medium words
	hard_candidates = [w for w in candidates if w.percentile < 0.3] # Rare words

	# Stage 2: Rank filtered candidates by similarity
	selected = softmax_selection(filtered_candidates, similarity_only=True)
	```

	Effect: Guarantees different frequency pools for each difficulty, then optimizes within each pool.

	### 3. Exponential Temperature Scaling (Parameter Fix)
	Use different temperature values by difficulty to create more distinct distributions:

	```python
	easy_temperature = 0.1 # Very deterministic (sharp peak)
	medium_temperature = 0.3 # Moderate randomness
	hard_temperature = 0.2 # Deterministic but different peak
	```

	### 4. Adaptive Frequency Weights (Dynamic Fix)
	```python
	# Calculate frequency dominance needed to overcome similarity differences
	max_similarity_diff = max_similarity - min_similarity # e.g., 0.95 - 0.6 = 0.35
	required_freq_weight = max_similarity_diff / (1 - max_similarity_diff) # e.g., 0.35/0.65 ≈ 0.54

	# Use higher frequency weight when similarity ranges are wide
	adaptive_weight = min(0.8, required_freq_weight)
	```

	## Empirical Data Summary

	### Word Selection Patterns (Technology Topic)
	```
	Easy Mode Top Selections:
	- MULTIMEDIA (percentile: ?, similarity: high)
	- IMPLEMENT (percentile: ?, similarity: high)
	- TECHNOLOGICAL (percentile: ?, similarity: high)

	Hard Mode Top Selections:
	- TECH (percentile: ?, similarity: very high)
	- DIGITISATION (percentile: likely low, similarity: high)
	- TECHNICIAN (percentile: ?, similarity: high)
	```

	### Statistical Summary
	- σ Width Variation: Easy (33.4) vs Medium (42.9) vs Hard (40.2) - only 28% difference
	- Peak Variation: 1.5% to 4.1% - moderate difference
	- Mean Position Variation: Position 37 to 60 - 62% range but all in middle zone
	- Selection Concentration: Most selections from first 30 words in all difficulties

	## Conclusions

	### The Core Problem
	The difficulty-aware word selection system is theoretically sound but practically ineffective because:

	1. Semantic similarity signals are too strong compared to frequency signals
	2. Additive scoring allows high-similarity words to dominate regardless of frequency appropriateness
	3. Statistical visualization assumes normal distributions but data is exponentially skewed

	### Success Metrics for Fixes
	A working system should show:

	1. Visually distinct probability distributions for each difficulty
	2. Different word frequency profiles in actual selections
	3. Mode and mean alignment with intended difficulty targets
	4. Meaningful σ ranges that represent actual selection zones

	### Next Steps
	1. Implement multiplicative scoring or two-stage filtering
	2. Update visualization to use percentiles instead of μ ± σ
	3. Collect empirical data on word frequency percentiles in actual selections
	4. Validate fixes show distinct patterns across difficulties

	---

	This analysis represents empirical findings from the debug visualization system, revealing gaps between the theoretical composite scoring model and its practical implementation.