vimalk78 commited on
Commit
53e35dc
Β·
1 Parent(s): 2ecccdf

feat: implement distribution normalization with default disabled

Browse files

- Add distribution normalization to ensure consistent difficulty across topics
- Support three methods: similarity_range, composite_zscore, percentile_recentering
- Set default to disabled based on analysis showing natural semantic relationships are preferable
- Add comprehensive analysis documentation and test suite

Signed-off-by: Vimal Kumar <vimal78@gmail.com>

crossword-app/backend-py/docs/distribution_normalization_analysis.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Distribution Normalization Analysis
2
+
3
+ ## Overview
4
+
5
+ Distribution normalization is a feature implemented to ensure consistent difficulty levels across different topics in the crossword generator. This document analyzes the trade-offs between normalized and non-normalized approaches and provides recommendations.
6
+
7
+ ## The Problem
8
+
9
+ The original question was: *"Should we normalize the distribution before display? Perhaps the distribution will be centered at the same position for a difficulty level irrespective of topic."*
10
+
11
+ Different topics naturally have different semantic similarity ranges:
12
+ - **"Animals"**: Rich vocabulary, similarities often range 0.4-0.9
13
+ - **"Philosophy"**: Abstract concepts, similarities might range 0.1-0.6
14
+ - **"Technology"**: Mixed range, similarities around 0.2-0.8
15
+
16
+ This led to perceived "inconsistent difficulty" where "Easy Animals" felt easier than "Easy Philosophy" crosswords.
17
+
18
+ ## Current Implementation
19
+
20
+ ### Composite Score Formula
21
+ ```
22
+ composite = (1 - difficulty_weight) * similarity + difficulty_weight * freq_score
23
+ ```
24
+
25
+ With default `difficulty_weight = 0.5`:
26
+ ```
27
+ composite = 0.5 * similarity + 0.5 * freq_score
28
+ ```
29
+
30
+ ### Normalization Methods
31
+
32
+ 1. **`similarity_range` (default)**: Normalizes similarities to [0,1] before composite calculation
33
+ 2. **`composite_zscore`**: Z-score normalization (unbounded, typically -3 to +3)
34
+ 3. **`percentile_recentering`**: Boosts scores based on proximity to target percentile (can exceed 1.0)
35
+
36
+ ### Configuration
37
+ - `ENABLE_DISTRIBUTION_NORMALIZATION=true` (default)
38
+ - `NORMALIZATION_METHOD=similarity_range` (default)
39
+
40
+ ## Trade-offs Analysis
41
+
42
+ ### Before Normalization (Original System)
43
+
44
+ #### Advantages βœ…
45
+ 1. **Natural semantic relationships preserved**
46
+ - Topics with broader vocabulary naturally had higher similarity ranges
47
+ - Reflected genuine linguistic density differences
48
+ - Authentic representation of semantic space
49
+
50
+ 2. **Simpler and more predictable**
51
+ - Straightforward composite score calculation
52
+ - Always bounded to [0,1] naturally
53
+ - No artificial transformations
54
+
55
+ 3. **Semantic honesty**
56
+ - Some topics ARE inherently harder to generate crosswords for
57
+ - System reflected this reality rather than masking it
58
+ - Valuable information for both system and users
59
+
60
+ 4. **Computational efficiency**
61
+ - No additional normalization calculations
62
+ - Cleaner code path
63
+
64
+ #### Disadvantages ❌
65
+ 1. **Inconsistent difficulty across topics**
66
+ - "Easy" for animals genuinely easier than "Easy" for philosophy
67
+ - Could confuse users expecting uniform difficulty
68
+
69
+ 2. **User expectation mismatch**
70
+ - Players might expect same difficulty label = same challenge level
71
+
72
+ ### After Normalization (Current System)
73
+
74
+ #### Advantages βœ…
75
+ 1. **Consistent difficulty intent**
76
+ - Attempts to make "Easy" equally easy across all topics
77
+ - Meets user expectations for uniform difficulty labels
78
+
79
+ 2. **Debug visualization enhancements**
80
+ - Shows normalization effects in debug tab
81
+ - Helpful for analysis and understanding
82
+
83
+ #### Disadvantages ❌
84
+ 1. **Artificial stretching of similarity ranges**
85
+ - Forces sparse topics to use full [0,1] range
86
+ - Genuinely dissimilar words appear artificially similar
87
+ - Loss of semantic authenticity
88
+
89
+ 2. **Implementation complexity and bugs**
90
+ - Different methods produce different ranges
91
+ - Z-score normalization is unbounded
92
+ - Percentile recentering can exceed 1.0
93
+ - Softmax sensitivity to inconsistent ranges
94
+
95
+ 3. **Loss of valuable information**
96
+ - Masks natural vocabulary density differences
97
+ - Hides genuine topic difficulty characteristics
98
+ - Makes debugging harder (what's "real" vs "normalized"?)
99
+
100
+ 4. **Computational overhead**
101
+ - Additional calculations for normalization
102
+ - More complex code paths
103
+ - Potential for numerical issues
104
+
105
+ ## Composite Score Ranges
106
+
107
+ ### Without Normalization
108
+ - **Theoretical range**: [0, 1]
109
+ - **Practical range**: Depends on actual similarities in the 150-word thematic pool
110
+ - **Example**: If similarities range 0.3-0.7, composite β‰ˆ [0.15, 0.85]
111
+
112
+ ### With Normalization
113
+ - **`similarity_range`**: ~[0, 1] (most consistent)
114
+ - **`composite_zscore`**: Unbounded (typically [-3, +3])
115
+ - **`percentile_recentering`**: Can exceed 1.0 due to boosting
116
+
117
+ ## Problems with Current Implementation
118
+
119
+ 1. **Range inconsistency**: Different normalization methods produce different ranges
120
+ 2. **Unbounded z-scores**: Affect softmax probability calculations unpredictably
121
+ 3. **Values exceeding [0,1]**: Break assumptions about composite score bounds
122
+ 4. **Complexity without clear benefit**: Added complexity for questionable gains
123
+
124
+ ## Recommendation
125
+
126
+ ### **Revert to Non-Normalized Approach**
127
+
128
+ The original system was **better** for these reasons:
129
+
130
+ 1. **The "problem" wasn't really a problem**
131
+ - Different topics having different difficulty distributions is natural and informative
132
+ - Philosophy IS harder to make crosswords for than animals - this is linguistic reality
133
+
134
+ 2. **Normalization introduces distortions**
135
+ - Stretching narrow ranges doesn't make words more semantically similar
136
+ - Creates artificial relationships that don't exist
137
+
138
+ 3. **Alternative solutions are better**:
139
+ - Show users the natural difficulty of each topic
140
+ - Adjust word count based on topic vocabulary density
141
+ - Provide topic difficulty ratings to set expectations
142
+ - Use adaptive difficulty within topics rather than across them
143
+
144
+ ### If Normalization is Kept
145
+
146
+ If normalization must be retained:
147
+
148
+ 1. **Make it opt-in, not default**: `ENABLE_DISTRIBUTION_NORMALIZATION=false`
149
+ 2. **Fix range consistency**: Ensure all methods produce [0,1] outputs
150
+ 3. **Add proper bounds checking**: Clamp scores to [0,1] after normalization
151
+ 4. **Document trade-offs clearly**: Let users make informed choices
152
+
153
+ ## Proposed Implementation Fixes
154
+
155
+ If keeping normalization, fix these issues:
156
+
157
+ ```python
158
+ # After normalization, ensure consistent [0,1] range
159
+ if method == "composite_zscore":
160
+ # Map z-scores to [0,1] using sigmoid
161
+ scores = 1 / (1 + np.exp(-normalized_scores))
162
+ elif method == "percentile_recentering":
163
+ # Clamp boosted scores to valid range
164
+ scores = np.clip(boosted_scores, 0, 1)
165
+
166
+ # Final safety clamp for all methods
167
+ composite_scores = np.clip(composite_scores, 0, 1)
168
+ ```
169
+
170
+ ## Conclusion
171
+
172
+ The **non-normalized approach respects semantic reality** and provides more honest, interpretable results. The "inconsistency" across topics is actually valuable information about linguistic structure, not a bug to be fixed.
173
+
174
+ **Recommendation**: Disable normalization by default (`ENABLE_DISTRIBUTION_NORMALIZATION=false`) and let the natural semantic relationships guide difficulty distribution. This preserves the system's authenticity while maintaining simplicity and predictability.
175
+
176
+ The original system's variation across topics was a **feature representing real linguistic diversity**, not a problem requiring artificial correction.
crossword-app/backend-py/src/services/thematic_word_service.py CHANGED
@@ -288,6 +288,13 @@ class ThematicWordService:
288
  self.difficulty_weight = float(os.getenv("DIFFICULTY_WEIGHT", "0.5"))
289
  self.thematic_pool_size = int(os.getenv("THEMATIC_POOL_SIZE", "150"))
290
 
 
 
 
 
 
 
 
291
  # Debug tab configuration
292
  self.enable_debug_tab = os.getenv("ENABLE_DEBUG_TAB", "false").lower() == "true"
293
 
@@ -359,6 +366,9 @@ class ThematicWordService:
359
  logger.info(f"🎲 Softmax selection: {'ENABLED' if self.use_softmax_selection else 'DISABLED'}")
360
  if self.use_softmax_selection:
361
  logger.info(f"🌑️ Similarity temperature: {self.similarity_temperature}")
 
 
 
362
 
363
  async def initialize_async(self):
364
  """Initialize the generator (async version for backend compatibility)."""
@@ -716,6 +726,85 @@ class ThematicWordService:
716
  composite = final_alpha * similarity + final_beta * freq_score
717
  return composite
718
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
719
  def _softmax_with_temperature(self, scores: np.ndarray, temperature: float = 1.0) -> np.ndarray:
720
  """
721
  Apply softmax with temperature control to similarity scores.
@@ -821,6 +910,12 @@ class ThematicWordService:
821
 
822
  composite_scores = np.array(composite_scores)
823
 
 
 
 
 
 
 
824
  # Log debug information
825
  logger.info(f"πŸ” Debug: Top 10 composite scores for difficulty={difficulty}:")
826
  for info in debug_info:
@@ -856,7 +951,7 @@ class ThematicWordService:
856
  # Create probability distribution data for debug visualization
857
  prob_distribution = []
858
  for i, candidate in enumerate(candidates):
859
- prob_distribution.append({
860
  "word": candidate["word"],
861
  "probability": float(probabilities[i]),
862
  "composite_score": float(composite_scores[i]),
@@ -865,7 +960,17 @@ class ThematicWordService:
865
  "similarity": candidate["similarity"],
866
  "tier": candidate.get("tier", "unknown"),
867
  "percentile": self.word_percentiles.get(candidate["word"].lower(), 0.0)
868
- })
 
 
 
 
 
 
 
 
 
 
869
 
870
  # Sort by probability descending for display
871
  prob_distribution.sort(key=lambda x: x["probability"], reverse=True)
@@ -879,7 +984,9 @@ class ThematicWordService:
879
  "temperature": temperature,
880
  "difficulty": difficulty,
881
  "total_candidates": len(candidates),
882
- "selected_count": len(selected_candidates)
 
 
883
  }
884
 
885
  return selected_candidates, prob_data
 
288
  self.difficulty_weight = float(os.getenv("DIFFICULTY_WEIGHT", "0.5"))
289
  self.thematic_pool_size = int(os.getenv("THEMATIC_POOL_SIZE", "150"))
290
 
291
+ # Distribution normalization configuration
292
+ # Default: DISABLED based on analysis showing non-normalized approach is better
293
+ # See docs/distribution_normalization_analysis.md for detailed reasoning
294
+ # Preserves natural semantic relationships and avoids artificial distortions
295
+ self.enable_distribution_normalization = os.getenv("ENABLE_DISTRIBUTION_NORMALIZATION", "false").lower() == "true"
296
+ self.normalization_method = os.getenv("NORMALIZATION_METHOD", "similarity_range").lower() # "similarity_range", "composite_zscore", "percentile_recentering"
297
+
298
  # Debug tab configuration
299
  self.enable_debug_tab = os.getenv("ENABLE_DEBUG_TAB", "false").lower() == "true"
300
 
 
366
  logger.info(f"🎲 Softmax selection: {'ENABLED' if self.use_softmax_selection else 'DISABLED'}")
367
  if self.use_softmax_selection:
368
  logger.info(f"🌑️ Similarity temperature: {self.similarity_temperature}")
369
+ logger.info(f"🎯 Distribution normalization: {'ENABLED' if self.enable_distribution_normalization else 'DISABLED'}")
370
+ if self.enable_distribution_normalization:
371
+ logger.info(f"πŸ”§ Normalization method: {self.normalization_method}")
372
 
373
  async def initialize_async(self):
374
  """Initialize the generator (async version for backend compatibility)."""
 
726
  composite = final_alpha * similarity + final_beta * freq_score
727
  return composite
728
 
729
+ def _apply_distribution_normalization(self, composite_scores: np.ndarray, candidates: List[Dict[str, Any]], difficulty: str) -> np.ndarray:
730
+ """
731
+ Apply distribution normalization to ensure consistent difficulty distributions across topics.
732
+
733
+ This method normalizes the composite score distribution to ensure that the same difficulty level
734
+ produces consistent selection patterns regardless of the topic's inherent semantic similarity range.
735
+
736
+ Args:
737
+ composite_scores: Raw composite scores from similarity + frequency alignment
738
+ candidates: List of candidate word dictionaries
739
+ difficulty: Difficulty level for target percentile calculation
740
+
741
+ Returns:
742
+ Normalized composite scores with consistent distribution shape
743
+ """
744
+ if len(composite_scores) <= 1:
745
+ return composite_scores
746
+
747
+ method = self.normalization_method.lower()
748
+
749
+ if method == "similarity_range":
750
+ # Method 1: Normalize similarity ranges to [0,1] before composite scoring
751
+ # This ensures all topics use the full similarity spectrum
752
+ similarities = np.array([c['similarity'] for c in candidates])
753
+ if len(similarities) > 1 and np.std(similarities) > 0:
754
+ min_sim, max_sim = np.min(similarities), np.max(similarities)
755
+ if max_sim > min_sim: # Avoid division by zero
756
+ # Recalculate composite scores with normalized similarities
757
+ normalized_scores = []
758
+ for i, candidate in enumerate(candidates):
759
+ normalized_sim = (candidate['similarity'] - min_sim) / (max_sim - min_sim)
760
+ word = candidate['word']
761
+ # Recompute composite score with normalized similarity
762
+ percentile = self.word_percentiles.get(word.lower(), 0.0)
763
+
764
+ # Calculate difficulty alignment score (same as _compute_composite_score)
765
+ if difficulty == "easy":
766
+ freq_score = np.exp(-((percentile - 0.9) ** 2) / (2 * 0.1 ** 2))
767
+ elif difficulty == "hard":
768
+ freq_score = np.exp(-((percentile - 0.2) ** 2) / (2 * 0.15 ** 2))
769
+ else: # medium
770
+ freq_score = 0.5 + 0.5 * np.exp(-((percentile - 0.5) ** 2) / (2 * 0.3 ** 2))
771
+
772
+ # Apply difficulty weight with normalized similarity
773
+ final_alpha = 1.0 - self.difficulty_weight
774
+ final_beta = self.difficulty_weight
775
+ composite = final_alpha * normalized_sim + final_beta * freq_score
776
+ normalized_scores.append(composite)
777
+
778
+ return np.array(normalized_scores)
779
+
780
+ elif method == "composite_zscore":
781
+ # Method 2: Z-score normalization of composite scores
782
+ # Centers distribution at 0 with unit variance
783
+ mean_score = np.mean(composite_scores)
784
+ std_score = np.std(composite_scores)
785
+ if std_score > 0:
786
+ return (composite_scores - mean_score) / std_score
787
+
788
+ elif method == "percentile_recentering":
789
+ # Method 3: Force distribution center to match target percentile
790
+ target_percentiles = {"easy": 0.9, "medium": 0.5, "hard": 0.2}
791
+ target = target_percentiles.get(difficulty, 0.5)
792
+
793
+ # Calculate current probability-weighted percentile center
794
+ percentiles = np.array([self.word_percentiles.get(c['word'].lower(), 0.0) for c in candidates])
795
+
796
+ # Simple linear transformation to center distribution
797
+ current_center = np.mean(percentiles) # Simplified: use mean percentile
798
+ shift = target - current_center
799
+
800
+ # Apply proportional boost to scores based on how close they are to target
801
+ percentile_alignment = np.exp(-((percentiles - target) ** 2) / (2 * 0.2 ** 2))
802
+ boosted_scores = composite_scores * (1 + 0.5 * percentile_alignment)
803
+ return boosted_scores
804
+
805
+ # If no valid method or normalization not needed, return original scores
806
+ return composite_scores
807
+
808
  def _softmax_with_temperature(self, scores: np.ndarray, temperature: float = 1.0) -> np.ndarray:
809
  """
810
  Apply softmax with temperature control to similarity scores.
 
910
 
911
  composite_scores = np.array(composite_scores)
912
 
913
+ # Apply distribution normalization if enabled
914
+ original_composite_scores = composite_scores.copy() # Keep for debug comparison
915
+ if self.enable_distribution_normalization:
916
+ composite_scores = self._apply_distribution_normalization(composite_scores, candidates, difficulty)
917
+ logger.info(f"🎯 Applied distribution normalization ({self.normalization_method})")
918
+
919
  # Log debug information
920
  logger.info(f"πŸ” Debug: Top 10 composite scores for difficulty={difficulty}:")
921
  for info in debug_info:
 
951
  # Create probability distribution data for debug visualization
952
  prob_distribution = []
953
  for i, candidate in enumerate(candidates):
954
+ prob_item = {
955
  "word": candidate["word"],
956
  "probability": float(probabilities[i]),
957
  "composite_score": float(composite_scores[i]),
 
960
  "similarity": candidate["similarity"],
961
  "tier": candidate.get("tier", "unknown"),
962
  "percentile": self.word_percentiles.get(candidate["word"].lower(), 0.0)
963
+ }
964
+
965
+ # Add normalization debug data if normalization was applied
966
+ if self.enable_distribution_normalization and 'original_composite_scores' in locals():
967
+ prob_item["original_composite_score"] = float(original_composite_scores[i])
968
+ prob_item["normalization_applied"] = True
969
+ prob_item["normalization_method"] = self.normalization_method
970
+ else:
971
+ prob_item["normalization_applied"] = False
972
+
973
+ prob_distribution.append(prob_item)
974
 
975
  # Sort by probability descending for display
976
  prob_distribution.sort(key=lambda x: x["probability"], reverse=True)
 
984
  "temperature": temperature,
985
  "difficulty": difficulty,
986
  "total_candidates": len(candidates),
987
+ "selected_count": len(selected_candidates),
988
+ "normalization_enabled": self.enable_distribution_normalization,
989
+ "normalization_method": self.normalization_method if self.enable_distribution_normalization else None
990
  }
991
 
992
  return selected_candidates, prob_data
crossword-app/backend-py/test_distribution_normalization.py ADDED
@@ -0,0 +1,219 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Test script for distribution normalization feature.
4
+
5
+ This script demonstrates how distribution normalization ensures consistent
6
+ difficulty levels across different topics by normalizing similarity ranges
7
+ and standardizing distribution shapes.
8
+ """
9
+
10
+ import os
11
+ import sys
12
+ import numpy as np
13
+ from collections import defaultdict
14
+
15
+ # Add src directory to path
16
+ sys.path.insert(0, os.path.join(os.path.dirname(__file__), 'src'))
17
+
18
+ def test_normalization_across_topics():
19
+ """Test normalization consistency across different topics."""
20
+ print("πŸ§ͺ Testing distribution normalization across topics...")
21
+
22
+ # Set up environment for testing normalization
23
+ os.environ['SIMILARITY_TEMPERATURE'] = '0.7'
24
+ os.environ['USE_SOFTMAX_SELECTION'] = 'true'
25
+ os.environ['DIFFICULTY_WEIGHT'] = '0.3'
26
+ os.environ['ENABLE_DEBUG_TAB'] = 'true'
27
+
28
+ # Test with normalization ENABLED
29
+ os.environ['ENABLE_DISTRIBUTION_NORMALIZATION'] = 'true'
30
+ os.environ['NORMALIZATION_METHOD'] = 'similarity_range'
31
+
32
+ from services.thematic_word_service import ThematicWordService
33
+
34
+ # Create service instance
35
+ service = ThematicWordService()
36
+ service.initialize()
37
+
38
+ # Test topics with expected different similarity ranges
39
+ test_topics = [
40
+ ("animals", "Expected high similarity range - many animals in vocabulary"),
41
+ ("technology", "Expected medium similarity range - some tech words"),
42
+ ("geology", "Expected low similarity range - fewer geology terms"),
43
+ ("food", "Expected high similarity range - many food words"),
44
+ ("philosophy", "Expected very low similarity range - abstract concepts")
45
+ ]
46
+
47
+ difficulty = "medium" # Use medium difficulty for consistent comparison
48
+ num_words = 15
49
+
50
+ print(f"\n🎯 Testing normalization for difficulty: {difficulty.upper()}")
51
+ print(f"πŸ“Š Requesting {num_words} words per topic")
52
+ print(f"πŸ”§ Normalization: {service.enable_distribution_normalization} ({service.normalization_method})")
53
+
54
+ results = {}
55
+
56
+ for topic, description in test_topics:
57
+ print(f"\nπŸ“š Topic: {topic.upper()}")
58
+ print(f" {description}")
59
+
60
+ try:
61
+ # Generate words using crossword-specific method to get debug data
62
+ result = service.find_words_for_crossword([topic], difficulty, num_words)
63
+ words = result["words"]
64
+ debug_data = result.get("debug", {})
65
+
66
+ if debug_data and "probability_distribution" in debug_data:
67
+ prob_data = debug_data["probability_distribution"]
68
+ probabilities = prob_data["probabilities"]
69
+
70
+ # Calculate distribution statistics
71
+ similarities = [p["similarity"] for p in probabilities]
72
+ percentiles = [p["percentile"] for p in probabilities]
73
+ composite_scores = [p["composite_score"] for p in probabilities]
74
+ probs = [p["probability"] for p in probabilities]
75
+
76
+ # Check for normalization data
77
+ has_normalization_data = any(p.get("normalization_applied", False) for p in probabilities)
78
+ original_scores = []
79
+ if has_normalization_data:
80
+ original_scores = [p.get("original_composite_score", p["composite_score"]) for p in probabilities]
81
+
82
+ stats = {
83
+ "topic": topic,
84
+ "word_count": len(words),
85
+ "similarity_range": (min(similarities), max(similarities)),
86
+ "similarity_mean": np.mean(similarities),
87
+ "similarity_std": np.std(similarities),
88
+ "percentile_mean": np.mean(percentiles),
89
+ "percentile_std": np.std(percentiles),
90
+ "composite_mean": np.mean(composite_scores),
91
+ "composite_std": np.std(composite_scores),
92
+ "prob_entropy": -sum(p * np.log(p + 1e-10) for p in probs), # Selection entropy
93
+ "selected_words": [w["word"] for w in words[:5]], # First 5 words
94
+ "normalization_applied": has_normalization_data
95
+ }
96
+
97
+ if original_scores:
98
+ stats["original_composite_mean"] = np.mean(original_scores)
99
+ stats["original_composite_std"] = np.std(original_scores)
100
+ stats["normalization_effect"] = abs(stats["composite_mean"] - stats["original_composite_mean"])
101
+
102
+ results[topic] = stats
103
+
104
+ # Display key statistics
105
+ print(f" βœ… Generated {len(words)} words")
106
+ print(f" πŸ“Š Similarity range: {stats['similarity_range'][0]:.3f} - {stats['similarity_range'][1]:.3f}")
107
+ print(f" πŸ“ˆ Similarity meanΒ±std: {stats['similarity_mean']:.3f}Β±{stats['similarity_std']:.3f}")
108
+ print(f" 🎯 Percentile mean±std: {stats['percentile_mean']:.3f}±{stats['percentile_std']:.3f}")
109
+ print(f" πŸ”’ Composite meanΒ±std: {stats['composite_mean']:.3f}Β±{stats['composite_std']:.3f}")
110
+ if has_normalization_data:
111
+ print(f" 🎯 Normalization applied: Original composite mean was {stats['original_composite_mean']:.3f}")
112
+ print(f" πŸ“ˆ Normalization effect: {stats['normalization_effect']:.3f} change in mean")
113
+ print(f" πŸ“ Selected words: {', '.join(stats['selected_words'])}")
114
+
115
+ else:
116
+ print(f" ❌ No debug data available for {topic}")
117
+
118
+ except Exception as e:
119
+ print(f" ❌ Error testing {topic}: {e}")
120
+ continue
121
+
122
+ # Analyze consistency across topics
123
+ if len(results) >= 3:
124
+ print(f"\nπŸ“Š NORMALIZATION CONSISTENCY ANALYSIS")
125
+ print(f"=" * 60)
126
+
127
+ # Compare similarity ranges (should be more consistent after normalization)
128
+ sim_ranges = [stats['similarity_range'][1] - stats['similarity_range'][0] for stats in results.values()]
129
+ sim_means = [stats['similarity_mean'] for stats in results.values()]
130
+ composite_stds = [stats['composite_std'] for stats in results.values()]
131
+ percentile_means = [stats['percentile_mean'] for stats in results.values()]
132
+
133
+ print(f"🎯 Similarity Range Consistency:")
134
+ print(f" Range spread: {np.std(sim_ranges):.4f} (lower = more consistent)")
135
+ print(f" Mean variation: {np.std(sim_means):.4f} (lower = more consistent)")
136
+
137
+ print(f"\n🎲 Selection Distribution Consistency:")
138
+ print(f" Composite score std variation: {np.std(composite_stds):.4f} (lower = more consistent)")
139
+ print(f" Percentile targeting consistency: {np.std(percentile_means):.4f} (should be near 0.5 for medium)")
140
+
141
+ print(f"\nπŸ† Normalization Effectiveness:")
142
+ if any(stats.get('normalization_applied', False) for stats in results.values()):
143
+ normalization_effects = [stats.get('normalization_effect', 0) for stats in results.values() if stats.get('normalization_effect') is not None]
144
+ if normalization_effects:
145
+ avg_effect = np.mean(normalization_effects)
146
+ print(f" Average normalization effect: {avg_effect:.4f}")
147
+ print(f" Normalization was {'SIGNIFICANT' if avg_effect > 0.05 else 'MINIMAL'}")
148
+ print(" βœ… Normalization data found in debug output")
149
+ else:
150
+ print(" ⚠️ No normalization data found - check ENABLE_DISTRIBUTION_NORMALIZATION")
151
+
152
+ # Ideal targets for medium difficulty
153
+ target_percentile = 0.5
154
+ percentile_deviation = np.mean([abs(pm - target_percentile) for pm in percentile_means])
155
+ print(f"\n🎯 Difficulty Targeting Accuracy:")
156
+ print(f" Target percentile (medium): {target_percentile}")
157
+ print(f" Average deviation: {percentile_deviation:.4f}")
158
+ print(f" Targeting accuracy: {'EXCELLENT' if percentile_deviation < 0.05 else 'GOOD' if percentile_deviation < 0.1 else 'NEEDS IMPROVEMENT'}")
159
+
160
+ print(f"\nβœ… Distribution normalization test completed!")
161
+ return results
162
+
163
+ def test_normalization_methods():
164
+ """Test different normalization methods."""
165
+ print(f"\nπŸ§ͺ Testing different normalization methods...")
166
+
167
+ methods = ["similarity_range", "composite_zscore", "percentile_recentering"]
168
+ topic = "animals" # Use consistent topic
169
+ difficulty = "easy" # Use easy difficulty to see clear effects
170
+
171
+ for method in methods:
172
+ print(f"\nπŸ”§ Testing method: {method.upper()}")
173
+
174
+ os.environ['NORMALIZATION_METHOD'] = method
175
+
176
+ from services.thematic_word_service import ThematicWordService
177
+
178
+ service = ThematicWordService()
179
+ service.initialize()
180
+
181
+ try:
182
+ result = service.find_words_for_crossword([topic], difficulty, 10)
183
+ words = result["words"]
184
+ debug_data = result.get("debug", {})
185
+
186
+ if debug_data and "probability_distribution" in debug_data:
187
+ prob_data = debug_data["probability_distribution"]
188
+ probabilities = prob_data["probabilities"]
189
+
190
+ similarities = [p["similarity"] for p in probabilities]
191
+ percentiles = [p["percentile"] for p in probabilities]
192
+
193
+ print(f" πŸ“Š Similarity range: {min(similarities):.3f} - {max(similarities):.3f}")
194
+ print(f" 🎯 Mean percentile: {np.mean(percentiles):.3f} (target for easy: 0.9)")
195
+ print(f" πŸ“ˆ Selected words: {', '.join([w['word'] for w in words[:5]])}")
196
+
197
+ if any(p.get("normalization_applied", False) for p in probabilities):
198
+ print(f" βœ… Normalization applied successfully")
199
+ else:
200
+ print(f" ⚠️ Normalization not detected in debug data")
201
+ else:
202
+ print(f" ❌ No debug data available")
203
+
204
+ except Exception as e:
205
+ print(f" ❌ Error with method {method}: {e}")
206
+
207
+ if __name__ == "__main__":
208
+ print("🎯 Distribution Normalization Test Suite")
209
+ print("=" * 50)
210
+
211
+ test_normalization_across_topics()
212
+ test_normalization_methods()
213
+
214
+ print(f"\nπŸŽ‰ All tests completed!")
215
+ print(f"\nπŸ’‘ To see normalization effects in the UI:")
216
+ print(f" 1. Set ENABLE_DISTRIBUTION_NORMALIZATION=true")
217
+ print(f" 2. Set ENABLE_DEBUG_TAB=true")
218
+ print(f" 3. Generate crosswords with different topics at the same difficulty")
219
+ print(f" 4. Check the Debug tab for normalization indicators and tooltips")
crossword-app/frontend/src/components/DebugTab.jsx CHANGED
@@ -313,13 +313,21 @@ const DebugTab = ({ debugData }) => {
313
  },
314
  label: function(context) {
315
  const item = sortedByPercentile[context.dataIndex];
316
- return [
317
  `Probability: ${(item.probability * 100).toFixed(2)}%`,
318
  `Composite Score: ${item.composite_score.toFixed(3)}`,
319
  `Similarity: ${item.similarity.toFixed(3)}`,
320
  `Percentile: ${(item.percentile * 100).toFixed(1)}%`,
321
  `Tier: ${item.tier.replace('tier_', '').replace('_', ' ')}`
322
  ];
 
 
 
 
 
 
 
 
323
  }
324
  },
325
  backgroundColor: 'rgba(0, 0, 0, 0.8)',
@@ -394,13 +402,21 @@ const DebugTab = ({ debugData }) => {
394
  },
395
  label: function(context) {
396
  const item = sortedByPercentile[context.dataIndex];
397
- return [
398
  `Probability: ${(item.probability * 100).toFixed(2)}%`,
399
  `Composite Score: ${item.composite_score.toFixed(3)}`,
400
  `Similarity: ${item.similarity.toFixed(3)}`,
401
  `Percentile: ${(item.percentile * 100).toFixed(1)}%`,
402
  `Tier: ${item.tier.replace('tier_', '').replace('_', ' ')}`
403
  ];
 
 
 
 
 
 
 
 
404
  }
405
  },
406
  backgroundColor: 'rgba(0, 0, 0, 0.8)',
@@ -500,6 +516,11 @@ const DebugTab = ({ debugData }) => {
500
  <div><strong>Top Probability:</strong> {(Math.max(...sortedByPercentile.map(p => p.probability)) * 100).toFixed(1)}%</div>
501
  <div><strong>Average:</strong> {((1/probData.total_candidates) * 100).toFixed(1)}%</div>
502
  <div><strong>Temperature Effect:</strong> {probData.temperature < 1 ? 'More deterministic' : probData.temperature > 1 ? 'More random' : 'Balanced'}</div>
 
 
 
 
 
503
  <div><strong>Mean Position:</strong> Word #{meanWordIndex + 1} ({sortedByPercentile[meanWordIndex]?.word})</div>
504
  <div><strong>Distribution Width (Οƒ):</strong> {sigma.toFixed(1)} words</div>
505
  <div><strong>Οƒ Sampling Zone:</strong> {(sigmaRangeProbMass * 100).toFixed(1)}% of probability mass</div>
@@ -516,6 +537,9 @@ const DebugTab = ({ debugData }) => {
516
  frequency percentile (100% β†’ 0%, common β†’ rare). This reveals whether the Gaussian frequency targeting
517
  is working correctly for your selected difficulty level. Look for probability peaks at the intended percentile ranges:
518
  <strong> Easy (90%+), Medium (50%), Hard (20%)</strong>.
 
 
 
519
  </p>
520
  </div>
521
 
 
313
  },
314
  label: function(context) {
315
  const item = sortedByPercentile[context.dataIndex];
316
+ const labels = [
317
  `Probability: ${(item.probability * 100).toFixed(2)}%`,
318
  `Composite Score: ${item.composite_score.toFixed(3)}`,
319
  `Similarity: ${item.similarity.toFixed(3)}`,
320
  `Percentile: ${(item.percentile * 100).toFixed(1)}%`,
321
  `Tier: ${item.tier.replace('tier_', '').replace('_', ' ')}`
322
  ];
323
+
324
+ // Add normalization data if available
325
+ if (item.normalization_applied && item.original_composite_score !== undefined) {
326
+ labels.splice(2, 0, `Original Score: ${item.original_composite_score.toFixed(3)}`);
327
+ labels.splice(3, 0, `🎯 Normalized: ${item.normalization_method}`);
328
+ }
329
+
330
+ return labels;
331
  }
332
  },
333
  backgroundColor: 'rgba(0, 0, 0, 0.8)',
 
402
  },
403
  label: function(context) {
404
  const item = sortedByPercentile[context.dataIndex];
405
+ const labels = [
406
  `Probability: ${(item.probability * 100).toFixed(2)}%`,
407
  `Composite Score: ${item.composite_score.toFixed(3)}`,
408
  `Similarity: ${item.similarity.toFixed(3)}`,
409
  `Percentile: ${(item.percentile * 100).toFixed(1)}%`,
410
  `Tier: ${item.tier.replace('tier_', '').replace('_', ' ')}`
411
  ];
412
+
413
+ // Add normalization data if available
414
+ if (item.normalization_applied && item.original_composite_score !== undefined) {
415
+ labels.splice(2, 0, `Original Score: ${item.original_composite_score.toFixed(3)}`);
416
+ labels.splice(3, 0, `🎯 Normalized: ${item.normalization_method}`);
417
+ }
418
+
419
+ return labels;
420
  }
421
  },
422
  backgroundColor: 'rgba(0, 0, 0, 0.8)',
 
516
  <div><strong>Top Probability:</strong> {(Math.max(...sortedByPercentile.map(p => p.probability)) * 100).toFixed(1)}%</div>
517
  <div><strong>Average:</strong> {((1/probData.total_candidates) * 100).toFixed(1)}%</div>
518
  <div><strong>Temperature Effect:</strong> {probData.temperature < 1 ? 'More deterministic' : probData.temperature > 1 ? 'More random' : 'Balanced'}</div>
519
+ {probData.normalization_enabled && (
520
+ <div style={{backgroundColor: '#e8f5e8', padding: '4px', borderRadius: '4px'}}>
521
+ <strong>🎯 Distribution Normalization:</strong> ENABLED ({probData.normalization_method})
522
+ </div>
523
+ )}
524
  <div><strong>Mean Position:</strong> Word #{meanWordIndex + 1} ({sortedByPercentile[meanWordIndex]?.word})</div>
525
  <div><strong>Distribution Width (Οƒ):</strong> {sigma.toFixed(1)} words</div>
526
  <div><strong>Οƒ Sampling Zone:</strong> {(sigmaRangeProbMass * 100).toFixed(1)}% of probability mass</div>
 
537
  frequency percentile (100% β†’ 0%, common β†’ rare). This reveals whether the Gaussian frequency targeting
538
  is working correctly for your selected difficulty level. Look for probability peaks at the intended percentile ranges:
539
  <strong> Easy (90%+), Medium (50%), Hard (20%)</strong>.
540
+ {probData.normalization_enabled && (
541
+ <> <strong>🎯 Distribution normalization is ENABLED</strong> to ensure consistent difficulty across topics.</>
542
+ )}
543
  </p>
544
  </div>
545