| # Vocabulary Optimization & Unification | |
| ## Problem Solved | |
| Previously, the crossword system had **vocabulary redundancy** with 3 separate sources: | |
| - **SentenceTransformer Model Vocabulary**: ~30K tokens β ~8-12K actual words after filtering | |
| - **NLTK Words Corpus**: 41,998 words for embeddings in thematic generator | |
| - **WordFreq Database**: 319,938 words for frequency data | |
| This created inconsistencies, memory waste, and limited vocabulary coverage. | |
| ## Solution: Unified Architecture | |
| ### New Design | |
| - **Single Vocabulary Source**: WordFreq database (319,938 words) | |
| - **Single Embedding Model**: all-mpnet-base-v2 (generates embeddings for any text) | |
| - **Unified Filtering**: Consistent crossword-suitable word filtering | |
| - **Shared Caching**: Single vocabulary + embeddings + frequency cache | |
| ### Key Components | |
| #### 1. VocabularyManager (`hack/thematic_word_generator.py`) | |
| - Loads and filters WordFreq vocabulary | |
| - Applies crossword-suitable filtering (3-12 chars, alphabetic, excludes boring words) | |
| - Generates frequency data with 10-tier classification | |
| - Handles caching for performance | |
| #### 2. UnifiedThematicWordGenerator (`hack/thematic_word_generator.py`) | |
| - Uses WordFreq vocabulary instead of NLTK words | |
| - Generates all-mpnet-base-v2 embeddings for WordFreq words | |
| - Maintains 10-tier frequency classification system | |
| - Provides both hack tool API and backend-compatible API | |
| #### 3. UnifiedWordService (`crossword-app/backend-py/src/services/unified_word_service.py`) | |
| - Bridge adapter for backend integration | |
| - Compatible with existing VectorSearchService interface | |
| - Uses comprehensive WordFreq vocabulary instead of limited model vocabulary | |
| ## Usage | |
| ### For Hack Tools | |
| ```python | |
| from thematic_word_generator import UnifiedThematicWordGenerator | |
| # Initialize with desired vocabulary size | |
| generator = UnifiedThematicWordGenerator(vocab_size_limit=100000) | |
| generator.initialize() | |
| # Generate thematic words with tier info | |
| results = generator.generate_thematic_words( | |
| topic="science", | |
| num_words=10, | |
| difficulty_tier="tier_5_common" # Optional tier filtering | |
| ) | |
| for word, similarity, tier in results: | |
| print(f"{word}: {similarity:.3f} ({tier})") | |
| ``` | |
| ### For Backend Integration | |
| #### Option 1: Replace VectorSearchService | |
| ```python | |
| # In crossword_generator.py | |
| from .unified_word_service import create_unified_word_service | |
| # Initialize | |
| vector_service = await create_unified_word_service(vocab_size_limit=100000) | |
| crossword_gen = CrosswordGenerator(vector_service=vector_service) | |
| ``` | |
| #### Option 2: Direct Usage | |
| ```python | |
| from .unified_word_service import UnifiedWordService | |
| service = UnifiedWordService(vocab_size_limit=100000) | |
| await service.initialize() | |
| # Compatible with existing interface | |
| words = await service.find_similar_words("animal", "medium", max_words=15) | |
| ``` | |
| ## Performance Improvements | |
| ### Memory Usage | |
| - **Before**: 3 separate vocabularies + embeddings (~500MB+) | |
| - **After**: Single vocabulary + embeddings (~200MB) | |
| - **Reduction**: ~60% memory usage reduction | |
| ### Vocabulary Coverage | |
| - **Before**: Limited to ~8-12K words from model tokenizer | |
| - **After**: Up to 100K+ filtered words from WordFreq database | |
| - **Improvement**: 10x+ vocabulary coverage | |
| ### Consistency | |
| - **Before**: Different words available in hack tools vs backend | |
| - **After**: Same comprehensive vocabulary across all components | |
| - **Benefit**: Consistent word quality and availability | |
| ## Configuration | |
| ### Environment Variables | |
| - `MAX_VOCABULARY_SIZE`: Maximum vocabulary size (default: 100000) | |
| - `EMBEDDING_MODEL`: Model name (default: all-mpnet-base-v2) | |
| - `WORD_SIMILARITY_THRESHOLD`: Minimum similarity (default: 0.3) | |
| ### Vocabulary Size Options | |
| - **Small (10K)**: Fast initialization, basic vocabulary | |
| - **Medium (50K)**: Balanced performance and coverage | |
| - **Large (100K)**: Comprehensive coverage, slower initialization | |
| - **Full (319K)**: Complete WordFreq database, longest initialization | |
| ## Migration Guide | |
| ### For Existing Hack Tools | |
| 1. Update imports: `from thematic_word_generator import UnifiedThematicWordGenerator` | |
| 2. Replace `ThematicWordGenerator` with `UnifiedThematicWordGenerator` | |
| 3. API remains compatible, but now uses comprehensive WordFreq vocabulary | |
| ### For Backend Services | |
| 1. Import: `from .unified_word_service import UnifiedWordService` | |
| 2. Replace `VectorSearchService` initialization with `UnifiedWordService` | |
| 3. All existing methods remain compatible | |
| 4. Benefits: Better vocabulary coverage, consistent frequency data | |
| ### Backwards Compatibility | |
| - All existing APIs maintained | |
| - Same method signatures and return formats | |
| - Gradual migration possible - can run both systems in parallel | |
| ## Benefits Summary | |
| β **Eliminates Redundancy**: Single vocabulary source instead of 3 separate ones | |
| β **Improves Coverage**: 100K+ words vs previous 8-12K words | |
| β **Reduces Memory**: ~60% reduction in memory usage | |
| β **Ensures Consistency**: Same vocabulary across hack tools and backend | |
| β **Maintains Performance**: Smart caching and batch processing | |
| β **Preserves Features**: 10-tier frequency classification, difficulty filtering | |
| β **Enables Growth**: Easy to add new features with unified architecture | |
| ## Cache Management | |
| ### Cache Locations | |
| - **Hack tools**: `hack/model_cache/` | |
| - **Backend**: `crossword-app/backend-py/cache/unified_generator/` | |
| ### Cache Files | |
| - `unified_vocabulary_<size>.pkl`: Filtered vocabulary | |
| - `unified_frequencies_<size>.pkl`: Frequency data | |
| - `unified_embeddings_<model>_<size>.npy`: Pre-computed embeddings | |
| ### Cache Invalidation | |
| Caches are automatically rebuilt if: | |
| - Vocabulary size limit changes | |
| - Embedding model changes | |
| - WordFreq database updates (rare) | |
| ## Future Enhancements | |
| 1. **Semantic Clustering**: Group words by semantic similarity | |
| 2. **Dynamic Difficulty**: Real-time difficulty adjustment based on user performance | |
| 3. **Topic Expansion**: Automatic topic discovery and expansion | |
| 4. **Multilingual Support**: Extend to other languages using WordFreq | |
| 5. **Custom Vocabularies**: Allow domain-specific vocabulary additions |