abc123 / VOCABULARY_OPTIMIZATION.md
vimalk78's picture
feat(crossword): generated crosswords with clues
486eff6
# Vocabulary Optimization & Unification
## Problem Solved
Previously, the crossword system had **vocabulary redundancy** with 3 separate sources:
- **SentenceTransformer Model Vocabulary**: ~30K tokens β†’ ~8-12K actual words after filtering
- **NLTK Words Corpus**: 41,998 words for embeddings in thematic generator
- **WordFreq Database**: 319,938 words for frequency data
This created inconsistencies, memory waste, and limited vocabulary coverage.
## Solution: Unified Architecture
### New Design
- **Single Vocabulary Source**: WordFreq database (319,938 words)
- **Single Embedding Model**: all-mpnet-base-v2 (generates embeddings for any text)
- **Unified Filtering**: Consistent crossword-suitable word filtering
- **Shared Caching**: Single vocabulary + embeddings + frequency cache
### Key Components
#### 1. VocabularyManager (`hack/thematic_word_generator.py`)
- Loads and filters WordFreq vocabulary
- Applies crossword-suitable filtering (3-12 chars, alphabetic, excludes boring words)
- Generates frequency data with 10-tier classification
- Handles caching for performance
#### 2. UnifiedThematicWordGenerator (`hack/thematic_word_generator.py`)
- Uses WordFreq vocabulary instead of NLTK words
- Generates all-mpnet-base-v2 embeddings for WordFreq words
- Maintains 10-tier frequency classification system
- Provides both hack tool API and backend-compatible API
#### 3. UnifiedWordService (`crossword-app/backend-py/src/services/unified_word_service.py`)
- Bridge adapter for backend integration
- Compatible with existing VectorSearchService interface
- Uses comprehensive WordFreq vocabulary instead of limited model vocabulary
## Usage
### For Hack Tools
```python
from thematic_word_generator import UnifiedThematicWordGenerator
# Initialize with desired vocabulary size
generator = UnifiedThematicWordGenerator(vocab_size_limit=100000)
generator.initialize()
# Generate thematic words with tier info
results = generator.generate_thematic_words(
topic="science",
num_words=10,
difficulty_tier="tier_5_common" # Optional tier filtering
)
for word, similarity, tier in results:
print(f"{word}: {similarity:.3f} ({tier})")
```
### For Backend Integration
#### Option 1: Replace VectorSearchService
```python
# In crossword_generator.py
from .unified_word_service import create_unified_word_service
# Initialize
vector_service = await create_unified_word_service(vocab_size_limit=100000)
crossword_gen = CrosswordGenerator(vector_service=vector_service)
```
#### Option 2: Direct Usage
```python
from .unified_word_service import UnifiedWordService
service = UnifiedWordService(vocab_size_limit=100000)
await service.initialize()
# Compatible with existing interface
words = await service.find_similar_words("animal", "medium", max_words=15)
```
## Performance Improvements
### Memory Usage
- **Before**: 3 separate vocabularies + embeddings (~500MB+)
- **After**: Single vocabulary + embeddings (~200MB)
- **Reduction**: ~60% memory usage reduction
### Vocabulary Coverage
- **Before**: Limited to ~8-12K words from model tokenizer
- **After**: Up to 100K+ filtered words from WordFreq database
- **Improvement**: 10x+ vocabulary coverage
### Consistency
- **Before**: Different words available in hack tools vs backend
- **After**: Same comprehensive vocabulary across all components
- **Benefit**: Consistent word quality and availability
## Configuration
### Environment Variables
- `MAX_VOCABULARY_SIZE`: Maximum vocabulary size (default: 100000)
- `EMBEDDING_MODEL`: Model name (default: all-mpnet-base-v2)
- `WORD_SIMILARITY_THRESHOLD`: Minimum similarity (default: 0.3)
### Vocabulary Size Options
- **Small (10K)**: Fast initialization, basic vocabulary
- **Medium (50K)**: Balanced performance and coverage
- **Large (100K)**: Comprehensive coverage, slower initialization
- **Full (319K)**: Complete WordFreq database, longest initialization
## Migration Guide
### For Existing Hack Tools
1. Update imports: `from thematic_word_generator import UnifiedThematicWordGenerator`
2. Replace `ThematicWordGenerator` with `UnifiedThematicWordGenerator`
3. API remains compatible, but now uses comprehensive WordFreq vocabulary
### For Backend Services
1. Import: `from .unified_word_service import UnifiedWordService`
2. Replace `VectorSearchService` initialization with `UnifiedWordService`
3. All existing methods remain compatible
4. Benefits: Better vocabulary coverage, consistent frequency data
### Backwards Compatibility
- All existing APIs maintained
- Same method signatures and return formats
- Gradual migration possible - can run both systems in parallel
## Benefits Summary
βœ… **Eliminates Redundancy**: Single vocabulary source instead of 3 separate ones
βœ… **Improves Coverage**: 100K+ words vs previous 8-12K words
βœ… **Reduces Memory**: ~60% reduction in memory usage
βœ… **Ensures Consistency**: Same vocabulary across hack tools and backend
βœ… **Maintains Performance**: Smart caching and batch processing
βœ… **Preserves Features**: 10-tier frequency classification, difficulty filtering
βœ… **Enables Growth**: Easy to add new features with unified architecture
## Cache Management
### Cache Locations
- **Hack tools**: `hack/model_cache/`
- **Backend**: `crossword-app/backend-py/cache/unified_generator/`
### Cache Files
- `unified_vocabulary_<size>.pkl`: Filtered vocabulary
- `unified_frequencies_<size>.pkl`: Frequency data
- `unified_embeddings_<model>_<size>.npy`: Pre-computed embeddings
### Cache Invalidation
Caches are automatically rebuilt if:
- Vocabulary size limit changes
- Embedding model changes
- WordFreq database updates (rare)
## Future Enhancements
1. **Semantic Clustering**: Group words by semantic similarity
2. **Dynamic Difficulty**: Real-time difficulty adjustment based on user performance
3. **Topic Expansion**: Automatic topic discovery and expansion
4. **Multilingual Support**: Extend to other languages using WordFreq
5. **Custom Vocabularies**: Allow domain-specific vocabulary additions