Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / VOCABULARY_OPTIMIZATION.md

vimalk78

feat(crossword): generated crosswords with clues

486eff6 4 months ago

preview code

raw

history blame contribute delete

6.05 kB

	# Vocabulary Optimization & Unification

	## Problem Solved

	Previously, the crossword system had vocabulary redundancy with 3 separate sources:
	- SentenceTransformer Model Vocabulary: ~30K tokens → ~8-12K actual words after filtering
	- NLTK Words Corpus: 41,998 words for embeddings in thematic generator
	- WordFreq Database: 319,938 words for frequency data

	This created inconsistencies, memory waste, and limited vocabulary coverage.

	## Solution: Unified Architecture

	### New Design
	- Single Vocabulary Source: WordFreq database (319,938 words)
	- Single Embedding Model: all-mpnet-base-v2 (generates embeddings for any text)
	- Unified Filtering: Consistent crossword-suitable word filtering
	- Shared Caching: Single vocabulary + embeddings + frequency cache

	### Key Components

	#### 1. VocabularyManager (`hack/thematic_word_generator.py`)
	- Loads and filters WordFreq vocabulary
	- Applies crossword-suitable filtering (3-12 chars, alphabetic, excludes boring words)
	- Generates frequency data with 10-tier classification
	- Handles caching for performance

	#### 2. UnifiedThematicWordGenerator (`hack/thematic_word_generator.py`)
	- Uses WordFreq vocabulary instead of NLTK words
	- Generates all-mpnet-base-v2 embeddings for WordFreq words
	- Maintains 10-tier frequency classification system
	- Provides both hack tool API and backend-compatible API

	#### 3. UnifiedWordService (`crossword-app/backend-py/src/services/unified_word_service.py`)
	- Bridge adapter for backend integration
	- Compatible with existing VectorSearchService interface
	- Uses comprehensive WordFreq vocabulary instead of limited model vocabulary

	## Usage

	### For Hack Tools
	```python
	from thematic_word_generator import UnifiedThematicWordGenerator

	# Initialize with desired vocabulary size
	generator = UnifiedThematicWordGenerator(vocab_size_limit=100000)
	generator.initialize()

	# Generate thematic words with tier info
	results = generator.generate_thematic_words(
	topic="science",
	num_words=10,
	difficulty_tier="tier_5_common" # Optional tier filtering
	)

	for word, similarity, tier in results:
	print(f"{word}: {similarity:.3f} ({tier})")
	```

	### For Backend Integration

	#### Option 1: Replace VectorSearchService
	```python
	# In crossword_generator.py
	from .unified_word_service import create_unified_word_service

	# Initialize
	vector_service = await create_unified_word_service(vocab_size_limit=100000)
	crossword_gen = CrosswordGenerator(vector_service=vector_service)
	```

	#### Option 2: Direct Usage
	```python
	from .unified_word_service import UnifiedWordService

	service = UnifiedWordService(vocab_size_limit=100000)
	await service.initialize()

	# Compatible with existing interface
	words = await service.find_similar_words("animal", "medium", max_words=15)
	```

	## Performance Improvements

	### Memory Usage
	- Before: 3 separate vocabularies + embeddings (~500MB+)
	- After: Single vocabulary + embeddings (~200MB)
	- Reduction: ~60% memory usage reduction

	### Vocabulary Coverage
	- Before: Limited to ~8-12K words from model tokenizer
	- After: Up to 100K+ filtered words from WordFreq database
	- Improvement: 10x+ vocabulary coverage

	### Consistency
	- Before: Different words available in hack tools vs backend
	- After: Same comprehensive vocabulary across all components
	- Benefit: Consistent word quality and availability

	## Configuration

	### Environment Variables
	- `MAX_VOCABULARY_SIZE`: Maximum vocabulary size (default: 100000)
	- `EMBEDDING_MODEL`: Model name (default: all-mpnet-base-v2)
	- `WORD_SIMILARITY_THRESHOLD`: Minimum similarity (default: 0.3)

	### Vocabulary Size Options
	- Small (10K): Fast initialization, basic vocabulary
	- Medium (50K): Balanced performance and coverage
	- Large (100K): Comprehensive coverage, slower initialization
	- Full (319K): Complete WordFreq database, longest initialization

	## Migration Guide

	### For Existing Hack Tools
	1. Update imports: `from thematic_word_generator import UnifiedThematicWordGenerator`
	2. Replace `ThematicWordGenerator` with `UnifiedThematicWordGenerator`
	3. API remains compatible, but now uses comprehensive WordFreq vocabulary

	### For Backend Services
	1. Import: `from .unified_word_service import UnifiedWordService`
	2. Replace `VectorSearchService` initialization with `UnifiedWordService`
	3. All existing methods remain compatible
	4. Benefits: Better vocabulary coverage, consistent frequency data

	### Backwards Compatibility
	- All existing APIs maintained
	- Same method signatures and return formats
	- Gradual migration possible - can run both systems in parallel

	## Benefits Summary

	✅ Eliminates Redundancy: Single vocabulary source instead of 3 separate ones
	✅ Improves Coverage: 100K+ words vs previous 8-12K words
	✅ Reduces Memory: ~60% reduction in memory usage
	✅ Ensures Consistency: Same vocabulary across hack tools and backend
	✅ Maintains Performance: Smart caching and batch processing
	✅ Preserves Features: 10-tier frequency classification, difficulty filtering
	✅ Enables Growth: Easy to add new features with unified architecture

	## Cache Management

	### Cache Locations
	- Hack tools: `hack/model_cache/`
	- Backend: `crossword-app/backend-py/cache/unified_generator/`

	### Cache Files
	- `unified_vocabulary_<size>.pkl`: Filtered vocabulary
	- `unified_frequencies_<size>.pkl`: Frequency data
	- `unified_embeddings_<model>_<size>.npy`: Pre-computed embeddings

	### Cache Invalidation
	Caches are automatically rebuilt if:
	- Vocabulary size limit changes
	- Embedding model changes
	- WordFreq database updates (rare)

	## Future Enhancements

	1. Semantic Clustering: Group words by semantic similarity
	2. Dynamic Difficulty: Real-time difficulty adjustment based on user performance
	3. Topic Expansion: Automatic topic discovery and expansion
	4. Multilingual Support: Extend to other languages using WordFreq
	5. Custom Vocabularies: Allow domain-specific vocabulary additions