| # Vocabulary Alternatives Analysis: Beyond WordFreq | |
| ## Executive Summary | |
| WordFreq, while useful for general frequency analysis, produces vocabulary quality issues for crossword generation due to its web-scraped, uncurated nature. After hands-on evaluation of alternatives, most "curated" crossword lists have significant quality issues requiring substantial cleanup effort. | |
| ### **Updated Recommendations (Post-Evaluation):** | |
| 1. **Primary**: COCA free sample (6K high-quality words with rich metadata) + Peter Norvig's clean 100K list | |
| 2. **Quality Leader**: COCA full version (if budget allows) - 14 billion words, sophisticated metadata | |
| 3. **Fallback**: SUBTLEX (reasonable quality, needs programming to parse properly) | |
| 4. **Avoid**: Most crossword-specific lists contain junk data requiring extensive cleanup | |
| 5. **Semantic Processing**: Keep all-mpnet-base-v2 (working well) | |
| ## Current Issues with WordFreq Vocabulary | |
| ### Problems Identified: | |
| 1. **Web-based contamination**: Includes Reddit, Twitter, and web crawl data with typos, slang, and internet-specific language | |
| 2. **No quality filtering**: Purely frequency-based without considering appropriateness for crosswords | |
| 3. **Mixed registers**: Combines formal and informal language indiscriminately | |
| 4. **Problematic intersections**: Generates words like "ethology", "guns", "porn" for topics like "Art+Books" | |
| 5. **Limited metadata**: No information about word suitability, part-of-speech, or crossword usage | |
| 6. **AI contamination risk**: WordFreq author stopped updates in 2024 due to generative AI polluting data sources | |
| ### Impact on Crossword Generation: | |
| - Lower quality semantic intersections | |
| - Inappropriate words for family-friendly puzzles | |
| - Poor difficulty calibration | |
| - Reduced solver experience quality | |
| ## Superior Alternatives | |
| ### 1. Crossword-Specific Word Lists (⚠️ QUALITY ISSUES FOUND) | |
| #### A. Collaborative Word List (❌ NOT RECOMMENDED) | |
| - **Source**: https://github.com/Crossword-Nexus/collaborative-word-list | |
| - **Size**: 114,000+ words | |
| - **Direct download**: `https://raw.githubusercontent.com/Crossword-Nexus/collaborative-word-list/main/xwordlist.dict` | |
| - **QUALITY PROBLEMS IDENTIFIED**: | |
| - Contains nonsensical entries: `10THGENCONSOLE`, `1STGENERATIONCONSOLES`, `4XGAMES` | |
| - Single letters: `A`, `AA`, `AAA`, `AAAA` | |
| - Meaningless sequences: `AAAAH`, `AAAAUTOCLUB` | |
| - **Verdict**: Requires extensive cleanup before use | |
| #### B. Spread the Word(list) (❌ NOT RECOMMENDED) | |
| - **Source**: https://www.spreadthewordlist.com | |
| - **Size**: 114,000+ answers with scores | |
| - **QUALITY PROBLEMS IDENTIFIED**: | |
| - Garbage entries: `zzzzzzzzzzzzzzz`, `zzzquil` | |
| - Malformed words: `aaaaddress`, `aabb`, `aabba` | |
| - Random sequences: `aaiiiiiiiiiiiii` | |
| - **Verdict**: Same quality issues as Collaborative List | |
| #### C. Christopher Jones' Crossword Wordlist (⚠️ NEEDS CLEANUP) | |
| - **Source**: https://github.com/christophsjones/crossword-wordlist | |
| - **QUALITY PROBLEMS IDENTIFIED**: | |
| - Long phrases: `"a week from now"`, `"a recipe for disaster"` | |
| - Absurdly long compounds: `ABIRDINTHEHANDISWORTHTWOINTHEBUSH`, `ABLEBODIEDSEAMAN` | |
| - Arbitrary scoring: Many words with score 50 don't match claimed "common words you wouldn't hesitate to use" | |
| - **Verdict**: Contains good data but needs significant filtering and rescoring | |
| ### 2. SUBTLEX Psycholinguistic Databases (✅ REASONABLE QUALITY) | |
| #### SUBTLEX-US (American English) | |
| - **Source**: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus | |
| - **Size**: 74,000+ words | |
| - **Quality**: Based on film/TV subtitles (natural language exposure) | |
| - **Scoring**: Zipf scale 1-7, contextual diversity metrics | |
| - **License**: Free for research | |
| #### EVALUATION RESULTS: | |
| - **✅ Better quality**: Words are generally reasonable and appropriate | |
| - **⚠️ Contains tiny phrases**: Some entries are short phrases rather than single words | |
| - **⚠️ Requires programming**: Need to parse and filter the numerical data properly | |
| - **✅ Rich metadata**: Includes frequency, Zipf scores, part-of-speech, contextual diversity | |
| - **✅ Research backing**: Proven to predict word processing difficulty better than traditional corpora | |
| #### Advantages: | |
| - **Psycholinguistic validity**: Better predictor of word processing difficulty | |
| - **Clean vocabulary**: Professional media content (edited, appropriate) | |
| - **Good difficulty calibration**: Zipf 1-3 = rare/hard, 4-7 = common/easy | |
| - **Multiple languages**: Available for US, UK, Chinese, Welsh, Spanish | |
| ### 3. COCA (Corpus of Contemporary American English) (🌟 EXCELLENT QUALITY) | |
| #### Available Data: | |
| - **Free tier**: ~6,000 words with rich metadata and collocates | |
| - **Full version**: 14 billion words with sophisticated metadata (paid) | |
| - **Source**: https://www.wordfrequency.info/ and https://github.com/brucewlee/COCA-WordFrequency | |
| - **Composition**: Balanced across news, fiction, academic, spoken | |
| #### EVALUATION RESULTS: | |
| - **🌟 Excellent quality**: "Phew, this is good" - professional curation shows | |
| - **✅ Rich metadata**: Frequency, part-of-speech, genre distribution, collocates | |
| - **✅ Clean vocabulary**: Academic standard filtering | |
| - **✅ Balanced representation**: Multiple text types ensure comprehensive coverage | |
| - **💰 Premium option**: Full version provides 14 billion words with sophisticated metadata | |
| - **✅ Free sample sufficient**: 6K words could serve as high-quality core vocabulary | |
| #### Advantages: | |
| - **Academic gold standard**: Most accurate and reliable word frequency data | |
| - **Professional curation**: High editorial and scholarly standards | |
| - **Balanced corpus**: News, fiction, academic, spoken genres represented | |
| - **Collocate data**: Helps understand word usage patterns and context | |
| - **Research proven**: Widely used and validated in linguistics research | |
| ### 4. Peter Norvig's Clean Word Lists (🌟 EXCELLENT DISCOVERY) | |
| #### Norvig's Word Count Lists | |
| - **Source**: https://norvig.com/ngrams/ | |
| - **Key Resource**: `count_1w100k.txt` - 100,000 most popular words, all uppercase | |
| - **Quality**: Really clean vocabulary without junk entries | |
| - **Problem**: No frequency information included | |
| #### EVALUATION RESULTS: | |
| - **✅ Very clean**: Properly curated, no garbage like other sources | |
| - **✅ Good coverage**: 100K words should provide sufficient vocabulary | |
| - **✅ Reliable source**: Peter Norvig (Google's Director of Research) ensures quality | |
| - **❌ Missing frequencies**: Would need to cross-reference with other sources for difficulty grading | |
| - **💡 Hybrid opportunity**: Could combine Norvig's clean words with frequency data from SUBTLEX or COCA | |
| #### Potential Implementation: | |
| ```python | |
| # Use Norvig's clean word list as vocabulary base | |
| norvig_words = load_norvig_100k() | |
| # Cross-reference with SUBTLEX for frequency data | |
| subtlex_freq = load_subtlex_frequencies() | |
| # Result: Clean vocabulary + reliable frequency information | |
| ``` | |
| ### 5. Premium Options (For Comparison - Not Evaluated) | |
| #### XWordInfo (NYT-focused) | |
| - **Cost**: $50 Angel membership | |
| - **Quality**: Every NYT crossword ever published | |
| - **Size**: 200,000+ words | |
| - **Note**: Not evaluated in this analysis | |
| #### Cruciverb | |
| - **Cost**: $35 Gold membership | |
| - **Quality**: Multiple publication sources | |
| - **Note**: Not evaluated in this analysis | |
| ## Detailed Comparison Analysis (Updated with Evaluation Results) | |
| | Source | Size | Quality Score | Frequency Data | Evaluated Quality | Cost | Recommendation | | |
| |--------|------|---------------|----------------|------------------|------|----------------| | |
| | **WordFreq** | 100K+ | ❌ Web-scraped | ✅ Frequency | ❌ Original issues | Free | ⚠️ Current baseline | | |
| | **Collaborative List** | 114K+ | ❌ Junk entries | ❌ Arbitrary scoring | ❌ `10THGENCONSOLE`, `AAAA` | Free | ❌ **AVOID** | | |
| | **Spread Wordlist** | 114K+ | ❌ Junk entries | ❌ Arbitrary scoring | ❌ `zzzzzzzzzzzzzzz`, `aabb` | Free | ❌ **AVOID** | | |
| | **C. Jones Wordlist** | ~50K | ⚠️ Needs filtering | ⚠️ Arbitrary scoring | ⚠️ Long phrases, compounds | Free | ⚠️ **CLEANUP REQUIRED** | | |
| | **SUBTLEX-US** | 74K | ✅ Reasonable quality | ✅ Zipf 1-7 | ✅ Clean, some phrases | Free | ✅ **VIABLE** | | |
| | **COCA (free)** | 6K | 🌟 Excellent | ✅ Rich metadata | 🌟 "Phew, this is good" | Free | 🌟 **RECOMMENDED** | | |
| | **COCA (full)** | 1M+ | 🌟 Excellent | ✅ Rich metadata | 🌟 Sophisticated metadata | $$$ | 🌟 **PREMIUM CHOICE** | | |
| | **Norvig 100K** | 100K | 🌟 Very clean | ❌ None included | 🌟 Clean, no garbage | Free | 🌟 **HYBRID BASE** | | |
| ## Updated Implementation Recommendations (Post-Evaluation) | |
| ### Recommended Approach: Hybrid COCA + Norvig System | |
| Based on hands-on evaluation, the cleanest approach combines the best of multiple sources: | |
| #### Option A: COCA Free + Extended Coverage (Recommended) | |
| ```python | |
| # 1. Load COCA 6K words as high-quality core | |
| def load_coca_core(): | |
| """Load 6K high-quality words from COCA free sample""" | |
| # Excellent quality, rich metadata, reliable frequencies | |
| return parse_coca_free_sample() | |
| # 2. Extend with filtered SUBTLEX for broader coverage | |
| def extend_with_subtlex(): | |
| """Add clean words from SUBTLEX for broader coverage""" | |
| # Filter out phrases, keep single words only | |
| # Use Zipf scores for difficulty grading | |
| return filtered_subtlex_words() | |
| # 3. Cross-reference with Norvig's clean list for validation | |
| def validate_with_norvig(): | |
| """Use Norvig's 100K list to validate word cleanliness""" | |
| norvig_clean = load_norvig_100k() | |
| # Only include words that appear in Norvig's curated list | |
| return validated_vocabulary | |
| ``` | |
| #### Option B: Norvig Base + Frequency Cross-Reference (Alternative) | |
| ```python | |
| # 1. Start with Norvig's clean 100K vocabulary | |
| norvig_words = load_norvig_100k() | |
| # 2. Cross-reference with COCA for frequency data | |
| coca_freq = load_coca_frequencies() # Free 6K sample | |
| subtlex_freq = load_subtlex_frequencies() # Broader coverage | |
| # 3. Assign frequencies with fallback chain | |
| def get_word_difficulty(word): | |
| if word in coca_freq: | |
| return coca_freq[word] # Highest quality | |
| elif word in subtlex_freq: | |
| return subtlex_freq[word] # Good quality | |
| else: | |
| return default_difficulty # Fallback | |
| ``` | |
| ### Why This Hybrid Approach Works | |
| #### Problems with "Crossword-Specific" Lists: | |
| - **Collaborative Word List**: Contains `10THGENCONSOLE`, `AAAA`, `AAAAUTOCLUB` | |
| - **Spread the Wordlist**: Contains `zzzzzzzzzzzzzzz`, `aaaaddress`, `aabba` | |
| - **Christopher Jones**: Contains `ABIRDINTHEHANDISWORTHTWOINTHEBUSH` | |
| - **Verdict**: All require extensive cleanup, defeating their supposed advantage | |
| #### Advantages of COCA + Norvig Hybrid: | |
| - **COCA Free**: 6K professionally curated, academically validated words | |
| - **Norvig 100K**: Clean vocabulary from Google's Director of Research | |
| - **SUBTLEX**: Reasonable quality with psycholinguistic validity | |
| - **No garbage**: Avoid the cleanup nightmare of "crossword-specific" lists | |
| - **Research backing**: Academic and industry validation | |
| ### Updated Difficulty Grading System | |
| ```python | |
| def classify_word_difficulty(word): | |
| """Updated difficulty classification using clean sources""" | |
| # Priority 1: COCA data (highest quality) | |
| if word in coca_frequencies: | |
| freq_rank = coca_frequencies[word]['rank'] | |
| if freq_rank <= 1000: | |
| return "easy" | |
| elif freq_rank <= 3000: | |
| return "medium" | |
| else: | |
| return "hard" | |
| # Priority 2: SUBTLEX Zipf score | |
| elif word in subtlex_zipf: | |
| zipf = subtlex_zipf[word] | |
| if zipf >= 4.5: | |
| return "easy" # Very common | |
| elif zipf >= 2.5: | |
| return "medium" # Moderately common | |
| else: | |
| return "hard" # Rare | |
| # Fallback: Conservative classification | |
| else: | |
| return "medium" # Unknown words default to medium | |
| ``` | |
| ## Updated Technical Integration Steps | |
| ### 1. Data Download and Preprocessing (Revised) | |
| ```bash | |
| # Download COCA free sample (6K high-quality words) | |
| wget https://raw.githubusercontent.com/brucewlee/COCA-WordFrequency/master/coca_5000.txt | |
| # Download Peter Norvig's clean 100K word list | |
| wget https://norvig.com/ngrams/count_1w100k.txt | |
| # Download SUBTLEX-US (requires academic access) | |
| # Available at: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus | |
| # AVOID these due to quality issues: | |
| # ❌ Collaborative Word List (contains garbage) | |
| # ❌ Spread the Wordlist (contains garbage) | |
| # ❌ Christopher Jones (needs extensive cleanup) | |
| ``` | |
| ### 2. Data Structure Migration | |
| ```python | |
| class EnhancedVocabulary: | |
| def __init__(self): | |
| self.collaborative_scores = {} # word -> quality score (10-100) | |
| self.subtlex_zipf = {} # word -> zipf score (1-7) | |
| self.subtlex_pos = {} # word -> part of speech | |
| self.word_embeddings = {} # word -> embedding vector | |
| def load_all_sources(self): | |
| """Load and integrate all vocabulary sources""" | |
| self.load_collaborative_wordlist() | |
| self.load_subtlex_data() | |
| self.compute_embeddings() # Keep existing all-mpnet-base-v2 | |
| def is_crossword_suitable(self, word): | |
| """Filter based on crossword appropriateness""" | |
| return word.upper() in self.collaborative_scores | |
| ``` | |
| ### 3. Configuration Updates | |
| ```python | |
| # Environment variables to add | |
| VOCAB_SOURCE = "collaborative" # "collaborative", "subtlex", "hybrid" | |
| COLLABORATIVE_WORDLIST_URL = "https://raw.githubusercontent.com/..." | |
| SUBTLEX_DATA_PATH = "/path/to/subtlex_us.txt" | |
| MIN_CROSSWORD_QUALITY = 30 # Minimum collaborative score | |
| MIN_ZIPF_SCORE = 2.0 # Minimum SUBTLEX frequency | |
| ``` | |
| ## Quality Scoring Systems Comparison | |
| ### WordFreq (Current) | |
| - **Scale**: Frequency values (logarithmic) | |
| - **Basis**: Web text frequency | |
| - **Issues**: No quality filtering, includes inappropriate content | |
| ### Collaborative Word List | |
| - **Scale**: 10-100 quality score | |
| - **Basis**: Crossword constructor consensus | |
| - **Interpretation**: | |
| - 70-100: Excellent crossword words (common, clean) | |
| - 40-69: Good crossword words (moderate difficulty) | |
| - 10-39: Challenging words (obscure, specialized) | |
| ### SUBTLEX Zipf Scale | |
| - **Scale**: 1-7 (logarithmic) | |
| - **Basis**: Psycholinguistic word processing research | |
| - **Interpretation**: | |
| - 6-7: Ultra common (THE, AND, OF) | |
| - 4-5: Common (HOUSE, WATER, FRIEND) | |
| - 2-3: Uncommon (BIZARRE, ELOQUENT) | |
| - 1: Rare (OBSEQUIOUS, PERSPICACIOUS) | |
| ## Expected Benefits | |
| ### Immediate Quality Improvements: | |
| 1. **Cleaner intersections**: No more "ethology/guns/porn" issues | |
| 2. **Family-friendly vocabulary**: Community-curated appropriateness | |
| 3. **Better difficulty calibration**: Psycholinguistically validated scales | |
| 4. **Crossword-optimized**: Words chosen for puzzle suitability | |
| ### Long-term Advantages: | |
| 1. **Community support**: Active maintenance by crossword constructors | |
| 2. **Research backing**: SUBTLEX has extensive academic validation | |
| 3. **Hybrid flexibility**: Can combine multiple quality signals | |
| 4. **Scalability**: Easy to add new vocabulary sources | |
| ## Migration Strategy | |
| ### Week 1: Data Integration | |
| - Download and preprocess Collaborative Word List | |
| - Create vocabulary loading pipeline | |
| - Implement basic quality filtering | |
| ### Week 2: Scoring System | |
| - Implement hybrid quality scoring | |
| - Map quality scores to difficulty levels | |
| - Test with existing multi-topic intersection methods | |
| ### Week 3: Performance Validation | |
| - A/B test against WordFreq baseline | |
| - Measure semantic intersection quality | |
| - Validate difficulty calibration | |
| ### Week 4: Production Deployment | |
| - Update environment configuration | |
| - Monitor vocabulary coverage | |
| - Collect user feedback on word quality | |
| ## Alternative Implementation: Gradual Migration | |
| For lower risk, implement gradual migration: | |
| ```python | |
| def get_word_quality(word): | |
| """Gradual migration approach""" | |
| if word in collaborative_scores: | |
| # Use collaborative score if available | |
| return collaborative_scores[word] / 100.0 | |
| elif word in subtlex_zipf: | |
| # Fallback to SUBTLEX | |
| return subtlex_zipf[word] / 7.0 | |
| else: | |
| # Final fallback to WordFreq | |
| return word_frequency(word, 'en') | |
| ``` | |
| This allows testing new vocabulary sources while maintaining compatibility with existing words not found in curated lists. | |
| ## Conclusion (Updated After Hands-On Evaluation) | |
| **Key Finding**: Most "crossword-specific" vocabulary lists contain significant amounts of junk data that require extensive cleanup, defeating their supposed advantage over general-purpose sources. | |
| **Recommended Solution**: Combine high-quality general sources instead: | |
| 1. **COCA free sample** (6K words) for core high-quality vocabulary | |
| 2. **Peter Norvig's 100K list** for clean, broad coverage | |
| 3. **SUBTLEX** for psycholinguistically validated difficulty grading | |
| 4. **Avoid crossword-specific lists** until they improve their curation | |
| This hybrid approach provides: | |
| - **Clean vocabulary**: No `10THGENCONSOLE`, `zzzzzzzzzzzzzzz`, or `AAAAUTOCLUB` garbage | |
| - **Academic validation**: COCA and SUBTLEX are research-proven | |
| - **Industry credibility**: Norvig's list comes from Google's Director of Research | |
| - **Reasonable coverage**: 6K-100K words should handle most crossword needs | |
| - **Better difficulty calibration**: Psycholinguistic frequency data beats arbitrary scores | |
| **Next Steps**: | |
| 1. Start with COCA free sample as proof of concept | |
| 2. Extend with filtered SUBTLEX for broader coverage | |
| 3. Validate against Norvig's clean list | |
| 4. Consider COCA full version if budget allows | |
| The investment in clean, research-backed vocabulary data will dramatically improve puzzle quality without the cleanup nightmare of supposedly "crossword-specific" sources. |