Vocabulary Alternatives Analysis: Beyond WordFreq
Executive Summary
WordFreq, while useful for general frequency analysis, produces vocabulary quality issues for crossword generation due to its web-scraped, uncurated nature. After hands-on evaluation of alternatives, most "curated" crossword lists have significant quality issues requiring substantial cleanup effort.
Updated Recommendations (Post-Evaluation):
- Primary: COCA free sample (6K high-quality words with rich metadata) + Peter Norvig's clean 100K list
- Quality Leader: COCA full version (if budget allows) - 14 billion words, sophisticated metadata
- Fallback: SUBTLEX (reasonable quality, needs programming to parse properly)
- Avoid: Most crossword-specific lists contain junk data requiring extensive cleanup
- Semantic Processing: Keep all-mpnet-base-v2 (working well)
Current Issues with WordFreq Vocabulary
Problems Identified:
- Web-based contamination: Includes Reddit, Twitter, and web crawl data with typos, slang, and internet-specific language
- No quality filtering: Purely frequency-based without considering appropriateness for crosswords
- Mixed registers: Combines formal and informal language indiscriminately
- Problematic intersections: Generates words like "ethology", "guns", "porn" for topics like "Art+Books"
- Limited metadata: No information about word suitability, part-of-speech, or crossword usage
- AI contamination risk: WordFreq author stopped updates in 2024 due to generative AI polluting data sources
Impact on Crossword Generation:
- Lower quality semantic intersections
- Inappropriate words for family-friendly puzzles
- Poor difficulty calibration
- Reduced solver experience quality
Superior Alternatives
1. Crossword-Specific Word Lists (β οΈ QUALITY ISSUES FOUND)
A. Collaborative Word List (β NOT RECOMMENDED)
- Source: https://github.com/Crossword-Nexus/collaborative-word-list
- Size: 114,000+ words
- Direct download:
https://raw.githubusercontent.com/Crossword-Nexus/collaborative-word-list/main/xwordlist.dict - QUALITY PROBLEMS IDENTIFIED:
- Contains nonsensical entries:
10THGENCONSOLE,1STGENERATIONCONSOLES,4XGAMES - Single letters:
A,AA,AAA,AAAA - Meaningless sequences:
AAAAH,AAAAUTOCLUB - Verdict: Requires extensive cleanup before use
- Contains nonsensical entries:
B. Spread the Word(list) (β NOT RECOMMENDED)
- Source: https://www.spreadthewordlist.com
- Size: 114,000+ answers with scores
- QUALITY PROBLEMS IDENTIFIED:
- Garbage entries:
zzzzzzzzzzzzzzz,zzzquil - Malformed words:
aaaaddress,aabb,aabba - Random sequences:
aaiiiiiiiiiiiii - Verdict: Same quality issues as Collaborative List
- Garbage entries:
C. Christopher Jones' Crossword Wordlist (β οΈ NEEDS CLEANUP)
- Source: https://github.com/christophsjones/crossword-wordlist
- QUALITY PROBLEMS IDENTIFIED:
- Long phrases:
"a week from now","a recipe for disaster" - Absurdly long compounds:
ABIRDINTHEHANDISWORTHTWOINTHEBUSH,ABLEBODIEDSEAMAN - Arbitrary scoring: Many words with score 50 don't match claimed "common words you wouldn't hesitate to use"
- Verdict: Contains good data but needs significant filtering and rescoring
- Long phrases:
2. SUBTLEX Psycholinguistic Databases (β REASONABLE QUALITY)
SUBTLEX-US (American English)
- Source: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus
- Size: 74,000+ words
- Quality: Based on film/TV subtitles (natural language exposure)
- Scoring: Zipf scale 1-7, contextual diversity metrics
- License: Free for research
EVALUATION RESULTS:
- β Better quality: Words are generally reasonable and appropriate
- β οΈ Contains tiny phrases: Some entries are short phrases rather than single words
- β οΈ Requires programming: Need to parse and filter the numerical data properly
- β Rich metadata: Includes frequency, Zipf scores, part-of-speech, contextual diversity
- β Research backing: Proven to predict word processing difficulty better than traditional corpora
Advantages:
- Psycholinguistic validity: Better predictor of word processing difficulty
- Clean vocabulary: Professional media content (edited, appropriate)
- Good difficulty calibration: Zipf 1-3 = rare/hard, 4-7 = common/easy
- Multiple languages: Available for US, UK, Chinese, Welsh, Spanish
3. COCA (Corpus of Contemporary American English) (π EXCELLENT QUALITY)
Available Data:
- Free tier: ~6,000 words with rich metadata and collocates
- Full version: 14 billion words with sophisticated metadata (paid)
- Source: https://www.wordfrequency.info/ and https://github.com/brucewlee/COCA-WordFrequency
- Composition: Balanced across news, fiction, academic, spoken
EVALUATION RESULTS:
- π Excellent quality: "Phew, this is good" - professional curation shows
- β Rich metadata: Frequency, part-of-speech, genre distribution, collocates
- β Clean vocabulary: Academic standard filtering
- β Balanced representation: Multiple text types ensure comprehensive coverage
- π° Premium option: Full version provides 14 billion words with sophisticated metadata
- β Free sample sufficient: 6K words could serve as high-quality core vocabulary
Advantages:
- Academic gold standard: Most accurate and reliable word frequency data
- Professional curation: High editorial and scholarly standards
- Balanced corpus: News, fiction, academic, spoken genres represented
- Collocate data: Helps understand word usage patterns and context
- Research proven: Widely used and validated in linguistics research
4. Peter Norvig's Clean Word Lists (π EXCELLENT DISCOVERY)
Norvig's Word Count Lists
- Source: https://norvig.com/ngrams/
- Key Resource:
count_1w100k.txt- 100,000 most popular words, all uppercase - Quality: Really clean vocabulary without junk entries
- Problem: No frequency information included
EVALUATION RESULTS:
- β Very clean: Properly curated, no garbage like other sources
- β Good coverage: 100K words should provide sufficient vocabulary
- β Reliable source: Peter Norvig (Google's Director of Research) ensures quality
- β Missing frequencies: Would need to cross-reference with other sources for difficulty grading
- π‘ Hybrid opportunity: Could combine Norvig's clean words with frequency data from SUBTLEX or COCA
Potential Implementation:
# Use Norvig's clean word list as vocabulary base
norvig_words = load_norvig_100k()
# Cross-reference with SUBTLEX for frequency data
subtlex_freq = load_subtlex_frequencies()
# Result: Clean vocabulary + reliable frequency information
5. Premium Options (For Comparison - Not Evaluated)
XWordInfo (NYT-focused)
- Cost: $50 Angel membership
- Quality: Every NYT crossword ever published
- Size: 200,000+ words
- Note: Not evaluated in this analysis
Cruciverb
- Cost: $35 Gold membership
- Quality: Multiple publication sources
- Note: Not evaluated in this analysis
Detailed Comparison Analysis (Updated with Evaluation Results)
| Source | Size | Quality Score | Frequency Data | Evaluated Quality | Cost | Recommendation |
|---|---|---|---|---|---|---|
| WordFreq | 100K+ | β Web-scraped | β Frequency | β Original issues | Free | β οΈ Current baseline |
| Collaborative List | 114K+ | β Junk entries | β Arbitrary scoring | β 10THGENCONSOLE, AAAA |
Free | β AVOID |
| Spread Wordlist | 114K+ | β Junk entries | β Arbitrary scoring | β zzzzzzzzzzzzzzz, aabb |
Free | β AVOID |
| C. Jones Wordlist | ~50K | β οΈ Needs filtering | β οΈ Arbitrary scoring | β οΈ Long phrases, compounds | Free | β οΈ CLEANUP REQUIRED |
| SUBTLEX-US | 74K | β Reasonable quality | β Zipf 1-7 | β Clean, some phrases | Free | β VIABLE |
| COCA (free) | 6K | π Excellent | β Rich metadata | π "Phew, this is good" | Free | π RECOMMENDED |
| COCA (full) | 1M+ | π Excellent | β Rich metadata | π Sophisticated metadata | $$$ | π PREMIUM CHOICE |
| Norvig 100K | 100K | π Very clean | β None included | π Clean, no garbage | Free | π HYBRID BASE |
Updated Implementation Recommendations (Post-Evaluation)
Recommended Approach: Hybrid COCA + Norvig System
Based on hands-on evaluation, the cleanest approach combines the best of multiple sources:
Option A: COCA Free + Extended Coverage (Recommended)
# 1. Load COCA 6K words as high-quality core
def load_coca_core():
"""Load 6K high-quality words from COCA free sample"""
# Excellent quality, rich metadata, reliable frequencies
return parse_coca_free_sample()
# 2. Extend with filtered SUBTLEX for broader coverage
def extend_with_subtlex():
"""Add clean words from SUBTLEX for broader coverage"""
# Filter out phrases, keep single words only
# Use Zipf scores for difficulty grading
return filtered_subtlex_words()
# 3. Cross-reference with Norvig's clean list for validation
def validate_with_norvig():
"""Use Norvig's 100K list to validate word cleanliness"""
norvig_clean = load_norvig_100k()
# Only include words that appear in Norvig's curated list
return validated_vocabulary
Option B: Norvig Base + Frequency Cross-Reference (Alternative)
# 1. Start with Norvig's clean 100K vocabulary
norvig_words = load_norvig_100k()
# 2. Cross-reference with COCA for frequency data
coca_freq = load_coca_frequencies() # Free 6K sample
subtlex_freq = load_subtlex_frequencies() # Broader coverage
# 3. Assign frequencies with fallback chain
def get_word_difficulty(word):
if word in coca_freq:
return coca_freq[word] # Highest quality
elif word in subtlex_freq:
return subtlex_freq[word] # Good quality
else:
return default_difficulty # Fallback
Why This Hybrid Approach Works
Problems with "Crossword-Specific" Lists:
- Collaborative Word List: Contains
10THGENCONSOLE,AAAA,AAAAUTOCLUB - Spread the Wordlist: Contains
zzzzzzzzzzzzzzz,aaaaddress,aabba - Christopher Jones: Contains
ABIRDINTHEHANDISWORTHTWOINTHEBUSH - Verdict: All require extensive cleanup, defeating their supposed advantage
Advantages of COCA + Norvig Hybrid:
- COCA Free: 6K professionally curated, academically validated words
- Norvig 100K: Clean vocabulary from Google's Director of Research
- SUBTLEX: Reasonable quality with psycholinguistic validity
- No garbage: Avoid the cleanup nightmare of "crossword-specific" lists
- Research backing: Academic and industry validation
Updated Difficulty Grading System
def classify_word_difficulty(word):
"""Updated difficulty classification using clean sources"""
# Priority 1: COCA data (highest quality)
if word in coca_frequencies:
freq_rank = coca_frequencies[word]['rank']
if freq_rank <= 1000:
return "easy"
elif freq_rank <= 3000:
return "medium"
else:
return "hard"
# Priority 2: SUBTLEX Zipf score
elif word in subtlex_zipf:
zipf = subtlex_zipf[word]
if zipf >= 4.5:
return "easy" # Very common
elif zipf >= 2.5:
return "medium" # Moderately common
else:
return "hard" # Rare
# Fallback: Conservative classification
else:
return "medium" # Unknown words default to medium
Updated Technical Integration Steps
1. Data Download and Preprocessing (Revised)
# Download COCA free sample (6K high-quality words)
wget https://raw.githubusercontent.com/brucewlee/COCA-WordFrequency/master/coca_5000.txt
# Download Peter Norvig's clean 100K word list
wget https://norvig.com/ngrams/count_1w100k.txt
# Download SUBTLEX-US (requires academic access)
# Available at: https://www.ugent.be/pp/experimentele-psychologie/en/research/documents/subtlexus
# AVOID these due to quality issues:
# β Collaborative Word List (contains garbage)
# β Spread the Wordlist (contains garbage)
# β Christopher Jones (needs extensive cleanup)
2. Data Structure Migration
class EnhancedVocabulary:
def __init__(self):
self.collaborative_scores = {} # word -> quality score (10-100)
self.subtlex_zipf = {} # word -> zipf score (1-7)
self.subtlex_pos = {} # word -> part of speech
self.word_embeddings = {} # word -> embedding vector
def load_all_sources(self):
"""Load and integrate all vocabulary sources"""
self.load_collaborative_wordlist()
self.load_subtlex_data()
self.compute_embeddings() # Keep existing all-mpnet-base-v2
def is_crossword_suitable(self, word):
"""Filter based on crossword appropriateness"""
return word.upper() in self.collaborative_scores
3. Configuration Updates
# Environment variables to add
VOCAB_SOURCE = "collaborative" # "collaborative", "subtlex", "hybrid"
COLLABORATIVE_WORDLIST_URL = "https://raw.githubusercontent.com/..."
SUBTLEX_DATA_PATH = "/path/to/subtlex_us.txt"
MIN_CROSSWORD_QUALITY = 30 # Minimum collaborative score
MIN_ZIPF_SCORE = 2.0 # Minimum SUBTLEX frequency
Quality Scoring Systems Comparison
WordFreq (Current)
- Scale: Frequency values (logarithmic)
- Basis: Web text frequency
- Issues: No quality filtering, includes inappropriate content
Collaborative Word List
- Scale: 10-100 quality score
- Basis: Crossword constructor consensus
- Interpretation:
- 70-100: Excellent crossword words (common, clean)
- 40-69: Good crossword words (moderate difficulty)
- 10-39: Challenging words (obscure, specialized)
SUBTLEX Zipf Scale
- Scale: 1-7 (logarithmic)
- Basis: Psycholinguistic word processing research
- Interpretation:
- 6-7: Ultra common (THE, AND, OF)
- 4-5: Common (HOUSE, WATER, FRIEND)
- 2-3: Uncommon (BIZARRE, ELOQUENT)
- 1: Rare (OBSEQUIOUS, PERSPICACIOUS)
Expected Benefits
Immediate Quality Improvements:
- Cleaner intersections: No more "ethology/guns/porn" issues
- Family-friendly vocabulary: Community-curated appropriateness
- Better difficulty calibration: Psycholinguistically validated scales
- Crossword-optimized: Words chosen for puzzle suitability
Long-term Advantages:
- Community support: Active maintenance by crossword constructors
- Research backing: SUBTLEX has extensive academic validation
- Hybrid flexibility: Can combine multiple quality signals
- Scalability: Easy to add new vocabulary sources
Migration Strategy
Week 1: Data Integration
- Download and preprocess Collaborative Word List
- Create vocabulary loading pipeline
- Implement basic quality filtering
Week 2: Scoring System
- Implement hybrid quality scoring
- Map quality scores to difficulty levels
- Test with existing multi-topic intersection methods
Week 3: Performance Validation
- A/B test against WordFreq baseline
- Measure semantic intersection quality
- Validate difficulty calibration
Week 4: Production Deployment
- Update environment configuration
- Monitor vocabulary coverage
- Collect user feedback on word quality
Alternative Implementation: Gradual Migration
For lower risk, implement gradual migration:
def get_word_quality(word):
"""Gradual migration approach"""
if word in collaborative_scores:
# Use collaborative score if available
return collaborative_scores[word] / 100.0
elif word in subtlex_zipf:
# Fallback to SUBTLEX
return subtlex_zipf[word] / 7.0
else:
# Final fallback to WordFreq
return word_frequency(word, 'en')
This allows testing new vocabulary sources while maintaining compatibility with existing words not found in curated lists.
Conclusion (Updated After Hands-On Evaluation)
Key Finding: Most "crossword-specific" vocabulary lists contain significant amounts of junk data that require extensive cleanup, defeating their supposed advantage over general-purpose sources.
Recommended Solution: Combine high-quality general sources instead:
- COCA free sample (6K words) for core high-quality vocabulary
- Peter Norvig's 100K list for clean, broad coverage
- SUBTLEX for psycholinguistically validated difficulty grading
- Avoid crossword-specific lists until they improve their curation
This hybrid approach provides:
- Clean vocabulary: No
10THGENCONSOLE,zzzzzzzzzzzzzzz, orAAAAUTOCLUBgarbage - Academic validation: COCA and SUBTLEX are research-proven
- Industry credibility: Norvig's list comes from Google's Director of Research
- Reasonable coverage: 6K-100K words should handle most crossword needs
- Better difficulty calibration: Psycholinguistic frequency data beats arbitrary scores
Next Steps:
- Start with COCA free sample as proof of concept
- Extend with filtered SUBTLEX for broader coverage
- Validate against Norvig's clean list
- Consider COCA full version if budget allows
The investment in clean, research-backed vocabulary data will dramatically improve puzzle quality without the cleanup nightmare of supposedly "crossword-specific" sources.