Embedding Limitations and Clue Generation Analysis
Executive Summary
This document analyzes why our current semantic neighbor approach for crossword clue generation produces suboptimal results, explores the fundamental limitations of sentence transformers for entity relationships, and proposes practical solutions for better crossword clues.
The Problem: Poor Quality Clues from Semantic Neighbors
Current Clue Examples
PANESAR β "Associated with pandya, parmar and pankaj"
RAJOURI β "Associated with raji, rajini and rajni"
RAJPUTANA β "Related to rajput (a member of the dominant hindu military caste...)"
DRAVIDA β "Related to dravidian (a member of one of the aboriginal races...)"
TENDULKAR β "Associated with ganguly, sachin and dravid"
Why These Are Poor Crossword Clues
- PANESAR: Semantic neighbors are just phonetically similar Indian names
- RAJPUTANA: The clue contains "rajput" which is part of the answer
- Generic formatting: "Associated with X, Y, Z" is not crossword-style
- Missing entity context: No indication that PANESAR is a cricketer, RAJOURI is a place
Root Cause Analysis: Sentence Transformer Limitations
The PANESAR Case Study
Expected neighbors for a crossword:
- cricket, england, spinner, bowler
Actual neighbors from embeddings:
PANESAR similarities:
cricket : 0.526 (moderate)
england : 0.264 (very low!)
spinner : 0.361 (low)
bowler : 0.476 (moderate)
pandya : 0.788 (very high!)
parmar : 0.731 (very high!)
pankaj : 0.702 (very high!)
panaji : 0.696 (very high!)
Why This Happens: What Embeddings Actually Encode
Sentence transformers like all-mpnet-base-v2 are trained to encode sentence-level semantics, not entity relationships. When extracting single-word embeddings, they capture:
β What They Capture Well:
- Morphological similarity: Words with similar spelling/phonetics
- Syntactic patterns: How words are used grammatically
- Distributional similarity: Words appearing in similar sentence contexts
β What They Miss:
- Encyclopedic knowledge: "Panesar is a cricketer"
- Entity relationships: "Panesar played for England"
- Factual attributes: "Rajouri is in Kashmir"
The 768-Dimensional Problem
For PANESAR, the embedding dimensions are encoding:
- High weight: "Sounds like an Indian surname" (pan- prefix pattern)
- High weight: "Appears with other Indian names in text"
- Medium weight: "Sometimes mentioned with cricket terms"
- Low weight: "Played for England team"
The model learned surface patterns rather than semantic facts.
Training Data Distribution Effects
Why Phonetic Similarity Dominates
The training corpus likely contained:
"Indian names like Pandya, Parmar, and Patel..." (frequent)
"Panesar and Pankaj are common surnames..." (frequent)
vs.
"Panesar bowled for England in the 2007 series..." (infrequent)
Result: Phonetic/cultural patterns get higher weight than factual relationships.
Fundamental Issue: Wrong Type of Similarity
What We Need vs What We Get
For crosswords, we need:
- PANESAR β cricketer, spinner, England-born
- RAJOURI β district, Kashmir, border region
- TENDULKAR β batsman, records, Mumbai
What embeddings give us:
- PANESAR β pandya, parmar (phonetic similarity)
- RAJOURI β raji, rajini (name pattern similarity)
- TENDULKAR β ganguly, dravid (co-occurrence similarity)
Knowledge-Augmented Embedding Solutions
Available Models with Entity Knowledge
1. Wikipedia2Vec
- Pros: Trained on Wikipedia with entity linking, knows factual relationships
- Cons: Complex setup, requires Wikipedia dump download
- Example: Would know "Monty Panesar" β "English cricketer"
2. BERT-Entity / LUKE
- Pros: Specifically designed for entity understanding
- Cons: Heavier model, requires entity recognition pipeline
- Example: Understands entity types and relationships
3. ConceptNet Numberbatch
- Pros: Combines word embeddings with knowledge graph
- Cons: Large download (several GB), complex integration
- Example: Knows factual relationships like "cricket player from England"
4. ERNIE (Enhanced Representation through kNowledge IntEgration)
- Pros: Integrates knowledge graphs during training
- Cons: Primarily Chinese focus, complex setup
- Example: Better entity-relationship understanding
5. KnowBERT
- Pros: BERT + Knowledge bases (WordNet, Wikipedia)
- Cons: Multiple components, heavy setup
- Example: Combines language understanding with encyclopedic knowledge
Practical Solutions for Our System
Option 1: Hybrid Approach (Recommended)
Keep current embeddings but augment with lightweight knowledge base:
# Small knowledge file
entity_facts = {
"panesar": {
"type": "person",
"domain": "cricket",
"attributes": ["spinner", "england", "monty"],
"clue_template": "English {domain} player known as {nickname}"
},
"rajouri": {
"type": "place",
"domain": "geography",
"attributes": ["district", "kashmir", "border"],
"clue_template": "{domain} district in disputed region"
}
}
def generate_hybrid_clue(word):
if word in entity_facts:
return generate_factual_clue(word, entity_facts[word])
else:
return generate_semantic_neighbor_clue(word)
Option 2: Entity Type Classification
Use embedding clusters to identify entity types:
# Pre-compute clusters
person_cluster = words_near(["gandhi", "nehru", "shakespeare"])
place_cluster = words_near(["delhi", "mumbai", "london"])
sport_cluster = words_near(["cricket", "football", "tennis"])
# Classify and generate appropriate clues
if word in person_cluster and word in sport_cluster:
return f"Sports personality"
elif word in place_cluster:
return f"Geographic location"
Option 3: Knowledge Graph from Co-occurrences
Build relationships from the training corpus:
# Extract from embedding neighborhoods
def build_knowledge_graph():
knowledge = {}
for word in vocabulary:
neighbors = get_semantic_neighbors(word)
# Identify patterns
if any(n in cricket_terms for n in neighbors):
knowledge[word]["domain"] = "cricket"
if any(n in place_names for n in neighbors):
knowledge[word]["type"] = "place"
Implementation Recommendations
Phase 1: Immediate Improvement
- Add entity knowledge file for top 1000 words in vocabulary
- Implement hybrid clue generation (facts first, then neighbors)
- Better clue formatting (proper crossword style)
Phase 2: Enhanced System
- Entity type classification using embedding clustering
- Automated knowledge extraction from neighbor patterns
- Domain-specific clue templates
Phase 3: Advanced Solutions
- Evaluate Wikipedia2Vec for full factual embeddings
- Build comprehensive knowledge base for crossword entities
- Train custom embeddings on crossword-specific data
Current System Status
What Works
- β Proper difficulty-based word selection (rare words for hard mode)
- β Fast performance using existing embeddings
- β Better than generic templates (slight improvement)
What Needs Improvement
- β Clue quality still poor for domain-specific entities
- β Phonetic similarity dominates factual relationships
- β No understanding of entity types or attributes
Conclusion
The semantic neighbor approach revealed fundamental limitations of sentence transformers for entity-relationship understanding. While it's better than generic templates, it's insufficient for quality crossword clues.
The recommended path forward is a hybrid approach that augments current embeddings with a lightweight knowledge base, providing factual context for common crossword entities while maintaining system performance and simplicity.
Technical Notes
- Current model:
sentence-transformers/all-mpnet-base-v2(768 dimensions) - Vocabulary size: ~30,000 words
- Performance impact: Semantic neighbor lookup adds ~50ms per word
- Storage requirements: Current approach uses existing embeddings (~500MB)
This analysis was conducted during the crossword generation optimization project, August 2025