Spaces:

vimalk78
/

abc123

Sleeping

App Files Files Community

abc123 / crossword-app /backend-py /docs /embedding_limitations_and_clue_generation.md

vimalk78

fix: Improve word selection and clue generation for crosswords

425eda1 4 months ago

preview code

raw

history blame contribute delete

8.5 kB

Embedding Limitations and Clue Generation Analysis

Executive Summary

This document analyzes why our current semantic neighbor approach for crossword clue generation produces suboptimal results, explores the fundamental limitations of sentence transformers for entity relationships, and proposes practical solutions for better crossword clues.

The Problem: Poor Quality Clues from Semantic Neighbors

Current Clue Examples

PANESAR    → "Associated with pandya, parmar and pankaj"
RAJOURI    → "Associated with raji, rajini and rajni" 
RAJPUTANA  → "Related to rajput (a member of the dominant hindu military caste...)"
DRAVIDA    → "Related to dravidian (a member of one of the aboriginal races...)"
TENDULKAR  → "Associated with ganguly, sachin and dravid"

Why These Are Poor Crossword Clues

PANESAR: Semantic neighbors are just phonetically similar Indian names
RAJPUTANA: The clue contains "rajput" which is part of the answer
Generic formatting: "Associated with X, Y, Z" is not crossword-style
Missing entity context: No indication that PANESAR is a cricketer, RAJOURI is a place

Root Cause Analysis: Sentence Transformer Limitations

The PANESAR Case Study

Expected neighbors for a crossword:

cricket, england, spinner, bowler

Actual neighbors from embeddings:

PANESAR similarities:
  cricket   : 0.526 (moderate)
  england   : 0.264 (very low!)
  spinner   : 0.361 (low)
  bowler    : 0.476 (moderate)
  
  pandya    : 0.788 (very high!)
  parmar    : 0.731 (very high!)
  pankaj    : 0.702 (very high!)
  panaji    : 0.696 (very high!)

Why This Happens: What Embeddings Actually Encode

Sentence transformers like all-mpnet-base-v2 are trained to encode sentence-level semantics, not entity relationships. When extracting single-word embeddings, they capture:

✅ What They Capture Well:

Morphological similarity: Words with similar spelling/phonetics
Syntactic patterns: How words are used grammatically
Distributional similarity: Words appearing in similar sentence contexts

❌ What They Miss:

Encyclopedic knowledge: "Panesar is a cricketer"
Entity relationships: "Panesar played for England"
Factual attributes: "Rajouri is in Kashmir"

The 768-Dimensional Problem

For PANESAR, the embedding dimensions are encoding:

High weight: "Sounds like an Indian surname" (pan- prefix pattern)
High weight: "Appears with other Indian names in text"
Medium weight: "Sometimes mentioned with cricket terms"
Low weight: "Played for England team"

The model learned surface patterns rather than semantic facts.

Training Data Distribution Effects

Why Phonetic Similarity Dominates

The training corpus likely contained:

"Indian names like Pandya, Parmar, and Patel..." (frequent)
"Panesar and Pankaj are common surnames..." (frequent)

vs.

"Panesar bowled for England in the 2007 series..." (infrequent)

Result: Phonetic/cultural patterns get higher weight than factual relationships.

Fundamental Issue: Wrong Type of Similarity

What We Need vs What We Get

For crosswords, we need:

PANESAR → cricketer, spinner, England-born
RAJOURI → district, Kashmir, border region
TENDULKAR → batsman, records, Mumbai

What embeddings give us:

PANESAR → pandya, parmar (phonetic similarity)
RAJOURI → raji, rajini (name pattern similarity)
TENDULKAR → ganguly, dravid (co-occurrence similarity)

Knowledge-Augmented Embedding Solutions

Available Models with Entity Knowledge

1. Wikipedia2Vec

Pros: Trained on Wikipedia with entity linking, knows factual relationships
Cons: Complex setup, requires Wikipedia dump download
Example: Would know "Monty Panesar" → "English cricketer"

2. BERT-Entity / LUKE

Pros: Specifically designed for entity understanding
Cons: Heavier model, requires entity recognition pipeline
Example: Understands entity types and relationships

3. ConceptNet Numberbatch

Pros: Combines word embeddings with knowledge graph
Cons: Large download (several GB), complex integration
Example: Knows factual relationships like "cricket player from England"

4. ERNIE (Enhanced Representation through kNowledge IntEgration)

Pros: Integrates knowledge graphs during training
Cons: Primarily Chinese focus, complex setup
Example: Better entity-relationship understanding

5. KnowBERT

Pros: BERT + Knowledge bases (WordNet, Wikipedia)
Cons: Multiple components, heavy setup
Example: Combines language understanding with encyclopedic knowledge

Practical Solutions for Our System

Option 1: Hybrid Approach (Recommended)

Keep current embeddings but augment with lightweight knowledge base:

# Small knowledge file
entity_facts = {
    "panesar": {
        "type": "person",
        "domain": "cricket", 
        "attributes": ["spinner", "england", "monty"],
        "clue_template": "English {domain} player known as {nickname}"
    },
    "rajouri": {
        "type": "place",
        "domain": "geography",
        "attributes": ["district", "kashmir", "border"],
        "clue_template": "{domain} district in disputed region"
    }
}

def generate_hybrid_clue(word):
    if word in entity_facts:
        return generate_factual_clue(word, entity_facts[word])
    else:
        return generate_semantic_neighbor_clue(word)

Option 2: Entity Type Classification

Use embedding clusters to identify entity types:

# Pre-compute clusters
person_cluster = words_near(["gandhi", "nehru", "shakespeare"])  
place_cluster = words_near(["delhi", "mumbai", "london"])
sport_cluster = words_near(["cricket", "football", "tennis"])

# Classify and generate appropriate clues
if word in person_cluster and word in sport_cluster:
    return f"Sports personality"
elif word in place_cluster:
    return f"Geographic location"

Option 3: Knowledge Graph from Co-occurrences

Build relationships from the training corpus:

# Extract from embedding neighborhoods
def build_knowledge_graph():
    knowledge = {}
    for word in vocabulary:
        neighbors = get_semantic_neighbors(word)
        
        # Identify patterns
        if any(n in cricket_terms for n in neighbors):
            knowledge[word]["domain"] = "cricket"
        if any(n in place_names for n in neighbors):
            knowledge[word]["type"] = "place"

Implementation Recommendations

Phase 1: Immediate Improvement

Add entity knowledge file for top 1000 words in vocabulary
Implement hybrid clue generation (facts first, then neighbors)
Better clue formatting (proper crossword style)

Phase 2: Enhanced System

Entity type classification using embedding clustering
Automated knowledge extraction from neighbor patterns
Domain-specific clue templates

Phase 3: Advanced Solutions

Evaluate Wikipedia2Vec for full factual embeddings
Build comprehensive knowledge base for crossword entities
Train custom embeddings on crossword-specific data

Current System Status

What Works

✅ Proper difficulty-based word selection (rare words for hard mode)
✅ Fast performance using existing embeddings
✅ Better than generic templates (slight improvement)

What Needs Improvement

❌ Clue quality still poor for domain-specific entities
❌ Phonetic similarity dominates factual relationships
❌ No understanding of entity types or attributes

Conclusion

The semantic neighbor approach revealed fundamental limitations of sentence transformers for entity-relationship understanding. While it's better than generic templates, it's insufficient for quality crossword clues.

The recommended path forward is a hybrid approach that augments current embeddings with a lightweight knowledge base, providing factual context for common crossword entities while maintaining system performance and simplicity.

Technical Notes

Current model: sentence-transformers/all-mpnet-base-v2 (768 dimensions)
Vocabulary size: ~30,000 words
Performance impact: Semantic neighbor lookup adds ~50ms per word
Storage requirements: Current approach uses existing embeddings (~500MB)

This analysis was conducted during the crossword generation optimization project, August 2025