File size: 8,501 Bytes
425eda1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
# Embedding Limitations and Clue Generation Analysis

## Executive Summary

This document analyzes why our current semantic neighbor approach for crossword clue generation produces suboptimal results, explores the fundamental limitations of sentence transformers for entity relationships, and proposes practical solutions for better crossword clues.

## The Problem: Poor Quality Clues from Semantic Neighbors

### Current Clue Examples
```
PANESAR    β†’ "Associated with pandya, parmar and pankaj"
RAJOURI    β†’ "Associated with raji, rajini and rajni" 
RAJPUTANA  β†’ "Related to rajput (a member of the dominant hindu military caste...)"
DRAVIDA    β†’ "Related to dravidian (a member of one of the aboriginal races...)"
TENDULKAR  β†’ "Associated with ganguly, sachin and dravid"
```

### Why These Are Poor Crossword Clues

1. **PANESAR**: Semantic neighbors are just phonetically similar Indian names
2. **RAJPUTANA**: The clue contains "rajput" which is part of the answer
3. **Generic formatting**: "Associated with X, Y, Z" is not crossword-style
4. **Missing entity context**: No indication that PANESAR is a cricketer, RAJOURI is a place

## Root Cause Analysis: Sentence Transformer Limitations

### The PANESAR Case Study

**Expected neighbors for a crossword:**
- cricket, england, spinner, bowler

**Actual neighbors from embeddings:**
```
PANESAR similarities:
  cricket   : 0.526 (moderate)
  england   : 0.264 (very low!)
  spinner   : 0.361 (low)
  bowler    : 0.476 (moderate)
  
  pandya    : 0.788 (very high!)
  parmar    : 0.731 (very high!)
  pankaj    : 0.702 (very high!)
  panaji    : 0.696 (very high!)
```

### Why This Happens: What Embeddings Actually Encode

Sentence transformers like `all-mpnet-base-v2` are trained to encode **sentence-level semantics**, not **entity relationships**. When extracting single-word embeddings, they capture:

#### βœ… What They Capture Well:
1. **Morphological similarity**: Words with similar spelling/phonetics
2. **Syntactic patterns**: How words are used grammatically 
3. **Distributional similarity**: Words appearing in similar sentence contexts

#### ❌ What They Miss:
1. **Encyclopedic knowledge**: "Panesar is a cricketer"
2. **Entity relationships**: "Panesar played for England"
3. **Factual attributes**: "Rajouri is in Kashmir"

### The 768-Dimensional Problem

For PANESAR, the embedding dimensions are encoding:
- **High weight**: "Sounds like an Indian surname" (pan- prefix pattern)
- **High weight**: "Appears with other Indian names in text"
- **Medium weight**: "Sometimes mentioned with cricket terms"
- **Low weight**: "Played for England team"

The model learned **surface patterns** rather than **semantic facts**.

## Training Data Distribution Effects

### Why Phonetic Similarity Dominates

The training corpus likely contained:
```
"Indian names like Pandya, Parmar, and Patel..." (frequent)
"Panesar and Pankaj are common surnames..." (frequent)

vs.

"Panesar bowled for England in the 2007 series..." (infrequent)
```

**Result**: Phonetic/cultural patterns get higher weight than factual relationships.

## Fundamental Issue: Wrong Type of Similarity

### What We Need vs What We Get

**For crosswords, we need:**
- PANESAR β†’ cricketer, spinner, England-born
- RAJOURI β†’ district, Kashmir, border region
- TENDULKAR β†’ batsman, records, Mumbai

**What embeddings give us:**
- PANESAR β†’ pandya, parmar (phonetic similarity)
- RAJOURI β†’ raji, rajini (name pattern similarity) 
- TENDULKAR β†’ ganguly, dravid (co-occurrence similarity)

## Knowledge-Augmented Embedding Solutions

### Available Models with Entity Knowledge

#### 1. Wikipedia2Vec
- **Pros**: Trained on Wikipedia with entity linking, knows factual relationships
- **Cons**: Complex setup, requires Wikipedia dump download
- **Example**: Would know "Monty Panesar" β†’ "English cricketer"

#### 2. BERT-Entity / LUKE
- **Pros**: Specifically designed for entity understanding
- **Cons**: Heavier model, requires entity recognition pipeline
- **Example**: Understands entity types and relationships

#### 3. ConceptNet Numberbatch
- **Pros**: Combines word embeddings with knowledge graph
- **Cons**: Large download (several GB), complex integration
- **Example**: Knows factual relationships like "cricket player from England"

#### 4. ERNIE (Enhanced Representation through kNowledge IntEgration)
- **Pros**: Integrates knowledge graphs during training
- **Cons**: Primarily Chinese focus, complex setup
- **Example**: Better entity-relationship understanding

#### 5. KnowBERT
- **Pros**: BERT + Knowledge bases (WordNet, Wikipedia)
- **Cons**: Multiple components, heavy setup
- **Example**: Combines language understanding with encyclopedic knowledge

## Practical Solutions for Our System

### Option 1: Hybrid Approach (Recommended)

Keep current embeddings but augment with lightweight knowledge base:

```python
# Small knowledge file
entity_facts = {
    "panesar": {
        "type": "person",
        "domain": "cricket", 
        "attributes": ["spinner", "england", "monty"],
        "clue_template": "English {domain} player known as {nickname}"
    },
    "rajouri": {
        "type": "place",
        "domain": "geography",
        "attributes": ["district", "kashmir", "border"],
        "clue_template": "{domain} district in disputed region"
    }
}

def generate_hybrid_clue(word):
    if word in entity_facts:
        return generate_factual_clue(word, entity_facts[word])
    else:
        return generate_semantic_neighbor_clue(word)
```

### Option 2: Entity Type Classification

Use embedding clusters to identify entity types:

```python
# Pre-compute clusters
person_cluster = words_near(["gandhi", "nehru", "shakespeare"])  
place_cluster = words_near(["delhi", "mumbai", "london"])
sport_cluster = words_near(["cricket", "football", "tennis"])

# Classify and generate appropriate clues
if word in person_cluster and word in sport_cluster:
    return f"Sports personality"
elif word in place_cluster:
    return f"Geographic location"
```

### Option 3: Knowledge Graph from Co-occurrences

Build relationships from the training corpus:

```python
# Extract from embedding neighborhoods
def build_knowledge_graph():
    knowledge = {}
    for word in vocabulary:
        neighbors = get_semantic_neighbors(word)
        
        # Identify patterns
        if any(n in cricket_terms for n in neighbors):
            knowledge[word]["domain"] = "cricket"
        if any(n in place_names for n in neighbors):
            knowledge[word]["type"] = "place"
```

## Implementation Recommendations

### Phase 1: Immediate Improvement
1. **Add entity knowledge file** for top 1000 words in vocabulary
2. **Implement hybrid clue generation** (facts first, then neighbors)
3. **Better clue formatting** (proper crossword style)

### Phase 2: Enhanced System
1. **Entity type classification** using embedding clustering
2. **Automated knowledge extraction** from neighbor patterns
3. **Domain-specific clue templates** 

### Phase 3: Advanced Solutions
1. **Evaluate Wikipedia2Vec** for full factual embeddings
2. **Build comprehensive knowledge base** for crossword entities
3. **Train custom embeddings** on crossword-specific data

## Current System Status

### What Works
- βœ… Proper difficulty-based word selection (rare words for hard mode)
- βœ… Fast performance using existing embeddings
- βœ… Better than generic templates (slight improvement)

### What Needs Improvement
- ❌ Clue quality still poor for domain-specific entities
- ❌ Phonetic similarity dominates factual relationships
- ❌ No understanding of entity types or attributes

## Conclusion

The semantic neighbor approach revealed fundamental limitations of sentence transformers for entity-relationship understanding. While it's better than generic templates, it's insufficient for quality crossword clues.

The recommended path forward is a **hybrid approach** that augments current embeddings with a lightweight knowledge base, providing factual context for common crossword entities while maintaining system performance and simplicity.

## Technical Notes

- **Current model**: `sentence-transformers/all-mpnet-base-v2` (768 dimensions)
- **Vocabulary size**: ~30,000 words
- **Performance impact**: Semantic neighbor lookup adds ~50ms per word
- **Storage requirements**: Current approach uses existing embeddings (~500MB)

---

*This analysis was conducted during the crossword generation optimization project, August 2025*