File size: 18,168 Bytes
5dac685
f8392aa
5dac685
f8392aa
 
5dac685
 
f8392aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dac685
 
f8392aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dac685
f8392aa
5dac685
f8392aa
5dac685
f8392aa
 
 
 
 
 
 
5dac685
f8392aa
 
 
5dac685
f8392aa
 
5dac685
f8392aa
 
 
5dac685
 
f8392aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dac685
 
f8392aa
5dac685
f8392aa
5dac685
f8392aa
 
5dac685
f8392aa
5dac685
f8392aa
 
5dac685
f8392aa
 
 
 
 
 
5dac685
f8392aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b82c4e6
f8392aa
b82c4e6
5dac685
b82c4e6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5a677dd
f8392aa
b82c4e6
 
 
 
 
 
 
f8392aa
b82c4e6
 
 
 
f8392aa
 
5a677dd
f8392aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dac685
 
f8392aa
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5dac685
f8392aa
 
 
 
 
 
 
 
5dac685
f8392aa
5dac685
f8392aa
 
5dac685
f8392aa
 
 
 
 
 
 
 
5dac685
f8392aa
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
---
language: en
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- static-embedding
- chess
- retrieval
- exploratory
datasets:
- Lichess/chess-puzzles
- Lichess/chess-openings
---

# Chess Static Embedding (v4-C2) β€” Open Exploration

A 4M-parameter `StaticEmbedding` model for chess content retrieval, plus the
full **open-science methodology document** describing what we tried, what
worked, what failed, and why.

This repo is **exploratory experimental work**, published as-is. The model is
genuinely useful (NDCG@10 = 0.12 on a compositional held-out eval, 50Γ— smaller
than typical retrieval encoders) but the bigger contribution is the
**methodology narrative** below β€” particularly the *LLM-bridge* and
*deterministic-bridge* findings.

---

## Quick start

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("oneryalcin/static-embedding-chess")
query = "fork endgame short"
docs = [
    "themes crushing endgame fork short opening Sicilian Defense moves f2g3 e6e7",
    "themes mate mateIn1 oneMove opening Caro-Kann moves d2d4 e7e5",
]
sims = model.encode(query) @ model.encode(docs).T
```

Static embedding: lookup table + average. Sub-millisecond CPU inference. No GPU
required.

---

## Headline result

| Variant | NDCG@10 | vs random init |
|---------|---------|---------------|
| v3 baseline (random init + MNRL) | 0.0801 | β€” |
| v4-A hard-neg only | 0.1000 | +25% |
| v4-B theme distill only | 0.0112 | -86% (regression β€” see methodology) |
| v4-C multitask 500Γ— | 0.1154 | +44% |
| **v4-C2 multitask 5000Γ— (this model)** | **0.1202** | **+50%** |

Held-out eval: 200 unseen anchor combinations Γ— 600-doc corpus. Compositional
generalization β€” the model never saw these exact theme combinations during
training, only the individual tokens in other combos.

For **production-ready** chess search, see the **two-stage architecture** below
(static + BM25 over English-bridged docs) that delivers NDCG@10 = 0.59-0.87.

---

## What's in this repo

```
model.safetensors                     # 4M-param StaticEmbedding weights (~9MB)
chess_tokenizer.json                  # WordLevel chess tokenizer (4,336 tokens)
tokenizer.json                        # Same, in HF format for ST loading
config_sentence_transformers.json     # Module config
modules.json                          # Module pipeline

data/
β”œβ”€β”€ theme_definitions.parquet         # 73 chess themes + LLM-generated English defs + MPNet embeddings (the LLM-bridge teacher signal)
β”œβ”€β”€ hard_negatives_chess.parquet      # 1.6M (anchor, positive, negative) triplets, chess-token format
└── hard_negatives_english.parquet    # Same, English-bridged via deterministic conversion

scripts/
β”œβ”€β”€ train_chess_static.py             # Main training entrypoint (multi-version, env-flag controlled)
β”œβ”€β”€ train_chess_multitask.py          # The v4-C2 winning recipe (theme distill + hard-neg MNRL)
β”œβ”€β”€ convert_to_english.py             # Deterministic chessβ†’English (no LLM needed; python-chess + regex)
β”œβ”€β”€ mine_hard_negs_v2.py              # Memory-bounded custom hard-negative miner
β”œβ”€β”€ generate_theme_defs.py            # LLM-bridge: DeepSeek-v4-flash writes chess concept definitions
β”œβ”€β”€ compare_variants.py               # Side-by-side eval framework across all variants
└── diag_ce_vs_bm25.py                # The critical "is your CE really helping" diagnostic
```

---

## Methodology β€” the full experimental journey

This was 36+ hours of iterative exploration. The model is the small visible
output; the methodology is the bigger contribution.

### 1. Problem and approach

**Task:** Free-text search over a chess puzzle corpus. User types something
like `"fork endgame short"` and gets matching Lichess puzzles.

**Why static embedding:** Tom Aarsen's
[static-retrieval-mrl-en-v1](https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1)
showed StaticEmbedding can be a useful retrieval primitive with the right
training. We adapted the recipe for a chess-specific domain with a custom
WordLevel tokenizer so chess tokens (UCI moves, theme names, ECO codes) are
first-class.

**Data:** Lichess/chess-puzzles (5.8M puzzles, CC0) + Lichess/chess-openings
(3.6K openings, CC0).

### 2. Eval design β€” the hardest part

**Initial mistake:** First eval used top-200 most-common theme strings as
queries. The model had seen each of these ~50,000 times in training. Baseline
NDCG@10 was inflated to 0.81 by lexical overlap before any training. Useless.

**Fixed eval (used throughout):** *Compositional held-out anchors*. Pick 200
theme-combination strings that appear exactly 3 times in the data
(rare-but-multi-relevant), remove all matching pairs from train, use those rare
combos as queries. Tests whether the model can compose meaning from individual
theme tokens it learned, without having seen the specific combination.

This is harsh β€” the model can never "memorize" the eval queries β€” and that's
the point. Random-init baseline drops to NDCG@10 β‰ˆ 0.01.

### 3. Phase 1 β€” diagnostic of the v3 model (0.08 NDCG@10)

A working baseline existed. Question: **why isn't it better?**

Token-similarity probe revealed the core issue:

| Pair | v3 cosine similarity |
|---|---|
| `fork` ↔ `pin` | +0.01 |
| `fork` ↔ `skewer` | -0.12 |
| `endgame` ↔ `middlegame` | -0.30 |

**Token embeddings were essentially orthogonal.** The model learned per-token
mappings to chess-content clusters but no relationships *between* tokens.
Compositional generalization (the eval task) requires those relationships.

Also discovered: 51% of held-out queries returned zero relevant in top-10
(median NDCG@10 = 0). Bimodal failure pattern.

Also discovered: model beat BM25 by 7.5Γ— (0.08 vs 0.01), confirming it does
real semantic work beyond keyword match.

### 4. Phase 2 β€” distillation from raw MPNet (DEAD END)

Hypothesis: distill student token embeddings to match teacher (MPNet)
embeddings. Teacher knows English; should know that `fork β‰ˆ pin`.

**Result:** REGRESSION. Why? **MPNet itself scores NDCG@10 = 0.0094 on our
eval.** 95.5% of queries get zero in top-10. MPNet doesn't know chess: UCI
moves are character soup to its WordPiece tokenizer.

**You can't distill what the teacher doesn't know.** This was the first key
lesson.

### 5. Phase 3 β€” LLM-bridge for theme distillation (BREAKTHROUGH)

Key insight: an LLM can read both chess (in camelCase) AND English. Use it as
a **translator** to put chess concepts into language MPNet *can* understand
semantically.

**Steps:**

1. DeepSeek-v4-flash writes English definitions for 73 Lichess themes:
   - `fork` β†’ "A tactical motif where a single piece attacks two or more
     enemy pieces simultaneously, forcing a material gain."
2. MPNet embeds the *English definitions* (it knows English fluently).
3. Distill the student's per-token embedding to match the definition embedding.

After step 2 alone, MPNet's `fork ↔ skewer` similarity jumps from 0.39 (raw
camelCase) to **0.87** (via definitions). Real semantic structure.

Combined with hard-negative MNRL training (v4-C2): **NDCG@10 = 0.1202**, +50%
over v3.

Cost: 73 themes Γ— DeepSeek API β‰ˆ $0.01 + ~1 minute generation.

This is the **LLM-bridge** pattern: when system A doesn't speak system B's
language, use an LLM as a translator. The LLM is one-shot work, not part of
inference.

### 6. Phase 4 β€” hard-negative mining

Used the v3 model to mine confusable documents per anchor. Custom
memory-bounded miner because the sentence-transformers built-in OOMs on M4 at
327k unique anchors Γ— 327k positives. See `scripts/mine_hard_negs_v2.py`.

1.6M triplets mined. Positive-negative margin: 0.135 mean (good signal for
training).

### 7. Phase 5 β€” multi-task training (v4-C2 winner)

Multi-dataset trainer combining:
- **Chess triplets** (1.6M, MNRL loss): teaches content associations
- **Theme distillation** (73 themes Γ— 5000 replicas via `EmbedDistillLoss`):
  injects semantic structure between tokens

With proportional sampling, theme tokens see ~500 gradient updates per epoch
(via replication) vs chess pairs once. Theme distillation oversampling matters:

| Theme replicas | NDCG@10 |
|---|---|
| 500Γ— | 0.1154 |
| 5000Γ— | 0.1202 |

### 8. Phase 6 β€” cross-encoder reranker attempts (ALL FAILED)

Tried three variants:
- MS-MARCO MiniLM (English-pretrained, 22M params) on chess-format docs
- Same, with theme echo stripped from training docs
- Fresh-init tiny BERT (5M params) with our chess tokenizer

**All regressed below static-only.** Diagnosis: trained CEs operate at
random-ordering level on the eval. Inspection of training predictions showed
the trained CE got pair-ordering wrong 2/3 of the time on sample inputs.

**Root cause:** documents are UCI move sequences (`f2g3 e6e7 ...`). To
English-pretrained CE tokenizers these are character fragments with no
meaningful representation. The CE can't learn what makes a "fork-y" move
sequence from sparse labels alone. Static embedding worked because token-bag
averaging is sample-efficient (each `fork` token gets gradients from many
examples β†’ converges to a useful cluster); the CE's pair-level processing is
hungrier for signal not available in our data.

### 9. Phase 7 β€” deterministic English bridge for documents (REVEALED THE TRUTH)

Insight: we don't need an LLM to translate documents either. `python-chess`
deterministically converts UCI β†’ SAN with board context (`f2g3` β†’ `Bxg3`).
Regex decamelizes themes (`backRankMate` β†’ `back rank mate`). Free, instant,
reproducible. The `convert_to_english.py` script does the full 5.8M corpus in
~3 minutes.

Re-ran reranker training on English-bridged docs. **Untrained MS-MARCO CE hit
the oracle ceiling (0.5947 at top-100).** Massive jump.

But: ran a final diagnostic comparing trained CE vs **BM25** over the same
English docs. They were *identical*:

| K | Static | +CE | +BM25 | Oracle |
|---|---|---|---|---|
| 100 | 0.1202 | **0.5947** | **0.5947** | 0.5947 |
| 200 | 0.1202 | 0.7706 | 0.7706 | 0.7706 |
| 300 | 0.1202 | 0.8718 | 0.8718 | 0.8718 |

The "LLM-bridge effect" we observed was **lexical match enabled by the
English conversion**, not semantic CE understanding. BM25 over English docs
does the same job.

**Stress test**: stripped theme tokens from English docs too. Forces the CE
to genuinely understand "fork query ↔ fork-pattern moves":

| K | Static | +CE | +BM25 | Oracle |
|---|---|---|---|---|
| 100 | 0.1202 | 0.0726 | 0.4327 | 0.5947 |
| 300 | 0.1202 | 0.0706 | 0.6252 | 0.8718 |

CE drops below static (negative transfer β€” memorized "theme overlap = match"
during training; can't generalize). BM25 still partially works via opening
name overlap.

**True semantic CE chess understanding is not achievable** with 22M-param
English-pretrained models on our training signal.

---

## Production recommendation β€” and a surprising honest finding

**The static embedding model is not needed for this task.** A direct comparison:

| Approach | NDCG@10 (200 unseen-combo queries Γ— 600 docs) |
|---|---|
| Static (v4-C2) alone | 0.1202 |
| BM25 alone over chess-format docs | 0.0107 |
| **BM25 alone over English-bridged docs** | **1.0000** |
| Static + BM25 RRF fusion | 0.4940 |

**BM25 over deterministically-English-converted documents achieves PERFECT
ranking (1.0000 NDCG@10) on this eval.** No embedding model needed. No training.
No GPU.

Why: our queries are theme tokens (`fork endgame`), and the English-bridged
docs explicitly contain those words (`"Short endgame puzzle with fork..."`).
This is BM25's natural strength β€” keyword overlap detection. The static model
labors to learn token-cluster mappings; BM25 just reads the words directly.

### Actual production architecture (the simple answer)

```python
import chess, re
from rank_bm25 import BM25Okapi

# One-time: convert all puzzles to English (use scripts/convert_to_english.py)
# Build BM25 index over the English-converted corpus
bm25 = BM25Okapi([english_doc.split() for english_doc in corpus])

# Query
query = "fork endgame short"  # or any theme combo / opening name
top_indices = bm25.get_top_n(query.split(), corpus_ids, n=10)
```

**Total: <10ms/query, $0 cost, no model, no GPU, no training.**

### When the static embedding would actually help

1. **Natural-language paraphrased queries**: user types `"two-piece tactical in late game"` instead of `"fork endgame"`. BM25 wouldn't match those words. Static (trained with paraphrase augmentation) could match via learned semantic similarity. **We never tested this.**
2. **Cross-lingual queries**: BM25 needs exact lexical overlap; embeddings can cross language barriers.
3. **Very large corpora** where BM25 index size becomes an issue, embeddings are more storage-efficient per doc.

For our actual eval setup (theme-token queries on Lichess puzzles), the static
model loses by 8Γ— to BM25-over-English-bridged. The static training exercise
produced valuable methodology insights (especially the LLM-bridge pattern) but
was the wrong tool for the actual production problem.

---

## Key learnings worth keeping (general, not chess-specific)

1. **Eval methodology dominates.** Most time spent debugging the "model isn't
   improving" turned out to be eval issues, not training issues. Compositional
   held-out > top-frequent-string eval. Strip lexical leakage between query
   and corpus when testing generalization.

2. **Sentence-transformers' `NoDuplicatesBatchSampler` is O(epoch-progress)
   per batch.** It walks a linked-list of deferred conflicts. For datasets
   with limited unique anchors (our ~327k anchors over 5.8M pairs), this
   creates monotonic step-time blowup. Switch to `BatchSamplers.BATCH_SAMPLER`.

3. **`CachedMultipleNegativesRankingLoss` is incompatible with
   `StaticEmbedding`** β€” explicit error. Token-bag has no transformer
   activations to GradCache through.

4. **Trackio crashes on first checkpoint push** with sentence-transformers
   due to an empty `router_mapping` struct that pyarrow can't write. Use
   `report_to="none"`.

5. **The "LLM-bridge" pattern**: when system A speaks language X and system
   B speaks language Y, use an LLM to translate B→X once (not at inference).
   For chess: LLM writes English definitions of themes β†’ general English
   teacher can now embed them β†’ distill into chess-specific model.

6. **Deterministic translation often suffices** for the bridge. Don't pay LLM
   API costs if `python-chess` and regex can produce the same English text.
   Reserve LLMs for the parts that genuinely need understanding (concept
   definitions, paraphrases, strategic narratives).

7. **Compare your trained model against BM25** on the actual eval. If they
   tie, your model is doing keyword matching, not semantic work. Diagnostic
   in `scripts/diag_ce_vs_bm25.py`.

8. **Modal `.spawn()` only survives entrypoint exit on deployed apps.** For
   ephemeral `modal run`, the app dies when entrypoint returns β€” including
   spawned calls. Use `.remote()` with `--detach`.

9. **Apple Silicon M4 is competitive with cloud A100** for tiny models. Token
   bag + small batch easily hits 17 it/s on MPS. GPU cost is wasted unless
   the model is compute-bound.

---

## Reproducibility

Clone this repo, then with sentence-transformers v5.5+:

```bash
# Inspect the recipe
cat scripts/train_chess_multitask.py

# Reproduce the data prep (one-time, ~10 min)
python scripts/generate_theme_defs.py        # Needs DeepSeek API key in macOS keychain
python scripts/convert_to_english.py         # python-chess + regex, $0
python scripts/mine_hard_negs_v2.py          # ~10 min on M4 MPS

# Reproduce the winning training
python scripts/train_chess_multitask.py      # ~5 min on M4 MPS

# Verify
python scripts/compare_variants.py           # Side-by-side eval table
python scripts/diag_ce_vs_bm25.py            # Is the rerank doing real work?
```

---

## Limitations and honest caveats

- **NDCG@10 = 0.12 is modest in absolute terms.** Industry retrieval encoders
  reach 0.4-0.6 on similar tasks. This model is competitive on size/speed,
  not absolute quality.
- **The two-stage architecture (NDCG@10 β‰ˆ 0.6) is the production answer**
  but relies on BM25 over English-converted docs, not on the cross-encoder.
- **Cross-encoder didn't add semantic value** in our setup; results came from
  lexical match enabled by the English bridge.
- **Bimodal failure**: even the best model misses half of queries entirely
  (median NDCG@10 = 0). The architecture has fundamental limits for chess
  reasoning.
- **English-pretrained models don't know chess.** Tried MPNet, MiniLM,
  Jina-v5; all fail on UCI moves. Bigger English models won't fix this; only
  chess-pretrained or deterministic conversion helps.
- **No engine evaluation.** "Is this puzzle a fork?" was determined by
  Lichess theme tags; we never ran a chess engine. A real production system
  would integrate Stockfish for ground-truth tactical pattern detection.

---

## What this is NOT

- Not a chess engine. See [`thomasahle/fastchess`](https://github.com/thomasahle/fastchess)
  for FastText-based move prediction (closest related work).
- Not a position similarity model. See `chess2vec` lineage on GitHub for
  position-level embeddings.
- Not a state-of-the-art retrieval model. It's a tiny first-stage filter
  designed to pair with a reranker.

---

## License

Apache 2.0 (model + scripts). Data derived from Lichess/chess-puzzles which is
CC0 β€” derived parquets in this repo are also released under CC0.

## Acknowledgments

- [Lichess](https://lichess.org) for releasing puzzles + openings under CC0.
- [Tom Aarsen](https://huggingface.co/tomaarsen) for the
  `train-sentence-transformers` skill and `StaticEmbedding` recipe.
- DeepSeek for the v4-flash API used for theme definitions.

## Citation

If this work is useful, please link to this repo. The scientific findings
(particularly the deterministic-bridge insight that BM25 over English-bridged
docs equals a trained cross-encoder for this task) are the main contribution.