oneryalcin's picture
Upload README.md with huggingface_hub
b82c4e6 verified
---
language: en
license: apache-2.0
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- static-embedding
- chess
- retrieval
- exploratory
datasets:
- Lichess/chess-puzzles
- Lichess/chess-openings
---
# Chess Static Embedding (v4-C2) — Open Exploration
A 4M-parameter `StaticEmbedding` model for chess content retrieval, plus the
full **open-science methodology document** describing what we tried, what
worked, what failed, and why.
This repo is **exploratory experimental work**, published as-is. The model is
genuinely useful (NDCG@10 = 0.12 on a compositional held-out eval, 50× smaller
than typical retrieval encoders) but the bigger contribution is the
**methodology narrative** below — particularly the *LLM-bridge* and
*deterministic-bridge* findings.
---
## Quick start
```python
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("oneryalcin/static-embedding-chess")
query = "fork endgame short"
docs = [
"themes crushing endgame fork short opening Sicilian Defense moves f2g3 e6e7",
"themes mate mateIn1 oneMove opening Caro-Kann moves d2d4 e7e5",
]
sims = model.encode(query) @ model.encode(docs).T
```
Static embedding: lookup table + average. Sub-millisecond CPU inference. No GPU
required.
---
## Headline result
| Variant | NDCG@10 | vs random init |
|---------|---------|---------------|
| v3 baseline (random init + MNRL) | 0.0801 | — |
| v4-A hard-neg only | 0.1000 | +25% |
| v4-B theme distill only | 0.0112 | -86% (regression — see methodology) |
| v4-C multitask 500× | 0.1154 | +44% |
| **v4-C2 multitask 5000× (this model)** | **0.1202** | **+50%** |
Held-out eval: 200 unseen anchor combinations × 600-doc corpus. Compositional
generalization — the model never saw these exact theme combinations during
training, only the individual tokens in other combos.
For **production-ready** chess search, see the **two-stage architecture** below
(static + BM25 over English-bridged docs) that delivers NDCG@10 = 0.59-0.87.
---
## What's in this repo
```
model.safetensors # 4M-param StaticEmbedding weights (~9MB)
chess_tokenizer.json # WordLevel chess tokenizer (4,336 tokens)
tokenizer.json # Same, in HF format for ST loading
config_sentence_transformers.json # Module config
modules.json # Module pipeline
data/
├── theme_definitions.parquet # 73 chess themes + LLM-generated English defs + MPNet embeddings (the LLM-bridge teacher signal)
├── hard_negatives_chess.parquet # 1.6M (anchor, positive, negative) triplets, chess-token format
└── hard_negatives_english.parquet # Same, English-bridged via deterministic conversion
scripts/
├── train_chess_static.py # Main training entrypoint (multi-version, env-flag controlled)
├── train_chess_multitask.py # The v4-C2 winning recipe (theme distill + hard-neg MNRL)
├── convert_to_english.py # Deterministic chess→English (no LLM needed; python-chess + regex)
├── mine_hard_negs_v2.py # Memory-bounded custom hard-negative miner
├── generate_theme_defs.py # LLM-bridge: DeepSeek-v4-flash writes chess concept definitions
├── compare_variants.py # Side-by-side eval framework across all variants
└── diag_ce_vs_bm25.py # The critical "is your CE really helping" diagnostic
```
---
## Methodology — the full experimental journey
This was 36+ hours of iterative exploration. The model is the small visible
output; the methodology is the bigger contribution.
### 1. Problem and approach
**Task:** Free-text search over a chess puzzle corpus. User types something
like `"fork endgame short"` and gets matching Lichess puzzles.
**Why static embedding:** Tom Aarsen's
[static-retrieval-mrl-en-v1](https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1)
showed StaticEmbedding can be a useful retrieval primitive with the right
training. We adapted the recipe for a chess-specific domain with a custom
WordLevel tokenizer so chess tokens (UCI moves, theme names, ECO codes) are
first-class.
**Data:** Lichess/chess-puzzles (5.8M puzzles, CC0) + Lichess/chess-openings
(3.6K openings, CC0).
### 2. Eval design — the hardest part
**Initial mistake:** First eval used top-200 most-common theme strings as
queries. The model had seen each of these ~50,000 times in training. Baseline
NDCG@10 was inflated to 0.81 by lexical overlap before any training. Useless.
**Fixed eval (used throughout):** *Compositional held-out anchors*. Pick 200
theme-combination strings that appear exactly 3 times in the data
(rare-but-multi-relevant), remove all matching pairs from train, use those rare
combos as queries. Tests whether the model can compose meaning from individual
theme tokens it learned, without having seen the specific combination.
This is harsh — the model can never "memorize" the eval queries — and that's
the point. Random-init baseline drops to NDCG@10 ≈ 0.01.
### 3. Phase 1 — diagnostic of the v3 model (0.08 NDCG@10)
A working baseline existed. Question: **why isn't it better?**
Token-similarity probe revealed the core issue:
| Pair | v3 cosine similarity |
|---|---|
| `fork``pin` | +0.01 |
| `fork``skewer` | -0.12 |
| `endgame``middlegame` | -0.30 |
**Token embeddings were essentially orthogonal.** The model learned per-token
mappings to chess-content clusters but no relationships *between* tokens.
Compositional generalization (the eval task) requires those relationships.
Also discovered: 51% of held-out queries returned zero relevant in top-10
(median NDCG@10 = 0). Bimodal failure pattern.
Also discovered: model beat BM25 by 7.5× (0.08 vs 0.01), confirming it does
real semantic work beyond keyword match.
### 4. Phase 2 — distillation from raw MPNet (DEAD END)
Hypothesis: distill student token embeddings to match teacher (MPNet)
embeddings. Teacher knows English; should know that `fork ≈ pin`.
**Result:** REGRESSION. Why? **MPNet itself scores NDCG@10 = 0.0094 on our
eval.** 95.5% of queries get zero in top-10. MPNet doesn't know chess: UCI
moves are character soup to its WordPiece tokenizer.
**You can't distill what the teacher doesn't know.** This was the first key
lesson.
### 5. Phase 3 — LLM-bridge for theme distillation (BREAKTHROUGH)
Key insight: an LLM can read both chess (in camelCase) AND English. Use it as
a **translator** to put chess concepts into language MPNet *can* understand
semantically.
**Steps:**
1. DeepSeek-v4-flash writes English definitions for 73 Lichess themes:
- `fork` → "A tactical motif where a single piece attacks two or more
enemy pieces simultaneously, forcing a material gain."
2. MPNet embeds the *English definitions* (it knows English fluently).
3. Distill the student's per-token embedding to match the definition embedding.
After step 2 alone, MPNet's `fork ↔ skewer` similarity jumps from 0.39 (raw
camelCase) to **0.87** (via definitions). Real semantic structure.
Combined with hard-negative MNRL training (v4-C2): **NDCG@10 = 0.1202**, +50%
over v3.
Cost: 73 themes × DeepSeek API ≈ $0.01 + ~1 minute generation.
This is the **LLM-bridge** pattern: when system A doesn't speak system B's
language, use an LLM as a translator. The LLM is one-shot work, not part of
inference.
### 6. Phase 4 — hard-negative mining
Used the v3 model to mine confusable documents per anchor. Custom
memory-bounded miner because the sentence-transformers built-in OOMs on M4 at
327k unique anchors × 327k positives. See `scripts/mine_hard_negs_v2.py`.
1.6M triplets mined. Positive-negative margin: 0.135 mean (good signal for
training).
### 7. Phase 5 — multi-task training (v4-C2 winner)
Multi-dataset trainer combining:
- **Chess triplets** (1.6M, MNRL loss): teaches content associations
- **Theme distillation** (73 themes × 5000 replicas via `EmbedDistillLoss`):
injects semantic structure between tokens
With proportional sampling, theme tokens see ~500 gradient updates per epoch
(via replication) vs chess pairs once. Theme distillation oversampling matters:
| Theme replicas | NDCG@10 |
|---|---|
| 500× | 0.1154 |
| 5000× | 0.1202 |
### 8. Phase 6 — cross-encoder reranker attempts (ALL FAILED)
Tried three variants:
- MS-MARCO MiniLM (English-pretrained, 22M params) on chess-format docs
- Same, with theme echo stripped from training docs
- Fresh-init tiny BERT (5M params) with our chess tokenizer
**All regressed below static-only.** Diagnosis: trained CEs operate at
random-ordering level on the eval. Inspection of training predictions showed
the trained CE got pair-ordering wrong 2/3 of the time on sample inputs.
**Root cause:** documents are UCI move sequences (`f2g3 e6e7 ...`). To
English-pretrained CE tokenizers these are character fragments with no
meaningful representation. The CE can't learn what makes a "fork-y" move
sequence from sparse labels alone. Static embedding worked because token-bag
averaging is sample-efficient (each `fork` token gets gradients from many
examples → converges to a useful cluster); the CE's pair-level processing is
hungrier for signal not available in our data.
### 9. Phase 7 — deterministic English bridge for documents (REVEALED THE TRUTH)
Insight: we don't need an LLM to translate documents either. `python-chess`
deterministically converts UCI → SAN with board context (`f2g3``Bxg3`).
Regex decamelizes themes (`backRankMate``back rank mate`). Free, instant,
reproducible. The `convert_to_english.py` script does the full 5.8M corpus in
~3 minutes.
Re-ran reranker training on English-bridged docs. **Untrained MS-MARCO CE hit
the oracle ceiling (0.5947 at top-100).** Massive jump.
But: ran a final diagnostic comparing trained CE vs **BM25** over the same
English docs. They were *identical*:
| K | Static | +CE | +BM25 | Oracle |
|---|---|---|---|---|
| 100 | 0.1202 | **0.5947** | **0.5947** | 0.5947 |
| 200 | 0.1202 | 0.7706 | 0.7706 | 0.7706 |
| 300 | 0.1202 | 0.8718 | 0.8718 | 0.8718 |
The "LLM-bridge effect" we observed was **lexical match enabled by the
English conversion**, not semantic CE understanding. BM25 over English docs
does the same job.
**Stress test**: stripped theme tokens from English docs too. Forces the CE
to genuinely understand "fork query ↔ fork-pattern moves":
| K | Static | +CE | +BM25 | Oracle |
|---|---|---|---|---|
| 100 | 0.1202 | 0.0726 | 0.4327 | 0.5947 |
| 300 | 0.1202 | 0.0706 | 0.6252 | 0.8718 |
CE drops below static (negative transfer — memorized "theme overlap = match"
during training; can't generalize). BM25 still partially works via opening
name overlap.
**True semantic CE chess understanding is not achievable** with 22M-param
English-pretrained models on our training signal.
---
## Production recommendation — and a surprising honest finding
**The static embedding model is not needed for this task.** A direct comparison:
| Approach | NDCG@10 (200 unseen-combo queries × 600 docs) |
|---|---|
| Static (v4-C2) alone | 0.1202 |
| BM25 alone over chess-format docs | 0.0107 |
| **BM25 alone over English-bridged docs** | **1.0000** |
| Static + BM25 RRF fusion | 0.4940 |
**BM25 over deterministically-English-converted documents achieves PERFECT
ranking (1.0000 NDCG@10) on this eval.** No embedding model needed. No training.
No GPU.
Why: our queries are theme tokens (`fork endgame`), and the English-bridged
docs explicitly contain those words (`"Short endgame puzzle with fork..."`).
This is BM25's natural strength — keyword overlap detection. The static model
labors to learn token-cluster mappings; BM25 just reads the words directly.
### Actual production architecture (the simple answer)
```python
import chess, re
from rank_bm25 import BM25Okapi
# One-time: convert all puzzles to English (use scripts/convert_to_english.py)
# Build BM25 index over the English-converted corpus
bm25 = BM25Okapi([english_doc.split() for english_doc in corpus])
# Query
query = "fork endgame short" # or any theme combo / opening name
top_indices = bm25.get_top_n(query.split(), corpus_ids, n=10)
```
**Total: <10ms/query, $0 cost, no model, no GPU, no training.**
### When the static embedding would actually help
1. **Natural-language paraphrased queries**: user types `"two-piece tactical in late game"` instead of `"fork endgame"`. BM25 wouldn't match those words. Static (trained with paraphrase augmentation) could match via learned semantic similarity. **We never tested this.**
2. **Cross-lingual queries**: BM25 needs exact lexical overlap; embeddings can cross language barriers.
3. **Very large corpora** where BM25 index size becomes an issue, embeddings are more storage-efficient per doc.
For our actual eval setup (theme-token queries on Lichess puzzles), the static
model loses by 8× to BM25-over-English-bridged. The static training exercise
produced valuable methodology insights (especially the LLM-bridge pattern) but
was the wrong tool for the actual production problem.
---
## Key learnings worth keeping (general, not chess-specific)
1. **Eval methodology dominates.** Most time spent debugging the "model isn't
improving" turned out to be eval issues, not training issues. Compositional
held-out > top-frequent-string eval. Strip lexical leakage between query
and corpus when testing generalization.
2. **Sentence-transformers' `NoDuplicatesBatchSampler` is O(epoch-progress)
per batch.** It walks a linked-list of deferred conflicts. For datasets
with limited unique anchors (our ~327k anchors over 5.8M pairs), this
creates monotonic step-time blowup. Switch to `BatchSamplers.BATCH_SAMPLER`.
3. **`CachedMultipleNegativesRankingLoss` is incompatible with
`StaticEmbedding`** — explicit error. Token-bag has no transformer
activations to GradCache through.
4. **Trackio crashes on first checkpoint push** with sentence-transformers
due to an empty `router_mapping` struct that pyarrow can't write. Use
`report_to="none"`.
5. **The "LLM-bridge" pattern**: when system A speaks language X and system
B speaks language Y, use an LLM to translate B→X once (not at inference).
For chess: LLM writes English definitions of themes → general English
teacher can now embed them → distill into chess-specific model.
6. **Deterministic translation often suffices** for the bridge. Don't pay LLM
API costs if `python-chess` and regex can produce the same English text.
Reserve LLMs for the parts that genuinely need understanding (concept
definitions, paraphrases, strategic narratives).
7. **Compare your trained model against BM25** on the actual eval. If they
tie, your model is doing keyword matching, not semantic work. Diagnostic
in `scripts/diag_ce_vs_bm25.py`.
8. **Modal `.spawn()` only survives entrypoint exit on deployed apps.** For
ephemeral `modal run`, the app dies when entrypoint returns — including
spawned calls. Use `.remote()` with `--detach`.
9. **Apple Silicon M4 is competitive with cloud A100** for tiny models. Token
bag + small batch easily hits 17 it/s on MPS. GPU cost is wasted unless
the model is compute-bound.
---
## Reproducibility
Clone this repo, then with sentence-transformers v5.5+:
```bash
# Inspect the recipe
cat scripts/train_chess_multitask.py
# Reproduce the data prep (one-time, ~10 min)
python scripts/generate_theme_defs.py # Needs DeepSeek API key in macOS keychain
python scripts/convert_to_english.py # python-chess + regex, $0
python scripts/mine_hard_negs_v2.py # ~10 min on M4 MPS
# Reproduce the winning training
python scripts/train_chess_multitask.py # ~5 min on M4 MPS
# Verify
python scripts/compare_variants.py # Side-by-side eval table
python scripts/diag_ce_vs_bm25.py # Is the rerank doing real work?
```
---
## Limitations and honest caveats
- **NDCG@10 = 0.12 is modest in absolute terms.** Industry retrieval encoders
reach 0.4-0.6 on similar tasks. This model is competitive on size/speed,
not absolute quality.
- **The two-stage architecture (NDCG@10 ≈ 0.6) is the production answer**
but relies on BM25 over English-converted docs, not on the cross-encoder.
- **Cross-encoder didn't add semantic value** in our setup; results came from
lexical match enabled by the English bridge.
- **Bimodal failure**: even the best model misses half of queries entirely
(median NDCG@10 = 0). The architecture has fundamental limits for chess
reasoning.
- **English-pretrained models don't know chess.** Tried MPNet, MiniLM,
Jina-v5; all fail on UCI moves. Bigger English models won't fix this; only
chess-pretrained or deterministic conversion helps.
- **No engine evaluation.** "Is this puzzle a fork?" was determined by
Lichess theme tags; we never ran a chess engine. A real production system
would integrate Stockfish for ground-truth tactical pattern detection.
---
## What this is NOT
- Not a chess engine. See [`thomasahle/fastchess`](https://github.com/thomasahle/fastchess)
for FastText-based move prediction (closest related work).
- Not a position similarity model. See `chess2vec` lineage on GitHub for
position-level embeddings.
- Not a state-of-the-art retrieval model. It's a tiny first-stage filter
designed to pair with a reranker.
---
## License
Apache 2.0 (model + scripts). Data derived from Lichess/chess-puzzles which is
CC0 — derived parquets in this repo are also released under CC0.
## Acknowledgments
- [Lichess](https://lichess.org) for releasing puzzles + openings under CC0.
- [Tom Aarsen](https://huggingface.co/tomaarsen) for the
`train-sentence-transformers` skill and `StaticEmbedding` recipe.
- DeepSeek for the v4-flash API used for theme definitions.
## Citation
If this work is useful, please link to this repo. The scientific findings
(particularly the deterministic-bridge insight that BM25 over English-bridged
docs equals a trained cross-encoder for this task) are the main contribution.