Sentence Similarity
sentence-transformers
Safetensors
English
static-embedding
chess
retrieval
exploratory
Instructions to use oneryalcin/static-embedding-chess with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use oneryalcin/static-embedding-chess with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("oneryalcin/static-embedding-chess") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: apache-2.0 | |
| library_name: sentence-transformers | |
| pipeline_tag: sentence-similarity | |
| tags: | |
| - sentence-transformers | |
| - static-embedding | |
| - chess | |
| - retrieval | |
| - exploratory | |
| datasets: | |
| - Lichess/chess-puzzles | |
| - Lichess/chess-openings | |
| # Chess Static Embedding (v4-C2) — Open Exploration | |
| A 4M-parameter `StaticEmbedding` model for chess content retrieval, plus the | |
| full **open-science methodology document** describing what we tried, what | |
| worked, what failed, and why. | |
| This repo is **exploratory experimental work**, published as-is. The model is | |
| genuinely useful (NDCG@10 = 0.12 on a compositional held-out eval, 50× smaller | |
| than typical retrieval encoders) but the bigger contribution is the | |
| **methodology narrative** below — particularly the *LLM-bridge* and | |
| *deterministic-bridge* findings. | |
| --- | |
| ## Quick start | |
| ```python | |
| from sentence_transformers import SentenceTransformer | |
| model = SentenceTransformer("oneryalcin/static-embedding-chess") | |
| query = "fork endgame short" | |
| docs = [ | |
| "themes crushing endgame fork short opening Sicilian Defense moves f2g3 e6e7", | |
| "themes mate mateIn1 oneMove opening Caro-Kann moves d2d4 e7e5", | |
| ] | |
| sims = model.encode(query) @ model.encode(docs).T | |
| ``` | |
| Static embedding: lookup table + average. Sub-millisecond CPU inference. No GPU | |
| required. | |
| --- | |
| ## Headline result | |
| | Variant | NDCG@10 | vs random init | | |
| |---------|---------|---------------| | |
| | v3 baseline (random init + MNRL) | 0.0801 | — | | |
| | v4-A hard-neg only | 0.1000 | +25% | | |
| | v4-B theme distill only | 0.0112 | -86% (regression — see methodology) | | |
| | v4-C multitask 500× | 0.1154 | +44% | | |
| | **v4-C2 multitask 5000× (this model)** | **0.1202** | **+50%** | | |
| Held-out eval: 200 unseen anchor combinations × 600-doc corpus. Compositional | |
| generalization — the model never saw these exact theme combinations during | |
| training, only the individual tokens in other combos. | |
| For **production-ready** chess search, see the **two-stage architecture** below | |
| (static + BM25 over English-bridged docs) that delivers NDCG@10 = 0.59-0.87. | |
| --- | |
| ## What's in this repo | |
| ``` | |
| model.safetensors # 4M-param StaticEmbedding weights (~9MB) | |
| chess_tokenizer.json # WordLevel chess tokenizer (4,336 tokens) | |
| tokenizer.json # Same, in HF format for ST loading | |
| config_sentence_transformers.json # Module config | |
| modules.json # Module pipeline | |
| data/ | |
| ├── theme_definitions.parquet # 73 chess themes + LLM-generated English defs + MPNet embeddings (the LLM-bridge teacher signal) | |
| ├── hard_negatives_chess.parquet # 1.6M (anchor, positive, negative) triplets, chess-token format | |
| └── hard_negatives_english.parquet # Same, English-bridged via deterministic conversion | |
| scripts/ | |
| ├── train_chess_static.py # Main training entrypoint (multi-version, env-flag controlled) | |
| ├── train_chess_multitask.py # The v4-C2 winning recipe (theme distill + hard-neg MNRL) | |
| ├── convert_to_english.py # Deterministic chess→English (no LLM needed; python-chess + regex) | |
| ├── mine_hard_negs_v2.py # Memory-bounded custom hard-negative miner | |
| ├── generate_theme_defs.py # LLM-bridge: DeepSeek-v4-flash writes chess concept definitions | |
| ├── compare_variants.py # Side-by-side eval framework across all variants | |
| └── diag_ce_vs_bm25.py # The critical "is your CE really helping" diagnostic | |
| ``` | |
| --- | |
| ## Methodology — the full experimental journey | |
| This was 36+ hours of iterative exploration. The model is the small visible | |
| output; the methodology is the bigger contribution. | |
| ### 1. Problem and approach | |
| **Task:** Free-text search over a chess puzzle corpus. User types something | |
| like `"fork endgame short"` and gets matching Lichess puzzles. | |
| **Why static embedding:** Tom Aarsen's | |
| [static-retrieval-mrl-en-v1](https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1) | |
| showed StaticEmbedding can be a useful retrieval primitive with the right | |
| training. We adapted the recipe for a chess-specific domain with a custom | |
| WordLevel tokenizer so chess tokens (UCI moves, theme names, ECO codes) are | |
| first-class. | |
| **Data:** Lichess/chess-puzzles (5.8M puzzles, CC0) + Lichess/chess-openings | |
| (3.6K openings, CC0). | |
| ### 2. Eval design — the hardest part | |
| **Initial mistake:** First eval used top-200 most-common theme strings as | |
| queries. The model had seen each of these ~50,000 times in training. Baseline | |
| NDCG@10 was inflated to 0.81 by lexical overlap before any training. Useless. | |
| **Fixed eval (used throughout):** *Compositional held-out anchors*. Pick 200 | |
| theme-combination strings that appear exactly 3 times in the data | |
| (rare-but-multi-relevant), remove all matching pairs from train, use those rare | |
| combos as queries. Tests whether the model can compose meaning from individual | |
| theme tokens it learned, without having seen the specific combination. | |
| This is harsh — the model can never "memorize" the eval queries — and that's | |
| the point. Random-init baseline drops to NDCG@10 ≈ 0.01. | |
| ### 3. Phase 1 — diagnostic of the v3 model (0.08 NDCG@10) | |
| A working baseline existed. Question: **why isn't it better?** | |
| Token-similarity probe revealed the core issue: | |
| | Pair | v3 cosine similarity | | |
| |---|---| | |
| | `fork` ↔ `pin` | +0.01 | | |
| | `fork` ↔ `skewer` | -0.12 | | |
| | `endgame` ↔ `middlegame` | -0.30 | | |
| **Token embeddings were essentially orthogonal.** The model learned per-token | |
| mappings to chess-content clusters but no relationships *between* tokens. | |
| Compositional generalization (the eval task) requires those relationships. | |
| Also discovered: 51% of held-out queries returned zero relevant in top-10 | |
| (median NDCG@10 = 0). Bimodal failure pattern. | |
| Also discovered: model beat BM25 by 7.5× (0.08 vs 0.01), confirming it does | |
| real semantic work beyond keyword match. | |
| ### 4. Phase 2 — distillation from raw MPNet (DEAD END) | |
| Hypothesis: distill student token embeddings to match teacher (MPNet) | |
| embeddings. Teacher knows English; should know that `fork ≈ pin`. | |
| **Result:** REGRESSION. Why? **MPNet itself scores NDCG@10 = 0.0094 on our | |
| eval.** 95.5% of queries get zero in top-10. MPNet doesn't know chess: UCI | |
| moves are character soup to its WordPiece tokenizer. | |
| **You can't distill what the teacher doesn't know.** This was the first key | |
| lesson. | |
| ### 5. Phase 3 — LLM-bridge for theme distillation (BREAKTHROUGH) | |
| Key insight: an LLM can read both chess (in camelCase) AND English. Use it as | |
| a **translator** to put chess concepts into language MPNet *can* understand | |
| semantically. | |
| **Steps:** | |
| 1. DeepSeek-v4-flash writes English definitions for 73 Lichess themes: | |
| - `fork` → "A tactical motif where a single piece attacks two or more | |
| enemy pieces simultaneously, forcing a material gain." | |
| 2. MPNet embeds the *English definitions* (it knows English fluently). | |
| 3. Distill the student's per-token embedding to match the definition embedding. | |
| After step 2 alone, MPNet's `fork ↔ skewer` similarity jumps from 0.39 (raw | |
| camelCase) to **0.87** (via definitions). Real semantic structure. | |
| Combined with hard-negative MNRL training (v4-C2): **NDCG@10 = 0.1202**, +50% | |
| over v3. | |
| Cost: 73 themes × DeepSeek API ≈ $0.01 + ~1 minute generation. | |
| This is the **LLM-bridge** pattern: when system A doesn't speak system B's | |
| language, use an LLM as a translator. The LLM is one-shot work, not part of | |
| inference. | |
| ### 6. Phase 4 — hard-negative mining | |
| Used the v3 model to mine confusable documents per anchor. Custom | |
| memory-bounded miner because the sentence-transformers built-in OOMs on M4 at | |
| 327k unique anchors × 327k positives. See `scripts/mine_hard_negs_v2.py`. | |
| 1.6M triplets mined. Positive-negative margin: 0.135 mean (good signal for | |
| training). | |
| ### 7. Phase 5 — multi-task training (v4-C2 winner) | |
| Multi-dataset trainer combining: | |
| - **Chess triplets** (1.6M, MNRL loss): teaches content associations | |
| - **Theme distillation** (73 themes × 5000 replicas via `EmbedDistillLoss`): | |
| injects semantic structure between tokens | |
| With proportional sampling, theme tokens see ~500 gradient updates per epoch | |
| (via replication) vs chess pairs once. Theme distillation oversampling matters: | |
| | Theme replicas | NDCG@10 | | |
| |---|---| | |
| | 500× | 0.1154 | | |
| | 5000× | 0.1202 | | |
| ### 8. Phase 6 — cross-encoder reranker attempts (ALL FAILED) | |
| Tried three variants: | |
| - MS-MARCO MiniLM (English-pretrained, 22M params) on chess-format docs | |
| - Same, with theme echo stripped from training docs | |
| - Fresh-init tiny BERT (5M params) with our chess tokenizer | |
| **All regressed below static-only.** Diagnosis: trained CEs operate at | |
| random-ordering level on the eval. Inspection of training predictions showed | |
| the trained CE got pair-ordering wrong 2/3 of the time on sample inputs. | |
| **Root cause:** documents are UCI move sequences (`f2g3 e6e7 ...`). To | |
| English-pretrained CE tokenizers these are character fragments with no | |
| meaningful representation. The CE can't learn what makes a "fork-y" move | |
| sequence from sparse labels alone. Static embedding worked because token-bag | |
| averaging is sample-efficient (each `fork` token gets gradients from many | |
| examples → converges to a useful cluster); the CE's pair-level processing is | |
| hungrier for signal not available in our data. | |
| ### 9. Phase 7 — deterministic English bridge for documents (REVEALED THE TRUTH) | |
| Insight: we don't need an LLM to translate documents either. `python-chess` | |
| deterministically converts UCI → SAN with board context (`f2g3` → `Bxg3`). | |
| Regex decamelizes themes (`backRankMate` → `back rank mate`). Free, instant, | |
| reproducible. The `convert_to_english.py` script does the full 5.8M corpus in | |
| ~3 minutes. | |
| Re-ran reranker training on English-bridged docs. **Untrained MS-MARCO CE hit | |
| the oracle ceiling (0.5947 at top-100).** Massive jump. | |
| But: ran a final diagnostic comparing trained CE vs **BM25** over the same | |
| English docs. They were *identical*: | |
| | K | Static | +CE | +BM25 | Oracle | | |
| |---|---|---|---|---| | |
| | 100 | 0.1202 | **0.5947** | **0.5947** | 0.5947 | | |
| | 200 | 0.1202 | 0.7706 | 0.7706 | 0.7706 | | |
| | 300 | 0.1202 | 0.8718 | 0.8718 | 0.8718 | | |
| The "LLM-bridge effect" we observed was **lexical match enabled by the | |
| English conversion**, not semantic CE understanding. BM25 over English docs | |
| does the same job. | |
| **Stress test**: stripped theme tokens from English docs too. Forces the CE | |
| to genuinely understand "fork query ↔ fork-pattern moves": | |
| | K | Static | +CE | +BM25 | Oracle | | |
| |---|---|---|---|---| | |
| | 100 | 0.1202 | 0.0726 | 0.4327 | 0.5947 | | |
| | 300 | 0.1202 | 0.0706 | 0.6252 | 0.8718 | | |
| CE drops below static (negative transfer — memorized "theme overlap = match" | |
| during training; can't generalize). BM25 still partially works via opening | |
| name overlap. | |
| **True semantic CE chess understanding is not achievable** with 22M-param | |
| English-pretrained models on our training signal. | |
| --- | |
| ## Production recommendation — and a surprising honest finding | |
| **The static embedding model is not needed for this task.** A direct comparison: | |
| | Approach | NDCG@10 (200 unseen-combo queries × 600 docs) | | |
| |---|---| | |
| | Static (v4-C2) alone | 0.1202 | | |
| | BM25 alone over chess-format docs | 0.0107 | | |
| | **BM25 alone over English-bridged docs** | **1.0000** | | |
| | Static + BM25 RRF fusion | 0.4940 | | |
| **BM25 over deterministically-English-converted documents achieves PERFECT | |
| ranking (1.0000 NDCG@10) on this eval.** No embedding model needed. No training. | |
| No GPU. | |
| Why: our queries are theme tokens (`fork endgame`), and the English-bridged | |
| docs explicitly contain those words (`"Short endgame puzzle with fork..."`). | |
| This is BM25's natural strength — keyword overlap detection. The static model | |
| labors to learn token-cluster mappings; BM25 just reads the words directly. | |
| ### Actual production architecture (the simple answer) | |
| ```python | |
| import chess, re | |
| from rank_bm25 import BM25Okapi | |
| # One-time: convert all puzzles to English (use scripts/convert_to_english.py) | |
| # Build BM25 index over the English-converted corpus | |
| bm25 = BM25Okapi([english_doc.split() for english_doc in corpus]) | |
| # Query | |
| query = "fork endgame short" # or any theme combo / opening name | |
| top_indices = bm25.get_top_n(query.split(), corpus_ids, n=10) | |
| ``` | |
| **Total: <10ms/query, $0 cost, no model, no GPU, no training.** | |
| ### When the static embedding would actually help | |
| 1. **Natural-language paraphrased queries**: user types `"two-piece tactical in late game"` instead of `"fork endgame"`. BM25 wouldn't match those words. Static (trained with paraphrase augmentation) could match via learned semantic similarity. **We never tested this.** | |
| 2. **Cross-lingual queries**: BM25 needs exact lexical overlap; embeddings can cross language barriers. | |
| 3. **Very large corpora** where BM25 index size becomes an issue, embeddings are more storage-efficient per doc. | |
| For our actual eval setup (theme-token queries on Lichess puzzles), the static | |
| model loses by 8× to BM25-over-English-bridged. The static training exercise | |
| produced valuable methodology insights (especially the LLM-bridge pattern) but | |
| was the wrong tool for the actual production problem. | |
| --- | |
| ## Key learnings worth keeping (general, not chess-specific) | |
| 1. **Eval methodology dominates.** Most time spent debugging the "model isn't | |
| improving" turned out to be eval issues, not training issues. Compositional | |
| held-out > top-frequent-string eval. Strip lexical leakage between query | |
| and corpus when testing generalization. | |
| 2. **Sentence-transformers' `NoDuplicatesBatchSampler` is O(epoch-progress) | |
| per batch.** It walks a linked-list of deferred conflicts. For datasets | |
| with limited unique anchors (our ~327k anchors over 5.8M pairs), this | |
| creates monotonic step-time blowup. Switch to `BatchSamplers.BATCH_SAMPLER`. | |
| 3. **`CachedMultipleNegativesRankingLoss` is incompatible with | |
| `StaticEmbedding`** — explicit error. Token-bag has no transformer | |
| activations to GradCache through. | |
| 4. **Trackio crashes on first checkpoint push** with sentence-transformers | |
| due to an empty `router_mapping` struct that pyarrow can't write. Use | |
| `report_to="none"`. | |
| 5. **The "LLM-bridge" pattern**: when system A speaks language X and system | |
| B speaks language Y, use an LLM to translate B→X once (not at inference). | |
| For chess: LLM writes English definitions of themes → general English | |
| teacher can now embed them → distill into chess-specific model. | |
| 6. **Deterministic translation often suffices** for the bridge. Don't pay LLM | |
| API costs if `python-chess` and regex can produce the same English text. | |
| Reserve LLMs for the parts that genuinely need understanding (concept | |
| definitions, paraphrases, strategic narratives). | |
| 7. **Compare your trained model against BM25** on the actual eval. If they | |
| tie, your model is doing keyword matching, not semantic work. Diagnostic | |
| in `scripts/diag_ce_vs_bm25.py`. | |
| 8. **Modal `.spawn()` only survives entrypoint exit on deployed apps.** For | |
| ephemeral `modal run`, the app dies when entrypoint returns — including | |
| spawned calls. Use `.remote()` with `--detach`. | |
| 9. **Apple Silicon M4 is competitive with cloud A100** for tiny models. Token | |
| bag + small batch easily hits 17 it/s on MPS. GPU cost is wasted unless | |
| the model is compute-bound. | |
| --- | |
| ## Reproducibility | |
| Clone this repo, then with sentence-transformers v5.5+: | |
| ```bash | |
| # Inspect the recipe | |
| cat scripts/train_chess_multitask.py | |
| # Reproduce the data prep (one-time, ~10 min) | |
| python scripts/generate_theme_defs.py # Needs DeepSeek API key in macOS keychain | |
| python scripts/convert_to_english.py # python-chess + regex, $0 | |
| python scripts/mine_hard_negs_v2.py # ~10 min on M4 MPS | |
| # Reproduce the winning training | |
| python scripts/train_chess_multitask.py # ~5 min on M4 MPS | |
| # Verify | |
| python scripts/compare_variants.py # Side-by-side eval table | |
| python scripts/diag_ce_vs_bm25.py # Is the rerank doing real work? | |
| ``` | |
| --- | |
| ## Limitations and honest caveats | |
| - **NDCG@10 = 0.12 is modest in absolute terms.** Industry retrieval encoders | |
| reach 0.4-0.6 on similar tasks. This model is competitive on size/speed, | |
| not absolute quality. | |
| - **The two-stage architecture (NDCG@10 ≈ 0.6) is the production answer** | |
| but relies on BM25 over English-converted docs, not on the cross-encoder. | |
| - **Cross-encoder didn't add semantic value** in our setup; results came from | |
| lexical match enabled by the English bridge. | |
| - **Bimodal failure**: even the best model misses half of queries entirely | |
| (median NDCG@10 = 0). The architecture has fundamental limits for chess | |
| reasoning. | |
| - **English-pretrained models don't know chess.** Tried MPNet, MiniLM, | |
| Jina-v5; all fail on UCI moves. Bigger English models won't fix this; only | |
| chess-pretrained or deterministic conversion helps. | |
| - **No engine evaluation.** "Is this puzzle a fork?" was determined by | |
| Lichess theme tags; we never ran a chess engine. A real production system | |
| would integrate Stockfish for ground-truth tactical pattern detection. | |
| --- | |
| ## What this is NOT | |
| - Not a chess engine. See [`thomasahle/fastchess`](https://github.com/thomasahle/fastchess) | |
| for FastText-based move prediction (closest related work). | |
| - Not a position similarity model. See `chess2vec` lineage on GitHub for | |
| position-level embeddings. | |
| - Not a state-of-the-art retrieval model. It's a tiny first-stage filter | |
| designed to pair with a reranker. | |
| --- | |
| ## License | |
| Apache 2.0 (model + scripts). Data derived from Lichess/chess-puzzles which is | |
| CC0 — derived parquets in this repo are also released under CC0. | |
| ## Acknowledgments | |
| - [Lichess](https://lichess.org) for releasing puzzles + openings under CC0. | |
| - [Tom Aarsen](https://huggingface.co/tomaarsen) for the | |
| `train-sentence-transformers` skill and `StaticEmbedding` recipe. | |
| - DeepSeek for the v4-flash API used for theme definitions. | |
| ## Citation | |
| If this work is useful, please link to this repo. The scientific findings | |
| (particularly the deterministic-bridge insight that BM25 over English-bridged | |
| docs equals a trained cross-encoder for this task) are the main contribution. | |