--- language: en license: apache-2.0 library_name: sentence-transformers pipeline_tag: sentence-similarity tags: - sentence-transformers - static-embedding - chess - retrieval - exploratory datasets: - Lichess/chess-puzzles - Lichess/chess-openings --- # Chess Static Embedding (v4-C2) — Open Exploration A 4M-parameter `StaticEmbedding` model for chess content retrieval, plus the full **open-science methodology document** describing what we tried, what worked, what failed, and why. This repo is **exploratory experimental work**, published as-is. The model is genuinely useful (NDCG@10 = 0.12 on a compositional held-out eval, 50× smaller than typical retrieval encoders) but the bigger contribution is the **methodology narrative** below — particularly the *LLM-bridge* and *deterministic-bridge* findings. --- ## Quick start ```python from sentence_transformers import SentenceTransformer model = SentenceTransformer("oneryalcin/static-embedding-chess") query = "fork endgame short" docs = [ "themes crushing endgame fork short opening Sicilian Defense moves f2g3 e6e7", "themes mate mateIn1 oneMove opening Caro-Kann moves d2d4 e7e5", ] sims = model.encode(query) @ model.encode(docs).T ``` Static embedding: lookup table + average. Sub-millisecond CPU inference. No GPU required. --- ## Headline result | Variant | NDCG@10 | vs random init | |---------|---------|---------------| | v3 baseline (random init + MNRL) | 0.0801 | — | | v4-A hard-neg only | 0.1000 | +25% | | v4-B theme distill only | 0.0112 | -86% (regression — see methodology) | | v4-C multitask 500× | 0.1154 | +44% | | **v4-C2 multitask 5000× (this model)** | **0.1202** | **+50%** | Held-out eval: 200 unseen anchor combinations × 600-doc corpus. Compositional generalization — the model never saw these exact theme combinations during training, only the individual tokens in other combos. For **production-ready** chess search, see the **two-stage architecture** below (static + BM25 over English-bridged docs) that delivers NDCG@10 = 0.59-0.87. --- ## What's in this repo ``` model.safetensors # 4M-param StaticEmbedding weights (~9MB) chess_tokenizer.json # WordLevel chess tokenizer (4,336 tokens) tokenizer.json # Same, in HF format for ST loading config_sentence_transformers.json # Module config modules.json # Module pipeline data/ ├── theme_definitions.parquet # 73 chess themes + LLM-generated English defs + MPNet embeddings (the LLM-bridge teacher signal) ├── hard_negatives_chess.parquet # 1.6M (anchor, positive, negative) triplets, chess-token format └── hard_negatives_english.parquet # Same, English-bridged via deterministic conversion scripts/ ├── train_chess_static.py # Main training entrypoint (multi-version, env-flag controlled) ├── train_chess_multitask.py # The v4-C2 winning recipe (theme distill + hard-neg MNRL) ├── convert_to_english.py # Deterministic chess→English (no LLM needed; python-chess + regex) ├── mine_hard_negs_v2.py # Memory-bounded custom hard-negative miner ├── generate_theme_defs.py # LLM-bridge: DeepSeek-v4-flash writes chess concept definitions ├── compare_variants.py # Side-by-side eval framework across all variants └── diag_ce_vs_bm25.py # The critical "is your CE really helping" diagnostic ``` --- ## Methodology — the full experimental journey This was 36+ hours of iterative exploration. The model is the small visible output; the methodology is the bigger contribution. ### 1. Problem and approach **Task:** Free-text search over a chess puzzle corpus. User types something like `"fork endgame short"` and gets matching Lichess puzzles. **Why static embedding:** Tom Aarsen's [static-retrieval-mrl-en-v1](https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1) showed StaticEmbedding can be a useful retrieval primitive with the right training. We adapted the recipe for a chess-specific domain with a custom WordLevel tokenizer so chess tokens (UCI moves, theme names, ECO codes) are first-class. **Data:** Lichess/chess-puzzles (5.8M puzzles, CC0) + Lichess/chess-openings (3.6K openings, CC0). ### 2. Eval design — the hardest part **Initial mistake:** First eval used top-200 most-common theme strings as queries. The model had seen each of these ~50,000 times in training. Baseline NDCG@10 was inflated to 0.81 by lexical overlap before any training. Useless. **Fixed eval (used throughout):** *Compositional held-out anchors*. Pick 200 theme-combination strings that appear exactly 3 times in the data (rare-but-multi-relevant), remove all matching pairs from train, use those rare combos as queries. Tests whether the model can compose meaning from individual theme tokens it learned, without having seen the specific combination. This is harsh — the model can never "memorize" the eval queries — and that's the point. Random-init baseline drops to NDCG@10 ≈ 0.01. ### 3. Phase 1 — diagnostic of the v3 model (0.08 NDCG@10) A working baseline existed. Question: **why isn't it better?** Token-similarity probe revealed the core issue: | Pair | v3 cosine similarity | |---|---| | `fork` ↔ `pin` | +0.01 | | `fork` ↔ `skewer` | -0.12 | | `endgame` ↔ `middlegame` | -0.30 | **Token embeddings were essentially orthogonal.** The model learned per-token mappings to chess-content clusters but no relationships *between* tokens. Compositional generalization (the eval task) requires those relationships. Also discovered: 51% of held-out queries returned zero relevant in top-10 (median NDCG@10 = 0). Bimodal failure pattern. Also discovered: model beat BM25 by 7.5× (0.08 vs 0.01), confirming it does real semantic work beyond keyword match. ### 4. Phase 2 — distillation from raw MPNet (DEAD END) Hypothesis: distill student token embeddings to match teacher (MPNet) embeddings. Teacher knows English; should know that `fork ≈ pin`. **Result:** REGRESSION. Why? **MPNet itself scores NDCG@10 = 0.0094 on our eval.** 95.5% of queries get zero in top-10. MPNet doesn't know chess: UCI moves are character soup to its WordPiece tokenizer. **You can't distill what the teacher doesn't know.** This was the first key lesson. ### 5. Phase 3 — LLM-bridge for theme distillation (BREAKTHROUGH) Key insight: an LLM can read both chess (in camelCase) AND English. Use it as a **translator** to put chess concepts into language MPNet *can* understand semantically. **Steps:** 1. DeepSeek-v4-flash writes English definitions for 73 Lichess themes: - `fork` → "A tactical motif where a single piece attacks two or more enemy pieces simultaneously, forcing a material gain." 2. MPNet embeds the *English definitions* (it knows English fluently). 3. Distill the student's per-token embedding to match the definition embedding. After step 2 alone, MPNet's `fork ↔ skewer` similarity jumps from 0.39 (raw camelCase) to **0.87** (via definitions). Real semantic structure. Combined with hard-negative MNRL training (v4-C2): **NDCG@10 = 0.1202**, +50% over v3. Cost: 73 themes × DeepSeek API ≈ $0.01 + ~1 minute generation. This is the **LLM-bridge** pattern: when system A doesn't speak system B's language, use an LLM as a translator. The LLM is one-shot work, not part of inference. ### 6. Phase 4 — hard-negative mining Used the v3 model to mine confusable documents per anchor. Custom memory-bounded miner because the sentence-transformers built-in OOMs on M4 at 327k unique anchors × 327k positives. See `scripts/mine_hard_negs_v2.py`. 1.6M triplets mined. Positive-negative margin: 0.135 mean (good signal for training). ### 7. Phase 5 — multi-task training (v4-C2 winner) Multi-dataset trainer combining: - **Chess triplets** (1.6M, MNRL loss): teaches content associations - **Theme distillation** (73 themes × 5000 replicas via `EmbedDistillLoss`): injects semantic structure between tokens With proportional sampling, theme tokens see ~500 gradient updates per epoch (via replication) vs chess pairs once. Theme distillation oversampling matters: | Theme replicas | NDCG@10 | |---|---| | 500× | 0.1154 | | 5000× | 0.1202 | ### 8. Phase 6 — cross-encoder reranker attempts (ALL FAILED) Tried three variants: - MS-MARCO MiniLM (English-pretrained, 22M params) on chess-format docs - Same, with theme echo stripped from training docs - Fresh-init tiny BERT (5M params) with our chess tokenizer **All regressed below static-only.** Diagnosis: trained CEs operate at random-ordering level on the eval. Inspection of training predictions showed the trained CE got pair-ordering wrong 2/3 of the time on sample inputs. **Root cause:** documents are UCI move sequences (`f2g3 e6e7 ...`). To English-pretrained CE tokenizers these are character fragments with no meaningful representation. The CE can't learn what makes a "fork-y" move sequence from sparse labels alone. Static embedding worked because token-bag averaging is sample-efficient (each `fork` token gets gradients from many examples → converges to a useful cluster); the CE's pair-level processing is hungrier for signal not available in our data. ### 9. Phase 7 — deterministic English bridge for documents (REVEALED THE TRUTH) Insight: we don't need an LLM to translate documents either. `python-chess` deterministically converts UCI → SAN with board context (`f2g3` → `Bxg3`). Regex decamelizes themes (`backRankMate` → `back rank mate`). Free, instant, reproducible. The `convert_to_english.py` script does the full 5.8M corpus in ~3 minutes. Re-ran reranker training on English-bridged docs. **Untrained MS-MARCO CE hit the oracle ceiling (0.5947 at top-100).** Massive jump. But: ran a final diagnostic comparing trained CE vs **BM25** over the same English docs. They were *identical*: | K | Static | +CE | +BM25 | Oracle | |---|---|---|---|---| | 100 | 0.1202 | **0.5947** | **0.5947** | 0.5947 | | 200 | 0.1202 | 0.7706 | 0.7706 | 0.7706 | | 300 | 0.1202 | 0.8718 | 0.8718 | 0.8718 | The "LLM-bridge effect" we observed was **lexical match enabled by the English conversion**, not semantic CE understanding. BM25 over English docs does the same job. **Stress test**: stripped theme tokens from English docs too. Forces the CE to genuinely understand "fork query ↔ fork-pattern moves": | K | Static | +CE | +BM25 | Oracle | |---|---|---|---|---| | 100 | 0.1202 | 0.0726 | 0.4327 | 0.5947 | | 300 | 0.1202 | 0.0706 | 0.6252 | 0.8718 | CE drops below static (negative transfer — memorized "theme overlap = match" during training; can't generalize). BM25 still partially works via opening name overlap. **True semantic CE chess understanding is not achievable** with 22M-param English-pretrained models on our training signal. --- ## Production recommendation — and a surprising honest finding **The static embedding model is not needed for this task.** A direct comparison: | Approach | NDCG@10 (200 unseen-combo queries × 600 docs) | |---|---| | Static (v4-C2) alone | 0.1202 | | BM25 alone over chess-format docs | 0.0107 | | **BM25 alone over English-bridged docs** | **1.0000** | | Static + BM25 RRF fusion | 0.4940 | **BM25 over deterministically-English-converted documents achieves PERFECT ranking (1.0000 NDCG@10) on this eval.** No embedding model needed. No training. No GPU. Why: our queries are theme tokens (`fork endgame`), and the English-bridged docs explicitly contain those words (`"Short endgame puzzle with fork..."`). This is BM25's natural strength — keyword overlap detection. The static model labors to learn token-cluster mappings; BM25 just reads the words directly. ### Actual production architecture (the simple answer) ```python import chess, re from rank_bm25 import BM25Okapi # One-time: convert all puzzles to English (use scripts/convert_to_english.py) # Build BM25 index over the English-converted corpus bm25 = BM25Okapi([english_doc.split() for english_doc in corpus]) # Query query = "fork endgame short" # or any theme combo / opening name top_indices = bm25.get_top_n(query.split(), corpus_ids, n=10) ``` **Total: <10ms/query, $0 cost, no model, no GPU, no training.** ### When the static embedding would actually help 1. **Natural-language paraphrased queries**: user types `"two-piece tactical in late game"` instead of `"fork endgame"`. BM25 wouldn't match those words. Static (trained with paraphrase augmentation) could match via learned semantic similarity. **We never tested this.** 2. **Cross-lingual queries**: BM25 needs exact lexical overlap; embeddings can cross language barriers. 3. **Very large corpora** where BM25 index size becomes an issue, embeddings are more storage-efficient per doc. For our actual eval setup (theme-token queries on Lichess puzzles), the static model loses by 8× to BM25-over-English-bridged. The static training exercise produced valuable methodology insights (especially the LLM-bridge pattern) but was the wrong tool for the actual production problem. --- ## Key learnings worth keeping (general, not chess-specific) 1. **Eval methodology dominates.** Most time spent debugging the "model isn't improving" turned out to be eval issues, not training issues. Compositional held-out > top-frequent-string eval. Strip lexical leakage between query and corpus when testing generalization. 2. **Sentence-transformers' `NoDuplicatesBatchSampler` is O(epoch-progress) per batch.** It walks a linked-list of deferred conflicts. For datasets with limited unique anchors (our ~327k anchors over 5.8M pairs), this creates monotonic step-time blowup. Switch to `BatchSamplers.BATCH_SAMPLER`. 3. **`CachedMultipleNegativesRankingLoss` is incompatible with `StaticEmbedding`** — explicit error. Token-bag has no transformer activations to GradCache through. 4. **Trackio crashes on first checkpoint push** with sentence-transformers due to an empty `router_mapping` struct that pyarrow can't write. Use `report_to="none"`. 5. **The "LLM-bridge" pattern**: when system A speaks language X and system B speaks language Y, use an LLM to translate B→X once (not at inference). For chess: LLM writes English definitions of themes → general English teacher can now embed them → distill into chess-specific model. 6. **Deterministic translation often suffices** for the bridge. Don't pay LLM API costs if `python-chess` and regex can produce the same English text. Reserve LLMs for the parts that genuinely need understanding (concept definitions, paraphrases, strategic narratives). 7. **Compare your trained model against BM25** on the actual eval. If they tie, your model is doing keyword matching, not semantic work. Diagnostic in `scripts/diag_ce_vs_bm25.py`. 8. **Modal `.spawn()` only survives entrypoint exit on deployed apps.** For ephemeral `modal run`, the app dies when entrypoint returns — including spawned calls. Use `.remote()` with `--detach`. 9. **Apple Silicon M4 is competitive with cloud A100** for tiny models. Token bag + small batch easily hits 17 it/s on MPS. GPU cost is wasted unless the model is compute-bound. --- ## Reproducibility Clone this repo, then with sentence-transformers v5.5+: ```bash # Inspect the recipe cat scripts/train_chess_multitask.py # Reproduce the data prep (one-time, ~10 min) python scripts/generate_theme_defs.py # Needs DeepSeek API key in macOS keychain python scripts/convert_to_english.py # python-chess + regex, $0 python scripts/mine_hard_negs_v2.py # ~10 min on M4 MPS # Reproduce the winning training python scripts/train_chess_multitask.py # ~5 min on M4 MPS # Verify python scripts/compare_variants.py # Side-by-side eval table python scripts/diag_ce_vs_bm25.py # Is the rerank doing real work? ``` --- ## Limitations and honest caveats - **NDCG@10 = 0.12 is modest in absolute terms.** Industry retrieval encoders reach 0.4-0.6 on similar tasks. This model is competitive on size/speed, not absolute quality. - **The two-stage architecture (NDCG@10 ≈ 0.6) is the production answer** but relies on BM25 over English-converted docs, not on the cross-encoder. - **Cross-encoder didn't add semantic value** in our setup; results came from lexical match enabled by the English bridge. - **Bimodal failure**: even the best model misses half of queries entirely (median NDCG@10 = 0). The architecture has fundamental limits for chess reasoning. - **English-pretrained models don't know chess.** Tried MPNet, MiniLM, Jina-v5; all fail on UCI moves. Bigger English models won't fix this; only chess-pretrained or deterministic conversion helps. - **No engine evaluation.** "Is this puzzle a fork?" was determined by Lichess theme tags; we never ran a chess engine. A real production system would integrate Stockfish for ground-truth tactical pattern detection. --- ## What this is NOT - Not a chess engine. See [`thomasahle/fastchess`](https://github.com/thomasahle/fastchess) for FastText-based move prediction (closest related work). - Not a position similarity model. See `chess2vec` lineage on GitHub for position-level embeddings. - Not a state-of-the-art retrieval model. It's a tiny first-stage filter designed to pair with a reranker. --- ## License Apache 2.0 (model + scripts). Data derived from Lichess/chess-puzzles which is CC0 — derived parquets in this repo are also released under CC0. ## Acknowledgments - [Lichess](https://lichess.org) for releasing puzzles + openings under CC0. - [Tom Aarsen](https://huggingface.co/tomaarsen) for the `train-sentence-transformers` skill and `StaticEmbedding` recipe. - DeepSeek for the v4-flash API used for theme definitions. ## Citation If this work is useful, please link to this repo. The scientific findings (particularly the deterministic-bridge insight that BM25 over English-bridged docs equals a trained cross-encoder for this task) are the main contribution.