Upload README.md with huggingface_hub

b82c4e6 verified 22 days ago

18.2 kB

	---
	language: en
	license: apache-2.0
	library_name: sentence-transformers
	pipeline_tag: sentence-similarity
	tags:
	- sentence-transformers
	- static-embedding
	- chess
	- retrieval
	- exploratory
	datasets:
	- Lichess/chess-puzzles
	- Lichess/chess-openings
	---

	# Chess Static Embedding (v4-C2) — Open Exploration

	A 4M-parameter `StaticEmbedding` model for chess content retrieval, plus the
	full open-science methodology document describing what we tried, what
	worked, what failed, and why.

	This repo is exploratory experimental work, published as-is. The model is
	genuinely useful (NDCG@10 = 0.12 on a compositional held-out eval, 50× smaller
	than typical retrieval encoders) but the bigger contribution is the
	methodology narrative below — particularly the LLM-bridge and
	deterministic-bridge findings.

	---

	## Quick start

	```python
	from sentence_transformers import SentenceTransformer

	model = SentenceTransformer("oneryalcin/static-embedding-chess")
	query = "fork endgame short"
	docs = [
	"themes crushing endgame fork short opening Sicilian Defense moves f2g3 e6e7",
	"themes mate mateIn1 oneMove opening Caro-Kann moves d2d4 e7e5",
	]
	sims = model.encode(query) @ model.encode(docs).T
	```

	Static embedding: lookup table + average. Sub-millisecond CPU inference. No GPU
	required.

	---

	## Headline result

	\| Variant \| NDCG@10 \| vs random init \|
	\|---------\|---------\|---------------\|
	\| v3 baseline (random init + MNRL) \| 0.0801 \| — \|
	\| v4-A hard-neg only \| 0.1000 \| +25% \|
	\| v4-B theme distill only \| 0.0112 \| -86% (regression — see methodology) \|
	\| v4-C multitask 500× \| 0.1154 \| +44% \|
	\| v4-C2 multitask 5000× (this model) \| 0.1202 \| +50% \|

	Held-out eval: 200 unseen anchor combinations × 600-doc corpus. Compositional
	generalization — the model never saw these exact theme combinations during
	training, only the individual tokens in other combos.

	For production-ready chess search, see the two-stage architecture below
	(static + BM25 over English-bridged docs) that delivers NDCG@10 = 0.59-0.87.

	---

	## What's in this repo

	```
	model.safetensors # 4M-param StaticEmbedding weights (~9MB)
	chess_tokenizer.json # WordLevel chess tokenizer (4,336 tokens)
	tokenizer.json # Same, in HF format for ST loading
	config_sentence_transformers.json # Module config
	modules.json # Module pipeline

	data/
	├── theme_definitions.parquet # 73 chess themes + LLM-generated English defs + MPNet embeddings (the LLM-bridge teacher signal)
	├── hard_negatives_chess.parquet # 1.6M (anchor, positive, negative) triplets, chess-token format
	└── hard_negatives_english.parquet # Same, English-bridged via deterministic conversion

	scripts/
	├── train_chess_static.py # Main training entrypoint (multi-version, env-flag controlled)
	├── train_chess_multitask.py # The v4-C2 winning recipe (theme distill + hard-neg MNRL)
	├── convert_to_english.py # Deterministic chess→English (no LLM needed; python-chess + regex)
	├── mine_hard_negs_v2.py # Memory-bounded custom hard-negative miner
	├── generate_theme_defs.py # LLM-bridge: DeepSeek-v4-flash writes chess concept definitions
	├── compare_variants.py # Side-by-side eval framework across all variants
	└── diag_ce_vs_bm25.py # The critical "is your CE really helping" diagnostic
	```

	---

	## Methodology — the full experimental journey

	This was 36+ hours of iterative exploration. The model is the small visible
	output; the methodology is the bigger contribution.

	### 1. Problem and approach

	Task: Free-text search over a chess puzzle corpus. User types something
	like `"fork endgame short"` and gets matching Lichess puzzles.

	Why static embedding: Tom Aarsen's
	[static-retrieval-mrl-en-v1](https://huggingface.co/sentence-transformers/static-retrieval-mrl-en-v1)
	showed StaticEmbedding can be a useful retrieval primitive with the right
	training. We adapted the recipe for a chess-specific domain with a custom
	WordLevel tokenizer so chess tokens (UCI moves, theme names, ECO codes) are
	first-class.

	Data: Lichess/chess-puzzles (5.8M puzzles, CC0) + Lichess/chess-openings
	(3.6K openings, CC0).

	### 2. Eval design — the hardest part

	Initial mistake: First eval used top-200 most-common theme strings as
	queries. The model had seen each of these ~50,000 times in training. Baseline
	NDCG@10 was inflated to 0.81 by lexical overlap before any training. Useless.

	Fixed eval (used throughout): Compositional held-out anchors. Pick 200
	theme-combination strings that appear exactly 3 times in the data
	(rare-but-multi-relevant), remove all matching pairs from train, use those rare
	combos as queries. Tests whether the model can compose meaning from individual
	theme tokens it learned, without having seen the specific combination.

	This is harsh — the model can never "memorize" the eval queries — and that's
	the point. Random-init baseline drops to NDCG@10 ≈ 0.01.

	### 3. Phase 1 — diagnostic of the v3 model (0.08 NDCG@10)

	A working baseline existed. Question: why isn't it better?

	Token-similarity probe revealed the core issue:

	\| Pair \| v3 cosine similarity \|
	\|---\|---\|
	\| `fork` ↔ `pin` \| +0.01 \|
	\| `fork` ↔ `skewer` \| -0.12 \|
	\| `endgame` ↔ `middlegame` \| -0.30 \|

	Token embeddings were essentially orthogonal. The model learned per-token
	mappings to chess-content clusters but no relationships between tokens.
	Compositional generalization (the eval task) requires those relationships.

	Also discovered: 51% of held-out queries returned zero relevant in top-10
	(median NDCG@10 = 0). Bimodal failure pattern.

	Also discovered: model beat BM25 by 7.5× (0.08 vs 0.01), confirming it does
	real semantic work beyond keyword match.

	### 4. Phase 2 — distillation from raw MPNet (DEAD END)

	Hypothesis: distill student token embeddings to match teacher (MPNet)
	embeddings. Teacher knows English; should know that `fork ≈ pin`.

	Result: REGRESSION. Why? **MPNet itself scores NDCG@10 = 0.0094 on our
	eval.** 95.5% of queries get zero in top-10. MPNet doesn't know chess: UCI
	moves are character soup to its WordPiece tokenizer.

	You can't distill what the teacher doesn't know. This was the first key
	lesson.

	### 5. Phase 3 — LLM-bridge for theme distillation (BREAKTHROUGH)

	Key insight: an LLM can read both chess (in camelCase) AND English. Use it as
	a translator to put chess concepts into language MPNet can understand
	semantically.

	Steps:

	1. DeepSeek-v4-flash writes English definitions for 73 Lichess themes:
	- `fork` → "A tactical motif where a single piece attacks two or more
	enemy pieces simultaneously, forcing a material gain."
	2. MPNet embeds the English definitions (it knows English fluently).
	3. Distill the student's per-token embedding to match the definition embedding.

	After step 2 alone, MPNet's `fork ↔ skewer` similarity jumps from 0.39 (raw
	camelCase) to 0.87 (via definitions). Real semantic structure.

	Combined with hard-negative MNRL training (v4-C2): NDCG@10 = 0.1202, +50%
	over v3.

	Cost: 73 themes × DeepSeek API ≈ $0.01 + ~1 minute generation.

	This is the LLM-bridge pattern: when system A doesn't speak system B's
	language, use an LLM as a translator. The LLM is one-shot work, not part of
	inference.

	### 6. Phase 4 — hard-negative mining

	Used the v3 model to mine confusable documents per anchor. Custom
	memory-bounded miner because the sentence-transformers built-in OOMs on M4 at
	327k unique anchors × 327k positives. See `scripts/mine_hard_negs_v2.py`.

	1.6M triplets mined. Positive-negative margin: 0.135 mean (good signal for
	training).

	### 7. Phase 5 — multi-task training (v4-C2 winner)

	Multi-dataset trainer combining:
	- Chess triplets (1.6M, MNRL loss): teaches content associations
	- Theme distillation (73 themes × 5000 replicas via `EmbedDistillLoss`):
	injects semantic structure between tokens

	With proportional sampling, theme tokens see ~500 gradient updates per epoch
	(via replication) vs chess pairs once. Theme distillation oversampling matters:

	\| Theme replicas \| NDCG@10 \|
	\|---\|---\|
	\| 500× \| 0.1154 \|
	\| 5000× \| 0.1202 \|

	### 8. Phase 6 — cross-encoder reranker attempts (ALL FAILED)

	Tried three variants:
	- MS-MARCO MiniLM (English-pretrained, 22M params) on chess-format docs
	- Same, with theme echo stripped from training docs
	- Fresh-init tiny BERT (5M params) with our chess tokenizer

	All regressed below static-only. Diagnosis: trained CEs operate at
	random-ordering level on the eval. Inspection of training predictions showed
	the trained CE got pair-ordering wrong 2/3 of the time on sample inputs.

	Root cause: documents are UCI move sequences (`f2g3 e6e7 ...`). To
	English-pretrained CE tokenizers these are character fragments with no
	meaningful representation. The CE can't learn what makes a "fork-y" move
	sequence from sparse labels alone. Static embedding worked because token-bag
	averaging is sample-efficient (each `fork` token gets gradients from many
	examples → converges to a useful cluster); the CE's pair-level processing is
	hungrier for signal not available in our data.

	### 9. Phase 7 — deterministic English bridge for documents (REVEALED THE TRUTH)

	Insight: we don't need an LLM to translate documents either. `python-chess`
	deterministically converts UCI → SAN with board context (`f2g3` → `Bxg3`).
	Regex decamelizes themes (`backRankMate` → `back rank mate`). Free, instant,
	reproducible. The `convert_to_english.py` script does the full 5.8M corpus in
	~3 minutes.

	Re-ran reranker training on English-bridged docs. **Untrained MS-MARCO CE hit
	the oracle ceiling (0.5947 at top-100).** Massive jump.

	But: ran a final diagnostic comparing trained CE vs BM25 over the same
	English docs. They were identical:

	\| K \| Static \| +CE \| +BM25 \| Oracle \|
	\|---\|---\|---\|---\|---\|
	\| 100 \| 0.1202 \| 0.5947 \| 0.5947 \| 0.5947 \|
	\| 200 \| 0.1202 \| 0.7706 \| 0.7706 \| 0.7706 \|
	\| 300 \| 0.1202 \| 0.8718 \| 0.8718 \| 0.8718 \|

	The "LLM-bridge effect" we observed was **lexical match enabled by the
	English conversion**, not semantic CE understanding. BM25 over English docs
	does the same job.

	Stress test: stripped theme tokens from English docs too. Forces the CE
	to genuinely understand "fork query ↔ fork-pattern moves":

	\| K \| Static \| +CE \| +BM25 \| Oracle \|
	\|---\|---\|---\|---\|---\|
	\| 100 \| 0.1202 \| 0.0726 \| 0.4327 \| 0.5947 \|
	\| 300 \| 0.1202 \| 0.0706 \| 0.6252 \| 0.8718 \|

	CE drops below static (negative transfer — memorized "theme overlap = match"
	during training; can't generalize). BM25 still partially works via opening
	name overlap.

	True semantic CE chess understanding is not achievable with 22M-param
	English-pretrained models on our training signal.

	---

	## Production recommendation — and a surprising honest finding

	The static embedding model is not needed for this task. A direct comparison:

	\| Approach \| NDCG@10 (200 unseen-combo queries × 600 docs) \|
	\|---\|---\|
	\| Static (v4-C2) alone \| 0.1202 \|
	\| BM25 alone over chess-format docs \| 0.0107 \|
	\| BM25 alone over English-bridged docs \| 1.0000 \|
	\| Static + BM25 RRF fusion \| 0.4940 \|

	**BM25 over deterministically-English-converted documents achieves PERFECT
	ranking (1.0000 NDCG@10) on this eval.** No embedding model needed. No training.
	No GPU.

	Why: our queries are theme tokens (`fork endgame`), and the English-bridged
	docs explicitly contain those words (`"Short endgame puzzle with fork..."`).
	This is BM25's natural strength — keyword overlap detection. The static model
	labors to learn token-cluster mappings; BM25 just reads the words directly.

	### Actual production architecture (the simple answer)

	```python
	import chess, re
	from rank_bm25 import BM25Okapi

	# One-time: convert all puzzles to English (use scripts/convert_to_english.py)
	# Build BM25 index over the English-converted corpus
	bm25 = BM25Okapi([english_doc.split() for english_doc in corpus])

	# Query
	query = "fork endgame short" # or any theme combo / opening name
	top_indices = bm25.get_top_n(query.split(), corpus_ids, n=10)
	```

	Total: <10ms/query, $0 cost, no model, no GPU, no training.

	### When the static embedding would actually help

	1. Natural-language paraphrased queries: user types `"two-piece tactical in late game"` instead of `"fork endgame"`. BM25 wouldn't match those words. Static (trained with paraphrase augmentation) could match via learned semantic similarity. We never tested this.
	2. Cross-lingual queries: BM25 needs exact lexical overlap; embeddings can cross language barriers.
	3. Very large corpora where BM25 index size becomes an issue, embeddings are more storage-efficient per doc.

	For our actual eval setup (theme-token queries on Lichess puzzles), the static
	model loses by 8× to BM25-over-English-bridged. The static training exercise
	produced valuable methodology insights (especially the LLM-bridge pattern) but
	was the wrong tool for the actual production problem.

	---

	## Key learnings worth keeping (general, not chess-specific)

	1. Eval methodology dominates. Most time spent debugging the "model isn't
	improving" turned out to be eval issues, not training issues. Compositional
	held-out > top-frequent-string eval. Strip lexical leakage between query
	and corpus when testing generalization.

	2. **Sentence-transformers' `NoDuplicatesBatchSampler` is O(epoch-progress)
	per batch.** It walks a linked-list of deferred conflicts. For datasets
	with limited unique anchors (our ~327k anchors over 5.8M pairs), this
	creates monotonic step-time blowup. Switch to `BatchSamplers.BATCH_SAMPLER`.

	3. **`CachedMultipleNegativesRankingLoss` is incompatible with
	`StaticEmbedding`** — explicit error. Token-bag has no transformer
	activations to GradCache through.

	4. Trackio crashes on first checkpoint push with sentence-transformers
	due to an empty `router_mapping` struct that pyarrow can't write. Use
	`report_to="none"`.

	5. The "LLM-bridge" pattern: when system A speaks language X and system
	B speaks language Y, use an LLM to translate B→X once (not at inference).
	For chess: LLM writes English definitions of themes → general English
	teacher can now embed them → distill into chess-specific model.

	6. Deterministic translation often suffices for the bridge. Don't pay LLM
	API costs if `python-chess` and regex can produce the same English text.
	Reserve LLMs for the parts that genuinely need understanding (concept
	definitions, paraphrases, strategic narratives).

	7. Compare your trained model against BM25 on the actual eval. If they
	tie, your model is doing keyword matching, not semantic work. Diagnostic
	in `scripts/diag_ce_vs_bm25.py`.

	8. Modal `.spawn()` only survives entrypoint exit on deployed apps. For
	ephemeral `modal run`, the app dies when entrypoint returns — including
	spawned calls. Use `.remote()` with `--detach`.

	9. Apple Silicon M4 is competitive with cloud A100 for tiny models. Token
	bag + small batch easily hits 17 it/s on MPS. GPU cost is wasted unless
	the model is compute-bound.

	---

	## Reproducibility

	Clone this repo, then with sentence-transformers v5.5+:

	```bash
	# Inspect the recipe
	cat scripts/train_chess_multitask.py

	# Reproduce the data prep (one-time, ~10 min)
	python scripts/generate_theme_defs.py # Needs DeepSeek API key in macOS keychain
	python scripts/convert_to_english.py # python-chess + regex, $0
	python scripts/mine_hard_negs_v2.py # ~10 min on M4 MPS

	# Reproduce the winning training
	python scripts/train_chess_multitask.py # ~5 min on M4 MPS

	# Verify
	python scripts/compare_variants.py # Side-by-side eval table
	python scripts/diag_ce_vs_bm25.py # Is the rerank doing real work?
	```

	---

	## Limitations and honest caveats

	- NDCG@10 = 0.12 is modest in absolute terms. Industry retrieval encoders
	reach 0.4-0.6 on similar tasks. This model is competitive on size/speed,
	not absolute quality.
	- The two-stage architecture (NDCG@10 ≈ 0.6) is the production answer
	but relies on BM25 over English-converted docs, not on the cross-encoder.
	- Cross-encoder didn't add semantic value in our setup; results came from
	lexical match enabled by the English bridge.
	- Bimodal failure: even the best model misses half of queries entirely
	(median NDCG@10 = 0). The architecture has fundamental limits for chess
	reasoning.
	- English-pretrained models don't know chess. Tried MPNet, MiniLM,
	Jina-v5; all fail on UCI moves. Bigger English models won't fix this; only
	chess-pretrained or deterministic conversion helps.
	- No engine evaluation. "Is this puzzle a fork?" was determined by
	Lichess theme tags; we never ran a chess engine. A real production system
	would integrate Stockfish for ground-truth tactical pattern detection.

	---

	## What this is NOT

	- Not a chess engine. See [`thomasahle/fastchess`](https://github.com/thomasahle/fastchess)
	for FastText-based move prediction (closest related work).
	- Not a position similarity model. See `chess2vec` lineage on GitHub for
	position-level embeddings.
	- Not a state-of-the-art retrieval model. It's a tiny first-stage filter
	designed to pair with a reranker.

	---

	## License

	Apache 2.0 (model + scripts). Data derived from Lichess/chess-puzzles which is
	CC0 — derived parquets in this repo are also released under CC0.

	## Acknowledgments

	- [Lichess](https://lichess.org) for releasing puzzles + openings under CC0.
	- [Tom Aarsen](https://huggingface.co/tomaarsen) for the
	`train-sentence-transformers` skill and `StaticEmbedding` recipe.
	- DeepSeek for the v4-flash API used for theme definitions.

	## Citation

	If this work is useful, please link to this repo. The scientific findings
	(particularly the deterministic-bridge insight that BM25 over English-bridged
	docs equals a trained cross-encoder for this task) are the main contribution.