oneryalcin
/

static-embedding-chess

@@ -271,30 +271,53 @@ English-pretrained models on our training signal.
 ---
-## Production recommendation
-For a real chess search system, the winning architecture is:
 ```
-Stage 1: Static embedding (this model)
-  - Encode chess-format corpus (4M params, ~9MB)
-  - Sub-millisecond CPU inference
-  - Retrieve top-200 candidates via cosine similarity
-  - Recall@200 = 93.5%
-Stage 2: BM25 over English-bridged corpus
-  - python-chess + regex (one-time, $0)
-  - Index the English versions of all docs
-  - Rerank top-200 candidates to top-10
-  - NDCG@10 ≈ 0.55-0.62
-```
-**Total: <10ms/query, $0 inference cost, no GPU.**
-The cross-encoder is only worth adding if you have GPU available AND you train
-it on a fundamentally different signal (e.g., human-annotated relevance,
-chess-engine strategic descriptions, or much more parameters with chess in
-pretraining).
 ---

 ---
+## Production recommendation — and a surprising honest finding
+**The static embedding model is not needed for this task.** A direct comparison:
+| Approach | NDCG@10 (200 unseen-combo queries × 600 docs) |
+|---|---|
+| Static (v4-C2) alone | 0.1202 |
+| BM25 alone over chess-format docs | 0.0107 |
+| **BM25 alone over English-bridged docs** | **1.0000** |
+| Static + BM25 RRF fusion | 0.4940 |
+**BM25 over deterministically-English-converted documents achieves PERFECT
+ranking (1.0000 NDCG@10) on this eval.** No embedding model needed. No training.
+No GPU.
+Why: our queries are theme tokens (`fork endgame`), and the English-bridged
+docs explicitly contain those words (`"Short endgame puzzle with fork..."`).
+This is BM25's natural strength — keyword overlap detection. The static model
+labors to learn token-cluster mappings; BM25 just reads the words directly.
+### Actual production architecture (the simple answer)
+```python
+import chess, re
+from rank_bm25 import BM25Okapi
+# One-time: convert all puzzles to English (use scripts/convert_to_english.py)
+# Build BM25 index over the English-converted corpus
+bm25 = BM25Okapi([english_doc.split() for english_doc in corpus])
+# Query
+query = "fork endgame short"  # or any theme combo / opening name
+top_indices = bm25.get_top_n(query.split(), corpus_ids, n=10)
 ```
+**Total: <10ms/query, $0 cost, no model, no GPU, no training.**
+### When the static embedding would actually help
+1. **Natural-language paraphrased queries**: user types `"two-piece tactical in late game"` instead of `"fork endgame"`. BM25 wouldn't match those words. Static (trained with paraphrase augmentation) could match via learned semantic similarity. **We never tested this.**
+2. **Cross-lingual queries**: BM25 needs exact lexical overlap; embeddings can cross language barriers.
+3. **Very large corpora** where BM25 index size becomes an issue, embeddings are more storage-efficient per doc.
+For our actual eval setup (theme-token queries on Lichess puzzles), the static
+model loses by 8× to BM25-over-English-bridged. The static training exercise
+produced valuable methodology insights (especially the LLM-bridge pattern) but
+was the wrong tool for the actual production problem.
 ---