Sentence Similarity
sentence-transformers
Safetensors
English
static-embedding
chess
retrieval
exploratory
Instructions to use oneryalcin/static-embedding-chess with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use oneryalcin/static-embedding-chess with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("oneryalcin/static-embedding-chess") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -271,30 +271,53 @@ English-pretrained models on our training signal.
|
|
| 271 |
|
| 272 |
---
|
| 273 |
|
| 274 |
-
## Production recommendation
|
| 275 |
|
| 276 |
-
|
| 277 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 278 |
```
|
| 279 |
-
Stage 1: Static embedding (this model)
|
| 280 |
-
- Encode chess-format corpus (4M params, ~9MB)
|
| 281 |
-
- Sub-millisecond CPU inference
|
| 282 |
-
- Retrieve top-200 candidates via cosine similarity
|
| 283 |
-
- Recall@200 = 93.5%
|
| 284 |
-
|
| 285 |
-
Stage 2: BM25 over English-bridged corpus
|
| 286 |
-
- python-chess + regex (one-time, $0)
|
| 287 |
-
- Index the English versions of all docs
|
| 288 |
-
- Rerank top-200 candidates to top-10
|
| 289 |
-
- NDCG@10 ≈ 0.55-0.62
|
| 290 |
-
```
|
| 291 |
|
| 292 |
-
**Total: <10ms/query, $0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 293 |
|
| 294 |
-
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
|
| 298 |
|
| 299 |
---
|
| 300 |
|
|
|
|
| 271 |
|
| 272 |
---
|
| 273 |
|
| 274 |
+
## Production recommendation — and a surprising honest finding
|
| 275 |
|
| 276 |
+
**The static embedding model is not needed for this task.** A direct comparison:
|
| 277 |
|
| 278 |
+
| Approach | NDCG@10 (200 unseen-combo queries × 600 docs) |
|
| 279 |
+
|---|---|
|
| 280 |
+
| Static (v4-C2) alone | 0.1202 |
|
| 281 |
+
| BM25 alone over chess-format docs | 0.0107 |
|
| 282 |
+
| **BM25 alone over English-bridged docs** | **1.0000** |
|
| 283 |
+
| Static + BM25 RRF fusion | 0.4940 |
|
| 284 |
+
|
| 285 |
+
**BM25 over deterministically-English-converted documents achieves PERFECT
|
| 286 |
+
ranking (1.0000 NDCG@10) on this eval.** No embedding model needed. No training.
|
| 287 |
+
No GPU.
|
| 288 |
+
|
| 289 |
+
Why: our queries are theme tokens (`fork endgame`), and the English-bridged
|
| 290 |
+
docs explicitly contain those words (`"Short endgame puzzle with fork..."`).
|
| 291 |
+
This is BM25's natural strength — keyword overlap detection. The static model
|
| 292 |
+
labors to learn token-cluster mappings; BM25 just reads the words directly.
|
| 293 |
+
|
| 294 |
+
### Actual production architecture (the simple answer)
|
| 295 |
+
|
| 296 |
+
```python
|
| 297 |
+
import chess, re
|
| 298 |
+
from rank_bm25 import BM25Okapi
|
| 299 |
+
|
| 300 |
+
# One-time: convert all puzzles to English (use scripts/convert_to_english.py)
|
| 301 |
+
# Build BM25 index over the English-converted corpus
|
| 302 |
+
bm25 = BM25Okapi([english_doc.split() for english_doc in corpus])
|
| 303 |
+
|
| 304 |
+
# Query
|
| 305 |
+
query = "fork endgame short" # or any theme combo / opening name
|
| 306 |
+
top_indices = bm25.get_top_n(query.split(), corpus_ids, n=10)
|
| 307 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 308 |
|
| 309 |
+
**Total: <10ms/query, $0 cost, no model, no GPU, no training.**
|
| 310 |
+
|
| 311 |
+
### When the static embedding would actually help
|
| 312 |
+
|
| 313 |
+
1. **Natural-language paraphrased queries**: user types `"two-piece tactical in late game"` instead of `"fork endgame"`. BM25 wouldn't match those words. Static (trained with paraphrase augmentation) could match via learned semantic similarity. **We never tested this.**
|
| 314 |
+
2. **Cross-lingual queries**: BM25 needs exact lexical overlap; embeddings can cross language barriers.
|
| 315 |
+
3. **Very large corpora** where BM25 index size becomes an issue, embeddings are more storage-efficient per doc.
|
| 316 |
|
| 317 |
+
For our actual eval setup (theme-token queries on Lichess puzzles), the static
|
| 318 |
+
model loses by 8× to BM25-over-English-bridged. The static training exercise
|
| 319 |
+
produced valuable methodology insights (especially the LLM-bridge pattern) but
|
| 320 |
+
was the wrong tool for the actual production problem.
|
| 321 |
|
| 322 |
---
|
| 323 |
|