gemma-4-E2B-it solitaire-advisor LoRA

A LoRA adapter that distils a 31B Gemma Klondike Solitaire advisor into the ~2B-effective Gemma 4 E2B text model (4-bit, MLX), runnable locally on a 16 GB Apple Silicon Mac.

This is the Gemma 4 E2B successor to chayuto/gemma-3n-e2b-it-solitaire-advisor-lora. That earlier card promised a Gemma 4 E2B variant "once mlx-lm ships the missing architecture support"; the base is now loadable via a small local sanitize() patch (see Usage), so this is that promised release, same project, same teacher, same data pipeline.

It is the project's strongest and lead student: it beats the untuned base under a trusted full-game evaluation both on held-out decks and on fresh, never-seen decks (it generalizes), not just on a single-turn bench. Headline numbers, all adjudicated by exact engine replay plus a sound solver:

In-distribution (13 held-out solver-winnable decks): 5 wins vs the base's 1, mean foundation cards 27.7 vs 14.2.
Out-of-distribution (12 fresh, solver-winnable decks with zero corpus overlap): 5 wins vs the base's 1, +12.9 mean paired foundation-card delta, better than base on 9 of 12 decks.

The teacher is gemma-4-31b-it (Google's Gemma 4 31B, accessed through a separate harvester app). The teacher itself wins roughly 31% of games, so this adapter's ceiling is teacher-level imitation, not optimal play.

Model details


Base model	`mlx-community/Gemma4-E2B-IT-Text-int4` (Gemma 4 E2B IT, text-only, int4)
Adapter type	LoRA over a 4-bit-quantised base
LoRA rank	16 (scale 2.0, dropout 0.05)
LoRA target modules	`self_attn.{q,k,v,o}_proj`, `mlp.{gate,up,down}_proj`
LoRA layers	top 16 layers
Training framework	`mlx-lm`
Hardware	Apple M5, 16 GB unified memory (Metal GPU)
Adapter size on disk	~51 MB per checkpoint (bf16 LoRA weights over the 4-bit base)
Iterations trained	1,000 (shipped weights = iter 1,000)
Quantisation	base remains int4; LoRA weights bfloat16
Decoding used for eval	greedy

The shipped adapters.safetensors is byte-identical to checkpoints/0001000_adapters.safetensors (iter 1,000), selected over the earlier checkpoints by the checkpoint-selection pass (below).

Intended use

In scope. Acting as a move-selection advisor inside a Klondike Solitaire client that already enforces game rules:

Imperfect-information draw-1 Klondike (one card flipped from stock per draw); the advisor is shown the full visible state plus the count of face-down cards.
Per-turn decisions: given the harvester prompt (with a LEGAL MOVES block), emit a single JSON object choosing one offered legal move by index, or move_index: -1 to resign.
Local inference on Apple Silicon via mlx-lm.

Out of scope.

Open-ended chat or general text generation. The model is fine-tuned to a narrow JSON-emitting role and is expected to be worse than the base at unrelated tasks.
Game-rule enforcement. It selects from a legalMoves array supplied in the prompt; it does not verify legality from first principles.
Optimal play. The teacher wins ~31% of games; this adapter inherits that ceiling.
Other Solitaire variants (Spider, FreeCell), out of distribution.

Evaluation

Method

Unlike the 3n predecessor (scored on a 20-state single-turn tier bench), this adapter is evaluated on full games, which is the eval the project now trusts. Each game is played turn-by-turn under the faithful production prompt (hybrid-v1.6) on a fixed deck, greedy decoding, cap 200 turns, with a tiered JSON parse-rescue (temp-0.3 retry) matching production. Every finished or resigned game is then exact-adjudicated: the recorded decisions are replayed through the engine with zero drift, and the final or resign position is handed to a sound best-first solver (SOLVED / UNSOLVABLE / UNKNOWN at a 300k-node cap). This distinguishes a real win from a cap-truncated one, and a correct resign on a dead board from a quit on a winnable one. The two eval sets:

In-distribution: 13 winnable decks held out by seed from the harvester pool that the training data was drawn from.
Generalization: 12 freshly dealt, solver-confirmed-winnable decks (seeds 9000002..9000026) with verified zero overlap with the benchmark or any training corpus, so they are guaranteed unseen by teacher and student.

meanFC below is the mean number of cards on foundations (out of 52) at game end across the deck set; 52 is a win.

In-distribution (13 held-out winnable decks)

arm	corpus	meanFC	wins	resigns
base (untuned)	none	14.2	1	0
gate	2,492 rows, 100% won	27.5	4	0
allsucc	2,500 rows, 38% won	25.5	3	1
volume (this adapter)	6,859 rows, 36% won	27.7	5	1

Two facts this established. First, training beats the untuned base under trusted eval (the first time in the project). Second, the won-only filter is not the lever; data volume is: the won-only gate and the natural-mix allsucc tie at matched size, and the full-volume arm here is the strongest student, so "collect and train on more data" is the validated recipe. The 1 in-distribution resign (#4221577640) is on a deal that was winnable from the deal; whether volume's board at the moment it resigned was already dead is UNKNOWN at the 300k-node cap, so it is neither claimed correct nor false.

Generalization (12 fresh, never-seen, solver-winnable decks)

The decisive question was whether the student learned to play Klondike or memorized the harvester's deck distribution. On fresh decks with zero corpus overlap, paired against base:

arm	wins/12	meanFC	mean paired delta vs base	better than base
base	1	15.6	---	---
gate (won-only)	3	21.2	+5.6	7/12
volume (this adapter)	5	28.5	+12.9	9/12

Verdict: GENERALIZES, decisively. Both trained arms show a positive paired delta and win multiple fresh, never-seen decks that base cannot. The student learned to play; it did not memorize the deck pool. Win counts are exact (no cap-truncated win above fc=40). Read the paired delta, not absolute fc: the fresh set is biased easy (only deals the solver cracked under a 200k-node cap were kept), which lifts both arms and compresses the gap.

Resigns on the fresh set are informative because every fresh deck is winnable. Volume resigned 2, and both were adjudicated correct: #9000021 (fc18/fd3) solved UNSOLVABLE, #9000024 (fc11/fd12) solved UNSOLVABLE. Both boards were already structurally dead at the moment of resignation, so the resigns saved a 200-turn flail rather than throwing a winnable game. Where there is an error it is earlier play, not the resign itself (judge resign-correctness on the board at resign, not the deck at deal).

Checkpoint selection (why iter 1,000)

Quality across the four saved checkpoints is non-monotonic, so the final checkpoint is not automatically the best; it was selected on the 13 held-out decks. JSON discipline is measured as temp-0.3 parse-rescues (lower is cleaner):

checkpoint	wins	meanFC	temp parse-rescues
iter 250	5	31.7	102
iter 500	2	21.3	126
iter 1,000 (shipped)	5	27.7	34

Iter 1,000 keeps the win count of the best early checkpoint while being roughly 3x cleaner on JSON, and iter 500 is a deep trough. The remaining JSON-discipline gap is best closed by constrained decoding at inference, not by an earlier checkpoint (see Limitations). The 250/500/750 checkpoints are included under checkpoints/ for reproducibility of this table.

Note on comparing to the 3n predecessor

The 3n adapter's headline numbers come from a 20-state single-turn tier bench; this adapter's come from full-game play with exact adjudication. They are not directly comparable. The shift to full-game adjudicated eval is deliberate: in this project, raw harness scores have repeatedly disagreed with what an exact replay plus a sound solver show, so single-turn or unadjudicated numbers are treated as unreliable. The full-game result here is the stronger and more honest claim.

Training data

Source. Per-decision play logs from a Klondike Solitaire client where the 31B gemma-4-31b-it teacher chose moves turn-by-turn, published as the chayuto/klondike-llm-decisions dataset (CC-BY-4.0). Logs span multiple app builds and prompt-template versions.

Selection (this adapter, the "volume" arm). The entire non-eval success pool: 6,859 decisions across 77 games (36% of which were won), with the 13 eval seeds held out. Split at the game level into 5,663 train / 531 validation / 665 test. At iters=1,000 with batch size 1 the model sees ~~1,000 examples (~~0.18 epoch), so this arm tests "more unique, diverse data at a fixed gradient budget", and it won: full volume beat a matched 2,500-row arm on wins (5 vs 3).

Known data-quality issues the adapter inherits:

Mixed prompt-template formats across the corpus (v1.0 through v1.6); eval is on the v1.6 production template only, so cross-template generalization is a confound that the held-out and fresh-deck results only partially control for.
Lost-game trajectories in the corpus teach a resign capability. On the evidence above this is net useful (correct resigns on dead boards), but it is a behaviour the won-only gate arm does not have.
The teacher itself wins ~31% of games, so imitation targets are imperfect.

Training procedure

model: mlx-community/Gemma4-E2B-IT-Text-int4
max_seq_length: 2048
batch_size: 1
num_layers: 16
grad_checkpoint: true
learning_rate: 2.0e-4
iters: 1000
save_every: 250
val_batches: 25

lora_parameters:
  rank: 16
  scale: 2.0
  dropout: 0.05
  keys:
    - self_attn.q_proj
    - self_attn.k_proj
    - self_attn.v_proj
    - self_attn.o_proj
    - mlp.gate_proj
    - mlp.up_proj
    - mlp.down_proj

Hyperparameters are identical to the project's other Gemma 4 E2B arms (gate, allsucc, v2, v5), so arm-to-arm differences are attributable to the corpus, not the optimiser.

Usage

Install

# Apple Silicon, Python 3.12 venv recommended
python3.12 -m venv venv && source venv/bin/activate
pip install mlx mlx-lm huggingface-hub

Loading the base (important)

The Gemma4-E2B-IT-Text-int4 base needs a small sanitize() patch to load on current mlx-lm (it strips the alternating-attention k_norm/k_proj/v_proj keys that the loader does not yet implement). The 6-line patch used here is gemma4_finetune/gemma4_text_patch.py in the source repo (github.com/chayuto/solitaire-analytics); apply it (or use an mlx-lm version that has since merged the support) before loading. Peak memory to load the patched base is ~2.7 GB.

Quick start

from huggingface_hub import snapshot_download
from mlx_lm import load, generate
# import gemma4_text_patch  # apply the base-loading patch first (see above)

adapter_path = snapshot_download(
    repo_id="chayuto/gemma-4-e2b-it-solitaire-advisor-lora",
    allow_patterns=["adapters.safetensors", "adapter_config.json"],
)

model, tokenizer = load(
    "mlx-community/Gemma4-E2B-IT-Text-int4",
    adapter_path=adapter_path,
)

solitaire_prompt = open("your_solitaire_prompt.txt").read()  # the v1.6 harvester prompt
wrapped = tokenizer.apply_chat_template(
    [{"role": "user", "content": solitaire_prompt}],
    tokenize=False, add_generation_prompt=True,
)
print(generate(model, tokenizer, prompt=wrapped, max_tokens=512))

The model emits a JSON object whose final_decision.move_index is a 0-based index into the prompt's legalMoves array (or -1 to resign). Trust move_index, not the prose. The exact prompt renderer, the play harness (tournament_A.py, play_deck_with_student.py), and the adjudicator (adjudicate_final_position.py) that produced every number in this card are in the source repo.

Limitations

Small n. 13 in-distribution and 12 fresh decks, with high per-deck variance. Read the paired deltas; do not over-interpret a single deck.
Easy-deck bias on the fresh set. Only deals the solver cracked under a 200k-node cap were kept, so harder-but-winnable deals are under-represented.
JSON discipline is imperfect. The eval used a tiered parse-rescue (temp-0.3 retry); raw greedy output is not 100% valid JSON. For production, use constrained decoding or a JSON grammar at inference. This is a known gap, not yet fixed in the weights.
Teacher ceiling. Trained to imitate a teacher that wins ~31% of games; this is an advisor, not a solver, for imperfect-information draw-1 only.
Greedy eval only. Behaviour under temperature sampling is untested.
Base-loading patch required. See Usage; the int4 Gemma 4 E2B text base does not load on stock mlx-lm without the sanitize() patch.
Apple-Silicon / MLX only. CUDA/CPU inference via transformers/PEFT is not validated here.
Loads only onto this exact base. The LoRA was trained against mlx-community/Gemma4-E2B-IT-Text-int4; applying it to a different quant or the full-precision base is unvalidated.

License

The adapter is released under the Gemma Terms of Use (inherited from the base model). Use, redistribution, and modification require compliance with the Gemma Prohibited Use Policy. The training and evaluation code in the source repository is MIT. The training data (chayuto/klondike-llm-decisions) is CC-BY-4.0.

Citation

@misc{orapinpatipat2026solitaireadvisorgemma4,
  title  = {Distilling a 31B Klondike Solitaire advisor into Gemma 4 E2B via MLX LoRA},
  author = {Orapinpatipat, Chayut},
  year   = {2026},
  month  = jun,
  howpublished = {\url{https://huggingface.co/chayuto/gemma-4-e2b-it-solitaire-advisor-lora}},
  note   = {LoRA adapter; volume arm, iter-1000 checkpoint},
}

Acknowledgements

Base model mlx-community/Gemma4-E2B-IT-Text-int4 from the mlx-community team.
Training framework mlx-lm from Apple Machine Learning Research.
Teacher model gemma-4-31b-it from Google DeepMind.

Project status

This is the Gemma 4 E2B milestone of an ongoing project (solitaire-analytics): the strongest student so far, beating the untuned base under trusted full-game eval and generalizing to fresh, unseen decks. It supersedes the gemma-3n adapter as the project's lead student; the 3n repo remains as the v1 baseline.

Next directions: constrained decoding for JSON robustness, a resign-calibration pass (loop-compressed training reaches winnable endgames but can resign too early), and solver-as-teacher trajectories to break the ~31% teacher imitation ceiling.

Downloads last month: -; Downloads are not tracked for this model. How to track

MLX

Hardware compatibility

Quantized

Model tree for chayuto/gemma-4-e2b-it-solitaire-advisor-lora

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it