zeb-42: Neural Network for Texas 42 Dominoes

A 557K-parameter transformer trained entirely through AlphaZero-style self-play with MCTS to play Texas 42 — a 4-player partnership trick-taking game played with dominoes, often called "the national game of Texas."

This is a hobby/research project. The model is not state-of-the-art; it is an ongoing experiment in applying self-play RL to a traditional game with imperfect information.

Evaluation Results

All evaluations use 1,000 games with temperature=0.1 (near-greedy play). Bidding is skipped (random declaration) to isolate trick-play skill.

Theoretical ceiling: The E[Q] oracle (our strongest player) achieves 73.2% vs random — this is the practical ceiling for a single strong player carrying a random partner. Zeb at 69–71% is within 2–3 points of optimal.

Matchup	Win Rate	Avg Point Margin
E[Q] oracle vs Random	73.2%	—
Zeb vs Random	69.1%	+10.29
Zeb vs Heuristic	60.8%	+6.18
Heuristic vs Random	53.2%	+1.58

Zeb's edge over the heuristic (+7.6% absolute, +6.18 margin) is more than double the heuristic's edge over random (+3.2%, +1.58 margin).

Heuristic baseline: Lead highest trump (or highest domino); follow with lowest-point domino when losing.

vs E[Q] Oracle (head-to-head)

The E[Q] player uses a 3.3M-parameter oracle model (trained on perfect-play minimax solutions) and samples N hypothetical worlds per decision to handle hidden information. This is a fundamentally different — and computationally expensive — approach.

E[Q] Samples	E[Q] Win Rate	Avg Margin	Speed
N=10	60.9%	+6.3	12.7 g/s
N=50	58.8%	+6.4	8.8 g/s
N=100	58.7%	+6.5	5.5 g/s

E[Q] consistently beats Zeb ~59%, regardless of sample count — the oracle's access to perfect-play solutions matters more than sampling precision. But the 557K self-play model holds its own against a 6× larger oracle-distilled model backed by world sampling, despite having learned everything from scratch.

Model Details

Property	Value
Architecture	Transformer encoder (pre-LN) + policy/value heads
Parameters	556,970
Embedding dim	128
Attention heads	4
Transformer layers	4
Feedforward dim	256
Framework	PyTorch

Architecture

Input tokens [batch, seq, 8]
    │
    ▼
Per-feature embeddings (8 learned tables) → concat → Linear → 128d
    │
    ▼
+ Learned positional embeddings (up to 36 positions)
    │
    ▼
4× TransformerEncoderLayer (pre-LN, 4 heads, FF=256, dropout=0.1)
    │
    ├──→ Policy head: gather 7 hand-slot embeddings → Linear(128,1) → logits
    │
    └──→ Value head: mean-pool valid tokens → Linear(128,64) → GELU → Linear(64,1) → Tanh

Policy output: 7 logits (one per hand slot), masked for legality, softmax-sampled. Value output: Scalar in [-1, 1] predicting game outcome from current player's team perspective.

Observation Format

The model sees a sequence of up to 36 tokens, each with 8 integer features:

Token type	Count	Description
Declaration	1	Trump suit context
Hand	7	Player's own dominoes (fixed slots; played dominoes masked)
Play history	0–28	Previously played dominoes with relative player IDs

Features per token: high pip (0–6), low pip (0–6), is_double (0/1), count value (0/5/10 pts), relative player (0–3), is_in_hand (0/1), declaration ID (0–9), token type (decl/hand/play).

The model sees only its own hand and public play history — never opponents' hands. This is an imperfect information game.

Training

Method

AlphaZero-style self-play: the model plays against itself with Monte Carlo Tree Search (200 simulations/move), then trains on the search results.

Policy targets: MCTS visit-count distributions at root (soft cross-entropy)
Value targets: Actual game outcomes (+1/−1 from team perspective, MSE loss)
Total loss: policy_loss + value_loss

Training History

The model went through five phases — including two failures that shaped the final approach.

Phase	Method	Games	vs Random	Outcome
0. REINFORCE	Policy gradient, no search	—	51%	Failed. Entropy collapsed to 0.07, no learning signal.
1. Oracle MCTS	MCTS with perfect oracle at leaves	10K	53.5%	Barely above random — oracle sees all hands, model doesn't.
2. GPU MCTS	GPU-accelerated oracle MCTS	20K	~50%	Faster but same ceiling. Oracle cheats; model can't use that.
3. Self-play	AlphaZero (model evaluates own leaves)	1M	~70%	Breakthrough. The flywheel starts spinning.
4. Distributed	Workers on Vast.ai + local learner	1.5M	69.1%*	Scale-up to 1.5M games over 8 hours.

* Formal 1,000-game evaluation at temperature=0.1 (see eval table above). In-training eval showed ~71% but used different conditions.

Phase 0 lesson: Pure REINFORCE without search produces no useful gradient signal in a game this complex. Entropy collapses before the policy learns anything. You need search.

Phase 3 breakthrough: Switching from oracle leaf evaluation to the model's own value head. Despite the oracle being "perfect," self-play produced a stronger player — because it learns to handle the imperfect information it actually faces at test time. The oracle knows all hands; the deployed model never will.

The key regime change came from adding a 50K replay buffer + 1,000 training steps/epoch. Value loss crashed from 0.37 → 0.17 in a single session — the model finally had enough diverse data to learn from. This got the flywheel spinning: better value estimates → sharper MCTS search → better policy targets → repeat.

Entropy as a flywheel indicator: We tracked target entropy (from MCTS visit distributions) dropping from 0.415 → 0.313 → 0.255 as training progressed — proving the loop: better model → more decisive search → lower entropy floor → policy sharpens to follow. This was visible in real-time on the WandB dashboard.

Value loss floor: Value loss converged to ~0.134, near the information-theoretic floor of ~0.15 for this game. You can't predict outcomes better than hidden information allows — the model is extracting nearly all learnable signal from its observations.

Distributed Architecture (Phase 4)

Workers (4–6 Vast.ai GPUs)              Learner (local RTX 3050 Ti)
┌──────────────────────┐                ┌───────────────────────┐
│ MCTS Self-Play       │──examples.pt──▷│ GPU Replay Buffer     │
│ 200 sims/move        │  (HF Hub)      │ (200K examples)       │
│ 128 parallel games   │                │ Train 100 steps/cycle │
│ ~4 games/sec/GPU     │◁──weights.pt───│ Push weights/25 cycles│
│                      │  (HF Hub)      │ Eval/50 cycles        │
└──────────────────────┘                └───────────────────────┘

HuggingFace Hub serves as the single source of truth — workers pull fresh weights, upload game examples. The learner consumes examples and pushes updated weights. This loop ran continuously for ~8 hours.

Configuration

Parameter	Value
MCTS simulations	200/move
MCTS c_puct	1.414
MCTS max nodes	512
Training temperature	1.0
Eval temperature	0.1
Optimizer	AdamW (lr=1e-4, weight_decay=0.01)
Batch size	64
Replay buffer	200K examples
Parallel games (workers)	128
Total training cycles	11,325
Total self-play games	1,511,424

Final Training Metrics (WandB)

Metric	Value
Policy loss	0.313
Value loss	0.134
Policy entropy	0.316
Policy top-1 accuracy	89.0%
Value mean	0.019 (near zero — balanced self-play)
Value std	0.926
Eval vs random (in-training)	71.2%*

* In-training eval uses 200 games with temperature=0.3. Formal eval (1,000 games, temperature=0.1) gives 69.1% — the canonical number.

Compute

The distributed training phase cost approximately $2.88 over 8 hours using 4 Vast.ai workers (RTX 3060/3070 Ti at $0.05–0.10/hr). The learner ran on a local RTX 3050 Ti.

Timeline

Date	Milestone
Dec 25	Perfect-play oracle cracked (GPU minimax solver for all 42 game states)
Jan 31	First Zeb commit — REINFORCE attempt, fails immediately
Feb 2	Oracle MCTS → self-play pivot, first real learning signal
Feb 5	CUDA MCTS breakthrough (Claude Opus 4.6 rewrites GPU kernel overnight); E[Q] ceiling discovered at 73.2%
Feb 6	1M games milestone, 70% vs random
Feb 7	Distributed training on Vast.ai, 1.5M games, model card published

5 days from first commit to 70% win rate. The Christmas oracle was the foundation — without perfect-play solutions, neither the E[Q] pipeline nor the self-play reward signal would exist.

The GPU Saga

The MCTS kernel was the bottleneck. An NVIDIA B200 (a $30K datacenter GPU) achieved only 2.2× speedup over a $250 RTX 3050 Ti — a sawtooth pattern in profiling showed kernel launch overhead dominated, not compute. Both GPUs spent most of their time on dispatch, not math.

Claude Opus 4.6 diagnosed the problem and rewrote the CUDA MCTS overnight with three optimizations:

Vectorized backprop — replaced Python for i in range(N) loops with batched gather/scatter_add_
Depth-variant CUDA graphs — pre-capture graphs at [28, 8, 1] active nodes, pick the smallest sufficient graph per move
Multi-step capture — loop K simulation steps inside a single graph.capture(), replay N/K times

Result: the 3050 Ti went from 3.3 → 3.6 games/sec at production settings (n_sims=200), with the B200 reaching 10.1 games/sec at 100% GPU utilization — a 2.8× speedup over the 3050 Ti, confirming MCTS is inherently depth-sequential.

But the real lesson was cost-effectiveness: at $6.25/hr the B200 produces ~5,800 games/$/hr, while a fleet of RTX 3060s at $0.05/hr produces ~288,000 games/$/hr — 50× more cost-effective. We ran the distributed phase on cheap consumer GPUs.

Intended Use

Research and experimentation with self-play RL on imperfect information games
Baseline opponent for testing other Texas 42 AI approaches
Educational reference for AlphaZero-style training on a novel domain

Limitations

No bidding: The model only plays the trick-taking phase. Bidding (choosing whether to commit to a contract) is skipped. A complete Texas 42 AI would need a bidding policy.
Naive imperfect information: The model plays from its observation without explicitly reasoning about opponent hands. Information set MCTS or belief tracking would likely improve play.
Near the ceiling but not there: At 69–71% vs random, Zeb is within 2–3 points of the E[Q] oracle ceiling (73.2%). Closing that gap likely requires explicit reasoning about hidden information, not just more self-play games.
No human evaluation: 60.8% vs a basic heuristic is meaningful but the heuristic itself is simple. Evaluation against experienced human players would better characterize strength.
Trick-play only evaluation: Random declaration assignment means some hands are inherently unwinnable regardless of skill.

About Texas 42

Texas 42 is a 4-player partnership trick-taking game played with a standard set of 28 dominoes (double-six). Invented in 1887 in Garner, Texas as a way to play "cards" without actual playing cards (which were considered sinful), it remains widely played in Texas today.

28 dominoes dealt 7 to each of 4 players in fixed partnerships (seats 0+2 vs 1+3)
Bidding determines who chooses the trump suit and must win a minimum point total
7 tricks per hand; winner of each trick leads the next
Follow suit required (suit determined by trump declaration)
42 total points: five counting dominoes worth 5 or 10 points each, plus 1 point per trick

Files

File	Description
`model.pt`	Model state dict (latest, 1.5M games)
`config.json`	Architecture config: `{"embed_dim": 128, "n_heads": 4, "n_layers": 4, "ff_dim": 256}`
`training_state.json`	Training progress: `{"step": 11325, "total_games": 1511424}`
`zeb-557k-1m.pt`	Earlier checkpoint (~1M games, bootstrap for distributed phase)

Loading the Model

import torch
from forge.zeb.model import ZebModel

model = ZebModel(embed_dim=128, n_heads=4, n_layers=4, ff_dim=256)
state_dict = torch.load("model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()

Citation

This is an informal hobby project. If you reference it, a link to this repo is sufficient.

Source Code

Training pipeline, game engine, and evaluation code: github.com/jasonyandell/texas-42 (forge/zeb/)

WandB project: jasonyandell-forge42/zeb-mcts

Downloads last month: 1,914

Video Preview

Reinforcement Learning

Evaluation results

Win Rate vs Heuristic
self-reported

60.800
Win Rate vs Random
self-reported

69.100