zeb-42: Neural Network for Texas 42 Dominoes
A 557K-parameter transformer trained entirely through AlphaZero-style self-play with MCTS to play Texas 42 β a 4-player partnership trick-taking game played with dominoes, often called "the national game of Texas."
This is a hobby/research project. The model is not state-of-the-art; it is an ongoing experiment in applying self-play RL to a traditional game with imperfect information.
Evaluation Results
All evaluations use 1,000 games with temperature=0.1 (near-greedy play). Bidding is skipped (random declaration) to isolate trick-play skill.
Theoretical ceiling: The E[Q] oracle (our strongest player) achieves 73.2% vs random β this is the practical ceiling for a single strong player carrying a random partner. Zeb at 69β71% is within 2β3 points of optimal.
| Matchup | Win Rate | Avg Point Margin |
|---|---|---|
| E[Q] oracle vs Random | 73.2% | β |
| Zeb vs Random | 69.1% | +10.29 |
| Zeb vs Heuristic | 60.8% | +6.18 |
| Heuristic vs Random | 53.2% | +1.58 |
Zeb's edge over the heuristic (+7.6% absolute, +6.18 margin) is more than double the heuristic's edge over random (+3.2%, +1.58 margin).
Heuristic baseline: Lead highest trump (or highest domino); follow with lowest-point domino when losing.
vs E[Q] Oracle (head-to-head)
The E[Q] player uses a 3.3M-parameter oracle model (trained on perfect-play minimax solutions) and samples N hypothetical worlds per decision to handle hidden information. This is a fundamentally different β and computationally expensive β approach.
| E[Q] Samples | E[Q] Win Rate | Avg Margin | Speed |
|---|---|---|---|
| N=10 | 60.9% | +6.3 | 12.7 g/s |
| N=50 | 58.8% | +6.4 | 8.8 g/s |
| N=100 | 58.7% | +6.5 | 5.5 g/s |
E[Q] consistently beats Zeb ~59%, regardless of sample count β the oracle's access to perfect-play solutions matters more than sampling precision. But the 557K self-play model holds its own against a 6Γ larger oracle-distilled model backed by world sampling, despite having learned everything from scratch.
Model Details
| Property | Value |
|---|---|
| Architecture | Transformer encoder (pre-LN) + policy/value heads |
| Parameters | 556,970 |
| Embedding dim | 128 |
| Attention heads | 4 |
| Transformer layers | 4 |
| Feedforward dim | 256 |
| Framework | PyTorch |
Architecture
Input tokens [batch, seq, 8]
β
βΌ
Per-feature embeddings (8 learned tables) β concat β Linear β 128d
β
βΌ
+ Learned positional embeddings (up to 36 positions)
β
βΌ
4Γ TransformerEncoderLayer (pre-LN, 4 heads, FF=256, dropout=0.1)
β
ββββ Policy head: gather 7 hand-slot embeddings β Linear(128,1) β logits
β
ββββ Value head: mean-pool valid tokens β Linear(128,64) β GELU β Linear(64,1) β Tanh
Policy output: 7 logits (one per hand slot), masked for legality, softmax-sampled. Value output: Scalar in [-1, 1] predicting game outcome from current player's team perspective.
Observation Format
The model sees a sequence of up to 36 tokens, each with 8 integer features:
| Token type | Count | Description |
|---|---|---|
| Declaration | 1 | Trump suit context |
| Hand | 7 | Player's own dominoes (fixed slots; played dominoes masked) |
| Play history | 0β28 | Previously played dominoes with relative player IDs |
Features per token: high pip (0β6), low pip (0β6), is_double (0/1), count value (0/5/10 pts), relative player (0β3), is_in_hand (0/1), declaration ID (0β9), token type (decl/hand/play).
The model sees only its own hand and public play history β never opponents' hands. This is an imperfect information game.
Training
Method
AlphaZero-style self-play: the model plays against itself with Monte Carlo Tree Search (200 simulations/move), then trains on the search results.
- Policy targets: MCTS visit-count distributions at root (soft cross-entropy)
- Value targets: Actual game outcomes (+1/β1 from team perspective, MSE loss)
- Total loss: policy_loss + value_loss
Training History
The model went through five phases β including two failures that shaped the final approach.
| Phase | Method | Games | vs Random | Outcome |
|---|---|---|---|---|
| 0. REINFORCE | Policy gradient, no search | β | 51% | Failed. Entropy collapsed to 0.07, no learning signal. |
| 1. Oracle MCTS | MCTS with perfect oracle at leaves | 10K | 53.5% | Barely above random β oracle sees all hands, model doesn't. |
| 2. GPU MCTS | GPU-accelerated oracle MCTS | 20K | ~50% | Faster but same ceiling. Oracle cheats; model can't use that. |
| 3. Self-play | AlphaZero (model evaluates own leaves) | 1M | ~70% | Breakthrough. The flywheel starts spinning. |
| 4. Distributed | Workers on Vast.ai + local learner | 1.5M | 69.1%* | Scale-up to 1.5M games over 8 hours. |
* Formal 1,000-game evaluation at temperature=0.1 (see eval table above). In-training eval showed ~71% but used different conditions.
Phase 0 lesson: Pure REINFORCE without search produces no useful gradient signal in a game this complex. Entropy collapses before the policy learns anything. You need search.
Phase 3 breakthrough: Switching from oracle leaf evaluation to the model's own value head. Despite the oracle being "perfect," self-play produced a stronger player β because it learns to handle the imperfect information it actually faces at test time. The oracle knows all hands; the deployed model never will.
The key regime change came from adding a 50K replay buffer + 1,000 training steps/epoch. Value loss crashed from 0.37 β 0.17 in a single session β the model finally had enough diverse data to learn from. This got the flywheel spinning: better value estimates β sharper MCTS search β better policy targets β repeat.
Entropy as a flywheel indicator: We tracked target entropy (from MCTS visit distributions) dropping from 0.415 β 0.313 β 0.255 as training progressed β proving the loop: better model β more decisive search β lower entropy floor β policy sharpens to follow. This was visible in real-time on the WandB dashboard.
Value loss floor: Value loss converged to ~0.134, near the information-theoretic floor of ~0.15 for this game. You can't predict outcomes better than hidden information allows β the model is extracting nearly all learnable signal from its observations.
Distributed Architecture (Phase 4)
Workers (4β6 Vast.ai GPUs) Learner (local RTX 3050 Ti)
ββββββββββββββββββββββββ βββββββββββββββββββββββββ
β MCTS Self-Play βββexamples.ptβββ·β GPU Replay Buffer β
β 200 sims/move β (HF Hub) β (200K examples) β
β 128 parallel games β β Train 100 steps/cycle β
β ~4 games/sec/GPU ββββweights.ptββββ Push weights/25 cyclesβ
β β (HF Hub) β Eval/50 cycles β
ββββββββββββββββββββββββ βββββββββββββββββββββββββ
HuggingFace Hub serves as the single source of truth β workers pull fresh weights, upload game examples. The learner consumes examples and pushes updated weights. This loop ran continuously for ~8 hours.
Configuration
| Parameter | Value |
|---|---|
| MCTS simulations | 200/move |
| MCTS c_puct | 1.414 |
| MCTS max nodes | 512 |
| Training temperature | 1.0 |
| Eval temperature | 0.1 |
| Optimizer | AdamW (lr=1e-4, weight_decay=0.01) |
| Batch size | 64 |
| Replay buffer | 200K examples |
| Parallel games (workers) | 128 |
| Total training cycles | 11,325 |
| Total self-play games | 1,511,424 |
Final Training Metrics (WandB)
| Metric | Value |
|---|---|
| Policy loss | 0.313 |
| Value loss | 0.134 |
| Policy entropy | 0.316 |
| Policy top-1 accuracy | 89.0% |
| Value mean | 0.019 (near zero β balanced self-play) |
| Value std | 0.926 |
| Eval vs random (in-training) | 71.2%* |
* In-training eval uses 200 games with temperature=0.3. Formal eval (1,000 games, temperature=0.1) gives 69.1% β the canonical number.
Compute
The distributed training phase cost approximately $2.88 over 8 hours using 4 Vast.ai workers (RTX 3060/3070 Ti at $0.05β0.10/hr). The learner ran on a local RTX 3050 Ti.
Timeline
| Date | Milestone |
|---|---|
| Dec 25 | Perfect-play oracle cracked (GPU minimax solver for all 42 game states) |
| Jan 31 | First Zeb commit β REINFORCE attempt, fails immediately |
| Feb 2 | Oracle MCTS β self-play pivot, first real learning signal |
| Feb 5 | CUDA MCTS breakthrough (Claude Opus 4.6 rewrites GPU kernel overnight); E[Q] ceiling discovered at 73.2% |
| Feb 6 | 1M games milestone, 70% vs random |
| Feb 7 | Distributed training on Vast.ai, 1.5M games, model card published |
5 days from first commit to 70% win rate. The Christmas oracle was the foundation β without perfect-play solutions, neither the E[Q] pipeline nor the self-play reward signal would exist.
The GPU Saga
The MCTS kernel was the bottleneck. An NVIDIA B200 (a $30K datacenter GPU) achieved only 2.2Γ speedup over a $250 RTX 3050 Ti β a sawtooth pattern in profiling showed kernel launch overhead dominated, not compute. Both GPUs spent most of their time on dispatch, not math.
Claude Opus 4.6 diagnosed the problem and rewrote the CUDA MCTS overnight with three optimizations:
- Vectorized backprop β replaced Python
for i in range(N)loops with batchedgather/scatter_add_ - Depth-variant CUDA graphs β pre-capture graphs at [28, 8, 1] active nodes, pick the smallest sufficient graph per move
- Multi-step capture β loop K simulation steps inside a single
graph.capture(), replay N/K times
Result: the 3050 Ti went from 3.3 β 3.6 games/sec at production settings (n_sims=200), with the B200 reaching 10.1 games/sec at 100% GPU utilization β a 2.8Γ speedup over the 3050 Ti, confirming MCTS is inherently depth-sequential.
But the real lesson was cost-effectiveness: at $6.25/hr the B200 produces ~5,800 games/$/hr, while a fleet of RTX 3060s at $0.05/hr produces ~288,000 games/$/hr β 50Γ more cost-effective. We ran the distributed phase on cheap consumer GPUs.
Intended Use
- Research and experimentation with self-play RL on imperfect information games
- Baseline opponent for testing other Texas 42 AI approaches
- Educational reference for AlphaZero-style training on a novel domain
Limitations
- No bidding: The model only plays the trick-taking phase. Bidding (choosing whether to commit to a contract) is skipped. A complete Texas 42 AI would need a bidding policy.
- Naive imperfect information: The model plays from its observation without explicitly reasoning about opponent hands. Information set MCTS or belief tracking would likely improve play.
- Near the ceiling but not there: At 69β71% vs random, Zeb is within 2β3 points of the E[Q] oracle ceiling (73.2%). Closing that gap likely requires explicit reasoning about hidden information, not just more self-play games.
- No human evaluation: 60.8% vs a basic heuristic is meaningful but the heuristic itself is simple. Evaluation against experienced human players would better characterize strength.
- Trick-play only evaluation: Random declaration assignment means some hands are inherently unwinnable regardless of skill.
About Texas 42
Texas 42 is a 4-player partnership trick-taking game played with a standard set of 28 dominoes (double-six). Invented in 1887 in Garner, Texas as a way to play "cards" without actual playing cards (which were considered sinful), it remains widely played in Texas today.
- 28 dominoes dealt 7 to each of 4 players in fixed partnerships (seats 0+2 vs 1+3)
- Bidding determines who chooses the trump suit and must win a minimum point total
- 7 tricks per hand; winner of each trick leads the next
- Follow suit required (suit determined by trump declaration)
- 42 total points: five counting dominoes worth 5 or 10 points each, plus 1 point per trick
Files
| File | Description |
|---|---|
model.pt |
Model state dict (latest, 1.5M games) |
config.json |
Architecture config: {"embed_dim": 128, "n_heads": 4, "n_layers": 4, "ff_dim": 256} |
training_state.json |
Training progress: {"step": 11325, "total_games": 1511424} |
zeb-557k-1m.pt |
Earlier checkpoint (~1M games, bootstrap for distributed phase) |
Loading the Model
import torch
from forge.zeb.model import ZebModel
model = ZebModel(embed_dim=128, n_heads=4, n_layers=4, ff_dim=256)
state_dict = torch.load("model.pt", map_location="cpu")
model.load_state_dict(state_dict)
model.eval()
Citation
This is an informal hobby project. If you reference it, a link to this repo is sufficient.
Source Code
Training pipeline, game engine, and evaluation code:
github.com/jasonyandell/texas-42 (forge/zeb/)
WandB project: jasonyandell-forge42/zeb-mcts
- Downloads last month
- 1,914
Evaluation results
- Win Rate vs Heuristicself-reported60.800
- Win Rate vs Randomself-reported69.100