| | --- |
| | base_model: unsloth/qwen3-4b-instruct-2507-unsloth-bnb-4bit |
| | tags: |
| | - text-generation-inference |
| | - transformers |
| | - unsloth |
| | - qwen3 |
| | - chess |
| | - reasoning |
| | - sft |
| | license: apache-2.0 |
| | language: |
| | - en |
| | datasets: |
| | - nuriyev/chess-reasoning |
| | pipeline_tag: text-generation |
| | --- |
| | |
| | # Chess Reasoner |
| |
|
| | A chess move prediction model fine-tuned from **Qwen3-4B-Instruct** to output structured reasoning before selecting moves. |
| |
|
| | ## Overview |
| |
|
| | This model is **Phase 1** of a two-stage training pipeline: |
| | 1. **SFT (this model)** — Align the model to output in a specific `<think>` + `<uci_move>` format |
| | 2. **GRPO (next step)** — Reinforce with Stockfish rewards for stronger play |
| |
|
| | ## Output Format |
| |
|
| | ``` |
| | <think>brief reasoning (1-2 sentences)</think> |
| | <uci_move>e2e4</uci_move> |
| | ``` |
| |
|
| | ## Usage |
| |
|
| | ### System Prompt |
| |
|
| | ```python |
| | SYSTEM_PROMPT = """You are an expert chess player. |
| | |
| | Given a current game state, you must select the best legal next move. Think in 1-2 sentences, then output your chosen move. |
| | |
| | Output format: |
| | <think>brief thinking (2 sentences max)</think> |
| | <uci_move>your_move</uci_move>""" |
| | ``` |
| |
|
| | ### User Prompt Template |
| |
|
| | The model expects the board state in the following format: |
| |
|
| | ``` |
| | Here is the current game state |
| | Board (Fen): <FEN string> |
| | Turn: It is your turn (<white/black>) |
| | Legal Moves: <comma-separated UCI moves> |
| | Board: |
| | <board representation> |
| | ``` |
| |
|
| | ### Full Inference Example |
| |
|
| | ```python |
| | import chess |
| | from transformers import AutoModelForCausalLM, AutoTokenizer |
| | |
| | # Load model |
| | model = AutoModelForCausalLM.from_pretrained("nuriyev/chess-reasoner") |
| | tokenizer = AutoTokenizer.from_pretrained("nuriyev/chess-reasoner") |
| | |
| | # System prompt |
| | SYSTEM_PROMPT = """You are an expert chess player. |
| | |
| | Given a current game state, you must select the best legal next move. Think in 1-2 sentences, then output your chosen move. |
| | |
| | Output format: |
| | <think>brief thinking (2 sentences max)</think> |
| | <uci_move>your_move</uci_move>""" |
| | |
| | # Example position: after 1. e4 |
| | fen = "rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq - 0 1" |
| | board = chess.Board(fen) |
| | |
| | # Build user prompt |
| | user_content = f"""Here is the current game state |
| | Board (Fen): {fen} |
| | Turn: It is your turn ({'white' if board.turn else 'black'}) |
| | Legal Moves: {', '.join([move.uci() for move in board.legal_moves])} |
| | Board: |
| | {board}""" |
| | |
| | messages = [ |
| | {"role": "system", "content": SYSTEM_PROMPT}, |
| | {"role": "user", "content": user_content}, |
| | ] |
| | |
| | # Generate |
| | inputs = tokenizer.apply_chat_template( |
| | messages, |
| | tokenize=True, |
| | add_generation_prompt=True, |
| | return_tensors="pt" |
| | ).to(model.device) |
| | |
| | outputs = model.generate( |
| | inputs, |
| | max_new_tokens=128, |
| | do_sample=False, |
| | ) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | ### Example Output |
| |
|
| | ``` |
| | <think>The e5 pawn is undefended, so I will move my knight to e5 to challenge the center and set up a queen attack on g7.</think> |
| | <uci_move>c7c5</uci_move> |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | | Parameter | Value | |
| | |-----------|-------| |
| | | Base Model | Qwen/Qwen3-4B-Instruct-2507 | |
| | | Method | SFT with LoRA (r=32, α=64) | |
| | | Dataset | [nuriyev/chess-reasoning](https://huggingface.co/datasets/nuriyev/chess-reasoning) | |
| | | Epochs | 2 | |
| | | Learning Rate | 2e-4 | |
| | | Batch Size | 16 | |
| | | Max Seq Length | 1024 | |
| |
|
| | Trained using [Unsloth](https://github.com/unslothai/unsloth) with response-only loss masking. |
| |
|
| | ## Limitations |
| |
|
| | This SFT checkpoint is format-aligned but not yet optimized for move quality. The upcoming GRPO stage will train against Stockfish evaluations to improve actual chess performance. |
| |
|
| | ## LoRA Adapter |
| |
|
| | Also available: [nuriyev/chess-reasoner-lora](https://huggingface.co/nuriyev/chess-reasoner-lora) |
| |
|
| | [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |