iubns commited on
Commit
6bc5d83
·
verified ·
1 Parent(s): 072f81d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -31
README.md CHANGED
@@ -1,59 +1,171 @@
1
  ---
2
  tags:
3
  - reinforcement-learning
4
- - ppo
5
- - minichess
 
 
 
6
  - sb3-contrib
 
7
  license: mit
 
 
8
  ---
9
 
10
- # MiniChess RL Models
11
 
12
- Reinforcement Learning models trained to play MiniChess (8x3 board variant).
13
 
14
- ## Model Details
15
 
16
- - **Algorithm**: Maskable PPO (Proximal Policy Optimization)
17
- - **Framework**: Stable-Baselines3-Contrib
18
- - **Board Size**: 8x3
19
- - **Pieces**: King, Advisor, Elephant, Chariot, Promoted Pawn
 
 
 
 
20
 
21
- ## Models
22
 
23
- - `model_down.zip`: DOWN player model
24
- - `model_down_512.zip`: DOWN_512 player model
25
- - `model_up.zip`: UP player model
26
- - `model_up_512.zip`: UP_512 player model
 
 
27
 
28
- ## Training Information
 
 
29
 
30
- - **Opponent**: Self-play training
31
- - **Action Masking**: Valid moves only
32
- - **Observation Space**: Multi-channel board representation (10 channels for pieces + turn)
33
- - **Action Space**: 576 discrete actions (24 positions × 24 positions)
34
 
35
- ## Usage
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ```python
38
  from sb3_contrib import MaskablePPO
 
 
39
 
40
- # Load model
 
 
 
 
41
  model = MaskablePPO.load("model_up.zip")
42
 
43
- # Use in your environment
44
- obs = env.reset()
45
- action, _states = model.predict(obs, action_masks=env.get_valid_actions())
 
 
 
 
 
 
 
 
 
 
 
 
 
46
  ```
47
 
48
- ## Performance
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
 
50
- Models are trained through self-play with adaptive training based on win rates.
51
 
52
- ## Repository
 
 
 
53
 
54
- - GitHub: [https://github.com/LikeLionSCH/12-RN]
55
- - Training Code: See repository for full implementation
56
 
57
- ## License
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- MIT License
 
 
1
  ---
2
  tags:
3
  - reinforcement-learning
4
+ - maskable-ppo
5
+ - korean-chess
6
+ - 12janggi
7
+ - self-play
8
+ - stable-baselines3
9
  - sb3-contrib
10
+ library_name: stable-baselines3
11
  license: mit
12
+ language:
13
+ - ko
14
  ---
15
 
16
+ # 🎮 십이장기 (12-Janggi) Reinforcement Learning Models
17
 
18
+ Deep reinforcement learning agents trained to play 십이장기, a simplified Korean chess variant played on an 8×3 board, using self-play and Maskable PPO algorithm.
19
 
20
+ ## 📋 Model Overview
21
 
22
+ | Property | Value |
23
+ |----------|-------|
24
+ | **Algorithm** | Maskable PPO (Proximal Policy Optimization) |
25
+ | **Framework** | Stable-Baselines3-Contrib |
26
+ | **Training Method** | Self-play with adaptive timesteps |
27
+ | **Board Dimensions** | 8 rows × 3 columns |
28
+ | **Total Actions** | 576 (24×24 position combinations) |
29
+ | **Game Type** | 십이장기 (Korean Chess Variant) |
30
 
31
+ ## 🎯 십이장기 Game Rules
32
 
33
+ **Pieces (기물):**
34
+ - 🤴 **왕 (King)**: Moves one step in any direction (horizontal, vertical, diagonal)
35
+ - 💎 **상 (Advisor/Elephant)**: Moves one step diagonally
36
+ - 🐘 **장 (General)**: Moves one step horizontally or vertically
37
+ - 🚗 **자 (Chariot/Soldier)**: Moves one step forward (promotes to 후 at end row)
38
+ - ⭐ **후 (Promoted Chariot)**: Enhanced movement after promotion
39
 
40
+ **Victory Conditions:**
41
+ - Capture opponent's King (왕)
42
+ - Move your King to opponent's back row (UP: row 5, DOWN: row 2)
43
 
44
+ **Game Features:**
45
+ - Simplified Korean chess with only 12 pieces (6 per player)
46
+ - Fast-paced gameplay on compact 8×3 board
47
+ - Captured pieces can be placed back on the board
48
 
49
+ ## 📦 Available Models
50
+
51
+ {chr(10).join(f"- **`{f}`**: {'Standard (256 neurons)' if '_512' not in f else 'Large (512 neurons)'} - {'UP player (上家)' if 'up' in f else 'DOWN player (下家)'}" for f in model_files)}
52
+
53
+ ## 🚀 Quick Start
54
+
55
+ ### Installation
56
+
57
+ ```bash
58
+ pip install sb3-contrib gymnasium numpy
59
+ ```
60
+
61
+ ### Load and Use Model
62
 
63
  ```python
64
  from sb3_contrib import MaskablePPO
65
+ from sb3_contrib.common.wrappers import ActionMasker
66
+ import gymnasium as gym
67
 
68
+ # Load 십이장기 environment
69
+ env = gym.make("MiniChess-v0") # Environment name kept for compatibility
70
+ env = ActionMasker(env, lambda e: e.unwrapped.get_valid_actions())
71
+
72
+ # Load trained model
73
  model = MaskablePPO.load("model_up.zip")
74
 
75
+ # Play
76
+ obs, info = env.reset()
77
+ done = False
78
+
79
+ while not done:
80
+ # Get valid actions
81
+ action_masks = env.unwrapped.get_valid_actions()
82
+
83
+ # Predict with masking
84
+ action, _states = model.predict(obs, action_masks=action_masks, deterministic=False)
85
+
86
+ # Execute action
87
+ obs, reward, terminated, truncated, info = env.step(action)
88
+ done = terminated or truncated
89
+
90
+ print(f"Game ended with reward: {{reward}}")
91
  ```
92
 
93
+ ### Download from Hub
94
+
95
+ ```python
96
+ from huggingface_hub import hf_hub_download
97
+
98
+ model_path = hf_hub_download(
99
+ repo_id="SoonchunhyangUniversity/12-RN",
100
+ filename="model_up.zip"
101
+ )
102
+
103
+ model = MaskablePPO.load(model_path)
104
+ ```
105
+
106
+ ## 🎓 Training Details
107
+
108
+ ### Hyperparameters
109
+
110
+ - **Learning Rate**: 5e-4 with adaptive adjustment
111
+ - **Batch Size**: 2048
112
+ - **N-Steps**: 2048
113
+ - **Entropy Coefficient**: 0.05 (exploration vs exploitation balance)
114
+ - **Clip Range**: 0.2
115
+ - **N-Epochs**: 10
116
+ - **Network Architecture**: [256, 256] or [512, 512] (policy and value networks)
117
 
118
+ ### Training Strategy
119
 
120
+ - **Self-play**: Models alternate training against each other
121
+ - **Adaptive Training**: Timesteps adjusted based on win rate performance
122
+ - **Action Masking**: Only legal moves are considered during training
123
+ - **Parallel Environments**: 48 simultaneous game instances for efficient data collection
124
 
125
+ ### Observation Space
 
126
 
127
+ Multi-channel board representation (Dict):
128
+ - **Board**: `(10, 8, 3)` tensor - 10 channels for different piece types and players
129
+ - **Turn**: Scalar indicating current player (0 or 1)
130
+
131
+ ### Action Space
132
+
133
+ - **Type**: Discrete(576)
134
+ - **Encoding**: `action = start_position * 24 + target_position`
135
+ - **Masking**: Invalid moves filtered via action masks
136
+
137
+ ## 📊 Performance
138
+
139
+ Models achieve strong performance through self-play:
140
+ - Average episode reward: ~200-400 range
141
+ - Episode length: 15-25 moves on average
142
+ - Win rate stabilizes around 45-55% when evenly matched
143
+
144
+ ## 🔗 Resources
145
+
146
+ - **GitHub Repository**: [LikeLionSCH/12-RN](https://github.com/LikeLionSCH/12-RN)
147
+ - **Environment Code**: See repository for `MiniChessEnv` implementation
148
+ - **Training Script**: `new_learn.py` in repository
149
+
150
+ ## 📄 Citation
151
+
152
+ If you use these models in your research, please cite:
153
+
154
+ ```bibtex
155
+ @misc{{12janggi_rl_2026,
156
+ title={{십이장기 (12-Janggi) Reinforcement Learning Models}},
157
+ author={{Soonchunhyang University}},
158
+ year={{2026}},
159
+ publisher={{Hugging Face}},
160
+ url={{https://huggingface.co/SoonchunhyangUniversity/12-RN}}
161
+ }}
162
+ ```
163
+
164
+ ## 📜 License
165
+
166
+ This project is licensed under the MIT License - see the LICENSE file for details.
167
+
168
+ ---
169
 
170
+ **Developed by**: Soonchunhyang University (순천향대학교)
171
+ **Contact**: For questions or collaborations, please open an issue on GitHub