--- license: mit base_model: Qwen/Qwen3.5-9B tags: - choonsik - VLA - Minecraft - vision-language-action - qwen3.5 - image-text-to-text datasets: - CraftJarvis/minecraft-vla-sft library_name: transformers language: - en pipeline_tag: image-text-to-text --- # Choonsik — Minecraft Vision-Language-Action Model Choonsik is a **Vision-Language-Action (VLA)** model for Minecraft, built on [Qwen/Qwen3.5-9B]() and trained with the three-stage **ActVLP** pipeline from [JARVIS-VLA](https://arxiv.org/abs/2503.16365). Given a Minecraft observation frame and a natural-language task instruction, Choonsik outputs keyboard + mouse action tokens that can be executed directly in the game — covering 1,000+ atomic tasks (crafting, mining, smelting, combat, navigation, etc.). | | | |---|---| | **Base model** | [Qwen/Qwen3.5-9B]() | | **Training data** | [CraftJarvis/minecraft-vla-sft]() (3.78M examples) | | **Training stages** | Language → Vision-Language → Imitation Learning | | **License** | MIT | ## Usage ```python from choonsik.inference import ChoonsikInferenceRunner from PIL import Image runner = ChoonsikInferenceRunner("Infinity08/Choonsik-Qwen3.5-9B") frame = Image.open("minecraft_frame.png") action = runner.predict(frame, task="craft a wooden pickaxe") # action = {"forward": 0, "attack": 1, ..., "camera": [0.0, 0.3]} ``` ## Action Space Choonsik predicts actions using **mu-law discretized tokens**: | Token type | Count | Description | |---|---|---| | Keyboard | 29 | `forward`, `attack`, `use`, `jump`, hotbar 1–9, … | | Mouse X | 21 | Horizontal camera rotation (mu-law bins) | | Mouse Y | 21 | Vertical camera rotation (mu-law bins) | ## Training Three-stage ActVLP pipeline (following JARVIS-VLA): 1. **Stage 1 — Language post-training**: Minecraft world knowledge (text-only SFT) 2. **Stage 2 — Vision-language alignment**: Image captioning and VQA on gameplay frames 3. **Stage 3 — Imitation learning**: Action prediction on 3.78M trajectory examples Training hardware: L40S (48 GB VRAM). Inference: RTX 5080 with 4-bit NF4 quantization. ## Citation If you use Choonsik or the underlying JARVIS-VLA methodology, please cite: ```bibtex @article{li2025jarvisvla, title = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse}, author = {Muyao Li and Zihao Wang and Kaichen He and others}, journal = {arXiv preprint arXiv:2503.16365}, year = {2025} } ```