Choonsik — Minecraft Vision-Language-Action Model

Choonsik is a Vision-Language-Action (VLA) model for Minecraft, built on Qwen/Qwen3.5-9B and trained with the three-stage ActVLP pipeline from JARVIS-VLA.

Given a Minecraft observation frame and a natural-language task instruction, Choonsik outputs keyboard + mouse action tokens that can be executed directly in the game — covering 1,000+ atomic tasks (crafting, mining, smelting, combat, navigation, etc.).

Base model Qwen/Qwen3.5-9B
Training data CraftJarvis/minecraft-vla-sft (3.78M examples)
Training stages Language → Vision-Language → Imitation Learning
License MIT

Usage

from choonsik.inference import ChoonsikInferenceRunner
from PIL import Image

runner = ChoonsikInferenceRunner("Infinity08/Choonsik-Qwen3.5-9B")
frame  = Image.open("minecraft_frame.png")

action = runner.predict(frame, task="craft a wooden pickaxe")
# action = {"forward": 0, "attack": 1, ..., "camera": [0.0, 0.3]}

Action Space

Choonsik predicts actions using mu-law discretized tokens:

Token type Count Description
Keyboard 29 forward, attack, use, jump, hotbar 1–9, …
Mouse X 21 Horizontal camera rotation (mu-law bins)
Mouse Y 21 Vertical camera rotation (mu-law bins)

Training

Three-stage ActVLP pipeline (following JARVIS-VLA):

  1. Stage 1 — Language post-training: Minecraft world knowledge (text-only SFT)
  2. Stage 2 — Vision-language alignment: Image captioning and VQA on gameplay frames
  3. Stage 3 — Imitation learning: Action prediction on 3.78M trajectory examples

Training hardware: L40S (48 GB VRAM). Inference: RTX 5080 with 4-bit NF4 quantization.

Citation

If you use Choonsik or the underlying JARVIS-VLA methodology, please cite:

@article{li2025jarvisvla,
  title   = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models
             to Play Visual Games with Keyboards and Mouse},
  author  = {Muyao Li and Zihao Wang and Kaichen He and others},
  journal = {arXiv preprint arXiv:2503.16365},
  year    = {2025}
}
Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Infinity08/Choonsik-Qwen3.5-9B

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(227)
this model

Dataset used to train Infinity08/Choonsik-Qwen3.5-9B

Paper for Infinity08/Choonsik-Qwen3.5-9B