llm-tank / README.md
Imperius's picture
Update README.md
9430ade verified
metadata
license: gemma
base_model: unsloth/gemma-3-270m-it
language:
  - en
pipeline_tag: text-generation
tags:
  - robotics
  - text-to-json
  - instruction-following
  - mujoco
  - gemma3
library_name: transformers

LLM-Tank — Gemma-3 270M → robot JSON

Source-code: https://codeberg.org/imperius/llm-tank

Fine-tuned Gemma-3 270M that translates one free-form English instruction for a tracked robot with a gripper arm into a strict JSON command list, executed in a MuJoCo simulation.

Full pipeline: text → this model → valid JSON → controller → robot drives / grasps. Code & sim: see the source repository.

LLM-Tank demo

What it outputs

A single JSON object {"commands": [ ... ]}. Actions:

  • movedirection (forward|backward), distance_m, speed?
  • turndirection (left|right), angle_deg, speed?
  • stop, waitduration_s
  • grasp / release — optional cellfront|front_left|front_right|left|right (discrete, relative to the robot; IK is solved by the controller, not the model)
  • out-of-scope / nonsense → {"commands": []}

The model emits no coordinates — only discrete actions/enums (this keeps generation reliable and schema-checkable).

Required input format (IMPORTANT)

The model was trained train == infer with a fixed short system prompt folded with the instruction into ONE user turn. You must use exactly this:

import json
from transformers import AutoModelForCausalLM, AutoTokenizer

SYSTEM = ("You translate ONE English instruction for a tracked robot "
          "with a gripper arm into a single JSON object "
          '{"commands":[...]} using actions: move, turn, stop, wait, '
          "grasp, release. Output ONLY the JSON object, no prose, no "
          'markdown. If the instruction is out of scope or nonsense, '
          'output {"commands": []}.')

tok = AutoTokenizer.from_pretrained("PATH_OR_REPO")
model = AutoModelForCausalLM.from_pretrained("PATH_OR_REPO",
                                             torch_dtype="auto",
                                             device_map="auto")

def translate(instruction: str) -> dict:
    user = SYSTEM + "\n\n---\nINSTRUCTION: " + instruction.strip()
    enc = tok.apply_chat_template(
        [{"role": "user", "content": user}],
        tokenize=True, add_generation_prompt=True,
        return_dict=True, return_tensors="pt").to(model.device)
    out = model.generate(**enc, max_new_tokens=160, do_sample=False)
    txt = tok.decode(out[0][enc["input_ids"].shape[1]:],
                     skip_special_tokens=True)
    i, j = txt.find("{"), txt.rfind("}")
    try:
        return json.loads(txt[i:j + 1])
    except Exception:
        return {"commands": []}  # safe fallback

print(translate("go forward 2 meters then turn left"))
# {"commands": [{"action": "move", "direction": "forward",
#   "distance_m": 2.0}, {"action": "turn", "direction": "left",
#   "angle_deg": 90}]}
print(translate("pick it up"))      # {"commands": [{"action": "grasp"}]}
print(translate("make me a coffee"))# {"commands": []}

Greedy decoding (do_sample=False). The model is ~99% schema-valid without constrained decoding; always keep the safe fallback.

Metrics (held-out val, 352 examples: locomotion + manipulation + OOD)

metric value
schema_valid_rate 0.991
exact_match_rate 0.943
action_seq_accuracy 0.980
ood_f1 0.857
task_success (MuJoCo, 40) 0.975

Training

Full fine-tuning (not LoRA) of unsloth/gemma-3-270m-it on ~3.5k synthetic instruction→JSON pairs (generated with 120B models, validated against a JSON Schema). fp32, Kaggle T4. Two phases: locomotion, then

  • arm (grasp/release). Details in the source repo (docs/).

Demo

demo.mp4 (in this repo) — ~1 min, two panes: left = command + model JSON output, right = the robot acting in MuJoCo (real model + real physics, not staged).

Limitations

  • No perception: the model can't target objects by name/color, only by discrete relative cell. Object resolution is spatial (controller grabs the nearest graspable body in the chosen cell).
  • English only. Single fixed gripper, minimal custom arm.
  • Designed for the accompanying controller/sim; raw JSON is meaningless without it.

License

Weights are a derivative of Google Gemma-3 — use is governed by the Gemma Terms of Use. Accompanying code is under its own license (see the source repository).