ViLM-0.8b (ASCII SFT)

Model Details

ViLM-0.8b-SFT (Vi Language Model) is an highly experimental 0.8 billion parameter causal language model fine-tuned to act as a Vim/Vi text-editing agent within a simulated 2D grid. Built on top of Qwen/Qwen3.5-0.8B-Base, this model was trained via Supervised Fine-Tuning (SFT) specifically on ASCII-based manipulation tasks to read a simulated text editor state (Vi Gym) and predict the exact keystrokes required to navigate and modify the grid.

Base Model: Qwen/Qwen3.5-0.8B-Base
Language: English / Vi Keystrokes
License: Apache 2.0 (Inherited from Qwen)
Training Dataset: Antix5/vi-gym-causal-ascii

What This Model Is (and Isn't)

What it IS:

A Behavioral Cloning Baseline for Grid Navigation: ViLM successfully learned the syntax, modes, and basic grammar of Vim within the context of an ASCII grid. It understands how to navigate (h, j, k, l), use quantifiers (10j, 12o), and the relationship between entering Insert Mode (i, a, o) and escaping it (<Esc>).
An Environment-Aware Agent: It reads an XML-like observation state and formats its outputs to interact with a deterministic Rust-based Vi backend.
A Potential Starting Point for Reinforcement Learning: While currently limited by its supervised training data, this SFT checkpoint provides a structurally valid baseline that could theoretically be used to initialize RL (e.g., PPO or GRPO) to teach it actual spatial reasoning.

What it ISN'T:

A General Text Editor: It has only been trained on ASCII-related manipulation tasks in this phase. It is not currently equipped to fix typos, refactor code, or act as a general-purpose writing assistant.
A Zero-Shot ASCII Artist: While the model attempts to build scaffolding and prepare a canvas for drawing objects, the SFT phase alone did not instill complex geometric reasoning. It understands how to use Vi commands to draw lines or pad spaces, but struggles to translate unprompted 2D concepts (like "cat" or "rose") into coordinates without getting stuck in repetitive loops.

Input & Output Format

The model expects a strictly formatted XML-like string representing the environment state. It was trained to output the exact keystrokes and terminate with a <|im_end|> token.

Input State Example

The state is rendered by the backend and passed to the model:

<|im_start|>
<notepad>
0 |<cursor>
</notepad>
<mode>Normal</mode>
<prompt>Draw a horizontal line</prompt>
<command>

Output Action Example

10i_<Esc><|im_end|>

Key Learnings & Emergent Behaviors

During inference testing, several interesting behavioral patterns and biases were observed:

"The Canvas Builder" Routine: When asked to draw an object, the model often recognizes it needs space. It attempts to autonomously use commands like 12o<Esc> to open blank lines, followed by 10i <Esc> to pad spaces, effectively building a blank 2D canvas before it attempts to place specific ASCII characters.
The "Opening Move" Reflex (Overfitting): Because many training trajectories naturally began with resetting the cursor to the start of the line, the model developed a strong reflex to output 0 as its first move. If the cursor is already at 0, the state does not change, which can trap the model in an infinite loop if sampling is not used.
Strict Mode Grammar: The model rarely hallucinates raw text outside of Insert Mode. It learned the rule that i, a, A, o, and O must eventually be closed by <Esc>.

Limitations and Known Traps

Macro-Loops: If the model makes a mistake (e.g., tries to go up when already at the top), the deterministic environment rejects the move and returns an identical state. Under do_sample=False (Greedy Decoding), the model will repeat the exact same invalid move forever. Mitigation: Inference must be run with temperature sampling (e.g., 0.7) to force the agent to explore alternate predictions when stuck.
Overfitting to Demonstrations: The model relies heavily on the specific spatial structures seen in the vi-gym-causal-ascii dataset and cannot yet generalize to complex zero-shot artistic creations.

Example Inference Code

Attached repo : https://github.com/Antix5/ViLM

To run this model effectively, you must utilize sampling with an appropriate temperature to prevent state-stagnation loops.

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "Antix5/ViLM-0.8b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype=torch.bfloat16)

prompt = """<|im_start|>
<notepad>
0 |<cursor>
</notepad>
<mode>Normal</mode>
<prompt>Draw a horizontal line</prompt>
<command>
"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Crucial generation parameters to avoid Vi loops
outputs = model.generate(
    **inputs,
    max_new_tokens=20,
    do_sample=True,        
    temperature=0.7,       
    top_p=0.9,
    eos_token_id=tokenizer.convert_tokens_to_ids("<|im_end|>")
)

generated_tokens = outputs[0][inputs["input_ids"].shape[1]:]
action = tokenizer.decode(generated_tokens, skip_special_tokens=True)
print(f"Action: {action}")

Future Work

Reinforcement Learning (RL): Transitioning from SFT to PPO/GRPO using the deterministic Rust Vi Gym environment to penalize illegal moves and reward successful ASCII pattern completion.
Extend to other simple task: We could also do another SFT pass on text editing, this task being easy to generate synthetic data for through programatic means