A 1B model can explain the correct browser action
before it can reliably choose it.

Fine-tuning MiniCPM5-1B with LoRA to study what a small model learns about planning.

The Problem

Large browser agents (7B+) are expensive and slow. Can a 1B parameter model learn to plan browser actions — search, navigate, extract, backtrack — or does it just memorize heuristics?

search open_page extract back refine_search finish

Experiment Timeline

v1
Linear data — 1142 samples, perfect regularity
Learned patterns, couldn't generalize
v2/v3
More data — 1504 → 1948 samples
Still 72% — more data didn't help
v4
Hard samples — 243 targeted examples, 89 decision points
200 quality examples > 2000 generic ones
Ablation
The Action Space Paradox

Adding back gave the model a powerful tool — but it over-generalized. Both models (with/without back) learned a single heuristic instead of conditional selection.

ModelWrong pagePaywallStandardRefineTotal
vA (no back)search ✅search ✅extract ✅search ✅11/11
vB (with back)back ✅back ✅back ❌back ❌4/12
More actions → more capability → more confusion. The model trades expressive power for decision quality.
Reason-First
Fixed with 40 samples and 7.8 seconds of training

Hypothesis: the model does understand state, but action-only format encourages shortcutting. Forced reasoning before action.

Observation → Reason: ... → Action: ...
ScenariovB (action-only)Reason-First
Wrong pageback ✅back ✅
Paywall1/2back ✅
Standard pageback ❌extract ✅
Needs refineback ❌refine_search ✅
Total4/1210/12
Reasoning eliminates heuristic shortcutting.

Why It Works

Action-Only (vB)

Input: Task: Find Apple stock price — Page shows price prominently

Output: back

Heuristic: "obstacle → back" applied to everything

Reason-First

Input: Task: Find Apple stock price — Page shows price prominently

Output: Reason: Target information is present on the page

Output: Action: extract ✅

The model already understood the state. The reasoning head exposes that knowledge.

Key Insights

📊

Data Quality > Quantity

200 targeted hard samples outperformed 2000 generic ones.

🎯

Reasoning > Data

40 reason-first samples outperformed 88 action-only samples by fixing heuristic shortcutting.

🧠

State Understanding

Correct reasoning in all passing cases confirms the model understands state — the bottleneck is action selection, not understanding.

Try It Yourself

1. Open the notebook → click the ⬇ Download button
2. Go to colab.research.google.com
3. File → Upload notebook → select the downloaded file
4. Runtime → Run all (free GPU included)

Or run locally:

from unsloth import FastLanguageModel
import torch, re

model, tokenizer = FastLanguageModel.from_pretrained(
    "Georgefifth/tiny-browser-planner-reason",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)
FastLanguageModel.for_inference(model)

msgs = [
    {"role": "system", "content": "First reason, then output the action."},
    {"role": "user", "content": "Task: Find Apple stock price\n\nHistory:\n[search] Search completed.\n[open_page] Price displayed prominently at $198\n\nWhat is the next action?"},
]
prompt = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outs = model.generate(**inputs, max_new_tokens=48)
print(tokenizer.decode(outs[0], skip_special_tokens=True))
# Output:
# Reason: Target information is present on the page
# Action: extract