A 1B model can explain the correct browser action
before it can reliably choose it.

Fine-tuning MiniCPM5-1B with LoRA to study what a small model learns about planning.

The Problem

Large browser agents (7B+) are expensive and slow. Can a 1B parameter model learn to plan browser actions — search, navigate, extract, backtrack — or does it just memorize heuristics?

search open_page extract back refine_search finish

Experiment Timeline

Linear data — 1142 samples, perfect regularity

Learned patterns, couldn't generalize

v2/v3

More data — 1504 → 1948 samples

Still 72% — more data didn't help

Hard samples — 243 targeted examples, 89 decision points

200 quality examples > 2000 generic ones

Ablation

The Action Space Paradox

Adding back gave the model a powerful tool — but it over-generalized. Both models (with/without back) learned a single heuristic instead of conditional selection.

Model	Wrong page	Paywall	Standard	Refine	Total
vA (no back)	search ✅	search ✅	extract ✅	search ✅	11/11
vB (with back)	back ✅	back ✅	back ❌	back ❌	4/12

More actions → more capability → more confusion. The model trades expressive power for decision quality.

Reason-First

Fixed with 40 samples and 7.8 seconds of training

Hypothesis: the model does understand state, but action-only format encourages shortcutting. Forced reasoning before action.

Observation → Reason: ... → Action: ...

Scenario	vB (action-only)	Reason-First
Wrong page	back ✅	back ✅
Paywall	1/2	back ✅
Standard page	back ❌	extract ✅
Needs refine	back ❌	refine_search ✅
Total	4/12	10/12

Reasoning eliminates heuristic shortcutting.

Why It Works

Action-Only (vB)

Input: Task: Find Apple stock price — Page shows price prominently

Output: back

Heuristic: "obstacle → back" applied to everything

Reason-First

Input: Task: Find Apple stock price — Page shows price prominently

Output: Reason: Target information is present on the page

Output: Action: extract ✅

The model already understood the state. The reasoning head exposes that knowledge.

Key Insights

📊

Data Quality > Quantity

200 targeted hard samples outperformed 2000 generic ones.

🎯

Reasoning > Data

40 reason-first samples outperformed 88 action-only samples by fixing heuristic shortcutting.

🧠

State Understanding

Correct reasoning in all passing cases confirms the model understands state — the bottleneck is action selection, not understanding.

Try It Yourself

1. Open the notebook → click the ⬇ Download button
2. Go to colab.research.google.com
3. File → Upload notebook → select the downloaded file
4. Runtime → Run all (free GPU included)

Or run locally:

from unsloth import FastLanguageModel
import torch, re

model, tokenizer = FastLanguageModel.from_pretrained(
    "Georgefifth/tiny-browser-planner-reason",
    max_seq_length=2048,
    load_in_4bit=True,
    dtype=torch.bfloat16,
)
FastLanguageModel.for_inference(model)

msgs = [
    {"role": "system", "content": "First reason, then output the action."},
    {"role": "user", "content": "Task: Find Apple stock price\n\nHistory:\n[search] Search completed.\n[open_page] Price displayed prominently at $198\n\nWhat is the next action?"},
]
prompt = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outs = model.generate(**inputs, max_new_tokens=48)
print(tokenizer.decode(outs[0], skip_special_tokens=True))
# Output:
# Reason: Target information is present on the page
# Action: extract