Fine-tuning MiniCPM5-1B with LoRA to study what a small model learns about planning.
Large browser agents (7B+) are expensive and slow. Can a 1B parameter model learn to plan browser actions — search, navigate, extract, backtrack — or does it just memorize heuristics?
Adding back gave the model a powerful tool — but it over-generalized. Both models (with/without back) learned a single heuristic instead of conditional selection.
| Model | Wrong page | Paywall | Standard | Refine | Total |
|---|---|---|---|---|---|
| vA (no back) | search ✅ | search ✅ | extract ✅ | search ✅ | 11/11 |
| vB (with back) | back ✅ | back ✅ | back ❌ | back ❌ | 4/12 |
Hypothesis: the model does understand state, but action-only format encourages shortcutting. Forced reasoning before action.
Observation → Reason: ... → Action: ...
| Scenario | vB (action-only) | Reason-First |
|---|---|---|
| Wrong page | back ✅ | back ✅ |
| Paywall | 1/2 | back ✅ |
| Standard page | back ❌ | extract ✅ |
| Needs refine | back ❌ | refine_search ✅ |
| Total | 4/12 | 10/12 |
Input: Task: Find Apple stock price — Page shows price prominently
Output: back
Heuristic: "obstacle → back" applied to everything
Input: Task: Find Apple stock price — Page shows price prominently
Output: Reason: Target information is present on the page
Output: Action: extract ✅
The model already understood the state. The reasoning head exposes that knowledge.
200 targeted hard samples outperformed 2000 generic ones.
40 reason-first samples outperformed 88 action-only samples by fixing heuristic shortcutting.
Correct reasoning in all passing cases confirms the model understands state — the bottleneck is action selection, not understanding.
1. Open the notebook → click the ⬇ Download button
2. Go to colab.research.google.com
3. File → Upload notebook → select the downloaded file
4. Runtime → Run all (free GPU included)
Or run locally:
from unsloth import FastLanguageModel
import torch, re
model, tokenizer = FastLanguageModel.from_pretrained(
"Georgefifth/tiny-browser-planner-reason",
max_seq_length=2048,
load_in_4bit=True,
dtype=torch.bfloat16,
)
FastLanguageModel.for_inference(model)
msgs = [
{"role": "system", "content": "First reason, then output the action."},
{"role": "user", "content": "Task: Find Apple stock price\n\nHistory:\n[search] Search completed.\n[open_page] Price displayed prominently at $198\n\nWhat is the next action?"},
]
prompt = tokenizer.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outs = model.generate(**inputs, max_new_tokens=48)
print(tokenizer.decode(outs[0], skip_special_tokens=True))
# Output:
# Reason: Target information is present on the page
# Action: extract