File size: 6,756 Bytes

d4da3aa
 
 
 
 
 
 
 
 
 
1c00349
d4da3aa
1c00349
 
 
 
 
 
 
d4da3aa
 
 
1c00349
 
 
d4da3aa
 
 
1c00349
 
d4da3aa
1c00349
 
 
d4da3aa
1c00349
 
 
 
 
 
 
 
 
d4da3aa
1c00349
 
 
 
 
d4da3aa
1c00349
 
 
 
d4da3aa
1c00349
 
 
 
 
 
d4da3aa
1c00349
 
 
 
 
 
d4da3aa
 
1c00349
 
 
 
 
 
 
d4da3aa
1c00349
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4da3aa
 
1c00349
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d4da3aa
1c00349
 
 
d4da3aa
 
 
1c00349

---
license: apache-2.0
language: [en]
library_name: safetensors
pipeline_tag: text-generation
tags: [hobbylm, mixture-of-experts, moe, sparse-moe]
---

# HobbyLM-Computer-Use (500M MoE, GUI agent / tool use)

HobbyLM-Computer-Use is the agentic variant: function calling plus a **text-only GUI agent** that reads a serialized accessibility tree (no pixels, no screenshots) and emits a grounded UI action. It can also decompose a multi-step goal and drive it to completion, deciding when it's `finish`ed.

It's part of the **HobbyLM** family — a 500M sparse-MoE model (and its variants) built from scratch on a
hobby budget: FineWeb, a handful of Modal H100 hours, a lot of ablations, and a from-scratch Rust engine
([`hobby-rs`](https://github.com/harishsg993010/HobbyLM)) to run it on a laptop CPU.

## Intended use

Computer-use / GUI automation over a UI-Automation accessibility tree, and general tool / function calling. Serialize the screen as `SCREEN:\n[ControlType] "Name" (state) …`, give it the 12-action schema, and it returns a grounded action as JSON. Powers the Computer panel in the hobby-chat app.

## Architecture

Every HobbyLM variant shares one core: a **sparse Mixture-of-Experts (MoE)** decoder in the modern
small-MoE style (DeepSeek-V3 / OLMoE lineage), where each design choice was picked by ablation rather
than by guesswork.

| Component | Value |
|---|---|
| Total parameters | ~500M (only a fraction is active per token) |
| Hidden size / layers | 768 / 16 (first FFN dense, the rest MoE) |
| Routed experts / active | 36 / top-6 (+ 1 always-on shared expert) |
| Attention | GQA, 12 query / 3 KV heads, decoupled head-dim 128, per-head QK-norm |
| Router | sigmoid gating, DeepSeek-V3 aux-loss-free load balancing, no top-k renorm |
| Positional | RoPE (θ up to 1e6 for the 8k-context checkpoints) |
| Tokenizer | GPT-2 byte-level BPE (50,304 vocab, sentinel-padded) |
| Optimizer | Muon on the 2-D + per-expert matrices, AdamW on everything else |

The full ablation log (QK-norm is the single biggest lever; aux-loss-free beats classic aux-loss;
≥32 experts and top-6 help; embedding-scaling hurt) lives in the project's architecture notes.

## Benchmarks

Held-out evaluation of the v4 checkpoint (accessibility-tree grounding + multi-step planning). `param-hallucination`
is the rate of invented element names/arguments — strict tree-grounding in the data drives it to **0**.

| Split | JSON-parse | Name-F1 | Value-acc | Exact-match | Param-halluc |
|---|---|---|---|---|---|
| Planning (multi-step goals) | 96.5% | 94.7% | — | 82.6% | 0.0% |
| Grounding (real app trees) | ~96% | 95.5% | 91% | 78.4% | 0.0% |
| Grounding (synthetic screens) | 100% | 90.7% | 88.6% | 72.5% | 0.0% |

For general (non-GUI) function calling, the HobbyLM tool-use lineage scores **~24% average on BFCL v3**
(grammar-constrained) — strong relevance/abstention (relevance 77.8, beating the needle reference's 61.1),
weaker on parallel multi-call, which is the 500M ceiling. Exact-match understates real quality: many "misses"
are ambiguous numerics (e.g. *"give it a minute"* → `wait(60)` vs the reference `wait(7)`).

> **How these were measured.** All language-model scores are **0-shot** through our own port of
> EleutherAI's `lm-evaluation-harness` (a custom `MoELMWrapper` that runs log-likelihood scoring over the
> HobbyLM MoE + GPT-2 tokenizer). Reference models in the comparison table were run through the **identical
> harness and task set**, so the numbers are apples-to-apples with ours — they are *not* copied from other
> model cards. We validated the harness against published cards (e.g. TinyLlama 52.75 vs card 52.99). These
> are small research models: read the numbers in context, not as leaderboard claims.

## Usage

### Python (PyTorch reference implementation)

HobbyLM is a custom sparse-MoE architecture — there's no `transformers` `AutoModel` for it, so load it with
the small reference implementation from the [GitHub repo](https://github.com/harishsg993010/HobbyLM):

```python
# HobbyLM is a CUSTOM sparse-MoE architecture, so load it with the reference implementation —
# NOT transformers.AutoModelForCausalLM (there is no AutoModel mapping for this arch).
# pip install torch safetensors tiktoken huggingface_hub
# git clone https://github.com/harishsg993010/HobbyLM && cd HobbyLM

import json, torch, tiktoken
from huggingface_hub import hf_hub_download
from safetensors.torch import load_file
from hobbylm.config import ModelConfig
from hobbylm.model import MoETransformer
from hobbylm.generate import generate

repo = "rootxhacker/HobbyLM-Computer-Use"
cfg = ModelConfig(**{k: v for k, v in json.load(open(hf_hub_download(repo, "config.json"))).items() if k != "preset"})
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
cfg.expert_backend = "grouped" if device.type == "cuda" else "bmm"

model = MoETransformer(cfg).to(device).eval()
model.load_state_dict(load_file(hf_hub_download(repo, "model.safetensors")))

enc = tiktoken.get_encoding("gpt2")
prompt = "USER: What is 7 plus 2?\nASSISTANT:"
ids = torch.tensor([enc.encode_ordinary(prompt)], device=device)
out = generate(model, ids, max_new_tokens=64, temperature=0.7, top_k=0, device=device,
               repetition_penalty=1.3)               # temperature=0.0 for greedy
print(enc.decode(out[0].tolist()))
```

> For GUI / tool use, the real prompt format is `TOOLS: [<schema>]\nSCREEN:\n[ControlType] "Name" (state) …\nUSER: <instruction>\nASSISTANT:` and the model replies with a JSON action. The end-to-end agent loop lives in `agents/` in the repo.

### GGUF + hobby-rs (CPU)

GGUF builds (architecture `hobbylm`) live in [`rootxhacker/HobbyLM-gguf`](https://huggingface.co/rootxhacker/HobbyLM-gguf). They load
directly in the from-scratch `hobby-rs` CPU engine — **stock llama.cpp won't load them** without registering
the `hobbylm` architecture first.

```bash
hobby-rs --model HobbyLM-Computer-Use.gguf --prompt "..." --n 64
```

## Training

Continue-SFT from the combined tool checkpoint on synthetic accessibility-tree data (Gemini-generated, strictly tree-validated) + real-app UI trees + planning trajectories, with a weighted loss. 13-action vocabulary (12 UI actions + `finish`).

## Limitations

- Per-step grounding is ~80% accurate; on **long** goals those errors compound (short tasks usually complete, long ones can drift) and there is no per-step recovery.
- Trained on trees capped at ~45 elements (2k-context era); very large raw UI trees should be filtered.
- Near-identical controls (e.g. digit buttons) occasionally mis-ground.

## License

Apache-2.0. Weights aren't a substitute for judgement — this is a research / hobby model at the 500M scale,
not a production system.