Northstar CUA Fast
A 4B-parameter computer-use agent model trained with GUI reinforcement learning. Recovers from mistakes, generalizes across environments, and outperforms open-source models at twice its size on single-app tasks.
Built for agentic loops where every step is a model call.
| Parameters | 4B |
| Context | 64K tokens |
| Training | GUI reinforcement learning |
| Input | Text + screenshot |
| Output | GUI actions (click, type, scroll, key, drag, ...) |
| Coordinates | 0-999 normalized (model) / pixel-scaled (Responses API) |
| Pricing | < $1/M tokens |
Highlights
- RL-trained, not just SFT. Trained with GRPO on synthetic GUI environments, producing behaviors that generalize rather than memorize.
- Recovery over raw accuracy. Multi-turn RL training teaches the model to detect failures from history and adapt, which matters far more than single-step click precision for long-horizon tasks.
- Competitive at 4B. Matches or exceeds open-source models at 2x parameter count on single-app desktop tasks.
- Production-ready API. OpenAI-compatible chat completions and a Responses API with pixel-scaled coordinates.
How It Was Trained
The Problem with SFT for Computer Use
Supervised fine-tuning on GUI data saturates after 100-1000 examples per task and degrades other abilities. More critically, SFT improvements do not generalize: the model memorizes state-action pairs rather than learning why an action should be taken. Coordinate prediction under SFT also suffers because all incorrect coordinates are penalized uniformly, so clicking 1 pixel away from ground truth is treated the same as clicking on the opposite side of the screen.
Reinforcement Learning on Synthetic Environments
Using a GRPO loss adapted for multi-modal inputs (built on prime-rl), Northstar CUA Fast was trained on synthetic GUI environments with bounding-box-based reward signals. Key findings:
- Generalization from abstract environments. Training exclusively on simplified, fabricated test environments improved performance on real UI benchmarks (0.39 to 0.53 on an aggregated UI benchmark), surpassing SFT on actual UI datasets.
- Multi-turn RL is critical. Training on ~100 environments requiring 3-15 click interactions produced a 20% absolute improvement on the OSWorld Chrome category, despite zero resemblance between training and evaluation environments.
- Emergent self-correction. The model learns to detect failed interactions from its history and either retry with adjustments or try entirely different approaches. This cannot be systematically derived from SFT because it depends on the model's own action distribution.
Positional Encoding Insights
Analysis of the vision encoder revealed that absolute positional information decays exponentially through attention layers due to normalization. Since 2D-RoPE only encodes relative position, the additive patch embedding (added once at input) is the sole source of absolute coordinate information, and it degrades with depth. Scaling the positional embedding by 3x improved click accuracy from 40% to 80% on a simple red-ball benchmark without any retraining.
OSWorld Benchmark (pass@1, 50 steps)
Evaluated on OSWorld across 369 real-world desktop tasks.
| Domain | UI-TARS 2 | Qwen3 Flash | Northstar CUA Fast (4B) |
|---|---|---|---|
| Chrome | 62.96% | 56.43% | 55.30% |
| Thunderbird | 73.33% | 66.67% | 62.40% |
| LibreOffice Writer | 60.87% | 56.52% | 56.94% |
| OS | 41.67% | 54.17% | 46.26% |
| VLC | 49.94% | 34.41% | 43.87% |
| Overall | 53.1% | 41.6% | 37.01% |
At 4B parameters, Northstar CUA Fast is competitive with open-source models at twice its size on single-app tasks. Using the EVOCUA agent harness: EVOCUA-8B averages 32.5% vs Northstar CUA Fast (RL) at 37.0%.
Why Recovery Matters More Than Accuracy
For multi-step agentic tasks, per-step accuracy requirements scale harshly:
| Trajectory Length | 50% success | 80% success | 95% success |
|---|---|---|---|
| 1 | 0.50 | 0.80 | 0.95 |
| 4 | 0.84 | 0.95 | 0.99 |
| 16 | 0.96 | 0.99 | 1.00 |
| 32 | 0.98 | 0.99 | 1.00 |
Even with retry tolerance, the required per-step accuracy for long trajectories becomes impractical. The model's ability to recover from failures and handle out-of-distribution variation matters far more than raw single-step precision.
Supported Actions
click · double_click · triple_click · right_click · drag · type · key · scroll · hscroll · navigate (browser only) · wait · terminate
Quickstart
Install
pip install tzafon
Responses API (recommended)
import os
from tzafon import Lightcone
client = Lightcone(api_key=os.environ["TZAFON_API_KEY"])
response = client.responses.create(
model="tzafon.northstar-cua-fast",
instructions="Click on the Firefox icon.",
tools=[{
"type": "computer_use",
"display_width": 1024,
"display_height": 768,
"environment": "browser",
}],
)
print(response.output)
OpenAI-compatible Chat Completions
from openai import OpenAI
client = OpenAI(
api_key="your-api-key",
base_url="https://api.tzafon.ai/v1",
)
response = client.chat.completions.create(
model="tzafon.northstar-cua-fast",
messages=[
{"role": "user", "content": [
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}},
{"type": "text", "text": "Click on the Firefox icon."},
]},
],
temperature=0,
max_tokens=512,
)
print(response.choices[0].message.content)
cURL
curl -X POST https://api.tzafon.ai/v1/responses \
-H "Authorization: Bearer $TZAFON_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "tzafon.northstar-cua-fast",
"instructions": "Click on the Firefox icon.",
"tools": [{"type": "computer_use", "display_width": 1024, "display_height": 768}]
}'
Lightcone Agent Harness
Lightcone wraps Northstar CUA Fast into a full desktop automation loop: screenshot, think, act, repeat.
screenshot → Northstar CUA Fast → parse action → execute on computer → repeat
Features: pure-async FastAPI server with SSE streaming, sliding-window context management, Rust-accelerated image processing, and an auto-discovering tool registry.
git clone https://github.com/tzafon/lightcone.git
cd lightcone
uv venv && uv sync --extra dev
uv run maturin develop -m native/Cargo.toml
export TZAFON_API_KEY="your-api-key"
lightcone run --task "Open Firefox and search for 'hello world'"
What's Open Source vs Hosted
| Component | License | Status |
|---|---|---|
| Lightcone agent harness | Apache 2.0 | GitHub |
Python SDK (tzafon) |
MIT | PyPI |
| Model weights | Apache 2.0 | Tzafon API |
Citation
@misc{tzafon2026northstarcuafast,
title={Northstar CUA Fast: Lightweight Computer-Use Agent Model},
author={Tzafon Team},
year={2026},
url={https://github.com/tzafon/lightcone},
}
Links
- Downloads last month
- 159
Evaluation results
- Overall (50 steps) on OSWorldself-reported37.010
- Chrome on OSWorldself-reported55.300
- Thunderbird on OSWorldself-reported62.400
- LibreOffice Writer on OSWorldself-reported56.940
- OS on OSWorldself-reported46.260
- VLC on OSWorldself-reported43.870