File size: 2,831 Bytes

e8ea8b5
 
 
 
 
 
 
 
 
dfb50d7
e8ea8b5
dfb50d7
e8ea8b5
 
 
 
 
 
dfb50d7
e8ea8b5
 
 
 
 
 
dfb50d7
e8ea8b5
 
 
 
 
 
 
 
dfb50d7
e8ea8b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3ad5aa
dfb50d7
e8ea8b5
c3ad5aa
dfb50d7
e8ea8b5

---
license: apache-2.0
language:
- en
tags:
- computer-use
- gui-agent
- vision-language-model
- screen-understanding
- vla
datasets:
- TESS-Computer/tess-agentnet
base_model: HuggingFaceTB/SmolVLM2-500M-Instruct
pipeline_tag: image-text-to-text
---

# TESS-500M

**TESS** is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).

## Model Description

- **Base Model**: SmolVLM2-500M-Instruct
- **Architecture**: SmolVLM + Router + Mouse/Keyboard heads
- **Parameters**: 508M total, 48M trainable
- **Training Data**: [tess-agentnet](https://huggingface.co/datasets/TESS-Computer/tess-agentnet) (~312K samples)

## Usage

```python
import torch
from PIL import Image

# Clone the TESS repo
# git clone https://github.com/husseinlezzaik/TESS.git
# cd TESS/model

from test_checkpoint import load_model, predict

# Load model
model, processor = load_model("path/to/checkpoint.pt", device="cuda")

# Run inference
image = Image.open("screenshot.png")
result = predict(model, processor, image, "Click the search button")

print(result)
# Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
# Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}
```

## Output Format

**Mouse actions:**
```python
{
    'action_type': 'mouse',
    'xy': [x, y],  # Normalized coordinates (0-1)
    'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ...
}
```

**Keyboard actions:**
```python
{
    'action_type': 'keyboard',
    'action': 'type' | 'press' | 'hotkey',
    'value': 'text to type' | '<ENTER>' | '<SUPER+C>'
}
```

## Architecture

```
Screenshot + Instruction → SmolVLM2 → Shared MLP → Router
                                                    ↓
                                    ┌───────────────┴───────────────┐
                                    ↓                               ↓
                              Mouse Branch                   Keyboard Branch
                              (XY + Click heads)            (VLM text generation)
```

## Training

- **Epochs**: 3
- **Batch Size**: 48
- **Optimizer**: AdamW (LR 2e-4 heads, 5e-4 embeddings)
- **Hardware**: NVIDIA H100 80GB
- **Training Time**: ~8 hours

## Limitations

- Trained primarily on desktop/web screenshots
- English instructions only
- May struggle with unusual UI layouts not seen in training

## License

Apache 2.0

## Citation

```bibtex
@misc{tess2025,
  title={TESS: A Vision-Language-Action Model for Computer Use},
  author={Hussein Lezzaik},
  year={2025},
  url={https://github.com/husseinlezzaik/TESS}
}
```