File size: 2,831 Bytes
e8ea8b5
 
 
 
 
 
 
 
 
dfb50d7
e8ea8b5
dfb50d7
e8ea8b5
 
 
 
 
 
dfb50d7
e8ea8b5
 
 
 
 
 
dfb50d7
e8ea8b5
 
 
 
 
 
 
 
dfb50d7
e8ea8b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c3ad5aa
dfb50d7
e8ea8b5
c3ad5aa
dfb50d7
e8ea8b5
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
---
license: apache-2.0
language:
- en
tags:
- computer-use
- gui-agent
- vision-language-model
- screen-understanding
- vla
datasets:
- TESS-Computer/tess-agentnet
base_model: HuggingFaceTB/SmolVLM2-500M-Instruct
pipeline_tag: image-text-to-text
---

# TESS-500M

**TESS** is a Vision-Language-Action (VLA) model for computer use, inspired by robotic VLAs. Given a screenshot and natural language instruction, it predicts either a mouse action (click coordinates) or keyboard action (typing/shortcuts).

## Model Description

- **Base Model**: SmolVLM2-500M-Instruct
- **Architecture**: SmolVLM + Router + Mouse/Keyboard heads
- **Parameters**: 508M total, 48M trainable
- **Training Data**: [tess-agentnet](https://huggingface.co/datasets/TESS-Computer/tess-agentnet) (~312K samples)

## Usage

```python
import torch
from PIL import Image

# Clone the TESS repo
# git clone https://github.com/husseinlezzaik/TESS.git
# cd TESS/model

from test_checkpoint import load_model, predict

# Load model
model, processor = load_model("path/to/checkpoint.pt", device="cuda")

# Run inference
image = Image.open("screenshot.png")
result = predict(model, processor, image, "Click the search button")

print(result)
# Mouse action: {'action_type': 'mouse', 'xy': array([0.45, 0.32]), 'click_type': 'LEFT_CLICK'}
# Keyboard action: {'action_type': 'keyboard', 'action': 'type', 'value': 'hello world'}
```

## Output Format

**Mouse actions:**
```python
{
    'action_type': 'mouse',
    'xy': [x, y],  # Normalized coordinates (0-1)
    'click_type': 'LEFT_CLICK' | 'RIGHT_CLICK' | 'DOUBLE_CLICK' | ...
}
```

**Keyboard actions:**
```python
{
    'action_type': 'keyboard',
    'action': 'type' | 'press' | 'hotkey',
    'value': 'text to type' | '<ENTER>' | '<SUPER+C>'
}
```

## Architecture

```
Screenshot + Instruction β†’ SmolVLM2 β†’ Shared MLP β†’ Router
                                                    ↓
                                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                    ↓                               ↓
                              Mouse Branch                   Keyboard Branch
                              (XY + Click heads)            (VLM text generation)
```

## Training

- **Epochs**: 3
- **Batch Size**: 48
- **Optimizer**: AdamW (LR 2e-4 heads, 5e-4 embeddings)
- **Hardware**: NVIDIA H100 80GB
- **Training Time**: ~8 hours

## Limitations

- Trained primarily on desktop/web screenshots
- English instructions only
- May struggle with unusual UI layouts not seen in training

## License

Apache 2.0

## Citation

```bibtex
@misc{tess2025,
  title={TESS: A Vision-Language-Action Model for Computer Use},
  author={Hussein Lezzaik},
  year={2025},
  url={https://github.com/husseinlezzaik/TESS}
}
```