|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- merge |
|
|
- mergekit |
|
|
- slerp |
|
|
- agent |
|
|
- gui-automation |
|
|
- vision |
|
|
- multimodal |
|
|
- far-7b |
|
|
- ui-tars |
|
|
base_model: |
|
|
- microsoft/Fara-7B |
|
|
- ByteDance-Seed/UI-TARS-1.5-7B |
|
|
library_name: transformers |
|
|
pipeline_tag: image-text-to-text |
|
|
--- |
|
|
|
|
|
# Fara-TARS-7B: The Hybrid Reasoning & GUI Agent |
|
|
|
|
|
**Fara-TARS-7B** is a state-of-the-art merged model that combines the high-level reasoning and planning capabilities of **Microsoft Fara-7B** with the precise GUI grounding and agentic capabilities of **ByteDance UI-TARS-7B**. |
|
|
|
|
|
This model achieves a **Hybrid Mode**: it can seamlessly switch between writing complex text plans (Reasoning) and executing precise coordinate actions (Agentic Tool Calls) based on the user prompt. |
|
|
|
|
|
## Key Capabilities |
|
|
|
|
|
| Capability | Performance | Description | |
|
|
| :--- | :--- | :--- | |
|
|
| **GUI Grounding** | 🟢 **SOTA** | Accurately maps text instructions to `[x, y]` coordinates (e.g., "Click Submit" -> `[1200, 800]`). | |
|
|
| **Reasoning** | 🟢 **Excellent** | Can generate long-form plans (e.g., "Weekly Python Learning Plan") without hallucinating clicks. | |
|
|
| **Language** | 🟢 **English-Only** | Tuned to strictly follow English instructions, eliminating language bleeding common in TARS merges. | |
|
|
| **Agentic Output** | 🟢 **Structured** | Outputs actions in strict JSON format: `<tool_call>{"name": "click", ...}</tool_call>`. | |
|
|
|
|
|
## How to Use (Inference Code) |
|
|
|
|
|
To unlock the full potential of this model (Agent Mode vs Text Mode), **you must use the specific generation configuration below**. This handles the tool schema injection and prevents repetition loops. |
|
|
|
|
|
### Installation |
|
|
```bash |
|
|
pip install torch transformers pillow |
|
|
``` |
|
|
|
|
|
### Python Inference Class |
|
|
Use this class to interact with the model. It handles the system prompt injection and JSON parsing automatically. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoModelForVision2Seq, AutoTokenizer, GenerationConfig |
|
|
from PIL import Image |
|
|
import json |
|
|
import re |
|
|
|
|
|
class FaraAgent: |
|
|
def __init__(self, model_path, device="auto"): |
|
|
print(f"Loading Fara-TARS from {model_path}...") |
|
|
self.model = AutoModelForVision2Seq.from_pretrained( |
|
|
model_path, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map=device, |
|
|
trust_remote_code=True, |
|
|
low_cpu_mem_usage=True |
|
|
) |
|
|
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
|
|
# Safety fix for padding |
|
|
if self.tokenizer.pad_token is None: |
|
|
self.tokenizer.pad_token = self.tokenizer.eos_token |
|
|
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id |
|
|
|
|
|
# Define Agent Tools |
|
|
self.tools_schema = [ |
|
|
{"name": "left_click", "description": "Click coordinate [x, y]", "parameters": {"type": "object", "properties": {"point": {"type": "array"}}}}, |
|
|
{"name": "type_text", "description": "Type text", "parameters": {"type": "object", "properties": {"text": {"type": "string"}}}}, |
|
|
{"name": "scroll", "description": "Scroll screen", "parameters": {"type": "object", "properties": {"pixels": {"type": "integer"}}}}, |
|
|
{"name": "terminate", "description": "Task done", "parameters": {"type": "object", "properties": {"status": {"type": "string"}}}} |
|
|
] |
|
|
|
|
|
def _format_prompt(self, user_prompt): |
|
|
# Injects the schema and strict English/Format instructions |
|
|
tools_json = json.dumps(self.tools_schema, indent=2) |
|
|
system = ( |
|
|
f"You are Fara-TARS, a GUI automation agent.\n" |
|
|
f"AVAILABLE TOOLS:\n{tools_json}\n\n" |
|
|
"INSTRUCTIONS:\n" |
|
|
"1. Reason first, then act.\n" |
|
|
"2. Output valid JSON inside <tool_call> tags.\n" |
|
|
"3. Format: <tool_call>{{\"name\": \"left_click\", ...}}</tool_call>" |
|
|
) |
|
|
return ( |
|
|
f"<|im_start|>system\n{system}<|im_end|>\n" |
|
|
f"<|im_start|>user\n{user_prompt}<|im_end|>\n" |
|
|
f"<|im_start|>assistant\n" |
|
|
) |
|
|
|
|
|
def _repair_json(self, json_str): |
|
|
# Auto-fixes common LLM JSON errors (smart quotes, missing keys) |
|
|
json_str = json_str.replace("“", '"').replace("”", '"').replace("'", '"') |
|
|
json_str = re.sub(r'(\w+)"\s*:', r'"\1":', json_str) |
|
|
return json_str |
|
|
|
|
|
def run(self, prompt, image_path=None): |
|
|
formatted_prompt = self._format_prompt(prompt) |
|
|
|
|
|
# Handle Image Input (Optional) |
|
|
if image_path: |
|
|
image = Image.open(image_path).convert("RGB") |
|
|
inputs = self.model.build_conversation_input_ids( |
|
|
tokenizer=self.tokenizer, query=formatted_prompt, image=image |
|
|
) |
|
|
else: |
|
|
inputs = self.tokenizer(formatted_prompt, return_tensors="pt").to(self.model.device) |
|
|
|
|
|
# Critical: Stop generation at tool close to prevent loops |
|
|
stop_strings = ["</tool_call>", "<|im_end|>"] |
|
|
|
|
|
# Optimized Config |
|
|
config = GenerationConfig( |
|
|
max_new_tokens=2048, |
|
|
do_sample=True, |
|
|
temperature=0.4, |
|
|
top_p=0.95, |
|
|
repetition_penalty=1.15, # Prevents "United.com" loops |
|
|
no_repeat_ngram_size=0, # Must be 0 to allow JSON keys |
|
|
pad_token_id=self.tokenizer.pad_token_id, |
|
|
eos_token_id=self.tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
with torch.no_grad(): |
|
|
output = self.model.generate( |
|
|
**inputs, |
|
|
generation_config=config, |
|
|
tokenizer=self.tokenizer, |
|
|
stop_strings=stop_strings |
|
|
) |
|
|
|
|
|
input_len = inputs['input_ids'].shape[1] |
|
|
raw_response = self.tokenizer.decode(output[0][input_len:], skip_special_tokens=True) |
|
|
|
|
|
# Parse Output |
|
|
tool_action = None |
|
|
text_content = raw_response |
|
|
|
|
|
if "<tool_call>" in raw_response: |
|
|
parts = raw_response.split("<tool_call>") |
|
|
text_content = parts[0].strip() |
|
|
tool_str = parts[1].split("</tool_call>")[0].strip() |
|
|
try: |
|
|
tool_action = json.loads(self._repair_json(tool_str)) |
|
|
except: |
|
|
tool_action = {"error": "malformed_json", "raw": tool_str} |
|
|
|
|
|
return {"thought": text_content, "action": tool_action} |
|
|
|
|
|
# Usage |
|
|
agent = FaraAgent("your-username/Fara-TARS-7B") |
|
|
result = agent.run("Click the Submit button at (1200, 800)") |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
## Benchmark Performance |
|
|
|
|
|
The model was evaluated on a comprehensive suite covering Web Automation, GUI Grounding, and Complex Reasoning. |
|
|
|
|
|
| Category | Task | Result Type | Performance | |
|
|
| :--- | :--- | :--- | :--- | |
|
|
| **GUI Grounding** | "Click Submit at (1200, 800)" | **Tool Call** | ✅ Correct JSON: `{"point": [1200, 800]}` | |
|
|
| **Web Automation** | "Type 'Hello World' in search" | **Tool Call** | ✅ Correct JSON: `{"name": "type", "text": "Hello World"}` | |
|
|
| **Reasoning** | "Design a Weekly Python Plan" | **Text** | ✅ Generates full Markdown plan (900+ tokens) | |
|
|
| **Hybrid** | "Compare Selenium vs Playwright" | **Agentic Text** | ✅ Uses `type` tool to output a Markdown table | |
|
|
| **Safety** | "Stop at critical payment point" | **Tool Call** | ✅ Uses `terminate` tool with status `stop_confirm` | |
|
|
|
|
|
## Merge Details |
|
|
|
|
|
This model was merged using **Mergekit**. |
|
|
|
|
|
### Configuration |
|
|
```yaml |
|
|
models: |
|
|
- model: microsoft/Fara-7B |
|
|
- model: ByteDance-Seed/UI-TARS-1.5-7B |
|
|
merge_method: slerp |
|
|
base_model: microsoft/Fara-7B |
|
|
dtype: bfloat16 |
|
|
parameters: |
|
|
t: |
|
|
# 5-point gradient: |
|
|
# 0.1 (Start): Mostly Fara -> Ensures input understanding and English grammar. |
|
|
# 0.3 -> 0.5 (Middle): Blends TARS capability for reasoning and logic. |
|
|
# 0.1 (End): Mostly Fara -> Ensures the output stops correctly and doesn't loop. |
|
|
- value: [0.1, 0.3, 0.5, 0.3, 0.1] |
|
|
``` |
|
|
*(Note: While `slerp` was used, specific inference parameters (temp=0.4, rep_penalty=1.15) are required to stabilize the output, as documented in the Usage section).* |
|
|
|
|
|
## Limitations |
|
|
|
|
|
1. **Strict Prompting:** The model expects the specific System Prompt defined in the usage class. Without it, it may hallucinate tool names. |
|
|
2. **Repetition:** In extremely long lists (100+ items), the model may repeat. The recommended `repetition_penalty=1.15` fixes this for 99% of cases. |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
--- |