Fara-TARS-7B / README.md
yasserrmd's picture
Update README.md
b4cd15b verified
---
language:
- en
license: apache-2.0
tags:
- merge
- mergekit
- slerp
- agent
- gui-automation
- vision
- multimodal
- far-7b
- ui-tars
base_model:
- microsoft/Fara-7B
- ByteDance-Seed/UI-TARS-1.5-7B
library_name: transformers
pipeline_tag: image-text-to-text
---
# Fara-TARS-7B: The Hybrid Reasoning & GUI Agent
**Fara-TARS-7B** is a state-of-the-art merged model that combines the high-level reasoning and planning capabilities of **Microsoft Fara-7B** with the precise GUI grounding and agentic capabilities of **ByteDance UI-TARS-7B**.
This model achieves a **Hybrid Mode**: it can seamlessly switch between writing complex text plans (Reasoning) and executing precise coordinate actions (Agentic Tool Calls) based on the user prompt.
## Key Capabilities
| Capability | Performance | Description |
| :--- | :--- | :--- |
| **GUI Grounding** | 🟢 **SOTA** | Accurately maps text instructions to `[x, y]` coordinates (e.g., "Click Submit" -> `[1200, 800]`). |
| **Reasoning** | 🟢 **Excellent** | Can generate long-form plans (e.g., "Weekly Python Learning Plan") without hallucinating clicks. |
| **Language** | 🟢 **English-Only** | Tuned to strictly follow English instructions, eliminating language bleeding common in TARS merges. |
| **Agentic Output** | 🟢 **Structured** | Outputs actions in strict JSON format: `<tool_call>{"name": "click", ...}</tool_call>`. |
## How to Use (Inference Code)
To unlock the full potential of this model (Agent Mode vs Text Mode), **you must use the specific generation configuration below**. This handles the tool schema injection and prevents repetition loops.
### Installation
```bash
pip install torch transformers pillow
```
### Python Inference Class
Use this class to interact with the model. It handles the system prompt injection and JSON parsing automatically.
```python
import torch
from transformers import AutoModelForVision2Seq, AutoTokenizer, GenerationConfig
from PIL import Image
import json
import re
class FaraAgent:
def __init__(self, model_path, device="auto"):
print(f"Loading Fara-TARS from {model_path}...")
self.model = AutoModelForVision2Seq.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map=device,
trust_remote_code=True,
low_cpu_mem_usage=True
)
self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
# Safety fix for padding
if self.tokenizer.pad_token is None:
self.tokenizer.pad_token = self.tokenizer.eos_token
self.tokenizer.pad_token_id = self.tokenizer.eos_token_id
# Define Agent Tools
self.tools_schema = [
{"name": "left_click", "description": "Click coordinate [x, y]", "parameters": {"type": "object", "properties": {"point": {"type": "array"}}}},
{"name": "type_text", "description": "Type text", "parameters": {"type": "object", "properties": {"text": {"type": "string"}}}},
{"name": "scroll", "description": "Scroll screen", "parameters": {"type": "object", "properties": {"pixels": {"type": "integer"}}}},
{"name": "terminate", "description": "Task done", "parameters": {"type": "object", "properties": {"status": {"type": "string"}}}}
]
def _format_prompt(self, user_prompt):
# Injects the schema and strict English/Format instructions
tools_json = json.dumps(self.tools_schema, indent=2)
system = (
f"You are Fara-TARS, a GUI automation agent.\n"
f"AVAILABLE TOOLS:\n{tools_json}\n\n"
"INSTRUCTIONS:\n"
"1. Reason first, then act.\n"
"2. Output valid JSON inside <tool_call> tags.\n"
"3. Format: <tool_call>{{\"name\": \"left_click\", ...}}</tool_call>"
)
return (
f"<|im_start|>system\n{system}<|im_end|>\n"
f"<|im_start|>user\n{user_prompt}<|im_end|>\n"
f"<|im_start|>assistant\n"
)
def _repair_json(self, json_str):
# Auto-fixes common LLM JSON errors (smart quotes, missing keys)
json_str = json_str.replace("“", '"').replace("”", '"').replace("'", '"')
json_str = re.sub(r'(\w+)"\s*:', r'"\1":', json_str)
return json_str
def run(self, prompt, image_path=None):
formatted_prompt = self._format_prompt(prompt)
# Handle Image Input (Optional)
if image_path:
image = Image.open(image_path).convert("RGB")
inputs = self.model.build_conversation_input_ids(
tokenizer=self.tokenizer, query=formatted_prompt, image=image
)
else:
inputs = self.tokenizer(formatted_prompt, return_tensors="pt").to(self.model.device)
# Critical: Stop generation at tool close to prevent loops
stop_strings = ["</tool_call>", "<|im_end|>"]
# Optimized Config
config = GenerationConfig(
max_new_tokens=2048,
do_sample=True,
temperature=0.4,
top_p=0.95,
repetition_penalty=1.15, # Prevents "United.com" loops
no_repeat_ngram_size=0, # Must be 0 to allow JSON keys
pad_token_id=self.tokenizer.pad_token_id,
eos_token_id=self.tokenizer.eos_token_id
)
with torch.no_grad():
output = self.model.generate(
**inputs,
generation_config=config,
tokenizer=self.tokenizer,
stop_strings=stop_strings
)
input_len = inputs['input_ids'].shape[1]
raw_response = self.tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
# Parse Output
tool_action = None
text_content = raw_response
if "<tool_call>" in raw_response:
parts = raw_response.split("<tool_call>")
text_content = parts[0].strip()
tool_str = parts[1].split("</tool_call>")[0].strip()
try:
tool_action = json.loads(self._repair_json(tool_str))
except:
tool_action = {"error": "malformed_json", "raw": tool_str}
return {"thought": text_content, "action": tool_action}
# Usage
agent = FaraAgent("your-username/Fara-TARS-7B")
result = agent.run("Click the Submit button at (1200, 800)")
print(result)
```
## Benchmark Performance
The model was evaluated on a comprehensive suite covering Web Automation, GUI Grounding, and Complex Reasoning.
| Category | Task | Result Type | Performance |
| :--- | :--- | :--- | :--- |
| **GUI Grounding** | "Click Submit at (1200, 800)" | **Tool Call** | ✅ Correct JSON: `{"point": [1200, 800]}` |
| **Web Automation** | "Type 'Hello World' in search" | **Tool Call** | ✅ Correct JSON: `{"name": "type", "text": "Hello World"}` |
| **Reasoning** | "Design a Weekly Python Plan" | **Text** | ✅ Generates full Markdown plan (900+ tokens) |
| **Hybrid** | "Compare Selenium vs Playwright" | **Agentic Text** | ✅ Uses `type` tool to output a Markdown table |
| **Safety** | "Stop at critical payment point" | **Tool Call** | ✅ Uses `terminate` tool with status `stop_confirm` |
## Merge Details
This model was merged using **Mergekit**.
### Configuration
```yaml
models:
- model: microsoft/Fara-7B
- model: ByteDance-Seed/UI-TARS-1.5-7B
merge_method: slerp
base_model: microsoft/Fara-7B
dtype: bfloat16
parameters:
t:
# 5-point gradient:
# 0.1 (Start): Mostly Fara -> Ensures input understanding and English grammar.
# 0.3 -> 0.5 (Middle): Blends TARS capability for reasoning and logic.
# 0.1 (End): Mostly Fara -> Ensures the output stops correctly and doesn't loop.
- value: [0.1, 0.3, 0.5, 0.3, 0.1]
```
*(Note: While `slerp` was used, specific inference parameters (temp=0.4, rep_penalty=1.15) are required to stabilize the output, as documented in the Usage section).*
## Limitations
1. **Strict Prompting:** The model expects the specific System Prompt defined in the usage class. Without it, it may hallucinate tool names.
2. **Repetition:** In extremely long lists (100+ items), the model may repeat. The recommended `repetition_penalty=1.15` fixes this for 99% of cases.
## License
Apache 2.0
---