---
language:
- en
license: apache-2.0
tags:
- merge
- mergekit
- slerp
- agent
- gui-automation
- vision
- multimodal
- far-7b
- ui-tars
base_model:
- microsoft/Fara-7B
- ByteDance-Seed/UI-TARS-1.5-7B
library_name: transformers
pipeline_tag: image-text-to-text
---

# Fara-TARS-7B: The Hybrid Reasoning & GUI Agent

**Fara-TARS-7B** is a state-of-the-art merged model that combines the high-level reasoning and planning capabilities of **Microsoft Fara-7B** with the precise GUI grounding and agentic capabilities of **ByteDance UI-TARS-7B**.

This model achieves a **Hybrid Mode**: it can seamlessly switch between writing complex text plans (Reasoning) and executing precise coordinate actions (Agentic Tool Calls) based on the user prompt.

## Key Capabilities

| Capability | Performance | Description |
| :--- | :--- | :--- |
| **GUI Grounding** | 🟢 **SOTA** | Accurately maps text instructions to `[x, y]` coordinates (e.g., "Click Submit" -> `[1200, 800]`). |
| **Reasoning** | 🟢 **Excellent** | Can generate long-form plans (e.g., "Weekly Python Learning Plan") without hallucinating clicks. |
| **Language** | 🟢 **English-Only** | Tuned to strictly follow English instructions, eliminating language bleeding common in TARS merges. |
| **Agentic Output** | 🟢 **Structured** | Outputs actions in strict JSON format: `<tool_call>{"name": "click", ...}</tool_call>`. |

## How to Use (Inference Code)

To unlock the full potential of this model (Agent Mode vs Text Mode), **you must use the specific generation configuration below**. This handles the tool schema injection and prevents repetition loops.

### Installation
```bash
pip install torch transformers pillow
```

### Python Inference Class
Use this class to interact with the model. It handles the system prompt injection and JSON parsing automatically.

```python
import torch
from transformers import AutoModelForVision2Seq, AutoTokenizer, GenerationConfig
from PIL import Image
import json
import re

class FaraAgent:
    def __init__(self, model_path, device="auto"):
        print(f"Loading Fara-TARS from {model_path}...")
        self.model = AutoModelForVision2Seq.from_pretrained(
            model_path,
            torch_dtype=torch.bfloat16,
            device_map=device,
            trust_remote_code=True,
            low_cpu_mem_usage=True
        )
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
        
        # Safety fix for padding
        if self.tokenizer.pad_token is None:
            self.tokenizer.pad_token = self.tokenizer.eos_token
            self.tokenizer.pad_token_id = self.tokenizer.eos_token_id

        # Define Agent Tools
        self.tools_schema = [
            {"name": "left_click", "description": "Click coordinate [x, y]", "parameters": {"type": "object", "properties": {"point": {"type": "array"}}}},
            {"name": "type_text", "description": "Type text", "parameters": {"type": "object", "properties": {"text": {"type": "string"}}}},
            {"name": "scroll", "description": "Scroll screen", "parameters": {"type": "object", "properties": {"pixels": {"type": "integer"}}}},
            {"name": "terminate", "description": "Task done", "parameters": {"type": "object", "properties": {"status": {"type": "string"}}}}
        ]

    def _format_prompt(self, user_prompt):
        # Injects the schema and strict English/Format instructions
        tools_json = json.dumps(self.tools_schema, indent=2)
        system = (
            f"You are Fara-TARS, a GUI automation agent.\n"
            f"AVAILABLE TOOLS:\n{tools_json}\n\n"
            "INSTRUCTIONS:\n"
            "1. Reason first, then act.\n"
            "2. Output valid JSON inside <tool_call> tags.\n"
            "3. Format: <tool_call>{{\"name\": \"left_click\", ...}}</tool_call>"
        )
        return (
            f"<|im_start|>system\n{system}<|im_end|>\n"
            f"<|im_start|>user\n{user_prompt}<|im_end|>\n"
            f"<|im_start|>assistant\n"
        )

    def _repair_json(self, json_str):
        # Auto-fixes common LLM JSON errors (smart quotes, missing keys)
        json_str = json_str.replace("“", '"').replace("”", '"').replace("'", '"')
        json_str = re.sub(r'(\w+)"\s*:', r'"\1":', json_str) 
        return json_str

    def run(self, prompt, image_path=None):
        formatted_prompt = self._format_prompt(prompt)
        
        # Handle Image Input (Optional)
        if image_path:
            image = Image.open(image_path).convert("RGB")
            inputs = self.model.build_conversation_input_ids(
                tokenizer=self.tokenizer, query=formatted_prompt, image=image
            )
        else:
            inputs = self.tokenizer(formatted_prompt, return_tensors="pt").to(self.model.device)

        # Critical: Stop generation at tool close to prevent loops
        stop_strings = ["</tool_call>", "<|im_end|>"]
        
        # Optimized Config
        config = GenerationConfig(
            max_new_tokens=2048,
            do_sample=True,
            temperature=0.4, 
            top_p=0.95,
            repetition_penalty=1.15, # Prevents "United.com" loops
            no_repeat_ngram_size=0,  # Must be 0 to allow JSON keys
            pad_token_id=self.tokenizer.pad_token_id,
            eos_token_id=self.tokenizer.eos_token_id
        )

        with torch.no_grad():
            output = self.model.generate(
                **inputs,
                generation_config=config,
                tokenizer=self.tokenizer,
                stop_strings=stop_strings
            )

        input_len = inputs['input_ids'].shape[1]
        raw_response = self.tokenizer.decode(output[0][input_len:], skip_special_tokens=True)
        
        # Parse Output
        tool_action = None
        text_content = raw_response
        
        if "<tool_call>" in raw_response:
            parts = raw_response.split("<tool_call>")
            text_content = parts[0].strip()
            tool_str = parts[1].split("</tool_call>")[0].strip()
            try:
                tool_action = json.loads(self._repair_json(tool_str))
            except:
                tool_action = {"error": "malformed_json", "raw": tool_str}

        return {"thought": text_content, "action": tool_action}

# Usage
agent = FaraAgent("your-username/Fara-TARS-7B")
result = agent.run("Click the Submit button at (1200, 800)")
print(result)
```

## Benchmark Performance

The model was evaluated on a comprehensive suite covering Web Automation, GUI Grounding, and Complex Reasoning.

| Category | Task | Result Type | Performance |
| :--- | :--- | :--- | :--- |
| **GUI Grounding** | "Click Submit at (1200, 800)" | **Tool Call** | ✅ Correct JSON: `{"point": [1200, 800]}` |
| **Web Automation** | "Type 'Hello World' in search" | **Tool Call** | ✅ Correct JSON: `{"name": "type", "text": "Hello World"}` |
| **Reasoning** | "Design a Weekly Python Plan" | **Text** | ✅ Generates full Markdown plan (900+ tokens) |
| **Hybrid** | "Compare Selenium vs Playwright" | **Agentic Text** | ✅ Uses `type` tool to output a Markdown table |
| **Safety** | "Stop at critical payment point" | **Tool Call** | ✅ Uses `terminate` tool with status `stop_confirm` |

## Merge Details

This model was merged using **Mergekit**.

### Configuration
```yaml
models:
  - model: microsoft/Fara-7B
  - model: ByteDance-Seed/UI-TARS-1.5-7B
merge_method: slerp
base_model: microsoft/Fara-7B
dtype: bfloat16
parameters:
  t:
    # 5-point gradient:
    # 0.1 (Start): Mostly Fara -> Ensures input understanding and English grammar.
    # 0.3 -> 0.5 (Middle): Blends TARS capability for reasoning and logic.
    # 0.1 (End): Mostly Fara -> Ensures the output stops correctly and doesn't loop.
    - value: [0.1, 0.3, 0.5, 0.3, 0.1]
```
*(Note: While `slerp` was used, specific inference parameters (temp=0.4, rep_penalty=1.15) are required to stabilize the output, as documented in the Usage section).*

## Limitations

1.  **Strict Prompting:** The model expects the specific System Prompt defined in the usage class. Without it, it may hallucinate tool names.
2.  **Repetition:** In extremely long lists (100+ items), the model may repeat. The recommended `repetition_penalty=1.15` fixes this for 99% of cases.

## License

Apache 2.0
---