--- language: - en license: apache-2.0 tags: - merge - mergekit - slerp - agent - gui-automation - vision - multimodal - far-7b - ui-tars base_model: - microsoft/Fara-7B - ByteDance-Seed/UI-TARS-1.5-7B library_name: transformers pipeline_tag: image-text-to-text --- # Fara-TARS-7B: The Hybrid Reasoning & GUI Agent **Fara-TARS-7B** is a state-of-the-art merged model that combines the high-level reasoning and planning capabilities of **Microsoft Fara-7B** with the precise GUI grounding and agentic capabilities of **ByteDance UI-TARS-7B**. This model achieves a **Hybrid Mode**: it can seamlessly switch between writing complex text plans (Reasoning) and executing precise coordinate actions (Agentic Tool Calls) based on the user prompt. ## Key Capabilities | Capability | Performance | Description | | :--- | :--- | :--- | | **GUI Grounding** | 🟢 **SOTA** | Accurately maps text instructions to `[x, y]` coordinates (e.g., "Click Submit" -> `[1200, 800]`). | | **Reasoning** | 🟢 **Excellent** | Can generate long-form plans (e.g., "Weekly Python Learning Plan") without hallucinating clicks. | | **Language** | 🟢 **English-Only** | Tuned to strictly follow English instructions, eliminating language bleeding common in TARS merges. | | **Agentic Output** | 🟢 **Structured** | Outputs actions in strict JSON format: `{"name": "click", ...}`. | ## How to Use (Inference Code) To unlock the full potential of this model (Agent Mode vs Text Mode), **you must use the specific generation configuration below**. This handles the tool schema injection and prevents repetition loops. ### Installation ```bash pip install torch transformers pillow ``` ### Python Inference Class Use this class to interact with the model. It handles the system prompt injection and JSON parsing automatically. ```python import torch from transformers import AutoModelForVision2Seq, AutoTokenizer, GenerationConfig from PIL import Image import json import re class FaraAgent: def __init__(self, model_path, device="auto"): print(f"Loading Fara-TARS from {model_path}...") self.model = AutoModelForVision2Seq.from_pretrained( model_path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True, low_cpu_mem_usage=True ) self.tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) # Safety fix for padding if self.tokenizer.pad_token is None: self.tokenizer.pad_token = self.tokenizer.eos_token self.tokenizer.pad_token_id = self.tokenizer.eos_token_id # Define Agent Tools self.tools_schema = [ {"name": "left_click", "description": "Click coordinate [x, y]", "parameters": {"type": "object", "properties": {"point": {"type": "array"}}}}, {"name": "type_text", "description": "Type text", "parameters": {"type": "object", "properties": {"text": {"type": "string"}}}}, {"name": "scroll", "description": "Scroll screen", "parameters": {"type": "object", "properties": {"pixels": {"type": "integer"}}}}, {"name": "terminate", "description": "Task done", "parameters": {"type": "object", "properties": {"status": {"type": "string"}}}} ] def _format_prompt(self, user_prompt): # Injects the schema and strict English/Format instructions tools_json = json.dumps(self.tools_schema, indent=2) system = ( f"You are Fara-TARS, a GUI automation agent.\n" f"AVAILABLE TOOLS:\n{tools_json}\n\n" "INSTRUCTIONS:\n" "1. Reason first, then act.\n" "2. Output valid JSON inside tags.\n" "3. Format: {{\"name\": \"left_click\", ...}}" ) return ( f"<|im_start|>system\n{system}<|im_end|>\n" f"<|im_start|>user\n{user_prompt}<|im_end|>\n" f"<|im_start|>assistant\n" ) def _repair_json(self, json_str): # Auto-fixes common LLM JSON errors (smart quotes, missing keys) json_str = json_str.replace("“", '"').replace("”", '"').replace("'", '"') json_str = re.sub(r'(\w+)"\s*:', r'"\1":', json_str) return json_str def run(self, prompt, image_path=None): formatted_prompt = self._format_prompt(prompt) # Handle Image Input (Optional) if image_path: image = Image.open(image_path).convert("RGB") inputs = self.model.build_conversation_input_ids( tokenizer=self.tokenizer, query=formatted_prompt, image=image ) else: inputs = self.tokenizer(formatted_prompt, return_tensors="pt").to(self.model.device) # Critical: Stop generation at tool close to prevent loops stop_strings = ["", "<|im_end|>"] # Optimized Config config = GenerationConfig( max_new_tokens=2048, do_sample=True, temperature=0.4, top_p=0.95, repetition_penalty=1.15, # Prevents "United.com" loops no_repeat_ngram_size=0, # Must be 0 to allow JSON keys pad_token_id=self.tokenizer.pad_token_id, eos_token_id=self.tokenizer.eos_token_id ) with torch.no_grad(): output = self.model.generate( **inputs, generation_config=config, tokenizer=self.tokenizer, stop_strings=stop_strings ) input_len = inputs['input_ids'].shape[1] raw_response = self.tokenizer.decode(output[0][input_len:], skip_special_tokens=True) # Parse Output tool_action = None text_content = raw_response if "" in raw_response: parts = raw_response.split("") text_content = parts[0].strip() tool_str = parts[1].split("")[0].strip() try: tool_action = json.loads(self._repair_json(tool_str)) except: tool_action = {"error": "malformed_json", "raw": tool_str} return {"thought": text_content, "action": tool_action} # Usage agent = FaraAgent("your-username/Fara-TARS-7B") result = agent.run("Click the Submit button at (1200, 800)") print(result) ``` ## Benchmark Performance The model was evaluated on a comprehensive suite covering Web Automation, GUI Grounding, and Complex Reasoning. | Category | Task | Result Type | Performance | | :--- | :--- | :--- | :--- | | **GUI Grounding** | "Click Submit at (1200, 800)" | **Tool Call** | ✅ Correct JSON: `{"point": [1200, 800]}` | | **Web Automation** | "Type 'Hello World' in search" | **Tool Call** | ✅ Correct JSON: `{"name": "type", "text": "Hello World"}` | | **Reasoning** | "Design a Weekly Python Plan" | **Text** | ✅ Generates full Markdown plan (900+ tokens) | | **Hybrid** | "Compare Selenium vs Playwright" | **Agentic Text** | ✅ Uses `type` tool to output a Markdown table | | **Safety** | "Stop at critical payment point" | **Tool Call** | ✅ Uses `terminate` tool with status `stop_confirm` | ## Merge Details This model was merged using **Mergekit**. ### Configuration ```yaml models: - model: microsoft/Fara-7B - model: ByteDance-Seed/UI-TARS-1.5-7B merge_method: slerp base_model: microsoft/Fara-7B dtype: bfloat16 parameters: t: # 5-point gradient: # 0.1 (Start): Mostly Fara -> Ensures input understanding and English grammar. # 0.3 -> 0.5 (Middle): Blends TARS capability for reasoning and logic. # 0.1 (End): Mostly Fara -> Ensures the output stops correctly and doesn't loop. - value: [0.1, 0.3, 0.5, 0.3, 0.1] ``` *(Note: While `slerp` was used, specific inference parameters (temp=0.4, rep_penalty=1.15) are required to stabilize the output, as documented in the Usage section).* ## Limitations 1. **Strict Prompting:** The model expects the specific System Prompt defined in the usage class. Without it, it may hallucinate tool names. 2. **Repetition:** In extremely long lists (100+ items), the model may repeat. The recommended `repetition_penalty=1.15` fixes this for 99% of cases. ## License Apache 2.0 ---