factorstudios
/

TIDA_T1

Safetensors

vla-model

Model card Files Files and versions

xet

Community

factorstudios commited on Jan 7

Commit

27744f8

verified ·

1 Parent(s): 90ae3fe

Update README.md

Browse files

Files changed (1) hide show

README.md +0 -64

README.md CHANGED Viewed

@@ -1,64 +0,0 @@
----
-tags:
-- vision-language-action
-- vla
-- multimodal
-- factorstudios
-- tida
-- curfy
-- foundation-model
----
-# factorstudios/TIDA_T1: Vision-Language-Action (VLA) Model
-This repository hosts the **TIDA_T1** model, a complete **Vision-Language-Action (VLA) Model** developed by FactorStudios.
-TIDA_T1 is a monolithic, multi-modal foundation model designed for complex sequential decision-making tasks, such as automated interaction with graphical user interfaces (GUIs) or real-time control systems. It is a direct continuation of the `curfy_v2` training line.
-## Model Architecture Overview
-The TIDA_T1 model is a **1.575 Billion Parameter** architecture that fuses information from five distinct input streams before passing the combined representation through a deep reasoning layer to predict the next action.
-| Stream | Component | Purpose | Pre-trained Base |
-| :--- | :--- | :--- | :--- |
-| **Vision** | ViT-L/14 | Processes the current screen frame (image). | ViT-Large (308M frozen) |
-| **Caption** | BERT-large | Processes the textual description of the current state or goal. | BERT-Large (340M frozen) |
-| **Context** | GPT-2-XL | Processes the long-term history and task context. | GPT-2-XL (355M frozen) |
-| **Spatial** | MLP | Encodes the recent cursor trajectory and position history. | Trainable (Small) |
-| **Temporal** | MLP | Encodes the history of frame embeddings (what the screen looked like). | Trainable (Small) |
-## Decision Outputs
-The model's reasoning layer outputs a single embedding which is fed into six specialized decision heads to predict a complete action:
-1.  **Action Logits**: Predicts the type of action (e.g., `click`, `drag`, `type`, `scroll`).
-2.  **Coordinates**: Predicts the normalized bounding box or point (x1, y1, x2, y2) for the action.
-3.  **Duration**: Predicts the time the action should take (e.g., for a drag or wait).
-4.  **Parameters**: A 32-dimensional vector for action-specific parameters (e.g., scroll amount, keypress).
-5.  **Confidence**: A score indicating the model's certainty in its prediction.
-6.  **Explanation Logits**: Token logits for generating a natural language explanation of the decision.
-## Usage
-This repository contains the model weights (`model.safetensors`) and the necessary configuration files (`config.json`, tokenizer files) to load the model using the Hugging Face `transformers` library.
-To load the model and tokenizer:
-```python
-from transformers import AutoModel, AutoTokenizer
-# The model is a custom architecture, so direct AutoModel loading may require
-# custom code or a registered class. Refer to the original training script
-# for the exact class definition.
-# Load the tokenizer for the text streams
-tokenizer = AutoTokenizer.from_pretrained("factorstudios/TIDA_T1")
-# Load the model weights (assuming you have the custom class defined)
-# model = VisionLanguageActionModel.from_pretrained("factorstudios/TIDA_T1")
-```
-**Note**: The model's custom architecture (`VisionLanguageActionModel`) is not a standard Hugging Face class. You will need the class definition (as provided in the `inference_script.py` I previously delivered) to load the weights correctly.
----
-*Generated by Manus AI based on analysis of `train3-v4.py`.*