Update README.md
Browse files
README.md
CHANGED
|
@@ -1,64 +0,0 @@
|
|
| 1 |
-
---
|
| 2 |
-
tags:
|
| 3 |
-
- vision-language-action
|
| 4 |
-
- vla
|
| 5 |
-
- multimodal
|
| 6 |
-
- factorstudios
|
| 7 |
-
- tida
|
| 8 |
-
- curfy
|
| 9 |
-
- foundation-model
|
| 10 |
-
---
|
| 11 |
-
|
| 12 |
-
# factorstudios/TIDA_T1: Vision-Language-Action (VLA) Model
|
| 13 |
-
|
| 14 |
-
This repository hosts the **TIDA_T1** model, a complete **Vision-Language-Action (VLA) Model** developed by FactorStudios.
|
| 15 |
-
|
| 16 |
-
TIDA_T1 is a monolithic, multi-modal foundation model designed for complex sequential decision-making tasks, such as automated interaction with graphical user interfaces (GUIs) or real-time control systems. It is a direct continuation of the `curfy_v2` training line.
|
| 17 |
-
|
| 18 |
-
## Model Architecture Overview
|
| 19 |
-
|
| 20 |
-
The TIDA_T1 model is a **1.575 Billion Parameter** architecture that fuses information from five distinct input streams before passing the combined representation through a deep reasoning layer to predict the next action.
|
| 21 |
-
|
| 22 |
-
| Stream | Component | Purpose | Pre-trained Base |
|
| 23 |
-
| :--- | :--- | :--- | :--- |
|
| 24 |
-
| **Vision** | ViT-L/14 | Processes the current screen frame (image). | ViT-Large (308M frozen) |
|
| 25 |
-
| **Caption** | BERT-large | Processes the textual description of the current state or goal. | BERT-Large (340M frozen) |
|
| 26 |
-
| **Context** | GPT-2-XL | Processes the long-term history and task context. | GPT-2-XL (355M frozen) |
|
| 27 |
-
| **Spatial** | MLP | Encodes the recent cursor trajectory and position history. | Trainable (Small) |
|
| 28 |
-
| **Temporal** | MLP | Encodes the history of frame embeddings (what the screen looked like). | Trainable (Small) |
|
| 29 |
-
|
| 30 |
-
## Decision Outputs
|
| 31 |
-
|
| 32 |
-
The model's reasoning layer outputs a single embedding which is fed into six specialized decision heads to predict a complete action:
|
| 33 |
-
|
| 34 |
-
1. **Action Logits**: Predicts the type of action (e.g., `click`, `drag`, `type`, `scroll`).
|
| 35 |
-
2. **Coordinates**: Predicts the normalized bounding box or point (x1, y1, x2, y2) for the action.
|
| 36 |
-
3. **Duration**: Predicts the time the action should take (e.g., for a drag or wait).
|
| 37 |
-
4. **Parameters**: A 32-dimensional vector for action-specific parameters (e.g., scroll amount, keypress).
|
| 38 |
-
5. **Confidence**: A score indicating the model's certainty in its prediction.
|
| 39 |
-
6. **Explanation Logits**: Token logits for generating a natural language explanation of the decision.
|
| 40 |
-
|
| 41 |
-
## Usage
|
| 42 |
-
|
| 43 |
-
This repository contains the model weights (`model.safetensors`) and the necessary configuration files (`config.json`, tokenizer files) to load the model using the Hugging Face `transformers` library.
|
| 44 |
-
|
| 45 |
-
To load the model and tokenizer:
|
| 46 |
-
|
| 47 |
-
```python
|
| 48 |
-
from transformers import AutoModel, AutoTokenizer
|
| 49 |
-
|
| 50 |
-
# The model is a custom architecture, so direct AutoModel loading may require
|
| 51 |
-
# custom code or a registered class. Refer to the original training script
|
| 52 |
-
# for the exact class definition.
|
| 53 |
-
|
| 54 |
-
# Load the tokenizer for the text streams
|
| 55 |
-
tokenizer = AutoTokenizer.from_pretrained("factorstudios/TIDA_T1")
|
| 56 |
-
|
| 57 |
-
# Load the model weights (assuming you have the custom class defined)
|
| 58 |
-
# model = VisionLanguageActionModel.from_pretrained("factorstudios/TIDA_T1")
|
| 59 |
-
```
|
| 60 |
-
|
| 61 |
-
**Note**: The model's custom architecture (`VisionLanguageActionModel`) is not a standard Hugging Face class. You will need the class definition (as provided in the `inference_script.py` I previously delivered) to load the weights correctly.
|
| 62 |
-
|
| 63 |
-
---
|
| 64 |
-
*Generated by Manus AI based on analysis of `train3-v4.py`.*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|