factorstudios commited on
Commit
27744f8
·
verified ·
1 Parent(s): 90ae3fe

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -64
README.md CHANGED
@@ -1,64 +0,0 @@
1
- ---
2
- tags:
3
- - vision-language-action
4
- - vla
5
- - multimodal
6
- - factorstudios
7
- - tida
8
- - curfy
9
- - foundation-model
10
- ---
11
-
12
- # factorstudios/TIDA_T1: Vision-Language-Action (VLA) Model
13
-
14
- This repository hosts the **TIDA_T1** model, a complete **Vision-Language-Action (VLA) Model** developed by FactorStudios.
15
-
16
- TIDA_T1 is a monolithic, multi-modal foundation model designed for complex sequential decision-making tasks, such as automated interaction with graphical user interfaces (GUIs) or real-time control systems. It is a direct continuation of the `curfy_v2` training line.
17
-
18
- ## Model Architecture Overview
19
-
20
- The TIDA_T1 model is a **1.575 Billion Parameter** architecture that fuses information from five distinct input streams before passing the combined representation through a deep reasoning layer to predict the next action.
21
-
22
- | Stream | Component | Purpose | Pre-trained Base |
23
- | :--- | :--- | :--- | :--- |
24
- | **Vision** | ViT-L/14 | Processes the current screen frame (image). | ViT-Large (308M frozen) |
25
- | **Caption** | BERT-large | Processes the textual description of the current state or goal. | BERT-Large (340M frozen) |
26
- | **Context** | GPT-2-XL | Processes the long-term history and task context. | GPT-2-XL (355M frozen) |
27
- | **Spatial** | MLP | Encodes the recent cursor trajectory and position history. | Trainable (Small) |
28
- | **Temporal** | MLP | Encodes the history of frame embeddings (what the screen looked like). | Trainable (Small) |
29
-
30
- ## Decision Outputs
31
-
32
- The model's reasoning layer outputs a single embedding which is fed into six specialized decision heads to predict a complete action:
33
-
34
- 1. **Action Logits**: Predicts the type of action (e.g., `click`, `drag`, `type`, `scroll`).
35
- 2. **Coordinates**: Predicts the normalized bounding box or point (x1, y1, x2, y2) for the action.
36
- 3. **Duration**: Predicts the time the action should take (e.g., for a drag or wait).
37
- 4. **Parameters**: A 32-dimensional vector for action-specific parameters (e.g., scroll amount, keypress).
38
- 5. **Confidence**: A score indicating the model's certainty in its prediction.
39
- 6. **Explanation Logits**: Token logits for generating a natural language explanation of the decision.
40
-
41
- ## Usage
42
-
43
- This repository contains the model weights (`model.safetensors`) and the necessary configuration files (`config.json`, tokenizer files) to load the model using the Hugging Face `transformers` library.
44
-
45
- To load the model and tokenizer:
46
-
47
- ```python
48
- from transformers import AutoModel, AutoTokenizer
49
-
50
- # The model is a custom architecture, so direct AutoModel loading may require
51
- # custom code or a registered class. Refer to the original training script
52
- # for the exact class definition.
53
-
54
- # Load the tokenizer for the text streams
55
- tokenizer = AutoTokenizer.from_pretrained("factorstudios/TIDA_T1")
56
-
57
- # Load the model weights (assuming you have the custom class defined)
58
- # model = VisionLanguageActionModel.from_pretrained("factorstudios/TIDA_T1")
59
- ```
60
-
61
- **Note**: The model's custom architecture (`VisionLanguageActionModel`) is not a standard Hugging Face class. You will need the class definition (as provided in the `inference_script.py` I previously delivered) to load the weights correctly.
62
-
63
- ---
64
- *Generated by Manus AI based on analysis of `train3-v4.py`.*