factorstudios commited on
Commit
90ae3fe
·
verified ·
1 Parent(s): 6e90922

Add configuration files, tokenizer, and README.md for inference setup

Browse files
Files changed (6) hide show
  1. README.md +64 -0
  2. config.json +60 -0
  3. special_tokens_map.json +7 -0
  4. tokenizer.json +0 -0
  5. tokenizer_config.json +56 -0
  6. vocab.txt +0 -0
README.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - vision-language-action
4
+ - vla
5
+ - multimodal
6
+ - factorstudios
7
+ - tida
8
+ - curfy
9
+ - foundation-model
10
+ ---
11
+
12
+ # factorstudios/TIDA_T1: Vision-Language-Action (VLA) Model
13
+
14
+ This repository hosts the **TIDA_T1** model, a complete **Vision-Language-Action (VLA) Model** developed by FactorStudios.
15
+
16
+ TIDA_T1 is a monolithic, multi-modal foundation model designed for complex sequential decision-making tasks, such as automated interaction with graphical user interfaces (GUIs) or real-time control systems. It is a direct continuation of the `curfy_v2` training line.
17
+
18
+ ## Model Architecture Overview
19
+
20
+ The TIDA_T1 model is a **1.575 Billion Parameter** architecture that fuses information from five distinct input streams before passing the combined representation through a deep reasoning layer to predict the next action.
21
+
22
+ | Stream | Component | Purpose | Pre-trained Base |
23
+ | :--- | :--- | :--- | :--- |
24
+ | **Vision** | ViT-L/14 | Processes the current screen frame (image). | ViT-Large (308M frozen) |
25
+ | **Caption** | BERT-large | Processes the textual description of the current state or goal. | BERT-Large (340M frozen) |
26
+ | **Context** | GPT-2-XL | Processes the long-term history and task context. | GPT-2-XL (355M frozen) |
27
+ | **Spatial** | MLP | Encodes the recent cursor trajectory and position history. | Trainable (Small) |
28
+ | **Temporal** | MLP | Encodes the history of frame embeddings (what the screen looked like). | Trainable (Small) |
29
+
30
+ ## Decision Outputs
31
+
32
+ The model's reasoning layer outputs a single embedding which is fed into six specialized decision heads to predict a complete action:
33
+
34
+ 1. **Action Logits**: Predicts the type of action (e.g., `click`, `drag`, `type`, `scroll`).
35
+ 2. **Coordinates**: Predicts the normalized bounding box or point (x1, y1, x2, y2) for the action.
36
+ 3. **Duration**: Predicts the time the action should take (e.g., for a drag or wait).
37
+ 4. **Parameters**: A 32-dimensional vector for action-specific parameters (e.g., scroll amount, keypress).
38
+ 5. **Confidence**: A score indicating the model's certainty in its prediction.
39
+ 6. **Explanation Logits**: Token logits for generating a natural language explanation of the decision.
40
+
41
+ ## Usage
42
+
43
+ This repository contains the model weights (`model.safetensors`) and the necessary configuration files (`config.json`, tokenizer files) to load the model using the Hugging Face `transformers` library.
44
+
45
+ To load the model and tokenizer:
46
+
47
+ ```python
48
+ from transformers import AutoModel, AutoTokenizer
49
+
50
+ # The model is a custom architecture, so direct AutoModel loading may require
51
+ # custom code or a registered class. Refer to the original training script
52
+ # for the exact class definition.
53
+
54
+ # Load the tokenizer for the text streams
55
+ tokenizer = AutoTokenizer.from_pretrained("factorstudios/TIDA_T1")
56
+
57
+ # Load the model weights (assuming you have the custom class defined)
58
+ # model = VisionLanguageActionModel.from_pretrained("factorstudios/TIDA_T1")
59
+ ```
60
+
61
+ **Note**: The model's custom architecture (`VisionLanguageActionModel`) is not a standard Hugging Face class. You will need the class definition (as provided in the `inference_script.py` I previously delivered) to load the weights correctly.
62
+
63
+ ---
64
+ *Generated by Manus AI based on analysis of `train3-v4.py`.*
config.json ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_class_name": "VisionLanguageActionModel",
3
+ "architectures": [
4
+ "VisionLanguageActionModel"
5
+ ],
6
+ "model_type": "vla-model",
7
+ "hidden_size": 768,
8
+ "num_tasks": 6,
9
+ "vision_config": {
10
+ "model_type": "vit",
11
+ "image_size": 224,
12
+ "patch_size": 14,
13
+ "hidden_size": 1024,
14
+ "num_hidden_layers": 24,
15
+ "num_attention_heads": 16,
16
+ "intermediate_size": 4096,
17
+ "projection_dim": 768
18
+ },
19
+ "caption_config": {
20
+ "model_type": "bert",
21
+ "vocab_size": 30522,
22
+ "hidden_size": 1024,
23
+ "num_hidden_layers": 24,
24
+ "num_attention_heads": 16,
25
+ "intermediate_size": 4096,
26
+ "projection_dim": 768
27
+ },
28
+ "context_config": {
29
+ "model_type": "gpt2",
30
+ "vocab_size": 50257,
31
+ "n_positions": 1024,
32
+ "n_embd": 1024,
33
+ "n_layer": 24,
34
+ "n_head": 16,
35
+ "projection_dim": 768
36
+ },
37
+ "spatial_config": {
38
+ "input_dim": 10,
39
+ "output_dim": 768
40
+ },
41
+ "temporal_config": {
42
+ "input_dim": 1280,
43
+ "output_dim": 768
44
+ },
45
+ "fusion_config": {
46
+ "input_dim": 3840,
47
+ "output_dim": 768
48
+ },
49
+ "reasoning_config": {
50
+ "d_model": 768,
51
+ "nhead": 12,
52
+ "num_layers": 8
53
+ },
54
+ "action_head_config": {
55
+ "num_actions": 8
56
+ },
57
+ "explanation_head_config": {
58
+ "vocab_size": 30522
59
+ }
60
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 512,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "BertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff