Add configuration files, tokenizer, and README.md for inference setup
Browse files- README.md +64 -0
- config.json +60 -0
- special_tokens_map.json +7 -0
- tokenizer.json +0 -0
- tokenizer_config.json +56 -0
- vocab.txt +0 -0
README.md
ADDED
|
@@ -0,0 +1,64 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
tags:
|
| 3 |
+
- vision-language-action
|
| 4 |
+
- vla
|
| 5 |
+
- multimodal
|
| 6 |
+
- factorstudios
|
| 7 |
+
- tida
|
| 8 |
+
- curfy
|
| 9 |
+
- foundation-model
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# factorstudios/TIDA_T1: Vision-Language-Action (VLA) Model
|
| 13 |
+
|
| 14 |
+
This repository hosts the **TIDA_T1** model, a complete **Vision-Language-Action (VLA) Model** developed by FactorStudios.
|
| 15 |
+
|
| 16 |
+
TIDA_T1 is a monolithic, multi-modal foundation model designed for complex sequential decision-making tasks, such as automated interaction with graphical user interfaces (GUIs) or real-time control systems. It is a direct continuation of the `curfy_v2` training line.
|
| 17 |
+
|
| 18 |
+
## Model Architecture Overview
|
| 19 |
+
|
| 20 |
+
The TIDA_T1 model is a **1.575 Billion Parameter** architecture that fuses information from five distinct input streams before passing the combined representation through a deep reasoning layer to predict the next action.
|
| 21 |
+
|
| 22 |
+
| Stream | Component | Purpose | Pre-trained Base |
|
| 23 |
+
| :--- | :--- | :--- | :--- |
|
| 24 |
+
| **Vision** | ViT-L/14 | Processes the current screen frame (image). | ViT-Large (308M frozen) |
|
| 25 |
+
| **Caption** | BERT-large | Processes the textual description of the current state or goal. | BERT-Large (340M frozen) |
|
| 26 |
+
| **Context** | GPT-2-XL | Processes the long-term history and task context. | GPT-2-XL (355M frozen) |
|
| 27 |
+
| **Spatial** | MLP | Encodes the recent cursor trajectory and position history. | Trainable (Small) |
|
| 28 |
+
| **Temporal** | MLP | Encodes the history of frame embeddings (what the screen looked like). | Trainable (Small) |
|
| 29 |
+
|
| 30 |
+
## Decision Outputs
|
| 31 |
+
|
| 32 |
+
The model's reasoning layer outputs a single embedding which is fed into six specialized decision heads to predict a complete action:
|
| 33 |
+
|
| 34 |
+
1. **Action Logits**: Predicts the type of action (e.g., `click`, `drag`, `type`, `scroll`).
|
| 35 |
+
2. **Coordinates**: Predicts the normalized bounding box or point (x1, y1, x2, y2) for the action.
|
| 36 |
+
3. **Duration**: Predicts the time the action should take (e.g., for a drag or wait).
|
| 37 |
+
4. **Parameters**: A 32-dimensional vector for action-specific parameters (e.g., scroll amount, keypress).
|
| 38 |
+
5. **Confidence**: A score indicating the model's certainty in its prediction.
|
| 39 |
+
6. **Explanation Logits**: Token logits for generating a natural language explanation of the decision.
|
| 40 |
+
|
| 41 |
+
## Usage
|
| 42 |
+
|
| 43 |
+
This repository contains the model weights (`model.safetensors`) and the necessary configuration files (`config.json`, tokenizer files) to load the model using the Hugging Face `transformers` library.
|
| 44 |
+
|
| 45 |
+
To load the model and tokenizer:
|
| 46 |
+
|
| 47 |
+
```python
|
| 48 |
+
from transformers import AutoModel, AutoTokenizer
|
| 49 |
+
|
| 50 |
+
# The model is a custom architecture, so direct AutoModel loading may require
|
| 51 |
+
# custom code or a registered class. Refer to the original training script
|
| 52 |
+
# for the exact class definition.
|
| 53 |
+
|
| 54 |
+
# Load the tokenizer for the text streams
|
| 55 |
+
tokenizer = AutoTokenizer.from_pretrained("factorstudios/TIDA_T1")
|
| 56 |
+
|
| 57 |
+
# Load the model weights (assuming you have the custom class defined)
|
| 58 |
+
# model = VisionLanguageActionModel.from_pretrained("factorstudios/TIDA_T1")
|
| 59 |
+
```
|
| 60 |
+
|
| 61 |
+
**Note**: The model's custom architecture (`VisionLanguageActionModel`) is not a standard Hugging Face class. You will need the class definition (as provided in the `inference_script.py` I previously delivered) to load the weights correctly.
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
*Generated by Manus AI based on analysis of `train3-v4.py`.*
|
config.json
ADDED
|
@@ -0,0 +1,60 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"_class_name": "VisionLanguageActionModel",
|
| 3 |
+
"architectures": [
|
| 4 |
+
"VisionLanguageActionModel"
|
| 5 |
+
],
|
| 6 |
+
"model_type": "vla-model",
|
| 7 |
+
"hidden_size": 768,
|
| 8 |
+
"num_tasks": 6,
|
| 9 |
+
"vision_config": {
|
| 10 |
+
"model_type": "vit",
|
| 11 |
+
"image_size": 224,
|
| 12 |
+
"patch_size": 14,
|
| 13 |
+
"hidden_size": 1024,
|
| 14 |
+
"num_hidden_layers": 24,
|
| 15 |
+
"num_attention_heads": 16,
|
| 16 |
+
"intermediate_size": 4096,
|
| 17 |
+
"projection_dim": 768
|
| 18 |
+
},
|
| 19 |
+
"caption_config": {
|
| 20 |
+
"model_type": "bert",
|
| 21 |
+
"vocab_size": 30522,
|
| 22 |
+
"hidden_size": 1024,
|
| 23 |
+
"num_hidden_layers": 24,
|
| 24 |
+
"num_attention_heads": 16,
|
| 25 |
+
"intermediate_size": 4096,
|
| 26 |
+
"projection_dim": 768
|
| 27 |
+
},
|
| 28 |
+
"context_config": {
|
| 29 |
+
"model_type": "gpt2",
|
| 30 |
+
"vocab_size": 50257,
|
| 31 |
+
"n_positions": 1024,
|
| 32 |
+
"n_embd": 1024,
|
| 33 |
+
"n_layer": 24,
|
| 34 |
+
"n_head": 16,
|
| 35 |
+
"projection_dim": 768
|
| 36 |
+
},
|
| 37 |
+
"spatial_config": {
|
| 38 |
+
"input_dim": 10,
|
| 39 |
+
"output_dim": 768
|
| 40 |
+
},
|
| 41 |
+
"temporal_config": {
|
| 42 |
+
"input_dim": 1280,
|
| 43 |
+
"output_dim": 768
|
| 44 |
+
},
|
| 45 |
+
"fusion_config": {
|
| 46 |
+
"input_dim": 3840,
|
| 47 |
+
"output_dim": 768
|
| 48 |
+
},
|
| 49 |
+
"reasoning_config": {
|
| 50 |
+
"d_model": 768,
|
| 51 |
+
"nhead": 12,
|
| 52 |
+
"num_layers": 8
|
| 53 |
+
},
|
| 54 |
+
"action_head_config": {
|
| 55 |
+
"num_actions": 8
|
| 56 |
+
},
|
| 57 |
+
"explanation_head_config": {
|
| 58 |
+
"vocab_size": 30522
|
| 59 |
+
}
|
| 60 |
+
}
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": false,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": true,
|
| 47 |
+
"extra_special_tokens": {},
|
| 48 |
+
"mask_token": "[MASK]",
|
| 49 |
+
"model_max_length": 512,
|
| 50 |
+
"pad_token": "[PAD]",
|
| 51 |
+
"sep_token": "[SEP]",
|
| 52 |
+
"strip_accents": null,
|
| 53 |
+
"tokenize_chinese_chars": true,
|
| 54 |
+
"tokenizer_class": "BertTokenizer",
|
| 55 |
+
"unk_token": "[UNK]"
|
| 56 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|