Add configuration files, tokenizer, and README.md for inference setup

Browse files

Files changed (6) hide show

README.md +64 -0
config.json +60 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +56 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,64 @@

+---
+tags:
+- vision-language-action
+- vla
+- multimodal
+- factorstudios
+- tida
+- curfy
+- foundation-model
+---
+# factorstudios/TIDA_T1: Vision-Language-Action (VLA) Model
+This repository hosts the **TIDA_T1** model, a complete **Vision-Language-Action (VLA) Model** developed by FactorStudios.
+TIDA_T1 is a monolithic, multi-modal foundation model designed for complex sequential decision-making tasks, such as automated interaction with graphical user interfaces (GUIs) or real-time control systems. It is a direct continuation of the `curfy_v2` training line.
+## Model Architecture Overview
+The TIDA_T1 model is a **1.575 Billion Parameter** architecture that fuses information from five distinct input streams before passing the combined representation through a deep reasoning layer to predict the next action.
+| Stream | Component | Purpose | Pre-trained Base |
+| :--- | :--- | :--- | :--- |
+| **Vision** | ViT-L/14 | Processes the current screen frame (image). | ViT-Large (308M frozen) |
+| **Caption** | BERT-large | Processes the textual description of the current state or goal. | BERT-Large (340M frozen) |
+| **Context** | GPT-2-XL | Processes the long-term history and task context. | GPT-2-XL (355M frozen) |
+| **Spatial** | MLP | Encodes the recent cursor trajectory and position history. | Trainable (Small) |
+| **Temporal** | MLP | Encodes the history of frame embeddings (what the screen looked like). | Trainable (Small) |
+## Decision Outputs
+The model's reasoning layer outputs a single embedding which is fed into six specialized decision heads to predict a complete action:
+1.  **Action Logits**: Predicts the type of action (e.g., `click`, `drag`, `type`, `scroll`).
+2.  **Coordinates**: Predicts the normalized bounding box or point (x1, y1, x2, y2) for the action.
+3.  **Duration**: Predicts the time the action should take (e.g., for a drag or wait).
+4.  **Parameters**: A 32-dimensional vector for action-specific parameters (e.g., scroll amount, keypress).
+5.  **Confidence**: A score indicating the model's certainty in its prediction.
+6.  **Explanation Logits**: Token logits for generating a natural language explanation of the decision.
+## Usage
+This repository contains the model weights (`model.safetensors`) and the necessary configuration files (`config.json`, tokenizer files) to load the model using the Hugging Face `transformers` library.
+To load the model and tokenizer:
+```python
+from transformers import AutoModel, AutoTokenizer
+# The model is a custom architecture, so direct AutoModel loading may require
+# custom code or a registered class. Refer to the original training script
+# for the exact class definition.
+# Load the tokenizer for the text streams
+tokenizer = AutoTokenizer.from_pretrained("factorstudios/TIDA_T1")
+# Load the model weights (assuming you have the custom class defined)
+# model = VisionLanguageActionModel.from_pretrained("factorstudios/TIDA_T1")
+```
+**Note**: The model's custom architecture (`VisionLanguageActionModel`) is not a standard Hugging Face class. You will need the class definition (as provided in the `inference_script.py` I previously delivered) to load the weights correctly.
+---
+*Generated by Manus AI based on analysis of `train3-v4.py`.*

config.json ADDED Viewed

	@@ -0,0 +1,60 @@

+{
+  "_class_name": "VisionLanguageActionModel",
+  "architectures": [
+    "VisionLanguageActionModel"
+  ],
+  "model_type": "vla-model",
+  "hidden_size": 768,
+  "num_tasks": 6,
+  "vision_config": {
+    "model_type": "vit",
+    "image_size": 224,
+    "patch_size": 14,
+    "hidden_size": 1024,
+    "num_hidden_layers": 24,
+    "num_attention_heads": 16,
+    "intermediate_size": 4096,
+    "projection_dim": 768
+  },
+  "caption_config": {
+    "model_type": "bert",
+    "vocab_size": 30522,
+    "hidden_size": 1024,
+    "num_hidden_layers": 24,
+    "num_attention_heads": 16,
+    "intermediate_size": 4096,
+    "projection_dim": 768
+  },
+  "context_config": {
+    "model_type": "gpt2",
+    "vocab_size": 50257,
+    "n_positions": 1024,
+    "n_embd": 1024,
+    "n_layer": 24,
+    "n_head": 16,
+    "projection_dim": 768
+  },
+  "spatial_config": {
+    "input_dim": 10,
+    "output_dim": 768
+  },
+  "temporal_config": {
+    "input_dim": 1280,
+    "output_dim": 768
+  },
+  "fusion_config": {
+    "input_dim": 3840,
+    "output_dim": 768
+  },
+  "reasoning_config": {
+    "d_model": 768,
+    "nhead": 12,
+    "num_layers": 8
+  },
+  "action_head_config": {
+    "num_actions": 8
+  },
+  "explanation_head_config": {
+    "vocab_size": 30522
+  }
+}

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,56 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_lower_case": true,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff