File size: 8,852 Bytes
dd20045
0c27eda
 
 
dd20045
 
0c27eda
dd20045
0c27eda
 
 
 
 
 
 
 
0d47ecf
0c27eda
 
 
 
dd20045
 
0c27eda
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd20045
0c27eda
 
 
 
dd20045
0c27eda
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dd20045
0c27eda
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
---
language:
- en
license: apache-2.0
tags:
- unsloth
- transformers
- qwen3_5
- image-text-to-text
- multimodal
- vision-language
- reasoning
- pytorch
base_model: unsloth/Qwen3.5-2B
datasets:
- Phase-Technologies/claude-reasoning-super
- Xerv-AI/TART
pipeline_tag: image-text-to-text
library_name: transformers
metrics:
- accuracy
---

# 🌌 tarn (tarn-2b-vision-reasoning)
Developed by **Xerv-AI**, `tarn` is an optimized, ultra-compact 2-Billion parameter multimodal vision-language engine built upon the **Qwen 3.5 VL** architecture. By merging core perception mechanics with complex chain-of-thought data processing topologies, `tarn` is uniquely tailored for resource-constrained architectures, local deployments, and high-velocity streaming infrastructures requiring deep contextual visual comprehension.
---
## πŸ“‹ Table of Contents
1. [Model Overview](#model-overview)
2. [Intended Architectural Uses & Scope](#intended-architectural-uses--scope)
3. [Memory & VRAM Footprint Benchmarks](#memory--vram-footprint-benchmarks)
4. [Step-by-Step Google Colab Implementation](#step-by-step-google-colab-implementation)
5. [Streaming & Production Pipeline Setup](#streaming--production-pipeline-setup)
6. [Training Topology & Data Lineage](#training-topology--data-lineage)
7. [Ethical Guardrails & Systemic Limitations](#ethical-guardrails--systemic-limitations)
---
## 🧠 Model Overview
Unlike basic classification vision systems, `tarn` incorporates a native **Chain-of-Thought (CoT)** reasoning matrix. When faced with an image-text query, it executes an internal multi-layered analytical pass to self-correct and map spatial elements before formatting its final output. 
### Key Technical Enhancements
* **Architectural Blueprint:** Fine-tuned via Low-Rank Adaptation (LoRA) over the `unsloth/Qwen3.5-2B` base framework, maintaining architectural elasticity.
* **Dynamic Resolution Windowing:** Supports bounded image tokenization via adjustable `min_pixels` and `max_pixels` scaling layers, eliminating sudden GPU out-of-memory (OOM) faults.
* **Advanced Token Processing:** Utilizes specialized multimodal token sequence embeddings to seamlessly align image feature vectors into the foundational language space.
---
## 🎯 Intended Architectural Uses & Scope
### Recommended Core Tasks
* **Visual Problem-Solving:** Breaking down multi-step actions inside an image (e.g., troubleshooting complex wiring diagrams, reading mechanical dials).
* **Nuanced Image-Text Analysis:** Generating dense, conceptually accurate descriptions of visual phenomena rather than superficial tags.
* **Complex Physics & Abstract Querying:** Responding to interleaved queries requiring both text extraction (OCR), deep domain-specific knowledge, and physical reasoning (e.g., electrostatic properties, mechanics).
### Out-of-Scope Deployments
* Medical diagnostic automation without expert human verification loops.
* Real-time automated safety-critical processing (autonomous vehicle controls, live weapons systems).
* Generation of biometric verification data or high-stakes demographic filtering.
---
## πŸ“Š Memory & VRAM Footprint Benchmarks
Due to the intense multi-dimensional matrix layout of Qwen 3.5's vision patches, native unconstrained generation can result in extreme VRAM spikes. `tarn` solves this by introducing dynamic spatial constraints.

| Precision Level | Quantization State | Active Loading VRAM | Inference VRAM (Unbounded) | Optimized Bounded VRAM |
| :--- | :--- | :--- | :--- | :--- |
| **Float16 (`fp16`)** | None | ~4.55 GB | ~14.6 GB (OOM Risk) | **~9.83 GB (Safe for T4)** |
| **Int4 (`4-bit`)** | BitsAndBytes | ~1.85 GB | ~6.20 GB | **~3.95 GB** |

> πŸ’‘ **Core Recommendation:** For edge deployments or free-tier Google Colab instances (Tesla T4 GPU with 15GB VRAM), always set execution patch limits between $256 \times 28 \times 28$ and $512 \times 28 \times 28$ pixels to guarantee stable, deterministic execution boundaries.
---
## πŸš€ Step-by-Step Google Colab Implementation
To verify and run this model within a standard hardware sandbox environment, execute the blocks below.
### 1. Environment Initialization
Ensure your runtime is pointing to a hardware accelerator backend (T4 GPU). Install the bleeding-edge architecture updates from source:
```bash
# Force-install source versions supporting the qwen3_5 structural configuration
pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git)
pip install -q accelerate bitsandbytes torchvision qwen-vl-utils
```
*Note: Make sure to navigate to Runtime -> Restart session after installation to initialize the new environment context.*
### 2. Loading the Model Weights
```python
import torch
from transformers import pipeline
model_id = "Xerv-AI/tarn"
print("Initializing tarn architecture pipelines...")
pipe = pipeline(
    "image-text-to-text", 
    model=model_id, 
    torch_dtype=torch.float16, 
    device_map="auto"
)
print("tarn is loaded and standing by.")
```
## ⚑ Streaming & Production Pipeline Setup
For real-time user-facing conversational products, buffering text generation hurts user experience. Use the TextStreamer implementation below to stream outputs token-by-token directly to your standard output array:
```python
from transformers import TextStreamer
# Attach the text streamer interface to the pipeline core
streamer = TextStreamer(pipe.tokenizer, skip_prompt=True)
# Build a composite multimodal user payload
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image", 
                "url": "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG)"
            },
            {
                "type": "text", 
                "text": "Analyze the visual artifacts present in this image and define the principles of triboelectricity."
            }
        ]
    },
]
print("=== Initiating Real-Time Telemetry Stream ===")
outputs = pipe(
    text=messages, 
    max_new_tokens=1024, # Extend depth capability safely
    min_pixels=256*28*28, # Set baseline feature extraction map
    max_pixels=512*28*28, # Cap peak VRAM consumption upper bound
    generate_kwargs={"streamer": streamer}
)
```
## 🧬 Training Topology & Data Lineage
The training protocol of tarn was heavily engineered to break the paradigm of superficial visual question answering. It is optimized through a two-stage distillation and alignment process.

### 1. Dataset Dependencies
 * **xerv-ai/tart (344k records):** Provides core alignments on basic physics, electromagnetism, electrostatics, and real-world everyday sensory scenarios. It grounds the model's factual accuracy in high-density core domains.
 * **Phase-Technologies/claude-reasoning-super (47.8k records):** Instructs the model's internal decoder to prioritize complex hidden steps. Instead of outputting an immediately available guess, it structures the response using logical markdown hierarchies, self-corrections, and explicit calculations.
### 2. Hyperparameter Settings
 * **Optimizer:** AdamW (Learning Rate: 2 \times 10^{-4})
 * **Weight Decay Coefficients:** 0.01
 * **Lr Scheduler Sequence:** Linear warmup followed by cosine attenuation.
 * **LoRA Rank (r):** 64
 * **LoRA Alpha (\alpha):** 16
 * **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
## πŸ›‘οΈ Ethical Guardrails & Systemic Limitations
 * **Hallucination Vectors:** Like all generative vision systems, compressing multi-dimensional visual spaces into discrete texts can cause hallucinations if the image resolution is constrained too low (e.g., misreading small font sizes or highly dense numbers).
 * **Bias Propagations:** tarn can inherit underlying societal, technical, and taxonomic biases hidden inside the open source web data crawls forming its initial foundations.
 * **Sycophancy Risks:** Due to alignment patterns, if a prompt aggressively asserts a falsehood (*"Why is there a dog in this picture of a ocean?"*), the model may spend its initial reasoning block trying to justify the user's premise before correcting it.
## πŸ“œ Citation & Attributions
```latex
@misc{tarn2026,
  author       = {Soham Pal and the Xerv-AI Research Team},
  title        = {tarn: Optimized Compact Multimodal Vision-Reasoning Engine},
  year         = {2026},
  publisher    = {Hugging Face Hub},
  howpublished = {\url{[https://huggingface.co/Xerv-AI/tarn](https://huggingface.co/Xerv-AI/tarn)}}
}
```
If you integrate tarn or your custom structural derivatives into enterprise frameworks, please attribute **Xerv-AI** accordingly. For additional questions or model contributions, open a pull request directly in the community repository channel.