Image-Text-to-Text
Transformers
Safetensors
PyTorch
English
qwen3_5
unsloth
multimodal
vision-language
reasoning
conversational
Instructions to use Xerv-AI/tarn with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Xerv-AI/tarn with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Xerv-AI/tarn") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Xerv-AI/tarn") model = AutoModelForImageTextToText.from_pretrained("Xerv-AI/tarn") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Xerv-AI/tarn with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Xerv-AI/tarn" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Xerv-AI/tarn", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Xerv-AI/tarn
- SGLang
How to use Xerv-AI/tarn with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Xerv-AI/tarn" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Xerv-AI/tarn", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Xerv-AI/tarn" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Xerv-AI/tarn", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Unsloth Studio new
How to use Xerv-AI/tarn with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Xerv-AI/tarn to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Xerv-AI/tarn to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Xerv-AI/tarn to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Xerv-AI/tarn", max_seq_length=2048, ) - Docker Model Runner
How to use Xerv-AI/tarn with Docker Model Runner:
docker model run hf.co/Xerv-AI/tarn
Update README.md
Browse files
README.md
CHANGED
|
@@ -1,21 +1,146 @@
|
|
|
|
|
| 1 |
---
|
| 2 |
-
|
|
|
|
|
|
|
| 3 |
tags:
|
| 4 |
-
- text-generation-inference
|
| 5 |
-
- transformers
|
| 6 |
- unsloth
|
|
|
|
| 7 |
- qwen3_5
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
---
|
| 12 |
|
| 13 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 14 |
|
| 15 |
-
|
| 16 |
-
-
|
| 17 |
-
|
|
|
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
|
| 2 |
---
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
license: apache-2.0
|
| 6 |
tags:
|
|
|
|
|
|
|
| 7 |
- unsloth
|
| 8 |
+
- transformers
|
| 9 |
- qwen3_5
|
| 10 |
+
- image-text-to-text
|
| 11 |
+
- multimodal
|
| 12 |
+
- vision-language
|
| 13 |
+
- reasoning
|
| 14 |
+
- pytorch
|
| 15 |
+
base_model: unsloth/Qwen3.5-2B
|
| 16 |
+
datasets:
|
| 17 |
+
- Phase-Technologies/claude-reasoning-super
|
| 18 |
+
- xerv-ai/tart
|
| 19 |
+
pipeline_tag: image-text-to-text
|
| 20 |
+
library_name: transformers
|
| 21 |
+
metrics:
|
| 22 |
+
- accuracy
|
| 23 |
---
|
| 24 |
|
| 25 |
+
# π tarn (tarn-2b-vision-reasoning)
|
| 26 |
+
Developed by **Xerv-AI**, `tarn` is an optimized, ultra-compact 2-Billion parameter multimodal vision-language engine built upon the **Qwen 3.5 VL** architecture. By merging core perception mechanics with complex chain-of-thought data processing topologies, `tarn` is uniquely tailored for resource-constrained architectures, local deployments, and high-velocity streaming infrastructures requiring deep contextual visual comprehension.
|
| 27 |
+
---
|
| 28 |
+
## π Table of Contents
|
| 29 |
+
1. [Model Overview](#model-overview)
|
| 30 |
+
2. [Intended Architectural Uses & Scope](#intended-architectural-uses--scope)
|
| 31 |
+
3. [Memory & VRAM Footprint Benchmarks](#memory--vram-footprint-benchmarks)
|
| 32 |
+
4. [Step-by-Step Google Colab Implementation](#step-by-step-google-colab-implementation)
|
| 33 |
+
5. [Streaming & Production Pipeline Setup](#streaming--production-pipeline-setup)
|
| 34 |
+
6. [Training Topology & Data Lineage](#training-topology--data-lineage)
|
| 35 |
+
7. [Ethical Guardrails & Systemic Limitations](#ethical-guardrails--systemic-limitations)
|
| 36 |
+
---
|
| 37 |
+
## π§ Model Overview
|
| 38 |
+
Unlike basic classification vision systems, `tarn` incorporates a native **Chain-of-Thought (CoT)** reasoning matrix. When faced with an image-text query, it executes an internal multi-layered analytical pass to self-correct and map spatial elements before formatting its final output.
|
| 39 |
+
### Key Technical Enhancements
|
| 40 |
+
* **Architectural Blueprint:** Fine-tuned via Low-Rank Adaptation (LoRA) over the `unsloth/Qwen3.5-2B` base framework, maintaining architectural elasticity.
|
| 41 |
+
* **Dynamic Resolution Windowing:** Supports bounded image tokenization via adjustable `min_pixels` and `max_pixels` scaling layers, eliminating sudden GPU out-of-memory (OOM) faults.
|
| 42 |
+
* **Advanced Token Processing:** Utilizes specialized multimodal token sequence embeddings to seamlessly align image feature vectors into the foundational language space.
|
| 43 |
+
---
|
| 44 |
+
## π― Intended Architectural Uses & Scope
|
| 45 |
+
### Recommended Core Tasks
|
| 46 |
+
* **Visual Problem-Solving:** Breaking down multi-step actions inside an image (e.g., troubleshooting complex wiring diagrams, reading mechanical dials).
|
| 47 |
+
* **Nuanced Image-Text Analysis:** Generating dense, conceptually accurate descriptions of visual phenomena rather than superficial tags.
|
| 48 |
+
* **Complex Physics & Abstract Querying:** Responding to interleaved queries requiring both text extraction (OCR), deep domain-specific knowledge, and physical reasoning (e.g., electrostatic properties, mechanics).
|
| 49 |
+
### Out-of-Scope Deployments
|
| 50 |
+
* Medical diagnostic automation without expert human verification loops.
|
| 51 |
+
* Real-time automated safety-critical processing (autonomous vehicle controls, live weapons systems).
|
| 52 |
+
* Generation of biometric verification data or high-stakes demographic filtering.
|
| 53 |
+
---
|
| 54 |
+
## π Memory & VRAM Footprint Benchmarks
|
| 55 |
+
Due to the intense multi-dimensional matrix layout of Qwen 3.5's vision patches, native unconstrained generation can result in extreme VRAM spikes. `tarn` solves this by introducing dynamic spatial constraints.
|
| 56 |
|
| 57 |
+
| Precision Level | Quantization State | Active Loading VRAM | Inference VRAM (Unbounded) | Optimized Bounded VRAM |
|
| 58 |
+
| :--- | :--- | :--- | :--- | :--- |
|
| 59 |
+
| **Float16 (`fp16`)** | None | ~4.55 GB | ~14.6 GB (OOM Risk) | **~9.83 GB (Safe for T4)** |
|
| 60 |
+
| **Int4 (`4-bit`)** | BitsAndBytes | ~1.85 GB | ~6.20 GB | **~3.95 GB** |
|
| 61 |
|
| 62 |
+
> π‘ **Core Recommendation:** For edge deployments or free-tier Google Colab instances (Tesla T4 GPU with 15GB VRAM), always set execution patch limits between $256 \times 28 \times 28$ and $512 \times 28 \times 28$ pixels to guarantee stable, deterministic execution boundaries.
|
| 63 |
+
---
|
| 64 |
+
## π Step-by-Step Google Colab Implementation
|
| 65 |
+
To verify and run this model within a standard hardware sandbox environment, execute the blocks below.
|
| 66 |
+
### 1. Environment Initialization
|
| 67 |
+
Ensure your runtime is pointing to a hardware accelerator backend (T4 GPU). Install the bleeding-edge architecture updates from source:
|
| 68 |
+
```bash
|
| 69 |
+
# Force-install source versions supporting the qwen3_5 structural configuration
|
| 70 |
+
pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git)
|
| 71 |
+
pip install -q accelerate bitsandbytes torchvision qwen-vl-utils
|
| 72 |
+
```
|
| 73 |
+
*Note: Make sure to navigate to Runtime -> Restart session after installation to initialize the new environment context.*
|
| 74 |
+
### 2. Loading the Model Weights
|
| 75 |
+
```python
|
| 76 |
+
import torch
|
| 77 |
+
from transformers import pipeline
|
| 78 |
+
model_id = "Xerv-AI/tarn"
|
| 79 |
+
print("Initializing tarn architecture pipelines...")
|
| 80 |
+
pipe = pipeline(
|
| 81 |
+
"image-text-to-text",
|
| 82 |
+
model=model_id,
|
| 83 |
+
torch_dtype=torch.float16,
|
| 84 |
+
device_map="auto"
|
| 85 |
+
)
|
| 86 |
+
print("tarn is loaded and standing by.")
|
| 87 |
+
```
|
| 88 |
+
## β‘ Streaming & Production Pipeline Setup
|
| 89 |
+
For real-time user-facing conversational products, buffering text generation hurts user experience. Use the TextStreamer implementation below to stream outputs token-by-token directly to your standard output array:
|
| 90 |
+
```python
|
| 91 |
+
from transformers import TextStreamer
|
| 92 |
+
# Attach the text streamer interface to the pipeline core
|
| 93 |
+
streamer = TextStreamer(pipe.tokenizer, skip_prompt=True)
|
| 94 |
+
# Build a composite multimodal user payload
|
| 95 |
+
messages = [
|
| 96 |
+
{
|
| 97 |
+
"role": "user",
|
| 98 |
+
"content": [
|
| 99 |
+
{
|
| 100 |
+
"type": "image",
|
| 101 |
+
"url": "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG)"
|
| 102 |
+
},
|
| 103 |
+
{
|
| 104 |
+
"type": "text",
|
| 105 |
+
"text": "Analyze the visual artifacts present in this image and define the principles of triboelectricity."
|
| 106 |
+
}
|
| 107 |
+
]
|
| 108 |
+
},
|
| 109 |
+
]
|
| 110 |
+
print("=== Initiating Real-Time Telemetry Stream ===")
|
| 111 |
+
outputs = pipe(
|
| 112 |
+
text=messages,
|
| 113 |
+
max_new_tokens=1024, # Extend depth capability safely
|
| 114 |
+
min_pixels=256*28*28, # Set baseline feature extraction map
|
| 115 |
+
max_pixels=512*28*28, # Cap peak VRAM consumption upper bound
|
| 116 |
+
generate_kwargs={"streamer": streamer}
|
| 117 |
+
)
|
| 118 |
+
```
|
| 119 |
+
## 𧬠Training Topology & Data Lineage
|
| 120 |
+
The training protocol of tarn was heavily engineered to break the paradigm of superficial visual question answering. It is optimized through a two-stage distillation and alignment process.
|
| 121 |
|
| 122 |
+
### 1. Dataset Dependencies
|
| 123 |
+
* **xerv-ai/tart (344k records):** Provides core alignments on basic physics, electromagnetism, electrostatics, and real-world everyday sensory scenarios. It grounds the model's factual accuracy in high-density core domains.
|
| 124 |
+
* **Phase-Technologies/claude-reasoning-super (47.8k records):** Instructs the model's internal decoder to prioritize complex hidden steps. Instead of outputting an immediately available guess, it structures the response using logical markdown hierarchies, self-corrections, and explicit calculations.
|
| 125 |
+
### 2. Hyperparameter Settings
|
| 126 |
+
* **Optimizer:** AdamW (Learning Rate: 2 \times 10^{-4})
|
| 127 |
+
* **Weight Decay Coefficients:** 0.01
|
| 128 |
+
* **Lr Scheduler Sequence:** Linear warmup followed by cosine attenuation.
|
| 129 |
+
* **LoRA Rank (r):** 64
|
| 130 |
+
* **LoRA Alpha (\alpha):** 16
|
| 131 |
+
* **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
|
| 132 |
+
## π‘οΈ Ethical Guardrails & Systemic Limitations
|
| 133 |
+
* **Hallucination Vectors:** Like all generative vision systems, compressing multi-dimensional visual spaces into discrete texts can cause hallucinations if the image resolution is constrained too low (e.g., misreading small font sizes or highly dense numbers).
|
| 134 |
+
* **Bias Propagations:** tarn can inherit underlying societal, technical, and taxonomic biases hidden inside the open source web data crawls forming its initial foundations.
|
| 135 |
+
* **Sycophancy Risks:** Due to alignment patterns, if a prompt aggressively asserts a falsehood (*"Why is there a dog in this picture of a ocean?"*), the model may spend its initial reasoning block trying to justify the user's premise before correcting it.
|
| 136 |
+
## π Citation & Attributions
|
| 137 |
+
```latex
|
| 138 |
+
@misc{tarn2026,
|
| 139 |
+
author = {Soham Pal and the Xerv-AI Research Team},
|
| 140 |
+
title = {tarn: Optimized Compact Multimodal Vision-Reasoning Engine},
|
| 141 |
+
year = {2026},
|
| 142 |
+
publisher = {Hugging Face Hub},
|
| 143 |
+
howpublished = {\url{[https://huggingface.co/Xerv-AI/tarn](https://huggingface.co/Xerv-AI/tarn)}}
|
| 144 |
+
}
|
| 145 |
+
```
|
| 146 |
+
If you integrate tarn or your custom structural derivatives into enterprise frameworks, please attribute **Xerv-AI** accordingly. For additional questions or model contributions, open a pull request directly in the community repository channel.
|