Instructions to use Xerv-AI/tarn with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Xerv-AI/tarn with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Xerv-AI/tarn")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Xerv-AI/tarn")
model = AutoModelForImageTextToText.from_pretrained("Xerv-AI/tarn")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Xerv-AI/tarn with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Xerv-AI/tarn"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xerv-AI/tarn",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Xerv-AI/tarn

SGLang

How to use Xerv-AI/tarn with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Xerv-AI/tarn" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xerv-AI/tarn",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Xerv-AI/tarn" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Xerv-AI/tarn",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Unsloth Studio new

How to use Xerv-AI/tarn with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Xerv-AI/tarn to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Xerv-AI/tarn to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Xerv-AI/tarn to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="Xerv-AI/tarn",
    max_seq_length=2048,
)

Docker Model Runner
How to use Xerv-AI/tarn with Docker Model Runner:
```
docker model run hf.co/Xerv-AI/tarn
```

Phase-Technologies commited on 3 days ago

Commit

0c27eda

verified ·

1 Parent(s): 076408b

Update README.md

Browse files

Files changed (1) hide show

README.md +137 -12

README.md CHANGED Viewed

@@ -1,21 +1,146 @@
 ---
-base_model: unsloth/Qwen3.5-2B
 tags:
-- text-generation-inference
-- transformers
 - unsloth
 - qwen3_5
-license: apache-2.0
-language:
-- en
 ---
-# Uploaded finetuned  model
-- **Developed by:** Phase-Technologies
-- **License:** apache-2.0
-- **Finetuned from model :** unsloth/Qwen3.5-2B
-This qwen3_5 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
-[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)

 ---
+language:
+- en
+license: apache-2.0
 tags:
 - unsloth
+- transformers
 - qwen3_5
+- image-text-to-text
+- multimodal
+- vision-language
+- reasoning
+- pytorch
+base_model: unsloth/Qwen3.5-2B
+datasets:
+- Phase-Technologies/claude-reasoning-super
+- xerv-ai/tart
+pipeline_tag: image-text-to-text
+library_name: transformers
+metrics:
+- accuracy
 ---
+# 🌌 tarn (tarn-2b-vision-reasoning)
+Developed by **Xerv-AI**, `tarn` is an optimized, ultra-compact 2-Billion parameter multimodal vision-language engine built upon the **Qwen 3.5 VL** architecture. By merging core perception mechanics with complex chain-of-thought data processing topologies, `tarn` is uniquely tailored for resource-constrained architectures, local deployments, and high-velocity streaming infrastructures requiring deep contextual visual comprehension.
+---
+## 📋 Table of Contents
+1. [Model Overview](#model-overview)
+2. [Intended Architectural Uses & Scope](#intended-architectural-uses--scope)
+3. [Memory & VRAM Footprint Benchmarks](#memory--vram-footprint-benchmarks)
+4. [Step-by-Step Google Colab Implementation](#step-by-step-google-colab-implementation)
+5. [Streaming & Production Pipeline Setup](#streaming--production-pipeline-setup)
+6. [Training Topology & Data Lineage](#training-topology--data-lineage)
+7. [Ethical Guardrails & Systemic Limitations](#ethical-guardrails--systemic-limitations)
+---
+## 🧠 Model Overview
+Unlike basic classification vision systems, `tarn` incorporates a native **Chain-of-Thought (CoT)** reasoning matrix. When faced with an image-text query, it executes an internal multi-layered analytical pass to self-correct and map spatial elements before formatting its final output.
+### Key Technical Enhancements
+* **Architectural Blueprint:** Fine-tuned via Low-Rank Adaptation (LoRA) over the `unsloth/Qwen3.5-2B` base framework, maintaining architectural elasticity.
+* **Dynamic Resolution Windowing:** Supports bounded image tokenization via adjustable `min_pixels` and `max_pixels` scaling layers, eliminating sudden GPU out-of-memory (OOM) faults.
+* **Advanced Token Processing:** Utilizes specialized multimodal token sequence embeddings to seamlessly align image feature vectors into the foundational language space.
+---
+## 🎯 Intended Architectural Uses & Scope
+### Recommended Core Tasks
+* **Visual Problem-Solving:** Breaking down multi-step actions inside an image (e.g., troubleshooting complex wiring diagrams, reading mechanical dials).
+* **Nuanced Image-Text Analysis:** Generating dense, conceptually accurate descriptions of visual phenomena rather than superficial tags.
+* **Complex Physics & Abstract Querying:** Responding to interleaved queries requiring both text extraction (OCR), deep domain-specific knowledge, and physical reasoning (e.g., electrostatic properties, mechanics).
+### Out-of-Scope Deployments
+* Medical diagnostic automation without expert human verification loops.
+* Real-time automated safety-critical processing (autonomous vehicle controls, live weapons systems).
+* Generation of biometric verification data or high-stakes demographic filtering.
+---
+## 📊 Memory & VRAM Footprint Benchmarks
+Due to the intense multi-dimensional matrix layout of Qwen 3.5's vision patches, native unconstrained generation can result in extreme VRAM spikes. `tarn` solves this by introducing dynamic spatial constraints.
+| Precision Level | Quantization State | Active Loading VRAM | Inference VRAM (Unbounded) | Optimized Bounded VRAM |
+| :--- | :--- | :--- | :--- | :--- |
+| **Float16 (`fp16`)** | None | ~4.55 GB | ~14.6 GB (OOM Risk) | **~9.83 GB (Safe for T4)** |
+| **Int4 (`4-bit`)** | BitsAndBytes | ~1.85 GB | ~6.20 GB | **~3.95 GB** |
+> 💡 **Core Recommendation:** For edge deployments or free-tier Google Colab instances (Tesla T4 GPU with 15GB VRAM), always set execution patch limits between $256 \times 28 \times 28$ and $512 \times 28 \times 28$ pixels to guarantee stable, deterministic execution boundaries.
+---
+## 🚀 Step-by-Step Google Colab Implementation
+To verify and run this model within a standard hardware sandbox environment, execute the blocks below.
+### 1. Environment Initialization
+Ensure your runtime is pointing to a hardware accelerator backend (T4 GPU). Install the bleeding-edge architecture updates from source:
+```bash
+# Force-install source versions supporting the qwen3_5 structural configuration
+pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git)
+pip install -q accelerate bitsandbytes torchvision qwen-vl-utils
+```
+*Note: Make sure to navigate to Runtime -> Restart session after installation to initialize the new environment context.*
+### 2. Loading the Model Weights
+```python
+import torch
+from transformers import pipeline
+model_id = "Xerv-AI/tarn"
+print("Initializing tarn architecture pipelines...")
+pipe = pipeline(
+    "image-text-to-text",
+    model=model_id,
+    torch_dtype=torch.float16,
+    device_map="auto"
+)
+print("tarn is loaded and standing by.")
+```
+## ⚡ Streaming & Production Pipeline Setup
+For real-time user-facing conversational products, buffering text generation hurts user experience. Use the TextStreamer implementation below to stream outputs token-by-token directly to your standard output array:
+```python
+from transformers import TextStreamer
+# Attach the text streamer interface to the pipeline core
+streamer = TextStreamer(pipe.tokenizer, skip_prompt=True)
+# Build a composite multimodal user payload
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "url": "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG)"
+            },
+            {
+                "type": "text",
+                "text": "Analyze the visual artifacts present in this image and define the principles of triboelectricity."
+            }
+        ]
+    },
+]
+print("=== Initiating Real-Time Telemetry Stream ===")
+outputs = pipe(
+    text=messages,
+    max_new_tokens=1024, # Extend depth capability safely
+    min_pixels=256*28*28, # Set baseline feature extraction map
+    max_pixels=512*28*28, # Cap peak VRAM consumption upper bound
+    generate_kwargs={"streamer": streamer}
+)
+```
+## 🧬 Training Topology & Data Lineage
+The training protocol of tarn was heavily engineered to break the paradigm of superficial visual question answering. It is optimized through a two-stage distillation and alignment process.
+### 1. Dataset Dependencies
+ * **xerv-ai/tart (344k records):** Provides core alignments on basic physics, electromagnetism, electrostatics, and real-world everyday sensory scenarios. It grounds the model's factual accuracy in high-density core domains.
+ * **Phase-Technologies/claude-reasoning-super (47.8k records):** Instructs the model's internal decoder to prioritize complex hidden steps. Instead of outputting an immediately available guess, it structures the response using logical markdown hierarchies, self-corrections, and explicit calculations.
+### 2. Hyperparameter Settings
+ * **Optimizer:** AdamW (Learning Rate: 2 \times 10^{-4})
+ * **Weight Decay Coefficients:** 0.01
+ * **Lr Scheduler Sequence:** Linear warmup followed by cosine attenuation.
+ * **LoRA Rank (r):** 64
+ * **LoRA Alpha (\alpha):** 16
+ * **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
+## 🛡️ Ethical Guardrails & Systemic Limitations
+ * **Hallucination Vectors:** Like all generative vision systems, compressing multi-dimensional visual spaces into discrete texts can cause hallucinations if the image resolution is constrained too low (e.g., misreading small font sizes or highly dense numbers).
+ * **Bias Propagations:** tarn can inherit underlying societal, technical, and taxonomic biases hidden inside the open source web data crawls forming its initial foundations.
+ * **Sycophancy Risks:** Due to alignment patterns, if a prompt aggressively asserts a falsehood (*"Why is there a dog in this picture of a ocean?"*), the model may spend its initial reasoning block trying to justify the user's premise before correcting it.
+## 📜 Citation & Attributions
+```latex
+@misc{tarn2026,
+  author       = {Soham Pal and the Xerv-AI Research Team},
+  title        = {tarn: Optimized Compact Multimodal Vision-Reasoning Engine},
+  year         = {2026},
+  publisher    = {Hugging Face Hub},
+  howpublished = {\url{[https://huggingface.co/Xerv-AI/tarn](https://huggingface.co/Xerv-AI/tarn)}}
+}
+```
+If you integrate tarn or your custom structural derivatives into enterprise frameworks, please attribute **Xerv-AI** accordingly. For additional questions or model contributions, open a pull request directly in the community repository channel.