--- language: - en license: apache-2.0 tags: - unsloth - transformers - qwen3_5 - image-text-to-text - multimodal - vision-language - reasoning - pytorch base_model: unsloth/Qwen3.5-2B datasets: - Phase-Technologies/claude-reasoning-super - Xerv-AI/TART pipeline_tag: image-text-to-text library_name: transformers metrics: - accuracy --- # 🌌 tarn (tarn-2b-vision-reasoning) Developed by **Xerv-AI**, `tarn` is an optimized, ultra-compact 2-Billion parameter multimodal vision-language engine built upon the **Qwen 3.5 VL** architecture. By merging core perception mechanics with complex chain-of-thought data processing topologies, `tarn` is uniquely tailored for resource-constrained architectures, local deployments, and high-velocity streaming infrastructures requiring deep contextual visual comprehension. --- ## 📋 Table of Contents 1. [Model Overview](#model-overview) 2. [Intended Architectural Uses & Scope](#intended-architectural-uses--scope) 3. [Memory & VRAM Footprint Benchmarks](#memory--vram-footprint-benchmarks) 4. [Step-by-Step Google Colab Implementation](#step-by-step-google-colab-implementation) 5. [Streaming & Production Pipeline Setup](#streaming--production-pipeline-setup) 6. [Training Topology & Data Lineage](#training-topology--data-lineage) 7. [Ethical Guardrails & Systemic Limitations](#ethical-guardrails--systemic-limitations) --- ## 🧠 Model Overview Unlike basic classification vision systems, `tarn` incorporates a native **Chain-of-Thought (CoT)** reasoning matrix. When faced with an image-text query, it executes an internal multi-layered analytical pass to self-correct and map spatial elements before formatting its final output. ### Key Technical Enhancements * **Architectural Blueprint:** Fine-tuned via Low-Rank Adaptation (LoRA) over the `unsloth/Qwen3.5-2B` base framework, maintaining architectural elasticity. * **Dynamic Resolution Windowing:** Supports bounded image tokenization via adjustable `min_pixels` and `max_pixels` scaling layers, eliminating sudden GPU out-of-memory (OOM) faults. * **Advanced Token Processing:** Utilizes specialized multimodal token sequence embeddings to seamlessly align image feature vectors into the foundational language space. --- ## 🎯 Intended Architectural Uses & Scope ### Recommended Core Tasks * **Visual Problem-Solving:** Breaking down multi-step actions inside an image (e.g., troubleshooting complex wiring diagrams, reading mechanical dials). * **Nuanced Image-Text Analysis:** Generating dense, conceptually accurate descriptions of visual phenomena rather than superficial tags. * **Complex Physics & Abstract Querying:** Responding to interleaved queries requiring both text extraction (OCR), deep domain-specific knowledge, and physical reasoning (e.g., electrostatic properties, mechanics). ### Out-of-Scope Deployments * Medical diagnostic automation without expert human verification loops. * Real-time automated safety-critical processing (autonomous vehicle controls, live weapons systems). * Generation of biometric verification data or high-stakes demographic filtering. --- ## 📊 Memory & VRAM Footprint Benchmarks Due to the intense multi-dimensional matrix layout of Qwen 3.5's vision patches, native unconstrained generation can result in extreme VRAM spikes. `tarn` solves this by introducing dynamic spatial constraints. | Precision Level | Quantization State | Active Loading VRAM | Inference VRAM (Unbounded) | Optimized Bounded VRAM | | :--- | :--- | :--- | :--- | :--- | | **Float16 (`fp16`)** | None | ~4.55 GB | ~14.6 GB (OOM Risk) | **~9.83 GB (Safe for T4)** | | **Int4 (`4-bit`)** | BitsAndBytes | ~1.85 GB | ~6.20 GB | **~3.95 GB** | > 💡 **Core Recommendation:** For edge deployments or free-tier Google Colab instances (Tesla T4 GPU with 15GB VRAM), always set execution patch limits between $256 \times 28 \times 28$ and $512 \times 28 \times 28$ pixels to guarantee stable, deterministic execution boundaries. --- ## 🚀 Step-by-Step Google Colab Implementation To verify and run this model within a standard hardware sandbox environment, execute the blocks below. ### 1. Environment Initialization Ensure your runtime is pointing to a hardware accelerator backend (T4 GPU). Install the bleeding-edge architecture updates from source: ```bash # Force-install source versions supporting the qwen3_5 structural configuration pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git) pip install -q accelerate bitsandbytes torchvision qwen-vl-utils ``` *Note: Make sure to navigate to Runtime -> Restart session after installation to initialize the new environment context.* ### 2. Loading the Model Weights ```python import torch from transformers import pipeline model_id = "Xerv-AI/tarn" print("Initializing tarn architecture pipelines...") pipe = pipeline( "image-text-to-text", model=model_id, torch_dtype=torch.float16, device_map="auto" ) print("tarn is loaded and standing by.") ``` ## ⚡ Streaming & Production Pipeline Setup For real-time user-facing conversational products, buffering text generation hurts user experience. Use the TextStreamer implementation below to stream outputs token-by-token directly to your standard output array: ```python from transformers import TextStreamer # Attach the text streamer interface to the pipeline core streamer = TextStreamer(pipe.tokenizer, skip_prompt=True) # Build a composite multimodal user payload messages = [ { "role": "user", "content": [ { "type": "image", "url": "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG)" }, { "type": "text", "text": "Analyze the visual artifacts present in this image and define the principles of triboelectricity." } ] }, ] print("=== Initiating Real-Time Telemetry Stream ===") outputs = pipe( text=messages, max_new_tokens=1024, # Extend depth capability safely min_pixels=256*28*28, # Set baseline feature extraction map max_pixels=512*28*28, # Cap peak VRAM consumption upper bound generate_kwargs={"streamer": streamer} ) ``` ## 🧬 Training Topology & Data Lineage The training protocol of tarn was heavily engineered to break the paradigm of superficial visual question answering. It is optimized through a two-stage distillation and alignment process. ### 1. Dataset Dependencies * **xerv-ai/tart (344k records):** Provides core alignments on basic physics, electromagnetism, electrostatics, and real-world everyday sensory scenarios. It grounds the model's factual accuracy in high-density core domains. * **Phase-Technologies/claude-reasoning-super (47.8k records):** Instructs the model's internal decoder to prioritize complex hidden steps. Instead of outputting an immediately available guess, it structures the response using logical markdown hierarchies, self-corrections, and explicit calculations. ### 2. Hyperparameter Settings * **Optimizer:** AdamW (Learning Rate: 2 \times 10^{-4}) * **Weight Decay Coefficients:** 0.01 * **Lr Scheduler Sequence:** Linear warmup followed by cosine attenuation. * **LoRA Rank (r):** 64 * **LoRA Alpha (\alpha):** 16 * **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj ## 🛡️ Ethical Guardrails & Systemic Limitations * **Hallucination Vectors:** Like all generative vision systems, compressing multi-dimensional visual spaces into discrete texts can cause hallucinations if the image resolution is constrained too low (e.g., misreading small font sizes or highly dense numbers). * **Bias Propagations:** tarn can inherit underlying societal, technical, and taxonomic biases hidden inside the open source web data crawls forming its initial foundations. * **Sycophancy Risks:** Due to alignment patterns, if a prompt aggressively asserts a falsehood (*"Why is there a dog in this picture of a ocean?"*), the model may spend its initial reasoning block trying to justify the user's premise before correcting it. ## 📜 Citation & Attributions ```latex @misc{tarn2026, author = {Soham Pal and the Xerv-AI Research Team}, title = {tarn: Optimized Compact Multimodal Vision-Reasoning Engine}, year = {2026}, publisher = {Hugging Face Hub}, howpublished = {\url{[https://huggingface.co/Xerv-AI/tarn](https://huggingface.co/Xerv-AI/tarn)}} } ``` If you integrate tarn or your custom structural derivatives into enterprise frameworks, please attribute **Xerv-AI** accordingly. For additional questions or model contributions, open a pull request directly in the community repository channel.