Phase-Technologies commited on
Commit
0c27eda
Β·
verified Β·
1 Parent(s): 076408b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -12
README.md CHANGED
@@ -1,21 +1,146 @@
 
1
  ---
2
- base_model: unsloth/Qwen3.5-2B
 
 
3
  tags:
4
- - text-generation-inference
5
- - transformers
6
  - unsloth
 
7
  - qwen3_5
8
- license: apache-2.0
9
- language:
10
- - en
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- # Uploaded finetuned model
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
 
15
- - **Developed by:** Phase-Technologies
16
- - **License:** apache-2.0
17
- - **Finetuned from model :** unsloth/Qwen3.5-2B
 
18
 
19
- This qwen3_5 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
  ---
3
+ language:
4
+ - en
5
+ license: apache-2.0
6
  tags:
 
 
7
  - unsloth
8
+ - transformers
9
  - qwen3_5
10
+ - image-text-to-text
11
+ - multimodal
12
+ - vision-language
13
+ - reasoning
14
+ - pytorch
15
+ base_model: unsloth/Qwen3.5-2B
16
+ datasets:
17
+ - Phase-Technologies/claude-reasoning-super
18
+ - xerv-ai/tart
19
+ pipeline_tag: image-text-to-text
20
+ library_name: transformers
21
+ metrics:
22
+ - accuracy
23
  ---
24
 
25
+ # 🌌 tarn (tarn-2b-vision-reasoning)
26
+ Developed by **Xerv-AI**, `tarn` is an optimized, ultra-compact 2-Billion parameter multimodal vision-language engine built upon the **Qwen 3.5 VL** architecture. By merging core perception mechanics with complex chain-of-thought data processing topologies, `tarn` is uniquely tailored for resource-constrained architectures, local deployments, and high-velocity streaming infrastructures requiring deep contextual visual comprehension.
27
+ ---
28
+ ## πŸ“‹ Table of Contents
29
+ 1. [Model Overview](#model-overview)
30
+ 2. [Intended Architectural Uses & Scope](#intended-architectural-uses--scope)
31
+ 3. [Memory & VRAM Footprint Benchmarks](#memory--vram-footprint-benchmarks)
32
+ 4. [Step-by-Step Google Colab Implementation](#step-by-step-google-colab-implementation)
33
+ 5. [Streaming & Production Pipeline Setup](#streaming--production-pipeline-setup)
34
+ 6. [Training Topology & Data Lineage](#training-topology--data-lineage)
35
+ 7. [Ethical Guardrails & Systemic Limitations](#ethical-guardrails--systemic-limitations)
36
+ ---
37
+ ## 🧠 Model Overview
38
+ Unlike basic classification vision systems, `tarn` incorporates a native **Chain-of-Thought (CoT)** reasoning matrix. When faced with an image-text query, it executes an internal multi-layered analytical pass to self-correct and map spatial elements before formatting its final output.
39
+ ### Key Technical Enhancements
40
+ * **Architectural Blueprint:** Fine-tuned via Low-Rank Adaptation (LoRA) over the `unsloth/Qwen3.5-2B` base framework, maintaining architectural elasticity.
41
+ * **Dynamic Resolution Windowing:** Supports bounded image tokenization via adjustable `min_pixels` and `max_pixels` scaling layers, eliminating sudden GPU out-of-memory (OOM) faults.
42
+ * **Advanced Token Processing:** Utilizes specialized multimodal token sequence embeddings to seamlessly align image feature vectors into the foundational language space.
43
+ ---
44
+ ## 🎯 Intended Architectural Uses & Scope
45
+ ### Recommended Core Tasks
46
+ * **Visual Problem-Solving:** Breaking down multi-step actions inside an image (e.g., troubleshooting complex wiring diagrams, reading mechanical dials).
47
+ * **Nuanced Image-Text Analysis:** Generating dense, conceptually accurate descriptions of visual phenomena rather than superficial tags.
48
+ * **Complex Physics & Abstract Querying:** Responding to interleaved queries requiring both text extraction (OCR), deep domain-specific knowledge, and physical reasoning (e.g., electrostatic properties, mechanics).
49
+ ### Out-of-Scope Deployments
50
+ * Medical diagnostic automation without expert human verification loops.
51
+ * Real-time automated safety-critical processing (autonomous vehicle controls, live weapons systems).
52
+ * Generation of biometric verification data or high-stakes demographic filtering.
53
+ ---
54
+ ## πŸ“Š Memory & VRAM Footprint Benchmarks
55
+ Due to the intense multi-dimensional matrix layout of Qwen 3.5's vision patches, native unconstrained generation can result in extreme VRAM spikes. `tarn` solves this by introducing dynamic spatial constraints.
56
 
57
+ | Precision Level | Quantization State | Active Loading VRAM | Inference VRAM (Unbounded) | Optimized Bounded VRAM |
58
+ | :--- | :--- | :--- | :--- | :--- |
59
+ | **Float16 (`fp16`)** | None | ~4.55 GB | ~14.6 GB (OOM Risk) | **~9.83 GB (Safe for T4)** |
60
+ | **Int4 (`4-bit`)** | BitsAndBytes | ~1.85 GB | ~6.20 GB | **~3.95 GB** |
61
 
62
+ > πŸ’‘ **Core Recommendation:** For edge deployments or free-tier Google Colab instances (Tesla T4 GPU with 15GB VRAM), always set execution patch limits between $256 \times 28 \times 28$ and $512 \times 28 \times 28$ pixels to guarantee stable, deterministic execution boundaries.
63
+ ---
64
+ ## πŸš€ Step-by-Step Google Colab Implementation
65
+ To verify and run this model within a standard hardware sandbox environment, execute the blocks below.
66
+ ### 1. Environment Initialization
67
+ Ensure your runtime is pointing to a hardware accelerator backend (T4 GPU). Install the bleeding-edge architecture updates from source:
68
+ ```bash
69
+ # Force-install source versions supporting the qwen3_5 structural configuration
70
+ pip install -q git+[https://github.com/huggingface/transformers.git](https://github.com/huggingface/transformers.git)
71
+ pip install -q accelerate bitsandbytes torchvision qwen-vl-utils
72
+ ```
73
+ *Note: Make sure to navigate to Runtime -> Restart session after installation to initialize the new environment context.*
74
+ ### 2. Loading the Model Weights
75
+ ```python
76
+ import torch
77
+ from transformers import pipeline
78
+ model_id = "Xerv-AI/tarn"
79
+ print("Initializing tarn architecture pipelines...")
80
+ pipe = pipeline(
81
+ "image-text-to-text",
82
+ model=model_id,
83
+ torch_dtype=torch.float16,
84
+ device_map="auto"
85
+ )
86
+ print("tarn is loaded and standing by.")
87
+ ```
88
+ ## ⚑ Streaming & Production Pipeline Setup
89
+ For real-time user-facing conversational products, buffering text generation hurts user experience. Use the TextStreamer implementation below to stream outputs token-by-token directly to your standard output array:
90
+ ```python
91
+ from transformers import TextStreamer
92
+ # Attach the text streamer interface to the pipeline core
93
+ streamer = TextStreamer(pipe.tokenizer, skip_prompt=True)
94
+ # Build a composite multimodal user payload
95
+ messages = [
96
+ {
97
+ "role": "user",
98
+ "content": [
99
+ {
100
+ "type": "image",
101
+ "url": "[https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG)"
102
+ },
103
+ {
104
+ "type": "text",
105
+ "text": "Analyze the visual artifacts present in this image and define the principles of triboelectricity."
106
+ }
107
+ ]
108
+ },
109
+ ]
110
+ print("=== Initiating Real-Time Telemetry Stream ===")
111
+ outputs = pipe(
112
+ text=messages,
113
+ max_new_tokens=1024, # Extend depth capability safely
114
+ min_pixels=256*28*28, # Set baseline feature extraction map
115
+ max_pixels=512*28*28, # Cap peak VRAM consumption upper bound
116
+ generate_kwargs={"streamer": streamer}
117
+ )
118
+ ```
119
+ ## 🧬 Training Topology & Data Lineage
120
+ The training protocol of tarn was heavily engineered to break the paradigm of superficial visual question answering. It is optimized through a two-stage distillation and alignment process.
121
 
122
+ ### 1. Dataset Dependencies
123
+ * **xerv-ai/tart (344k records):** Provides core alignments on basic physics, electromagnetism, electrostatics, and real-world everyday sensory scenarios. It grounds the model's factual accuracy in high-density core domains.
124
+ * **Phase-Technologies/claude-reasoning-super (47.8k records):** Instructs the model's internal decoder to prioritize complex hidden steps. Instead of outputting an immediately available guess, it structures the response using logical markdown hierarchies, self-corrections, and explicit calculations.
125
+ ### 2. Hyperparameter Settings
126
+ * **Optimizer:** AdamW (Learning Rate: 2 \times 10^{-4})
127
+ * **Weight Decay Coefficients:** 0.01
128
+ * **Lr Scheduler Sequence:** Linear warmup followed by cosine attenuation.
129
+ * **LoRA Rank (r):** 64
130
+ * **LoRA Alpha (\alpha):** 16
131
+ * **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
132
+ ## πŸ›‘οΈ Ethical Guardrails & Systemic Limitations
133
+ * **Hallucination Vectors:** Like all generative vision systems, compressing multi-dimensional visual spaces into discrete texts can cause hallucinations if the image resolution is constrained too low (e.g., misreading small font sizes or highly dense numbers).
134
+ * **Bias Propagations:** tarn can inherit underlying societal, technical, and taxonomic biases hidden inside the open source web data crawls forming its initial foundations.
135
+ * **Sycophancy Risks:** Due to alignment patterns, if a prompt aggressively asserts a falsehood (*"Why is there a dog in this picture of a ocean?"*), the model may spend its initial reasoning block trying to justify the user's premise before correcting it.
136
+ ## πŸ“œ Citation & Attributions
137
+ ```latex
138
+ @misc{tarn2026,
139
+ author = {Soham Pal and the Xerv-AI Research Team},
140
+ title = {tarn: Optimized Compact Multimodal Vision-Reasoning Engine},
141
+ year = {2026},
142
+ publisher = {Hugging Face Hub},
143
+ howpublished = {\url{[https://huggingface.co/Xerv-AI/tarn](https://huggingface.co/Xerv-AI/tarn)}}
144
+ }
145
+ ```
146
+ If you integrate tarn or your custom structural derivatives into enterprise frameworks, please attribute **Xerv-AI** accordingly. For additional questions or model contributions, open a pull request directly in the community repository channel.