Instructions to use Jetlink/JetLLMPlus-3.5 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Jetlink/JetLLMPlus-3.5 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Jetlink/JetLLMPlus-3.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Jetlink/JetLLMPlus-3.5")
model = AutoModelForImageTextToText.from_pretrained("Jetlink/JetLLMPlus-3.5")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Jetlink/JetLLMPlus-3.5 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Jetlink/JetLLMPlus-3.5"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jetlink/JetLLMPlus-3.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Jetlink/JetLLMPlus-3.5

SGLang

How to use Jetlink/JetLLMPlus-3.5 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Jetlink/JetLLMPlus-3.5" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jetlink/JetLLMPlus-3.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Jetlink/JetLLMPlus-3.5" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Jetlink/JetLLMPlus-3.5",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Jetlink/JetLLMPlus-3.5 with Docker Model Runner:
```
docker model run hf.co/Jetlink/JetLLMPlus-3.5
```

rezanaltjetlink commited on Apr 14

Commit

399a42c

verified ·

1 Parent(s): fdd95c2

Upload README.md

Browse files

Files changed (1) hide show

README.md +680 -0

README.md ADDED Viewed

	@@ -0,0 +1,680 @@

+---
+license: apache-2.0
+library_name: transformers
+tags:
+  - qwen
+  - multimodal
+  - moe
+  - vision-language
+  - conversational
+  - transformers
+  - vllm
+  - sglang
+  - ktransformers
+  - function-calling
+  - reasoning
+pipeline_tag: image-text-to-text
+base_model: Qwen/Qwen3.5-122B-A10B
+---
+# JetLLMPlus-3.5
+**JetLLMPlus-3.5** is a multimodal Mixture-of-Experts model published by **Jetlink**.
+It is intended for teams that want to manage deployment, access, and internal distribution from their own namespace while preserving compatibility with the original upstream model ecosystem.
+## Model Summary
+JetLLMPlus-3.5 is a 122B total / 10B active parameter multimodal MoE model with:
+- **122B total parameters, 10B activated per token**
+- **Causal Language Model with Vision Encoder**
+- **Hybrid architecture: Gated DeltaNet (36 layers) + Full Attention (12 layers) + Sparse MoE**
+- **256 routed experts + 1 shared expert per layer**
+- **262,144 tokens native context length**
+- **Extensible context up to 1,010,000 tokens via YaRN**
+- **Support for 201 languages and dialects**
+- Compatibility with **Transformers**, **vLLM**, **SGLang**, and **KTransformers**
+## Intended Use
+This model is suitable for advanced workloads such as:
+- multimodal chat assistants
+- long-context document and PDF understanding
+- OCR, chart comprehension, and document extraction pipelines
+- reasoning and step-by-step problem solving
+- agentic workflows with function calling
+- coding assistants and code generation
+- GUI automation and screen understanding
+- multilingual enterprise assistants
+- research and benchmarking
+## Model Details
+### Architecture
+- **Model type:** Causal Language Model with Vision Encoder
+- **Training stage:** Pre-training & Post-training
+- **Total parameters:** 122B
+- **Activated parameters:** 10B per token
+- **Hidden dimension:** 3,072
+- **Number of layers:** 48 (36 GatedDeltaNet linear attention + 12 full attention)
+- **MoE experts:** 256 routed + 1 shared per layer
+- **Activated experts:** 8 routed + 1 shared
+- **Expert FFN dimension:** 1,024
+- **Vocabulary size:** 248,320
+- **Native context length:** 262,144 tokens
+- **Extended context capability:** up to 1,010,000 tokens via YaRN
+### Architecture Note: Hybrid Attention (GatedDeltaNet + MoE)
+JetLLMPlus-3.5 uses a novel hybrid attention design unique to the Qwen3.5 architecture. Unlike standard transformer MoE models, it combines:
+- **GatedDeltaNet linear attention** (36 out of 48 layers) for efficient long-context processing with sub-quadratic complexity
+- **Full global attention** (12 layers) for high-quality token interactions
+- **Sparse MoE** routing in feed-forward layers for parameter efficiency
+This design delivers high-throughput inference with significantly lower latency than pure full-attention models of comparable total parameter count.
+> ⚠️ **Deployment note:** The GatedDeltaNet layers impose additional constraints compared to standard MoE models. When serving with SGLang, `--attention-backend triton` and `--kv-cache-dtype bf16` are required. FP8 KV cache is not recommended due to potential output corruption on this architecture. CUDA graph and HiCache (prefix caching) are currently incompatible with DeltaNet layers.
+### Ecosystem Compatibility
+- Hugging Face Transformers
+- vLLM
+- SGLang
+- KTransformers
+## Hardware Requirements
+> JetLLMPlus-3.5 sits between the lightweight 35B-A3B and the flagship 397B-A17B, requiring multi-GPU infrastructure at full precision but manageable on 2–4 datacenter GPUs.
+### Reference Hardware
+Approximate GPU memory requirements:
+- **Unquantized (BF16):** ~244GB VRAM — 3–4× A100 80GB or equivalent
+- **FP8:** ~127GB — 2× A100 80GB or equivalent
+- **GPTQ-Int4:** ~79GB — 1× H100 80GB or 2× A100 40GB
+- **Multi-GPU:** tensor parallelism recommended via vLLM or SGLang (`--tp-size 4` or `--tp-size 8`)
+> Note: requirements vary significantly based on context length, KV cache settings, and batch size. FP8 KV cache should be avoided for this model due to DeltaNet architecture constraints — use BF16 KV.
+### Recommendation
+For most production teams:
+1. use **FP8 weights + BF16 KV** for the best balance of memory and quality
+2. use **GPTQ-Int4** for single-GPU or memory-constrained deployments
+3. enable **MTP (Multi-Token Prediction)** for the highest throughput gains — this is the primary optimization path for this model's architecture
+4. use `--language-model-only` when vision is not needed to free KV cache memory
+## Software Requirements
+Recommended environment:
+- Python 3.10+
+- Linux
+- CUDA-enabled GPU infrastructure
+- One of the following runtimes:
+  - Transformers (latest from `main` branch)
+  - vLLM
+  - SGLang
+  - KTransformers
+Common dependencies:
+- `torch`
+- `transformers`
+- `torchvision`
+- `pillow`
+- `accelerate`
+## Quickstart
+Install Transformers:
+    pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"
+### Basic text inference
+    from transformers import AutoProcessor, AutoModelForImageTextToText
+    import torch
+    model_id = "Jetlink/JetLLMPlus-3.5"
+    processor = AutoProcessor.from_pretrained(model_id)
+    model = AutoModelForImageTextToText.from_pretrained(
+        model_id,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+        trust_remote_code=True,
+    )
+    messages = [
+        {"role": "user", "content": [{"type": "text", "text": "Explain the difference between MoE and dense models."}]}
+    ]
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_tensors="pt"
+    ).to(model.device)
+    output = model.generate(**inputs, max_new_tokens=512)
+    print(processor.decode(output[0], skip_special_tokens=True))
+### Thinking mode (deep reasoning)
+Enable step-by-step reasoning with `enable_thinking=True`:
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_tensors="pt",
+        enable_thinking=True,
+    ).to(model.device)
+### Non-thinking mode (direct response)
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_tensors="pt",
+        enable_thinking=False,
+    ).to(model.device)
+## Serving Examples
+### vLLM
+    vllm serve Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tensor-parallel-size 4 \
+      --max-model-len 262144 \
+      --reasoning-parser qwen3
+### vLLM with Tool Use
+    vllm serve Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tensor-parallel-size 4 \
+      --max-model-len 262144 \
+      --reasoning-parser qwen3 \
+      --enable-auto-tool-choice \
+      --tool-call-parser qwen3_coder
+### vLLM with MTP (Multi-Token Prediction)
+    vllm serve Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tensor-parallel-size 4 \
+      --max-model-len 262144 \
+      --reasoning-parser qwen3 \
+      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
+### vLLM text-only mode
+    vllm serve Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tensor-parallel-size 4 \
+      --max-model-len 262144 \
+      --reasoning-parser qwen3 \
+      --language-model-only
+### SGLang
+> ⚠️ DeltaNet layers require additional flags. Use `--attention-backend triton` and `--kv-cache-dtype bf16`.
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tp-size 4 \
+      --mem-fraction-static 0.80 \
+      --context-length 262144 \
+      --reasoning-parser qwen3 \
+      --attention-backend triton \
+      --kv-cache-dtype bf16 \
+      --disable-cuda-graph \
+      --disable-radix-cache
+### SGLang with Tool Use
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tp-size 4 \
+      --mem-fraction-static 0.80 \
+      --context-length 262144 \
+      --reasoning-parser qwen3 \
+      --tool-call-parser qwen3_coder \
+      --attention-backend triton \
+      --kv-cache-dtype bf16 \
+      --disable-cuda-graph \
+      --disable-radix-cache
+### SGLang with Multi-Token Prediction (MTP)
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tp-size 4 \
+      --mem-fraction-static 0.80 \
+      --context-length 262144 \
+      --reasoning-parser qwen3 \
+      --attention-backend triton \
+      --kv-cache-dtype bf16 \
+      --disable-cuda-graph \
+      --disable-radix-cache \
+      --speculative-algo NEXTN \
+      --speculative-num-steps 3 \
+      --speculative-eagle-topk 1 \
+      --speculative-num-draft-tokens 4
+## Long Context Notes
+JetLLMPlus-3.5 natively supports **262,144 tokens**.
+For tasks exceeding this window, the upstream documentation recommends YaRN-based long-context scaling, supported in Transformers, vLLM, KTransformers, and SGLang, extending context up to **1,010,000 tokens**.
+The hybrid GatedDeltaNet + full-attention architecture provides sub-quadratic scaling for long-context inputs on the linear attention layers, making long-context processing more efficient than pure full-attention models of similar scale.
+## Strengths
+- very strong knowledge and vision benchmarks in the open-weight mid-tier class
+- best-in-class document understanding (OCRBench 92.1, OmniDocBench 89.8)
+- leading function calling performance in the Qwen3.5 lineup (BFCL-V4 72.2)
+- strong GUI and screen automation capabilities (ScreenSpot Pro 70.4)
+- highly efficient inference thanks to MoE — only 10B parameters activate per token
+- hybrid DeltaNet attention for efficient long-context processing
+- 262K native context, extensible to 1M via YaRN
+- 201 language support
+- Apache 2.0 license
+## Limitations
+- full weight matrix (122B) must reside in memory regardless of active parameters
+- GatedDeltaNet layers impose framework-specific constraints (no FP8 KV, no CUDA graph, no prefix caching in SGLang)
+- multi-GPU deployment required for unquantized serving
+- long context significantly increases KV cache memory pressure
+- multimodal usage adds further overhead
+- deployment characteristics vary significantly by framework and configuration
+## Out-of-Scope / Cautionary Use
+As with other frontier-scale multimodal language models, outputs should be reviewed before use in:
+- medical decision-making
+- legal advice
+- safety-critical automation
+- high-stakes financial decisions
+- fully autonomous customer actions without guardrails
+Human review, policy controls, and tool-level validation are strongly recommended.
+## License
+This repository follows the same license as the upstream release.
+- **License:** Apache-2.0
+- See the upstream Qwen repository and included license text for the governing terms.
+If you redistribute, fine-tune, quantize, or otherwise modify this model, make sure your usage remains compliant with the upstream license and attribution requirements.
+## Attribution
+Original model and research release by the **Qwen** team.
+Upstream model:
+- `Qwen/Qwen3.5-122B-A10B`
+This repository is an organization-managed copy and is **not the original upstream source**.
+## Citation
+Please cite the original Qwen release when using this model in research, evaluation, or production documentation.
+```bibtex
+@misc{qwen3.5,
+  title        = {Qwen3.5 Technical Report},
+  author       = {Qwen Team},
+  year         = {2026},
+  publisher    = {Alibaba Cloud},
+  howpublished = {\url{https://huggingface.co/Qwen/Qwen3.5-122B-A10B}}
+}
+```
+---
+# JetLLMPlus-3.5 (Türkçe)
+**JetLLMPlus-3.5**, **Jetlink** tarafından yayınlanan multimodal bir Mixture-of-Experts modelidir.
+Bu depo; modeli kendi namespace'i altında yönetmek, erişimi kontrol etmek ve dağıtımı kolaylaştırmak isteyen ekipler için hazırlanmıştır.
+## Model Özeti
+JetLLMPlus-3.5, token başına 10B parametre aktive eden 122B toplam parametreli bir multimodal MoE modelidir:
+- **122B toplam parametre, token başına 10B aktif**
+- **Vision Encoder içeren Causal Language Model**
+- **Hibrit mimari: Gated DeltaNet (36 katman) + Tam Dikkat (12 katman) + Sparse MoE**
+- **Katman başına 256 routed expert + 1 shared expert**
+- **262.144 token yerel bağlam uzunluğu**
+- **YaRN ile 1.010.000 token'a kadar genişletilebilir bağlam**
+- **201 dil ve lehçe desteği**
+- **Transformers**, **vLLM**, **SGLang** ve **KTransformers** ile uyumluluk
+## Kullanım Amacı
+Bu model aşağıdaki gelişmiş kullanım senaryoları için uygundur:
+- multimodal sohbet asistanları
+- uzun bağlamlı doküman ve PDF anlama
+- OCR, grafik anlama ve doküman çıkarma pipeline'ları
+- adım adım akıl yürütme ve problem çözme
+- function calling ile agentic workflow yapıları
+- kodlama asistanları ve kod üretimi
+- GUI otomasyon ve ekran anlama
+- çok dilli kurumsal asistanlar
+- araştırma ve benchmark çalışmaları
+## Model Detayları
+### Mimari
+- **Model tipi:** Vision Encoder içeren Causal Language Model
+- **Eğitim aşaması:** Pre-training ve Post-training
+- **Toplam parametre:** 122B
+- **Aktif parametre:** Token başına 10B
+- **Hidden dimension:** 3.072
+- **Katman sayısı:** 48 (36 GatedDeltaNet lineer dikkat + 12 tam dikkat)
+- **MoE expert sayısı:** Katman başına 256 routed + 1 shared
+- **Aktif expert:** 8 routed + 1 shared
+- **Expert FFN boyutu:** 1.024
+- **Vocabulary size:** 248.320
+- **Yerel bağlam uzunluğu:** 262.144 token
+- **Genişletilmiş bağlam kapasitesi:** YaRN ile 1.010.000 token'a kadar
+### Mimari Notu: Hibrit Dikkat (GatedDeltaNet + MoE)
+JetLLMPlus-3.5, Qwen3.5 mimarisine özgü yenilikçi bir hibrit dikkat tasarımı kullanır. Standart transformer MoE modellerinden farklı olarak şunları birleştirir:
+- **GatedDeltaNet lineer dikkat** (48 katmandan 36'sı): sub-quadratic karmaşıklıkla verimli uzun bağlam işleme
+- **Tam global dikkat** (12 katman): yüksek kaliteli token etkileşimleri
+- **Sparse MoE** routing: parametre verimliliği için feed-forward katmanlarında
+Bu tasarım, benzer toplam parametre sayısına sahip tam-dikkat modellerine kıyasla çok daha düşük gecikmeyle yüksek throughput inference sağlar.
+> ⚠️ **Deployment notu:** GatedDeltaNet katmanları, standart MoE modellerine kıyasla ek kısıtlamalar getirir. SGLang ile servis ederken `--attention-backend triton` ve `--kv-cache-dtype bf16` zorunludur. FP8 KV cache bu mimaride output bozulmasına yol açabileceğinden önerilmez. CUDA graph ve HiCache (prefix caching) DeltaNet katmanlarıyla uyumsuzluk nedeniyle devre dışı bırakılmalıdır.
+### Ekosistem Uyumluluğu
+- Hugging Face Transformers
+- vLLM
+- SGLang
+- KTransformers
+## Donanım Gereksinimleri
+> JetLLMPlus-3.5, hafif 35B-A3B ile flagship 397B-A17B arasında konumlanmaktadır. Tam hassasiyette çoklu GPU altyapısı gerektirir ancak 2–4 datacenter GPU ile yönetilebilir düzeydedir.
+### Referans Donanım
+Tahmini GPU bellek gereksinimleri:
+- **Quantize edilmemiş (BF16):** ~244GB VRAM — 3–4× A100 80GB veya eşdeğeri
+- **FP8:** ~127GB — 2× A100 80GB veya eşdeğeri
+- **GPTQ-Int4:** ~79GB — 1× H100 80GB veya 2× A100 40GB
+- **Çoklu GPU:** vLLM veya SGLang üzerinden tensor parallelism önerilir (`--tp-size 4` veya `--tp-size 8`)
+> Not: Gereksinimler bağlam uzunluğu, KV cache ayarları ve batch size'a göre önemli ölçüde değişir. Bu model için FP8 KV cache, DeltaNet mimari kısıtlamaları nedeniyle önerilmez — BF16 KV kullanın.
+### Öneri
+Çoğu production ekip için en mantıklı yaklaşım:
+1. en iyi bellek/kalite dengesi için **FP8 ağırlık + BF16 KV** kullanmak
+2. tek GPU veya bellek kısıtlı dağıtımlar için **GPTQ-Int4** kullanmak
+3. en yüksek throughput kazanımı için **MTP (Multi-Token Prediction)** etkinleştirmek — bu modelin mimarisinde birincil optimizasyon yoludur
+4. vision gerekmiyorsa KV cache belleği açmak için `--language-model-only` kullanmak
+## Yazılım Gereksinimleri
+Önerilen ortam:
+- Python 3.10+
+- Linux
+- CUDA destekli GPU altyapısı
+- Şu runtime'lardan biri:
+  - Transformers (en son `main` branch)
+  - vLLM
+  - SGLang
+  - KTransformers
+Yaygın bağımlılıklar:
+- `torch`
+- `transformers`
+- `torchvision`
+- `pillow`
+- `accelerate`
+## Hızlı Başlangıç
+Transformers kurulumu:
+    pip install "transformers[serving] @ git+https://github.com/huggingface/transformers.git@main"
+### Temel metin çıkarımı
+    from transformers import AutoProcessor, AutoModelForImageTextToText
+    import torch
+    model_id = "Jetlink/JetLLMPlus-3.5"
+    processor = AutoProcessor.from_pretrained(model_id)
+    model = AutoModelForImageTextToText.from_pretrained(
+        model_id,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+        trust_remote_code=True,
+    )
+    messages = [
+        {"role": "user", "content": [{"type": "text", "text": "MoE ve dense modeller arasındaki farkı açıkla."}]}
+    ]
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_tensors="pt"
+    ).to(model.device)
+    output = model.generate(**inputs, max_new_tokens=512)
+    print(processor.decode(output[0], skip_special_tokens=True))
+### Thinking modu (derin akıl yürütme)
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_tensors="pt",
+        enable_thinking=True,
+    ).to(model.device)
+### Non-thinking modu (doğrudan yanıt)
+    inputs = processor.apply_chat_template(
+        messages,
+        add_generation_prompt=True,
+        tokenize=True,
+        return_tensors="pt",
+        enable_thinking=False,
+    ).to(model.device)
+## Serving Örnekleri
+### vLLM
+    vllm serve Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tensor-parallel-size 4 \
+      --max-model-len 262144 \
+      --reasoning-parser qwen3
+### vLLM Tool Use ile
+    vllm serve Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tensor-parallel-size 4 \
+      --max-model-len 262144 \
+      --reasoning-parser qwen3 \
+      --enable-auto-tool-choice \
+      --tool-call-parser qwen3_coder
+### vLLM MTP ile
+    vllm serve Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tensor-parallel-size 4 \
+      --max-model-len 262144 \
+      --reasoning-parser qwen3 \
+      --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'
+### vLLM sadece metin modu
+    vllm serve Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tensor-parallel-size 4 \
+      --max-model-len 262144 \
+      --reasoning-parser qwen3 \
+      --language-model-only
+### SGLang
+> ⚠️ DeltaNet katmanları ek flag gerektirmektedir. `--attention-backend triton` ve `--kv-cache-dtype bf16` zorunludur.
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tp-size 4 \
+      --mem-fraction-static 0.80 \
+      --context-length 262144 \
+      --reasoning-parser qwen3 \
+      --attention-backend triton \
+      --kv-cache-dtype bf16 \
+      --disable-cuda-graph \
+      --disable-radix-cache
+### SGLang Tool Use ile
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tp-size 4 \
+      --mem-fraction-static 0.80 \
+      --context-length 262144 \
+      --reasoning-parser qwen3 \
+      --tool-call-parser qwen3_coder \
+      --attention-backend triton \
+      --kv-cache-dtype bf16 \
+      --disable-cuda-graph \
+      --disable-radix-cache
+### SGLang Multi-Token Prediction (MTP) ile
+    python -m sglang.launch_server \
+      --model-path Jetlink/JetLLMPlus-3.5 \
+      --port 8000 \
+      --tp-size 4 \
+      --mem-fraction-static 0.80 \
+      --context-length 262144 \
+      --reasoning-parser qwen3 \
+      --attention-backend triton \
+      --kv-cache-dtype bf16 \
+      --disable-cuda-graph \
+      --disable-radix-cache \
+      --speculative-algo NEXTN \
+      --speculative-num-steps 3 \
+      --speculative-eagle-topk 1 \
+      --speculative-num-draft-tokens 4
+## Uzun Bağlam Notları
+JetLLMPlus-3.5 yerel olarak **262.144 token** destekler.
+Bu pencereyi aşan görevlerde Transformers, vLLM, KTransformers ve SGLang tarafından desteklenen YaRN tabanlı uzun bağlam ölçekleme ile **1.010.000 token'a** kadar genişletilebilir.
+Hibrit GatedDeltaNet + tam dikkat mimarisi, lineer dikkat katmanlarında uzun bağlam girdileri için sub-quadratic ölçekleme sağlayarak benzer ölçekteki saf tam dikkat modellerine kıyasla uzun bağlam işlemeyi daha verimli hale getirir.
+## Güçlü Yönler
+- açık ağırlıklı orta kademe sınıfında çok güçlü bilgi ve vision benchmark'ları
+- en iyi sınıf doküman anlama (OCRBench 92.1, OmniDocBench 89.8)
+- Qwen3.5 serisinde öncü function calling performansı (BFCL-V4 72.2)
+- güçlü GUI ve ekran otomasyon yetenekleri (ScreenSpot Pro 70.4)
+- MoE sayesinde yüksek verimli inference — token başına yalnızca 10B parametre aktive edilir
+- verimli uzun bağlam işleme için hibrit DeltaNet dikkat
+- YaRN ile 262K yerel bağlam, 1M'a genişletilebilir
+- 201 dil desteği
+- Apache 2.0 lisansı
+## Sınırlamalar
+- aktif parametrelerden bağımsız olarak tam ağırlık matrisi (122B) bellekte tutulmalıdır
+- GatedDeltaNet katmanları framework'e özgü kısıtlamalar getirir (FP8 KV yok, CUDA graph yok, SGLang'da prefix caching yok)
+- quantize edilmemiş serving için çoklu GPU dağıtımı gereklidir
+- uzun bağlam KV cache bellek baskısını ciddi ölçüde artırır
+- multimodal kullanım ek yük getirir
+- deployment karakteristiği framework ve konfigürasyona göre önemli ölçüde değişir
+## Kapsam Dışı / Dikkat Gerektiren Kullanımlar
+Diğer frontier-scale multimodal language model'lerde olduğu gibi, model çıktıları şu alanlarda insan denetimi olmadan kullanılmamalıdır:
+- tıbbi karar verme
+- hukuki tavsiye
+- güvenlik kritik otomasyon
+- yüksek riskli finansal kararlar
+- korumasız tam otonom müşteri aksiyonları
+İnsan incelemesi, politika kontrolleri ve tool seviyesinde doğrulama güçlü şekilde önerilir.
+## Lisans
+Bu depo, upstream sürümle aynı lisansı takip eder.
+- **Lisans:** Apache-2.0
+- Geçerli şartlar için upstream Qwen deposu ve lisans metni incelenmelidir.
+Modeli yeniden dağıtıyor, fine-tune ediyor, quantize ediyor veya başka şekilde değiştiriyorsan; kullanımının upstream lisans ve attribution gereklilikleriyle uyumlu olduğundan emin olmalısın.
+## Atıf
+Orijinal model ve araştırma yayını **Qwen** ekibine aittir.
+Upstream model:
+- `Qwen/Qwen3.5-122B-A10B`
+Bu depo, kurum tarafından yönetilen bir kopyadır ve **orijinal upstream kaynak değildir**.
+## Atıf / Citation
+Bu modeli araştırma, değerlendirme veya production dokümantasyonunda kullanıyorsan, lütfen orijinal Qwen sürümüne atıf yap.
+```bibtex
+@misc{qwen3.5,
+  title        = {Qwen3.5 Technical Report},
+  author       = {Qwen Team},
+  year         = {2026},
+  publisher    = {Alibaba Cloud},
+  howpublished = {\url{https://huggingface.co/Qwen/Qwen3.5-122B-A10B}}
+}
+```