Upload folder using huggingface_hub

6c8c5f2 verified 6 months ago

7.94 kB

	---
	library_name: transformers
	license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507/blob/main/LICENSE
	pipeline_tag: text-generation
	tags:
	- AWQ
	- 量化修复
	- vLLM
	base_model:
	- Kwaipilot/KAT-V1-40B
	base_model_relation: quantized
	---
	# KAT-V1-40B-AWQ
	Base model: [Kwaipilot/KAT-V1-40B](https://huggingface.co/Kwaipilot/KAT-V1-40B)


	### 【vLLM Single Node with 4 GPUs Startup Command】
	```
	CONTEXT_LENGTH=32768

	vllm serve \
	QuantTrio/KAT-V1-40B-AWQ \
	--served-model-name KAT-V1-40B-AWQ \
	--swap-space 16 \
	--max-num-seqs 512 \
	--max-model-len $CONTEXT_LENGTH \
	--max-seq-len-to-capture $CONTEXT_LENGTH \
	--gpu-memory-utilization 0.9 \
	--tensor-parallel-size 4 \
	--trust-remote-code \
	--disable-log-requests \
	--host 0.0.0.0 \
	--port 8000
	```

	### 【Dependencies】

	```
	vllm==0.10.0
	```

	### 【Model Update Date】
	```
	2025-07-31
	1. fast commit
	```

	### 【Model Files】
	\| File Size \| Last Updated \|
	\|--------\|--------------\|
	\| `22GB` \| `2025-07-31` \|


	### 【Model Download】

	```python
	from huggingface_hub import snapshot_download
	snapshot_download('QuantTrio/KAT-V1-40B-AWQ', cache_dir="your_local_path")
	```

	### 【Overview】
	<div align="center">
	<img src="https://raw.githubusercontent.com/Anditty/OASIS/refs/heads/main/Group.svg" width="60%" alt="Kwaipilot" />
	</div>

	<hr>

	<div align="center" style="line-height: 1;">
	<a href="https://huggingface.co/Kwaipilot/KAT-V1-40B" target="_blank">
	<img alt="Hugging Face" src="https://img.shields.io/badge/HuggingFace-fcd022?style=for-the-badge&logo=huggingface&logoColor=000&labelColor"/>
	</a>

	<a href="https://arxiv.org/pdf/2507.08297" target="_blank">
	<img alt="arXiv" src="https://img.shields.io/badge/arXiv-2507.08297-b31b1b.svg?style=for-the-badge"/>
	</a>
	</div>

	# News

	- Kwaipilot-AutoThink ranks first among all open-source models on [LiveCodeBench Pro](https://livecodebenchpro.com/), a challenging benchmark explicitly designed to prevent data leakage, and even surpasses strong proprietary systems such as Seed and o3-mini.

	***

	# Introduction

	KAT (Kwaipilot-AutoThink) is an open-source large-language model that mitigates over-thinking by learning when to produce explicit chain-of-thought and when to answer directly.

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/zdnsvBmv6hWIC2Qxxy1fD.png)

	Its development follows a concise two-stage training pipeline:

	<table>
	<thead>
	<tr>
	<th style="text-align:left; width:18%;">Stage</th>
	<th style="text-align:left;">Core Idea</th>
	<th style="text-align:left;">Key Techniques</th>
	<th style="text-align:left;">Outcome</th>
	</tr>
	</thead>
	<tbody>
	<tr>
	<td><strong>1. Pre-training</strong></td>
	<td>Inject knowledge while separating “reasoning” from “direct answering”.</td>
	<td>
	<em>Dual-regime data</em><br>
	• <strong>Think-off</strong> queries labeled via a custom tagging system.<br>
	• <strong>Think-on</strong> queries generated by a multi-agent solver.<br><br>
	<em>Knowledge Distillation + Multi-Token Prediction</em> for fine-grained utility.
	</td>
	<td>Base model attains strong factual and reasoning skills without full-scale pre-training costs.</td>
	</tr>
	<tr>
	<td><strong>2. Post-training</strong></td>
	<td>Make reasoning optional and efficient.</td>
	<td>
	<em>Cold-start AutoThink</em> — majority vote sets the initial thinking mode.<br>
	<em>Step-SRPO</em> — intermediate supervision rewards correct <strong>mode selection</strong> and <strong>answer accuracy</strong> under that mode.
	</td>
	<td>Model triggers CoT only when beneficial, reducing token use and speeding inference.</td>
	</tr>
	</tbody>
	</table>

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/cwFAEh7Rl3f4FU46z8gBZ.png)


	***

	# Data Format


	KAT produces responses in a structured template that makes the reasoning path explicit and machine-parsable.
	Two modes are supported:


	![image/jpeg](https://cdn-uploads.huggingface.co/production/uploads/61ee40a269351366e29972ad/H8iAvQMMT02nyvlYnI5q1.jpeg)


	## Special Tokens

	\| Token \| Description \|
	\|-------\|-------------\|
	\| `<judge>` \| Analyzes the input to decide whether explicit reasoning is needed. \|
	\| `<think_on>` / `<think_off>` \| Indicates whether reasoning is activated (“on”) or skipped (“off”). \|
	\| `<think>` \| Marks the start of the chain-of-thought segment when `think_on` is chosen. \|
	\| `<answer>` \| Marks the start of the final user-facing answer. \|


	***

	# 🔧 Quick Start

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	model_name = "Kwaipilot/KAT-V1-40B"

	# load the tokenizer and the model
	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto"
	)

	# prepare the model input
	prompt = "Give me a short introduction to large language model."
	messages = [
	{"role": "user", "content": prompt}
	]
	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)
	model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

	# conduct text completion
	generated_ids = model.generate(
	**model_inputs,
	max_new_tokens=65536,
	temperature=0.6,
	top_p=0.95,
	)
	output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
	content = tokenizer.decode(output_ids, skip_special_tokens=True).strip("\n")
	print("prompt:\n", prompt)
	print("content:\n", content)
	"""
	prompt:
	Give me a short introduction to large language model.
	content:
	<judge>
	The user's request is to provide a concise factual introduction to large language models, which involves retrieving and summarizing basic information. This task is straightforward as it only requires recalling and presenting well-known details without deeper analysis. No complex reasoning is needed here—just a simple explanation will suffice.
	</judge>

	<think_off>
	<answer>
	A Large Language Model (LLM) is an advanced AI system trained on vast amounts of text data to understand, generate, and process human-like language. Here’s a concise introduction:

	### Key Points:
	1. Training: Trained on diverse text sources (books, websites, etc.) using deep learning.
	2. Capabilities:
	- Answer questions, generate text, summarize content, translate languages.
	- Understand context, sentiment, and nuances in language.
	3. Architecture: Often based on transformer models (e.g., BERT, GPT, LLaMA).
	4. Scale: Billions of parameters, requiring massive computational resources.
	5. Applications: Chatbots, content creation, coding assistance, research, and more.

	### Examples:
	- OpenAI’s GPT-4: Powers ChatGPT.
	- Google’s Gemini: Used in Bard.
	- Meta’s LLaMA: Open-source alternative.

	### Challenges:
	- Bias: Can reflect biases in training data.
	- Accuracy: May hallucinate "facts" not grounded in reality.
	- Ethics: Raises concerns about misinformation and job displacement.

	LLMs represent a leap forward in natural language processing, enabling machines to interact with humans in increasingly sophisticated ways. 🌐🤖
	</answer>
	"""
	```

	***

	# Future Releases

	Looking ahead, we will publish a companion paper that fully documents the AutoThink training framework, covering:

	* Cold-start initialization procedures
	* Reinforcement-learning (Step-SRPO) strategies
	* Data curation and reward design details

	At the same time, we will open-source:

	* Training resources – the curated dual-regime datasets and RL codebase
	* Model suite – checkpoints at 1.5B, 7B, and 13B parameters, all trained with AutoThink gating