Create README.md

e3d1326 verified 9 days ago

5.9 kB

	---
	license: mit
	---
	# DeepSeek-V3.2-Retro

	This repository hosts the model weights for DeepSeek-V3.2-Retro. For instructions and details, please refer to the [GitHub](https://github.com/zhejianglab/DeepSeek-V3.2-Retro).

	## 1. Introduction
	[DeepSeek-V3.2](https://huggingface.co/deepseek-ai/DeepSeek-V3.2)
	introduces the DeepSeek Sparse Attention (DSA) architecture, representing a significant architectural evolution over [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3) and [DeepSeek-V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1). However, as of now, an official open-source implementation compatible with Ampere-series GPUs has not been released.

	To address this gap, we introduce DeepSeek-V3.2-Retro, targeting the following user groups:

	- Ampere GPU users who do not have access to Hopper or Blackwell architectures.
	- Users of general-purpose GPU platforms where DSA is not yet supported.

	Key features of DeepSeek-V3.2-Retro include:

	- Removal of the DSA modules from the original V3.2 architecture.
	- Conversion of model parameters and computation to the BF16 data format.
	- Broad Compatibility: runs on any hardware platform that supports the V3 architecture.
	- Validated Performance: achieves performance on multiple benchmarks that is close to the [officially reported results](https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf).

	## 2. Performance Evaluation
	As our primary target scenario is reasoning-oriented usage, we report accuracy results on several representative benchmarks after enabling the thinking feature. All evaluation metrics are taken from the corresponding official technical reports for consistency.

	<div align="center">

	\| Benchmark \| [DeepSeek-V3.2-Retro](https://github.com/zhejianglab/DeepSeek-V3.2-Retro) \| [DeepSeek-V3.2-Thinking](https://huggingface.co/deepseek-ai/DeepSeek-V3.2/blob/main/assets/paper.pdf) \|
	\| :---: \| :---: \| :---: \|
	\| MMLU-Pro \| 86.4 \| 85.0 \|
	\| GPQA Diamond \| 82.12 \| 82.4 \|
	\| AIME 2025 \| 93.67 \| 93.1 \|
	\| LiveCodeBench \| 80.72 \| 83.3 \|

	</div>

	In addition, we evaluate inference efficiency. Using SGLang v0.5.6 under identical settings, we observe that the throughput of DeepSeek-V3.2-Retro is on par with DeepSeek-V3.1. Output throughput is reported in tokens/s.

	<div align="center">

	\| Model \| Output Throughput (qps=512, input=1k, output=10k) \|
	\| :---: \| :---: \|
	\| [DeepSeek-V3.2-Retro](https://github.com/zhejianglab/DeepSeek-V3.2-Retro) \| 2510.27 \|
	\| [DeepSeek-V3.1](https://huggingface.co/deepseek-ai/DeepSeek-V3.1) \| 2515.34 \|

	</div>

	These results indicate that removing the DSA structure and reverting to a V3-compatible architecture does not introduce noticeable performance regression in either reasoning accuracy or inference throughput on Ampere-class hardware.

	## 3. Model Download
	DeepSeek-V3.2-Retro model is available for download from [Hugging Face](https://huggingface.co/ZhejiangLab/DeepSeek-V3.2-Retro) and [ModelScope](https://modelscope.cn/models/zhejianglab/DeepSeek-V3.2-Retro). Please ensure that you have at least 1.5 TB of available disk space before downloading the model.

	<div align="center">

	\| Model \| Total Params \| Hugging Face \| ModelScope \|
	\|:---------:\|:----------------:\|:----------------:\|:--------------:\|
	\| DeepSeek-V3.2-Retro \| 684 B \| [🤗 Hugging Face](https://huggingface.co/ZhejiangLab/DeepSeek-V3.2-Retro) \|[🤖 ModelScope](https://modelscope.cn/models/zhejianglab/DeepSeek-V3.2-Retro) \|

	</div>

	## 4. Quickstart

	We strongly recommend using SGLang for efficient inference of the DeepSeek series models. We provide example configurations for SGLang serving on four A100*8 nodes.

	### SGLang

	#### Using Docker (Recommended)

	```docker
	# Pull latest image on four nodes and ensure RDMA network connectivity between the 4 nodes.
	# https://hub.docker.com/r/lmsysorg/sglang/tags
	docker pull lmsysorg/sglang:latest
	```

	#### Launch Command

	```python
	# For high QPS scenarios, add --enable-dp-attention and --ep-size arguments to boost throughput, and use mtp to boost decoding speed.
	# node 1
	python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 30000 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

	# node 2
	python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 1 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

	# node 3
	python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 2 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head

	# node 4
	python3 -m sglang.launch_server --model-path /path/to/DeepSeek-V3.2-Retro --tp 32 --dist-init-addr 10.0.0.1:5000 --nnodes 4 --node-rank 3 --trust-remote-code --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --enable-dp-attention --dp 8 --ep-size 32 --enable-dp-lm-head
	```

	## 5. License
	This repository and the model weights are licensed under the MIT License, following the license of DeepSeek-V3.2. In addition, if you use DeepSeek-V3.2, you shall also comply with the terms and conditions of DeepSeek-V3.2.

	## 6. Contact
	If you have any questions, please raise an [issue](https://github.com/zhejianglab/DeepSeek-V3.2-Retro/issues) or contact us at opensource@zhejianglab.org.