DeepSeek V4 Flash dsv4_int INT4/INT8
This checkpoint is experimental and under active development. It is intended for the AppMana Ampere vLLM fork and is not a general-purpose Hugging Face Transformers checkpoint.
Generated on 2026-05-11 from clean source checkpoint:
deepseek-ai/DeepSeek-V4-Flash@fd53f944496234770ba80e15004f9b6d269a71f5
Conversion command:
CUDA_VISIBLE_DEVICES=1 python tools/ampere/dsv4_requant_checkpoint.py \
--src /home/administrator/inference/.cache/huggingface/models--deepseek-ai--DeepSeek-V4-Flash/snapshots/fd53f944496234770ba80e15004f9b6d269a71f5 \
--dst /home/administrator/inference/deepseek-v4-flash-dsv4-int-channel-vllm \
--device cuda:0 \
--dense-int8-strategy channel \
--overwrite
Quantization format:
- Routed MoE experts: MXFP4 source weights converted to packed symmetric INT4 W4A16, group size 32, for Marlin MoE.
- Dense FP8 linears: channelwise biased UINT8 W8A16 format for the Ampere AllSpark path where supported.
- Preserved precision: embeddings, norms, gates, attention sinks, HC tensors, and other tensors marked BF16/F32 in the source.
Structural audit:
- Safetensor shards: 46
- Size: about 157 GiB
- Expert INT4 tensors: 33,792
- Dense INT8 tensors: 375
- Preserved tensors: 853
- Missing expert scales: 0
Known status:
- The 2-layer version of this conversion path loads and generates locally under vLLM with compile and CUDA graphs enabled.
- The full 43-layer checkpoint has been converted and structurally audited, but must still pass a full distributed vLLM load/generation test before treating it as usable.
- Quality/perplexity is not yet validated. Do not assume this matches the original FP4/FP8 DeepSeek checkpoint until evaluation says so.
- Downloads last month
- 33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for appmana/deepseek-v4-int4-int8
Base model
deepseek-ai/DeepSeek-V4-Flash