DeepSeek-V4-Flash-TQ-Q3.4-MLX

osmapi/DeepSeek-V4-Flash-TQ-Q3.4-MLX is an Apple-Silicon MLX TurboQuant/JANGTQ quantization of deepseek-ai/DeepSeek-V4-Flash.

No fine-tuning, distillation, or retraining was applied. The official mixed FP4/FP8 source weights were converted locally, the MTP head was dropped because it is not used for normal decode, and router/mHC/control tensors were preserved rather than aggressively quantized.

Model Details

Property Value
Base model deepseek-ai/DeepSeek-V4-Flash
Architecture DeepSeek-V4 Flash MoE, 284B total / 13B active, 1M context
Local profile JANGTQ-Q3.4
Bundle size 126.14 GB
Layout Pre-stacked MLX switch_mlp layout
MTP head Dropped
Validation Safetensors header/index validation, metadata validation

Required Sidecar

This is a JANGTQ/TurboQuant bundle and requires jangtq_runtime.safetensors from this repository. The sidecar stores the deterministic codebooks and Hadamard rotation signs used to decode the .tq_packed expert weights. If it is missing, re-download the full repository or fetch that file explicitly:

hf download osmapi/DeepSeek-V4-Flash-TQ-Q3.4-MLX jangtq_runtime.safetensors --local-dir <your-model-dir>

Quantization Recipe

Tensor class Codec Bits / handling
Routed experts TurboQuant MXTQ 39 routed layer/projection groups at 2-bit MXTQ and 90 at 4-bit MXTQ
Routed effective bits MXTQ 3.3953 bits
Attention, shared experts, compressor, indexer, embed, lm head MLX affine 8-bit, group size 32
Norms, router, mHC, sinks, integer routing tables passthrough source precision preserved

The fractional target is implemented as a power-of-two lane mix because the current JANGTQ vectorized packer is stable on 2/4/8-bit lanes for DeepSeek-V4 expert dimensions.

Use

Install the JANG loader/runtime and MLX LM:

pip install mlx-lm jang-tools

Example:

from jang_tools.load_jangtq import load_jangtq_model
from mlx_lm import generate

model, tokenizer = load_jangtq_model("osmapi/DeepSeek-V4-Flash-TQ-Q3.4-MLX")
prompt = "Write a short note about MLX quantization."
text = generate(model, tokenizer, prompt=prompt, verbose=True)
print(text)

Files

  • model-*.safetensors: pre-stacked JANGTQ/MLX shards
  • model.safetensors.index.json: shard index
  • jangtq_runtime.safetensors: required TurboQuant runtime sidecar
  • config.json, jang_config.json: MLX/JANGTQ metadata
  • encoding/: upstream DeepSeek-V4 prompt encoding reference

Notes

This upload follows the same broad model-card shape as the public OsaurusAI/JANGQ DeepSeek-V4-Flash JANGTQ uploads: a sidecar warning, an explicit recipe table, and minimal reproducible loading instructions.

License

MIT, following the upstream DeepSeek-V4-Flash release.

Downloads last month
1,068
Safetensors
Model size
32B params
Tensor type
U32
·
F16
·
U8
·
I64
·
F32
·
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for osmapi/DeepSeek-V4-Flash-TQ-Q3.4-MLX

Quantized
(65)
this model