| | --- |
| | quantized_by: sealad886 |
| | license_link: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/blob/main/LICENSE |
| | language: |
| | - en |
| | pipeline_tag: text-generation |
| | base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B |
| | tags: |
| | - chat |
| | - mlx |
| | - conversations |
| | --- |
| | |
| | # mlx-community/DeepSeek-R1-Distill-Qwen-7B |
| |
|
| | This Model [mlx-community/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Qwen-7B) contains multiple quantized variants of the base model [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B). The model was converted to MLX format using mlx-lm version 0.21.5. |
| |
|
| | The conversion process applied different quantization strategies to produce variants that offer trade-offs between memory footprint, inference speed, and accuracy. In addition to the default 4-bit conversion, you will find both uniform and mixed quantized files at various bit widths (2-bit, 3-bit, 6-bit, and 8-bit). This multi-quantized approach allows users to select the best variant for their deployment scenario, balancing precision and performance. |
| |
|
| | ## Quantization Configurations |
| |
|
| | The model conversion uses a range of quantization configurations defined via `mlx_lm.convert`. These configurations fall into three main categories: |
| |
|
| | 1. **Uniform Quantization:** |
| | Applies the same bit width to all layers. |
| | - **3bit:** Uniform 3-bit quantization. |
| | - **4bit:** Uniform 4-bit quantization (default). |
| | - **6bit:** Uniform 6-bit quantization. |
| | - **8bit:** Uniform 8-bit quantization. |
| |
|
| | 2. **Mixed Quantization:** |
| | Uses a custom predicate function to decide the bit width for each layer—allowing different layers to use different precisions. |
| | - **2,6_mixed:** Uses the `mixed_2_6` predicate to choose between 2-bit and 6-bit quantization. |
| | - **3,6_mixed:** Uses the `mixed_3_6` predicate to choose between 3-bit and 6-bit quantization. |
| | - **3,4_mixed:** Built via `mixed_quant_predicate_builder(3, 4, group_size)`, it mixes 3-bit and 4-bit precision. |
| | - **4,6_mixed:** Built via `mixed_quant_predicate_builder(4, 6, group_size)`, it mixes 4-bit and 6-bit precision. |
| | - **4,8_mixed:** Built via `mixed_quant_predicate_builder(4, 8, group_size)`, it mixes 4-bit and 8-bit precision. |
| | |
| | Where `group_size = 64` (which is default for other quantization methods). |
| | |
| | 3. **Non-Quantized Conversions:** |
| | Converts the model to a different floating point precision without quantizing weights. |
| | - **bfloat16:** Model converted to bfloat16 precision. |
| | - **float16:** Model converted to float16 precision. |
| |
|
| | ## Use with mlx |
| |
|
| | Install the MLX library: |
| | ```bash |
| | pip install mlx-lm |
| | ``` |
| |
|
| | Load the model and generate text: |
| | ```python |
| | from mlx_lm import load, generate |
| | |
| | model, tokenizer = load("mlx-community/DeepSeek-R1-Distill-Qwen-7B-MLX") |
| | |
| | prompt = "hello" |
| | |
| | if tokenizer.chat_template is not None: |
| | messages = [{"role": "user", "content": prompt}] |
| | prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True) |
| | |
| | response = generate(model, tokenizer, prompt=prompt, verbose=True) |
| | ``` |
| |
|
| | Each configuration is optimized to meet specific requirements, enabling a forward-thinking approach in model deployment where resource constraints and performance targets are key considerations. |