mlx-community
/

DeepSeek-R1-Distill-Qwen-7B-MLX

Text Generation

Model card Files Files and versions

DeepSeek-R1-Distill-Qwen-7B-MLX / README.md

sealad886's picture

Add files using upload-large-folder tool

28b3681 verified 12 months ago

|

history blame contribute delete

3.27 kB

	---
	quantized_by: sealad886
	license_link: https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B/blob/main/LICENSE
	language:
	- en
	pipeline_tag: text-generation
	base_model: deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
	tags:
	- chat
	- mlx
	- conversations
	---

	# mlx-community/DeepSeek-R1-Distill-Qwen-7B

	This Model [mlx-community/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/mlx-community/DeepSeek-R1-Distill-Qwen-7B) contains multiple quantized variants of the base model [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B). The model was converted to MLX format using mlx-lm version 0.21.5.

	The conversion process applied different quantization strategies to produce variants that offer trade-offs between memory footprint, inference speed, and accuracy. In addition to the default 4-bit conversion, you will find both uniform and mixed quantized files at various bit widths (2-bit, 3-bit, 6-bit, and 8-bit). This multi-quantized approach allows users to select the best variant for their deployment scenario, balancing precision and performance.

	## Quantization Configurations

	The model conversion uses a range of quantization configurations defined via `mlx_lm.convert`. These configurations fall into three main categories:

	1. Uniform Quantization:
	Applies the same bit width to all layers.
	- 3bit: Uniform 3-bit quantization.
	- 4bit: Uniform 4-bit quantization (default).
	- 6bit: Uniform 6-bit quantization.
	- 8bit: Uniform 8-bit quantization.

	2. Mixed Quantization:
	Uses a custom predicate function to decide the bit width for each layer—allowing different layers to use different precisions.
	- 2,6_mixed: Uses the `mixed_2_6` predicate to choose between 2-bit and 6-bit quantization.
	- 3,6_mixed: Uses the `mixed_3_6` predicate to choose between 3-bit and 6-bit quantization.
	- 3,4_mixed: Built via `mixed_quant_predicate_builder(3, 4, group_size)`, it mixes 3-bit and 4-bit precision.
	- 4,6_mixed: Built via `mixed_quant_predicate_builder(4, 6, group_size)`, it mixes 4-bit and 6-bit precision.
	- 4,8_mixed: Built via `mixed_quant_predicate_builder(4, 8, group_size)`, it mixes 4-bit and 8-bit precision.

	Where `group_size = 64` (which is default for other quantization methods).

	3. Non-Quantized Conversions:
	Converts the model to a different floating point precision without quantizing weights.
	- bfloat16: Model converted to bfloat16 precision.
	- float16: Model converted to float16 precision.

	## Use with mlx

	Install the MLX library:
	```bash
	pip install mlx-lm
	```

	Load the model and generate text:
	```python
	from mlx_lm import load, generate

	model, tokenizer = load("mlx-community/DeepSeek-R1-Distill-Qwen-7B-MLX")

	prompt = "hello"

	if tokenizer.chat_template is not None:
	messages = [{"role": "user", "content": prompt}]
	prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True)

	response = generate(model, tokenizer, prompt=prompt, verbose=True)
	```

	Each configuration is optimized to meet specific requirements, enabling a forward-thinking approach in model deployment where resource constraints and performance targets are key considerations.