anthonymikinka
/

DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML

Text Generation

Model card Files Files and versions

DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML / model-readme.md

anthonymikinka's picture

Upload model-readme.md

515f6e4 verified 12 months ago

|

history blame contribute delete

3.54 kB

	# DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML

	This repository contains a CoreML conversion of the DeepSeek-R1-Distill-Llama-8B model optimized for Apple Silicon devices. This conversion features stateful key-value caching for efficient text generation.

	## Model Description

	[DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) is a distilled 8 billion parameter language model from the DeepSeek-AI team. The model is built on the Llama architecture and has been distilled to maintain performance while reducing the parameter count.

	This CoreML conversion provides:
	- Full compatibility with Apple Silicon devices (M1, M2, M3 series)
	- Stateful inference with KV-caching for efficient text generation
	- Optimized performance for on-device deployment

	## Technical Specifications

	- Base Model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
	- Parameters: 8 billion
	- Context Length: Configurable (default: 64, expandable based on memory constraints)
	- Quantization: FP16
	- File Format: .mlpackage
	- Deployment Target: macOS 15+
	- Architecture: Stateful LLM with key-value caching
	- Input Features: Flexible input size with dynamic shape handling

	## Key Features

	- Stateful Inference: The model implements a custom SliceUpdateKeyValueCache to maintain conversation state between inference calls, significantly improving generation speed.
	- Dynamic Input Shapes: Supports variable input lengths through RangeDim specification.
	- Optimized Memory Usage: Efficiently manages the key-value cache to minimize memory footprint.

	## Implementation Details

	This conversion utilizes:
	- A custom KvCacheStateLlamaForCausalLM wrapper around the Hugging Face Transformers implementation
	- CoreML's state management capabilities for maintaining KV caches between inference calls
	- Proper buffer registration to ensure state persistence
	- Dynamic tensor shapes to accommodate various input and context lengths

	## Usage

	The model can be loaded and used with CoreML in your Swift or Python projects:

	```python
	import coremltools as ct

	# Load the model
	model = ct.models.MLModel("DeepSeek-R1-Distill-Llama-8B.mlpackage")

	# Prepare inputs for inference
	# ...

	# Run inference
	output = model.predict({
	"inputIds": input_ids,
	"causalMask": causal_mask
	})
	```

	## Conversion Process

	The model was converted using CoreML Tools with the following steps:
	1. Loading the original model from Hugging Face
	2. Wrapping it with custom state management
	3. Tracing with PyTorch's JIT
	4. Converting to CoreML format with state specifications
	5. Saving in the .mlpackage format

	## Requirements

	To use this model:
	- Apple Silicon Mac (M1/M2/M3 series)
	- macOS 15 or later
	- Minimum 16GB RAM recommended

	## Limitations

	- The model requires significant memory for inference, especially with longer contexts
	- Performance is highly dependent on the device's Neural Engine capabilities
	- The default configuration supports a context length of 64 tokens, but this can be adjusted

	## License

	This model conversion inherits the license of the original DeepSeek-R1-Distill-Llama-8B model.

	## Acknowledgments

	- [DeepSeek-AI](https://github.com/deepseek-ai) for creating and releasing the original model
	- [Hugging Face](https://huggingface.co/) for hosting the model and providing the Transformers library
	- Apple for developing the CoreML framework

	## Citation

	If you use this model in your research, please cite both the original DeepSeek model and this conversion.