| # DeepSeek-R1-Distill-Llama-8B-Stateful-CoreML | |
| This repository contains a CoreML conversion of the DeepSeek-R1-Distill-Llama-8B model optimized for Apple Silicon devices. This conversion features stateful key-value caching for efficient text generation. | |
| ## Model Description | |
| [DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B) is a distilled 8 billion parameter language model from the DeepSeek-AI team. The model is built on the Llama architecture and has been distilled to maintain performance while reducing the parameter count. | |
| This CoreML conversion provides: | |
| - Full compatibility with Apple Silicon devices (M1, M2, M3 series) | |
| - Stateful inference with KV-caching for efficient text generation | |
| - Optimized performance for on-device deployment | |
| ## Technical Specifications | |
| - **Base Model**: deepseek-ai/DeepSeek-R1-Distill-Llama-8B | |
| - **Parameters**: 8 billion | |
| - **Context Length**: Configurable (default: 64, expandable based on memory constraints) | |
| - **Quantization**: FP16 | |
| - **File Format**: .mlpackage | |
| - **Deployment Target**: macOS 15+ | |
| - **Architecture**: Stateful LLM with key-value caching | |
| - **Input Features**: Flexible input size with dynamic shape handling | |
| ## Key Features | |
| - **Stateful Inference**: The model implements a custom SliceUpdateKeyValueCache to maintain conversation state between inference calls, significantly improving generation speed. | |
| - **Dynamic Input Shapes**: Supports variable input lengths through RangeDim specification. | |
| - **Optimized Memory Usage**: Efficiently manages the key-value cache to minimize memory footprint. | |
| ## Implementation Details | |
| This conversion utilizes: | |
| - A custom KvCacheStateLlamaForCausalLM wrapper around the Hugging Face Transformers implementation | |
| - CoreML's state management capabilities for maintaining KV caches between inference calls | |
| - Proper buffer registration to ensure state persistence | |
| - Dynamic tensor shapes to accommodate various input and context lengths | |
| ## Usage | |
| The model can be loaded and used with CoreML in your Swift or Python projects: | |
| ```python | |
| import coremltools as ct | |
| # Load the model | |
| model = ct.models.MLModel("DeepSeek-R1-Distill-Llama-8B.mlpackage") | |
| # Prepare inputs for inference | |
| # ... | |
| # Run inference | |
| output = model.predict({ | |
| "inputIds": input_ids, | |
| "causalMask": causal_mask | |
| }) | |
| ``` | |
| ## Conversion Process | |
| The model was converted using CoreML Tools with the following steps: | |
| 1. Loading the original model from Hugging Face | |
| 2. Wrapping it with custom state management | |
| 3. Tracing with PyTorch's JIT | |
| 4. Converting to CoreML format with state specifications | |
| 5. Saving in the .mlpackage format | |
| ## Requirements | |
| To use this model: | |
| - Apple Silicon Mac (M1/M2/M3 series) | |
| - macOS 15 or later | |
| - Minimum 16GB RAM recommended | |
| ## Limitations | |
| - The model requires significant memory for inference, especially with longer contexts | |
| - Performance is highly dependent on the device's Neural Engine capabilities | |
| - The default configuration supports a context length of 64 tokens, but this can be adjusted | |
| ## License | |
| This model conversion inherits the license of the original DeepSeek-R1-Distill-Llama-8B model. | |
| ## Acknowledgments | |
| - [DeepSeek-AI](https://github.com/deepseek-ai) for creating and releasing the original model | |
| - [Hugging Face](https://huggingface.co/) for hosting the model and providing the Transformers library | |
| - Apple for developing the CoreML framework | |
| ## Citation | |
| If you use this model in your research, please cite both the original DeepSeek model and this conversion. | |