Instructions to use ykarout/Mixtral-8x7B-DeepSeek-R1-Distill with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ykarout/Mixtral-8x7B-DeepSeek-R1-Distill with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="ykarout/Mixtral-8x7B-DeepSeek-R1-Distill") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill") model = AutoModelForCausalLM.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use ykarout/Mixtral-8x7B-DeepSeek-R1-Distill with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill
- SGLang
How to use ykarout/Mixtral-8x7B-DeepSeek-R1-Distill with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use ykarout/Mixtral-8x7B-DeepSeek-R1-Distill with Docker Model Runner:
docker model run hf.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill
YAML Metadata Warning:The pipeline tag "text2text-generation" is not in the official list: text-classification, token-classification, table-question-answering, question-answering, zero-shot-classification, translation, summarization, feature-extraction, text-generation, fill-mask, sentence-similarity, text-to-speech, text-to-audio, automatic-speech-recognition, audio-to-audio, audio-classification, audio-text-to-text, voice-activity-detection, depth-estimation, image-classification, object-detection, image-segmentation, text-to-image, image-to-text, image-to-image, image-to-video, unconditional-image-generation, video-classification, reinforcement-learning, robotics, tabular-classification, tabular-regression, tabular-to-text, table-to-text, multiple-choice, text-ranking, text-retrieval, time-series-forecasting, text-to-video, image-text-to-text, image-text-to-image, image-text-to-video, visual-question-answering, document-question-answering, zero-shot-image-classification, graph-ml, mask-generation, zero-shot-object-detection, text-to-3d, image-to-3d, image-feature-extraction, video-text-to-text, keypoint-detection, visual-document-retrieval, any-to-any, video-to-video, other
Mixtral-8x7B-DeepSeek-R1-Distill
A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model.
Model Details
Model Description
This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1.
- Developed by: ykarout
- Model type: Mixture of Experts (MoE) Language Model
- Language(s) (NLP): English, Arabic, French, Spanish (inherited from base model)
- License: Apache 2.0
- Finetuned from model: mistralai/Mixtral-8x7B-Instruct-v0.1
Model Sources
- Base Repository: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
- Training Dataset: open-r1/Mixture-of-Thoughts
Uses
Direct Use
This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including:
- Mathematical problem solving with detailed explanations
- Logical reasoning tasks
- Code generation with explanatory comments
- Scientific analysis and hypothesis formation
- Complex question answering with reasoning traces
Downstream Use
The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes.
Out-of-Scope Use
- Real-time applications requiring sub-second responses (due to reasoning overhead)
- Tasks where reasoning explanations are not desired
- Applications requiring factual accuracy without verification (model may hallucinate during reasoning)
Bias, Risks, and Limitations
- Reasoning Overhead: Generates longer responses due to explicit thinking processes
- Inherited Biases: Retains biases from the base Mixtral model and training data
- Hallucination Risk: May generate plausible but incorrect reasoning steps
- Language Bias: Reasoning capabilities may be stronger in English than other supported languages
Recommendations
Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning."
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit")
model = AutoModelForCausalLM.from_pretrained(
"ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit",
torch_dtype=torch.bfloat16,
device_map="auto"
)
# Example reasoning prompt
prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]"""
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=0.7,
do_sample=True
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
Training Details
Training Data
The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning.
Training Procedure
Training Hyperparameters
- Training regime: bf16 mixed precision
- Optimizer: AdamW with fused implementation
- Learning rate: 5e-6 (reduced from initial 1e-5 for stability)
- Batch size: 8 per device
- Gradient accumulation steps: 1
- Max sequence length: 8192 tokens
- Epochs: 1
- Gradient clipping: 0.1 (tightened for stability)
- Learning rate scheduler: Cosine with 10% warmup
- Weight decay: 0.01
Training Infrastructure
- Hardware: Single NVIDIA H200 GPU
- Framework: Transformers + TRL SFTTrainer
- Gradient checkpointing: Enabled
- Memory optimizations: Remove unused columns, persistent data loaders
Speeds, Sizes, Times
- Training time: Approximately 15 hours for full epoch
- Peak memory usage: ~140GB on H200
- Tokens processed: ~15M tokens
- Final model size: ~90GB (bf16 precision)
Evaluation
Testing Data, Factors & Metrics
Testing Data
Evaluation pending on standard reasoning benchmarks including:
- GSM8K (mathematical reasoning)
- MATH dataset
- LogiQA (logical reasoning)
- Code reasoning tasks
Metrics
- Primary: Token-level accuracy during training
- Secondary: Loss convergence and gradient stability
- Planned: Human evaluation of reasoning quality
Results
Training Metrics:
- Final training loss: ~0.6 (converged from ~0.85)
- Token accuracy: Stabilized around 78-84%
- Training stability: Achieved after hyperparameter tuning
Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion.
Model Examination
The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing.
Environmental Impact
Estimated Training Impact:
- Hardware Type: NVIDIA H200 (140GB HBM3)
- Hours used: ~15 hours
- Cloud Provider: Academic cluster
- Compute Region: [Location specific]
- Estimated Carbon Emitted: ~2-3 kg CO2eq (approximate)
Technical Specifications
Model Architecture and Objective
- Base Architecture: Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts)
- Active Parameters: ~13B (2 experts activated per token)
- Total Parameters: ~47B
- Training Objective: Causal language modeling with reasoning supervision
- Attention: Sliding window attention with 32k context capability
Compute Infrastructure
Hardware
- Training: NVIDIA H200 (132GB HBM3)
- Memory: 139GB peak utilization
- Precision: bfloat16
Software
- Framework: PyTorch + Transformers + TRL
- CUDA: Compatible with latest versions
- Optimization: Flash Attention, gradient checkpointing
Citation
BibTeX:
@model{mixtral-deepseek-r1-distill,
title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts},
author={ykarout},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit}
}
Model Card Contact
For questions or issues, please contact through Hugging Face
- Downloads last month
- 6