| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - open-r1/Mixture-of-Thoughts |
| | language: |
| | - en |
| | - ar |
| | - fr |
| | - es |
| | base_model: |
| | - mistralai/Mixtral-8x7B-Instruct-v0.1 |
| | pipeline_tag: text2text-generation |
| | library_name: transformers |
| | tags: |
| | - reasoning |
| | - r1 |
| | - deepseek |
| | - mixtral |
| | - MoE |
| | - thinking |
| | - code |
| | - science |
| | - math |
| | metrics: |
| | - accuracy |
| | new_version: ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit |
| | --- |
| | |
| | # Mixtral-8x7B-DeepSeek-R1-Distill |
| |
|
| | A reasoning-enhanced version of Mixtral-8x7B-Instruct-v0.1, fine-tuned on reasoning responses generated by DeepSeek's reasoning model. |
| |
|
| | ## Model Details |
| |
|
| | ### Model Description |
| |
|
| | This model is a fine-tuned version of Mixtral-8x7B-Instruct-v0.1 that has been trained on reasoning-rich datasets to improve its step-by-step thinking and problem-solving capabilities. The model learns to generate explicit reasoning traces similar to those produced by advanced reasoning models like DeepSeek-R1. |
| |
|
| | - **Developed by:** ykarout |
| | - **Model type:** Mixture of Experts (MoE) Language Model |
| | - **Language(s) (NLP):** English, Arabic, French, Spanish (inherited from base model) |
| | - **License:** Apache 2.0 |
| | - **Finetuned from model:** mistralai/Mixtral-8x7B-Instruct-v0.1 |
| |
|
| | ### Model Sources |
| |
|
| | - **Base Repository:** https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1 |
| | - **Training Dataset:** open-r1/Mixture-of-Thoughts |
| |
|
| | ## Uses |
| |
|
| | ### Direct Use |
| |
|
| | This model is designed for tasks requiring explicit reasoning and step-by-step problem solving, including: |
| |
|
| | - Mathematical problem solving with detailed explanations |
| | - Logical reasoning tasks |
| | - Code generation with explanatory comments |
| | - Scientific analysis and hypothesis formation |
| | - Complex question answering with reasoning traces |
| |
|
| | ### Downstream Use |
| |
|
| | The model can be further fine-tuned for domain-specific reasoning tasks or integrated into applications requiring transparent AI reasoning processes. |
| |
|
| | ### Out-of-Scope Use |
| |
|
| | - Real-time applications requiring sub-second responses (due to reasoning overhead) |
| | - Tasks where reasoning explanations are not desired |
| | - Applications requiring factual accuracy without verification (model may hallucinate during reasoning) |
| |
|
| | ## Bias, Risks, and Limitations |
| |
|
| | - **Reasoning Overhead:** Generates longer responses due to explicit thinking processes |
| | - **Inherited Biases:** Retains biases from the base Mixtral model and training data |
| | - **Hallucination Risk:** May generate plausible but incorrect reasoning steps |
| | - **Language Bias:** Reasoning capabilities may be stronger in English than other supported languages |
| |
|
| | ### Recommendations |
| |
|
| | Users should validate reasoning outputs, especially for critical applications. The model works best when prompted to "think step by step" or "show your reasoning." |
| |
|
| | ## How to Get Started with the Model |
| |
|
| | ```python |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | |
| | tokenizer = AutoTokenizer.from_pretrained("ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit") |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit", |
| | torch_dtype=torch.bfloat16, |
| | device_map="auto" |
| | ) |
| | |
| | # Example reasoning prompt |
| | prompt = """<s>[INST] Solve this step by step: If a train travels 120 km in 2 hours, and then 180 km in 3 hours, what is its average speed for the entire journey? [/INST]""" |
| | |
| | inputs = tokenizer(prompt, return_tensors="pt") |
| | outputs = model.generate( |
| | **inputs, |
| | max_new_tokens=512, |
| | temperature=0.7, |
| | do_sample=True |
| | ) |
| | |
| | response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
| | print(response) |
| | ``` |
| |
|
| | ## Training Details |
| |
|
| | ### Training Data |
| |
|
| | The model was fine-tuned on the open-r1/Mixture-of-Thoughts dataset, which contains reasoning responses generated by DeepSeek's reasoning model across various domains including mathematics, science, coding, and logical reasoning. |
| |
|
| | ### Training Procedure |
| |
|
| | #### Training Hyperparameters |
| |
|
| | - **Training regime:** bf16 mixed precision |
| | - **Optimizer:** AdamW with fused implementation |
| | - **Learning rate:** 5e-6 (reduced from initial 1e-5 for stability) |
| | - **Batch size:** 8 per device |
| | - **Gradient accumulation steps:** 1 |
| | - **Max sequence length:** 8192 tokens |
| | - **Epochs:** 1 |
| | - **Gradient clipping:** 0.1 (tightened for stability) |
| | - **Learning rate scheduler:** Cosine with 10% warmup |
| | - **Weight decay:** 0.01 |
| |
|
| | #### Training Infrastructure |
| |
|
| | - **Hardware:** Single NVIDIA H200 GPU |
| | - **Framework:** Transformers + TRL SFTTrainer |
| | - **Gradient checkpointing:** Enabled |
| | - **Memory optimizations:** Remove unused columns, persistent data loaders |
| |
|
| | #### Speeds, Sizes, Times |
| |
|
| | - **Training time:** Approximately 15 hours for full epoch |
| | - **Peak memory usage:** ~140GB on H200 |
| | - **Tokens processed:** ~15M tokens |
| | - **Final model size:** ~90GB (bf16 precision) |
| |
|
| | ## Evaluation |
| |
|
| | ### Testing Data, Factors & Metrics |
| |
|
| | #### Testing Data |
| |
|
| | Evaluation pending on standard reasoning benchmarks including: |
| | - GSM8K (mathematical reasoning) |
| | - MATH dataset |
| | - LogiQA (logical reasoning) |
| | - Code reasoning tasks |
| |
|
| | #### Metrics |
| |
|
| | - **Primary:** Token-level accuracy during training |
| | - **Secondary:** Loss convergence and gradient stability |
| | - **Planned:** Human evaluation of reasoning quality |
| |
|
| | ### Results |
| |
|
| | **Training Metrics:** |
| | - **Final training loss:** ~0.6 (converged from ~0.85) |
| | - **Token accuracy:** Stabilized around 78-84% |
| | - **Training stability:** Achieved after hyperparameter tuning |
| |
|
| | Comprehensive evaluation results on reasoning benchmarks will be updated post-training completion. |
| |
|
| | ## Model Examination |
| |
|
| | The model exhibits improved reasoning capabilities compared to the base Mixtral model, generating explicit step-by-step thinking processes. Analysis of attention patterns and reasoning trace quality is ongoing. |
| |
|
| | ## Environmental Impact |
| |
|
| | **Estimated Training Impact:** |
| | - **Hardware Type:** NVIDIA H200 (140GB HBM3) |
| | - **Hours used:** ~15 hours |
| | - **Cloud Provider:** Academic cluster |
| | - **Compute Region:** [Location specific] |
| | - **Estimated Carbon Emitted:** ~2-3 kg CO2eq (approximate) |
| |
|
| | ## Technical Specifications |
| |
|
| | ### Model Architecture and Objective |
| |
|
| | - **Base Architecture:** Mixtral-8x7B-Instruct-v0.1 (Mixture of Experts) |
| | - **Active Parameters:** ~13B (2 experts activated per token) |
| | - **Total Parameters:** ~47B |
| | - **Training Objective:** Causal language modeling with reasoning supervision |
| | - **Attention:** Sliding window attention with 32k context capability |
| |
|
| | ### Compute Infrastructure |
| |
|
| | #### Hardware |
| | - **Training:** NVIDIA H200 (132GB HBM3) |
| | - **Memory:** 139GB peak utilization |
| | - **Precision:** bfloat16 |
| |
|
| | #### Software |
| | - **Framework:** PyTorch + Transformers + TRL |
| | - **CUDA:** Compatible with latest versions |
| | - **Optimization:** Flash Attention, gradient checkpointing |
| |
|
| | ## Citation |
| |
|
| | **BibTeX:** |
| | ```bibtex |
| | @model{mixtral-deepseek-r1-distill, |
| | title={Mixtral-8x7B-DeepSeek-R1-Distill: Reasoning-Enhanced Mixture of Experts}, |
| | author={ykarout}, |
| | year={2025}, |
| | publisher={Hugging Face}, |
| | url={https://huggingface.co/ykarout/Mixtral-8x7B-DeepSeek-R1-Distill-16bit} |
| | } |
| | ``` |
| |
|
| |
|
| | ## Model Card Contact |
| |
|
| | For questions or issues, please contact through Hugging Face |