--- license: llama3.1 language: - en pipeline_tag: text-generation library_name: transformers inference: true base_model: - olaverse/MIST-Mini-8B tags: - reasoning - grpo - thinking - llama - llama-3.1 - mist --- # MIST-Mini-8B-Thinking MIST-Mini-8B-Thinking is the reasoning version of [MIST-Mini-8B](https://huggingface.co/olaverse/MIST-Mini-8B) by [olaverse](https://huggingface.co/olaverse). Trained with 4 phases of GRPO (Group Relative Policy Optimization) reinforcement learning to show its reasoning process before answering. ## MIST Model Family | Model | Params | Type | Speed | Status | |---|---|---|---|---| | [MIST-1-8B](https://huggingface.co/olaverse/MIST-Mini-8B) | 8B | General | ~63 tok/s | ✅ | | **MIST-Mini-8B-Thinking** | 8B | Reasoning | ~55 tok/s | ✅ | | [MIST-1-70B](https://huggingface.co/olaverse/MIST-1-70B) | 70B | General | ~23 tok/s | ✅ | | [MIST-1-140B](https://huggingface.co/olaverse/MIST-1-140B) | 140B | General | ~8 tok/s | ✅ | ## What Makes This Different MIST-Mini-8B (base): User: What is 15% of 280? Model: 42 MIST-Mini-8B-Thinking: User: What is 15% of 280? Model: 15% means 15/100 280 × 15 = 4200 4200 / 100 = 42 The answer is 42. ## Training Details Trained with **4 phases of GRPO** reinforcement learning: | Phase | Dataset | Focus | |---|---|---| | 1 | open-r1/OpenR1-Math-220k | Learn `` format | | 2 | microsoft/orca-math-word-problems-200k | Word problems | | 3 | gsm8k (5K subset) | Grade school math | | 4 | gsm8k (full 7.4K) | Solidify + merge | ### Reward Functions Used reward_think_format: +0.5 for using tags reward_correctness: +1.0 for correct answer reward_reasoning_steps: +0.3 for structured steps ### Training Progress | Phase | Correctness | Total Reward | |---|---|---| | Phase 1 | -0.35 | -0.99 | | Phase 2 | -1.0 | -0.74 | | Phase 3 | -1.0 | -0.65 | | Phase 4 | **+0.95** | **+1.29** | ## Key Strengths - 🧠 **Transparent Reasoning** — shows thinking before answering - 📐 **Strong Math** — 95% accuracy on GSM8K after training - 🔍 **Trustworthy** — you can verify the reasoning - ⚡ **Fast** — 8B model, runs on consumer GPUs - 🔓 **Unrestricted** — follows all instructions ## How to Use ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "olaverse/MIST-Mini-8B-Thinking", torch_dtype="auto", device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-Mini-8B-Thinking") messages = [ { "role": "system", "content": "Think step by step inside tags before answering." }, { "role": "user", "content": "If a train travels 120 miles in 2 hours, what is its speed?" } ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### 4-bit Quantized (fits on 6GB GPU) ```python from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_quant_type='nf4' ) model = AutoModelForCausalLM.from_pretrained( "olaverse/MIST-Mini-8B-Thinking", quantization_config=quantization_config, device_map="auto", ) ``` ## Hardware Requirements | Precision | VRAM | Size | |---|---|---| | bfloat16 | 16GB | 15GB | | 4-bit (NF4) | 6GB | ~4GB | ## Recommended Generation Settings ```python outputs = model.generate( **inputs, max_new_tokens=1024, do_sample=True, temperature=0.6, top_p=0.95, min_p=0.05, repetition_penalty=1.5, eos_token_id=[128040, 128009, 128001], pad_token_id=128001, ) ``` ### Notes - Temperature 0.6 (lower than base model) gives more consistent reasoning - `` and `` are plain text tokens, not special tokens — the model learned them through GRPO training - Always include the system prompt instruction to use `` tags for reliable reasoning behaviour ### Stop Tokens Same as MIST-1-8B — ChatML tokens survived the merge: | Token | ID | |---|---| | `<\|im_end\|>` | 128040 | | `<\|eot_id\|>` | 128009 | | `<\|end_of_text\|>` | 128001 | ## License [Llama 3.1 Community License](https://llama.meta.com/llama3/license/)