Text Generation
Transformers
Safetensors
English
llama
reasoning
grpo
thinking
llama-3.1
mist
conversational
text-generation-inference
Instructions to use olaverse/MIST-Mini-8B-Thinking with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use olaverse/MIST-Mini-8B-Thinking with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="olaverse/MIST-Mini-8B-Thinking") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-Mini-8B-Thinking") model = AutoModelForCausalLM.from_pretrained("olaverse/MIST-Mini-8B-Thinking") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use olaverse/MIST-Mini-8B-Thinking with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "olaverse/MIST-Mini-8B-Thinking" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olaverse/MIST-Mini-8B-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/olaverse/MIST-Mini-8B-Thinking
- SGLang
How to use olaverse/MIST-Mini-8B-Thinking with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "olaverse/MIST-Mini-8B-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olaverse/MIST-Mini-8B-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "olaverse/MIST-Mini-8B-Thinking" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olaverse/MIST-Mini-8B-Thinking", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use olaverse/MIST-Mini-8B-Thinking with Docker Model Runner:
docker model run hf.co/olaverse/MIST-Mini-8B-Thinking
| license: llama3.1 | |
| language: | |
| - en | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| inference: true | |
| base_model: | |
| - olaverse/MIST-Mini-8B | |
| tags: | |
| - reasoning | |
| - grpo | |
| - thinking | |
| - llama | |
| - llama-3.1 | |
| - mist | |
| # MIST-Mini-8B-Thinking | |
| MIST-Mini-8B-Thinking is the reasoning version of [MIST-Mini-8B](https://huggingface.co/olaverse/MIST-Mini-8B) by [olaverse](https://huggingface.co/olaverse). Trained with 4 phases of GRPO (Group Relative Policy Optimization) reinforcement learning to show its reasoning process before answering. | |
| ## MIST Model Family | |
| | Model | Params | Type | Speed | Status | | |
| |---|---|---|---|---| | |
| | [MIST-1-8B](https://huggingface.co/olaverse/MIST-Mini-8B) | 8B | General | ~63 tok/s | β | | |
| | **MIST-Mini-8B-Thinking** | 8B | Reasoning | ~55 tok/s | β | | |
| | [MIST-1-70B](https://huggingface.co/olaverse/MIST-1-70B) | 70B | General | ~23 tok/s | β | | |
| | [MIST-1-140B](https://huggingface.co/olaverse/MIST-1-140B) | 140B | General | ~8 tok/s | β | | |
| ## What Makes This Different | |
| MIST-Mini-8B (base): | |
| User: What is 15% of 280? | |
| Model: 42 | |
| MIST-Mini-8B-Thinking: | |
| User: What is 15% of 280? | |
| Model: <think> | |
| 15% means 15/100 | |
| 280 Γ 15 = 4200 | |
| 4200 / 100 = 42 | |
| </think> | |
| The answer is 42. | |
| ## Training Details | |
| Trained with **4 phases of GRPO** reinforcement learning: | |
| | Phase | Dataset | Focus | | |
| |---|---|---| | |
| | 1 | open-r1/OpenR1-Math-220k | Learn `<think>` format | | |
| | 2 | microsoft/orca-math-word-problems-200k | Word problems | | |
| | 3 | gsm8k (5K subset) | Grade school math | | |
| | 4 | gsm8k (full 7.4K) | Solidify + merge | | |
| ### Reward Functions Used | |
| reward_think_format: +0.5 for using <think> tags | |
| reward_correctness: +1.0 for correct answer | |
| reward_reasoning_steps: +0.3 for structured steps | |
| ### Training Progress | |
| | Phase | Correctness | Total Reward | | |
| |---|---|---| | |
| | Phase 1 | -0.35 | -0.99 | | |
| | Phase 2 | -1.0 | -0.74 | | |
| | Phase 3 | -1.0 | -0.65 | | |
| | Phase 4 | **+0.95** | **+1.29** | | |
| ## Key Strengths | |
| - π§ **Transparent Reasoning** β shows thinking before answering | |
| - π **Strong Math** β 95% accuracy on GSM8K after training | |
| - π **Trustworthy** β you can verify the reasoning | |
| - β‘ **Fast** β 8B model, runs on consumer GPUs | |
| - π **Unrestricted** β follows all instructions | |
| ## How to Use | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "olaverse/MIST-Mini-8B-Thinking", | |
| torch_dtype="auto", | |
| device_map="auto", | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-Mini-8B-Thinking") | |
| messages = [ | |
| { | |
| "role": "system", | |
| "content": "Think step by step inside <think> tags before answering." | |
| }, | |
| { | |
| "role": "user", | |
| "content": "If a train travels 120 miles in 2 hours, what is its speed?" | |
| } | |
| ] | |
| text = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| inputs = tokenizer(text, return_tensors="pt").to("cuda") | |
| outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.7, do_sample=True) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ### 4-bit Quantized (fits on 6GB GPU) | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | |
| import torch | |
| quantization_config = BitsAndBytesConfig( | |
| load_in_4bit=True, | |
| bnb_4bit_compute_dtype=torch.bfloat16, | |
| bnb_4bit_quant_type='nf4' | |
| ) | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "olaverse/MIST-Mini-8B-Thinking", | |
| quantization_config=quantization_config, | |
| device_map="auto", | |
| ) | |
| ``` | |
| ## Hardware Requirements | |
| | Precision | VRAM | Size | | |
| |---|---|---| | |
| | bfloat16 | 16GB | 15GB | | |
| | 4-bit (NF4) | 6GB | ~4GB | | |
| ## Recommended Generation Settings | |
| ```python | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=1024, | |
| do_sample=True, | |
| temperature=0.6, | |
| top_p=0.95, | |
| min_p=0.05, | |
| repetition_penalty=1.5, | |
| eos_token_id=[128040, 128009, 128001], | |
| pad_token_id=128001, | |
| ) | |
| ``` | |
| ### Notes | |
| - Temperature 0.6 (lower than base model) gives more consistent reasoning | |
| - `<think>` and `</think>` are plain text tokens, not special tokens β | |
| the model learned them through GRPO training | |
| - Always include the system prompt instruction to use `<think>` tags | |
| for reliable reasoning behaviour | |
| ### Stop Tokens | |
| Same as MIST-1-8B β ChatML tokens survived the merge: | |
| | Token | ID | | |
| |---|---| | |
| | `<\|im_end\|>` | 128040 | | |
| | `<\|eot_id\|>` | 128009 | | |
| | `<\|end_of_text\|>` | 128001 | | |
| ## License | |
| [Llama 3.1 Community License](https://llama.meta.com/llama3/license/) | |