--- license: llama3.1 language: - en pipeline_tag: text-generation library_name: transformers inference: true base_model: - NousResearch/Hermes-3-Llama-3.1-8B - NousResearch/DeepHermes-3-Llama-3-8B-Preview - nvidia/Llama-3.1-Nemotron-Nano-8B-v1 - deepseek-ai/DeepSeek-R1-Distill-Llama-8B tags: - merge - dare_ties - llama - llama-3.1 - mist --- # MIST-1-8B MIST-1-8B (formerly MIST-Mini) is the smallest and fastest model in the **MIST model family** by [olaverse](https://huggingface.co/olaverse). Built by blending 4 specialized Llama 3.1 8B models using DARE+TIES — delivering strong performance at maximum speed. fast, thorough, great for everyday use ## MIST Model Family | Model | Params | Speed | Status | |---|---|---|---| | **MIST-1-8B** | 8B | ~63 tok/s | ✅ Available | | [MIST-1-70B](https://huggingface.co/olaverse/MIST-1-70B) | 70B | ~23 tok/s | ✅ Available | | [MIST-1-140B](https://huggingface.co/olaverse/MIST-1-140B) | 140B | ~8 tok/s | ✅ Available | --- ## Key Strengths - ⚡ **Fastest** — 63 tok/s on H200, great for real-time applications - 🧠 **Strong Reasoning** — DeepSeek R1 distillation - 💻 **Clean Code** — production-ready with comments - 📐 **Math** — accurate step-by-step solving - 🤝 **Helpful** — low refusal rate - 📦 **Lightweight** — 15GB, runs on consumer GPUs --- ## Benchmark Results | Task | Speed | Quality | |---|---|---| | Reasoning | 4.5s | ✅ Correct | | Coding | 4.0s | ✅ Clean code | | Math | 4.0s | ✅ Step-by-step | | General | 4.0s | ✅ Accurate | | Instruction | 4.0s | ✅ Precise | **Average: 63 tok/s** --- ## How to Use ```python from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained( "olaverse/MIST-Mini-8B", torch_dtype="auto", device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-Mini-8B") messages = [{"role": "user", "content": "Your question here"}] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt").to("cuda") outputs = model.generate(**inputs, max_new_tokens=512) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Hardware Requirements | Precision | VRAM Required | |---|---| | bfloat16 | 16GB (RTX 3090/4090) | | 4-bit | 6GB (RTX 3060+) | --- ## Recommended Generation Settings These settings were verified through testing. Without `repetition_penalty` and `min_p` the model will ramble and not stop cleanly. ```python outputs = model.generate( **inputs, max_new_tokens=1024, do_sample=True, temperature=0.7, top_p=0.95, min_p=0.05, repetition_penalty=1.5, eos_token_id=[128040, 128009, 128001], pad_token_id=128001, ) ``` ### Stop Tokens This model's ChatML parents (`<|im_end|>`) survived the DARE+TIES merge alongside Llama 3.1 native tokens. Use all three: | Token | ID | Source | |---|---|---| | `<\|im_end\|>` | 128040 | Hermes/Nemotron parents | | `<\|eot_id\|>` | 128009 | Llama 3.1 native | | `<\|end_of_text\|>` | 128001 | Llama 3.1 native | ## License [Llama 3.1 Community License](https://llama.meta.com/llama3/license/)