Text Generation
Transformers
Safetensors
English
llama
Merge
dare_ties
llama-3.1
mist
conversational
text-generation-inference
Instructions to use olaverse/MIST-Mini-8B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use olaverse/MIST-Mini-8B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="olaverse/MIST-Mini-8B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-Mini-8B") model = AutoModelForCausalLM.from_pretrained("olaverse/MIST-Mini-8B") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use olaverse/MIST-Mini-8B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "olaverse/MIST-Mini-8B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olaverse/MIST-Mini-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/olaverse/MIST-Mini-8B
- SGLang
How to use olaverse/MIST-Mini-8B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "olaverse/MIST-Mini-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olaverse/MIST-Mini-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "olaverse/MIST-Mini-8B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "olaverse/MIST-Mini-8B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use olaverse/MIST-Mini-8B with Docker Model Runner:
docker model run hf.co/olaverse/MIST-Mini-8B
| license: llama3.1 | |
| language: | |
| - en | |
| pipeline_tag: text-generation | |
| library_name: transformers | |
| inference: true | |
| base_model: | |
| - NousResearch/Hermes-3-Llama-3.1-8B | |
| - NousResearch/DeepHermes-3-Llama-3-8B-Preview | |
| - nvidia/Llama-3.1-Nemotron-Nano-8B-v1 | |
| - deepseek-ai/DeepSeek-R1-Distill-Llama-8B | |
| tags: | |
| - merge | |
| - dare_ties | |
| - llama | |
| - llama-3.1 | |
| - mist | |
| # MIST-1-8B | |
| MIST-1-8B (formerly MIST-Mini) is the smallest and fastest model in the **MIST model family** by [olaverse](https://huggingface.co/olaverse). Built by blending 4 specialized Llama 3.1 8B models using DARE+TIES β delivering strong performance at maximum speed. | |
| fast, thorough, great for everyday use | |
| ## MIST Model Family | |
| | Model | Params | Speed | Status | | |
| |---|---|---|---| | |
| | **MIST-1-8B** | 8B | ~63 tok/s | β Available | | |
| | [MIST-1-70B](https://huggingface.co/olaverse/MIST-1-70B) | 70B | ~23 tok/s | β Available | | |
| | [MIST-1-140B](https://huggingface.co/olaverse/MIST-1-140B) | 140B | ~8 tok/s | β Available | | |
| --- | |
| ## Key Strengths | |
| - β‘ **Fastest** β 63 tok/s on H200, great for real-time applications | |
| - π§ **Strong Reasoning** β DeepSeek R1 distillation | |
| - π» **Clean Code** β production-ready with comments | |
| - π **Math** β accurate step-by-step solving | |
| - π€ **Helpful** β low refusal rate | |
| - π¦ **Lightweight** β 15GB, runs on consumer GPUs | |
| --- | |
| ## Benchmark Results | |
| | Task | Speed | Quality | | |
| |---|---|---| | |
| | Reasoning | 4.5s | β Correct | | |
| | Coding | 4.0s | β Clean code | | |
| | Math | 4.0s | β Step-by-step | | |
| | General | 4.0s | β Accurate | | |
| | Instruction | 4.0s | β Precise | | |
| **Average: 63 tok/s** | |
| --- | |
| ## How to Use | |
| ```python | |
| from transformers import AutoModelForCausalLM, AutoTokenizer | |
| model = AutoModelForCausalLM.from_pretrained( | |
| "olaverse/MIST-Mini-8B", | |
| torch_dtype="auto", | |
| device_map="auto", | |
| ) | |
| tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-Mini-8B") | |
| messages = [{"role": "user", "content": "Your question here"}] | |
| text = tokenizer.apply_chat_template( | |
| messages, tokenize=False, add_generation_prompt=True | |
| ) | |
| inputs = tokenizer(text, return_tensors="pt").to("cuda") | |
| outputs = model.generate(**inputs, max_new_tokens=512) | |
| print(tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| ## Hardware Requirements | |
| | Precision | VRAM Required | | |
| |---|---| | |
| | bfloat16 | 16GB (RTX 3090/4090) | | |
| | 4-bit | 6GB (RTX 3060+) | | |
| --- | |
| ## Recommended Generation Settings | |
| These settings were verified through testing. Without `repetition_penalty` | |
| and `min_p` the model will ramble and not stop cleanly. | |
| ```python | |
| outputs = model.generate( | |
| **inputs, | |
| max_new_tokens=1024, | |
| do_sample=True, | |
| temperature=0.7, | |
| top_p=0.95, | |
| min_p=0.05, | |
| repetition_penalty=1.5, | |
| eos_token_id=[128040, 128009, 128001], | |
| pad_token_id=128001, | |
| ) | |
| ``` | |
| ### Stop Tokens | |
| This model's ChatML parents (`<|im_end|>`) survived the DARE+TIES merge | |
| alongside Llama 3.1 native tokens. Use all three: | |
| | Token | ID | Source | | |
| |---|---|---| | |
| | `<\|im_end\|>` | 128040 | Hermes/Nemotron parents | | |
| | `<\|eot_id\|>` | 128009 | Llama 3.1 native | | |
| | `<\|end_of_text\|>` | 128001 | Llama 3.1 native | | |
| ## License | |
| [Llama 3.1 Community License](https://llama.meta.com/llama3/license/) | |