MIST-1-140B-4bit

4-bit NF4 quantized version of MIST-1-140B. Runs on a single H100/H200 GPU using only ~70GB VRAM instead of 280GB. Model is experimental

MIST Model Family

Model Params VRAM Speed Status
MIST-1-8B 8B 16GB ~63 tok/s βœ…
MIST-1-70B 70B 140GB ~23 tok/s βœ…
MIST-1-140B 140B 280GB ~8 tok/s βœ…
MIST-1-140B-4bit 140B 70GB ~8 tok/s βœ…

Quantization Details

Property Value
Method BitsAndBytes NF4
Compute dtype bfloat16
Double quantization Yes
Original size 256GB
Quantized size 69GB
Quality retention ~97-98%

Key Strengths

  • 🧠 Deepest Reasoning β€” 158 layers of processing
  • πŸ’‘ Rich Explanations β€” detailed and engaging responses
  • πŸ’» Excellent Coding β€” thorough documentation and examples
  • πŸ“ Precise Math β€” detailed step-by-step solutions
  • πŸ”“ Unrestricted β€” follows all instructions
  • πŸ“š 128K Context β€” long document processing
  • πŸ’Ύ Efficient β€” fits on single H100/H200 at 4-bit

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)

model = AutoModelForCausalLM.from_pretrained(
    "olaverse/MIST-1-140B-4bit",
    quantization_config=quantization_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("olaverse/MIST-1-140B-4bit")

messages = [{"role": "user", "content": "Your question here"}]
text = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

As an Assistant

messages = [
    {
        "role": "system",
        "content": "You are MIST, a highly capable AI assistant. Think step by step before answering."
    },
    {"role": "user", "content": "Your question here"}
]

Hardware Requirements

VRAM Speed
70GB (1x H200/H100) ~8 tok/s
140GB (1x H200) ~8 tok/s with headroom

License

Llama 3.1 Community License

Downloads last month
130
Safetensors
Model size
137B params
Tensor type
F32
Β·
BF16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for olaverse/MIST-1-140B-4bit

Quantized
(2)
this model

Collection including olaverse/MIST-1-140B-4bit