Hymba-1.5B-Eigen-Hybrid-4bit
This is a Hybrid 4-bit Quantized version of nvidia/Hymba-1.5B-Base. It applies GPTQ (4-bit, group size 64) to the Mamba backbone and MoE Experts, drastically reducing VRAM usage (to ~1GB) while maintaining coherence.
Benchmarks
- VRAM Usage: ~1.01 GB (vs ~3.5 GB for Base)
- Speed: ~10 tokens/sec (Consumer GPU)
- Perplexity: 6.43 (WikiText-2)
CRITICAL USAGE NOTE
This model uses a Hybrid Quantization Strategy that standard AutoGPTQ loaders do not support automatically. You MUST use the custom python script below to load this model.
How to Run
import torch
import os
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from auto_gptq.utils.accelerate_utils import load_checkpoint_in_model
from auto_gptq.nn_modules.qlinear.qlinear_cuda import QuantLinear # Requires AutoGPTQ
# 1. Setup
model_path = "krishhx/Hymba-1.5B-Eigen-Hybrid-4bit" # CHANGE THIS TO YOUR REPO ID
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
# 2. Build Empty Skeleton (CPU)
with torch.device("cpu"):
model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)
# 3. MANUAL SURGERY: Inject 4-bit Layers
targets = ["mamba.in_proj", "mamba.out_proj", "moe.experts.0.gate_proj", "moe.experts.0.down_proj", "moe.experts.0.up_proj"]
def replace_linear_with_quant(module, name_path=""):
for name, child in module.named_children():
full_name = f"{name_path}.{name}" if name_path else name
is_target = any(t in full_name for t in targets)
if is_target and isinstance(child, torch.nn.Linear):
# Skip odd-dimension layers (kept in FP16 for stability)
if child.in_features % 32 != 0 or child.out_features % 32 != 0: continue
new_layer = QuantLinear(bits=4, group_size=64, infeatures=child.in_features, outfeatures=child.out_features, bias=child.bias is not None)
setattr(module, name, new_layer)
else:
replace_linear_with_quant(child, full_name)
replace_linear_with_quant(model)
# 4. Force-Load Weights
import os
from huggingface_hub import hf_hub_download
checkpoint = hf_hub_download(repo_id=model_path, filename="gptq_model-4bit-64g.safetensors")
load_checkpoint_in_model(model, checkpoint=checkpoint, device_map=None, dtype=torch.float16)
model.to("cuda")
# 5. Patch Cache & Run
model_module = torch.utils.sys.modules[model.__class__.__module__]
if hasattr(model_module, "HybridMambaAttentionDynamicCache"):
CacheClass = getattr(model_module, "HybridMambaAttentionDynamicCache")
if not hasattr(CacheClass, "layers"): CacheClass.layers = property(lambda self: [None] * 32)
if not hasattr(CacheClass, "get_usable_length"): CacheClass.get_usable_length = lambda self, i, l=None: self.get_seq_length(l)
if not hasattr(CacheClass, "seen_tokens"): CacheClass.seen_tokens = property(lambda self: self.get_seq_length())
if not hasattr(CacheClass, "get_max_length"): CacheClass.get_max_length = lambda self: model.config.max_position_embeddings
input_ids = tokenizer("The future of AI is", return_tensors="pt").to("cuda")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))
⚠️ LIMITATION: Long-Context Retrieval While this model retains high reasoning capabilities (Perplexity 6.43), the aggressive 4-bit quantization of the Mamba backbone limits effective memory retrieval to <4k tokens. For tasks requiring >4k context, we recommend waiting for our upcoming v1.1 Mixed-Precision release.
- Downloads last month
- -
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support
Model tree for krishhx/Hymba-1.5B-Eigen-Hybrid-4bit
Base model
nvidia/Hymba-1.5B-Base