You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Hymba-1.5B-Eigen-Hybrid-4bit

This is a Hybrid 4-bit Quantized version of nvidia/Hymba-1.5B-Base. It applies GPTQ (4-bit, group size 64) to the Mamba backbone and MoE Experts, drastically reducing VRAM usage (to ~1GB) while maintaining coherence.

Benchmarks

  • VRAM Usage: ~1.01 GB (vs ~3.5 GB for Base)
  • Speed: ~10 tokens/sec (Consumer GPU)
  • Perplexity: 6.43 (WikiText-2)

CRITICAL USAGE NOTE

This model uses a Hybrid Quantization Strategy that standard AutoGPTQ loaders do not support automatically. You MUST use the custom python script below to load this model.

How to Run

import torch
import os
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM
from auto_gptq.utils.accelerate_utils import load_checkpoint_in_model
from auto_gptq.nn_modules.qlinear.qlinear_cuda import QuantLinear # Requires AutoGPTQ

# 1. Setup
model_path = "krishhx/Hymba-1.5B-Eigen-Hybrid-4bit" # CHANGE THIS TO YOUR REPO ID
config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)

# 2. Build Empty Skeleton (CPU)
with torch.device("cpu"):
    model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

# 3. MANUAL SURGERY: Inject 4-bit Layers
targets = ["mamba.in_proj", "mamba.out_proj", "moe.experts.0.gate_proj", "moe.experts.0.down_proj", "moe.experts.0.up_proj"]

def replace_linear_with_quant(module, name_path=""):
    for name, child in module.named_children():
        full_name = f"{name_path}.{name}" if name_path else name
        is_target = any(t in full_name for t in targets)
        if is_target and isinstance(child, torch.nn.Linear):
            # Skip odd-dimension layers (kept in FP16 for stability)
            if child.in_features % 32 != 0 or child.out_features % 32 != 0: continue
            
            new_layer = QuantLinear(bits=4, group_size=64, infeatures=child.in_features, outfeatures=child.out_features, bias=child.bias is not None)
            setattr(module, name, new_layer)
        else:
            replace_linear_with_quant(child, full_name)

replace_linear_with_quant(model)

# 4. Force-Load Weights
import os
from huggingface_hub import hf_hub_download
checkpoint = hf_hub_download(repo_id=model_path, filename="gptq_model-4bit-64g.safetensors")

load_checkpoint_in_model(model, checkpoint=checkpoint, device_map=None, dtype=torch.float16)
model.to("cuda")

# 5. Patch Cache & Run
model_module = torch.utils.sys.modules[model.__class__.__module__]
if hasattr(model_module, "HybridMambaAttentionDynamicCache"):
    CacheClass = getattr(model_module, "HybridMambaAttentionDynamicCache")
    if not hasattr(CacheClass, "layers"): CacheClass.layers = property(lambda self: [None] * 32)
    if not hasattr(CacheClass, "get_usable_length"): CacheClass.get_usable_length = lambda self, i, l=None: self.get_seq_length(l)
    if not hasattr(CacheClass, "seen_tokens"): CacheClass.seen_tokens = property(lambda self: self.get_seq_length())
    if not hasattr(CacheClass, "get_max_length"): CacheClass.get_max_length = lambda self: model.config.max_position_embeddings

input_ids = tokenizer("The future of AI is", return_tensors="pt").to("cuda")["input_ids"]
output = model.generate(input_ids, max_new_tokens=50)
print(tokenizer.decode(output[0]))

⚠️ LIMITATION: Long-Context Retrieval While this model retains high reasoning capabilities (Perplexity 6.43), the aggressive 4-bit quantization of the Mamba backbone limits effective memory retrieval to <4k tokens. For tasks requiring >4k context, we recommend waiting for our upcoming v1.1 Mixed-Precision release.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 1 Ask for provider support

Model tree for krishhx/Hymba-1.5B-Eigen-Hybrid-4bit

Quantized
(1)
this model