--- license: apache-2.0 tags: - nebula-s - svms - math-reasoning - competition-math - 4bit - quantized - bitsandbytes library_name: transformers --- # Nebula-S-v1-4bit 4-bit quantized version of [Nebula-S-v1](https://huggingface.co/punitdecomp/Nebula-S-v1). **Nebula-S-v1** is a reasoning-enhanced language model using the **SVMS (Structured-Vector Multi-Stream)** architecture. ## What's different from Nebula-S-v1? | | Nebula-S-v1 | Nebula-S-v1-4bit | |---|---|---| | Backbone precision | bf16 | **4-bit (nf4)** | | Adapter precision | bf16 | bf16 | | Backbone size | ~8 GB | **~2 GB** | | Total size | ~9 GB | **~3 GB** | | VRAM needed | ~18 GB | **~6 GB** | | Requires | CUDA / MPS / CPU | **CUDA only** (bitsandbytes) | ## Quick Start ```bash pip install torch transformers>=4.51.0 bitsandbytes accelerate huggingface-hub ``` ### Option 1: Using huggingface_hub ```python from huggingface_hub import snapshot_download import sys snapshot_download("punitdecomp/Nebula-S-v1-4bit", local_dir="./Nebula-S-v1-4bit") sys.path.insert(0, "./Nebula-S-v1-4bit") from nebula_s import load_nebula_s model, tokenizer = load_nebula_s("./Nebula-S-v1-4bit", device="cuda") ``` ### Option 2: Using git clone ```bash git lfs install git clone https://huggingface.co/punitdecomp/Nebula-S-v1-4bit ``` ```python import sys sys.path.insert(0, "./Nebula-S-v1-4bit") from nebula_s import load_nebula_s model, tokenizer = load_nebula_s("./Nebula-S-v1-4bit", device="cuda") ``` ### Generate a response ```python messages = [{"role": "user", "content": "Solve step by step: what is 17 * 23?"}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt").to("cuda") response = model.generate( inputs["input_ids"], inputs["attention_mask"], tokenizer, max_new_tokens=2048, temperature=0.7 ) print(response) ``` ## License Apache 2.0. Backbone derived from an Apache-2.0 licensed base model.