| --- |
| license: apache-2.0 |
| tags: |
| - nebula-s |
| - svms |
| - math-reasoning |
| - competition-math |
| - 4bit |
| - quantized |
| - bitsandbytes |
| library_name: transformers |
| --- |
| |
| # Nebula-S-v1-4bit |
|
|
| 4-bit quantized version of [Nebula-S-v1](https://huggingface.co/punitdecomp/Nebula-S-v1). |
|
|
| **Nebula-S-v1** is a reasoning-enhanced language model using the **SVMS (Structured-Vector Multi-Stream)** architecture. |
|
|
| ## What's different from Nebula-S-v1? |
|
|
| | | Nebula-S-v1 | Nebula-S-v1-4bit | |
| |---|---|---| |
| | Backbone precision | bf16 | **4-bit (nf4)** | |
| | Adapter precision | bf16 | bf16 | |
| | Backbone size | ~8 GB | **~2 GB** | |
| | Total size | ~9 GB | **~3 GB** | |
| | VRAM needed | ~18 GB | **~6 GB** | |
| | Requires | CUDA / MPS / CPU | **CUDA only** (bitsandbytes) | |
|
|
| ## Quick Start |
|
|
| ```bash |
| pip install torch transformers>=4.51.0 bitsandbytes accelerate huggingface-hub |
| ``` |
|
|
| ### Option 1: Using huggingface_hub |
| |
| ```python |
| from huggingface_hub import snapshot_download |
| import sys |
| |
| snapshot_download("punitdecomp/Nebula-S-v1-4bit", local_dir="./Nebula-S-v1-4bit") |
| sys.path.insert(0, "./Nebula-S-v1-4bit") |
| from nebula_s import load_nebula_s |
|
|
| model, tokenizer = load_nebula_s("./Nebula-S-v1-4bit", device="cuda") |
| ``` |
| |
| ### Option 2: Using git clone |
| |
| ```bash |
| git lfs install |
| git clone https://huggingface.co/punitdecomp/Nebula-S-v1-4bit |
| ``` |
| |
| ```python |
| import sys |
| sys.path.insert(0, "./Nebula-S-v1-4bit") |
| from nebula_s import load_nebula_s |
| |
| model, tokenizer = load_nebula_s("./Nebula-S-v1-4bit", device="cuda") |
| ``` |
| |
| ### Generate a response |
| |
| ```python |
| messages = [{"role": "user", "content": "Solve step by step: what is 17 * 23?"}] |
| text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
| inputs = tokenizer(text, return_tensors="pt").to("cuda") |
| response = model.generate( |
| inputs["input_ids"], inputs["attention_mask"], |
| tokenizer, max_new_tokens=2048, temperature=0.7 |
| ) |
| print(response) |
| ``` |
| |
| ## License |
|
|
| Apache 2.0. Backbone derived from an Apache-2.0 licensed base model. |
|
|