decompute
/

Nebula-S-v1-4bit

Text Generation

competition-math

4-bit precision

Model card Files Files and versions

Nebula-S-v1-4bit / README.md

punitdecomp's picture

Upload folder using huggingface_hub

03674d1 verified about 19 hours ago

|

history blame contribute delete

1.99 kB

	---
	license: apache-2.0
	tags:
	- nebula-s
	- svms
	- math-reasoning
	- competition-math
	- 4bit
	- quantized
	- bitsandbytes
	library_name: transformers
	---

	# Nebula-S-v1-4bit

	4-bit quantized version of [Nebula-S-v1](https://huggingface.co/punitdecomp/Nebula-S-v1).

	Nebula-S-v1 is a reasoning-enhanced language model using the SVMS (Structured-Vector Multi-Stream) architecture.

	## What's different from Nebula-S-v1?

	\| \| Nebula-S-v1 \| Nebula-S-v1-4bit \|
	\|---\|---\|---\|
	\| Backbone precision \| bf16 \| 4-bit (nf4) \|
	\| Adapter precision \| bf16 \| bf16 \|
	\| Backbone size \| ~8 GB \| ~2 GB \|
	\| Total size \| ~9 GB \| ~3 GB \|
	\| VRAM needed \| ~18 GB \| ~6 GB \|
	\| Requires \| CUDA / MPS / CPU \| CUDA only (bitsandbytes) \|

	## Quick Start

	```bash
	pip install torch transformers>=4.51.0 bitsandbytes accelerate huggingface-hub
	```

	### Option 1: Using huggingface_hub

	```python
	from huggingface_hub import snapshot_download
	import sys

	snapshot_download("punitdecomp/Nebula-S-v1-4bit", local_dir="./Nebula-S-v1-4bit")
	sys.path.insert(0, "./Nebula-S-v1-4bit")
	from nebula_s import load_nebula_s

	model, tokenizer = load_nebula_s("./Nebula-S-v1-4bit", device="cuda")
	```

	### Option 2: Using git clone

	```bash
	git lfs install
	git clone https://huggingface.co/punitdecomp/Nebula-S-v1-4bit
	```

	```python
	import sys
	sys.path.insert(0, "./Nebula-S-v1-4bit")
	from nebula_s import load_nebula_s

	model, tokenizer = load_nebula_s("./Nebula-S-v1-4bit", device="cuda")
	```

	### Generate a response

	```python
	messages = [{"role": "user", "content": "Solve step by step: what is 17 * 23?"}]
	text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(text, return_tensors="pt").to("cuda")
	response = model.generate(
	inputs["input_ids"], inputs["attention_mask"],
	tokenizer, max_new_tokens=2048, temperature=0.7
	)
	print(response)
	```

	## License

	Apache 2.0. Backbone derived from an Apache-2.0 licensed base model.