Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,63 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
language:
|
| 4 |
+
- en
|
| 5 |
+
base_model:
|
| 6 |
+
- meta-llama/Llama-3.2-1B-Instruct
|
| 7 |
+
pipeline_tag: text-generation
|
| 8 |
+
---
|
| 9 |
+
# Llama-3.2-1B-Instruct (4-bit Quantized)
|
| 10 |
+
|
| 11 |
+
This repository contains a **4-bit quantized version** of the Llama-3.2-1B-Instruct model.
|
| 12 |
+
It has been quantized using **bitsandbytes NF4** for extremely low VRAM consumption and
|
| 13 |
+
fast inference, making it ideal for edge devices, low-resource systems, or fast evaluation
|
| 14 |
+
pipelines (e.g., interview Thinker models).
|
| 15 |
+
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
## Model Features
|
| 19 |
+
|
| 20 |
+
- **Base model:** Llama-3.2-1B-Instruct
|
| 21 |
+
- **Quantization:** 4-bit (NF4) using `bitsandbytes`
|
| 22 |
+
- **VRAM requirement:** ~1.0 GB
|
| 23 |
+
- **Perfect for:**
|
| 24 |
+
- Lightweight chatbots
|
| 25 |
+
- Reasoning/evaluation agents
|
| 26 |
+
- Interview Thinker modules
|
| 27 |
+
- Local inference on small GPUs
|
| 28 |
+
- Low-latency systems
|
| 29 |
+
- **Compatible with:**
|
| 30 |
+
- LoRA fine-tuning
|
| 31 |
+
- HuggingFace Transformers
|
| 32 |
+
- Text-generation inference engines
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Files Included
|
| 37 |
+
|
| 38 |
+
- `config.json`
|
| 39 |
+
- `generation_config.json`
|
| 40 |
+
- `model.safetensors` (4-bit quantized weights)
|
| 41 |
+
- `tokenizer.json`
|
| 42 |
+
- `tokenizer_config.json`
|
| 43 |
+
- `special_tokens_map.json`
|
| 44 |
+
- `chat_template.jinja`
|
| 45 |
+
|
| 46 |
+
These files allow you to load the model directly with `load_in_4bit=True`.
|
| 47 |
+
|
| 48 |
+
---
|
| 49 |
+
|
| 50 |
+
## How To Load This Model
|
| 51 |
+
|
| 52 |
+
```python
|
| 53 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 54 |
+
|
| 55 |
+
model_name = "Shlok307/llama-1b-4bit"
|
| 56 |
+
|
| 57 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
| 58 |
+
|
| 59 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 60 |
+
model_name,
|
| 61 |
+
load_in_4bit=True,
|
| 62 |
+
device_map="auto"
|
| 63 |
+
)
|