Shlok307
/

llama-1b-4bit

Text Generation

4-bit precision

Model card Files Files and versions

Shlok307 commited on Dec 1, 2025

Commit

2b32321

·

verified ·

1 Parent(s): bdb0388

Create README.md

Files changed (1) hide show

README.md +63 -0

README.md ADDED Viewed

	@@ -0,0 +1,63 @@

+---
+license: mit
+language:
+- en
+base_model:
+- meta-llama/Llama-3.2-1B-Instruct
+pipeline_tag: text-generation
+---
+# Llama-3.2-1B-Instruct (4-bit Quantized)
+This repository contains a **4-bit quantized version** of the Llama-3.2-1B-Instruct model.
+It has been quantized using **bitsandbytes NF4** for extremely low VRAM consumption and
+fast inference, making it ideal for edge devices, low-resource systems, or fast evaluation
+pipelines (e.g., interview Thinker models).
+---
+##  Model Features
+- **Base model:** Llama-3.2-1B-Instruct
+- **Quantization:** 4-bit (NF4) using `bitsandbytes`
+- **VRAM requirement:** ~1.0 GB
+- **Perfect for:**
+  - Lightweight chatbots
+  - Reasoning/evaluation agents
+  - Interview Thinker modules
+  - Local inference on small GPUs
+  - Low-latency systems
+- **Compatible with:**
+  - LoRA fine-tuning
+  - HuggingFace Transformers
+  - Text-generation inference engines
+---
+##  Files Included
+- `config.json`
+- `generation_config.json`
+- `model.safetensors` (4-bit quantized weights)
+- `tokenizer.json`
+- `tokenizer_config.json`
+- `special_tokens_map.json`
+- `chat_template.jinja`
+These files allow you to load the model directly with `load_in_4bit=True`.
+---
+##  How To Load This Model
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "Shlok307/llama-1b-4bit"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForCausalLM.from_pretrained(
+    model_name,
+    load_in_4bit=True,
+    device_map="auto"
+)