|
|
--- |
|
|
license: mit |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- mehta/CooperLM-354M |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
tags: |
|
|
- toy-llm |
|
|
- gpt2 |
|
|
- 4bit |
|
|
- quantized |
|
|
- casual-lm |
|
|
- transformers |
|
|
- small-llm |
|
|
--- |
|
|
|
|
|
# 🧠 CooperLM-354M (4-bit Quantized) |
|
|
|
|
|
This is a 4-bit quantized version of [CooperLM-354M](https://huggingface.co/mehta/CooperLM-354M), a 354M parameter GPT-2 style language model trained from scratch on a subset of Wikipedia, BookCorpus, and OpenWebText. |
|
|
|
|
|
The quantized model is intended for faster inference and smaller memory footprint, especially useful for CPU or limited-GPU setups. |
|
|
|
|
|
--- |
|
|
|
|
|
## 📌 Model Details |
|
|
|
|
|
- **Base Model**: [mehta/CooperLM-354M](https://huggingface.co/mehta/CooperLM-354M) |
|
|
- **Architecture**: GPT-2 (24 layers, 16 heads, 1024 hidden size) |
|
|
- **Quantization**: 4-bit integer weights via `AutoGPTQ` (safetensors) |
|
|
- **Precision**: INT4 |
|
|
|
|
|
--- |
|
|
|
|
|
## 🛠️ How to Use |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import torch |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("mehta/CooperLM-354M-4bit") |
|
|
model = AutoModelForCausalLM.from_pretrained("mehta/CooperLM-354M-4bit") |
|
|
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
|
model.to(device) |
|
|
|
|
|
prompt = "In the distant future," |
|
|
inputs = tokenizer(prompt, return_tensors="pt").to(device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_length=100, |
|
|
temperature=0.8, |
|
|
top_p=0.95, |
|
|
do_sample=True |
|
|
) |
|
|
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |