mehta
/

CooperLM-354M-4bit

Text Generation

text-generation-inference

4-bit precision

Model card Files Files and versions

CooperLM-354M-4bit / README.md

mehta's picture

Create README.md

73cc915 verified 7 months ago

|

history blame contribute delete

1.49 kB

	---
	license: mit
	language:
	- en
	base_model:
	- mehta/CooperLM-354M
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- toy-llm
	- gpt2
	- 4bit
	- quantized
	- casual-lm
	- transformers
	- small-llm
	---

	# 🧠 CooperLM-354M (4-bit Quantized)

	This is a 4-bit quantized version of [CooperLM-354M](https://huggingface.co/mehta/CooperLM-354M), a 354M parameter GPT-2 style language model trained from scratch on a subset of Wikipedia, BookCorpus, and OpenWebText.

	The quantized model is intended for faster inference and smaller memory footprint, especially useful for CPU or limited-GPU setups.

	---

	## 📌 Model Details

	- Base Model: [mehta/CooperLM-354M](https://huggingface.co/mehta/CooperLM-354M)
	- Architecture: GPT-2 (24 layers, 16 heads, 1024 hidden size)
	- Quantization: 4-bit integer weights via `AutoGPTQ` (safetensors)
	- Precision: INT4

	---

	## 🛠️ How to Use

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import torch

	tokenizer = AutoTokenizer.from_pretrained("mehta/CooperLM-354M-4bit")
	model = AutoModelForCausalLM.from_pretrained("mehta/CooperLM-354M-4bit")

	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	model.to(device)

	prompt = "In the distant future,"
	inputs = tokenizer(prompt, return_tensors="pt").to(device)

	outputs = model.generate(
	**inputs,
	max_length=100,
	temperature=0.8,
	top_p=0.95,
	do_sample=True
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))