gpt2-test-quantization - Quanto int8

This is an automatically quantized version of Sambhavnoobcoder/gpt2-test-quantization using Quanto int8 quantization.

⚡ Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "Sambhavnoobcoder/gpt2-test-quantization-Quanto-int8",
    device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained("Sambhavnoobcoder/gpt2-test-quantization-Quanto-int8")

# Generate text
inputs = tokenizer("Hello, my name is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0]))

🔧 Quantization Details

Method: Quanto (HuggingFace native)
Precision: int8 (8-bit integer weights)
Quality: 99%+ retention vs FP16
Memory: ~2x smaller than original
Speed: 2-4x faster inference

📈 Performance

Metric	Value
Memory Reduction	~50%
Quality Retention	99%+
Inference Speed	2-4x faster

🤖 Automatic Quantization

This model was automatically quantized by the Auto-Quantization Service.

Want your models automatically quantized?

Set up a webhook in your HuggingFace settings
Point to: https://Sambhavnoobcoder-quantization-mvp.hf.space/webhook
Upload a model - it will be automatically quantized!

📚 Learn More

Original Model: Sambhavnoobcoder/gpt2-test-quantization
Quantization Method: Quanto Documentation
Service Code: GitHub Repository

📝 Citation

@software{quanto_quantization,
  title = {Quanto: PyTorch Quantization Toolkit},
  author = {HuggingFace Team},
  year = {2024},
  url = {https://github.com/huggingface/optimum-quanto}
}

Generated on 2026-01-10 21:37:02 by Auto-Quantization MVP

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sambhavnoobcoder/gpt2-test-quantization-Quanto-int8

Base model

Sambhavnoobcoder/gpt2-test-quantization

Finetuned

(1)

this model

Finetunes

1 model