Gemma-2-2B Trading Summarizer (8-bit Quantized)

Model Description

This is an 8-bit quantized version of the fine-tuned Gemma-2-2B trading journal summarizer. It offers ~50% reduction in model size and memory usage with minimal quality loss.

Quantization Details

  • Method: bitsandbytes 8-bit quantization
  • Original Precision: fp16
  • Quantized Precision: int8
  • Size Reduction: ~50%
  • Quality Impact: Typically <2% degradation

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "./gemma-2b-trader-8bit",
    load_in_8bit=True,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("./gemma-2b-trader-8bit")

# Same usage as fp16 version

When to Use This Version

  • Limited GPU memory (<8GB VRAM)
  • Faster loading times needed
  • Deployment on edge devices
  • When inference speed is more important than marginal quality

When to Use FP16 Version

  • Maximum quality required
  • Sufficient GPU memory available
  • Fine-tuning or further training needed
Downloads last month
2
Safetensors
Model size
3B params
Tensor type
F32
F16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Wezenite/gemma-2b-trader-8bit

Quantized
(64)
this model