Added model card (#1)
Browse files- Added model card (b46a87ba8c3e0f7b1652ec2059925866811534f9)
Co-authored-by: Lynn Langit <lynnlangit@users.noreply.huggingface.co>
README.md
ADDED
|
@@ -0,0 +1,116 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: gemma
|
| 3 |
+
base_model: google/gemma-2-2b
|
| 4 |
+
library_name: transformers
|
| 5 |
+
tags:
|
| 6 |
+
- text-generation
|
| 7 |
+
- gemma2
|
| 8 |
+
- local-inference
|
| 9 |
+
- bitsandbytes
|
| 10 |
+
- fine-tuned
|
| 11 |
+
pipeline_tag: text-generation
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# Gemma-2-Racer
|
| 15 |
+
|
| 16 |
+
`gemma2racer` is a specialized optimization of Google's **Gemma 2** architecture. This model is fine-tuned and configured specifically for "racing" performance—prioritizing high-speed token generation and low-memory overhead for local LLM deployment.
|
| 17 |
+
|
| 18 |
+
---
|
| 19 |
+
|
| 20 |
+
## Model Summary
|
| 21 |
+
|
| 22 |
+
The following table outlines the core technical specifications for the Gemma-2-Racer model.
|
| 23 |
+
|
| 24 |
+
| Feature | Details |
|
| 25 |
+
| :--- | :--- |
|
| 26 |
+
| **Developed by** | [Rabimba Karanjai](https://huggingface.co/rabimba) |
|
| 27 |
+
| **Model Type** | Causal Language Model (Transformer-based) |
|
| 28 |
+
| **Base Model** | [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) |
|
| 29 |
+
| **Architecture** | Gemma-2 |
|
| 30 |
+
| **Optimization Strategy** | 4-bit Quantization, `torch.compile`, and BitsAndBytes |
|
| 31 |
+
| **Primary Language** | English |
|
| 32 |
+
| **License** | [Gemma Terms of Use](https://ai.google.dev/gemma/terms) |
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Intended Use
|
| 37 |
+
|
| 38 |
+
This model is designed for developers and researchers who require state-of-the-art performance on consumer-grade hardware. It is specifically optimized for:
|
| 39 |
+
|
| 40 |
+
* **Real-time Interaction:** Minimized "Time To First Token" (TTFT) for chat applications.
|
| 41 |
+
* **Local Privacy:** Small enough to run entirely offline on standard laptops or edge devices.
|
| 42 |
+
* **Efficient Inference:** Optimized to fit into 2GB - 4GB of VRAM depending on your quantization settings.
|
| 43 |
+
|
| 44 |
+
---
|
| 45 |
+
|
| 46 |
+
## Quickstart Guide
|
| 47 |
+
|
| 48 |
+
To get the model running with the "Racer" performance presets, follow these steps:
|
| 49 |
+
|
| 50 |
+
1. **Install Requirements:**
|
| 51 |
+
Update your environment with the necessary libraries for quantization and acceleration.
|
| 52 |
+
```bash
|
| 53 |
+
pip install -U transformers accelerate bitsandbytes
|
| 54 |
+
```
|
| 55 |
+
|
| 56 |
+
2. **Login to Hugging Face:**
|
| 57 |
+
Ensure you have accepted the Gemma license on the official Google repository and authenticate locally.
|
| 58 |
+
```bash
|
| 59 |
+
huggingface-cli login
|
| 60 |
+
```
|
| 61 |
+
|
| 62 |
+
3. **Python Implementation:**
|
| 63 |
+
Use the following code snippet to load the model in its optimized 4-bit state.
|
| 64 |
+
```python
|
| 65 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 66 |
+
import torch
|
| 67 |
+
|
| 68 |
+
model_id = "rabimba/gemma2racer"
|
| 69 |
+
|
| 70 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 71 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 72 |
+
model_id,
|
| 73 |
+
device_map="auto",
|
| 74 |
+
load_in_4bit=True,
|
| 75 |
+
torch_dtype=torch.bfloat16
|
| 76 |
+
)
|
| 77 |
+
|
| 78 |
+
prompt = "Explain quantum physics like I'm a race car driver."
|
| 79 |
+
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
|
| 80 |
+
|
| 81 |
+
outputs = model.generate(**inputs, max_new_tokens=150)
|
| 82 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
| 83 |
+
```
|
| 84 |
+
|
| 85 |
+
---
|
| 86 |
+
|
| 87 |
+
## Performance Profiles
|
| 88 |
+
|
| 89 |
+
The "Racer" moniker refers to the model's ability to be tuned for different hardware constraints:
|
| 90 |
+
|
| 91 |
+
* **The Speedster (Linux/CUDA):** After loading, use `model = torch.compile(model)` to utilize kernel fusion for significantly higher throughput.
|
| 92 |
+
* **The Daily Driver (Standard GPU):** Standard 4-bit loading via BitsAndBytes provides a perfect balance of speed and 2.6B parameter intelligence.
|
| 93 |
+
* **The Endurance Run (Low VRAM):** Can be run with heavy CPU offloading via `accelerate` for systems with limited or no dedicated graphics memory.
|
| 94 |
+
|
| 95 |
+
---
|
| 96 |
+
|
| 97 |
+
## Limitations and Ethical Considerations
|
| 98 |
+
|
| 99 |
+
* **Accuracy:** Like all large language models, this model may hallucinate. Users should verify critical information.
|
| 100 |
+
* **Bias:** This model inherits biases present in the Gemma-2 base training data.
|
| 101 |
+
* **Safety:** While safety filters are present, it is recommended that users implement their own moderation layers for public-facing deployments.
|
| 102 |
+
|
| 103 |
+
---
|
| 104 |
+
|
| 105 |
+
## Citation
|
| 106 |
+
|
| 107 |
+
If you use this model in your research or commercial projects, please cite it as follows:
|
| 108 |
+
|
| 109 |
+
```bibtex
|
| 110 |
+
@misc{gemma2racer2024,
|
| 111 |
+
author = {Rabimba Karanjai},
|
| 112 |
+
title = {Gemma-2-Racer: Optimized Local Inference},
|
| 113 |
+
year = {2024},
|
| 114 |
+
publisher = {Hugging Face},
|
| 115 |
+
howpublished = {\url{https://huggingface.co/rabimba/gemma2racer}}
|
| 116 |
+
}
|