rabimba lynnlangit commited on
Commit
6e6da40
·
verified ·
1 Parent(s): 7e430db

Added model card (#1)

Browse files

- Added model card (b46a87ba8c3e0f7b1652ec2059925866811534f9)


Co-authored-by: Lynn Langit <lynnlangit@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +116 -0
README.md ADDED
@@ -0,0 +1,116 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: gemma
3
+ base_model: google/gemma-2-2b
4
+ library_name: transformers
5
+ tags:
6
+ - text-generation
7
+ - gemma2
8
+ - local-inference
9
+ - bitsandbytes
10
+ - fine-tuned
11
+ pipeline_tag: text-generation
12
+ ---
13
+
14
+ # Gemma-2-Racer
15
+
16
+ `gemma2racer` is a specialized optimization of Google's **Gemma 2** architecture. This model is fine-tuned and configured specifically for "racing" performance—prioritizing high-speed token generation and low-memory overhead for local LLM deployment.
17
+
18
+ ---
19
+
20
+ ## Model Summary
21
+
22
+ The following table outlines the core technical specifications for the Gemma-2-Racer model.
23
+
24
+ | Feature | Details |
25
+ | :--- | :--- |
26
+ | **Developed by** | [Rabimba Karanjai](https://huggingface.co/rabimba) |
27
+ | **Model Type** | Causal Language Model (Transformer-based) |
28
+ | **Base Model** | [google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b) |
29
+ | **Architecture** | Gemma-2 |
30
+ | **Optimization Strategy** | 4-bit Quantization, `torch.compile`, and BitsAndBytes |
31
+ | **Primary Language** | English |
32
+ | **License** | [Gemma Terms of Use](https://ai.google.dev/gemma/terms) |
33
+
34
+ ---
35
+
36
+ ## Intended Use
37
+
38
+ This model is designed for developers and researchers who require state-of-the-art performance on consumer-grade hardware. It is specifically optimized for:
39
+
40
+ * **Real-time Interaction:** Minimized "Time To First Token" (TTFT) for chat applications.
41
+ * **Local Privacy:** Small enough to run entirely offline on standard laptops or edge devices.
42
+ * **Efficient Inference:** Optimized to fit into 2GB - 4GB of VRAM depending on your quantization settings.
43
+
44
+ ---
45
+
46
+ ## Quickstart Guide
47
+
48
+ To get the model running with the "Racer" performance presets, follow these steps:
49
+
50
+ 1. **Install Requirements:**
51
+ Update your environment with the necessary libraries for quantization and acceleration.
52
+ ```bash
53
+ pip install -U transformers accelerate bitsandbytes
54
+ ```
55
+
56
+ 2. **Login to Hugging Face:**
57
+ Ensure you have accepted the Gemma license on the official Google repository and authenticate locally.
58
+ ```bash
59
+ huggingface-cli login
60
+ ```
61
+
62
+ 3. **Python Implementation:**
63
+ Use the following code snippet to load the model in its optimized 4-bit state.
64
+ ```python
65
+ from transformers import AutoTokenizer, AutoModelForCausalLM
66
+ import torch
67
+
68
+ model_id = "rabimba/gemma2racer"
69
+
70
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
71
+ model = AutoModelForCausalLM.from_pretrained(
72
+ model_id,
73
+ device_map="auto",
74
+ load_in_4bit=True,
75
+ torch_dtype=torch.bfloat16
76
+ )
77
+
78
+ prompt = "Explain quantum physics like I'm a race car driver."
79
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
80
+
81
+ outputs = model.generate(**inputs, max_new_tokens=150)
82
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
83
+ ```
84
+
85
+ ---
86
+
87
+ ## Performance Profiles
88
+
89
+ The "Racer" moniker refers to the model's ability to be tuned for different hardware constraints:
90
+
91
+ * **The Speedster (Linux/CUDA):** After loading, use `model = torch.compile(model)` to utilize kernel fusion for significantly higher throughput.
92
+ * **The Daily Driver (Standard GPU):** Standard 4-bit loading via BitsAndBytes provides a perfect balance of speed and 2.6B parameter intelligence.
93
+ * **The Endurance Run (Low VRAM):** Can be run with heavy CPU offloading via `accelerate` for systems with limited or no dedicated graphics memory.
94
+
95
+ ---
96
+
97
+ ## Limitations and Ethical Considerations
98
+
99
+ * **Accuracy:** Like all large language models, this model may hallucinate. Users should verify critical information.
100
+ * **Bias:** This model inherits biases present in the Gemma-2 base training data.
101
+ * **Safety:** While safety filters are present, it is recommended that users implement their own moderation layers for public-facing deployments.
102
+
103
+ ---
104
+
105
+ ## Citation
106
+
107
+ If you use this model in your research or commercial projects, please cite it as follows:
108
+
109
+ ```bibtex
110
+ @misc{gemma2racer2024,
111
+ author = {Rabimba Karanjai},
112
+ title = {Gemma-2-Racer: Optimized Local Inference},
113
+ year = {2024},
114
+ publisher = {Hugging Face},
115
+ howpublished = {\url{https://huggingface.co/rabimba/gemma2racer}}
116
+ }