Instructions to use rabimba/gemma2racer with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use rabimba/gemma2racer with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="rabimba/gemma2racer")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("rabimba/gemma2racer")
model = AutoModelForImageTextToText.from_pretrained("rabimba/gemma2racer")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use rabimba/gemma2racer with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "rabimba/gemma2racer"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rabimba/gemma2racer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/rabimba/gemma2racer

SGLang

How to use rabimba/gemma2racer with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "rabimba/gemma2racer" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rabimba/gemma2racer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "rabimba/gemma2racer" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "rabimba/gemma2racer",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use rabimba/gemma2racer with Docker Model Runner:
```
docker model run hf.co/rabimba/gemma2racer
```

Gemma-2-Racer

gemma2racer is a specialized optimization of Google's Gemma 2 architecture. This model is fine-tuned and configured specifically for "racing" performance—prioritizing high-speed token generation and low-memory overhead for local LLM deployment.

Model Summary

The following table outlines the core technical specifications for the Gemma-2-Racer model.

Feature	Details
Developed by	Rabimba Karanjai
Model Type	Causal Language Model (Transformer-based)
Base Model	google/gemma-2-2b
Architecture	Gemma-2
Optimization Strategy	4-bit Quantization, `torch.compile`, and BitsAndBytes
Primary Language	English
License	Gemma Terms of Use

Intended Use

This model is designed for developers and researchers who require state-of-the-art performance on consumer-grade hardware. It is specifically optimized for:

Real-time Interaction: Minimized "Time To First Token" (TTFT) for chat applications.
Local Privacy: Small enough to run entirely offline on standard laptops or edge devices.
Efficient Inference: Optimized to fit into 2GB - 4GB of VRAM depending on your quantization settings.

Quickstart Guide

To get the model running with the "Racer" performance presets, follow these steps:

Install Requirements: Update your environment with the necessary libraries for quantization and acceleration.
```
pip install -U transformers accelerate bitsandbytes
```
Login to Hugging Face: Ensure you have accepted the Gemma license on the official Google repository and authenticate locally.
```
huggingface-cli login
```

Python Implementation: Use the following code snippet to load the model in its optimized 4-bit state.

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "rabimba/gemma2racer"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    load_in_4bit=True,
    torch_dtype=torch.bfloat16
)

prompt = "Explain quantum physics like I'm a race car driver."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Performance Profiles

The "Racer" moniker refers to the model's ability to be tuned for different hardware constraints:

The Speedster (Linux/CUDA): After loading, use model = torch.compile(model) to utilize kernel fusion for significantly higher throughput.
The Daily Driver (Standard GPU): Standard 4-bit loading via BitsAndBytes provides a perfect balance of speed and 2.6B parameter intelligence.
The Endurance Run (Low VRAM): Can be run with heavy CPU offloading via accelerate for systems with limited or no dedicated graphics memory.

Limitations and Ethical Considerations

Accuracy: Like all large language models, this model may hallucinate. Users should verify critical information.
Bias: This model inherits biases present in the Gemma-2 base training data.
Safety: While safety filters are present, it is recommended that users implement their own moderation layers for public-facing deployments.

Citation

If you use this model in your research or commercial projects, please cite it as follows:

@misc{gemma2racer2024,
  author = {Rabimba Karanjai},
  title = {Gemma-2-Racer: Optimized Local Inference},
  year = {2024},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/rabimba/gemma2racer}}
}

Downloads last month: 3

Safetensors

Model size

5B params

Tensor type

BF16

Model tree for rabimba/gemma2racer

Base model

google/gemma-2-2b

Finetuned

(491)

this model