Instructions to use RichardErkhov/google_-_recurrentgemma-2b-it-4bits with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use RichardErkhov/google_-_recurrentgemma-2b-it-4bits with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="RichardErkhov/google_-_recurrentgemma-2b-it-4bits")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("RichardErkhov/google_-_recurrentgemma-2b-it-4bits")
model = AutoModelForCausalLM.from_pretrained("RichardErkhov/google_-_recurrentgemma-2b-it-4bits")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use RichardErkhov/google_-_recurrentgemma-2b-it-4bits with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "RichardErkhov/google_-_recurrentgemma-2b-it-4bits"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RichardErkhov/google_-_recurrentgemma-2b-it-4bits",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/RichardErkhov/google_-_recurrentgemma-2b-it-4bits

SGLang

How to use RichardErkhov/google_-_recurrentgemma-2b-it-4bits with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "RichardErkhov/google_-_recurrentgemma-2b-it-4bits" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RichardErkhov/google_-_recurrentgemma-2b-it-4bits",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "RichardErkhov/google_-_recurrentgemma-2b-it-4bits" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "RichardErkhov/google_-_recurrentgemma-2b-it-4bits",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use RichardErkhov/google_-_recurrentgemma-2b-it-4bits with Docker Model Runner:
```
docker model run hf.co/RichardErkhov/google_-_recurrentgemma-2b-it-4bits
```

Request for <4B linear attention quants

by TomLucidor - opened Feb 13

Discussion

TomLucidor

Feb 13

Could you do Q8/Q6/Q4/Adaptive quants on Jet-Nemotron-2B / Nemotron-Flash-3B-Instruct / Jet-Nemotron-4B / Nemotron-H-4B-Instruct-128K (ideally MLX-compatible)?

RichardErkhov

Owner Feb 14

Hi, we only do gguf static and imaxtrix, we dont do other formats. If you want gguf, you can just send link for models here so I can process them

TomLucidor

Feb 14

Here you go, and then the MLX community can convert GGUF later on

RichardErkhov

Owner Feb 14

so I just realised that this comment was left on my and not mradermacher page. As much as I want to, due to huggingface blocking me from uploading to my account (because whatever can go wrong will go wrong in my life) and my forgetfullness I just fully joined mradermacher team instead, so you would need to find the quants on their page. I queued them there, here's the message I usually leave on model request for mradermacher =)

It's queued!

You can check for progress at http://hf.tst.eu/status.html or regularly check the model
summary page at https://hf.tst.eu/model#Jet-Nemotron-4B-GGUF
https://hf.tst.eu/model#Jet-Nemotron-2B-GGUF
https://hf.tst.eu/model#Nemotron-Flash-3B-Instruct-GGUF
https://hf.tst.eu/model#Nemotron-H-4B-Instruct-128K-GGUF
for quants to appear.

https://huggingface.co/nvidia/Nemotron-H-4B-Instruct-128K

Queue gave me this error, you would like to check it to understand why your nemotrons might not be quantized

model broken, max arrogance. https://huggingface.co/nvidia/NVIDIA-Nemotron-Nano-9B-v2/discussions/5

So sorry if something is not going to be quanted, as it is out of my control

TomLucidor

Feb 14

WTF from nVidia! We definitely need something functional in the linear attention sphere...

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment