Instructions to use lunahr/SystemGemma2-2b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use lunahr/SystemGemma2-2b-it with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="lunahr/SystemGemma2-2b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("lunahr/SystemGemma2-2b-it")
model = AutoModelForCausalLM.from_pretrained("lunahr/SystemGemma2-2b-it")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use lunahr/SystemGemma2-2b-it with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "lunahr/SystemGemma2-2b-it"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lunahr/SystemGemma2-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/lunahr/SystemGemma2-2b-it

SGLang

How to use lunahr/SystemGemma2-2b-it with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "lunahr/SystemGemma2-2b-it" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lunahr/SystemGemma2-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "lunahr/SystemGemma2-2b-it" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "lunahr/SystemGemma2-2b-it",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use lunahr/SystemGemma2-2b-it with Docker Model Runner:
```
docker model run hf.co/lunahr/SystemGemma2-2b-it
```

Hallucination

by cnmoro - opened Aug 10, 2024

Discussion

cnmoro

Aug 10, 2024

On prompts with context larger than 8k tokens, it simply starts outputting nonsense text.

I selected the raw text from a few books and tried to generate some summaries, I have tried all sorts of prompt techniques, but it does not answer correctly, it tries to continue the text of the book instead of answering my question, whatever it is. If under 8k tokens of context, it answers correctly.

I am using the updated tokenizer

lunahr

Owner Aug 10, 2024

I have found out just now: RoPE scaling has no effect here. The RoPE frequency base has to be inherently set to 160000 to accept this many tokens. The models will be updated.

Source: https://www.reddit.com/r/LocalLLaMA/s/g9SNNrfbpK

lunahr

Owner Aug 10, 2024

Please recheck with the correct RoPE technique applied.

cnmoro

Aug 10, 2024

The issue persists :(

lunahr

Owner Aug 10, 2024

•

edited Aug 10, 2024

Model weights will have be to updated, stay tuned while I figure out how to get LongLM to cooperate

This is a reference: https://www.reddit.com/r/LocalLLaMA/s/NTHnMabIDS

lunahr

Owner Aug 11, 2024

•

edited Aug 11, 2024

I don't have memory for this, however I am going to try.

lunahr

Owner Aug 11, 2024

•

edited Aug 11, 2024

Maybe this gonna work but I don't know

llama-cli -m <gguf> -c 32768 -n 1024 --temp 0 -gan 8 -gaw 4096

lunahr

Owner Aug 11, 2024

•

edited Aug 11, 2024

@cnmoro I changed some more things about the sliding window apart from applying the RoPE techniques

If you still get hallucinations, sorry I can't fix it due to lack of resources at hand.

This model will not be advertised as a high context size just to avoid any confusion that something is off.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment