Instructions to use google/gemma-2-9b-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use google/gemma-2-9b-it with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="google/gemma-2-9b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it") model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Inference
- Local Apps Settings
- vLLM
How to use google/gemma-2-9b-it with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "google/gemma-2-9b-it" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-9b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/google/gemma-2-9b-it
- SGLang
How to use google/gemma-2-9b-it with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "google/gemma-2-9b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-9b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "google/gemma-2-9b-it" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "google/gemma-2-9b-it", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use google/gemma-2-9b-it with Docker Model Runner:
docker model run hf.co/google/gemma-2-9b-it
Deployment to Inference Endpoints failed
This error is similar to what being faced with llama-3-1, the endpoint creation on Hub fails. I had previously deployed the model on dedicated endpoint and it was working fine, now its not working.
Exit code: 1. Reason: /opt/conda/lib/python3.10/site-packages/text_generation_server/models/init.py", line 610, in get_model\n raise NotImplementedError("sharded is not supported for AutoModel")\n\nNotImplementedError: sharded is not supported for AutoModel\n"},"target":"text_generation_launcher","span":{"rank":0,"name":"shard-manager"},"spans":[{"rank":0,"name":"shard-manager"}]}
{"timestamp":"2024-08-07T08:25:52.910377Z","level":"ERROR","fields":{"message":"Shard 3 failed to start"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-07T08:25:52.910402Z","level":"INFO","fields":{"message":"Shutting down shards"},"target":"text_generation_launcher"}
{"timestamp":"2024-08-07T08:25:52.912665Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
{"timestamp":"2024-08-07T08:25:52.912699Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
{"timestamp":"2024-08-07T08:25:52.912698Z","level":"INFO","fields":{"message":"Terminating shard"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2024-08-07T08:25:52.912712Z","level":"INFO","fields":{"message":"Waiting for shard to gracefully shutdown"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2024-08-07T08:25:53.012880Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":1,"name":"shard-manager"},"spans":[{"rank":1,"name":"shard-manager"}]}
{"timestamp":"2024-08-07T08:25:53.012883Z","level":"INFO","fields":{"message":"shard terminated"},"target":"text_generation_launcher","span":{"rank":2,"name":"shard-manager"},"spans":[{"rank":2,"name":"shard-manager"}]}
Error: ShardCannotStart
I tried GPU · Nvidia L4 · 4x GPUs · 96 GB, with TGI, on all settings left on default. Can someone else facing this issue too? Any help is appreciated.
Thanks
I still can't deploy this to inference endpoints. Does this require flash attention with Ampere GPU? It can't fit within 24GB if that's the case and the next tier up is 96GB, which is too expensive