|
|
--- |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
# Model card |
|
|
This is Dippy AI's reference Gemma 2 27b model |
|
|
|
|
|
#### Optimizations |
|
|
|
|
|
* _Flash Attention 2_ |
|
|
|
|
|
First make sure to install `flash-attn` in your environment `pip install flash-attn` |
|
|
|
|
|
```diff |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
torch_dtype=torch.float16, |
|
|
+ attn_implementation="flash_attention_2" |
|
|
).to(0) |
|
|
``` |
|
|
|
|
|
|
|
|
The instruction-tuned models use a chat template that must be adhered to for conversational use. |
|
|
The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet. |
|
|
|
|
|
Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction: |
|
|
|
|
|
```py |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
import transformers |
|
|
import torch |
|
|
|
|
|
model_id = "google/gemma-2-27b-it" |
|
|
dtype = torch.bfloat16 |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
model_id, |
|
|
device_map="cuda", |
|
|
torch_dtype=dtype, |
|
|
) |
|
|
|
|
|
chat = [ |
|
|
{ "role": "user", "content": "Write a hello world program" }, |
|
|
] |
|
|
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
|
|
``` |
|
|
|
|
|
At this point, the prompt contains the following text: |
|
|
|
|
|
``` |
|
|
<bos><start_of_turn>user |
|
|
Write a hello world program<end_of_turn> |
|
|
<start_of_turn>model |
|
|
``` |
|
|
|
|
|
As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity |
|
|
(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with |
|
|
the `<end_of_turn>` token. |
|
|
|
|
|
You can follow this format to build the prompt manually, if you need to do it without the tokenizer's |
|
|
chat template. |
|
|
|
|
|
After the prompt is ready, generation can be performed like this: |
|
|
|
|
|
```py |
|
|
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") |
|
|
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150) |
|
|
print(tokenizer.decode(outputs[0])) |
|
|
``` |