DippyAI
/

gemma-27b-reference

Text Generation

text-generation-inference

Model card Files Files and versions

gemma-27b-reference / README.md

DippyAI's picture

Upload folder using huggingface_hub

1266c86 verified 11 months ago

|

history blame contribute delete

2.02 kB

	---
	library_name: transformers
	pipeline_tag: text-generation

	---


	# Model card
	This is Dippy AI's reference Gemma 2 27b model

	#### Optimizations

	* _Flash Attention 2_

	First make sure to install `flash-attn` in your environment `pip install flash-attn`

	```diff
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	torch_dtype=torch.float16,
	+ attn_implementation="flash_attention_2"
	).to(0)
	```


	The instruction-tuned models use a chat template that must be adhered to for conversational use.
	The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet.

	Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction:

	```py
	from transformers import AutoTokenizer, AutoModelForCausalLM
	import transformers
	import torch

	model_id = "google/gemma-2-27b-it"
	dtype = torch.bfloat16

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(
	model_id,
	device_map="cuda",
	torch_dtype=dtype,
	)

	chat = [
	{ "role": "user", "content": "Write a hello world program" },
	]
	prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
	```

	At this point, the prompt contains the following text:

	```
	<bos><start_of_turn>user
	Write a hello world program<end_of_turn>
	<start_of_turn>model
	```

	As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity
	(either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with
	the `<end_of_turn>` token.

	You can follow this format to build the prompt manually, if you need to do it without the tokenizer's
	chat template.

	After the prompt is ready, generation can be performed like this:

	```py
	inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
	outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
	print(tokenizer.decode(outputs[0]))
	```