| | --- |
| | base_model: unsloth/gemma-2-2b-it-bnb-4bit |
| | language: |
| | - en |
| | license: apache-2.0 |
| | tags: |
| | - text-generation-inference |
| | - transformers |
| | - unsloth |
| | - gemma2 |
| | - trl |
| | --- |
| | |
| | # Athena-codegemma-2-2b-lt for coding |
| |
|
| | Supervised fine tuned (sft unsloth) for coding with EpistemeAI coding dataset. |
| |
|
| | # Original Model card |
| |
|
| | ## Model Information |
| |
|
| | Summary description and brief definition of inputs and outputs. |
| |
|
| | ### Description |
| |
|
| | Gemma is a family of lightweight, state-of-the-art open models from Google, |
| | built from the same research and technology used to create the Gemini models. |
| | They are text-to-text, decoder-only large language models, available in English, |
| | with open weights for both pre-trained variants and instruction-tuned variants. |
| | Gemma models are well-suited for a variety of text generation tasks, including |
| | question answering, summarization, and reasoning. Their relatively small size |
| | makes it possible to deploy them in environments with limited resources such as |
| | a laptop, desktop or your own cloud infrastructure, democratizing access to |
| | state of the art AI models and helping foster innovation for everyone. |
| |
|
| | ### Usage |
| |
|
| | Below we share some code snippets on how to get quickly started with running the model. First, install the Transformers library with: |
| | ```sh |
| | pip install -U transformers |
| | ``` |
| |
|
| | Then, copy the snippet from the section that is relevant for your usecase. |
| |
|
| | #### Running with the `pipeline` API |
| |
|
| | ```python |
| | import torch |
| | from transformers import pipeline |
| | pipe = pipeline( |
| | "text-generation", |
| | model="EpistemeAI/Athena-codegemma-2-2b-it", |
| | model_kwargs={"torch_dtype": torch.bfloat16}, |
| | device="cuda", # replace with "mps" to run on a Mac device |
| | ) |
| | messages = [ |
| | {"role": "user", "content": "Who are you? Please, answer in pirate-speak."}, |
| | ] |
| | outputs = pipe(messages, max_new_tokens=256) |
| | assistant_response = outputs[0]["generated_text"][-1]["content"].strip() |
| | print(assistant_response) |
| | # Ahoy, matey! I be Gemma, a digital scallywag, a language-slingin' parrot of the digital seas. I be here to help ye with yer wordy woes, answer yer questions, and spin ye yarns of the digital world. So, what be yer pleasure, eh? 🦜 |
| | ``` |
| |
|
| | #### Running the model on a single / multi GPU |
| |
|
| | ```python |
| | # pip install accelerate |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import torch |
| | tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "EpistemeAI/Athena-codegemma-2-2b-it", |
| | device_map="auto", |
| | torch_dtype=torch.bfloat16, |
| | ) |
| | input_text = "Write me a poem about Machine Learning." |
| | input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
| | outputs = model.generate(**input_ids, max_new_tokens=32) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | You can ensure the correct chat template is applied by using `tokenizer.apply_chat_template` as follows: |
| | ```python |
| | messages = [ |
| | {"role": "user", "content": "Write me a poem about Machine Learning."}, |
| | ] |
| | input_ids = tokenizer.apply_chat_template(messages, return_tensors="pt", return_dict=True).to("cuda") |
| | outputs = model.generate(**input_ids, max_new_tokens=256) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | <a name="precisions"></a> |
| | #### Running the model on a GPU using different precisions |
| |
|
| | The native weights of this model were exported in `bfloat16` precision. |
| |
|
| | You can also use `float32` if you skip the dtype, but no precision increase will occur (model weights will just be upcasted to `float32`). See examples below. |
| |
|
| | * _Upcasting to `torch.float32`_ |
| |
|
| | ```python |
| | # pip install accelerate |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "EpistemeAI/Athena-codegemma-2-2b-it", |
| | device_map="auto", |
| | ) |
| | input_text = "Write me a poem about Machine Learning." |
| | input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
| | outputs = model.generate(**input_ids, max_new_tokens=32) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | #### Running the model through a CLI |
| |
|
| | The [local-gemma](https://github.com/huggingface/local-gemma) repository contains a lightweight wrapper around Transformers |
| | for running Gemma 2 through a command line interface, or CLI. Follow the [installation instructions](https://github.com/huggingface/local-gemma#cli-usage) |
| | for getting started, then launch the CLI through the following command: |
| |
|
| | ```shell |
| | local-gemma --model 2b --preset speed |
| | ``` |
| |
|
| | #### Quantized Versions through `bitsandbytes` |
| |
|
| | <details> |
| | <summary> |
| | Using 8-bit precision (int8) |
| | </summary> |
| | ```python |
| | # pip install bitsandbytes accelerate |
| | from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
| | quantization_config = BitsAndBytesConfig(load_in_8bit=True) |
| | tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "EpistemeAI/Athena-codegemma-2-2b-it", |
| | quantization_config=quantization_config, |
| | ) |
| | input_text = "Write me a poem about Machine Learning." |
| | input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
| | outputs = model.generate(**input_ids, max_new_tokens=32) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| | </details> |
| | |
| | <details> |
| | <summary> |
| | Using 4-bit precision |
| | </summary> |
| | ```python |
| | # pip install bitsandbytes accelerate |
| | from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig |
| | quantization_config = BitsAndBytesConfig(load_in_4bit=True) |
| | tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
| | model = AutoModelForCausalLM.from_pretrained( |
| | "EpistemeAI/Athena-codegemma-2-2b-it", |
| | quantization_config=quantization_config, |
| | ) |
| | input_text = "Write me a poem about Machine Learning." |
| | input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") |
| | outputs = model.generate(**input_ids, max_new_tokens=32) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| | </details> |
| | |
| | #### Advanced Usage |
| |
|
| | <details> |
| | <summary> |
| | Torch compile |
| | </summary> |
| | [Torch compile](https://pytorch.org/tutorials/intermediate/torch_compile_tutorial.html) is a method for speeding-up the |
| | inference of PyTorch modules. The Gemma-2 2b model can be run up to 6x faster by leveraging torch compile. |
| | |
| | Note that two warm-up steps are required before the full inference speed is realised: |
| |
|
| | ```python |
| | import os |
| | os.environ["TOKENIZERS_PARALLELISM"] = "false" |
| | from transformers import AutoTokenizer, Gemma2ForCausalLM |
| | from transformers.cache_utils import HybridCache |
| | import torch |
| | torch.set_float32_matmul_precision("high") |
| | # load the model + tokenizer |
| | tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-2b-it") |
| | model = Gemma2ForCausalLM.from_pretrained("EpistemeAI/Athena-codegemma-2-2b-it", torch_dtype=torch.bfloat16) |
| | model.to("cuda") |
| | # apply the torch compile transformation |
| | model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True) |
| | # pre-process inputs |
| | input_text = "The theory of special relativity states " |
| | model_inputs = tokenizer(input_text, return_tensors="pt").to("cuda") |
| | prompt_length = model_inputs.input_ids.shape[1] |
| | # set-up k/v cache |
| | past_key_values = HybridCache( |
| | config=model.config, |
| | max_batch_size=1, |
| | max_cache_len=model.config.max_position_embeddings, |
| | device=model.device, |
| | dtype=model.dtype |
| | ) |
| | # enable passing kv cache to generate |
| | model._supports_cache_class = True |
| | model.generation_config.cache_implementation = None |
| | # two warm-up steps |
| | for idx in range(2): |
| | outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128) |
| | past_key_values.reset() |
| | # fast run |
| | outputs = model.generate(**model_inputs, past_key_values=past_key_values, do_sample=True, temperature=1.0, max_new_tokens=128) |
| | print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
| | ``` |
| |
|
| | For more details, refer to the [Transformers documentation](https://huggingface.co/docs/transformers/main/en/llm_optims?static-kv=basic+usage%3A+generation_config). |
| |
|
| | </details> |
| |
|
| | ### Chat Template |
| |
|
| | The instruction-tuned models use a chat template that must be adhered to for conversational use. |
| | The easiest way to apply it is using the tokenizer's built-in chat template, as shown in the following snippet. |
| |
|
| | Let's load the model and apply the chat template to a conversation. In this example, we'll start with a single user interaction: |
| |
|
| | ```py |
| | from transformers import AutoTokenizer, AutoModelForCausalLM |
| | import transformers |
| | import torch |
| | model_id = "google/gemma-2-2b-it" |
| | dtype = torch.bfloat16 |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | model = AutoModelForCausalLM.from_pretrained( |
| | model_id, |
| | device_map="cuda", |
| | torch_dtype=dtype,) |
| | chat = [ |
| | { "role": "user", "content": "Write a hello world program" }, |
| | ] |
| | prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True) |
| | ``` |
| |
|
| | At this point, the prompt contains the following text: |
| |
|
| | ``` |
| | <bos><start_of_turn>user |
| | Write a hello world program<end_of_turn> |
| | <start_of_turn>model |
| | ``` |
| |
|
| | As you can see, each turn is preceded by a `<start_of_turn>` delimiter and then the role of the entity |
| | (either `user`, for content supplied by the user, or `model` for LLM responses). Turns finish with |
| | the `<end_of_turn>` token. |
| |
|
| | You can follow this format to build the prompt manually, if you need to do it without the tokenizer's |
| | chat template. |
| |
|
| | After the prompt is ready, generation can be performed like this: |
| |
|
| | ```py |
| | inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt") |
| | outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150) |
| | print(tokenizer.decode(outputs[0])) |
| | ``` |
| |
|
| | ### Inputs and outputs |
| |
|
| | * **Input:** Text string, such as a question, a prompt, or a document to be |
| | summarized. |
| | * **Output:** Generated English-language text in response to the input, such |
| | as an answer to a question, or a summary of a document. |
| | ### Citation |
| | |
| | ```none |
| | @article{gemma_2024, |
| | title={Gemma}, |
| | url={https://www.kaggle.com/m/3301}, |
| | DOI={10.34740/KAGGLE/M/3301}, |
| | publisher={Kaggle}, |
| | author={Gemma Team}, |
| | year={2024} |
| | } |
| | ``` |
| |
|
| | # Uploaded model |
| |
|
| | - **Developed by:** EpistemeAI |
| | - **License:** apache-2.0 |
| | - **Finetuned from model :** unsloth/gemma-2-2b-it-bnb-4bit |
| |
|
| | This gemma2 model was trained 2x faster with [Unsloth](https://github.com/unslothai/unsloth) and Huggingface's TRL library. |
| |
|
| | [<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth) |
| |
|