Update README.md
Browse files
README.md
CHANGED
|
@@ -23,7 +23,100 @@ The model was developed within the [PoVeJMo](https://www.cjvt.si/povejmo/en/proj
|
|
| 23 |
|
| 24 |
We thank everyone who contributed to data collection and preparation, which enabled us to train our model. Special thanks go to Nikola Ljubešić, Taja Kuzman, Tjaša Arčon, Jaka Čibej, Simon Krek, Tomaž Erjavec, Iztok Kosem and Tomaž Savodnik.
|
| 25 |
|
| 26 |
-
The model's development was supported by NVIDIA as a part of their Sovereign AI initiative. We
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
## Data and benchmark information
|
| 29 |
|
|
|
|
| 23 |
|
| 24 |
We thank everyone who contributed to data collection and preparation, which enabled us to train our model. Special thanks go to Nikola Ljubešić, Taja Kuzman, Tjaša Arčon, Jaka Čibej, Simon Krek, Tomaž Erjavec, Iztok Kosem and Tomaž Savodnik.
|
| 25 |
|
| 26 |
+
The model's development was supported by NVIDIA as a part of their Sovereign AI initiative. We are thankful for the access to [NVIDIA DGX Cloud Lepton](https://developer.nvidia.com/blog/introducing-nvidia-dgx-cloud-lepton-a-unified-ai-platform-built-for-developers/). We are also extremely grateful for all the support and help we received from a group of exceptional people at NVIDIA.
|
| 27 |
+
|
| 28 |
+
## Basic information
|
| 29 |
+
|
| 30 |
+
- **Developed by:** team of researchers at the University of Ljubljana, Faculty for Computer and Information Science. Team members: Domen Vreš, Iztok Lebar Bajec, Tjaša Arčon, Timotej Petrič, Dario Vajda and Marko Robnik-Šikonja.
|
| 31 |
+
- **Languages:** Slovene, English (primary), Croatian, Bosnian and Serbian (secondary). The model might also work for other languages supported by Gemma 3, even though it was not continually pretrained on them.
|
| 32 |
+
- **Base model:** [google/gemma-3-12b-pt](https://huggingface.co/google/gemma-3-12b-pt)
|
| 33 |
+
- **License:** [Gemma](https://ai.google.dev/gemma/terms)
|
| 34 |
+
|
| 35 |
+
## Usage
|
| 36 |
+
|
| 37 |
+
### Transformers library
|
| 38 |
+
|
| 39 |
+
The model can be run through `pipeline` API using the following code:
|
| 40 |
+
|
| 41 |
+
```python
|
| 42 |
+
from transformers import pipeline
|
| 43 |
+
|
| 44 |
+
model_id = "cjvt/GaMS3-12B-Instruct"
|
| 45 |
+
|
| 46 |
+
model = pipeline(
|
| 47 |
+
"text-generation",
|
| 48 |
+
model=model_id,
|
| 49 |
+
device_map="cuda" # replace with "mps" to run on a Mac device
|
| 50 |
+
)
|
| 51 |
+
|
| 52 |
+
# Example of response generation
|
| 53 |
+
message = [{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]
|
| 54 |
+
response = model(message, max_new_tokens=512)
|
| 55 |
+
print("Model's response:", response[0]["generated_text"][-1]["content"])
|
| 56 |
+
|
| 57 |
+
# Example of conversation chain
|
| 58 |
+
new_message = response[0]["generated_text"]
|
| 59 |
+
new_message.append({"role": "user", "content": "Lahko bolj podrobno opišeš ta dogodek?"})
|
| 60 |
+
response = model(new_message, max_new_tokens=1024)
|
| 61 |
+
print("Model's response:", response[0]["generated_text"][-1]["content"])
|
| 62 |
+
```
|
| 63 |
+
|
| 64 |
+
For multi GPU inference, set the `device_map` to `auto` (accelerate library required):
|
| 65 |
+
|
| 66 |
+
```python
|
| 67 |
+
from transformers import pipeline
|
| 68 |
+
|
| 69 |
+
model_id = "cjvt/GaMS3-12B-Instruct"
|
| 70 |
+
|
| 71 |
+
model = pipeline(
|
| 72 |
+
"text-generation",
|
| 73 |
+
model=model_id,
|
| 74 |
+
device_map="auto"
|
| 75 |
+
)
|
| 76 |
+
|
| 77 |
+
# Example of response generation
|
| 78 |
+
message = [{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]
|
| 79 |
+
response = model(message, max_new_tokens=512)
|
| 80 |
+
print("Model's response:", response[0]["generated_text"][-1]["content"])
|
| 81 |
+
|
| 82 |
+
# Example of conversation chain
|
| 83 |
+
new_message = response[0]["generated_text"]
|
| 84 |
+
new_message.append({"role": "user", "content": "Lahko bolj podrobno opišeš ta dogodek?"})
|
| 85 |
+
response = model(new_message, max_new_tokens=1024)
|
| 86 |
+
print("Model's response:", response[0]["generated_text"][-1]["content"])
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
### vLLM library
|
| 90 |
+
|
| 91 |
+
As Gemma 3 architecture is supported in vLLM, this is also true for our model.
|
| 92 |
+
|
| 93 |
+
**NOTE:** We noticed degradation in performance when the Flash Infer attention backend is used. For optimal performance please use Flash Attention backend.
|
| 94 |
+
|
| 95 |
+
Example vLLM code:
|
| 96 |
+
|
| 97 |
+
```python
|
| 98 |
+
from vllm import LLM, SamplingParams
|
| 99 |
+
|
| 100 |
+
model = LLM("cjvt/GaMS3-12B-Instruct")
|
| 101 |
+
|
| 102 |
+
sampling_params = SamplingParams(
|
| 103 |
+
n=1,
|
| 104 |
+
temperature=0.6,
|
| 105 |
+
top_p=0.9,
|
| 106 |
+
max_tokens=1024
|
| 107 |
+
)
|
| 108 |
+
|
| 109 |
+
messages = [[{"role": "user", "content": "Kateri je najpomembnejši dogodek v slovenski zgodovini?"}]]
|
| 110 |
+
response = model.chat(messages, sampling_params)
|
| 111 |
+
print("Model's response:", response[0].outputs[0].text)
|
| 112 |
+
```
|
| 113 |
+
|
| 114 |
+
## Training Infrastructure
|
| 115 |
+
|
| 116 |
+
The model was trained on the following HPC infrastructure:
|
| 117 |
+
- EuroHPC supercomputer [LEONARDO](https://www.hpc.cineca.it/systems/hardware/leonardo/): We managed to scale the training across 128 nodes on LEONARDO's booster partition. We used approximately **150k GPU** hours on LEONARDO for development of this model (including data preparation such as translation and web rewrite, and model training).
|
| 118 |
+
- Faculty's B200 node: With 8 B200 GPUs, our faculty's node represents a modern infrastructure for AI development. However, as we have only a single such node, the majority of the training was done elsewhere. In total around **1000 GPU hours** were used on B200 node.
|
| 119 |
+
- [NVIDIA DGX Cloud Lepton](https://developer.nvidia.com/blog/introducing-nvidia-dgx-cloud-lepton-a-unified-ai-platform-built-for-developers/): A unified AI platform that connects developers to tens of thousands of GPUs from a global network of cloud providers. It addresses a critical need: accelerating AI developer productivity by providing access to GPU capacity and AI services across the NVIDIA compute ecosystem. It integrates seamlessly with the NVIDIA software stack, enabling developers to build, train, and deploy AI applications quickly and Scale. We spent approximately **40k Lepton GPU hours**.
|
| 120 |
|
| 121 |
## Data and benchmark information
|
| 122 |
|