cjvt
/

GaMS3-12B-Instruct

@@ -23,7 +23,7 @@ The model was developed within the [PoVeJMo](https://www.cjvt.si/povejmo/en/proj
 We thank everyone who contributed to data collection and preparation, which enabled us to train our model. Special thanks go to Nikola Ljubešić, Taja Kuzman, Tjaša Arčon, Jaka Čibej, Simon Krek, Tomaž Erjavec, Iztok Kosem and Tomaž Savodnik.
-The model's development was supported by NVIDIA as a part of their Sovereign AI initiative. We are thankful for the access to [NVIDIA DGX Cloud Lepton](https://developer.nvidia.com/blog/introducing-nvidia-dgx-cloud-lepton-a-unified-ai-platform-built-for-developers/). We are also extremely grateful for all the support and help we received from a group of exceptional people at NVIDIA.
 ## Basic information
@@ -111,13 +111,43 @@ response = model.chat(messages, sampling_params)
 print("Model's response:", response[0].outputs[0].text)
 ```
-## Training Infrastructure
 The model was trained on the following HPC infrastructure:
 - EuroHPC supercomputer [LEONARDO](https://www.hpc.cineca.it/systems/hardware/leonardo/): We managed to scale the training across 128 nodes on LEONARDO's booster partition. We used approximately **150k GPU** hours on LEONARDO for development of this model (including data preparation such as translation and web rewrite, and model training).
-- Faculty's B200 node: With 8 B200 GPUs, our faculty's node represents a modern infrastructure for AI development. However, as we have only a single such node, the majority of the training was done elsewhere. In total around **1000 GPU hours** were used on B200 node.
 - [NVIDIA DGX Cloud Lepton](https://developer.nvidia.com/blog/introducing-nvidia-dgx-cloud-lepton-a-unified-ai-platform-built-for-developers/): A unified AI platform that connects developers to tens of thousands of GPUs from a global network of cloud providers. It addresses a critical need: accelerating AI developer productivity by providing access to GPU capacity and AI services across the NVIDIA compute ecosystem. It integrates seamlessly with the NVIDIA software stack, enabling developers to build, train, and deploy AI applications quickly and Scale. We spent approximately **40k Lepton GPU hours**.
 ## Data and benchmark information
 Coming soon!

 We thank everyone who contributed to data collection and preparation, which enabled us to train our model. Special thanks go to Nikola Ljubešić, Taja Kuzman, Tjaša Arčon, Jaka Čibej, Simon Krek, Tomaž Erjavec, Iztok Kosem and Tomaž Savodnik.
+The model's development was supported by NVIDIA as a part of their Sovereign AI initiative. We are thankful for the access to [NVIDIA DGX Cloud Lepton](https://developer.nvidia.com/blog/introducing-nvidia-dgx-cloud-lepton-a-unified-ai-platform-built-for-developers/). We are also extremely grateful for all the support and help we received from a group of exceptional people at NVIDIA: Anna Louise Ollerenshaw, Meriem Bendris, Oleg Sudakov, Benedetta Delfino, Rita Fernandes Neves, Andrea Pilzer, Miguel Martinez, Noel Osagie, Adam Henryk Grzywaczewski and Aleks Polak.
 ## Basic information
 print("Model's response:", response[0].outputs[0].text)
 ```
+## Training
+The training was performed in 3 CPT and 2 SFT stages.
+CPT stages:
+- **Parallel alignment**: the model was pretrained on parallel English and Slovene texts using context window of 65536 tokens;
+- **Base CPT**: the model was pretrained on a combination of Slovene, English, Croatian, Bosnian and Serbian corpora with a context window of 65536 tokens;
+- **Long CPT**: the model was pretrained on a combination of high quality Slovene, English, Croatian, Bosnian and Serbian corpora with a context window of 131072 tokens.
+SFT stages:
+- **Base instruction-following SFT**: the model was trained on a dataset consisting of various tasks (open/clsoed question answering, writing, math, code) and topics;
+- **Chat and safety tuning**: the model was trained on a combination of chat-oriented examples and a small set of safety prompts.
+### Infrastructure
 The model was trained on the following HPC infrastructure:
 - EuroHPC supercomputer [LEONARDO](https://www.hpc.cineca.it/systems/hardware/leonardo/): We managed to scale the training across 128 nodes on LEONARDO's booster partition. We used approximately **150k GPU** hours on LEONARDO for development of this model (including data preparation such as translation and web rewrite, and model training).
+- Faculty's B200 node: With 8 B200 GPUs, our faculty's node represents a modern infrastructure for AI development. However, as we have only a single such node, the majority of the training was done elsewhere. In total, around **1000 GPU hours** were used on the B200 node.
 - [NVIDIA DGX Cloud Lepton](https://developer.nvidia.com/blog/introducing-nvidia-dgx-cloud-lepton-a-unified-ai-platform-built-for-developers/): A unified AI platform that connects developers to tens of thousands of GPUs from a global network of cloud providers. It addresses a critical need: accelerating AI developer productivity by providing access to GPU capacity and AI services across the NVIDIA compute ecosystem. It integrates seamlessly with the NVIDIA software stack, enabling developers to build, train, and deploy AI applications quickly and Scale. We spent approximately **40k Lepton GPU hours**.
+### Software
+- **CPT**: [NVIDIA NeMo Framework 2.0](https://github.com/NVIDIA-NeMo/NeMo) (container version [25.07](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo?version=25.07));
+- **SFT**: [Transformers library](https://huggingface.co/docs/transformers/index) in combination with [DeepSpeed](https://huggingface.co/docs/transformers/main_classes/deepspeed) and [TRL](https://huggingface.co/docs/trl/index).
+### Training hyperparameters
+In line with our commitment to transparency, open science, and the sharing of knowledge, we openly disclose all training hyperparameters used in developing this model. All training stages were performed with **bfloat16** precision and **Adam** optimizer.
+| Stage | Model Parallelism | Data Parallelism | Batch Size | Micro Batch Size | LR Scheduler | Min LR | Max LR | Warmup Steps | Constant Steps | Epochs |
+|-------|--------------------|------------------|------------|-------------------|--------------|--------|--------|---------------|----------------|--------|
+|   Parallel alignment    | TP 8 | 64 | 128 | 1 | Cosine with warmup | 5e-7 | 5e-6 | 150 | 200 | 1 |
+|   Base CPT    | TP 8  | 64 | Rampup: 128 (961 steps) -> 192 (600 steps) -> 256 | 1 | Cosine with warmup | 5e-7 | 5e-6 | 1000 | 1000 | 1 |
+|   Long CPT    | TP 8  | 16 | 64 | 1 | Constant with warmup | / | 5e-6 | 500 | / | 1 |
+|   Base instruction-following SFT   | DeepSpeed ZeRO Stage 2 | 8 | 64 | 8 | Cosine with warmup | 1e-6 | 5e-6 | 1000 | 0 | 3 (checkpoint after epoch 2 was selected) |
+|   Chat and safety tuning    | DeepSpeed ZeRO Stage 2 | 8 | 64 | 8 | Cosine with warmup | 1e-6 | 5e-6 | 1000 | 0 | 3 (checkpoint after epoch 2 was selected) |
 ## Data and benchmark information
 Coming soon!