ubitech-edg
/

commandr-35b-sft

@@ -1,4 +1,4 @@
-# Command-R 35B — SFT (Supervised Fine-Tuning)
 **Model type:** Causal Language Model
 **Base model:** [CohereLabs/c4ai-command-r-v01](https://huggingface.co/CohereLabs/c4ai-command-r-v01)
@@ -9,9 +9,11 @@
 ## Overview
-`commandr-SFT` is a **supervised fine-tuned** variant of Cohere’s Command-R 35B model.
 Fine-tuning was performed on a high-quality instruction-following dataset using LoRA adapters, enabling improved conversational reasoning and question answering.
 ---
 ## Training Setup
@@ -20,8 +22,9 @@ Fine-tuning was performed on a high-quality instruction-following dataset using
 **Adapter type:** LoRA
 **Precision:** bfloat16
 **Hardware:** 8 nodes × 2 × NVIDIA A100 64GB GPUs
-**Training duration:** ~6 hours
-**Framework:** DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121
 ---
@@ -29,7 +32,6 @@ Fine-tuning was performed on a high-quality instruction-following dataset using
 **Name:** `axolotl_deduplicated_synthetic_qa.jsonl`
 **Type:** Instruction-following synthetic QA dataset
-**Split:** 70% train / 30% validation
 Each sample follows a QA/chat format used in the `alpaca_chat.load_qa` schema.
@@ -40,15 +42,27 @@ Each sample follows a QA/chat format used in the `alpaca_chat.load_qa` schema.
 | Parameter | Value |
 |------------|-------|
 | Sequence length | 2048 |
-| Micro batch size | 2 |
 | Gradient accumulation | 2 |
-| Learning rate | 2e-4 |
 | LR scheduler | cosine |
 | Optimizer | AdamW (8-bit) |
 | LoRA rank (r) | 16 |
 | LoRA alpha | 32 |
 | LoRA dropout | 0.05 |
-| Target modules | q_proj, v_proj, k_proj, o_proj |
-| Epochs | 1 |
-| Warmup steps | 10 |
-| Weight decay | 0.0 |

+# Command-R 35B — SFT (Supervised Fine-Tuning on Synthetic QA)
 **Model type:** Causal Language Model
 **Base model:** [CohereLabs/c4ai-command-r-v01](https://huggingface.co/CohereLabs/c4ai-command-r-v01)
 ## Overview
+`commandr-35b-sft` is a **supervised fine-tuned** variant of Cohere’s Command-R 35B model.
 Fine-tuning was performed on a high-quality instruction-following dataset using LoRA adapters, enabling improved conversational reasoning and question answering.
+Training was conducted on the **Leonardo EuroHPC** system
 ---
 ## Training Setup
 **Adapter type:** LoRA
 **Precision:** bfloat16
 **Hardware:** 8 nodes × 2 × NVIDIA A100 64GB GPUs
+**Framework:** DeepSpeed ZeRO-1, Axolotl, PyTorch 2.5.1+cu121
+**Runtime:** ~6 hours
+**Dataset split:** 70% train / 30% validation
 ---
 **Name:** `axolotl_deduplicated_synthetic_qa.jsonl`
 **Type:** Instruction-following synthetic QA dataset
 Each sample follows a QA/chat format used in the `alpaca_chat.load_qa` schema.
 | Parameter | Value |
 |------------|-------|
 | Sequence length | 2048 |
+| Micro batch size | 1 |
 | Gradient accumulation | 2 |
+| Epochs | 1 |
+| Learning rate | 0.0001 |
 | LR scheduler | cosine |
 | Optimizer | AdamW (8-bit) |
+| Warmup steps | 20 |
+| Weight decay | 0.0 |
 | LoRA rank (r) | 16 |
 | LoRA alpha | 32 |
 | LoRA dropout | 0.05 |
+| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
+| Gradient checkpointing | ✅ |
+| Flash attention | ✅ |
+| Auto resume | ✅ |
+| Loss watchdog threshold | 8.0 |
+| Loss watchdog patience | 20 |
+---
+## Tokenizer
+**Tokenizer type:** `AutoTokenizer`
+**Special token:** `<|end_of_text|>` as `pad_token`