kosmylo1992 commited on
Commit
88d141c
·
verified ·
1 Parent(s): 94d7f26

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +106 -3
README.md CHANGED
@@ -1,3 +1,106 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ {
3
+ "language": ["en"],
4
+ "license": "apache-2.0",
5
+ "tags": [
6
+ "text-generation",
7
+ "causal-lm",
8
+ "continual-pretraining",
9
+ "lora",
10
+ "axolotl",
11
+ "deepspeed",
12
+ "transformers",
13
+ "mistral",
14
+ "nemo",
15
+ "eu-hpc"
16
+ ],
17
+ "datasets": ["arxiv", "gov", "news", "wikipedia"],
18
+ "metrics": ["loss"],
19
+ "library_name": "transformers",
20
+ "framework": "pytorch",
21
+ "base_model": "mistralai/Mistral-Nemo-Instruct-2407",
22
+ "model_name": "mistral-12b-cpt",
23
+ "pipeline_tag": "text-generation",
24
+ "task_categories": ["text-generation"],
25
+ "model_type": "AutoModelForCausalLM",
26
+ "inference": {
27
+ "parameters": {
28
+ "max_new_tokens": 512,
29
+ "temperature": 0.7,
30
+ "top_p": 0.9
31
+ }
32
+ },
33
+ "trained_on": ["Leonardo EuroHPC"],
34
+ "description": "Continual pretraining (CPT) of Mistral 12B Nemo Instruct using Axolotl and DeepSpeed ZeRO-1. Trained on scientific, government, news, and Wikipedia text with LoRA adapters."
35
+ }
36
+ ---
37
+
38
+ # Mistral 12B — CPT (Continual Pretraining with LoRA)
39
+
40
+ **Model type:** Causal Language Model
41
+ **Base model:** [mistralai/Mistral-Nemo-Instruct-2407](https://huggingface.co/mistralai/Mistral-Nemo-Instruct-2407)
42
+ **License:** Apache 2.0
43
+ **Framework:** [Axolotl](https://github.com/OpenAccess-AI-Collective/axolotl)
44
+
45
+ ---
46
+
47
+ ## Overview
48
+
49
+ `mistral-12b-cpt` is a **continual-pretrained** version of the Mistral-12B Nemo Instruct model.
50
+ This CPT phase extends the model’s factual and energy domain understanding using scientific, governmental, news, and encyclopedic text.
51
+
52
+ Training was executed on the **Leonardo EuroHPC** system using Axolotl with DeepSpeed ZeRO-1 for efficient large-scale distributed fine-tuning.
53
+
54
+ ---
55
+
56
+ ## Training Setup
57
+
58
+ **Objective:** Unsupervised continual pretraining (language modeling)
59
+ **Adapter type:** LoRA
60
+ **Precision:** bfloat16
61
+ **Hardware:** 8 nodes × 2 × NVIDIA A100 64 GB GPUs
62
+ **Framework:** Axolotl + DeepSpeed + PyTorch 2.5.1 + CUDA 12.1
63
+ **Runtime:** 24 h
64
+ **Checkpoints:** 5 per epoch
65
+
66
+ ---
67
+
68
+ ## Dataset
69
+
70
+ | Dataset | Description |
71
+ |----------|-------------|
72
+ | `arxiv.jsonl` | Scientific and technical papers |
73
+ | `gov.jsonl` | Government and policy documents |
74
+ | `news.jsonl` | News articles |
75
+ | `wiki.jsonl` | Wikipedia text |
76
+
77
+ ---
78
+
79
+ ## Hyperparameters
80
+
81
+ | Parameter | Value |
82
+ |------------|-------|
83
+ | Sequence length | 2048 |
84
+ | Micro batch size | 2 |
85
+ | Gradient accumulation | 2 |
86
+ | Epochs | 10 |
87
+ | Max steps | 10000 |
88
+ | Learning rate | 0.0002 |
89
+ | LR scheduler | cosine |
90
+ | Optimizer | AdamW (8-bit) |
91
+ | Warmup steps | 10 |
92
+ | Weight decay | 0.0 |
93
+ | LoRA rank (r) | 16 |
94
+ | LoRA alpha | 32 |
95
+ | LoRA dropout | 0.05 |
96
+ | LoRA targets | q_proj, k_proj, v_proj, o_proj |
97
+ | Gradient checkpointing | ✅ |
98
+ | Flash attention | ✅ |
99
+ | Loss watchdog (threshold/patience) | 5.0 / 3 |
100
+
101
+ ---
102
+
103
+ ## Tokenizer
104
+
105
+ **Tokenizer type:** `AutoTokenizer`
106
+ **Pad token:** `<|end_of_text|>`