Update README.md
Browse files
README.md
CHANGED
|
@@ -2,7 +2,7 @@
|
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
# Introduction
|
| 5 |
-
CSMPT7b is a large Czech language model continously pretrained from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model
|
| 6 |
|
| 7 |
# Eval
|
| 8 |
Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
|
|
@@ -48,7 +48,33 @@ Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. S
|
|
| 48 |
|
| 49 |
|
| 50 |
## Training Method
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 52 |
|
| 53 |
|
| 54 |
# Usage
|
|
@@ -92,7 +118,7 @@ with torch.autocast('cuda', dtype=torch.bfloat16):
|
|
| 92 |
|
| 93 |
```
|
| 94 |
# Training Data
|
| 95 |
-
We release most (95.79%) of our training data corpus [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
|
| 96 |
|
| 97 |
|
| 98 |
# Our Release Plan
|
|
|
|
| 2 |
license: apache-2.0
|
| 3 |
---
|
| 4 |
# Introduction
|
| 5 |
+
CSMPT7b is a large Czech language model continously pretrained for 272b training steps from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc) using Czech tokenizer, obtained using our vocabulary swap method (see below).
|
| 6 |
|
| 7 |
# Eval
|
| 8 |
Dev eval at CS-HellaSwag (automatically translated HellaSwag benchmark).
|
|
|
|
| 48 |
|
| 49 |
|
| 50 |
## Training Method
|
| 51 |
+
### Vocabulary Swap
|
| 52 |
+
The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)
|
| 53 |
+
We managed to align 4,177 english tokens with corresponding czech tokens.
|
| 54 |
+
|
| 55 |
+
## Hyperparameters
|
| 56 |
+
Not mentioned hyperparameters were kept the same as for MPT.
|
| 57 |
+
| **Name** | **Value** | **Note** |
|
| 58 |
+
|----------------------------|---------------|----------------------------------------------------------------------------------------------|
|
| 59 |
+
| training sw | llm-foundry | We've done some minor patching (e.g., to allow DDP sync over file) |
|
| 60 |
+
| dataset_type | Concat | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
|
| 61 |
+
| tokenizer_size | 64k | Same as in [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) |
|
| 62 |
+
| max_seq_len | 2048 | |
|
| 63 |
+
| batch_size | 1024 | |
|
| 64 |
+
| learning_rate | 1.0e-4 | |
|
| 65 |
+
| optimizer | LionW | |
|
| 66 |
+
| optimizer_betas | 0.9/0.95 | |
|
| 67 |
+
| optimizer_weight_decay | 0 | |
|
| 68 |
+
| optimizer_eps | 1.0e-08 | |
|
| 69 |
+
| gradient_clipping_max_norm | 1.0 | |
|
| 70 |
+
| attn_impl | flash2 | we used triton flash-attn 1 implementation for initial ~60k steps |
|
| 71 |
+
| positional_encoding | alibi | |
|
| 72 |
+
| fsdp | FULL_SHARD | (we had implementation issues with hybrid sharding in llm-foundry) |
|
| 73 |
+
| precision | bf16 | |
|
| 74 |
+
| scheduler | cosine | |
|
| 75 |
+
| scheduler_warmup | 100 steps | |
|
| 76 |
+
| scheduler_steps | 170,000 | |
|
| 77 |
+
| scheduler_alpha | 0.1 | So LR on last step is 0.1*(vanilla LR) |
|
| 78 |
|
| 79 |
|
| 80 |
# Usage
|
|
|
|
| 118 |
|
| 119 |
```
|
| 120 |
# Training Data
|
| 121 |
+
We release most (95.79%) of our training data corpus as [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
|
| 122 |
|
| 123 |
|
| 124 |
# Our Release Plan
|