BUT-FIT
/

csmpt7b

@@ -2,7 +2,7 @@
 license: apache-2.0
 ---
 # Introduction
-CSMPT7b is a large Czech language model continously pretrained from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model trained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc) has Czech tokenizer, obtained using our vocabulary swap method (see below).
 # Eval
 Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark).
@@ -48,7 +48,33 @@ Figure 3: Test loss closeup, testing performed on split of internal-corpus #1. S
 ## Training Method
-tbd.
 # Usage
@@ -92,7 +118,7 @@ with torch.autocast('cuda', dtype=torch.bfloat16):
 ```
 # Training Data
-We release most (95.79%) of our training data corpus [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
 # Our Release Plan

 license: apache-2.0
 ---
 # Introduction
+CSMPT7b is a large Czech language model continously pretrained for 272b training steps from English [MPT7b](https://huggingface.co/mosaicml/mpt-7b) model. Model was pretrained on ~67b token [Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc) using Czech tokenizer, obtained using our vocabulary swap method (see below).
 # Eval
 Dev eval at CS-HellaSwag  (automatically translated HellaSwag benchmark).
 ## Training Method
+### Vocabulary Swap
+The vocabulary swap was done the same way as our [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k) model (check it out for comprehensive description.)
+We managed to align 4,177 english tokens with corresponding czech tokens.
+## Hyperparameters
+Not mentioned hyperparameters were kept the same as for MPT.
+| **Name**                   | **Value**     | **Note**                                                                                     |
+|----------------------------|---------------|----------------------------------------------------------------------------------------------|
+| training sw                | llm-foundry   | We've done some minor patching (e.g., to allow DDP sync over file) |
+| dataset_type               | Concat        | Sequences at the model's input were concatenated up to `$max_seq_len`, divided by EOS token. |
+| tokenizer_size             | 64k           | Same as in [Czech-GPT-2](https://huggingface.co/BUT-FIT/Czech-GPT-2-XL-133k)                 |
+| max_seq_len                | 2048          |                                                                                              |
+| batch_size                 | 1024          |                                                                                              |
+| learning_rate              | 1.0e-4        |                                                                                              |
+| optimizer                  | LionW         |                                                                                              |
+| optimizer_betas            | 0.9/0.95      |                                                                                              |
+| optimizer_weight_decay     | 0             |                                                                                              |
+| optimizer_eps              | 1.0e-08       |                                                                                              |
+| gradient_clipping_max_norm | 1.0           |                                                                                              |
+| attn_impl                  | flash2        | we used triton flash-attn 1 implementation for initial ~60k steps                            |
+| positional_encoding        | alibi         |                                                                                              |
+| fsdp                       | FULL_SHARD    | (we had implementation issues with hybrid sharding in llm-foundry)                           |
+| precision                  | bf16          |                                                                                              |
+| scheduler                  | cosine        |                                                                                              |
+| scheduler_warmup           | 100 steps     |                                                                                              |
+| scheduler_steps            | 170,000       |                                                                                              |
+| scheduler_alpha            | 0.1           | So LR on last step is 0.1*(vanilla LR)                                                       |
 # Usage
 ```
 # Training Data
+We release most (95.79%) of our training data corpus as [BUT-Large Czech Collection](https://huggingface.co/datasets/BUT-FIT/but_lcc).
 # Our Release Plan