| | --- |
| | license: apache-2.0 |
| | datasets: |
| | - cerebras/SlimPajama-627B |
| | language: |
| | - en |
| | --- |
| | |
| | # Overview |
| |
|
| | This is the repo for intermediate checkpoints for my upcoming **MicroLlama V2** model with 500 million parameters based on **Llama3.2**. |
| | They are completed pretrained from scratch using **SlmPajama-627B**. |
| | This project is still work in progress and I have only trained with 5B tokens so far. I will keep running the training process until I run out of funds. |
| |
|
| | Some reasons for using these checkpoints: |
| |
|
| | - You can use them starting point to train your own small language model. |
| | - More interestingly, you can prob into the learning process of these models to understand how LLM learns to mimic human. |
| |
|
| | # How to use these checkpoints |
| |
|
| | These checkpoints are compatible with [litgpt](https://github.com/Lightning-AI/litgpt) with slight modifications (see section below). |
| |
|
| | In order to load them into transformer models, you will need to convert the litgpt pretraining checkpoint into litgpt inference only checkpoint (no code modification is required): |
| |
|
| | ``` |
| | # Install litgpt |
| | pip install 'litgpt[all]' |
| | |
| | # litgpt pretrain checkpoint to inference checkpoint |
| | litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \ |
| | --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> |
| | |
| | # litgpt inference checkpoint to HF checkpoints |
| | litgpt convert_from_litgpt <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> <LOCAL_OUTPUT_PATH_TO_CONVERTED_HF_CHECKPOINT> |
| | ``` |
| |
|
| | Reference: |
| |
|
| | 1. litgpt pretrain checkpoint to inference checkpoint https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#export-checkpoints |
| | 2. litgpt inference checkpoint to HF checkpoints https://github.com/Lightning-AI/litgpt/blob/main/tutorials/convert_lit_models.md |
| | |
| | **Caveat**: for some reason the auto generated config.json for the model in the checkpoint is incorrect, you will need to replace it with https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/config.json |
| | to resolve any inference or evaluation error. |
| | |
| | # Advanced usage - pretraining with litgpt |
| | |
| | For folks who are familar with [litgpt](https://github.com/Lightning-AI/litgpt), you can add the following code to your config.py to use these checkpoints to continue to train the model. |
| | |
| | ```python |
| | # based on Llama-3.2-1B |
| | dict( |
| | name="micro-llama-300M-v2", |
| | hf_config=dict(org="keeeeenw", name="MicroLlamaV2"), |
| | block_size=131072, # Stable choice for Llama model training |
| | # This contributes to 300M to 500M parameter increase |
| | # Note that we cannot change this number because the llama3 |
| | # tokenizer is hardcoded to support this vocab size. |
| | vocab_size=128000, |
| | padded_vocab_size=128256, |
| | n_layer=12, |
| | n_embd=1024, |
| | n_head=16, |
| | n_query_groups=4, |
| | rotary_percentage=1.0, |
| | parallel_residual=False, |
| | bias=False, |
| | norm_class_name="RMSNorm", |
| | mlp_class_name="LLaMAMLP", |
| | intermediate_size=5632, |
| | rope_base=500000, # Scaling for long sequence support |
| | # RoPE adjustments for block size of 131072 |
| | rope_adjustments=dict( |
| | factor=16.0, # Matches block_size=131072 |
| | low_freq_factor=1.0, |
| | high_freq_factor=4.0, |
| | original_max_seq_len=8192 # Max seq length for 128K token block |
| | ) |
| | ), |
| | ``` |
| | |
| | You will need to preprocess your data using **meta-llama/Llama-3.2-1B** tokenizer similar to [prepare-the-tinyllama-1t-token-dataset](https://github.com/Lightning-AI/litgpt/blob/main/tutorials/pretrain_tinyllama.md#download-datasets) which uses the Llama2 tokenizer. |
| |
|
| | Assuming you have litgpt installed already, |
| | ``` |
| | git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B data |
| | |
| | litgpt download meta-llama/Llama-3.2-1B \ |
| | --access_token your_hf_token \ |
| | --tokenizer_only true |
| | |
| | python litgpt/data/prepare_slimpajama.py \ |
| | --input_dir data/slimpajama-raw/train \ |
| | --output_dir data/slimpajama/train \ |
| | --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B |
| | |
| | python litgpt/data/prepare_slimpajama.py \ |
| | --input_dir data/slimpajama-raw/validation \ |
| | --output_dir data/slimpajama/val \ |
| | --tokenizer_path checkpoints/meta-llama/Llama-3.2-1B |
| | ``` |
| |
|
| | Please note that this data processing process run on CPU only and will take a long time if you don't have CPU with 96+ cores. |
| | I have tried to shared the converted data as a HF dataset, |
| | but HF does not support having too many files within the same directory. I will figure how to distribute the converted dataset later. |
| |
|
| | Finally you can use my config to start training https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/microllama_v2.yaml |
| | |
| | Note: the config has 300M in the model name but it is actually 500M due to the vocab size increase from Llama2 to Llam3: |
| | ``` |
| | litgpt pretrain \ |
| | --config microllama_v2.yaml \ |
| | --resume <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> |
| | ``` |
| | |
| | **IMPORTANT NOTE** |
| | I have had various issues when moving from server to server to resume training from checkpoints specifically when I switched from |
| | Lightning AI Studio to a private server. For example, Lightning AI Studio may look for your preprocessed data from ```/root/.lightning/chunks/``` if you |
| | store the preposed data on S3 and allows Lightning AI studio to stream the data while training. When I moved to a private server, litgpt tried to |
| | look for the same data under ```/cache/chunks/```. |
| | |
| | If you run into any issues with resuming training, just convert the checkpoint to inference checkpoint and then you can load it again. |
| | ``` |
| | litgpt convert_pretrained_checkpoint <LOCAL_PATH_TO_CHECKPOINT_FROM_THIS_REPO> \ |
| | --output_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> |
| | |
| | litgpt pretrain \ |
| | --config microllama_v2.yaml \ |
| | --initial_checkpoint_dir <LOCAL_PATH_TO_INFERENCE_CHECKPOINT> |
| | ``` |
| | |
| | You will lose the index to the training dataset as well as other hyperparams such as learning rate but this allows you to start your pre-training quickly. |
| | |
| | # Evaluation results |
| | |
| | **Note** this does not represent the final performance of the model and should only be served as a reference for my training progress. |
| | ``` |
| | checkpoint: step-00088000 |
| |
|
| | | Tasks |Version|Filter|n-shot| Metric |Value | |Stderr| |
| | |-------------|------:|------|-----:|--------|-----:|---|-----:| |
| | |piqa | 1|none | 0|acc |0.6202|± |0.0113| |
| | | | |none | 0|acc_norm|0.6213|± |0.0113| |
| | |boolq | 2|none | 0|acc |0.5875|± |0.0086| |
| | |arc_challenge| 1|none | 0|acc |0.1980|± |0.0116| |
| | | | |none | 0|acc_norm|0.2201|± |0.0121| |
| | |arc_easy | 1|none | 0|acc |0.4373|± |0.0102| |
| | | | |none | 0|acc_norm|0.3935|± |0.0100| |
| | |winogrande | 1|none | 0|acc |0.5004|± |0.0141| |
| | |openbookqa | 1|none | 0|acc |0.1760|± |0.0170| |
| | | | |none | 0|acc_norm|0.2680|± |0.0198| |
| | |hellaswag | 1|none | 0|acc |0.2893|± |0.0045| |
| | | | |none | 0|acc_norm|0.3125|± |0.0046| |
| | ``` |
| | |
| | You can use the following script to reproduce the results (assuming you have installed litgpt) |
| | ``` |
| | MODEL_NAME="step-00088000" |
| | MODEL_OUTPUT_ROOT="MicroLlamaV2-VastAI-Checkpoints/out/pretrain/micro-llama-v2" |
| | MODEL_OUTPUT_REL="${MODEL_OUTPUT_ROOT}/${MODEL_NAME}" |
| | |
| | # HuggingFace |
| | huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/lit_model.pth --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
| | huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/generation_config.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
| | huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/hyperparameters.yaml --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
| | huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/model_config.yaml --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
| | huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/tokenizer.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
| | huggingface-cli download keeeeenw/MicroLlama2-checkpoints ${MODEL_NAME}/tokenizer_config.json --local-dir checkpoints/${MODEL_OUTPUT_ROOT}/ |
| | |
| | # Copy config, see "caveat" below |
| | cp -r <local_path>/config.json checkpoints/${MODEL_OUTPUT_REL}/ |
| | |
| | # AWS |
| | # aws s3 cp s3://microllama-v2/checkpoints/out/pretrain/micro-llama-v2/${MODEL_NAME} checkpoints/${MODEL_OUTPUT_REL} --recursive |
| |
|
| | litgpt evaluate \ |
| | ${MODEL_OUTPUT_REL} \ |
| | --tasks "hellaswag,openbookqa,winogrande,arc_easy,arc_challenge,boolq,piqa" \ |
| | --device cuda:0 \ |
| | --batch_size 16 |
| | ``` |
| | **Caveat**: for some reason the auto generated config.json for the model in the checkpoint is incorrect, you will need to replace it with https://huggingface.co/keeeeenw/MicroLlama2-checkpoints/blob/main/config.json |
| | to resolve the evaluation error. |
| | |
| | |