| # Andromeda Model Training Standard Operating Procedure | |
| This document provides instructions on how to train the Andromeda model end-to-end using the provided code. The training procedure consists of three main scripts: `build_dataset.py`, `model.py`, and `train_distributed.py`. Follow the steps below to train the Andromeda model. | |
| ## Prerequisites | |
| Before starting the training process, ensure that you have the following requirements: | |
| - Python 3.7 or higher | |
| - PyTorch 1.9 or higher | |
| - Transformers library | |
| - Datasets library | |
| - Accelerate library | |
| - Wandb library (optional, for logging) | |
| ## Step 1: Building the Dataset | |
| The first step is to build the dataset required for training. The `build_dataset.py` script processes the training data and prepares it for training. Follow the instructions below to build the dataset: | |
| 1. Open the `build_dataset.py` script. | |
| 2. Set the configuration parameters in the `CFG` class according to your requirements: | |
| - `HF_ACCOUNT_REPO`: Replace with your Hugging Face API key. | |
| - `TOKENIZER`: Choose the tokenizer model to use (e.g., "EleutherAI/gpt-neox-20b"). | |
| - `DATASET_NAME`: Choose the dataset to process (e.g., "tiiuae/falcon-refinedweb"). | |
| - `SEQ_LEN`: Set the desired sequence length. | |
| 3. Save the changes to the script. | |
| 4. Open a terminal or command prompt and navigate to the directory containing the `build_dataset.py` script. | |
| 5. Run the following command to execute the script: | |
| ``` | |
| python build_dataset.py | |
| ``` | |
| 6. The script will process the dataset and push it to your Hugging Face account repository specified by `HF_ACCOUNT_REPO`. | |
| ## Step 2: Defining the Andromeda Model | |
| The second step is to define the Andromeda model architecture. The `model.py` script contains the model definition and configuration. Follow the instructions below to configure the Andromeda model: | |
| 1. Open the `model.py` script. | |
| 2. Set the configuration parameters in the `AndromedaTokenizer` and `Andromeda` classes according to your requirements: | |
| - `tokenizer`: Configure the tokenizer with the desired parameters. | |
| - `Andromeda`: Configure the Andromeda model with the desired architecture. | |
| 3. Save the changes to the script. | |
| ## Step 3: Training the Andromeda Model | |
| The final step is to train the Andromeda model using the `train_distributed.py` script. Follow the instructions below to start the training process: | |
| 1. Open the `train_distributed.py` script. | |
| 2. Set the configuration parameters in the `TrainAndromeda.CFG` class according to your requirements: | |
| - `BATCH_SIZE`: Set the batch size for training. | |
| - `GRADIENT_ACCUMULATE_EVERY`: Set the number of gradient accumulation steps. | |
| - `LEARNING_RATE`: Set the learning rate for the optimizer. | |
| - `WEIGHT_DECAY`: Set the weight decay for the optimizer. | |
| - `SEQ_LEN`: Set the desired sequence length. | |
| - `USE_DEEPSPEED`: Set to `True` if using DeepSpeed for optimization. | |
| - `USE_FSDP`: Set to `True` if using Fully Sharded Data Parallelism. | |
| - `USE_PRETOKENIZED`: Set to `True` if using a pre-tokenized dataset. | |
| - `USE_ACTIVATION_CHECKPOINTING`: Set to `True` if using activation checkpointing. | |
| - `RESUME_FROM_CHECKPOINT`: Set to the path of a checkpoint to resume training from. | |
| - `CHECKPOINTING_STEPS`: Set the number of steps between checkpoints. | |
| - `OUTPUT_DIR`: Set the output directory for saving the model checkpoints and logs. | |
| - `ENTITY_NAME`: Set the Wandb entity name for logging (optional). | |
| 3. Save the changes to the script. | |
| 4. Open a terminal or command prompt and navigate to the directory containing the `train_distributed.py` script. | |
| 5. Run the following command to start the training: | |
| ``` | |
| python train_distributed.py | |
| ``` | |
| 6. The script will train the Andromeda model using the specified configuration and dataset. | |
| 7. During training, the progress will be displayed in the terminal, and logs will be saved to the specified output directory. | |
| # Other Training methods | |
| First: | |
| `Accelerate Config` | |
| Enable Deepspeed 3: | |
| `Accelerate launch train_distributed_accelerate.py` | |