|
|
--- |
|
|
license: other |
|
|
license_name: ngen2-community-license |
|
|
license_link: https://tnsaai-builds.framer.website/community/licenses/ngen2 |
|
|
language: |
|
|
- en |
|
|
- hi |
|
|
- te |
|
|
metrics: |
|
|
- bleu |
|
|
- perplexity |
|
|
- accuracy |
|
|
base_model: |
|
|
- TNSA/NGen2-15M |
|
|
pipeline_tag: text-generation |
|
|
library_name: transformers |
|
|
model_type: safetensors |
|
|
new_version: TNSA/NGen3-15M |
|
|
--- |
|
|
# NGen 2 |
|
|
|
|
|
While using with transformers you can only use the 15M variant for now. |
|
|
|
|
|
NGen 2 is an advanced Transformer model training pipeline that supports multiple model variants. It ranges from a **nano** variant (approximately 120M parameters) to a **foundational** variant (approximately 1B parameters). The pipeline incorporates modern architectural improvements such as rotary positional embeddings, RMSNorm, and GEGLU activations to boost performance and training efficiency. |
|
|
|
|
|
> **Note:** Although NGen 3 is designed to train a 1B-parameter model, its advanced architecture pushes its performance closer to that of much larger models. |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
## Model Variants |
|
|
|
|
|
NGen 2 supports the following variants via the `--variant` flag: |
|
|
|
|
|
- **nano**: ~120M parameters |
|
|
- **small**: ~300M parameters |
|
|
- **medium**: ~500M parameters |
|
|
- **large**: ~700M parameters |
|
|
- **foundational**: ~1B parameters |
|
|
|
|
|
Each variant adjusts key hyperparameters such as the number of layers, model dimension (`d_model`), number of attention heads (`n_heads`), and the feed-forward dimension (`d_ff`). |
|
|
|
|
|
## Requirements |
|
|
|
|
|
- Python 3.8+ |
|
|
- PyTorch |
|
|
- Transformers |
|
|
- Datasets |
|
|
- DeepSpeed (optional, for efficient training) |
|
|
- Azure ML SDK (for distributed training on Azure) |
|
|
|
|
|
Install dependencies using pip (adjust as needed): |
|
|
|
|
|
```bash |
|
|
pip install torch transformers datasets deepspeed azureml-core |
|
|
``` |
|
|
|
|
|
# Usage |
|
|
# 1. Data Preparation |
|
|
First, download and preprocess the OpenWebText dataset: |
|
|
|
|
|
```bash |
|
|
python prepare.py --output_dir ./_data_ --max_length 4096 |
|
|
``` |
|
|
|
|
|
This script downloads, tokenizes, and saves the dataset in Arrow format to the ./_data_ directory. |
|
|
|
|
|
# 2. Local Training |
|
|
|
|
|
The main training script is train.py. It loads the processed dataset (by default from ./_data_), instantiates the desired model variant, and starts training. |
|
|
|
|
|
Example CLI Commands |
|
|
|
|
|
- Train the nano (120M) variant: |
|
|
|
|
|
```bash |
|
|
python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_nano --batch_size 4 --epochs 3 --variant nano |
|
|
``` |
|
|
|
|
|
- Train the small (300M) variant: |
|
|
|
|
|
```bash |
|
|
python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_small --batch_size 4 --epochs 3 --variant small |
|
|
``` |
|
|
|
|
|
- Train the medium (500M) variant: |
|
|
|
|
|
```bash |
|
|
python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_medium --batch_size 4 --epochs 3 --variant medium |
|
|
``` |
|
|
|
|
|
- Train the large (700M) variant: |
|
|
```bash |
|
|
python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_large --batch_size 4 --epochs 3 --variant large |
|
|
``` |
|
|
|
|
|
- Train the foundational (1B) variant with rotary embeddings enabled: |
|
|
```bash |
|
|
python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_foundational --batch_size 4 --epochs 3 --variant foundational --use_rotary |
|
|
``` |
|
|
|
|
|
# 3. Training on Azure ML |
|
|
|
|
|
- Step 1: Set Up Azure ML Resources |
|
|
|
|
|
Use ```azure_setup.py``` to create or connect to your Azure ML workspace and set up a compute cluster: |
|
|
|
|
|
```bash |
|
|
python azure_setup.py \ |
|
|
--workspace_name MyWorkspace \ |
|
|
--resource_group MyResourceGroup \ |
|
|
--subscription_id YOUR_SUBSCRIPTION_ID \ |
|
|
--location eastus \ |
|
|
--compute_name gpu-cluster \ |
|
|
--vm_size Standard_NC6 \ |
|
|
--max_nodes 4 \ |
|
|
--min_nodes 0 |
|
|
``` |
|
|
- Step 2: Submit a Training Job to Azure ML |
|
|
Use ```submit_train.py``` to submit your training script to Azure ML: |
|
|
```bash |
|
|
python submit_train.py \ |
|
|
--experiment_name ngen3-experiment \ |
|
|
--compute_target gpu-cluster \ |
|
|
--script train.py \ |
|
|
--dataset_dir ./_data_ \ |
|
|
--output_dir ./checkpoints_foundational \ |
|
|
--batch_size 4 \ |
|
|
--epochs 3 \ |
|
|
--variant foundational \ |
|
|
--use_rotary |
|
|
``` |
|
|
|
|
|
# 4. DeepSpeed Integration |
|
|
|
|
|
The deepspeed.json file configures mixed-precision training and ZeRO optimizations. To leverage DeepSpeed, ensure it is installed and adjust your training script or submission command to enable DeepSpeed support. |
|
|
|
|
|
# License |
|
|
License |
|
|
The NGen 2 project is developed and maintained by TNSA AI. The licensing model is dual: |
|
|
|
|
|
- The nano and small variants are open source and released under the MIT License. |
|
|
- The medium, large, and foundational variants are proprietary and are not open source. Use of these proprietary components is subject to TNSA AI's proprietary licensing terms. |
|
|
|
|
|
# Copyright |
|
|
© 2023 TNSA AI. All rights reserved. for Use read ```LICENCE``` in the LICENSE file |