NGen2-30M / README.md

Update README.md

a331aaa verified 7 months ago

4.57 kB

	---
	license: other
	license_name: ngen2-community-license
	license_link: https://tnsaai-builds.framer.website/community/licenses/ngen2
	language:
	- en
	- hi
	- te
	metrics:
	- bleu
	- perplexity
	- accuracy
	base_model:
	- TNSA/NGen2-15M
	pipeline_tag: text-generation
	library_name: transformers
	model_type: safetensors
	new_version: TNSA/NGen3-15M
	---
	# NGen 2

	While using with transformers you can only use the 15M variant for now.

	NGen 2 is an advanced Transformer model training pipeline that supports multiple model variants. It ranges from a nano variant (approximately 120M parameters) to a foundational variant (approximately 1B parameters). The pipeline incorporates modern architectural improvements such as rotary positional embeddings, RMSNorm, and GEGLU activations to boost performance and training efficiency.

	> Note: Although NGen 3 is designed to train a 1B-parameter model, its advanced architecture pushes its performance closer to that of much larger models.




	## Model Variants

	NGen 2 supports the following variants via the `--variant` flag:

	- nano: ~120M parameters
	- small: ~300M parameters
	- medium: ~500M parameters
	- large: ~700M parameters
	- foundational: ~1B parameters

	Each variant adjusts key hyperparameters such as the number of layers, model dimension (`d_model`), number of attention heads (`n_heads`), and the feed-forward dimension (`d_ff`).

	## Requirements

	- Python 3.8+
	- PyTorch
	- Transformers
	- Datasets
	- DeepSpeed (optional, for efficient training)
	- Azure ML SDK (for distributed training on Azure)

	Install dependencies using pip (adjust as needed):

	```bash
	pip install torch transformers datasets deepspeed azureml-core
	```

	# Usage
	# 1. Data Preparation
	First, download and preprocess the OpenWebText dataset:

	```bash
	python prepare.py --output_dir ./_data_ --max_length 4096
	```

	This script downloads, tokenizes, and saves the dataset in Arrow format to the ./_data_ directory.

	# 2. Local Training

	The main training script is train.py. It loads the processed dataset (by default from ./_data_), instantiates the desired model variant, and starts training.

	Example CLI Commands

	- Train the nano (120M) variant:

	```bash
	python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_nano --batch_size 4 --epochs 3 --variant nano
	```

	- Train the small (300M) variant:

	```bash
	python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_small --batch_size 4 --epochs 3 --variant small
	```

	- Train the medium (500M) variant:

	```bash
	python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_medium --batch_size 4 --epochs 3 --variant medium
	```

	- Train the large (700M) variant:
	```bash
	python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_large --batch_size 4 --epochs 3 --variant large
	```

	- Train the foundational (1B) variant with rotary embeddings enabled:
	```bash
	python train.py --dataset_dir ./_data_ --output_dir ./checkpoints_foundational --batch_size 4 --epochs 3 --variant foundational --use_rotary
	```

	# 3. Training on Azure ML

	- Step 1: Set Up Azure ML Resources

	Use ```azure_setup.py``` to create or connect to your Azure ML workspace and set up a compute cluster:

	```bash
	python azure_setup.py \
	--workspace_name MyWorkspace \
	--resource_group MyResourceGroup \
	--subscription_id YOUR_SUBSCRIPTION_ID \
	--location eastus \
	--compute_name gpu-cluster \
	--vm_size Standard_NC6 \
	--max_nodes 4 \
	--min_nodes 0
	```
	- Step 2: Submit a Training Job to Azure ML
	Use ```submit_train.py``` to submit your training script to Azure ML:
	```bash
	python submit_train.py \
	--experiment_name ngen3-experiment \
	--compute_target gpu-cluster \
	--script train.py \
	--dataset_dir ./_data_ \
	--output_dir ./checkpoints_foundational \
	--batch_size 4 \
	--epochs 3 \
	--variant foundational \
	--use_rotary
	```

	# 4. DeepSpeed Integration

	The deepspeed.json file configures mixed-precision training and ZeRO optimizations. To leverage DeepSpeed, ensure it is installed and adjust your training script or submission command to enable DeepSpeed support.

	# License
	License
	The NGen 2 project is developed and maintained by TNSA AI. The licensing model is dual:

	- The nano and small variants are open source and released under the MIT License.
	- The medium, large, and foundational variants are proprietary and are not open source. Use of these proprietary components is subject to TNSA AI's proprietary licensing terms.

	# Copyright
	© 2023 TNSA AI. All rights reserved. for Use read ```LICENCE``` in the LICENSE file