NGen3-90M / README.md

Update README.md

5a47dce verified 8 months ago

6.38 kB

	---
	license: other
	license_name: ngen-2-community-license
	license_link: https://tnsaai-builds.framer.website/community/licenses/ngen2
	library_name: transformers
	datasets:
	- TNSA/TCorpus
	language:
	- en
	pipeline_tag: text-generation
	---

	# NGen3: Next-Generation Foundational Model

	NGen3 is a production-level foundational language model inspired by state-of-the-art architectures such as GPT-4, Claude-3, and Llama 2. It is designed for both research and production and supports model variants ranging from 7M to 1B parameters. The model is built with a modular transformer decoder architecture and provides a comprehensive command-line interface (CLI) for tokenization, training, sampling, exporting, knowledge distillation, and fine-tuning on conversational data.

	![alt text](https://raw.githubusercontent.com/TnsaAi/images-urls/refs/heads/main/TV%20-%201%20(24).png)

	## Table of Contents

	- [Overview](#overview)
	- [Model Architecture](#model-architecture)
	- [Installation](#installation)
	- [Usage](#usage)
	- [Tokenization](#tokenization)
	- [Training](#training)
	- [Sampling](#sampling)
	- [Exporting](#exporting)
	- [Knowledge Distillation](#knowledge-distillation)
	- [Fine-Tuning](#fine-tuning)
	- [Local Fine-Tuning](#local-fine-tuning)
	- [Hugging Face Fine-Tuning](#hugging-face-fine-tuning)
	- [Hyperparameters](#hyperparameters)
	- [Acknowledgements](#acknowledgements)



	## Overview

	NGen3 is a flexible, self-contained implementation of a foundational language model built on a transformer decoder architecture. It enables users to:

	- Tokenize text from local files, URLs, or directly from Hugging Face datasets.
	- Train the model on tokenized datasets.
	- Generate text samples from trained models.
	- Export models (with minimal tokenizer configurations) to formats compatible with Hugging Face.
	- Distill knowledge from larger teacher models into smaller student models.
	- Fine-Tune on conversational datasets (using local files or datasets from Hugging Face).

	---

	## Model Architecture

	NGen3 uses a decoder-only transformer design with the following components:

	- Token & Positional Embeddings: Learnable embeddings for tokens and their positions.
	- Transformer Blocks: A stack of blocks, each containing:
	- Causal Self-Attention: Multi-head attention with a lower-triangular mask to prevent attention to future tokens.
	- Feed-Forward Network (MLP): With GELU activation.
	- Residual Connections & Layer Normalization: To stabilize training.
	- Final Projection Layer: Projects the hidden states to logits over the vocabulary.

	The model comes in several variants:
	- 7M Variant: 4 layers, 4 heads, 128-dimensional embeddings.
	- 120M Variant: 12 layers, 8 heads, 512-dimensional embeddings.
	- 300M, 500M, 700M, and 1B Variants: Increasing in depth and width.

	---


	## Installation

	Ensure you have Python 3.8+ installed and install the necessary dependencies:

	```bash
	pip install torch transformers datasets tqdm safetensors
	```
	## Usage

	NGen3 is fully managed via a CLI. Below are examples for each command.
	Tokenization
	Local Text File or URL:
	```bash
	python _model_.py tokenize --dataset tinyshakespeare --txt "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
	```

	Hugging Face Dataset:
	```bash
	python _model_.py hf_tokenize --hf_dataset roskoN/dailydialog --hf_split train --hf_text_column utterances --dataset dailydialog_train
	```

	## Training
	Train a model variant (e.g., 7M):
	```bash
	python _model_.py train --variant 7M --data _data_tinyshakespeare_/data.bin
	```

	## Sampling
	Generate text samples from a trained model:
	```bash
	python _model_.py sample --variant 7M --model_checkpoint 7M_model.pt --prompt "To be, or not to be" --length 100 --temperature 1.0
	```
	## Exporting
	Export a trained model (and its tokenizer configuration) for Hugging Face:

	```bash
	python _model_.py export --variant 7M --model_path 7M_model.pt --output_dir exported_7M
	```

	## Knowledge Distillation
	Distill a larger teacher model (e.g., GPT-2 120M from HF) into a smaller student model (e.g., 7M):

	```bash
	python _model_.py distill --teacher_model_path hf --teacher_variant 120M --student_variant 7M --data _data_tinyshakespeare_/data.bin --temperature 2.0 --alpha 0.5
	```

	## Fine-Tuning
	Local Fine-Tuning on Conversational Data
	Fine-tune a distilled model using local conversation data:

	```bash

	python _model_.py finetune --variant 120M --model_checkpoint distilled_120M_model.pt --data _data_conversations_/data.bin --finetune_iters 1000 --prompt "Hello, how are you?" --sample_length 100 --sample_temperature 1.0
	```
	Hugging Face Fine-Tuning on a Conversational Dataset
	Fine-tune on a conversational dataset from Hugging Face (e.g., roskoN/dailydialog):

	```bash

	python _model_.py hf_finetune --variant 120M --model_checkpoint distilled_120M_model.pt --hf_dataset roskoN/dailydialog --hf_split train --hf_text_column utterances --finetune_iters 1000 --prompt "Hello, how are you?" --sample_length 100 --sample_temperature 1.0
	```

	## Sampling and Exporting Fine-Tuned Models
	After fine-tuning, you can sample from or export the fine-tuned model just as with any checkpoint. For example, if your fine-tuned model is saved as finetuned_120M_model.pt:

	Sampling:

	```bash
	python _model_.py sample --variant 120M --model_checkpoint finetuned_120M_model.pt --prompt "What do you think about AI?" --length 100 --temperature 1.0
	```
	Exporting:

	```bash
	python _model_.py export --variant 120M --model_path finetuned_120M_model.pt --output_dir exported_finetuned_120M
	```
	## Hyperparameters
	Each model variant comes with predefined hyperparameters. For example:

	7M Variant:

	Layers: 4, Heads: 4, Embedding Dimension: 128
	Block Size: 128, Batch Size: 16, Learning Rate: 3e-4
	120M Variant:

	Layers: 12, Heads: 8, Embedding Dimension: 512
	Block Size: 256, Batch Size: 32, Learning Rate: 3e-4
	300M, 500M, 700M, 1B Variants:
	Increasing layers, heads, and embedding dimensions for better performance.

	Adjust ```max_iters```, ```log_interval```, and ```eval_interval``` to suit your dataset size and computational resources.


	## Acknowledgements
	NGen3 is inspired by leading models including GPT-4, Claude-3, and Llama 2. Special thanks to the open-source community for:

	- PyTorch
	- Hugging Face Transformers
	- Hugging Face Datasets
	- safetensors