PulkundwarP
/

leap0

Model card Files Files and versions

leap0 / README.md

PulkundwarP's picture

Update README.md

4a4d62e verified 9 months ago

|

history blame contribute delete

2.99 kB

	# Leap-0

	This repository contains the implementation of a lightweight, modified version of the GPT architecture Leap-0 trained from scratch using FineWeb-Edu, an open-source dataset. The project demonstrates the design, training, and optimization of a custom natural language model on local hardware.

	<div align="center">
	<img src="LLM.drawio.png" alt="Description of the image" width="300">
	<p><strong>Figure 1: Architecture of Leap</p>
	</div>

	## Features
	- Custom GPT Architecture: A miniaturized version of the GPT model tailored for efficient training on limited hardware.
	- Local Training: Complete model training executed on local resources, enabling cost-effective development.
	- Open-Source Datasets: Trained using publicly available FineWeb-Edu dataset to ensure accessibility and reproducibility.
	- Scalable Design: Architecture optimized for experimentation and scalability while maintaining resource efficiency.



	## Implementation Details
	1. Model Architecture
	- A streamlined GPT-based architecture designed for reduced complexity and improved training efficiency.
	- Incorporates modifications to parameter scaling to suit resource-constrained environments.

	2. Training
	- Training executed locally on NVIDIA GeForce RTX 4500 ada 24GB GPU, leveraging PyTorch.

	3. Testing
	- A simple Streamlit UI created for testing generation capability of the model.

	## Model Architecture

	### Configuration
	- Sequence Length: 512 tokens
	- Vocabulary Size: 48,951 tokens
	- Includes 50,000 BPE merges, 256 special byte tokens, and 1 `<\|endoftext\|>` token.
	- Number of Layers: 4 transformer blocks
	- Attention Heads: 8 per block
	- Embedding Dimension: 512
	- Dropout: 0.1

	### Components
	1. Embeddings:
	- Word Embeddings (`wte`): Learnable token embeddings of size `n_embd`.
	- Position Embeddings (`wpe`): Learnable positional embeddings for sequences up to `block_size`.

	2. Transformer Blocks:
	- A stack of 4 transformer blocks, each comprising:
	- Multi-head self-attention mechanisms.
	- Feedforward networks for feature transformation.

	3. Output Head:
	- Linear Layer (`lm_head`): Maps hidden states to logits for token predictions.
	- Implements weight sharing between token embeddings (`wte`) and output projection for parameter efficiency.

	4. Layer Normalization:
	- Final layer normalization (`ln_f`) ensures stable optimization.


	## Current Status:
	1. Dataset Used: FineWeb-Edu (18.5 GB) entirely.
	2. Training Steps: 5000
	3. Time Taken: ~ 7 hours
	4. File format: .pt

	## Requirements
	- Python 3.8+
	- PyTorch 2.0+ or TensorFlow 2.10+
	- CUDA-enabled GPU with at least 4GB VRAM (recommended)
	- Dependencies listed in `requirements.txt`
	- Note: Different OS support different versions of PyTorch/Tensorflow to use CUDA (local GPU). Install only after verifying for your OS.