Update README.md
Browse files
README.md
CHANGED
|
@@ -1 +1,91 @@
|
|
| 1 |
-
#
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# LangPWT
|
| 2 |
+
|
| 3 |
+
This repository contains the implementation of a lightweight, modified version of the GPT architecture **LangPWT** trained from scratch using FineWeb-Edu, an open-source dataset. The project demonstrates the design, training, and optimization of a custom natural language model on local hardware.
|
| 4 |
+
|
| 5 |
+
## Features
|
| 6 |
+
- **Custom GPT Architecture**: A miniaturized version of the GPT model tailored for efficient training on limited hardware.
|
| 7 |
+
- **Local Training**: Complete model training executed on local resources, enabling cost-effective development.
|
| 8 |
+
- **Open-Source Datasets**: Trained using publicly available FineWeb-Edu dataset to ensure accessibility and reproducibility.
|
| 9 |
+
- **Scalable Design**: Architecture optimized for experimentation and scalability while maintaining resource efficiency.
|
| 10 |
+
|
| 11 |
+
<div align="center">
|
| 12 |
+
<img src="LLM.drawio.png" alt="Description of the image" width="300">
|
| 13 |
+
<p><strong>Figure 1: Architecture of LangPWT</p>
|
| 14 |
+
</div>
|
| 15 |
+
|
| 16 |
+
## Implementation Details
|
| 17 |
+
1. **Model Architecture**
|
| 18 |
+
- A streamlined GPT-based architecture designed for reduced complexity and improved training efficiency.
|
| 19 |
+
- Incorporates modifications to parameter scaling to suit resource-constrained environments.
|
| 20 |
+
|
| 21 |
+
2. **Training**
|
| 22 |
+
- Training executed locally on NVIDIA GeForce RTX 3050 (Laptop) 4GB GPU, leveraging PyTorch.
|
| 23 |
+
|
| 24 |
+
3. **Testing**
|
| 25 |
+
- A simple Streamlit UI created for testing generation capability of the model.
|
| 26 |
+
|
| 27 |
+
## Model Architecture
|
| 28 |
+
|
| 29 |
+
### Configuration
|
| 30 |
+
- **Sequence Length:** 512 tokens
|
| 31 |
+
- **Vocabulary Size:** 48,951 tokens
|
| 32 |
+
- Includes 50,000 BPE merges, 256 special byte tokens, and 1 `<|endoftext|>` token.
|
| 33 |
+
- **Number of Layers:** 4 transformer blocks
|
| 34 |
+
- **Attention Heads:** 8 per block
|
| 35 |
+
- **Embedding Dimension:** 512
|
| 36 |
+
- **Dropout:** 0.1
|
| 37 |
+
|
| 38 |
+
### Components
|
| 39 |
+
1. **Embeddings:**
|
| 40 |
+
- **Word Embeddings (`wte`):** Learnable token embeddings of size `n_embd`.
|
| 41 |
+
- **Position Embeddings (`wpe`):** Learnable positional embeddings for sequences up to `block_size`.
|
| 42 |
+
|
| 43 |
+
2. **Transformer Blocks:**
|
| 44 |
+
- A stack of 4 transformer blocks, each comprising:
|
| 45 |
+
- Multi-head self-attention mechanisms.
|
| 46 |
+
- Feedforward networks for feature transformation.
|
| 47 |
+
|
| 48 |
+
3. **Output Head:**
|
| 49 |
+
- **Linear Layer (`lm_head`):** Maps hidden states to logits for token predictions.
|
| 50 |
+
- Implements weight sharing between token embeddings (`wte`) and output projection for parameter efficiency.
|
| 51 |
+
|
| 52 |
+
4. **Layer Normalization:**
|
| 53 |
+
- Final layer normalization (`ln_f`) ensures stable optimization.
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
## Current Status:
|
| 57 |
+
1. Dataset Used: FineWeb-Edu (18.5 GB) entirely.
|
| 58 |
+
2. Training Steps: 5000
|
| 59 |
+
3. Time Taken: ~ 7 hours
|
| 60 |
+
4. File format: .pt
|
| 61 |
+
|
| 62 |
+
## Requirements
|
| 63 |
+
- Python 3.8+
|
| 64 |
+
- PyTorch 2.0+ or TensorFlow 2.10+
|
| 65 |
+
- CUDA-enabled GPU with at least 4GB VRAM (recommended)
|
| 66 |
+
- Dependencies listed in `requirements.txt`
|
| 67 |
+
- **Note**: Different OS support different versions of PyTorch/Tensorflow to use CUDA (local GPU). Install only after verifying for your OS.
|
| 68 |
+
|
| 69 |
+
## Usage
|
| 70 |
+
1. Clone the repository:
|
| 71 |
+
```bash
|
| 72 |
+
git clone https://github.com/pulkundwar29/LangPWT
|
| 73 |
+
cd LangPWT
|
| 74 |
+
```
|
| 75 |
+
2. Create and activate a virtual environment:
|
| 76 |
+
```bash
|
| 77 |
+
venv env
|
| 78 |
+
env\scripts\activate
|
| 79 |
+
```
|
| 80 |
+
3. Install dependencies:
|
| 81 |
+
```bash
|
| 82 |
+
pip install -r requirements.txt
|
| 83 |
+
```
|
| 84 |
+
4. Run the training file **trainpwt.py**
|
| 85 |
+
5. Run the streamlit file: **trial_pwt.py**
|
| 86 |
+
6. Enter your prompt and hit the Generate button.
|
| 87 |
+
|
| 88 |
+
<div align="center">
|
| 89 |
+
<img src="ex1.png" alt="example text">
|
| 90 |
+
<p><strong>Figure 2: Example of Text Generated using LangPWT</p>
|
| 91 |
+
</div>
|