mirajnair
/

simple-llm-gpt2-v1.0

Safetensors

custom-gpt2

Model card Files Files and versions

xet

Community

mirajnair commited on Apr 19

Commit

bd8e93b

verified ·

1 Parent(s): ef3d128

Update README.md

Browse files

Files changed (1) hide show

README.md +125 -133

README.md CHANGED Viewed

@@ -1,199 +1,191 @@
 ---
-library_name: transformers
-tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
+{}
 ---
+# Simple LLM Training with GPT-2 Architecture
+This repository demonstrates how to train a Language Learning Model (LLM) from scratch using the GPT-2 architecture. The model is trained on numerical sequences to learn and predict patterns.
+## 📌 Overview
+This project implements a full machine learning pipeline:
+- 📊 **Synthetic dataset generation** (number sequences)
+- 🔤 **Custom tokenizer training**
+- 🧠 **Model training** using GPT-2
+- 🤖 **Inference capabilities**
+---
+## 🚧 Progress So Far. We have trained a **6.4 million parameter** model that:
+- Uses **base-16 (hexadecimal)** conversion for tokenization.
+- Can **add up to 4-digit numbers with 100% accuracy**.
+- Is publicly available on Github : 🔗 [Rajesh-Nair/simple-llm](https://github.com/Rajesh-Nair/simple-llm)
+---
+## 🏗️ Dataset Generator
+Synthetic number sequences are generated based on parameters defined in `data_config.yaml`.
+**Example Configuration:**
+- **Number range:** `0 - 9999`
+- **Number of sequences:** `100,000`
+- **Output path:** `../simple-llm-data/sequences.txt`
+- **Delimiters:** `|` (columns), `\n` (rows)
+### 🔧 To Generate the Dataset:
+1. Update `data_config.yaml` with your desired parameters.
+2. Run the generator:
+   ```bash
+   python3 data_generator.py
+   ```
+---
+## 🎯 Training
+### Step 1: Train the Tokenizer
+```bash
+python3 tokenizer.py
+```
+### Step 2: Train the Model
+```bash
+python3 trainer.py
+```
+Training configurations are managed in `train_config.yaml`, including:
+- 🔧 Model architecture (layers, heads, embedding size)
+- ⚙️ Training hyperparameters (batch size, learning rate)
+- 💾 Checkpointing and saving
+- ☁️ Hugging Face Hub integration
+---
+## 🔢 Position Embeddings
+### 📐 Learnable vs. Sinusoidal Embeddings
+- **Learnable Embeddings**: Adapt to numeric patterns.
+- **Sinusoidal Embeddings**: Provide a mathematical basis for position understanding.
+---
+### 🧮 Block Position IDs (Abacus Embedding)
+Inspired by the [Abacus Embedding paper](https://arxiv.org/pdf/2405.17399), we use **block position IDs**.
+**Example:**
+- Input:     `+1342+879+2221+`
+- Block IDs: `012340123012340`
+#### 🔍 Why Block Position IDs?
+1. ✅ **Commutative Support**: `a + b = b + a` — block IDs reinforce this.
+2. 🧠 **Digit Alignment**: Helps align units, tens, hundreds positions for easier digit-wise processing.
+---
+### 🔄 Digit Reversal
+As part of preprocessing:
+- `5672 → 2765` (reversed)
+- Output is reversed back during evaluation.
+#### 📈 Benefits of Reversal
+1. 🧒 **Human-like learning**: Mimics the left-to-right addition humans use.
+2. 🎯 **Causal attention compatibility**: Enables better carryover handling.
+3. 📚 **Research-backed approach**: Digit reversal has been successfully used in several papers including:
+   - [Transformers Can Do Arithmetic with the Right Embeddings](https://arxiv.org/pdf/2405.17399) (which also introduces Abacus embedding)
+   - [Transformers Can Achieve Length Generalization But Not Robustly](https://arxiv.org/pdf/2402.09371)
+---
+## 🧩 Tokenization Strategy
+Tokenization is **critical** for arithmetic modeling. Our approach:
+1. 📏 **Shortens sequences**: Optimizes context window usage.
+2. 🧬 **Boosts generalization**: Learns across number patterns.
+3. 🔄 **Uses base conversion** (e.g., decimal → hexadecimal) for compact, arithmetic-aware tokens.
+4. 🧠 **Preserves arithmetic logic**: Even in higher bases, rules still apply.
+_We're experimenting with different bases to improve efficiency further._
+---
+## 🔁 Multi-token Prediction
+Predicting **multiple tokens at once** increases efficiency. This is possible since we have reversed all the numbers.
+### Example: To predict two token at a time, we see output 99 to appear in the first iteration
+```
+Input (reversed):     +12+873+PPPPPPPP      (P = padding tokens)
+Output (reversed):    PPPPPP99PPPPPPPP      (P = padding tokens)
+Position IDs:         0120123000000000
+```
+We're currently supporting **2-token prediction** and it works well
+🔗 [mirajnair/simple-llm-gpt2-v2.0](https://huggingface.co/mirajnair/simple-llm-gpt2-v2.0)
+ ..And we are expanding on generalizing this method - i.e output token at the earliest opportunity so we can have 2 or more predicted in one go.
+## 📊 Attention Visualization
+Visualizing attention patterns reveals how the model processes arithmetic operations. Below is an example showing attention patterns for the addition problem: `101 + 1002 = 1103` (represented in reversed form as `+101+2001+3011+`).
+### Layer 1 Attention Patterns
+![Layer 1 Attention Visualization](https://github.com/Rajesh-Nair/simple-llm/blob/master/attention_visualizations/layer_1_attention.png)
+In this visualization:
+- **Bright vertical bars** at positions 1, 5, and 10 show how the model focuses on unit digits from both inputs and the output
+- The model learns to align corresponding digit positions (units with units, tens with tens, etc.)
+- Attention patterns reveal how information flows during the addition process, including carry operations
+This confirms our block position ID approach helps the model understand the commutative nature of addition and properly align digits for arithmetic operations.
+The visualization demonstrates how the model has learned to focus on relevant digits when performing calculations, similar to how humans process arithmetic problems.
+## 🎯 Performance Results
+We've rigorously tested our model's arithmetic capabilities with impressive results:
+### Addition Performance Test
+- **Test Set**: 10,000 random pairs of 4-digit numbers
+- **Accuracy**: 100%
+- **Consistency**: Maintained perfect accuracy across multiple test runs
+This perfect accuracy demonstrates that our approach successfully teaches the model to perform addition operations with complete reliability, even on previously unseen number combinations. The combination of our specialized tokenization strategy, position encoding, and multi-token prediction enables the model to generalize arithmetic rules effectively.
+These results validate our architectural choices and confirm that transformer-based models can master fundamental arithmetic operations when properly designed.
+## 🚀 Next Steps
+1. **Multi-token Generation**:
+   - We've proved the model can output more than 1 token at a time
+   - Test if model can generate all tokens in one-go (greedy generation)
+2. **Scale Up**:
+   - Increase the length/number of digits in operations
+   - Scale up model size for more complex operations
+3. **Length Generalization**:
+   - Implement and test length generalization techniques as described in [Transformers Can Achieve Length Generalization But Not Robustly](https://arxiv.org/pdf/2402.09371)
+   - Explore methods to improve model's ability to handle varying sequence lengths
+4. **Add batch prediction**:
+   - Implement parallel processing of multiple arithmetic operations
+   - Optimize throughput by processing multiple sequences simultaneously
+   - Reduce overall inference time for bulk operations
+5. **KV cache**:
+   - Implement key-value caching to reduce redundant computations
+   - Optimize memory usage during autoregressive generation
+   - Speed up sequential token generation by reusing previous computations