microformer / README.md
moorebrett0's picture
Update README.md
851c818 verified
---
language: en
license: mit
library_name: pytorch
tags:
- transformer
- adapters
- continual-learning
- dual-memory
- minimal
- educational
- nlp
- language-model
- online-learning
datasets:
- text8
- tinyshakespeare
model_name: "Microformer"
model_type: "stacked-adapter-transformer"
pipeline_tag: text-generation
widget:
- text: "Describe the internet"
- text: "Who is Buck?"
- text: "Call me Ishmael."
---
# Microformer
**Microformer** is a minimal, educational-scale transformer language model built from scratch in PyTorch.
Inspired by [nanoGPT](https://github.com/karpathy/nanoGPT) and OpenAI’s GPT-1, Microformer is designed for learning, experimentation, and prototyping on lightweight datasets like [text8](https://mattmahoney.net/dc/textdata.html) or Tiny Shakespeare.
---
## Features
- Decoder-only transformer (GPT-style) architecture
- **Stacked adapters per layer for dual-memory:**
- **Long-term adapters** (for corpus/knowledge facts)
- **Session adapters** (for rapid, online, user/session-specific learning)
- Choice of character-level **or** subword/BPE tokenization (configurable)
- Learnable positional encoding
- Multi-head self-attention
- Configurable depth, embedding size, sequence length, and attention heads
- Simple end-to-end pipeline: preprocessing, training, and text generation
- Modular, readable code ideal for educational use and tinkering
- Temperature and multinomial sampling in text generation
---
## What’s Unique: Stacked Adapters for Dual-Memory Learning
Microformer implements **two adapters in every transformer block**:
- **Long-term adapter:**
Trained with your full corpus during batch/corpus training.
Stores stable, general β€œknowledge” (e.g., literary style, factual info).
- **Session adapter:**
Starts blank and is trained *on the fly* during chat or interactive teaching.
Lets you rapidly β€œteach” new facts, styles, or user preferences without overwriting core knowledge.
At inference, the outputs of both adapters (plus the core transformer) are combinedβ€”giving the model both stable and flexible, session-specific memory, just like a human brain’s β€œtemporal lobe” and β€œcore memory”.
---
## Project Structure
```
microformer/
β”œβ”€β”€ config.py # Hyperparameters and model settings
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ corpus.txt # Raw training text
β”‚ β”œβ”€β”€ train.pt # Preprocessed training tensor (token IDs)
β”‚ β”œβ”€β”€ val.pt # Validation tensor (token IDs)
β”‚ β”œβ”€β”€ vocab.json # Vocabulary (char or subword, stoi/itos mapping)
β”‚ └── tokenizer.json # (optional) BPE tokenizer file if using subwords
β”œβ”€β”€ models/
β”‚ └── model.py # Transformer model definition (Microformer)
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ prepare_data.py # Data preprocessing/tokenization
β”‚ β”œβ”€β”€ train.py # Training script (trains long-term adapters)
β”‚ β”œβ”€β”€ generate_text.py # Inference/generation + online learning (session adapters)
β”‚ └── tokenizer_setup.py # BPE Tokenizer
└── README.md
```
---
## Quickstart
1. **Prepare your corpus and run the tokenizer**
Place your text data in `data/corpus.txt`.
2. **Choose your tokenizer:**
- **Character-level (default):**
No extra steps needed.
- **BPE/Subword (recommended for rich/modern text):**
```bash
python scripts/tokenizer_setup.py --input data/corpus.txt --vocab_size 1000
```
3. **Prepare the dataset**
```bash
python scripts/prepare_data.py
```
4. **Train the model (long-term knowledge)**
```bash
python scripts/train.py
```
- This trains only the **long-term adapters** and core weights.
- Session adapters remain untrained (blank) until chat time.
5. **Generate text and teach interactively (session memory)**
```bash
python scripts/generate_text.py
```
- Loads your trained model.
- Prompts for a seed string and temperature.
- **Allows you to β€œteach” new facts on the fly!**
- New knowledge is stored in session adaptersβ€”does *not* overwrite long-term knowledge.
---
## Example Config (`config.py`)
```python
EMBED_DIM = 128
NUM_HEADS = 4
NUM_LAYERS = 2
FF_DIM = 256
MAX_SEQ_LEN = 128
BATCH_SIZE = 32
ADAPTER_DIM = 32 # Used for both long-term and session adapters
VOCAB_SIZE = 100 # Set automatically from tokenizer/vocab
```
---
## Using the Dual-Memory System
- **Long-term adapters:**
Learned during `train.py`β€”persist between runs.
- **Session adapters:**
Learned during interactive chat in `generate_text.py`β€”resettable (optional) between users/sessions.
- **Teach new facts by entering a prompt and providing your ideal answer.**
The model will β€œremember” this during the session, even if it wasn’t present in the training corpus.
---
## Customization & Ideas
- Use BPE/subword tokenization for more expressive modeling (recommended for non-trivial datasets)
- Add more adapters or experiment with gating (e.g., blend adapters by context)
- Combine with a key-value retrieval or buffer for truly persistent β€œuser memory”
- Visualize training with TensorBoard or wandb
- Tinker with alternative attention or memory mechanisms
---
## Requirements
- Python 3.8+
- [PyTorch](https://pytorch.org/)
- [tokenizers](https://github.com/huggingface/tokenizers) (for BPE/subword)
Install dependencies with:
```bash
pip install torch tokenizers
```
---
## Credits
- Inspired by [nanoGPT](https://github.com/karpathy/nanoGPT) and [minGPT](https://github.com/karpathy/minGPT) by Andrej Karpathy
- Adapter and continual-learning inspiration from recent NLP research ([Houlsby et al. 2019](https://arxiv.org/abs/1902.00751))
- Built using concepts from the original [GPT-1 paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)
---
## License
MIT License – Use freely for learning and experimentation.
---
**Happy tinkering with dual-memory transformers!**