Portulino
A language model training project completely from scratch, using free Google Colab T4 GPUs.
π Model Specifications
GPTConfig(
block_size=1024, # Context size
vocab_size=32006, # Vocabulary size
n_layer=20, # Number of transformer layers
n_head=16, # Number of attention heads
n_kv_head=4, # Key-value heads (GQA)
n_embd=1024, # Embedding dimension
dropout=0.05, # Dropout rate
bias=False, # No bias in linear layers
tie_word_embeddings=True # Share input/output embeddings
)
Total parameters: ~280M parameters
π― Features
- β Training from scratch - No pre-trained model
- β Free Tier friendly - Optimized for 15GB T4
- β Grouped Query Attention (GQA) - Improved efficiency
- β Automatic checkpointing - Saves every 100 iterations
- β Mixed precision training - Efficient memory usage
- β PyTorch 2.0 compile - Performance optimizations
- β BatchSize - 9 on T4
π Training Progress
Current status: ~38,500 iterations
Iteration: 38543
Current loss: 2.0038
Learning rate: 0.0003
How to Run the Chat
# Resume training from last checkpoint
python chat.py
Then connect to the local site at http://127.0.0.1:8088
There are three prompt formats:
- <|query|><|answer|> - The traditional instruct where the user makes a request and expects a response.
- <|query|><|hole|><|answer|> - The thinking mode where it does prior reasoning and generates a response (the <|hole|> token was used because the vocabulary was already created before the thinking generation method emerged)
- Auto-Complete - The old way of completing the provided text with model-generated text.
Screenshot
Project Structure
.
βββ chat.py # Main python/flask script to chat with the model
βββ model3.py # The model itself
βββ chekpoint/ckpt.pt # Current checkpoint
βββ vocab/byte-level-bpe.tokenizer.32k.json # 32k token vocabulary
βββ templates/chat.html # Flask template for the page
βββ README.md
π Learnings
This project demonstrates:
- How to train language models on limited hardware
- Memory optimization techniques for small GPUs
- Implementation of GQA (Grouped Query Attention)
- Effective use of PyTorch 2.0 compilation
- Checkpointing strategies for free sessions
β οΈ Limitations
- Session time: Colab Free has a 12h limit
- Disconnections: Requires frequent checkpointing
- Small batch size: Limited by T4 memory
- Speed: ~40x slower than A100/H100 GPUs
Sometimes I pay for the A100 and it gives a good boost to training. So far on this model I've only paid twice. I turn on training on Colab using free T4s almost religiously every day. I get an average of 150 iterations per day.
Dataset
Most datasets are in Portuguese
- I've been scraping sites for many years so there's content from several decades, which unfortunately affected the vocabulary as Dilma is in it and she's already fallen into obscurity.
- Random books found on Google Drives and Torrents.
- Scientific articles
- Reddit, in the beginning, but it was reduced over time as it 'dirtied' the model with profanity, ideologies and politics.
- Blogspots and Wordpress blogs mostly in Portuguese, however these blogs are largely religious, women's stuff and Left-wing content
- Wikipedia dump in Portuguese around 2015.
- Random PDFs from niches found on the internet such as legal documents, open magazines, study guides, manuals, civil service exam prep...
- CulturaX
- Instructs like Alpaca, OpenAssistant, Cabrita, gsm8k, thinking
- Many instructs that I make automated using ChatGPT, Gemini and others. In fact, I run one of these generators every day.
- Has strong Legal and Humanities content.
No, I will not provide the datasets.
Safety
This is not a safe model! I'm not a company with the structure to clean texts and besides, the model is small so it can inevitably generate discriminatory, offensive, sexual and horror texts. However, due to the strongly Safe instructs both externally generated and generated by me, the model has shown awareness of Self-Harm and dangerous generalizations.
π€ Contributions
Contributions are welcome! Especially:
- Memory optimizations
- Data augmentation techniques
- Code improvements
- Documentation
π License
MIT License - Feel free to use in your projects!
π Acknowledgments
- Google Colab for free infrastructure
- PyTorch community
- Reference papers: GPT-2, GPT-3, LLaMA
β‘ Trained with determination and free GPUs! β‘
