Portulino

A language model training project completely from scratch, using free Google Colab T4 GPUs.

πŸ“Š Model Specifications

GPTConfig(
    block_size=1024,      # Context size
    vocab_size=32006,     # Vocabulary size
    n_layer=20,           # Number of transformer layers
    n_head=16,            # Number of attention heads
    n_kv_head=4,          # Key-value heads (GQA)
    n_embd=1024,          # Embedding dimension
    dropout=0.05,         # Dropout rate
    bias=False,           # No bias in linear layers
    tie_word_embeddings=True  # Share input/output embeddings
)

Total parameters: ~280M parameters

🎯 Features

  • βœ… Training from scratch - No pre-trained model
  • βœ… Free Tier friendly - Optimized for 15GB T4
  • βœ… Grouped Query Attention (GQA) - Improved efficiency
  • βœ… Automatic checkpointing - Saves every 100 iterations
  • βœ… Mixed precision training - Efficient memory usage
  • βœ… PyTorch 2.0 compile - Performance optimizations
  • βœ… BatchSize - 9 on T4

πŸ“ˆ Training Progress

Current status: ~38,500 iterations

Iteration: 38543
Current loss: 2.0038
Learning rate: 0.0003

How to Run the Chat

# Resume training from last checkpoint
python chat.py

Then connect to the local site at http://127.0.0.1:8088

There are three prompt formats:

  • <|query|><|answer|> - The traditional instruct where the user makes a request and expects a response.
  • <|query|><|hole|><|answer|> - The thinking mode where it does prior reasoning and generates a response (the <|hole|> token was used because the vocabulary was already created before the thinking generation method emerged)
  • Auto-Complete - The old way of completing the provided text with model-generated text.

Screenshot

Chat page

Project Structure

.
β”œβ”€β”€ chat.py                            # Main python/flask script to chat with the model
β”œβ”€β”€ model3.py                                  # The model itself
β”œβ”€β”€ chekpoint/ckpt.pt                          # Current checkpoint
β”œβ”€β”€ vocab/byte-level-bpe.tokenizer.32k.json    # 32k token vocabulary
β”œβ”€β”€ templates/chat.html                        # Flask template for the page
└── README.md

πŸŽ“ Learnings

This project demonstrates:

  • How to train language models on limited hardware
  • Memory optimization techniques for small GPUs
  • Implementation of GQA (Grouped Query Attention)
  • Effective use of PyTorch 2.0 compilation
  • Checkpointing strategies for free sessions

⚠️ Limitations

  • Session time: Colab Free has a 12h limit
  • Disconnections: Requires frequent checkpointing
  • Small batch size: Limited by T4 memory
  • Speed: ~40x slower than A100/H100 GPUs

Sometimes I pay for the A100 and it gives a good boost to training. So far on this model I've only paid twice. I turn on training on Colab using free T4s almost religiously every day. I get an average of 150 iterations per day.

Dataset

Most datasets are in Portuguese

  • I've been scraping sites for many years so there's content from several decades, which unfortunately affected the vocabulary as Dilma is in it and she's already fallen into obscurity.
  • Random books found on Google Drives and Torrents.
  • Scientific articles
  • Reddit, in the beginning, but it was reduced over time as it 'dirtied' the model with profanity, ideologies and politics.
  • Blogspots and Wordpress blogs mostly in Portuguese, however these blogs are largely religious, women's stuff and Left-wing content
  • Wikipedia dump in Portuguese around 2015.
  • Random PDFs from niches found on the internet such as legal documents, open magazines, study guides, manuals, civil service exam prep...
  • CulturaX
  • Instructs like Alpaca, OpenAssistant, Cabrita, gsm8k, thinking
  • Many instructs that I make automated using ChatGPT, Gemini and others. In fact, I run one of these generators every day.
  • Has strong Legal and Humanities content.

No, I will not provide the datasets.

Safety

This is not a safe model! I'm not a company with the structure to clean texts and besides, the model is small so it can inevitably generate discriminatory, offensive, sexual and horror texts. However, due to the strongly Safe instructs both externally generated and generated by me, the model has shown awareness of Self-Harm and dangerous generalizations.

🀝 Contributions

Contributions are welcome! Especially:

  • Memory optimizations
  • Data augmentation techniques
  • Code improvements
  • Documentation

πŸ“œ License

MIT License - Feel free to use in your projects!

πŸ™ Acknowledgments

  • Google Colab for free infrastructure
  • PyTorch community
  • Reference papers: GPT-2, GPT-3, LLaMA

⚑ Trained with determination and free GPUs! ⚑


Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support