|
|
--- |
|
|
language: |
|
|
- tl |
|
|
dataset: |
|
|
- MaAIos/culturax-filipino-subset |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-generation |
|
|
- pytorch |
|
|
- custom-architecture |
|
|
- henyo |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Henyo-153M-CulturaX |
|
|
|
|
|
**Henyo** is a 153M parameter Tagalog Language Model trained on the `MaAIos/culturax-filipino-subset` dataset. It utilizes a custom efficient architecture heavily inspired by Llama 2/3 and PaLM. |
|
|
|
|
|
## Architecture Details |
|
|
This model uses a custom Decoder-Only Transformer architecture built from scratch in PyTorch. |
|
|
|
|
|
| Hyperparameter | Value | |
|
|
| :--- | :--- | |
|
|
| **Parameters** | ~153M | |
|
|
| **Context Window** | 1024 tokens | |
|
|
| **Embedding Dim** | 768 | |
|
|
| **Layers (Depth)** | 12 | |
|
|
| **Attention Heads** | 12 | |
|
|
| **KV Heads (GQA)** | 4 | |
|
|
| **Vocab Size** | 50,257 (GPT-2 tokenizer) | |
|
|
|
|
|
### Key Features |
|
|
1. **SwiGLU Activation**: High-performance gated linear unit activation. |
|
|
2. **Grouped Query Attention (GQA)**: 12 Query heads sharing 4 KV heads (3:1 ratio) for efficient inference. |
|
|
3. **Rotary Positional Embeddings (RoPE)**: For better generalization on sequence lengths. |
|
|
4. **RMSNorm**: Pre-normalization for training stability. |
|
|
|
|
|
## Training Configuration |
|
|
- **Dataset**: [MaAIos/culturax-filipino-subset](https://huggingface.co/datasets/MaAIos/culturax-filipino-subset) |
|
|
- **Mode**: Streaming (Iterable Dataset) |
|
|
- **Optimizer**: AdamW |
|
|
- **Scheduler**: Cosine Decay |
|
|
- **Gradient Accumulation**: 8 steps (Effective batch size ~32) |
|
|
- **Precision**: Mixed Precision (FP16) |
|
|
|
|
|
## Usage |
|
|
|
|
|
Since this model uses a custom architecture, you must include the class definitions (provided in the `train_henyo.py` file in this repo) or use the inference script below. |
|
|
|
|
|
```python |
|
|
# See inference_henyo.py in files for full class definitions |
|
|
from transformers import AutoTokenizer |
|
|
|
|
|
model_id = "marcuscedricridia/Henyo-153M-CulturaX" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
# Load model using custom class wrapper... |
|
|
``` |
|
|
|
|
|
### Reproducibility |
|
|
The full training script (train_henyo.py) is included in the file listing of this repository. |
|
|
|