--- language: - tl dataset: - MaAIos/culturax-filipino-subset library_name: transformers tags: - text-generation - pytorch - custom-architecture - henyo license: mit --- # Henyo-153M-CulturaX **Henyo** is a 153M parameter Tagalog Language Model trained on the `MaAIos/culturax-filipino-subset` dataset. It utilizes a custom efficient architecture heavily inspired by Llama 2/3 and PaLM. ## Architecture Details This model uses a custom Decoder-Only Transformer architecture built from scratch in PyTorch. | Hyperparameter | Value | | :--- | :--- | | **Parameters** | ~153M | | **Context Window** | 1024 tokens | | **Embedding Dim** | 768 | | **Layers (Depth)** | 12 | | **Attention Heads** | 12 | | **KV Heads (GQA)** | 4 | | **Vocab Size** | 50,257 (GPT-2 tokenizer) | ### Key Features 1. **SwiGLU Activation**: High-performance gated linear unit activation. 2. **Grouped Query Attention (GQA)**: 12 Query heads sharing 4 KV heads (3:1 ratio) for efficient inference. 3. **Rotary Positional Embeddings (RoPE)**: For better generalization on sequence lengths. 4. **RMSNorm**: Pre-normalization for training stability. ## Training Configuration - **Dataset**: [MaAIos/culturax-filipino-subset](https://huggingface.co/datasets/MaAIos/culturax-filipino-subset) - **Mode**: Streaming (Iterable Dataset) - **Optimizer**: AdamW - **Scheduler**: Cosine Decay - **Gradient Accumulation**: 8 steps (Effective batch size ~32) - **Precision**: Mixed Precision (FP16) ## Usage Since this model uses a custom architecture, you must include the class definitions (provided in the `train_henyo.py` file in this repo) or use the inference script below. ```python # See inference_henyo.py in files for full class definitions from transformers import AutoTokenizer model_id = "marcuscedricridia/Henyo-153M-CulturaX" tokenizer = AutoTokenizer.from_pretrained(model_id) # Load model using custom class wrapper... ``` ### Reproducibility The full training script (train_henyo.py) is included in the file listing of this repository.