|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- id |
|
|
- en |
|
|
tags: |
|
|
- text-generation |
|
|
- pytorch |
|
|
- causal-lm |
|
|
- transformer |
|
|
- untrained |
|
|
- gqa |
|
|
- rope |
|
|
- swiglu |
|
|
- rmsnorm |
|
|
- flash-attention |
|
|
- indonesian |
|
|
- bilingual |
|
|
library_name: transformers |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "Jakarta adalah ibu kota" |
|
|
example_title: "๐ฎ๐ฉ Pelengkapan Teks (ID)" |
|
|
- text: | |
|
|
Pertanyaan: Apa itu kecerdasan buatan? |
|
|
Jawaban: |
|
|
example_title: "๐ฎ๐ฉ Tanya Jawab (ID)" |
|
|
- text: | |
|
|
Tulis cerita pendek tentang robot yang belajar mencintai. |
|
|
example_title: "๐ฎ๐ฉ Penulisan Kreatif (ID)" |
|
|
- text: "The capital of Indonesia is" |
|
|
example_title: "๐ฌ๐ง Text Completion (EN)" |
|
|
- text: | |
|
|
Question: What is artificial intelligence? |
|
|
Answer: |
|
|
example_title: "๐ฌ๐ง Question Answering (EN)" |
|
|
- text: | |
|
|
def fibonacci(n): |
|
|
"""Hitung bilangan fibonacci ke-n""" |
|
|
example_title: "๐ป Pelengkapan Kode" |
|
|
- text: | |
|
|
# Fungsi untuk mengurutkan array |
|
|
def sort_array(arr): |
|
|
example_title: "๐ป Generasi Kode" |
|
|
- text: | |
|
|
User: Halo! Siapa kamu? |
|
|
Assistant: |
|
|
example_title: "๐ฌ Format Chat (ID)" |
|
|
- text: | |
|
|
User: Jelaskan tentang machine learning dalam 2 kalimat. |
|
|
Assistant: |
|
|
example_title: "๐ฌ Conversational (ID)" |
|
|
inference: |
|
|
parameters: |
|
|
max_new_tokens: 100 |
|
|
temperature: 0.7 |
|
|
top_p: 0.9 |
|
|
top_k: 50 |
|
|
do_sample: true |
|
|
repetition_penalty: 1.1 |
|
|
num_beams: 1 |
|
|
datasets: [] |
|
|
metrics: |
|
|
- perplexity |
|
|
- accuracy |
|
|
model-index: |
|
|
- name: caca-1M |
|
|
results: [] |
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-1M"/> |
|
|
|
|
|
# ๐ค caca-1M |
|
|
|
|
|
### Arsitektur Transformer Modern dengan Fitur Canggih |
|
|
|
|
|
[](https://opensource.org/licenses/Apache-2.0) |
|
|
[](https://www.python.org/downloads/) |
|
|
[](https://pytorch.org/) |
|
|
[](https://github.com/huggingface/transformers) |
|
|
[]() |
|
|
[]() |
|
|
[]() |
|
|
|
|
|
**3,524,608** parameters โข **3.52M** โข **6 layers** โข **1,024 tokens** |
|
|
|
|
|
[๐ Documentation](#-dokumentasi) โข [๐ป Usage](#-cara-penggunaan) โข [โ๏ธ Configuration](#๏ธ-konfigurasi-detail) โข [๐ฌ Architecture](#-arsitektur) |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## โ ๏ธ PENTING: Model Belum Dilatih (Untrained) |
|
|
|
|
|
<div style="background: #fff3cd; border-left: 4px solid #ffc107; padding: 12px; margin: 16px 0;"> |
|
|
<strong>โ ๏ธ PERHATIAN</strong>: Ini adalah model yang <strong>belum melalui proses training</strong>. Bobot model masih dalam kondisi <strong>random initialization</strong>. Output yang dihasilkan akan <strong>tidak bermakna dan acak</strong>. |
|
|
</div> |
|
|
|
|
|
**Status Model:** |
|
|
- ๐ด **Belum dilatih** - Bobot masih random (Kaiming/Xavier init) |
|
|
- ๐ก **Untuk riset & eksperimen** - Arsitektur sudah siap, tinggal train |
|
|
- ๐ข **Production-ready architecture** - Teruji dan optimal |
|
|
|
|
|
Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas tinggi. |
|
|
|
|
|
### ๐ฏ Apa yang Bisa Dilakukan? |
|
|
|
|
|
| โ
Bisa | โ Belum Bisa | |
|
|
|---------|----------------| |
|
|
| Load model architecture | Generate teks bermakna | |
|
|
| Test forward pass | Menjawab pertanyaan | |
|
|
| Measure memory & speed | Reasoning & understanding | |
|
|
| Start training | Production deployment | |
|
|
| Fine-tuning experiments | Real-world applications | |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Deskripsi |
|
|
|
|
|
**CACA** (Collaborative Architecture for Contextual AI) adalah arsitektur Large Language Model (LLM) yang menggabungkan **best practices** dari berbagai model State-of-the-Art (SOTA) seperti **LLaMA**, **GPT-4**, **Gemini**, **Qwen**, dan **Gemma**. |
|
|
|
|
|
Model ini dirancang dengan fokus pada **efisiensi komputasi**, **skalabilitas**, dan **performa tinggi** โ menjadikannya **modular**, **production-ready**, dan mendukung **multimodal** (teks, gambar, audio). |
|
|
|
|
|
<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; background: #f8f9fa; padding: 12px;"> |
|
|
<p><strong>๐ Tentang Project Caca</strong></p> |
|
|
<p><em>Caca</em> adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative.</p> |
|
|
<p>Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok. Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p> |
|
|
<p>โ <strong>Lyon</strong>, Creator</p> |
|
|
</blockquote> |
|
|
|
|
|
### โจ **Highlights** |
|
|
|
|
|
- ๐ง **Hybrid Architecture** โ Kombinasi teknik terbaik dari 5+ model SOTA |
|
|
- ๐ญ **Multimodal Native** โ Support teks, gambar, dan audio dalam satu model |
|
|
- โก **High Performance** โ Flash Attention, MoE, dan optimasi modern |
|
|
- ๐ **Indonesian-First** โ Dikembangkan dengan fokus pada Bahasa Indonesia |
|
|
- ๐ **Open Source** โ Transparent, reproducible, collaborative |
|
|
|
|
|
### ๐ Mengapa Caca? |
|
|
|
|
|
1. **๐ฎ๐ฉ Fokus pada Bahasa Indonesia** - Dirancang dengan mempertimbangkan karakteristik bahasa Indonesia |
|
|
2. **โก Efisiensi Tinggi** - GQA & Flash Attention untuk inferensi 3-5x lebih cepat |
|
|
3. **๐พ Memory Efficient** - Hemat 50% memory untuk KV cache |
|
|
4. **๐ง Modular & Extensible** - Mudah dikustomisasi untuk berbagai use case |
|
|
5. **๐ Bilingual** - Support optimal untuk Indonesia & English |
|
|
|
|
|
**CACA** hadir dengan filosofi berbeda: |
|
|
- โ
**Fully open-source** โ dari architecture sampai training code |
|
|
- โ
**Modular & scalable** โ bisa disesuaikan dari 1B sampai 70B+ parameters |
|
|
- โ
**Resource-efficient** โ optimized untuk budget terbatas |
|
|
- โ
**Indonesian-centric** โ prioritas pada Bahasa Indonesia |
|
|
- โ
**Community-driven** โ open for contributions & collaborations |
|
|
|
|
|
## ๐ Perbandingan dengan Model Lain |
|
|
|
|
|
| Fitur | LLaMA | GPT-4 | Gemini | Qwen | CACA | |
|
|
|-------|-------|-------|--------|------|------| |
|
|
| **RMSNorm** | โ
| โ | โ | โ
| โ
| |
|
|
| **RoPE** | โ
| โ | โ | โ
| โ
| |
|
|
| **GQA** | โ
| โ | โ | โ
| โ
| |
|
|
| **MoE** | โ | โ
| โ
| โ | โ
| |
|
|
| **Multimodal** | โ | โ
| โ
| โ
| โ
| |
|
|
| **Flash Attention** | โ
| โ
| โ
| โ
| โ
| |
|
|
| **Sliding Window** | โ | โ | โ | โ
| โ
| |
|
|
| **Attention Sinks** | โ | โ | โ | โ | โ
| |
|
|
| **MoD** | โ | โ | โ | โ | โ
| |
|
|
| **Expert Choice** | โ | โ | โ | โ | โ
| |
|
|
| **YARN Scaling** | โ | โ | โ | โ
| โ
| |
|
|
| **Quantization** | โ
| โ | โ | โ
| โ
| |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฏ Use Cases & Applications |
|
|
|
|
|
### โ
Cocok Untuk |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<td width="50%"> |
|
|
|
|
|
**๐ฌ Research & Development** |
|
|
- Eksperimen arsitektur transformer |
|
|
- Ablation studies |
|
|
- Novel training techniques |
|
|
- Architecture search |
|
|
|
|
|
**๐ Academic & Education** |
|
|
- Thesis & research papers |
|
|
- Teaching materials |
|
|
- Student projects |
|
|
- LLM internals understanding |
|
|
|
|
|
</td> |
|
|
<td width="50%"> |
|
|
|
|
|
**๐ Base Model for Fine-tuning** |
|
|
- Task-specific models |
|
|
- Domain adaptation |
|
|
- Instruction tuning |
|
|
- RLHF experiments |
|
|
|
|
|
**๐ก Prototyping** |
|
|
- Proof of concept |
|
|
- Feature testing |
|
|
- A/B testing architectures |
|
|
- Benchmark comparisons |
|
|
|
|
|
</td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
### โ Tidak Cocok Untuk |
|
|
|
|
|
<div style="background: #ffe6e6; border-left: 4px solid #ff4444; padding: 12px; margin: 16px 0;"> |
|
|
|
|
|
- ๐ซ **Production Applications** - Model belum dilatih, output random |
|
|
- ๐ซ **Real-world Deployment** - Perlu training & safety alignment dulu |
|
|
- ๐ซ **Safety-critical Systems** - Tidak ada safety guardrails |
|
|
- ๐ซ **Direct User-facing Apps** - Output tidak dapat diprediksi |
|
|
- ๐ซ **Commercial Use (as-is)** - Harus dilatih terlebih dahulu |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Spesifikasi Model |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<td><strong>Parameter</strong></td> |
|
|
<td><strong>Value</strong></td> |
|
|
<td><strong>Parameter</strong></td> |
|
|
<td><strong>Value</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Total Parameters</td> |
|
|
<td><code>3,524,608</code></td> |
|
|
<td>Vocab Size</td> |
|
|
<td><code>8,000</code></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Hidden Size</td> |
|
|
<td><code>128</code></td> |
|
|
<td>Intermediate Size</td> |
|
|
<td><code>512</code></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Num Layers</td> |
|
|
<td><code>6</code></td> |
|
|
<td>Attention Heads</td> |
|
|
<td><code>4</code></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>KV Heads (GQA)</td> |
|
|
<td><code>2</code></td> |
|
|
<td>Head Dimension</td> |
|
|
<td><code>32</code></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Max Context Length</td> |
|
|
<td><code>1,024</code></td> |
|
|
<td>RoPE Base (ฮธ)</td> |
|
|
<td><code>10,000</code></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td>Model Size (FP16)</td> |
|
|
<td><code>0.01 GB</code></td> |
|
|
<td>Formatted Size</td> |
|
|
<td><code>3.52M</code></td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ฏ Core Features |
|
|
|
|
|
<details open> |
|
|
<summary><b>๐ Klik untuk expand/collapse</b></summary> |
|
|
|
|
|
- โ
**Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior |
|
|
- Query heads: **4** |
|
|
- KV heads: **2** |
|
|
- Ratio: **2:1** (hemat ~50% memory KV cache) |
|
|
- **Benefit**: Inferensi lebih cepat dengan memory footprint lebih kecil |
|
|
|
|
|
- โ
**Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik |
|
|
- Theta (ฮธ): **10,000** |
|
|
- Support extrapolation untuk konteks > training length |
|
|
- **Benefit**: Performa stabil pada sequence length yang belum pernah dilihat saat training |
|
|
|
|
|
- โ
**RMSNorm** - Normalisasi lebih stabil dan ~50% lebih cepat dari LayerNorm |
|
|
- Epsilon: **1e-06** |
|
|
- **Benefit**: Training lebih stabil, inference lebih cepat, gradient flow lebih baik |
|
|
|
|
|
- โ
**SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU |
|
|
- Intermediate size: **512** (4.0x hidden) |
|
|
- **Benefit**: Kapasitas model lebih besar tanpa menambah parameter signifikan |
|
|
|
|
|
- โ
**Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency |
|
|
- Otomatis aktif jika tersedia CUDA device |
|
|
- IO-aware algorithm untuk minimal HBM access |
|
|
- **Benefit**: Training & inference jauh lebih cepat, support batch size lebih besar |
|
|
|
|
|
- โ
**Hybrid Architecture** - Kombinasi teknik terbaik dari 5+ model SOTA |
|
|
- โ
**Multimodal Support** - Native support untuk Vision dan Audio |
|
|
- โ
**Mixture of Experts (MoE)** - Sparse activation untuk efisiensi |
|
|
- โ
**Long Context** - Support hingga 8K+ tokens dengan YARN scaling |
|
|
- โ
**Advanced Attention** - Flash Attention, Sliding Window, Attention Sinks |
|
|
- โ
**Quantization Ready** - Support 4-bit dan 8-bit quantization |
|
|
- โ
**Production Features** - Extensive error handling & monitoring |
|
|
|
|
|
</details> |
|
|
|
|
|
### ๐ฅ Advanced Features |
|
|
|
|
|
### ๐ฏ Mekanisme Attention |
|
|
|
|
|
- โก **Flash Attention v2** - Algoritma IO-aware yang 3x lebih cepat dari attention standar |
|
|
- ๐ **Grouped Query Attention (GQA)** - 4 Query heads : 2 KV heads |
|
|
- Rasio kompresi: **2:1** (hemat ~50% memory KV cache) |
|
|
- ๐ **xFormers Support** - Fallback memory-efficient attention |
|
|
- ๐ฏ **PyTorch SDPA** - Native scaled dot product attention |
|
|
|
|
|
### ๐ Position Encodings |
|
|
|
|
|
- ๐ **RoPE (Rotary Position Embeddings)** - Base frequency ฮธ=10,000 |
|
|
- Generalisasi lebih baik untuk sequence panjang dibanding absolute PE |
|
|
|
|
|
### ๐ Optimisasi Training |
|
|
|
|
|
- ๐พ **Gradient Checkpointing** - Trade compute for memory (support model hingga 100B+ params) |
|
|
- ๐ฏ **Mixed Precision Training** - Support FP16, BF16, dan TF32 |
|
|
- ๐ **Dropout Regularization** |
|
|
- Hidden dropout: 0.1 |
|
|
- Attention dropout: 0.0 |
|
|
- Residual dropout: 0.1 |
|
|
|
|
|
### ๐ฆ Dukungan Quantization |
|
|
|
|
|
- 4๏ธโฃ **4-bit Quantization** - NF4 & FP4 via bitsandbytes |
|
|
- Memory reduction: ~**75%** (4GB โ 1GB) |
|
|
- Accuracy loss: <2% pada kebanyakan tasks |
|
|
- Support double quantization untuk kompresi maksimal |
|
|
- 8๏ธโฃ **8-bit Quantization** - LLM.int8() dengan outlier handling |
|
|
- Memory reduction: ~**50%** (4GB โ 2GB) |
|
|
- Accuracy loss: <1% |
|
|
- ๐ **Dynamic Quantization** - Runtime quantization tanpa calibration |
|
|
|
|
|
### ๐ฌ Advanced Features |
|
|
|
|
|
- ๐ **Automatic Mixed Precision (AMP)** - Dynamic loss scaling |
|
|
- ๐ฏ **Gradient Clipping** - Stabilitas training dengan max norm clipping |
|
|
- ๐ **Learning Rate Scheduling** - Support cosine, linear, warmup |
|
|
- ๐ก **Smart Memory Management** - Auto cache clearing & monitoring |
|
|
- ๐ **Metrics Tracking** - Real-time perplexity, loss, gradient norms |
|
|
- ๐ก๏ธ **NaN/Inf Detection** - Automatic recovery dari numerical instability |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐งฉ Komponen Arsitektur |
|
|
|
|
|
### 1๏ธโฃ **Dari LLaMA (Meta)** |
|
|
|
|
|
CACA mengadopsi komponen efisien dari LLaMA untuk performa optimal: |
|
|
|
|
|
```python |
|
|
โ RMSNorm # Normalisasi lebih efisien dari LayerNorm |
|
|
โ Rotary Position Embeddings # Positional encoding yang lebih baik |
|
|
โ SwiGLU Activation # Activation function dengan gating mechanism |
|
|
โ Grouped-Query Attention # Hemat memory dengan shared K/V heads |
|
|
โ Pre-normalization # Stabilitas training yang lebih baik |
|
|
``` |
|
|
|
|
|
- RMSNorm **30% lebih cepat** dari LayerNorm |
|
|
- RoPE membuat model bisa **extrapolate ke context lebih panjang** |
|
|
- GQA **hemat 30-40% memory** dibanding Multi-Head Attention |
|
|
- SwiGLU **meningkatkan performa 3-5%** dibanding ReLU/GELU |
|
|
|
|
|
--- |
|
|
|
|
|
### 2๏ธโฃ **Dari GPT-4 (OpenAI)** |
|
|
|
|
|
Implementasi Mixture of Experts untuk skalabilitas: |
|
|
|
|
|
```python |
|
|
โ Mixture of Experts (MoE) # Sparse activation dengan multiple expert networks |
|
|
โ Top-K Router # Routing token ke K expert terbaik |
|
|
โ Auxiliary Loss # Load balancing antar experts |
|
|
โ Z-Loss # Stabilisasi router logits |
|
|
โ Expert Usage Tracking # Monitoring penggunaan setiap expert |
|
|
``` |
|
|
|
|
|
``` |
|
|
Input Token |
|
|
โ |
|
|
[Router] โ Pilih Top-K Experts (misal K=2 dari 8 experts) |
|
|
โ |
|
|
Expert_1 (weight: 0.6) + Expert_3 (weight: 0.4) |
|
|
โ |
|
|
Weighted Sum Output |
|
|
``` |
|
|
|
|
|
**Keuntungan:** |
|
|
- Model bisa **10x lebih besar** dengan compute cost yang sama |
|
|
- Setiap token hanya activate **12.5% parameters** (jika K=2, N=8) |
|
|
- Parallel processing antar experts |
|
|
|
|
|
--- |
|
|
|
|
|
### 3๏ธโฃ **Dari Gemini (Google)** |
|
|
|
|
|
Multimodal native dengan cross-modal fusion: |
|
|
|
|
|
```python |
|
|
โ Vision Encoder (ViT) # Process gambar dengan Vision Transformer |
|
|
โ Audio Encoder (Conv1D + Trans) # Process audio dengan CNN + Transformer |
|
|
โ Cross-Attention Mechanism # Fuse multimodal features |
|
|
โ Multiple Projector Types: |
|
|
- Linear Projector # Simple & cepat |
|
|
- MLP Projector # Non-linear mapping |
|
|
- Perceiver Resampler # Compress dengan latent queries |
|
|
- Q-Former # Query-based projection (BLIP-2 style) |
|
|
โ Logit Soft-Capping # Clip extreme values untuk stabilitas |
|
|
``` |
|
|
|
|
|
**Alur Multimodal:** |
|
|
``` |
|
|
[Image] โ Vision Encoder โ [2D patches โ 1D tokens] |
|
|
โ |
|
|
Projector โ [Hidden dim = text dim] |
|
|
โ |
|
|
[Text] + [Image tokens] โ Cross-Attention โ Fused representation |
|
|
``` |
|
|
|
|
|
**Support format:** |
|
|
- Images: JPEG, PNG (224x224 default) |
|
|
- Audio: Mel-spectrogram (80 bins) |
|
|
|
|
|
--- |
|
|
|
|
|
### 4๏ธโฃ **Dari Qwen (Alibaba)** |
|
|
|
|
|
Long context optimization: |
|
|
|
|
|
```python |
|
|
โ YARN Scaling # Yet Another RoPE extensioN |
|
|
โ Dynamic Position Scaling # Auto-adjust untuk sequence lebih panjang |
|
|
โ Sliding Window Attention # Local attention pattern |
|
|
โ Context Window 8K-128K # Flexible context length |
|
|
``` |
|
|
|
|
|
**YARN vs Standard RoPE:** |
|
|
``` |
|
|
Standard RoPE: [====] 4K context โ [====????] 8K (error naik) |
|
|
YARN: [====] 4K context โ [========] 8K (smooth extrapolation) |
|
|
``` |
|
|
|
|
|
**Sliding Window Mechanism:** |
|
|
``` |
|
|
Token 0: attend ke [0] |
|
|
Token 1: attend ke [0, 1] |
|
|
Token 2: attend ke [0, 1, 2] |
|
|
Token 10: attend ke [0, 6, 7, 8, 9, 10] โ sliding window = 4 |
|
|
(keep attention sink di token 0) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### 5๏ธโฃ **Dari Gemma (Google)** |
|
|
|
|
|
Optimization techniques: |
|
|
|
|
|
```python |
|
|
โ Layer Scale # Learnable scaling per layer |
|
|
โ Stochastic Depth # Random layer dropping saat training |
|
|
โ Normalized Attention # QK normalization untuk stabilitas |
|
|
โ Knowledge Distillation # Transfer knowledge dari model besar |
|
|
``` |
|
|
|
|
|
**Layer Scale formula:** |
|
|
```python |
|
|
output = input + gamma * layer(input) |
|
|
# gamma diinit sangat kecil (1e-5) lalu di-learn |
|
|
``` |
|
|
|
|
|
**Stochastic Depth:** |
|
|
- Training: 20% chance layer di-skip (drop_prob=0.2) |
|
|
- Inference: semua layer aktif |
|
|
- Benefit: **regularization** + **faster training** |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Fitur Eksperimental & Unik |
|
|
|
|
|
### A) **Mixture of Depths (MoD)** |
|
|
|
|
|
Token bisa "skip" layer tertentu untuk efisiensi: |
|
|
|
|
|
```python |
|
|
class MixtureOfDepthsRouter: |
|
|
# Pilih top 50% tokens paling "penting" untuk di-process |
|
|
capacity_factor = 0.5 |
|
|
|
|
|
# Method: learned, random, atau heuristic |
|
|
route_method = "learned" |
|
|
``` |
|
|
|
|
|
**Ilustrasi:** |
|
|
``` |
|
|
Layer 1: [All 100 tokens processed] |
|
|
Layer 2: [Top 50 tokens processed, 50 skipped] โ MoD |
|
|
Layer 3: [All 100 tokens processed] |
|
|
Layer 4: [Top 50 tokens processed, 50 skipped] โ MoD |
|
|
``` |
|
|
|
|
|
**Benefit:** |
|
|
- **30-40% faster inference** dengan minimal accuracy drop |
|
|
- Dynamic computation based on token importance |
|
|
|
|
|
**Paper:** [Mixture-of-Depths (2024)](https://arxiv.org/abs/2404.02258) |
|
|
|
|
|
--- |
|
|
|
|
|
### B) **Attention Sinks** |
|
|
|
|
|
Keep token awal selalu di-attend untuk stabilitas: |
|
|
|
|
|
```python |
|
|
attention_sink_size = 4 # Keep first 4 tokens |
|
|
attention_sink_window = 512 # Sliding window size |
|
|
``` |
|
|
|
|
|
**Attention Pattern:** |
|
|
``` |
|
|
Query Token 1000: |
|
|
โโ Attend to: [0, 1, 2, 3] โ attention sinks (always) |
|
|
โโ Attend to: [488, 489, ..., 1000] โ sliding window |
|
|
``` |
|
|
|
|
|
**Benefit:** |
|
|
- Prevent attention collapse di long sequences |
|
|
- Better streaming generation |
|
|
- Inspired by [StreamingLLM (2023)](https://arxiv.org/abs/2309.17453) |
|
|
|
|
|
--- |
|
|
|
|
|
### C) **Expert Choice Routing** |
|
|
|
|
|
Alternatif dari Top-K routing: |
|
|
|
|
|
```python |
|
|
# Top-K: Token pilih expert |
|
|
Token โ Router โ "Saya mau Expert 2 dan 5" |
|
|
|
|
|
# Expert Choice: Expert pilih token |
|
|
Expert 1 โ "Saya mau process Token 3, 7, 12, ..." |
|
|
Expert 2 โ "Saya mau process Token 1, 5, 9, ..." |
|
|
``` |
|
|
|
|
|
**Keuntungan:** |
|
|
- **Better load balancing** (setiap expert process jumlah token yang sama) |
|
|
- **Lebih stable training** (no expert collapse) |
|
|
- Trade-off: sedikit lebih complex implementasi |
|
|
|
|
|
--- |
|
|
|
|
|
### D) **Multi-Backend Attention** |
|
|
|
|
|
Automatic fallback untuk compatibility: |
|
|
|
|
|
```python |
|
|
if HAS_FLASH_ATTN and device == "cuda": |
|
|
use flash_attn_func() # โ Fastest (2-4x speedup) |
|
|
elif HAS_XFORMERS and device == "cuda": |
|
|
use memory_efficient_attention() # โ Fallback 1 |
|
|
elif HAS_SDPA: |
|
|
use F.scaled_dot_product_attention() # โ Fallback 2 (PyTorch 2.0+) |
|
|
else: |
|
|
use standard_attention() # โ Safe fallback |
|
|
``` |
|
|
|
|
|
**Performa Comparison:** |
|
|
``` |
|
|
Flash Attention: 100ms (baseline) |
|
|
xFormers: 150ms (1.5x slower) |
|
|
SDPA: 180ms (1.8x slower) |
|
|
Standard: 400ms (4x slower) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
|
|
|
## ๐๏ธ CACA Model Family |
|
|
|
|
|
| Model | Parameters | Vocab Size | Hidden Size | Intermediate Size | Layers | Attention Heads | KV Heads | Head Dim | Max Position | |
|
|
|-------|------------|------------|-------------|-------------------|--------|-----------------|----------|----------|--------------| |
|
|
| caca-1M-untrained | 2.50M | 8,000 | 128 | 512 | 6 | 4 | 2 | 32 | 1,024 | |
|
|
| caca-3M-untrained | 6.63M | 12,000 | 192 | 768 | 8 | 6 | 2 | 32 | 2,048 | |
|
|
| caca-4M-untrained | 4.02M | 16,000 | 128 | 512 | 8 | 4 | 2 | 32 | 2,048 | |
|
|
| caca-6M-untrained | 11.96M | 16,000 | 256 | 1024 | 8 | 4 | 2 | 64 | 2,048 | |
|
|
| caca-10M-untrained | 21.25M | 20,000 | 320 | 1280 | 10 | 8 | 2 | 40 | 2,048 | |
|
|
| caca-15M-untrained | 35.18M | 24,000 | 384 | 1536 | 12 | 6 | 2 | 64 | 2,048 | |
|
|
| caca-25M-untrained | 67.57M | 28,000 | 512 | 2048 | 14 | 8 | 2 | 64 | 4,096 | |
|
|
| caca-35M-untrained | 95.42M | 32,000 | 576 | 2304 | 16 | 8 | 2 | 72 | 4,096 | |
|
|
| caca-50M-untrained | 138.47M | 32,000 | 640 | 2560 | 20 | 10 | 2 | 64 | 4,096 | |
|
|
| caca-75M-untrained | 178.55M | 32,000 | 768 | 3072 | 18 | 12 | 3 | 64 | 4,096 | |
|
|
| caca-100M-untrained | 232.23M | 32,000 | 768 | 3072 | 24 | 12 | 4 | 64 | 4,096 | |
|
|
| caca-150M-untrained | 336.90M | 32,000 | 1024 | 4096 | 20 | 16 | 4 | 64 | 4,096 | |
|
|
| caca-200M-untrained | 458.55M | 32,000 | 1024 | 4096 | 28 | 16 | 4 | 64 | 4,096 | |
|
|
| caca-250M-untrained | 569.54M | 32,000 | 1152 | 4608 | 28 | 18 | 3 | 64 | 8,192 | |
|
|
| caca-300M-untrained | 701.64M | 32,000 | 1280 | 5120 | 28 | 20 | 4 | 64 | 8,192 | |
|
|
| caca-400M-untrained | 956.36M | 32,000 | 1408 | 5632 | 32 | 22 | 4 | 64 | 8,192 | |
|
|
| caca-500M-untrained | 1.27B | 32,000 | 1536 | 6144 | 36 | 24 | 4 | 64 | 8,192 | |
|
|
| caca-600M-untrained | 1.48B | 32,000 | 1664 | 6656 | 36 | 26 | 4 | 64 | 8,192 | |
|
|
| caca-700M-untrained | 1.71B | 32,000 | 1792 | 7168 | 36 | 28 | 4 | 64 | 8,192 | |
|
|
| caca-800M-untrained | 1.96B | 32,000 | 1920 | 7680 | 36 | 30 | 5 | 64 | 8,192 | |
|
|
| caca-900M-untrained | 2.01B | 32,000 | 2048 | 8192 | 32 | 32 | 8 | 64 | 8,192 | |
|
|
| caca-1B-untrained | 2.26B | 32,000 | 2048 | 8192 | 36 | 32 | 8 | 64 | 8,192 | |
|
|
| caca-1.5B-untrained | 2.98B | 32,000 | 2048 | 8192 | 48 | 32 | 8 | 64 | 8,192 | |
|
|
| caca-2B-untrained | 3.15B | 32,000 | 2304 | 9216 | 40 | 32 | 8 | 72 | 8,192 | |
|
|
| caca-2.5B-untrained | 3.12B | 32,000 | 2560 | 10240 | 32 | 32 | 8 | 80 | 8,192 | |
|
|
| caca-3B-untrained | 3.88B | 32,000 | 2560 | 10240 | 40 | 32 | 8 | 80 | 8,192 | |
|
|
| caca-3.5B-untrained | 4.69B | 32,000 | 2816 | 11264 | 40 | 32 | 8 | 88 | 8,192 | |
|
|
| caca-4B-untrained | 5.02B | 32,000 | 3072 | 12288 | 36 | 32 | 8 | 96 | 8,192 | |
|
|
| caca-4.5B-untrained | 5.45B | 32,000 | 3200 | 12800 | 36 | 32 | 8 | 100 | 8,192 | |
|
|
| caca-5B-untrained | 6.53B | 32,000 | 3328 | 13312 | 40 | 32 | 8 | 104 | 8,192 | |
|
|
| caca-6B-untrained | 8.31B | 32,000 | 3584 | 14336 | 44 | 32 | 8 | 112 | 8,192 | |
|
|
| caca-7B-untrained | 7.11B | 32,000 | 4096 | 14336 | 32 | 32 | 8 | 128 | 8,192 | |
|
|
| caca-8B-untrained | 7.98B | 32,000 | 4096 | 14336 | 36 | 32 | 8 | 128 | 8,192 | |
|
|
| caca-9B-untrained | 9.09B | 32,000 | 4608 | 16384 | 32 | 36 | 9 | 128 | 8,192 | |
|
|
| caca-10B-untrained | 11.23B | 32,000 | 4608 | 18432 | 36 | 32 | 8 | 144 | 8,192 | |
|
|
| caca-12B-untrained | 15.26B | 32,000 | 5120 | 20480 | 40 | 40 | 8 | 128 | 8,192 | |
|
|
| caca-13B-untrained | 13.38B | 32,000 | 5120 | 13824 | 48 | 40 | 8 | 128 | 8,192 | |
|
|
| caca-14B-untrained | 13.40B | 32,000 | 5376 | 14464 | 44 | 48 | 8 | 112 | 8,192 | |
|
|
| caca-15B-untrained | 14.90B | 32,000 | 5632 | 15104 | 44 | 32 | 8 | 176 | 8,192 | |
|
|
| caca-18B-untrained | 18.92B | 32,000 | 6144 | 16384 | 48 | 48 | 8 | 128 | 8,192 | |
|
|
| caca-20B-untrained | 20.48B | 32,000 | 6144 | 16384 | 52 | 48 | 8 | 128 | 8,192 | |
|
|
| caca-24B-untrained | 25.83B | 32,000 | 6656 | 17920 | 56 | 64 | 8 | 104 | 8,192 | |
|
|
| caca-30B-untrained | 32.24B | 32,000 | 6656 | 17920 | 70 | 64 | 8 | 104 | 8,192 | |
|
|
| caca-35B-untrained | 39.02B | 32,000 | 8192 | 22016 | 56 | 64 | 8 | 128 | 8,192 | |
|
|
| caca-40B-untrained | 44.56B | 32,000 | 8192 | 22016 | 64 | 64 | 8 | 128 | 8,192 | |
|
|
| caca-45B-untrained | 50.09B | 32,000 | 8192 | 22016 | 72 | 64 | 8 | 128 | 8,192 | |
|
|
| caca-50B-untrained | 55.63B | 32,000 | 8192 | 22016 | 80 | 64 | 8 | 128 | 8,192 | |
|
|
| caca-60B-untrained | 72.14B | 32,000 | 8192 | 28672 | 84 | 64 | 8 | 128 | 8,192 | |
|
|
| caca-70B-untrained | 68.71B | 32,000 | 8192 | 28672 | 80 | 64 | 8 | 128 | 8,192 | |
|
|
| caca-80B-untrained | 101.77B | 32,000 | 9216 | 36864 | 84 | 72 | 8 | 128 | 8,192 | |
|
|
| caca-100B-untrained | 137.32B | 32,000 | 10240 | 40960 | 92 | 80 | 8 | 128 | 8,192 | |
|
|
| caca-120B-untrained | 173.10B | 32,000 | 11264 | 45056 | 96 | 88 | 8 | 128 | 8,192 | |
|
|
| caca-150B-untrained | 214.31B | 32,000 | 12288 | 49152 | 100 | 96 | 8 | 128 | 8,192 | |
|
|
| caca-175B-untrained | 248.53B | 32,000 | 12288 | 49152 | 116 | 96 | 8 | 128 | 8,192 | |
|
|
| caca-200B-untrained | 324.80B | 128,000 | 14336 | 57344 | 110 | 112 | 16 | 128 | 16,384 | |
|
|
| caca-250B-untrained | 419.35B | 128,000 | 15360 | 61440 | 124 | 120 | 16 | 128 | 16,384 | |
|
|
| caca-300B-untrained | 507.03B | 128,000 | 16384 | 65536 | 132 | 128 | 16 | 128 | 16,384 | |
|
|
| caca-350B-untrained | 591.18B | 128,000 | 16384 | 65536 | 154 | 128 | 16 | 128 | 16,384 | |
|
|
| caca-400B-untrained | 675.34B | 128,000 | 16384 | 65536 | 176 | 128 | 16 | 128 | 16,384 | |
|
|
| caca-500B-untrained | 852.77B | 128,000 | 18432 | 73728 | 176 | 144 | 16 | 128 | 16,384 | |
|
|
| caca-600B-untrained | 1.07T | 128,000 | 20480 | 81920 | 180 | 160 | 16 | 128 | 16,384 | |
|
|
| caca-700B-untrained | 1.23T | 128,000 | 21504 | 86016 | 186 | 168 | 24 | 128 | 16,384 | |
|
|
| caca-800B-untrained | 1.38T | 128,000 | 22528 | 90112 | 192 | 176 | 16 | 128 | 16,384 | |
|
|
| caca-900B-untrained | 1.65T | 128,000 | 24576 | 94208 | 198 | 192 | 24 | 128 | 16,384 | |
|
|
| caca-1T-untrained | 1.75T | 128,000 | 24576 | 98304 | 204 | 192 | 16 | 128 | 16,384 | |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## ๐พ Kebutuhan Memory |
|
|
|
|
|
### Training Requirements |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<th>Configuration</th> |
|
|
<th>Model Weights</th> |
|
|
<th>+ Optimizer States</th> |
|
|
<th>Total Training</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>FP32 (AdamW)</strong></td> |
|
|
<td>0.01 GB</td> |
|
|
<td>+0.04 GB</td> |
|
|
<td><strong>0.06 GB</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>Mixed Precision</strong></td> |
|
|
<td>0.01 GB</td> |
|
|
<td>+0.05 GB</td> |
|
|
<td><strong>0.06 GB</strong></td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>+ Gradient Checkpointing</strong></td> |
|
|
<td colspan="2">Menghemat ~30-50% activation memory</td> |
|
|
<td><strong>~0.03 GB</strong></td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
### Inference Requirements |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<th>Precision</th> |
|
|
<th>Model Size</th> |
|
|
<th>KV Cache (2K ctx)</th> |
|
|
<th>Total Memory</th> |
|
|
<th>Memory Saving</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>FP16 / BF16</strong></td> |
|
|
<td>0.01 GB</td> |
|
|
<td>0.00 GB</td> |
|
|
<td><strong>0.01 GB</strong></td> |
|
|
<td>Baseline</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>INT8</strong></td> |
|
|
<td>0.00 GB</td> |
|
|
<td>0.00 GB</td> |
|
|
<td><strong>0.01 GB</strong></td> |
|
|
<td>~50% โ</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>INT4 (NF4)</strong></td> |
|
|
<td>0.00 GB</td> |
|
|
<td>0.00 GB</td> |
|
|
<td><strong>0.00 GB</strong></td> |
|
|
<td>~75% โ</td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
> ๐ก **Note**: KV cache bertambah secara linear dengan panjang sequence. Untuk context 8K, kalikan nilai KV cache dengan 4. |
|
|
|
|
|
### Performance Estimates |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<th>Metric</th> |
|
|
<th>Value</th> |
|
|
<th>Notes</th> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>FLOPs per Token</strong></td> |
|
|
<td>7,049,216</td> |
|
|
<td>Forward pass only</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>TFLOPs per Token</strong></td> |
|
|
<td>0.0000</td> |
|
|
<td>โ 6ร untuk backward</td> |
|
|
</tr> |
|
|
<tr> |
|
|
<td><strong>Bandwidth (FP16)</strong></td> |
|
|
<td>0.01 GB/token</td> |
|
|
<td>Memory bandwidth requirement</td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ Struktur Arsitektur Lengkap |
|
|
|
|
|
|
|
|
<details> |
|
|
<summary><b>๐ Klik untuk lihat detail arsitektur</b></summary> |
|
|
|
|
|
``` |
|
|
CACA Architecture |
|
|
โ |
|
|
โโโโ ๐ฅ INPUT PROCESSING |
|
|
โ โ |
|
|
โ โโโโ Text Input |
|
|
โ โ โโโโ Tokenization (BPE/WordPiece/SentencePiece) |
|
|
โ โ โโโโ Token Embeddings (vocab_size ร hidden_size) |
|
|
โ โ โโโโ Output: [batch_size, seq_len, hidden_size] |
|
|
โ โ |
|
|
โ โโโโ Vision Input (Optional) |
|
|
โ โ โโโโ Image Preprocessing (resize ke 224ร224) |
|
|
โ โ โโโโ Vision Encoder (ViT) |
|
|
โ โ โ โโโโ Patch Embedding (Conv2D: 14ร14 patches) |
|
|
โ โ โ โโโโ CLS Token + Positional Embeddings |
|
|
โ โ โ โโโโ Vision Transformer Blocks (24 layers) |
|
|
โ โ โ โ โโโโ LayerNorm |
|
|
โ โ โ โ โโโโ Multi-Head Attention |
|
|
โ โ โ โ โโโโ MLP (GELU activation) |
|
|
โ โ โ โ โโโโ Residual Connections |
|
|
โ โ โ โโโโ Final LayerNorm |
|
|
โ โ โโโโ Vision Projector |
|
|
โ โ โ โโโโ Type: Linear / MLP / Perceiver / Q-Former |
|
|
โ โ โ โโโโ Output: [batch_size, num_patches, hidden_size] |
|
|
โ โ โโโโ Output: Vision embeddings aligned to text space |
|
|
โ โ |
|
|
โ โโโโ Audio Input (Optional) |
|
|
โ โโโโ Audio Preprocessing (Mel-spectrogram, 80 bins) |
|
|
โ โโโโ Audio Encoder |
|
|
โ โ โโโโ Conv1D Layers (feature extraction) |
|
|
โ โ โ โโโโ Conv1D (80 โ hidden_size, kernel=3) |
|
|
โ โ โ โโโโ Conv1D (stride=2 untuk downsampling) |
|
|
โ โ โโโโ Positional Embeddings (interpolated) |
|
|
โ โ โโโโ Audio Transformer Blocks (12 layers) |
|
|
โ โ โ โโโโ LayerNorm |
|
|
โ โ โ โโโโ Multi-Head Attention |
|
|
โ โ โ โโโโ MLP (GELU activation) |
|
|
โ โ โ โโโโ Residual Connections |
|
|
โ โ โโโโ Final LayerNorm |
|
|
โ โโโโ Audio Projector |
|
|
โ โ โโโโ Type: Linear / MLP / Perceiver / Q-Former |
|
|
โ โ โโโโ Output: [batch_size, audio_len, hidden_size] |
|
|
โ โโโโ Output: Audio embeddings aligned to text space |
|
|
โ |
|
|
โโโโ ๐ MULTIMODAL FUSION |
|
|
โ โ |
|
|
โ โโโโ Early Fusion (jika tidak pakai Cross-Attention) |
|
|
โ โ โโโโ Concatenate: [vision_tokens + audio_tokens + text_tokens] |
|
|
โ โ โโโโ Update attention mask |
|
|
โ โ โโโโ Output: Combined sequence untuk decoder |
|
|
โ โ |
|
|
โ โโโโ Late Fusion (jika pakai Cross-Attention) |
|
|
โ โโโโ Text tokens โ Query untuk cross-attention |
|
|
โ โโโโ Vision+Audio tokens โ Key/Value untuk cross-attention |
|
|
โ โโโโ Fusion dilakukan di dalam decoder layers |
|
|
โ |
|
|
โโโโ ๐๏ธ DECODER STACK (N=32 layers) |
|
|
โ โ |
|
|
โ โโโโ ๐ DECODER LAYER i (repeated N times) |
|
|
โ โ |
|
|
โ โโโโ [OPTIONAL] Mixture of Depths (MoD) |
|
|
โ โ โโโโ Input: Hidden states [batch, seq_len, hidden] |
|
|
โ โ โโโโ MoD Router |
|
|
โ โ โ โโโโ Method: learned / random / heuristic |
|
|
โ โ โ โโโโ Score computation per token |
|
|
โ โ โ โโโโ Top-K selection (K = capacity_factor ร seq_len) |
|
|
โ โ โโโโ Process Mask Generation |
|
|
โ โ โ โโโโ Binary mask [batch, seq_len] (1=process, 0=skip) |
|
|
โ โ โโโโ Token Selection |
|
|
โ โ โโโโ Selected tokens: processed through layer |
|
|
โ โ โโโโ Skipped tokens: bypass layer (identity) |
|
|
โ โ |
|
|
โ โโโโ ๐ฏ SELF-ATTENTION PATH |
|
|
โ โ โ |
|
|
โ โ โโโโ Input Normalization |
|
|
โ โ โ โโโโ RMSNorm (Root Mean Square Layer Normalization) |
|
|
โ โ โ โโโโ Formula: x * rsqrt(mean(xยฒ) + ฮต) * ฮณ |
|
|
โ โ โ โโโโ More efficient than LayerNorm (no mean centering) |
|
|
โ โ โ |
|
|
โ โ โโโโ Attention Computation |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Query/Key/Value Projections |
|
|
โ โ โ โ โโโโ Q: Linear(hidden_size โ num_heads ร head_dim) |
|
|
โ โ โ โ โโโโ K: Linear(hidden_size โ num_kv_heads ร head_dim) |
|
|
โ โ โ โ โโโโ V: Linear(hidden_size โ num_kv_heads ร head_dim) |
|
|
โ โ โ โ โโโโ Reshape: [batch, seq, heads, head_dim] |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ [OPTIONAL] QK Normalization |
|
|
โ โ โ โ โโโโ Q = RMSNorm(Q) |
|
|
โ โ โ โ โโโโ K = RMSNorm(K) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Rotary Position Embeddings (RoPE) |
|
|
โ โ โ โ โโโโ Compute frequencies: ฮธ_i = base^(-2i/dim) |
|
|
โ โ โ โ โโโโ Position indices: t โ [0, seq_len) |
|
|
โ โ โ โ โโโโ Rotation matrix: cos(tยทฮธ), sin(tยทฮธ) |
|
|
โ โ โ โ โโโโ Apply rotation: Q, K = rotate(Q, K, cos, sin) |
|
|
โ โ โ โ โโโโ YARN Scaling (jika enabled) |
|
|
โ โ โ โ โโโโ Type: linear / dynamic / yarn |
|
|
โ โ โ โ โโโโ Scaling factor per frequency band |
|
|
โ โ โ โ โโโโ Better extrapolation ke context panjang |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Grouped-Query Attention (GQA) |
|
|
โ โ โ โ โโโโ num_kv_groups = num_heads / num_kv_heads |
|
|
โ โ โ โ โโโโ Repeat K, V: [num_kv_heads โ num_heads] |
|
|
โ โ โ โ โโโโ Memory saving: 30-40% vs full MHA |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Attention Score Computation |
|
|
โ โ โ โ โโโโ scores = (Q @ K.T) / sqrt(head_dim) |
|
|
โ โ โ โ โโโโ Logit clamping: [-50, 50] untuk stabilitas |
|
|
โ โ โ โ โโโโ [OPTIONAL] Soft-capping |
|
|
โ โ โ โ โโโโ scores = tanh(scores / cap) * cap |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Attention Masking |
|
|
โ โ โ โ โโโโ Causal Mask (autoregressive) |
|
|
โ โ โ โ โโโโ Sliding Window Mask (jika enabled) |
|
|
โ โ โ โ โ โโโโ Window size (misal: 512 tokens) |
|
|
โ โ โ โ โ โโโโ Attend hanya ke window terdekat |
|
|
โ โ โ โ โโโโ Attention Sinks (jika enabled) |
|
|
โ โ โ โ โ โโโโ Always attend to first K tokens |
|
|
โ โ โ โ โ โโโโ Prevent attention collapse |
|
|
โ โ โ โ โ โโโโ Better streaming generation |
|
|
โ โ โ โ โโโโ [OPTIONAL] ALiBi Bias |
|
|
โ โ โ โ โโโโ Linear bias based on distance |
|
|
โ โ โ โ โโโโ Alternative/complement to RoPE |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Backend Selection (automatic fallback) |
|
|
โ โ โ โ โโโโ 1๏ธโฃ Flash Attention 2 (PREFERRED) |
|
|
โ โ โ โ โ โโโโ Requirements: CUDA + FP16/BF16 |
|
|
โ โ โ โ โ โโโโ Speedup: 2-4x faster |
|
|
โ โ โ โ โ โโโโ Memory: 10-20x less |
|
|
โ โ โ โ โ โโโโ Sliding window support |
|
|
โ โ โ โ โ โโโโ IO-aware algorithm |
|
|
โ โ โ โ โโโโ 2๏ธโฃ xFormers Memory Efficient (FALLBACK 1) |
|
|
โ โ โ โ โ โโโโ Requirements: CUDA |
|
|
โ โ โ โ โ โโโโ Block-sparse attention |
|
|
โ โ โ โ โ โโโโ Custom attention patterns |
|
|
โ โ โ โ โโโโ 3๏ธโฃ PyTorch SDPA (FALLBACK 2) |
|
|
โ โ โ โ โ โโโโ Requirements: PyTorch 2.0+ |
|
|
โ โ โ โ โ โโโโ Built-in scaled_dot_product_attention |
|
|
โ โ โ โ โ โโโโ Hardware-agnostic |
|
|
โ โ โ โ โโโโ 4๏ธโฃ Standard Attention (SAFE FALLBACK) |
|
|
โ โ โ โ โโโโ Pure PyTorch implementation |
|
|
โ โ โ โ โโโโ Always available |
|
|
โ โ โ โ โโโโ Slower but stable |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Softmax + Dropout |
|
|
โ โ โ โ โโโโ attn_weights = softmax(scores, dim=-1) |
|
|
โ โ โ โ โโโโ attn_weights = dropout(attn_weights) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Value Aggregation |
|
|
โ โ โ โ โโโโ output = attn_weights @ V |
|
|
โ โ โ โ โโโโ Reshape: [batch, seq, num_heads ร head_dim] |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Output Projection |
|
|
โ โ โ โโโโ O: Linear(num_heads ร head_dim โ hidden_size) |
|
|
โ โ โ โโโโ Output: [batch, seq, hidden_size] |
|
|
โ โ โ |
|
|
โ โ โโโโ [OPTIONAL] Layer Scale |
|
|
โ โ โ โโโโ Learnable per-layer scaling: ฮณ |
|
|
โ โ โ โโโโ Initialize: ฮณ = 1e-5 (very small) |
|
|
โ โ โ โโโโ output = ฮณ * output |
|
|
โ โ โ โโโโ Improves training stability |
|
|
โ โ โ |
|
|
โ โ โโโโ [OPTIONAL] Stochastic Depth |
|
|
โ โ โ โโโโ Training: Random layer dropping |
|
|
โ โ โ โโโโ drop_prob = layer_idx / num_layers ร base_prob |
|
|
โ โ โ โโโโ if random() > drop_prob: return output |
|
|
โ โ โ โโโโ else: return 0 |
|
|
โ โ โ โโโโ Inference: Always apply (no dropping) |
|
|
โ โ โ |
|
|
โ โ โโโโ Residual Dropout |
|
|
โ โ โ โโโโ output = dropout(output) |
|
|
โ โ โ |
|
|
โ โ โโโโ Residual Connection |
|
|
โ โ โโโโ hidden_states = hidden_states + output |
|
|
โ โ โโโโ [Training] Gradient clipping: [-1e4, 1e4] |
|
|
โ โ |
|
|
โ โโโโ ๐ [OPTIONAL] CROSS-ATTENTION PATH (untuk Multimodal) |
|
|
โ โ โ |
|
|
โ โ โโโโ Conditional: Hanya jika encoder_hidden_states != None |
|
|
โ โ โโโโ Frequency: Setiap cross_attention_frequency layers |
|
|
โ โ โ |
|
|
โ โ โโโโ Input Normalization |
|
|
โ โ โ โโโโ RMSNorm(hidden_states) |
|
|
โ โ โ |
|
|
โ โ โโโโ Cross-Attention Computation |
|
|
โ โ โ โโโโ Query: dari text hidden states |
|
|
โ โ โ โ โโโโ Q: Linear(hidden_size โ num_heads ร head_dim) |
|
|
โ โ โ โโโโ Key/Value: dari encoder_hidden_states (vision+audio) |
|
|
โ โ โ โ โโโโ K: Linear(hidden_size โ num_kv_heads ร head_dim) |
|
|
โ โ โ โ โโโโ V: Linear(hidden_size โ num_kv_heads ร head_dim) |
|
|
โ โ โ โโโโ Attention: Q @ K.T / sqrt(head_dim) |
|
|
โ โ โ โโโโ Softmax + Dropout |
|
|
โ โ โ โโโโ Output: attn_weights @ V |
|
|
โ โ โ โโโโ Output Projection |
|
|
โ โ โ |
|
|
โ โ โโโโ [OPTIONAL] Layer Scale |
|
|
โ โ โโโโ [OPTIONAL] Stochastic Depth |
|
|
โ โ โโโโ Residual Dropout |
|
|
โ โ โโโโ Residual Connection |
|
|
โ โ โโโโ hidden_states = hidden_states + cross_attn_output |
|
|
โ โ |
|
|
โ โโโโ ๐ฎ FEED-FORWARD PATH |
|
|
โ โ |
|
|
โ โโโโ Input Normalization |
|
|
โ โ โโโโ RMSNorm(hidden_states) |
|
|
โ โ |
|
|
โ โโโโ Feed-Forward Network |
|
|
โ โ โ |
|
|
โ โ โโโโ โโโโโ STANDARD MLP โโโโโ |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Gate Projection |
|
|
โ โ โ โ โโโโ gate: Linear(hidden_size โ intermediate_size) |
|
|
โ โ โ โ โโโโ Typical: intermediate_size = 4 ร hidden_size |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Up Projection |
|
|
โ โ โ โ โโโโ up: Linear(hidden_size โ intermediate_size) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ SwiGLU Activation |
|
|
โ โ โ โ โโโโ gate = silu(gate) # Swish activation |
|
|
โ โ โ โ โโโโ hidden = gate * up # Gating mechanism |
|
|
โ โ โ โ โโโโ Formula: silu(x) = x * sigmoid(x) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Dropout |
|
|
โ โ โ โ โโโโ hidden = dropout(hidden) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Down Projection |
|
|
โ โ โ โโโโ down: Linear(intermediate_size โ hidden_size) |
|
|
โ โ โ โโโโ Output: [batch, seq, hidden_size] |
|
|
โ โ โ |
|
|
โ โ โโโโ โโโโโ MIXTURE OF EXPERTS (MoE) โโโโโ |
|
|
โ โ โ |
|
|
โ โ โโโโ Conditional: use_moe AND (layer_idx % moe_frequency == 0) |
|
|
โ โ โ |
|
|
โ โ โโโโ Router Network |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Router Type Selection |
|
|
โ โ โ โ โโโโ Top-K Router (default) |
|
|
โ โ โ โ โโโโ Expert Choice Router (alternative) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ โโโ TOP-K ROUTER โโโ |
|
|
โ โ โ โ โ |
|
|
โ โ โ โ โโโโ Gate Normalization |
|
|
โ โ โ โ โ โโโโ hidden = LayerNorm(hidden) |
|
|
โ โ โ โ โ |
|
|
โ โ โ โ โโโโ Router Logits |
|
|
โ โ โ โ โ โโโโ logits: Linear(hidden_size โ num_experts) |
|
|
โ โ โ โ โ โโโโ Clamping: [-20, 20] |
|
|
โ โ โ โ โ โโโโ Temperature scaling: logits / temp |
|
|
โ โ โ โ โ |
|
|
โ โ โ โ โโโโ [Training] Jitter Noise |
|
|
โ โ โ โ โ โโโโ noise = randn_like(logits) ร 0.01 |
|
|
โ โ โ โ โ โโโโ logits = logits + noise |
|
|
โ โ โ โ โ |
|
|
โ โ โ โ โโโโ Routing Weights |
|
|
โ โ โ โ โ โโโโ weights = softmax(logits) |
|
|
โ โ โ โ โ โโโโ top_k_weights, top_k_indices = topk(weights, k) |
|
|
โ โ โ โ โ |
|
|
โ โ โ โ โโโโ Weight Normalization |
|
|
โ โ โ โ โ โโโโ top_k_weights = top_k_weights / sum(top_k_weights) |
|
|
โ โ โ โ โ |
|
|
โ โ โ โ โโโโ Loss Computation |
|
|
โ โ โ โ โโโโ Auxiliary Loss (load balancing) |
|
|
โ โ โ โ โ โโโโ expert_usage = mean(weights, dim=0) |
|
|
โ โ โ โ โ โโโโ mean_usage = mean(expert_usage) |
|
|
โ โ โ โ โ โโโโ aux_loss = std(expert_usage) / mean_usage |
|
|
โ โ โ โ โโโโ Z-Loss (router stability) |
|
|
โ โ โ โ โ โโโโ z_loss = mean(logsumexp(logits)ยฒ) |
|
|
โ โ โ โ โ โโโโ Prevents logits explosion |
|
|
โ โ โ โ โโโโ Entropy Loss (diversity) |
|
|
โ โ โ โ โโโโ entropy_loss = -mean(weights ร log(weights)) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ โโโ EXPERT CHOICE ROUTER โโโ |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Router Logits |
|
|
โ โ โ โ โโโโ logits: Linear(hidden โ num_experts) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Expert-wise Token Selection |
|
|
โ โ โ โ โโโโ Transpose: [batchรseq, experts] |
|
|
โ โ โ โ โโโโ capacity = expert_choice_k ร total_tokens / num_experts |
|
|
โ โ โ โ โโโโ Per expert: topk(logits, k=capacity) |
|
|
โ โ โ โ โโโโ Expert mask: [experts, batchรseq] |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Routing weights from mask |
|
|
โ โ โ |
|
|
โ โ โโโโ Expert Networks (N experts, misal N=8) |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Expert i (i = 0 to N-1) |
|
|
โ โ โ โโโโ Same structure as Standard MLP |
|
|
โ โ โ โโโโ gate_proj: Linear(hidden โ intermediate) |
|
|
โ โ โ โโโโ up_proj: Linear(hidden โ intermediate) |
|
|
โ โ โ โโโโ SwiGLU activation |
|
|
โ โ โ โโโโ Dropout |
|
|
โ โ โ โโโโ down_proj: Linear(intermediate โ hidden) |
|
|
โ โ โ |
|
|
โ โ โโโโ Expert Execution |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ For each expert: |
|
|
โ โ โ โ โโโโ Get tokens routed to this expert |
|
|
โ โ โ โ โโโโ If no tokens: skip |
|
|
โ โ โ โ โโโโ Run expert forward pass |
|
|
โ โ โ โ โโโโ [Training] Track expert usage |
|
|
โ โ โ โ โโโโ [Safety] NaN/Inf detection |
|
|
โ โ โ โ |
|
|
โ โ โ โโโโ Combine Expert Outputs |
|
|
โ โ โ โโโโ Weighted sum by router weights |
|
|
โ โ โ โโโโ final_output = ฮฃ(weight_i ร expert_i(x)) |
|
|
โ โ โ |
|
|
โ โ โโโโ Output: [batch, seq, hidden_size] |
|
|
โ โ |
|
|
โ โโโโ [OPTIONAL] Layer Scale |
|
|
โ โ โโโโ output = ฮณ * output |
|
|
โ โ |
|
|
โ โโโโ [OPTIONAL] Stochastic Depth |
|
|
โ โ โโโโ Probabilistic dropping (training only) |
|
|
โ โ |
|
|
โ โโโโ Residual Dropout |
|
|
โ โ โโโโ output = dropout(output) |
|
|
โ โ |
|
|
โ โโโโ Residual Connection |
|
|
โ โโโโ hidden_states = hidden_states + output |
|
|
โ โโโโ [Training] Gradient clipping: [-1e4, 1e4] |
|
|
โ |
|
|
โโโโ ๐ค OUTPUT HEAD |
|
|
โ โ |
|
|
โ โโโโ Final Normalization |
|
|
โ โ โโโโ RMSNorm(hidden_states) |
|
|
โ โ โโโโ Output: [batch, seq, hidden_size] |
|
|
โ โ |
|
|
โ โโโโ Language Modeling Head |
|
|
โ โ โโโโ Linear Projection |
|
|
โ โ โ โโโโ lm_head: Linear(hidden_size โ vocab_size, bias=False) |
|
|
โ โ โ โโโโ Output: [batch, seq, vocab_size] |
|
|
โ โ โ |
|
|
โ โ โโโโ [OPTIONAL] Logit Soft-Capping |
|
|
โ โ โโโโ Clamp extreme values: [-capร0.99, capร0.99] |
|
|
โ โ โโโโ Formula: tanh(logits / cap) ร cap |
|
|
โ โ โโโโ Prevents numerical instability |
|
|
โ โ โโโโ Typical cap value: 30.0 |
|
|
โ โ |
|
|
โ โโโโ Output: Logits [batch, seq, vocab_size] |
|
|
โ |
|
|
โโโโ ๐ LOSS COMPUTATION (Training Only) |
|
|
โ โ |
|
|
โ โโโโ Shift for Autoregressive |
|
|
โ โ โโโโ shift_logits = logits[:, :-1, :] |
|
|
โ โ โโโโ shift_labels = labels[:, 1:] |
|
|
โ โ |
|
|
โ โโโโ Language Modeling Loss |
|
|
โ โ โโโโ CrossEntropyLoss(ignore_index=-100) |
|
|
โ โ โโโโ [OPTIONAL] Label Smoothing |
|
|
โ โ โ โโโโ Reduces overconfidence |
|
|
โ โ โโโโ lm_loss = CE(shift_logits, shift_labels) |
|
|
โ โ |
|
|
โ โโโโ [OPTIONAL] MoE Auxiliary Losses |
|
|
โ โ โโโโ Router Auxiliary Loss (load balancing) |
|
|
โ โ โ โโโโ aux_loss ร router_aux_loss_coef (default: 0.01) |
|
|
โ โ โโโโ Router Z-Loss (stability) |
|
|
โ โ โ โโโโ z_loss ร router_z_loss_coef (default: 0.001) |
|
|
โ โ โโโโ Sum across all MoE layers |
|
|
โ โ |
|
|
โ โโโโ Total Loss |
|
|
โ โโโโ total = lm_loss + aux_losses |
|
|
โ |
|
|
โโโโ ๐ MONITORING & METRICS |
|
|
โ โ |
|
|
โ โโโโ MetricsTracker |
|
|
โ โ โโโโ Loss tracking (LM, aux, z-loss) |
|
|
โ โ โโโโ Perplexity: exp(lm_loss) |
|
|
โ โ โโโโ Gradient norms per layer |
|
|
โ โ โโโโ GPU memory usage |
|
|
โ โ โโโโ Expert usage statistics |
|
|
โ โ โโโโ Attention cache hit rate |
|
|
โ โ โโโโ Periodic summary & clearing |
|
|
โ โ |
|
|
โ โโโโ Gradient Monitoring |
|
|
โ โ โโโโ Max gradient norm per layer |
|
|
โ โ โโโโ Mean gradient norm (EMA) |
|
|
โ โ โโโโ Gradient clipping count |
|
|
โ โ โโโโ NaN/Inf detection |
|
|
โ โ |
|
|
โ โโโโ Memory Monitoring |
|
|
โ โโโโ GPU memory allocated |
|
|
โ โโโโ GPU memory reserved |
|
|
โ โโโโ Automatic cache clearing |
|
|
โ โโโโ Per-layer memory checkpoints |
|
|
โ |
|
|
โโโโ ๐ง OPTIMIZATION FEATURES |
|
|
โ |
|
|
โโโโ Gradient Checkpointing |
|
|
โ โโโโ Trade: 30% slower, 50% less memory |
|
|
โ โโโโ Recompute activations during backward |
|
|
โ โโโโ Enable: model.gradient_checkpointing_enable() |
|
|
โ |
|
|
โโโโ Mixed Precision Training (AMP) |
|
|
โ โโโโ FP16/BF16 forward pass |
|
|
โ โโโโ FP32 master weights |
|
|
โ โโโโ Dynamic loss scaling |
|
|
โ โโโโ 2x speedup, 50% memory reduction |
|
|
โ |
|
|
โโโโ Gradient Accumulation |
|
|
โ โโโโ Simulate larger batch size |
|
|
โ โโโโ loss = loss / accumulation_steps |
|
|
โ โโโโ optimizer.step() every N steps |
|
|
โ |
|
|
โโโโ KV Cache (Inference) |
|
|
โ โโโโ Cache Key/Value tensors |
|
|
โ โโโโ Reuse for autoregressive generation |
|
|
โ โโโโ Memory: O(num_layers ร seq_len ร hidden_size) |
|
|
โ โโโโ Speedup: ~10x untuk long sequences |
|
|
โ |
|
|
โโโโ Quantization Support |
|
|
โโโโ 8-bit (LLM.int8) |
|
|
โ โโโโ bitsandbytes integration |
|
|
โ โโโโ Mixed precision (outliers in FP16) |
|
|
โ โโโโ 2x memory reduction |
|
|
โโโโ 4-bit (QLoRA) |
|
|
โโโโ NF4 quantization (normal float 4-bit) |
|
|
โโโโ Double quantization |
|
|
โโโโ BF16 compute dtype |
|
|
โโโโ 4x memory reduction |
|
|
|
|
|
CacaForCausalLM (3.52M) |
|
|
โ |
|
|
โโ Embedding: 8,000 ร 128 |
|
|
โ |
|
|
โโ Transformer Layers (6x) |
|
|
โ โโ RMSNorm |
|
|
โ โโ Attention (GQA) |
|
|
โ โ โโ Q: 4 heads ร 32 dim |
|
|
โ โ โโ KV: 2 heads ร 32 dim |
|
|
โ โ โโ RoPE (ฮธ=10,000) |
|
|
โ โ โโ Flash Attention v2 |
|
|
โ โโ Residual |
|
|
โ โโ RMSNorm |
|
|
โ โโ FFN (SwiGLU) |
|
|
โ โ โโ Gate: 128 โ 512 |
|
|
โ โ โโ Up: 128 โ 512 |
|
|
โ โ โโ Down: 512 โ 128 |
|
|
โ โโ Residual |
|
|
โ |
|
|
โโ Final RMSNorm |
|
|
โโ LM Head: 128 โ 8,000 |
|
|
|
|
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
๐ PARAMETER BREAKDOWN: |
|
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
Embeddings: 1,024,000 ( 29.1%) |
|
|
Transformer Layers: 1,474,560 ( 41.8%) |
|
|
โโ Attention: 294,912 |
|
|
โโ FFN: 1,179,648 |
|
|
Final Norm: 128 ( 0.0%) |
|
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
TOTAL: 3,524,608 (100.0%) |
|
|
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ |
|
|
``` |
|
|
|
|
|
**Key Design Decisions:** |
|
|
|
|
|
1. **GQA over MHA**: Hemat 50% KV cache memory dengan minimal accuracy loss |
|
|
2. **SwiGLU over GELU**: ~10% better performance pada language modeling |
|
|
3. **RMSNorm over LayerNorm**: Lebih cepat & stabil, tanpa bias term |
|
|
4. **RoPE over Learned**: Better extrapolation untuk sequence length > training |
|
|
5. **No Bias in Linear**: Mengikuti modern LLM best practices (LLaMA-style) |
|
|
|
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Dokumentasi |
|
|
|
|
|
### ๐ฆ Instalasi Dependencies |
|
|
|
|
|
```bash |
|
|
# Core dependencies (REQUIRED) |
|
|
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors |
|
|
|
|
|
# Optional: Untuk performa maksimal |
|
|
pip install flash-attn --no-build-isolation # Flash Attention 2 (3x speedup) |
|
|
pip install xformers # Memory efficient attention |
|
|
pip install bitsandbytes # 4/8-bit quantization |
|
|
|
|
|
# Optional: Untuk monitoring & profiling |
|
|
pip install tensorboard wandb # Training monitoring |
|
|
pip install gputil psutil # Resource monitoring |
|
|
``` |
|
|
|
|
|
**Compatibility Matrix:** |
|
|
|
|
|
| Component | Version | Note | |
|
|
|-----------|---------|------| |
|
|
| Python | 3.8 - 3.11 | 3.11 recommended | |
|
|
| PyTorch | โฅ 2.0.0 | 2.1+ untuk SDPA optimal | |
|
|
| CUDA | 11.8 / 12.1 | Untuk Flash Attention | |
|
|
| Transformers | โฅ 4.35.0 | Untuk AutoModel support | |
|
|
|
|
|
### Cara Penggunaan |
|
|
|
|
|
#### 1๏ธโฃ Basic Loading |
|
|
|
|
|
```python |
|
|
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
# Load configuration |
|
|
config = AutoConfig.from_pretrained( |
|
|
"Lyon28/caca-1M-untrained", |
|
|
trust_remote_code=True |
|
|
) |
|
|
|
|
|
# Load model (FP16 untuk efisiensi) |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"Lyon28/caca-1M-untrained", |
|
|
config=config, |
|
|
trust_remote_code=True, |
|
|
torch_dtype=torch.float16, |
|
|
device_map="auto" # Automatic device placement |
|
|
) |
|
|
|
|
|
# Model ini UNTRAINED - butuh training dulu! |
|
|
print(f"Model loaded: {model.num_parameters():,} parameters") |
|
|
print("โ ๏ธ Model ini belum dilatih dan belum bisa digunakan untuk inference") |
|
|
``` |
|
|
|
|
|
#### 2๏ธโฃ Quantized Loading (4-bit/8-bit) |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, BitsAndBytesConfig |
|
|
import torch |
|
|
|
|
|
# 4-bit quantization config |
|
|
bnb_config = BitsAndBytesConfig( |
|
|
load_in_4bit=True, |
|
|
bnb_4bit_quant_type="nf4", |
|
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
|
bnb_4bit_use_double_quant=True |
|
|
) |
|
|
|
|
|
# Load model dengan quantization |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
"Lyon28/caca-1M-untrained", |
|
|
trust_remote_code=True, |
|
|
quantization_config=bnb_config, |
|
|
device_map="auto" |
|
|
) |
|
|
|
|
|
print(f"Memory footprint: ~0.00GB (4-bit)") |
|
|
``` |
|
|
|
|
|
#### 3๏ธโฃ Training Setup |
|
|
|
|
|
```python |
|
|
from transformers import TrainingArguments, Trainer |
|
|
|
|
|
# Training configuration |
|
|
training_args = TrainingArguments( |
|
|
output_dir="./output", |
|
|
per_device_train_batch_size=1, |
|
|
gradient_accumulation_steps=16, |
|
|
learning_rate=2e-4, |
|
|
max_steps=10000, |
|
|
lr_scheduler_type="cosine", |
|
|
warmup_steps=500, |
|
|
logging_steps=10, |
|
|
save_steps=500, |
|
|
fp16=True, # Mixed precision |
|
|
gradient_checkpointing=True, # Memory efficient |
|
|
) |
|
|
|
|
|
# Initialize trainer |
|
|
trainer = Trainer( |
|
|
model=model, |
|
|
args=training_args, |
|
|
train_dataset=train_dataset, |
|
|
) |
|
|
|
|
|
# Start training |
|
|
trainer.train() |
|
|
``` |
|
|
|
|
|
### Advanced Usage |
|
|
|
|
|
#### Gradient Checkpointing (Memory Efficient) |
|
|
|
|
|
```python |
|
|
model.gradient_checkpointing_enable() |
|
|
print("โ
Gradient checkpointing enabled - saves ~40% memory") |
|
|
``` |
|
|
|
|
|
#### Custom Training Loop |
|
|
|
|
|
```python |
|
|
from torch.optim import AdamW |
|
|
from torch.cuda.amp import autocast, GradScaler |
|
|
|
|
|
optimizer = AdamW(model.parameters(), lr=2e-4) |
|
|
scaler = GradScaler() |
|
|
|
|
|
for batch in dataloader: |
|
|
# Mixed precision forward |
|
|
with autocast(dtype=torch.bfloat16): |
|
|
outputs = model(**batch) |
|
|
loss = outputs.loss |
|
|
|
|
|
# Backward with gradient scaling |
|
|
scaler.scale(loss).backward() |
|
|
scaler.step(optimizer) |
|
|
scaler.update() |
|
|
optimizer.zero_grad() |
|
|
``` |
|
|
|
|
|
#### Multi-GPU Training (DDP) |
|
|
|
|
|
```python |
|
|
import torch.distributed as dist |
|
|
from torch.nn.parallel import DistributedDataParallel |
|
|
|
|
|
# Initialize process group |
|
|
dist.init_process_group(backend="nccl") |
|
|
|
|
|
# Wrap model |
|
|
model = DistributedDataParallel( |
|
|
model, |
|
|
device_ids=[local_rank], |
|
|
find_unused_parameters=False |
|
|
) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## โ๏ธ Konfigurasi Detail |
|
|
|
|
|
### Full Configuration JSON |
|
|
|
|
|
```json |
|
|
{ |
|
|
"architectures": ["CacaForCausalLM"], |
|
|
"model_type": "caca", |
|
|
"vocab_size": 8000, |
|
|
"hidden_size": 128, |
|
|
"intermediate_size": 512, |
|
|
"num_hidden_layers": 6, |
|
|
"num_attention_heads": 4, |
|
|
"num_key_value_heads": 2, |
|
|
"head_dim": 32, |
|
|
"max_position_embeddings": 1024, |
|
|
"rope_theta": 10000, |
|
|
"rms_norm_eps": 1e-06, |
|
|
"use_cache": true, |
|
|
"use_qk_norm": true, |
|
|
"use_flash_attn": true, |
|
|
"attention_dropout": 0.0, |
|
|
"hidden_dropout": 0.1, |
|
|
"torch_dtype": "float16" |
|
|
} |
|
|
``` |
|
|
|
|
|
### Custom Configuration |
|
|
|
|
|
```python |
|
|
from transformers import AutoConfig |
|
|
|
|
|
# Load dan modifikasi config |
|
|
config = AutoConfig.from_pretrained("Lyon28/caca-1M-untrained") |
|
|
|
|
|
# Custom modifications |
|
|
config.max_position_embeddings = 16384 # Extend context |
|
|
config.rope_scaling = {"type": "linear", "factor": 2.0} |
|
|
config.use_flash_attn = True |
|
|
config.hidden_dropout = 0.05 |
|
|
|
|
|
# Save custom config |
|
|
config.save_pretrained("./custom_config") |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฌ Arsitektur |
|
|
|
|
|
<details> |
|
|
<summary><b>Layer Structure</b></summary> |
|
|
|
|
|
**Input Tokens** |
|
|
โ |
|
|
**Embedding Layer** (vocab_size โ hidden_size) |
|
|
โ |
|
|
**Decoder Block ร N** |
|
|
- RMSNorm |
|
|
- Multi-Head Attention (GQA) |
|
|
- Flash Attention v2 |
|
|
- Query heads, KV heads |
|
|
- RoPE position encoding |
|
|
- Residual Connection |
|
|
- RMSNorm |
|
|
- Feed-Forward Network (SwiGLU) |
|
|
- Gate: hidden โ intermediate |
|
|
- Up: hidden โ intermediate |
|
|
- Down: intermediate โ hidden |
|
|
- Residual Connection |
|
|
|
|
|
โ |
|
|
**RMSNorm (Final)** |
|
|
โ |
|
|
**LM Head** (hidden โ vocab_size) |
|
|
โ |
|
|
**Output Logits** |
|
|
|
|
|
</details> |
|
|
|
|
|
### Attention Mechanism (GQA) |
|
|
|
|
|
``` |
|
|
Query: [4 heads ร 32 dim] = 128 |
|
|
Key: [2 heads ร 32 dim] = 64 |
|
|
Value: [2 heads ร 32 dim] = 64 |
|
|
|
|
|
Grouped Query Attention: |
|
|
- Setiap 2 query heads berbagi 1 KV head |
|
|
- Memory KV cache: 50% lebih kecil dari Multi-Head Attention |
|
|
- Kualitas mendekati MHA, speed mendekati MQA |
|
|
``` |
|
|
|
|
|
### Feed-Forward Network (SwiGLU) |
|
|
|
|
|
``` |
|
|
FFN(x) = (SiLU(xW_gate) โ xW_up) W_down |
|
|
|
|
|
Where: |
|
|
- W_gate: 128 ร 512 |
|
|
- W_up: 128 ร 512 |
|
|
- W_down: 512 ร 128 |
|
|
- SiLU(x) = x ยท sigmoid(x) |
|
|
- โ = element-wise multiplication |
|
|
``` |
|
|
|
|
|
## ๐ฌ Format Chat & Prompt Engineering |
|
|
|
|
|
### ๐ Chat Template |
|
|
|
|
|
Model mendukung format chat standar untuk conversational AI: |
|
|
|
|
|
```python |
|
|
# Format chat template bawaan |
|
|
chat_template = """ |
|
|
{% for message in messages %} |
|
|
{% if message['role'] == 'system' %} |
|
|
System: {{ message['content'] }} |
|
|
|
|
|
{% elif message['role'] == 'user' %} |
|
|
User: {{ message['content'] }} |
|
|
|
|
|
{% elif message['role'] == 'assistant' %} |
|
|
Assistant: {{ message['content'] }} |
|
|
|
|
|
{% endif %} |
|
|
{% endfor %} |
|
|
{% if add_generation_prompt %}Assistant:{% endif %} |
|
|
""" |
|
|
|
|
|
# Contoh penggunaan |
|
|
messages = [ |
|
|
{"role": "system", "content": "Kamu adalah asisten AI yang membantu dan ramah."}, |
|
|
{"role": "user", "content": "Jelaskan tentang fotosintesis"}, |
|
|
{"role": "assistant", "content": "Fotosintesis adalah proses di mana tumbuhan mengubah cahaya matahari menjadi energi kimia..."}, |
|
|
{"role": "user", "content": "Apa manfaatnya bagi manusia?"}, |
|
|
] |
|
|
|
|
|
# Apply template |
|
|
formatted = tokenizer.apply_chat_template( |
|
|
messages, |
|
|
tokenize=False, |
|
|
add_generation_prompt=True |
|
|
) |
|
|
|
|
|
print(formatted) |
|
|
# Output: |
|
|
# System: Kamu adalah asisten AI yang membantu dan ramah. |
|
|
# |
|
|
# User: Jelaskan tentang fotosintesis |
|
|
# Assistant: Fotosintesis adalah proses di mana tumbuhan... |
|
|
# User: Apa manfaatnya bagi manusia? |
|
|
# Assistant: |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฏ Use Cases |
|
|
|
|
|
Model ini dirancang untuk berbagai aplikasi NLP setelah melalui proses training: |
|
|
|
|
|
### Text Generation |
|
|
- โ๏ธ Creative writing & storytelling |
|
|
- ๐ฐ Article generation |
|
|
- ๐ฌ Conversational AI |
|
|
- ๐ Text completion |
|
|
|
|
|
### Language Understanding |
|
|
- ๐ Text classification |
|
|
- ๐ท๏ธ Named Entity Recognition (NER) |
|
|
- โ Question Answering |
|
|
- ๐ Summarization |
|
|
|
|
|
### Code Generation |
|
|
- ๐ป Code completion |
|
|
- ๐ Bug fixing suggestions |
|
|
- ๐ Documentation generation |
|
|
- ๐ Code translation |
|
|
|
|
|
### Multilingual Tasks |
|
|
- ๐ Translation (ID โ EN) |
|
|
- ๐ฃ๏ธ Cross-lingual understanding |
|
|
- ๐ Multilingual classification |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Benchmark & Evaluation |
|
|
|
|
|
> โ ๏ธ Model belum melalui evaluasi karena status untrained |
|
|
|
|
|
Setelah training, model akan dievaluasi pada: |
|
|
|
|
|
### Indonesian Benchmarks |
|
|
- **IndoNLU**: Comprehensive Indonesian NLU tasks |
|
|
- **IndoQA**: Indonesian Question Answering |
|
|
- **IndoSum**: Summarization |
|
|
- **IndoNER**: Named Entity Recognition |
|
|
|
|
|
### Multilingual Benchmarks |
|
|
- **MMLU**: Massive Multitask Language Understanding |
|
|
- **HellaSwag**: Common sense reasoning |
|
|
- **ARC**: Science QA |
|
|
- **TruthfulQA**: Truthfulness evaluation |
|
|
|
|
|
### Generation Quality |
|
|
- **Perplexity**: Language modeling quality |
|
|
- **BLEU/ROUGE**: Translation & summarization |
|
|
- **Human Evaluation**: Fluency, coherence, factuality |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ ๏ธ Development & Training Tips |
|
|
|
|
|
### Optimal Batch Size |
|
|
|
|
|
```python |
|
|
# Rule of thumb untuk 3.52M model |
|
|
# GPU Memory โ Batch size per device |
|
|
|
|
|
if gpu_memory >= 80: # A100 80GB |
|
|
batch_size = 4539 |
|
|
gradient_accumulation = 1 |
|
|
elif gpu_memory >= 40: # A100 40GB |
|
|
batch_size = 2269 |
|
|
gradient_accumulation = 1 |
|
|
elif gpu_memory >= 24: # RTX 3090/4090 |
|
|
batch_size = 1 |
|
|
gradient_accumulation = 1 |
|
|
|
|
|
# Effective batch size = batch_size ร gradient_accumulation ร num_gpus |
|
|
``` |
|
|
|
|
|
### Learning Rate Scheduling |
|
|
|
|
|
```python |
|
|
# Recommended untuk 3.52M model |
|
|
learning_rate = 0.0005 # Base LR |
|
|
warmup_ratio = 0.05 # 5% of total steps |
|
|
lr_scheduler = "cosine" # atau "linear" |
|
|
|
|
|
# Learning rate scaling rule: |
|
|
# LR โ sqrt(batch_size) |
|
|
# Untuk batch size 256: LR = 0.0005 |
|
|
# Untuk batch size 512: LR = 7.07e-04 |
|
|
``` |
|
|
|
|
|
### Gradient Clipping |
|
|
|
|
|
```python |
|
|
# Prevent gradient explosion |
|
|
max_grad_norm = 1.0 # Clip at 1.0 |
|
|
|
|
|
# Monitor gradients |
|
|
from torch.nn.utils import clip_grad_norm_ |
|
|
|
|
|
grad_norm = clip_grad_norm_(model.parameters(), max_grad_norm) |
|
|
if grad_norm > 10.0: |
|
|
print(f"โ ๏ธ High gradient norm: {grad_norm:.2f}") |
|
|
``` |
|
|
|
|
|
### Training Stability |
|
|
|
|
|
```python |
|
|
# Tips untuk stable training: |
|
|
|
|
|
1. **Warmup**: Mulai dengan LR rendah |
|
|
2. **Gradient Checkpointing**: Kurangi memory footprint |
|
|
3. **Mixed Precision**: Gunakan BF16 jika tersedia (lebih stable dari FP16) |
|
|
4. **Batch Size**: Start small, increase gradually |
|
|
5. **Monitor**: Track loss, perplexity, gradient norms |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ง Troubleshooting |
|
|
|
|
|
### Out of Memory (OOM) |
|
|
|
|
|
```python |
|
|
# Solusi OOM saat training: |
|
|
|
|
|
โ
1. Enable gradient checkpointing |
|
|
model.gradient_checkpointing_enable() |
|
|
|
|
|
โ
2. Reduce batch size |
|
|
per_device_train_batch_size = 1 |
|
|
|
|
|
โ
3. Increase gradient accumulation |
|
|
gradient_accumulation_steps = 32 |
|
|
|
|
|
โ
4. Use quantization |
|
|
load_in_8bit = True # atau load_in_4bit |
|
|
|
|
|
โ
5. Reduce sequence length |
|
|
max_length = 1024 # Start dengan ini |
|
|
|
|
|
โ
6. CPU offloading (jika perlu) |
|
|
device_map = "auto" |
|
|
offload_folder = "offload" |
|
|
``` |
|
|
|
|
|
### Slow Training |
|
|
|
|
|
```python |
|
|
# Optimasi kecepatan training: |
|
|
|
|
|
โ
1. Flash Attention |
|
|
config.use_flash_attn = True # 2-3x speedup |
|
|
|
|
|
โ
2. Compile model (PyTorch 2.0+) |
|
|
model = torch.compile(model, mode="reduce-overhead") |
|
|
|
|
|
โ
3. DataLoader optimization |
|
|
dataloader = DataLoader( |
|
|
dataset, |
|
|
batch_size=batch_size, |
|
|
num_workers=4, # Parallel data loading |
|
|
pin_memory=True, # Faster GPU transfer |
|
|
prefetch_factor=2 |
|
|
) |
|
|
|
|
|
โ
4. Mixed precision |
|
|
use_fp16 = True # atau bf16 |
|
|
|
|
|
โ
5. Optimize communication (multi-GPU) |
|
|
find_unused_parameters = False |
|
|
gradient_as_bucket_view = True |
|
|
``` |
|
|
|
|
|
### NaN Loss |
|
|
|
|
|
```python |
|
|
# Jika loss menjadi NaN: |
|
|
|
|
|
โ
1. Reduce learning rate |
|
|
learning_rate = learning_rate * 0.1 |
|
|
|
|
|
โ
2. Check gradient norms |
|
|
clip_grad_norm_(model.parameters(), 1.0) |
|
|
|
|
|
โ
3. Use BF16 instead of FP16 |
|
|
torch_dtype = torch.bfloat16 # Lebih stable |
|
|
|
|
|
โ
4. Add epsilon to RMSNorm |
|
|
rms_norm_eps = 1e-5 # Increase jika perlu |
|
|
|
|
|
โ
5. Check data |
|
|
# Pastikan tidak ada inf/nan di dataset |
|
|
assert not torch.isnan(input_ids).any() |
|
|
assert not torch.isinf(attention_mask).any() |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ซ Prohibited Uses |
|
|
|
|
|
<div style="background: #ffebee; border-left: 4px solid #f44336; padding: 12px; margin: 16px 0;"> |
|
|
|
|
|
Model ini **TIDAK BOLEH** digunakan untuk: |
|
|
|
|
|
- ๐ซ **Harmful content generation** (violence, self-harm, illegal acts) |
|
|
- ๐ซ **Misinformation/disinformation campaigns** |
|
|
- ๐ซ **Harassment or hate speech** |
|
|
- ๐ซ **Impersonation or identity theft** |
|
|
- ๐ซ **Child safety violations** (CSAM, grooming, exploitation) |
|
|
- ๐ซ **Privacy violations** (doxxing, stalking, surveillance abuse) |
|
|
- ๐ซ **Malicious code generation** (malware, exploits, etc) |
|
|
- ๐ซ **Spam or manipulation** (fake reviews, astroturfing) |
|
|
- ๐ซ **Medical/legal advice** (tanpa disclaimer & expert review) |
|
|
- ๐ซ **Financial fraud** (scams, market manipulation) |
|
|
|
|
|
**Violation consequences:** Model access revocation + legal action jika applicable |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ References & Papers |
|
|
|
|
|
### Core Architecture |
|
|
1. **LLaMA** - [Touvron et al., 2023](https://arxiv.org/abs/2302.13971) |
|
|
- RMSNorm, RoPE, SwiGLU, GQA |
|
|
|
|
|
2. **GPT-4** - [OpenAI Technical Report, 2023](https://arxiv.org/abs/2303.08774) |
|
|
- Mixture of Experts (speculated) |
|
|
|
|
|
3. **Gemini** - [Google DeepMind, 2023](https://arxiv.org/abs/2312.11805) |
|
|
- Multimodal architecture, soft-capping |
|
|
|
|
|
4. **Qwen** - [Alibaba Cloud, 2023](https://arxiv.org/abs/2309.16609) |
|
|
- YARN, long context |
|
|
|
|
|
5. **Gemma** - [Google, 2024](https://arxiv.org/abs/2403.08295) |
|
|
- Layer scaling, normalization |
|
|
|
|
|
### Advanced Techniques |
|
|
6. **Flash Attention 2** - [Dao, 2023](https://arxiv.org/abs/2307.08691) |
|
|
7. **Mixture-of-Depths** - [Raposo et al., 2024](https://arxiv.org/abs/2404.02258) |
|
|
8. **StreamingLLM** - [Xiao et al., 2023](https://arxiv.org/abs/2309.17453) |
|
|
9. **YARN** - [Peng et al., 2023](https://arxiv.org/abs/2309.00071) |
|
|
10. **QLoRA** - [Dettmers et al., 2023](https://arxiv.org/abs/2305.14314) |
|
|
|
|
|
--- |
|
|
|
|
|
## โ ๏ธ Known Limitations |
|
|
|
|
|
1. **Training Cost** - MoE + Multimodal = expensive |
|
|
2. **Complex Debugging** - Banyak fallback systems |
|
|
3. **Memory Hungry** - Jika semua fitur enabled |
|
|
4. **Dependency Hell** - Butuh flash-attn, xformers, bitsandbytes |
|
|
5. **Expert Balancing** - MoE butuh careful tuning untuk load balancing |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ License & Citation |
|
|
|
|
|
### ๐ License |
|
|
|
|
|
<div style="background: #e8f5e9; border-left: 4px solid #4caf50; padding: 12px; margin: 16px 0;"> |
|
|
|
|
|
Model ini dirilis di bawah **Apache License 2.0** |
|
|
|
|
|
โ
**Anda BEBAS untuk:** |
|
|
- โ๏ธ Gunakan secara komersial |
|
|
- โ๏ธ Modifikasi sesuka hati |
|
|
- โ๏ธ Distribusi ulang |
|
|
- โ๏ธ Patent use |
|
|
- โ๏ธ Private use |
|
|
|
|
|
โ ๏ธ **Dengan syarat:** |
|
|
- ๐ Include license & copyright notice |
|
|
- ๐ State changes yang dibuat |
|
|
- ๐ Disclaimer of warranty |
|
|
|
|
|
โ **Tanpa jaminan apapun** (use at your own risk) |
|
|
|
|
|
</div> |
|
|
|
|
|
**Full license text**: [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) |
|
|
|
|
|
## ๐ Citation |
|
|
|
|
|
Jika Anda menggunakan model ini dalam penelitian, mohon sitasi: |
|
|
|
|
|
```bibtex |
|
|
@misc{cacacaca1m, |
|
|
author = {Lyon}, |
|
|
title = {Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention}, |
|
|
year = {2026}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face Model Hub}, |
|
|
howpublished = {\url{https://huggingface.co/Lyon28/caca-1M-untrained}}, |
|
|
note = {Untrained model with 3,524,608 parameters} |
|
|
} |
|
|
``` |
|
|
|
|
|
**APA Style:** |
|
|
``` |
|
|
Lyon. (2026). Caca-caca-1M: Modern Transformer Architecture with Grouped |
|
|
Query Attention [Untrained model]. Hugging Face. |
|
|
https://huggingface.co/Lyon28/caca-1M-untrained |
|
|
``` |
|
|
|
|
|
**MLA Style:** |
|
|
``` |
|
|
Lyon. "Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention." |
|
|
Hugging Face, 2026, huggingface.co/Lyon28/caca-1M-untrained. |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ Acknowledgments |
|
|
|
|
|
Model ini berdiri di pundak para raksasa! Terima kasih kepada: |
|
|
|
|
|
<details> |
|
|
<summary><b>๐๏ธ Klik untuk daftar lengkap acknowledgments</b></summary> |
|
|
|
|
|
#### ๐๏ธ **Core Architecture** |
|
|
- **LLaMA/LLaMA 2** (Meta AI, 2023) - Decoder-only architecture, RMSNorm, SwiGLU |
|
|
- Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) |
|
|
- Authors: Hugo Touvron et al. |
|
|
- **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm |
|
|
- **PaLM** (Google, 2022) - SwiGLU activation insights |
|
|
|
|
|
#### ๐ฏ **Attention Mechanisms** |
|
|
- **Flash Attention v2** (Tri Dao et al., Stanford, 2023) |
|
|
- Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691) |
|
|
- 3x speedup dengan IO-aware algorithm |
|
|
- **Grouped Query Attention** (Joshua Ainslie et al., Google, 2023) |
|
|
- Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245) |
|
|
- Memory-efficient KV cache |
|
|
- **Multi-Query Attention** (Noam Shazeer, Google, 2019) |
|
|
- Fast inference dengan shared K/V |
|
|
- **xFormers** (Meta AI, 2022) - Memory efficient attention |
|
|
- **PyTorch SDPA** (PyTorch Team, 2023) - Native attention optimization |
|
|
|
|
|
#### ๐ **Position Encodings** |
|
|
- **RoPE** (Jianlin Su et al., EleutherAI, 2021) |
|
|
- Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) |
|
|
- Superior length extrapolation |
|
|
- **ALiBI** (Ofir Press et al., 2022) |
|
|
- Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409) |
|
|
- Length generalization without retraining |
|
|
- **YaRN** (Bowen Peng et al., 2023) |
|
|
- Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071) |
|
|
|
|
|
#### ๐ช **Long Context & Efficiency** |
|
|
- **Sliding Window Attention** (Albert Gu et al., Mistral AI, 2023) |
|
|
- Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825) |
|
|
- **StreamingLLM** (Guangxuan Xiao et al., MIT, 2023) |
|
|
- Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) |
|
|
- Infinite sequence length! |
|
|
- **Logit Softcapping** (Google Gemma Team, 2024) |
|
|
- Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295) |
|
|
|
|
|
#### ๐ง **Mixture of Experts** |
|
|
- **Mixtral 8x7B** (Albert Jiang et al., Mistral AI, 2024) |
|
|
- Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088) |
|
|
- State-of-the-art sparse MoE |
|
|
- **Switch Transformers** (William Fedus et al., Google, 2021) |
|
|
- Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961) |
|
|
- Expert scaling insights |
|
|
- **GLaM** (Nan Du et al., Google, 2021) - Generalist Language Model |
|
|
- **Expert Choice Routing** (Yanqi Zhou et al., Google, 2022) |
|
|
- Better load balancing |
|
|
|
|
|
#### ๐ **Training Optimizations** |
|
|
- **Layer Scale** (Hugo Touvron et al., Meta, 2021) |
|
|
- Paper: [Going Deeper with Image Transformers](https://arxiv.org/abs/2103.17239) |
|
|
- Training stability untuk deep networks |
|
|
- **Stochastic Depth** (Gao Huang et al., 2016) |
|
|
- Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382) |
|
|
- **Mixture of Depths** (David Raposo et al., DeepMind, 2024) |
|
|
- Paper: [Mixture-of-Depths: Dynamically allocating compute](https://arxiv.org/abs/2404.02258) |
|
|
- Dynamic compute allocation |
|
|
- **Gradient Checkpointing** (Tianqi Chen et al., 2016) |
|
|
|
|
|
#### ๐ฆ **Quantization** |
|
|
- **LLM.int8()** (Tim Dettmers et al., 2022) |
|
|
- Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339) |
|
|
- **QLoRA** (Tim Dettmers et al., 2023) |
|
|
- Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) |
|
|
- 4-bit efficient fine-tuning |
|
|
- **bitsandbytes** (Tim Dettmers) - Quantization library |
|
|
|
|
|
#### ๐จ **Multimodal** |
|
|
- **Vision Transformer** (Alexey Dosovitskiy et al., Google, 2020) |
|
|
- Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929) |
|
|
- **Flamingo** (Jean-Baptiste Alayrac et al., DeepMind, 2022) |
|
|
- Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198) |
|
|
- Perceiver Resampler |
|
|
- **BLIP-2** (Junnan Li et al., Salesforce, 2023) |
|
|
- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597) |
|
|
- Q-Former architecture |
|
|
- **Whisper** (Alec Radford et al., OpenAI, 2022) - Audio encoding |
|
|
|
|
|
#### ๐ ๏ธ **Normalization & Activations** |
|
|
- **RMSNorm** (Biao Zhang, Rico Sennrich, 2019) |
|
|
- Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467) |
|
|
- **SwiGLU** (Noam Shazeer, Google, 2020) |
|
|
- Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) |
|
|
|
|
|
#### ๐ง **Tools & Frameworks** |
|
|
- **๐ค Hugging Face** - Transformers, Accelerate, PEFT |
|
|
- Making NLP accessible to everyone |
|
|
- **PyTorch** - Deep learning framework |
|
|
- Facebook AI Research team |
|
|
- **Safetensors** - Secure serialization |
|
|
- Hugging Face team |
|
|
- **DeepSpeed** - Distributed training |
|
|
- Microsoft Research |
|
|
- **Flash Attention Implementation** - Tri Dao & team |
|
|
|
|
|
#### ๐ฎ๐ฉ **Indonesian NLP Community** |
|
|
Special thanks to Indonesian NLP researchers & practitioners yang telah membangun foundation untuk Indonesian language AI. |
|
|
|
|
|
</details> |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ License |
|
|
|
|
|
Model ini dirilis di bawah **Apache License 2.0**. |
|
|
|
|
|
### Ketentuan Penggunaan: |
|
|
- โ
**Bebas digunakan** untuk keperluan komersial dan non-komersial |
|
|
- โ
**Modifikasi** diperbolehkan |
|
|
- โ
**Distribusi** diperbolehkan dengan attribution |
|
|
- โ ๏ธ **No Warranty** - model disediakan "as is" |
|
|
- ๐ **Attribution Required** - sertakan copyright notice |
|
|
|
|
|
Lihat [LICENSE](LICENSE) untuk detail lengkap. |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ค Contributing |
|
|
|
|
|
Kami sangat terbuka untuk kontribusi! Berikut cara Anda bisa berkontribusi: |
|
|
|
|
|
### Training & Fine-tuning |
|
|
- ๐ Train model ini dengan dataset Anda |
|
|
- ๐ Share benchmark results |
|
|
- ๐ฌ Experiment dengan hyperparameters |
|
|
|
|
|
### Code & Architecture |
|
|
- ๐ Report bugs atau issues |
|
|
- ๐ก Suggest improvements |
|
|
- ๐ง Submit pull requests |
|
|
|
|
|
### Documentation |
|
|
- ๐ Improve documentation |
|
|
- ๐ Add translations |
|
|
- โ๏ธ Write tutorials & guides |
|
|
|
|
|
### Dataset & Evaluation |
|
|
- ๐ Contribute training data |
|
|
- ๐งช Create evaluation benchmarks |
|
|
- ๐ฏ Share fine-tuned versions |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ฅ Team & Acknowledgments |
|
|
|
|
|
### Core Team |
|
|
- **LyonPoy** - Architecture design & implementation |
|
|
|
|
|
### Special Thanks |
|
|
- ๐ค **Hugging Face** - Infrastructure & community |
|
|
- โก **FlashAttention Team** - Efficient attention implementation |
|
|
- ๐ง **Anthropic, Google, Meta, openAI, etc** - Research inspirations |
|
|
- Meta AI (LLaMA) |
|
|
- OpenAI (GPT series) |
|
|
- Google DeepMind (Gemini, Gemma) |
|
|
- Alibaba Cloud (Qwen) |
|
|
- HuggingFace (Transformers library) |
|
|
- Tri Dao (Flash Attention) |
|
|
- Tim Chen (bitsandbytes) |
|
|
|
|
|
### Community |
|
|
Terima kasih kepada komunitas open-source yang telah berkontribusi pada: |
|
|
- Transformers library |
|
|
- PyTorch framework |
|
|
- Datasets & evaluation tools |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Contact & Support |
|
|
|
|
|
### Community |
|
|
- ๐ฌ [Discussions](https://huggingface.co/Lyon28/caca-1M-untrained/discussions) - Ask questions |
|
|
- ๐ [Issues](https://github.com/Lyon-28/caca-transformers/issues) - Report bugs |
|
|
- ๐ง Email : cacatransformers@gmail.com |
|
|
|
|
|
--- |
|
|
|
|
|
## ๐ Star History |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
[](https://star-history.com/#Lyon-28/caca-transformers&Date) |
|
|
|
|
|
</div> |
|
|
|
|
|
## ๐ Dibuat dengan โค๏ธ untuk Komunitas AI Indonesia |
|
|
|
|
|
<img src="https://i.postimg.cc/MTSj073X/logo.png" width="200" alt="Caca Logo"/> |
|
|
|
|
|
### **Terima kasih telah menggunakan Caca!** |
|
|
|
|
|
Jika model ini berguna, jangan lupa โญ repository kami! |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
<table> |
|
|
<tr> |
|
|
<td align="center">โญ<br/><b>Star Repo</b><br/><sub>Show your support</sub></td> |
|
|
<td align="center">๐<br/><b>Share</b><br/><sub>Tell your friends</sub></td> |
|
|
<td align="center">๐ฌ<br/><b>Join Discussion</b><br/><sub>Ask questions</sub></td> |
|
|
<td align="center">๐ค<br/><b>Contribute</b><br/><sub>Make it better</sub></td> |
|
|
</tr> |
|
|
</table> |
|
|
|
|
|
### ๐ Happy Training! ๐ |
|
|
|
|
|
**Model ini menunggu untuk dilatih dan menjadi foundation untuk aplikasi AI Anda.** |
|
|
|
|
|
[๐ฅ Download Model](#) โข [๐ Read Docs](https://github.com/Lyon-28/caca-transformers) โข [๐ฌ Join Community](https://github.com/Lyon-28/caca-transformers) |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ Model Statistics |
|
|
|
|
|
<img src="https://img.shields.io/badge/Parameters-3.52M-blue?style=for-the-badge" alt="Parameters"/> |
|
|
<img src="https://img.shields.io/badge/Status-Untrained-orange?style=for-the-badge" alt="Status"/> |
|
|
<img src="https://img.shields.io/badge/License-Apache%202.0-green?style=for-the-badge" alt="License"/> |
|
|
|
|
|
<img src="https://img.shields.io/badge/Architecture-Transformer-purple?style=for-the-badge" alt="Architecture"/> |
|
|
<img src="https://img.shields.io/badge/Type-Causal%20LM-red?style=for-the-badge" alt="Type"/> |
|
|
<img src="https://img.shields.io/badge/Context-1,024%20tokens-cyan?style=for-the-badge" alt="Context"/> |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐จ Daily Inspiration |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://quotes-caca.vercel.app/api/SvgQuote" alt="Daily Quote" width="600" /> |
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
### ๐ Quick Stats |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| ๐ Total Parameters | 3,524,608 | |
|
|
| ๐๏ธ Layers | 6 | |
|
|
| ๐ฏ Attention Heads | 4 | |
|
|
| ๐ Max Context | 1,024 tokens | |
|
|
| ๐พ Size (FP16) | 0.01 GB | |
|
|
| ๐พ Size (INT4) | 0.00 GB | |
|
|
|
|
|
--- |
|
|
|
|
|
<sub> |
|
|
Model ini adalah bagian dari <b>Caca Project</b> - Open source initiative untuk membangun Indonesian LLM ecosystem.<br/> |
|
|
Created with ๐ป by <a href="https://huggingface.co/Lyon28">@Lyon28</a> | |
|
|
Licensed under <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0</a> | |
|
|
Built with <a href="https://huggingface.co">๐ค HuggingFace</a> |
|
|
</sub> |
|
|
|
|
|
<br/><br/> |
|
|
|
|
|
**๐ "Dari nol, untuk semua" ๐** |
|
|
|
|
|
<sub>Last updated: january 2026</sub> |
|
|
|
|
|
</div> |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
<sub>Built with โค๏ธ by Caca Transformers Team</sub><br> |
|
|
<sub>Powered by ๐ค Transformers โข โก PyTorch โข ๐ฅ Flash Attention</sub> |
|
|
</div> |
|
|
|