caca-1M-untrained / README.md
Lyon28's picture
Update README.md
2dbabe3 verified
---
license: apache-2.0
language:
- id
- en
tags:
- text-generation
- pytorch
- causal-lm
- transformer
- untrained
- gqa
- rope
- swiglu
- rmsnorm
- flash-attention
- indonesian
- bilingual
library_name: transformers
pipeline_tag: text-generation
widget:
- text: "Jakarta adalah ibu kota"
example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Pelengkapan Teks (ID)"
- text: |
Pertanyaan: Apa itu kecerdasan buatan?
Jawaban:
example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Tanya Jawab (ID)"
- text: |
Tulis cerita pendek tentang robot yang belajar mencintai.
example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Penulisan Kreatif (ID)"
- text: "The capital of Indonesia is"
example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Text Completion (EN)"
- text: |
Question: What is artificial intelligence?
Answer:
example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Question Answering (EN)"
- text: |
def fibonacci(n):
"""Hitung bilangan fibonacci ke-n"""
example_title: "๐Ÿ’ป Pelengkapan Kode"
- text: |
# Fungsi untuk mengurutkan array
def sort_array(arr):
example_title: "๐Ÿ’ป Generasi Kode"
- text: |
User: Halo! Siapa kamu?
Assistant:
example_title: "๐Ÿ’ฌ Format Chat (ID)"
- text: |
User: Jelaskan tentang machine learning dalam 2 kalimat.
Assistant:
example_title: "๐Ÿ’ฌ Conversational (ID)"
inference:
parameters:
max_new_tokens: 100
temperature: 0.7
top_p: 0.9
top_k: 50
do_sample: true
repetition_penalty: 1.1
num_beams: 1
datasets: []
metrics:
- perplexity
- accuracy
model-index:
- name: caca-1M
results: []
---
<div align="center">
<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-1M"/>
# ๐Ÿค– caca-1M
### Arsitektur Transformer Modern dengan Fitur Canggih
[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
[![Transformers](https://img.shields.io/badge/๐Ÿค—%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers)
[![Model Type](https://img.shields.io/badge/Model-Causal%20LM-green.svg)]()
[![Parameters](https://img.shields.io/badge/Parameters-3.52M-orange.svg)]()
[![Status](https://img.shields.io/badge/Status-Untrained-red.svg)]()
**3,524,608** parameters โ€ข **3.52M** โ€ข **6 layers** โ€ข **1,024 tokens**
[๐Ÿ“š Documentation](#-dokumentasi) โ€ข [๐Ÿ’ป Usage](#-cara-penggunaan) โ€ข [โš™๏ธ Configuration](#๏ธ-konfigurasi-detail) โ€ข [๐Ÿ”ฌ Architecture](#-arsitektur)
</div>
---
## โš ๏ธ PENTING: Model Belum Dilatih (Untrained)
<div style="background: #fff3cd; border-left: 4px solid #ffc107; padding: 12px; margin: 16px 0;">
<strong>โš ๏ธ PERHATIAN</strong>: Ini adalah model yang <strong>belum melalui proses training</strong>. Bobot model masih dalam kondisi <strong>random initialization</strong>. Output yang dihasilkan akan <strong>tidak bermakna dan acak</strong>.
</div>
**Status Model:**
- ๐Ÿ”ด **Belum dilatih** - Bobot masih random (Kaiming/Xavier init)
- ๐ŸŸก **Untuk riset & eksperimen** - Arsitektur sudah siap, tinggal train
- ๐ŸŸข **Production-ready architecture** - Teruji dan optimal
Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas tinggi.
### ๐ŸŽฏ Apa yang Bisa Dilakukan?
| โœ… Bisa | โŒ Belum Bisa |
|---------|----------------|
| Load model architecture | Generate teks bermakna |
| Test forward pass | Menjawab pertanyaan |
| Measure memory & speed | Reasoning & understanding |
| Start training | Production deployment |
| Fine-tuning experiments | Real-world applications |
---
## ๐Ÿ“‹ Deskripsi
**CACA** (Collaborative Architecture for Contextual AI) adalah arsitektur Large Language Model (LLM) yang menggabungkan **best practices** dari berbagai model State-of-the-Art (SOTA) seperti **LLaMA**, **GPT-4**, **Gemini**, **Qwen**, dan **Gemma**.
Model ini dirancang dengan fokus pada **efisiensi komputasi**, **skalabilitas**, dan **performa tinggi** โ€” menjadikannya **modular**, **production-ready**, dan mendukung **multimodal** (teks, gambar, audio).
<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; background: #f8f9fa; padding: 12px;">
<p><strong>๐Ÿ“– Tentang Project Caca</strong></p>
<p><em>Caca</em> adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative.</p>
<p>Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok. Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
<p>โ€” <strong>Lyon</strong>, Creator</p>
</blockquote>
### โœจ **Highlights**
- ๐Ÿง  **Hybrid Architecture** โ€” Kombinasi teknik terbaik dari 5+ model SOTA
- ๐ŸŽญ **Multimodal Native** โ€” Support teks, gambar, dan audio dalam satu model
- โšก **High Performance** โ€” Flash Attention, MoE, dan optimasi modern
- ๐ŸŒ **Indonesian-First** โ€” Dikembangkan dengan fokus pada Bahasa Indonesia
- ๐Ÿ”“ **Open Source** โ€” Transparent, reproducible, collaborative
### ๐ŸŒŸ Mengapa Caca?
1. **๐Ÿ‡ฎ๐Ÿ‡ฉ Fokus pada Bahasa Indonesia** - Dirancang dengan mempertimbangkan karakteristik bahasa Indonesia
2. **โšก Efisiensi Tinggi** - GQA & Flash Attention untuk inferensi 3-5x lebih cepat
3. **๐Ÿ’พ Memory Efficient** - Hemat 50% memory untuk KV cache
4. **๐Ÿ”ง Modular & Extensible** - Mudah dikustomisasi untuk berbagai use case
5. **๐ŸŒ Bilingual** - Support optimal untuk Indonesia & English
**CACA** hadir dengan filosofi berbeda:
- โœ… **Fully open-source** โ€” dari architecture sampai training code
- โœ… **Modular & scalable** โ€” bisa disesuaikan dari 1B sampai 70B+ parameters
- โœ… **Resource-efficient** โ€” optimized untuk budget terbatas
- โœ… **Indonesian-centric** โ€” prioritas pada Bahasa Indonesia
- โœ… **Community-driven** โ€” open for contributions & collaborations
## ๐Ÿ“ˆ Perbandingan dengan Model Lain
| Fitur | LLaMA | GPT-4 | Gemini | Qwen | CACA |
|-------|-------|-------|--------|------|------|
| **RMSNorm** | โœ… | โŒ | โŒ | โœ… | โœ… |
| **RoPE** | โœ… | โŒ | โŒ | โœ… | โœ… |
| **GQA** | โœ… | โŒ | โŒ | โœ… | โœ… |
| **MoE** | โŒ | โœ… | โœ… | โŒ | โœ… |
| **Multimodal** | โŒ | โœ… | โœ… | โœ… | โœ… |
| **Flash Attention** | โœ… | โœ… | โœ… | โœ… | โœ… |
| **Sliding Window** | โŒ | โŒ | โŒ | โœ… | โœ… |
| **Attention Sinks** | โŒ | โŒ | โŒ | โŒ | โœ… |
| **MoD** | โŒ | โŒ | โŒ | โŒ | โœ… |
| **Expert Choice** | โŒ | โŒ | โŒ | โŒ | โœ… |
| **YARN Scaling** | โŒ | โŒ | โŒ | โœ… | โœ… |
| **Quantization** | โœ… | โŒ | โŒ | โœ… | โœ… |
---
## ๐ŸŽฏ Use Cases & Applications
### โœ… Cocok Untuk
<table>
<tr>
<td width="50%">
**๐Ÿ”ฌ Research & Development**
- Eksperimen arsitektur transformer
- Ablation studies
- Novel training techniques
- Architecture search
**๐Ÿ“š Academic & Education**
- Thesis & research papers
- Teaching materials
- Student projects
- LLM internals understanding
</td>
<td width="50%">
**๐Ÿš€ Base Model for Fine-tuning**
- Task-specific models
- Domain adaptation
- Instruction tuning
- RLHF experiments
**๐Ÿ’ก Prototyping**
- Proof of concept
- Feature testing
- A/B testing architectures
- Benchmark comparisons
</td>
</tr>
</table>
### โŒ Tidak Cocok Untuk
<div style="background: #ffe6e6; border-left: 4px solid #ff4444; padding: 12px; margin: 16px 0;">
- ๐Ÿšซ **Production Applications** - Model belum dilatih, output random
- ๐Ÿšซ **Real-world Deployment** - Perlu training & safety alignment dulu
- ๐Ÿšซ **Safety-critical Systems** - Tidak ada safety guardrails
- ๐Ÿšซ **Direct User-facing Apps** - Output tidak dapat diprediksi
- ๐Ÿšซ **Commercial Use (as-is)** - Harus dilatih terlebih dahulu
</div>
---
## ๐Ÿ“Š Spesifikasi Model
<table>
<tr>
<td><strong>Parameter</strong></td>
<td><strong>Value</strong></td>
<td><strong>Parameter</strong></td>
<td><strong>Value</strong></td>
</tr>
<tr>
<td>Total Parameters</td>
<td><code>3,524,608</code></td>
<td>Vocab Size</td>
<td><code>8,000</code></td>
</tr>
<tr>
<td>Hidden Size</td>
<td><code>128</code></td>
<td>Intermediate Size</td>
<td><code>512</code></td>
</tr>
<tr>
<td>Num Layers</td>
<td><code>6</code></td>
<td>Attention Heads</td>
<td><code>4</code></td>
</tr>
<tr>
<td>KV Heads (GQA)</td>
<td><code>2</code></td>
<td>Head Dimension</td>
<td><code>32</code></td>
</tr>
<tr>
<td>Max Context Length</td>
<td><code>1,024</code></td>
<td>RoPE Base (ฮธ)</td>
<td><code>10,000</code></td>
</tr>
<tr>
<td>Model Size (FP16)</td>
<td><code>0.01 GB</code></td>
<td>Formatted Size</td>
<td><code>3.52M</code></td>
</tr>
</table>
---
### ๐ŸŽฏ Core Features
<details open>
<summary><b>๐Ÿ” Klik untuk expand/collapse</b></summary>
- โœ… **Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior
- Query heads: **4**
- KV heads: **2**
- Ratio: **2:1** (hemat ~50% memory KV cache)
- **Benefit**: Inferensi lebih cepat dengan memory footprint lebih kecil
- โœ… **Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik
- Theta (ฮธ): **10,000**
- Support extrapolation untuk konteks > training length
- **Benefit**: Performa stabil pada sequence length yang belum pernah dilihat saat training
- โœ… **RMSNorm** - Normalisasi lebih stabil dan ~50% lebih cepat dari LayerNorm
- Epsilon: **1e-06**
- **Benefit**: Training lebih stabil, inference lebih cepat, gradient flow lebih baik
- โœ… **SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU
- Intermediate size: **512** (4.0x hidden)
- **Benefit**: Kapasitas model lebih besar tanpa menambah parameter signifikan
- โœ… **Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency
- Otomatis aktif jika tersedia CUDA device
- IO-aware algorithm untuk minimal HBM access
- **Benefit**: Training & inference jauh lebih cepat, support batch size lebih besar
- โœ… **Hybrid Architecture** - Kombinasi teknik terbaik dari 5+ model SOTA
- โœ… **Multimodal Support** - Native support untuk Vision dan Audio
- โœ… **Mixture of Experts (MoE)** - Sparse activation untuk efisiensi
- โœ… **Long Context** - Support hingga 8K+ tokens dengan YARN scaling
- โœ… **Advanced Attention** - Flash Attention, Sliding Window, Attention Sinks
- โœ… **Quantization Ready** - Support 4-bit dan 8-bit quantization
- โœ… **Production Features** - Extensive error handling & monitoring
</details>
### ๐Ÿ”ฅ Advanced Features
### ๐ŸŽฏ Mekanisme Attention
- โšก **Flash Attention v2** - Algoritma IO-aware yang 3x lebih cepat dari attention standar
- ๐Ÿ”‘ **Grouped Query Attention (GQA)** - 4 Query heads : 2 KV heads
- Rasio kompresi: **2:1** (hemat ~50% memory KV cache)
- ๐Ÿš€ **xFormers Support** - Fallback memory-efficient attention
- ๐ŸŽฏ **PyTorch SDPA** - Native scaled dot product attention
### ๐Ÿ“ Position Encodings
- ๐Ÿ”„ **RoPE (Rotary Position Embeddings)** - Base frequency ฮธ=10,000
- Generalisasi lebih baik untuk sequence panjang dibanding absolute PE
### ๐ŸŽ“ Optimisasi Training
- ๐Ÿ’พ **Gradient Checkpointing** - Trade compute for memory (support model hingga 100B+ params)
- ๐ŸŽฏ **Mixed Precision Training** - Support FP16, BF16, dan TF32
- ๐Ÿ“‰ **Dropout Regularization**
- Hidden dropout: 0.1
- Attention dropout: 0.0
- Residual dropout: 0.1
### ๐Ÿ“ฆ Dukungan Quantization
- 4๏ธโƒฃ **4-bit Quantization** - NF4 & FP4 via bitsandbytes
- Memory reduction: ~**75%** (4GB โ†’ 1GB)
- Accuracy loss: <2% pada kebanyakan tasks
- Support double quantization untuk kompresi maksimal
- 8๏ธโƒฃ **8-bit Quantization** - LLM.int8() dengan outlier handling
- Memory reduction: ~**50%** (4GB โ†’ 2GB)
- Accuracy loss: <1%
- ๐Ÿ”„ **Dynamic Quantization** - Runtime quantization tanpa calibration
### ๐Ÿ”ฌ Advanced Features
- ๐Ÿ“Š **Automatic Mixed Precision (AMP)** - Dynamic loss scaling
- ๐ŸŽฏ **Gradient Clipping** - Stabilitas training dengan max norm clipping
- ๐Ÿ“ˆ **Learning Rate Scheduling** - Support cosine, linear, warmup
- ๐Ÿ’ก **Smart Memory Management** - Auto cache clearing & monitoring
- ๐Ÿ” **Metrics Tracking** - Real-time perplexity, loss, gradient norms
- ๐Ÿ›ก๏ธ **NaN/Inf Detection** - Automatic recovery dari numerical instability
---
## ๐Ÿงฉ Komponen Arsitektur
### 1๏ธโƒฃ **Dari LLaMA (Meta)**
CACA mengadopsi komponen efisien dari LLaMA untuk performa optimal:
```python
โœ“ RMSNorm # Normalisasi lebih efisien dari LayerNorm
โœ“ Rotary Position Embeddings # Positional encoding yang lebih baik
โœ“ SwiGLU Activation # Activation function dengan gating mechanism
โœ“ Grouped-Query Attention # Hemat memory dengan shared K/V heads
โœ“ Pre-normalization # Stabilitas training yang lebih baik
```
- RMSNorm **30% lebih cepat** dari LayerNorm
- RoPE membuat model bisa **extrapolate ke context lebih panjang**
- GQA **hemat 30-40% memory** dibanding Multi-Head Attention
- SwiGLU **meningkatkan performa 3-5%** dibanding ReLU/GELU
---
### 2๏ธโƒฃ **Dari GPT-4 (OpenAI)**
Implementasi Mixture of Experts untuk skalabilitas:
```python
โœ“ Mixture of Experts (MoE) # Sparse activation dengan multiple expert networks
โœ“ Top-K Router # Routing token ke K expert terbaik
โœ“ Auxiliary Loss # Load balancing antar experts
โœ“ Z-Loss # Stabilisasi router logits
โœ“ Expert Usage Tracking # Monitoring penggunaan setiap expert
```
```
Input Token
โ†“
[Router] โ†’ Pilih Top-K Experts (misal K=2 dari 8 experts)
โ†“
Expert_1 (weight: 0.6) + Expert_3 (weight: 0.4)
โ†“
Weighted Sum Output
```
**Keuntungan:**
- Model bisa **10x lebih besar** dengan compute cost yang sama
- Setiap token hanya activate **12.5% parameters** (jika K=2, N=8)
- Parallel processing antar experts
---
### 3๏ธโƒฃ **Dari Gemini (Google)**
Multimodal native dengan cross-modal fusion:
```python
โœ“ Vision Encoder (ViT) # Process gambar dengan Vision Transformer
โœ“ Audio Encoder (Conv1D + Trans) # Process audio dengan CNN + Transformer
โœ“ Cross-Attention Mechanism # Fuse multimodal features
โœ“ Multiple Projector Types:
- Linear Projector # Simple & cepat
- MLP Projector # Non-linear mapping
- Perceiver Resampler # Compress dengan latent queries
- Q-Former # Query-based projection (BLIP-2 style)
โœ“ Logit Soft-Capping # Clip extreme values untuk stabilitas
```
**Alur Multimodal:**
```
[Image] โ†’ Vision Encoder โ†’ [2D patches โ†’ 1D tokens]
โ†“
Projector โ†’ [Hidden dim = text dim]
โ†“
[Text] + [Image tokens] โ†’ Cross-Attention โ†’ Fused representation
```
**Support format:**
- Images: JPEG, PNG (224x224 default)
- Audio: Mel-spectrogram (80 bins)
---
### 4๏ธโƒฃ **Dari Qwen (Alibaba)**
Long context optimization:
```python
โœ“ YARN Scaling # Yet Another RoPE extensioN
โœ“ Dynamic Position Scaling # Auto-adjust untuk sequence lebih panjang
โœ“ Sliding Window Attention # Local attention pattern
โœ“ Context Window 8K-128K # Flexible context length
```
**YARN vs Standard RoPE:**
```
Standard RoPE: [====] 4K context โ†’ [====????] 8K (error naik)
YARN: [====] 4K context โ†’ [========] 8K (smooth extrapolation)
```
**Sliding Window Mechanism:**
```
Token 0: attend ke [0]
Token 1: attend ke [0, 1]
Token 2: attend ke [0, 1, 2]
Token 10: attend ke [0, 6, 7, 8, 9, 10] โ† sliding window = 4
(keep attention sink di token 0)
```
---
### 5๏ธโƒฃ **Dari Gemma (Google)**
Optimization techniques:
```python
โœ“ Layer Scale # Learnable scaling per layer
โœ“ Stochastic Depth # Random layer dropping saat training
โœ“ Normalized Attention # QK normalization untuk stabilitas
โœ“ Knowledge Distillation # Transfer knowledge dari model besar
```
**Layer Scale formula:**
```python
output = input + gamma * layer(input)
# gamma diinit sangat kecil (1e-5) lalu di-learn
```
**Stochastic Depth:**
- Training: 20% chance layer di-skip (drop_prob=0.2)
- Inference: semua layer aktif
- Benefit: **regularization** + **faster training**
---
## ๐Ÿ†• Fitur Eksperimental & Unik
### A) **Mixture of Depths (MoD)**
Token bisa "skip" layer tertentu untuk efisiensi:
```python
class MixtureOfDepthsRouter:
# Pilih top 50% tokens paling "penting" untuk di-process
capacity_factor = 0.5
# Method: learned, random, atau heuristic
route_method = "learned"
```
**Ilustrasi:**
```
Layer 1: [All 100 tokens processed]
Layer 2: [Top 50 tokens processed, 50 skipped] โ† MoD
Layer 3: [All 100 tokens processed]
Layer 4: [Top 50 tokens processed, 50 skipped] โ† MoD
```
**Benefit:**
- **30-40% faster inference** dengan minimal accuracy drop
- Dynamic computation based on token importance
**Paper:** [Mixture-of-Depths (2024)](https://arxiv.org/abs/2404.02258)
---
### B) **Attention Sinks**
Keep token awal selalu di-attend untuk stabilitas:
```python
attention_sink_size = 4 # Keep first 4 tokens
attention_sink_window = 512 # Sliding window size
```
**Attention Pattern:**
```
Query Token 1000:
โ”œโ”€ Attend to: [0, 1, 2, 3] โ† attention sinks (always)
โ””โ”€ Attend to: [488, 489, ..., 1000] โ† sliding window
```
**Benefit:**
- Prevent attention collapse di long sequences
- Better streaming generation
- Inspired by [StreamingLLM (2023)](https://arxiv.org/abs/2309.17453)
---
### C) **Expert Choice Routing**
Alternatif dari Top-K routing:
```python
# Top-K: Token pilih expert
Token โ†’ Router โ†’ "Saya mau Expert 2 dan 5"
# Expert Choice: Expert pilih token
Expert 1 โ†’ "Saya mau process Token 3, 7, 12, ..."
Expert 2 โ†’ "Saya mau process Token 1, 5, 9, ..."
```
**Keuntungan:**
- **Better load balancing** (setiap expert process jumlah token yang sama)
- **Lebih stable training** (no expert collapse)
- Trade-off: sedikit lebih complex implementasi
---
### D) **Multi-Backend Attention**
Automatic fallback untuk compatibility:
```python
if HAS_FLASH_ATTN and device == "cuda":
use flash_attn_func() # โ† Fastest (2-4x speedup)
elif HAS_XFORMERS and device == "cuda":
use memory_efficient_attention() # โ† Fallback 1
elif HAS_SDPA:
use F.scaled_dot_product_attention() # โ† Fallback 2 (PyTorch 2.0+)
else:
use standard_attention() # โ† Safe fallback
```
**Performa Comparison:**
```
Flash Attention: 100ms (baseline)
xFormers: 150ms (1.5x slower)
SDPA: 180ms (1.8x slower)
Standard: 400ms (4x slower)
```
---
## ๐Ÿ—๏ธ CACA Model Family
| Model | Parameters | Vocab Size | Hidden Size | Intermediate Size | Layers | Attention Heads | KV Heads | Head Dim | Max Position |
|-------|------------|------------|-------------|-------------------|--------|-----------------|----------|----------|--------------|
| caca-1M-untrained | 2.50M | 8,000 | 128 | 512 | 6 | 4 | 2 | 32 | 1,024 |
| caca-3M-untrained | 6.63M | 12,000 | 192 | 768 | 8 | 6 | 2 | 32 | 2,048 |
| caca-4M-untrained | 4.02M | 16,000 | 128 | 512 | 8 | 4 | 2 | 32 | 2,048 |
| caca-6M-untrained | 11.96M | 16,000 | 256 | 1024 | 8 | 4 | 2 | 64 | 2,048 |
| caca-10M-untrained | 21.25M | 20,000 | 320 | 1280 | 10 | 8 | 2 | 40 | 2,048 |
| caca-15M-untrained | 35.18M | 24,000 | 384 | 1536 | 12 | 6 | 2 | 64 | 2,048 |
| caca-25M-untrained | 67.57M | 28,000 | 512 | 2048 | 14 | 8 | 2 | 64 | 4,096 |
| caca-35M-untrained | 95.42M | 32,000 | 576 | 2304 | 16 | 8 | 2 | 72 | 4,096 |
| caca-50M-untrained | 138.47M | 32,000 | 640 | 2560 | 20 | 10 | 2 | 64 | 4,096 |
| caca-75M-untrained | 178.55M | 32,000 | 768 | 3072 | 18 | 12 | 3 | 64 | 4,096 |
| caca-100M-untrained | 232.23M | 32,000 | 768 | 3072 | 24 | 12 | 4 | 64 | 4,096 |
| caca-150M-untrained | 336.90M | 32,000 | 1024 | 4096 | 20 | 16 | 4 | 64 | 4,096 |
| caca-200M-untrained | 458.55M | 32,000 | 1024 | 4096 | 28 | 16 | 4 | 64 | 4,096 |
| caca-250M-untrained | 569.54M | 32,000 | 1152 | 4608 | 28 | 18 | 3 | 64 | 8,192 |
| caca-300M-untrained | 701.64M | 32,000 | 1280 | 5120 | 28 | 20 | 4 | 64 | 8,192 |
| caca-400M-untrained | 956.36M | 32,000 | 1408 | 5632 | 32 | 22 | 4 | 64 | 8,192 |
| caca-500M-untrained | 1.27B | 32,000 | 1536 | 6144 | 36 | 24 | 4 | 64 | 8,192 |
| caca-600M-untrained | 1.48B | 32,000 | 1664 | 6656 | 36 | 26 | 4 | 64 | 8,192 |
| caca-700M-untrained | 1.71B | 32,000 | 1792 | 7168 | 36 | 28 | 4 | 64 | 8,192 |
| caca-800M-untrained | 1.96B | 32,000 | 1920 | 7680 | 36 | 30 | 5 | 64 | 8,192 |
| caca-900M-untrained | 2.01B | 32,000 | 2048 | 8192 | 32 | 32 | 8 | 64 | 8,192 |
| caca-1B-untrained | 2.26B | 32,000 | 2048 | 8192 | 36 | 32 | 8 | 64 | 8,192 |
| caca-1.5B-untrained | 2.98B | 32,000 | 2048 | 8192 | 48 | 32 | 8 | 64 | 8,192 |
| caca-2B-untrained | 3.15B | 32,000 | 2304 | 9216 | 40 | 32 | 8 | 72 | 8,192 |
| caca-2.5B-untrained | 3.12B | 32,000 | 2560 | 10240 | 32 | 32 | 8 | 80 | 8,192 |
| caca-3B-untrained | 3.88B | 32,000 | 2560 | 10240 | 40 | 32 | 8 | 80 | 8,192 |
| caca-3.5B-untrained | 4.69B | 32,000 | 2816 | 11264 | 40 | 32 | 8 | 88 | 8,192 |
| caca-4B-untrained | 5.02B | 32,000 | 3072 | 12288 | 36 | 32 | 8 | 96 | 8,192 |
| caca-4.5B-untrained | 5.45B | 32,000 | 3200 | 12800 | 36 | 32 | 8 | 100 | 8,192 |
| caca-5B-untrained | 6.53B | 32,000 | 3328 | 13312 | 40 | 32 | 8 | 104 | 8,192 |
| caca-6B-untrained | 8.31B | 32,000 | 3584 | 14336 | 44 | 32 | 8 | 112 | 8,192 |
| caca-7B-untrained | 7.11B | 32,000 | 4096 | 14336 | 32 | 32 | 8 | 128 | 8,192 |
| caca-8B-untrained | 7.98B | 32,000 | 4096 | 14336 | 36 | 32 | 8 | 128 | 8,192 |
| caca-9B-untrained | 9.09B | 32,000 | 4608 | 16384 | 32 | 36 | 9 | 128 | 8,192 |
| caca-10B-untrained | 11.23B | 32,000 | 4608 | 18432 | 36 | 32 | 8 | 144 | 8,192 |
| caca-12B-untrained | 15.26B | 32,000 | 5120 | 20480 | 40 | 40 | 8 | 128 | 8,192 |
| caca-13B-untrained | 13.38B | 32,000 | 5120 | 13824 | 48 | 40 | 8 | 128 | 8,192 |
| caca-14B-untrained | 13.40B | 32,000 | 5376 | 14464 | 44 | 48 | 8 | 112 | 8,192 |
| caca-15B-untrained | 14.90B | 32,000 | 5632 | 15104 | 44 | 32 | 8 | 176 | 8,192 |
| caca-18B-untrained | 18.92B | 32,000 | 6144 | 16384 | 48 | 48 | 8 | 128 | 8,192 |
| caca-20B-untrained | 20.48B | 32,000 | 6144 | 16384 | 52 | 48 | 8 | 128 | 8,192 |
| caca-24B-untrained | 25.83B | 32,000 | 6656 | 17920 | 56 | 64 | 8 | 104 | 8,192 |
| caca-30B-untrained | 32.24B | 32,000 | 6656 | 17920 | 70 | 64 | 8 | 104 | 8,192 |
| caca-35B-untrained | 39.02B | 32,000 | 8192 | 22016 | 56 | 64 | 8 | 128 | 8,192 |
| caca-40B-untrained | 44.56B | 32,000 | 8192 | 22016 | 64 | 64 | 8 | 128 | 8,192 |
| caca-45B-untrained | 50.09B | 32,000 | 8192 | 22016 | 72 | 64 | 8 | 128 | 8,192 |
| caca-50B-untrained | 55.63B | 32,000 | 8192 | 22016 | 80 | 64 | 8 | 128 | 8,192 |
| caca-60B-untrained | 72.14B | 32,000 | 8192 | 28672 | 84 | 64 | 8 | 128 | 8,192 |
| caca-70B-untrained | 68.71B | 32,000 | 8192 | 28672 | 80 | 64 | 8 | 128 | 8,192 |
| caca-80B-untrained | 101.77B | 32,000 | 9216 | 36864 | 84 | 72 | 8 | 128 | 8,192 |
| caca-100B-untrained | 137.32B | 32,000 | 10240 | 40960 | 92 | 80 | 8 | 128 | 8,192 |
| caca-120B-untrained | 173.10B | 32,000 | 11264 | 45056 | 96 | 88 | 8 | 128 | 8,192 |
| caca-150B-untrained | 214.31B | 32,000 | 12288 | 49152 | 100 | 96 | 8 | 128 | 8,192 |
| caca-175B-untrained | 248.53B | 32,000 | 12288 | 49152 | 116 | 96 | 8 | 128 | 8,192 |
| caca-200B-untrained | 324.80B | 128,000 | 14336 | 57344 | 110 | 112 | 16 | 128 | 16,384 |
| caca-250B-untrained | 419.35B | 128,000 | 15360 | 61440 | 124 | 120 | 16 | 128 | 16,384 |
| caca-300B-untrained | 507.03B | 128,000 | 16384 | 65536 | 132 | 128 | 16 | 128 | 16,384 |
| caca-350B-untrained | 591.18B | 128,000 | 16384 | 65536 | 154 | 128 | 16 | 128 | 16,384 |
| caca-400B-untrained | 675.34B | 128,000 | 16384 | 65536 | 176 | 128 | 16 | 128 | 16,384 |
| caca-500B-untrained | 852.77B | 128,000 | 18432 | 73728 | 176 | 144 | 16 | 128 | 16,384 |
| caca-600B-untrained | 1.07T | 128,000 | 20480 | 81920 | 180 | 160 | 16 | 128 | 16,384 |
| caca-700B-untrained | 1.23T | 128,000 | 21504 | 86016 | 186 | 168 | 24 | 128 | 16,384 |
| caca-800B-untrained | 1.38T | 128,000 | 22528 | 90112 | 192 | 176 | 16 | 128 | 16,384 |
| caca-900B-untrained | 1.65T | 128,000 | 24576 | 94208 | 198 | 192 | 24 | 128 | 16,384 |
| caca-1T-untrained | 1.75T | 128,000 | 24576 | 98304 | 204 | 192 | 16 | 128 | 16,384 |
---
## ๐Ÿ’พ Kebutuhan Memory
### Training Requirements
<table>
<tr>
<th>Configuration</th>
<th>Model Weights</th>
<th>+ Optimizer States</th>
<th>Total Training</th>
</tr>
<tr>
<td><strong>FP32 (AdamW)</strong></td>
<td>0.01 GB</td>
<td>+0.04 GB</td>
<td><strong>0.06 GB</strong></td>
</tr>
<tr>
<td><strong>Mixed Precision</strong></td>
<td>0.01 GB</td>
<td>+0.05 GB</td>
<td><strong>0.06 GB</strong></td>
</tr>
<tr>
<td><strong>+ Gradient Checkpointing</strong></td>
<td colspan="2">Menghemat ~30-50% activation memory</td>
<td><strong>~0.03 GB</strong></td>
</tr>
</table>
### Inference Requirements
<table>
<tr>
<th>Precision</th>
<th>Model Size</th>
<th>KV Cache (2K ctx)</th>
<th>Total Memory</th>
<th>Memory Saving</th>
</tr>
<tr>
<td><strong>FP16 / BF16</strong></td>
<td>0.01 GB</td>
<td>0.00 GB</td>
<td><strong>0.01 GB</strong></td>
<td>Baseline</td>
</tr>
<tr>
<td><strong>INT8</strong></td>
<td>0.00 GB</td>
<td>0.00 GB</td>
<td><strong>0.01 GB</strong></td>
<td>~50% โ†“</td>
</tr>
<tr>
<td><strong>INT4 (NF4)</strong></td>
<td>0.00 GB</td>
<td>0.00 GB</td>
<td><strong>0.00 GB</strong></td>
<td>~75% โ†“</td>
</tr>
</table>
> ๐Ÿ’ก **Note**: KV cache bertambah secara linear dengan panjang sequence. Untuk context 8K, kalikan nilai KV cache dengan 4.
### Performance Estimates
<table>
<tr>
<th>Metric</th>
<th>Value</th>
<th>Notes</th>
</tr>
<tr>
<td><strong>FLOPs per Token</strong></td>
<td>7,049,216</td>
<td>Forward pass only</td>
</tr>
<tr>
<td><strong>TFLOPs per Token</strong></td>
<td>0.0000</td>
<td>โ‰ˆ 6ร— untuk backward</td>
</tr>
<tr>
<td><strong>Bandwidth (FP16)</strong></td>
<td>0.01 GB/token</td>
<td>Memory bandwidth requirement</td>
</tr>
</table>
---
### ๐Ÿ“ Struktur Arsitektur Lengkap
<details>
<summary><b>๐Ÿ” Klik untuk lihat detail arsitektur</b></summary>
```
CACA Architecture
โ”‚
โ”œโ”€โ”€โ”€ ๐Ÿ“ฅ INPUT PROCESSING
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Text Input
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Tokenization (BPE/WordPiece/SentencePiece)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Token Embeddings (vocab_size ร— hidden_size)
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch_size, seq_len, hidden_size]
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Vision Input (Optional)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Image Preprocessing (resize ke 224ร—224)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Vision Encoder (ViT)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Patch Embedding (Conv2D: 14ร—14 patches)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ CLS Token + Positional Embeddings
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Vision Transformer Blocks (24 layers)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ LayerNorm
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Multi-Head Attention
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ MLP (GELU activation)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connections
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Final LayerNorm
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Vision Projector
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Type: Linear / MLP / Perceiver / Q-Former
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch_size, num_patches, hidden_size]
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: Vision embeddings aligned to text space
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€ Audio Input (Optional)
โ”‚ โ”œโ”€โ”€โ”€ Audio Preprocessing (Mel-spectrogram, 80 bins)
โ”‚ โ”œโ”€โ”€โ”€ Audio Encoder
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Conv1D Layers (feature extraction)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Conv1D (80 โ†’ hidden_size, kernel=3)
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Conv1D (stride=2 untuk downsampling)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Positional Embeddings (interpolated)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Audio Transformer Blocks (12 layers)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ LayerNorm
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Multi-Head Attention
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ MLP (GELU activation)
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connections
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Final LayerNorm
โ”‚ โ”œโ”€โ”€โ”€ Audio Projector
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Type: Linear / MLP / Perceiver / Q-Former
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch_size, audio_len, hidden_size]
โ”‚ โ””โ”€โ”€โ”€ Output: Audio embeddings aligned to text space
โ”‚
โ”œโ”€โ”€โ”€ ๐Ÿ”„ MULTIMODAL FUSION
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Early Fusion (jika tidak pakai Cross-Attention)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Concatenate: [vision_tokens + audio_tokens + text_tokens]
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Update attention mask
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: Combined sequence untuk decoder
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€ Late Fusion (jika pakai Cross-Attention)
โ”‚ โ”œโ”€โ”€โ”€ Text tokens โ†’ Query untuk cross-attention
โ”‚ โ”œโ”€โ”€โ”€ Vision+Audio tokens โ†’ Key/Value untuk cross-attention
โ”‚ โ””โ”€โ”€โ”€ Fusion dilakukan di dalam decoder layers
โ”‚
โ”œโ”€โ”€โ”€ ๐Ÿ—๏ธ DECODER STACK (N=32 layers)
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€ ๐Ÿ” DECODER LAYER i (repeated N times)
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Mixture of Depths (MoD)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Input: Hidden states [batch, seq_len, hidden]
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ MoD Router
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Method: learned / random / heuristic
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Score computation per token
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Top-K selection (K = capacity_factor ร— seq_len)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Process Mask Generation
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Binary mask [batch, seq_len] (1=process, 0=skip)
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Token Selection
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Selected tokens: processed through layer
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Skipped tokens: bypass layer (identity)
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ ๐ŸŽฏ SELF-ATTENTION PATH
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Input Normalization
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ RMSNorm (Root Mean Square Layer Normalization)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Formula: x * rsqrt(mean(xยฒ) + ฮต) * ฮณ
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ More efficient than LayerNorm (no mean centering)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention Computation
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Query/Key/Value Projections
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Q: Linear(hidden_size โ†’ num_heads ร— head_dim)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ K: Linear(hidden_size โ†’ num_kv_heads ร— head_dim)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ V: Linear(hidden_size โ†’ num_kv_heads ร— head_dim)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Reshape: [batch, seq, heads, head_dim]
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] QK Normalization
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Q = RMSNorm(Q)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ K = RMSNorm(K)
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Rotary Position Embeddings (RoPE)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Compute frequencies: ฮธ_i = base^(-2i/dim)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Position indices: t โˆˆ [0, seq_len)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Rotation matrix: cos(tยทฮธ), sin(tยทฮธ)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Apply rotation: Q, K = rotate(Q, K, cos, sin)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ YARN Scaling (jika enabled)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Type: linear / dynamic / yarn
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Scaling factor per frequency band
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Better extrapolation ke context panjang
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Grouped-Query Attention (GQA)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ num_kv_groups = num_heads / num_kv_heads
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Repeat K, V: [num_kv_heads โ†’ num_heads]
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Memory saving: 30-40% vs full MHA
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention Score Computation
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ scores = (Q @ K.T) / sqrt(head_dim)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Logit clamping: [-50, 50] untuk stabilitas
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ [OPTIONAL] Soft-capping
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ scores = tanh(scores / cap) * cap
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention Masking
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Causal Mask (autoregressive)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Sliding Window Mask (jika enabled)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Window size (misal: 512 tokens)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Attend hanya ke window terdekat
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention Sinks (jika enabled)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Always attend to first K tokens
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Prevent attention collapse
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Better streaming generation
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ [OPTIONAL] ALiBi Bias
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Linear bias based on distance
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Alternative/complement to RoPE
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Backend Selection (automatic fallback)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ 1๏ธโƒฃ Flash Attention 2 (PREFERRED)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Requirements: CUDA + FP16/BF16
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Speedup: 2-4x faster
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Memory: 10-20x less
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Sliding window support
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ IO-aware algorithm
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ 2๏ธโƒฃ xFormers Memory Efficient (FALLBACK 1)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Requirements: CUDA
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Block-sparse attention
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Custom attention patterns
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ 3๏ธโƒฃ PyTorch SDPA (FALLBACK 2)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Requirements: PyTorch 2.0+
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Built-in scaled_dot_product_attention
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Hardware-agnostic
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ 4๏ธโƒฃ Standard Attention (SAFE FALLBACK)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Pure PyTorch implementation
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Always available
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Slower but stable
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Softmax + Dropout
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ attn_weights = softmax(scores, dim=-1)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ attn_weights = dropout(attn_weights)
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Value Aggregation
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ output = attn_weights @ V
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Reshape: [batch, seq, num_heads ร— head_dim]
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output Projection
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ O: Linear(num_heads ร— head_dim โ†’ hidden_size)
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, hidden_size]
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Layer Scale
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Learnable per-layer scaling: ฮณ
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Initialize: ฮณ = 1e-5 (very small)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ output = ฮณ * output
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Improves training stability
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Stochastic Depth
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Training: Random layer dropping
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ drop_prob = layer_idx / num_layers ร— base_prob
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ if random() > drop_prob: return output
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ else: return 0
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Inference: Always apply (no dropping)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Residual Dropout
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ output = dropout(output)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connection
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ hidden_states = hidden_states + output
โ”‚ โ”‚ โ””โ”€โ”€โ”€ [Training] Gradient clipping: [-1e4, 1e4]
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ ๐ŸŒ [OPTIONAL] CROSS-ATTENTION PATH (untuk Multimodal)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Conditional: Hanya jika encoder_hidden_states != None
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Frequency: Setiap cross_attention_frequency layers
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Input Normalization
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ RMSNorm(hidden_states)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Cross-Attention Computation
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Query: dari text hidden states
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Q: Linear(hidden_size โ†’ num_heads ร— head_dim)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Key/Value: dari encoder_hidden_states (vision+audio)
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ K: Linear(hidden_size โ†’ num_kv_heads ร— head_dim)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ V: Linear(hidden_size โ†’ num_kv_heads ร— head_dim)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention: Q @ K.T / sqrt(head_dim)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Softmax + Dropout
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Output: attn_weights @ V
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output Projection
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Layer Scale
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Stochastic Depth
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Residual Dropout
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connection
โ”‚ โ”‚ โ””โ”€โ”€โ”€ hidden_states = hidden_states + cross_attn_output
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€ ๐Ÿ”ฎ FEED-FORWARD PATH
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Input Normalization
โ”‚ โ”‚ โ””โ”€โ”€โ”€ RMSNorm(hidden_states)
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Feed-Forward Network
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ โ”โ”โ”โ”โ” STANDARD MLP โ”โ”โ”โ”โ”
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gate Projection
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ gate: Linear(hidden_size โ†’ intermediate_size)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Typical: intermediate_size = 4 ร— hidden_size
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Up Projection
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ up: Linear(hidden_size โ†’ intermediate_size)
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ SwiGLU Activation
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ gate = silu(gate) # Swish activation
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ hidden = gate * up # Gating mechanism
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Formula: silu(x) = x * sigmoid(x)
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Dropout
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ hidden = dropout(hidden)
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Down Projection
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ down: Linear(intermediate_size โ†’ hidden_size)
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, hidden_size]
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€ โ”โ”โ”โ”โ” MIXTURE OF EXPERTS (MoE) โ”โ”โ”โ”โ”
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Conditional: use_moe AND (layer_idx % moe_frequency == 0)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Network
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Type Selection
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Top-K Router (default)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Expert Choice Router (alternative)
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ โ”โ”โ” TOP-K ROUTER โ”โ”โ”
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gate Normalization
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ hidden = LayerNorm(hidden)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Logits
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ logits: Linear(hidden_size โ†’ num_experts)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Clamping: [-20, 20]
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Temperature scaling: logits / temp
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [Training] Jitter Noise
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ noise = randn_like(logits) ร— 0.01
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ logits = logits + noise
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Routing Weights
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ weights = softmax(logits)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ top_k_weights, top_k_indices = topk(weights, k)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Weight Normalization
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ top_k_weights = top_k_weights / sum(top_k_weights)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Loss Computation
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Auxiliary Loss (load balancing)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ expert_usage = mean(weights, dim=0)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ mean_usage = mean(expert_usage)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ aux_loss = std(expert_usage) / mean_usage
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Z-Loss (router stability)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ z_loss = mean(logsumexp(logits)ยฒ)
โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Prevents logits explosion
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Entropy Loss (diversity)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ entropy_loss = -mean(weights ร— log(weights))
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ โ”โ”โ” EXPERT CHOICE ROUTER โ”โ”โ”
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Logits
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ logits: Linear(hidden โ†’ num_experts)
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Expert-wise Token Selection
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Transpose: [batchร—seq, experts]
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ capacity = expert_choice_k ร— total_tokens / num_experts
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Per expert: topk(logits, k=capacity)
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Expert mask: [experts, batchร—seq]
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Routing weights from mask
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Expert Networks (N experts, misal N=8)
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Expert i (i = 0 to N-1)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Same structure as Standard MLP
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ gate_proj: Linear(hidden โ†’ intermediate)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ up_proj: Linear(hidden โ†’ intermediate)
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ SwiGLU activation
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Dropout
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ down_proj: Linear(intermediate โ†’ hidden)
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Expert Execution
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ For each expert:
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Get tokens routed to this expert
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ If no tokens: skip
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Run expert forward pass
โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [Training] Track expert usage
โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ [Safety] NaN/Inf detection
โ”‚ โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Combine Expert Outputs
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Weighted sum by router weights
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ final_output = ฮฃ(weight_i ร— expert_i(x))
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, hidden_size]
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Layer Scale
โ”‚ โ”‚ โ””โ”€โ”€โ”€ output = ฮณ * output
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Stochastic Depth
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Probabilistic dropping (training only)
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Residual Dropout
โ”‚ โ”‚ โ””โ”€โ”€โ”€ output = dropout(output)
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€ Residual Connection
โ”‚ โ”œโ”€โ”€โ”€ hidden_states = hidden_states + output
โ”‚ โ””โ”€โ”€โ”€ [Training] Gradient clipping: [-1e4, 1e4]
โ”‚
โ”œโ”€โ”€โ”€ ๐Ÿ“ค OUTPUT HEAD
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Final Normalization
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ RMSNorm(hidden_states)
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, hidden_size]
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Language Modeling Head
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Linear Projection
โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ lm_head: Linear(hidden_size โ†’ vocab_size, bias=False)
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, vocab_size]
โ”‚ โ”‚ โ”‚
โ”‚ โ”‚ โ””โ”€โ”€โ”€ [OPTIONAL] Logit Soft-Capping
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Clamp extreme values: [-capร—0.99, capร—0.99]
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Formula: tanh(logits / cap) ร— cap
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Prevents numerical instability
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Typical cap value: 30.0
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€ Output: Logits [batch, seq, vocab_size]
โ”‚
โ”œโ”€โ”€โ”€ ๐Ÿ“‰ LOSS COMPUTATION (Training Only)
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Shift for Autoregressive
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ shift_logits = logits[:, :-1, :]
โ”‚ โ”‚ โ””โ”€โ”€โ”€ shift_labels = labels[:, 1:]
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Language Modeling Loss
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ CrossEntropyLoss(ignore_index=-100)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Label Smoothing
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Reduces overconfidence
โ”‚ โ”‚ โ””โ”€โ”€โ”€ lm_loss = CE(shift_logits, shift_labels)
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] MoE Auxiliary Losses
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Auxiliary Loss (load balancing)
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ aux_loss ร— router_aux_loss_coef (default: 0.01)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Z-Loss (stability)
โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ z_loss ร— router_z_loss_coef (default: 0.001)
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Sum across all MoE layers
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€ Total Loss
โ”‚ โ””โ”€โ”€โ”€ total = lm_loss + aux_losses
โ”‚
โ”œโ”€โ”€โ”€ ๐Ÿ“Š MONITORING & METRICS
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ MetricsTracker
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Loss tracking (LM, aux, z-loss)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Perplexity: exp(lm_loss)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gradient norms per layer
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ GPU memory usage
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Expert usage statistics
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention cache hit rate
โ”‚ โ”‚ โ””โ”€โ”€โ”€ Periodic summary & clearing
โ”‚ โ”‚
โ”‚ โ”œโ”€โ”€โ”€ Gradient Monitoring
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Max gradient norm per layer
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Mean gradient norm (EMA)
โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gradient clipping count
โ”‚ โ”‚ โ””โ”€โ”€โ”€ NaN/Inf detection
โ”‚ โ”‚
โ”‚ โ””โ”€โ”€โ”€ Memory Monitoring
โ”‚ โ”œโ”€โ”€โ”€ GPU memory allocated
โ”‚ โ”œโ”€โ”€โ”€ GPU memory reserved
โ”‚ โ”œโ”€โ”€โ”€ Automatic cache clearing
โ”‚ โ””โ”€โ”€โ”€ Per-layer memory checkpoints
โ”‚
โ””โ”€โ”€โ”€ ๐Ÿ”ง OPTIMIZATION FEATURES
โ”‚
โ”œโ”€โ”€โ”€ Gradient Checkpointing
โ”‚ โ”œโ”€โ”€โ”€ Trade: 30% slower, 50% less memory
โ”‚ โ”œโ”€โ”€โ”€ Recompute activations during backward
โ”‚ โ””โ”€โ”€โ”€ Enable: model.gradient_checkpointing_enable()
โ”‚
โ”œโ”€โ”€โ”€ Mixed Precision Training (AMP)
โ”‚ โ”œโ”€โ”€โ”€ FP16/BF16 forward pass
โ”‚ โ”œโ”€โ”€โ”€ FP32 master weights
โ”‚ โ”œโ”€โ”€โ”€ Dynamic loss scaling
โ”‚ โ””โ”€โ”€โ”€ 2x speedup, 50% memory reduction
โ”‚
โ”œโ”€โ”€โ”€ Gradient Accumulation
โ”‚ โ”œโ”€โ”€โ”€ Simulate larger batch size
โ”‚ โ”œโ”€โ”€โ”€ loss = loss / accumulation_steps
โ”‚ โ””โ”€โ”€โ”€ optimizer.step() every N steps
โ”‚
โ”œโ”€โ”€โ”€ KV Cache (Inference)
โ”‚ โ”œโ”€โ”€โ”€ Cache Key/Value tensors
โ”‚ โ”œโ”€โ”€โ”€ Reuse for autoregressive generation
โ”‚ โ”œโ”€โ”€โ”€ Memory: O(num_layers ร— seq_len ร— hidden_size)
โ”‚ โ””โ”€โ”€โ”€ Speedup: ~10x untuk long sequences
โ”‚
โ””โ”€โ”€โ”€ Quantization Support
โ”œโ”€โ”€โ”€ 8-bit (LLM.int8)
โ”‚ โ”œโ”€โ”€โ”€ bitsandbytes integration
โ”‚ โ”œโ”€โ”€โ”€ Mixed precision (outliers in FP16)
โ”‚ โ””โ”€โ”€โ”€ 2x memory reduction
โ””โ”€โ”€โ”€ 4-bit (QLoRA)
โ”œโ”€โ”€โ”€ NF4 quantization (normal float 4-bit)
โ”œโ”€โ”€โ”€ Double quantization
โ”œโ”€โ”€โ”€ BF16 compute dtype
โ””โ”€โ”€โ”€ 4x memory reduction
CacaForCausalLM (3.52M)
โ”‚
โ”œโ”€ Embedding: 8,000 ร— 128
โ”‚
โ”œโ”€ Transformer Layers (6x)
โ”‚ โ”œโ”€ RMSNorm
โ”‚ โ”œโ”€ Attention (GQA)
โ”‚ โ”‚ โ”œโ”€ Q: 4 heads ร— 32 dim
โ”‚ โ”‚ โ”œโ”€ KV: 2 heads ร— 32 dim
โ”‚ โ”‚ โ”œโ”€ RoPE (ฮธ=10,000)
โ”‚ โ”‚ โ””โ”€ Flash Attention v2
โ”‚ โ”œโ”€ Residual
โ”‚ โ”œโ”€ RMSNorm
โ”‚ โ”œโ”€ FFN (SwiGLU)
โ”‚ โ”‚ โ”œโ”€ Gate: 128 โ†’ 512
โ”‚ โ”‚ โ”œโ”€ Up: 128 โ†’ 512
โ”‚ โ”‚ โ””โ”€ Down: 512 โ†’ 128
โ”‚ โ””โ”€ Residual
โ”‚
โ”œโ”€ Final RMSNorm
โ””โ”€ LM Head: 128 โ†’ 8,000
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
๐Ÿ“Š PARAMETER BREAKDOWN:
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Embeddings: 1,024,000 ( 29.1%)
Transformer Layers: 1,474,560 ( 41.8%)
โ”œโ”€ Attention: 294,912
โ””โ”€ FFN: 1,179,648
Final Norm: 128 ( 0.0%)
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
TOTAL: 3,524,608 (100.0%)
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
```
**Key Design Decisions:**
1. **GQA over MHA**: Hemat 50% KV cache memory dengan minimal accuracy loss
2. **SwiGLU over GELU**: ~10% better performance pada language modeling
3. **RMSNorm over LayerNorm**: Lebih cepat & stabil, tanpa bias term
4. **RoPE over Learned**: Better extrapolation untuk sequence length > training
5. **No Bias in Linear**: Mengikuti modern LLM best practices (LLaMA-style)
</details>
---
## ๐Ÿ“š Dokumentasi
### ๐Ÿ“ฆ Instalasi Dependencies
```bash
# Core dependencies (REQUIRED)
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors
# Optional: Untuk performa maksimal
pip install flash-attn --no-build-isolation # Flash Attention 2 (3x speedup)
pip install xformers # Memory efficient attention
pip install bitsandbytes # 4/8-bit quantization
# Optional: Untuk monitoring & profiling
pip install tensorboard wandb # Training monitoring
pip install gputil psutil # Resource monitoring
```
**Compatibility Matrix:**
| Component | Version | Note |
|-----------|---------|------|
| Python | 3.8 - 3.11 | 3.11 recommended |
| PyTorch | โ‰ฅ 2.0.0 | 2.1+ untuk SDPA optimal |
| CUDA | 11.8 / 12.1 | Untuk Flash Attention |
| Transformers | โ‰ฅ 4.35.0 | Untuk AutoModel support |
### Cara Penggunaan
#### 1๏ธโƒฃ Basic Loading
```python
from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
import torch
# Load configuration
config = AutoConfig.from_pretrained(
"Lyon28/caca-1M-untrained",
trust_remote_code=True
)
# Load model (FP16 untuk efisiensi)
model = AutoModelForCausalLM.from_pretrained(
"Lyon28/caca-1M-untrained",
config=config,
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto" # Automatic device placement
)
# Model ini UNTRAINED - butuh training dulu!
print(f"Model loaded: {model.num_parameters():,} parameters")
print("โš ๏ธ Model ini belum dilatih dan belum bisa digunakan untuk inference")
```
#### 2๏ธโƒฃ Quantized Loading (4-bit/8-bit)
```python
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
# Load model dengan quantization
model = AutoModelForCausalLM.from_pretrained(
"Lyon28/caca-1M-untrained",
trust_remote_code=True,
quantization_config=bnb_config,
device_map="auto"
)
print(f"Memory footprint: ~0.00GB (4-bit)")
```
#### 3๏ธโƒฃ Training Setup
```python
from transformers import TrainingArguments, Trainer
# Training configuration
training_args = TrainingArguments(
output_dir="./output",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
max_steps=10000,
lr_scheduler_type="cosine",
warmup_steps=500,
logging_steps=10,
save_steps=500,
fp16=True, # Mixed precision
gradient_checkpointing=True, # Memory efficient
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
# Start training
trainer.train()
```
### Advanced Usage
#### Gradient Checkpointing (Memory Efficient)
```python
model.gradient_checkpointing_enable()
print("โœ… Gradient checkpointing enabled - saves ~40% memory")
```
#### Custom Training Loop
```python
from torch.optim import AdamW
from torch.cuda.amp import autocast, GradScaler
optimizer = AdamW(model.parameters(), lr=2e-4)
scaler = GradScaler()
for batch in dataloader:
# Mixed precision forward
with autocast(dtype=torch.bfloat16):
outputs = model(**batch)
loss = outputs.loss
# Backward with gradient scaling
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
optimizer.zero_grad()
```
#### Multi-GPU Training (DDP)
```python
import torch.distributed as dist
from torch.nn.parallel import DistributedDataParallel
# Initialize process group
dist.init_process_group(backend="nccl")
# Wrap model
model = DistributedDataParallel(
model,
device_ids=[local_rank],
find_unused_parameters=False
)
```
---
## โš™๏ธ Konfigurasi Detail
### Full Configuration JSON
```json
{
"architectures": ["CacaForCausalLM"],
"model_type": "caca",
"vocab_size": 8000,
"hidden_size": 128,
"intermediate_size": 512,
"num_hidden_layers": 6,
"num_attention_heads": 4,
"num_key_value_heads": 2,
"head_dim": 32,
"max_position_embeddings": 1024,
"rope_theta": 10000,
"rms_norm_eps": 1e-06,
"use_cache": true,
"use_qk_norm": true,
"use_flash_attn": true,
"attention_dropout": 0.0,
"hidden_dropout": 0.1,
"torch_dtype": "float16"
}
```
### Custom Configuration
```python
from transformers import AutoConfig
# Load dan modifikasi config
config = AutoConfig.from_pretrained("Lyon28/caca-1M-untrained")
# Custom modifications
config.max_position_embeddings = 16384 # Extend context
config.rope_scaling = {"type": "linear", "factor": 2.0}
config.use_flash_attn = True
config.hidden_dropout = 0.05
# Save custom config
config.save_pretrained("./custom_config")
```
---
## ๐Ÿ”ฌ Arsitektur
<details>
<summary><b>Layer Structure</b></summary>
**Input Tokens**
โ†“
**Embedding Layer** (vocab_size โ†’ hidden_size)
โ†“
**Decoder Block ร— N**
- RMSNorm
- Multi-Head Attention (GQA)
- Flash Attention v2
- Query heads, KV heads
- RoPE position encoding
- Residual Connection
- RMSNorm
- Feed-Forward Network (SwiGLU)
- Gate: hidden โ†’ intermediate
- Up: hidden โ†’ intermediate
- Down: intermediate โ†’ hidden
- Residual Connection
โ†“
**RMSNorm (Final)**
โ†“
**LM Head** (hidden โ†’ vocab_size)
โ†“
**Output Logits**
</details>
### Attention Mechanism (GQA)
```
Query: [4 heads ร— 32 dim] = 128
Key: [2 heads ร— 32 dim] = 64
Value: [2 heads ร— 32 dim] = 64
Grouped Query Attention:
- Setiap 2 query heads berbagi 1 KV head
- Memory KV cache: 50% lebih kecil dari Multi-Head Attention
- Kualitas mendekati MHA, speed mendekati MQA
```
### Feed-Forward Network (SwiGLU)
```
FFN(x) = (SiLU(xW_gate) โŠ™ xW_up) W_down
Where:
- W_gate: 128 ร— 512
- W_up: 128 ร— 512
- W_down: 512 ร— 128
- SiLU(x) = x ยท sigmoid(x)
- โŠ™ = element-wise multiplication
```
## ๐Ÿ’ฌ Format Chat & Prompt Engineering
### ๐Ÿ“ Chat Template
Model mendukung format chat standar untuk conversational AI:
```python
# Format chat template bawaan
chat_template = """
{% for message in messages %}
{% if message['role'] == 'system' %}
System: {{ message['content'] }}
{% elif message['role'] == 'user' %}
User: {{ message['content'] }}
{% elif message['role'] == 'assistant' %}
Assistant: {{ message['content'] }}
{% endif %}
{% endfor %}
{% if add_generation_prompt %}Assistant:{% endif %}
"""
# Contoh penggunaan
messages = [
{"role": "system", "content": "Kamu adalah asisten AI yang membantu dan ramah."},
{"role": "user", "content": "Jelaskan tentang fotosintesis"},
{"role": "assistant", "content": "Fotosintesis adalah proses di mana tumbuhan mengubah cahaya matahari menjadi energi kimia..."},
{"role": "user", "content": "Apa manfaatnya bagi manusia?"},
]
# Apply template
formatted = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
print(formatted)
# Output:
# System: Kamu adalah asisten AI yang membantu dan ramah.
#
# User: Jelaskan tentang fotosintesis
# Assistant: Fotosintesis adalah proses di mana tumbuhan...
# User: Apa manfaatnya bagi manusia?
# Assistant:
```
---
## ๐ŸŽฏ Use Cases
Model ini dirancang untuk berbagai aplikasi NLP setelah melalui proses training:
### Text Generation
- โœ๏ธ Creative writing & storytelling
- ๐Ÿ“ฐ Article generation
- ๐Ÿ’ฌ Conversational AI
- ๐Ÿ”„ Text completion
### Language Understanding
- ๐Ÿ“Š Text classification
- ๐Ÿท๏ธ Named Entity Recognition (NER)
- โ“ Question Answering
- ๐Ÿ“ Summarization
### Code Generation
- ๐Ÿ’ป Code completion
- ๐Ÿ› Bug fixing suggestions
- ๐Ÿ“š Documentation generation
- ๐Ÿ”„ Code translation
### Multilingual Tasks
- ๐ŸŒ Translation (ID โ†” EN)
- ๐Ÿ—ฃ๏ธ Cross-lingual understanding
- ๐ŸŒ Multilingual classification
---
## ๐Ÿ“ˆ Benchmark & Evaluation
> โš ๏ธ Model belum melalui evaluasi karena status untrained
Setelah training, model akan dievaluasi pada:
### Indonesian Benchmarks
- **IndoNLU**: Comprehensive Indonesian NLU tasks
- **IndoQA**: Indonesian Question Answering
- **IndoSum**: Summarization
- **IndoNER**: Named Entity Recognition
### Multilingual Benchmarks
- **MMLU**: Massive Multitask Language Understanding
- **HellaSwag**: Common sense reasoning
- **ARC**: Science QA
- **TruthfulQA**: Truthfulness evaluation
### Generation Quality
- **Perplexity**: Language modeling quality
- **BLEU/ROUGE**: Translation & summarization
- **Human Evaluation**: Fluency, coherence, factuality
---
## ๐Ÿ› ๏ธ Development & Training Tips
### Optimal Batch Size
```python
# Rule of thumb untuk 3.52M model
# GPU Memory โ†’ Batch size per device
if gpu_memory >= 80: # A100 80GB
batch_size = 4539
gradient_accumulation = 1
elif gpu_memory >= 40: # A100 40GB
batch_size = 2269
gradient_accumulation = 1
elif gpu_memory >= 24: # RTX 3090/4090
batch_size = 1
gradient_accumulation = 1
# Effective batch size = batch_size ร— gradient_accumulation ร— num_gpus
```
### Learning Rate Scheduling
```python
# Recommended untuk 3.52M model
learning_rate = 0.0005 # Base LR
warmup_ratio = 0.05 # 5% of total steps
lr_scheduler = "cosine" # atau "linear"
# Learning rate scaling rule:
# LR โˆ sqrt(batch_size)
# Untuk batch size 256: LR = 0.0005
# Untuk batch size 512: LR = 7.07e-04
```
### Gradient Clipping
```python
# Prevent gradient explosion
max_grad_norm = 1.0 # Clip at 1.0
# Monitor gradients
from torch.nn.utils import clip_grad_norm_
grad_norm = clip_grad_norm_(model.parameters(), max_grad_norm)
if grad_norm > 10.0:
print(f"โš ๏ธ High gradient norm: {grad_norm:.2f}")
```
### Training Stability
```python
# Tips untuk stable training:
1. **Warmup**: Mulai dengan LR rendah
2. **Gradient Checkpointing**: Kurangi memory footprint
3. **Mixed Precision**: Gunakan BF16 jika tersedia (lebih stable dari FP16)
4. **Batch Size**: Start small, increase gradually
5. **Monitor**: Track loss, perplexity, gradient norms
```
---
## ๐Ÿ”ง Troubleshooting
### Out of Memory (OOM)
```python
# Solusi OOM saat training:
โœ… 1. Enable gradient checkpointing
model.gradient_checkpointing_enable()
โœ… 2. Reduce batch size
per_device_train_batch_size = 1
โœ… 3. Increase gradient accumulation
gradient_accumulation_steps = 32
โœ… 4. Use quantization
load_in_8bit = True # atau load_in_4bit
โœ… 5. Reduce sequence length
max_length = 1024 # Start dengan ini
โœ… 6. CPU offloading (jika perlu)
device_map = "auto"
offload_folder = "offload"
```
### Slow Training
```python
# Optimasi kecepatan training:
โœ… 1. Flash Attention
config.use_flash_attn = True # 2-3x speedup
โœ… 2. Compile model (PyTorch 2.0+)
model = torch.compile(model, mode="reduce-overhead")
โœ… 3. DataLoader optimization
dataloader = DataLoader(
dataset,
batch_size=batch_size,
num_workers=4, # Parallel data loading
pin_memory=True, # Faster GPU transfer
prefetch_factor=2
)
โœ… 4. Mixed precision
use_fp16 = True # atau bf16
โœ… 5. Optimize communication (multi-GPU)
find_unused_parameters = False
gradient_as_bucket_view = True
```
### NaN Loss
```python
# Jika loss menjadi NaN:
โœ… 1. Reduce learning rate
learning_rate = learning_rate * 0.1
โœ… 2. Check gradient norms
clip_grad_norm_(model.parameters(), 1.0)
โœ… 3. Use BF16 instead of FP16
torch_dtype = torch.bfloat16 # Lebih stable
โœ… 4. Add epsilon to RMSNorm
rms_norm_eps = 1e-5 # Increase jika perlu
โœ… 5. Check data
# Pastikan tidak ada inf/nan di dataset
assert not torch.isnan(input_ids).any()
assert not torch.isinf(attention_mask).any()
```
---
### ๐Ÿšซ Prohibited Uses
<div style="background: #ffebee; border-left: 4px solid #f44336; padding: 12px; margin: 16px 0;">
Model ini **TIDAK BOLEH** digunakan untuk:
- ๐Ÿšซ **Harmful content generation** (violence, self-harm, illegal acts)
- ๐Ÿšซ **Misinformation/disinformation campaigns**
- ๐Ÿšซ **Harassment or hate speech**
- ๐Ÿšซ **Impersonation or identity theft**
- ๐Ÿšซ **Child safety violations** (CSAM, grooming, exploitation)
- ๐Ÿšซ **Privacy violations** (doxxing, stalking, surveillance abuse)
- ๐Ÿšซ **Malicious code generation** (malware, exploits, etc)
- ๐Ÿšซ **Spam or manipulation** (fake reviews, astroturfing)
- ๐Ÿšซ **Medical/legal advice** (tanpa disclaimer & expert review)
- ๐Ÿšซ **Financial fraud** (scams, market manipulation)
**Violation consequences:** Model access revocation + legal action jika applicable
</div>
---
## ๐Ÿ“š References & Papers
### Core Architecture
1. **LLaMA** - [Touvron et al., 2023](https://arxiv.org/abs/2302.13971)
- RMSNorm, RoPE, SwiGLU, GQA
2. **GPT-4** - [OpenAI Technical Report, 2023](https://arxiv.org/abs/2303.08774)
- Mixture of Experts (speculated)
3. **Gemini** - [Google DeepMind, 2023](https://arxiv.org/abs/2312.11805)
- Multimodal architecture, soft-capping
4. **Qwen** - [Alibaba Cloud, 2023](https://arxiv.org/abs/2309.16609)
- YARN, long context
5. **Gemma** - [Google, 2024](https://arxiv.org/abs/2403.08295)
- Layer scaling, normalization
### Advanced Techniques
6. **Flash Attention 2** - [Dao, 2023](https://arxiv.org/abs/2307.08691)
7. **Mixture-of-Depths** - [Raposo et al., 2024](https://arxiv.org/abs/2404.02258)
8. **StreamingLLM** - [Xiao et al., 2023](https://arxiv.org/abs/2309.17453)
9. **YARN** - [Peng et al., 2023](https://arxiv.org/abs/2309.00071)
10. **QLoRA** - [Dettmers et al., 2023](https://arxiv.org/abs/2305.14314)
---
## โš ๏ธ Known Limitations
1. **Training Cost** - MoE + Multimodal = expensive
2. **Complex Debugging** - Banyak fallback systems
3. **Memory Hungry** - Jika semua fitur enabled
4. **Dependency Hell** - Butuh flash-attn, xformers, bitsandbytes
5. **Expert Balancing** - MoE butuh careful tuning untuk load balancing
---
## ๐Ÿ“œ License & Citation
### ๐Ÿ“„ License
<div style="background: #e8f5e9; border-left: 4px solid #4caf50; padding: 12px; margin: 16px 0;">
Model ini dirilis di bawah **Apache License 2.0**
โœ… **Anda BEBAS untuk:**
- โœ”๏ธ Gunakan secara komersial
- โœ”๏ธ Modifikasi sesuka hati
- โœ”๏ธ Distribusi ulang
- โœ”๏ธ Patent use
- โœ”๏ธ Private use
โš ๏ธ **Dengan syarat:**
- ๐Ÿ“„ Include license & copyright notice
- ๐Ÿ“ State changes yang dibuat
- ๐Ÿ“‹ Disclaimer of warranty
โŒ **Tanpa jaminan apapun** (use at your own risk)
</div>
**Full license text**: [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)
## ๐Ÿ“– Citation
Jika Anda menggunakan model ini dalam penelitian, mohon sitasi:
```bibtex
@misc{cacacaca1m,
author = {Lyon},
title = {Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention},
year = {2026},
publisher = {Hugging Face},
journal = {Hugging Face Model Hub},
howpublished = {\url{https://huggingface.co/Lyon28/caca-1M-untrained}},
note = {Untrained model with 3,524,608 parameters}
}
```
**APA Style:**
```
Lyon. (2026). Caca-caca-1M: Modern Transformer Architecture with Grouped
Query Attention [Untrained model]. Hugging Face.
https://huggingface.co/Lyon28/caca-1M-untrained
```
**MLA Style:**
```
Lyon. "Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention."
Hugging Face, 2026, huggingface.co/Lyon28/caca-1M-untrained.
```
---
### ๐Ÿ™ Acknowledgments
Model ini berdiri di pundak para raksasa! Terima kasih kepada:
<details>
<summary><b>๐Ÿ›๏ธ Klik untuk daftar lengkap acknowledgments</b></summary>
#### ๐Ÿ—๏ธ **Core Architecture**
- **LLaMA/LLaMA 2** (Meta AI, 2023) - Decoder-only architecture, RMSNorm, SwiGLU
- Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
- Authors: Hugo Touvron et al.
- **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm
- **PaLM** (Google, 2022) - SwiGLU activation insights
#### ๐ŸŽฏ **Attention Mechanisms**
- **Flash Attention v2** (Tri Dao et al., Stanford, 2023)
- Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
- 3x speedup dengan IO-aware algorithm
- **Grouped Query Attention** (Joshua Ainslie et al., Google, 2023)
- Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
- Memory-efficient KV cache
- **Multi-Query Attention** (Noam Shazeer, Google, 2019)
- Fast inference dengan shared K/V
- **xFormers** (Meta AI, 2022) - Memory efficient attention
- **PyTorch SDPA** (PyTorch Team, 2023) - Native attention optimization
#### ๐Ÿ“ **Position Encodings**
- **RoPE** (Jianlin Su et al., EleutherAI, 2021)
- Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
- Superior length extrapolation
- **ALiBI** (Ofir Press et al., 2022)
- Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
- Length generalization without retraining
- **YaRN** (Bowen Peng et al., 2023)
- Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)
#### ๐ŸชŸ **Long Context & Efficiency**
- **Sliding Window Attention** (Albert Gu et al., Mistral AI, 2023)
- Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
- **StreamingLLM** (Guangxuan Xiao et al., MIT, 2023)
- Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
- Infinite sequence length!
- **Logit Softcapping** (Google Gemma Team, 2024)
- Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)
#### ๐Ÿง  **Mixture of Experts**
- **Mixtral 8x7B** (Albert Jiang et al., Mistral AI, 2024)
- Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
- State-of-the-art sparse MoE
- **Switch Transformers** (William Fedus et al., Google, 2021)
- Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
- Expert scaling insights
- **GLaM** (Nan Du et al., Google, 2021) - Generalist Language Model
- **Expert Choice Routing** (Yanqi Zhou et al., Google, 2022)
- Better load balancing
#### ๐ŸŽ“ **Training Optimizations**
- **Layer Scale** (Hugo Touvron et al., Meta, 2021)
- Paper: [Going Deeper with Image Transformers](https://arxiv.org/abs/2103.17239)
- Training stability untuk deep networks
- **Stochastic Depth** (Gao Huang et al., 2016)
- Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
- **Mixture of Depths** (David Raposo et al., DeepMind, 2024)
- Paper: [Mixture-of-Depths: Dynamically allocating compute](https://arxiv.org/abs/2404.02258)
- Dynamic compute allocation
- **Gradient Checkpointing** (Tianqi Chen et al., 2016)
#### ๐Ÿ“ฆ **Quantization**
- **LLM.int8()** (Tim Dettmers et al., 2022)
- Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
- **QLoRA** (Tim Dettmers et al., 2023)
- Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
- 4-bit efficient fine-tuning
- **bitsandbytes** (Tim Dettmers) - Quantization library
#### ๐ŸŽจ **Multimodal**
- **Vision Transformer** (Alexey Dosovitskiy et al., Google, 2020)
- Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
- **Flamingo** (Jean-Baptiste Alayrac et al., DeepMind, 2022)
- Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
- Perceiver Resampler
- **BLIP-2** (Junnan Li et al., Salesforce, 2023)
- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
- Q-Former architecture
- **Whisper** (Alec Radford et al., OpenAI, 2022) - Audio encoding
#### ๐Ÿ› ๏ธ **Normalization & Activations**
- **RMSNorm** (Biao Zhang, Rico Sennrich, 2019)
- Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
- **SwiGLU** (Noam Shazeer, Google, 2020)
- Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)
#### ๐Ÿ”ง **Tools & Frameworks**
- **๐Ÿค— Hugging Face** - Transformers, Accelerate, PEFT
- Making NLP accessible to everyone
- **PyTorch** - Deep learning framework
- Facebook AI Research team
- **Safetensors** - Secure serialization
- Hugging Face team
- **DeepSpeed** - Distributed training
- Microsoft Research
- **Flash Attention Implementation** - Tri Dao & team
#### ๐Ÿ‡ฎ๐Ÿ‡ฉ **Indonesian NLP Community**
Special thanks to Indonesian NLP researchers & practitioners yang telah membangun foundation untuk Indonesian language AI.
</details>
---
## ๐Ÿ“„ License
Model ini dirilis di bawah **Apache License 2.0**.
### Ketentuan Penggunaan:
- โœ… **Bebas digunakan** untuk keperluan komersial dan non-komersial
- โœ… **Modifikasi** diperbolehkan
- โœ… **Distribusi** diperbolehkan dengan attribution
- โš ๏ธ **No Warranty** - model disediakan "as is"
- ๐Ÿ“ **Attribution Required** - sertakan copyright notice
Lihat [LICENSE](LICENSE) untuk detail lengkap.
---
## ๐Ÿค Contributing
Kami sangat terbuka untuk kontribusi! Berikut cara Anda bisa berkontribusi:
### Training & Fine-tuning
- ๐ŸŽ“ Train model ini dengan dataset Anda
- ๐Ÿ“Š Share benchmark results
- ๐Ÿ”ฌ Experiment dengan hyperparameters
### Code & Architecture
- ๐Ÿ› Report bugs atau issues
- ๐Ÿ’ก Suggest improvements
- ๐Ÿ”ง Submit pull requests
### Documentation
- ๐Ÿ“š Improve documentation
- ๐ŸŒ Add translations
- โœ๏ธ Write tutorials & guides
### Dataset & Evaluation
- ๐Ÿ“ Contribute training data
- ๐Ÿงช Create evaluation benchmarks
- ๐ŸŽฏ Share fine-tuned versions
---
## ๐Ÿ‘ฅ Team & Acknowledgments
### Core Team
- **LyonPoy** - Architecture design & implementation
### Special Thanks
- ๐Ÿค— **Hugging Face** - Infrastructure & community
- โšก **FlashAttention Team** - Efficient attention implementation
- ๐Ÿง  **Anthropic, Google, Meta, openAI, etc** - Research inspirations
- Meta AI (LLaMA)
- OpenAI (GPT series)
- Google DeepMind (Gemini, Gemma)
- Alibaba Cloud (Qwen)
- HuggingFace (Transformers library)
- Tri Dao (Flash Attention)
- Tim Chen (bitsandbytes)
### Community
Terima kasih kepada komunitas open-source yang telah berkontribusi pada:
- Transformers library
- PyTorch framework
- Datasets & evaluation tools
---
## ๐Ÿ“ž Contact & Support
### Community
- ๐Ÿ’ฌ [Discussions](https://huggingface.co/Lyon28/caca-1M-untrained/discussions) - Ask questions
- ๐Ÿ› [Issues](https://github.com/Lyon-28/caca-transformers/issues) - Report bugs
- ๐Ÿ“ง Email : cacatransformers@gmail.com
---
## ๐ŸŒŸ Star History
<div align="center">
[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)
</div>
## ๐Ÿ’ Dibuat dengan โค๏ธ untuk Komunitas AI Indonesia
<img src="https://i.postimg.cc/MTSj073X/logo.png" width="200" alt="Caca Logo"/>
### **Terima kasih telah menggunakan Caca!**
Jika model ini berguna, jangan lupa โญ repository kami!
<div align="center">
<table>
<tr>
<td align="center">โญ<br/><b>Star Repo</b><br/><sub>Show your support</sub></td>
<td align="center">๐Ÿ”—<br/><b>Share</b><br/><sub>Tell your friends</sub></td>
<td align="center">๐Ÿ’ฌ<br/><b>Join Discussion</b><br/><sub>Ask questions</sub></td>
<td align="center">๐Ÿค<br/><b>Contribute</b><br/><sub>Make it better</sub></td>
</tr>
</table>
### ๐Ÿš€ Happy Training! ๐Ÿš€
**Model ini menunggu untuk dilatih dan menjadi foundation untuk aplikasi AI Anda.**
[๐Ÿ“ฅ Download Model](#) โ€ข [๐Ÿ“– Read Docs](https://github.com/Lyon-28/caca-transformers) โ€ข [๐Ÿ’ฌ Join Community](https://github.com/Lyon-28/caca-transformers)
</div>
---
### ๐Ÿ“Š Model Statistics
<img src="https://img.shields.io/badge/Parameters-3.52M-blue?style=for-the-badge" alt="Parameters"/>
<img src="https://img.shields.io/badge/Status-Untrained-orange?style=for-the-badge" alt="Status"/>
<img src="https://img.shields.io/badge/License-Apache%202.0-green?style=for-the-badge" alt="License"/>
<img src="https://img.shields.io/badge/Architecture-Transformer-purple?style=for-the-badge" alt="Architecture"/>
<img src="https://img.shields.io/badge/Type-Causal%20LM-red?style=for-the-badge" alt="Type"/>
<img src="https://img.shields.io/badge/Context-1,024%20tokens-cyan?style=for-the-badge" alt="Context"/>
---
### ๐ŸŽจ Daily Inspiration
<div align="center">
<img src="https://quotes-caca.vercel.app/api/SvgQuote" alt="Daily Quote" width="600" />
</div>
---
### ๐Ÿ“ˆ Quick Stats
| Metric | Value |
|--------|-------|
| ๐Ÿ’Ž Total Parameters | 3,524,608 |
| ๐Ÿ—๏ธ Layers | 6 |
| ๐ŸŽฏ Attention Heads | 4 |
| ๐Ÿ“– Max Context | 1,024 tokens |
| ๐Ÿ’พ Size (FP16) | 0.01 GB |
| ๐Ÿ’พ Size (INT4) | 0.00 GB |
---
<sub>
Model ini adalah bagian dari <b>Caca Project</b> - Open source initiative untuk membangun Indonesian LLM ecosystem.<br/>
Created with ๐Ÿ’ป by <a href="https://huggingface.co/Lyon28">@Lyon28</a> |
Licensed under <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0</a> |
Built with <a href="https://huggingface.co">๐Ÿค— HuggingFace</a>
</sub>
<br/><br/>
**๐ŸŒŸ "Dari nol, untuk semua" ๐ŸŒŸ**
<sub>Last updated: january 2026</sub>
</div>
---
<div align="center">
<sub>Built with โค๏ธ by Caca Transformers Team</sub><br>
<sub>Powered by ๐Ÿค— Transformers โ€ข โšก PyTorch โ€ข ๐Ÿ”ฅ Flash Attention</sub>
</div>