README.md · Lyon28/caca-1M-untrained at main

caca-1M-untrained / README.md

Lyon28

Update README.md

2dbabe3 verified 13 days ago

preview code

raw

history blame contribute delete

78.8 kB

	---
	license: apache-2.0
	language:
	- id
	- en
	tags:
	- text-generation
	- pytorch
	- causal-lm
	- transformer
	- untrained
	- gqa
	- rope
	- swiglu
	- rmsnorm
	- flash-attention
	- indonesian
	- bilingual
	library_name: transformers
	pipeline_tag: text-generation
	widget:
	- text: "Jakarta adalah ibu kota"
	example_title: "🇮🇩 Pelengkapan Teks (ID)"
	- text: \|
	Pertanyaan: Apa itu kecerdasan buatan?
	Jawaban:
	example_title: "🇮🇩 Tanya Jawab (ID)"
	- text: \|
	Tulis cerita pendek tentang robot yang belajar mencintai.
	example_title: "🇮🇩 Penulisan Kreatif (ID)"
	- text: "The capital of Indonesia is"
	example_title: "🇬🇧 Text Completion (EN)"
	- text: \|
	Question: What is artificial intelligence?
	Answer:
	example_title: "🇬🇧 Question Answering (EN)"
	- text: \|
	def fibonacci(n):
	"""Hitung bilangan fibonacci ke-n"""
	example_title: "💻 Pelengkapan Kode"
	- text: \|
	# Fungsi untuk mengurutkan array
	def sort_array(arr):
	example_title: "💻 Generasi Kode"
	- text: \|
	User: Halo! Siapa kamu?
	Assistant:
	example_title: "💬 Format Chat (ID)"
	- text: \|
	User: Jelaskan tentang machine learning dalam 2 kalimat.
	Assistant:
	example_title: "💬 Conversational (ID)"
	inference:
	parameters:
	max_new_tokens: 100
	temperature: 0.7
	top_p: 0.9
	top_k: 50
	do_sample: true
	repetition_penalty: 1.1
	num_beams: 1
	datasets: []
	metrics:
	- perplexity
	- accuracy
	model-index:
	- name: caca-1M
	results: []
	---

	<div align="center">

	<img src="https://i.postimg.cc/MTSj073X/logo.png" width="400" alt="caca-1M"/>

	# 🤖 caca-1M

	### Arsitektur Transformer Modern dengan Fitur Canggih

	[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/)
	[![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/)
	[![Transformers](https://img.shields.io/badge/🤗%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers)
	[![Model Type](https://img.shields.io/badge/Model-Causal%20LM-green.svg)]()
	[![Parameters](https://img.shields.io/badge/Parameters-3.52M-orange.svg)]()
	[![Status](https://img.shields.io/badge/Status-Untrained-red.svg)]()

	3,524,608 parameters • 3.52M • 6 layers • 1,024 tokens

	[📚 Documentation](#-dokumentasi) • [💻 Usage](#-cara-penggunaan) • [⚙️ Configuration](#️-konfigurasi-detail) • [🔬 Architecture](#-arsitektur)

	</div>

	---

	## ⚠️ PENTING: Model Belum Dilatih (Untrained)

	<div style="background: #fff3cd; border-left: 4px solid #ffc107; padding: 12px; margin: 16px 0;">
	<strong>⚠️ PERHATIAN</strong>: Ini adalah model yang <strong>belum melalui proses training</strong>. Bobot model masih dalam kondisi <strong>random initialization</strong>. Output yang dihasilkan akan <strong>tidak bermakna dan acak</strong>.
	</div>

	Status Model:
	- 🔴 Belum dilatih - Bobot masih random (Kaiming/Xavier init)
	- 🟡 Untuk riset & eksperimen - Arsitektur sudah siap, tinggal train
	- 🟢 Production-ready architecture - Teruji dan optimal

	Widget di atas hanya menunjukkan format input yang diharapkan. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas tinggi.

	### 🎯 Apa yang Bisa Dilakukan?

	\| ✅ Bisa \| ❌ Belum Bisa \|
	\|---------\|----------------\|
	\| Load model architecture \| Generate teks bermakna \|
	\| Test forward pass \| Menjawab pertanyaan \|
	\| Measure memory & speed \| Reasoning & understanding \|
	\| Start training \| Production deployment \|
	\| Fine-tuning experiments \| Real-world applications \|

	---

	## 📋 Deskripsi

	CACA (Collaborative Architecture for Contextual AI) adalah arsitektur Large Language Model (LLM) yang menggabungkan best practices dari berbagai model State-of-the-Art (SOTA) seperti LLaMA, GPT-4, Gemini, Qwen, dan Gemma.

	Model ini dirancang dengan fokus pada efisiensi komputasi, skalabilitas, dan performa tinggi — menjadikannya modular, production-ready, dan mendukung multimodal (teks, gambar, audio).

	<blockquote style="border-left: 4px solid #4A90E2; padding-left: 16px; margin: 16px 0; background: #f8f9fa; padding: 12px;">
	<p><strong>📖 Tentang Project Caca</strong></p>
	<p><em>Caca</em> adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative.</p>
	<p>Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok. Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.</p>
	<p>— <strong>Lyon</strong>, Creator</p>
	</blockquote>

	### ✨ Highlights

	- 🧠 Hybrid Architecture — Kombinasi teknik terbaik dari 5+ model SOTA
	- 🎭 Multimodal Native — Support teks, gambar, dan audio dalam satu model
	- ⚡ High Performance — Flash Attention, MoE, dan optimasi modern
	- 🌍 Indonesian-First — Dikembangkan dengan fokus pada Bahasa Indonesia
	- 🔓 Open Source — Transparent, reproducible, collaborative

	### 🌟 Mengapa Caca?

	1. 🇮🇩 Fokus pada Bahasa Indonesia - Dirancang dengan mempertimbangkan karakteristik bahasa Indonesia
	2. ⚡ Efisiensi Tinggi - GQA & Flash Attention untuk inferensi 3-5x lebih cepat
	3. 💾 Memory Efficient - Hemat 50% memory untuk KV cache
	4. 🔧 Modular & Extensible - Mudah dikustomisasi untuk berbagai use case
	5. 🌐 Bilingual - Support optimal untuk Indonesia & English

	CACA hadir dengan filosofi berbeda:
	- ✅ Fully open-source — dari architecture sampai training code
	- ✅ Modular & scalable — bisa disesuaikan dari 1B sampai 70B+ parameters
	- ✅ Resource-efficient — optimized untuk budget terbatas
	- ✅ Indonesian-centric — prioritas pada Bahasa Indonesia
	- ✅ Community-driven — open for contributions & collaborations

	## 📈 Perbandingan dengan Model Lain

	\| Fitur \| LLaMA \| GPT-4 \| Gemini \| Qwen \| CACA \|
	\|-------\|-------\|-------\|--------\|------\|------\|
	\| RMSNorm \| ✅ \| ❌ \| ❌ \| ✅ \| ✅ \|
	\| RoPE \| ✅ \| ❌ \| ❌ \| ✅ \| ✅ \|
	\| GQA \| ✅ \| ❌ \| ❌ \| ✅ \| ✅ \|
	\| MoE \| ❌ \| ✅ \| ✅ \| ❌ \| ✅ \|
	\| Multimodal \| ❌ \| ✅ \| ✅ \| ✅ \| ✅ \|
	\| Flash Attention \| ✅ \| ✅ \| ✅ \| ✅ \| ✅ \|
	\| Sliding Window \| ❌ \| ❌ \| ❌ \| ✅ \| ✅ \|
	\| Attention Sinks \| ❌ \| ❌ \| ❌ \| ❌ \| ✅ \|
	\| MoD \| ❌ \| ❌ \| ❌ \| ❌ \| ✅ \|
	\| Expert Choice \| ❌ \| ❌ \| ❌ \| ❌ \| ✅ \|
	\| YARN Scaling \| ❌ \| ❌ \| ❌ \| ✅ \| ✅ \|
	\| Quantization \| ✅ \| ❌ \| ❌ \| ✅ \| ✅ \|

	---

	## 🎯 Use Cases & Applications

	### ✅ Cocok Untuk

	<table>
	<tr>
	<td width="50%">

	🔬 Research & Development
	- Eksperimen arsitektur transformer
	- Ablation studies
	- Novel training techniques
	- Architecture search

	📚 Academic & Education
	- Thesis & research papers
	- Teaching materials
	- Student projects
	- LLM internals understanding

	</td>
	<td width="50%">

	🚀 Base Model for Fine-tuning
	- Task-specific models
	- Domain adaptation
	- Instruction tuning
	- RLHF experiments

	💡 Prototyping
	- Proof of concept
	- Feature testing
	- A/B testing architectures
	- Benchmark comparisons

	</td>
	</tr>
	</table>

	### ❌ Tidak Cocok Untuk

	<div style="background: #ffe6e6; border-left: 4px solid #ff4444; padding: 12px; margin: 16px 0;">

	- 🚫 Production Applications - Model belum dilatih, output random
	- 🚫 Real-world Deployment - Perlu training & safety alignment dulu
	- 🚫 Safety-critical Systems - Tidak ada safety guardrails
	- 🚫 Direct User-facing Apps - Output tidak dapat diprediksi
	- 🚫 Commercial Use (as-is) - Harus dilatih terlebih dahulu

	</div>

	---

	## 📊 Spesifikasi Model

	<table>
	<tr>
	<td><strong>Parameter</strong></td>
	<td><strong>Value</strong></td>
	<td><strong>Parameter</strong></td>
	<td><strong>Value</strong></td>
	</tr>
	<tr>
	<td>Total Parameters</td>
	<td><code>3,524,608</code></td>
	<td>Vocab Size</td>
	<td><code>8,000</code></td>
	</tr>
	<tr>
	<td>Hidden Size</td>
	<td><code>128</code></td>
	<td>Intermediate Size</td>
	<td><code>512</code></td>
	</tr>
	<tr>
	<td>Num Layers</td>
	<td><code>6</code></td>
	<td>Attention Heads</td>
	<td><code>4</code></td>
	</tr>
	<tr>
	<td>KV Heads (GQA)</td>
	<td><code>2</code></td>
	<td>Head Dimension</td>
	<td><code>32</code></td>
	</tr>
	<tr>
	<td>Max Context Length</td>
	<td><code>1,024</code></td>
	<td>RoPE Base (θ)</td>
	<td><code>10,000</code></td>
	</tr>
	<tr>
	<td>Model Size (FP16)</td>
	<td><code>0.01 GB</code></td>
	<td>Formatted Size</td>
	<td><code>3.52M</code></td>
	</tr>
	</table>

	---

	### 🎯 Core Features

	<details open>
	<summary><b>🔍 Klik untuk expand/collapse</b></summary>

	- ✅ Grouped Query Attention (GQA) - Efisiensi memori dan komputasi superior
	- Query heads: 4
	- KV heads: 2
	- Ratio: 2:1 (hemat ~50% memory KV cache)
	- Benefit: Inferensi lebih cepat dengan memory footprint lebih kecil

	- ✅ Rotary Position Embeddings (RoPE) - Generalisasi konteks panjang lebih baik
	- Theta (θ): 10,000
	- Support extrapolation untuk konteks > training length
	- Benefit: Performa stabil pada sequence length yang belum pernah dilihat saat training

	- ✅ RMSNorm - Normalisasi lebih stabil dan ~50% lebih cepat dari LayerNorm
	- Epsilon: 1e-06
	- Benefit: Training lebih stabil, inference lebih cepat, gradient flow lebih baik

	- ✅ SwiGLU Activation - Performa 10-15% lebih baik dari ReLU/GELU
	- Intermediate size: 512 (4.0x hidden)
	- Benefit: Kapasitas model lebih besar tanpa menambah parameter signifikan

	- ✅ Flash Attention 2 - Akselerasi hingga 3x dengan memory efficiency
	- Otomatis aktif jika tersedia CUDA device
	- IO-aware algorithm untuk minimal HBM access
	- Benefit: Training & inference jauh lebih cepat, support batch size lebih besar

	- ✅ Hybrid Architecture - Kombinasi teknik terbaik dari 5+ model SOTA
	- ✅ Multimodal Support - Native support untuk Vision dan Audio
	- ✅ Mixture of Experts (MoE) - Sparse activation untuk efisiensi
	- ✅ Long Context - Support hingga 8K+ tokens dengan YARN scaling
	- ✅ Advanced Attention - Flash Attention, Sliding Window, Attention Sinks
	- ✅ Quantization Ready - Support 4-bit dan 8-bit quantization
	- ✅ Production Features - Extensive error handling & monitoring

	</details>

	### 🔥 Advanced Features

	### 🎯 Mekanisme Attention

	- ⚡ Flash Attention v2 - Algoritma IO-aware yang 3x lebih cepat dari attention standar
	- 🔑 Grouped Query Attention (GQA) - 4 Query heads : 2 KV heads
	- Rasio kompresi: 2:1 (hemat ~50% memory KV cache)
	- 🚀 xFormers Support - Fallback memory-efficient attention
	- 🎯 PyTorch SDPA - Native scaled dot product attention

	### 📍 Position Encodings

	- 🔄 RoPE (Rotary Position Embeddings) - Base frequency θ=10,000
	- Generalisasi lebih baik untuk sequence panjang dibanding absolute PE

	### 🎓 Optimisasi Training

	- 💾 Gradient Checkpointing - Trade compute for memory (support model hingga 100B+ params)
	- 🎯 Mixed Precision Training - Support FP16, BF16, dan TF32
	- 📉 Dropout Regularization
	- Hidden dropout: 0.1
	- Attention dropout: 0.0
	- Residual dropout: 0.1

	### 📦 Dukungan Quantization

	- 4️⃣ 4-bit Quantization - NF4 & FP4 via bitsandbytes
	- Memory reduction: ~75% (4GB → 1GB)
	- Accuracy loss: <2% pada kebanyakan tasks
	- Support double quantization untuk kompresi maksimal
	- 8️⃣ 8-bit Quantization - LLM.int8() dengan outlier handling
	- Memory reduction: ~50% (4GB → 2GB)
	- Accuracy loss: <1%
	- 🔄 Dynamic Quantization - Runtime quantization tanpa calibration

	### 🔬 Advanced Features

	- 📊 Automatic Mixed Precision (AMP) - Dynamic loss scaling
	- 🎯 Gradient Clipping - Stabilitas training dengan max norm clipping
	- 📈 Learning Rate Scheduling - Support cosine, linear, warmup
	- 💡 Smart Memory Management - Auto cache clearing & monitoring
	- 🔍 Metrics Tracking - Real-time perplexity, loss, gradient norms
	- 🛡️ NaN/Inf Detection - Automatic recovery dari numerical instability

	---

	## 🧩 Komponen Arsitektur

	### 1️⃣ Dari LLaMA (Meta)

	CACA mengadopsi komponen efisien dari LLaMA untuk performa optimal:

	```python
	✓ RMSNorm # Normalisasi lebih efisien dari LayerNorm
	✓ Rotary Position Embeddings # Positional encoding yang lebih baik
	✓ SwiGLU Activation # Activation function dengan gating mechanism
	✓ Grouped-Query Attention # Hemat memory dengan shared K/V heads
	✓ Pre-normalization # Stabilitas training yang lebih baik
	```

	- RMSNorm 30% lebih cepat dari LayerNorm
	- RoPE membuat model bisa extrapolate ke context lebih panjang
	- GQA hemat 30-40% memory dibanding Multi-Head Attention
	- SwiGLU meningkatkan performa 3-5% dibanding ReLU/GELU

	---

	### 2️⃣ Dari GPT-4 (OpenAI)

	Implementasi Mixture of Experts untuk skalabilitas:

	```python
	✓ Mixture of Experts (MoE) # Sparse activation dengan multiple expert networks
	✓ Top-K Router # Routing token ke K expert terbaik
	✓ Auxiliary Loss # Load balancing antar experts
	✓ Z-Loss # Stabilisasi router logits
	✓ Expert Usage Tracking # Monitoring penggunaan setiap expert
	```

	```
	Input Token
	↓
	[Router] → Pilih Top-K Experts (misal K=2 dari 8 experts)
	↓
	Expert_1 (weight: 0.6) + Expert_3 (weight: 0.4)
	↓
	Weighted Sum Output
	```

	Keuntungan:
	- Model bisa 10x lebih besar dengan compute cost yang sama
	- Setiap token hanya activate 12.5% parameters (jika K=2, N=8)
	- Parallel processing antar experts

	---

	### 3️⃣ Dari Gemini (Google)

	Multimodal native dengan cross-modal fusion:

	```python
	✓ Vision Encoder (ViT) # Process gambar dengan Vision Transformer
	✓ Audio Encoder (Conv1D + Trans) # Process audio dengan CNN + Transformer
	✓ Cross-Attention Mechanism # Fuse multimodal features
	✓ Multiple Projector Types:
	- Linear Projector # Simple & cepat
	- MLP Projector # Non-linear mapping
	- Perceiver Resampler # Compress dengan latent queries
	- Q-Former # Query-based projection (BLIP-2 style)
	✓ Logit Soft-Capping # Clip extreme values untuk stabilitas
	```

	Alur Multimodal:
	```
	[Image] → Vision Encoder → [2D patches → 1D tokens]
	↓
	Projector → [Hidden dim = text dim]
	↓
	[Text] + [Image tokens] → Cross-Attention → Fused representation
	```

	Support format:
	- Images: JPEG, PNG (224x224 default)
	- Audio: Mel-spectrogram (80 bins)

	---

	### 4️⃣ Dari Qwen (Alibaba)

	Long context optimization:

	```python
	✓ YARN Scaling # Yet Another RoPE extensioN
	✓ Dynamic Position Scaling # Auto-adjust untuk sequence lebih panjang
	✓ Sliding Window Attention # Local attention pattern
	✓ Context Window 8K-128K # Flexible context length
	```

	YARN vs Standard RoPE:
	```
	Standard RoPE: [====] 4K context → [====????] 8K (error naik)
	YARN: [====] 4K context → [========] 8K (smooth extrapolation)
	```

	Sliding Window Mechanism:
	```
	Token 0: attend ke [0]
	Token 1: attend ke [0, 1]
	Token 2: attend ke [0, 1, 2]
	Token 10: attend ke [0, 6, 7, 8, 9, 10] ← sliding window = 4
	(keep attention sink di token 0)
	```

	---

	### 5️⃣ Dari Gemma (Google)

	Optimization techniques:

	```python
	✓ Layer Scale # Learnable scaling per layer
	✓ Stochastic Depth # Random layer dropping saat training
	✓ Normalized Attention # QK normalization untuk stabilitas
	✓ Knowledge Distillation # Transfer knowledge dari model besar
	```

	Layer Scale formula:
	```python
	output = input + gamma * layer(input)
	# gamma diinit sangat kecil (1e-5) lalu di-learn
	```

	Stochastic Depth:
	- Training: 20% chance layer di-skip (drop_prob=0.2)
	- Inference: semua layer aktif
	- Benefit: regularization + faster training

	---

	## 🆕 Fitur Eksperimental & Unik

	### A) Mixture of Depths (MoD)

	Token bisa "skip" layer tertentu untuk efisiensi:

	```python
	class MixtureOfDepthsRouter:
	# Pilih top 50% tokens paling "penting" untuk di-process
	capacity_factor = 0.5

	# Method: learned, random, atau heuristic
	route_method = "learned"
	```

	Ilustrasi:
	```
	Layer 1: [All 100 tokens processed]
	Layer 2: [Top 50 tokens processed, 50 skipped] ← MoD
	Layer 3: [All 100 tokens processed]
	Layer 4: [Top 50 tokens processed, 50 skipped] ← MoD
	```

	Benefit:
	- 30-40% faster inference dengan minimal accuracy drop
	- Dynamic computation based on token importance

	Paper: [Mixture-of-Depths (2024)](https://arxiv.org/abs/2404.02258)

	---

	### B) Attention Sinks

	Keep token awal selalu di-attend untuk stabilitas:

	```python
	attention_sink_size = 4 # Keep first 4 tokens
	attention_sink_window = 512 # Sliding window size
	```

	Attention Pattern:
	```
	Query Token 1000:
	├─ Attend to: [0, 1, 2, 3] ← attention sinks (always)
	└─ Attend to: [488, 489, ..., 1000] ← sliding window
	```

	Benefit:
	- Prevent attention collapse di long sequences
	- Better streaming generation
	- Inspired by [StreamingLLM (2023)](https://arxiv.org/abs/2309.17453)

	---

	### C) Expert Choice Routing

	Alternatif dari Top-K routing:

	```python
	# Top-K: Token pilih expert
	Token → Router → "Saya mau Expert 2 dan 5"

	# Expert Choice: Expert pilih token
	Expert 1 → "Saya mau process Token 3, 7, 12, ..."
	Expert 2 → "Saya mau process Token 1, 5, 9, ..."
	```

	Keuntungan:
	- Better load balancing (setiap expert process jumlah token yang sama)
	- Lebih stable training (no expert collapse)
	- Trade-off: sedikit lebih complex implementasi

	---

	### D) Multi-Backend Attention

	Automatic fallback untuk compatibility:

	```python
	if HAS_FLASH_ATTN and device == "cuda":
	use flash_attn_func() # ← Fastest (2-4x speedup)
	elif HAS_XFORMERS and device == "cuda":
	use memory_efficient_attention() # ← Fallback 1
	elif HAS_SDPA:
	use F.scaled_dot_product_attention() # ← Fallback 2 (PyTorch 2.0+)
	else:
	use standard_attention() # ← Safe fallback
	```

	Performa Comparison:
	```
	Flash Attention: 100ms (baseline)
	xFormers: 150ms (1.5x slower)
	SDPA: 180ms (1.8x slower)
	Standard: 400ms (4x slower)
	```

	---


	## 🏗️ CACA Model Family

	\| Model \| Parameters \| Vocab Size \| Hidden Size \| Intermediate Size \| Layers \| Attention Heads \| KV Heads \| Head Dim \| Max Position \|
	\|-------\|------------\|------------\|-------------\|-------------------\|--------\|-----------------\|----------\|----------\|--------------\|
	\| caca-1M-untrained \| 2.50M \| 8,000 \| 128 \| 512 \| 6 \| 4 \| 2 \| 32 \| 1,024 \|
	\| caca-3M-untrained \| 6.63M \| 12,000 \| 192 \| 768 \| 8 \| 6 \| 2 \| 32 \| 2,048 \|
	\| caca-4M-untrained \| 4.02M \| 16,000 \| 128 \| 512 \| 8 \| 4 \| 2 \| 32 \| 2,048 \|
	\| caca-6M-untrained \| 11.96M \| 16,000 \| 256 \| 1024 \| 8 \| 4 \| 2 \| 64 \| 2,048 \|
	\| caca-10M-untrained \| 21.25M \| 20,000 \| 320 \| 1280 \| 10 \| 8 \| 2 \| 40 \| 2,048 \|
	\| caca-15M-untrained \| 35.18M \| 24,000 \| 384 \| 1536 \| 12 \| 6 \| 2 \| 64 \| 2,048 \|
	\| caca-25M-untrained \| 67.57M \| 28,000 \| 512 \| 2048 \| 14 \| 8 \| 2 \| 64 \| 4,096 \|
	\| caca-35M-untrained \| 95.42M \| 32,000 \| 576 \| 2304 \| 16 \| 8 \| 2 \| 72 \| 4,096 \|
	\| caca-50M-untrained \| 138.47M \| 32,000 \| 640 \| 2560 \| 20 \| 10 \| 2 \| 64 \| 4,096 \|
	\| caca-75M-untrained \| 178.55M \| 32,000 \| 768 \| 3072 \| 18 \| 12 \| 3 \| 64 \| 4,096 \|
	\| caca-100M-untrained \| 232.23M \| 32,000 \| 768 \| 3072 \| 24 \| 12 \| 4 \| 64 \| 4,096 \|
	\| caca-150M-untrained \| 336.90M \| 32,000 \| 1024 \| 4096 \| 20 \| 16 \| 4 \| 64 \| 4,096 \|
	\| caca-200M-untrained \| 458.55M \| 32,000 \| 1024 \| 4096 \| 28 \| 16 \| 4 \| 64 \| 4,096 \|
	\| caca-250M-untrained \| 569.54M \| 32,000 \| 1152 \| 4608 \| 28 \| 18 \| 3 \| 64 \| 8,192 \|
	\| caca-300M-untrained \| 701.64M \| 32,000 \| 1280 \| 5120 \| 28 \| 20 \| 4 \| 64 \| 8,192 \|
	\| caca-400M-untrained \| 956.36M \| 32,000 \| 1408 \| 5632 \| 32 \| 22 \| 4 \| 64 \| 8,192 \|
	\| caca-500M-untrained \| 1.27B \| 32,000 \| 1536 \| 6144 \| 36 \| 24 \| 4 \| 64 \| 8,192 \|
	\| caca-600M-untrained \| 1.48B \| 32,000 \| 1664 \| 6656 \| 36 \| 26 \| 4 \| 64 \| 8,192 \|
	\| caca-700M-untrained \| 1.71B \| 32,000 \| 1792 \| 7168 \| 36 \| 28 \| 4 \| 64 \| 8,192 \|
	\| caca-800M-untrained \| 1.96B \| 32,000 \| 1920 \| 7680 \| 36 \| 30 \| 5 \| 64 \| 8,192 \|
	\| caca-900M-untrained \| 2.01B \| 32,000 \| 2048 \| 8192 \| 32 \| 32 \| 8 \| 64 \| 8,192 \|
	\| caca-1B-untrained \| 2.26B \| 32,000 \| 2048 \| 8192 \| 36 \| 32 \| 8 \| 64 \| 8,192 \|
	\| caca-1.5B-untrained \| 2.98B \| 32,000 \| 2048 \| 8192 \| 48 \| 32 \| 8 \| 64 \| 8,192 \|
	\| caca-2B-untrained \| 3.15B \| 32,000 \| 2304 \| 9216 \| 40 \| 32 \| 8 \| 72 \| 8,192 \|
	\| caca-2.5B-untrained \| 3.12B \| 32,000 \| 2560 \| 10240 \| 32 \| 32 \| 8 \| 80 \| 8,192 \|
	\| caca-3B-untrained \| 3.88B \| 32,000 \| 2560 \| 10240 \| 40 \| 32 \| 8 \| 80 \| 8,192 \|
	\| caca-3.5B-untrained \| 4.69B \| 32,000 \| 2816 \| 11264 \| 40 \| 32 \| 8 \| 88 \| 8,192 \|
	\| caca-4B-untrained \| 5.02B \| 32,000 \| 3072 \| 12288 \| 36 \| 32 \| 8 \| 96 \| 8,192 \|
	\| caca-4.5B-untrained \| 5.45B \| 32,000 \| 3200 \| 12800 \| 36 \| 32 \| 8 \| 100 \| 8,192 \|
	\| caca-5B-untrained \| 6.53B \| 32,000 \| 3328 \| 13312 \| 40 \| 32 \| 8 \| 104 \| 8,192 \|
	\| caca-6B-untrained \| 8.31B \| 32,000 \| 3584 \| 14336 \| 44 \| 32 \| 8 \| 112 \| 8,192 \|
	\| caca-7B-untrained \| 7.11B \| 32,000 \| 4096 \| 14336 \| 32 \| 32 \| 8 \| 128 \| 8,192 \|
	\| caca-8B-untrained \| 7.98B \| 32,000 \| 4096 \| 14336 \| 36 \| 32 \| 8 \| 128 \| 8,192 \|
	\| caca-9B-untrained \| 9.09B \| 32,000 \| 4608 \| 16384 \| 32 \| 36 \| 9 \| 128 \| 8,192 \|
	\| caca-10B-untrained \| 11.23B \| 32,000 \| 4608 \| 18432 \| 36 \| 32 \| 8 \| 144 \| 8,192 \|
	\| caca-12B-untrained \| 15.26B \| 32,000 \| 5120 \| 20480 \| 40 \| 40 \| 8 \| 128 \| 8,192 \|
	\| caca-13B-untrained \| 13.38B \| 32,000 \| 5120 \| 13824 \| 48 \| 40 \| 8 \| 128 \| 8,192 \|
	\| caca-14B-untrained \| 13.40B \| 32,000 \| 5376 \| 14464 \| 44 \| 48 \| 8 \| 112 \| 8,192 \|
	\| caca-15B-untrained \| 14.90B \| 32,000 \| 5632 \| 15104 \| 44 \| 32 \| 8 \| 176 \| 8,192 \|
	\| caca-18B-untrained \| 18.92B \| 32,000 \| 6144 \| 16384 \| 48 \| 48 \| 8 \| 128 \| 8,192 \|
	\| caca-20B-untrained \| 20.48B \| 32,000 \| 6144 \| 16384 \| 52 \| 48 \| 8 \| 128 \| 8,192 \|
	\| caca-24B-untrained \| 25.83B \| 32,000 \| 6656 \| 17920 \| 56 \| 64 \| 8 \| 104 \| 8,192 \|
	\| caca-30B-untrained \| 32.24B \| 32,000 \| 6656 \| 17920 \| 70 \| 64 \| 8 \| 104 \| 8,192 \|
	\| caca-35B-untrained \| 39.02B \| 32,000 \| 8192 \| 22016 \| 56 \| 64 \| 8 \| 128 \| 8,192 \|
	\| caca-40B-untrained \| 44.56B \| 32,000 \| 8192 \| 22016 \| 64 \| 64 \| 8 \| 128 \| 8,192 \|
	\| caca-45B-untrained \| 50.09B \| 32,000 \| 8192 \| 22016 \| 72 \| 64 \| 8 \| 128 \| 8,192 \|
	\| caca-50B-untrained \| 55.63B \| 32,000 \| 8192 \| 22016 \| 80 \| 64 \| 8 \| 128 \| 8,192 \|
	\| caca-60B-untrained \| 72.14B \| 32,000 \| 8192 \| 28672 \| 84 \| 64 \| 8 \| 128 \| 8,192 \|
	\| caca-70B-untrained \| 68.71B \| 32,000 \| 8192 \| 28672 \| 80 \| 64 \| 8 \| 128 \| 8,192 \|
	\| caca-80B-untrained \| 101.77B \| 32,000 \| 9216 \| 36864 \| 84 \| 72 \| 8 \| 128 \| 8,192 \|
	\| caca-100B-untrained \| 137.32B \| 32,000 \| 10240 \| 40960 \| 92 \| 80 \| 8 \| 128 \| 8,192 \|
	\| caca-120B-untrained \| 173.10B \| 32,000 \| 11264 \| 45056 \| 96 \| 88 \| 8 \| 128 \| 8,192 \|
	\| caca-150B-untrained \| 214.31B \| 32,000 \| 12288 \| 49152 \| 100 \| 96 \| 8 \| 128 \| 8,192 \|
	\| caca-175B-untrained \| 248.53B \| 32,000 \| 12288 \| 49152 \| 116 \| 96 \| 8 \| 128 \| 8,192 \|
	\| caca-200B-untrained \| 324.80B \| 128,000 \| 14336 \| 57344 \| 110 \| 112 \| 16 \| 128 \| 16,384 \|
	\| caca-250B-untrained \| 419.35B \| 128,000 \| 15360 \| 61440 \| 124 \| 120 \| 16 \| 128 \| 16,384 \|
	\| caca-300B-untrained \| 507.03B \| 128,000 \| 16384 \| 65536 \| 132 \| 128 \| 16 \| 128 \| 16,384 \|
	\| caca-350B-untrained \| 591.18B \| 128,000 \| 16384 \| 65536 \| 154 \| 128 \| 16 \| 128 \| 16,384 \|
	\| caca-400B-untrained \| 675.34B \| 128,000 \| 16384 \| 65536 \| 176 \| 128 \| 16 \| 128 \| 16,384 \|
	\| caca-500B-untrained \| 852.77B \| 128,000 \| 18432 \| 73728 \| 176 \| 144 \| 16 \| 128 \| 16,384 \|
	\| caca-600B-untrained \| 1.07T \| 128,000 \| 20480 \| 81920 \| 180 \| 160 \| 16 \| 128 \| 16,384 \|
	\| caca-700B-untrained \| 1.23T \| 128,000 \| 21504 \| 86016 \| 186 \| 168 \| 24 \| 128 \| 16,384 \|
	\| caca-800B-untrained \| 1.38T \| 128,000 \| 22528 \| 90112 \| 192 \| 176 \| 16 \| 128 \| 16,384 \|
	\| caca-900B-untrained \| 1.65T \| 128,000 \| 24576 \| 94208 \| 198 \| 192 \| 24 \| 128 \| 16,384 \|
	\| caca-1T-untrained \| 1.75T \| 128,000 \| 24576 \| 98304 \| 204 \| 192 \| 16 \| 128 \| 16,384 \|


	---

	## 💾 Kebutuhan Memory

	### Training Requirements

	<table>
	<tr>
	<th>Configuration</th>
	<th>Model Weights</th>
	<th>+ Optimizer States</th>
	<th>Total Training</th>
	</tr>
	<tr>
	<td><strong>FP32 (AdamW)</strong></td>
	<td>0.01 GB</td>
	<td>+0.04 GB</td>
	<td><strong>0.06 GB</strong></td>
	</tr>
	<tr>
	<td><strong>Mixed Precision</strong></td>
	<td>0.01 GB</td>
	<td>+0.05 GB</td>
	<td><strong>0.06 GB</strong></td>
	</tr>
	<tr>
	<td><strong>+ Gradient Checkpointing</strong></td>
	<td colspan="2">Menghemat ~30-50% activation memory</td>
	<td><strong>~0.03 GB</strong></td>
	</tr>
	</table>

	### Inference Requirements

	<table>
	<tr>
	<th>Precision</th>
	<th>Model Size</th>
	<th>KV Cache (2K ctx)</th>
	<th>Total Memory</th>
	<th>Memory Saving</th>
	</tr>
	<tr>
	<td><strong>FP16 / BF16</strong></td>
	<td>0.01 GB</td>
	<td>0.00 GB</td>
	<td><strong>0.01 GB</strong></td>
	<td>Baseline</td>
	</tr>
	<tr>
	<td><strong>INT8</strong></td>
	<td>0.00 GB</td>
	<td>0.00 GB</td>
	<td><strong>0.01 GB</strong></td>
	<td>~50% ↓</td>
	</tr>
	<tr>
	<td><strong>INT4 (NF4)</strong></td>
	<td>0.00 GB</td>
	<td>0.00 GB</td>
	<td><strong>0.00 GB</strong></td>
	<td>~75% ↓</td>
	</tr>
	</table>

	> 💡 Note: KV cache bertambah secara linear dengan panjang sequence. Untuk context 8K, kalikan nilai KV cache dengan 4.

	### Performance Estimates

	<table>
	<tr>
	<th>Metric</th>
	<th>Value</th>
	<th>Notes</th>
	</tr>
	<tr>
	<td><strong>FLOPs per Token</strong></td>
	<td>7,049,216</td>
	<td>Forward pass only</td>
	</tr>
	<tr>
	<td><strong>TFLOPs per Token</strong></td>
	<td>0.0000</td>
	<td>≈ 6× untuk backward</td>
	</tr>
	<tr>
	<td><strong>Bandwidth (FP16)</strong></td>
	<td>0.01 GB/token</td>
	<td>Memory bandwidth requirement</td>
	</tr>
	</table>

	---

	### 📐 Struktur Arsitektur Lengkap


	<details>
	<summary><b>🔍 Klik untuk lihat detail arsitektur</b></summary>

	```
	CACA Architecture
	│
	├─── 📥 INPUT PROCESSING
	│ │
	│ ├─── Text Input
	│ │ ├─── Tokenization (BPE/WordPiece/SentencePiece)
	│ │ ├─── Token Embeddings (vocab_size × hidden_size)
	│ │ └─── Output: [batch_size, seq_len, hidden_size]
	│ │
	│ ├─── Vision Input (Optional)
	│ │ ├─── Image Preprocessing (resize ke 224×224)
	│ │ ├─── Vision Encoder (ViT)
	│ │ │ ├─── Patch Embedding (Conv2D: 14×14 patches)
	│ │ │ ├─── CLS Token + Positional Embeddings
	│ │ │ ├─── Vision Transformer Blocks (24 layers)
	│ │ │ │ ├─── LayerNorm
	│ │ │ │ ├─── Multi-Head Attention
	│ │ │ │ ├─── MLP (GELU activation)
	│ │ │ │ └─── Residual Connections
	│ │ │ └─── Final LayerNorm
	│ │ ├─── Vision Projector
	│ │ │ ├─── Type: Linear / MLP / Perceiver / Q-Former
	│ │ │ └─── Output: [batch_size, num_patches, hidden_size]
	│ │ └─── Output: Vision embeddings aligned to text space
	│ │
	│ └─── Audio Input (Optional)
	│ ├─── Audio Preprocessing (Mel-spectrogram, 80 bins)
	│ ├─── Audio Encoder
	│ │ ├─── Conv1D Layers (feature extraction)
	│ │ │ ├─── Conv1D (80 → hidden_size, kernel=3)
	│ │ │ └─── Conv1D (stride=2 untuk downsampling)
	│ │ ├─── Positional Embeddings (interpolated)
	│ │ ├─── Audio Transformer Blocks (12 layers)
	│ │ │ ├─── LayerNorm
	│ │ │ ├─── Multi-Head Attention
	│ │ │ ├─── MLP (GELU activation)
	│ │ │ └─── Residual Connections
	│ │ └─── Final LayerNorm
	│ ├─── Audio Projector
	│ │ ├─── Type: Linear / MLP / Perceiver / Q-Former
	│ │ └─── Output: [batch_size, audio_len, hidden_size]
	│ └─── Output: Audio embeddings aligned to text space
	│
	├─── 🔄 MULTIMODAL FUSION
	│ │
	│ ├─── Early Fusion (jika tidak pakai Cross-Attention)
	│ │ ├─── Concatenate: [vision_tokens + audio_tokens + text_tokens]
	│ │ ├─── Update attention mask
	│ │ └─── Output: Combined sequence untuk decoder
	│ │
	│ └─── Late Fusion (jika pakai Cross-Attention)
	│ ├─── Text tokens → Query untuk cross-attention
	│ ├─── Vision+Audio tokens → Key/Value untuk cross-attention
	│ └─── Fusion dilakukan di dalam decoder layers
	│
	├─── 🏗️ DECODER STACK (N=32 layers)
	│ │
	│ └─── 🔁 DECODER LAYER i (repeated N times)
	│ │
	│ ├─── [OPTIONAL] Mixture of Depths (MoD)
	│ │ ├─── Input: Hidden states [batch, seq_len, hidden]
	│ │ ├─── MoD Router
	│ │ │ ├─── Method: learned / random / heuristic
	│ │ │ ├─── Score computation per token
	│ │ │ └─── Top-K selection (K = capacity_factor × seq_len)
	│ │ ├─── Process Mask Generation
	│ │ │ └─── Binary mask [batch, seq_len] (1=process, 0=skip)
	│ │ └─── Token Selection
	│ │ ├─── Selected tokens: processed through layer
	│ │ └─── Skipped tokens: bypass layer (identity)
	│ │
	│ ├─── 🎯 SELF-ATTENTION PATH
	│ │ │
	│ │ ├─── Input Normalization
	│ │ │ ├─── RMSNorm (Root Mean Square Layer Normalization)
	│ │ │ ├─── Formula: x * rsqrt(mean(x²) + ε) * γ
	│ │ │ └─── More efficient than LayerNorm (no mean centering)
	│ │ │
	│ │ ├─── Attention Computation
	│ │ │ │
	│ │ │ ├─── Query/Key/Value Projections
	│ │ │ │ ├─── Q: Linear(hidden_size → num_heads × head_dim)
	│ │ │ │ ├─── K: Linear(hidden_size → num_kv_heads × head_dim)
	│ │ │ │ ├─── V: Linear(hidden_size → num_kv_heads × head_dim)
	│ │ │ │ └─── Reshape: [batch, seq, heads, head_dim]
	│ │ │ │
	│ │ │ ├─── [OPTIONAL] QK Normalization
	│ │ │ │ ├─── Q = RMSNorm(Q)
	│ │ │ │ └─── K = RMSNorm(K)
	│ │ │ │
	│ │ │ ├─── Rotary Position Embeddings (RoPE)
	│ │ │ │ ├─── Compute frequencies: θ_i = base^(-2i/dim)
	│ │ │ │ ├─── Position indices: t ∈ [0, seq_len)
	│ │ │ │ ├─── Rotation matrix: cos(t·θ), sin(t·θ)
	│ │ │ │ ├─── Apply rotation: Q, K = rotate(Q, K, cos, sin)
	│ │ │ │ └─── YARN Scaling (jika enabled)
	│ │ │ │ ├─── Type: linear / dynamic / yarn
	│ │ │ │ ├─── Scaling factor per frequency band
	│ │ │ │ └─── Better extrapolation ke context panjang
	│ │ │ │
	│ │ │ ├─── Grouped-Query Attention (GQA)
	│ │ │ │ ├─── num_kv_groups = num_heads / num_kv_heads
	│ │ │ │ ├─── Repeat K, V: [num_kv_heads → num_heads]
	│ │ │ │ └─── Memory saving: 30-40% vs full MHA
	│ │ │ │
	│ │ │ ├─── Attention Score Computation
	│ │ │ │ ├─── scores = (Q @ K.T) / sqrt(head_dim)
	│ │ │ │ ├─── Logit clamping: [-50, 50] untuk stabilitas
	│ │ │ │ └─── [OPTIONAL] Soft-capping
	│ │ │ │ └─── scores = tanh(scores / cap) * cap
	│ │ │ │
	│ │ │ ├─── Attention Masking
	│ │ │ │ ├─── Causal Mask (autoregressive)
	│ │ │ │ ├─── Sliding Window Mask (jika enabled)
	│ │ │ │ │ ├─── Window size (misal: 512 tokens)
	│ │ │ │ │ └─── Attend hanya ke window terdekat
	│ │ │ │ ├─── Attention Sinks (jika enabled)
	│ │ │ │ │ ├─── Always attend to first K tokens
	│ │ │ │ │ ├─── Prevent attention collapse
	│ │ │ │ │ └─── Better streaming generation
	│ │ │ │ └─── [OPTIONAL] ALiBi Bias
	│ │ │ │ ├─── Linear bias based on distance
	│ │ │ │ └─── Alternative/complement to RoPE
	│ │ │ │
	│ │ │ ├─── Backend Selection (automatic fallback)
	│ │ │ │ ├─── 1️⃣ Flash Attention 2 (PREFERRED)
	│ │ │ │ │ ├─── Requirements: CUDA + FP16/BF16
	│ │ │ │ │ ├─── Speedup: 2-4x faster
	│ │ │ │ │ ├─── Memory: 10-20x less
	│ │ │ │ │ ├─── Sliding window support
	│ │ │ │ │ └─── IO-aware algorithm
	│ │ │ │ ├─── 2️⃣ xFormers Memory Efficient (FALLBACK 1)
	│ │ │ │ │ ├─── Requirements: CUDA
	│ │ │ │ │ ├─── Block-sparse attention
	│ │ │ │ │ └─── Custom attention patterns
	│ │ │ │ ├─── 3️⃣ PyTorch SDPA (FALLBACK 2)
	│ │ │ │ │ ├─── Requirements: PyTorch 2.0+
	│ │ │ │ │ ├─── Built-in scaled_dot_product_attention
	│ │ │ │ │ └─── Hardware-agnostic
	│ │ │ │ └─── 4️⃣ Standard Attention (SAFE FALLBACK)
	│ │ │ │ ├─── Pure PyTorch implementation
	│ │ │ │ ├─── Always available
	│ │ │ │ └─── Slower but stable
	│ │ │ │
	│ │ │ ├─── Softmax + Dropout
	│ │ │ │ ├─── attn_weights = softmax(scores, dim=-1)
	│ │ │ │ └─── attn_weights = dropout(attn_weights)
	│ │ │ │
	│ │ │ ├─── Value Aggregation
	│ │ │ │ ├─── output = attn_weights @ V
	│ │ │ │ └─── Reshape: [batch, seq, num_heads × head_dim]
	│ │ │ │
	│ │ │ └─── Output Projection
	│ │ │ ├─── O: Linear(num_heads × head_dim → hidden_size)
	│ │ │ └─── Output: [batch, seq, hidden_size]
	│ │ │
	│ │ ├─── [OPTIONAL] Layer Scale
	│ │ │ ├─── Learnable per-layer scaling: γ
	│ │ │ ├─── Initialize: γ = 1e-5 (very small)
	│ │ │ ├─── output = γ * output
	│ │ │ └─── Improves training stability
	│ │ │
	│ │ ├─── [OPTIONAL] Stochastic Depth
	│ │ │ ├─── Training: Random layer dropping
	│ │ │ ├─── drop_prob = layer_idx / num_layers × base_prob
	│ │ │ ├─── if random() > drop_prob: return output
	│ │ │ ├─── else: return 0
	│ │ │ └─── Inference: Always apply (no dropping)
	│ │ │
	│ │ ├─── Residual Dropout
	│ │ │ └─── output = dropout(output)
	│ │ │
	│ │ └─── Residual Connection
	│ │ ├─── hidden_states = hidden_states + output
	│ │ └─── [Training] Gradient clipping: [-1e4, 1e4]
	│ │
	│ ├─── 🌐 [OPTIONAL] CROSS-ATTENTION PATH (untuk Multimodal)
	│ │ │
	│ │ ├─── Conditional: Hanya jika encoder_hidden_states != None
	│ │ ├─── Frequency: Setiap cross_attention_frequency layers
	│ │ │
	│ │ ├─── Input Normalization
	│ │ │ └─── RMSNorm(hidden_states)
	│ │ │
	│ │ ├─── Cross-Attention Computation
	│ │ │ ├─── Query: dari text hidden states
	│ │ │ │ └─── Q: Linear(hidden_size → num_heads × head_dim)
	│ │ │ ├─── Key/Value: dari encoder_hidden_states (vision+audio)
	│ │ │ │ ├─── K: Linear(hidden_size → num_kv_heads × head_dim)
	│ │ │ │ └─── V: Linear(hidden_size → num_kv_heads × head_dim)
	│ │ │ ├─── Attention: Q @ K.T / sqrt(head_dim)
	│ │ │ ├─── Softmax + Dropout
	│ │ │ ├─── Output: attn_weights @ V
	│ │ │ └─── Output Projection
	│ │ │
	│ │ ├─── [OPTIONAL] Layer Scale
	│ │ ├─── [OPTIONAL] Stochastic Depth
	│ │ ├─── Residual Dropout
	│ │ └─── Residual Connection
	│ │ └─── hidden_states = hidden_states + cross_attn_output
	│ │
	│ └─── 🔮 FEED-FORWARD PATH
	│ │
	│ ├─── Input Normalization
	│ │ └─── RMSNorm(hidden_states)
	│ │
	│ ├─── Feed-Forward Network
	│ │ │
	│ │ ├─── ━━━━━ STANDARD MLP ━━━━━
	│ │ │ │
	│ │ │ ├─── Gate Projection
	│ │ │ │ ├─── gate: Linear(hidden_size → intermediate_size)
	│ │ │ │ └─── Typical: intermediate_size = 4 × hidden_size
	│ │ │ │
	│ │ │ ├─── Up Projection
	│ │ │ │ └─── up: Linear(hidden_size → intermediate_size)
	│ │ │ │
	│ │ │ ├─── SwiGLU Activation
	│ │ │ │ ├─── gate = silu(gate) # Swish activation
	│ │ │ │ ├─── hidden = gate * up # Gating mechanism
	│ │ │ │ └─── Formula: silu(x) = x * sigmoid(x)
	│ │ │ │
	│ │ │ ├─── Dropout
	│ │ │ │ └─── hidden = dropout(hidden)
	│ │ │ │
	│ │ │ └─── Down Projection
	│ │ │ ├─── down: Linear(intermediate_size → hidden_size)
	│ │ │ └─── Output: [batch, seq, hidden_size]
	│ │ │
	│ │ └─── ━━━━━ MIXTURE OF EXPERTS (MoE) ━━━━━
	│ │ │
	│ │ ├─── Conditional: use_moe AND (layer_idx % moe_frequency == 0)
	│ │ │
	│ │ ├─── Router Network
	│ │ │ │
	│ │ │ ├─── Router Type Selection
	│ │ │ │ ├─── Top-K Router (default)
	│ │ │ │ └─── Expert Choice Router (alternative)
	│ │ │ │
	│ │ │ ├─── ━━━ TOP-K ROUTER ━━━
	│ │ │ │ │
	│ │ │ │ ├─── Gate Normalization
	│ │ │ │ │ └─── hidden = LayerNorm(hidden)
	│ │ │ │ │
	│ │ │ │ ├─── Router Logits
	│ │ │ │ │ ├─── logits: Linear(hidden_size → num_experts)
	│ │ │ │ │ ├─── Clamping: [-20, 20]
	│ │ │ │ │ └─── Temperature scaling: logits / temp
	│ │ │ │ │
	│ │ │ │ ├─── [Training] Jitter Noise
	│ │ │ │ │ ├─── noise = randn_like(logits) × 0.01
	│ │ │ │ │ └─── logits = logits + noise
	│ │ │ │ │
	│ │ │ │ ├─── Routing Weights
	│ │ │ │ │ ├─── weights = softmax(logits)
	│ │ │ │ │ └─── top_k_weights, top_k_indices = topk(weights, k)
	│ │ │ │ │
	│ │ │ │ ├─── Weight Normalization
	│ │ │ │ │ └─── top_k_weights = top_k_weights / sum(top_k_weights)
	│ │ │ │ │
	│ │ │ │ └─── Loss Computation
	│ │ │ │ ├─── Auxiliary Loss (load balancing)
	│ │ │ │ │ ├─── expert_usage = mean(weights, dim=0)
	│ │ │ │ │ ├─── mean_usage = mean(expert_usage)
	│ │ │ │ │ └─── aux_loss = std(expert_usage) / mean_usage
	│ │ │ │ ├─── Z-Loss (router stability)
	│ │ │ │ │ ├─── z_loss = mean(logsumexp(logits)²)
	│ │ │ │ │ └─── Prevents logits explosion
	│ │ │ │ └─── Entropy Loss (diversity)
	│ │ │ │ └─── entropy_loss = -mean(weights × log(weights))
	│ │ │ │
	│ │ │ └─── ━━━ EXPERT CHOICE ROUTER ━━━
	│ │ │ │
	│ │ │ ├─── Router Logits
	│ │ │ │ └─── logits: Linear(hidden → num_experts)
	│ │ │ │
	│ │ │ ├─── Expert-wise Token Selection
	│ │ │ │ ├─── Transpose: [batch×seq, experts]
	│ │ │ │ ├─── capacity = expert_choice_k × total_tokens / num_experts
	│ │ │ │ ├─── Per expert: topk(logits, k=capacity)
	│ │ │ │ └─── Expert mask: [experts, batch×seq]
	│ │ │ │
	│ │ │ └─── Routing weights from mask
	│ │ │
	│ │ ├─── Expert Networks (N experts, misal N=8)
	│ │ │ │
	│ │ │ └─── Expert i (i = 0 to N-1)
	│ │ │ ├─── Same structure as Standard MLP
	│ │ │ ├─── gate_proj: Linear(hidden → intermediate)
	│ │ │ ├─── up_proj: Linear(hidden → intermediate)
	│ │ │ ├─── SwiGLU activation
	│ │ │ ├─── Dropout
	│ │ │ └─── down_proj: Linear(intermediate → hidden)
	│ │ │
	│ │ ├─── Expert Execution
	│ │ │ │
	│ │ │ ├─── For each expert:
	│ │ │ │ ├─── Get tokens routed to this expert
	│ │ │ │ ├─── If no tokens: skip
	│ │ │ │ ├─── Run expert forward pass
	│ │ │ │ ├─── [Training] Track expert usage
	│ │ │ │ └─── [Safety] NaN/Inf detection
	│ │ │ │
	│ │ │ └─── Combine Expert Outputs
	│ │ │ ├─── Weighted sum by router weights
	│ │ │ └─── final_output = Σ(weight_i × expert_i(x))
	│ │ │
	│ │ └─── Output: [batch, seq, hidden_size]
	│ │
	│ ├─── [OPTIONAL] Layer Scale
	│ │ └─── output = γ * output
	│ │
	│ ├─── [OPTIONAL] Stochastic Depth
	│ │ └─── Probabilistic dropping (training only)
	│ │
	│ ├─── Residual Dropout
	│ │ └─── output = dropout(output)
	│ │
	│ └─── Residual Connection
	│ ├─── hidden_states = hidden_states + output
	│ └─── [Training] Gradient clipping: [-1e4, 1e4]
	│
	├─── 📤 OUTPUT HEAD
	│ │
	│ ├─── Final Normalization
	│ │ ├─── RMSNorm(hidden_states)
	│ │ └─── Output: [batch, seq, hidden_size]
	│ │
	│ ├─── Language Modeling Head
	│ │ ├─── Linear Projection
	│ │ │ ├─── lm_head: Linear(hidden_size → vocab_size, bias=False)
	│ │ │ └─── Output: [batch, seq, vocab_size]
	│ │ │
	│ │ └─── [OPTIONAL] Logit Soft-Capping
	│ │ ├─── Clamp extreme values: [-cap×0.99, cap×0.99]
	│ │ ├─── Formula: tanh(logits / cap) × cap
	│ │ ├─── Prevents numerical instability
	│ │ └─── Typical cap value: 30.0
	│ │
	│ └─── Output: Logits [batch, seq, vocab_size]
	│
	├─── 📉 LOSS COMPUTATION (Training Only)
	│ │
	│ ├─── Shift for Autoregressive
	│ │ ├─── shift_logits = logits[:, :-1, :]
	│ │ └─── shift_labels = labels[:, 1:]
	│ │
	│ ├─── Language Modeling Loss
	│ │ ├─── CrossEntropyLoss(ignore_index=-100)
	│ │ ├─── [OPTIONAL] Label Smoothing
	│ │ │ └─── Reduces overconfidence
	│ │ └─── lm_loss = CE(shift_logits, shift_labels)
	│ │
	│ ├─── [OPTIONAL] MoE Auxiliary Losses
	│ │ ├─── Router Auxiliary Loss (load balancing)
	│ │ │ └─── aux_loss × router_aux_loss_coef (default: 0.01)
	│ │ ├─── Router Z-Loss (stability)
	│ │ │ └─── z_loss × router_z_loss_coef (default: 0.001)
	│ │ └─── Sum across all MoE layers
	│ │
	│ └─── Total Loss
	│ └─── total = lm_loss + aux_losses
	│
	├─── 📊 MONITORING & METRICS
	│ │
	│ ├─── MetricsTracker
	│ │ ├─── Loss tracking (LM, aux, z-loss)
	│ │ ├─── Perplexity: exp(lm_loss)
	│ │ ├─── Gradient norms per layer
	│ │ ├─── GPU memory usage
	│ │ ├─── Expert usage statistics
	│ │ ├─── Attention cache hit rate
	│ │ └─── Periodic summary & clearing
	│ │
	│ ├─── Gradient Monitoring
	│ │ ├─── Max gradient norm per layer
	│ │ ├─── Mean gradient norm (EMA)
	│ │ ├─── Gradient clipping count
	│ │ └─── NaN/Inf detection
	│ │
	│ └─── Memory Monitoring
	│ ├─── GPU memory allocated
	│ ├─── GPU memory reserved
	│ ├─── Automatic cache clearing
	│ └─── Per-layer memory checkpoints
	│
	└─── 🔧 OPTIMIZATION FEATURES
	│
	├─── Gradient Checkpointing
	│ ├─── Trade: 30% slower, 50% less memory
	│ ├─── Recompute activations during backward
	│ └─── Enable: model.gradient_checkpointing_enable()
	│
	├─── Mixed Precision Training (AMP)
	│ ├─── FP16/BF16 forward pass
	│ ├─── FP32 master weights
	│ ├─── Dynamic loss scaling
	│ └─── 2x speedup, 50% memory reduction
	│
	├─── Gradient Accumulation
	│ ├─── Simulate larger batch size
	│ ├─── loss = loss / accumulation_steps
	│ └─── optimizer.step() every N steps
	│
	├─── KV Cache (Inference)
	│ ├─── Cache Key/Value tensors
	│ ├─── Reuse for autoregressive generation
	│ ├─── Memory: O(num_layers × seq_len × hidden_size)
	│ └─── Speedup: ~10x untuk long sequences
	│
	└─── Quantization Support
	├─── 8-bit (LLM.int8)
	│ ├─── bitsandbytes integration
	│ ├─── Mixed precision (outliers in FP16)
	│ └─── 2x memory reduction
	└─── 4-bit (QLoRA)
	├─── NF4 quantization (normal float 4-bit)
	├─── Double quantization
	├─── BF16 compute dtype
	└─── 4x memory reduction

	CacaForCausalLM (3.52M)
	│
	├─ Embedding: 8,000 × 128
	│
	├─ Transformer Layers (6x)
	│ ├─ RMSNorm
	│ ├─ Attention (GQA)
	│ │ ├─ Q: 4 heads × 32 dim
	│ │ ├─ KV: 2 heads × 32 dim
	│ │ ├─ RoPE (θ=10,000)
	│ │ └─ Flash Attention v2
	│ ├─ Residual
	│ ├─ RMSNorm
	│ ├─ FFN (SwiGLU)
	│ │ ├─ Gate: 128 → 512
	│ │ ├─ Up: 128 → 512
	│ │ └─ Down: 512 → 128
	│ └─ Residual
	│
	├─ Final RMSNorm
	└─ LM Head: 128 → 8,000

	═══════════════════════════════════════════════════════════
	📊 PARAMETER BREAKDOWN:
	═══════════════════════════════════════════════════════════
	Embeddings: 1,024,000 ( 29.1%)
	Transformer Layers: 1,474,560 ( 41.8%)
	├─ Attention: 294,912
	└─ FFN: 1,179,648
	Final Norm: 128 ( 0.0%)
	───────────────────────────────────────────────────────────
	TOTAL: 3,524,608 (100.0%)
	═══════════════════════════════════════════════════════════
	```

	Key Design Decisions:

	1. GQA over MHA: Hemat 50% KV cache memory dengan minimal accuracy loss
	2. SwiGLU over GELU: ~10% better performance pada language modeling
	3. RMSNorm over LayerNorm: Lebih cepat & stabil, tanpa bias term
	4. RoPE over Learned: Better extrapolation untuk sequence length > training
	5. No Bias in Linear: Mengikuti modern LLM best practices (LLaMA-style)

	</details>

	---

	## 📚 Dokumentasi

	### 📦 Instalasi Dependencies

	```bash
	# Core dependencies (REQUIRED)
	pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors

	# Optional: Untuk performa maksimal
	pip install flash-attn --no-build-isolation # Flash Attention 2 (3x speedup)
	pip install xformers # Memory efficient attention
	pip install bitsandbytes # 4/8-bit quantization

	# Optional: Untuk monitoring & profiling
	pip install tensorboard wandb # Training monitoring
	pip install gputil psutil # Resource monitoring
	```

	Compatibility Matrix:

	\| Component \| Version \| Note \|
	\|-----------\|---------\|------\|
	\| Python \| 3.8 - 3.11 \| 3.11 recommended \|
	\| PyTorch \| ≥ 2.0.0 \| 2.1+ untuk SDPA optimal \|
	\| CUDA \| 11.8 / 12.1 \| Untuk Flash Attention \|
	\| Transformers \| ≥ 4.35.0 \| Untuk AutoModel support \|

	### Cara Penggunaan

	#### 1️⃣ Basic Loading

	```python
	from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer
	import torch

	# Load configuration
	config = AutoConfig.from_pretrained(
	"Lyon28/caca-1M-untrained",
	trust_remote_code=True
	)

	# Load model (FP16 untuk efisiensi)
	model = AutoModelForCausalLM.from_pretrained(
	"Lyon28/caca-1M-untrained",
	config=config,
	trust_remote_code=True,
	torch_dtype=torch.float16,
	device_map="auto" # Automatic device placement
	)

	# Model ini UNTRAINED - butuh training dulu!
	print(f"Model loaded: {model.num_parameters():,} parameters")
	print("⚠️ Model ini belum dilatih dan belum bisa digunakan untuk inference")
	```

	#### 2️⃣ Quantized Loading (4-bit/8-bit)

	```python
	from transformers import AutoModelForCausalLM, BitsAndBytesConfig
	import torch

	# 4-bit quantization config
	bnb_config = BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_quant_type="nf4",
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True
	)

	# Load model dengan quantization
	model = AutoModelForCausalLM.from_pretrained(
	"Lyon28/caca-1M-untrained",
	trust_remote_code=True,
	quantization_config=bnb_config,
	device_map="auto"
	)

	print(f"Memory footprint: ~0.00GB (4-bit)")
	```

	#### 3️⃣ Training Setup

	```python
	from transformers import TrainingArguments, Trainer

	# Training configuration
	training_args = TrainingArguments(
	output_dir="./output",
	per_device_train_batch_size=1,
	gradient_accumulation_steps=16,
	learning_rate=2e-4,
	max_steps=10000,
	lr_scheduler_type="cosine",
	warmup_steps=500,
	logging_steps=10,
	save_steps=500,
	fp16=True, # Mixed precision
	gradient_checkpointing=True, # Memory efficient
	)

	# Initialize trainer
	trainer = Trainer(
	model=model,
	args=training_args,
	train_dataset=train_dataset,
	)

	# Start training
	trainer.train()
	```

	### Advanced Usage

	#### Gradient Checkpointing (Memory Efficient)

	```python
	model.gradient_checkpointing_enable()
	print("✅ Gradient checkpointing enabled - saves ~40% memory")
	```

	#### Custom Training Loop

	```python
	from torch.optim import AdamW
	from torch.cuda.amp import autocast, GradScaler

	optimizer = AdamW(model.parameters(), lr=2e-4)
	scaler = GradScaler()

	for batch in dataloader:
	# Mixed precision forward
	with autocast(dtype=torch.bfloat16):
	outputs = model(**batch)
	loss = outputs.loss

	# Backward with gradient scaling
	scaler.scale(loss).backward()
	scaler.step(optimizer)
	scaler.update()
	optimizer.zero_grad()
	```

	#### Multi-GPU Training (DDP)

	```python
	import torch.distributed as dist
	from torch.nn.parallel import DistributedDataParallel

	# Initialize process group
	dist.init_process_group(backend="nccl")

	# Wrap model
	model = DistributedDataParallel(
	model,
	device_ids=[local_rank],
	find_unused_parameters=False
	)
	```

	---

	## ⚙️ Konfigurasi Detail

	### Full Configuration JSON

	```json
	{
	"architectures": ["CacaForCausalLM"],
	"model_type": "caca",
	"vocab_size": 8000,
	"hidden_size": 128,
	"intermediate_size": 512,
	"num_hidden_layers": 6,
	"num_attention_heads": 4,
	"num_key_value_heads": 2,
	"head_dim": 32,
	"max_position_embeddings": 1024,
	"rope_theta": 10000,
	"rms_norm_eps": 1e-06,
	"use_cache": true,
	"use_qk_norm": true,
	"use_flash_attn": true,
	"attention_dropout": 0.0,
	"hidden_dropout": 0.1,
	"torch_dtype": "float16"
	}
	```

	### Custom Configuration

	```python
	from transformers import AutoConfig

	# Load dan modifikasi config
	config = AutoConfig.from_pretrained("Lyon28/caca-1M-untrained")

	# Custom modifications
	config.max_position_embeddings = 16384 # Extend context
	config.rope_scaling = {"type": "linear", "factor": 2.0}
	config.use_flash_attn = True
	config.hidden_dropout = 0.05

	# Save custom config
	config.save_pretrained("./custom_config")
	```

	---

	## 🔬 Arsitektur

	<details>
	<summary><b>Layer Structure</b></summary>

	Input Tokens
	↓
	Embedding Layer (vocab_size → hidden_size)
	↓
	Decoder Block × N
	- RMSNorm
	- Multi-Head Attention (GQA)
	- Flash Attention v2
	- Query heads, KV heads
	- RoPE position encoding
	- Residual Connection
	- RMSNorm
	- Feed-Forward Network (SwiGLU)
	- Gate: hidden → intermediate
	- Up: hidden → intermediate
	- Down: intermediate → hidden
	- Residual Connection

	↓
	RMSNorm (Final)
	↓
	LM Head (hidden → vocab_size)
	↓
	Output Logits

	</details>

	### Attention Mechanism (GQA)

	```
	Query: [4 heads × 32 dim] = 128
	Key: [2 heads × 32 dim] = 64
	Value: [2 heads × 32 dim] = 64

	Grouped Query Attention:
	- Setiap 2 query heads berbagi 1 KV head
	- Memory KV cache: 50% lebih kecil dari Multi-Head Attention
	- Kualitas mendekati MHA, speed mendekati MQA
	```

	### Feed-Forward Network (SwiGLU)

	```
	FFN(x) = (SiLU(xW_gate) ⊙ xW_up) W_down

	Where:
	- W_gate: 128 × 512
	- W_up: 128 × 512
	- W_down: 512 × 128
	- SiLU(x) = x · sigmoid(x)
	- ⊙ = element-wise multiplication
	```

	## 💬 Format Chat & Prompt Engineering

	### 📝 Chat Template

	Model mendukung format chat standar untuk conversational AI:

	```python
	# Format chat template bawaan
	chat_template = """
	{% for message in messages %}
	{% if message['role'] == 'system' %}
	System: {{ message['content'] }}

	{% elif message['role'] == 'user' %}
	User: {{ message['content'] }}

	{% elif message['role'] == 'assistant' %}
	Assistant: {{ message['content'] }}

	{% endif %}
	{% endfor %}
	{% if add_generation_prompt %}Assistant:{% endif %}
	"""

	# Contoh penggunaan
	messages = [
	{"role": "system", "content": "Kamu adalah asisten AI yang membantu dan ramah."},
	{"role": "user", "content": "Jelaskan tentang fotosintesis"},
	{"role": "assistant", "content": "Fotosintesis adalah proses di mana tumbuhan mengubah cahaya matahari menjadi energi kimia..."},
	{"role": "user", "content": "Apa manfaatnya bagi manusia?"},
	]

	# Apply template
	formatted = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	print(formatted)
	# Output:
	# System: Kamu adalah asisten AI yang membantu dan ramah.
	#
	# User: Jelaskan tentang fotosintesis
	# Assistant: Fotosintesis adalah proses di mana tumbuhan...
	# User: Apa manfaatnya bagi manusia?
	# Assistant:
	```

	---

	## 🎯 Use Cases

	Model ini dirancang untuk berbagai aplikasi NLP setelah melalui proses training:

	### Text Generation
	- ✍️ Creative writing & storytelling
	- 📰 Article generation
	- 💬 Conversational AI
	- 🔄 Text completion

	### Language Understanding
	- 📊 Text classification
	- 🏷️ Named Entity Recognition (NER)
	- ❓ Question Answering
	- 📝 Summarization

	### Code Generation
	- 💻 Code completion
	- 🐛 Bug fixing suggestions
	- 📚 Documentation generation
	- 🔄 Code translation

	### Multilingual Tasks
	- 🌏 Translation (ID ↔ EN)
	- 🗣️ Cross-lingual understanding
	- 🌐 Multilingual classification

	---

	## 📈 Benchmark & Evaluation

	> ⚠️ Model belum melalui evaluasi karena status untrained

	Setelah training, model akan dievaluasi pada:

	### Indonesian Benchmarks
	- IndoNLU: Comprehensive Indonesian NLU tasks
	- IndoQA: Indonesian Question Answering
	- IndoSum: Summarization
	- IndoNER: Named Entity Recognition

	### Multilingual Benchmarks
	- MMLU: Massive Multitask Language Understanding
	- HellaSwag: Common sense reasoning
	- ARC: Science QA
	- TruthfulQA: Truthfulness evaluation

	### Generation Quality
	- Perplexity: Language modeling quality
	- BLEU/ROUGE: Translation & summarization
	- Human Evaluation: Fluency, coherence, factuality

	---

	## 🛠️ Development & Training Tips

	### Optimal Batch Size

	```python
	# Rule of thumb untuk 3.52M model
	# GPU Memory → Batch size per device

	if gpu_memory >= 80: # A100 80GB
	batch_size = 4539
	gradient_accumulation = 1
	elif gpu_memory >= 40: # A100 40GB
	batch_size = 2269
	gradient_accumulation = 1
	elif gpu_memory >= 24: # RTX 3090/4090
	batch_size = 1
	gradient_accumulation = 1

	# Effective batch size = batch_size × gradient_accumulation × num_gpus
	```

	### Learning Rate Scheduling

	```python
	# Recommended untuk 3.52M model
	learning_rate = 0.0005 # Base LR
	warmup_ratio = 0.05 # 5% of total steps
	lr_scheduler = "cosine" # atau "linear"

	# Learning rate scaling rule:
	# LR ∝ sqrt(batch_size)
	# Untuk batch size 256: LR = 0.0005
	# Untuk batch size 512: LR = 7.07e-04
	```

	### Gradient Clipping

	```python
	# Prevent gradient explosion
	max_grad_norm = 1.0 # Clip at 1.0

	# Monitor gradients
	from torch.nn.utils import clip_grad_norm_

	grad_norm = clip_grad_norm_(model.parameters(), max_grad_norm)
	if grad_norm > 10.0:
	print(f"⚠️ High gradient norm: {grad_norm:.2f}")
	```

	### Training Stability

	```python
	# Tips untuk stable training:

	1. Warmup: Mulai dengan LR rendah
	2. Gradient Checkpointing: Kurangi memory footprint
	3. Mixed Precision: Gunakan BF16 jika tersedia (lebih stable dari FP16)
	4. Batch Size: Start small, increase gradually
	5. Monitor: Track loss, perplexity, gradient norms
	```

	---

	## 🔧 Troubleshooting

	### Out of Memory (OOM)

	```python
	# Solusi OOM saat training:

	✅ 1. Enable gradient checkpointing
	model.gradient_checkpointing_enable()

	✅ 2. Reduce batch size
	per_device_train_batch_size = 1

	✅ 3. Increase gradient accumulation
	gradient_accumulation_steps = 32

	✅ 4. Use quantization
	load_in_8bit = True # atau load_in_4bit

	✅ 5. Reduce sequence length
	max_length = 1024 # Start dengan ini

	✅ 6. CPU offloading (jika perlu)
	device_map = "auto"
	offload_folder = "offload"
	```

	### Slow Training

	```python
	# Optimasi kecepatan training:

	✅ 1. Flash Attention
	config.use_flash_attn = True # 2-3x speedup

	✅ 2. Compile model (PyTorch 2.0+)
	model = torch.compile(model, mode="reduce-overhead")

	✅ 3. DataLoader optimization
	dataloader = DataLoader(
	dataset,
	batch_size=batch_size,
	num_workers=4, # Parallel data loading
	pin_memory=True, # Faster GPU transfer
	prefetch_factor=2
	)

	✅ 4. Mixed precision
	use_fp16 = True # atau bf16

	✅ 5. Optimize communication (multi-GPU)
	find_unused_parameters = False
	gradient_as_bucket_view = True
	```

	### NaN Loss

	```python
	# Jika loss menjadi NaN:

	✅ 1. Reduce learning rate
	learning_rate = learning_rate * 0.1

	✅ 2. Check gradient norms
	clip_grad_norm_(model.parameters(), 1.0)

	✅ 3. Use BF16 instead of FP16
	torch_dtype = torch.bfloat16 # Lebih stable

	✅ 4. Add epsilon to RMSNorm
	rms_norm_eps = 1e-5 # Increase jika perlu

	✅ 5. Check data
	# Pastikan tidak ada inf/nan di dataset
	assert not torch.isnan(input_ids).any()
	assert not torch.isinf(attention_mask).any()
	```

	---

	### 🚫 Prohibited Uses

	<div style="background: #ffebee; border-left: 4px solid #f44336; padding: 12px; margin: 16px 0;">

	Model ini TIDAK BOLEH digunakan untuk:

	- 🚫 Harmful content generation (violence, self-harm, illegal acts)
	- 🚫 Misinformation/disinformation campaigns
	- 🚫 Harassment or hate speech
	- 🚫 Impersonation or identity theft
	- 🚫 Child safety violations (CSAM, grooming, exploitation)
	- 🚫 Privacy violations (doxxing, stalking, surveillance abuse)
	- 🚫 Malicious code generation (malware, exploits, etc)
	- 🚫 Spam or manipulation (fake reviews, astroturfing)
	- 🚫 Medical/legal advice (tanpa disclaimer & expert review)
	- 🚫 Financial fraud (scams, market manipulation)

	Violation consequences: Model access revocation + legal action jika applicable

	</div>

	---

	## 📚 References & Papers

	### Core Architecture
	1. LLaMA - [Touvron et al., 2023](https://arxiv.org/abs/2302.13971)
	- RMSNorm, RoPE, SwiGLU, GQA

	2. GPT-4 - [OpenAI Technical Report, 2023](https://arxiv.org/abs/2303.08774)
	- Mixture of Experts (speculated)

	3. Gemini - [Google DeepMind, 2023](https://arxiv.org/abs/2312.11805)
	- Multimodal architecture, soft-capping

	4. Qwen - [Alibaba Cloud, 2023](https://arxiv.org/abs/2309.16609)
	- YARN, long context

	5. Gemma - [Google, 2024](https://arxiv.org/abs/2403.08295)
	- Layer scaling, normalization

	### Advanced Techniques
	6. Flash Attention 2 - [Dao, 2023](https://arxiv.org/abs/2307.08691)
	7. Mixture-of-Depths - [Raposo et al., 2024](https://arxiv.org/abs/2404.02258)
	8. StreamingLLM - [Xiao et al., 2023](https://arxiv.org/abs/2309.17453)
	9. YARN - [Peng et al., 2023](https://arxiv.org/abs/2309.00071)
	10. QLoRA - [Dettmers et al., 2023](https://arxiv.org/abs/2305.14314)

	---

	## ⚠️ Known Limitations

	1. Training Cost - MoE + Multimodal = expensive
	2. Complex Debugging - Banyak fallback systems
	3. Memory Hungry - Jika semua fitur enabled
	4. Dependency Hell - Butuh flash-attn, xformers, bitsandbytes
	5. Expert Balancing - MoE butuh careful tuning untuk load balancing

	---

	## 📜 License & Citation

	### 📄 License

	<div style="background: #e8f5e9; border-left: 4px solid #4caf50; padding: 12px; margin: 16px 0;">

	Model ini dirilis di bawah Apache License 2.0

	✅ Anda BEBAS untuk:
	- ✔️ Gunakan secara komersial
	- ✔️ Modifikasi sesuka hati
	- ✔️ Distribusi ulang
	- ✔️ Patent use
	- ✔️ Private use

	⚠️ Dengan syarat:
	- 📄 Include license & copyright notice
	- 📝 State changes yang dibuat
	- 📋 Disclaimer of warranty

	❌ Tanpa jaminan apapun (use at your own risk)

	</div>

	Full license text: [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)

	## 📖 Citation

	Jika Anda menggunakan model ini dalam penelitian, mohon sitasi:

	```bibtex
	@misc{cacacaca1m,
	author = {Lyon},
	title = {Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention},
	year = {2026},
	publisher = {Hugging Face},
	journal = {Hugging Face Model Hub},
	howpublished = {\url{https://huggingface.co/Lyon28/caca-1M-untrained}},
	note = {Untrained model with 3,524,608 parameters}
	}
	```

	APA Style:
	```
	Lyon. (2026). Caca-caca-1M: Modern Transformer Architecture with Grouped
	Query Attention [Untrained model]. Hugging Face.
	https://huggingface.co/Lyon28/caca-1M-untrained
	```

	MLA Style:
	```
	Lyon. "Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention."
	Hugging Face, 2026, huggingface.co/Lyon28/caca-1M-untrained.
	```

	---

	### 🙏 Acknowledgments

	Model ini berdiri di pundak para raksasa! Terima kasih kepada:

	<details>
	<summary><b>🏛️ Klik untuk daftar lengkap acknowledgments</b></summary>

	#### 🏗️ Core Architecture
	- LLaMA/LLaMA 2 (Meta AI, 2023) - Decoder-only architecture, RMSNorm, SwiGLU
	- Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971)
	- Authors: Hugo Touvron et al.
	- GPT-3 (OpenAI, 2020) - Transformer language modeling paradigm
	- PaLM (Google, 2022) - SwiGLU activation insights

	#### 🎯 Attention Mechanisms
	- Flash Attention v2 (Tri Dao et al., Stanford, 2023)
	- Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691)
	- 3x speedup dengan IO-aware algorithm
	- Grouped Query Attention (Joshua Ainslie et al., Google, 2023)
	- Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245)
	- Memory-efficient KV cache
	- Multi-Query Attention (Noam Shazeer, Google, 2019)
	- Fast inference dengan shared K/V
	- xFormers (Meta AI, 2022) - Memory efficient attention
	- PyTorch SDPA (PyTorch Team, 2023) - Native attention optimization

	#### 📍 Position Encodings
	- RoPE (Jianlin Su et al., EleutherAI, 2021)
	- Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864)
	- Superior length extrapolation
	- ALiBI (Ofir Press et al., 2022)
	- Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409)
	- Length generalization without retraining
	- YaRN (Bowen Peng et al., 2023)
	- Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071)

	#### 🪟 Long Context & Efficiency
	- Sliding Window Attention (Albert Gu et al., Mistral AI, 2023)
	- Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825)
	- StreamingLLM (Guangxuan Xiao et al., MIT, 2023)
	- Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453)
	- Infinite sequence length!
	- Logit Softcapping (Google Gemma Team, 2024)
	- Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295)

	#### 🧠 Mixture of Experts
	- Mixtral 8x7B (Albert Jiang et al., Mistral AI, 2024)
	- Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088)
	- State-of-the-art sparse MoE
	- Switch Transformers (William Fedus et al., Google, 2021)
	- Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961)
	- Expert scaling insights
	- GLaM (Nan Du et al., Google, 2021) - Generalist Language Model
	- Expert Choice Routing (Yanqi Zhou et al., Google, 2022)
	- Better load balancing

	#### 🎓 Training Optimizations
	- Layer Scale (Hugo Touvron et al., Meta, 2021)
	- Paper: [Going Deeper with Image Transformers](https://arxiv.org/abs/2103.17239)
	- Training stability untuk deep networks
	- Stochastic Depth (Gao Huang et al., 2016)
	- Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382)
	- Mixture of Depths (David Raposo et al., DeepMind, 2024)
	- Paper: [Mixture-of-Depths: Dynamically allocating compute](https://arxiv.org/abs/2404.02258)
	- Dynamic compute allocation
	- Gradient Checkpointing (Tianqi Chen et al., 2016)

	#### 📦 Quantization
	- LLM.int8() (Tim Dettmers et al., 2022)
	- Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339)
	- QLoRA (Tim Dettmers et al., 2023)
	- Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314)
	- 4-bit efficient fine-tuning
	- bitsandbytes (Tim Dettmers) - Quantization library

	#### 🎨 Multimodal
	- Vision Transformer (Alexey Dosovitskiy et al., Google, 2020)
	- Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929)
	- Flamingo (Jean-Baptiste Alayrac et al., DeepMind, 2022)
	- Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198)
	- Perceiver Resampler
	- BLIP-2 (Junnan Li et al., Salesforce, 2023)
	- Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597)
	- Q-Former architecture
	- Whisper (Alec Radford et al., OpenAI, 2022) - Audio encoding

	#### 🛠️ Normalization & Activations
	- RMSNorm (Biao Zhang, Rico Sennrich, 2019)
	- Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467)
	- SwiGLU (Noam Shazeer, Google, 2020)
	- Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202)

	#### 🔧 Tools & Frameworks
	- 🤗 Hugging Face - Transformers, Accelerate, PEFT
	- Making NLP accessible to everyone
	- PyTorch - Deep learning framework
	- Facebook AI Research team
	- Safetensors - Secure serialization
	- Hugging Face team
	- DeepSpeed - Distributed training
	- Microsoft Research
	- Flash Attention Implementation - Tri Dao & team

	#### 🇮🇩 Indonesian NLP Community
	Special thanks to Indonesian NLP researchers & practitioners yang telah membangun foundation untuk Indonesian language AI.

	</details>

	---

	## 📄 License

	Model ini dirilis di bawah Apache License 2.0.

	### Ketentuan Penggunaan:
	- ✅ Bebas digunakan untuk keperluan komersial dan non-komersial
	- ✅ Modifikasi diperbolehkan
	- ✅ Distribusi diperbolehkan dengan attribution
	- ⚠️ No Warranty - model disediakan "as is"
	- 📝 Attribution Required - sertakan copyright notice

	Lihat [LICENSE](LICENSE) untuk detail lengkap.

	---

	## 🤝 Contributing

	Kami sangat terbuka untuk kontribusi! Berikut cara Anda bisa berkontribusi:

	### Training & Fine-tuning
	- 🎓 Train model ini dengan dataset Anda
	- 📊 Share benchmark results
	- 🔬 Experiment dengan hyperparameters

	### Code & Architecture
	- 🐛 Report bugs atau issues
	- 💡 Suggest improvements
	- 🔧 Submit pull requests

	### Documentation
	- 📚 Improve documentation
	- 🌐 Add translations
	- ✍️ Write tutorials & guides

	### Dataset & Evaluation
	- 📝 Contribute training data
	- 🧪 Create evaluation benchmarks
	- 🎯 Share fine-tuned versions

	---

	## 👥 Team & Acknowledgments

	### Core Team
	- LyonPoy - Architecture design & implementation

	### Special Thanks
	- 🤗 Hugging Face - Infrastructure & community
	- ⚡ FlashAttention Team - Efficient attention implementation
	- 🧠 Anthropic, Google, Meta, openAI, etc - Research inspirations
	- Meta AI (LLaMA)
	- OpenAI (GPT series)
	- Google DeepMind (Gemini, Gemma)
	- Alibaba Cloud (Qwen)
	- HuggingFace (Transformers library)
	- Tri Dao (Flash Attention)
	- Tim Chen (bitsandbytes)

	### Community
	Terima kasih kepada komunitas open-source yang telah berkontribusi pada:
	- Transformers library
	- PyTorch framework
	- Datasets & evaluation tools

	---

	## 📞 Contact & Support

	### Community
	- 💬 [Discussions](https://huggingface.co/Lyon28/caca-1M-untrained/discussions) - Ask questions
	- 🐛 [Issues](https://github.com/Lyon-28/caca-transformers/issues) - Report bugs
	- 📧 Email : cacatransformers@gmail.com

	---

	## 🌟 Star History

	<div align="center">

	[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)

	</div>

	## 💝 Dibuat dengan ❤️ untuk Komunitas AI Indonesia

	<img src="https://i.postimg.cc/MTSj073X/logo.png" width="200" alt="Caca Logo"/>

	### Terima kasih telah menggunakan Caca!

	Jika model ini berguna, jangan lupa ⭐ repository kami!

	<div align="center">

	<table>
	<tr>
	<td align="center">⭐<br/><b>Star Repo</b><br/><sub>Show your support</sub></td>
	<td align="center">🔗<br/><b>Share</b><br/><sub>Tell your friends</sub></td>
	<td align="center">💬<br/><b>Join Discussion</b><br/><sub>Ask questions</sub></td>
	<td align="center">🤝<br/><b>Contribute</b><br/><sub>Make it better</sub></td>
	</tr>
	</table>

	### 🚀 Happy Training! 🚀

	Model ini menunggu untuk dilatih dan menjadi foundation untuk aplikasi AI Anda.

	[📥 Download Model](#) • [📖 Read Docs](https://github.com/Lyon-28/caca-transformers) • [💬 Join Community](https://github.com/Lyon-28/caca-transformers)

	</div>

	---

	### 📊 Model Statistics

	<img src="https://img.shields.io/badge/Parameters-3.52M-blue?style=for-the-badge" alt="Parameters"/>
	<img src="https://img.shields.io/badge/Status-Untrained-orange?style=for-the-badge" alt="Status"/>
	<img src="https://img.shields.io/badge/License-Apache%202.0-green?style=for-the-badge" alt="License"/>

	<img src="https://img.shields.io/badge/Architecture-Transformer-purple?style=for-the-badge" alt="Architecture"/>
	<img src="https://img.shields.io/badge/Type-Causal%20LM-red?style=for-the-badge" alt="Type"/>
	<img src="https://img.shields.io/badge/Context-1,024%20tokens-cyan?style=for-the-badge" alt="Context"/>

	---

	### 🎨 Daily Inspiration

	<div align="center">
	<img src="https://quotes-caca.vercel.app/api/SvgQuote" alt="Daily Quote" width="600" />
	</div>

	---

	### 📈 Quick Stats

	\| Metric \| Value \|
	\|--------\|-------\|
	\| 💎 Total Parameters \| 3,524,608 \|
	\| 🏗️ Layers \| 6 \|
	\| 🎯 Attention Heads \| 4 \|
	\| 📖 Max Context \| 1,024 tokens \|
	\| 💾 Size (FP16) \| 0.01 GB \|
	\| 💾 Size (INT4) \| 0.00 GB \|

	---

	<sub>
	Model ini adalah bagian dari <b>Caca Project</b> - Open source initiative untuk membangun Indonesian LLM ecosystem.<br/>
	Created with 💻 by <a href="https://huggingface.co/Lyon28">@Lyon28</a> \|
	Licensed under <a href="https://www.apache.org/licenses/LICENSE-2.0">Apache 2.0</a> \|
	Built with <a href="https://huggingface.co">🤗 HuggingFace</a>
	</sub>

	<br/><br/>

	🌟 "Dari nol, untuk semua" 🌟

	<sub>Last updated: january 2026</sub>

	</div>

	---

	<div align="center">
	<sub>Built with ❤️ by Caca Transformers Team</sub><br>
	<sub>Powered by 🤗 Transformers • ⚡ PyTorch • 🔥 Flash Attention</sub>
	</div>