--- license: apache-2.0 language: - id - en tags: - text-generation - pytorch - causal-lm - transformer - untrained - gqa - rope - swiglu - rmsnorm - flash-attention - indonesian - bilingual library_name: transformers pipeline_tag: text-generation widget: - text: "Jakarta adalah ibu kota" example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Pelengkapan Teks (ID)" - text: | Pertanyaan: Apa itu kecerdasan buatan? Jawaban: example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Tanya Jawab (ID)" - text: | Tulis cerita pendek tentang robot yang belajar mencintai. example_title: "๐Ÿ‡ฎ๐Ÿ‡ฉ Penulisan Kreatif (ID)" - text: "The capital of Indonesia is" example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Text Completion (EN)" - text: | Question: What is artificial intelligence? Answer: example_title: "๐Ÿ‡ฌ๐Ÿ‡ง Question Answering (EN)" - text: | def fibonacci(n): """Hitung bilangan fibonacci ke-n""" example_title: "๐Ÿ’ป Pelengkapan Kode" - text: | # Fungsi untuk mengurutkan array def sort_array(arr): example_title: "๐Ÿ’ป Generasi Kode" - text: | User: Halo! Siapa kamu? Assistant: example_title: "๐Ÿ’ฌ Format Chat (ID)" - text: | User: Jelaskan tentang machine learning dalam 2 kalimat. Assistant: example_title: "๐Ÿ’ฌ Conversational (ID)" inference: parameters: max_new_tokens: 100 temperature: 0.7 top_p: 0.9 top_k: 50 do_sample: true repetition_penalty: 1.1 num_beams: 1 datasets: [] metrics: - perplexity - accuracy model-index: - name: caca-1M results: [] ---
caca-1M # ๐Ÿค– caca-1M ### Arsitektur Transformer Modern dengan Fitur Canggih [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/) [![Transformers](https://img.shields.io/badge/๐Ÿค—%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers) [![Model Type](https://img.shields.io/badge/Model-Causal%20LM-green.svg)]() [![Parameters](https://img.shields.io/badge/Parameters-3.52M-orange.svg)]() [![Status](https://img.shields.io/badge/Status-Untrained-red.svg)]() **3,524,608** parameters โ€ข **3.52M** โ€ข **6 layers** โ€ข **1,024 tokens** [๐Ÿ“š Documentation](#-dokumentasi) โ€ข [๐Ÿ’ป Usage](#-cara-penggunaan) โ€ข [โš™๏ธ Configuration](#๏ธ-konfigurasi-detail) โ€ข [๐Ÿ”ฌ Architecture](#-arsitektur)
--- ## โš ๏ธ PENTING: Model Belum Dilatih (Untrained)
โš ๏ธ PERHATIAN: Ini adalah model yang belum melalui proses training. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan tidak bermakna dan acak.
**Status Model:** - ๐Ÿ”ด **Belum dilatih** - Bobot masih random (Kaiming/Xavier init) - ๐ŸŸก **Untuk riset & eksperimen** - Arsitektur sudah siap, tinggal train - ๐ŸŸข **Production-ready architecture** - Teruji dan optimal Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas tinggi. ### ๐ŸŽฏ Apa yang Bisa Dilakukan? | โœ… Bisa | โŒ Belum Bisa | |---------|----------------| | Load model architecture | Generate teks bermakna | | Test forward pass | Menjawab pertanyaan | | Measure memory & speed | Reasoning & understanding | | Start training | Production deployment | | Fine-tuning experiments | Real-world applications | --- ## ๐Ÿ“‹ Deskripsi **CACA** (Collaborative Architecture for Contextual AI) adalah arsitektur Large Language Model (LLM) yang menggabungkan **best practices** dari berbagai model State-of-the-Art (SOTA) seperti **LLaMA**, **GPT-4**, **Gemini**, **Qwen**, dan **Gemma**. Model ini dirancang dengan fokus pada **efisiensi komputasi**, **skalabilitas**, dan **performa tinggi** โ€” menjadikannya **modular**, **production-ready**, dan mendukung **multimodal** (teks, gambar, audio).

๐Ÿ“– Tentang Project Caca

Caca adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative.

Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok. Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.

โ€” Lyon, Creator

### โœจ **Highlights** - ๐Ÿง  **Hybrid Architecture** โ€” Kombinasi teknik terbaik dari 5+ model SOTA - ๐ŸŽญ **Multimodal Native** โ€” Support teks, gambar, dan audio dalam satu model - โšก **High Performance** โ€” Flash Attention, MoE, dan optimasi modern - ๐ŸŒ **Indonesian-First** โ€” Dikembangkan dengan fokus pada Bahasa Indonesia - ๐Ÿ”“ **Open Source** โ€” Transparent, reproducible, collaborative ### ๐ŸŒŸ Mengapa Caca? 1. **๐Ÿ‡ฎ๐Ÿ‡ฉ Fokus pada Bahasa Indonesia** - Dirancang dengan mempertimbangkan karakteristik bahasa Indonesia 2. **โšก Efisiensi Tinggi** - GQA & Flash Attention untuk inferensi 3-5x lebih cepat 3. **๐Ÿ’พ Memory Efficient** - Hemat 50% memory untuk KV cache 4. **๐Ÿ”ง Modular & Extensible** - Mudah dikustomisasi untuk berbagai use case 5. **๐ŸŒ Bilingual** - Support optimal untuk Indonesia & English **CACA** hadir dengan filosofi berbeda: - โœ… **Fully open-source** โ€” dari architecture sampai training code - โœ… **Modular & scalable** โ€” bisa disesuaikan dari 1B sampai 70B+ parameters - โœ… **Resource-efficient** โ€” optimized untuk budget terbatas - โœ… **Indonesian-centric** โ€” prioritas pada Bahasa Indonesia - โœ… **Community-driven** โ€” open for contributions & collaborations ## ๐Ÿ“ˆ Perbandingan dengan Model Lain | Fitur | LLaMA | GPT-4 | Gemini | Qwen | CACA | |-------|-------|-------|--------|------|------| | **RMSNorm** | โœ… | โŒ | โŒ | โœ… | โœ… | | **RoPE** | โœ… | โŒ | โŒ | โœ… | โœ… | | **GQA** | โœ… | โŒ | โŒ | โœ… | โœ… | | **MoE** | โŒ | โœ… | โœ… | โŒ | โœ… | | **Multimodal** | โŒ | โœ… | โœ… | โœ… | โœ… | | **Flash Attention** | โœ… | โœ… | โœ… | โœ… | โœ… | | **Sliding Window** | โŒ | โŒ | โŒ | โœ… | โœ… | | **Attention Sinks** | โŒ | โŒ | โŒ | โŒ | โœ… | | **MoD** | โŒ | โŒ | โŒ | โŒ | โœ… | | **Expert Choice** | โŒ | โŒ | โŒ | โŒ | โœ… | | **YARN Scaling** | โŒ | โŒ | โŒ | โœ… | โœ… | | **Quantization** | โœ… | โŒ | โŒ | โœ… | โœ… | --- ## ๐ŸŽฏ Use Cases & Applications ### โœ… Cocok Untuk
**๐Ÿ”ฌ Research & Development** - Eksperimen arsitektur transformer - Ablation studies - Novel training techniques - Architecture search **๐Ÿ“š Academic & Education** - Thesis & research papers - Teaching materials - Student projects - LLM internals understanding **๐Ÿš€ Base Model for Fine-tuning** - Task-specific models - Domain adaptation - Instruction tuning - RLHF experiments **๐Ÿ’ก Prototyping** - Proof of concept - Feature testing - A/B testing architectures - Benchmark comparisons
### โŒ Tidak Cocok Untuk
- ๐Ÿšซ **Production Applications** - Model belum dilatih, output random - ๐Ÿšซ **Real-world Deployment** - Perlu training & safety alignment dulu - ๐Ÿšซ **Safety-critical Systems** - Tidak ada safety guardrails - ๐Ÿšซ **Direct User-facing Apps** - Output tidak dapat diprediksi - ๐Ÿšซ **Commercial Use (as-is)** - Harus dilatih terlebih dahulu
--- ## ๐Ÿ“Š Spesifikasi Model
Parameter Value Parameter Value
Total Parameters 3,524,608 Vocab Size 8,000
Hidden Size 128 Intermediate Size 512
Num Layers 6 Attention Heads 4
KV Heads (GQA) 2 Head Dimension 32
Max Context Length 1,024 RoPE Base (ฮธ) 10,000
Model Size (FP16) 0.01 GB Formatted Size 3.52M
--- ### ๐ŸŽฏ Core Features
๐Ÿ” Klik untuk expand/collapse - โœ… **Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior - Query heads: **4** - KV heads: **2** - Ratio: **2:1** (hemat ~50% memory KV cache) - **Benefit**: Inferensi lebih cepat dengan memory footprint lebih kecil - โœ… **Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik - Theta (ฮธ): **10,000** - Support extrapolation untuk konteks > training length - **Benefit**: Performa stabil pada sequence length yang belum pernah dilihat saat training - โœ… **RMSNorm** - Normalisasi lebih stabil dan ~50% lebih cepat dari LayerNorm - Epsilon: **1e-06** - **Benefit**: Training lebih stabil, inference lebih cepat, gradient flow lebih baik - โœ… **SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU - Intermediate size: **512** (4.0x hidden) - **Benefit**: Kapasitas model lebih besar tanpa menambah parameter signifikan - โœ… **Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency - Otomatis aktif jika tersedia CUDA device - IO-aware algorithm untuk minimal HBM access - **Benefit**: Training & inference jauh lebih cepat, support batch size lebih besar - โœ… **Hybrid Architecture** - Kombinasi teknik terbaik dari 5+ model SOTA - โœ… **Multimodal Support** - Native support untuk Vision dan Audio - โœ… **Mixture of Experts (MoE)** - Sparse activation untuk efisiensi - โœ… **Long Context** - Support hingga 8K+ tokens dengan YARN scaling - โœ… **Advanced Attention** - Flash Attention, Sliding Window, Attention Sinks - โœ… **Quantization Ready** - Support 4-bit dan 8-bit quantization - โœ… **Production Features** - Extensive error handling & monitoring
### ๐Ÿ”ฅ Advanced Features ### ๐ŸŽฏ Mekanisme Attention - โšก **Flash Attention v2** - Algoritma IO-aware yang 3x lebih cepat dari attention standar - ๐Ÿ”‘ **Grouped Query Attention (GQA)** - 4 Query heads : 2 KV heads - Rasio kompresi: **2:1** (hemat ~50% memory KV cache) - ๐Ÿš€ **xFormers Support** - Fallback memory-efficient attention - ๐ŸŽฏ **PyTorch SDPA** - Native scaled dot product attention ### ๐Ÿ“ Position Encodings - ๐Ÿ”„ **RoPE (Rotary Position Embeddings)** - Base frequency ฮธ=10,000 - Generalisasi lebih baik untuk sequence panjang dibanding absolute PE ### ๐ŸŽ“ Optimisasi Training - ๐Ÿ’พ **Gradient Checkpointing** - Trade compute for memory (support model hingga 100B+ params) - ๐ŸŽฏ **Mixed Precision Training** - Support FP16, BF16, dan TF32 - ๐Ÿ“‰ **Dropout Regularization** - Hidden dropout: 0.1 - Attention dropout: 0.0 - Residual dropout: 0.1 ### ๐Ÿ“ฆ Dukungan Quantization - 4๏ธโƒฃ **4-bit Quantization** - NF4 & FP4 via bitsandbytes - Memory reduction: ~**75%** (4GB โ†’ 1GB) - Accuracy loss: <2% pada kebanyakan tasks - Support double quantization untuk kompresi maksimal - 8๏ธโƒฃ **8-bit Quantization** - LLM.int8() dengan outlier handling - Memory reduction: ~**50%** (4GB โ†’ 2GB) - Accuracy loss: <1% - ๐Ÿ”„ **Dynamic Quantization** - Runtime quantization tanpa calibration ### ๐Ÿ”ฌ Advanced Features - ๐Ÿ“Š **Automatic Mixed Precision (AMP)** - Dynamic loss scaling - ๐ŸŽฏ **Gradient Clipping** - Stabilitas training dengan max norm clipping - ๐Ÿ“ˆ **Learning Rate Scheduling** - Support cosine, linear, warmup - ๐Ÿ’ก **Smart Memory Management** - Auto cache clearing & monitoring - ๐Ÿ” **Metrics Tracking** - Real-time perplexity, loss, gradient norms - ๐Ÿ›ก๏ธ **NaN/Inf Detection** - Automatic recovery dari numerical instability --- ## ๐Ÿงฉ Komponen Arsitektur ### 1๏ธโƒฃ **Dari LLaMA (Meta)** CACA mengadopsi komponen efisien dari LLaMA untuk performa optimal: ```python โœ“ RMSNorm # Normalisasi lebih efisien dari LayerNorm โœ“ Rotary Position Embeddings # Positional encoding yang lebih baik โœ“ SwiGLU Activation # Activation function dengan gating mechanism โœ“ Grouped-Query Attention # Hemat memory dengan shared K/V heads โœ“ Pre-normalization # Stabilitas training yang lebih baik ``` - RMSNorm **30% lebih cepat** dari LayerNorm - RoPE membuat model bisa **extrapolate ke context lebih panjang** - GQA **hemat 30-40% memory** dibanding Multi-Head Attention - SwiGLU **meningkatkan performa 3-5%** dibanding ReLU/GELU --- ### 2๏ธโƒฃ **Dari GPT-4 (OpenAI)** Implementasi Mixture of Experts untuk skalabilitas: ```python โœ“ Mixture of Experts (MoE) # Sparse activation dengan multiple expert networks โœ“ Top-K Router # Routing token ke K expert terbaik โœ“ Auxiliary Loss # Load balancing antar experts โœ“ Z-Loss # Stabilisasi router logits โœ“ Expert Usage Tracking # Monitoring penggunaan setiap expert ``` ``` Input Token โ†“ [Router] โ†’ Pilih Top-K Experts (misal K=2 dari 8 experts) โ†“ Expert_1 (weight: 0.6) + Expert_3 (weight: 0.4) โ†“ Weighted Sum Output ``` **Keuntungan:** - Model bisa **10x lebih besar** dengan compute cost yang sama - Setiap token hanya activate **12.5% parameters** (jika K=2, N=8) - Parallel processing antar experts --- ### 3๏ธโƒฃ **Dari Gemini (Google)** Multimodal native dengan cross-modal fusion: ```python โœ“ Vision Encoder (ViT) # Process gambar dengan Vision Transformer โœ“ Audio Encoder (Conv1D + Trans) # Process audio dengan CNN + Transformer โœ“ Cross-Attention Mechanism # Fuse multimodal features โœ“ Multiple Projector Types: - Linear Projector # Simple & cepat - MLP Projector # Non-linear mapping - Perceiver Resampler # Compress dengan latent queries - Q-Former # Query-based projection (BLIP-2 style) โœ“ Logit Soft-Capping # Clip extreme values untuk stabilitas ``` **Alur Multimodal:** ``` [Image] โ†’ Vision Encoder โ†’ [2D patches โ†’ 1D tokens] โ†“ Projector โ†’ [Hidden dim = text dim] โ†“ [Text] + [Image tokens] โ†’ Cross-Attention โ†’ Fused representation ``` **Support format:** - Images: JPEG, PNG (224x224 default) - Audio: Mel-spectrogram (80 bins) --- ### 4๏ธโƒฃ **Dari Qwen (Alibaba)** Long context optimization: ```python โœ“ YARN Scaling # Yet Another RoPE extensioN โœ“ Dynamic Position Scaling # Auto-adjust untuk sequence lebih panjang โœ“ Sliding Window Attention # Local attention pattern โœ“ Context Window 8K-128K # Flexible context length ``` **YARN vs Standard RoPE:** ``` Standard RoPE: [====] 4K context โ†’ [====????] 8K (error naik) YARN: [====] 4K context โ†’ [========] 8K (smooth extrapolation) ``` **Sliding Window Mechanism:** ``` Token 0: attend ke [0] Token 1: attend ke [0, 1] Token 2: attend ke [0, 1, 2] Token 10: attend ke [0, 6, 7, 8, 9, 10] โ† sliding window = 4 (keep attention sink di token 0) ``` --- ### 5๏ธโƒฃ **Dari Gemma (Google)** Optimization techniques: ```python โœ“ Layer Scale # Learnable scaling per layer โœ“ Stochastic Depth # Random layer dropping saat training โœ“ Normalized Attention # QK normalization untuk stabilitas โœ“ Knowledge Distillation # Transfer knowledge dari model besar ``` **Layer Scale formula:** ```python output = input + gamma * layer(input) # gamma diinit sangat kecil (1e-5) lalu di-learn ``` **Stochastic Depth:** - Training: 20% chance layer di-skip (drop_prob=0.2) - Inference: semua layer aktif - Benefit: **regularization** + **faster training** --- ## ๐Ÿ†• Fitur Eksperimental & Unik ### A) **Mixture of Depths (MoD)** Token bisa "skip" layer tertentu untuk efisiensi: ```python class MixtureOfDepthsRouter: # Pilih top 50% tokens paling "penting" untuk di-process capacity_factor = 0.5 # Method: learned, random, atau heuristic route_method = "learned" ``` **Ilustrasi:** ``` Layer 1: [All 100 tokens processed] Layer 2: [Top 50 tokens processed, 50 skipped] โ† MoD Layer 3: [All 100 tokens processed] Layer 4: [Top 50 tokens processed, 50 skipped] โ† MoD ``` **Benefit:** - **30-40% faster inference** dengan minimal accuracy drop - Dynamic computation based on token importance **Paper:** [Mixture-of-Depths (2024)](https://arxiv.org/abs/2404.02258) --- ### B) **Attention Sinks** Keep token awal selalu di-attend untuk stabilitas: ```python attention_sink_size = 4 # Keep first 4 tokens attention_sink_window = 512 # Sliding window size ``` **Attention Pattern:** ``` Query Token 1000: โ”œโ”€ Attend to: [0, 1, 2, 3] โ† attention sinks (always) โ””โ”€ Attend to: [488, 489, ..., 1000] โ† sliding window ``` **Benefit:** - Prevent attention collapse di long sequences - Better streaming generation - Inspired by [StreamingLLM (2023)](https://arxiv.org/abs/2309.17453) --- ### C) **Expert Choice Routing** Alternatif dari Top-K routing: ```python # Top-K: Token pilih expert Token โ†’ Router โ†’ "Saya mau Expert 2 dan 5" # Expert Choice: Expert pilih token Expert 1 โ†’ "Saya mau process Token 3, 7, 12, ..." Expert 2 โ†’ "Saya mau process Token 1, 5, 9, ..." ``` **Keuntungan:** - **Better load balancing** (setiap expert process jumlah token yang sama) - **Lebih stable training** (no expert collapse) - Trade-off: sedikit lebih complex implementasi --- ### D) **Multi-Backend Attention** Automatic fallback untuk compatibility: ```python if HAS_FLASH_ATTN and device == "cuda": use flash_attn_func() # โ† Fastest (2-4x speedup) elif HAS_XFORMERS and device == "cuda": use memory_efficient_attention() # โ† Fallback 1 elif HAS_SDPA: use F.scaled_dot_product_attention() # โ† Fallback 2 (PyTorch 2.0+) else: use standard_attention() # โ† Safe fallback ``` **Performa Comparison:** ``` Flash Attention: 100ms (baseline) xFormers: 150ms (1.5x slower) SDPA: 180ms (1.8x slower) Standard: 400ms (4x slower) ``` --- ## ๐Ÿ—๏ธ CACA Model Family | Model | Parameters | Vocab Size | Hidden Size | Intermediate Size | Layers | Attention Heads | KV Heads | Head Dim | Max Position | |-------|------------|------------|-------------|-------------------|--------|-----------------|----------|----------|--------------| | caca-1M-untrained | 2.50M | 8,000 | 128 | 512 | 6 | 4 | 2 | 32 | 1,024 | | caca-3M-untrained | 6.63M | 12,000 | 192 | 768 | 8 | 6 | 2 | 32 | 2,048 | | caca-4M-untrained | 4.02M | 16,000 | 128 | 512 | 8 | 4 | 2 | 32 | 2,048 | | caca-6M-untrained | 11.96M | 16,000 | 256 | 1024 | 8 | 4 | 2 | 64 | 2,048 | | caca-10M-untrained | 21.25M | 20,000 | 320 | 1280 | 10 | 8 | 2 | 40 | 2,048 | | caca-15M-untrained | 35.18M | 24,000 | 384 | 1536 | 12 | 6 | 2 | 64 | 2,048 | | caca-25M-untrained | 67.57M | 28,000 | 512 | 2048 | 14 | 8 | 2 | 64 | 4,096 | | caca-35M-untrained | 95.42M | 32,000 | 576 | 2304 | 16 | 8 | 2 | 72 | 4,096 | | caca-50M-untrained | 138.47M | 32,000 | 640 | 2560 | 20 | 10 | 2 | 64 | 4,096 | | caca-75M-untrained | 178.55M | 32,000 | 768 | 3072 | 18 | 12 | 3 | 64 | 4,096 | | caca-100M-untrained | 232.23M | 32,000 | 768 | 3072 | 24 | 12 | 4 | 64 | 4,096 | | caca-150M-untrained | 336.90M | 32,000 | 1024 | 4096 | 20 | 16 | 4 | 64 | 4,096 | | caca-200M-untrained | 458.55M | 32,000 | 1024 | 4096 | 28 | 16 | 4 | 64 | 4,096 | | caca-250M-untrained | 569.54M | 32,000 | 1152 | 4608 | 28 | 18 | 3 | 64 | 8,192 | | caca-300M-untrained | 701.64M | 32,000 | 1280 | 5120 | 28 | 20 | 4 | 64 | 8,192 | | caca-400M-untrained | 956.36M | 32,000 | 1408 | 5632 | 32 | 22 | 4 | 64 | 8,192 | | caca-500M-untrained | 1.27B | 32,000 | 1536 | 6144 | 36 | 24 | 4 | 64 | 8,192 | | caca-600M-untrained | 1.48B | 32,000 | 1664 | 6656 | 36 | 26 | 4 | 64 | 8,192 | | caca-700M-untrained | 1.71B | 32,000 | 1792 | 7168 | 36 | 28 | 4 | 64 | 8,192 | | caca-800M-untrained | 1.96B | 32,000 | 1920 | 7680 | 36 | 30 | 5 | 64 | 8,192 | | caca-900M-untrained | 2.01B | 32,000 | 2048 | 8192 | 32 | 32 | 8 | 64 | 8,192 | | caca-1B-untrained | 2.26B | 32,000 | 2048 | 8192 | 36 | 32 | 8 | 64 | 8,192 | | caca-1.5B-untrained | 2.98B | 32,000 | 2048 | 8192 | 48 | 32 | 8 | 64 | 8,192 | | caca-2B-untrained | 3.15B | 32,000 | 2304 | 9216 | 40 | 32 | 8 | 72 | 8,192 | | caca-2.5B-untrained | 3.12B | 32,000 | 2560 | 10240 | 32 | 32 | 8 | 80 | 8,192 | | caca-3B-untrained | 3.88B | 32,000 | 2560 | 10240 | 40 | 32 | 8 | 80 | 8,192 | | caca-3.5B-untrained | 4.69B | 32,000 | 2816 | 11264 | 40 | 32 | 8 | 88 | 8,192 | | caca-4B-untrained | 5.02B | 32,000 | 3072 | 12288 | 36 | 32 | 8 | 96 | 8,192 | | caca-4.5B-untrained | 5.45B | 32,000 | 3200 | 12800 | 36 | 32 | 8 | 100 | 8,192 | | caca-5B-untrained | 6.53B | 32,000 | 3328 | 13312 | 40 | 32 | 8 | 104 | 8,192 | | caca-6B-untrained | 8.31B | 32,000 | 3584 | 14336 | 44 | 32 | 8 | 112 | 8,192 | | caca-7B-untrained | 7.11B | 32,000 | 4096 | 14336 | 32 | 32 | 8 | 128 | 8,192 | | caca-8B-untrained | 7.98B | 32,000 | 4096 | 14336 | 36 | 32 | 8 | 128 | 8,192 | | caca-9B-untrained | 9.09B | 32,000 | 4608 | 16384 | 32 | 36 | 9 | 128 | 8,192 | | caca-10B-untrained | 11.23B | 32,000 | 4608 | 18432 | 36 | 32 | 8 | 144 | 8,192 | | caca-12B-untrained | 15.26B | 32,000 | 5120 | 20480 | 40 | 40 | 8 | 128 | 8,192 | | caca-13B-untrained | 13.38B | 32,000 | 5120 | 13824 | 48 | 40 | 8 | 128 | 8,192 | | caca-14B-untrained | 13.40B | 32,000 | 5376 | 14464 | 44 | 48 | 8 | 112 | 8,192 | | caca-15B-untrained | 14.90B | 32,000 | 5632 | 15104 | 44 | 32 | 8 | 176 | 8,192 | | caca-18B-untrained | 18.92B | 32,000 | 6144 | 16384 | 48 | 48 | 8 | 128 | 8,192 | | caca-20B-untrained | 20.48B | 32,000 | 6144 | 16384 | 52 | 48 | 8 | 128 | 8,192 | | caca-24B-untrained | 25.83B | 32,000 | 6656 | 17920 | 56 | 64 | 8 | 104 | 8,192 | | caca-30B-untrained | 32.24B | 32,000 | 6656 | 17920 | 70 | 64 | 8 | 104 | 8,192 | | caca-35B-untrained | 39.02B | 32,000 | 8192 | 22016 | 56 | 64 | 8 | 128 | 8,192 | | caca-40B-untrained | 44.56B | 32,000 | 8192 | 22016 | 64 | 64 | 8 | 128 | 8,192 | | caca-45B-untrained | 50.09B | 32,000 | 8192 | 22016 | 72 | 64 | 8 | 128 | 8,192 | | caca-50B-untrained | 55.63B | 32,000 | 8192 | 22016 | 80 | 64 | 8 | 128 | 8,192 | | caca-60B-untrained | 72.14B | 32,000 | 8192 | 28672 | 84 | 64 | 8 | 128 | 8,192 | | caca-70B-untrained | 68.71B | 32,000 | 8192 | 28672 | 80 | 64 | 8 | 128 | 8,192 | | caca-80B-untrained | 101.77B | 32,000 | 9216 | 36864 | 84 | 72 | 8 | 128 | 8,192 | | caca-100B-untrained | 137.32B | 32,000 | 10240 | 40960 | 92 | 80 | 8 | 128 | 8,192 | | caca-120B-untrained | 173.10B | 32,000 | 11264 | 45056 | 96 | 88 | 8 | 128 | 8,192 | | caca-150B-untrained | 214.31B | 32,000 | 12288 | 49152 | 100 | 96 | 8 | 128 | 8,192 | | caca-175B-untrained | 248.53B | 32,000 | 12288 | 49152 | 116 | 96 | 8 | 128 | 8,192 | | caca-200B-untrained | 324.80B | 128,000 | 14336 | 57344 | 110 | 112 | 16 | 128 | 16,384 | | caca-250B-untrained | 419.35B | 128,000 | 15360 | 61440 | 124 | 120 | 16 | 128 | 16,384 | | caca-300B-untrained | 507.03B | 128,000 | 16384 | 65536 | 132 | 128 | 16 | 128 | 16,384 | | caca-350B-untrained | 591.18B | 128,000 | 16384 | 65536 | 154 | 128 | 16 | 128 | 16,384 | | caca-400B-untrained | 675.34B | 128,000 | 16384 | 65536 | 176 | 128 | 16 | 128 | 16,384 | | caca-500B-untrained | 852.77B | 128,000 | 18432 | 73728 | 176 | 144 | 16 | 128 | 16,384 | | caca-600B-untrained | 1.07T | 128,000 | 20480 | 81920 | 180 | 160 | 16 | 128 | 16,384 | | caca-700B-untrained | 1.23T | 128,000 | 21504 | 86016 | 186 | 168 | 24 | 128 | 16,384 | | caca-800B-untrained | 1.38T | 128,000 | 22528 | 90112 | 192 | 176 | 16 | 128 | 16,384 | | caca-900B-untrained | 1.65T | 128,000 | 24576 | 94208 | 198 | 192 | 24 | 128 | 16,384 | | caca-1T-untrained | 1.75T | 128,000 | 24576 | 98304 | 204 | 192 | 16 | 128 | 16,384 | --- ## ๐Ÿ’พ Kebutuhan Memory ### Training Requirements
Configuration Model Weights + Optimizer States Total Training
FP32 (AdamW) 0.01 GB +0.04 GB 0.06 GB
Mixed Precision 0.01 GB +0.05 GB 0.06 GB
+ Gradient Checkpointing Menghemat ~30-50% activation memory ~0.03 GB
### Inference Requirements
Precision Model Size KV Cache (2K ctx) Total Memory Memory Saving
FP16 / BF16 0.01 GB 0.00 GB 0.01 GB Baseline
INT8 0.00 GB 0.00 GB 0.01 GB ~50% โ†“
INT4 (NF4) 0.00 GB 0.00 GB 0.00 GB ~75% โ†“
> ๐Ÿ’ก **Note**: KV cache bertambah secara linear dengan panjang sequence. Untuk context 8K, kalikan nilai KV cache dengan 4. ### Performance Estimates
Metric Value Notes
FLOPs per Token 7,049,216 Forward pass only
TFLOPs per Token 0.0000 โ‰ˆ 6ร— untuk backward
Bandwidth (FP16) 0.01 GB/token Memory bandwidth requirement
--- ### ๐Ÿ“ Struktur Arsitektur Lengkap
๐Ÿ” Klik untuk lihat detail arsitektur ``` CACA Architecture โ”‚ โ”œโ”€โ”€โ”€ ๐Ÿ“ฅ INPUT PROCESSING โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Text Input โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Tokenization (BPE/WordPiece/SentencePiece) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Token Embeddings (vocab_size ร— hidden_size) โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch_size, seq_len, hidden_size] โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Vision Input (Optional) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Image Preprocessing (resize ke 224ร—224) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Vision Encoder (ViT) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Patch Embedding (Conv2D: 14ร—14 patches) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ CLS Token + Positional Embeddings โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Vision Transformer Blocks (24 layers) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ LayerNorm โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Multi-Head Attention โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ MLP (GELU activation) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connections โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Final LayerNorm โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Vision Projector โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Type: Linear / MLP / Perceiver / Q-Former โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch_size, num_patches, hidden_size] โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: Vision embeddings aligned to text space โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Audio Input (Optional) โ”‚ โ”œโ”€โ”€โ”€ Audio Preprocessing (Mel-spectrogram, 80 bins) โ”‚ โ”œโ”€โ”€โ”€ Audio Encoder โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Conv1D Layers (feature extraction) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Conv1D (80 โ†’ hidden_size, kernel=3) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Conv1D (stride=2 untuk downsampling) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Positional Embeddings (interpolated) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Audio Transformer Blocks (12 layers) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ LayerNorm โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Multi-Head Attention โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ MLP (GELU activation) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connections โ”‚ โ”‚ โ””โ”€โ”€โ”€ Final LayerNorm โ”‚ โ”œโ”€โ”€โ”€ Audio Projector โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Type: Linear / MLP / Perceiver / Q-Former โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch_size, audio_len, hidden_size] โ”‚ โ””โ”€โ”€โ”€ Output: Audio embeddings aligned to text space โ”‚ โ”œโ”€โ”€โ”€ ๐Ÿ”„ MULTIMODAL FUSION โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Early Fusion (jika tidak pakai Cross-Attention) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Concatenate: [vision_tokens + audio_tokens + text_tokens] โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Update attention mask โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: Combined sequence untuk decoder โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Late Fusion (jika pakai Cross-Attention) โ”‚ โ”œโ”€โ”€โ”€ Text tokens โ†’ Query untuk cross-attention โ”‚ โ”œโ”€โ”€โ”€ Vision+Audio tokens โ†’ Key/Value untuk cross-attention โ”‚ โ””โ”€โ”€โ”€ Fusion dilakukan di dalam decoder layers โ”‚ โ”œโ”€โ”€โ”€ ๐Ÿ—๏ธ DECODER STACK (N=32 layers) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ ๐Ÿ” DECODER LAYER i (repeated N times) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Mixture of Depths (MoD) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Input: Hidden states [batch, seq_len, hidden] โ”‚ โ”‚ โ”œโ”€โ”€โ”€ MoD Router โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Method: learned / random / heuristic โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Score computation per token โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Top-K selection (K = capacity_factor ร— seq_len) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Process Mask Generation โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Binary mask [batch, seq_len] (1=process, 0=skip) โ”‚ โ”‚ โ””โ”€โ”€โ”€ Token Selection โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Selected tokens: processed through layer โ”‚ โ”‚ โ””โ”€โ”€โ”€ Skipped tokens: bypass layer (identity) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ ๐ŸŽฏ SELF-ATTENTION PATH โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Input Normalization โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ RMSNorm (Root Mean Square Layer Normalization) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Formula: x * rsqrt(mean(xยฒ) + ฮต) * ฮณ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ More efficient than LayerNorm (no mean centering) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention Computation โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Query/Key/Value Projections โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Q: Linear(hidden_size โ†’ num_heads ร— head_dim) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ K: Linear(hidden_size โ†’ num_kv_heads ร— head_dim) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ V: Linear(hidden_size โ†’ num_kv_heads ร— head_dim) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Reshape: [batch, seq, heads, head_dim] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] QK Normalization โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Q = RMSNorm(Q) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ K = RMSNorm(K) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Rotary Position Embeddings (RoPE) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Compute frequencies: ฮธ_i = base^(-2i/dim) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Position indices: t โˆˆ [0, seq_len) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Rotation matrix: cos(tยทฮธ), sin(tยทฮธ) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Apply rotation: Q, K = rotate(Q, K, cos, sin) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ YARN Scaling (jika enabled) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Type: linear / dynamic / yarn โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Scaling factor per frequency band โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Better extrapolation ke context panjang โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Grouped-Query Attention (GQA) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ num_kv_groups = num_heads / num_kv_heads โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Repeat K, V: [num_kv_heads โ†’ num_heads] โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Memory saving: 30-40% vs full MHA โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention Score Computation โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ scores = (Q @ K.T) / sqrt(head_dim) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Logit clamping: [-50, 50] untuk stabilitas โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ [OPTIONAL] Soft-capping โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ scores = tanh(scores / cap) * cap โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention Masking โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Causal Mask (autoregressive) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Sliding Window Mask (jika enabled) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Window size (misal: 512 tokens) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Attend hanya ke window terdekat โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention Sinks (jika enabled) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Always attend to first K tokens โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Prevent attention collapse โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Better streaming generation โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ [OPTIONAL] ALiBi Bias โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Linear bias based on distance โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Alternative/complement to RoPE โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Backend Selection (automatic fallback) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ 1๏ธโƒฃ Flash Attention 2 (PREFERRED) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Requirements: CUDA + FP16/BF16 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Speedup: 2-4x faster โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Memory: 10-20x less โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Sliding window support โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ IO-aware algorithm โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ 2๏ธโƒฃ xFormers Memory Efficient (FALLBACK 1) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Requirements: CUDA โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Block-sparse attention โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Custom attention patterns โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ 3๏ธโƒฃ PyTorch SDPA (FALLBACK 2) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Requirements: PyTorch 2.0+ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Built-in scaled_dot_product_attention โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Hardware-agnostic โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ 4๏ธโƒฃ Standard Attention (SAFE FALLBACK) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Pure PyTorch implementation โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Always available โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Slower but stable โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Softmax + Dropout โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ attn_weights = softmax(scores, dim=-1) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ attn_weights = dropout(attn_weights) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Value Aggregation โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ output = attn_weights @ V โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Reshape: [batch, seq, num_heads ร— head_dim] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output Projection โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ O: Linear(num_heads ร— head_dim โ†’ hidden_size) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, hidden_size] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Layer Scale โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Learnable per-layer scaling: ฮณ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Initialize: ฮณ = 1e-5 (very small) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ output = ฮณ * output โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Improves training stability โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Stochastic Depth โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Training: Random layer dropping โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ drop_prob = layer_idx / num_layers ร— base_prob โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ if random() > drop_prob: return output โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ else: return 0 โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Inference: Always apply (no dropping) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Residual Dropout โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ output = dropout(output) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connection โ”‚ โ”‚ โ”œโ”€โ”€โ”€ hidden_states = hidden_states + output โ”‚ โ”‚ โ””โ”€โ”€โ”€ [Training] Gradient clipping: [-1e4, 1e4] โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ ๐ŸŒ [OPTIONAL] CROSS-ATTENTION PATH (untuk Multimodal) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Conditional: Hanya jika encoder_hidden_states != None โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Frequency: Setiap cross_attention_frequency layers โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Input Normalization โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ RMSNorm(hidden_states) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Cross-Attention Computation โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Query: dari text hidden states โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Q: Linear(hidden_size โ†’ num_heads ร— head_dim) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Key/Value: dari encoder_hidden_states (vision+audio) โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ K: Linear(hidden_size โ†’ num_kv_heads ร— head_dim) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ V: Linear(hidden_size โ†’ num_kv_heads ร— head_dim) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention: Q @ K.T / sqrt(head_dim) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Softmax + Dropout โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Output: attn_weights @ V โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output Projection โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Layer Scale โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Stochastic Depth โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Residual Dropout โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connection โ”‚ โ”‚ โ””โ”€โ”€โ”€ hidden_states = hidden_states + cross_attn_output โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ ๐Ÿ”ฎ FEED-FORWARD PATH โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Input Normalization โ”‚ โ”‚ โ””โ”€โ”€โ”€ RMSNorm(hidden_states) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Feed-Forward Network โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ โ”โ”โ”โ”โ” STANDARD MLP โ”โ”โ”โ”โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gate Projection โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ gate: Linear(hidden_size โ†’ intermediate_size) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Typical: intermediate_size = 4 ร— hidden_size โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Up Projection โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ up: Linear(hidden_size โ†’ intermediate_size) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ SwiGLU Activation โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ gate = silu(gate) # Swish activation โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ hidden = gate * up # Gating mechanism โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Formula: silu(x) = x * sigmoid(x) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Dropout โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ hidden = dropout(hidden) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Down Projection โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ down: Linear(intermediate_size โ†’ hidden_size) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, hidden_size] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ โ”โ”โ”โ”โ” MIXTURE OF EXPERTS (MoE) โ”โ”โ”โ”โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Conditional: use_moe AND (layer_idx % moe_frequency == 0) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Network โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Type Selection โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Top-K Router (default) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Expert Choice Router (alternative) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ โ”โ”โ” TOP-K ROUTER โ”โ”โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gate Normalization โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ hidden = LayerNorm(hidden) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Logits โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ logits: Linear(hidden_size โ†’ num_experts) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Clamping: [-20, 20] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Temperature scaling: logits / temp โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [Training] Jitter Noise โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ noise = randn_like(logits) ร— 0.01 โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ logits = logits + noise โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Routing Weights โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ weights = softmax(logits) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ top_k_weights, top_k_indices = topk(weights, k) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Weight Normalization โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ top_k_weights = top_k_weights / sum(top_k_weights) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Loss Computation โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Auxiliary Loss (load balancing) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ expert_usage = mean(weights, dim=0) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ mean_usage = mean(expert_usage) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ aux_loss = std(expert_usage) / mean_usage โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Z-Loss (router stability) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ z_loss = mean(logsumexp(logits)ยฒ) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Prevents logits explosion โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Entropy Loss (diversity) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ entropy_loss = -mean(weights ร— log(weights)) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ โ”โ”โ” EXPERT CHOICE ROUTER โ”โ”โ” โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Logits โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ logits: Linear(hidden โ†’ num_experts) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Expert-wise Token Selection โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Transpose: [batchร—seq, experts] โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ capacity = expert_choice_k ร— total_tokens / num_experts โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Per expert: topk(logits, k=capacity) โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Expert mask: [experts, batchร—seq] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Routing weights from mask โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Expert Networks (N experts, misal N=8) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Expert i (i = 0 to N-1) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Same structure as Standard MLP โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ gate_proj: Linear(hidden โ†’ intermediate) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ up_proj: Linear(hidden โ†’ intermediate) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ SwiGLU activation โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Dropout โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ down_proj: Linear(intermediate โ†’ hidden) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Expert Execution โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ For each expert: โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Get tokens routed to this expert โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ If no tokens: skip โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Run expert forward pass โ”‚ โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [Training] Track expert usage โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ [Safety] NaN/Inf detection โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Combine Expert Outputs โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Weighted sum by router weights โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ final_output = ฮฃ(weight_i ร— expert_i(x)) โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, hidden_size] โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Layer Scale โ”‚ โ”‚ โ””โ”€โ”€โ”€ output = ฮณ * output โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Stochastic Depth โ”‚ โ”‚ โ””โ”€โ”€โ”€ Probabilistic dropping (training only) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Residual Dropout โ”‚ โ”‚ โ””โ”€โ”€โ”€ output = dropout(output) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Residual Connection โ”‚ โ”œโ”€โ”€โ”€ hidden_states = hidden_states + output โ”‚ โ””โ”€โ”€โ”€ [Training] Gradient clipping: [-1e4, 1e4] โ”‚ โ”œโ”€โ”€โ”€ ๐Ÿ“ค OUTPUT HEAD โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Final Normalization โ”‚ โ”‚ โ”œโ”€โ”€โ”€ RMSNorm(hidden_states) โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, hidden_size] โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Language Modeling Head โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Linear Projection โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ lm_head: Linear(hidden_size โ†’ vocab_size, bias=False) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: [batch, seq, vocab_size] โ”‚ โ”‚ โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ [OPTIONAL] Logit Soft-Capping โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Clamp extreme values: [-capร—0.99, capร—0.99] โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Formula: tanh(logits / cap) ร— cap โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Prevents numerical instability โ”‚ โ”‚ โ””โ”€โ”€โ”€ Typical cap value: 30.0 โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Output: Logits [batch, seq, vocab_size] โ”‚ โ”œโ”€โ”€โ”€ ๐Ÿ“‰ LOSS COMPUTATION (Training Only) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Shift for Autoregressive โ”‚ โ”‚ โ”œโ”€โ”€โ”€ shift_logits = logits[:, :-1, :] โ”‚ โ”‚ โ””โ”€โ”€โ”€ shift_labels = labels[:, 1:] โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Language Modeling Loss โ”‚ โ”‚ โ”œโ”€โ”€โ”€ CrossEntropyLoss(ignore_index=-100) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] Label Smoothing โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Reduces overconfidence โ”‚ โ”‚ โ””โ”€โ”€โ”€ lm_loss = CE(shift_logits, shift_labels) โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ [OPTIONAL] MoE Auxiliary Losses โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Auxiliary Loss (load balancing) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ aux_loss ร— router_aux_loss_coef (default: 0.01) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Router Z-Loss (stability) โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ z_loss ร— router_z_loss_coef (default: 0.001) โ”‚ โ”‚ โ””โ”€โ”€โ”€ Sum across all MoE layers โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Total Loss โ”‚ โ””โ”€โ”€โ”€ total = lm_loss + aux_losses โ”‚ โ”œโ”€โ”€โ”€ ๐Ÿ“Š MONITORING & METRICS โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ MetricsTracker โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Loss tracking (LM, aux, z-loss) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Perplexity: exp(lm_loss) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gradient norms per layer โ”‚ โ”‚ โ”œโ”€โ”€โ”€ GPU memory usage โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Expert usage statistics โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Attention cache hit rate โ”‚ โ”‚ โ””โ”€โ”€โ”€ Periodic summary & clearing โ”‚ โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gradient Monitoring โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Max gradient norm per layer โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Mean gradient norm (EMA) โ”‚ โ”‚ โ”œโ”€โ”€โ”€ Gradient clipping count โ”‚ โ”‚ โ””โ”€โ”€โ”€ NaN/Inf detection โ”‚ โ”‚ โ”‚ โ””โ”€โ”€โ”€ Memory Monitoring โ”‚ โ”œโ”€โ”€โ”€ GPU memory allocated โ”‚ โ”œโ”€โ”€โ”€ GPU memory reserved โ”‚ โ”œโ”€โ”€โ”€ Automatic cache clearing โ”‚ โ””โ”€โ”€โ”€ Per-layer memory checkpoints โ”‚ โ””โ”€โ”€โ”€ ๐Ÿ”ง OPTIMIZATION FEATURES โ”‚ โ”œโ”€โ”€โ”€ Gradient Checkpointing โ”‚ โ”œโ”€โ”€โ”€ Trade: 30% slower, 50% less memory โ”‚ โ”œโ”€โ”€โ”€ Recompute activations during backward โ”‚ โ””โ”€โ”€โ”€ Enable: model.gradient_checkpointing_enable() โ”‚ โ”œโ”€โ”€โ”€ Mixed Precision Training (AMP) โ”‚ โ”œโ”€โ”€โ”€ FP16/BF16 forward pass โ”‚ โ”œโ”€โ”€โ”€ FP32 master weights โ”‚ โ”œโ”€โ”€โ”€ Dynamic loss scaling โ”‚ โ””โ”€โ”€โ”€ 2x speedup, 50% memory reduction โ”‚ โ”œโ”€โ”€โ”€ Gradient Accumulation โ”‚ โ”œโ”€โ”€โ”€ Simulate larger batch size โ”‚ โ”œโ”€โ”€โ”€ loss = loss / accumulation_steps โ”‚ โ””โ”€โ”€โ”€ optimizer.step() every N steps โ”‚ โ”œโ”€โ”€โ”€ KV Cache (Inference) โ”‚ โ”œโ”€โ”€โ”€ Cache Key/Value tensors โ”‚ โ”œโ”€โ”€โ”€ Reuse for autoregressive generation โ”‚ โ”œโ”€โ”€โ”€ Memory: O(num_layers ร— seq_len ร— hidden_size) โ”‚ โ””โ”€โ”€โ”€ Speedup: ~10x untuk long sequences โ”‚ โ””โ”€โ”€โ”€ Quantization Support โ”œโ”€โ”€โ”€ 8-bit (LLM.int8) โ”‚ โ”œโ”€โ”€โ”€ bitsandbytes integration โ”‚ โ”œโ”€โ”€โ”€ Mixed precision (outliers in FP16) โ”‚ โ””โ”€โ”€โ”€ 2x memory reduction โ””โ”€โ”€โ”€ 4-bit (QLoRA) โ”œโ”€โ”€โ”€ NF4 quantization (normal float 4-bit) โ”œโ”€โ”€โ”€ Double quantization โ”œโ”€โ”€โ”€ BF16 compute dtype โ””โ”€โ”€โ”€ 4x memory reduction CacaForCausalLM (3.52M) โ”‚ โ”œโ”€ Embedding: 8,000 ร— 128 โ”‚ โ”œโ”€ Transformer Layers (6x) โ”‚ โ”œโ”€ RMSNorm โ”‚ โ”œโ”€ Attention (GQA) โ”‚ โ”‚ โ”œโ”€ Q: 4 heads ร— 32 dim โ”‚ โ”‚ โ”œโ”€ KV: 2 heads ร— 32 dim โ”‚ โ”‚ โ”œโ”€ RoPE (ฮธ=10,000) โ”‚ โ”‚ โ””โ”€ Flash Attention v2 โ”‚ โ”œโ”€ Residual โ”‚ โ”œโ”€ RMSNorm โ”‚ โ”œโ”€ FFN (SwiGLU) โ”‚ โ”‚ โ”œโ”€ Gate: 128 โ†’ 512 โ”‚ โ”‚ โ”œโ”€ Up: 128 โ†’ 512 โ”‚ โ”‚ โ””โ”€ Down: 512 โ†’ 128 โ”‚ โ””โ”€ Residual โ”‚ โ”œโ”€ Final RMSNorm โ””โ”€ LM Head: 128 โ†’ 8,000 โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• ๐Ÿ“Š PARAMETER BREAKDOWN: โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• Embeddings: 1,024,000 ( 29.1%) Transformer Layers: 1,474,560 ( 41.8%) โ”œโ”€ Attention: 294,912 โ””โ”€ FFN: 1,179,648 Final Norm: 128 ( 0.0%) โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ TOTAL: 3,524,608 (100.0%) โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ• ``` **Key Design Decisions:** 1. **GQA over MHA**: Hemat 50% KV cache memory dengan minimal accuracy loss 2. **SwiGLU over GELU**: ~10% better performance pada language modeling 3. **RMSNorm over LayerNorm**: Lebih cepat & stabil, tanpa bias term 4. **RoPE over Learned**: Better extrapolation untuk sequence length > training 5. **No Bias in Linear**: Mengikuti modern LLM best practices (LLaMA-style)
--- ## ๐Ÿ“š Dokumentasi ### ๐Ÿ“ฆ Instalasi Dependencies ```bash # Core dependencies (REQUIRED) pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors # Optional: Untuk performa maksimal pip install flash-attn --no-build-isolation # Flash Attention 2 (3x speedup) pip install xformers # Memory efficient attention pip install bitsandbytes # 4/8-bit quantization # Optional: Untuk monitoring & profiling pip install tensorboard wandb # Training monitoring pip install gputil psutil # Resource monitoring ``` **Compatibility Matrix:** | Component | Version | Note | |-----------|---------|------| | Python | 3.8 - 3.11 | 3.11 recommended | | PyTorch | โ‰ฅ 2.0.0 | 2.1+ untuk SDPA optimal | | CUDA | 11.8 / 12.1 | Untuk Flash Attention | | Transformers | โ‰ฅ 4.35.0 | Untuk AutoModel support | ### Cara Penggunaan #### 1๏ธโƒฃ Basic Loading ```python from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer import torch # Load configuration config = AutoConfig.from_pretrained( "Lyon28/caca-1M-untrained", trust_remote_code=True ) # Load model (FP16 untuk efisiensi) model = AutoModelForCausalLM.from_pretrained( "Lyon28/caca-1M-untrained", config=config, trust_remote_code=True, torch_dtype=torch.float16, device_map="auto" # Automatic device placement ) # Model ini UNTRAINED - butuh training dulu! print(f"Model loaded: {model.num_parameters():,} parameters") print("โš ๏ธ Model ini belum dilatih dan belum bisa digunakan untuk inference") ``` #### 2๏ธโƒฃ Quantized Loading (4-bit/8-bit) ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True ) # Load model dengan quantization model = AutoModelForCausalLM.from_pretrained( "Lyon28/caca-1M-untrained", trust_remote_code=True, quantization_config=bnb_config, device_map="auto" ) print(f"Memory footprint: ~0.00GB (4-bit)") ``` #### 3๏ธโƒฃ Training Setup ```python from transformers import TrainingArguments, Trainer # Training configuration training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=1, gradient_accumulation_steps=16, learning_rate=2e-4, max_steps=10000, lr_scheduler_type="cosine", warmup_steps=500, logging_steps=10, save_steps=500, fp16=True, # Mixed precision gradient_checkpointing=True, # Memory efficient ) # Initialize trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, ) # Start training trainer.train() ``` ### Advanced Usage #### Gradient Checkpointing (Memory Efficient) ```python model.gradient_checkpointing_enable() print("โœ… Gradient checkpointing enabled - saves ~40% memory") ``` #### Custom Training Loop ```python from torch.optim import AdamW from torch.cuda.amp import autocast, GradScaler optimizer = AdamW(model.parameters(), lr=2e-4) scaler = GradScaler() for batch in dataloader: # Mixed precision forward with autocast(dtype=torch.bfloat16): outputs = model(**batch) loss = outputs.loss # Backward with gradient scaling scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() optimizer.zero_grad() ``` #### Multi-GPU Training (DDP) ```python import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel # Initialize process group dist.init_process_group(backend="nccl") # Wrap model model = DistributedDataParallel( model, device_ids=[local_rank], find_unused_parameters=False ) ``` --- ## โš™๏ธ Konfigurasi Detail ### Full Configuration JSON ```json { "architectures": ["CacaForCausalLM"], "model_type": "caca", "vocab_size": 8000, "hidden_size": 128, "intermediate_size": 512, "num_hidden_layers": 6, "num_attention_heads": 4, "num_key_value_heads": 2, "head_dim": 32, "max_position_embeddings": 1024, "rope_theta": 10000, "rms_norm_eps": 1e-06, "use_cache": true, "use_qk_norm": true, "use_flash_attn": true, "attention_dropout": 0.0, "hidden_dropout": 0.1, "torch_dtype": "float16" } ``` ### Custom Configuration ```python from transformers import AutoConfig # Load dan modifikasi config config = AutoConfig.from_pretrained("Lyon28/caca-1M-untrained") # Custom modifications config.max_position_embeddings = 16384 # Extend context config.rope_scaling = {"type": "linear", "factor": 2.0} config.use_flash_attn = True config.hidden_dropout = 0.05 # Save custom config config.save_pretrained("./custom_config") ``` --- ## ๐Ÿ”ฌ Arsitektur
Layer Structure **Input Tokens** โ†“ **Embedding Layer** (vocab_size โ†’ hidden_size) โ†“ **Decoder Block ร— N** - RMSNorm - Multi-Head Attention (GQA) - Flash Attention v2 - Query heads, KV heads - RoPE position encoding - Residual Connection - RMSNorm - Feed-Forward Network (SwiGLU) - Gate: hidden โ†’ intermediate - Up: hidden โ†’ intermediate - Down: intermediate โ†’ hidden - Residual Connection โ†“ **RMSNorm (Final)** โ†“ **LM Head** (hidden โ†’ vocab_size) โ†“ **Output Logits**
### Attention Mechanism (GQA) ``` Query: [4 heads ร— 32 dim] = 128 Key: [2 heads ร— 32 dim] = 64 Value: [2 heads ร— 32 dim] = 64 Grouped Query Attention: - Setiap 2 query heads berbagi 1 KV head - Memory KV cache: 50% lebih kecil dari Multi-Head Attention - Kualitas mendekati MHA, speed mendekati MQA ``` ### Feed-Forward Network (SwiGLU) ``` FFN(x) = (SiLU(xW_gate) โŠ™ xW_up) W_down Where: - W_gate: 128 ร— 512 - W_up: 128 ร— 512 - W_down: 512 ร— 128 - SiLU(x) = x ยท sigmoid(x) - โŠ™ = element-wise multiplication ``` ## ๐Ÿ’ฌ Format Chat & Prompt Engineering ### ๐Ÿ“ Chat Template Model mendukung format chat standar untuk conversational AI: ```python # Format chat template bawaan chat_template = """ {% for message in messages %} {% if message['role'] == 'system' %} System: {{ message['content'] }} {% elif message['role'] == 'user' %} User: {{ message['content'] }} {% elif message['role'] == 'assistant' %} Assistant: {{ message['content'] }} {% endif %} {% endfor %} {% if add_generation_prompt %}Assistant:{% endif %} """ # Contoh penggunaan messages = [ {"role": "system", "content": "Kamu adalah asisten AI yang membantu dan ramah."}, {"role": "user", "content": "Jelaskan tentang fotosintesis"}, {"role": "assistant", "content": "Fotosintesis adalah proses di mana tumbuhan mengubah cahaya matahari menjadi energi kimia..."}, {"role": "user", "content": "Apa manfaatnya bagi manusia?"}, ] # Apply template formatted = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) print(formatted) # Output: # System: Kamu adalah asisten AI yang membantu dan ramah. # # User: Jelaskan tentang fotosintesis # Assistant: Fotosintesis adalah proses di mana tumbuhan... # User: Apa manfaatnya bagi manusia? # Assistant: ``` --- ## ๐ŸŽฏ Use Cases Model ini dirancang untuk berbagai aplikasi NLP setelah melalui proses training: ### Text Generation - โœ๏ธ Creative writing & storytelling - ๐Ÿ“ฐ Article generation - ๐Ÿ’ฌ Conversational AI - ๐Ÿ”„ Text completion ### Language Understanding - ๐Ÿ“Š Text classification - ๐Ÿท๏ธ Named Entity Recognition (NER) - โ“ Question Answering - ๐Ÿ“ Summarization ### Code Generation - ๐Ÿ’ป Code completion - ๐Ÿ› Bug fixing suggestions - ๐Ÿ“š Documentation generation - ๐Ÿ”„ Code translation ### Multilingual Tasks - ๐ŸŒ Translation (ID โ†” EN) - ๐Ÿ—ฃ๏ธ Cross-lingual understanding - ๐ŸŒ Multilingual classification --- ## ๐Ÿ“ˆ Benchmark & Evaluation > โš ๏ธ Model belum melalui evaluasi karena status untrained Setelah training, model akan dievaluasi pada: ### Indonesian Benchmarks - **IndoNLU**: Comprehensive Indonesian NLU tasks - **IndoQA**: Indonesian Question Answering - **IndoSum**: Summarization - **IndoNER**: Named Entity Recognition ### Multilingual Benchmarks - **MMLU**: Massive Multitask Language Understanding - **HellaSwag**: Common sense reasoning - **ARC**: Science QA - **TruthfulQA**: Truthfulness evaluation ### Generation Quality - **Perplexity**: Language modeling quality - **BLEU/ROUGE**: Translation & summarization - **Human Evaluation**: Fluency, coherence, factuality --- ## ๐Ÿ› ๏ธ Development & Training Tips ### Optimal Batch Size ```python # Rule of thumb untuk 3.52M model # GPU Memory โ†’ Batch size per device if gpu_memory >= 80: # A100 80GB batch_size = 4539 gradient_accumulation = 1 elif gpu_memory >= 40: # A100 40GB batch_size = 2269 gradient_accumulation = 1 elif gpu_memory >= 24: # RTX 3090/4090 batch_size = 1 gradient_accumulation = 1 # Effective batch size = batch_size ร— gradient_accumulation ร— num_gpus ``` ### Learning Rate Scheduling ```python # Recommended untuk 3.52M model learning_rate = 0.0005 # Base LR warmup_ratio = 0.05 # 5% of total steps lr_scheduler = "cosine" # atau "linear" # Learning rate scaling rule: # LR โˆ sqrt(batch_size) # Untuk batch size 256: LR = 0.0005 # Untuk batch size 512: LR = 7.07e-04 ``` ### Gradient Clipping ```python # Prevent gradient explosion max_grad_norm = 1.0 # Clip at 1.0 # Monitor gradients from torch.nn.utils import clip_grad_norm_ grad_norm = clip_grad_norm_(model.parameters(), max_grad_norm) if grad_norm > 10.0: print(f"โš ๏ธ High gradient norm: {grad_norm:.2f}") ``` ### Training Stability ```python # Tips untuk stable training: 1. **Warmup**: Mulai dengan LR rendah 2. **Gradient Checkpointing**: Kurangi memory footprint 3. **Mixed Precision**: Gunakan BF16 jika tersedia (lebih stable dari FP16) 4. **Batch Size**: Start small, increase gradually 5. **Monitor**: Track loss, perplexity, gradient norms ``` --- ## ๐Ÿ”ง Troubleshooting ### Out of Memory (OOM) ```python # Solusi OOM saat training: โœ… 1. Enable gradient checkpointing model.gradient_checkpointing_enable() โœ… 2. Reduce batch size per_device_train_batch_size = 1 โœ… 3. Increase gradient accumulation gradient_accumulation_steps = 32 โœ… 4. Use quantization load_in_8bit = True # atau load_in_4bit โœ… 5. Reduce sequence length max_length = 1024 # Start dengan ini โœ… 6. CPU offloading (jika perlu) device_map = "auto" offload_folder = "offload" ``` ### Slow Training ```python # Optimasi kecepatan training: โœ… 1. Flash Attention config.use_flash_attn = True # 2-3x speedup โœ… 2. Compile model (PyTorch 2.0+) model = torch.compile(model, mode="reduce-overhead") โœ… 3. DataLoader optimization dataloader = DataLoader( dataset, batch_size=batch_size, num_workers=4, # Parallel data loading pin_memory=True, # Faster GPU transfer prefetch_factor=2 ) โœ… 4. Mixed precision use_fp16 = True # atau bf16 โœ… 5. Optimize communication (multi-GPU) find_unused_parameters = False gradient_as_bucket_view = True ``` ### NaN Loss ```python # Jika loss menjadi NaN: โœ… 1. Reduce learning rate learning_rate = learning_rate * 0.1 โœ… 2. Check gradient norms clip_grad_norm_(model.parameters(), 1.0) โœ… 3. Use BF16 instead of FP16 torch_dtype = torch.bfloat16 # Lebih stable โœ… 4. Add epsilon to RMSNorm rms_norm_eps = 1e-5 # Increase jika perlu โœ… 5. Check data # Pastikan tidak ada inf/nan di dataset assert not torch.isnan(input_ids).any() assert not torch.isinf(attention_mask).any() ``` --- ### ๐Ÿšซ Prohibited Uses
Model ini **TIDAK BOLEH** digunakan untuk: - ๐Ÿšซ **Harmful content generation** (violence, self-harm, illegal acts) - ๐Ÿšซ **Misinformation/disinformation campaigns** - ๐Ÿšซ **Harassment or hate speech** - ๐Ÿšซ **Impersonation or identity theft** - ๐Ÿšซ **Child safety violations** (CSAM, grooming, exploitation) - ๐Ÿšซ **Privacy violations** (doxxing, stalking, surveillance abuse) - ๐Ÿšซ **Malicious code generation** (malware, exploits, etc) - ๐Ÿšซ **Spam or manipulation** (fake reviews, astroturfing) - ๐Ÿšซ **Medical/legal advice** (tanpa disclaimer & expert review) - ๐Ÿšซ **Financial fraud** (scams, market manipulation) **Violation consequences:** Model access revocation + legal action jika applicable
--- ## ๐Ÿ“š References & Papers ### Core Architecture 1. **LLaMA** - [Touvron et al., 2023](https://arxiv.org/abs/2302.13971) - RMSNorm, RoPE, SwiGLU, GQA 2. **GPT-4** - [OpenAI Technical Report, 2023](https://arxiv.org/abs/2303.08774) - Mixture of Experts (speculated) 3. **Gemini** - [Google DeepMind, 2023](https://arxiv.org/abs/2312.11805) - Multimodal architecture, soft-capping 4. **Qwen** - [Alibaba Cloud, 2023](https://arxiv.org/abs/2309.16609) - YARN, long context 5. **Gemma** - [Google, 2024](https://arxiv.org/abs/2403.08295) - Layer scaling, normalization ### Advanced Techniques 6. **Flash Attention 2** - [Dao, 2023](https://arxiv.org/abs/2307.08691) 7. **Mixture-of-Depths** - [Raposo et al., 2024](https://arxiv.org/abs/2404.02258) 8. **StreamingLLM** - [Xiao et al., 2023](https://arxiv.org/abs/2309.17453) 9. **YARN** - [Peng et al., 2023](https://arxiv.org/abs/2309.00071) 10. **QLoRA** - [Dettmers et al., 2023](https://arxiv.org/abs/2305.14314) --- ## โš ๏ธ Known Limitations 1. **Training Cost** - MoE + Multimodal = expensive 2. **Complex Debugging** - Banyak fallback systems 3. **Memory Hungry** - Jika semua fitur enabled 4. **Dependency Hell** - Butuh flash-attn, xformers, bitsandbytes 5. **Expert Balancing** - MoE butuh careful tuning untuk load balancing --- ## ๐Ÿ“œ License & Citation ### ๐Ÿ“„ License
Model ini dirilis di bawah **Apache License 2.0** โœ… **Anda BEBAS untuk:** - โœ”๏ธ Gunakan secara komersial - โœ”๏ธ Modifikasi sesuka hati - โœ”๏ธ Distribusi ulang - โœ”๏ธ Patent use - โœ”๏ธ Private use โš ๏ธ **Dengan syarat:** - ๐Ÿ“„ Include license & copyright notice - ๐Ÿ“ State changes yang dibuat - ๐Ÿ“‹ Disclaimer of warranty โŒ **Tanpa jaminan apapun** (use at your own risk)
**Full license text**: [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) ## ๐Ÿ“– Citation Jika Anda menggunakan model ini dalam penelitian, mohon sitasi: ```bibtex @misc{cacacaca1m, author = {Lyon}, title = {Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Model Hub}, howpublished = {\url{https://huggingface.co/Lyon28/caca-1M-untrained}}, note = {Untrained model with 3,524,608 parameters} } ``` **APA Style:** ``` Lyon. (2026). Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention [Untrained model]. Hugging Face. https://huggingface.co/Lyon28/caca-1M-untrained ``` **MLA Style:** ``` Lyon. "Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention." Hugging Face, 2026, huggingface.co/Lyon28/caca-1M-untrained. ``` --- ### ๐Ÿ™ Acknowledgments Model ini berdiri di pundak para raksasa! Terima kasih kepada:
๐Ÿ›๏ธ Klik untuk daftar lengkap acknowledgments #### ๐Ÿ—๏ธ **Core Architecture** - **LLaMA/LLaMA 2** (Meta AI, 2023) - Decoder-only architecture, RMSNorm, SwiGLU - Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) - Authors: Hugo Touvron et al. - **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm - **PaLM** (Google, 2022) - SwiGLU activation insights #### ๐ŸŽฏ **Attention Mechanisms** - **Flash Attention v2** (Tri Dao et al., Stanford, 2023) - Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691) - 3x speedup dengan IO-aware algorithm - **Grouped Query Attention** (Joshua Ainslie et al., Google, 2023) - Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245) - Memory-efficient KV cache - **Multi-Query Attention** (Noam Shazeer, Google, 2019) - Fast inference dengan shared K/V - **xFormers** (Meta AI, 2022) - Memory efficient attention - **PyTorch SDPA** (PyTorch Team, 2023) - Native attention optimization #### ๐Ÿ“ **Position Encodings** - **RoPE** (Jianlin Su et al., EleutherAI, 2021) - Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) - Superior length extrapolation - **ALiBI** (Ofir Press et al., 2022) - Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409) - Length generalization without retraining - **YaRN** (Bowen Peng et al., 2023) - Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071) #### ๐ŸชŸ **Long Context & Efficiency** - **Sliding Window Attention** (Albert Gu et al., Mistral AI, 2023) - Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825) - **StreamingLLM** (Guangxuan Xiao et al., MIT, 2023) - Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) - Infinite sequence length! - **Logit Softcapping** (Google Gemma Team, 2024) - Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295) #### ๐Ÿง  **Mixture of Experts** - **Mixtral 8x7B** (Albert Jiang et al., Mistral AI, 2024) - Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088) - State-of-the-art sparse MoE - **Switch Transformers** (William Fedus et al., Google, 2021) - Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961) - Expert scaling insights - **GLaM** (Nan Du et al., Google, 2021) - Generalist Language Model - **Expert Choice Routing** (Yanqi Zhou et al., Google, 2022) - Better load balancing #### ๐ŸŽ“ **Training Optimizations** - **Layer Scale** (Hugo Touvron et al., Meta, 2021) - Paper: [Going Deeper with Image Transformers](https://arxiv.org/abs/2103.17239) - Training stability untuk deep networks - **Stochastic Depth** (Gao Huang et al., 2016) - Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382) - **Mixture of Depths** (David Raposo et al., DeepMind, 2024) - Paper: [Mixture-of-Depths: Dynamically allocating compute](https://arxiv.org/abs/2404.02258) - Dynamic compute allocation - **Gradient Checkpointing** (Tianqi Chen et al., 2016) #### ๐Ÿ“ฆ **Quantization** - **LLM.int8()** (Tim Dettmers et al., 2022) - Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339) - **QLoRA** (Tim Dettmers et al., 2023) - Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) - 4-bit efficient fine-tuning - **bitsandbytes** (Tim Dettmers) - Quantization library #### ๐ŸŽจ **Multimodal** - **Vision Transformer** (Alexey Dosovitskiy et al., Google, 2020) - Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929) - **Flamingo** (Jean-Baptiste Alayrac et al., DeepMind, 2022) - Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198) - Perceiver Resampler - **BLIP-2** (Junnan Li et al., Salesforce, 2023) - Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597) - Q-Former architecture - **Whisper** (Alec Radford et al., OpenAI, 2022) - Audio encoding #### ๐Ÿ› ๏ธ **Normalization & Activations** - **RMSNorm** (Biao Zhang, Rico Sennrich, 2019) - Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467) - **SwiGLU** (Noam Shazeer, Google, 2020) - Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) #### ๐Ÿ”ง **Tools & Frameworks** - **๐Ÿค— Hugging Face** - Transformers, Accelerate, PEFT - Making NLP accessible to everyone - **PyTorch** - Deep learning framework - Facebook AI Research team - **Safetensors** - Secure serialization - Hugging Face team - **DeepSpeed** - Distributed training - Microsoft Research - **Flash Attention Implementation** - Tri Dao & team #### ๐Ÿ‡ฎ๐Ÿ‡ฉ **Indonesian NLP Community** Special thanks to Indonesian NLP researchers & practitioners yang telah membangun foundation untuk Indonesian language AI.
--- ## ๐Ÿ“„ License Model ini dirilis di bawah **Apache License 2.0**. ### Ketentuan Penggunaan: - โœ… **Bebas digunakan** untuk keperluan komersial dan non-komersial - โœ… **Modifikasi** diperbolehkan - โœ… **Distribusi** diperbolehkan dengan attribution - โš ๏ธ **No Warranty** - model disediakan "as is" - ๐Ÿ“ **Attribution Required** - sertakan copyright notice Lihat [LICENSE](LICENSE) untuk detail lengkap. --- ## ๐Ÿค Contributing Kami sangat terbuka untuk kontribusi! Berikut cara Anda bisa berkontribusi: ### Training & Fine-tuning - ๐ŸŽ“ Train model ini dengan dataset Anda - ๐Ÿ“Š Share benchmark results - ๐Ÿ”ฌ Experiment dengan hyperparameters ### Code & Architecture - ๐Ÿ› Report bugs atau issues - ๐Ÿ’ก Suggest improvements - ๐Ÿ”ง Submit pull requests ### Documentation - ๐Ÿ“š Improve documentation - ๐ŸŒ Add translations - โœ๏ธ Write tutorials & guides ### Dataset & Evaluation - ๐Ÿ“ Contribute training data - ๐Ÿงช Create evaluation benchmarks - ๐ŸŽฏ Share fine-tuned versions --- ## ๐Ÿ‘ฅ Team & Acknowledgments ### Core Team - **LyonPoy** - Architecture design & implementation ### Special Thanks - ๐Ÿค— **Hugging Face** - Infrastructure & community - โšก **FlashAttention Team** - Efficient attention implementation - ๐Ÿง  **Anthropic, Google, Meta, openAI, etc** - Research inspirations - Meta AI (LLaMA) - OpenAI (GPT series) - Google DeepMind (Gemini, Gemma) - Alibaba Cloud (Qwen) - HuggingFace (Transformers library) - Tri Dao (Flash Attention) - Tim Chen (bitsandbytes) ### Community Terima kasih kepada komunitas open-source yang telah berkontribusi pada: - Transformers library - PyTorch framework - Datasets & evaluation tools --- ## ๐Ÿ“ž Contact & Support ### Community - ๐Ÿ’ฌ [Discussions](https://huggingface.co/Lyon28/caca-1M-untrained/discussions) - Ask questions - ๐Ÿ› [Issues](https://github.com/Lyon-28/caca-transformers/issues) - Report bugs - ๐Ÿ“ง Email : cacatransformers@gmail.com --- ## ๐ŸŒŸ Star History
[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)
## ๐Ÿ’ Dibuat dengan โค๏ธ untuk Komunitas AI Indonesia Caca Logo ### **Terima kasih telah menggunakan Caca!** Jika model ini berguna, jangan lupa โญ repository kami!
โญ
Star Repo
Show your support
๐Ÿ”—
Share
Tell your friends
๐Ÿ’ฌ
Join Discussion
Ask questions
๐Ÿค
Contribute
Make it better
### ๐Ÿš€ Happy Training! ๐Ÿš€ **Model ini menunggu untuk dilatih dan menjadi foundation untuk aplikasi AI Anda.** [๐Ÿ“ฅ Download Model](#) โ€ข [๐Ÿ“– Read Docs](https://github.com/Lyon-28/caca-transformers) โ€ข [๐Ÿ’ฌ Join Community](https://github.com/Lyon-28/caca-transformers)
--- ### ๐Ÿ“Š Model Statistics Parameters Status License Architecture Type Context --- ### ๐ŸŽจ Daily Inspiration
Daily Quote
--- ### ๐Ÿ“ˆ Quick Stats | Metric | Value | |--------|-------| | ๐Ÿ’Ž Total Parameters | 3,524,608 | | ๐Ÿ—๏ธ Layers | 6 | | ๐ŸŽฏ Attention Heads | 4 | | ๐Ÿ“– Max Context | 1,024 tokens | | ๐Ÿ’พ Size (FP16) | 0.01 GB | | ๐Ÿ’พ Size (INT4) | 0.00 GB | --- Model ini adalah bagian dari Caca Project - Open source initiative untuk membangun Indonesian LLM ecosystem.
Created with ๐Ÿ’ป by @Lyon28 | Licensed under Apache 2.0 | Built with ๐Ÿค— HuggingFace


**๐ŸŒŸ "Dari nol, untuk semua" ๐ŸŒŸ** Last updated: january 2026 ---
Built with โค๏ธ by Caca Transformers Team
Powered by ๐Ÿค— Transformers โ€ข โšก PyTorch โ€ข ๐Ÿ”ฅ Flash Attention