--- license: apache-2.0 language: - id - en tags: - text-generation - pytorch - causal-lm - transformer - untrained - gqa - rope - swiglu - rmsnorm - flash-attention - indonesian - bilingual library_name: transformers pipeline_tag: text-generation widget: - text: "Jakarta adalah ibu kota" example_title: "🇮🇩 Pelengkapan Teks (ID)" - text: | Pertanyaan: Apa itu kecerdasan buatan? Jawaban: example_title: "🇮🇩 Tanya Jawab (ID)" - text: | Tulis cerita pendek tentang robot yang belajar mencintai. example_title: "🇮🇩 Penulisan Kreatif (ID)" - text: "The capital of Indonesia is" example_title: "🇬🇧 Text Completion (EN)" - text: | Question: What is artificial intelligence? Answer: example_title: "🇬🇧 Question Answering (EN)" - text: | def fibonacci(n): """Hitung bilangan fibonacci ke-n""" example_title: "💻 Pelengkapan Kode" - text: | # Fungsi untuk mengurutkan array def sort_array(arr): example_title: "💻 Generasi Kode" - text: | User: Halo! Siapa kamu? Assistant: example_title: "💬 Format Chat (ID)" - text: | User: Jelaskan tentang machine learning dalam 2 kalimat. Assistant: example_title: "💬 Conversational (ID)" inference: parameters: max_new_tokens: 100 temperature: 0.7 top_p: 0.9 top_k: 50 do_sample: true repetition_penalty: 1.1 num_beams: 1 datasets: [] metrics: - perplexity - accuracy model-index: - name: caca-1M results: [] ---

# 🤖 caca-1M ### Arsitektur Transformer Modern dengan Fitur Canggih [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Python 3.8+](https://img.shields.io/badge/python-3.8+-blue.svg)](https://www.python.org/downloads/) [![PyTorch](https://img.shields.io/badge/PyTorch-2.0+-red.svg)](https://pytorch.org/) [![Transformers](https://img.shields.io/badge/🤗%20Transformers-4.35+-yellow.svg)](https://github.com/huggingface/transformers) [![Model Type](https://img.shields.io/badge/Model-Causal%20LM-green.svg)]() [![Parameters](https://img.shields.io/badge/Parameters-3.52M-orange.svg)]() [![Status](https://img.shields.io/badge/Status-Untrained-red.svg)]() **3,524,608** parameters • **3.52M** • **6 layers** • **1,024 tokens** [📚 Documentation](#-dokumentasi) • [💻 Usage](#-cara-penggunaan) • [⚙️ Configuration](#️-konfigurasi-detail) • [🔬 Architecture](#-arsitektur)

--- ## ⚠️ PENTING: Model Belum Dilatih (Untrained)

⚠️ PERHATIAN: Ini adalah model yang belum melalui proses training. Bobot model masih dalam kondisi random initialization. Output yang dihasilkan akan tidak bermakna dan acak.

**Status Model:** - 🔴 **Belum dilatih** - Bobot masih random (Kaiming/Xavier init) - 🟡 **Untuk riset & eksperimen** - Arsitektur sudah siap, tinggal train - 🟢 **Production-ready architecture** - Teruji dan optimal Widget di atas hanya menunjukkan **format input yang diharapkan**. Setelah model dilatih dengan dataset yang tepat, format yang sama akan menghasilkan output berkualitas tinggi. ### 🎯 Apa yang Bisa Dilakukan? | ✅ Bisa | ❌ Belum Bisa | |---------|----------------| | Load model architecture | Generate teks bermakna | | Test forward pass | Menjawab pertanyaan | | Measure memory & speed | Reasoning & understanding | | Start training | Production deployment | | Fine-tuning experiments | Real-world applications | --- ## 📋 Deskripsi **CACA** (Collaborative Architecture for Contextual AI) adalah arsitektur Large Language Model (LLM) yang menggabungkan **best practices** dari berbagai model State-of-the-Art (SOTA) seperti **LLaMA**, **GPT-4**, **Gemini**, **Qwen**, dan **Gemma**. Model ini dirancang dengan fokus pada **efisiensi komputasi**, **skalabilitas**, dan **performa tinggi** — menjadikannya **modular**, **production-ready**, dan mendukung **multimodal** (teks, gambar, audio).

📖 Tentang Project Caca

Caca adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative.

Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok. Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.

— Lyon, Creator

### ✨ **Highlights** - 🧠 **Hybrid Architecture** — Kombinasi teknik terbaik dari 5+ model SOTA - 🎭 **Multimodal Native** — Support teks, gambar, dan audio dalam satu model - ⚡ **High Performance** — Flash Attention, MoE, dan optimasi modern - 🌍 **Indonesian-First** — Dikembangkan dengan fokus pada Bahasa Indonesia - 🔓 **Open Source** — Transparent, reproducible, collaborative ### 🌟 Mengapa Caca? 1. **🇮🇩 Fokus pada Bahasa Indonesia** - Dirancang dengan mempertimbangkan karakteristik bahasa Indonesia 2. **⚡ Efisiensi Tinggi** - GQA & Flash Attention untuk inferensi 3-5x lebih cepat 3. **💾 Memory Efficient** - Hemat 50% memory untuk KV cache 4. **🔧 Modular & Extensible** - Mudah dikustomisasi untuk berbagai use case 5. **🌐 Bilingual** - Support optimal untuk Indonesia & English **CACA** hadir dengan filosofi berbeda: - ✅ **Fully open-source** — dari architecture sampai training code - ✅ **Modular & scalable** — bisa disesuaikan dari 1B sampai 70B+ parameters - ✅ **Resource-efficient** — optimized untuk budget terbatas - ✅ **Indonesian-centric** — prioritas pada Bahasa Indonesia - ✅ **Community-driven** — open for contributions & collaborations ## 📈 Perbandingan dengan Model Lain | Fitur | LLaMA | GPT-4 | Gemini | Qwen | CACA | |-------|-------|-------|--------|------|------| | **RMSNorm** | ✅ | ❌ | ❌ | ✅ | ✅ | | **RoPE** | ✅ | ❌ | ❌ | ✅ | ✅ | | **GQA** | ✅ | ❌ | ❌ | ✅ | ✅ | | **MoE** | ❌ | ✅ | ✅ | ❌ | ✅ | | **Multimodal** | ❌ | ✅ | ✅ | ✅ | ✅ | | **Flash Attention** | ✅ | ✅ | ✅ | ✅ | ✅ | | **Sliding Window** | ❌ | ❌ | ❌ | ✅ | ✅ | | **Attention Sinks** | ❌ | ❌ | ❌ | ❌ | ✅ | | **MoD** | ❌ | ❌ | ❌ | ❌ | ✅ | | **Expert Choice** | ❌ | ❌ | ❌ | ❌ | ✅ | | **YARN Scaling** | ❌ | ❌ | ❌ | ✅ | ✅ | | **Quantization** | ✅ | ❌ | ❌ | ✅ | ✅ | --- ## 🎯 Use Cases & Applications ### ✅ Cocok Untuk

**🔬 Research & Development** - Eksperimen arsitektur transformer - Ablation studies - Novel training techniques - Architecture search **📚 Academic & Education** - Thesis & research papers - Teaching materials - Student projects - LLM internals understanding

**🚀 Base Model for Fine-tuning** - Task-specific models - Domain adaptation - Instruction tuning - RLHF experiments **💡 Prototyping** - Proof of concept - Feature testing - A/B testing architectures - Benchmark comparisons

### ❌ Tidak Cocok Untuk

- 🚫 **Production Applications** - Model belum dilatih, output random - 🚫 **Real-world Deployment** - Perlu training & safety alignment dulu - 🚫 **Safety-critical Systems** - Tidak ada safety guardrails - 🚫 **Direct User-facing Apps** - Output tidak dapat diprediksi - 🚫 **Commercial Use (as-is)** - Harus dilatih terlebih dahulu

--- ## 📊 Spesifikasi Model

Parameter	Value	Parameter	Value
Total Parameters	`3,524,608`	Vocab Size	`8,000`
Hidden Size	`128`	Intermediate Size	`512`
Num Layers	`6`	Attention Heads	`4`
KV Heads (GQA)	`2`	Head Dimension	`32`
Max Context Length	`1,024`	RoPE Base (θ)	`10,000`
Model Size (FP16)	`0.01 GB`	Formatted Size	`3.52M`

--- ### 🎯 Core Features

🔍 Klik untuk expand/collapse

- ✅ **Grouped Query Attention (GQA)** - Efisiensi memori dan komputasi superior - Query heads: **4** - KV heads: **2** - Ratio: **2:1** (hemat ~50% memory KV cache) - **Benefit**: Inferensi lebih cepat dengan memory footprint lebih kecil - ✅ **Rotary Position Embeddings (RoPE)** - Generalisasi konteks panjang lebih baik - Theta (θ): **10,000** - Support extrapolation untuk konteks > training length - **Benefit**: Performa stabil pada sequence length yang belum pernah dilihat saat training - ✅ **RMSNorm** - Normalisasi lebih stabil dan ~50% lebih cepat dari LayerNorm - Epsilon: **1e-06** - **Benefit**: Training lebih stabil, inference lebih cepat, gradient flow lebih baik - ✅ **SwiGLU Activation** - Performa 10-15% lebih baik dari ReLU/GELU - Intermediate size: **512** (4.0x hidden) - **Benefit**: Kapasitas model lebih besar tanpa menambah parameter signifikan - ✅ **Flash Attention 2** - Akselerasi hingga 3x dengan memory efficiency - Otomatis aktif jika tersedia CUDA device - IO-aware algorithm untuk minimal HBM access - **Benefit**: Training & inference jauh lebih cepat, support batch size lebih besar - ✅ **Hybrid Architecture** - Kombinasi teknik terbaik dari 5+ model SOTA - ✅ **Multimodal Support** - Native support untuk Vision dan Audio - ✅ **Mixture of Experts (MoE)** - Sparse activation untuk efisiensi - ✅ **Long Context** - Support hingga 8K+ tokens dengan YARN scaling - ✅ **Advanced Attention** - Flash Attention, Sliding Window, Attention Sinks - ✅ **Quantization Ready** - Support 4-bit dan 8-bit quantization - ✅ **Production Features** - Extensive error handling & monitoring

### 🔥 Advanced Features ### 🎯 Mekanisme Attention - ⚡ **Flash Attention v2** - Algoritma IO-aware yang 3x lebih cepat dari attention standar - 🔑 **Grouped Query Attention (GQA)** - 4 Query heads : 2 KV heads - Rasio kompresi: **2:1** (hemat ~50% memory KV cache) - 🚀 **xFormers Support** - Fallback memory-efficient attention - 🎯 **PyTorch SDPA** - Native scaled dot product attention ### 📍 Position Encodings - 🔄 **RoPE (Rotary Position Embeddings)** - Base frequency θ=10,000 - Generalisasi lebih baik untuk sequence panjang dibanding absolute PE ### 🎓 Optimisasi Training - 💾 **Gradient Checkpointing** - Trade compute for memory (support model hingga 100B+ params) - 🎯 **Mixed Precision Training** - Support FP16, BF16, dan TF32 - 📉 **Dropout Regularization** - Hidden dropout: 0.1 - Attention dropout: 0.0 - Residual dropout: 0.1 ### 📦 Dukungan Quantization - 4️⃣ **4-bit Quantization** - NF4 & FP4 via bitsandbytes - Memory reduction: ~**75%** (4GB → 1GB) - Accuracy loss: <2% pada kebanyakan tasks - Support double quantization untuk kompresi maksimal - 8️⃣ **8-bit Quantization** - LLM.int8() dengan outlier handling - Memory reduction: ~**50%** (4GB → 2GB) - Accuracy loss: <1% - 🔄 **Dynamic Quantization** - Runtime quantization tanpa calibration ### 🔬 Advanced Features - 📊 **Automatic Mixed Precision (AMP)** - Dynamic loss scaling - 🎯 **Gradient Clipping** - Stabilitas training dengan max norm clipping - 📈 **Learning Rate Scheduling** - Support cosine, linear, warmup - 💡 **Smart Memory Management** - Auto cache clearing & monitoring - 🔍 **Metrics Tracking** - Real-time perplexity, loss, gradient norms - 🛡️ **NaN/Inf Detection** - Automatic recovery dari numerical instability --- ## 🧩 Komponen Arsitektur ### 1️⃣ **Dari LLaMA (Meta)** CACA mengadopsi komponen efisien dari LLaMA untuk performa optimal: ```python ✓ RMSNorm # Normalisasi lebih efisien dari LayerNorm ✓ Rotary Position Embeddings # Positional encoding yang lebih baik ✓ SwiGLU Activation # Activation function dengan gating mechanism ✓ Grouped-Query Attention # Hemat memory dengan shared K/V heads ✓ Pre-normalization # Stabilitas training yang lebih baik ``` - RMSNorm **30% lebih cepat** dari LayerNorm - RoPE membuat model bisa **extrapolate ke context lebih panjang** - GQA **hemat 30-40% memory** dibanding Multi-Head Attention - SwiGLU **meningkatkan performa 3-5%** dibanding ReLU/GELU --- ### 2️⃣ **Dari GPT-4 (OpenAI)** Implementasi Mixture of Experts untuk skalabilitas: ```python ✓ Mixture of Experts (MoE) # Sparse activation dengan multiple expert networks ✓ Top-K Router # Routing token ke K expert terbaik ✓ Auxiliary Loss # Load balancing antar experts ✓ Z-Loss # Stabilisasi router logits ✓ Expert Usage Tracking # Monitoring penggunaan setiap expert ``` ``` Input Token ↓ [Router] → Pilih Top-K Experts (misal K=2 dari 8 experts) ↓ Expert_1 (weight: 0.6) + Expert_3 (weight: 0.4) ↓ Weighted Sum Output ``` **Keuntungan:** - Model bisa **10x lebih besar** dengan compute cost yang sama - Setiap token hanya activate **12.5% parameters** (jika K=2, N=8) - Parallel processing antar experts --- ### 3️⃣ **Dari Gemini (Google)** Multimodal native dengan cross-modal fusion: ```python ✓ Vision Encoder (ViT) # Process gambar dengan Vision Transformer ✓ Audio Encoder (Conv1D + Trans) # Process audio dengan CNN + Transformer ✓ Cross-Attention Mechanism # Fuse multimodal features ✓ Multiple Projector Types: - Linear Projector # Simple & cepat - MLP Projector # Non-linear mapping - Perceiver Resampler # Compress dengan latent queries - Q-Former # Query-based projection (BLIP-2 style) ✓ Logit Soft-Capping # Clip extreme values untuk stabilitas ``` **Alur Multimodal:** ``` [Image] → Vision Encoder → [2D patches → 1D tokens] ↓ Projector → [Hidden dim = text dim] ↓ [Text] + [Image tokens] → Cross-Attention → Fused representation ``` **Support format:** - Images: JPEG, PNG (224x224 default) - Audio: Mel-spectrogram (80 bins) --- ### 4️⃣ **Dari Qwen (Alibaba)** Long context optimization: ```python ✓ YARN Scaling # Yet Another RoPE extensioN ✓ Dynamic Position Scaling # Auto-adjust untuk sequence lebih panjang ✓ Sliding Window Attention # Local attention pattern ✓ Context Window 8K-128K # Flexible context length ``` **YARN vs Standard RoPE:** ``` Standard RoPE: [====] 4K context → [====????] 8K (error naik) YARN: [====] 4K context → [========] 8K (smooth extrapolation) ``` **Sliding Window Mechanism:** ``` Token 0: attend ke [0] Token 1: attend ke [0, 1] Token 2: attend ke [0, 1, 2] Token 10: attend ke [0, 6, 7, 8, 9, 10] ← sliding window = 4 (keep attention sink di token 0) ``` --- ### 5️⃣ **Dari Gemma (Google)** Optimization techniques: ```python ✓ Layer Scale # Learnable scaling per layer ✓ Stochastic Depth # Random layer dropping saat training ✓ Normalized Attention # QK normalization untuk stabilitas ✓ Knowledge Distillation # Transfer knowledge dari model besar ``` **Layer Scale formula:** ```python output = input + gamma * layer(input) # gamma diinit sangat kecil (1e-5) lalu di-learn ``` **Stochastic Depth:** - Training: 20% chance layer di-skip (drop_prob=0.2) - Inference: semua layer aktif - Benefit: **regularization** + **faster training** --- ## 🆕 Fitur Eksperimental & Unik ### A) **Mixture of Depths (MoD)** Token bisa "skip" layer tertentu untuk efisiensi: ```python class MixtureOfDepthsRouter: # Pilih top 50% tokens paling "penting" untuk di-process capacity_factor = 0.5 # Method: learned, random, atau heuristic route_method = "learned" ``` **Ilustrasi:** ``` Layer 1: [All 100 tokens processed] Layer 2: [Top 50 tokens processed, 50 skipped] ← MoD Layer 3: [All 100 tokens processed] Layer 4: [Top 50 tokens processed, 50 skipped] ← MoD ``` **Benefit:** - **30-40% faster inference** dengan minimal accuracy drop - Dynamic computation based on token importance **Paper:** [Mixture-of-Depths (2024)](https://arxiv.org/abs/2404.02258) --- ### B) **Attention Sinks** Keep token awal selalu di-attend untuk stabilitas: ```python attention_sink_size = 4 # Keep first 4 tokens attention_sink_window = 512 # Sliding window size ``` **Attention Pattern:** ``` Query Token 1000: ├─ Attend to: [0, 1, 2, 3] ← attention sinks (always) └─ Attend to: [488, 489, ..., 1000] ← sliding window ``` **Benefit:** - Prevent attention collapse di long sequences - Better streaming generation - Inspired by [StreamingLLM (2023)](https://arxiv.org/abs/2309.17453) --- ### C) **Expert Choice Routing** Alternatif dari Top-K routing: ```python # Top-K: Token pilih expert Token → Router → "Saya mau Expert 2 dan 5" # Expert Choice: Expert pilih token Expert 1 → "Saya mau process Token 3, 7, 12, ..." Expert 2 → "Saya mau process Token 1, 5, 9, ..." ``` **Keuntungan:** - **Better load balancing** (setiap expert process jumlah token yang sama) - **Lebih stable training** (no expert collapse) - Trade-off: sedikit lebih complex implementasi --- ### D) **Multi-Backend Attention** Automatic fallback untuk compatibility: ```python if HAS_FLASH_ATTN and device == "cuda": use flash_attn_func() # ← Fastest (2-4x speedup) elif HAS_XFORMERS and device == "cuda": use memory_efficient_attention() # ← Fallback 1 elif HAS_SDPA: use F.scaled_dot_product_attention() # ← Fallback 2 (PyTorch 2.0+) else: use standard_attention() # ← Safe fallback ``` **Performa Comparison:** ``` Flash Attention: 100ms (baseline) xFormers: 150ms (1.5x slower) SDPA: 180ms (1.8x slower) Standard: 400ms (4x slower) ``` --- ## 🏗️ CACA Model Family | Model | Parameters | Vocab Size | Hidden Size | Intermediate Size | Layers | Attention Heads | KV Heads | Head Dim | Max Position | |-------|------------|------------|-------------|-------------------|--------|-----------------|----------|----------|--------------| | caca-1M-untrained | 2.50M | 8,000 | 128 | 512 | 6 | 4 | 2 | 32 | 1,024 | | caca-3M-untrained | 6.63M | 12,000 | 192 | 768 | 8 | 6 | 2 | 32 | 2,048 | | caca-4M-untrained | 4.02M | 16,000 | 128 | 512 | 8 | 4 | 2 | 32 | 2,048 | | caca-6M-untrained | 11.96M | 16,000 | 256 | 1024 | 8 | 4 | 2 | 64 | 2,048 | | caca-10M-untrained | 21.25M | 20,000 | 320 | 1280 | 10 | 8 | 2 | 40 | 2,048 | | caca-15M-untrained | 35.18M | 24,000 | 384 | 1536 | 12 | 6 | 2 | 64 | 2,048 | | caca-25M-untrained | 67.57M | 28,000 | 512 | 2048 | 14 | 8 | 2 | 64 | 4,096 | | caca-35M-untrained | 95.42M | 32,000 | 576 | 2304 | 16 | 8 | 2 | 72 | 4,096 | | caca-50M-untrained | 138.47M | 32,000 | 640 | 2560 | 20 | 10 | 2 | 64 | 4,096 | | caca-75M-untrained | 178.55M | 32,000 | 768 | 3072 | 18 | 12 | 3 | 64 | 4,096 | | caca-100M-untrained | 232.23M | 32,000 | 768 | 3072 | 24 | 12 | 4 | 64 | 4,096 | | caca-150M-untrained | 336.90M | 32,000 | 1024 | 4096 | 20 | 16 | 4 | 64 | 4,096 | | caca-200M-untrained | 458.55M | 32,000 | 1024 | 4096 | 28 | 16 | 4 | 64 | 4,096 | | caca-250M-untrained | 569.54M | 32,000 | 1152 | 4608 | 28 | 18 | 3 | 64 | 8,192 | | caca-300M-untrained | 701.64M | 32,000 | 1280 | 5120 | 28 | 20 | 4 | 64 | 8,192 | | caca-400M-untrained | 956.36M | 32,000 | 1408 | 5632 | 32 | 22 | 4 | 64 | 8,192 | | caca-500M-untrained | 1.27B | 32,000 | 1536 | 6144 | 36 | 24 | 4 | 64 | 8,192 | | caca-600M-untrained | 1.48B | 32,000 | 1664 | 6656 | 36 | 26 | 4 | 64 | 8,192 | | caca-700M-untrained | 1.71B | 32,000 | 1792 | 7168 | 36 | 28 | 4 | 64 | 8,192 | | caca-800M-untrained | 1.96B | 32,000 | 1920 | 7680 | 36 | 30 | 5 | 64 | 8,192 | | caca-900M-untrained | 2.01B | 32,000 | 2048 | 8192 | 32 | 32 | 8 | 64 | 8,192 | | caca-1B-untrained | 2.26B | 32,000 | 2048 | 8192 | 36 | 32 | 8 | 64 | 8,192 | | caca-1.5B-untrained | 2.98B | 32,000 | 2048 | 8192 | 48 | 32 | 8 | 64 | 8,192 | | caca-2B-untrained | 3.15B | 32,000 | 2304 | 9216 | 40 | 32 | 8 | 72 | 8,192 | | caca-2.5B-untrained | 3.12B | 32,000 | 2560 | 10240 | 32 | 32 | 8 | 80 | 8,192 | | caca-3B-untrained | 3.88B | 32,000 | 2560 | 10240 | 40 | 32 | 8 | 80 | 8,192 | | caca-3.5B-untrained | 4.69B | 32,000 | 2816 | 11264 | 40 | 32 | 8 | 88 | 8,192 | | caca-4B-untrained | 5.02B | 32,000 | 3072 | 12288 | 36 | 32 | 8 | 96 | 8,192 | | caca-4.5B-untrained | 5.45B | 32,000 | 3200 | 12800 | 36 | 32 | 8 | 100 | 8,192 | | caca-5B-untrained | 6.53B | 32,000 | 3328 | 13312 | 40 | 32 | 8 | 104 | 8,192 | | caca-6B-untrained | 8.31B | 32,000 | 3584 | 14336 | 44 | 32 | 8 | 112 | 8,192 | | caca-7B-untrained | 7.11B | 32,000 | 4096 | 14336 | 32 | 32 | 8 | 128 | 8,192 | | caca-8B-untrained | 7.98B | 32,000 | 4096 | 14336 | 36 | 32 | 8 | 128 | 8,192 | | caca-9B-untrained | 9.09B | 32,000 | 4608 | 16384 | 32 | 36 | 9 | 128 | 8,192 | | caca-10B-untrained | 11.23B | 32,000 | 4608 | 18432 | 36 | 32 | 8 | 144 | 8,192 | | caca-12B-untrained | 15.26B | 32,000 | 5120 | 20480 | 40 | 40 | 8 | 128 | 8,192 | | caca-13B-untrained | 13.38B | 32,000 | 5120 | 13824 | 48 | 40 | 8 | 128 | 8,192 | | caca-14B-untrained | 13.40B | 32,000 | 5376 | 14464 | 44 | 48 | 8 | 112 | 8,192 | | caca-15B-untrained | 14.90B | 32,000 | 5632 | 15104 | 44 | 32 | 8 | 176 | 8,192 | | caca-18B-untrained | 18.92B | 32,000 | 6144 | 16384 | 48 | 48 | 8 | 128 | 8,192 | | caca-20B-untrained | 20.48B | 32,000 | 6144 | 16384 | 52 | 48 | 8 | 128 | 8,192 | | caca-24B-untrained | 25.83B | 32,000 | 6656 | 17920 | 56 | 64 | 8 | 104 | 8,192 | | caca-30B-untrained | 32.24B | 32,000 | 6656 | 17920 | 70 | 64 | 8 | 104 | 8,192 | | caca-35B-untrained | 39.02B | 32,000 | 8192 | 22016 | 56 | 64 | 8 | 128 | 8,192 | | caca-40B-untrained | 44.56B | 32,000 | 8192 | 22016 | 64 | 64 | 8 | 128 | 8,192 | | caca-45B-untrained | 50.09B | 32,000 | 8192 | 22016 | 72 | 64 | 8 | 128 | 8,192 | | caca-50B-untrained | 55.63B | 32,000 | 8192 | 22016 | 80 | 64 | 8 | 128 | 8,192 | | caca-60B-untrained | 72.14B | 32,000 | 8192 | 28672 | 84 | 64 | 8 | 128 | 8,192 | | caca-70B-untrained | 68.71B | 32,000 | 8192 | 28672 | 80 | 64 | 8 | 128 | 8,192 | | caca-80B-untrained | 101.77B | 32,000 | 9216 | 36864 | 84 | 72 | 8 | 128 | 8,192 | | caca-100B-untrained | 137.32B | 32,000 | 10240 | 40960 | 92 | 80 | 8 | 128 | 8,192 | | caca-120B-untrained | 173.10B | 32,000 | 11264 | 45056 | 96 | 88 | 8 | 128 | 8,192 | | caca-150B-untrained | 214.31B | 32,000 | 12288 | 49152 | 100 | 96 | 8 | 128 | 8,192 | | caca-175B-untrained | 248.53B | 32,000 | 12288 | 49152 | 116 | 96 | 8 | 128 | 8,192 | | caca-200B-untrained | 324.80B | 128,000 | 14336 | 57344 | 110 | 112 | 16 | 128 | 16,384 | | caca-250B-untrained | 419.35B | 128,000 | 15360 | 61440 | 124 | 120 | 16 | 128 | 16,384 | | caca-300B-untrained | 507.03B | 128,000 | 16384 | 65536 | 132 | 128 | 16 | 128 | 16,384 | | caca-350B-untrained | 591.18B | 128,000 | 16384 | 65536 | 154 | 128 | 16 | 128 | 16,384 | | caca-400B-untrained | 675.34B | 128,000 | 16384 | 65536 | 176 | 128 | 16 | 128 | 16,384 | | caca-500B-untrained | 852.77B | 128,000 | 18432 | 73728 | 176 | 144 | 16 | 128 | 16,384 | | caca-600B-untrained | 1.07T | 128,000 | 20480 | 81920 | 180 | 160 | 16 | 128 | 16,384 | | caca-700B-untrained | 1.23T | 128,000 | 21504 | 86016 | 186 | 168 | 24 | 128 | 16,384 | | caca-800B-untrained | 1.38T | 128,000 | 22528 | 90112 | 192 | 176 | 16 | 128 | 16,384 | | caca-900B-untrained | 1.65T | 128,000 | 24576 | 94208 | 198 | 192 | 24 | 128 | 16,384 | | caca-1T-untrained | 1.75T | 128,000 | 24576 | 98304 | 204 | 192 | 16 | 128 | 16,384 | --- ## 💾 Kebutuhan Memory ### Training Requirements

Configuration	Model Weights	+ Optimizer States	Total Training
FP32 (AdamW)	0.01 GB	+0.04 GB	0.06 GB
Mixed Precision	0.01 GB	+0.05 GB	0.06 GB
+ Gradient Checkpointing	Menghemat ~30-50% activation memory		~0.03 GB

### Inference Requirements

Precision	Model Size	Total Memory	Memory Saving
FP16 / BF16	0.01 GB	0.01 GB	Baseline
INT8	0.00 GB	0.01 GB	~50% ↓
INT4 (NF4)	0.00 GB	0.00 GB	~75% ↓

> 💡 **Note**: KV cache bertambah secara linear dengan panjang sequence. Untuk context 8K, kalikan nilai KV cache dengan 4. ### Performance Estimates

Metric	Value	Notes
FLOPs per Token	7,049,216	Forward pass only
TFLOPs per Token	0.0000	≈ 6× untuk backward
Bandwidth (FP16)	0.01 GB/token	Memory bandwidth requirement

--- ### 📐 Struktur Arsitektur Lengkap

🔍 Klik untuk lihat detail arsitektur

``` CACA Architecture │ ├─── 📥 INPUT PROCESSING │ │ │ ├─── Text Input │ │ ├─── Tokenization (BPE/WordPiece/SentencePiece) │ │ ├─── Token Embeddings (vocab_size × hidden_size) │ │ └─── Output: [batch_size, seq_len, hidden_size] │ │ │ ├─── Vision Input (Optional) │ │ ├─── Image Preprocessing (resize ke 224×224) │ │ ├─── Vision Encoder (ViT) │ │ │ ├─── Patch Embedding (Conv2D: 14×14 patches) │ │ │ ├─── CLS Token + Positional Embeddings │ │ │ ├─── Vision Transformer Blocks (24 layers) │ │ │ │ ├─── LayerNorm │ │ │ │ ├─── Multi-Head Attention │ │ │ │ ├─── MLP (GELU activation) │ │ │ │ └─── Residual Connections │ │ │ └─── Final LayerNorm │ │ ├─── Vision Projector │ │ │ ├─── Type: Linear / MLP / Perceiver / Q-Former │ │ │ └─── Output: [batch_size, num_patches, hidden_size] │ │ └─── Output: Vision embeddings aligned to text space │ │ │ └─── Audio Input (Optional) │ ├─── Audio Preprocessing (Mel-spectrogram, 80 bins) │ ├─── Audio Encoder │ │ ├─── Conv1D Layers (feature extraction) │ │ │ ├─── Conv1D (80 → hidden_size, kernel=3) │ │ │ └─── Conv1D (stride=2 untuk downsampling) │ │ ├─── Positional Embeddings (interpolated) │ │ ├─── Audio Transformer Blocks (12 layers) │ │ │ ├─── LayerNorm │ │ │ ├─── Multi-Head Attention │ │ │ ├─── MLP (GELU activation) │ │ │ └─── Residual Connections │ │ └─── Final LayerNorm │ ├─── Audio Projector │ │ ├─── Type: Linear / MLP / Perceiver / Q-Former │ │ └─── Output: [batch_size, audio_len, hidden_size] │ └─── Output: Audio embeddings aligned to text space │ ├─── 🔄 MULTIMODAL FUSION │ │ │ ├─── Early Fusion (jika tidak pakai Cross-Attention) │ │ ├─── Concatenate: [vision_tokens + audio_tokens + text_tokens] │ │ ├─── Update attention mask │ │ └─── Output: Combined sequence untuk decoder │ │ │ └─── Late Fusion (jika pakai Cross-Attention) │ ├─── Text tokens → Query untuk cross-attention │ ├─── Vision+Audio tokens → Key/Value untuk cross-attention │ └─── Fusion dilakukan di dalam decoder layers │ ├─── 🏗️ DECODER STACK (N=32 layers) │ │ │ └─── 🔁 DECODER LAYER i (repeated N times) │ │ │ ├─── [OPTIONAL] Mixture of Depths (MoD) │ │ ├─── Input: Hidden states [batch, seq_len, hidden] │ │ ├─── MoD Router │ │ │ ├─── Method: learned / random / heuristic │ │ │ ├─── Score computation per token │ │ │ └─── Top-K selection (K = capacity_factor × seq_len) │ │ ├─── Process Mask Generation │ │ │ └─── Binary mask [batch, seq_len] (1=process, 0=skip) │ │ └─── Token Selection │ │ ├─── Selected tokens: processed through layer │ │ └─── Skipped tokens: bypass layer (identity) │ │ │ ├─── 🎯 SELF-ATTENTION PATH │ │ │ │ │ ├─── Input Normalization │ │ │ ├─── RMSNorm (Root Mean Square Layer Normalization) │ │ │ ├─── Formula: x * rsqrt(mean(x²) + ε) * γ │ │ │ └─── More efficient than LayerNorm (no mean centering) │ │ │ │ │ ├─── Attention Computation │ │ │ │ │ │ │ ├─── Query/Key/Value Projections │ │ │ │ ├─── Q: Linear(hidden_size → num_heads × head_dim) │ │ │ │ ├─── K: Linear(hidden_size → num_kv_heads × head_dim) │ │ │ │ ├─── V: Linear(hidden_size → num_kv_heads × head_dim) │ │ │ │ └─── Reshape: [batch, seq, heads, head_dim] │ │ │ │ │ │ │ ├─── [OPTIONAL] QK Normalization │ │ │ │ ├─── Q = RMSNorm(Q) │ │ │ │ └─── K = RMSNorm(K) │ │ │ │ │ │ │ ├─── Rotary Position Embeddings (RoPE) │ │ │ │ ├─── Compute frequencies: θ_i = base^(-2i/dim) │ │ │ │ ├─── Position indices: t ∈ [0, seq_len) │ │ │ │ ├─── Rotation matrix: cos(t·θ), sin(t·θ) │ │ │ │ ├─── Apply rotation: Q, K = rotate(Q, K, cos, sin) │ │ │ │ └─── YARN Scaling (jika enabled) │ │ │ │ ├─── Type: linear / dynamic / yarn │ │ │ │ ├─── Scaling factor per frequency band │ │ │ │ └─── Better extrapolation ke context panjang │ │ │ │ │ │ │ ├─── Grouped-Query Attention (GQA) │ │ │ │ ├─── num_kv_groups = num_heads / num_kv_heads │ │ │ │ ├─── Repeat K, V: [num_kv_heads → num_heads] │ │ │ │ └─── Memory saving: 30-40% vs full MHA │ │ │ │ │ │ │ ├─── Attention Score Computation │ │ │ │ ├─── scores = (Q @ K.T) / sqrt(head_dim) │ │ │ │ ├─── Logit clamping: [-50, 50] untuk stabilitas │ │ │ │ └─── [OPTIONAL] Soft-capping │ │ │ │ └─── scores = tanh(scores / cap) * cap │ │ │ │ │ │ │ ├─── Attention Masking │ │ │ │ ├─── Causal Mask (autoregressive) │ │ │ │ ├─── Sliding Window Mask (jika enabled) │ │ │ │ │ ├─── Window size (misal: 512 tokens) │ │ │ │ │ └─── Attend hanya ke window terdekat │ │ │ │ ├─── Attention Sinks (jika enabled) │ │ │ │ │ ├─── Always attend to first K tokens │ │ │ │ │ ├─── Prevent attention collapse │ │ │ │ │ └─── Better streaming generation │ │ │ │ └─── [OPTIONAL] ALiBi Bias │ │ │ │ ├─── Linear bias based on distance │ │ │ │ └─── Alternative/complement to RoPE │ │ │ │ │ │ │ ├─── Backend Selection (automatic fallback) │ │ │ │ ├─── 1️⃣ Flash Attention 2 (PREFERRED) │ │ │ │ │ ├─── Requirements: CUDA + FP16/BF16 │ │ │ │ │ ├─── Speedup: 2-4x faster │ │ │ │ │ ├─── Memory: 10-20x less │ │ │ │ │ ├─── Sliding window support │ │ │ │ │ └─── IO-aware algorithm │ │ │ │ ├─── 2️⃣ xFormers Memory Efficient (FALLBACK 1) │ │ │ │ │ ├─── Requirements: CUDA │ │ │ │ │ ├─── Block-sparse attention │ │ │ │ │ └─── Custom attention patterns │ │ │ │ ├─── 3️⃣ PyTorch SDPA (FALLBACK 2) │ │ │ │ │ ├─── Requirements: PyTorch 2.0+ │ │ │ │ │ ├─── Built-in scaled_dot_product_attention │ │ │ │ │ └─── Hardware-agnostic │ │ │ │ └─── 4️⃣ Standard Attention (SAFE FALLBACK) │ │ │ │ ├─── Pure PyTorch implementation │ │ │ │ ├─── Always available │ │ │ │ └─── Slower but stable │ │ │ │ │ │ │ ├─── Softmax + Dropout │ │ │ │ ├─── attn_weights = softmax(scores, dim=-1) │ │ │ │ └─── attn_weights = dropout(attn_weights) │ │ │ │ │ │ │ ├─── Value Aggregation │ │ │ │ ├─── output = attn_weights @ V │ │ │ │ └─── Reshape: [batch, seq, num_heads × head_dim] │ │ │ │ │ │ │ └─── Output Projection │ │ │ ├─── O: Linear(num_heads × head_dim → hidden_size) │ │ │ └─── Output: [batch, seq, hidden_size] │ │ │ │ │ ├─── [OPTIONAL] Layer Scale │ │ │ ├─── Learnable per-layer scaling: γ │ │ │ ├─── Initialize: γ = 1e-5 (very small) │ │ │ ├─── output = γ * output │ │ │ └─── Improves training stability │ │ │ │ │ ├─── [OPTIONAL] Stochastic Depth │ │ │ ├─── Training: Random layer dropping │ │ │ ├─── drop_prob = layer_idx / num_layers × base_prob │ │ │ ├─── if random() > drop_prob: return output │ │ │ ├─── else: return 0 │ │ │ └─── Inference: Always apply (no dropping) │ │ │ │ │ ├─── Residual Dropout │ │ │ └─── output = dropout(output) │ │ │ │ │ └─── Residual Connection │ │ ├─── hidden_states = hidden_states + output │ │ └─── [Training] Gradient clipping: [-1e4, 1e4] │ │ │ ├─── 🌐 [OPTIONAL] CROSS-ATTENTION PATH (untuk Multimodal) │ │ │ │ │ ├─── Conditional: Hanya jika encoder_hidden_states != None │ │ ├─── Frequency: Setiap cross_attention_frequency layers │ │ │ │ │ ├─── Input Normalization │ │ │ └─── RMSNorm(hidden_states) │ │ │ │ │ ├─── Cross-Attention Computation │ │ │ ├─── Query: dari text hidden states │ │ │ │ └─── Q: Linear(hidden_size → num_heads × head_dim) │ │ │ ├─── Key/Value: dari encoder_hidden_states (vision+audio) │ │ │ │ ├─── K: Linear(hidden_size → num_kv_heads × head_dim) │ │ │ │ └─── V: Linear(hidden_size → num_kv_heads × head_dim) │ │ │ ├─── Attention: Q @ K.T / sqrt(head_dim) │ │ │ ├─── Softmax + Dropout │ │ │ ├─── Output: attn_weights @ V │ │ │ └─── Output Projection │ │ │ │ │ ├─── [OPTIONAL] Layer Scale │ │ ├─── [OPTIONAL] Stochastic Depth │ │ ├─── Residual Dropout │ │ └─── Residual Connection │ │ └─── hidden_states = hidden_states + cross_attn_output │ │ │ └─── 🔮 FEED-FORWARD PATH │ │ │ ├─── Input Normalization │ │ └─── RMSNorm(hidden_states) │ │ │ ├─── Feed-Forward Network │ │ │ │ │ ├─── ━━━━━ STANDARD MLP ━━━━━ │ │ │ │ │ │ │ ├─── Gate Projection │ │ │ │ ├─── gate: Linear(hidden_size → intermediate_size) │ │ │ │ └─── Typical: intermediate_size = 4 × hidden_size │ │ │ │ │ │ │ ├─── Up Projection │ │ │ │ └─── up: Linear(hidden_size → intermediate_size) │ │ │ │ │ │ │ ├─── SwiGLU Activation │ │ │ │ ├─── gate = silu(gate) # Swish activation │ │ │ │ ├─── hidden = gate * up # Gating mechanism │ │ │ │ └─── Formula: silu(x) = x * sigmoid(x) │ │ │ │ │ │ │ ├─── Dropout │ │ │ │ └─── hidden = dropout(hidden) │ │ │ │ │ │ │ └─── Down Projection │ │ │ ├─── down: Linear(intermediate_size → hidden_size) │ │ │ └─── Output: [batch, seq, hidden_size] │ │ │ │ │ └─── ━━━━━ MIXTURE OF EXPERTS (MoE) ━━━━━ │ │ │ │ │ ├─── Conditional: use_moe AND (layer_idx % moe_frequency == 0) │ │ │ │ │ ├─── Router Network │ │ │ │ │ │ │ ├─── Router Type Selection │ │ │ │ ├─── Top-K Router (default) │ │ │ │ └─── Expert Choice Router (alternative) │ │ │ │ │ │ │ ├─── ━━━ TOP-K ROUTER ━━━ │ │ │ │ │ │ │ │ │ ├─── Gate Normalization │ │ │ │ │ └─── hidden = LayerNorm(hidden) │ │ │ │ │ │ │ │ │ ├─── Router Logits │ │ │ │ │ ├─── logits: Linear(hidden_size → num_experts) │ │ │ │ │ ├─── Clamping: [-20, 20] │ │ │ │ │ └─── Temperature scaling: logits / temp │ │ │ │ │ │ │ │ │ ├─── [Training] Jitter Noise │ │ │ │ │ ├─── noise = randn_like(logits) × 0.01 │ │ │ │ │ └─── logits = logits + noise │ │ │ │ │ │ │ │ │ ├─── Routing Weights │ │ │ │ │ ├─── weights = softmax(logits) │ │ │ │ │ └─── top_k_weights, top_k_indices = topk(weights, k) │ │ │ │ │ │ │ │ │ ├─── Weight Normalization │ │ │ │ │ └─── top_k_weights = top_k_weights / sum(top_k_weights) │ │ │ │ │ │ │ │ │ └─── Loss Computation │ │ │ │ ├─── Auxiliary Loss (load balancing) │ │ │ │ │ ├─── expert_usage = mean(weights, dim=0) │ │ │ │ │ ├─── mean_usage = mean(expert_usage) │ │ │ │ │ └─── aux_loss = std(expert_usage) / mean_usage │ │ │ │ ├─── Z-Loss (router stability) │ │ │ │ │ ├─── z_loss = mean(logsumexp(logits)²) │ │ │ │ │ └─── Prevents logits explosion │ │ │ │ └─── Entropy Loss (diversity) │ │ │ │ └─── entropy_loss = -mean(weights × log(weights)) │ │ │ │ │ │ │ └─── ━━━ EXPERT CHOICE ROUTER ━━━ │ │ │ │ │ │ │ ├─── Router Logits │ │ │ │ └─── logits: Linear(hidden → num_experts) │ │ │ │ │ │ │ ├─── Expert-wise Token Selection │ │ │ │ ├─── Transpose: [batch×seq, experts] │ │ │ │ ├─── capacity = expert_choice_k × total_tokens / num_experts │ │ │ │ ├─── Per expert: topk(logits, k=capacity) │ │ │ │ └─── Expert mask: [experts, batch×seq] │ │ │ │ │ │ │ └─── Routing weights from mask │ │ │ │ │ ├─── Expert Networks (N experts, misal N=8) │ │ │ │ │ │ │ └─── Expert i (i = 0 to N-1) │ │ │ ├─── Same structure as Standard MLP │ │ │ ├─── gate_proj: Linear(hidden → intermediate) │ │ │ ├─── up_proj: Linear(hidden → intermediate) │ │ │ ├─── SwiGLU activation │ │ │ ├─── Dropout │ │ │ └─── down_proj: Linear(intermediate → hidden) │ │ │ │ │ ├─── Expert Execution │ │ │ │ │ │ │ ├─── For each expert: │ │ │ │ ├─── Get tokens routed to this expert │ │ │ │ ├─── If no tokens: skip │ │ │ │ ├─── Run expert forward pass │ │ │ │ ├─── [Training] Track expert usage │ │ │ │ └─── [Safety] NaN/Inf detection │ │ │ │ │ │ │ └─── Combine Expert Outputs │ │ │ ├─── Weighted sum by router weights │ │ │ └─── final_output = Σ(weight_i × expert_i(x)) │ │ │ │ │ └─── Output: [batch, seq, hidden_size] │ │ │ ├─── [OPTIONAL] Layer Scale │ │ └─── output = γ * output │ │ │ ├─── [OPTIONAL] Stochastic Depth │ │ └─── Probabilistic dropping (training only) │ │ │ ├─── Residual Dropout │ │ └─── output = dropout(output) │ │ │ └─── Residual Connection │ ├─── hidden_states = hidden_states + output │ └─── [Training] Gradient clipping: [-1e4, 1e4] │ ├─── 📤 OUTPUT HEAD │ │ │ ├─── Final Normalization │ │ ├─── RMSNorm(hidden_states) │ │ └─── Output: [batch, seq, hidden_size] │ │ │ ├─── Language Modeling Head │ │ ├─── Linear Projection │ │ │ ├─── lm_head: Linear(hidden_size → vocab_size, bias=False) │ │ │ └─── Output: [batch, seq, vocab_size] │ │ │ │ │ └─── [OPTIONAL] Logit Soft-Capping │ │ ├─── Clamp extreme values: [-cap×0.99, cap×0.99] │ │ ├─── Formula: tanh(logits / cap) × cap │ │ ├─── Prevents numerical instability │ │ └─── Typical cap value: 30.0 │ │ │ └─── Output: Logits [batch, seq, vocab_size] │ ├─── 📉 LOSS COMPUTATION (Training Only) │ │ │ ├─── Shift for Autoregressive │ │ ├─── shift_logits = logits[:, :-1, :] │ │ └─── shift_labels = labels[:, 1:] │ │ │ ├─── Language Modeling Loss │ │ ├─── CrossEntropyLoss(ignore_index=-100) │ │ ├─── [OPTIONAL] Label Smoothing │ │ │ └─── Reduces overconfidence │ │ └─── lm_loss = CE(shift_logits, shift_labels) │ │ │ ├─── [OPTIONAL] MoE Auxiliary Losses │ │ ├─── Router Auxiliary Loss (load balancing) │ │ │ └─── aux_loss × router_aux_loss_coef (default: 0.01) │ │ ├─── Router Z-Loss (stability) │ │ │ └─── z_loss × router_z_loss_coef (default: 0.001) │ │ └─── Sum across all MoE layers │ │ │ └─── Total Loss │ └─── total = lm_loss + aux_losses │ ├─── 📊 MONITORING & METRICS │ │ │ ├─── MetricsTracker │ │ ├─── Loss tracking (LM, aux, z-loss) │ │ ├─── Perplexity: exp(lm_loss) │ │ ├─── Gradient norms per layer │ │ ├─── GPU memory usage │ │ ├─── Expert usage statistics │ │ ├─── Attention cache hit rate │ │ └─── Periodic summary & clearing │ │ │ ├─── Gradient Monitoring │ │ ├─── Max gradient norm per layer │ │ ├─── Mean gradient norm (EMA) │ │ ├─── Gradient clipping count │ │ └─── NaN/Inf detection │ │ │ └─── Memory Monitoring │ ├─── GPU memory allocated │ ├─── GPU memory reserved │ ├─── Automatic cache clearing │ └─── Per-layer memory checkpoints │ └─── 🔧 OPTIMIZATION FEATURES │ ├─── Gradient Checkpointing │ ├─── Trade: 30% slower, 50% less memory │ ├─── Recompute activations during backward │ └─── Enable: model.gradient_checkpointing_enable() │ ├─── Mixed Precision Training (AMP) │ ├─── FP16/BF16 forward pass │ ├─── FP32 master weights │ ├─── Dynamic loss scaling │ └─── 2x speedup, 50% memory reduction │ ├─── Gradient Accumulation │ ├─── Simulate larger batch size │ ├─── loss = loss / accumulation_steps │ └─── optimizer.step() every N steps │ ├─── KV Cache (Inference) │ ├─── Cache Key/Value tensors │ ├─── Reuse for autoregressive generation │ ├─── Memory: O(num_layers × seq_len × hidden_size) │ └─── Speedup: ~10x untuk long sequences │ └─── Quantization Support ├─── 8-bit (LLM.int8) │ ├─── bitsandbytes integration │ ├─── Mixed precision (outliers in FP16) │ └─── 2x memory reduction └─── 4-bit (QLoRA) ├─── NF4 quantization (normal float 4-bit) ├─── Double quantization ├─── BF16 compute dtype └─── 4x memory reduction CacaForCausalLM (3.52M) │ ├─ Embedding: 8,000 × 128 │ ├─ Transformer Layers (6x) │ ├─ RMSNorm │ ├─ Attention (GQA) │ │ ├─ Q: 4 heads × 32 dim │ │ ├─ KV: 2 heads × 32 dim │ │ ├─ RoPE (θ=10,000) │ │ └─ Flash Attention v2 │ ├─ Residual │ ├─ RMSNorm │ ├─ FFN (SwiGLU) │ │ ├─ Gate: 128 → 512 │ │ ├─ Up: 128 → 512 │ │ └─ Down: 512 → 128 │ └─ Residual │ ├─ Final RMSNorm └─ LM Head: 128 → 8,000 ═══════════════════════════════════════════════════════════ 📊 PARAMETER BREAKDOWN: ═══════════════════════════════════════════════════════════ Embeddings: 1,024,000 ( 29.1%) Transformer Layers: 1,474,560 ( 41.8%) ├─ Attention: 294,912 └─ FFN: 1,179,648 Final Norm: 128 ( 0.0%) ─────────────────────────────────────────────────────────── TOTAL: 3,524,608 (100.0%) ═══════════════════════════════════════════════════════════ ``` **Key Design Decisions:** 1. **GQA over MHA**: Hemat 50% KV cache memory dengan minimal accuracy loss 2. **SwiGLU over GELU**: ~10% better performance pada language modeling 3. **RMSNorm over LayerNorm**: Lebih cepat & stabil, tanpa bias term 4. **RoPE over Learned**: Better extrapolation untuk sequence length > training 5. **No Bias in Linear**: Mengikuti modern LLM best practices (LLaMA-style)

--- ## 📚 Dokumentasi ### 📦 Instalasi Dependencies ```bash # Core dependencies (REQUIRED) pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors # Optional: Untuk performa maksimal pip install flash-attn --no-build-isolation # Flash Attention 2 (3x speedup) pip install xformers # Memory efficient attention pip install bitsandbytes # 4/8-bit quantization # Optional: Untuk monitoring & profiling pip install tensorboard wandb # Training monitoring pip install gputil psutil # Resource monitoring ``` **Compatibility Matrix:** | Component | Version | Note | |-----------|---------|------| | Python | 3.8 - 3.11 | 3.11 recommended | | PyTorch | ≥ 2.0.0 | 2.1+ untuk SDPA optimal | | CUDA | 11.8 / 12.1 | Untuk Flash Attention | | Transformers | ≥ 4.35.0 | Untuk AutoModel support | ### Cara Penggunaan #### 1️⃣ Basic Loading ```python from transformers import AutoConfig, AutoModelForCausalLM, AutoTokenizer import torch # Load configuration config = AutoConfig.from_pretrained( "Lyon28/caca-1M-untrained", trust_remote_code=True ) # Load model (FP16 untuk efisiensi) model = AutoModelForCausalLM.from_pretrained( "Lyon28/caca-1M-untrained", config=config, trust_remote_code=True, torch_dtype=torch.float16, device_map="auto" # Automatic device placement ) # Model ini UNTRAINED - butuh training dulu! print(f"Model loaded: {model.num_parameters():,} parameters") print("⚠️ Model ini belum dilatih dan belum bisa digunakan untuk inference") ``` #### 2️⃣ Quantized Loading (4-bit/8-bit) ```python from transformers import AutoModelForCausalLM, BitsAndBytesConfig import torch # 4-bit quantization config bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True ) # Load model dengan quantization model = AutoModelForCausalLM.from_pretrained( "Lyon28/caca-1M-untrained", trust_remote_code=True, quantization_config=bnb_config, device_map="auto" ) print(f"Memory footprint: ~0.00GB (4-bit)") ``` #### 3️⃣ Training Setup ```python from transformers import TrainingArguments, Trainer # Training configuration training_args = TrainingArguments( output_dir="./output", per_device_train_batch_size=1, gradient_accumulation_steps=16, learning_rate=2e-4, max_steps=10000, lr_scheduler_type="cosine", warmup_steps=500, logging_steps=10, save_steps=500, fp16=True, # Mixed precision gradient_checkpointing=True, # Memory efficient ) # Initialize trainer trainer = Trainer( model=model, args=training_args, train_dataset=train_dataset, ) # Start training trainer.train() ``` ### Advanced Usage #### Gradient Checkpointing (Memory Efficient) ```python model.gradient_checkpointing_enable() print("✅ Gradient checkpointing enabled - saves ~40% memory") ``` #### Custom Training Loop ```python from torch.optim import AdamW from torch.cuda.amp import autocast, GradScaler optimizer = AdamW(model.parameters(), lr=2e-4) scaler = GradScaler() for batch in dataloader: # Mixed precision forward with autocast(dtype=torch.bfloat16): outputs = model(**batch) loss = outputs.loss # Backward with gradient scaling scaler.scale(loss).backward() scaler.step(optimizer) scaler.update() optimizer.zero_grad() ``` #### Multi-GPU Training (DDP) ```python import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel # Initialize process group dist.init_process_group(backend="nccl") # Wrap model model = DistributedDataParallel( model, device_ids=[local_rank], find_unused_parameters=False ) ``` --- ## ⚙️ Konfigurasi Detail ### Full Configuration JSON ```json { "architectures": ["CacaForCausalLM"], "model_type": "caca", "vocab_size": 8000, "hidden_size": 128, "intermediate_size": 512, "num_hidden_layers": 6, "num_attention_heads": 4, "num_key_value_heads": 2, "head_dim": 32, "max_position_embeddings": 1024, "rope_theta": 10000, "rms_norm_eps": 1e-06, "use_cache": true, "use_qk_norm": true, "use_flash_attn": true, "attention_dropout": 0.0, "hidden_dropout": 0.1, "torch_dtype": "float16" } ``` ### Custom Configuration ```python from transformers import AutoConfig # Load dan modifikasi config config = AutoConfig.from_pretrained("Lyon28/caca-1M-untrained") # Custom modifications config.max_position_embeddings = 16384 # Extend context config.rope_scaling = {"type": "linear", "factor": 2.0} config.use_flash_attn = True config.hidden_dropout = 0.05 # Save custom config config.save_pretrained("./custom_config") ``` --- ## 🔬 Arsitektur

Layer Structure

**Input Tokens** ↓ **Embedding Layer** (vocab_size → hidden_size) ↓ **Decoder Block × N** - RMSNorm - Multi-Head Attention (GQA) - Flash Attention v2 - Query heads, KV heads - RoPE position encoding - Residual Connection - RMSNorm - Feed-Forward Network (SwiGLU) - Gate: hidden → intermediate - Up: hidden → intermediate - Down: intermediate → hidden - Residual Connection ↓ **RMSNorm (Final)** ↓ **LM Head** (hidden → vocab_size) ↓ **Output Logits**

### Attention Mechanism (GQA) ``` Query: [4 heads × 32 dim] = 128 Key: [2 heads × 32 dim] = 64 Value: [2 heads × 32 dim] = 64 Grouped Query Attention: - Setiap 2 query heads berbagi 1 KV head - Memory KV cache: 50% lebih kecil dari Multi-Head Attention - Kualitas mendekati MHA, speed mendekati MQA ``` ### Feed-Forward Network (SwiGLU) ``` FFN(x) = (SiLU(xW_gate) ⊙ xW_up) W_down Where: - W_gate: 128 × 512 - W_up: 128 × 512 - W_down: 512 × 128 - SiLU(x) = x · sigmoid(x) - ⊙ = element-wise multiplication ``` ## 💬 Format Chat & Prompt Engineering ### 📝 Chat Template Model mendukung format chat standar untuk conversational AI: ```python # Format chat template bawaan chat_template = """ {% for message in messages %} {% if message['role'] == 'system' %} System: {{ message['content'] }} {% elif message['role'] == 'user' %} User: {{ message['content'] }} {% elif message['role'] == 'assistant' %} Assistant: {{ message['content'] }} {% endif %} {% endfor %} {% if add_generation_prompt %}Assistant:{% endif %} """ # Contoh penggunaan messages = [ {"role": "system", "content": "Kamu adalah asisten AI yang membantu dan ramah."}, {"role": "user", "content": "Jelaskan tentang fotosintesis"}, {"role": "assistant", "content": "Fotosintesis adalah proses di mana tumbuhan mengubah cahaya matahari menjadi energi kimia..."}, {"role": "user", "content": "Apa manfaatnya bagi manusia?"}, ] # Apply template formatted = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) print(formatted) # Output: # System: Kamu adalah asisten AI yang membantu dan ramah. # # User: Jelaskan tentang fotosintesis # Assistant: Fotosintesis adalah proses di mana tumbuhan... # User: Apa manfaatnya bagi manusia? # Assistant: ``` --- ## 🎯 Use Cases Model ini dirancang untuk berbagai aplikasi NLP setelah melalui proses training: ### Text Generation - ✍️ Creative writing & storytelling - 📰 Article generation - 💬 Conversational AI - 🔄 Text completion ### Language Understanding - 📊 Text classification - 🏷️ Named Entity Recognition (NER) - ❓ Question Answering - 📝 Summarization ### Code Generation - 💻 Code completion - 🐛 Bug fixing suggestions - 📚 Documentation generation - 🔄 Code translation ### Multilingual Tasks - 🌏 Translation (ID ↔ EN) - 🗣️ Cross-lingual understanding - 🌐 Multilingual classification --- ## 📈 Benchmark & Evaluation > ⚠️ Model belum melalui evaluasi karena status untrained Setelah training, model akan dievaluasi pada: ### Indonesian Benchmarks - **IndoNLU**: Comprehensive Indonesian NLU tasks - **IndoQA**: Indonesian Question Answering - **IndoSum**: Summarization - **IndoNER**: Named Entity Recognition ### Multilingual Benchmarks - **MMLU**: Massive Multitask Language Understanding - **HellaSwag**: Common sense reasoning - **ARC**: Science QA - **TruthfulQA**: Truthfulness evaluation ### Generation Quality - **Perplexity**: Language modeling quality - **BLEU/ROUGE**: Translation & summarization - **Human Evaluation**: Fluency, coherence, factuality --- ## 🛠️ Development & Training Tips ### Optimal Batch Size ```python # Rule of thumb untuk 3.52M model # GPU Memory → Batch size per device if gpu_memory >= 80: # A100 80GB batch_size = 4539 gradient_accumulation = 1 elif gpu_memory >= 40: # A100 40GB batch_size = 2269 gradient_accumulation = 1 elif gpu_memory >= 24: # RTX 3090/4090 batch_size = 1 gradient_accumulation = 1 # Effective batch size = batch_size × gradient_accumulation × num_gpus ``` ### Learning Rate Scheduling ```python # Recommended untuk 3.52M model learning_rate = 0.0005 # Base LR warmup_ratio = 0.05 # 5% of total steps lr_scheduler = "cosine" # atau "linear" # Learning rate scaling rule: # LR ∝ sqrt(batch_size) # Untuk batch size 256: LR = 0.0005 # Untuk batch size 512: LR = 7.07e-04 ``` ### Gradient Clipping ```python # Prevent gradient explosion max_grad_norm = 1.0 # Clip at 1.0 # Monitor gradients from torch.nn.utils import clip_grad_norm_ grad_norm = clip_grad_norm_(model.parameters(), max_grad_norm) if grad_norm > 10.0: print(f"⚠️ High gradient norm: {grad_norm:.2f}") ``` ### Training Stability ```python # Tips untuk stable training: 1. **Warmup**: Mulai dengan LR rendah 2. **Gradient Checkpointing**: Kurangi memory footprint 3. **Mixed Precision**: Gunakan BF16 jika tersedia (lebih stable dari FP16) 4. **Batch Size**: Start small, increase gradually 5. **Monitor**: Track loss, perplexity, gradient norms ``` --- ## 🔧 Troubleshooting ### Out of Memory (OOM) ```python # Solusi OOM saat training: ✅ 1. Enable gradient checkpointing model.gradient_checkpointing_enable() ✅ 2. Reduce batch size per_device_train_batch_size = 1 ✅ 3. Increase gradient accumulation gradient_accumulation_steps = 32 ✅ 4. Use quantization load_in_8bit = True # atau load_in_4bit ✅ 5. Reduce sequence length max_length = 1024 # Start dengan ini ✅ 6. CPU offloading (jika perlu) device_map = "auto" offload_folder = "offload" ``` ### Slow Training ```python # Optimasi kecepatan training: ✅ 1. Flash Attention config.use_flash_attn = True # 2-3x speedup ✅ 2. Compile model (PyTorch 2.0+) model = torch.compile(model, mode="reduce-overhead") ✅ 3. DataLoader optimization dataloader = DataLoader( dataset, batch_size=batch_size, num_workers=4, # Parallel data loading pin_memory=True, # Faster GPU transfer prefetch_factor=2 ) ✅ 4. Mixed precision use_fp16 = True # atau bf16 ✅ 5. Optimize communication (multi-GPU) find_unused_parameters = False gradient_as_bucket_view = True ``` ### NaN Loss ```python # Jika loss menjadi NaN: ✅ 1. Reduce learning rate learning_rate = learning_rate * 0.1 ✅ 2. Check gradient norms clip_grad_norm_(model.parameters(), 1.0) ✅ 3. Use BF16 instead of FP16 torch_dtype = torch.bfloat16 # Lebih stable ✅ 4. Add epsilon to RMSNorm rms_norm_eps = 1e-5 # Increase jika perlu ✅ 5. Check data # Pastikan tidak ada inf/nan di dataset assert not torch.isnan(input_ids).any() assert not torch.isinf(attention_mask).any() ``` --- ### 🚫 Prohibited Uses

Model ini **TIDAK BOLEH** digunakan untuk: - 🚫 **Harmful content generation** (violence, self-harm, illegal acts) - 🚫 **Misinformation/disinformation campaigns** - 🚫 **Harassment or hate speech** - 🚫 **Impersonation or identity theft** - 🚫 **Child safety violations** (CSAM, grooming, exploitation) - 🚫 **Privacy violations** (doxxing, stalking, surveillance abuse) - 🚫 **Malicious code generation** (malware, exploits, etc) - 🚫 **Spam or manipulation** (fake reviews, astroturfing) - 🚫 **Medical/legal advice** (tanpa disclaimer & expert review) - 🚫 **Financial fraud** (scams, market manipulation) **Violation consequences:** Model access revocation + legal action jika applicable

--- ## 📚 References & Papers ### Core Architecture 1. **LLaMA** - [Touvron et al., 2023](https://arxiv.org/abs/2302.13971) - RMSNorm, RoPE, SwiGLU, GQA 2. **GPT-4** - [OpenAI Technical Report, 2023](https://arxiv.org/abs/2303.08774) - Mixture of Experts (speculated) 3. **Gemini** - [Google DeepMind, 2023](https://arxiv.org/abs/2312.11805) - Multimodal architecture, soft-capping 4. **Qwen** - [Alibaba Cloud, 2023](https://arxiv.org/abs/2309.16609) - YARN, long context 5. **Gemma** - [Google, 2024](https://arxiv.org/abs/2403.08295) - Layer scaling, normalization ### Advanced Techniques 6. **Flash Attention 2** - [Dao, 2023](https://arxiv.org/abs/2307.08691) 7. **Mixture-of-Depths** - [Raposo et al., 2024](https://arxiv.org/abs/2404.02258) 8. **StreamingLLM** - [Xiao et al., 2023](https://arxiv.org/abs/2309.17453) 9. **YARN** - [Peng et al., 2023](https://arxiv.org/abs/2309.00071) 10. **QLoRA** - [Dettmers et al., 2023](https://arxiv.org/abs/2305.14314) --- ## ⚠️ Known Limitations 1. **Training Cost** - MoE + Multimodal = expensive 2. **Complex Debugging** - Banyak fallback systems 3. **Memory Hungry** - Jika semua fitur enabled 4. **Dependency Hell** - Butuh flash-attn, xformers, bitsandbytes 5. **Expert Balancing** - MoE butuh careful tuning untuk load balancing --- ## 📜 License & Citation ### 📄 License

Model ini dirilis di bawah **Apache License 2.0** ✅ **Anda BEBAS untuk:** - ✔️ Gunakan secara komersial - ✔️ Modifikasi sesuka hati - ✔️ Distribusi ulang - ✔️ Patent use - ✔️ Private use ⚠️ **Dengan syarat:** - 📄 Include license & copyright notice - 📝 State changes yang dibuat - 📋 Disclaimer of warranty ❌ **Tanpa jaminan apapun** (use at your own risk)

**Full license text**: [Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0) ## 📖 Citation Jika Anda menggunakan model ini dalam penelitian, mohon sitasi: ```bibtex @misc{cacacaca1m, author = {Lyon}, title = {Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention}, year = {2026}, publisher = {Hugging Face}, journal = {Hugging Face Model Hub}, howpublished = {\url{https://huggingface.co/Lyon28/caca-1M-untrained}}, note = {Untrained model with 3,524,608 parameters} } ``` **APA Style:** ``` Lyon. (2026). Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention [Untrained model]. Hugging Face. https://huggingface.co/Lyon28/caca-1M-untrained ``` **MLA Style:** ``` Lyon. "Caca-caca-1M: Modern Transformer Architecture with Grouped Query Attention." Hugging Face, 2026, huggingface.co/Lyon28/caca-1M-untrained. ``` --- ### 🙏 Acknowledgments Model ini berdiri di pundak para raksasa! Terima kasih kepada:

🏛️ Klik untuk daftar lengkap acknowledgments

#### 🏗️ **Core Architecture** - **LLaMA/LLaMA 2** (Meta AI, 2023) - Decoder-only architecture, RMSNorm, SwiGLU - Paper: [LLaMA: Open and Efficient Foundation Language Models](https://arxiv.org/abs/2302.13971) - Authors: Hugo Touvron et al. - **GPT-3** (OpenAI, 2020) - Transformer language modeling paradigm - **PaLM** (Google, 2022) - SwiGLU activation insights #### 🎯 **Attention Mechanisms** - **Flash Attention v2** (Tri Dao et al., Stanford, 2023) - Paper: [FlashAttention-2: Faster Attention with Better Parallelism](https://arxiv.org/abs/2307.08691) - 3x speedup dengan IO-aware algorithm - **Grouped Query Attention** (Joshua Ainslie et al., Google, 2023) - Paper: [GQA: Training Generalized Multi-Query Transformer](https://arxiv.org/abs/2305.13245) - Memory-efficient KV cache - **Multi-Query Attention** (Noam Shazeer, Google, 2019) - Fast inference dengan shared K/V - **xFormers** (Meta AI, 2022) - Memory efficient attention - **PyTorch SDPA** (PyTorch Team, 2023) - Native attention optimization #### 📍 **Position Encodings** - **RoPE** (Jianlin Su et al., EleutherAI, 2021) - Paper: [RoFormer: Enhanced Transformer with Rotary Position Embedding](https://arxiv.org/abs/2104.09864) - Superior length extrapolation - **ALiBI** (Ofir Press et al., 2022) - Paper: [Train Short, Test Long: Attention with Linear Biases](https://arxiv.org/abs/2108.12409) - Length generalization without retraining - **YaRN** (Bowen Peng et al., 2023) - Paper: [YaRN: Efficient Context Window Extension](https://arxiv.org/abs/2309.00071) #### 🪟 **Long Context & Efficiency** - **Sliding Window Attention** (Albert Gu et al., Mistral AI, 2023) - Paper: [Mistral 7B](https://arxiv.org/abs/2310.06825) - **StreamingLLM** (Guangxuan Xiao et al., MIT, 2023) - Paper: [Efficient Streaming Language Models with Attention Sinks](https://arxiv.org/abs/2309.17453) - Infinite sequence length! - **Logit Softcapping** (Google Gemma Team, 2024) - Paper: [Gemma: Open Models Based on Gemini](https://arxiv.org/abs/2403.08295) #### 🧠 **Mixture of Experts** - **Mixtral 8x7B** (Albert Jiang et al., Mistral AI, 2024) - Paper: [Mixtral of Experts](https://arxiv.org/abs/2401.04088) - State-of-the-art sparse MoE - **Switch Transformers** (William Fedus et al., Google, 2021) - Paper: [Switch Transformers: Scaling to Trillion Parameter Models](https://arxiv.org/abs/2101.03961) - Expert scaling insights - **GLaM** (Nan Du et al., Google, 2021) - Generalist Language Model - **Expert Choice Routing** (Yanqi Zhou et al., Google, 2022) - Better load balancing #### 🎓 **Training Optimizations** - **Layer Scale** (Hugo Touvron et al., Meta, 2021) - Paper: [Going Deeper with Image Transformers](https://arxiv.org/abs/2103.17239) - Training stability untuk deep networks - **Stochastic Depth** (Gao Huang et al., 2016) - Paper: [Deep Networks with Stochastic Depth](https://arxiv.org/abs/1603.09382) - **Mixture of Depths** (David Raposo et al., DeepMind, 2024) - Paper: [Mixture-of-Depths: Dynamically allocating compute](https://arxiv.org/abs/2404.02258) - Dynamic compute allocation - **Gradient Checkpointing** (Tianqi Chen et al., 2016) #### 📦 **Quantization** - **LLM.int8()** (Tim Dettmers et al., 2022) - Paper: [LLM.int8(): 8-bit Matrix Multiplication for Transformers](https://arxiv.org/abs/2208.07339) - **QLoRA** (Tim Dettmers et al., 2023) - Paper: [QLoRA: Efficient Finetuning of Quantized LLMs](https://arxiv.org/abs/2305.14314) - 4-bit efficient fine-tuning - **bitsandbytes** (Tim Dettmers) - Quantization library #### 🎨 **Multimodal** - **Vision Transformer** (Alexey Dosovitskiy et al., Google, 2020) - Paper: [An Image is Worth 16x16 Words](https://arxiv.org/abs/2010.11929) - **Flamingo** (Jean-Baptiste Alayrac et al., DeepMind, 2022) - Paper: [Flamingo: a Visual Language Model](https://arxiv.org/abs/2204.14198) - Perceiver Resampler - **BLIP-2** (Junnan Li et al., Salesforce, 2023) - Paper: [BLIP-2: Bootstrapping Language-Image Pre-training](https://arxiv.org/abs/2301.12597) - Q-Former architecture - **Whisper** (Alec Radford et al., OpenAI, 2022) - Audio encoding #### 🛠️ **Normalization & Activations** - **RMSNorm** (Biao Zhang, Rico Sennrich, 2019) - Paper: [Root Mean Square Layer Normalization](https://arxiv.org/abs/1910.07467) - **SwiGLU** (Noam Shazeer, Google, 2020) - Paper: [GLU Variants Improve Transformer](https://arxiv.org/abs/2002.05202) #### 🔧 **Tools & Frameworks** - **🤗 Hugging Face** - Transformers, Accelerate, PEFT - Making NLP accessible to everyone - **PyTorch** - Deep learning framework - Facebook AI Research team - **Safetensors** - Secure serialization - Hugging Face team - **DeepSpeed** - Distributed training - Microsoft Research - **Flash Attention Implementation** - Tri Dao & team #### 🇮🇩 **Indonesian NLP Community** Special thanks to Indonesian NLP researchers & practitioners yang telah membangun foundation untuk Indonesian language AI.

--- ## 📄 License Model ini dirilis di bawah **Apache License 2.0**. ### Ketentuan Penggunaan: - ✅ **Bebas digunakan** untuk keperluan komersial dan non-komersial - ✅ **Modifikasi** diperbolehkan - ✅ **Distribusi** diperbolehkan dengan attribution - ⚠️ **No Warranty** - model disediakan "as is" - 📝 **Attribution Required** - sertakan copyright notice Lihat [LICENSE](LICENSE) untuk detail lengkap. --- ## 🤝 Contributing Kami sangat terbuka untuk kontribusi! Berikut cara Anda bisa berkontribusi: ### Training & Fine-tuning - 🎓 Train model ini dengan dataset Anda - 📊 Share benchmark results - 🔬 Experiment dengan hyperparameters ### Code & Architecture - 🐛 Report bugs atau issues - 💡 Suggest improvements - 🔧 Submit pull requests ### Documentation - 📚 Improve documentation - 🌐 Add translations - ✍️ Write tutorials & guides ### Dataset & Evaluation - 📝 Contribute training data - 🧪 Create evaluation benchmarks - 🎯 Share fine-tuned versions --- ## 👥 Team & Acknowledgments ### Core Team - **LyonPoy** - Architecture design & implementation ### Special Thanks - 🤗 **Hugging Face** - Infrastructure & community - ⚡ **FlashAttention Team** - Efficient attention implementation - 🧠 **Anthropic, Google, Meta, openAI, etc** - Research inspirations - Meta AI (LLaMA) - OpenAI (GPT series) - Google DeepMind (Gemini, Gemma) - Alibaba Cloud (Qwen) - HuggingFace (Transformers library) - Tri Dao (Flash Attention) - Tim Chen (bitsandbytes) ### Community Terima kasih kepada komunitas open-source yang telah berkontribusi pada: - Transformers library - PyTorch framework - Datasets & evaluation tools --- ## 📞 Contact & Support ### Community - 💬 [Discussions](https://huggingface.co/Lyon28/caca-1M-untrained/discussions) - Ask questions - 🐛 [Issues](https://github.com/Lyon-28/caca-transformers/issues) - Report bugs - 📧 Email : cacatransformers@gmail.com --- ## 🌟 Star History

[![Star History Chart](https://api.star-history.com/svg?repos=Lyon-28/caca-transformers&type=Date)](https://star-history.com/#Lyon-28/caca-transformers&Date)

## 💝 Dibuat dengan ❤️ untuk Komunitas AI Indonesia

### **Terima kasih telah menggunakan Caca!** Jika model ini berguna, jangan lupa ⭐ repository kami!

⭐
Star Repo
_{Show your support}

🔗
Share
_{Tell your friends}

💬
Join Discussion
_{Ask questions}

🤝
Contribute
_{Make it better}

### 🚀 Happy Training! 🚀 **Model ini menunggu untuk dilatih dan menjadi foundation untuk aplikasi AI Anda.** [📥 Download Model](#) • [📖 Read Docs](https://github.com/Lyon-28/caca-transformers) • [💬 Join Community](https://github.com/Lyon-28/caca-transformers)

--- ### 📊 Model Statistics

--- ### 🎨 Daily Inspiration

--- ### 📈 Quick Stats | Metric | Value | |--------|-------| | 💎 Total Parameters | 3,524,608 | | 🏗️ Layers | 6 | | 🎯 Attention Heads | 4 | | 📖 Max Context | 1,024 tokens | | 💾 Size (FP16) | 0.01 GB | | 💾 Size (INT4) | 0.00 GB | --- _{Model ini adalah bagian dari Caca Project - Open source initiative untuk membangun Indonesian LLM ecosystem.

Created with 💻 by @Lyon28 |
Licensed under Apache 2.0 |
Built with 🤗 HuggingFace}

**🌟 "Dari nol, untuk semua" 🌟** _{Last updated: january 2026} ---

_{Built with ❤️ by Caca Transformers Team}
_{Powered by 🤗 Transformers • ⚡ PyTorch • 🔥 Flash Attention}