README / README.md
Lyon28's picture
Update README.md
eb961bf verified
metadata
title: README
emoji: 🐨
colorFrom: purple
colorTo: indigo
sdk: static
pinned: true
license: mit
caca-AI

πŸ“‹ Deskripsi

Caca adalah arsitektur Large Language Model (LLM) generasi terbaru yang menggabungkan berbagai teknik state-of-the-art dalam deep learning. Model ini dirancang dengan fokus pada efisiensi, skalabilitas, dan performa tinggi.

Caca itu eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual dan bertahap. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative. Kalau berguna buat orang lain, alhamdulillah. Kalau enggak, ya tetap fun kok.

Ini proyek eksplorasi, jadi kalau gagal ya bagian dari proses belajar. Kalau berhasil, itu bonus.

πŸ“Š Perbandingan dengan Arsitektur Lain

Fitur Caca LLaMA 2 Mistral IndoGPT GPT-2
πŸ—οΈ Arsitektur Dasar
Status ⚠️ Untrained βœ… Trained βœ… Trained βœ… Trained βœ… Trained
Ukuran Model 60+ variant
1M - 1T (semoga)
7B / 13B / 70B 7B 117M 117M - 1.5B
Tipe Arsitektur Decoder-only Decoder-only Decoder-only Decoder-only Decoder-only
Fungsi Aktivasi SwiGLU SwiGLU SwiGLU GELU GELU
Normalisasi RMSNorm RMSNorm RMSNorm LayerNorm LayerNorm
Tahun Release 2025 2023 2023 2020 2019
πŸ‘οΈ Mekanisme Attention
Tipe Attention GQA (configurable) GQA GQA MHA MHA
Position Encoding RoPE + variants RoPE RoPE Learned Learned
Max Context 8K - 16K 4K 32K 1K 1K
Sliding Window βœ… Optional ❌ βœ… 4K window ❌ ❌
Flash Attention βœ… Flash Attn 2 βœ… Supported βœ… Supported ❌ ❌
KV Cache Efficiency 75% reduction
(GQA 4:1)
~60% reduction 75% reduction No optimization No optimization
πŸš€ Fitur Lanjutan
Mixture of Experts βœ… Optional
TopK + ExpertChoice
❌ ❌
(Mixtral variant)
❌ ❌
Multimodal βœ… Native
Vision + Audio
❌
(LLaVA separate)
❌ ❌ ❌
Config Flexibility βœ… 50+ parameters
Toggle semua fitur
⚠️ Limited ⚠️ Limited ❌ Fixed ❌ Fixed
Layer Scale βœ… Optional ❌ ❌ ❌ ❌
Stochastic Depth βœ… Optional ❌ ❌ ❌ ❌
⚑ Performa & Optimisasi
Inference Speed
(7B model, A100)
⚠️ TBD
(belum trained)
~75 tok/s ~78 tok/s ~150 tok/s
(jauh lebih kecil)
~120 tok/s
(jauh lebih kecil)
Memory Footprint
(7B, BF16)
~14GB
(dengan GQA)
~14GB ~14GB ~500MB ~500MB
Gradient Checkpointing βœ… Full support βœ… Supported βœ… Supported ⚠️ Manual ⚠️ Manual
Quantization βœ… 8-bit/4-bit built-in ⚠️ Via external tools ⚠️ Via external tools ❌ Limited support ❌ Limited support
Multi-Backend Support βœ… 4 backends
Flash/xFormers/SDPA/Standard
⚠️ 2 backends ⚠️ 2 backends ❌ Standard only ❌ Standard only
🌏 Dukungan Bahasa
Bahasa Indonesia ⚠️ Belum trained
Designed for ID
❌ Poor
English-heavy
❌ Poor
English-heavy
βœ… Native ❌ Minimal
English ⚠️ TBD
Bilingual design
βœ… Excellent βœ… Excellent ⚠️ Limited βœ… Good
Training Data ⚠️ To be trained
User's choice
2T tokens
English-heavy
Unknown
English-heavy
23GB
Indonesian
40GB
WebText
Vocab Size 32K
(configurable)
32K 32K 50K 50K
πŸ‘¨β€πŸ’» Developer Experience
Error Messages βœ… Helpful + solutions
Detailed debugging
⚠️ Standard PyTorch ⚠️ Standard PyTorch ❌ Basic errors ❌ Basic errors
Config Validation βœ… Comprehensive
Auto-check conflicts
⚠️ Basic ⚠️ Basic ❌ Minimal ❌ Minimal
Documentation βœ… Extensive
ID + EN, with examples
βœ… Good
Official docs
⚠️ Medium
Community-driven
❌ Limited
Minimal docs
βœ… Extensive
OpenAI docs
Code Examples βœ… 50+ examples
Training to deployment
βœ… Many examples ⚠️ Some examples ❌ Few examples βœ… Many examples
HuggingFace Integration βœ… Full native
Auto-registered
βœ… Official βœ… Official βœ… Available βœ… Standard
🌍 Ketersediaan & Lisensi
License βœ… Apache 2.0
Fully permissive
⚠️ LLaMA 2 License
Commercial OK
βœ… Apache 2.0 βœ… MIT βœ… MIT
Commercial Use βœ… Allowed
No restrictions
βœ… Allowed βœ… Allowed βœ… Allowed βœ… Allowed
Weights Available ❌ Not trained
Architecture only
βœ… All sizes
7B/13B/70B
βœ… 7B βœ… 117M βœ… All sizes
Self-Hosting βœ… Designed for it
Full control
βœ… Yes βœ… Yes βœ… Yes βœ… Yes
Training Required ❌ Yes
From scratch
βœ… No
Ready to use
βœ… No
Ready to use
βœ… No
Ready to use
βœ… No
Ready to use
🎯 Use Cases
Production Ready ❌ Not yet
After training
βœ… Yes βœ… Yes ⚠️ Limited
Too small
⚠️ Limited
Outdated
Research βœ… Excellent
Modular design
βœ… Good βœ… Good ⚠️ Limited βœ… Classic baseline
Indonesian NLP ⚠️ After training
High potential
❌ Poor
Needs fine-tuning
❌ Poor
Needs fine-tuning
βœ… Native
But limited
❌ Poor
Education βœ… Excellent
Learn modern LLMs
βœ… Good ⚠️ Medium βœ… Good
Simple architecture
βœ… Classic
Well-documented

πŸ“ Catatan Penting:

  • Caca adalah arsitektur modern yang belum dilatih - perlu training dari nol dengan dataset Indonesian
  • LLaMA 2 & Mistral sangat bagus untuk English, tapi poor untuk Indonesian tanpa fine-tuning
  • IndoGPT adalah satu-satunya dedicated Indonesian LLM, tapi arsitektur sudah outdated (GPT-2 era)
  • GPT-2 dimasukkan sebagai baseline klasik - arsitektur yang sudah proven tapi tidak modern

✨ Keunggulan Unik Caca:

  • 🎯 Modular Design: Toggle 50+ fitur tanpa rewrite code
  • πŸ”§ Developer-Friendly: Error messages helpful + config validation
  • πŸš€ Modern Architecture: GQA + Flash Attention + SwiGLU + RMSNorm
  • 🎨 Multimodal Native: Vision & Audio built-in (bukan add-on)
  • πŸ“š Extensive Docs: Bahasa Indonesia + English dengan banyak contoh
  • ⚑ Optimization Focus: 4 attention backends, auto-fallback, quantization ready
  • πŸ”¬ Research-Oriented: MoE, Mixture of Depths, Layer Scale, dll.

⚠️ Keterbatasan Realistis:

  • ❌ Belum trained - output akan random sampai di-training
  • ❌ Belum ada tokenizer - perlu training tokenizer sendiri untuk Indonesian
  • ❌ Butuh resources besar - training 7B model perlu GPU kelas A100
  • ❌ Belum teruji - perlu extensive evaluation setelah training
  • ❌ Community masih kecil - tidak sebesar LLaMA/Mistral ecosystem
Daily Quote

πŸ”— Links