--- language: id license: apache-2.0 library_name: transformers tags: - pytorch - safetensors - deeplm - bitnet - moe - mla - mtp - indonesian pipeline_tag: text-generation --- # Deeplm — 108M BitNet MoE Language Model Deeplm adalah model bahasa berukuran ~105M parameter dengan **BitNet b1.58 ternary quantization** dari awal, terinspirasi dari arsitektur **DeepSeek V4**, **Kimi K2.6**, dan **MiniMax M2.7**. ## 🏗️ Arsitektur | Komponen | Detail | |---|---| | **Total Parameters** | ~104.7M | | **Architecture** | Decoder-only Transformer | | **Layers** | 10 | | **Hidden Size** | 512 | | **Vocab Size** | 32,000 (BPETokenizer) | | **Max Seq Length** | 4,096 | | **Attention Heads** | 8 (MQA, 1 KV head) | | **Quantization** | BitNet b1.58 ternary {-1, 0, +1}, absmean | | **Dtype** | float32 (weights terkuantisasi ke ternary) | ## ✨ Fitur Inovatif | Fitur | Sumber | Keterangan | |---|---|---| | **MLA** | DeepSeek V4 | Multi-head Latent Attention, KV cache compression 24x | | **MoE** | DeepSeek V4 + Kimi K2.6 | 4 routed + 1 shared expert, top-k=2 | | **Hybrid Attention** | MiniMax M2.7 | Softmax + Lightning v2 linear attention | | **Hyper-Connections** | DeepSeek V4 | Sinkhorn routing, menggantikan residual standar | | **MTP** | DeepSeek V4 | Multi-Token Prediction, depth=2 | | **BitNet b1.58** | BitNet | Ternary quantization {-1, 0, +1} dari init | | **AutoTuner** | Deeplm | Adaptive LR, GN, WD, momentum, revive, trajectory prediction | | **Curriculum Router** | Deeplm | Phase-based category weighting | | **Self-Evolution** | MiniMax M2.7 | Autonomous hypothesis → experiment → decision loop | ## 📊 Spesifikasi Model ```json { "architectures": ["DeeplmModel"], "model_type": "deeplm", "vocab_size": 32000, "hidden_size": 512, "intermediate_size": 2048, "num_hidden_layers": 10, "num_attention_heads": 8, "num_key_value_heads": 1, "max_position_embeddings": 4096, "rms_norm_eps": 1e-06, "rope_theta": 50000.0, "rope_dim": 64, "tie_word_embeddings": true, "num_routed_experts": 4, "num_shared_experts": 1, "expert_topk": 2, "q_lora_rank": 192, "kv_lora_rank": 64, "qk_rope_head_dim": 64, "qk_nope_head_dim": 64, "v_head_dim": 128, "mtp_depth": 2, "mtp_num_layers": 2, "bitnet_quantized": true, "bitnet_scale": "absmean" } ``` ## 🚀 Usage ### Inference ```python import sys sys.path.insert(0, "deeplm") from deeplm.config import DeeplmConfig from deeplm.model.deeplm import DeeplmModel from safetensors.torch import load_file import torch # Load config config = DeeplmConfig() # Build model model = DeeplmModel(config) # Load BitNet quantized weights state_dict = load_file("model.safetensors") model.load_state_dict(state_dict, strict=False) # Generate input_ids = torch.tensor([[1, 2, 3]]) # bos + tokens output = model.generate(input_ids, max_new_tokens=100, temperature=0.7) ``` ### Training ```bash # Install dependencies pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors # Train with all features python train.py --batch_size 3 --grad_accum 2 --max_steps 31250 # Custom config python train.py \ --max_steps 100000 \ --batch_size 4 \ --seq_len 512 \ --lr 3e-4 \ --no_auto_tuner ``` ## 📁 Struktur Project ``` deeplm-108m/ ├── config.json # Model config ├── generation_config.json # Generation params ├── model.safetensors # BitNet quantized weights (419MB) ├── tokenizer.json # BPETokenizer ├── tokenizer_config.json # Tokenizer config ├── train.py # Training script (all features) ├── init_model.py # Model initialization script ├── deeplm_modal.py # Modal.com build script └── deeplm/ # Source code ├── config.py # Dataclass configs ├── model/ │ ├── deeplm.py # Main model │ ├── mla.py # Multi-head Latent Attention │ ├── moe.py # Mixture of Experts │ ├── hybrid_attention.py # Softmax + Lightning │ ├── hyper_connections.py # Sinkhorn routing │ ├── mtp.py # Multi-Token Prediction │ └── transformer_block.py ├── training/ │ ├── trainer.py # Training loop │ ├── auto_tuner.py # Adaptive training controller │ ├── curriculum_router.py # Phase-based routing │ ├── data_pipeline.py # Bucket dataset + sampler │ ├── logger.py # SmartLogger + anomaly detection │ └── control/ # TrainingControl plane ├── self_evolution/ │ └── framework.py # Autonomous evolution loop └── quantization/ ├── bitnet_quantize.py # BitNet b1.58 └── gguf_export.py ``` ## 📈 Training | Parameter | Value | |---|---| | **Dataset** | afrizalha/KamusOne-28M-Indonesian | | **Optimizer** | AdamW (β1=0.9, β2=0.95, ε=1e-8) | | **LR** | 6e-4 (cosine, warmup=150) | | **Batch Size** | 3 x grad_accum=2 = 6 effective | | **Weight Decay** | 0.1 | | **Max Grad Norm** | 1.0 | | **Max Steps** | 31,250 | ## 📄 License Apache 2.0 ## 🙏 Acknowledgments Arsitektur terinspirasi dari: - **DeepSeek V4** — MLA, Hyper-Connections, MTP, MoE routing - **Kimi K2.6** — Shared Expert, Agent Swarm - **MiniMax M2.7** — Self-Evolution Framework, Hybrid Attention, Agent Harness - **BitNet** — b1.58 ternary quantization