---
language: id
license: apache-2.0
library_name: transformers
tags:
- pytorch
- safetensors
- deeplm
- bitnet
- moe
- mla
- mtp
- indonesian
pipeline_tag: text-generation
---

# Deeplm — 108M BitNet MoE Language Model

Deeplm adalah model bahasa berukuran ~105M parameter dengan **BitNet b1.58 ternary quantization** dari awal, terinspirasi dari arsitektur **DeepSeek V4**, **Kimi K2.6**, dan **MiniMax M2.7**.

## 🏗️ Arsitektur

| Komponen | Detail |
|---|---|
| **Total Parameters** | ~104.7M |
| **Architecture** | Decoder-only Transformer |
| **Layers** | 10 |
| **Hidden Size** | 512 |
| **Vocab Size** | 32,000 (BPETokenizer) |
| **Max Seq Length** | 4,096 |
| **Attention Heads** | 8 (MQA, 1 KV head) |
| **Quantization** | BitNet b1.58 ternary {-1, 0, +1}, absmean |
| **Dtype** | float32 (weights terkuantisasi ke ternary) |

## ✨ Fitur Inovatif

| Fitur | Sumber | Keterangan |
|---|---|---|
| **MLA** | DeepSeek V4 | Multi-head Latent Attention, KV cache compression 24x |
| **MoE** | DeepSeek V4 + Kimi K2.6 | 4 routed + 1 shared expert, top-k=2 |
| **Hybrid Attention** | MiniMax M2.7 | Softmax + Lightning v2 linear attention |
| **Hyper-Connections** | DeepSeek V4 | Sinkhorn routing, menggantikan residual standar |
| **MTP** | DeepSeek V4 | Multi-Token Prediction, depth=2 |
| **BitNet b1.58** | BitNet | Ternary quantization {-1, 0, +1} dari init |
| **AutoTuner** | Deeplm | Adaptive LR, GN, WD, momentum, revive, trajectory prediction |
| **Curriculum Router** | Deeplm | Phase-based category weighting |
| **Self-Evolution** | MiniMax M2.7 | Autonomous hypothesis → experiment → decision loop |

## 📊 Spesifikasi Model

```json
{
  "architectures": ["DeeplmModel"],
  "model_type": "deeplm",
  "vocab_size": 32000,
  "hidden_size": 512,
  "intermediate_size": 2048,
  "num_hidden_layers": 10,
  "num_attention_heads": 8,
  "num_key_value_heads": 1,
  "max_position_embeddings": 4096,
  "rms_norm_eps": 1e-06,
  "rope_theta": 50000.0,
  "rope_dim": 64,
  "tie_word_embeddings": true,
  "num_routed_experts": 4,
  "num_shared_experts": 1,
  "expert_topk": 2,
  "q_lora_rank": 192,
  "kv_lora_rank": 64,
  "qk_rope_head_dim": 64,
  "qk_nope_head_dim": 64,
  "v_head_dim": 128,
  "mtp_depth": 2,
  "mtp_num_layers": 2,
  "bitnet_quantized": true,
  "bitnet_scale": "absmean"
}
```

## 🚀 Usage

### Inference

```python
import sys
sys.path.insert(0, "deeplm")
from deeplm.config import DeeplmConfig
from deeplm.model.deeplm import DeeplmModel
from safetensors.torch import load_file
import torch

# Load config
config = DeeplmConfig()

# Build model
model = DeeplmModel(config)

# Load BitNet quantized weights
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict, strict=False)

# Generate
input_ids = torch.tensor([[1, 2, 3]])  # bos + tokens
output = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
```

### Training

```bash
# Install dependencies
pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors

# Train with all features
python train.py --batch_size 3 --grad_accum 2 --max_steps 31250

# Custom config
python train.py \
  --max_steps 100000 \
  --batch_size 4 \
  --seq_len 512 \
  --lr 3e-4 \
  --no_auto_tuner
```

## 📁 Struktur Project

```
deeplm-108m/
├── config.json              # Model config
├── generation_config.json   # Generation params
├── model.safetensors        # BitNet quantized weights (419MB)
├── tokenizer.json           # BPETokenizer
├── tokenizer_config.json    # Tokenizer config
├── train.py                 # Training script (all features)
├── init_model.py            # Model initialization script
├── deeplm_modal.py          # Modal.com build script
└── deeplm/                  # Source code
    ├── config.py            # Dataclass configs
    ├── model/
    │   ├── deeplm.py        # Main model
    │   ├── mla.py           # Multi-head Latent Attention
    │   ├── moe.py           # Mixture of Experts
    │   ├── hybrid_attention.py  # Softmax + Lightning
    │   ├── hyper_connections.py # Sinkhorn routing
    │   ├── mtp.py           # Multi-Token Prediction
    │   └── transformer_block.py
    ├── training/
    │   ├── trainer.py       # Training loop
    │   ├── auto_tuner.py    # Adaptive training controller
    │   ├── curriculum_router.py  # Phase-based routing
    │   ├── data_pipeline.py # Bucket dataset + sampler
    │   ├── logger.py        # SmartLogger + anomaly detection
    │   └── control/         # TrainingControl plane
    ├── self_evolution/
    │   └── framework.py     # Autonomous evolution loop
    └── quantization/
        ├── bitnet_quantize.py  # BitNet b1.58
        └── gguf_export.py
```

## 📈 Training

| Parameter | Value |
|---|---|
| **Dataset** | afrizalha/KamusOne-28M-Indonesian |
| **Optimizer** | AdamW (β1=0.9, β2=0.95, ε=1e-8) |
| **LR** | 6e-4 (cosine, warmup=150) |
| **Batch Size** | 3 x grad_accum=2 = 6 effective |
| **Weight Decay** | 0.1 |
| **Max Grad Norm** | 1.0 |
| **Max Steps** | 31,250 |

## 📄 License

Apache 2.0

## 🙏 Acknowledgments

Arsitektur terinspirasi dari:
- **DeepSeek V4** — MLA, Hyper-Connections, MTP, MoE routing
- **Kimi K2.6** — Shared Expert, Agent Swarm
- **MiniMax M2.7** — Self-Evolution Framework, Hybrid Attention, Agent Harness
- **BitNet** — b1.58 ternary quantization