Deeplm — 108M BitNet MoE Language Model

Deeplm adalah model bahasa berukuran ~105M parameter dengan BitNet b1.58 ternary quantization dari awal, terinspirasi dari arsitektur DeepSeek V4, Kimi K2.6, dan MiniMax M2.7.

🏗️ Arsitektur

Komponen	Detail
Total Parameters	~104.7M
Architecture	Decoder-only Transformer
Layers	10
Hidden Size	512
Vocab Size	32,000 (BPETokenizer)
Max Seq Length	4,096
Attention Heads	8 (MQA, 1 KV head)
Quantization	BitNet b1.58 ternary {-1, 0, +1}, absmean
Dtype	float32 (weights terkuantisasi ke ternary)

✨ Fitur Inovatif

Fitur	Sumber	Keterangan
MLA	DeepSeek V4	Multi-head Latent Attention, KV cache compression 24x
MoE	DeepSeek V4 + Kimi K2.6	4 routed + 1 shared expert, top-k=2
Hybrid Attention	MiniMax M2.7	Softmax + Lightning v2 linear attention
Hyper-Connections	DeepSeek V4	Sinkhorn routing, menggantikan residual standar
MTP	DeepSeek V4	Multi-Token Prediction, depth=2
BitNet b1.58	BitNet	Ternary quantization {-1, 0, +1} dari init
AutoTuner	Deeplm	Adaptive LR, GN, WD, momentum, revive, trajectory prediction
Curriculum Router	Deeplm	Phase-based category weighting
Self-Evolution	MiniMax M2.7	Autonomous hypothesis → experiment → decision loop

📊 Spesifikasi Model

{
  "architectures": ["DeeplmModel"],
  "model_type": "deeplm",
  "vocab_size": 32000,
  "hidden_size": 512,
  "intermediate_size": 2048,
  "num_hidden_layers": 10,
  "num_attention_heads": 8,
  "num_key_value_heads": 1,
  "max_position_embeddings": 4096,
  "rms_norm_eps": 1e-06,
  "rope_theta": 50000.0,
  "rope_dim": 64,
  "tie_word_embeddings": true,
  "num_routed_experts": 4,
  "num_shared_experts": 1,
  "expert_topk": 2,
  "q_lora_rank": 192,
  "kv_lora_rank": 64,
  "qk_rope_head_dim": 64,
  "qk_nope_head_dim": 64,
  "v_head_dim": 128,
  "mtp_depth": 2,
  "mtp_num_layers": 2,
  "bitnet_quantized": true,
  "bitnet_scale": "absmean"
}

🚀 Usage

Inference

import sys
sys.path.insert(0, "deeplm")
from deeplm.config import DeeplmConfig
from deeplm.model.deeplm import DeeplmModel
from safetensors.torch import load_file
import torch

# Load config
config = DeeplmConfig()

# Build model
model = DeeplmModel(config)

# Load BitNet quantized weights
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict, strict=False)

# Generate
input_ids = torch.tensor([[1, 2, 3]])  # bos + tokens
output = model.generate(input_ids, max_new_tokens=100, temperature=0.7)

Training

# Install dependencies
pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors

# Train with all features
python train.py --batch_size 3 --grad_accum 2 --max_steps 31250

# Custom config
python train.py \
  --max_steps 100000 \
  --batch_size 4 \
  --seq_len 512 \
  --lr 3e-4 \
  --no_auto_tuner

📁 Struktur Project

deeplm-108m/
├── config.json              # Model config
├── generation_config.json   # Generation params
├── model.safetensors        # BitNet quantized weights (419MB)
├── tokenizer.json           # BPETokenizer
├── tokenizer_config.json    # Tokenizer config
├── train.py                 # Training script (all features)
├── init_model.py            # Model initialization script
├── deeplm_modal.py          # Modal.com build script
└── deeplm/                  # Source code
    ├── config.py            # Dataclass configs
    ├── model/
    │   ├── deeplm.py        # Main model
    │   ├── mla.py           # Multi-head Latent Attention
    │   ├── moe.py           # Mixture of Experts
    │   ├── hybrid_attention.py  # Softmax + Lightning
    │   ├── hyper_connections.py # Sinkhorn routing
    │   ├── mtp.py           # Multi-Token Prediction
    │   └── transformer_block.py
    ├── training/
    │   ├── trainer.py       # Training loop
    │   ├── auto_tuner.py    # Adaptive training controller
    │   ├── curriculum_router.py  # Phase-based routing
    │   ├── data_pipeline.py # Bucket dataset + sampler
    │   ├── logger.py        # SmartLogger + anomaly detection
    │   └── control/         # TrainingControl plane
    ├── self_evolution/
    │   └── framework.py     # Autonomous evolution loop
    └── quantization/
        ├── bitnet_quantize.py  # BitNet b1.58
        └── gguf_export.py

📈 Training

Parameter	Value
Dataset	afrizalha/KamusOne-28M-Indonesian
Optimizer	AdamW (β1=0.9, β2=0.95, ε=1e-8)
LR	6e-4 (cosine, warmup=150)
Batch Size	3 x grad_accum=2 = 6 effective
Weight Decay	0.1
Max Grad Norm	1.0
Max Steps	31,250

📄 License

Apache 2.0

🙏 Acknowledgments

Arsitektur terinspirasi dari:

DeepSeek V4 — MLA, Hyper-Connections, MTP, MoE routing
Kimi K2.6 — Shared Expert, Agent Swarm
MiniMax M2.7 — Self-Evolution Framework, Hybrid Attention, Agent Harness
BitNet — b1.58 ternary quantization

Downloads last month: 231

Safetensors

Model size

0.1B params

Tensor type

F32