Deeplm β€” 108M BitNet MoE Language Model

Deeplm adalah model bahasa berukuran ~105M parameter dengan BitNet b1.58 ternary quantization dari awal, terinspirasi dari arsitektur DeepSeek V4, Kimi K2.6, dan MiniMax M2.7.

πŸ—οΈ Arsitektur

Komponen Detail
Total Parameters ~104.7M
Architecture Decoder-only Transformer
Layers 10
Hidden Size 512
Vocab Size 32,000 (BPETokenizer)
Max Seq Length 4,096
Attention Heads 8 (MQA, 1 KV head)
Quantization BitNet b1.58 ternary {-1, 0, +1}, absmean
Dtype float32 (weights terkuantisasi ke ternary)

✨ Fitur Inovatif

Fitur Sumber Keterangan
MLA DeepSeek V4 Multi-head Latent Attention, KV cache compression 24x
MoE DeepSeek V4 + Kimi K2.6 4 routed + 1 shared expert, top-k=2
Hybrid Attention MiniMax M2.7 Softmax + Lightning v2 linear attention
Hyper-Connections DeepSeek V4 Sinkhorn routing, menggantikan residual standar
MTP DeepSeek V4 Multi-Token Prediction, depth=2
BitNet b1.58 BitNet Ternary quantization {-1, 0, +1} dari init
AutoTuner Deeplm Adaptive LR, GN, WD, momentum, revive, trajectory prediction
Curriculum Router Deeplm Phase-based category weighting
Self-Evolution MiniMax M2.7 Autonomous hypothesis β†’ experiment β†’ decision loop

πŸ“Š Spesifikasi Model

{
  "architectures": ["DeeplmModel"],
  "model_type": "deeplm",
  "vocab_size": 32000,
  "hidden_size": 512,
  "intermediate_size": 2048,
  "num_hidden_layers": 10,
  "num_attention_heads": 8,
  "num_key_value_heads": 1,
  "max_position_embeddings": 4096,
  "rms_norm_eps": 1e-06,
  "rope_theta": 50000.0,
  "rope_dim": 64,
  "tie_word_embeddings": true,
  "num_routed_experts": 4,
  "num_shared_experts": 1,
  "expert_topk": 2,
  "q_lora_rank": 192,
  "kv_lora_rank": 64,
  "qk_rope_head_dim": 64,
  "qk_nope_head_dim": 64,
  "v_head_dim": 128,
  "mtp_depth": 2,
  "mtp_num_layers": 2,
  "bitnet_quantized": true,
  "bitnet_scale": "absmean"
}

πŸš€ Usage

Inference

import sys
sys.path.insert(0, "deeplm")
from deeplm.config import DeeplmConfig
from deeplm.model.deeplm import DeeplmModel
from safetensors.torch import load_file
import torch

# Load config
config = DeeplmConfig()

# Build model
model = DeeplmModel(config)

# Load BitNet quantized weights
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict, strict=False)

# Generate
input_ids = torch.tensor([[1, 2, 3]])  # bos + tokens
output = model.generate(input_ids, max_new_tokens=100, temperature=0.7)

Training

# Install dependencies
pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors

# Train with all features
python train.py --batch_size 3 --grad_accum 2 --max_steps 31250

# Custom config
python train.py \
  --max_steps 100000 \
  --batch_size 4 \
  --seq_len 512 \
  --lr 3e-4 \
  --no_auto_tuner

πŸ“ Struktur Project

deeplm-108m/
β”œβ”€β”€ config.json              # Model config
β”œβ”€β”€ generation_config.json   # Generation params
β”œβ”€β”€ model.safetensors        # BitNet quantized weights (419MB)
β”œβ”€β”€ tokenizer.json           # BPETokenizer
β”œβ”€β”€ tokenizer_config.json    # Tokenizer config
β”œβ”€β”€ train.py                 # Training script (all features)
β”œβ”€β”€ init_model.py            # Model initialization script
β”œβ”€β”€ deeplm_modal.py          # Modal.com build script
└── deeplm/                  # Source code
    β”œβ”€β”€ config.py            # Dataclass configs
    β”œβ”€β”€ model/
    β”‚   β”œβ”€β”€ deeplm.py        # Main model
    β”‚   β”œβ”€β”€ mla.py           # Multi-head Latent Attention
    β”‚   β”œβ”€β”€ moe.py           # Mixture of Experts
    β”‚   β”œβ”€β”€ hybrid_attention.py  # Softmax + Lightning
    β”‚   β”œβ”€β”€ hyper_connections.py # Sinkhorn routing
    β”‚   β”œβ”€β”€ mtp.py           # Multi-Token Prediction
    β”‚   └── transformer_block.py
    β”œβ”€β”€ training/
    β”‚   β”œβ”€β”€ trainer.py       # Training loop
    β”‚   β”œβ”€β”€ auto_tuner.py    # Adaptive training controller
    β”‚   β”œβ”€β”€ curriculum_router.py  # Phase-based routing
    β”‚   β”œβ”€β”€ data_pipeline.py # Bucket dataset + sampler
    β”‚   β”œβ”€β”€ logger.py        # SmartLogger + anomaly detection
    β”‚   └── control/         # TrainingControl plane
    β”œβ”€β”€ self_evolution/
    β”‚   └── framework.py     # Autonomous evolution loop
    └── quantization/
        β”œβ”€β”€ bitnet_quantize.py  # BitNet b1.58
        └── gguf_export.py

πŸ“ˆ Training

Parameter Value
Dataset afrizalha/KamusOne-28M-Indonesian
Optimizer AdamW (Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8)
LR 6e-4 (cosine, warmup=150)
Batch Size 3 x grad_accum=2 = 6 effective
Weight Decay 0.1
Max Grad Norm 1.0
Max Steps 31,250

πŸ“„ License

Apache 2.0

πŸ™ Acknowledgments

Arsitektur terinspirasi dari:

  • DeepSeek V4 β€” MLA, Hyper-Connections, MTP, MoE routing
  • Kimi K2.6 β€” Shared Expert, Agent Swarm
  • MiniMax M2.7 β€” Self-Evolution Framework, Hybrid Attention, Agent Harness
  • BitNet β€” b1.58 ternary quantization
Downloads last month
231
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support