deeplm-108m / README.md
samcheng0's picture
Upload README.md with huggingface_hub
5d28f48 verified
---
language: id
license: apache-2.0
library_name: transformers
tags:
- pytorch
- safetensors
- deeplm
- bitnet
- moe
- mla
- mtp
- indonesian
pipeline_tag: text-generation
---
# Deeplm β€” 108M BitNet MoE Language Model
Deeplm adalah model bahasa berukuran ~105M parameter dengan **BitNet b1.58 ternary quantization** dari awal, terinspirasi dari arsitektur **DeepSeek V4**, **Kimi K2.6**, dan **MiniMax M2.7**.
## πŸ—οΈ Arsitektur
| Komponen | Detail |
|---|---|
| **Total Parameters** | ~104.7M |
| **Architecture** | Decoder-only Transformer |
| **Layers** | 10 |
| **Hidden Size** | 512 |
| **Vocab Size** | 32,000 (BPETokenizer) |
| **Max Seq Length** | 4,096 |
| **Attention Heads** | 8 (MQA, 1 KV head) |
| **Quantization** | BitNet b1.58 ternary {-1, 0, +1}, absmean |
| **Dtype** | float32 (weights terkuantisasi ke ternary) |
## ✨ Fitur Inovatif
| Fitur | Sumber | Keterangan |
|---|---|---|
| **MLA** | DeepSeek V4 | Multi-head Latent Attention, KV cache compression 24x |
| **MoE** | DeepSeek V4 + Kimi K2.6 | 4 routed + 1 shared expert, top-k=2 |
| **Hybrid Attention** | MiniMax M2.7 | Softmax + Lightning v2 linear attention |
| **Hyper-Connections** | DeepSeek V4 | Sinkhorn routing, menggantikan residual standar |
| **MTP** | DeepSeek V4 | Multi-Token Prediction, depth=2 |
| **BitNet b1.58** | BitNet | Ternary quantization {-1, 0, +1} dari init |
| **AutoTuner** | Deeplm | Adaptive LR, GN, WD, momentum, revive, trajectory prediction |
| **Curriculum Router** | Deeplm | Phase-based category weighting |
| **Self-Evolution** | MiniMax M2.7 | Autonomous hypothesis β†’ experiment β†’ decision loop |
## πŸ“Š Spesifikasi Model
```json
{
"architectures": ["DeeplmModel"],
"model_type": "deeplm",
"vocab_size": 32000,
"hidden_size": 512,
"intermediate_size": 2048,
"num_hidden_layers": 10,
"num_attention_heads": 8,
"num_key_value_heads": 1,
"max_position_embeddings": 4096,
"rms_norm_eps": 1e-06,
"rope_theta": 50000.0,
"rope_dim": 64,
"tie_word_embeddings": true,
"num_routed_experts": 4,
"num_shared_experts": 1,
"expert_topk": 2,
"q_lora_rank": 192,
"kv_lora_rank": 64,
"qk_rope_head_dim": 64,
"qk_nope_head_dim": 64,
"v_head_dim": 128,
"mtp_depth": 2,
"mtp_num_layers": 2,
"bitnet_quantized": true,
"bitnet_scale": "absmean"
}
```
## πŸš€ Usage
### Inference
```python
import sys
sys.path.insert(0, "deeplm")
from deeplm.config import DeeplmConfig
from deeplm.model.deeplm import DeeplmModel
from safetensors.torch import load_file
import torch
# Load config
config = DeeplmConfig()
# Build model
model = DeeplmModel(config)
# Load BitNet quantized weights
state_dict = load_file("model.safetensors")
model.load_state_dict(state_dict, strict=False)
# Generate
input_ids = torch.tensor([[1, 2, 3]]) # bos + tokens
output = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
```
### Training
```bash
# Install dependencies
pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors
# Train with all features
python train.py --batch_size 3 --grad_accum 2 --max_steps 31250
# Custom config
python train.py \
--max_steps 100000 \
--batch_size 4 \
--seq_len 512 \
--lr 3e-4 \
--no_auto_tuner
```
## πŸ“ Struktur Project
```
deeplm-108m/
β”œβ”€β”€ config.json # Model config
β”œβ”€β”€ generation_config.json # Generation params
β”œβ”€β”€ model.safetensors # BitNet quantized weights (419MB)
β”œβ”€β”€ tokenizer.json # BPETokenizer
β”œβ”€β”€ tokenizer_config.json # Tokenizer config
β”œβ”€β”€ train.py # Training script (all features)
β”œβ”€β”€ init_model.py # Model initialization script
β”œβ”€β”€ deeplm_modal.py # Modal.com build script
└── deeplm/ # Source code
β”œβ”€β”€ config.py # Dataclass configs
β”œβ”€β”€ model/
β”‚ β”œβ”€β”€ deeplm.py # Main model
β”‚ β”œβ”€β”€ mla.py # Multi-head Latent Attention
β”‚ β”œβ”€β”€ moe.py # Mixture of Experts
β”‚ β”œβ”€β”€ hybrid_attention.py # Softmax + Lightning
β”‚ β”œβ”€β”€ hyper_connections.py # Sinkhorn routing
β”‚ β”œβ”€β”€ mtp.py # Multi-Token Prediction
β”‚ └── transformer_block.py
β”œβ”€β”€ training/
β”‚ β”œβ”€β”€ trainer.py # Training loop
β”‚ β”œβ”€β”€ auto_tuner.py # Adaptive training controller
β”‚ β”œβ”€β”€ curriculum_router.py # Phase-based routing
β”‚ β”œβ”€β”€ data_pipeline.py # Bucket dataset + sampler
β”‚ β”œβ”€β”€ logger.py # SmartLogger + anomaly detection
β”‚ └── control/ # TrainingControl plane
β”œβ”€β”€ self_evolution/
β”‚ └── framework.py # Autonomous evolution loop
└── quantization/
β”œβ”€β”€ bitnet_quantize.py # BitNet b1.58
└── gguf_export.py
```
## πŸ“ˆ Training
| Parameter | Value |
|---|---|
| **Dataset** | afrizalha/KamusOne-28M-Indonesian |
| **Optimizer** | AdamW (Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8) |
| **LR** | 6e-4 (cosine, warmup=150) |
| **Batch Size** | 3 x grad_accum=2 = 6 effective |
| **Weight Decay** | 0.1 |
| **Max Grad Norm** | 1.0 |
| **Max Steps** | 31,250 |
## πŸ“„ License
Apache 2.0
## πŸ™ Acknowledgments
Arsitektur terinspirasi dari:
- **DeepSeek V4** β€” MLA, Hyper-Connections, MTP, MoE routing
- **Kimi K2.6** β€” Shared Expert, Agent Swarm
- **MiniMax M2.7** β€” Self-Evolution Framework, Hybrid Attention, Agent Harness
- **BitNet** β€” b1.58 ternary quantization