Upload README.md with huggingface_hub

5d28f48 verified 13 days ago

5.54 kB

	---
	language: id
	license: apache-2.0
	library_name: transformers
	tags:
	- pytorch
	- safetensors
	- deeplm
	- bitnet
	- moe
	- mla
	- mtp
	- indonesian
	pipeline_tag: text-generation
	---

	# Deeplm — 108M BitNet MoE Language Model

	Deeplm adalah model bahasa berukuran ~105M parameter dengan BitNet b1.58 ternary quantization dari awal, terinspirasi dari arsitektur DeepSeek V4, Kimi K2.6, dan MiniMax M2.7.

	## 🏗️ Arsitektur

	\| Komponen \| Detail \|
	\|---\|---\|
	\| Total Parameters \| ~104.7M \|
	\| Architecture \| Decoder-only Transformer \|
	\| Layers \| 10 \|
	\| Hidden Size \| 512 \|
	\| Vocab Size \| 32,000 (BPETokenizer) \|
	\| Max Seq Length \| 4,096 \|
	\| Attention Heads \| 8 (MQA, 1 KV head) \|
	\| Quantization \| BitNet b1.58 ternary {-1, 0, +1}, absmean \|
	\| Dtype \| float32 (weights terkuantisasi ke ternary) \|

	## ✨ Fitur Inovatif

	\| Fitur \| Sumber \| Keterangan \|
	\|---\|---\|---\|
	\| MLA \| DeepSeek V4 \| Multi-head Latent Attention, KV cache compression 24x \|
	\| MoE \| DeepSeek V4 + Kimi K2.6 \| 4 routed + 1 shared expert, top-k=2 \|
	\| Hybrid Attention \| MiniMax M2.7 \| Softmax + Lightning v2 linear attention \|
	\| Hyper-Connections \| DeepSeek V4 \| Sinkhorn routing, menggantikan residual standar \|
	\| MTP \| DeepSeek V4 \| Multi-Token Prediction, depth=2 \|
	\| BitNet b1.58 \| BitNet \| Ternary quantization {-1, 0, +1} dari init \|
	\| AutoTuner \| Deeplm \| Adaptive LR, GN, WD, momentum, revive, trajectory prediction \|
	\| Curriculum Router \| Deeplm \| Phase-based category weighting \|
	\| Self-Evolution \| MiniMax M2.7 \| Autonomous hypothesis → experiment → decision loop \|

	## 📊 Spesifikasi Model

	```json
	{
	"architectures": ["DeeplmModel"],
	"model_type": "deeplm",
	"vocab_size": 32000,
	"hidden_size": 512,
	"intermediate_size": 2048,
	"num_hidden_layers": 10,
	"num_attention_heads": 8,
	"num_key_value_heads": 1,
	"max_position_embeddings": 4096,
	"rms_norm_eps": 1e-06,
	"rope_theta": 50000.0,
	"rope_dim": 64,
	"tie_word_embeddings": true,
	"num_routed_experts": 4,
	"num_shared_experts": 1,
	"expert_topk": 2,
	"q_lora_rank": 192,
	"kv_lora_rank": 64,
	"qk_rope_head_dim": 64,
	"qk_nope_head_dim": 64,
	"v_head_dim": 128,
	"mtp_depth": 2,
	"mtp_num_layers": 2,
	"bitnet_quantized": true,
	"bitnet_scale": "absmean"
	}
	```

	## 🚀 Usage

	### Inference

	```python
	import sys
	sys.path.insert(0, "deeplm")
	from deeplm.config import DeeplmConfig
	from deeplm.model.deeplm import DeeplmModel
	from safetensors.torch import load_file
	import torch

	# Load config
	config = DeeplmConfig()

	# Build model
	model = DeeplmModel(config)

	# Load BitNet quantized weights
	state_dict = load_file("model.safetensors")
	model.load_state_dict(state_dict, strict=False)

	# Generate
	input_ids = torch.tensor([[1, 2, 3]]) # bos + tokens
	output = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
	```

	### Training

	```bash
	# Install dependencies
	pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors

	# Train with all features
	python train.py --batch_size 3 --grad_accum 2 --max_steps 31250

	# Custom config
	python train.py \
	--max_steps 100000 \
	--batch_size 4 \
	--seq_len 512 \
	--lr 3e-4 \
	--no_auto_tuner
	```

	## 📁 Struktur Project

	```
	deeplm-108m/
	├── config.json # Model config
	├── generation_config.json # Generation params
	├── model.safetensors # BitNet quantized weights (419MB)
	├── tokenizer.json # BPETokenizer
	├── tokenizer_config.json # Tokenizer config
	├── train.py # Training script (all features)
	├── init_model.py # Model initialization script
	├── deeplm_modal.py # Modal.com build script
	└── deeplm/ # Source code
	├── config.py # Dataclass configs
	├── model/
	│ ├── deeplm.py # Main model
	│ ├── mla.py # Multi-head Latent Attention
	│ ├── moe.py # Mixture of Experts
	│ ├── hybrid_attention.py # Softmax + Lightning
	│ ├── hyper_connections.py # Sinkhorn routing
	│ ├── mtp.py # Multi-Token Prediction
	│ └── transformer_block.py
	├── training/
	│ ├── trainer.py # Training loop
	│ ├── auto_tuner.py # Adaptive training controller
	│ ├── curriculum_router.py # Phase-based routing
	│ ├── data_pipeline.py # Bucket dataset + sampler
	│ ├── logger.py # SmartLogger + anomaly detection
	│ └── control/ # TrainingControl plane
	├── self_evolution/
	│ └── framework.py # Autonomous evolution loop
	└── quantization/
	├── bitnet_quantize.py # BitNet b1.58
	└── gguf_export.py
	```

	## 📈 Training

	\| Parameter \| Value \|
	\|---\|---\|
	\| Dataset \| afrizalha/KamusOne-28M-Indonesian \|
	\| Optimizer \| AdamW (β1=0.9, β2=0.95, ε=1e-8) \|
	\| LR \| 6e-4 (cosine, warmup=150) \|
	\| Batch Size \| 3 x grad_accum=2 = 6 effective \|
	\| Weight Decay \| 0.1 \|
	\| Max Grad Norm \| 1.0 \|
	\| Max Steps \| 31,250 \|

	## 📄 License

	Apache 2.0

	## 🙏 Acknowledgments

	Arsitektur terinspirasi dari:
	- DeepSeek V4 — MLA, Hyper-Connections, MTP, MoE routing
	- Kimi K2.6 — Shared Expert, Agent Swarm
	- MiniMax M2.7 — Self-Evolution Framework, Hybrid Attention, Agent Harness
	- BitNet — b1.58 ternary quantization