Text Generation
Transformers
Safetensors
PyTorch
Indonesian
deeplm
bitnet
Mixture of Experts
mla
mtp
indonesian
Instructions to use samcheng0/deeplm-108m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use samcheng0/deeplm-108m with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="samcheng0/deeplm-108m")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("samcheng0/deeplm-108m", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use samcheng0/deeplm-108m with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "samcheng0/deeplm-108m" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samcheng0/deeplm-108m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/samcheng0/deeplm-108m
- SGLang
How to use samcheng0/deeplm-108m with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "samcheng0/deeplm-108m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samcheng0/deeplm-108m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "samcheng0/deeplm-108m" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "samcheng0/deeplm-108m", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use samcheng0/deeplm-108m with Docker Model Runner:
docker model run hf.co/samcheng0/deeplm-108m
Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -3,251 +3,182 @@ language: id
|
|
| 3 |
license: apache-2.0
|
| 4 |
library_name: transformers
|
| 5 |
tags:
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
- deeplm
|
| 16 |
-
base_model: "none"
|
| 17 |
---
|
| 18 |
|
| 19 |
-
# Deeplm
|
| 20 |
|
| 21 |
-
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
|
| 28 |
-
|
|
| 29 |
-
| **
|
| 30 |
-
| **
|
| 31 |
-
| **
|
| 32 |
-
| **
|
| 33 |
-
| **
|
| 34 |
-
| **
|
| 35 |
-
| **
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
| 43 |
-
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
-
##
|
| 53 |
|
| 54 |
-
|
| 55 |
-
|-----------|--------|
|
| 56 |
-
| **Total Parameters** | 104,747,048 (~105M) |
|
| 57 |
-
| **Vocabulary** | 32,000 (BBPE) |
|
| 58 |
-
| **Layers** | 10 Transformer blocks |
|
| 59 |
-
| **Hidden Size** | 512 |
|
| 60 |
-
| **Feed-Forward** | 2048 (SwiGLU, 4Γ hidden) |
|
| 61 |
-
| **Attention Heads** | 8 query heads, 1 KV head (MQA) |
|
| 62 |
-
| **Head Dim** | 128 (64 RoPE + 64 NoPE) |
|
| 63 |
-
| **Max Seq Length** | 4096 |
|
| 64 |
-
| **RoPE Theta** | 50,000 |
|
| 65 |
-
| **Attention** | MLA (Multi-head Latent Attention) |
|
| 66 |
-
| **FFN** | MoE (4 routed + 1 shared experts, top-k=2) |
|
| 67 |
-
| **Residual** | Hyper-Connections with Sinkhorn routing |
|
| 68 |
-
| **Hybrid Attention** | 3 softmax + 7 Lightning layers |
|
| 69 |
-
| **Prediction** | MTP (Multi-Token Prediction, depth=2, 2 MTP layers) |
|
| 70 |
-
| **Self-Evolution** | Autonomous research loop (100+ rounds) |
|
| 71 |
-
| **Embeddings** | Tied (shared between input/output) |
|
| 72 |
-
| **AutoTuner** | Adaptive energy-based optimizer scheduler |
|
| 73 |
-
| **Dtype** | float32 (Hyper-Connections stability) |
|
| 74 |
-
|
| 75 |
-
### Key Innovations
|
| 76 |
-
|
| 77 |
-
<details>
|
| 78 |
-
<summary>Click to expand architecture details</summary>
|
| 79 |
-
|
| 80 |
-
### 1. Multi-head Latent Attention (MLA) β *DeepSeek V4 / Kimi K2.6*
|
| 81 |
-
- Q compressed: hidden β q_lora_rank(192) β Layernorm β q_up(8 Γ 128)
|
| 82 |
-
- KV compressed: hidden β [kv_latent(64) + k_rope(64)] β kv_up β [k_nope(64) + v(128)] Γ 8 heads
|
| 83 |
-
- Entire KV cache per token: just 128 dims (64 latent + 64 rope) β **~8Γ smaller** than standard MHA
|
| 84 |
-
- Decoupled RoPE applied only to 64-dim k_pe, content path stays RoPE-free
|
| 85 |
-
- Absorption trick pre-computes W_UK @ W_UV for faster inference
|
| 86 |
-
- MQA-style: KV decomposed once, expanded to all query heads
|
| 87 |
-
|
| 88 |
-
### 2. Mixture of Experts (MoE) β *DeepSeek V4 / Kimi K2.6*
|
| 89 |
-
- 4 routed experts + 1 shared expert (always active, Kimi K2.6 style)
|
| 90 |
-
- Top-k=2 routing: each token activates only 2 experts
|
| 91 |
-
- **sqrt(softplus(x))** scoring for numerical stability (DeepSeek V4)
|
| 92 |
-
- **Bias-based load balancing** (no auxiliary loss, no gradient interference)
|
| 93 |
-
- Per-expert routing bias auto-updates to balance token assignments
|
| 94 |
-
- SwiGLU activation in every expert (fused gate+up projection)
|
| 95 |
-
- Expert affinity memory tracks token-expert history
|
| 96 |
-
|
| 97 |
-
### 3. Hyper-Connections with Sinkhorn Routing β *DeepSeek V4*
|
| 98 |
-
- Replaces standard residual connections with learned routing
|
| 99 |
-
- 4 connection types: **identity**, **transform**, **gate**, **skip**
|
| 100 |
-
- Sinkhorn-Knopp normalization (2 iterations) for doubly-stochastic weights
|
| 101 |
-
- Input-dependent routing via gating network
|
| 102 |
-
- Type-specific learnable biases initialized per config
|
| 103 |
-
- Pre-LayerNorm on layer output before routing
|
| 104 |
-
|
| 105 |
-
### 4. Hybrid Attention β *MiniMax M2.7*
|
| 106 |
-
- 3 softmax layers (indices 0, 4, 8): Standard MLA with full causal attention
|
| 107 |
-
- 7 linear layers (1, 2, 3, 5, 6, 7, 9): MLA + LightningAttentionV2 **50/50 blend**
|
| 108 |
-
- LightningAttentionV2: O(n) complexity with intra-block softmax + inter-block KV product
|
| 109 |
-
- Incremental KV state for efficient autoregressive generation
|
| 110 |
-
- ReLU/Swish activation replaces softmax in linear path
|
| 111 |
-
|
| 112 |
-
### 5. Multi-Token Prediction (MTP) β *DeepSeek V4*
|
| 113 |
-
- 2 MTP layers, each predicting 2 tokens ahead (mtp_depth=2)
|
| 114 |
-
- Projection block: Linear β LayerNorm β GELU β Linear + residual skip
|
| 115 |
-
- RoPE positional encoding on reduced dim (hidden/4) for efficiency
|
| 116 |
-
- Tied LM head shares parameters with main embedding layer
|
| 117 |
-
- Chunked computation (chunk_size=16) to avoid full (B, S, V) logits
|
| 118 |
-
- Loss weight: 0.3 Γ cross-entropy of future token predictions
|
| 119 |
-
|
| 120 |
-
### 6. Self-Evolution Framework β *MiniMax M2.7 / Deeplm*
|
| 121 |
-
- Autonomous 8-phase research loop: hypothesis β design β execute β analyze β diagnose β fix β evaluate β decide
|
| 122 |
-
- 100+ autonomous optimization rounds per training cycle
|
| 123 |
-
- 3 feedback chain episodes for meta-learning
|
| 124 |
-
|
| 125 |
-
### 7. AutoTuner β *Deeplm custom*
|
| 126 |
-
- Energy-based adaptive hyperparameter controller
|
| 127 |
-
- Phase-aware dynamics (warmup β exploration β balanced β exploitation)
|
| 128 |
-
- Bayesian dynamics model: uncertainty-aware lr/wd sensitivity (Welford variance)
|
| 129 |
-
- Multi-timescale loss EMAs (short=0.9, med=0.98, long=0.995)
|
| 130 |
-
- Gradient noise scale monitoring
|
| 131 |
-
- Cosine similarity for gradient direction tracking
|
| 132 |
-
- Layer health monitoring with per-group gradient ratios
|
| 133 |
-
- Failure-aware rollback with revive mechanism
|
| 134 |
-
- Strategic planner: multi-step scheduled adjustments with plan accuracy tracking
|
| 135 |
-
- Dual-window trajectory predictor: regime change detection, convergence estimation
|
| 136 |
-
|
| 137 |
-
</details>
|
| 138 |
-
|
| 139 |
-
## Training Configuration
|
| 140 |
-
|
| 141 |
-
| Config | Value |
|
| 142 |
-
|--------|-------|
|
| 143 |
-
| **Dataset** | Wikipedia-id (Indonesian) + GLM-5.1 (English reasoning) + English Wikipedia |
|
| 144 |
-
| **Tokenizer** | 32K BBPE |
|
| 145 |
-
| **Optimizer** | SGD Nesterov (momentum=0.9, weight_decay=0.1) |
|
| 146 |
-
| **LR Schedule** | Cosine (warmup 3%) |
|
| 147 |
-
| **Base LR** | 3e-4 |
|
| 148 |
-
| **Effective Batch** | 36 (12 Γ 3 grad_accum) |
|
| 149 |
-
| **Sequence Length** | 2048 |
|
| 150 |
-
| **Max Grad Norm** | 1.0 (auto-tuned) |
|
| 151 |
-
| **Total Steps** | 19,500 |
|
| 152 |
-
| **GPU** | A10G (24GB) |
|
| 153 |
-
| **Dtype** | float32 |
|
| 154 |
-
| **Curriculum** | 4-tier (easy β medium β hard β reasoning), current: **hard** |
|
| 155 |
-
| **Dynamic Mix** | Adaptive per-category sampling weights, applied via WeightedBucketSampler |
|
| 156 |
-
| **Tokenization** | Disk-cached (SHA-256 keyed), no re-tokenization per epoch |
|
| 157 |
-
| **Filtering** | StrictFilter: URL/HTML/emoji stripping + char ratio + language score + repetition + min words |
|
| 158 |
-
| **Batching** | BucketDataset: groups by length for efficient padding |
|
| 159 |
-
|
| 160 |
-
### Training Algorithms
|
| 161 |
-
|
| 162 |
-
| Algorithm | Status | Description |
|
| 163 |
-
|-----------|--------|-------------|
|
| 164 |
-
| Curriculum Learning | Active | 4-tier easyβhard progression by text length |
|
| 165 |
-
| Dynamic Sampling | Active | Adaptive category mix based on per-category loss |
|
| 166 |
-
| Difficulty Scheduling | Active | 4 phases: Token Learning β Syntax β Reasoning β Expert |
|
| 167 |
-
| MoE Balancing | Active | Bias-based load-balanced routing |
|
| 168 |
-
| AutoTuner | Active | AI adaptive hyperparameter control |
|
| 169 |
-
| MTP | Active | Auxiliary multi-token prediction loss |
|
| 170 |
-
| Curriculum Scheduling | Active | Loss-based adaptive difficulty |
|
| 171 |
-
| Reflection Training | **Active** | High-loss example replay (1,500 stored) |
|
| 172 |
-
| Memory Algorithms | **Active** | 1,500 stored, avg loss 10.1 |
|
| 173 |
-
| Tool Routing | **Active** | Code=706, Math=205, Formal=587 routed |
|
| 174 |
-
| Synthetic Evolution | Inactive | Model-generated training data (potential A10G bottleneck) |
|
| 175 |
-
|
| 176 |
-
## AutoTuner State (Step 19,500)
|
| 177 |
-
|
| 178 |
-
| Metric | Step 18,000 | Step 19,500 | Change |
|
| 179 |
-
|--------|-------------|-------------|--------|
|
| 180 |
-
| **Phase** | Balanced | **Exploitation** | β aggressiveness |
|
| 181 |
-
| **LR Multiplier** | 0.78Γ | **0.64Γ** | β 18% |
|
| 182 |
-
| **Grad Norm Multiplier** | 0.76Γ | **0.64Γ** | β 16% |
|
| 183 |
-
| **Weight Decay Mult** | 1.60Γ | **active** | regularization |
|
| 184 |
-
| **Best (smoothed loss)** | 28.84 | **4.01** | β (different scale) |
|
| 185 |
-
| **Best Eval Loss** | 56.07 | **56.07** | β (no new eval) |
|
| 186 |
-
| **Adjustments Made** | β | **152** | learned control |
|
| 187 |
-
| **Degeneracy Reductions** | β | **2** | prevented divergence |
|
| 188 |
-
| **Cosine Similarity EMA** | β | **0.15** | moderate direction stability |
|
| 189 |
-
| **Gradient Noise EMA** | β | **0.10** | low noise |
|
| 190 |
-
| **Gradient Norm (avg)** | β | **3.45** | well-controlled |
|
| 191 |
-
| **Diagnosis** | Overfitting | **Exploitation** | phase-consistent |
|
| 192 |
-
| **Plan Strategy** | Regularize | **regularization ongoing** | β |
|
| 193 |
-
| **Plan Accuracy** | 0.04 | β | exploratory phase |
|
| 194 |
-
| **Trajectory Slope** | +1.85 (rΒ²=0.08) | β | high variance |
|
| 195 |
-
| **Mix Weights** | short=5.6%, med=40.8%, long=30.4%, vlong=23.2% | **short=43.3%, med=24.5%, long=18.3%, vlong=13.9%** | shifted to short |
|
| 196 |
-
| **Curriculum** | medium | **hard** | β difficulty |
|
| 197 |
-
|
| 198 |
-
The AutoTuner has entered **exploitation** phase at step 19,500 β reducing LR to 6.35e-5 (0.64Γ base), grad clip to 0.64Γ, increasing weight decay for regularization. The multi-timescale EMAs (short=10.3, med=10.3, long=10.3) indicate stable convergence at the underlying dynamics level despite curriculum tier transitions causing high surface loss variance.
|
| 199 |
-
|
| 200 |
-
## Routing Activity (Step 19,500)
|
| 201 |
-
|
| 202 |
-
| Route | Count | Avg Performance |
|
| 203 |
-
|-------|-------|----------------|
|
| 204 |
-
| Code | 706 | 10.36 |
|
| 205 |
-
| Math | 205 | 10.29 |
|
| 206 |
-
| Formal | 587 | 10.36 |
|
| 207 |
-
| Creative | 1 | 10.83 |
|
| 208 |
-
| Dialog | 1 | 9.48 |
|
| 209 |
-
|
| 210 |
-
Routing algorithms are actively classifying training examples by type, with code and formal reasoning dominating the mix.
|
| 211 |
-
|
| 212 |
-
## Data Pipeline (New in v2)
|
| 213 |
-
|
| 214 |
-
- **StrictFilter**: Multi-layer text quality filter β URL/HTML/emoji stripping β char ratio β₯0.25 β language score β₯0.001 β 4-gram repetition β€0.4 β min 10 words
|
| 215 |
-
- **TokenCache**: SHA-256 keyed disk cache β tokenize once per unique text, no re-tokenization across epochs
|
| 216 |
-
- **BucketDataset**: Groups texts by similar length (bucket_size=64) to minimize padding waste
|
| 217 |
-
- **WeightedBucketSampler**: Importance sampling by category weights, synced from DynamicSampler every 500 steps
|
| 218 |
-
|
| 219 |
-
## Files
|
| 220 |
-
|
| 221 |
-
| File | Description |
|
| 222 |
-
|------|-------------|
|
| 223 |
-
| `model.pt` | Model weights (~105M params, 419MB) β **step 19,500** |
|
| 224 |
-
| `best.pt` | Best checkpoint by eval loss |
|
| 225 |
-
| `training_state.json` | Full training state including AutoTuner state |
|
| 226 |
-
| `tokenizer.json` | BBPE tokenizer (32K vocab) |
|
| 227 |
-
| `tokenizer_config.json` | Tokenizer configuration |
|
| 228 |
-
| `config.yaml` | Model configuration (DeeplmConfig defaults) |
|
| 229 |
-
| `training_curve_8k_20k.png` | Updated training curves: step 8,010 β 19,690 |
|
| 230 |
-
|
| 231 |
-
## Usage
|
| 232 |
|
| 233 |
```python
|
| 234 |
-
import
|
|
|
|
| 235 |
from deeplm.config import DeeplmConfig
|
| 236 |
from deeplm.model.deeplm import DeeplmModel
|
|
|
|
|
|
|
| 237 |
|
|
|
|
| 238 |
config = DeeplmConfig()
|
|
|
|
|
|
|
| 239 |
model = DeeplmModel(config)
|
| 240 |
-
|
| 241 |
-
|
| 242 |
-
|
| 243 |
-
|
| 244 |
-
|
| 245 |
-
|
| 246 |
-
|
| 247 |
-
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
|
| 252 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
license: apache-2.0
|
| 4 |
library_name: transformers
|
| 5 |
tags:
|
| 6 |
+
- pytorch
|
| 7 |
+
- safetensors
|
| 8 |
+
- deeplm
|
| 9 |
+
- bitnet
|
| 10 |
+
- moe
|
| 11 |
+
- mla
|
| 12 |
+
- mtp
|
| 13 |
+
- indonesian
|
| 14 |
+
pipeline_tag: text-generation
|
|
|
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
+
# Deeplm β 108M BitNet MoE Language Model
|
| 18 |
|
| 19 |
+
Deeplm adalah model bahasa berukuran ~105M parameter dengan **BitNet b1.58 ternary quantization** dari awal, terinspirasi dari arsitektur **DeepSeek V4**, **Kimi K2.6**, dan **MiniMax M2.7**.
|
| 20 |
|
| 21 |
+
## ποΈ Arsitektur
|
| 22 |
|
| 23 |
+
| Komponen | Detail |
|
| 24 |
+
|---|---|
|
| 25 |
+
| **Total Parameters** | ~104.7M |
|
| 26 |
+
| **Architecture** | Decoder-only Transformer |
|
| 27 |
+
| **Layers** | 10 |
|
| 28 |
+
| **Hidden Size** | 512 |
|
| 29 |
+
| **Vocab Size** | 32,000 (BPETokenizer) |
|
| 30 |
+
| **Max Seq Length** | 4,096 |
|
| 31 |
+
| **Attention Heads** | 8 (MQA, 1 KV head) |
|
| 32 |
+
| **Quantization** | BitNet b1.58 ternary {-1, 0, +1}, absmean |
|
| 33 |
+
| **Dtype** | float32 (weights terkuantisasi ke ternary) |
|
| 34 |
+
|
| 35 |
+
## β¨ Fitur Inovatif
|
| 36 |
+
|
| 37 |
+
| Fitur | Sumber | Keterangan |
|
| 38 |
+
|---|---|---|
|
| 39 |
+
| **MLA** | DeepSeek V4 | Multi-head Latent Attention, KV cache compression 24x |
|
| 40 |
+
| **MoE** | DeepSeek V4 + Kimi K2.6 | 4 routed + 1 shared expert, top-k=2 |
|
| 41 |
+
| **Hybrid Attention** | MiniMax M2.7 | Softmax + Lightning v2 linear attention |
|
| 42 |
+
| **Hyper-Connections** | DeepSeek V4 | Sinkhorn routing, menggantikan residual standar |
|
| 43 |
+
| **MTP** | DeepSeek V4 | Multi-Token Prediction, depth=2 |
|
| 44 |
+
| **BitNet b1.58** | BitNet | Ternary quantization {-1, 0, +1} dari init |
|
| 45 |
+
| **AutoTuner** | Deeplm | Adaptive LR, GN, WD, momentum, revive, trajectory prediction |
|
| 46 |
+
| **Curriculum Router** | Deeplm | Phase-based category weighting |
|
| 47 |
+
| **Self-Evolution** | MiniMax M2.7 | Autonomous hypothesis β experiment β decision loop |
|
| 48 |
+
|
| 49 |
+
## π Spesifikasi Model
|
| 50 |
+
|
| 51 |
+
```json
|
| 52 |
+
{
|
| 53 |
+
"architectures": ["DeeplmModel"],
|
| 54 |
+
"model_type": "deeplm",
|
| 55 |
+
"vocab_size": 32000,
|
| 56 |
+
"hidden_size": 512,
|
| 57 |
+
"intermediate_size": 2048,
|
| 58 |
+
"num_hidden_layers": 10,
|
| 59 |
+
"num_attention_heads": 8,
|
| 60 |
+
"num_key_value_heads": 1,
|
| 61 |
+
"max_position_embeddings": 4096,
|
| 62 |
+
"rms_norm_eps": 1e-06,
|
| 63 |
+
"rope_theta": 50000.0,
|
| 64 |
+
"rope_dim": 64,
|
| 65 |
+
"tie_word_embeddings": true,
|
| 66 |
+
"num_routed_experts": 4,
|
| 67 |
+
"num_shared_experts": 1,
|
| 68 |
+
"expert_topk": 2,
|
| 69 |
+
"q_lora_rank": 192,
|
| 70 |
+
"kv_lora_rank": 64,
|
| 71 |
+
"qk_rope_head_dim": 64,
|
| 72 |
+
"qk_nope_head_dim": 64,
|
| 73 |
+
"v_head_dim": 128,
|
| 74 |
+
"mtp_depth": 2,
|
| 75 |
+
"mtp_num_layers": 2,
|
| 76 |
+
"bitnet_quantized": true,
|
| 77 |
+
"bitnet_scale": "absmean"
|
| 78 |
+
}
|
| 79 |
+
```
|
| 80 |
|
| 81 |
+
## π Usage
|
| 82 |
|
| 83 |
+
### Inference
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 84 |
|
| 85 |
```python
|
| 86 |
+
import sys
|
| 87 |
+
sys.path.insert(0, "deeplm")
|
| 88 |
from deeplm.config import DeeplmConfig
|
| 89 |
from deeplm.model.deeplm import DeeplmModel
|
| 90 |
+
from safetensors.torch import load_file
|
| 91 |
+
import torch
|
| 92 |
|
| 93 |
+
# Load config
|
| 94 |
config = DeeplmConfig()
|
| 95 |
+
|
| 96 |
+
# Build model
|
| 97 |
model = DeeplmModel(config)
|
| 98 |
+
|
| 99 |
+
# Load BitNet quantized weights
|
| 100 |
+
state_dict = load_file("model.safetensors")
|
| 101 |
+
model.load_state_dict(state_dict, strict=False)
|
| 102 |
+
|
| 103 |
+
# Generate
|
| 104 |
+
input_ids = torch.tensor([[1, 2, 3]]) # bos + tokens
|
| 105 |
+
output = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
|
| 106 |
+
```
|
| 107 |
+
|
| 108 |
+
### Training
|
| 109 |
+
|
| 110 |
+
```bash
|
| 111 |
+
# Install dependencies
|
| 112 |
+
pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors
|
| 113 |
+
|
| 114 |
+
# Train with all features
|
| 115 |
+
python train.py --batch_size 3 --grad_accum 2 --max_steps 31250
|
| 116 |
+
|
| 117 |
+
# Custom config
|
| 118 |
+
python train.py \
|
| 119 |
+
--max_steps 100000 \
|
| 120 |
+
--batch_size 4 \
|
| 121 |
+
--seq_len 512 \
|
| 122 |
+
--lr 3e-4 \
|
| 123 |
+
--no_auto_tuner
|
| 124 |
+
```
|
| 125 |
+
|
| 126 |
+
## π Struktur Project
|
| 127 |
+
|
| 128 |
```
|
| 129 |
+
deeplm-108m/
|
| 130 |
+
βββ config.json # Model config
|
| 131 |
+
βββ generation_config.json # Generation params
|
| 132 |
+
βββ model.safetensors # BitNet quantized weights (419MB)
|
| 133 |
+
βββ tokenizer.json # BPETokenizer
|
| 134 |
+
βββ tokenizer_config.json # Tokenizer config
|
| 135 |
+
βββ train.py # Training script (all features)
|
| 136 |
+
βββ init_model.py # Model initialization script
|
| 137 |
+
βββ deeplm_modal.py # Modal.com build script
|
| 138 |
+
βββ deeplm/ # Source code
|
| 139 |
+
βββ config.py # Dataclass configs
|
| 140 |
+
βββ model/
|
| 141 |
+
β βββ deeplm.py # Main model
|
| 142 |
+
β βββ mla.py # Multi-head Latent Attention
|
| 143 |
+
β βββ moe.py # Mixture of Experts
|
| 144 |
+
β βββ hybrid_attention.py # Softmax + Lightning
|
| 145 |
+
β βββ hyper_connections.py # Sinkhorn routing
|
| 146 |
+
β βββ mtp.py # Multi-Token Prediction
|
| 147 |
+
β βββ transformer_block.py
|
| 148 |
+
βββ training/
|
| 149 |
+
β βββ trainer.py # Training loop
|
| 150 |
+
β βββ auto_tuner.py # Adaptive training controller
|
| 151 |
+
β βββ curriculum_router.py # Phase-based routing
|
| 152 |
+
β βββ data_pipeline.py # Bucket dataset + sampler
|
| 153 |
+
β βββ logger.py # SmartLogger + anomaly detection
|
| 154 |
+
β βββ control/ # TrainingControl plane
|
| 155 |
+
βββ self_evolution/
|
| 156 |
+
β βββ framework.py # Autonomous evolution loop
|
| 157 |
+
βββ quantization/
|
| 158 |
+
βββ bitnet_quantize.py # BitNet b1.58
|
| 159 |
+
βββ gguf_export.py
|
| 160 |
+
```
|
| 161 |
+
|
| 162 |
+
## π Training
|
| 163 |
+
|
| 164 |
+
| Parameter | Value |
|
| 165 |
+
|---|---|
|
| 166 |
+
| **Dataset** | afrizalha/KamusOne-28M-Indonesian |
|
| 167 |
+
| **Optimizer** | AdamW (Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8) |
|
| 168 |
+
| **LR** | 6e-4 (cosine, warmup=150) |
|
| 169 |
+
| **Batch Size** | 3 x grad_accum=2 = 6 effective |
|
| 170 |
+
| **Weight Decay** | 0.1 |
|
| 171 |
+
| **Max Grad Norm** | 1.0 |
|
| 172 |
+
| **Max Steps** | 31,250 |
|
| 173 |
+
|
| 174 |
+
## π License
|
| 175 |
+
|
| 176 |
+
Apache 2.0
|
| 177 |
+
|
| 178 |
+
## π Acknowledgments
|
| 179 |
+
|
| 180 |
+
Arsitektur terinspirasi dari:
|
| 181 |
+
- **DeepSeek V4** β MLA, Hyper-Connections, MTP, MoE routing
|
| 182 |
+
- **Kimi K2.6** β Shared Expert, Agent Swarm
|
| 183 |
+
- **MiniMax M2.7** β Self-Evolution Framework, Hybrid Attention, Agent Harness
|
| 184 |
+
- **BitNet** β b1.58 ternary quantization
|