samcheng0
/

deeplm-108m

@@ -3,251 +3,182 @@ language: id
 license: apache-2.0
 library_name: transformers
 tags:
-  - indonesian
-  - language-model
-  - moe
-  - mla
-  - hyper-connections
-  - lightning-attention
-  - multi-token-prediction
-  - self-evolution
-  - autotuner
-  - deeplm
-base_model: "none"
 ---
-# Deeplm-105M v2 (Step 19,500)
-Indonesian language model with novel architecture combining MLA, MoE, Hyper-Connections, Hybrid Attention, Multi-Token Prediction, Self-Evolution, and autonomous AutoTuner.
-Trained on A10G (24GB) for **19,500 steps** (~24h) with progressive curriculum, dynamic category sampling, activated reflection+memory+routing algorithms, and energy-based hyperparameter control.
-## Training Progress (Step 8,000 → 19,500)
-| Metric | Step 8,000 | Step 18,000 | Step 19,500 | Delta (8k→19.5k) |
-|--------|-----------|-------------|-------------|-------------------|
-| **Loss (range)** | 59.6 ± 31.2 | 29.3 – 83.2 | **31.29** (at step) | — |
-| **Mean Loss (8k+)** | — | 53.5 | **50.68** | -2.8 |
-| **Best Loss (eval)** | — | 56.07 | **56.07** | — |
-| **Curriculum** | balanced | medium | **hard** | ↑ |
-| **Learning Rate** | 3.84e-04 | 9.80e-05 | **6.47e-05** | 6.0× ↓ |
-| **Gradient Norm (avg)** | 21.3 | 16.8 | **15.4** | -5.9 |
-| **Throughput** | 262 tok/s | 262 tok/s | **249 tok/s** | -5% |
-| **Total Tokens** | — | ~263K | **~381K** | +45% |
-![Training Curves](training_curve_8k_20k.png)
-*4-panel training curves: loss (with 20-step moving average), learning rate (cosine, log scale), gradient norm (MA-30, log scale), and throughput tokens/s (MA-50). Green = previous upload at step 18,000, Red = current upload at step 19,500. Data logged every 10 steps from step 8,010 to 19,690.*
-### Key Observations (8k → 19.5k)
-- **Curriculum progression**: balanced → medium → hard — loss variance reflects tier transitions
-- **Loss range**: 28.84 – 85.48 (mean 50.68) — diverse curriculum tiers (easy → hard reasoning)
-- **Best eval loss**: 56.07 held steady (no new eval between 9500→19500)
-- **LR decay**: Cosine schedule from 2.95e-04 → 6.29e-05 at step 19690
-- **AutoTuner**: Phase changed from `balanced` → `exploitation` — actively reducing LR/wd for regularization
-- **Reflection + Memory + Routing**: Activated after step ~15,000 — adds overhead (~249 tok/s vs 262)
-- **Gradient norm**: Stable at 5–20 range with fewer extreme spikes as training progresses
-## Architecture
-| Component | Detail |
-|-----------|--------|
-| **Total Parameters** | 104,747,048 (~105M) |
-| **Vocabulary** | 32,000 (BBPE) |
-| **Layers** | 10 Transformer blocks |
-| **Hidden Size** | 512 |
-| **Feed-Forward** | 2048 (SwiGLU, 4× hidden) |
-| **Attention Heads** | 8 query heads, 1 KV head (MQA) |
-| **Head Dim** | 128 (64 RoPE + 64 NoPE) |
-| **Max Seq Length** | 4096 |
-| **RoPE Theta** | 50,000 |
-| **Attention** | MLA (Multi-head Latent Attention) |
-| **FFN** | MoE (4 routed + 1 shared experts, top-k=2) |
-| **Residual** | Hyper-Connections with Sinkhorn routing |
-| **Hybrid Attention** | 3 softmax + 7 Lightning layers |
-| **Prediction** | MTP (Multi-Token Prediction, depth=2, 2 MTP layers) |
-| **Self-Evolution** | Autonomous research loop (100+ rounds) |
-| **Embeddings** | Tied (shared between input/output) |
-| **AutoTuner** | Adaptive energy-based optimizer scheduler |
-| **Dtype** | float32 (Hyper-Connections stability) |
-### Key Innovations
-<details>
-<summary>Click to expand architecture details</summary>
-### 1. Multi-head Latent Attention (MLA) — *DeepSeek V4 / Kimi K2.6*
-- Q compressed: hidden → q_lora_rank(192) → Layernorm → q_up(8 × 128)
-- KV compressed: hidden → [kv_latent(64) + k_rope(64)] → kv_up → [k_nope(64) + v(128)] × 8 heads
-- Entire KV cache per token: just 128 dims (64 latent + 64 rope) — **~8× smaller** than standard MHA
-- Decoupled RoPE applied only to 64-dim k_pe, content path stays RoPE-free
-- Absorption trick pre-computes W_UK @ W_UV for faster inference
-- MQA-style: KV decomposed once, expanded to all query heads
-### 2. Mixture of Experts (MoE) — *DeepSeek V4 / Kimi K2.6*
-- 4 routed experts + 1 shared expert (always active, Kimi K2.6 style)
-- Top-k=2 routing: each token activates only 2 experts
-- **sqrt(softplus(x))** scoring for numerical stability (DeepSeek V4)
-- **Bias-based load balancing** (no auxiliary loss, no gradient interference)
-- Per-expert routing bias auto-updates to balance token assignments
-- SwiGLU activation in every expert (fused gate+up projection)
-- Expert affinity memory tracks token-expert history
-### 3. Hyper-Connections with Sinkhorn Routing — *DeepSeek V4*
-- Replaces standard residual connections with learned routing
-- 4 connection types: **identity**, **transform**, **gate**, **skip**
-- Sinkhorn-Knopp normalization (2 iterations) for doubly-stochastic weights
-- Input-dependent routing via gating network
-- Type-specific learnable biases initialized per config
-- Pre-LayerNorm on layer output before routing
-### 4. Hybrid Attention — *MiniMax M2.7*
-- 3 softmax layers (indices 0, 4, 8): Standard MLA with full causal attention
-- 7 linear layers (1, 2, 3, 5, 6, 7, 9): MLA + LightningAttentionV2 **50/50 blend**
-- LightningAttentionV2: O(n) complexity with intra-block softmax + inter-block KV product
-- Incremental KV state for efficient autoregressive generation
-- ReLU/Swish activation replaces softmax in linear path
-### 5. Multi-Token Prediction (MTP) — *DeepSeek V4*
-- 2 MTP layers, each predicting 2 tokens ahead (mtp_depth=2)
-- Projection block: Linear → LayerNorm → GELU → Linear + residual skip
-- RoPE positional encoding on reduced dim (hidden/4) for efficiency
-- Tied LM head shares parameters with main embedding layer
-- Chunked computation (chunk_size=16) to avoid full (B, S, V) logits
-- Loss weight: 0.3 × cross-entropy of future token predictions
-### 6. Self-Evolution Framework — *MiniMax M2.7 / Deeplm*
-- Autonomous 8-phase research loop: hypothesis → design → execute → analyze → diagnose → fix → evaluate → decide
-- 100+ autonomous optimization rounds per training cycle
-- 3 feedback chain episodes for meta-learning
-### 7. AutoTuner — *Deeplm custom*
-- Energy-based adaptive hyperparameter controller
-- Phase-aware dynamics (warmup → exploration → balanced → exploitation)
-- Bayesian dynamics model: uncertainty-aware lr/wd sensitivity (Welford variance)
-- Multi-timescale loss EMAs (short=0.9, med=0.98, long=0.995)
-- Gradient noise scale monitoring
-- Cosine similarity for gradient direction tracking
-- Layer health monitoring with per-group gradient ratios
-- Failure-aware rollback with revive mechanism
-- Strategic planner: multi-step scheduled adjustments with plan accuracy tracking
-- Dual-window trajectory predictor: regime change detection, convergence estimation
-</details>
-## Training Configuration
-| Config | Value |
-|--------|-------|
-| **Dataset** | Wikipedia-id (Indonesian) + GLM-5.1 (English reasoning) + English Wikipedia |
-| **Tokenizer** | 32K BBPE |
-| **Optimizer** | SGD Nesterov (momentum=0.9, weight_decay=0.1) |
-| **LR Schedule** | Cosine (warmup 3%) |
-| **Base LR** | 3e-4 |
-| **Effective Batch** | 36 (12 × 3 grad_accum) |
-| **Sequence Length** | 2048 |
-| **Max Grad Norm** | 1.0 (auto-tuned) |
-| **Total Steps** | 19,500 |
-| **GPU** | A10G (24GB) |
-| **Dtype** | float32 |
-| **Curriculum** | 4-tier (easy → medium → hard → reasoning), current: **hard** |
-| **Dynamic Mix** | Adaptive per-category sampling weights, applied via WeightedBucketSampler |
-| **Tokenization** | Disk-cached (SHA-256 keyed), no re-tokenization per epoch |
-| **Filtering** | StrictFilter: URL/HTML/emoji stripping + char ratio + language score + repetition + min words |
-| **Batching** | BucketDataset: groups by length for efficient padding |
-### Training Algorithms
-| Algorithm | Status | Description |
-|-----------|--------|-------------|
-| Curriculum Learning | Active | 4-tier easy→hard progression by text length |
-| Dynamic Sampling | Active | Adaptive category mix based on per-category loss |
-| Difficulty Scheduling | Active | 4 phases: Token Learning → Syntax → Reasoning → Expert |
-| MoE Balancing | Active | Bias-based load-balanced routing |
-| AutoTuner | Active | AI adaptive hyperparameter control |
-| MTP | Active | Auxiliary multi-token prediction loss |
-| Curriculum Scheduling | Active | Loss-based adaptive difficulty |
-| Reflection Training | **Active** | High-loss example replay (1,500 stored) |
-| Memory Algorithms | **Active** | 1,500 stored, avg loss 10.1 |
-| Tool Routing | **Active** | Code=706, Math=205, Formal=587 routed |
-| Synthetic Evolution | Inactive | Model-generated training data (potential A10G bottleneck) |
-## AutoTuner State (Step 19,500)
-| Metric | Step 18,000 | Step 19,500 | Change |
-|--------|-------------|-------------|--------|
-| **Phase** | Balanced | **Exploitation** | ↑ aggressiveness |
-| **LR Multiplier** | 0.78× | **0.64×** | ↓ 18% |
-| **Grad Norm Multiplier** | 0.76× | **0.64×** | ↓ 16% |
-| **Weight Decay Mult** | 1.60× | **active** | regularization |
-| **Best (smoothed loss)** | 28.84 | **4.01** | ↓ (different scale) |
-| **Best Eval Loss** | 56.07 | **56.07** | — (no new eval) |
-| **Adjustments Made** | — | **152** | learned control |
-| **Degeneracy Reductions** | — | **2** | prevented divergence |
-| **Cosine Similarity EMA** | — | **0.15** | moderate direction stability |
-| **Gradient Noise EMA** | — | **0.10** | low noise |
-| **Gradient Norm (avg)** | — | **3.45** | well-controlled |
-| **Diagnosis** | Overfitting | **Exploitation** | phase-consistent |
-| **Plan Strategy** | Regularize | **regularization ongoing** | — |
-| **Plan Accuracy** | 0.04 | — | exploratory phase |
-| **Trajectory Slope** | +1.85 (r²=0.08) | — | high variance |
-| **Mix Weights** | short=5.6%, med=40.8%, long=30.4%, vlong=23.2% | **short=43.3%, med=24.5%, long=18.3%, vlong=13.9%** | shifted to short |
-| **Curriculum** | medium | **hard** | ↑ difficulty |
-The AutoTuner has entered **exploitation** phase at step 19,500 — reducing LR to 6.35e-5 (0.64× base), grad clip to 0.64×, increasing weight decay for regularization. The multi-timescale EMAs (short=10.3, med=10.3, long=10.3) indicate stable convergence at the underlying dynamics level despite curriculum tier transitions causing high surface loss variance.
-## Routing Activity (Step 19,500)
-| Route | Count | Avg Performance |
-|-------|-------|----------------|
-| Code | 706 | 10.36 |
-| Math | 205 | 10.29 |
-| Formal | 587 | 10.36 |
-| Creative | 1 | 10.83 |
-| Dialog | 1 | 9.48 |
-Routing algorithms are actively classifying training examples by type, with code and formal reasoning dominating the mix.
-## Data Pipeline (New in v2)
-- **StrictFilter**: Multi-layer text quality filter — URL/HTML/emoji stripping → char ratio ≥0.25 → language score ≥0.001 → 4-gram repetition ≤0.4 → min 10 words
-- **TokenCache**: SHA-256 keyed disk cache — tokenize once per unique text, no re-tokenization across epochs
-- **BucketDataset**: Groups texts by similar length (bucket_size=64) to minimize padding waste
-- **WeightedBucketSampler**: Importance sampling by category weights, synced from DynamicSampler every 500 steps
-## Files
-| File | Description |
-|------|-------------|
-| `model.pt` | Model weights (~105M params, 419MB) — **step 19,500** |
-| `best.pt` | Best checkpoint by eval loss |
-| `training_state.json` | Full training state including AutoTuner state |
-| `tokenizer.json` | BBPE tokenizer (32K vocab) |
-| `tokenizer_config.json` | Tokenizer configuration |
-| `config.yaml` | Model configuration (DeeplmConfig defaults) |
-| `training_curve_8k_20k.png` | Updated training curves: step 8,010 → 19,690 |
-## Usage
 ```python
-import torch
 from deeplm.config import DeeplmConfig
 from deeplm.model.deeplm import DeeplmModel
 config = DeeplmConfig()
 model = DeeplmModel(config)
-model.load_state_dict(torch.load("model.pt", map_location="cpu"), strict=False)
-model.eval()
-input_ids = torch.tensor([[1, 2, 3]])
-output = model.generate(
-    input_ids,
-    max_new_tokens=128,
-    do_sample=True,
-    temperature=0.7,
-    top_k=50,
-    top_p=0.9,
-)
-print(output)
 ```

 license: apache-2.0
 library_name: transformers
 tags:
+- pytorch
+- safetensors
+- deeplm
+- bitnet
+- moe
+- mla
+- mtp
+- indonesian
+pipeline_tag: text-generation
 ---
+# Deeplm — 108M BitNet MoE Language Model
+Deeplm adalah model bahasa berukuran ~105M parameter dengan **BitNet b1.58 ternary quantization** dari awal, terinspirasi dari arsitektur **DeepSeek V4**, **Kimi K2.6**, dan **MiniMax M2.7**.
+## 🏗️ Arsitektur
+| Komponen | Detail |
+|---|---|
+| **Total Parameters** | ~104.7M |
+| **Architecture** | Decoder-only Transformer |
+| **Layers** | 10 |
+| **Hidden Size** | 512 |
+| **Vocab Size** | 32,000 (BPETokenizer) |
+| **Max Seq Length** | 4,096 |
+| **Attention Heads** | 8 (MQA, 1 KV head) |
+| **Quantization** | BitNet b1.58 ternary {-1, 0, +1}, absmean |
+| **Dtype** | float32 (weights terkuantisasi ke ternary) |
+## ✨ Fitur Inovatif
+| Fitur | Sumber | Keterangan |
+|---|---|---|
+| **MLA** | DeepSeek V4 | Multi-head Latent Attention, KV cache compression 24x |
+| **MoE** | DeepSeek V4 + Kimi K2.6 | 4 routed + 1 shared expert, top-k=2 |
+| **Hybrid Attention** | MiniMax M2.7 | Softmax + Lightning v2 linear attention |
+| **Hyper-Connections** | DeepSeek V4 | Sinkhorn routing, menggantikan residual standar |
+| **MTP** | DeepSeek V4 | Multi-Token Prediction, depth=2 |
+| **BitNet b1.58** | BitNet | Ternary quantization {-1, 0, +1} dari init |
+| **AutoTuner** | Deeplm | Adaptive LR, GN, WD, momentum, revive, trajectory prediction |
+| **Curriculum Router** | Deeplm | Phase-based category weighting |
+| **Self-Evolution** | MiniMax M2.7 | Autonomous hypothesis → experiment → decision loop |
+## 📊 Spesifikasi Model
+```json
+{
+  "architectures": ["DeeplmModel"],
+  "model_type": "deeplm",
+  "vocab_size": 32000,
+  "hidden_size": 512,
+  "intermediate_size": 2048,
+  "num_hidden_layers": 10,
+  "num_attention_heads": 8,
+  "num_key_value_heads": 1,
+  "max_position_embeddings": 4096,
+  "rms_norm_eps": 1e-06,
+  "rope_theta": 50000.0,
+  "rope_dim": 64,
+  "tie_word_embeddings": true,
+  "num_routed_experts": 4,
+  "num_shared_experts": 1,
+  "expert_topk": 2,
+  "q_lora_rank": 192,
+  "kv_lora_rank": 64,
+  "qk_rope_head_dim": 64,
+  "qk_nope_head_dim": 64,
+  "v_head_dim": 128,
+  "mtp_depth": 2,
+  "mtp_num_layers": 2,
+  "bitnet_quantized": true,
+  "bitnet_scale": "absmean"
+}
+```
+## 🚀 Usage
+### Inference
 ```python
+import sys
+sys.path.insert(0, "deeplm")
 from deeplm.config import DeeplmConfig
 from deeplm.model.deeplm import DeeplmModel
+from safetensors.torch import load_file
+import torch
+# Load config
 config = DeeplmConfig()
+# Build model
 model = DeeplmModel(config)
+# Load BitNet quantized weights
+state_dict = load_file("model.safetensors")
+model.load_state_dict(state_dict, strict=False)
+# Generate
+input_ids = torch.tensor([[1, 2, 3]])  # bos + tokens
+output = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
+```
+### Training
+```bash
+# Install dependencies
+pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors
+# Train with all features
+python train.py --batch_size 3 --grad_accum 2 --max_steps 31250
+# Custom config
+python train.py \
+  --max_steps 100000 \
+  --batch_size 4 \
+  --seq_len 512 \
+  --lr 3e-4 \
+  --no_auto_tuner
+```
+## 📁 Struktur Project
 ```
+deeplm-108m/
+├── config.json              # Model config
+├── generation_config.json   # Generation params
+├── model.safetensors        # BitNet quantized weights (419MB)
+├── tokenizer.json           # BPETokenizer
+├── tokenizer_config.json    # Tokenizer config
+├── train.py                 # Training script (all features)
+├── init_model.py            # Model initialization script
+├── deeplm_modal.py          # Modal.com build script
+└── deeplm/                  # Source code
+    ├── config.py            # Dataclass configs
+    ├── model/
+    │   ├── deeplm.py        # Main model
+    │   ├── mla.py           # Multi-head Latent Attention
+    │   ├── moe.py           # Mixture of Experts
+    │   ├── hybrid_attention.py  # Softmax + Lightning
+    │   ├── hyper_connections.py # Sinkhorn routing
+    │   ├── mtp.py           # Multi-Token Prediction
+    │   └── transformer_block.py
+    ├── training/
+    │   ├── trainer.py       # Training loop
+    │   ├── auto_tuner.py    # Adaptive training controller
+    │   ├── curriculum_router.py  # Phase-based routing
+    │   ├── data_pipeline.py # Bucket dataset + sampler
+    │   ├── logger.py        # SmartLogger + anomaly detection
+    │   └── control/         # TrainingControl plane
+    ├── self_evolution/
+    │   └── framework.py     # Autonomous evolution loop
+    └── quantization/
+        ├── bitnet_quantize.py  # BitNet b1.58
+        └── gguf_export.py
+```
+## 📈 Training
+| Parameter | Value |
+|---|---|
+| **Dataset** | afrizalha/KamusOne-28M-Indonesian |
+| **Optimizer** | AdamW (β1=0.9, β2=0.95, ε=1e-8) |
+| **LR** | 6e-4 (cosine, warmup=150) |
+| **Batch Size** | 3 x grad_accum=2 = 6 effective |
+| **Weight Decay** | 0.1 |
+| **Max Grad Norm** | 1.0 |
+| **Max Steps** | 31,250 |
+## 📄 License
+Apache 2.0
+## 🙏 Acknowledgments
+Arsitektur terinspirasi dari:
+- **DeepSeek V4** — MLA, Hyper-Connections, MTP, MoE routing
+- **Kimi K2.6** — Shared Expert, Agent Swarm
+- **MiniMax M2.7** — Self-Evolution Framework, Hybrid Attention, Agent Harness
+- **BitNet** — b1.58 ternary quantization