samcheng0 commited on
Commit
5d28f48
Β·
verified Β·
1 Parent(s): 121f32b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +164 -233
README.md CHANGED
@@ -3,251 +3,182 @@ language: id
3
  license: apache-2.0
4
  library_name: transformers
5
  tags:
6
- - indonesian
7
- - language-model
8
- - moe
9
- - mla
10
- - hyper-connections
11
- - lightning-attention
12
- - multi-token-prediction
13
- - self-evolution
14
- - autotuner
15
- - deeplm
16
- base_model: "none"
17
  ---
18
 
19
- # Deeplm-105M v2 (Step 19,500)
20
 
21
- Indonesian language model with novel architecture combining MLA, MoE, Hyper-Connections, Hybrid Attention, Multi-Token Prediction, Self-Evolution, and autonomous AutoTuner.
22
 
23
- Trained on A10G (24GB) for **19,500 steps** (~24h) with progressive curriculum, dynamic category sampling, activated reflection+memory+routing algorithms, and energy-based hyperparameter control.
24
 
25
- ## Training Progress (Step 8,000 β†’ 19,500)
26
-
27
- | Metric | Step 8,000 | Step 18,000 | Step 19,500 | Delta (8k→19.5k) |
28
- |--------|-----------|-------------|-------------|-------------------|
29
- | **Loss (range)** | 59.6 Β± 31.2 | 29.3 – 83.2 | **31.29** (at step) | β€” |
30
- | **Mean Loss (8k+)** | β€” | 53.5 | **50.68** | -2.8 |
31
- | **Best Loss (eval)** | β€” | 56.07 | **56.07** | β€” |
32
- | **Curriculum** | balanced | medium | **hard** | ↑ |
33
- | **Learning Rate** | 3.84e-04 | 9.80e-05 | **6.47e-05** | 6.0Γ— ↓ |
34
- | **Gradient Norm (avg)** | 21.3 | 16.8 | **15.4** | -5.9 |
35
- | **Throughput** | 262 tok/s | 262 tok/s | **249 tok/s** | -5% |
36
- | **Total Tokens** | β€” | ~263K | **~381K** | +45% |
37
-
38
- ![Training Curves](training_curve_8k_20k.png)
39
-
40
- *4-panel training curves: loss (with 20-step moving average), learning rate (cosine, log scale), gradient norm (MA-30, log scale), and throughput tokens/s (MA-50). Green = previous upload at step 18,000, Red = current upload at step 19,500. Data logged every 10 steps from step 8,010 to 19,690.*
41
-
42
- ### Key Observations (8k β†’ 19.5k)
43
-
44
- - **Curriculum progression**: balanced β†’ medium β†’ hard β€” loss variance reflects tier transitions
45
- - **Loss range**: 28.84 – 85.48 (mean 50.68) β€” diverse curriculum tiers (easy β†’ hard reasoning)
46
- - **Best eval loss**: 56.07 held steady (no new eval between 9500β†’19500)
47
- - **LR decay**: Cosine schedule from 2.95e-04 β†’ 6.29e-05 at step 19690
48
- - **AutoTuner**: Phase changed from `balanced` β†’ `exploitation` β€” actively reducing LR/wd for regularization
49
- - **Reflection + Memory + Routing**: Activated after step ~15,000 β€” adds overhead (~249 tok/s vs 262)
50
- - **Gradient norm**: Stable at 5–20 range with fewer extreme spikes as training progresses
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- ## Architecture
53
 
54
- | Component | Detail |
55
- |-----------|--------|
56
- | **Total Parameters** | 104,747,048 (~105M) |
57
- | **Vocabulary** | 32,000 (BBPE) |
58
- | **Layers** | 10 Transformer blocks |
59
- | **Hidden Size** | 512 |
60
- | **Feed-Forward** | 2048 (SwiGLU, 4Γ— hidden) |
61
- | **Attention Heads** | 8 query heads, 1 KV head (MQA) |
62
- | **Head Dim** | 128 (64 RoPE + 64 NoPE) |
63
- | **Max Seq Length** | 4096 |
64
- | **RoPE Theta** | 50,000 |
65
- | **Attention** | MLA (Multi-head Latent Attention) |
66
- | **FFN** | MoE (4 routed + 1 shared experts, top-k=2) |
67
- | **Residual** | Hyper-Connections with Sinkhorn routing |
68
- | **Hybrid Attention** | 3 softmax + 7 Lightning layers |
69
- | **Prediction** | MTP (Multi-Token Prediction, depth=2, 2 MTP layers) |
70
- | **Self-Evolution** | Autonomous research loop (100+ rounds) |
71
- | **Embeddings** | Tied (shared between input/output) |
72
- | **AutoTuner** | Adaptive energy-based optimizer scheduler |
73
- | **Dtype** | float32 (Hyper-Connections stability) |
74
-
75
- ### Key Innovations
76
-
77
- <details>
78
- <summary>Click to expand architecture details</summary>
79
-
80
- ### 1. Multi-head Latent Attention (MLA) β€” *DeepSeek V4 / Kimi K2.6*
81
- - Q compressed: hidden β†’ q_lora_rank(192) β†’ Layernorm β†’ q_up(8 Γ— 128)
82
- - KV compressed: hidden β†’ [kv_latent(64) + k_rope(64)] β†’ kv_up β†’ [k_nope(64) + v(128)] Γ— 8 heads
83
- - Entire KV cache per token: just 128 dims (64 latent + 64 rope) β€” **~8Γ— smaller** than standard MHA
84
- - Decoupled RoPE applied only to 64-dim k_pe, content path stays RoPE-free
85
- - Absorption trick pre-computes W_UK @ W_UV for faster inference
86
- - MQA-style: KV decomposed once, expanded to all query heads
87
-
88
- ### 2. Mixture of Experts (MoE) β€” *DeepSeek V4 / Kimi K2.6*
89
- - 4 routed experts + 1 shared expert (always active, Kimi K2.6 style)
90
- - Top-k=2 routing: each token activates only 2 experts
91
- - **sqrt(softplus(x))** scoring for numerical stability (DeepSeek V4)
92
- - **Bias-based load balancing** (no auxiliary loss, no gradient interference)
93
- - Per-expert routing bias auto-updates to balance token assignments
94
- - SwiGLU activation in every expert (fused gate+up projection)
95
- - Expert affinity memory tracks token-expert history
96
-
97
- ### 3. Hyper-Connections with Sinkhorn Routing β€” *DeepSeek V4*
98
- - Replaces standard residual connections with learned routing
99
- - 4 connection types: **identity**, **transform**, **gate**, **skip**
100
- - Sinkhorn-Knopp normalization (2 iterations) for doubly-stochastic weights
101
- - Input-dependent routing via gating network
102
- - Type-specific learnable biases initialized per config
103
- - Pre-LayerNorm on layer output before routing
104
-
105
- ### 4. Hybrid Attention β€” *MiniMax M2.7*
106
- - 3 softmax layers (indices 0, 4, 8): Standard MLA with full causal attention
107
- - 7 linear layers (1, 2, 3, 5, 6, 7, 9): MLA + LightningAttentionV2 **50/50 blend**
108
- - LightningAttentionV2: O(n) complexity with intra-block softmax + inter-block KV product
109
- - Incremental KV state for efficient autoregressive generation
110
- - ReLU/Swish activation replaces softmax in linear path
111
-
112
- ### 5. Multi-Token Prediction (MTP) β€” *DeepSeek V4*
113
- - 2 MTP layers, each predicting 2 tokens ahead (mtp_depth=2)
114
- - Projection block: Linear β†’ LayerNorm β†’ GELU β†’ Linear + residual skip
115
- - RoPE positional encoding on reduced dim (hidden/4) for efficiency
116
- - Tied LM head shares parameters with main embedding layer
117
- - Chunked computation (chunk_size=16) to avoid full (B, S, V) logits
118
- - Loss weight: 0.3 Γ— cross-entropy of future token predictions
119
-
120
- ### 6. Self-Evolution Framework β€” *MiniMax M2.7 / Deeplm*
121
- - Autonomous 8-phase research loop: hypothesis β†’ design β†’ execute β†’ analyze β†’ diagnose β†’ fix β†’ evaluate β†’ decide
122
- - 100+ autonomous optimization rounds per training cycle
123
- - 3 feedback chain episodes for meta-learning
124
-
125
- ### 7. AutoTuner β€” *Deeplm custom*
126
- - Energy-based adaptive hyperparameter controller
127
- - Phase-aware dynamics (warmup β†’ exploration β†’ balanced β†’ exploitation)
128
- - Bayesian dynamics model: uncertainty-aware lr/wd sensitivity (Welford variance)
129
- - Multi-timescale loss EMAs (short=0.9, med=0.98, long=0.995)
130
- - Gradient noise scale monitoring
131
- - Cosine similarity for gradient direction tracking
132
- - Layer health monitoring with per-group gradient ratios
133
- - Failure-aware rollback with revive mechanism
134
- - Strategic planner: multi-step scheduled adjustments with plan accuracy tracking
135
- - Dual-window trajectory predictor: regime change detection, convergence estimation
136
-
137
- </details>
138
-
139
- ## Training Configuration
140
-
141
- | Config | Value |
142
- |--------|-------|
143
- | **Dataset** | Wikipedia-id (Indonesian) + GLM-5.1 (English reasoning) + English Wikipedia |
144
- | **Tokenizer** | 32K BBPE |
145
- | **Optimizer** | SGD Nesterov (momentum=0.9, weight_decay=0.1) |
146
- | **LR Schedule** | Cosine (warmup 3%) |
147
- | **Base LR** | 3e-4 |
148
- | **Effective Batch** | 36 (12 Γ— 3 grad_accum) |
149
- | **Sequence Length** | 2048 |
150
- | **Max Grad Norm** | 1.0 (auto-tuned) |
151
- | **Total Steps** | 19,500 |
152
- | **GPU** | A10G (24GB) |
153
- | **Dtype** | float32 |
154
- | **Curriculum** | 4-tier (easy β†’ medium β†’ hard β†’ reasoning), current: **hard** |
155
- | **Dynamic Mix** | Adaptive per-category sampling weights, applied via WeightedBucketSampler |
156
- | **Tokenization** | Disk-cached (SHA-256 keyed), no re-tokenization per epoch |
157
- | **Filtering** | StrictFilter: URL/HTML/emoji stripping + char ratio + language score + repetition + min words |
158
- | **Batching** | BucketDataset: groups by length for efficient padding |
159
-
160
- ### Training Algorithms
161
-
162
- | Algorithm | Status | Description |
163
- |-----------|--------|-------------|
164
- | Curriculum Learning | Active | 4-tier easy→hard progression by text length |
165
- | Dynamic Sampling | Active | Adaptive category mix based on per-category loss |
166
- | Difficulty Scheduling | Active | 4 phases: Token Learning β†’ Syntax β†’ Reasoning β†’ Expert |
167
- | MoE Balancing | Active | Bias-based load-balanced routing |
168
- | AutoTuner | Active | AI adaptive hyperparameter control |
169
- | MTP | Active | Auxiliary multi-token prediction loss |
170
- | Curriculum Scheduling | Active | Loss-based adaptive difficulty |
171
- | Reflection Training | **Active** | High-loss example replay (1,500 stored) |
172
- | Memory Algorithms | **Active** | 1,500 stored, avg loss 10.1 |
173
- | Tool Routing | **Active** | Code=706, Math=205, Formal=587 routed |
174
- | Synthetic Evolution | Inactive | Model-generated training data (potential A10G bottleneck) |
175
-
176
- ## AutoTuner State (Step 19,500)
177
-
178
- | Metric | Step 18,000 | Step 19,500 | Change |
179
- |--------|-------------|-------------|--------|
180
- | **Phase** | Balanced | **Exploitation** | ↑ aggressiveness |
181
- | **LR Multiplier** | 0.78Γ— | **0.64Γ—** | ↓ 18% |
182
- | **Grad Norm Multiplier** | 0.76Γ— | **0.64Γ—** | ↓ 16% |
183
- | **Weight Decay Mult** | 1.60Γ— | **active** | regularization |
184
- | **Best (smoothed loss)** | 28.84 | **4.01** | ↓ (different scale) |
185
- | **Best Eval Loss** | 56.07 | **56.07** | β€” (no new eval) |
186
- | **Adjustments Made** | β€” | **152** | learned control |
187
- | **Degeneracy Reductions** | β€” | **2** | prevented divergence |
188
- | **Cosine Similarity EMA** | β€” | **0.15** | moderate direction stability |
189
- | **Gradient Noise EMA** | β€” | **0.10** | low noise |
190
- | **Gradient Norm (avg)** | β€” | **3.45** | well-controlled |
191
- | **Diagnosis** | Overfitting | **Exploitation** | phase-consistent |
192
- | **Plan Strategy** | Regularize | **regularization ongoing** | β€” |
193
- | **Plan Accuracy** | 0.04 | β€” | exploratory phase |
194
- | **Trajectory Slope** | +1.85 (rΒ²=0.08) | β€” | high variance |
195
- | **Mix Weights** | short=5.6%, med=40.8%, long=30.4%, vlong=23.2% | **short=43.3%, med=24.5%, long=18.3%, vlong=13.9%** | shifted to short |
196
- | **Curriculum** | medium | **hard** | ↑ difficulty |
197
-
198
- The AutoTuner has entered **exploitation** phase at step 19,500 β€” reducing LR to 6.35e-5 (0.64Γ— base), grad clip to 0.64Γ—, increasing weight decay for regularization. The multi-timescale EMAs (short=10.3, med=10.3, long=10.3) indicate stable convergence at the underlying dynamics level despite curriculum tier transitions causing high surface loss variance.
199
-
200
- ## Routing Activity (Step 19,500)
201
-
202
- | Route | Count | Avg Performance |
203
- |-------|-------|----------------|
204
- | Code | 706 | 10.36 |
205
- | Math | 205 | 10.29 |
206
- | Formal | 587 | 10.36 |
207
- | Creative | 1 | 10.83 |
208
- | Dialog | 1 | 9.48 |
209
-
210
- Routing algorithms are actively classifying training examples by type, with code and formal reasoning dominating the mix.
211
-
212
- ## Data Pipeline (New in v2)
213
-
214
- - **StrictFilter**: Multi-layer text quality filter β€” URL/HTML/emoji stripping β†’ char ratio β‰₯0.25 β†’ language score β‰₯0.001 β†’ 4-gram repetition ≀0.4 β†’ min 10 words
215
- - **TokenCache**: SHA-256 keyed disk cache β€” tokenize once per unique text, no re-tokenization across epochs
216
- - **BucketDataset**: Groups texts by similar length (bucket_size=64) to minimize padding waste
217
- - **WeightedBucketSampler**: Importance sampling by category weights, synced from DynamicSampler every 500 steps
218
-
219
- ## Files
220
-
221
- | File | Description |
222
- |------|-------------|
223
- | `model.pt` | Model weights (~105M params, 419MB) β€” **step 19,500** |
224
- | `best.pt` | Best checkpoint by eval loss |
225
- | `training_state.json` | Full training state including AutoTuner state |
226
- | `tokenizer.json` | BBPE tokenizer (32K vocab) |
227
- | `tokenizer_config.json` | Tokenizer configuration |
228
- | `config.yaml` | Model configuration (DeeplmConfig defaults) |
229
- | `training_curve_8k_20k.png` | Updated training curves: step 8,010 β†’ 19,690 |
230
-
231
- ## Usage
232
 
233
  ```python
234
- import torch
 
235
  from deeplm.config import DeeplmConfig
236
  from deeplm.model.deeplm import DeeplmModel
 
 
237
 
 
238
  config = DeeplmConfig()
 
 
239
  model = DeeplmModel(config)
240
- model.load_state_dict(torch.load("model.pt", map_location="cpu"), strict=False)
241
- model.eval()
242
-
243
- input_ids = torch.tensor([[1, 2, 3]])
244
- output = model.generate(
245
- input_ids,
246
- max_new_tokens=128,
247
- do_sample=True,
248
- temperature=0.7,
249
- top_k=50,
250
- top_p=0.9,
251
- )
252
- print(output)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
253
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  license: apache-2.0
4
  library_name: transformers
5
  tags:
6
+ - pytorch
7
+ - safetensors
8
+ - deeplm
9
+ - bitnet
10
+ - moe
11
+ - mla
12
+ - mtp
13
+ - indonesian
14
+ pipeline_tag: text-generation
 
 
15
  ---
16
 
17
+ # Deeplm β€” 108M BitNet MoE Language Model
18
 
19
+ Deeplm adalah model bahasa berukuran ~105M parameter dengan **BitNet b1.58 ternary quantization** dari awal, terinspirasi dari arsitektur **DeepSeek V4**, **Kimi K2.6**, dan **MiniMax M2.7**.
20
 
21
+ ## πŸ—οΈ Arsitektur
22
 
23
+ | Komponen | Detail |
24
+ |---|---|
25
+ | **Total Parameters** | ~104.7M |
26
+ | **Architecture** | Decoder-only Transformer |
27
+ | **Layers** | 10 |
28
+ | **Hidden Size** | 512 |
29
+ | **Vocab Size** | 32,000 (BPETokenizer) |
30
+ | **Max Seq Length** | 4,096 |
31
+ | **Attention Heads** | 8 (MQA, 1 KV head) |
32
+ | **Quantization** | BitNet b1.58 ternary {-1, 0, +1}, absmean |
33
+ | **Dtype** | float32 (weights terkuantisasi ke ternary) |
34
+
35
+ ## ✨ Fitur Inovatif
36
+
37
+ | Fitur | Sumber | Keterangan |
38
+ |---|---|---|
39
+ | **MLA** | DeepSeek V4 | Multi-head Latent Attention, KV cache compression 24x |
40
+ | **MoE** | DeepSeek V4 + Kimi K2.6 | 4 routed + 1 shared expert, top-k=2 |
41
+ | **Hybrid Attention** | MiniMax M2.7 | Softmax + Lightning v2 linear attention |
42
+ | **Hyper-Connections** | DeepSeek V4 | Sinkhorn routing, menggantikan residual standar |
43
+ | **MTP** | DeepSeek V4 | Multi-Token Prediction, depth=2 |
44
+ | **BitNet b1.58** | BitNet | Ternary quantization {-1, 0, +1} dari init |
45
+ | **AutoTuner** | Deeplm | Adaptive LR, GN, WD, momentum, revive, trajectory prediction |
46
+ | **Curriculum Router** | Deeplm | Phase-based category weighting |
47
+ | **Self-Evolution** | MiniMax M2.7 | Autonomous hypothesis β†’ experiment β†’ decision loop |
48
+
49
+ ## πŸ“Š Spesifikasi Model
50
+
51
+ ```json
52
+ {
53
+ "architectures": ["DeeplmModel"],
54
+ "model_type": "deeplm",
55
+ "vocab_size": 32000,
56
+ "hidden_size": 512,
57
+ "intermediate_size": 2048,
58
+ "num_hidden_layers": 10,
59
+ "num_attention_heads": 8,
60
+ "num_key_value_heads": 1,
61
+ "max_position_embeddings": 4096,
62
+ "rms_norm_eps": 1e-06,
63
+ "rope_theta": 50000.0,
64
+ "rope_dim": 64,
65
+ "tie_word_embeddings": true,
66
+ "num_routed_experts": 4,
67
+ "num_shared_experts": 1,
68
+ "expert_topk": 2,
69
+ "q_lora_rank": 192,
70
+ "kv_lora_rank": 64,
71
+ "qk_rope_head_dim": 64,
72
+ "qk_nope_head_dim": 64,
73
+ "v_head_dim": 128,
74
+ "mtp_depth": 2,
75
+ "mtp_num_layers": 2,
76
+ "bitnet_quantized": true,
77
+ "bitnet_scale": "absmean"
78
+ }
79
+ ```
80
 
81
+ ## πŸš€ Usage
82
 
83
+ ### Inference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
  ```python
86
+ import sys
87
+ sys.path.insert(0, "deeplm")
88
  from deeplm.config import DeeplmConfig
89
  from deeplm.model.deeplm import DeeplmModel
90
+ from safetensors.torch import load_file
91
+ import torch
92
 
93
+ # Load config
94
  config = DeeplmConfig()
95
+
96
+ # Build model
97
  model = DeeplmModel(config)
98
+
99
+ # Load BitNet quantized weights
100
+ state_dict = load_file("model.safetensors")
101
+ model.load_state_dict(state_dict, strict=False)
102
+
103
+ # Generate
104
+ input_ids = torch.tensor([[1, 2, 3]]) # bos + tokens
105
+ output = model.generate(input_ids, max_new_tokens=100, temperature=0.7)
106
+ ```
107
+
108
+ ### Training
109
+
110
+ ```bash
111
+ # Install dependencies
112
+ pip install torch datasets tokenizers pyyaml einops huggingface-hub safetensors
113
+
114
+ # Train with all features
115
+ python train.py --batch_size 3 --grad_accum 2 --max_steps 31250
116
+
117
+ # Custom config
118
+ python train.py \
119
+ --max_steps 100000 \
120
+ --batch_size 4 \
121
+ --seq_len 512 \
122
+ --lr 3e-4 \
123
+ --no_auto_tuner
124
+ ```
125
+
126
+ ## πŸ“ Struktur Project
127
+
128
  ```
129
+ deeplm-108m/
130
+ β”œβ”€β”€ config.json # Model config
131
+ β”œβ”€β”€ generation_config.json # Generation params
132
+ β”œβ”€β”€ model.safetensors # BitNet quantized weights (419MB)
133
+ β”œβ”€β”€ tokenizer.json # BPETokenizer
134
+ β”œβ”€β”€ tokenizer_config.json # Tokenizer config
135
+ β”œβ”€β”€ train.py # Training script (all features)
136
+ β”œβ”€β”€ init_model.py # Model initialization script
137
+ β”œβ”€β”€ deeplm_modal.py # Modal.com build script
138
+ └── deeplm/ # Source code
139
+ β”œβ”€β”€ config.py # Dataclass configs
140
+ β”œβ”€β”€ model/
141
+ β”‚ β”œβ”€β”€ deeplm.py # Main model
142
+ β”‚ β”œβ”€β”€ mla.py # Multi-head Latent Attention
143
+ β”‚ β”œβ”€β”€ moe.py # Mixture of Experts
144
+ β”‚ β”œβ”€β”€ hybrid_attention.py # Softmax + Lightning
145
+ β”‚ β”œβ”€β”€ hyper_connections.py # Sinkhorn routing
146
+ β”‚ β”œβ”€β”€ mtp.py # Multi-Token Prediction
147
+ β”‚ └── transformer_block.py
148
+ β”œβ”€β”€ training/
149
+ β”‚ β”œβ”€β”€ trainer.py # Training loop
150
+ β”‚ β”œβ”€β”€ auto_tuner.py # Adaptive training controller
151
+ β”‚ β”œβ”€β”€ curriculum_router.py # Phase-based routing
152
+ β”‚ β”œβ”€β”€ data_pipeline.py # Bucket dataset + sampler
153
+ β”‚ β”œβ”€β”€ logger.py # SmartLogger + anomaly detection
154
+ β”‚ └── control/ # TrainingControl plane
155
+ β”œβ”€β”€ self_evolution/
156
+ β”‚ └── framework.py # Autonomous evolution loop
157
+ └── quantization/
158
+ β”œβ”€β”€ bitnet_quantize.py # BitNet b1.58
159
+ └── gguf_export.py
160
+ ```
161
+
162
+ ## πŸ“ˆ Training
163
+
164
+ | Parameter | Value |
165
+ |---|---|
166
+ | **Dataset** | afrizalha/KamusOne-28M-Indonesian |
167
+ | **Optimizer** | AdamW (Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8) |
168
+ | **LR** | 6e-4 (cosine, warmup=150) |
169
+ | **Batch Size** | 3 x grad_accum=2 = 6 effective |
170
+ | **Weight Decay** | 0.1 |
171
+ | **Max Grad Norm** | 1.0 |
172
+ | **Max Steps** | 31,250 |
173
+
174
+ ## πŸ“„ License
175
+
176
+ Apache 2.0
177
+
178
+ ## πŸ™ Acknowledgments
179
+
180
+ Arsitektur terinspirasi dari:
181
+ - **DeepSeek V4** β€” MLA, Hyper-Connections, MTP, MoE routing
182
+ - **Kimi K2.6** β€” Shared Expert, Agent Swarm
183
+ - **MiniMax M2.7** β€” Self-Evolution Framework, Hybrid Attention, Agent Harness
184
+ - **BitNet** β€” b1.58 ternary quantization