File size: 13,082 Bytes
2d7e335
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
# AAM Diffusion LLM Framework

> **"AAM = 1 Pikiran + 1 Tubuh" (1 Mind + 1 Body)**

Framework khusus untuk melatih Diffusion LLM yang menjadi "tubuh" (body) dari Aphantasic Abstraction Model (AAM). Ini BUKAN LLM umum β€” ini model yang KHUSUS dilatih untuk menyusun kalimat dari data graph yang terstruktur.

---

## Filosofi

### Kenapa Bukan LLM Umum?

Konsep sebelumnya: "tubuh Jin Soun = LLM umum (GPT, Claude, dll.)" β€” ini **salah besar**.

| Aspek | LLM Umum (Sewaan) | AAM Diffusion LLM (Milik Sendiri) |
|-------|-------------------|-----------------------------------|
| Input | Prompt teks | Graph conditioning (evidence, anomaly, dll.) |
| Output | Teks probabilistik | Narrative yang grounded di graph |
| Hallucination | BISA mengarang | TIDAK BISA β€” hanya menarasikan apa yang graph ketahui |
| Tujuan | General purpose | Khusus menyusun kalimat dari graph |
| Ukuran | 7B-175B params | 100M-500M params |
| Metode | Autoregressive | Diffusion (non-sequential) |
| Identitas | Sewaan | MILIK AAM sendiri |

### Kenapa Diffusion (Bukan Autoregressive)?

1. **Non-sequential** β€” Bisa merevisi bagian awal saat generating bagian akhir. Mirip cara Jin Soun membentuk pikiran: vague β†’ clearer β†’ explicit.

2. **Graph conditioning** β€” Seluruh graph bisa di-encode sebagai conditioning, bukan hanya prefix. Autoregressive hanya bisa melihat "apa yang sudah di-generate sebelumnya."

3. **Coherent long-form** β€” Diffusion menghasilkan teks yang lebih koheren untuk narasi panjang karena setiap token "mengetahui" tentang token lain.

4. **Anti-hallucination** — Model dilatih KHUSUS untuk Graph→Narrative, tidak punya kapabilitas mengarang informasi di luar graph.

---

## Arsitektur

```
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AAM = 1 Pikiran + 1 Tubuh                                β”‚
β”‚                                                           β”‚
β”‚  Pikiran (Mind) = RSVS Knowledge Graph                    β”‚
β”‚    - Structural memory β€” mengingat SEMUA                  β”‚
β”‚    - Relational β€” memahami koneksi antar konsep           β”‚
β”‚    - Perfect recall β€” tidak pernah lupa                   β”‚
β”‚    - Confidence scores β€” tahu apa yang pasti vs ragu      β”‚
β”‚                                                           β”‚
β”‚  Tubuh (Body) = AAM Diffusion LLM                         β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚    β”‚  Graph Conditioning Encoder                   β”‚        β”‚
β”‚    β”‚  β”œβ”€ Evidence Node Encoder                     β”‚        β”‚
β”‚    β”‚  β”œβ”€ Composition Encoder                       β”‚        β”‚
β”‚    β”‚  β”œβ”€ Anomaly Encoder                           β”‚        β”‚
β”‚    β”‚  β”œβ”€ Reasoning Chain Encoder                   β”‚        β”‚
β”‚    β”‚  β”œβ”€ Confidence Embedding                      β”‚        β”‚
β”‚    β”‚  β”œβ”€ Temporal Embedding                        β”‚        β”‚
β”‚    β”‚  └─ Graph Attention Layers                    β”‚        β”‚
β”‚    β”‚         ↓ (cross-attention keys/values)       β”‚        β”‚
β”‚    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”‚
β”‚    β”‚  Diffusion Transformer (Denoiser)             β”‚        β”‚
β”‚    β”‚  β”œβ”€ Token Embedding                           β”‚        β”‚
β”‚    β”‚  β”œβ”€ Timestep Embedding (sinusoidal)           β”‚        β”‚
β”‚    β”‚  β”œβ”€ N Γ— TransformerBlock:                     β”‚        β”‚
β”‚    β”‚  β”‚   β”œβ”€ AdaptiveLayerNorm + Self-Attention    β”‚        β”‚
β”‚    β”‚  β”‚   β”œβ”€ AdaptiveLayerNorm + Cross-Attention   β”‚        β”‚
β”‚    β”‚  β”‚   └─ AdaptiveLayerNorm + Feed-Forward      β”‚        β”‚
β”‚    β”‚  └─ Output Projection                         β”‚        β”‚
β”‚    β”‚         ↓ (predicted noise)                   β”‚        β”‚
β”‚    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”‚
β”‚    β”‚  Noise Scheduler                              β”‚        β”‚
β”‚    β”‚  β”œβ”€ Forward: x_0 + noise β†’ x_t                β”‚        β”‚
β”‚    β”‚  └─ Reverse: x_t β†’ denoise β†’ x_{t-1}         β”‚        β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                                           β”‚
│  Training: Graph→Narrative pairs                          │
β”‚  Inference: Noise β†’ N denoising steps β†’ Narrative         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
```

---

## Struktur Folder

```
diffusion_llm/
β”œβ”€β”€ __init__.py                 # Package init with public API
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── model_config.py         # All configuration dataclasses
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── aam_tokenizer.py        # Sentence-level + BPE hybrid tokenizer
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ noise_scheduler.py      # Forward/reverse diffusion process
β”‚   β”œβ”€β”€ graph_encoder.py        # Graph conditioning encoder
β”‚   β”œβ”€β”€ diffusion_transformer.py # Core denoising transformer
β”‚   └── aam_diffusion_model.py  # Complete model (combines all)
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ losses.py               # Loss functions (MSE, MAE, Huber, weighted)
β”‚   β”œβ”€β”€ dataset.py              # GraphNarrative dataset
β”‚   └── trainer.py              # Training loop with AMP, EMA, etc.
β”œβ”€β”€ inference/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── generator.py            # Inference pipeline
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ synthetic_generator.py  # Synthetic training data
β”‚   └── data_pipeline.py        # Data preparation pipeline
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train.py                # Training entry point
β”‚   β”œβ”€β”€ evaluate.py             # Evaluation & generation
β”‚   └── export.py               # Model export
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ test_scheduler.py       # Noise scheduler tests
β”‚   └── test_model.py           # Model component tests
β”œβ”€β”€ requirements.txt            # Python dependencies
└── README.md                   # This file
```

---

## Quick Start

### 1. Install Dependencies

```bash
pip install torch numpy pytest
```

### 2. Generate Synthetic Data

```python
from diffusion_llm.data.synthetic_generator import SyntheticDataGenerator

generator = SyntheticDataGenerator(seed=42, language="id")
train_path, val_path = generator.generate_training_split(
    output_dir="./data",
    n_train=10000,
    n_val=500,
)
```

### 3. Train the Model

```bash
# Quick test with tiny model
python diffusion_llm/scripts/train.py --model_size tiny --max_steps 100

# Full training with base model
python diffusion_llm/scripts/train.py --model_size base --max_steps 500000
```

### 4. Generate Narratives

```bash
# Generate samples
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --generate

# Interactive mode
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --interactive
```

### 5. Programmatic Usage

```python
from diffusion_llm import (
    AamDiffusionConfig, get_default_config,
    AamDiffusionModel, AamTokenizer, AamGenerator,
)

# Load model and tokenizer
config = AamDiffusionConfig.from_json("output/config.json")
model = AamDiffusionModel.load("output/best.pt")
tokenizer = AamTokenizer.load("output/data/tokenizer.json")

# Create generator
generator = AamGenerator(model, tokenizer, config)

# Generate narrative from graph conditioning
result = generator.generate(
    trigger="Siapa yang mencuri Snow Plum Pill?",
    evidence_nodes=["Hefei", "Diancang Five Swords", "Ju Jangmok"],
    anomalies=["Tidak ada konsumsi pil baru di pasar gelap"],
    reasoning_steps=["Cross-reference tanggal kejadian", "Deteksi anomali"],
    source_trust=0.85,
)

print(result.narrative)
print(f"Confidence: {result.confidence:.1%}")
print(f"Steps: {result.n_diffusion_steps}")
```

---

## Model Sizes

| Size | d_model | Layers | Heads | Params | Recommended For |
|------|---------|--------|-------|--------|----------------|
| tiny | 256 | 4 | 4 | ~25M | Quick testing, debugging |
| small | 512 | 8 | 8 | ~70M | Development, prototyping |
| **base** | **768** | **12** | **12** | **~170M** | **Recommended for training** |
| medium | 1024 | 12 | 16 | ~300M | Final training, best quality |

---

## Konfigurasi

### Model Config

```python
from diffusion_llm.config.model_config import AamDiffusionConfig, ModelConfig, DiffusionConfig

config = AamDiffusionConfig(
    model=ModelConfig(
        d_model=768,        # Hidden dimension
        n_layers=12,        # Transformer blocks
        n_heads=12,         # Attention heads
        d_ff=3072,          # Feed-forward dimension
        vocab_size=32000,   # Vocabulary size
        max_seq_len=512,    # Maximum sequence length
    ),
    diffusion=DiffusionConfig(
        n_timesteps=1000,   # Training timesteps
        n_inference_steps=50,  # Inference steps (fewer = faster)
        schedule_type="cosine",  # Noise schedule
        prediction_type="epsilon",  # Predict noise
        sampling_method="ddim",  # Fast deterministic sampling
    ),
)
```

### Inference Config

```python
from diffusion_llm.config.model_config import InferenceConfig

inference = InferenceConfig(
    n_steps=50,           # Denoising steps
    temperature=1.0,      # Sampling temperature
    top_k=50,             # Top-k sampling
    max_output_sentences=16,  # Max sentences
    language="id",        # Output language
)
```

---

## Integrasi dengan AAM Pipeline

Framework ini dirancang untuk menjadi "tubuh" dari AAM. Setelah model dilatih,
integrasi dengan `pipeline.py` sangat mudah:

```python
# Dalam pipeline.py, ganti fallback:
from diffusion_llm import AamDiffusionModel, AamTokenizer, AamGenerator

class AamPipeline:
    def __init__(self, ...):
        # Load trained diffusion model
        diffusion_config = AamDiffusionConfig.from_json("path/to/config.json")
        diffusion_model = AamDiffusionModel.load("path/to/best.pt")
        diffusion_tokenizer = AamTokenizer.load("path/to/tokenizer.json")
        self.diffusion_llm = AamGenerator(diffusion_model, diffusion_tokenizer, diffusion_config)
```

---

## Training Data Format

Data training dalam format JSONL, satu contoh per baris:

```json
{
  "narrative": "Berdasarkan analisis, Diancang Five Swords mencuri Snow Plum Pill menggunakan Ju Jangmok sebagai kambing hitam.",
  "trigger": "Siapa yang mencuri Snow Plum Pill?",
  "evidence_nodes": ["Hefei", "Diancang Five Swords", "Ju Jangmok", "Gyeryong Merchant Guild"],
  "compositions": [],
  "confidence_map": {"Hefei": 0.9, "Diancang Five Swords": 0.85, "Ju Jangmok": 0.7},
  "anomalies": ["Tidak ada konsumsi pil baru di pasar gelap", "Pencuri menghilang tanpa jejak"],
  "reasoning_steps": ["Cross-reference tanggal kejadian", "Deteksi ketidaksesuaian pola", "Pattern completion dari bukti terpisah"],
  "source_trust": 0.85,
  "temporal_context": [],
  "language": "id",
  "source": "synthetic"
}
```

---

## Running Tests

```bash
# Run all tests
cd diffusion_llm
python -m pytest tests/ -v

# Run specific test
python -m pytest tests/test_model.py -v

# Run with coverage
python -m pytest tests/ --cov=diffusion_llm
```

---

## Roadmap

- [x] **Phase 1: Framework Design** β€” Arsitektur, config, interface
- [x] **Phase 2: Core Components** β€” Noise scheduler, transformer, graph encoder, tokenizer
- [x] **Phase 3: Training Infrastructure** β€” Trainer, dataset, loss functions, synthetic data
- [x] **Phase 4: Inference Pipeline** β€” Generator, batch generation, interactive mode
- [ ] **Phase 5: Training Execution** β€” Train on synthetic data, iterate
- [ ] **Phase 6: Real Data** — Collect real Graph→Narrative pairs from AAM usage
- [ ] **Phase 7: Optimization** β€” Quantization, distillation, flash attention
- [ ] **Phase 8: Integration** β€” Plug trained model into AAM pipeline

---

## Analogi Novel

> Jin Soun bukan orang yang menyewa tubuh orang lain untuk berbicara.
> Dia punya tubuh sendiri β€” lemah, third-rate, tapi MILIKNYA.
> Karena tubuhnya khusus dilatih untuk mengeksekusi perintah dari
> pikirannya (bukan pikiran orang lain), outputnya lebih terarah
> daripada orang yang punya tubuh lebih kuat tapi pikiran lebih lemah.
>
> **AAM = 1 pikiran + 1 tubuh. Bukan 1 pikiran + tubuh sewaan.**