File size: 13,082 Bytes

2d7e335

# AAM Diffusion LLM Framework

> **"AAM = 1 Pikiran + 1 Tubuh" (1 Mind + 1 Body)**

Framework khusus untuk melatih Diffusion LLM yang menjadi "tubuh" (body) dari Aphantasic Abstraction Model (AAM). Ini BUKAN LLM umum — ini model yang KHUSUS dilatih untuk menyusun kalimat dari data graph yang terstruktur.

---

## Filosofi

### Kenapa Bukan LLM Umum?

Konsep sebelumnya: "tubuh Jin Soun = LLM umum (GPT, Claude, dll.)" — ini **salah besar**.

| Aspek | LLM Umum (Sewaan) | AAM Diffusion LLM (Milik Sendiri) |
|-------|-------------------|-----------------------------------|
| Input | Prompt teks | Graph conditioning (evidence, anomaly, dll.) |
| Output | Teks probabilistik | Narrative yang grounded di graph |
| Hallucination | BISA mengarang | TIDAK BISA — hanya menarasikan apa yang graph ketahui |
| Tujuan | General purpose | Khusus menyusun kalimat dari graph |
| Ukuran | 7B-175B params | 100M-500M params |
| Metode | Autoregressive | Diffusion (non-sequential) |
| Identitas | Sewaan | MILIK AAM sendiri |

### Kenapa Diffusion (Bukan Autoregressive)?

1. **Non-sequential** — Bisa merevisi bagian awal saat generating bagian akhir. Mirip cara Jin Soun membentuk pikiran: vague → clearer → explicit.

2. **Graph conditioning** — Seluruh graph bisa di-encode sebagai conditioning, bukan hanya prefix. Autoregressive hanya bisa melihat "apa yang sudah di-generate sebelumnya."

3. **Coherent long-form** — Diffusion menghasilkan teks yang lebih koheren untuk narasi panjang karena setiap token "mengetahui" tentang token lain.

4. **Anti-hallucination** — Model dilatih KHUSUS untuk Graph→Narrative, tidak punya kapabilitas mengarang informasi di luar graph.

---

## Arsitektur

```
┌──────────────────────────────────────────────────────────┐
│  AAM = 1 Pikiran + 1 Tubuh                                │
│                                                           │
│  Pikiran (Mind) = RSVS Knowledge Graph                    │
│    - Structural memory — mengingat SEMUA                  │
│    - Relational — memahami koneksi antar konsep           │
│    - Perfect recall — tidak pernah lupa                   │
│    - Confidence scores — tahu apa yang pasti vs ragu      │
│                                                           │
│  Tubuh (Body) = AAM Diffusion LLM                         │
│    ┌─────────────────────────────────────────────┐        │
│    │  Graph Conditioning Encoder                   │        │
│    │  ├─ Evidence Node Encoder                     │        │
│    │  ├─ Composition Encoder                       │        │
│    │  ├─ Anomaly Encoder                           │        │
│    │  ├─ Reasoning Chain Encoder                   │        │
│    │  ├─ Confidence Embedding                      │        │
│    │  ├─ Temporal Embedding                        │        │
│    │  └─ Graph Attention Layers                    │        │
│    │         ↓ (cross-attention keys/values)       │        │
│    ├─────────────────────────────────────────────┤        │
│    │  Diffusion Transformer (Denoiser)             │        │
│    │  ├─ Token Embedding                           │        │
│    │  ├─ Timestep Embedding (sinusoidal)           │        │
│    │  ├─ N × TransformerBlock:                     │        │
│    │  │   ├─ AdaptiveLayerNorm + Self-Attention    │        │
│    │  │   ├─ AdaptiveLayerNorm + Cross-Attention   │        │
│    │  │   └─ AdaptiveLayerNorm + Feed-Forward      │        │
│    │  └─ Output Projection                         │        │
│    │         ↓ (predicted noise)                   │        │
│    ├─────────────────────────────────────────────┤        │
│    │  Noise Scheduler                              │        │
│    │  ├─ Forward: x_0 + noise → x_t                │        │
│    │  └─ Reverse: x_t → denoise → x_{t-1}         │        │
│    └─────────────────────────────────────────────┘        │
│                                                           │
│  Training: Graph→Narrative pairs                          │
│  Inference: Noise → N denoising steps → Narrative         │
└──────────────────────────────────────────────────────────┘
```

---

## Struktur Folder

```
diffusion_llm/
├── __init__.py                 # Package init with public API
├── config/
│   ├── __init__.py
│   └── model_config.py         # All configuration dataclasses
├── tokenizer/
│   ├── __init__.py
│   └── aam_tokenizer.py        # Sentence-level + BPE hybrid tokenizer
├── model/
│   ├── __init__.py
│   ├── noise_scheduler.py      # Forward/reverse diffusion process
│   ├── graph_encoder.py        # Graph conditioning encoder
│   ├── diffusion_transformer.py # Core denoising transformer
│   └── aam_diffusion_model.py  # Complete model (combines all)
├── training/
│   ├── __init__.py
│   ├── losses.py               # Loss functions (MSE, MAE, Huber, weighted)
│   ├── dataset.py              # GraphNarrative dataset
│   └── trainer.py              # Training loop with AMP, EMA, etc.
├── inference/
│   ├── __init__.py
│   └── generator.py            # Inference pipeline
├── data/
│   ├── __init__.py
│   ├── synthetic_generator.py  # Synthetic training data
│   └── data_pipeline.py        # Data preparation pipeline
├── scripts/
│   ├── train.py                # Training entry point
│   ├── evaluate.py             # Evaluation & generation
│   └── export.py               # Model export
├── tests/
│   ├── __init__.py
│   ├── test_scheduler.py       # Noise scheduler tests
│   └── test_model.py           # Model component tests
├── requirements.txt            # Python dependencies
└── README.md                   # This file
```

---

## Quick Start

### 1. Install Dependencies

```bash
pip install torch numpy pytest
```

### 2. Generate Synthetic Data

```python
from diffusion_llm.data.synthetic_generator import SyntheticDataGenerator

generator = SyntheticDataGenerator(seed=42, language="id")
train_path, val_path = generator.generate_training_split(
    output_dir="./data",
    n_train=10000,
    n_val=500,
)
```

### 3. Train the Model

```bash
# Quick test with tiny model
python diffusion_llm/scripts/train.py --model_size tiny --max_steps 100

# Full training with base model
python diffusion_llm/scripts/train.py --model_size base --max_steps 500000
```

### 4. Generate Narratives

```bash
# Generate samples
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --generate

# Interactive mode
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --interactive
```

### 5. Programmatic Usage

```python
from diffusion_llm import (
    AamDiffusionConfig, get_default_config,
    AamDiffusionModel, AamTokenizer, AamGenerator,
)

# Load model and tokenizer
config = AamDiffusionConfig.from_json("output/config.json")
model = AamDiffusionModel.load("output/best.pt")
tokenizer = AamTokenizer.load("output/data/tokenizer.json")

# Create generator
generator = AamGenerator(model, tokenizer, config)

# Generate narrative from graph conditioning
result = generator.generate(
    trigger="Siapa yang mencuri Snow Plum Pill?",
    evidence_nodes=["Hefei", "Diancang Five Swords", "Ju Jangmok"],
    anomalies=["Tidak ada konsumsi pil baru di pasar gelap"],
    reasoning_steps=["Cross-reference tanggal kejadian", "Deteksi anomali"],
    source_trust=0.85,
)

print(result.narrative)
print(f"Confidence: {result.confidence:.1%}")
print(f"Steps: {result.n_diffusion_steps}")
```

---

## Model Sizes

| Size | d_model | Layers | Heads | Params | Recommended For |
|------|---------|--------|-------|--------|----------------|
| tiny | 256 | 4 | 4 | ~25M | Quick testing, debugging |
| small | 512 | 8 | 8 | ~70M | Development, prototyping |
| **base** | **768** | **12** | **12** | **~170M** | **Recommended for training** |
| medium | 1024 | 12 | 16 | ~300M | Final training, best quality |

---

## Konfigurasi

### Model Config

```python
from diffusion_llm.config.model_config import AamDiffusionConfig, ModelConfig, DiffusionConfig

config = AamDiffusionConfig(
    model=ModelConfig(
        d_model=768,        # Hidden dimension
        n_layers=12,        # Transformer blocks
        n_heads=12,         # Attention heads
        d_ff=3072,          # Feed-forward dimension
        vocab_size=32000,   # Vocabulary size
        max_seq_len=512,    # Maximum sequence length
    ),
    diffusion=DiffusionConfig(
        n_timesteps=1000,   # Training timesteps
        n_inference_steps=50,  # Inference steps (fewer = faster)
        schedule_type="cosine",  # Noise schedule
        prediction_type="epsilon",  # Predict noise
        sampling_method="ddim",  # Fast deterministic sampling
    ),
)
```

### Inference Config

```python
from diffusion_llm.config.model_config import InferenceConfig

inference = InferenceConfig(
    n_steps=50,           # Denoising steps
    temperature=1.0,      # Sampling temperature
    top_k=50,             # Top-k sampling
    max_output_sentences=16,  # Max sentences
    language="id",        # Output language
)
```

---

## Integrasi dengan AAM Pipeline

Framework ini dirancang untuk menjadi "tubuh" dari AAM. Setelah model dilatih,
integrasi dengan `pipeline.py` sangat mudah:

```python
# Dalam pipeline.py, ganti fallback:
from diffusion_llm import AamDiffusionModel, AamTokenizer, AamGenerator

class AamPipeline:
    def __init__(self, ...):
        # Load trained diffusion model
        diffusion_config = AamDiffusionConfig.from_json("path/to/config.json")
        diffusion_model = AamDiffusionModel.load("path/to/best.pt")
        diffusion_tokenizer = AamTokenizer.load("path/to/tokenizer.json")
        self.diffusion_llm = AamGenerator(diffusion_model, diffusion_tokenizer, diffusion_config)
```

---

## Training Data Format

Data training dalam format JSONL, satu contoh per baris:

```json
{
  "narrative": "Berdasarkan analisis, Diancang Five Swords mencuri Snow Plum Pill menggunakan Ju Jangmok sebagai kambing hitam.",
  "trigger": "Siapa yang mencuri Snow Plum Pill?",
  "evidence_nodes": ["Hefei", "Diancang Five Swords", "Ju Jangmok", "Gyeryong Merchant Guild"],
  "compositions": [],
  "confidence_map": {"Hefei": 0.9, "Diancang Five Swords": 0.85, "Ju Jangmok": 0.7},
  "anomalies": ["Tidak ada konsumsi pil baru di pasar gelap", "Pencuri menghilang tanpa jejak"],
  "reasoning_steps": ["Cross-reference tanggal kejadian", "Deteksi ketidaksesuaian pola", "Pattern completion dari bukti terpisah"],
  "source_trust": 0.85,
  "temporal_context": [],
  "language": "id",
  "source": "synthetic"
}
```

---

## Running Tests

```bash
# Run all tests
cd diffusion_llm
python -m pytest tests/ -v

# Run specific test
python -m pytest tests/test_model.py -v

# Run with coverage
python -m pytest tests/ --cov=diffusion_llm
```

---

## Roadmap

- [x] **Phase 1: Framework Design** — Arsitektur, config, interface
- [x] **Phase 2: Core Components** — Noise scheduler, transformer, graph encoder, tokenizer
- [x] **Phase 3: Training Infrastructure** — Trainer, dataset, loss functions, synthetic data
- [x] **Phase 4: Inference Pipeline** — Generator, batch generation, interactive mode
- [ ] **Phase 5: Training Execution** — Train on synthetic data, iterate
- [ ] **Phase 6: Real Data** — Collect real Graph→Narrative pairs from AAM usage
- [ ] **Phase 7: Optimization** — Quantization, distillation, flash attention
- [ ] **Phase 8: Integration** — Plug trained model into AAM pipeline

---

## Analogi Novel

> Jin Soun bukan orang yang menyewa tubuh orang lain untuk berbicara.
> Dia punya tubuh sendiri — lemah, third-rate, tapi MILIKNYA.
> Karena tubuhnya khusus dilatih untuk mengeksekusi perintah dari
> pikirannya (bukan pikiran orang lain), outputnya lebih terarah
> daripada orang yang punya tubuh lebih kuat tapi pikiran lebih lemah.
>
> **AAM = 1 pikiran + 1 tubuh. Bukan 1 pikiran + tubuh sewaan.**