Wolfvin's picture
AAM Diffusion LLM v1.0 — The Body of Aphantasic Abstraction Model
2d7e335 verified
# AAM Diffusion LLM Framework
> **"AAM = 1 Pikiran + 1 Tubuh" (1 Mind + 1 Body)**
Framework khusus untuk melatih Diffusion LLM yang menjadi "tubuh" (body) dari Aphantasic Abstraction Model (AAM). Ini BUKAN LLM umum — ini model yang KHUSUS dilatih untuk menyusun kalimat dari data graph yang terstruktur.
---
## Filosofi
### Kenapa Bukan LLM Umum?
Konsep sebelumnya: "tubuh Jin Soun = LLM umum (GPT, Claude, dll.)" — ini **salah besar**.
| Aspek | LLM Umum (Sewaan) | AAM Diffusion LLM (Milik Sendiri) |
|-------|-------------------|-----------------------------------|
| Input | Prompt teks | Graph conditioning (evidence, anomaly, dll.) |
| Output | Teks probabilistik | Narrative yang grounded di graph |
| Hallucination | BISA mengarang | TIDAK BISA — hanya menarasikan apa yang graph ketahui |
| Tujuan | General purpose | Khusus menyusun kalimat dari graph |
| Ukuran | 7B-175B params | 100M-500M params |
| Metode | Autoregressive | Diffusion (non-sequential) |
| Identitas | Sewaan | MILIK AAM sendiri |
### Kenapa Diffusion (Bukan Autoregressive)?
1. **Non-sequential** — Bisa merevisi bagian awal saat generating bagian akhir. Mirip cara Jin Soun membentuk pikiran: vague → clearer → explicit.
2. **Graph conditioning** — Seluruh graph bisa di-encode sebagai conditioning, bukan hanya prefix. Autoregressive hanya bisa melihat "apa yang sudah di-generate sebelumnya."
3. **Coherent long-form** — Diffusion menghasilkan teks yang lebih koheren untuk narasi panjang karena setiap token "mengetahui" tentang token lain.
4. **Anti-hallucination** — Model dilatih KHUSUS untuk Graph→Narrative, tidak punya kapabilitas mengarang informasi di luar graph.
---
## Arsitektur
```
┌──────────────────────────────────────────────────────────┐
│ AAM = 1 Pikiran + 1 Tubuh │
│ │
│ Pikiran (Mind) = RSVS Knowledge Graph │
│ - Structural memory — mengingat SEMUA │
│ - Relational — memahami koneksi antar konsep │
│ - Perfect recall — tidak pernah lupa │
│ - Confidence scores — tahu apa yang pasti vs ragu │
│ │
│ Tubuh (Body) = AAM Diffusion LLM │
│ ┌─────────────────────────────────────────────┐ │
│ │ Graph Conditioning Encoder │ │
│ │ ├─ Evidence Node Encoder │ │
│ │ ├─ Composition Encoder │ │
│ │ ├─ Anomaly Encoder │ │
│ │ ├─ Reasoning Chain Encoder │ │
│ │ ├─ Confidence Embedding │ │
│ │ ├─ Temporal Embedding │ │
│ │ └─ Graph Attention Layers │ │
│ │ ↓ (cross-attention keys/values) │ │
│ ├─────────────────────────────────────────────┤ │
│ │ Diffusion Transformer (Denoiser) │ │
│ │ ├─ Token Embedding │ │
│ │ ├─ Timestep Embedding (sinusoidal) │ │
│ │ ├─ N × TransformerBlock: │ │
│ │ │ ├─ AdaptiveLayerNorm + Self-Attention │ │
│ │ │ ├─ AdaptiveLayerNorm + Cross-Attention │ │
│ │ │ └─ AdaptiveLayerNorm + Feed-Forward │ │
│ │ └─ Output Projection │ │
│ │ ↓ (predicted noise) │ │
│ ├─────────────────────────────────────────────┤ │
│ │ Noise Scheduler │ │
│ │ ├─ Forward: x_0 + noise → x_t │ │
│ │ └─ Reverse: x_t → denoise → x_{t-1} │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Training: Graph→Narrative pairs │
│ Inference: Noise → N denoising steps → Narrative │
└──────────────────────────────────────────────────────────┘
```
---
## Struktur Folder
```
diffusion_llm/
├── __init__.py # Package init with public API
├── config/
│ ├── __init__.py
│ └── model_config.py # All configuration dataclasses
├── tokenizer/
│ ├── __init__.py
│ └── aam_tokenizer.py # Sentence-level + BPE hybrid tokenizer
├── model/
│ ├── __init__.py
│ ├── noise_scheduler.py # Forward/reverse diffusion process
│ ├── graph_encoder.py # Graph conditioning encoder
│ ├── diffusion_transformer.py # Core denoising transformer
│ └── aam_diffusion_model.py # Complete model (combines all)
├── training/
│ ├── __init__.py
│ ├── losses.py # Loss functions (MSE, MAE, Huber, weighted)
│ ├── dataset.py # GraphNarrative dataset
│ └── trainer.py # Training loop with AMP, EMA, etc.
├── inference/
│ ├── __init__.py
│ └── generator.py # Inference pipeline
├── data/
│ ├── __init__.py
│ ├── synthetic_generator.py # Synthetic training data
│ └── data_pipeline.py # Data preparation pipeline
├── scripts/
│ ├── train.py # Training entry point
│ ├── evaluate.py # Evaluation & generation
│ └── export.py # Model export
├── tests/
│ ├── __init__.py
│ ├── test_scheduler.py # Noise scheduler tests
│ └── test_model.py # Model component tests
├── requirements.txt # Python dependencies
└── README.md # This file
```
---
## Quick Start
### 1. Install Dependencies
```bash
pip install torch numpy pytest
```
### 2. Generate Synthetic Data
```python
from diffusion_llm.data.synthetic_generator import SyntheticDataGenerator
generator = SyntheticDataGenerator(seed=42, language="id")
train_path, val_path = generator.generate_training_split(
output_dir="./data",
n_train=10000,
n_val=500,
)
```
### 3. Train the Model
```bash
# Quick test with tiny model
python diffusion_llm/scripts/train.py --model_size tiny --max_steps 100
# Full training with base model
python diffusion_llm/scripts/train.py --model_size base --max_steps 500000
```
### 4. Generate Narratives
```bash
# Generate samples
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --generate
# Interactive mode
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --interactive
```
### 5. Programmatic Usage
```python
from diffusion_llm import (
AamDiffusionConfig, get_default_config,
AamDiffusionModel, AamTokenizer, AamGenerator,
)
# Load model and tokenizer
config = AamDiffusionConfig.from_json("output/config.json")
model = AamDiffusionModel.load("output/best.pt")
tokenizer = AamTokenizer.load("output/data/tokenizer.json")
# Create generator
generator = AamGenerator(model, tokenizer, config)
# Generate narrative from graph conditioning
result = generator.generate(
trigger="Siapa yang mencuri Snow Plum Pill?",
evidence_nodes=["Hefei", "Diancang Five Swords", "Ju Jangmok"],
anomalies=["Tidak ada konsumsi pil baru di pasar gelap"],
reasoning_steps=["Cross-reference tanggal kejadian", "Deteksi anomali"],
source_trust=0.85,
)
print(result.narrative)
print(f"Confidence: {result.confidence:.1%}")
print(f"Steps: {result.n_diffusion_steps}")
```
---
## Model Sizes
| Size | d_model | Layers | Heads | Params | Recommended For |
|------|---------|--------|-------|--------|----------------|
| tiny | 256 | 4 | 4 | ~25M | Quick testing, debugging |
| small | 512 | 8 | 8 | ~70M | Development, prototyping |
| **base** | **768** | **12** | **12** | **~170M** | **Recommended for training** |
| medium | 1024 | 12 | 16 | ~300M | Final training, best quality |
---
## Konfigurasi
### Model Config
```python
from diffusion_llm.config.model_config import AamDiffusionConfig, ModelConfig, DiffusionConfig
config = AamDiffusionConfig(
model=ModelConfig(
d_model=768, # Hidden dimension
n_layers=12, # Transformer blocks
n_heads=12, # Attention heads
d_ff=3072, # Feed-forward dimension
vocab_size=32000, # Vocabulary size
max_seq_len=512, # Maximum sequence length
),
diffusion=DiffusionConfig(
n_timesteps=1000, # Training timesteps
n_inference_steps=50, # Inference steps (fewer = faster)
schedule_type="cosine", # Noise schedule
prediction_type="epsilon", # Predict noise
sampling_method="ddim", # Fast deterministic sampling
),
)
```
### Inference Config
```python
from diffusion_llm.config.model_config import InferenceConfig
inference = InferenceConfig(
n_steps=50, # Denoising steps
temperature=1.0, # Sampling temperature
top_k=50, # Top-k sampling
max_output_sentences=16, # Max sentences
language="id", # Output language
)
```
---
## Integrasi dengan AAM Pipeline
Framework ini dirancang untuk menjadi "tubuh" dari AAM. Setelah model dilatih,
integrasi dengan `pipeline.py` sangat mudah:
```python
# Dalam pipeline.py, ganti fallback:
from diffusion_llm import AamDiffusionModel, AamTokenizer, AamGenerator
class AamPipeline:
def __init__(self, ...):
# Load trained diffusion model
diffusion_config = AamDiffusionConfig.from_json("path/to/config.json")
diffusion_model = AamDiffusionModel.load("path/to/best.pt")
diffusion_tokenizer = AamTokenizer.load("path/to/tokenizer.json")
self.diffusion_llm = AamGenerator(diffusion_model, diffusion_tokenizer, diffusion_config)
```
---
## Training Data Format
Data training dalam format JSONL, satu contoh per baris:
```json
{
"narrative": "Berdasarkan analisis, Diancang Five Swords mencuri Snow Plum Pill menggunakan Ju Jangmok sebagai kambing hitam.",
"trigger": "Siapa yang mencuri Snow Plum Pill?",
"evidence_nodes": ["Hefei", "Diancang Five Swords", "Ju Jangmok", "Gyeryong Merchant Guild"],
"compositions": [],
"confidence_map": {"Hefei": 0.9, "Diancang Five Swords": 0.85, "Ju Jangmok": 0.7},
"anomalies": ["Tidak ada konsumsi pil baru di pasar gelap", "Pencuri menghilang tanpa jejak"],
"reasoning_steps": ["Cross-reference tanggal kejadian", "Deteksi ketidaksesuaian pola", "Pattern completion dari bukti terpisah"],
"source_trust": 0.85,
"temporal_context": [],
"language": "id",
"source": "synthetic"
}
```
---
## Running Tests
```bash
# Run all tests
cd diffusion_llm
python -m pytest tests/ -v
# Run specific test
python -m pytest tests/test_model.py -v
# Run with coverage
python -m pytest tests/ --cov=diffusion_llm
```
---
## Roadmap
- [x] **Phase 1: Framework Design** — Arsitektur, config, interface
- [x] **Phase 2: Core Components** — Noise scheduler, transformer, graph encoder, tokenizer
- [x] **Phase 3: Training Infrastructure** — Trainer, dataset, loss functions, synthetic data
- [x] **Phase 4: Inference Pipeline** — Generator, batch generation, interactive mode
- [ ] **Phase 5: Training Execution** — Train on synthetic data, iterate
- [ ] **Phase 6: Real Data** — Collect real Graph→Narrative pairs from AAM usage
- [ ] **Phase 7: Optimization** — Quantization, distillation, flash attention
- [ ] **Phase 8: Integration** — Plug trained model into AAM pipeline
---
## Analogi Novel
> Jin Soun bukan orang yang menyewa tubuh orang lain untuk berbicara.
> Dia punya tubuh sendiri — lemah, third-rate, tapi MILIKNYA.
> Karena tubuhnya khusus dilatih untuk mengeksekusi perintah dari
> pikirannya (bukan pikiran orang lain), outputnya lebih terarah
> daripada orang yang punya tubuh lebih kuat tapi pikiran lebih lemah.
>
> **AAM = 1 pikiran + 1 tubuh. Bukan 1 pikiran + tubuh sewaan.**