AAM Diffusion LLM v1.0 — The Body of Aphantasic Abstraction Model

2d7e335 verified 3 days ago

13.1 kB

	# AAM Diffusion LLM Framework

	> "AAM = 1 Pikiran + 1 Tubuh" (1 Mind + 1 Body)

	Framework khusus untuk melatih Diffusion LLM yang menjadi "tubuh" (body) dari Aphantasic Abstraction Model (AAM). Ini BUKAN LLM umum — ini model yang KHUSUS dilatih untuk menyusun kalimat dari data graph yang terstruktur.

	---

	## Filosofi

	### Kenapa Bukan LLM Umum?

	Konsep sebelumnya: "tubuh Jin Soun = LLM umum (GPT, Claude, dll.)" — ini salah besar.

	\| Aspek \| LLM Umum (Sewaan) \| AAM Diffusion LLM (Milik Sendiri) \|
	\|-------\|-------------------\|-----------------------------------\|
	\| Input \| Prompt teks \| Graph conditioning (evidence, anomaly, dll.) \|
	\| Output \| Teks probabilistik \| Narrative yang grounded di graph \|
	\| Hallucination \| BISA mengarang \| TIDAK BISA — hanya menarasikan apa yang graph ketahui \|
	\| Tujuan \| General purpose \| Khusus menyusun kalimat dari graph \|
	\| Ukuran \| 7B-175B params \| 100M-500M params \|
	\| Metode \| Autoregressive \| Diffusion (non-sequential) \|
	\| Identitas \| Sewaan \| MILIK AAM sendiri \|

	### Kenapa Diffusion (Bukan Autoregressive)?

	1. Non-sequential — Bisa merevisi bagian awal saat generating bagian akhir. Mirip cara Jin Soun membentuk pikiran: vague → clearer → explicit.

	2. Graph conditioning — Seluruh graph bisa di-encode sebagai conditioning, bukan hanya prefix. Autoregressive hanya bisa melihat "apa yang sudah di-generate sebelumnya."

	3. Coherent long-form — Diffusion menghasilkan teks yang lebih koheren untuk narasi panjang karena setiap token "mengetahui" tentang token lain.

	4. Anti-hallucination — Model dilatih KHUSUS untuk Graph→Narrative, tidak punya kapabilitas mengarang informasi di luar graph.

	---

	## Arsitektur

	```
	┌──────────────────────────────────────────────────────────┐
	│ AAM = 1 Pikiran + 1 Tubuh │
	│ │
	│ Pikiran (Mind) = RSVS Knowledge Graph │
	│ - Structural memory — mengingat SEMUA │
	│ - Relational — memahami koneksi antar konsep │
	│ - Perfect recall — tidak pernah lupa │
	│ - Confidence scores — tahu apa yang pasti vs ragu │
	│ │
	│ Tubuh (Body) = AAM Diffusion LLM │
	│ ┌─────────────────────────────────────────────┐ │
	│ │ Graph Conditioning Encoder │ │
	│ │ ├─ Evidence Node Encoder │ │
	│ │ ├─ Composition Encoder │ │
	│ │ ├─ Anomaly Encoder │ │
	│ │ ├─ Reasoning Chain Encoder │ │
	│ │ ├─ Confidence Embedding │ │
	│ │ ├─ Temporal Embedding │ │
	│ │ └─ Graph Attention Layers │ │
	│ │ ↓ (cross-attention keys/values) │ │
	│ ├─────────────────────────────────────────────┤ │
	│ │ Diffusion Transformer (Denoiser) │ │
	│ │ ├─ Token Embedding │ │
	│ │ ├─ Timestep Embedding (sinusoidal) │ │
	│ │ ├─ N × TransformerBlock: │ │
	│ │ │ ├─ AdaptiveLayerNorm + Self-Attention │ │
	│ │ │ ├─ AdaptiveLayerNorm + Cross-Attention │ │
	│ │ │ └─ AdaptiveLayerNorm + Feed-Forward │ │
	│ │ └─ Output Projection │ │
	│ │ ↓ (predicted noise) │ │
	│ ├─────────────────────────────────────────────┤ │
	│ │ Noise Scheduler │ │
	│ │ ├─ Forward: x_0 + noise → x_t │ │
	│ │ └─ Reverse: x_t → denoise → x_{t-1} │ │
	│ └─────────────────────────────────────────────┘ │
	│ │
	│ Training: Graph→Narrative pairs │
	│ Inference: Noise → N denoising steps → Narrative │
	└──────────────────────────────────────────────────────────┘
	```

	---

	## Struktur Folder

	```
	diffusion_llm/
	├── __init__.py # Package init with public API
	├── config/
	│ ├── __init__.py
	│ └── model_config.py # All configuration dataclasses
	├── tokenizer/
	│ ├── __init__.py
	│ └── aam_tokenizer.py # Sentence-level + BPE hybrid tokenizer
	├── model/
	│ ├── __init__.py
	│ ├── noise_scheduler.py # Forward/reverse diffusion process
	│ ├── graph_encoder.py # Graph conditioning encoder
	│ ├── diffusion_transformer.py # Core denoising transformer
	│ └── aam_diffusion_model.py # Complete model (combines all)
	├── training/
	│ ├── __init__.py
	│ ├── losses.py # Loss functions (MSE, MAE, Huber, weighted)
	│ ├── dataset.py # GraphNarrative dataset
	│ └── trainer.py # Training loop with AMP, EMA, etc.
	├── inference/
	│ ├── __init__.py
	│ └── generator.py # Inference pipeline
	├── data/
	│ ├── __init__.py
	│ ├── synthetic_generator.py # Synthetic training data
	│ └── data_pipeline.py # Data preparation pipeline
	├── scripts/
	│ ├── train.py # Training entry point
	│ ├── evaluate.py # Evaluation & generation
	│ └── export.py # Model export
	├── tests/
	│ ├── __init__.py
	│ ├── test_scheduler.py # Noise scheduler tests
	│ └── test_model.py # Model component tests
	├── requirements.txt # Python dependencies
	└── README.md # This file
	```

	---

	## Quick Start

	### 1. Install Dependencies

	```bash
	pip install torch numpy pytest
	```

	### 2. Generate Synthetic Data

	```python
	from diffusion_llm.data.synthetic_generator import SyntheticDataGenerator

	generator = SyntheticDataGenerator(seed=42, language="id")
	train_path, val_path = generator.generate_training_split(
	output_dir="./data",
	n_train=10000,
	n_val=500,
	)
	```

	### 3. Train the Model

	```bash
	# Quick test with tiny model
	python diffusion_llm/scripts/train.py --model_size tiny --max_steps 100

	# Full training with base model
	python diffusion_llm/scripts/train.py --model_size base --max_steps 500000
	```

	### 4. Generate Narratives

	```bash
	# Generate samples
	python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --generate

	# Interactive mode
	python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --interactive
	```

	### 5. Programmatic Usage

	```python
	from diffusion_llm import (
	AamDiffusionConfig, get_default_config,
	AamDiffusionModel, AamTokenizer, AamGenerator,
	)

	# Load model and tokenizer
	config = AamDiffusionConfig.from_json("output/config.json")
	model = AamDiffusionModel.load("output/best.pt")
	tokenizer = AamTokenizer.load("output/data/tokenizer.json")

	# Create generator
	generator = AamGenerator(model, tokenizer, config)

	# Generate narrative from graph conditioning
	result = generator.generate(
	trigger="Siapa yang mencuri Snow Plum Pill?",
	evidence_nodes=["Hefei", "Diancang Five Swords", "Ju Jangmok"],
	anomalies=["Tidak ada konsumsi pil baru di pasar gelap"],
	reasoning_steps=["Cross-reference tanggal kejadian", "Deteksi anomali"],
	source_trust=0.85,
	)

	print(result.narrative)
	print(f"Confidence: {result.confidence:.1%}")
	print(f"Steps: {result.n_diffusion_steps}")
	```

	---

	## Model Sizes

	\| Size \| d_model \| Layers \| Heads \| Params \| Recommended For \|
	\|------\|---------\|--------\|-------\|--------\|----------------\|
	\| tiny \| 256 \| 4 \| 4 \| ~25M \| Quick testing, debugging \|
	\| small \| 512 \| 8 \| 8 \| ~70M \| Development, prototyping \|
	\| base \| 768 \| 12 \| 12 \| ~170M \| Recommended for training \|
	\| medium \| 1024 \| 12 \| 16 \| ~300M \| Final training, best quality \|

	---

	## Konfigurasi

	### Model Config

	```python
	from diffusion_llm.config.model_config import AamDiffusionConfig, ModelConfig, DiffusionConfig

	config = AamDiffusionConfig(
	model=ModelConfig(
	d_model=768, # Hidden dimension
	n_layers=12, # Transformer blocks
	n_heads=12, # Attention heads
	d_ff=3072, # Feed-forward dimension
	vocab_size=32000, # Vocabulary size
	max_seq_len=512, # Maximum sequence length
	),
	diffusion=DiffusionConfig(
	n_timesteps=1000, # Training timesteps
	n_inference_steps=50, # Inference steps (fewer = faster)
	schedule_type="cosine", # Noise schedule
	prediction_type="epsilon", # Predict noise
	sampling_method="ddim", # Fast deterministic sampling
	),
	)
	```

	### Inference Config

	```python
	from diffusion_llm.config.model_config import InferenceConfig

	inference = InferenceConfig(
	n_steps=50, # Denoising steps
	temperature=1.0, # Sampling temperature
	top_k=50, # Top-k sampling
	max_output_sentences=16, # Max sentences
	language="id", # Output language
	)
	```

	---

	## Integrasi dengan AAM Pipeline

	Framework ini dirancang untuk menjadi "tubuh" dari AAM. Setelah model dilatih,
	integrasi dengan `pipeline.py` sangat mudah:

	```python
	# Dalam pipeline.py, ganti fallback:
	from diffusion_llm import AamDiffusionModel, AamTokenizer, AamGenerator

	class AamPipeline:
	def __init__(self, ...):
	# Load trained diffusion model
	diffusion_config = AamDiffusionConfig.from_json("path/to/config.json")
	diffusion_model = AamDiffusionModel.load("path/to/best.pt")
	diffusion_tokenizer = AamTokenizer.load("path/to/tokenizer.json")
	self.diffusion_llm = AamGenerator(diffusion_model, diffusion_tokenizer, diffusion_config)
	```

	---

	## Training Data Format

	Data training dalam format JSONL, satu contoh per baris:

	```json
	{
	"narrative": "Berdasarkan analisis, Diancang Five Swords mencuri Snow Plum Pill menggunakan Ju Jangmok sebagai kambing hitam.",
	"trigger": "Siapa yang mencuri Snow Plum Pill?",
	"evidence_nodes": ["Hefei", "Diancang Five Swords", "Ju Jangmok", "Gyeryong Merchant Guild"],
	"compositions": [],
	"confidence_map": {"Hefei": 0.9, "Diancang Five Swords": 0.85, "Ju Jangmok": 0.7},
	"anomalies": ["Tidak ada konsumsi pil baru di pasar gelap", "Pencuri menghilang tanpa jejak"],
	"reasoning_steps": ["Cross-reference tanggal kejadian", "Deteksi ketidaksesuaian pola", "Pattern completion dari bukti terpisah"],
	"source_trust": 0.85,
	"temporal_context": [],
	"language": "id",
	"source": "synthetic"
	}
	```

	---

	## Running Tests

	```bash
	# Run all tests
	cd diffusion_llm
	python -m pytest tests/ -v

	# Run specific test
	python -m pytest tests/test_model.py -v

	# Run with coverage
	python -m pytest tests/ --cov=diffusion_llm
	```

	---

	## Roadmap

	- [x] Phase 1: Framework Design — Arsitektur, config, interface
	- [x] Phase 2: Core Components — Noise scheduler, transformer, graph encoder, tokenizer
	- [x] Phase 3: Training Infrastructure — Trainer, dataset, loss functions, synthetic data
	- [x] Phase 4: Inference Pipeline — Generator, batch generation, interactive mode
	- [ ] Phase 5: Training Execution — Train on synthetic data, iterate
	- [ ] Phase 6: Real Data — Collect real Graph→Narrative pairs from AAM usage
	- [ ] Phase 7: Optimization — Quantization, distillation, flash attention
	- [ ] Phase 8: Integration — Plug trained model into AAM pipeline

	---

	## Analogi Novel

	> Jin Soun bukan orang yang menyewa tubuh orang lain untuk berbicara.
	> Dia punya tubuh sendiri — lemah, third-rate, tapi MILIKNYA.
	> Karena tubuhnya khusus dilatih untuk mengeksekusi perintah dari
	> pikirannya (bukan pikiran orang lain), outputnya lebih terarah
	> daripada orang yang punya tubuh lebih kuat tapi pikiran lebih lemah.
	>
	> AAM = 1 pikiran + 1 tubuh. Bukan 1 pikiran + tubuh sewaan.