AAM Diffusion LLM Framework

"AAM = 1 Pikiran + 1 Tubuh" (1 Mind + 1 Body)

Framework khusus untuk melatih Diffusion LLM yang menjadi "tubuh" (body) dari Aphantasic Abstraction Model (AAM). Ini BUKAN LLM umum — ini model yang KHUSUS dilatih untuk menyusun kalimat dari data graph yang terstruktur.

Filosofi

Kenapa Bukan LLM Umum?

Konsep sebelumnya: "tubuh Jin Soun = LLM umum (GPT, Claude, dll.)" — ini salah besar.

Aspek	LLM Umum (Sewaan)	AAM Diffusion LLM (Milik Sendiri)
Input	Prompt teks	Graph conditioning (evidence, anomaly, dll.)
Output	Teks probabilistik	Narrative yang grounded di graph
Hallucination	BISA mengarang	TIDAK BISA — hanya menarasikan apa yang graph ketahui
Tujuan	General purpose	Khusus menyusun kalimat dari graph
Ukuran	7B-175B params	100M-500M params
Metode	Autoregressive	Diffusion (non-sequential)
Identitas	Sewaan	MILIK AAM sendiri

Kenapa Diffusion (Bukan Autoregressive)?

Non-sequential — Bisa merevisi bagian awal saat generating bagian akhir. Mirip cara Jin Soun membentuk pikiran: vague → clearer → explicit.
Graph conditioning — Seluruh graph bisa di-encode sebagai conditioning, bukan hanya prefix. Autoregressive hanya bisa melihat "apa yang sudah di-generate sebelumnya."
Coherent long-form — Diffusion menghasilkan teks yang lebih koheren untuk narasi panjang karena setiap token "mengetahui" tentang token lain.
Anti-hallucination — Model dilatih KHUSUS untuk Graph→Narrative, tidak punya kapabilitas mengarang informasi di luar graph.

Arsitektur

┌──────────────────────────────────────────────────────────┐
│  AAM = 1 Pikiran + 1 Tubuh                                │
│                                                           │
│  Pikiran (Mind) = RSVS Knowledge Graph                    │
│    - Structural memory — mengingat SEMUA                  │
│    - Relational — memahami koneksi antar konsep           │
│    - Perfect recall — tidak pernah lupa                   │
│    - Confidence scores — tahu apa yang pasti vs ragu      │
│                                                           │
│  Tubuh (Body) = AAM Diffusion LLM                         │
│    ┌─────────────────────────────────────────────┐        │
│    │  Graph Conditioning Encoder                   │        │
│    │  ├─ Evidence Node Encoder                     │        │
│    │  ├─ Composition Encoder                       │        │
│    │  ├─ Anomaly Encoder                           │        │
│    │  ├─ Reasoning Chain Encoder                   │        │
│    │  ├─ Confidence Embedding                      │        │
│    │  ├─ Temporal Embedding                        │        │
│    │  └─ Graph Attention Layers                    │        │
│    │         ↓ (cross-attention keys/values)       │        │
│    ├─────────────────────────────────────────────┤        │
│    │  Diffusion Transformer (Denoiser)             │        │
│    │  ├─ Token Embedding                           │        │
│    │  ├─ Timestep Embedding (sinusoidal)           │        │
│    │  ├─ N × TransformerBlock:                     │        │
│    │  │   ├─ AdaptiveLayerNorm + Self-Attention    │        │
│    │  │   ├─ AdaptiveLayerNorm + Cross-Attention   │        │
│    │  │   └─ AdaptiveLayerNorm + Feed-Forward      │        │
│    │  └─ Output Projection                         │        │
│    │         ↓ (predicted noise)                   │        │
│    ├─────────────────────────────────────────────┤        │
│    │  Noise Scheduler                              │        │
│    │  ├─ Forward: x_0 + noise → x_t                │        │
│    │  └─ Reverse: x_t → denoise → x_{t-1}         │        │
│    └─────────────────────────────────────────────┘        │
│                                                           │
│  Training: Graph→Narrative pairs                          │
│  Inference: Noise → N denoising steps → Narrative         │
└──────────────────────────────────────────────────────────┘

Struktur Folder

diffusion_llm/
├── __init__.py                 # Package init with public API
├── config/
│   ├── __init__.py
│   └── model_config.py         # All configuration dataclasses
├── tokenizer/
│   ├── __init__.py
│   └── aam_tokenizer.py        # Sentence-level + BPE hybrid tokenizer
├── model/
│   ├── __init__.py
│   ├── noise_scheduler.py      # Forward/reverse diffusion process
│   ├── graph_encoder.py        # Graph conditioning encoder
│   ├── diffusion_transformer.py # Core denoising transformer
│   └── aam_diffusion_model.py  # Complete model (combines all)
├── training/
│   ├── __init__.py
│   ├── losses.py               # Loss functions (MSE, MAE, Huber, weighted)
│   ├── dataset.py              # GraphNarrative dataset
│   └── trainer.py              # Training loop with AMP, EMA, etc.
├── inference/
│   ├── __init__.py
│   └── generator.py            # Inference pipeline
├── data/
│   ├── __init__.py
│   ├── synthetic_generator.py  # Synthetic training data
│   └── data_pipeline.py        # Data preparation pipeline
├── scripts/
│   ├── train.py                # Training entry point
│   ├── evaluate.py             # Evaluation & generation
│   └── export.py               # Model export
├── tests/
│   ├── __init__.py
│   ├── test_scheduler.py       # Noise scheduler tests
│   └── test_model.py           # Model component tests
├── requirements.txt            # Python dependencies
└── README.md                   # This file

Quick Start

1. Install Dependencies

pip install torch numpy pytest

2. Generate Synthetic Data

from diffusion_llm.data.synthetic_generator import SyntheticDataGenerator

generator = SyntheticDataGenerator(seed=42, language="id")
train_path, val_path = generator.generate_training_split(
    output_dir="./data",
    n_train=10000,
    n_val=500,
)

3. Train the Model

# Quick test with tiny model
python diffusion_llm/scripts/train.py --model_size tiny --max_steps 100

# Full training with base model
python diffusion_llm/scripts/train.py --model_size base --max_steps 500000

4. Generate Narratives

# Generate samples
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --generate

# Interactive mode
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --interactive

5. Programmatic Usage

from diffusion_llm import (
    AamDiffusionConfig, get_default_config,
    AamDiffusionModel, AamTokenizer, AamGenerator,
)

# Load model and tokenizer
config = AamDiffusionConfig.from_json("output/config.json")
model = AamDiffusionModel.load("output/best.pt")
tokenizer = AamTokenizer.load("output/data/tokenizer.json")

# Create generator
generator = AamGenerator(model, tokenizer, config)

# Generate narrative from graph conditioning
result = generator.generate(
    trigger="Siapa yang mencuri Snow Plum Pill?",
    evidence_nodes=["Hefei", "Diancang Five Swords", "Ju Jangmok"],
    anomalies=["Tidak ada konsumsi pil baru di pasar gelap"],
    reasoning_steps=["Cross-reference tanggal kejadian", "Deteksi anomali"],
    source_trust=0.85,
)

print(result.narrative)
print(f"Confidence: {result.confidence:.1%}")
print(f"Steps: {result.n_diffusion_steps}")

Model Sizes

Size	d_model	Layers	Heads	Params	Recommended For
tiny	256	4	4	~25M	Quick testing, debugging
small	512	8	8	~70M	Development, prototyping
base	768	12	12	~170M	Recommended for training
medium	1024	12	16	~300M	Final training, best quality

Konfigurasi

Model Config

from diffusion_llm.config.model_config import AamDiffusionConfig, ModelConfig, DiffusionConfig

config = AamDiffusionConfig(
    model=ModelConfig(
        d_model=768,        # Hidden dimension
        n_layers=12,        # Transformer blocks
        n_heads=12,         # Attention heads
        d_ff=3072,          # Feed-forward dimension
        vocab_size=32000,   # Vocabulary size
        max_seq_len=512,    # Maximum sequence length
    ),
    diffusion=DiffusionConfig(
        n_timesteps=1000,   # Training timesteps
        n_inference_steps=50,  # Inference steps (fewer = faster)
        schedule_type="cosine",  # Noise schedule
        prediction_type="epsilon",  # Predict noise
        sampling_method="ddim",  # Fast deterministic sampling
    ),
)

Inference Config

from diffusion_llm.config.model_config import InferenceConfig

inference = InferenceConfig(
    n_steps=50,           # Denoising steps
    temperature=1.0,      # Sampling temperature
    top_k=50,             # Top-k sampling
    max_output_sentences=16,  # Max sentences
    language="id",        # Output language
)

Integrasi dengan AAM Pipeline

Framework ini dirancang untuk menjadi "tubuh" dari AAM. Setelah model dilatih, integrasi dengan pipeline.py sangat mudah:

# Dalam pipeline.py, ganti fallback:
from diffusion_llm import AamDiffusionModel, AamTokenizer, AamGenerator

class AamPipeline:
    def __init__(self, ...):
        # Load trained diffusion model
        diffusion_config = AamDiffusionConfig.from_json("path/to/config.json")
        diffusion_model = AamDiffusionModel.load("path/to/best.pt")
        diffusion_tokenizer = AamTokenizer.load("path/to/tokenizer.json")
        self.diffusion_llm = AamGenerator(diffusion_model, diffusion_tokenizer, diffusion_config)

Training Data Format

Data training dalam format JSONL, satu contoh per baris:

{
  "narrative": "Berdasarkan analisis, Diancang Five Swords mencuri Snow Plum Pill menggunakan Ju Jangmok sebagai kambing hitam.",
  "trigger": "Siapa yang mencuri Snow Plum Pill?",
  "evidence_nodes": ["Hefei", "Diancang Five Swords", "Ju Jangmok", "Gyeryong Merchant Guild"],
  "compositions": [],
  "confidence_map": {"Hefei": 0.9, "Diancang Five Swords": 0.85, "Ju Jangmok": 0.7},
  "anomalies": ["Tidak ada konsumsi pil baru di pasar gelap", "Pencuri menghilang tanpa jejak"],
  "reasoning_steps": ["Cross-reference tanggal kejadian", "Deteksi ketidaksesuaian pola", "Pattern completion dari bukti terpisah"],
  "source_trust": 0.85,
  "temporal_context": [],
  "language": "id",
  "source": "synthetic"
}

Running Tests

# Run all tests
cd diffusion_llm
python -m pytest tests/ -v

# Run specific test
python -m pytest tests/test_model.py -v

# Run with coverage
python -m pytest tests/ --cov=diffusion_llm

Roadmap

Phase 1: Framework Design — Arsitektur, config, interface
Phase 2: Core Components — Noise scheduler, transformer, graph encoder, tokenizer
Phase 3: Training Infrastructure — Trainer, dataset, loss functions, synthetic data
Phase 4: Inference Pipeline — Generator, batch generation, interactive mode
Phase 5: Training Execution — Train on synthetic data, iterate
Phase 6: Real Data — Collect real Graph→Narrative pairs from AAM usage
Phase 7: Optimization — Quantization, distillation, flash attention
Phase 8: Integration — Plug trained model into AAM pipeline

Analogi Novel

Jin Soun bukan orang yang menyewa tubuh orang lain untuk berbicara. Dia punya tubuh sendiri — lemah, third-rate, tapi MILIKNYA. Karena tubuhnya khusus dilatih untuk mengeksekusi perintah dari pikirannya (bukan pikiran orang lain), outputnya lebih terarah daripada orang yang punya tubuh lebih kuat tapi pikiran lebih lemah.

AAM = 1 pikiran + 1 tubuh. Bukan 1 pikiran + tubuh sewaan.