Wolfvin's picture
AAM Diffusion LLM v1.0 β€” The Body of Aphantasic Abstraction Model
2d7e335 verified

AAM Diffusion LLM Framework

"AAM = 1 Pikiran + 1 Tubuh" (1 Mind + 1 Body)

Framework khusus untuk melatih Diffusion LLM yang menjadi "tubuh" (body) dari Aphantasic Abstraction Model (AAM). Ini BUKAN LLM umum β€” ini model yang KHUSUS dilatih untuk menyusun kalimat dari data graph yang terstruktur.


Filosofi

Kenapa Bukan LLM Umum?

Konsep sebelumnya: "tubuh Jin Soun = LLM umum (GPT, Claude, dll.)" β€” ini salah besar.

Aspek LLM Umum (Sewaan) AAM Diffusion LLM (Milik Sendiri)
Input Prompt teks Graph conditioning (evidence, anomaly, dll.)
Output Teks probabilistik Narrative yang grounded di graph
Hallucination BISA mengarang TIDAK BISA β€” hanya menarasikan apa yang graph ketahui
Tujuan General purpose Khusus menyusun kalimat dari graph
Ukuran 7B-175B params 100M-500M params
Metode Autoregressive Diffusion (non-sequential)
Identitas Sewaan MILIK AAM sendiri

Kenapa Diffusion (Bukan Autoregressive)?

  1. Non-sequential β€” Bisa merevisi bagian awal saat generating bagian akhir. Mirip cara Jin Soun membentuk pikiran: vague β†’ clearer β†’ explicit.

  2. Graph conditioning β€” Seluruh graph bisa di-encode sebagai conditioning, bukan hanya prefix. Autoregressive hanya bisa melihat "apa yang sudah di-generate sebelumnya."

  3. Coherent long-form β€” Diffusion menghasilkan teks yang lebih koheren untuk narasi panjang karena setiap token "mengetahui" tentang token lain.

  4. Anti-hallucination — Model dilatih KHUSUS untuk Graph→Narrative, tidak punya kapabilitas mengarang informasi di luar graph.


Arsitektur

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AAM = 1 Pikiran + 1 Tubuh                                β”‚
β”‚                                                           β”‚
β”‚  Pikiran (Mind) = RSVS Knowledge Graph                    β”‚
β”‚    - Structural memory β€” mengingat SEMUA                  β”‚
β”‚    - Relational β€” memahami koneksi antar konsep           β”‚
β”‚    - Perfect recall β€” tidak pernah lupa                   β”‚
β”‚    - Confidence scores β€” tahu apa yang pasti vs ragu      β”‚
β”‚                                                           β”‚
β”‚  Tubuh (Body) = AAM Diffusion LLM                         β”‚
β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚    β”‚  Graph Conditioning Encoder                   β”‚        β”‚
β”‚    β”‚  β”œβ”€ Evidence Node Encoder                     β”‚        β”‚
β”‚    β”‚  β”œβ”€ Composition Encoder                       β”‚        β”‚
β”‚    β”‚  β”œβ”€ Anomaly Encoder                           β”‚        β”‚
β”‚    β”‚  β”œβ”€ Reasoning Chain Encoder                   β”‚        β”‚
β”‚    β”‚  β”œβ”€ Confidence Embedding                      β”‚        β”‚
β”‚    β”‚  β”œβ”€ Temporal Embedding                        β”‚        β”‚
β”‚    β”‚  └─ Graph Attention Layers                    β”‚        β”‚
β”‚    β”‚         ↓ (cross-attention keys/values)       β”‚        β”‚
β”‚    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”‚
β”‚    β”‚  Diffusion Transformer (Denoiser)             β”‚        β”‚
β”‚    β”‚  β”œβ”€ Token Embedding                           β”‚        β”‚
β”‚    β”‚  β”œβ”€ Timestep Embedding (sinusoidal)           β”‚        β”‚
β”‚    β”‚  β”œβ”€ N Γ— TransformerBlock:                     β”‚        β”‚
β”‚    β”‚  β”‚   β”œβ”€ AdaptiveLayerNorm + Self-Attention    β”‚        β”‚
β”‚    β”‚  β”‚   β”œβ”€ AdaptiveLayerNorm + Cross-Attention   β”‚        β”‚
β”‚    β”‚  β”‚   └─ AdaptiveLayerNorm + Feed-Forward      β”‚        β”‚
β”‚    β”‚  └─ Output Projection                         β”‚        β”‚
β”‚    β”‚         ↓ (predicted noise)                   β”‚        β”‚
β”‚    β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€        β”‚
β”‚    β”‚  Noise Scheduler                              β”‚        β”‚
β”‚    β”‚  β”œβ”€ Forward: x_0 + noise β†’ x_t                β”‚        β”‚
β”‚    β”‚  └─ Reverse: x_t β†’ denoise β†’ x_{t-1}         β”‚        β”‚
β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                                           β”‚
│  Training: Graph→Narrative pairs                          │
β”‚  Inference: Noise β†’ N denoising steps β†’ Narrative         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Struktur Folder

diffusion_llm/
β”œβ”€β”€ __init__.py                 # Package init with public API
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── model_config.py         # All configuration dataclasses
β”œβ”€β”€ tokenizer/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── aam_tokenizer.py        # Sentence-level + BPE hybrid tokenizer
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ noise_scheduler.py      # Forward/reverse diffusion process
β”‚   β”œβ”€β”€ graph_encoder.py        # Graph conditioning encoder
β”‚   β”œβ”€β”€ diffusion_transformer.py # Core denoising transformer
β”‚   └── aam_diffusion_model.py  # Complete model (combines all)
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ losses.py               # Loss functions (MSE, MAE, Huber, weighted)
β”‚   β”œβ”€β”€ dataset.py              # GraphNarrative dataset
β”‚   └── trainer.py              # Training loop with AMP, EMA, etc.
β”œβ”€β”€ inference/
β”‚   β”œβ”€β”€ __init__.py
β”‚   └── generator.py            # Inference pipeline
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ synthetic_generator.py  # Synthetic training data
β”‚   └── data_pipeline.py        # Data preparation pipeline
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train.py                # Training entry point
β”‚   β”œβ”€β”€ evaluate.py             # Evaluation & generation
β”‚   └── export.py               # Model export
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ test_scheduler.py       # Noise scheduler tests
β”‚   └── test_model.py           # Model component tests
β”œβ”€β”€ requirements.txt            # Python dependencies
└── README.md                   # This file

Quick Start

1. Install Dependencies

pip install torch numpy pytest

2. Generate Synthetic Data

from diffusion_llm.data.synthetic_generator import SyntheticDataGenerator

generator = SyntheticDataGenerator(seed=42, language="id")
train_path, val_path = generator.generate_training_split(
    output_dir="./data",
    n_train=10000,
    n_val=500,
)

3. Train the Model

# Quick test with tiny model
python diffusion_llm/scripts/train.py --model_size tiny --max_steps 100

# Full training with base model
python diffusion_llm/scripts/train.py --model_size base --max_steps 500000

4. Generate Narratives

# Generate samples
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --generate

# Interactive mode
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --interactive

5. Programmatic Usage

from diffusion_llm import (
    AamDiffusionConfig, get_default_config,
    AamDiffusionModel, AamTokenizer, AamGenerator,
)

# Load model and tokenizer
config = AamDiffusionConfig.from_json("output/config.json")
model = AamDiffusionModel.load("output/best.pt")
tokenizer = AamTokenizer.load("output/data/tokenizer.json")

# Create generator
generator = AamGenerator(model, tokenizer, config)

# Generate narrative from graph conditioning
result = generator.generate(
    trigger="Siapa yang mencuri Snow Plum Pill?",
    evidence_nodes=["Hefei", "Diancang Five Swords", "Ju Jangmok"],
    anomalies=["Tidak ada konsumsi pil baru di pasar gelap"],
    reasoning_steps=["Cross-reference tanggal kejadian", "Deteksi anomali"],
    source_trust=0.85,
)

print(result.narrative)
print(f"Confidence: {result.confidence:.1%}")
print(f"Steps: {result.n_diffusion_steps}")

Model Sizes

Size d_model Layers Heads Params Recommended For
tiny 256 4 4 ~25M Quick testing, debugging
small 512 8 8 ~70M Development, prototyping
base 768 12 12 ~170M Recommended for training
medium 1024 12 16 ~300M Final training, best quality

Konfigurasi

Model Config

from diffusion_llm.config.model_config import AamDiffusionConfig, ModelConfig, DiffusionConfig

config = AamDiffusionConfig(
    model=ModelConfig(
        d_model=768,        # Hidden dimension
        n_layers=12,        # Transformer blocks
        n_heads=12,         # Attention heads
        d_ff=3072,          # Feed-forward dimension
        vocab_size=32000,   # Vocabulary size
        max_seq_len=512,    # Maximum sequence length
    ),
    diffusion=DiffusionConfig(
        n_timesteps=1000,   # Training timesteps
        n_inference_steps=50,  # Inference steps (fewer = faster)
        schedule_type="cosine",  # Noise schedule
        prediction_type="epsilon",  # Predict noise
        sampling_method="ddim",  # Fast deterministic sampling
    ),
)

Inference Config

from diffusion_llm.config.model_config import InferenceConfig

inference = InferenceConfig(
    n_steps=50,           # Denoising steps
    temperature=1.0,      # Sampling temperature
    top_k=50,             # Top-k sampling
    max_output_sentences=16,  # Max sentences
    language="id",        # Output language
)

Integrasi dengan AAM Pipeline

Framework ini dirancang untuk menjadi "tubuh" dari AAM. Setelah model dilatih, integrasi dengan pipeline.py sangat mudah:

# Dalam pipeline.py, ganti fallback:
from diffusion_llm import AamDiffusionModel, AamTokenizer, AamGenerator

class AamPipeline:
    def __init__(self, ...):
        # Load trained diffusion model
        diffusion_config = AamDiffusionConfig.from_json("path/to/config.json")
        diffusion_model = AamDiffusionModel.load("path/to/best.pt")
        diffusion_tokenizer = AamTokenizer.load("path/to/tokenizer.json")
        self.diffusion_llm = AamGenerator(diffusion_model, diffusion_tokenizer, diffusion_config)

Training Data Format

Data training dalam format JSONL, satu contoh per baris:

{
  "narrative": "Berdasarkan analisis, Diancang Five Swords mencuri Snow Plum Pill menggunakan Ju Jangmok sebagai kambing hitam.",
  "trigger": "Siapa yang mencuri Snow Plum Pill?",
  "evidence_nodes": ["Hefei", "Diancang Five Swords", "Ju Jangmok", "Gyeryong Merchant Guild"],
  "compositions": [],
  "confidence_map": {"Hefei": 0.9, "Diancang Five Swords": 0.85, "Ju Jangmok": 0.7},
  "anomalies": ["Tidak ada konsumsi pil baru di pasar gelap", "Pencuri menghilang tanpa jejak"],
  "reasoning_steps": ["Cross-reference tanggal kejadian", "Deteksi ketidaksesuaian pola", "Pattern completion dari bukti terpisah"],
  "source_trust": 0.85,
  "temporal_context": [],
  "language": "id",
  "source": "synthetic"
}

Running Tests

# Run all tests
cd diffusion_llm
python -m pytest tests/ -v

# Run specific test
python -m pytest tests/test_model.py -v

# Run with coverage
python -m pytest tests/ --cov=diffusion_llm

Roadmap

  • Phase 1: Framework Design β€” Arsitektur, config, interface
  • Phase 2: Core Components β€” Noise scheduler, transformer, graph encoder, tokenizer
  • Phase 3: Training Infrastructure β€” Trainer, dataset, loss functions, synthetic data
  • Phase 4: Inference Pipeline β€” Generator, batch generation, interactive mode
  • Phase 5: Training Execution β€” Train on synthetic data, iterate
  • Phase 6: Real Data β€” Collect real Graphβ†’Narrative pairs from AAM usage
  • Phase 7: Optimization β€” Quantization, distillation, flash attention
  • Phase 8: Integration β€” Plug trained model into AAM pipeline

Analogi Novel

Jin Soun bukan orang yang menyewa tubuh orang lain untuk berbicara. Dia punya tubuh sendiri β€” lemah, third-rate, tapi MILIKNYA. Karena tubuhnya khusus dilatih untuk mengeksekusi perintah dari pikirannya (bukan pikiran orang lain), outputnya lebih terarah daripada orang yang punya tubuh lebih kuat tapi pikiran lebih lemah.

AAM = 1 pikiran + 1 tubuh. Bukan 1 pikiran + tubuh sewaan.