AAM Diffusion LLM Framework
"AAM = 1 Pikiran + 1 Tubuh" (1 Mind + 1 Body)
Framework khusus untuk melatih Diffusion LLM yang menjadi "tubuh" (body) dari Aphantasic Abstraction Model (AAM). Ini BUKAN LLM umum β ini model yang KHUSUS dilatih untuk menyusun kalimat dari data graph yang terstruktur.
Filosofi
Kenapa Bukan LLM Umum?
Konsep sebelumnya: "tubuh Jin Soun = LLM umum (GPT, Claude, dll.)" β ini salah besar.
| Aspek | LLM Umum (Sewaan) | AAM Diffusion LLM (Milik Sendiri) |
|---|---|---|
| Input | Prompt teks | Graph conditioning (evidence, anomaly, dll.) |
| Output | Teks probabilistik | Narrative yang grounded di graph |
| Hallucination | BISA mengarang | TIDAK BISA β hanya menarasikan apa yang graph ketahui |
| Tujuan | General purpose | Khusus menyusun kalimat dari graph |
| Ukuran | 7B-175B params | 100M-500M params |
| Metode | Autoregressive | Diffusion (non-sequential) |
| Identitas | Sewaan | MILIK AAM sendiri |
Kenapa Diffusion (Bukan Autoregressive)?
Non-sequential β Bisa merevisi bagian awal saat generating bagian akhir. Mirip cara Jin Soun membentuk pikiran: vague β clearer β explicit.
Graph conditioning β Seluruh graph bisa di-encode sebagai conditioning, bukan hanya prefix. Autoregressive hanya bisa melihat "apa yang sudah di-generate sebelumnya."
Coherent long-form β Diffusion menghasilkan teks yang lebih koheren untuk narasi panjang karena setiap token "mengetahui" tentang token lain.
Anti-hallucination β Model dilatih KHUSUS untuk GraphβNarrative, tidak punya kapabilitas mengarang informasi di luar graph.
Arsitektur
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β AAM = 1 Pikiran + 1 Tubuh β
β β
β Pikiran (Mind) = RSVS Knowledge Graph β
β - Structural memory β mengingat SEMUA β
β - Relational β memahami koneksi antar konsep β
β - Perfect recall β tidak pernah lupa β
β - Confidence scores β tahu apa yang pasti vs ragu β
β β
β Tubuh (Body) = AAM Diffusion LLM β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Graph Conditioning Encoder β β
β β ββ Evidence Node Encoder β β
β β ββ Composition Encoder β β
β β ββ Anomaly Encoder β β
β β ββ Reasoning Chain Encoder β β
β β ββ Confidence Embedding β β
β β ββ Temporal Embedding β β
β β ββ Graph Attention Layers β β
β β β (cross-attention keys/values) β β
β βββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β Diffusion Transformer (Denoiser) β β
β β ββ Token Embedding β β
β β ββ Timestep Embedding (sinusoidal) β β
β β ββ N Γ TransformerBlock: β β
β β β ββ AdaptiveLayerNorm + Self-Attention β β
β β β ββ AdaptiveLayerNorm + Cross-Attention β β
β β β ββ AdaptiveLayerNorm + Feed-Forward β β
β β ββ Output Projection β β
β β β (predicted noise) β β
β βββββββββββββββββββββββββββββββββββββββββββββββ€ β
β β Noise Scheduler β β
β β ββ Forward: x_0 + noise β x_t β β
β β ββ Reverse: x_t β denoise β x_{t-1} β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β
β Training: GraphβNarrative pairs β
β Inference: Noise β N denoising steps β Narrative β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Struktur Folder
diffusion_llm/
βββ __init__.py # Package init with public API
βββ config/
β βββ __init__.py
β βββ model_config.py # All configuration dataclasses
βββ tokenizer/
β βββ __init__.py
β βββ aam_tokenizer.py # Sentence-level + BPE hybrid tokenizer
βββ model/
β βββ __init__.py
β βββ noise_scheduler.py # Forward/reverse diffusion process
β βββ graph_encoder.py # Graph conditioning encoder
β βββ diffusion_transformer.py # Core denoising transformer
β βββ aam_diffusion_model.py # Complete model (combines all)
βββ training/
β βββ __init__.py
β βββ losses.py # Loss functions (MSE, MAE, Huber, weighted)
β βββ dataset.py # GraphNarrative dataset
β βββ trainer.py # Training loop with AMP, EMA, etc.
βββ inference/
β βββ __init__.py
β βββ generator.py # Inference pipeline
βββ data/
β βββ __init__.py
β βββ synthetic_generator.py # Synthetic training data
β βββ data_pipeline.py # Data preparation pipeline
βββ scripts/
β βββ train.py # Training entry point
β βββ evaluate.py # Evaluation & generation
β βββ export.py # Model export
βββ tests/
β βββ __init__.py
β βββ test_scheduler.py # Noise scheduler tests
β βββ test_model.py # Model component tests
βββ requirements.txt # Python dependencies
βββ README.md # This file
Quick Start
1. Install Dependencies
pip install torch numpy pytest
2. Generate Synthetic Data
from diffusion_llm.data.synthetic_generator import SyntheticDataGenerator
generator = SyntheticDataGenerator(seed=42, language="id")
train_path, val_path = generator.generate_training_split(
output_dir="./data",
n_train=10000,
n_val=500,
)
3. Train the Model
# Quick test with tiny model
python diffusion_llm/scripts/train.py --model_size tiny --max_steps 100
# Full training with base model
python diffusion_llm/scripts/train.py --model_size base --max_steps 500000
4. Generate Narratives
# Generate samples
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --generate
# Interactive mode
python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --interactive
5. Programmatic Usage
from diffusion_llm import (
AamDiffusionConfig, get_default_config,
AamDiffusionModel, AamTokenizer, AamGenerator,
)
# Load model and tokenizer
config = AamDiffusionConfig.from_json("output/config.json")
model = AamDiffusionModel.load("output/best.pt")
tokenizer = AamTokenizer.load("output/data/tokenizer.json")
# Create generator
generator = AamGenerator(model, tokenizer, config)
# Generate narrative from graph conditioning
result = generator.generate(
trigger="Siapa yang mencuri Snow Plum Pill?",
evidence_nodes=["Hefei", "Diancang Five Swords", "Ju Jangmok"],
anomalies=["Tidak ada konsumsi pil baru di pasar gelap"],
reasoning_steps=["Cross-reference tanggal kejadian", "Deteksi anomali"],
source_trust=0.85,
)
print(result.narrative)
print(f"Confidence: {result.confidence:.1%}")
print(f"Steps: {result.n_diffusion_steps}")
Model Sizes
| Size | d_model | Layers | Heads | Params | Recommended For |
|---|---|---|---|---|---|
| tiny | 256 | 4 | 4 | ~25M | Quick testing, debugging |
| small | 512 | 8 | 8 | ~70M | Development, prototyping |
| base | 768 | 12 | 12 | ~170M | Recommended for training |
| medium | 1024 | 12 | 16 | ~300M | Final training, best quality |
Konfigurasi
Model Config
from diffusion_llm.config.model_config import AamDiffusionConfig, ModelConfig, DiffusionConfig
config = AamDiffusionConfig(
model=ModelConfig(
d_model=768, # Hidden dimension
n_layers=12, # Transformer blocks
n_heads=12, # Attention heads
d_ff=3072, # Feed-forward dimension
vocab_size=32000, # Vocabulary size
max_seq_len=512, # Maximum sequence length
),
diffusion=DiffusionConfig(
n_timesteps=1000, # Training timesteps
n_inference_steps=50, # Inference steps (fewer = faster)
schedule_type="cosine", # Noise schedule
prediction_type="epsilon", # Predict noise
sampling_method="ddim", # Fast deterministic sampling
),
)
Inference Config
from diffusion_llm.config.model_config import InferenceConfig
inference = InferenceConfig(
n_steps=50, # Denoising steps
temperature=1.0, # Sampling temperature
top_k=50, # Top-k sampling
max_output_sentences=16, # Max sentences
language="id", # Output language
)
Integrasi dengan AAM Pipeline
Framework ini dirancang untuk menjadi "tubuh" dari AAM. Setelah model dilatih,
integrasi dengan pipeline.py sangat mudah:
# Dalam pipeline.py, ganti fallback:
from diffusion_llm import AamDiffusionModel, AamTokenizer, AamGenerator
class AamPipeline:
def __init__(self, ...):
# Load trained diffusion model
diffusion_config = AamDiffusionConfig.from_json("path/to/config.json")
diffusion_model = AamDiffusionModel.load("path/to/best.pt")
diffusion_tokenizer = AamTokenizer.load("path/to/tokenizer.json")
self.diffusion_llm = AamGenerator(diffusion_model, diffusion_tokenizer, diffusion_config)
Training Data Format
Data training dalam format JSONL, satu contoh per baris:
{
"narrative": "Berdasarkan analisis, Diancang Five Swords mencuri Snow Plum Pill menggunakan Ju Jangmok sebagai kambing hitam.",
"trigger": "Siapa yang mencuri Snow Plum Pill?",
"evidence_nodes": ["Hefei", "Diancang Five Swords", "Ju Jangmok", "Gyeryong Merchant Guild"],
"compositions": [],
"confidence_map": {"Hefei": 0.9, "Diancang Five Swords": 0.85, "Ju Jangmok": 0.7},
"anomalies": ["Tidak ada konsumsi pil baru di pasar gelap", "Pencuri menghilang tanpa jejak"],
"reasoning_steps": ["Cross-reference tanggal kejadian", "Deteksi ketidaksesuaian pola", "Pattern completion dari bukti terpisah"],
"source_trust": 0.85,
"temporal_context": [],
"language": "id",
"source": "synthetic"
}
Running Tests
# Run all tests
cd diffusion_llm
python -m pytest tests/ -v
# Run specific test
python -m pytest tests/test_model.py -v
# Run with coverage
python -m pytest tests/ --cov=diffusion_llm
Roadmap
- Phase 1: Framework Design β Arsitektur, config, interface
- Phase 2: Core Components β Noise scheduler, transformer, graph encoder, tokenizer
- Phase 3: Training Infrastructure β Trainer, dataset, loss functions, synthetic data
- Phase 4: Inference Pipeline β Generator, batch generation, interactive mode
- Phase 5: Training Execution β Train on synthetic data, iterate
- Phase 6: Real Data β Collect real GraphβNarrative pairs from AAM usage
- Phase 7: Optimization β Quantization, distillation, flash attention
- Phase 8: Integration β Plug trained model into AAM pipeline
Analogi Novel
Jin Soun bukan orang yang menyewa tubuh orang lain untuk berbicara. Dia punya tubuh sendiri β lemah, third-rate, tapi MILIKNYA. Karena tubuhnya khusus dilatih untuk mengeksekusi perintah dari pikirannya (bukan pikiran orang lain), outputnya lebih terarah daripada orang yang punya tubuh lebih kuat tapi pikiran lebih lemah.
AAM = 1 pikiran + 1 tubuh. Bukan 1 pikiran + tubuh sewaan.