| # AAM Diffusion LLM Framework |
|
|
| > **"AAM = 1 Pikiran + 1 Tubuh" (1 Mind + 1 Body)** |
|
|
| Framework khusus untuk melatih Diffusion LLM yang menjadi "tubuh" (body) dari Aphantasic Abstraction Model (AAM). Ini BUKAN LLM umum — ini model yang KHUSUS dilatih untuk menyusun kalimat dari data graph yang terstruktur. |
|
|
| --- |
|
|
| ## Filosofi |
|
|
| ### Kenapa Bukan LLM Umum? |
|
|
| Konsep sebelumnya: "tubuh Jin Soun = LLM umum (GPT, Claude, dll.)" — ini **salah besar**. |
|
|
| | Aspek | LLM Umum (Sewaan) | AAM Diffusion LLM (Milik Sendiri) | |
| |-------|-------------------|-----------------------------------| |
| | Input | Prompt teks | Graph conditioning (evidence, anomaly, dll.) | |
| | Output | Teks probabilistik | Narrative yang grounded di graph | |
| | Hallucination | BISA mengarang | TIDAK BISA — hanya menarasikan apa yang graph ketahui | |
| | Tujuan | General purpose | Khusus menyusun kalimat dari graph | |
| | Ukuran | 7B-175B params | 100M-500M params | |
| | Metode | Autoregressive | Diffusion (non-sequential) | |
| | Identitas | Sewaan | MILIK AAM sendiri | |
|
|
| ### Kenapa Diffusion (Bukan Autoregressive)? |
|
|
| 1. **Non-sequential** — Bisa merevisi bagian awal saat generating bagian akhir. Mirip cara Jin Soun membentuk pikiran: vague → clearer → explicit. |
|
|
| 2. **Graph conditioning** — Seluruh graph bisa di-encode sebagai conditioning, bukan hanya prefix. Autoregressive hanya bisa melihat "apa yang sudah di-generate sebelumnya." |
|
|
| 3. **Coherent long-form** — Diffusion menghasilkan teks yang lebih koheren untuk narasi panjang karena setiap token "mengetahui" tentang token lain. |
|
|
| 4. **Anti-hallucination** — Model dilatih KHUSUS untuk Graph→Narrative, tidak punya kapabilitas mengarang informasi di luar graph. |
|
|
| --- |
|
|
| ## Arsitektur |
|
|
| ``` |
| ┌──────────────────────────────────────────────────────────┐ |
| │ AAM = 1 Pikiran + 1 Tubuh │ |
| │ │ |
| │ Pikiran (Mind) = RSVS Knowledge Graph │ |
| │ - Structural memory — mengingat SEMUA │ |
| │ - Relational — memahami koneksi antar konsep │ |
| │ - Perfect recall — tidak pernah lupa │ |
| │ - Confidence scores — tahu apa yang pasti vs ragu │ |
| │ │ |
| │ Tubuh (Body) = AAM Diffusion LLM │ |
| │ ┌─────────────────────────────────────────────┐ │ |
| │ │ Graph Conditioning Encoder │ │ |
| │ │ ├─ Evidence Node Encoder │ │ |
| │ │ ├─ Composition Encoder │ │ |
| │ │ ├─ Anomaly Encoder │ │ |
| │ │ ├─ Reasoning Chain Encoder │ │ |
| │ │ ├─ Confidence Embedding │ │ |
| │ │ ├─ Temporal Embedding │ │ |
| │ │ └─ Graph Attention Layers │ │ |
| │ │ ↓ (cross-attention keys/values) │ │ |
| │ ├─────────────────────────────────────────────┤ │ |
| │ │ Diffusion Transformer (Denoiser) │ │ |
| │ │ ├─ Token Embedding │ │ |
| │ │ ├─ Timestep Embedding (sinusoidal) │ │ |
| │ │ ├─ N × TransformerBlock: │ │ |
| │ │ │ ├─ AdaptiveLayerNorm + Self-Attention │ │ |
| │ │ │ ├─ AdaptiveLayerNorm + Cross-Attention │ │ |
| │ │ │ └─ AdaptiveLayerNorm + Feed-Forward │ │ |
| │ │ └─ Output Projection │ │ |
| │ │ ↓ (predicted noise) │ │ |
| │ ├─────────────────────────────────────────────┤ │ |
| │ │ Noise Scheduler │ │ |
| │ │ ├─ Forward: x_0 + noise → x_t │ │ |
| │ │ └─ Reverse: x_t → denoise → x_{t-1} │ │ |
| │ └─────────────────────────────────────────────┘ │ |
| │ │ |
| │ Training: Graph→Narrative pairs │ |
| │ Inference: Noise → N denoising steps → Narrative │ |
| └──────────────────────────────────────────────────────────┘ |
| ``` |
|
|
| --- |
|
|
| ## Struktur Folder |
|
|
| ``` |
| diffusion_llm/ |
| ├── __init__.py # Package init with public API |
| ├── config/ |
| │ ├── __init__.py |
| │ └── model_config.py # All configuration dataclasses |
| ├── tokenizer/ |
| │ ├── __init__.py |
| │ └── aam_tokenizer.py # Sentence-level + BPE hybrid tokenizer |
| ├── model/ |
| │ ├── __init__.py |
| │ ├── noise_scheduler.py # Forward/reverse diffusion process |
| │ ├── graph_encoder.py # Graph conditioning encoder |
| │ ├── diffusion_transformer.py # Core denoising transformer |
| │ └── aam_diffusion_model.py # Complete model (combines all) |
| ├── training/ |
| │ ├── __init__.py |
| │ ├── losses.py # Loss functions (MSE, MAE, Huber, weighted) |
| │ ├── dataset.py # GraphNarrative dataset |
| │ └── trainer.py # Training loop with AMP, EMA, etc. |
| ├── inference/ |
| │ ├── __init__.py |
| │ └── generator.py # Inference pipeline |
| ├── data/ |
| │ ├── __init__.py |
| │ ├── synthetic_generator.py # Synthetic training data |
| │ └── data_pipeline.py # Data preparation pipeline |
| ├── scripts/ |
| │ ├── train.py # Training entry point |
| │ ├── evaluate.py # Evaluation & generation |
| │ └── export.py # Model export |
| ├── tests/ |
| │ ├── __init__.py |
| │ ├── test_scheduler.py # Noise scheduler tests |
| │ └── test_model.py # Model component tests |
| ├── requirements.txt # Python dependencies |
| └── README.md # This file |
| ``` |
|
|
| --- |
|
|
| ## Quick Start |
|
|
| ### 1. Install Dependencies |
|
|
| ```bash |
| pip install torch numpy pytest |
| ``` |
|
|
| ### 2. Generate Synthetic Data |
|
|
| ```python |
| from diffusion_llm.data.synthetic_generator import SyntheticDataGenerator |
| |
| generator = SyntheticDataGenerator(seed=42, language="id") |
| train_path, val_path = generator.generate_training_split( |
| output_dir="./data", |
| n_train=10000, |
| n_val=500, |
| ) |
| ``` |
|
|
| ### 3. Train the Model |
|
|
| ```bash |
| # Quick test with tiny model |
| python diffusion_llm/scripts/train.py --model_size tiny --max_steps 100 |
| |
| # Full training with base model |
| python diffusion_llm/scripts/train.py --model_size base --max_steps 500000 |
| ``` |
|
|
| ### 4. Generate Narratives |
|
|
| ```bash |
| # Generate samples |
| python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --generate |
| |
| # Interactive mode |
| python diffusion_llm/scripts/evaluate.py --checkpoint output/best.pt --interactive |
| ``` |
|
|
| ### 5. Programmatic Usage |
|
|
| ```python |
| from diffusion_llm import ( |
| AamDiffusionConfig, get_default_config, |
| AamDiffusionModel, AamTokenizer, AamGenerator, |
| ) |
| |
| # Load model and tokenizer |
| config = AamDiffusionConfig.from_json("output/config.json") |
| model = AamDiffusionModel.load("output/best.pt") |
| tokenizer = AamTokenizer.load("output/data/tokenizer.json") |
| |
| # Create generator |
| generator = AamGenerator(model, tokenizer, config) |
| |
| # Generate narrative from graph conditioning |
| result = generator.generate( |
| trigger="Siapa yang mencuri Snow Plum Pill?", |
| evidence_nodes=["Hefei", "Diancang Five Swords", "Ju Jangmok"], |
| anomalies=["Tidak ada konsumsi pil baru di pasar gelap"], |
| reasoning_steps=["Cross-reference tanggal kejadian", "Deteksi anomali"], |
| source_trust=0.85, |
| ) |
| |
| print(result.narrative) |
| print(f"Confidence: {result.confidence:.1%}") |
| print(f"Steps: {result.n_diffusion_steps}") |
| ``` |
|
|
| --- |
|
|
| ## Model Sizes |
|
|
| | Size | d_model | Layers | Heads | Params | Recommended For | |
| |------|---------|--------|-------|--------|----------------| |
| | tiny | 256 | 4 | 4 | ~25M | Quick testing, debugging | |
| | small | 512 | 8 | 8 | ~70M | Development, prototyping | |
| | **base** | **768** | **12** | **12** | **~170M** | **Recommended for training** | |
| | medium | 1024 | 12 | 16 | ~300M | Final training, best quality | |
| |
| --- |
| |
| ## Konfigurasi |
| |
| ### Model Config |
| |
| ```python |
| from diffusion_llm.config.model_config import AamDiffusionConfig, ModelConfig, DiffusionConfig |
| |
| config = AamDiffusionConfig( |
| model=ModelConfig( |
| d_model=768, # Hidden dimension |
| n_layers=12, # Transformer blocks |
| n_heads=12, # Attention heads |
| d_ff=3072, # Feed-forward dimension |
| vocab_size=32000, # Vocabulary size |
| max_seq_len=512, # Maximum sequence length |
| ), |
| diffusion=DiffusionConfig( |
| n_timesteps=1000, # Training timesteps |
| n_inference_steps=50, # Inference steps (fewer = faster) |
| schedule_type="cosine", # Noise schedule |
| prediction_type="epsilon", # Predict noise |
| sampling_method="ddim", # Fast deterministic sampling |
| ), |
| ) |
| ``` |
| |
| ### Inference Config |
|
|
| ```python |
| from diffusion_llm.config.model_config import InferenceConfig |
| |
| inference = InferenceConfig( |
| n_steps=50, # Denoising steps |
| temperature=1.0, # Sampling temperature |
| top_k=50, # Top-k sampling |
| max_output_sentences=16, # Max sentences |
| language="id", # Output language |
| ) |
| ``` |
|
|
| --- |
|
|
| ## Integrasi dengan AAM Pipeline |
|
|
| Framework ini dirancang untuk menjadi "tubuh" dari AAM. Setelah model dilatih, |
| integrasi dengan `pipeline.py` sangat mudah: |
|
|
| ```python |
| # Dalam pipeline.py, ganti fallback: |
| from diffusion_llm import AamDiffusionModel, AamTokenizer, AamGenerator |
| |
| class AamPipeline: |
| def __init__(self, ...): |
| # Load trained diffusion model |
| diffusion_config = AamDiffusionConfig.from_json("path/to/config.json") |
| diffusion_model = AamDiffusionModel.load("path/to/best.pt") |
| diffusion_tokenizer = AamTokenizer.load("path/to/tokenizer.json") |
| self.diffusion_llm = AamGenerator(diffusion_model, diffusion_tokenizer, diffusion_config) |
| ``` |
|
|
| --- |
|
|
| ## Training Data Format |
|
|
| Data training dalam format JSONL, satu contoh per baris: |
|
|
| ```json |
| { |
| "narrative": "Berdasarkan analisis, Diancang Five Swords mencuri Snow Plum Pill menggunakan Ju Jangmok sebagai kambing hitam.", |
| "trigger": "Siapa yang mencuri Snow Plum Pill?", |
| "evidence_nodes": ["Hefei", "Diancang Five Swords", "Ju Jangmok", "Gyeryong Merchant Guild"], |
| "compositions": [], |
| "confidence_map": {"Hefei": 0.9, "Diancang Five Swords": 0.85, "Ju Jangmok": 0.7}, |
| "anomalies": ["Tidak ada konsumsi pil baru di pasar gelap", "Pencuri menghilang tanpa jejak"], |
| "reasoning_steps": ["Cross-reference tanggal kejadian", "Deteksi ketidaksesuaian pola", "Pattern completion dari bukti terpisah"], |
| "source_trust": 0.85, |
| "temporal_context": [], |
| "language": "id", |
| "source": "synthetic" |
| } |
| ``` |
|
|
| --- |
|
|
| ## Running Tests |
|
|
| ```bash |
| # Run all tests |
| cd diffusion_llm |
| python -m pytest tests/ -v |
| |
| # Run specific test |
| python -m pytest tests/test_model.py -v |
| |
| # Run with coverage |
| python -m pytest tests/ --cov=diffusion_llm |
| ``` |
|
|
| --- |
|
|
| ## Roadmap |
|
|
| - [x] **Phase 1: Framework Design** — Arsitektur, config, interface |
| - [x] **Phase 2: Core Components** — Noise scheduler, transformer, graph encoder, tokenizer |
| - [x] **Phase 3: Training Infrastructure** — Trainer, dataset, loss functions, synthetic data |
| - [x] **Phase 4: Inference Pipeline** — Generator, batch generation, interactive mode |
| - [ ] **Phase 5: Training Execution** — Train on synthetic data, iterate |
| - [ ] **Phase 6: Real Data** — Collect real Graph→Narrative pairs from AAM usage |
| - [ ] **Phase 7: Optimization** — Quantization, distillation, flash attention |
| - [ ] **Phase 8: Integration** — Plug trained model into AAM pipeline |
|
|
| --- |
|
|
| ## Analogi Novel |
|
|
| > Jin Soun bukan orang yang menyewa tubuh orang lain untuk berbicara. |
| > Dia punya tubuh sendiri — lemah, third-rate, tapi MILIKNYA. |
| > Karena tubuhnya khusus dilatih untuk mengeksekusi perintah dari |
| > pikirannya (bukan pikiran orang lain), outputnya lebih terarah |
| > daripada orang yang punya tubuh lebih kuat tapi pikiran lebih lemah. |
| > |
| > **AAM = 1 pikiran + 1 tubuh. Bukan 1 pikiran + tubuh sewaan.** |
|
|