| # Inference Guide | |
| ## Quick Start | |
| ```bash | |
| python generate.py | |
| ``` | |
| **Default Output:** | |
| ``` | |
| Prompt: 'The history of ' | |
| -------------------------------------------------- | |
| The history of Tomadination of the [[New Gouple de aparty]]... | |
| ``` | |
| --- | |
| ## Basic Usage | |
| ### 1. Load Model | |
| ```python | |
| from src.models.agiformer import AGIFORMER | |
| import torch | |
| model = AGIFORMER(d_model=512, n_layers=6, patch_size=4, thinking_steps=3) | |
| model.load_state_dict(torch.load("best_model.pth")) | |
| model.eval() | |
| ``` | |
| ### 2. Prepare Input | |
| ```python | |
| prompt = "The history of artificial intelligence" | |
| input_bytes = [ord(c) for c in prompt] | |
| # Pad to patch_size boundary | |
| pad = (4 - len(input_bytes) % 4) % 4 | |
| input_bytes.extend([32] * pad) | |
| x = torch.tensor(input_bytes).unsqueeze(0) # (1, seq_len) | |
| ``` | |
| ### 3. Generate | |
| ```python | |
| with torch.no_grad(): | |
| output = model(x, temperature=0.7) # (1, num_patches, patch_size) | |
| # Decode | |
| generated_bytes = output[0, -1, :].tolist() | |
| text = ''.join([chr(b) for b in generated_bytes if 32 <= b <= 126]) | |
| ``` | |
| --- | |
| ## Temperature Sampling | |
| ### Greedy (Temperature = 0) | |
| ```python | |
| output = model(x, temperature=0.0) | |
| ``` | |
| - Picks most likely byte every step | |
| - **Deterministic** (same output each run) | |
| - Prone to repetition loops | |
| **Example:** | |
| ``` | |
| The history of of of of of... | |
| ``` | |
| ### Low Temperature (0.3 - 0.5) | |
| ```python | |
| output = model(x, temperature=0.3) | |
| ``` | |
| - Slightly random, still conservative | |
| - Good for **coherent** text | |
| - Reduces repetition | |
| **Example:** | |
| ``` | |
| The history of the computer system... | |
| ``` | |
| ### Medium Temperature (0.7 - 0.9) | |
| ```python | |
| output = model(x, temperature=0.7) # Default | |
| ``` | |
| - Balanced creativity/coherence | |
| - **Recommended** for exploration | |
| **Example:** | |
| ``` | |
| The history of Tomadination of the [[New Gouple]]... | |
| ``` | |
| ### High Temperature (1.0+) | |
| ```python | |
| output = model(x, temperature=1.2) | |
| ``` | |
| - Very random | |
| - Incoherent but diverse | |
| - Good for **debugging** model knowledge | |
| **Example:** | |
| ``` | |
| The history qw8#$x [[zap]] nullification... | |
| ``` | |
| --- | |
| ## Advanced: Token-by-Token Generation | |
| For streaming output: | |
| ```python | |
| def generate_stream(model, prompt, max_tokens=200, temperature=0.7): | |
| # Encode prompt | |
| context = [ord(c) for c in prompt] | |
| pad = (4 - len(context) % 4) % 4 | |
| context.extend([32] * pad) | |
| for _ in range(max_tokens // 4): # Generate patch-by-patch | |
| x = torch.tensor(context[-1024:]).unsqueeze(0) # Sliding window | |
| with torch.no_grad(): | |
| pred = model(x, temperature=temperature) | |
| # Get last patch | |
| new_bytes = pred[0, -1, :].cpu().tolist() | |
| context.extend(new_bytes) | |
| # Decode and print | |
| chunk = ''.join([chr(b) for b in new_bytes if 32 <= b <= 126]) | |
| print(chunk, end='', flush=True) | |
| ``` | |
| **Usage:** | |
| ```python | |
| generate_stream(model, "The history of ", max_tokens=200) | |
| ``` | |
| --- | |
| ## System 2 Control | |
| ### Disable Thinking (Baseline) | |
| ```python | |
| model = AGIFORMER(thinking_steps=0) # Skip System 2 | |
| ``` | |
| - Faster inference (~2× speedup) | |
| - Lower quality output | |
| ### Increase Thinking | |
| ```python | |
| model = AGIFORMER(thinking_steps=5) # More refinement | |
| ``` | |
| - Slower inference | |
| - Potentially better coherence | |
| ### Runtime Control | |
| System 2 is part of the model, so you must reinitialize: | |
| ```python | |
| # Not possible to change thinking_steps after model creation | |
| # Must create new model with desired config | |
| ``` | |
| --- | |
| ## Batch Inference | |
| Process multiple prompts: | |
| ```python | |
| prompts = ["The history of ", "In the year 2050, ", "Once upon a time, "] | |
| batch = [] | |
| for prompt in prompts: | |
| bytes = [ord(c) for c in prompt] | |
| pad = (4 - len(bytes) % 4) % 4 | |
| bytes.extend([32] * pad) | |
| batch.append(torch.tensor(bytes)) | |
| # Pad to same length | |
| max_len = max(t.size(0) for t in batch) | |
| batch_tensor = torch.stack([ | |
| F.pad(t, (0, max_len - t.size(0))) | |
| for t in batch | |
| ]) | |
| # Generate | |
| with torch.no_grad(): | |
| outputs = model(batch_tensor, temperature=0.7) | |
| ``` | |
| --- | |
| ## Debugging Output | |
| ### Check Raw Bytes | |
| ```python | |
| generated = model(x, temperature=0.0) | |
| raw_bytes = generated[0, -1, :].tolist() | |
| print(f"Raw: {raw_bytes}") # e.g., [116, 104, 101, 32] | |
| ``` | |
| ### Detect Non-Printables | |
| ```python | |
| for b in raw_bytes: | |
| if not (32 <= b <= 126): | |
| print(f"Warning: Non-ASCII byte {b}") | |
| ``` | |
| ### Measure Entropy | |
| ```python | |
| import torch.nn.functional as F | |
| logits = model.head(latents) # Get raw logits | |
| probs = F.softmax(logits, dim=-1) | |
| entropy = -(probs * torch.log(probs + 1e-10)).sum(dim=-1).mean() | |
| print(f"Avg Entropy: {entropy.item():.2f} bits") | |
| # Low (<2): Confident, may repeat | |
| # High (>6): Confused, will be random | |
| ``` | |
| --- | |
| ## Common Issues | |
| ### Repetition Loops | |
| **Problem:** | |
| ``` | |
| of of of of of... | |
| ``` | |
| **Solutions:** | |
| 1. Increase temperature: `0.0 → 0.7` | |
| 2. Use nucleus sampling (top-p): | |
| ```python | |
| probs = F.softmax(logits / temp, dim=-1) | |
| sorted_probs, indices = torch.sort(probs, descending=True) | |
| cumsum = torch.cumsum(sorted_probs, dim=-1) | |
| mask = cumsum > 0.9 # Keep top 90% | |
| sorted_probs[mask] = 0 | |
| next_byte = torch.multinomial(sorted_probs, 1) | |
| ``` | |
| ### Gibberish Output | |
| **Problem:** | |
| ``` | |
| xq#$8z [[nullification]]... | |
| ``` | |
| **Causes:** | |
| - Temperature too high | |
| - Model undertrained | |
| **Solutions:** | |
| - Lower temperature: `1.2 → 0.5` | |
| - Train longer (20k+ steps) | |
| ### Slow Inference | |
| **Problem:** >1s per token | |
| **Solutions:** | |
| - Use GPU: `model.cuda()` | |
| - Reduce `thinking_steps`: `3 → 1` | |
| - Disable System 2: `thinking_steps=0` | |
| --- | |
| ## Performance Benchmarks | |
| **GPU:** NVIDIA T4 | |
| **Prompt Length:** 100 bytes | |
| **Generation Length:** 200 bytes | |
| | Config | Latency | Throughput | | |
| |--------|---------|------------| | |
| | Greedy (temp=0) | 45ms | 22 tokens/s | | |
| | Sampling (temp=0.7) | 52ms | 19 tokens/s | | |
| | System 2 disabled | 28ms | 36 tokens/s | | |
| --- | |
| ## API Reference | |
| ### Model Forward | |
| ```python | |
| def forward( | |
| x: torch.Tensor, # (Batch, Seq_Len) bytes | |
| target_bytes: Optional[torch.Tensor] = None, # For training | |
| temperature: float = 0.0 # Sampling temp (0 = greedy) | |
| ) -> torch.Tensor: | |
| # Returns: (Batch, Num_Patches, Patch_Size, 256) if training | |
| # (Batch, Num_Patches, Patch_Size) if inference | |
| ``` | |
| ### Generation Utilities | |
| See `generate.py` for full implementation: | |
| - `generate_text(model_path, prompt, max_tokens, temperature)` | |
| - Automatic padding and decoding | |
| --- | |
| ## Next Steps | |
| 1. **Experiment with Prompts:** Try different domains | |
| 2. **Tune Temperature:** Find sweet spot for your use case | |
| 3. **Extend Context:** Modify `generate.py` to use longer contexts | |
| 4. **Fine-tune:** Retrain on domain-specific data | |