--- title: EAM 100M Agentic Kernel v1.2 emoji: ๐Ÿงฌ colorFrom: blue colorTo: indigo sdk: gradio app_file: hf_app.py pinned: true --- # 100M Parameter Agentic Model Walkthrough This project implements a state-of-the-art agentic model architecture by synthesizing multiple advanced frameworks. ## ๐Ÿš€ Architectural Stack ### 1. Core: nanoGPT We use a minimalist Transformer architecture based on Karpathy's `nanoGPT`. It provides the foundational attention and MLP blocks but has been heavily modified for agentic performance. ### 2. Residuals: AttenRes (Attention Residuals) Instead of standard additive residuals (`x + f(x)`), we implement **Attention Residuals**. - **File**: `model/attenres.py` - **Logic**: Each layer performs a dynamic retrieval (attention) over all previous layer outputs. This prevents information dilution and allows deeper reasoning. ### 3. Weights: BitNet 1.58b (QVAC Fabric / Static Sparse) To ensure efficiency on consumer hardware (QVAC style), we use **Ternary Weights** ({-1, 0, 1}). - **File**: `model/bitnet.py` - **Efficiency**: This mimics a static sparse matrix where 0s act as pruned connections. It reduces the memory footprint by ~70% compared to FP16. ### 4. Attention: Memory Sparse Attention (MSA) โญ NEW Replaces the standard causal attention with a triple-mechanism attention layer. - **File**: `model/memory_sparse_attention.py` - **Mechanism 1 โ€” Persistent Memory Tokens**: Each layer holds `n_memory_tokens=32` learnable `(K, V)` parameter pairs. Every query position attends to these slots without any causal or sparse masking, giving the model a dedicated working-memory scratchpad that persists across positions within a forward pass. - **Mechanism 2 โ€” IndexCache Sparse Top-K**: Full layers (even `layer_idx`) compute top-K attention indices over the sequence and cache them. Shared layers (odd `layer_idx`) reuse the cached indices, reducing O(Tยฒ) โ†’ O(T ยท sparse_topk). Memory slots are always kept regardless of the sparse mask. - **Mechanism 3 โ€” Interleaved Head Attention**: The first half of heads use a local sliding-window mask (`local_window_size=256`); the second half retain unrestricted global access. Memory slots are exempt from this masking too. ### 5. Reasoning: Tiny Recursive Loop The "agentic" part of the model comes from a recursive inference loop. - **File**: `agent/recursive_reasoning.py` - **Process**: The model generates a ``, critiques it, and refines it up to $N$ times before producing the final answer. ### 7. Teacher: NIM Distillation (N3S) โญ NEW The model was distilled using **NVIDIA Nemotron-3 Super (N3S)** as a high-fidelity teacher. - **Method**: Multi-Token Distillation (MTD) focused on agentic reasoning trajectories. - **Alignment**: Alignment-aware distillation ensures the kernel follows workspace safety and grounding protocols. ### 8. Ecosystem: Model Context Protocol (MCP) โญ EXPANDED Natively orchestrates cloud and local tools via MCP connectors. - **Integrations**: Figma (Design), Google Calendar, Notion, Google Sheets/Slides. - **Orchestration**: The recursive loop manages authentication signals and tool execution results. ## ๐Ÿ“Š Model Statistics - **Layers**: 10 - **Embedding Dim**: 640 - **Heads**: 10 - **Memory Slots / Layer**: 32 (K+V, persistent, learnable) - **Sparse Top-K**: 128 tokens per head (IndexCache) - **Local Window**: 256 tokens (Interleaved Attention) - **Total Parameters**: ~94.9M (includes memory K/V params) - **Precision**: 1.58-bit (Ternary) ## ๐Ÿ› ๏ธ Usage To view the architecture and verify parameters: ```bash python main.py ``` *(Requires `torch` installed)*