saur7764's picture
Upload README.md with huggingface_hub
43b0fee verified
---
title: EAM 100M Agentic Kernel v1.2
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: hf_app.py
pinned: true
---
# 100M Parameter Agentic Model Walkthrough
This project implements a state-of-the-art agentic model architecture by synthesizing multiple advanced frameworks.
## πŸš€ Architectural Stack
### 1. Core: nanoGPT
We use a minimalist Transformer architecture based on Karpathy's `nanoGPT`. It provides the foundational attention and MLP blocks but has been heavily modified for agentic performance.
### 2. Residuals: AttenRes (Attention Residuals)
Instead of standard additive residuals (`x + f(x)`), we implement **Attention Residuals**.
- **File**: `model/attenres.py`
- **Logic**: Each layer performs a dynamic retrieval (attention) over all previous layer outputs. This prevents information dilution and allows deeper reasoning.
### 3. Weights: BitNet 1.58b (QVAC Fabric / Static Sparse)
To ensure efficiency on consumer hardware (QVAC style), we use **Ternary Weights** ({-1, 0, 1}).
- **File**: `model/bitnet.py`
- **Efficiency**: This mimics a static sparse matrix where 0s act as pruned connections. It reduces the memory footprint by ~70% compared to FP16.
### 4. Attention: Memory Sparse Attention (MSA) ⭐ NEW
Replaces the standard causal attention with a triple-mechanism attention layer.
- **File**: `model/memory_sparse_attention.py`
- **Mechanism 1 β€” Persistent Memory Tokens**: Each layer holds `n_memory_tokens=32` learnable `(K, V)` parameter pairs. Every query position attends to these slots without any causal or sparse masking, giving the model a dedicated working-memory scratchpad that persists across positions within a forward pass.
- **Mechanism 2 β€” IndexCache Sparse Top-K**: Full layers (even `layer_idx`) compute top-K attention indices over the sequence and cache them. Shared layers (odd `layer_idx`) reuse the cached indices, reducing O(TΒ²) β†’ O(T Β· sparse_topk). Memory slots are always kept regardless of the sparse mask.
- **Mechanism 3 β€” Interleaved Head Attention**: The first half of heads use a local sliding-window mask (`local_window_size=256`); the second half retain unrestricted global access. Memory slots are exempt from this masking too.
### 5. Reasoning: Tiny Recursive Loop
The "agentic" part of the model comes from a recursive inference loop.
- **File**: `agent/recursive_reasoning.py`
- **Process**: The model generates a `<thought>`, critiques it, and refines it up to $N$ times before producing the final answer.
### 7. Teacher: NIM Distillation (N3S) ⭐ NEW
The model was distilled using **NVIDIA Nemotron-3 Super (N3S)** as a high-fidelity teacher.
- **Method**: Multi-Token Distillation (MTD) focused on agentic reasoning trajectories.
- **Alignment**: Alignment-aware distillation ensures the kernel follows workspace safety and grounding protocols.
### 8. Ecosystem: Model Context Protocol (MCP) ⭐ EXPANDED
Natively orchestrates cloud and local tools via MCP connectors.
- **Integrations**: Figma (Design), Google Calendar, Notion, Google Sheets/Slides.
- **Orchestration**: The recursive loop manages authentication signals and tool execution results.
## πŸ“Š Model Statistics
- **Layers**: 10
- **Embedding Dim**: 640
- **Heads**: 10
- **Memory Slots / Layer**: 32 (K+V, persistent, learnable)
- **Sparse Top-K**: 128 tokens per head (IndexCache)
- **Local Window**: 256 tokens (Interleaved Attention)
- **Total Parameters**: ~94.9M (includes memory K/V params)
- **Precision**: 1.58-bit (Ternary)
## πŸ› οΈ Usage
To view the architecture and verify parameters:
```bash
python main.py
```
*(Requires `torch` installed)*