Spaces:
Sleeping
A newer version of the Gradio SDK is available: 6.17.3
title: EAM 100M Agentic Kernel v1.2
emoji: π§¬
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: hf_app.py
pinned: true
100M Parameter Agentic Model Walkthrough
This project implements a state-of-the-art agentic model architecture by synthesizing multiple advanced frameworks.
π Architectural Stack
1. Core: nanoGPT
We use a minimalist Transformer architecture based on Karpathy's nanoGPT. It provides the foundational attention and MLP blocks but has been heavily modified for agentic performance.
2. Residuals: AttenRes (Attention Residuals)
Instead of standard additive residuals (x + f(x)), we implement Attention Residuals.
- File:
model/attenres.py - Logic: Each layer performs a dynamic retrieval (attention) over all previous layer outputs. This prevents information dilution and allows deeper reasoning.
3. Weights: BitNet 1.58b (QVAC Fabric / Static Sparse)
To ensure efficiency on consumer hardware (QVAC style), we use Ternary Weights ({-1, 0, 1}).
- File:
model/bitnet.py - Efficiency: This mimics a static sparse matrix where 0s act as pruned connections. It reduces the memory footprint by ~70% compared to FP16.
4. Attention: Memory Sparse Attention (MSA) β NEW
Replaces the standard causal attention with a triple-mechanism attention layer.
- File:
model/memory_sparse_attention.py - Mechanism 1 β Persistent Memory Tokens: Each layer holds
n_memory_tokens=32learnable(K, V)parameter pairs. Every query position attends to these slots without any causal or sparse masking, giving the model a dedicated working-memory scratchpad that persists across positions within a forward pass. - Mechanism 2 β IndexCache Sparse Top-K: Full layers (even
layer_idx) compute top-K attention indices over the sequence and cache them. Shared layers (oddlayer_idx) reuse the cached indices, reducing O(TΒ²) β O(T Β· sparse_topk). Memory slots are always kept regardless of the sparse mask. - Mechanism 3 β Interleaved Head Attention: The first half of heads use a local sliding-window mask (
local_window_size=256); the second half retain unrestricted global access. Memory slots are exempt from this masking too.
5. Reasoning: Tiny Recursive Loop
The "agentic" part of the model comes from a recursive inference loop.
- File:
agent/recursive_reasoning.py - Process: The model generates a
<thought>, critiques it, and refines it up to $N$ times before producing the final answer.
7. Teacher: NIM Distillation (N3S) β NEW
The model was distilled using NVIDIA Nemotron-3 Super (N3S) as a high-fidelity teacher.
- Method: Multi-Token Distillation (MTD) focused on agentic reasoning trajectories.
- Alignment: Alignment-aware distillation ensures the kernel follows workspace safety and grounding protocols.
8. Ecosystem: Model Context Protocol (MCP) β EXPANDED
Natively orchestrates cloud and local tools via MCP connectors.
- Integrations: Figma (Design), Google Calendar, Notion, Google Sheets/Slides.
- Orchestration: The recursive loop manages authentication signals and tool execution results.
π Model Statistics
- Layers: 10
- Embedding Dim: 640
- Heads: 10
- Memory Slots / Layer: 32 (K+V, persistent, learnable)
- Sparse Top-K: 128 tokens per head (IndexCache)
- Local Window: 256 tokens (Interleaved Attention)
- Total Parameters: ~94.9M (includes memory K/V params)
- Precision: 1.58-bit (Ternary)
π οΈ Usage
To view the architecture and verify parameters:
python main.py
(Requires torch installed)