saur7764's picture
Upload README.md with huggingface_hub
43b0fee verified

A newer version of the Gradio SDK is available: 6.17.3

Upgrade
metadata
title: EAM 100M Agentic Kernel v1.2
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: hf_app.py
pinned: true

100M Parameter Agentic Model Walkthrough

This project implements a state-of-the-art agentic model architecture by synthesizing multiple advanced frameworks.

πŸš€ Architectural Stack

1. Core: nanoGPT

We use a minimalist Transformer architecture based on Karpathy's nanoGPT. It provides the foundational attention and MLP blocks but has been heavily modified for agentic performance.

2. Residuals: AttenRes (Attention Residuals)

Instead of standard additive residuals (x + f(x)), we implement Attention Residuals.

  • File: model/attenres.py
  • Logic: Each layer performs a dynamic retrieval (attention) over all previous layer outputs. This prevents information dilution and allows deeper reasoning.

3. Weights: BitNet 1.58b (QVAC Fabric / Static Sparse)

To ensure efficiency on consumer hardware (QVAC style), we use Ternary Weights ({-1, 0, 1}).

  • File: model/bitnet.py
  • Efficiency: This mimics a static sparse matrix where 0s act as pruned connections. It reduces the memory footprint by ~70% compared to FP16.

4. Attention: Memory Sparse Attention (MSA) ⭐ NEW

Replaces the standard causal attention with a triple-mechanism attention layer.

  • File: model/memory_sparse_attention.py
  • Mechanism 1 β€” Persistent Memory Tokens: Each layer holds n_memory_tokens=32 learnable (K, V) parameter pairs. Every query position attends to these slots without any causal or sparse masking, giving the model a dedicated working-memory scratchpad that persists across positions within a forward pass.
  • Mechanism 2 β€” IndexCache Sparse Top-K: Full layers (even layer_idx) compute top-K attention indices over the sequence and cache them. Shared layers (odd layer_idx) reuse the cached indices, reducing O(TΒ²) β†’ O(T Β· sparse_topk). Memory slots are always kept regardless of the sparse mask.
  • Mechanism 3 β€” Interleaved Head Attention: The first half of heads use a local sliding-window mask (local_window_size=256); the second half retain unrestricted global access. Memory slots are exempt from this masking too.

5. Reasoning: Tiny Recursive Loop

The "agentic" part of the model comes from a recursive inference loop.

  • File: agent/recursive_reasoning.py
  • Process: The model generates a <thought>, critiques it, and refines it up to $N$ times before producing the final answer.

7. Teacher: NIM Distillation (N3S) ⭐ NEW

The model was distilled using NVIDIA Nemotron-3 Super (N3S) as a high-fidelity teacher.

  • Method: Multi-Token Distillation (MTD) focused on agentic reasoning trajectories.
  • Alignment: Alignment-aware distillation ensures the kernel follows workspace safety and grounding protocols.

8. Ecosystem: Model Context Protocol (MCP) ⭐ EXPANDED

Natively orchestrates cloud and local tools via MCP connectors.

  • Integrations: Figma (Design), Google Calendar, Notion, Google Sheets/Slides.
  • Orchestration: The recursive loop manages authentication signals and tool execution results.

πŸ“Š Model Statistics

  • Layers: 10
  • Embedding Dim: 640
  • Heads: 10
  • Memory Slots / Layer: 32 (K+V, persistent, learnable)
  • Sparse Top-K: 128 tokens per head (IndexCache)
  • Local Window: 256 tokens (Interleaved Attention)
  • Total Parameters: ~94.9M (includes memory K/V params)
  • Precision: 1.58-bit (Ternary)

πŸ› οΈ Usage

To view the architecture and verify parameters:

python main.py

(Requires torch installed)