Spaces:

saur7764
/

EAM-100M-Agentic-Kernel

Sleeping

App Files Files Community

EAM-100M-Agentic-Kernel / README.md

saur7764

Upload README.md with huggingface_hub

43b0fee verified about 1 month ago

preview code

raw

history blame contribute delete

3.64 kB

A newer version of the Gradio SDK is available: 6.17.3

Upgrade

metadata

title: EAM 100M Agentic Kernel v1.2
emoji: 🧬
colorFrom: blue
colorTo: indigo
sdk: gradio
app_file: hf_app.py
pinned: true

100M Parameter Agentic Model Walkthrough

This project implements a state-of-the-art agentic model architecture by synthesizing multiple advanced frameworks.

🚀 Architectural Stack

1. Core: nanoGPT

We use a minimalist Transformer architecture based on Karpathy's nanoGPT. It provides the foundational attention and MLP blocks but has been heavily modified for agentic performance.

2. Residuals: AttenRes (Attention Residuals)

Instead of standard additive residuals (x + f(x)), we implement Attention Residuals.

File: model/attenres.py
Logic: Each layer performs a dynamic retrieval (attention) over all previous layer outputs. This prevents information dilution and allows deeper reasoning.

3. Weights: BitNet 1.58b (QVAC Fabric / Static Sparse)

To ensure efficiency on consumer hardware (QVAC style), we use Ternary Weights ({-1, 0, 1}).

File: model/bitnet.py
Efficiency: This mimics a static sparse matrix where 0s act as pruned connections. It reduces the memory footprint by ~70% compared to FP16.

4. Attention: Memory Sparse Attention (MSA) ⭐ NEW

Replaces the standard causal attention with a triple-mechanism attention layer.

File: model/memory_sparse_attention.py
Mechanism 1 — Persistent Memory Tokens: Each layer holds n_memory_tokens=32 learnable (K, V) parameter pairs. Every query position attends to these slots without any causal or sparse masking, giving the model a dedicated working-memory scratchpad that persists across positions within a forward pass.
Mechanism 2 — IndexCache Sparse Top-K: Full layers (even layer_idx) compute top-K attention indices over the sequence and cache them. Shared layers (odd layer_idx) reuse the cached indices, reducing O(T²) → O(T · sparse_topk). Memory slots are always kept regardless of the sparse mask.
Mechanism 3 — Interleaved Head Attention: The first half of heads use a local sliding-window mask (local_window_size=256); the second half retain unrestricted global access. Memory slots are exempt from this masking too.

5. Reasoning: Tiny Recursive Loop

The "agentic" part of the model comes from a recursive inference loop.

File: agent/recursive_reasoning.py
Process: The model generates a <thought>, critiques it, and refines it up to $N$ times before producing the final answer.

7. Teacher: NIM Distillation (N3S) ⭐ NEW

The model was distilled using NVIDIA Nemotron-3 Super (N3S) as a high-fidelity teacher.

Method: Multi-Token Distillation (MTD) focused on agentic reasoning trajectories.
Alignment: Alignment-aware distillation ensures the kernel follows workspace safety and grounding protocols.

8. Ecosystem: Model Context Protocol (MCP) ⭐ EXPANDED

Natively orchestrates cloud and local tools via MCP connectors.

Integrations: Figma (Design), Google Calendar, Notion, Google Sheets/Slides.
Orchestration: The recursive loop manages authentication signals and tool execution results.

📊 Model Statistics

Layers: 10
Embedding Dim: 640
Heads: 10
Memory Slots / Layer: 32 (K+V, persistent, learnable)
Sparse Top-K: 128 tokens per head (IndexCache)
Local Window: 256 tokens (Interleaved Attention)
Total Parameters: ~94.9M (includes memory K/V params)
Precision: 1.58-bit (Ternary)

🛠️ Usage

To view the architecture and verify parameters:

python main.py

(Requires torch installed)