Spaces:

saur7764
/

EAM-100M-Agentic-Kernel

Sleeping

App Files Files Community

EAM-100M-Agentic-Kernel / README.md

saur7764

Upload README.md with huggingface_hub

43b0fee verified about 1 month ago

preview code

raw

history blame contribute delete

3.64 kB

	---
	title: EAM 100M Agentic Kernel v1.2
	emoji: 🧬
	colorFrom: blue
	colorTo: indigo
	sdk: gradio
	app_file: hf_app.py
	pinned: true
	---

	# 100M Parameter Agentic Model Walkthrough

	This project implements a state-of-the-art agentic model architecture by synthesizing multiple advanced frameworks.

	## 🚀 Architectural Stack

	### 1. Core: nanoGPT
	We use a minimalist Transformer architecture based on Karpathy's `nanoGPT`. It provides the foundational attention and MLP blocks but has been heavily modified for agentic performance.

	### 2. Residuals: AttenRes (Attention Residuals)
	Instead of standard additive residuals (`x + f(x)`), we implement Attention Residuals.
	- File: `model/attenres.py`
	- Logic: Each layer performs a dynamic retrieval (attention) over all previous layer outputs. This prevents information dilution and allows deeper reasoning.

	### 3. Weights: BitNet 1.58b (QVAC Fabric / Static Sparse)
	To ensure efficiency on consumer hardware (QVAC style), we use Ternary Weights ({-1, 0, 1}).
	- File: `model/bitnet.py`
	- Efficiency: This mimics a static sparse matrix where 0s act as pruned connections. It reduces the memory footprint by ~70% compared to FP16.

	### 4. Attention: Memory Sparse Attention (MSA) ⭐ NEW
	Replaces the standard causal attention with a triple-mechanism attention layer.
	- File: `model/memory_sparse_attention.py`
	- Mechanism 1 — Persistent Memory Tokens: Each layer holds `n_memory_tokens=32` learnable `(K, V)` parameter pairs. Every query position attends to these slots without any causal or sparse masking, giving the model a dedicated working-memory scratchpad that persists across positions within a forward pass.
	- Mechanism 2 — IndexCache Sparse Top-K: Full layers (even `layer_idx`) compute top-K attention indices over the sequence and cache them. Shared layers (odd `layer_idx`) reuse the cached indices, reducing O(T²) → O(T · sparse_topk). Memory slots are always kept regardless of the sparse mask.
	- Mechanism 3 — Interleaved Head Attention: The first half of heads use a local sliding-window mask (`local_window_size=256`); the second half retain unrestricted global access. Memory slots are exempt from this masking too.

	### 5. Reasoning: Tiny Recursive Loop
	The "agentic" part of the model comes from a recursive inference loop.
	- File: `agent/recursive_reasoning.py`
	- Process: The model generates a `<thought>`, critiques it, and refines it up to $N$ times before producing the final answer.

	### 7. Teacher: NIM Distillation (N3S) ⭐ NEW
	The model was distilled using NVIDIA Nemotron-3 Super (N3S) as a high-fidelity teacher.
	- Method: Multi-Token Distillation (MTD) focused on agentic reasoning trajectories.
	- Alignment: Alignment-aware distillation ensures the kernel follows workspace safety and grounding protocols.

	### 8. Ecosystem: Model Context Protocol (MCP) ⭐ EXPANDED
	Natively orchestrates cloud and local tools via MCP connectors.
	- Integrations: Figma (Design), Google Calendar, Notion, Google Sheets/Slides.
	- Orchestration: The recursive loop manages authentication signals and tool execution results.

	## 📊 Model Statistics
	- Layers: 10
	- Embedding Dim: 640
	- Heads: 10
	- Memory Slots / Layer: 32 (K+V, persistent, learnable)
	- Sparse Top-K: 128 tokens per head (IndexCache)
	- Local Window: 256 tokens (Interleaved Attention)
	- Total Parameters: ~94.9M (includes memory K/V params)
	- Precision: 1.58-bit (Ternary)

	## 🛠️ Usage
	To view the architecture and verify parameters:
	```bash
	python main.py
	```
	(Requires `torch` installed)