Add comprehensive README with HuggingFace model card metadata

43a2621 5 days ago

4.85 kB

language: en
license: mit
tags:
  - pytorch
  - mixture-of-experts
  - language-model
  - reasoning
  - grpo

SHOREKEEPER-4B

A 4-billion parameter language model built around a Council of Experts architecture — 12 specialized expert modules routed by a learned gating network, layered on top of 28 transformer blocks with Grouped Query Attention and RoPE positional encoding. Designed for reasoning, code generation, and long-term memory across conversations.

Architecture

Component	Details
Parameters	~4B
Layers	28 transformer blocks
Attention	Grouped Query Attention (24 heads, 6 KV heads, head_dim 128)
Positional encoding	RoPE (θ = 1,000,000)
Experts	12 specialists, 2 activated per token
Expert routing	Sentinel (learned gating with load-balance loss)
Expert dim	2048
Hidden dim	3072
Vocab size	50,304
Max sequence length	8,192
Quantization	4-bit NF4 (bitsandbytes)

Each transformer block applies attention → MoE FFN with pre-norm and residual connections. The 12 experts share weights across layers (cross-layer parameter sharing), keeping the model compact while preserving specialization.

The Council of Experts

The Sentinel router selects 2 experts per token based on learned routing logits. Each expert is a gated feed-forward network (SiLU gate × value projection) with a role-specific bias term.

Expert	Role	Specialization
Asmoday	Code	Python development, debugging
Istaroth	Systems	OS, networking, deployment
Ronova	Reasoning	Math, logic, step-by-step problems
Naberius	Memory	Long-term retrieval
Phanes	Creation	Writing, generation
Barbeloth	Analysis	Data patterns, insights
Tacet	Silence	Noise filtering, summarization
Abby	Empathy	User context, preferences
Reindoter	Validation	Testing, verification
Zestial	Vision	Visualization, diagrams
Alice	Exploration	Novel solutions, experiments
Rover	Execution	Terminal commands, sandbox

Persistent Memory

SHOREKEEPER maintains a JSON-based memory store across conversations, organized into six categories:

user_preferences — learned user settings and habits
project_context — active project information
conversation_history — past exchanges (capped at 1,000 entries per category)
important_facts — stored knowledge
code_patterns — learned code conventions
learned_skills — acquired capabilities

Memory context is automatically injected into each chat() call. Use /remember and /recall commands to interact with it directly.

Training

Training happens in two stages:

Stage 1 — Supervised Fine-Tuning Mixed STEM dataset: GSM8K, CodeAlpaca, OpenOrca, MathInstruct (~50K examples). Standard causal language modeling loss with AdamW + cosine annealing.

Stage 2 — GRPO Group Relative Policy Optimization on math reasoning prompts. Reward signal: +2.0 for correct answer, +0.5 bonus for chain-of-thought reasoning steps. Load balance loss applied every step to prevent expert collapse.

Sandboxed Execution

SHOREKEEPER can execute terminal commands inside a Docker container with:

Command whitelist (python3, pip, git, ls, cat, mkdir, touch, echo)
30-second timeout
4GB memory / 2 CPU limit
No interactive shell access

Quick Start

pip install -r requirements.txt
python scripts/07_run_shorekeeper.py

Available commands in the CLI:

/remember <fact>    Store something in long-term memory
/recall <query>     Search memory
/run <command>      Execute in sandbox
/project <name>     Create a new project
/exit               Quit

Project Structure

src/
├── shorekeeper.py          Main model class
├── council/
│   ├── attention.py        GQA + RoPE attention layer
│   ├── sentinel.py         Expert router
│   ├── experts.py          12 expert modules
│   └── base_expert.py      Shared expert base class
├── memory/
│   └── json_library.py     Persistent memory system
├── sandbox/
│   └── terminal.py         Docker-based execution
└── training/
    └── grpo.py             GRPO trainer

configs/                    YAML configs (model, training, memory, sandbox)
scripts/                    Training and inference scripts
tests/                      Unit tests

Requirements

Python 3.10+
PyTorch 2.5+
CUDA recommended for inference at full precision
Docker (optional, for sandbox execution)

pip install -r requirements.txt

Variants

A 15B variant config is available at configs/model_15b.yaml (dim 6144, 48 layers, 16 experts).