SHOREKEEPER / README.md
geoore's picture
Add comprehensive README with HuggingFace model card metadata
43a2621
metadata
language: en
license: mit
tags:
  - pytorch
  - mixture-of-experts
  - language-model
  - reasoning
  - grpo

SHOREKEEPER-4B

A 4-billion parameter language model built around a Council of Experts architecture β€” 12 specialized expert modules routed by a learned gating network, layered on top of 28 transformer blocks with Grouped Query Attention and RoPE positional encoding. Designed for reasoning, code generation, and long-term memory across conversations.


Architecture

Component Details
Parameters ~4B
Layers 28 transformer blocks
Attention Grouped Query Attention (24 heads, 6 KV heads, head_dim 128)
Positional encoding RoPE (ΞΈ = 1,000,000)
Experts 12 specialists, 2 activated per token
Expert routing Sentinel (learned gating with load-balance loss)
Expert dim 2048
Hidden dim 3072
Vocab size 50,304
Max sequence length 8,192
Quantization 4-bit NF4 (bitsandbytes)

Each transformer block applies attention β†’ MoE FFN with pre-norm and residual connections. The 12 experts share weights across layers (cross-layer parameter sharing), keeping the model compact while preserving specialization.


The Council of Experts

The Sentinel router selects 2 experts per token based on learned routing logits. Each expert is a gated feed-forward network (SiLU gate Γ— value projection) with a role-specific bias term.

Expert Role Specialization
Asmoday Code Python development, debugging
Istaroth Systems OS, networking, deployment
Ronova Reasoning Math, logic, step-by-step problems
Naberius Memory Long-term retrieval
Phanes Creation Writing, generation
Barbeloth Analysis Data patterns, insights
Tacet Silence Noise filtering, summarization
Abby Empathy User context, preferences
Reindoter Validation Testing, verification
Zestial Vision Visualization, diagrams
Alice Exploration Novel solutions, experiments
Rover Execution Terminal commands, sandbox

Persistent Memory

SHOREKEEPER maintains a JSON-based memory store across conversations, organized into six categories:

  • user_preferences β€” learned user settings and habits
  • project_context β€” active project information
  • conversation_history β€” past exchanges (capped at 1,000 entries per category)
  • important_facts β€” stored knowledge
  • code_patterns β€” learned code conventions
  • learned_skills β€” acquired capabilities

Memory context is automatically injected into each chat() call. Use /remember and /recall commands to interact with it directly.


Training

Training happens in two stages:

Stage 1 β€” Supervised Fine-Tuning Mixed STEM dataset: GSM8K, CodeAlpaca, OpenOrca, MathInstruct (~50K examples). Standard causal language modeling loss with AdamW + cosine annealing.

Stage 2 β€” GRPO Group Relative Policy Optimization on math reasoning prompts. Reward signal: +2.0 for correct answer, +0.5 bonus for chain-of-thought reasoning steps. Load balance loss applied every step to prevent expert collapse.


Sandboxed Execution

SHOREKEEPER can execute terminal commands inside a Docker container with:

  • Command whitelist (python3, pip, git, ls, cat, mkdir, touch, echo)
  • 30-second timeout
  • 4GB memory / 2 CPU limit
  • No interactive shell access

Quick Start

pip install -r requirements.txt
python scripts/07_run_shorekeeper.py

Available commands in the CLI:

/remember <fact>    Store something in long-term memory
/recall <query>     Search memory
/run <command>      Execute in sandbox
/project <name>     Create a new project
/exit               Quit

Project Structure

src/
β”œβ”€β”€ shorekeeper.py          Main model class
β”œβ”€β”€ council/
β”‚   β”œβ”€β”€ attention.py        GQA + RoPE attention layer
β”‚   β”œβ”€β”€ sentinel.py         Expert router
β”‚   β”œβ”€β”€ experts.py          12 expert modules
β”‚   └── base_expert.py      Shared expert base class
β”œβ”€β”€ memory/
β”‚   └── json_library.py     Persistent memory system
β”œβ”€β”€ sandbox/
β”‚   └── terminal.py         Docker-based execution
└── training/
    └── grpo.py             GRPO trainer

configs/                    YAML configs (model, training, memory, sandbox)
scripts/                    Training and inference scripts
tests/                      Unit tests

Requirements

  • Python 3.10+
  • PyTorch 2.5+
  • CUDA recommended for inference at full precision
  • Docker (optional, for sandbox execution)
pip install -r requirements.txt

Variants

A 15B variant config is available at configs/model_15b.yaml (dim 6144, 48 layers, 16 experts).