SHOREKEEPER / README.md
geoore's picture
Add comprehensive README with HuggingFace model card metadata
43a2621
---
language: en
license: mit
tags:
- pytorch
- mixture-of-experts
- language-model
- reasoning
- grpo
---
# SHOREKEEPER-4B
A 4-billion parameter language model built around a **Council of Experts** architecture β€” 12 specialized expert modules routed by a learned gating network, layered on top of 28 transformer blocks with Grouped Query Attention and RoPE positional encoding. Designed for reasoning, code generation, and long-term memory across conversations.
---
## Architecture
| Component | Details |
|---|---|
| Parameters | ~4B |
| Layers | 28 transformer blocks |
| Attention | Grouped Query Attention (24 heads, 6 KV heads, head_dim 128) |
| Positional encoding | RoPE (ΞΈ = 1,000,000) |
| Experts | 12 specialists, 2 activated per token |
| Expert routing | Sentinel (learned gating with load-balance loss) |
| Expert dim | 2048 |
| Hidden dim | 3072 |
| Vocab size | 50,304 |
| Max sequence length | 8,192 |
| Quantization | 4-bit NF4 (bitsandbytes) |
Each transformer block applies **attention β†’ MoE FFN** with pre-norm and residual connections. The 12 experts share weights across layers (cross-layer parameter sharing), keeping the model compact while preserving specialization.
---
## The Council of Experts
The Sentinel router selects 2 experts per token based on learned routing logits. Each expert is a gated feed-forward network (SiLU gate Γ— value projection) with a role-specific bias term.
| Expert | Role | Specialization |
|---|---|---|
| **Asmoday** | Code | Python development, debugging |
| **Istaroth** | Systems | OS, networking, deployment |
| **Ronova** | Reasoning | Math, logic, step-by-step problems |
| **Naberius** | Memory | Long-term retrieval |
| **Phanes** | Creation | Writing, generation |
| **Barbeloth** | Analysis | Data patterns, insights |
| **Tacet** | Silence | Noise filtering, summarization |
| **Abby** | Empathy | User context, preferences |
| **Reindoter** | Validation | Testing, verification |
| **Zestial** | Vision | Visualization, diagrams |
| **Alice** | Exploration | Novel solutions, experiments |
| **Rover** | Execution | Terminal commands, sandbox |
---
## Persistent Memory
SHOREKEEPER maintains a JSON-based memory store across conversations, organized into six categories:
- `user_preferences` β€” learned user settings and habits
- `project_context` β€” active project information
- `conversation_history` β€” past exchanges (capped at 1,000 entries per category)
- `important_facts` β€” stored knowledge
- `code_patterns` β€” learned code conventions
- `learned_skills` β€” acquired capabilities
Memory context is automatically injected into each `chat()` call. Use `/remember` and `/recall` commands to interact with it directly.
---
## Training
Training happens in two stages:
**Stage 1 β€” Supervised Fine-Tuning**
Mixed STEM dataset: GSM8K, CodeAlpaca, OpenOrca, MathInstruct (~50K examples). Standard causal language modeling loss with AdamW + cosine annealing.
**Stage 2 β€” GRPO**
Group Relative Policy Optimization on math reasoning prompts. Reward signal: +2.0 for correct answer, +0.5 bonus for chain-of-thought reasoning steps. Load balance loss applied every step to prevent expert collapse.
---
## Sandboxed Execution
SHOREKEEPER can execute terminal commands inside a Docker container with:
- Command whitelist (python3, pip, git, ls, cat, mkdir, touch, echo)
- 30-second timeout
- 4GB memory / 2 CPU limit
- No interactive shell access
---
## Quick Start
```bash
pip install -r requirements.txt
python scripts/07_run_shorekeeper.py
```
**Available commands in the CLI:**
```
/remember <fact> Store something in long-term memory
/recall <query> Search memory
/run <command> Execute in sandbox
/project <name> Create a new project
/exit Quit
```
---
## Project Structure
```
src/
β”œβ”€β”€ shorekeeper.py Main model class
β”œβ”€β”€ council/
β”‚ β”œβ”€β”€ attention.py GQA + RoPE attention layer
β”‚ β”œβ”€β”€ sentinel.py Expert router
β”‚ β”œβ”€β”€ experts.py 12 expert modules
β”‚ └── base_expert.py Shared expert base class
β”œβ”€β”€ memory/
β”‚ └── json_library.py Persistent memory system
β”œβ”€β”€ sandbox/
β”‚ └── terminal.py Docker-based execution
└── training/
└── grpo.py GRPO trainer
configs/ YAML configs (model, training, memory, sandbox)
scripts/ Training and inference scripts
tests/ Unit tests
```
---
## Requirements
- Python 3.10+
- PyTorch 2.5+
- CUDA recommended for inference at full precision
- Docker (optional, for sandbox execution)
```bash
pip install -r requirements.txt
```
---
## Variants
A **15B variant** config is available at `configs/model_15b.yaml` (dim 6144, 48 layers, 16 experts).