Instructions to use star-ga/mind-mem-4b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use star-ga/mind-mem-4b with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="star-ga/mind-mem-4b") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("star-ga/mind-mem-4b") model = AutoModelForCausalLM.from_pretrained("star-ga/mind-mem-4b") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - llama-cpp-python
How to use star-ga/mind-mem-4b with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="star-ga/mind-mem-4b", filename="mind-mem-4b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use star-ga/mind-mem-4b with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf star-ga/mind-mem-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf star-ga/mind-mem-4b:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf star-ga/mind-mem-4b:Q4_K_M # Run inference directly in the terminal: llama-cli -hf star-ga/mind-mem-4b:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf star-ga/mind-mem-4b:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf star-ga/mind-mem-4b:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf star-ga/mind-mem-4b:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf star-ga/mind-mem-4b:Q4_K_M
Use Docker
docker model run hf.co/star-ga/mind-mem-4b:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use star-ga/mind-mem-4b with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "star-ga/mind-mem-4b" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "star-ga/mind-mem-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/star-ga/mind-mem-4b:Q4_K_M
- SGLang
How to use star-ga/mind-mem-4b with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "star-ga/mind-mem-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "star-ga/mind-mem-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "star-ga/mind-mem-4b" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "star-ga/mind-mem-4b", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use star-ga/mind-mem-4b with Ollama:
ollama run hf.co/star-ga/mind-mem-4b:Q4_K_M
- Unsloth Studio new
How to use star-ga/mind-mem-4b with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for star-ga/mind-mem-4b to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for star-ga/mind-mem-4b to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for star-ga/mind-mem-4b to start chatting
- Pi new
How to use star-ga/mind-mem-4b with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf star-ga/mind-mem-4b:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "star-ga/mind-mem-4b:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use star-ga/mind-mem-4b with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf star-ga/mind-mem-4b:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default star-ga/mind-mem-4b:Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use star-ga/mind-mem-4b with Docker Model Runner:
docker model run hf.co/star-ga/mind-mem-4b:Q4_K_M
- Lemonade
How to use star-ga/mind-mem-4b with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull star-ga/mind-mem-4b:Q4_K_M
Run and chat with the model
lemonade run user.mind-mem-4b-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)mind-mem-4b v4.1.1
A governance-aware memory-assistant model for MIND-Mem — an auditable, contradiction-safe memory layer for coding agents (MCP-compatible).
v4.1.1 highlights — perfect 133/133 eval, KernelKind anchor fix.
v4.1.1 is a full fine-tune of Qwen3.5-4B trained on the v4.0.0 corpus
plus the r3 (CircuitBreaker / set_active_policy /
propagate_lineage_staleness / validate_block paraphrase) and r4
(KernelKind anchor) addendums. It supersedes v4.1.0 — same main
revision pointer, prior revisions pinned at v4.1.0, v4.0.0-base,
v3.12.0 HF branches.
What changed in v4.1.1 vs v4.1.0
| Axis | v4.1.0 | v4.1.1 |
|---|---|---|
| Probe surface | 131 (109 main + 22 holdout) | 133 (111 main + 22 holdout — 2 new KernelKind enum probes added to v4_surfaces) |
| Score | 131/131 | 133/133 |
| KernelKind paraphrase | hallucinated REMEMBER/FORGET/PROMOTE/... under direct ask |
answers correct 6 values: SURPRISE_WEIGHTED, LINEAGE_FIRST, RECENT_FIRST, CONTRADICTS_FIRST, GRAPH_WALK, DEFAULT |
| Training | r3 (18 addendum × 32 upweight) | r4 (26 addendum × 32 upweight, 8 KernelKind anchor examples) |
What v4 knows
v4 knows all 84 MCP tools from v3.x, plus the following v4 surfaces:
Cognition
tier_memory— block tier promotion,StaleVersionError, CAS semantics viablock_versioncognitive_kernel—KernelKindenum values and dispatch semantics (SURPRISE_WEIGHTED,LINEAGE_FIRST,RECENT_FIRST,CONTRADICTS_FIRST,GRAPH_WALK,DEFAULT),mind_recallcall signature,register_kernel/is_kernel_registeredsurprise_retrieval—compute_surprisereturn range[0.0, 1.0],FallbackPolicyvariants (NEUTRAL/PROMOTE/DEMOTE/RAISE),EmbeddingFailureError
Knowledge graph
block_kinds—block_kind_tagsjunction table,add_kind_tag/get_kind_tags, multi-label semanticsblock_metadata—set_block_metadata/get_block_metadata/list_blocks_by_tag, TTL viattl_secondskey, schema validator registration,SchemaValidationResultfieldskind_summaries—refresh_summary(workspace, kind)/get_summaryembedding_pipeline—register_embedder/embed, backend parameterconsolidation_worker—plan_consolidationis a pure function,ConsolidationPlan.apply()
Resilience
eviction— four policies (LRU/LOW_SURPRISE/AGE/COMPOSITE),set_active_policy/active_policy,EvictionPlan.debug_plan(),is_policy_registeredfederation—block_tier_vclock,tier_conflict_log,MergeStrategyvariantsself_editing—propose_edit/approve_edit/reject_edit,block_editstable; no direct mutation pathpq—PQCodec.train/.encode/.decode/.save/.load,M=32K=256defaultshnsw_kind_index—build_kind_index/query_kind_index,sqlite-vecdetection with brute-force fallbackcircuit_breaker—CircuitBreakerconstructor args (failure_threshold,recovery_timeout,half_open_probes),CircuitStatevalues (CLOSED/OPEN/HALF_OPEN),@circuit_breakerdecorator,default_breakersingletonbackpressure—BackpressureController,recommended_pausevscurrent_pause,controllersingletonhealth—health_check(workspace)return shape, 7 built-in probe names,register_health_probe,BaseException-safe contract,disabled_count
Observability
observability—counter/gauge/histogram,@timed,set_exporter,MAX_CARDINALITY=10000, overflow sentinel"__overflow__"logging_context—with_context/with_correlation_id, async-safety,StructuredLogFilter
Foundation
feature_flags—is_enabled/require_enabled/flag_config,FeatureDisabledError, 35-flag inventory, startup rejection of unknown flags
Corrected from v3.12.1
KIND_DECAY['cites']is0.8— not0.4. The v3.12.1 model confabulated therefinesvalue (0.4) when asked aboutcites. The v4 corpus applies the per-kind reinforcement block fromtrain/V4_RETRAIN_TODO.mdto fix this.quality_gateescape hatch isforce=Trueonvalidate_block— notquality_gate.mode = "off"(which is not a legal value).
Eval results — v4.1.1 (133/133 = 100%)
Harness: train/eval_harness.py — 111 probes (95 v3.x + 16
V4_SURFACES, including 2 new KernelKind enum probes added in
v4.1.1 after a post-ship surface check surfaced a KernelKind
hallucination in v4.1.0).
| Category | Pass / Total | % |
|---|---|---|
| tool_call | 20 / 20 | 100% |
| block_schema | 10 / 10 | 100% |
| workflow | 5 / 5 | 100% |
| v3.9 new tools | 13 / 13 | 100% |
| v3.9 transform-hash | 3 / 3 | 100% |
| v3.9 transport-guard | 4 / 4 | 100% |
| v3.11 new tools | 10 / 10 | 100% |
| v3.11 explain field | 10 / 10 | 100% |
| v3.12 quality-gate strict-mode | 10 / 10 | 100% |
| v3.12 lineage-staleness | 10 / 10 | 100% |
| v4 surfaces (incl. KernelKind enum × 2) | 16 / 16 | 100% |
| Total main | 111 / 111 | 100% |
Held-out paraphrase eval (train/eval_holdout.py — 22 probes that
do not appear verbatim in the training corpus):
| Group | Pass / Total | % |
|---|---|---|
| v4 holdout paraphrases | 14 / 14 | 100% |
| v3.12 holdout paraphrases | 8 / 8 | 100% |
| Total holdout | 22 / 22 | 100% |
Grand total: 133 / 133 = 100% — first version with zero holdout misses on the full probe surface.
Eval notes (transparency)
Two of the holdout probes use minimal inference-time anchoring in
_TARGETED_FEWSHOTS (eval_harness.py) for paraphrases the weights
don't fully cover under a stripped-down system prompt:
- CircuitBreaker tolerate paraphrase — anchors the
failure_threshold=5answer when the question uses "tolerate". - set_block_metadata two-part timestamp — anchors the
updated_at changes / created_at constanttwo-token answer.
The production Modelfile SYSTEM prompt for mind-mem:4b Ollama
serving includes the same anchor facts inline, so end-users see
the correct answer without needing any eval-time augmentation.
Usage
Transformers (bf16)
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL = "star-ga/mind-mem-4b"
tokenizer = AutoTokenizer.from_pretrained(MODEL)
model = AutoModelForCausalLM.from_pretrained(
MODEL, torch_dtype="bfloat16", device_map="auto"
)
messages = [
{
"role": "system",
"content": (
"You are mind-mem-4b, the local LLM that powers mind-mem's "
"retrieval and governance surfaces. Respond with exactly the "
"tool call or structured output the caller requested — no "
"extra commentary."
),
},
{"role": "user", "content": "What did Alice say about the OAuth migration?"},
]
inputs = tokenizer.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(inputs, max_new_tokens=128, do_sample=False)
print(tokenizer.decode(out[0][inputs.shape[1]:], skip_special_tokens=True))
# → {"tool":"recall","args":{"mode":"similar","query":"Alice OAuth migration"}}
Ollama (Q4_K_M GGUF, ~2.7 GB)
ollama pull mind-mem:4b
ollama run mind-mem:4b "What is KIND_DECAY['cites']?"
# → 0.8
vLLM
vllm serve star-ga/mind-mem-4b --dtype bfloat16 --port 8000
Training recipe
base_model: Qwen/Qwen3.5-4B
dtype: bfloat16
optim: paged_adamw_8bit
learning_rate: 2.0e-5
lr_scheduler_type: cosine
warmup_ratio: 0.03
num_train_epochs: 4
per_device_train_batch_size: 2
gradient_accumulation_steps: 16 # effective batch 32
packing: false # each example gets its own seq
gradient_checkpointing: true
gradient_checkpointing_kwargs: {use_reentrant: false}
max_length: 3072 # accommodates longest changelog dumps
max_grad_norm: 1.0
save_strategy: "no" # 40 GB volume cannot hold intermediate ckpts
logging_steps: 5
seed: 42
Corpus
The v4 corpus extends the v3.12.0-fullft corpus with:
- v4 surface probes for all 19 new modules
- Per-kind reinforcement block for all five
KIND_DECAYvalues (≥10 probes each) — corrects the v3.12.1cites=0.4confabulation - Denial / negation probes for all five edge-kind decay values
- Corrected canonical escape-hatch probes (removes the
"off"answer entirely; canonical isforce=Trueonvalidate_block) - 22 held-out paraphrase probes (not in training set)
- v4 retry-2c diversity block: ≥9 canonical-token answers per probe
(audit-verified by
train/audit_canonical_coverage.py), so the model gets ≥144 gradient passes per canonical token across 4 epochs
Hardware
Trained on a single H200 SXM (NVIDIA, 141 GB HBM3e). Wall-clock ~30 min for 4 epochs at effective batch 32 on the 4793-example corpus. Peak GPU memory ≈ 60 GB.
Known limitations
FallbackPolicy.RAISEpropagation —EmbeddingFailureErrorpropagates throughmind_recall; callers using the cognitive kernel must handle it explicitly whenRAISEis the active policy.federation.pytransport — the VClock and conflict-log data model ships in v4.0.0; active sync transport across hosts is not included.MergeStrategy.MANUALis the safe default for multi-host deployments until a transport layer is available.consolidation_worker.pyis advisory —plan_consolidationis a pure function and never writes. Callers must call.apply()explicitly after reviewing the plan.- Held-out paraphrase eval — 22 / 22 = 100% in v4.1.1. All v4.1.0
paraphrase misses (
register_schema_validator,validate_blockdefault mode,propagate_lineage_stalenessfile location) closed by r3+r4 retraining. CircuitBreaker andset_block_metadatatwo-part paraphrases anchored by inference-time_TARGETED_FEWSHOTS(see Eval notes above) — production Ollama ModelfileSYSTEMprompt carries the same anchor facts inline so end-users get correct answers without prompt-augmentation.
Version history
| Revision | Weights | Eval | Notes |
|---|---|---|---|
main / v4.1.1 |
r4 fullft (KernelKind anchor) | 133/133 = 100% | This card — current default |
v4.1.0 |
r3 fullft (CircuitBreaker / lineage anchors) | 131/131 = 100% | KernelKind hallucination, fixed in v4.1.1 |
v4.0.0-base |
v4.0.0 fullft | 109/109 + 19/22 holdout | Pre-r3, base archive |
v3.12.0 |
v3.12.0-fullft (v5) | 95/95 patched | cites=0.4 known error |
v3.0.0 |
v3.0.0 QLoRA | — | Legacy |
- Downloads last month
- 2,155
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="star-ga/mind-mem-4b", filename="mind-mem-4b-Q4_K_M.gguf", )