Instructions to use Truthseeker87/solarhive-e4b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Truthseeker87/solarhive-e4b-gguf with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Truthseeker87/solarhive-e4b-gguf",
	filename="mmproj-solarhive-e4b-BF16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use Truthseeker87/solarhive-e4b-gguf with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Truthseeker87/solarhive-e4b-gguf:BF16
# Run inference directly in the terminal:
llama cli -hf Truthseeker87/solarhive-e4b-gguf:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf Truthseeker87/solarhive-e4b-gguf:BF16
# Run inference directly in the terminal:
llama cli -hf Truthseeker87/solarhive-e4b-gguf:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf Truthseeker87/solarhive-e4b-gguf:BF16
# Run inference directly in the terminal:
./llama-cli -hf Truthseeker87/solarhive-e4b-gguf:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf Truthseeker87/solarhive-e4b-gguf:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf Truthseeker87/solarhive-e4b-gguf:BF16

Use Docker

docker model run hf.co/Truthseeker87/solarhive-e4b-gguf:BF16

LM Studio
Jan

vLLM

How to use Truthseeker87/solarhive-e4b-gguf with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Truthseeker87/solarhive-e4b-gguf"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Truthseeker87/solarhive-e4b-gguf",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Truthseeker87/solarhive-e4b-gguf:BF16

Ollama
How to use Truthseeker87/solarhive-e4b-gguf with Ollama:
```
ollama run hf.co/Truthseeker87/solarhive-e4b-gguf:BF16
```

Unsloth Studio

How to use Truthseeker87/solarhive-e4b-gguf with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Truthseeker87/solarhive-e4b-gguf to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Truthseeker87/solarhive-e4b-gguf to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Truthseeker87/solarhive-e4b-gguf to start chatting

How to use Truthseeker87/solarhive-e4b-gguf with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Truthseeker87/solarhive-e4b-gguf:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "Truthseeker87/solarhive-e4b-gguf:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use Truthseeker87/solarhive-e4b-gguf with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Truthseeker87/solarhive-e4b-gguf:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default Truthseeker87/solarhive-e4b-gguf:BF16

Run Hermes

hermes

Atomic Chat new

OpenClaw new

How to use Truthseeker87/solarhive-e4b-gguf with OpenClaw:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf Truthseeker87/solarhive-e4b-gguf:BF16

Configure OpenClaw

# Install OpenClaw:
npm install -g openclaw@latest
# Register the local server and set it as the default model:
openclaw onboard --non-interactive --mode local \
  --auth-choice custom-api-key \
  --custom-base-url http://127.0.0.1:8080/v1 \
  --custom-model-id "Truthseeker87/solarhive-e4b-gguf:BF16" \
  --custom-provider-id llama-cpp \
  --custom-compatibility openai \
  --custom-text-input \
  --accept-risk \
  --skip-health

Run OpenClaw

openclaw agent --local --agent main --message "Hello from Hugging Face"

Docker Model Runner
How to use Truthseeker87/solarhive-e4b-gguf with Docker Model Runner:
```
docker model run hf.co/Truthseeker87/solarhive-e4b-gguf:BF16
```

Lemonade

How to use Truthseeker87/solarhive-e4b-gguf with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull Truthseeker87/solarhive-e4b-gguf:BF16

Run and chat with the model

lemonade run user.solarhive-e4b-gguf-BF16

List all available models

lemonade list

SolarHive E4B GGUF — Edge Solar Energy Intelligence

Fine-tuned Gemma 4 E4B (8B) in GGUF format for Ollama and llama.cpp edge deployment. A standard Q4_K_M text quantization (shipped under two interchangeable filenames, each with its own Modelfile) plus one multimodal projector companion file. 10/10 + 2/3 When2Call on the single-pass local-laptop project-held-out check — matching the cloud E4B LoRA baseline as the joint best-in-class across all six measured deployment variants. Validated end-to-end on a CPU-only Microsoft Surface Pro 8 (Intel i5-1135G7, 16 GB RAM) with the GGUF and Ollama blob cache stored on an external USB drive.

Built for the Gemma 4 Good Hackathon (Google DeepMind × Kaggle).


Base Model	google/gemma-4-e4b-it
Architecture	Dense + PLE — 8B total, 4.5B effective
Quantization	Q4_K_M — standard recipe (Q6_K on the PLE tensor)
Files	5.34 GB Q4_K_M text (×2 filenames, same recipe) · 992 MB mmproj
Total footprint	6.3 GB (text + mmproj)
Modalities	Text, Image, Audio (via mmproj companion)
Function Calling	Native Gemma 4 protocol
Project-held-out check	10/10 parity (5/5 Q&A + 5/5 tool routing) + 2/3 When2Call — joint best in the 6-variant Multi-Variant Deployment Validation table
Local deployment	CPU-only Surface Pro 8 (Intel i5-1135G7 @ 2.4 GHz, 16 GB RAM, Intel Iris Xe unused), Ollama 0.21.0 with llama.cpp `ggml-cpu-icelake.dll` (AVX2 + AVX512 + VNNI). External USB drive holds the 5.34 GB GGUF + Ollama blob cache.
Fine-Tuning	LoRA via Unsloth (BF16)
Training Data	1,727 examples (solarhive-community-solar-multimodal) — text-only fine-tune; VQA at inference uses the base Gemma 4 vision encoder (~150M params for E4B), unmodified by our LoRA per the Vertex AI SFT recipe. The 992 MB `mmproj-solarhive-e4b-BF16.gguf` companion file packages this base vision encoder + audio encoder for `llama-server --mmproj`.
Converged Loss	0.9218
Training Time	420 seconds on RTX PRO 6000 Blackwell
License	MIT (adapters) / Gemma Terms (base model)

Model Overview

SolarHive is an AI energy advisor for community solar microgrids. It helps suburban neighborhoods collectively optimize distributed solar generation and shared battery storage through natural language conversation, visual inspection, and live data integration.

This is the edge deployment artifact. The 5.34 GB Q4_K_M text GGUF pairs with the 992 MB multimodal projector and runs on a 16 GB laptop CPU with no GPU, no cloud dependency, and no internet requirement. Companion to SolarHive 26B A4B LoRA (cloud inference with full multimodal VQA) and SolarHive E4B Ollama (merged safetensors for transformers research).

Privacy-first edge deployment. Community energy data never leaves the neighborhood. A village in rural India, a suburb in Michigan, and a coastal town recovering from a hurricane all get the same intelligence with no cloud round-trips.

Files in This Repository

File	Size	Use
`solarhive-e4b-q4_k_m.gguf`	5.34 GB	Text + tool calling — standard Q4_K_M (Q6_K-PLE)
`solarhive-e4b-q4_k_m-standard.gguf`	5.34 GB	Text + tool calling — same standard Q4_K_M (Q6_K-PLE), explicit `-standard` name
`mmproj-solarhive-e4b-BF16.gguf`	992 MB	Vision (SigLIP, 658 tensors) + audio (Conformer, 751 tensors) projector — pairs with the text GGUF
`Modelfile`	—	Ollama recipe pointing at `solarhive-e4b-q4_k_m.gguf`
`Modelfile.standard`	—	Ollama recipe pointing at `solarhive-e4b-q4_k_m-standard.gguf`

Quantization

The repo ships the SolarHive E4B fine-tune as a standard llama.cpp Q4_K_M GGUF (5.34 GB). The Q4_K_M mixed-precision recipe assigns Q4_K to most tensors and Q6_K to the largest / most sensitive ones — including the PLE tensor per_layer_token_embd.weight [10752, 262144] (2.82 B params, the largest single tensor in the model). The quant was produced on a high-RAM cloud notebook from the merged safetensors.

Two interchangeable copies ship so you can ollama create against a ready-made Modelfile without editing paths:

File	Recipe	Modelfile
`solarhive-e4b-q4_k_m.gguf`	standard Q4_K_M (Q6_K-PLE)	`Modelfile`
`solarhive-e4b-q4_k_m-standard.gguf`	standard Q4_K_M (Q6_K-PLE)	`Modelfile.standard`

Both files are the same Q4_K_M recipe and behave identically at inference time (≥7 GB RAM at 4K context).

Quantizing on 16 GB RAM (reproducibility recipe)

The standard Q4_K_M conversion of the Q6_K PLE tensor needs a ~10.7 GB float32 intermediate buffer — which OOMs llama-quantize on 16 GB hardware (bad allocation at tensor 4/720). To reproduce the quant on laptop-class hardware, override that one tensor to Q4_0:

llama-quantize --tensor-type per_layer_token_embd.weight=q4_0 \
  solarhive-e4b-bf16.gguf solarhive-e4b-q4_k_m.gguf Q4_K_M

This bypasses the buffer and yields a smaller (~4.6 GB) GGUF. The override was validated quality-safe during development (no measurable regression on the project-held-out check); the artifacts shipped in this repo use the standard recipe.

One mmproj, one text GGUF

The mmproj-solarhive-e4b-BF16.gguf companion is independent of the text model's quantization. llama-quantize operates exclusively on the text tower — vision and audio tensors were split out into the mmproj file during convert_hf_to_gguf.py --mmproj. At inference time, llama-server --model X --mmproj Y pairs them on demand. One mmproj serves any current or future text variant (Q4/Q5/Q6/Q8) with no per-variant duplication.

Inference Path Selection

This repo ships alongside solarhive_inference_e4b_gguf_ollama.py (in the GitHub repo), which implements two inference approaches against Ollama. The recommended demo path is the /api/generate raw mode + manual prompt builder approach. Both are documented for transparency and reproducibility.

Path A — OpenAI-compatible `/api/chat` + Ollama's built-in `gemma4.go` parser

Definition. Send messages and OpenAI-style tool schemas to Ollama's standard /api/chat endpoint. Ollama internally renders the chat template, parses model output, and extracts tool calls via its dedicated Gemma 4 parser (gemma4.go). The simplest path — closest to "drop-in replacement" for any OpenAI-compatible client.

import requests, json
resp = requests.post("http://127.0.0.1:11434/api/chat", json={
    "model": "solarhive",
    "messages": [{"role": "user", "content": "What's the current battery state?"}],
    "tools": [{"type": "function", "function": {...}}],
    "stream": False,
    "options": {"temperature": 1.0, "top_p": 0.95, "top_k": 64, "num_ctx": 4096},
})

Rationale. This is the path most developers will reach for first. OpenAI-compatible JSON, no manual prompt construction, leverages Ollama's first-party Gemma 4 support.

Why it does not score 10/10 on our project-held-out check. Ollama 0.21.0's gemma4.go:306 parser detects the model's native call:fn{} output but rejects it because the arguments use bare unquoted keys and <|"|> string delimiters (the Gemma 4 native format the model was fine-tuned on) instead of strict JSON. Server log evidence:

level=WARN source=gemma4.go:306 msg="gemma4 tool call parsing failed"
error="invalid character '\'' looking for beginning of object key string
       repair failed to produce valid JSON arguments"
content="call:get_status{battery:{level:65,pct:77,kwh:77,...,dispatched:<|\"|>OK<|\"|>},...}"

The repair logic also fails to recover valid JSON. Result: tool calls silently dropped, content empty. Inconsistent scoring on the project-held-out 10-prompt check — capped by this upstream issue. Not a viable path for the demo without an upstream Ollama patch.

Path B — `/api/generate` raw mode + manual Gemma 4 prompt builder + native parser (recommended)

Definition. Bypass gemma4.go entirely. Build the Gemma 4 native prompt manually from chat_template.jinja, send via Ollama's /api/generate endpoint with raw: True (no template rendering by Ollama), parse the raw output ourselves using regex matching tokenizer_config.json's response_schema.

prompt = build_gemma4_prompt(messages, tools)   # byte-identical to apply_chat_template
resp = requests.post("http://127.0.0.1:11434/api/generate", json={
    "model": "solarhive",
    "prompt": prompt,
    "raw": True,
    "stream": False,
    "options": {"temperature": 1.0, "top_p": 0.95, "top_k": 64, "num_ctx": 4096},
})
content, tool_calls = parse_gemma4_output(resp.json()["response"])

Rationale. We control both endpoints of the prompt cycle:

Prompt structure matches what the model was fine-tuned on, exactly. build_gemma4_prompt produces byte-identical output to apply_chat_template(messages, tools, enable_thinking=False, add_generation_prompt=True) — the same call used by the cloud 26B A4B path.
Output parser matches the exact regex from tokenizer_config.json response_schema: r"<\|tool_call>(.*?)<tool_call\|>" for tool blocks, r"call:(\w+)\{(.*)\}$" for argument extraction.
Tool responses fed back as <|turn>tool\n{json}<turn|> — matches what training pipeline rendered from the {role:"tool", content:"{...}"} OpenAI-style messages in the training data.

Result. Score on the single-pass project-held-out 10-prompt check: 10/10 + 2/3 on the When2Call probes. Matches the cloud E4B LoRA baseline as the joint best across all 6 measured variants in the Multi-Variant Deployment Validation table below.

Trade-off. ~30 lines of additional Python in solarhive_inference_e4b_gguf_ollama.py to build the prompt and parse the output. Worth it to bypass the upstream gemma4.go parser issue.

Recommended path

For the SolarHive Ollama + llama.cpp demo we use Path B (/api/generate raw mode + manual prompt builder). The solarhive_inference_e4b_gguf_ollama.py script in the GitHub repo provides a complete reference implementation of build_gemma4_prompt and parse_gemma4_output. Set OLLAMA_MODEL=solarhive or OLLAMA_MODEL=solarhive-standard (same Q4_K_M quant, two Modelfiles) and run the script — both score 10/10 on the project-held-out check.

For multimodal use cases (vision, audio), use llama-server --mmproj from llama.cpp directly — see "How to Use" below. Ollama 0.21.0's Modelfile syntax for Gemma 4 multimodal projector declaration is still evolving; first-class Ollama vision support for Gemma 4 will arrive in a future release.

Path C — Interactive chat via Unsloth Studio

For community users who want to chat with the SolarHive E4B GGUF without writing a Modelfile or running the HTTP harness by hand, Unsloth Studio provides an open-source no-code local web UI:

# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888

# Windows PowerShell
irm https://unsloth.ai/install.ps1 | iex
unsloth studio -H 0.0.0.0 -p 8888

Studio supports running GGUF + safetensor models locally with self-healing tool calling and code execution, model arena (side-by-side comparison), and saves to GGUF / 16-bit safetensor formats. Use Studio for interactive exploration; use Path B (/api/generate raw mode + solarhive_inference_e4b_gguf_ollama.py) for reproducible evaluation — Studio doesn't expose the per-question evaluation harness needed to reproduce the 10/10 SolarHive parity score.

Project-Held-Out Results

Single-pass project-held-out 10-prompt parity check — 5 domain Q&A (no tool expected) + 5 tool-calling — plus 3 When2Call probes. Run via the /api/generate raw mode + manual Gemma 4 prompt builder path (Path B above), on Ollama 0.21.0, CPU-only Microsoft Surface Pro 8 (Intel i5-1135G7 @ 2.4 GHz, 4 cores, 16 GB RAM, Intel Iris Xe unused), with the 5.34 GB GGUF + Ollama blob cache stored on an external USB drive. Same 10 prompts and same 3 When2Call probes as the cloud checks for direct cross-variant comparison.

Domain Q&A (5/5)

Question	Expected behavior	Result
Solar production when humidity exceeds 80%?	Direct answer, no tool call	✅
At what battery SOC should we stop exporting to the grid?	Direct answer, no tool call	✅
Home #3 underperforming 22% — diagnostic checklist?	Direct answer, no tool call	✅
Winter snow on panels — prioritize actions?	Direct answer, no tool call	✅
Grid frequency 59.8 Hz — microgrid implications?	Direct answer, no tool call	✅

Tool Calling (5/5)

Question	Expected tool	Result
What's the current battery state?	`get_battery_state`	✅ fired + synthesized
Current weather and how does it affect solar production?	`get_weather`	✅ fired + synthesized
What are general maintenance tips for panels?	None (no tool needed)	✅ correctly no tool call
What is the weather expected to be like this week?	`get_weather`	✅ fired + synthesized
How should we plan energy consumption and storage given the weather forecast?	`get_weather` (+ `get_grid_status`)	✅ fired + synthesized

Quantization-precision note. The shipped Q4_K_M (Q6_K-PLE) GGUF matches the cloud BF16 E4B baseline on the project-held-out check (see the Multi-Variant Deployment Validation table below) — confirming Q4_K_M is lossless for SolarHive's task distribution at this model size.

When2Call probes (3/3 categories per Ross et al. 2025)

Three held-out probes from When2Call: When (not) to Call Tools cover 3 of the 4 failure-mode categories the paper documents (the paper documents 9–67% tool-hallucination rates on (c) and (d) in untrained community models):

Category	Question	Expected behavior	Result on this GGUF
(b) Well-specified, in-scope	"What's the current grid rate?"	Call `get_grid_status`	PASS — called `get_grid_status`
(c) Under-specified	"How much will a 10 kW array produce today?"	Follow-up question (does NOT auto-fill location default)	PASS — asked for current weather conditions
(d) Out-of-scope	"What's the current air quality index in Ann Arbor?"	Refusal + redirect (does NOT hallucinate a tool)	FAIL — called `get_weather` (a known E4B-family failure mode also seen on the cloud E4B merged variant; the larger A4B family scores 3/3 on this same probe)

Headline: 2/3 nominal — same profile as the cloud E4B merged variant (transformers BF16). Confirms GGUF Q4_K_M quantization is lossless at the When2Call refusal/follow-up decision boundary within the E4B family. The +1/3 W2C delta vs the A4B family persists across runtimes (cloud transformers and local Ollama produce identical W2C scores within each family) — confirming it's a model-size signature, not a precision artifact.

End-to-end agentic loop probe

A multi-tool community-energy-audit query is run through the full agentic loop (extract → execute → feed back, max 3 rounds), mirroring the cloud notebook's §13g cell. Headline trace:

"Full community energy audit — check current weather, solar production, battery state, and grid pricing. Give a 3-sentence status report."

Rounds completed: 2
Tools executed: get_weather, get_solar_production, get_battery_state, get_grid_status (4 of 5 tools called in parallel in round 1)
Final answer (excerpt): "Community Energy Status Report (Midday): Partly cloudy with 30% cloud cover and 72°F. Array is producing 56% of capacity at 40.4 kW. Battery is at 72% SOC and actively charging from surplus. Grid pricing is Peak rates ($0.28/kWh). Status: Excellent. Maximize self-consumption by running heavy loads now…"

All four tool results are correctly synthesized into a coherent status report — confirming the GGUF Q4_K_M quantization preserves end-to-end agentic reasoning quality, not just single-shot tool routing.

Multi-Variant Deployment Validation — All 6 variants now measured

Cross-variant table covering all 5 cloud transformers variants (validated on Colab Pro G4 with NVIDIA RTX PRO 6000 Blackwell) plus this GGUF variant (validated on the CPU-only Surface Pro 8 reference hardware described above):

Variant	Q&A	Tool	W2C	Total	Backend	Hardware
a4b_lora (baseline)	5/5	4/5	3/3	9/10	transformers + Unsloth	Colab Pro G4 GPU
e4b_lora	5/5	5/5	2/3	10/10	transformers + Unsloth	Colab Pro G4 GPU
e4b_merged	5/5	4/5	2/3	9/10	transformers BF16	Colab Pro G4 GPU
a4b_merged	5/5	4/5	3/3	9/10	transformers BF16	Colab Pro G4 GPU
a4b_nf4	5/5	4/5	3/3	9/10	transformers NF4 (BnB)	Colab Pro G4 GPU
e4b_gguf (this artifact)	5/5	5/5	2/3	10/10	Ollama HTTP raw + manual prompt builder	CPU-only Surface Pro 8 (i5-1135G7, 16 GB RAM)

Key empirical findings:

Joint best variant in the table — this GGUF ties the cloud E4B LoRA baseline at 10/10 + 2/3 W2C, despite running on a 4-year-old consumer laptop with no GPU vs an A100-class cloud accelerator.
GGUF Q4_K_M quantization is lossless within the E4B family. Tool routing 5/5 + W2C 2/3 exactly matches the BF16 LoRA. Same (b)+(c) PASS, (d) FAIL profile. The 5.34 GB CPU-only deployment produces identical decisions to the BF16 + GPU pipeline.
The +1/3 W2C delta vs A4B family is a model-size signature, not a runtime artifact — reproduced across both cloud transformers AND local Ollama runtimes within each family.

Reproduce locally:

$env:OLLAMA_HOST  = 'http://localhost:11434'
$env:OLLAMA_MODEL = 'solarhive'
python -m pytest solarhive_inference_e4b_gguf_ollama.py -v --tb=short

The evaluation run auto-writes a markdown summary to archive/ollama_local_e4b_gguf_results_YYYYMMDD_HHMMSS.md with the full per-question / per-probe trace.

How to Use

Path 1 — Text + tool calling via Ollama (10/10 demo path)

# Download the text GGUF + Modelfile + LICENSE
hf download Truthseeker87/solarhive-e4b-gguf \
  solarhive-e4b-q4_k_m.gguf Modelfile LICENSE \
  --local-dir ./solarhive-gguf

cd ./solarhive-gguf

# Register with Ollama (Modelfile uses ./solarhive-e4b-q4_k_m.gguf relative path)
ollama create solarhive -f Modelfile

# Quick sanity check
ollama run solarhive "What is the current solar production for our community?"

# Full 10/10 + 2/3 W2C agentic check — uses /api/generate raw mode + manual prompt builder
git clone https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive
cd the-gemma4-good-hackathon-solarhive
$env:OLLAMA_MODEL = 'solarhive'
python -m pytest solarhive_inference_e4b_gguf_ollama.py -v --tb=short

To use the -standard-named copy instead (same Q4_K_M quant), swap solarhive-e4b-q4_k_m.gguf and Modelfile for solarhive-e4b-q4_k_m-standard.gguf and Modelfile.standard, then ollama create solarhive-standard -f Modelfile.standard and OLLAMA_MODEL=solarhive-standard.

Path 2 — Text + image + audio via llama.cpp `llama-server`

# Download a text variant + the mmproj companion
hf download Truthseeker87/solarhive-e4b-gguf \
  solarhive-e4b-q4_k_m.gguf mmproj-solarhive-e4b-BF16.gguf \
  --local-dir ./solarhive-gguf

# Start llama-server with the mmproj
llama-server \
  --model   ./solarhive-gguf/solarhive-e4b-q4_k_m.gguf \
  --mmproj  ./solarhive-gguf/mmproj-solarhive-e4b-BF16.gguf \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --port 8080

# POST chat-completion requests with image_url to http://localhost:8080/v1/chat/completions

The mmproj bundles both vision (SigLIP, sky photos and panel inspection) and audio (Conformer, voice queries up to 30 s) — same recipe serves both modalities.

Path 3 — Community microgrid hub on Jetson Orin Nano Super (llama.cpp + CUDA)

The same GGUF runs CUDA-accelerated on a Jetson Orin Nano Super Developer Kit ($249, 8 GB LPDDR5, 1024 CUDA cores, 7–25 W power envelope) — making it a natural community-ownable, solar-powerable microgrid hub that serves a whole neighborhood. Build llama.cpp with CUDA per Nvidia's official Gemma 4 Jetson recipe:

# On the Jetson (Ubuntu via JetPack SDK)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="87" \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4

# Download the same GGUFs as Path 1 / Path 2
hf download Truthseeker87/solarhive-e4b-gguf \
  solarhive-e4b-q4_k_m.gguf mmproj-solarhive-e4b-BF16.gguf \
  --local-dir ~/models

# Run with full GPU offload (-ngl 99) for maximum throughput
./build/bin/llama-server \
  --model   ~/models/solarhive-e4b-q4_k_m.gguf \
  --mmproj  ~/models/mmproj-solarhive-e4b-BF16.gguf \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --port 8080 --host 0.0.0.0 \
  -ngl 99 --flash-attn on \
  --no-mmproj-offload --jinja -np 1

Why this matters for community deployment: at 7–25 W, a single Jetson Orin Nano Super can be powered directly from a small solar panel during the day and a modest battery overnight — the SolarHive intelligence runs on the same energy infrastructure it advises. Mobile clients (Path 4 below) hit the Jetson at http://<hub-ip>:8080 over the local network when tool-calling responses with live API data are needed.

Modelfile reference

Both Modelfiles use a relative path so the same file works regardless of where you cloned the repo:

FROM ./solarhive-e4b-q4_k_m.gguf

SYSTEM """You are SolarHive, an AI energy advisor for a community of 12 homes with rooftop solar and shared battery storage in Ann Arbor, Michigan. Use the available tools to get real-time data before answering. Be specific, reference actual data, and keep responses concise (3-5 sentences)."""

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
PARAMETER num_ctx 4096

Core Capabilities

1. Multimodal Visual Question Answering (3 Modes)

Available via llama-server --mmproj (Path 2 above). Tensor inventory of mmproj-solarhive-e4b-BF16.gguf: 1411 tensors total — 658 vision (SigLIP), 751 audio (Conformer), 2 multimodal projectors (mm.input_projection.weight, mm.a.input_projection.weight).

Mode	Input	Output
Sky Analysis	Sky photograph	Cloud coverage %, production forecast, storage recommendation
Panel Inspection	Panel photograph	Dirt/damage/shading detection, efficiency impact estimate
Neighborhood Assessment	Aerial/satellite image	Panel inventory, expansion priorities, shading analysis

2. Native Function Calling (5 Tools — all 3 keyed APIs wired)

Available via either Path 1 (Ollama) or Path 2 (llama-server). Tool schemas and reference implementations are in solarhive_inference_e4b_gguf_ollama.py (project root). The §13g cell of solarhive_inference.py runs an end-to-end agentic-loop probe via Ollama HTTP raw mode using these same 5 tools.

Tool	API	Returns
`get_weather(location)`	OpenWeatherMap (`OWM_API_KEY`)	Temperature, clouds %, wind, humidity, sunrise/sunset
`get_solar_production(clouds_pct, temp_f)`	Open-Meteo GHI (keyless)	Production kW, efficiency %, GHI W/m², temp derating
`get_battery_state()`	Community BMS (simulated)	State of charge, capacity, charging status
`get_grid_status()`	EIA Open Data (`EIA_API_KEY`)	Pricing period, rate/kWh, renewable %, CO2 intensity
`get_nrel_pvwatts_baseline()`	NREL PVWatts v8 (`NREL_API_KEY`)	Annual + current-month typical kWh + avg kW for the 72 kW array

Tool results feed back as a 2-message sequence matching the training distribution: {"role": "assistant", "tool_calls": [...]} then {"role": "tool", "name": "<fn>", "content": json.dumps(result)} per call. The _build_gemma4_prompt() helper in solarhive_inference_e4b_gguf_ollama.py renders this format byte-identically — same as solarhive_datagen.py (training-data generation) and solarhive_finetune.py (SFT preprocessing). Inference matches training distribution exactly.

3. Selective Tool Reasoning

The model decides when to call tools — not blindly invoking all of them. Validated by the 5/5 tool-calling sub-check above:

"What time does peak pricing start?"
→ Calls: get_grid_status() only

"Is today's production above typical for January?"
→ Calls: get_solar_production() + get_nrel_pvwatts_baseline()

"Should I run my pool heater now?"
→ Calls: get_weather() + get_solar_production() + get_battery_state() + get_grid_status()

"What are general maintenance tips for panels?"
→ Calls: none (answers from training knowledge)

4. Inference-time When2Call Validation (`solarhive_inference.py` §11b)

Three held-out probes validate 3 of the 4 failure-mode categories from Ross, H., Mahabaleshwarkar, A. S., & Suhara, Y. (2025). When2Call: When (not) to Call Tools. arXiv:2504.18851 — the paper documents 9–67% tool-hallucination rates on (c) and (d) in untrained community models because public tool-calling datasets typically lack follow-up and unable-to-answer examples:

(b) "What's the current grid rate?" → expect get_grid_status call (well-specified, in-scope)
(c) "How much will a 10 kW array produce today?" → expect follow-up question (does NOT auto-fill location default)
(d) "What's the current air quality index in Ann Arbor?" → expect refusal + redirect (does NOT hallucinate a tool)

A baseline community model trained without these categories typically fails (c) + (d) (per the paper's 9-67% hallucination rates on untrained models). With the _UNABLE_TO_ANSWER + _FOLLOW_UP_QUESTIONS corpus categories included in solarhive_datagen.py, the A4B family scores 3/3 and the E4B family scores 2/3 (passes (b) + (c), fails (d) — see Project-Held-Out Results above). The same When2Call probes run end-to-end against this GGUF artifact via the /api/generate raw mode path for edge-deployment validation.

Training Details

Parameter	Value
Method	LoRA via Unsloth `FastVisionModel` (BF16, RTX PRO 6000 96 GB)
LoRA rank	16
LoRA alpha	16
LoRA dropout	0
Target modules	All linear layers
Learning rate	2e-4
Optimizer	AdamW 8-bit
Warmup steps	5
Epochs	3
Max sequence length	2048
Precision	BF16
Seed	3407
Trainable parameters	41.2 M / 8.0 B (0.51%)

Training Data — 1,727 Examples

Same canonical training corpus as the 26B A4B model — solarhive-community-solar-multimodal, 1,727 rows:

413 hand-crafted examples spanning 15+ US cities and 9 energy domains (sky conditions, battery management, panel health, consumption optimization, community/grid coordination, emergency resilience, seasonal planning, multi-step reasoning, alternative storage)
~1,117 API-grounded examples from live Open-Meteo (GHI/DNI/DHI, low/mid/high cloud cover), PVWatts, OpenWeatherMap, and EIA APIs — every numeric claim traces to a real API response, joined on (location, hourly timestamp) for cross-source coherence
183 tool-calling examples following the When2Call taxonomy — 106 should-call, 53 should-not-call, 10 unable-to-answer, 6 follow-up clarification, 8 failure-recovery
14 image-grounded Q&A turns from 7 manually-labeled Ann Arbor sky photographs — paired with the same temperature-derated GHI formula used in text rows

See the SolarHive Dataset for full documentation.

Fine-tuning is text-only on the multimodal-capable corpus (image rows skipped at the data-prep layer). VQA at inference uses the base Gemma 4 E4B model's pretrained vision encoder (~150M params per the official model card). Our LoRA targets only the language-model linear layers (target=all-linear); the vision tower is unmodified, matching the Vertex AI Gemma 4 SFT recipe documented in the Hugging Face blog, which explicitly freezes both vision and audio towers during text-focused fine-tuning. The 992 MB mmproj-solarhive-e4b-BF16.gguf companion file packages this base vision encoder (658 SigLIP tensors) plus the base audio encoder (751 Conformer tensors) for llama-server --mmproj, giving the deployed GGUF full multimodal capability.

Training Loss

Metric	Value
Converged loss (last 20 steps)	0.9218
Final step loss	0.0635
Minimum loss	0.0635
Total steps	324
Training time	420 seconds

Canonical metric: the bolded Converged loss (last 20 steps) is the only smoothed convergence indicator. Final step and Minimum are single-batch point statistics — mini-batch loss is noisy step-to-step, so one easy batch can drop a point estimate well below the rolling-average trend.

Hardware

GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7)
Platform: Google Colab Pro (G4 VM)
Precision: BF16 (no quantization during training)

GGUF Production Pipeline

The shipped Q4_K_M GGUF was produced on a high-RAM cloud notebook:

Step	Command	Notes
Source	merged safetensors from solarhive-e4b-ollama	—
Convert to BF16 GGUF	`convert_hf_to_gguf.py --outtype bf16`	~30 min, 14 GB intermediate
Quantize text tower	`llama-quantize ... Q4_K_M`	~23 sec, 5.34 GB (needs ≥30 GB RAM)
Reproducibility notebook	`solarhive_colab_quantize_e4b.ipynb` in GitHub repo	—

To reproduce on a 16 GB laptop instead of a high-RAM notebook, add --tensor-type per_layer_token_embd.weight=q4_0 (see "Quantizing on 16 GB RAM" above) — this bypasses the Q6_K-PLE intermediate buffer and yields a ~4.6 GB GGUF.

The mmproj companion is produced once via convert_hf_to_gguf.py --mmproj on the merged safetensors — independent of the text quantization step.

Community Model

Parameter	Value
Location	Ann Arbor, Michigan (42.2808°N, 83.7430°W)
Community size	12 homes
Total panel capacity	72 kW
Shared battery storage	100 kWh
Grid region	MISO (Midcontinent Independent System Operator)

Runtime Performance

CPU-only inference on Intel i5-1135G7 @ 2.4 GHz, 4 cores, 16 GB RAM, Ollama 0.21.0:

Phase	Time	Speed	Notes
First query (cold)	~65 s	~2.2 tok/s	Includes ~55–60 s model load
Warm advisory query	~10 s	~9–10 tok/s	Single forward pass, no tools
Warm tool-calling loop	25–60 s	~9–10 tok/s	2–3 rounds with live API latency

GGUF blob ingestion (one-time, after ollama create): ~3–5 min for the 5.34 GB variant on a typical laptop SSD.

Technical Notes

PLE tensor override. The Q4_K_M mixed-precision strategy assigns Q6_K to per_layer_token_embd.weight [10752, 262144] (2.82 B params, the largest single tensor in the model). The float32 intermediate buffer for Q6_K conversion (~10.7 GB) OOMs llama-quantize on 16 GB RAM. The --tensor-type per_layer_token_embd.weight=q4_0 override eliminates the buffer; project-held-out scores prove the override is quality-safe.
One mmproj for any text quant. The mmproj companion file is independent of text-model quantization — llama-quantize only sees the text tower. The same 992 MB mmproj pairs with the shipped Q4_K_M or any future Q5/Q6/Q8 quant.
Ollama /api/chat content-drop issue. Ollama 0.21.0's gemma4.go:306 parser detects but rejects fine-tuned Gemma 4's native tool-call format (bare keys + <|"|> delimiters). Use the /api/generate raw mode + manual prompt builder path (Path B in "Inference Path Selection" above) for tool-calling workloads.
Ollama multimodal support is evolving. Ollama 0.21.0's Modelfile syntax for Gemma 4 mmproj projector declaration is not yet first-class. For text + image / text + audio today, use llama-server --mmproj from llama.cpp b8863+ directly.
Sampling defaults. temperature=1.0, top_p=0.95, top_k=64 (Kaggle-recommended Gemma 4 defaults). Set in both Modelfile and Modelfile.standard.
Context window. Modelfiles set num_ctx=4096. The base architecture supports up to 128 K; raise num_ctx for longer multi-round agentic loops at the cost of more RAM at inference.
No Unsloth dependency at inference. Once quantized, the GGUF files run via stock Ollama or llama.cpp. Unsloth was used only during fine-tuning.

Limitations

Prototype tested on a single community model (12 homes, Ann Arbor, MI). Real-world deployment requires validation across diverse geographies and community sizes.
The OpenAI-compatible /api/chat path is not the demo path — see "Inference Path Selection" above for the gemma4.go content-drop reasoning. Use the /api/generate raw mode + manual prompt builder path for production inference.
Image and audio modalities require llama-server --mmproj; Ollama-native multimodal recipe is pending upstream Ollama support.
The model occasionally uses "60 kW" instead of the correct 72 kW community capacity in direct (no-tool) responses — base-model tendency, addressed by tool-calling path which queries actual capacity.
Tool responses depend on external API availability. Open-Meteo and EIA have rate limits; OpenWeatherMap free tier allows 1,000 calls/day.
The battery state is currently a deterministic simulator (get_battery_state() in solarhive_inference.py) — real deployment requires integration with actual battery management systems.
The shipped GGUF uses the standard Q4_K_M (Q6_K-PLE) recipe and scores 10/10 on the single-pass SolarHive project-held-out 10-prompt check. Quantizing on 16 GB hardware via the per_layer_token_embd.weight=q4_0 override trades a small quality margin on one tensor for laptop-class quantization — validated quality-safe in development, but a reproducibility recipe rather than the shipped artifact.

Future Iteration — Multi-Token Prediction (MTP) Drafters on Edge GGUF Runtimes

Not in the measured numbers above. Google announced Gemma 4 MTP drafters on May 5, 2026 (blog, overview, HF collection, Kaggle, @GoogleGemma) — after this artifact's project-held-out check was captured. The numbers above reflect standard autoregressive decoding only. MTP integration is documented here as future iteration; no measured speedup is claimed in this release.

Theoretical foundation. Speculative decoding (Leviathan, Kalman & Matias, Fast Inference from Transformers via Speculative Decoding, ICML 2023, arXiv:2211.17192) accelerates generation without changing the output distribution under argmax decoding: a smaller drafter proposes γ candidate tokens, the target verifies all γ in a single parallel forward pass, accepted tokens are kept, and any rejection is resampled from a corrected distribution. The output distribution is preserved exactly regardless of drafter quality; only acceptance rate α, and therefore walltime speedup, varies.

Released drafter for E4B. google/gemma-4-E4B-it-assistant (~78.8 M params) is the canonical pair for google/gemma-4-E4B-it. Per the MTP overview, the drafter shares the input embedding table with the target and consumes the target's last-layer activations. Google reports up to 3× decode speedup on the 26B-A4B configuration; per-variant E4B numbers were not enumerated in the announcement.

Runtime support is partial for GGUF deployments.

Runtime	Listed by Google?	Source
Ollama	✅ Tested-runtime list	Google blog
llama.cpp	⚠️ Appears in docs runtime nav but not in the blog's tested-runtime list	Docs nav
LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang	✅ Tested-runtime list	Google blog

Implementation paths on this edge GGUF tier (post-hackathon):

Drafter GGUF conversion. Google ships the drafter as HF safetensors. To use against this Q4_K_M target via Ollama or llama.cpp, the drafter weights would need conversion through convert_hf_to_gguf.py — feasible reuse of the same toolchain that produced this target's GGUF, but the conversion is not in the canonical SolarHive registry today.
llama.cpp speculative decoding. llama-speculative and llama-server --draft-model support vanilla speculative decoding per the 2023 paper. Whether Gemma 4 MTP drafters' embedding-sharing + last-layer-activation conditioning architecture maps cleanly to llama.cpp's existing --draft-model plumbing is unverified — Google's docs list the runtime but the blog omits it from the tested set.
Ollama paired drafter. Ollama is in Google's tested-runtime list; the exact CLI/API surface for drafter pairing is not yet documented in Ollama's public docs as of writing.

Planned measurement (post-hackathon). (a) Convert google/gemma-4-E4B-it-assistant → Q4_K_M GGUF via convert_hf_to_gguf.py + llama-quantize. (b) Re-run the parity check with the drafter paired via llama.cpp's --draft-model flag. (c) Capture acceptance rate α + decode-tps + walltime. (d) Cross-check against the Ollama paired-drafter API once documented. Correctness is invariant per the 2023 speculative-sampling guarantee — only α varies under target × drafter distribution mismatch.

Companion Repositories

Model	Repository	Purpose
SolarHive 26B A4B LoRA	solarhive-26b-a4b-lora	Cloud inference with full multimodal + function calling (LoRA adapters + Unsloth)
SolarHive 26B A4B NF4	solarhive-26b-a4b-nf4	Pre-quantized 4-bit cloud model for HF Spaces / 24 GB+ GPUs
SolarHive E4B LoRA	solarhive-e4b-lora	E4B adapter weights (~200 MB) — apply over base via Unsloth
SolarHive E4B Safetensors	solarhive-e4b-ollama	Merged safetensors for transformers-native multimodal research use
SolarHive E4B GGUF	This repo	Edge deployment — 2 text quants + 1 mmproj for Ollama / llama.cpp
SolarHive Dataset	solarhive-community-solar-multimodal	1,727 training examples (1,713 text + 14 image-grounded)
LiteRT-LM Python edge runtime	`solarhive_e4b_litert_v3.1.ipynb`	LiteRT Special Tech Track entry — runs upstream base `litert-community/gemma-4-E4B-it-litert-lm` `.litertlm` (3.66 GB) + SolarHive UX layer + on-device agentic loop. Q&A 8/8 on Colab Pro CPU + High-RAM. Fine-tuned LiteRT-LM bundle is a planned next iteration once upstream `gemma4` example module lands in `ai_edge_torch.generative.examples/`.
GitHub (source)	the-gemma4-good-hackathon-solarhive	Full source code, training notebooks, `solarhive_inference.py` (cloud), `solarhive_inference_e4b_gguf_ollama.py` (local laptop)

Citation

@misc{solarhive2026,
  title={SolarHive: AI-Powered Community Solar Energy Intelligence},
  author={Youshen Lim},
  year={2026},
  url={https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive},
  note={Gemma 4 Good Hackathon submission — Google DeepMind x Kaggle}
}

Dataset used to train Truthseeker87/solarhive-e4b-gguf

Space using Truthseeker87/solarhive-e4b-gguf 1

Papers for Truthseeker87/solarhive-e4b-gguf

When2Call: When (not) to Call Tools

Paper • 2504.18851 • Published Apr 26, 2025

Fast Inference from Transformers via Speculative Decoding

Paper • 2211.17192 • Published Nov 30, 2022 • 11

Evaluation results

accuracy
self-reported

1.000
accuracy
self-reported

1.000
accuracy
self-reported

1.000

SolarHive E4B GGUF — Edge Solar Energy Intelligence

Model Overview

Files in This Repository

Quantization

Quantizing on 16 GB RAM (reproducibility recipe)

One mmproj, one text GGUF

Inference Path Selection

Path A — OpenAI-compatible /api/chat + Ollama's built-in gemma4.go parser

Path B — /api/generate raw mode + manual Gemma 4 prompt builder + native parser (recommended)

Recommended path

Path C — Interactive chat via Unsloth Studio

Project-Held-Out Results

Domain Q&A (5/5)

Tool Calling (5/5)

When2Call probes (3/3 categories per Ross et al. 2025)

End-to-end agentic loop probe

Multi-Variant Deployment Validation — All 6 variants now measured

How to Use

Path 1 — Text + tool calling via Ollama (10/10 demo path)

Path 2 — Text + image + audio via llama.cpp llama-server

Path 3 — Community microgrid hub on Jetson Orin Nano Super (llama.cpp + CUDA)

Modelfile reference

Core Capabilities

1. Multimodal Visual Question Answering (3 Modes)

2. Native Function Calling (5 Tools — all 3 keyed APIs wired)

3. Selective Tool Reasoning

4. Inference-time When2Call Validation (solarhive_inference.py §11b)

Training Details

Training Data — 1,727 Examples

Training Loss

Hardware

GGUF Production Pipeline

Community Model

Runtime Performance

Technical Notes

Limitations

Future Iteration — Multi-Token Prediction (MTP) Drafters on Edge GGUF Runtimes

Companion Repositories

Citation

Links

Dataset used to train Truthseeker87/solarhive-e4b-gguf

Space using Truthseeker87/solarhive-e4b-gguf 1

Papers for Truthseeker87/solarhive-e4b-gguf

Evaluation results

Path A — OpenAI-compatible `/api/chat` + Ollama's built-in `gemma4.go` parser

Path B — `/api/generate` raw mode + manual Gemma 4 prompt builder + native parser (recommended)

Path 2 — Text + image + audio via llama.cpp `llama-server`

4. Inference-time When2Call Validation (`solarhive_inference.py` §11b)