Instructions to use Truthseeker87/solarhive-e4b-gguf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use Truthseeker87/solarhive-e4b-gguf with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Truthseeker87/solarhive-e4b-gguf", filename="mmproj-solarhive-e4b-BF16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use Truthseeker87/solarhive-e4b-gguf with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Truthseeker87/solarhive-e4b-gguf:BF16 # Run inference directly in the terminal: llama-cli -hf Truthseeker87/solarhive-e4b-gguf:BF16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf Truthseeker87/solarhive-e4b-gguf:BF16 # Run inference directly in the terminal: llama-cli -hf Truthseeker87/solarhive-e4b-gguf:BF16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf Truthseeker87/solarhive-e4b-gguf:BF16 # Run inference directly in the terminal: ./llama-cli -hf Truthseeker87/solarhive-e4b-gguf:BF16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf Truthseeker87/solarhive-e4b-gguf:BF16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf Truthseeker87/solarhive-e4b-gguf:BF16
Use Docker
docker model run hf.co/Truthseeker87/solarhive-e4b-gguf:BF16
- LM Studio
- Jan
- vLLM
How to use Truthseeker87/solarhive-e4b-gguf with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Truthseeker87/solarhive-e4b-gguf" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Truthseeker87/solarhive-e4b-gguf", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Truthseeker87/solarhive-e4b-gguf:BF16
- Ollama
How to use Truthseeker87/solarhive-e4b-gguf with Ollama:
ollama run hf.co/Truthseeker87/solarhive-e4b-gguf:BF16
- Unsloth Studio new
How to use Truthseeker87/solarhive-e4b-gguf with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Truthseeker87/solarhive-e4b-gguf to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Truthseeker87/solarhive-e4b-gguf to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Truthseeker87/solarhive-e4b-gguf to start chatting
- Pi new
How to use Truthseeker87/solarhive-e4b-gguf with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Truthseeker87/solarhive-e4b-gguf:BF16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "Truthseeker87/solarhive-e4b-gguf:BF16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use Truthseeker87/solarhive-e4b-gguf with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf Truthseeker87/solarhive-e4b-gguf:BF16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default Truthseeker87/solarhive-e4b-gguf:BF16
Run Hermes
hermes
- Docker Model Runner
How to use Truthseeker87/solarhive-e4b-gguf with Docker Model Runner:
docker model run hf.co/Truthseeker87/solarhive-e4b-gguf:BF16
- Lemonade
How to use Truthseeker87/solarhive-e4b-gguf with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull Truthseeker87/solarhive-e4b-gguf:BF16
Run and chat with the model
lemonade run user.solarhive-e4b-gguf-BF16
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
)- SolarHive E4B GGUF — Edge Solar Energy Intelligence
- Model Overview
- Files in This Repository
- Quantization Variants
- Inference Path Selection
- Benchmark Results
- How to Use
- Core Capabilities
- Training Details
- GGUF Production Pipeline
- Community Model
- Runtime Performance
- Technical Notes
- Limitations
- Future Iteration — Multi-Token Prediction (MTP) Drafters on Edge GGUF Runtimes
- Companion Repositories
- Citation
- Links
- Model Overview
SolarHive E4B GGUF — Edge Solar Energy Intelligence
Fine-tuned Gemma 4 E4B (8B) in GGUF format for Ollama and llama.cpp edge deployment. Two text-model quantization variants plus one multimodal projector companion file. 10/10 + 2/3 When2Call on the local-laptop benchmark — matching the cloud E4B LoRA baseline as the joint best-in-class across all six measured deployment variants. Validated end-to-end on a CPU-only Microsoft Surface Pro 8 (Intel i5-1135G7, 16 GB RAM) with the GGUF and Ollama blob cache stored on an external USB drive.
Built for the Gemma 4 Good Hackathon (Google DeepMind × Kaggle).
| Base Model | google/gemma-4-e4b-it |
| Architecture | Dense + PLE — 8B total, 4.5B effective |
| Quantization | Q4_K_M (two variants — see below) |
| Files | 5.34 GB text (PLE-Q4_0) · 5.34 GB text (Standard Q6_K PLE) · 992 MB mmproj |
| Total footprint | 6.3 GB (text + mmproj) |
| Modalities | Text, Image, Audio (via mmproj companion) |
| Function Calling | Native Gemma 4 protocol |
| Benchmark | 10/10 parity (5/5 Q&A + 5/5 tool routing) + 2/3 When2Call — joint best in the 6-variant Multi-Variant Deployment Validation table |
| Local deployment | CPU-only Surface Pro 8 (Intel i5-1135G7 @ 2.4 GHz, 16 GB RAM, Intel Iris Xe unused), Ollama 0.21.0 with llama.cpp ggml-cpu-icelake.dll (AVX2 + AVX512 + VNNI). External USB drive holds the 5.3 GB GGUF + Ollama blob cache. |
| Fine-Tuning | LoRA via Unsloth (BF16) |
| Training Data | 1,727 examples (solarhive-community-solar-multimodal) — text-only fine-tune; VQA at inference uses the base Gemma 4 vision encoder (~150M params for E4B), unmodified by our LoRA per the Vertex AI SFT recipe. The 992 MB mmproj-solarhive-e4b-BF16.gguf companion file packages this base vision encoder + audio encoder for llama-server --mmproj. |
| Converged Loss | 0.9218 |
| Training Time | 420 seconds on RTX PRO 6000 Blackwell |
| License | MIT (adapters) / Gemma Terms (base model) |
Model Overview
SolarHive is an AI energy advisor for community solar microgrids. It helps suburban neighborhoods collectively optimize distributed solar generation and shared battery storage through natural language conversation, visual inspection, and live data integration.
This is the edge deployment artifact. The 4.61 GB text GGUF + 992 MB multimodal projector run on a 16 GB laptop CPU with no GPU, no cloud dependency, and no internet requirement. Companion to SolarHive 26B A4B LoRA (cloud inference with full multimodal VQA) and SolarHive E4B Ollama (merged safetensors for transformers research).
Privacy-first edge deployment. Community energy data never leaves the neighborhood. A village in rural India, a suburb in Michigan, and a coastal town recovering from a hurricane all get the same intelligence with no cloud round-trips.
Files in This Repository
| File | Size | Use |
|---|---|---|
solarhive-e4b-q4_k_m.gguf |
5.34 GB | Text + tool calling (PLE-Q4_0 quantization recipe) |
solarhive-e4b-q4_k_m-standard.gguf |
5.34 GB | Text + tool calling (Standard Q6_K-PLE quantization recipe — llama.cpp default Q4_K_M preset) |
mmproj-solarhive-e4b-BF16.gguf |
992 MB | Vision (SigLIP, 658 tensors) + audio (Conformer, 751 tensors) projector — pairs with EITHER text variant |
Modelfile |
— | Ollama recipe pointing at the PLE-Q4_0 text variant |
Modelfile.standard |
— | Ollama recipe pointing at the Standard text variant |
Quantization Variants
Both text variants are 5.34 GB Q4_K_M quantizations of the same 720 transformer tensors. They were produced via two different quantization recipes — preserved as separate files for reproducibility / methodology transparency:
| Variant | Quantization recipe | Total size | Production hardware | Inference hardware |
|---|---|---|---|---|
solarhive-e4b-q4_k_m.gguf |
PLE-Q4_0 override (--tensor-type per_layer_token_embd.weight=q4_0) |
5.34 GB | Laptop (≥16 GB RAM) — bypasses the Q6_K-PLE intermediate buffer | ≥7 GB RAM at 4K context |
solarhive-e4b-q4_k_m-standard.gguf |
Standard Q6_K-PLE (llama.cpp default Q4_K_M preset) | 5.34 GB | High-RAM cloud notebook (≥30 GB RAM) — needs the ~10.7 GB float32 intermediate buffer | ≥7 GB RAM at 4K context |
Why preserve both recipes? The standard Q4_K_M mixed-precision strategy assigns Q6_K to the PLE tensor per_layer_token_embd.weight [10752, 262144] (2.82 B params, the largest single tensor in the model). Converting that tensor to Q6_K requires a ~10.7 GB float32 intermediate buffer during quantization — OOMs on 16 GB hardware (bad allocation at tensor 4/720). The PLE-Q4_0 recipe uses --tensor-type per_layer_token_embd.weight=q4_0 to bypass the buffer, enabling laptop-class quantization. Both recipes are documented for transparency about the methodological tradeoff.
Are they interchangeable? Yes — both load and run identically on a 16 GB laptop at inference time. Quality benchmarks below confirm parity. Pick by quantization-time hardware constraint; the inference-time behavior is the same.
One mmproj, multiple text-GGUF variants
The mmproj-solarhive-e4b-BF16.gguf companion is independent of the text model's quantization. llama-quantize operates exclusively on the text tower — vision and audio tensors were split out into the mmproj file during convert_hf_to_gguf.py --mmproj. At inference time, llama-server --model X --mmproj Y pairs them on demand. One mmproj serves any current or future text variant (Q4/Q5/Q6/Q8) with no per-variant duplication.
Inference Path Selection
This repo ships alongside solarhive_inference_e4b_gguf_ollama.py (in the GitHub repo), which implements two inference approaches against Ollama. The recommended demo path is the /api/generate raw mode + manual prompt builder approach. Both are documented for transparency and reproducibility.
Path A — OpenAI-compatible /api/chat + Ollama's built-in gemma4.go parser
Definition. Send messages and OpenAI-style tool schemas to Ollama's standard /api/chat endpoint. Ollama internally renders the chat template, parses model output, and extracts tool calls via its dedicated Gemma 4 parser (gemma4.go). The simplest path — closest to "drop-in replacement" for any OpenAI-compatible client.
import requests, json
resp = requests.post("http://127.0.0.1:11434/api/chat", json={
"model": "solarhive",
"messages": [{"role": "user", "content": "What's the current battery state?"}],
"tools": [{"type": "function", "function": {...}}],
"stream": False,
"options": {"temperature": 1.0, "top_p": 0.95, "top_k": 64, "num_ctx": 4096},
})
Rationale. This is the path most developers will reach for first. OpenAI-compatible JSON, no manual prompt construction, leverages Ollama's first-party Gemma 4 support.
Why it does not score 10/10 in our benchmark. Ollama 0.21.0's gemma4.go:306 parser detects the model's native call:fn{} output but rejects it because the arguments use bare unquoted keys and <|"|> string delimiters (the Gemma 4 native format the model was fine-tuned on) instead of strict JSON. Server log evidence:
level=WARN source=gemma4.go:306 msg="gemma4 tool call parsing failed"
error="invalid character '\'' looking for beginning of object key string
repair failed to produce valid JSON arguments"
content="call:get_status{battery:{level:65,pct:77,kwh:77,...,dispatched:<|\"|>OK<|\"|>},...}"
The repair logic also fails to recover valid JSON. Result: tool calls silently dropped, content empty. Inconsistent scoring on the 10-prompt benchmark — capped by this upstream issue. Not a viable path for the demo without an upstream Ollama patch.
Path B — /api/generate raw mode + manual Gemma 4 prompt builder + native parser (recommended)
Definition. Bypass gemma4.go entirely. Build the Gemma 4 native prompt manually from chat_template.jinja, send via Ollama's /api/generate endpoint with raw: True (no template rendering by Ollama), parse the raw output ourselves using regex matching tokenizer_config.json's response_schema.
prompt = build_gemma4_prompt(messages, tools) # byte-identical to apply_chat_template
resp = requests.post("http://127.0.0.1:11434/api/generate", json={
"model": "solarhive",
"prompt": prompt,
"raw": True,
"stream": False,
"options": {"temperature": 1.0, "top_p": 0.95, "top_k": 64, "num_ctx": 4096},
})
content, tool_calls = parse_gemma4_output(resp.json()["response"])
Rationale. We control both endpoints of the prompt cycle:
- Prompt structure matches what the model was fine-tuned on, exactly.
build_gemma4_promptproduces byte-identical output toapply_chat_template(messages, tools, enable_thinking=False, add_generation_prompt=True)— the same call used by the cloud 26B A4B path. - Output parser matches the exact regex from
tokenizer_config.jsonresponse_schema:r"<\|tool_call>(.*?)<tool_call\|>"for tool blocks,r"call:(\w+)\{(.*)\}$"for argument extraction. - Tool responses fed back as
<|turn>tool\n{json}<turn|>— matches what training pipeline rendered from the{role:"tool", content:"{...}"}OpenAI-style messages in the training data.
Result. Score on the 10-prompt benchmark: 10/10 on both text variants + 2/3 on the When2Call probes. Matches the cloud E4B LoRA baseline as the joint best across all 6 measured variants in the Multi-Variant Deployment Validation table below.
Trade-off. ~30 lines of additional Python in solarhive_inference_e4b_gguf_ollama.py to build the prompt and parse the output. Worth it to bypass the upstream gemma4.go parser issue.
Recommended path
For the SolarHive Ollama + llama.cpp demo we use Path B (/api/generate raw mode + manual prompt builder). The solarhive_inference_e4b_gguf_ollama.py script in the GitHub repo provides a complete reference implementation of build_gemma4_prompt and parse_gemma4_output. Set OLLAMA_MODEL=solarhive (PLE-Q4_0) or OLLAMA_MODEL=solarhive-standard (Standard) and run the script — both score 10/10.
For multimodal use cases (vision, audio), use llama-server --mmproj from llama.cpp directly — see "How to Use" below. Ollama 0.21.0's Modelfile syntax for Gemma 4 multimodal projector declaration is still evolving; first-class Ollama vision support for Gemma 4 will arrive in a future release.
Path C — Interactive chat via Unsloth Studio
For community users who want to chat with the SolarHive E4B GGUF without writing a Modelfile or running the HTTP harness by hand, Unsloth Studio provides an open-source no-code local web UI:
# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888
# Windows PowerShell
irm https://unsloth.ai/install.ps1 | iex
unsloth studio -H 0.0.0.0 -p 8888
Studio supports running GGUF + safetensor models locally with self-healing tool calling and code execution, model arena (side-by-side comparison), and saves to GGUF / 16-bit safetensor formats. Use Studio for interactive exploration; use Path B (/api/generate raw mode + solarhive_inference_e4b_gguf_ollama.py) for benchmark-grade evaluation — Studio doesn't expose the per-question evaluation harness needed to reproduce the 10/10 SolarHive parity score.
Benchmark Results
10-prompt parity benchmark — 5 domain Q&A (no tool expected) + 5 tool-calling — plus 3 When2Call probes. Run via the /api/generate raw mode + manual Gemma 4 prompt builder path (Path B above), on Ollama 0.21.0, CPU-only Microsoft Surface Pro 8 (Intel i5-1135G7 @ 2.4 GHz, 4 cores, 16 GB RAM, Intel Iris Xe unused), with the 5.3 GB GGUF + Ollama blob cache stored on an external USB drive. Same 10 prompts and same 3 When2Call probes as the cloud benchmarks for direct cross-variant comparison.
Domain Q&A (5/5 on both variants)
| Question | Expected behavior | PLE-Q4_0 | Standard |
|---|---|---|---|
| Solar production when humidity exceeds 80%? | Direct answer, no tool call | ✅ | ✅ |
| At what battery SOC should we stop exporting to the grid? | Direct answer, no tool call | ✅ | ✅ |
| Home #3 underperforming 22% — diagnostic checklist? | Direct answer, no tool call | ✅ | ✅ |
| Winter snow on panels — prioritize actions? | Direct answer, no tool call | ✅ | ✅ |
| Grid frequency 59.8 Hz — microgrid implications? | Direct answer, no tool call | ✅ | ✅ |
Tool Calling (5/5 on both variants)
| Question | Expected tool | PLE-Q4_0 | Standard |
|---|---|---|---|
| What's the current battery state? | get_battery_state |
✅ fired + synthesized | ✅ fired + synthesized |
| Current weather and how does it affect solar production? | get_weather |
✅ fired + synthesized | ✅ fired + synthesized |
| What are general maintenance tips for panels? | None (no tool needed) | ✅ correctly no tool call | ✅ correctly no tool call |
| What is the weather expected to be like this week? | get_weather |
✅ fired + synthesized | ✅ fired + synthesized |
| How should we plan energy consumption and storage given the weather forecast? | get_weather (+ get_grid_status) |
✅ fired + synthesized | ✅ fired + synthesized |
Quantization-precision independence confirmed. The PLE-Q4_0 override is quality-safe — no measurable regression on the 10-prompt benchmark vs the standard Q6_K-PLE. Both variants are interchangeable shipping artifacts; pick by quantization-time hardware constraint.
When2Call probes (3/3 categories per Ross et al. 2025)
Three held-out probes from When2Call: When (not) to Call Tools cover 3 of the 4 failure-mode categories the paper documents (the paper documents 9–67% tool-hallucination rates on (c) and (d) in untrained community models):
| Category | Question | Expected behavior | Result on this GGUF |
|---|---|---|---|
| (b) Well-specified, in-scope | "What's the current grid rate?" | Call get_grid_status |
PASS — called get_grid_status |
| (c) Under-specified | "How much will a 10 kW array produce today?" | Follow-up question (does NOT auto-fill location default) | PASS — asked for current weather conditions |
| (d) Out-of-scope | "What's the current air quality index in Ann Arbor?" | Refusal + redirect (does NOT hallucinate a tool) | FAIL — called get_weather (a known E4B-family failure mode also seen on the cloud E4B merged variant; the larger A4B family scores 3/3 on this same probe) |
Headline: 2/3 nominal — same profile as the cloud E4B merged variant (transformers BF16). Confirms GGUF Q4_K_M quantization is lossless at the When2Call refusal/follow-up decision boundary within the E4B family. The +1/3 W2C delta vs the A4B family persists across runtimes (cloud transformers and local Ollama produce identical W2C scores within each family) — confirming it's a model-size signature, not a precision artifact.
End-to-end agentic loop probe
A multi-tool community-energy-audit query is run through the full agentic loop (extract → execute → feed back, max 3 rounds), mirroring the cloud notebook's §13g cell. Headline trace:
"Full community energy audit — check current weather, solar production, battery state, and grid pricing. Give a 3-sentence status report."
- Rounds completed: 2
- Tools executed:
get_weather,get_solar_production,get_battery_state,get_grid_status(4 of 5 tools called in parallel in round 1) - Final answer (excerpt): "Community Energy Status Report (Midday): Partly cloudy with 30% cloud cover and 72°F. Array is producing 56% of capacity at 40.4 kW. Battery is at 72% SOC and actively charging from surplus. Grid pricing is Peak rates ($0.28/kWh). Status: Excellent. Maximize self-consumption by running heavy loads now…"
All four tool results are correctly synthesized into a coherent status report — confirming the GGUF Q4_K_M quantization preserves end-to-end agentic reasoning quality, not just single-shot tool routing.
Multi-Variant Deployment Validation — All 6 variants now measured
Cross-variant table covering all 5 cloud transformers variants (validated on Colab Pro G4 with NVIDIA RTX PRO 6000 Blackwell) plus this GGUF variant (validated on the CPU-only Surface Pro 8 reference hardware described above):
| Variant | Q&A | Tool | W2C | Total | Backend | Hardware |
|---|---|---|---|---|---|---|
| a4b_lora (baseline) | 5/5 | 4/5 | 3/3 | 9/10 | transformers + Unsloth | Colab Pro G4 GPU |
| e4b_lora | 5/5 | 5/5 | 2/3 | 10/10 | transformers + Unsloth | Colab Pro G4 GPU |
| e4b_merged | 5/5 | 4/5 | 2/3 | 9/10 | transformers BF16 | Colab Pro G4 GPU |
| a4b_merged | 5/5 | 4/5 | 3/3 | 9/10 | transformers BF16 | Colab Pro G4 GPU |
| a4b_nf4 | 5/5 | 4/5 | 3/3 | 9/10 | transformers NF4 (BnB) | Colab Pro G4 GPU |
| e4b_gguf (this artifact) | 5/5 | 5/5 | 2/3 | 10/10 | Ollama HTTP raw + manual prompt builder | CPU-only Surface Pro 8 (i5-1135G7, 16 GB RAM) |
Key empirical findings:
- Joint best variant in the table — this GGUF ties the cloud E4B LoRA baseline at 10/10 + 2/3 W2C, despite running on a 4-year-old consumer laptop with no GPU vs an A100-class cloud accelerator.
- GGUF Q4_K_M quantization is lossless within the E4B family. Tool routing 5/5 + W2C 2/3 exactly matches the BF16 LoRA. Same (b)+(c) PASS, (d) FAIL profile. The 5.3 GB CPU-only deployment produces identical decisions to the BF16 + GPU pipeline.
- The +1/3 W2C delta vs A4B family is a model-size signature, not a runtime artifact — reproduced across both cloud transformers AND local Ollama runtimes within each family.
Reproduce locally:
$env:OLLAMA_HOST = 'http://localhost:11434'
$env:OLLAMA_MODEL = 'solarhive'
python -m pytest solarhive_inference_e4b_gguf_ollama.py -v --tb=short
The benchmark run auto-writes a markdown summary to archive/ollama_local_e4b_gguf_results_YYYYMMDD_HHMMSS.md with the full per-question / per-probe trace.
How to Use
Path 1 — Text + tool calling via Ollama (10/10 demo path)
# Download both text variants + Modelfiles + LICENSE
hf download Truthseeker87/solarhive-e4b-gguf \
solarhive-e4b-q4_k_m.gguf Modelfile LICENSE \
--local-dir ./solarhive-gguf
cd ./solarhive-gguf
# Register with Ollama (Modelfile uses ./solarhive-e4b-q4_k_m.gguf relative path)
ollama create solarhive -f Modelfile
# Quick sanity check
ollama run solarhive "What is the current solar production for our community?"
# Full 10/10 + 2/3 W2C agentic benchmark — uses /api/generate raw mode + manual prompt builder
git clone https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive
cd the-gemma4-good-hackathon-solarhive
$env:OLLAMA_MODEL = 'solarhive'
python -m pytest solarhive_inference_e4b_gguf_ollama.py -v --tb=short
To use the standard variant instead, swap solarhive-e4b-q4_k_m.gguf and Modelfile for solarhive-e4b-q4_k_m-standard.gguf and Modelfile.standard, then ollama create solarhive-standard -f Modelfile.standard and OLLAMA_MODEL=solarhive-standard.
Path 2 — Text + image + audio via llama.cpp llama-server
# Download a text variant + the mmproj companion
hf download Truthseeker87/solarhive-e4b-gguf \
solarhive-e4b-q4_k_m.gguf mmproj-solarhive-e4b-BF16.gguf \
--local-dir ./solarhive-gguf
# Start llama-server with the mmproj
llama-server \
--model ./solarhive-gguf/solarhive-e4b-q4_k_m.gguf \
--mmproj ./solarhive-gguf/mmproj-solarhive-e4b-BF16.gguf \
--temp 1.0 --top-p 0.95 --top-k 64 \
--port 8080
# POST chat-completion requests with image_url to http://localhost:8080/v1/chat/completions
The mmproj bundles both vision (SigLIP, sky photos and panel inspection) and audio (Conformer, voice queries up to 30 s) — same recipe serves both modalities.
Path 3 — Community microgrid hub on Jetson Orin Nano Super (llama.cpp + CUDA)
The same GGUF runs CUDA-accelerated on a Jetson Orin Nano Super Developer Kit ($249, 8 GB LPDDR5, 1024 CUDA cores, 7–25 W power envelope) — making it a natural community-ownable, solar-powerable microgrid hub that serves a whole neighborhood. Build llama.cpp with CUDA per Nvidia's official Gemma 4 Jetson recipe:
# On the Jetson (Ubuntu via JetPack SDK)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="87" \
-DGGML_NATIVE=ON \
-DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4
# Download the same GGUFs as Path 1 / Path 2
hf download Truthseeker87/solarhive-e4b-gguf \
solarhive-e4b-q4_k_m.gguf mmproj-solarhive-e4b-BF16.gguf \
--local-dir ~/models
# Run with full GPU offload (-ngl 99) for maximum throughput
./build/bin/llama-server \
--model ~/models/solarhive-e4b-q4_k_m.gguf \
--mmproj ~/models/mmproj-solarhive-e4b-BF16.gguf \
--temp 1.0 --top-p 0.95 --top-k 64 \
--port 8080 --host 0.0.0.0 \
-ngl 99 --flash-attn on \
--no-mmproj-offload --jinja -np 1
Why this matters for community deployment: at 7–25 W, a single Jetson Orin Nano Super can be powered directly from a small solar panel during the day and a modest battery overnight — the SolarHive intelligence runs on the same energy infrastructure it advises. Mobile clients (Path 4 below) hit the Jetson at http://<hub-ip>:8080 over the local network when tool-calling responses with live API data are needed.
Modelfile reference
Both Modelfiles use a relative path so the same file works regardless of where you cloned the repo:
FROM ./solarhive-e4b-q4_k_m.gguf
SYSTEM """You are SolarHive, an AI energy advisor for a community of 12 homes with rooftop solar and shared battery storage in Ann Arbor, Michigan. Use the available tools to get real-time data before answering. Be specific, reference actual data, and keep responses concise (3-5 sentences)."""
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
PARAMETER num_ctx 4096
Core Capabilities
1. Multimodal Visual Question Answering (3 Modes)
Available via llama-server --mmproj (Path 2 above). Tensor inventory of mmproj-solarhive-e4b-BF16.gguf: 1411 tensors total — 658 vision (SigLIP), 751 audio (Conformer), 2 multimodal projectors (mm.input_projection.weight, mm.a.input_projection.weight).
| Mode | Input | Output |
|---|---|---|
| Sky Analysis | Sky photograph | Cloud coverage %, production forecast, storage recommendation |
| Panel Inspection | Panel photograph | Dirt/damage/shading detection, efficiency impact estimate |
| Neighborhood Assessment | Aerial/satellite image | Panel inventory, expansion priorities, shading analysis |
2. Native Function Calling (5 Tools — all 3 keyed APIs wired)
Available via either Path 1 (Ollama) or Path 2 (llama-server). Tool schemas and reference implementations are in solarhive_inference_e4b_gguf_ollama.py (project root). The §13g cell of solarhive_inference.py runs an end-to-end agentic-loop probe via Ollama HTTP raw mode using these same 5 tools.
| Tool | API | Returns |
|---|---|---|
get_weather(location) |
OpenWeatherMap (OWM_API_KEY) |
Temperature, clouds %, wind, humidity, sunrise/sunset |
get_solar_production(clouds_pct, temp_f) |
Open-Meteo GHI (keyless) | Production kW, efficiency %, GHI W/m², temp derating |
get_battery_state() |
Community BMS (simulated) | State of charge, capacity, charging status |
get_grid_status() |
EIA Open Data (EIA_API_KEY) |
Pricing period, rate/kWh, renewable %, CO2 intensity |
get_nrel_pvwatts_baseline() |
NREL PVWatts v8 (NREL_API_KEY) |
Annual + current-month typical kWh + avg kW for the 72 kW array |
Tool results feed back as a 2-message sequence matching the training distribution: {"role": "assistant", "tool_calls": [...]} then {"role": "tool", "name": "<fn>", "content": json.dumps(result)} per call. The _build_gemma4_prompt() helper in solarhive_inference_e4b_gguf_ollama.py renders this format byte-identically — same as solarhive_datagen.py (training-data generation) and solarhive_finetune.py (SFT preprocessing). Inference matches training distribution exactly.
3. Selective Tool Reasoning
The model decides when to call tools — not blindly invoking all of them. Validated by the 5/5 tool-calling sub-benchmark above:
"What time does peak pricing start?"
→ Calls: get_grid_status() only
"Is today's production above typical for January?"
→ Calls: get_solar_production() + get_nrel_pvwatts_baseline()
"Should I run my pool heater now?"
→ Calls: get_weather() + get_solar_production() + get_battery_state() + get_grid_status()
"What are general maintenance tips for panels?"
→ Calls: none (answers from training knowledge)
4. Inference-time When2Call Validation (solarhive_inference.py §11b)
Three held-out probes validate 3 of the 4 failure-mode categories from Ross, H., Mahabaleshwarkar, A. S., & Suhara, Y. (2025). When2Call: When (not) to Call Tools. arXiv:2504.18851 — the paper documents 9–67% tool-hallucination rates on (c) and (d) in untrained community models because public tool-calling datasets typically lack follow-up and unable-to-answer examples:
- (b) "What's the current grid rate?" → expect
get_grid_statuscall (well-specified, in-scope) - (c) "How much will a 10 kW array produce today?" → expect follow-up question (does NOT auto-fill location default)
- (d) "What's the current air quality index in Ann Arbor?" → expect refusal + redirect (does NOT hallucinate a tool)
A baseline community model trained without these categories typically fails (c) + (d) (per the paper's 9-67% hallucination rates on untrained models). With the _UNABLE_TO_ANSWER + _FOLLOW_UP_QUESTIONS corpus categories included in solarhive_datagen.py, the A4B family scores 3/3 and the E4B family scores 2/3 (passes (b) + (c), fails (d) — see Benchmark Results above). The same When2Call probes run end-to-end against this GGUF artifact via the /api/generate raw mode path for edge-deployment validation.
Training Details
| Parameter | Value |
|---|---|
| Method | LoRA via Unsloth FastVisionModel (BF16, RTX PRO 6000 96 GB) |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| LoRA dropout | 0 |
| Target modules | All linear layers |
| Learning rate | 2e-4 |
| Optimizer | AdamW 8-bit |
| Warmup steps | 5 |
| Epochs | 3 |
| Max sequence length | 2048 |
| Precision | BF16 |
| Seed | 3407 |
| Trainable parameters | 41.2 M / 8.0 B (0.51%) |
Training Data — 1,727 Examples
Same canonical training corpus as the 26B A4B model — solarhive-community-solar-multimodal, 1,727 rows:
- 413 hand-crafted examples spanning 15+ US cities and 9 energy domains (sky conditions, battery management, panel health, consumption optimization, community/grid coordination, emergency resilience, seasonal planning, multi-step reasoning, alternative storage)
- ~1,117 API-grounded examples from live Open-Meteo (GHI/DNI/DHI, low/mid/high cloud cover), PVWatts, OpenWeatherMap, and EIA APIs — every numeric claim traces to a real API response, joined on
(location, hourly timestamp)for cross-source coherence - 183 tool-calling examples following the When2Call taxonomy — 106 should-call, 53 should-not-call, 10 unable-to-answer, 6 follow-up clarification, 8 failure-recovery
- 14 image-grounded Q&A turns from 7 manually-labeled Ann Arbor sky photographs — paired with the same temperature-derated GHI formula used in text rows
See the SolarHive Dataset for full documentation.
Fine-tuning is text-only on the multimodal-capable corpus (image rows skipped at the data-prep layer). VQA at inference uses the base Gemma 4 E4B model's pretrained vision encoder (~150M params per the official model card). Our LoRA targets only the language-model linear layers (
target=all-linear); the vision tower is unmodified, matching the Vertex AI Gemma 4 SFT recipe documented in the Hugging Face blog, which explicitly freezes both vision and audio towers during text-focused fine-tuning. The 992 MBmmproj-solarhive-e4b-BF16.ggufcompanion file packages this base vision encoder (658 SigLIP tensors) plus the base audio encoder (751 Conformer tensors) forllama-server --mmproj, giving the deployed GGUF full multimodal capability.
Training Loss
| Metric | Value |
|---|---|
| Converged loss (last 20 steps) | 0.9218 |
| Final step loss | 0.0635 |
| Minimum loss | 0.0635 |
| Total steps | 324 |
| Training time | 420 seconds |
Canonical metric: the bolded Converged loss (last 20 steps) is the only smoothed convergence indicator. Final step and Minimum are single-batch point statistics — mini-batch loss is noisy step-to-step, so one easy batch can drop a point estimate well below the rolling-average trend.
Hardware
- GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7)
- Platform: Google Colab Pro (G4 VM)
- Precision: BF16 (no quantization during training)
GGUF Production Pipeline
The two text variants differ only in where they were quantized:
| Step | PLE-Q4_0 (laptop) | Standard (Colab) |
|---|---|---|
| Source | merged safetensors from solarhive-e4b-ollama | same |
| Convert to BF16 GGUF | convert_hf_to_gguf.py --outtype bf16 (~30 min, 14 GB) |
same |
| Quantize text tower | llama-quantize --tensor-type per_layer_token_embd.weight=q4_0 ... Q4_K_M (~15 min, 4.61 GB) |
llama-quantize ... Q4_K_M (~23 sec, 5.34 GB) |
| Hardware needed | Intel i5-1135G7 + 16 GB RAM | High-RAM cloud notebook (≥30 GB RAM) |
| Reproducibility notebook | solarhive_quantize_nf4.py-style recipe in GitHub repo |
solarhive_colab_quantize_e4b.ipynb in GitHub repo |
The mmproj companion is produced once via convert_hf_to_gguf.py --mmproj on the merged safetensors — independent of the text quantization step.
Community Model
| Parameter | Value |
|---|---|
| Location | Ann Arbor, Michigan (42.2808°N, 83.7430°W) |
| Community size | 12 homes |
| Total panel capacity | 72 kW |
| Shared battery storage | 100 kWh |
| Grid region | MISO (Midcontinent Independent System Operator) |
Runtime Performance
CPU-only inference on Intel i5-1135G7 @ 2.4 GHz, 4 cores, 16 GB RAM, Ollama 0.21.0:
| Phase | Time | Speed | Notes |
|---|---|---|---|
| First query (cold) | ~65 s | ~2.2 tok/s | Includes ~55–60 s model load |
| Warm advisory query | ~10 s | ~9–10 tok/s | Single forward pass, no tools |
| Warm tool-calling loop | 25–60 s | ~9–10 tok/s | 2–3 rounds with live API latency |
GGUF blob ingestion (one-time, after ollama create): ~3–5 min for the 4.61 GB variant on a typical laptop SSD.
Technical Notes
- PLE tensor override. The
Q4_K_Mmixed-precision strategy assigns Q6_K toper_layer_token_embd.weight [10752, 262144](2.82 B params, the largest single tensor in the model). The float32 intermediate buffer for Q6_K conversion (~10.7 GB) OOMsllama-quantizeon 16 GB RAM. The--tensor-type per_layer_token_embd.weight=q4_0override eliminates the buffer; benchmark scores prove the override is quality-safe. - One mmproj for many text variants. The mmproj companion file is independent of text-model quantization —
llama-quantizeonly sees the text tower. The same 992 MB mmproj pairs with PLE-Q4_0, Standard, or any future Q5/Q6/Q8 variant. - Ollama
/api/chatcontent-drop issue. Ollama 0.21.0'sgemma4.go:306parser detects but rejects fine-tuned Gemma 4's native tool-call format (bare keys +<|"|>delimiters). Use the/api/generateraw mode + manual prompt builder path (Path B in "Inference Path Selection" above) for tool-calling workloads. - Ollama multimodal support is evolving. Ollama 0.21.0's Modelfile syntax for Gemma 4 mmproj projector declaration is not yet first-class. For text + image / text + audio today, use
llama-server --mmprojfrom llama.cpp b8863+ directly. - Sampling defaults.
temperature=1.0, top_p=0.95, top_k=64(Kaggle-recommended Gemma 4 defaults). Set in bothModelfileandModelfile.standard. - Context window. Modelfiles set
num_ctx=4096. The base architecture supports up to 128 K; raisenum_ctxfor longer multi-round agentic loops at the cost of more RAM at inference. - No Unsloth dependency at inference. Once quantized, the GGUF files run via stock Ollama or llama.cpp. Unsloth was used only during fine-tuning.
Limitations
- Prototype tested on a single community model (12 homes, Ann Arbor, MI). Real-world deployment requires validation across diverse geographies and community sizes.
- The OpenAI-compatible
/api/chatpath is not the demo path — see "Inference Path Selection" above for thegemma4.gocontent-drop reasoning. Use the/api/generateraw mode + manual prompt builder path for production inference. - Image and audio modalities require
llama-server --mmproj; Ollama-native multimodal recipe is pending upstream Ollama support. - The model occasionally uses "60 kW" instead of the correct 72 kW community capacity in direct (no-tool) responses — base-model tendency, addressed by tool-calling path which queries actual capacity.
- Tool responses depend on external API availability. Open-Meteo and EIA have rate limits; OpenWeatherMap free tier allows 1,000 calls/day.
- The battery state is currently a deterministic simulator (
get_battery_state()insolarhive_inference.py) — real deployment requires integration with actual battery management systems. - The PLE-Q4_0 override trades a small quality margin on one tensor for laptop-class deployability. The standard variant exists as a higher-precision reference; both score 10/10 on the SolarHive 10-prompt benchmark.
Future Iteration — Multi-Token Prediction (MTP) Drafters on Edge GGUF Runtimes
Not in the measured numbers above. Google announced Gemma 4 MTP drafters on May 5, 2026 (blog, overview, HF collection, Kaggle, @GoogleGemma) — after this artifact's benchmark was captured. The benchmarks above reflect standard autoregressive decoding only. MTP integration is documented here as future iteration; no measured speedup is claimed in this release.
Theoretical foundation. Speculative decoding (Leviathan, Kalman & Matias, ICML 2023, arXiv:2211.17192) accelerates generation without changing the output distribution under argmax decoding: a smaller drafter proposes γ candidate tokens, the target verifies all γ in a single parallel forward pass, accepted tokens are kept, and any rejection is resampled from a corrected distribution. The output distribution is preserved exactly regardless of drafter quality; only acceptance rate α, and therefore walltime speedup, varies.
Released drafter for E4B. google/gemma-4-E4B-it-assistant (~78.8 M params) is the canonical pair for google/gemma-4-E4B-it. Per the MTP overview, the drafter shares the input embedding table with the target and consumes the target's last-layer activations. Google reports up to 3× decode speedup on the 26B-A4B configuration; per-variant E4B numbers were not enumerated in the announcement.
Runtime support is partial for GGUF deployments.
| Runtime | Listed by Google? | Source |
|---|---|---|
| Ollama | ✅ Tested-runtime list | Google blog |
| llama.cpp | ⚠️ Appears in docs runtime nav but not in the blog's tested-runtime list | Docs nav |
| LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang | ✅ Tested-runtime list | Google blog |
Implementation paths on this edge GGUF tier (post-hackathon):
- Drafter GGUF conversion. Google ships the drafter as HF safetensors. To use against this Q4_K_M target via Ollama or llama.cpp, the drafter weights would need conversion through
convert_hf_to_gguf.py— feasible reuse of the same toolchain that produced this target's GGUF, but the conversion is not in the canonical SolarHive registry today. - llama.cpp speculative decoding.
llama-speculativeandllama-server --draft-modelsupport vanilla speculative decoding per the 2023 paper. Whether Gemma 4 MTP drafters' embedding-sharing + last-layer-activation conditioning architecture maps cleanly to llama.cpp's existing--draft-modelplumbing is unverified — Google's docs list the runtime but the blog omits it from the tested set. - Ollama paired drafter. Ollama is in Google's tested-runtime list; the exact CLI/API surface for drafter pairing is not yet documented in Ollama's public docs as of writing.
Planned measurement (post-hackathon). (a) Convert google/gemma-4-E4B-it-assistant → Q4_K_M GGUF via convert_hf_to_gguf.py + llama-quantize. (b) Re-run the parity benchmark with the drafter paired via llama.cpp's --draft-model flag. (c) Capture acceptance rate α + decode-tps + walltime. (d) Cross-check against the Ollama paired-drafter API once documented. Correctness is invariant per the 2023 speculative-sampling guarantee — only α varies under target × drafter distribution mismatch.
Companion Repositories
| Model | Repository | Purpose |
|---|---|---|
| SolarHive 26B A4B LoRA | solarhive-26b-a4b-lora | Cloud inference with full multimodal + function calling (LoRA adapters + Unsloth) |
| SolarHive 26B A4B NF4 | solarhive-26b-a4b-nf4 | Pre-quantized 4-bit cloud model for HF Spaces / 24 GB+ GPUs |
| SolarHive E4B LoRA | solarhive-e4b-lora | E4B adapter weights (~200 MB) — apply over base via Unsloth |
| SolarHive E4B Safetensors | solarhive-e4b-ollama | Merged safetensors for transformers-native multimodal research use |
| SolarHive E4B GGUF | This repo | Edge deployment — 2 text quants + 1 mmproj for Ollama / llama.cpp |
| SolarHive Dataset | solarhive-community-solar-multimodal | 1,727 training examples (1,713 text + 14 image-grounded) |
| LiteRT-LM Python edge runtime | solarhive_e4b_litert_v3.1.ipynb |
LiteRT Special Tech Track entry — runs upstream base litert-community/gemma-4-E4B-it-litert-lm .litertlm (3.66 GB) + SolarHive UX layer + on-device agentic loop. Q&A 8/8 on Colab Pro CPU + High-RAM. Fine-tuned LiteRT-LM bundle is a planned next iteration once upstream gemma4 example module lands in ai_edge_torch.generative.examples/. |
| GitHub (source) | the-gemma4-good-hackathon-solarhive | Full source code, training notebooks, solarhive_inference.py (cloud), solarhive_inference_e4b_gguf_ollama.py (local laptop) |
Citation
@misc{solarhive2026,
title={SolarHive: AI-Powered Community Solar Energy Intelligence},
author={Youshen Lim},
year={2026},
url={https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive},
note={Gemma 4 Good Hackathon submission — Google DeepMind x Kaggle}
}
Links
- GitHub: youshen-lim/the-gemma4-good-hackathon-solarhive
- Kaggle: The Gemma 4 Good Hackathon
- Base Model: google/gemma-4-e4b-it
- Unsloth Gemma 4 docs: unsloth.ai/docs/models/gemma-4
- llama.cpp: ggerganov/llama.cpp
Built with Gemma 4 in Ann Arbor, Michigan. April 2026.
Gemma is a trademark of Google LLC.
- Downloads last month
- 124
4-bit
Dataset used to train Truthseeker87/solarhive-e4b-gguf
Space using Truthseeker87/solarhive-e4b-gguf 1
Papers for Truthseeker87/solarhive-e4b-gguf
When2Call: When (not) to Call Tools
Fast Inference from Transformers via Speculative Decoding
Evaluation results
- accuracyself-reported1.000
- accuracyself-reported1.000
- accuracyself-reported1.000

# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="Truthseeker87/solarhive-e4b-gguf", filename="", )