How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="Truthseeker87/solarhive-e4b-gguf",
	filename="",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

SolarHive

SolarHive E4B GGUF — Edge Solar Energy Intelligence

Fine-tuned Gemma 4 E4B (8B) in GGUF format for Ollama and llama.cpp edge deployment. Two text-model quantization variants plus one multimodal projector companion file. 10/10 + 2/3 When2Call on the local-laptop benchmark — matching the cloud E4B LoRA baseline as the joint best-in-class across all six measured deployment variants. Validated end-to-end on a CPU-only Microsoft Surface Pro 8 (Intel i5-1135G7, 16 GB RAM) with the GGUF and Ollama blob cache stored on an external USB drive.

Built for the Gemma 4 Good Hackathon (Google DeepMind × Kaggle).

Base Model google/gemma-4-e4b-it
Architecture Dense + PLE — 8B total, 4.5B effective
Quantization Q4_K_M (two variants — see below)
Files 5.34 GB text (PLE-Q4_0) · 5.34 GB text (Standard Q6_K PLE) · 992 MB mmproj
Total footprint 6.3 GB (text + mmproj)
Modalities Text, Image, Audio (via mmproj companion)
Function Calling Native Gemma 4 protocol
Benchmark 10/10 parity (5/5 Q&A + 5/5 tool routing) + 2/3 When2Call — joint best in the 6-variant Multi-Variant Deployment Validation table
Local deployment CPU-only Surface Pro 8 (Intel i5-1135G7 @ 2.4 GHz, 16 GB RAM, Intel Iris Xe unused), Ollama 0.21.0 with llama.cpp ggml-cpu-icelake.dll (AVX2 + AVX512 + VNNI). External USB drive holds the 5.3 GB GGUF + Ollama blob cache.
Fine-Tuning LoRA via Unsloth (BF16)
Training Data 1,727 examples (solarhive-community-solar-multimodal) — text-only fine-tune; VQA at inference uses the base Gemma 4 vision encoder (~150M params for E4B), unmodified by our LoRA per the Vertex AI SFT recipe. The 992 MB mmproj-solarhive-e4b-BF16.gguf companion file packages this base vision encoder + audio encoder for llama-server --mmproj.
Converged Loss 0.9218
Training Time 420 seconds on RTX PRO 6000 Blackwell
License MIT (adapters) / Gemma Terms (base model)

Model Overview

SolarHive is an AI energy advisor for community solar microgrids. It helps suburban neighborhoods collectively optimize distributed solar generation and shared battery storage through natural language conversation, visual inspection, and live data integration.

This is the edge deployment artifact. The 4.61 GB text GGUF + 992 MB multimodal projector run on a 16 GB laptop CPU with no GPU, no cloud dependency, and no internet requirement. Companion to SolarHive 26B A4B LoRA (cloud inference with full multimodal VQA) and SolarHive E4B Ollama (merged safetensors for transformers research).

Privacy-first edge deployment. Community energy data never leaves the neighborhood. A village in rural India, a suburb in Michigan, and a coastal town recovering from a hurricane all get the same intelligence with no cloud round-trips.


Files in This Repository

File Size Use
solarhive-e4b-q4_k_m.gguf 5.34 GB Text + tool calling (PLE-Q4_0 quantization recipe)
solarhive-e4b-q4_k_m-standard.gguf 5.34 GB Text + tool calling (Standard Q6_K-PLE quantization recipe — llama.cpp default Q4_K_M preset)
mmproj-solarhive-e4b-BF16.gguf 992 MB Vision (SigLIP, 658 tensors) + audio (Conformer, 751 tensors) projector — pairs with EITHER text variant
Modelfile Ollama recipe pointing at the PLE-Q4_0 text variant
Modelfile.standard Ollama recipe pointing at the Standard text variant

Quantization Variants

Both text variants are 5.34 GB Q4_K_M quantizations of the same 720 transformer tensors. They were produced via two different quantization recipes — preserved as separate files for reproducibility / methodology transparency:

Variant Quantization recipe Total size Production hardware Inference hardware
solarhive-e4b-q4_k_m.gguf PLE-Q4_0 override (--tensor-type per_layer_token_embd.weight=q4_0) 5.34 GB Laptop (≥16 GB RAM) — bypasses the Q6_K-PLE intermediate buffer ≥7 GB RAM at 4K context
solarhive-e4b-q4_k_m-standard.gguf Standard Q6_K-PLE (llama.cpp default Q4_K_M preset) 5.34 GB High-RAM cloud notebook (≥30 GB RAM) — needs the ~10.7 GB float32 intermediate buffer ≥7 GB RAM at 4K context

Why preserve both recipes? The standard Q4_K_M mixed-precision strategy assigns Q6_K to the PLE tensor per_layer_token_embd.weight [10752, 262144] (2.82 B params, the largest single tensor in the model). Converting that tensor to Q6_K requires a ~10.7 GB float32 intermediate buffer during quantization — OOMs on 16 GB hardware (bad allocation at tensor 4/720). The PLE-Q4_0 recipe uses --tensor-type per_layer_token_embd.weight=q4_0 to bypass the buffer, enabling laptop-class quantization. Both recipes are documented for transparency about the methodological tradeoff.

Are they interchangeable? Yes — both load and run identically on a 16 GB laptop at inference time. Quality benchmarks below confirm parity. Pick by quantization-time hardware constraint; the inference-time behavior is the same.

One mmproj, multiple text-GGUF variants

The mmproj-solarhive-e4b-BF16.gguf companion is independent of the text model's quantization. llama-quantize operates exclusively on the text tower — vision and audio tensors were split out into the mmproj file during convert_hf_to_gguf.py --mmproj. At inference time, llama-server --model X --mmproj Y pairs them on demand. One mmproj serves any current or future text variant (Q4/Q5/Q6/Q8) with no per-variant duplication.


Inference Path Selection

This repo ships alongside solarhive_inference_e4b_gguf_ollama.py (in the GitHub repo), which implements two inference approaches against Ollama. The recommended demo path is the /api/generate raw mode + manual prompt builder approach. Both are documented for transparency and reproducibility.

Path A — OpenAI-compatible /api/chat + Ollama's built-in gemma4.go parser

Definition. Send messages and OpenAI-style tool schemas to Ollama's standard /api/chat endpoint. Ollama internally renders the chat template, parses model output, and extracts tool calls via its dedicated Gemma 4 parser (gemma4.go). The simplest path — closest to "drop-in replacement" for any OpenAI-compatible client.

import requests, json
resp = requests.post("http://127.0.0.1:11434/api/chat", json={
    "model": "solarhive",
    "messages": [{"role": "user", "content": "What's the current battery state?"}],
    "tools": [{"type": "function", "function": {...}}],
    "stream": False,
    "options": {"temperature": 1.0, "top_p": 0.95, "top_k": 64, "num_ctx": 4096},
})

Rationale. This is the path most developers will reach for first. OpenAI-compatible JSON, no manual prompt construction, leverages Ollama's first-party Gemma 4 support.

Why it does not score 10/10 in our benchmark. Ollama 0.21.0's gemma4.go:306 parser detects the model's native call:fn{} output but rejects it because the arguments use bare unquoted keys and <|"|> string delimiters (the Gemma 4 native format the model was fine-tuned on) instead of strict JSON. Server log evidence:

level=WARN source=gemma4.go:306 msg="gemma4 tool call parsing failed"
error="invalid character '\'' looking for beginning of object key string
       repair failed to produce valid JSON arguments"
content="call:get_status{battery:{level:65,pct:77,kwh:77,...,dispatched:<|\"|>OK<|\"|>},...}"

The repair logic also fails to recover valid JSON. Result: tool calls silently dropped, content empty. Inconsistent scoring on the 10-prompt benchmark — capped by this upstream issue. Not a viable path for the demo without an upstream Ollama patch.

Path B — /api/generate raw mode + manual Gemma 4 prompt builder + native parser (recommended)

Definition. Bypass gemma4.go entirely. Build the Gemma 4 native prompt manually from chat_template.jinja, send via Ollama's /api/generate endpoint with raw: True (no template rendering by Ollama), parse the raw output ourselves using regex matching tokenizer_config.json's response_schema.

prompt = build_gemma4_prompt(messages, tools)   # byte-identical to apply_chat_template
resp = requests.post("http://127.0.0.1:11434/api/generate", json={
    "model": "solarhive",
    "prompt": prompt,
    "raw": True,
    "stream": False,
    "options": {"temperature": 1.0, "top_p": 0.95, "top_k": 64, "num_ctx": 4096},
})
content, tool_calls = parse_gemma4_output(resp.json()["response"])

Rationale. We control both endpoints of the prompt cycle:

  1. Prompt structure matches what the model was fine-tuned on, exactly. build_gemma4_prompt produces byte-identical output to apply_chat_template(messages, tools, enable_thinking=False, add_generation_prompt=True) — the same call used by the cloud 26B A4B path.
  2. Output parser matches the exact regex from tokenizer_config.json response_schema: r"<\|tool_call>(.*?)<tool_call\|>" for tool blocks, r"call:(\w+)\{(.*)\}$" for argument extraction.
  3. Tool responses fed back as <|turn>tool\n{json}<turn|> — matches what training pipeline rendered from the {role:"tool", content:"{...}"} OpenAI-style messages in the training data.

Result. Score on the 10-prompt benchmark: 10/10 on both text variants + 2/3 on the When2Call probes. Matches the cloud E4B LoRA baseline as the joint best across all 6 measured variants in the Multi-Variant Deployment Validation table below.

Trade-off. ~30 lines of additional Python in solarhive_inference_e4b_gguf_ollama.py to build the prompt and parse the output. Worth it to bypass the upstream gemma4.go parser issue.

Recommended path

For the SolarHive Ollama + llama.cpp demo we use Path B (/api/generate raw mode + manual prompt builder). The solarhive_inference_e4b_gguf_ollama.py script in the GitHub repo provides a complete reference implementation of build_gemma4_prompt and parse_gemma4_output. Set OLLAMA_MODEL=solarhive (PLE-Q4_0) or OLLAMA_MODEL=solarhive-standard (Standard) and run the script — both score 10/10.

For multimodal use cases (vision, audio), use llama-server --mmproj from llama.cpp directly — see "How to Use" below. Ollama 0.21.0's Modelfile syntax for Gemma 4 multimodal projector declaration is still evolving; first-class Ollama vision support for Gemma 4 will arrive in a future release.

Path C — Interactive chat via Unsloth Studio

For community users who want to chat with the SolarHive E4B GGUF without writing a Modelfile or running the HTTP harness by hand, Unsloth Studio provides an open-source no-code local web UI:

# macOS / Linux / WSL
curl -fsSL https://unsloth.ai/install.sh | sh
unsloth studio -H 0.0.0.0 -p 8888

# Windows PowerShell
irm https://unsloth.ai/install.ps1 | iex
unsloth studio -H 0.0.0.0 -p 8888

Studio supports running GGUF + safetensor models locally with self-healing tool calling and code execution, model arena (side-by-side comparison), and saves to GGUF / 16-bit safetensor formats. Use Studio for interactive exploration; use Path B (/api/generate raw mode + solarhive_inference_e4b_gguf_ollama.py) for benchmark-grade evaluation — Studio doesn't expose the per-question evaluation harness needed to reproduce the 10/10 SolarHive parity score.


Benchmark Results

10-prompt parity benchmark — 5 domain Q&A (no tool expected) + 5 tool-calling — plus 3 When2Call probes. Run via the /api/generate raw mode + manual Gemma 4 prompt builder path (Path B above), on Ollama 0.21.0, CPU-only Microsoft Surface Pro 8 (Intel i5-1135G7 @ 2.4 GHz, 4 cores, 16 GB RAM, Intel Iris Xe unused), with the 5.3 GB GGUF + Ollama blob cache stored on an external USB drive. Same 10 prompts and same 3 When2Call probes as the cloud benchmarks for direct cross-variant comparison.

Domain Q&A (5/5 on both variants)

Question Expected behavior PLE-Q4_0 Standard
Solar production when humidity exceeds 80%? Direct answer, no tool call
At what battery SOC should we stop exporting to the grid? Direct answer, no tool call
Home #3 underperforming 22% — diagnostic checklist? Direct answer, no tool call
Winter snow on panels — prioritize actions? Direct answer, no tool call
Grid frequency 59.8 Hz — microgrid implications? Direct answer, no tool call

Tool Calling (5/5 on both variants)

Question Expected tool PLE-Q4_0 Standard
What's the current battery state? get_battery_state ✅ fired + synthesized ✅ fired + synthesized
Current weather and how does it affect solar production? get_weather ✅ fired + synthesized ✅ fired + synthesized
What are general maintenance tips for panels? None (no tool needed) ✅ correctly no tool call ✅ correctly no tool call
What is the weather expected to be like this week? get_weather ✅ fired + synthesized ✅ fired + synthesized
How should we plan energy consumption and storage given the weather forecast? get_weather (+ get_grid_status) ✅ fired + synthesized ✅ fired + synthesized

Quantization-precision independence confirmed. The PLE-Q4_0 override is quality-safe — no measurable regression on the 10-prompt benchmark vs the standard Q6_K-PLE. Both variants are interchangeable shipping artifacts; pick by quantization-time hardware constraint.

When2Call probes (3/3 categories per Ross et al. 2025)

Three held-out probes from When2Call: When (not) to Call Tools cover 3 of the 4 failure-mode categories the paper documents (the paper documents 9–67% tool-hallucination rates on (c) and (d) in untrained community models):

Category Question Expected behavior Result on this GGUF
(b) Well-specified, in-scope "What's the current grid rate?" Call get_grid_status PASS — called get_grid_status
(c) Under-specified "How much will a 10 kW array produce today?" Follow-up question (does NOT auto-fill location default) PASS — asked for current weather conditions
(d) Out-of-scope "What's the current air quality index in Ann Arbor?" Refusal + redirect (does NOT hallucinate a tool) FAIL — called get_weather (a known E4B-family failure mode also seen on the cloud E4B merged variant; the larger A4B family scores 3/3 on this same probe)

Headline: 2/3 nominal — same profile as the cloud E4B merged variant (transformers BF16). Confirms GGUF Q4_K_M quantization is lossless at the When2Call refusal/follow-up decision boundary within the E4B family. The +1/3 W2C delta vs the A4B family persists across runtimes (cloud transformers and local Ollama produce identical W2C scores within each family) — confirming it's a model-size signature, not a precision artifact.

End-to-end agentic loop probe

A multi-tool community-energy-audit query is run through the full agentic loop (extract → execute → feed back, max 3 rounds), mirroring the cloud notebook's §13g cell. Headline trace:

"Full community energy audit — check current weather, solar production, battery state, and grid pricing. Give a 3-sentence status report."

  • Rounds completed: 2
  • Tools executed: get_weather, get_solar_production, get_battery_state, get_grid_status (4 of 5 tools called in parallel in round 1)
  • Final answer (excerpt): "Community Energy Status Report (Midday): Partly cloudy with 30% cloud cover and 72°F. Array is producing 56% of capacity at 40.4 kW. Battery is at 72% SOC and actively charging from surplus. Grid pricing is Peak rates ($0.28/kWh). Status: Excellent. Maximize self-consumption by running heavy loads now…"

All four tool results are correctly synthesized into a coherent status report — confirming the GGUF Q4_K_M quantization preserves end-to-end agentic reasoning quality, not just single-shot tool routing.

Multi-Variant Deployment Validation — All 6 variants now measured

Cross-variant table covering all 5 cloud transformers variants (validated on Colab Pro G4 with NVIDIA RTX PRO 6000 Blackwell) plus this GGUF variant (validated on the CPU-only Surface Pro 8 reference hardware described above):

Variant Q&A Tool W2C Total Backend Hardware
a4b_lora (baseline) 5/5 4/5 3/3 9/10 transformers + Unsloth Colab Pro G4 GPU
e4b_lora 5/5 5/5 2/3 10/10 transformers + Unsloth Colab Pro G4 GPU
e4b_merged 5/5 4/5 2/3 9/10 transformers BF16 Colab Pro G4 GPU
a4b_merged 5/5 4/5 3/3 9/10 transformers BF16 Colab Pro G4 GPU
a4b_nf4 5/5 4/5 3/3 9/10 transformers NF4 (BnB) Colab Pro G4 GPU
e4b_gguf (this artifact) 5/5 5/5 2/3 10/10 Ollama HTTP raw + manual prompt builder CPU-only Surface Pro 8 (i5-1135G7, 16 GB RAM)

Key empirical findings:

  1. Joint best variant in the table — this GGUF ties the cloud E4B LoRA baseline at 10/10 + 2/3 W2C, despite running on a 4-year-old consumer laptop with no GPU vs an A100-class cloud accelerator.
  2. GGUF Q4_K_M quantization is lossless within the E4B family. Tool routing 5/5 + W2C 2/3 exactly matches the BF16 LoRA. Same (b)+(c) PASS, (d) FAIL profile. The 5.3 GB CPU-only deployment produces identical decisions to the BF16 + GPU pipeline.
  3. The +1/3 W2C delta vs A4B family is a model-size signature, not a runtime artifact — reproduced across both cloud transformers AND local Ollama runtimes within each family.

Reproduce locally:

$env:OLLAMA_HOST  = 'http://localhost:11434'
$env:OLLAMA_MODEL = 'solarhive'
python -m pytest solarhive_inference_e4b_gguf_ollama.py -v --tb=short

The benchmark run auto-writes a markdown summary to archive/ollama_local_e4b_gguf_results_YYYYMMDD_HHMMSS.md with the full per-question / per-probe trace.


How to Use

Path 1 — Text + tool calling via Ollama (10/10 demo path)

# Download both text variants + Modelfiles + LICENSE
hf download Truthseeker87/solarhive-e4b-gguf \
  solarhive-e4b-q4_k_m.gguf Modelfile LICENSE \
  --local-dir ./solarhive-gguf

cd ./solarhive-gguf

# Register with Ollama (Modelfile uses ./solarhive-e4b-q4_k_m.gguf relative path)
ollama create solarhive -f Modelfile

# Quick sanity check
ollama run solarhive "What is the current solar production for our community?"

# Full 10/10 + 2/3 W2C agentic benchmark — uses /api/generate raw mode + manual prompt builder
git clone https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive
cd the-gemma4-good-hackathon-solarhive
$env:OLLAMA_MODEL = 'solarhive'
python -m pytest solarhive_inference_e4b_gguf_ollama.py -v --tb=short

To use the standard variant instead, swap solarhive-e4b-q4_k_m.gguf and Modelfile for solarhive-e4b-q4_k_m-standard.gguf and Modelfile.standard, then ollama create solarhive-standard -f Modelfile.standard and OLLAMA_MODEL=solarhive-standard.

Path 2 — Text + image + audio via llama.cpp llama-server

# Download a text variant + the mmproj companion
hf download Truthseeker87/solarhive-e4b-gguf \
  solarhive-e4b-q4_k_m.gguf mmproj-solarhive-e4b-BF16.gguf \
  --local-dir ./solarhive-gguf

# Start llama-server with the mmproj
llama-server \
  --model   ./solarhive-gguf/solarhive-e4b-q4_k_m.gguf \
  --mmproj  ./solarhive-gguf/mmproj-solarhive-e4b-BF16.gguf \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --port 8080

# POST chat-completion requests with image_url to http://localhost:8080/v1/chat/completions

The mmproj bundles both vision (SigLIP, sky photos and panel inspection) and audio (Conformer, voice queries up to 30 s) — same recipe serves both modalities.

Path 3 — Community microgrid hub on Jetson Orin Nano Super (llama.cpp + CUDA)

The same GGUF runs CUDA-accelerated on a Jetson Orin Nano Super Developer Kit ($249, 8 GB LPDDR5, 1024 CUDA cores, 7–25 W power envelope) — making it a natural community-ownable, solar-powerable microgrid hub that serves a whole neighborhood. Build llama.cpp with CUDA per Nvidia's official Gemma 4 Jetson recipe:

# On the Jetson (Ubuntu via JetPack SDK)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="87" \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4

# Download the same GGUFs as Path 1 / Path 2
hf download Truthseeker87/solarhive-e4b-gguf \
  solarhive-e4b-q4_k_m.gguf mmproj-solarhive-e4b-BF16.gguf \
  --local-dir ~/models

# Run with full GPU offload (-ngl 99) for maximum throughput
./build/bin/llama-server \
  --model   ~/models/solarhive-e4b-q4_k_m.gguf \
  --mmproj  ~/models/mmproj-solarhive-e4b-BF16.gguf \
  --temp 1.0 --top-p 0.95 --top-k 64 \
  --port 8080 --host 0.0.0.0 \
  -ngl 99 --flash-attn on \
  --no-mmproj-offload --jinja -np 1

Why this matters for community deployment: at 7–25 W, a single Jetson Orin Nano Super can be powered directly from a small solar panel during the day and a modest battery overnight — the SolarHive intelligence runs on the same energy infrastructure it advises. Mobile clients (Path 4 below) hit the Jetson at http://<hub-ip>:8080 over the local network when tool-calling responses with live API data are needed.

Modelfile reference

Both Modelfiles use a relative path so the same file works regardless of where you cloned the repo:

FROM ./solarhive-e4b-q4_k_m.gguf

SYSTEM """You are SolarHive, an AI energy advisor for a community of 12 homes with rooftop solar and shared battery storage in Ann Arbor, Michigan. Use the available tools to get real-time data before answering. Be specific, reference actual data, and keep responses concise (3-5 sentences)."""

PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
PARAMETER num_ctx 4096

Core Capabilities

1. Multimodal Visual Question Answering (3 Modes)

Available via llama-server --mmproj (Path 2 above). Tensor inventory of mmproj-solarhive-e4b-BF16.gguf: 1411 tensors total — 658 vision (SigLIP), 751 audio (Conformer), 2 multimodal projectors (mm.input_projection.weight, mm.a.input_projection.weight).

Mode Input Output
Sky Analysis Sky photograph Cloud coverage %, production forecast, storage recommendation
Panel Inspection Panel photograph Dirt/damage/shading detection, efficiency impact estimate
Neighborhood Assessment Aerial/satellite image Panel inventory, expansion priorities, shading analysis

2. Native Function Calling (5 Tools — all 3 keyed APIs wired)

Available via either Path 1 (Ollama) or Path 2 (llama-server). Tool schemas and reference implementations are in solarhive_inference_e4b_gguf_ollama.py (project root). The §13g cell of solarhive_inference.py runs an end-to-end agentic-loop probe via Ollama HTTP raw mode using these same 5 tools.

Tool API Returns
get_weather(location) OpenWeatherMap (OWM_API_KEY) Temperature, clouds %, wind, humidity, sunrise/sunset
get_solar_production(clouds_pct, temp_f) Open-Meteo GHI (keyless) Production kW, efficiency %, GHI W/m², temp derating
get_battery_state() Community BMS (simulated) State of charge, capacity, charging status
get_grid_status() EIA Open Data (EIA_API_KEY) Pricing period, rate/kWh, renewable %, CO2 intensity
get_nrel_pvwatts_baseline() NREL PVWatts v8 (NREL_API_KEY) Annual + current-month typical kWh + avg kW for the 72 kW array

Tool results feed back as a 2-message sequence matching the training distribution: {"role": "assistant", "tool_calls": [...]} then {"role": "tool", "name": "<fn>", "content": json.dumps(result)} per call. The _build_gemma4_prompt() helper in solarhive_inference_e4b_gguf_ollama.py renders this format byte-identically — same as solarhive_datagen.py (training-data generation) and solarhive_finetune.py (SFT preprocessing). Inference matches training distribution exactly.

3. Selective Tool Reasoning

The model decides when to call tools — not blindly invoking all of them. Validated by the 5/5 tool-calling sub-benchmark above:

"What time does peak pricing start?"
→ Calls: get_grid_status() only

"Is today's production above typical for January?"
→ Calls: get_solar_production() + get_nrel_pvwatts_baseline()

"Should I run my pool heater now?"
→ Calls: get_weather() + get_solar_production() + get_battery_state() + get_grid_status()

"What are general maintenance tips for panels?"
→ Calls: none (answers from training knowledge)

4. Inference-time When2Call Validation (solarhive_inference.py §11b)

Three held-out probes validate 3 of the 4 failure-mode categories from Ross, H., Mahabaleshwarkar, A. S., & Suhara, Y. (2025). When2Call: When (not) to Call Tools. arXiv:2504.18851 — the paper documents 9–67% tool-hallucination rates on (c) and (d) in untrained community models because public tool-calling datasets typically lack follow-up and unable-to-answer examples:

  • (b) "What's the current grid rate?" → expect get_grid_status call (well-specified, in-scope)
  • (c) "How much will a 10 kW array produce today?" → expect follow-up question (does NOT auto-fill location default)
  • (d) "What's the current air quality index in Ann Arbor?" → expect refusal + redirect (does NOT hallucinate a tool)

A baseline community model trained without these categories typically fails (c) + (d) (per the paper's 9-67% hallucination rates on untrained models). With the _UNABLE_TO_ANSWER + _FOLLOW_UP_QUESTIONS corpus categories included in solarhive_datagen.py, the A4B family scores 3/3 and the E4B family scores 2/3 (passes (b) + (c), fails (d) — see Benchmark Results above). The same When2Call probes run end-to-end against this GGUF artifact via the /api/generate raw mode path for edge-deployment validation.


Training Details

Parameter Value
Method LoRA via Unsloth FastVisionModel (BF16, RTX PRO 6000 96 GB)
LoRA rank 16
LoRA alpha 16
LoRA dropout 0
Target modules All linear layers
Learning rate 2e-4
Optimizer AdamW 8-bit
Warmup steps 5
Epochs 3
Max sequence length 2048
Precision BF16
Seed 3407
Trainable parameters 41.2 M / 8.0 B (0.51%)

Training Data — 1,727 Examples

Same canonical training corpus as the 26B A4B model — solarhive-community-solar-multimodal, 1,727 rows:

  • 413 hand-crafted examples spanning 15+ US cities and 9 energy domains (sky conditions, battery management, panel health, consumption optimization, community/grid coordination, emergency resilience, seasonal planning, multi-step reasoning, alternative storage)
  • ~1,117 API-grounded examples from live Open-Meteo (GHI/DNI/DHI, low/mid/high cloud cover), PVWatts, OpenWeatherMap, and EIA APIs — every numeric claim traces to a real API response, joined on (location, hourly timestamp) for cross-source coherence
  • 183 tool-calling examples following the When2Call taxonomy — 106 should-call, 53 should-not-call, 10 unable-to-answer, 6 follow-up clarification, 8 failure-recovery
  • 14 image-grounded Q&A turns from 7 manually-labeled Ann Arbor sky photographs — paired with the same temperature-derated GHI formula used in text rows

See the SolarHive Dataset for full documentation.

Fine-tuning is text-only on the multimodal-capable corpus (image rows skipped at the data-prep layer). VQA at inference uses the base Gemma 4 E4B model's pretrained vision encoder (~150M params per the official model card). Our LoRA targets only the language-model linear layers (target=all-linear); the vision tower is unmodified, matching the Vertex AI Gemma 4 SFT recipe documented in the Hugging Face blog, which explicitly freezes both vision and audio towers during text-focused fine-tuning. The 992 MB mmproj-solarhive-e4b-BF16.gguf companion file packages this base vision encoder (658 SigLIP tensors) plus the base audio encoder (751 Conformer tensors) for llama-server --mmproj, giving the deployed GGUF full multimodal capability.

Training Loss

Metric Value
Converged loss (last 20 steps) 0.9218
Final step loss 0.0635
Minimum loss 0.0635
Total steps 324
Training time 420 seconds

Canonical metric: the bolded Converged loss (last 20 steps) is the only smoothed convergence indicator. Final step and Minimum are single-batch point statistics — mini-batch loss is noisy step-to-step, so one easy batch can drop a point estimate well below the rolling-average trend.

Hardware

  • GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7)
  • Platform: Google Colab Pro (G4 VM)
  • Precision: BF16 (no quantization during training)

GGUF Production Pipeline

The two text variants differ only in where they were quantized:

Step PLE-Q4_0 (laptop) Standard (Colab)
Source merged safetensors from solarhive-e4b-ollama same
Convert to BF16 GGUF convert_hf_to_gguf.py --outtype bf16 (~30 min, 14 GB) same
Quantize text tower llama-quantize --tensor-type per_layer_token_embd.weight=q4_0 ... Q4_K_M (~15 min, 4.61 GB) llama-quantize ... Q4_K_M (~23 sec, 5.34 GB)
Hardware needed Intel i5-1135G7 + 16 GB RAM High-RAM cloud notebook (≥30 GB RAM)
Reproducibility notebook solarhive_quantize_nf4.py-style recipe in GitHub repo solarhive_colab_quantize_e4b.ipynb in GitHub repo

The mmproj companion is produced once via convert_hf_to_gguf.py --mmproj on the merged safetensors — independent of the text quantization step.


Community Model

Parameter Value
Location Ann Arbor, Michigan (42.2808°N, 83.7430°W)
Community size 12 homes
Total panel capacity 72 kW
Shared battery storage 100 kWh
Grid region MISO (Midcontinent Independent System Operator)

Runtime Performance

CPU-only inference on Intel i5-1135G7 @ 2.4 GHz, 4 cores, 16 GB RAM, Ollama 0.21.0:

Phase Time Speed Notes
First query (cold) ~65 s ~2.2 tok/s Includes ~55–60 s model load
Warm advisory query ~10 s ~9–10 tok/s Single forward pass, no tools
Warm tool-calling loop 25–60 s ~9–10 tok/s 2–3 rounds with live API latency

GGUF blob ingestion (one-time, after ollama create): ~3–5 min for the 4.61 GB variant on a typical laptop SSD.


Technical Notes

  • PLE tensor override. The Q4_K_M mixed-precision strategy assigns Q6_K to per_layer_token_embd.weight [10752, 262144] (2.82 B params, the largest single tensor in the model). The float32 intermediate buffer for Q6_K conversion (~10.7 GB) OOMs llama-quantize on 16 GB RAM. The --tensor-type per_layer_token_embd.weight=q4_0 override eliminates the buffer; benchmark scores prove the override is quality-safe.
  • One mmproj for many text variants. The mmproj companion file is independent of text-model quantization — llama-quantize only sees the text tower. The same 992 MB mmproj pairs with PLE-Q4_0, Standard, or any future Q5/Q6/Q8 variant.
  • Ollama /api/chat content-drop issue. Ollama 0.21.0's gemma4.go:306 parser detects but rejects fine-tuned Gemma 4's native tool-call format (bare keys + <|"|> delimiters). Use the /api/generate raw mode + manual prompt builder path (Path B in "Inference Path Selection" above) for tool-calling workloads.
  • Ollama multimodal support is evolving. Ollama 0.21.0's Modelfile syntax for Gemma 4 mmproj projector declaration is not yet first-class. For text + image / text + audio today, use llama-server --mmproj from llama.cpp b8863+ directly.
  • Sampling defaults. temperature=1.0, top_p=0.95, top_k=64 (Kaggle-recommended Gemma 4 defaults). Set in both Modelfile and Modelfile.standard.
  • Context window. Modelfiles set num_ctx=4096. The base architecture supports up to 128 K; raise num_ctx for longer multi-round agentic loops at the cost of more RAM at inference.
  • No Unsloth dependency at inference. Once quantized, the GGUF files run via stock Ollama or llama.cpp. Unsloth was used only during fine-tuning.

Limitations

  • Prototype tested on a single community model (12 homes, Ann Arbor, MI). Real-world deployment requires validation across diverse geographies and community sizes.
  • The OpenAI-compatible /api/chat path is not the demo path — see "Inference Path Selection" above for the gemma4.go content-drop reasoning. Use the /api/generate raw mode + manual prompt builder path for production inference.
  • Image and audio modalities require llama-server --mmproj; Ollama-native multimodal recipe is pending upstream Ollama support.
  • The model occasionally uses "60 kW" instead of the correct 72 kW community capacity in direct (no-tool) responses — base-model tendency, addressed by tool-calling path which queries actual capacity.
  • Tool responses depend on external API availability. Open-Meteo and EIA have rate limits; OpenWeatherMap free tier allows 1,000 calls/day.
  • The battery state is currently a deterministic simulator (get_battery_state() in solarhive_inference.py) — real deployment requires integration with actual battery management systems.
  • The PLE-Q4_0 override trades a small quality margin on one tensor for laptop-class deployability. The standard variant exists as a higher-precision reference; both score 10/10 on the SolarHive 10-prompt benchmark.

Future Iteration — Multi-Token Prediction (MTP) Drafters on Edge GGUF Runtimes

Not in the measured numbers above. Google announced Gemma 4 MTP drafters on May 5, 2026 (blog, overview, HF collection, Kaggle, @GoogleGemma) — after this artifact's benchmark was captured. The benchmarks above reflect standard autoregressive decoding only. MTP integration is documented here as future iteration; no measured speedup is claimed in this release.

Theoretical foundation. Speculative decoding (Leviathan, Kalman & Matias, ICML 2023, arXiv:2211.17192) accelerates generation without changing the output distribution under argmax decoding: a smaller drafter proposes γ candidate tokens, the target verifies all γ in a single parallel forward pass, accepted tokens are kept, and any rejection is resampled from a corrected distribution. The output distribution is preserved exactly regardless of drafter quality; only acceptance rate α, and therefore walltime speedup, varies.

Released drafter for E4B. google/gemma-4-E4B-it-assistant (~78.8 M params) is the canonical pair for google/gemma-4-E4B-it. Per the MTP overview, the drafter shares the input embedding table with the target and consumes the target's last-layer activations. Google reports up to 3× decode speedup on the 26B-A4B configuration; per-variant E4B numbers were not enumerated in the announcement.

Runtime support is partial for GGUF deployments.

Runtime Listed by Google? Source
Ollama ✅ Tested-runtime list Google blog
llama.cpp ⚠️ Appears in docs runtime nav but not in the blog's tested-runtime list Docs nav
LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang ✅ Tested-runtime list Google blog

Implementation paths on this edge GGUF tier (post-hackathon):

  1. Drafter GGUF conversion. Google ships the drafter as HF safetensors. To use against this Q4_K_M target via Ollama or llama.cpp, the drafter weights would need conversion through convert_hf_to_gguf.py — feasible reuse of the same toolchain that produced this target's GGUF, but the conversion is not in the canonical SolarHive registry today.
  2. llama.cpp speculative decoding. llama-speculative and llama-server --draft-model support vanilla speculative decoding per the 2023 paper. Whether Gemma 4 MTP drafters' embedding-sharing + last-layer-activation conditioning architecture maps cleanly to llama.cpp's existing --draft-model plumbing is unverified — Google's docs list the runtime but the blog omits it from the tested set.
  3. Ollama paired drafter. Ollama is in Google's tested-runtime list; the exact CLI/API surface for drafter pairing is not yet documented in Ollama's public docs as of writing.

Planned measurement (post-hackathon). (a) Convert google/gemma-4-E4B-it-assistant → Q4_K_M GGUF via convert_hf_to_gguf.py + llama-quantize. (b) Re-run the parity benchmark with the drafter paired via llama.cpp's --draft-model flag. (c) Capture acceptance rate α + decode-tps + walltime. (d) Cross-check against the Ollama paired-drafter API once documented. Correctness is invariant per the 2023 speculative-sampling guarantee — only α varies under target × drafter distribution mismatch.


Companion Repositories

Model Repository Purpose
SolarHive 26B A4B LoRA solarhive-26b-a4b-lora Cloud inference with full multimodal + function calling (LoRA adapters + Unsloth)
SolarHive 26B A4B NF4 solarhive-26b-a4b-nf4 Pre-quantized 4-bit cloud model for HF Spaces / 24 GB+ GPUs
SolarHive E4B LoRA solarhive-e4b-lora E4B adapter weights (~200 MB) — apply over base via Unsloth
SolarHive E4B Safetensors solarhive-e4b-ollama Merged safetensors for transformers-native multimodal research use
SolarHive E4B GGUF This repo Edge deployment — 2 text quants + 1 mmproj for Ollama / llama.cpp
SolarHive Dataset solarhive-community-solar-multimodal 1,727 training examples (1,713 text + 14 image-grounded)
LiteRT-LM Python edge runtime solarhive_e4b_litert_v3.1.ipynb LiteRT Special Tech Track entry — runs upstream base litert-community/gemma-4-E4B-it-litert-lm .litertlm (3.66 GB) + SolarHive UX layer + on-device agentic loop. Q&A 8/8 on Colab Pro CPU + High-RAM. Fine-tuned LiteRT-LM bundle is a planned next iteration once upstream gemma4 example module lands in ai_edge_torch.generative.examples/.
GitHub (source) the-gemma4-good-hackathon-solarhive Full source code, training notebooks, solarhive_inference.py (cloud), solarhive_inference_e4b_gguf_ollama.py (local laptop)

Citation

@misc{solarhive2026,
  title={SolarHive: AI-Powered Community Solar Energy Intelligence},
  author={Youshen Lim},
  year={2026},
  url={https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive},
  note={Gemma 4 Good Hackathon submission — Google DeepMind x Kaggle}
}

Links

Built with Gemma 4 in Ann Arbor, Michigan. April 2026.

Gemma is a trademark of Google LLC.

Downloads last month
124
GGUF
Model size
8B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train Truthseeker87/solarhive-e4b-gguf

Space using Truthseeker87/solarhive-e4b-gguf 1

Papers for Truthseeker87/solarhive-e4b-gguf

Evaluation results