Instructions to use Truthseeker87/solarhive-e4b-ollama with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Truthseeker87/solarhive-e4b-ollama with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Truthseeker87/solarhive-e4b-ollama")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Truthseeker87/solarhive-e4b-ollama")
model = AutoModelForMultimodalLM.from_pretrained("Truthseeker87/solarhive-e4b-ollama")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Truthseeker87/solarhive-e4b-ollama with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Truthseeker87/solarhive-e4b-ollama"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Truthseeker87/solarhive-e4b-ollama",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Truthseeker87/solarhive-e4b-ollama

SGLang

How to use Truthseeker87/solarhive-e4b-ollama with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Truthseeker87/solarhive-e4b-ollama" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Truthseeker87/solarhive-e4b-ollama",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Truthseeker87/solarhive-e4b-ollama" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Truthseeker87/solarhive-e4b-ollama",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Unsloth Studio

How to use Truthseeker87/solarhive-e4b-ollama with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Truthseeker87/solarhive-e4b-ollama to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for Truthseeker87/solarhive-e4b-ollama to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for Truthseeker87/solarhive-e4b-ollama to start chatting

Load model with FastModel

pip install unsloth
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="Truthseeker87/solarhive-e4b-ollama",
    max_seq_length=2048,
)

Docker Model Runner
How to use Truthseeker87/solarhive-e4b-ollama with Docker Model Runner:
```
docker model run hf.co/Truthseeker87/solarhive-e4b-ollama
```

SolarHive E4B — BF16 Merged Safetensors

LoRA fine-tuned Gemma 4 E4B (8B), merged to 16-bit safetensors. Source artifact for direct transformers inference and llama.cpp convert_hf_to_gguf.py → Q4_K_M GGUF conversion (which powers Ollama + llama.cpp edge deployment via the solarhive-e4b-gguf companion repo).

For Ollama or llama.cpp edge deployment on a 16 GB CPU laptop, use the solarhive-e4b-gguf repo instead — it ships the 5.34 GB Q4_K_M GGUF (standard Q6_K-PLE recipe) plus the 992 MB mmproj-BF16.gguf companion (vision + audio), with Modelfiles ready for ollama create and a 10/10 score on the single-pass SolarHive project-held-out 10-prompt parity check.

The --experimental Ollama path documented previously OOMs ollama create on ≤16 GB RAM (the 16 GB BF16 safetensors blob does not fit during ingestion). On hardware ≥24 GB RAM the experimental import works, but the GGUF pipeline (built using llama.cpp convert_hf_to_gguf.py + llama-quantize) is the recommended edge deployment path for everyone else.

This repository now serves three roles:

Source for GGUF conversion via llama.cpp's convert_hf_to_gguf.py (text tower) and convert_hf_to_gguf.py --mmproj (vision + audio projector). See solarhive-e4b-gguf for the produced GGUF artifacts.
Transformers-native multimodal use — load with AutoModelForCausalLM for full image + audio + text in Python (requires ≥24 GB RAM or A100-class GPU).
Reference for further fine-tuning — extend the LoRA on additional data using Unsloth FastVisionModel.

Built for the Gemma 4 Good Hackathon (Google DeepMind x Kaggle).


Base Model	google/gemma-4-e4b-it
Architecture	Dense + PLE — 8B total, 4.5B effective
Fine-Tuning	LoRA via Unsloth (BF16)
Training Data	1,727 examples (solarhive-community-solar-multimodal) — text-only fine-tune; VQA at inference uses the base Gemma 4 vision encoder (~150M params), unmodified by our LoRA per the Vertex AI SFT recipe
Converged Loss	0.9218
Project-held-out check	9/10 (5/5 domain Q&A + 4/5 tool calling) — May 2026 final run, multi-call regression on TQ5 (see Multi-Variant Deployment Validation below)
Training Time	420 seconds (~7 minutes)
Compute	Google Colab Pro
License	MIT (adapters) / Gemma Terms (base model)

Model Overview

SolarHive E4B is the edge companion to SolarHive 26B A4B. While the 26B model powers cloud inference with full multimodal VQA, the E4B model is optimized for local deployment via Ollama on consumer hardware.

Privacy-first: Running Gemma 4 locally means community energy data never leaves the neighborhood. No cloud dependency, no internet requirement, no data privacy concerns. A village in rural India, a suburb in Michigan, and a coastal town recovering from a hurricane all get the same intelligence.

This repository contains the fully merged model (base + LoRA baked together) — no separate base model download needed.

Training Details

Parameter	Value
Method	LoRA via Unsloth `FastVisionModel` (BF16, RTX PRO 6000 96 GB)
LoRA rank	16
LoRA alpha	16
LoRA dropout	0
Target modules	All linear layers
Learning rate	2e-4
Optimizer	AdamW 8-bit
Warmup steps	5
Epochs	3
Max sequence length	2048
Precision	BF16
Seed	3407
Trainable parameters	41.2M / 8.0B (0.51%)

Training Loss

Metric	Value
Converged loss (last 20 steps)	0.9218
Final step loss	0.0635
Minimum loss	0.0635
Total steps	324
Training time	420 seconds

Canonical metric: the bolded Converged loss (last 20 steps) is the only smoothed convergence indicator. Final step and Minimum are single-batch point statistics — mini-batch loss is noisy step-to-step, so one easy batch can drop a point estimate well below the rolling-average trend.

Training Data

Same canonical training corpus as the 26B A4B model — solarhive-community-solar-multimodal, 1,727 rows:

413 hand-crafted examples spanning 15+ US cities and 9 energy domains
~1,117 API-grounded examples from live Open-Meteo, PVWatts, OWM, and EIA data
183 tool-calling examples (positive, negative refusals, follow-up clarifications, failure-recovery)
14 image-grounded Q&A turns from 7 manually-labeled Ann Arbor sky photographs

Hardware

GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB GDDR7 total, 94.97 GB max usable per Unsloth)
Platform: Google Colab Pro (G4 VM)

Project-Held-Out Results

Domain Q&A (5/5)

Question	Result
Solar production when humidity exceeds 80%?	Correct
Battery SOC threshold for grid export?	Correct
Home #3 underperforming 22% — diagnostic checklist?	Correct
Winter snow on panels — prioritize actions?	Correct
Grid frequency 59.8 Hz — microgrid implications?	Correct

Note on validation history: the 5/5 Q&A above is from the initial 8-question validation harness used during fine-tune development. The canonical headline number is the May 2026 final-run multi-variant validation (single-pass project-held-out 10-question parity check) — see below.

Tool inventory + inference-time When2Call validation

solarhive_inference.py exposes 5 tools to the model — all three keyed APIs (OWM_API_KEY, EIA_API_KEY, NREL_API_KEY) actively wired:

Tool	API	Returns
`get_weather(location)`	OpenWeatherMap (`OWM_API_KEY`)	Temperature, clouds %, wind, humidity, sunrise/sunset
`get_solar_production(clouds_pct, temp_f)`	Open-Meteo GHI (keyless)	Production kW, efficiency %, GHI W/m², temp derating
`get_battery_state()`	Community BMS (sim)	State of charge, capacity, charging status
`get_grid_status()`	EIA Open Data (`EIA_API_KEY`)	Pricing period, rate/kWh, renewable %, CO2 intensity
`get_nrel_pvwatts_baseline()`	NREL PVWatts v8 (`NREL_API_KEY`)	Annual + current-month typical kWh + avg kW for the 72 kW array

Tool results feed back as a 2-message sequence matching the training distribution: {"role": "assistant", "tool_calls": [...]} then {"role": "tool", "name": "<fn>", "content": json.dumps(result)} per call. Shared across the data-generation pipeline, the fine-tune SFT preprocessing layer, and the inference agentic loop — inference matches training distribution exactly.

When2Call probes. Three held-out probes validate 3 of the 4 failure-mode categories from Ross, H., Mahabaleshwarkar, A. S., & Suhara, Y. (2025). When2Call: When (not) to Call Tools. arXiv:2504.18851 — the paper documents 9–67% tool-hallucination rates on (c)+(d) in untrained community models:

(b) "What's the current grid rate?" → expect get_grid_status call (well-specified, in-scope)
(c) "How much will a 10 kW array produce today?" → expect follow-up question (does NOT auto-fill location default)
(d) "What's the current air quality index in Ann Arbor?" → expect refusal + redirect (does NOT hallucinate a tool)

Models trained without explicit unable-to-answer and follow-up clarification examples typically fail (c) + (d). The SolarHive corpus includes 16 such examples (10 unable-to-answer + 6 follow-up clarification) following the When2Call taxonomy.

Multi-Variant Deployment Validation (Final Run, May 2026) — E4B regression on When2Call (c)

End-to-end inference run on Colab Pro G4. This E4B BF16 merged variant loaded from a local cache (16.9 GB VRAM utilization, ~10 min runtime).

Project-held-out parity check: 5/5 Q&A + 4/5 tool = 9/10 on the 10-question set — matches the A4B family on the 9 deterministic questions; the single FAIL is the lenient multi-call probe (TQ5 — "Compare today's irradiance forecast across Ann Arbor, Phoenix, and Seattle", min_calls=2) where this variant emitted only 1 get_weather call. Notably, the E4B LoRA + base variant (same fine-tune, applied via Unsloth instead of merged) DOES chain 3 calls on the same probe and scores 10/10 — pattern reproducible across runs.

When2Call probes — measured 2/3 (final run May 2026):

Probe	E4B merged behavior	Score
(b) "current grid rate?"	✅ Correctly calls `get_grid_status`	PASS
(c) "How much will a 10 kW array produce today?"	❌ Auto-fills location and calls `get_solar_production` instead of asking back	FAIL
(d) "current AQI in Ann Arbor?"	✅ Genuinely disclaims (no fabrication, no tool call)	PASS

Cross-variant pattern: the E4B LoRA + base variant is inferred to score 2/3 by mathematical lossless equivalence with this merged variant (the merge step is lossless on weights, so the When2Call decision boundary is identical). The +1/3 W2C delta vs the A4B family (3/3 directly measured on A4B LoRA, inferred-lossless on A4B merged + NF4) is the empirical signature of size-vs-refusal scaling.

Honest finding — size-vs-refusal scaling is real, and was the pre-stated hypothesis. This E4B fine-tune regresses on (c) compared to the A4B LoRA baseline (which scores 3/3). The smaller model with less reasoning depth more readily auto-fills missing parameters when it should ask back — exactly the failure mode Ross et al. 2025 document at 9-67% rates in untrained community models. The fine-tune closes (b)+(d) at this size but doesn't fully close (c).

This was the expected outcome going in, per the official Google Gemma 4 Core docs "Parameter sizes and quantization": "Models with higher parameters and bit counts (higher precision) are generally more capable, but are more expensive to run." E4B's 8B total / 4.5B effective parameters / ~150M vision encoder vs A4B's 25.2B total / 3.8B active (MoE) / ~550M vision encoder reflect a deliberate ~3× capacity gap on the dimension that drives reasoning-heavy refusal/follow-up behavior. The validation confirms the documented scaling — not a defect, but architecture-aware deployment design.

Quantitative reinforcement from Unsloth's published Gemma 4 benchmarks: E4B scores 69.4% on MMLU Pro (vs 26B A4B's 82.6% — a 13.2 pp gap), 52.6% on MMMU Pro (vs 73.8% — 21.2 pp gap), and 42.5% on AIME 2026 (vs 88.3% — a 45.8 pp gap). The AIME math-reasoning gap and MMMU Pro multimodal-reasoning gap directly predict the (c)/(d) When2Call regression we observe here — the smaller model's published reasoning-benchmark deltas scale cleanly into the 2-of-3 behavioral regression vs the A4B baseline. E4B is the right choice for the volume of well-specified, in-scope queries that dominate everyday community-energy interactions; A4B handles the harder reasoning edge cases.

Deployment recommendation: Use this E4B variant for the volume of well-specified, in-scope queries (production estimates, grid pricing, maintenance guidance) where (b)-category routing dominates. Route under-specified or out-of-scope queries to the A4B cloud variant for correct refusal + follow-up behavior. A future fine-tune could increase the E4B follow-up clarification example count (currently 6) and unable-to-answer count (currently 10) to close the gap.

How to Use

Loading with transformers

from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model = AutoModelForCausalLM.from_pretrained(
    "Truthseeker87/solarhive-e4b-ollama",  # This repo (merged safetensors)
    dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
    "Truthseeker87/solarhive-e4b-ollama",
    trust_remote_code=True,
)

Edge Deployment — use the GGUF repo

For Ollama or llama.cpp on a 16 GB CPU laptop, download the GGUF artifacts instead of trying to import these safetensors:

hf download Truthseeker87/solarhive-e4b-gguf \
  solarhive-e4b-q4_k_m.gguf Modelfile \
  --local-dir ./solarhive-gguf
cd ./solarhive-gguf
ollama create solarhive -f Modelfile
ollama run solarhive "What's the best time to run my dishwasher today?"

The solarhive-e4b-gguf repo ships the Q4_K_M GGUF (standard Q6_K-PLE recipe, 5.34 GB) plus a 992 MB mmproj-BF16.gguf for full multimodal via llama-server --mmproj.

Edge Deployment via Ollama `--experimental` (≥24 GB RAM only)

If you have ≥24 GB system RAM, you can experimentally import these safetensors directly via Ollama:

git clone https://huggingface.co/Truthseeker87/solarhive-e4b-ollama
cd solarhive-e4b-ollama
cat > Modelfile << 'EOF'
FROM .
SYSTEM "You are SolarHive, an AI energy advisor for a community of 12 homes with rooftop solar and shared battery storage in Ann Arbor, Michigan. Use the available tools to get real-time data before answering. Be specific, reference actual data, and keep responses concise (3-5 sentences)."
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
PARAMETER num_ctx 4096
EOF
ollama create solarhive --experimental -f Modelfile
ollama run solarhive "What's the best time to run my dishwasher today?"

OOM warning: on 16 GB RAM, ollama create --experimental crashes around 44% blob processing as Ollama tries to materialize the full 16 GB BF16 model in memory during ingestion. Use the GGUF path above instead.

The official base (non-fine-tuned) E4B is also available as a pre-built GGUF on ollama.com/library/gemma4:e4b (9.6 GB, Q4_K_M). Our solarhive-e4b-gguf adds 1,727 examples of community solar domain expertise on top.

GGUF Conversion via llama.cpp (reproducibility recipe)

These safetensors are the source artifact for the GGUF deployment. Reproducible via llama.cpp tooling:

# Text tower → BF16 GGUF (~14 GB intermediate)
python convert_hf_to_gguf.py --outtype bf16 \
  --outfile solarhive-e4b-bf16.gguf \
  Truthseeker87/solarhive-e4b-ollama/

# Quantize text → Q4_K_M (shipped recipe, 5.34 GB; needs ≥30 GB RAM)
llama-quantize \
  solarhive-e4b-bf16.gguf solarhive-e4b-q4_k_m.gguf Q4_K_M

# Multimodal projector (vision SigLIP + audio Conformer, ~992 MB)
python convert_hf_to_gguf.py --mmproj --outtype bf16 \
  --outfile mmproj-solarhive-e4b-BF16.gguf \
  Truthseeker87/solarhive-e4b-ollama/

The shipped 5.34 GB Q4_K_M quant requires ≥30 GB RAM at quantization time (the Q6_K-PLE tensor needs a ~~10.7 GB float32 buffer). To quantize on a 16 GB laptop instead, add --tensor-type per_layer_token_embd.weight=q4_0 — it bypasses that buffer and yields a smaller (~~4.6 GB) GGUF, validated quality-safe in development. See the solarhive_quantize_e4b.ipynb notebook for the high-RAM recipe.

Core Capabilities

1. Multimodal Visual Question Answering (3 Modes)

Available because the base Gemma 4 E4B vision encoder (~150M params) is preserved unmodified in these merged weights:

Mode	Input	Output
Sky Analysis	Sky photograph	Cloud coverage %, production forecast, storage recommendation
Panel Inspection	Panel photograph	Dirt/damage/shading detection, efficiency impact estimate
Neighborhood Assessment	Aerial/satellite image	Panel inventory, expansion priorities, shading analysis

2. Native Function Calling (5 Tools — all 3 keyed APIs wired)

Tool	API	Returns
`get_weather(location)`	OpenWeatherMap (`OWM_API_KEY`)	Temperature, clouds %, wind, humidity, sunrise/sunset
`get_solar_production(clouds_pct, temp_f)`	Open-Meteo GHI (keyless)	Production kW, efficiency %, GHI W/m², temp derating
`get_battery_state()`	Community BMS (sim)	State of charge, capacity, charging status
`get_grid_status()`	EIA Open Data (`EIA_API_KEY`)	Pricing period, rate/kWh, renewable %, CO2 intensity
`get_nrel_pvwatts_baseline()`	NREL PVWatts v8 (`NREL_API_KEY`)	Annual + current-month typical kWh + avg kW for the 72 kW array

3. Selective Tool Reasoning

The model decides when to call tools — it does not blindly invoke all of them:

"What time does peak pricing start?"
→ Calls: get_grid_status() only

"Should I run my pool heater now?"
→ Calls: get_weather() + get_solar_production() + get_battery_state() + get_grid_status()

"What are general maintenance tips for panels?"
→ Calls: none (answers from training knowledge)

Community Model

Parameter	Value
Location	Ann Arbor, Michigan (42.2808°N, 83.7430°W)
Community size	12 homes
Total panel capacity	72 kW
Shared battery storage	100 kWh
Grid region	MISO (Midcontinent Independent System Operator)

Technical Notes

Merged BF16 safetensors. Base + LoRA fused via Unsloth save_pretrained_merged("merged_16bit"). Loads with plain transformers.AutoModelForCausalLM.from_pretrained(...) — no PEFT or Unsloth dependency at inference time.
Vision tower frozen during fine-tune. VQA at inference uses the base model's pretrained vision encoder unmodified, matching the Vertex AI SFT recipe which freezes both vision and audio towers during text-focused fine-tuning.
Two-step tokenization at inference. Single-step tokenize=True crashes in transformers 5.5.x on messages without a content key (e.g., tool_calls messages). Always render text first (tokenize=False) then tokenize separately.
Sampling defaults. temperature=1.0, top_p=0.95, top_k=64 (Kaggle-recommended Gemma 4 defaults).
Chat template. gemma-4 (per Unsloth Tip #1 for E2B/E4B). The gemma-4-thinking template is reserved for 26B/31B reasoning-class variants. The simpler template is more robust across downstream Ollama / llama.cpp runtimes that don't expose enable_thinking=False at the runtime layer.

Limitations

Prototype scope. Tested on a single community model (12 homes, Ann Arbor, MI). Real-world deployment requires validation across diverse geographies and community sizes.
Smaller model, weaker refusal/follow-up. When2Call (c) regression vs the A4B baseline (2/3 vs 3/3 — see Multi-Variant Deployment Validation above). Route under-specified or out-of-scope queries to the A4B cloud variant for correct refusal + follow-up behavior.
Occasional capacity hallucination. The base model's prior occasionally surfaces "60 kW" instead of the correct 72 kW community capacity in direct (no-tool) responses. The tool-calling path (which queries actual capacity from get_nrel_pvwatts_baseline) avoids this.
External API dependence. Tool responses depend on Open-Meteo, OWM, EIA, and PVWatts availability with their respective rate limits.
Battery state is simulated. get_battery_state() is a deterministic in-memory simulator for demonstrations — real deployment requires integration with actual battery management systems.
Single-trial multi-variant validation. The May 2026 final-run project-held-out numbers are from one inference pass; a multi-trial bootstrap would strengthen the multi-call regression claim against temperature-1.0 stochasticity.
Memory. ~16 GB BF16 safetensors require ≥24 GB system RAM at load time — does not fit on consumer 16 GB laptops in this format. For 16 GB laptops, use the solarhive-e4b-gguf Q4_K_M variant.

Future Iteration — Multi-Token Prediction (MTP) Drafters

Not in the measured numbers above. Google announced Gemma 4 MTP drafters on May 5, 2026 (blog, overview, HF collection, Kaggle, @GoogleGemma) — after this artifact's final project-held-out check was captured. The numbers above reflect standard autoregressive decoding only. MTP integration is documented here as future iteration; no measured speedup is claimed in this release.

Theoretical foundation. Speculative decoding (Leviathan, Kalman & Matias, Fast Inference from Transformers via Speculative Decoding, ICML 2023, arXiv:2211.17192) accelerates generation without changing the output distribution under argmax decoding: a smaller drafter proposes γ candidate tokens, the target verifies all γ in a single parallel forward pass, accepted tokens are kept, and any rejection is resampled from a corrected distribution. The output distribution is preserved exactly regardless of drafter quality; only acceptance rate α, and therefore walltime speedup, varies.

What Google released on May 5, 2026. Paired drafter checkpoints for all four IT-tuned Gemma 4 variants — gemma-4-E2B-it-assistant, gemma-4-E4B-it-assistant, gemma-4-26B-A4B-it-assistant, gemma-4-31B-it-assistant — discoverable via the google/gemma-4 Hugging Face collection and on Kaggle Models. The drafters share the input embedding table with their paired target and consume the target's last-layer activations (architecture per the MTP overview). For the E4B target family the paired drafter is google/gemma-4-E4B-it-assistant (78.8 M params). Google reports up to 3× decode speedup with no quality degradation on the headline 26B-A4B configuration and **2.2×** on Apple Silicon at batch sizes 4–8; per-variant E4B numbers were not enumerated in the announcement. Tested runtimes named in the blog: LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang, Ollama.

Integration via Hugging Face Transformers is a plain-AutoModelForCausalLM two-line load plus one extra kwarg:

target    = AutoModelForCausalLM.from_pretrained("Truthseeker87/solarhive-e4b-ollama",        dtype=torch.bfloat16, ...)
assistant = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B-it-assistant",          dtype=torch.bfloat16, ...)
target.generate(**inputs, assistant_model=assistant)  # MTP enabled

The merged-safetensors load path on this repo is the cleanest E4B integration surface — no PEFT/Unsloth wrapping; the Hugging Face Transformers assistant_model= kwarg works directly.

Open question specific to this LoRA-merged BF16 target. Per the 2023 speculative-sampling guarantee, correctness is invariant to drafter quality — the target's verification step preserves the exact output distribution regardless of what the drafter proposes. What varies is acceptance rate α, since Google's released drafter was trained against the base gemma-4-E4B-it, not against this LoRA-merged target. Measured α at the edge BF16 tier is the planned post-hackathon contribution; a cloud-tier measurement against the A4B merged target is captured by the gated future-iteration cell in solarhive_inference.py §14.

Companion Repositories

Model	Repository	Purpose
SolarHive 26B A4B LoRA	solarhive-26b-a4b-lora	Cloud inference with full multimodal + function calling (LoRA adapters)
SolarHive 26B A4B Merged	solarhive-26b-a4b-merged	Full BF16 cloud model (~48 GB) — production inference, no PEFT/Unsloth dep
SolarHive 26B A4B NF4	solarhive-26b-a4b-nf4	Pre-quantized 4-bit cloud model for HF Spaces / 24 GB+ GPUs
SolarHive E4B LoRA	solarhive-e4b-lora	E4B adapter weights (~200 MB) — apply over base via Unsloth
SolarHive E4B safetensors	This repo	Source safetensors for transformers research / GGUF conversion via llama.cpp
SolarHive E4B GGUF	solarhive-e4b-gguf	Edge deployment — Q4_K_M GGUF + mmproj for Ollama / llama.cpp on 16 GB CPU laptop. 10/10 project-held-out check.
SolarHive Dataset	solarhive-community-solar-multimodal	1,727 training examples (1,713 text + 14 image-grounded)
LiteRT-LM Python edge runtime	`solarhive_e4b_litert_v3.1.ipynb`	LiteRT Special Tech Track entry — runs upstream base `litert-community/gemma-4-E4B-it-litert-lm` `.litertlm` (3.66 GB) + SolarHive UX layer + on-device agentic loop with native Gemma 4 function calling. Q&A 8/8 on Colab Pro CPU + High-RAM. Fine-tuned LiteRT-LM bundle is a planned next iteration once upstream `gemma4` example module lands in `ai_edge_torch.generative.examples/`.
GitHub	the-gemma4-good-hackathon-solarhive	Full source code, training & quantization notebooks, data principles

Fine-Tuning Architecture — Text-Only on the Multimodal-Capable Corpus

The shipped fine-tune is text-only on the canonical solarhive-community-solar-multimodal corpus (1,727 rows = 1,713 text + 14 image-grounded). Image rows are skipped at the data-prep layer; the training pipeline pre-renders only text rows for TRL's default text collator. Multimodal fine-tuning is deferred post-hackathon — a real image corpus and a held-out VQA benchmark would be prerequisites; the dataset's image schema is preserved so a future multimodal fine-tune can re-enable image rows without changing the corpus.

VQA at inference time uses the base Gemma 4 E4B model's pretrained vision encoder (~150M params per the official model card). Our LoRA targets only the language-model linear layers (target=all-linear); the vision tower is not modified. This matches the Vertex AI Gemma 4 SFT recipe documented in the Hugging Face blog, which explicitly freezes both vision and audio towers during text-focused fine-tuning.

Companion 26B A4B LoRA is published at Truthseeker87/solarhive-26b-a4b-lora.

The dataset uses the project archive for its 14 image-grounded Q&A turns (7 Ann Arbor sky photos × 2 turns). Image-source planning pivoted twice: the SWIM corpora (NUS) were rejected for CC BY-NC licensing, and NREL SRRL was rejected because the legacy MIDC SkyCam image archive ended May 2017 (modern ASI-16 only exposes derived measurements). The shipped dataset uses the project archive only — fewer images, but every label is human-confirmed and every paired Q&A traces back to the same GHI / temperature-derating formula used elsewhere in the dataset.

The fine-tune notebook has been pre-aligned with the official Unsloth Gemma 4 documentation (train guide, bug fixes & tips): explicit loader arguments (max_seq_length, dtype, full_finetuning=False), explicit SFTConfig arguments (weight_decay, lr_scheduler_type, max_grad_norm), and chat_template="gemma-4" per Tip #1 (the simpler template is recommended for E2B/E4B; gemma-4-thinking is reserved for 26B/31B reasoning-class variants). The change makes the embedded chat template more robust across downstream Ollama / llama.cpp runtimes that don't expose enable_thinking=False at the runtime layer.

Citation

@misc{solarhive2026,
  title={SolarHive: AI-Powered Community Solar Energy Intelligence},
  author={Youshen Lim},
  year={2026},
  url={https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive},
  note={Gemma 4 Good Hackathon submission — Google DeepMind x Kaggle}
}

Dataset used to train Truthseeker87/solarhive-e4b-ollama

Papers for Truthseeker87/solarhive-e4b-ollama

When2Call: When (not) to Call Tools

Paper • 2504.18851 • Published Apr 26, 2025

Fast Inference from Transformers via Speculative Decoding

Paper • 2211.17192 • Published Nov 30, 2022 • 11

Evaluation results

Accuracy
self-reported

1.000
Accuracy
self-reported

0.800

Truthseeker87
/

solarhive-e4b-ollama

SolarHive E4B — BF16 Merged Safetensors

Model Overview

Training Details

Training Loss

Training Data

Hardware

Project-Held-Out Results

Domain Q&A (5/5)

Tool inventory + inference-time When2Call validation

Multi-Variant Deployment Validation (Final Run, May 2026) — E4B regression on When2Call (c)

How to Use

Loading with transformers

Edge Deployment — use the GGUF repo

Edge Deployment via Ollama `--experimental` (≥24 GB RAM only)

GGUF Conversion via llama.cpp (reproducibility recipe)

Core Capabilities

1. Multimodal Visual Question Answering (3 Modes)

2. Native Function Calling (5 Tools — all 3 keyed APIs wired)

3. Selective Tool Reasoning

Community Model

Technical Notes

Limitations

Future Iteration — Multi-Token Prediction (MTP) Drafters

Companion Repositories

Fine-Tuning Architecture — Text-Only on the Multimodal-Capable Corpus

Citation

Links

Dataset used to train Truthseeker87/solarhive-e4b-ollama

Papers for Truthseeker87/solarhive-e4b-ollama

When2Call: When (not) to Call Tools

Fast Inference from Transformers via Speculative Decoding

Evaluation results

SolarHive E4B — BF16 Merged Safetensors

Model Overview

Training Details

Training Loss

Training Data

Hardware

Project-Held-Out Results

Domain Q&A (5/5)

Tool inventory + inference-time When2Call validation

Multi-Variant Deployment Validation (Final Run, May 2026) — E4B regression on When2Call (c)

How to Use

Loading with transformers

Edge Deployment — use the GGUF repo

Edge Deployment via Ollama --experimental (≥24 GB RAM only)

GGUF Conversion via llama.cpp (reproducibility recipe)

Core Capabilities

1. Multimodal Visual Question Answering (3 Modes)

2. Native Function Calling (5 Tools — all 3 keyed APIs wired)

3. Selective Tool Reasoning

Community Model

Technical Notes

Limitations

Future Iteration — Multi-Token Prediction (MTP) Drafters

Companion Repositories

Fine-Tuning Architecture — Text-Only on the Multimodal-Capable Corpus

Citation

Links

Dataset used to train Truthseeker87/solarhive-e4b-ollama

Papers for Truthseeker87/solarhive-e4b-ollama

Evaluation results

Edge Deployment via Ollama `--experimental` (≥24 GB RAM only)