Instructions to use Truthseeker87/solarhive-e4b-ollama with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Truthseeker87/solarhive-e4b-ollama with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Truthseeker87/solarhive-e4b-ollama") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("Truthseeker87/solarhive-e4b-ollama") model = AutoModelForImageTextToText.from_pretrained("Truthseeker87/solarhive-e4b-ollama") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use Truthseeker87/solarhive-e4b-ollama with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Truthseeker87/solarhive-e4b-ollama" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Truthseeker87/solarhive-e4b-ollama", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Truthseeker87/solarhive-e4b-ollama
- SGLang
How to use Truthseeker87/solarhive-e4b-ollama with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Truthseeker87/solarhive-e4b-ollama" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Truthseeker87/solarhive-e4b-ollama", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Truthseeker87/solarhive-e4b-ollama" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Truthseeker87/solarhive-e4b-ollama", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Unsloth Studio new
How to use Truthseeker87/solarhive-e4b-ollama with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Truthseeker87/solarhive-e4b-ollama to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for Truthseeker87/solarhive-e4b-ollama to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for Truthseeker87/solarhive-e4b-ollama to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="Truthseeker87/solarhive-e4b-ollama", max_seq_length=2048, ) - Docker Model Runner
How to use Truthseeker87/solarhive-e4b-ollama with Docker Model Runner:
docker model run hf.co/Truthseeker87/solarhive-e4b-ollama
- SolarHive E4B — BF16 Merged Safetensors
SolarHive E4B — BF16 Merged Safetensors
LoRA fine-tuned Gemma 4 E4B (8B), merged to 16-bit safetensors. Source artifact for direct transformers inference and llama.cpp convert_hf_to_gguf.py → Q4_K_M GGUF conversion (which powers Ollama + llama.cpp edge deployment via the solarhive-e4b-gguf companion repo).
For Ollama or llama.cpp edge deployment on a 16 GB CPU laptop, use the solarhive-e4b-gguf repo instead — it ships the Q4_K_M GGUF text variants (4.6 GB / 5.3 GB) plus the 992 MB
mmproj-BF16.ggufcompanion (vision + audio), withModelfiles ready forollama createand a 10/10 score on the SolarHive 10-prompt parity benchmark.The
--experimentalOllama path documented previously OOMsollama createon ≤16 GB RAM (the 16 GB BF16 safetensors blob does not fit during ingestion). On hardware ≥24 GB RAM the experimental import works, but the GGUF pipeline (built using llama.cppconvert_hf_to_gguf.py+llama-quantize) is the recommended edge deployment path for everyone else.
This repository now serves three roles:
- Source for GGUF conversion via llama.cpp's
convert_hf_to_gguf.py(text tower) andconvert_hf_to_gguf.py --mmproj(vision + audio projector). See solarhive-e4b-gguf for the produced GGUF artifacts. - Transformers-native multimodal use — load with
AutoModelForCausalLMfor full image + audio + text in Python (requires ≥24 GB RAM or A100-class GPU). - Reference for further fine-tuning — extend the LoRA on additional data using Unsloth
FastVisionModel.
Built for the Gemma 4 Good Hackathon (Google DeepMind x Kaggle).
| Base Model | google/gemma-4-e4b-it |
| Architecture | Dense + PLE — 8B total, 4.5B effective |
| Fine-Tuning | LoRA via Unsloth (BF16) |
| Training Data | 1,727 examples (solarhive-community-solar-multimodal) — text-only fine-tune; VQA at inference uses the base Gemma 4 vision encoder (~150M params), unmodified by our LoRA per the Vertex AI SFT recipe |
| Converged Loss | 0.9218 |
| Benchmark | 9/10 (5/5 domain Q&A + 4/5 tool calling) — May 2026 final run, multi-call regression on TQ5 (see Multi-Variant Deployment Validation below) |
| Training Time | 420 seconds (~7 minutes) |
| Compute | Google Colab Pro |
| License | MIT (adapters) / Gemma Terms (base model) |
Model Overview
SolarHive E4B is the edge companion to SolarHive 26B A4B. While the 26B model powers cloud inference with full multimodal VQA, the E4B model is optimized for local deployment via Ollama on consumer hardware.
Privacy-first: Running Gemma 4 locally means community energy data never leaves the neighborhood. No cloud dependency, no internet requirement, no data privacy concerns. A village in rural India, a suburb in Michigan, and a coastal town recovering from a hurricane all get the same intelligence.
This repository contains the fully merged model (base + LoRA baked together) — no separate base model download needed.
Training Details
| Parameter | Value |
|---|---|
| Method | LoRA via Unsloth FastVisionModel (BF16, RTX PRO 6000 96 GB) |
| LoRA rank | 16 |
| LoRA alpha | 16 |
| LoRA dropout | 0 |
| Target modules | All linear layers |
| Learning rate | 2e-4 |
| Optimizer | AdamW 8-bit |
| Warmup steps | 5 |
| Epochs | 3 |
| Max sequence length | 2048 |
| Precision | BF16 |
| Seed | 3407 |
| Trainable parameters | 41.2M / 8.0B (0.51%) |
Training Loss
| Metric | Value |
|---|---|
| Converged loss (last 20 steps) | 0.9218 |
| Final step loss | 0.0635 |
| Minimum loss | 0.0635 |
| Total steps | 324 |
| Training time | 420 seconds |
Canonical metric: the bolded Converged loss (last 20 steps) is the only smoothed convergence indicator. Final step and Minimum are single-batch point statistics — mini-batch loss is noisy step-to-step, so one easy batch can drop a point estimate well below the rolling-average trend.
Training Data
Same canonical training corpus as the 26B A4B model — solarhive-community-solar-multimodal, 1,727 rows:
- 413 hand-crafted examples spanning 15+ US cities and 9 energy domains
- ~1,117 API-grounded examples from live Open-Meteo, PVWatts, OWM, and EIA data
- 183 tool-calling examples (positive, negative refusals, follow-up clarifications, failure-recovery)
- 14 image-grounded Q&A turns from 7 manually-labeled Ann Arbor sky photographs
Hardware
- GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (102 GB GDDR7 total, 94.97 GB max usable per Unsloth)
- Platform: Google Colab Pro (G4 VM)
Benchmark Results
Domain Q&A (5/5)
| Question | Result |
|---|---|
| Solar production when humidity exceeds 80%? | Correct |
| Battery SOC threshold for grid export? | Correct |
| Home #3 underperforming 22% — diagnostic checklist? | Correct |
| Winter snow on panels — prioritize actions? | Correct |
| Grid frequency 59.8 Hz — microgrid implications? | Correct |
Note on benchmark history: the 5/5 Q&A above is from the initial 8-question validation harness used during fine-tune development. The canonical headline number is the May 2026 final-run multi-variant validation (10-question parity benchmark) — see below.
Tool inventory + inference-time When2Call validation
solarhive_inference.py exposes 5 tools to the model — all three keyed APIs (OWM_API_KEY, EIA_API_KEY, NREL_API_KEY) actively wired:
| Tool | API | Returns |
|---|---|---|
get_weather(location) |
OpenWeatherMap (OWM_API_KEY) |
Temperature, clouds %, wind, humidity, sunrise/sunset |
get_solar_production(clouds_pct, temp_f) |
Open-Meteo GHI (keyless) | Production kW, efficiency %, GHI W/m², temp derating |
get_battery_state() |
Community BMS (sim) | State of charge, capacity, charging status |
get_grid_status() |
EIA Open Data (EIA_API_KEY) |
Pricing period, rate/kWh, renewable %, CO2 intensity |
get_nrel_pvwatts_baseline() |
NREL PVWatts v8 (NREL_API_KEY) |
Annual + current-month typical kWh + avg kW for the 72 kW array |
Tool results feed back as a 2-message sequence matching the training distribution: {"role": "assistant", "tool_calls": [...]} then {"role": "tool", "name": "<fn>", "content": json.dumps(result)} per call. Shared across the data-generation pipeline, the fine-tune SFT preprocessing layer, and the inference agentic loop — inference matches training distribution exactly.
When2Call probes. Three held-out probes validate 3 of the 4 failure-mode categories from Ross, H., Mahabaleshwarkar, A. S., & Suhara, Y. (2025). When2Call: When (not) to Call Tools. arXiv:2504.18851 — the paper documents 9–67% tool-hallucination rates on (c)+(d) in untrained community models:
- (b) "What's the current grid rate?" → expect
get_grid_statuscall (well-specified, in-scope) - (c) "How much will a 10 kW array produce today?" → expect follow-up question (does NOT auto-fill location default)
- (d) "What's the current air quality index in Ann Arbor?" → expect refusal + redirect (does NOT hallucinate a tool)
Models trained without explicit unable-to-answer and follow-up clarification examples typically fail (c) + (d). The SolarHive corpus includes 16 such examples (10 unable-to-answer + 6 follow-up clarification) following the When2Call taxonomy.
Multi-Variant Deployment Validation (Final Run, May 2026) — E4B regression on When2Call (c)
End-to-end inference run on Colab Pro G4. This E4B BF16 merged variant loaded from a local cache (16.9 GB VRAM utilization, ~10 min runtime).
Parity benchmark: 5/5 Q&A + 4/5 tool = 9/10 on the 10-question set —
matches the A4B family
on the 9 deterministic questions; the single FAIL is the lenient
multi-call probe (TQ5 — "Compare today's irradiance forecast across
Ann Arbor, Phoenix, and Seattle", min_calls=2) where this variant
emitted only 1 get_weather call. Notably, the
E4B LoRA + base variant
(same fine-tune, applied via Unsloth instead of merged) DOES chain
3 calls on the same probe and scores 10/10 — pattern reproducible
across runs.
When2Call probes — measured 2/3 (final run May 2026):
| Probe | E4B merged behavior | Score |
|---|---|---|
| (b) "current grid rate?" | ✅ Correctly calls get_grid_status |
PASS |
| (c) "How much will a 10 kW array produce today?" | ❌ Auto-fills location and calls get_solar_production instead of asking back |
FAIL |
| (d) "current AQI in Ann Arbor?" | ✅ Genuinely disclaims (no fabrication, no tool call) | PASS |
Cross-variant pattern: the E4B LoRA + base variant is inferred to score 2/3 by mathematical lossless equivalence with this merged variant (the merge step is lossless on weights, so the When2Call decision boundary is identical). The +1/3 W2C delta vs the A4B family (3/3 directly measured on A4B LoRA, inferred-lossless on A4B merged + NF4) is the empirical signature of size-vs-refusal scaling.
Honest finding — size-vs-refusal scaling is real, and was the pre-stated hypothesis. This E4B fine-tune regresses on (c) compared to the A4B LoRA baseline (which scores 3/3). The smaller model with less reasoning depth more readily auto-fills missing parameters when it should ask back — exactly the failure mode Ross et al. 2025 document at 9-67% rates in untrained community models. The fine-tune closes (b)+(d) at this size but doesn't fully close (c).
This was the expected outcome going in, per the official Google Gemma 4 Core docs "Parameter sizes and quantization": "Models with higher parameters and bit counts (higher precision) are generally more capable, but are more expensive to run." E4B's 8B total / 4.5B effective parameters / ~150M vision encoder vs A4B's 25.2B total / 3.8B active (MoE) / ~550M vision encoder reflect a deliberate ~3× capacity gap on the dimension that drives reasoning-heavy refusal/follow-up behavior. The validation confirms the documented scaling — not a defect, but architecture-aware deployment design.
Quantitative reinforcement from Unsloth's published Gemma 4 benchmarks: E4B scores 69.4% on MMLU Pro (vs 26B A4B's 82.6% — a 13.2 pp gap), 52.6% on MMMU Pro (vs 73.8% — 21.2 pp gap), and 42.5% on AIME 2026 (vs 88.3% — a 45.8 pp gap). The AIME math-reasoning gap and MMMU Pro multimodal-reasoning gap directly predict the (c)/(d) When2Call regression we observe here — the smaller model's published reasoning-benchmark deltas scale cleanly into the 2-of-3 behavioral regression vs the A4B baseline. E4B is the right choice for the volume of well-specified, in-scope queries that dominate everyday community-energy interactions; A4B handles the harder reasoning edge cases.
Deployment recommendation: Use this E4B variant for the volume of well-specified, in-scope queries (production estimates, grid pricing, maintenance guidance) where (b)-category routing dominates. Route under-specified or out-of-scope queries to the A4B cloud variant for correct refusal + follow-up behavior. A future fine-tune could increase the E4B follow-up clarification example count (currently 6) and unable-to-answer count (currently 10) to close the gap.
How to Use
Loading with transformers
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
model = AutoModelForCausalLM.from_pretrained(
"Truthseeker87/solarhive-e4b-ollama", # This repo (merged safetensors)
dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(
"Truthseeker87/solarhive-e4b-ollama",
trust_remote_code=True,
)
Edge Deployment — use the GGUF repo
For Ollama or llama.cpp on a 16 GB CPU laptop, download the GGUF artifacts instead of trying to import these safetensors:
hf download Truthseeker87/solarhive-e4b-gguf \
solarhive-e4b-q4_k_m.gguf Modelfile \
--local-dir ./solarhive-gguf
cd ./solarhive-gguf
ollama create solarhive -f Modelfile
ollama run solarhive "What's the best time to run my dishwasher today?"
The solarhive-e4b-gguf repo also includes the Standard Q4_K_M variant (Colab-produced, Q6_K PLE) and a 992 MB mmproj-BF16.gguf for full multimodal via llama-server --mmproj.
Edge Deployment via Ollama --experimental (≥24 GB RAM only)
If you have ≥24 GB system RAM, you can experimentally import these safetensors directly via Ollama:
git clone https://huggingface.co/Truthseeker87/solarhive-e4b-ollama
cd solarhive-e4b-ollama
cat > Modelfile << 'EOF'
FROM .
SYSTEM "You are SolarHive, an AI energy advisor for a community of 12 homes with rooftop solar and shared battery storage in Ann Arbor, Michigan. Use the available tools to get real-time data before answering. Be specific, reference actual data, and keep responses concise (3-5 sentences)."
PARAMETER temperature 1.0
PARAMETER top_p 0.95
PARAMETER top_k 64
PARAMETER num_ctx 4096
EOF
ollama create solarhive --experimental -f Modelfile
ollama run solarhive "What's the best time to run my dishwasher today?"
OOM warning: on 16 GB RAM,
ollama create --experimentalcrashes around 44% blob processing as Ollama tries to materialize the full 16 GB BF16 model in memory during ingestion. Use the GGUF path above instead.The official base (non-fine-tuned) E4B is also available as a pre-built GGUF on ollama.com/library/gemma4:e4b (9.6 GB, Q4_K_M). Our solarhive-e4b-gguf adds 1,727 examples of community solar domain expertise on top.
GGUF Conversion via llama.cpp (reproducibility recipe)
These safetensors are the source artifact for the GGUF deployment. Reproducible via llama.cpp tooling:
# Text tower → BF16 GGUF (~14 GB intermediate)
python convert_hf_to_gguf.py --outtype bf16 \
--outfile solarhive-e4b-bf16.gguf \
Truthseeker87/solarhive-e4b-ollama/
# Quantize text → Q4_K_M with PLE override for 16 GB hardware (~4.6 GB)
llama-quantize \
--tensor-type per_layer_token_embd.weight=q4_0 \
solarhive-e4b-bf16.gguf solarhive-e4b-q4_k_m.gguf Q4_K_M
# Multimodal projector (vision SigLIP + audio Conformer, ~992 MB)
python convert_hf_to_gguf.py --mmproj --outtype bf16 \
--outfile mmproj-solarhive-e4b-BF16.gguf \
Truthseeker87/solarhive-e4b-ollama/
The standard Q4_K_M variant (without --tensor-type override, ~5.3 GB) requires ≥32 GB RAM at quantization time — see the solarhive_quantize_e4b.ipynb notebook for the high-RAM recipe.
Core Capabilities
1. Multimodal Visual Question Answering (3 Modes)
Available because the base Gemma 4 E4B vision encoder (~150M params) is preserved unmodified in these merged weights:
| Mode | Input | Output |
|---|---|---|
| Sky Analysis | Sky photograph | Cloud coverage %, production forecast, storage recommendation |
| Panel Inspection | Panel photograph | Dirt/damage/shading detection, efficiency impact estimate |
| Neighborhood Assessment | Aerial/satellite image | Panel inventory, expansion priorities, shading analysis |
2. Native Function Calling (5 Tools — all 3 keyed APIs wired)
| Tool | API | Returns |
|---|---|---|
get_weather(location) |
OpenWeatherMap (OWM_API_KEY) |
Temperature, clouds %, wind, humidity, sunrise/sunset |
get_solar_production(clouds_pct, temp_f) |
Open-Meteo GHI (keyless) | Production kW, efficiency %, GHI W/m², temp derating |
get_battery_state() |
Community BMS (sim) | State of charge, capacity, charging status |
get_grid_status() |
EIA Open Data (EIA_API_KEY) |
Pricing period, rate/kWh, renewable %, CO2 intensity |
get_nrel_pvwatts_baseline() |
NREL PVWatts v8 (NREL_API_KEY) |
Annual + current-month typical kWh + avg kW for the 72 kW array |
3. Selective Tool Reasoning
The model decides when to call tools — it does not blindly invoke all of them:
"What time does peak pricing start?"
→ Calls: get_grid_status() only
"Should I run my pool heater now?"
→ Calls: get_weather() + get_solar_production() + get_battery_state() + get_grid_status()
"What are general maintenance tips for panels?"
→ Calls: none (answers from training knowledge)
Community Model
| Parameter | Value |
|---|---|
| Location | Ann Arbor, Michigan (42.2808°N, 83.7430°W) |
| Community size | 12 homes |
| Total panel capacity | 72 kW |
| Shared battery storage | 100 kWh |
| Grid region | MISO (Midcontinent Independent System Operator) |
Technical Notes
- Merged BF16 safetensors. Base + LoRA fused via Unsloth
save_pretrained_merged("merged_16bit"). Loads with plaintransformers.AutoModelForCausalLM.from_pretrained(...)— no PEFT or Unsloth dependency at inference time. - Vision tower frozen during fine-tune. VQA at inference uses the base model's pretrained vision encoder unmodified, matching the Vertex AI SFT recipe which freezes both vision and audio towers during text-focused fine-tuning.
- Two-step tokenization at inference. Single-step
tokenize=Truecrashes in transformers 5.5.x on messages without acontentkey (e.g.,tool_callsmessages). Always render text first (tokenize=False) then tokenize separately. - Sampling defaults.
temperature=1.0, top_p=0.95, top_k=64(Kaggle-recommended Gemma 4 defaults). - Chat template.
gemma-4(per Unsloth Tip #1 for E2B/E4B). Thegemma-4-thinkingtemplate is reserved for 26B/31B reasoning-class variants. The simpler template is more robust across downstream Ollama / llama.cpp runtimes that don't exposeenable_thinking=Falseat the runtime layer.
Limitations
- Prototype scope. Tested on a single community model (12 homes, Ann Arbor, MI). Real-world deployment requires validation across diverse geographies and community sizes.
- Smaller model, weaker refusal/follow-up. When2Call (c) regression vs the A4B baseline (2/3 vs 3/3 — see Multi-Variant Deployment Validation above). Route under-specified or out-of-scope queries to the A4B cloud variant for correct refusal + follow-up behavior.
- Occasional capacity hallucination. The base model's prior occasionally surfaces "60 kW" instead of the correct 72 kW community capacity in direct (no-tool) responses. The tool-calling path (which queries actual capacity from
get_nrel_pvwatts_baseline) avoids this. - External API dependence. Tool responses depend on Open-Meteo, OWM, EIA, and PVWatts availability with their respective rate limits.
- Battery state is simulated.
get_battery_state()is a deterministic in-memory simulator for demonstrations — real deployment requires integration with actual battery management systems. - Single-trial multi-variant validation. The May 2026 final-run benchmark numbers are from one inference pass; a multi-trial bootstrap would strengthen the multi-call regression claim against temperature-1.0 stochasticity.
- Memory. ~16 GB BF16 safetensors require ≥24 GB system RAM at load time — does not fit on consumer 16 GB laptops in this format. For 16 GB laptops, use the solarhive-e4b-gguf Q4_K_M variant.
Future Iteration — Multi-Token Prediction (MTP) Drafters
Not in the measured numbers above. Google announced Gemma 4 MTP drafters on May 5, 2026 (blog, overview, HF collection, Kaggle, @GoogleGemma) — after this artifact's final benchmark was captured. The benchmarks above reflect standard autoregressive decoding only. MTP integration is documented here as future iteration; no measured speedup is claimed in this release.
Theoretical foundation. Speculative decoding (Leviathan, Kalman & Matias, ICML 2023, arXiv:2211.17192) accelerates generation without changing the output distribution under argmax decoding: a smaller drafter proposes γ candidate tokens, the target verifies all γ in a single parallel forward pass, accepted tokens are kept, and any rejection is resampled from a corrected distribution. The output distribution is preserved exactly regardless of drafter quality; only acceptance rate α, and therefore walltime speedup, varies.
What Google released on May 5, 2026. Paired drafter checkpoints for all four IT-tuned Gemma 4 variants — gemma-4-E2B-it-assistant, gemma-4-E4B-it-assistant, gemma-4-26B-A4B-it-assistant, gemma-4-31B-it-assistant — discoverable via the google/gemma-4 Hugging Face collection and on Kaggle Models. The drafters share the input embedding table with their paired target and consume the target's last-layer activations (architecture per the MTP overview). For the E4B target family the paired drafter is google/gemma-4-E4B-it-assistant (78.8 M params). Google reports up to 3× decode speedup with no quality degradation on the headline 26B-A4B configuration and **2.2×** on Apple Silicon at batch sizes 4–8; per-variant E4B numbers were not enumerated in the announcement. Tested runtimes named in the blog: LiteRT-LM, MLX, Hugging Face Transformers, vLLM, SGLang, Ollama.
Integration via Hugging Face Transformers is a plain-AutoModelForCausalLM two-line load plus one extra kwarg:
target = AutoModelForCausalLM.from_pretrained("Truthseeker87/solarhive-e4b-ollama", dtype=torch.bfloat16, ...)
assistant = AutoModelForCausalLM.from_pretrained("google/gemma-4-E4B-it-assistant", dtype=torch.bfloat16, ...)
target.generate(**inputs, assistant_model=assistant) # MTP enabled
The merged-safetensors load path on this repo is the cleanest E4B integration surface — no PEFT/Unsloth wrapping; the Hugging Face Transformers assistant_model= kwarg works directly.
Open question specific to this LoRA-merged BF16 target. Per the 2023 speculative-sampling guarantee, correctness is invariant to drafter quality — the target's verification step preserves the exact output distribution regardless of what the drafter proposes. What varies is acceptance rate α, since Google's released drafter was trained against the base gemma-4-E4B-it, not against this LoRA-merged target. Measured α at the edge BF16 tier is the planned post-hackathon contribution; a cloud-tier measurement against the A4B merged target is captured by the gated future-iteration cell in solarhive_inference.py §14.
Companion Repositories
| Model | Repository | Purpose |
|---|---|---|
| SolarHive 26B A4B LoRA | solarhive-26b-a4b-lora | Cloud inference with full multimodal + function calling (LoRA adapters) |
| SolarHive 26B A4B Merged | solarhive-26b-a4b-merged | Full BF16 cloud model (~48 GB) — production inference, no PEFT/Unsloth dep |
| SolarHive 26B A4B NF4 | solarhive-26b-a4b-nf4 | Pre-quantized 4-bit cloud model for HF Spaces / 24 GB+ GPUs |
| SolarHive E4B LoRA | solarhive-e4b-lora | E4B adapter weights (~200 MB) — apply over base via Unsloth |
| SolarHive E4B safetensors | This repo | Source safetensors for transformers research / GGUF conversion via llama.cpp |
| SolarHive E4B GGUF | solarhive-e4b-gguf | Edge deployment — Q4_K_M GGUF + mmproj for Ollama / llama.cpp on 16 GB CPU laptop. 10/10 benchmark. |
| SolarHive Dataset | solarhive-community-solar-multimodal | 1,727 training examples (1,713 text + 14 image-grounded) |
| LiteRT-LM Python edge runtime | solarhive_e4b_litert_v3.1.ipynb |
LiteRT Special Tech Track entry — runs upstream base litert-community/gemma-4-E4B-it-litert-lm .litertlm (3.66 GB) + SolarHive UX layer + on-device agentic loop with native Gemma 4 function calling. Q&A 8/8 on Colab Pro CPU + High-RAM. Fine-tuned LiteRT-LM bundle is a planned next iteration once upstream gemma4 example module lands in ai_edge_torch.generative.examples/. |
| GitHub | the-gemma4-good-hackathon-solarhive | Full source code, training & quantization notebooks, data principles |
Fine-Tuning Architecture — Text-Only on the Multimodal-Capable Corpus
The shipped fine-tune is text-only on the canonical
solarhive-community-solar-multimodal
corpus (1,727 rows = 1,713 text + 14 image-grounded). Image rows are
skipped at the data-prep layer; the training pipeline pre-renders only
text rows for TRL's default text collator. Multimodal fine-tuning is
deferred post-hackathon — a real image corpus and a held-out VQA
benchmark would be prerequisites; the dataset's image schema is
preserved so a future multimodal fine-tune can re-enable image rows
without changing the corpus.
VQA at inference time uses the base Gemma 4 E4B model's pretrained
vision encoder (~150M params per the official model card).
Our LoRA targets only the language-model linear layers
(target=all-linear); the vision tower is not modified. This matches
the Vertex AI Gemma 4 SFT recipe documented in the
Hugging Face blog, which
explicitly freezes both vision and audio towers during text-focused
fine-tuning.
Companion 26B A4B LoRA is published at
Truthseeker87/solarhive-26b-a4b-lora.
The dataset uses the project archive for its 14 image-grounded Q&A turns (7 Ann Arbor sky photos × 2 turns). Image-source planning pivoted twice: the SWIM corpora (NUS) were rejected for CC BY-NC licensing, and NREL SRRL was rejected because the legacy MIDC SkyCam image archive ended May 2017 (modern ASI-16 only exposes derived measurements). The shipped dataset uses the project archive only — fewer images, but every label is human-confirmed and every paired Q&A traces back to the same GHI / temperature-derating formula used elsewhere in the dataset.
The fine-tune notebook has been pre-aligned with the official Unsloth
Gemma 4 documentation
(train guide,
bug fixes & tips):
explicit loader arguments (max_seq_length, dtype,
full_finetuning=False), explicit SFTConfig arguments
(weight_decay, lr_scheduler_type, max_grad_norm), and
chat_template="gemma-4" per Tip #1 (the simpler template is
recommended for E2B/E4B; gemma-4-thinking is reserved for 26B/31B
reasoning-class variants). The change makes the embedded chat
template more robust across downstream Ollama / llama.cpp runtimes
that don't expose enable_thinking=False at the runtime layer.
Citation
@misc{solarhive2026,
title={SolarHive: AI-Powered Community Solar Energy Intelligence},
author={Youshen Lim},
year={2026},
url={https://github.com/youshen-lim/the-gemma4-good-hackathon-solarhive},
note={Gemma 4 Good Hackathon submission — Google DeepMind x Kaggle}
}
Links
- GitHub: youshen-lim/the-gemma4-good-hackathon-solarhive
- Kaggle: The Gemma 4 Good Hackathon
- Base Model: google/gemma-4-e4b-it
- Unsloth Gemma 4 docs: unsloth.ai/docs/models/gemma-4
Built with Gemma 4 in Ann Arbor, Michigan. May 2026.
Gemma is a trademark of Google LLC.
- Downloads last month
- 41
Dataset used to train Truthseeker87/solarhive-e4b-ollama
Papers for Truthseeker87/solarhive-e4b-ollama
When2Call: When (not) to Call Tools
Fast Inference from Transformers via Speculative Decoding
Evaluation results
- Accuracyself-reported1.000
- Accuracyself-reported0.800
