Instructions to use LocusForge/VariantAssist-Gemma4-31B-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="LocusForge/VariantAssist-Gemma4-31B-GGUF", filename="VA-Gemma4-31B-BF16-00002-of-00002.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- llama.cpp
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M # Run inference directly in the terminal: llama-cli -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M # Run inference directly in the terminal: llama-cli -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
Use Docker
docker model run hf.co/LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
- LM Studio
- Jan
- vLLM
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "LocusForge/VariantAssist-Gemma4-31B-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "LocusForge/VariantAssist-Gemma4-31B-GGUF", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
- Ollama
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Ollama:
ollama run hf.co/LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
- Unsloth Studio new
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for LocusForge/VariantAssist-Gemma4-31B-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for LocusForge/VariantAssist-Gemma4-31B-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for LocusForge/VariantAssist-Gemma4-31B-GGUF to start chatting
- Pi new
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
Run Hermes
hermes
- Docker Model Runner
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Docker Model Runner:
docker model run hf.co/LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
- Lemonade
How to use LocusForge/VariantAssist-Gemma4-31B-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull LocusForge/VariantAssist-Gemma4-31B-GGUF:UD-Q4_K_M
Run and chat with the model
lemonade run user.VariantAssist-Gemma4-31B-GGUF-UD-Q4_K_M
List all available models
lemonade list
variantassist.com · GitHub · License
VariantAssist Gemma 4 31B GGUF
VariantAssist Gemma 4 31B GGUF is the local-inference release of the VariantAssist Gemma 4 31B LoRA model. The files in this repository are produced by merging the VariantAssist LoRA adapter with Gemma 4 31B IT and converting/quantizing the merged model for llama.cpp-compatible runtimes.
VariantAssist is designed to support structured clinical genetic variant review. It is not a diagnostic device and must not replace a clinician, medical geneticist, laboratory director, or ACMG/AMP-trained reviewer.
Evaluation Protocol
All model scores below are evaluated after the VariantAssist 3-to-5 consensus procedure. For each variant, the model is first run three times. If all three runs return the same pathogenicity level, that level is accepted. If any run differs, two additional runs are performed; a result is accepted only if one pathogenicity level appears at least three times across the five runs. If no level reaches that threshold, the result is marked as no consensus and may be rerun.
No dissensus/no-consensus cases occurred in this benchmark. In practical use, no-consensus cases have been observed at roughly 1 in 5000 variants.
Available GGUF Files
| File | Size | Match | Quant | Role |
|---|---|---|---|---|
VA-Gemma4-31B-UD-Q8_0.gguf |
31 GB | 86 | UDQ | Best current benchmark result |
VA-Gemma4-31B-Q4_K_M.gguf |
18 GB | 85 | LQ | Practical default |
VA-Gemma4-31B-Q8_0.gguf |
31 GB | 83 | LQ | Classic Q8 variant |
VA-Gemma4-31B-UD-Q4_K_M.gguf |
18 GB | 82 | UDQ | Smaller UDQ variant |
VA-Gemma4-31B-F16.gguf |
58 GB | 81 | F16 | Reference GGUF |
VA-Gemma4-31B-BF16-00002-of-00002.gguf |
11 GB | - | BF16 | BF16 export shard |
VA-Gemma4-31B-BF16-mmproj.gguf |
1.2 GB | - | MMProj | Not needed for text-only runs |
UDQ = Unsloth dynamic quantization. LQ = classic llama.cpp quantization. The Unsloth quantized variants were selected/validated on examples with the correct VariantAssist Level-1 input/output structure.
Benchmark Results
The ATP7B benchmark contains 100 Wilson disease variants with consensus labels from five independent expert annotations. The primary ground truth is strict majority consensus.
Reasoning budget is usually an important quality driver for classic quantized models. In this benchmark, the VariantAssist-tuned quantized runs improve accuracy while also reducing the reasoning-token budget compared with the original quantized baseline.
Current highlighted result:
VariantAssist UD-Q8: 86/100 exact matches on the ATP7B benchmark.- No strong errors in the selected released-model comparison.
- Expert-consensus reference: 15 average expert disagreements, equivalent to 85/100 agreement.
Prompts, Schema, And Reproducibility
Use the public prompt archive for reproducible evaluation:
- ATP7B prompt archive
- System instruction
- Response schema
- Annotation rules
- ATP7B benchmark ground truth
That archive contains the system prompt, schema, annotation rules, and per-variant prompts used for benchmark-style evaluation.
Runtime
Recommended runtime is llama-server from a recent llama.cpp build with Gemma 4 reasoning support.
Recommended server command:
llama-server \
-m /path/to/VA-Gemma4-31B-Q4_K_M.gguf \
--no-mmproj \
--jinja \
-ngl auto \
-c 32768 \
-fa on \
--swa-full \
-np 1 \
--cache-prompt \
--cache-reuse 256 \
--slot-prompt-similarity 0.10 \
--ctx-checkpoints 1 \
--checkpoint-every-n-tokens 4096 \
--cache-ram 2048 \
--kv-unified \
--cache-type-k f16 \
--cache-type-v f16 \
-b 2048 \
-ub 512 \
--no-cont-batching \
--perf \
--metrics \
--host 127.0.0.1 \
--port 8091 \
--reasoning on \
--reasoning-budget 8192 \
-t 24 \
-tb 24
Small-machine optimization:
-c 8192 --reasoning-budget 4096
What to change:
-m: select the GGUF file.--host/--port: set your serving endpoint.-t/-tb: match your CPU thread budget.-cand--reasoning-budget: reduce on smaller machines if needed.
What to keep for VariantAssist Level-1 runs:
--reasoning on: benchmarked runs use reasoning mode.--jinja: uses the Gemma chat template.--no-mmproj: this release is text-only.--cache-type-k f16 --cache-type-v f16: keeps KV cache quality stable.--no-cont-batching: keeps single-review behavior predictable.
Reasoning should remain enabled for VariantAssist-style review. In our workflow, no-reasoning runs could generate shorter single responses, but were less reliable in the completed 3-to-5 consensus process and could require reruns.
Intended Use
Use this release for:
- local-first VariantAssist review workflows;
- structured evidence synthesis for expert review;
- JSON-oriented draft outputs;
- reproducible local benchmarking with the public ATP7B prompt archive.
Out Of Scope
Do not use this model for:
- autonomous diagnosis;
- direct patient-facing medical advice;
- final ACMG/AMP classification without expert review;
- clinical interpretation outside the supplied evidence context;
- high-stakes clinical workflows without local validation.
Training Data
The full fine-tuning corpus is not distributed with this release because it may include clinical-context and literature-derived materials requiring separate privacy and licensing review. Public benchmark data, prompt templates, response schema, and de-identified examples are provided separately to support reproducible evaluation.
Links
- Website: https://variantassist.com/
- LoRA adapter: https://huggingface.co/LocusForge/VariantAssist-Gemma4-31B-LoRA
- Main project: https://github.com/LocusForge/VariantAssist
- Benchmark and prompts: https://github.com/LocusForge/VariantAssist-supplement/tree/main/benchmark
- Upstream base model: https://huggingface.co/google/gemma-4-31B-it
- License: Apache License 2.0
- Notice: NOTICE.md
- Downloads last month
- 358
4-bit
8-bit
16-bit
Model tree for LocusForge/VariantAssist-Gemma4-31B-GGUF
Base model
google/gemma-4-31B