Instructions to use Hypereum/HivemindEval with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Hypereum/HivemindEval with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Hypereum/HivemindEval")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Hypereum/HivemindEval")
model = AutoModelForCausalLM.from_pretrained("Hypereum/HivemindEval", device_map="auto")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Hypereum/HivemindEval with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Hypereum/HivemindEval"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Hypereum/HivemindEval",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Hypereum/HivemindEval

SGLang

How to use Hypereum/HivemindEval with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Hypereum/HivemindEval" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Hypereum/HivemindEval",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Hypereum/HivemindEval" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Hypereum/HivemindEval",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Hypereum/HivemindEval with Docker Model Runner:
```
docker model run hf.co/Hypereum/HivemindEval
```

HivemindEval v2.1

An open-weight verifier for compliance analysis. HivemindEval scores a compliance finding (JSON in, JSON out) on six quality dimensions for UK/EU regulatory frameworks — PSD2 SCA-RTS, UK GDPR / NHS DSPT, MOD JSP 440, Cyber Essentials Plus, DORA, and the EU AI Act — producing an overall 0–100 score, per-dimension scores with reasoning, and actionable improvement suggestions, in a fixed structured schema.

Built by Hypereum. Fine-tuned from Qwen/Qwen3-8B (Apache-2.0). Weights here are the full merged model in bfloat16 safetensors.

PROVISIONAL STATUS — read this first. Every benchmark number on this card is provisional: band gold is construction-and-adjudication based and numeric gold is AI-grader/human-range based at the low/mid band only. No number here has been validated by an independent human expert panel yet (that validation — ≥3 independent experts — is planned and pending). We make no claim that this model is the best, state of the art, or at expert parity. On our own frozen benchmark it is in a statistical tie at the top with a much larger open 72B model — including metrics where that model, and even a rubric-prompted base 8B, beat it. Details below, losses included.

What it does

Input: one compliance finding as JSON (fields: severity, category, title, description, evidence, regulatory_reference, remediation, plus optional deadline, effort_days).

Output: one JSON object:

{
  "overall_score": 0-100,
  "dimensions": {
    "accuracy":             {"score": 0-100, "reasoning": "..."},
    "completeness":         {"score": 0-100, "reasoning": "..."},
    "regulatory_alignment": {"score": 0-100, "reasoning": "..."},
    "actionability":        {"score": 0-100, "reasoning": "..."},
    "coherence":            {"score": 0-100, "reasoning": "..."},
    "evidence_quality":     {"score": 0-100, "reasoning": "..."}
  },
  "improvement_suggestions": ["..."]
}

Intended use

Scoring / ranking machine- or human-drafted compliance findings for UK/EU frameworks.
Quality gating and regression testing of compliance-analysis pipelines.
Deployments where analysis must stay in-jurisdiction on owned hardware: the model runs on a single 24 GB GPU (e.g. RTX 4090; observed ~40 tok/s single-stream, ~23.4 GB used with vLLM at --max-model-len 8192).

Out of scope

Not a general-purpose judge. It is trained for one input schema and one output schema in one domain. A rubric-prompted frontier model is the right tool for general LLM-as-a-judge work.
Not legal or compliance advice. Scores are decision support for human reviewers. Do not use its output as the sole basis for a compliance decision.
Not evaluated for jurisdictions beyond UK/EU frameworks listed above, non-English input, or free-text (non-schema) input.

Quickstart

The model expects its native interface: the system prompt below + the raw finding JSON as the user message, temperature 0, Qwen3 thinking disabled.

SYSTEM_PROMPT = (
    "You are HivemindEval, an expert evaluator of compliance findings for UK and EU "
    "regulated industries. Given a compliance finding (JSON), produce a structured "
    "evaluation with integer scores 0-100 for each dimension and actionable "
    "improvement suggestions. Be calibrated: poor findings score low, excellent ones "
    "score high. Vary your scores to reflect actual finding quality."
)

vLLM (recommended):

vllm serve Hypereum/HivemindEval --max-model-len 8192

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="-")
finding = open("finding.json").read()
r = client.chat.completions.create(
    model="Hypereum/HivemindEval",
    messages=[{"role": "system", "content": SYSTEM_PROMPT},
              {"role": "user", "content": finding}],
    temperature=0, max_tokens=3000,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
print(r.choices[0].message.content)

transformers:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tok = AutoTokenizer.from_pretrained("Hypereum/HivemindEval")
model = AutoModelForCausalLM.from_pretrained("Hypereum/HivemindEval",
                                             torch_dtype=torch.bfloat16, device_map="auto")
messages = [{"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": finding}]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True,
                                 enable_thinking=False, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=3000, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

Outputs are verbose; budget ≥2,048 new tokens (short budgets truncate the JSON before it closes). Parse with the validation-gated parser in the companion repo (eval/schema.py). Enforce the schema at the serving layer (e.g. vLLM guided_json) for production use — observed validity is not a guarantee.

Evaluation

Benchmark. A frozen, contamination-gated benchmark of 230 synthetic-but-realistic compliance findings across the six frameworks (212 rubric-scorable; 18 auxiliary items excluded from scoring), built from tiered constructions (excellent / good / mediocre / terrible) plus four injected defect classes (wrong_jurisdiction, hallucinated, missing_requirement, band_edge). All 212 scored items are content-hash-proven disjoint from this model's training data (0 collisions). The full item set is committed by hash and withheld as our private validation instrument; a stratified 68-item public subset ships with per-item gold and every model's raw predictions, so the subset table below is exactly reproducible offline. See the benchmark dataset and the methodology doc in the companion repo.

Full item set (230): SHA-256 4f2dbff713ab6b56d2db0a8eb08a1296f86bca0229f55d8ca127a73d21e1a9c6
Published subset (68): SHA-256 b51e721e37df6e9917f44550ccb2ea2376839d1c7744818a1fd432fc12a4eee1 (same canonicalization)

Gold policy. Primary metrics are rank/band-based and monotonic-invariant. Band gold is label-source-independent (by-construction tier + blind adjudication by an independent arbiter model — never the source of any numeric label or training signal). Numeric gold exists only where validated sound: human ranges on graded probes and an LLM reference grader at the low/mid band; high-band numeric gold is excluded (PENDING_EXPERT). MAE vs low/mid numeric gold is a disclosed secondary metric that structurally favours this model (its training labels share provenance with that gold) — read it as a sanity check, never a ranking.

Comparison models were run rubric-prompted (a ~10.5k-token scoring rubric, the strongest general-model setup we have) at temperature 0 on identical items.

Full board — 212 items (open models)

model	tier concordance (95% CI)	excellent concordance	defect separation	Kendall τ	MAE low/mid (raw)	scored
Qwen2.5-72B-Instruct-AWQ (rubric)	0.985 [0.977–0.993]	0.990	0.602	0.801	12.64	212/212
HivemindEval v2.1 (native)	0.984 [0.973–0.993]	0.982	0.591	0.817	4.00	211/212
Qwen3-8B (rubric)	0.973 [0.959–0.985]	0.973	0.709	0.751	10.00	208/212
Llama-3.3-70B (rubric)	0.965 [0.950–0.980]	0.968	0.616	0.764	17.52	173/212
Qwen3-8B (native prompt, no fine-tune)	—	—	—	—	—	0/212

Read the tie honestly: the 72B edges HivemindEval on tier and excellent concordance; HivemindEval leads Kendall τ (0.817 vs 0.801 — τ has no recorded CI, unlike tier concordance). All tier-concordance gaps are within overlapping 95% CIs — a statistical tie, not a win.

Core-68 board — the published subset (adds the frontier model; reproducible from the shipped data)

model	tier concordance (95% CI)	excellent concordance	defect separation	Kendall τ	MAE low/mid	scored
Qwen3-8B (rubric)	0.991 [0.977–1.000]	0.988	0.639	0.788	9.32	67/68
Qwen2.5-72B-AWQ (rubric)	0.990 [0.980–1.000]	0.992	0.549	0.833	14.40	68/68
HivemindEval v2.1 (native)	0.979 [0.958–0.998]	0.976	0.615	0.843	4.30	67/68
claude-sonnet-5 (rubric)	0.976 [0.951–0.996]	0.975	0.562	0.777	3.35	66/68
Llama-3.3-70B (rubric)	0.964 [0.932–0.988]	0.963	0.667	0.798	21.00	56/68

Reproduce this table offline: python3 eval/score_benchmark.py in the companion repo recomputes every cell (including the bootstrap CIs, seed 7, 400 resamples) from the shipped gold + predictions, bit-for-bit.

Defect separation by class — where this model is weak (full board)

P(defect item scored below a clean good item); higher is better.

model	wrong_jurisdiction	hallucinated	missing_requirement	band_edge
Qwen3-8B (rubric)	1.00	0.59	0.65	0.60
Qwen2.5-72B-AWQ	0.89	0.59	0.46	0.46
Llama-3.3-70B	0.85	0.57	0.51	0.53
HivemindEval v2.1	0.90	0.67	0.42	0.36

Stated plainly: HivemindEval is last among scored models on overall defect separation (0.591), weakest on missing_requirement (0.42) and band_edge (0.36), and a rubric-prompted base Qwen3-8B beats it on wrong_jurisdiction (1.00 vs 0.90) — detecting a US-framework citation in a UK/EU context, a core capability for this domain. (On the core-68 board, claude-sonnet-5 scores 0.54 on wrong_jurisdiction — frontier models are not automatically strong here either.) If your primary need is defect detection rather than quality ranking, a rubric-prompted model is currently the better tool.

What the fine-tuning measurably buys: format + serving economics

On the same hardware and the same native prompt, base Qwen3-8B produced 0/212 parseable outputs (prose, not JSON); HivemindEval produced valid schema output on 211/212 benchmark items (99.5%; plus 12/12 on the acceptance-gate probes) — from a paragraph-length prompt instead of the ~10.5k-token rubric a general model needs (≈10× fewer prompt tokens per call, and the rubric does not even fit the 8k serving context above). It matches a self-served 72B's ranking while running on a 24 GB GPU instead of ≥46 GB-class hardware. This is a token/GPU-tier comparison, not an audited $-per-token benchmark, and format validity is a reliability property, not judgment quality — we report the two separately on purpose.

Acceptance gate (v2.1 final weights, 12 human-range-graded probes, 2048-token eval budget)

criterion	threshold	result	verdict
overall score variance	std > 8.0	23.33	PASS
max per-dimension variance	std > 5.0	31.9 (all six in 21.1–31.9)	PASS
quality spread (best−worst tier)	> 20.0	51.33	PASS
tier ordering	monotonic, 0 violations	excellent > good > mediocre > terrible	PASS
jurisdiction-defect probes	mean < 50.0	22.0	PASS
schema-valid outputs	—	12/12	—

Per-tier calibration (human gold ranges vs predicted mean): excellent 72.0 (gold floors 78–82 — the model under-scores genuinely excellent input), good 63.0, mediocre 35.0, terrible 20.7, jurisdiction-defect 22.0.

Version history — and a prior-defect disclosure

v2.0 was publicly visible on the Hub for a period beginning May 2026 and was later withdrawn; this release (v2.1) supersedes it. If you obtained v2.0 weights during that window, do not use them — they carry the rating-collapse defect described below (constant scores regardless of input). We disclose the following about v2.0 because it shaped how this model is validated:

An earlier version (v2.0) suffered complete rating collapse: it emitted the identical overall_score of 75 — and an identical per-dimension score vector — for every input, regardless of quality (std 0.0 across a graded probe set), while producing perfectly well-formed JSON. It was syntactically valid and semantically inert. The defect was caught in our internal validation program and root-caused to defective training labels (the labels carried a constant score vector rather than real judgments — no model trained on them could learn to discriminate). v2.1 is a full retrain on corrected labels and passes the anti-collapse acceptance gate above; the benchmark items used to evaluate it are hash-proven disjoint from its training data.

The general lesson we took, reflected throughout this card: format validity is not judgment quality. A model can emit flawless JSON and zero signal. Rank-based validation against constructed gold, with disclosed losses, is the floor we now hold ourselves to.

Limitations

Provisional, AI-anchored gold. Band gold is construction+adjudication-based; numeric gold at low/mid band comes from an LLM reference grader validated against human-graded probes. No ≥3-expert human panel has scored this model yet. Agreement with human experts (weighted κ, Krippendorff's α, ICC) is unmeasured, not merely unreported. Until that panel exists, every number here is provisional.
Statistical tie, not superiority. A clean self-served Qwen2.5-72B-AWQ ties or edges it on tier/excellent concordance (within CIs). Kendall τ (where it leads) has no recorded CI.
Defect detection is its weakest axis (see table above) — last among scored models overall; a rubric-prompted base 8B is better at spotting wrong-jurisdiction citations.
Conservative at the top of the range. Genuinely excellent findings tend to score ~72 vs human gold floors of 78–82. Treat high-60s/low-70s as potentially excellent-tier. High-band numeric calibration awaits expert labels.
Serving determinism. With vLLM greedy decoding, outputs are verdict-stable, not byte-stable (identical inputs can produce token-level jitter). Byte-level reproducibility on a serving stack is a documented non-goal.
Single-schema, single-domain. Six UK/EU frameworks, English, one input/output schema. Anything else is out of scope and unmeasured.
24 GB deployment is single-stream with minimal headroom; use a larger card for concurrency.
Structured-output validity was observed (211/212, 12/12), not grammar-guaranteed; enforce a schema at the serving layer for production.

Reproduction

Data: Hypereum/hivemind-eval-benchmark (68 items + gold + all six models' raw predictions).
Recompute the core-68 board offline (no GPU needed): python3 eval/score_benchmark.py --gold data/gold_subset.json --predictions-dir data/predictions in hypereum-innovations/hivemind-eval.
Regenerate predictions from these weights (24 GB GPU): python3 eval/run_benchmark.py --endpoint http://localhost:8000/v1 --model Hypereum/HivemindEval --out predictions/my-run.jsonl then score as above. Expect verdict-stable (not byte-identical) agreement with the shipped predictions; see the methodology doc, "Determinism bounds".

License

Citation

@software{hivemindeval2026,
  author  = {Hypereum},
  title   = {HivemindEval: an open-weight verifier for UK/EU compliance analysis},
  year    = {2026},
  url     = {https://huggingface.co/Hypereum/HivemindEval},
  note    = {Provisional benchmark results pending independent expert validation}
}

Contact

hypereum.tech — for the expert-validation programme (we are recruiting UK/EU compliance experts for the ≥3-reviewer blind panel), open an issue in the companion repo.

Downloads last month: 21

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for Hypereum/HivemindEval

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1964)

this model