Predicate — a promptable, calibrated criterion classifier (Granite-4.0-micro)

Describe any criterion in plain English, pass a piece of content, and get a calibrated 0–1 probability that the content satisfies it. One small probe, any criterion — no per-criterion training, no labelled examples per task.

score(content   = "Comparing you to two competitors. About ready to switch.",
      criterion = "the customer is at risk of churning")  →  0.97

Backbone: ibm-granite/granite-4.0-micro (Apache-2.0), pulled at runtime.
Probe: a frozen domain LoRA + a small MLP head reading the hidden state at layer 28 via the KV-pop primitive (prefill the content once, pop K criteria against it → cheap multi-criteria).
License: Apache-2.0. The LoRA, MLP head and calibration here are Apache-2.0; the Granite base is Apache-2.0. See LICENSE and NOTICE.

What's in this repo

lora_granite_L07_armA_r32mlp/ — the PEFT LoRA adapter (rank 32, attn+mlp).
base_probe_granite_L07_armA_L28_amortized_kvpop_calibrated.zip — the MLP head + isotonic calibration + manifest + load-time canary (the probe artifact).

How to run

Easiest — the Docker image (everything bundled; Granite pulls at first run):

docker run --gpus all -p 8088:8088 ghcr.io/nope-net/predicate-oss:0.1.0
# one content, many criteria, scored in a single content prefill:
curl -X POST localhost:8088/classify_multi_criteria -H 'content-type: application/json' \
  -d '{"content":"My brother has been anxious lately and I worry about him.",
       "criteria":["the speaker themselves is anxious",
                   "someone other than the speaker is anxious"]}'

From source — the inference code is Apache-2.0 and ships inside the Docker image (/app/predicate + /app/service). To run the probe yourself, use these weights with that code: Scorer.from_path("base_probe_granite_L07_armA_L28_amortized_kvpop_calibrated.zip") (point the manifest's lora_path at the LoRA in this repo). The artifact's canary verifies the hidden-state frame at load and refuses to serve on a numerical mismatch — e.g. a different GPU architecture (the canary here was generated on an RTX 4000 Ada; pack and serve on the same GPU family). Built under torch 2.6.0 / transformers 5.9.0.

What it reads well

Topic, sentiment, intent, stance
Subject / attribution — whose attribute is this? (self vs third party)
Framing — asserted vs quoted vs hypothetical; first-person voice
Multi-clause AND / OR, deontic, temporal conditions
Cross-language content (de / es / zh ≈ English)
Long documents (Granite reaches 128K context; the practical limit is GPU memory)
Calibrated + reproducible — same input → same score; safe to threshold, cache, monitor, A/B

What it is NOT for — please read

Predicate reads relatedness + instruction-following, not formal truth-relation. It is not a judge, a groundedness/hallucination detector, or a pragmatic-comprehension engine. For those, a small purpose-built model or a line of code beats it:

criterion type	use instead
scalar / quantifier logic ("not all", XOR, parity)	a rule, or a small NLI model
is a claim entailed / contradicted by messy evidence	a groundedness model (e.g. MiniCheck, an NLI encoder)
"is this factually true?" (world knowledge)	retrieval + a model that has the facts
exact numeric comparison	code

On vanilla zero-shot classification, small encoders (GLiClass, DeBERTa-zero-shot) are efficient peers at 0.4–0.6B params. Predicate's real edge is expressivity (arbitrary inference-time criteria, no retraining), calibration + reproducibility, context length, and the KV-pop cost wedge (many criteria amortized over one content prefill).

Honest numbers

Macro AUROC 0.890 across a 9-substrate / 6116-item internal slate (ocular, clean, held-out, cross-lingual, adversarial, ChaosNLI, compound-"vibe", codependence, implicature). Strong on ocular/cross-lingual/adversarial; weakest on implicature (the truth-relation wall — a structural limit, not a tuning gap). Encoders win narrow truth-relation tasks; Predicate wins universality + calibration + cost.

Non-claims

Predicate is a content classifier — not predictive, diagnostic, or therapeutic, and not a substitute for human judgment. Scores are signals for routing, ranking, and human review.

Project page, recipe, and benchmarks: https://labs.nope.net/predicate

Downloads last month: -

Model tree for nopenet/predicate-oss

Base model

ibm-granite/granite-4.0-micro

Adapter

(25)

this model