Watchly Smart-Match v2 (internal v11)

A fine-tuned 3-class NLI cross-encoder used by Watchly — a macOS watcher app — to decide at runtime whether a user's natural-language watch condition (e.g. "deploy succeeded", "my order shipped", "customer frustrated") is satisfied by the OCR text of a page snapshot.

What it's for

Smart-match runs as a layer-2 semantic gate after Watchly's deterministic rule engine. It only sees conditions the rule-drafter LLM routes to it: abstract / sentiment / state-event phrasing the rule engine can't compile to literal text-contains atoms. Numerical thresholds ("more than 100 errors"), state-change detection ("new email arrived"), and subjective conditions ("weather is nice") are handled elsewhere in Watchly's pipeline.

Architecture

Base: dleemiller/EttinX-nli-s — small NLI cross-encoder
Params: 68M (~261 MB safetensors)
Latency: ~20 ms per forward pass on Apple Silicon (M-series)
Output: 3-class NLI head — [contradiction, neutral, entailment]. Smart-match uses the entailment column (index 2).
Inputs: (condition, visible_text) pair. The page text is chunked into 300-char overlapping windows; entailment is max-pooled across chunks.

Training lineage

Internal version	Description
v2	Initial fine-tune on synthetic scenes corpus (~3000 cases)
v3	+ 465-row hard-negative patch (same-surface contrast)
v5	+ 240-row CLEAR-only curated round (Claude Haiku judge)
v6	+ 240-row topic/identity contrast (per-cluster scenarios)
v11 (this release)	+ 768-row patch from 3 fresh adversarial holdouts (synonym positives + chrome-shortcut negatives)

v11 was trained from v6 with 3 epochs at LR 5e-6, batch size 16. Patch shape: 384 contrast cases (3 sets × 128 Sonnet-generated adversarial scenarios) + 384 balanced replay from prior pools.

Evaluation

Production smart-match in the Watchly app combines this cross-encoder with a runtime safety-guard layer:

Lexical-evidence guard (anchor stems must appear un-negated on page)
Polarity-contrast rescue (synonym TPs, predicate-stem-gated)
Future-pattern suppression
Existing danger-word + numeric-progress guards

Numbers below include those guards.

Suite	v6 (prior production)	v11 (this release)
Production smart-match in-scope (75 cases)	96.00% (0 FP)	96.00% (0 FP)
Codex out-of-distribution (28)	96.43%	100.00%
Fresh holdout (40)	92.50%	92.50%
Adversarial big holdout v4 (truly held out, 128)	69.53%	74.21%
Adversarial big holdout v5 (truly held out, 128)	75.00%	75.00%
Synthetic v2 (1808)	90.93%	90.21%

Zero false positives on the production smart-match path — the metric Watchly cares most about (no spurious watcher fires).

Usage

from sentence_transformers import CrossEncoder
import numpy as np

model = CrossEncoder("alyssaxuu/watchly-sm-v2", max_length=512)

# Page text is chunked into 300-char windows and entailment is max-pooled
chunks = [
    "Order #47291 — Shipped\nThank you for your purchase from Bellroy!\nTracking: UPS 1Z999AA10123456784",
]
condition = "my order shipped"
raw = np.array(model.predict([(condition, c) for c in chunks], apply_softmax=True))
entail = float(np.max(raw[:, 2]))  # column 2 = entailment
# Production threshold: entail >= 0.50 → match (then runs through guard layer)
print(f"score={entail:.3f}")

In Watchly, the entailment score is then refined by the runtime guard layer described above before becoming a fire/no-fire decision.

Limitations

Adversarial chrome-shortcut OCR (page is on-topic but state is opposite — e.g. condition "layoffs announced" on a Shopify hiring page): cross-encoder hits a ~75% ceiling on this distribution at the 32M-param scale. The runtime guard layer catches the worst confidence-locked failures; an ensemble with a deberta-v3-base co-classifier pushes the held-out adversarial accuracy to ~84% if size/latency budget allows.
Synonym-only positives (page uses a different vocabulary than the condition's predicate, e.g. "finished" vs "Upload Complete"): the cross-encoder handles many but not all. The polarity-contrast rescue catches a meaningful fraction; the rest are accepted as missed-fires (preferred over spurious fires).
Numerical / quantitative claims are out of scope by design — routed to Watchly's deterministic rule engine.

License

Apache 2.0, matching the EttinX-nli-s base model's license.

Downloads last month: 19

Safetensors

Model size

68.4M params

Tensor type

F32

Model tree for alyssaxuu/watchly-sm-v2

Base model

jhu-clsp/ettin-encoder-68m

Finetuned

dleemiller/EttinX-nli-s

Finetuned

(1)

this model