Watchly Smart-Match v2 (internal v11)

A fine-tuned 3-class NLI cross-encoder used by Watchly β€” a macOS watcher app β€” to decide at runtime whether a user's natural-language watch condition (e.g. "deploy succeeded", "my order shipped", "customer frustrated") is satisfied by the OCR text of a page snapshot.

What it's for

Smart-match runs as a layer-2 semantic gate after Watchly's deterministic rule engine. It only sees conditions the rule-drafter LLM routes to it: abstract / sentiment / state-event phrasing the rule engine can't compile to literal text-contains atoms. Numerical thresholds ("more than 100 errors"), state-change detection ("new email arrived"), and subjective conditions ("weather is nice") are handled elsewhere in Watchly's pipeline.

Architecture

  • Base: dleemiller/EttinX-nli-s β€” small NLI cross-encoder
  • Params: 68M (~261 MB safetensors)
  • Latency: ~20 ms per forward pass on Apple Silicon (M-series)
  • Output: 3-class NLI head β€” [contradiction, neutral, entailment]. Smart-match uses the entailment column (index 2).
  • Inputs: (condition, visible_text) pair. The page text is chunked into 300-char overlapping windows; entailment is max-pooled across chunks.

Training lineage

Internal version Description
v2 Initial fine-tune on synthetic scenes corpus (~3000 cases)
v3 + 465-row hard-negative patch (same-surface contrast)
v5 + 240-row CLEAR-only curated round (Claude Haiku judge)
v6 + 240-row topic/identity contrast (per-cluster scenarios)
v11 (this release) + 768-row patch from 3 fresh adversarial holdouts (synonym positives + chrome-shortcut negatives)

v11 was trained from v6 with 3 epochs at LR 5e-6, batch size 16. Patch shape: 384 contrast cases (3 sets Γ— 128 Sonnet-generated adversarial scenarios) + 384 balanced replay from prior pools.

Evaluation

Production smart-match in the Watchly app combines this cross-encoder with a runtime safety-guard layer:

  • Lexical-evidence guard (anchor stems must appear un-negated on page)
  • Polarity-contrast rescue (synonym TPs, predicate-stem-gated)
  • Future-pattern suppression
  • Existing danger-word + numeric-progress guards

Numbers below include those guards.

Suite v6 (prior production) v11 (this release)
Production smart-match in-scope (75 cases) 96.00% (0 FP) 96.00% (0 FP)
Codex out-of-distribution (28) 96.43% 100.00%
Fresh holdout (40) 92.50% 92.50%
Adversarial big holdout v4 (truly held out, 128) 69.53% 74.21%
Adversarial big holdout v5 (truly held out, 128) 75.00% 75.00%
Synthetic v2 (1808) 90.93% 90.21%

Zero false positives on the production smart-match path β€” the metric Watchly cares most about (no spurious watcher fires).

Usage

from sentence_transformers import CrossEncoder
import numpy as np

model = CrossEncoder("alyssaxuu/watchly-sm-v2", max_length=512)

# Page text is chunked into 300-char windows and entailment is max-pooled
chunks = [
    "Order #47291 β€” Shipped\nThank you for your purchase from Bellroy!\nTracking: UPS 1Z999AA10123456784",
]
condition = "my order shipped"
raw = np.array(model.predict([(condition, c) for c in chunks], apply_softmax=True))
entail = float(np.max(raw[:, 2]))  # column 2 = entailment
# Production threshold: entail >= 0.50 β†’ match (then runs through guard layer)
print(f"score={entail:.3f}")

In Watchly, the entailment score is then refined by the runtime guard layer described above before becoming a fire/no-fire decision.

Limitations

  • Adversarial chrome-shortcut OCR (page is on-topic but state is opposite β€” e.g. condition "layoffs announced" on a Shopify hiring page): cross-encoder hits a ~75% ceiling on this distribution at the 32M-param scale. The runtime guard layer catches the worst confidence-locked failures; an ensemble with a deberta-v3-base co-classifier pushes the held-out adversarial accuracy to ~84% if size/latency budget allows.
  • Synonym-only positives (page uses a different vocabulary than the condition's predicate, e.g. "finished" vs "Upload Complete"): the cross-encoder handles many but not all. The polarity-contrast rescue catches a meaningful fraction; the rest are accepted as missed-fires (preferred over spurious fires).
  • Numerical / quantitative claims are out of scope by design β€” routed to Watchly's deterministic rule engine.

License

Apache 2.0, matching the EttinX-nli-s base model's license.

Downloads last month
19
Safetensors
Model size
68.4M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for alyssaxuu/watchly-sm-v2

Finetuned
(1)
this model