What happens when you check whether your own detector detects what it claims

Published June 8, 2026

The objective-projection dataset ships with a field called applied_rules on every scene. It records, scene by scene, whether a rule-based detector thinks six "craft" features are present — things like a banned simile, a temporal anchor, a materialized metaphor. It is convenient: deterministic, fast, bilingual. It is also, until now, undocumented as to whether it is right.

So I ran the obvious test, and the result is worth publishing precisely because part of it is unflattering.

The check

I took 120 Turkish scenes from the dataset (100 sampled at random, plus 20 drawn to reinforce rare classes) and labeled each one by hand for the six features, blind to the detector's output. Then I compared. Positive class throughout means "the feature is present in the text."

The picture splits cleanly in two.

For surface, lexical features, the detector is fine. Simile detection matched the human labels perfectly (κ = 1.00) — though on only two positive cases, so that is a strong direction on weak evidence rather than a settled fact. Two other rules, micro-focus and temporal anchor, looked excellent on precision (0.98 and 1.00), but that is an artifact: the human rater marked those features present in almost every scene (118/120 and 120/120). When nearly everything is positive, precision is trivial and Cohen's κ carries no information. Those rules are better described as not evaluable on this sample than as good.

For features that require inference, it is a different story. The emotion-label rule over-fires badly: it flagged 12 scenes, the human confirmed 2 — precision 0.17. And the headline result: the materialized-metaphor rule scored κ ≈ 0.00, no better than chance. The detector is measuring "did a physical-parameter token leak into the prose," which is not the same construct as the literary feature a reader recognizes. That is an approach–construct mismatch, not something a bigger keyword list would fix.

Overall raw agreement across all 720 cells (120 scenes × 6 rules) was 81.5% — a number that, on its own, would have flattered the detector and hidden the two rules that actually fail.

One thing I got wrong, and fixed

During the first labeling pass, my own criterion for "materialized metaphor" drifted — the threshold I was applying shifted over the session. I caught it, and re-labeled that one rule in a single time-distributed pass under the same definition. I am reporting this rather than quietly fixing it, because the reliability of a single-rater study depends on the rater being honest about exactly this kind of slip.

What changed in the dataset

Off the back of the check, three pieces of documentation now travel with the data:

A note on applied_rules stating plainly that these flags are deterministic detector output, not human-validated labels, with the reliability numbers attached.
A schema guide documenting an asymmetry I had not advertised: the English and Turkish halves use different record fields, and even encode the physical matrix differently (six-parameter keys for English, M/T/V/Δ/Ω/Ng for Turkish).
A category audit, which surfaced the sharpest structural finding of all: the two halves categorize on different axes. English categories are purely emotion; the Turkish half mixes emotion with scenario and genre. Of 300 Turkish scenes, 80% are categorized by scenario (pandemi, yapay zeka, bilim kurgu…), only 20% by emotion. A clean cross-language category mapping is therefore not honestly possible, and the audit says so instead of inventing one.

Limits, stated up front

This is a single-rater pilot. A second independent annotator would give a real inter-rater κ; there isn't one yet. The study is Turkish-only, so the 200 English scenes' flags are untested entirely. Four of the six rules had badly skewed classes. And there is a conflict of interest worth naming: I built the detector and I am the rater. I am disclosing it because the headline finding runs against my own tool — the incentive, if anything, ran the other way.

The actual point

A rule-based proxy is adequate for surface lexical features and unreliable for features that require reading for meaning. That is a boring, useful, falsifiable claim — the kind that is more valuable than a confident one. A companion write-up with the full confusion tables and the 120 human labels is in preparation.

If you work on writing-quality annotation and want to be the second rater, or to replicate any of this on the data, the dataset is openly available (HF DOI 10.57967/hf/8960, Zenodo archive 10.5281/zenodo.19511369, CC BY-NC-ND). Disagreement with the labels is the most useful thing that could happen to this study. The dataset lives at huggingface.co/datasets/leventbulut/objective-projection; the full methodology archive is at leventbulut.com.

Datasets mentioned in this article 1

I asked for a second rater. I got one plus two LLMs. Here is what the detector check looked like on fresh data.

June 18, 2026

Objective Projection v7.2: Closing the Pattern F Gap and Extending the Hard Negatives

May 31, 2026

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote