Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

README.md +90 -0
config.json +31 -0
inference_rules.py +136 -0
model.safetensors +3 -0
score.py +95 -0
special_tokens_map.json +7 -0
threshold.json +4 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,90 @@

+---
+language: no
+license: apache-2.0
+base_model: NbAiLab/nb-bert-base
+tags:
+  - text-classification
+  - relevance-scoring
+  - procurement
+  - norwegian
+---
+# Menon nb-bert relevance scorer (v4)
+Binary classifier built on top of [`NbAiLab/nb-bert-base`](https://huggingface.co/NbAiLab/nb-bert-base).
+Scores procurement notices as **RELEVANT** or **NOT_RELEVANT** for Menon Economics' lead pipeline.
+## What's new in v4 vs v3
+- Trained only on the (translated) Norwegian description — no `tittel`, no `oppdragsgiver`, no portal/country features. Removes the v3 shortcut where the model learned client/country identity instead of project topic.
+- Near-duplicate negatives are downweighted via TF-IDF clustering, so templated doffin notices stop dominating the loss.
+- International positives are upweighted (`weight=2`).
+- Empty / placeholder / non-Norwegian descriptions are routed to `needs_review` instead of being scored.
+## Held-out test results (n = 1,214)
+| split | precision | recall | F1 |
+|---|---:|---:|---:|
+| **overall** | 0.76 | 0.89 | 0.82 |
+| **international subset** (n=8) | 0.86 | 1.00 | 0.92 |
+Threshold tuned on validation for recall ≥ 0.90: **0.2594** (saved in `threshold.json`).
+## Usage
+```python
+from score import score_lead
+# Norwegian input — gets a real score
+score_lead("Anskaffelse av samfunnsøkonomisk analyse for evaluering...")
+# → {"label": "RELEVANT", "score": 0.83, "threshold": 0.2594, "reason": "ok"}
+# Empty / placeholder / non-Norwegian input — routed to review, not scored
+score_lead("")
+# → {"label": "needs_review", "score": None, "reason": "empty"}
+score_lead("Se konkurransegrunnlag")
+# → {"label": "needs_review", "score": None, "reason": "too_short(len=22)"}
+score_lead("TRANSQ is a joint qualification system for transport suppliers.")
+# → {"label": "needs_review", "score": None, "reason": "non_norwegian(en)"}
+```
+## Important: input must be in Norwegian
+The model assumes incoming descriptions are already in Norwegian Bokmål.
+Production callers (the leads_scraper) translate non-Norwegian leads upstream.
+Anything that arrives in another language is intentionally flagged as
+`needs_review` so a human can fetch a correct translation rather than the
+model returning a low-confidence guess.
+For one-off ad-hoc scoring of raw foreign text, translate it with any tool
+(DeepL / OpenAI / GPT / Google) **before** calling `score_lead`.
+Requires:
+- `transformers`, `torch`, `langdetect`
+- No API keys needed.
+## Files in this repo
+| file | purpose |
+|---|---|
+| `model.safetensors`, `config.json` | Model weights + config |
+| `tokenizer.json`, `vocab.txt`, etc. | Tokenizer |
+| `threshold.json` | Tuned decision threshold |
+| `inference_rules.py` | `needs_review()` gate (empty / short / placeholder / non-Norwegian) |
+| `score.py` | End-to-end scoring function (use this) |
+## Training data
+13,177 labeled procurement leads from doffin / mercell / TED / Nordisk ministerråd /
+hilma / FHF, with per-row weights encoding class balance + cluster dedup +
+international upweight. After filtering `needs_review` rows: 12,133 used for training.
+Stratified 80/10/10 split by `(Is_relevant, international)`.
+## Caveats
+- International subset is small (~8 held-out positives). The 100% recall is encouraging but high-variance.
+- The `needs_review` gate also catches Danish/Swedish-detected text leniently — those languages are mutually intelligible with Norwegian Bokmål and the model handles them fine, so they pass through.
+- Production assumption: leads arrive translated. About 5–7 non-Norwegian leads/month historically slipped through — those will be routed to human review under v4.

config.json ADDED Viewed

	@@ -0,0 +1,31 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "directionality": "bidi",
+  "dtype": "float32",
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "pooler_fc_size": 768,
+  "pooler_num_attention_heads": 12,
+  "pooler_num_fc_layers": 3,
+  "pooler_size_per_head": 128,
+  "pooler_type": "first_token_transform",
+  "position_embedding_type": "absolute",
+  "transformers_version": "4.57.6",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 119547
+}

inference_rules.py ADDED Viewed

	@@ -0,0 +1,136 @@

+"""
+Inference-time gate for the relevance scorer.
+PURPOSE
+-------
+Some procurement notices have descriptions like "Se konkurransegrunnlag" or are
+empty entirely. The model can't classify what isn't there. Instead of returning
+a (low-confidence) prediction, the pipeline returns "needs_review" so a human
+can fetch the missing content from the linked documents/website.
+USE
+---
+Same gate must be applied in:
+  1. Training-data filtering — drop rows where needs_review() is True.
+  2. Inference time         — skip the model call, return "needs_review".
+This keeps training and serving aligned.
+USAGE
+-----
+    from inference_rules import needs_review
+    flag, reason = needs_review(kort_beskrivelse)
+    if flag:
+        return {"label": "needs_review", "reason": reason}
+    # else: run the model
+"""
+import re
+from langdetect import DetectorFactory, detect
+DetectorFactory.seed = 0
+MIN_LEN = 30  # below this → needs_review
+# Languages we consider close enough to Norwegian Bokmål for the model.
+# - 'no' (Norwegian) is the obvious one.
+# - 'da' (Danish) is mutually intelligible with Norwegian; nb-bert-base handles it.
+# - 'sv' (Swedish) is close enough that langdetect often confuses it with Norwegian.
+# Anything else → routed to human review (production assumption: scraper translated).
+NORWEGIAN_READABLE = {"no", "da", "sv"}
+# Lowercased placeholder phrases (text == one of these, after strip+lower).
+PLACEHOLDER_PHRASES = {
+    "se tittel",
+    "se tittel.",
+    "tittelen sier vel alt",
+    "tittelen sier vel alt.",
+    "se konkurransegrunnlag",
+    "se konkurransegrunnlag.",
+    "se vedlegg",
+    "se vedlegg.",
+    "se dokumentene",
+    "se dokumentene.",
+    "se dokumentene som ble sendt på mail",
+    "se henvendelse på e-post",
+    "se henvendelse på epost",
+    "se nettside",
+    "se nettside.",
+    "se utlysning",
+    "se utlysning.",
+    "se anbudsdokumenter",
+    "se anbudsdokumenter.",
+    "rammeavtale",
+    "rammeavtale.",
+}
+# Substring patterns: short descriptions that *contain* these phrases also fail.
+PLACEHOLDER_PATTERNS = [
+    re.compile(r"^se\s+(tittel|konkurransegrunnlag|vedlegg|dokumenter|nettside|utlysning|anbud|henvendelse)", re.IGNORECASE),
+    re.compile(r"tittelen sier vel alt", re.IGNORECASE),
+    re.compile(r"sjekk (dokumentene|vedlegg|nettsiden|websiden)", re.IGNORECASE),
+    re.compile(r"check (the doc|website|attachment)", re.IGNORECASE),
+    re.compile(r"read (website|the website|the doc)", re.IGNORECASE),
+]
+def needs_review(text):
+    """Return (True, reason) if the description should NOT be sent to the model.
+    Otherwise returns (False, "ok").
+    """
+    if text is None:
+        return True, "empty"
+    s = str(text).strip()
+    if s == "" or s.lower() == "nan":
+        return True, "empty"
+    if len(s) < MIN_LEN:
+        return True, f"too_short(len={len(s)})"
+    s_lower = s.lower().strip().rstrip(".").strip()
+    if s_lower in {p.rstrip(".") for p in PLACEHOLDER_PHRASES}:
+        return True, "placeholder_phrase"
+    for pat in PLACEHOLDER_PATTERNS:
+        if pat.search(s):
+            # Only fire as placeholder if the description is also short — a long
+            # description that *mentions* "se vedlegg" inside a real sentence is fine.
+            if len(s) < 80:
+                return True, f"placeholder_match({pat.pattern[:30]})"
+    # Last check: language. Production assumption is that the scraper has already
+    # translated foreign-language leads into Norwegian. Anything that arrives here
+    # in another language is unexpected — route to human review.
+    try:
+        lang = detect(s[:500])
+        if lang not in NORWEGIAN_READABLE:
+            return True, f"non_norwegian({lang})"
+    except Exception:
+        # Detection failure → fall through (assume Norwegian, don't block)
+        pass
+    return False, "ok"
+if __name__ == "__main__":
+    tests = [
+        "",
+        None,
+        "   ",
+        "Se konkurransegrunnlag",
+        "Se tittel.",
+        "Tittelen sier vel alt.",
+        "Rammeavtale",
+        "Sjekk dokumentene",
+        "Anskaffelse av samfunnsøkonomisk analyse for transportforskning innen evaluering.",
+        "Short text",
+        "Se vedlegg for full beskrivelse av kontraktens innhold inkludert alle leveranser.",  # long → ok
+        "TRANSQ is a joint qualification system for Scandinavian transport suppliers.",  # English → flagged
+        "Hilma on Suomen julkisten hankintojen ilmoituskanava ja keskitetty palvelu.",  # Finnish → flagged
+    ]
+    for t in tests:
+        flag, reason = needs_review(t)
+        print(f"{flag!s:<6} {reason:<35} | {t!r}")

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f6ca38993266759d2ffe53152b98810effb9714608f2bed64f43dff6141c845e
+size 711443456

score.py ADDED Viewed

	@@ -0,0 +1,95 @@

+"""
+End-to-end inference for the Menon nb-bert relevance scorer (v4).
+PIPELINE
+--------
+    needs_review() gate  →  if flagged, return early
+    tokenize             →  run model  →  apply tuned threshold
+DESIGN NOTE
+-----------
+Translation is NOT done here. Production assumption: the leads_scraper
+has already translated foreign-language leads to Norwegian before they
+reach this model. Anything that arrives here in another language is
+flagged as `needs_review` (handled inside `inference_rules.py`).
+For ad-hoc scoring of raw foreign text, translate beforehand with any
+tool and feed the Norwegian version in.
+USAGE
+-----
+    from score import score_lead
+    result = score_lead(kort_beskrivelse_text)
+    # → {"label": "RELEVANT" / "NOT_RELEVANT" / "needs_review",
+    #    "score": 0.83, "threshold": 0.2594, "reason": "..."}
+REQUIRES
+--------
+    Packages: transformers, torch, langdetect
+"""
+import json
+from pathlib import Path
+import torch
+from transformers import AutoModelForSequenceClassification, AutoTokenizer
+from inference_rules import needs_review
+_MODEL_DIR = Path(__file__).parent
+_threshold = None
+_model = None
+_tokenizer = None
+def _lazy_load():
+    global _model, _tokenizer, _threshold
+    if _model is None:
+        _tokenizer = AutoTokenizer.from_pretrained(str(_MODEL_DIR))
+        _model = AutoModelForSequenceClassification.from_pretrained(str(_MODEL_DIR))
+        _model.eval()
+        with open(_MODEL_DIR / "threshold.json") as f:
+            _threshold = json.load(f)["threshold"]
+    return _model, _tokenizer, _threshold
+def score_lead(kort_beskrivelse: str, max_length: int = 256) -> dict:
+    """Score a single procurement description.
+    Returns a dict:
+      - {"label": "RELEVANT", "score": 0.83, "threshold": 0.26, "reason": "ok"}
+      - {"label": "NOT_RELEVANT", "score": 0.05, "threshold": 0.26, "reason": "ok"}
+      - {"label": "needs_review", "score": None, "reason": "<why>"}
+    """
+    # 1) Gate: empty / short / placeholder / non-Norwegian → don't run model
+    flag, reason = needs_review(kort_beskrivelse)
+    if flag:
+        return {"label": "needs_review", "score": None, "reason": reason}
+    # 2) Tokenize + run model
+    model, tokenizer, threshold = _lazy_load()
+    enc = tokenizer(
+        str(kort_beskrivelse),
+        truncation=True,
+        padding="max_length",
+        max_length=max_length,
+        return_tensors="pt",
+    )
+    with torch.no_grad():
+        logits = model(**enc).logits
+    score = torch.softmax(logits, dim=1)[0, 1].item()
+    label = "RELEVANT" if score >= threshold else "NOT_RELEVANT"
+    return {"label": label, "score": score, "threshold": threshold, "reason": "ok"}
+if __name__ == "__main__":
+    samples = [
+        "",  # → needs_review (empty)
+        "Se konkurransegrunnlag",  # → needs_review (placeholder)
+        "TRANSQ is a joint qualification system for transport suppliers.",  # → needs_review (English)
+        "Anskaffelse av samfunnsøkonomisk analyse for evaluering av transportforskningens næringseffekter.",  # Norwegian → score
+        "Levering av rengjøringstjenester i kommunale bygg etter rammeavtale.",  # Norwegian → score
+    ]
+    for s in samples:
+        print(f"{score_lead(s)}  ← {s!r}")

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

threshold.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "threshold": 0.25936875,
+  "target_recall": 0.9
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 1000000000000000019884624838656,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:dcf659b571fac9023b4b5ed3f24547e41ac4f08daeef5877944303fea38a0f88
+size 5841

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff