RozaA commited on
Commit
bc25b1d
·
verified ·
1 Parent(s): ff4349a

Upload folder using huggingface_hub

Browse files
README.md ADDED
@@ -0,0 +1,90 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: no
3
+ license: apache-2.0
4
+ base_model: NbAiLab/nb-bert-base
5
+ tags:
6
+ - text-classification
7
+ - relevance-scoring
8
+ - procurement
9
+ - norwegian
10
+ ---
11
+
12
+ # Menon nb-bert relevance scorer (v4)
13
+
14
+ Binary classifier built on top of [`NbAiLab/nb-bert-base`](https://huggingface.co/NbAiLab/nb-bert-base).
15
+ Scores procurement notices as **RELEVANT** or **NOT_RELEVANT** for Menon Economics' lead pipeline.
16
+
17
+ ## What's new in v4 vs v3
18
+
19
+ - Trained only on the (translated) Norwegian description — no `tittel`, no `oppdragsgiver`, no portal/country features. Removes the v3 shortcut where the model learned client/country identity instead of project topic.
20
+ - Near-duplicate negatives are downweighted via TF-IDF clustering, so templated doffin notices stop dominating the loss.
21
+ - International positives are upweighted (`weight=2`).
22
+ - Empty / placeholder / non-Norwegian descriptions are routed to `needs_review` instead of being scored.
23
+
24
+ ## Held-out test results (n = 1,214)
25
+
26
+ | split | precision | recall | F1 |
27
+ |---|---:|---:|---:|
28
+ | **overall** | 0.76 | 0.89 | 0.82 |
29
+ | **international subset** (n=8) | 0.86 | 1.00 | 0.92 |
30
+
31
+ Threshold tuned on validation for recall ≥ 0.90: **0.2594** (saved in `threshold.json`).
32
+
33
+ ## Usage
34
+
35
+ ```python
36
+ from score import score_lead
37
+
38
+ # Norwegian input — gets a real score
39
+ score_lead("Anskaffelse av samfunnsøkonomisk analyse for evaluering...")
40
+ # → {"label": "RELEVANT", "score": 0.83, "threshold": 0.2594, "reason": "ok"}
41
+
42
+ # Empty / placeholder / non-Norwegian input — routed to review, not scored
43
+ score_lead("")
44
+ # → {"label": "needs_review", "score": None, "reason": "empty"}
45
+
46
+ score_lead("Se konkurransegrunnlag")
47
+ # → {"label": "needs_review", "score": None, "reason": "too_short(len=22)"}
48
+
49
+ score_lead("TRANSQ is a joint qualification system for transport suppliers.")
50
+ # → {"label": "needs_review", "score": None, "reason": "non_norwegian(en)"}
51
+ ```
52
+
53
+ ## Important: input must be in Norwegian
54
+
55
+ The model assumes incoming descriptions are already in Norwegian Bokmål.
56
+ Production callers (the leads_scraper) translate non-Norwegian leads upstream.
57
+ Anything that arrives in another language is intentionally flagged as
58
+ `needs_review` so a human can fetch a correct translation rather than the
59
+ model returning a low-confidence guess.
60
+
61
+ For one-off ad-hoc scoring of raw foreign text, translate it with any tool
62
+ (DeepL / OpenAI / GPT / Google) **before** calling `score_lead`.
63
+
64
+ Requires:
65
+ - `transformers`, `torch`, `langdetect`
66
+ - No API keys needed.
67
+
68
+ ## Files in this repo
69
+
70
+ | file | purpose |
71
+ |---|---|
72
+ | `model.safetensors`, `config.json` | Model weights + config |
73
+ | `tokenizer.json`, `vocab.txt`, etc. | Tokenizer |
74
+ | `threshold.json` | Tuned decision threshold |
75
+ | `inference_rules.py` | `needs_review()` gate (empty / short / placeholder / non-Norwegian) |
76
+ | `score.py` | End-to-end scoring function (use this) |
77
+
78
+ ## Training data
79
+
80
+ 13,177 labeled procurement leads from doffin / mercell / TED / Nordisk ministerråd /
81
+ hilma / FHF, with per-row weights encoding class balance + cluster dedup +
82
+ international upweight. After filtering `needs_review` rows: 12,133 used for training.
83
+
84
+ Stratified 80/10/10 split by `(Is_relevant, international)`.
85
+
86
+ ## Caveats
87
+
88
+ - International subset is small (~8 held-out positives). The 100% recall is encouraging but high-variance.
89
+ - The `needs_review` gate also catches Danish/Swedish-detected text leniently — those languages are mutually intelligible with Norwegian Bokmål and the model handles them fine, so they pass through.
90
+ - Production assumption: leads arrive translated. About 5–7 non-Norwegian leads/month historically slipped through — those will be routed to human review under v4.
config.json ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "directionality": "bidi",
8
+ "dtype": "float32",
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-12,
16
+ "max_position_embeddings": 512,
17
+ "model_type": "bert",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 0,
21
+ "pooler_fc_size": 768,
22
+ "pooler_num_attention_heads": 12,
23
+ "pooler_num_fc_layers": 3,
24
+ "pooler_size_per_head": 128,
25
+ "pooler_type": "first_token_transform",
26
+ "position_embedding_type": "absolute",
27
+ "transformers_version": "4.57.6",
28
+ "type_vocab_size": 2,
29
+ "use_cache": true,
30
+ "vocab_size": 119547
31
+ }
inference_rules.py ADDED
@@ -0,0 +1,136 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Inference-time gate for the relevance scorer.
3
+
4
+ PURPOSE
5
+ -------
6
+ Some procurement notices have descriptions like "Se konkurransegrunnlag" or are
7
+ empty entirely. The model can't classify what isn't there. Instead of returning
8
+ a (low-confidence) prediction, the pipeline returns "needs_review" so a human
9
+ can fetch the missing content from the linked documents/website.
10
+
11
+ USE
12
+ ---
13
+ Same gate must be applied in:
14
+ 1. Training-data filtering — drop rows where needs_review() is True.
15
+ 2. Inference time — skip the model call, return "needs_review".
16
+
17
+ This keeps training and serving aligned.
18
+
19
+ USAGE
20
+ -----
21
+ from inference_rules import needs_review
22
+ flag, reason = needs_review(kort_beskrivelse)
23
+ if flag:
24
+ return {"label": "needs_review", "reason": reason}
25
+ # else: run the model
26
+ """
27
+
28
+ import re
29
+
30
+ from langdetect import DetectorFactory, detect
31
+
32
+ DetectorFactory.seed = 0
33
+
34
+ MIN_LEN = 30 # below this → needs_review
35
+
36
+ # Languages we consider close enough to Norwegian Bokmål for the model.
37
+ # - 'no' (Norwegian) is the obvious one.
38
+ # - 'da' (Danish) is mutually intelligible with Norwegian; nb-bert-base handles it.
39
+ # - 'sv' (Swedish) is close enough that langdetect often confuses it with Norwegian.
40
+ # Anything else → routed to human review (production assumption: scraper translated).
41
+ NORWEGIAN_READABLE = {"no", "da", "sv"}
42
+
43
+ # Lowercased placeholder phrases (text == one of these, after strip+lower).
44
+ PLACEHOLDER_PHRASES = {
45
+ "se tittel",
46
+ "se tittel.",
47
+ "tittelen sier vel alt",
48
+ "tittelen sier vel alt.",
49
+ "se konkurransegrunnlag",
50
+ "se konkurransegrunnlag.",
51
+ "se vedlegg",
52
+ "se vedlegg.",
53
+ "se dokumentene",
54
+ "se dokumentene.",
55
+ "se dokumentene som ble sendt på mail",
56
+ "se henvendelse på e-post",
57
+ "se henvendelse på epost",
58
+ "se nettside",
59
+ "se nettside.",
60
+ "se utlysning",
61
+ "se utlysning.",
62
+ "se anbudsdokumenter",
63
+ "se anbudsdokumenter.",
64
+ "rammeavtale",
65
+ "rammeavtale.",
66
+ }
67
+
68
+ # Substring patterns: short descriptions that *contain* these phrases also fail.
69
+ PLACEHOLDER_PATTERNS = [
70
+ re.compile(r"^se\s+(tittel|konkurransegrunnlag|vedlegg|dokumenter|nettside|utlysning|anbud|henvendelse)", re.IGNORECASE),
71
+ re.compile(r"tittelen sier vel alt", re.IGNORECASE),
72
+ re.compile(r"sjekk (dokumentene|vedlegg|nettsiden|websiden)", re.IGNORECASE),
73
+ re.compile(r"check (the doc|website|attachment)", re.IGNORECASE),
74
+ re.compile(r"read (website|the website|the doc)", re.IGNORECASE),
75
+ ]
76
+
77
+
78
+ def needs_review(text):
79
+ """Return (True, reason) if the description should NOT be sent to the model.
80
+
81
+ Otherwise returns (False, "ok").
82
+ """
83
+ if text is None:
84
+ return True, "empty"
85
+
86
+ s = str(text).strip()
87
+ if s == "" or s.lower() == "nan":
88
+ return True, "empty"
89
+
90
+ if len(s) < MIN_LEN:
91
+ return True, f"too_short(len={len(s)})"
92
+
93
+ s_lower = s.lower().strip().rstrip(".").strip()
94
+ if s_lower in {p.rstrip(".") for p in PLACEHOLDER_PHRASES}:
95
+ return True, "placeholder_phrase"
96
+
97
+ for pat in PLACEHOLDER_PATTERNS:
98
+ if pat.search(s):
99
+ # Only fire as placeholder if the description is also short — a long
100
+ # description that *mentions* "se vedlegg" inside a real sentence is fine.
101
+ if len(s) < 80:
102
+ return True, f"placeholder_match({pat.pattern[:30]})"
103
+
104
+ # Last check: language. Production assumption is that the scraper has already
105
+ # translated foreign-language leads into Norwegian. Anything that arrives here
106
+ # in another language is unexpected — route to human review.
107
+ try:
108
+ lang = detect(s[:500])
109
+ if lang not in NORWEGIAN_READABLE:
110
+ return True, f"non_norwegian({lang})"
111
+ except Exception:
112
+ # Detection failure → fall through (assume Norwegian, don't block)
113
+ pass
114
+
115
+ return False, "ok"
116
+
117
+
118
+ if __name__ == "__main__":
119
+ tests = [
120
+ "",
121
+ None,
122
+ " ",
123
+ "Se konkurransegrunnlag",
124
+ "Se tittel.",
125
+ "Tittelen sier vel alt.",
126
+ "Rammeavtale",
127
+ "Sjekk dokumentene",
128
+ "Anskaffelse av samfunnsøkonomisk analyse for transportforskning innen evaluering.",
129
+ "Short text",
130
+ "Se vedlegg for full beskrivelse av kontraktens innhold inkludert alle leveranser.", # long → ok
131
+ "TRANSQ is a joint qualification system for Scandinavian transport suppliers.", # English → flagged
132
+ "Hilma on Suomen julkisten hankintojen ilmoituskanava ja keskitetty palvelu.", # Finnish → flagged
133
+ ]
134
+ for t in tests:
135
+ flag, reason = needs_review(t)
136
+ print(f"{flag!s:<6} {reason:<35} | {t!r}")
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f6ca38993266759d2ffe53152b98810effb9714608f2bed64f43dff6141c845e
3
+ size 711443456
score.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ End-to-end inference for the Menon nb-bert relevance scorer (v4).
3
+
4
+ PIPELINE
5
+ --------
6
+ needs_review() gate → if flagged, return early
7
+ tokenize → run model → apply tuned threshold
8
+
9
+ DESIGN NOTE
10
+ -----------
11
+ Translation is NOT done here. Production assumption: the leads_scraper
12
+ has already translated foreign-language leads to Norwegian before they
13
+ reach this model. Anything that arrives here in another language is
14
+ flagged as `needs_review` (handled inside `inference_rules.py`).
15
+
16
+ For ad-hoc scoring of raw foreign text, translate beforehand with any
17
+ tool and feed the Norwegian version in.
18
+
19
+ USAGE
20
+ -----
21
+ from score import score_lead
22
+ result = score_lead(kort_beskrivelse_text)
23
+ # → {"label": "RELEVANT" / "NOT_RELEVANT" / "needs_review",
24
+ # "score": 0.83, "threshold": 0.2594, "reason": "..."}
25
+
26
+ REQUIRES
27
+ --------
28
+ Packages: transformers, torch, langdetect
29
+ """
30
+
31
+ import json
32
+ from pathlib import Path
33
+
34
+ import torch
35
+ from transformers import AutoModelForSequenceClassification, AutoTokenizer
36
+
37
+ from inference_rules import needs_review
38
+
39
+ _MODEL_DIR = Path(__file__).parent
40
+ _threshold = None
41
+ _model = None
42
+ _tokenizer = None
43
+
44
+
45
+ def _lazy_load():
46
+ global _model, _tokenizer, _threshold
47
+ if _model is None:
48
+ _tokenizer = AutoTokenizer.from_pretrained(str(_MODEL_DIR))
49
+ _model = AutoModelForSequenceClassification.from_pretrained(str(_MODEL_DIR))
50
+ _model.eval()
51
+ with open(_MODEL_DIR / "threshold.json") as f:
52
+ _threshold = json.load(f)["threshold"]
53
+ return _model, _tokenizer, _threshold
54
+
55
+
56
+ def score_lead(kort_beskrivelse: str, max_length: int = 256) -> dict:
57
+ """Score a single procurement description.
58
+
59
+ Returns a dict:
60
+ - {"label": "RELEVANT", "score": 0.83, "threshold": 0.26, "reason": "ok"}
61
+ - {"label": "NOT_RELEVANT", "score": 0.05, "threshold": 0.26, "reason": "ok"}
62
+ - {"label": "needs_review", "score": None, "reason": "<why>"}
63
+ """
64
+ # 1) Gate: empty / short / placeholder / non-Norwegian → don't run model
65
+ flag, reason = needs_review(kort_beskrivelse)
66
+ if flag:
67
+ return {"label": "needs_review", "score": None, "reason": reason}
68
+
69
+ # 2) Tokenize + run model
70
+ model, tokenizer, threshold = _lazy_load()
71
+ enc = tokenizer(
72
+ str(kort_beskrivelse),
73
+ truncation=True,
74
+ padding="max_length",
75
+ max_length=max_length,
76
+ return_tensors="pt",
77
+ )
78
+ with torch.no_grad():
79
+ logits = model(**enc).logits
80
+ score = torch.softmax(logits, dim=1)[0, 1].item()
81
+
82
+ label = "RELEVANT" if score >= threshold else "NOT_RELEVANT"
83
+ return {"label": label, "score": score, "threshold": threshold, "reason": "ok"}
84
+
85
+
86
+ if __name__ == "__main__":
87
+ samples = [
88
+ "", # → needs_review (empty)
89
+ "Se konkurransegrunnlag", # → needs_review (placeholder)
90
+ "TRANSQ is a joint qualification system for transport suppliers.", # → needs_review (English)
91
+ "Anskaffelse av samfunnsøkonomisk analyse for evaluering av transportforskningens næringseffekter.", # Norwegian → score
92
+ "Levering av rengjøringstjenester i kommunale bygg etter rammeavtale.", # Norwegian → score
93
+ ]
94
+ for s in samples:
95
+ print(f"{score_lead(s)} ← {s!r}")
special_tokens_map.json ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": "[CLS]",
3
+ "mask_token": "[MASK]",
4
+ "pad_token": "[PAD]",
5
+ "sep_token": "[SEP]",
6
+ "unk_token": "[UNK]"
7
+ }
threshold.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "threshold": 0.25936875,
3
+ "target_recall": 0.9
4
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "model_max_length": 1000000000000000019884624838656,
51
+ "never_split": null,
52
+ "pad_token": "[PAD]",
53
+ "sep_token": "[SEP]",
54
+ "strip_accents": null,
55
+ "tokenize_chinese_chars": true,
56
+ "tokenizer_class": "BertTokenizer",
57
+ "unk_token": "[UNK]"
58
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:dcf659b571fac9023b4b5ed3f24547e41ac4f08daeef5877944303fea38a0f88
3
+ size 5841
vocab.txt ADDED
The diff for this file is too large to render. See raw diff