Upload folder using huggingface_hub
Browse files- README.md +90 -0
- config.json +31 -0
- inference_rules.py +136 -0
- model.safetensors +3 -0
- score.py +95 -0
- special_tokens_map.json +7 -0
- threshold.json +4 -0
- tokenizer.json +0 -0
- tokenizer_config.json +58 -0
- training_args.bin +3 -0
- vocab.txt +0 -0
README.md
ADDED
|
@@ -0,0 +1,90 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
language: no
|
| 3 |
+
license: apache-2.0
|
| 4 |
+
base_model: NbAiLab/nb-bert-base
|
| 5 |
+
tags:
|
| 6 |
+
- text-classification
|
| 7 |
+
- relevance-scoring
|
| 8 |
+
- procurement
|
| 9 |
+
- norwegian
|
| 10 |
+
---
|
| 11 |
+
|
| 12 |
+
# Menon nb-bert relevance scorer (v4)
|
| 13 |
+
|
| 14 |
+
Binary classifier built on top of [`NbAiLab/nb-bert-base`](https://huggingface.co/NbAiLab/nb-bert-base).
|
| 15 |
+
Scores procurement notices as **RELEVANT** or **NOT_RELEVANT** for Menon Economics' lead pipeline.
|
| 16 |
+
|
| 17 |
+
## What's new in v4 vs v3
|
| 18 |
+
|
| 19 |
+
- Trained only on the (translated) Norwegian description — no `tittel`, no `oppdragsgiver`, no portal/country features. Removes the v3 shortcut where the model learned client/country identity instead of project topic.
|
| 20 |
+
- Near-duplicate negatives are downweighted via TF-IDF clustering, so templated doffin notices stop dominating the loss.
|
| 21 |
+
- International positives are upweighted (`weight=2`).
|
| 22 |
+
- Empty / placeholder / non-Norwegian descriptions are routed to `needs_review` instead of being scored.
|
| 23 |
+
|
| 24 |
+
## Held-out test results (n = 1,214)
|
| 25 |
+
|
| 26 |
+
| split | precision | recall | F1 |
|
| 27 |
+
|---|---:|---:|---:|
|
| 28 |
+
| **overall** | 0.76 | 0.89 | 0.82 |
|
| 29 |
+
| **international subset** (n=8) | 0.86 | 1.00 | 0.92 |
|
| 30 |
+
|
| 31 |
+
Threshold tuned on validation for recall ≥ 0.90: **0.2594** (saved in `threshold.json`).
|
| 32 |
+
|
| 33 |
+
## Usage
|
| 34 |
+
|
| 35 |
+
```python
|
| 36 |
+
from score import score_lead
|
| 37 |
+
|
| 38 |
+
# Norwegian input — gets a real score
|
| 39 |
+
score_lead("Anskaffelse av samfunnsøkonomisk analyse for evaluering...")
|
| 40 |
+
# → {"label": "RELEVANT", "score": 0.83, "threshold": 0.2594, "reason": "ok"}
|
| 41 |
+
|
| 42 |
+
# Empty / placeholder / non-Norwegian input — routed to review, not scored
|
| 43 |
+
score_lead("")
|
| 44 |
+
# → {"label": "needs_review", "score": None, "reason": "empty"}
|
| 45 |
+
|
| 46 |
+
score_lead("Se konkurransegrunnlag")
|
| 47 |
+
# → {"label": "needs_review", "score": None, "reason": "too_short(len=22)"}
|
| 48 |
+
|
| 49 |
+
score_lead("TRANSQ is a joint qualification system for transport suppliers.")
|
| 50 |
+
# → {"label": "needs_review", "score": None, "reason": "non_norwegian(en)"}
|
| 51 |
+
```
|
| 52 |
+
|
| 53 |
+
## Important: input must be in Norwegian
|
| 54 |
+
|
| 55 |
+
The model assumes incoming descriptions are already in Norwegian Bokmål.
|
| 56 |
+
Production callers (the leads_scraper) translate non-Norwegian leads upstream.
|
| 57 |
+
Anything that arrives in another language is intentionally flagged as
|
| 58 |
+
`needs_review` so a human can fetch a correct translation rather than the
|
| 59 |
+
model returning a low-confidence guess.
|
| 60 |
+
|
| 61 |
+
For one-off ad-hoc scoring of raw foreign text, translate it with any tool
|
| 62 |
+
(DeepL / OpenAI / GPT / Google) **before** calling `score_lead`.
|
| 63 |
+
|
| 64 |
+
Requires:
|
| 65 |
+
- `transformers`, `torch`, `langdetect`
|
| 66 |
+
- No API keys needed.
|
| 67 |
+
|
| 68 |
+
## Files in this repo
|
| 69 |
+
|
| 70 |
+
| file | purpose |
|
| 71 |
+
|---|---|
|
| 72 |
+
| `model.safetensors`, `config.json` | Model weights + config |
|
| 73 |
+
| `tokenizer.json`, `vocab.txt`, etc. | Tokenizer |
|
| 74 |
+
| `threshold.json` | Tuned decision threshold |
|
| 75 |
+
| `inference_rules.py` | `needs_review()` gate (empty / short / placeholder / non-Norwegian) |
|
| 76 |
+
| `score.py` | End-to-end scoring function (use this) |
|
| 77 |
+
|
| 78 |
+
## Training data
|
| 79 |
+
|
| 80 |
+
13,177 labeled procurement leads from doffin / mercell / TED / Nordisk ministerråd /
|
| 81 |
+
hilma / FHF, with per-row weights encoding class balance + cluster dedup +
|
| 82 |
+
international upweight. After filtering `needs_review` rows: 12,133 used for training.
|
| 83 |
+
|
| 84 |
+
Stratified 80/10/10 split by `(Is_relevant, international)`.
|
| 85 |
+
|
| 86 |
+
## Caveats
|
| 87 |
+
|
| 88 |
+
- International subset is small (~8 held-out positives). The 100% recall is encouraging but high-variance.
|
| 89 |
+
- The `needs_review` gate also catches Danish/Swedish-detected text leniently — those languages are mutually intelligible with Norwegian Bokmål and the model handles them fine, so they pass through.
|
| 90 |
+
- Production assumption: leads arrive translated. About 5–7 non-Norwegian leads/month historically slipped through — those will be routed to human review under v4.
|
config.json
ADDED
|
@@ -0,0 +1,31 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"BertForSequenceClassification"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.1,
|
| 6 |
+
"classifier_dropout": null,
|
| 7 |
+
"directionality": "bidi",
|
| 8 |
+
"dtype": "float32",
|
| 9 |
+
"gradient_checkpointing": false,
|
| 10 |
+
"hidden_act": "gelu",
|
| 11 |
+
"hidden_dropout_prob": 0.1,
|
| 12 |
+
"hidden_size": 768,
|
| 13 |
+
"initializer_range": 0.02,
|
| 14 |
+
"intermediate_size": 3072,
|
| 15 |
+
"layer_norm_eps": 1e-12,
|
| 16 |
+
"max_position_embeddings": 512,
|
| 17 |
+
"model_type": "bert",
|
| 18 |
+
"num_attention_heads": 12,
|
| 19 |
+
"num_hidden_layers": 12,
|
| 20 |
+
"pad_token_id": 0,
|
| 21 |
+
"pooler_fc_size": 768,
|
| 22 |
+
"pooler_num_attention_heads": 12,
|
| 23 |
+
"pooler_num_fc_layers": 3,
|
| 24 |
+
"pooler_size_per_head": 128,
|
| 25 |
+
"pooler_type": "first_token_transform",
|
| 26 |
+
"position_embedding_type": "absolute",
|
| 27 |
+
"transformers_version": "4.57.6",
|
| 28 |
+
"type_vocab_size": 2,
|
| 29 |
+
"use_cache": true,
|
| 30 |
+
"vocab_size": 119547
|
| 31 |
+
}
|
inference_rules.py
ADDED
|
@@ -0,0 +1,136 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Inference-time gate for the relevance scorer.
|
| 3 |
+
|
| 4 |
+
PURPOSE
|
| 5 |
+
-------
|
| 6 |
+
Some procurement notices have descriptions like "Se konkurransegrunnlag" or are
|
| 7 |
+
empty entirely. The model can't classify what isn't there. Instead of returning
|
| 8 |
+
a (low-confidence) prediction, the pipeline returns "needs_review" so a human
|
| 9 |
+
can fetch the missing content from the linked documents/website.
|
| 10 |
+
|
| 11 |
+
USE
|
| 12 |
+
---
|
| 13 |
+
Same gate must be applied in:
|
| 14 |
+
1. Training-data filtering — drop rows where needs_review() is True.
|
| 15 |
+
2. Inference time — skip the model call, return "needs_review".
|
| 16 |
+
|
| 17 |
+
This keeps training and serving aligned.
|
| 18 |
+
|
| 19 |
+
USAGE
|
| 20 |
+
-----
|
| 21 |
+
from inference_rules import needs_review
|
| 22 |
+
flag, reason = needs_review(kort_beskrivelse)
|
| 23 |
+
if flag:
|
| 24 |
+
return {"label": "needs_review", "reason": reason}
|
| 25 |
+
# else: run the model
|
| 26 |
+
"""
|
| 27 |
+
|
| 28 |
+
import re
|
| 29 |
+
|
| 30 |
+
from langdetect import DetectorFactory, detect
|
| 31 |
+
|
| 32 |
+
DetectorFactory.seed = 0
|
| 33 |
+
|
| 34 |
+
MIN_LEN = 30 # below this → needs_review
|
| 35 |
+
|
| 36 |
+
# Languages we consider close enough to Norwegian Bokmål for the model.
|
| 37 |
+
# - 'no' (Norwegian) is the obvious one.
|
| 38 |
+
# - 'da' (Danish) is mutually intelligible with Norwegian; nb-bert-base handles it.
|
| 39 |
+
# - 'sv' (Swedish) is close enough that langdetect often confuses it with Norwegian.
|
| 40 |
+
# Anything else → routed to human review (production assumption: scraper translated).
|
| 41 |
+
NORWEGIAN_READABLE = {"no", "da", "sv"}
|
| 42 |
+
|
| 43 |
+
# Lowercased placeholder phrases (text == one of these, after strip+lower).
|
| 44 |
+
PLACEHOLDER_PHRASES = {
|
| 45 |
+
"se tittel",
|
| 46 |
+
"se tittel.",
|
| 47 |
+
"tittelen sier vel alt",
|
| 48 |
+
"tittelen sier vel alt.",
|
| 49 |
+
"se konkurransegrunnlag",
|
| 50 |
+
"se konkurransegrunnlag.",
|
| 51 |
+
"se vedlegg",
|
| 52 |
+
"se vedlegg.",
|
| 53 |
+
"se dokumentene",
|
| 54 |
+
"se dokumentene.",
|
| 55 |
+
"se dokumentene som ble sendt på mail",
|
| 56 |
+
"se henvendelse på e-post",
|
| 57 |
+
"se henvendelse på epost",
|
| 58 |
+
"se nettside",
|
| 59 |
+
"se nettside.",
|
| 60 |
+
"se utlysning",
|
| 61 |
+
"se utlysning.",
|
| 62 |
+
"se anbudsdokumenter",
|
| 63 |
+
"se anbudsdokumenter.",
|
| 64 |
+
"rammeavtale",
|
| 65 |
+
"rammeavtale.",
|
| 66 |
+
}
|
| 67 |
+
|
| 68 |
+
# Substring patterns: short descriptions that *contain* these phrases also fail.
|
| 69 |
+
PLACEHOLDER_PATTERNS = [
|
| 70 |
+
re.compile(r"^se\s+(tittel|konkurransegrunnlag|vedlegg|dokumenter|nettside|utlysning|anbud|henvendelse)", re.IGNORECASE),
|
| 71 |
+
re.compile(r"tittelen sier vel alt", re.IGNORECASE),
|
| 72 |
+
re.compile(r"sjekk (dokumentene|vedlegg|nettsiden|websiden)", re.IGNORECASE),
|
| 73 |
+
re.compile(r"check (the doc|website|attachment)", re.IGNORECASE),
|
| 74 |
+
re.compile(r"read (website|the website|the doc)", re.IGNORECASE),
|
| 75 |
+
]
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
def needs_review(text):
|
| 79 |
+
"""Return (True, reason) if the description should NOT be sent to the model.
|
| 80 |
+
|
| 81 |
+
Otherwise returns (False, "ok").
|
| 82 |
+
"""
|
| 83 |
+
if text is None:
|
| 84 |
+
return True, "empty"
|
| 85 |
+
|
| 86 |
+
s = str(text).strip()
|
| 87 |
+
if s == "" or s.lower() == "nan":
|
| 88 |
+
return True, "empty"
|
| 89 |
+
|
| 90 |
+
if len(s) < MIN_LEN:
|
| 91 |
+
return True, f"too_short(len={len(s)})"
|
| 92 |
+
|
| 93 |
+
s_lower = s.lower().strip().rstrip(".").strip()
|
| 94 |
+
if s_lower in {p.rstrip(".") for p in PLACEHOLDER_PHRASES}:
|
| 95 |
+
return True, "placeholder_phrase"
|
| 96 |
+
|
| 97 |
+
for pat in PLACEHOLDER_PATTERNS:
|
| 98 |
+
if pat.search(s):
|
| 99 |
+
# Only fire as placeholder if the description is also short — a long
|
| 100 |
+
# description that *mentions* "se vedlegg" inside a real sentence is fine.
|
| 101 |
+
if len(s) < 80:
|
| 102 |
+
return True, f"placeholder_match({pat.pattern[:30]})"
|
| 103 |
+
|
| 104 |
+
# Last check: language. Production assumption is that the scraper has already
|
| 105 |
+
# translated foreign-language leads into Norwegian. Anything that arrives here
|
| 106 |
+
# in another language is unexpected — route to human review.
|
| 107 |
+
try:
|
| 108 |
+
lang = detect(s[:500])
|
| 109 |
+
if lang not in NORWEGIAN_READABLE:
|
| 110 |
+
return True, f"non_norwegian({lang})"
|
| 111 |
+
except Exception:
|
| 112 |
+
# Detection failure → fall through (assume Norwegian, don't block)
|
| 113 |
+
pass
|
| 114 |
+
|
| 115 |
+
return False, "ok"
|
| 116 |
+
|
| 117 |
+
|
| 118 |
+
if __name__ == "__main__":
|
| 119 |
+
tests = [
|
| 120 |
+
"",
|
| 121 |
+
None,
|
| 122 |
+
" ",
|
| 123 |
+
"Se konkurransegrunnlag",
|
| 124 |
+
"Se tittel.",
|
| 125 |
+
"Tittelen sier vel alt.",
|
| 126 |
+
"Rammeavtale",
|
| 127 |
+
"Sjekk dokumentene",
|
| 128 |
+
"Anskaffelse av samfunnsøkonomisk analyse for transportforskning innen evaluering.",
|
| 129 |
+
"Short text",
|
| 130 |
+
"Se vedlegg for full beskrivelse av kontraktens innhold inkludert alle leveranser.", # long → ok
|
| 131 |
+
"TRANSQ is a joint qualification system for Scandinavian transport suppliers.", # English → flagged
|
| 132 |
+
"Hilma on Suomen julkisten hankintojen ilmoituskanava ja keskitetty palvelu.", # Finnish → flagged
|
| 133 |
+
]
|
| 134 |
+
for t in tests:
|
| 135 |
+
flag, reason = needs_review(t)
|
| 136 |
+
print(f"{flag!s:<6} {reason:<35} | {t!r}")
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:f6ca38993266759d2ffe53152b98810effb9714608f2bed64f43dff6141c845e
|
| 3 |
+
size 711443456
|
score.py
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
End-to-end inference for the Menon nb-bert relevance scorer (v4).
|
| 3 |
+
|
| 4 |
+
PIPELINE
|
| 5 |
+
--------
|
| 6 |
+
needs_review() gate → if flagged, return early
|
| 7 |
+
tokenize → run model → apply tuned threshold
|
| 8 |
+
|
| 9 |
+
DESIGN NOTE
|
| 10 |
+
-----------
|
| 11 |
+
Translation is NOT done here. Production assumption: the leads_scraper
|
| 12 |
+
has already translated foreign-language leads to Norwegian before they
|
| 13 |
+
reach this model. Anything that arrives here in another language is
|
| 14 |
+
flagged as `needs_review` (handled inside `inference_rules.py`).
|
| 15 |
+
|
| 16 |
+
For ad-hoc scoring of raw foreign text, translate beforehand with any
|
| 17 |
+
tool and feed the Norwegian version in.
|
| 18 |
+
|
| 19 |
+
USAGE
|
| 20 |
+
-----
|
| 21 |
+
from score import score_lead
|
| 22 |
+
result = score_lead(kort_beskrivelse_text)
|
| 23 |
+
# → {"label": "RELEVANT" / "NOT_RELEVANT" / "needs_review",
|
| 24 |
+
# "score": 0.83, "threshold": 0.2594, "reason": "..."}
|
| 25 |
+
|
| 26 |
+
REQUIRES
|
| 27 |
+
--------
|
| 28 |
+
Packages: transformers, torch, langdetect
|
| 29 |
+
"""
|
| 30 |
+
|
| 31 |
+
import json
|
| 32 |
+
from pathlib import Path
|
| 33 |
+
|
| 34 |
+
import torch
|
| 35 |
+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
| 36 |
+
|
| 37 |
+
from inference_rules import needs_review
|
| 38 |
+
|
| 39 |
+
_MODEL_DIR = Path(__file__).parent
|
| 40 |
+
_threshold = None
|
| 41 |
+
_model = None
|
| 42 |
+
_tokenizer = None
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
def _lazy_load():
|
| 46 |
+
global _model, _tokenizer, _threshold
|
| 47 |
+
if _model is None:
|
| 48 |
+
_tokenizer = AutoTokenizer.from_pretrained(str(_MODEL_DIR))
|
| 49 |
+
_model = AutoModelForSequenceClassification.from_pretrained(str(_MODEL_DIR))
|
| 50 |
+
_model.eval()
|
| 51 |
+
with open(_MODEL_DIR / "threshold.json") as f:
|
| 52 |
+
_threshold = json.load(f)["threshold"]
|
| 53 |
+
return _model, _tokenizer, _threshold
|
| 54 |
+
|
| 55 |
+
|
| 56 |
+
def score_lead(kort_beskrivelse: str, max_length: int = 256) -> dict:
|
| 57 |
+
"""Score a single procurement description.
|
| 58 |
+
|
| 59 |
+
Returns a dict:
|
| 60 |
+
- {"label": "RELEVANT", "score": 0.83, "threshold": 0.26, "reason": "ok"}
|
| 61 |
+
- {"label": "NOT_RELEVANT", "score": 0.05, "threshold": 0.26, "reason": "ok"}
|
| 62 |
+
- {"label": "needs_review", "score": None, "reason": "<why>"}
|
| 63 |
+
"""
|
| 64 |
+
# 1) Gate: empty / short / placeholder / non-Norwegian → don't run model
|
| 65 |
+
flag, reason = needs_review(kort_beskrivelse)
|
| 66 |
+
if flag:
|
| 67 |
+
return {"label": "needs_review", "score": None, "reason": reason}
|
| 68 |
+
|
| 69 |
+
# 2) Tokenize + run model
|
| 70 |
+
model, tokenizer, threshold = _lazy_load()
|
| 71 |
+
enc = tokenizer(
|
| 72 |
+
str(kort_beskrivelse),
|
| 73 |
+
truncation=True,
|
| 74 |
+
padding="max_length",
|
| 75 |
+
max_length=max_length,
|
| 76 |
+
return_tensors="pt",
|
| 77 |
+
)
|
| 78 |
+
with torch.no_grad():
|
| 79 |
+
logits = model(**enc).logits
|
| 80 |
+
score = torch.softmax(logits, dim=1)[0, 1].item()
|
| 81 |
+
|
| 82 |
+
label = "RELEVANT" if score >= threshold else "NOT_RELEVANT"
|
| 83 |
+
return {"label": label, "score": score, "threshold": threshold, "reason": "ok"}
|
| 84 |
+
|
| 85 |
+
|
| 86 |
+
if __name__ == "__main__":
|
| 87 |
+
samples = [
|
| 88 |
+
"", # → needs_review (empty)
|
| 89 |
+
"Se konkurransegrunnlag", # → needs_review (placeholder)
|
| 90 |
+
"TRANSQ is a joint qualification system for transport suppliers.", # → needs_review (English)
|
| 91 |
+
"Anskaffelse av samfunnsøkonomisk analyse for evaluering av transportforskningens næringseffekter.", # Norwegian → score
|
| 92 |
+
"Levering av rengjøringstjenester i kommunale bygg etter rammeavtale.", # Norwegian → score
|
| 93 |
+
]
|
| 94 |
+
for s in samples:
|
| 95 |
+
print(f"{score_lead(s)} ← {s!r}")
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,7 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": "[CLS]",
|
| 3 |
+
"mask_token": "[MASK]",
|
| 4 |
+
"pad_token": "[PAD]",
|
| 5 |
+
"sep_token": "[SEP]",
|
| 6 |
+
"unk_token": "[UNK]"
|
| 7 |
+
}
|
threshold.json
ADDED
|
@@ -0,0 +1,4 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"threshold": 0.25936875,
|
| 3 |
+
"target_recall": 0.9
|
| 4 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,58 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": true,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_basic_tokenize": true,
|
| 47 |
+
"do_lower_case": false,
|
| 48 |
+
"extra_special_tokens": {},
|
| 49 |
+
"mask_token": "[MASK]",
|
| 50 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 51 |
+
"never_split": null,
|
| 52 |
+
"pad_token": "[PAD]",
|
| 53 |
+
"sep_token": "[SEP]",
|
| 54 |
+
"strip_accents": null,
|
| 55 |
+
"tokenize_chinese_chars": true,
|
| 56 |
+
"tokenizer_class": "BertTokenizer",
|
| 57 |
+
"unk_token": "[UNK]"
|
| 58 |
+
}
|
training_args.bin
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:dcf659b571fac9023b4b5ed3f24547e41ac4f08daeef5877944303fea38a0f88
|
| 3 |
+
size 5841
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|