temsa commited on
Commit
fbc78d4
·
verified ·
1 Parent(s): 0ed1b40

Mark legacy repo as superseded by OpenMed-mLiteClinical-IrishPPSN-135M-v1

Browse files
Files changed (3) hide show
  1. NOTICE +12 -3
  2. README.md +20 -67
  3. pyproject.toml +3 -3
NOTICE CHANGED
@@ -1,17 +1,26 @@
1
- OpenMed PPSN Extension
2
  Copyright 2026 Contributors
3
 
4
  This project includes fine-tuned/derived model artifacts from:
5
- - OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1 (Hugging Face)
6
  Declared license: Apache-2.0.
7
 
8
  This project uses evaluation/training data sources including:
9
  - nvidia/Nemotron-PII (Hugging Face dataset)
10
  Declared license: CC-BY-4.0.
 
 
 
 
 
 
11
 
12
  Attribution and links:
13
- - OpenMed base model: https://huggingface.co/OpenMed/OpenMed-PII-SuperClinical-Large-434M-v1
14
  - Nemotron-PII dataset: https://huggingface.co/datasets/nvidia/Nemotron-PII
 
 
 
15
 
16
  If you redistribute models or checkpoints produced here, keep this NOTICE,
17
  retain upstream license notices, and provide dataset attribution where required.
 
1
+ OpenMed mLiteClinical Irish PPSN Extension
2
  Copyright 2026 Contributors
3
 
4
  This project includes fine-tuned/derived model artifacts from:
5
+ - OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1 (Hugging Face)
6
  Declared license: Apache-2.0.
7
 
8
  This project uses evaluation/training data sources including:
9
  - nvidia/Nemotron-PII (Hugging Face dataset)
10
  Declared license: CC-BY-4.0.
11
+ - joelniklaus/mapa (Hugging Face dataset)
12
+ Declared license: refer to dataset card.
13
+ - unimelb-nlp/wikiann (Hugging Face dataset)
14
+ Declared license: refer to dataset card.
15
+ - DataikuNLP/kiji-pii-training-data (Hugging Face dataset)
16
+ Declared license: refer to dataset card.
17
 
18
  Attribution and links:
19
+ - OpenMed base model: https://huggingface.co/OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
20
  - Nemotron-PII dataset: https://huggingface.co/datasets/nvidia/Nemotron-PII
21
+ - MAPA dataset: https://huggingface.co/datasets/joelniklaus/mapa
22
+ - WikiANN dataset: https://huggingface.co/datasets/unimelb-nlp/wikiann
23
+ - KIJI PII training data: https://huggingface.co/datasets/DataikuNLP/kiji-pii-training-data
24
 
25
  If you redistribute models or checkpoints produced here, keep this NOTICE,
26
  retain upstream license notices, and provide dataset attribution where required.
README.md CHANGED
@@ -12,89 +12,42 @@ tags:
12
  - de-identification
13
  - ireland
14
  - ppsn
15
- - multilingual
16
  base_model:
17
  - OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
18
- datasets:
19
- - nvidia/Nemotron-PII
20
- - joelniklaus/mapa
21
- - unimelb-nlp/wikiann
22
- - DataikuNLP/kiji-pii-training-data
23
- model-index:
24
- - name: OpenMed-PPSN-mLiteClinical-v1
25
- results:
26
- - task:
27
- type: token-classification
28
- name: PPSN detection (Irish large eval)
29
- dataset:
30
- name: irish_ppsn_eval_large_v2
31
- type: custom
32
- metrics:
33
- - type: f1
34
- value: 0.8979
35
- name: Irish large F1
36
- - task:
37
- type: token-classification
38
- name: PPSN detection (multilingual gov + citizen + HSE)
39
- dataset:
40
- name: multilingual_ppsn_v1_all
41
- type: custom
42
- metrics:
43
- - type: f1
44
- value: 0.9704
45
- name: Multilingual suite F1
46
  ---
47
 
48
  # OpenMed-PPSN-mLiteClinical-v1
49
 
50
- Full token-classification checkpoint derived from `OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1` with `B-PPSN` / `I-PPSN` support for Irish PPSN detection.
51
 
52
- ## What This Release Is
53
 
54
- - A full `transformers` checkpoint
55
- - Intended for PPSN masking with the custom `word_aligned` decoder
56
- - Tuned for Irish PPSN cases while retaining the base OpenMed multilingual PII labels
57
 
58
- ## Recommended Inference Path
59
 
60
- Use `inference_word_aligned.py`:
 
 
61
 
62
- ```bash
63
- python3 inference_word_aligned.py \
64
- --ppsn-min-score 0.4 \
65
- --text "My PPSN is 1234567TW and I need help with my housing grant." \
66
- --json
67
- ```
68
-
69
- ## Included Artifacts
70
 
71
- - Model files:
72
- - `model.safetensors`
73
- - `config.json`
74
- - `tokenizer.json`
75
- - `tokenizer_config.json`
76
- - `special_tokens_map.json`
77
- - `label_meta.json`
78
- - QA/inference files:
79
- - `inference_word_aligned.py`
80
- - `qa_config.json`
81
- - `pyproject.toml`
82
- - Eval artifacts in `eval/`
83
 
84
- ## Key Results
 
 
85
 
86
- - User raw regression F1: `0.8000`
87
- - QA regression v6 validated F1: `0.6667`
88
- - QA regression v8 F1: `0.7385`
89
- - Irish regression F1: `0.8000`
90
- - Irish large F1: `0.8979`
91
- - Multilingual suite F1: `0.9704`
92
- - Non-PPSN agreement vs base mLiteClinical: `1.0000`
93
 
94
- ## Notes
95
 
96
- - The base `OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1` model has no PPSN label, so PPSN recall starts at zero until PPSN rows are added.
97
- - The recommended path for PPSN extraction is `word_aligned`, not the default token-aggregation path.
 
 
98
 
99
  ## License and Attribution
100
 
 
12
  - de-identification
13
  - ireland
14
  - ppsn
15
+ - legacy
16
  base_model:
17
  - OpenMed/OpenMed-PII-mLiteClinical-Base-135M-v1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
18
  ---
19
 
20
  # OpenMed-PPSN-mLiteClinical-v1
21
 
22
+ This repo remains available as a **legacy alias** for compatibility.
23
 
24
+ The canonical release for this model is now:
25
 
26
+ - `temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1`
27
+ - https://huggingface.co/temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1
 
28
 
29
+ ## Status
30
 
31
+ - Same model family and same intended use: Irish PPSN detection and masking
32
+ - New canonical repo has the clearer name, cleaned metadata, corrected attribution, and cleaner eval packaging
33
+ - Prefer the canonical repo for all new integrations, QA, and benchmarking
34
 
35
+ ## Recommended Upgrade
 
 
 
 
 
 
 
36
 
37
+ Use the canonical release instead of this legacy alias:
 
 
 
 
 
 
 
 
 
 
 
38
 
39
+ ```bash
40
+ python3 inference_word_aligned.py --model temsa/OpenMed-mLiteClinical-IrishPPSN-135M-v1 --ppsn-min-score 0.4 --text "My PPSN is 1234567TW and I need help with my housing grant." --json
41
+ ```
42
 
43
+ ## Why The New Repo Exists
 
 
 
 
 
 
44
 
45
+ The original `OpenMed-PPSN-mLiteClinical-v1` name was serviceable but vague. The canonical repo name makes the scope explicit:
46
 
47
+ - `mLiteClinical`
48
+ - `IrishPPSN`
49
+ - `135M`
50
+ - `v1`
51
 
52
  ## License and Attribution
53
 
pyproject.toml CHANGED
@@ -1,7 +1,7 @@
1
  [project]
2
- name = "openmed-mliteclinical-ppsn"
3
- version = "0.1.0"
4
- description = "mLiteClinical PPSN token-classification release"
5
  requires-python = ">=3.10"
6
  readme = "README.md"
7
  license = { text = "Apache-2.0" }
 
1
  [project]
2
+ name = "openmed-ppsn-mliteclinical-v1-legacy"
3
+ version = "1.0.1"
4
+ description = "Legacy alias for the OpenMed mLiteClinical Irish PPSN release; prefer the canonical repo"
5
  requires-python = ">=3.10"
6
  readme = "README.md"
7
  license = { text = "Apache-2.0" }