Jonnob commited on
Commit
f421b42
·
verified ·
1 Parent(s): 08de523

Added more info still missing training data

Browse files
Files changed (1) hide show
  1. README.md +63 -17
README.md CHANGED
@@ -9,8 +9,7 @@ base_model:
9
  - answerdotai/ModernBERT-base
10
  ---
11
 
12
-
13
- # Model Card for NER_OCOD
14
 
15
  This model is designed to perform Named Entity Recognition on The OCOD dataset of offshore owned property in England and Wales.
16
 
@@ -19,7 +18,7 @@ This model is designed to perform Named Entity Recognition on The OCOD dataset o
19
  ### Model Description
20
 
21
  The OCOD dataset is a record of all property in England and Wales owned by companies incorporated outside the UK, and is regularly released by the `Land Registry` and agency of the UK government. The issue with the OCOD dataset is that the property addresses are entered as free text, making getting important details of the addresses challenging. In addition a single entry can contain more than one property with some addresses containing hundreds of sub-properties, which adds to the challenge, see the below table for examples.
22
- As such the "NER_OCOD" model is designed to extract a list of standardised elements which can be normalised to one property per row.
23
 
24
  | Example | Address |
25
  |---------|---------|
@@ -46,7 +45,7 @@ The model has the following classes
46
  - **License:** GPL 3.0
47
  - **Finetuned from model :** modernBERT
48
 
49
- ### Model Sources [optional]
50
 
51
  - **Repository:** https://huggingface.co/Jonnob/OCOD_NER
52
  - **Github:** https://github.com/JonnoB/enhance_ocod
@@ -59,16 +58,41 @@ The model is designed to be used as part of the enhance_ocod python library whic
59
 
60
  ### Direct Use
61
 
62
- Although it can be used to parse text string addresses the model is not designed to be used directly.
63
 
64
- ### Downstream Use
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
65
 
66
- TIdeally the model should be used as part of the enhance OCOD library, where a variety of support functions and scripts are available. for example parsing the entire history of the OCOD dataset can be done by running
 
 
 
67
 
 
68
 
69
- `python parse_ocod_history.py`
 
70
 
71
- from the 'scripts' folder of the repo.
 
 
 
 
 
72
 
73
  ### Out-of-Scope Use
74
 
@@ -76,19 +100,31 @@ The model is specifically trained on OCOD data and is not designed to be a gener
76
 
77
  ## Bias, Risks, and Limitations
78
 
79
- Whilst the model has been trained on both English and Welsh addresses, there were less Welsh addresses in the training data, in addition modernBERT was not pre-trained on Welsh, as such the model may under-perform on addresses written in Welsh.
80
 
81
- ### Recommendations
82
 
83
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
84
 
85
- {{ bias_recommendations | default("Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.", true)}}
 
86
 
87
- ## How to Get Started with the Model
 
 
 
 
 
 
88
 
89
- Use the code below to get started with the model.
 
 
90
 
91
- {{ get_started_code | default("[More Information Needed]", true)}}
 
 
 
92
 
93
  ## Training Details
94
 
@@ -161,7 +197,17 @@ model performance is given below
161
 
162
  ### Model Architecture and Objective
163
 
164
- {{ model_specs | default("[More Information Needed]", true)}}
 
 
 
 
 
 
 
 
 
 
165
 
166
  ### Compute Infrastructure
167
 
 
9
  - answerdotai/ModernBERT-base
10
  ---
11
 
12
+ # Model Card for OCOD_NER
 
13
 
14
  This model is designed to perform Named Entity Recognition on The OCOD dataset of offshore owned property in England and Wales.
15
 
 
18
  ### Model Description
19
 
20
  The OCOD dataset is a record of all property in England and Wales owned by companies incorporated outside the UK, and is regularly released by the `Land Registry` and agency of the UK government. The issue with the OCOD dataset is that the property addresses are entered as free text, making getting important details of the addresses challenging. In addition a single entry can contain more than one property with some addresses containing hundreds of sub-properties, which adds to the challenge, see the below table for examples.
21
+ As such the "OCOD_NER" model is designed to extract a list of standardised elements which can be normalised to one property per row.
22
 
23
  | Example | Address |
24
  |---------|---------|
 
45
  - **License:** GPL 3.0
46
  - **Finetuned from model :** modernBERT
47
 
48
+ ### Model Sources
49
 
50
  - **Repository:** https://huggingface.co/Jonnob/OCOD_NER
51
  - **Github:** https://github.com/JonnoB/enhance_ocod
 
58
 
59
  ### Direct Use
60
 
61
+ This model is designed for Named Entity Recognition (NER) on address data to extract and classify address components. The model can be used directly through HuggingFace's transformers library for token classification tasks.
62
 
63
+ **Primary Use Case:**
64
+ - Parsing and extracting structured components from address strings
65
+ - Identifying entities such as street numbers, street names, cities, postcodes, etc.
66
+
67
+ **Example Usage:**
68
+
69
+ ```python
70
+ from transformers import pipeline
71
+
72
+ # Load the model
73
+ nlp = pipeline(
74
+ "token-classification",
75
+ model="Jonnob/OCOD_NER",
76
+ aggregation_strategy="simple",
77
+ device=0 # Use GPU if available
78
+ )
79
 
80
+ # Parse a single address
81
+ address = "Flat 14a, 14 Barnsbury Road, London N1 1JU"
82
+ results = nlp(address)
83
+ ```
84
 
85
+ ### Downstream Use
86
 
87
+ **Primary Integration: OCOD Library**
88
+ This model is primarily designed to be used as part of the enhanced OCOD (Office for National Statistics Comparison of Overseas Property Datasets) library, where specialized functions and scripts are available for processing property address data.
89
 
90
+ **OCOD-Specific Usage:**
91
+ For users working with OCOD datasets, the complete processing pipeline can be executed using:
92
+ ```bash
93
+ python parse_ocod_history.py
94
+ ```
95
+ from the 'scripts' folder of the repository. This handles the entire historical OCOD dataset with optimized batch processing.
96
 
97
  ### Out-of-Scope Use
98
 
 
100
 
101
  ## Bias, Risks, and Limitations
102
 
103
+ Whilst the model has been trained on both English and Welsh addresses, there were less Welsh addresses in the training data, in addition modernBERT was not pre-trained on Welsh, as such the model may under-perform on addresses written in Welsh. The model will almost certainly not work in any other language.
104
 
105
+ ## How to Get Started with the Model
106
 
107
+ Use the code below to get started with the model.
108
 
109
+ ```python
110
+ from transformers import pipeline
111
 
112
+ # Load the model
113
+ nlp = pipeline(
114
+ "token-classification",
115
+ model="Jonnob/OCOD_NER", # Replace with your actual model name
116
+ aggregation_strategy="simple",
117
+ device=0 if torch.cuda.is_available() else -1 # GPU if available
118
+ )
119
 
120
+ # Parse a single address
121
+ address = "Flat 14a, 14 Barnsbury Road, London N1 1JU"
122
+ results = nlp(address)
123
 
124
+ # Print extracted entities
125
+ for entity in results:
126
+ print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.2f})")
127
+ ```
128
 
129
  ## Training Details
130
 
 
197
 
198
  ### Model Architecture and Objective
199
 
200
+ **Architecture:**
201
+ - Base model: ModernBERT-base (answerdotai/ModernBERT-base)
202
+ - 22 transformer layers, 149 million parameters
203
+ - Bidirectional encoder-only architecture with modern improvements:
204
+ - Native context length: up to 8,192 tokens
205
+ - Additional token classification head for NER fine-tuning
206
+
207
+ **Objective:**
208
+ - Fine-tuned for Named Entity Recognition (NER)
209
+ - Training objective: Token-level classification with cross-entropy loss
210
+ - Designed to identify and classify named entities for address normalisation
211
 
212
  ### Compute Infrastructure
213