Added more info still missing training data
Browse files
README.md
CHANGED
|
@@ -9,8 +9,7 @@ base_model:
|
|
| 9 |
- answerdotai/ModernBERT-base
|
| 10 |
---
|
| 11 |
|
| 12 |
-
|
| 13 |
-
# Model Card for NER_OCOD
|
| 14 |
|
| 15 |
This model is designed to perform Named Entity Recognition on The OCOD dataset of offshore owned property in England and Wales.
|
| 16 |
|
|
@@ -19,7 +18,7 @@ This model is designed to perform Named Entity Recognition on The OCOD dataset o
|
|
| 19 |
### Model Description
|
| 20 |
|
| 21 |
The OCOD dataset is a record of all property in England and Wales owned by companies incorporated outside the UK, and is regularly released by the `Land Registry` and agency of the UK government. The issue with the OCOD dataset is that the property addresses are entered as free text, making getting important details of the addresses challenging. In addition a single entry can contain more than one property with some addresses containing hundreds of sub-properties, which adds to the challenge, see the below table for examples.
|
| 22 |
-
As such the "
|
| 23 |
|
| 24 |
| Example | Address |
|
| 25 |
|---------|---------|
|
|
@@ -46,7 +45,7 @@ The model has the following classes
|
|
| 46 |
- **License:** GPL 3.0
|
| 47 |
- **Finetuned from model :** modernBERT
|
| 48 |
|
| 49 |
-
### Model Sources
|
| 50 |
|
| 51 |
- **Repository:** https://huggingface.co/Jonnob/OCOD_NER
|
| 52 |
- **Github:** https://github.com/JonnoB/enhance_ocod
|
|
@@ -59,16 +58,41 @@ The model is designed to be used as part of the enhance_ocod python library whic
|
|
| 59 |
|
| 60 |
### Direct Use
|
| 61 |
|
| 62 |
-
|
| 63 |
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
| 67 |
|
|
|
|
| 68 |
|
| 69 |
-
|
|
|
|
| 70 |
|
| 71 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
|
| 73 |
### Out-of-Scope Use
|
| 74 |
|
|
@@ -76,19 +100,31 @@ The model is specifically trained on OCOD data and is not designed to be a gener
|
|
| 76 |
|
| 77 |
## Bias, Risks, and Limitations
|
| 78 |
|
| 79 |
-
Whilst the model has been trained on both English and Welsh addresses, there were less Welsh addresses in the training data, in addition modernBERT was not pre-trained on Welsh, as such the model may under-perform on addresses written in Welsh.
|
| 80 |
|
| 81 |
-
|
| 82 |
|
| 83 |
-
|
| 84 |
|
| 85 |
-
|
|
|
|
| 86 |
|
| 87 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 88 |
|
| 89 |
-
|
|
|
|
|
|
|
| 90 |
|
| 91 |
-
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
## Training Details
|
| 94 |
|
|
@@ -161,7 +197,17 @@ model performance is given below
|
|
| 161 |
|
| 162 |
### Model Architecture and Objective
|
| 163 |
|
| 164 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 165 |
|
| 166 |
### Compute Infrastructure
|
| 167 |
|
|
|
|
| 9 |
- answerdotai/ModernBERT-base
|
| 10 |
---
|
| 11 |
|
| 12 |
+
# Model Card for OCOD_NER
|
|
|
|
| 13 |
|
| 14 |
This model is designed to perform Named Entity Recognition on The OCOD dataset of offshore owned property in England and Wales.
|
| 15 |
|
|
|
|
| 18 |
### Model Description
|
| 19 |
|
| 20 |
The OCOD dataset is a record of all property in England and Wales owned by companies incorporated outside the UK, and is regularly released by the `Land Registry` and agency of the UK government. The issue with the OCOD dataset is that the property addresses are entered as free text, making getting important details of the addresses challenging. In addition a single entry can contain more than one property with some addresses containing hundreds of sub-properties, which adds to the challenge, see the below table for examples.
|
| 21 |
+
As such the "OCOD_NER" model is designed to extract a list of standardised elements which can be normalised to one property per row.
|
| 22 |
|
| 23 |
| Example | Address |
|
| 24 |
|---------|---------|
|
|
|
|
| 45 |
- **License:** GPL 3.0
|
| 46 |
- **Finetuned from model :** modernBERT
|
| 47 |
|
| 48 |
+
### Model Sources
|
| 49 |
|
| 50 |
- **Repository:** https://huggingface.co/Jonnob/OCOD_NER
|
| 51 |
- **Github:** https://github.com/JonnoB/enhance_ocod
|
|
|
|
| 58 |
|
| 59 |
### Direct Use
|
| 60 |
|
| 61 |
+
This model is designed for Named Entity Recognition (NER) on address data to extract and classify address components. The model can be used directly through HuggingFace's transformers library for token classification tasks.
|
| 62 |
|
| 63 |
+
**Primary Use Case:**
|
| 64 |
+
- Parsing and extracting structured components from address strings
|
| 65 |
+
- Identifying entities such as street numbers, street names, cities, postcodes, etc.
|
| 66 |
+
|
| 67 |
+
**Example Usage:**
|
| 68 |
+
|
| 69 |
+
```python
|
| 70 |
+
from transformers import pipeline
|
| 71 |
+
|
| 72 |
+
# Load the model
|
| 73 |
+
nlp = pipeline(
|
| 74 |
+
"token-classification",
|
| 75 |
+
model="Jonnob/OCOD_NER",
|
| 76 |
+
aggregation_strategy="simple",
|
| 77 |
+
device=0 # Use GPU if available
|
| 78 |
+
)
|
| 79 |
|
| 80 |
+
# Parse a single address
|
| 81 |
+
address = "Flat 14a, 14 Barnsbury Road, London N1 1JU"
|
| 82 |
+
results = nlp(address)
|
| 83 |
+
```
|
| 84 |
|
| 85 |
+
### Downstream Use
|
| 86 |
|
| 87 |
+
**Primary Integration: OCOD Library**
|
| 88 |
+
This model is primarily designed to be used as part of the enhanced OCOD (Office for National Statistics Comparison of Overseas Property Datasets) library, where specialized functions and scripts are available for processing property address data.
|
| 89 |
|
| 90 |
+
**OCOD-Specific Usage:**
|
| 91 |
+
For users working with OCOD datasets, the complete processing pipeline can be executed using:
|
| 92 |
+
```bash
|
| 93 |
+
python parse_ocod_history.py
|
| 94 |
+
```
|
| 95 |
+
from the 'scripts' folder of the repository. This handles the entire historical OCOD dataset with optimized batch processing.
|
| 96 |
|
| 97 |
### Out-of-Scope Use
|
| 98 |
|
|
|
|
| 100 |
|
| 101 |
## Bias, Risks, and Limitations
|
| 102 |
|
| 103 |
+
Whilst the model has been trained on both English and Welsh addresses, there were less Welsh addresses in the training data, in addition modernBERT was not pre-trained on Welsh, as such the model may under-perform on addresses written in Welsh. The model will almost certainly not work in any other language.
|
| 104 |
|
| 105 |
+
## How to Get Started with the Model
|
| 106 |
|
| 107 |
+
Use the code below to get started with the model.
|
| 108 |
|
| 109 |
+
```python
|
| 110 |
+
from transformers import pipeline
|
| 111 |
|
| 112 |
+
# Load the model
|
| 113 |
+
nlp = pipeline(
|
| 114 |
+
"token-classification",
|
| 115 |
+
model="Jonnob/OCOD_NER", # Replace with your actual model name
|
| 116 |
+
aggregation_strategy="simple",
|
| 117 |
+
device=0 if torch.cuda.is_available() else -1 # GPU if available
|
| 118 |
+
)
|
| 119 |
|
| 120 |
+
# Parse a single address
|
| 121 |
+
address = "Flat 14a, 14 Barnsbury Road, London N1 1JU"
|
| 122 |
+
results = nlp(address)
|
| 123 |
|
| 124 |
+
# Print extracted entities
|
| 125 |
+
for entity in results:
|
| 126 |
+
print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.2f})")
|
| 127 |
+
```
|
| 128 |
|
| 129 |
## Training Details
|
| 130 |
|
|
|
|
| 197 |
|
| 198 |
### Model Architecture and Objective
|
| 199 |
|
| 200 |
+
**Architecture:**
|
| 201 |
+
- Base model: ModernBERT-base (answerdotai/ModernBERT-base)
|
| 202 |
+
- 22 transformer layers, 149 million parameters
|
| 203 |
+
- Bidirectional encoder-only architecture with modern improvements:
|
| 204 |
+
- Native context length: up to 8,192 tokens
|
| 205 |
+
- Additional token classification head for NER fine-tuning
|
| 206 |
+
|
| 207 |
+
**Objective:**
|
| 208 |
+
- Fine-tuned for Named Entity Recognition (NER)
|
| 209 |
+
- Training objective: Token-level classification with cross-entropy loss
|
| 210 |
+
- Designed to identify and classify named entities for address normalisation
|
| 211 |
|
| 212 |
### Compute Infrastructure
|
| 213 |
|