Model Card for OCOD_NER
This model is designed to perform Named Entity Recognition on The OCOD dataset of offshore owned property in England and Wales.
Model Details
Model Description
The OCOD dataset is a record of all property in England and Wales owned by companies incorporated outside the UK, and is regularly released by the Land Registry and agency of the UK government. The issue with the OCOD dataset is that the property addresses are entered as free text, making getting important details of the addresses challenging. In addition a single entry can contain more than one property with some addresses containing hundreds of sub-properties, which adds to the challenge, see the below table for examples.
As such the "OCOD_NER" model is designed to extract a list of standardised elements which can be normalised to one property per row.
| Example | Address |
|---|---|
| 1 | flat 6, chartfield house, babel road, london |
| 2 | 5 to 15 (odds only) babel road, london (w1 8ap) |
| 3 | 5 babel road, london and parking 3.5 w1 8ap |
The model has the following classes
| Entity class | Description |
|---|---|
| Unit ID | Describes a sub-unit such as a flat number or parking space ID. Example One would have 6' and Example Three would have 3.5' as unit id. Unit Id is not always a number |
| Unit type | Describes the type of unit, if available. Example One would have `flat' whilst Example Three would have ``parking" |
| Building Name | Example One would have `Chartfield House', the field would not be present for the other two examples |
| Street Number | The street number of the property, if available, would be 5 to 15' in Example Two and 5' in Example Three. Street number is not always a number |
| Street Name | Self explanatory, would be `Babel Road' in all three examples |
| Number Filter | When multiple properties are included in the address a filtering condition is often used, because in the UK odd and even numbers are often on opposite sides of the road; or a company may not own all the flats in an apartment block. Example Two would have `odd' |
| City | Self explanatory, would be London for all three examples |
| Postcode | Self explanatory. In almost all cases the post code is in parenthesis. In addition, UK postcodes follow a pattern which can be extracted using regex, making them easy to label |
- Developed by: Jonathan Bourne
- Model type: Named Entity Recognition
- Language(s) (NLP): English, Welsh
- License: GPL 3.0
- Finetuned from model : modernBERT
Model Sources
- Repository: https://huggingface.co/Jonnob/OCOD_NER
- Github: https://github.com/JonnoB/enhance_ocod
- Paper : Whatโs in the laundromat? Mapping and characterising offshore-owned residential property in London doi: https://doi.org/10.1177/23998083231155483
Uses
The model is designed to be used as part of the enhance_ocod python library which can be found at https://github.com/JonnoB/enhance_ocod
Direct Use
This model is designed for Named Entity Recognition (NER) on address data to extract and classify address components. The model can be used directly through HuggingFace's transformers library for token classification tasks.
Primary Use Case:
- Parsing and extracting structured components from address strings
- Identifying entities such as street numbers, street names, cities, postcodes, etc.
Example Usage:
from transformers import pipeline
# Load the model
nlp = pipeline(
"token-classification",
model="Jonnob/OCOD_NER",
aggregation_strategy="simple",
device=0 # Use GPU if available
)
# Parse a single address
address = "Flat 14a, 14 Barnsbury Road, London N1 1JU".lower()
results = nlp(address)
Downstream Use
Primary Integration: OCOD Library This model is primarily designed to be used as part of the enhanced OCOD (Office for National Statistics Comparison of Overseas Property Datasets) library, where specialized functions and scripts are available for processing property address data.
OCOD-Specific Usage: For users working with OCOD datasets, the complete processing pipeline can be executed using:
python parse_ocod_history.py
from the 'scripts' folder of the repository. This handles the entire historical OCOD dataset with optimized batch processing.
Out-of-Scope Use
The model is specifically trained on OCOD data and is not designed to be a general purpose address parser. However, it is likely to work relatively well on UK addresses, although it has not been tested.
Bias, Risks, and Limitations
Whilst the model has been trained on both English and Welsh addresses, there were less Welsh addresses in the training data, in addition modernBERT was not pre-trained on Welsh, as such the model may under-perform on addresses written in Welsh. The model will almost certainly not work in any other language.
How to Get Started with the Model
Use the code below to get started with the model.
from transformers import pipeline
# Load the model
nlp = pipeline(
"token-classification",
model="Jonnob/OCOD_NER", # Replace with your actual model name
aggregation_strategy="simple",
device=0 if torch.cuda.is_available() else -1 # GPU if available
)
# Parse a single address
address = "Flat 14a, 14 Barnsbury Road, London N1 1JU".lower()
results = nlp(address)
# Print extracted entities
for entity in results:
print(f"{entity['entity_group']}: {entity['word']} (confidence: {entity['score']:.2f})")
Training Details
Training Data
This model was trained on the hand labelled dataset not the weakly-lablled dataset. This is because despite being slightly outperformed by the weakl-learning model, the training set model is significantly easier to reproduce and faster to train.
{{ training_data | default("[More Information Needed]", true)}}
Training Procedure
The training procedure can be found in the 'mbert_train_configurable.py' script of the enhance OCOD repo.
Training Hyperparameters
| Parameter | Default Value | Description |
|---|---|---|
| Model Architecture | answerdotai/ModernBERT-base |
Base model used for token classification |
| Number of Epochs | 6 | Training epochs (configurable via --num_epochs) |
| Batch Size | 16 | Per device train/eval batch size (configurable via --batch_size) |
| Learning Rate | 5e-5 | Learning rate (configurable via --learning_rate) |
| Max Sequence Length | 128 | Maximum input sequence length (configurable via --max_length) |
| Warmup Steps | 500 | Number of warmup steps for learning rate scheduler |
| Weight Decay | 0.01 | L2 regularization weight decay |
| Evaluation Strategy | epoch | Evaluation performed at the end of each epoch |
| Save Strategy | epoch | Model checkpoints saved at the end of each epoch |
| Save Total Limit | 1 | Maximum number of checkpoints to keep |
| Load Best Model at End | True | Load the best model based on evaluation metric |
| Metric for Best Model | f1 | F1 score used to determine best model |
| Logging Steps | 500 | Log training metrics every 500 steps |
| Pad to Multiple of | 8 | Padding strategy for efficient GPU utilization |
| Float32 Matmul Precision | medium | PyTorch tensor operation precision setting |
Evaluation
The model was evaluated on 2000 hand-labelled addresses randomly sampled from the OCOD February 2022 release.
Testing Data, Factors & Metrics
Testing Data
{{ testing_data | default("[More Information Needed]", true)}}
Metrics
The model was tested using the micro F1.
Results
model performance is given below
| Class | Precision | Recall | F1 | support |
|---|---|---|---|---|
| building name | 0.86 | 0.90 | 0.88 | 383 |
| city | 1.00 | 0.97 | 0.99 | 947 |
| postcode | 1.00 | 1.00 | 1.00 | 768 |
| street name | 0.99 | 0.96 | 0.97 | 1029 |
| street number | 0.99 | 0.98 | 0.98 | 678 |
| unit id | 0.97 | 0.95 | 0.96 | 370 |
| unit type | 1.00 | 0.97 | 0.98 | 488 |
| micro avg | 0.98 | 0.97 | 0.97 | 4663 |
| macro avg | 0.97 | 0.96 | 0.97 | 4663 |
| weighted avg | 0.98 | 0.97 | 0.97 | 4663 |
Model Architecture and Objective
Architecture:
- Base model: ModernBERT-base (answerdotai/ModernBERT-base)
- 22 transformer layers, 149 million parameters
- Bidirectional encoder-only architecture with modern improvements:
- Native context length: up to 8,192 tokens
- Additional token classification head for NER fine-tuning
Objective:
- Fine-tuned for Named Entity Recognition (NER)
- Training objective: Token-level classification with cross-entropy loss
- Designed to identify and classify named entities for address normalisation
Compute Infrastructure
The model was trained using the lightning.ai platform
Hardware
The model can be run on a L4 or T4 GPU and requires 16Gb VRAM.
Citation
Coming soon.
Model Card Contact
For queries please raise issues on the github repo
- Downloads last month
- 6
Model tree for Jonnob/OCOD_NER
Base model
answerdotai/ModernBERT-base