|
|
---
|
|
|
license: apache-2.0
|
|
|
base_model: dslim/bert-base-NER
|
|
|
tags:
|
|
|
- named-entity-recognition
|
|
|
- ner
|
|
|
- vessel-detection
|
|
|
- maritime
|
|
|
- multilingual
|
|
|
- bert
|
|
|
datasets:
|
|
|
- custom
|
|
|
language:
|
|
|
- en
|
|
|
- es
|
|
|
- zh
|
|
|
- fr
|
|
|
- pt
|
|
|
- ru
|
|
|
- multilingual
|
|
|
metrics:
|
|
|
- f1
|
|
|
- precision
|
|
|
- recall
|
|
|
pipeline_tag: token-classification
|
|
|
---
|
|
|
|
|
|
# BERT-NER Vessel Detection Model
|
|
|
|
|
|
## Model Description
|
|
|
|
|
|
This model is a fine-tuned version of [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) for detecting **vessels (ships)** and **organizations** in maritime news articles and documents.
|
|
|
|
|
|
### Key Features
|
|
|
|
|
|
- **Vessel Detection**: Identifies ship names in text (mapped to MISC slot)
|
|
|
- **Organization Detection**: Identifies maritime organizations, ship owners, operators, and related entities (uses ORG slot)
|
|
|
- **Multilingual Support**: Trained on English, Spanish, Chinese, French, Portuguese, Russian, and other languages
|
|
|
- **Preserves Base Model**: Maintains original PER, LOC, and other entity detection capabilities
|
|
|
|
|
|
### Model Architecture
|
|
|
|
|
|
- **Base Model**: `dslim/bert-base-NER`
|
|
|
- **Task**: Token Classification (Named Entity Recognition)
|
|
|
- **Labels**: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC
|
|
|
- **VESSEL entities** are mapped to **MISC** slot (B-MISC, I-MISC)
|
|
|
- **Organization entities** use **ORG** slot (B-ORG, I-ORG)
|
|
|
|
|
|
## Model Performance
|
|
|
|
|
|
### Evaluation Metrics
|
|
|
|
|
|
- **Precision**: 1.0000
|
|
|
- **Recall**: 1.0000
|
|
|
- **F1 Score**: 1.0000
|
|
|
|
|
|
*Note: Metrics reported on validation set. Real-world performance may vary.*
|
|
|
|
|
|
### Example Predictions
|
|
|
|
|
|
| Text | Detected Entities |
|
|
|
|------|------------------|
|
|
|
| "The fishing vessel Hai Feng 718 was detained by authorities." | Hai Feng 718 (VESSEL: 1.00) |
|
|
|
| "Coast guard seized the trawler Thunder near disputed waters." | Thunder (VESSEL: 1.00) |
|
|
|
| "Pacific Seafood Inc. announced quarterly earnings today." | Pacific Seafood Inc (ORG: 0.95) |
|
|
|
| "The vessel Thunder owned by Pacific Seafood Inc. was seized." | Thunder (VESSEL: 1.00), Pacific Seafood Inc (ORG: 0.98) |
|
|
|
|
|
|
## Training Details
|
|
|
|
|
|
### Training Data
|
|
|
|
|
|
- **Total Examples**: ~60,000 synthetic multilingual examples
|
|
|
- **VESSEL Examples**: ~20,000
|
|
|
- **ORG Examples**: ~40,000 (ship owners, operators, brands, retailers, importers, fishmeal plants, etc.)
|
|
|
- **Languages**: English, Spanish, Chinese, French, Portuguese, Russian, and others
|
|
|
- **Source**: Synthetically generated from maritime entity databases
|
|
|
|
|
|
### Training Procedure
|
|
|
|
|
|
- **Base Model**: `dslim/bert-base-NER`
|
|
|
- **Training Epochs**: 3
|
|
|
- **Batch Size**: 32
|
|
|
- **Learning Rate**: 2e-5
|
|
|
- **Max Sequence Length**: 128 tokens
|
|
|
- **Optimizer**: AdamW with weight decay 0.01
|
|
|
- **Mixed Precision**: FP16 enabled
|
|
|
|
|
|
### Training Configuration
|
|
|
|
|
|
```python
|
|
|
TrainingArguments(
|
|
|
num_train_epochs=3,
|
|
|
per_device_train_batch_size=32,
|
|
|
learning_rate=2e-5,
|
|
|
weight_decay=0.01,
|
|
|
max_length=128,
|
|
|
fp16=True
|
|
|
)
|
|
|
```
|
|
|
|
|
|
## How to Use
|
|
|
|
|
|
### Direct Use
|
|
|
|
|
|
```python
|
|
|
from transformers import pipeline
|
|
|
|
|
|
# Load the model
|
|
|
ner = pipeline("ner", model="your-username/bert-vessel-ner", aggregation_strategy="simple")
|
|
|
|
|
|
# Example 1: Vessel detection
|
|
|
text = "The fishing vessel Hai Feng 718 was detained by authorities."
|
|
|
entities = ner(text)
|
|
|
for entity in entities:
|
|
|
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
|
|
|
# Output: Hai Feng 718 -> MISC (1.00) # MISC = VESSEL
|
|
|
|
|
|
# Example 2: Mixed entities
|
|
|
text = "The vessel Thunder owned by Pacific Seafood Inc. was seized."
|
|
|
entities = ner(text)
|
|
|
for entity in entities:
|
|
|
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
|
|
|
# Output:
|
|
|
# Thunder -> MISC (1.00) # VESSEL
|
|
|
# Pacific Seafood Inc -> ORG (0.98) # Organization
|
|
|
```
|
|
|
|
|
|
### Advanced Usage
|
|
|
|
|
|
```python
|
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification
|
|
|
import torch
|
|
|
|
|
|
# Load model and tokenizer
|
|
|
model = AutoModelForTokenClassification.from_pretrained("your-username/bert-vessel-ner")
|
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/bert-vessel-ner")
|
|
|
|
|
|
# Tokenize and predict
|
|
|
text = "The vessel Thunder was seized."
|
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
|
|
|
|
|
|
with torch.no_grad():
|
|
|
outputs = model(**inputs)
|
|
|
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
|
|
|
predicted_ids = torch.argmax(predictions, dim=-1)
|
|
|
|
|
|
# Decode predictions
|
|
|
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
|
|
|
for token, pred_id in zip(tokens, predicted_ids[0]):
|
|
|
if token not in ['[CLS]', '[SEP]', '[PAD]']:
|
|
|
label = model.config.id2label[pred_id.item()]
|
|
|
print(f"{token}: {label}")
|
|
|
```
|
|
|
|
|
|
### Post-Processing
|
|
|
|
|
|
**Note**: The model outputs VESSEL entities as **MISC** labels. You may want to rename them for clarity:
|
|
|
|
|
|
```python
|
|
|
entities = ner(text)
|
|
|
for entity in entities:
|
|
|
# Rename MISC to VESSEL for clarity
|
|
|
if entity['entity_group'] == 'MISC':
|
|
|
entity['entity_group'] = 'VESSEL'
|
|
|
print(f"{entity['word']} -> {entity['entity_group']}")
|
|
|
```
|
|
|
|
|
|
## Limitations and Bias
|
|
|
|
|
|
### Known Limitations
|
|
|
|
|
|
1. **False Positives**: May occasionally classify organization names as vessels if they resemble ship names (e.g., "Pacific Seafood Inc."). Use a higher threshold (0.98+) to reduce false positives.
|
|
|
|
|
|
2. **Multilingual Performance**: While trained on multiple languages, performance may vary by language. Best results on English, Spanish, and Chinese.
|
|
|
|
|
|
3. **Domain Specificity**: Trained primarily on maritime crime and enforcement contexts. Performance may vary in other domains (e.g., commercial shipping, recreational boating).
|
|
|
|
|
|
4. **Synthetic Data**: Model was trained on synthetically generated data. Real-world performance may differ from validation metrics.
|
|
|
|
|
|
### Recommendations
|
|
|
|
|
|
- **Threshold Tuning**: Adjust the confidence threshold based on your use case:
|
|
|
- High precision (fewer false positives): Use threshold ≥ 0.98
|
|
|
- High recall (catch more vessels): Use threshold ≥ 0.90
|
|
|
- **Post-Processing**: Consider adding rules to filter obvious false positives (e.g., entities containing "Inc.", "Co.", "Ltd.")
|
|
|
- **Domain Adaptation**: For best results in specific domains, consider fine-tuning on domain-specific data
|
|
|
|
|
|
## Training Data Sources
|
|
|
|
|
|
The model was trained on synthetically generated data from:
|
|
|
- Maritime vessel databases
|
|
|
- Ship owner and operator registries
|
|
|
- Brand and retailer information
|
|
|
- Fishmeal plant and processor databases
|
|
|
|
|
|
All training data was synthetically generated using large language models (Gemini 2.5 Flash-Lite) to create realistic maritime news contexts.
|
|
|
|
|
|
## Evaluation
|
|
|
|
|
|
### Test Set
|
|
|
|
|
|
- **Size**: ~2,280 examples (10% of total data)
|
|
|
- **Distribution**: Balanced across languages and entity types
|
|
|
- **Metrics**: Precision, Recall, F1 Score
|
|
|
|
|
|
### Performance by Entity Type
|
|
|
|
|
|
| Entity Type | Precision | Recall | F1 |
|
|
|
|-------------|-----------|--------|-----|
|
|
|
| VESSEL (MISC) | 1.0000 | 1.0000 | 1.0000 |
|
|
|
| ORG | 1.0000 | 1.0000 | 1.0000 |
|
|
|
|
|
|
*Note: Metrics on validation set. Real-world performance may vary.*
|
|
|
|
|
|
## Environmental Impact
|
|
|
|
|
|
- **Hardware**: GPU (CUDA)
|
|
|
- **Training Time**: ~3 minutes per epoch (total ~10 minutes)
|
|
|
- **Carbon Emissions**: Minimal (short training duration)
|
|
|
|
|
|
## Citation
|
|
|
|
|
|
If you use this model, please cite:
|
|
|
|
|
|
```bibtex
|
|
|
@misc{bert-vessel-ner,
|
|
|
title={BERT-NER Vessel Detection Model},
|
|
|
author={Your Name},
|
|
|
year={2025},
|
|
|
howpublished={\url{https://huggingface.co/your-username/bert-vessel-ner}}
|
|
|
}
|
|
|
```
|
|
|
|
|
|
## Model Card Contact
|
|
|
|
|
|
For questions or issues, please open an issue on the model repository.
|
|
|
|
|
|
## License
|
|
|
|
|
|
This model is licensed under Apache 2.0, same as the base model [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER).
|
|
|
|