File size: 7,838 Bytes
1e9240c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 |
---
license: apache-2.0
base_model: dslim/bert-base-NER
tags:
- named-entity-recognition
- ner
- vessel-detection
- maritime
- multilingual
- bert
datasets:
- custom
language:
- en
- es
- zh
- fr
- pt
- ru
- multilingual
metrics:
- f1
- precision
- recall
pipeline_tag: token-classification
---
# BERT-NER Vessel Detection Model
## Model Description
This model is a fine-tuned version of [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) for detecting **vessels (ships)** and **organizations** in maritime news articles and documents.
### Key Features
- **Vessel Detection**: Identifies ship names in text (mapped to MISC slot)
- **Organization Detection**: Identifies maritime organizations, ship owners, operators, and related entities (uses ORG slot)
- **Multilingual Support**: Trained on English, Spanish, Chinese, French, Portuguese, Russian, and other languages
- **Preserves Base Model**: Maintains original PER, LOC, and other entity detection capabilities
### Model Architecture
- **Base Model**: `dslim/bert-base-NER`
- **Task**: Token Classification (Named Entity Recognition)
- **Labels**: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC
- **VESSEL entities** are mapped to **MISC** slot (B-MISC, I-MISC)
- **Organization entities** use **ORG** slot (B-ORG, I-ORG)
## Model Performance
### Evaluation Metrics
- **Precision**: 1.0000
- **Recall**: 1.0000
- **F1 Score**: 1.0000
*Note: Metrics reported on validation set. Real-world performance may vary.*
### Example Predictions
| Text | Detected Entities |
|------|------------------|
| "The fishing vessel Hai Feng 718 was detained by authorities." | Hai Feng 718 (VESSEL: 1.00) |
| "Coast guard seized the trawler Thunder near disputed waters." | Thunder (VESSEL: 1.00) |
| "Pacific Seafood Inc. announced quarterly earnings today." | Pacific Seafood Inc (ORG: 0.95) |
| "The vessel Thunder owned by Pacific Seafood Inc. was seized." | Thunder (VESSEL: 1.00), Pacific Seafood Inc (ORG: 0.98) |
## Training Details
### Training Data
- **Total Examples**: ~60,000 synthetic multilingual examples
- **VESSEL Examples**: ~20,000
- **ORG Examples**: ~40,000 (ship owners, operators, brands, retailers, importers, fishmeal plants, etc.)
- **Languages**: English, Spanish, Chinese, French, Portuguese, Russian, and others
- **Source**: Synthetically generated from maritime entity databases
### Training Procedure
- **Base Model**: `dslim/bert-base-NER`
- **Training Epochs**: 3
- **Batch Size**: 32
- **Learning Rate**: 2e-5
- **Max Sequence Length**: 128 tokens
- **Optimizer**: AdamW with weight decay 0.01
- **Mixed Precision**: FP16 enabled
### Training Configuration
```python
TrainingArguments(
num_train_epochs=3,
per_device_train_batch_size=32,
learning_rate=2e-5,
weight_decay=0.01,
max_length=128,
fp16=True
)
```
## How to Use
### Direct Use
```python
from transformers import pipeline
# Load the model
ner = pipeline("ner", model="your-username/bert-vessel-ner", aggregation_strategy="simple")
# Example 1: Vessel detection
text = "The fishing vessel Hai Feng 718 was detained by authorities."
entities = ner(text)
for entity in entities:
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
# Output: Hai Feng 718 -> MISC (1.00) # MISC = VESSEL
# Example 2: Mixed entities
text = "The vessel Thunder owned by Pacific Seafood Inc. was seized."
entities = ner(text)
for entity in entities:
print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
# Output:
# Thunder -> MISC (1.00) # VESSEL
# Pacific Seafood Inc -> ORG (0.98) # Organization
```
### Advanced Usage
```python
from transformers import AutoTokenizer, AutoModelForTokenClassification
import torch
# Load model and tokenizer
model = AutoModelForTokenClassification.from_pretrained("your-username/bert-vessel-ner")
tokenizer = AutoTokenizer.from_pretrained("your-username/bert-vessel-ner")
# Tokenize and predict
text = "The vessel Thunder was seized."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_ids = torch.argmax(predictions, dim=-1)
# Decode predictions
tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
for token, pred_id in zip(tokens, predicted_ids[0]):
if token not in ['[CLS]', '[SEP]', '[PAD]']:
label = model.config.id2label[pred_id.item()]
print(f"{token}: {label}")
```
### Post-Processing
**Note**: The model outputs VESSEL entities as **MISC** labels. You may want to rename them for clarity:
```python
entities = ner(text)
for entity in entities:
# Rename MISC to VESSEL for clarity
if entity['entity_group'] == 'MISC':
entity['entity_group'] = 'VESSEL'
print(f"{entity['word']} -> {entity['entity_group']}")
```
## Limitations and Bias
### Known Limitations
1. **False Positives**: May occasionally classify organization names as vessels if they resemble ship names (e.g., "Pacific Seafood Inc."). Use a higher threshold (0.98+) to reduce false positives.
2. **Multilingual Performance**: While trained on multiple languages, performance may vary by language. Best results on English, Spanish, and Chinese.
3. **Domain Specificity**: Trained primarily on maritime crime and enforcement contexts. Performance may vary in other domains (e.g., commercial shipping, recreational boating).
4. **Synthetic Data**: Model was trained on synthetically generated data. Real-world performance may differ from validation metrics.
### Recommendations
- **Threshold Tuning**: Adjust the confidence threshold based on your use case:
- High precision (fewer false positives): Use threshold ≥ 0.98
- High recall (catch more vessels): Use threshold ≥ 0.90
- **Post-Processing**: Consider adding rules to filter obvious false positives (e.g., entities containing "Inc.", "Co.", "Ltd.")
- **Domain Adaptation**: For best results in specific domains, consider fine-tuning on domain-specific data
## Training Data Sources
The model was trained on synthetically generated data from:
- Maritime vessel databases
- Ship owner and operator registries
- Brand and retailer information
- Fishmeal plant and processor databases
All training data was synthetically generated using large language models (Gemini 2.5 Flash-Lite) to create realistic maritime news contexts.
## Evaluation
### Test Set
- **Size**: ~2,280 examples (10% of total data)
- **Distribution**: Balanced across languages and entity types
- **Metrics**: Precision, Recall, F1 Score
### Performance by Entity Type
| Entity Type | Precision | Recall | F1 |
|-------------|-----------|--------|-----|
| VESSEL (MISC) | 1.0000 | 1.0000 | 1.0000 |
| ORG | 1.0000 | 1.0000 | 1.0000 |
*Note: Metrics on validation set. Real-world performance may vary.*
## Environmental Impact
- **Hardware**: GPU (CUDA)
- **Training Time**: ~3 minutes per epoch (total ~10 minutes)
- **Carbon Emissions**: Minimal (short training duration)
## Citation
If you use this model, please cite:
```bibtex
@misc{bert-vessel-ner,
title={BERT-NER Vessel Detection Model},
author={Your Name},
year={2025},
howpublished={\url{https://huggingface.co/your-username/bert-vessel-ner}}
}
```
## Model Card Contact
For questions or issues, please open an issue on the model repository.
## License
This model is licensed under Apache 2.0, same as the base model [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER).
|