File size: 7,838 Bytes
1e9240c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
---

license: apache-2.0
base_model: dslim/bert-base-NER
tags:
- named-entity-recognition
- ner
- vessel-detection
- maritime
- multilingual
- bert
datasets:
- custom
language:
- en
- es
- zh
- fr
- pt
- ru
- multilingual
metrics:
- f1
- precision
- recall
pipeline_tag: token-classification
---


# BERT-NER Vessel Detection Model

## Model Description

This model is a fine-tuned version of [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) for detecting **vessels (ships)** and **organizations** in maritime news articles and documents.

### Key Features

- **Vessel Detection**: Identifies ship names in text (mapped to MISC slot)
- **Organization Detection**: Identifies maritime organizations, ship owners, operators, and related entities (uses ORG slot)
- **Multilingual Support**: Trained on English, Spanish, Chinese, French, Portuguese, Russian, and other languages
- **Preserves Base Model**: Maintains original PER, LOC, and other entity detection capabilities

### Model Architecture

- **Base Model**: `dslim/bert-base-NER`
- **Task**: Token Classification (Named Entity Recognition)
- **Labels**: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC
  - **VESSEL entities** are mapped to **MISC** slot (B-MISC, I-MISC)
  - **Organization entities** use **ORG** slot (B-ORG, I-ORG)

## Model Performance

### Evaluation Metrics

- **Precision**: 1.0000
- **Recall**: 1.0000
- **F1 Score**: 1.0000

*Note: Metrics reported on validation set. Real-world performance may vary.*

### Example Predictions

| Text | Detected Entities |
|------|------------------|
| "The fishing vessel Hai Feng 718 was detained by authorities." | Hai Feng 718 (VESSEL: 1.00) |
| "Coast guard seized the trawler Thunder near disputed waters." | Thunder (VESSEL: 1.00) |
| "Pacific Seafood Inc. announced quarterly earnings today." | Pacific Seafood Inc (ORG: 0.95) |
| "The vessel Thunder owned by Pacific Seafood Inc. was seized." | Thunder (VESSEL: 1.00), Pacific Seafood Inc (ORG: 0.98) |

## Training Details

### Training Data

- **Total Examples**: ~60,000 synthetic multilingual examples
- **VESSEL Examples**: ~20,000
- **ORG Examples**: ~40,000 (ship owners, operators, brands, retailers, importers, fishmeal plants, etc.)
- **Languages**: English, Spanish, Chinese, French, Portuguese, Russian, and others
- **Source**: Synthetically generated from maritime entity databases

### Training Procedure

- **Base Model**: `dslim/bert-base-NER`
- **Training Epochs**: 3
- **Batch Size**: 32
- **Learning Rate**: 2e-5
- **Max Sequence Length**: 128 tokens
- **Optimizer**: AdamW with weight decay 0.01
- **Mixed Precision**: FP16 enabled

### Training Configuration

```python

TrainingArguments(

    num_train_epochs=3,

    per_device_train_batch_size=32,

    learning_rate=2e-5,

    weight_decay=0.01,

    max_length=128,

    fp16=True

)

```

## How to Use

### Direct Use

```python

from transformers import pipeline



# Load the model

ner = pipeline("ner", model="your-username/bert-vessel-ner", aggregation_strategy="simple")



# Example 1: Vessel detection

text = "The fishing vessel Hai Feng 718 was detained by authorities."

entities = ner(text)

for entity in entities:

    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

# Output: Hai Feng 718 -> MISC (1.00)  # MISC = VESSEL



# Example 2: Mixed entities

text = "The vessel Thunder owned by Pacific Seafood Inc. was seized."

entities = ner(text)

for entity in entities:

    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

# Output: 

# Thunder -> MISC (1.00)  # VESSEL

# Pacific Seafood Inc -> ORG (0.98)  # Organization

```

### Advanced Usage

```python

from transformers import AutoTokenizer, AutoModelForTokenClassification

import torch



# Load model and tokenizer

model = AutoModelForTokenClassification.from_pretrained("your-username/bert-vessel-ner")

tokenizer = AutoTokenizer.from_pretrained("your-username/bert-vessel-ner")



# Tokenize and predict

text = "The vessel Thunder was seized."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)



with torch.no_grad():

    outputs = model(**inputs)

    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

    predicted_ids = torch.argmax(predictions, dim=-1)



# Decode predictions

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

for token, pred_id in zip(tokens, predicted_ids[0]):

    if token not in ['[CLS]', '[SEP]', '[PAD]']:

        label = model.config.id2label[pred_id.item()]

        print(f"{token}: {label}")

```

### Post-Processing

**Note**: The model outputs VESSEL entities as **MISC** labels. You may want to rename them for clarity:

```python

entities = ner(text)

for entity in entities:

    # Rename MISC to VESSEL for clarity

    if entity['entity_group'] == 'MISC':

        entity['entity_group'] = 'VESSEL'

    print(f"{entity['word']} -> {entity['entity_group']}")

```

## Limitations and Bias

### Known Limitations

1. **False Positives**: May occasionally classify organization names as vessels if they resemble ship names (e.g., "Pacific Seafood Inc."). Use a higher threshold (0.98+) to reduce false positives.

2. **Multilingual Performance**: While trained on multiple languages, performance may vary by language. Best results on English, Spanish, and Chinese.

3. **Domain Specificity**: Trained primarily on maritime crime and enforcement contexts. Performance may vary in other domains (e.g., commercial shipping, recreational boating).

4. **Synthetic Data**: Model was trained on synthetically generated data. Real-world performance may differ from validation metrics.

### Recommendations

- **Threshold Tuning**: Adjust the confidence threshold based on your use case:
  - High precision (fewer false positives): Use threshold ≥ 0.98
  - High recall (catch more vessels): Use threshold ≥ 0.90
- **Post-Processing**: Consider adding rules to filter obvious false positives (e.g., entities containing "Inc.", "Co.", "Ltd.")
- **Domain Adaptation**: For best results in specific domains, consider fine-tuning on domain-specific data

## Training Data Sources

The model was trained on synthetically generated data from:
- Maritime vessel databases
- Ship owner and operator registries
- Brand and retailer information
- Fishmeal plant and processor databases

All training data was synthetically generated using large language models (Gemini 2.5 Flash-Lite) to create realistic maritime news contexts.

## Evaluation

### Test Set

- **Size**: ~2,280 examples (10% of total data)
- **Distribution**: Balanced across languages and entity types
- **Metrics**: Precision, Recall, F1 Score

### Performance by Entity Type

| Entity Type | Precision | Recall | F1 |
|-------------|-----------|--------|-----|
| VESSEL (MISC) | 1.0000 | 1.0000 | 1.0000 |
| ORG | 1.0000 | 1.0000 | 1.0000 |

*Note: Metrics on validation set. Real-world performance may vary.*

## Environmental Impact

- **Hardware**: GPU (CUDA)
- **Training Time**: ~3 minutes per epoch (total ~10 minutes)
- **Carbon Emissions**: Minimal (short training duration)

## Citation

If you use this model, please cite:

```bibtex

@misc{bert-vessel-ner,

  title={BERT-NER Vessel Detection Model},

  author={Your Name},

  year={2025},

  howpublished={\url{https://huggingface.co/your-username/bert-vessel-ner}}

}

```

## Model Card Contact

For questions or issues, please open an issue on the model repository.

## License

This model is licensed under Apache 2.0, same as the base model [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER).