File size: 7,838 Bytes

1e9240c

---

license: apache-2.0
base_model: dslim/bert-base-NER
tags:
- named-entity-recognition
- ner
- vessel-detection
- maritime
- multilingual
- bert
datasets:
- custom
language:
- en
- es
- zh
- fr
- pt
- ru
- multilingual
metrics:
- f1
- precision
- recall
pipeline_tag: token-classification
---


# BERT-NER Vessel Detection Model

## Model Description

This model is a fine-tuned version of [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) for detecting **vessels (ships)** and **organizations** in maritime news articles and documents.

### Key Features

- **Vessel Detection**: Identifies ship names in text (mapped to MISC slot)
- **Organization Detection**: Identifies maritime organizations, ship owners, operators, and related entities (uses ORG slot)
- **Multilingual Support**: Trained on English, Spanish, Chinese, French, Portuguese, Russian, and other languages
- **Preserves Base Model**: Maintains original PER, LOC, and other entity detection capabilities

### Model Architecture

- **Base Model**: `dslim/bert-base-NER`
- **Task**: Token Classification (Named Entity Recognition)
- **Labels**: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC
  - **VESSEL entities** are mapped to **MISC** slot (B-MISC, I-MISC)
  - **Organization entities** use **ORG** slot (B-ORG, I-ORG)

## Model Performance

### Evaluation Metrics

- **Precision**: 1.0000
- **Recall**: 1.0000
- **F1 Score**: 1.0000

*Note: Metrics reported on validation set. Real-world performance may vary.*

### Example Predictions

| Text | Detected Entities |
|------|------------------|
| "The fishing vessel Hai Feng 718 was detained by authorities." | Hai Feng 718 (VESSEL: 1.00) |
| "Coast guard seized the trawler Thunder near disputed waters." | Thunder (VESSEL: 1.00) |
| "Pacific Seafood Inc. announced quarterly earnings today." | Pacific Seafood Inc (ORG: 0.95) |
| "The vessel Thunder owned by Pacific Seafood Inc. was seized." | Thunder (VESSEL: 1.00), Pacific Seafood Inc (ORG: 0.98) |

## Training Details

### Training Data

- **Total Examples**: ~60,000 synthetic multilingual examples
- **VESSEL Examples**: ~20,000
- **ORG Examples**: ~40,000 (ship owners, operators, brands, retailers, importers, fishmeal plants, etc.)
- **Languages**: English, Spanish, Chinese, French, Portuguese, Russian, and others
- **Source**: Synthetically generated from maritime entity databases

### Training Procedure

- **Base Model**: `dslim/bert-base-NER`
- **Training Epochs**: 3
- **Batch Size**: 32
- **Learning Rate**: 2e-5
- **Max Sequence Length**: 128 tokens
- **Optimizer**: AdamW with weight decay 0.01
- **Mixed Precision**: FP16 enabled

### Training Configuration

```python

TrainingArguments(

    num_train_epochs=3,

    per_device_train_batch_size=32,

    learning_rate=2e-5,

    weight_decay=0.01,

    max_length=128,

    fp16=True

)

```

## How to Use

### Direct Use

```python

from transformers import pipeline



# Load the model

ner = pipeline("ner", model="your-username/bert-vessel-ner", aggregation_strategy="simple")



# Example 1: Vessel detection

text = "The fishing vessel Hai Feng 718 was detained by authorities."

entities = ner(text)

for entity in entities:

    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

# Output: Hai Feng 718 -> MISC (1.00)  # MISC = VESSEL



# Example 2: Mixed entities

text = "The vessel Thunder owned by Pacific Seafood Inc. was seized."

entities = ner(text)

for entity in entities:

    print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")

# Output: 

# Thunder -> MISC (1.00)  # VESSEL

# Pacific Seafood Inc -> ORG (0.98)  # Organization

```

### Advanced Usage

```python

from transformers import AutoTokenizer, AutoModelForTokenClassification

import torch



# Load model and tokenizer

model = AutoModelForTokenClassification.from_pretrained("your-username/bert-vessel-ner")

tokenizer = AutoTokenizer.from_pretrained("your-username/bert-vessel-ner")



# Tokenize and predict

text = "The vessel Thunder was seized."

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)



with torch.no_grad():

    outputs = model(**inputs)

    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)

    predicted_ids = torch.argmax(predictions, dim=-1)



# Decode predictions

tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])

for token, pred_id in zip(tokens, predicted_ids[0]):

    if token not in ['[CLS]', '[SEP]', '[PAD]']:

        label = model.config.id2label[pred_id.item()]

        print(f"{token}: {label}")

```

### Post-Processing

**Note**: The model outputs VESSEL entities as **MISC** labels. You may want to rename them for clarity:

```python

entities = ner(text)

for entity in entities:

    # Rename MISC to VESSEL for clarity

    if entity['entity_group'] == 'MISC':

        entity['entity_group'] = 'VESSEL'

    print(f"{entity['word']} -> {entity['entity_group']}")

```

## Limitations and Bias

### Known Limitations

1. **False Positives**: May occasionally classify organization names as vessels if they resemble ship names (e.g., "Pacific Seafood Inc."). Use a higher threshold (0.98+) to reduce false positives.

2. **Multilingual Performance**: While trained on multiple languages, performance may vary by language. Best results on English, Spanish, and Chinese.

3. **Domain Specificity**: Trained primarily on maritime crime and enforcement contexts. Performance may vary in other domains (e.g., commercial shipping, recreational boating).

4. **Synthetic Data**: Model was trained on synthetically generated data. Real-world performance may differ from validation metrics.

### Recommendations

- **Threshold Tuning**: Adjust the confidence threshold based on your use case:
  - High precision (fewer false positives): Use threshold ≥ 0.98
  - High recall (catch more vessels): Use threshold ≥ 0.90
- **Post-Processing**: Consider adding rules to filter obvious false positives (e.g., entities containing "Inc.", "Co.", "Ltd.")
- **Domain Adaptation**: For best results in specific domains, consider fine-tuning on domain-specific data

## Training Data Sources

The model was trained on synthetically generated data from:
- Maritime vessel databases
- Ship owner and operator registries
- Brand and retailer information
- Fishmeal plant and processor databases

All training data was synthetically generated using large language models (Gemini 2.5 Flash-Lite) to create realistic maritime news contexts.

## Evaluation

### Test Set

- **Size**: ~2,280 examples (10% of total data)
- **Distribution**: Balanced across languages and entity types
- **Metrics**: Precision, Recall, F1 Score

### Performance by Entity Type

| Entity Type | Precision | Recall | F1 |
|-------------|-----------|--------|-----|
| VESSEL (MISC) | 1.0000 | 1.0000 | 1.0000 |
| ORG | 1.0000 | 1.0000 | 1.0000 |

*Note: Metrics on validation set. Real-world performance may vary.*

## Environmental Impact

- **Hardware**: GPU (CUDA)
- **Training Time**: ~3 minutes per epoch (total ~10 minutes)
- **Carbon Emissions**: Minimal (short training duration)

## Citation

If you use this model, please cite:

```bibtex

@misc{bert-vessel-ner,

  title={BERT-NER Vessel Detection Model},

  author={Your Name},

  year={2025},

  howpublished={\url{https://huggingface.co/your-username/bert-vessel-ner}}

}

```

## Model Card Contact

For questions or issues, please open an issue on the model repository.

## License

This model is licensed under Apache 2.0, same as the base model [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER).