bert-vessel-ner / README.md

Upload README.md with huggingface_hub

1e9240c verified about 1 month ago

7.84 kB

	---
	license: apache-2.0
	base_model: dslim/bert-base-NER
	tags:
	- named-entity-recognition
	- ner
	- vessel-detection
	- maritime
	- multilingual
	- bert
	datasets:
	- custom
	language:
	- en
	- es
	- zh
	- fr
	- pt
	- ru
	- multilingual
	metrics:
	- f1
	- precision
	- recall
	pipeline_tag: token-classification
	---

	# BERT-NER Vessel Detection Model

	## Model Description

	This model is a fine-tuned version of [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) for detecting vessels (ships) and organizations in maritime news articles and documents.

	### Key Features

	- Vessel Detection: Identifies ship names in text (mapped to MISC slot)
	- Organization Detection: Identifies maritime organizations, ship owners, operators, and related entities (uses ORG slot)
	- Multilingual Support: Trained on English, Spanish, Chinese, French, Portuguese, Russian, and other languages
	- Preserves Base Model: Maintains original PER, LOC, and other entity detection capabilities

	### Model Architecture

	- Base Model: `dslim/bert-base-NER`
	- Task: Token Classification (Named Entity Recognition)
	- Labels: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC
	- VESSEL entities are mapped to MISC slot (B-MISC, I-MISC)
	- Organization entities use ORG slot (B-ORG, I-ORG)

	## Model Performance

	### Evaluation Metrics

	- Precision: 1.0000
	- Recall: 1.0000
	- F1 Score: 1.0000

	Note: Metrics reported on validation set. Real-world performance may vary.

	### Example Predictions

	\| Text \| Detected Entities \|
	\|------\|------------------\|
	\| "The fishing vessel Hai Feng 718 was detained by authorities." \| Hai Feng 718 (VESSEL: 1.00) \|
	\| "Coast guard seized the trawler Thunder near disputed waters." \| Thunder (VESSEL: 1.00) \|
	\| "Pacific Seafood Inc. announced quarterly earnings today." \| Pacific Seafood Inc (ORG: 0.95) \|
	\| "The vessel Thunder owned by Pacific Seafood Inc. was seized." \| Thunder (VESSEL: 1.00), Pacific Seafood Inc (ORG: 0.98) \|

	## Training Details

	### Training Data

	- Total Examples: ~60,000 synthetic multilingual examples
	- VESSEL Examples: ~20,000
	- ORG Examples: ~40,000 (ship owners, operators, brands, retailers, importers, fishmeal plants, etc.)
	- Languages: English, Spanish, Chinese, French, Portuguese, Russian, and others
	- Source: Synthetically generated from maritime entity databases

	### Training Procedure

	- Base Model: `dslim/bert-base-NER`
	- Training Epochs: 3
	- Batch Size: 32
	- Learning Rate: 2e-5
	- Max Sequence Length: 128 tokens
	- Optimizer: AdamW with weight decay 0.01
	- Mixed Precision: FP16 enabled

	### Training Configuration

	```python
	TrainingArguments(
	num_train_epochs=3,
	per_device_train_batch_size=32,
	learning_rate=2e-5,
	weight_decay=0.01,
	max_length=128,
	fp16=True
	)
	```

	## How to Use

	### Direct Use

	```python
	from transformers import pipeline

	# Load the model
	ner = pipeline("ner", model="your-username/bert-vessel-ner", aggregation_strategy="simple")

	# Example 1: Vessel detection
	text = "The fishing vessel Hai Feng 718 was detained by authorities."
	entities = ner(text)
	for entity in entities:
	print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
	# Output: Hai Feng 718 -> MISC (1.00) # MISC = VESSEL

	# Example 2: Mixed entities
	text = "The vessel Thunder owned by Pacific Seafood Inc. was seized."
	entities = ner(text)
	for entity in entities:
	print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})")
	# Output:
	# Thunder -> MISC (1.00) # VESSEL
	# Pacific Seafood Inc -> ORG (0.98) # Organization
	```

	### Advanced Usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification
	import torch

	# Load model and tokenizer
	model = AutoModelForTokenClassification.from_pretrained("your-username/bert-vessel-ner")
	tokenizer = AutoTokenizer.from_pretrained("your-username/bert-vessel-ner")

	# Tokenize and predict
	text = "The vessel Thunder was seized."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)

	with torch.no_grad():
	outputs = model(**inputs)
	predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
	predicted_ids = torch.argmax(predictions, dim=-1)

	# Decode predictions
	tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0])
	for token, pred_id in zip(tokens, predicted_ids[0]):
	if token not in ['[CLS]', '[SEP]', '[PAD]']:
	label = model.config.id2label[pred_id.item()]
	print(f"{token}: {label}")
	```

	### Post-Processing

	Note: The model outputs VESSEL entities as MISC labels. You may want to rename them for clarity:

	```python
	entities = ner(text)
	for entity in entities:
	# Rename MISC to VESSEL for clarity
	if entity['entity_group'] == 'MISC':
	entity['entity_group'] = 'VESSEL'
	print(f"{entity['word']} -> {entity['entity_group']}")
	```

	## Limitations and Bias

	### Known Limitations

	1. False Positives: May occasionally classify organization names as vessels if they resemble ship names (e.g., "Pacific Seafood Inc."). Use a higher threshold (0.98+) to reduce false positives.

	2. Multilingual Performance: While trained on multiple languages, performance may vary by language. Best results on English, Spanish, and Chinese.

	3. Domain Specificity: Trained primarily on maritime crime and enforcement contexts. Performance may vary in other domains (e.g., commercial shipping, recreational boating).

	4. Synthetic Data: Model was trained on synthetically generated data. Real-world performance may differ from validation metrics.

	### Recommendations

	- Threshold Tuning: Adjust the confidence threshold based on your use case:
	- High precision (fewer false positives): Use threshold ≥ 0.98
	- High recall (catch more vessels): Use threshold ≥ 0.90
	- Post-Processing: Consider adding rules to filter obvious false positives (e.g., entities containing "Inc.", "Co.", "Ltd.")
	- Domain Adaptation: For best results in specific domains, consider fine-tuning on domain-specific data

	## Training Data Sources

	The model was trained on synthetically generated data from:
	- Maritime vessel databases
	- Ship owner and operator registries
	- Brand and retailer information
	- Fishmeal plant and processor databases

	All training data was synthetically generated using large language models (Gemini 2.5 Flash-Lite) to create realistic maritime news contexts.

	## Evaluation

	### Test Set

	- Size: ~2,280 examples (10% of total data)
	- Distribution: Balanced across languages and entity types
	- Metrics: Precision, Recall, F1 Score

	### Performance by Entity Type

	\| Entity Type \| Precision \| Recall \| F1 \|
	\|-------------\|-----------\|--------\|-----\|
	\| VESSEL (MISC) \| 1.0000 \| 1.0000 \| 1.0000 \|
	\| ORG \| 1.0000 \| 1.0000 \| 1.0000 \|

	Note: Metrics on validation set. Real-world performance may vary.

	## Environmental Impact

	- Hardware: GPU (CUDA)
	- Training Time: ~3 minutes per epoch (total ~10 minutes)
	- Carbon Emissions: Minimal (short training duration)

	## Citation

	If you use this model, please cite:

	```bibtex
	@misc{bert-vessel-ner,
	title={BERT-NER Vessel Detection Model},
	author={Your Name},
	year={2025},
	howpublished={\url{https://huggingface.co/your-username/bert-vessel-ner}}
	}
	```

	## Model Card Contact

	For questions or issues, please open an issue on the model repository.

	## License

	This model is licensed under Apache 2.0, same as the base model [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER).