--- license: apache-2.0 base_model: dslim/bert-base-NER tags: - named-entity-recognition - ner - vessel-detection - maritime - multilingual - bert datasets: - custom language: - en - es - zh - fr - pt - ru - multilingual metrics: - f1 - precision - recall pipeline_tag: token-classification --- # BERT-NER Vessel Detection Model ## Model Description This model is a fine-tuned version of [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER) for detecting **vessels (ships)** and **organizations** in maritime news articles and documents. ### Key Features - **Vessel Detection**: Identifies ship names in text (mapped to MISC slot) - **Organization Detection**: Identifies maritime organizations, ship owners, operators, and related entities (uses ORG slot) - **Multilingual Support**: Trained on English, Spanish, Chinese, French, Portuguese, Russian, and other languages - **Preserves Base Model**: Maintains original PER, LOC, and other entity detection capabilities ### Model Architecture - **Base Model**: `dslim/bert-base-NER` - **Task**: Token Classification (Named Entity Recognition) - **Labels**: O, B-PER, I-PER, B-ORG, I-ORG, B-LOC, I-LOC, B-MISC, I-MISC - **VESSEL entities** are mapped to **MISC** slot (B-MISC, I-MISC) - **Organization entities** use **ORG** slot (B-ORG, I-ORG) ## Model Performance ### Evaluation Metrics - **Precision**: 1.0000 - **Recall**: 1.0000 - **F1 Score**: 1.0000 *Note: Metrics reported on validation set. Real-world performance may vary.* ### Example Predictions | Text | Detected Entities | |------|------------------| | "The fishing vessel Hai Feng 718 was detained by authorities." | Hai Feng 718 (VESSEL: 1.00) | | "Coast guard seized the trawler Thunder near disputed waters." | Thunder (VESSEL: 1.00) | | "Pacific Seafood Inc. announced quarterly earnings today." | Pacific Seafood Inc (ORG: 0.95) | | "The vessel Thunder owned by Pacific Seafood Inc. was seized." | Thunder (VESSEL: 1.00), Pacific Seafood Inc (ORG: 0.98) | ## Training Details ### Training Data - **Total Examples**: ~60,000 synthetic multilingual examples - **VESSEL Examples**: ~20,000 - **ORG Examples**: ~40,000 (ship owners, operators, brands, retailers, importers, fishmeal plants, etc.) - **Languages**: English, Spanish, Chinese, French, Portuguese, Russian, and others - **Source**: Synthetically generated from maritime entity databases ### Training Procedure - **Base Model**: `dslim/bert-base-NER` - **Training Epochs**: 3 - **Batch Size**: 32 - **Learning Rate**: 2e-5 - **Max Sequence Length**: 128 tokens - **Optimizer**: AdamW with weight decay 0.01 - **Mixed Precision**: FP16 enabled ### Training Configuration ```python TrainingArguments( num_train_epochs=3, per_device_train_batch_size=32, learning_rate=2e-5, weight_decay=0.01, max_length=128, fp16=True ) ``` ## How to Use ### Direct Use ```python from transformers import pipeline # Load the model ner = pipeline("ner", model="your-username/bert-vessel-ner", aggregation_strategy="simple") # Example 1: Vessel detection text = "The fishing vessel Hai Feng 718 was detained by authorities." entities = ner(text) for entity in entities: print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})") # Output: Hai Feng 718 -> MISC (1.00) # MISC = VESSEL # Example 2: Mixed entities text = "The vessel Thunder owned by Pacific Seafood Inc. was seized." entities = ner(text) for entity in entities: print(f"{entity['word']} -> {entity['entity_group']} ({entity['score']:.2f})") # Output: # Thunder -> MISC (1.00) # VESSEL # Pacific Seafood Inc -> ORG (0.98) # Organization ``` ### Advanced Usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification import torch # Load model and tokenizer model = AutoModelForTokenClassification.from_pretrained("your-username/bert-vessel-ner") tokenizer = AutoTokenizer.from_pretrained("your-username/bert-vessel-ner") # Tokenize and predict text = "The vessel Thunder was seized." inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) with torch.no_grad(): outputs = model(**inputs) predictions = torch.nn.functional.softmax(outputs.logits, dim=-1) predicted_ids = torch.argmax(predictions, dim=-1) # Decode predictions tokens = tokenizer.convert_ids_to_tokens(inputs['input_ids'][0]) for token, pred_id in zip(tokens, predicted_ids[0]): if token not in ['[CLS]', '[SEP]', '[PAD]']: label = model.config.id2label[pred_id.item()] print(f"{token}: {label}") ``` ### Post-Processing **Note**: The model outputs VESSEL entities as **MISC** labels. You may want to rename them for clarity: ```python entities = ner(text) for entity in entities: # Rename MISC to VESSEL for clarity if entity['entity_group'] == 'MISC': entity['entity_group'] = 'VESSEL' print(f"{entity['word']} -> {entity['entity_group']}") ``` ## Limitations and Bias ### Known Limitations 1. **False Positives**: May occasionally classify organization names as vessels if they resemble ship names (e.g., "Pacific Seafood Inc."). Use a higher threshold (0.98+) to reduce false positives. 2. **Multilingual Performance**: While trained on multiple languages, performance may vary by language. Best results on English, Spanish, and Chinese. 3. **Domain Specificity**: Trained primarily on maritime crime and enforcement contexts. Performance may vary in other domains (e.g., commercial shipping, recreational boating). 4. **Synthetic Data**: Model was trained on synthetically generated data. Real-world performance may differ from validation metrics. ### Recommendations - **Threshold Tuning**: Adjust the confidence threshold based on your use case: - High precision (fewer false positives): Use threshold ≥ 0.98 - High recall (catch more vessels): Use threshold ≥ 0.90 - **Post-Processing**: Consider adding rules to filter obvious false positives (e.g., entities containing "Inc.", "Co.", "Ltd.") - **Domain Adaptation**: For best results in specific domains, consider fine-tuning on domain-specific data ## Training Data Sources The model was trained on synthetically generated data from: - Maritime vessel databases - Ship owner and operator registries - Brand and retailer information - Fishmeal plant and processor databases All training data was synthetically generated using large language models (Gemini 2.5 Flash-Lite) to create realistic maritime news contexts. ## Evaluation ### Test Set - **Size**: ~2,280 examples (10% of total data) - **Distribution**: Balanced across languages and entity types - **Metrics**: Precision, Recall, F1 Score ### Performance by Entity Type | Entity Type | Precision | Recall | F1 | |-------------|-----------|--------|-----| | VESSEL (MISC) | 1.0000 | 1.0000 | 1.0000 | | ORG | 1.0000 | 1.0000 | 1.0000 | *Note: Metrics on validation set. Real-world performance may vary.* ## Environmental Impact - **Hardware**: GPU (CUDA) - **Training Time**: ~3 minutes per epoch (total ~10 minutes) - **Carbon Emissions**: Minimal (short training duration) ## Citation If you use this model, please cite: ```bibtex @misc{bert-vessel-ner, title={BERT-NER Vessel Detection Model}, author={Your Name}, year={2025}, howpublished={\url{https://huggingface.co/your-username/bert-vessel-ner}} } ``` ## Model Card Contact For questions or issues, please open an issue on the model repository. ## License This model is licensed under Apache 2.0, same as the base model [`dslim/bert-base-NER`](https://huggingface.co/dslim/bert-base-NER).