|
|
--- |
|
|
language: ja |
|
|
tags: |
|
|
- modernbert |
|
|
- japanese |
|
|
- emergency-call |
|
|
- phase-detection |
|
|
- boundary-detection |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- custom |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
--- |
|
|
|
|
|
# NEC-119 ModernBERT Phase & Boundary Detector |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is fine-tuned from `sbintuitions/modernbert-ja-310m` for Japanese emergency call (119) transcript analysis. |
|
|
It performs two tasks simultaneously: |
|
|
1. **Phase Classification**: Classifies conversation phases (INIT/LOC/INC/SUP) |
|
|
2. **Boundary Detection**: Detects phase boundaries in conversation |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base Model**: sbintuitions/modernbert-ja-310m |
|
|
- **Training Data**: 45,483 instances from Japanese emergency call transcripts |
|
|
- **Validation Data**: 4,984 instances |
|
|
- **Test Data**: 9,605 instances |
|
|
- **Training Configuration**: |
|
|
- Epochs: 5 |
|
|
- Batch Size: 16 (effective 32 with gradient accumulation) |
|
|
- Learning Rate: 1e-5 |
|
|
- Max Sequence Length: 1024 tokens |
|
|
- Optimizer: AdamW |
|
|
- Scheduler: Cosine |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Test Set Results (After 1 epoch) |
|
|
- **Phase Classification Accuracy**: 84.9% |
|
|
- **Boundary Detection Accuracy**: 94.6% |
|
|
- **Phase F1-Macro**: 0.813 |
|
|
- **Boundary F1**: 0.626 |
|
|
- **Both Correct Accuracy**: 81.8% |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModel |
|
|
import torch |
|
|
|
|
|
# Load model and tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("your-username/nec119-modernbert-phase-boundary") |
|
|
model = AutoModel.from_pretrained("your-username/nec119-modernbert-phase-boundary") |
|
|
|
|
|
# Prepare input |
|
|
context = "previous conversation text" |
|
|
current_utterance = "current line to classify" |
|
|
inputs = tokenizer(context, current_utterance, return_tensors="pt", max_length=1024, truncation=True) |
|
|
|
|
|
# Get predictions |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
# Extract predictions from outputs |
|
|
``` |
|
|
|
|
|
## Phase Labels |
|
|
- **INIT (0)**: Initial phase |
|
|
- **LOC (1)**: Location identification phase |
|
|
- **INC (2)**: Incident details phase |
|
|
- **SUP (3)**: Support/supplementary phase |
|
|
|
|
|
## Limitations |
|
|
|
|
|
This model is specifically trained for Japanese emergency call transcripts and may not generalize well to other domains or conversation types. |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|