language:
- ja
license: mit
base_model: tohoku-nlp/bert-base-japanese-v3
tags:
- japanese
- keigo
- text-classification
- omotenashi
- hospitality
- bert
pipeline_tag: text-classification
Keigo Evaluator โ ๆฌ่ชใฌใใซๅ้กใขใใซ
A fine-tuned Japanese BERT model that classifies the politeness level (ๆฌ่ชใฌใใซ) of Japanese speech into four levels. Designed to evaluate whether an employee is speaking with appropriate Keigo (ๆฌ่ช) and Omotenashi (ใใใฆใชใ) standards in a hospitality or service context.
Intended Use
This model is the NLP component of an AI-powered service quality evaluation pipeline:
Voice Recording โ Whisper ASR โ Transcribed Text โ This Model โ Keigo Verdict
It is intended for:
- Evaluating employee speech quality in hospitality and customer service settings
- Automated Keigo compliance checking in call centres or hotel/restaurant environments
- Quality assurance systems for Japanese service staff training
Labels
The model predicts one of four classes:
| Label | Level | Name | Description | Service Verdict |
|---|---|---|---|---|
| LABEL_0 | 1 | ๆ้ซๆฌ่ช | Highest honorific โ sonkeigo dominant | โ Pass |
| LABEL_1 | 2 | ๆฌ่ช | Standard honorific โ appropriate for most service contexts | โ Pass |
| LABEL_2 | 3 | ไธๅฏง่ช | Polite but not honorific โ insufficient for hospitality | โ Fail |
| LABEL_3 | 4 | ๆฎ้่ช | Casual / plain speech โ inappropriate in service contexts | โ Fail |
How to Use
Installation
pip install transformers torch fugashi unidic-lite
Note:
unidic-liteis required (notipadic) โ this model uses the UniDic dictionary for MeCab tokenization.
Basic Usage
from transformers import pipeline
import torch
classifier = pipeline(
'text-classification',
model='ishraq/keigo-evaluator',
device=0 if torch.cuda.is_available() else -1
)
LEVEL_MAP = {
'LABEL_0': {'level': 1, 'name': 'ๆ้ซๆฌ่ช', 'passed': True},
'LABEL_1': {'level': 2, 'name': 'ๆฌ่ช', 'passed': True},
'LABEL_2': {'level': 3, 'name': 'ไธๅฏง่ช', 'passed': False},
'LABEL_3': {'level': 4, 'name': 'ๆฎ้่ช', 'passed': False},
}
def evaluate_keigo(text: str) -> dict:
result = classifier(text)[0]
info = LEVEL_MAP[result['label']]
return {
'text': text,
'level': info['level'],
'level_name': info['name'],
'confidence': round(result['score'], 3),
'passed': info['passed'],
'verdict': 'โ
้ฉๅใชๆฌ่ชใงใ' if info['passed'] else 'โ ๆฌ่ชใฌใใซใไธ่ถณใใฆใใพใ'
}
print(evaluate_keigo('ใใใฃใใใใพใใๆฌๆฅใฏใฉใฎใใใชใ็จไปถใงใใใใพใใใใ๏ผ'))
# {'level': 1, 'level_name': 'ๆ้ซๆฌ่ช', 'confidence': 0.91, 'passed': True, 'verdict': 'โ
้ฉๅใชๆฌ่ชใงใ'}
print(evaluate_keigo('ใกใใฃใจๅพ
ใฃใฆใ'))
# {'level': 4, 'level_name': 'ๆฎ้่ช', 'confidence': 0.99, 'passed': False, 'verdict': 'โ ๆฌ่ชใฌใใซใไธ่ถณใใฆใใพใ'}
Full Voice Pipeline (Whisper + Keigo Evaluator)
import whisper
from transformers import pipeline
import torch
asr = whisper.load_model('medium')
classifier = pipeline(
'text-classification',
model='ishraq/keigo-evaluator',
device=0 if torch.cuda.is_available() else -1
)
def evaluate_recording(audio_path: str) -> dict:
transcript = asr.transcribe(audio_path, language='ja')['text']
result = classifier(transcript)[0]
info = LEVEL_MAP[result['label']]
return {
'transcript': transcript,
'level': info['level'],
'level_name': info['name'],
'confidence': round(result['score'], 3),
'passed': info['passed'],
'verdict': 'โ
้ฉๅใชๆฌ่ชใงใ' if info['passed'] else 'โ ๆฌ่ชใฌใใซใไธ่ถณใใฆใใพใ'
}
result = evaluate_recording('employee_call.mp3')
print(result)
Training Details
Dataset
KeiCO Corpus โ a Japanese keigo classification corpus of 10,002 sentences labelled by politeness level and keigo type (sonkeigo / kenjลgo / teineigo) across a wide range of service situations including greetings (ๆจๆถ), apologies (่ฌใ), meetings (ไผใ), and seasonal expressions (ๅญฃ็ฏ).
| Level | Count | % |
|---|---|---|
| 1 โ ๆ้ซๆฌ่ช | 2,584 | 25.8% |
| 2 โ ๆฌ่ช | 2,044 | 20.4% |
| 3 โ ไธๅฏง่ช | 2,692 | 26.9% |
| 4 โ ๆฎ้่ช | 2,682 | 26.8% |
The dataset is well-balanced. No oversampling or class weighting was applied.
Training Hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 5 |
| Batch size | 32 |
| Learning rate | 2e-5 |
| Weight decay | 0.01 |
| Warmup ratio | 10% |
| Max sequence length | 128 |
| Optimizer | AdamW |
| Scheduler | Linear warmup + decay |
| Gradient clipping | 1.0 |
| Loss | Cross-entropy |
Training Infrastructure
- Hardware: NVIDIA T4 GPU (Google Colab)
- Framework: PyTorch + Hugging Face Transformers
- Train / Val split: 85% / 15% stratified by label
Evaluation Results
Sample inference results on held-out test sentences:
| Input | Predicted Level | Confidence | Verdict |
|---|---|---|---|
| ๆฌๆฅใฏใๆฉใใฎใงใใญใใๆฃๆญฉใงใใ๏ผ | 2 โ ๆฌ่ช | 0.598 | โ Pass |
| ใๅค็จไธญใซใใใใใใใใใใๅบใใ ใใใพใใใ | 2 โ ๆฌ่ช | 0.557 | โ Pass |
| ใๅใๅใใใใใใ ใใๅๅใใๆฌๆฅๅ ฅ่ทใใพใใใ | 3 โ ไธๅฏง่ช | 0.740 | โ Fail |
| ไปๆฅใฏใใฉใใซใใใ | 4 โ ๆฎ้่ช | 0.993 | โ Fail |
| ๅฟใใใฎใซใใใๆฅใใญใ | 4 โ ๆฎ้่ช | 0.996 | โ Fail |
Casual speech (Level 4) is detected with near-perfect confidence. Borderline honorific sentences show appropriately lower confidence scores.
Limitations
- The model evaluates transcribed text, not raw audio. Whisper transcription quality directly affects evaluation accuracy โ
whisper mediumorwhisper largeis recommended for Japanese. - Confidence scores below 0.60 on a passing result indicate borderline speech โ consider flagging for human review.
- The model classifies overall politeness level and does not identify specific keigo errors (e.g. incorrect verb conjugation).
- Accuracy may be lower for highly domain-specific speech such as medical or legal Japanese.
Citation
If you use this model, please cite the KeiCO corpus and the base model:
Base model: Tohoku NLP Lab, BERT-base Japanese v3
Dataset: KeiCO Corpus โ Japanese Keigo Classification Corpus
Fine-tuned by: Ishraq (B-JET Ideathon 2026 โ Smart Service Evaluator)