| --- |
| language: |
| - ja |
| license: mit |
| base_model: tohoku-nlp/bert-base-japanese-v3 |
| tags: |
| - japanese |
| - keigo |
| - text-classification |
| - omotenashi |
| - hospitality |
| - bert |
| pipeline_tag: text-classification |
| --- |
| |
| # Keigo Evaluator โ ๆฌ่ชใฌใใซๅ้กใขใใซ |
|
|
| A fine-tuned Japanese BERT model that classifies the politeness level (ๆฌ่ชใฌใใซ) of Japanese speech into four levels. Designed to evaluate whether an employee is speaking with appropriate **Keigo (ๆฌ่ช)** and **Omotenashi (ใใใฆใชใ)** standards in a hospitality or service context. |
|
|
| --- |
|
|
| ## Intended Use |
|
|
| This model is the NLP component of an AI-powered service quality evaluation pipeline: |
|
|
| ``` |
| Voice Recording โ Whisper ASR โ Transcribed Text โ This Model โ Keigo Verdict |
| ``` |
|
|
| It is intended for: |
| - Evaluating employee speech quality in hospitality and customer service settings |
| - Automated Keigo compliance checking in call centres or hotel/restaurant environments |
| - Quality assurance systems for Japanese service staff training |
|
|
| --- |
|
|
| ## Labels |
|
|
| The model predicts one of four classes: |
|
|
| | Label | Level | Name | Description | Service Verdict | |
| |-------|-------|------|-------------|-----------------| |
| | LABEL_0 | 1 | ๆ้ซๆฌ่ช | Highest honorific โ sonkeigo dominant | โ
Pass | |
| | LABEL_1 | 2 | ๆฌ่ช | Standard honorific โ appropriate for most service contexts | โ
Pass | |
| | LABEL_2 | 3 | ไธๅฏง่ช | Polite but not honorific โ insufficient for hospitality | โ Fail | |
| | LABEL_3 | 4 | ๆฎ้่ช | Casual / plain speech โ inappropriate in service contexts | โ Fail | |
|
|
| --- |
|
|
| ## How to Use |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch fugashi unidic-lite |
| ``` |
|
|
| > **Note**: `unidic-lite` is required (not `ipadic`) โ this model uses the UniDic dictionary for MeCab tokenization. |
|
|
| ### Basic Usage |
|
|
| ```python |
| from transformers import pipeline |
| import torch |
| |
| classifier = pipeline( |
| 'text-classification', |
| model='ishraq/keigo-evaluator', |
| device=0 if torch.cuda.is_available() else -1 |
| ) |
| |
| LEVEL_MAP = { |
| 'LABEL_0': {'level': 1, 'name': 'ๆ้ซๆฌ่ช', 'passed': True}, |
| 'LABEL_1': {'level': 2, 'name': 'ๆฌ่ช', 'passed': True}, |
| 'LABEL_2': {'level': 3, 'name': 'ไธๅฏง่ช', 'passed': False}, |
| 'LABEL_3': {'level': 4, 'name': 'ๆฎ้่ช', 'passed': False}, |
| } |
| |
| def evaluate_keigo(text: str) -> dict: |
| result = classifier(text)[0] |
| info = LEVEL_MAP[result['label']] |
| return { |
| 'text': text, |
| 'level': info['level'], |
| 'level_name': info['name'], |
| 'confidence': round(result['score'], 3), |
| 'passed': info['passed'], |
| 'verdict': 'โ
้ฉๅใชๆฌ่ชใงใ' if info['passed'] else 'โ ๆฌ่ชใฌใใซใไธ่ถณใใฆใใพใ' |
| } |
| |
| print(evaluate_keigo('ใใใฃใใใใพใใๆฌๆฅใฏใฉใฎใใใชใ็จไปถใงใใใใพใใใใ๏ผ')) |
| # {'level': 1, 'level_name': 'ๆ้ซๆฌ่ช', 'confidence': 0.91, 'passed': True, 'verdict': 'โ
้ฉๅใชๆฌ่ชใงใ'} |
| |
| print(evaluate_keigo('ใกใใฃใจๅพ
ใฃใฆใ')) |
| # {'level': 4, 'level_name': 'ๆฎ้่ช', 'confidence': 0.99, 'passed': False, 'verdict': 'โ ๆฌ่ชใฌใใซใไธ่ถณใใฆใใพใ'} |
| ``` |
|
|
| ### Full Voice Pipeline (Whisper + Keigo Evaluator) |
|
|
| ```python |
| import whisper |
| from transformers import pipeline |
| import torch |
| |
| asr = whisper.load_model('medium') |
| classifier = pipeline( |
| 'text-classification', |
| model='ishraq/keigo-evaluator', |
| device=0 if torch.cuda.is_available() else -1 |
| ) |
| |
| def evaluate_recording(audio_path: str) -> dict: |
| transcript = asr.transcribe(audio_path, language='ja')['text'] |
| result = classifier(transcript)[0] |
| info = LEVEL_MAP[result['label']] |
| return { |
| 'transcript': transcript, |
| 'level': info['level'], |
| 'level_name': info['name'], |
| 'confidence': round(result['score'], 3), |
| 'passed': info['passed'], |
| 'verdict': 'โ
้ฉๅใชๆฌ่ชใงใ' if info['passed'] else 'โ ๆฌ่ชใฌใใซใไธ่ถณใใฆใใพใ' |
| } |
| |
| result = evaluate_recording('employee_call.mp3') |
| print(result) |
| ``` |
|
|
| --- |
|
|
| ## Training Details |
|
|
| ### Dataset |
|
|
| **KeiCO Corpus** โ a Japanese keigo classification corpus of 10,002 sentences labelled by politeness level and keigo type (sonkeigo / kenjลgo / teineigo) across a wide range of service situations including greetings (ๆจๆถ), apologies (่ฌใ), meetings (ไผใ), and seasonal expressions (ๅญฃ็ฏ). |
|
|
| | Level | Count | % | |
| |-------|-------|---| |
| | 1 โ ๆ้ซๆฌ่ช | 2,584 | 25.8% | |
| | 2 โ ๆฌ่ช | 2,044 | 20.4% | |
| | 3 โ ไธๅฏง่ช | 2,692 | 26.9% | |
| | 4 โ ๆฎ้่ช | 2,682 | 26.8% | |
|
|
| The dataset is well-balanced. No oversampling or class weighting was applied. |
|
|
| ### Training Hyperparameters |
|
|
| | Parameter | Value | |
| |-----------|-------| |
| | Epochs | 5 | |
| | Batch size | 32 | |
| | Learning rate | 2e-5 | |
| | Weight decay | 0.01 | |
| | Warmup ratio | 10% | |
| | Max sequence length | 128 | |
| | Optimizer | AdamW | |
| | Scheduler | Linear warmup + decay | |
| | Gradient clipping | 1.0 | |
| | Loss | Cross-entropy | |
|
|
| ### Training Infrastructure |
|
|
| - **Hardware**: NVIDIA T4 GPU (Google Colab) |
| - **Framework**: PyTorch + Hugging Face Transformers |
| - **Train / Val split**: 85% / 15% stratified by label |
|
|
| --- |
|
|
| ## Evaluation Results |
|
|
| Sample inference results on held-out test sentences: |
|
|
| | Input | Predicted Level | Confidence | Verdict | |
| |-------|----------------|------------|---------| |
| | ๆฌๆฅใฏใๆฉใใฎใงใใญใใๆฃๆญฉใงใใ๏ผ | 2 โ ๆฌ่ช | 0.598 | โ
Pass | |
| | ใๅค็จไธญใซใใใใใใใใใใๅบใใ ใใใพใใใ | 2 โ ๆฌ่ช | 0.557 | โ
Pass | |
| | ใๅใๅใใใใใใ ใใๅๅใใๆฌๆฅๅ
ฅ่ทใใพใใใ | 3 โ ไธๅฏง่ช | 0.740 | โ Fail | |
| | ไปๆฅใฏใใฉใใซใใใ | 4 โ ๆฎ้่ช | 0.993 | โ Fail | |
| | ๅฟใใใฎใซใใใๆฅใใญใ | 4 โ ๆฎ้่ช | 0.996 | โ Fail | |
|
|
| Casual speech (Level 4) is detected with near-perfect confidence. Borderline honorific sentences show appropriately lower confidence scores. |
|
|
| --- |
|
|
| ## Limitations |
|
|
| - The model evaluates **transcribed text**, not raw audio. Whisper transcription quality directly affects evaluation accuracy โ `whisper medium` or `whisper large` is recommended for Japanese. |
| - Confidence scores below **0.60** on a passing result indicate borderline speech โ consider flagging for human review. |
| - The model classifies overall politeness level and does **not** identify specific keigo errors (e.g. incorrect verb conjugation). |
| - Accuracy may be lower for highly domain-specific speech such as medical or legal Japanese. |
|
|
| --- |
|
|
| ## Citation |
|
|
| If you use this model, please cite the KeiCO corpus and the base model: |
|
|
| ``` |
| Base model: Tohoku NLP Lab, BERT-base Japanese v3 |
| Dataset: KeiCO Corpus โ Japanese Keigo Classification Corpus |
| Fine-tuned by: Ishraq (B-JET Ideathon 2026 โ Smart Service Evaluator) |
| ``` |