KAi Toxicity Filter

日本語の有害表現検出に特化したモデル
Japanese toxicity detection model specialized for Japanese language

日本語版

モデル概要

日本語テキストを有害/非有害に分類するモデルです。このモデルはtohoku-nlp/bert-base-japanese-v3をベースに、日本語の有害表現検出タスクでファインチューニングされています。

学習データ

以下のデータで学習されています：

inspection-ai/japanese-toxic-dataset (Apache 2.0)
- 出典: https://github.com/inspection-ai/japanese-toxic-dataset
KAi専用カスタムデータセット
自動生成されたハードネガティブサンプル
自動生成された有害表現バリエーション（バランス調整用）

モデル詳細

ベースモデル: tohoku-nlp/bert-base-japanese-v3
タスク: 二値分類（有害/非有害）
学習手法: 連続値ラベル学習（0.0〜1.0）+ MSE Loss
訓練データ: 1,899サンプル（訓練: 1,614 / 検証: 285）
エポック数: 5
学習率: 2e-5（線形減衰）
特徴: ハードネガティブサンプリングによる日本語表現の最適化

性能

検証データセットでの評価結果:

Accuracy: 86.32%
F1 Score: 70.68%
Precision: 72.31%
Recall: 69.12%

使用例

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "b4c0n/KAi-Toxicity-Filter"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "終わってる暴言"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=1)
toxic_prob = probs[0][1].item()

print(f"有害確率: {toxic_prob:.2%}")

使用目的

KAi (かい鯖グループAI) における日本語テキストの有害コンテンツ検出・フィルタリングのために開発されました。

主な用途:

ユーザー生成コンテンツのモデレーション
対話型AIの安全性フィルタリング
日本語ソーシャルメディアコンテンツの有害性検出

制限事項

短い口語表現に特化しており、長文や文脈依存の有害性検出には限界があります
誤検出（偽陽性/偽陰性）の可能性があります
文化的・地域的文脈により判定が変わる可能性があります
訓練データに含まれない新しいタイプの有害表現は検出できない場合があります
人間のレビューなしの自動検閲には適していません

倫理的配慮

⚠️ このモデルは有害コンテンツデータで学習されています。責任を持って使用してください。

正当な表現を誤検出する可能性があります
コンテンツ削除の唯一の判断基準として使用すべきではありません
定期的な人間によるレビューを推奨します
自動フィルタリング実装時は表現の自由を考慮してください

ライセンス

Apache 2.0

謝辞

このモデルは inspection-ai/japanese-toxic-dataset (Apache 2.0 License) のデータを使用しています。

English

Model Description

This model classifies Japanese text as toxic or non-toxic. It is fine-tuned from tohoku-nlp/bert-base-japanese-v3 for Japanese toxicity detection tasks.

Training Data

This model was trained on:

inspection-ai/japanese-toxic-dataset (Apache 2.0)
- Source: https://github.com/inspection-ai/japanese-toxic-dataset
Custom dataset created specifically for KAi
Automatically generated hard negative samples
Automatically generated toxic variations for balance

Model Details

Base Model: tohoku-nlp/bert-base-japanese-v3
Task: Binary Text Classification (toxic/not-toxic)
Training Data: 1,899 samples (train: 1,614 / validation: 285)
Epochs: 5
Learning Rate: 2e-5 with linear decay
Training: Continuous label learning (0.0-1.0) with MSE Loss
Special Feature: Optimized for Japanese language with hard negative sampling

Performance

Evaluation results on validation dataset:

Accuracy: 86.32%
F1 Score: 70.68%
Precision: 72.31%
Recall: 69.12%

Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = "b4c0n/KAi-Toxicity-Filter"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

text = "toxic expression"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
outputs = model(**inputs)

probs = torch.softmax(outputs.logits, dim=1)
toxic_prob = probs[0][1].item()

print(f"Toxic probability: {toxic_prob:.2%}")

Intended Use

This model was developed for the KAi (KaisabaGroupAI) to detect and filter harmful content in Japanese text.

Primary Use Cases:

Content moderation for user-generated text
Safety filtering in conversational AI
Toxicity detection in Japanese social media content

Limitations

Optimized for short colloquial expressions; limited for long texts or context-dependent toxicity
May have false positives/negatives
Cultural and regional context may affect predictions
Cannot detect new types of toxic expressions not present in training data
Not designed for automatic censorship without human review

Ethical Considerations

⚠️ This model was trained on toxic content data. Please use responsibly.

The model may produce false positives affecting legitimate speech
Should not be used as the sole decision-maker for content removal
Regular human review is recommended
Consider freedom of expression when implementing automated filtering

License

Apache 2.0

Citation

@misc{kai-toxicity-filter,
  author = {b4c0n},
  title = {KAi Toxicity Filter: Japanese Toxicity Detection Model},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/b4c0n/KAi-Toxicity-Filter}}
}

Acknowledgments

This model uses data from inspection-ai/japanese-toxic-dataset (Apache 2.0 License).

Downloads last month: 97

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for b4c0n/KAi-Toxicity-Filter

Base model

tohoku-nlp/bert-base-japanese-v3

Finetuned

(48)

this model

Quantizations

1 model

Evaluation results

Accuracy on japanese-toxic-dataset
validation set self-reported

0.863
F1 Score on japanese-toxic-dataset
validation set self-reported

0.707
Precision on japanese-toxic-dataset
validation set self-reported

0.723
Recall on japanese-toxic-dataset
validation set self-reported

0.691