Update README.md

8139bfa verified 1 day ago

7.45 kB

	---
	language:
	- ja
	license: apache-2.0
	base_model: tohoku-nlp/bert-base-japanese-v3
	library_name: transformers
	pipeline_tag: text-classification
	tags:
	- text-classification
	- toxicity-detection
	- japanese
	- bert
	- fine-tuned
	datasets:
	- inspection-ai/japanese-toxic-dataset
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	model-index:
	- name: KAi-Toxicity-Filter
	results:
	- task:
	type: text-classification
	name: Text Classification
	dataset:
	name: japanese-toxic-dataset
	type: inspection-ai/japanese-toxic-dataset
	split: validation
	metrics:
	- type: accuracy
	value: 0.8632
	name: Accuracy
	- type: f1
	value: 0.7068
	name: F1 Score
	- type: precision
	value: 0.7231
	name: Precision
	- type: recall
	value: 0.6912
	name: Recall
	---

	# KAi Toxicity Filter

	日本語の有害表現検出に特化したモデル
	Japanese toxicity detection model specialized for Japanese language

	---

	## 日本語版

	### モデル概要

	日本語テキストを有害/非有害に分類するモデルです。このモデルは`tohoku-nlp/bert-base-japanese-v3`をベースに、日本語の有害表現検出タスクでファインチューニングされています。

	### 学習データ

	以下のデータで学習されています：

	- inspection-ai/japanese-toxic-dataset (Apache 2.0)
	- 出典: https://github.com/inspection-ai/japanese-toxic-dataset
	- KAi専用カスタムデータセット
	- 自動生成されたハードネガティブサンプル
	- 自動生成された有害表現バリエーション（バランス調整用）

	### モデル詳細

	- ベースモデル: tohoku-nlp/bert-base-japanese-v3
	- タスク: 二値分類（有害/非有害）
	- 学習手法: 連続値ラベル学習（0.0〜1.0）+ MSE Loss
	- 訓練データ: 1,899サンプル（訓練: 1,614 / 検証: 285）
	- エポック数: 5
	- 学習率: 2e-5（線形減衰）
	- 特徴: ハードネガティブサンプリングによる日本語表現の最適化

	### 性能

	検証データセットでの評価結果:

	- Accuracy: 86.32%
	- F1 Score: 70.68%
	- Precision: 72.31%
	- Recall: 69.12%

	### 使用例
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "b4c0n/KAi-Toxicity-Filter"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "終わってる暴言"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	outputs = model(**inputs)

	probs = torch.softmax(outputs.logits, dim=1)
	toxic_prob = probs[0][1].item()

	print(f"有害確率: {toxic_prob:.2%}")
	```

	### 使用目的

	KAi (かい鯖グループAI) における日本語テキストの有害コンテンツ検出・フィルタリングのために開発されました。

	主な用途:
	- ユーザー生成コンテンツのモデレーション
	- 対話型AIの安全性フィルタリング
	- 日本語ソーシャルメディアコンテンツの有害性検出

	### 制限事項

	- 短い口語表現に特化しており、長文や文脈依存の有害性検出には限界があります
	- 誤検出（偽陽性/偽陰性）の可能性があります
	- 文化的・地域的文脈により判定が変わる可能性があります
	- 訓練データに含まれない新しいタイプの有害表現は検出できない場合があります
	- 人間のレビューなしの自動検閲には適していません

	### 倫理的配慮

	⚠️ このモデルは有害コンテンツデータで学習されています。責任を持って使用してください。

	- 正当な表現を誤検出する可能性があります
	- コンテンツ削除の唯一の判断基準として使用すべきではありません
	- 定期的な人間によるレビューを推奨します
	- 自動フィルタリング実装時は表現の自由を考慮してください

	### ライセンス

	Apache 2.0

	### 謝辞

	このモデルは [inspection-ai/japanese-toxic-dataset](https://github.com/inspection-ai/japanese-toxic-dataset) (Apache 2.0 License) のデータを使用しています。

	---

	## English

	### Model Description

	This model classifies Japanese text as toxic or non-toxic. It is fine-tuned from tohoku-nlp/bert-base-japanese-v3 for Japanese toxicity detection tasks.

	### Training Data

	This model was trained on:

	- inspection-ai/japanese-toxic-dataset (Apache 2.0)
	- Source: https://github.com/inspection-ai/japanese-toxic-dataset
	- Custom dataset created specifically for KAi
	- Automatically generated hard negative samples
	- Automatically generated toxic variations for balance

	### Model Details

	- Base Model: tohoku-nlp/bert-base-japanese-v3
	- Task: Binary Text Classification (toxic/not-toxic)
	- Training Data: 1,899 samples (train: 1,614 / validation: 285)
	- Epochs: 5
	- Learning Rate: 2e-5 with linear decay
	- Training: Continuous label learning (0.0-1.0) with MSE Loss
	- Special Feature: Optimized for Japanese language with hard negative sampling

	### Performance

	Evaluation results on validation dataset:

	- Accuracy: 86.32%
	- F1 Score: 70.68%
	- Precision: 72.31%
	- Recall: 69.12%

	### Usage
	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_name = "b4c0n/KAi-Toxicity-Filter"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForSequenceClassification.from_pretrained(model_name)

	text = "toxic expression"
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
	outputs = model(**inputs)

	probs = torch.softmax(outputs.logits, dim=1)
	toxic_prob = probs[0][1].item()

	print(f"Toxic probability: {toxic_prob:.2%}")
	```

	### Intended Use

	This model was developed for the KAi (KaisabaGroupAI) to detect and filter harmful content in Japanese text.

	Primary Use Cases:
	- Content moderation for user-generated text
	- Safety filtering in conversational AI
	- Toxicity detection in Japanese social media content

	### Limitations

	- Optimized for short colloquial expressions; limited for long texts or context-dependent toxicity
	- May have false positives/negatives
	- Cultural and regional context may affect predictions
	- Cannot detect new types of toxic expressions not present in training data
	- Not designed for automatic censorship without human review

	### Ethical Considerations

	⚠️ This model was trained on toxic content data. Please use responsibly.

	- The model may produce false positives affecting legitimate speech
	- Should not be used as the sole decision-maker for content removal
	- Regular human review is recommended
	- Consider freedom of expression when implementing automated filtering

	### License

	Apache 2.0

	### Citation
	```bibtex
	@misc{kai-toxicity-filter,
	author = {b4c0n},
	title = {KAi Toxicity Filter: Japanese Toxicity Detection Model},
	year = {2025},
	publisher = {HuggingFace},
	howpublished = {\url{https://huggingface.co/b4c0n/KAi-Toxicity-Filter}}
	}
	```

	### Acknowledgments

	This model uses data from [inspection-ai/japanese-toxic-dataset](https://github.com/inspection-ai/japanese-toxic-dataset) (Apache 2.0 License).