Add readme

2b4e3bc verified 5 days ago

6.3 kB

	# Floxoris Harmony v1.2

	Floxoris Harmony v1.2 is a lightweight binary moderation model for fast toxicity detection in Russian and Ukrainian text.

	This version is a continued fine-tuning update of Floxoris Harmony v1.1, focused on reducing false positives on positive slang, compliments, and short praise phrases while keeping strong detection of insults and rude commands.

	Harmony v1.2 is designed for practical moderation systems where speed, low cost, and simple deployment matter.

	## What Is New In v1.2

	Harmony v1.2 improves the behavior of v1.1 in cases where friendly slang or praise could be incorrectly classified as toxic.

	Main focus:

	- safer handling of positive slang
	- fewer false positives on compliments
	- improved distinction between insult patterns and praise patterns
	- stronger handling of phrases like `харош`, `ну ты харош`, `ты красавчик`
	- continued support for Russian and Ukrainian moderation
	- same lightweight binary output: `safe` / `toxic`

	This release is mainly an anti-false-positive patch for short praise and casual chat slang.

	## Model Task

	The model performs binary text classification:

	\| Class \| Label \|
	\|---\|---\|
	\| `0` \| `safe` \|
	\| `1` \| `toxic` \|

	The model answers:

	> Is this message safe or toxic?

	## Intended Use

	Floxoris Harmony v1.2 is suitable for:

	- Telegram bot moderation
	- chat message filtering
	- community moderation tools
	- AI assistant safety checks
	- lightweight moderation APIs
	- first-stage toxicity detection
	- Russian/Ukrainian text moderation

	It works best as a fast first-pass classifier before more complex moderation logic.

	## Example Usage

	```python
	from transformers import AutoTokenizer, AutoModelForSequenceClassification
	import torch

	model_id = "floxoris/harmony-v1.2"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForSequenceClassification.from_pretrained(model_id)

	text = "ну ты харош"

	inputs = tokenizer(
	text,
	return_tensors="pt",
	truncation=True,
	padding=True,
	max_length=128
	)

	with torch.no_grad():
	outputs = model(**inputs)
	probs = torch.softmax(outputs.logits, dim=-1)[0]

	safe_score = probs[0].item()
	toxic_score = probs[1].item()

	threshold = 0.65
	label = "toxic" if toxic_score >= threshold else "safe"

	print({
	"label": label,
	"safe_score": round(safe_score, 4),
	"toxic_score": round(toxic_score, 4)
	})
	```

	## Recommended Threshold

	Suggested default:

	```python
	TOXIC_THRESHOLD = 0.65
	```

	Suggested behavior:

	\| Toxic score \| Action \|
	\|---\|---\|
	\| `0.00–0.64` \| allow \|
	\| `0.65–0.87` \| warn / review \|
	\| `0.88–1.00` \| delete / block \|

	## Training Focus

	Harmony v1.2 was fine-tuned from:

	```text
	floxoris/harmony-v1.1
	```

	The main training focus was:

	- positive slang
	- friendly short praise
	- safe casual phrases
	- Russian compliments
	- Ukrainian compliments
	- safe context examples
	- mild toxic phrases
	- toxic vs safe phrase contrast

	Examples of safe praise targeted in v1.2:

	```text
	харош
	хорош
	ну ты харош
	капец ты харош
	ты красавчик
	ты красава
	ты молодец
	ты лучший
	ты мощный
	ты гений
	ти молодець
	ти красень
	ти крутий
	```

	Examples of toxic phrases that should remain toxic:

	```text
	заткнись
	отвали
	закрой рот
	ты тупой
	ты дурак
	ну ты даун
	замовкни
	відвали
	ти тупий
	```

	Examples of safe contrast phrases:

	```text
	закрой окно пожалуйста
	пошёл в магазин
	тупой угол в геометрии
	не пиши пароль сюда
	рот болит после стоматолога
	```

	## Difference Between Versions

	\| Version \| Focus \|
	\|---\|---\|
	\| `harmony-v1` \| base lightweight toxicity classifier \|
	\| `harmony-v1.1` \| improved mild toxicity detection \|
	\| `harmony-v1.2` \| reduced false positives on praise and positive slang \|

	## Example Behavior

	Expected behavior:

	```text
	"ну ты харош" → safe
	"капец ты харош" → safe
	"ты красавчик" → safe
	"закрой окно пожалуйста" → safe
	"закрой рот" → toxic
	"ну ты даун" → toxic
	"заткнись" → toxic
	```

	## API-Style Output Example

	```json
	{
	"model": "floxoris/harmony-v1.2",
	"text": "закрой рот",
	"label": "toxic",
	"class": 1,
	"safe_score": 0.18,
	"toxic_score": 0.82,
	"threshold": 0.65,
	"latency_ms": 3.1
	}
	```

	## Limitations

	- This is a binary classifier and does not separate toxicity categories.
	- It does not classify hate speech, threats, profanity, spam, or harassment separately.
	- It may still miss sarcasm, coded abuse, irony, or context-dependent toxicity.
	- It may produce false positives on unusual slang or jokes.
	- It may produce false negatives on creative insults.
	- Very short messages can be ambiguous.
	- Performance outside Russian and Ukrainian may be weaker.
	- It should not be the only moderation layer for high-stakes systems.

	Human review is recommended for important moderation decisions.

	## Recommended Production Setup

	For real-world moderation:

	1. Run Harmony v1.2 on each message.
	2. Use a threshold, recommended default: `0.65`.
	3. Treat scores between `0.50` and `0.65` as uncertain if needed.
	4. Log false positives and false negatives.
	5. Periodically fine-tune future versions on real moderation mistakes.
	6. Combine with rule-based checks for extreme cases if necessary.

	## Why This Model

	Floxoris Harmony v1.2 is built for projects that need:

	- fast inference
	- low hosting cost
	- compact model size
	- simple API deployment
	- Russian/Ukrainian moderation
	- practical binary toxicity detection
	- better behavior on casual slang

	It is especially useful for Telegram bots, small communities, AI tools, and lightweight moderation APIs.

	## Summary

	Floxoris Harmony v1.2 is a compact Russian/Ukrainian moderation model focused on fast binary toxicity detection.

	Compared to v1.1, this version is less likely to mark friendly praise like `харош`, `ну ты харош`, or `ты красавчик` as toxic, while preserving detection of direct insults and rude commands.