Add readme

2b4e3bc verified 4 days ago

6.3 kB

Floxoris Harmony v1.2

Floxoris Harmony v1.2 is a lightweight binary moderation model for fast toxicity detection in Russian and Ukrainian text.

This version is a continued fine-tuning update of Floxoris Harmony v1.1, focused on reducing false positives on positive slang, compliments, and short praise phrases while keeping strong detection of insults and rude commands.

Harmony v1.2 is designed for practical moderation systems where speed, low cost, and simple deployment matter.

What Is New In v1.2

Harmony v1.2 improves the behavior of v1.1 in cases where friendly slang or praise could be incorrectly classified as toxic.

Main focus:

safer handling of positive slang
fewer false positives on compliments
improved distinction between insult patterns and praise patterns
stronger handling of phrases like харош, ну ты харош, ты красавчик
continued support for Russian and Ukrainian moderation
same lightweight binary output: safe / toxic

This release is mainly an anti-false-positive patch for short praise and casual chat slang.

Model Task

The model performs binary text classification:

Class	Label
`0`	`safe`
`1`	`toxic`

The model answers:

Is this message safe or toxic?

Intended Use

Floxoris Harmony v1.2 is suitable for:

Telegram bot moderation
chat message filtering
community moderation tools
AI assistant safety checks
lightweight moderation APIs
first-stage toxicity detection
Russian/Ukrainian text moderation

It works best as a fast first-pass classifier before more complex moderation logic.

Example Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "floxoris/harmony-v1.2"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

text = "ну ты харош"

inputs = tokenizer(
    text,
    return_tensors="pt",
    truncation=True,
    padding=True,
    max_length=128
)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)[0]

safe_score = probs[0].item()
toxic_score = probs[1].item()

threshold = 0.65
label = "toxic" if toxic_score >= threshold else "safe"

print({
    "label": label,
    "safe_score": round(safe_score, 4),
    "toxic_score": round(toxic_score, 4)
})

Recommended Threshold

Suggested default:

TOXIC_THRESHOLD = 0.65

Suggested behavior:

Toxic score	Action
`0.00–0.64`	allow
`0.65–0.87`	warn / review
`0.88–1.00`	delete / block

Training Focus

Harmony v1.2 was fine-tuned from:

floxoris/harmony-v1.1

The main training focus was:

positive slang
friendly short praise
safe casual phrases
Russian compliments
Ukrainian compliments
safe context examples
mild toxic phrases
toxic vs safe phrase contrast

Examples of safe praise targeted in v1.2:

харош
хорош
ну ты харош
капец ты харош
ты красавчик
ты красава
ты молодец
ты лучший
ты мощный
ты гений
ти молодець
ти красень
ти крутий

Examples of toxic phrases that should remain toxic:

заткнись
отвали
закрой рот
ты тупой
ты дурак
ну ты даун
замовкни
відвали
ти тупий

Examples of safe contrast phrases:

закрой окно пожалуйста
пошёл в магазин
тупой угол в геометрии
не пиши пароль сюда
рот болит после стоматолога

Difference Between Versions

Version	Focus
`harmony-v1`	base lightweight toxicity classifier
`harmony-v1.1`	improved mild toxicity detection
`harmony-v1.2`	reduced false positives on praise and positive slang

Example Behavior

Expected behavior:

"ну ты харош"             → safe
"капец ты харош"          → safe
"ты красавчик"            → safe
"закрой окно пожалуйста"  → safe
"закрой рот"              → toxic
"ну ты даун"              → toxic
"заткнись"                → toxic

API-Style Output Example

{
  "model": "floxoris/harmony-v1.2",
  "text": "закрой рот",
  "label": "toxic",
  "class": 1,
  "safe_score": 0.18,
  "toxic_score": 0.82,
  "threshold": 0.65,
  "latency_ms": 3.1
}

Limitations

This is a binary classifier and does not separate toxicity categories.
It does not classify hate speech, threats, profanity, spam, or harassment separately.
It may still miss sarcasm, coded abuse, irony, or context-dependent toxicity.
It may produce false positives on unusual slang or jokes.
It may produce false negatives on creative insults.
Very short messages can be ambiguous.
Performance outside Russian and Ukrainian may be weaker.
It should not be the only moderation layer for high-stakes systems.

Human review is recommended for important moderation decisions.

Recommended Production Setup

For real-world moderation:

Run Harmony v1.2 on each message.
Use a threshold, recommended default: 0.65.
Treat scores between 0.50 and 0.65 as uncertain if needed.
Log false positives and false negatives.
Periodically fine-tune future versions on real moderation mistakes.
Combine with rule-based checks for extreme cases if necessary.

Why This Model

Floxoris Harmony v1.2 is built for projects that need:

fast inference
low hosting cost
compact model size
simple API deployment
Russian/Ukrainian moderation
practical binary toxicity detection
better behavior on casual slang

It is especially useful for Telegram bots, small communities, AI tools, and lightweight moderation APIs.

Summary

Floxoris Harmony v1.2 is a compact Russian/Ukrainian moderation model focused on fast binary toxicity detection.

Compared to v1.1, this version is less likely to mark friendly praise like харош, ну ты харош, or ты красавчик as toxic, while preserving detection of direct insults and rude commands.