File size: 2,826 Bytes
004309d
3a8d54b
 
004309d
3a8d54b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
004309d
3a8d54b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e7f17a
3a8d54b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0e7f17a
3a8d54b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---

language:
- he
license: cc-by-sa-4.0
tags:
- text-classification
- profanity-detection
- toxicity
- hebrew
- bert
- alephbert
library_name: transformers
base_model: onlplab/alephbert-base
pipeline_tag: text-classification
datasets:
- custom
metrics:
- accuracy
- precision
- recall
- f1
---


# OpenCensor-H1-Mini

**OpenCensor-H1-Mini** is a lightweight, efficient version of **OpenCensor-H1**, designed to detect profanity, toxicity, and offensive content in Hebrew text. It is fine-tuned on the `onlplab/alephbert-base` architecture.

## Model Details

- **Model Name:** OpenCensor-H1-Mini
- **Base Model:** `onlplab/alephbert-base`
- **Task:** Binary Classification (0 = Clean, 1 = Toxic/Profane)
- **Language:** Hebrew
- **Max Sequence Length:** 256 tokens (optimized for efficiency)

## Performance

| Metric | Score |
| :--- | :--- |
| **Accuracy** | 0.9826 |
| **F1-Score** | 0.9823 |
| **Precision** | 0.9812 |
| **Recall** | 0.9835 |

*Note: Best Threshold = 0.17*

### Training Graphs

| Validation F1 | Threshold Analysis |
| :---: | :---: |
| ![Validation F1](valf1perepoch.png) | ![Thresholds](thresholdsperepoch.png) |

![Final Test Metrics](testmetrics.png)

## How to Use

You can use this model directly with the Hugging Face `transformers` library.

```python

import torch

from transformers import AutoTokenizer, AutoModelForSequenceClassification



# Load the model

model_id = "LikoKIko/OpenCensor-H1-Mini"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForSequenceClassification.from_pretrained(model_id).eval()



def predict(text):

    # Tokenize input

    inputs = tokenizer(

        text, 

        return_tensors="pt", 

        truncation=True, 

        padding=True, 

        max_length=256

    )

    

    # Predict

    with torch.no_grad():

        logits = model(**inputs).logits

        score = torch.sigmoid(logits).item()

        

    return {

        "text": text,

        "score": round(score, 4),

        "is_toxic": score >= 0.17  # Threshold

    }



# Example usage

text = "ืื ื™ ืื•ื”ื‘ ืืช ื›ื•ืœื" # "I love everyone"

print(predict(text))

```

## Training Info

The model was trained using an optimized pipeline featuring:
- **Gradient Accumulation:** Ensures stable training with larger effective batch sizes.
- **Smart Text Cleaning:** Removes noise while preserving Hebrew, English, and important symbols (`@#$%*`).
- **Dynamic Padding:** Uses efficient token lengths based on data distribution.

## License

CC-BY-SA-4.0

## Citation

```bibtex

@misc{opencensor-h1-mini,

  title = {OpenCensor-H1-Mini: Hebrew Profanity Detection Model},

  author = {LikoKIko},

  year = {2025},

  url = {https://huggingface.co/LikoKIko/OpenCensor-H1-Mini}

}

```