Added H1 files and updated README (Full Release)

Files changed (11) hide show

.gitattributes +1 -0
README.md +106 -0
bestthreshold.png +3 -0
config.json +32 -0
model.safetensors +3 -0
special_tokens_map.json +7 -0
testmetrics.png +3 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
valf1perepoch.png +3 -0
vocab.txt +0 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,109 @@
 ---
 license: cc-by-sa-4.0
 ---

 ---
+language:
+- he
 license: cc-by-sa-4.0
+tags:
+- text-classification
+- profanity-detection
+- hebrew
+- bert
+- alephbert
+library_name: transformers
+base_model: onlplab/alephbert-base
+datasets:
+- custom
+metrics:
+- accuracy
+- precision
+- recall
+- f1
 ---
+# OpenCensor-Hebrew
+This is a fine tuned **AlephBERT** model that finds bad words ( profanity ) in Hebrew text.
+You give the model a Hebrew sentence.
+It returns:
+- a score between **0 and 1**
+- a yes/no flag (based on a cutoff you choose)
+Meaning of the score:
+- **0 = clean**, **1 = has profanity**
+- Recommended cutoff from tests: **0.49** ( you can change it )
+![Validation F1 per Epoch](validation_f1_per_epoch_hd.png)
+![Final Test Metrics](final_test_metrics_hd.png)
+![Best Threshold](thresholds_per_epoch_hd.png)
+## How to use
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+KModel = "LikoKIko/OpenCensor-Hebrew"
+KCutoff = 0.49 # best threshold from training
+KMaxLen = 512 # number of tokens (not characters)
+tokenizer = AutoTokenizer.from_pretrained(KModel)
+model = AutoModelForSequenceClassification.from_pretrained(KModel, num_labels=1).eval()
+text = "some hebrew text here"
+inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=KMaxLen)
+with torch.inference_mode():
+  score = torch.sigmoid(model(**inputs).logits).item()
+KHasProfanity = int(score >= KCutoff)
+print({"score": round(score, 4), "KHasProfanity": KHasProfanity})
+````
+Note: If the text is very long, it is cut at `KMaxLen` tokens.
+## About this model
+  - Base: `onlplab/alephbert-base`
+  - Task: binary classification (clean / profanity)
+  - Language: Hebrew
+  - Max length: 512 tokens
+  - Training:
+      - Batch size: 16
+      - Epochs: 10
+      - Learning rate: 0.00002
+      - Loss: binary cross-entropy with logits (`BCEWithLogitsLoss`). We use `pos_weight` so the model pays more attention to the rare class. This helps when the dataset is imbalanced.
+      - Scheduler: linear warmup (10%)
+### Results
+  - Test Accuracy: 0.9826
+  - Test Precision: 0.9812
+  - Test Recall: 0.9835
+  - Test F1: 0.9823
+  - Best threshold: 0.49
+## Reproduce (training code)
+This model was trained with a script that:
+  - Loads `onlplab/alephbert-base` with `num_labels=1`
+  - Tokenizes with `max_length=512` and pads to the max length
+  - Trains with AdamW, linear warmup, and mixed precision
+  - Tries cutoffs from `0.1` to `0.9` on the validation set and picks the best F1
+  - Saves the best checkpoint by validation F1, then reports test metrics
+## License
+CC-BY-SA-4.0
+## How to cite
+```
+```bibtex
+@misc{opencensor-hebrew,
+  title = {OpenCensor-Hebrew: Hebrew Profanity Detection Model},
+  author = {LikoKIko},
+  year = {2025},
+  url = {[https://huggingface.co/LikoKIko/OpenCensor-Hebrew](https://huggingface.co/LikoKIko/OpenCensor-Hebrew)}
+}
+```
+```

bestthreshold.png ADDED Viewed

Git LFS Details

SHA256: dd690fa508c8feb5a1cc6d8ece69b42fee1b33d55a7d5ac4320349d30adf837d
Pointer size: 131 Bytes
Size of remote file: 201 kB

config.json ADDED Viewed

	@@ -0,0 +1,32 @@

+{
+  "_name_or_path": "onlplab/alephbert-base",
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "gradient_checkpointing": false,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 768,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 3072,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 512,
+  "model_type": "bert",
+  "num_attention_heads": 12,
+  "num_hidden_layers": 12,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "torch_dtype": "float32",
+  "transformers_version": "4.39.1",
+  "type_vocab_size": 1,
+  "use_cache": true,
+  "vocab_size": 52000
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:b7c15726bd22425a757957bd75f6a67cc889cbe7f0fe39465d07131f27bcdd41
+size 503932924

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

testmetrics.png ADDED Viewed

Git LFS Details

SHA256: 06b7b9118fa08f465ecf9fdb93152c189334815ddd16d5b860d581e51255124c
Pointer size: 131 Bytes
Size of remote file: 117 kB

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "1": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "2": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "3": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "4": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": true,
+  "mask_token": "[MASK]",
+  "max_len": 512,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

valf1perepoch.png ADDED Viewed

Git LFS Details

SHA256: 7ae0ced6076173b14362e3d7f8c0e44adb8de1612437ffa5f224a862dc69464d
Pointer size: 131 Bytes
Size of remote file: 196 kB

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff