Oxidane commited on
Commit
4a819e4
·
verified ·
1 Parent(s): 15d95be

Upload TMR AI Text Detector - RoBERTa-base with Focal Loss and Self-Hard-Negative training

Browse files
README.md ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ base_model: roberta-base
4
+ tags:
5
+ - text-classification
6
+ - ai-detection
7
+ - robeta
8
+ - focal-loss
9
+ - raid-benchmark
10
+ language:
11
+ - en
12
+ pipeline_tag: text-classification
13
+ ---
14
+
15
+ # TMR: Target Mining RoBERTa - AI Text Detector
16
+
17
+ A robust AI-generated text detector based on RoBERTa-base, trained with Focal Loss and Self-Hard-Negative iterative mining on the RAID dataset.
18
+
19
+ ## Model Description
20
+
21
+ TMR (Target Mining RoBERTa) is designed to detect AI-generated text with high accuracy while maintaining low false positive rates. The model uses:
22
+
23
+ - **Architecture**: RoBERTa-base (125M parameters)
24
+ - **Loss Function**: Focal Loss (gamma=2.0, alpha=[0.85, 0.15]) to focus on hard examples
25
+ - **Training Strategy**: Self-Hard-Negative (Self-HN) iterative mining
26
+ - **Training Data**: 50,000 stratified samples from RAID (45% human, 55% AI)
27
+
28
+ ## Performance
29
+
30
+ RAID held-out evaluation (100k samples, seed 999):
31
+
32
+ | Metric | Score |
33
+ |--------|-------|
34
+ | **AUROC** | **0.9972** |
35
+ | **Accuracy** | 97.30% |
36
+ | **FPR** | 2.27% |
37
+ | **FNR** | 2.71% |
38
+ | **F1 Score** | 0.9856 |
39
+
40
+ ## Usage
41
+
42
+ ```python
43
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
44
+ import torch
45
+
46
+ # Load model and tokenizer
47
+ model_path = "Oxidane/tmr-ai-text-detector"
48
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
49
+ model = AutoModelForSequenceClassification.from_pretrained(model_path)
50
+
51
+ # Predict
52
+ text = "Your text here..."
53
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512, padding=True)
54
+
55
+ with torch.no_grad():
56
+ outputs = model(**inputs)
57
+ logits = outputs.logits
58
+ probs = torch.softmax(logits, dim=-1)
59
+
60
+ # Probability that text is AI-generated
61
+ ai_probability = probs[0][1].item()
62
+ print(f"AI probability: {ai_probability:.4f}")
63
+
64
+ # Binary classification (threshold=0.5)
65
+ is_ai = ai_probability > 0.5
66
+ print(f"Prediction: {'AI-generated' if is_ai else 'Human-written'}")
67
+ ```
68
+
69
+ ## Training
70
+
71
+ Trained on the RAID dataset (ACL 2024) for 3 epochs with Focal Loss (gamma=2.0). The model uses Self-Hard-Negative mining: iteratively identifying human samples misclassified as AI, then retraining with these hard examples.
72
+
73
+ ## Limitations
74
+
75
+ - **Language**: Primarily trained on English text
76
+ - **Threshold**: Optimized for threshold=0.5
77
+
78
+ ## License
79
+
80
+ MIT License
81
+
82
+ ## Citation
83
+
84
+ If you use this model, please cite:
85
+
86
+ ```bibtex
87
+ @misc{tmr-ai-text-detector,
88
+ title={TMR: Target Mining RoBERTa for AI Text Detection},
89
+ author={Oxidane},
90
+ year={2025},
91
+ url={https://huggingface.co/Oxidane/tmr-ai-text-detector}
92
+ }
93
+ ```
94
+
95
+ ## Contact
96
+
97
+ For questions, contact me@oxidane.net
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "RobertaForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "bos_token_id": 0,
7
+ "classifier_dropout": null,
8
+ "dtype": "float32",
9
+ "eos_token_id": 2,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "initializer_range": 0.02,
14
+ "intermediate_size": 3072,
15
+ "layer_norm_eps": 1e-05,
16
+ "max_position_embeddings": 514,
17
+ "model_type": "roberta",
18
+ "num_attention_heads": 12,
19
+ "num_hidden_layers": 12,
20
+ "pad_token_id": 1,
21
+ "position_embedding_type": "absolute",
22
+ "transformers_version": "4.57.3",
23
+ "type_vocab_size": 1,
24
+ "use_cache": true,
25
+ "vocab_size": 50265
26
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a74ea30f5db6a1094510be903ce374426191addaf7d860b766077de246d32358
3
+ size 498612824
special_tokens_map.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": true,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "cls_token": {
10
+ "content": "<s>",
11
+ "lstrip": false,
12
+ "normalized": true,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "eos_token": {
17
+ "content": "</s>",
18
+ "lstrip": false,
19
+ "normalized": true,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "mask_token": {
24
+ "content": "<mask>",
25
+ "lstrip": true,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "pad_token": {
31
+ "content": "<pad>",
32
+ "lstrip": false,
33
+ "normalized": true,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ },
37
+ "sep_token": {
38
+ "content": "</s>",
39
+ "lstrip": false,
40
+ "normalized": true,
41
+ "rstrip": false,
42
+ "single_word": false
43
+ },
44
+ "unk_token": {
45
+ "content": "<unk>",
46
+ "lstrip": false,
47
+ "normalized": true,
48
+ "rstrip": false,
49
+ "single_word": false
50
+ }
51
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_prefix_space": false,
3
+ "added_tokens_decoder": {
4
+ "0": {
5
+ "content": "<s>",
6
+ "lstrip": false,
7
+ "normalized": true,
8
+ "rstrip": false,
9
+ "single_word": false,
10
+ "special": true
11
+ },
12
+ "1": {
13
+ "content": "<pad>",
14
+ "lstrip": false,
15
+ "normalized": true,
16
+ "rstrip": false,
17
+ "single_word": false,
18
+ "special": true
19
+ },
20
+ "2": {
21
+ "content": "</s>",
22
+ "lstrip": false,
23
+ "normalized": true,
24
+ "rstrip": false,
25
+ "single_word": false,
26
+ "special": true
27
+ },
28
+ "3": {
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": true,
32
+ "rstrip": false,
33
+ "single_word": false,
34
+ "special": true
35
+ },
36
+ "50264": {
37
+ "content": "<mask>",
38
+ "lstrip": true,
39
+ "normalized": false,
40
+ "rstrip": false,
41
+ "single_word": false,
42
+ "special": true
43
+ }
44
+ },
45
+ "bos_token": "<s>",
46
+ "clean_up_tokenization_spaces": false,
47
+ "cls_token": "<s>",
48
+ "eos_token": "</s>",
49
+ "errors": "replace",
50
+ "extra_special_tokens": {},
51
+ "mask_token": "<mask>",
52
+ "max_length": 512,
53
+ "model_max_length": 512,
54
+ "pad_token": "<pad>",
55
+ "sep_token": "</s>",
56
+ "stride": 0,
57
+ "tokenizer_class": "RobertaTokenizer",
58
+ "trim_offsets": true,
59
+ "truncation_side": "right",
60
+ "truncation_strategy": "longest_first",
61
+ "unk_token": "<unk>"
62
+ }
training_args.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b4f4276af6e212d3cd34f82392a73e76fe8f93b5d1c7ed0bb5ae976c974832b3
3
+ size 5905
vocab.json ADDED
The diff for this file is too large to render. See raw diff