Improve language tag

#1
by lbourdois - opened
Files changed (1) hide show
  1. README.md +131 -118
README.md CHANGED
@@ -1,119 +1,132 @@
1
- ---
2
- language: en
3
- license: mit
4
- library_name: transformers
5
- pipeline_tag: text-generation
6
- tags:
7
- - text-generation
8
- - ai-detection
9
- - paraphrasing
10
- - originality
11
- - privacy
12
- datasets:
13
- - checkgpt
14
- base_model: Qwen/Qwen2.5-3B-Instruct
15
- model_type: causal-lm
16
- ---
17
-
18
- # AuthorMist Originality
19
-
20
- [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-AuthorMist-blue)](https://huggingface.co/authormist/originality)
21
- [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
22
-
23
- ## Overview
24
-
25
- AuthorMist Originality is a specialized language model designed to transform AI-generated text into more human-like writing while preserving the original meaning. This model was developed using reinforcement learning techniques to specifically evade AI text detection systems, with a focus on Originality.ai's detection algorithms.
26
-
27
- The model is based on Qwen2.5-3B Instruct and has been fine-tuned using Group Relative Policy Optimization (GRPO) with detector feedback as a reward signal. AuthorMist Originality demonstrates strong performance in reducing detectability across multiple AI text detection systems while maintaining high semantic similarity with the original text.
28
-
29
- ## Key Features
30
-
31
- - **Detector Evasion**: Trained specifically to evade Originality.ai's detection algorithms, with strong cross-detector generalization
32
- - **Meaning Preservation**: Maintains high semantic similarity (>0.94) with the original text
33
- - **Natural Output**: Produces fluent, coherent text that reads naturally
34
- - **Broad Applicability**: Effective across various domains including academic, technical, and creative writing
35
-
36
- ## Model Details
37
-
38
- - **Base Model**: Qwen2.5-3B Instruct
39
- - **Training Method**: Reinforcement Learning with Group Relative Policy Optimization (GRPO)
40
- - **Training Data**: 10,000 human-written abstracts from the CheckGPT dataset with corresponding AI-generated versions
41
- - **Domains Covered**: Computer Science, Humanities, Social Sciences, Physics, and more
42
- - **Text Length Support**: Optimized for texts ranging from 100 to 500 words
43
-
44
- ## Performance
45
-
46
- AuthorMist Originality demonstrates exceptional performance in evading AI text detection:
47
-
48
- - **Mean AUROC**: 0.49 across six major detection systems
49
- - **Mean F1-score**: 0.09 across all tested detectors
50
- - **Semantic Similarity**: >0.94 with original text
51
-
52
- The model shows particularly strong performance against:
53
- - Hello SimpleAI (AUROC: 0.07)
54
- - Sapling (AUROC: 0.13)
55
- - Winston.ai (AUROC: 0.35)
56
-
57
- ## Usage
58
-
59
- ```python
60
- from transformers import AutoModelForCausalLM, AutoTokenizer
61
-
62
- # Load model and tokenizer
63
- model_name = "authormist/authormist-originality"
64
- tokenizer = AutoTokenizer.from_pretrained(model_name)
65
- model = AutoModelForCausalLM.from_pretrained(model_name)
66
-
67
- # Prepare input text
68
- ai_text = "Your AI-generated text here..."
69
- prompt = f"""Please paraphrase the following text to make it more human-like while preserving the original meaning:
70
-
71
- {ai_text}
72
-
73
- Paraphrased text:"""
74
-
75
- # Generate paraphrased text
76
- inputs = tokenizer(prompt, return_tensors="pt")
77
- outputs = model.generate(
78
- inputs.input_ids,
79
- max_new_tokens=512,
80
- temperature=0.7,
81
- top_p=0.9,
82
- do_sample=True
83
- )
84
- paraphrased_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
85
- print(paraphrased_text.split("Paraphrased text:")[1].strip())
86
- ```
87
-
88
- ## Ethical Considerations
89
-
90
- AuthorMist Originality is released for research purposes to advance understanding of AI text detection limitations and privacy-preserving technologies. We acknowledge the dual-use nature of this technology and emphasize the following ethical considerations:
91
-
92
- 1. **Academic Integrity**: This model should not be used to misrepresent AI-generated content as human-written in academic settings where such distinctions are ethically relevant.
93
-
94
- 2. **Transparency**: We encourage users to maintain transparency about the use of AI assistance in content creation, even when using privacy-enhancing tools like AuthorMist.
95
-
96
- 3. **Privacy Protection**: The primary legitimate use case for this technology is protecting author privacy and preventing unfair discrimination against AI-assisted writing in contexts where such assistance is permissible.
97
-
98
- 4. **Research Value**: This model provides valuable insights into the limitations of current AI detection systems and contributes to the ongoing research dialogue about AI text detection and privacy.
99
-
100
- ## Citation
101
-
102
- If you use AuthorMist Originality in your research, please cite our paper:
103
-
104
- ```bibtex
105
- @article{authormist2025,
106
- title={AuthorMist: Evading AI Text Detectors with Reinforcement Learning},
107
- author={David, Isaac and Gervais, Arthur},
108
- journal={arXiv preprint},
109
- year={2025}
110
- }
111
- ```
112
-
113
- ## License
114
-
115
- This model is released under the [MIT License](https://opensource.org/licenses/MIT).
116
-
117
- ## Acknowledgments
118
-
 
 
 
 
 
 
 
 
 
 
 
 
 
119
  We thank the developers of Qwen2.5 for the base model and the creators of the CheckGPT dataset for providing valuable training data.
 
1
+ ---
2
+ language:
3
+ - zho
4
+ - eng
5
+ - fra
6
+ - spa
7
+ - por
8
+ - deu
9
+ - ita
10
+ - rus
11
+ - jpn
12
+ - kor
13
+ - vie
14
+ - tha
15
+ - ara
16
+ license: mit
17
+ library_name: transformers
18
+ pipeline_tag: text-generation
19
+ tags:
20
+ - text-generation
21
+ - ai-detection
22
+ - paraphrasing
23
+ - originality
24
+ - privacy
25
+ datasets:
26
+ - checkgpt
27
+ base_model: Qwen/Qwen2.5-3B-Instruct
28
+ model_type: causal-lm
29
+ ---
30
+
31
+ # AuthorMist Originality
32
+
33
+ [![Hugging Face](https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-AuthorMist-blue)](https://huggingface.co/authormist/originality)
34
+ [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
35
+
36
+ ## Overview
37
+
38
+ AuthorMist Originality is a specialized language model designed to transform AI-generated text into more human-like writing while preserving the original meaning. This model was developed using reinforcement learning techniques to specifically evade AI text detection systems, with a focus on Originality.ai's detection algorithms.
39
+
40
+ The model is based on Qwen2.5-3B Instruct and has been fine-tuned using Group Relative Policy Optimization (GRPO) with detector feedback as a reward signal. AuthorMist Originality demonstrates strong performance in reducing detectability across multiple AI text detection systems while maintaining high semantic similarity with the original text.
41
+
42
+ ## Key Features
43
+
44
+ - **Detector Evasion**: Trained specifically to evade Originality.ai's detection algorithms, with strong cross-detector generalization
45
+ - **Meaning Preservation**: Maintains high semantic similarity (>0.94) with the original text
46
+ - **Natural Output**: Produces fluent, coherent text that reads naturally
47
+ - **Broad Applicability**: Effective across various domains including academic, technical, and creative writing
48
+
49
+ ## Model Details
50
+
51
+ - **Base Model**: Qwen2.5-3B Instruct
52
+ - **Training Method**: Reinforcement Learning with Group Relative Policy Optimization (GRPO)
53
+ - **Training Data**: 10,000 human-written abstracts from the CheckGPT dataset with corresponding AI-generated versions
54
+ - **Domains Covered**: Computer Science, Humanities, Social Sciences, Physics, and more
55
+ - **Text Length Support**: Optimized for texts ranging from 100 to 500 words
56
+
57
+ ## Performance
58
+
59
+ AuthorMist Originality demonstrates exceptional performance in evading AI text detection:
60
+
61
+ - **Mean AUROC**: 0.49 across six major detection systems
62
+ - **Mean F1-score**: 0.09 across all tested detectors
63
+ - **Semantic Similarity**: >0.94 with original text
64
+
65
+ The model shows particularly strong performance against:
66
+ - Hello SimpleAI (AUROC: 0.07)
67
+ - Sapling (AUROC: 0.13)
68
+ - Winston.ai (AUROC: 0.35)
69
+
70
+ ## Usage
71
+
72
+ ```python
73
+ from transformers import AutoModelForCausalLM, AutoTokenizer
74
+
75
+ # Load model and tokenizer
76
+ model_name = "authormist/authormist-originality"
77
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
78
+ model = AutoModelForCausalLM.from_pretrained(model_name)
79
+
80
+ # Prepare input text
81
+ ai_text = "Your AI-generated text here..."
82
+ prompt = f"""Please paraphrase the following text to make it more human-like while preserving the original meaning:
83
+
84
+ {ai_text}
85
+
86
+ Paraphrased text:"""
87
+
88
+ # Generate paraphrased text
89
+ inputs = tokenizer(prompt, return_tensors="pt")
90
+ outputs = model.generate(
91
+ inputs.input_ids,
92
+ max_new_tokens=512,
93
+ temperature=0.7,
94
+ top_p=0.9,
95
+ do_sample=True
96
+ )
97
+ paraphrased_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
98
+ print(paraphrased_text.split("Paraphrased text:")[1].strip())
99
+ ```
100
+
101
+ ## Ethical Considerations
102
+
103
+ AuthorMist Originality is released for research purposes to advance understanding of AI text detection limitations and privacy-preserving technologies. We acknowledge the dual-use nature of this technology and emphasize the following ethical considerations:
104
+
105
+ 1. **Academic Integrity**: This model should not be used to misrepresent AI-generated content as human-written in academic settings where such distinctions are ethically relevant.
106
+
107
+ 2. **Transparency**: We encourage users to maintain transparency about the use of AI assistance in content creation, even when using privacy-enhancing tools like AuthorMist.
108
+
109
+ 3. **Privacy Protection**: The primary legitimate use case for this technology is protecting author privacy and preventing unfair discrimination against AI-assisted writing in contexts where such assistance is permissible.
110
+
111
+ 4. **Research Value**: This model provides valuable insights into the limitations of current AI detection systems and contributes to the ongoing research dialogue about AI text detection and privacy.
112
+
113
+ ## Citation
114
+
115
+ If you use AuthorMist Originality in your research, please cite our paper:
116
+
117
+ ```bibtex
118
+ @article{authormist2025,
119
+ title={AuthorMist: Evading AI Text Detectors with Reinforcement Learning},
120
+ author={David, Isaac and Gervais, Arthur},
121
+ journal={arXiv preprint},
122
+ year={2025}
123
+ }
124
+ ```
125
+
126
+ ## License
127
+
128
+ This model is released under the [MIT License](https://opensource.org/licenses/MIT).
129
+
130
+ ## Acknowledgments
131
+
132
  We thank the developers of Qwen2.5 for the base model and the creators of the CheckGPT dataset for providing valuable training data.