Minibase commited on
Commit
c83dc9e
Β·
verified Β·
1 Parent(s): 1b54ae8

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +387 -0
README.md ADDED
@@ -0,0 +1,387 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - text-detoxification
6
+ - text2text-generation
7
+ - detoxification
8
+ - content-moderation
9
+ - toxicity-reduction
10
+ - llama
11
+ - gguf
12
+ - minibase
13
+ - medium-model
14
+ - 4096-context
15
+ license: apache-2.0
16
+ datasets:
17
+ - paradetox
18
+ metrics:
19
+ - toxicity-reduction
20
+ - semantic-similarity
21
+ - fluency
22
+ - latency
23
+ model-index:
24
+ - name: Detoxify-Medium
25
+ results:
26
+ - task:
27
+ type: text-detoxification
28
+ name: Toxicity Reduction
29
+ dataset:
30
+ type: paradetox
31
+ name: ParaDetox
32
+ config: toxic-neutral
33
+ split: test
34
+ metrics:
35
+ - type: toxicity-reduction
36
+ value: 0.178
37
+ name: Average Toxicity Reduction
38
+ - type: semantic-similarity
39
+ value: 0.561
40
+ name: Semantic to Expected
41
+ - type: fluency
42
+ value: 0.929
43
+ name: Text Fluency
44
+ - type: latency
45
+ value: 160.2
46
+ name: Average Latency (ms)
47
+ ---
48
+
49
+ # Detoxify-Medium πŸ€–
50
+
51
+ <div align="center">
52
+
53
+ **A medium-sized, high-capacity text detoxification model for advanced toxicity removal while preserving meaning.**
54
+
55
+ [![Model Size](https://img.shields.io/badge/Model_Size-369MB-blue)](https://huggingface.co/)
56
+ [![Architecture](https://img.shields.io/badge/Architecture-LlamaForCausalLM-green)](https://huggingface.co/)
57
+ [![Context Window](https://img.shields.io/badge/Context-4096_Tokens-orange)](https://huggingface.co/)
58
+ [![License](https://img.shields.io/badge/License-Apache_2.0-yellow)](LICENSE)
59
+ [![Discord](https://img.shields.io/badge/Discord-Join_Community-5865F2)](https://discord.com/invite/BrJn4D2Guh)
60
+
61
+ *Built by [Minibase](https://minibase.ai) - Train and deploy small AI models from your browser.*
62
+ *Browse all of the models and datasets available on the [Minibase Marketplace](https://minibase.ai/wiki/Special:Marketplace).*
63
+
64
+ </div>
65
+
66
+ ## πŸ“‹ Model Summary
67
+
68
+ **Minibase-Detoxify-Medium** is a medium-capacity language model fine-tuned specifically for advanced text detoxification tasks. It takes toxic or inappropriate text as input and generates cleaned, non-toxic versions while preserving the original meaning and intent as much as possible. With a 4,096 token context window and enhanced capacity, it excels at handling longer texts and more complex detoxification scenarios.
69
+
70
+ ### Key Features
71
+ - ⚑ **Balanced Performance**: ~160ms average response time
72
+ - 🎯 **High Fluency**: 92.9% well-formed output text
73
+ - 🧹 **Advanced Detoxification**: 17.8% average toxicity reduction
74
+ - πŸ’Ύ **Medium Size**: 369MB (GGUF Q8_0 quantized)
75
+ - πŸ”’ **Privacy-First**: Runs locally, no data sent to external servers
76
+ - πŸ“ **Extended Context**: 4,096 token context window (4x larger than Small)
77
+
78
+ ## πŸš€ Quick Start
79
+
80
+ ### Local Inference (Recommended)
81
+
82
+ 1. **Install llama.cpp** (if not already installed):
83
+ ```bash
84
+ git clone https://github.com/ggerganov/llama.cpp
85
+ cd llama.cpp && make
86
+ ```
87
+
88
+ 2. **Download and run the model**:
89
+ ```bash
90
+ # Download model files
91
+ wget https://huggingface.co/minibase/detoxify-medium/resolve/main/detoxify-medium-q8_0.gguf
92
+ wget https://huggingface.co/minibase/detoxify-medium/resolve/main/run_server.sh
93
+
94
+ # Make executable and run
95
+ chmod +x run_server.sh
96
+ ./run_server.sh
97
+ ```
98
+
99
+ 3. **Make API calls**:
100
+ ```python
101
+ import requests
102
+
103
+ # Detoxify text
104
+ response = requests.post("http://127.0.0.1:8000/completion", json={
105
+ "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is fucking terrible!\n\nResponse: ",
106
+ "max_tokens": 256,
107
+ "temperature": 0.7
108
+ })
109
+
110
+ result = response.json()
111
+ print(result["content"]) # "This is really terrible!"
112
+ ```
113
+
114
+ ### Python Client
115
+
116
+ ```python
117
+ from detoxify_inference import DetoxifyClient
118
+
119
+ # Initialize client
120
+ client = DetoxifyClient()
121
+
122
+ # Detoxify text
123
+ toxic_text = "This product is fucking amazing, no bullshit!"
124
+ clean_text = client.detoxify_text(toxic_text)
125
+
126
+ print(clean_text) # "This product is really amazing, no kidding!"
127
+ ```
128
+
129
+ ## πŸ“Š Benchmarks & Performance
130
+
131
+ ### ParaDetox Dataset Results (1,011 samples)
132
+
133
+ | Metric | Score | Description |
134
+ |--------|-------|-------------|
135
+ β€’ Original Toxicity: 0.196 (19.6%)
136
+ β€’ Final Toxicity: 0.018 (1.8%)
137
+
138
+ | **Toxicity Reduction** | 0.196 (ParaDetox) --> 0.018 | Reduced toxicity scores by 91% |
139
+ | **Semantic to Expected** | 0.561 (56.1%) | Similarity to human expert rewrites |
140
+ | **Semantic to Original** | 0.625 (62.5%) | How much original meaning is preserved |
141
+ | **Fluency** | 0.929 (92.9%) | Quality of generated text structure |
142
+ | **Latency** | 160.2ms | Average response time |
143
+ | **Throughput** | ~6 req/sec | Estimated requests per second |
144
+
145
+ ### Dataset Breakdown
146
+
147
+ #### General Toxic Content (1,000 samples)
148
+ - **Toxicity Reduction**: 17.8%
149
+ - **Semantic Preservation**: 56.1%
150
+ - **Fluency**: 92.9%
151
+
152
+ #### High-Toxicity Content (11 samples)
153
+ - **Toxicity Reduction**: 31.3% ⭐ **Strong performance!**
154
+ - **Semantic Preservation**: 47.7%
155
+ - **Fluency**: 93.6%
156
+
157
+ ### Comparison with Detoxify-Small
158
+
159
+ | Model | Context Window | Toxicity Reduction | Semantic Similarity | Latency | Size |
160
+ |-------|----------------|-------------------|-------------------|---------|------|
161
+ | **Detoxify-Medium** | **4,096 tokens** | **17.8%** | **56.1%** | **160ms** | **369MB** |
162
+ | Detoxify-Small | 1,024 tokens | 3.2% | 47.1% | 66ms | 138MB |
163
+
164
+ ### Comparison with Baselines
165
+
166
+ | Model | Semantic Similarity | Toxicity Reduction | Fluency |
167
+ |-------|-------------------|-------------------|---------|
168
+ | **Detoxify-Medium** | **0.561** | **0.178** | **0.929** |
169
+ | Detoxify-Small | 0.471 | 0.032 | 0.919 |
170
+ | BART-base (ParaDetox) | 0.750 | ~0.15 | ~0.85 |
171
+ | Human Performance | 0.850 | ~0.25 | ~0.95 |
172
+
173
+ ## πŸ—οΈ Technical Details
174
+
175
+ ### Model Architecture
176
+ - **Architecture**: LlamaForCausalLM
177
+ - **Parameters**: ~150M estimated (medium capacity)
178
+ - **Context Window**: 4,096 tokens (4x larger than Small)
179
+ - **Max Position Embeddings**: 8,192
180
+ - **Quantization**: GGUF (Q8_0 quantization)
181
+ - **File Size**: 369MB
182
+ - **Memory Requirements**: 12GB RAM minimum, 24GB recommended
183
+
184
+ ### Training Details
185
+ - **Base Model**: Custom-trained Llama architecture
186
+ - **Fine-tuning Dataset**: Curated toxic-neutral parallel pairs
187
+ - **Training Objective**: Instruction-following for detoxification
188
+ - **Optimization**: Quantized for edge deployment
189
+ - **Model Scale**: Medium capacity for enhanced performance
190
+
191
+ ### System Requirements
192
+ - **OS**: Linux, macOS, Windows
193
+ - **RAM**: 12GB minimum, 24GB recommended
194
+ - **Storage**: 400MB free space
195
+ - **Dependencies**: llama.cpp, Python 3.8+
196
+ - **GPU**: Optional but beneficial (NVIDIA RTX 30-series, Apple M2/M3)
197
+
198
+ ## πŸ“– Usage Examples
199
+
200
+ ### Basic Detoxification
201
+ ```python
202
+ # Input: "This is fucking awesome!"
203
+ # Output: "This is really awesome!"
204
+
205
+ # Input: "You stupid idiot, get out of my way!"
206
+ # Output: "You silly person, please move aside!"
207
+ ```
208
+
209
+ ### Long-Form Text Detoxification
210
+ ```python
211
+ # Input: "This article is complete bullshit and the author is a fucking moron who doesn't know what they're talking about. The whole thing is garbage and worthless."
212
+ # Output: "This article is not well-founded and the author seems uninformed about the topic. The whole thing seems questionable."
213
+ ```
214
+
215
+ ### API Integration
216
+ ```python
217
+ import requests
218
+
219
+ def detoxify_text(text: str) -> str:
220
+ """Detoxify text using Detoxify-Medium API"""
221
+ prompt = f"Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: {text}\n\nResponse: "
222
+
223
+ response = requests.post("http://127.0.0.1:8000/completion", json={
224
+ "prompt": prompt,
225
+ "max_tokens": 256,
226
+ "temperature": 0.7
227
+ })
228
+
229
+ return response.json()["content"]
230
+
231
+ # Usage
232
+ toxic_comment = "This product sucks donkey balls!"
233
+ clean_comment = detoxify_text(toxic_comment)
234
+ print(clean_comment) # "This product is not very good!"
235
+ ```
236
+
237
+ ### Batch Processing
238
+ ```python
239
+ import asyncio
240
+ import aiohttp
241
+
242
+ async def detoxify_batch(texts: list) -> list:
243
+ """Process multiple texts concurrently"""
244
+ async with aiohttp.ClientSession() as session:
245
+ tasks = []
246
+ for text in texts:
247
+ prompt = f"Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: {text}\n\nResponse: "
248
+ payload = {
249
+ "prompt": prompt,
250
+ "max_tokens": 256,
251
+ "temperature": 0.7
252
+ }
253
+ tasks.append(session.post("http://127.0.0.1:8000/completion", json=payload))
254
+
255
+ responses = await asyncio.gather(*tasks)
256
+ return [await resp.json() for resp in responses]
257
+
258
+ # Process multiple comments
259
+ comments = [
260
+ "This is fucking brilliant!",
261
+ "You stupid moron!",
262
+ "What the hell is wrong with you?"
263
+ ]
264
+
265
+ clean_comments = await detoxify_batch(comments)
266
+ ```
267
+
268
+ ## πŸ”§ Advanced Configuration
269
+
270
+ ### Server Configuration
271
+ ```bash
272
+ # GPU acceleration (macOS with Metal)
273
+ llama-server \
274
+ -m detoxify-medium-q8_0.gguf \
275
+ --host 127.0.0.1 \
276
+ --port 8000 \
277
+ --n-gpu-layers 35 \
278
+ --ctx-size 4096 \
279
+ --metal
280
+
281
+ # CPU-only (higher memory usage)
282
+ llama-server \
283
+ -m detoxify-medium-q8_0.gguf \
284
+ --host 127.0.0.1 \
285
+ --port 8000 \
286
+ --n-gpu-layers 0 \
287
+ --threads 8 \
288
+ --ctx-size 4096
289
+
290
+ # Custom context window
291
+ llama-server \
292
+ -m detoxify-medium-q8_0.gguf \
293
+ --ctx-size 2048 \
294
+ --host 127.0.0.1 \
295
+ --port 8000
296
+ ```
297
+
298
+ ### Temperature Settings
299
+ - **Low (0.1-0.3)**: Conservative detoxification, minimal changes
300
+ - **Medium (0.4-0.7)**: Balanced approach (recommended)
301
+ - **High (0.8-1.0)**: Creative detoxification, more aggressive changes
302
+
303
+ ### Context Window Optimization
304
+ - **Full Context (4096)**: Best for long documents and complex detoxification
305
+ - **Medium Context (2048)**: Good balance of performance and capability
306
+ - **Short Context (1024)**: Faster inference for simple tasks
307
+
308
+ ## πŸ“š Limitations & Biases
309
+
310
+ ### Current Limitations
311
+ - **Vocabulary Scope**: Trained primarily on English toxic content
312
+ - **Context Awareness**: May not detect sarcasm or cultural context
313
+ - **Length Constraints**: Limited to 4096 token context window
314
+ - **Domain Specificity**: Optimized for general web content
315
+ - **Memory Requirements**: Higher RAM usage compared to smaller models
316
+
317
+ ### Potential Biases
318
+ - **Cultural Context**: May not handle culture-specific expressions
319
+ - **Dialect Variations**: Limited exposure to regional dialects
320
+ - **Emerging Slang**: May not recognize newest internet slang
321
+ - **Long-form Content**: May struggle with very complex or technical toxicity
322
+
323
+ ## 🀝 Contributing
324
+
325
+ We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details.
326
+
327
+ ### Development Setup
328
+ ```bash
329
+ # Clone the repository
330
+ git clone https://github.com/minibase-ai/detoxify-medium
331
+ cd detoxify-medium
332
+
333
+ # Install dependencies
334
+ pip install -r requirements.txt
335
+
336
+ # Run tests
337
+ python -m pytest tests/
338
+ ```
339
+
340
+ ## πŸ“œ Citation
341
+
342
+ If you use Detoxify-Medium in your research, please cite:
343
+
344
+ ```bibtex
345
+ @misc{detoxify-medium-2025,
346
+ title={Detoxify-Medium: A High-Capacity Text Detoxification Model},
347
+ author={Minibase AI Team},
348
+ year={2025},
349
+ publisher={Hugging Face},
350
+ url={https://huggingface.co/minibase/detoxify-medium}
351
+ }
352
+ ```
353
+
354
+ ## πŸ“ž Contact & Community
355
+
356
+ - **Website**: [minibase.ai](https://minibase.ai)
357
+ - **Discord Community**: [Join our Discord](https://discord.com/invite/BrJn4D2Guh)
358
+ - **GitHub Issues**: [Report bugs or request features](https://github.com/minibase-ai/detoxify-medium/issues)
359
+ - **Email**: hello@minibase.ai
360
+
361
+ ### Support
362
+ - πŸ“– **Documentation**: [docs.minibase.ai](https://docs.minibase.ai)
363
+ - πŸ’¬ **Community Forum**: [forum.minibase.ai](https://forum.minibase.ai)
364
+ - πŸ› **Bug Reports**: [GitHub Issues](https://github.com/minibase-ai/detoxify-medium/issues)
365
+
366
+ ## πŸ“‹ License
367
+
368
+ This model is released under the [Apache License 2.0](LICENSE).
369
+
370
+ ## πŸ™ Acknowledgments
371
+
372
+ - **ParaDetox Dataset**: Used for benchmarking and evaluation
373
+ - **llama.cpp**: For efficient local inference
374
+ - **Hugging Face**: For model hosting and community
375
+ - **Our amazing community**: For feedback and contributions
376
+
377
+ ---
378
+
379
+ <div align="center">
380
+
381
+ **Built with ❀️ by the Minibase team**
382
+
383
+ *Making AI safer and more accessible for everyone*
384
+
385
+ [🌟 Star us on GitHub](https://github.com/minibase-ai/detoxify-medium) β€’ [πŸ“– Read the docs](https://docs.minibase.ai) β€’ [πŸ’¬ Join our Discord](https://discord.com/invite/BrJn4D2Guh)
386
+
387
+ </div>