|
|
--- |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- text-detoxification |
|
|
- text2text-generation |
|
|
- detoxification |
|
|
- content-moderation |
|
|
- toxicity-reduction |
|
|
- llama |
|
|
- gguf |
|
|
- minibase |
|
|
- medium-model |
|
|
- 4096-context |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- paradetox |
|
|
metrics: |
|
|
- toxicity-reduction |
|
|
- semantic-similarity |
|
|
- fluency |
|
|
- latency |
|
|
model-index: |
|
|
- name: Detoxify-Medium |
|
|
results: |
|
|
- task: |
|
|
type: text-detoxification |
|
|
name: Toxicity Reduction |
|
|
dataset: |
|
|
type: paradetox |
|
|
name: ParaDetox |
|
|
config: toxic-neutral |
|
|
split: test |
|
|
metrics: |
|
|
- type: toxicity-reduction |
|
|
value: 0.178 |
|
|
name: Average Toxicity Reduction |
|
|
- type: semantic-similarity |
|
|
value: 0.561 |
|
|
name: Semantic to Expected |
|
|
- type: fluency |
|
|
value: 0.929 |
|
|
name: Text Fluency |
|
|
- type: latency |
|
|
value: 160.2 |
|
|
name: Average Latency (ms) |
|
|
--- |
|
|
|
|
|
# Detoxify-Medium π€ |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**A medium-sized, high-capacity text detoxification model for advanced toxicity removal while preserving meaning.** |
|
|
|
|
|
[](https://huggingface.co/) |
|
|
[](https://huggingface.co/) |
|
|
[](https://huggingface.co/) |
|
|
[](LICENSE) |
|
|
[](https://discord.com/invite/BrJn4D2Guh) |
|
|
|
|
|
*Built by [Minibase](https://minibase.ai) - Train and deploy small AI models from your browser.* |
|
|
*Browse all of the models and datasets available on the [Minibase Marketplace](https://minibase.ai/wiki/Special:Marketplace).* |
|
|
|
|
|
</div> |
|
|
|
|
|
## π Model Summary |
|
|
|
|
|
**Minibase-Detoxify-Medium** is a medium-capacity language model fine-tuned specifically for advanced text detoxification tasks. It takes toxic or inappropriate text as input and generates cleaned, non-toxic versions while preserving the original meaning and intent as much as possible. With a 4,096 token context window and enhanced capacity, it excels at handling longer texts and more complex detoxification scenarios. |
|
|
|
|
|
### Key Features |
|
|
- β‘ **Balanced Performance**: ~160ms average response time |
|
|
- π― **High Fluency**: 92.9% well-formed output text |
|
|
- π§Ή **Advanced Detoxification**: 17.8% average toxicity reduction |
|
|
- πΎ **Medium Size**: 369MB (GGUF Q8_0 quantized) |
|
|
- π **Privacy-First**: Runs locally, no data sent to external servers |
|
|
- π **Extended Context**: 4,096 token context window (4x larger than Small) |
|
|
|
|
|
## π Quick Start |
|
|
|
|
|
### Local Inference (Recommended) |
|
|
|
|
|
1. **Install llama.cpp** (if not already installed): |
|
|
```bash |
|
|
git clone https://github.com/ggerganov/llama.cpp |
|
|
cd llama.cpp && make |
|
|
``` |
|
|
|
|
|
2. **Download and run the model**: |
|
|
```bash |
|
|
# Download model files |
|
|
wget https://huggingface.co/Minibase/Detoxify-Language-Medium/resolve/main/detoxify-medium-q8_0.gguf |
|
|
wget https://huggingface.co/Minibase/Detoxify-Language-Medium/resolve/main/detoxify_inference.py |
|
|
|
|
|
# Make executable and run |
|
|
chmod +x run_server.sh |
|
|
./run_server.sh |
|
|
``` |
|
|
|
|
|
3. **Make API calls**: |
|
|
```python |
|
|
import requests |
|
|
|
|
|
# Detoxify text |
|
|
response = requests.post("http://127.0.0.1:8000/completion", json={ |
|
|
"prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is fucking terrible!\n\nResponse: ", |
|
|
"max_tokens": 256, |
|
|
"temperature": 0.7 |
|
|
}) |
|
|
|
|
|
result = response.json() |
|
|
print(result["content"]) # "This is really terrible!" |
|
|
``` |
|
|
|
|
|
### Python Client |
|
|
|
|
|
```python |
|
|
from detoxify_inference import DetoxifyClient |
|
|
|
|
|
# Initialize client |
|
|
client = DetoxifyClient() |
|
|
|
|
|
# Detoxify text |
|
|
toxic_text = "This product is fucking amazing, no bullshit!" |
|
|
clean_text = client.detoxify_text(toxic_text) |
|
|
|
|
|
print(clean_text) # "This product is really amazing, no kidding!" |
|
|
``` |
|
|
|
|
|
## π Benchmarks & Performance |
|
|
|
|
|
### ParaDetox Dataset Results (1,011 samples) |
|
|
|
|
|
| Metric | Value | Description | |
|
|
|--------|-------|-------------| |
|
|
| Original Toxicity | 0.196 (19.6%) | Input toxicity level | |
|
|
| Final Toxicity | 0.018 (1.8%) | Output toxicity level | |
|
|
| **Toxicity Reduction** | **91%** | **Reduction in toxicity scores** | |
|
|
| **Semantic Similarity (Expected)** | **0.561 (56.1%)** | **Similarity to human expert rewrites** | |
|
|
| **Semantic Similarity (Original)** | **0.625 (62.5%)** | **How much original meaning is preserved** | |
|
|
| **Fluency** | **0.929 (92.9%)** | **Quality of generated text structure** | |
|
|
| **Latency** | **160.2ms** | **Average response time** | |
|
|
| **Throughput** | **~6 req/sec** | **Estimated requests per second** | |
|
|
|
|
|
### Dataset Breakdown |
|
|
|
|
|
#### General Toxic Content (1,000 samples) |
|
|
- **Toxicity Reduction**: 17.8% |
|
|
- **Semantic Preservation**: 56.1% |
|
|
- **Fluency**: 92.9% |
|
|
|
|
|
#### High-Toxicity Content (11 samples) |
|
|
- **Toxicity Reduction**: 31.3% β **Strong performance!** |
|
|
- **Semantic Preservation**: 47.7% |
|
|
- **Fluency**: 93.6% |
|
|
|
|
|
### Comparison with Detoxify-Small |
|
|
|
|
|
| Model | Context Window | Toxicity Reduction | Semantic Similarity | Latency | Size | |
|
|
|-------|----------------|-------------------|-------------------|---------|------| |
|
|
| **Detoxify-Medium** | **4,096 tokens** | **17.8%** | **56.1%** | **160ms** | **369MB** | |
|
|
| Detoxify-Small | 1,024 tokens | 3.2% | 47.1% | 66ms | 138MB | |
|
|
|
|
|
**Key Improvements:** |
|
|
- β
4x larger context window |
|
|
- β
5.6x better toxicity reduction |
|
|
- β
19% better semantic preservation |
|
|
- β
2.7x larger model size |
|
|
|
|
|
### Comparison with Baselines |
|
|
|
|
|
| Model | Semantic Similarity | Toxicity Reduction | Fluency | |
|
|
|-------|-------------------|-------------------|---------| |
|
|
| **Detoxify-Medium** | **0.561** | **0.178** | **0.929** | |
|
|
| Detoxify-Small | 0.471 | 0.032 | 0.919 | |
|
|
| BART-base (ParaDetox) | 0.750 | ~0.15 | ~0.85 | |
|
|
| Human Performance | 0.850 | ~0.25 | ~0.95 | |
|
|
|
|
|
**Performance Notes:** |
|
|
- π **Semantic Similarity**: How well meaning is preserved |
|
|
- π§Ή **Toxicity Reduction**: How effectively toxicity is removed |
|
|
- βοΈ **Fluency**: Quality of generated text |
|
|
- π― **Detoxify-Medium** achieves strong performance across all metrics |
|
|
|
|
|
## ποΈ Technical Details |
|
|
|
|
|
### Model Architecture |
|
|
- **Architecture**: LlamaForCausalLM |
|
|
- **Parameters**: 279M (medium capacity) |
|
|
- **Context Window**: 4,096 tokens (4x larger than Small) |
|
|
- **Max Position Embeddings**: 8,192 |
|
|
- **Quantization**: GGUF (Q8_0 quantization) |
|
|
- **File Size**: 369MB |
|
|
- **Memory Requirements**: 12GB RAM minimum, 24GB recommended |
|
|
|
|
|
### Training Details |
|
|
- **Base Model**: Custom-trained Llama architecture |
|
|
- **Fine-tuning Dataset**: Curated toxic-neutral parallel pairs |
|
|
- **Training Objective**: Instruction-following for detoxification |
|
|
- **Optimization**: Quantized for edge deployment |
|
|
- **Model Scale**: Medium capacity for enhanced performance |
|
|
|
|
|
### System Requirements |
|
|
|
|
|
| Component | Minimum | Recommended | |
|
|
|-----------|---------|-------------| |
|
|
| **Operating System** | Linux, macOS, Windows | Linux or macOS | |
|
|
| **RAM** | 12GB | 24GB | |
|
|
| **Storage** | 400MB free space | 1GB free space | |
|
|
| **Python** | 3.8+ | 3.10+ | |
|
|
| **Dependencies** | llama.cpp | llama.cpp, requests | |
|
|
| **GPU** | Optional | NVIDIA RTX 30-series or Apple M2/M3 | |
|
|
|
|
|
**Notes:** |
|
|
- β
**CPU-only inference** is supported but slower |
|
|
- β
**GPU acceleration** provides significant speed improvements |
|
|
- β
**Apple Silicon** users get Metal acceleration automatically |
|
|
|
|
|
## π Usage Examples |
|
|
|
|
|
### Basic Detoxification |
|
|
```python |
|
|
# Input: "This is fucking awesome!" |
|
|
# Output: "This is really awesome!" |
|
|
|
|
|
# Input: "You stupid idiot, get out of my way!" |
|
|
# Output: "You silly person, please move aside!" |
|
|
``` |
|
|
|
|
|
### Long-Form Text Detoxification |
|
|
```python |
|
|
# Input: "This article is complete bullshit and the author is a fucking moron who doesn't know what they're talking about. The whole thing is garbage and worthless." |
|
|
# Output: "This article is not well-founded and the author seems uninformed about the topic. The whole thing seems questionable." |
|
|
``` |
|
|
|
|
|
### API Integration |
|
|
```python |
|
|
import requests |
|
|
|
|
|
def detoxify_text(text: str) -> str: |
|
|
"""Detoxify text using Detoxify-Medium API""" |
|
|
prompt = f"Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: {text}\n\nResponse: " |
|
|
|
|
|
response = requests.post("http://127.0.0.1:8000/completion", json={ |
|
|
"prompt": prompt, |
|
|
"max_tokens": 256, |
|
|
"temperature": 0.7 |
|
|
}) |
|
|
|
|
|
return response.json()["content"] |
|
|
|
|
|
# Usage |
|
|
toxic_comment = "This product sucks donkey balls!" |
|
|
clean_comment = detoxify_text(toxic_comment) |
|
|
print(clean_comment) # "This product is not very good!" |
|
|
``` |
|
|
|
|
|
### Batch Processing |
|
|
```python |
|
|
import asyncio |
|
|
import aiohttp |
|
|
|
|
|
async def detoxify_batch(texts: list) -> list: |
|
|
"""Process multiple texts concurrently""" |
|
|
async with aiohttp.ClientSession() as session: |
|
|
tasks = [] |
|
|
for text in texts: |
|
|
prompt = f"Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: {text}\n\nResponse: " |
|
|
payload = { |
|
|
"prompt": prompt, |
|
|
"max_tokens": 256, |
|
|
"temperature": 0.7 |
|
|
} |
|
|
tasks.append(session.post("http://127.0.0.1:8000/completion", json=payload)) |
|
|
|
|
|
responses = await asyncio.gather(*tasks) |
|
|
return [await resp.json() for resp in responses] |
|
|
|
|
|
# Process multiple comments |
|
|
comments = [ |
|
|
"This is fucking brilliant!", |
|
|
"You stupid moron!", |
|
|
"What the hell is wrong with you?" |
|
|
] |
|
|
|
|
|
clean_comments = await detoxify_batch(comments) |
|
|
``` |
|
|
|
|
|
## π§ Advanced Configuration |
|
|
|
|
|
### Server Configuration |
|
|
```bash |
|
|
# GPU acceleration (macOS with Metal) |
|
|
llama-server \ |
|
|
-m detoxify-medium-q8_0.gguf \ |
|
|
--host 127.0.0.1 \ |
|
|
--port 8000 \ |
|
|
--n-gpu-layers 35 \ |
|
|
--ctx-size 4096 \ |
|
|
--metal |
|
|
|
|
|
# CPU-only (higher memory usage) |
|
|
llama-server \ |
|
|
-m detoxify-medium-q8_0.gguf \ |
|
|
--host 127.0.0.1 \ |
|
|
--port 8000 \ |
|
|
--n-gpu-layers 0 \ |
|
|
--threads 8 \ |
|
|
--ctx-size 4096 |
|
|
|
|
|
# Custom context window |
|
|
llama-server \ |
|
|
-m detoxify-medium-q8_0.gguf \ |
|
|
--ctx-size 2048 \ |
|
|
--host 127.0.0.1 \ |
|
|
--port 8000 |
|
|
``` |
|
|
|
|
|
### Alternative: Use the MacOS Application |
|
|
```bash |
|
|
# If using the provided MacOS app bundle |
|
|
cd /path/to/downloaded/model |
|
|
./Minibase-detoxify-medium.app/Contents/MacOS/run_server |
|
|
``` |
|
|
|
|
|
### Temperature Settings |
|
|
|
|
|
| Temperature Range | Approach | Description | |
|
|
|------------------|----------|-------------| |
|
|
| **0.1-0.3** | Conservative | Minimal changes, preserves original style | |
|
|
| **0.4-0.7** | **Balanced (Recommended)** | **Best trade-off between detoxification and naturalness** | |
|
|
| **0.8-1.0** | Creative | More aggressive detoxification, may alter style | |
|
|
|
|
|
### Context Window Optimization |
|
|
|
|
|
| Context Size | Use Case | Performance | |
|
|
|--------------|----------|------------| |
|
|
| **4,096 tokens** | **Long documents, complex detoxification** | **Best quality, slower processing** | |
|
|
| **2,048 tokens** | **Balanced performance and quality** | **Good compromise (recommended)** | |
|
|
| **1,024 tokens** | **Simple tasks, fast processing** | **Faster inference, adequate quality** | |
|
|
|
|
|
## π Limitations & Biases |
|
|
|
|
|
### Current Limitations |
|
|
|
|
|
| Limitation | Description | Impact | |
|
|
|------------|-------------|--------| |
|
|
| **Vocabulary Scope** | Trained primarily on English toxic content | May not handle other languages effectively | |
|
|
| **Context Awareness** | Limited detection of sarcasm or cultural context | May miss nuanced toxicity | |
|
|
| **Length Constraints** | Limited to 4,096 token context window | Cannot process very long documents | |
|
|
| **Domain Specificity** | Optimized for general web content | May perform differently on specialized domains | |
|
|
| **Memory Requirements** | Higher RAM usage than smaller models | Requires more system resources | |
|
|
|
|
|
### Potential Biases |
|
|
|
|
|
| Bias Type | Description | Mitigation | |
|
|
|-----------|-------------|------------| |
|
|
| **Cultural Context** | May not handle culture-specific expressions | Use with awareness of cultural differences | |
|
|
| **Dialect Variations** | Limited exposure to regional dialects | May not recognize regional toxic patterns | |
|
|
| **Emerging Slang** | May not recognize newest internet slang | Regular model updates recommended | |
|
|
| **Long-form Content** | May struggle with very complex toxicity | Break long content into smaller chunks | |
|
|
|
|
|
## π€ Contributing |
|
|
|
|
|
We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details. |
|
|
|
|
|
### Development Setup |
|
|
```bash |
|
|
# Clone the repository |
|
|
git clone https://github.com/minibase-ai/detoxify-medium |
|
|
cd detoxify-medium |
|
|
|
|
|
# Install dependencies |
|
|
pip install -r requirements.txt |
|
|
|
|
|
# Run tests |
|
|
python -m pytest tests/ |
|
|
``` |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use Detoxify-Medium in your research, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{detoxify-medium-2025, |
|
|
title={Detoxify-Medium: A High-Capacity Text Detoxification Model}, |
|
|
author={Minibase AI Team}, |
|
|
year={2025}, |
|
|
publisher={Hugging Face}, |
|
|
url={https://huggingface.co/Minibase/Detoxify-Language-Medium} |
|
|
} |
|
|
``` |
|
|
|
|
|
## π Contact & Community |
|
|
|
|
|
- **Website**: [minibase.ai](https://minibase.ai) |
|
|
- **Discord Community**: [Join our Discord](https://discord.com/invite/BrJn4D2Guh) |
|
|
- **GitHub Issues**: [Report bugs or request features on Discord](https://discord.com/invite/BrJn4D2Guh) |
|
|
- **Email**: hello@minibase.ai |
|
|
|
|
|
### Support |
|
|
- π **Documentation**: [help.minibase.ai](https://help.minibase.ai) |
|
|
- π¬ **Community Forum**: [Join our Discord Community](https://discord.com/invite/BrJn4D2Guh) |
|
|
|
|
|
## π License |
|
|
|
|
|
This model is released under the [Apache License 2.0](https://www.apache.org/licenses/LICENSE-2.0)). |
|
|
|
|
|
## π Acknowledgments |
|
|
|
|
|
- **ParaDetox Dataset**: Used for benchmarking and evaluation |
|
|
- **llama.cpp**: For efficient local inference |
|
|
- **Hugging Face**: For model hosting and community |
|
|
- **Our amazing community**: For feedback and contributions |
|
|
|
|
|
--- |
|
|
|
|
|
<div align="center"> |
|
|
|
|
|
**Built with β€οΈ by the Minibase team** |
|
|
|
|
|
*Making AI more accessible for everyone* |
|
|
|
|
|
[π Minibase Help Center](https://help.minibase.ai) β’ [π¬ Join our Discord](https://discord.com/invite/BrJn4D2Guh) |
|
|
|
|
|
</div> |
|
|
|