File size: 2,789 Bytes
4c97017
e459f02
4c97017
 
e459f02
4c97017
196125f
4c97017
 
 
 
 
e459f02
4c97017
e459f02
4c97017
e459f02
4c97017
e459f02
 
 
 
 
 
 
 
 
 
4c97017
e459f02
 
 
4c97017
e459f02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c97017
 
 
 
 
 
e459f02
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4c97017
e459f02
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
---
title: AI vs Human Document Classifier
emoji: 📄
colorFrom: indigo
colorTo: blue
sdk: gradio
sdk_version: 5.49.0
app_file: app.py
pinned: false
license: mit
---

# 🔎 AI vs Human — Document Classifier

This **Gradio Space** lets you upload a document (TXT, MD, HTML, or PDF) and predicts whether it was **AI-generated** or **Human-written**.

The app supports **long documents** by splitting them into overlapping 512‑token chunks and aggregating predictions to provide an overall document‑level probability.

---

## ✨ Features**Interactive Interface**
- Upload documents directly (TXT, MD, HTML, PDF)
- Displays clean probability bars for *AI‑generated* vs *Human‑written*  
- Shows a **confidence badge** (“Likely AI” / “Likely Human”) with traffic‑light colors  
- Separate **Basic** and **Advanced** tabs for simplicity
- A **Chunk Details** accordion with per‑chunk probabilities for deeper inspection

✅ **Configurable Parameters**
- Adjust `MAX_LENGTH` and `STRIDE` for token chunking  
- Choose aggregation method (`mean` or `max`) across chunks

✅ **Fully local**
- No Hub API calls beyond model loading
- Runs on CPU, GPU, or MPS automatically

---

## ⚙️ Environment Variables

You can configure your Space in **Settings → Variables**:

| Variable | Description | Default |
|-----------|--------------|----------|
| `MODEL_ID` | Hugging Face repo ID of your model | `bert-base-uncased` |
| `MAX_LENGTH` | Tokens per chunk | `512` |
| `STRIDE` | Overlap tokens between chunks | `128` |

Example:
```
MODEL_ID=your-username/bert-binclass
MAX_LENGTH=512
STRIDE=128
```

---

## 🧠 Example Workflow

1. Train your binary classifier using `train.py` and push to Hub.  
2. Deploy this Space with your model:
   - Set the Space variable `MODEL_ID` to your repo.  
3. Upload any text file — the app will:
   - Chunk the text  
   - Run inference on each chunk  
   - Show probabilities like:

```
AI generated: 0.82
Human written: 0.18
```

and a color‑coded **confidence badge**.

---

## 🚀 Run Locally

```bash
pip install -r requirements.txt
python app.py
```

Then open the Gradio URL shown in your terminal.

---

## 🖼️ UI Preview

> ![screenshot placeholder](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/gradio-placeholder.png)
>
> *Top: prediction and probabilities; bottom: per‑chunk details.*

---

## 🧩 Notes

- PDF parsing uses [`pypdf`](https://pypi.org/project/pypdf/); for better results or OCR, consider [`pymupdf`](https://pypi.org/project/PyMuPDF/) or [`unstructured`](https://github.com/Unstructured-IO/unstructured).
- The color scheme is based on the **Soft Indigo** theme for a calm, modern feel.

---

## 🪪 License

MIT — feel free to modify and re‑deploy.