tomerz14 commited on
Commit
e459f02
·
verified ·
1 Parent(s): 4cf9509

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +84 -14
README.md CHANGED
@@ -1,8 +1,8 @@
1
  ---
2
- title: Binary Doc Classifier (Chunked)
3
  emoji: 📄
4
  colorFrom: indigo
5
- colorTo: purple
6
  sdk: gradio
7
  sdk_version: 4.44.0
8
  app_file: app.py
@@ -10,27 +10,97 @@ pinned: false
10
  license: mit
11
  ---
12
 
13
- # Binary Document ClassifierGradio Space
14
 
15
- This Space hosts a Gradio app for **binary text classification** on uploaded documents.
16
- It supports long documents by **chunking** (512-token windows with overlap) and aggregates
17
- chunk probabilities into a **document-level** prediction.
18
 
19
- ## Configuration
20
 
21
- Set the following **Space variables** in the UI (Settings → Variables):
 
 
 
 
 
 
 
 
 
22
 
23
- - `MODEL_ID` — your trained model repo (e.g., `your-username/bert-binclass`)
24
- - `MAX_LENGTH` tokens per chunk (default: `512`)
25
- - `STRIDE` overlap tokens between chunks (default: `128`)
26
 
27
- ## Local run
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ```bash
30
  pip install -r requirements.txt
31
  python app.py
32
  ```
33
 
34
- ## Notes
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
35
 
36
- - PDF extraction uses `pypdf` for simplicity.
 
1
  ---
2
+ title: AI vs Human Document Classifier
3
  emoji: 📄
4
  colorFrom: indigo
5
+ colorTo: blue
6
  sdk: gradio
7
  sdk_version: 4.44.0
8
  app_file: app.py
 
10
  license: mit
11
  ---
12
 
13
+ # 🔎 AI vs Human Document Classifier
14
 
15
+ This **Gradio Space** lets you upload a document (TXT, MD, HTML, or PDF) and predicts whether it was **AI-generated** or **Human-written**.
 
 
16
 
17
+ The app supports **long documents** by splitting them into overlapping 512‑token chunks and aggregating predictions to provide an overall document‑level probability.
18
 
19
+ ---
20
+
21
+ ## ✨ Features
22
+
23
+ ✅ **Interactive Interface**
24
+ - Upload documents directly (TXT, MD, HTML, PDF)
25
+ - Displays clean probability bars for *AI‑generated* vs *Human‑written*
26
+ - Shows a **confidence badge** (“Likely AI” / “Likely Human”) with traffic‑light colors
27
+ - Separate **Basic** and **Advanced** tabs for simplicity
28
+ - A **Chunk Details** accordion with per‑chunk probabilities for deeper inspection
29
 
30
+ **Configurable Parameters**
31
+ - Adjust `MAX_LENGTH` and `STRIDE` for token chunking
32
+ - Choose aggregation method (`mean` or `max`) across chunks
33
 
34
+ **Fully local**
35
+ - No Hub API calls beyond model loading
36
+ - Runs on CPU, GPU, or MPS automatically
37
+
38
+ ---
39
+
40
+ ## ⚙️ Environment Variables
41
+
42
+ You can configure your Space in **Settings → Variables**:
43
+
44
+ | Variable | Description | Default |
45
+ |-----------|--------------|----------|
46
+ | `MODEL_ID` | Hugging Face repo ID of your model | `bert-base-uncased` |
47
+ | `MAX_LENGTH` | Tokens per chunk | `512` |
48
+ | `STRIDE` | Overlap tokens between chunks | `128` |
49
+
50
+ Example:
51
+ ```
52
+ MODEL_ID=your-username/bert-binclass
53
+ MAX_LENGTH=512
54
+ STRIDE=128
55
+ ```
56
+
57
+ ---
58
+
59
+ ## 🧠 Example Workflow
60
+
61
+ 1. Train your binary classifier using `train.py` and push to Hub.
62
+ 2. Deploy this Space with your model:
63
+ - Set the Space variable `MODEL_ID` to your repo.
64
+ 3. Upload any text file — the app will:
65
+ - Chunk the text
66
+ - Run inference on each chunk
67
+ - Show probabilities like:
68
+
69
+ ```
70
+ AI generated: 0.82
71
+ Human written: 0.18
72
+ ```
73
+
74
+ and a color‑coded **confidence badge**.
75
+
76
+ ---
77
+
78
+ ## 🚀 Run Locally
79
 
80
  ```bash
81
  pip install -r requirements.txt
82
  python app.py
83
  ```
84
 
85
+ Then open the Gradio URL shown in your terminal.
86
+
87
+ ---
88
+
89
+ ## 🖼️ UI Preview
90
+
91
+ > ![screenshot placeholder](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/gradio-placeholder.png)
92
+ >
93
+ > *Top: prediction and probabilities; bottom: per‑chunk details.*
94
+
95
+ ---
96
+
97
+ ## 🧩 Notes
98
+
99
+ - PDF parsing uses [`pypdf`](https://pypi.org/project/pypdf/); for better results or OCR, consider [`pymupdf`](https://pypi.org/project/PyMuPDF/) or [`unstructured`](https://github.com/Unstructured-IO/unstructured).
100
+ - The color scheme is based on the **Soft Indigo** theme for a calm, modern feel.
101
+
102
+ ---
103
+
104
+ ## 🪪 License
105
 
106
+ MIT feel free to modify and re‑deploy.