Spaces:

tomerz14
/

BERT_Text_Source_Classifier

Sleeping

App Files Files Community

tomerz14 commited on Oct 4, 2025

Commit

e459f02

verified ·

1 Parent(s): 4cf9509

Update README.md

Browse files

Files changed (1) hide show

README.md +84 -14

README.md CHANGED Viewed

@@ -1,8 +1,8 @@
 ---
-title: Binary Doc Classifier (Chunked)
 emoji: 📄
 colorFrom: indigo
-colorTo: purple
 sdk: gradio
 sdk_version: 4.44.0
 app_file: app.py
@@ -10,27 +10,97 @@ pinned: false
 license: mit
 ---
-# Binary Document Classifier — Gradio Space
-This Space hosts a Gradio app for **binary text classification** on uploaded documents.
-It supports long documents by **chunking** (512-token windows with overlap) and aggregates
-chunk probabilities into a **document-level** prediction.
-## Configuration
-Set the following **Space variables** in the UI (Settings → Variables):
-- `MODEL_ID` — your trained model repo (e.g., `your-username/bert-binclass`)
-- `MAX_LENGTH` — tokens per chunk (default: `512`)
-- `STRIDE` — overlap tokens between chunks (default: `128`)
-## Local run
 ```bash
 pip install -r requirements.txt
 python app.py
 ```
-## Notes
-- PDF extraction uses `pypdf` for simplicity.

 ---
+title: AI vs Human Document Classifier
 emoji: 📄
 colorFrom: indigo
+colorTo: blue
 sdk: gradio
 sdk_version: 4.44.0
 app_file: app.py
 license: mit
 ---
+# 🔎 AI vs Human — Document Classifier
+This **Gradio Space** lets you upload a document (TXT, MD, HTML, or PDF) and predicts whether it was **AI-generated** or **Human-written**.
+The app supports **long documents** by splitting them into overlapping 512‑token chunks and aggregating predictions to provide an overall document‑level probability.
+---
+## ✨ Features
+✅ **Interactive Interface**
+- Upload documents directly (TXT, MD, HTML, PDF)
+- Displays clean probability bars for *AI‑generated* vs *Human‑written*
+- Shows a **confidence badge** (“Likely AI” / “Likely Human”) with traffic‑light colors
+- Separate **Basic** and **Advanced** tabs for simplicity
+- A **Chunk Details** accordion with per‑chunk probabilities for deeper inspection
+✅ **Configurable Parameters**
+- Adjust `MAX_LENGTH` and `STRIDE` for token chunking
+- Choose aggregation method (`mean` or `max`) across chunks
+✅ **Fully local**
+- No Hub API calls beyond model loading
+- Runs on CPU, GPU, or MPS automatically
+---
+## ⚙️ Environment Variables
+You can configure your Space in **Settings → Variables**:
+| Variable | Description | Default |
+|-----------|--------------|----------|
+| `MODEL_ID` | Hugging Face repo ID of your model | `bert-base-uncased` |
+| `MAX_LENGTH` | Tokens per chunk | `512` |
+| `STRIDE` | Overlap tokens between chunks | `128` |
+Example:
+```
+MODEL_ID=your-username/bert-binclass
+MAX_LENGTH=512
+STRIDE=128
+```
+---
+## 🧠 Example Workflow
+1. Train your binary classifier using `train.py` and push to Hub.
+2. Deploy this Space with your model:
+   - Set the Space variable `MODEL_ID` to your repo.
+3. Upload any text file — the app will:
+   - Chunk the text
+   - Run inference on each chunk
+   - Show probabilities like:
+```
+AI generated: 0.82
+Human written: 0.18
+```
+and a color‑coded **confidence badge**.
+---
+## 🚀 Run Locally
 ```bash
 pip install -r requirements.txt
 python app.py
 ```
+Then open the Gradio URL shown in your terminal.
+---
+## 🖼️ UI Preview
+> ![screenshot placeholder](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/gradio-placeholder.png)
+>
+> *Top: prediction and probabilities; bottom: per‑chunk details.*
+---
+## 🧩 Notes
+- PDF parsing uses [`pypdf`](https://pypi.org/project/pypdf/); for better results or OCR, consider [`pymupdf`](https://pypi.org/project/PyMuPDF/) or [`unstructured`](https://github.com/Unstructured-IO/unstructured).
+- The color scheme is based on the **Soft Indigo** theme for a calm, modern feel.
+---
+## 🪪 License
+MIT — feel free to modify and re‑deploy.