File size: 3,425 Bytes
e262fe2
 
af107f1
e262fe2
af107f1
e262fe2
 
af107f1
e262fe2
 
af107f1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
title: PDF Redaction API
emoji: πŸ”’
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
license: mit
---

# PDF Redaction API πŸ”’

Automatically redact sensitive information from PDF documents using Named Entity Recognition (NER).

## Features

- πŸ€– **Powered by NER**: Uses state-of-the-art Named Entity Recognition
- πŸ“„ **PDF Support**: Upload and process PDF documents
- 🎯 **Accurate Redaction**: Correctly positioned black rectangles over sensitive text
- πŸš€ **Fast Processing**: Optimized OCR and NER pipeline
- πŸ”§ **Configurable**: Adjust DPI and filter entity types

## API Endpoints

### `POST /redact`

Upload a PDF file and get it redacted.

**Parameters:**
- `file`: PDF file (required)
- `dpi`: OCR quality (default: 300)
- `entity_types`: Comma-separated entity types to redact (optional)

**Example using cURL:**

```bash
curl -X POST "https://your-space.hf.space/redact" \
  -F "file=@document.pdf" \
  -F "dpi=300"
```

**Example using Python:**

```python
import requests

url = "https://your-space.hf.space/redact"
files = {"file": open("document.pdf", "rb")}
params = {"dpi": 300}

response = requests.post(url, files=files, params=params)
result = response.json()

# Download redacted file
job_id = result["job_id"]
download_url = f"https://your-space.hf.space/download/{job_id}"
redacted_pdf = requests.get(download_url)

with open("redacted.pdf", "wb") as f:
    f.write(redacted_pdf.content)
```

### `GET /download/{job_id}`

Download the redacted PDF file.

### `GET /health`

Check API health and model status.

### `GET /stats`

Get API statistics.

## Response Format

```json
{
  "job_id": "uuid-here",
  "status": "completed",
  "message": "Successfully redacted 5 entities",
  "entities": [
    {
      "entity_type": "PER",
      "entity_text": "John Doe",
      "page": 1,
      "word_count": 2
    }
  ],
  "redacted_file_url": "/download/uuid-here"
}
```

## Entity Types

Common entity types detected:
- `PER`: Person names
- `ORG`: Organizations
- `LOC`: Locations
- `DATE`: Dates
- `EMAIL`: Email addresses
- `PHONE`: Phone numbers
- And more...

## Local Development

### Prerequisites

- Python 3.10+
- Tesseract OCR
- Poppler utils

### Installation

```bash
# Install system dependencies (Ubuntu/Debian)
sudo apt-get install tesseract-ocr poppler-utils

# Install Python dependencies
pip install -r requirements.txt

# Run the server
python main.py
```

The API will be available at `http://localhost:7860`

### Using Docker

```bash
# Build the image
docker build -t pdf-redaction-api .

# Run the container
docker run -p 7860:7860 pdf-redaction-api
```

## Configuration

Adjust the DPI parameter based on your needs:
- `150`: Fast processing, lower quality
- `300`: Recommended balance (default)
- `600`: High quality, slower processing

## Limitations

- Maximum file size: Dependent on Space resources
- Processing time increases with page count and DPI
- Files are automatically cleaned up after processing

## Privacy

- Uploaded files are processed in-memory and deleted after redaction
- No data is stored permanently
- Use your own deployment for sensitive documents

## Credits

Built with:
- [FastAPI](https://fastapi.tiangolo.com/)
- [Transformers](https://huggingface.co/transformers/)
- [PyPDF](https://github.com/py-pdf/pypdf)
- [Tesseract OCR](https://github.com/tesseract-ocr/tesseract)

## License

MIT License - See LICENSE file for details