| --- |
| title: PDF Character Counter |
| emoji: π |
| colorFrom: red |
| colorTo: red |
| sdk: docker |
| app_port: 8501 |
| tags: |
| - streamlit |
| pinned: false |
| short_description: A simpe app that counts characters including spaces in PDF |
| --- |
| |
| # PDF Character Counter |
|
|
| A simple and reliable PDF character counting tool that extracts text from PDF documents and counts only the actual content. |
|
|
| Unlike many basic PDF counters, this tool automatically detects and removes: |
|
|
| * Page numbers |
| * Repeated headers |
| * Repeated footers |
| * Running chapter headers |
|
|
| This makes it useful for academic papers, theses, reports, assignments, and other documents where administrative text should not be included in the final character count. |
|
|
| ## Features |
|
|
| β
Accurate PDF text extraction |
|
|
| β
Automatic page number removal |
|
|
| β
Automatic header detection |
|
|
| β
Automatic footer detection |
|
|
| β
Running chapter header detection |
|
|
| β
Per-page character statistics |
|
|
| β
Optional page exclusion |
|
|
| β
Detailed diagnostics showing removed content |
|
|
| ## How It Works |
|
|
| The application processes PDFs in six steps: |
|
|
| 1. Extract all text blocks from every page. |
| 2. Detect recurring headers and footers based on position and repetition. |
| 3. Identify page numbers using pattern matching. |
| 4. Detect running chapter headers. |
| 5. Remove non-content elements. |
| 6. Count characters in the remaining text. |
|
|
| Only the cleaned content is included in the final count. |
|
|
| ## What Gets Removed |
|
|
| ### Page Numbers |
|
|
| Examples: |
|
|
| ```text |
| Page 12 |
| 12 |
| 12 / 120 |
| 12 of 120 |
| Side 12 |
| 12 af 120 |
| ``` |
|
|
| ### Running Headers |
|
|
| Examples: |
|
|
| ```text |
| 2.1 Methods 12 |
| 4.3 Results iv |
| ``` |
|
|
| ### Repeated Headers and Footers |
|
|
| Text that appears on multiple pages in the top or bottom regions of the document is automatically detected and excluded. |
|
|
| ## Output |
|
|
| The tool returns: |
|
|
| * Total character count |
| * Character count per page |
| * Included text |
| * Detected headers |
| * Detected footers |
| * Detected running headers |
| * Detected page numbers |
| * Log of removed elements |
|
|
| ## Use Cases |
|
|
| * Academic assignments |
| * University theses |
| * Research papers |
| * Government reports |
| * Technical documentation |
| * Publication word/character limit verification |
|
|
| ## Technology |
|
|
| Built with: |
|
|
| * Python |
| * PyMuPDF (fitz) |
| * Regular Expressions |
| * Hugging Face Spaces |
|
|
| ## Limitations |
|
|
| * Scanned PDFs without embedded text require OCR before processing. |
| * Very unusual document layouts may affect automatic header/footer detection. |
| * Character counts are based on extracted text and may differ slightly from counts generated by word processors. |
|
|
| ## License |
|
|
| MIT License |
|
|
| ## Author |
|
|
| Created by Daniel Hjerresen. |
|
|