Spaces:

danielhjerresen
/

PDF_Character_Counter

Sleeping

App Files Files Community

PDF_Character_Counter / README.md

danielhjerresen

Update README.md

e421562 verified 28 days ago

preview code

Raw

History Blame Contribute Delete

2.56 kB

metadata

title: PDF Character Counter
emoji: 🚀
colorFrom: red
colorTo: red
sdk: docker
app_port: 8501
tags:
  - streamlit
pinned: false
short_description: A simpe app that counts characters including spaces in PDF

PDF Character Counter

A simple and reliable PDF character counting tool that extracts text from PDF documents and counts only the actual content.

Unlike many basic PDF counters, this tool automatically detects and removes:

Page numbers
Repeated headers
Repeated footers
Running chapter headers

This makes it useful for academic papers, theses, reports, assignments, and other documents where administrative text should not be included in the final character count.

Features

✅ Accurate PDF text extraction

✅ Automatic page number removal

✅ Automatic header detection

✅ Automatic footer detection

✅ Running chapter header detection

✅ Per-page character statistics

✅ Optional page exclusion

✅ Detailed diagnostics showing removed content

How It Works

The application processes PDFs in six steps:

Extract all text blocks from every page.
Detect recurring headers and footers based on position and repetition.
Identify page numbers using pattern matching.
Detect running chapter headers.
Remove non-content elements.
Count characters in the remaining text.

Only the cleaned content is included in the final count.

What Gets Removed

Page Numbers

Examples:

Page 12
12
12 / 120
12 of 120
Side 12
12 af 120

Running Headers

Examples:

2.1 Methods 12
4.3 Results iv

Repeated Headers and Footers

Text that appears on multiple pages in the top or bottom regions of the document is automatically detected and excluded.

Output

The tool returns:

Total character count
Character count per page
Included text
Detected headers
Detected footers
Detected running headers
Detected page numbers
Log of removed elements

Use Cases

Academic assignments
University theses
Research papers
Government reports
Technical documentation
Publication word/character limit verification

Technology

Built with:

Python
PyMuPDF (fitz)
Regular Expressions
Hugging Face Spaces

Limitations

Scanned PDFs without embedded text require OCR before processing.
Very unusual document layouts may affect automatic header/footer detection.
Character counts are based on extracted text and may differ slightly from counts generated by word processors.

License

MIT License

Author

Created by Daniel Hjerresen.