Spaces:

danielhjerresen
/

PDF_Character_Counter

Sleeping

App Files Files Community

PDF_Character_Counter / README.md

danielhjerresen

Update README.md

e421562 verified 28 days ago

preview code

Raw

History Blame Contribute Delete

2.56 kB

	---
	title: PDF Character Counter
	emoji: 🚀
	colorFrom: red
	colorTo: red
	sdk: docker
	app_port: 8501
	tags:
	- streamlit
	pinned: false
	short_description: A simpe app that counts characters including spaces in PDF
	---

	# PDF Character Counter

	A simple and reliable PDF character counting tool that extracts text from PDF documents and counts only the actual content.

	Unlike many basic PDF counters, this tool automatically detects and removes:

	* Page numbers
	* Repeated headers
	* Repeated footers
	* Running chapter headers

	This makes it useful for academic papers, theses, reports, assignments, and other documents where administrative text should not be included in the final character count.

	## Features

	✅ Accurate PDF text extraction

	✅ Automatic page number removal

	✅ Automatic header detection

	✅ Automatic footer detection

	✅ Running chapter header detection

	✅ Per-page character statistics

	✅ Optional page exclusion

	✅ Detailed diagnostics showing removed content

	## How It Works

	The application processes PDFs in six steps:

	1. Extract all text blocks from every page.
	2. Detect recurring headers and footers based on position and repetition.
	3. Identify page numbers using pattern matching.
	4. Detect running chapter headers.
	5. Remove non-content elements.
	6. Count characters in the remaining text.

	Only the cleaned content is included in the final count.

	## What Gets Removed

	### Page Numbers

	Examples:

	```text
	Page 12
	12
	12 / 120
	12 of 120
	Side 12
	12 af 120
	```

	### Running Headers

	Examples:

	```text
	2.1 Methods 12
	4.3 Results iv
	```

	### Repeated Headers and Footers

	Text that appears on multiple pages in the top or bottom regions of the document is automatically detected and excluded.

	## Output

	The tool returns:

	* Total character count
	* Character count per page
	* Included text
	* Detected headers
	* Detected footers
	* Detected running headers
	* Detected page numbers
	* Log of removed elements

	## Use Cases

	* Academic assignments
	* University theses
	* Research papers
	* Government reports
	* Technical documentation
	* Publication word/character limit verification

	## Technology

	Built with:

	* Python
	* PyMuPDF (fitz)
	* Regular Expressions
	* Hugging Face Spaces

	## Limitations

	* Scanned PDFs without embedded text require OCR before processing.
	* Very unusual document layouts may affect automatic header/footer detection.
	* Character counts are based on extracted text and may differ slightly from counts generated by word processors.

	## License

	MIT License

	## Author

	Created by Daniel Hjerresen.