Spaces:

iamnew123
/

taxdoc-preprocessor

Sleeping

taxdoc-preprocessor / extractor.py

Update extractor.py

0d76c8b verified 8 months ago

448 Bytes

	import fitz # PyMuPDF
	import os

	def extract_text(file):
	if not file:
	return ""

	file_ext = os.path.splitext(file.name)[1].lower()

	if file_ext == ".pdf":
	with fitz.open(file.name) as doc:
	return "\n".join([page.get_text() for page in doc])

	elif file_ext == ".txt":
	with open(file.name, "r", encoding="utf-8") as f:
	return f.read()

	else:
	return "Unsupported file type"