Spaces:

LianHP
/

Web_page_data_html

Sleeping

App Files Files Community

Web_page_data_html / README.md

LianHP

Update README.md

6942eab verified 2 months ago

preview code

raw

history blame contribute delete

2.25 kB

	---
	title: Web_page_data_html
	app_file: app.py
	sdk: gradio
	sdk_version: 5.47.2
	---
	Web Company Data Extractor
	==========================

	Overview
	--------
	This Hugging Face demo extracts basic company-related information from unstructured
	web pages. You enter a URL, and the app:

	1. Downloads the HTML
	2. Parses visible text
	3. Extracts useful company data indicators:
	- Emails
	- Phone numbers
	- Possible addresses
	- Social media profile links
	- Page title (as a company name guess)

	The output includes:
	- Truncated raw text extracted from the page
	- A structured JSON-like summary of detected signals

	This is a rule-based prototype for demonstration purposes. It does not replace
	professional-grade web data extraction or parsing libraries.


	Project Purpose
	---------------
	This app demonstrates how unstructured web pages can be converted into structured
	company data. It is useful for illustrating:

	- Web data extraction
	- Text parsing and cleaning
	- Company signal detection
	- Lightweight company data enrichment
	- A pattern for how ZoomInfo might bootstrap new data fields

	Files Included
	--------------
	- app.py – The main Gradio application
	- requirements.txt
	- README.txt – This file


	How to Use
	----------
	1. Enter a URL (e.g. https://www.microsoft.com)
	2. Click "Extract Company Data"
	3. Inspect:
	- Cleaned visible text (truncated)
	- The structured output fields


	Technical Notes
	---------------
	Uses:
	- requests – to fetch HTML
	- beautifulsoup4 – to parse DOM and extract visible text
	- regex patterns – for emails, phone numbers, and simple address detection
	- gradio – for UI

	Outputs are plain Python dictionaries and strings to avoid serialization issues.


	Running Locally
	---------------
	pip install -r requirements.txt
	python app.py

	The app will run locally on a Gradio-generated URL.


	Limitations
	-----------
	This is not a complete production parser. It does not handle:
	- JavaScript-rendered pages
	- International address formats
	- Advanced company name extraction
	- Pagination or crawling
	- Rate limiting or proxy management

	However, it demonstrates the core concept:
	turning unstructured web data into structured company data signals.