---
title: Web_page_data_html
app_file: app.py
sdk: gradio
sdk_version: 5.47.2
---
Web Company Data Extractor
==========================
Overview
--------
This Hugging Face demo extracts basic company-related information from unstructured
web pages. You enter a URL, and the app:
1. Downloads the HTML
2. Parses visible text
3. Extracts useful company data indicators:
- Emails
- Phone numbers
- Possible addresses
- Social media profile links
- Page title (as a company name guess)
The output includes:
- Truncated raw text extracted from the page
- A structured JSON-like summary of detected signals
This is a rule-based prototype for demonstration purposes. It does not replace
professional-grade web data extraction or parsing libraries.
Project Purpose
---------------
This app demonstrates how unstructured web pages can be converted into structured
company data. It is useful for illustrating:
- Web data extraction
- Text parsing and cleaning
- Company signal detection
- Lightweight company data enrichment
- A pattern for how ZoomInfo might bootstrap new data fields
Files Included
--------------
- app.py – The main Gradio application
- requirements.txt
- README.txt – This file
How to Use
----------
1. Enter a URL (e.g. https://www.microsoft.com)
2. Click "Extract Company Data"
3. Inspect:
- Cleaned visible text (truncated)
- The structured output fields
Technical Notes
---------------
Uses:
- requests – to fetch HTML
- beautifulsoup4 – to parse DOM and extract visible text
- regex patterns – for emails, phone numbers, and simple address detection
- gradio – for UI
Outputs are plain Python dictionaries and strings to avoid serialization issues.
Running Locally
---------------
pip install -r requirements.txt
python app.py
The app will run locally on a Gradio-generated URL.
Limitations
-----------
This is not a complete production parser. It does not handle:
- JavaScript-rendered pages
- International address formats
- Advanced company name extraction
- Pagination or crawling
- Rate limiting or proxy management
However, it demonstrates the core concept:
**turning unstructured web data into structured company data signals.**