Spaces:
Sleeping
A newer version of the Gradio SDK is available:
6.5.1
title: Web_page_data_html
app_file: app.py
sdk: gradio
sdk_version: 5.47.2
Web Company Data Extractor
Overview
This Hugging Face demo extracts basic company-related information from unstructured web pages. You enter a URL, and the app:
- Downloads the HTML
- Parses visible text
- Extracts useful company data indicators:
- Emails
- Phone numbers
- Possible addresses
- Social media profile links
- Page title (as a company name guess)
The output includes:
- Truncated raw text extracted from the page
- A structured JSON-like summary of detected signals
This is a rule-based prototype for demonstration purposes. It does not replace professional-grade web data extraction or parsing libraries.
Project Purpose
This app demonstrates how unstructured web pages can be converted into structured company data. It is useful for illustrating:
- Web data extraction
- Text parsing and cleaning
- Company signal detection
- Lightweight company data enrichment
- A pattern for how ZoomInfo might bootstrap new data fields
Files Included
- app.py – The main Gradio application
- requirements.txt
- README.txt – This file
How to Use
- Enter a URL (e.g. https://www.microsoft.com)
- Click "Extract Company Data"
- Inspect:
- Cleaned visible text (truncated)
- The structured output fields
Technical Notes
Uses:
- requests – to fetch HTML
- beautifulsoup4 – to parse DOM and extract visible text
- regex patterns – for emails, phone numbers, and simple address detection
- gradio – for UI
Outputs are plain Python dictionaries and strings to avoid serialization issues.
Running Locally
pip install -r requirements.txt
python app.py
The app will run locally on a Gradio-generated URL.
Limitations
This is not a complete production parser. It does not handle:
- JavaScript-rendered pages
- International address formats
- Advanced company name extraction
- Pagination or crawling
- Rate limiting or proxy management
However, it demonstrates the core concept: turning unstructured web data into structured company data signals.