Spaces:
Sleeping
Sleeping
| title: Web_page_data_html | |
| app_file: app.py | |
| sdk: gradio | |
| sdk_version: 5.47.2 | |
| Web Company Data Extractor | |
| ========================== | |
| Overview | |
| -------- | |
| This Hugging Face demo extracts basic company-related information from unstructured | |
| web pages. You enter a URL, and the app: | |
| 1. Downloads the HTML | |
| 2. Parses visible text | |
| 3. Extracts useful company data indicators: | |
| - Emails | |
| - Phone numbers | |
| - Possible addresses | |
| - Social media profile links | |
| - Page title (as a company name guess) | |
| The output includes: | |
| - Truncated raw text extracted from the page | |
| - A structured JSON-like summary of detected signals | |
| This is a rule-based prototype for demonstration purposes. It does not replace | |
| professional-grade web data extraction or parsing libraries. | |
| Project Purpose | |
| --------------- | |
| This app demonstrates how unstructured web pages can be converted into structured | |
| company data. It is useful for illustrating: | |
| - Web data extraction | |
| - Text parsing and cleaning | |
| - Company signal detection | |
| - Lightweight company data enrichment | |
| - A pattern for how ZoomInfo might bootstrap new data fields | |
| Files Included | |
| -------------- | |
| - app.py β The main Gradio application | |
| - requirements.txt | |
| - README.txt β This file | |
| How to Use | |
| ---------- | |
| 1. Enter a URL (e.g. https://www.microsoft.com) | |
| 2. Click "Extract Company Data" | |
| 3. Inspect: | |
| - Cleaned visible text (truncated) | |
| - The structured output fields | |
| Technical Notes | |
| --------------- | |
| Uses: | |
| - requests β to fetch HTML | |
| - beautifulsoup4 β to parse DOM and extract visible text | |
| - regex patterns β for emails, phone numbers, and simple address detection | |
| - gradio β for UI | |
| Outputs are plain Python dictionaries and strings to avoid serialization issues. | |
| Running Locally | |
| --------------- | |
| pip install -r requirements.txt | |
| python app.py | |
| The app will run locally on a Gradio-generated URL. | |
| Limitations | |
| ----------- | |
| This is not a complete production parser. It does not handle: | |
| - JavaScript-rendered pages | |
| - International address formats | |
| - Advanced company name extraction | |
| - Pagination or crawling | |
| - Rate limiting or proxy management | |
| However, it demonstrates the core concept: | |
| **turning unstructured web data into structured company data signals.** |