Spaces:

LianHP
/

Web_page_data_html

Sleeping

File size: 2,248 Bytes

eed5224
530143f
eed5224
530143f
 
eed5224
6942eab

---
title: Web_page_data_html
app_file: app.py
sdk: gradio
sdk_version: 5.47.2
---
Web Company Data Extractor
==========================

Overview
--------
This Hugging Face demo extracts basic company-related information from unstructured
web pages. You enter a URL, and the app:

1. Downloads the HTML
2. Parses visible text
3. Extracts useful company data indicators:
   - Emails
   - Phone numbers
   - Possible addresses
   - Social media profile links
   - Page title (as a company name guess)

The output includes:
- Truncated raw text extracted from the page
- A structured JSON-like summary of detected signals

This is a rule-based prototype for demonstration purposes. It does not replace
professional-grade web data extraction or parsing libraries.


Project Purpose
---------------
This app demonstrates how unstructured web pages can be converted into structured
company data. It is useful for illustrating:

- Web data extraction
- Text parsing and cleaning
- Company signal detection
- Lightweight company data enrichment
- A pattern for how ZoomInfo might bootstrap new data fields

Files Included
--------------
- app.py        – The main Gradio application
- requirements.txt
- README.txt    – This file


How to Use
----------
1. Enter a URL (e.g. https://www.microsoft.com)
2. Click "Extract Company Data"
3. Inspect:
   - Cleaned visible text (truncated)
   - The structured output fields


Technical Notes
---------------
Uses:
- requests          – to fetch HTML
- beautifulsoup4    – to parse DOM and extract visible text
- regex patterns    – for emails, phone numbers, and simple address detection
- gradio            – for UI

Outputs are plain Python dictionaries and strings to avoid serialization issues.


Running Locally
---------------
    pip install -r requirements.txt
    python app.py

The app will run locally on a Gradio-generated URL.


Limitations
-----------
This is not a complete production parser. It does not handle:
- JavaScript-rendered pages
- International address formats
- Advanced company name extraction
- Pagination or crawling
- Rate limiting or proxy management

However, it demonstrates the core concept:
**turning unstructured web data into structured company data signals.**