Web_page_data_html / README.md
LianHP's picture
Update README.md
6942eab verified

A newer version of the Gradio SDK is available: 6.5.1

Upgrade
metadata
title: Web_page_data_html
app_file: app.py
sdk: gradio
sdk_version: 5.47.2

Web Company Data Extractor

Overview

This Hugging Face demo extracts basic company-related information from unstructured web pages. You enter a URL, and the app:

  1. Downloads the HTML
  2. Parses visible text
  3. Extracts useful company data indicators:
    • Emails
    • Phone numbers
    • Possible addresses
    • Social media profile links
    • Page title (as a company name guess)

The output includes:

  • Truncated raw text extracted from the page
  • A structured JSON-like summary of detected signals

This is a rule-based prototype for demonstration purposes. It does not replace professional-grade web data extraction or parsing libraries.

Project Purpose

This app demonstrates how unstructured web pages can be converted into structured company data. It is useful for illustrating:

  • Web data extraction
  • Text parsing and cleaning
  • Company signal detection
  • Lightweight company data enrichment
  • A pattern for how ZoomInfo might bootstrap new data fields

Files Included

  • app.py – The main Gradio application
  • requirements.txt
  • README.txt – This file

How to Use

  1. Enter a URL (e.g. https://www.microsoft.com)
  2. Click "Extract Company Data"
  3. Inspect:
    • Cleaned visible text (truncated)
    • The structured output fields

Technical Notes

Uses:

  • requests – to fetch HTML
  • beautifulsoup4 – to parse DOM and extract visible text
  • regex patterns – for emails, phone numbers, and simple address detection
  • gradio – for UI

Outputs are plain Python dictionaries and strings to avoid serialization issues.

Running Locally

pip install -r requirements.txt
python app.py

The app will run locally on a Gradio-generated URL.

Limitations

This is not a complete production parser. It does not handle:

  • JavaScript-rendered pages
  • International address formats
  • Advanced company name extraction
  • Pagination or crawling
  • Rate limiting or proxy management

However, it demonstrates the core concept: turning unstructured web data into structured company data signals.