--- title: Web_page_data_html app_file: app.py sdk: gradio sdk_version: 5.47.2 --- Web Company Data Extractor ========================== Overview -------- This Hugging Face demo extracts basic company-related information from unstructured web pages. You enter a URL, and the app: 1. Downloads the HTML 2. Parses visible text 3. Extracts useful company data indicators: - Emails - Phone numbers - Possible addresses - Social media profile links - Page title (as a company name guess) The output includes: - Truncated raw text extracted from the page - A structured JSON-like summary of detected signals This is a rule-based prototype for demonstration purposes. It does not replace professional-grade web data extraction or parsing libraries. Project Purpose --------------- This app demonstrates how unstructured web pages can be converted into structured company data. It is useful for illustrating: - Web data extraction - Text parsing and cleaning - Company signal detection - Lightweight company data enrichment - A pattern for how ZoomInfo might bootstrap new data fields Files Included -------------- - app.py – The main Gradio application - requirements.txt - README.txt – This file How to Use ---------- 1. Enter a URL (e.g. https://www.microsoft.com) 2. Click "Extract Company Data" 3. Inspect: - Cleaned visible text (truncated) - The structured output fields Technical Notes --------------- Uses: - requests – to fetch HTML - beautifulsoup4 – to parse DOM and extract visible text - regex patterns – for emails, phone numbers, and simple address detection - gradio – for UI Outputs are plain Python dictionaries and strings to avoid serialization issues. Running Locally --------------- pip install -r requirements.txt python app.py The app will run locally on a Gradio-generated URL. Limitations ----------- This is not a complete production parser. It does not handle: - JavaScript-rendered pages - International address formats - Advanced company name extraction - Pagination or crawling - Rate limiting or proxy management However, it demonstrates the core concept: **turning unstructured web data into structured company data signals.**