Spaces:
Sleeping
Sleeping
File size: 2,248 Bytes
eed5224 530143f eed5224 530143f eed5224 6942eab |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
title: Web_page_data_html
app_file: app.py
sdk: gradio
sdk_version: 5.47.2
---
Web Company Data Extractor
==========================
Overview
--------
This Hugging Face demo extracts basic company-related information from unstructured
web pages. You enter a URL, and the app:
1. Downloads the HTML
2. Parses visible text
3. Extracts useful company data indicators:
- Emails
- Phone numbers
- Possible addresses
- Social media profile links
- Page title (as a company name guess)
The output includes:
- Truncated raw text extracted from the page
- A structured JSON-like summary of detected signals
This is a rule-based prototype for demonstration purposes. It does not replace
professional-grade web data extraction or parsing libraries.
Project Purpose
---------------
This app demonstrates how unstructured web pages can be converted into structured
company data. It is useful for illustrating:
- Web data extraction
- Text parsing and cleaning
- Company signal detection
- Lightweight company data enrichment
- A pattern for how ZoomInfo might bootstrap new data fields
Files Included
--------------
- app.py – The main Gradio application
- requirements.txt
- README.txt – This file
How to Use
----------
1. Enter a URL (e.g. https://www.microsoft.com)
2. Click "Extract Company Data"
3. Inspect:
- Cleaned visible text (truncated)
- The structured output fields
Technical Notes
---------------
Uses:
- requests – to fetch HTML
- beautifulsoup4 – to parse DOM and extract visible text
- regex patterns – for emails, phone numbers, and simple address detection
- gradio – for UI
Outputs are plain Python dictionaries and strings to avoid serialization issues.
Running Locally
---------------
pip install -r requirements.txt
python app.py
The app will run locally on a Gradio-generated URL.
Limitations
-----------
This is not a complete production parser. It does not handle:
- JavaScript-rendered pages
- International address formats
- Advanced company name extraction
- Pagination or crawling
- Rate limiting or proxy management
However, it demonstrates the core concept:
**turning unstructured web data into structured company data signals.** |