File size: 2,248 Bytes
eed5224
530143f
eed5224
530143f
 
eed5224
6942eab
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
title: Web_page_data_html
app_file: app.py
sdk: gradio
sdk_version: 5.47.2
---
Web Company Data Extractor
==========================

Overview
--------
This Hugging Face demo extracts basic company-related information from unstructured
web pages. You enter a URL, and the app:

1. Downloads the HTML
2. Parses visible text
3. Extracts useful company data indicators:
   - Emails
   - Phone numbers
   - Possible addresses
   - Social media profile links
   - Page title (as a company name guess)

The output includes:
- Truncated raw text extracted from the page
- A structured JSON-like summary of detected signals

This is a rule-based prototype for demonstration purposes. It does not replace
professional-grade web data extraction or parsing libraries.


Project Purpose
---------------
This app demonstrates how unstructured web pages can be converted into structured
company data. It is useful for illustrating:

- Web data extraction
- Text parsing and cleaning
- Company signal detection
- Lightweight company data enrichment
- A pattern for how ZoomInfo might bootstrap new data fields

Files Included
--------------
- app.py        – The main Gradio application
- requirements.txt
- README.txt    – This file


How to Use
----------
1. Enter a URL (e.g. https://www.microsoft.com)
2. Click "Extract Company Data"
3. Inspect:
   - Cleaned visible text (truncated)
   - The structured output fields


Technical Notes
---------------
Uses:
- requests          – to fetch HTML
- beautifulsoup4    – to parse DOM and extract visible text
- regex patterns    – for emails, phone numbers, and simple address detection
- gradio            – for UI

Outputs are plain Python dictionaries and strings to avoid serialization issues.


Running Locally
---------------
    pip install -r requirements.txt
    python app.py

The app will run locally on a Gradio-generated URL.


Limitations
-----------
This is not a complete production parser. It does not handle:
- JavaScript-rendered pages
- International address formats
- Advanced company name extraction
- Pagination or crawling
- Rate limiting or proxy management

However, it demonstrates the core concept:
**turning unstructured web data into structured company data signals.**