LianHP commited on
Commit
6942eab
·
verified ·
1 Parent(s): 530143f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md CHANGED
@@ -4,3 +4,85 @@ app_file: app.py
4
  sdk: gradio
5
  sdk_version: 5.47.2
6
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  sdk: gradio
5
  sdk_version: 5.47.2
6
  ---
7
+ Web Company Data Extractor
8
+ ==========================
9
+
10
+ Overview
11
+ --------
12
+ This Hugging Face demo extracts basic company-related information from unstructured
13
+ web pages. You enter a URL, and the app:
14
+
15
+ 1. Downloads the HTML
16
+ 2. Parses visible text
17
+ 3. Extracts useful company data indicators:
18
+ - Emails
19
+ - Phone numbers
20
+ - Possible addresses
21
+ - Social media profile links
22
+ - Page title (as a company name guess)
23
+
24
+ The output includes:
25
+ - Truncated raw text extracted from the page
26
+ - A structured JSON-like summary of detected signals
27
+
28
+ This is a rule-based prototype for demonstration purposes. It does not replace
29
+ professional-grade web data extraction or parsing libraries.
30
+
31
+
32
+ Project Purpose
33
+ ---------------
34
+ This app demonstrates how unstructured web pages can be converted into structured
35
+ company data. It is useful for illustrating:
36
+
37
+ - Web data extraction
38
+ - Text parsing and cleaning
39
+ - Company signal detection
40
+ - Lightweight company data enrichment
41
+ - A pattern for how ZoomInfo might bootstrap new data fields
42
+
43
+ Files Included
44
+ --------------
45
+ - app.py – The main Gradio application
46
+ - requirements.txt
47
+ - README.txt – This file
48
+
49
+
50
+ How to Use
51
+ ----------
52
+ 1. Enter a URL (e.g. https://www.microsoft.com)
53
+ 2. Click "Extract Company Data"
54
+ 3. Inspect:
55
+ - Cleaned visible text (truncated)
56
+ - The structured output fields
57
+
58
+
59
+ Technical Notes
60
+ ---------------
61
+ Uses:
62
+ - requests – to fetch HTML
63
+ - beautifulsoup4 – to parse DOM and extract visible text
64
+ - regex patterns – for emails, phone numbers, and simple address detection
65
+ - gradio – for UI
66
+
67
+ Outputs are plain Python dictionaries and strings to avoid serialization issues.
68
+
69
+
70
+ Running Locally
71
+ ---------------
72
+ pip install -r requirements.txt
73
+ python app.py
74
+
75
+ The app will run locally on a Gradio-generated URL.
76
+
77
+
78
+ Limitations
79
+ -----------
80
+ This is not a complete production parser. It does not handle:
81
+ - JavaScript-rendered pages
82
+ - International address formats
83
+ - Advanced company name extraction
84
+ - Pagination or crawling
85
+ - Rate limiting or proxy management
86
+
87
+ However, it demonstrates the core concept:
88
+ **turning unstructured web data into structured company data signals.**