Spaces:
Sleeping
Sleeping
Update README.md
Browse files
README.md
CHANGED
|
@@ -4,3 +4,85 @@ app_file: app.py
|
|
| 4 |
sdk: gradio
|
| 5 |
sdk_version: 5.47.2
|
| 6 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 4 |
sdk: gradio
|
| 5 |
sdk_version: 5.47.2
|
| 6 |
---
|
| 7 |
+
Web Company Data Extractor
|
| 8 |
+
==========================
|
| 9 |
+
|
| 10 |
+
Overview
|
| 11 |
+
--------
|
| 12 |
+
This Hugging Face demo extracts basic company-related information from unstructured
|
| 13 |
+
web pages. You enter a URL, and the app:
|
| 14 |
+
|
| 15 |
+
1. Downloads the HTML
|
| 16 |
+
2. Parses visible text
|
| 17 |
+
3. Extracts useful company data indicators:
|
| 18 |
+
- Emails
|
| 19 |
+
- Phone numbers
|
| 20 |
+
- Possible addresses
|
| 21 |
+
- Social media profile links
|
| 22 |
+
- Page title (as a company name guess)
|
| 23 |
+
|
| 24 |
+
The output includes:
|
| 25 |
+
- Truncated raw text extracted from the page
|
| 26 |
+
- A structured JSON-like summary of detected signals
|
| 27 |
+
|
| 28 |
+
This is a rule-based prototype for demonstration purposes. It does not replace
|
| 29 |
+
professional-grade web data extraction or parsing libraries.
|
| 30 |
+
|
| 31 |
+
|
| 32 |
+
Project Purpose
|
| 33 |
+
---------------
|
| 34 |
+
This app demonstrates how unstructured web pages can be converted into structured
|
| 35 |
+
company data. It is useful for illustrating:
|
| 36 |
+
|
| 37 |
+
- Web data extraction
|
| 38 |
+
- Text parsing and cleaning
|
| 39 |
+
- Company signal detection
|
| 40 |
+
- Lightweight company data enrichment
|
| 41 |
+
- A pattern for how ZoomInfo might bootstrap new data fields
|
| 42 |
+
|
| 43 |
+
Files Included
|
| 44 |
+
--------------
|
| 45 |
+
- app.py – The main Gradio application
|
| 46 |
+
- requirements.txt
|
| 47 |
+
- README.txt – This file
|
| 48 |
+
|
| 49 |
+
|
| 50 |
+
How to Use
|
| 51 |
+
----------
|
| 52 |
+
1. Enter a URL (e.g. https://www.microsoft.com)
|
| 53 |
+
2. Click "Extract Company Data"
|
| 54 |
+
3. Inspect:
|
| 55 |
+
- Cleaned visible text (truncated)
|
| 56 |
+
- The structured output fields
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
Technical Notes
|
| 60 |
+
---------------
|
| 61 |
+
Uses:
|
| 62 |
+
- requests – to fetch HTML
|
| 63 |
+
- beautifulsoup4 – to parse DOM and extract visible text
|
| 64 |
+
- regex patterns – for emails, phone numbers, and simple address detection
|
| 65 |
+
- gradio – for UI
|
| 66 |
+
|
| 67 |
+
Outputs are plain Python dictionaries and strings to avoid serialization issues.
|
| 68 |
+
|
| 69 |
+
|
| 70 |
+
Running Locally
|
| 71 |
+
---------------
|
| 72 |
+
pip install -r requirements.txt
|
| 73 |
+
python app.py
|
| 74 |
+
|
| 75 |
+
The app will run locally on a Gradio-generated URL.
|
| 76 |
+
|
| 77 |
+
|
| 78 |
+
Limitations
|
| 79 |
+
-----------
|
| 80 |
+
This is not a complete production parser. It does not handle:
|
| 81 |
+
- JavaScript-rendered pages
|
| 82 |
+
- International address formats
|
| 83 |
+
- Advanced company name extraction
|
| 84 |
+
- Pagination or crawling
|
| 85 |
+
- Rate limiting or proxy management
|
| 86 |
+
|
| 87 |
+
However, it demonstrates the core concept:
|
| 88 |
+
**turning unstructured web data into structured company data signals.**
|