Spaces:

LianHP
/

Web_page_data_html

Sleeping

App Files Files Community

LianHP commited on Dec 9, 2025

Commit

6942eab

verified ·

1 Parent(s): 530143f

Update README.md

Browse files

Files changed (1) hide show

README.md +82 -0

README.md CHANGED Viewed

@@ -4,3 +4,85 @@ app_file: app.py
 sdk: gradio
 sdk_version: 5.47.2
 ---

 sdk: gradio
 sdk_version: 5.47.2
 ---
+Web Company Data Extractor
+==========================
+Overview
+--------
+This Hugging Face demo extracts basic company-related information from unstructured
+web pages. You enter a URL, and the app:
+1. Downloads the HTML
+2. Parses visible text
+3. Extracts useful company data indicators:
+   - Emails
+   - Phone numbers
+   - Possible addresses
+   - Social media profile links
+   - Page title (as a company name guess)
+The output includes:
+- Truncated raw text extracted from the page
+- A structured JSON-like summary of detected signals
+This is a rule-based prototype for demonstration purposes. It does not replace
+professional-grade web data extraction or parsing libraries.
+Project Purpose
+---------------
+This app demonstrates how unstructured web pages can be converted into structured
+company data. It is useful for illustrating:
+- Web data extraction
+- Text parsing and cleaning
+- Company signal detection
+- Lightweight company data enrichment
+- A pattern for how ZoomInfo might bootstrap new data fields
+Files Included
+--------------
+- app.py        – The main Gradio application
+- requirements.txt
+- README.txt    – This file
+How to Use
+----------
+1. Enter a URL (e.g. https://www.microsoft.com)
+2. Click "Extract Company Data"
+3. Inspect:
+   - Cleaned visible text (truncated)
+   - The structured output fields
+Technical Notes
+---------------
+Uses:
+- requests          – to fetch HTML
+- beautifulsoup4    – to parse DOM and extract visible text
+- regex patterns    – for emails, phone numbers, and simple address detection
+- gradio            – for UI
+Outputs are plain Python dictionaries and strings to avoid serialization issues.
+Running Locally
+---------------
+    pip install -r requirements.txt
+    python app.py
+The app will run locally on a Gradio-generated URL.
+Limitations
+-----------
+This is not a complete production parser. It does not handle:
+- JavaScript-rendered pages
+- International address formats
+- Advanced company name extraction
+- Pagination or crawling
+- Rate limiting or proxy management
+However, it demonstrates the core concept:
+**turning unstructured web data into structured company data signals.**