Spaces:
Runtime error
Hospital Data Processing Pipeline
This repository contains three main scripts that together form a pipeline for collecting, cleaning, and summarizing hospital website data.
1. crawl.py – Web Crawling
- Runs in parallel using Python’s
multiprocessing.Poolto speed up crawling.
A premade list of hospital URL already exists in the file and defines the hospitals that are crawled over. This list was made with keeping the robots.txt in mind. A simple python crawl.py executes the program and populates "HospitalData"
Output: Raw unprocessed hospital data in ./HospitalData/
2. clean.py – Data Cleaning
All the files which output errors, or have links from irrelevant sources are removed here.
python clean.py executes it and populates "CleanedHospitalData"
Output: Cleaned hospital data in JSON format, stored in ./CleanedHospitalData/
3. summarize.py – Data Summarization
- Reads each
.jsonfile from./CleanedHospitalData/. - Calls the OpenRouter API (via
openaiclient) using models such asgrok-4-fast. - Summarizes the hospital’s content into keyword-focused outputs:
- Hospital name as the heading.
- Keywords about technologies offered.
- Keywords about medical specialties.
- Keywords about services provided.
- Ignores irrelevant details, warnings, or assumptions.
- Writes the summaries into
./SummarizedCleanedHospitalData/.
An API token from Openrouter is required to be stored in an "openrouteapi.txt" in the same directory. A simple python summarize.py executes it and populates "SummarizedCleanHospitalData" which is used as a knowledge base for the Lighthouse AI model.
Output: Concise hospital summaries in ./SummarizedCleanedHospitalData/
The files need to be executed in the follwoing order: crawl.py -> clean.py -> summarize.py
Steps for API key generation:
- visit https://openrouter.ai/settings/keys
- Create an account if not already and you can generate API keys from OpenRouter Steps to execute:
- after making and activating a virtual environment
- pip install -r requirements.txt
- echo "" >> openrouteapi.txt
- python crawl.py
- python clean.py
- python summarize.py