SwissAI-Team7 / crawl_README.md
Razican's picture
Update crawl_README.md (#8)
7a77465 verified

Hospital Data Processing Pipeline

This repository contains three main scripts that together form a pipeline for collecting, cleaning, and summarizing hospital website data.

1. crawl.py – Web Crawling

  • Runs in parallel using Python’s multiprocessing.Pool to speed up crawling.

A premade list of hospital URL already exists in the file and defines the hospitals that are crawled over. This list was made with keeping the robots.txt in mind. A simple python crawl.py executes the program and populates "HospitalData"

Output: Raw unprocessed hospital data in ./HospitalData/


2. clean.py – Data Cleaning

All the files which output errors, or have links from irrelevant sources are removed here.

python clean.py executes it and populates "CleanedHospitalData"

Output: Cleaned hospital data in JSON format, stored in ./CleanedHospitalData/


3. summarize.py – Data Summarization

  • Reads each .json file from ./CleanedHospitalData/.
  • Calls the OpenRouter API (via openai client) using models such as grok-4-fast.
  • Summarizes the hospital’s content into keyword-focused outputs:
    • Hospital name as the heading.
    • Keywords about technologies offered.
    • Keywords about medical specialties.
    • Keywords about services provided.
  • Ignores irrelevant details, warnings, or assumptions.
  • Writes the summaries into ./SummarizedCleanedHospitalData/.

An API token from Openrouter is required to be stored in an "openrouteapi.txt" in the same directory. A simple python summarize.py executes it and populates "SummarizedCleanHospitalData" which is used as a knowledge base for the Lighthouse AI model.

Output: Concise hospital summaries in ./SummarizedCleanedHospitalData/


The files need to be executed in the follwoing order: crawl.py -> clean.py -> summarize.py

Steps for API key generation:

  1. visit https://openrouter.ai/settings/keys
  2. Create an account if not already and you can generate API keys from OpenRouter Steps to execute:
  3. after making and activating a virtual environment
  4. pip install -r requirements.txt
  5. echo "" >> openrouteapi.txt
  6. python crawl.py
  7. python clean.py
  8. python summarize.py