rdisipio's picture
min dedupl
858f20a

A newer version of the Gradio SDK is available: 6.15.2

Upgrade

Course Scraping System

This directory contains the orchestrated course scraping system that can collect course data from multiple platforms using configurable topics and counts.

Prerequisites

  • Pipenv: All commands use pipenv run to ensure proper dependency management
  • Python 3.11+: Required for the project environment
  • Dev Dependencies: Scraping tools require development dependencies

Install Pipenv if you haven't already:

pip install pipenv

Install all dependencies (including scraping tools):

pipenv install --dev

Note: The scraping system uses dev dependencies (pyyaml, beautifulsoup4, requests, lxml) since it's intended for content management, not end-user functionality. Production deployments only need pipenv install for the core recommendation system.

Components

1. Individual Scraper (course_scraper.py)

Scrapes courses from a single platform for a specific topic.

pipenv run python scripts/course_scraper.py --topic "machine learning" --platform coursera --count 10

2. Master Orchestrator (master_scraper.py)

Python script that reads YAML configuration and orchestrates multiple scraping runs.

pipenv run python scripts/master_scraper.py --dry-run
pipenv run python scripts/master_scraper.py --topic "AI" --platform coursera

3. Bash Wrapper (scrape.sh)

Convenient bash interface for the master scraper.

./scripts/scrape.sh --dry-run
./scripts/scrape.sh --topic "python" --platform udemy

4. Configuration (config/scraping_config.yaml)

YAML file that defines what to scrape, from which platforms, and how many courses.

Quick Start

  1. Test the configuration:

    ./scripts/scrape.sh --dry-run
    
  2. Scrape specific topics:

    ./scripts/scrape.sh --topic "machine learning"
    
  3. Scrape from specific platform:

    ./scripts/scrape.sh --platform coursera
    
  4. Full scraping run:

    ./scripts/scrape.sh
    

Configuration

Edit config/scraping_config.yaml to customize:

  • Topics: What subjects to search for
  • Platforms: Which platforms to scrape (coursera, udemy, edx)
  • Counts: How many courses to get per topic/platform
  • LLM Processing: Whether to enhance data with LLM
  • Delays: Request timing between platforms

Example configuration:

defaults:
  count: 10
  process_llm: false

topics:
  - name: "machine learning"
    platforms:
      - name: "coursera"
        count: 15
      - name: "udemy"
        count: 10

Output

Scraped courses are saved to data/scraped_courses/raw_data/ with filenames like:

  • coursera_machine_learning_20250825_201532.json
  • udemy_python_programming_20250825_201600.json

Each file contains:

  • Metadata: Topic, platform, scraping timestamp, course count
  • Courses: Array of course objects with unique IDs, titles, descriptions, URLs, etc.

Features

βœ… Unique Course IDs: Each course gets a UUID for deduplication
βœ… Complete Descriptions: No more truncated text
βœ… Platform-Specific Files: Clean organization by platform
βœ… Configurable: Easy YAML configuration
βœ… Filtering: Run specific topics or platforms
βœ… Dry Run: Test configurations without scraping
βœ… Error Handling: Robust error reporting and recovery
βœ… Rate Limiting: Configurable delays between requests

Supported Platforms

  • Coursera: βœ… Working
  • Udemy: ⚠️ Often blocked (403 Forbidden)
  • edX: πŸ”§ Needs selector updates

Command Reference

Individual Scraper

pipenv run python scripts/course_scraper.py --topic TOPIC --platform PLATFORM [--count N] [--process-llm]

Master Scraper (Python)

pipenv run python scripts/master_scraper.py [--config FILE] [--dry-run] [--topic FILTER] [--platform FILTER]

Master Scraper (Bash)

./scripts/scrape.sh [-c FILE] [-d] [-t TOPIC] [-p PLATFORM] [-h]

Troubleshooting

  • 403 Forbidden: Platform is blocking requests (common with Udemy)
  • No courses found: Check if platform selectors need updating
  • Config errors: Validate YAML syntax and required fields
  • Module not found (pyyaml, beautifulsoup4, etc.): Install dev dependencies with pipenv install --dev

Deployment Notes

Production Environment

For end-user deployments (course recommendation system only):

pipenv install  # Core dependencies only

Development/Content Management Environment

For updating course database and scraping:

pipenv install --dev  # Includes scraping tools

The scraping system is intentionally in dev dependencies since it's for content management, not end-user functionality.