Spaces:
Runtime error
A newer version of the Gradio SDK is available: 6.15.2
Course Scraping System
This directory contains the orchestrated course scraping system that can collect course data from multiple platforms using configurable topics and counts.
Prerequisites
- Pipenv: All commands use
pipenv runto ensure proper dependency management - Python 3.11+: Required for the project environment
- Dev Dependencies: Scraping tools require development dependencies
Install Pipenv if you haven't already:
pip install pipenv
Install all dependencies (including scraping tools):
pipenv install --dev
Note: The scraping system uses dev dependencies (
pyyaml,beautifulsoup4,requests,lxml) since it's intended for content management, not end-user functionality. Production deployments only needpipenv installfor the core recommendation system.
Components
1. Individual Scraper (course_scraper.py)
Scrapes courses from a single platform for a specific topic.
pipenv run python scripts/course_scraper.py --topic "machine learning" --platform coursera --count 10
2. Master Orchestrator (master_scraper.py)
Python script that reads YAML configuration and orchestrates multiple scraping runs.
pipenv run python scripts/master_scraper.py --dry-run
pipenv run python scripts/master_scraper.py --topic "AI" --platform coursera
3. Bash Wrapper (scrape.sh)
Convenient bash interface for the master scraper.
./scripts/scrape.sh --dry-run
./scripts/scrape.sh --topic "python" --platform udemy
4. Configuration (config/scraping_config.yaml)
YAML file that defines what to scrape, from which platforms, and how many courses.
Quick Start
Test the configuration:
./scripts/scrape.sh --dry-runScrape specific topics:
./scripts/scrape.sh --topic "machine learning"Scrape from specific platform:
./scripts/scrape.sh --platform courseraFull scraping run:
./scripts/scrape.sh
Configuration
Edit config/scraping_config.yaml to customize:
- Topics: What subjects to search for
- Platforms: Which platforms to scrape (coursera, udemy, edx)
- Counts: How many courses to get per topic/platform
- LLM Processing: Whether to enhance data with LLM
- Delays: Request timing between platforms
Example configuration:
defaults:
count: 10
process_llm: false
topics:
- name: "machine learning"
platforms:
- name: "coursera"
count: 15
- name: "udemy"
count: 10
Output
Scraped courses are saved to data/scraped_courses/raw_data/ with filenames like:
coursera_machine_learning_20250825_201532.jsonudemy_python_programming_20250825_201600.json
Each file contains:
- Metadata: Topic, platform, scraping timestamp, course count
- Courses: Array of course objects with unique IDs, titles, descriptions, URLs, etc.
Features
β
Unique Course IDs: Each course gets a UUID for deduplication
β
Complete Descriptions: No more truncated text
β
Platform-Specific Files: Clean organization by platform
β
Configurable: Easy YAML configuration
β
Filtering: Run specific topics or platforms
β
Dry Run: Test configurations without scraping
β
Error Handling: Robust error reporting and recovery
β
Rate Limiting: Configurable delays between requests
Supported Platforms
- Coursera: β Working
- Udemy: β οΈ Often blocked (403 Forbidden)
- edX: π§ Needs selector updates
Command Reference
Individual Scraper
pipenv run python scripts/course_scraper.py --topic TOPIC --platform PLATFORM [--count N] [--process-llm]
Master Scraper (Python)
pipenv run python scripts/master_scraper.py [--config FILE] [--dry-run] [--topic FILTER] [--platform FILTER]
Master Scraper (Bash)
./scripts/scrape.sh [-c FILE] [-d] [-t TOPIC] [-p PLATFORM] [-h]
Troubleshooting
- 403 Forbidden: Platform is blocking requests (common with Udemy)
- No courses found: Check if platform selectors need updating
- Config errors: Validate YAML syntax and required fields
- Module not found (pyyaml, beautifulsoup4, etc.): Install dev dependencies with
pipenv install --dev
Deployment Notes
Production Environment
For end-user deployments (course recommendation system only):
pipenv install # Core dependencies only
Development/Content Management Environment
For updating course database and scraping:
pipenv install --dev # Includes scraping tools
The scraping system is intentionally in dev dependencies since it's for content management, not end-user functionality.