rdisipio's picture
min dedupl
858f20a
# Course Scraping System
This directory contains the orchestrated course scraping system that can collect course data from multiple platforms using configurable topics and counts.
## Prerequisites
- **Pipenv**: All commands use `pipenv run` to ensure proper dependency management
- **Python 3.11+**: Required for the project environment
- **Dev Dependencies**: Scraping tools require development dependencies
Install Pipenv if you haven't already:
```bash
pip install pipenv
```
Install all dependencies (including scraping tools):
```bash
pipenv install --dev
```
> **Note**: The scraping system uses dev dependencies (`pyyaml`, `beautifulsoup4`, `requests`, `lxml`) since it's intended for content management, not end-user functionality. Production deployments only need `pipenv install` for the core recommendation system.
## Components
### 1. Individual Scraper (`course_scraper.py`)
Scrapes courses from a single platform for a specific topic.
```bash
pipenv run python scripts/course_scraper.py --topic "machine learning" --platform coursera --count 10
```
### 2. Master Orchestrator (`master_scraper.py`)
Python script that reads YAML configuration and orchestrates multiple scraping runs.
```bash
pipenv run python scripts/master_scraper.py --dry-run
pipenv run python scripts/master_scraper.py --topic "AI" --platform coursera
```
### 3. Bash Wrapper (`scrape.sh`)
Convenient bash interface for the master scraper.
```bash
./scripts/scrape.sh --dry-run
./scripts/scrape.sh --topic "python" --platform udemy
```
### 4. Configuration (`config/scraping_config.yaml`)
YAML file that defines what to scrape, from which platforms, and how many courses.
## Quick Start
1. **Test the configuration**:
```bash
./scripts/scrape.sh --dry-run
```
2. **Scrape specific topics**:
```bash
./scripts/scrape.sh --topic "machine learning"
```
3. **Scrape from specific platform**:
```bash
./scripts/scrape.sh --platform coursera
```
4. **Full scraping run**:
```bash
./scripts/scrape.sh
```
## Configuration
Edit `config/scraping_config.yaml` to customize:
- **Topics**: What subjects to search for
- **Platforms**: Which platforms to scrape (coursera, udemy, edx)
- **Counts**: How many courses to get per topic/platform
- **LLM Processing**: Whether to enhance data with LLM
- **Delays**: Request timing between platforms
Example configuration:
```yaml
defaults:
count: 10
process_llm: false
topics:
- name: "machine learning"
platforms:
- name: "coursera"
count: 15
- name: "udemy"
count: 10
```
## Output
Scraped courses are saved to `data/scraped_courses/raw_data/` with filenames like:
- `coursera_machine_learning_20250825_201532.json`
- `udemy_python_programming_20250825_201600.json`
Each file contains:
- **Metadata**: Topic, platform, scraping timestamp, course count
- **Courses**: Array of course objects with unique IDs, titles, descriptions, URLs, etc.
## Features
βœ… **Unique Course IDs**: Each course gets a UUID for deduplication
βœ… **Complete Descriptions**: No more truncated text
βœ… **Platform-Specific Files**: Clean organization by platform
βœ… **Configurable**: Easy YAML configuration
βœ… **Filtering**: Run specific topics or platforms
βœ… **Dry Run**: Test configurations without scraping
βœ… **Error Handling**: Robust error reporting and recovery
βœ… **Rate Limiting**: Configurable delays between requests
## Supported Platforms
- **Coursera**: βœ… Working
- **Udemy**: ⚠️ Often blocked (403 Forbidden)
- **edX**: πŸ”§ Needs selector updates
## Command Reference
### Individual Scraper
```bash
pipenv run python scripts/course_scraper.py --topic TOPIC --platform PLATFORM [--count N] [--process-llm]
```
### Master Scraper (Python)
```bash
pipenv run python scripts/master_scraper.py [--config FILE] [--dry-run] [--topic FILTER] [--platform FILTER]
```
### Master Scraper (Bash)
```bash
./scripts/scrape.sh [-c FILE] [-d] [-t TOPIC] [-p PLATFORM] [-h]
```
## Troubleshooting
- **403 Forbidden**: Platform is blocking requests (common with Udemy)
- **No courses found**: Check if platform selectors need updating
- **Config errors**: Validate YAML syntax and required fields
- **Module not found (pyyaml, beautifulsoup4, etc.)**: Install dev dependencies with `pipenv install --dev`
## Deployment Notes
### Production Environment
For end-user deployments (course recommendation system only):
```bash
pipenv install # Core dependencies only
```
### Development/Content Management Environment
For updating course database and scraping:
```bash
pipenv install --dev # Includes scraping tools
```
The scraping system is intentionally in dev dependencies since it's for content management, not end-user functionality.