# Course Scraping System

This directory contains the orchestrated course scraping system that can collect course data from multiple platforms using configurable topics and counts.

## Prerequisites

- **Pipenv**: All commands use `pipenv run` to ensure proper dependency management
- **Python 3.11+**: Required for the project environment
- **Dev Dependencies**: Scraping tools require development dependencies

Install Pipenv if you haven't already: 
```bash
pip install pipenv
```

Install all dependencies (including scraping tools):
```bash
pipenv install --dev
```

> **Note**: The scraping system uses dev dependencies (`pyyaml`, `beautifulsoup4`, `requests`, `lxml`) since it's intended for content management, not end-user functionality. Production deployments only need `pipenv install` for the core recommendation system.

## Components

### 1. Individual Scraper (`course_scraper.py`)
Scrapes courses from a single platform for a specific topic.

```bash
pipenv run python scripts/course_scraper.py --topic "machine learning" --platform coursera --count 10
```

### 2. Master Orchestrator (`master_scraper.py`) 
Python script that reads YAML configuration and orchestrates multiple scraping runs.

```bash
pipenv run python scripts/master_scraper.py --dry-run
pipenv run python scripts/master_scraper.py --topic "AI" --platform coursera
```

### 3. Bash Wrapper (`scrape.sh`)
Convenient bash interface for the master scraper.

```bash
./scripts/scrape.sh --dry-run
./scripts/scrape.sh --topic "python" --platform udemy
```

### 4. Configuration (`config/scraping_config.yaml`)
YAML file that defines what to scrape, from which platforms, and how many courses.

## Quick Start

1. **Test the configuration**:
   ```bash
   ./scripts/scrape.sh --dry-run
   ```

2. **Scrape specific topics**:
   ```bash
   ./scripts/scrape.sh --topic "machine learning"
   ```

3. **Scrape from specific platform**:
   ```bash
   ./scripts/scrape.sh --platform coursera
   ```

4. **Full scraping run**:
   ```bash
   ./scripts/scrape.sh
   ```

## Configuration

Edit `config/scraping_config.yaml` to customize:

- **Topics**: What subjects to search for
- **Platforms**: Which platforms to scrape (coursera, udemy, edx)
- **Counts**: How many courses to get per topic/platform
- **LLM Processing**: Whether to enhance data with LLM
- **Delays**: Request timing between platforms

Example configuration:
```yaml
defaults:
  count: 10
  process_llm: false

topics:
  - name: "machine learning"
    platforms:
      - name: "coursera"
        count: 15
      - name: "udemy"
        count: 10
```

## Output

Scraped courses are saved to `data/scraped_courses/raw_data/` with filenames like:
- `coursera_machine_learning_20250825_201532.json`
- `udemy_python_programming_20250825_201600.json`

Each file contains:
- **Metadata**: Topic, platform, scraping timestamp, course count
- **Courses**: Array of course objects with unique IDs, titles, descriptions, URLs, etc.

## Features

✅ **Unique Course IDs**: Each course gets a UUID for deduplication  
✅ **Complete Descriptions**: No more truncated text  
✅ **Platform-Specific Files**: Clean organization by platform  
✅ **Configurable**: Easy YAML configuration  
✅ **Filtering**: Run specific topics or platforms  
✅ **Dry Run**: Test configurations without scraping  
✅ **Error Handling**: Robust error reporting and recovery  
✅ **Rate Limiting**: Configurable delays between requests  

## Supported Platforms

- **Coursera**: ✅ Working
- **Udemy**: ⚠️ Often blocked (403 Forbidden)  
- **edX**: 🔧 Needs selector updates

## Command Reference

### Individual Scraper
```bash
pipenv run python scripts/course_scraper.py --topic TOPIC --platform PLATFORM [--count N] [--process-llm]
```

### Master Scraper (Python)
```bash
pipenv run python scripts/master_scraper.py [--config FILE] [--dry-run] [--topic FILTER] [--platform FILTER]
```

### Master Scraper (Bash)
```bash
./scripts/scrape.sh [-c FILE] [-d] [-t TOPIC] [-p PLATFORM] [-h]
```

## Troubleshooting

- **403 Forbidden**: Platform is blocking requests (common with Udemy)
- **No courses found**: Check if platform selectors need updating
- **Config errors**: Validate YAML syntax and required fields
- **Module not found (pyyaml, beautifulsoup4, etc.)**: Install dev dependencies with `pipenv install --dev`

## Deployment Notes

### Production Environment
For end-user deployments (course recommendation system only):
```bash
pipenv install  # Core dependencies only
```

### Development/Content Management Environment  
For updating course database and scraping:
```bash
pipenv install --dev  # Includes scraping tools
```

The scraping system is intentionally in dev dependencies since it's for content management, not end-user functionality.