Spaces:
Runtime error
Runtime error
| # Course Scraping System | |
| This directory contains the orchestrated course scraping system that can collect course data from multiple platforms using configurable topics and counts. | |
| ## Prerequisites | |
| - **Pipenv**: All commands use `pipenv run` to ensure proper dependency management | |
| - **Python 3.11+**: Required for the project environment | |
| - **Dev Dependencies**: Scraping tools require development dependencies | |
| Install Pipenv if you haven't already: | |
| ```bash | |
| pip install pipenv | |
| ``` | |
| Install all dependencies (including scraping tools): | |
| ```bash | |
| pipenv install --dev | |
| ``` | |
| > **Note**: The scraping system uses dev dependencies (`pyyaml`, `beautifulsoup4`, `requests`, `lxml`) since it's intended for content management, not end-user functionality. Production deployments only need `pipenv install` for the core recommendation system. | |
| ## Components | |
| ### 1. Individual Scraper (`course_scraper.py`) | |
| Scrapes courses from a single platform for a specific topic. | |
| ```bash | |
| pipenv run python scripts/course_scraper.py --topic "machine learning" --platform coursera --count 10 | |
| ``` | |
| ### 2. Master Orchestrator (`master_scraper.py`) | |
| Python script that reads YAML configuration and orchestrates multiple scraping runs. | |
| ```bash | |
| pipenv run python scripts/master_scraper.py --dry-run | |
| pipenv run python scripts/master_scraper.py --topic "AI" --platform coursera | |
| ``` | |
| ### 3. Bash Wrapper (`scrape.sh`) | |
| Convenient bash interface for the master scraper. | |
| ```bash | |
| ./scripts/scrape.sh --dry-run | |
| ./scripts/scrape.sh --topic "python" --platform udemy | |
| ``` | |
| ### 4. Configuration (`config/scraping_config.yaml`) | |
| YAML file that defines what to scrape, from which platforms, and how many courses. | |
| ## Quick Start | |
| 1. **Test the configuration**: | |
| ```bash | |
| ./scripts/scrape.sh --dry-run | |
| ``` | |
| 2. **Scrape specific topics**: | |
| ```bash | |
| ./scripts/scrape.sh --topic "machine learning" | |
| ``` | |
| 3. **Scrape from specific platform**: | |
| ```bash | |
| ./scripts/scrape.sh --platform coursera | |
| ``` | |
| 4. **Full scraping run**: | |
| ```bash | |
| ./scripts/scrape.sh | |
| ``` | |
| ## Configuration | |
| Edit `config/scraping_config.yaml` to customize: | |
| - **Topics**: What subjects to search for | |
| - **Platforms**: Which platforms to scrape (coursera, udemy, edx) | |
| - **Counts**: How many courses to get per topic/platform | |
| - **LLM Processing**: Whether to enhance data with LLM | |
| - **Delays**: Request timing between platforms | |
| Example configuration: | |
| ```yaml | |
| defaults: | |
| count: 10 | |
| process_llm: false | |
| topics: | |
| - name: "machine learning" | |
| platforms: | |
| - name: "coursera" | |
| count: 15 | |
| - name: "udemy" | |
| count: 10 | |
| ``` | |
| ## Output | |
| Scraped courses are saved to `data/scraped_courses/raw_data/` with filenames like: | |
| - `coursera_machine_learning_20250825_201532.json` | |
| - `udemy_python_programming_20250825_201600.json` | |
| Each file contains: | |
| - **Metadata**: Topic, platform, scraping timestamp, course count | |
| - **Courses**: Array of course objects with unique IDs, titles, descriptions, URLs, etc. | |
| ## Features | |
| β **Unique Course IDs**: Each course gets a UUID for deduplication | |
| β **Complete Descriptions**: No more truncated text | |
| β **Platform-Specific Files**: Clean organization by platform | |
| β **Configurable**: Easy YAML configuration | |
| β **Filtering**: Run specific topics or platforms | |
| β **Dry Run**: Test configurations without scraping | |
| β **Error Handling**: Robust error reporting and recovery | |
| β **Rate Limiting**: Configurable delays between requests | |
| ## Supported Platforms | |
| - **Coursera**: β Working | |
| - **Udemy**: β οΈ Often blocked (403 Forbidden) | |
| - **edX**: π§ Needs selector updates | |
| ## Command Reference | |
| ### Individual Scraper | |
| ```bash | |
| pipenv run python scripts/course_scraper.py --topic TOPIC --platform PLATFORM [--count N] [--process-llm] | |
| ``` | |
| ### Master Scraper (Python) | |
| ```bash | |
| pipenv run python scripts/master_scraper.py [--config FILE] [--dry-run] [--topic FILTER] [--platform FILTER] | |
| ``` | |
| ### Master Scraper (Bash) | |
| ```bash | |
| ./scripts/scrape.sh [-c FILE] [-d] [-t TOPIC] [-p PLATFORM] [-h] | |
| ``` | |
| ## Troubleshooting | |
| - **403 Forbidden**: Platform is blocking requests (common with Udemy) | |
| - **No courses found**: Check if platform selectors need updating | |
| - **Config errors**: Validate YAML syntax and required fields | |
| - **Module not found (pyyaml, beautifulsoup4, etc.)**: Install dev dependencies with `pipenv install --dev` | |
| ## Deployment Notes | |
| ### Production Environment | |
| For end-user deployments (course recommendation system only): | |
| ```bash | |
| pipenv install # Core dependencies only | |
| ``` | |
| ### Development/Content Management Environment | |
| For updating course database and scraping: | |
| ```bash | |
| pipenv install --dev # Includes scraping tools | |
| ``` | |
| The scraping system is intentionally in dev dependencies since it's for content management, not end-user functionality. | |