Spaces:

rdisipio
/

coachable-course-agent

Runtime error

App Files Files Community

coachable-course-agent / scripts /README.md

rdisipio

min dedupl

858f20a 9 months ago

preview code

raw

history blame contribute delete

4.78 kB

	# Course Scraping System

	This directory contains the orchestrated course scraping system that can collect course data from multiple platforms using configurable topics and counts.

	## Prerequisites

	- Pipenv: All commands use `pipenv run` to ensure proper dependency management
	- Python 3.11+: Required for the project environment
	- Dev Dependencies: Scraping tools require development dependencies

	Install Pipenv if you haven't already:
	```bash
	pip install pipenv
	```

	Install all dependencies (including scraping tools):
	```bash
	pipenv install --dev
	```

	> Note: The scraping system uses dev dependencies (`pyyaml`, `beautifulsoup4`, `requests`, `lxml`) since it's intended for content management, not end-user functionality. Production deployments only need `pipenv install` for the core recommendation system.

	## Components

	### 1. Individual Scraper (`course_scraper.py`)
	Scrapes courses from a single platform for a specific topic.

	```bash
	pipenv run python scripts/course_scraper.py --topic "machine learning" --platform coursera --count 10
	```

	### 2. Master Orchestrator (`master_scraper.py`)
	Python script that reads YAML configuration and orchestrates multiple scraping runs.

	```bash
	pipenv run python scripts/master_scraper.py --dry-run
	pipenv run python scripts/master_scraper.py --topic "AI" --platform coursera
	```

	### 3. Bash Wrapper (`scrape.sh`)
	Convenient bash interface for the master scraper.

	```bash
	./scripts/scrape.sh --dry-run
	./scripts/scrape.sh --topic "python" --platform udemy
	```

	### 4. Configuration (`config/scraping_config.yaml`)
	YAML file that defines what to scrape, from which platforms, and how many courses.

	## Quick Start

	1. Test the configuration:
	```bash
	./scripts/scrape.sh --dry-run
	```

	2. Scrape specific topics:
	```bash
	./scripts/scrape.sh --topic "machine learning"
	```

	3. Scrape from specific platform:
	```bash
	./scripts/scrape.sh --platform coursera
	```

	4. Full scraping run:
	```bash
	./scripts/scrape.sh
	```

	## Configuration

	Edit `config/scraping_config.yaml` to customize:

	- Topics: What subjects to search for
	- Platforms: Which platforms to scrape (coursera, udemy, edx)
	- Counts: How many courses to get per topic/platform
	- LLM Processing: Whether to enhance data with LLM
	- Delays: Request timing between platforms

	Example configuration:
	```yaml
	defaults:
	count: 10
	process_llm: false

	topics:
	- name: "machine learning"
	platforms:
	- name: "coursera"
	count: 15
	- name: "udemy"
	count: 10
	```

	## Output

	Scraped courses are saved to `data/scraped_courses/raw_data/` with filenames like:
	- `coursera_machine_learning_20250825_201532.json`
	- `udemy_python_programming_20250825_201600.json`

	Each file contains:
	- Metadata: Topic, platform, scraping timestamp, course count
	- Courses: Array of course objects with unique IDs, titles, descriptions, URLs, etc.

	## Features

	✅ Unique Course IDs: Each course gets a UUID for deduplication
	✅ Complete Descriptions: No more truncated text
	✅ Platform-Specific Files: Clean organization by platform
	✅ Configurable: Easy YAML configuration
	✅ Filtering: Run specific topics or platforms
	✅ Dry Run: Test configurations without scraping
	✅ Error Handling: Robust error reporting and recovery
	✅ Rate Limiting: Configurable delays between requests

	## Supported Platforms

	- Coursera: ✅ Working
	- Udemy: ⚠️ Often blocked (403 Forbidden)
	- edX: 🔧 Needs selector updates

	## Command Reference

	### Individual Scraper
	```bash
	pipenv run python scripts/course_scraper.py --topic TOPIC --platform PLATFORM [--count N] [--process-llm]
	```

	### Master Scraper (Python)
	```bash
	pipenv run python scripts/master_scraper.py [--config FILE] [--dry-run] [--topic FILTER] [--platform FILTER]
	```

	### Master Scraper (Bash)
	```bash
	./scripts/scrape.sh [-c FILE] [-d] [-t TOPIC] [-p PLATFORM] [-h]
	```

	## Troubleshooting

	- 403 Forbidden: Platform is blocking requests (common with Udemy)
	- No courses found: Check if platform selectors need updating
	- Config errors: Validate YAML syntax and required fields
	- Module not found (pyyaml, beautifulsoup4, etc.): Install dev dependencies with `pipenv install --dev`

	## Deployment Notes

	### Production Environment
	For end-user deployments (course recommendation system only):
	```bash
	pipenv install # Core dependencies only
	```

	### Development/Content Management Environment
	For updating course database and scraping:
	```bash
	pipenv install --dev # Includes scraping tools
	```

	The scraping system is intentionally in dev dependencies since it's for content management, not end-user functionality.