Spaces:

jessejohnson
/

plg4-dev-server

Paused

App Files Files Community

plg4-dev-server / backend /docs /scraper.md

Jesse Johnson

New commit for backend deployment: 2025-09-25_13-24-03

c59d808 5 months ago

preview code

raw

history blame contribute delete

8.38 kB

	# Recipe Scraper – FastAPI demo

	A tiny FastAPI service + CLI that scrapes recipe sites, normalizes data, and (optionally) embeds combined ingredients + instructions into a single vector (`recipe_emb`). Designed as a test project—simple to run locally, easy to extend.

	---

	## Features

	* 🔧 Sites: `yummy` (YummyMedley), `anr` (All Nigerian Recipes)
	* 🧱 Unified text: builds `recipe_text` from sections, or embeds `("ingredients","instructions") → recipe_emb`
	* 🧠 Embeddings: Hugging Face `sentence-transformers` via your `HFEmbedder` (default: `all-MiniLM-L6-v2`)
	* 🚀 API trigger: `POST /scrape` runs scraping in the background
	* 👀 Progress: `GET /jobs/{job_id}` (and optional `GET /jobs`) to check status
	* 💾 Output: `output_type = "json"` (local file) or `"mongo"` (MongoDB/Atlas)

	---

	## Project layout (essential bits)

	```
	backend/
	app.py
	data_minning/
	base_scraper.py # BaseRecipeScraper (+ StreamOptions)
	all_nigerian_recipe_scraper.py
	yummy_medley_scraper.py
	dto/recipe_doc.py
	soup_client.py
	utils/sanitization.py
	```

	Make sure every package dir has an `__init__.py`.

	---

	## Requirements

	* Python 3.9+
	* macOS/Linux (Windows should work too)
	* (Optional) MongoDB/Atlas for `"mongo"` output

	### Install

	```bash
	python3 -m venv .venv
	source .venv/bin/activate

	pip install --upgrade pip
	pip install -r requirements.txt
	# If you don’t have a requirements.txt, minimum:
	pip install fastapi "uvicorn[standard]" pydantic==2.* requests beautifulsoup4 \
	sentence-transformers numpy pymongo python-dotenv
	```

	> If `uvicorn` isn’t found on your PATH, you can always run with `python3 -m uvicorn ...`.

	---

	## Environment variables

	Create `.env` in repo root (or export envs) as needed:

	```dotenv


	# For Mongo output_type="mongo"
	MONGODB_URI=mongodb+srv://user:pass@cluster/recipes?retryWrites=true&w=majority
	MONGODB_DB=recipes
	MONGODB_COL=items
	ATLAS_INDEX=recipes_vec # your Atlas Search index name

	# Embeddings (HFEmbedder)
	HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
	HF_DEVICE=cpu # or cuda
	```

	---

	## Running the API

	From the project root (the folder containing `backend/`):

	```bash
	python3 -m uvicorn app:app --reload --host 127.0.0.1 --port 8080
	```


	---

	## API

	### POST `/scrape`

	Trigger a scrape job (non-blocking). Body is a JSON object:

	```json
	{
	"site": "yummy",
	"limit": 50, #optional
	"output_type": "json" // or "mongo"
	}
	```

	Headers

	* `Content-Type: application/json`
	* If enabled: `X-API-Key: <ADMIN_API_KEY>`

	curl example (JSON output):

	```bash
	curl -X POST http://127.0.0.1:8080/scrape \
	-H "Content-Type: application/json" \
	-H "X-API-Key: dev-key" \
	-d '{"site":"yummy","limit":20,"output_type":"json"}'
	```

	Response

	```json
	{ "job_id": "yummy-a1b2c3d4", "status": "queued" }
	```

	### GET `/jobs/{job_id}`

	Check progress:

	```bash
	curl http://127.0.0.1:8080/jobs/yummy-a1b2c3d4
	```

	Possible responses

	```json
	{ "status": "running", "count": 13 }
	{ "status": "done", "count": 50 }
	{ "status": "error", "error": "Traceback ..." }
	{ "status": "unknown" }
	```

	### (Optional) GET `/jobs`

	Return the whole in-memory job map (useful for debugging):

	```bash
	curl http://127.0.0.1:8080/jobs
	```

	> Note: jobs are stored in a process-local dict and clear on server restart.

	---

	## Output modes

	### `"json"`

	Writes batches to a JSON sink (e.g., newline-delimited file). Check the sink path configured in your `JsonArraySink`/`DualSink`.

	Typical document shape:

	```json
	{
	"title": "...",
	"url": "...",
	"source": "...",
	"category": "...",
	"ingredients": "- 1 cup rice\n- 2 tbsp oil\n...",
	"instructions": "1. Heat oil...\n\n2. Add rice...",
	"image_url": "...",
	"needs_review": false,
	"scraped_at": "2025-09-14 10:03:32.289232",
	"recipe_emb": [0.0123, -0.0456, ...] // when embeddings enabled
	}
	```

	### `"mongo"`

	Writes to `MONGODB_DB.MONGODB_COL`. Ensure your Atlas Search index is created if you plan to query vectors.

	Atlas index mapping (single vector field)

	```json
	{
	"mappings": {
	"dynamic": false,
	"fields": {
	"recipe_emb": { "type": "knnVector", "dims": 384, "similarity": "cosine" }
	}
	}
	}
	```

	Query example:

	```python
	qvec = embedder.encode([query])[0]
	pipeline = [{
	"$vectorSearch": {
	"index": os.getenv("ATLAS_INDEX", "recipes_vec"),
	"path": "recipe_emb",
	"queryVector": qvec,
	"numCandidates": 400,
	"limit": 10,
	"filter": { "needs_review": { "$ne": True } }
	}
	}]
	results = list(col.aggregate(pipeline))
	```

	---

	## Embeddings (combined fields → one vector)

	We embed ingredients + instructions into a single `recipe_emb`. Two supported patterns:

	### A) Combine at embedding time

	Configure:

	```python
	embedding_fields = [
	(("ingredients", "instructions"), "recipe_emb")
	]
	```

	`_apply_embeddings` concatenates labeled sections:

	```
	Ingredients:
	- ...

	Instructions:
	1. ...
	```

	### B) Build `recipe_text` in `RecipeDoc.finalize()` and embed once

	```python
	self.recipe_text = "\n\n".join(
	[s for s in [
	f"Title:\n{self.title}" if self.title else "",
	f"Ingredients:\n{self.ingredients_text}" if self.ingredients_text else "",
	f"Instructions:\n{self.instructions_text}" if self.instructions_text else ""
	] if s]
	)
	# embedding_fields = [("recipe_text", "recipe_emb")]
	```

	HFEmbedder config (defaults):

	```python
	HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
	HF_DEVICE=cpu
	```

	---

	## CLI (optional but handy)

	Create `run_scrape.py`:

	```python
	from backend.services.data_minning.yummy_medley_scraper import YummyMedleyScraper
	from backend.services.data_minning.all_nigerian_recipe_scraper import AllNigerianRecipesScraper

	SCRAPERS = {
	"yummy": YummyMedleyScraper,
	"anr": AllNigerianRecipesScraper,
	}

	if __name__ == "__main__":
	import argparse
	from dataclasses import asdict
	p = argparse.ArgumentParser()
	p.add_argument("--site", choices=SCRAPERS.keys(), required=True)
	p.add_argument("--limit", type=int, default=50)
	args = p.parse_args()

	s = SCRAPERS[args.site]()
	saved = s.stream(sink=..., options=StreamOptions(limit=args.limit))
	print(f"Saved {saved}")
	```

	Run:

	```bash
	python3 run_scrape.py --site yummy --limit 25
	```

	---

	## Implementation notes

	### `StreamOptions` (clean params)

	```python
	from dataclasses import dataclass
	from typing import Optional, Callable

	@dataclass
	class StreamOptions:
	delay: float = 0.3
	limit: Optional[int] = None
	batch_size: int = 50
	resume_file: Optional[str] = None
	progress_callback: Optional[Callable[[int], None]] = None
	```

	### Progress to `/jobs`

	We pass a `progress_callback` that updates the job by `job_id`:

	```python
	def make_progress_cb(job_id: str):
	def _cb(n: int):
	JOBS[job_id]["count"] = n
	return _cb
	```

	Used as:

	```python
	saved = s.stream(
	sink=json_or_mongo_sink,
	options=StreamOptions(
	limit=body.limit,
	batch_size=body.limit,
	resume_file="recipes.resume",
	progress_callback=make_progress_cb(job_id),
	),
	)
	```

	---

	## Common pitfalls & fixes

	* `ModuleNotFoundError: No module named 'backend'`
	Run with module path:
	`python3 -m uvicorn backend.app:app --reload`

	* Uvicorn not found (`zsh: command not found: uvicorn`)
	Use: `python3 -m uvicorn ...` or add `~/Library/Python/3.9/bin` to PATH.

	* `422 Unprocessable Entity` on `/scrape`
	In Postman: Body → raw → JSON and send:
	`{"site":"yummy","limit":20,"output_type":"json"}`

	* Pydantic v2: “non-annotated attribute”
	Keep globals like `JOBS = {}` outside `BaseModel` classes.

	* `'int' object is not iterable`
	Don’t iterate `stream()`—it returns an `int`. Use the `progress_callback` if you need live updates.

	* `BackgroundTasks` undefined
	Import from FastAPI:
	`from fastapi import BackgroundTasks`

	* Too many commas in ingredients
	Don’t `.join()` a string—only join if it’s a `list[str]`.

	---

	## Future ideas (nice-to-haves)

	* Store jobs in Redis for persistence across restarts
	* Add `started_at` / `finished_at` timestamps and durations to jobs
	* Rate-limit per site; cool-down if a scrape ran recently
	* Switch to task queue (Celery/RQ/BullMQ) if you need scale
	* Add `/search` endpoint that calls `$vectorSearch` in MongoDB

	---