Jesse Johnson
New commit for backend deployment: 2025-09-25_13-24-03
c59d808
# Recipe Scraper – FastAPI demo
A tiny FastAPI service + CLI that scrapes recipe sites, normalizes data, and (optionally) embeds combined **ingredients + instructions** into a single vector (`recipe_emb`). Designed as a **test project**—simple to run locally, easy to extend.
---
## Features
* 🔧 **Sites**: `yummy` (YummyMedley), `anr` (All Nigerian Recipes)
* 🧱 **Unified text**: builds `recipe_text` from sections, or embeds `("ingredients","instructions") → recipe_emb`
* 🧠 **Embeddings**: Hugging Face `sentence-transformers` via your `HFEmbedder` (default: `all-MiniLM-L6-v2`)
* 🚀 **API trigger**: `POST /scrape` runs scraping in the background
* 👀 **Progress**: `GET /jobs/{job_id}` (and optional `GET /jobs`) to check status
* 💾 **Output**: `output_type = "json"` (local file) or `"mongo"` (MongoDB/Atlas)
---
## Project layout (essential bits)
```
backend/
app.py
data_minning/
base_scraper.py # BaseRecipeScraper (+ StreamOptions)
all_nigerian_recipe_scraper.py
yummy_medley_scraper.py
dto/recipe_doc.py
soup_client.py
utils/sanitization.py
```
Make sure every package dir has an `__init__.py`.
---
## Requirements
* Python 3.9+
* macOS/Linux (Windows should work too)
* (Optional) MongoDB/Atlas for `"mongo"` output
### Install
```bash
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
# If you don’t have a requirements.txt, minimum:
pip install fastapi "uvicorn[standard]" pydantic==2.* requests beautifulsoup4 \
sentence-transformers numpy pymongo python-dotenv
```
> If `uvicorn` isn’t found on your PATH, you can always run with `python3 -m uvicorn ...`.
---
## Environment variables
Create `.env` in repo root (or export envs) as needed:
```dotenv
# For Mongo output_type="mongo"
MONGODB_URI=mongodb+srv://user:pass@cluster/recipes?retryWrites=true&w=majority
MONGODB_DB=recipes
MONGODB_COL=items
ATLAS_INDEX=recipes_vec # your Atlas Search index name
# Embeddings (HFEmbedder)
HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
HF_DEVICE=cpu # or cuda
```
---
## Running the API
From the project root (the folder **containing** `backend/`):
```bash
python3 -m uvicorn app:app --reload --host 127.0.0.1 --port 8080
```
---
## API
### POST `/scrape`
Trigger a scrape job (non-blocking). **Body** is a JSON object:
```json
{
"site": "yummy",
"limit": 50, #optional
"output_type": "json" // or "mongo"
}
```
**Headers**
* `Content-Type: application/json`
* If enabled: `X-API-Key: <ADMIN_API_KEY>`
**curl example (JSON output):**
```bash
curl -X POST http://127.0.0.1:8080/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: dev-key" \
-d '{"site":"yummy","limit":20,"output_type":"json"}'
```
**Response**
```json
{ "job_id": "yummy-a1b2c3d4", "status": "queued" }
```
### GET `/jobs/{job_id}`
Check progress:
```bash
curl http://127.0.0.1:8080/jobs/yummy-a1b2c3d4
```
**Possible responses**
```json
{ "status": "running", "count": 13 }
{ "status": "done", "count": 50 }
{ "status": "error", "error": "Traceback ..." }
{ "status": "unknown" }
```
### (Optional) GET `/jobs`
Return the whole in-memory job map (useful for debugging):
```bash
curl http://127.0.0.1:8080/jobs
```
> Note: jobs are stored in a process-local dict and clear on server restart.
---
## Output modes
### `"json"`
Writes batches to a JSON sink (e.g., newline-delimited file). Check the sink path configured in your `JsonArraySink`/`DualSink`.
Typical document shape:
```json
{
"title": "...",
"url": "...",
"source": "...",
"category": "...",
"ingredients": "- 1 cup rice\n- 2 tbsp oil\n...",
"instructions": "1. Heat oil...\n\n2. Add rice...",
"image_url": "...",
"needs_review": false,
"scraped_at": "2025-09-14 10:03:32.289232",
"recipe_emb": [0.0123, -0.0456, ...] // when embeddings enabled
}
```
### `"mongo"`
Writes to `MONGODB_DB.MONGODB_COL`. Ensure your Atlas Search index is created if you plan to query vectors.
**Atlas index mapping (single vector field)**
```json
{
"mappings": {
"dynamic": false,
"fields": {
"recipe_emb": { "type": "knnVector", "dims": 384, "similarity": "cosine" }
}
}
}
```
**Query example:**
```python
qvec = embedder.encode([query])[0]
pipeline = [{
"$vectorSearch": {
"index": os.getenv("ATLAS_INDEX", "recipes_vec"),
"path": "recipe_emb",
"queryVector": qvec,
"numCandidates": 400,
"limit": 10,
"filter": { "needs_review": { "$ne": True } }
}
}]
results = list(col.aggregate(pipeline))
```
---
## Embeddings (combined fields → one vector)
We embed **ingredients + instructions** into a single `recipe_emb`. Two supported patterns:
### A) Combine at embedding time
Configure:
```python
embedding_fields = [
(("ingredients", "instructions"), "recipe_emb")
]
```
`_apply_embeddings` concatenates labeled sections:
```
Ingredients:
- ...
Instructions:
1. ...
```
### B) Build `recipe_text` in `RecipeDoc.finalize()` and embed once
```python
self.recipe_text = "\n\n".join(
[s for s in [
f"Title:\n{self.title}" if self.title else "",
f"Ingredients:\n{self.ingredients_text}" if self.ingredients_text else "",
f"Instructions:\n{self.instructions_text}" if self.instructions_text else ""
] if s]
)
# embedding_fields = [("recipe_text", "recipe_emb")]
```
**HFEmbedder config (defaults):**
```python
HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
HF_DEVICE=cpu
```
---
## CLI (optional but handy)
Create `run_scrape.py`:
```python
from backend.services.data_minning.yummy_medley_scraper import YummyMedleyScraper
from backend.services.data_minning.all_nigerian_recipe_scraper import AllNigerianRecipesScraper
SCRAPERS = {
"yummy": YummyMedleyScraper,
"anr": AllNigerianRecipesScraper,
}
if __name__ == "__main__":
import argparse
from dataclasses import asdict
p = argparse.ArgumentParser()
p.add_argument("--site", choices=SCRAPERS.keys(), required=True)
p.add_argument("--limit", type=int, default=50)
args = p.parse_args()
s = SCRAPERS[args.site]()
saved = s.stream(sink=..., options=StreamOptions(limit=args.limit))
print(f"Saved {saved}")
```
Run:
```bash
python3 run_scrape.py --site yummy --limit 25
```
---
## Implementation notes
### `StreamOptions` (clean params)
```python
from dataclasses import dataclass
from typing import Optional, Callable
@dataclass
class StreamOptions:
delay: float = 0.3
limit: Optional[int] = None
batch_size: int = 50
resume_file: Optional[str] = None
progress_callback: Optional[Callable[[int], None]] = None
```
### Progress to `/jobs`
We pass a `progress_callback` that updates the job by `job_id`:
```python
def make_progress_cb(job_id: str):
def _cb(n: int):
JOBS[job_id]["count"] = n
return _cb
```
Used as:
```python
saved = s.stream(
sink=json_or_mongo_sink,
options=StreamOptions(
limit=body.limit,
batch_size=body.limit,
resume_file="recipes.resume",
progress_callback=make_progress_cb(job_id),
),
)
```
---
## Common pitfalls & fixes
* **`ModuleNotFoundError: No module named 'backend'`**
Run with module path:
`python3 -m uvicorn backend.app:app --reload`
* **Uvicorn not found (`zsh: command not found: uvicorn`)**
Use: `python3 -m uvicorn ...` or add `~/Library/Python/3.9/bin` to PATH.
* **`422 Unprocessable Entity` on `/scrape`**
In Postman: Body → **raw → JSON** and send:
`{"site":"yummy","limit":20,"output_type":"json"}`
* **Pydantic v2: “non-annotated attribute”**
Keep globals like `JOBS = {}` **outside** `BaseModel` classes.
* **`'int' object is not iterable`**
Don’t iterate `stream()`—it **returns** an `int`. Use the `progress_callback` if you need live updates.
* **`BackgroundTasks` undefined**
Import from FastAPI:
`from fastapi import BackgroundTasks`
* **Too many commas in ingredients**
Don’t `.join()` a **string**—only join if it’s a `list[str]`.
---
## Future ideas (nice-to-haves)
* Store jobs in Redis for persistence across restarts
* Add `started_at` / `finished_at` timestamps and durations to jobs
* Rate-limit per site; cool-down if a scrape ran recently
* Switch to task queue (Celery/RQ/BullMQ) if you need scale
* Add `/search` endpoint that calls `$vectorSearch` in MongoDB
---