Recipe Scraper – FastAPI demo
A tiny FastAPI service + CLI that scrapes recipe sites, normalizes data, and (optionally) embeds combined ingredients + instructions into a single vector (recipe_emb). Designed as a test project—simple to run locally, easy to extend.
Features
- 🔧 Sites:
yummy(YummyMedley),anr(All Nigerian Recipes) - 🧱 Unified text: builds
recipe_textfrom sections, or embeds("ingredients","instructions") → recipe_emb - 🧠 Embeddings: Hugging Face
sentence-transformersvia yourHFEmbedder(default:all-MiniLM-L6-v2) - 🚀 API trigger:
POST /scraperuns scraping in the background - 👀 Progress:
GET /jobs/{job_id}(and optionalGET /jobs) to check status - 💾 Output:
output_type = "json"(local file) or"mongo"(MongoDB/Atlas)
Project layout (essential bits)
backend/
app.py
data_minning/
base_scraper.py # BaseRecipeScraper (+ StreamOptions)
all_nigerian_recipe_scraper.py
yummy_medley_scraper.py
dto/recipe_doc.py
soup_client.py
utils/sanitization.py
Make sure every package dir has an __init__.py.
Requirements
- Python 3.9+
- macOS/Linux (Windows should work too)
- (Optional) MongoDB/Atlas for
"mongo"output
Install
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
# If you don’t have a requirements.txt, minimum:
pip install fastapi "uvicorn[standard]" pydantic==2.* requests beautifulsoup4 \
sentence-transformers numpy pymongo python-dotenv
If
uvicornisn’t found on your PATH, you can always run withpython3 -m uvicorn ....
Environment variables
Create .env in repo root (or export envs) as needed:
# For Mongo output_type="mongo"
MONGODB_URI=mongodb+srv://user:pass@cluster/recipes?retryWrites=true&w=majority
MONGODB_DB=recipes
MONGODB_COL=items
ATLAS_INDEX=recipes_vec # your Atlas Search index name
# Embeddings (HFEmbedder)
HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
HF_DEVICE=cpu # or cuda
Running the API
From the project root (the folder containing backend/):
python3 -m uvicorn app:app --reload --host 127.0.0.1 --port 8080
API
POST /scrape
Trigger a scrape job (non-blocking). Body is a JSON object:
{
"site": "yummy",
"limit": 50, #optional
"output_type": "json" // or "mongo"
}
Headers
Content-Type: application/json- If enabled:
X-API-Key: <ADMIN_API_KEY>
curl example (JSON output):
curl -X POST http://127.0.0.1:8080/scrape \
-H "Content-Type: application/json" \
-H "X-API-Key: dev-key" \
-d '{"site":"yummy","limit":20,"output_type":"json"}'
Response
{ "job_id": "yummy-a1b2c3d4", "status": "queued" }
GET /jobs/{job_id}
Check progress:
curl http://127.0.0.1:8080/jobs/yummy-a1b2c3d4
Possible responses
{ "status": "running", "count": 13 }
{ "status": "done", "count": 50 }
{ "status": "error", "error": "Traceback ..." }
{ "status": "unknown" }
(Optional) GET /jobs
Return the whole in-memory job map (useful for debugging):
curl http://127.0.0.1:8080/jobs
Note: jobs are stored in a process-local dict and clear on server restart.
Output modes
"json"
Writes batches to a JSON sink (e.g., newline-delimited file). Check the sink path configured in your JsonArraySink/DualSink.
Typical document shape:
{
"title": "...",
"url": "...",
"source": "...",
"category": "...",
"ingredients": "- 1 cup rice\n- 2 tbsp oil\n...",
"instructions": "1. Heat oil...\n\n2. Add rice...",
"image_url": "...",
"needs_review": false,
"scraped_at": "2025-09-14 10:03:32.289232",
"recipe_emb": [0.0123, -0.0456, ...] // when embeddings enabled
}
"mongo"
Writes to MONGODB_DB.MONGODB_COL. Ensure your Atlas Search index is created if you plan to query vectors.
Atlas index mapping (single vector field)
{
"mappings": {
"dynamic": false,
"fields": {
"recipe_emb": { "type": "knnVector", "dims": 384, "similarity": "cosine" }
}
}
}
Query example:
qvec = embedder.encode([query])[0]
pipeline = [{
"$vectorSearch": {
"index": os.getenv("ATLAS_INDEX", "recipes_vec"),
"path": "recipe_emb",
"queryVector": qvec,
"numCandidates": 400,
"limit": 10,
"filter": { "needs_review": { "$ne": True } }
}
}]
results = list(col.aggregate(pipeline))
Embeddings (combined fields → one vector)
We embed ingredients + instructions into a single recipe_emb. Two supported patterns:
A) Combine at embedding time
Configure:
embedding_fields = [
(("ingredients", "instructions"), "recipe_emb")
]
_apply_embeddings concatenates labeled sections:
Ingredients:
- ...
Instructions:
1. ...
B) Build recipe_text in RecipeDoc.finalize() and embed once
self.recipe_text = "\n\n".join(
[s for s in [
f"Title:\n{self.title}" if self.title else "",
f"Ingredients:\n{self.ingredients_text}" if self.ingredients_text else "",
f"Instructions:\n{self.instructions_text}" if self.instructions_text else ""
] if s]
)
# embedding_fields = [("recipe_text", "recipe_emb")]
HFEmbedder config (defaults):
HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
HF_DEVICE=cpu
CLI (optional but handy)
Create run_scrape.py:
from backend.services.data_minning.yummy_medley_scraper import YummyMedleyScraper
from backend.services.data_minning.all_nigerian_recipe_scraper import AllNigerianRecipesScraper
SCRAPERS = {
"yummy": YummyMedleyScraper,
"anr": AllNigerianRecipesScraper,
}
if __name__ == "__main__":
import argparse
from dataclasses import asdict
p = argparse.ArgumentParser()
p.add_argument("--site", choices=SCRAPERS.keys(), required=True)
p.add_argument("--limit", type=int, default=50)
args = p.parse_args()
s = SCRAPERS[args.site]()
saved = s.stream(sink=..., options=StreamOptions(limit=args.limit))
print(f"Saved {saved}")
Run:
python3 run_scrape.py --site yummy --limit 25
Implementation notes
StreamOptions (clean params)
from dataclasses import dataclass
from typing import Optional, Callable
@dataclass
class StreamOptions:
delay: float = 0.3
limit: Optional[int] = None
batch_size: int = 50
resume_file: Optional[str] = None
progress_callback: Optional[Callable[[int], None]] = None
Progress to /jobs
We pass a progress_callback that updates the job by job_id:
def make_progress_cb(job_id: str):
def _cb(n: int):
JOBS[job_id]["count"] = n
return _cb
Used as:
saved = s.stream(
sink=json_or_mongo_sink,
options=StreamOptions(
limit=body.limit,
batch_size=body.limit,
resume_file="recipes.resume",
progress_callback=make_progress_cb(job_id),
),
)
Common pitfalls & fixes
ModuleNotFoundError: No module named 'backend'Run with module path:python3 -m uvicorn backend.app:app --reloadUvicorn not found (
zsh: command not found: uvicorn) Use:python3 -m uvicorn ...or add~/Library/Python/3.9/binto PATH.422 Unprocessable Entityon/scrapeIn Postman: Body → raw → JSON and send:{"site":"yummy","limit":20,"output_type":"json"}Pydantic v2: “non-annotated attribute” Keep globals like
JOBS = {}outsideBaseModelclasses.'int' object is not iterableDon’t iteratestream()—it returns anint. Use theprogress_callbackif you need live updates.BackgroundTasksundefined Import from FastAPI:from fastapi import BackgroundTasksToo many commas in ingredients Don’t
.join()a string—only join if it’s alist[str].
Future ideas (nice-to-haves)
- Store jobs in Redis for persistence across restarts
- Add
started_at/finished_attimestamps and durations to jobs - Rate-limit per site; cool-down if a scrape ran recently
- Switch to task queue (Celery/RQ/BullMQ) if you need scale
- Add
/searchendpoint that calls$vectorSearchin MongoDB