# Recipe Scraper – FastAPI demo A tiny FastAPI service + CLI that scrapes recipe sites, normalizes data, and (optionally) embeds combined **ingredients + instructions** into a single vector (`recipe_emb`). Designed as a **test project**—simple to run locally, easy to extend. --- ## Features * 🔧 **Sites**: `yummy` (YummyMedley), `anr` (All Nigerian Recipes) * 🧱 **Unified text**: builds `recipe_text` from sections, or embeds `("ingredients","instructions") → recipe_emb` * 🧠 **Embeddings**: Hugging Face `sentence-transformers` via your `HFEmbedder` (default: `all-MiniLM-L6-v2`) * 🚀 **API trigger**: `POST /scrape` runs scraping in the background * 👀 **Progress**: `GET /jobs/{job_id}` (and optional `GET /jobs`) to check status * 💾 **Output**: `output_type = "json"` (local file) or `"mongo"` (MongoDB/Atlas) --- ## Project layout (essential bits) ``` backend/ app.py data_minning/ base_scraper.py # BaseRecipeScraper (+ StreamOptions) all_nigerian_recipe_scraper.py yummy_medley_scraper.py dto/recipe_doc.py soup_client.py utils/sanitization.py ``` Make sure every package dir has an `__init__.py`. --- ## Requirements * Python 3.9+ * macOS/Linux (Windows should work too) * (Optional) MongoDB/Atlas for `"mongo"` output ### Install ```bash python3 -m venv .venv source .venv/bin/activate pip install --upgrade pip pip install -r requirements.txt # If you don’t have a requirements.txt, minimum: pip install fastapi "uvicorn[standard]" pydantic==2.* requests beautifulsoup4 \ sentence-transformers numpy pymongo python-dotenv ``` > If `uvicorn` isn’t found on your PATH, you can always run with `python3 -m uvicorn ...`. --- ## Environment variables Create `.env` in repo root (or export envs) as needed: ```dotenv # For Mongo output_type="mongo" MONGODB_URI=mongodb+srv://user:pass@cluster/recipes?retryWrites=true&w=majority MONGODB_DB=recipes MONGODB_COL=items ATLAS_INDEX=recipes_vec # your Atlas Search index name # Embeddings (HFEmbedder) HF_MODEL=sentence-transformers/all-MiniLM-L6-v2 HF_DEVICE=cpu # or cuda ``` --- ## Running the API From the project root (the folder **containing** `backend/`): ```bash python3 -m uvicorn app:app --reload --host 127.0.0.1 --port 8080 ``` --- ## API ### POST `/scrape` Trigger a scrape job (non-blocking). **Body** is a JSON object: ```json { "site": "yummy", "limit": 50, #optional "output_type": "json" // or "mongo" } ``` **Headers** * `Content-Type: application/json` * If enabled: `X-API-Key: ` **curl example (JSON output):** ```bash curl -X POST http://127.0.0.1:8080/scrape \ -H "Content-Type: application/json" \ -H "X-API-Key: dev-key" \ -d '{"site":"yummy","limit":20,"output_type":"json"}' ``` **Response** ```json { "job_id": "yummy-a1b2c3d4", "status": "queued" } ``` ### GET `/jobs/{job_id}` Check progress: ```bash curl http://127.0.0.1:8080/jobs/yummy-a1b2c3d4 ``` **Possible responses** ```json { "status": "running", "count": 13 } { "status": "done", "count": 50 } { "status": "error", "error": "Traceback ..." } { "status": "unknown" } ``` ### (Optional) GET `/jobs` Return the whole in-memory job map (useful for debugging): ```bash curl http://127.0.0.1:8080/jobs ``` > Note: jobs are stored in a process-local dict and clear on server restart. --- ## Output modes ### `"json"` Writes batches to a JSON sink (e.g., newline-delimited file). Check the sink path configured in your `JsonArraySink`/`DualSink`. Typical document shape: ```json { "title": "...", "url": "...", "source": "...", "category": "...", "ingredients": "- 1 cup rice\n- 2 tbsp oil\n...", "instructions": "1. Heat oil...\n\n2. Add rice...", "image_url": "...", "needs_review": false, "scraped_at": "2025-09-14 10:03:32.289232", "recipe_emb": [0.0123, -0.0456, ...] // when embeddings enabled } ``` ### `"mongo"` Writes to `MONGODB_DB.MONGODB_COL`. Ensure your Atlas Search index is created if you plan to query vectors. **Atlas index mapping (single vector field)** ```json { "mappings": { "dynamic": false, "fields": { "recipe_emb": { "type": "knnVector", "dims": 384, "similarity": "cosine" } } } } ``` **Query example:** ```python qvec = embedder.encode([query])[0] pipeline = [{ "$vectorSearch": { "index": os.getenv("ATLAS_INDEX", "recipes_vec"), "path": "recipe_emb", "queryVector": qvec, "numCandidates": 400, "limit": 10, "filter": { "needs_review": { "$ne": True } } } }] results = list(col.aggregate(pipeline)) ``` --- ## Embeddings (combined fields → one vector) We embed **ingredients + instructions** into a single `recipe_emb`. Two supported patterns: ### A) Combine at embedding time Configure: ```python embedding_fields = [ (("ingredients", "instructions"), "recipe_emb") ] ``` `_apply_embeddings` concatenates labeled sections: ``` Ingredients: - ... Instructions: 1. ... ``` ### B) Build `recipe_text` in `RecipeDoc.finalize()` and embed once ```python self.recipe_text = "\n\n".join( [s for s in [ f"Title:\n{self.title}" if self.title else "", f"Ingredients:\n{self.ingredients_text}" if self.ingredients_text else "", f"Instructions:\n{self.instructions_text}" if self.instructions_text else "" ] if s] ) # embedding_fields = [("recipe_text", "recipe_emb")] ``` **HFEmbedder config (defaults):** ```python HF_MODEL=sentence-transformers/all-MiniLM-L6-v2 HF_DEVICE=cpu ``` --- ## CLI (optional but handy) Create `run_scrape.py`: ```python from backend.services.data_minning.yummy_medley_scraper import YummyMedleyScraper from backend.services.data_minning.all_nigerian_recipe_scraper import AllNigerianRecipesScraper SCRAPERS = { "yummy": YummyMedleyScraper, "anr": AllNigerianRecipesScraper, } if __name__ == "__main__": import argparse from dataclasses import asdict p = argparse.ArgumentParser() p.add_argument("--site", choices=SCRAPERS.keys(), required=True) p.add_argument("--limit", type=int, default=50) args = p.parse_args() s = SCRAPERS[args.site]() saved = s.stream(sink=..., options=StreamOptions(limit=args.limit)) print(f"Saved {saved}") ``` Run: ```bash python3 run_scrape.py --site yummy --limit 25 ``` --- ## Implementation notes ### `StreamOptions` (clean params) ```python from dataclasses import dataclass from typing import Optional, Callable @dataclass class StreamOptions: delay: float = 0.3 limit: Optional[int] = None batch_size: int = 50 resume_file: Optional[str] = None progress_callback: Optional[Callable[[int], None]] = None ``` ### Progress to `/jobs` We pass a `progress_callback` that updates the job by `job_id`: ```python def make_progress_cb(job_id: str): def _cb(n: int): JOBS[job_id]["count"] = n return _cb ``` Used as: ```python saved = s.stream( sink=json_or_mongo_sink, options=StreamOptions( limit=body.limit, batch_size=body.limit, resume_file="recipes.resume", progress_callback=make_progress_cb(job_id), ), ) ``` --- ## Common pitfalls & fixes * **`ModuleNotFoundError: No module named 'backend'`** Run with module path: `python3 -m uvicorn backend.app:app --reload` * **Uvicorn not found (`zsh: command not found: uvicorn`)** Use: `python3 -m uvicorn ...` or add `~/Library/Python/3.9/bin` to PATH. * **`422 Unprocessable Entity` on `/scrape`** In Postman: Body → **raw → JSON** and send: `{"site":"yummy","limit":20,"output_type":"json"}` * **Pydantic v2: “non-annotated attribute”** Keep globals like `JOBS = {}` **outside** `BaseModel` classes. * **`'int' object is not iterable`** Don’t iterate `stream()`—it **returns** an `int`. Use the `progress_callback` if you need live updates. * **`BackgroundTasks` undefined** Import from FastAPI: `from fastapi import BackgroundTasks` * **Too many commas in ingredients** Don’t `.join()` a **string**—only join if it’s a `list[str]`. --- ## Future ideas (nice-to-haves) * Store jobs in Redis for persistence across restarts * Add `started_at` / `finished_at` timestamps and durations to jobs * Rate-limit per site; cool-down if a scrape ran recently * Switch to task queue (Celery/RQ/BullMQ) if you need scale * Add `/search` endpoint that calls `$vectorSearch` in MongoDB ---