| # Recipe Scraper – FastAPI demo | |
| A tiny FastAPI service + CLI that scrapes recipe sites, normalizes data, and (optionally) embeds combined **ingredients + instructions** into a single vector (`recipe_emb`). Designed as a **test project**—simple to run locally, easy to extend. | |
| --- | |
| ## Features | |
| * 🔧 **Sites**: `yummy` (YummyMedley), `anr` (All Nigerian Recipes) | |
| * 🧱 **Unified text**: builds `recipe_text` from sections, or embeds `("ingredients","instructions") → recipe_emb` | |
| * 🧠 **Embeddings**: Hugging Face `sentence-transformers` via your `HFEmbedder` (default: `all-MiniLM-L6-v2`) | |
| * 🚀 **API trigger**: `POST /scrape` runs scraping in the background | |
| * 👀 **Progress**: `GET /jobs/{job_id}` (and optional `GET /jobs`) to check status | |
| * 💾 **Output**: `output_type = "json"` (local file) or `"mongo"` (MongoDB/Atlas) | |
| --- | |
| ## Project layout (essential bits) | |
| ``` | |
| backend/ | |
| app.py | |
| data_minning/ | |
| base_scraper.py # BaseRecipeScraper (+ StreamOptions) | |
| all_nigerian_recipe_scraper.py | |
| yummy_medley_scraper.py | |
| dto/recipe_doc.py | |
| soup_client.py | |
| utils/sanitization.py | |
| ``` | |
| Make sure every package dir has an `__init__.py`. | |
| --- | |
| ## Requirements | |
| * Python 3.9+ | |
| * macOS/Linux (Windows should work too) | |
| * (Optional) MongoDB/Atlas for `"mongo"` output | |
| ### Install | |
| ```bash | |
| python3 -m venv .venv | |
| source .venv/bin/activate | |
| pip install --upgrade pip | |
| pip install -r requirements.txt | |
| # If you don’t have a requirements.txt, minimum: | |
| pip install fastapi "uvicorn[standard]" pydantic==2.* requests beautifulsoup4 \ | |
| sentence-transformers numpy pymongo python-dotenv | |
| ``` | |
| > If `uvicorn` isn’t found on your PATH, you can always run with `python3 -m uvicorn ...`. | |
| --- | |
| ## Environment variables | |
| Create `.env` in repo root (or export envs) as needed: | |
| ```dotenv | |
| # For Mongo output_type="mongo" | |
| MONGODB_URI=mongodb+srv://user:pass@cluster/recipes?retryWrites=true&w=majority | |
| MONGODB_DB=recipes | |
| MONGODB_COL=items | |
| ATLAS_INDEX=recipes_vec # your Atlas Search index name | |
| # Embeddings (HFEmbedder) | |
| HF_MODEL=sentence-transformers/all-MiniLM-L6-v2 | |
| HF_DEVICE=cpu # or cuda | |
| ``` | |
| --- | |
| ## Running the API | |
| From the project root (the folder **containing** `backend/`): | |
| ```bash | |
| python3 -m uvicorn app:app --reload --host 127.0.0.1 --port 8080 | |
| ``` | |
| --- | |
| ## API | |
| ### POST `/scrape` | |
| Trigger a scrape job (non-blocking). **Body** is a JSON object: | |
| ```json | |
| { | |
| "site": "yummy", | |
| "limit": 50, #optional | |
| "output_type": "json" // or "mongo" | |
| } | |
| ``` | |
| **Headers** | |
| * `Content-Type: application/json` | |
| * If enabled: `X-API-Key: <ADMIN_API_KEY>` | |
| **curl example (JSON output):** | |
| ```bash | |
| curl -X POST http://127.0.0.1:8080/scrape \ | |
| -H "Content-Type: application/json" \ | |
| -H "X-API-Key: dev-key" \ | |
| -d '{"site":"yummy","limit":20,"output_type":"json"}' | |
| ``` | |
| **Response** | |
| ```json | |
| { "job_id": "yummy-a1b2c3d4", "status": "queued" } | |
| ``` | |
| ### GET `/jobs/{job_id}` | |
| Check progress: | |
| ```bash | |
| curl http://127.0.0.1:8080/jobs/yummy-a1b2c3d4 | |
| ``` | |
| **Possible responses** | |
| ```json | |
| { "status": "running", "count": 13 } | |
| { "status": "done", "count": 50 } | |
| { "status": "error", "error": "Traceback ..." } | |
| { "status": "unknown" } | |
| ``` | |
| ### (Optional) GET `/jobs` | |
| Return the whole in-memory job map (useful for debugging): | |
| ```bash | |
| curl http://127.0.0.1:8080/jobs | |
| ``` | |
| > Note: jobs are stored in a process-local dict and clear on server restart. | |
| --- | |
| ## Output modes | |
| ### `"json"` | |
| Writes batches to a JSON sink (e.g., newline-delimited file). Check the sink path configured in your `JsonArraySink`/`DualSink`. | |
| Typical document shape: | |
| ```json | |
| { | |
| "title": "...", | |
| "url": "...", | |
| "source": "...", | |
| "category": "...", | |
| "ingredients": "- 1 cup rice\n- 2 tbsp oil\n...", | |
| "instructions": "1. Heat oil...\n\n2. Add rice...", | |
| "image_url": "...", | |
| "needs_review": false, | |
| "scraped_at": "2025-09-14 10:03:32.289232", | |
| "recipe_emb": [0.0123, -0.0456, ...] // when embeddings enabled | |
| } | |
| ``` | |
| ### `"mongo"` | |
| Writes to `MONGODB_DB.MONGODB_COL`. Ensure your Atlas Search index is created if you plan to query vectors. | |
| **Atlas index mapping (single vector field)** | |
| ```json | |
| { | |
| "mappings": { | |
| "dynamic": false, | |
| "fields": { | |
| "recipe_emb": { "type": "knnVector", "dims": 384, "similarity": "cosine" } | |
| } | |
| } | |
| } | |
| ``` | |
| **Query example:** | |
| ```python | |
| qvec = embedder.encode([query])[0] | |
| pipeline = [{ | |
| "$vectorSearch": { | |
| "index": os.getenv("ATLAS_INDEX", "recipes_vec"), | |
| "path": "recipe_emb", | |
| "queryVector": qvec, | |
| "numCandidates": 400, | |
| "limit": 10, | |
| "filter": { "needs_review": { "$ne": True } } | |
| } | |
| }] | |
| results = list(col.aggregate(pipeline)) | |
| ``` | |
| --- | |
| ## Embeddings (combined fields → one vector) | |
| We embed **ingredients + instructions** into a single `recipe_emb`. Two supported patterns: | |
| ### A) Combine at embedding time | |
| Configure: | |
| ```python | |
| embedding_fields = [ | |
| (("ingredients", "instructions"), "recipe_emb") | |
| ] | |
| ``` | |
| `_apply_embeddings` concatenates labeled sections: | |
| ``` | |
| Ingredients: | |
| - ... | |
| Instructions: | |
| 1. ... | |
| ``` | |
| ### B) Build `recipe_text` in `RecipeDoc.finalize()` and embed once | |
| ```python | |
| self.recipe_text = "\n\n".join( | |
| [s for s in [ | |
| f"Title:\n{self.title}" if self.title else "", | |
| f"Ingredients:\n{self.ingredients_text}" if self.ingredients_text else "", | |
| f"Instructions:\n{self.instructions_text}" if self.instructions_text else "" | |
| ] if s] | |
| ) | |
| # embedding_fields = [("recipe_text", "recipe_emb")] | |
| ``` | |
| **HFEmbedder config (defaults):** | |
| ```python | |
| HF_MODEL=sentence-transformers/all-MiniLM-L6-v2 | |
| HF_DEVICE=cpu | |
| ``` | |
| --- | |
| ## CLI (optional but handy) | |
| Create `run_scrape.py`: | |
| ```python | |
| from backend.services.data_minning.yummy_medley_scraper import YummyMedleyScraper | |
| from backend.services.data_minning.all_nigerian_recipe_scraper import AllNigerianRecipesScraper | |
| SCRAPERS = { | |
| "yummy": YummyMedleyScraper, | |
| "anr": AllNigerianRecipesScraper, | |
| } | |
| if __name__ == "__main__": | |
| import argparse | |
| from dataclasses import asdict | |
| p = argparse.ArgumentParser() | |
| p.add_argument("--site", choices=SCRAPERS.keys(), required=True) | |
| p.add_argument("--limit", type=int, default=50) | |
| args = p.parse_args() | |
| s = SCRAPERS[args.site]() | |
| saved = s.stream(sink=..., options=StreamOptions(limit=args.limit)) | |
| print(f"Saved {saved}") | |
| ``` | |
| Run: | |
| ```bash | |
| python3 run_scrape.py --site yummy --limit 25 | |
| ``` | |
| --- | |
| ## Implementation notes | |
| ### `StreamOptions` (clean params) | |
| ```python | |
| from dataclasses import dataclass | |
| from typing import Optional, Callable | |
| @dataclass | |
| class StreamOptions: | |
| delay: float = 0.3 | |
| limit: Optional[int] = None | |
| batch_size: int = 50 | |
| resume_file: Optional[str] = None | |
| progress_callback: Optional[Callable[[int], None]] = None | |
| ``` | |
| ### Progress to `/jobs` | |
| We pass a `progress_callback` that updates the job by `job_id`: | |
| ```python | |
| def make_progress_cb(job_id: str): | |
| def _cb(n: int): | |
| JOBS[job_id]["count"] = n | |
| return _cb | |
| ``` | |
| Used as: | |
| ```python | |
| saved = s.stream( | |
| sink=json_or_mongo_sink, | |
| options=StreamOptions( | |
| limit=body.limit, | |
| batch_size=body.limit, | |
| resume_file="recipes.resume", | |
| progress_callback=make_progress_cb(job_id), | |
| ), | |
| ) | |
| ``` | |
| --- | |
| ## Common pitfalls & fixes | |
| * **`ModuleNotFoundError: No module named 'backend'`** | |
| Run with module path: | |
| `python3 -m uvicorn backend.app:app --reload` | |
| * **Uvicorn not found (`zsh: command not found: uvicorn`)** | |
| Use: `python3 -m uvicorn ...` or add `~/Library/Python/3.9/bin` to PATH. | |
| * **`422 Unprocessable Entity` on `/scrape`** | |
| In Postman: Body → **raw → JSON** and send: | |
| `{"site":"yummy","limit":20,"output_type":"json"}` | |
| * **Pydantic v2: “non-annotated attribute”** | |
| Keep globals like `JOBS = {}` **outside** `BaseModel` classes. | |
| * **`'int' object is not iterable`** | |
| Don’t iterate `stream()`—it **returns** an `int`. Use the `progress_callback` if you need live updates. | |
| * **`BackgroundTasks` undefined** | |
| Import from FastAPI: | |
| `from fastapi import BackgroundTasks` | |
| * **Too many commas in ingredients** | |
| Don’t `.join()` a **string**—only join if it’s a `list[str]`. | |
| --- | |
| ## Future ideas (nice-to-haves) | |
| * Store jobs in Redis for persistence across restarts | |
| * Add `started_at` / `finished_at` timestamps and durations to jobs | |
| * Rate-limit per site; cool-down if a scrape ran recently | |
| * Switch to task queue (Celery/RQ/BullMQ) if you need scale | |
| * Add `/search` endpoint that calls `$vectorSearch` in MongoDB | |
| --- | |