Spaces:

jessejohnson
/

plg4-dev-server

Paused

App Files Files Community

plg4-dev-server / backend /docs /scraper.md

Jesse Johnson

New commit for backend deployment: 2025-09-25_13-24-03

c59d808 5 months ago

preview code

raw

history blame contribute delete

8.38 kB

Recipe Scraper – FastAPI demo

A tiny FastAPI service + CLI that scrapes recipe sites, normalizes data, and (optionally) embeds combined ingredients + instructions into a single vector (recipe_emb). Designed as a test project—simple to run locally, easy to extend.

Features

🔧 Sites: yummy (YummyMedley), anr (All Nigerian Recipes)
🧱 Unified text: builds recipe_text from sections, or embeds ("ingredients","instructions") → recipe_emb
🧠 Embeddings: Hugging Face sentence-transformers via your HFEmbedder (default: all-MiniLM-L6-v2)
🚀 API trigger: POST /scrape runs scraping in the background
👀 Progress: GET /jobs/{job_id} (and optional GET /jobs) to check status
💾 Output: output_type = "json" (local file) or "mongo" (MongoDB/Atlas)

Project layout (essential bits)

backend/
  app.py
  data_minning/
      base_scraper.py       # BaseRecipeScraper (+ StreamOptions)
      all_nigerian_recipe_scraper.py
      yummy_medley_scraper.py
      dto/recipe_doc.py
      soup_client.py
  utils/sanitization.py

Make sure every package dir has an __init__.py.

Requirements

Python 3.9+
macOS/Linux (Windows should work too)
(Optional) MongoDB/Atlas for "mongo" output

Install

python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt
# If you don’t have a requirements.txt, minimum:
pip install fastapi "uvicorn[standard]" pydantic==2.* requests beautifulsoup4 \
            sentence-transformers numpy pymongo python-dotenv

If uvicorn isn’t found on your PATH, you can always run with python3 -m uvicorn ....

Environment variables

Create .env in repo root (or export envs) as needed:



# For Mongo output_type="mongo"
MONGODB_URI=mongodb+srv://user:pass@cluster/recipes?retryWrites=true&w=majority
MONGODB_DB=recipes
MONGODB_COL=items
ATLAS_INDEX=recipes_vec  # your Atlas Search index name

# Embeddings (HFEmbedder)
HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
HF_DEVICE=cpu  # or cuda

Running the API

From the project root (the folder containing backend/):

python3 -m uvicorn app:app --reload --host 127.0.0.1 --port 8080

API

POST `/scrape`

Trigger a scrape job (non-blocking). Body is a JSON object:

{
  "site": "yummy",
  "limit": 50, #optional
  "output_type": "json"   // or "mongo"
}

Headers

Content-Type: application/json
If enabled: X-API-Key: <ADMIN_API_KEY>

curl example (JSON output):

curl -X POST http://127.0.0.1:8080/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-key" \
  -d '{"site":"yummy","limit":20,"output_type":"json"}'

Response

{ "job_id": "yummy-a1b2c3d4", "status": "queued" }

GET `/jobs/{job_id}`

Check progress:

curl http://127.0.0.1:8080/jobs/yummy-a1b2c3d4

Possible responses

{ "status": "running", "count": 13 }
{ "status": "done", "count": 50 }
{ "status": "error", "error": "Traceback ..." }
{ "status": "unknown" }

(Optional) GET `/jobs`

Return the whole in-memory job map (useful for debugging):

curl http://127.0.0.1:8080/jobs

Note: jobs are stored in a process-local dict and clear on server restart.

Output modes

`"json"`

Writes batches to a JSON sink (e.g., newline-delimited file). Check the sink path configured in your JsonArraySink/DualSink.

Typical document shape:

{
  "title": "...",
  "url": "...",
  "source": "...",
  "category": "...",
  "ingredients": "- 1 cup rice\n- 2 tbsp oil\n...",
  "instructions": "1. Heat oil...\n\n2. Add rice...",
  "image_url": "...",
  "needs_review": false,
  "scraped_at": "2025-09-14 10:03:32.289232",
  "recipe_emb": [0.0123, -0.0456, ...]   // when embeddings enabled
}

`"mongo"`

Writes to MONGODB_DB.MONGODB_COL. Ensure your Atlas Search index is created if you plan to query vectors.

Atlas index mapping (single vector field)

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "recipe_emb": { "type": "knnVector", "dims": 384, "similarity": "cosine" }
    }
  }
}

Query example:

qvec = embedder.encode([query])[0]
pipeline = [{
  "$vectorSearch": {
    "index": os.getenv("ATLAS_INDEX", "recipes_vec"),
    "path": "recipe_emb",
    "queryVector": qvec,
    "numCandidates": 400,
    "limit": 10,
    "filter": { "needs_review": { "$ne": True } }
  }
}]
results = list(col.aggregate(pipeline))

Embeddings (combined fields → one vector)

We embed ingredients + instructions into a single recipe_emb. Two supported patterns:

A) Combine at embedding time

Configure:

embedding_fields = [
  (("ingredients", "instructions"), "recipe_emb")
]

_apply_embeddings concatenates labeled sections:

Ingredients:
- ...

Instructions:
1. ...

B) Build `recipe_text` in `RecipeDoc.finalize()` and embed once

self.recipe_text = "\n\n".join(
  [s for s in [
    f"Title:\n{self.title}" if self.title else "",
    f"Ingredients:\n{self.ingredients_text}" if self.ingredients_text else "",
    f"Instructions:\n{self.instructions_text}" if self.instructions_text else ""
  ] if s]
)
# embedding_fields = [("recipe_text", "recipe_emb")]

HFEmbedder config (defaults):

HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
HF_DEVICE=cpu

CLI (optional but handy)

Create run_scrape.py:

from backend.services.data_minning.yummy_medley_scraper import YummyMedleyScraper
from backend.services.data_minning.all_nigerian_recipe_scraper import AllNigerianRecipesScraper

SCRAPERS = {
  "yummy": YummyMedleyScraper,
  "anr": AllNigerianRecipesScraper,
}

if __name__ == "__main__":
    import argparse
    from dataclasses import asdict
    p = argparse.ArgumentParser()
    p.add_argument("--site", choices=SCRAPERS.keys(), required=True)
    p.add_argument("--limit", type=int, default=50)
    args = p.parse_args()

    s = SCRAPERS[args.site]()
    saved = s.stream(sink=..., options=StreamOptions(limit=args.limit))
    print(f"Saved {saved}")

Run:

python3 run_scrape.py --site yummy --limit 25

Implementation notes

`StreamOptions` (clean params)

from dataclasses import dataclass
from typing import Optional, Callable

@dataclass
class StreamOptions:
    delay: float = 0.3
    limit: Optional[int] = None
    batch_size: int = 50
    resume_file: Optional[str] = None
    progress_callback: Optional[Callable[[int], None]] = None

Progress to `/jobs`

We pass a progress_callback that updates the job by job_id:

def make_progress_cb(job_id: str):
    def _cb(n: int):
        JOBS[job_id]["count"] = n
    return _cb

Used as:

saved = s.stream(
  sink=json_or_mongo_sink,
  options=StreamOptions(
    limit=body.limit,
    batch_size=body.limit,
    resume_file="recipes.resume",
    progress_callback=make_progress_cb(job_id),
  ),
)

Common pitfalls & fixes

ModuleNotFoundError: No module named 'backend' Run with module path: python3 -m uvicorn backend.app:app --reload
Uvicorn not found (zsh: command not found: uvicorn) Use: python3 -m uvicorn ... or add ~/Library/Python/3.9/bin to PATH.
422 Unprocessable Entity on /scrape In Postman: Body → raw → JSON and send: {"site":"yummy","limit":20,"output_type":"json"}
Pydantic v2: “non-annotated attribute” Keep globals like JOBS = {} outside BaseModel classes.
'int' object is not iterable Don’t iterate stream()—it returns an int. Use the progress_callback if you need live updates.
BackgroundTasks undefined Import from FastAPI: from fastapi import BackgroundTasks
Too many commas in ingredients Don’t .join() a string—only join if it’s a list[str].

Future ideas (nice-to-haves)

Store jobs in Redis for persistence across restarts
Add started_at / finished_at timestamps and durations to jobs
Rate-limit per site; cool-down if a scrape ran recently
Switch to task queue (Celery/RQ/BullMQ) if you need scale
Add /search endpoint that calls $vectorSearch in MongoDB