Spaces:

jessejohnson
/

plg4-dev-server

Paused

File size: 8,377 Bytes

c59d808

# Recipe Scraper – FastAPI demo

A tiny FastAPI service + CLI that scrapes recipe sites, normalizes data, and (optionally) embeds combined **ingredients + instructions** into a single vector (`recipe_emb`). Designed as a **test project**—simple to run locally, easy to extend.

---

## Features

* 🔧 **Sites**: `yummy` (YummyMedley), `anr` (All Nigerian Recipes)
* 🧱 **Unified text**: builds `recipe_text` from sections, or embeds `("ingredients","instructions") → recipe_emb`
* 🧠 **Embeddings**: Hugging Face `sentence-transformers` via your `HFEmbedder` (default: `all-MiniLM-L6-v2`)
* 🚀 **API trigger**: `POST /scrape` runs scraping in the background
* 👀 **Progress**: `GET /jobs/{job_id}` (and optional `GET /jobs`) to check status
* 💾 **Output**: `output_type = "json"` (local file) or `"mongo"` (MongoDB/Atlas)

---

## Project layout (essential bits)

```
backend/
  app.py
  data_minning/
      base_scraper.py       # BaseRecipeScraper (+ StreamOptions)
      all_nigerian_recipe_scraper.py
      yummy_medley_scraper.py
      dto/recipe_doc.py
      soup_client.py
  utils/sanitization.py
```

Make sure every package dir has an `__init__.py`.

---

## Requirements

* Python 3.9+
* macOS/Linux (Windows should work too)
* (Optional) MongoDB/Atlas for `"mongo"` output

### Install

```bash
python3 -m venv .venv
source .venv/bin/activate

pip install --upgrade pip
pip install -r requirements.txt
# If you don’t have a requirements.txt, minimum:
pip install fastapi "uvicorn[standard]" pydantic==2.* requests beautifulsoup4 \
            sentence-transformers numpy pymongo python-dotenv
```

> If `uvicorn` isn’t found on your PATH, you can always run with `python3 -m uvicorn ...`.

---

## Environment variables

Create `.env` in repo root (or export envs) as needed:

```dotenv


# For Mongo output_type="mongo"
MONGODB_URI=mongodb+srv://user:pass@cluster/recipes?retryWrites=true&w=majority
MONGODB_DB=recipes
MONGODB_COL=items
ATLAS_INDEX=recipes_vec  # your Atlas Search index name

# Embeddings (HFEmbedder)
HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
HF_DEVICE=cpu  # or cuda
```

---

## Running the API

From the project root (the folder **containing** `backend/`):

```bash
python3 -m uvicorn app:app --reload --host 127.0.0.1 --port 8080
```


---

## API

### POST `/scrape`

Trigger a scrape job (non-blocking). **Body** is a JSON object:

```json
{
  "site": "yummy",
  "limit": 50, #optional
  "output_type": "json"   // or "mongo"
}
```

**Headers**

* `Content-Type: application/json`
* If enabled: `X-API-Key: <ADMIN_API_KEY>`

**curl example (JSON output):**

```bash
curl -X POST http://127.0.0.1:8080/scrape \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-key" \
  -d '{"site":"yummy","limit":20,"output_type":"json"}'
```

**Response**

```json
{ "job_id": "yummy-a1b2c3d4", "status": "queued" }
```

### GET `/jobs/{job_id}`

Check progress:

```bash
curl http://127.0.0.1:8080/jobs/yummy-a1b2c3d4
```

**Possible responses**

```json
{ "status": "running", "count": 13 }
{ "status": "done", "count": 50 }
{ "status": "error", "error": "Traceback ..." }
{ "status": "unknown" }
```

### (Optional) GET `/jobs`

Return the whole in-memory job map (useful for debugging):

```bash
curl http://127.0.0.1:8080/jobs
```

> Note: jobs are stored in a process-local dict and clear on server restart.

---

## Output modes

### `"json"`

Writes batches to a JSON sink (e.g., newline-delimited file). Check the sink path configured in your `JsonArraySink`/`DualSink`.

Typical document shape:

```json
{
  "title": "...",
  "url": "...",
  "source": "...",
  "category": "...",
  "ingredients": "- 1 cup rice\n- 2 tbsp oil\n...",
  "instructions": "1. Heat oil...\n\n2. Add rice...",
  "image_url": "...",
  "needs_review": false,
  "scraped_at": "2025-09-14 10:03:32.289232",
  "recipe_emb": [0.0123, -0.0456, ...]   // when embeddings enabled
}
```

### `"mongo"`

Writes to `MONGODB_DB.MONGODB_COL`. Ensure your Atlas Search index is created if you plan to query vectors.

**Atlas index mapping (single vector field)**

```json
{
  "mappings": {
    "dynamic": false,
    "fields": {
      "recipe_emb": { "type": "knnVector", "dims": 384, "similarity": "cosine" }
    }
  }
}
```

**Query example:**

```python
qvec = embedder.encode([query])[0]
pipeline = [{
  "$vectorSearch": {
    "index": os.getenv("ATLAS_INDEX", "recipes_vec"),
    "path": "recipe_emb",
    "queryVector": qvec,
    "numCandidates": 400,
    "limit": 10,
    "filter": { "needs_review": { "$ne": True } }
  }
}]
results = list(col.aggregate(pipeline))
```

---

## Embeddings (combined fields → one vector)

We embed **ingredients + instructions** into a single `recipe_emb`. Two supported patterns:

### A) Combine at embedding time

Configure:

```python
embedding_fields = [
  (("ingredients", "instructions"), "recipe_emb")
]
```

`_apply_embeddings` concatenates labeled sections:

```
Ingredients:
- ...

Instructions:
1. ...
```

### B) Build `recipe_text` in `RecipeDoc.finalize()` and embed once

```python
self.recipe_text = "\n\n".join(
  [s for s in [
    f"Title:\n{self.title}" if self.title else "",
    f"Ingredients:\n{self.ingredients_text}" if self.ingredients_text else "",
    f"Instructions:\n{self.instructions_text}" if self.instructions_text else ""
  ] if s]
)
# embedding_fields = [("recipe_text", "recipe_emb")]
```

**HFEmbedder config (defaults):**

```python
HF_MODEL=sentence-transformers/all-MiniLM-L6-v2
HF_DEVICE=cpu
```

---

## CLI (optional but handy)

Create `run_scrape.py`:

```python
from backend.services.data_minning.yummy_medley_scraper import YummyMedleyScraper
from backend.services.data_minning.all_nigerian_recipe_scraper import AllNigerianRecipesScraper

SCRAPERS = {
  "yummy": YummyMedleyScraper,
  "anr": AllNigerianRecipesScraper,
}

if __name__ == "__main__":
    import argparse
    from dataclasses import asdict
    p = argparse.ArgumentParser()
    p.add_argument("--site", choices=SCRAPERS.keys(), required=True)
    p.add_argument("--limit", type=int, default=50)
    args = p.parse_args()

    s = SCRAPERS[args.site]()
    saved = s.stream(sink=..., options=StreamOptions(limit=args.limit))
    print(f"Saved {saved}")
```

Run:

```bash
python3 run_scrape.py --site yummy --limit 25
```

---

## Implementation notes

### `StreamOptions` (clean params)

```python
from dataclasses import dataclass
from typing import Optional, Callable

@dataclass
class StreamOptions:
    delay: float = 0.3
    limit: Optional[int] = None
    batch_size: int = 50
    resume_file: Optional[str] = None
    progress_callback: Optional[Callable[[int], None]] = None
```

### Progress to `/jobs`

We pass a `progress_callback` that updates the job by `job_id`:

```python
def make_progress_cb(job_id: str):
    def _cb(n: int):
        JOBS[job_id]["count"] = n
    return _cb
```

Used as:

```python
saved = s.stream(
  sink=json_or_mongo_sink,
  options=StreamOptions(
    limit=body.limit,
    batch_size=body.limit,
    resume_file="recipes.resume",
    progress_callback=make_progress_cb(job_id),
  ),
)
```

---

## Common pitfalls & fixes

* **`ModuleNotFoundError: No module named 'backend'`**
  Run with module path:
  `python3 -m uvicorn backend.app:app --reload`

* **Uvicorn not found (`zsh: command not found: uvicorn`)**
  Use: `python3 -m uvicorn ...` or add `~/Library/Python/3.9/bin` to PATH.

* **`422 Unprocessable Entity` on `/scrape`**
  In Postman: Body → **raw → JSON** and send:
  `{"site":"yummy","limit":20,"output_type":"json"}`

* **Pydantic v2: “non-annotated attribute”**
  Keep globals like `JOBS = {}` **outside** `BaseModel` classes.

* **`'int' object is not iterable`**
  Don’t iterate `stream()`—it **returns** an `int`. Use the `progress_callback` if you need live updates.

* **`BackgroundTasks` undefined**
  Import from FastAPI:
  `from fastapi import BackgroundTasks`

* **Too many commas in ingredients**
  Don’t `.join()` a **string**—only join if it’s a `list[str]`.

---

## Future ideas (nice-to-haves)

* Store jobs in Redis for persistence across restarts
* Add `started_at` / `finished_at` timestamps and durations to jobs
* Rate-limit per site; cool-down if a scrape ran recently
* Switch to task queue (Celery/RQ/BullMQ) if you need scale
* Add `/search` endpoint that calls `$vectorSearch` in MongoDB

---