# SETUP — Preparing the Space's Data Files

The Space ships **without** trained model artefacts. They are too large and
too tightly coupled to the training pipeline for me to ship blindly. This
file explains what every artefact is, exactly where it goes, and gives a
working Python template for producing it from each of the three training
notebooks.

There are two artefact types the Space needs:

| Path                                | Purpose                                                                |
|-------------------------------------|-----------------------------------------------------------------------|
| `data/books_metadata.parquet`       | Title, author, category, rating, summary for every book               |
| `embeddings/{model}_book_emb.npy`   | Final L2-normalisable book vector for each model `(N, D) float32`     |
| `embeddings/{model}_book_ids.npy`   | The `book_id` string for each row of the embedding file `(N,)`        |

> **Why not `.pkl`?** Pickle files are slower to load, larger on disk, and
> carry an arbitrary-code-execution risk when loaded — a real concern for a
> public Space. `.npy` (numbers) and `.parquet` (tables) are the standard
> choices for this kind of artefact in the HF ecosystem and are what NumPy
> and pandas read fastest.

---

## 1. `data/books_metadata.parquet`

A single table indexed by `book_id`. Required columns:

| column     | dtype  | notes                                          |
|------------|--------|------------------------------------------------|
| `book_id`  | str    | The same string used in your JSON dataset       |
| `title`    | str    | Book title (Bangla or English)                  |
| `author`   | str    | Comma-joined if multiple authors                |
| `category` | str    | Primary category name                           |
| `rating`   | float  | Average rating 1–5; NaN allowed                 |
| `summary`  | str    | Short description; first ~240 chars are shown   |

Build it once from the raw JSON files in your HF dataset repo. You almost
certainly have these locally already from training; if not, download them from
<https://huggingface.co/datasets/DevnilMaster1/Bangla-Book-Recommendation-Dataset/tree/main>.

```python
# build_metadata.py
import json
import pandas as pd

def load_json(path):
    with open(path, encoding="utf-8") as f:
        return json.load(f)

books      = load_json("book.json")
authors    = {str(a["author_id"]):    a for a in load_json("author.json")}
categories = {str(c["category_id"]): c for c in load_json("category.json")}
b2a        = load_json("book_to_author.json")
b2c        = load_json("book_to_category.json")

# book_id → list of author names
authors_per_book = {}
for e in b2a:
    authors_per_book.setdefault(str(e["book_id"]), []).append(
        authors.get(str(e["author_id"]), {}).get("author", "")
    )

# book_id → list of category names (we keep the first as the primary)
categories_per_book = {}
for e in b2c:
    categories_per_book.setdefault(str(e["book_id"]), []).append(
        categories.get(str(e["category_id"]), {}).get("category_name", "")
    )

rows = []
for b in books:
    bid = str(b["book_id"])
    rows.append({
        "book_id":  bid,
        "title":    b.get("book_name") or b.get("title", ""),
        "author":   ", ".join(filter(None, authors_per_book.get(bid, []))),
        "category": (categories_per_book.get(bid) or [""])[0],
        "rating":   float(b["rating"]) if b.get("rating") not in (None, "") else None,
        "summary":  b.get("summary", "") or b.get("description", ""),
    })

df = pd.DataFrame(rows)
df.to_parquet("data/books_metadata.parquet", index=False)
print(f"Wrote {len(df):,} rows to data/books_metadata.parquet")
```

> **Adjust the JSON key names** (`book_name`, `author`, `category_name`,
> `rating`, `summary`, …) to match what your actual files use. Open one of
> the JSON files and check — different snapshots have used slightly different
> field names.

---

## 2. Per-model embedding files

Each notebook produces one `*_book_emb.npy` and one `*_book_ids.npy`. Save
them at the **end** of training, after the model has converged (or after
loading the best checkpoint).

The cell below assumes you have:

- the trained `model` in eval mode
- the `data_loader` (or whatever object owns `entity_maps["book"]`, the
  `{book_id_str → internal_index}` dict that all three notebooks build)

### 2a. Two-Tower

```python
import numpy as np, torch
import torch.nn.functional as F

model.eval()
with torch.no_grad():
    # Adapt the call to your actual item-tower forward signature.
    # The point is to get the L2-normalised final book embedding for every book.
    book_emb = model.compute_all_item_embeddings()   # (num_books, 256) tensor
    book_emb = F.normalize(book_emb, dim=-1)

book_emb_np = book_emb.cpu().numpy().astype(np.float32)

idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids   = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

np.save("embeddings/two_tower_book_emb.npy", book_emb_np)
np.save("embeddings/two_tower_book_ids.npy", book_ids)
print(f"Two-Tower: saved {book_emb_np.shape}")
```

### 2b. LightGCN

```python
import numpy as np, torch
import torch.nn.functional as F

model.eval()
with torch.no_grad():
    final_user_emb, final_book_emb = model.get_final_embeddings()  # (Nu, 256), (Nb, 256)
    book_emb = F.normalize(final_book_emb, dim=-1)

book_emb_np = book_emb.cpu().numpy().astype(np.float32)

idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids   = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

np.save("embeddings/lightgcn_book_emb.npy", book_emb_np)
np.save("embeddings/lightgcn_book_ids.npy", book_ids)
print(f"LightGCN: saved {book_emb_np.shape}")
```

### 2c. HGNN

```python
import numpy as np, torch
import torch.nn.functional as F

model.eval()
with torch.no_grad():
    h = model.get_all_embeddings(graph, features)   # dict {ntype: (N_ntype, 64)}
    book_emb = F.normalize(h["book"], dim=-1)

book_emb_np = book_emb.cpu().numpy().astype(np.float32)

idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids   = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

np.save("embeddings/hgnn_book_emb.npy", book_emb_np)
np.save("embeddings/hgnn_book_ids.npy", book_ids)
print(f"HGNN: saved {book_emb_np.shape}")
```

> The Space *does* re-normalise on load, so the L2-normalisation step above
> is technically optional. It is kept so that the `.npy` you ship represents
> exactly the vectors used in your evaluation tables.

---

## 3. Pushing to the Space

Your final Space repo should look like this:

```
your-space/
├── app.py
├── requirements.txt
├── README.md
├── SETUP.md
├── data/
│   └── books_metadata.parquet              ← drop yours here
└── embeddings/
    ├── two_tower_book_emb.npy              ← drop yours here
    ├── two_tower_book_ids.npy              ← drop yours here
    ├── lightgcn_book_emb.npy
    ├── lightgcn_book_ids.npy
    ├── hgnn_book_emb.npy
    └── hgnn_book_ids.npy
```

Files larger than ~50 MB **must** be tracked with `git-lfs`:

```bash
git lfs install
git lfs track "*.npy"
git lfs track "*.parquet"
git add .gitattributes
git add .
git commit -m "Add trained embeddings and metadata"
git push
```

If you would rather not commit the embeddings to the Space repo, an
alternative is to upload them to your existing HF *dataset* repo and have
`app.py` download them on startup using `huggingface_hub.hf_hub_download`.
Tell me if you want this variant — a 6-line change to `app.py`.

---

## 4. Approximate sizes

| File                      | Shape           | float32 | float16 |
|---------------------------|-----------------|---------|---------|
| `two_tower_book_emb.npy`  | (127302, 256)   | ~130 MB | ~65 MB  |
| `lightgcn_book_emb.npy`   | (127302, 256)   | ~130 MB | ~65 MB  |
| `hgnn_book_emb.npy`       | (127302, 64)    | ~32 MB  | ~16 MB  |
| `books_metadata.parquet`  | (127302, 6)     | ~30–60 MB | —     |

Casting to `float16` before saving roughly halves the embedding size with
negligible quality impact for cosine similarity:

```python
np.save("embeddings/two_tower_book_emb.npy", book_emb_np.astype(np.float16))
```

The Space upcasts to float32 on load, so this is safe.

---

## 5. Validating locally before pushing

```bash
pip install -r requirements.txt
python app.py
```

Open the URL Gradio prints. The startup log will show `[real]` or
`[synthetic]` next to each model — confirm all three say `[real]`. Pick a
few books you actually know from the catalogue and sanity-check the
recommendations: anything in the same author/series/category neighbourhood
is a good sign. Once it looks right locally, push to the Space.