# SETUP — Preparing the Space's Data Files The Space ships **without** trained model artefacts. They are too large and too tightly coupled to the training pipeline for me to ship blindly. This file explains what every artefact is, exactly where it goes, and gives a working Python template for producing it from each of the three training notebooks. There are two artefact types the Space needs: | Path | Purpose | |-------------------------------------|-----------------------------------------------------------------------| | `data/books_metadata.parquet` | Title, author, category, rating, summary for every book | | `embeddings/{model}_book_emb.npy` | Final L2-normalisable book vector for each model `(N, D) float32` | | `embeddings/{model}_book_ids.npy` | The `book_id` string for each row of the embedding file `(N,)` | > **Why not `.pkl`?** Pickle files are slower to load, larger on disk, and > carry an arbitrary-code-execution risk when loaded — a real concern for a > public Space. `.npy` (numbers) and `.parquet` (tables) are the standard > choices for this kind of artefact in the HF ecosystem and are what NumPy > and pandas read fastest. --- ## 1. `data/books_metadata.parquet` A single table indexed by `book_id`. Required columns: | column | dtype | notes | |------------|--------|------------------------------------------------| | `book_id` | str | The same string used in your JSON dataset | | `title` | str | Book title (Bangla or English) | | `author` | str | Comma-joined if multiple authors | | `category` | str | Primary category name | | `rating` | float | Average rating 1–5; NaN allowed | | `summary` | str | Short description; first ~240 chars are shown | Build it once from the raw JSON files in your HF dataset repo. You almost certainly have these locally already from training; if not, download them from . ```python # build_metadata.py import json import pandas as pd def load_json(path): with open(path, encoding="utf-8") as f: return json.load(f) books = load_json("book.json") authors = {str(a["author_id"]): a for a in load_json("author.json")} categories = {str(c["category_id"]): c for c in load_json("category.json")} b2a = load_json("book_to_author.json") b2c = load_json("book_to_category.json") # book_id → list of author names authors_per_book = {} for e in b2a: authors_per_book.setdefault(str(e["book_id"]), []).append( authors.get(str(e["author_id"]), {}).get("author", "") ) # book_id → list of category names (we keep the first as the primary) categories_per_book = {} for e in b2c: categories_per_book.setdefault(str(e["book_id"]), []).append( categories.get(str(e["category_id"]), {}).get("category_name", "") ) rows = [] for b in books: bid = str(b["book_id"]) rows.append({ "book_id": bid, "title": b.get("book_name") or b.get("title", ""), "author": ", ".join(filter(None, authors_per_book.get(bid, []))), "category": (categories_per_book.get(bid) or [""])[0], "rating": float(b["rating"]) if b.get("rating") not in (None, "") else None, "summary": b.get("summary", "") or b.get("description", ""), }) df = pd.DataFrame(rows) df.to_parquet("data/books_metadata.parquet", index=False) print(f"Wrote {len(df):,} rows to data/books_metadata.parquet") ``` > **Adjust the JSON key names** (`book_name`, `author`, `category_name`, > `rating`, `summary`, …) to match what your actual files use. Open one of > the JSON files and check — different snapshots have used slightly different > field names. --- ## 2. Per-model embedding files Each notebook produces one `*_book_emb.npy` and one `*_book_ids.npy`. Save them at the **end** of training, after the model has converged (or after loading the best checkpoint). The cell below assumes you have: - the trained `model` in eval mode - the `data_loader` (or whatever object owns `entity_maps["book"]`, the `{book_id_str → internal_index}` dict that all three notebooks build) ### 2a. Two-Tower ```python import numpy as np, torch import torch.nn.functional as F model.eval() with torch.no_grad(): # Adapt the call to your actual item-tower forward signature. # The point is to get the L2-normalised final book embedding for every book. book_emb = model.compute_all_item_embeddings() # (num_books, 256) tensor book_emb = F.normalize(book_emb, dim=-1) book_emb_np = book_emb.cpu().numpy().astype(np.float32) idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()} book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object) np.save("embeddings/two_tower_book_emb.npy", book_emb_np) np.save("embeddings/two_tower_book_ids.npy", book_ids) print(f"Two-Tower: saved {book_emb_np.shape}") ``` ### 2b. LightGCN ```python import numpy as np, torch import torch.nn.functional as F model.eval() with torch.no_grad(): final_user_emb, final_book_emb = model.get_final_embeddings() # (Nu, 256), (Nb, 256) book_emb = F.normalize(final_book_emb, dim=-1) book_emb_np = book_emb.cpu().numpy().astype(np.float32) idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()} book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object) np.save("embeddings/lightgcn_book_emb.npy", book_emb_np) np.save("embeddings/lightgcn_book_ids.npy", book_ids) print(f"LightGCN: saved {book_emb_np.shape}") ``` ### 2c. HGNN ```python import numpy as np, torch import torch.nn.functional as F model.eval() with torch.no_grad(): h = model.get_all_embeddings(graph, features) # dict {ntype: (N_ntype, 64)} book_emb = F.normalize(h["book"], dim=-1) book_emb_np = book_emb.cpu().numpy().astype(np.float32) idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()} book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object) np.save("embeddings/hgnn_book_emb.npy", book_emb_np) np.save("embeddings/hgnn_book_ids.npy", book_ids) print(f"HGNN: saved {book_emb_np.shape}") ``` > The Space *does* re-normalise on load, so the L2-normalisation step above > is technically optional. It is kept so that the `.npy` you ship represents > exactly the vectors used in your evaluation tables. --- ## 3. Pushing to the Space Your final Space repo should look like this: ``` your-space/ ├── app.py ├── requirements.txt ├── README.md ├── SETUP.md ├── data/ │ └── books_metadata.parquet ← drop yours here └── embeddings/ ├── two_tower_book_emb.npy ← drop yours here ├── two_tower_book_ids.npy ← drop yours here ├── lightgcn_book_emb.npy ├── lightgcn_book_ids.npy ├── hgnn_book_emb.npy └── hgnn_book_ids.npy ``` Files larger than ~50 MB **must** be tracked with `git-lfs`: ```bash git lfs install git lfs track "*.npy" git lfs track "*.parquet" git add .gitattributes git add . git commit -m "Add trained embeddings and metadata" git push ``` If you would rather not commit the embeddings to the Space repo, an alternative is to upload them to your existing HF *dataset* repo and have `app.py` download them on startup using `huggingface_hub.hf_hub_download`. Tell me if you want this variant — a 6-line change to `app.py`. --- ## 4. Approximate sizes | File | Shape | float32 | float16 | |---------------------------|-----------------|---------|---------| | `two_tower_book_emb.npy` | (127302, 256) | ~130 MB | ~65 MB | | `lightgcn_book_emb.npy` | (127302, 256) | ~130 MB | ~65 MB | | `hgnn_book_emb.npy` | (127302, 64) | ~32 MB | ~16 MB | | `books_metadata.parquet` | (127302, 6) | ~30–60 MB | — | Casting to `float16` before saving roughly halves the embedding size with negligible quality impact for cosine similarity: ```python np.save("embeddings/two_tower_book_emb.npy", book_emb_np.astype(np.float16)) ``` The Space upcasts to float32 on load, so this is safe. --- ## 5. Validating locally before pushing ```bash pip install -r requirements.txt python app.py ``` Open the URL Gradio prints. The startup log will show `[real]` or `[synthetic]` next to each model — confirm all three say `[real]`. Pick a few books you actually know from the catalogue and sanity-check the recommendations: anything in the same author/series/category neighbourhood is a good sign. Once it looks right locally, push to the Space.