| # SETUP β Preparing the Space's Data Files |
|
|
| The Space ships **without** trained model artefacts. They are too large and |
| too tightly coupled to the training pipeline for me to ship blindly. This |
| file explains what every artefact is, exactly where it goes, and gives a |
| working Python template for producing it from each of the three training |
| notebooks. |
|
|
| There are two artefact types the Space needs: |
|
|
| | Path | Purpose | |
| |-------------------------------------|-----------------------------------------------------------------------| |
| | `data/books_metadata.parquet` | Title, author, category, rating, summary for every book | |
| | `embeddings/{model}_book_emb.npy` | Final L2-normalisable book vector for each model `(N, D) float32` | |
| | `embeddings/{model}_book_ids.npy` | The `book_id` string for each row of the embedding file `(N,)` | |
|
|
| > **Why not `.pkl`?** Pickle files are slower to load, larger on disk, and |
| > carry an arbitrary-code-execution risk when loaded β a real concern for a |
| > public Space. `.npy` (numbers) and `.parquet` (tables) are the standard |
| > choices for this kind of artefact in the HF ecosystem and are what NumPy |
| > and pandas read fastest. |
|
|
| --- |
|
|
| ## 1. `data/books_metadata.parquet` |
| |
| A single table indexed by `book_id`. Required columns: |
|
|
| | column | dtype | notes | |
| |------------|--------|------------------------------------------------| |
| | `book_id` | str | The same string used in your JSON dataset | |
| | `title` | str | Book title (Bangla or English) | |
| | `author` | str | Comma-joined if multiple authors | |
| | `category` | str | Primary category name | |
| | `rating` | float | Average rating 1β5; NaN allowed | |
| | `summary` | str | Short description; first ~240 chars are shown | |
|
|
| Build it once from the raw JSON files in your HF dataset repo. You almost |
| certainly have these locally already from training; if not, download them from |
| <https://huggingface.co/datasets/DevnilMaster1/Bangla-Book-Recommendation-Dataset/tree/main>. |
|
|
| ```python |
| # build_metadata.py |
| import json |
| import pandas as pd |
| |
| def load_json(path): |
| with open(path, encoding="utf-8") as f: |
| return json.load(f) |
| |
| books = load_json("book.json") |
| authors = {str(a["author_id"]): a for a in load_json("author.json")} |
| categories = {str(c["category_id"]): c for c in load_json("category.json")} |
| b2a = load_json("book_to_author.json") |
| b2c = load_json("book_to_category.json") |
| |
| # book_id β list of author names |
| authors_per_book = {} |
| for e in b2a: |
| authors_per_book.setdefault(str(e["book_id"]), []).append( |
| authors.get(str(e["author_id"]), {}).get("author", "") |
| ) |
| |
| # book_id β list of category names (we keep the first as the primary) |
| categories_per_book = {} |
| for e in b2c: |
| categories_per_book.setdefault(str(e["book_id"]), []).append( |
| categories.get(str(e["category_id"]), {}).get("category_name", "") |
| ) |
| |
| rows = [] |
| for b in books: |
| bid = str(b["book_id"]) |
| rows.append({ |
| "book_id": bid, |
| "title": b.get("book_name") or b.get("title", ""), |
| "author": ", ".join(filter(None, authors_per_book.get(bid, []))), |
| "category": (categories_per_book.get(bid) or [""])[0], |
| "rating": float(b["rating"]) if b.get("rating") not in (None, "") else None, |
| "summary": b.get("summary", "") or b.get("description", ""), |
| }) |
| |
| df = pd.DataFrame(rows) |
| df.to_parquet("data/books_metadata.parquet", index=False) |
| print(f"Wrote {len(df):,} rows to data/books_metadata.parquet") |
| ``` |
|
|
| > **Adjust the JSON key names** (`book_name`, `author`, `category_name`, |
| > `rating`, `summary`, β¦) to match what your actual files use. Open one of |
| > the JSON files and check β different snapshots have used slightly different |
| > field names. |
|
|
| --- |
|
|
| ## 2. Per-model embedding files |
|
|
| Each notebook produces one `*_book_emb.npy` and one `*_book_ids.npy`. Save |
| them at the **end** of training, after the model has converged (or after |
| loading the best checkpoint). |
|
|
| The cell below assumes you have: |
|
|
| - the trained `model` in eval mode |
| - the `data_loader` (or whatever object owns `entity_maps["book"]`, the |
| `{book_id_str β internal_index}` dict that all three notebooks build) |
|
|
| ### 2a. Two-Tower |
|
|
| ```python |
| import numpy as np, torch |
| import torch.nn.functional as F |
| |
| model.eval() |
| with torch.no_grad(): |
| # Adapt the call to your actual item-tower forward signature. |
| # The point is to get the L2-normalised final book embedding for every book. |
| book_emb = model.compute_all_item_embeddings() # (num_books, 256) tensor |
| book_emb = F.normalize(book_emb, dim=-1) |
| |
| book_emb_np = book_emb.cpu().numpy().astype(np.float32) |
| |
| idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()} |
| book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object) |
| |
| np.save("embeddings/two_tower_book_emb.npy", book_emb_np) |
| np.save("embeddings/two_tower_book_ids.npy", book_ids) |
| print(f"Two-Tower: saved {book_emb_np.shape}") |
| ``` |
|
|
| ### 2b. LightGCN |
|
|
| ```python |
| import numpy as np, torch |
| import torch.nn.functional as F |
| |
| model.eval() |
| with torch.no_grad(): |
| final_user_emb, final_book_emb = model.get_final_embeddings() # (Nu, 256), (Nb, 256) |
| book_emb = F.normalize(final_book_emb, dim=-1) |
| |
| book_emb_np = book_emb.cpu().numpy().astype(np.float32) |
| |
| idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()} |
| book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object) |
| |
| np.save("embeddings/lightgcn_book_emb.npy", book_emb_np) |
| np.save("embeddings/lightgcn_book_ids.npy", book_ids) |
| print(f"LightGCN: saved {book_emb_np.shape}") |
| ``` |
|
|
| ### 2c. HGNN |
|
|
| ```python |
| import numpy as np, torch |
| import torch.nn.functional as F |
| |
| model.eval() |
| with torch.no_grad(): |
| h = model.get_all_embeddings(graph, features) # dict {ntype: (N_ntype, 64)} |
| book_emb = F.normalize(h["book"], dim=-1) |
| |
| book_emb_np = book_emb.cpu().numpy().astype(np.float32) |
| |
| idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()} |
| book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object) |
| |
| np.save("embeddings/hgnn_book_emb.npy", book_emb_np) |
| np.save("embeddings/hgnn_book_ids.npy", book_ids) |
| print(f"HGNN: saved {book_emb_np.shape}") |
| ``` |
|
|
| > The Space *does* re-normalise on load, so the L2-normalisation step above |
| > is technically optional. It is kept so that the `.npy` you ship represents |
| > exactly the vectors used in your evaluation tables. |
|
|
| --- |
|
|
| ## 3. Pushing to the Space |
|
|
| Your final Space repo should look like this: |
|
|
| ``` |
| your-space/ |
| βββ app.py |
| βββ requirements.txt |
| βββ README.md |
| βββ SETUP.md |
| βββ data/ |
| β βββ books_metadata.parquet β drop yours here |
| βββ embeddings/ |
| βββ two_tower_book_emb.npy β drop yours here |
| βββ two_tower_book_ids.npy β drop yours here |
| βββ lightgcn_book_emb.npy |
| βββ lightgcn_book_ids.npy |
| βββ hgnn_book_emb.npy |
| βββ hgnn_book_ids.npy |
| ``` |
|
|
| Files larger than ~50 MB **must** be tracked with `git-lfs`: |
|
|
| ```bash |
| git lfs install |
| git lfs track "*.npy" |
| git lfs track "*.parquet" |
| git add .gitattributes |
| git add . |
| git commit -m "Add trained embeddings and metadata" |
| git push |
| ``` |
|
|
| If you would rather not commit the embeddings to the Space repo, an |
| alternative is to upload them to your existing HF *dataset* repo and have |
| `app.py` download them on startup using `huggingface_hub.hf_hub_download`. |
| Tell me if you want this variant β a 6-line change to `app.py`. |
|
|
| --- |
|
|
| ## 4. Approximate sizes |
|
|
| | File | Shape | float32 | float16 | |
| |---------------------------|-----------------|---------|---------| |
| | `two_tower_book_emb.npy` | (127302, 256) | ~130 MB | ~65 MB | |
| | `lightgcn_book_emb.npy` | (127302, 256) | ~130 MB | ~65 MB | |
| | `hgnn_book_emb.npy` | (127302, 64) | ~32 MB | ~16 MB | |
| | `books_metadata.parquet` | (127302, 6) | ~30β60 MB | β | |
|
|
| Casting to `float16` before saving roughly halves the embedding size with |
| negligible quality impact for cosine similarity: |
|
|
| ```python |
| np.save("embeddings/two_tower_book_emb.npy", book_emb_np.astype(np.float16)) |
| ``` |
|
|
| The Space upcasts to float32 on load, so this is safe. |
|
|
| --- |
|
|
| ## 5. Validating locally before pushing |
|
|
| ```bash |
| pip install -r requirements.txt |
| python app.py |
| ``` |
|
|
| Open the URL Gradio prints. The startup log will show `[real]` or |
| `[synthetic]` next to each model β confirm all three say `[real]`. Pick a |
| few books you actually know from the catalogue and sanity-check the |
| recommendations: anything in the same author/series/category neighbourhood |
| is a good sign. Once it looks right locally, push to the Space. |
|
|