A newer version of the Gradio SDK is available: 6.16.0
SETUP β Preparing the Space's Data Files
The Space ships without trained model artefacts. They are too large and too tightly coupled to the training pipeline for me to ship blindly. This file explains what every artefact is, exactly where it goes, and gives a working Python template for producing it from each of the three training notebooks.
There are two artefact types the Space needs:
| Path | Purpose |
|---|---|
data/books_metadata.parquet |
Title, author, category, rating, summary for every book |
embeddings/{model}_book_emb.npy |
Final L2-normalisable book vector for each model (N, D) float32 |
embeddings/{model}_book_ids.npy |
The book_id string for each row of the embedding file (N,) |
Why not
.pkl? Pickle files are slower to load, larger on disk, and carry an arbitrary-code-execution risk when loaded β a real concern for a public Space..npy(numbers) and.parquet(tables) are the standard choices for this kind of artefact in the HF ecosystem and are what NumPy and pandas read fastest.
1. data/books_metadata.parquet
A single table indexed by book_id. Required columns:
| column | dtype | notes |
|---|---|---|
book_id |
str | The same string used in your JSON dataset |
title |
str | Book title (Bangla or English) |
author |
str | Comma-joined if multiple authors |
category |
str | Primary category name |
rating |
float | Average rating 1β5; NaN allowed |
summary |
str | Short description; first ~240 chars are shown |
Build it once from the raw JSON files in your HF dataset repo. You almost certainly have these locally already from training; if not, download them from https://huggingface.co/datasets/DevnilMaster1/Bangla-Book-Recommendation-Dataset/tree/main.
# build_metadata.py
import json
import pandas as pd
def load_json(path):
with open(path, encoding="utf-8") as f:
return json.load(f)
books = load_json("book.json")
authors = {str(a["author_id"]): a for a in load_json("author.json")}
categories = {str(c["category_id"]): c for c in load_json("category.json")}
b2a = load_json("book_to_author.json")
b2c = load_json("book_to_category.json")
# book_id β list of author names
authors_per_book = {}
for e in b2a:
authors_per_book.setdefault(str(e["book_id"]), []).append(
authors.get(str(e["author_id"]), {}).get("author", "")
)
# book_id β list of category names (we keep the first as the primary)
categories_per_book = {}
for e in b2c:
categories_per_book.setdefault(str(e["book_id"]), []).append(
categories.get(str(e["category_id"]), {}).get("category_name", "")
)
rows = []
for b in books:
bid = str(b["book_id"])
rows.append({
"book_id": bid,
"title": b.get("book_name") or b.get("title", ""),
"author": ", ".join(filter(None, authors_per_book.get(bid, []))),
"category": (categories_per_book.get(bid) or [""])[0],
"rating": float(b["rating"]) if b.get("rating") not in (None, "") else None,
"summary": b.get("summary", "") or b.get("description", ""),
})
df = pd.DataFrame(rows)
df.to_parquet("data/books_metadata.parquet", index=False)
print(f"Wrote {len(df):,} rows to data/books_metadata.parquet")
Adjust the JSON key names (
book_name,author,category_name,rating,summary, β¦) to match what your actual files use. Open one of the JSON files and check β different snapshots have used slightly different field names.
2. Per-model embedding files
Each notebook produces one *_book_emb.npy and one *_book_ids.npy. Save
them at the end of training, after the model has converged (or after
loading the best checkpoint).
The cell below assumes you have:
- the trained
modelin eval mode - the
data_loader(or whatever object ownsentity_maps["book"], the{book_id_str β internal_index}dict that all three notebooks build)
2a. Two-Tower
import numpy as np, torch
import torch.nn.functional as F
model.eval()
with torch.no_grad():
# Adapt the call to your actual item-tower forward signature.
# The point is to get the L2-normalised final book embedding for every book.
book_emb = model.compute_all_item_embeddings() # (num_books, 256) tensor
book_emb = F.normalize(book_emb, dim=-1)
book_emb_np = book_emb.cpu().numpy().astype(np.float32)
idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)
np.save("embeddings/two_tower_book_emb.npy", book_emb_np)
np.save("embeddings/two_tower_book_ids.npy", book_ids)
print(f"Two-Tower: saved {book_emb_np.shape}")
2b. LightGCN
import numpy as np, torch
import torch.nn.functional as F
model.eval()
with torch.no_grad():
final_user_emb, final_book_emb = model.get_final_embeddings() # (Nu, 256), (Nb, 256)
book_emb = F.normalize(final_book_emb, dim=-1)
book_emb_np = book_emb.cpu().numpy().astype(np.float32)
idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)
np.save("embeddings/lightgcn_book_emb.npy", book_emb_np)
np.save("embeddings/lightgcn_book_ids.npy", book_ids)
print(f"LightGCN: saved {book_emb_np.shape}")
2c. HGNN
import numpy as np, torch
import torch.nn.functional as F
model.eval()
with torch.no_grad():
h = model.get_all_embeddings(graph, features) # dict {ntype: (N_ntype, 64)}
book_emb = F.normalize(h["book"], dim=-1)
book_emb_np = book_emb.cpu().numpy().astype(np.float32)
idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)
np.save("embeddings/hgnn_book_emb.npy", book_emb_np)
np.save("embeddings/hgnn_book_ids.npy", book_ids)
print(f"HGNN: saved {book_emb_np.shape}")
The Space does re-normalise on load, so the L2-normalisation step above is technically optional. It is kept so that the
.npyyou ship represents exactly the vectors used in your evaluation tables.
3. Pushing to the Space
Your final Space repo should look like this:
your-space/
βββ app.py
βββ requirements.txt
βββ README.md
βββ SETUP.md
βββ data/
β βββ books_metadata.parquet β drop yours here
βββ embeddings/
βββ two_tower_book_emb.npy β drop yours here
βββ two_tower_book_ids.npy β drop yours here
βββ lightgcn_book_emb.npy
βββ lightgcn_book_ids.npy
βββ hgnn_book_emb.npy
βββ hgnn_book_ids.npy
Files larger than ~50 MB must be tracked with git-lfs:
git lfs install
git lfs track "*.npy"
git lfs track "*.parquet"
git add .gitattributes
git add .
git commit -m "Add trained embeddings and metadata"
git push
If you would rather not commit the embeddings to the Space repo, an
alternative is to upload them to your existing HF dataset repo and have
app.py download them on startup using huggingface_hub.hf_hub_download.
Tell me if you want this variant β a 6-line change to app.py.
4. Approximate sizes
| File | Shape | float32 | float16 |
|---|---|---|---|
two_tower_book_emb.npy |
(127302, 256) | ~130 MB | ~65 MB |
lightgcn_book_emb.npy |
(127302, 256) | ~130 MB | ~65 MB |
hgnn_book_emb.npy |
(127302, 64) | ~32 MB | ~16 MB |
books_metadata.parquet |
(127302, 6) | ~30β60 MB | β |
Casting to float16 before saving roughly halves the embedding size with
negligible quality impact for cosine similarity:
np.save("embeddings/two_tower_book_emb.npy", book_emb_np.astype(np.float16))
The Space upcasts to float32 on load, so this is safe.
5. Validating locally before pushing
pip install -r requirements.txt
python app.py
Open the URL Gradio prints. The startup log will show [real] or
[synthetic] next to each model β confirm all three say [real]. Pick a
few books you actually know from the catalogue and sanity-check the
recommendations: anything in the same author/series/category neighbourhood
is a good sign. Once it looks right locally, push to the Space.