Spaces:

DevnilMaster1
/

Bangla-Book-Recommender

Sleeping

App Files Files Community

Bangla-Book-Recommender / SETUP.md

DevnilMaster1

Latest upload

be8cb65 verified about 1 month ago

preview code

raw

history blame contribute delete

8.93 kB

A newer version of the Gradio SDK is available: 6.16.0

Upgrade

SETUP — Preparing the Space's Data Files

The Space ships without trained model artefacts. They are too large and too tightly coupled to the training pipeline for me to ship blindly. This file explains what every artefact is, exactly where it goes, and gives a working Python template for producing it from each of the three training notebooks.

There are two artefact types the Space needs:

Path	Purpose
`data/books_metadata.parquet`	Title, author, category, rating, summary for every book
`embeddings/{model}_book_emb.npy`	Final L2-normalisable book vector for each model `(N, D) float32`
`embeddings/{model}_book_ids.npy`	The `book_id` string for each row of the embedding file `(N,)`

Why not .pkl? Pickle files are slower to load, larger on disk, and carry an arbitrary-code-execution risk when loaded — a real concern for a public Space. .npy (numbers) and .parquet (tables) are the standard choices for this kind of artefact in the HF ecosystem and are what NumPy and pandas read fastest.

1. `data/books_metadata.parquet`

A single table indexed by book_id. Required columns:

column	dtype	notes
`book_id`	str	The same string used in your JSON dataset
`title`	str	Book title (Bangla or English)
`author`	str	Comma-joined if multiple authors
`category`	str	Primary category name
`rating`	float	Average rating 1–5; NaN allowed
`summary`	str	Short description; first ~240 chars are shown

Build it once from the raw JSON files in your HF dataset repo. You almost certainly have these locally already from training; if not, download them from https://huggingface.co/datasets/DevnilMaster1/Bangla-Book-Recommendation-Dataset/tree/main.

# build_metadata.py
import json
import pandas as pd

def load_json(path):
    with open(path, encoding="utf-8") as f:
        return json.load(f)

books      = load_json("book.json")
authors    = {str(a["author_id"]):    a for a in load_json("author.json")}
categories = {str(c["category_id"]): c for c in load_json("category.json")}
b2a        = load_json("book_to_author.json")
b2c        = load_json("book_to_category.json")

# book_id → list of author names
authors_per_book = {}
for e in b2a:
    authors_per_book.setdefault(str(e["book_id"]), []).append(
        authors.get(str(e["author_id"]), {}).get("author", "")
    )

# book_id → list of category names (we keep the first as the primary)
categories_per_book = {}
for e in b2c:
    categories_per_book.setdefault(str(e["book_id"]), []).append(
        categories.get(str(e["category_id"]), {}).get("category_name", "")
    )

rows = []
for b in books:
    bid = str(b["book_id"])
    rows.append({
        "book_id":  bid,
        "title":    b.get("book_name") or b.get("title", ""),
        "author":   ", ".join(filter(None, authors_per_book.get(bid, []))),
        "category": (categories_per_book.get(bid) or [""])[0],
        "rating":   float(b["rating"]) if b.get("rating") not in (None, "") else None,
        "summary":  b.get("summary", "") or b.get("description", ""),
    })

df = pd.DataFrame(rows)
df.to_parquet("data/books_metadata.parquet", index=False)
print(f"Wrote {len(df):,} rows to data/books_metadata.parquet")

Adjust the JSON key names (book_name, author, category_name, rating, summary, …) to match what your actual files use. Open one of the JSON files and check — different snapshots have used slightly different field names.

2. Per-model embedding files

Each notebook produces one *_book_emb.npy and one *_book_ids.npy. Save them at the end of training, after the model has converged (or after loading the best checkpoint).

The cell below assumes you have:

the trained model in eval mode
the data_loader (or whatever object owns entity_maps["book"], the {book_id_str → internal_index} dict that all three notebooks build)

2a. Two-Tower

import numpy as np, torch
import torch.nn.functional as F

model.eval()
with torch.no_grad():
    # Adapt the call to your actual item-tower forward signature.
    # The point is to get the L2-normalised final book embedding for every book.
    book_emb = model.compute_all_item_embeddings()   # (num_books, 256) tensor
    book_emb = F.normalize(book_emb, dim=-1)

book_emb_np = book_emb.cpu().numpy().astype(np.float32)

idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids   = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

np.save("embeddings/two_tower_book_emb.npy", book_emb_np)
np.save("embeddings/two_tower_book_ids.npy", book_ids)
print(f"Two-Tower: saved {book_emb_np.shape}")

2b. LightGCN

import numpy as np, torch
import torch.nn.functional as F

model.eval()
with torch.no_grad():
    final_user_emb, final_book_emb = model.get_final_embeddings()  # (Nu, 256), (Nb, 256)
    book_emb = F.normalize(final_book_emb, dim=-1)

book_emb_np = book_emb.cpu().numpy().astype(np.float32)

idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids   = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

np.save("embeddings/lightgcn_book_emb.npy", book_emb_np)
np.save("embeddings/lightgcn_book_ids.npy", book_ids)
print(f"LightGCN: saved {book_emb_np.shape}")

2c. HGNN

import numpy as np, torch
import torch.nn.functional as F

model.eval()
with torch.no_grad():
    h = model.get_all_embeddings(graph, features)   # dict {ntype: (N_ntype, 64)}
    book_emb = F.normalize(h["book"], dim=-1)

book_emb_np = book_emb.cpu().numpy().astype(np.float32)

idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
book_ids   = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

np.save("embeddings/hgnn_book_emb.npy", book_emb_np)
np.save("embeddings/hgnn_book_ids.npy", book_ids)
print(f"HGNN: saved {book_emb_np.shape}")

The Space does re-normalise on load, so the L2-normalisation step above is technically optional. It is kept so that the .npy you ship represents exactly the vectors used in your evaluation tables.

3. Pushing to the Space

Your final Space repo should look like this:

your-space/
├── app.py
├── requirements.txt
├── README.md
├── SETUP.md
├── data/
│   └── books_metadata.parquet              ← drop yours here
└── embeddings/
    ├── two_tower_book_emb.npy              ← drop yours here
    ├── two_tower_book_ids.npy              ← drop yours here
    ├── lightgcn_book_emb.npy
    ├── lightgcn_book_ids.npy
    ├── hgnn_book_emb.npy
    └── hgnn_book_ids.npy

Files larger than ~50 MB must be tracked with git-lfs:

git lfs install
git lfs track "*.npy"
git lfs track "*.parquet"
git add .gitattributes
git add .
git commit -m "Add trained embeddings and metadata"
git push

If you would rather not commit the embeddings to the Space repo, an alternative is to upload them to your existing HF dataset repo and have app.py download them on startup using huggingface_hub.hf_hub_download. Tell me if you want this variant — a 6-line change to app.py.

4. Approximate sizes

File	Shape	float32	float16
`two_tower_book_emb.npy`	(127302, 256)	~130 MB	~65 MB
`lightgcn_book_emb.npy`	(127302, 256)	~130 MB	~65 MB
`hgnn_book_emb.npy`	(127302, 64)	~32 MB	~16 MB
`books_metadata.parquet`	(127302, 6)	~30–60 MB	—

Casting to float16 before saving roughly halves the embedding size with negligible quality impact for cosine similarity:

np.save("embeddings/two_tower_book_emb.npy", book_emb_np.astype(np.float16))

The Space upcasts to float32 on load, so this is safe.

5. Validating locally before pushing

pip install -r requirements.txt
python app.py

Open the URL Gradio prints. The startup log will show [real] or [synthetic] next to each model — confirm all three say [real]. Pick a few books you actually know from the catalogue and sanity-check the recommendations: anything in the same author/series/category neighbourhood is a good sign. Once it looks right locally, push to the Space.