Spaces:

DevnilMaster1
/

Bangla-Book-Recommender

Sleeping

App Files Files Community

Bangla-Book-Recommender / SETUP.md

DevnilMaster1

Latest upload

be8cb65 verified about 1 month ago

preview code

raw

history blame contribute delete

8.93 kB

	# SETUP — Preparing the Space's Data Files

	The Space ships without trained model artefacts. They are too large and
	too tightly coupled to the training pipeline for me to ship blindly. This
	file explains what every artefact is, exactly where it goes, and gives a
	working Python template for producing it from each of the three training
	notebooks.

	There are two artefact types the Space needs:

	\| Path \| Purpose \|
	\|-------------------------------------\|-----------------------------------------------------------------------\|
	\| `data/books_metadata.parquet` \| Title, author, category, rating, summary for every book \|
	\| `embeddings/{model}_book_emb.npy` \| Final L2-normalisable book vector for each model `(N, D) float32` \|
	\| `embeddings/{model}_book_ids.npy` \| The `book_id` string for each row of the embedding file `(N,)` \|

	> Why not `.pkl`? Pickle files are slower to load, larger on disk, and
	> carry an arbitrary-code-execution risk when loaded — a real concern for a
	> public Space. `.npy` (numbers) and `.parquet` (tables) are the standard
	> choices for this kind of artefact in the HF ecosystem and are what NumPy
	> and pandas read fastest.

	---

	## 1. `data/books_metadata.parquet`

	A single table indexed by `book_id`. Required columns:

	\| column \| dtype \| notes \|
	\|------------\|--------\|------------------------------------------------\|
	\| `book_id` \| str \| The same string used in your JSON dataset \|
	\| `title` \| str \| Book title (Bangla or English) \|
	\| `author` \| str \| Comma-joined if multiple authors \|
	\| `category` \| str \| Primary category name \|
	\| `rating` \| float \| Average rating 1–5; NaN allowed \|
	\| `summary` \| str \| Short description; first ~240 chars are shown \|

	Build it once from the raw JSON files in your HF dataset repo. You almost
	certainly have these locally already from training; if not, download them from
	<https://huggingface.co/datasets/DevnilMaster1/Bangla-Book-Recommendation-Dataset/tree/main>.

	```python
	# build_metadata.py
	import json
	import pandas as pd

	def load_json(path):
	with open(path, encoding="utf-8") as f:
	return json.load(f)

	books = load_json("book.json")
	authors = {str(a["author_id"]): a for a in load_json("author.json")}
	categories = {str(c["category_id"]): c for c in load_json("category.json")}
	b2a = load_json("book_to_author.json")
	b2c = load_json("book_to_category.json")

	# book_id → list of author names
	authors_per_book = {}
	for e in b2a:
	authors_per_book.setdefault(str(e["book_id"]), []).append(
	authors.get(str(e["author_id"]), {}).get("author", "")
	)

	# book_id → list of category names (we keep the first as the primary)
	categories_per_book = {}
	for e in b2c:
	categories_per_book.setdefault(str(e["book_id"]), []).append(
	categories.get(str(e["category_id"]), {}).get("category_name", "")
	)

	rows = []
	for b in books:
	bid = str(b["book_id"])
	rows.append({
	"book_id": bid,
	"title": b.get("book_name") or b.get("title", ""),
	"author": ", ".join(filter(None, authors_per_book.get(bid, []))),
	"category": (categories_per_book.get(bid) or [""])[0],
	"rating": float(b["rating"]) if b.get("rating") not in (None, "") else None,
	"summary": b.get("summary", "") or b.get("description", ""),
	})

	df = pd.DataFrame(rows)
	df.to_parquet("data/books_metadata.parquet", index=False)
	print(f"Wrote {len(df):,} rows to data/books_metadata.parquet")
	```

	> Adjust the JSON key names (`book_name`, `author`, `category_name`,
	> `rating`, `summary`, …) to match what your actual files use. Open one of
	> the JSON files and check — different snapshots have used slightly different
	> field names.

	---

	## 2. Per-model embedding files

	Each notebook produces one `_book_emb.npy` and one `_book_ids.npy`. Save
	them at the end of training, after the model has converged (or after
	loading the best checkpoint).

	The cell below assumes you have:

	- the trained `model` in eval mode
	- the `data_loader` (or whatever object owns `entity_maps["book"]`, the
	`{book_id_str → internal_index}` dict that all three notebooks build)

	### 2a. Two-Tower

	```python
	import numpy as np, torch
	import torch.nn.functional as F

	model.eval()
	with torch.no_grad():
	# Adapt the call to your actual item-tower forward signature.
	# The point is to get the L2-normalised final book embedding for every book.
	book_emb = model.compute_all_item_embeddings() # (num_books, 256) tensor
	book_emb = F.normalize(book_emb, dim=-1)

	book_emb_np = book_emb.cpu().numpy().astype(np.float32)

	idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
	book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

	np.save("embeddings/two_tower_book_emb.npy", book_emb_np)
	np.save("embeddings/two_tower_book_ids.npy", book_ids)
	print(f"Two-Tower: saved {book_emb_np.shape}")
	```

	### 2b. LightGCN

	```python
	import numpy as np, torch
	import torch.nn.functional as F

	model.eval()
	with torch.no_grad():
	final_user_emb, final_book_emb = model.get_final_embeddings() # (Nu, 256), (Nb, 256)
	book_emb = F.normalize(final_book_emb, dim=-1)

	book_emb_np = book_emb.cpu().numpy().astype(np.float32)

	idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
	book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

	np.save("embeddings/lightgcn_book_emb.npy", book_emb_np)
	np.save("embeddings/lightgcn_book_ids.npy", book_ids)
	print(f"LightGCN: saved {book_emb_np.shape}")
	```

	### 2c. HGNN

	```python
	import numpy as np, torch
	import torch.nn.functional as F

	model.eval()
	with torch.no_grad():
	h = model.get_all_embeddings(graph, features) # dict {ntype: (N_ntype, 64)}
	book_emb = F.normalize(h["book"], dim=-1)

	book_emb_np = book_emb.cpu().numpy().astype(np.float32)

	idx_to_bid = {v: k for k, v in data_loader.entity_maps["book"].items()}
	book_ids = np.array([idx_to_bid[i] for i in range(book_emb_np.shape[0])], dtype=object)

	np.save("embeddings/hgnn_book_emb.npy", book_emb_np)
	np.save("embeddings/hgnn_book_ids.npy", book_ids)
	print(f"HGNN: saved {book_emb_np.shape}")
	```

	> The Space does re-normalise on load, so the L2-normalisation step above
	> is technically optional. It is kept so that the `.npy` you ship represents
	> exactly the vectors used in your evaluation tables.

	---

	## 3. Pushing to the Space

	Your final Space repo should look like this:

	```
	your-space/
	├── app.py
	├── requirements.txt
	├── README.md
	├── SETUP.md
	├── data/
	│ └── books_metadata.parquet ← drop yours here
	└── embeddings/
	├── two_tower_book_emb.npy ← drop yours here
	├── two_tower_book_ids.npy ← drop yours here
	├── lightgcn_book_emb.npy
	├── lightgcn_book_ids.npy
	├── hgnn_book_emb.npy
	└── hgnn_book_ids.npy
	```

	Files larger than ~50 MB must be tracked with `git-lfs`:

	```bash
	git lfs install
	git lfs track "*.npy"
	git lfs track "*.parquet"
	git add .gitattributes
	git add .
	git commit -m "Add trained embeddings and metadata"
	git push
	```

	If you would rather not commit the embeddings to the Space repo, an
	alternative is to upload them to your existing HF dataset repo and have
	`app.py` download them on startup using `huggingface_hub.hf_hub_download`.
	Tell me if you want this variant — a 6-line change to `app.py`.

	---

	## 4. Approximate sizes

	\| File \| Shape \| float32 \| float16 \|
	\|---------------------------\|-----------------\|---------\|---------\|
	\| `two_tower_book_emb.npy` \| (127302, 256) \| ~130 MB \| ~65 MB \|
	\| `lightgcn_book_emb.npy` \| (127302, 256) \| ~130 MB \| ~65 MB \|
	\| `hgnn_book_emb.npy` \| (127302, 64) \| ~32 MB \| ~16 MB \|
	\| `books_metadata.parquet` \| (127302, 6) \| ~30–60 MB \| — \|

	Casting to `float16` before saving roughly halves the embedding size with
	negligible quality impact for cosine similarity:

	```python
	np.save("embeddings/two_tower_book_emb.npy", book_emb_np.astype(np.float16))
	```

	The Space upcasts to float32 on load, so this is safe.

	---

	## 5. Validating locally before pushing

	```bash
	pip install -r requirements.txt
	python app.py
	```

	Open the URL Gradio prints. The startup log will show `[real]` or
	`[synthetic]` next to each model — confirm all three say `[real]`. Pick a
	few books you actually know from the catalogue and sanity-check the
	recommendations: anything in the same author/series/category neighbourhood
	is a good sign. Once it looks right locally, push to the Space.