Spaces:
Sleeping
Build Guide: From Zero to Production
This guide explains how to build the entire project from scratch.
Quick Start (Already Built)
# 1. Create environment
conda env create -f environment.yml
conda activate book-rec
# 2. Validate data (check what's ready)
make data-validate
# 3. Start backend
make run # http://localhost:6006
# 4. Start frontend
cd web && npm install && npm run dev # http://localhost:5173
New Pipeline Commands
make data-pipeline # Run full pipeline (data + models)
make data-prep # Data processing only (no GPU training)
make data-validate # Check data quality
make train-models # Train ML models only
Full Build Pipeline
Overview
Raw Data (CSV)
β
βββ [1] Data Processing βββββββββββββββββββββββββββ
β βββ books_data.csv β books_processed.csv β
β βββ Books_rating.csv β rec/train,val,test.csv β
β βββ Reviews β review_chunks β
β β
βββ [2] Index Building ββββββββββββββββββββββββββββ€
β βββ ChromaDB (Vector Index) β
β βββ BM25 (Sparse Index) β
β β
βββ [3] Model Training ββββββββββββββββββββββββββββ€
β βββ ItemCF / UserCF / Swing (CPU) β
β βββ YoutubeDNN (GPU) β
β βββ SASRec (GPU) β
β βββ LGBMRanker (CPU) β
β β
βββ [4] Service Startup βββββββββββββββββββββββββββ
βββ FastAPI + React
Phase 1: Environment Setup
# Clone repo
git clone <repo-url>
cd book-rec-with-LLMs
# Create conda environment
conda env create -f environment.yml
conda activate book-rec
# Install frontend dependencies
cd web && npm install && cd ..
Phase 2: Data Preparation
2.1 Raw Data Requirements
Place in data/raw/:
books_data.csv- Book metadata (title, author, description, categories)Books_rating.csv- User ratings (User_id, Id, review/score, review/time, review/text)
2.2 Pipeline DAG (Execution Order)
Recommended: Use make data-pipeline or python scripts/run_pipeline.py β it defines the full DAG.
| Stage | Script | Purpose | Output |
|---|---|---|---|
| 1 | build_books_basic_info.py |
Merge raw books + ratings | books_basic_info.csv |
| 2 | books_processed.csv | From HuggingFace or manual merge of basic_info + review_highlights | books_processed.csv |
| 3 | clean_data.py |
HTML/encoding/whitespace cleanup | books_processed.csv (cleaned) |
| 4 | generate_emotions.py |
Sentiment analysis (5 emotions) | +joy,sadness,fear,anger,surprise |
| 5 | generate_tags.py |
TF-IDF keyword extraction | +tags column |
| 6 | chunk_reviews.py |
Reviews β sentences | review_chunks.jsonl |
| 7 | split_rec_data.py |
Leave-Last-Out time split | rec/train,val,test.csv |
| 8 | build_sequences.py |
User history β sequences | rec/user_sequences.pkl |
Note: books_processed.csv may be pre-downloaded from HuggingFace. If building from scratch, merge books_basic_info.csv with review data and run extract_review_sentences.py first.
2.3 Script Details
Data Cleaning (clean_data.py)
- HTML: Remove tags, decode entities (
&β&) - Encoding: Fix mojibake (UTF-8 corruption)
- Unicode: NFKC normalization
- Whitespace: Collapse multiple spaces/newlines
- URLs: Remove from text
Data Split (split_rec_data.py)
- Strategy: Leave-Last-Out (ζΆεΊεε)
- Filter: Users with β₯3 interactions
- Output: train (oldest) β val (2nd last) β test (last)
Sequence Building (build_sequences.py)
- Format:
Dict[user_id, List[item_id]] - Padding: 0 reserved, IDs are 1-indexed
- Max length: 50 items (truncated)
# Run via unified pipeline
python scripts/run_pipeline.py --stage books
# Or manually
python scripts/data/clean_data.py --backup
python scripts/data/split_rec_data.py
python scripts/data/build_sequences.py
Script conventions: Use config.data_config for paths; scripts.utils.setup_script_logger() for logging.
Phase 3: Index Building
3.1 Vector Database (ChromaDB)
python scripts/data/init_dual_index.py
Output: data/chroma_db/ (222K book vectors)
3.2 Review Chunks Index (Small-to-Big)
python scripts/data/extract_review_sentences.py
Output: data/chroma_chunks/ (788K sentence vectors)
Phase 4: Model Training
4.1 Recall Models (CPU OK)
# Build ItemCF / UserCF / Swing / Popularity
python scripts/model/build_recall_models.py
Output: data/model/recall/itemcf.pkl, usercf.pkl, swing.pkl, popularity.pkl
Training Time (Apple Silicon CPU):
| Model | Time |
|---|---|
| ItemCF (direction-weighted) | ~2 min |
| UserCF | ~7 sec |
| Swing (optimized) | ~35 sec |
| Popularity | <1 sec |
4.2 YoutubeDNN (GPU Recommended)
# Train two-tower model
python scripts/model/train_youtube_dnn.py
Output: data/model/recall/youtube_dnn.pt
Training: ~50 epochs, 2048 batch, ~30 min on GPU
4.3 SASRec (GPU Recommended)
# Train sequence model
python scripts/model/train_sasrec.py
Output: data/model/recall/sasrec.pt
Training: ~30 epochs, ~20 min on GPU
4.4 LGBMRanker (LambdaRank)
# Train ranking model (hard negative sampling from recall results)
python scripts/model/train_ranker.py
Output: data/model/ranking/lgbm_ranker.txt
Training: ~16 min on CPU (20K users sampled, 4Γ hard negatives, 17 features)
Phase 5: Service Startup
Backend
make run
# or
uvicorn src.main:app --reload --port 6006
Startup Log:
Loading embedding model... # ~20s
Loaded 222003 documents # ~10s
BM25 Index built with 222005 docs # ~12s
Engines Initialized. # Ready!
Frontend
cd web
npm run dev
Access:
- Frontend: http://localhost:5173
- API Docs: http://localhost:6006/docs
Data Flow Summary
data/
βββ raw/
β βββ books_data.csv # Original book metadata
β βββ Books_rating.csv # Original ratings
βββ books_basic_info.csv # Processed book info
βββ books_processed.csv # Full processed data
βββ chroma_db/ # Vector index (222K)
βββ chroma_chunks/ # Review chunks (788K)
βββ rec/
β βββ train.csv # 1.08M training records
β βββ val.csv # 168K validation
β βββ test.csv # 168K test
β βββ user_sequences.pkl # User history
β βββ item_map.pkl # ISBN β ID mapping
βββ model/
β βββ recall/
β β βββ itemcf.pkl # ItemCF matrix (direction-weighted)
β β βββ usercf.pkl # UserCF matrix
β β βββ swing.pkl # Swing matrix
β β βββ popularity.pkl # Popularity scores
β β βββ youtube_dnn.pt # Two-tower model
β β βββ sasrec.pt # Sequence model
β βββ ranking/
β βββ lgbm_ranker.txt # LGBMRanker (LambdaRank)
βββ user_profiles.json # User favorites
Training on GPU Server
If local machine is slow, use AutoDL/Cloud:
# Sync to server
rsync -avz . user@server:/path/to/project
# On server
python scripts/model/train_youtube_dnn.py
python scripts/model/train_sasrec.py
# Sync back
rsync -avz user@server:/path/to/project/data/model ./data/
Minimal Local Run (Without Training)
If you only have raw data but no trained models:
- ItemCF/UserCF/Swing will work (CPU-trained on-demand)
- YoutubeDNN will be skipped (graceful degradation)
- SASRec features will be 0.0
- LGBMRanker needs to be trained or use recall-score fallback
System will run with reduced accuracy but functional.
Last Updated: January 2026