ymlin105's picture
chore: reorganize documentation structure and clean repository root
78cfff7

Build Guide: From Zero to Production

This guide explains how to build the entire project from scratch.


Quick Start (Already Built)

# 1. Create environment
conda env create -f environment.yml
conda activate book-rec

# 2. Validate data (check what's ready)
make data-validate

# 3. Start backend
make run  # http://localhost:6006

# 4. Start frontend
cd web && npm install && npm run dev  # http://localhost:5173

New Pipeline Commands

make data-pipeline   # Run full pipeline (data + models)
make data-prep       # Data processing only (no GPU training)
make data-validate   # Check data quality
make train-models    # Train ML models only

Full Build Pipeline

Overview

Raw Data (CSV)
     β”‚
     β”œβ”€β”€ [1] Data Processing ──────────────────────────┐
     β”‚   β”œβ”€β”€ books_data.csv β†’ books_processed.csv      β”‚
     β”‚   β”œβ”€β”€ Books_rating.csv β†’ rec/train,val,test.csv β”‚
     β”‚   └── Reviews β†’ review_chunks                   β”‚
     β”‚                                                  β”‚
     β”œβ”€β”€ [2] Index Building ────────────────────────────
     β”‚   β”œβ”€β”€ ChromaDB (Vector Index)                   β”‚
     β”‚   └── BM25 (Sparse Index)                       β”‚
     β”‚                                                  β”‚
     β”œβ”€β”€ [3] Model Training ────────────────────────────
     β”‚   β”œβ”€β”€ ItemCF / UserCF / Swing (CPU)             β”‚
     β”‚   β”œβ”€β”€ YoutubeDNN (GPU)                          β”‚
     β”‚   β”œβ”€β”€ SASRec (GPU)                              β”‚
     β”‚   └── LGBMRanker (CPU)                          β”‚
     β”‚                                                  β”‚
     └── [4] Service Startup β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         └── FastAPI + React

Phase 1: Environment Setup

# Clone repo
git clone <repo-url>
cd book-rec-with-LLMs

# Create conda environment
conda env create -f environment.yml
conda activate book-rec

# Install frontend dependencies
cd web && npm install && cd ..

Phase 2: Data Preparation

2.1 Raw Data Requirements

Place in data/raw/:

  • books_data.csv - Book metadata (title, author, description, categories)
  • Books_rating.csv - User ratings (User_id, Id, review/score, review/time, review/text)

2.2 Pipeline DAG (Execution Order)

Recommended: Use make data-pipeline or python scripts/run_pipeline.py β€” it defines the full DAG.

Stage Script Purpose Output
1 build_books_basic_info.py Merge raw books + ratings books_basic_info.csv
2 books_processed.csv From HuggingFace or manual merge of basic_info + review_highlights books_processed.csv
3 clean_data.py HTML/encoding/whitespace cleanup books_processed.csv (cleaned)
4 generate_emotions.py Sentiment analysis (5 emotions) +joy,sadness,fear,anger,surprise
5 generate_tags.py TF-IDF keyword extraction +tags column
6 chunk_reviews.py Reviews β†’ sentences review_chunks.jsonl
7 split_rec_data.py Leave-Last-Out time split rec/train,val,test.csv
8 build_sequences.py User history β†’ sequences rec/user_sequences.pkl

Note: books_processed.csv may be pre-downloaded from HuggingFace. If building from scratch, merge books_basic_info.csv with review data and run extract_review_sentences.py first.

2.3 Script Details

Data Cleaning (clean_data.py)

  • HTML: Remove tags, decode entities (&amp; β†’ &)
  • Encoding: Fix mojibake (UTF-8 corruption)
  • Unicode: NFKC normalization
  • Whitespace: Collapse multiple spaces/newlines
  • URLs: Remove from text

Data Split (split_rec_data.py)

  • Strategy: Leave-Last-Out (ζ—ΆεΊεˆ’εˆ†)
  • Filter: Users with β‰₯3 interactions
  • Output: train (oldest) β†’ val (2nd last) β†’ test (last)

Sequence Building (build_sequences.py)

  • Format: Dict[user_id, List[item_id]]
  • Padding: 0 reserved, IDs are 1-indexed
  • Max length: 50 items (truncated)
# Run via unified pipeline
python scripts/run_pipeline.py --stage books

# Or manually
python scripts/data/clean_data.py --backup
python scripts/data/split_rec_data.py
python scripts/data/build_sequences.py

Script conventions: Use config.data_config for paths; scripts.utils.setup_script_logger() for logging.


Phase 3: Index Building

3.1 Vector Database (ChromaDB)

python scripts/data/init_dual_index.py

Output: data/chroma_db/ (222K book vectors)

3.2 Review Chunks Index (Small-to-Big)

python scripts/data/extract_review_sentences.py

Output: data/chroma_chunks/ (788K sentence vectors)


Phase 4: Model Training

4.1 Recall Models (CPU OK)

# Build ItemCF / UserCF / Swing / Popularity
python scripts/model/build_recall_models.py

Output: data/model/recall/itemcf.pkl, usercf.pkl, swing.pkl, popularity.pkl

Training Time (Apple Silicon CPU):

Model Time
ItemCF (direction-weighted) ~2 min
UserCF ~7 sec
Swing (optimized) ~35 sec
Popularity <1 sec

4.2 YoutubeDNN (GPU Recommended)

# Train two-tower model
python scripts/model/train_youtube_dnn.py

Output: data/model/recall/youtube_dnn.pt

Training: ~50 epochs, 2048 batch, ~30 min on GPU

4.3 SASRec (GPU Recommended)

# Train sequence model
python scripts/model/train_sasrec.py

Output: data/model/recall/sasrec.pt

Training: ~30 epochs, ~20 min on GPU

4.4 LGBMRanker (LambdaRank)

# Train ranking model (hard negative sampling from recall results)
python scripts/model/train_ranker.py

Output: data/model/ranking/lgbm_ranker.txt

Training: ~16 min on CPU (20K users sampled, 4Γ— hard negatives, 17 features)


Phase 5: Service Startup

Backend

make run
# or
uvicorn src.main:app --reload --port 6006

Startup Log:

Loading embedding model...           # ~20s
Loaded 222003 documents             # ~10s
BM25 Index built with 222005 docs   # ~12s
Engines Initialized.                # Ready!

Frontend

cd web
npm run dev

Access:


Data Flow Summary

data/
β”œβ”€β”€ raw/
β”‚   β”œβ”€β”€ books_data.csv          # Original book metadata
β”‚   └── Books_rating.csv        # Original ratings
β”œβ”€β”€ books_basic_info.csv        # Processed book info
β”œβ”€β”€ books_processed.csv         # Full processed data
β”œβ”€β”€ chroma_db/                  # Vector index (222K)
β”œβ”€β”€ chroma_chunks/              # Review chunks (788K)
β”œβ”€β”€ rec/
β”‚   β”œβ”€β”€ train.csv               # 1.08M training records
β”‚   β”œβ”€β”€ val.csv                 # 168K validation
β”‚   β”œβ”€β”€ test.csv                # 168K test
β”‚   β”œβ”€β”€ user_sequences.pkl      # User history
β”‚   └── item_map.pkl            # ISBN β†’ ID mapping
β”œβ”€β”€ model/
β”‚   β”œβ”€β”€ recall/
β”‚   β”‚   β”œβ”€β”€ itemcf.pkl          # ItemCF matrix (direction-weighted)
β”‚   β”‚   β”œβ”€β”€ usercf.pkl          # UserCF matrix
β”‚   β”‚   β”œβ”€β”€ swing.pkl           # Swing matrix
β”‚   β”‚   β”œβ”€β”€ popularity.pkl      # Popularity scores
β”‚   β”‚   β”œβ”€β”€ youtube_dnn.pt      # Two-tower model
β”‚   β”‚   └── sasrec.pt           # Sequence model
β”‚   └── ranking/
β”‚       └── lgbm_ranker.txt     # LGBMRanker (LambdaRank)
└── user_profiles.json          # User favorites

Training on GPU Server

If local machine is slow, use AutoDL/Cloud:

# Sync to server
rsync -avz . user@server:/path/to/project

# On server
python scripts/model/train_youtube_dnn.py
python scripts/model/train_sasrec.py

# Sync back
rsync -avz user@server:/path/to/project/data/model ./data/

Minimal Local Run (Without Training)

If you only have raw data but no trained models:

  1. ItemCF/UserCF/Swing will work (CPU-trained on-demand)
  2. YoutubeDNN will be skipped (graceful degradation)
  3. SASRec features will be 0.0
  4. LGBMRanker needs to be trained or use recall-score fallback

System will run with reduced accuracy but functional.


Last Updated: January 2026