ymlin105's picture
chore: reorganize documentation structure and clean repository root
78cfff7
# Build Guide: From Zero to Production
This guide explains how to build the entire project from scratch.
---
## Quick Start (Already Built)
```bash
# 1. Create environment
conda env create -f environment.yml
conda activate book-rec
# 2. Validate data (check what's ready)
make data-validate
# 3. Start backend
make run # http://localhost:6006
# 4. Start frontend
cd web && npm install && npm run dev # http://localhost:5173
```
### New Pipeline Commands
```bash
make data-pipeline # Run full pipeline (data + models)
make data-prep # Data processing only (no GPU training)
make data-validate # Check data quality
make train-models # Train ML models only
```
---
## Full Build Pipeline
### Overview
```
Raw Data (CSV)
β”‚
β”œβ”€β”€ [1] Data Processing ──────────────────────────┐
β”‚ β”œβ”€β”€ books_data.csv β†’ books_processed.csv β”‚
β”‚ β”œβ”€β”€ Books_rating.csv β†’ rec/train,val,test.csv β”‚
β”‚ └── Reviews β†’ review_chunks β”‚
β”‚ β”‚
β”œβ”€β”€ [2] Index Building ────────────────────────────
β”‚ β”œβ”€β”€ ChromaDB (Vector Index) β”‚
β”‚ └── BM25 (Sparse Index) β”‚
β”‚ β”‚
β”œβ”€β”€ [3] Model Training ────────────────────────────
β”‚ β”œβ”€β”€ ItemCF / UserCF / Swing (CPU) β”‚
β”‚ β”œβ”€β”€ YoutubeDNN (GPU) β”‚
β”‚ β”œβ”€β”€ SASRec (GPU) β”‚
β”‚ └── LGBMRanker (CPU) β”‚
β”‚ β”‚
└── [4] Service Startup β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
└── FastAPI + React
```
---
## Phase 1: Environment Setup
```bash
# Clone repo
git clone <repo-url>
cd book-rec-with-LLMs
# Create conda environment
conda env create -f environment.yml
conda activate book-rec
# Install frontend dependencies
cd web && npm install && cd ..
```
---
## Phase 2: Data Preparation
### 2.1 Raw Data Requirements
Place in `data/raw/`:
- `books_data.csv` - Book metadata (title, author, description, categories)
- `Books_rating.csv` - User ratings (User_id, Id, review/score, review/time, review/text)
### 2.2 Pipeline DAG (Execution Order)
**Recommended**: Use `make data-pipeline` or `python scripts/run_pipeline.py` β€” it defines the full DAG.
| Stage | Script | Purpose | Output |
|:---:|:---|:---|:---|
| 1 | `build_books_basic_info.py` | Merge raw books + ratings | books_basic_info.csv |
| 2 | *books_processed.csv* | From HuggingFace or manual merge of basic_info + review_highlights | books_processed.csv |
| 3 | `clean_data.py` | HTML/encoding/whitespace cleanup | books_processed.csv (cleaned) |
| 4 | `generate_emotions.py` | Sentiment analysis (5 emotions) | +joy,sadness,fear,anger,surprise |
| 5 | `generate_tags.py` | TF-IDF keyword extraction | +tags column |
| 6 | `chunk_reviews.py` | Reviews β†’ sentences | review_chunks.jsonl |
| 7 | `split_rec_data.py` | Leave-Last-Out time split | rec/train,val,test.csv |
| 8 | `build_sequences.py` | User history β†’ sequences | rec/user_sequences.pkl |
**Note**: `books_processed.csv` may be pre-downloaded from HuggingFace. If building from scratch, merge `books_basic_info.csv` with review data and run `extract_review_sentences.py` first.
### 2.3 Script Details
#### Data Cleaning (`clean_data.py`)
- **HTML**: Remove tags, decode entities (`&amp;` β†’ `&`)
- **Encoding**: Fix mojibake (UTF-8 corruption)
- **Unicode**: NFKC normalization
- **Whitespace**: Collapse multiple spaces/newlines
- **URLs**: Remove from text
#### Data Split (`split_rec_data.py`)
- **Strategy**: Leave-Last-Out (ζ—ΆεΊεˆ’εˆ†)
- **Filter**: Users with β‰₯3 interactions
- **Output**: train (oldest) β†’ val (2nd last) β†’ test (last)
#### Sequence Building (`build_sequences.py`)
- **Format**: `Dict[user_id, List[item_id]]`
- **Padding**: 0 reserved, IDs are 1-indexed
- **Max length**: 50 items (truncated)
```bash
# Run via unified pipeline
python scripts/run_pipeline.py --stage books
# Or manually
python scripts/data/clean_data.py --backup
python scripts/data/split_rec_data.py
python scripts/data/build_sequences.py
```
**Script conventions**: Use `config.data_config` for paths; `scripts.utils.setup_script_logger()` for logging.
---
## Phase 3: Index Building
### 3.1 Vector Database (ChromaDB)
```bash
python scripts/data/init_dual_index.py
```
**Output**: `data/chroma_db/` (222K book vectors)
### 3.2 Review Chunks Index (Small-to-Big)
```bash
python scripts/data/extract_review_sentences.py
```
**Output**: `data/chroma_chunks/` (788K sentence vectors)
---
## Phase 4: Model Training
### 4.1 Recall Models (CPU OK)
```bash
# Build ItemCF / UserCF / Swing / Popularity
python scripts/model/build_recall_models.py
```
**Output**: `data/model/recall/itemcf.pkl`, `usercf.pkl`, `swing.pkl`, `popularity.pkl`
**Training Time** (Apple Silicon CPU):
| Model | Time |
|:---|:---|
| ItemCF (direction-weighted) | ~2 min |
| UserCF | ~7 sec |
| Swing (optimized) | ~35 sec |
| Popularity | <1 sec |
### 4.2 YoutubeDNN (GPU Recommended)
```bash
# Train two-tower model
python scripts/model/train_youtube_dnn.py
```
**Output**: `data/model/recall/youtube_dnn.pt`
**Training**: ~50 epochs, 2048 batch, ~30 min on GPU
### 4.3 SASRec (GPU Recommended)
```bash
# Train sequence model
python scripts/model/train_sasrec.py
```
**Output**: `data/model/recall/sasrec.pt`
**Training**: ~30 epochs, ~20 min on GPU
### 4.4 LGBMRanker (LambdaRank)
```bash
# Train ranking model (hard negative sampling from recall results)
python scripts/model/train_ranker.py
```
**Output**: `data/model/ranking/lgbm_ranker.txt`
**Training**: ~16 min on CPU (20K users sampled, 4Γ— hard negatives, 17 features)
---
## Phase 5: Service Startup
### Backend
```bash
make run
# or
uvicorn src.main:app --reload --port 6006
```
**Startup Log**:
```
Loading embedding model... # ~20s
Loaded 222003 documents # ~10s
BM25 Index built with 222005 docs # ~12s
Engines Initialized. # Ready!
```
### Frontend
```bash
cd web
npm run dev
```
**Access**:
- Frontend: http://localhost:5173
- API Docs: http://localhost:6006/docs
---
## Data Flow Summary
```
data/
β”œβ”€β”€ raw/
β”‚ β”œβ”€β”€ books_data.csv # Original book metadata
β”‚ └── Books_rating.csv # Original ratings
β”œβ”€β”€ books_basic_info.csv # Processed book info
β”œβ”€β”€ books_processed.csv # Full processed data
β”œβ”€β”€ chroma_db/ # Vector index (222K)
β”œβ”€β”€ chroma_chunks/ # Review chunks (788K)
β”œβ”€β”€ rec/
β”‚ β”œβ”€β”€ train.csv # 1.08M training records
β”‚ β”œβ”€β”€ val.csv # 168K validation
β”‚ β”œβ”€β”€ test.csv # 168K test
β”‚ β”œβ”€β”€ user_sequences.pkl # User history
β”‚ └── item_map.pkl # ISBN β†’ ID mapping
β”œβ”€β”€ model/
β”‚ β”œβ”€β”€ recall/
β”‚ β”‚ β”œβ”€β”€ itemcf.pkl # ItemCF matrix (direction-weighted)
β”‚ β”‚ β”œβ”€β”€ usercf.pkl # UserCF matrix
β”‚ β”‚ β”œβ”€β”€ swing.pkl # Swing matrix
β”‚ β”‚ β”œβ”€β”€ popularity.pkl # Popularity scores
β”‚ β”‚ β”œβ”€β”€ youtube_dnn.pt # Two-tower model
β”‚ β”‚ └── sasrec.pt # Sequence model
β”‚ └── ranking/
β”‚ └── lgbm_ranker.txt # LGBMRanker (LambdaRank)
└── user_profiles.json # User favorites
```
---
## Training on GPU Server
If local machine is slow, use AutoDL/Cloud:
```bash
# Sync to server
rsync -avz . user@server:/path/to/project
# On server
python scripts/model/train_youtube_dnn.py
python scripts/model/train_sasrec.py
# Sync back
rsync -avz user@server:/path/to/project/data/model ./data/
```
---
## Minimal Local Run (Without Training)
If you only have raw data but no trained models:
1. **ItemCF/UserCF/Swing** will work (CPU-trained on-demand)
2. **YoutubeDNN** will be skipped (graceful degradation)
3. **SASRec features** will be 0.0
4. **LGBMRanker** needs to be trained or use recall-score fallback
System will run with reduced accuracy but functional.
---
*Last Updated: January 2026*