Spaces:

ymlin105
/

book-rec-with-LLMs

Sleeping

App Files Files Community

book-rec-with-LLMs / docs /development /REFACTORING_GUIDE.md

ymlin105

chore: remove obsolete files and update project structure

6ad997d 10 days ago

preview code

raw

history blame contribute delete

14.2 kB

	# Refactoring Guide: Large Files

	Status: Active
	Audience: Engineers planning refactoring work
	Last Updated: 2026-02-12
	Priority: P2 (Quality improvement, not blocking)

	---

	## Overview

	Several files in the codebase exceed 300 lines. While not critical, breaking them into smaller modules improves:
	- Readability: Easier to understand each component
	- Testability: Smaller units = easier to test in isolation
	- Maintainability: Changes have smaller blast radius

	Industry Standard: Files <300 lines, functions <50 lines.

	---

	## Files Requiring Refactoring

	\| File \| Lines \| Complexity \| Priority \| Suggested Split \|
	\|------\|-------\|------------\|----------\|-----------------\|
	\| `src/ranking/features.py` \| 470 \| High \| P0 \| Extract feature groups into separate modules \|
	\| `src/core/web_search.py` \| 427 \| Medium \| P1 \| Separate API client from result parsing \|
	\| `src/vector_db.py` \| 368 \| High \| P0 \| Split hybrid search from small-to-big retrieval \|
	\| `src/recall/sasrec_recall.py` \| 337 \| Medium \| P2 \| Extract embedding logic \|
	\| `src/recall/embedding.py` \| 333 \| Medium \| P2 \| Extract training vs inference \|
	\| `src/services/recommend_service.py` \| 313 \| High \| P1 \| Extract diversity reranking orchestration \|
	\| `src/core/recommendation_orchestrator.py` \| 293 \| Medium \| P2 \| Extract query preprocessing \|
	\| `src/core/metadata_store.py` \| 288 \| Low \| P3 \| Already well-structured (mostly SQL) \|

	---

	## Refactoring Strategy: features.py (Example)

	### Current Structure (470 lines)

	```python
	class FeatureEngineer:
	def __init__():
	# 20 lines

	def _load_sasrec_features():
	# 35 lines

	def _load_user_sequences():
	# 20 lines

	def load_base_data():
	# 30 lines

	def generate_features():
	# 180 lines (HUGE!)
	# - User stats (15 lines)
	# - Item stats (15 lines)
	# - SASRec features (40 lines)
	# - ItemCF features (30 lines)
	# - UserCF features (30 lines)
	# - Last-N similarity (50 lines)

	def extract_features():
	# 120 lines
	```

	### Proposed Refactoring

	New Structure:

	```
	src/ranking/
	├── features.py # Orchestrator (100 lines)
	├── feature_groups/
	│ ├── __init__.py
	│ ├── user_features.py # User stats extraction (40 lines)
	│ ├── item_features.py # Item stats extraction (40 lines)
	│ ├── sasrec_features.py # SASRec embedding features (60 lines)
	│ ├── cf_features.py # ItemCF + UserCF features (70 lines)
	│ └── sequence_features.py # Last-N similarity (60 lines)
	```

	Refactored `features.py` (100 lines):

	```python
	from src.ranking.feature_groups.user_features import UserFeatureExtractor
	from src.ranking.feature_groups.item_features import ItemFeatureExtractor
	from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor
	from src.ranking.feature_groups.cf_features import CFFeatureExtractor
	from src.ranking.feature_groups.sequence_features import SequenceFeatureExtractor

	class FeatureEngineer:
	"""Orchestrates feature extraction from multiple sources."""

	def __init__(self, data_dir='data/rec', model_dir='data/model/recall'):
	self.data_dir = Path(data_dir)
	self.model_dir = Path(model_dir)

	# Initialize sub-extractors
	self.user_extractor = UserFeatureExtractor(data_dir)
	self.item_extractor = ItemFeatureExtractor(data_dir)
	self.sasrec_extractor = SASRecFeatureExtractor(data_dir, model_dir)
	self.cf_extractor = CFFeatureExtractor(data_dir, model_dir)
	self.seq_extractor = SequenceFeatureExtractor(data_dir)

	def load_base_data(self):
	"""Load all feature extractors."""
	self.user_extractor.load()
	self.item_extractor.load()
	self.sasrec_extractor.load()
	self.cf_extractor.load()
	self.seq_extractor.load()

	def generate_features(self, user_id, candidate_item, **overrides):
	"""Generate feature vector by delegating to sub-extractors."""
	feats = {}

	# Delegate to each extractor
	feats.update(self.user_extractor.extract(user_id))
	feats.update(self.item_extractor.extract(candidate_item))
	feats.update(self.sasrec_extractor.extract(user_id, candidate_item, **overrides))
	feats.update(self.cf_extractor.extract(user_id, candidate_item))
	feats.update(self.seq_extractor.extract(user_id, candidate_item, **overrides))

	return feats
	```

	Benefits:
	- Each extractor is <70 lines (easy to understand)
	- Testable in isolation: `test_sasrec_features.py`
	- Clear separation of concerns
	- Easy to add new feature groups

	---

	## Refactoring Checklist

	Before Refactoring:
	- [ ] Read existing code thoroughly
	- [ ] Identify natural module boundaries (feature groups, API layers, etc.)
	- [ ] Check test coverage (write tests if missing)
	- [ ] Document current behavior (input/output examples)

	During Refactoring:
	- [ ] Create new modules with single responsibility
	- [ ] Move code incrementally (one function at a time)
	- [ ] Run tests after each move
	- [ ] Update imports in dependent files
	- [ ] Add docstrings to new modules

	After Refactoring:
	- [ ] Run full test suite
	- [ ] Check code coverage (should not decrease)
	- [ ] Update ARCHITECTURE.md if structure changed
	- [ ] Benchmark performance (ensure no regression)
	- [ ] Update this guide with lessons learned

	---

	## Refactoring Patterns

	### Pattern 1: Extract Feature Groups

	When: Large function with distinct sections (user features, item features, etc.)

	How:
	1. Identify groups: user stats, item stats, embeddings, CF scores
	2. Create one class per group: `UserFeatureExtractor`, `ItemFeatureExtractor`
	3. Each class has `load()` and `extract()` methods
	4. Main class delegates to extractors

	Example: `features.py` → `feature_groups/*`

	---

	### Pattern 2: Split API Client from Parser

	When: File handles both HTTP requests and response parsing

	How:
	1. Create `api_client.py`: HTTP calls, retries, auth
	2. Create `parsers.py`: Parse JSON responses into domain objects
	3. Main file orchestrates: `client.fetch()` → `parser.parse()`

	Example: `web_search.py` → `google_books_client.py` + `result_parser.py`

	---

	### Pattern 3: Separate Training from Inference

	When: File contains both model training code and inference code

	How:
	1. Create `trainer.py`: Fit model, save weights
	2. Create `predictor.py`: Load weights, predict
	3. Shared utilities in `model_utils.py`

	Example: `sasrec_recall.py` → `sasrec_trainer.py` + `sasrec_predictor.py`

	---

	### Pattern 4: Extract Helper Functions

	When: Function >50 lines with reusable logic

	How:
	1. Identify helper logic (e.g., "compute similarity", "filter results")
	2. Extract to `_helper_function()` or `utils.py`
	3. Main function becomes high-level orchestration

	Example:
	```python
	# Before (80 lines)
	def recommend(self, user_id, k=50):
	# 1. Get history (15 lines)
	# 2. Query similar items (20 lines)
	# 3. Aggregate scores (25 lines)
	# 4. Filter & sort (20 lines)

	# After (20 lines)
	def recommend(self, user_id, k=50):
	history = self._get_user_history(user_id)
	similar_items = self._query_similar_items(history)
	scores = self._aggregate_scores(similar_items)
	return self._filter_and_sort(scores, k)
	```

	---

	## Example Refactoring: vector_db.py

	### Current (368 lines)

	```python
	class VectorDB:
	def __init__():
	# 20 lines

	def search():
	# 30 lines

	def hybrid_search():
	# 120 lines (complex RRF fusion logic)

	def small_to_big_search():
	# 80 lines (hierarchical retrieval)

	def _rrf_fusion():
	# 60 lines

	def _build_bm25_index():
	# 40 lines
	```

	### Proposed (3 files, each <150 lines)

	src/vector_db.py (80 lines):
	```python
	from src.retrieval.hybrid_retriever import HybridRetriever
	from src.retrieval.hierarchical_retriever import HierarchicalRetriever

	class VectorDB:
	"""Facade for multiple retrieval strategies."""

	def __init__():
	self.hybrid = HybridRetriever()
	self.hierarchical = HierarchicalRetriever()

	def search(query, k=10):
	"""Dense search only."""
	return self.chroma.search(query, k=k)

	def hybrid_search(query, k=10, alpha=0.5):
	"""BM25 + Dense fusion."""
	return self.hybrid.search(query, k, alpha)

	def small_to_big_search(query, k=10):
	"""Sentence → Book hierarchical retrieval."""
	return self.hierarchical.search(query, k)
	```

	src/retrieval/hybrid_retriever.py (150 lines):
	```python
	class HybridRetriever:
	"""Combines BM25 (sparse) + Dense (embeddings) via RRF."""

	def __init__():
	self.bm25 = self._build_bm25_index()
	self.dense_db = Chroma(...)

	def search(query, k, alpha):
	dense_results = self._dense_search(query, k)
	sparse_results = self._sparse_search(query, k)
	return self._rrf_fusion(dense_results, sparse_results, alpha)

	def _rrf_fusion(dense, sparse, alpha):
	# 60 lines (RRF logic)

	def _build_bm25_index():
	# 40 lines
	```

	src/retrieval/hierarchical_retriever.py (130 lines):
	```python
	class HierarchicalRetriever:
	"""Small-to-Big (sentence → book) retrieval."""

	def search(query, k):
	# 1. Find matching sentences (child nodes)
	sentences = self.sentence_db.search(query, k=50)

	# 2. Map to parent books (ISBNs)
	book_isbns = self._map_to_parents(sentences)

	# 3. Return full book context
	return self._enrich_with_metadata(book_isbns, k)
	```

	---

	## Migration Strategy

	### Incremental Refactoring (Safe)

	Week 1: Extract utilities
	- Move helper functions to `utils.py`
	- No API changes
	- Run tests

	Week 2: Create new modules
	- Add `feature_groups/` directory
	- Copy (don't move) code to new modules
	- Old code still works

	Week 3: Update imports
	- Change `features.py` to use new modules
	- Mark old methods as `@deprecated`
	- Update tests

	Week 4: Remove old code
	- Delete deprecated methods
	- Update ARCHITECTURE.md
	- Celebrate! 🎉

	---

	## Testing Strategy

	### Before Refactoring: Capture Behavior

	```python
	# tests/test_features_baseline.py
	def test_feature_extraction_baseline():
	"""Capture current behavior before refactoring."""
	fe = FeatureEngineer()
	fe.load_base_data()

	# Test on known user/item
	feats = fe.generate_features("A123", "0123456789")

	# Save as baseline
	import json
	with open("tests/baseline_features.json", "w") as f:
	json.dump(feats, f)
	```

	### After Refactoring: Compare Output

	```python
	# tests/test_features_refactored.py
	def test_feature_extraction_matches_baseline():
	"""Ensure refactored code produces identical output."""
	fe = FeatureEngineer() # New refactored version
	fe.load_base_data()

	feats = fe.generate_features("A123", "0123456789")

	# Load baseline
	import json
	with open("tests/baseline_features.json") as f:
	baseline = json.load(f)

	# Compare
	for key in baseline:
	assert key in feats, f"Missing feature: {key}"
	assert abs(feats[key] - baseline[key]) < 1e-6, f"Feature {key} mismatch"
	```

	---

	## Performance Considerations

	### Refactoring Should Not Slow Things Down

	Before Refactoring: Benchmark
	```bash
	python benchmarks/benchmark.py > baseline.txt
	```

	After Refactoring: Re-benchmark
	```bash
	python benchmarks/benchmark.py > refactored.txt
	diff baseline.txt refactored.txt
	```

	Acceptable Changes:
	- Latency: ±5% (small overhead from function calls OK)
	- Memory: ±10% (object overhead from new classes OK)

	Red Flags:
	- Latency increase >10% → Investigate (unnecessary copies? Extra I/O?)
	- Memory increase >20% → Investigate (duplicate data? Leaks?)

	---

	## Common Pitfalls

	### Pitfall 1: Over-Refactoring

	Problem: Creating too many tiny files (10-20 lines each)

	Solution: Aim for 50-150 lines per file. Smaller isn't always better.

	---

	### Pitfall 2: Breaking API Compatibility

	Problem: Refactoring changes function signatures, breaking existing code

	Solution: Use deprecation warnings
	```python
	import warnings

	def old_method(self, x):
	warnings.warn("old_method is deprecated, use new_method", DeprecationWarning)
	return self.new_method(x)
	```

	---

	### Pitfall 3: Not Updating Tests

	Problem: Tests still import old paths

	Solution: Update imports incrementally
	```python
	# Before
	from src.ranking.features import FeatureEngineer

	# After
	from src.ranking.features import FeatureEngineer # Facade
	from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor # Specific
	```

	---

	## When NOT to Refactor

	- Deadline pressure: Refactoring during crunch time adds risk
	- No tests: Write tests first, then refactor
	- Unclear requirements: Understand the code before restructuring
	- "Just because": Only refactor if it improves readability/testability

	Rule of Thumb: Refactor when you're touching the code for a feature/bug anyway. Don't refactor in isolation.

	---

	## Success Metrics

	How do you know refactoring succeeded?

	- [ ] Code is easier to understand (ask a teammate)
	- [ ] Tests pass (same behavior)
	- [ ] Performance unchanged (benchmark)
	- [ ] Test coverage increased (easier to test small modules)
	- [ ] Future changes are faster (measure time to implement next feature)

	---

	## Related Documentation

	- [ARCHITECTURE.md](../ARCHITECTURE.md) — Overall system structure
	- [DEVELOPMENT.md](DEVELOPMENT.md) — Development guidelines
	- [ONBOARDING.md](../ONBOARDING.md) — New contributor guide

	---

	## Refactoring Backlog

	Track refactoring work here:

	\| File \| Priority \| Assigned \| Status \| Target Date \|
	\|------\|----------\|----------\|--------\|-------------\|
	\| `ranking/features.py` \| P0 \| Unassigned \| Not Started \| TBD \|
	\| `core/web_search.py` \| P1 \| Unassigned \| Not Started \| TBD \|
	\| `vector_db.py` \| P0 \| Unassigned \| Not Started \| TBD \|

	---

	Last Updated: 2026-02-12
	Next Review: After v3.0 planning