Spaces:
Sleeping
Refactoring Guide: Large Files
Status: Active Audience: Engineers planning refactoring work Last Updated: 2026-02-12 Priority: P2 (Quality improvement, not blocking)
Overview
Several files in the codebase exceed 300 lines. While not critical, breaking them into smaller modules improves:
- Readability: Easier to understand each component
- Testability: Smaller units = easier to test in isolation
- Maintainability: Changes have smaller blast radius
Industry Standard: Files <300 lines, functions <50 lines.
Files Requiring Refactoring
| File | Lines | Complexity | Priority | Suggested Split |
|---|---|---|---|---|
src/ranking/features.py |
470 | High | P0 | Extract feature groups into separate modules |
src/core/web_search.py |
427 | Medium | P1 | Separate API client from result parsing |
src/vector_db.py |
368 | High | P0 | Split hybrid search from small-to-big retrieval |
src/recall/sasrec_recall.py |
337 | Medium | P2 | Extract embedding logic |
src/recall/embedding.py |
333 | Medium | P2 | Extract training vs inference |
src/services/recommend_service.py |
313 | High | P1 | Extract diversity reranking orchestration |
src/core/recommendation_orchestrator.py |
293 | Medium | P2 | Extract query preprocessing |
src/core/metadata_store.py |
288 | Low | P3 | Already well-structured (mostly SQL) |
Refactoring Strategy: features.py (Example)
Current Structure (470 lines)
class FeatureEngineer:
def __init__():
# 20 lines
def _load_sasrec_features():
# 35 lines
def _load_user_sequences():
# 20 lines
def load_base_data():
# 30 lines
def generate_features():
# 180 lines (HUGE!)
# - User stats (15 lines)
# - Item stats (15 lines)
# - SASRec features (40 lines)
# - ItemCF features (30 lines)
# - UserCF features (30 lines)
# - Last-N similarity (50 lines)
def extract_features():
# 120 lines
Proposed Refactoring
New Structure:
src/ranking/
βββ features.py # Orchestrator (100 lines)
βββ feature_groups/
β βββ __init__.py
β βββ user_features.py # User stats extraction (40 lines)
β βββ item_features.py # Item stats extraction (40 lines)
β βββ sasrec_features.py # SASRec embedding features (60 lines)
β βββ cf_features.py # ItemCF + UserCF features (70 lines)
β βββ sequence_features.py # Last-N similarity (60 lines)
Refactored features.py (100 lines):
from src.ranking.feature_groups.user_features import UserFeatureExtractor
from src.ranking.feature_groups.item_features import ItemFeatureExtractor
from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor
from src.ranking.feature_groups.cf_features import CFFeatureExtractor
from src.ranking.feature_groups.sequence_features import SequenceFeatureExtractor
class FeatureEngineer:
"""Orchestrates feature extraction from multiple sources."""
def __init__(self, data_dir='data/rec', model_dir='data/model/recall'):
self.data_dir = Path(data_dir)
self.model_dir = Path(model_dir)
# Initialize sub-extractors
self.user_extractor = UserFeatureExtractor(data_dir)
self.item_extractor = ItemFeatureExtractor(data_dir)
self.sasrec_extractor = SASRecFeatureExtractor(data_dir, model_dir)
self.cf_extractor = CFFeatureExtractor(data_dir, model_dir)
self.seq_extractor = SequenceFeatureExtractor(data_dir)
def load_base_data(self):
"""Load all feature extractors."""
self.user_extractor.load()
self.item_extractor.load()
self.sasrec_extractor.load()
self.cf_extractor.load()
self.seq_extractor.load()
def generate_features(self, user_id, candidate_item, **overrides):
"""Generate feature vector by delegating to sub-extractors."""
feats = {}
# Delegate to each extractor
feats.update(self.user_extractor.extract(user_id))
feats.update(self.item_extractor.extract(candidate_item))
feats.update(self.sasrec_extractor.extract(user_id, candidate_item, **overrides))
feats.update(self.cf_extractor.extract(user_id, candidate_item))
feats.update(self.seq_extractor.extract(user_id, candidate_item, **overrides))
return feats
Benefits:
- Each extractor is <70 lines (easy to understand)
- Testable in isolation:
test_sasrec_features.py - Clear separation of concerns
- Easy to add new feature groups
Refactoring Checklist
Before Refactoring:
- Read existing code thoroughly
- Identify natural module boundaries (feature groups, API layers, etc.)
- Check test coverage (write tests if missing)
- Document current behavior (input/output examples)
During Refactoring:
- Create new modules with single responsibility
- Move code incrementally (one function at a time)
- Run tests after each move
- Update imports in dependent files
- Add docstrings to new modules
After Refactoring:
- Run full test suite
- Check code coverage (should not decrease)
- Update ARCHITECTURE.md if structure changed
- Benchmark performance (ensure no regression)
- Update this guide with lessons learned
Refactoring Patterns
Pattern 1: Extract Feature Groups
When: Large function with distinct sections (user features, item features, etc.)
How:
- Identify groups: user stats, item stats, embeddings, CF scores
- Create one class per group:
UserFeatureExtractor,ItemFeatureExtractor - Each class has
load()andextract()methods - Main class delegates to extractors
Example: features.py β feature_groups/*
Pattern 2: Split API Client from Parser
When: File handles both HTTP requests and response parsing
How:
- Create
api_client.py: HTTP calls, retries, auth - Create
parsers.py: Parse JSON responses into domain objects - Main file orchestrates:
client.fetch()βparser.parse()
Example: web_search.py β google_books_client.py + result_parser.py
Pattern 3: Separate Training from Inference
When: File contains both model training code and inference code
How:
- Create
trainer.py: Fit model, save weights - Create
predictor.py: Load weights, predict - Shared utilities in
model_utils.py
Example: sasrec_recall.py β sasrec_trainer.py + sasrec_predictor.py
Pattern 4: Extract Helper Functions
When: Function >50 lines with reusable logic
How:
- Identify helper logic (e.g., "compute similarity", "filter results")
- Extract to
_helper_function()orutils.py - Main function becomes high-level orchestration
Example:
# Before (80 lines)
def recommend(self, user_id, k=50):
# 1. Get history (15 lines)
# 2. Query similar items (20 lines)
# 3. Aggregate scores (25 lines)
# 4. Filter & sort (20 lines)
# After (20 lines)
def recommend(self, user_id, k=50):
history = self._get_user_history(user_id)
similar_items = self._query_similar_items(history)
scores = self._aggregate_scores(similar_items)
return self._filter_and_sort(scores, k)
Example Refactoring: vector_db.py
Current (368 lines)
class VectorDB:
def __init__():
# 20 lines
def search():
# 30 lines
def hybrid_search():
# 120 lines (complex RRF fusion logic)
def small_to_big_search():
# 80 lines (hierarchical retrieval)
def _rrf_fusion():
# 60 lines
def _build_bm25_index():
# 40 lines
Proposed (3 files, each <150 lines)
src/vector_db.py (80 lines):
from src.retrieval.hybrid_retriever import HybridRetriever
from src.retrieval.hierarchical_retriever import HierarchicalRetriever
class VectorDB:
"""Facade for multiple retrieval strategies."""
def __init__():
self.hybrid = HybridRetriever()
self.hierarchical = HierarchicalRetriever()
def search(query, k=10):
"""Dense search only."""
return self.chroma.search(query, k=k)
def hybrid_search(query, k=10, alpha=0.5):
"""BM25 + Dense fusion."""
return self.hybrid.search(query, k, alpha)
def small_to_big_search(query, k=10):
"""Sentence β Book hierarchical retrieval."""
return self.hierarchical.search(query, k)
src/retrieval/hybrid_retriever.py (150 lines):
class HybridRetriever:
"""Combines BM25 (sparse) + Dense (embeddings) via RRF."""
def __init__():
self.bm25 = self._build_bm25_index()
self.dense_db = Chroma(...)
def search(query, k, alpha):
dense_results = self._dense_search(query, k)
sparse_results = self._sparse_search(query, k)
return self._rrf_fusion(dense_results, sparse_results, alpha)
def _rrf_fusion(dense, sparse, alpha):
# 60 lines (RRF logic)
def _build_bm25_index():
# 40 lines
src/retrieval/hierarchical_retriever.py (130 lines):
class HierarchicalRetriever:
"""Small-to-Big (sentence β book) retrieval."""
def search(query, k):
# 1. Find matching sentences (child nodes)
sentences = self.sentence_db.search(query, k=50)
# 2. Map to parent books (ISBNs)
book_isbns = self._map_to_parents(sentences)
# 3. Return full book context
return self._enrich_with_metadata(book_isbns, k)
Migration Strategy
Incremental Refactoring (Safe)
Week 1: Extract utilities
- Move helper functions to
utils.py - No API changes
- Run tests
Week 2: Create new modules
- Add
feature_groups/directory - Copy (don't move) code to new modules
- Old code still works
Week 3: Update imports
- Change
features.pyto use new modules - Mark old methods as
@deprecated - Update tests
Week 4: Remove old code
- Delete deprecated methods
- Update ARCHITECTURE.md
- Celebrate! π
Testing Strategy
Before Refactoring: Capture Behavior
# tests/test_features_baseline.py
def test_feature_extraction_baseline():
"""Capture current behavior before refactoring."""
fe = FeatureEngineer()
fe.load_base_data()
# Test on known user/item
feats = fe.generate_features("A123", "0123456789")
# Save as baseline
import json
with open("tests/baseline_features.json", "w") as f:
json.dump(feats, f)
After Refactoring: Compare Output
# tests/test_features_refactored.py
def test_feature_extraction_matches_baseline():
"""Ensure refactored code produces identical output."""
fe = FeatureEngineer() # New refactored version
fe.load_base_data()
feats = fe.generate_features("A123", "0123456789")
# Load baseline
import json
with open("tests/baseline_features.json") as f:
baseline = json.load(f)
# Compare
for key in baseline:
assert key in feats, f"Missing feature: {key}"
assert abs(feats[key] - baseline[key]) < 1e-6, f"Feature {key} mismatch"
Performance Considerations
Refactoring Should Not Slow Things Down
Before Refactoring: Benchmark
python benchmarks/benchmark.py > baseline.txt
After Refactoring: Re-benchmark
python benchmarks/benchmark.py > refactored.txt
diff baseline.txt refactored.txt
Acceptable Changes:
- Latency: Β±5% (small overhead from function calls OK)
- Memory: Β±10% (object overhead from new classes OK)
Red Flags:
- Latency increase >10% β Investigate (unnecessary copies? Extra I/O?)
- Memory increase >20% β Investigate (duplicate data? Leaks?)
Common Pitfalls
Pitfall 1: Over-Refactoring
Problem: Creating too many tiny files (10-20 lines each)
Solution: Aim for 50-150 lines per file. Smaller isn't always better.
Pitfall 2: Breaking API Compatibility
Problem: Refactoring changes function signatures, breaking existing code
Solution: Use deprecation warnings
import warnings
def old_method(self, x):
warnings.warn("old_method is deprecated, use new_method", DeprecationWarning)
return self.new_method(x)
Pitfall 3: Not Updating Tests
Problem: Tests still import old paths
Solution: Update imports incrementally
# Before
from src.ranking.features import FeatureEngineer
# After
from src.ranking.features import FeatureEngineer # Facade
from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor # Specific
When NOT to Refactor
- Deadline pressure: Refactoring during crunch time adds risk
- No tests: Write tests first, then refactor
- Unclear requirements: Understand the code before restructuring
- "Just because": Only refactor if it improves readability/testability
Rule of Thumb: Refactor when you're touching the code for a feature/bug anyway. Don't refactor in isolation.
Success Metrics
How do you know refactoring succeeded?
- Code is easier to understand (ask a teammate)
- Tests pass (same behavior)
- Performance unchanged (benchmark)
- Test coverage increased (easier to test small modules)
- Future changes are faster (measure time to implement next feature)
Related Documentation
- ARCHITECTURE.md β Overall system structure
- DEVELOPMENT.md β Development guidelines
- ONBOARDING.md β New contributor guide
Refactoring Backlog
Track refactoring work here:
| File | Priority | Assigned | Status | Target Date |
|---|---|---|---|---|
ranking/features.py |
P0 | Unassigned | Not Started | TBD |
core/web_search.py |
P1 | Unassigned | Not Started | TBD |
vector_db.py |
P0 | Unassigned | Not Started | TBD |
Last Updated: 2026-02-12 Next Review: After v3.0 planning