# Refactoring Guide: Large Files **Status**: Active **Audience**: Engineers planning refactoring work **Last Updated**: 2026-02-12 **Priority**: P2 (Quality improvement, not blocking) --- ## Overview Several files in the codebase exceed 300 lines. While not critical, breaking them into smaller modules improves: - **Readability**: Easier to understand each component - **Testability**: Smaller units = easier to test in isolation - **Maintainability**: Changes have smaller blast radius **Industry Standard**: Files <300 lines, functions <50 lines. --- ## Files Requiring Refactoring | File | Lines | Complexity | Priority | Suggested Split | |------|-------|------------|----------|-----------------| | `src/ranking/features.py` | 470 | High | P0 | Extract feature groups into separate modules | | `src/core/web_search.py` | 427 | Medium | P1 | Separate API client from result parsing | | `src/vector_db.py` | 368 | High | P0 | Split hybrid search from small-to-big retrieval | | `src/recall/sasrec_recall.py` | 337 | Medium | P2 | Extract embedding logic | | `src/recall/embedding.py` | 333 | Medium | P2 | Extract training vs inference | | `src/services/recommend_service.py` | 313 | High | P1 | Extract diversity reranking orchestration | | `src/core/recommendation_orchestrator.py` | 293 | Medium | P2 | Extract query preprocessing | | `src/core/metadata_store.py` | 288 | Low | P3 | Already well-structured (mostly SQL) | --- ## Refactoring Strategy: features.py (Example) ### Current Structure (470 lines) ```python class FeatureEngineer: def __init__(): # 20 lines def _load_sasrec_features(): # 35 lines def _load_user_sequences(): # 20 lines def load_base_data(): # 30 lines def generate_features(): # 180 lines (HUGE!) # - User stats (15 lines) # - Item stats (15 lines) # - SASRec features (40 lines) # - ItemCF features (30 lines) # - UserCF features (30 lines) # - Last-N similarity (50 lines) def extract_features(): # 120 lines ``` ### Proposed Refactoring **New Structure**: ``` src/ranking/ ├── features.py # Orchestrator (100 lines) ├── feature_groups/ │ ├── __init__.py │ ├── user_features.py # User stats extraction (40 lines) │ ├── item_features.py # Item stats extraction (40 lines) │ ├── sasrec_features.py # SASRec embedding features (60 lines) │ ├── cf_features.py # ItemCF + UserCF features (70 lines) │ └── sequence_features.py # Last-N similarity (60 lines) ``` **Refactored `features.py`** (100 lines): ```python from src.ranking.feature_groups.user_features import UserFeatureExtractor from src.ranking.feature_groups.item_features import ItemFeatureExtractor from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor from src.ranking.feature_groups.cf_features import CFFeatureExtractor from src.ranking.feature_groups.sequence_features import SequenceFeatureExtractor class FeatureEngineer: """Orchestrates feature extraction from multiple sources.""" def __init__(self, data_dir='data/rec', model_dir='data/model/recall'): self.data_dir = Path(data_dir) self.model_dir = Path(model_dir) # Initialize sub-extractors self.user_extractor = UserFeatureExtractor(data_dir) self.item_extractor = ItemFeatureExtractor(data_dir) self.sasrec_extractor = SASRecFeatureExtractor(data_dir, model_dir) self.cf_extractor = CFFeatureExtractor(data_dir, model_dir) self.seq_extractor = SequenceFeatureExtractor(data_dir) def load_base_data(self): """Load all feature extractors.""" self.user_extractor.load() self.item_extractor.load() self.sasrec_extractor.load() self.cf_extractor.load() self.seq_extractor.load() def generate_features(self, user_id, candidate_item, **overrides): """Generate feature vector by delegating to sub-extractors.""" feats = {} # Delegate to each extractor feats.update(self.user_extractor.extract(user_id)) feats.update(self.item_extractor.extract(candidate_item)) feats.update(self.sasrec_extractor.extract(user_id, candidate_item, **overrides)) feats.update(self.cf_extractor.extract(user_id, candidate_item)) feats.update(self.seq_extractor.extract(user_id, candidate_item, **overrides)) return feats ``` **Benefits**: - Each extractor is <70 lines (easy to understand) - Testable in isolation: `test_sasrec_features.py` - Clear separation of concerns - Easy to add new feature groups --- ## Refactoring Checklist **Before Refactoring**: - [ ] Read existing code thoroughly - [ ] Identify natural module boundaries (feature groups, API layers, etc.) - [ ] Check test coverage (write tests if missing) - [ ] Document current behavior (input/output examples) **During Refactoring**: - [ ] Create new modules with single responsibility - [ ] Move code incrementally (one function at a time) - [ ] Run tests after each move - [ ] Update imports in dependent files - [ ] Add docstrings to new modules **After Refactoring**: - [ ] Run full test suite - [ ] Check code coverage (should not decrease) - [ ] Update ARCHITECTURE.md if structure changed - [ ] Benchmark performance (ensure no regression) - [ ] Update this guide with lessons learned --- ## Refactoring Patterns ### Pattern 1: Extract Feature Groups **When**: Large function with distinct sections (user features, item features, etc.) **How**: 1. Identify groups: user stats, item stats, embeddings, CF scores 2. Create one class per group: `UserFeatureExtractor`, `ItemFeatureExtractor` 3. Each class has `load()` and `extract()` methods 4. Main class delegates to extractors **Example**: `features.py` → `feature_groups/*` --- ### Pattern 2: Split API Client from Parser **When**: File handles both HTTP requests and response parsing **How**: 1. Create `api_client.py`: HTTP calls, retries, auth 2. Create `parsers.py`: Parse JSON responses into domain objects 3. Main file orchestrates: `client.fetch()` → `parser.parse()` **Example**: `web_search.py` → `google_books_client.py` + `result_parser.py` --- ### Pattern 3: Separate Training from Inference **When**: File contains both model training code and inference code **How**: 1. Create `trainer.py`: Fit model, save weights 2. Create `predictor.py`: Load weights, predict 3. Shared utilities in `model_utils.py` **Example**: `sasrec_recall.py` → `sasrec_trainer.py` + `sasrec_predictor.py` --- ### Pattern 4: Extract Helper Functions **When**: Function >50 lines with reusable logic **How**: 1. Identify helper logic (e.g., "compute similarity", "filter results") 2. Extract to `_helper_function()` or `utils.py` 3. Main function becomes high-level orchestration **Example**: ```python # Before (80 lines) def recommend(self, user_id, k=50): # 1. Get history (15 lines) # 2. Query similar items (20 lines) # 3. Aggregate scores (25 lines) # 4. Filter & sort (20 lines) # After (20 lines) def recommend(self, user_id, k=50): history = self._get_user_history(user_id) similar_items = self._query_similar_items(history) scores = self._aggregate_scores(similar_items) return self._filter_and_sort(scores, k) ``` --- ## Example Refactoring: vector_db.py ### Current (368 lines) ```python class VectorDB: def __init__(): # 20 lines def search(): # 30 lines def hybrid_search(): # 120 lines (complex RRF fusion logic) def small_to_big_search(): # 80 lines (hierarchical retrieval) def _rrf_fusion(): # 60 lines def _build_bm25_index(): # 40 lines ``` ### Proposed (3 files, each <150 lines) **src/vector_db.py** (80 lines): ```python from src.retrieval.hybrid_retriever import HybridRetriever from src.retrieval.hierarchical_retriever import HierarchicalRetriever class VectorDB: """Facade for multiple retrieval strategies.""" def __init__(): self.hybrid = HybridRetriever() self.hierarchical = HierarchicalRetriever() def search(query, k=10): """Dense search only.""" return self.chroma.search(query, k=k) def hybrid_search(query, k=10, alpha=0.5): """BM25 + Dense fusion.""" return self.hybrid.search(query, k, alpha) def small_to_big_search(query, k=10): """Sentence → Book hierarchical retrieval.""" return self.hierarchical.search(query, k) ``` **src/retrieval/hybrid_retriever.py** (150 lines): ```python class HybridRetriever: """Combines BM25 (sparse) + Dense (embeddings) via RRF.""" def __init__(): self.bm25 = self._build_bm25_index() self.dense_db = Chroma(...) def search(query, k, alpha): dense_results = self._dense_search(query, k) sparse_results = self._sparse_search(query, k) return self._rrf_fusion(dense_results, sparse_results, alpha) def _rrf_fusion(dense, sparse, alpha): # 60 lines (RRF logic) def _build_bm25_index(): # 40 lines ``` **src/retrieval/hierarchical_retriever.py** (130 lines): ```python class HierarchicalRetriever: """Small-to-Big (sentence → book) retrieval.""" def search(query, k): # 1. Find matching sentences (child nodes) sentences = self.sentence_db.search(query, k=50) # 2. Map to parent books (ISBNs) book_isbns = self._map_to_parents(sentences) # 3. Return full book context return self._enrich_with_metadata(book_isbns, k) ``` --- ## Migration Strategy ### Incremental Refactoring (Safe) **Week 1: Extract utilities** - Move helper functions to `utils.py` - No API changes - Run tests **Week 2: Create new modules** - Add `feature_groups/` directory - Copy (don't move) code to new modules - Old code still works **Week 3: Update imports** - Change `features.py` to use new modules - Mark old methods as `@deprecated` - Update tests **Week 4: Remove old code** - Delete deprecated methods - Update ARCHITECTURE.md - Celebrate! 🎉 --- ## Testing Strategy ### Before Refactoring: Capture Behavior ```python # tests/test_features_baseline.py def test_feature_extraction_baseline(): """Capture current behavior before refactoring.""" fe = FeatureEngineer() fe.load_base_data() # Test on known user/item feats = fe.generate_features("A123", "0123456789") # Save as baseline import json with open("tests/baseline_features.json", "w") as f: json.dump(feats, f) ``` ### After Refactoring: Compare Output ```python # tests/test_features_refactored.py def test_feature_extraction_matches_baseline(): """Ensure refactored code produces identical output.""" fe = FeatureEngineer() # New refactored version fe.load_base_data() feats = fe.generate_features("A123", "0123456789") # Load baseline import json with open("tests/baseline_features.json") as f: baseline = json.load(f) # Compare for key in baseline: assert key in feats, f"Missing feature: {key}" assert abs(feats[key] - baseline[key]) < 1e-6, f"Feature {key} mismatch" ``` --- ## Performance Considerations ### Refactoring Should Not Slow Things Down **Before Refactoring**: Benchmark ```bash python benchmarks/benchmark.py > baseline.txt ``` **After Refactoring**: Re-benchmark ```bash python benchmarks/benchmark.py > refactored.txt diff baseline.txt refactored.txt ``` **Acceptable Changes**: - Latency: ±5% (small overhead from function calls OK) - Memory: ±10% (object overhead from new classes OK) **Red Flags**: - Latency increase >10% → Investigate (unnecessary copies? Extra I/O?) - Memory increase >20% → Investigate (duplicate data? Leaks?) --- ## Common Pitfalls ### Pitfall 1: Over-Refactoring **Problem**: Creating too many tiny files (10-20 lines each) **Solution**: Aim for 50-150 lines per file. Smaller isn't always better. --- ### Pitfall 2: Breaking API Compatibility **Problem**: Refactoring changes function signatures, breaking existing code **Solution**: Use deprecation warnings ```python import warnings def old_method(self, x): warnings.warn("old_method is deprecated, use new_method", DeprecationWarning) return self.new_method(x) ``` --- ### Pitfall 3: Not Updating Tests **Problem**: Tests still import old paths **Solution**: Update imports incrementally ```python # Before from src.ranking.features import FeatureEngineer # After from src.ranking.features import FeatureEngineer # Facade from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor # Specific ``` --- ## When NOT to Refactor - **Deadline pressure**: Refactoring during crunch time adds risk - **No tests**: Write tests first, then refactor - **Unclear requirements**: Understand the code before restructuring - **"Just because"**: Only refactor if it improves readability/testability **Rule of Thumb**: Refactor when you're touching the code for a feature/bug anyway. Don't refactor in isolation. --- ## Success Metrics How do you know refactoring succeeded? - [ ] Code is easier to understand (ask a teammate) - [ ] Tests pass (same behavior) - [ ] Performance unchanged (benchmark) - [ ] Test coverage increased (easier to test small modules) - [ ] Future changes are faster (measure time to implement next feature) --- ## Related Documentation - [ARCHITECTURE.md](../ARCHITECTURE.md) — Overall system structure - [DEVELOPMENT.md](DEVELOPMENT.md) — Development guidelines - [ONBOARDING.md](../ONBOARDING.md) — New contributor guide --- ## Refactoring Backlog Track refactoring work here: | File | Priority | Assigned | Status | Target Date | |------|----------|----------|--------|-------------| | `ranking/features.py` | P0 | Unassigned | Not Started | TBD | | `core/web_search.py` | P1 | Unassigned | Not Started | TBD | | `vector_db.py` | P0 | Unassigned | Not Started | TBD | --- **Last Updated**: 2026-02-12 **Next Review**: After v3.0 planning