Spaces:
Sleeping
Sleeping
| # Refactoring Guide: Large Files | |
| **Status**: Active | |
| **Audience**: Engineers planning refactoring work | |
| **Last Updated**: 2026-02-12 | |
| **Priority**: P2 (Quality improvement, not blocking) | |
| --- | |
| ## Overview | |
| Several files in the codebase exceed 300 lines. While not critical, breaking them into smaller modules improves: | |
| - **Readability**: Easier to understand each component | |
| - **Testability**: Smaller units = easier to test in isolation | |
| - **Maintainability**: Changes have smaller blast radius | |
| **Industry Standard**: Files <300 lines, functions <50 lines. | |
| --- | |
| ## Files Requiring Refactoring | |
| | File | Lines | Complexity | Priority | Suggested Split | | |
| |------|-------|------------|----------|-----------------| | |
| | `src/ranking/features.py` | 470 | High | P0 | Extract feature groups into separate modules | | |
| | `src/core/web_search.py` | 427 | Medium | P1 | Separate API client from result parsing | | |
| | `src/vector_db.py` | 368 | High | P0 | Split hybrid search from small-to-big retrieval | | |
| | `src/recall/sasrec_recall.py` | 337 | Medium | P2 | Extract embedding logic | | |
| | `src/recall/embedding.py` | 333 | Medium | P2 | Extract training vs inference | | |
| | `src/services/recommend_service.py` | 313 | High | P1 | Extract diversity reranking orchestration | | |
| | `src/core/recommendation_orchestrator.py` | 293 | Medium | P2 | Extract query preprocessing | | |
| | `src/core/metadata_store.py` | 288 | Low | P3 | Already well-structured (mostly SQL) | | |
| --- | |
| ## Refactoring Strategy: features.py (Example) | |
| ### Current Structure (470 lines) | |
| ```python | |
| class FeatureEngineer: | |
| def __init__(): | |
| # 20 lines | |
| def _load_sasrec_features(): | |
| # 35 lines | |
| def _load_user_sequences(): | |
| # 20 lines | |
| def load_base_data(): | |
| # 30 lines | |
| def generate_features(): | |
| # 180 lines (HUGE!) | |
| # - User stats (15 lines) | |
| # - Item stats (15 lines) | |
| # - SASRec features (40 lines) | |
| # - ItemCF features (30 lines) | |
| # - UserCF features (30 lines) | |
| # - Last-N similarity (50 lines) | |
| def extract_features(): | |
| # 120 lines | |
| ``` | |
| ### Proposed Refactoring | |
| **New Structure**: | |
| ``` | |
| src/ranking/ | |
| ├── features.py # Orchestrator (100 lines) | |
| ├── feature_groups/ | |
| │ ├── __init__.py | |
| │ ├── user_features.py # User stats extraction (40 lines) | |
| │ ├── item_features.py # Item stats extraction (40 lines) | |
| │ ├── sasrec_features.py # SASRec embedding features (60 lines) | |
| │ ├── cf_features.py # ItemCF + UserCF features (70 lines) | |
| │ └── sequence_features.py # Last-N similarity (60 lines) | |
| ``` | |
| **Refactored `features.py`** (100 lines): | |
| ```python | |
| from src.ranking.feature_groups.user_features import UserFeatureExtractor | |
| from src.ranking.feature_groups.item_features import ItemFeatureExtractor | |
| from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor | |
| from src.ranking.feature_groups.cf_features import CFFeatureExtractor | |
| from src.ranking.feature_groups.sequence_features import SequenceFeatureExtractor | |
| class FeatureEngineer: | |
| """Orchestrates feature extraction from multiple sources.""" | |
| def __init__(self, data_dir='data/rec', model_dir='data/model/recall'): | |
| self.data_dir = Path(data_dir) | |
| self.model_dir = Path(model_dir) | |
| # Initialize sub-extractors | |
| self.user_extractor = UserFeatureExtractor(data_dir) | |
| self.item_extractor = ItemFeatureExtractor(data_dir) | |
| self.sasrec_extractor = SASRecFeatureExtractor(data_dir, model_dir) | |
| self.cf_extractor = CFFeatureExtractor(data_dir, model_dir) | |
| self.seq_extractor = SequenceFeatureExtractor(data_dir) | |
| def load_base_data(self): | |
| """Load all feature extractors.""" | |
| self.user_extractor.load() | |
| self.item_extractor.load() | |
| self.sasrec_extractor.load() | |
| self.cf_extractor.load() | |
| self.seq_extractor.load() | |
| def generate_features(self, user_id, candidate_item, **overrides): | |
| """Generate feature vector by delegating to sub-extractors.""" | |
| feats = {} | |
| # Delegate to each extractor | |
| feats.update(self.user_extractor.extract(user_id)) | |
| feats.update(self.item_extractor.extract(candidate_item)) | |
| feats.update(self.sasrec_extractor.extract(user_id, candidate_item, **overrides)) | |
| feats.update(self.cf_extractor.extract(user_id, candidate_item)) | |
| feats.update(self.seq_extractor.extract(user_id, candidate_item, **overrides)) | |
| return feats | |
| ``` | |
| **Benefits**: | |
| - Each extractor is <70 lines (easy to understand) | |
| - Testable in isolation: `test_sasrec_features.py` | |
| - Clear separation of concerns | |
| - Easy to add new feature groups | |
| --- | |
| ## Refactoring Checklist | |
| **Before Refactoring**: | |
| - [ ] Read existing code thoroughly | |
| - [ ] Identify natural module boundaries (feature groups, API layers, etc.) | |
| - [ ] Check test coverage (write tests if missing) | |
| - [ ] Document current behavior (input/output examples) | |
| **During Refactoring**: | |
| - [ ] Create new modules with single responsibility | |
| - [ ] Move code incrementally (one function at a time) | |
| - [ ] Run tests after each move | |
| - [ ] Update imports in dependent files | |
| - [ ] Add docstrings to new modules | |
| **After Refactoring**: | |
| - [ ] Run full test suite | |
| - [ ] Check code coverage (should not decrease) | |
| - [ ] Update ARCHITECTURE.md if structure changed | |
| - [ ] Benchmark performance (ensure no regression) | |
| - [ ] Update this guide with lessons learned | |
| --- | |
| ## Refactoring Patterns | |
| ### Pattern 1: Extract Feature Groups | |
| **When**: Large function with distinct sections (user features, item features, etc.) | |
| **How**: | |
| 1. Identify groups: user stats, item stats, embeddings, CF scores | |
| 2. Create one class per group: `UserFeatureExtractor`, `ItemFeatureExtractor` | |
| 3. Each class has `load()` and `extract()` methods | |
| 4. Main class delegates to extractors | |
| **Example**: `features.py` → `feature_groups/*` | |
| --- | |
| ### Pattern 2: Split API Client from Parser | |
| **When**: File handles both HTTP requests and response parsing | |
| **How**: | |
| 1. Create `api_client.py`: HTTP calls, retries, auth | |
| 2. Create `parsers.py`: Parse JSON responses into domain objects | |
| 3. Main file orchestrates: `client.fetch()` → `parser.parse()` | |
| **Example**: `web_search.py` → `google_books_client.py` + `result_parser.py` | |
| --- | |
| ### Pattern 3: Separate Training from Inference | |
| **When**: File contains both model training code and inference code | |
| **How**: | |
| 1. Create `trainer.py`: Fit model, save weights | |
| 2. Create `predictor.py`: Load weights, predict | |
| 3. Shared utilities in `model_utils.py` | |
| **Example**: `sasrec_recall.py` → `sasrec_trainer.py` + `sasrec_predictor.py` | |
| --- | |
| ### Pattern 4: Extract Helper Functions | |
| **When**: Function >50 lines with reusable logic | |
| **How**: | |
| 1. Identify helper logic (e.g., "compute similarity", "filter results") | |
| 2. Extract to `_helper_function()` or `utils.py` | |
| 3. Main function becomes high-level orchestration | |
| **Example**: | |
| ```python | |
| # Before (80 lines) | |
| def recommend(self, user_id, k=50): | |
| # 1. Get history (15 lines) | |
| # 2. Query similar items (20 lines) | |
| # 3. Aggregate scores (25 lines) | |
| # 4. Filter & sort (20 lines) | |
| # After (20 lines) | |
| def recommend(self, user_id, k=50): | |
| history = self._get_user_history(user_id) | |
| similar_items = self._query_similar_items(history) | |
| scores = self._aggregate_scores(similar_items) | |
| return self._filter_and_sort(scores, k) | |
| ``` | |
| --- | |
| ## Example Refactoring: vector_db.py | |
| ### Current (368 lines) | |
| ```python | |
| class VectorDB: | |
| def __init__(): | |
| # 20 lines | |
| def search(): | |
| # 30 lines | |
| def hybrid_search(): | |
| # 120 lines (complex RRF fusion logic) | |
| def small_to_big_search(): | |
| # 80 lines (hierarchical retrieval) | |
| def _rrf_fusion(): | |
| # 60 lines | |
| def _build_bm25_index(): | |
| # 40 lines | |
| ``` | |
| ### Proposed (3 files, each <150 lines) | |
| **src/vector_db.py** (80 lines): | |
| ```python | |
| from src.retrieval.hybrid_retriever import HybridRetriever | |
| from src.retrieval.hierarchical_retriever import HierarchicalRetriever | |
| class VectorDB: | |
| """Facade for multiple retrieval strategies.""" | |
| def __init__(): | |
| self.hybrid = HybridRetriever() | |
| self.hierarchical = HierarchicalRetriever() | |
| def search(query, k=10): | |
| """Dense search only.""" | |
| return self.chroma.search(query, k=k) | |
| def hybrid_search(query, k=10, alpha=0.5): | |
| """BM25 + Dense fusion.""" | |
| return self.hybrid.search(query, k, alpha) | |
| def small_to_big_search(query, k=10): | |
| """Sentence → Book hierarchical retrieval.""" | |
| return self.hierarchical.search(query, k) | |
| ``` | |
| **src/retrieval/hybrid_retriever.py** (150 lines): | |
| ```python | |
| class HybridRetriever: | |
| """Combines BM25 (sparse) + Dense (embeddings) via RRF.""" | |
| def __init__(): | |
| self.bm25 = self._build_bm25_index() | |
| self.dense_db = Chroma(...) | |
| def search(query, k, alpha): | |
| dense_results = self._dense_search(query, k) | |
| sparse_results = self._sparse_search(query, k) | |
| return self._rrf_fusion(dense_results, sparse_results, alpha) | |
| def _rrf_fusion(dense, sparse, alpha): | |
| # 60 lines (RRF logic) | |
| def _build_bm25_index(): | |
| # 40 lines | |
| ``` | |
| **src/retrieval/hierarchical_retriever.py** (130 lines): | |
| ```python | |
| class HierarchicalRetriever: | |
| """Small-to-Big (sentence → book) retrieval.""" | |
| def search(query, k): | |
| # 1. Find matching sentences (child nodes) | |
| sentences = self.sentence_db.search(query, k=50) | |
| # 2. Map to parent books (ISBNs) | |
| book_isbns = self._map_to_parents(sentences) | |
| # 3. Return full book context | |
| return self._enrich_with_metadata(book_isbns, k) | |
| ``` | |
| --- | |
| ## Migration Strategy | |
| ### Incremental Refactoring (Safe) | |
| **Week 1: Extract utilities** | |
| - Move helper functions to `utils.py` | |
| - No API changes | |
| - Run tests | |
| **Week 2: Create new modules** | |
| - Add `feature_groups/` directory | |
| - Copy (don't move) code to new modules | |
| - Old code still works | |
| **Week 3: Update imports** | |
| - Change `features.py` to use new modules | |
| - Mark old methods as `@deprecated` | |
| - Update tests | |
| **Week 4: Remove old code** | |
| - Delete deprecated methods | |
| - Update ARCHITECTURE.md | |
| - Celebrate! 🎉 | |
| --- | |
| ## Testing Strategy | |
| ### Before Refactoring: Capture Behavior | |
| ```python | |
| # tests/test_features_baseline.py | |
| def test_feature_extraction_baseline(): | |
| """Capture current behavior before refactoring.""" | |
| fe = FeatureEngineer() | |
| fe.load_base_data() | |
| # Test on known user/item | |
| feats = fe.generate_features("A123", "0123456789") | |
| # Save as baseline | |
| import json | |
| with open("tests/baseline_features.json", "w") as f: | |
| json.dump(feats, f) | |
| ``` | |
| ### After Refactoring: Compare Output | |
| ```python | |
| # tests/test_features_refactored.py | |
| def test_feature_extraction_matches_baseline(): | |
| """Ensure refactored code produces identical output.""" | |
| fe = FeatureEngineer() # New refactored version | |
| fe.load_base_data() | |
| feats = fe.generate_features("A123", "0123456789") | |
| # Load baseline | |
| import json | |
| with open("tests/baseline_features.json") as f: | |
| baseline = json.load(f) | |
| # Compare | |
| for key in baseline: | |
| assert key in feats, f"Missing feature: {key}" | |
| assert abs(feats[key] - baseline[key]) < 1e-6, f"Feature {key} mismatch" | |
| ``` | |
| --- | |
| ## Performance Considerations | |
| ### Refactoring Should Not Slow Things Down | |
| **Before Refactoring**: Benchmark | |
| ```bash | |
| python benchmarks/benchmark.py > baseline.txt | |
| ``` | |
| **After Refactoring**: Re-benchmark | |
| ```bash | |
| python benchmarks/benchmark.py > refactored.txt | |
| diff baseline.txt refactored.txt | |
| ``` | |
| **Acceptable Changes**: | |
| - Latency: ±5% (small overhead from function calls OK) | |
| - Memory: ±10% (object overhead from new classes OK) | |
| **Red Flags**: | |
| - Latency increase >10% → Investigate (unnecessary copies? Extra I/O?) | |
| - Memory increase >20% → Investigate (duplicate data? Leaks?) | |
| --- | |
| ## Common Pitfalls | |
| ### Pitfall 1: Over-Refactoring | |
| **Problem**: Creating too many tiny files (10-20 lines each) | |
| **Solution**: Aim for 50-150 lines per file. Smaller isn't always better. | |
| --- | |
| ### Pitfall 2: Breaking API Compatibility | |
| **Problem**: Refactoring changes function signatures, breaking existing code | |
| **Solution**: Use deprecation warnings | |
| ```python | |
| import warnings | |
| def old_method(self, x): | |
| warnings.warn("old_method is deprecated, use new_method", DeprecationWarning) | |
| return self.new_method(x) | |
| ``` | |
| --- | |
| ### Pitfall 3: Not Updating Tests | |
| **Problem**: Tests still import old paths | |
| **Solution**: Update imports incrementally | |
| ```python | |
| # Before | |
| from src.ranking.features import FeatureEngineer | |
| # After | |
| from src.ranking.features import FeatureEngineer # Facade | |
| from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor # Specific | |
| ``` | |
| --- | |
| ## When NOT to Refactor | |
| - **Deadline pressure**: Refactoring during crunch time adds risk | |
| - **No tests**: Write tests first, then refactor | |
| - **Unclear requirements**: Understand the code before restructuring | |
| - **"Just because"**: Only refactor if it improves readability/testability | |
| **Rule of Thumb**: Refactor when you're touching the code for a feature/bug anyway. Don't refactor in isolation. | |
| --- | |
| ## Success Metrics | |
| How do you know refactoring succeeded? | |
| - [ ] Code is easier to understand (ask a teammate) | |
| - [ ] Tests pass (same behavior) | |
| - [ ] Performance unchanged (benchmark) | |
| - [ ] Test coverage increased (easier to test small modules) | |
| - [ ] Future changes are faster (measure time to implement next feature) | |
| --- | |
| ## Related Documentation | |
| - [ARCHITECTURE.md](../ARCHITECTURE.md) — Overall system structure | |
| - [DEVELOPMENT.md](DEVELOPMENT.md) — Development guidelines | |
| - [ONBOARDING.md](../ONBOARDING.md) — New contributor guide | |
| --- | |
| ## Refactoring Backlog | |
| Track refactoring work here: | |
| | File | Priority | Assigned | Status | Target Date | | |
| |------|----------|----------|--------|-------------| | |
| | `ranking/features.py` | P0 | Unassigned | Not Started | TBD | | |
| | `core/web_search.py` | P1 | Unassigned | Not Started | TBD | | |
| | `vector_db.py` | P0 | Unassigned | Not Started | TBD | | |
| --- | |
| **Last Updated**: 2026-02-12 | |
| **Next Review**: After v3.0 planning | |