Spaces:
Sleeping
Sleeping
File size: 14,160 Bytes
6ad997d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 |
# Refactoring Guide: Large Files
**Status**: Active
**Audience**: Engineers planning refactoring work
**Last Updated**: 2026-02-12
**Priority**: P2 (Quality improvement, not blocking)
---
## Overview
Several files in the codebase exceed 300 lines. While not critical, breaking them into smaller modules improves:
- **Readability**: Easier to understand each component
- **Testability**: Smaller units = easier to test in isolation
- **Maintainability**: Changes have smaller blast radius
**Industry Standard**: Files <300 lines, functions <50 lines.
---
## Files Requiring Refactoring
| File | Lines | Complexity | Priority | Suggested Split |
|------|-------|------------|----------|-----------------|
| `src/ranking/features.py` | 470 | High | P0 | Extract feature groups into separate modules |
| `src/core/web_search.py` | 427 | Medium | P1 | Separate API client from result parsing |
| `src/vector_db.py` | 368 | High | P0 | Split hybrid search from small-to-big retrieval |
| `src/recall/sasrec_recall.py` | 337 | Medium | P2 | Extract embedding logic |
| `src/recall/embedding.py` | 333 | Medium | P2 | Extract training vs inference |
| `src/services/recommend_service.py` | 313 | High | P1 | Extract diversity reranking orchestration |
| `src/core/recommendation_orchestrator.py` | 293 | Medium | P2 | Extract query preprocessing |
| `src/core/metadata_store.py` | 288 | Low | P3 | Already well-structured (mostly SQL) |
---
## Refactoring Strategy: features.py (Example)
### Current Structure (470 lines)
```python
class FeatureEngineer:
def __init__():
# 20 lines
def _load_sasrec_features():
# 35 lines
def _load_user_sequences():
# 20 lines
def load_base_data():
# 30 lines
def generate_features():
# 180 lines (HUGE!)
# - User stats (15 lines)
# - Item stats (15 lines)
# - SASRec features (40 lines)
# - ItemCF features (30 lines)
# - UserCF features (30 lines)
# - Last-N similarity (50 lines)
def extract_features():
# 120 lines
```
### Proposed Refactoring
**New Structure**:
```
src/ranking/
βββ features.py # Orchestrator (100 lines)
βββ feature_groups/
β βββ __init__.py
β βββ user_features.py # User stats extraction (40 lines)
β βββ item_features.py # Item stats extraction (40 lines)
β βββ sasrec_features.py # SASRec embedding features (60 lines)
β βββ cf_features.py # ItemCF + UserCF features (70 lines)
β βββ sequence_features.py # Last-N similarity (60 lines)
```
**Refactored `features.py`** (100 lines):
```python
from src.ranking.feature_groups.user_features import UserFeatureExtractor
from src.ranking.feature_groups.item_features import ItemFeatureExtractor
from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor
from src.ranking.feature_groups.cf_features import CFFeatureExtractor
from src.ranking.feature_groups.sequence_features import SequenceFeatureExtractor
class FeatureEngineer:
"""Orchestrates feature extraction from multiple sources."""
def __init__(self, data_dir='data/rec', model_dir='data/model/recall'):
self.data_dir = Path(data_dir)
self.model_dir = Path(model_dir)
# Initialize sub-extractors
self.user_extractor = UserFeatureExtractor(data_dir)
self.item_extractor = ItemFeatureExtractor(data_dir)
self.sasrec_extractor = SASRecFeatureExtractor(data_dir, model_dir)
self.cf_extractor = CFFeatureExtractor(data_dir, model_dir)
self.seq_extractor = SequenceFeatureExtractor(data_dir)
def load_base_data(self):
"""Load all feature extractors."""
self.user_extractor.load()
self.item_extractor.load()
self.sasrec_extractor.load()
self.cf_extractor.load()
self.seq_extractor.load()
def generate_features(self, user_id, candidate_item, **overrides):
"""Generate feature vector by delegating to sub-extractors."""
feats = {}
# Delegate to each extractor
feats.update(self.user_extractor.extract(user_id))
feats.update(self.item_extractor.extract(candidate_item))
feats.update(self.sasrec_extractor.extract(user_id, candidate_item, **overrides))
feats.update(self.cf_extractor.extract(user_id, candidate_item))
feats.update(self.seq_extractor.extract(user_id, candidate_item, **overrides))
return feats
```
**Benefits**:
- Each extractor is <70 lines (easy to understand)
- Testable in isolation: `test_sasrec_features.py`
- Clear separation of concerns
- Easy to add new feature groups
---
## Refactoring Checklist
**Before Refactoring**:
- [ ] Read existing code thoroughly
- [ ] Identify natural module boundaries (feature groups, API layers, etc.)
- [ ] Check test coverage (write tests if missing)
- [ ] Document current behavior (input/output examples)
**During Refactoring**:
- [ ] Create new modules with single responsibility
- [ ] Move code incrementally (one function at a time)
- [ ] Run tests after each move
- [ ] Update imports in dependent files
- [ ] Add docstrings to new modules
**After Refactoring**:
- [ ] Run full test suite
- [ ] Check code coverage (should not decrease)
- [ ] Update ARCHITECTURE.md if structure changed
- [ ] Benchmark performance (ensure no regression)
- [ ] Update this guide with lessons learned
---
## Refactoring Patterns
### Pattern 1: Extract Feature Groups
**When**: Large function with distinct sections (user features, item features, etc.)
**How**:
1. Identify groups: user stats, item stats, embeddings, CF scores
2. Create one class per group: `UserFeatureExtractor`, `ItemFeatureExtractor`
3. Each class has `load()` and `extract()` methods
4. Main class delegates to extractors
**Example**: `features.py` β `feature_groups/*`
---
### Pattern 2: Split API Client from Parser
**When**: File handles both HTTP requests and response parsing
**How**:
1. Create `api_client.py`: HTTP calls, retries, auth
2. Create `parsers.py`: Parse JSON responses into domain objects
3. Main file orchestrates: `client.fetch()` β `parser.parse()`
**Example**: `web_search.py` β `google_books_client.py` + `result_parser.py`
---
### Pattern 3: Separate Training from Inference
**When**: File contains both model training code and inference code
**How**:
1. Create `trainer.py`: Fit model, save weights
2. Create `predictor.py`: Load weights, predict
3. Shared utilities in `model_utils.py`
**Example**: `sasrec_recall.py` β `sasrec_trainer.py` + `sasrec_predictor.py`
---
### Pattern 4: Extract Helper Functions
**When**: Function >50 lines with reusable logic
**How**:
1. Identify helper logic (e.g., "compute similarity", "filter results")
2. Extract to `_helper_function()` or `utils.py`
3. Main function becomes high-level orchestration
**Example**:
```python
# Before (80 lines)
def recommend(self, user_id, k=50):
# 1. Get history (15 lines)
# 2. Query similar items (20 lines)
# 3. Aggregate scores (25 lines)
# 4. Filter & sort (20 lines)
# After (20 lines)
def recommend(self, user_id, k=50):
history = self._get_user_history(user_id)
similar_items = self._query_similar_items(history)
scores = self._aggregate_scores(similar_items)
return self._filter_and_sort(scores, k)
```
---
## Example Refactoring: vector_db.py
### Current (368 lines)
```python
class VectorDB:
def __init__():
# 20 lines
def search():
# 30 lines
def hybrid_search():
# 120 lines (complex RRF fusion logic)
def small_to_big_search():
# 80 lines (hierarchical retrieval)
def _rrf_fusion():
# 60 lines
def _build_bm25_index():
# 40 lines
```
### Proposed (3 files, each <150 lines)
**src/vector_db.py** (80 lines):
```python
from src.retrieval.hybrid_retriever import HybridRetriever
from src.retrieval.hierarchical_retriever import HierarchicalRetriever
class VectorDB:
"""Facade for multiple retrieval strategies."""
def __init__():
self.hybrid = HybridRetriever()
self.hierarchical = HierarchicalRetriever()
def search(query, k=10):
"""Dense search only."""
return self.chroma.search(query, k=k)
def hybrid_search(query, k=10, alpha=0.5):
"""BM25 + Dense fusion."""
return self.hybrid.search(query, k, alpha)
def small_to_big_search(query, k=10):
"""Sentence β Book hierarchical retrieval."""
return self.hierarchical.search(query, k)
```
**src/retrieval/hybrid_retriever.py** (150 lines):
```python
class HybridRetriever:
"""Combines BM25 (sparse) + Dense (embeddings) via RRF."""
def __init__():
self.bm25 = self._build_bm25_index()
self.dense_db = Chroma(...)
def search(query, k, alpha):
dense_results = self._dense_search(query, k)
sparse_results = self._sparse_search(query, k)
return self._rrf_fusion(dense_results, sparse_results, alpha)
def _rrf_fusion(dense, sparse, alpha):
# 60 lines (RRF logic)
def _build_bm25_index():
# 40 lines
```
**src/retrieval/hierarchical_retriever.py** (130 lines):
```python
class HierarchicalRetriever:
"""Small-to-Big (sentence β book) retrieval."""
def search(query, k):
# 1. Find matching sentences (child nodes)
sentences = self.sentence_db.search(query, k=50)
# 2. Map to parent books (ISBNs)
book_isbns = self._map_to_parents(sentences)
# 3. Return full book context
return self._enrich_with_metadata(book_isbns, k)
```
---
## Migration Strategy
### Incremental Refactoring (Safe)
**Week 1: Extract utilities**
- Move helper functions to `utils.py`
- No API changes
- Run tests
**Week 2: Create new modules**
- Add `feature_groups/` directory
- Copy (don't move) code to new modules
- Old code still works
**Week 3: Update imports**
- Change `features.py` to use new modules
- Mark old methods as `@deprecated`
- Update tests
**Week 4: Remove old code**
- Delete deprecated methods
- Update ARCHITECTURE.md
- Celebrate! π
---
## Testing Strategy
### Before Refactoring: Capture Behavior
```python
# tests/test_features_baseline.py
def test_feature_extraction_baseline():
"""Capture current behavior before refactoring."""
fe = FeatureEngineer()
fe.load_base_data()
# Test on known user/item
feats = fe.generate_features("A123", "0123456789")
# Save as baseline
import json
with open("tests/baseline_features.json", "w") as f:
json.dump(feats, f)
```
### After Refactoring: Compare Output
```python
# tests/test_features_refactored.py
def test_feature_extraction_matches_baseline():
"""Ensure refactored code produces identical output."""
fe = FeatureEngineer() # New refactored version
fe.load_base_data()
feats = fe.generate_features("A123", "0123456789")
# Load baseline
import json
with open("tests/baseline_features.json") as f:
baseline = json.load(f)
# Compare
for key in baseline:
assert key in feats, f"Missing feature: {key}"
assert abs(feats[key] - baseline[key]) < 1e-6, f"Feature {key} mismatch"
```
---
## Performance Considerations
### Refactoring Should Not Slow Things Down
**Before Refactoring**: Benchmark
```bash
python benchmarks/benchmark.py > baseline.txt
```
**After Refactoring**: Re-benchmark
```bash
python benchmarks/benchmark.py > refactored.txt
diff baseline.txt refactored.txt
```
**Acceptable Changes**:
- Latency: Β±5% (small overhead from function calls OK)
- Memory: Β±10% (object overhead from new classes OK)
**Red Flags**:
- Latency increase >10% β Investigate (unnecessary copies? Extra I/O?)
- Memory increase >20% β Investigate (duplicate data? Leaks?)
---
## Common Pitfalls
### Pitfall 1: Over-Refactoring
**Problem**: Creating too many tiny files (10-20 lines each)
**Solution**: Aim for 50-150 lines per file. Smaller isn't always better.
---
### Pitfall 2: Breaking API Compatibility
**Problem**: Refactoring changes function signatures, breaking existing code
**Solution**: Use deprecation warnings
```python
import warnings
def old_method(self, x):
warnings.warn("old_method is deprecated, use new_method", DeprecationWarning)
return self.new_method(x)
```
---
### Pitfall 3: Not Updating Tests
**Problem**: Tests still import old paths
**Solution**: Update imports incrementally
```python
# Before
from src.ranking.features import FeatureEngineer
# After
from src.ranking.features import FeatureEngineer # Facade
from src.ranking.feature_groups.sasrec_features import SASRecFeatureExtractor # Specific
```
---
## When NOT to Refactor
- **Deadline pressure**: Refactoring during crunch time adds risk
- **No tests**: Write tests first, then refactor
- **Unclear requirements**: Understand the code before restructuring
- **"Just because"**: Only refactor if it improves readability/testability
**Rule of Thumb**: Refactor when you're touching the code for a feature/bug anyway. Don't refactor in isolation.
---
## Success Metrics
How do you know refactoring succeeded?
- [ ] Code is easier to understand (ask a teammate)
- [ ] Tests pass (same behavior)
- [ ] Performance unchanged (benchmark)
- [ ] Test coverage increased (easier to test small modules)
- [ ] Future changes are faster (measure time to implement next feature)
---
## Related Documentation
- [ARCHITECTURE.md](../ARCHITECTURE.md) β Overall system structure
- [DEVELOPMENT.md](DEVELOPMENT.md) β Development guidelines
- [ONBOARDING.md](../ONBOARDING.md) β New contributor guide
---
## Refactoring Backlog
Track refactoring work here:
| File | Priority | Assigned | Status | Target Date |
|------|----------|----------|--------|-------------|
| `ranking/features.py` | P0 | Unassigned | Not Started | TBD |
| `core/web_search.py` | P1 | Unassigned | Not Started | TBD |
| `vector_db.py` | P0 | Unassigned | Not Started | TBD |
---
**Last Updated**: 2026-02-12
**Next Review**: After v3.0 planning
|