diff --git "a/Python/app.js" "b/Python/app.js" --- "a/Python/app.js" +++ "b/Python/app.js" @@ -83,7 +83,7 @@ const MODULE_CONTENT = {
Python is a dynamically-typed, garbage-collected, interpreted language with a C-based runtime (CPython). Everything is an object — integers, functions, even classes. Understanding this object model is what separates beginners from professionals.
-

1. Data Structures for DS — Complete Reference

+

1. Data Structures — Complete Reference

@@ -96,450 +96,709 @@ const MODULE_CONTENT = {
TypeMutableOrderedHashableUse Case
listSequential data, time series, feature lists
bytearrayMutable binary buffers
-

2. Python Memory Model — What No One Teaches

+

2. Time Complexity — What Every Dev Must Know

+ + + + + + + + +
Operationlistdictset
Lookup by index/keyO(1)O(1)
Search (x in ...)O(n)O(1)O(1)
Insert/AppendO(1) end, O(n) middleO(1)O(1)
DeleteO(n)O(1)O(1)
SortO(n log n)
IterationO(n)O(n)O(n)
+

Real-world impact: Checking if an item exists in a list of 1M elements = ~50ms. In a set = ~0.00005ms. That's 1,000,000x faster. Always use sets/dicts for membership testing.

+ +

3. Python Memory Model

-
⚡ Everything Is An Object
-
In Python, every value is an object on the heap. Variables are just references (pointers) to objects. a = [1, 2, 3] — the list lives on the heap; a is a name that points to it. b = a makes both point to the same list — no copy is made. This is called aliasing.
+
⚡ Everything Is An Object on the Heap
+
Variables are references (pointers), not boxes. a = [1,2,3] creates a list on the heap; a points to it. b = a makes both point to the same list. This is aliasing — the #1 source of bugs in beginner Python code.
-

Reference Counting: Python uses reference counting + cyclic garbage collector. Each object tracks how many names point to it. When count hits 0, memory is freed immediately. del doesn't always free memory — it just decrements the reference count.

-

Integer Interning: Python caches integers from -5 to 256 and short strings. So a = 100; b = 100; a is b is True, but a = 1000; b = 1000; a is b may be False. Never use is for value comparison — always use ==.

-

Garbage Collection Generations: CPython has 3 generations (gen0, gen1, gen2). New objects start in gen0. Objects that survive a collection move to the next generation. Long-lived objects (gen2) are collected less frequently. Use gc.get_stats() to monitor.

+

Reference Counting: Each object tracks how many names reference it. When count = 0, freed immediately. del decrements the count, doesn't necessarily free memory.

+

Integer Interning: Python caches integers -5 to 256. So a = 100; b = 100; a is b → True. But a = 1000; b = 1000; a is b → may be False. Never use is for value comparison.

+

Garbage Collection: 3 generations (gen0, gen1, gen2). New objects in gen0. Survivors promoted. Use gc.collect() after deleting large ML models.

-

3. Generators & Iterators — The Core of Pythonic Code

+

4. Generators & Iterators — The Heart of Python

-
🔄 Lazy Evaluation Is King
-
Generators produce values one at a time using yield, consuming O(1) memory regardless of data size. A list of 1 billion items = ~8GB RAM. A generator of 1 billion items = ~100 bytes. The Iterator Protocol: any object with __iter__ and __next__ methods. Generators are just syntactic sugar for iterators.
+
🔄 Lazy Evaluation
+
yield suspends state, return terminates. A list of 1B items = ~8GB. A generator = ~100 bytes. The Iterator Protocol: any object with __iter__ + __next__. Generator expressions: (x**2 for x in range(10**9)) — O(1) memory.
-

yield vs return: return terminates the function. yield suspends it, saving the entire stack frame (local variables, instruction pointer). The next next() call resumes from where it left off.

-

yield from: Delegates to a sub-generator. yield from iterable is equivalent to for item in iterable: yield item but also forwards send() and throw() calls.

-

Generator Expressions: (x**2 for x in range(10**9)) — uses O(1) memory. List comprehension [x**2 for x in range(10**9)] — tries to allocate ~8GB. Always prefer generator expressions for large data.

+

yield from: Delegates to sub-generator. Forwards send() and throw(). Essential for building composable data pipelines.

+

send(): Two-way communication with generators (coroutines). value = yield result — both receives and produces values.

-

4. Closures & First-Class Functions

-

Functions in Python are first-class objects — they can be passed as arguments, returned from other functions, and assigned to variables. A closure is a function that captures variables from its enclosing scope. This is the foundation of decorators, callbacks, and functional programming in Python.

+

5. Closures & First-Class Functions

+

Functions are first-class objects — passed as args, returned, assigned. A closure captures variables from enclosing scope. Foundation of decorators, callbacks, and functional programming.

-

5. Critical Python Gotchas

+

6. Critical Python Gotchas for Projects

-
⚠️ Mutable Default Arguments — #1 Python Trap
- def append_to(element, target=[]): — This default list is shared across ALL calls! Default arguments are evaluated ONCE at function definition time, not at call time. Fix: use target=None then if target is None: target = []. +
⚠️ The 5 Deadliest Python Traps
+ 1. Mutable Default Args: def f(x, lst=[]): — list shared across ALL calls. Fix: lst=None.
+ 2. Late Binding Closures: [lambda: i for i in range(5)] — all return 4! Fix: lambda i=i: i.
+ 3. Shallow Copy: list(a) copies outer list but shares inner objects.
+ 4. String Concatenation: s += "text" in a loop creates new string every time — O(n²). Use ''.join(parts).
+ 5. Circular Imports: Module A imports B, B imports A → ImportError. Fix: restructure or lazy import.
-

Late Binding Closures: [lambda: i for i in range(5)] — all lambdas return 4! Variables in closures are looked up at call time, not definition time. Fix: [lambda i=i: i for i in range(5)].

-

Tuple Assignment Gotcha: a = ([1,2],); a[0] += [3] raises TypeError AND modifies the list! The += first mutates the list in-place (succeeds), then tries to reassign the tuple element (fails).

-

6. collections Module — Power Tools

+

7. Error Handling for Production Projects

+
+
🛡️ Exception Hierarchy You Must Know
+
+ BaseExceptionException (catch this) → ValueError, TypeError, KeyError, FileNotFoundError, ConnectionError...
+ Rules: (1) Never catch bare except:. (2) Catch specific exceptions. (3) Use else for success path. (4) finally always runs. (5) Create custom exceptions for your project. +
+
+ +

8. collections Module — Power Tools

- - - - - - - + + + + + + +
ClassPurposeWhy It Matters in DS
defaultdictDict with default factoryGroup data without KeyError: defaultdict(list)
CounterCount hashable objectsLabel distribution: Counter(y_train)
namedtupleLightweight immutable classReturn multiple values with names, not indices
OrderedDictDict remembering insertion orderLegacy (dicts are ordered 3.7+), useful for move_to_end()
dequeDouble-ended queueSliding window computations, BFS algorithms
ChainMapStack multiple dictsLayer config: defaults → env → CLI overrides
ClassPurposeProject Use Case
defaultdictDict with default factoryGroup data: defaultdict(list)
CounterCount hashable objectsLabel distribution, word frequency
namedtupleLightweight immutable classReturn multiple named values
dequeDouble-ended queueSliding window, BFS, ring buffer
ChainMapStack multiple dictsConfig layers: defaults → env → CLI
OrderedDictOrdered dict (legacy)move_to_end() for LRU cache
-

7. itertools — Memory-Efficient Pipelines

+

9. itertools — Memory-Efficient Pipelines

- - - - - + + + + + - - + +
FunctionWhat It DoesDS Use Case
chain()Concatenate iterablesMerge multiple data files lazily
islice()Slice any iteratorTake first N records from generator
groupby()Group consecutive elementsProcess sorted log entries by date
product()Cartesian productGenerate hyperparameter grid
FunctionWhat It DoesProject Use
chain()Concatenate iterables lazilyMerge data files
islice()Slice any iteratorTake first N from generator
groupby()Group consecutive elementsProcess sorted logs by date
product()Cartesian productHyperparameter grid
combinations()All r-length combosFeature interaction pairs
starmap()map() with unpacked argsApply function to paired data
accumulate()Running total/custom accumulatorCumulative sums, running max
tee()Clone an iterator N timesMultiple passes over data stream
accumulate()Running accumulatorCumulative sums, running max
tee()Clone iterator N timesMultiple passes over stream
-

8. String Internals & Formatting

-

f-strings (3.6+) are the fastest formatting method. They support expressions: f"{accuracy:.2%}" → "95.23%", f"{x=}" (3.8+) → "x=42" for debugging. Interning: Python interns string literals and identifiers. 'hello' is 'hello' is True because both point to the same interned object.

+

10. File I/O for Real Projects

+ + + + + + + + +
FormatReadWriteBest For
JSONjson.load(f)json.dump(obj, f)Configs, API responses
CSVcsv.DictReader(f)csv.DictWriter(f)Tabular data (small)
YAMLyaml.safe_load(f)yaml.dump(obj, f)Config files
Picklepickle.load(f)pickle.dump(obj, f)Python objects, models
Parquetpd.read_parquet()df.to_parquet()Large DataFrames (fast)
SQLitesqlite3.connect()SQL queriesLocal database
-

9. pathlib — Modern File Handling

-

Stop using os.path.join(). Use pathlib.Path — object-oriented, cross-platform, reads like English. Path('data') / 'train' / 'images' builds paths. path.glob('*.csv') finds files. path.read_text() reads without open().

+

11. pathlib — Modern File Handling

+

Stop using os.path.join(). Use pathlib.Path: Path('data') / 'train' / 'images'. Methods: .glob(), .read_text(), .mkdir(parents=True), .exists(), .suffix, .stem. Cross-platform, readable, powerful.

-

10. Virtual Environments

+

12. Virtual Environments & Dependency Management

- + - + +
ToolBest ForKey Feature
venvSimple projectsBuilt-in, lightweight
condaDS/ML (C dependencies)Handles CUDA, MKL
condaDS/ML (C deps)Handles CUDA, MKL, OpenCV
poetryModern packagingLock files, deterministic builds
uvSpeed (Rust-based)10-100x faster than pip
uvSpeed10-100x faster pip (Rust-based)
pip-toolsRequirements pinningpip-compile for lock files
+ +

13. Project Structure Template

+
my_project/ +├── src/ +│ └── my_package/ +│ ├── __init__.py +│ ├── data/ # Data loading & processing +│ ├── models/ # Model definitions +│ ├── training/ # Training loops +│ ├── evaluation/ # Metrics & evaluation +│ ├── serving/ # API endpoints +│ └── utils/ # Shared utilities +├── tests/ +│ ├── conftest.py # Shared fixtures +│ ├── test_data.py +│ └── test_models.py +├── configs/ # YAML/JSON configs +├── notebooks/ # EDA notebooks +├── scripts/ # CLI scripts +├── pyproject.toml # Modern Python packaging +├── Dockerfile +├── Makefile # Common commands +└── README.md
+ +

14. String Operations for Data Cleaning

+

f-strings (3.6+): f"{accuracy:.2%}" → "95.23%". f"{x=}" (3.8+) → "x=42" for debugging. f"{name!r}" → shows repr. regex: re.compile(pattern) for repeated use. re.sub() for cleaning. re.findall() for extraction. Always compile patterns used in loops.

+ +

15. Command-Line Interface (CLI) Tools

+

argparse: Built-in CLI parsing. click: Decorator-based, more Pythonic. typer: Modern, uses type hints. Every production project needs a CLI for: training, evaluation, data processing, deployment scripts.

`, code: `
-

💻 Python Fundamentals — Code Examples

+

💻 Python Fundamentals — Project Code

-

1. Generators — Complete Patterns

-
# Basic generator — yields values lazily -def read_large_file(filepath): +

1. Generator Pipeline — Process Any Size Data

+
import json +from pathlib import Path + +def read_jsonl(filepath): + """Read JSON Lines file lazily — handles any size.""" with open(filepath) as f: for line in f: - yield line.strip() -# Processes a 10GB file with O(1) memory! - -# Generator pipeline — compose transformations -def pipeline(filepath): - lines = read_large_file(filepath) - parsed = (json.loads(line) for line in lines) - filtered = (rec for rec in parsed if rec['score'] > 0.5) - return filtered # Still lazy! No work done yet - -# send() — coroutine pattern (advanced) -def running_average(): - total = count = 0 - avg = None + yield json.loads(line.strip()) + +def filter_records(records, min_score=0.5): + for rec in records: + if rec.get('score', 0) >= min_score: + yield rec + +def batch(iterable, size=64): + """Batch any iterable into fixed-size chunks.""" + from itertools import islice + it = iter(iterable) + while chunk := list(islice(it, size)): + yield chunk + +# Compose into pipeline — still O(1) memory! +pipeline = batch(filter_records(read_jsonl("data.jsonl")), size=32) +for chunk in pipeline: + process(chunk) # Only 32 records in memory at a time
+ +

2. Coroutine Pattern — Running Statistics

+
def running_stats(): + """Coroutine that computes running mean & variance.""" + n = 0 + mean = 0.0 + M2 = 0.0 while True: - value = yield avg - total += value - count += 1 - avg = total / count - -ra = running_average() -next(ra) # Prime the coroutine -ra.send(10) # 10.0 -ra.send(20) # 15.0
- -

2. Closures & Mutable Default Trap

-
# Closure — function capturing external state -def make_multiplier(factor): - def multiply(x): - return x * factor # 'factor' captured from enclosing scope - return multiply - -double = make_multiplier(2) -triple = make_multiplier(3) -print(double(5)) # 10 - -# ⚠️ MUTABLE DEFAULT ARGUMENT — THE #1 PYTHON BUG -# BAD: default list is shared across ALL calls! -def bad_append(item, lst=[]): + x = yield {'mean': mean, 'var': M2/n if n > 0 else 0, 'n': n} + n += 1 + delta = x - mean + mean += delta / n + M2 += delta * (x - mean) # Welford's algorithm — numerically stable + +stats = running_stats() +next(stats) # Prime +stats.send(10) # {'mean': 10.0, 'var': 0, 'n': 1} +stats.send(20) # {'mean': 15.0, 'var': 25.0, 'n': 2}
+ +

3. Custom Exception Hierarchy for Projects

+
# Define project-specific exceptions +class ProjectError(Exception): + """Base exception for the project.""" + +class DataValidationError(ProjectError): + def __init__(self, column, expected, actual): + self.column = column + super().__init__( + f"Column '{column}': expected {expected}, got {actual}" + ) + +class ModelNotTrainedError(ProjectError): + pass + +# Usage with proper error handling +def load_and_validate(path): + try: + df = pd.read_csv(path) + except FileNotFoundError: + raise DataValidationError("file", "exists", "missing") + except pd.errors.EmptyDataError: + raise DataValidationError("data", "non-empty", "empty file") + else: + print(f"Loaded {len(df)} rows") + return df + finally: + print("Load attempt complete")
+ +

4. Closures & Mutable Default Trap

+
# ⚠️ THE #1 PYTHON BUG — Mutable default argument +def bad_append(item, lst=[]): # List shared across ALL calls! lst.append(item) return lst bad_append(1) # [1] bad_append(2) # [1, 2] ← SURPRISE! -# GOOD: use None sentinel +# ✅ CORRECT — use None sentinel def good_append(item, lst=None): if lst is None: lst = [] lst.append(item) return lst
-

3. collections In Action

-
from collections import defaultdict, Counter, namedtuple, deque +

5. collections in Action

+
from collections import defaultdict, Counter, deque -# defaultdict — Group samples by label -samples_by_label = defaultdict(list) -for feature, label in zip(features, labels): - samples_by_label[label].append(feature) +# defaultdict — group data without KeyError +samples_by_label = defaultdict(list) +for feat, label in zip(features, labels): + samples_by_label[label].append(feat) -# Counter — Class distribution + arithmetic +# Counter — class distribution + top-N dist = Counter(y_train) print(dist.most_common(3)) -# Counter supports +, -, &, | operations! +imbalance_ratio = dist.most_common()[0][1] / dist.most_common()[-1][1] -# deque — Sliding window for streaming data +# deque — sliding window for streaming window = deque(maxlen=5) -for value in data_stream: - window.append(value) +for val in data_stream: + window.append(val) moving_avg = sum(window) / len(window)
-

4. Advanced Comprehensions & Unpacking

-
# Walrus operator (:=) — Assign + use in expression (3.8+) +

6. CLI Tool with argparse

+
import argparse + +def main(): + parser = argparse.ArgumentParser(description="Train ML model") + parser.add_argument("--data", required=True, help="Path to data") + parser.add_argument("--model", choices=["rf", "xgb", "lgbm"], default="rf") + parser.add_argument("--epochs", type=int, default=10) + parser.add_argument("--lr", type=float, default=0.001) + parser.add_argument("--dry-run", action="store_true") + args = parser.parse_args() + + print(f"Training {args.model} on {args.data}") + # python train.py --data data.csv --model xgb --epochs 50 + +if __name__ == "__main__": + main()
+ +

7. Advanced Comprehensions & Modern Python

+
# Walrus operator (:=) — assign + use (3.8+) if (n := len(data)) > 1000: print(f"Large dataset: {n} samples") -# Extended unpacking -first, *middle, last = sorted(scores) - # Dict merge (3.9+) -config = defaults | overrides # New in 3.9 +config = defaults | overrides -# match-case (3.10+) — Structural Pattern Matching +# match-case — Structural Pattern Matching (3.10+) match command: case {"action": "train", "model": model_name}: train(model_name) case {"action": "predict", "data": path}: - predict(path)
+ predict(path) + case _: + print("Unknown command") + +# Extended unpacking +first, *middle, last = sorted(scores) + +# Nested dict comprehension +metrics = { + model: {metric: score for metric, score in results.items()} + for model, results in all_results.items() +}
+ +

8. Regex for Data Cleaning

+
import re + +# Compile patterns used repeatedly (10x faster) +EMAIL = re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}') +PHONE = re.compile(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b') + +# Extract all emails from text +emails = EMAIL.findall(text) + +# Clean text for NLP +def clean_text(text): + text = re.sub(r'http\S+', '', text) # Remove URLs + text = re.sub(r'[^a-zA-Z\s]', '', text) # Keep only letters + text = re.sub(r'\s+', ' ', text).strip() # Normalize whitespace + return text.lower()
+ +

9. Configuration Management

+
import json, yaml +from pathlib import Path +from dataclasses import dataclass, asdict + +@dataclass +class Config: + model_name: str = "random_forest" + learning_rate: float = 0.001 + batch_size: int = 32 + epochs: int = 100 + data_path: str = "data/train.csv" + + @classmethod + def from_yaml(cls, path): + with open(path) as f: + return cls(**yaml.safe_load(f)) + + def save(self, path): + Path(path).write_text(json.dumps(asdict(self), indent=2)) + +config = Config.from_yaml("configs/experiment.yaml")
`, interview: `

🎯 Python Fundamentals — Interview Questions

-
Q1: What's the difference between a list and a tuple?

Answer: Lists are mutable, tuples immutable. Deeper: tuples are hashable (can be dict keys), use less memory (no over-allocation), and signal intent ("this shouldn't change"). Use tuples for (lat, lon) pairs, function return values, dict keys. Use lists for collections that grow.

-
Q2: How does Python's GIL affect DS workflows?

Answer: The GIL prevents true multi-threading for CPU-bound tasks. But NumPy, Pandas, and scikit-learn release the GIL during C-level computations. So vectorized operations ARE parallel at the C level. For pure Python CPU work, use multiprocessing. For I/O, threading works fine.

-
Q3: Explain shallow vs deep copy.

Answer: copy.copy() copies outer container but shares inner objects. copy.deepcopy() recursively copies everything. Real scenario: list of dicts (configs). Shallow copy means modifying one config modifies all. Pandas .copy() is deep by default — but df2 = df is NOT a copy.

-
Q4: What is the mutable default argument trap?

Answer: def f(x, lst=[]): — the default list is created ONCE at function definition and shared across all calls. So f(1); f(2) gives [1, 2] not [2]. Fix: use lst=None then if lst is None: lst = []. This is the #1 Python gotcha in interviews.

-
Q5: What are generators and why are they critical for large-scale data?

Answer: Generators yield values one at a time using yield, consuming O(1) memory. A list of 1B items = ~8GB. A generator = ~100 bytes. Critical for: reading large files, streaming data, batch training. yield from delegates to sub-generators. Generator expressions: (x for x in data).

-
Q6: Explain the LEGB scope rule.

Answer: Python resolves names in order: Local → Enclosing → Global → Built-in. This is why list = [1,2] breaks list(). Use nonlocal for enclosing scope, global for module scope.

-
Q7: How would you handle a 10GB CSV that doesn't fit in memory?

Answer: (1) pd.read_csv(chunksize=50000), (2) usecols=['needed'], (3) dtype={'col': 'int32'}, (4) Dask for lazy Pandas, (5) DuckDB for SQL on CSV with zero overhead, (6) Polars for fast out-of-core processing.

-
Q8: What's the time complexity of dict lookup vs list search?

Answer: Dict: O(1) via hash tables (open addressing). List: O(n) linear scan. Dict hashes the key to compute slot index, handles collisions via probing. Sets use the same mechanism. x in my_set is O(1) but x in my_list is O(n).

-
Q9: Explain Python's garbage collection.

Answer: Two mechanisms: (1) Reference counting — freed when count hits 0. (2) Cyclic GC — detects reference cycles (A→B→A). Runs on 3 generations. Long-lived objects collected less often. gc.collect() forces collection — useful after deleting large ML models.

-
Q10: What is __slots__ and when to use it?

Answer: By default, Python objects store attributes in a __dict__ (a dict per instance). __slots__ replaces this with a fixed-size array. Saves ~40% memory per instance. Use when creating millions of small objects (data points, nodes). Trade-off: can't add attributes dynamically.

+
Q1: List vs tuple — when to use which?

Answer: Tuples: immutable, hashable (dict keys), less memory. Lists: mutable, growable. Use tuples for fixed data (coordinates, config). Use lists for collections that change. Tuples signal "this shouldn't be modified."

+
Q2: How does Python's GIL affect DS?

Answer: GIL prevents multi-threading for CPU-bound Python. But NumPy/Pandas release the GIL during C operations. For pure Python CPU work → multiprocessing. For I/O → threading works. For data science, the GIL rarely matters.

+
Q3: Shallow vs deep copy?

Answer: copy.copy(): outer container copied, inner objects shared. copy.deepcopy(): everything copied recursively. Real trap: df2 = df is NOT a copy — it's aliasing. Use df.copy().

+
Q4: What is the mutable default argument trap?

Answer: def f(x, lst=[]): — default list created ONCE and shared. Fix: lst=None; if lst is None: lst = []. #1 Python interview gotcha.

+
Q5: Why are generators critical for large data?

Answer: O(1) memory. 1B items as list = 8GB. As generator = 100 bytes. Use for: file processing, streaming, batch training. yield from for composition.

+
Q6: Explain LEGB scope rule.

Answer: Name lookup order: Local → Enclosing → Global → Built-in. nonlocal for enclosing scope, global for module. list = [1] shadows built-in list().

+
Q7: How to handle a 10GB CSV?

Answer: (1) pd.read_csv(chunksize=N), (2) usecols=['needed'], (3) dtype={'col':'int32'}, (4) Dask, (5) DuckDB for SQL on CSV, (6) Polars for Rust-speed.

+
Q8: Dict lookup O(1) vs list search O(n)?

Answer: Dicts use hash tables. Key → hash → slot index. O(1) average. Lists scan linearly. x in set is O(1) but x in list is O(n). For 1M items: microseconds vs milliseconds.

+
Q9: Explain Python's garbage collection.

Answer: (1) Reference counting — freed at count=0. (2) Cyclic GC — detects A→B→A cycles. 3 generations. gc.collect() after deleting large models.

+
Q10: What is __slots__?

Answer: Replaces per-instance __dict__ with fixed array. ~40% memory savings. Use for millions of small objects. Trade-off: no dynamic attributes.

+
Q11: How do you structure a Python project?

Answer: src/package/ layout. pyproject.toml for config. tests/ with pytest. configs/ for YAML. Makefile for common commands. Separate data, models, training, serving.

+
Q12: What's the difference between is and ==?

Answer: == checks value equality. is checks identity (same memory). Use is only for singletons: x is None, x is True. Integer interning makes 256 is 256 True but 1000 is 1000 may be False.

` }, - "numpy": { - concepts: ` +"numpy": { + concepts: `

🔢 NumPy — Complete Deep Dive

-
⚡ Why NumPy Is 50-100x Faster Than Python Lists
-
Three reasons: (1) Contiguous memory — CPU cache-friendly, no pointer chasing. (2) Compiled C loops — operations run in C, not interpreted Python. (3) SIMD instructions — modern CPUs process 4-8 floats simultaneously (AVX).
+
⚡ Why NumPy Is 50-100x Faster
+
(1) Contiguous memory — CPU cache-friendly. (2) Compiled C loops. (3) SIMD instructions — 4-8 floats simultaneously. Python list: array of pointers to objects. NumPy: raw typed data in a block.

1. ndarray Internals

- - - - - + + + +
FeaturePython ListNumPy ndarray
StorageArray of pointers to objectsContiguous block of raw typed data
TypeEach element can differHomogeneous — all same dtype
OperationsPython loop (bytecode)Compiled C/Fortran loops
Memory~28 bytes per int + pointer8 bytes per int64 (no overhead)
SIMDNot possibleUses CPU vector instructions
StoragePointers to objectsContiguous typed data
Memory per int~28 bytes + pointer8 bytes (int64)
OperationsPython loopCompiled C/Fortran
SIMDImpossibleCPU vector instructions
-

2. Memory Layout: C-Order vs Fortran-Order

+

2. Memory Layout & Strides

-
⚡ Performance-Critical Knowledge
-
C-order (row-major): Rows stored contiguously. Fortran-order (col-major): Columns stored contiguously. NumPy defaults to C-order. Iterating along the last axis is fastest (cache-friendly). Fortran-order preferred for LAPACK/BLAS operations.
+
🧠 Strides = The Secret Behind Views
+
Every ndarray has strides — bytes to jump in each dimension. For (3,4) float64: strides = (32, 8). Slicing creates views (no copy) by adjusting strides. arr[::2] doubles row stride. C-order (row-major): rows contiguous. Fortran-order: columns contiguous. Iterate along last axis for best performance.
-

3. Strides: The Secret Behind Views

-

Every ndarray has a strides tuple — bytes to jump in each dimension. For a (3,4) float64 array: strides = (32, 8). Slicing creates views (no copy) by adjusting strides. arr[::2] doubles the row stride.

- -

4. Broadcasting Rules

+

3. Broadcasting Rules

-
🎯 Broadcasting Rules (Right to Left)
-
Two arrays are compatible when, for each trailing dimension: (1) Dimensions are equal, OR (2) One is 1. Example: (5,3,1) + (1,4) → shape (5,3,4). The (1,) dims are "stretched" virtually — no memory copied.
+
🎯 Rules (Right to Left)
+
Two arrays compatible when, for each trailing dim: dims are equal OR one is 1. (5,3,1) + (1,4) → (5,3,4). The "1" dims stretch virtually — no memory copied. Common: X - X.mean(axis=0) → (1000,5) - (5,) works!
-

5. Universal Functions (ufuncs)

-

Ufuncs are vectorized functions that operate element-wise. They support: .reduce() (fold along axis), .accumulate() (running total), .outer() (outer product), .at() (unbuffered in-place). Example: np.add.reduce(arr) = arr.sum() but works with custom ufuncs too.

+

4. Universal Functions (ufuncs)

+

Vectorized element-wise functions. Advanced methods: .reduce() (fold), .accumulate() (running total), .outer() (outer product), .at() (unbuffered in-place). Create custom with np.frompyfunc().

-

6. Key dtype Choices for DS

+

5. dtype Selection for Projects

- - - - + + + + +
dtypeBytesWhen to Use
float324Deep learning (GPU prefers this), 50% less memory
float648Default. Scientific computing, high-precision stats
int324Indices, counts, most integer data
float162Mixed-precision training, inference
float324Deep learning, GPU (50% less memory)
float648Default. Statistics, scientific computing
float162Mixed-precision inference
int324Indices, counts
int81Quantized models
bool1Masks for filtering
-

7. np.einsum — Einstein Summation

-

np.einsum can express any tensor operation: matrix multiply, trace, transpose, batch ops. Often faster than chaining NumPy functions because it avoids intermediate arrays.

+

6. np.einsum — One Function for All Tensor Ops

+

Einstein summation: express ANY tensor operation. Matrix multiply: 'ik,kj->ij'. Batch matmul: 'bij,bjk->bik'. Trace: 'ii->'. Often faster than chaining NumPy calls — avoids intermediate arrays.

-

8. Linear Algebra for ML

+

7. Linear Algebra for ML Projects

  • X.T @ X → Gram matrix (basis of linear regression)
  • -
  • U, S, Vt = np.linalg.svd(X) → PCA, dimensionality reduction
  • +
  • np.linalg.svd(X) → PCA, dimensionality reduction
  • np.linalg.eigh(cov) → Covariance eigenvectors
  • -
  • np.linalg.norm(X, axis=1) → L2 norms for distance
  • -
  • np.linalg.lstsq(X, y) → Stable linear regression (preferred over inv)
  • +
  • np.linalg.norm(X, axis=1) → L2 norms for distances
  • +
  • np.linalg.lstsq(X, y) → Stable linear regression
  • +
  • np.linalg.inv() → AVOID! Use solve() instead (numerically stable)
-

9. Random Number Generation (Modern API)

-

np.random.default_rng(42) is the modern way (NumPy 1.17+). Uses PCG64 algorithm — better statistical properties, thread-safe. Old np.random.seed(42) is global state, not thread-safe. Always use default_rng() in new code.

+

8. Random Number Generation

+

Modern: rng = np.random.default_rng(42) (NumPy 1.17+). PCG64 algorithm, thread-safe. Old np.random.seed(42) is global, not thread-safe. Always use default_rng() in projects.

+ +

9. Image Processing with NumPy

+

Images are just 3D arrays: (height, width, channels). Crop: img[100:200, 50:150]. Resize: scipy. Normalize: img / 255.0. Augment: flip img[:, ::-1], rotate with scipy.ndimage. Foundation of all computer vision.

`, code: `
-

💻 NumPy Code Examples

+

💻 NumPy Project Code

-

1. Array Creation & Memory Inspection

+

1. Feature Engineering with Broadcasting

import numpy as np -# Memory-efficient creation -X = np.random.randn(1000, 10).astype(np.float32) # 40KB vs 80KB -print(f"Strides: {X.strides}") # (40, 4) bytes -print(f"Memory: {X.nbytes / 1024:.1f} KB")
- -

2. Broadcasting for Feature Normalization

-
# Z-score normalization using broadcasting +# Z-score normalization X = np.random.randn(1000, 5) -X_norm = (X - X.mean(axis=0)) / X.std(axis=0) # (1000,5) - (5,) works! +X_norm = (X - X.mean(axis=0)) / X.std(axis=0) # (1000,5) - (5,) # Min-Max scaling -X_scaled = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0) + 1e-8)
+X_scaled = (X - X.min(0)) / (X.max(0) - X.min(0) + 1e-8) + +# Pairwise Euclidean distance matrix +diff = X[:, np.newaxis, :] - X[np.newaxis, :, :] # (N,1,D)-(1,N,D) +dist_matrix = np.sqrt((diff ** 2).sum(axis=-1)) # (N,N)
-

3. Advanced Indexing & Boolean Masking

-
# Boolean mask — filter outliers (3 sigma rule) +

2. Boolean Masking & Advanced Indexing

+
# Remove outliers (3-sigma rule) data = np.random.randn(10000) -clean = data[np.abs(data) < 3] +clean = data[np.abs(data - data.mean()) < 3 * data.std()] -# np.where — Conditional replacement +# np.where — conditional replacement preds = np.array([0.3, 0.7, 0.1, 0.9]) -labels = np.where(preds > 0.5, 1, 0) # [0, 1, 0, 1] +labels = np.where(preds > 0.5, 1, 0) -# np.select — Multiple conditions +# np.select — multiple conditions conditions = [data < -1, data > 1] choices = ['low', 'high'] -category = np.select(conditions, choices, default='mid')
+category = np.select(conditions, choices, default='mid') -

4. np.einsum — One Function to Rule Them All

+# Fancy indexing — sample without replacement +rng = np.random.default_rng(42) +idx = rng.choice(len(X), size=500, replace=False) +X_sample = X[idx]
+ +

3. einsum for Complex Operations

# Matrix multiply -C = np.einsum('ik,kj->ij', A, B) # same as A @ B +C = np.einsum('ik,kj->ij', A, B) # Batch matrix multiply (deep learning) batch_result = np.einsum('bij,bjk->bik', batch_A, batch_B) # Cosine similarity matrix -X = np.random.randn(100, 768) -sim = np.einsum('ij,kj->ik', X, X)
- -

5. Memory-Mapped Files for Huge Arrays

+norms = np.linalg.norm(X, axis=1, keepdims=True) +X_normed = X / norms +sim = np.einsum('ij,kj->ik', X_normed, X_normed)
+ +

4. Implement Linear Regression from Scratch

+
# Normal equation: w = (X^T X)^(-1) X^T y +# Better: use lstsq for numerical stability +X_b = np.c_[np.ones((len(X), 1)), X] # Add bias column +w, residuals, rank, sv = np.linalg.lstsq(X_b, y, rcond=None) +y_pred = X_b @ w +mse = ((y - y_pred) ** 2).mean() +r2 = 1 - ((y - y_pred)**2).sum() / ((y - y.mean())**2).sum()
+ +

5. Memory-Mapped Files for Huge Data

# Process arrays larger than RAM big = np.memmap('huge.npy', dtype=np.float32, mode='w+', shape=(1000000, 100)) -subset = big[5000:6000] # Only reads 1000 rows from disk
+subset = big[5000:6000] # Only reads 1000 rows from disk -

6. Structured Arrays

-
# Mixed dtypes without Pandas overhead +# Structured arrays — mixed types without Pandas dt = np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]) -data = np.array([('Alice', 30, 95.5), ('Bob', 25, 87.3)], dtype=dt) -print(data['name']) # ['Alice' 'Bob'] -print(data['score'].mean()) # 91.4
+data = np.array([('Alice', 30, 95.5)], dtype=dt)
+ +

6. Implement PCA from Scratch

+
def pca(X, n_components): + # Center the data + X_centered = X - X.mean(axis=0) + # Covariance matrix + cov = X_centered.T @ X_centered / (len(X) - 1) + # Eigendecomposition + eigenvalues, eigenvectors = np.linalg.eigh(cov) + # Sort by largest eigenvalue + idx = eigenvalues.argsort()[::-1][:n_components] + components = eigenvectors[:, idx] + # Project data + X_pca = X_centered @ components + explained_var = eigenvalues[idx] / eigenvalues.sum() + return X_pca, explained_var, components
`, - interview: ` + interview: `

🎯 NumPy Interview Questions

-
Q1: Why is NumPy faster than Python lists?

Answer: (1) Contiguous memory — cache-friendly. (2) Compiled C loops. (3) SIMD instructions — 4-8 floats simultaneously. Together: 50-100x speedup.

-
Q2: View vs copy — what's the difference?

Answer: Views share data (slicing creates views). Copies duplicate. arr[::2] = view, arr[[0,2,4]] (fancy indexing) = copy. Check with np.shares_memory(a, b).

-
Q3: Explain broadcasting with example.

Answer: Compare shapes right-to-left. Dims must be equal or one must be 1. (3,1) + (1,4) → (3,4). No memory copied — strides adjusted internally. Gotcha: (3,) + (3,4) fails — reshape to (3,1) first.

-
Q4: What is axis=0 vs axis=1?

Answer: axis=0 = operate down rows (collapses rows). axis=1 = across columns (collapses columns). For (100,5): mean(axis=0) → (5,) per feature. mean(axis=1) → (100,) per sample.

-
Q5: How to implement PCA with NumPy?

Answer: Center: X_c = X - X.mean(0). Covariance: cov = X_c.T @ X_c / (n-1). Eigendecompose: vals, vecs = np.linalg.eigh(cov). Project: X_pca = X_c @ vecs[:,-k:]. Or use SVD directly.

-
Q6: np.dot vs @ vs einsum?

Answer: np.dot: confusing for 3D+. @: clean matrix multiply, broadcasts. einsum: most flexible. Use @ for readability, einsum for complex ops.

-
Q7: How to handle NaN values?

Answer: np.isnan(arr) detects. np.nanmean(arr) — nan-safe aggregation. Gotcha: np.nan == np.nan is False! IEEE 754 standard.

-
Q8: Explain C-order vs Fortran-order performance.

Answer: C-order stores rows contiguously. Iterating along last axis is fastest (cache-friendly). For column-heavy ops, Fortran can be faster. NumPy defaults to C. Convert with np.asfortranarray().

+
Q1: Why is NumPy faster than Python lists?

Answer: (1) Contiguous memory (cache-friendly). (2) Compiled C loops. (3) SIMD instructions. Together: 50-100x speedup.

+
Q2: View vs copy?

Answer: Slicing = view (shares data). Fancy indexing = copy. Check: np.shares_memory(a, b). Views are dangerous: modifying view modifies original.

+
Q3: Broadcasting rules?

Answer: Right-to-left: dims must equal or one is 1. (3,1) + (1,4) → (3,4). No memory copied. Gotcha: (3,) + (3,4) fails — reshape to (3,1).

+
Q4: axis=0 vs axis=1?

Answer: axis=0: operate down rows (collapse rows). axis=1: across columns (collapse columns). (100,5): mean(axis=0)→(5,). mean(axis=1)→(100,).

+
Q5: Implement PCA with NumPy?

Answer: Center, compute covariance, eigendecompose (eigh), sort by eigenvalue, project onto top-k eigenvectors. Or SVD directly.

+
Q6: np.dot vs @ vs einsum?

Answer: @: clean, broadcasts. np.dot: confusing for 3D+. einsum: most flexible, any tensor op. Use @ for readability.

+
Q7: How to handle NaN?

Answer: np.isnan() detects. np.nanmean() ignores NaN. Gotcha: NaN == NaN is False (IEEE 754).

+
Q8: C-order vs Fortran-order?

Answer: C: rows contiguous (default). Fortran: columns contiguous (LAPACK/BLAS). Iterate last axis for speed. Convert: np.asfortranarray().

` - }, +}, - "pandas": { - concepts: ` +"pandas": { + concepts: `

🐼 Pandas — Complete Deep Dive

⚡ DataFrame Internals — BlockManager
-
A DataFrame is NOT a 2D array. Internally, Pandas uses a BlockManager — columns of the same dtype are stored together in contiguous NumPy arrays (blocks). This is why column operations are fast (same block) but row iteration is slow (crosses blocks).
+
A DataFrame is NOT a 2D array. Uses BlockManager — same-dtype columns stored in contiguous blocks. Column operations: fast (same block). Row iteration: slow (crosses blocks). This is why df.iterrows() is 100x slower than vectorized ops.
-

1. .loc vs .iloc — The Golden Rule

-
-
🎯 Never Confuse These
-
.loc = Label-based. Inclusive on both ends. .iloc = Integer position. Exclusive on end. df.loc[0:5] includes row 5. df.iloc[0:5] excludes row 5.
+

1. The Golden Rules

+
+
⚠️ 5 Rules That Prevent 90% of Pandas Bugs
+ 1. Use .loc (label) and .iloc (position) — never chain indexing.
+ 2. df.loc[0:5] includes 5. df.iloc[0:5] excludes 5.
+ 3. df[mask]['col'] = x creates copy. Use df.loc[mask, 'col'] = x.
+ 4. df2 = df is NOT a copy. Use df2 = df.copy().
+ 5. Always check df.dtypes and df.isna().sum() first.
-

2. SettingWithCopyWarning — Finally Explained

-

Chained indexing (df[df.x > 0]['y'] = 5) may create a temporary copy. Fix: df.loc[df.x > 0, 'y'] = 5. In Pandas 2.0+, Copy-on-Write mode eliminates this entirely.

- -

3. GroupBy Split-Apply-Combine

-

The most powerful Pandas operation. (1) Split into groups, (2) Apply function to each, (3) Combine results. GroupBy is lazy — no computation until aggregation. Key methods: agg() (reduce), transform() (broadcast), filter() (keep/drop groups), apply() (flexible).

+

2. GroupBy — Split-Apply-Combine

+

Most powerful Pandas operation. (1) Split → (2) Apply function → (3) Combine results. GroupBy is lazy — no computation until aggregation. Key methods:

+ + + + + + +
MethodOutput ShapeUse Case
agg()Reduced (one row/group)Sum, mean, count per group
transform()Same as inputFill with group mean, normalize within group
filter()Subset of groupsKeep groups with N > 100
apply()FlexibleCustom function per group
-

4. Pandas 2.0 — Major Changes

+

3. Pandas 2.0 — Major Changes

- - + + -
FeatureBefore (1.x)After (2.0+)
BackendNumPy onlyApache Arrow backend option
Copy semanticsConfusingCopy-on-Write (explicit)
BackendNumPy onlyApache Arrow option
Copy semanticsConfusingCopy-on-Write
String dtypeobjectstring[pyarrow] (faster)
Nullable typesNaN for everythingpd.NA (proper null)
Index dtypesint64 defaultMatches data dtype
-

5. Polars vs Pandas

+

4. Polars vs Pandas

- - - + + - - + +
FeaturePandasPolars
Speed1x5-50x faster (Rust)
MemoryHigherLower (Arrow-native)
ParallelismSingle-threadedMulti-threaded by default
Speed1x5-50x (Rust)
ParallelismSingle-threadedMulti-threaded auto
APIEagerLazy + Eager
EcosystemMassiveGrowing
When to useEDA, legacy projectsLarge data, production pipelines
EcosystemMassiveGrowing fast
Use whenEDA, small-med data, legacyLarge data, production
-

6. Method Chaining

-

Fluent API style. More readable, no intermediate variables. Use .assign() instead of df['col'] = .... Use .pipe() for custom functions. Use .query() for readable filtering.

+

5. Merge/Join Patterns

+ + + + + +
MethodHowWhen
merge()SQL-style joins on columnsCombine tables on shared keys
join()Joins on indexIndex-based combining
concat()Stack along axisAppend rows/columns
+

Common pitfall: Merge produces more rows than expected = many-to-many join. Always check: len(merged) vs len(left).

-

7. Memory Optimization

+

6. Memory Optimization Strategies

- - + + - - + + +
StrategySavingsWhen to Use
Category dtype90%+Columns with few unique strings
StrategySavingsWhen
Category dtype90%+Few unique strings
Downcast numerics50-75%int64 → int32/int16
Sparse arrays80%+Columns mostly zeros/NaN
PyArrow backend30-50%String-heavy DataFrames
Sparse arrays80%+Mostly zeros/NaN
PyArrow backend30-50%String-heavy data
Read only needed columnsVariableusecols=['a','b']
-

8. Window Functions

-

.rolling(N) — fixed-size sliding window. .expanding() — cumulative from start. .ewm(span=N) — exponentially weighted. All support .mean(), .std(), .apply(func). Critical for time series feature engineering: lag features, moving averages, volatility.

+

7. Window Functions for Time Series

+

.rolling(N): fixed sliding window. .expanding(): cumulative. .ewm(span=N): exponentially weighted. All support .mean(), .std(), .apply(). Essential for: lag features, moving averages, volatility, Bollinger bands.

+ +

8. Pivot Tables & Crosstab

+

df.pivot_table(values, index, columns, aggfunc) — summarize data by two categorical dimensions. pd.crosstab() — frequency table of two categorical columns. Essential for EDA and business reporting.

+ +

9. Method Chaining Pattern

+

Fluent API: .assign() instead of df['col']=. .pipe(func) for custom. .query('col > 5') for readable filters. No intermediate variables = cleaner, reproducible pipelines.

`, code: `
-

💻 Pandas Code Examples

+

💻 Pandas Project Code

-

1. Method Chaining — Production Pattern

+

1. Complete Data Loading & Cleaning Pipeline

import pandas as pd +import numpy as np -result = ( - pd.read_csv('sales.csv') - .rename(columns=str.lower) - .assign( - date=lambda df: pd.to_datetime(df['date']), - revenue=lambda df: df['price'] * df['quantity'] +def load_and_clean(path, config): + """Production data loading pipeline.""" + df = ( + pd.read_csv(path, usecols=config['columns'], + dtype=config.get('dtypes', None), + parse_dates=config.get('date_cols', [])) + .rename(columns=str.lower) + .drop_duplicates() + .assign( + date=lambda df: pd.to_datetime(df['date']), + revenue=lambda df: df['price'] * df['qty'] + ) + .query('revenue > 0') + .pipe(optimize_dtypes) ) - .query('revenue > 100') - .groupby('month') - .agg({'revenue': ['sum', 'mean', 'count']}) -)
+ return df

2. GroupBy — Beyond Basics

-
# Named aggregation (clean column names) +
# Named aggregation summary = df.groupby('category').agg( total=('revenue', 'sum'), avg_price=('price', 'mean'), - n_orders=('order_id', 'nunique') + n_orders=('order_id', 'nunique'), + top_product=('product', lambda x: x.value_counts().index[0]) ) -# Transform — broadcast back to original shape +# Transform — normalize within groups df['pct_of_group'] = df.groupby('cat')['rev'].transform( lambda x: x / x.sum() * 100 -)
+) -

3. Merge Patterns

-
# LEFT JOIN with indicator -merged = pd.merge(orders, customers, on='id', - how='left', indicator=True) -orphans = merged[merged['_merge'] == 'left_only']
- -

4. Time Series Operations

-
# Resample, rolling, lag features -daily = df.set_index('date') -weekly = daily['revenue'].resample('W').sum() -df['ma_7'] = df['revenue'].rolling(7).mean() -df['lag_1'] = df['revenue'].shift(1) -df['pct_chg'] = df['revenue'].pct_change()
- -

5. Memory Optimization

+# Filter — keep only groups with enough data +df_filtered = df.groupby('user').filter(lambda x: len(x) >= 5)
+ +

3. Time Series Feature Engineering

+
def create_time_features(df, date_col, target_col): + """Generate time series features for ML.""" + df = df.sort_values(date_col).copy() + + # Lag features + for lag in [1, 3, 7, 14, 30]: + df[f'lag_{lag}'] = df[target_col].shift(lag) + + # Rolling statistics + for window in [7, 14, 30]: + df[f'rolling_mean_{window}'] = df[target_col].rolling(window).mean() + df[f'rolling_std_{window}'] = df[target_col].rolling(window).std() + + # Date features + df['dayofweek'] = df[date_col].dt.dayofweek + df['month'] = df[date_col].dt.month + df['is_weekend'] = df['dayofweek'].isin([5, 6]).astype(int) + + # Percentage change + df['pct_change'] = df[target_col].pct_change() + + return df
+ +

4. Memory Optimization

def optimize_dtypes(df): + """Reduce DataFrame memory by 60-80%.""" + start_mem = df.memory_usage(deep=True).sum() / 1024**2 + for col in df.select_dtypes(['int']).columns: df[col] = pd.to_numeric(df[col], downcast='integer') for col in df.select_dtypes(['float']).columns: df[col] = pd.to_numeric(df[col], downcast='float') for col in df.select_dtypes(['object']).columns: - if df[col].nunique() / len(df) < 0.5: + if df[col].nunique() / len(df) < 0.5: df[col] = df[col].astype('category') - return df -# 800 MB → 200 MB typical savings
+ + end_mem = df.memory_usage(deep=True).sum() / 1024**2 + print(f"Memory: {start_mem:.1f}MB → {end_mem:.1f}MB ({100*(1-end_mem/start_mem):.0f}% reduction)") + return df
+ +

5. Merge with Validation

+
# LEFT JOIN with indicator for debugging +merged = pd.merge(orders, customers, on='customer_id', + how='left', indicator=True, validate='many_to_one') + +# Check for orphan records +orphans = merged[merged['_merge'] == 'left_only'] +print(f"Orphan orders: {len(orphans)}") + +# Multi-key merge +result = pd.merge(df1, df2, on=['date', 'product_id'], + how='inner', suffixes=('_actual', '_predicted'))
+ +

6. Pivot Table for Business Reporting

+
# Revenue by month and category +pivot = df.pivot_table( + values='revenue', + index=df['date'].dt.to_period('M'), + columns='category', + aggfunc=['sum', 'count'], + margins=True # Add totals row/column +) + +# Crosstab — frequency of two categorical columns +ct = pd.crosstab(df['region'], df['product'], normalize='index')
`, - interview: ` + interview: `

🎯 Pandas Interview Questions

-
Q1: SettingWithCopyWarning — cause and fix?

Answer: Chained indexing may modify a copy. Fix: df.loc[mask, 'col'] = val. Pandas 2.0+ Copy-on-Write: pd.options.mode.copy_on_write = True.

-
Q2: merge vs join vs concat?

Answer: merge(): SQL joins on columns. join(): joins on index. concat(): stack along axis. Use merge for column joins, concat for stacking.

-
Q3: apply vs map vs transform?

Answer: map(): Series element-wise. apply(): rows/columns. transform(): same shape output. All are slow — prefer vectorized operations.

-
Q4: GroupBy transform vs agg?

Answer: agg() reduces — one value per group. transform() broadcasts — same shape as input. Use transform for "fill with group mean" patterns.

-
Q5: What is MultiIndex?

Answer: Hierarchical indexing — multiple levels. Use for pivot tables, panel data (entity + time). Access with .xs() or tuple: df.loc[('A', 2023)]. Convert back with .reset_index().

-
Q6: Pandas vs Polars — when to choose?

Answer: Pandas: mature ecosystem, EDA, small-medium data. Polars: 5-50x faster (Rust), multi-threaded, lazy evaluation, better for large data and production pipelines. Polars for new projects with big data.

-
Q7: How to handle missing data in production?

Answer: (1) dropna(thresh=N), (2) fillna(method='ffill') for time series, (3) fillna(df.median()) for ML, (4) interpolate(method='time'). Always check df.isna().sum() first.

+
Q1: SettingWithCopyWarning?

Answer: Chained indexing modifies copy. Fix: df.loc[mask, 'col'] = val. Pandas 2.0+ Copy-on-Write eliminates this.

+
Q2: merge vs join vs concat?

Answer: merge: SQL joins on columns. join: on index. concat: stack along axis. Use merge for column joins, concat for appending.

+
Q3: apply vs map vs transform?

Answer: map: Series element-wise. apply: rows/columns. transform: same-shape output. All slow — prefer vectorized when possible.

+
Q4: GroupBy transform vs agg?

Answer: agg reduces. transform broadcasts back. Use transform for "fill with group mean" or "normalize within group" patterns.

+
Q5: How to handle missing data?

Answer: (1) dropna(thresh=N), (2) fillna(method='ffill') for time series, (3) fillna(df.median()) for ML, (4) interpolate(method='time'). Always check df.isna().sum() first.

+
Q6: Pandas vs Polars?

Answer: Polars: 5-50x faster (Rust), multi-threaded, lazy eval. Pandas: mature ecosystem, wide compatibility. New projects with big data → Polars.

+
Q7: What is MultiIndex?

Answer: Hierarchical indexing. Use for pivot tables, panel data. Access with .xs() or tuple. Reset with .reset_index().

+
Q8: How to optimize a 5GB DataFrame?

Answer: (1) Read only needed columns. (2) Downcast dtypes. (3) Category for strings. (4) Sparse for zeros. (5) PyArrow backend. (6) Process in chunks. Can reduce 5GB to 1GB.

` - }, +}, "visualization": { concepts: ` @@ -548,7 +807,7 @@ df['pct_chg'] = df['revenue'
⚡ The Grammar of Graphics
-
Leland Wilkinson's framework: Data (what to plot) + Aesthetics (x, y, color, size) + Geometry (bars, lines, points) + Statistics (binning, smoothing) + Coordinates (cartesian, polar) + Facets (subplots). Every chart follows this.
+
Data + Aesthetics (x, y, color, size) + Geometry (bars, lines, points) + Statistics (binning, smoothing) + Coordinates (cartesian, polar) + Facets (subplots). Every chart = this framework.

1. Choosing the Right Chart

@@ -558,104 +817,159 @@ df['pct_chg'] = df['revenue'Relationship?Scatter, Hexbin, RegressionSeaborn/Plotly Comparison?Bar, Grouped bar, ViolinSeaborn Trend over time?Line, Area chartPlotly/Matplotlib - Correlation matrix?HeatmapSeaborn + Correlation?HeatmapSeaborn Part of whole?Pie, Treemap, SunburstPlotly Geographic?Choropleth, MapboxPlotly/Folium - High-dimensional?Parallel coords, UMAPPlotly/UMAP + High-dimensional?Parallel coords, UMAPPlotly + ML results?Confusion matrix, ROC, SHAPSeaborn/SHAP

2. Matplotlib Architecture

-

Three layers: Backend (rendering), Artist (everything drawn), Scripting (pyplot). Figure contains Axes (subplots). Each Axes has Axis objects. Always prefer OO API (fig, ax = plt.subplots()) over pyplot for production.

-

rcParams: Control global defaults. Set plt.rcParams['font.size'] = 14 once. Create a style file for consistency across all project figures. Use plt.style.use('seaborn-v0_8-whitegrid') for clean defaults.

+

Three layers: Backend (rendering), Artist (everything drawn), Scripting (pyplot). Figure → Axes (subplots) → Axis objects. Always use OO API: fig, ax = plt.subplots().

+

rcParams: Global defaults. plt.rcParams['font.size'] = 14. Create style files for project consistency. plt.style.use('seaborn-v0_8-whitegrid').

3. Color Theory for Data

-
💡 Color Best Practices
- Sequential: viridis, plasma (one variable, low→high).
- Diverging: RdBu, coolwarm (center point matters).
+
💡 Color Guide
+ Sequential: viridis, plasma (low→high).
+ Diverging: RdBu, coolwarm (center matters).
Categorical: Set2, tab10 (distinct groups).
- Never use rainbow/jet — bad for colorblind users and perceptually non-uniform. + Never use rainbow/jet — bad for colorblind, perceptually non-uniform.

4. Seaborn — Statistical Visualization

-

Three API levels: Figure-level (relplot, catplot, displot — own figure), Axes-level (scatterplot, boxplot — on existing axes), Objects API (0.12+, composable). Seaborn auto-computes statistics (regression lines, confidence intervals, density estimates).

+

Three API levels: Figure-level (relplot, catplot, displot), Axes-level (scatterplot, boxplot), Objects API (0.12+). Auto-computes regression lines, confidence intervals, density estimates.

5. Plotly — Interactive Dashboards

-

JavaScript-powered charts with hover, zoom, selection. plotly.express for quick plots, plotly.graph_objects for full control. Integrates with Dash for production dashboards. Supports 3D, maps, and animations. Export to HTML for sharing.

+

JavaScript-powered: hover, zoom, selection. plotly.express for quick plots. plotly.graph_objects for control. Integrates with Dash for production dashboards. Supports 3D, maps, animations. Export to HTML.

-

6. Common Mistakes

+

6. Visualization for ML Projects

+ + + + + + + + + + + +
What to VisualizeChartWhy
Class distributionBar chartDetect imbalance
Feature distributionsHistogram/KDE gridFind skew, outliers
Feature correlationsHeatmap (triangular)Multicollinearity
Training curvesLine plot (loss/acc vs epoch)Detect overfit/underfit
Model comparisonBox plot of CV scoresCompare variance
Confusion matrixAnnotated heatmapError analysis
ROC curveLine plot + AUCThreshold selection
Feature importanceHorizontal barModel interpretation
SHAP valuesBeeswarm/waterfallIndividual predictions
+ +

7. Common Mistakes

`, code: `
-

💻 Visualization Code Examples

+

💻 Visualization Project Code

-

1. Matplotlib — Publication Quality

+

1. Publication-Quality Multi-Subplot Figure

import matplotlib.pyplot as plt import numpy as np -# Professional multi-subplot figure -fig, axes = plt.subplots(1, 3, figsize=(15, 5)) +# Professional style setup +plt.rcParams.update({ + 'font.size': 12, 'axes.titlesize': 14, + 'figure.facecolor': 'white', + 'axes.spines.top': False, 'axes.spines.right': False +}) + +fig, axes = plt.subplots(2, 2, figsize=(14, 10)) -# Distribution with mean line -data = np.random.randn(1000) -axes[0].hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='white') -axes[0].axvline(data.mean(), color='red', linestyle='--', label='Mean') +# Distribution +axes[0,0].hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='white') +axes[0,0].axvline(data.mean(), color='red', linestyle='--', label='Mean') +axes[0,0].set_title('Distribution') # Scatter with colormap -x, y = np.random.randn(2, 100) -scatter = axes[1].scatter(x, y, c=y, cmap='viridis', alpha=0.7) -plt.colorbar(scatter, ax=axes[1]) +sc = axes[0,1].scatter(x, y, c=z, cmap='viridis', alpha=0.7) +plt.colorbar(sc, ax=axes[0,1]) # Line with confidence interval -x = np.linspace(0, 10, 100) -axes[2].plot(x, np.sin(x), 'b-', linewidth=2) -axes[2].fill_between(x, np.sin(x)-0.3, np.sin(x)+0.3, alpha=0.2) +axes[1,0].plot(x, y_mean, 'b-', linewidth=2) +axes[1,0].fill_between(x, y_mean-y_std, y_mean+y_std, alpha=0.2) + +# Bar with error bars +axes[1,1].bar(categories, values, yerr=errors, capsize=5, color='coral') plt.tight_layout() plt.savefig('figure.png', dpi=300, bbox_inches='tight')
-

2. Seaborn — Statistical Plots

+

2. ML Evaluation Dashboard

import seaborn as sns +from sklearn.metrics import confusion_matrix, roc_curve, auc -# Correlation heatmap (upper triangle only) -fig, ax = plt.subplots(figsize=(10, 8)) -mask = np.triu(np.ones_like(df.corr(), dtype=bool)) -sns.heatmap(df.corr(), mask=mask, annot=True, - fmt='.2f', cmap='RdBu_r', center=0) +def plot_model_evaluation(y_true, y_pred, y_proba): + fig, axes = plt.subplots(1, 3, figsize=(18, 5)) + + # Confusion Matrix + cm = confusion_matrix(y_true, y_pred) + sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0]) + axes[0].set_title('Confusion Matrix') + + # ROC Curve + fpr, tpr, _ = roc_curve(y_true, y_proba) + axes[1].plot(fpr, tpr, label=f'AUC={auc(fpr,tpr):.3f}') + axes[1].plot([0,1], [0,1], 'k--') + axes[1].set_title('ROC Curve') + axes[1].legend() + + # Feature Importance + importance = model.feature_importances_ + idx = importance.argsort() + axes[2].barh(feature_names[idx], importance[idx]) + axes[2].set_title('Feature Importance') + + plt.tight_layout()
-# Pair plot — all relationships at once -sns.pairplot(df, hue='target', diag_kind='kde') +

3. Seaborn — EDA in One Call

+
# Pair plot — all relationships at once +sns.pairplot(df, hue='target', diag_kind='kde', + plot_kws={'alpha': 0.6}) -# Violin + strip — distribution + individual points -sns.violinplot(x='cat', y='val', data=df, inner=None, alpha=0.3) -sns.stripplot(x='cat', y='val', data=df, size=3, jitter=True)
+# Correlation heatmap (upper triangle) +mask = np.triu(np.ones_like(df.corr(), dtype=bool)) +sns.heatmap(df.corr(), mask=mask, annot=True, + fmt='.2f', cmap='RdBu_r', center=0)
-

3. Plotly — Interactive

+

4. Plotly — Interactive Dashboard

import plotly.express as px +from plotly.subplots import make_subplots +import plotly.graph_objects as go -# Animated scatter (like Gapminder) +# Animated scatter (Gapminder style) fig = px.scatter(df, x='gdp', y='life_exp', animation_frame='year', size='pop', color='continent', hover_name='country') -fig.show()
+ +# Training curves dashboard +fig = make_subplots(rows=1, cols=2, + subplot_titles=['Loss', 'Accuracy']) +fig.add_trace(go.Scatter(y=train_loss, name='Train Loss'), row=1, col=1) +fig.add_trace(go.Scatter(y=val_loss, name='Val Loss'), row=1, col=1) +fig.add_trace(go.Scatter(y=train_acc, name='Train Acc'), row=1, col=2) +fig.add_trace(go.Scatter(y=val_acc, name='Val Acc'), row=1, col=2) +fig.write_html('training_dashboard.html') `, interview: `

🎯 Visualization Interview Questions

-
Q1: When to use Matplotlib vs Seaborn vs Plotly?

Answer: Matplotlib: full control, publication figures. Seaborn: statistical EDA, beautiful defaults. Plotly: interactive dashboards, stakeholders. Rule: Seaborn for EDA, Matplotlib for papers, Plotly for stakeholders.

-
Q2: How to visualize high-dimensional data?

Answer: (1) PCA/t-SNE/UMAP to 2D, (2) Pair plots, (3) Parallel coordinates, (4) Correlation heatmap, (5) SHAP summary plots.

-
Q3: How to handle overplotting?

Answer: (1) alpha transparency, (2) hexbin, (3) 2D KDE, (4) random sampling, (5) Datashader for millions of points.

-
Q4: What makes good visualization for non-technical stakeholders?

Answer: Clear title stating conclusion, minimal chart junk, annotate key points, consistent color, one insight per chart. Tell a story — what action should they take?

-
Q5: Explain Figure vs Axes in Matplotlib.

Answer: Figure = entire canvas. Axes = single plot area. fig, axes = plt.subplots(2,2) = 4 plots. Always use OO API: ax.plot() not plt.plot().

-
Q6: How to make accessible visualizations?

Answer: Colorblind-safe palettes (viridis), don't rely on color alone, add shapes/patterns, sufficient contrast, alt text, large fonts (12pt+).

+
Q1: Matplotlib vs Seaborn vs Plotly?

Answer: Matplotlib: full control, papers. Seaborn: statistical EDA, beautiful. Plotly: interactive, stakeholders. Rule: Seaborn→EDA, Matplotlib→papers, Plotly→stakeholders.

+
Q2: How to visualize high-dimensional data?

Answer: (1) PCA/t-SNE/UMAP to 2D, (2) Pair plots, (3) Parallel coordinates, (4) Correlation heatmap, (5) SHAP plots.

+
Q3: Handle overplotting?

Answer: alpha, hexbin, 2D KDE, random sampling, Datashader for millions of points.

+
Q4: Good viz for non-technical audience?

Answer: Title states conclusion. One insight per chart. Annotate key points. Consistent color. Minimal chart junk. Tell a story.

+
Q5: Figure vs Axes?

Answer: Figure = canvas. Axes = plot area. fig, axes = plt.subplots(2,2). Use OO API: ax.plot() not plt.plot().

+
Q6: Accessible visualizations?

Answer: Colorblind palettes (viridis), shapes not just color, sufficient contrast, alt text, 12pt+ fonts.

+
Q7: How to visualize model performance?

Answer: Training curves (loss/acc vs epoch), confusion matrix (heatmap), ROC/AUC, feature importance (horizontal bars), SHAP for interpretability.

` }, @@ -664,100 +978,146 @@ fig.show()

🎯 Advanced Python — Complete Engineering Guide

-

1. Decorators — Beyond Basics

+

1. Decorators — Complete Patterns

⚡ Three Levels of Decorators
-
Level 1: Simple wrapper (timing, logging). Level 2: Decorator with arguments (factory pattern). Level 3: Class-based decorators with state. Always use functools.wraps to preserve function metadata (name, docstring, signature).
+
Level 1: Simple wrapper (timing, logging). Level 2: With arguments (factory). Level 3: Class-based with state. Always use functools.wraps.
+

Common patterns: Retry with exponential backoff, caching, rate limiting, authentication, input validation, deprecation warnings.

2. Context Managers

-

Managing resources reliably. with blocks guarantee cleanup even on errors. Two approaches: (1) Class-based (__enter__/__exit__), (2) @contextlib.contextmanager with yield. Use for: file handles, DB connections, GPU locks, temporary settings.

+

Guarantee resource cleanup. Two approaches: (1) Class-based (__enter__/__exit__), (2) @contextlib.contextmanager with yield. Use for: files, DB connections, GPU locks, temporary settings, timers.

-

3. Dataclasses vs namedtuple vs Pydantic

+

3. Dataclasses vs namedtuple vs Pydantic vs attrs

- - - - - - - - + + + + + +
FeaturenamedtupledataclassPydantic
Mutable✓ (default)✓ (v2)
Validation✗ (manual)✓ (automatic)
Default valuesLimited
Inheritance
JSON serializationManualManualBuilt-in
PerformanceFastestFastSlower (validation)
Use caseImmutable recordsData containersAPI models, configs
FeaturenamedtupledataclassPydanticattrs
Mutable✓ (v2)
Validation✓ (auto)✓ (validators)
JSON✓ (built-in)via cattrs
PerformanceFastestFastMediumFast
Use forRecordsData containersAPI modelsComplex classes

4. Type Hints — Complete Guide

-
🎯 Why Type Hints Matter
-
Type hints enable: IDE autocompletion, static analysis (mypy), self-documenting code, and runtime validation (Pydantic). Python doesn't enforce them at runtime — they're optional annotations checked by external tools.
+
🎯 Why Type Hints Matter for Projects
+
Enable: IDE autocompletion, mypy static analysis, self-documenting code, Pydantic validation. Python doesn't enforce at runtime — they're for tools and humans.
- - - - - - + + + + + +
HintMeaningExample
int, str, floatBasic typesdef f(x: int) -> str:
list[int]List of ints (3.9+)scores: list[int] = []
dict[str, Any]Dict with str keysconfig: dict[str, Any]
Optional[int]int or Nonex: int | None (3.10+)
Union[int, str]int or strid: int | str
Callable[[int], str]Function signatureCallbacks, decorators
TypeVar('T')Generic typeGeneric containers
dict[str, Any]Dict str keysconfig: dict[str, Any]
int | NoneOptional (3.10+)x: int | None = None
Callable[[int], str]Function typeCallbacks
TypeVarGenericGeneric containers
LiteralExact valuesLiteral['train','test']
TypedDictDict with typed keysJSON schemas
-

5. async/await — Concurrent Python

-

Async is for I/O-bound tasks (API calls, DB queries, file reads). NOT for CPU-bound work (use multiprocessing). The event loop manages coroutines cooperatively. asyncio.gather() runs multiple coroutines concurrently. aiohttp for async HTTP, asyncpg for async PostgreSQL.

+

5. async/await — Concurrent I/O

+

For I/O-bound tasks: API calls, DB queries, file reads. NOT for CPU (use multiprocessing). Event loop manages coroutines cooperatively. asyncio.gather() runs concurrently. Game changer: 100 API calls in ~1s vs 100s sequentially.

-

6. Descriptors — How @property Works

-

A descriptor is any object implementing __get__, __set__, or __delete__. @property is a descriptor. They control attribute access at the class level. Used in Django ORM fields, SQLAlchemy columns, and dataclass fields.

+

6. Design Patterns for ML Projects

+ + + + + + + + + +
PatternUse CasePython Implementation
StrategySwap algorithmsPass function/class as argument
FactoryCreate objects by nameRegistry dict: models['rf']
ObserverTraining callbacksEvent system with hooks
PipelineData transformationsChain of fit→transform
SingletonModel cache, DB poolModule-level or metaclass
TemplateTraining loopABC with abstract methods
RegistryAuto-register modelsClass decorator + dict
-

7. Metaclasses

-

Classes are objects too. Metaclasses define how classes behave. type is the default metaclass. Use for: auto-registering subclasses (model registry), enforcing interface standards, singleton pattern. Most developers should use class decorators instead — metaclasses are a last resort.

+

7. Descriptors — How @property Works

+

Any object implementing __get__/__set__/__delete__. @property is a descriptor. Control attribute access at class level. Used in Django ORM, SQLAlchemy, dataclass fields.

-

8. __slots__ for Memory Efficiency

-

By default, instances store attributes in __dict__. __slots__ replaces with a fixed tuple. Saves ~40% memory per instance. Use when creating millions of objects. Trade-off: can't add dynamic attributes. Especially useful for data-heavy classes.

+

8. Metaclasses — Advanced

+

Classes are objects. Metaclasses define how classes behave. type is the default. Use for: auto-registration, interface enforcement, singleton. Most should use class decorators instead.

+ +

9. __slots__ for Memory Efficiency

+

Replaces __dict__ with fixed array. ~40% memory savings per instance. Use for millions of small objects. Trade-off: no dynamic attributes.

+ +

10. Multiprocessing for CPU-Bound Work

+

multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor. Each process has its own GIL. Share data via: multiprocessing.Queue, shared_memory, or serialize (pickle). Overhead: process creation ~100ms. Only use for expensive computations.

`, code: `
-

💻 Advanced Python Code Examples

+

💻 Advanced Python Project Code

-

1. Production Decorator with Parameters

+

1. Production Decorator — Retry with Backoff

from functools import wraps import time, logging -def retry(max_attempts=3, delay=1.0): - """Decorator factory: retries on failure.""" +def retry(max_attempts=3, delay=1.0, exceptions=(Exception,)): def decorator(func): @wraps(func) def wrapper(*args, **kwargs): for attempt in range(max_attempts): try: return func(*args, **kwargs) - except Exception as e: + except exceptions as e: if attempt == max_attempts - 1: raise - time.sleep(delay * (2 ** attempt)) # Exponential backoff + wait = delay * (2 ** attempt) + logging.warning(f"Retry {attempt+1}/{max_attempts}: {e}, waiting {wait}s") + time.sleep(wait) return wrapper return decorator @retry(max_attempts=3, delay=0.5) def fetch_data(url): - return requests.get(url).json()
+ return requests.get(url, timeout=10).json()
-

2. Dataclass with Validation

-
from dataclasses import dataclass, field -from typing import Optional +

2. Dataclass for ML Experiments

+
from dataclasses import dataclass, field, asdict +import json +from datetime import datetime @dataclass class Experiment: name: str + model: str lr: float = 0.001 epochs: int = 100 + batch_size: int = 32 tags: list[str] = field(default_factory=list) + timestamp: str = field(default_factory=lambda: datetime.now().isoformat()) + metrics: dict = field(default_factory=dict) def __post_init__(self): - if self.lr <= 0: - raise ValueError("lr must be positive") + if self.lr <= 0: raise ValueError("lr must be positive") + + def save(self, path): + with open(path, 'w') as f: + json.dump(asdict(self), f, indent=2) + + @classmethod + def load(cls, path): + with open(path) as f: + return cls(**json.load(f))
+ +

3. Model Registry Pattern

+
MODEL_REGISTRY = {} + +def register_model(name): + def decorator(cls): + MODEL_REGISTRY[name] = cls + return cls + return decorator -exp = Experiment("bert-finetune", lr=3e-5, tags=["nlp"])
+@register_model("random_forest") +class RandomForestModel: + def train(self, X, y): ... -

3. async/await for Parallel API Calls

+@register_model("xgboost") +class XGBoostModel: + def train(self, X, y): ... + +# Create model by name from config +model = MODEL_REGISTRY[config["model_name"]]()
+ +

4. async — Parallel API Calls

import asyncio import aiohttp @@ -768,33 +1128,56 @@ exp = Experiment("bert-finetune", lr=async def fetch_all(urls): async with aiohttp.ClientSession() as session: tasks = [fetch(session, url) for url in urls] - return await asyncio.gather(*tasks) + return await asyncio.gather(*tasks, return_exceptions=True) -# 100 API calls in ~1 second (vs 100 seconds sequentially) +# 100 API calls in ~1 second vs 100 seconds results = asyncio.run(fetch_all(urls))
-

4. Type-Hinted Protocol (Duck Typing)

-
from typing import Protocol -import numpy as np +

5. Pydantic for API Data Validation

+
from pydantic import BaseModel, Field, field_validator -class Predictor(Protocol): - def predict(self, X: np.ndarray) -> np.ndarray: ... - -def evaluate(model: Predictor, X: np.ndarray, y: np.ndarray): - # Works with ANY object that has .predict() - preds = model.predict(X) - return (preds == y).mean()
+class PredictionRequest(BaseModel): + features: list[float] = Field(..., min_length=1) + model_name: str = "default" + threshold: float = Field(0.5, ge=0, le=1) + + @field_validator('features') + @classmethod + def check_features(cls, v): + if any(np.isnan(x) for x in v): + raise ValueError("NaN not allowed") + return v + +# Auto-validates on creation +req = PredictionRequest(features=[1.0, 2.0, 3.0])
+ +

6. Context Manager — Timer & GPU Lock

+
from contextlib import contextmanager +import time + +@contextmanager +def timer(name="Block"): + start = time.perf_counter() + try: + yield + finally: + elapsed = time.perf_counter() - start + print(f"{name}: {elapsed:.4f}s") + +with timer("Training"): + model.fit(X_train, y_train)
`, interview: `

🎯 Advanced Python Interview Questions

-
Q1: Explain MRO (Method Resolution Order).

Answer: C3 Linearization algorithm for multiple inheritance. Access via ClassName.mro(). Ensures bases searched after subclasses, preserving definition order.

-
Q2: dataclass vs namedtuple vs Pydantic?

Answer: namedtuple: immutable, fastest. dataclass: mutable, flexible, no validation. Pydantic: auto-validation, JSON serialization, API models. Choose based on whether you need validation.

-
Q3: When to use async/await vs threading vs multiprocessing?

Answer: async: I/O-bound, many connections (1000s of API calls). threading: I/O-bound, simpler code. multiprocessing: CPU-bound (bypasses GIL). NumPy already releases GIL internally.

-
Q4: How does @property work internally?

Answer: It's a descriptor — implements __get__, __set__, __delete__. When you access obj.x, Python's attribute lookup finds the descriptor on the class and calls __get__.

-
Q5: Decorator with parameters pattern?

Answer: Three nested functions: (1) Factory takes params, returns decorator. (2) Decorator takes function, returns wrapper. (3) Wrapper executes logic. Use @wraps(func) always.

-
Q6: What is __slots__?

Answer: Replaces __dict__ with fixed-size array. Saves ~40% memory per instance. Can't add dynamic attributes. Use for millions of small objects.

-
Q7: Explain closures. Give a real use case.

Answer: A function that captures variables from enclosing scope. The captured variables survive after the enclosing function returns. Use case: factory functions, decorators, callbacks. Example: make_multiplier(3) returns a function that multiplies by 3.

+
Q1: Explain MRO.

Answer: C3 Linearization for multiple inheritance. ClassName.mro() shows order. Subclasses before bases, left-to-right.

+
Q2: dataclass vs Pydantic?

Answer: dataclass: no validation, fast, standard library. Pydantic: auto-validation, JSON serialization, API models. Use Pydantic for external data, dataclass for internal.

+
Q3: When async vs threading vs multiprocessing?

Answer: async: I/O-bound, 1000s connections. threading: I/O, simpler. multiprocessing: CPU-bound (bypasses GIL). NumPy releases GIL internally.

+
Q4: How does @property work?

Answer: It's a descriptor with __get__/__set__. Attribute access triggers descriptor protocol. Used for computed attributes and validation.

+
Q5: Decorator with parameters?

Answer: Three nested functions: factory(params) → decorator(func) → wrapper(*args). Use @wraps(func) always.

+
Q6: What is __slots__?

Answer: Fixed array instead of __dict__. ~40% less memory. No dynamic attributes. Use for millions of objects.

+
Q7: Explain closures with use case.

Answer: Function capturing enclosing scope variables. Use: factory functions, decorators, callbacks. make_multiplier(3) returns function multiplying by 3.

+
Q8: Design patterns in Python vs Java?

Answer: Python makes many patterns trivial: Strategy = pass a function. Singleton = module variable. Factory = dict of classes. Observer = list of callables. Python prefers simplicity.

` }, @@ -804,81 +1187,118 @@ results = asyncio.run(fetch_all(urls))

🤖 Scikit-learn — Complete ML Engineering

-
⚡ The Estimator API — Unified Interface
-
Estimators have fit(X, y). Transformers have transform(X). Predictors have predict(X). This consistency allows seamless swapping and composition via Pipelines.
+
⚡ The Estimator API
+
Estimators: fit(X, y). Transformers: transform(X). Predictors: predict(X). Consistency allows seamless swapping and composition via Pipelines.
-

1. Pipelines — Avoiding Data Leakage

+

1. Pipelines — The Foundation of Production ML

-
⚠️ The #1 ML Mistake
- Fitting a scaler on the ENTIRE dataset before splitting = data leakage. Test set statistics leak into training. Fix: put scaling INSIDE a Pipeline, which ensures fit only on training data during cross-validation. +
⚠️ Data Leakage — The #1 ML Mistake
+ Fitting scaler on ENTIRE dataset before split = test set info leaks into training. Fix: put ALL preprocessing inside Pipeline. Pipeline ensures fit only on training folds during CV.
-

2. ColumnTransformer — Different processing per column type

-

Real data has mixed types. ColumnTransformer applies different transformations to different column sets: StandardScaler for numerics, OneHotEncoder for categoricals, TfidfVectorizer for text. All in one pipeline.

+

2. ColumnTransformer — Real-World Data

+

Real data has mixed types. ColumnTransformer applies different transformations per column set: StandardScaler for numerics, OneHotEncoder for categoricals, TfidfVectorizer for text. All in one pipeline.

3. Custom Transformers

-

Inherit from BaseEstimator + TransformerMixin. Implement fit(X, y) and transform(X). TransformerMixin gives you fit_transform() for free. Use check_is_fitted(self) to validate state.

+

Inherit BaseEstimator + TransformerMixin. Implement fit(X, y) and transform(X). TransformerMixin gives fit_transform() free. Use check_is_fitted() for safety.

4. Cross-Validation Strategies

- - - + + + -
StrategyWhen to UseGotcha
KFoldGeneral purposeDoesn't preserve class ratios
StratifiedKFoldClassification (imbalanced)Preserves class distribution
StrategyWhenKey Point
KFoldGeneralDoesn't preserve class ratios
StratifiedKFoldImbalanced classificationPreserves class distribution
TimeSeriesSplitTime-ordered dataTrain always before test
GroupKFoldGrouped data (patients)Same group never in train+test
LeaveOneOutVery small datasetsN fits — very slow
RepeatedStratifiedKFoldRobust estimationMultiple random splits

5. Hyperparameter Tuning

- - - - + + + + +
MethodProsCons
GridSearchCVExhaustive, simpleExponential with params
RandomizedSearchCVFaster, continuous distributionsMay miss optimal
Optuna/BayesianOptSmart search, early stoppingMore setup, dependency
Halving*SearchCVSuccessive halving, fastNewer, less documented
GridSearchCVExhaustiveExponential with params
RandomizedSearchCVFaster, continuous distsMay miss optimal
OptunaSmart search, pruningExtra dependency
HalvingSearchCVSuccessive halvingNewer, less docs
+ +

6. Complete ML Workflow

+
+
🎯 The Steps
+
+ 1. EDA → 2. Train/Val/Test split → 3. Build Pipeline (preprocess + model) → 4. Cross-validate multiple models → 5. Select best → 6. Tune hyperparameters → 7. Final evaluation on test set → 8. Save model → 9. Deploy +
+
+ +

7. Feature Engineering

+ + + + + + +
TransformerPurpose
PolynomialFeaturesInteraction & polynomial terms
FunctionTransformerApply any function (log, sqrt)
SplineTransformerNon-linear feature basis
KBinsDiscretizerBin continuous into categories
TargetEncoderEncode categoricals by target mean
-

6. Feature Engineering in sklearn

-

PolynomialFeatures, FunctionTransformer, SplineTransformer, KBinsDiscretizer. Chain with Pipeline for clean, leak-free preprocessing. Use make_column_selector to auto-select column types.

+

8. Model Selection Guide

+ + + + + + + +
Data SizeModelWhy
<1K rowsLogistic/SVM/KNNSimple, less overfitting
1K-100KRandom Forest, XGBoostBest accuracy/speed tradeoff
100K+XGBoost, LightGBMHandles large data efficiently
Very largeSGDClassifier/onlineIncremental learning
TabularGradient BoostingAlmost always best for tabular
-

7. Model Selection Workflow

-

Train/Val/Test split → Cross-validate multiple models → Select best → Tune hyperparameters → Final evaluation on test set. Never tune on test data. Use cross_val_score for quick comparison, cross_validate for detailed metrics.

+

9. Handling Imbalanced Data

+ + + + + + + +
StrategyHow
class_weight='balanced'Built-in for most models
SMOTESynthetic oversampling (imblearn)
Threshold tuningAdjust decision threshold from 0.5
MetricsUse F1, Precision-Recall AUC (not accuracy)
EnsembleBalancedRandomForest
+ +

10. Model Persistence

+

joblib.dump(model, 'model.pkl') — faster than pickle for NumPy arrays. model = joblib.load('model.pkl'). Always save the entire pipeline (not just model) to include preprocessing. Version your models with timestamps.

`, code: `
-

💻 Scikit-learn Code Examples

+

💻 Scikit-learn Project Code

-

1. Production Pipeline with ColumnTransformer

+

1. Production Pipeline — Complete Template

from sklearn.pipeline import Pipeline -from sklearn.compose import ColumnTransformer +from sklearn.compose import ColumnTransformer, make_column_selector from sklearn.preprocessing import StandardScaler, OneHotEncoder from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier - -num_features = ['age', 'income', 'score'] -cat_features = ['gender', 'city'] +from sklearn.model_selection import cross_val_score preprocessor = ColumnTransformer([ ('num', Pipeline([ ('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler()) - ]), num_features), + ]), make_column_selector(dtype_include='number')), + ('cat', Pipeline([ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), - ('encoder', OneHotEncoder(handle_unknown='ignore')) - ]), cat_features) + ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False)) + ]), make_column_selector(dtype_include='object')) ]) pipe = Pipeline([ ('preprocessor', preprocessor), - ('classifier', RandomForestClassifier(n_estimators=100)) + ('classifier', RandomForestClassifier(n_estimators=100, n_jobs=-1)) ]) -pipe.fit(X_train, y_train) # No data leakage!
+ +# No data leakage! +scores = cross_val_score(pipe, X, y, cv=5, scoring='f1') +print(f"F1: {scores.mean():.3f} ± {scores.std():.3f}")

2. Custom Transformer

from sklearn.base import BaseEstimator, TransformerMixin +from sklearn.utils.validation import check_is_fitted class OutlierClipper(BaseEstimator, TransformerMixin): def __init__(self, factor=1.5): @@ -893,33 +1313,71 @@ pipe.fit(X_train, y_train) # No data leakage!
return
self def transform(self, X): + check_is_fitted(self) return np.clip(X, self.lower_, self.upper_) -

3. Hyperparameter Tuning with Optuna

+

3. Model Comparison Framework

+
from sklearn.model_selection import cross_validate + +models = { + 'Logistic': LogisticRegression(), + 'RF': RandomForestClassifier(n_estimators=100), + 'XGBoost': XGBClassifier(n_estimators=100), + 'LightGBM': LGBMClassifier(n_estimators=100) +} + +results = {} +for name, model in models.items(): + pipe = Pipeline([('prep', preprocessor), ('model', model)]) + cv = cross_validate(pipe, X, y, cv=5, + scoring=['accuracy', 'f1', 'roc_auc'], n_jobs=-1) + results[name] = {k: v.mean() for k, v in cv.items()} + print(f"{name}: F1={cv['test_f1'].mean():.3f}") + +pd.DataFrame(results).T.sort_values('test_f1', ascending=False)
+ +

4. Hyperparameter Tuning with Optuna

import optuna def objective(trial): params = { 'n_estimators': trial.suggest_int('n_estimators', 50, 500), 'max_depth': trial.suggest_int('max_depth', 3, 15), - 'learning_rate': trial.suggest_float('lr', 1e-3, 0.3, log=True) + 'learning_rate': trial.suggest_float('lr', 1e-3, 0.3, log=True), + 'subsample': trial.suggest_float('subsample', 0.6, 1.0) } model = XGBClassifier(**params) - score = cross_val_score(model, X, y, cv=5).mean() + score = cross_val_score(model, X, y, cv=5, scoring='f1').mean() return score study = optuna.create_study(direction='maximize') -study.optimize(objective, n_trials=100)
+study.optimize(objective, n_trials=100) +print(f"Best F1: {study.best_value:.3f}") +print(f"Best params: {study.best_params}") + +

5. Save & Load Pipeline

+
import joblib +from datetime import datetime + +# Save entire pipeline (includes preprocessing!) +version = datetime.now().strftime('%Y%m%d_%H%M') +joblib.dump(pipe, f'models/pipeline_{version}.pkl') + +# Load and predict +pipe = joblib.load('models/pipeline_20240315_1430.pkl') +predictions = pipe.predict(new_data) # Preprocessing included!
`, interview: `

🎯 Scikit-learn Interview Questions

-
Q1: What is data leakage? How to prevent it?

Answer: Info from test set influencing training. Common cause: fitting scaler on full data before split. Fix: put all preprocessing inside a Pipeline which ensures fit only on train folds during cross-validation.

-
Q2: Pipeline vs ColumnTransformer?

Answer: Pipeline: sequential steps (A→B→C). ColumnTransformer: parallel branches (different processing for different column types). Typically ColumnTransformer inside Pipeline.

-
Q3: When to use which cross-validation?

Answer: KFold: general. StratifiedKFold: imbalanced classes. TimeSeriesSplit: temporal. GroupKFold: grouped data (same patient never in both).

-
Q4: GridSearch vs RandomSearch vs Bayesian?

Answer: Grid: exhaustive but exponential. Random: better for many params, samples continuous distributions. Bayesian (Optuna): learns from previous trials, most efficient for expensive models.

-
Q5: How to create a custom transformer?

Answer: Inherit BaseEstimator + TransformerMixin. Implement fit(X, y) (learn params, return self) and transform(X) (apply). TransformerMixin gives fit_transform() free.

-
Q6: Explain fit() vs transform() vs predict().

Answer: fit(): learn parameters from data. transform(): apply learned params to transform data. predict(): generate predictions. fit() is always on train, transform/predict on train+test.

+
Q1: What is data leakage?

Answer: Test set info influencing training. Common: fitting scaler before split. Fix: Pipeline ensures fit only on train folds.

+
Q2: Pipeline vs ColumnTransformer?

Answer: Pipeline: sequential (A→B→C). ColumnTransformer: parallel branches (different processing per column type). Usually CT inside Pipeline.

+
Q3: Which cross-validation when?

Answer: KFold: general. Stratified: imbalanced. TimeSeriesSplit: temporal. GroupKFold: grouped data.

+
Q4: Grid vs Random vs Bayesian?

Answer: Grid: exhaustive, exponential. Random: better for many params. Bayesian (Optuna): learns, most efficient for expensive models.

+
Q5: Custom transformer?

Answer: BaseEstimator + TransformerMixin. Implement fit(X,y) and transform(X). TransformerMixin gives fit_transform free.

+
Q6: How to handle imbalanced data?

Answer: (1) class_weight='balanced'. (2) SMOTE oversampling. (3) Adjust threshold. (4) Use F1/AUC not accuracy. (5) BalancedRandomForest.

+
Q7: When to use which model?

Answer: Tabular: gradient boosting (XGBoost/LightGBM). Small data: Logistic/SVM. Interpretability: Logistic/trees. Speed: LightGBM. Baseline: Random Forest.

+
Q8: fit() vs transform() vs predict()?

Answer: fit: learn params from data. transform: apply params. predict: generate predictions. fit on train only, transform/predict on both.

` }, @@ -930,133 +1388,205 @@ study.optimize(objective, n_trials=100)
⚡ PyTorch Philosophy: Define-by-Run
-
PyTorch builds the computational graph dynamically as operations execute (eager mode). This makes debugging natural — use print(), breakpoints, standard Python control flow. TensorFlow originally used static graphs (define-then-run).
+
PyTorch builds the computational graph dynamically as operations execute (eager mode). Debug with print(), breakpoints, standard Python control flow.

1. Tensors — The Foundation

- - - + + + - + +
ConceptWhat It IsKey Point
TensorN-dimensional arrayLike NumPy ndarray but GPU-capable
requires_gradTrack operations for autogradOnly enable for learnable parameters
ConceptWhatKey Point
TensorN-dimensional arrayLike NumPy but GPU-capable
requires_gradTrack for autogradOnly for learnable params
deviceCPU or CUDA.to('cuda') moves to GPU
.detach()Stop gradient trackingUse for inference/metrics
.item()Extract scalar valueUse for logging loss values
.item()Extract scalarUse for logging loss
.contiguous()Ensure contiguous memoryRequired after transpose/permute

2. Autograd — How Backpropagation Works

-
🧠 Computational Graph
-
When requires_grad=True, PyTorch records every operation in a directed acyclic graph (DAG). Each tensor stores its grad_fn — the function that created it. .backward() traverses this graph in reverse, computing gradients via the chain rule. The graph is destroyed after backward() (unless retain_graph=True).
+
🧠 Computational Graph (DAG)
+
When requires_grad=True, every operation is recorded. Each tensor stores grad_fn. .backward() traverses graph in reverse (chain rule). Graph destroyed after backward() unless retain_graph=True. Gradients ACCUMULATE — must optimizer.zero_grad() before each backward.
-

Gradient accumulation: By default, .backward() accumulates gradients. You MUST call optimizer.zero_grad() before each backward pass. This is intentional — allows gradient accumulation for larger effective batch sizes.

3. nn.Module — Building Blocks

-

Every model inherits nn.Module. Define layers in __init__, computation in forward(). model.parameters() returns all learnable weights. model.train() and model.eval() toggle BatchNorm/Dropout behavior. model.state_dict() saves/loads weights.

+

Every model inherits nn.Module. Layers in __init__, computation in forward(). model.train()/model.eval() toggle BatchNorm/Dropout. model.parameters() for optimizer. model.state_dict() for save/load. Use nn.Sequential for simple stacks, nn.ModuleList/nn.ModuleDict for dynamic architectures.

4. Training Loop — The Standard Pattern

-

Every PyTorch training follows: (1) Forward pass, (2) Compute loss, (3) optimizer.zero_grad(), (4) loss.backward(), (5) optimizer.step(). No magic — you write it explicitly. This gives full control over learning rate scheduling, gradient clipping, mixed precision, etc.

+

(1) Forward pass → (2) Compute loss → (3) optimizer.zero_grad() → (4) loss.backward() → (5) optimizer.step(). Add: gradient clipping, LR scheduling, mixed precision, logging, checkpointing.

5. Custom Datasets & DataLoaders

-

Dataset: override __len__ and __getitem__. DataLoader: wraps Dataset with batching, shuffling, multi-worker loading. Use num_workers > 0 for parallel data loading. pin_memory=True speeds up CPU→GPU transfer.

+

Dataset: override __len__ and __getitem__. DataLoader: batching, shuffling, multi-worker. num_workers>0 for parallel loading. pin_memory=True for faster GPU transfer. Use collate_fn for variable-length sequences.

-

6. Mixed Precision Training (AMP)

-

Use torch.cuda.amp for automatic mixed precision. Forward pass in float16 (2x faster on modern GPUs), gradients in float32 (numerical stability). GradScaler prevents underflow. Up to 2-3x speedup with minimal accuracy loss.

+

6. Learning Rate Scheduling

+ + + + + + + +
SchedulerStrategyWhen
StepLRDecay every N epochsSimple baseline
CosineAnnealingLRCosine decayStandard for vision
OneCycleLRWarmup + decayBest for fast training
ReduceLROnPlateauDecay on stallWhen loss plateaus
LinearLRLinear warmupTransformer models
-

7. Transfer Learning

-

Load pretrained model → Freeze base layers → Replace final layer → Fine-tune. model.requires_grad_(False) freezes all. Then unfreeze last N layers. Use smaller learning rate for pretrained layers.

+

7. Mixed Precision Training (AMP)

+

torch.cuda.amp: forward in float16 (2x faster), gradients in float32. GradScaler prevents underflow. 2-3x speedup. Standard practice for any GPU training.

-

8. Hook System for Debugging

-

Register hooks on modules: register_forward_hook, register_backward_hook. View intermediate activations, gradient magnitudes, feature maps. Essential for debugging vanishing/exploding gradients.

+

8. Transfer Learning Patterns

+

Load pretrained → Freeze base → Replace head → Fine-tune with smaller LR. Discriminative LR: lower LR for earlier layers. Progressive unfreezing: unfreeze layers one at a time. Both work better than fine-tuning everything at once.

9. Distributed Training (DDP)

-

DistributedDataParallel is the standard for multi-GPU training. Each GPU runs a copy of the model, gradients are averaged across GPUs (all-reduce). Near-linear scaling. Use torchrun to launch.

+

DistributedDataParallel: each GPU runs model copy, gradients averaged via all-reduce. Near-linear scaling. Use torchrun to launch. DistributedSampler for data splitting.

+ +

10. Debugging & Profiling

+ + + + + + + +
ToolPurpose
register_forward_hookView intermediate activations
register_backward_hookMonitor gradient magnitudes
torch.profilerGPU/CPU profiling
torch.cuda.memory_summary()GPU memory debugging
detect_anomaly()Find NaN/Inf sources
+ +

11. torch.compile (2.x)

+

JIT compiles model for 30-60% speedup. model = torch.compile(model). Uses TorchDynamo + Triton. Works on existing code. The future of PyTorch performance.

`, code: `
-

💻 PyTorch Code Examples

+

💻 PyTorch Project Code

-

1. Complete Training Loop

+

1. Complete Training Framework

import torch import torch.nn as nn - -class MLP(nn.Module): - def __init__(self, in_dim, hidden, out_dim): - super().__init__() - self.net = nn.Sequential( - nn.Linear(in_dim, hidden), - nn.ReLU(), - nn.Dropout(0.3), - nn.Linear(hidden, out_dim) - ) +from torch.utils.data import DataLoader + +class Trainer: + def __init__(self, model, optimizer, criterion, device='cuda'): + self.model = model.to(device) + self.optimizer = optimizer + self.criterion = criterion + self.device = device + self.history = {'train_loss': [], 'val_loss': []} - def forward(self, x): - return self.net(x) - -model = MLP(784, 256, 10).to('cuda') -optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) -criterion = nn.CrossEntropyLoss() - -for epoch in range(10): - model.train() - for X_batch, y_batch in train_loader: - X_batch = X_batch.to('cuda') - y_batch = y_batch.to('cuda') - - logits = model(X_batch) - loss = criterion(logits, y_batch) - - optimizer.zero_grad() - loss.backward() - torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) - optimizer.step()
- -

2. Custom Dataset

-
from torch.utils.data import Dataset, DataLoader - -class TabularDataset(Dataset): - def __init__(self, df, target_col): - self.X = torch.FloatTensor(df.drop(target_col, axis=1).values) - self.y = torch.LongTensor(df[target_col].values) + def train_epoch(self, loader): + self.model.train() + total_loss = 0 + for X, y in loader: + X, y = X.to(self.device), y.to(self.device) + self.optimizer.zero_grad() + loss = self.criterion(self.model(X), y) + loss.backward() + torch.nn.utils.clip_grad_norm_(self.model.parameters(), 1.0) + self.optimizer.step() + total_loss += loss.item() * len(X) + return total_loss / len(loader.dataset) + + @torch.no_grad() + def evaluate(self, loader): + self.model.eval() + total_loss = 0 + for X, y in loader: + X, y = X.to(self.device), y.to(self.device) + total_loss += self.criterion(self.model(X), y).item() * len(X) + return total_loss / len(loader.dataset) + + def fit(self, train_loader, val_loader, epochs, patience=5): + best_loss = float('inf') + wait = 0 + for epoch in range(epochs): + train_loss = self.train_epoch(train_loader) + val_loss = self.evaluate(val_loader) + self.history['train_loss'].append(train_loss) + self.history['val_loss'].append(val_loss) + print(f"Epoch {epoch+1}: train={train_loss:.4f} val={val_loss:.4f}") + if val_loss < best_loss: + best_loss = val_loss + torch.save(self.model.state_dict(), 'best_model.pt') + wait = 0 + else: + wait += 1 + if wait >= patience: + print("Early stopping!") + break
+ +

2. Custom Dataset for Any Tabular Data

+
class TabularDataset(torch.utils.data.Dataset): + def __init__(self, df, target, cat_cols=None, num_cols=None): + self.target = torch.FloatTensor(df[target].values) + self.num = torch.FloatTensor(df[num_cols].values) if num_cols else None + self.cat = torch.LongTensor(df[cat_cols].values) if cat_cols else None def __len__(self): - return len(self.X) + return len(self.target) def __getitem__(self, idx): - return self.X[idx], self.y[idx] - -loader = DataLoader(dataset, batch_size=64, shuffle=True, - num_workers=4, pin_memory=True)
+ x = {} + if self.num is not None: x['num'] = self.num[idx] + if self.cat is not None: x['cat'] = self.cat[idx] + return x, self.target[idx]
-

3. Mixed Precision Training

+

3. Mixed Precision + Gradient Accumulation

from torch.cuda.amp import autocast, GradScaler scaler = GradScaler() -for X, y in train_loader: - optimizer.zero_grad() - with autocast(): # Float16 forward pass - logits = model(X.cuda()) - loss = criterion(logits, y.cuda()) - scaler.scale(loss).backward() # Scaled backward - scaler.step(optimizer) - scaler.update()
+accum_steps = 4 # Effective batch = batch_size × 4 + +for i, (X, y) in enumerate(loader): + with autocast(): + loss = model(X.cuda(), y.cuda()) / accum_steps + scaler.scale(loss).backward() + + if (i + 1) % accum_steps == 0: + scaler.unscale_(optimizer) + torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) + scaler.step(optimizer) + scaler.update() + optimizer.zero_grad()

4. Transfer Learning

import torchvision.models as models -# Load pretrained, freeze, replace head model = models.resnet50(weights='IMAGENET1K_V2') model.requires_grad_(False) # Freeze all -model.fc = nn.Linear(2048, 10) # New trainable head
+model.fc = nn.Sequential( + nn.Dropout(0.3), + nn.Linear(2048, 512), + nn.ReLU(), + nn.Linear(512, num_classes) +) + +# Discriminative LR: lower for pretrained, higher for new head +optimizer = torch.optim.AdamW([ + {'params': model.layer4.parameters(), 'lr': 1e-5}, + {'params': model.fc.parameters(), 'lr': 1e-3} +]) + +

5. Model Save/Load Best Practices

+
# Save everything for resuming training +checkpoint = { + 'epoch': epoch, + 'model_state': model.state_dict(), + 'optimizer_state': optimizer.state_dict(), + 'scheduler_state': scheduler.state_dict(), + 'best_loss': best_loss, + 'config': config +} +torch.save(checkpoint, 'checkpoint.pt') + +# Resume training +ckpt = torch.load('checkpoint.pt', map_location=device) +model.load_state_dict(ckpt['model_state']) +optimizer.load_state_dict(ckpt['optimizer_state'])
`, interview: `

🎯 PyTorch Interview Questions

-
Q1: How does autograd work?

Answer: PyTorch records operations in a DAG when requires_grad=True. .backward() traverses the graph in reverse, computing gradients via chain rule. Graph is destroyed after backward (dynamic graph).

-
Q2: Why call optimizer.zero_grad()?

Answer: PyTorch accumulates gradients by default. Without zeroing, gradients from previous batch add to current. This is intentional — allows gradient accumulation for larger effective batches.

-
Q3: model.train() vs model.eval()?

Answer: train(): BatchNorm uses batch stats, Dropout is active. eval(): BatchNorm uses running stats, Dropout disabled. Always switch before training/inference.

-
Q4: .detach() vs with torch.no_grad()?

Answer: .detach(): creates a tensor that shares data but doesn't track gradients (single tensor). torch.no_grad(): context manager disabling gradient computation for all operations inside (saves memory during inference).

-
Q5: How to debug vanishing/exploding gradients?

Answer: (1) Register backward hooks to monitor gradient magnitudes. (2) Use torch.nn.utils.clip_grad_norm_. (3) Gradient histograms in TensorBoard. (4) Check if BatchNorm/LayerNorm is applied. (5) Try skip connections (ResNet idea).

-
Q6: DataLoader num_workers — how many?

Answer: Rule of thumb: num_workers = 4 * num_gpus. Too many = CPU overhead, too few = GPU starved. Use pin_memory=True for faster CPU→GPU transfer. Profile to find sweet spot.

+
Q1: How does autograd work?

Answer: Records ops in DAG. .backward() traverses reverse, chain rule. Graph destroyed after backward. Dynamic = rebuilt each forward.

+
Q2: Why zero_grad()?

Answer: Gradients accumulate. Without zeroing, previous batch adds to current. Intentional: enables gradient accumulation for larger effective batch.

+
Q3: .detach() vs torch.no_grad()?

Answer: detach(): single tensor, shares data. no_grad(): context manager for all ops inside, saves memory. Use no_grad() for inference.

+
Q4: How to debug vanishing gradients?

Answer: (1) Backward hooks for gradient magnitudes. (2) clip_grad_norm_. (3) TensorBoard histograms. (4) BatchNorm/LayerNorm. (5) Skip connections.

+
Q5: DataLoader num_workers?

Answer: Rule: 4 × num_gpus. Too many = CPU overhead. pin_memory=True for faster transfers. Profile to find sweet spot.

+
Q6: torch.compile vs eager?

Answer: compile JITs model via TorchDynamo+Triton. 30-60% faster. One line change. The future of PyTorch performance.

+
Q7: How to save/load models?

Answer: state_dict (weights only) vs full checkpoint (weights + optimizer + epoch). Use state_dict for inference, checkpoint for resuming.

+
Q8: Mixed precision — how and why?

Answer: autocast(fp16 forward) + GradScaler(fp32 grads). 2-3x speedup. Minimal accuracy loss. Standard for GPU training.

` }, @@ -1066,60 +1596,62 @@ model.fc = nn.Linear(2048, 10🧠 TensorFlow & Keras — Complete Guide
-
⚡ TensorFlow 2.x Philosophy
-
TF2 defaults to eager execution (like PyTorch). @tf.function compiles to static graph for production speed. Keras is the official high-level API. TF handles the full ML lifecycle: training → saving → serving → monitoring.
+
⚡ TF2 = Eager by Default + @tf.function for Speed
+
TF2 defaults to eager mode (like PyTorch). @tf.function compiles to graph for production. Keras is the official API. TF handles full lifecycle: train → save → serve → monitor.
-

1. Three Ways to Build Models

+

1. Three Model APIs

- - - + + +
APIUse CaseFlexibility
SequentialSimple stack of layersLow (linear only)
FunctionalMulti-input/output, branchingMedium
SubclassingCustom forward logicHigh (most flexible)
SequentialLinear stackLow
FunctionalMulti-input/output, branchingMedium (recommended)
SubclassingCustom forward logicHigh
-

2. tf.data — The Data Pipeline

-

Build efficient input pipelines: tf.data.Dataset chains transformations lazily. Key methods: .map(), .batch(), .shuffle(), .prefetch(tf.data.AUTOTUNE). Prefetching overlaps data loading with model execution. Supports TFRecord files for large datasets.

+

2. tf.data Pipeline

+

Chains transformations lazily. .map(), .batch(), .shuffle(), .prefetch(AUTOTUNE). Prefetching overlaps loading with GPU execution. .cache() for small datasets. .interleave() for reading multiple files. TFRecord format for large datasets.

3. Callbacks — Training Hooks

- + - - - + + +
CallbackPurpose
ModelCheckpointSave best model (monitor val_loss)
ModelCheckpointSave best model
EarlyStoppingStop when metric plateaus
ReduceLROnPlateauReduce LR when stuck
TensorBoardVisualize training metrics
CSVLoggerLog metrics to CSV
LambdaCallbackCustom logic per epoch
TensorBoardVisualize metrics
CSVLoggerLog to CSV
LambdaCallbackCustom per-epoch logic
-

4. Custom Training with GradientTape

-

For full control: tf.GradientTape() records operations, then tape.gradient(loss, model.trainable_variables) computes gradients. Same pattern as PyTorch's manual loop. Use for: GANs, reinforcement learning, custom loss functions.

+

4. GradientTape — Custom Training

+

Record ops → compute gradients → apply. Use for: GANs, RL, custom losses, gradient penalty, multi-loss weighting. Same concept as PyTorch's manual loop.

-

5. SavedModel for Deployment

-

model.save('path') exports as SavedModel format — includes architecture, weights, and computation graph. Ready for TF Serving, TF Lite (mobile), TF.js (browser). Universal deployment format.

+

5. @tf.function — Production Speed

+

Trace Python → TF graph. Benefits: optimized execution, XLA, export. Gotchas: Python side effects only during tracing. Use tf.print() in graphs.

-

6. @tf.function — Graph Compilation

-

Decorating with @tf.function traces Python code into a TF graph. Benefits: optimized execution, XLA compilation, deployment. Gotchas: Python side effects only run during tracing, use tf.print() instead of print().

+

6. SavedModel — Universal Deployment

+

model.save('path') exports architecture + weights + computation. Ready for: TF Serving (production), TF Lite (mobile), TF.js (browser). One model, any platform.

-

7. TF vs PyTorch — When to Choose

+

7. Keras Tuner — Automated Hyperparameter Search

+

Build model function → Tuner searches space. Strategies: Random, Hyperband, Bayesian. Integrates with TensorBoard. Alternative to Optuna for Keras models.

+ +

8. TF vs PyTorch — Decision Guide

- - - - - - + + + + + +
AspectTensorFlowPyTorch
DeploymentTF Serving, TFLite, TF.jsTorchServe, ONNX
ResearchLess common nowDominant in papers
ProductionMature ecosystemCatching up fast
MobileTFLite (mature)PyTorch Mobile
DebuggingHarder (graph mode)Easier (eager by default)
Choose TF WhenChoose PyTorch When
Production deployment at scaleResearch & prototyping
Mobile (TFLite mature)Hugging Face ecosystem
TPU trainingGPU research
Edge devicesCustom architectures
Browser (TF.js)Academic papers
`, code: `
-

💻 TensorFlow Code Examples

+

💻 TensorFlow Project Code

-

1. Functional API Model

+

1. Functional API — Multi-Input Model

import tensorflow as tf from tensorflow import keras -# Multi-input model text_input = keras.Input(shape=(100,), name='text') num_input = keras.Input(shape=(5,), name='features') @@ -1128,7 +1660,9 @@ x1 = keras.layers.GlobalAveragePooling1D()(x1) x2 = keras.layers.Dense(32, activation='relu')(num_input) combined = keras.layers.Concatenate()([x1, x2]) -output = keras.layers.Dense(1, activation='sigmoid')(combined) +x = keras.layers.Dense(64, activation='relu')(combined) +x = keras.layers.Dropout(0.3)(x) +output = keras.layers.Dense(1, activation='sigmoid')(x) model = keras.Model(inputs=[text_input, num_input], outputs=output)

2. Training with Callbacks

@@ -1143,8 +1677,8 @@ model = keras.Model(inputs=[text_input, num_input], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', keras.metrics.AUC()]) -model.fit(X_train, y_train, epochs=50, validation_split=0.2, - callbacks=callbacks) +model.fit(X_train, y_train, epochs=50, + validation_split=0.2, callbacks=callbacks)

3. Custom Training Loop (GradientTape)

@tf.function @@ -1157,25 +1691,37 @@ model.fit(X_train, y_train, epochs=50, validation_sp return loss

4. tf.data Pipeline

-
# Efficient data pipeline with prefetching -dataset = ( +
dataset = ( tf.data.Dataset.from_tensor_slices((X, y)) .shuffle(10000) .batch(64) .map(lambda x, y: (augment(x), y), num_parallel_calls=tf.data.AUTOTUNE) - .prefetch(tf.data.AUTOTUNE) # Overlap loading + training + .prefetch(tf.data.AUTOTUNE) )
+ +

5. Custom Callback for Experiment Logging

+
class ExperimentLogger(keras.callbacks.Callback): + def __init__(self, log_path): + self.log_path = log_path + self.logs_data = [] + + def on_epoch_end(self, epoch, logs=None): + self.logs_data.append({'epoch': epoch, **logs}) + pd.DataFrame(self.logs_data).to_csv(self.log_path, index=False) + if logs['val_loss'] > logs['loss'] * 1.5: + print(f"⚠️ Possible overfitting at epoch {epoch}")
`, interview: `

🎯 TensorFlow Interview Questions

-
Q1: Sequential vs Functional vs Subclassing?

Answer: Sequential: linear stack. Functional: multi-input/output, shared layers. Subclassing: full Python control, custom forward. Use Functional for most real projects.

-
Q2: What does @tf.function do?

Answer: Compiles Python function into a TF graph. Faster execution, enables XLA optimization, required for SavedModel export. Gotcha: Python code only runs during tracing — side effects behave differently.

-
Q3: How does tf.data improve performance?

Answer: Chains transformations lazily. .prefetch(AUTOTUNE) overlaps data loading with GPU computation. .cache() stores in memory after first epoch. .interleave() reads multiple files concurrently.

-
Q4: EarlyStopping — what to monitor?

Answer: Usually val_loss. Set patience=5-10 (epochs without improvement). restore_best_weights=True reverts to best epoch. Combine with ReduceLROnPlateau for better convergence.

-
Q5: When to use GradientTape?

Answer: When Keras .fit() is too restrictive: GANs (two optimizers), RL (custom gradients), multi-loss weighting, gradient penalty, research experiments needing full control.

-
Q6: TF vs PyTorch — when to choose each?

Answer: TF: production deployment (TF Serving, TFLite), mobile apps, TPU training. PyTorch: research, prototyping, Hugging Face ecosystem. Both are converging in features.

+
Q1: Sequential vs Functional vs Subclassing?

Answer: Sequential: linear. Functional: multi-I/O, branching. Subclassing: full Python control. Use Functional for most projects.

+
Q2: What does @tf.function do?

Answer: Traces Python → TF graph. Faster, XLA, export. Gotcha: side effects only during tracing.

+
Q3: tf.data performance?

Answer: prefetch(AUTOTUNE) overlaps loading+training. cache() for small data. interleave() for multiple files.

+
Q4: EarlyStopping config?

Answer: monitor='val_loss', patience=5-10, restore_best_weights=True. Combine with ReduceLROnPlateau.

+
Q5: When GradientTape?

Answer: GANs, RL, custom gradients, multi-loss. When .fit() is too restrictive.

+
Q6: TF vs PyTorch?

Answer: TF: deployment (Serving, Lite, JS), mobile. PyTorch: research, HuggingFace. Both converging.

+
Q7: How to deploy TF model?

Answer: SavedModel → TF Serving (REST/gRPC), TFLite (mobile), TF.js (browser). Docker + TF Serving for production.

` }, @@ -1186,166 +1732,252 @@ dataset = (
⚡ Production = Reliability + Reproducibility + Observability
-
Production code must be tested (pytest), typed (mypy), logged (structured logging), packaged (pyproject.toml), containerized (Docker), and monitored (metrics/alerts). The gap between notebook code and production code is enormous.
+
Production code must be tested (pytest), typed (mypy), logged (structured), packaged (pyproject.toml), containerized (Docker), and monitored (metrics). The gap between notebook and production is enormous.

1. pytest — Professional Testing

- - - - - - + + + + + + +
FeaturePurposeExample
fixturesReusable test setup@pytest.fixture for test data
parametrizeRun same test with many inputs@pytest.mark.parametrize
conftest.pyShared fixtures across testsDB connections, mock data
monkeypatchOverride functions/env varsMock API calls
tmp_pathTemporary directoryTest file I/O without cleanup
markersTag tests (slow, gpu, integration)pytest -m "not slow"
fixturesReusable test setup@pytest.fixture
parametrizeMany inputs, same test@pytest.mark.parametrize
conftest.pyShared fixturesDB connections, mock data
monkeypatchOverride functions/envMock API calls
tmp_pathTemp directoryTest file I/O
markersTag testspytest -m "not slow"
coverageMeasure test coveragepytest --cov
-

2. Logging Best Practices

-
-
💡 Logging vs Print
- Never use print() in production. Use logging module: configurable levels (DEBUG/INFO/WARNING/ERROR), output to files, structured format, no performance cost when disabled. +

2. Testing ML Code

+
+
🎯 What to Test in ML
+
+ Unit: data transforms, feature engineering, loss functions.
+ Integration: full pipeline end-to-end.
+ Model: output shape, range, determinism with seed.
+ Data: schema validation, distribution shifts, missing patterns. +
+ +

3. Logging Best Practices

- - - - - - + + + + + +
LevelWhen to Use
DEBUGDetailed diagnostic (tensor shapes, intermediate values)
INFONormal events (training started, epoch complete)
WARNINGSomething unexpected but handled (missing feature, fallback)
ERRORSomething failed (model load error, API failure)
CRITICALSystem-level failure (out of memory, GPU crash)
LevelWhen
DEBUGTensor shapes, intermediate values
INFOTraining started, epoch complete
WARNINGUnexpected but handled (fallback used)
ERRORModel load failure, API error
CRITICALOOM, GPU crash
- -

3. Project Structure

-
project/ -├── src/ -│ └── mypackage/ -│ ├── __init__.py -│ ├── data/ -│ ├── models/ -│ ├── training/ -│ └── serving/ -├── tests/ -├── configs/ -├── pyproject.toml -├── Dockerfile -└── README.md
+

Never use print(). Use structured logging (JSON format) for production — parseable by log aggregators (ELK, Datadog).

4. FastAPI for Model Serving

-

Modern async web framework. Auto-generates OpenAPI docs. Type-validated requests via Pydantic. Use for: model inference APIs, data pipelines, webhook handlers. Deploy with Uvicorn + Docker. Add health checks and input validation.

+

Modern async framework. Auto-generates OpenAPI docs. Pydantic validation. Deploy with Uvicorn + Docker. Add: health checks, input validation, error handling, rate limiting, request logging.

+ +

5. Docker for ML

+

Containerize everything: Python, CUDA, dependencies. Multi-stage builds: builder (install) → runtime (slim). Pin versions. NVIDIA Container Toolkit for GPU. docker compose for multi-service (API + Redis + DB).

-

5. Docker for ML Projects

-

Containerize your entire environment: Python version, CUDA drivers, dependencies. Multi-stage builds: builder stage (install deps) → runtime stage (slim image). Use NVIDIA Container Toolkit for GPU access. Pin all dependency versions.

+

6. pyproject.toml — Modern Packaging

+

Replaces setup.py/cfg. Project metadata, dependencies, build system, tool configs (pytest, mypy, ruff). [project.optional-dependencies] for dev/test extras. pip install -e ".[dev]" for editable installs.

-

6. Configuration Management

+

7. Configuration Management

- + -
ToolBest ForKey Feature
HydraML experimentsYAML configs, CLI overrides, multi-run
HydraML experimentsYAML, CLI overrides, multi-run
Pydantic SettingsApp configEnv var loading, validation
python-dotenvSimple projects.env file loading
dynaconfMulti-environmentdev/staging/prod configs
-

7. CI/CD for ML

-

Automate: linting (ruff/flake8), type checking (mypy), testing (pytest), building (Docker), deploying. Use GitHub Actions or GitLab CI. Add model validation gate: compare new model metrics against baseline before deployment.

+

8. CI/CD for ML

+

GitHub Actions: lint (ruff) → type check (mypy) → test (pytest) → build (Docker) → deploy. Add model validation gate: new model must beat baseline on test metrics before deployment.

-

8. Code Quality Tools

+

9. Code Quality Tools

- - - - + + + + + +
ToolPurpose
ruffFast linter + formatter (replaces black, isort, flake8)
mypyStatic type checking
pre-commitGit hooks for auto-formatting
pytest-covTest coverage measurement
ruffFast linter + formatter (replaces black, isort, flake8)
mypyStatic type checking
pre-commitGit hooks for auto-formatting
pytest-covTest coverage
banditSecurity linting
+ +

10. MLOps — Model Lifecycle

+ + + + + + + +
ToolPurpose
MLflowExperiment tracking, model registry
DVCData versioning (like Git for data)
Weights & BiasesExperiment tracking, visualization
EvidentlyData drift & model monitoring
Great ExpectationsData validation
+ +

11. Database for ML Projects

+ + + + + + +
DBUse CasePython Library
SQLiteLocal, small data, prototypingsqlite3 (built-in)
PostgreSQLProduction, ACID, JSONpsycopg2, SQLAlchemy
RedisCaching, queues, sessionsredis-py
MongoDBFlexible schema, documentspymongo
Pinecone/WeaviateVector search (embeddings)Official SDKs
`, code: `
-

💻 Production Python Code Examples

+

💻 Production Python Project Code

-

1. pytest — ML Testing Patterns

+

1. pytest — Complete ML Testing

import pytest import numpy as np +# conftest.py — shared fixtures @pytest.fixture def sample_data(): + np.random.seed(42) X = np.random.randn(100, 10) y = np.random.randint(0, 2, 100) return X, y +@pytest.fixture +def trained_model(sample_data): + X, y = sample_data + model = RandomForestClassifier(n_estimators=10) + model.fit(X, y) + return model + +# Test multiple models with one function @pytest.mark.parametrize("model_cls", [ - LogisticRegression, - RandomForestClassifier, - GradientBoostingClassifier + LogisticRegression, RandomForestClassifier, GradientBoostingClassifier ]) -def test_model_fits(model_cls, sample_data): +def test_model_output(model_cls, sample_data): X, y = sample_data model = model_cls() model.fit(X, y) preds = model.predict(X) assert preds.shape == y.shape - assert set(preds).issubset({0, 1})
+ assert set(np.unique(preds)).issubset({0, 1}) + +# Test data pipeline +def test_pipeline_no_leakage(sample_data, pipeline): + X, y = sample_data + scores = cross_val_score(pipeline, X, y, cv=3) + assert all(s >= 0 and s <= 1 for s in scores)

2. Structured Logging

-
import logging -import json +
import logging, json, sys class JSONFormatter(logging.Formatter): def format(self, record): - return json.dumps({ + log = { 'timestamp': self.formatTime(record), 'level': record.levelname, - 'message': record.getMessage(), - 'module': record.module - }) - -logger = logging.getLogger('ml_pipeline') -logger.setLevel(logging.INFO) -handler = logging.StreamHandler() -handler.setFormatter(JSONFormatter()) -logger.addHandler(handler) - -logger.info("Training complete", extra={'accuracy': 0.95})
- -

3. FastAPI Model Serving

-
from fastapi import FastAPI -from pydantic import BaseModel - -app = FastAPI(title="ML API") + 'module': record.module, + 'message': record.getMessage() + } + if record.exc_info: + log['exception'] = self.formatException(record.exc_info) + return json.dumps(log) + +def setup_logging(level=logging.INFO): + handler = logging.StreamHandler(sys.stdout) + handler.setFormatter(JSONFormatter()) + logging.root.handlers = [handler] + logging.root.setLevel(level) + +logger = logging.getLogger(__name__) +logger.info("Training started", extra={'model': 'xgb'})
+ +

3. FastAPI — Complete ML API

+
from fastapi import FastAPI, HTTPException +from pydantic import BaseModel, Field +import joblib, numpy as np + +app = FastAPI(title="ML Prediction API") +model = None + +@app.on_event("startup") +def load_model(): + global model + model = joblib.load("models/pipeline.pkl") class PredictRequest(BaseModel): - features: list[float] - model_name: str = "default" + features: list[float] = Field(..., min_length=1) + +class PredictResponse(BaseModel): + prediction: int + probability: float + model_version: str -@app.post("/predict") +@app.post("/predict", response_model=PredictResponse) async def predict(req: PredictRequest): - X = np.array(req.features).reshape(1, -1) - pred = model.predict(X) - return {"prediction": pred.tolist()} + try: + X = np.array(req.features).reshape(1, -1) + pred = model.predict(X)[0] + proba = model.predict_proba(X)[0].max() + return PredictResponse( + prediction=int(pred), probability=float(proba), + model_version="v2.1" + ) + except Exception as e: + raise HTTPException(500, detail=str(e)) @app.get("/health") async def health(): - return {"status": "healthy"}
+ return {"status": "healthy", "model_loaded": model is not None}

4. Dockerfile for ML

# Multi-stage build -FROM python:3.11-slim as builder +FROM python:3.11-slim AS builder COPY requirements.txt . -RUN pip install --no-cache-dir -r requirements.txt +RUN pip install --no-cache-dir --target=/deps -r requirements.txt FROM python:3.11-slim -COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11 +COPY --from=builder /deps /usr/local/lib/python3.11/site-packages COPY src/ /app/src/ COPY models/ /app/models/ WORKDIR /app -CMD ["uvicorn", "src.api:app", "--host", "0.0.0.0"]
+EXPOSE 8000 +HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1 +CMD ["uvicorn", "src.api:app", "--host", "0.0.0.0", "--port", "8000"] + +

5. Makefile for Project Commands

+
# Makefile — run from project root +.PHONY: install test lint train serve + +install: + pip install -e ".[dev]" + +test: + pytest tests/ -v --cov=src --cov-report=term-missing + +lint: + ruff check src/ tests/ + mypy src/ + +train: + python -m src.training.train --config configs/default.yaml + +serve: + uvicorn src.api:app --reload --port 8000
+ +

6. MLflow Experiment Tracking

+
import mlflow + +mlflow.set_experiment("customer_churn") +with mlflow.start_run(): + mlflow.log_params({"model": "xgb", "lr": 0.01}) + model.fit(X_train, y_train) + mlflow.log_metrics({"f1": f1, "auc": auc_score}) + mlflow.sklearn.log_model(pipeline, "model")
`, interview: `

🎯 Production Python Interview Questions

-
Q1: How do you test ML code?

Answer: (1) Unit tests: data transformations, feature engineering functions. (2) Integration tests: full pipeline end-to-end. (3) Model tests: output shape, range, determinism with seeds. (4) Data tests: schema validation, distribution checks. Use pytest fixtures for reusable test data.

-
Q2: print() vs logging — why?

Answer: Logging: configurable levels, file output, structured format, zero cost when disabled, thread-safe. Print: none of these. Production code must use logging for observability and debugging.

-
Q3: How to serve an ML model in production?

Answer: FastAPI/Flask for REST API. Docker for containerization. Load model at startup (not per request). Add health checks, input validation, error handling, logging, metrics. Use async for high throughput. Consider model registries (MLflow) for versioning.

-
Q4: What goes in pyproject.toml?

Answer: Project metadata, dependencies, build system, tool configs (pytest, mypy, ruff). Replaced setup.py/setup.cfg. Pin dependency versions for reproducibility. Use [project.optional-dependencies] for dev/test extras.

-
Q5: How to manage ML experiment configs?

Answer: Hydra: YAML configs with CLI overrides, multi-run sweeps. Store configs in version control. Never hardcode hyperparameters. Use config groups for model/data/training combos.

-
Q6: What is CI/CD for ML?

Answer: Automate: lint → type-check → test → build → deploy. Add model validation gate: new model must beat baseline on test metrics. Use GitHub Actions. Include data validation (Great Expectations) in pipeline.

+
Q1: How to test ML code?

Answer: Unit: transforms, features. Integration: full pipeline. Model: shape, range, determinism. Data: schema, distributions. Use pytest fixtures.

+
Q2: print() vs logging?

Answer: Logging: levels, file output, structured (JSON), zero cost when disabled, thread-safe. Print: none. Production = logging.

+
Q3: How to serve ML model?

Answer: FastAPI + Docker. Load model at startup. Add health checks, validation, error handling, logging. Async for throughput.

+
Q4: pyproject.toml vs setup.py?

Answer: pyproject.toml: modern standard, all tools in one file. Pin deps. Use optional deps for dev/test. pip install -e ".[dev]".

+
Q5: ML experiment configs?

Answer: Hydra: YAML + CLI overrides + multi-run sweeps. Version control configs. Never hardcode hyperparams.

+
Q6: CI/CD for ML?

Answer: lint → type-check → test → build → deploy. Model validation gate: must beat baseline. GitHub Actions + Docker.

+
Q7: How to handle model versioning?

Answer: MLflow model registry. DVC for data. Git for code. timestamp + metrics in model filename. A/B testing for rollout.

+
Q8: What is data drift?

Answer: Input distribution changes post-deployment. Detect: Evidently, statistical tests. Monitor: feature distributions, prediction distributions. Retrain trigger.

` }, @@ -1356,82 +1988,103 @@ CMD ["uvicorn", "src.api:app", "--host", "0.0.0.0"]
⚡ The Optimization Hierarchy
-
1. Algorithm (O(n²) → O(n log n)) > 2. Data structures (list → set for lookups) > 3. Vectorization (NumPy) > 4. Compilation (Numba/Cython) > 5. Parallelization (multiprocessing/Dask) > 6. Hardware (GPU). Always start from the top.
+
1. Algorithm (O(n²)→O(n log n)) > 2. Data structures (list→set) > 3. Vectorization (NumPy) > 4. Compilation (Numba/Cython) > 5. Parallelization (multiprocessing) > 6. Hardware (GPU). Always start from the top.
-

1. Profiling — Measure Before Optimizing

+

1. Profiling — Measure First

- - - - - - - + + + + + + +
ToolTypeWhen to UseOverhead
cProfileFunction-levelFind slow functions~2x slowdown
line_profilerLine-by-lineFind slow lines in a functionHigher
Py-SpySampling profilerProduction profilingNear zero
tracemallocMemory allocationFind memory leaksLow
memory_profilerLine-by-line memoryFind memory-heavy linesHigh
scaleneCPU + Memory + GPUComprehensive profilingLow
ToolTypeWhenOverhead
cProfileFunction-levelFind slow functions~2x
line_profilerLine-by-lineOptimize hot functionHigher
Py-SpySamplingProduction profilingNear zero
tracemallocMemoryFind leaksLow
memory_profilerLine memoryMemory per lineHigh
scaleneCPU+Memory+GPUComprehensiveLow
-

2. The GIL and Parallelism

-

GIL prevents true multi-threading for CPU-bound Python code. But: NumPy, Pandas, and scikit-learn release the GIL during C operations. Solutions for parallelism:

+

2. The GIL — What Every Python Dev Must Know

+
+
🔒 Global Interpreter Lock
+
GIL prevents true multi-threading for CPU-bound Python. BUT: NumPy, Pandas, scikit-learn release the GIL during C operations. Python 3.13: experimental free-threaded CPython (no-GIL).
+
- - - - - - + + + + +
ToolBest ForHow
threadingI/O-bound (API calls, disk)GIL released during I/O waits
multiprocessingCPU-bound PythonSeparate processes, separate GIL
concurrent.futuresSimple parallel patternsThreadPool/ProcessPool executors
asyncioMany I/O operationsEvent loop, cooperative multitasking
joblibsklearn paralleln_jobs parameter
Task TypeSolutionWhy
I/O-boundasyncio / threadingGIL released during I/O
CPU-bound PythonmultiprocessingSeparate processes, separate GIL
CPU-bound NumPythreading OKNumPy releases GIL
Many tasksconcurrent.futuresSimple Pool interface

3. Numba — JIT Compilation

-

@numba.jit(nopython=True) compiles Python functions to machine code. Supports NumPy arrays and most math operations. 10-100x speedup for loops that can't be vectorized. @numba.vectorize creates custom ufuncs. @numba.cuda.jit runs on GPU.

+

@numba.jit(nopython=True): compile to machine code. 10-100x speedup for loops. Supports NumPy, math. @numba.vectorize: custom ufuncs. @cuda.jit: GPU kernels. Best for: tight loops that can't be vectorized.

-

4. Cython — C-Level Performance

-

Compiles Python to C extension modules. Add type declarations for massive speedups. Best for: tight loops, calling C libraries, CPython extensions. More setup than Numba but more control.

+

4. Dask — Parallel Computing

+

Pandas/NumPy API for data bigger than memory. dask.dataframe, dask.array, dask.delayed. Lazy execution. Task graph scheduler. Scales from laptop to cluster. Alternative: Polars for single-machine parallel.

-

5. Dask — Parallel Computing

-

Pandas-like API for datasets larger than memory. Key abstractions: dask.dataframe (parallel Pandas), dask.array (parallel NumPy), dask.delayed (custom parallelism). Uses a task scheduler to execute lazily. Scales from laptop to cluster.

+

5. Ray — Distributed ML

+

General-purpose distributed framework. Ray Tune (hyperparameter tuning), Ray Serve (model serving), Ray Data. Easier than Dask for ML. Used by OpenAI, Uber.

-

6. Ray — Distributed ML

-

General-purpose distributed framework. Ray Tune for hyperparameter tuning, Ray Serve for model serving, Ray Data for data processing. Easier than Dask for ML-specific workloads. Used by OpenAI, Uber, Ant Group.

- -

7. Memory Optimization

+

6. Memory Optimization

+

7. Caching Strategies

+ + + + + + + +
ToolScopeUse Case
@functools.lru_cacheIn-memory, functionExpensive computations
@functools.cacheUnbounded cachePure functions
joblib.MemoryDisk cacheData processing pipelines
RedisExternal cacheMulti-process, API responses
diskcachePure Python diskSimple persistent cache
+

8. Python 3.12-3.13 Performance

-

3.12: Faster interpreter (5-15% overall), better error messages, per-interpreter GIL (experimental). 3.13: Free-threaded CPython (no-GIL mode experimental), JIT compiler (experimental). The future of Python performance is exciting.

+

3.12: 5-15% faster, better errors, per-interpreter GIL. 3.13: Free-threaded (no-GIL experimental), JIT compiler (experimental). The future of Python performance is exciting.

+ +

9. Common Performance Anti-Patterns

+ + + + + + + + +
Anti-PatternFixSpeedup
for row in df.iterrows()Vectorized ops100-1000x
s += "text" in loop''.join(parts)100x
x in big_listx in big_set1000x
Python list of floatsNumPy array50-100x
Global imports in functionImport at topVariable
Not using built-inssum(), min()5-10x
`, code: `

💻 Performance Code Examples

-

1. Profiling

-
import cProfile -import pstats +

1. Profiling Workflow

+
import cProfile, pstats -# Profile a function +# Profile and find bottlenecks with cProfile.Profile() as pr: - result = expensive_function(data) + result = expensive_pipeline(data) stats = pstats.Stats(pr) stats.sort_stats('cumulative') -stats.print_stats(10) # Top 10 functions +stats.print_stats(10) # Top 10 slow functions # Memory profiling import tracemalloc tracemalloc.start() -# ... do work ... +# ... process data ... snapshot = tracemalloc.take_snapshot() for stat in snapshot.statistics('filename')[:5]: print(stat)
-

2. Numba JIT — Vectorization Impossible

+

2. Numba JIT

import numba +import numpy as np @numba.jit(nopython=True) def pairwise_distance(X): @@ -1444,50 +2097,72 @@ snapshot = tracemalloc.take_snapshot() d += (X[i,k] - X[j,k]) ** 2 D[i,j] = D[j,i] = d ** 0.5 return D -# 100x faster than pure Python loops!
+# 100x faster than pure Python!
+ +

3. concurrent.futures — Parallel Processing

+
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor -

3. Dask for Large Data

+# CPU-bound: processes +with ProcessPoolExecutor(max_workers=8) as ex: + results = list(ex.map(process_chunk, data_chunks)) + +# I/O-bound: threads +with ThreadPoolExecutor(max_workers=32) as ex: + results = list(ex.map(fetch_url, urls))
+ +

4. Dask for Large Data

import dask.dataframe as dd -# Read 100GB of CSV files — lazy! +# Read 100GB of CSVs — lazy! ddf = dd.read_csv('data/*.csv') # Same Pandas API — but parallel result = ( ddf.groupby('category') - .agg({'revenue': 'sum', 'quantity': 'mean'}) - .compute() # Only here does execution happen + .agg({'revenue': 'sum', 'qty': 'mean'}) + .compute() # Only here does it execute )
-

4. concurrent.futures — Simple Parallelism

-
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor +

5. functools.lru_cache — Memoization

+
from functools import lru_cache -# CPU-bound: use ProcessPool -with ProcessPoolExecutor(max_workers=8) as executor: - results = list(executor.map(process_chunk, chunks)) +@lru_cache(maxsize=1024) +def expensive_feature(customer_id: int) -> dict: + # DB query, computation, etc. + return compute_features(customer_id) -# I/O-bound: use ThreadPool -with ThreadPoolExecutor(max_workers=32) as executor: - results = list(executor.map(fetch_url, urls))
+# First call: computes. Second call: instant from cache +print(expensive_feature.cache_info()) # hits, misses, size
-

5. __slots__ for Memory

+

6. __slots__ for Memory

class Point: __slots__ = ('x', 'y', 'z') def __init__(self, x, y, z): - self.x = x - self.y = y - self.z = z -# 1M instances: ~60MB vs ~160MB without __slots__
+ self.x, self.y, self.z = x, y, z + +# 1M instances: ~60MB vs ~160MB without __slots__ +points = [Point(i, i*2, i*3) for i in range(1_000_000)]
+ +

7. String Performance

+
# ❌ O(n²) — creates new string each iteration +result = "" +for word in words: + result += word + " " + +# ✅ O(n) — single allocation at end +result = " ".join(words)
`, interview: `

🎯 Performance Interview Questions

-
Q1: Why does Python have a GIL?

Answer: Simplifies reference counting (thread-safe without granular locks). Makes single-threaded code faster. Makes C extension integration easier. Python 3.13 has experimental free-threaded mode (no-GIL).

-
Q2: How to optimize a nested loop?

Answer: (1) Vectorize with NumPy (broadcast). (2) If too complex, use Numba JIT. (3) Cython for C-level types. (4) multiprocessing if iterations are independent.

-
Q3: Threading vs Multiprocessing?

Answer: Threading: I/O-bound (shared memory, low overhead). Multiprocessing: CPU-bound (separate memory, bypasses GIL). For downloading 1000 images → threads. For computing 1000 matrix operations → processes.

-
Q4: What is Numba?

Answer: JIT compiler that translates Python/NumPy to machine code using LLVM. @jit(nopython=True) for 10-100x speedup. Works best with: NumPy arrays, math operations, loops. Doesn't support: Pandas, string manipulation, most Python objects.

-
Q5: How to profile Python code?

Answer: cProfile: function-level (find slow functions). line_profiler: line-by-line. Py-Spy: sampling (production-safe). tracemalloc: memory. scalene: CPU+memory+GPU all-in-one. Always profile before optimizing.

-
Q6: Dask vs Ray vs Spark?

Answer: Dask: familiar Pandas/NumPy API, Python-native, scales well. Ray: ML-focused (tune, serve), lower-level control. Spark: JVM-based, best for very large (TB+) data, enterprise. For Python ML: Dask or Ray. For big data ETL: Spark.

+
Q1: Why the GIL?

Answer: Simplifies reference counting. Makes single-threaded faster. Easier C extensions. Python 3.13 has experimental no-GIL mode.

+
Q2: Optimize nested loop?

Answer: (1) NumPy vectorize. (2) Numba JIT. (3) Cython. (4) multiprocessing if independent.

+
Q3: Threading vs multiprocessing?

Answer: Threading: I/O-bound (shared memory). Multiprocessing: CPU-bound (bypasses GIL). Downloads→threads. Matrix math→processes.

+
Q4: What is Numba?

Answer: JIT compiler: Python→machine code via LLVM. @jit(nopython=True). 10-100x for NumPy loops. No Pandas/strings.

+
Q5: How to profile Python?

Answer: cProfile: functions. line_profiler: lines. Py-Spy: production. tracemalloc: memory. scalene: all-in-one. Profile FIRST, optimize second.

+
Q6: Dask vs Ray vs Spark?

Answer: Dask: Pandas API, Python-native. Ray: ML-focused. Spark: JVM, TB+ data. Python ML: Dask/Ray. Big data ETL: Spark.

+
Q7: Top 3 Python performance tips?

Answer: (1) Use sets not lists for lookups. (2) NumPy not Python loops. (3) Generator expressions for memory. Bonus: lru_cache for expensive functions.

+
Q8: How does lru_cache work?

Answer: Hash-based memoization. Args must be hashable. maxsize=None for unlimited. cache_info() shows hits/misses. Perfect for pure functions.

` } };