diff --git "a/Python/app.js" "b/Python/app.js" --- "a/Python/app.js" +++ "b/Python/app.js" @@ -71,12 +71,19 @@ const modules = [ } ]; + const MODULE_CONTENT = { "python-fundamentals": { concepts: `
-

Python Data Structures for DS

+

🐍 Python Fundamentals — Complete Deep Dive

+ +
+
⚡ Python Is Not What You Think
+
Python is a dynamically-typed, garbage-collected, interpreted language with a C-based runtime (CPython). Everything is an object — integers, functions, even classes. Understanding this object model is what separates beginners from professionals.
+
+

1. Data Structures for DS — Complete Reference

@@ -85,36 +92,51 @@ const MODULE_CONTENT = { + +
TypeMutableOrderedHashableUse Case
listSequential data, time series, feature lists
setUnique values, membership testing O(1)
frozensetImmutable set, usable as dict keys
dequeO(1) append/pop both ends, sliding windows
bytesBinary data, serialization, network I/O
bytearrayMutable binary buffers
-

🧠 Python Memory Model — What No One Teaches You

+

2. Python Memory Model — What No One Teaches

⚡ Everything Is An Object
-
- In Python, every value is an object on the heap. Variables are just references (pointers) to objects. When you write a = [1, 2, 3], the list lives on the heap; a is a name that points to it. This is why b = a makes both point to the same list — no copy is made. -
+
In Python, every value is an object on the heap. Variables are just references (pointers) to objects. a = [1, 2, 3] — the list lives on the heap; a is a name that points to it. b = a makes both point to the same list — no copy is made. This is called aliasing.
-

Reference Counting: Python uses reference counting + cyclic garbage collector. Each object tracks how many names point to it. When count hits 0, memory is freed immediately. This is why del doesn't always free memory — it just decrements the reference count.

+

Reference Counting: Python uses reference counting + cyclic garbage collector. Each object tracks how many names point to it. When count hits 0, memory is freed immediately. del doesn't always free memory — it just decrements the reference count.

Integer Interning: Python caches integers from -5 to 256 and short strings. So a = 100; b = 100; a is b is True, but a = 1000; b = 1000; a is b may be False. Never use is for value comparison — always use ==.

+

Garbage Collection Generations: CPython has 3 generations (gen0, gen1, gen2). New objects start in gen0. Objects that survive a collection move to the next generation. Long-lived objects (gen2) are collected less frequently. Use gc.get_stats() to monitor.

+ +

3. Generators & Iterators — The Core of Pythonic Code

+
+
🔄 Lazy Evaluation Is King
+
Generators produce values one at a time using yield, consuming O(1) memory regardless of data size. A list of 1 billion items = ~8GB RAM. A generator of 1 billion items = ~100 bytes. The Iterator Protocol: any object with __iter__ and __next__ methods. Generators are just syntactic sugar for iterators.
+
+

yield vs return: return terminates the function. yield suspends it, saving the entire stack frame (local variables, instruction pointer). The next next() call resumes from where it left off.

+

yield from: Delegates to a sub-generator. yield from iterable is equivalent to for item in iterable: yield item but also forwards send() and throw() calls.

+

Generator Expressions: (x**2 for x in range(10**9)) — uses O(1) memory. List comprehension [x**2 for x in range(10**9)] — tries to allocate ~8GB. Always prefer generator expressions for large data.

+ +

4. Closures & First-Class Functions

+

Functions in Python are first-class objects — they can be passed as arguments, returned from other functions, and assigned to variables. A closure is a function that captures variables from its enclosing scope. This is the foundation of decorators, callbacks, and functional programming in Python.

-

collections Module — The Power Tools

+

5. Critical Python Gotchas

+
+
⚠️ Mutable Default Arguments — #1 Python Trap
+ def append_to(element, target=[]): — This default list is shared across ALL calls! Default arguments are evaluated ONCE at function definition time, not at call time. Fix: use target=None then if target is None: target = []. +
+

Late Binding Closures: [lambda: i for i in range(5)] — all lambdas return 4! Variables in closures are looked up at call time, not definition time. Fix: [lambda i=i: i for i in range(5)].

+

Tuple Assignment Gotcha: a = ([1,2],); a[0] += [3] raises TypeError AND modifies the list! The += first mutates the list in-place (succeeds), then tries to reassign the tuple element (fails).

+ +

6. collections Module — Power Tools

- +
ClassPurposeWhy It Matters in DS
defaultdictDict with default factoryGroup data without KeyError: defaultdict(list)
CounterCount hashable objectsLabel distribution: Counter(y_train)
namedtupleLightweight immutable classReturn multiple values with names, not indices
OrderedDictDict remembering insertion orderLegacy (dicts are ordered in 3.7+), but useful for move_to_end()
OrderedDictDict remembering insertion orderLegacy (dicts are ordered 3.7+), useful for move_to_end()
dequeDouble-ended queueSliding window computations, BFS algorithms
ChainMapStack multiple dictsLayer config: defaults → env → CLI overrides
-

itertools — Memory-Efficient Data Pipelines

-
-
🔄 Lazy Evaluation Is King
-
- itertools functions return iterators, not lists. They consume O(1) memory regardless of input size. This matters when processing millions of records. -
-
+

7. itertools — Memory-Efficient Pipelines

@@ -123,1686 +145,1352 @@ const MODULE_CONTENT = { + +
FunctionWhat It DoesDS Use Case
chain()Concatenate iterablesMerge multiple data files lazily
product()Cartesian productGenerate hyperparameter grid
combinations()All r-length combosFeature interaction pairs
starmap()map() with unpacked argsApply function to paired data
accumulate()Running total/custom accumulatorCumulative sums, running max
tee()Clone an iterator N timesMultiple passes over data stream
-

pathlib — Modern File Handling

-

Stop using os.path.join(). Use pathlib.Path — it's object-oriented, cross-platform, and reads like English:

- +

8. String Internals & Formatting

+

f-strings (3.6+) are the fastest formatting method. They support expressions: f"{accuracy:.2%}" → "95.23%", f"{x=}" (3.8+) → "x=42" for debugging. Interning: Python interns string literals and identifiers. 'hello' is 'hello' is True because both point to the same interned object.

-

Error Handling Patterns for Data Pipelines

-
-
⚠️ Never Do This
- Bare except: catches SystemExit and KeyboardInterrupt. Always catch specific exceptions. In DS pipelines, catch ValueError (bad data), FileNotFoundError (missing files), KeyError (missing columns). -
-

LBYL vs EAFP: Python prefers "Easier to Ask Forgiveness than Permission" (EAFP). Use try/except instead of checking conditions first. It's faster when exceptions are rare (which they usually are).

+

9. pathlib — Modern File Handling

+

Stop using os.path.join(). Use pathlib.Path — object-oriented, cross-platform, reads like English. Path('data') / 'train' / 'images' builds paths. path.glob('*.csv') finds files. path.read_text() reads without open().

-

Virtual Environments — Dependency Isolation

+

10. Virtual Environments

- - - - - + + + + +
ToolBest ForCreateKey Feature
venvSimple projectspython -m venv envBuilt-in, lightweight
condaDS/ML (C dependencies)conda create -n myenv python=3.11Handles non-Python deps (CUDA, MKL)
poetryModern packagingpoetry initLock files, deterministic builds
uvSpeed (Rust-based)uv venv10-100x faster than pip
ToolBest ForKey Feature
venvSimple projectsBuilt-in, lightweight
condaDS/ML (C dependencies)Handles CUDA, MKL
poetryModern packagingLock files, deterministic builds
uvSpeed (Rust-based)10-100x faster than pip
-
- `, + `, code: `
-

💻 Essential Code Examples

- -

collections In Action

-
-from collections import defaultdict, Counter, namedtuple, deque +

💻 Python Fundamentals — Code Examples

-# defaultdict — Group samples by label (no KeyError!) +

1. Generators — Complete Patterns

+
# Basic generator — yields values lazily +def read_large_file(filepath): + with open(filepath) as f: + for line in f: + yield line.strip() +# Processes a 10GB file with O(1) memory! + +# Generator pipeline — compose transformations +def pipeline(filepath): + lines = read_large_file(filepath) + parsed = (json.loads(line) for line in lines) + filtered = (rec for rec in parsed if rec['score'] > 0.5) + return filtered # Still lazy! No work done yet + +# send() — coroutine pattern (advanced) +def running_average(): + total = count = 0 + avg = None + while True: + value = yield avg + total += value + count += 1 + avg = total / count + +ra = running_average() +next(ra) # Prime the coroutine +ra.send(10) # 10.0 +ra.send(20) # 15.0
+ +

2. Closures & Mutable Default Trap

+
# Closure — function capturing external state +def make_multiplier(factor): + def multiply(x): + return x * factor # 'factor' captured from enclosing scope + return multiply + +double = make_multiplier(2) +triple = make_multiplier(3) +print(double(5)) # 10 + +# ⚠️ MUTABLE DEFAULT ARGUMENT — THE #1 PYTHON BUG +# BAD: default list is shared across ALL calls! +def bad_append(item, lst=[]): + lst.append(item) + return lst +bad_append(1) # [1] +bad_append(2) # [1, 2] ← SURPRISE! + +# GOOD: use None sentinel +def good_append(item, lst=None): + if lst is None: + lst = [] + lst.append(item) + return lst
+ +

3. collections In Action

+
from collections import defaultdict, Counter, namedtuple, deque + +# defaultdict — Group samples by label samples_by_label = defaultdict(list) for feature, label in zip(features, labels): samples_by_label[label].append(feature) -# {'cat': [f1, f3], 'dog': [f2, f4]} — no if/else needed -# Counter — Class distribution analysis -y_train = [0, 1, 1, 0, 1, 2, 0, 1] +# Counter — Class distribution + arithmetic dist = Counter(y_train) -print(dist) # Counter({1: 4, 0: 3, 2: 1}) -print(dist.most_common(2)) # [(1, 4), (0, 3)] - -# namedtuple — Return multiple values with names -ModelResult = namedtuple('ModelResult', ['accuracy', 'precision', 'recall', 'f1']) -result = ModelResult(accuracy=0.95, precision=0.93, recall=0.91, f1=0.92) -print(result.accuracy) # 0.95 — much clearer than result[0] +print(dist.most_common(3)) +# Counter supports +, -, &, | operations! # deque — Sliding window for streaming data window = deque(maxlen=5) for value in data_stream: window.append(value) - moving_avg = sum(window) / len(window) -
- -

itertools for Data Pipelines

-
-from itertools import chain, islice, product, combinations - -# chain — Merge multiple data files lazily (no memory explosion) -def read_csv_lines(filepath): - with open(filepath) as f: -next(f) # skip header -yield from f - -all_data = chain( - read_csv_lines('jan.csv'), - read_csv_lines('feb.csv'), - read_csv_lines('mar.csv') -) -# Processes millions of lines with O(1) memory! - -# product — Generate hyperparameter grid -learning_rates = [0.001, 0.01, 0.1] -batch_sizes = [16, 32, 64] -dropouts = [0.1, 0.3, 0.5] -for lr, bs, do in product(learning_rates, batch_sizes, dropouts): - train_model(lr=lr, batch_size=bs, dropout=do) -# 27 combinations without nested loops - -# combinations — Feature interaction pairs -features = ['age', 'income', 'score', 'tenure'] -for f1, f2 in combinations(features, 2): - df[f'{f1}_x_{f2}'] = df[f1] * df[f2] -# Creates: age_x_income, age_x_score, ... (6 pairs) -
- -

pathlib — Modern File Management

-
-from pathlib import Path - -# Build paths naturally (cross-platform) -data_dir = Path('data') / 'processed' / 'v2' -data_dir.mkdir(parents=True, exist_ok=True) - -# Find all CSV files recursively -csv_files = list(data_dir.glob('**/*.csv')) -print(f"Found {len(csv_files)} CSV files") + moving_avg = sum(window) / len(window)
-# Parse file names without string hacking -for f in csv_files: - print(f"Name: {f.stem}, Extension: {f.suffix}, Parent: {f.parent}") - -# Read/write without open() -config = Path('config.json').read_text() -Path('output.txt').write_text('Results: 95.2% accuracy') -
- -

Advanced Comprehensions & Unpacking

-
-# Nested comprehension — Flatten list of lists -nested = [[1, 2], [3, 4], [5, 6]] -flat = [x for sublist in nested for x in sublist] -# [1, 2, 3, 4, 5, 6] - -# Dict comprehension — Invert a mapping -label_to_id = {'cat': 0, 'dog': 1, 'bird': 2} -id_to_label = {v: k for k, v in label_to_id.items()} - -# Set comprehension — Unique words from documents -docs = ["hello world", "world of python"] -vocab = {word for doc in docs for word in doc.split()} - -# Walrus operator (:=) — Assign + use in expression (3.8+) -if (n := len(data)) > 1000: +

4. Advanced Comprehensions & Unpacking

+
# Walrus operator (:=) — Assign + use in expression (3.8+) +if (n := len(data)) > 1000: print(f"Large dataset: {n} samples") -# Extended unpacking — Split data elegantly +# Extended unpacking first, *middle, last = sorted(scores) -print(f"Min: {first}, Max: {last}, Middle: {middle}") -
-

Robust Error Handling for Pipelines

-
-import logging -logger = logging.getLogger(__name__) - -def load_and_validate(filepath): - """Production-grade data loading with proper error handling.""" - try: -df = pd.read_csv(filepath) - except FileNotFoundError: -logger.error(f"File not found: {filepath}") -raise - except pd.errors.EmptyDataError: -logger.warning(f"Empty file: {filepath}") -return pd.DataFrame() - except pd.errors.ParserError as e: -logger.error(f"Parse error in {filepath}: {e}") -raise ValueError(f"Corrupted CSV: {filepath}") from e - - # Validate required columns - required = {'id', 'target', 'timestamp'} - missing = required - set(df.columns) - if missing: -raise KeyError(f"Missing columns: {missing}") - - return df -
-
- `, +# Dict merge (3.9+) +config = defaults | overrides # New in 3.9 + +# match-case (3.10+) — Structural Pattern Matching +match command: + case {"action": "train", "model": model_name}: + train(model_name) + case {"action": "predict", "data": path}: + predict(path)
+ `, interview: `
-

🎯 Interview Questions

- -
- Q1: What's the difference between a list and a tuple? When would you use each in DS? -

Answer: Lists are mutable, tuples immutable. But the deeper answer: tuples are hashable (can be dict keys), use less memory (no over-allocation), and signal intent ("this shouldn't change"). Use tuples for (lat, lon) pairs, function return values, dict keys for caching. Use lists for feature collections that grow.

-
- -
- Q2: How does Python's GIL affect data science workflows? -

Answer: The GIL prevents true multi-threading for CPU-bound tasks. But here's what most people miss: NumPy, Pandas, and scikit-learn release the GIL during C-level computations. So vectorized operations ARE parallel at the C level. For pure Python CPU work, use multiprocessing. For I/O (API calls, file reads), threading works fine because the GIL is released during I/O waits.

-
- -
- Q3: Explain the difference between is and ==. Why does this matter? -

Answer: == checks value equality (__eq__). is checks identity (same memory address). Python interns small integers (-5 to 256) and some strings, so 300 is 300 may be False. Always use == for values. Only use is for None checks: if x is None.

-
- -
- Q4: How would you handle a 10GB CSV that doesn't fit in memory? -

Answer: 5 strategies, from simplest to most powerful: (1) pd.read_csv(chunksize=50000) — process in batches, (2) usecols=['needed_cols'] — load only what you need, (3) dtype={'col': 'int32'} — use smaller types, (4) Dask — lazy Pandas-like API, (5) DuckDB — SQL on CSV files with zero memory overhead.

-
- -
- Q5: What's the time complexity of dict lookup vs list search? Why? -

Answer: Dict: O(1) average via hash tables (Python's dict uses open addressing). List: O(n) linear scan. Internally, dict hashes the key to compute a slot index, then handles collisions via probing. Sets use the same mechanism. This is why x in my_set is fast but x in my_list is slow.

-
- -
- Q6: Explain shallow vs deep copy. Give a real DS scenario where this matters. -

Answer: copy.copy() copies outer container but shares inner objects. copy.deepcopy() recursively copies everything. Real scenario: You have a list of dicts (config per experiment). Shallow copy means modifying one experiment's config changes all of them. Deep copy gives independent configs. Pandas .copy() is deep by default — but df2 = df is NOT a copy at all.

-
- -
- Q7: What is a defaultdict and when would you use it over a regular dict? -

Answer: defaultdict(factory) auto-creates default values for missing keys. Use defaultdict(list) to group items without if key not in dict checks. Use defaultdict(int) to count. It's cleaner and ~20% faster than dict.setdefault() for grouping operations in data processing.

-
- -
- Q8: What are generators and why are they critical for large-scale data processing? -

Answer: Generators yield values one at a time using yield, consuming O(1) memory regardless of data size. A list of 1 billion items = ~8GB RAM. A generator of 1 billion items = ~100 bytes. Critical for: reading large files, streaming data, batch training. yield from delegates to sub-generators.

-
- -
- Q9: How would you remove duplicates from a list while preserving order? -

Answer: list(dict.fromkeys(my_list)) — uses dict's insertion-order guarantee (3.7+), runs in O(n). Old approach: seen = set(); [x for x in lst if not (x in seen or seen.add(x))]. For DataFrames: df.drop_duplicates(subset=['key_col']).

-
- -
- Q10: Explain Python's garbage collection mechanism. -

Answer: Two mechanisms: (1) Reference counting — each object has a count; freed when count hits 0. Immediate cleanup. (2) Cyclic garbage collector — detects reference cycles (A → B → A) that refcount can't handle. Runs periodically on generations (gen0, gen1, gen2). You can force it with gc.collect() — useful after deleting large ML models.

-
- -
- Q11: What's the difference between __str__ and __repr__? -

Answer: __str__ is for end users (readable), __repr__ is for developers (unambiguous, ideally eval-able). If only one is defined, implement __repr__ — Python falls back to it for str() too. In ML: __repr__ should show model params: LinearRegression(lr=0.01, reg=l2).

-
- -
- Q12: How does *args and **kwargs help in ML code? -

Answer: They enable flexible function signatures. *args: variable positional args (multiple datasets). **kwargs: variable keyword args (hyperparameters). Essential for: wrapper functions, decorators, scikit-learn's set_params(**params), and model.fit(X, y, **fit_params).

-
- -
- Q13: What are f-strings and why are they preferred over .format() and %? -

Answer: f-strings (3.6+) are fastest, most readable formatting. They support expressions: f"{accuracy:.2%}" → "95.23%", f"{x=}" (3.8+) → "x=42" for debugging. .format() is slower and more verbose. % formatting is legacy C-style. Always use f-strings in modern Python.

-
- -
- Q14: Explain the LEGB scope rule. -

Answer: Python resolves names in order: Local → Enclosing function → Global → Built-in. This is why you can accidentally shadow built-ins: list = [1,2] breaks list(). Use nonlocal to modify enclosing scope, global for module scope (but avoid globals in production code).

-
- -
- Q15: What's the difference between append() and extend()? -

Answer: append(x) adds x as a single element. extend(iterable) unpacks and adds each element. [1,2].append([3,4])[1,2,[3,4]]. [1,2].extend([3,4])[1,2,3,4]. Use extend() when merging feature lists; append() when adding one item to results.

-
-
- ` +

🎯 Python Fundamentals — Interview Questions

+
Q1: What's the difference between a list and a tuple?

Answer: Lists are mutable, tuples immutable. Deeper: tuples are hashable (can be dict keys), use less memory (no over-allocation), and signal intent ("this shouldn't change"). Use tuples for (lat, lon) pairs, function return values, dict keys. Use lists for collections that grow.

+
Q2: How does Python's GIL affect DS workflows?

Answer: The GIL prevents true multi-threading for CPU-bound tasks. But NumPy, Pandas, and scikit-learn release the GIL during C-level computations. So vectorized operations ARE parallel at the C level. For pure Python CPU work, use multiprocessing. For I/O, threading works fine.

+
Q3: Explain shallow vs deep copy.

Answer: copy.copy() copies outer container but shares inner objects. copy.deepcopy() recursively copies everything. Real scenario: list of dicts (configs). Shallow copy means modifying one config modifies all. Pandas .copy() is deep by default — but df2 = df is NOT a copy.

+
Q4: What is the mutable default argument trap?

Answer: def f(x, lst=[]): — the default list is created ONCE at function definition and shared across all calls. So f(1); f(2) gives [1, 2] not [2]. Fix: use lst=None then if lst is None: lst = []. This is the #1 Python gotcha in interviews.

+
Q5: What are generators and why are they critical for large-scale data?

Answer: Generators yield values one at a time using yield, consuming O(1) memory. A list of 1B items = ~8GB. A generator = ~100 bytes. Critical for: reading large files, streaming data, batch training. yield from delegates to sub-generators. Generator expressions: (x for x in data).

+
Q6: Explain the LEGB scope rule.

Answer: Python resolves names in order: Local → Enclosing → Global → Built-in. This is why list = [1,2] breaks list(). Use nonlocal for enclosing scope, global for module scope.

+
Q7: How would you handle a 10GB CSV that doesn't fit in memory?

Answer: (1) pd.read_csv(chunksize=50000), (2) usecols=['needed'], (3) dtype={'col': 'int32'}, (4) Dask for lazy Pandas, (5) DuckDB for SQL on CSV with zero overhead, (6) Polars for fast out-of-core processing.

+
Q8: What's the time complexity of dict lookup vs list search?

Answer: Dict: O(1) via hash tables (open addressing). List: O(n) linear scan. Dict hashes the key to compute slot index, handles collisions via probing. Sets use the same mechanism. x in my_set is O(1) but x in my_list is O(n).

+
Q9: Explain Python's garbage collection.

Answer: Two mechanisms: (1) Reference counting — freed when count hits 0. (2) Cyclic GC — detects reference cycles (A→B→A). Runs on 3 generations. Long-lived objects collected less often. gc.collect() forces collection — useful after deleting large ML models.

+
Q10: What is __slots__ and when to use it?

Answer: By default, Python objects store attributes in a __dict__ (a dict per instance). __slots__ replaces this with a fixed-size array. Saves ~40% memory per instance. Use when creating millions of small objects (data points, nodes). Trade-off: can't add attributes dynamically.

+ ` }, "numpy": { concepts: `
-

NumPy ndarray Fundamentals

+

🔢 NumPy — Complete Deep Dive

+ +
+
⚡ Why NumPy Is 50-100x Faster Than Python Lists
+
Three reasons: (1) Contiguous memory — CPU cache-friendly, no pointer chasing. (2) Compiled C loops — operations run in C, not interpreted Python. (3) SIMD instructions — modern CPUs process 4-8 floats simultaneously (AVX).
+
-

🧠 Why NumPy Is 50-100x Faster Than Python Lists

+

1. ndarray Internals

- - - - - + + + + +
FeaturePython ListNumPy ndarray
StorageArray of pointers to objects scattered in memoryContiguous block of raw typed data
TypeEach element can be different typeHomogeneous — all elements same dtype
OperationsPython loop (bytecode interpretation)Compiled C/Fortran loops
Memory~28 bytes per int + pointer overhead8 bytes per int64 (no overhead)
SIMDNot possibleUses CPU vector instructions (SSE/AVX)
StorageArray of pointers to objectsContiguous block of raw typed data
TypeEach element can differHomogeneous — all same dtype
OperationsPython loop (bytecode)Compiled C/Fortran loops
Memory~28 bytes per int + pointer8 bytes per int64 (no overhead)
SIMDNot possibleUses CPU vector instructions
-

Memory Layout: C-Order vs Fortran-Order

+

2. Memory Layout: C-Order vs Fortran-Order

⚡ Performance-Critical Knowledge
-
- C-order (row-major): Rows stored contiguously. arr[0,0], arr[0,1], arr[0,2], arr[1,0]...
- Fortran-order (col-major): Columns stored contiguously. arr[0,0], arr[1,0], arr[2,0], arr[0,1]...
- NumPy defaults to C-order. Iterating along the last axis is fastest (cache-friendly). Fortran-order is preferred when interfacing with LAPACK/BLAS (used internally by NumPy's linear algebra). -
+
C-order (row-major): Rows stored contiguously. Fortran-order (col-major): Columns stored contiguously. NumPy defaults to C-order. Iterating along the last axis is fastest (cache-friendly). Fortran-order preferred for LAPACK/BLAS operations.
-

Strides: The Secret Behind Views

-

Every ndarray has a strides tuple — bytes to jump in each dimension. For a (3,4) float64 array: strides = (32, 8) means jump 32 bytes for next row, 8 bytes for next column. Slicing creates views (no copy) by adjusting strides. arr[::2] doubles the row stride.

+

3. Strides: The Secret Behind Views

+

Every ndarray has a strides tuple — bytes to jump in each dimension. For a (3,4) float64 array: strides = (32, 8). Slicing creates views (no copy) by adjusting strides. arr[::2] doubles the row stride.

-

Broadcasting Rules — The Complete Picture

+

4. Broadcasting Rules

🎯 Broadcasting Rules (Right to Left)
-
- Two arrays are compatible when, for each trailing dimension: (1) Dimensions are equal, OR (2) One of them is 1.
- Example: (5, 3, 1) + (1, 4) → shape (5, 3, 4). The (1,) dims are "stretched" virtually — no memory is copied. -
+
Two arrays are compatible when, for each trailing dimension: (1) Dimensions are equal, OR (2) One is 1. Example: (5,3,1) + (1,4) → shape (5,3,4). The (1,) dims are "stretched" virtually — no memory copied.
-

Key dtype Choices for DS

+

5. Universal Functions (ufuncs)

+

Ufuncs are vectorized functions that operate element-wise. They support: .reduce() (fold along axis), .accumulate() (running total), .outer() (outer product), .at() (unbuffered in-place). Example: np.add.reduce(arr) = arr.sum() but works with custom ufuncs too.

+ +

6. Key dtype Choices for DS

- - - - - - + + + + + +
dtypeBytesRangeWhen to Use
float324±3.4e38Deep learning (GPU prefers this), 50% less memory
float648±1.8e308Default. Scientific computing, high-precision stats
int324±2.1 billionIndices, counts, most integer data
bool1True/FalseMasks for filtering
category (Pandas)VariesFinite setRepeated strings → 90% memory savings
dtypeBytesWhen to Use
float324Deep learning (GPU prefers this), 50% less memory
float648Default. Scientific computing, high-precision stats
int324Indices, counts, most integer data
float162Mixed-precision training, inference
bool1Masks for filtering
-

np.einsum — Einstein Summation (Power Tool)

-

np.einsum can express any tensor operation in one call: matrix multiply, trace, transpose, batch ops. Often faster than chaining NumPy functions because it avoids intermediate arrays.

+

7. np.einsum — Einstein Summation

+

np.einsum can express any tensor operation: matrix multiply, trace, transpose, batch ops. Often faster than chaining NumPy functions because it avoids intermediate arrays.

-

Linear Algebra for ML

+

8. Linear Algebra for ML

-
- `, + +

9. Random Number Generation (Modern API)

+

np.random.default_rng(42) is the modern way (NumPy 1.17+). Uses PCG64 algorithm — better statistical properties, thread-safe. Old np.random.seed(42) is global state, not thread-safe. Always use default_rng() in new code.

+ `, code: `

💻 NumPy Code Examples

-

Array Creation & Inspection

-
-import numpy as np +

1. Array Creation & Memory Inspection

+
import numpy as np -# Create with specific dtypes for memory efficiency +# Memory-efficient creation X = np.random.randn(1000, 10).astype(np.float32) # 40KB vs 80KB -y = np.random.randint(0, 2, size=1000, dtype=np.int8) # 1KB vs 8KB - -# Inspect memory layout -print(f"Shape: {X.shape}") # (1000, 10) print(f"Strides: {X.strides}") # (40, 4) bytes -print(f"C-contiguous: {X.flags['C_CONTIGUOUS']}") # True -print(f"Memory: {X.nbytes / 1024:.1f} KB") # 39.1 KB -
- -

Broadcasting for Feature Normalization

-
-# Normalize each feature (mean=0, std=1) -X = np.random.randn(1000, 5) # 1000 samples, 5 features +print(f"Memory: {X.nbytes / 1024:.1f} KB")
-mean = X.mean(axis=0) # shape (5,) -std = X.std(axis=0) # shape (5,) +

2. Broadcasting for Feature Normalization

+
# Z-score normalization using broadcasting +X = np.random.randn(1000, 5) +X_norm = (X - X.mean(axis=0)) / X.std(axis=0) # (1000,5) - (5,) works! -X_normalized = (X - mean) / std # Broadcasting! (1000,5) - (5,) works +# Min-Max scaling +X_scaled = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0) + 1e-8)
-# Min-Max scaling to [0, 1] -X_min = X.min(axis=0) -X_max = X.max(axis=0) -X_scaled = (X - X_min) / (X_max - X_min + 1e-8) # epsilon avoids /0 -
- -

Advanced Indexing & Boolean Masking

-
-# Boolean mask — filter outliers (3 sigma rule) +

3. Advanced Indexing & Boolean Masking

+
# Boolean mask — filter outliers (3 sigma rule) data = np.random.randn(10000) -mask = np.abs(data) < 3 # True where within 3 std devs -clean = data[mask] # Only non-outlier values -print(f"Removed {len(data) - len(clean)} outliers") - -# Fancy indexing — select specific rows/columns -X = np.random.randn(100, 10) -important_features = [0, 3, 7] # indices of best features -X_selected = X[:, important_features] # shape (100, 3) +clean = data[np.abs(data) < 3] # np.where — Conditional replacement -predictions = np.array([0.3, 0.7, 0.1, 0.9]) -labels = np.where(predictions > 0.5, 1, 0) # [0, 1, 0, 1] -
- -

np.einsum — One Function to Rule Them All

-
-A = np.random.randn(3, 4) -B = np.random.randn(4, 5) - -# Matrix multiply: C_ij = sum_k A_ik * B_kj -C = np.einsum('ik,kj->ij', A, B) # same as A @ B - -# Trace: sum of diagonal -trace = np.einsum('ii->', np.eye(4)) # 4.0 - -# Batch matrix multiply (common in deep learning) -batch_A = np.random.randn(32, 10, 20) # 32 matrices -batch_B = np.random.randn(32, 20, 5) -result = np.einsum('bij,bjk->bik', batch_A, batch_B) # (32,10,5) - -# Dot product of each row pair -X = np.random.randn(100, 768) # embeddings -similarity = np.einsum('ij,kj->ik', X, X) # cosine sim matrix -
- -

Linear Regression — The NumPy Way

-
-# Solve linear regression: y = Xβ + ε -# Normal equation: β = (X^T X)^{-1} X^T y -X = np.column_stack([np.ones(100), np.random.randn(100, 3)]) -y = np.random.randn(100) - -# Method 1: Direct (numerically unstable for large X) -beta = np.linalg.inv(X.T @ X) @ X.T @ y - -# Method 2: lstsq (stable, handles rank-deficient X) -beta, residuals, rank, sv = np.linalg.lstsq(X, y, rcond=None) - -# Method 3: SVD decomposition (most stable) -U, S, Vt = np.linalg.svd(X, full_matrices=False) -beta = Vt.T @ np.diag(1/S) @ U.T @ y -
- -

Memory-Mapped Files for Huge Arrays

-
-# Process arrays larger than RAM -# Create memory-mapped file -big_array = np.memmap('huge_data.npy', dtype=np.float32, - mode='w+', shape=(1000000, 100)) - -# Write data in chunks -for i in range(0, 1000000, 10000): - big_array[i:i+10000] = np.random.randn(10000, 100) - -# Read slices without loading the entire file -subset = big_array[5000:6000] # Only reads 1000 rows from disk -print(subset.mean()) -
-
- `, +preds = np.array([0.3, 0.7, 0.1, 0.9]) +labels = np.where(preds > 0.5, 1, 0) # [0, 1, 0, 1] + +# np.select — Multiple conditions +conditions = [data < -1, data > 1] +choices = ['low', 'high'] +category = np.select(conditions, choices, default='mid')
+ +

4. np.einsum — One Function to Rule Them All

+
# Matrix multiply +C = np.einsum('ik,kj->ij', A, B) # same as A @ B + +# Batch matrix multiply (deep learning) +batch_result = np.einsum('bij,bjk->bik', batch_A, batch_B) + +# Cosine similarity matrix +X = np.random.randn(100, 768) +sim = np.einsum('ij,kj->ik', X, X)
+ +

5. Memory-Mapped Files for Huge Arrays

+
# Process arrays larger than RAM +big = np.memmap('huge.npy', dtype=np.float32, + mode='w+', shape=(1000000, 100)) +subset = big[5000:6000] # Only reads 1000 rows from disk
+ +

6. Structured Arrays

+
# Mixed dtypes without Pandas overhead +dt = np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]) +data = np.array([('Alice', 30, 95.5), ('Bob', 25, 87.3)], dtype=dt) +print(data['name']) # ['Alice' 'Bob'] +print(data['score'].mean()) # 91.4
+ `, interview: `

🎯 NumPy Interview Questions

- -
- Q1: Why is NumPy faster than Python lists for numerical operations? -

Answer: Three reasons: (1) Contiguous memory — CPU cache-friendly, no pointer chasing. (2) Compiled C loops — operations run in compiled C, not interpreted Python. (3) SIMD instructions — modern CPUs process 4-8 floats simultaneously (AVX). Together: 50-100x speedup.

-
- -
- Q2: What's the difference between a view and a copy? Why does it matter? -

Answer: Views share data (slicing creates views). Copies duplicate data. arr[::2] is a view — modifying it modifies the original. arr[[0,2,4]] (fancy indexing) is a copy. Views are fast and memory-efficient. Use np.shares_memory(a, b) to check. Always .copy() when you need independent data.

-
- -
- Q3: Explain broadcasting rules with an example. -

Answer: Compare shapes right-to-left. Dimensions must be equal or one must be 1. Example: (3,1) + (1,4)(3,4). Each (3,1) row is "stretched" to match 4 columns. No memory is actually copied — NumPy adjusts strides internally. Gotcha: (3,) + (3,4) fails — need to reshape to (3,1) first.

-
- -
- Q4: What is axis=0 vs axis=1? -

Answer: axis=0 = operate down rows (column-wise). axis=1 = across columns (row-wise). Think: axis=0 collapses rows, axis=1 collapses columns. For (100,5) array: mean(axis=0) → shape (5,) — one mean per feature. mean(axis=1) → shape (100,) — one mean per sample.

-
- -
- Q5: How would you implement PCA using only NumPy? -

Answer: (1) Center data: X_c = X - X.mean(axis=0), (2) Covariance: cov = X_c.T @ X_c / (n-1), (3) Eigendecomposition: vals, vecs = np.linalg.eigh(cov), (4) Sort by eigenvalue descending, (5) Project: X_pca = X_c @ vecs[:, -k:]. Alternatively use SVD directly: U, S, Vt = np.linalg.svd(X_c).

-
- -
- Q6: What's the difference between np.dot, np.matmul (@), and np.einsum? -

Answer: np.dot: flattens for 1D, matrix multiply for 2D, but confusing for higher dims. @ (matmul): clean matrix multiply, broadcasts over batch dims. einsum: most flexible — express any contraction. Use @ for readability, einsum for complex ops. Avoid np.dot for 3D+ arrays.

-
- -
- Q7: How do you handle NaN values in NumPy arrays? -

Answer: np.isnan(arr) detects NaNs. np.nanmean(arr), np.nanstd(arr) — nan-safe aggregations. Replace: arr[np.isnan(arr)] = 0. Gotcha: np.nan == np.nan is False! NaN poisons comparisons. This is IEEE 754 standard.

-
- -
- Q8: What's structured arrays and when would you use them over Pandas? -

Answer: Structured arrays have named fields with mixed dtypes: np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]). Use when: (1) You need NumPy speed without Pandas overhead, (2) Interfacing with binary file formats (HDF5, FITS), (3) Processing millions of records where Pandas is too slow.

-
- -
- Q9: Explain the performance difference between C-order and Fortran-order. -

Answer: C-order stores rows contiguously; Fortran stores columns. Iterating along the last axis of C-order arrays is fastest because adjacent elements are in adjacent memory (cache-friendly). For column-heavy operations, Fortran order can be faster. NumPy defaults to C-order. np.asfortranarray() converts.

-
- -
- Q10: How would you vectorize a custom function that doesn't have a NumPy equivalent? -

Answer: Three options in order of speed: (1) np.vectorize(func) — convenience wrapper, NOT actually vectorized (still Python loops), (2) Rewrite using broadcasting + boolean masks, (3) Use @numba.jit(nopython=True) for true compiled speed. Always prefer option 2 when possible.

-
- -
- Q11: What's np.random.seed() vs np.random.RandomState vs np.random.default_rng()? -

Answer: np.random.seed(42): global state, not thread-safe. RandomState(42): isolated state, legacy. default_rng(42): modern (NumPy 1.17+), uses PCG64, thread-safe, better statistical properties. Always use default_rng() in new code.

-
- -
- Q12: How do you compute pairwise distances between all points efficiently? -

Answer: Use the expansion: ||a-b||² = ||a||² + ||b||² - 2a·b. Code: dists = np.sum(X**2, axis=1)[:,None] + np.sum(X**2, axis=1)[None,:] - 2 * X @ X.T. This avoids the O(n²×d) explicit loop and leverages BLAS matrix multiply. scipy.spatial.distance.cdist wraps this.

-
-
- ` +
Q1: Why is NumPy faster than Python lists?

Answer: (1) Contiguous memory — cache-friendly. (2) Compiled C loops. (3) SIMD instructions — 4-8 floats simultaneously. Together: 50-100x speedup.

+
Q2: View vs copy — what's the difference?

Answer: Views share data (slicing creates views). Copies duplicate. arr[::2] = view, arr[[0,2,4]] (fancy indexing) = copy. Check with np.shares_memory(a, b).

+
Q3: Explain broadcasting with example.

Answer: Compare shapes right-to-left. Dims must be equal or one must be 1. (3,1) + (1,4) → (3,4). No memory copied — strides adjusted internally. Gotcha: (3,) + (3,4) fails — reshape to (3,1) first.

+
Q4: What is axis=0 vs axis=1?

Answer: axis=0 = operate down rows (collapses rows). axis=1 = across columns (collapses columns). For (100,5): mean(axis=0) → (5,) per feature. mean(axis=1) → (100,) per sample.

+
Q5: How to implement PCA with NumPy?

Answer: Center: X_c = X - X.mean(0). Covariance: cov = X_c.T @ X_c / (n-1). Eigendecompose: vals, vecs = np.linalg.eigh(cov). Project: X_pca = X_c @ vecs[:,-k:]. Or use SVD directly.

+
Q6: np.dot vs @ vs einsum?

Answer: np.dot: confusing for 3D+. @: clean matrix multiply, broadcasts. einsum: most flexible. Use @ for readability, einsum for complex ops.

+
Q7: How to handle NaN values?

Answer: np.isnan(arr) detects. np.nanmean(arr) — nan-safe aggregation. Gotcha: np.nan == np.nan is False! IEEE 754 standard.

+
Q8: Explain C-order vs Fortran-order performance.

Answer: C-order stores rows contiguously. Iterating along last axis is fastest (cache-friendly). For column-heavy ops, Fortran can be faster. NumPy defaults to C. Convert with np.asfortranarray().

+ ` }, "pandas": { concepts: `
-

Pandas Core Concepts

+

🐼 Pandas — Complete Deep Dive

-

🧠 DataFrame Internals — What Actually Happens Under the Hood

-
⚡ BlockManager Architecture
-
- A DataFrame is NOT a 2D array. Internally, Pandas uses a BlockManager — columns of the same dtype are stored together in contiguous NumPy arrays (blocks). When you add a column of a different type, a new block is created. This is why column operations are fast (same block) but row iteration is slow (crosses blocks). -
+
⚡ DataFrame Internals — BlockManager
+
A DataFrame is NOT a 2D array. Internally, Pandas uses a BlockManager — columns of the same dtype are stored together in contiguous NumPy arrays (blocks). This is why column operations are fast (same block) but row iteration is slow (crosses blocks).
-

DataFrame vs Series

- - - - - - -
FeatureSeriesDataFrame
Dimensions1D labeled array2D labeled table
AnalogyA column in a spreadsheetThe entire spreadsheet
IndexSingle indexRow index + column index
Creationpd.Series([1,2,3])pd.DataFrame({'a': [1,2]})
- -

The .loc vs .iloc Decision Tree

+

1. .loc vs .iloc — The Golden Rule

-
🎯 Golden Rule
-
- .loc = Label-based. Use when you know column/row names. Inclusive on both ends.
- .iloc = Integer-based. Use when you know positions. Exclusive on end (like Python slicing).
- Gotcha: df.loc[0:5] includes row 5. df.iloc[0:5] excludes row 5. This trips up everyone. -
+
🎯 Never Confuse These
+
.loc = Label-based. Inclusive on both ends. .iloc = Integer position. Exclusive on end. df.loc[0:5] includes row 5. df.iloc[0:5] excludes row 5.
-

The SettingWithCopyWarning — Finally Explained

-

When you chain indexing (df[df.x > 0]['y'] = 5), Pandas may create a temporary copy. Your assignment modifies the copy, not the original. Fix: Always use .loc: df.loc[df.x > 0, 'y'] = 5. In Pandas 2.0+, Copy-on-Write mode eliminates this issue entirely.

+

2. SettingWithCopyWarning — Finally Explained

+

Chained indexing (df[df.x > 0]['y'] = 5) may create a temporary copy. Fix: df.loc[df.x > 0, 'y'] = 5. In Pandas 2.0+, Copy-on-Write mode eliminates this entirely.

+ +

3. GroupBy Split-Apply-Combine

+

The most powerful Pandas operation. (1) Split into groups, (2) Apply function to each, (3) Combine results. GroupBy is lazy — no computation until aggregation. Key methods: agg() (reduce), transform() (broadcast), filter() (keep/drop groups), apply() (flexible).

-

GroupBy Split-Apply-Combine

-

GroupBy is the most powerful Pandas operation. It follows three steps: (1) Split data into groups, (2) Apply a function to each group independently, (3) Combine results. The key insight: GroupBy is lazy — no computation happens until you call an aggregation.

+

4. Pandas 2.0 — Major Changes

+ + + + + + + +
FeatureBefore (1.x)After (2.0+)
BackendNumPy onlyApache Arrow backend option
Copy semanticsConfusingCopy-on-Write (explicit)
String dtypeobjectstring[pyarrow] (faster)
Nullable typesNaN for everythingpd.NA (proper null)
Index dtypesint64 defaultMatches data dtype
+ +

5. Polars vs Pandas

+ + + + + + + + +
FeaturePandasPolars
Speed1x5-50x faster (Rust)
MemoryHigherLower (Arrow-native)
ParallelismSingle-threadedMulti-threaded by default
APIEagerLazy + Eager
EcosystemMassiveGrowing
When to useEDA, legacy projectsLarge data, production pipelines
-

Method Chaining — The Pandas Way

-

Fluent API style chains multiple operations. More readable, no intermediate variables, and enables .pipe() for custom functions. Use .assign() instead of df['col'] = ... for chainability.

+

6. Method Chaining

+

Fluent API style. More readable, no intermediate variables. Use .assign() instead of df['col'] = .... Use .pipe() for custom functions. Use .query() for readable filtering.

-

Memory Optimization Strategies

+

7. Memory Optimization

- - - - + + + +
StrategySavingsWhen to Use
Category dtype90%+Columns with few unique strings (gender, country)
Downcast numerics50-75%int64 to int32/int16 when range allows
Sparse arrays80%+Columns that are mostly zeros/NaN
Read in chunksN/AFiles larger than RAM
Category dtype90%+Columns with few unique strings
Downcast numerics50-75%int64 → int32/int16
Sparse arrays80%+Columns mostly zeros/NaN
PyArrow backend30-50%String-heavy DataFrames
-
- `, + +

8. Window Functions

+

.rolling(N) — fixed-size sliding window. .expanding() — cumulative from start. .ewm(span=N) — exponentially weighted. All support .mean(), .std(), .apply(func). Critical for time series feature engineering: lag features, moving averages, volatility.

+ `, code: `

💻 Pandas Code Examples

-

Method Chaining — Production Pattern

-
-import pandas as pd -import numpy as np +

1. Method Chaining — Production Pattern

+
import pandas as pd -# Method chaining — clean, readable data pipeline result = ( pd.read_csv('sales.csv') .rename(columns=str.lower) .assign( -date=lambda df: pd.to_datetime(df['date']), -revenue=lambda df: df['price'] * df['quantity'], -month=lambda df: df['date'].dt.month + date=lambda df: pd.to_datetime(df['date']), + revenue=lambda df: df['price'] * df['quantity'] ) .query('revenue > 100') .groupby('month') .agg({'revenue': ['sum', 'mean', 'count']}) - .sort_values(('revenue', 'sum'), ascending=False) -) -
+)
-

GroupBy — Beyond Basic Aggregation

-
-# Multi-aggregation with named columns +

2. GroupBy — Beyond Basics

+
# Named aggregation (clean column names) summary = df.groupby('category').agg( - total_sales=('revenue', 'sum'), + total=('revenue', 'sum'), avg_price=('price', 'mean'), - num_orders=('order_id', 'nunique'), - top_product=('product', lambda x: x.mode().iloc[0]) + n_orders=('order_id', 'nunique') ) -# Transform — apply function, keep original shape -df['pct_of_group'] = df.groupby('category')['revenue'].transform( +# Transform — broadcast back to original shape +df['pct_of_group'] = df.groupby('cat')['rev'].transform( lambda x: x / x.sum() * 100 -) +)
-# Filter — keep only groups meeting criteria -big_groups = df.groupby('category').filter( - lambda g: len(g) >= 10 -) -
+

3. Merge Patterns

+
# LEFT JOIN with indicator +merged = pd.merge(orders, customers, on='id', + how='left', indicator=True) +orphans = merged[merged['_merge'] == 'left_only']
-

Merge Patterns — SQL Joins in Pandas

-
-# LEFT JOIN with indicator to find unmatched -merged = pd.merge( - orders, customers, - on='customer_id', - how='left', - indicator=True # adds _merge column -) -orphan_orders = merged[merged['_merge'] == 'left_only'] - -# Merge on different column names -result = pd.merge( - df1, df2, - left_on='user_id', - right_on='id', - suffixes=('_orders', '_users') -) -
- -

Time Series Operations

-
-# Resample — change frequency +

4. Time Series Operations

+
# Resample, rolling, lag features daily = df.set_index('date') weekly = daily['revenue'].resample('W').sum() -monthly = daily['revenue'].resample('M').agg(['sum', 'mean', 'count']) - -# Rolling windows — moving averages df['ma_7'] = df['revenue'].rolling(7).mean() -df['ma_30'] = df['revenue'].rolling(30).mean() -df['expanding_mean'] = df['revenue'].expanding().mean() - -# Shift — create lag features for ML -df['prev_day'] = df['revenue'].shift(1) -df['pct_change'] = df['revenue'].pct_change() -
- -

Memory Optimization

-
-# Reduce DataFrame memory by 70%+ -def optimize_dtypes(df): - for col in df.select_dtypes(include=['int']).columns: -df[col] = pd.to_numeric(df[col], downcast='integer') - for col in df.select_dtypes(include=['float']).columns: -df[col] = pd.to_numeric(df[col], downcast='float') - for col in df.select_dtypes(include=['object']).columns: -if df[col].nunique() / len(df) < 0.5: - df[col] = df[col].astype('category') +df['lag_1'] = df['revenue'].shift(1) +df['pct_chg'] = df['revenue'].pct_change()
+ +

5. Memory Optimization

+
def optimize_dtypes(df): + for col in df.select_dtypes(['int']).columns: + df[col] = pd.to_numeric(df[col], downcast='integer') + for col in df.select_dtypes(['float']).columns: + df[col] = pd.to_numeric(df[col], downcast='float') + for col in df.select_dtypes(['object']).columns: + if df[col].nunique() / len(df) < 0.5: + df[col] = df[col].astype('category') return df - -# Before: 800 MB → After: 200 MB -df = optimize_dtypes(df) -print(df.memory_usage(deep=True).sum() / 1e6, "MB") -
-
- `, +# 800 MB → 200 MB typical savings
+ `, interview: `

🎯 Pandas Interview Questions

- -
- Q1: What causes SettingWithCopyWarning and how do you fix it? -

Answer: Chained indexing (df[mask]['col'] = val) may modify a copy, not the original. Fix: use df.loc[mask, 'col'] = val. In Pandas 2.0+, enable Copy-on-Write: pd.options.mode.copy_on_write = True. This makes all indexing return views until modification, then copies automatically.

-
- -
- Q2: How do you handle a 10GB CSV that doesn't fit in memory? -

Answer: 5 strategies: (1) pd.read_csv(chunksize=50000) — process in batches, (2) usecols=['needed_cols'] — load only what you need, (3) dtype={'col': 'int32'} — use smaller types, (4) Dask — lazy Pandas-like API, (5) DuckDB — SQL on CSV files with zero memory overhead. Polars is also excellent for out-of-core processing.

-
- -
- Q3: Explain the difference between merge, join, and concat. -

Answer: merge(): SQL-style joins on columns (most flexible). join(): joins on index (convenience wrapper). concat(): stack DataFrames along axis (union/append). Use merge for column-based joins, concat for stacking rows/columns. join is just merge with index.

-
- -
- Q4: What's the difference between apply, map, and applymap? -

Answer: map(): Series only, element-wise. apply(): works on rows/columns of DataFrame or elements of Series. applymap(): element-wise on entire DataFrame (renamed to map() in Pandas 2.1). Performance tip: all three are slow — prefer vectorized operations whenever possible.

-
- -
- Q5: How does GroupBy transform differ from agg? -

Answer: agg() reduces — returns one value per group (changes shape). transform() broadcasts — returns same shape as input. Example: df.groupby('dept')['salary'].transform('mean') fills every row with its department's average salary, while .agg('mean') returns one row per department.

-
- -
- Q6: What is MultiIndex and when would you use it? -

Answer: Hierarchical indexing — multiple levels of row/column labels. Use for: pivot table results, panel data (entity + time), groupby with multiple keys. Access with .xs() or tuple slicing: df.loc[('A', 2023)]. Convert back with .reset_index().

-
- -
- Q7: How do you handle missing data in production? -

Answer: Strategy depends on context: (1) dropna(thresh=N) — keep rows with at least N non-null values, (2) fillna(method='ffill') — forward fill for time series, (3) fillna(df.median()) — impute with median for ML, (4) interpolate(method='time') — time-weighted interpolation. Always check df.isna().sum() first.

-
- -
- Q8: What is the category dtype and when should you use it? -

Answer: Stores repeated strings as integer codes + lookup table. Use when a column has few unique values relative to total rows (e.g., 50 countries in 1M rows). Benefits: 90%+ memory savings, faster groupby. Gotcha: operations that create new values (like string concatenation) convert back to object dtype.

-
- -
- Q9: Pandas vs Polars vs DuckDB — when to use each? -

Answer: Pandas: best ecosystem, most tutorials, sufficient for <1GB. Polars: 10-100x faster, lazy evaluation, multi-threaded, no GIL issues — use for 1-100GB. DuckDB: SQL interface, out-of-core, great for analytical queries — use when SQL is more natural or data exceeds RAM.

-
- -
- Q10: How do you create lag features and rolling statistics for time series ML? -

Answer: df['lag_1'] = df['value'].shift(1) for lag features. df['rolling_mean_7'] = df['value'].rolling(7).mean() for rolling stats. df['ewm_mean'] = df['value'].ewm(span=7).mean() for exponential weighted. Always sort by time first, use groupby().shift() for multi-entity data to avoid data leakage.

-
-
- ` +
Q1: SettingWithCopyWarning — cause and fix?

Answer: Chained indexing may modify a copy. Fix: df.loc[mask, 'col'] = val. Pandas 2.0+ Copy-on-Write: pd.options.mode.copy_on_write = True.

+
Q2: merge vs join vs concat?

Answer: merge(): SQL joins on columns. join(): joins on index. concat(): stack along axis. Use merge for column joins, concat for stacking.

+
Q3: apply vs map vs transform?

Answer: map(): Series element-wise. apply(): rows/columns. transform(): same shape output. All are slow — prefer vectorized operations.

+
Q4: GroupBy transform vs agg?

Answer: agg() reduces — one value per group. transform() broadcasts — same shape as input. Use transform for "fill with group mean" patterns.

+
Q5: What is MultiIndex?

Answer: Hierarchical indexing — multiple levels. Use for pivot tables, panel data (entity + time). Access with .xs() or tuple: df.loc[('A', 2023)]. Convert back with .reset_index().

+
Q6: Pandas vs Polars — when to choose?

Answer: Pandas: mature ecosystem, EDA, small-medium data. Polars: 5-50x faster (Rust), multi-threaded, lazy evaluation, better for large data and production pipelines. Polars for new projects with big data.

+
Q7: How to handle missing data in production?

Answer: (1) dropna(thresh=N), (2) fillna(method='ffill') for time series, (3) fillna(df.median()) for ML, (4) interpolate(method='time'). Always check df.isna().sum() first.

+ ` }, - "visualization": { - concepts: ` +"visualization": { + concepts: `
-

Data Visualization Principles

+

📊 Data Visualization — Complete Guide

-

🧠 The Grammar of Graphics

-
⚡ Every Chart = Data + Aesthetics + Geometry
-
- Leland Wilkinson's framework: Data (what to plot), Aesthetics (x, y, color, size mappings), Geometry (bars, lines, points), Statistics (binning, smoothing), Coordinates (cartesian, polar), Facets (subplots). Seaborn and Plotly follow this pattern. Understanding it means you can build any chart. -
+
⚡ The Grammar of Graphics
+
Leland Wilkinson's framework: Data (what to plot) + Aesthetics (x, y, color, size) + Geometry (bars, lines, points) + Statistics (binning, smoothing) + Coordinates (cartesian, polar) + Facets (subplots). Every chart follows this.
-

Choosing the Right Chart

+

1. Choosing the Right Chart

- - - - + + + + - + +
QuestionChart TypeLibrary
Distribution of one variable?Histogram, KDE, Box plotSeaborn
Relationship between two variables?Scatter, Hexbin, RegressionSeaborn/Plotly
Comparison across categories?Bar, Grouped bar, ViolinSeaborn
Trend over time?Line chart, Area chartPlotly/Matplotlib
Distribution?Histogram, KDE, Box, ViolinSeaborn
Relationship?Scatter, Hexbin, RegressionSeaborn/Plotly
Comparison?Bar, Grouped bar, ViolinSeaborn
Trend over time?Line, Area chartPlotly/Matplotlib
Correlation matrix?HeatmapSeaborn
Part of whole?Pie, Treemap, SunburstPlotly
Geographic data?Choropleth, Scatter mapboxPlotly/Folium
Geographic?Choropleth, MapboxPlotly/Folium
High-dimensional?Parallel coords, UMAPPlotly/UMAP
-

Matplotlib Architecture

-

Three layers: Backend (rendering engine), Artist (everything drawn), Scripting (pyplot). The Figure contains Axes (subplots). Each Axes has Axis objects. Always prefer the object-oriented API (fig, ax = plt.subplots()) over pyplot for production code.

+

2. Matplotlib Architecture

+

Three layers: Backend (rendering), Artist (everything drawn), Scripting (pyplot). Figure contains Axes (subplots). Each Axes has Axis objects. Always prefer OO API (fig, ax = plt.subplots()) over pyplot for production.

+

rcParams: Control global defaults. Set plt.rcParams['font.size'] = 14 once. Create a style file for consistency across all project figures. Use plt.style.use('seaborn-v0_8-whitegrid') for clean defaults.

-

Seaborn — Statistical Visualization

-

Built on Matplotlib with statistical intelligence. Three API levels: Figure-level (relplot, catplot, displot — create their own figure), Axes-level (scatterplot, boxplot — plot on existing axes), Objects API (new in 0.12, more composable).

+

3. Color Theory for Data

+
+
💡 Color Best Practices
+ Sequential: viridis, plasma (one variable, low→high).
+ Diverging: RdBu, coolwarm (center point matters).
+ Categorical: Set2, tab10 (distinct groups).
+ Never use rainbow/jet — bad for colorblind users and perceptually non-uniform. +
+ +

4. Seaborn — Statistical Visualization

+

Three API levels: Figure-level (relplot, catplot, displot — own figure), Axes-level (scatterplot, boxplot — on existing axes), Objects API (0.12+, composable). Seaborn auto-computes statistics (regression lines, confidence intervals, density estimates).

-

Plotly — Interactive Dashboards

-

JavaScript-powered charts with hover, zoom, selection. plotly.express for quick plots, plotly.graph_objects for full control. Integrates with Dash for production dashboards. Supports 3D plots, maps, and animations.

+

5. Plotly — Interactive Dashboards

+

JavaScript-powered charts with hover, zoom, selection. plotly.express for quick plots, plotly.graph_objects for full control. Integrates with Dash for production dashboards. Supports 3D, maps, and animations. Export to HTML for sharing.

-

Common Visualization Mistakes

+

6. Common Mistakes

-
- `, + `, code: `

💻 Visualization Code Examples

-

Matplotlib — Publication-Quality Figures

-
-import matplotlib.pyplot as plt +

1. Matplotlib — Publication Quality

+
import matplotlib.pyplot as plt import numpy as np -# Professional figure setup +# Professional multi-subplot figure fig, axes = plt.subplots(1, 3, figsize=(15, 5)) -# Subplot 1: Distribution +# Distribution with mean line data = np.random.randn(1000) axes[0].hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='white') -axes[0].set_title('Distribution', fontsize=14, fontweight='bold') axes[0].axvline(data.mean(), color='red', linestyle='--', label='Mean') -# Subplot 2: Scatter with colormap +# Scatter with colormap x, y = np.random.randn(2, 100) scatter = axes[1].scatter(x, y, c=y, cmap='viridis', alpha=0.7) plt.colorbar(scatter, ax=axes[1]) -# Subplot 3: Line with confidence interval +# Line with confidence interval x = np.linspace(0, 10, 100) -y = np.sin(x) -axes[2].plot(x, y, 'b-', linewidth=2) -axes[2].fill_between(x, y-0.3, y+0.3, alpha=0.2) +axes[2].plot(x, np.sin(x), 'b-', linewidth=2) +axes[2].fill_between(x, np.sin(x)-0.3, np.sin(x)+0.3, alpha=0.2) plt.tight_layout() -plt.savefig('figure.png', dpi=300, bbox_inches='tight') -
- -

Seaborn — Statistical Plots

-
-import seaborn as sns +plt.savefig('figure.png', dpi=300, bbox_inches='tight')
-# Pair plot — see all relationships at once -sns.pairplot(df, hue='target', diag_kind='kde', - plot_kws={'alpha': 0.6}) +

2. Seaborn — Statistical Plots

+
import seaborn as sns -# Correlation heatmap with annotations +# Correlation heatmap (upper triangle only) fig, ax = plt.subplots(figsize=(10, 8)) mask = np.triu(np.ones_like(df.corr(), dtype=bool)) -sns.heatmap(df.corr(), mask=mask, annot=True, fmt='.2f', - cmap='RdBu_r', center=0, square=True) +sns.heatmap(df.corr(), mask=mask, annot=True, + fmt='.2f', cmap='RdBu_r', center=0) -# Violin + strip plot — distribution + individual points -fig, ax = plt.subplots(figsize=(10, 6)) -sns.violinplot(x='category', y='value', data=df, inner=None, alpha=0.3) -sns.stripplot(x='category', y='value', data=df, size=3, jitter=True) -
+# Pair plot — all relationships at once +sns.pairplot(df, hue='target', diag_kind='kde') -

Plotly — Interactive Visualizations

-
-import plotly.express as px -import plotly.graph_objects as go +# Violin + strip — distribution + individual points +sns.violinplot(x='cat', y='val', data=df, inner=None, alpha=0.3) +sns.stripplot(x='cat', y='val', data=df, size=3, jitter=True)
-# Interactive scatter with hover info -fig = px.scatter(df, x='feature1', y='feature2', - color='target', size='importance', - hover_data=['name'], - title='Feature Analysis') +

3. Plotly — Interactive

+
import plotly.express as px -# Animated chart — data over time +# Animated scatter (like Gapminder) fig = px.scatter(df, x='gdp', y='life_exp', - animation_frame='year', - size='population', color='continent', - hover_name='country', - size_max=60) -fig.show() -
-
- `, - interview: ` + animation_frame='year', size='pop', + color='continent', hover_name='country') +fig.show()
+ `, + interview: `

🎯 Visualization Interview Questions

+
Q1: When to use Matplotlib vs Seaborn vs Plotly?

Answer: Matplotlib: full control, publication figures. Seaborn: statistical EDA, beautiful defaults. Plotly: interactive dashboards, stakeholders. Rule: Seaborn for EDA, Matplotlib for papers, Plotly for stakeholders.

+
Q2: How to visualize high-dimensional data?

Answer: (1) PCA/t-SNE/UMAP to 2D, (2) Pair plots, (3) Parallel coordinates, (4) Correlation heatmap, (5) SHAP summary plots.

+
Q3: How to handle overplotting?

Answer: (1) alpha transparency, (2) hexbin, (3) 2D KDE, (4) random sampling, (5) Datashader for millions of points.

+
Q4: What makes good visualization for non-technical stakeholders?

Answer: Clear title stating conclusion, minimal chart junk, annotate key points, consistent color, one insight per chart. Tell a story — what action should they take?

+
Q5: Explain Figure vs Axes in Matplotlib.

Answer: Figure = entire canvas. Axes = single plot area. fig, axes = plt.subplots(2,2) = 4 plots. Always use OO API: ax.plot() not plt.plot().

+
Q6: How to make accessible visualizations?

Answer: Colorblind-safe palettes (viridis), don't rely on color alone, add shapes/patterns, sufficient contrast, alt text, large fonts (12pt+).

+
` +}, + +"advanced-python": { + concepts: ` +
+

🎯 Advanced Python — Complete Engineering Guide

-
- Q1: When would you use Matplotlib vs Seaborn vs Plotly? -

Answer: Matplotlib: full control, publication figures, custom layouts. Seaborn: statistical plots, quick EDA, beautiful defaults. Plotly: interactive dashboards, web apps, 3D/maps. Rule of thumb: Seaborn for EDA, Matplotlib for papers, Plotly for stakeholders.

-
- -
- Q2: How do you visualize high-dimensional data? -

Answer: (1) PCA/t-SNE/UMAP to 2D then scatter plot, (2) Pair plots for feature pairs, (3) Parallel coordinates, (4) Heatmap of correlation matrix, (5) SHAP summary plots for feature importance. For 100+ features, start with correlation heatmap to identify groups.

-
- -
- Q3: How do you handle overplotting in scatter plots? -

Answer: (1) Reduce alpha: alpha=0.1, (2) Hexbin plots: plt.hexbin(), (3) 2D KDE: sns.kdeplot(), (4) Random sampling for display, (5) Datashader for millions of points. The key is encoding density visually.

-
- -
- Q4: What makes a good visualization for non-technical stakeholders? -

Answer: (1) Clear title stating the conclusion, not the method, (2) Minimal chart junk — remove gridlines, borders, legends when obvious, (3) Annotate key data points directly, (4) Use color consistently and meaningfully, (5) Tell a story — what action should they take? Keep it to one insight per chart.

-
- -
- Q5: Explain the Figure and Axes API in Matplotlib. -

Answer: Figure is the entire window/canvas. Axes is a single plot area within the figure. fig, axes = plt.subplots(2,2) creates 4 plots. Always use the OO API for production — ax.plot() not plt.plot(). This gives you explicit control over which subplot you're modifying.

+

1. Decorators — Beyond Basics

+
+
⚡ Three Levels of Decorators
+
Level 1: Simple wrapper (timing, logging). Level 2: Decorator with arguments (factory pattern). Level 3: Class-based decorators with state. Always use functools.wraps to preserve function metadata (name, docstring, signature).
-
- Q6: How do you make accessible visualizations? -

Answer: (1) Use colorblind-safe palettes (viridis, cividis), (2) Don't rely on color alone — add shapes/patterns, (3) Sufficient contrast ratios, (4) Alt text for web charts, (5) Large enough font sizes (12pt minimum). Test with colorblindness simulators.

-
-
- ` - }, +

2. Context Managers

+

Managing resources reliably. with blocks guarantee cleanup even on errors. Two approaches: (1) Class-based (__enter__/__exit__), (2) @contextlib.contextmanager with yield. Use for: file handles, DB connections, GPU locks, temporary settings.

- "advanced-python": { - concepts: ` -
-

Advanced Python Engineering

+

3. Dataclasses vs namedtuple vs Pydantic

+ + + + + + + + + +
FeaturenamedtupledataclassPydantic
Mutable✓ (default)✓ (v2)
Validation✗ (manual)✓ (automatic)
Default valuesLimited
Inheritance
JSON serializationManualManualBuilt-in
PerformanceFastestFastSlower (validation)
Use caseImmutable recordsData containersAPI models, configs
-

🧠 Professional Decorators — Beyond "Hello World"

+

4. Type Hints — Complete Guide

-
⚡ Closures & Wrappers
-
- Decorators are higher-order functions that modify behavior without changing code. Professional implementation tools: Use functools.wraps to preserve metadata (name, docstring), handle both positional and keyword arguments, and support decorators with parameters (factories). -
+
🎯 Why Type Hints Matter
+
Type hints enable: IDE autocompletion, static analysis (mypy), self-documenting code, and runtime validation (Pydantic). Python doesn't enforce them at runtime — they're optional annotations checked by external tools.
+ + + + + + + + + +
HintMeaningExample
int, str, floatBasic typesdef f(x: int) -> str:
list[int]List of ints (3.9+)scores: list[int] = []
dict[str, Any]Dict with str keysconfig: dict[str, Any]
Optional[int]int or Nonex: int | None (3.10+)
Union[int, str]int or strid: int | str
Callable[[int], str]Function signatureCallbacks, decorators
TypeVar('T')Generic typeGeneric containers
-

Context Managers (The Pythonic Way)

-

Managing resources (files, locks, DB connections) reliably. with blocks guarantee cleanup even on errors. Implementation options: (1) Class-based with __enter__ and __exit__, (2) Function-based with @contextlib.contextmanager and yield.

+

5. async/await — Concurrent Python

+

Async is for I/O-bound tasks (API calls, DB queries, file reads). NOT for CPU-bound work (use multiprocessing). The event loop manages coroutines cooperatively. asyncio.gather() runs multiple coroutines concurrently. aiohttp for async HTTP, asyncpg for async PostgreSQL.

-

Iterators & Generators — Memory Efficiency

-
-
💡 Why Generators?
- Generators use lazy evaluation. They produce values one at a time using yield, using constant memory O(1) regardless of dataset size. Ideal for processing huge datasets or infinite streams. -
+

6. Descriptors — How @property Works

+

A descriptor is any object implementing __get__, __set__, or __delete__. @property is a descriptor. They control attribute access at the class level. Used in Django ORM fields, SQLAlchemy columns, and dataclass fields.

-

Object-Oriented Design for Data Science

- - - - - - -
ConceptData Science Use Case
InheritanceBaseModel → LinearModel → LogisticRegression
Abstract Base ClassesDefining mandatory methods like fit()/predict()
PropertiesValidating input parameters (e.g., learning rate > 0)
Dunder Methods__call__ for making models callable, __getitem__ for datasets
+

7. Metaclasses

+

Classes are objects too. Metaclasses define how classes behave. type is the default metaclass. Use for: auto-registering subclasses (model registry), enforcing interface standards, singleton pattern. Most developers should use class decorators instead — metaclasses are a last resort.

-

Metaclasses & Dynamic Programming

-

Classes are objects too! Classes define how instances behave; Metaclasses define how classes behave. Useful for registry patterns (auto-registering models) or enforcement of interface standards across a codebase. type is the default metaclass.

-
- `, +

8. __slots__ for Memory Efficiency

+

By default, instances store attributes in __dict__. __slots__ replaces with a fixed tuple. Saves ~40% memory per instance. Use when creating millions of objects. Trade-off: can't add dynamic attributes. Especially useful for data-heavy classes.

+
`, code: `

💻 Advanced Python Code Examples

-

The Production-Grade Decorator

-
-from functools import wraps -import time -import logging +

1. Production Decorator with Parameters

+
from functools import wraps +import time, logging -def timer_with_logging(logger): +def retry(max_attempts=3, delay=1.0): + """Decorator factory: retries on failure.""" def decorator(func): -@wraps(func) -def wrapper(*args, **kwargs): - start = time.perf_counter() - try: - result = func(*args, **kwargs) - return result - finally: - duration = time.perf_counter() - start - logger.info(f"Executed {func.__name__} in {duration:.4f}s") -return wrapper + @wraps(func) + def wrapper(*args, **kwargs): + for attempt in range(max_attempts): + try: + return func(*args, **kwargs) + except Exception as e: + if attempt == max_attempts - 1: + raise + time.sleep(delay * (2 ** attempt)) # Exponential backoff + return wrapper return decorator -@timer_with_logging(logging.getLogger(__name__)) -def train_model(X, y): - # Simulate training - time.sleep(1.5) -
+@retry(max_attempts=3, delay=0.5) +def fetch_data(url): + return requests.get(url).json()
-

Custom Context Manager for GPU Lock

-
-from contextlib import contextmanager +

2. Dataclass with Validation

+
from dataclasses import dataclass, field +from typing import Optional -@contextmanager -def gpu_lock(device_id): - print(f"Acquiring lock for GPU {device_id}") - try: -yield f"GPU_{device_id}_CONTEXT" - finally: -print(f"Releasing GPU {device_id}") +@dataclass +class Experiment: + name: str + lr: float = 0.001 + epochs: int = 100 + tags: list[str] = field(default_factory=list) + + def __post_init__(self): + if self.lr <= 0: + raise ValueError("lr must be positive") -with gpu_lock(0) as ctx: - print(f"Training with {ctx}") -
+exp = Experiment("bert-finetune", lr=3e-5, tags=["nlp"])
-

ABC & Protocol — Enforcing Interfaces

-
-from abc import ABC, abstractmethod -from typing import Protocol +

3. async/await for Parallel API Calls

+
import asyncio +import aiohttp -class Predictor(Protocol): - def predict(self, X: np.ndarray) -> np.ndarray: ... +async def fetch(session, url): + async with session.get(url) as resp: + return await resp.json() -class BaseModel(ABC): - @abstractmethod - def fit(self, X, y): -pass +async def fetch_all(urls): + async with aiohttp.ClientSession() as session: + tasks = [fetch(session, url) for url in urls] + return await asyncio.gather(*tasks) -class MyModel(BaseModel): - def fit(self, X, y): -print("Fitting...") - - def predict(self, X): -return X @ self.weights -
+# 100 API calls in ~1 second (vs 100 seconds sequentially) +results = asyncio.run(fetch_all(urls))
-

Functional Pipelines with itertools

-
-import itertools +

4. Type-Hinted Protocol (Duck Typing)

+
from typing import Protocol +import numpy as np -# Process infinite stream in batches -def get_batches(stream, size): - it = iter(stream) - while True: -batch = list(itertools.islice(it, size)) -if not batch: break -yield batch - -# Data pipeline: chain -> filter -> map -> batch -processed = get_batches( - map(str.upper, filter(lambda x: len(x) > 5, stream)), - batch_size=64 -) -
-
- `, - interview: ` +class Predictor(Protocol): + def predict(self, X: np.ndarray) -> np.ndarray: ... + +def evaluate(model: Predictor, X: np.ndarray, y: np.ndarray): + # Works with ANY object that has .predict() + preds = model.predict(X) + return (preds == y).mean()
+ `, + interview: `

🎯 Advanced Python Interview Questions

+
Q1: Explain MRO (Method Resolution Order).

Answer: C3 Linearization algorithm for multiple inheritance. Access via ClassName.mro(). Ensures bases searched after subclasses, preserving definition order.

+
Q2: dataclass vs namedtuple vs Pydantic?

Answer: namedtuple: immutable, fastest. dataclass: mutable, flexible, no validation. Pydantic: auto-validation, JSON serialization, API models. Choose based on whether you need validation.

+
Q3: When to use async/await vs threading vs multiprocessing?

Answer: async: I/O-bound, many connections (1000s of API calls). threading: I/O-bound, simpler code. multiprocessing: CPU-bound (bypasses GIL). NumPy already releases GIL internally.

+
Q4: How does @property work internally?

Answer: It's a descriptor — implements __get__, __set__, __delete__. When you access obj.x, Python's attribute lookup finds the descriptor on the class and calls __get__.

+
Q5: Decorator with parameters pattern?

Answer: Three nested functions: (1) Factory takes params, returns decorator. (2) Decorator takes function, returns wrapper. (3) Wrapper executes logic. Use @wraps(func) always.

+
Q6: What is __slots__?

Answer: Replaces __dict__ with fixed-size array. Saves ~40% memory per instance. Can't add dynamic attributes. Use for millions of small objects.

+
Q7: Explain closures. Give a real use case.

Answer: A function that captures variables from enclosing scope. The captured variables survive after the enclosing function returns. Use case: factory functions, decorators, callbacks. Example: make_multiplier(3) returns a function that multiplies by 3.

+
` +}, + +"sklearn": { + concepts: ` +
+

🤖 Scikit-learn — Complete ML Engineering

-
- Q1: What's the difference between __str__ and __repr__? -

Answer: __str__ is for end-users (informal, readable). __repr__ is for developers (detailed, unambiguous, "eval-able"). For data science, always implement __repr__ for models to show hyperparameters when printed.

-
- -
- Q2: Explain Python's MRO (Method Resolution Order). -

Answer: C3 Linearization algorithm. It determines the search order for methods in multiple inheritance. Access it via ClassName.mro(). Python ensures that bases are searched after their subclasses and the order of bases in the class definition is preserved.

-
- -
- Q3: How do you implement a Singleton pattern in Python? -

Answer: Several ways: (1) Overriding __new__, (2) Using a Metaclass (cleanest), (3) Module-level variables (simplest). Example with Metaclass: class Singleton(type): ... then class Database(metaclass=Singleton): ....

-
- -
- Q4: Decorators: How to handle @timer(unit='ms')? -

Answer: This is a decorator factory. You need three levels of functions: (1) Factory takes parameters and returns a decorator, (2) Decorator takes the function and returns a wrapper, (3) Wrapper takes args/kwargs and executes the logic.

-
- -
- Q5: What are *args and **kwargs and when to use them? -

Answer: *args collects positional arguments into a tuple. **kwargs collects keyword arguments into a dictionary. Crucial for wrapping functions, implementing decorators, or creating flexible API interfaces like Scikit-learn's __init__(**params).

+
+
⚡ The Estimator API — Unified Interface
+
Estimators have fit(X, y). Transformers have transform(X). Predictors have predict(X). This consistency allows seamless swapping and composition via Pipelines.
-
- Q6: Explain the difference between is and ==. -

Answer: == checks for equality (values are the same). is checks for identity (objects occupy the same memory address). Use is for Singletons like None or bool. Example: a = [1]; b = [1]; a == b is True, a is b is False.

+

1. Pipelines — Avoiding Data Leakage

+
+
⚠️ The #1 ML Mistake
+ Fitting a scaler on the ENTIRE dataset before splitting = data leakage. Test set statistics leak into training. Fix: put scaling INSIDE a Pipeline, which ensures fit only on training data during cross-validation.
-
- ` - }, - "sklearn": { - concepts: ` -
-

Scikit-learn & ML Engineering

-

🧠 The Estimator API — Unified Interface

-
-
⚡ Consistency is King
-
- Scikit-learn's brilliance lies in its interface consistency. Estimators have fit(X, y), Transformers have transform(X), and Predictors have predict(X). This design allows for seamless swapping of models and preprocessing steps. -
-
+

2. ColumnTransformer — Different processing per column type

+

Real data has mixed types. ColumnTransformer applies different transformations to different column sets: StandardScaler for numerics, OneHotEncoder for categoricals, TfidfVectorizer for text. All in one pipeline.

-

Production Pipelines — Avoiding Data Leakage

-

A Pipeline bundles preprocessing and modeling into a single object. Crucial Benefit: It ensures that transformers are fit only on the training fold during cross-validation, preventing information from the validation set (like mean/std) from "leaking" into training. Always use pipelines in production.

+

3. Custom Transformers

+

Inherit from BaseEstimator + TransformerMixin. Implement fit(X, y) and transform(X). TransformerMixin gives you fit_transform() for free. Use check_is_fitted(self) to validate state.

-

ColumnTransformer — Heterogeneous Data

-

Most real-world data is a mix of types. ColumnTransformer allows you to apply different preprocessing pipelines to different columns (e.g., OneHotEncode categories, Scale numerics) and then concatenate them for the model.

+

4. Cross-Validation Strategies

+ + + + + + + + +
StrategyWhen to UseGotcha
KFoldGeneral purposeDoesn't preserve class ratios
StratifiedKFoldClassification (imbalanced)Preserves class distribution
TimeSeriesSplitTime-ordered dataTrain always before test
GroupKFoldGrouped data (patients)Same group never in train+test
LeaveOneOutVery small datasetsN fits — very slow
RepeatedStratifiedKFoldRobust estimationMultiple random splits
-

Model Evaluation Beyond Accuracy

+

5. Hyperparameter Tuning

- - - - - - + + + + +
MetricUse CaseScikit-learn Name
F1-ScoreImbalanced classification (Precision-Recall balance)f1_score
ROC-AUCProbability ranking / classifier qualityroc_auc_score
MSE / MAERegression error magnitudemean_squared_error
R2 ScoreVariance explained by modelr2_score
Log LossProbabilistic predictions confidencelog_loss
MethodProsCons
GridSearchCVExhaustive, simpleExponential with params
RandomizedSearchCVFaster, continuous distributionsMay miss optimal
Optuna/BayesianOptSmart search, early stoppingMore setup, dependency
Halving*SearchCVSuccessive halving, fastNewer, less documented
-

Cross-Validation Strategies

-

(1) K-Fold: standard, (2) Stratified K-Fold: for imbalanced data, (3) TimeSeriesSplit: for temporal data (preventing looking into the future), (4) GroupKFold: to ensure samples from the same group aren't split across train/test.

-
- `, +

6. Feature Engineering in sklearn

+

PolynomialFeatures, FunctionTransformer, SplineTransformer, KBinsDiscretizer. Chain with Pipeline for clean, leak-free preprocessing. Use make_column_selector to auto-select column types.

+ +

7. Model Selection Workflow

+

Train/Val/Test split → Cross-validate multiple models → Select best → Tune hyperparameters → Final evaluation on test set. Never tune on test data. Use cross_val_score for quick comparison, cross_validate for detailed metrics.

+
`, code: `

💻 Scikit-learn Code Examples

-

The Modular Pipeline Pattern

-
-from sklearn.pipeline import Pipeline +

1. Production Pipeline with ColumnTransformer

+
from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.preprocessing import StandardScaler, OneHotEncoder +from sklearn.impute import SimpleImputer from sklearn.ensemble import RandomForestClassifier -# Define preprocessing for different feature types -numeric_transformer = Pipeline(steps=[ - ('scaler', StandardScaler()) -]) - -categorical_transformer = Pipeline(steps=[ - ('onehot', OneHotEncoder(handle_unknown='ignore')) +num_features = ['age', 'income', 'score'] +cat_features = ['gender', 'city'] + +preprocessor = ColumnTransformer([ + ('num', Pipeline([ + ('imputer', SimpleImputer(strategy='median')), + ('scaler', StandardScaler()) + ]), num_features), + ('cat', Pipeline([ + ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), + ('encoder', OneHotEncoder(handle_unknown='ignore')) + ]), cat_features) ]) -preprocessor = ColumnTransformer(transformers=[ - ('num', numeric_transformer, numeric_features), - ('cat', categorical_transformer, categorical_features) -]) - -# Create full pipeline -clf = Pipeline(steps=[ +pipe = Pipeline([ ('preprocessor', preprocessor), ('classifier', RandomForestClassifier(n_estimators=100)) ]) +pipe.fit(X_train, y_train) # No data leakage!
-# Entire workflow in one object -clf.fit(X_train, y_train) -preds = clf.predict(X_test) -
- -

Custom Transformers — Industry Standard

-
-from sklearn.base import BaseEstimator, TransformerMixin - -class LogTransformer(BaseEstimator, TransformerMixin): - def __init__(self, columns=None): -self.columns = columns - - def fit(self, X, y=None): -return self +

2. Custom Transformer

+
from sklearn.base import BaseEstimator, TransformerMixin +class OutlierClipper(BaseEstimator, TransformerMixin): + def __init__(self, factor=1.5): + self.factor = factor + + def fit(self, X, y=None): + Q1 = np.percentile(X, 25, axis=0) + Q3 = np.percentile(X, 75, axis=0) + IQR = Q3 - Q1 + self.lower_ = Q1 - self.factor * IQR + self.upper_ = Q3 + self.factor * IQR + return self + def transform(self, X): -X_out = X.copy() -for col in self.columns: - X_out[col] = np.log1p(X_out[col]) -return X_out - -# Now usable in any Pipeline -pipeline = Pipeline([ - ('log', LogTransformer(columns=['revenue'])), - ('model', LinearRegression()) -]) -
- -

Hyperparameter Optimization (Advanced)

-
-from sklearn.model_selection import RandomizedSearchCV -from scipy.stats import randint - -param_dist = { - 'classifier__n_estimators': randint(50, 500), - 'classifier__max_depth': [5, 10, 20, None], - 'preprocessor__num__scaler': [StandardScaler(), RobustScaler()] -} + return np.clip(X, self.lower_, self.upper_)
-search = RandomizedSearchCV(clf, param_dist, n_iter=50, cv=3) -search.fit(X_train, y_train) +

3. Hyperparameter Tuning with Optuna

+
import optuna -print(f"Best Score: {search.best_score_}") -print(f"Best Params: {search.best_params_}") -
-
- `, - interview: ` +def objective(trial): + params = { + 'n_estimators': trial.suggest_int('n_estimators', 50, 500), + 'max_depth': trial.suggest_int('max_depth', 3, 15), + 'learning_rate': trial.suggest_float('lr', 1e-3, 0.3, log=True) + } + model = XGBClassifier(**params) + score = cross_val_score(model, X, y, cv=5).mean() + return score + +study = optuna.create_study(direction='maximize') +study.optimize(objective, n_trials=100)
+
`, + interview: `

🎯 Scikit-learn Interview Questions

+
Q1: What is data leakage? How to prevent it?

Answer: Info from test set influencing training. Common cause: fitting scaler on full data before split. Fix: put all preprocessing inside a Pipeline which ensures fit only on train folds during cross-validation.

+
Q2: Pipeline vs ColumnTransformer?

Answer: Pipeline: sequential steps (A→B→C). ColumnTransformer: parallel branches (different processing for different column types). Typically ColumnTransformer inside Pipeline.

+
Q3: When to use which cross-validation?

Answer: KFold: general. StratifiedKFold: imbalanced classes. TimeSeriesSplit: temporal. GroupKFold: grouped data (same patient never in both).

+
Q4: GridSearch vs RandomSearch vs Bayesian?

Answer: Grid: exhaustive but exponential. Random: better for many params, samples continuous distributions. Bayesian (Optuna): learns from previous trials, most efficient for expensive models.

+
Q5: How to create a custom transformer?

Answer: Inherit BaseEstimator + TransformerMixin. Implement fit(X, y) (learn params, return self) and transform(X) (apply). TransformerMixin gives fit_transform() free.

+
Q6: Explain fit() vs transform() vs predict().

Answer: fit(): learn parameters from data. transform(): apply learned params to transform data. predict(): generate predictions. fit() is always on train, transform/predict on train+test.

+
` +}, + +"pytorch": { + concepts: ` +
+

🔥 Deep Learning with PyTorch — Complete Guide

-
- Q1: Why use fit_transform on train but only transform on test? -

Answer: To prevent Data Leakage. Mean/variance for scaling must be learned ONLY from training data. Applying fit to test data uses future information about the test distribution, leading to overly optimistic results.

+
+
⚡ PyTorch Philosophy: Define-by-Run
+
PyTorch builds the computational graph dynamically as operations execute (eager mode). This makes debugging natural — use print(), breakpoints, standard Python control flow. TensorFlow originally used static graphs (define-then-run).
-
- Q2: When would you use predict_proba instead of predict? -

Answer: When you need the uncertainty of the model or need to adjust the decision threshold. For cost-sensitive problems (e.g., fraud), you might flag anything with >10% probability, rather than the default 50%.

-
+

1. Tensors — The Foundation

+ + + + + + + +
ConceptWhat It IsKey Point
TensorN-dimensional arrayLike NumPy ndarray but GPU-capable
requires_gradTrack operations for autogradOnly enable for learnable parameters
deviceCPU or CUDA.to('cuda') moves to GPU
.detach()Stop gradient trackingUse for inference/metrics
.item()Extract scalar valueUse for logging loss values
-
- Q3: Explain the bias-variance tradeoff in terms of Complexity. -

Answer: Underfitting (High Bias) happens when the model is too simple (e.g., linear on non-linear data). Overfitting (High Variance) happens when the model is too complex and captures noise. Regularization (Alpha/C parameters) is used to find the "sweet spot".

+

2. Autograd — How Backpropagation Works

+
+
🧠 Computational Graph
+
When requires_grad=True, PyTorch records every operation in a directed acyclic graph (DAG). Each tensor stores its grad_fn — the function that created it. .backward() traverses this graph in reverse, computing gradients via the chain rule. The graph is destroyed after backward() (unless retain_graph=True).
+

Gradient accumulation: By default, .backward() accumulates gradients. You MUST call optimizer.zero_grad() before each backward pass. This is intentional — allows gradient accumulation for larger effective batch sizes.

-
- Q4: How do you handle imbalanced datasets in Sklearn? -

Answer: (1) class_weight='balanced' inside estimators, (2) Stratified cross-validation, (3) Focus on Precision-Recall curves/AUC instead of Accuracy, (4) Resampling (using imblearn library which is Sklearn-compatible).

-
+

3. nn.Module — Building Blocks

+

Every model inherits nn.Module. Define layers in __init__, computation in forward(). model.parameters() returns all learnable weights. model.train() and model.eval() toggle BatchNorm/Dropout behavior. model.state_dict() saves/loads weights.

-
- Q5: What's the difference between L1 (Lasso) and L2 (Ridge) regularization? -

Answer: L1 adds absolute value penalty; it results in sparse models (coefficents become exactly zero), effectively performing feature selection. L2 adds squared penalty; it shrinks coefficients towards zero but rarely to zero, good for handling multicollinearity.

-
-
- ` - }, - "pytorch": { - concepts: ` -
-

PyTorch & Deep Learning Primitives

+

4. Training Loop — The Standard Pattern

+

Every PyTorch training follows: (1) Forward pass, (2) Compute loss, (3) optimizer.zero_grad(), (4) loss.backward(), (5) optimizer.step(). No magic — you write it explicitly. This gives full control over learning rate scheduling, gradient clipping, mixed precision, etc.

-

🧠 Computational Graphs & Autograd

-
-
⚡ Dynamic vs Static
-
- PyTorch uses Dynamic Computational Graphs (Define-by-Run). The graph is built on-the-fly as operations are performed. Autograd tracks every operation on tensors with requires_grad=True and automatically computes gradients using the chain rule during .backward(). -
-
+

5. Custom Datasets & DataLoaders

+

Dataset: override __len__ and __getitem__. DataLoader: wraps Dataset with batching, shuffling, multi-worker loading. Use num_workers > 0 for parallel data loading. pin_memory=True speeds up CPU→GPU transfer.

-

Tensors — The Heart of PyTorch

-

Tensors are multi-dimensional arrays (like NumPy) but with two superpowers: (1) GPU Acceleration (move to 'cuda' or 'mps'), (2) Automatic Differentiation. Bridging to NumPy is zero-copy for CPU tensors.

+

6. Mixed Precision Training (AMP)

+

Use torch.cuda.amp for automatic mixed precision. Forward pass in float16 (2x faster on modern GPUs), gradients in float32 (numerical stability). GradScaler prevents underflow. Up to 2-3x speedup with minimal accuracy loss.

-

Modular Architecture (nn.Module)

-

Every model in PyTorch inherits from nn.Module. You define parameters/layers in __init__ and the forward pass logic in forward(). This design promotes recursive composition — models can contain other modules.

+

7. Transfer Learning

+

Load pretrained model → Freeze base layers → Replace final layer → Fine-tune. model.requires_grad_(False) freezes all. Then unfreeze last N layers. Use smaller learning rate for pretrained layers.

-

Data Engineering: Dataset & DataLoader

- - - - - -
ComponentResponsibility
DatasetDefines HOW to load a single sample (__getitem__) and total count (__len__)
DataLoaderHandles batching, shuffling, multi-process loading, and memory pinning
TransformsOn-the-fly augmentation (cropping, flipping, normalizing)
+

8. Hook System for Debugging

+

Register hooks on modules: register_forward_hook, register_backward_hook. View intermediate activations, gradient magnitudes, feature maps. Essential for debugging vanishing/exploding gradients.

-

The Optimization Loop Essentials

-

Standard pattern: (1) Zero gradients, (2) Forward pass, (3) Compute Loss, (4) Backward pass (backprop), (5) Optimizer step. Don't forget model.train() and model.eval() to toggle dropout and batch norm behavior.

-
- `, +

9. Distributed Training (DDP)

+

DistributedDataParallel is the standard for multi-GPU training. Each GPU runs a copy of the model, gradients are averaged across GPUs (all-reduce). Near-linear scaling. Use torchrun to launch.

+
`, code: `

💻 PyTorch Code Examples

-

The Ultimate Training Boilerplate

-
-import torch +

1. Complete Training Loop

+
import torch import torch.nn as nn -import torch.optim as optim - -# Device agnostic code -device = torch.device("cuda" if torch.cuda.is_available() else "cpu") - -# 1. Define Architecture -class SimpleNet(nn.Module): - def __init__(self): -super().__init__() -self.flatten = nn.Flatten() -self.fc = nn.Sequential( - nn.Linear(28*28, 512), - nn.ReLU(), - nn.Dropout(0.2), - nn.Linear(512, 10) -) +class MLP(nn.Module): + def __init__(self, in_dim, hidden, out_dim): + super().__init__() + self.net = nn.Sequential( + nn.Linear(in_dim, hidden), + nn.ReLU(), + nn.Dropout(0.3), + nn.Linear(hidden, out_dim) + ) + def forward(self, x): -x = self.flatten(x) -return self.fc(x) + return self.net(x) -model = SimpleNet().to(device) -optimizer = optim.Adam(model.parameters(), lr=1e-3) +model = MLP(784, 256, 10).to('cuda') +optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4) criterion = nn.CrossEntropyLoss() -# 2. Training Loop -model.train() -for batch, (X, y) in enumerate(dataloader): - X, y = X.to(device), y.to(device) +for epoch in range(10): + model.train() + for X_batch, y_batch in train_loader: + X_batch = X_batch.to('cuda') + y_batch = y_batch.to('cuda') + + logits = model(X_batch) + loss = criterion(logits, y_batch) + + optimizer.zero_grad() + loss.backward() + torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0) + optimizer.step()
+ +

2. Custom Dataset

+
from torch.utils.data import Dataset, DataLoader + +class TabularDataset(Dataset): + def __init__(self, df, target_col): + self.X = torch.FloatTensor(df.drop(target_col, axis=1).values) + self.y = torch.LongTensor(df[target_col].values) - # Zero -> Forward -> Backward -> Step - optimizer.zero_grad() - pred = model(X) - loss = criterion(pred, y) - loss.backward() - optimizer.step() -
- -

Custom Dataset Implementation

-
-from torch.utils.data import Dataset - -class ImageDataset(Dataset): - def __init__(self, annotations_file, img_dir, transform=None): -self.img_labels = pd.read_csv(annotations_file) -self.img_dir = img_dir -self.transform = transform - def __len__(self): -return len(self.img_labels) - + return len(self.X) + def __getitem__(self, idx): -img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0]) -image = read_image(img_path) -label = self.img_labels.iloc[idx, 1] -if self.transform: - image = self.transform(image) -return image, label -
+ return self.X[idx], self.y[idx] -

Transfer Learning — Freezing Layers

-
-from torchvision import models +loader = DataLoader(dataset, batch_size=64, shuffle=True, + num_workers=4, pin_memory=True)
-model = models.resnet18(pretrained=True) +

3. Mixed Precision Training

+
from torch.cuda.amp import autocast, GradScaler -# Freeze all weights -for param in model.parameters(): - param.requires_grad = False - -# Replace final head (newly initialized, so requires_grad=True) -num_ftrs = model.fc.in_features -model.fc = nn.Linear(num_ftrs, 2) - -model = model.to(device) -# Only model.fc.parameters() will be updated -optimizer = optim.SGD(model.fc.parameters(), lr=0.001) -
-
- `, - interview: ` +scaler = GradScaler() +for X, y in train_loader: + optimizer.zero_grad() + with autocast(): # Float16 forward pass + logits = model(X.cuda()) + loss = criterion(logits, y.cuda()) + scaler.scale(loss).backward() # Scaled backward + scaler.step(optimizer) + scaler.update()
+ +

4. Transfer Learning

+
import torchvision.models as models + +# Load pretrained, freeze, replace head +model = models.resnet50(weights='IMAGENET1K_V2') +model.requires_grad_(False) # Freeze all +model.fc = nn.Linear(2048, 10) # New trainable head
+
`, + interview: `

🎯 PyTorch Interview Questions

+
Q1: How does autograd work?

Answer: PyTorch records operations in a DAG when requires_grad=True. .backward() traverses the graph in reverse, computing gradients via chain rule. Graph is destroyed after backward (dynamic graph).

+
Q2: Why call optimizer.zero_grad()?

Answer: PyTorch accumulates gradients by default. Without zeroing, gradients from previous batch add to current. This is intentional — allows gradient accumulation for larger effective batches.

+
Q3: model.train() vs model.eval()?

Answer: train(): BatchNorm uses batch stats, Dropout is active. eval(): BatchNorm uses running stats, Dropout disabled. Always switch before training/inference.

+
Q4: .detach() vs with torch.no_grad()?

Answer: .detach(): creates a tensor that shares data but doesn't track gradients (single tensor). torch.no_grad(): context manager disabling gradient computation for all operations inside (saves memory during inference).

+
Q5: How to debug vanishing/exploding gradients?

Answer: (1) Register backward hooks to monitor gradient magnitudes. (2) Use torch.nn.utils.clip_grad_norm_. (3) Gradient histograms in TensorBoard. (4) Check if BatchNorm/LayerNorm is applied. (5) Try skip connections (ResNet idea).

+
Q6: DataLoader num_workers — how many?

Answer: Rule of thumb: num_workers = 4 * num_gpus. Too many = CPU overhead, too few = GPU starved. Use pin_memory=True for faster CPU→GPU transfer. Profile to find sweet spot.

+
` +}, + +"tensorflow": { + concepts: ` +
+

🧠 TensorFlow & Keras — Complete Guide

-
- Q1: Why is optimizer.zero_grad() necessary? -

Answer: By default, PyTorch accumulates gradients on every .backward() call. This is useful for RNNs or training with effectively larger batch sizes than memory allows. If you don't zero them out, gradients from previous batches will influence the current update, leading to incorrect training.

-
- -
- Q2: What is the difference between model.train() and model.eval()? -

Answer: They set the mode for specific layers. .train() enables Dropout and Batch Normalization (calculates stats for current batch). .eval() disables dropout and uses running averages for Batch Norm. Forgetting .eval() during testing will lead to inconsistent/bad predictions.

+
+
⚡ TensorFlow 2.x Philosophy
+
TF2 defaults to eager execution (like PyTorch). @tf.function compiles to static graph for production speed. Keras is the official high-level API. TF handles the full ML lifecycle: training → saving → serving → monitoring.
-
- Q3: Explain the role of torch.no_grad(). -

Answer: It's a context manager that disables gradient calculation. Use it during inference or validation to save memory and compute resources. It prevents the creation of the computational graph for those operations.

-
+

1. Three Ways to Build Models

+ + + + + +
APIUse CaseFlexibility
SequentialSimple stack of layersLow (linear only)
FunctionalMulti-input/output, branchingMedium
SubclassingCustom forward logicHigh (most flexible)
-
- Q4: PyTorch vs TensorFlow — technical tradeoffs? -

Answer: PyTorch (Dynamic graph) is more Pythonic, easier to debug with standard tools, and highly favored in research. TensorFlow (Static graph/Keras) historically had better deployment tools (TFLite, TFServing) and massive industry scale, though the gap has significantly narrowed with PyTorch 2.0 and TorchServe.

-
+

2. tf.data — The Data Pipeline

+

Build efficient input pipelines: tf.data.Dataset chains transformations lazily. Key methods: .map(), .batch(), .shuffle(), .prefetch(tf.data.AUTOTUNE). Prefetching overlaps data loading with model execution. Supports TFRecord files for large datasets.

-
- Q5: What is "Tensor Broadcasting" in PyTorch? -

Answer: Same as NumPy. If dimensions don't match, PyTorch automatically expands the smaller tensor (by repeating values) to match the larger one, provided they are compatible (trailing dimensions match or are 1). This happens without actual memory copying.

-
-
- ` - }, - "tensorflow": { - concepts: ` -
-

TensorFlow & Production DL

+

3. Callbacks — Training Hooks

+ + + + + + + + +
CallbackPurpose
ModelCheckpointSave best model (monitor val_loss)
EarlyStoppingStop when metric plateaus
ReduceLROnPlateauReduce LR when stuck
TensorBoardVisualize training metrics
CSVLoggerLog metrics to CSV
LambdaCallbackCustom logic per epoch
-

🧠 The Keras Ecosystem

-
-
⚡ User-First API
-
- Keras is TensorFlow's high-level API. It focuses on Developer Experience (DX) — minimizing the number of user actions for common use cases. tf.keras supports three ways to build models: (1) Sequential (simple stacks), (2) Functional (DAGs, multi-input/output), (3) Subclassing (full control). -
-
+

4. Custom Training with GradientTape

+

For full control: tf.GradientTape() records operations, then tape.gradient(loss, model.trainable_variables) computes gradients. Same pattern as PyTorch's manual loop. Use for: GANs, reinforcement learning, custom loss functions.

-

tf.data — Performance Pipelines

-

Loading data is often the bottleneck. tf.data.Dataset enables "ETL" pipelines: Extract (from disk/cloud), Transform (shuffle, batch, repeat), Load (map to GPU). Concepts like prefetch and interleave ensure the GPU is never waiting for the CPU.

+

5. SavedModel for Deployment

+

model.save('path') exports as SavedModel format — includes architecture, weights, and computation graph. Ready for TF Serving, TF Lite (mobile), TF.js (browser). Universal deployment format.

-

Static Graphs & tf.function

-

TensorFlow can convert Python code into a Static Computational Graph using @tf.function. This enables significant optimizations like constant folding and makes models exportable to environments without Python (C++, Java, JS).

+

6. @tf.function — Graph Compilation

+

Decorating with @tf.function traces Python code into a TF graph. Benefits: optimized execution, XLA compilation, deployment. Gotchas: Python side effects only run during tracing, use tf.print() instead of print().

-

Monitoring with TensorBoard

+

7. TF vs PyTorch — When to Choose

- - - - - + + + + + +
ComponentVisualized metric
ScalarsLoss/Accuracy curves in real-time
HistogramsWeights/Gradients distribution (checking for vanishing/exploding)
GraphsThe internal model architecture
ProjectorHigh-dimensional embeddings (t-SNE/PCA)
AspectTensorFlowPyTorch
DeploymentTF Serving, TFLite, TF.jsTorchServe, ONNX
ResearchLess common nowDominant in papers
ProductionMature ecosystemCatching up fast
MobileTFLite (mature)PyTorch Mobile
DebuggingHarder (graph mode)Easier (eager by default)
- -

Deployment Architecture (TFX)

-

TensorFlow Extended (TFX) is for end-to-end ML. Key components: TF Serving (for APIs), TF Lite (for mobile/edge), TFJS (for web browsers). TF Serving supports model versioning and A/B testing out of the box.

-
- `, +
`, code: `

💻 TensorFlow Code Examples

-

The Functional API Pattern

-
-import tensorflow as tf +

1. Functional API Model

+
import tensorflow as tf from tensorflow import keras -from tensorflow.keras import layers - -# Functional API — best for production model logic -inputs = keras.Input(shape=(784,)) -x = layers.Dense(64, activation="relu")(inputs) -x = layers.Dense(64, activation="relu")(x) -outputs = layers.Dense(10, activation="softmax")(x) - -model = keras.Model(inputs=inputs, outputs=outputs, name="mnist_model") - -model.compile( - loss=keras.losses.SparseCategoricalCrossentropy(), - optimizer=keras.optimizers.RMSprop(), - metrics=["accuracy"], -) -history = model.fit(x_train, y_train, batch_size=64, epochs=2, validation_split=0.2) -
- -

High-Performance Data Pipeline

-
-def load_and_preprocess(path, label): - image = tf.io.read_file(path) - image = tf.image.decode_jpeg(image, channels=3) - image = tf.image.resize(image, [224, 224]) - return image / 255.0, label - -dataset = tf.data.Dataset.from_tensor_slices((image_paths, labels)) -dataset = (dataset - .shuffle(1000) - .map(load_and_preprocess, num_parallel_calls=tf.data.AUTOTUNE) - .batch(32) - .prefetch(tf.data.AUTOTUNE) # Overlap training and preprocessing -) -
- -

Custom Layers & Training Loops

-
-# Custom Training Loop (GradientTape) -optimizer = tf.keras.optimizers.Adam() -loss_fn = tf.keras.losses.BinaryCrossentropy() - -@tf.function # Compiles to static graph for speed -def train_step(x, y): +# Multi-input model +text_input = keras.Input(shape=(100,), name='text') +num_input = keras.Input(shape=(5,), name='features') + +x1 = keras.layers.Embedding(10000, 64)(text_input) +x1 = keras.layers.GlobalAveragePooling1D()(x1) +x2 = keras.layers.Dense(32, activation='relu')(num_input) + +combined = keras.layers.Concatenate()([x1, x2]) +output = keras.layers.Dense(1, activation='sigmoid')(combined) +model = keras.Model(inputs=[text_input, num_input], outputs=output)
+ +

2. Training with Callbacks

+
callbacks = [ + keras.callbacks.ModelCheckpoint('best.keras', + monitor='val_loss', save_best_only=True), + keras.callbacks.EarlyStopping(patience=5, + restore_best_weights=True), + keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3), + keras.callbacks.TensorBoard(log_dir='./logs') +] + +model.compile(optimizer='adam', loss='binary_crossentropy', + metrics=['accuracy', keras.metrics.AUC()]) +model.fit(X_train, y_train, epochs=50, validation_split=0.2, + callbacks=callbacks)
+ +

3. Custom Training Loop (GradientTape)

+
@tf.function +def train_step(model, X, y, optimizer, loss_fn): with tf.GradientTape() as tape: -logits = model(x, training=True) -loss_value = loss_fn(y, logits) - - grads = tape.gradient(loss_value, model.trainable_weights) - optimizer.apply_gradients(zip(grads, model.trainable_weights)) - return loss_value -
-
- `, - interview: ` + preds = model(X, training=True) + loss = loss_fn(y, preds) + grads = tape.gradient(loss, model.trainable_variables) + optimizer.apply_gradients(zip(grads, model.trainable_variables)) + return loss
+ +

4. tf.data Pipeline

+
# Efficient data pipeline with prefetching +dataset = ( + tf.data.Dataset.from_tensor_slices((X, y)) + .shuffle(10000) + .batch(64) + .map(lambda x, y: (augment(x), y), + num_parallel_calls=tf.data.AUTOTUNE) + .prefetch(tf.data.AUTOTUNE) # Overlap loading + training +)
+ `, + interview: `

🎯 TensorFlow Interview Questions

- -
- Q1: What is tf.function and AutoGraph? -

Answer: tf.function is a decorator that converts a regular Python function into a TensorFlow static graph. AutoGraph is the internal tool that translates Python control flow (if, while) into TF graph ops. This allows for compiler-level optimizations and easy deployment without a Python environment.

-
- -
- Q2: Why use tf.data.AUTOTUNE? -

Answer: It allows TensorFlow to dynamically adjust the level of parallelism and buffer sizes based on your CPU/disk hardware. It ensures that data preprocessing (CPU) is always one step ahead of model training (GPU), preventing hardware starvation.

-
- -
- Q3: Functional API vs Sequential vs Subclassing? -

Answer: Sequential: purely linear stacks. Functional: most common for production, supports non-linear topology (shared layers, multiple inputs/outputs). Subclassing: full control over the forward pass, best for complex research/custom logic. Functional is generally preferred for its balance of power and debugging ease.

-
- -
- Q4: How do you prevent overfitting in TensorFlow? -

Answer: (1) EarlyStopping callback, (2) Dropout layers, (3) L1/L2 kernels regularizers, (4) Data augmentation (via tf.image or keras.layers), (5) Learning rate schedules via callbacks.ReduceLROnPlateau.

-
- -
- Q5: What is SavedModel format? -

Answer: The language-neutral, hermetic serialization format for TF models. It includes the model architecture, weights, and the computational graph (signatures). It is the standard format for TF Serving and TFLite conversion.

-
-
- ` - }, - "production": { - concepts: ` +
Q1: Sequential vs Functional vs Subclassing?

Answer: Sequential: linear stack. Functional: multi-input/output, shared layers. Subclassing: full Python control, custom forward. Use Functional for most real projects.

+
Q2: What does @tf.function do?

Answer: Compiles Python function into a TF graph. Faster execution, enables XLA optimization, required for SavedModel export. Gotcha: Python code only runs during tracing — side effects behave differently.

+
Q3: How does tf.data improve performance?

Answer: Chains transformations lazily. .prefetch(AUTOTUNE) overlaps data loading with GPU computation. .cache() stores in memory after first epoch. .interleave() reads multiple files concurrently.

+
Q4: EarlyStopping — what to monitor?

Answer: Usually val_loss. Set patience=5-10 (epochs without improvement). restore_best_weights=True reverts to best epoch. Combine with ReduceLROnPlateau for better convergence.

+
Q5: When to use GradientTape?

Answer: When Keras .fit() is too restrictive: GANs (two optimizers), RL (custom gradients), multi-loss weighting, gradient penalty, research experiments needing full control.

+
Q6: TF vs PyTorch — when to choose each?

Answer: TF: production deployment (TF Serving, TFLite), mobile apps, TPU training. PyTorch: research, prototyping, Hugging Face ecosystem. Both are converging in features.

+ ` +}, + +"production": { + concepts: `
-

Production Python & MLOps

+

📦 Production Python — Complete Engineering Guide

-

🧠 FastAPI — The Modern Standard

-
⚡ High Performance APIs
-
- FastAPI is built on Starlette and Pydantic. It supports async/await for handling concurrent requests without blocking, uses type hints for automatic validation, and generates interactive OpenAPI (Swagger) documentation. It is the gold standard for serving ML models today. -
+
⚡ Production = Reliability + Reproducibility + Observability
+
Production code must be tested (pytest), typed (mypy), logged (structured logging), packaged (pyproject.toml), containerized (Docker), and monitored (metrics/alerts). The gap between notebook code and production code is enormous.
-

Pydantic & Data Validation

-

In production, you cannot trust input data. Pydantic enforces strict type checking and validation at runtime. If a JSON request arrives with a string instead of a float for a model feature, Pydantic catches it immediately and returns a clear error before the model even sees it.

+

1. pytest — Professional Testing

+ + + + + + + + +
FeaturePurposeExample
fixturesReusable test setup@pytest.fixture for test data
parametrizeRun same test with many inputs@pytest.mark.parametrize
conftest.pyShared fixtures across testsDB connections, mock data
monkeypatchOverride functions/env varsMock API calls
tmp_pathTemporary directoryTest file I/O without cleanup
markersTag tests (slow, gpu, integration)pytest -m "not slow"
+ +

2. Logging Best Practices

+
+
💡 Logging vs Print
+ Never use print() in production. Use logging module: configurable levels (DEBUG/INFO/WARNING/ERROR), output to files, structured format, no performance cost when disabled. +
+ + + + + + + +
LevelWhen to Use
DEBUGDetailed diagnostic (tensor shapes, intermediate values)
INFONormal events (training started, epoch complete)
WARNINGSomething unexpected but handled (missing feature, fallback)
ERRORSomething failed (model load error, API failure)
CRITICALSystem-level failure (out of memory, GPU crash)
-

The ML Model Serving Lifecycle

+

3. Project Structure

+
project/ +├── src/ +│ └── mypackage/ +│ ├── __init__.py +│ ├── data/ +│ ├── models/ +│ ├── training/ +│ └── serving/ +├── tests/ +├── configs/ +├── pyproject.toml +├── Dockerfile +└── README.md
+ +

4. FastAPI for Model Serving

+

Modern async web framework. Auto-generates OpenAPI docs. Type-validated requests via Pydantic. Use for: model inference APIs, data pipelines, webhook handlers. Deploy with Uvicorn + Docker. Add health checks and input validation.

+ +

5. Docker for ML Projects

+

Containerize your entire environment: Python version, CUDA drivers, dependencies. Multi-stage builds: builder stage (install deps) → runtime stage (slim image). Use NVIDIA Container Toolkit for GPU access. Pin all dependency versions.

+ +

6. Configuration Management

- - - - - + + + + +
StageResponsibilityTools
InitializationLoading model weights into memory (once)FastAPI Lifespan
InferencePreprocessing input and getting predictionNumPy/Pydantic
Post-processingFormatting prediction for the clientJSON/Protobuf
ObservabilityLogging latency, inputs, and driftPrometheus/ELK
ToolBest ForKey Feature
HydraML experimentsYAML configs, CLI overrides, multi-run
Pydantic SettingsApp configEnv var loading, validation
python-dotenvSimple projects.env file loading
dynaconfMulti-environmentdev/staging/prod configs
-

Dependency Management & Docker

-

Conda vs Pip: Pip is standard for Python; Conda is better for C-extensions/CUDA. Docker: Containerizing the environment ensures it "works on my machine" translates to "works in the cloud". Use lightweight base images (python:3.10-slim) to minimize security risks and build times.

+

7. CI/CD for ML

+

Automate: linting (ruff/flake8), type checking (mypy), testing (pytest), building (Docker), deploying. Use GitHub Actions or GitLab CI. Add model validation gate: compare new model metrics against baseline before deployment.

-

Testing ML Applications

-

(1) Unit tests: for preprocessing logic, (2) Integration tests: for the API endpoints, (3) Model Quality tests: ensuring the model meets a minimum accuracy threshold on a benchmark dataset before deployment.

-
- `, +

8. Code Quality Tools

+ + + + + + +
ToolPurpose
ruffFast linter + formatter (replaces black, isort, flake8)
mypyStatic type checking
pre-commitGit hooks for auto-formatting
pytest-covTest coverage measurement
+ `, code: `

💻 Production Python Code Examples

-

The FastAPI Model Server Pattern

-
-from fastapi import FastAPI, HTTPException -from pydantic import BaseModel -import joblib +

1. pytest — ML Testing Patterns

+
import pytest +import numpy as np -app = FastAPI(title="ML Model API") +@pytest.fixture +def sample_data(): + X = np.random.randn(100, 10) + y = np.random.randint(0, 2, 100) + return X, y -# 1. Prediction Schema -class PredictionInput(BaseModel): - feature_1: float - feature_2: float - category: str +@pytest.mark.parametrize("model_cls", [ + LogisticRegression, + RandomForestClassifier, + GradientBoostingClassifier +]) +def test_model_fits(model_cls, sample_data): + X, y = sample_data + model = model_cls() + model.fit(X, y) + preds = model.predict(X) + assert preds.shape == y.shape + assert set(preds).issubset({0, 1})
+ +

2. Structured Logging

+
import logging +import json + +class JSONFormatter(logging.Formatter): + def format(self, record): + return json.dumps({ + 'timestamp': self.formatTime(record), + 'level': record.levelname, + 'message': record.getMessage(), + 'module': record.module + }) + +logger = logging.getLogger('ml_pipeline') +logger.setLevel(logging.INFO) +handler = logging.StreamHandler() +handler.setFormatter(JSONFormatter()) +logger.addHandler(handler) + +logger.info("Training complete", extra={'accuracy': 0.95})
+ +

3. FastAPI Model Serving

+
from fastapi import FastAPI +from pydantic import BaseModel -# 2. Global Predictor Registry -model = None +app = FastAPI(title="ML API") -@app.on_event("startup") -def load_model(): - global model - model = joblib.load('model.joblib') +class PredictRequest(BaseModel): + features: list[float] + model_name: str = "default" @app.post("/predict") -async def predict(data: PredictionInput): - try: -features = [[data.feature_1, data.feature_2]] -prediction = model.predict(features) -return {"prediction": float(prediction[0])} - except Exception as e: -raise HTTPException(status_code=500, detail=str(e)) -
- -

Robust Logging Strategy

-
-import logging -import sys - -def get_logger(name): - logger = logging.getLogger(name) - logger.setLevel(logging.INFO) - - # JSON formatter for easier ELK/Splunk ingestion - handler = logging.StreamHandler(sys.stdout) - formatter = logging.Formatter( -'{"time":"%(asctime)s", "name":"%(name)s", "level":"%(levelname)s", "msg":"%(message)s"}' - ) - handler.setFormatter(formatter) - logger.addHandler(handler) - return logger -
- -

Docker Configuration (ML Specific)

-
-# Dockerfile for ML Service -FROM python:3.10-slim - -WORKDIR /app +async def predict(req: PredictRequest): + X = np.array(req.features).reshape(1, -1) + pred = model.predict(X) + return {"prediction": pred.tolist()} + +@app.get("/health") +async def health(): + return {"status": "healthy"}
+ +

4. Dockerfile for ML

+
# Multi-stage build +FROM python:3.11-slim as builder COPY requirements.txt . - -# Install dependencies without cache RUN pip install --no-cache-dir -r requirements.txt -COPY . . - -# Expose port and run server -EXPOSE 8000 -CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"] -
-
- `, - interview: ` +FROM python:3.11-slim +COPY --from=builder /usr/local/lib/python3.11 /usr/local/lib/python3.11 +COPY src/ /app/src/ +COPY models/ /app/models/ +WORKDIR /app +CMD ["uvicorn", "src.api:app", "--host", "0.0.0.0"]
+ `, + interview: `
-

🎯 Production Interview Questions

- -
- Q1: Why use FastAPI over Flask for ML models? -

Answer: (1) Native async support (handles concurrent requests better), (2) Automatically generates Swagger UI for testing, (3) Pydantic integration for data validation, (4) Significantly higher throughput (close to Go/Node.js levels), (5) Built-in support for WebSockets and background tasks.

-
- -
- Q2: How do you handle model versioning in production? -

Answer: (1) URL versioning (/v1/predict), (2) Model registry (MLflow/SageMaker) with aliases like production or staging, (3) Blue-green deployment — route traffic to the new version only after validation, (4) Embed the model version in the API response metadata for debugging.

-
- -
- Q3: What is "Dependency Hell" and how do you solve it? -

Answer: It occurs when multiple libraries require conflicting versions of the same dependency. Solved by: (1) Using virtual environments (venv/conda), (2) pinning exact versions in requirements.txt or poetry.lock, (3) Docker to isolate the entire OS environment.

-
- -
- Q4: How do you log inputs/outputs without violating privacy? -

Answer: (1) PII Masking: remove names/emails/IDs before logging, (2) Hash sensitive fields if they are needed for troubleshooting, (3) Separate logging of model metadata from raw data, (4) Use specialized monitoring tools like Arize or Whylogs for drift detection without full data capture.

-
- -
- Q5: What's the role of CI/CD in Machine Learning? -

Answer: Beyond standard code tests, ML CI/CD (MLOps) includes Data Validation (is the incoming data schema correct?), Model Validation (is accuracy >= 90%?), and automated deployment to staging for human-in-the-loop review.

-
-
- ` - }, - "optimization": { - concepts: ` +

🎯 Production Python Interview Questions

+
Q1: How do you test ML code?

Answer: (1) Unit tests: data transformations, feature engineering functions. (2) Integration tests: full pipeline end-to-end. (3) Model tests: output shape, range, determinism with seeds. (4) Data tests: schema validation, distribution checks. Use pytest fixtures for reusable test data.

+
Q2: print() vs logging — why?

Answer: Logging: configurable levels, file output, structured format, zero cost when disabled, thread-safe. Print: none of these. Production code must use logging for observability and debugging.

+
Q3: How to serve an ML model in production?

Answer: FastAPI/Flask for REST API. Docker for containerization. Load model at startup (not per request). Add health checks, input validation, error handling, logging, metrics. Use async for high throughput. Consider model registries (MLflow) for versioning.

+
Q4: What goes in pyproject.toml?

Answer: Project metadata, dependencies, build system, tool configs (pytest, mypy, ruff). Replaced setup.py/setup.cfg. Pin dependency versions for reproducibility. Use [project.optional-dependencies] for dev/test extras.

+
Q5: How to manage ML experiment configs?

Answer: Hydra: YAML configs with CLI overrides, multi-run sweeps. Store configs in version control. Never hardcode hyperparameters. Use config groups for model/data/training combos.

+
Q6: What is CI/CD for ML?

Answer: Automate: lint → type-check → test → build → deploy. Add model validation gate: new model must beat baseline on test metrics. Use GitHub Actions. Include data validation (Great Expectations) in pipeline.

+ ` +}, + +"optimization": { + concepts: `
-

Python High Performance & Optimization

+

⚡ Performance & Optimization — Complete Guide

-

🧠 The GIL (Global Interpreter Lock) — Deep Dive

-
⚡ The Bottleneck of Python
-
- The GIL is a mutex that protects access to Python objects, preventing multiple native threads from executing Python bytecodes at once. Critical for DS: NumPy and Pandas release the GIL during C-level computations. Therefore, vectorized code IS truly parallel across CPU cores even with the GIL. -
+
⚡ The Optimization Hierarchy
+
1. Algorithm (O(n²) → O(n log n)) > 2. Data structures (list → set for lookups) > 3. Vectorization (NumPy) > 4. Compilation (Numba/Cython) > 5. Parallelization (multiprocessing/Dask) > 6. Hardware (GPU). Always start from the top.
-

Profiling: Finding the Real Bottleneck

-

Never optimize without measuring. (1) cProfile: for function-level timing, (2) line_profiler: for line-by-line analysis in "hot" functions, (3) memory_profiler: to detect memory leaks and peak usage, (4) Py-Spy: a sampling profiler for zero-instrumentation production profiling.

- -

Numba — JIT Compilation for NumPy

-

Numba translates a subset of Python and NumPy code into fast machine code using LLVM. By simply adding @njit, you can achieve C/Fortran-like speeds for math-heavy loops that cannot be vectorized with pure NumPy.

- -

Concurrency Models in Python

+

1. Profiling — Measure Before Optimizing

- - - - + + + + + + +
ModelBest for...Mechanism
ThreadingI/O-bound (APIs, DBs)Concurrent but not parallel (GIL)
MultiprocessingCPU-bound (Training, Math)True parallelism (separate OS processes)
asyncioHigh-concurrency I/OSingle-threaded cooperative multitasking
ToolTypeWhen to UseOverhead
cProfileFunction-levelFind slow functions~2x slowdown
line_profilerLine-by-lineFind slow lines in a functionHigher
Py-SpySampling profilerProduction profilingNear zero
tracemallocMemory allocationFind memory leaksLow
memory_profilerLine-by-line memoryFind memory-heavy linesHigh
scaleneCPU + Memory + GPUComprehensive profilingLow
-

Vectorization & SIMD

-

Single Instruction, Multiple Data (SIMD) allows a CPU to perform the same operation on multiple data points in one clock cycle. Modern NumPy leverages AVX-512 and MKL/OpenBLAS to ensure your a + b is as fast as the hardware allows.

- -

Cython — When All Else Fails

-

Cython is a superset of Python that compiles to C. It allows you to call C functions directly and use static typing. Use it for complex algorithms that require low-level memory control (e.g., custom tree models or graph algorithms).

-
- `, - code: ` -
-

💻 Performance & Optimization Code Examples

- -

Numba — JIT Speedups

-
-from numba import njit -import numpy as np +

2. The GIL and Parallelism

+

GIL prevents true multi-threading for CPU-bound Python code. But: NumPy, Pandas, and scikit-learn release the GIL during C operations. Solutions for parallelism:

+ + + + + + + +
ToolBest ForHow
threadingI/O-bound (API calls, disk)GIL released during I/O waits
multiprocessingCPU-bound PythonSeparate processes, separate GIL
concurrent.futuresSimple parallel patternsThreadPool/ProcessPool executors
asyncioMany I/O operationsEvent loop, cooperative multitasking
joblibsklearn paralleln_jobs parameter
-# This loop is 100x slower in pure Python -@njit(parallel=True) -def monte_carlo_pi(nsamples): - acc = 0 - for i in range(nsamples): -x = np.random.random() -y = np.random.random() -if x**2 + y**2 < 1.0: - acc += 1 - return 4.0 * acc / nsamples -
+

3. Numba — JIT Compilation

+

@numba.jit(nopython=True) compiles Python functions to machine code. Supports NumPy arrays and most math operations. 10-100x speedup for loops that can't be vectorized. @numba.vectorize creates custom ufuncs. @numba.cuda.jit runs on GPU.

-

Multiprocessing for Data Prep

-
-from multiprocessing import Pool +

4. Cython — C-Level Performance

+

Compiles Python to C extension modules. Add type declarations for massive speedups. Best for: tight loops, calling C libraries, CPython extensions. More setup than Numba but more control.

-def heavy_image_prep(file_path): - # Complex transform logic here - return processed_img +

5. Dask — Parallel Computing

+

Pandas-like API for datasets larger than memory. Key abstractions: dask.dataframe (parallel Pandas), dask.array (parallel NumPy), dask.delayed (custom parallelism). Uses a task scheduler to execute lazily. Scales from laptop to cluster.

-# Use all available cores -if __name__ == '__main__': - with Pool() as p: -results = p.map(heavy_image_prep, all_files) -
+

6. Ray — Distributed ML

+

General-purpose distributed framework. Ray Tune for hyperparameter tuning, Ray Serve for model serving, Ray Data for data processing. Easier than Dask for ML-specific workloads. Used by OpenAI, Uber, Ant Group.

-

Memory Optimization with __slots__

-
-class Observation: - # Prevents creation of __dict__, saving significant RAM - __slots__ = ('timestamp', 'value', 'sensor_id') - - def __init__(self, ts, val, sid): -self.timestamp = ts -self.value = val -self.sensor_id = sid +

7. Memory Optimization

+ -# 1 million instances: ~60MB vs ~160MB without __slots__ -data = [Observation(i, i*1.1, 'S1') for i in range(1000000)] -
-
- `, - interview: ` +

8. Python 3.12-3.13 Performance

+

3.12: Faster interpreter (5-15% overall), better error messages, per-interpreter GIL (experimental). 3.13: Free-threaded CPython (no-GIL mode experimental), JIT compiler (experimental). The future of Python performance is exciting.

+ `, + code: ` +
+

💻 Performance Code Examples

+ +

1. Profiling

+
import cProfile +import pstats + +# Profile a function +with cProfile.Profile() as pr: + result = expensive_function(data) + +stats = pstats.Stats(pr) +stats.sort_stats('cumulative') +stats.print_stats(10) # Top 10 functions + +# Memory profiling +import tracemalloc +tracemalloc.start() +# ... do work ... +snapshot = tracemalloc.take_snapshot() +for stat in snapshot.statistics('filename')[:5]: + print(stat)
+ +

2. Numba JIT — Vectorization Impossible

+
import numba + +@numba.jit(nopython=True) +def pairwise_distance(X): + n = X.shape[0] + D = np.empty((n, n)) + for i in range(n): + for j in range(i+1, n): + d = 0.0 + for k in range(X.shape[1]): + d += (X[i,k] - X[j,k]) ** 2 + D[i,j] = D[j,i] = d ** 0.5 + return D +# 100x faster than pure Python loops!
+ +

3. Dask for Large Data

+
import dask.dataframe as dd + +# Read 100GB of CSV files — lazy! +ddf = dd.read_csv('data/*.csv') + +# Same Pandas API — but parallel +result = ( + ddf.groupby('category') + .agg({'revenue': 'sum', 'quantity': 'mean'}) + .compute() # Only here does execution happen +)
+ +

4. concurrent.futures — Simple Parallelism

+
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor + +# CPU-bound: use ProcessPool +with ProcessPoolExecutor(max_workers=8) as executor: + results = list(executor.map(process_chunk, chunks)) + +# I/O-bound: use ThreadPool +with ThreadPoolExecutor(max_workers=32) as executor: + results = list(executor.map(fetch_url, urls))
+ +

5. __slots__ for Memory

+
class Point: + __slots__ = ('x', 'y', 'z') + def __init__(self, x, y, z): + self.x = x + self.y = y + self.z = z +# 1M instances: ~60MB vs ~160MB without __slots__
+
`, + interview: `

🎯 Performance Interview Questions

- -
- Q1: Why does Python have a GIL? -

Answer: It simplifies implementation by making the memory management (reference counting) thread-safe without needing granular locks. It also makes single-threaded code faster and C-extension integration easier. Removing it is difficult because it effectively requires a rewrite of the interpreter (see: "no-gil" Python 3.13 proposal).

-
- -
- Q2: How do you optimize a function with a nested loop? -

Answer: (1) Vectorize with NumPy (broadcast), (2) If logic is too complex for NumPy, use Numba JIT, (3) Use Cython if you need C-level types, (4) Use multiprocessing if the iterations are independent and CPU-bound.

-
- -
- Q3: Explain the "cProfile" overhead. -

Answer: cProfile is a deterministic profiler; it hooks into every function call. While very accurate, it adds significant overhead (sometimes 2x slowdown). For production systems, "Sampling Profilers" (like Py-Spy) are better as they only inspect the stack every few milliseconds, adding negligible overhead.

-
- -
- Q4: When is Threading faster than Multiprocessing? -

Answer: For I/O-bound tasks (Network/Disk). Threading has much lower overhead (shared memory) compared to Multiprocessing (separate memory spaces, requires serialization/pickling of data between processes). For downloading 1000 images, threads are superior.

-
- -
- Q5: What is "Cache Locality" and how does NumPy help? -

Answer: CPUs are fastest when accessing contiguous memory (Spatial Locality). NumPy's C-contiguous arrays ensure that when one value is loaded into the CPU cache, the next values are also loaded, minimizing "Cache Misses" compared to Python lists of scattered objects.

-
-
- ` - } +
Q1: Why does Python have a GIL?

Answer: Simplifies reference counting (thread-safe without granular locks). Makes single-threaded code faster. Makes C extension integration easier. Python 3.13 has experimental free-threaded mode (no-GIL).

+
Q2: How to optimize a nested loop?

Answer: (1) Vectorize with NumPy (broadcast). (2) If too complex, use Numba JIT. (3) Cython for C-level types. (4) multiprocessing if iterations are independent.

+
Q3: Threading vs Multiprocessing?

Answer: Threading: I/O-bound (shared memory, low overhead). Multiprocessing: CPU-bound (separate memory, bypasses GIL). For downloading 1000 images → threads. For computing 1000 matrix operations → processes.

+
Q4: What is Numba?

Answer: JIT compiler that translates Python/NumPy to machine code using LLVM. @jit(nopython=True) for 10-100x speedup. Works best with: NumPy arrays, math operations, loops. Doesn't support: Pandas, string manipulation, most Python objects.

+
Q5: How to profile Python code?

Answer: cProfile: function-level (find slow functions). line_profiler: line-by-line. Py-Spy: sampling (production-safe). tracemalloc: memory. scalene: CPU+memory+GPU all-in-one. Always profile before optimizing.

+
Q6: Dask vs Ray vs Spark?

Answer: Dask: familiar Pandas/NumPy API, Python-native, scales well. Ray: ML-focused (tune, serve), lower-level control. Spark: JVM-based, best for very large (TB+) data, enterprise. For Python ML: Dask or Ray. For big data ETL: Spark.

+ ` +} }; - -// Render dashboard cards function renderDashboard() { const grid = document.getElementById('modulesGrid'); grid.innerHTML = modules.map(module => `