diff --git "a/Python/app.js" "b/Python/app.js" --- "a/Python/app.js" +++ "b/Python/app.js" @@ -71,12 +71,19 @@ const modules = [ } ]; + const MODULE_CONTENT = { "python-fundamentals": { concepts: `
| Type | Mutable | Ordered | Hashable | Use Case |
|---|---|---|---|---|
| list | ✓ | ✓ | ✗ | Sequential data, time series, feature lists |
| set | ✓ | ✗ | ✗ | Unique values, membership testing O(1) |
| frozenset | ✗ | ✗ | ✓ | Immutable set, usable as dict keys |
| deque | ✓ | ✓ | ✗ | O(1) append/pop both ends, sliding windows |
| bytes | ✗ | ✓ | ✓ | Binary data, serialization, network I/O |
| bytearray | ✓ | ✓ | ✗ | Mutable binary buffers |
a = [1, 2, 3], the list lives on the heap; a is a name that points to it. This is why b = a makes both point to the same list — no copy is made.
- a = [1, 2, 3] — the list lives on the heap; a is a name that points to it. b = a makes both point to the same list — no copy is made. This is called aliasing.Reference Counting: Python uses reference counting + cyclic garbage collector. Each object tracks how many names point to it. When count hits 0, memory is freed immediately. This is why del doesn't always free memory — it just decrements the reference count.
Reference Counting: Python uses reference counting + cyclic garbage collector. Each object tracks how many names point to it. When count hits 0, memory is freed immediately. del doesn't always free memory — it just decrements the reference count.
Integer Interning: Python caches integers from -5 to 256 and short strings. So a = 100; b = 100; a is b is True, but a = 1000; b = 1000; a is b may be False. Never use is for value comparison — always use ==.
Garbage Collection Generations: CPython has 3 generations (gen0, gen1, gen2). New objects start in gen0. Objects that survive a collection move to the next generation. Long-lived objects (gen2) are collected less frequently. Use gc.get_stats() to monitor.
yield, consuming O(1) memory regardless of data size. A list of 1 billion items = ~8GB RAM. A generator of 1 billion items = ~100 bytes. The Iterator Protocol: any object with __iter__ and __next__ methods. Generators are just syntactic sugar for iterators.yield vs return: return terminates the function. yield suspends it, saving the entire stack frame (local variables, instruction pointer). The next next() call resumes from where it left off.
yield from: Delegates to a sub-generator. yield from iterable is equivalent to for item in iterable: yield item but also forwards send() and throw() calls.
Generator Expressions: (x**2 for x in range(10**9)) — uses O(1) memory. List comprehension [x**2 for x in range(10**9)] — tries to allocate ~8GB. Always prefer generator expressions for large data.
Functions in Python are first-class objects — they can be passed as arguments, returned from other functions, and assigned to variables. A closure is a function that captures variables from its enclosing scope. This is the foundation of decorators, callbacks, and functional programming in Python.
-def append_to(element, target=[]): — This default list is shared across ALL calls! Default arguments are evaluated ONCE at function definition time, not at call time. Fix: use target=None then if target is None: target = [].
+ Late Binding Closures: [lambda: i for i in range(5)] — all lambdas return 4! Variables in closures are looked up at call time, not definition time. Fix: [lambda i=i: i for i in range(5)].
Tuple Assignment Gotcha: a = ([1,2],); a[0] += [3] raises TypeError AND modifies the list! The += first mutates the list in-place (succeeds), then tries to reassign the tuple element (fails).
| Class | Purpose | Why It Matters in DS |
|---|---|---|
| defaultdict | Dict with default factory | Group data without KeyError: defaultdict(list) |
| Counter | Count hashable objects | Label distribution: Counter(y_train) |
| namedtuple | Lightweight immutable class | Return multiple values with names, not indices |
| OrderedDict | Dict remembering insertion order | Legacy (dicts are ordered in 3.7+), but useful for move_to_end() |
| OrderedDict | Dict remembering insertion order | Legacy (dicts are ordered 3.7+), useful for move_to_end() |
| deque | Double-ended queue | Sliding window computations, BFS algorithms |
| ChainMap | Stack multiple dicts | Layer config: defaults → env → CLI overrides |
itertools functions return iterators, not lists. They consume O(1) memory regardless of input size. This matters when processing millions of records.
- | Function | What It Does | DS Use Case |
|---|---|---|
chain() | Concatenate iterables | Merge multiple data files lazily |
product() | Cartesian product | Generate hyperparameter grid |
combinations() | All r-length combos | Feature interaction pairs |
starmap() | map() with unpacked args | Apply function to paired data |
accumulate() | Running total/custom accumulator | Cumulative sums, running max |
tee() | Clone an iterator N times | Multiple passes over data stream |
Stop using os.path.join(). Use pathlib.Path — it's object-oriented, cross-platform, and reads like English:
Path('data') / 'train' / 'images' → builds paths with / operatorpath.glob('*.csv') → find all CSV filespath.stem, path.suffix, path.parent → parse without regexpath.read_text() / path.write_text() → no need for open()f-strings (3.6+) are the fastest formatting method. They support expressions: f"{accuracy:.2%}" → "95.23%", f"{x=}" (3.8+) → "x=42" for debugging. Interning: Python interns string literals and identifiers. 'hello' is 'hello' is True because both point to the same interned object.
except: catches SystemExit and KeyboardInterrupt. Always catch specific exceptions. In DS pipelines, catch ValueError (bad data), FileNotFoundError (missing files), KeyError (missing columns).
- LBYL vs EAFP: Python prefers "Easier to Ask Forgiveness than Permission" (EAFP). Use try/except instead of checking conditions first. It's faster when exceptions are rare (which they usually are).
Stop using os.path.join(). Use pathlib.Path — object-oriented, cross-platform, reads like English. Path('data') / 'train' / 'images' builds paths. path.glob('*.csv') finds files. path.read_text() reads without open().
| Tool | Best For | Create | Key Feature |
|---|---|---|---|
| venv | Simple projects | python -m venv env | Built-in, lightweight |
| conda | DS/ML (C dependencies) | conda create -n myenv python=3.11 | Handles non-Python deps (CUDA, MKL) |
| poetry | Modern packaging | poetry init | Lock files, deterministic builds |
| uv | Speed (Rust-based) | uv venv | 10-100x faster than pip |
| Tool | Best For | Key Feature | |
| venv | Simple projects | Built-in, lightweight | |
| conda | DS/ML (C dependencies) | Handles CUDA, MKL | |
| poetry | Modern packaging | Lock files, deterministic builds | |
| uv | Speed (Rust-based) | 10-100x faster than pip |
Answer: Lists are mutable, tuples immutable. But the deeper answer: tuples are hashable (can be dict keys), use less memory (no over-allocation), and signal intent ("this shouldn't change"). Use tuples for (lat, lon) pairs, function return values, dict keys for caching. Use lists for feature collections that grow.
-Answer: The GIL prevents true multi-threading for CPU-bound tasks. But here's what most people miss: NumPy, Pandas, and scikit-learn release the GIL during C-level computations. So vectorized operations ARE parallel at the C level. For pure Python CPU work, use multiprocessing. For I/O (API calls, file reads), threading works fine because the GIL is released during I/O waits.
is and ==. Why does this matter?
- Answer: == checks value equality (__eq__). is checks identity (same memory address). Python interns small integers (-5 to 256) and some strings, so 300 is 300 may be False. Always use == for values. Only use is for None checks: if x is None.
Answer: 5 strategies, from simplest to most powerful: (1) pd.read_csv(chunksize=50000) — process in batches, (2) usecols=['needed_cols'] — load only what you need, (3) dtype={'col': 'int32'} — use smaller types, (4) Dask — lazy Pandas-like API, (5) DuckDB — SQL on CSV files with zero memory overhead.
Answer: Dict: O(1) average via hash tables (Python's dict uses open addressing). List: O(n) linear scan. Internally, dict hashes the key to compute a slot index, then handles collisions via probing. Sets use the same mechanism. This is why x in my_set is fast but x in my_list is slow.
Answer: copy.copy() copies outer container but shares inner objects. copy.deepcopy() recursively copies everything. Real scenario: You have a list of dicts (config per experiment). Shallow copy means modifying one experiment's config changes all of them. Deep copy gives independent configs. Pandas .copy() is deep by default — but df2 = df is NOT a copy at all.
Answer: defaultdict(factory) auto-creates default values for missing keys. Use defaultdict(list) to group items without if key not in dict checks. Use defaultdict(int) to count. It's cleaner and ~20% faster than dict.setdefault() for grouping operations in data processing.
Answer: Generators yield values one at a time using yield, consuming O(1) memory regardless of data size. A list of 1 billion items = ~8GB RAM. A generator of 1 billion items = ~100 bytes. Critical for: reading large files, streaming data, batch training. yield from delegates to sub-generators.
Answer: list(dict.fromkeys(my_list)) — uses dict's insertion-order guarantee (3.7+), runs in O(n). Old approach: seen = set(); [x for x in lst if not (x in seen or seen.add(x))]. For DataFrames: df.drop_duplicates(subset=['key_col']).
Answer: Two mechanisms: (1) Reference counting — each object has a count; freed when count hits 0. Immediate cleanup. (2) Cyclic garbage collector — detects reference cycles (A → B → A) that refcount can't handle. Runs periodically on generations (gen0, gen1, gen2). You can force it with gc.collect() — useful after deleting large ML models.
__str__ and __repr__?
- Answer: __str__ is for end users (readable), __repr__ is for developers (unambiguous, ideally eval-able). If only one is defined, implement __repr__ — Python falls back to it for str() too. In ML: __repr__ should show model params: LinearRegression(lr=0.01, reg=l2).
*args and **kwargs help in ML code?
- Answer: They enable flexible function signatures. *args: variable positional args (multiple datasets). **kwargs: variable keyword args (hyperparameters). Essential for: wrapper functions, decorators, scikit-learn's set_params(**params), and model.fit(X, y, **fit_params).
Answer: f-strings (3.6+) are fastest, most readable formatting. They support expressions: f"{accuracy:.2%}" → "95.23%", f"{x=}" (3.8+) → "x=42" for debugging. .format() is slower and more verbose. % formatting is legacy C-style. Always use f-strings in modern Python.
Answer: Python resolves names in order: Local → Enclosing function → Global → Built-in. This is why you can accidentally shadow built-ins: list = [1,2] breaks list(). Use nonlocal to modify enclosing scope, global for module scope (but avoid globals in production code).
append() and extend()?
- Answer: append(x) adds x as a single element. extend(iterable) unpacks and adds each element. [1,2].append([3,4]) → [1,2,[3,4]]. [1,2].extend([3,4]) → [1,2,3,4]. Use extend() when merging feature lists; append() when adding one item to results.
Answer: Lists are mutable, tuples immutable. Deeper: tuples are hashable (can be dict keys), use less memory (no over-allocation), and signal intent ("this shouldn't change"). Use tuples for (lat, lon) pairs, function return values, dict keys. Use lists for collections that grow.
Answer: The GIL prevents true multi-threading for CPU-bound tasks. But NumPy, Pandas, and scikit-learn release the GIL during C-level computations. So vectorized operations ARE parallel at the C level. For pure Python CPU work, use multiprocessing. For I/O, threading works fine.
Answer: copy.copy() copies outer container but shares inner objects. copy.deepcopy() recursively copies everything. Real scenario: list of dicts (configs). Shallow copy means modifying one config modifies all. Pandas .copy() is deep by default — but df2 = df is NOT a copy.
Answer: def f(x, lst=[]): — the default list is created ONCE at function definition and shared across all calls. So f(1); f(2) gives [1, 2] not [2]. Fix: use lst=None then if lst is None: lst = []. This is the #1 Python gotcha in interviews.
Answer: Generators yield values one at a time using yield, consuming O(1) memory. A list of 1B items = ~8GB. A generator = ~100 bytes. Critical for: reading large files, streaming data, batch training. yield from delegates to sub-generators. Generator expressions: (x for x in data).
Answer: Python resolves names in order: Local → Enclosing → Global → Built-in. This is why list = [1,2] breaks list(). Use nonlocal for enclosing scope, global for module scope.
Answer: (1) pd.read_csv(chunksize=50000), (2) usecols=['needed'], (3) dtype={'col': 'int32'}, (4) Dask for lazy Pandas, (5) DuckDB for SQL on CSV with zero overhead, (6) Polars for fast out-of-core processing.
Answer: Dict: O(1) via hash tables (open addressing). List: O(n) linear scan. Dict hashes the key to compute slot index, handles collisions via probing. Sets use the same mechanism. x in my_set is O(1) but x in my_list is O(n).
Answer: Two mechanisms: (1) Reference counting — freed when count hits 0. (2) Cyclic GC — detects reference cycles (A→B→A). Runs on 3 generations. Long-lived objects collected less often. gc.collect() forces collection — useful after deleting large ML models.
__slots__ and when to use it?Answer: By default, Python objects store attributes in a __dict__ (a dict per instance). __slots__ replaces this with a fixed-size array. Saves ~40% memory per instance. Use when creating millions of small objects (data points, nodes). Trade-off: can't add attributes dynamically.
| Feature | Python List | NumPy ndarray |
|---|---|---|
| Storage | Array of pointers to objects scattered in memory | Contiguous block of raw typed data |
| Type | Each element can be different type | Homogeneous — all elements same dtype |
| Operations | Python loop (bytecode interpretation) | Compiled C/Fortran loops |
| Memory | ~28 bytes per int + pointer overhead | 8 bytes per int64 (no overhead) |
| SIMD | Not possible | Uses CPU vector instructions (SSE/AVX) |
| Storage | Array of pointers to objects | Contiguous block of raw typed data |
| Type | Each element can differ | Homogeneous — all same dtype |
| Operations | Python loop (bytecode) | Compiled C/Fortran loops |
| Memory | ~28 bytes per int + pointer | 8 bytes per int64 (no overhead) |
| SIMD | Not possible | Uses CPU vector instructions |
arr[0,0], arr[0,1], arr[0,2], arr[1,0]...arr[0,0], arr[1,0], arr[2,0], arr[0,1]...Every ndarray has a strides tuple — bytes to jump in each dimension. For a (3,4) float64 array: strides = (32, 8) means jump 32 bytes for next row, 8 bytes for next column. Slicing creates views (no copy) by adjusting strides. arr[::2] doubles the row stride.
Every ndarray has a strides tuple — bytes to jump in each dimension. For a (3,4) float64 array: strides = (32, 8). Slicing creates views (no copy) by adjusting strides. arr[::2] doubles the row stride.
Ufuncs are vectorized functions that operate element-wise. They support: .reduce() (fold along axis), .accumulate() (running total), .outer() (outer product), .at() (unbuffered in-place). Example: np.add.reduce(arr) = arr.sum() but works with custom ufuncs too.
| dtype | Bytes | Range | When to Use |
|---|---|---|---|
| float32 | 4 | ±3.4e38 | Deep learning (GPU prefers this), 50% less memory |
| float64 | 8 | ±1.8e308 | Default. Scientific computing, high-precision stats |
| int32 | 4 | ±2.1 billion | Indices, counts, most integer data |
| bool | 1 | True/False | Masks for filtering |
| category (Pandas) | Varies | Finite set | Repeated strings → 90% memory savings |
| dtype | Bytes | When to Use | |
| float32 | 4 | Deep learning (GPU prefers this), 50% less memory | |
| float64 | 8 | Default. Scientific computing, high-precision stats | |
| int32 | 4 | Indices, counts, most integer data | |
| float16 | 2 | Mixed-precision training, inference | |
| bool | 1 | Masks for filtering |
np.einsum can express any tensor operation in one call: matrix multiply, trace, transpose, batch ops. Often faster than chaining NumPy functions because it avoids intermediate arrays.
np.einsum can express any tensor operation: matrix multiply, trace, transpose, batch ops. Often faster than chaining NumPy functions because it avoids intermediate arrays.
np.linalg.inv(X.T @ X) @ X.T @ y → Normal equation (linear regression)X.T @ X → Gram matrix (basis of linear regression)U, S, Vt = np.linalg.svd(X) → PCA, dimensionality reductioneigenvals, eigenvecs = np.linalg.eigh(cov) → Covariance eigenvectorsnp.linalg.norm(X, axis=1) → L2 norms for distance computationnp.linalg.eigh(cov) → Covariance eigenvectorsnp.linalg.norm(X, axis=1) → L2 norms for distancenp.linalg.lstsq(X, y) → Stable linear regression (preferred over inv)np.random.default_rng(42) is the modern way (NumPy 1.17+). Uses PCG64 algorithm — better statistical properties, thread-safe. Old np.random.seed(42) is global state, not thread-safe. Always use default_rng() in new code.
Answer: Three reasons: (1) Contiguous memory — CPU cache-friendly, no pointer chasing. (2) Compiled C loops — operations run in compiled C, not interpreted Python. (3) SIMD instructions — modern CPUs process 4-8 floats simultaneously (AVX). Together: 50-100x speedup.
-Answer: Views share data (slicing creates views). Copies duplicate data. arr[::2] is a view — modifying it modifies the original. arr[[0,2,4]] (fancy indexing) is a copy. Views are fast and memory-efficient. Use np.shares_memory(a, b) to check. Always .copy() when you need independent data.
Answer: Compare shapes right-to-left. Dimensions must be equal or one must be 1. Example: (3,1) + (1,4) → (3,4). Each (3,1) row is "stretched" to match 4 columns. No memory is actually copied — NumPy adjusts strides internally. Gotcha: (3,) + (3,4) fails — need to reshape to (3,1) first.
Answer: axis=0 = operate down rows (column-wise). axis=1 = across columns (row-wise). Think: axis=0 collapses rows, axis=1 collapses columns. For (100,5) array: mean(axis=0) → shape (5,) — one mean per feature. mean(axis=1) → shape (100,) — one mean per sample.
Answer: (1) Center data: X_c = X - X.mean(axis=0), (2) Covariance: cov = X_c.T @ X_c / (n-1), (3) Eigendecomposition: vals, vecs = np.linalg.eigh(cov), (4) Sort by eigenvalue descending, (5) Project: X_pca = X_c @ vecs[:, -k:]. Alternatively use SVD directly: U, S, Vt = np.linalg.svd(X_c).
Answer: np.dot: flattens for 1D, matrix multiply for 2D, but confusing for higher dims. @ (matmul): clean matrix multiply, broadcasts over batch dims. einsum: most flexible — express any contraction. Use @ for readability, einsum for complex ops. Avoid np.dot for 3D+ arrays.
Answer: np.isnan(arr) detects NaNs. np.nanmean(arr), np.nanstd(arr) — nan-safe aggregations. Replace: arr[np.isnan(arr)] = 0. Gotcha: np.nan == np.nan is False! NaN poisons comparisons. This is IEEE 754 standard.
Answer: Structured arrays have named fields with mixed dtypes: np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]). Use when: (1) You need NumPy speed without Pandas overhead, (2) Interfacing with binary file formats (HDF5, FITS), (3) Processing millions of records where Pandas is too slow.
Answer: C-order stores rows contiguously; Fortran stores columns. Iterating along the last axis of C-order arrays is fastest because adjacent elements are in adjacent memory (cache-friendly). For column-heavy operations, Fortran order can be faster. NumPy defaults to C-order. np.asfortranarray() converts.
Answer: Three options in order of speed: (1) np.vectorize(func) — convenience wrapper, NOT actually vectorized (still Python loops), (2) Rewrite using broadcasting + boolean masks, (3) Use @numba.jit(nopython=True) for true compiled speed. Always prefer option 2 when possible.
Answer: np.random.seed(42): global state, not thread-safe. RandomState(42): isolated state, legacy. default_rng(42): modern (NumPy 1.17+), uses PCG64, thread-safe, better statistical properties. Always use default_rng() in new code.
Answer: Use the expansion: ||a-b||² = ||a||² + ||b||² - 2a·b. Code: dists = np.sum(X**2, axis=1)[:,None] + np.sum(X**2, axis=1)[None,:] - 2 * X @ X.T. This avoids the O(n²×d) explicit loop and leverages BLAS matrix multiply. scipy.spatial.distance.cdist wraps this.
Answer: (1) Contiguous memory — cache-friendly. (2) Compiled C loops. (3) SIMD instructions — 4-8 floats simultaneously. Together: 50-100x speedup.
Answer: Views share data (slicing creates views). Copies duplicate. arr[::2] = view, arr[[0,2,4]] (fancy indexing) = copy. Check with np.shares_memory(a, b).
Answer: Compare shapes right-to-left. Dims must be equal or one must be 1. (3,1) + (1,4) → (3,4). No memory copied — strides adjusted internally. Gotcha: (3,) + (3,4) fails — reshape to (3,1) first.
Answer: axis=0 = operate down rows (collapses rows). axis=1 = across columns (collapses columns). For (100,5): mean(axis=0) → (5,) per feature. mean(axis=1) → (100,) per sample.
Answer: Center: X_c = X - X.mean(0). Covariance: cov = X_c.T @ X_c / (n-1). Eigendecompose: vals, vecs = np.linalg.eigh(cov). Project: X_pca = X_c @ vecs[:,-k:]. Or use SVD directly.
Answer: np.dot: confusing for 3D+. @: clean matrix multiply, broadcasts. einsum: most flexible. Use @ for readability, einsum for complex ops.
Answer: np.isnan(arr) detects. np.nanmean(arr) — nan-safe aggregation. Gotcha: np.nan == np.nan is False! IEEE 754 standard.
Answer: C-order stores rows contiguously. Iterating along last axis is fastest (cache-friendly). For column-heavy ops, Fortran can be faster. NumPy defaults to C. Convert with np.asfortranarray().
| Feature | Series | DataFrame |
|---|---|---|
| Dimensions | 1D labeled array | 2D labeled table |
| Analogy | A column in a spreadsheet | The entire spreadsheet |
| Index | Single index | Row index + column index |
| Creation | pd.Series([1,2,3]) | pd.DataFrame({'a': [1,2]}) |
df.loc[0:5] includes row 5. df.iloc[0:5] excludes row 5. This trips up everyone.
- df.loc[0:5] includes row 5. df.iloc[0:5] excludes row 5.When you chain indexing (df[df.x > 0]['y'] = 5), Pandas may create a temporary copy. Your assignment modifies the copy, not the original. Fix: Always use .loc: df.loc[df.x > 0, 'y'] = 5. In Pandas 2.0+, Copy-on-Write mode eliminates this issue entirely.
Chained indexing (df[df.x > 0]['y'] = 5) may create a temporary copy. Fix: df.loc[df.x > 0, 'y'] = 5. In Pandas 2.0+, Copy-on-Write mode eliminates this entirely.
The most powerful Pandas operation. (1) Split into groups, (2) Apply function to each, (3) Combine results. GroupBy is lazy — no computation until aggregation. Key methods: agg() (reduce), transform() (broadcast), filter() (keep/drop groups), apply() (flexible).
GroupBy is the most powerful Pandas operation. It follows three steps: (1) Split data into groups, (2) Apply a function to each group independently, (3) Combine results. The key insight: GroupBy is lazy — no computation happens until you call an aggregation.
+| Feature | Before (1.x) | After (2.0+) |
|---|---|---|
| Backend | NumPy only | Apache Arrow backend option |
| Copy semantics | Confusing | Copy-on-Write (explicit) |
| String dtype | object | string[pyarrow] (faster) |
| Nullable types | NaN for everything | pd.NA (proper null) |
| Index dtypes | int64 default | Matches data dtype |
| Feature | Pandas | Polars |
|---|---|---|
| Speed | 1x | 5-50x faster (Rust) |
| Memory | Higher | Lower (Arrow-native) |
| Parallelism | Single-threaded | Multi-threaded by default |
| API | Eager | Lazy + Eager |
| Ecosystem | Massive | Growing |
| When to use | EDA, legacy projects | Large data, production pipelines |
Fluent API style chains multiple operations. More readable, no intermediate variables, and enables .pipe() for custom functions. Use .assign() instead of df['col'] = ... for chainability.
Fluent API style. More readable, no intermediate variables. Use .assign() instead of df['col'] = .... Use .pipe() for custom functions. Use .query() for readable filtering.
| Strategy | Savings | When to Use |
|---|---|---|
| Category dtype | 90%+ | Columns with few unique strings (gender, country) |
| Downcast numerics | 50-75% | int64 to int32/int16 when range allows |
| Sparse arrays | 80%+ | Columns that are mostly zeros/NaN |
| Read in chunks | N/A | Files larger than RAM |
| Category dtype | 90%+ | Columns with few unique strings |
| Downcast numerics | 50-75% | int64 → int32/int16 |
| Sparse arrays | 80%+ | Columns mostly zeros/NaN |
| PyArrow backend | 30-50% | String-heavy DataFrames |
.rolling(N) — fixed-size sliding window. .expanding() — cumulative from start. .ewm(span=N) — exponentially weighted. All support .mean(), .std(), .apply(func). Critical for time series feature engineering: lag features, moving averages, volatility.
Answer: Chained indexing (df[mask]['col'] = val) may modify a copy, not the original. Fix: use df.loc[mask, 'col'] = val. In Pandas 2.0+, enable Copy-on-Write: pd.options.mode.copy_on_write = True. This makes all indexing return views until modification, then copies automatically.
Answer: 5 strategies: (1) pd.read_csv(chunksize=50000) — process in batches, (2) usecols=['needed_cols'] — load only what you need, (3) dtype={'col': 'int32'} — use smaller types, (4) Dask — lazy Pandas-like API, (5) DuckDB — SQL on CSV files with zero memory overhead. Polars is also excellent for out-of-core processing.
Answer: merge(): SQL-style joins on columns (most flexible). join(): joins on index (convenience wrapper). concat(): stack DataFrames along axis (union/append). Use merge for column-based joins, concat for stacking rows/columns. join is just merge with index.
Answer: map(): Series only, element-wise. apply(): works on rows/columns of DataFrame or elements of Series. applymap(): element-wise on entire DataFrame (renamed to map() in Pandas 2.1). Performance tip: all three are slow — prefer vectorized operations whenever possible.
Answer: agg() reduces — returns one value per group (changes shape). transform() broadcasts — returns same shape as input. Example: df.groupby('dept')['salary'].transform('mean') fills every row with its department's average salary, while .agg('mean') returns one row per department.
Answer: Hierarchical indexing — multiple levels of row/column labels. Use for: pivot table results, panel data (entity + time), groupby with multiple keys. Access with .xs() or tuple slicing: df.loc[('A', 2023)]. Convert back with .reset_index().
Answer: Strategy depends on context: (1) dropna(thresh=N) — keep rows with at least N non-null values, (2) fillna(method='ffill') — forward fill for time series, (3) fillna(df.median()) — impute with median for ML, (4) interpolate(method='time') — time-weighted interpolation. Always check df.isna().sum() first.
Answer: Stores repeated strings as integer codes + lookup table. Use when a column has few unique values relative to total rows (e.g., 50 countries in 1M rows). Benefits: 90%+ memory savings, faster groupby. Gotcha: operations that create new values (like string concatenation) convert back to object dtype.
-Answer: Pandas: best ecosystem, most tutorials, sufficient for <1GB. Polars: 10-100x faster, lazy evaluation, multi-threaded, no GIL issues — use for 1-100GB. DuckDB: SQL interface, out-of-core, great for analytical queries — use when SQL is more natural or data exceeds RAM.
-Answer: df['lag_1'] = df['value'].shift(1) for lag features. df['rolling_mean_7'] = df['value'].rolling(7).mean() for rolling stats. df['ewm_mean'] = df['value'].ewm(span=7).mean() for exponential weighted. Always sort by time first, use groupby().shift() for multi-entity data to avoid data leakage.
Answer: Chained indexing may modify a copy. Fix: df.loc[mask, 'col'] = val. Pandas 2.0+ Copy-on-Write: pd.options.mode.copy_on_write = True.
Answer: merge(): SQL joins on columns. join(): joins on index. concat(): stack along axis. Use merge for column joins, concat for stacking.
Answer: map(): Series element-wise. apply(): rows/columns. transform(): same shape output. All are slow — prefer vectorized operations.
Answer: agg() reduces — one value per group. transform() broadcasts — same shape as input. Use transform for "fill with group mean" patterns.
Answer: Hierarchical indexing — multiple levels. Use for pivot tables, panel data (entity + time). Access with .xs() or tuple: df.loc[('A', 2023)]. Convert back with .reset_index().
Answer: Pandas: mature ecosystem, EDA, small-medium data. Polars: 5-50x faster (Rust), multi-threaded, lazy evaluation, better for large data and production pipelines. Polars for new projects with big data.
Answer: (1) dropna(thresh=N), (2) fillna(method='ffill') for time series, (3) fillna(df.median()) for ML, (4) interpolate(method='time'). Always check df.isna().sum() first.
| Question | Chart Type | Library |
|---|---|---|
| Distribution of one variable? | Histogram, KDE, Box plot | Seaborn |
| Relationship between two variables? | Scatter, Hexbin, Regression | Seaborn/Plotly |
| Comparison across categories? | Bar, Grouped bar, Violin | Seaborn |
| Trend over time? | Line chart, Area chart | Plotly/Matplotlib |
| Distribution? | Histogram, KDE, Box, Violin | Seaborn |
| Relationship? | Scatter, Hexbin, Regression | Seaborn/Plotly |
| Comparison? | Bar, Grouped bar, Violin | Seaborn |
| Trend over time? | Line, Area chart | Plotly/Matplotlib |
| Correlation matrix? | Heatmap | Seaborn |
| Part of whole? | Pie, Treemap, Sunburst | Plotly |
| Geographic data? | Choropleth, Scatter mapbox | Plotly/Folium |
| Geographic? | Choropleth, Mapbox | Plotly/Folium |
| High-dimensional? | Parallel coords, UMAP | Plotly/UMAP |
Three layers: Backend (rendering engine), Artist (everything drawn), Scripting (pyplot). The Figure contains Axes (subplots). Each Axes has Axis objects. Always prefer the object-oriented API (fig, ax = plt.subplots()) over pyplot for production code.
Three layers: Backend (rendering), Artist (everything drawn), Scripting (pyplot). Figure contains Axes (subplots). Each Axes has Axis objects. Always prefer OO API (fig, ax = plt.subplots()) over pyplot for production.
rcParams: Control global defaults. Set plt.rcParams['font.size'] = 14 once. Create a style file for consistency across all project figures. Use plt.style.use('seaborn-v0_8-whitegrid') for clean defaults.
Built on Matplotlib with statistical intelligence. Three API levels: Figure-level (relplot, catplot, displot — create their own figure), Axes-level (scatterplot, boxplot — plot on existing axes), Objects API (new in 0.12, more composable).
+Three API levels: Figure-level (relplot, catplot, displot — own figure), Axes-level (scatterplot, boxplot — on existing axes), Objects API (0.12+, composable). Seaborn auto-computes statistics (regression lines, confidence intervals, density estimates).
-JavaScript-powered charts with hover, zoom, selection. plotly.express for quick plots, plotly.graph_objects for full control. Integrates with Dash for production dashboards. Supports 3D plots, maps, and animations.
JavaScript-powered charts with hover, zoom, selection. plotly.express for quick plots, plotly.graph_objects for full control. Integrates with Dash for production dashboards. Supports 3D, maps, and animations. Export to HTML for sharing.
Answer: Matplotlib: full control, publication figures. Seaborn: statistical EDA, beautiful defaults. Plotly: interactive dashboards, stakeholders. Rule: Seaborn for EDA, Matplotlib for papers, Plotly for stakeholders.
Answer: (1) PCA/t-SNE/UMAP to 2D, (2) Pair plots, (3) Parallel coordinates, (4) Correlation heatmap, (5) SHAP summary plots.
Answer: (1) alpha transparency, (2) hexbin, (3) 2D KDE, (4) random sampling, (5) Datashader for millions of points.
Answer: Clear title stating conclusion, minimal chart junk, annotate key points, consistent color, one insight per chart. Tell a story — what action should they take?
Answer: Figure = entire canvas. Axes = single plot area. fig, axes = plt.subplots(2,2) = 4 plots. Always use OO API: ax.plot() not plt.plot().
Answer: Colorblind-safe palettes (viridis), don't rely on color alone, add shapes/patterns, sufficient contrast, alt text, large fonts (12pt+).
Answer: Matplotlib: full control, publication figures, custom layouts. Seaborn: statistical plots, quick EDA, beautiful defaults. Plotly: interactive dashboards, web apps, 3D/maps. Rule of thumb: Seaborn for EDA, Matplotlib for papers, Plotly for stakeholders.
-Answer: (1) PCA/t-SNE/UMAP to 2D then scatter plot, (2) Pair plots for feature pairs, (3) Parallel coordinates, (4) Heatmap of correlation matrix, (5) SHAP summary plots for feature importance. For 100+ features, start with correlation heatmap to identify groups.
-Answer: (1) Reduce alpha: alpha=0.1, (2) Hexbin plots: plt.hexbin(), (3) 2D KDE: sns.kdeplot(), (4) Random sampling for display, (5) Datashader for millions of points. The key is encoding density visually.
Answer: (1) Clear title stating the conclusion, not the method, (2) Minimal chart junk — remove gridlines, borders, legends when obvious, (3) Annotate key data points directly, (4) Use color consistently and meaningfully, (5) Tell a story — what action should they take? Keep it to one insight per chart.
-Answer: Figure is the entire window/canvas. Axes is a single plot area within the figure. fig, axes = plt.subplots(2,2) creates 4 plots. Always use the OO API for production — ax.plot() not plt.plot(). This gives you explicit control over which subplot you're modifying.
functools.wraps to preserve function metadata (name, docstring, signature).Answer: (1) Use colorblind-safe palettes (viridis, cividis), (2) Don't rely on color alone — add shapes/patterns, (3) Sufficient contrast ratios, (4) Alt text for web charts, (5) Large enough font sizes (12pt minimum). Test with colorblindness simulators.
-Managing resources reliably. with blocks guarantee cleanup even on errors. Two approaches: (1) Class-based (__enter__/__exit__), (2) @contextlib.contextmanager with yield. Use for: file handles, DB connections, GPU locks, temporary settings.
| Feature | namedtuple | dataclass | Pydantic |
|---|---|---|---|
| Mutable | ✗ | ✓ (default) | ✓ (v2) |
| Validation | ✗ | ✗ (manual) | ✓ (automatic) |
| Default values | Limited | ✓ | ✓ |
| Inheritance | ✗ | ✓ | ✓ |
| JSON serialization | Manual | Manual | Built-in |
| Performance | Fastest | Fast | Slower (validation) |
| Use case | Immutable records | Data containers | API models, configs |
functools.wraps to preserve metadata (name, docstring), handle both positional and keyword arguments, and support decorators with parameters (factories).
- | Hint | Meaning | Example |
|---|---|---|
int, str, float | Basic types | def f(x: int) -> str: |
list[int] | List of ints (3.9+) | scores: list[int] = [] |
dict[str, Any] | Dict with str keys | config: dict[str, Any] |
Optional[int] | int or None | x: int | None (3.10+) |
Union[int, str] | int or str | id: int | str |
Callable[[int], str] | Function signature | Callbacks, decorators |
TypeVar('T') | Generic type | Generic containers |
Managing resources (files, locks, DB connections) reliably. with blocks guarantee cleanup even on errors. Implementation options: (1) Class-based with __enter__ and __exit__, (2) Function-based with @contextlib.contextmanager and yield.
Async is for I/O-bound tasks (API calls, DB queries, file reads). NOT for CPU-bound work (use multiprocessing). The event loop manages coroutines cooperatively. asyncio.gather() runs multiple coroutines concurrently. aiohttp for async HTTP, asyncpg for async PostgreSQL.
yield, using constant memory O(1) regardless of dataset size. Ideal for processing huge datasets or infinite streams.
- A descriptor is any object implementing __get__, __set__, or __delete__. @property is a descriptor. They control attribute access at the class level. Used in Django ORM fields, SQLAlchemy columns, and dataclass fields.
| Concept | Data Science Use Case |
|---|---|
| Inheritance | BaseModel → LinearModel → LogisticRegression |
| Abstract Base Classes | Defining mandatory methods like fit()/predict() |
| Properties | Validating input parameters (e.g., learning rate > 0) |
| Dunder Methods | __call__ for making models callable, __getitem__ for datasets |
Classes are objects too. Metaclasses define how classes behave. type is the default metaclass. Use for: auto-registering subclasses (model registry), enforcing interface standards, singleton pattern. Most developers should use class decorators instead — metaclasses are a last resort.
Classes are objects too! Classes define how instances behave; Metaclasses define how classes behave. Useful for registry patterns (auto-registering models) or enforcement of interface standards across a codebase. type is the default metaclass.
By default, instances store attributes in __dict__. __slots__ replaces with a fixed tuple. Saves ~40% memory per instance. Use when creating millions of objects. Trade-off: can't add dynamic attributes. Especially useful for data-heavy classes.
Answer: C3 Linearization algorithm for multiple inheritance. Access via ClassName.mro(). Ensures bases searched after subclasses, preserving definition order.
Answer: namedtuple: immutable, fastest. dataclass: mutable, flexible, no validation. Pydantic: auto-validation, JSON serialization, API models. Choose based on whether you need validation.
Answer: async: I/O-bound, many connections (1000s of API calls). threading: I/O-bound, simpler code. multiprocessing: CPU-bound (bypasses GIL). NumPy already releases GIL internally.
Answer: It's a descriptor — implements __get__, __set__, __delete__. When you access obj.x, Python's attribute lookup finds the descriptor on the class and calls __get__.
Answer: Three nested functions: (1) Factory takes params, returns decorator. (2) Decorator takes function, returns wrapper. (3) Wrapper executes logic. Use @wraps(func) always.
Answer: Replaces __dict__ with fixed-size array. Saves ~40% memory per instance. Can't add dynamic attributes. Use for millions of small objects.
Answer: A function that captures variables from enclosing scope. The captured variables survive after the enclosing function returns. Use case: factory functions, decorators, callbacks. Example: make_multiplier(3) returns a function that multiplies by 3.
__str__ and __repr__?
- Answer: __str__ is for end-users (informal, readable). __repr__ is for developers (detailed, unambiguous, "eval-able"). For data science, always implement __repr__ for models to show hyperparameters when printed.
Answer: C3 Linearization algorithm. It determines the search order for methods in multiple inheritance. Access it via ClassName.mro(). Python ensures that bases are searched after their subclasses and the order of bases in the class definition is preserved.
Answer: Several ways: (1) Overriding __new__, (2) Using a Metaclass (cleanest), (3) Module-level variables (simplest). Example with Metaclass: class Singleton(type): ... then class Database(metaclass=Singleton): ....
@timer(unit='ms')?
- Answer: This is a decorator factory. You need three levels of functions: (1) Factory takes parameters and returns a decorator, (2) Decorator takes the function and returns a wrapper, (3) Wrapper takes args/kwargs and executes the logic.
-*args and **kwargs and when to use them?
- Answer: *args collects positional arguments into a tuple. **kwargs collects keyword arguments into a dictionary. Crucial for wrapping functions, implementing decorators, or creating flexible API interfaces like Scikit-learn's __init__(**params).
fit(X, y). Transformers have transform(X). Predictors have predict(X). This consistency allows seamless swapping and composition via Pipelines.is and ==.
- Answer: == checks for equality (values are the same). is checks for identity (objects occupy the same memory address). Use is for Singletons like None or bool. Example: a = [1]; b = [1]; a == b is True, a is b is False.
fit(X, y), Transformers have transform(X), and Predictors have predict(X). This design allows for seamless swapping of models and preprocessing steps.
- Real data has mixed types. ColumnTransformer applies different transformations to different column sets: StandardScaler for numerics, OneHotEncoder for categoricals, TfidfVectorizer for text. All in one pipeline.
-A Pipeline bundles preprocessing and modeling into a single object. Crucial Benefit: It ensures that transformers are fit only on the training fold during cross-validation, preventing information from the validation set (like mean/std) from "leaking" into training. Always use pipelines in production.
Inherit from BaseEstimator + TransformerMixin. Implement fit(X, y) and transform(X). TransformerMixin gives you fit_transform() for free. Use check_is_fitted(self) to validate state.
Most real-world data is a mix of types. ColumnTransformer allows you to apply different preprocessing pipelines to different columns (e.g., OneHotEncode categories, Scale numerics) and then concatenate them for the model.
| Strategy | When to Use | Gotcha |
|---|---|---|
| KFold | General purpose | Doesn't preserve class ratios |
| StratifiedKFold | Classification (imbalanced) | Preserves class distribution |
| TimeSeriesSplit | Time-ordered data | Train always before test |
| GroupKFold | Grouped data (patients) | Same group never in train+test |
| LeaveOneOut | Very small datasets | N fits — very slow |
| RepeatedStratifiedKFold | Robust estimation | Multiple random splits |
| Metric | Use Case | Scikit-learn Name |
|---|---|---|
| F1-Score | Imbalanced classification (Precision-Recall balance) | f1_score |
| ROC-AUC | Probability ranking / classifier quality | roc_auc_score |
| MSE / MAE | Regression error magnitude | mean_squared_error |
| R2 Score | Variance explained by model | r2_score |
| Log Loss | Probabilistic predictions confidence | log_loss |
| Method | Pros | Cons |
| GridSearchCV | Exhaustive, simple | Exponential with params |
| RandomizedSearchCV | Faster, continuous distributions | May miss optimal |
| Optuna/BayesianOpt | Smart search, early stopping | More setup, dependency |
| Halving*SearchCV | Successive halving, fast | Newer, less documented |
(1) K-Fold: standard, (2) Stratified K-Fold: for imbalanced data, (3) TimeSeriesSplit: for temporal data (preventing looking into the future), (4) GroupKFold: to ensure samples from the same group aren't split across train/test.
-PolynomialFeatures, FunctionTransformer, SplineTransformer, KBinsDiscretizer. Chain with Pipeline for clean, leak-free preprocessing. Use make_column_selector to auto-select column types.
Train/Val/Test split → Cross-validate multiple models → Select best → Tune hyperparameters → Final evaluation on test set. Never tune on test data. Use cross_val_score for quick comparison, cross_validate for detailed metrics.
Answer: Info from test set influencing training. Common cause: fitting scaler on full data before split. Fix: put all preprocessing inside a Pipeline which ensures fit only on train folds during cross-validation.
Answer: Pipeline: sequential steps (A→B→C). ColumnTransformer: parallel branches (different processing for different column types). Typically ColumnTransformer inside Pipeline.
Answer: KFold: general. StratifiedKFold: imbalanced classes. TimeSeriesSplit: temporal. GroupKFold: grouped data (same patient never in both).
Answer: Grid: exhaustive but exponential. Random: better for many params, samples continuous distributions. Bayesian (Optuna): learns from previous trials, most efficient for expensive models.
Answer: Inherit BaseEstimator + TransformerMixin. Implement fit(X, y) (learn params, return self) and transform(X) (apply). TransformerMixin gives fit_transform() free.
Answer: fit(): learn parameters from data. transform(): apply learned params to transform data. predict(): generate predictions. fit() is always on train, transform/predict on train+test.
fit_transform on train but only transform on test?
- Answer: To prevent Data Leakage. Mean/variance for scaling must be learned ONLY from training data. Applying fit to test data uses future information about the test distribution, leading to overly optimistic results.
predict_proba instead of predict?
- Answer: When you need the uncertainty of the model or need to adjust the decision threshold. For cost-sensitive problems (e.g., fraud), you might flag anything with >10% probability, rather than the default 50%.
-| Concept | What It Is | Key Point |
|---|---|---|
| Tensor | N-dimensional array | Like NumPy ndarray but GPU-capable |
| requires_grad | Track operations for autograd | Only enable for learnable parameters |
| device | CPU or CUDA | .to('cuda') moves to GPU |
| .detach() | Stop gradient tracking | Use for inference/metrics |
| .item() | Extract scalar value | Use for logging loss values |
Answer: Underfitting (High Bias) happens when the model is too simple (e.g., linear on non-linear data). Overfitting (High Variance) happens when the model is too complex and captures noise. Regularization (Alpha/C parameters) is used to find the "sweet spot".
+requires_grad=True, PyTorch records every operation in a directed acyclic graph (DAG). Each tensor stores its grad_fn — the function that created it. .backward() traverses this graph in reverse, computing gradients via the chain rule. The graph is destroyed after backward() (unless retain_graph=True).Gradient accumulation: By default, .backward() accumulates gradients. You MUST call optimizer.zero_grad() before each backward pass. This is intentional — allows gradient accumulation for larger effective batch sizes.
Answer: (1) class_weight='balanced' inside estimators, (2) Stratified cross-validation, (3) Focus on Precision-Recall curves/AUC instead of Accuracy, (4) Resampling (using imblearn library which is Sklearn-compatible).
Every model inherits nn.Module. Define layers in __init__, computation in forward(). model.parameters() returns all learnable weights. model.train() and model.eval() toggle BatchNorm/Dropout behavior. model.state_dict() saves/loads weights.
Answer: L1 adds absolute value penalty; it results in sparse models (coefficents become exactly zero), effectively performing feature selection. L2 adds squared penalty; it shrinks coefficients towards zero but rarely to zero, good for handling multicollinearity.
-Every PyTorch training follows: (1) Forward pass, (2) Compute loss, (3) optimizer.zero_grad(), (4) loss.backward(), (5) optimizer.step(). No magic — you write it explicitly. This gives full control over learning rate scheduling, gradient clipping, mixed precision, etc.
Autograd tracks every operation on tensors with requires_grad=True and automatically computes gradients using the chain rule during .backward().
- Dataset: override __len__ and __getitem__. DataLoader: wraps Dataset with batching, shuffling, multi-worker loading. Use num_workers > 0 for parallel data loading. pin_memory=True speeds up CPU→GPU transfer.
Tensors are multi-dimensional arrays (like NumPy) but with two superpowers: (1) GPU Acceleration (move to 'cuda' or 'mps'), (2) Automatic Differentiation. Bridging to NumPy is zero-copy for CPU tensors.
+Use torch.cuda.amp for automatic mixed precision. Forward pass in float16 (2x faster on modern GPUs), gradients in float32 (numerical stability). GradScaler prevents underflow. Up to 2-3x speedup with minimal accuracy loss.
Every model in PyTorch inherits from nn.Module. You define parameters/layers in __init__ and the forward pass logic in forward(). This design promotes recursive composition — models can contain other modules.
Load pretrained model → Freeze base layers → Replace final layer → Fine-tune. model.requires_grad_(False) freezes all. Then unfreeze last N layers. Use smaller learning rate for pretrained layers.
| Component | Responsibility |
|---|---|
| Dataset | Defines HOW to load a single sample (__getitem__) and total count (__len__) |
| DataLoader | Handles batching, shuffling, multi-process loading, and memory pinning |
| Transforms | On-the-fly augmentation (cropping, flipping, normalizing) |
Register hooks on modules: register_forward_hook, register_backward_hook. View intermediate activations, gradient magnitudes, feature maps. Essential for debugging vanishing/exploding gradients.
Standard pattern: (1) Zero gradients, (2) Forward pass, (3) Compute Loss, (4) Backward pass (backprop), (5) Optimizer step. Don't forget model.train() and model.eval() to toggle dropout and batch norm behavior.
DistributedDataParallel is the standard for multi-GPU training. Each GPU runs a copy of the model, gradients are averaged across GPUs (all-reduce). Near-linear scaling. Use torchrun to launch.
Answer: PyTorch records operations in a DAG when requires_grad=True. .backward() traverses the graph in reverse, computing gradients via chain rule. Graph is destroyed after backward (dynamic graph).
Answer: PyTorch accumulates gradients by default. Without zeroing, gradients from previous batch add to current. This is intentional — allows gradient accumulation for larger effective batches.
Answer: train(): BatchNorm uses batch stats, Dropout is active. eval(): BatchNorm uses running stats, Dropout disabled. Always switch before training/inference.
Answer: .detach(): creates a tensor that shares data but doesn't track gradients (single tensor). torch.no_grad(): context manager disabling gradient computation for all operations inside (saves memory during inference).
Answer: (1) Register backward hooks to monitor gradient magnitudes. (2) Use torch.nn.utils.clip_grad_norm_. (3) Gradient histograms in TensorBoard. (4) Check if BatchNorm/LayerNorm is applied. (5) Try skip connections (ResNet idea).
Answer: Rule of thumb: num_workers = 4 * num_gpus. Too many = CPU overhead, too few = GPU starved. Use pin_memory=True for faster CPU→GPU transfer. Profile to find sweet spot.
optimizer.zero_grad() necessary?
- Answer: By default, PyTorch accumulates gradients on every .backward() call. This is useful for RNNs or training with effectively larger batch sizes than memory allows. If you don't zero them out, gradients from previous batches will influence the current update, leading to incorrect training.
model.train() and model.eval()?
- Answer: They set the mode for specific layers. .train() enables Dropout and Batch Normalization (calculates stats for current batch). .eval() disables dropout and uses running averages for Batch Norm. Forgetting .eval() during testing will lead to inconsistent/bad predictions.
@tf.function compiles to static graph for production speed. Keras is the official high-level API. TF handles the full ML lifecycle: training → saving → serving → monitoring.torch.no_grad().
- Answer: It's a context manager that disables gradient calculation. Use it during inference or validation to save memory and compute resources. It prevents the creation of the computational graph for those operations.
-| API | Use Case | Flexibility |
|---|---|---|
| Sequential | Simple stack of layers | Low (linear only) |
| Functional | Multi-input/output, branching | Medium |
| Subclassing | Custom forward logic | High (most flexible) |
Answer: PyTorch (Dynamic graph) is more Pythonic, easier to debug with standard tools, and highly favored in research. TensorFlow (Static graph/Keras) historically had better deployment tools (TFLite, TFServing) and massive industry scale, though the gap has significantly narrowed with PyTorch 2.0 and TorchServe.
-Build efficient input pipelines: tf.data.Dataset chains transformations lazily. Key methods: .map(), .batch(), .shuffle(), .prefetch(tf.data.AUTOTUNE). Prefetching overlaps data loading with model execution. Supports TFRecord files for large datasets.
Answer: Same as NumPy. If dimensions don't match, PyTorch automatically expands the smaller tensor (by repeating values) to match the larger one, provided they are compatible (trailing dimensions match or are 1). This happens without actual memory copying.
-| Callback | Purpose |
|---|---|
| ModelCheckpoint | Save best model (monitor val_loss) |
| EarlyStopping | Stop when metric plateaus |
| ReduceLROnPlateau | Reduce LR when stuck |
| TensorBoard | Visualize training metrics |
| CSVLogger | Log metrics to CSV |
| LambdaCallback | Custom logic per epoch |
tf.keras supports three ways to build models: (1) Sequential (simple stacks), (2) Functional (DAGs, multi-input/output), (3) Subclassing (full control).
- For full control: tf.GradientTape() records operations, then tape.gradient(loss, model.trainable_variables) computes gradients. Same pattern as PyTorch's manual loop. Use for: GANs, reinforcement learning, custom loss functions.
Loading data is often the bottleneck. tf.data.Dataset enables "ETL" pipelines: Extract (from disk/cloud), Transform (shuffle, batch, repeat), Load (map to GPU). Concepts like prefetch and interleave ensure the GPU is never waiting for the CPU.
model.save('path') exports as SavedModel format — includes architecture, weights, and computation graph. Ready for TF Serving, TF Lite (mobile), TF.js (browser). Universal deployment format.
TensorFlow can convert Python code into a Static Computational Graph using @tf.function. This enables significant optimizations like constant folding and makes models exportable to environments without Python (C++, Java, JS).
Decorating with @tf.function traces Python code into a TF graph. Benefits: optimized execution, XLA compilation, deployment. Gotchas: Python side effects only run during tracing, use tf.print() instead of print().
| Component | Visualized metric | |
|---|---|---|
| Scalars | Loss/Accuracy curves in real-time | |
| Histograms | Weights/Gradients distribution (checking for vanishing/exploding) | |
| Graphs | The internal model architecture | |
| Projector | High-dimensional embeddings (t-SNE/PCA) | |
| Aspect | TensorFlow | PyTorch |
| Deployment | TF Serving, TFLite, TF.js | TorchServe, ONNX |
| Research | Less common now | Dominant in papers |
| Production | Mature ecosystem | Catching up fast |
| Mobile | TFLite (mature) | PyTorch Mobile |
| Debugging | Harder (graph mode) | Easier (eager by default) |
TensorFlow Extended (TFX) is for end-to-end ML. Key components: TF Serving (for APIs), TF Lite (for mobile/edge), TFJS (for web browsers). TF Serving supports model versioning and A/B testing out of the box.
-tf.function and AutoGraph?
- Answer: tf.function is a decorator that converts a regular Python function into a TensorFlow static graph. AutoGraph is the internal tool that translates Python control flow (if, while) into TF graph ops. This allows for compiler-level optimizations and easy deployment without a Python environment.
tf.data.AUTOTUNE?
- Answer: It allows TensorFlow to dynamically adjust the level of parallelism and buffer sizes based on your CPU/disk hardware. It ensures that data preprocessing (CPU) is always one step ahead of model training (GPU), preventing hardware starvation.
-Answer: Sequential: purely linear stacks. Functional: most common for production, supports non-linear topology (shared layers, multiple inputs/outputs). Subclassing: full control over the forward pass, best for complex research/custom logic. Functional is generally preferred for its balance of power and debugging ease.
-Answer: (1) EarlyStopping callback, (2) Dropout layers, (3) L1/L2 kernels regularizers, (4) Data augmentation (via tf.image or keras.layers), (5) Learning rate schedules via callbacks.ReduceLROnPlateau.
Answer: The language-neutral, hermetic serialization format for TF models. It includes the model architecture, weights, and the computational graph (signatures). It is the standard format for TF Serving and TFLite conversion.
-Answer: Sequential: linear stack. Functional: multi-input/output, shared layers. Subclassing: full Python control, custom forward. Use Functional for most real projects.
Answer: Compiles Python function into a TF graph. Faster execution, enables XLA optimization, required for SavedModel export. Gotcha: Python code only runs during tracing — side effects behave differently.
Answer: Chains transformations lazily. .prefetch(AUTOTUNE) overlaps data loading with GPU computation. .cache() stores in memory after first epoch. .interleave() reads multiple files concurrently.
Answer: Usually val_loss. Set patience=5-10 (epochs without improvement). restore_best_weights=True reverts to best epoch. Combine with ReduceLROnPlateau for better convergence.
Answer: When Keras .fit() is too restrictive: GANs (two optimizers), RL (custom gradients), multi-loss weighting, gradient penalty, research experiments needing full control.
Answer: TF: production deployment (TF Serving, TFLite), mobile apps, TPU training. PyTorch: research, prototyping, Hugging Face ecosystem. Both are converging in features.
async/await for handling concurrent requests without blocking, uses type hints for automatic validation, and generates interactive OpenAPI (Swagger) documentation. It is the gold standard for serving ML models today.
- In production, you cannot trust input data. Pydantic enforces strict type checking and validation at runtime. If a JSON request arrives with a string instead of a float for a model feature, Pydantic catches it immediately and returns a clear error before the model even sees it.
+| Feature | Purpose | Example |
|---|---|---|
| fixtures | Reusable test setup | @pytest.fixture for test data |
| parametrize | Run same test with many inputs | @pytest.mark.parametrize |
| conftest.py | Shared fixtures across tests | DB connections, mock data |
| monkeypatch | Override functions/env vars | Mock API calls |
| tmp_path | Temporary directory | Test file I/O without cleanup |
| markers | Tag tests (slow, gpu, integration) | pytest -m "not slow" |
print() in production. Use logging module: configurable levels (DEBUG/INFO/WARNING/ERROR), output to files, structured format, no performance cost when disabled.
+ | Level | When to Use |
|---|---|
| DEBUG | Detailed diagnostic (tensor shapes, intermediate values) |
| INFO | Normal events (training started, epoch complete) |
| WARNING | Something unexpected but handled (missing feature, fallback) |
| ERROR | Something failed (model load error, API failure) |
| CRITICAL | System-level failure (out of memory, GPU crash) |
Modern async web framework. Auto-generates OpenAPI docs. Type-validated requests via Pydantic. Use for: model inference APIs, data pipelines, webhook handlers. Deploy with Uvicorn + Docker. Add health checks and input validation.
+ +Containerize your entire environment: Python version, CUDA drivers, dependencies. Multi-stage builds: builder stage (install deps) → runtime stage (slim image). Use NVIDIA Container Toolkit for GPU access. Pin all dependency versions.
+ +| Stage | Responsibility | Tools |
|---|---|---|
| Initialization | Loading model weights into memory (once) | FastAPI Lifespan |
| Inference | Preprocessing input and getting prediction | NumPy/Pydantic |
| Post-processing | Formatting prediction for the client | JSON/Protobuf |
| Observability | Logging latency, inputs, and drift | Prometheus/ELK |
| Tool | Best For | Key Feature |
| Hydra | ML experiments | YAML configs, CLI overrides, multi-run |
| Pydantic Settings | App config | Env var loading, validation |
| python-dotenv | Simple projects | .env file loading |
| dynaconf | Multi-environment | dev/staging/prod configs |
Conda vs Pip: Pip is standard for Python; Conda is better for C-extensions/CUDA. Docker: Containerizing the environment ensures it "works on my machine" translates to "works in the cloud". Use lightweight base images (python:3.10-slim) to minimize security risks and build times.
+Automate: linting (ruff/flake8), type checking (mypy), testing (pytest), building (Docker), deploying. Use GitHub Actions or GitLab CI. Add model validation gate: compare new model metrics against baseline before deployment.
-(1) Unit tests: for preprocessing logic, (2) Integration tests: for the API endpoints, (3) Model Quality tests: ensuring the model meets a minimum accuracy threshold on a benchmark dataset before deployment.
-| Tool | Purpose |
|---|---|
| ruff | Fast linter + formatter (replaces black, isort, flake8) |
| mypy | Static type checking |
| pre-commit | Git hooks for auto-formatting |
| pytest-cov | Test coverage measurement |
Answer: (1) Native async support (handles concurrent requests better), (2) Automatically generates Swagger UI for testing, (3) Pydantic integration for data validation, (4) Significantly higher throughput (close to Go/Node.js levels), (5) Built-in support for WebSockets and background tasks.
Answer: (1) URL versioning (/v1/predict), (2) Model registry (MLflow/SageMaker) with aliases like production or staging, (3) Blue-green deployment — route traffic to the new version only after validation, (4) Embed the model version in the API response metadata for debugging.
Answer: It occurs when multiple libraries require conflicting versions of the same dependency. Solved by: (1) Using virtual environments (venv/conda), (2) pinning exact versions in requirements.txt or poetry.lock, (3) Docker to isolate the entire OS environment.
Answer: (1) PII Masking: remove names/emails/IDs before logging, (2) Hash sensitive fields if they are needed for troubleshooting, (3) Separate logging of model metadata from raw data, (4) Use specialized monitoring tools like Arize or Whylogs for drift detection without full data capture.
-Answer: Beyond standard code tests, ML CI/CD (MLOps) includes Data Validation (is the incoming data schema correct?), Model Validation (is accuracy >= 90%?), and automated deployment to staging for human-in-the-loop review.
-Answer: (1) Unit tests: data transformations, feature engineering functions. (2) Integration tests: full pipeline end-to-end. (3) Model tests: output shape, range, determinism with seeds. (4) Data tests: schema validation, distribution checks. Use pytest fixtures for reusable test data.
Answer: Logging: configurable levels, file output, structured format, zero cost when disabled, thread-safe. Print: none of these. Production code must use logging for observability and debugging.
Answer: FastAPI/Flask for REST API. Docker for containerization. Load model at startup (not per request). Add health checks, input validation, error handling, logging, metrics. Use async for high throughput. Consider model registries (MLflow) for versioning.
Answer: Project metadata, dependencies, build system, tool configs (pytest, mypy, ruff). Replaced setup.py/setup.cfg. Pin dependency versions for reproducibility. Use [project.optional-dependencies] for dev/test extras.
Answer: Hydra: YAML configs with CLI overrides, multi-run sweeps. Store configs in version control. Never hardcode hyperparameters. Use config groups for model/data/training combos.
Answer: Automate: lint → type-check → test → build → deploy. Add model validation gate: new model must beat baseline on test metrics. Use GitHub Actions. Include data validation (Great Expectations) in pipeline.
Never optimize without measuring. (1) cProfile: for function-level timing, (2) line_profiler: for line-by-line analysis in "hot" functions, (3) memory_profiler: to detect memory leaks and peak usage, (4) Py-Spy: a sampling profiler for zero-instrumentation production profiling.
- -Numba translates a subset of Python and NumPy code into fast machine code using LLVM. By simply adding @njit, you can achieve C/Fortran-like speeds for math-heavy loops that cannot be vectorized with pure NumPy.
| Model | Best for... | Mechanism | |
|---|---|---|---|
| Threading | I/O-bound (APIs, DBs) | Concurrent but not parallel (GIL) | |
| Multiprocessing | CPU-bound (Training, Math) | True parallelism (separate OS processes) | |
| asyncio | High-concurrency I/O | Single-threaded cooperative multitasking | |
| Tool | Type | When to Use | Overhead |
| cProfile | Function-level | Find slow functions | ~2x slowdown |
| line_profiler | Line-by-line | Find slow lines in a function | Higher |
| Py-Spy | Sampling profiler | Production profiling | Near zero |
| tracemalloc | Memory allocation | Find memory leaks | Low |
| memory_profiler | Line-by-line memory | Find memory-heavy lines | High |
| scalene | CPU + Memory + GPU | Comprehensive profiling | Low |
Single Instruction, Multiple Data (SIMD) allows a CPU to perform the same operation on multiple data points in one clock cycle. Modern NumPy leverages AVX-512 and MKL/OpenBLAS to ensure your a + b is as fast as the hardware allows.
Cython is a superset of Python that compiles to C. It allows you to call C functions directly and use static typing. Use it for complex algorithms that require low-level memory control (e.g., custom tree models or graph algorithms).
-GIL prevents true multi-threading for CPU-bound Python code. But: NumPy, Pandas, and scikit-learn release the GIL during C operations. Solutions for parallelism:
+| Tool | Best For | How |
|---|---|---|
| threading | I/O-bound (API calls, disk) | GIL released during I/O waits |
| multiprocessing | CPU-bound Python | Separate processes, separate GIL |
| concurrent.futures | Simple parallel patterns | ThreadPool/ProcessPool executors |
| asyncio | Many I/O operations | Event loop, cooperative multitasking |
| joblib | sklearn parallel | n_jobs parameter |
@numba.jit(nopython=True) compiles Python functions to machine code. Supports NumPy arrays and most math operations. 10-100x speedup for loops that can't be vectorized. @numba.vectorize creates custom ufuncs. @numba.cuda.jit runs on GPU.
Compiles Python to C extension modules. Add type declarations for massive speedups. Best for: tight loops, calling C libraries, CPython extensions. More setup than Numba but more control.
-def heavy_image_prep(file_path): - # Complex transform logic here - return processed_img +Pandas-like API for datasets larger than memory. Key abstractions: dask.dataframe (parallel Pandas), dask.array (parallel NumPy), dask.delayed (custom parallelism). Uses a task scheduler to execute lazily. Scales from laptop to cluster.
General-purpose distributed framework. Ray Tune for hyperparameter tuning, Ray Serve for model serving, Ray Data for data processing. Easier than Dask for ML-specific workloads. Used by OpenAI, Uber, Ant Group.
-3.12: Faster interpreter (5-15% overall), better error messages, per-interpreter GIL (experimental). 3.13: Free-threaded CPython (no-GIL mode experimental), JIT compiler (experimental). The future of Python performance is exciting.
+ `, + code: ` +Answer: It simplifies implementation by making the memory management (reference counting) thread-safe without needing granular locks. It also makes single-threaded code faster and C-extension integration easier. Removing it is difficult because it effectively requires a rewrite of the interpreter (see: "no-gil" Python 3.13 proposal).
-Answer: (1) Vectorize with NumPy (broadcast), (2) If logic is too complex for NumPy, use Numba JIT, (3) Use Cython if you need C-level types, (4) Use multiprocessing if the iterations are independent and CPU-bound.
Answer: cProfile is a deterministic profiler; it hooks into every function call. While very accurate, it adds significant overhead (sometimes 2x slowdown). For production systems, "Sampling Profilers" (like Py-Spy) are better as they only inspect the stack every few milliseconds, adding negligible overhead.
-Answer: For I/O-bound tasks (Network/Disk). Threading has much lower overhead (shared memory) compared to Multiprocessing (separate memory spaces, requires serialization/pickling of data between processes). For downloading 1000 images, threads are superior.
-Answer: CPUs are fastest when accessing contiguous memory (Spatial Locality). NumPy's C-contiguous arrays ensure that when one value is loaded into the CPU cache, the next values are also loaded, minimizing "Cache Misses" compared to Python lists of scattered objects.
-Answer: Simplifies reference counting (thread-safe without granular locks). Makes single-threaded code faster. Makes C extension integration easier. Python 3.13 has experimental free-threaded mode (no-GIL).
Answer: (1) Vectorize with NumPy (broadcast). (2) If too complex, use Numba JIT. (3) Cython for C-level types. (4) multiprocessing if iterations are independent.
Answer: Threading: I/O-bound (shared memory, low overhead). Multiprocessing: CPU-bound (separate memory, bypasses GIL). For downloading 1000 images → threads. For computing 1000 matrix operations → processes.
Answer: JIT compiler that translates Python/NumPy to machine code using LLVM. @jit(nopython=True) for 10-100x speedup. Works best with: NumPy arrays, math operations, loops. Doesn't support: Pandas, string manipulation, most Python objects.
Answer: cProfile: function-level (find slow functions). line_profiler: line-by-line. Py-Spy: sampling (production-safe). tracemalloc: memory. scalene: CPU+memory+GPU all-in-one. Always profile before optimizing.
Answer: Dask: familiar Pandas/NumPy API, Python-native, scales well. Ray: ML-focused (tune, serve), lower-level control. Spark: JVM-based, best for very large (TB+) data, enterprise. For Python ML: Dask or Ray. For big data ETL: Spark.