diff --git "a/Python/app.js" "b/Python/app.js" new file mode 100644--- /dev/null +++ "b/Python/app.js" @@ -0,0 +1,1876 @@ +const modules = [ + { + id: "python-fundamentals", + title: "Python Fundamentals for DS", + icon: "๐", + category: "Foundations", + description: "Data structures, comprehensions, file I/O, virtual environments" + }, + { + id: "numpy", + title: "NumPy & Scientific Computing", + icon: "๐ข", + category: "Scientific", + description: "ndarrays, broadcasting, vectorization, linear algebra" + }, + { + id: "pandas", + title: "Pandas & Data Manipulation", + icon: "๐ผ", + category: "Data Wrangling", + description: "DataFrames, groupby, pivot, time series, merging" + }, + { + id: "visualization", + title: "Data Visualization", + icon: "๐", + category: "Visualization", + description: "Matplotlib, Seaborn, Plotly โ from basics to publication-ready" + }, + { + id: "advanced-python", + title: "Advanced Python", + icon: "๐ฏ", + category: "Advanced", + description: "OOP, decorators, async, multiprocessing, type hints" + }, + { + id: "sklearn", + title: "Python for ML (Scikit-learn)", + icon: "๐ค", + category: "Machine Learning", + description: "Pipelines, transformers, cross-validation, hyperparameter tuning" + }, + { + id: "pytorch", + title: "Deep Learning with PyTorch", + icon: "๐ฅ", + category: "Deep Learning", + description: "Tensors, autograd, nn.Module, training loops, transfer learning" + }, + { + id: "tensorflow", + title: "TensorFlow & Keras", + icon: "๐ง ", + category: "Deep Learning", + description: "Sequential/Functional API, callbacks, TensorBoard, deployment" + }, + { + id: "production", + title: "Production Python", + icon: "๐ฆ", + category: "Engineering", + description: "Testing, packaging, logging, FastAPI for model serving" + }, + { + id: "optimization", + title: "Performance & Optimization", + icon: "โก", + category: "Optimization", + description: "Profiling, Numba, Cython, memory optimization, Dask" + } +]; + +const MODULE_CONTENT = { + "python-fundamentals": { + concepts: ` +
| Type | Mutable | Ordered | Hashable | Use Case |
|---|---|---|---|---|
| list | โ | โ | โ | Sequential data, time series, feature lists |
| tuple | โ | โ | โ | Fixed records, dict keys, DataFrame rows |
| dict | โ | โ (3.7+) | โ | Lookup tables, JSON, config, caches |
| set | โ | โ | โ | Unique values, membership testing O(1) |
| frozenset | โ | โ | โ | Immutable set, usable as dict keys |
| deque | โ | โ | โ | O(1) append/pop both ends, sliding windows |
a = [1, 2, 3], the list lives on the heap; a is a name that points to it. This is why b = a makes both point to the same list โ no copy is made.
+ Reference Counting: Python uses reference counting + cyclic garbage collector. Each object tracks how many names point to it. When count hits 0, memory is freed immediately. This is why del doesn't always free memory โ it just decrements the reference count.
Integer Interning: Python caches integers from -5 to 256 and short strings. So a = 100; b = 100; a is b is True, but a = 1000; b = 1000; a is b may be False. Never use is for value comparison โ always use ==.
| Class | Purpose | Why It Matters in DS |
|---|---|---|
| defaultdict | Dict with default factory | Group data without KeyError: defaultdict(list) |
| Counter | Count hashable objects | Label distribution: Counter(y_train) |
| namedtuple | Lightweight immutable class | Return multiple values with names, not indices |
| OrderedDict | Dict remembering insertion order | Legacy (dicts are ordered in 3.7+), but useful for move_to_end() |
| deque | Double-ended queue | Sliding window computations, BFS algorithms |
| ChainMap | Stack multiple dicts | Layer config: defaults โ env โ CLI overrides |
itertools functions return iterators, not lists. They consume O(1) memory regardless of input size. This matters when processing millions of records.
+ | Function | What It Does | DS Use Case |
|---|---|---|
chain() | Concatenate iterables | Merge multiple data files lazily |
islice() | Slice any iterator | Take first N records from generator |
groupby() | Group consecutive elements | Process sorted log entries by date |
product() | Cartesian product | Generate hyperparameter grid |
combinations() | All r-length combos | Feature interaction pairs |
starmap() | map() with unpacked args | Apply function to paired data |
Stop using os.path.join(). Use pathlib.Path โ it's object-oriented, cross-platform, and reads like English:
Path('data') / 'train' / 'images' โ builds paths with / operatorpath.glob('*.csv') โ find all CSV filespath.stem, path.suffix, path.parent โ parse without regexpath.read_text() / path.write_text() โ no need for open()except: catches SystemExit and KeyboardInterrupt. Always catch specific exceptions. In DS pipelines, catch ValueError (bad data), FileNotFoundError (missing files), KeyError (missing columns).
+ LBYL vs EAFP: Python prefers "Easier to Ask Forgiveness than Permission" (EAFP). Use try/except instead of checking conditions first. It's faster when exceptions are rare (which they usually are).
| Tool | Best For | Create | Key Feature |
|---|---|---|---|
| venv | Simple projects | python -m venv env | Built-in, lightweight |
| conda | DS/ML (C dependencies) | conda create -n myenv python=3.11 | Handles non-Python deps (CUDA, MKL) |
| poetry | Modern packaging | poetry init | Lock files, deterministic builds |
| uv | Speed (Rust-based) | uv venv | 10-100x faster than pip |
Answer: Lists are mutable, tuples immutable. But the deeper answer: tuples are hashable (can be dict keys), use less memory (no over-allocation), and signal intent ("this shouldn't change"). Use tuples for (lat, lon) pairs, function return values, dict keys for caching. Use lists for feature collections that grow.
+Answer: The GIL prevents true multi-threading for CPU-bound tasks. But here's what most people miss: NumPy, Pandas, and scikit-learn release the GIL during C-level computations. So vectorized operations ARE parallel at the C level. For pure Python CPU work, use multiprocessing. For I/O (API calls, file reads), threading works fine because the GIL is released during I/O waits.
is and ==. Why does this matter?
+ Answer: == checks value equality (__eq__). is checks identity (same memory address). Python interns small integers (-5 to 256) and some strings, so 300 is 300 may be False. Always use == for values. Only use is for None checks: if x is None.
Answer: 5 strategies, from simplest to most powerful: (1) pd.read_csv(chunksize=50000) โ process in batches, (2) usecols=['needed_cols'] โ load only what you need, (3) dtype={'col': 'int32'} โ use smaller types, (4) Dask โ lazy Pandas-like API, (5) DuckDB โ SQL on CSV files with zero memory overhead.
Answer: Dict: O(1) average via hash tables (Python's dict uses open addressing). List: O(n) linear scan. Internally, dict hashes the key to compute a slot index, then handles collisions via probing. Sets use the same mechanism. This is why x in my_set is fast but x in my_list is slow.
Answer: copy.copy() copies outer container but shares inner objects. copy.deepcopy() recursively copies everything. Real scenario: You have a list of dicts (config per experiment). Shallow copy means modifying one experiment's config changes all of them. Deep copy gives independent configs. Pandas .copy() is deep by default โ but df2 = df is NOT a copy at all.
Answer: defaultdict(factory) auto-creates default values for missing keys. Use defaultdict(list) to group items without if key not in dict checks. Use defaultdict(int) to count. It's cleaner and ~20% faster than dict.setdefault() for grouping operations in data processing.
Answer: Generators yield values one at a time using yield, consuming O(1) memory regardless of data size. A list of 1 billion items = ~8GB RAM. A generator of 1 billion items = ~100 bytes. Critical for: reading large files, streaming data, batch training. yield from delegates to sub-generators.
Answer: list(dict.fromkeys(my_list)) โ uses dict's insertion-order guarantee (3.7+), runs in O(n). Old approach: seen = set(); [x for x in lst if not (x in seen or seen.add(x))]. For DataFrames: df.drop_duplicates(subset=['key_col']).
Answer: Two mechanisms: (1) Reference counting โ each object has a count; freed when count hits 0. Immediate cleanup. (2) Cyclic garbage collector โ detects reference cycles (A โ B โ A) that refcount can't handle. Runs periodically on generations (gen0, gen1, gen2). You can force it with gc.collect() โ useful after deleting large ML models.
__str__ and __repr__?
+ Answer: __str__ is for end users (readable), __repr__ is for developers (unambiguous, ideally eval-able). If only one is defined, implement __repr__ โ Python falls back to it for str() too. In ML: __repr__ should show model params: LinearRegression(lr=0.01, reg=l2).
*args and **kwargs help in ML code?
+ Answer: They enable flexible function signatures. *args: variable positional args (multiple datasets). **kwargs: variable keyword args (hyperparameters). Essential for: wrapper functions, decorators, scikit-learn's set_params(**params), and model.fit(X, y, **fit_params).
Answer: f-strings (3.6+) are fastest, most readable formatting. They support expressions: f"{accuracy:.2%}" โ "95.23%", f"{x=}" (3.8+) โ "x=42" for debugging. .format() is slower and more verbose. % formatting is legacy C-style. Always use f-strings in modern Python.
Answer: Python resolves names in order: Local โ Enclosing function โ Global โ Built-in. This is why you can accidentally shadow built-ins: list = [1,2] breaks list(). Use nonlocal to modify enclosing scope, global for module scope (but avoid globals in production code).
append() and extend()?
+ Answer: append(x) adds x as a single element. extend(iterable) unpacks and adds each element. [1,2].append([3,4]) โ [1,2,[3,4]]. [1,2].extend([3,4]) โ [1,2,3,4]. Use extend() when merging feature lists; append() when adding one item to results.
| Feature | Python List | NumPy ndarray |
|---|---|---|
| Storage | Array of pointers to objects scattered in memory | Contiguous block of raw typed data |
| Type | Each element can be different type | Homogeneous โ all elements same dtype |
| Operations | Python loop (bytecode interpretation) | Compiled C/Fortran loops |
| Memory | ~28 bytes per int + pointer overhead | 8 bytes per int64 (no overhead) |
| SIMD | Not possible | Uses CPU vector instructions (SSE/AVX) |
arr[0,0], arr[0,1], arr[0,2], arr[1,0]...arr[0,0], arr[1,0], arr[2,0], arr[0,1]...Every ndarray has a strides tuple โ bytes to jump in each dimension. For a (3,4) float64 array: strides = (32, 8) means jump 32 bytes for next row, 8 bytes for next column. Slicing creates views (no copy) by adjusting strides. arr[::2] doubles the row stride.
| dtype | Bytes | Range | When to Use |
|---|---|---|---|
| float32 | 4 | ยฑ3.4e38 | Deep learning (GPU prefers this), 50% less memory |
| float64 | 8 | ยฑ1.8e308 | Default. Scientific computing, high-precision stats |
| int32 | 4 | ยฑ2.1 billion | Indices, counts, most integer data |
| bool | 1 | True/False | Masks for filtering |
| category (Pandas) | Varies | Finite set | Repeated strings โ 90% memory savings |
np.einsum can express any tensor operation in one call: matrix multiply, trace, transpose, batch ops. Often faster than chaining NumPy functions because it avoids intermediate arrays.
np.linalg.inv(X.T @ X) @ X.T @ y โ Normal equation (linear regression)U, S, Vt = np.linalg.svd(X) โ PCA, dimensionality reductioneigenvals, eigenvecs = np.linalg.eigh(cov) โ Covariance eigenvectorsnp.linalg.norm(X, axis=1) โ L2 norms for distance computationAnswer: Three reasons: (1) Contiguous memory โ CPU cache-friendly, no pointer chasing. (2) Compiled C loops โ operations run in compiled C, not interpreted Python. (3) SIMD instructions โ modern CPUs process 4-8 floats simultaneously (AVX). Together: 50-100x speedup.
+Answer: Views share data (slicing creates views). Copies duplicate data. arr[::2] is a view โ modifying it modifies the original. arr[[0,2,4]] (fancy indexing) is a copy. Views are fast and memory-efficient. Use np.shares_memory(a, b) to check. Always .copy() when you need independent data.
Answer: Compare shapes right-to-left. Dimensions must be equal or one must be 1. Example: (3,1) + (1,4) โ (3,4). Each (3,1) row is "stretched" to match 4 columns. No memory is actually copied โ NumPy adjusts strides internally. Gotcha: (3,) + (3,4) fails โ need to reshape to (3,1) first.
Answer: axis=0 = operate down rows (column-wise). axis=1 = across columns (row-wise). Think: axis=0 collapses rows, axis=1 collapses columns. For (100,5) array: mean(axis=0) โ shape (5,) โ one mean per feature. mean(axis=1) โ shape (100,) โ one mean per sample.
Answer: (1) Center data: X_c = X - X.mean(axis=0), (2) Covariance: cov = X_c.T @ X_c / (n-1), (3) Eigendecomposition: vals, vecs = np.linalg.eigh(cov), (4) Sort by eigenvalue descending, (5) Project: X_pca = X_c @ vecs[:, -k:]. Alternatively use SVD directly: U, S, Vt = np.linalg.svd(X_c).
Answer: np.dot: flattens for 1D, matrix multiply for 2D, but confusing for higher dims. @ (matmul): clean matrix multiply, broadcasts over batch dims. einsum: most flexible โ express any contraction. Use @ for readability, einsum for complex ops. Avoid np.dot for 3D+ arrays.
Answer: np.isnan(arr) detects NaNs. np.nanmean(arr), np.nanstd(arr) โ nan-safe aggregations. Replace: arr[np.isnan(arr)] = 0. Gotcha: np.nan == np.nan is False! NaN poisons comparisons. This is IEEE 754 standard.
Answer: Structured arrays have named fields with mixed dtypes: np.dtype([('name', 'U10'), ('age', 'i4'), ('score', 'f8')]). Use when: (1) You need NumPy speed without Pandas overhead, (2) Interfacing with binary file formats (HDF5, FITS), (3) Processing millions of records where Pandas is too slow.
Answer: C-order stores rows contiguously; Fortran stores columns. Iterating along the last axis of C-order arrays is fastest because adjacent elements are in adjacent memory (cache-friendly). For column-heavy operations, Fortran order can be faster. NumPy defaults to C-order. np.asfortranarray() converts.
Answer: Three options in order of speed: (1) np.vectorize(func) โ convenience wrapper, NOT actually vectorized (still Python loops), (2) Rewrite using broadcasting + boolean masks, (3) Use @numba.jit(nopython=True) for true compiled speed. Always prefer option 2 when possible.
Answer: np.random.seed(42): global state, not thread-safe. RandomState(42): isolated state, legacy. default_rng(42): modern (NumPy 1.17+), uses PCG64, thread-safe, better statistical properties. Always use default_rng() in new code.
Answer: Use the expansion: ||a-b||ยฒ = ||a||ยฒ + ||b||ยฒ - 2aยทb. Code: dists = np.sum(X**2, axis=1)[:,None] + np.sum(X**2, axis=1)[None,:] - 2 * X @ X.T. This avoids the O(nยฒรd) explicit loop and leverages BLAS matrix multiply. scipy.spatial.distance.cdist wraps this.
| Feature | Series | DataFrame |
|---|---|---|
| Dimensions | 1D labeled array | 2D labeled table |
| Analogy | A column in a spreadsheet | The entire spreadsheet |
| Index | Single index | Row index + column index |
| Creation | pd.Series([1,2,3]) | pd.DataFrame({'a': [1,2]}) |
df.loc[0:5] includes row 5. df.iloc[0:5] excludes row 5. This trips up everyone.
+ When you chain indexing (df[df.x > 0]['y'] = 5), Pandas may create a temporary copy. Your assignment modifies the copy, not the original. Fix: Always use .loc: df.loc[df.x > 0, 'y'] = 5. In Pandas 2.0+, Copy-on-Write mode eliminates this issue entirely.
GroupBy is the most powerful Pandas operation. It follows three steps: (1) Split data into groups, (2) Apply a function to each group independently, (3) Combine results. The key insight: GroupBy is lazy โ no computation happens until you call an aggregation.
+ +Fluent API style chains multiple operations. More readable, no intermediate variables, and enables .pipe() for custom functions. Use .assign() instead of df['col'] = ... for chainability.
| Strategy | Savings | When to Use |
|---|---|---|
| Category dtype | 90%+ | Columns with few unique strings (gender, country) |
| Downcast numerics | 50-75% | int64 to int32/int16 when range allows |
| Sparse arrays | 80%+ | Columns that are mostly zeros/NaN |
| Read in chunks | N/A | Files larger than RAM |
Answer: Chained indexing (df[mask]['col'] = val) may modify a copy, not the original. Fix: use df.loc[mask, 'col'] = val. In Pandas 2.0+, enable Copy-on-Write: pd.options.mode.copy_on_write = True. This makes all indexing return views until modification, then copies automatically.
Answer: 5 strategies: (1) pd.read_csv(chunksize=50000) โ process in batches, (2) usecols=['needed_cols'] โ load only what you need, (3) dtype={'col': 'int32'} โ use smaller types, (4) Dask โ lazy Pandas-like API, (5) DuckDB โ SQL on CSV files with zero memory overhead. Polars is also excellent for out-of-core processing.
Answer: merge(): SQL-style joins on columns (most flexible). join(): joins on index (convenience wrapper). concat(): stack DataFrames along axis (union/append). Use merge for column-based joins, concat for stacking rows/columns. join is just merge with index.
Answer: map(): Series only, element-wise. apply(): works on rows/columns of DataFrame or elements of Series. applymap(): element-wise on entire DataFrame (renamed to map() in Pandas 2.1). Performance tip: all three are slow โ prefer vectorized operations whenever possible.
Answer: agg() reduces โ returns one value per group (changes shape). transform() broadcasts โ returns same shape as input. Example: df.groupby('dept')['salary'].transform('mean') fills every row with its department's average salary, while .agg('mean') returns one row per department.
Answer: Hierarchical indexing โ multiple levels of row/column labels. Use for: pivot table results, panel data (entity + time), groupby with multiple keys. Access with .xs() or tuple slicing: df.loc[('A', 2023)]. Convert back with .reset_index().
Answer: Strategy depends on context: (1) dropna(thresh=N) โ keep rows with at least N non-null values, (2) fillna(method='ffill') โ forward fill for time series, (3) fillna(df.median()) โ impute with median for ML, (4) interpolate(method='time') โ time-weighted interpolation. Always check df.isna().sum() first.
Answer: Stores repeated strings as integer codes + lookup table. Use when a column has few unique values relative to total rows (e.g., 50 countries in 1M rows). Benefits: 90%+ memory savings, faster groupby. Gotcha: operations that create new values (like string concatenation) convert back to object dtype.
+Answer: Pandas: best ecosystem, most tutorials, sufficient for <1GB. Polars: 10-100x faster, lazy evaluation, multi-threaded, no GIL issues โ use for 1-100GB. DuckDB: SQL interface, out-of-core, great for analytical queries โ use when SQL is more natural or data exceeds RAM.
+Answer: df['lag_1'] = df['value'].shift(1) for lag features. df['rolling_mean_7'] = df['value'].rolling(7).mean() for rolling stats. df['ewm_mean'] = df['value'].ewm(span=7).mean() for exponential weighted. Always sort by time first, use groupby().shift() for multi-entity data to avoid data leakage.
| Question | Chart Type | Library |
|---|---|---|
| Distribution of one variable? | Histogram, KDE, Box plot | Seaborn |
| Relationship between two variables? | Scatter, Hexbin, Regression | Seaborn/Plotly |
| Comparison across categories? | Bar, Grouped bar, Violin | Seaborn |
| Trend over time? | Line chart, Area chart | Plotly/Matplotlib |
| Correlation matrix? | Heatmap | Seaborn |
| Part of whole? | Pie, Treemap, Sunburst | Plotly |
| Geographic data? | Choropleth, Scatter mapbox | Plotly/Folium |
Three layers: Backend (rendering engine), Artist (everything drawn), Scripting (pyplot). The Figure contains Axes (subplots). Each Axes has Axis objects. Always prefer the object-oriented API (fig, ax = plt.subplots()) over pyplot for production code.
Built on Matplotlib with statistical intelligence. Three API levels: Figure-level (relplot, catplot, displot โ create their own figure), Axes-level (scatterplot, boxplot โ plot on existing axes), Objects API (new in 0.12, more composable).
+ +JavaScript-powered charts with hover, zoom, selection. plotly.express for quick plots, plotly.graph_objects for full control. Integrates with Dash for production dashboards. Supports 3D plots, maps, and animations.
Answer: Matplotlib: full control, publication figures, custom layouts. Seaborn: statistical plots, quick EDA, beautiful defaults. Plotly: interactive dashboards, web apps, 3D/maps. Rule of thumb: Seaborn for EDA, Matplotlib for papers, Plotly for stakeholders.
+Answer: (1) PCA/t-SNE/UMAP to 2D then scatter plot, (2) Pair plots for feature pairs, (3) Parallel coordinates, (4) Heatmap of correlation matrix, (5) SHAP summary plots for feature importance. For 100+ features, start with correlation heatmap to identify groups.
+Answer: (1) Reduce alpha: alpha=0.1, (2) Hexbin plots: plt.hexbin(), (3) 2D KDE: sns.kdeplot(), (4) Random sampling for display, (5) Datashader for millions of points. The key is encoding density visually.
Answer: (1) Clear title stating the conclusion, not the method, (2) Minimal chart junk โ remove gridlines, borders, legends when obvious, (3) Annotate key data points directly, (4) Use color consistently and meaningfully, (5) Tell a story โ what action should they take? Keep it to one insight per chart.
+Answer: Figure is the entire window/canvas. Axes is a single plot area within the figure. fig, axes = plt.subplots(2,2) creates 4 plots. Always use the OO API for production โ ax.plot() not plt.plot(). This gives you explicit control over which subplot you're modifying.
Answer: (1) Use colorblind-safe palettes (viridis, cividis), (2) Don't rely on color alone โ add shapes/patterns, (3) Sufficient contrast ratios, (4) Alt text for web charts, (5) Large enough font sizes (12pt minimum). Test with colorblindness simulators.
+functools.wraps to preserve metadata (name, docstring), handle both positional and keyword arguments, and support decorators with parameters (factories).
+ Managing resources (files, locks, DB connections) reliably. with blocks guarantee cleanup even on errors. Implementation options: (1) Class-based with __enter__ and __exit__, (2) Function-based with @contextlib.contextmanager and yield.
yield, using constant memory O(1) regardless of dataset size. Ideal for processing huge datasets or infinite streams.
+ | Concept | Data Science Use Case |
|---|---|
| Inheritance | BaseModel โ LinearModel โ LogisticRegression |
| Abstract Base Classes | Defining mandatory methods like fit()/predict() |
| Properties | Validating input parameters (e.g., learning rate > 0) |
| Dunder Methods | __call__ for making models callable, __getitem__ for datasets |
Classes are objects too! Classes define how instances behave; Metaclasses define how classes behave. Useful for registry patterns (auto-registering models) or enforcement of interface standards across a codebase. type is the default metaclass.
__str__ and __repr__?
+ Answer: __str__ is for end-users (informal, readable). __repr__ is for developers (detailed, unambiguous, "eval-able"). For data science, always implement __repr__ for models to show hyperparameters when printed.
Answer: C3 Linearization algorithm. It determines the search order for methods in multiple inheritance. Access it via ClassName.mro(). Python ensures that bases are searched after their subclasses and the order of bases in the class definition is preserved.
Answer: Several ways: (1) Overriding __new__, (2) Using a Metaclass (cleanest), (3) Module-level variables (simplest). Example with Metaclass: class Singleton(type): ... then class Database(metaclass=Singleton): ....
@timer(unit='ms')?
+ Answer: This is a decorator factory. You need three levels of functions: (1) Factory takes parameters and returns a decorator, (2) Decorator takes the function and returns a wrapper, (3) Wrapper takes args/kwargs and executes the logic.
+*args and **kwargs and when to use them?
+ Answer: *args collects positional arguments into a tuple. **kwargs collects keyword arguments into a dictionary. Crucial for wrapping functions, implementing decorators, or creating flexible API interfaces like Scikit-learn's __init__(**params).
is and ==.
+ Answer: == checks for equality (values are the same). is checks for identity (objects occupy the same memory address). Use is for Singletons like None or bool. Example: a = [1]; b = [1]; a == b is True, a is b is False.
fit(X, y), Transformers have transform(X), and Predictors have predict(X). This design allows for seamless swapping of models and preprocessing steps.
+ A Pipeline bundles preprocessing and modeling into a single object. Crucial Benefit: It ensures that transformers are fit only on the training fold during cross-validation, preventing information from the validation set (like mean/std) from "leaking" into training. Always use pipelines in production.
Most real-world data is a mix of types. ColumnTransformer allows you to apply different preprocessing pipelines to different columns (e.g., OneHotEncode categories, Scale numerics) and then concatenate them for the model.
| Metric | Use Case | Scikit-learn Name |
|---|---|---|
| F1-Score | Imbalanced classification (Precision-Recall balance) | f1_score |
| ROC-AUC | Probability ranking / classifier quality | roc_auc_score |
| MSE / MAE | Regression error magnitude | mean_squared_error |
| R2 Score | Variance explained by model | r2_score |
| Log Loss | Probabilistic predictions confidence | log_loss |
(1) K-Fold: standard, (2) Stratified K-Fold: for imbalanced data, (3) TimeSeriesSplit: for temporal data (preventing looking into the future), (4) GroupKFold: to ensure samples from the same group aren't split across train/test.
+fit_transform on train but only transform on test?
+ Answer: To prevent Data Leakage. Mean/variance for scaling must be learned ONLY from training data. Applying fit to test data uses future information about the test distribution, leading to overly optimistic results.
predict_proba instead of predict?
+ Answer: When you need the uncertainty of the model or need to adjust the decision threshold. For cost-sensitive problems (e.g., fraud), you might flag anything with >10% probability, rather than the default 50%.
+Answer: Underfitting (High Bias) happens when the model is too simple (e.g., linear on non-linear data). Overfitting (High Variance) happens when the model is too complex and captures noise. Regularization (Alpha/C parameters) is used to find the "sweet spot".
+Answer: (1) class_weight='balanced' inside estimators, (2) Stratified cross-validation, (3) Focus on Precision-Recall curves/AUC instead of Accuracy, (4) Resampling (using imblearn library which is Sklearn-compatible).
Answer: L1 adds absolute value penalty; it results in sparse models (coefficents become exactly zero), effectively performing feature selection. L2 adds squared penalty; it shrinks coefficients towards zero but rarely to zero, good for handling multicollinearity.
+Autograd tracks every operation on tensors with requires_grad=True and automatically computes gradients using the chain rule during .backward().
+ Tensors are multi-dimensional arrays (like NumPy) but with two superpowers: (1) GPU Acceleration (move to 'cuda' or 'mps'), (2) Automatic Differentiation. Bridging to NumPy is zero-copy for CPU tensors.
+ +Every model in PyTorch inherits from nn.Module. You define parameters/layers in __init__ and the forward pass logic in forward(). This design promotes recursive composition โ models can contain other modules.
| Component | Responsibility |
|---|---|
| Dataset | Defines HOW to load a single sample (__getitem__) and total count (__len__) |
| DataLoader | Handles batching, shuffling, multi-process loading, and memory pinning |
| Transforms | On-the-fly augmentation (cropping, flipping, normalizing) |
Standard pattern: (1) Zero gradients, (2) Forward pass, (3) Compute Loss, (4) Backward pass (backprop), (5) Optimizer step. Don't forget model.train() and model.eval() to toggle dropout and batch norm behavior.
optimizer.zero_grad() necessary?
+ Answer: By default, PyTorch accumulates gradients on every .backward() call. This is useful for RNNs or training with effectively larger batch sizes than memory allows. If you don't zero them out, gradients from previous batches will influence the current update, leading to incorrect training.
model.train() and model.eval()?
+ Answer: They set the mode for specific layers. .train() enables Dropout and Batch Normalization (calculates stats for current batch). .eval() disables dropout and uses running averages for Batch Norm. Forgetting .eval() during testing will lead to inconsistent/bad predictions.
torch.no_grad().
+ Answer: It's a context manager that disables gradient calculation. Use it during inference or validation to save memory and compute resources. It prevents the creation of the computational graph for those operations.
+Answer: PyTorch (Dynamic graph) is more Pythonic, easier to debug with standard tools, and highly favored in research. TensorFlow (Static graph/Keras) historically had better deployment tools (TFLite, TFServing) and massive industry scale, though the gap has significantly narrowed with PyTorch 2.0 and TorchServe.
+Answer: Same as NumPy. If dimensions don't match, PyTorch automatically expands the smaller tensor (by repeating values) to match the larger one, provided they are compatible (trailing dimensions match or are 1). This happens without actual memory copying.
+tf.keras supports three ways to build models: (1) Sequential (simple stacks), (2) Functional (DAGs, multi-input/output), (3) Subclassing (full control).
+ Loading data is often the bottleneck. tf.data.Dataset enables "ETL" pipelines: Extract (from disk/cloud), Transform (shuffle, batch, repeat), Load (map to GPU). Concepts like prefetch and interleave ensure the GPU is never waiting for the CPU.
TensorFlow can convert Python code into a Static Computational Graph using @tf.function. This enables significant optimizations like constant folding and makes models exportable to environments without Python (C++, Java, JS).
| Component | Visualized metric |
|---|---|
| Scalars | Loss/Accuracy curves in real-time |
| Histograms | Weights/Gradients distribution (checking for vanishing/exploding) |
| Graphs | The internal model architecture |
| Projector | High-dimensional embeddings (t-SNE/PCA) |
TensorFlow Extended (TFX) is for end-to-end ML. Key components: TF Serving (for APIs), TF Lite (for mobile/edge), TFJS (for web browsers). TF Serving supports model versioning and A/B testing out of the box.
+tf.function and AutoGraph?
+ Answer: tf.function is a decorator that converts a regular Python function into a TensorFlow static graph. AutoGraph is the internal tool that translates Python control flow (if, while) into TF graph ops. This allows for compiler-level optimizations and easy deployment without a Python environment.
tf.data.AUTOTUNE?
+ Answer: It allows TensorFlow to dynamically adjust the level of parallelism and buffer sizes based on your CPU/disk hardware. It ensures that data preprocessing (CPU) is always one step ahead of model training (GPU), preventing hardware starvation.
+Answer: Sequential: purely linear stacks. Functional: most common for production, supports non-linear topology (shared layers, multiple inputs/outputs). Subclassing: full control over the forward pass, best for complex research/custom logic. Functional is generally preferred for its balance of power and debugging ease.
+Answer: (1) EarlyStopping callback, (2) Dropout layers, (3) L1/L2 kernels regularizers, (4) Data augmentation (via tf.image or keras.layers), (5) Learning rate schedules via callbacks.ReduceLROnPlateau.
Answer: The language-neutral, hermetic serialization format for TF models. It includes the model architecture, weights, and the computational graph (signatures). It is the standard format for TF Serving and TFLite conversion.
+async/await for handling concurrent requests without blocking, uses type hints for automatic validation, and generates interactive OpenAPI (Swagger) documentation. It is the gold standard for serving ML models today.
+ In production, you cannot trust input data. Pydantic enforces strict type checking and validation at runtime. If a JSON request arrives with a string instead of a float for a model feature, Pydantic catches it immediately and returns a clear error before the model even sees it.
+ +| Stage | Responsibility | Tools |
|---|---|---|
| Initialization | Loading model weights into memory (once) | FastAPI Lifespan |
| Inference | Preprocessing input and getting prediction | NumPy/Pydantic |
| Post-processing | Formatting prediction for the client | JSON/Protobuf |
| Observability | Logging latency, inputs, and drift | Prometheus/ELK |
Conda vs Pip: Pip is standard for Python; Conda is better for C-extensions/CUDA. Docker: Containerizing the environment ensures it "works on my machine" translates to "works in the cloud". Use lightweight base images (python:3.10-slim) to minimize security risks and build times.
+ +(1) Unit tests: for preprocessing logic, (2) Integration tests: for the API endpoints, (3) Model Quality tests: ensuring the model meets a minimum accuracy threshold on a benchmark dataset before deployment.
+Answer: (1) Native async support (handles concurrent requests better), (2) Automatically generates Swagger UI for testing, (3) Pydantic integration for data validation, (4) Significantly higher throughput (close to Go/Node.js levels), (5) Built-in support for WebSockets and background tasks.
Answer: (1) URL versioning (/v1/predict), (2) Model registry (MLflow/SageMaker) with aliases like production or staging, (3) Blue-green deployment โ route traffic to the new version only after validation, (4) Embed the model version in the API response metadata for debugging.
Answer: It occurs when multiple libraries require conflicting versions of the same dependency. Solved by: (1) Using virtual environments (venv/conda), (2) pinning exact versions in requirements.txt or poetry.lock, (3) Docker to isolate the entire OS environment.
Answer: (1) PII Masking: remove names/emails/IDs before logging, (2) Hash sensitive fields if they are needed for troubleshooting, (3) Separate logging of model metadata from raw data, (4) Use specialized monitoring tools like Arize or Whylogs for drift detection without full data capture.
+Answer: Beyond standard code tests, ML CI/CD (MLOps) includes Data Validation (is the incoming data schema correct?), Model Validation (is accuracy >= 90%?), and automated deployment to staging for human-in-the-loop review.
+Never optimize without measuring. (1) cProfile: for function-level timing, (2) line_profiler: for line-by-line analysis in "hot" functions, (3) memory_profiler: to detect memory leaks and peak usage, (4) Py-Spy: a sampling profiler for zero-instrumentation production profiling.
+ +Numba translates a subset of Python and NumPy code into fast machine code using LLVM. By simply adding @njit, you can achieve C/Fortran-like speeds for math-heavy loops that cannot be vectorized with pure NumPy.
| Model | Best for... | Mechanism |
|---|---|---|
| Threading | I/O-bound (APIs, DBs) | Concurrent but not parallel (GIL) |
| Multiprocessing | CPU-bound (Training, Math) | True parallelism (separate OS processes) |
| asyncio | High-concurrency I/O | Single-threaded cooperative multitasking |
Single Instruction, Multiple Data (SIMD) allows a CPU to perform the same operation on multiple data points in one clock cycle. Modern NumPy leverages AVX-512 and MKL/OpenBLAS to ensure your a + b is as fast as the hardware allows.
Cython is a superset of Python that compiles to C. It allows you to call C functions directly and use static typing. Use it for complex algorithms that require low-level memory control (e.g., custom tree models or graph algorithms).
+Answer: It simplifies implementation by making the memory management (reference counting) thread-safe without needing granular locks. It also makes single-threaded code faster and C-extension integration easier. Removing it is difficult because it effectively requires a rewrite of the interpreter (see: "no-gil" Python 3.13 proposal).
+Answer: (1) Vectorize with NumPy (broadcast), (2) If logic is too complex for NumPy, use Numba JIT, (3) Use Cython if you need C-level types, (4) Use multiprocessing if the iterations are independent and CPU-bound.
Answer: cProfile is a deterministic profiler; it hooks into every function call. While very accurate, it adds significant overhead (sometimes 2x slowdown). For production systems, "Sampling Profilers" (like Py-Spy) are better as they only inspect the stack every few milliseconds, adding negligible overhead.
+Answer: For I/O-bound tasks (Network/Disk). Threading has much lower overhead (shared memory) compared to Multiprocessing (separate memory spaces, requires serialization/pickling of data between processes). For downloading 1000 images, threads are superior.
+Answer: CPUs are fastest when accessing contiguous memory (Spatial Locality). NumPy's C-contiguous arrays ensure that when one value is loaded into the CPU cache, the next values are also loaded, minimizing "Cache Misses" compared to Python lists of scattered objects.
+${module.description}
+ ${module.category} +${module.description}
+