Python is a dynamically-typed, garbage-collected, interpreted language with a C-based runtime (CPython). Everything is an object β integers, functions, even classes. Understanding this object model is what separates beginners from professionals.
+
+
+
1. Data Structures β Complete Reference
+
+
Type
Mutable
Ordered
Hashable
Use Case
+
list
β
β
β
Sequential data, time series, feature lists
+
tuple
β
β
β
Fixed records, dict keys, DataFrame rows
+
dict
β
β (3.7+)
β
Lookup tables, JSON, config, caches
+
set
β
β
β
Unique values, membership testing O(1)
+
frozenset
β
β
β
Immutable set, usable as dict keys
+
deque
β
β
β
O(1) append/pop both ends, sliding windows
+
bytes
β
β
β
Binary data, serialization, network I/O
+
bytearray
β
β
β
Mutable binary buffers
+
+
+
2. Time Complexity β What Every Dev Must Know
+
+
Operation
list
dict
set
+
Lookup by index/key
O(1)
O(1)
β
+
Search (x in ...)
O(n)
O(1)
O(1)
+
Insert/Append
O(1) end, O(n) middle
O(1)
O(1)
+
Delete
O(n)
O(1)
O(1)
+
Sort
O(n log n)
β
β
+
Iteration
O(n)
O(n)
O(n)
+
+
Real-world impact: Checking if an item exists in a list of 1M elements = ~50ms. In a set = ~0.00005ms. That's 1,000,000x faster. Always use sets/dicts for membership testing.
+
+
3. Python Memory Model
+
+
β‘ Everything Is An Object on the Heap
+
Variables are references (pointers), not boxes. a = [1,2,3] creates a list on the heap; a points to it. b = a makes both point to the same list. This is aliasing β the #1 source of bugs in beginner Python code.
+
+
Reference Counting: Each object tracks how many names reference it. When count = 0, freed immediately. del decrements the count, doesn't necessarily free memory.
+
Integer Interning: Python caches integers -5 to 256. So a = 100; b = 100; a is b β True. But a = 1000; b = 1000; a is b β may be False. Never use is for value comparison.
+
Garbage Collection: 3 generations (gen0, gen1, gen2). New objects in gen0. Survivors promoted. Use gc.collect() after deleting large ML models.
+
+
4. Generators & Iterators β The Heart of Python
+
+
π Lazy Evaluation
+
yield suspends state, return terminates. A list of 1B items = ~8GB. A generator = ~100 bytes. The Iterator Protocol: any object with __iter__ + __next__. Generator expressions: (x**2 for x in range(10**9)) β O(1) memory.
+
+
yield from: Delegates to sub-generator. Forwards send() and throw(). Essential for building composable data pipelines.
+
send(): Two-way communication with generators (coroutines). value = yield result β both receives and produces values.
+
+
5. Closures & First-Class Functions
+
Functions are first-class objects β passed as args, returned, assigned. A closure captures variables from enclosing scope. Foundation of decorators, callbacks, and functional programming.
+
+
6. Critical Python Gotchas for Projects
+
+
β οΈ The 5 Deadliest Python Traps
+ 1. Mutable Default Args:def f(x, lst=[]): β list shared across ALL calls. Fix: lst=None.
+ 2. Late Binding Closures:[lambda: i for i in range(5)] β all return 4! Fix: lambda i=i: i.
+ 3. Shallow Copy:list(a) copies outer list but shares inner objects.
+ 4. String Concatenation:s += "text" in a loop creates new string every time β O(nΒ²). Use ''.join(parts).
+ 5. Circular Imports: Module A imports B, B imports A β ImportError. Fix: restructure or lazy import.
+
+
+
7. Error Handling for Production Projects
+
+
π‘οΈ Exception Hierarchy You Must Know
+
+ BaseException β Exception (catch this) β ValueError, TypeError, KeyError, FileNotFoundError, ConnectionError...
+ Rules: (1) Never catch bare except:. (2) Catch specific exceptions. (3) Use else for success path. (4) finally always runs. (5) Create custom exceptions for your project.
+
+
+
+
8. collections Module β Power Tools
+
+
Class
Purpose
Project Use Case
+
defaultdict
Dict with default factory
Group data: defaultdict(list)
+
Counter
Count hashable objects
Label distribution, word frequency
+
namedtuple
Lightweight immutable class
Return multiple named values
+
deque
Double-ended queue
Sliding window, BFS, ring buffer
+
ChainMap
Stack multiple dicts
Config layers: defaults β env β CLI
+
OrderedDict
Ordered dict (legacy)
move_to_end() for LRU cache
+
+
+
9. itertools β Memory-Efficient Pipelines
+
+
Function
What It Does
Project Use
+
chain()
Concatenate iterables lazily
Merge data files
+
islice()
Slice any iterator
Take first N from generator
+
groupby()
Group consecutive elements
Process sorted logs by date
+
product()
Cartesian product
Hyperparameter grid
+
combinations()
All r-length combos
Feature interaction pairs
+
starmap()
map() with unpacked args
Apply function to paired data
+
accumulate()
Running accumulator
Cumulative sums, running max
+
tee()
Clone iterator N times
Multiple passes over stream
+
+
+
10. File I/O for Real Projects
+
+
Format
Read
Write
Best For
+
JSON
json.load(f)
json.dump(obj, f)
Configs, API responses
+
CSV
csv.DictReader(f)
csv.DictWriter(f)
Tabular data (small)
+
YAML
yaml.safe_load(f)
yaml.dump(obj, f)
Config files
+
Pickle
pickle.load(f)
pickle.dump(obj, f)
Python objects, models
+
Parquet
pd.read_parquet()
df.to_parquet()
Large DataFrames (fast)
+
SQLite
sqlite3.connect()
SQL queries
Local database
+
+
+
11. pathlib β Modern File Handling
+
Stop using os.path.join(). Use pathlib.Path: Path('data') / 'train' / 'images'. Methods: .glob(), .read_text(), .mkdir(parents=True), .exists(), .suffix, .stem. Cross-platform, readable, powerful.
f-strings (3.6+):f"{accuracy:.2%}" β "95.23%". f"{x=}" (3.8+) β "x=42" for debugging. f"{name!r}" β shows repr. regex:re.compile(pattern) for repeated use. re.sub() for cleaning. re.findall() for extraction. Always compile patterns used in loops.
+
+
15. Command-Line Interface (CLI) Tools
+
argparse: Built-in CLI parsing. click: Decorator-based, more Pythonic. typer: Modern, uses type hints. Every production project needs a CLI for: training, evaluation, data processing, deployment scripts.
+
`,
+ code: `
+
+
π» Python Fundamentals β Project Code
+
+
1. Generator Pipeline β Process Any Size Data
+
import json
+from pathlib import Path
+
+defread_jsonl(filepath):
+ """Read JSON Lines file lazily β handles any size."""
+ withopen(filepath) as f:
+ for line in f:
+ yield json.loads(line.strip())
+
+deffilter_records(records, min_score=0.5):
+ for rec in records:
+ if rec.get('score', 0) >= min_score:
+ yield rec
+
+defbatch(iterable, size=64):
+ """Batch any iterable into fixed-size chunks."""
+ from itertools import islice
+ it = iter(iterable)
+ while chunk := list(islice(it, size)):
+ yield chunk
+
+# Compose into pipeline β still O(1) memory!
+pipeline = batch(filter_records(read_jsonl("data.jsonl")), size=32)
+for chunk in pipeline:
+ process(chunk) # Only 32 records in memory at a time
+
+
2. Coroutine Pattern β Running Statistics
+
defrunning_stats():
+ """Coroutine that computes running mean & variance."""
+ n = 0
+ mean = 0.0
+ M2 = 0.0
+ whileTrue:
+ x = yield {'mean': mean, 'var': M2/n if n > 0else0, 'n': n}
+ n += 1
+ delta = x - mean
+ mean += delta / n
+ M2 += delta * (x - mean) # Welford's algorithm β numerically stable
+
+stats = running_stats()
+next(stats) # Prime
+stats.send(10) # {'mean': 10.0, 'var': 0, 'n': 1}
+stats.send(20) # {'mean': 15.0, 'var': 25.0, 'n': 2}
Answer: Tuples: immutable, hashable (dict keys), less memory. Lists: mutable, growable. Use tuples for fixed data (coordinates, config). Use lists for collections that change. Tuples signal "this shouldn't be modified."
+
Q2: How does Python's GIL affect DS?
Answer: GIL prevents multi-threading for CPU-bound Python. But NumPy/Pandas release the GIL during C operations. For pure Python CPU work β multiprocessing. For I/O β threading works. For data science, the GIL rarely matters.
+
Q3: Shallow vs deep copy?
Answer:copy.copy(): outer container copied, inner objects shared. copy.deepcopy(): everything copied recursively. Real trap: df2 = df is NOT a copy β it's aliasing. Use df.copy().
+
Q4: What is the mutable default argument trap?
Answer:def f(x, lst=[]): β default list created ONCE and shared. Fix: lst=None; if lst is None: lst = []. #1 Python interview gotcha.
+
Q5: Why are generators critical for large data?
Answer: O(1) memory. 1B items as list = 8GB. As generator = 100 bytes. Use for: file processing, streaming, batch training. yield from for composition.
+
Q6: Explain LEGB scope rule.
Answer: Name lookup order: Local β Enclosing β Global β Built-in. nonlocal for enclosing scope, global for module. list = [1] shadows built-in list().
+
Q7: How to handle a 10GB CSV?
Answer: (1) pd.read_csv(chunksize=N), (2) usecols=['needed'], (3) dtype={'col':'int32'}, (4) Dask, (5) DuckDB for SQL on CSV, (6) Polars for Rust-speed.
+
Q8: Dict lookup O(1) vs list search O(n)?
Answer: Dicts use hash tables. Key β hash β slot index. O(1) average. Lists scan linearly. x in set is O(1) but x in list is O(n). For 1M items: microseconds vs milliseconds.
+
Q9: Explain Python's garbage collection.
Answer: (1) Reference counting β freed at count=0. (2) Cyclic GC β detects AβBβA cycles. 3 generations. gc.collect() after deleting large models.
+
Q10: What is __slots__?
Answer: Replaces per-instance __dict__ with fixed array. ~40% memory savings. Use for millions of small objects. Trade-off: no dynamic attributes.
+
Q11: How do you structure a Python project?
Answer:src/package/ layout. pyproject.toml for config. tests/ with pytest. configs/ for YAML. Makefile for common commands. Separate data, models, training, serving.
+
Q12: What's the difference between is and ==?
Answer:== checks value equality. is checks identity (same memory). Use is only for singletons: x is None, x is True. Integer interning makes 256 is 256 True but 1000 is 1000 may be False.
+
`
+ },
+
+"numpy": {
+ concepts: `
+
+
π’ NumPy β Complete Deep Dive
+
+
+
β‘ Why NumPy Is 50-100x Faster
+
(1) Contiguous memory β CPU cache-friendly. (2) Compiled C loops. (3) SIMD instructions β 4-8 floats simultaneously. Python list: array of pointers to objects. NumPy: raw typed data in a block.
+
+
+
1. ndarray Internals
+
+
Feature
Python List
NumPy ndarray
+
Storage
Pointers to objects
Contiguous typed data
+
Memory per int
~28 bytes + pointer
8 bytes (int64)
+
Operations
Python loop
Compiled C/Fortran
+
SIMD
Impossible
CPU vector instructions
+
+
+
2. Memory Layout & Strides
+
+
π§ Strides = The Secret Behind Views
+
Every ndarray has strides β bytes to jump in each dimension. For (3,4) float64: strides = (32, 8). Slicing creates views (no copy) by adjusting strides. arr[::2] doubles row stride. C-order (row-major): rows contiguous. Fortran-order: columns contiguous. Iterate along last axis for best performance.
+
+
+
3. Broadcasting Rules
+
+
π― Rules (Right to Left)
+
Two arrays compatible when, for each trailing dim: dims are equal OR one is 1. (5,3,1) + (1,4) β (5,3,4). The "1" dims stretch virtually β no memory copied. Common: X - X.mean(axis=0) β (1000,5) - (5,) works!
np.linalg.norm(X, axis=1) β L2 norms for distances
+
np.linalg.lstsq(X, y) β Stable linear regression
+
np.linalg.inv() β AVOID! Use solve() instead (numerically stable)
+
+
+
8. Random Number Generation
+
Modern: rng = np.random.default_rng(42) (NumPy 1.17+). PCG64 algorithm, thread-safe. Old np.random.seed(42) is global, not thread-safe. Always use default_rng() in projects.
+
+
9. Image Processing with NumPy
+
Images are just 3D arrays: (height, width, channels). Crop: img[100:200, 50:150]. Resize: scipy. Normalize: img / 255.0. Augment: flip img[:, ::-1], rotate with scipy.ndimage. Foundation of all computer vision.
Answer: Right-to-left: dims must equal or one is 1. (3,1) + (1,4) β (3,4). No memory copied. Gotcha: (3,) + (3,4) fails β reshape to (3,1).
+
Q4: axis=0 vs axis=1?
Answer: axis=0: operate down rows (collapse rows). axis=1: across columns (collapse columns). (100,5): mean(axis=0)β(5,). mean(axis=1)β(100,).
+
Q5: Implement PCA with NumPy?
Answer: Center, compute covariance, eigendecompose (eigh), sort by eigenvalue, project onto top-k eigenvectors. Or SVD directly.
+
Q6: np.dot vs @ vs einsum?
Answer:@: clean, broadcasts. np.dot: confusing for 3D+. einsum: most flexible, any tensor op. Use @ for readability.
+
Q7: How to handle NaN?
Answer:np.isnan() detects. np.nanmean() ignores NaN. Gotcha: NaN == NaN is False (IEEE 754).
+
Q8: C-order vs Fortran-order?
Answer: C: rows contiguous (default). Fortran: columns contiguous (LAPACK/BLAS). Iterate last axis for speed. Convert: np.asfortranarray().
+
`
+},
+
+"pandas": {
+ concepts: `
+
+
πΌ Pandas β Complete Deep Dive
+
+
+
β‘ DataFrame Internals β BlockManager
+
A DataFrame is NOT a 2D array. Uses BlockManager β same-dtype columns stored in contiguous blocks. Column operations: fast (same block). Row iteration: slow (crosses blocks). This is why df.iterrows() is 100x slower than vectorized ops.
+
+
+
1. The Golden Rules
+
+
β οΈ 5 Rules That Prevent 90% of Pandas Bugs
+ 1. Use .loc (label) and .iloc (position) β never chain indexing.
+ 2.df.loc[0:5] includes 5. df.iloc[0:5] excludes 5.
+ 3.df[mask]['col'] = x creates copy. Use df.loc[mask, 'col'] = x.
+ 4.df2 = df is NOT a copy. Use df2 = df.copy().
+ 5. Always check df.dtypes and df.isna().sum() first.
+
+
+
2. GroupBy β Split-Apply-Combine
+
Most powerful Pandas operation. (1) Split β (2) Apply function β (3) Combine results. GroupBy is lazy β no computation until aggregation. Key methods:
+
+
Method
Output Shape
Use Case
+
agg()
Reduced (one row/group)
Sum, mean, count per group
+
transform()
Same as input
Fill with group mean, normalize within group
+
filter()
Subset of groups
Keep groups with N > 100
+
apply()
Flexible
Custom function per group
+
+
+
3. Pandas 2.0 β Major Changes
+
+
Feature
Before (1.x)
After (2.0+)
+
Backend
NumPy only
Apache Arrow option
+
Copy semantics
Confusing
Copy-on-Write
+
String dtype
object
string[pyarrow] (faster)
+
Nullable types
NaN for everything
pd.NA (proper null)
+
+
+
4. Polars vs Pandas
+
+
Feature
Pandas
Polars
+
Speed
1x
5-50x (Rust)
+
Parallelism
Single-threaded
Multi-threaded auto
+
API
Eager
Lazy + Eager
+
Ecosystem
Massive
Growing fast
+
Use when
EDA, small-med data, legacy
Large data, production
+
+
+
5. Merge/Join Patterns
+
+
Method
How
When
+
merge()
SQL-style joins on columns
Combine tables on shared keys
+
join()
Joins on index
Index-based combining
+
concat()
Stack along axis
Append rows/columns
+
+
Common pitfall: Merge produces more rows than expected = many-to-many join. Always check: len(merged) vs len(left).
+
+
6. Memory Optimization Strategies
+
+
Strategy
Savings
When
+
Category dtype
90%+
Few unique strings
+
Downcast numerics
50-75%
int64 β int32/int16
+
Sparse arrays
80%+
Mostly zeros/NaN
+
PyArrow backend
30-50%
String-heavy data
+
Read only needed columns
Variable
usecols=['a','b']
+
+
+
7. Window Functions for Time Series
+
.rolling(N): fixed sliding window. .expanding(): cumulative. .ewm(span=N): exponentially weighted. All support .mean(), .std(), .apply(). Essential for: lag features, moving averages, volatility, Bollinger bands.
+
+
8. Pivot Tables & Crosstab
+
df.pivot_table(values, index, columns, aggfunc) β summarize data by two categorical dimensions. pd.crosstab() β frequency table of two categorical columns. Essential for EDA and business reporting.
+
+
9. Method Chaining Pattern
+
Fluent API: .assign() instead of df['col']=. .pipe(func) for custom. .query('col > 5') for readable filters. No intermediate variables = cleaner, reproducible pipelines.
Answer: merge: SQL joins on columns. join: on index. concat: stack along axis. Use merge for column joins, concat for appending.
+
Q3: apply vs map vs transform?
Answer: map: Series element-wise. apply: rows/columns. transform: same-shape output. All slow β prefer vectorized when possible.
+
Q4: GroupBy transform vs agg?
Answer: agg reduces. transform broadcasts back. Use transform for "fill with group mean" or "normalize within group" patterns.
+
Q5: How to handle missing data?
Answer: (1) dropna(thresh=N), (2) fillna(method='ffill') for time series, (3) fillna(df.median()) for ML, (4) interpolate(method='time'). Always check df.isna().sum() first.
+
Q6: Pandas vs Polars?
Answer: Polars: 5-50x faster (Rust), multi-threaded, lazy eval. Pandas: mature ecosystem, wide compatibility. New projects with big data β Polars.
+
Q7: What is MultiIndex?
Answer: Hierarchical indexing. Use for pivot tables, panel data. Access with .xs() or tuple. Reset with .reset_index().
+
Q8: How to optimize a 5GB DataFrame?
Answer: (1) Read only needed columns. (2) Downcast dtypes. (3) Category for strings. (4) Sparse for zeros. (5) PyArrow backend. (6) Process in chunks. Can reduce 5GB to 1GB.
+
`
+},
+
+"visualization": {
+ concepts: `
+
+
π Data Visualization β Complete Guide
+
+
+
β‘ The Grammar of Graphics
+
Data + Aesthetics (x, y, color, size) + Geometry (bars, lines, points) + Statistics (binning, smoothing) + Coordinates (cartesian, polar) + Facets (subplots). Every chart = this framework.
+
+
+
1. Choosing the Right Chart
+
+
Question
Chart Type
Library
+
Distribution?
Histogram, KDE, Box, Violin
Seaborn
+
Relationship?
Scatter, Hexbin, Regression
Seaborn/Plotly
+
Comparison?
Bar, Grouped bar, Violin
Seaborn
+
Trend over time?
Line, Area chart
Plotly/Matplotlib
+
Correlation?
Heatmap
Seaborn
+
Part of whole?
Pie, Treemap, Sunburst
Plotly
+
Geographic?
Choropleth, Mapbox
Plotly/Folium
+
High-dimensional?
Parallel coords, UMAP
Plotly
+
ML results?
Confusion matrix, ROC, SHAP
Seaborn/SHAP
+
+
+
2. Matplotlib Architecture
+
Three layers: Backend (rendering), Artist (everything drawn), Scripting (pyplot). Figure β Axes (subplots) β Axis objects. Always use OO API: fig, ax = plt.subplots().
+
rcParams: Global defaults. plt.rcParams['font.size'] = 14. Create style files for project consistency. plt.style.use('seaborn-v0_8-whitegrid').
+
+
3. Color Theory for Data
+
+
π‘ Color Guide
+ Sequential: viridis, plasma (lowβhigh).
+ Diverging: RdBu, coolwarm (center matters).
+ Categorical: Set2, tab10 (distinct groups).
+ Never use rainbow/jet β bad for colorblind, perceptually non-uniform.
+
+
+
4. Seaborn β Statistical Visualization
+
Three API levels: Figure-level (relplot, catplot, displot), Axes-level (scatterplot, boxplot), Objects API (0.12+). Auto-computes regression lines, confidence intervals, density estimates.
+
+
5. Plotly β Interactive Dashboards
+
JavaScript-powered: hover, zoom, selection. plotly.express for quick plots. plotly.graph_objects for control. Integrates with Dash for production dashboards. Supports 3D, maps, animations. Export to HTML.
+
+
6. Visualization for ML Projects
+
+
What to Visualize
Chart
Why
+
Class distribution
Bar chart
Detect imbalance
+
Feature distributions
Histogram/KDE grid
Find skew, outliers
+
Feature correlations
Heatmap (triangular)
Multicollinearity
+
Training curves
Line plot (loss/acc vs epoch)
Detect overfit/underfit
+
Model comparison
Box plot of CV scores
Compare variance
+
Confusion matrix
Annotated heatmap
Error analysis
+
ROC curve
Line plot + AUC
Threshold selection
+
Feature importance
Horizontal bar
Model interpretation
+
SHAP values
Beeswarm/waterfall
Individual predictions
+
+
+
7. Common Mistakes
+
+
Truncated y-axis exaggerating differences
+
Pie charts for >5 categories β use bar instead
+
Rainbow/jet colormap β use viridis/cividis
+
Overplotting β use alpha, hexbin, KDE, or datashader
+
Missing labels, titles, units
+
3D charts without interaction β often misleading
+
Not saving high-DPI figures β use dpi=300
+
+
`,
+ code: `
+
+
π» Visualization Project Code
+
+
1. Publication-Quality Multi-Subplot Figure
+
import matplotlib.pyplot as plt
+import numpy as np
+
+# Professional style setup
+plt.rcParams.update({
+ 'font.size': 12, 'axes.titlesize': 14,
+ 'figure.facecolor': 'white',
+ 'axes.spines.top': False, 'axes.spines.right': False
+})
+
+fig, axes = plt.subplots(2, 2, figsize=(14, 10))
+
+# Distribution
+axes[0,0].hist(data, bins=30, alpha=0.7, color='steelblue', edgecolor='white')
+axes[0,0].axvline(data.mean(), color='red', linestyle='--', label='Mean')
+axes[0,0].set_title('Distribution')
+
+# Scatter with colormap
+sc = axes[0,1].scatter(x, y, c=z, cmap='viridis', alpha=0.7)
+plt.colorbar(sc, ax=axes[0,1])
+
+# Line with confidence interval
+axes[1,0].plot(x, y_mean, 'b-', linewidth=2)
+axes[1,0].fill_between(x, y_mean-y_std, y_mean+y_std, alpha=0.2)
+
+# Bar with error bars
+axes[1,1].bar(categories, values, yerr=errors, capsize=5, color='coral')
+
+plt.tight_layout()
+plt.savefig('figure.png', dpi=300, bbox_inches='tight')
Level 1: Simple wrapper (timing, logging). Level 2: With arguments (factory). Level 3: Class-based with state. Always use functools.wraps.
+
+
Common patterns: Retry with exponential backoff, caching, rate limiting, authentication, input validation, deprecation warnings.
+
+
2. Context Managers
+
Guarantee resource cleanup. Two approaches: (1) Class-based (__enter__/__exit__), (2) @contextlib.contextmanager with yield. Use for: files, DB connections, GPU locks, temporary settings, timers.
+
+
3. Dataclasses vs namedtuple vs Pydantic vs attrs
+
+
Feature
namedtuple
dataclass
Pydantic
attrs
+
Mutable
β
β
β (v2)
β
+
Validation
β
β
β (auto)
β (validators)
+
JSON
β
β
β (built-in)
via cattrs
+
Performance
Fastest
Fast
Medium
Fast
+
Use for
Records
Data containers
API models
Complex classes
+
+
+
4. Type Hints β Complete Guide
+
+
π― Why Type Hints Matter for Projects
+
Enable: IDE autocompletion, mypy static analysis, self-documenting code, Pydantic validation. Python doesn't enforce at runtime β they're for tools and humans.
+
+
+
Hint
Meaning
Example
+
list[int]
List of ints (3.9+)
scores: list[int] = []
+
dict[str, Any]
Dict str keys
config: dict[str, Any]
+
int | None
Optional (3.10+)
x: int | None = None
+
Callable[[int], str]
Function type
Callbacks
+
TypeVar
Generic
Generic containers
+
Literal
Exact values
Literal['train','test']
+
TypedDict
Dict with typed keys
JSON schemas
+
+
+
5. async/await β Concurrent I/O
+
For I/O-bound tasks: API calls, DB queries, file reads. NOT for CPU (use multiprocessing). Event loop manages coroutines cooperatively. asyncio.gather() runs concurrently. Game changer: 100 API calls in ~1s vs 100s sequentially.
+
+
6. Design Patterns for ML Projects
+
+
Pattern
Use Case
Python Implementation
+
Strategy
Swap algorithms
Pass function/class as argument
+
Factory
Create objects by name
Registry dict: models['rf']
+
Observer
Training callbacks
Event system with hooks
+
Pipeline
Data transformations
Chain of fitβtransform
+
Singleton
Model cache, DB pool
Module-level or metaclass
+
Template
Training loop
ABC with abstract methods
+
Registry
Auto-register models
Class decorator + dict
+
+
+
7. Descriptors β How @property Works
+
Any object implementing __get__/__set__/__delete__. @property is a descriptor. Control attribute access at class level. Used in Django ORM, SQLAlchemy, dataclass fields.
+
+
8. Metaclasses β Advanced
+
Classes are objects. Metaclasses define how classes behave. type is the default. Use for: auto-registration, interface enforcement, singleton. Most should use class decorators instead.
+
+
9. __slots__ for Memory Efficiency
+
Replaces __dict__ with fixed array. ~40% memory savings per instance. Use for millions of small objects. Trade-off: no dynamic attributes.
+
+
10. Multiprocessing for CPU-Bound Work
+
multiprocessing.Pool or concurrent.futures.ProcessPoolExecutor. Each process has its own GIL. Share data via: multiprocessing.Queue, shared_memory, or serialize (pickle). Overhead: process creation ~100ms. Only use for expensive computations.
Answer: C3 Linearization for multiple inheritance. ClassName.mro() shows order. Subclasses before bases, left-to-right.
+
Q2: dataclass vs Pydantic?
Answer: dataclass: no validation, fast, standard library. Pydantic: auto-validation, JSON serialization, API models. Use Pydantic for external data, dataclass for internal.
Answer: It's a descriptor with __get__/__set__. Attribute access triggers descriptor protocol. Used for computed attributes and validation.
+
Q5: Decorator with parameters?
Answer: Three nested functions: factory(params) β decorator(func) β wrapper(*args). Use @wraps(func) always.
+
Q6: What is __slots__?
Answer: Fixed array instead of __dict__. ~40% less memory. No dynamic attributes. Use for millions of objects.
+
Q7: Explain closures with use case.
Answer: Function capturing enclosing scope variables. Use: factory functions, decorators, callbacks. make_multiplier(3) returns function multiplying by 3.
+
Q8: Design patterns in Python vs Java?
Answer: Python makes many patterns trivial: Strategy = pass a function. Singleton = module variable. Factory = dict of classes. Observer = list of callables. Python prefers simplicity.
+
`
+},
+
+"sklearn": {
+ concepts: `
+
+
π€ Scikit-learn β Complete ML Engineering
+
+
+
β‘ The Estimator API
+
Estimators:fit(X, y). Transformers:transform(X). Predictors:predict(X). Consistency allows seamless swapping and composition via Pipelines.
+
+
+
1. Pipelines β The Foundation of Production ML
+
+
β οΈ Data Leakage β The #1 ML Mistake
+ Fitting scaler on ENTIRE dataset before split = test set info leaks into training. Fix: put ALL preprocessing inside Pipeline. Pipeline ensures fit only on training folds during CV.
+
+
+
2. ColumnTransformer β Real-World Data
+
Real data has mixed types. ColumnTransformer applies different transformations per column set: StandardScaler for numerics, OneHotEncoder for categoricals, TfidfVectorizer for text. All in one pipeline.
+
+
3. Custom Transformers
+
Inherit BaseEstimator + TransformerMixin. Implement fit(X, y) and transform(X). TransformerMixin gives fit_transform() free. Use check_is_fitted() for safety.
+
+
4. Cross-Validation Strategies
+
+
Strategy
When
Key Point
+
KFold
General
Doesn't preserve class ratios
+
StratifiedKFold
Imbalanced classification
Preserves class distribution
+
TimeSeriesSplit
Time-ordered data
Train always before test
+
GroupKFold
Grouped data (patients)
Same group never in train+test
+
RepeatedStratifiedKFold
Robust estimation
Multiple random splits
+
+
+
5. Hyperparameter Tuning
+
+
Method
Pros
Cons
+
GridSearchCV
Exhaustive
Exponential with params
+
RandomizedSearchCV
Faster, continuous dists
May miss optimal
+
Optuna
Smart search, pruning
Extra dependency
+
HalvingSearchCV
Successive halving
Newer, less docs
+
+
+
6. Complete ML Workflow
+
+
π― The Steps
+
+ 1. EDA β 2. Train/Val/Test split β 3. Build Pipeline (preprocess + model) β 4. Cross-validate multiple models β 5. Select best β 6. Tune hyperparameters β 7. Final evaluation on test set β 8. Save model β 9. Deploy
+
+
+
+
7. Feature Engineering
+
+
Transformer
Purpose
+
PolynomialFeatures
Interaction & polynomial terms
+
FunctionTransformer
Apply any function (log, sqrt)
+
SplineTransformer
Non-linear feature basis
+
KBinsDiscretizer
Bin continuous into categories
+
TargetEncoder
Encode categoricals by target mean
+
+
+
8. Model Selection Guide
+
+
Data Size
Model
Why
+
<1K rows
Logistic/SVM/KNN
Simple, less overfitting
+
1K-100K
Random Forest, XGBoost
Best accuracy/speed tradeoff
+
100K+
XGBoost, LightGBM
Handles large data efficiently
+
Very large
SGDClassifier/online
Incremental learning
+
Tabular
Gradient Boosting
Almost always best for tabular
+
+
+
9. Handling Imbalanced Data
+
+
Strategy
How
+
class_weight='balanced'
Built-in for most models
+
SMOTE
Synthetic oversampling (imblearn)
+
Threshold tuning
Adjust decision threshold from 0.5
+
Metrics
Use F1, Precision-Recall AUC (not accuracy)
+
Ensemble
BalancedRandomForest
+
+
+
10. Model Persistence
+
joblib.dump(model, 'model.pkl') β faster than pickle for NumPy arrays. model = joblib.load('model.pkl'). Always save the entire pipeline (not just model) to include preprocessing. Version your models with timestamps.
Answer: (1) class_weight='balanced'. (2) SMOTE oversampling. (3) Adjust threshold. (4) Use F1/AUC not accuracy. (5) BalancedRandomForest.
+
Q7: When to use which model?
Answer: Tabular: gradient boosting (XGBoost/LightGBM). Small data: Logistic/SVM. Interpretability: Logistic/trees. Speed: LightGBM. Baseline: Random Forest.
+
Q8: fit() vs transform() vs predict()?
Answer: fit: learn params from data. transform: apply params. predict: generate predictions. fit on train only, transform/predict on both.
+
`
+},
+
+"pytorch": {
+ concepts: `
+
+
π₯ Deep Learning with PyTorch β Complete Guide
+
+
+
β‘ PyTorch Philosophy: Define-by-Run
+
PyTorch builds the computational graph dynamically as operations execute (eager mode). Debug with print(), breakpoints, standard Python control flow.
+
+
+
1. Tensors β The Foundation
+
+
Concept
What
Key Point
+
Tensor
N-dimensional array
Like NumPy but GPU-capable
+
requires_grad
Track for autograd
Only for learnable params
+
device
CPU or CUDA
.to('cuda') moves to GPU
+
.detach()
Stop gradient tracking
Use for inference/metrics
+
.item()
Extract scalar
Use for logging loss
+
.contiguous()
Ensure contiguous memory
Required after transpose/permute
+
+
+
2. Autograd β How Backpropagation Works
+
+
π§ Computational Graph (DAG)
+
When requires_grad=True, every operation is recorded. Each tensor stores grad_fn. .backward() traverses graph in reverse (chain rule). Graph destroyed after backward() unless retain_graph=True. Gradients ACCUMULATE β must optimizer.zero_grad() before each backward.
+
+
+
3. nn.Module β Building Blocks
+
Every model inherits nn.Module. Layers in __init__, computation in forward(). model.train()/model.eval() toggle BatchNorm/Dropout. model.parameters() for optimizer. model.state_dict() for save/load. Use nn.Sequential for simple stacks, nn.ModuleList/nn.ModuleDict for dynamic architectures.
Dataset: override __len__ and __getitem__. DataLoader: batching, shuffling, multi-worker. num_workers>0 for parallel loading. pin_memory=True for faster GPU transfer. Use collate_fn for variable-length sequences.
+
+
6. Learning Rate Scheduling
+
+
Scheduler
Strategy
When
+
StepLR
Decay every N epochs
Simple baseline
+
CosineAnnealingLR
Cosine decay
Standard for vision
+
OneCycleLR
Warmup + decay
Best for fast training
+
ReduceLROnPlateau
Decay on stall
When loss plateaus
+
LinearLR
Linear warmup
Transformer models
+
+
+
7. Mixed Precision Training (AMP)
+
torch.cuda.amp: forward in float16 (2x faster), gradients in float32. GradScaler prevents underflow. 2-3x speedup. Standard practice for any GPU training.
+
+
8. Transfer Learning Patterns
+
Load pretrained β Freeze base β Replace head β Fine-tune with smaller LR. Discriminative LR: lower LR for earlier layers. Progressive unfreezing: unfreeze layers one at a time. Both work better than fine-tuning everything at once.
+
+
9. Distributed Training (DDP)
+
DistributedDataParallel: each GPU runs model copy, gradients averaged via all-reduce. Near-linear scaling. Use torchrun to launch. DistributedSampler for data splitting.
+
+
10. Debugging & Profiling
+
+
Tool
Purpose
+
register_forward_hook
View intermediate activations
+
register_backward_hook
Monitor gradient magnitudes
+
torch.profiler
GPU/CPU profiling
+
torch.cuda.memory_summary()
GPU memory debugging
+
detect_anomaly()
Find NaN/Inf sources
+
+
+
11. torch.compile (2.x)
+
JIT compiles model for 30-60% speedup. model = torch.compile(model). Uses TorchDynamo + Triton. Works on existing code. The future of PyTorch performance.
Answer: Rule: 4 Γ num_gpus. Too many = CPU overhead. pin_memory=True for faster transfers. Profile to find sweet spot.
+
Q6: torch.compile vs eager?
Answer: compile JITs model via TorchDynamo+Triton. 30-60% faster. One line change. The future of PyTorch performance.
+
Q7: How to save/load models?
Answer: state_dict (weights only) vs full checkpoint (weights + optimizer + epoch). Use state_dict for inference, checkpoint for resuming.
+
Q8: Mixed precision β how and why?
Answer: autocast(fp16 forward) + GradScaler(fp32 grads). 2-3x speedup. Minimal accuracy loss. Standard for GPU training.
+
`
+},
+
+"tensorflow": {
+ concepts: `
+
+
π§ TensorFlow & Keras β Complete Guide
+
+
+
β‘ TF2 = Eager by Default + @tf.function for Speed
+
TF2 defaults to eager mode (like PyTorch). @tf.function compiles to graph for production. Keras is the official API. TF handles full lifecycle: train β save β serve β monitor.
+
+
+
1. Three Model APIs
+
+
API
Use Case
Flexibility
+
Sequential
Linear stack
Low
+
Functional
Multi-input/output, branching
Medium (recommended)
+
Subclassing
Custom forward logic
High
+
+
+
2. tf.data Pipeline
+
Chains transformations lazily. .map(), .batch(), .shuffle(), .prefetch(AUTOTUNE). Prefetching overlaps loading with GPU execution. .cache() for small datasets. .interleave() for reading multiple files. TFRecord format for large datasets.
+
+
3. Callbacks β Training Hooks
+
+
Callback
Purpose
+
ModelCheckpoint
Save best model
+
EarlyStopping
Stop when metric plateaus
+
ReduceLROnPlateau
Reduce LR when stuck
+
TensorBoard
Visualize metrics
+
CSVLogger
Log to CSV
+
LambdaCallback
Custom per-epoch logic
+
+
+
4. GradientTape β Custom Training
+
Record ops β compute gradients β apply. Use for: GANs, RL, custom losses, gradient penalty, multi-loss weighting. Same concept as PyTorch's manual loop.
+
+
5. @tf.function β Production Speed
+
Trace Python β TF graph. Benefits: optimized execution, XLA, export. Gotchas: Python side effects only during tracing. Use tf.print() in graphs.
+
+
6. SavedModel β Universal Deployment
+
model.save('path') exports architecture + weights + computation. Ready for: TF Serving (production), TF Lite (mobile), TF.js (browser). One model, any platform.
+
+
7. Keras Tuner β Automated Hyperparameter Search
+
Build model function β Tuner searches space. Strategies: Random, Hyperband, Bayesian. Integrates with TensorBoard. Alternative to Optuna for Keras models.
π¦ Production Python β Complete Engineering Guide
+
+
+
β‘ Production = Reliability + Reproducibility + Observability
+
Production code must be tested (pytest), typed (mypy), logged (structured), packaged (pyproject.toml), containerized (Docker), and monitored (metrics). The gap between notebook and production is enormous.
+
+
+
1. pytest β Professional Testing
+
+
Feature
Purpose
Example
+
fixtures
Reusable test setup
@pytest.fixture
+
parametrize
Many inputs, same test
@pytest.mark.parametrize
+
conftest.py
Shared fixtures
DB connections, mock data
+
monkeypatch
Override functions/env
Mock API calls
+
tmp_path
Temp directory
Test file I/O
+
markers
Tag tests
pytest -m "not slow"
+
coverage
Measure test coverage
pytest --cov
+
+
+
2. Testing ML Code
+
+
π― What to Test in ML
+
+ Unit: data transforms, feature engineering, loss functions.
+ Integration: full pipeline end-to-end.
+ Model: output shape, range, determinism with seed.
+ Data: schema validation, distribution shifts, missing patterns.
+
+
+
+
3. Logging Best Practices
+
+
Level
When
+
DEBUG
Tensor shapes, intermediate values
+
INFO
Training started, epoch complete
+
WARNING
Unexpected but handled (fallback used)
+
ERROR
Model load failure, API error
+
CRITICAL
OOM, GPU crash
+
+
Never use print(). Use structured logging (JSON format) for production β parseable by log aggregators (ELK, Datadog).
+
+
4. FastAPI for Model Serving
+
Modern async framework. Auto-generates OpenAPI docs. Pydantic validation. Deploy with Uvicorn + Docker. Add: health checks, input validation, error handling, rate limiting, request logging.
GitHub Actions: lint (ruff) β type check (mypy) β test (pytest) β build (Docker) β deploy. Add model validation gate: new model must beat baseline on test metrics before deployment.
+
+
9. Code Quality Tools
+
+
Tool
Purpose
+
ruff
Fast linter + formatter (replaces black, isort, flake8)
+
mypy
Static type checking
+
pre-commit
Git hooks for auto-formatting
+
pytest-cov
Test coverage
+
bandit
Security linting
+
+
+
10. MLOps β Model Lifecycle
+
+
Tool
Purpose
+
MLflow
Experiment tracking, model registry
+
DVC
Data versioning (like Git for data)
+
Weights & Biases
Experiment tracking, visualization
+
Evidently
Data drift & model monitoring
+
Great Expectations
Data validation
+
+
+
11. Database for ML Projects
+
+
DB
Use Case
Python Library
+
SQLite
Local, small data, prototyping
sqlite3 (built-in)
+
PostgreSQL
Production, ACID, JSON
psycopg2, SQLAlchemy
+
Redis
Caching, queues, sessions
redis-py
+
MongoDB
Flexible schema, documents
pymongo
+
Pinecone/Weaviate
Vector search (embeddings)
Official SDKs
+
+
`,
+ code: `
+
+
π» Production Python Project Code
+
+
1. pytest β Complete ML Testing
+
import pytest
+import numpy as np
+
+# conftest.py β shared fixtures
+@pytest.fixture
+defsample_data():
+ np.random.seed(42)
+ X = np.random.randn(100, 10)
+ y = np.random.randint(0, 2, 100)
+ return X, y
+
+@pytest.fixture
+deftrained_model(sample_data):
+ X, y = sample_data
+ model = RandomForestClassifier(n_estimators=10)
+ model.fit(X, y)
+ return model
+
+# Test multiple models with one function
+@pytest.mark.parametrize("model_cls", [
+ LogisticRegression, RandomForestClassifier, GradientBoostingClassifier
+])
+deftest_model_output(model_cls, sample_data):
+ X, y = sample_data
+ model = model_cls()
+ model.fit(X, y)
+ preds = model.predict(X)
+ assert preds.shape == y.shape
+ assertset(np.unique(preds)).issubset({0, 1})
+
+# Test data pipeline
+deftest_pipeline_no_leakage(sample_data, pipeline):
+ X, y = sample_data
+ scores = cross_val_score(pipeline, X, y, cv=3)
+ assertall(s >= 0and s <= 1for s in scores)
GIL prevents true multi-threading for CPU-bound Python. BUT: NumPy, Pandas, scikit-learn release the GIL during C operations. Python 3.13: experimental free-threaded CPython (no-GIL).
+
+
+
Task Type
Solution
Why
+
I/O-bound
asyncio / threading
GIL released during I/O
+
CPU-bound Python
multiprocessing
Separate processes, separate GIL
+
CPU-bound NumPy
threading OK
NumPy releases GIL
+
Many tasks
concurrent.futures
Simple Pool interface
+
+
+
3. Numba β JIT Compilation
+
@numba.jit(nopython=True): compile to machine code. 10-100x speedup for loops. Supports NumPy, math. @numba.vectorize: custom ufuncs. @cuda.jit: GPU kernels. Best for: tight loops that can't be vectorized.
+
+
4. Dask β Parallel Computing
+
Pandas/NumPy API for data bigger than memory. dask.dataframe, dask.array, dask.delayed. Lazy execution. Task graph scheduler. Scales from laptop to cluster. Alternative: Polars for single-machine parallel.
+
+
5. Ray β Distributed ML
+
General-purpose distributed framework. Ray Tune (hyperparameter tuning), Ray Serve (model serving), Ray Data. Easier than Dask for ML. Used by OpenAI, Uber.
array module: For simple typed arrays (no NumPy overhead)
+
+
+
7. Caching Strategies
+
+
Tool
Scope
Use Case
+
@functools.lru_cache
In-memory, function
Expensive computations
+
@functools.cache
Unbounded cache
Pure functions
+
joblib.Memory
Disk cache
Data processing pipelines
+
Redis
External cache
Multi-process, API responses
+
diskcache
Pure Python disk
Simple persistent cache
+
+
+
8. Python 3.12-3.13 Performance
+
3.12: 5-15% faster, better errors, per-interpreter GIL. 3.13: Free-threaded (no-GIL experimental), JIT compiler (experimental). The future of Python performance is exciting.
+
+
9. Common Performance Anti-Patterns
+
+
Anti-Pattern
Fix
Speedup
+
for row in df.iterrows()
Vectorized ops
100-1000x
+
s += "text" in loop
''.join(parts)
100x
+
x in big_list
x in big_set
1000x
+
Python list of floats
NumPy array
50-100x
+
Global imports in function
Import at top
Variable
+
Not using built-ins
sum(), min()
5-10x
+
+
`,
+ code: `
+
+
π» Performance Code Examples
+
+
1. Profiling Workflow
+
import cProfile, pstats
+
+# Profile and find bottlenecks
+with cProfile.Profile() as pr:
+ result = expensive_pipeline(data)
+
+stats = pstats.Stats(pr)
+stats.sort_stats('cumulative')
+stats.print_stats(10) # Top 10 slow functions
+
+# Memory profiling
+import tracemalloc
+tracemalloc.start()
+# ... process data ...
+snapshot = tracemalloc.take_snapshot()
+for stat in snapshot.statistics('filename')[:5]:
+ print(stat)
+
+
2. Numba JIT
+
import numba
+import numpy as np
+
+@numba.jit(nopython=True)
+defpairwise_distance(X):
+ n = X.shape[0]
+ D = np.empty((n, n))
+ for i inrange(n):
+ for j inrange(i+1, n):
+ d = 0.0
+ for k inrange(X.shape[1]):
+ d += (X[i,k] - X[j,k]) ** 2
+ D[i,j] = D[j,i] = d ** 0.5
+ return D
+# 100x faster than pure Python!
import dask.dataframe as dd
+
+# Read 100GB of CSVs β lazy!
+ddf = dd.read_csv('data/*.csv')
+
+# Same Pandas API β but parallel
+result = (
+ ddf.groupby('category')
+ .agg({'revenue': 'sum', 'qty': 'mean'})
+ .compute() # Only here does it execute
+)
+
+
5. functools.lru_cache β Memoization
+
from functools import lru_cache
+
+@lru_cache(maxsize=1024)
+defexpensive_feature(customer_id: int) -> dict:
+ # DB query, computation, etc.
+ return compute_features(customer_id)
+
+# First call: computes. Second call: instant from cache
+print(expensive_feature.cache_info()) # hits, misses, size
+
+
6. __slots__ for Memory
+
classPoint:
+ __slots__ = ('x', 'y', 'z')
+ def__init__(self, x, y, z):
+ self.x, self.y, self.z = x, y, z
+
+# 1M instances: ~60MB vs ~160MB without __slots__
+points = [Point(i, i*2, i*3) for i inrange(1_000_000)]
+
+
7. String Performance
+
# β O(nΒ²) β creates new string each iteration
+result = ""
+for word in words:
+ result += word + " "
+
+# β O(n) β single allocation at end
+result = " ".join(words)
+
`,
+ interview: `
+
+
π― Performance Interview Questions
+
Q1: Why the GIL?
Answer: Simplifies reference counting. Makes single-threaded faster. Easier C extensions. Python 3.13 has experimental no-GIL mode.
Answer: Dask: Pandas API, Python-native. Ray: ML-focused. Spark: JVM, TB+ data. Python ML: Dask/Ray. Big data ETL: Spark.
+
Q7: Top 3 Python performance tips?
Answer: (1) Use sets not lists for lookups. (2) NumPy not Python loops. (3) Generator expressions for memory. Bonus: lru_cache for expensive functions.
+
Q8: How does lru_cache work?
Answer: Hash-based memoization. Args must be hashable. maxsize=None for unlimited. cache_info() shows hits/misses. Perfect for pure functions.