Front,Back
Overfitting,"When a model fits too closely to the training data, it learns patterns that are specific to the training set, but not generalizable to new, unseen data. This results in poor performance on test data. It can be caused by using a model that is too complex or training for too many iterations."
Cross-Validation,"A technique for validating a model's performance by splitting the dataset into multiple partitions (folds). Each fold is used once as a test set while the remaining folds are used as the training set. This helps to assess how well the model generalizes across different data subsets and reduces the risk of overfitting."
Regularization Techniques,"L1 (Lasso), L2 (Ridge), Dropout (in neural networks) – Regularization techniques add penalties to the model to prevent overfitting. L1 adds a penalty based on the absolute values of coefficients, encouraging sparsity. L2 penalizes the sum of the squared coefficients, encouraging smaller weights. Dropout randomly disables neurons during training to prevent reliance on any particular neuron."
ROC AUC,"Receiver Operating Characteristic (ROC) is a curve that plots the true positive rate (TPR) against the false positive rate (FPR) at various thresholds. The Area Under the Curve (AUC) measures the overall performance of a binary classification model. A higher AUC indicates better performance, as it signifies a higher true positive rate for a given false positive rate."
SMOTE,"Synthetic Minority Over-sampling Technique – An oversampling method used to address class imbalance in datasets. SMOTE generates synthetic examples of the minority class by interpolating between existing examples, making the class distribution more balanced and improving model performance on the minority class."
K-Nearest Neighbors (KNN),"A classification method that assigns a data point to the class most common among its 'K' nearest neighbors, based on distance metrics such as Euclidean distance. It is simple and effective for low-dimensional data but can be computationally expensive for large datasets."
PCA,"Principal Component Analysis – A dimensionality reduction technique that transforms high-dimensional data into a smaller set of uncorrelated components, known as principal components. PCA helps to reduce complexity while preserving the most important variance in the data. It is widely used for feature extraction and visualization."
"XGBoost, LightGBM, CatBoost","Boosting models based on decision trees. These methods improve model accuracy by building models sequentially, each correcting the errors made by the previous one. They use gradient boosting techniques to minimize the loss function and are widely used for structured/tabular data due to their effectiveness in handling imbalances and feature interactions."
SVM (Support Vector Machines),"Support Vector Machines are classifiers that aim to find a hyperplane that maximizes the margin between two classes. The hyperplane is defined by support vectors, which are the data points closest to the hyperplane. SVM uses kernels to map data into higher-dimensional spaces to handle non-linearly separable data."
"RNN, LSTM, GRU","Recurrent Neural Networks (RNNs) are used for processing sequential data by maintaining a hidden state that is updated over time. Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) are types of RNNs designed to solve the vanishing gradient problem and are particularly useful for tasks such as time series forecasting and natural language processing."
NumPy,"NumPy is a Python library for numerical computing, providing support for arrays, matrices, and mathematical functions. It enables fast operations on large datasets due to its vectorized operations, and is often used as the foundation for many data science and machine learning tasks in Python."
Pandas,"Pandas is a Python library for data manipulation and analysis, particularly for structured data such as tables (dataframes). It provides powerful tools for data cleaning, transformation, and analysis, and is commonly used for working with datasets in CSV, Excel, and SQL databases."
Apache Spark,"Apache Spark is a distributed computing system designed for big data processing. It can handle large volumes of data and run computations in parallel across a cluster of machines. Spark provides APIs for Java, Scala, Python, and R, and is commonly used for large-scale data processing tasks such as ETL and machine learning."
Dask,"Dask is a Python library for parallel and distributed computing. It extends the capabilities of Pandas and NumPy to larger-than-memory datasets by distributing tasks across multiple cores or machines. Dask is useful for scaling computations on larger datasets that don't fit into memory."
Python Generator,"A Python generator is a special type of iterator that allows iteration over large datasets without loading the entire dataset into memory. It uses the 'yield' keyword to return data one piece at a time, making it memory-efficient and suitable for working with large data streams."
Multiple Inheritance in Python,"Multiple inheritance in Python allows a class to inherit from more than one parent class. This can lead to complexity, as multiple methods can be inherited. Python resolves method resolution order (MRO) to determine which method to invoke, ensuring that the correct method is called when multiple parent classes define the same method."
Factory Pattern,"The Factory Pattern is a design pattern used to create objects without specifying the exact class of object to be created. Instead, a method or class is responsible for creating the object, which helps to decouple object creation from the client code and makes the system more flexible and maintainable."
TensorFlow,"TensorFlow is an open-source deep learning framework developed by Google. It provides a comprehensive ecosystem for building, training, and deploying machine learning models at scale. TensorFlow is known for its flexibility, scalability, and support for both research and production environments."
PyTorch,"PyTorch is a deep learning framework developed by Facebook that emphasizes dynamic computation graphs, making it more flexible and easier to debug compared to static frameworks. It is particularly popular in research and for tasks that require quick iteration and experimentation."
Keras,"Keras is a high-level deep learning API written in Python, originally developed as an independent library and now integrated with TensorFlow. It provides simple interfaces for creating neural networks, making it easier for developers to quickly prototype deep learning models."
Scikit-learn,"Scikit-learn is a Python library for classical machine learning algorithms, offering a wide range of tools for regression, classification, clustering, and dimensionality reduction. It also provides utilities for model evaluation, feature selection, and data preprocessing."
AWS SageMaker,"AWS SageMaker is a fully managed service from Amazon Web Services for building, training, and deploying machine learning models at scale. It provides tools for every step of the ML workflow, including data preprocessing, model tuning, and deployment to production environments."
AWS EC2,"AWS EC2 (Elastic Compute Cloud) is a web service that provides scalable computing resources in the cloud. It allows users to launch virtual servers, known as instances, to run applications and workloads without managing the underlying hardware."
Flask / FastAPI,"Flask and FastAPI are lightweight Python frameworks for building web APIs. Flask is simple and flexible, while FastAPI is known for its performance, automatic API documentation, and validation of data types using Python's type hints."
TensorFlow Serving,"TensorFlow Serving is a system designed for serving TensorFlow models in production environments. It provides tools for loading, managing, and serving models for real-time inference, ensuring that they can be used in production applications efficiently."
Google Cloud AI Platform,"Google Cloud AI Platform is a fully managed service for building, training, and deploying machine learning models on Google Cloud. It supports a wide variety of ML frameworks and provides tools for model versioning, monitoring, and scaling."
Docker,"Docker is a tool that allows developers to package applications and their dependencies into containers, ensuring consistency across different environments. Containers are lightweight, portable, and can run on any system that supports Docker, making them ideal for deployment."
Kubernetes,"Kubernetes is an open-source container orchestration system that automates the deployment, scaling, and management of containerized applications. It helps manage clusters of containers, ensuring that applications are running smoothly and can scale as needed."
Terraform,"Terraform is an Infrastructure as Code (IaC) tool that allows users to define and provision infrastructure resources using declarative configuration files. It supports multiple cloud providers and helps automate the process of infrastructure management."
Jenkins,"Jenkins is an open-source tool for continuous integration and continuous delivery (CI/CD). It automates the process of testing, building, and deploying software, enabling developers to detect and fix issues early in the development process."
Matplotlib,"Matplotlib is the most widely used library in Python for creating static, animated, and interactive visualizations. It provides a wide variety of plotting functions and is often used for creating charts, graphs, and scientific plots."
Seaborn,"Seaborn is a Python data visualization library based on Matplotlib, which simplifies the creation of more complex and visually appealing statistical graphics. It includes built-in themes and functions to work easily with Pandas data structures."
Plotly,"Plotly is a Python library for creating interactive visualizations, including charts, maps, and dashboards. It allows users to create web-based visualizations that can be embedded in web pages or used for interactive data exploration."
Tableau,"Tableau is a powerful data visualization tool for creating interactive dashboards and business intelligence reports. It is widely used for visual analytics, enabling users to explore and analyze data through drag-and-drop interfaces."
MLflow,"MLflow is an open-source platform for managing the end-to-end machine learning lifecycle, including experimentation, reproducibility, and deployment. It helps track model performance, store datasets, and manage hyperparameters and metrics."
TensorBoard,"TensorBoard is a tool that provides visualizations for monitoring machine learning training processes. It allows users to view metrics like loss and accuracy, inspect the model architecture, and visualize embeddings to gain insights into the model's behavior."
Google Colab,"Google Colab is a cloud-based Jupyter notebook environment provided by Google, which includes free access to GPUs and TPUs for machine learning tasks. It allows users to write and execute Python code in a collaborative environment."
Git,"Git is a version control system that tracks changes to files, allowing multiple developers to collaborate on projects. It enables branching, merging, and versioning, making it essential for modern software development workflows."
MVP (Minimum Viable Product),"The MVP is the simplest version of a product that contains the core features necessary to address the primary needs of early adopters. It is used to test the market, gather feedback, and validate the product concept before further development."
Time-to-Market,"Time-to-Market refers to the time it takes to bring a product or feature from initial concept to launch. It is an important metric in product development, as faster time-to-market can provide a competitive advantage in rapidly changing markets."
Model Explainability,"Model Explainability refers to the ability to interpret and explain the predictions or behavior of a machine learning model in a way that is understandable to non-technical stakeholders. It is important for building trust in AI systems."
"SHAP, LIME","SHAP (SHapley Additive exPlanations) and LIME (Local Interpretable Model-agnostic Explanations) are techniques for explaining machine learning models. SHAP assigns an importance value to each feature, based on Shapley values, while LIME approximates the model's behavior locally for a specific prediction."
Bias-Variance Trade-off,"The bias-variance trade-off is the balance between the model's complexity and its generalization ability. High bias (underfitting) means the model is too simple and misses important patterns, while high variance (overfitting) means the model is too complex and captures noise."