[
  {
    "page_content": "User Guide\n1. Supervised learning\n1.1. Linear Models\n1.1.1. Ordinary Least Squares\n1.1.2. Ridge regression and classification\n1.1.3. Lasso\n1.1.4. Multi-task Lasso\n1.1.5. Elastic-Net\n1.1.6. Multi-task Elastic-Net\n1.1.7. Least Angle Regression\n1.1.8. LARS Lasso\n1.1.9. Orthogonal Matching Pursuit (OMP)\n1.1.10. Bayesian Regression\n1.1.11. Logistic regression\n1.1.12. Generalized Linear Models\n1.1.13. Stochastic Gradient Descent - SGD\n1.1.14. Perceptron\n1.1.15. Passive Aggressive Algorithms\n1.1.16. Robustness regression: outliers and modeling errors\n1.1.17. Quantile Regression\n1.1.18. Polynomial regression: extending linear models with basis functions\n1.2. Linear and Quadratic Discriminant Analysis\n1.2.1. Dimensionality reduction using Linear Discriminant Analysis\n1.2.2. Mathematical formulation of the LDA and QDA classifiers\n1.2.3. Mathematical formulation of LDA dimensionality reduction\n1.2.4. Shrinkage and Covariance Estimator\n1.2.5. Estimation algorithms\n1.3. Kernel ridge regression",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.2.4. Shrinkage and Covariance Estimator\n1.2.5. Estimation algorithms\n1.3. Kernel ridge regression\n1.4. Support Vector Machines\n1.4.1. Classification\n1.4.2. Regression\n1.4.3. Density estimation, novelty detection\n1.4.4. Complexity\n1.4.5. Tips on Practical Use\n1.4.6. Kernel functions\n1.4.7. Mathematical formulation\n1.4.8. Implementation details\n1.5. Stochastic Gradient Descent\n1.5.1. Classification\n1.5.2. Regression\n1.5.3. Online One-Class SVM\n1.5.4. Stochastic Gradient Descent for sparse data\n1.5.5. Complexity\n1.5.6. Stopping criterion\n1.5.7. Tips on Practical Use\n1.5.8. Mathematical formulation\n1.5.9. Implementation details\n1.6. Nearest Neighbors\n1.6.1. Unsupervised Nearest Neighbors\n1.6.2. Nearest Neighbors Classification\n1.6.3. Nearest Neighbors Regression\n1.6.4. Nearest Neighbor Algorithms\n1.6.5. Nearest Centroid Classifier\n1.6.6. Nearest Neighbors Transformer\n1.6.7. Neighborhood Components Analysis\n1.7. Gaussian Processes\n1.7.1. Gaussian Process Regression (GPR)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.6.6. Nearest Neighbors Transformer\n1.6.7. Neighborhood Components Analysis\n1.7. Gaussian Processes\n1.7.1. Gaussian Process Regression (GPR)\n1.7.2. Gaussian Process Classification (GPC)\n1.7.3. GPC examples\n1.7.4. Kernels for Gaussian Processes\n1.8. Cross decomposition\n1.8.1. PLSCanonical\n1.8.2. PLSSVD\n1.8.3. PLSRegression\n1.8.4. Canonical Correlation Analysis\n1.9. Naive Bayes\n1.9.1. Gaussian Naive Bayes\n1.9.2. Multinomial Naive Bayes\n1.9.3. Complement Naive Bayes\n1.9.4. Bernoulli Naive Bayes\n1.9.5. Categorical Naive Bayes\n1.9.6. Out-of-core naive Bayes model fitting\n1.10. Decision Trees\n1.10.1. Classification\n1.10.2. Regression\n1.10.3. Multi-output problems\n1.10.4. Complexity\n1.10.5. Tips on practical use\n1.10.6. Tree algorithms: ID3, C4.5, C5.0 and CART\n1.10.7. Mathematical formulation\n1.10.8. Missing Values Support\n1.10.9. Minimal Cost-Complexity Pruning\n1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking\n1.11.1. Gradient-boosted trees",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.10.9. Minimal Cost-Complexity Pruning\n1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking\n1.11.1. Gradient-boosted trees\n1.11.2. Random forests and other randomized tree ensembles\n1.11.3. Bagging meta-estimator\n1.11.4. Voting Classifier\n1.11.5. Voting Regressor\n1.11.6. Stacked generalization\n1.11.7. AdaBoost\n1.12. Multiclass and multioutput algorithms\n1.12.1. Multiclass classification\n1.12.2. Multilabel classification\n1.12.3. Multiclass-multioutput classification\n1.12.4. Multioutput regression\n1.13. Feature selection\n1.13.1. Removing features with low variance\n1.13.2. Univariate feature selection\n1.13.3. Recursive feature elimination\n1.13.4. Feature selection using SelectFromModel\n1.13.5. Sequential Feature Selection\n1.13.6. Feature selection as part of a pipeline\n1.14. Semi-supervised learning\n1.14.1. Self Training\n1.14.2. Label Propagation\n1.15. Isotonic regression\n1.16. Probability calibration\n1.16.1. Calibration curves\n1.16.2. Calibrating a classifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.14.2. Label Propagation\n1.15. Isotonic regression\n1.16. Probability calibration\n1.16.1. Calibration curves\n1.16.2. Calibrating a classifier\n1.16.3. Usage\n1.17. Neural network models (supervised)\n1.17.1. Multi-layer Perceptron\n1.17.2. Classification\n1.17.3. Regression\n1.17.4. Regularization\n1.17.5. Algorithms\n1.17.6. Complexity\n1.17.7. Tips on Practical Use\n1.17.8. More control with warm_start\n2. Unsupervised learning\n2.1. Gaussian mixture models\n2.1.1. Gaussian Mixture\n2.1.2. Variational Bayesian Gaussian Mixture\n2.2. Manifold learning\n2.2.1. Introduction\n2.2.2. Isomap\n2.2.3. Locally Linear Embedding\n2.2.4. Modified Locally Linear Embedding\n2.2.5. Hessian Eigenmapping\n2.2.6. Spectral Embedding\n2.2.7. Local Tangent Space Alignment\n2.2.8. Multi-dimensional Scaling (MDS)\n2.2.9. t-distributed Stochastic Neighbor Embedding (t-SNE)\n2.2.10. Tips on practical use\n2.3. Clustering\n2.3.1. Overview of clustering methods\n2.3.2. K-means\n2.3.3. Affinity Propagation\n2.3.4. Mean Shift",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.2.10. Tips on practical use\n2.3. Clustering\n2.3.1. Overview of clustering methods\n2.3.2. K-means\n2.3.3. Affinity Propagation\n2.3.4. Mean Shift\n2.3.5. Spectral clustering\n2.3.6. Hierarchical clustering\n2.3.7. DBSCAN\n2.3.8. HDBSCAN\n2.3.9. OPTICS\n2.3.10. BIRCH\n2.3.11. Clustering performance evaluation\n2.4. Biclustering\n2.4.1. Spectral Co-Clustering\n2.4.2. Spectral Biclustering\n2.4.3. Biclustering evaluation\n2.5. Decomposing signals in components (matrix factorization problems)\n2.5.1. Principal component analysis (PCA)\n2.5.2. Kernel Principal Component Analysis (kPCA)\n2.5.3. Truncated singular value decomposition and latent semantic analysis\n2.5.4. Dictionary Learning\n2.5.5. Factor Analysis\n2.5.6. Independent component analysis (ICA)\n2.5.7. Non-negative matrix factorization (NMF or NNMF)\n2.5.8. Latent Dirichlet Allocation (LDA)\n2.6. Covariance estimation\n2.6.1. Empirical covariance\n2.6.2. Shrunk Covariance\n2.6.3. Sparse inverse covariance\n2.6.4. Robust Covariance Estimation",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.6. Covariance estimation\n2.6.1. Empirical covariance\n2.6.2. Shrunk Covariance\n2.6.3. Sparse inverse covariance\n2.6.4. Robust Covariance Estimation\n2.7. Novelty and Outlier Detection\n2.7.1. Overview of outlier detection methods\n2.7.2. Novelty Detection\n2.7.3. Outlier Detection\n2.7.4. Novelty detection with Local Outlier Factor\n2.8. Density Estimation\n2.8.1. Density Estimation: Histograms\n2.8.2. Kernel Density Estimation\n2.9. Neural network models (unsupervised)\n2.9.1. Restricted Boltzmann machines\n3. Model selection and evaluation\n3.1. Cross-validation: evaluating estimator performance\n3.1.1. Computing cross-validated metrics\n3.1.2. Cross validation iterators\n3.1.3. A note on shuffling\n3.1.4. Cross validation and model selection\n3.1.5. Permutation test score\n3.2. Tuning the hyper-parameters of an estimator\n3.2.1. Exhaustive Grid Search\n3.2.2. Randomized Parameter Optimization\n3.2.3. Searching for optimal parameters with successive halving\n3.2.4. Tips for parameter search",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.2.2. Randomized Parameter Optimization\n3.2.3. Searching for optimal parameters with successive halving\n3.2.4. Tips for parameter search\n3.2.5. Alternatives to brute force parameter search\n3.3. Tuning the decision threshold for class prediction\n3.3.1. Post-tuning the decision threshold\n3.4. Metrics and scoring: quantifying the quality of predictions\n3.4.1. Which scoring function should I use?\n3.4.2. Scoring API overview\n3.4.3. The\nscoring\nparameter: defining model evaluation rules\n3.4.4. Classification metrics\n3.4.5. Multilabel ranking metrics\n3.4.6. Regression metrics\n3.4.7. Clustering metrics\n3.4.8. Dummy estimators\n3.5. Validation curves: plotting scores to evaluate models\n3.5.1. Validation curve\n3.5.2. Learning curve\n4. Metadata Routing\n4.1. Usage Examples\n4.1.1. Weighted scoring and fitting\n4.1.2. Weighted scoring and unweighted fitting\n4.1.3. Unweighted feature selection\n4.1.4. Different scoring and fitting weights\n4.2. API Interface\n4.3. Metadata Routing Support Status",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "4.1.3. Unweighted feature selection\n4.1.4. Different scoring and fitting weights\n4.2. API Interface\n4.3. Metadata Routing Support Status\n5. Inspection\n5.1. Partial Dependence and Individual Conditional Expectation plots\n5.1.1. Partial dependence plots\n5.1.2. Individual conditional expectation (ICE) plot\n5.1.3. Mathematical Definition\n5.1.4. Computation methods\n5.2. Permutation feature importance\n5.2.1. Outline of the permutation importance algorithm\n5.2.2. Relation to impurity-based importance in trees\n5.2.3. Misleading values on strongly correlated features\n6. Visualizations\n6.1. Available Plotting Utilities\n6.1.1. Display Objects\n7. Dataset transformations\n7.1. Pipelines and composite estimators\n7.1.1. Pipeline: chaining estimators\n7.1.2. Transforming target in regression\n7.1.3. FeatureUnion: composite feature spaces\n7.1.4. ColumnTransformer for heterogeneous data\n7.1.5. Visualizing Composite Estimators\n7.2. Feature extraction\n7.2.1. Loading features from dicts",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.1.4. ColumnTransformer for heterogeneous data\n7.1.5. Visualizing Composite Estimators\n7.2. Feature extraction\n7.2.1. Loading features from dicts\n7.2.2. Feature hashing\n7.2.3. Text feature extraction\n7.2.4. Image feature extraction\n7.3. Preprocessing data\n7.3.1. Standardization, or mean removal and variance scaling\n7.3.2. Non-linear transformation\n7.3.3. Normalization\n7.3.4. Encoding categorical features\n7.3.5. Discretization\n7.3.6. Imputation of missing values\n7.3.7. Generating polynomial features\n7.3.8. Custom transformers\n7.4. Imputation of missing values\n7.4.1. Univariate vs. Multivariate Imputation\n7.4.2. Univariate feature imputation\n7.4.3. Multivariate feature imputation\n7.4.4. Nearest neighbors imputation\n7.4.5. Keeping the number of features constant\n7.4.6. Marking imputed values\n7.4.7. Estimators that handle NaN values\n7.5. Unsupervised dimensionality reduction\n7.5.1. PCA: principal component analysis\n7.5.2. Random projections\n7.5.3. Feature agglomeration",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.5. Unsupervised dimensionality reduction\n7.5.1. PCA: principal component analysis\n7.5.2. Random projections\n7.5.3. Feature agglomeration\n7.6. Random Projection\n7.6.1. The Johnson-Lindenstrauss lemma\n7.6.2. Gaussian random projection\n7.6.3. Sparse random projection\n7.6.4. Inverse Transform\n7.7. Kernel Approximation\n7.7.1. Nystroem Method for Kernel Approximation\n7.7.2. Radial Basis Function Kernel\n7.7.3. Additive Chi Squared Kernel\n7.7.4. Skewed Chi Squared Kernel\n7.7.5. Polynomial Kernel Approximation via Tensor Sketch\n7.7.6. Mathematical Details\n7.8. Pairwise metrics, Affinities and Kernels\n7.8.1. Cosine similarity\n7.8.2. Linear kernel\n7.8.3. Polynomial kernel\n7.8.4. Sigmoid kernel\n7.8.5. RBF kernel\n7.8.6. Laplacian kernel\n7.8.7. Chi-squared kernel\n7.9. Transforming the prediction target (\ny\n)\n7.9.1. Label binarization\n7.9.2. Label encoding\n8. Dataset loading utilities\n8.1. Toy datasets\n8.1.1. Iris plants dataset\n8.1.2. Diabetes dataset",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "y\n)\n7.9.1. Label binarization\n7.9.2. Label encoding\n8. Dataset loading utilities\n8.1. Toy datasets\n8.1.1. Iris plants dataset\n8.1.2. Diabetes dataset\n8.1.3. Optical recognition of handwritten digits dataset\n8.1.4. Linnerrud dataset\n8.1.5. Wine recognition dataset\n8.1.6. Breast cancer Wisconsin (diagnostic) dataset\n8.2. Real world datasets\n8.2.1. The Olivetti faces dataset\n8.2.2. The 20 newsgroups text dataset\n8.2.3. The Labeled Faces in the Wild face recognition dataset\n8.2.4. Forest covertypes\n8.2.5. RCV1 dataset\n8.2.6. Kddcup 99 dataset\n8.2.7. California Housing dataset\n8.2.8. Species distribution dataset\n8.3. Generated datasets\n8.3.1. Generators for classification and clustering\n8.3.2. Generators for regression\n8.3.3. Generators for manifold learning\n8.3.4. Generators for decomposition\n8.4. Loading other datasets\n8.4.1. Sample images\n8.4.2. Datasets in svmlight / libsvm format\n8.4.3. Downloading datasets from the openml.org repository\n8.4.4. Loading from external datasets",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "8.4.2. Datasets in svmlight / libsvm format\n8.4.3. Downloading datasets from the openml.org repository\n8.4.4. Loading from external datasets\n9. Computing with scikit-learn\n9.1. Strategies to scale computationally: bigger data\n9.1.1. Scaling with instances using out-of-core learning\n9.2. Computational Performance\n9.2.1. Prediction Latency\n9.2.2. Prediction Throughput\n9.2.3. Tips and Tricks\n9.3. Parallelism, resource management, and configuration\n9.3.1. Parallelism\n9.3.2. Configuration switches\n10. Model persistence\n10.1. Workflow Overview\n10.1.1. Train and Persist the Model\n10.2. ONNX\n10.3.\nskops.io\n10.4.\npickle\n,\njoblib\n, and\ncloudpickle\n10.5. Security & Maintainability Limitations\n10.5.1. Replicating the training environment in production\n10.5.2. Serving the model artifact\n10.6. Summarizing the key points\n11. Common pitfalls and recommended practices\n11.1. Inconsistent preprocessing\n11.2. Data leakage\n11.2.1. How to avoid data leakage\n11.2.2. Data leakage during pre-processing",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "11.1. Inconsistent preprocessing\n11.2. Data leakage\n11.2.1. How to avoid data leakage\n11.2.2. Data leakage during pre-processing\n11.3. Controlling randomness\n11.3.1. Using\nNone\nor\nRandomState\ninstances, and repeated calls to\nfit\nand\nsplit\n11.3.2. Common pitfalls and subtleties\n11.3.3. General recommendations\n12. Dispatching\n12.1. Array API support (experimental)\n12.1.1. Example usage\n12.1.2. Support for\nArray\nAPI\n-compatible inputs\n12.1.3. Input and output array type handling\n12.1.4. Common estimator checks\n13. Choosing the right estimator\n14. External Resources, Videos and Talks\n14.1. The scikit-learn MOOC\n14.2. Videos\n14.3. New to Scientific Python?\n14.4. External Tutorials\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/user_guide.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "9.2.\nComputational Performance\nFor some applications the performance (mainly latency and throughput at\nprediction time) of estimators is crucial. It may also be of interest to\nconsider the training throughput but this is often less important in a\nproduction setup (where it often takes place offline).\nWe will review here the orders of magnitude you can expect from a number of\nscikit-learn estimators in different contexts and provide some tips and\ntricks for overcoming performance bottlenecks.\nPrediction latency is measured as the elapsed time necessary to make a\nprediction (e.g. in microseconds). Latency is often viewed as a distribution\nand operations engineers often focus on the latency at a given percentile of\nthis distribution (e.g. the 90th percentile).\nPrediction throughput is defined as the number of predictions the software can\ndeliver in a given amount of time (e.g. in predictions per second).\nAn important aspect of performance optimization is also that it can hurt",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "deliver in a given amount of time (e.g. in predictions per second).\nAn important aspect of performance optimization is also that it can hurt\nprediction accuracy. Indeed, simpler models (e.g. linear instead of\nnon-linear, or with fewer parameters) often run faster but are not always able\nto take into account the same exact properties of the data as more complex ones.\n9.2.1.\nPrediction Latency\nOne of the most straightforward concerns one may have when using/choosing a\nmachine learning toolkit is the latency at which predictions can be made in a\nproduction environment.\nThe main factors that influence the prediction latency are\nNumber of features\nInput data representation and sparsity\nModel complexity\nFeature extraction\nA last major parameter is also the possibility to do predictions in bulk or\none-at-a-time mode.\n9.2.1.1.\nBulk versus Atomic mode\nIn general doing predictions in bulk (many instances at the same time) is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "one-at-a-time mode.\n9.2.1.1.\nBulk versus Atomic mode\nIn general doing predictions in bulk (many instances at the same time) is\nmore efficient for a number of reasons (branching predictability, CPU cache,\nlinear algebra libraries optimizations etc.). Here we see on a setting\nwith few features that independently of estimator choice the bulk mode is\nalways faster, and for some of them by 1 to 2 orders of magnitude:\nTo benchmark different estimators for your case you can simply change the\nn_features\nparameter in this example:\nPrediction Latency\n. This should give\nyou an estimate of the order of magnitude of the prediction latency.\n9.2.1.2.\nConfiguring Scikit-learn for reduced validation overhead\nScikit-learn does some validation on data that increases the overhead per\ncall to\npredict\nand similar functions. In particular, checking that\nfeatures are finite (not NaN or infinite) involves a full pass over the\ndata. If you ensure that your data is acceptable, you may suppress",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "features are finite (not NaN or infinite) involves a full pass over the\ndata. If you ensure that your data is acceptable, you may suppress\nchecking for finiteness by setting the environment variable\nSKLEARN_ASSUME_FINITE\nto a non-empty string before importing\nscikit-learn, or configure it in Python with\nset_config\n.\nFor more control than these global settings, a\nconfig_context\nallows you to set this configuration within a specified context:\n>>>\nimport\nsklearn\n>>>\nwith\nsklearn\n.\nconfig_context\n(\nassume_finite\n=\nTrue\n):\n...\npass\n# do learning/prediction here with reduced validation\nNote that this will affect all uses of\nassert_all_finite\nwithin the context.\n9.2.1.3.\nInfluence of the Number of Features\nObviously when the number of features increases so does the memory\nconsumption of each example. Indeed, for a matrix of\n\\(M\\)\ninstances\nwith\n\\(N\\)\nfeatures, the space complexity is in\n\\(O(NM)\\)\n.\nFrom a computing perspective it also means that the number of basic operations",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "instances\nwith\n\\(N\\)\nfeatures, the space complexity is in\n\\(O(NM)\\)\n.\nFrom a computing perspective it also means that the number of basic operations\n(e.g., multiplications for vector-matrix products in linear models) increases\ntoo. Here is a graph of the evolution of the prediction latency with the\nnumber of features:\nOverall you can expect the prediction time to increase at least linearly with\nthe number of features (non-linear cases can happen depending on the global\nmemory footprint and estimator).\n9.2.1.4.\nInfluence of the Input Data Representation\nScipy provides sparse matrix data structures which are optimized for storing\nsparse data. The main feature of sparse formats is that you don’t store zeros\nso if your data is sparse then you use much less memory. A non-zero value in\na sparse (\nCSR or CSC\n)\nrepresentation will only take on average one 32bit integer position + the 64\nbit floating point value + an additional 32bit per row or column in the matrix.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "bit floating point value + an additional 32bit per row or column in the matrix.\nUsing sparse input on a dense (or sparse) linear model can speedup prediction\nby quite a bit as only the non zero valued features impact the dot product\nand thus the model predictions. Hence if you have 100 non zeros in 1e6\ndimensional space, you only need 100 multiply and add operation instead of 1e6.\nCalculation over a dense representation, however, may leverage highly optimized\nvector operations and multithreading in BLAS, and tends to result in fewer CPU\ncache misses. So the sparsity should typically be quite high (10% non-zeros\nmax, to be checked depending on the hardware) for the sparse input\nrepresentation to be faster than the dense input representation on a machine\nwith many CPUs and an optimized BLAS implementation.\nHere is sample code to test the sparsity of your input:\ndef\nsparsity_ratio\n(\nX\n):\nreturn\n1.0\n-\nnp\n.\ncount_nonzero\n(\nX\n)\n/\nfloat\n(\nX\n.\nshape\n[\n0\n]\n*\nX\n.\nshape\n[\n1\n])\nprint\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "def\nsparsity_ratio\n(\nX\n):\nreturn\n1.0\n-\nnp\n.\ncount_nonzero\n(\nX\n)\n/\nfloat\n(\nX\n.\nshape\n[\n0\n]\n*\nX\n.\nshape\n[\n1\n])\nprint\n(\n\"input sparsity ratio:\"\n,\nsparsity_ratio\n(\nX\n))\nAs a rule of thumb you can consider that if the sparsity ratio is greater\nthan 90% you can probably benefit from sparse formats. Check Scipy’s sparse\nmatrix formats\ndocumentation\nfor more information on how to build (or convert your data to) sparse matrix\nformats. Most of the time the\nCSR\nand\nCSC\nformats work best.\n9.2.1.5.\nInfluence of the Model Complexity\nGenerally speaking, when model complexity increases, predictive power and\nlatency are supposed to increase. Increasing predictive power is usually\ninteresting, but for many applications we would better not increase\nprediction latency too much. We will now review this idea for different\nfamilies of supervised models.\nFor\nsklearn.linear_model\n(e.g. Lasso, ElasticNet,\nSGDClassifier/Regressor, Ridge & RidgeClassifier,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "families of supervised models.\nFor\nsklearn.linear_model\n(e.g. Lasso, ElasticNet,\nSGDClassifier/Regressor, Ridge & RidgeClassifier,\nPassiveAggressiveClassifier/Regressor, LinearSVC, LogisticRegression…) the\ndecision function that is applied at prediction time is the same (a dot product)\n, so latency should be equivalent.\nHere is an example using\nSGDClassifier\nwith the\nelasticnet\npenalty. The regularization strength is globally controlled by\nthe\nalpha\nparameter. With a sufficiently high\nalpha\n,\none can then increase the\nl1_ratio\nparameter of\nelasticnet\nto\nenforce various levels of sparsity in the model coefficients. Higher sparsity\nhere is interpreted as less model complexity as we need fewer coefficients to\ndescribe it fully. Of course sparsity influences in turn the prediction time\nas the sparse dot-product takes time roughly proportional to the number of\nnon-zero coefficients.\nFor the\nsklearn.svm\nfamily of algorithms with a non-linear kernel,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "non-zero coefficients.\nFor the\nsklearn.svm\nfamily of algorithms with a non-linear kernel,\nthe latency is tied to the number of support vectors (the fewer the faster).\nLatency and throughput should (asymptotically) grow linearly with the number\nof support vectors in a SVC or SVR model. The kernel will also influence the\nlatency as it is used to compute the projection of the input vector once per\nsupport vector. In the following graph the\nnu\nparameter of\nNuSVR\nwas used to influence the number of\nsupport vectors.\nFor\nsklearn.ensemble\nof trees (e.g. RandomForest, GBT,\nExtraTrees, etc.) the number of trees and their depth play the most\nimportant role. Latency and throughput should scale linearly with the number\nof trees. In this case we used directly the\nn_estimators\nparameter of\nGradientBoostingRegressor\n.\nIn any case be warned that decreasing model complexity can hurt accuracy as\nmentioned above. For instance a non-linearly separable problem can be handled",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "mentioned above. For instance a non-linearly separable problem can be handled\nwith a speedy linear model but prediction power will very likely suffer in\nthe process.\n9.2.1.6.\nFeature Extraction Latency\nMost scikit-learn models are usually pretty fast as they are implemented\neither with compiled Cython extensions or optimized computing libraries.\nOn the other hand, in many real world applications the feature extraction\nprocess (i.e. turning raw data like database rows or network packets into\nnumpy arrays) governs the overall prediction time. For example on the Reuters\ntext classification task the whole preparation (reading and parsing SGML\nfiles, tokenizing the text and hashing it into a common vector space) is\ntaking 100 to 500 times more time than the actual prediction code, depending on\nthe chosen model.\nIn many cases it is thus recommended to carefully time and profile your\nfeature extraction code as it may be a good place to start optimizing when",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In many cases it is thus recommended to carefully time and profile your\nfeature extraction code as it may be a good place to start optimizing when\nyour overall latency is too slow for your application.\n9.2.2.\nPrediction Throughput\nAnother important metric to care about when sizing production systems is the\nthroughput i.e. the number of predictions you can make in a given amount of\ntime. Here is a benchmark from the\nPrediction Latency\nexample that measures\nthis quantity for a number of estimators on synthetic data:\nThese throughputs are achieved on a single process. An obvious way to\nincrease the throughput of your application is to spawn additional instances\n(usually processes in Python because of the\nGIL\n) that share the\nsame model. One might also add machines to spread the load. A detailed\nexplanation on how to achieve this is beyond the scope of this documentation\nthough.\n9.2.3.\nTips and Tricks\n9.2.3.1.\nLinear algebra libraries",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "explanation on how to achieve this is beyond the scope of this documentation\nthough.\n9.2.3.\nTips and Tricks\n9.2.3.1.\nLinear algebra libraries\nAs scikit-learn relies heavily on Numpy/Scipy and linear algebra in general it\nmakes sense to take explicit care of the versions of these libraries.\nBasically, you ought to make sure that Numpy is built using an optimized\nBLAS\n/\nLAPACK\nlibrary.\nNot all models benefit from optimized BLAS and Lapack implementations. For\ninstance models based on (randomized) decision trees typically do not rely on\nBLAS calls in their inner loops, nor do kernel SVMs (\nSVC\n,\nSVR\n,\nNuSVC\n,\nNuSVR\n). On the other hand a linear model implemented with a\nBLAS DGEMM call (via\nnumpy.dot\n) will typically benefit hugely from a tuned\nBLAS implementation and lead to orders of magnitude speedup over a\nnon-optimized BLAS.\nYou can display the BLAS / LAPACK implementation used by your NumPy / SciPy /\nscikit-learn install with the following command:\npython\n-\nc",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "You can display the BLAS / LAPACK implementation used by your NumPy / SciPy /\nscikit-learn install with the following command:\npython\n-\nc\n\"import sklearn; sklearn.show_versions()\"\nOptimized BLAS / LAPACK implementations include:\nAtlas (need hardware specific tuning by rebuilding on the target machine)\nOpenBLAS\nMKL\nApple Accelerate and vecLib frameworks (OSX only)\nMore information can be found on the\nNumPy install page\nand in this\nblog post\nfrom Daniel Nouri which has some nice step by step install instructions for\nDebian / Ubuntu.\n9.2.3.2.\nLimiting Working Memory\nSome calculations when implemented using standard numpy vectorized operations\ninvolve using a large amount of temporary memory. This may potentially exhaust\nsystem memory. Where computations can be performed in fixed-memory chunks, we\nattempt to do so, and allow the user to hint at the maximum size of this\nworking memory (defaulting to 1GB) using\nset_config\nor\nconfig_context\n. The following suggests to limit temporary working",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "working memory (defaulting to 1GB) using\nset_config\nor\nconfig_context\n. The following suggests to limit temporary working\nmemory to 128 MiB:\n>>>\nimport\nsklearn\n>>>\nwith\nsklearn\n.\nconfig_context\n(\nworking_memory\n=\n128\n):\n...\npass\n# do chunked work here\nAn example of a chunked operation adhering to this setting is\npairwise_distances_chunked\n, which facilitates computing\nrow-wise reductions of a pairwise distance matrix.\n9.2.3.3.\nModel Compression\nModel compression in scikit-learn only concerns linear models for the moment.\nIn this context it means that we want to control the model sparsity (i.e. the\nnumber of non-zero coordinates in the model vectors). It is generally a good\nidea to combine model sparsity with sparse input data representation.\nHere is sample code that illustrates the use of the\nsparsify()\nmethod:\nclf\n=\nSGDRegressor\n(\npenalty\n=\n'elasticnet'\n,\nl1_ratio\n=\n0.25\n)\nclf\n.\nfit\n(\nX_train\n,\ny_train\n)\n.\nsparsify\n()\nclf\n.\npredict\n(\nX_test\n)\nIn this example we prefer the\nelasticnet",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n'elasticnet'\n,\nl1_ratio\n=\n0.25\n)\nclf\n.\nfit\n(\nX_train\n,\ny_train\n)\n.\nsparsify\n()\nclf\n.\npredict\n(\nX_test\n)\nIn this example we prefer the\nelasticnet\npenalty as it is often a good\ncompromise between model compactness and prediction power. One can also\nfurther tune the\nl1_ratio\nparameter (in combination with the\nregularization strength\nalpha\n) to control this tradeoff.\nA typical\nbenchmark\non synthetic data yields a >30% decrease in latency when both the model and\ninput are sparse (with 0.000024 and 0.027400 non-zero coefficients ratio\nrespectively). Your mileage may vary depending on the sparsity and size of\nyour data and model.\nFurthermore, sparsifying can be very useful to reduce the memory usage of\npredictive models deployed on production servers.\n9.2.3.4.\nModel Reshaping\nModel reshaping consists in selecting only a portion of the available features\nto fit a model. In other words, if a model discards features during the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to fit a model. In other words, if a model discards features during the\nlearning phase we can then strip those from the input. This has several\nbenefits. Firstly it reduces memory (and therefore time) overhead of the\nmodel itself. It also allows to discard explicit\nfeature selection components in a pipeline once we know which features to\nkeep from a previous run. Finally, it can help reduce processing time and I/O\nusage upstream in the data access and feature extraction layers by not\ncollecting and building features that are discarded by the model. For instance\nif the raw data come from a database, it is possible to write simpler\nand faster queries or reduce I/O usage by making the queries return lighter\nrecords.\nAt the moment, reshaping needs to be performed manually in scikit-learn.\nIn the case of sparse input (particularly in\nCSR\nformat), it is generally\nsufficient to not generate the relevant features, leaving their columns empty.\n9.2.3.5.\nLinks",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "CSR\nformat), it is generally\nsufficient to not generate the relevant features, leaving their columns empty.\n9.2.3.5.\nLinks\nscikit-learn developer performance documentation\nScipy sparse matrix formats documentation\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/computational_performance.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "9.3.\nParallelism, resource management, and configuration\n9.3.1.\nParallelism\nSome scikit-learn estimators and utilities parallelize costly operations\nusing multiple CPU cores.\nDepending on the type of estimator and sometimes the values of the\nconstructor parameters, this is either done:\nwith higher-level parallelism via\njoblib\n.\nwith lower-level parallelism via OpenMP, used in C or Cython code.\nwith lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations\non arrays.\nThe\nn_jobs\nparameters of estimators always controls the amount of parallelism\nmanaged by joblib (processes or threads depending on the joblib backend).\nThe thread-level parallelism managed by OpenMP in scikit-learn’s own Cython code\nor by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn\nis always controlled by environment variables or\nthreadpoolctl\nas explained below.\nNote that some estimators can leverage all three kinds of parallelism at different",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "threadpoolctl\nas explained below.\nNote that some estimators can leverage all three kinds of parallelism at different\npoints of their training and prediction methods.\nWe describe these 3 types of parallelism in the following subsections in more details.\n9.3.1.1.\nHigher-level parallelism with joblib\nWhen the underlying implementation uses joblib, the number of workers\n(threads or processes) that are spawned in parallel can be controlled via the\nn_jobs\nparameter.\nNote\nWhere (and how) parallelization happens in the estimators using joblib by\nspecifying\nn_jobs\nis currently poorly documented.\nPlease help us by improving our docs and tackle\nissue 14228\n!\nJoblib is able to support both multi-processing and multi-threading. Whether\njoblib chooses to spawn a thread or a process depends on the\nbackend\nthat it’s using.\nscikit-learn generally relies on the\nloky\nbackend, which is joblib’s\ndefault backend. Loky is a multi-processing backend. When doing",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "that it’s using.\nscikit-learn generally relies on the\nloky\nbackend, which is joblib’s\ndefault backend. Loky is a multi-processing backend. When doing\nmulti-processing, in order to avoid duplicating the memory in each process\n(which isn’t reasonable with big datasets), joblib will create a\nmemmap\nthat all processes can share, when the data is bigger than 1MB.\nIn some specific cases (when the code that is run in parallel releases the\nGIL), scikit-learn will indicate to\njoblib\nthat a multi-threading\nbackend is preferable.\nAs a user, you may control the backend that joblib will use (regardless of\nwhat scikit-learn recommends) by using a context manager:\nfrom\njoblib\nimport\nparallel_backend\nwith\nparallel_backend\n(\n'threading'\n,\nn_jobs\n=\n2\n):\n# Your scikit-learn code here\nPlease refer to the\njoblib’s docs\nfor more details.\nIn practice, whether parallelism is helpful at improving runtime depends on\nmany factors. It is usually a good idea to experiment rather than assuming",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "many factors. It is usually a good idea to experiment rather than assuming\nthat increasing the number of workers is always a good thing. In some cases\nit can be highly detrimental to performance to run multiple copies of some\nestimators or functions in parallel (see\noversubscription\nbelow).\n9.3.1.2.\nLower-level parallelism with OpenMP\nOpenMP is used to parallelize code written in Cython or C, relying on\nmulti-threading exclusively. By default, the implementations using OpenMP\nwill use as many threads as possible, i.e. as many threads as logical cores.\nYou can control the exact number of threads that are used either:\nvia the\nOMP_NUM_THREADS\nenvironment variable, for instance when:\nrunning a python script:\nOMP_NUM_THREADS\n=\n4\npython\nmy_script.py\nor via\nthreadpoolctl\nas explained by\nthis piece of documentation\n.\n9.3.1.3.\nParallel NumPy and SciPy routines from numerical libraries\nscikit-learn relies heavily on NumPy and SciPy, which internally call",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\n9.3.1.3.\nParallel NumPy and SciPy routines from numerical libraries\nscikit-learn relies heavily on NumPy and SciPy, which internally call\nmulti-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries\nsuch as MKL, OpenBLAS or BLIS.\nYou can control the exact number of threads used by BLAS for each library\nusing environment variables, namely:\nMKL_NUM_THREADS\nsets the number of threads MKL uses,\nOPENBLAS_NUM_THREADS\nsets the number of threads OpenBLAS uses\nBLIS_NUM_THREADS\nsets the number of threads BLIS uses\nNote that BLAS & LAPACK implementations can also be impacted by\nOMP_NUM_THREADS\n. To check whether this is the case in your environment,\nyou can inspect how the number of threads effectively used by those libraries\nis affected when running the following command in a bash or zsh terminal\nfor different values of\nOMP_NUM_THREADS\n:\nOMP_NUM_THREADS\n=\n2\npython\n-m\nthreadpoolctl\n-i\nnumpy\nscipy\nNote\nAt the time of writing (2022), NumPy and SciPy packages which are",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "OMP_NUM_THREADS\n:\nOMP_NUM_THREADS\n=\n2\npython\n-m\nthreadpoolctl\n-i\nnumpy\nscipy\nNote\nAt the time of writing (2022), NumPy and SciPy packages which are\ndistributed on pypi.org (i.e. the ones installed via\npip\ninstall\n)\nand on the conda-forge channel (i.e. the ones installed via\nconda\ninstall\n--channel\nconda-forge\n) are linked with OpenBLAS, while\nNumPy and SciPy packages shipped on the\ndefaults\nconda\nchannel from Anaconda.org (i.e. the ones installed via\nconda\ninstall\n)\nare linked by default with MKL.\n9.3.1.4.\nOversubscription: spawning too many threads\nIt is generally recommended to avoid using significantly more processes or\nthreads than the number of CPUs on a machine. Over-subscription happens when\na program is running too many threads at the same time.\nSuppose you have a machine with 8 CPUs. Consider a case where you’re running\na\nGridSearchCV\n(parallelized with joblib)\nwith\nn_jobs=8\nover a\nHistGradientBoostingClassifier\n(parallelized with\nOpenMP). Each instance of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "a\nGridSearchCV\n(parallelized with joblib)\nwith\nn_jobs=8\nover a\nHistGradientBoostingClassifier\n(parallelized with\nOpenMP). Each instance of\nHistGradientBoostingClassifier\nwill spawn 8 threads\n(since you have 8 CPUs). That’s a total of\n8\n*\n8\n=\n64\nthreads, which\nleads to oversubscription of threads for physical CPU resources and thus\nto scheduling overhead.\nOversubscription can arise in the exact same fashion with parallelized\nroutines from MKL, OpenBLAS or BLIS that are nested in joblib calls.\nStarting from\njoblib\n>=\n0.14\n, when the\nloky\nbackend is used (which\nis the default), joblib will tell its child\nprocesses\nto limit the\nnumber of threads they can use, so as to avoid oversubscription. In practice\nthe heuristic that joblib uses is to tell the processes to use\nmax_threads\n=\nn_cpus\n//\nn_jobs\n, via their corresponding environment variable. Back to\nour example from above, since the joblib backend of\nGridSearchCV\nis\nloky\n, each process will",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", via their corresponding environment variable. Back to\nour example from above, since the joblib backend of\nGridSearchCV\nis\nloky\n, each process will\nonly be able to use 1 thread instead of 8, thus mitigating the\noversubscription issue.\nNote that:\nManually setting one of the environment variables (\nOMP_NUM_THREADS\n,\nMKL_NUM_THREADS\n,\nOPENBLAS_NUM_THREADS\n, or\nBLIS_NUM_THREADS\n)\nwill take precedence over what joblib tries to do. The total number of\nthreads will be\nn_jobs\n*\n<LIB>_NUM_THREADS\n. Note that setting this\nlimit will also impact your computations in the main process, which will\nonly use\n<LIB>_NUM_THREADS\n. Joblib exposes a context manager for\nfiner control over the number of threads in its workers (see joblib docs\nlinked below).\nWhen joblib is configured to use the\nthreading\nbackend, there is no\nmechanism to avoid oversubscriptions when calling into parallel native\nlibraries in the joblib-managed threads.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "threading\nbackend, there is no\nmechanism to avoid oversubscriptions when calling into parallel native\nlibraries in the joblib-managed threads.\nAll scikit-learn estimators that explicitly rely on OpenMP in their Cython code\nalways use\nthreadpoolctl\ninternally to automatically adapt the numbers of\nthreads used by OpenMP and potentially nested BLAS calls so as to avoid\noversubscription.\nYou will find additional details about joblib mitigation of oversubscription\nin\njoblib documentation\n.\nYou will find additional details about parallelism in numerical python libraries\nin\nthis document from Thomas J. Fan\n.\n9.3.2.\nConfiguration switches\n9.3.2.1.\nPython API\nsklearn.set_config\nand\nsklearn.config_context\ncan be used to change\nparameters of the configuration which control aspect of parallelism.\n9.3.2.2.\nEnvironment variables\nThese environment variables should be set before importing scikit-learn.\n9.3.2.2.1.\nSKLEARN_ASSUME_FINITE\nSets the default value for the\nassume_finite\nargument of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "9.3.2.2.1.\nSKLEARN_ASSUME_FINITE\nSets the default value for the\nassume_finite\nargument of\nsklearn.set_config\n.\n9.3.2.2.2.\nSKLEARN_WORKING_MEMORY\nSets the default value for the\nworking_memory\nargument of\nsklearn.set_config\n.\n9.3.2.2.3.\nSKLEARN_SEED\nSets the seed of the global random generator when running the tests, for\nreproducibility.\nNote that scikit-learn tests are expected to run deterministically with\nexplicit seeding of their own independent RNG instances instead of relying on\nthe numpy or Python standard library RNG singletons to make sure that test\nresults are independent of the test execution order. However some tests might\nforget to use explicit seeding and this variable is a way to control the initial\nstate of the aforementioned singletons.\n9.3.2.2.4.\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED\nControls the seeding of the random number generator used in tests that rely on\nthe\nglobal_random_seed\nfixture.\nAll tests that use this fixture accept the contract that they should",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the\nglobal_random_seed\nfixture.\nAll tests that use this fixture accept the contract that they should\ndeterministically pass for any seed value from 0 to 99 included.\nIn nightly CI builds, the\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED\nenvironment\nvariable is drawn randomly in the above range and all fixtured tests will run\nfor that specific seed. The goal is to ensure that, over time, our CI will run\nall tests with different seeds while keeping the test duration of a single run\nof the full test suite limited. This will check that the assertions of tests\nwritten to use this fixture are not dependent on a specific seed value.\nThe range of admissible seed values is limited to [0, 99] because it is often\nnot possible to write a test that can work for any possible seed and we want to\navoid having tests that randomly fail on the CI.\nValid values for\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED\n:\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"42\"\n: run tests with a fixed seed of 42\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"40-42\"",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SKLEARN_TESTS_GLOBAL_RANDOM_SEED\n:\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"42\"\n: run tests with a fixed seed of 42\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"40-42\"\n: run the tests with all seeds\nbetween 40 and 42 included\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"all\"\n: run the tests with all seeds\nbetween 0 and 99 included. This can take a long time: only use for individual\ntests, not the full test suite!\nIf the variable is not set, then 42 is used as the global seed in a\ndeterministic manner. This ensures that, by default, the scikit-learn test\nsuite is as deterministic as possible to avoid disrupting our friendly\nthird-party package maintainers. Similarly, this variable should not be set in\nthe CI config of pull-requests to make sure that our friendly contributors are\nnot the first people to encounter a seed-sensitivity regression in a test\nunrelated to the changes of their own PR. Only the scikit-learn maintainers who\nwatch the results of the nightly builds are expected to be annoyed by this.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "watch the results of the nightly builds are expected to be annoyed by this.\nWhen writing a new test function that uses this fixture, please use the\nfollowing command to make sure that it passes deterministically for all\nadmissible seeds on your local machine:\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED\n=\n\"all\"\npytest\n-v\n-k\ntest_your_test_name\n9.3.2.2.5.\nSKLEARN_SKIP_NETWORK_TESTS\nWhen this environment variable is set to a non zero value, the tests that need\nnetwork access are skipped. When this environment variable is not set then\nnetwork tests are skipped.\n9.3.2.2.6.\nSKLEARN_RUN_FLOAT32_TESTS\nWhen this environment variable is set to ‘1’, the tests using the\nglobal_dtype\nfixture are also run on float32 data.\nWhen this environment variable is not set, the tests are only run on\nfloat64 data.\n9.3.2.2.7.\nSKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES\nWhen this environment variable is set to a non zero value, the\nCython\nderivative,\nboundscheck\nis set to\nTrue\n. This is useful for finding\nsegfaults.\n9.3.2.2.8.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Cython\nderivative,\nboundscheck\nis set to\nTrue\n. This is useful for finding\nsegfaults.\n9.3.2.2.8.\nSKLEARN_BUILD_ENABLE_DEBUG_SYMBOLS\nWhen this environment variable is set to a non zero value, the debug symbols\nwill be included in the compiled C extensions. Only debug symbols for POSIX\nsystems are configured.\n9.3.2.2.9.\nSKLEARN_PAIRWISE_DIST_CHUNK_SIZE\nThis sets the size of chunk to be used by the underlying\nPairwiseDistancesReductions\nimplementations. The default value is\n256\nwhich has been showed to be adequate on\nmost machines.\nUsers looking for the best performance might want to tune this variable using\npowers of 2 so as to get the best parallelism behavior for their hardware,\nespecially with respect to their caches’ sizes.\n9.3.2.2.10.\nSKLEARN_WARNINGS_AS_ERRORS\nThis environment variable is used to turn warnings into errors in tests and\ndocumentation build.\nSome CI (Continuous Integration) builds set\nSKLEARN_WARNINGS_AS_ERRORS=1\n, for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "documentation build.\nSome CI (Continuous Integration) builds set\nSKLEARN_WARNINGS_AS_ERRORS=1\n, for\nexample to make sure that we catch deprecation warnings from our dependencies\nand that we adapt our code.\nTo locally run with the same “warnings as errors” setting as in these CI builds\nyou can set\nSKLEARN_WARNINGS_AS_ERRORS=1\n.\nBy default, warnings are not turned into errors. This is the case if\nSKLEARN_WARNINGS_AS_ERRORS\nis unset, or\nSKLEARN_WARNINGS_AS_ERRORS=0\n.\nThis environment variable uses specific warning filters to ignore some warnings,\nsince sometimes warnings originate from third-party libraries and there is not\nmuch we can do about it. You can see the warning filters in the\n_get_warnings_filters_info_list\nfunction in\nsklearn/utils/_testing.py\n.\nNote that for documentation build,\nSKLEARN_WARNING_AS_ERRORS=1\nis checking\nthat the documentation build, in particular running examples, does not produce\nany warnings. This is different from the\n-W\nsphinx-build\nargument that",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "that the documentation build, in particular running examples, does not produce\nany warnings. This is different from the\n-W\nsphinx-build\nargument that\ncatches syntax warnings in the rst files.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/parallelism.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "9.1.\nStrategies to scale computationally: bigger data\nFor some applications the amount of examples, features (or both) and/or the\nspeed at which they need to be processed are challenging for traditional\napproaches. In these cases scikit-learn has a number of options you can\nconsider to make your system scale.\n9.1.1.\nScaling with instances using out-of-core learning\nOut-of-core (or “external memory”) learning is a technique used to learn from\ndata that cannot fit in a computer’s main memory (RAM).\nHere is a sketch of a system designed to achieve this goal:\na way to stream instances\na way to extract features from instances\nan incremental algorithm\n9.1.1.1.\nStreaming instances\nBasically, 1. may be a reader that yields instances from files on a\nhard drive, a database, from a network stream etc. However,\ndetails on how to achieve this are beyond the scope of this documentation.\n9.1.1.2.\nExtracting features\n2. could be any relevant way to extract features among the\ndifferent",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "9.1.1.2.\nExtracting features\n2. could be any relevant way to extract features among the\ndifferent\nfeature extraction\nmethods supported by\nscikit-learn. However, when working with data that needs vectorization and\nwhere the set of features or values is not known in advance one should take\nexplicit care. A good example is text classification where unknown terms are\nlikely to be found during training. It is possible to use a stateful\nvectorizer if making multiple passes over the data is reasonable from an\napplication point of view. Otherwise, one can turn up the difficulty by using\na stateless feature extractor. Currently the preferred way to do this is to\nuse the so-called\nhashing trick\nas implemented by\nsklearn.feature_extraction.FeatureHasher\nfor datasets with categorical\nvariables represented as list of Python dicts or\nsklearn.feature_extraction.text.HashingVectorizer\nfor text documents.\n9.1.1.3.\nIncremental learning",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "variables represented as list of Python dicts or\nsklearn.feature_extraction.text.HashingVectorizer\nfor text documents.\n9.1.1.3.\nIncremental learning\nFinally, for 3. we have a number of options inside scikit-learn. Although not\nall algorithms can learn incrementally (i.e. without seeing all the instances\nat once), all estimators implementing the\npartial_fit\nAPI are candidates.\nActually, the ability to learn incrementally from a mini-batch of instances\n(sometimes called “online learning”) is key to out-of-core learning as it\nguarantees that at any given time there will be only a small amount of\ninstances in the main memory. Choosing a good size for the mini-batch that\nbalances relevancy and memory footprint could involve some tuning\n[\n1\n]\n.\nHere is a list of incremental estimators for different tasks:\nClassification\nsklearn.naive_bayes.MultinomialNB\nsklearn.naive_bayes.BernoulliNB\nsklearn.linear_model.Perceptron\nsklearn.linear_model.SGDClassifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Classification\nsklearn.naive_bayes.MultinomialNB\nsklearn.naive_bayes.BernoulliNB\nsklearn.linear_model.Perceptron\nsklearn.linear_model.SGDClassifier\nsklearn.linear_model.PassiveAggressiveClassifier\nsklearn.neural_network.MLPClassifier\nRegression\nsklearn.linear_model.SGDRegressor\nsklearn.linear_model.PassiveAggressiveRegressor\nsklearn.neural_network.MLPRegressor\nClustering\nsklearn.cluster.MiniBatchKMeans\nsklearn.cluster.Birch\nDecomposition / feature Extraction\nsklearn.decomposition.MiniBatchDictionaryLearning\nsklearn.decomposition.IncrementalPCA\nsklearn.decomposition.LatentDirichletAllocation\nsklearn.decomposition.MiniBatchNMF\nPreprocessing\nsklearn.preprocessing.StandardScaler\nsklearn.preprocessing.MinMaxScaler\nsklearn.preprocessing.MaxAbsScaler\nFor classification, a somewhat important thing to note is that although a\nstateless feature extraction routine may be able to cope with new/unseen\nattributes, the incremental learner itself may be unable to cope with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "stateless feature extraction routine may be able to cope with new/unseen\nattributes, the incremental learner itself may be unable to cope with\nnew/unseen targets classes. In this case you have to pass all the possible\nclasses to the first\npartial_fit\ncall using the\nclasses=\nparameter.\nAnother aspect to consider when choosing a proper algorithm is that not all of\nthem put the same importance on each example over time. Namely, the\nPerceptron\nis still sensitive to badly labeled examples even after many\nexamples whereas the\nSGD*\nand\nPassiveAggressive*\nfamilies are more\nrobust to this kind of artifacts. Conversely, the latter also tend to give less\nimportance to remarkably different, yet properly labeled examples when they\ncome late in the stream as their learning rate decreases over time.\n9.1.1.4.\nExamples\nFinally, we have a full-fledged example of\nOut-of-core classification of text documents\n. It is aimed at\nproviding a starting point for people wanting to build out-of-core learning",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Out-of-core classification of text documents\n. It is aimed at\nproviding a starting point for people wanting to build out-of-core learning\nsystems and demonstrates most of the notions discussed above.\nFurthermore, it also shows the evolution of the performance of different\nalgorithms with the number of processed examples.\nNow looking at the computation time of the different parts, we see that the\nvectorization is much more expensive than learning itself. From the different\nalgorithms,\nMultinomialNB\nis the most expensive, but its overhead can be\nmitigated by increasing the size of the mini-batches (exercise: change\nminibatch_size\nto 100 and 10000 in the program and compare).\n9.1.1.5.\nNotes\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.\nDataset transformations\nscikit-learn provides a library of transformers, which may clean (see\nPreprocessing data\n), reduce (see\nUnsupervised dimensionality reduction\n), expand (see\nKernel Approximation\n) or generate (see\nFeature extraction\n)\nfeature representations.\nLike other estimators, these are represented by classes with a\nfit\nmethod,\nwhich learns model parameters (e.g. mean and standard deviation for\nnormalization) from a training set, and a\ntransform\nmethod which applies\nthis transformation model to unseen data.\nfit_transform\nmay be more\nconvenient and efficient for modelling and transforming the training data\nsimultaneously.\nCombining such transformers, either in parallel or series is covered in\nPipelines and composite estimators\n.\nPairwise metrics, Affinities and Kernels\ncovers transforming feature\nspaces into affinity matrices, while\nTransforming the prediction target (y)\nconsiders\ntransformations of the target space (e.g. categorical labels) for use in\nscikit-learn.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/data_transforms.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Transforming the prediction target (y)\nconsiders\ntransformations of the target space (e.g. categorical labels) for use in\nscikit-learn.\n7.1. Pipelines and composite estimators\n7.1.1. Pipeline: chaining estimators\n7.1.2. Transforming target in regression\n7.1.3. FeatureUnion: composite feature spaces\n7.1.4. ColumnTransformer for heterogeneous data\n7.1.5. Visualizing Composite Estimators\n7.2. Feature extraction\n7.2.1. Loading features from dicts\n7.2.2. Feature hashing\n7.2.3. Text feature extraction\n7.2.4. Image feature extraction\n7.3. Preprocessing data\n7.3.1. Standardization, or mean removal and variance scaling\n7.3.2. Non-linear transformation\n7.3.3. Normalization\n7.3.4. Encoding categorical features\n7.3.5. Discretization\n7.3.6. Imputation of missing values\n7.3.7. Generating polynomial features\n7.3.8. Custom transformers\n7.4. Imputation of missing values\n7.4.1. Univariate vs. Multivariate Imputation\n7.4.2. Univariate feature imputation\n7.4.3. Multivariate feature imputation",
    "metadata": {
      "url": "https://scikit-learn.org/stable/data_transforms.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.4.1. Univariate vs. Multivariate Imputation\n7.4.2. Univariate feature imputation\n7.4.3. Multivariate feature imputation\n7.4.4. Nearest neighbors imputation\n7.4.5. Keeping the number of features constant\n7.4.6. Marking imputed values\n7.4.7. Estimators that handle NaN values\n7.5. Unsupervised dimensionality reduction\n7.5.1. PCA: principal component analysis\n7.5.2. Random projections\n7.5.3. Feature agglomeration\n7.6. Random Projection\n7.6.1. The Johnson-Lindenstrauss lemma\n7.6.2. Gaussian random projection\n7.6.3. Sparse random projection\n7.6.4. Inverse Transform\n7.7. Kernel Approximation\n7.7.1. Nystroem Method for Kernel Approximation\n7.7.2. Radial Basis Function Kernel\n7.7.3. Additive Chi Squared Kernel\n7.7.4. Skewed Chi Squared Kernel\n7.7.5. Polynomial Kernel Approximation via Tensor Sketch\n7.7.6. Mathematical Details\n7.8. Pairwise metrics, Affinities and Kernels\n7.8.1. Cosine similarity\n7.8.2. Linear kernel\n7.8.3. Polynomial kernel\n7.8.4. Sigmoid kernel\n7.8.5. RBF kernel",
    "metadata": {
      "url": "https://scikit-learn.org/stable/data_transforms.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.8.1. Cosine similarity\n7.8.2. Linear kernel\n7.8.3. Polynomial kernel\n7.8.4. Sigmoid kernel\n7.8.5. RBF kernel\n7.8.6. Laplacian kernel\n7.8.7. Chi-squared kernel\n7.9. Transforming the prediction target (\ny\n)\n7.9.1. Label binarization\n7.9.2. Label encoding\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/data_transforms.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "8.4.\nLoading other datasets\n8.4.1.\nSample images\nScikit-learn also embeds a couple of sample JPEG images published under Creative\nCommons license by their authors. Those images can be useful to test algorithms\nand pipelines on 2D data.\nload_sample_images\n()\nLoad sample images for image manipulation.\nload_sample_image\n(image_name)\nLoad the numpy array of a single sample image.\nWarning\nThe default coding of images is based on the\nuint8\ndtype to\nspare memory. Often machine learning algorithms work best if the\ninput is converted to a floating point representation first. Also,\nif you plan to use\nmatplotlib.pyplot.imshow\n, don’t forget to scale to the range\n0 - 1 as done in the following example.\n8.4.2.\nDatasets in svmlight / libsvm format\nscikit-learn includes utility functions for loading\ndatasets in the svmlight / libsvm format. In this format, each line\ntakes the form\n<label>\n<feature-id>:<feature-value>\n<feature-id>:<feature-value>\n...",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "takes the form\n<label>\n<feature-id>:<feature-value>\n<feature-id>:<feature-value>\n...\n. This format is especially suitable for sparse datasets.\nIn this module, scipy sparse CSR matrices are used for\nX\nand numpy arrays are used for\ny\n.\nYou may load a dataset like this as follows:\n>>>\nfrom\nsklearn.datasets\nimport\nload_svmlight_file\n>>>\nX_train\n,\ny_train\n=\nload_svmlight_file\n(\n\"/path/to/train_dataset.txt\"\n)\n...\nYou may also load two (or more) datasets at once:\n>>>\nX_train\n,\ny_train\n,\nX_test\n,\ny_test\n=\nload_svmlight_files\n(\n...\n(\n\"/path/to/train_dataset.txt\"\n,\n\"/path/to/test_dataset.txt\"\n))\n...\nIn this case,\nX_train\nand\nX_test\nare guaranteed to have the same number\nof features. Another way to achieve the same result is to fix the number of\nfeatures:\n>>>\nX_test\n,\ny_test\n=\nload_svmlight_file\n(\n...\n\"/path/to/test_dataset.txt\"\n,\nn_features\n=\nX_train\n.\nshape\n[\n1\n])\n...\nRelated links\nPublic\ndatasets\nin\nsvmlight\n/\nlibsvm\nformat\n:\nhttps://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets\nFaster",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "X_train\n.\nshape\n[\n1\n])\n...\nRelated links\nPublic\ndatasets\nin\nsvmlight\n/\nlibsvm\nformat\n:\nhttps://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets\nFaster\nAPI-compatible\nimplementation\n:\nmblondel/svmlight-loader\n8.4.3.\nDownloading datasets from the openml.org repository\nopenml.org\nis a public repository for machine learning\ndata and experiments, that allows everybody to upload open datasets.\nThe\nsklearn.datasets\npackage is able to download datasets\nfrom the repository using the function\nsklearn.datasets.fetch_openml\n.\nFor example, to download a dataset of gene expressions in mice brains:\n>>>\nfrom\nsklearn.datasets\nimport\nfetch_openml\n>>>\nmice\n=\nfetch_openml\n(\nname\n=\n'miceprotein'\n,\nversion\n=\n4\n)\nTo fully specify a dataset, you need to provide a name and a version, though\nthe version is optional, see\nDataset Versions\nbelow.\nThe dataset contains a total of 1080 examples belonging to 8 different\nclasses:\n>>>\nmice\n.\ndata\n.\nshape\n(1080, 77)\n>>>\nmice\n.\ntarget\n.\nshape\n(1080,)\n>>>\nnp\n.\nunique\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classes:\n>>>\nmice\n.\ndata\n.\nshape\n(1080, 77)\n>>>\nmice\n.\ntarget\n.\nshape\n(1080,)\n>>>\nnp\n.\nunique\n(\nmice\n.\ntarget\n)\narray(['c-CS-m', 'c-CS-s', 'c-SC-m', 'c-SC-s', 't-CS-m', 't-CS-s', 't-SC-m', 't-SC-s'], dtype=object)\nYou can get more information on the dataset by looking at the\nDESCR\nand\ndetails\nattributes:\n>>>\nprint\n(\nmice\n.\nDESCR\n)\n**Author**: Clara Higuera, Katheleen J. Gardiner, Krzysztof J. Cios\n**Source**: [UCI](https://archive.ics.uci.edu/ml/datasets/Mice+Protein+Expression) - 2015\n**Please cite**: Higuera C, Gardiner KJ, Cios KJ (2015) Self-Organizing\nFeature Maps Identify Proteins Critical to Learning in a Mouse Model of Down\nSyndrome. PLoS ONE 10(6): e0129126...\n>>>\nmice\n.\ndetails\n{'id': '40966', 'name': 'MiceProtein', 'version': '4', 'format': 'ARFF',\n'upload_date': '2017-11-08T16:00:15', 'licence': 'Public',\n'url': 'https://www.openml.org/data/v1/download/17928620/MiceProtein.arff',\n'file_id': '17928620', 'default_target_attribute': 'class',\n'row_id_attribute': 'MouseID',",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "'file_id': '17928620', 'default_target_attribute': 'class',\n'row_id_attribute': 'MouseID',\n'ignore_attribute': ['Genotype', 'Treatment', 'Behavior'],\n'tag': ['OpenML-CC18', 'study_135', 'study_98', 'study_99'],\n'visibility': 'public', 'status': 'active',\n'md5_checksum': '3c479a6885bfa0438971388283a1ce32'}\nThe\nDESCR\ncontains a free-text description of the data, while\ndetails\ncontains a dictionary of meta-data stored by openml, like the dataset id.\nFor more details, see the\nOpenML documentation\nThe\ndata_id\nof the mice protein dataset\nis 40966, and you can use this (or the name) to get more information on the\ndataset on the openml website:\n>>>\nmice\n.\nurl\n'https://www.openml.org/d/40966'\nThe\ndata_id\nalso uniquely identifies a dataset from OpenML:\n>>>\nmice\n=\nfetch_openml\n(\ndata_id\n=\n40966\n)\n>>>\nmice\n.\ndetails\n{'id': '4550', 'name': 'MiceProtein', 'version': '1', 'format': 'ARFF',\n'creator': ...,\n'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url':",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "'creator': ...,\n'upload_date': '2016-02-17T14:32:49', 'licence': 'Public', 'url':\n'https://www.openml.org/data/v1/download/1804243/MiceProtein.ARFF', 'file_id':\n'1804243', 'default_target_attribute': 'class', 'citation': 'Higuera C,\nGardiner KJ, Cios KJ (2015) Self-Organizing Feature Maps Identify Proteins\nCritical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 10(6):\ne0129126. [Web Link] journal.pone.0129126', 'tag': ['OpenML100', 'study_14',\n'study_34'], 'visibility': 'public', 'status': 'active', 'md5_checksum':\n'3c479a6885bfa0438971388283a1ce32'}\n8.4.3.1.\nDataset Versions\nA dataset is uniquely specified by its\ndata_id\n, but not necessarily by its\nname. Several different “versions” of a dataset with the same name can exist\nwhich can contain entirely different datasets.\nIf a particular version of a dataset has been found to contain significant\nissues, it might be deactivated. Using a name to specify a dataset will yield",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "issues, it might be deactivated. Using a name to specify a dataset will yield\nthe earliest version of a dataset that is still active. That means that\nfetch_openml(name=\"miceprotein\")\ncan yield different results\nat different times if earlier versions become inactive.\nYou can see that the dataset with\ndata_id\n40966 that we fetched above is\nthe first version of the “miceprotein” dataset:\n>>>\nmice\n.\ndetails\n[\n'version'\n]\n'1'\nIn fact, this dataset only has one version. The iris dataset on the other hand\nhas multiple versions:\n>>>\niris\n=\nfetch_openml\n(\nname\n=\n\"iris\"\n)\n>>>\niris\n.\ndetails\n[\n'version'\n]\n'1'\n>>>\niris\n.\ndetails\n[\n'id'\n]\n'61'\n>>>\niris_61\n=\nfetch_openml\n(\ndata_id\n=\n61\n)\n>>>\niris_61\n.\ndetails\n[\n'version'\n]\n'1'\n>>>\niris_61\n.\ndetails\n[\n'id'\n]\n'61'\n>>>\niris_969\n=\nfetch_openml\n(\ndata_id\n=\n969\n)\n>>>\niris_969\n.\ndetails\n[\n'version'\n]\n'3'\n>>>\niris_969\n.\ndetails\n[\n'id'\n]\n'969'\nSpecifying the dataset by the name “iris” yields the lowest version, version 1,\nwith the\ndata_id",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "]\n'3'\n>>>\niris_969\n.\ndetails\n[\n'id'\n]\n'969'\nSpecifying the dataset by the name “iris” yields the lowest version, version 1,\nwith the\ndata_id\n61. To make sure you always get this exact dataset, it is\nsafest to specify it by the dataset\ndata_id\n. The other dataset, with\ndata_id\n969, is version 3 (version 2 has become inactive), and contains a\nbinarized version of the data:\n>>>\nnp\n.\nunique\n(\niris_969\n.\ntarget\n)\narray(['N', 'P'], dtype=object)\nYou can also specify both the name and the version, which also uniquely\nidentifies the dataset:\n>>>\niris_version_3\n=\nfetch_openml\n(\nname\n=\n\"iris\"\n,\nversion\n=\n3\n)\n>>>\niris_version_3\n.\ndetails\n[\n'version'\n]\n'3'\n>>>\niris_version_3\n.\ndetails\n[\n'id'\n]\n'969'\nReferences\nVanschoren, van Rijn, Bischl and Torgo. “OpenML: networked science in\nmachine learning” ACM SIGKDD Explorations Newsletter, 15(2), 49-60, 2014.\n8.4.3.2.\nARFF parser\nFrom version 1.2, scikit-learn provides a new keyword argument\nparser\nthat",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "8.4.3.2.\nARFF parser\nFrom version 1.2, scikit-learn provides a new keyword argument\nparser\nthat\nprovides several options to parse the ARFF files provided by OpenML. The legacy\nparser (i.e.\nparser=\"liac-arff\"\n) is based on the project\nLIAC-ARFF\n. This parser is however\nslow and consumes more memory than required. A new parser based on pandas\n(i.e.\nparser=\"pandas\"\n) is both faster and more memory efficient.\nHowever, this parser does not support sparse data.\nTherefore, we recommend using\nparser=\"auto\"\nwhich will use the best parser\navailable for the requested dataset.\nThe\n\"pandas\"\nand\n\"liac-arff\"\nparsers can lead to different data types in\nthe output. The notable differences are the following:\nThe\n\"liac-arff\"\nparser always encodes categorical features as\nstr\nobjects. To the contrary, the\n\"pandas\"\nparser instead infers the type while\nreading and numerical categories will be casted into integers whenever\npossible.\nThe\n\"liac-arff\"\nparser uses float64 to encode numerical features tagged as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "possible.\nThe\n\"liac-arff\"\nparser uses float64 to encode numerical features tagged as\n‘REAL’ and ‘NUMERICAL’ in the metadata. The\n\"pandas\"\nparser instead infers\nif these numerical features correspond to integers and uses pandas’ Integer\nextension dtype.\nIn particular, classification datasets with integer categories are typically\nloaded as such\n(0,\n1,\n...)\nwith the\n\"pandas\"\nparser while\n\"liac-arff\"\nwill force the use of string encoded class labels such as\n\"0\"\n,\n\"1\"\nand so\non.\nThe\n\"pandas\"\nparser will not strip single quotes - i.e.\n'\n- from string\ncolumns. For instance, a string\n'my\nstring'\nwill be kept as is while the\n\"liac-arff\"\nparser will strip the single quotes. For categorical columns,\nthe single quotes are stripped from the values.\nIn addition, when\nas_frame=False\nis used, the\n\"liac-arff\"\nparser returns\nordinally encoded data where the categories are provided in the attribute\ncategories\nof the\nBunch\ninstance. Instead,\n\"pandas\"\nreturns a NumPy array",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "ordinally encoded data where the categories are provided in the attribute\ncategories\nof the\nBunch\ninstance. Instead,\n\"pandas\"\nreturns a NumPy array\nwere the categories. Then it’s up to the user to design a feature\nengineering pipeline with an instance of\nOneHotEncoder\nor\nOrdinalEncoder\ntypically wrapped in a\nColumnTransformer\nto\npreprocess the categorical columns explicitly. See for instance:\nColumn Transformer with Mixed Types\n.\n8.4.4.\nLoading from external datasets\nscikit-learn works on any numeric data stored as numpy arrays or scipy sparse\nmatrices. Other types that are convertible to numeric arrays such as pandas\nDataFrame are also acceptable.\nHere are some recommended ways to load standard columnar data into a\nformat usable by scikit-learn:\npandas.io\nprovides tools to read data from common formats including CSV, Excel, JSON\nand SQL. DataFrames may also be constructed from lists of tuples or dicts.\nPandas handles heterogeneous data smoothly and provides tools for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and SQL. DataFrames may also be constructed from lists of tuples or dicts.\nPandas handles heterogeneous data smoothly and provides tools for\nmanipulation and conversion into a numeric array suitable for scikit-learn.\nscipy.io\nspecializes in binary formats often used in scientific computing\ncontexts such as .mat and .arff\nnumpy/routines.io\nfor standard loading of columnar data into numpy arrays\nscikit-learn’s\nload_svmlight_file\nfor the svmlight or libSVM\nsparse format\nscikit-learn’s\nload_files\nfor directories of text files where\nthe name of each directory is the name of each category and each file inside\nof each directory corresponds to one sample from that category\nFor some miscellaneous data such as images, videos, and audio, you may wish to\nrefer to:\nskimage.io\nor\nImageio\nfor loading images and videos into numpy arrays\nscipy.io.wavfile.read\nfor reading WAV files into a numpy array\nCategorical (or nominal) features stored as strings (common in pandas DataFrames)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "scipy.io.wavfile.read\nfor reading WAV files into a numpy array\nCategorical (or nominal) features stored as strings (common in pandas DataFrames)\nwill need converting to numerical features using\nOneHotEncoder\nor\nOrdinalEncoder\nor similar.\nSee\nPreprocessing data\n.\nNote: if you manage your own numerical data it is recommended to use an\noptimized file format such as HDF5 to reduce data load times. Various libraries\nsuch as H5Py, PyTables and pandas provide a Python interface for reading and\nwriting data in that format.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/loading_other_datasets.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "8.2.\nReal world datasets\nscikit-learn provides tools to load larger datasets, downloading them if\nnecessary.\nThey can be loaded using the following functions:\nfetch_olivetti_faces\n(*[, data_home, ...])\nLoad the Olivetti faces data-set from AT&T (classification).\nfetch_20newsgroups\n(*[, data_home, subset, ...])\nLoad the filenames and data from the 20 newsgroups dataset (classification).\nfetch_20newsgroups_vectorized\n(*[, subset, ...])\nLoad and vectorize the 20 newsgroups dataset (classification).\nfetch_lfw_people\n(*[, data_home, funneled, ...])\nLoad the Labeled Faces in the Wild (LFW) people dataset (classification).\nfetch_lfw_pairs\n(*[, subset, data_home, ...])\nLoad the Labeled Faces in the Wild (LFW) pairs dataset (classification).\nfetch_covtype\n(*[, data_home, ...])\nLoad the covertype dataset (classification).\nfetch_rcv1\n(*[, data_home, subset, ...])\nLoad the RCV1 multilabel dataset (classification).\nfetch_kddcup99\n(*[, subset, data_home, ...])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "fetch_rcv1\n(*[, data_home, subset, ...])\nLoad the RCV1 multilabel dataset (classification).\nfetch_kddcup99\n(*[, subset, data_home, ...])\nLoad the kddcup99 dataset (classification).\nfetch_california_housing\n(*[, data_home, ...])\nLoad the California housing dataset (regression).\nfetch_species_distributions\n(*[, data_home, ...])\nLoader for species distribution dataset from Phillips et.\n8.2.1.\nThe Olivetti faces dataset\nThis dataset contains a set of face images\ntaken between April 1992 and\nApril 1994 at AT&T Laboratories Cambridge. The\nsklearn.datasets.fetch_olivetti_faces\nfunction is the data\nfetching / caching function that downloads the data\narchive from AT&T.\nAs described on the original website:\nThere are ten different images of each of 40 distinct subjects. For some\nsubjects, the images were taken at different times, varying the lighting,\nfacial expressions (open / closed eyes, smiling / not smiling) and facial",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "subjects, the images were taken at different times, varying the lighting,\nfacial expressions (open / closed eyes, smiling / not smiling) and facial\ndetails (glasses / no glasses). All the images were taken against a dark\nhomogeneous background with the subjects in an upright, frontal position\n(with tolerance for some side movement).\nData Set Characteristics:\nClasses\n40\nSamples total\n400\nDimensionality\n4096\nFeatures\nreal, between 0 and 1\nThe image is quantized to 256 grey levels and stored as unsigned 8-bit\nintegers; the loader will convert these to floating point values on the\ninterval [0, 1], which are easier to work with for many algorithms.\nThe “target” for this database is an integer from 0 to 39 indicating the\nidentity of the person pictured; however, with only 10 examples per class, this\nrelatively small dataset is more interesting from an unsupervised or\nsemi-supervised perspective.\nThe original dataset consisted of 92 x 112, while the version available here",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "semi-supervised perspective.\nThe original dataset consisted of 92 x 112, while the version available here\nconsists of 64x64 images.\nWhen using these images, please give credit to AT&T Laboratories Cambridge.\n8.2.2.\nThe 20 newsgroups text dataset\nThe 20 newsgroups dataset comprises around 18000 newsgroups posts on\n20 topics split in two subsets: one for training (or development)\nand the other one for testing (or for performance evaluation). The split\nbetween the train and test set is based upon a messages posted before\nand after a specific date.\nThis module contains two loaders. The first one,\nsklearn.datasets.fetch_20newsgroups\n,\nreturns a list of the raw texts that can be fed to text feature\nextractors such as\nCountVectorizer\nwith custom parameters so as to extract feature vectors.\nThe second one,\nsklearn.datasets.fetch_20newsgroups_vectorized\n,\nreturns ready-to-use features, i.e., it is not necessary to use a feature\nextractor.\nData Set Characteristics:\nClasses\n20\nSamples total",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nreturns ready-to-use features, i.e., it is not necessary to use a feature\nextractor.\nData Set Characteristics:\nClasses\n20\nSamples total\n18846\nDimensionality\n1\nFeatures\ntext\nUsage\nThe\nsklearn.datasets.fetch_20newsgroups\nfunction is a data\nfetching / caching functions that downloads the data archive from\nthe original\n20 newsgroups website\n,\nextracts the archive contents\nin the\n~/scikit_learn_data/20news_home\nfolder and calls the\nsklearn.datasets.load_files\non either the training or\ntesting set folder, or both of them:\n>>>\nfrom\nsklearn.datasets\nimport\nfetch_20newsgroups\n>>>\nnewsgroups_train\n=\nfetch_20newsgroups\n(\nsubset\n=\n'train'\n)\n>>>\nfrom\npprint\nimport\npprint\n>>>\npprint\n(\nlist\n(\nnewsgroups_train\n.\ntarget_names\n))\n['alt.atheism',\n'comp.graphics',\n'comp.os.ms-windows.misc',\n'comp.sys.ibm.pc.hardware',\n'comp.sys.mac.hardware',\n'comp.windows.x',\n'misc.forsale',\n'rec.autos',\n'rec.motorcycles',\n'rec.sport.baseball',\n'rec.sport.hockey',\n'sci.crypt',\n'sci.electronics',\n'sci.med',",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "'misc.forsale',\n'rec.autos',\n'rec.motorcycles',\n'rec.sport.baseball',\n'rec.sport.hockey',\n'sci.crypt',\n'sci.electronics',\n'sci.med',\n'sci.space',\n'soc.religion.christian',\n'talk.politics.guns',\n'talk.politics.mideast',\n'talk.politics.misc',\n'talk.religion.misc']\nThe real data lies in the\nfilenames\nand\ntarget\nattributes. The target\nattribute is the integer index of the category:\n>>>\nnewsgroups_train\n.\nfilenames\n.\nshape\n(11314,)\n>>>\nnewsgroups_train\n.\ntarget\n.\nshape\n(11314,)\n>>>\nnewsgroups_train\n.\ntarget\n[:\n10\n]\narray([ 7, 4, 4, 1, 14, 16, 13, 3, 2, 4])\nIt is possible to load only a sub-selection of the categories by passing the\nlist of the categories to load to the\nsklearn.datasets.fetch_20newsgroups\nfunction:\n>>>\ncats\n=\n[\n'alt.atheism'\n,\n'sci.space'\n]\n>>>\nnewsgroups_train\n=\nfetch_20newsgroups\n(\nsubset\n=\n'train'\n,\ncategories\n=\ncats\n)\n>>>\nlist\n(\nnewsgroups_train\n.\ntarget_names\n)\n['alt.atheism', 'sci.space']\n>>>\nnewsgroups_train\n.\nfilenames\n.\nshape\n(1073,)\n>>>\nnewsgroups_train\n.\ntarget\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\nnewsgroups_train\n.\ntarget_names\n)\n['alt.atheism', 'sci.space']\n>>>\nnewsgroups_train\n.\nfilenames\n.\nshape\n(1073,)\n>>>\nnewsgroups_train\n.\ntarget\n.\nshape\n(1073,)\n>>>\nnewsgroups_train\n.\ntarget\n[:\n10\n]\narray([0, 1, 1, 1, 0, 1, 1, 0, 0, 0])\nConverting text to vectors\nIn order to feed predictive or clustering models with the text data,\none first need to turn the text into vectors of numerical values suitable\nfor statistical analysis. This can be achieved with the utilities of the\nsklearn.feature_extraction.text\nas demonstrated in the following\nexample that extract\nTF-IDF\nvectors\nof unigram tokens from a subset of 20news:\n>>>\nfrom\nsklearn.feature_extraction.text\nimport\nTfidfVectorizer\n>>>\ncategories\n=\n[\n'alt.atheism'\n,\n'talk.religion.misc'\n,\n...\n'comp.graphics'\n,\n'sci.space'\n]\n>>>\nnewsgroups_train\n=\nfetch_20newsgroups\n(\nsubset\n=\n'train'\n,\n...\ncategories\n=\ncategories\n)\n>>>\nvectorizer\n=\nTfidfVectorizer\n()\n>>>\nvectors\n=\nvectorizer\n.\nfit_transform\n(\nnewsgroups_train\n.\ndata\n)\n>>>\nvectors\n.\nshape",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\ncategories\n)\n>>>\nvectorizer\n=\nTfidfVectorizer\n()\n>>>\nvectors\n=\nvectorizer\n.\nfit_transform\n(\nnewsgroups_train\n.\ndata\n)\n>>>\nvectors\n.\nshape\n(2034, 34118)\nThe extracted TF-IDF vectors are very sparse, with an average of 159 non-zero\ncomponents by sample in a more than 30000-dimensional space\n(less than .5% non-zero features):\n>>>\nvectors\n.\nnnz\n/\nfloat\n(\nvectors\n.\nshape\n[\n0\n])\n159.01327...\nsklearn.datasets.fetch_20newsgroups_vectorized\nis a function which\nreturns ready-to-use token counts features instead of file names.\nFiltering text for more realistic training\nIt is easy for a classifier to overfit on particular things that appear in the\n20 Newsgroups data, such as newsgroup headers. Many classifiers achieve very\nhigh F-scores, but their results would not generalize to other documents that\naren’t from this window of time.\nFor example, let’s look at the results of a multinomial Naive Bayes classifier,\nwhich is fast to train and achieves a decent F-score:\n>>>\nfrom\nsklearn.naive_bayes",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "which is fast to train and achieves a decent F-score:\n>>>\nfrom\nsklearn.naive_bayes\nimport\nMultinomialNB\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nnewsgroups_test\n=\nfetch_20newsgroups\n(\nsubset\n=\n'test'\n,\n...\ncategories\n=\ncategories\n)\n>>>\nvectors_test\n=\nvectorizer\n.\ntransform\n(\nnewsgroups_test\n.\ndata\n)\n>>>\nclf\n=\nMultinomialNB\n(\nalpha\n=\n.01\n)\n>>>\nclf\n.\nfit\n(\nvectors\n,\nnewsgroups_train\n.\ntarget\n)\nMultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)\n>>>\npred\n=\nclf\n.\npredict\n(\nvectors_test\n)\n>>>\nmetrics\n.\nf1_score\n(\nnewsgroups_test\n.\ntarget\n,\npred\n,\naverage\n=\n'macro'\n)\n0.88213...\n(The example\nClassification of text documents using sparse features\nshuffles\nthe training and test data, instead of segmenting by time, and in that case\nmultinomial Naive Bayes gets a much higher F-score of 0.88. Are you suspicious\nyet of what’s going on inside this classifier?)\nLet’s take a look at what the most informative features are:\n>>>\nimport\nnumpy\nas\nnp\n>>>\ndef\nshow_top10\n(\nclassifier\n,\nvectorizer\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Let’s take a look at what the most informative features are:\n>>>\nimport\nnumpy\nas\nnp\n>>>\ndef\nshow_top10\n(\nclassifier\n,\nvectorizer\n,\ncategories\n):\n...\nfeature_names\n=\nvectorizer\n.\nget_feature_names_out\n()\n...\nfor\ni\n,\ncategory\nin\nenumerate\n(\ncategories\n):\n...\ntop10\n=\nnp\n.\nargsort\n(\nclassifier\n.\ncoef_\n[\ni\n])[\n-\n10\n:]\n...\nprint\n(\n\"\n%s\n:\n%s\n\"\n%\n(\ncategory\n,\n\" \"\n.\njoin\n(\nfeature_names\n[\ntop10\n])))\n...\n>>>\nshow_top10\n(\nclf\n,\nvectorizer\n,\nnewsgroups_train\n.\ntarget_names\n)\nalt.atheism: edu it and in you that is of to the\ncomp.graphics: edu in graphics it is for and of to the\nsci.space: edu it that is in and space to of the\ntalk.religion.misc: not it you in is that and to of the\nYou can now see many things that these features have overfit to:\nAlmost every group is distinguished by whether headers such as\nNNTP-Posting-Host:\nand\nDistribution:\nappear more or less often.\nAnother significant feature involves whether the sender is affiliated with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "NNTP-Posting-Host:\nand\nDistribution:\nappear more or less often.\nAnother significant feature involves whether the sender is affiliated with\na university, as indicated either by their headers or their signature.\nThe word “article” is a significant feature, based on how often people quote\nprevious posts like this: “In article [article ID], [name] <[e-mail address]>\nwrote:”\nOther features match the names and e-mail addresses of particular people who\nwere posting at the time.\nWith such an abundance of clues that distinguish newsgroups, the classifiers\nbarely have to identify topics from text at all, and they all perform at the\nsame high level.\nFor this reason, the functions that load 20 Newsgroups data provide a\nparameter called\nremove\n, telling it what kinds of information to strip out\nof each file.\nremove\nshould be a tuple containing any subset of\n('headers',\n'footers',\n'quotes')\n, telling it to remove headers, signature\nblocks, and quotation blocks respectively.\n>>>\nnewsgroups_test\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "('headers',\n'footers',\n'quotes')\n, telling it to remove headers, signature\nblocks, and quotation blocks respectively.\n>>>\nnewsgroups_test\n=\nfetch_20newsgroups\n(\nsubset\n=\n'test'\n,\n...\nremove\n=\n(\n'headers'\n,\n'footers'\n,\n'quotes'\n),\n...\ncategories\n=\ncategories\n)\n>>>\nvectors_test\n=\nvectorizer\n.\ntransform\n(\nnewsgroups_test\n.\ndata\n)\n>>>\npred\n=\nclf\n.\npredict\n(\nvectors_test\n)\n>>>\nmetrics\n.\nf1_score\n(\npred\n,\nnewsgroups_test\n.\ntarget\n,\naverage\n=\n'macro'\n)\n0.77310...\nThis classifier lost over a lot of its F-score, just because we removed\nmetadata that has little to do with topic classification.\nIt loses even more if we also strip this metadata from the training data:\n>>>\nnewsgroups_train\n=\nfetch_20newsgroups\n(\nsubset\n=\n'train'\n,\n...\nremove\n=\n(\n'headers'\n,\n'footers'\n,\n'quotes'\n),\n...\ncategories\n=\ncategories\n)\n>>>\nvectors\n=\nvectorizer\n.\nfit_transform\n(\nnewsgroups_train\n.\ndata\n)\n>>>\nclf\n=\nMultinomialNB\n(\nalpha\n=\n.01\n)\n>>>\nclf\n.\nfit\n(\nvectors\n,\nnewsgroups_train\n.\ntarget\n)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "vectorizer\n.\nfit_transform\n(\nnewsgroups_train\n.\ndata\n)\n>>>\nclf\n=\nMultinomialNB\n(\nalpha\n=\n.01\n)\n>>>\nclf\n.\nfit\n(\nvectors\n,\nnewsgroups_train\n.\ntarget\n)\nMultinomialNB(alpha=0.01, class_prior=None, fit_prior=True)\n>>>\nvectors_test\n=\nvectorizer\n.\ntransform\n(\nnewsgroups_test\n.\ndata\n)\n>>>\npred\n=\nclf\n.\npredict\n(\nvectors_test\n)\n>>>\nmetrics\n.\nf1_score\n(\nnewsgroups_test\n.\ntarget\n,\npred\n,\naverage\n=\n'macro'\n)\n0.76995...\nSome other classifiers cope better with this harder version of the task. Try the\nSample pipeline for text feature extraction and evaluation\nexample with and without the\nremove\noption to compare the results.\nData Considerations\nThe Cleveland Indians is a major league baseball team based in Cleveland,\nOhio, USA. In December 2020, it was reported that “After several months of\ndiscussion sparked by the death of George Floyd and a national reckoning over\nrace and colonialism, the Cleveland Indians have decided to change their",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "race and colonialism, the Cleveland Indians have decided to change their\nname.” Team owner Paul Dolan “did make it clear that the team will not make\nits informal nickname – the Tribe – its new team name.” “It’s not going to\nbe a half-step away from the Indians,” Dolan said.”We will not have a Native\nAmerican-themed name.”\nhttps://www.mlb.com/news/cleveland-indians-team-name-change\nRecommendation\nWhen evaluating text classifiers on the 20 Newsgroups data, you\nshould strip newsgroup-related metadata. In scikit-learn, you can do this\nby setting\nremove=('headers',\n'footers',\n'quotes')\n. The F-score will be\nlower because it is more realistic.\nThis text dataset contains data which may be inappropriate for certain NLP\napplications. An example is listed in the “Data Considerations” section\nabove. The challenge with using current text datasets in NLP for tasks such\nas sentence completion, clustering, and other applications is that text",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "above. The challenge with using current text datasets in NLP for tasks such\nas sentence completion, clustering, and other applications is that text\nthat is culturally biased and inflammatory will propagate biases. This\nshould be taken into consideration when using the dataset, reviewing the\noutput, and the bias should be documented.\nExamples\nSample pipeline for text feature extraction and evaluation\nClassification of text documents using sparse features\nFeatureHasher and DictVectorizer Comparison\nClustering text documents using k-means\n8.2.3.\nThe Labeled Faces in the Wild face recognition dataset\nThis dataset is a collection of JPEG pictures of famous people collected\nover the internet, and the details are available on the Kaggle website:\nhttps://www.kaggle.com/datasets/jessicali9530/lfw-dataset\nEach picture is centered on a single face. The typical task is called\nFace Verification: given a pair of two pictures, a binary classifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Each picture is centered on a single face. The typical task is called\nFace Verification: given a pair of two pictures, a binary classifier\nmust predict whether the two images are from the same person.\nAn alternative task, Face Recognition or Face Identification is:\ngiven the picture of the face of an unknown person, identify the name\nof the person by referring to a gallery of previously seen pictures of\nidentified persons.\nBoth Face Verification and Face Recognition are tasks that are typically\nperformed on the output of a model trained to perform Face Detection. The\nmost popular model for Face Detection is called Viola-Jones and is\nimplemented in the OpenCV library. The LFW faces were extracted by this\nface detector from various online websites.\nData Set Characteristics:\nClasses\n5749\nSamples total\n13233\nDimensionality\n5828\nFeatures\nreal, between 0 and 255\nUsage\nscikit-learn\nprovides two loaders that will automatically download,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Samples total\n13233\nDimensionality\n5828\nFeatures\nreal, between 0 and 255\nUsage\nscikit-learn\nprovides two loaders that will automatically download,\ncache, parse the metadata files, decode the jpeg and convert the\ninteresting slices into memmapped numpy arrays. This dataset size is more\nthan 200 MB. The first load typically takes more than a couple of minutes\nto fully decode the relevant part of the JPEG files into numpy arrays. If\nthe dataset has been loaded once, the following times the loading times\nless than 200ms by using a memmapped version memoized on the disk in the\n~/scikit_learn_data/lfw_home/\nfolder using\njoblib\n.\nThe first loader is used for the Face Identification task: a multi-class\nclassification task (hence supervised learning):\n>>>\nfrom\nsklearn.datasets\nimport\nfetch_lfw_people\n>>>\nlfw_people\n=\nfetch_lfw_people\n(\nmin_faces_per_person\n=\n70\n,\nresize\n=\n0.4\n)\n>>>\nfor\nname\nin\nlfw_people\n.\ntarget_names\n:\n...\nprint\n(\nname\n)\n...\nAriel Sharon\nColin Powell\nDonald Rumsfeld",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n70\n,\nresize\n=\n0.4\n)\n>>>\nfor\nname\nin\nlfw_people\n.\ntarget_names\n:\n...\nprint\n(\nname\n)\n...\nAriel Sharon\nColin Powell\nDonald Rumsfeld\nGeorge W Bush\nGerhard Schroeder\nHugo Chavez\nTony Blair\nThe default slice is a rectangular shape around the face, removing\nmost of the background:\n>>>\nlfw_people\n.\ndata\n.\ndtype\ndtype('float32')\n>>>\nlfw_people\n.\ndata\n.\nshape\n(1288, 1850)\n>>>\nlfw_people\n.\nimages\n.\nshape\n(1288, 50, 37)\nEach of the\n1140\nfaces is assigned to a single person id in the\ntarget\narray:\n>>>\nlfw_people\n.\ntarget\n.\nshape\n(1288,)\n>>>\nlist\n(\nlfw_people\n.\ntarget\n[:\n10\n])\n[5, 6, 3, 1, 0, 1, 3, 4, 3, 0]\nThe second loader is typically used for the face verification task: each sample\nis a pair of two picture belonging or not to the same person:\n>>>\nfrom\nsklearn.datasets\nimport\nfetch_lfw_pairs\n>>>\nlfw_pairs_train\n=\nfetch_lfw_pairs\n(\nsubset\n=\n'train'\n)\n>>>\nlist\n(\nlfw_pairs_train\n.\ntarget_names\n)\n['Different persons', 'Same person']\n>>>\nlfw_pairs_train\n.\npairs\n.\nshape\n(2200, 2, 62, 47)\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "'train'\n)\n>>>\nlist\n(\nlfw_pairs_train\n.\ntarget_names\n)\n['Different persons', 'Same person']\n>>>\nlfw_pairs_train\n.\npairs\n.\nshape\n(2200, 2, 62, 47)\n>>>\nlfw_pairs_train\n.\ndata\n.\nshape\n(2200, 5828)\n>>>\nlfw_pairs_train\n.\ntarget\n.\nshape\n(2200,)\nBoth for the\nsklearn.datasets.fetch_lfw_people\nand\nsklearn.datasets.fetch_lfw_pairs\nfunction it is\npossible to get an additional dimension with the RGB color channels by\npassing\ncolor=True\n, in that case the shape will be\n(2200,\n2,\n62,\n47,\n3)\n.\nThe\nsklearn.datasets.fetch_lfw_pairs\ndatasets is subdivided into\n3 subsets: the development\ntrain\nset, the development\ntest\nset and\nan evaluation\n10_folds\nset meant to compute performance metrics using a\n10-folds cross validation scheme.\nReferences\nLabeled Faces in the Wild: A Database for Studying Face Recognition\nin Unconstrained Environments.\nGary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller.\nUniversity of Massachusetts, Amherst, Technical Report 07-49, October, 2007.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "University of Massachusetts, Amherst, Technical Report 07-49, October, 2007.\nExamples\nFaces recognition example using eigenfaces and SVMs\n8.2.4.\nForest covertypes\nThe samples in this dataset correspond to 30×30m patches of forest in the US,\ncollected for the task of predicting each patch’s cover type,\ni.e. the dominant species of tree.\nThere are seven covertypes, making this a multiclass classification problem.\nEach sample has 54 features, described on the\ndataset’s homepage\n.\nSome of the features are boolean indicators,\nwhile others are discrete or continuous measurements.\nData Set Characteristics:\nClasses\n7\nSamples total\n581012\nDimensionality\n54\nFeatures\nint\nsklearn.datasets.fetch_covtype\nwill load the covertype dataset;\nit returns a dictionary-like ‘Bunch’ object\nwith the feature matrix in the\ndata\nmember\nand the target values in\ntarget\n. If optional argument ‘as_frame’ is\nset to ‘True’, it will return\ndata\nand\ntarget\nas pandas\ndata frame, and there will be an additional member",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ". If optional argument ‘as_frame’ is\nset to ‘True’, it will return\ndata\nand\ntarget\nas pandas\ndata frame, and there will be an additional member\nframe\nas well.\nThe dataset will be downloaded from the web if necessary.\n8.2.5.\nRCV1 dataset\nReuters Corpus Volume I (RCV1) is an archive of over 800,000 manually\ncategorized newswire stories made available by Reuters, Ltd. for research\npurposes. The dataset is extensively described in\n[\n1\n]\n.\nData Set Characteristics:\nClasses\n103\nSamples total\n804414\nDimensionality\n47236\nFeatures\nreal, between 0 and 1\nsklearn.datasets.fetch_rcv1\nwill load the following\nversion: RCV1-v2, vectors, full sets, topics multilabels:\n>>>\nfrom\nsklearn.datasets\nimport\nfetch_rcv1\n>>>\nrcv1\n=\nfetch_rcv1\n()\nIt returns a dictionary-like object, with the following attributes:\ndata\n:\nThe feature matrix is a scipy CSR sparse matrix, with 804414 samples and\n47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "47236 features. Non-zero values contains cosine-normalized, log TF-IDF vectors.\nA nearly chronological split is proposed in\n[\n1\n]\n: The first 23149 samples are\nthe training set. The last 781265 samples are the testing set. This follows\nthe official LYRL2004 chronological split. The array has 0.16% of non zero\nvalues:\n>>>\nrcv1\n.\ndata\n.\nshape\n(804414, 47236)\ntarget\n:\nThe target values are stored in a scipy CSR sparse matrix, with 804414 samples\nand 103 categories. Each sample has a value of 1 in its categories, and 0 in\nothers. The array has 3.15% of non zero values:\n>>>\nrcv1\n.\ntarget\n.\nshape\n(804414, 103)\nsample_id\n:\nEach sample can be identified by its ID, ranging (with gaps) from 2286\nto 810596:\n>>>\nrcv1\n.\nsample_id\n[:\n3\n]\narray([2286, 2287, 2288], dtype=uint32)\ntarget_names\n:\nThe target values are the topics of each sample. Each sample belongs to at\nleast one topic, and to up to 17 topics. There are 103 topics, each",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ":\nThe target values are the topics of each sample. Each sample belongs to at\nleast one topic, and to up to 17 topics. There are 103 topics, each\nrepresented by a string. Their corpus frequencies span five orders of\nmagnitude, from 5 occurrences for ‘GMIL’, to 381327 for ‘CCAT’:\n>>>\nrcv1\n.\ntarget_names\n[:\n3\n]\n.\ntolist\n()\n['E11', 'ECAT', 'M11']\nThe dataset will be downloaded from the\nrcv1 homepage\nif necessary.\nThe compressed size is about 656 MB.\nReferences\n8.2.6.\nKddcup 99 dataset\nThe KDD Cup ‘99 dataset was created by processing the tcpdump portions\nof the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset,\ncreated by MIT Lincoln Lab\n[\n2\n]\n. The artificial data (described on the\ndataset’s\nhomepage\n) was\ngenerated using a closed network and hand-injected attacks to produce a\nlarge number of different types of attack with normal activity in the\nbackground. As the initial goal was to produce a large training set for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "large number of different types of attack with normal activity in the\nbackground. As the initial goal was to produce a large training set for\nsupervised learning algorithms, there is a large proportion (80.1%) of\nabnormal data which is unrealistic in real world, and inappropriate for\nunsupervised anomaly detection which aims at detecting ‘abnormal’ data, i.e.:\nqualitatively different from normal data\nin large minority among the observations.\nWe thus transform the KDD Data set into two different data sets: SA and SF.\nSA is obtained by simply selecting all the normal data, and a small\nproportion of abnormal data to gives an anomaly proportion of 1%.\nSF is obtained as in\n[\n3\n]\nby simply picking up the data whose attribute logged_in is positive, thus\nfocusing on the intrusion attack, which gives a proportion of 0.3% of\nattack.\nhttp and smtp are two subsets of SF corresponding with third feature\nequal to ‘http’ (resp. to ‘smtp’).\nGeneral KDD structure:\nSamples total\n4898431\nDimensionality",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "equal to ‘http’ (resp. to ‘smtp’).\nGeneral KDD structure:\nSamples total\n4898431\nDimensionality\n41\nFeatures\ndiscrete (int) or continuous (float)\nTargets\nstr, ‘normal.’ or name of the anomaly type\nSA structure:\nSamples total\n976158\nDimensionality\n41\nFeatures\ndiscrete (int) or continuous (float)\nTargets\nstr, ‘normal.’ or name of the anomaly type\nSF structure:\nSamples total\n699691\nDimensionality\n4\nFeatures\ndiscrete (int) or continuous (float)\nTargets\nstr, ‘normal.’ or name of the anomaly type\nhttp structure:\nSamples total\n619052\nDimensionality\n3\nFeatures\ndiscrete (int) or continuous (float)\nTargets\nstr, ‘normal.’ or name of the anomaly type\nsmtp structure:\nSamples total\n95373\nDimensionality\n3\nFeatures\ndiscrete (int) or continuous (float)\nTargets\nstr, ‘normal.’ or name of the anomaly type\nsklearn.datasets.fetch_kddcup99\nwill load the kddcup99 dataset; it\nreturns a dictionary-like object with the feature matrix in the\ndata\nmember\nand the target values in\ntarget",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "will load the kddcup99 dataset; it\nreturns a dictionary-like object with the feature matrix in the\ndata\nmember\nand the target values in\ntarget\n. The “as_frame” optional argument converts\ndata\ninto a pandas DataFrame and\ntarget\ninto a pandas Series. The\ndataset will be downloaded from the web if necessary.\nReferences\n8.2.7.\nCalifornia Housing dataset\nData Set Characteristics:\nNumber of Instances\n:\n20640\nNumber of Attributes\n:\n8 numeric, predictive attributes and the target\nAttribute Information\n:\nMedInc median income in block group\nHouseAge median house age in block group\nAveRooms average number of rooms per household\nAveBedrms average number of bedrooms per household\nPopulation block group population\nAveOccup average number of household members\nLatitude block group latitude\nLongitude block group longitude\nMissing Attribute Values\n:\nNone\nThis dataset was obtained from the StatLib repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Missing Attribute Values\n:\nNone\nThis dataset was obtained from the StatLib repository.\nhttps://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\nThe target variable is the median house value for California districts,\nexpressed in hundreds of thousands of dollars ($100,000).\nThis dataset was derived from the 1990 U.S. census, using one row per census\nblock group. A block group is the smallest geographical unit for which the U.S.\nCensus Bureau publishes sample data (a block group typically has a population\nof 600 to 3,000 people).\nA household is a group of people residing within a home. Since the average\nnumber of rooms and bedrooms in this dataset are provided per household, these\ncolumns may take surprisingly large values for block groups with few households\nand many empty houses, such as vacation resorts.\nIt can be downloaded/loaded using the\nsklearn.datasets.fetch_california_housing\nfunction.\nReferences\nPace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "sklearn.datasets.fetch_california_housing\nfunction.\nReferences\nPace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\nStatistics and Probability Letters, 33:291-297, 1997.\n8.2.8.\nSpecies distribution dataset\nThis dataset represents the geographic distribution of two species in Central and\nSouth America. The two species are:\n“Bradypus variegatus”\n,\nthe Brown-throated Sloth.\n“Microryzomys minutus”\n,\nalso known as the Forest Small Rice Rat, a rodent that lives in Peru,\nColombia, Ecuador, Peru, and Venezuela.\nThe dataset is not a typical dataset since a\nBunch\ncontaining the attributes\ndata\nand\ntarget\nis not returned. Instead, we have\ninformation allowing to create a “density” map of the different species.\nThe grid for the map can be built using the attributes\nx_left_lower_corner\n,\ny_left_lower_corner\n,\nNx\n,\nNy\nand\ngrid_size\n, which respectively correspond\nto the x and y coordinates of the lower left corner of the grid, the number of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nNx\n,\nNy\nand\ngrid_size\n, which respectively correspond\nto the x and y coordinates of the lower left corner of the grid, the number of\npoints along the x- and y-axis and the size of the step on the grid.\nThe density at each location of the grid is contained in the\ncoverage\nattribute.\nFinally, the\ntrain\nand\ntest\nattributes contain information regarding the location\nof a species at a specific location.\nThe dataset is provided by Phillips et. al. (2006).\nReferences\n“Maximum entropy modeling of species geographic distributions”\nS. J. Phillips,\nR. P. Anderson, R. E. Schapire - Ecological Modelling, 190:231-259, 2006.\nExamples\nSpecies distribution modeling\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/real_world.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "8.3.\nGenerated datasets\nIn addition, scikit-learn includes various random sample generators that\ncan be used to build artificial datasets of controlled size and complexity.\n8.3.1.\nGenerators for classification and clustering\nThese generators produce a matrix of features and corresponding discrete\ntargets.\n8.3.1.1.\nSingle label\nmake_blobs\ncreates a multiclass dataset by allocating each class to one\nnormally-distributed cluster of points. It provides control over the centers and\nstandard deviations of each cluster. This dataset is used to demonstrate clustering.\nimport\nmatplotlib.pyplot\nas\nplt\nfrom\nsklearn.datasets\nimport\nmake_blobs\nX\n,\ny\n=\nmake_blobs\n(\ncenters\n=\n3\n,\ncluster_std\n=\n0.5\n,\nrandom_state\n=\n0\n)\nplt\n.\nscatter\n(\nX\n[:,\n0\n],\nX\n[:,\n1\n],\nc\n=\ny\n)\nplt\n.\ntitle\n(\n\"Three normally-distributed clusters\"\n)\nplt\n.\nshow\n()\nmake_classification\nalso creates multiclass datasets but specializes in\nintroducing noise by way of: correlated, redundant and uninformative features; multiple",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/sample_generators.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "also creates multiclass datasets but specializes in\nintroducing noise by way of: correlated, redundant and uninformative features; multiple\nGaussian clusters per class; and linear transformations of the feature space.\nimport\nmatplotlib.pyplot\nas\nplt\nfrom\nsklearn.datasets\nimport\nmake_classification\nfig\n,\naxs\n=\nplt\n.\nsubplots\n(\n1\n,\n3\n,\nfigsize\n=\n(\n12\n,\n4\n),\nsharey\n=\nTrue\n,\nsharex\n=\nTrue\n)\ntitles\n=\n[\n\"Two classes,\n\\n\none informative feature,\n\\n\none cluster per class\"\n,\n\"Two classes,\n\\n\ntwo informative features,\n\\n\ntwo clusters per class\"\n,\n\"Three classes,\n\\n\ntwo informative features,\n\\n\none cluster per class\"\n]\nparams\n=\n[\n{\n\"n_informative\"\n:\n1\n,\n\"n_clusters_per_class\"\n:\n1\n,\n\"n_classes\"\n:\n2\n},\n{\n\"n_informative\"\n:\n2\n,\n\"n_clusters_per_class\"\n:\n2\n,\n\"n_classes\"\n:\n2\n},\n{\n\"n_informative\"\n:\n2\n,\n\"n_clusters_per_class\"\n:\n1\n,\n\"n_classes\"\n:\n3\n}\n]\nfor\ni\n,\nparam\nin\nenumerate\n(\nparams\n):\nX\n,\nY\n=\nmake_classification\n(\nn_features\n=\n2\n,\nn_redundant\n=\n0\n,\nrandom_state\n=\n1\n,\n**\nparam\n)\naxs\n[\ni\n]\n.\nscatter\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/sample_generators.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nparam\nin\nenumerate\n(\nparams\n):\nX\n,\nY\n=\nmake_classification\n(\nn_features\n=\n2\n,\nn_redundant\n=\n0\n,\nrandom_state\n=\n1\n,\n**\nparam\n)\naxs\n[\ni\n]\n.\nscatter\n(\nX\n[:,\n0\n],\nX\n[:,\n1\n],\nc\n=\nY\n)\naxs\n[\ni\n]\n.\nset_title\n(\ntitles\n[\ni\n])\nplt\n.\ntight_layout\n()\nplt\n.\nshow\n()\nmake_gaussian_quantiles\ndivides a single Gaussian cluster into\nnear-equal-size classes separated by concentric hyperspheres.\nimport\nmatplotlib.pyplot\nas\nplt\nfrom\nsklearn.datasets\nimport\nmake_gaussian_quantiles\nX\n,\nY\n=\nmake_gaussian_quantiles\n(\nn_features\n=\n2\n,\nn_classes\n=\n3\n,\nrandom_state\n=\n0\n)\nplt\n.\nscatter\n(\nX\n[:,\n0\n],\nX\n[:,\n1\n],\nc\n=\nY\n)\nplt\n.\ntitle\n(\n\"Gaussian divided into three quantiles\"\n)\nplt\n.\nshow\n()\nmake_hastie_10_2\ngenerates a similar binary, 10-dimensional problem.\nmake_circles\nand\nmake_moons\ngenerate 2D binary classification\ndatasets that are challenging to certain algorithms (e.g., centroid-based\nclustering or linear classification), including optional Gaussian noise.\nThey are useful for visualization.\nmake_circles",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/sample_generators.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "clustering or linear classification), including optional Gaussian noise.\nThey are useful for visualization.\nmake_circles\nproduces Gaussian data\nwith a spherical decision boundary for binary classification, while\nmake_moons\nproduces two interleaving half-circles.\nimport\nmatplotlib.pyplot\nas\nplt\nfrom\nsklearn.datasets\nimport\nmake_circles\n,\nmake_moons\nfig\n,\n(\nax1\n,\nax2\n)\n=\nplt\n.\nsubplots\n(\nnrows\n=\n1\n,\nncols\n=\n2\n,\nfigsize\n=\n(\n8\n,\n4\n))\nX\n,\nY\n=\nmake_circles\n(\nnoise\n=\n0.1\n,\nfactor\n=\n0.3\n,\nrandom_state\n=\n0\n)\nax1\n.\nscatter\n(\nX\n[:,\n0\n],\nX\n[:,\n1\n],\nc\n=\nY\n)\nax1\n.\nset_title\n(\n\"make_circles\"\n)\nX\n,\nY\n=\nmake_moons\n(\nnoise\n=\n0.1\n,\nrandom_state\n=\n0\n)\nax2\n.\nscatter\n(\nX\n[:,\n0\n],\nX\n[:,\n1\n],\nc\n=\nY\n)\nax2\n.\nset_title\n(\n\"make_moons\"\n)\nplt\n.\ntight_layout\n()\nplt\n.\nshow\n()\n8.3.1.2.\nMultilabel\nmake_multilabel_classification\ngenerates random samples with multiple\nlabels, reflecting a bag of words drawn from a mixture of topics. The number of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/sample_generators.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "make_multilabel_classification\ngenerates random samples with multiple\nlabels, reflecting a bag of words drawn from a mixture of topics. The number of\ntopics for each document is drawn from a Poisson distribution, and the topics\nthemselves are drawn from a fixed random distribution. Similarly, the number of\nwords is drawn from Poisson, with words drawn from a multinomial, where each\ntopic defines a probability distribution over words. Simplifications with\nrespect to true bag-of-words mixtures include:\nPer-topic word distributions are independently drawn, where in reality all\nwould be affected by a sparse base distribution, and would be correlated.\nFor a document generated from multiple topics, all topics are weighted\nequally in generating its bag of words.\nDocuments without labels words at random, rather than from a base\ndistribution.\n8.3.1.3.\nBiclustering\nmake_biclusters\n(shape, n_clusters, *[, ...])\nGenerate a constant block diagonal structure array for biclustering.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/sample_generators.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "8.3.1.3.\nBiclustering\nmake_biclusters\n(shape, n_clusters, *[, ...])\nGenerate a constant block diagonal structure array for biclustering.\nmake_checkerboard\n(shape, n_clusters, *[, ...])\nGenerate an array with block checkerboard structure for biclustering.\n8.3.2.\nGenerators for regression\nmake_regression\nproduces regression targets as an optionally-sparse\nrandom linear combination of random features, with noise. Its informative\nfeatures may be uncorrelated, or low rank (few features account for most of the\nvariance).\nOther regression generators generate functions deterministically from\nrandomized features.\nmake_sparse_uncorrelated\nproduces a target as a\nlinear combination of four features with fixed coefficients.\nOthers encode explicitly non-linear relations:\nmake_friedman1\nis related by polynomial and sine transforms;\nmake_friedman2\nincludes feature multiplication and reciprocation; and\nmake_friedman3\nis similar with an arctan transformation on the target.\n8.3.3.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/sample_generators.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "make_friedman2\nincludes feature multiplication and reciprocation; and\nmake_friedman3\nis similar with an arctan transformation on the target.\n8.3.3.\nGenerators for manifold learning\nmake_s_curve\n([n_samples, noise, random_state])\nGenerate an S curve dataset.\nmake_swiss_roll\n([n_samples, noise, ...])\nGenerate a swiss roll dataset.\n8.3.4.\nGenerators for decomposition\nmake_low_rank_matrix\n([n_samples, ...])\nGenerate a mostly low rank matrix with bell-shaped singular values.\nmake_sparse_coded_signal\n(n_samples, *, ...)\nGenerate a signal as a sparse combination of dictionary elements.\nmake_spd_matrix\n(n_dim, *[, random_state])\nGenerate a random symmetric, positive-definite matrix.\nmake_sparse_spd_matrix\n([n_dim, alpha, ...])\nGenerate a sparse symmetric definite positive matrix.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/sample_generators.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "8.1.\nToy datasets\nscikit-learn comes with a few small standard datasets that do not require to\ndownload any file from some external website.\nThey can be loaded using the following functions:\nload_iris\n(*[, return_X_y, as_frame])\nLoad and return the iris dataset (classification).\nload_diabetes\n(*[, return_X_y, as_frame, scaled])\nLoad and return the diabetes dataset (regression).\nload_digits\n(*[, n_class, return_X_y, as_frame])\nLoad and return the digits dataset (classification).\nload_linnerud\n(*[, return_X_y, as_frame])\nLoad and return the physical exercise Linnerud dataset.\nload_wine\n(*[, return_X_y, as_frame])\nLoad and return the wine dataset (classification).\nload_breast_cancer\n(*[, return_X_y, as_frame])\nLoad and return the breast cancer Wisconsin dataset (classification).\nThese datasets are useful to quickly illustrate the behavior of the\nvarious algorithms implemented in scikit-learn. They are however often too\nsmall to be representative of real world machine learning tasks.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "various algorithms implemented in scikit-learn. They are however often too\nsmall to be representative of real world machine learning tasks.\n8.1.1.\nIris plants dataset\nData Set Characteristics:\nNumber of Instances\n:\n150 (50 in each of three classes)\nNumber of Attributes\n:\n4 numeric, predictive attributes and the class\nAttribute Information\n:\nsepal length in cm\nsepal width in cm\npetal length in cm\npetal width in cm\nclass:\nIris-Setosa\nIris-Versicolour\nIris-Virginica\nSummary Statistics\n:\nsepal length:\n4.3\n7.9\n5.84\n0.83\n0.7826\nsepal width:\n2.0\n4.4\n3.05\n0.43\n-0.4194\npetal length:\n1.0\n6.9\n3.76\n1.76\n0.9490 (high!)\npetal width:\n0.1\n2.5\n1.20\n0.76\n0.9565 (high!)\nMissing Attribute Values\n:\nNone\nClass Distribution\n:\n33.3% for each of 3 classes.\nCreator\n:\nR.A. Fisher\nDonor\n:\nMichael Marshall (\nMARSHALL%PLU\n@\nio\n.\narc\n.\nnasa\n.\ngov\n)\nDate\n:\nJuly, 1988\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "MARSHALL%PLU\n@\nio\n.\narc\n.\nnasa\n.\ngov\n)\nDate\n:\nJuly, 1988\nThe famous Iris database, first used by Sir R.A. Fisher. The dataset is taken\nfrom Fisher’s paper. Note that it’s the same as in R, but not as in the UCI\nMachine Learning Repository, which has two wrong data points.\nThis is perhaps the best known database to be found in the\npattern recognition literature. Fisher’s paper is a classic in the field and\nis referenced frequently to this day. (See Duda & Hart, for example.) The\ndata set contains 3 classes of 50 instances each, where each class refers to a\ntype of iris plant. One class is linearly separable from the other 2; the\nlatter are NOT linearly separable from each other.\nReferences\nFisher, R.A. “The use of multiple measurements in taxonomic problems”\nAnnual Eugenics, 7, Part II, 179-188 (1936); also in “Contributions to\nMathematical Statistics” (John Wiley, NY, 1950).\nDuda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Mathematical Statistics” (John Wiley, NY, 1950).\nDuda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.\n(Q327.D83) John Wiley & Sons. ISBN 0-471-22361-1. See page 218.\nDasarathy, B.V. (1980) “Nosing Around the Neighborhood: A New System\nStructure and Classification Rule for Recognition in Partially Exposed\nEnvironments”. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, Vol. PAMI-2, No. 1, 67-71.\nGates, G.W. (1972) “The Reduced Nearest Neighbor Rule”. IEEE Transactions\non Information Theory, May 1972, 431-433.\nSee also: 1988 MLC Proceedings, 54-64. Cheeseman et al”s AUTOCLASS II\nconceptual clustering system finds 3 classes in the data.\nMany, many more …\n8.1.2.\nDiabetes dataset\nTen baseline variables, age, sex, body mass index, average blood\npressure, and six blood serum measurements were obtained for each of n =\n442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "442 diabetes patients, as well as the response of interest, a\nquantitative measure of disease progression one year after baseline.\nData Set Characteristics:\nNumber of Instances\n:\n442\nNumber of Attributes\n:\nFirst 10 columns are numeric predictive values\nTarget\n:\nColumn 11 is a quantitative measure of disease progression one year after baseline\nAttribute Information\n:\nage age in years\nsex\nbmi body mass index\nbp average blood pressure\ns1 tc, total serum cholesterol\ns2 ldl, low-density lipoproteins\ns3 hdl, high-density lipoproteins\ns4 tch, total cholesterol / HDL\ns5 ltg, possibly log of serum triglycerides level\ns6 glu, blood sugar level\nNote: Each of these 10 feature variables have been mean centered and scaled by the standard deviation times the square root of\nn_samples\n(i.e. the sum of squares of each column totals 1).\nSource URL:\nhttps://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\nFor more information see:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(i.e. the sum of squares of each column totals 1).\nSource URL:\nhttps://www4.stat.ncsu.edu/~boos/var.select/diabetes.html\nFor more information see:\nBradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499.\n(\nhttps://web.stanford.edu/~hastie/Papers/LARS/LeastAngle_2002.pdf\n)\n8.1.3.\nOptical recognition of handwritten digits dataset\nData Set Characteristics:\nNumber of Instances\n:\n1797\nNumber of Attributes\n:\n64\nAttribute Information\n:\n8x8 image of integer pixels in the range 0..16.\nMissing Attribute Values\n:\nNone\nCreator\n:\nAlpaydin (alpaydin ‘@’ boun.edu.tr)\nDate\n:\nJuly; 1998\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttps://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\nPreprocessing programs made available by NIST were used to extract",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "each class refers to a digit.\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\nReferences\nC. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\nApplications to Handwritten Digit Recognition, MSc Thesis, Institute of\nGraduate Studies in Science and Engineering, Bogazici University.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\nGraduate Studies in Science and Engineering, Bogazici University.\nAlpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\nKen Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\nLinear dimensionalityreduction using relevance weighted LDA. School of\nElectrical and Electronic Engineering Nanyang Technological University.\n2005.\nClaudio Gentile. A New Approximate Maximal Margin Classification\nAlgorithm. NIPS. 2000.\n8.1.4.\nLinnerrud dataset\nData Set Characteristics:\nNumber of Instances\n:\n20\nNumber of Attributes\n:\n3\nMissing Attribute Values\n:\nNone\nThe Linnerud dataset is a multi-output regression dataset. It consists of three\nexercise (data) and three physiological (target) variables collected from\ntwenty middle-aged men in a fitness club:\nphysiological\n- CSV containing 20 observations on 3 physiological variables:\nWeight, Waist and Pulse.\nexercise",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "physiological\n- CSV containing 20 observations on 3 physiological variables:\nWeight, Waist and Pulse.\nexercise\n- CSV containing 20 observations on 3 exercise variables:\nChins, Situps and Jumps.\nReferences\nTenenhaus, M. (1998). La regression PLS: theorie et pratique. Paris:\nEditions Technic.\n8.1.5.\nWine recognition dataset\nData Set Characteristics:\nNumber of Instances\n:\n178\nNumber of Attributes\n:\n13 numeric, predictive attributes and the class\nAttribute Information\n:\nAlcohol\nMalic acid\nAsh\nAlcalinity of ash\nMagnesium\nTotal phenols\nFlavanoids\nNonflavanoid phenols\nProanthocyanins\nColor intensity\nHue\nOD280/OD315 of diluted wines\nProline\nclass:\nclass_0\nclass_1\nclass_2\nSummary Statistics\n:\nAlcohol:\n11.0\n14.8\n13.0\n0.8\nMalic Acid:\n0.74\n5.80\n2.34\n1.12\nAsh:\n1.36\n3.23\n2.36\n0.27\nAlcalinity of Ash:\n10.6\n30.0\n19.5\n3.3\nMagnesium:\n70.0\n162.0\n99.7\n14.3\nTotal Phenols:\n0.98\n3.88\n2.29\n0.63\nFlavanoids:\n0.34\n5.08\n2.03\n1.00\nNonflavanoid Phenols:\n0.13\n0.66\n0.36\n0.12\nProanthocyanins:\n0.41\n3.58\n1.59\n0.57",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Total Phenols:\n0.98\n3.88\n2.29\n0.63\nFlavanoids:\n0.34\n5.08\n2.03\n1.00\nNonflavanoid Phenols:\n0.13\n0.66\n0.36\n0.12\nProanthocyanins:\n0.41\n3.58\n1.59\n0.57\nColour Intensity:\n1.3\n13.0\n5.1\n2.3\nHue:\n0.48\n1.71\n0.96\n0.23\nOD280/OD315 of diluted wines:\n1.27\n4.00\n2.61\n0.71\nProline:\n278\n1680\n746\n315\nMissing Attribute Values\n:\nNone\nClass Distribution\n:\nclass_0 (59), class_1 (71), class_2 (48)\nCreator\n:\nR.A. Fisher\nDonor\n:\nMichael Marshall (\nMARSHALL%PLU\n@\nio\n.\narc\n.\nnasa\n.\ngov\n)\nDate\n:\nJuly, 1988\nThis is a copy of UCI ML Wine recognition datasets.\nhttps://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data\nThe data is the results of a chemical analysis of wines grown in the same\nregion in Italy by three different cultivators. There are thirteen different\nmeasurements taken for different constituents found in the three types of\nwine.\nOriginal Owners:\nForina, M. et al, PARVUS -\nAn Extendible Package for Data Exploration, Classification and Correlation.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "wine.\nOriginal Owners:\nForina, M. et al, PARVUS -\nAn Extendible Package for Data Exploration, Classification and Correlation.\nInstitute of Pharmaceutical and Food Analysis and Technologies,\nVia Brigata Salerno, 16147 Genoa, Italy.\nCitation:\nLichman, M. (2013). UCI Machine Learning Repository\n[\nhttps://archive.ics.uci.edu/ml\n]. Irvine, CA: University of California,\nSchool of Information and Computer Science.\nReferences\n(1) S. Aeberhard, D. Coomans and O. de Vel,\nComparison of Classifiers in High Dimensional Settings,\nTech. Rep. no. 92-02, (1992), Dept. of Computer Science and Dept. of\nMathematics and Statistics, James Cook University of North Queensland.\n(Also submitted to Technometrics).\nThe data was used with many others for comparing various\nclassifiers. The classes are separable, though only RDA\nhas achieved 100% correct classification.\n(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))\n(All results using the leave-one-out technique)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(RDA : 100%, QDA 99.4%, LDA 98.9%, 1NN 96.1% (z-transformed data))\n(All results using the leave-one-out technique)\n(2) S. Aeberhard, D. Coomans and O. de Vel,\n“THE CLASSIFICATION PERFORMANCE OF RDA”\nTech. Rep. no. 92-01, (1992), Dept. of Computer Science and Dept. of\nMathematics and Statistics, James Cook University of North Queensland.\n(Also submitted to Journal of Chemometrics).\n8.1.6.\nBreast cancer Wisconsin (diagnostic) dataset\nData Set Characteristics:\nNumber of Instances\n:\n569\nNumber of Attributes\n:\n30 numeric, predictive attributes and the class\nAttribute Information\n:\nradius (mean of distances from center to points on the perimeter)\ntexture (standard deviation of gray-scale values)\nperimeter\narea\nsmoothness (local variation in radius lengths)\ncompactness (perimeter^2 / area - 1.0)\nconcavity (severity of concave portions of the contour)\nconcave points (number of concave portions of the contour)\nsymmetry\nfractal dimension (“coastline approximation” - 1)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "concave points (number of concave portions of the contour)\nsymmetry\nfractal dimension (“coastline approximation” - 1)\nThe mean, standard error, and “worst” or largest (mean of the three\nworst/largest values) of these features were computed for each image,\nresulting in 30 features. For instance, field 0 is Mean Radius, field\n10 is Radius SE, field 20 is Worst Radius.\nclass:\nWDBC-Malignant\nWDBC-Benign\nSummary Statistics\n:\nradius (mean):\n6.981\n28.11\ntexture (mean):\n9.71\n39.28\nperimeter (mean):\n43.79\n188.5\narea (mean):\n143.5\n2501.0\nsmoothness (mean):\n0.053\n0.163\ncompactness (mean):\n0.019\n0.345\nconcavity (mean):\n0.0\n0.427\nconcave points (mean):\n0.0\n0.201\nsymmetry (mean):\n0.106\n0.304\nfractal dimension (mean):\n0.05\n0.097\nradius (standard error):\n0.112\n2.873\ntexture (standard error):\n0.36\n4.885\nperimeter (standard error):\n0.757\n21.98\narea (standard error):\n6.802\n542.2\nsmoothness (standard error):\n0.002\n0.031\ncompactness (standard error):\n0.002\n0.135\nconcavity (standard error):\n0.0\n0.396",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "6.802\n542.2\nsmoothness (standard error):\n0.002\n0.031\ncompactness (standard error):\n0.002\n0.135\nconcavity (standard error):\n0.0\n0.396\nconcave points (standard error):\n0.0\n0.053\nsymmetry (standard error):\n0.008\n0.079\nfractal dimension (standard error):\n0.001\n0.03\nradius (worst):\n7.93\n36.04\ntexture (worst):\n12.02\n49.54\nperimeter (worst):\n50.41\n251.2\narea (worst):\n185.2\n4254.0\nsmoothness (worst):\n0.071\n0.223\ncompactness (worst):\n0.027\n1.058\nconcavity (worst):\n0.0\n1.252\nconcave points (worst):\n0.0\n0.291\nsymmetry (worst):\n0.156\n0.664\nfractal dimension (worst):\n0.055\n0.208\nMissing Attribute Values\n:\nNone\nClass Distribution\n:\n212 - Malignant, 357 - Benign\nCreator\n:\nDr. William H. Wolberg, W. Nick Street, Olvi L. Mangasarian\nDonor\n:\nNick Street\nDate\n:\nNovember, 1995\nThis is a copy of UCI ML Breast Cancer Wisconsin (Diagnostic) datasets.\nhttps://goo.gl/U2Uwz2\nFeatures are computed from a digitized image of a fine needle\naspirate (FNA) of a breast mass. They describe",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "https://goo.gl/U2Uwz2\nFeatures are computed from a digitized image of a fine needle\naspirate (FNA) of a breast mass. They describe\ncharacteristics of the cell nuclei present in the image.\nSeparating plane described above was obtained using\nMultisurface Method-Tree (MSM-T) [K. P. Bennett, “Decision Tree\nConstruction Via Linear Programming.” Proceedings of the 4th\nMidwest Artificial Intelligence and Cognitive Science Society,\npp. 97-101, 1992], a classification method which uses linear\nprogramming to construct a decision tree. Relevant features\nwere selected using an exhaustive search in the space of 1-4\nfeatures and 1-3 separating planes.\nThe actual linear program used to obtain the separating plane\nin the 3-dimensional space is that described in:\n[K. P. Bennett and O. L. Mangasarian: “Robust Linear\nProgramming Discrimination of Two Linearly Inseparable Sets”,\nOptimization Methods and Software 1, 1992, 23-34].\nThis database is also available through the UW CS ftp server:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Optimization Methods and Software 1, 1992, 23-34].\nThis database is also available through the UW CS ftp server:\nftp ftp.cs.wisc.edu\ncd math-prog/cpo-dataset/machine-learn/WDBC/\nReferences\nW.N. Street, W.H. Wolberg and O.L. Mangasarian. Nuclear feature extraction\nfor breast tumor diagnosis. IS&T/SPIE 1993 International Symposium on\nElectronic Imaging: Science and Technology, volume 1905, pages 861-870,\nSan Jose, CA, 1993.\nO.L. Mangasarian, W.N. Street and W.H. Wolberg. Breast cancer diagnosis and\nprognosis via linear programming. Operations Research, 43(4), pages 570-577,\nJuly-August 1995.\nW.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques\nto diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994)\n163-171.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/datasets/toy_dataset.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "10.\nModel persistence\nSummary of model persistence methods\nPersistence method\nPros\nRisks / Cons\nONNX\nServe models without a Python environment\nServing and training environments independent of one another\nMost secure option\nNot all scikit-learn models are supported\nCustom estimators require more work to support\nOriginal Python object is lost and cannot be reconstructed\nskops.io\nMore secure than\npickle\nbased formats\nContents can be partly validated without loading\nNot as fast as\npickle\nbased formats\nSupports less types than\npickle\nbased formats\nRequires the same environment as the training environment\npickle\nNative to Python\nCan serialize most Python objects\nEfficient memory usage with\nprotocol=5\nLoading can execute arbitrary code\nRequires the same environment as the training environment\njoblib\nEfficient memory usage\nSupports memory mapping\nEasy shortcuts for compression and decompression\nPickle based format\nLoading can execute arbitrary code",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Supports memory mapping\nEasy shortcuts for compression and decompression\nPickle based format\nLoading can execute arbitrary code\nRequires the same environment as the training environment\ncloudpickle\nCan serialize non-packaged, custom Python code\nComparable loading efficiency as\npickle\nwith\nprotocol=5\nPickle based format\nLoading can execute arbitrary code\nNo forward compatibility guarantees\nRequires the same environment as the training environment\nAfter training a scikit-learn model, it is desirable to have a way to persist\nthe model for future use without having to retrain. Based on your use-case,\nthere are a few different ways to persist a scikit-learn model, and here we\nhelp you decide which one suits you best. In order to make a decision, you need\nto answer the following questions:\nDo you need the Python object after persistence, or do you only need to\npersist in order to serve the model and get predictions out of it?",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Do you need the Python object after persistence, or do you only need to\npersist in order to serve the model and get predictions out of it?\nIf you only need to serve the model and no further investigation on the Python\nobject itself is required, then\nONNX\nmight be the\nbest fit for you. Note that not all models are supported by ONNX.\nIn case ONNX is not suitable for your use-case, the next question is:\nDo you absolutely trust the source of the model, or are there any security\nconcerns regarding where the persisted model comes from?\nIf you have security concerns, then you should consider using\nskops.io\nwhich gives you back the Python object, but unlike\npickle\nbased persistence solutions, loading the persisted model doesn’t\nautomatically allow arbitrary code execution. Note that this requires manual\ninvestigation of the persisted file, which\nskops.io\nallows you to do.\nThe other solutions assume you absolutely trust the source of the file to be",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "investigation of the persisted file, which\nskops.io\nallows you to do.\nThe other solutions assume you absolutely trust the source of the file to be\nloaded, as they are all susceptible to arbitrary code execution upon loading\nthe persisted file since they all use the pickle protocol under the hood.\nDo you care about the performance of loading the model, and sharing it\nbetween processes where a memory mapped object on disk is beneficial?\nIf yes, then you can consider using\njoblib\n. If this\nis not a major concern for you, then you can use the built-in\npickle\nmodule.\nDid you try\npickle\nor\njoblib\nand found that the model cannot\nbe persisted? It can happen for instance when you have user defined\nfunctions in your model.\nIf yes, then you can use\ncloudpickle\nwhich can serialize certain objects\nwhich cannot be serialized by\npickle\nor\njoblib\n.\n10.1.\nWorkflow Overview\nIn a typical workflow, the first step is to train the model using scikit-learn",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "pickle\nor\njoblib\n.\n10.1.\nWorkflow Overview\nIn a typical workflow, the first step is to train the model using scikit-learn\nand scikit-learn compatible libraries. Note that support for scikit-learn and\nthird party estimators varies across the different persistence methods.\n10.1.1.\nTrain and Persist the Model\nCreating an appropriate model depends on your use-case. As an example, here we\ntrain a\nsklearn.ensemble.HistGradientBoostingClassifier\non the iris\ndataset:\n>>>\nfrom\nsklearn\nimport\nensemble\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\nclf\n=\nensemble\n.\nHistGradientBoostingClassifier\n()\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\nHistGradientBoostingClassifier()\nOnce the model is trained, you can persist it using your desired method, and\nthen you can load the model in a separate environment and get predictions from\nit given input data. Here there are two major paths depending on how you\npersist and plan to serve the model:\nONNX\n: You need an\nONNX",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "it given input data. Here there are two major paths depending on how you\npersist and plan to serve the model:\nONNX\n: You need an\nONNX\nruntime and an environment\nwith appropriate dependencies installed to load the model and use the runtime\nto get predictions. This environment can be minimal and does not necessarily\neven require Python to be installed to load the model and compute\npredictions. Also note that\nonnxruntime\ntypically requires much less RAM\nthan Python to compute predictions from small models.\nskops.io\n,\npickle\n,\njoblib\n,\ncloudpickle\n: You need a\nPython environment with the appropriate dependencies installed to load the\nmodel and get predictions from it. This environment should have the same\npackages\nand the same\nversions\nas the environment where the model was\ntrained. Note that none of these methods support loading a model trained with\na different version of scikit-learn, and possibly different versions of other\ndependencies such as\nnumpy\nand\nscipy",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "a different version of scikit-learn, and possibly different versions of other\ndependencies such as\nnumpy\nand\nscipy\n. Another concern would be running\nthe persisted model on a different hardware, and in most cases you should be\nable to load your persisted model on a different hardware.\n10.2.\nONNX\nONNX\n, or\nOpen Neural Network Exchange\nformat is best\nsuitable in use-cases where one needs to persist the model and then use the\npersisted artifact to get predictions without the need to load the Python\nobject itself. It is also useful in cases where the serving environment needs\nto be lean and minimal, since the\nONNX\nruntime does not require\npython\n.\nONNX\nis a binary serialization of the model. It has been developed to improve\nthe usability of the interoperable representation of data models. It aims to\nfacilitate the conversion of the data models between different machine learning\nframeworks, and to improve their portability on different computing",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "facilitate the conversion of the data models between different machine learning\nframeworks, and to improve their portability on different computing\narchitectures. More details are available from the\nONNX tutorial\n. To convert scikit-learn model to\nONNX\nsklearn-onnx\nhas been developed. However,\nnot all scikit-learn models are supported, and it is limited to the core\nscikit-learn and does not support most third party estimators. One can write a\ncustom converter for third party or custom estimators, but the documentation to\ndo that is sparse and it might be challenging to do so.\nUsing ONNX\nTo convert the model to\nONNX\nformat, you need to give the converter some\ninformation about the input as well, about which you can read more\nhere\n:\nfrom\nskl2onnx\nimport\nto_onnx\nonx\n=\nto_onnx\n(\nclf\n,\nX\n[:\n1\n]\n.\nastype\n(\nnumpy\n.\nfloat32\n),\ntarget_opset\n=\n12\n)\nwith\nopen\n(\n\"filename.onnx\"\n,\n\"wb\"\n)\nas\nf\n:\nf\n.\nwrite\n(\nonx\n.\nSerializeToString\n())\nYou can load the model in Python and use the\nONNX\nruntime to get",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "with\nopen\n(\n\"filename.onnx\"\n,\n\"wb\"\n)\nas\nf\n:\nf\n.\nwrite\n(\nonx\n.\nSerializeToString\n())\nYou can load the model in Python and use the\nONNX\nruntime to get\npredictions:\nfrom\nonnxruntime\nimport\nInferenceSession\nwith\nopen\n(\n\"filename.onnx\"\n,\n\"rb\"\n)\nas\nf\n:\nonx\n=\nf\n.\nread\n()\nsess\n=\nInferenceSession\n(\nonx\n,\nproviders\n=\n[\n\"CPUExecutionProvider\"\n])\npred_ort\n=\nsess\n.\nrun\n(\nNone\n,\n{\n\"X\"\n:\nX_test\n.\nastype\n(\nnumpy\n.\nfloat32\n)})[\n0\n]\n10.3.\nskops.io\nskops.io\navoids using\npickle\nand only loads files which have types\nand references to functions which are trusted either by default or by the user.\nTherefore it provides a more secure format than\npickle\n,\njoblib\n,\nand\ncloudpickle\n.\nUsing skops\nThe API is very similar to\npickle\n, and you can persist your models as\nexplained in the\ndocumentation\nusing\nskops.io.dump\nand\nskops.io.dumps\n:\nimport\nskops.io\nas\nsio\nobj\n=\nsio\n.\ndump\n(\nclf\n,\n\"filename.skops\"\n)\nAnd you can load them back using\nskops.io.load\nand\nskops.io.loads",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "skops.io.dumps\n:\nimport\nskops.io\nas\nsio\nobj\n=\nsio\n.\ndump\n(\nclf\n,\n\"filename.skops\"\n)\nAnd you can load them back using\nskops.io.load\nand\nskops.io.loads\n. However, you need to specify the types which are\ntrusted by you. You can get existing unknown types in a dumped object / file\nusing\nskops.io.get_untrusted_types\n, and after checking its contents,\npass it to the load function:\nunknown_types\n=\nsio\n.\nget_untrusted_types\n(\nfile\n=\n\"filename.skops\"\n)\n# investigate the contents of unknown_types, and only load if you trust\n# everything you see.\nclf\n=\nsio\n.\nload\n(\n\"filename.skops\"\n,\ntrusted\n=\nunknown_types\n)\nPlease report issues and feature requests related to this format on the\nskops\nissue tracker\n.\n10.4.\npickle\n,\njoblib\n, and\ncloudpickle\nThese three modules / packages, use the\npickle\nprotocol under the hood, but\ncome with slight variations:\npickle\nis a module from the Python Standard Library. It can serialize\nand deserialize any Python object, including custom Python classes and\nobjects.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "pickle\nis a module from the Python Standard Library. It can serialize\nand deserialize any Python object, including custom Python classes and\nobjects.\njoblib\nis more efficient than\npickle\nwhen working with large machine\nlearning models or large numpy arrays.\ncloudpickle\ncan serialize certain objects which cannot be serialized by\npickle\nor\njoblib\n, such as user defined functions and lambda\nfunctions. This can happen for instance, when using a\nFunctionTransformer\nand using a custom\nfunction to transform the data.\nUsing\npickle\n,\njoblib\n, or\ncloudpickle\nDepending on your use-case, you can choose one of these three methods to\npersist and load your scikit-learn model, and they all follow the same API:\n# Here you can replace pickle with joblib or cloudpickle\nfrom\npickle\nimport\ndump\nwith\nopen\n(\n\"filename.pkl\"\n,\n\"wb\"\n)\nas\nf\n:\ndump\n(\nclf\n,\nf\n,\nprotocol\n=\n5\n)\nUsing\nprotocol=5\nis recommended to reduce memory usage and make it faster to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "open\n(\n\"filename.pkl\"\n,\n\"wb\"\n)\nas\nf\n:\ndump\n(\nclf\n,\nf\n,\nprotocol\n=\n5\n)\nUsing\nprotocol=5\nis recommended to reduce memory usage and make it faster to\nstore and load any large NumPy array stored as a fitted attribute in the model.\nYou can alternatively pass\nprotocol=pickle.HIGHEST_PROTOCOL\nwhich is\nequivalent to\nprotocol=5\nin Python 3.8 and later (at the time of writing).\nAnd later when needed, you can load the same object from the persisted file:\n# Here you can replace pickle with joblib or cloudpickle\nfrom\npickle\nimport\nload\nwith\nopen\n(\n\"filename.pkl\"\n,\n\"rb\"\n)\nas\nf\n:\nclf\n=\nload\n(\nf\n)\n10.5.\nSecurity & Maintainability Limitations\npickle\n(and\njoblib\nand\ncloudpickle\nby extension), has\nmany documented security vulnerabilities by design and should only be used if\nthe artifact, i.e. the pickle-file, is coming from a trusted and verified\nsource. You should never load a pickle file from an untrusted source, similarly\nto how you should never execute code from an untrusted source.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "source. You should never load a pickle file from an untrusted source, similarly\nto how you should never execute code from an untrusted source.\nAlso note that arbitrary computations can be represented using the\nONNX\nformat, and it is therefore recommended to serve models using\nONNX\nin a\nsandboxed environment to safeguard against computational and memory exploits.\nAlso note that there are no supported ways to load a model trained with a\ndifferent version of scikit-learn. While using\nskops.io\n,\njoblib\n,\npickle\n, or\ncloudpickle\n, models saved using one version of\nscikit-learn might load in other versions, however, this is entirely\nunsupported and inadvisable. It should also be kept in mind that operations\nperformed on such data could give different and unexpected results, or even\ncrash your Python process.\nIn order to rebuild a similar model with future versions of scikit-learn,\nadditional metadata should be saved along the pickled model:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In order to rebuild a similar model with future versions of scikit-learn,\nadditional metadata should be saved along the pickled model:\nThe training data, e.g. a reference to an immutable snapshot\nThe Python source code used to generate the model\nThe versions of scikit-learn and its dependencies\nThe cross validation score obtained on the training data\nThis should make it possible to check that the cross-validation score is in the\nsame range as before.\nAside for a few exceptions, persisted models should be portable across\noperating systems and hardware architectures assuming the same versions of\ndependencies and Python are used. If you encounter an estimator that is not\nportable, please open an issue on GitHub. Persisted models are often deployed\nin production using containers like Docker, in order to freeze the environment\nand dependencies.\nIf you want to know more about these issues, please refer to these talks:\nAdrin Jalali: Let’s exploit pickle, and skops to the rescue! | PyData",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "If you want to know more about these issues, please refer to these talks:\nAdrin Jalali: Let’s exploit pickle, and skops to the rescue! | PyData\nAmsterdam 2023\n.\nAlex Gaynor: Pickles are for Delis, not Software - PyCon 2014\n.\n10.5.1.\nReplicating the training environment in production\nIf the versions of the dependencies used may differ from training to\nproduction, it may result in unexpected behaviour and errors while using the\ntrained model. To prevent such situations it is recommended to use the same\ndependencies and versions in both the training and production environment.\nThese transitive dependencies can be pinned with the help of package management\ntools like\npip\n,\nmamba\n,\nconda\n,\npoetry\n,\nconda-lock\n,\npixi\n, etc.\nIt is not always possible to load a model trained with older versions of the\nscikit-learn library and its dependencies in an updated software environment.\nInstead, you might need to retrain the model with the new versions of all",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Instead, you might need to retrain the model with the new versions of all\nthe libraries. So when training a model, it is important to record the training\nrecipe (e.g. a Python script) and training set information, and metadata about\nall the dependencies to be able to automatically reconstruct the same training\nenvironment for the updated software.\nInconsistentVersionWarning\nWhen an estimator is loaded with a scikit-learn version that is inconsistent\nwith the version the estimator was pickled with, an\nInconsistentVersionWarning\nis raised. This warning\ncan be caught to obtain the original version the estimator was pickled with:\nfrom\nsklearn.exceptions\nimport\nInconsistentVersionWarning\nwarnings\n.\nsimplefilter\n(\n\"error\"\n,\nInconsistentVersionWarning\n)\ntry\n:\nwith\nopen\n(\n\"model_from_previous_version.pickle\"\n,\n\"rb\"\n)\nas\nf\n:\nest\n=\npickle\n.\nload\n(\nf\n)\nexcept\nInconsistentVersionWarning\nas\nw\n:\nprint\n(\nw\n.\noriginal_sklearn_version\n)\n10.5.2.\nServing the model artifact",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ")\nas\nf\n:\nest\n=\npickle\n.\nload\n(\nf\n)\nexcept\nInconsistentVersionWarning\nas\nw\n:\nprint\n(\nw\n.\noriginal_sklearn_version\n)\n10.5.2.\nServing the model artifact\nThe last step after training a scikit-learn model is serving the model.\nOnce the trained model is successfully loaded, it can be served to manage\ndifferent prediction requests. This can involve deploying the model as a\nweb service using containerization, or other model deployment strategies,\naccording to the specifications.\n10.6.\nSummarizing the key points\nBased on the different approaches for model persistence, the key points for\neach approach can be summarized as follows:\nONNX\n: It provides a uniform format for persisting any machine learning or\ndeep learning model (other than scikit-learn) and is useful for model\ninference (predictions). It can however, result in compatibility issues with\ndifferent frameworks.\nskops.io\n: Trained scikit-learn models can be easily shared and put\ninto production using\nskops.io",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "different frameworks.\nskops.io\n: Trained scikit-learn models can be easily shared and put\ninto production using\nskops.io\n. It is more secure compared to\nalternate approaches based on\npickle\nbecause it does not load\narbitrary code unless explicitly asked for by the user. Such code needs to be\npackaged and importable in the target Python environment.\njoblib\n: Efficient memory mapping techniques make it faster when using\nthe same persisted model in multiple Python processes when using\nmmap_mode=\"r\"\n. It also gives easy shortcuts to compress and decompress the\npersisted object without the need for extra code. However, it may trigger the\nexecution of malicious code when loading a model from an untrusted source as\nany other pickle-based persistence mechanism.\npickle\n: It is native to Python and most Python objects can be\nserialized and deserialized using\npickle\n, including custom Python\nclasses and functions as long as they are defined in a package that can be",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "serialized and deserialized using\npickle\n, including custom Python\nclasses and functions as long as they are defined in a package that can be\nimported in the target environment. While\npickle\ncan be used to easily\nsave and load scikit-learn models, it may trigger the execution of malicious\ncode while loading a model from an untrusted source.\npickle\ncan also\nbe very efficient memorywise if the model was persisted with\nprotocol=5\nbut\nit does not support memory mapping.\ncloudpickle\n: It has comparable loading efficiency as\npickle\nand\njoblib\n(without memory mapping), but offers additional flexibility to\nserialize custom Python code such as lambda expressions and interactively\ndefined functions and classes. It might be a last resort to persist pipelines\nwith custom Python components such as a\nsklearn.preprocessing.FunctionTransformer\nthat wraps a function\ndefined in the training script itself or more generally outside of any\nimportable Python package. Note that\ncloudpickle\noffers no forward",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "defined in the training script itself or more generally outside of any\nimportable Python package. Note that\ncloudpickle\noffers no forward\ncompatibility guarantees and you might need the same version of\ncloudpickle\nto load the persisted model along with the same version of all\nthe libraries used to define the model. As the other pickle-based persistence\nmechanisms, it may trigger the execution of malicious code while loading\na model from an untrusted source.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_persistence.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.\nModel selection and evaluation\n3.1. Cross-validation: evaluating estimator performance\n3.1.1. Computing cross-validated metrics\n3.1.2. Cross validation iterators\n3.1.3. A note on shuffling\n3.1.4. Cross validation and model selection\n3.1.5. Permutation test score\n3.2. Tuning the hyper-parameters of an estimator\n3.2.1. Exhaustive Grid Search\n3.2.2. Randomized Parameter Optimization\n3.2.3. Searching for optimal parameters with successive halving\n3.2.4. Tips for parameter search\n3.2.5. Alternatives to brute force parameter search\n3.3. Tuning the decision threshold for class prediction\n3.3.1. Post-tuning the decision threshold\n3.4. Metrics and scoring: quantifying the quality of predictions\n3.4.1. Which scoring function should I use?\n3.4.2. Scoring API overview\n3.4.3. The\nscoring\nparameter: defining model evaluation rules\n3.4.4. Classification metrics\n3.4.5. Multilabel ranking metrics\n3.4.6. Regression metrics\n3.4.7. Clustering metrics\n3.4.8. Dummy estimators",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_selection.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.4.4. Classification metrics\n3.4.5. Multilabel ranking metrics\n3.4.6. Regression metrics\n3.4.7. Clustering metrics\n3.4.8. Dummy estimators\n3.5. Validation curves: plotting scores to evaluate models\n3.5.1. Validation curve\n3.5.2. Learning curve\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/model_selection.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "12.1.\nArray API support (experimental)\nThe\nArray API\nspecification defines\na standard API for all array manipulation libraries with a NumPy-like API.\nScikit-learn vendors pinned copies of\narray-api-compat\nand\narray-api-extra\n.\nScikit-learn’s support for the array API standard requires the environment variable\nSCIPY_ARRAY_API\nto be set to\n1\nbefore importing\nscipy\nand\nscikit-learn\n:\nexport\nSCIPY_ARRAY_API\n=\n1\nPlease note that this environment variable is intended for temporary use.\nFor more details, refer to SciPy’s\nArray API documentation\n.\nSome scikit-learn estimators that primarily rely on NumPy (as opposed to using\nCython) to implement the algorithmic logic of their\nfit\n,\npredict\nor\ntransform\nmethods can be configured to accept any Array API compatible input\ndata structures and automatically dispatch operations to the underlying namespace\ninstead of relying on NumPy.\nAt this stage, this support is\nconsidered experimental\nand must be enabled\nexplicitly by the\narray_api_dispatch",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "instead of relying on NumPy.\nAt this stage, this support is\nconsidered experimental\nand must be enabled\nexplicitly by the\narray_api_dispatch\nconfiguration. See below for details.\nNote\nCurrently, only\narray-api-strict\n,\ncupy\n, and\nPyTorch\nare known to work\nwith scikit-learn’s estimators.\nThe following video provides an overview of the standard’s design principles\nand how it facilitates interoperability between array libraries:\nScikit-learn on GPUs with Array API\nby\nThomas Fan\nat PyData NYC 2023.\n12.1.1.\nExample usage\nThe configuration\narray_api_dispatch=True\nneeds to be set to\nTrue\nto enable array\nAPI support. We recommend setting this configuration globally to ensure consistent\nbehaviour and prevent accidental mixing of array namespaces.\nNote that we set it with\nconfig_context\nbelow to avoid having to call\nset_config(array_api_dispatch=False)\nat the end of every code snippet\nthat uses the array API.\nThe example code snippet below demonstrates how to use\nCuPy\nto run",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "at the end of every code snippet\nthat uses the array API.\nThe example code snippet below demonstrates how to use\nCuPy\nto run\nLinearDiscriminantAnalysis\non a GPU:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn\nimport\nconfig_context\n>>>\nfrom\nsklearn.discriminant_analysis\nimport\nLinearDiscriminantAnalysis\n>>>\nimport\ncupy\n>>>\nX_np\n,\ny_np\n=\nmake_classification\n(\nrandom_state\n=\n0\n)\n>>>\nX_cu\n=\ncupy\n.\nasarray\n(\nX_np\n)\n>>>\ny_cu\n=\ncupy\n.\nasarray\n(\ny_np\n)\n>>>\nX_cu\n.\ndevice\n<CUDA Device 0>\n>>>\nwith\nconfig_context\n(\narray_api_dispatch\n=\nTrue\n):\n...\nlda\n=\nLinearDiscriminantAnalysis\n()\n...\nX_trans\n=\nlda\n.\nfit_transform\n(\nX_cu\n,\ny_cu\n)\n>>>\nX_trans\n.\ndevice\n<CUDA Device 0>\nAfter the model is trained, fitted attributes that are arrays will also be\nfrom the same Array API namespace as the training data. For example, if CuPy’s\nArray API namespace was used for training, then fitted attributes will be on the\nGPU. We provide an experimental\n_estimator_with_converted_arrays",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Array API namespace was used for training, then fitted attributes will be on the\nGPU. We provide an experimental\n_estimator_with_converted_arrays\nutility that\ntransfers an estimator attributes from Array API to a ndarray:\n>>>\nfrom\nsklearn.utils._array_api\nimport\n_estimator_with_converted_arrays\n>>>\ncupy_to_ndarray\n=\nlambda\narray\n:\narray\n.\nget\n()\n>>>\nlda_np\n=\n_estimator_with_converted_arrays\n(\nlda\n,\ncupy_to_ndarray\n)\n>>>\nX_trans\n=\nlda_np\n.\ntransform\n(\nX_np\n)\n>>>\ntype\n(\nX_trans\n)\n<class 'numpy.ndarray'>\n12.1.1.1.\nPyTorch Support\nPyTorch Tensors can also be passed directly:\n>>>\nimport\ntorch\n>>>\nX_torch\n=\ntorch\n.\nasarray\n(\nX_np\n,\ndevice\n=\n\"cuda\"\n,\ndtype\n=\ntorch\n.\nfloat32\n)\n>>>\ny_torch\n=\ntorch\n.\nasarray\n(\ny_np\n,\ndevice\n=\n\"cuda\"\n,\ndtype\n=\ntorch\n.\nfloat32\n)\n>>>\nwith\nconfig_context\n(\narray_api_dispatch\n=\nTrue\n):\n...\nlda\n=\nLinearDiscriminantAnalysis\n()\n...\nX_trans\n=\nlda\n.\nfit_transform\n(\nX_torch\n,\ny_torch\n)\n>>>\ntype\n(\nX_trans\n)\n<class 'torch.Tensor'>\n>>>\nX_trans\n.\ndevice\n.\ntype\n'cuda'\n12.1.2.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "()\n...\nX_trans\n=\nlda\n.\nfit_transform\n(\nX_torch\n,\ny_torch\n)\n>>>\ntype\n(\nX_trans\n)\n<class 'torch.Tensor'>\n>>>\nX_trans\n.\ndevice\n.\ntype\n'cuda'\n12.1.2.\nSupport for\nArray\nAPI\n-compatible inputs\nEstimators and other tools in scikit-learn that support Array API compatible inputs.\n12.1.2.1.\nEstimators\ndecomposition.PCA\n(with\nsvd_solver=\"full\"\n,\nsvd_solver=\"randomized\"\nand\npower_iteration_normalizer=\"QR\"\n)\nlinear_model.Ridge\n(with\nsolver=\"svd\"\n)\ndiscriminant_analysis.LinearDiscriminantAnalysis\n(with\nsolver=\"svd\"\n)\npreprocessing.Binarizer\npreprocessing.KernelCenterer\npreprocessing.LabelEncoder\npreprocessing.MaxAbsScaler\npreprocessing.MinMaxScaler\npreprocessing.Normalizer\n12.1.2.2.\nMeta-estimators\nMeta-estimators that accept Array API inputs conditioned on the fact that the\nbase estimator also does:\nmodel_selection.GridSearchCV\nmodel_selection.RandomizedSearchCV\nmodel_selection.HalvingGridSearchCV\nmodel_selection.HalvingRandomSearchCV\n12.1.2.3.\nMetrics\nsklearn.metrics.cluster.entropy",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "model_selection.HalvingGridSearchCV\nmodel_selection.HalvingRandomSearchCV\n12.1.2.3.\nMetrics\nsklearn.metrics.cluster.entropy\nsklearn.metrics.accuracy_score\nsklearn.metrics.d2_tweedie_score\nsklearn.metrics.explained_variance_score\nsklearn.metrics.f1_score\nsklearn.metrics.fbeta_score\nsklearn.metrics.hamming_loss\nsklearn.metrics.jaccard_score\nsklearn.metrics.max_error\nsklearn.metrics.mean_absolute_error\nsklearn.metrics.mean_absolute_percentage_error\nsklearn.metrics.mean_gamma_deviance\nsklearn.metrics.mean_pinball_loss\nsklearn.metrics.mean_poisson_deviance\n(requires\nenabling array API support for SciPy\n)\nsklearn.metrics.mean_squared_error\nsklearn.metrics.mean_squared_log_error\nsklearn.metrics.mean_tweedie_deviance\nsklearn.metrics.multilabel_confusion_matrix\nsklearn.metrics.pairwise.additive_chi2_kernel\nsklearn.metrics.pairwise.chi2_kernel\nsklearn.metrics.pairwise.cosine_similarity\nsklearn.metrics.pairwise.cosine_distances\nsklearn.metrics.pairwise.euclidean_distances\n(see",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "sklearn.metrics.pairwise.cosine_similarity\nsklearn.metrics.pairwise.cosine_distances\nsklearn.metrics.pairwise.euclidean_distances\n(see\nNote on device support for float64\n)\nsklearn.metrics.pairwise.linear_kernel\nsklearn.metrics.pairwise.paired_cosine_distances\nsklearn.metrics.pairwise.paired_euclidean_distances\nsklearn.metrics.pairwise.polynomial_kernel\nsklearn.metrics.pairwise.rbf_kernel\n(see\nNote on device support for float64\n)\nsklearn.metrics.pairwise.sigmoid_kernel\nsklearn.metrics.precision_score\nsklearn.metrics.precision_recall_fscore_support\nsklearn.metrics.r2_score\nsklearn.metrics.recall_score\nsklearn.metrics.root_mean_squared_error\nsklearn.metrics.root_mean_squared_log_error\nsklearn.metrics.zero_one_loss\n12.1.2.4.\nTools\nmodel_selection.train_test_split\nutils.check_consistent_length\nCoverage is expected to grow over time. Please follow the dedicated\nmeta-issue on GitHub\nto track progress.\n12.1.3.\nInput and output array type handling",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "meta-issue on GitHub\nto track progress.\n12.1.3.\nInput and output array type handling\nEstimators and scoring functions are able to accept input arrays\nfrom different array libraries and/or devices. When a mixed set of input arrays is\npassed, scikit-learn converts arrays as needed to make them all consistent.\nFor estimators, the rule is\n“everything follows\nX\n“\n- mixed array inputs are\nconverted so that they all match the array library and device of\nX\n.\nFor scoring functions the rule is\n“everything follows\ny_pred\n“\n- mixed array\ninputs are converted so that they all match the array library and device of\ny_pred\n.\nWhen a function or method has been called with array API compatible inputs, the\nconvention is to return arrays from the same array library and on the same\ndevice as the input data.\n12.1.3.1.\nEstimators\nWhen an estimator is fitted with an array API compatible\nX\n, all other\narray inputs, including constructor arguments, (e.g.,\ny\n,\nsample_weight\n)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "When an estimator is fitted with an array API compatible\nX\n, all other\narray inputs, including constructor arguments, (e.g.,\ny\n,\nsample_weight\n)\nwill be converted to match the array library and device of\nX\n, if they do not already.\nThis behaviour enables switching from processing on the CPU to processing\non the GPU at any point within a pipeline.\nThis allows estimators to accept mixed input types, enabling\nX\nto be moved\nto a different device within a pipeline, without explicitly moving\ny\n.\nNote that scikit-learn pipelines do not allow transformation of\ny\n(to avoid\nleakage\n).\nTake for example a pipeline where\nX\nand\ny\nboth start on CPU, and go through\nthe following three steps:\nTargetEncoder\n, which will transform categorial\nX\nbut also requires\ny\n, meaning both\nX\nand\ny\nneed to be on CPU.\nFunctionTransformer(func=partial(torch.asarray,\ndevice=\"cuda\"))\n,\nwhich moves\nX\nto GPU, to improve performance in the next step.\nRidge\n, whose performance can be improved when",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "device=\"cuda\"))\n,\nwhich moves\nX\nto GPU, to improve performance in the next step.\nRidge\n, whose performance can be improved when\npassed arrays on a GPU, as they can handle large matrix operations very efficiently.\nX\ninitially contains categorical string data (thus needs to be on CPU), which is\ntarget encoded to numerical values in\nTargetEncoder\n.\nX\nis then explicitly moved to GPU to improve the performance of\nRidge\n.\ny\ncannot be transformed by the pipeline\n(recall scikit-learn pipelines do not allow transformation of\ny\n) but as\nRidge\nis able to accept mixed input types,\nthis is not a problem and the pipeline is able to be run.\nThe fitted attributes of an estimator fitted with an array API compatible\nX\n, will\nbe arrays from the same library as the input and stored on the same device.\nThe\npredict\nand\ntransform\nmethod subsequently expect\ninputs from the same array library and device as the data passed to the\nfit\nmethod.\n12.1.3.2.\nScoring functions\nWhen an array API compatible\ny_pred",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "inputs from the same array library and device as the data passed to the\nfit\nmethod.\n12.1.3.2.\nScoring functions\nWhen an array API compatible\ny_pred\nis passed to a scoring function,\nall other array inputs (e.g.,\ny_true\n,\nsample_weight\n) will be converted\nto match the array library and device of\ny_pred\n, if they do not already.\nThis allows scoring functions to accept mixed input types, enabling them to be\nused within a\nmeta-estimator\n(or function that accepts estimators), with a\npipeline that moves input arrays between devices (e.g., CPU to GPU).\nFor example, to be able to use the pipeline described above within e.g.,\ncross_validate\nor\nGridSearchCV\n, the scoring function internally\ncalled needs to be able to accept mixed input types.\nThe output type of scoring functions depends on the number of output values.\nWhen a scoring function returns a scalar value, it will return a Python\nscalar (typically a\nfloat\ninstance) instead of an array scalar value.\nFor scoring functions that support",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "scalar (typically a\nfloat\ninstance) instead of an array scalar value.\nFor scoring functions that support\nmulticlass\nor\nmultioutput\n,\nan array from the same array library and device as\ny_pred\nwill be returned when\nmultiple values need to be output.\n12.1.4.\nCommon estimator checks\nAdd the\narray_api_support\ntag to an estimator’s set of tags to indicate that\nit supports the array API. This will enable dedicated checks as part of the\ncommon tests to verify that the estimators’ results are the same when using\nvanilla NumPy and array API inputs.\nTo run these checks you need to install\narray-api-strict\nin your\ntest environment. This allows you to run checks without having a\nGPU. To run the full set of checks you also need to install\nPyTorch\n,\nCuPy\nand have\na GPU. Checks that can not be executed or have missing dependencies will be\nautomatically skipped. Therefore it’s important to run the tests with the\n-v\nflag to see which checks are skipped:\npip\ninstall\narray-api-strict",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "automatically skipped. Therefore it’s important to run the tests with the\n-v\nflag to see which checks are skipped:\npip\ninstall\narray-api-strict\n# and other libraries as needed\npytest\n-k\n\"array_api\"\n-v\nRunning the scikit-learn tests against\narray-api-strict\nshould help reveal\nmost code problems related to handling multiple device inputs via the use of\nsimulated non-CPU devices. This allows for fast iterative development and debugging of\narray API related code.\nHowever, to ensure full handling of PyTorch or CuPy inputs allocated on actual GPU\ndevices, it is necessary to run the tests against those libraries and hardware.\nThis can either be achieved by using\nGoogle Colab\nor leveraging our CI infrastructure on pull requests (manually triggered by maintainers\nfor cost reasons).\n12.1.4.1.\nNote on MPS device support\nOn macOS, PyTorch can use the Metal Performance Shaders (MPS) to access\nhardware accelerators (e.g. the internal GPU component of the M1 or M2 chips).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "hardware accelerators (e.g. the internal GPU component of the M1 or M2 chips).\nHowever, the MPS device support for PyTorch is incomplete at the time of\nwriting. See the following github issue for more details:\npytorch/pytorch#77764\nTo enable the MPS support in PyTorch, set the environment variable\nPYTORCH_ENABLE_MPS_FALLBACK=1\nbefore running the tests:\nPYTORCH_ENABLE_MPS_FALLBACK\n=\n1\npytest\n-k\n\"array_api\"\n-v\nAt the time of writing all scikit-learn tests should pass, however, the\ncomputational speed is not necessarily better than with the CPU device.\n12.1.4.2.\nNote on device support for\nfloat64\nCertain operations within scikit-learn will automatically perform operations\non floating-point values with\nfloat64\nprecision to prevent overflows and ensure\ncorrectness (e.g.,\nmetrics.pairwise.euclidean_distances\n). However,\ncertain combinations of array namespaces and devices, such as\nPyTorch\non\nMPS\n(see\nNote on MPS device support\n) do not support the\nfloat64\ndata type. In these cases,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "PyTorch\non\nMPS\n(see\nNote on MPS device support\n) do not support the\nfloat64\ndata type. In these cases,\nscikit-learn will revert to using the\nfloat32\ndata type instead. This can result in\ndifferent behavior (typically numerically unstable results) compared to not using array\nAPI dispatching or using a device with\nfloat64\nsupport.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/array_api.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.4.\nBiclustering\nBiclustering algorithms simultaneously\ncluster rows and columns of a data matrix. These clusters of rows and\ncolumns are known as biclusters. Each determines a submatrix of the\noriginal data matrix with some desired properties.\nFor instance, given a matrix of shape\n(10,\n10)\n, one possible bicluster\nwith three rows and two columns induces a submatrix of shape\n(3,\n2)\n:\n>>>\nimport\nnumpy\nas\nnp\n>>>\ndata\n=\nnp\n.\narange\n(\n100\n)\n.\nreshape\n(\n10\n,\n10\n)\n>>>\nrows\n=\nnp\n.\narray\n([\n0\n,\n2\n,\n3\n])[:,\nnp\n.\nnewaxis\n]\n>>>\ncolumns\n=\nnp\n.\narray\n([\n1\n,\n2\n])\n>>>\ndata\n[\nrows\n,\ncolumns\n]\narray([[ 1, 2],\n[21, 22],\n[31, 32]])\nFor visualization purposes, given a bicluster, the rows and columns of\nthe data matrix may be rearranged to make the bicluster contiguous.\nAlgorithms differ in how they define biclusters. Some of the\ncommon types include:\nconstant values, constant rows, or constant columns\nunusually high or low values\nsubmatrices with low variance\ncorrelated rows or columns",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "constant values, constant rows, or constant columns\nunusually high or low values\nsubmatrices with low variance\ncorrelated rows or columns\nAlgorithms also differ in how rows and columns may be assigned to\nbiclusters, which leads to different bicluster structures. Block\ndiagonal or checkerboard structures occur when rows and columns are\ndivided into partitions.\nIf each row and each column belongs to exactly one bicluster, then\nrearranging the rows and columns of the data matrix reveals the\nbiclusters on the diagonal. Here is an example of this structure\nwhere biclusters have higher average values than the other rows and\ncolumns:\nAn example of biclusters formed by partitioning rows and columns.\nIn the checkerboard case, each row belongs to all column clusters, and\neach column belongs to all row clusters. Here is an example of this\nstructure where the variance of the values within each bicluster is\nsmall:\nAn example of checkerboard biclusters.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "structure where the variance of the values within each bicluster is\nsmall:\nAn example of checkerboard biclusters.\nAfter fitting a model, row and column cluster membership can be found\nin the\nrows_\nand\ncolumns_\nattributes.\nrows_[i]\nis a binary vector\nwith nonzero entries corresponding to rows that belong to bicluster\ni\n. Similarly,\ncolumns_[i]\nindicates which columns belong to\nbicluster\ni\n.\nSome models also have\nrow_labels_\nand\ncolumn_labels_\nattributes.\nThese models partition the rows and columns, such as in the block\ndiagonal and checkerboard bicluster structures.\nNote\nBiclustering has many other names in different fields including\nco-clustering, two-mode clustering, two-way clustering, block\nclustering, coupled two-way clustering, etc. The names of some\nalgorithms, such as the Spectral Co-Clustering algorithm, reflect\nthese alternate names.\n2.4.1.\nSpectral Co-Clustering\nThe\nSpectralCoclustering\nalgorithm finds biclusters with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "these alternate names.\n2.4.1.\nSpectral Co-Clustering\nThe\nSpectralCoclustering\nalgorithm finds biclusters with\nvalues higher than those in the corresponding other rows and columns.\nEach row and each column belongs to exactly one bicluster, so\nrearranging the rows and columns to make partitions contiguous reveals\nthese high values along the diagonal:\nNote\nThe algorithm treats the input data matrix as a bipartite graph: the\nrows and columns of the matrix correspond to the two sets of vertices,\nand each entry corresponds to an edge between a row and a column. The\nalgorithm approximates the normalized cut of this graph to find heavy\nsubgraphs.\n2.4.1.1.\nMathematical formulation\nAn approximate solution to the optimal normalized cut may be found via\nthe generalized eigenvalue decomposition of the Laplacian of the\ngraph. Usually this would mean working directly with the Laplacian\nmatrix. If the original data matrix\n\\(A\\)\nhas shape\n\\(m\n\\times n\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "graph. Usually this would mean working directly with the Laplacian\nmatrix. If the original data matrix\n\\(A\\)\nhas shape\n\\(m\n\\times n\\)\n, the Laplacian matrix for the corresponding bipartite graph\nhas shape\n\\((m + n) \\times (m + n)\\)\n. However, in this case it is\npossible to work directly with\n\\(A\\)\n, which is smaller and more\nefficient.\nThe input matrix\n\\(A\\)\nis preprocessed as follows:\n\\[A_n = R^{-1/2} A C^{-1/2}\\]\nWhere\n\\(R\\)\nis the diagonal matrix with entry\n\\(i\\)\nequal to\n\\(\\sum_{j} A_{ij}\\)\nand\n\\(C\\)\nis the diagonal matrix with\nentry\n\\(j\\)\nequal to\n\\(\\sum_{i} A_{ij}\\)\n.\nThe singular value decomposition,\n\\(A_n = U \\Sigma V^\\top\\)\n,\nprovides the partitions of the rows and columns of\n\\(A\\)\n. A subset\nof the left singular vectors gives the row partitions, and a subset\nof the right singular vectors gives the column partitions.\nThe\n\\(\\ell = \\lceil \\log_2 k \\rceil\\)\nsingular vectors, starting\nfrom the second, provide the desired partitioning information. They\nare used to form the matrix",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "singular vectors, starting\nfrom the second, provide the desired partitioning information. They\nare used to form the matrix\n\\(Z\\)\n:\n\\[\\begin{split}Z = \\begin{bmatrix} R^{-1/2} U \\\\\\\\\nC^{-1/2} V\n\\end{bmatrix}\\end{split}\\]\nwhere the columns of\n\\(U\\)\nare\n\\(u_2, \\dots, u_{\\ell +\n1}\\)\n, and similarly for\n\\(V\\)\n.\nThen the rows of\n\\(Z\\)\nare clustered using\nk-means\n. The first\nn_rows\nlabels provide the row partitioning,\nand the remaining\nn_columns\nlabels provide the column partitioning.\nExamples\nA demo of the Spectral Co-Clustering algorithm\n: A simple example\nshowing how to generate a data matrix with biclusters and apply\nthis method to it.\nBiclustering documents with the Spectral Co-clustering algorithm\n: An example of finding\nbiclusters in the twenty newsgroup dataset.\nReferences\nDhillon, Inderjit S, 2001.\nCo-clustering documents and words using\nbipartite spectral graph partitioning\n2.4.2.\nSpectral Biclustering\nThe\nSpectralBiclustering\nalgorithm assumes that the input",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "bipartite spectral graph partitioning\n2.4.2.\nSpectral Biclustering\nThe\nSpectralBiclustering\nalgorithm assumes that the input\ndata matrix has a hidden checkerboard structure. The rows and columns\nof a matrix with this structure may be partitioned so that the entries\nof any bicluster in the Cartesian product of row clusters and column\nclusters are approximately constant. For instance, if there are two\nrow partitions and three column partitions, each row will belong to\nthree biclusters, and each column will belong to two biclusters.\nThe algorithm partitions the rows and columns of a matrix so that a\ncorresponding blockwise-constant checkerboard matrix provides a good\napproximation to the original matrix.\n2.4.2.1.\nMathematical formulation\nThe input matrix\n\\(A\\)\nis first normalized to make the\ncheckerboard pattern more obvious. There are three possible methods:\nIndependent row and column normalization\n, as in Spectral\nCo-Clustering. This method makes the rows sum to a constant and the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Independent row and column normalization\n, as in Spectral\nCo-Clustering. This method makes the rows sum to a constant and the\ncolumns sum to a different constant.\nBistochastization\n: repeated row and column normalization until\nconvergence. This method makes both rows and columns sum to the\nsame constant.\nLog normalization\n: the log of the data matrix is computed:\n\\(L =\n\\log A\\)\n. Then the column mean\n\\(\\overline{L_{i \\cdot}}\\)\n, row mean\n\\(\\overline{L_{\\cdot j}}\\)\n, and overall mean\n\\(\\overline{L_{\\cdot\n\\cdot}}\\)\nof\n\\(L\\)\nare computed. The final matrix is computed\naccording to the formula\n\\[K_{ij} = L_{ij} - \\overline{L_{i \\cdot}} - \\overline{L_{\\cdot\nj}} + \\overline{L_{\\cdot \\cdot}}\\]\nAfter normalizing, the first few singular vectors are computed, just\nas in the Spectral Co-Clustering algorithm.\nIf log normalization was used, all the singular vectors are\nmeaningful. However, if independent normalization or bistochastization\nwere used, the first singular vectors,\n\\(u_1\\)\nand\n\\(v_1\\)\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "meaningful. However, if independent normalization or bistochastization\nwere used, the first singular vectors,\n\\(u_1\\)\nand\n\\(v_1\\)\n.\nare discarded. From now on, the “first” singular vectors refers to\n\\(u_2 \\dots u_{p+1}\\)\nand\n\\(v_2 \\dots v_{p+1}\\)\nexcept in the\ncase of log normalization.\nGiven these singular vectors, they are ranked according to which can\nbe best approximated by a piecewise-constant vector. The\napproximations for each vector are found using one-dimensional k-means\nand scored using the Euclidean distance. Some subset of the best left\nand right singular vectors are selected. Next, the data is projected to\nthis best subset of singular vectors and clustered.\nFor instance, if\n\\(p\\)\nsingular vectors were calculated, the\n\\(q\\)\nbest are found as described, where\n\\(q<p\\)\n. Let\n\\(U\\)\nbe the matrix with columns the\n\\(q\\)\nbest left singular\nvectors, and similarly\n\\(V\\)\nfor the right. To partition the rows,\nthe rows of\n\\(A\\)\nare projected to a\n\\(q\\)\ndimensional space:\n\\(A * V\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "vectors, and similarly\n\\(V\\)\nfor the right. To partition the rows,\nthe rows of\n\\(A\\)\nare projected to a\n\\(q\\)\ndimensional space:\n\\(A * V\\)\n. Treating the\n\\(m\\)\nrows of this\n\\(m \\times q\\)\nmatrix as samples and clustering using k-means yields the row labels.\nSimilarly, projecting the columns to\n\\(A^{\\top} * U\\)\nand\nclustering this\n\\(n \\times q\\)\nmatrix yields the column labels.\nExamples\nA demo of the Spectral Biclustering algorithm\n: a simple example\nshowing how to generate a checkerboard matrix and bicluster it.\nReferences\nKluger, Yuval, et. al., 2003.\nSpectral biclustering of microarray\ndata: coclustering genes and conditions\n2.4.3.\nBiclustering evaluation\nThere are two ways of evaluating a biclustering result: internal and\nexternal. Internal measures, such as cluster stability, rely only on\nthe data and the result themselves. Currently there are no internal\nbicluster measures in scikit-learn. External measures refer to an",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the data and the result themselves. Currently there are no internal\nbicluster measures in scikit-learn. External measures refer to an\nexternal source of information, such as the true solution. When\nworking with real data the true solution is usually unknown, but\nbiclustering artificial data may be useful for evaluating algorithms\nprecisely because the true solution is known.\nTo compare a set of found biclusters to the set of true biclusters,\ntwo similarity measures are needed: a similarity measure for\nindividual biclusters, and a way to combine these individual\nsimilarities into an overall score.\nTo compare individual biclusters, several measures have been used. For\nnow, only the Jaccard index is implemented:\n\\[J(A, B) = \\frac{|A \\cap B|}{|A| + |B| - |A \\cap B|}\\]\nwhere\n\\(A\\)\nand\n\\(B\\)\nare biclusters,\n\\(|A \\cap B|\\)\nis\nthe number of elements in their intersection. The Jaccard index\nachieves its minimum of 0 when the biclusters do not overlap at all",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(|A \\cap B|\\)\nis\nthe number of elements in their intersection. The Jaccard index\nachieves its minimum of 0 when the biclusters do not overlap at all\nand its maximum of 1 when they are identical.\nSeveral methods have been developed to compare two sets of biclusters.\nFor now, only\nconsensus_score\n(Hochreiter et. al., 2010) is\navailable:\nCompute bicluster similarities for pairs of biclusters, one in each\nset, using the Jaccard index or a similar measure.\nAssign biclusters from one set to another in a one-to-one fashion\nto maximize the sum of their similarities. This step is performed\nusing\nscipy.optimize.linear_sum_assignment\n, which uses a\nmodified Jonker-Volgenant algorithm.\nThe final sum of similarities is divided by the size of the larger\nset.\nThe minimum consensus score, 0, occurs when all pairs of biclusters\nare totally dissimilar. The maximum score, 1, occurs when both sets\nare identical.\nReferences\nHochreiter, Bodenhofer, et. al., 2010.\nFABIA: factor analysis",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "are identical.\nReferences\nHochreiter, Bodenhofer, et. al., 2010.\nFABIA: factor analysis\nfor bicluster acquisition\n.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/biclustering.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.16.\nProbability calibration\nWhen performing classification you often want not only to predict the class\nlabel, but also obtain a probability of the respective label. This probability\ngives you some kind of confidence on the prediction. Some models can give you\npoor estimates of the class probabilities and some even do not support\nprobability prediction (e.g., some instances of\nSGDClassifier\n).\nThe calibration module allows you to better calibrate\nthe probabilities of a given model, or to add support for probability\nprediction.\nWell calibrated classifiers are probabilistic classifiers for which the output\nof the\npredict_proba\nmethod can be directly interpreted as a confidence\nlevel.\nFor instance, a well calibrated (binary) classifier should classify the samples such\nthat among the samples to which it gave a\npredict_proba\nvalue close to, say,\n0.8, approximately 80% actually belong to the positive class.\nBefore we show how to re-calibrate a classifier, we first need a way to detect how",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "0.8, approximately 80% actually belong to the positive class.\nBefore we show how to re-calibrate a classifier, we first need a way to detect how\ngood a classifier is calibrated.\nNote\nStrictly proper scoring rules for probabilistic predictions like\nsklearn.metrics.brier_score_loss\nand\nsklearn.metrics.log_loss\nassess calibration (reliability) and\ndiscriminative power (resolution) of a model, as well as the randomness of the data\n(uncertainty) at the same time. This follows from the well-known Brier score\ndecomposition of Murphy\n[\n1\n]\n. As it is not clear which term dominates, the score is\nof limited use for assessing calibration alone (unless one computes each term of\nthe decomposition). A lower Brier loss, for instance, does not necessarily\nmean a better calibrated model, it could also mean a worse calibrated model with much\nmore discriminatory power, e.g. using many more features.\n1.16.1.\nCalibration curves\nCalibration curves, also referred to as\nreliability diagrams\n(Wilks 1995\n[\n2\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.16.1.\nCalibration curves\nCalibration curves, also referred to as\nreliability diagrams\n(Wilks 1995\n[\n2\n]\n),\ncompare how well the probabilistic predictions of a binary classifier are calibrated.\nIt plots the frequency of the positive label (to be more precise, an estimation of the\nconditional event probability\n\\(P(Y=1|\\text{predict_proba})\\)\n) on the y-axis\nagainst the predicted probability\npredict_proba\nof a model on the x-axis.\nThe tricky part is to get values for the y-axis.\nIn scikit-learn, this is accomplished by binning the predictions such that the x-axis\nrepresents the average predicted probability in each bin.\nThe y-axis is then the\nfraction of positives\ngiven the predictions of that bin, i.e.\nthe proportion of samples whose class is the positive class (in each bin).\nThe top calibration curve plot is created with\nCalibrationDisplay.from_estimator\n, which uses\ncalibration_curve\nto\ncalculate the per bin average predicted probabilities and fraction of positives.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "CalibrationDisplay.from_estimator\n, which uses\ncalibration_curve\nto\ncalculate the per bin average predicted probabilities and fraction of positives.\nCalibrationDisplay.from_estimator\ntakes as input a fitted classifier, which is used to calculate the predicted\nprobabilities. The classifier thus must have\npredict_proba\nmethod. For\nthe few classifiers that do not have a\npredict_proba\nmethod, it is\npossible to use\nCalibratedClassifierCV\nto calibrate the classifier\noutputs to probabilities.\nThe bottom histogram gives some insight into the behavior of each classifier\nby showing the number of samples in each predicted probability bin.\nLogisticRegression\nis more likely to return well calibrated predictions by itself as it has a\ncanonical link function for its loss, i.e. the logit-link for the\nLog loss\n.\nIn the unpenalized case, this leads to the so-called\nbalance property\n, see\n[\n8\n]\nand\nLogistic regression\n.\nIn the plot above, data is generated according to a linear mechanism, which is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "balance property\n, see\n[\n8\n]\nand\nLogistic regression\n.\nIn the plot above, data is generated according to a linear mechanism, which is\nconsistent with the\nLogisticRegression\nmodel (the model is ‘well specified’),\nand the value of the regularization parameter\nC\nis tuned to be\nappropriate (neither too strong nor too low). As a consequence, this model returns\naccurate predictions from its\npredict_proba\nmethod.\nIn contrast to that, the other shown models return biased probabilities; with\ndifferent biases per model.\nGaussianNB\n(Naive Bayes) tends to push probabilities to 0 or 1 (note the counts\nin the histograms). This is mainly because it makes the assumption that\nfeatures are conditionally independent given the class, which is not the\ncase in this dataset which contains 2 redundant features.\nRandomForestClassifier\nshows the opposite behavior: the histograms\nshow peaks at probabilities approximately 0.2 and 0.9, while probabilities",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "RandomForestClassifier\nshows the opposite behavior: the histograms\nshow peaks at probabilities approximately 0.2 and 0.9, while probabilities\nclose to 0 or 1 are very rare. An explanation for this is given by\nNiculescu-Mizil and Caruana\n[\n3\n]\n: “Methods such as bagging and random\nforests that average predictions from a base set of models can have\ndifficulty making predictions near 0 and 1 because variance in the\nunderlying base models will bias predictions that should be near zero or one\naway from these values. Because predictions are restricted to the interval\n[0,1], errors caused by variance tend to be one-sided near zero and one. For\nexample, if a model should predict\n\\(p = 0\\)\nfor a case, the only way bagging\ncan achieve this is if all bagged trees predict zero. If we add noise to the\ntrees that bagging is averaging over, this noise will cause some trees to\npredict values larger than 0 for this case, thus moving the average",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "trees that bagging is averaging over, this noise will cause some trees to\npredict values larger than 0 for this case, thus moving the average\nprediction of the bagged ensemble away from 0. We observe this effect most\nstrongly with random forests because the base-level trees trained with\nrandom forests have relatively high variance due to feature subsetting.” As\na result, the calibration curve shows a characteristic sigmoid shape, indicating that\nthe classifier could trust its “intuition” more and return probabilities closer\nto 0 or 1 typically.\nLinearSVC\n(SVC) shows an even more sigmoid curve than the random forest, which\nis typical for maximum-margin methods (compare Niculescu-Mizil and Caruana\n[\n3\n]\n), which\nfocus on difficult to classify samples that are close to the decision boundary (the\nsupport vectors).\n1.16.2.\nCalibrating a classifier\nCalibrating a classifier consists of fitting a regressor (called a\ncalibrator\n) that maps the output of the classifier (as given by",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Calibrating a classifier consists of fitting a regressor (called a\ncalibrator\n) that maps the output of the classifier (as given by\ndecision_function\nor\npredict_proba\n) to a calibrated probability\nin [0, 1]. Denoting the output of the classifier for a given sample by\n\\(f_i\\)\n,\nthe calibrator tries to predict the conditional event probability\n\\(P(y_i = 1 | f_i)\\)\n.\nIdeally, the calibrator is fit on a dataset independent of the training data used to\nfit the classifier in the first place.\nThis is because performance of the classifier on its training data would be\nbetter than for novel data. Using the classifier output of training data\nto fit the calibrator would thus result in a biased calibrator that maps to\nprobabilities closer to 0 and 1 than it should.\n1.16.3.\nUsage\nThe\nCalibratedClassifierCV\nclass is used to calibrate a classifier.\nCalibratedClassifierCV\nuses a cross-validation approach to ensure\nunbiased data is always used to fit the calibrator. The data is split into\n\\(k\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "CalibratedClassifierCV\nuses a cross-validation approach to ensure\nunbiased data is always used to fit the calibrator. The data is split into\n\\(k\\)\n(train_set,\ntest_set)\ncouples (as determined by\ncv\n). When\nensemble=True\n(default), the following procedure is repeated independently for each\ncross-validation split:\na clone of\nbase_estimator\nis trained on the train subset\nthe trained\nbase_estimator\nmakes predictions on the test subset\nthe predictions are used to fit a calibrator (either a sigmoid or isotonic\nregressor) (when the data is multiclass, a calibrator is fit for every class)\nThis results in an\nensemble of\n\\(k\\)\n(classifier,\ncalibrator)\ncouples where each calibrator maps\nthe output of its corresponding classifier into [0, 1]. Each couple is exposed\nin the\ncalibrated_classifiers_\nattribute, where each entry is a calibrated\nclassifier with a\npredict_proba\nmethod that outputs calibrated\nprobabilities. The output of\npredict_proba\nfor the main\nCalibratedClassifierCV",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classifier with a\npredict_proba\nmethod that outputs calibrated\nprobabilities. The output of\npredict_proba\nfor the main\nCalibratedClassifierCV\ninstance corresponds to the average of the\npredicted probabilities of the\n\\(k\\)\nestimators in the\ncalibrated_classifiers_\nlist. The output of\npredict\nis the class that has the highest\nprobability.\nIt is important to choose\ncv\ncarefully when using\nensemble=True\n.\nAll classes should be present in both train and test subsets for every split.\nWhen a class is absent in the train subset, the predicted probability for that\nclass will default to 0 for the\n(classifier,\ncalibrator)\ncouple of that split.\nThis skews the\npredict_proba\nas it averages across all couples.\nWhen a class is absent in the test subset, the calibrator for that class\n(within the\n(classifier,\ncalibrator)\ncouple of that split) is\nfit on data with no positive class. This results in ineffective calibration.\nWhen\nensemble=False\n, cross-validation is used to obtain ‘unbiased’",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "fit on data with no positive class. This results in ineffective calibration.\nWhen\nensemble=False\n, cross-validation is used to obtain ‘unbiased’\npredictions for all the data, via\ncross_val_predict\n.\nThese unbiased predictions are then used to train the calibrator. The attribute\ncalibrated_classifiers_\nconsists of only one\n(classifier,\ncalibrator)\ncouple where the classifier is the\nbase_estimator\ntrained on all the data.\nIn this case the output of\npredict_proba\nfor\nCalibratedClassifierCV\nis the predicted probabilities obtained\nfrom the single\n(classifier,\ncalibrator)\ncouple.\nThe main advantage of\nensemble=True\nis to benefit from the traditional\nensembling effect (similar to\nBagging meta-estimator\n). The resulting ensemble should\nboth be well calibrated and slightly more accurate than with\nensemble=False\n.\nThe main advantage of using\nensemble=False\nis computational: it reduces the\noverall fit time by training only a single base classifier and calibrator",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The main advantage of using\nensemble=False\nis computational: it reduces the\noverall fit time by training only a single base classifier and calibrator\npair, decreases the final model size and increases prediction speed.\nAlternatively an already fitted classifier can be calibrated by using a\nFrozenEstimator\nas\nCalibratedClassifierCV(estimator=FrozenEstimator(estimator))\n.\nIt is up to the user to make sure that the data used for fitting the classifier\nis disjoint from the data used for fitting the regressor.\nCalibratedClassifierCV\nsupports the use of two regression techniques\nfor calibration via the\nmethod\nparameter:\n\"sigmoid\"\nand\n\"isotonic\"\n.\n1.16.3.1.\nSigmoid\nThe sigmoid regressor,\nmethod=\"sigmoid\"\nis based on Platt’s logistic model\n[\n4\n]\n:\n\\[p(y_i = 1 | f_i) = \\frac{1}{1 + \\exp(A f_i + B)} \\,,\\]\nwhere\n\\(y_i\\)\nis the true label of sample\n\\(i\\)\nand\n\\(f_i\\)\nis the output of the un-calibrated classifier for sample\n\\(i\\)\n.\n\\(A\\)\nand\n\\(B\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "where\n\\(y_i\\)\nis the true label of sample\n\\(i\\)\nand\n\\(f_i\\)\nis the output of the un-calibrated classifier for sample\n\\(i\\)\n.\n\\(A\\)\nand\n\\(B\\)\nare real numbers to be determined when fitting the regressor via\nmaximum likelihood.\nThe sigmoid method assumes the\ncalibration curve\ncan be corrected by applying a sigmoid function to the raw predictions. This\nassumption has been empirically justified in the case of\nSupport Vector Machines\nwith\ncommon kernel functions on various benchmark datasets in section 2.1 of Platt\n1999\n[\n4\n]\nbut does not necessarily hold in general. Additionally, the\nlogistic model works best if the calibration error is symmetrical, meaning\nthe classifier output for each binary class is normally distributed with\nthe same variance\n[\n7\n]\n. This can be a problem for highly imbalanced\nclassification problems, where outputs do not have equal variance.\nIn general this method is most effective for small sample sizes or when the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classification problems, where outputs do not have equal variance.\nIn general this method is most effective for small sample sizes or when the\nun-calibrated model is under-confident and has similar calibration errors for both\nhigh and low outputs.\n1.16.3.2.\nIsotonic\nThe\nmethod=\"isotonic\"\nfits a non-parametric isotonic regressor, which outputs\na step-wise non-decreasing function, see\nsklearn.isotonic\n. It minimizes:\n\\[\\sum_{i=1}^{n} (y_i - \\hat{f}_i)^2\\]\nsubject to\n\\(\\hat{f}_i \\geq \\hat{f}_j\\)\nwhenever\n\\(f_i \\geq f_j\\)\n.\n\\(y_i\\)\nis the true\nlabel of sample\n\\(i\\)\nand\n\\(\\hat{f}_i\\)\nis the output of the\ncalibrated classifier for sample\n\\(i\\)\n(i.e., the calibrated probability).\nThis method is more general when compared to\n'sigmoid'\nas the only restriction\nis that the mapping function is monotonically increasing. It is thus more\npowerful as it can correct any monotonic distortion of the un-calibrated model.\nHowever, it is more prone to overfitting, especially on small datasets\n[\n6\n]\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "However, it is more prone to overfitting, especially on small datasets\n[\n6\n]\n.\nOverall,\n'isotonic'\nwill perform as well as or better than\n'sigmoid'\nwhen\nthere is enough data (greater than ~ 1000 samples) to avoid overfitting\n[\n3\n]\n.\nNote\nImpact on ranking metrics like AUC\nIt is generally expected that calibration does not affect ranking metrics such as\nROC-AUC. However, these metrics might differ after calibration when using\nmethod=\"isotonic\"\nsince isotonic regression introduces ties in the predicted\nprobabilities. This can be seen as within the uncertainty of the model predictions.\nIn case, you strictly want to keep the ranking and thus AUC scores, use\nmethod=\"sigmoid\"\nwhich is a strictly monotonic transformation and thus keeps\nthe ranking.\n1.16.3.3.\nMulticlass support\nBoth isotonic and sigmoid regressors only\nsupport 1-dimensional data (e.g., binary classification output) but are\nextended for multiclass classification if the\nbase_estimator\nsupports",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "support 1-dimensional data (e.g., binary classification output) but are\nextended for multiclass classification if the\nbase_estimator\nsupports\nmulticlass predictions. For multiclass predictions,\nCalibratedClassifierCV\ncalibrates for\neach class separately in a\nOneVsRestClassifier\nfashion\n[\n5\n]\n. When\npredicting\nprobabilities, the calibrated probabilities for each class\nare predicted separately. As those probabilities do not necessarily sum to\none, a postprocessing is performed to normalize them.\nExamples\nProbability Calibration curves\nProbability Calibration for 3-class classification\nProbability calibration of classifiers\nComparison of Calibration of Classifiers\nReferences\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/calibration.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.3.\nTuning the decision threshold for class prediction\nClassification is best divided into two parts:\nthe statistical problem of learning a model to predict, ideally, class probabilities;\nthe decision problem to take concrete action based on those probability predictions.\nLet’s take a straightforward example related to weather forecasting: the first point is\nrelated to answering “what is the chance that it will rain tomorrow?” while the second\npoint is related to answering “should I take an umbrella tomorrow?”.\nWhen it comes to the scikit-learn API, the first point is addressed by providing scores\nusing\npredict_proba\nor\ndecision_function\n. The former returns conditional\nprobability estimates\n\\(P(y|X)\\)\nfor each class, while the latter returns a decision\nscore for each class.\nThe decision corresponding to the labels is obtained with\npredict\n. In binary\nclassification, a decision rule or action is then defined by thresholding the scores,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "predict\n. In binary\nclassification, a decision rule or action is then defined by thresholding the scores,\nleading to the prediction of a single class label for each sample. For binary\nclassification in scikit-learn, class labels predictions are obtained by hard-coded\ncut-off rules: a positive class is predicted when the conditional probability\n\\(P(y|X)\\)\nis greater than 0.5 (obtained with\npredict_proba\n) or if the\ndecision score is greater than 0 (obtained with\ndecision_function\n).\nHere, we show an example that illustrates the relatonship between conditional\nprobability estimates\n\\(P(y|X)\\)\nand class labels:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn.tree\nimport\nDecisionTreeClassifier\n>>>\nX\n,\ny\n=\nmake_classification\n(\nrandom_state\n=\n0\n)\n>>>\nclassifier\n=\nDecisionTreeClassifier\n(\nmax_depth\n=\n2\n,\nrandom_state\n=\n0\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nclassifier\n.\npredict_proba\n(\nX\n[:\n4\n])\narray([[0.94 , 0.06 ],\n[0.94 , 0.06 ],\n[0.0416, 0.9583],\n[0.0416, 0.9583]])\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n0\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nclassifier\n.\npredict_proba\n(\nX\n[:\n4\n])\narray([[0.94 , 0.06 ],\n[0.94 , 0.06 ],\n[0.0416, 0.9583],\n[0.0416, 0.9583]])\n>>>\nclassifier\n.\npredict\n(\nX\n[:\n4\n])\narray([0, 0, 1, 1])\nWhile these hard-coded rules might at first seem reasonable as default behavior, they\nare most certainly not ideal for most use cases. Let’s illustrate with an example.\nConsider a scenario where a predictive model is being deployed to assist\nphysicians in detecting tumors. In this setting, physicians will most likely be\ninterested in identifying all patients with cancer and not missing anyone with cancer so\nthat they can provide them with the right treatment. In other words, physicians\nprioritize achieving a high recall rate. This emphasis on recall comes, of course, with\nthe trade-off of potentially more false-positive predictions, reducing the precision of\nthe model. That is a risk physicians are willing to take because the cost of a missed",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the model. That is a risk physicians are willing to take because the cost of a missed\ncancer is much higher than the cost of further diagnostic tests. Consequently, when it\ncomes to deciding whether to classify a patient as having cancer or not, it may be more\nbeneficial to classify them as positive for cancer when the conditional probability\nestimate is much lower than 0.5.\n3.3.1.\nPost-tuning the decision threshold\nOne solution to address the problem stated in the introduction is to tune the decision\nthreshold of the classifier once the model has been trained. The\nTunedThresholdClassifierCV\ntunes this threshold using\nan internal cross-validation. The optimum threshold is chosen to maximize a given\nmetric.\nThe following image illustrates the tuning of the decision threshold for a gradient\nboosting classifier. While the vanilla and tuned classifiers provide the same\npredict_proba\noutputs and thus the same Receiver Operating Characteristic (ROC)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "predict_proba\noutputs and thus the same Receiver Operating Characteristic (ROC)\nand Precision-Recall curves, the class label predictions differ because of the tuned\ndecision threshold. The vanilla classifier predicts the class of interest for a\nconditional probability greater than 0.5 while the tuned classifier predicts the class\nof interest for a very low probability (around 0.02). This decision threshold optimizes\na utility metric defined by the business (in this case an insurance company).\n3.3.1.1.\nOptions to tune the decision threshold\nThe decision threshold can be tuned through different strategies controlled by the\nparameter\nscoring\n.\nOne way to tune the threshold is by maximizing a pre-defined scikit-learn metric. These\nmetrics can be found by calling the function\nget_scorer_names\n.\nBy default, the balanced accuracy is the metric used but be aware that one should choose\na meaningful metric for their use case.\nNote",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nBy default, the balanced accuracy is the metric used but be aware that one should choose\na meaningful metric for their use case.\nNote\nIt is important to notice that these metrics come with default parameters, notably\nthe label of the class of interest (i.e.\npos_label\n). Thus, if this label is not\nthe right one for your application, you need to define a scorer and pass the right\npos_label\n(and additional parameters) using the\nmake_scorer\n. Refer to\nCallable scorers\nto get\ninformation to define your own scoring function. For instance, we show how to pass\nthe information to the scorer that the label of interest is\n0\nwhen maximizing the\nf1_score\n:\n>>>\nfrom\nsklearn.linear_model\nimport\nLogisticRegression\n>>>\nfrom\nsklearn.model_selection\nimport\nTunedThresholdClassifierCV\n>>>\nfrom\nsklearn.metrics\nimport\nmake_scorer\n,\nf1_score\n>>>\nX\n,\ny\n=\nmake_classification\n(\n...\nn_samples\n=\n1_000\n,\nweights\n=\n[\n0.1\n,\n0.9\n],\nrandom_state\n=\n0\n)\n>>>\npos_label\n=\n0\n>>>\nscorer\n=\nmake_scorer\n(\nf1_score\n,\npos_label",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\n...\nn_samples\n=\n1_000\n,\nweights\n=\n[\n0.1\n,\n0.9\n],\nrandom_state\n=\n0\n)\n>>>\npos_label\n=\n0\n>>>\nscorer\n=\nmake_scorer\n(\nf1_score\n,\npos_label\n=\npos_label\n)\n>>>\nbase_model\n=\nLogisticRegression\n()\n>>>\nmodel\n=\nTunedThresholdClassifierCV\n(\nbase_model\n,\nscoring\n=\nscorer\n)\n>>>\nscorer\n(\nmodel\n.\nfit\n(\nX\n,\ny\n),\nX\n,\ny\n)\n0.88\n>>>\n# compare it with the internal score found by cross-validation\n>>>\nmodel\n.\nbest_score_\nnp.float64(0.86)\n3.3.1.2.\nImportant notes regarding the internal cross-validation\nBy default\nTunedThresholdClassifierCV\nuses a 5-fold\nstratified cross-validation to tune the decision threshold. The parameter\ncv\nallows to\ncontrol the cross-validation strategy. It is possible to bypass cross-validation by\nsetting\ncv=\"prefit\"\nand providing a fitted classifier. In this case, the decision\nthreshold is tuned on the data provided to the\nfit\nmethod.\nHowever, you should be extremely careful when using this option. You should never use",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "threshold is tuned on the data provided to the\nfit\nmethod.\nHowever, you should be extremely careful when using this option. You should never use\nthe same data for training the classifier and tuning the decision threshold due to the\nrisk of overfitting. Refer to the following example section for more details (cf.\nConsideration regarding model refitting and cross-validation\n). If you have limited resources, consider using\na float number for\ncv\nto limit to an internal single train-test split.\nThe option\ncv=\"prefit\"\nshould only be used when the provided classifier was already\ntrained, and you just want to find the best decision threshold using a new validation\nset.\n3.3.1.3.\nManually setting the decision threshold\nThe previous sections discussed strategies to find an optimal decision threshold. It is\nalso possible to manually set the decision threshold using the class\nFixedThresholdClassifier\n. In case that you don’t want\nto refit the model when calling\nfit\n, wrap your sub-estimator with a",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "FixedThresholdClassifier\n. In case that you don’t want\nto refit the model when calling\nfit\n, wrap your sub-estimator with a\nFrozenEstimator\nand do\nFixedThresholdClassifier(FrozenEstimator(estimator),\n...)\n.\n3.3.1.4.\nExamples\nSee the example entitled\nPost-hoc tuning the cut-off point of decision function\n,\nto get insights on the post-tuning of the decision threshold.\nSee the example entitled\nPost-tuning the decision threshold for cost-sensitive learning\n,\nto learn about cost-sensitive learning and decision threshold tuning.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/classification_threshold.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.3.\nClustering\nClustering\nof\nunlabeled data can be performed with the module\nsklearn.cluster\n.\nEach clustering algorithm comes in two variants: a class, that implements\nthe\nfit\nmethod to learn the clusters on train data, and a function,\nthat, given train data, returns an array of integer labels corresponding\nto the different clusters. For the class, the labels over the training\ndata can be found in the\nlabels_\nattribute.\n2.3.1.\nOverview of clustering methods\nA comparison of the clustering algorithms in scikit-learn\nMethod name\nParameters\nScalability\nUsecase\nGeometry (metric used)\nK-Means\nnumber of clusters\nVery large\nn_samples\n, medium\nn_clusters\nwith\nMiniBatch code\nGeneral-purpose, even cluster size, flat geometry,\nnot too many clusters, inductive\nDistances between points\nAffinity propagation\ndamping, sample preference\nNot scalable with n_samples\nMany clusters, uneven cluster size, non-flat geometry, inductive\nGraph distance (e.g. nearest-neighbor graph)\nMean-shift\nbandwidth",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Many clusters, uneven cluster size, non-flat geometry, inductive\nGraph distance (e.g. nearest-neighbor graph)\nMean-shift\nbandwidth\nNot scalable with\nn_samples\nMany clusters, uneven cluster size, non-flat geometry, inductive\nDistances between points\nSpectral clustering\nnumber of clusters\nMedium\nn_samples\n, small\nn_clusters\nFew clusters, even cluster size, non-flat geometry, transductive\nGraph distance (e.g. nearest-neighbor graph)\nWard hierarchical clustering\nnumber of clusters or distance threshold\nLarge\nn_samples\nand\nn_clusters\nMany clusters, possibly connectivity constraints, transductive\nDistances between points\nAgglomerative clustering\nnumber of clusters or distance threshold, linkage type, distance\nLarge\nn_samples\nand\nn_clusters\nMany clusters, possibly connectivity constraints, non Euclidean\ndistances, transductive\nAny pairwise distance\nDBSCAN\nneighborhood size\nVery large\nn_samples\n, medium\nn_clusters\nNon-flat geometry, uneven cluster sizes, outlier removal,\ntransductive",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "DBSCAN\nneighborhood size\nVery large\nn_samples\n, medium\nn_clusters\nNon-flat geometry, uneven cluster sizes, outlier removal,\ntransductive\nDistances between nearest points\nHDBSCAN\nminimum cluster membership, minimum point neighbors\nlarge\nn_samples\n, medium\nn_clusters\nNon-flat geometry, uneven cluster sizes, outlier removal,\ntransductive, hierarchical, variable cluster density\nDistances between nearest points\nOPTICS\nminimum cluster membership\nVery large\nn_samples\n, large\nn_clusters\nNon-flat geometry, uneven cluster sizes, variable cluster density,\noutlier removal, transductive\nDistances between points\nGaussian mixtures\nmany\nNot scalable\nFlat geometry, good for density estimation, inductive\nMahalanobis distances to centers\nBIRCH\nbranching factor, threshold, optional global clusterer.\nLarge\nn_clusters\nand\nn_samples\nLarge dataset, outlier removal, data reduction, inductive\nEuclidean distance between points\nBisecting K-Means\nnumber of clusters\nVery large\nn_samples\n, medium\nn_clusters",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Euclidean distance between points\nBisecting K-Means\nnumber of clusters\nVery large\nn_samples\n, medium\nn_clusters\nGeneral-purpose, even cluster size, flat geometry,\nno empty clusters, inductive, hierarchical\nDistances between points\nNon-flat geometry clustering is useful when the clusters have a specific\nshape, i.e. a non-flat manifold, and the standard euclidean distance is\nnot the right metric. This case arises in the two top rows of the figure\nabove.\nGaussian mixture models, useful for clustering, are described in\nanother chapter of the documentation\ndedicated to\nmixture models. KMeans can be seen as a special case of Gaussian mixture\nmodel with equal covariance per component.\nTransductive\nclustering methods (in contrast to\ninductive\nclustering methods) are not designed to be applied to new,\nunseen data.\nExamples\nInductive Clustering\n: An example\nof an inductive clustering model for handling new data.\n2.3.2.\nK-means\nThe\nKMeans",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "unseen data.\nExamples\nInductive Clustering\n: An example\nof an inductive clustering model for handling new data.\n2.3.2.\nK-means\nThe\nKMeans\nalgorithm clusters data by trying to separate samples in n\ngroups of equal variance, minimizing a criterion known as the\ninertia\nor\nwithin-cluster sum-of-squares (see below). This algorithm requires the number\nof clusters to be specified. It scales well to large numbers of samples and has\nbeen used across a large range of application areas in many different fields.\nThe k-means algorithm divides a set of\n\\(N\\)\nsamples\n\\(X\\)\ninto\n\\(K\\)\ndisjoint clusters\n\\(C\\)\n, each described by the mean\n\\(\\mu_j\\)\nof the samples in the cluster. The means are commonly called the cluster\n“centroids”; note that they are not, in general, points from\n\\(X\\)\n,\nalthough they live in the same space.\nThe K-means algorithm aims to choose centroids that minimise the\ninertia\n,\nor\nwithin-cluster sum-of-squares criterion\n:\n\\[\\sum_{i=0}^{n}\\min_{\\mu_j \\in C}(||x_i - \\mu_j||^2)\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "inertia\n,\nor\nwithin-cluster sum-of-squares criterion\n:\n\\[\\sum_{i=0}^{n}\\min_{\\mu_j \\in C}(||x_i - \\mu_j||^2)\\]\nInertia can be recognized as a measure of how internally coherent clusters are.\nIt suffers from various drawbacks:\nInertia makes the assumption that clusters are convex and isotropic,\nwhich is not always the case. It responds poorly to elongated clusters,\nor manifolds with irregular shapes.\nInertia is not a normalized metric: we just know that lower values are\nbetter and zero is optimal. But in very high-dimensional spaces, Euclidean\ndistances tend to become inflated\n(this is an instance of the so-called “curse of dimensionality”).\nRunning a dimensionality reduction algorithm such as\nPrincipal component analysis (PCA)\nprior to\nk-means clustering can alleviate this problem and speed up the\ncomputations.\nFor more detailed descriptions of the issues shown above and how to address them,\nrefer to the examples\nDemonstration of k-means assumptions\nand",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "For more detailed descriptions of the issues shown above and how to address them,\nrefer to the examples\nDemonstration of k-means assumptions\nand\nSelecting the number of clusters with silhouette analysis on KMeans clustering\n.\nK-means is often referred to as Lloyd’s algorithm. In basic terms, the\nalgorithm has three steps. The first step chooses the initial centroids, with\nthe most basic method being to choose\n\\(k\\)\nsamples from the dataset\n\\(X\\)\n. After initialization, K-means consists of looping between the\ntwo other steps. The first step assigns each sample to its nearest centroid.\nThe second step creates new centroids by taking the mean value of all of the\nsamples assigned to each previous centroid. The difference between the old\nand the new centroids are computed and the algorithm repeats these last two\nsteps until this value is less than a threshold. In other words, it repeats\nuntil the centroids do not move significantly.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "steps until this value is less than a threshold. In other words, it repeats\nuntil the centroids do not move significantly.\nK-means is equivalent to the expectation-maximization algorithm\nwith a small, all-equal, diagonal covariance matrix.\nThe algorithm can also be understood through the concept of\nVoronoi diagrams\n. First the Voronoi diagram of\nthe points is calculated using the current centroids. Each segment in the\nVoronoi diagram becomes a separate cluster. Secondly, the centroids are updated\nto the mean of each segment. The algorithm then repeats this until a stopping\ncriterion is fulfilled. Usually, the algorithm stops when the relative decrease\nin the objective function between iterations is less than the given tolerance\nvalue. This is not the case in this implementation: iteration stops when\ncentroids move less than the tolerance.\nGiven enough time, K-means will always converge, however this may be to a local",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "centroids move less than the tolerance.\nGiven enough time, K-means will always converge, however this may be to a local\nminimum. This is highly dependent on the initialization of the centroids.\nAs a result, the computation is often done several times, with different\ninitializations of the centroids. One method to help address this issue is the\nk-means++ initialization scheme, which has been implemented in scikit-learn\n(use the\ninit='k-means++'\nparameter). This initializes the centroids to be\n(generally) distant from each other, leading to probably better results than\nrandom initialization, as shown in the reference. For detailed examples of\ncomparing different initialization schemes, refer to\nA demo of K-Means clustering on the handwritten digits data\nand\nEmpirical evaluation of the impact of k-means initialization\n.\nK-means++ can also be called independently to select seeds for other\nclustering algorithms, see\nsklearn.cluster.kmeans_plusplus\nfor details\nand example usage.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "clustering algorithms, see\nsklearn.cluster.kmeans_plusplus\nfor details\nand example usage.\nThe algorithm supports sample weights, which can be given by a parameter\nsample_weight\n. This allows to assign more weight to some samples when\ncomputing cluster centers and values of inertia. For example, assigning a\nweight of 2 to a sample is equivalent to adding a duplicate of that sample\nto the dataset\n\\(X\\)\n.\nExamples\nClustering text documents using k-means\n: Document clustering\nusing\nKMeans\nand\nMiniBatchKMeans\nbased on sparse data\nAn example of K-Means++ initialization\n: Using K-means++\nto select seeds for other clustering algorithms.\n2.3.2.1.\nLow-level parallelism\nKMeans\nbenefits from OpenMP based parallelism through Cython. Small\nchunks of data (256 samples) are processed in parallel, which in addition\nyields a low memory footprint. For more details on how to control the number of\nthreads, please refer to our\nParallelism\nnotes.\nExamples\nDemonstration of k-means assumptions",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "threads, please refer to our\nParallelism\nnotes.\nExamples\nDemonstration of k-means assumptions\n: Demonstrating when\nk-means performs intuitively and when it does not\nA demo of K-Means clustering on the handwritten digits data\n: Clustering handwritten digits\nReferences\n“k-means++: The advantages of careful seeding”\nArthur, David, and Sergei Vassilvitskii,\nProceedings of the eighteenth annual ACM-SIAM symposium on Discrete\nalgorithms\n, Society for Industrial and Applied Mathematics (2007)\n2.3.2.2.\nMini Batch K-Means\nThe\nMiniBatchKMeans\nis a variant of the\nKMeans\nalgorithm\nwhich uses mini-batches to reduce the computation time, while still attempting\nto optimise the same objective function. Mini-batches are subsets of the input\ndata, randomly sampled in each training iteration. These mini-batches\ndrastically reduce the amount of computation required to converge to a local\nsolution. In contrast to other algorithms that reduce the convergence time of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "solution. In contrast to other algorithms that reduce the convergence time of\nk-means, mini-batch k-means produces results that are generally only slightly\nworse than the standard algorithm.\nThe algorithm iterates between two major steps, similar to vanilla k-means.\nIn the first step,\n\\(b\\)\nsamples are drawn randomly from the dataset, to form\na mini-batch. These are then assigned to the nearest centroid. In the second\nstep, the centroids are updated. In contrast to k-means, this is done on a\nper-sample basis. For each sample in the mini-batch, the assigned centroid\nis updated by taking the streaming average of the sample and all previous\nsamples assigned to that centroid. This has the effect of decreasing the\nrate of change for a centroid over time. These steps are performed until\nconvergence or a predetermined number of iterations is reached.\nMiniBatchKMeans\nconverges faster than\nKMeans\n, but the quality\nof the results is reduced. In practice this difference in quality can be quite",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "MiniBatchKMeans\nconverges faster than\nKMeans\n, but the quality\nof the results is reduced. In practice this difference in quality can be quite\nsmall, as shown in the example and cited reference.\nExamples\nComparison of the K-Means and MiniBatchKMeans clustering algorithms\n: Comparison of\nKMeans\nand\nMiniBatchKMeans\nClustering text documents using k-means\n: Document clustering\nusing\nKMeans\nand\nMiniBatchKMeans\nbased on sparse data\nOnline learning of a dictionary of parts of faces\nReferences\n“Web Scale K-Means clustering”\nD. Sculley,\nProceedings of the 19th international conference on World\nwide web\n(2010)\n2.3.3.\nAffinity Propagation\nAffinityPropagation\ncreates clusters by sending messages between\npairs of samples until convergence. A dataset is then described using a small\nnumber of exemplars, which are identified as those most representative of other\nsamples. The messages sent between pairs represent the suitability for one",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "samples. The messages sent between pairs represent the suitability for one\nsample to be the exemplar of the other, which is updated in response to the\nvalues from other pairs. This updating happens iteratively until convergence,\nat which point the final exemplars are chosen, and hence the final clustering\nis given.\nAffinity Propagation can be interesting as it chooses the number of\nclusters based on the data provided. For this purpose, the two important\nparameters are the\npreference\n, which controls how many exemplars are\nused, and the\ndamping factor\nwhich damps the responsibility and\navailability messages to avoid numerical oscillations when updating these\nmessages.\nThe main drawback of Affinity Propagation is its complexity. The\nalgorithm has a time complexity of the order\n\\(O(N^2 T)\\)\n, where\n\\(N\\)\nis the number of samples and\n\\(T\\)\nis the number of iterations until\nconvergence. Further, the memory complexity is of the order\n\\(O(N^2)\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(N\\)\nis the number of samples and\n\\(T\\)\nis the number of iterations until\nconvergence. Further, the memory complexity is of the order\n\\(O(N^2)\\)\nif a dense similarity matrix is used, but reducible if a\nsparse similarity matrix is used. This makes Affinity Propagation most\nappropriate for small to medium sized datasets.\nAlgorithm description\nThe messages sent between points belong to one of two categories. The first is\nthe responsibility\n\\(r(i, k)\\)\n, which is the accumulated evidence that\nsample\n\\(k\\)\nshould be the exemplar for sample\n\\(i\\)\n. The second is the\navailability\n\\(a(i, k)\\)\nwhich is the accumulated evidence that sample\n\\(i\\)\nshould choose sample\n\\(k\\)\nto be its exemplar, and considers the\nvalues for all other samples that\n\\(k\\)\nshould be an exemplar. In this way,\nexemplars are chosen by samples if they are (1) similar enough to many samples\nand (2) chosen by many samples to be representative of themselves.\nMore formally, the responsibility of a sample\n\\(k\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and (2) chosen by many samples to be representative of themselves.\nMore formally, the responsibility of a sample\n\\(k\\)\nto be the exemplar of\nsample\n\\(i\\)\nis given by:\n\\[r(i, k) \\leftarrow s(i, k) - max [ a(i, k') + s(i, k') \\forall k' \\neq k ]\\]\nWhere\n\\(s(i, k)\\)\nis the similarity between samples\n\\(i\\)\nand\n\\(k\\)\n.\nThe availability of sample\n\\(k\\)\nto be the exemplar of sample\n\\(i\\)\nis\ngiven by:\n\\[a(i, k) \\leftarrow min [0, r(k, k) + \\sum_{i'~s.t.~i' \\notin \\{i, k\\}}{r(i',\nk)}]\\]\nTo begin with, all values for\n\\(r\\)\nand\n\\(a\\)\nare set to zero, and the\ncalculation of each iterates until convergence. As discussed above, in order to\navoid numerical oscillations when updating the messages, the damping factor\n\\(\\lambda\\)\nis introduced to iteration process:\n\\[r_{t+1}(i, k) = \\lambda\\cdot r_{t}(i, k) + (1-\\lambda)\\cdot r_{t+1}(i, k)\\]\n\\[a_{t+1}(i, k) = \\lambda\\cdot a_{t}(i, k) + (1-\\lambda)\\cdot a_{t+1}(i, k)\\]\nwhere\n\\(t\\)\nindicates the iteration times.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[a_{t+1}(i, k) = \\lambda\\cdot a_{t}(i, k) + (1-\\lambda)\\cdot a_{t+1}(i, k)\\]\nwhere\n\\(t\\)\nindicates the iteration times.\nExamples\nDemo of affinity propagation clustering algorithm\n: Affinity\nPropagation on a synthetic 2D datasets with 3 classes\nVisualizing the stock market structure\nAffinity Propagation\non financial time series to find groups of companies\n2.3.4.\nMean Shift\nMeanShift\nclustering aims to discover\nblobs\nin a smooth density of\nsamples. It is a centroid based algorithm, which works by updating candidates\nfor centroids to be the mean of the points within a given region. These\ncandidates are then filtered in a post-processing stage to eliminate\nnear-duplicates to form the final set of centroids.\nMathematical details\nThe position of centroid candidates is iteratively adjusted using a technique\ncalled hill climbing, which finds local maxima of the estimated probability\ndensity. Given a candidate centroid\n\\(x\\)\nfor iteration\n\\(t\\)\n, the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "called hill climbing, which finds local maxima of the estimated probability\ndensity. Given a candidate centroid\n\\(x\\)\nfor iteration\n\\(t\\)\n, the\ncandidate is updated according to the following equation:\n\\[x^{t+1} = x^t + m(x^t)\\]\nWhere\n\\(m\\)\nis the\nmean shift\nvector that is computed for each centroid\nthat points towards a region of the maximum increase in the density of points.\nTo compute\n\\(m\\)\nwe define\n\\(N(x)\\)\nas the neighborhood of samples\nwithin a given distance around\n\\(x\\)\n. Then\n\\(m\\)\nis computed using the\nfollowing equation, effectively updating a centroid to be the mean of the\nsamples within its neighborhood:\n\\[m(x) = \\frac{1}{|N(x)|} \\sum_{x_j \\in N(x)}x_j - x\\]\nIn general, the equation for\n\\(m\\)\ndepends on a kernel used for density\nestimation. The generic formula is:\n\\[m(x) = \\frac{\\sum_{x_j \\in N(x)}K(x_j - x)x_j}{\\sum_{x_j \\in N(x)}K(x_j -\nx)} - x\\]\nIn our implementation,\n\\(K(x)\\)\nis equal to 1 if\n\\(x\\)\nis small enough\nand is equal to 0 otherwise. Effectively\n\\(K(y - x)\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "x)} - x\\]\nIn our implementation,\n\\(K(x)\\)\nis equal to 1 if\n\\(x\\)\nis small enough\nand is equal to 0 otherwise. Effectively\n\\(K(y - x)\\)\nindicates whether\n\\(y\\)\nis in the neighborhood of\n\\(x\\)\n.\nThe algorithm automatically sets the number of clusters, instead of relying on a\nparameter\nbandwidth\n, which dictates the size of the region to search through.\nThis parameter can be set manually, but can be estimated using the provided\nestimate_bandwidth\nfunction, which is called if the bandwidth is not set.\nThe algorithm is not highly scalable, as it requires multiple nearest neighbor\nsearches during the execution of the algorithm. The algorithm is guaranteed to\nconverge, however the algorithm will stop iterating when the change in centroids\nis small.\nLabelling a new sample is performed by finding the nearest centroid for a\ngiven sample.\nExamples\nA demo of the mean-shift clustering algorithm\n: Mean Shift clustering\non a synthetic 2D datasets with 3 classes.\nReferences",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "given sample.\nExamples\nA demo of the mean-shift clustering algorithm\n: Mean Shift clustering\non a synthetic 2D datasets with 3 classes.\nReferences\n“Mean shift: A robust approach toward feature space analysis”\nD. Comaniciu and P. Meer,\nIEEE Transactions on Pattern\nAnalysis and Machine Intelligence\n(2002)\n2.3.5.\nSpectral clustering\nSpectralClustering\nperforms a low-dimension embedding of the\naffinity matrix between samples, followed by clustering, e.g., by KMeans,\nof the components of the eigenvectors in the low dimensional space.\nIt is especially computationally efficient if the affinity matrix is sparse\nand the\namg\nsolver is used for the eigenvalue problem (Note, the\namg\nsolver\nrequires that the\npyamg\nmodule is installed.)\nThe present version of SpectralClustering requires the number of clusters\nto be specified in advance. It works well for a small number of clusters,\nbut is not advised for many clusters.\nFor two clusters, SpectralClustering solves a convex relaxation of the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "but is not advised for many clusters.\nFor two clusters, SpectralClustering solves a convex relaxation of the\nnormalized cuts\nproblem on the similarity graph: cutting the graph in two so that the weight of\nthe edges cut is small compared to the weights of the edges inside each\ncluster. This criteria is especially interesting when working on images, where\ngraph vertices are pixels, and weights of the edges of the similarity graph are\ncomputed using a function of a gradient of the image.\nWarning\nTransforming distance to well-behaved similarities\nNote that if the values of your similarity matrix are not well\ndistributed, e.g. with negative values or with a distance matrix\nrather than a similarity, the spectral problem will be singular and\nthe problem not solvable. In which case it is advised to apply a\ntransformation to the entries of the matrix. For instance, in the\ncase of a signed distance matrix, is common to apply a heat kernel:\nsimilarity\n=\nnp\n.\nexp\n(\n-\nbeta\n*\ndistance\n/\ndistance\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "case of a signed distance matrix, is common to apply a heat kernel:\nsimilarity\n=\nnp\n.\nexp\n(\n-\nbeta\n*\ndistance\n/\ndistance\n.\nstd\n())\nSee the examples for such an application.\nExamples\nSpectral clustering for image segmentation\n: Segmenting objects\nfrom a noisy background using spectral clustering.\nSegmenting the picture of greek coins in regions\n: Spectral clustering\nto split the image of coins in regions.\n2.3.5.1.\nDifferent label assignment strategies\nDifferent label assignment strategies can be used, corresponding to the\nassign_labels\nparameter of\nSpectralClustering\n.\n\"kmeans\"\nstrategy can match finer details, but can be unstable.\nIn particular, unless you control the\nrandom_state\n, it may not be\nreproducible from run-to-run, as it depends on random initialization.\nThe alternative\n\"discretize\"\nstrategy is 100% reproducible, but tends\nto create parcels of fairly even and geometrical shape.\nThe recently added\n\"cluster_qr\"\noption is a deterministic alternative that",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to create parcels of fairly even and geometrical shape.\nThe recently added\n\"cluster_qr\"\noption is a deterministic alternative that\ntends to create the visually best partitioning on the example application\nbelow.\nassign_labels=\"kmeans\"\nassign_labels=\"discretize\"\nassign_labels=\"cluster_qr\"\nReferences\n“Multiclass spectral clustering”\nStella X. Yu, Jianbo Shi, 2003\n“Simple, direct, and efficient multi-way spectral clustering”\nAnil Damle, Victor Minden, Lexing Ying, 2019\n2.3.5.2.\nSpectral Clustering Graphs\nSpectral Clustering can also be used to partition graphs via their spectral\nembeddings. In this case, the affinity matrix is the adjacency matrix of the\ngraph, and SpectralClustering is initialized with\naffinity='precomputed'\n:\n>>>\nfrom\nsklearn.cluster\nimport\nSpectralClustering\n>>>\nsc\n=\nSpectralClustering\n(\n3\n,\naffinity\n=\n'precomputed'\n,\nn_init\n=\n100\n,\n...\nassign_labels\n=\n'discretize'\n)\n>>>\nsc\n.\nfit_predict\n(\nadjacency_matrix\n)\nReferences\n“A Tutorial on Spectral Clustering”\nUlrike",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nn_init\n=\n100\n,\n...\nassign_labels\n=\n'discretize'\n)\n>>>\nsc\n.\nfit_predict\n(\nadjacency_matrix\n)\nReferences\n“A Tutorial on Spectral Clustering”\nUlrike\nvon Luxburg, 2007\n“Normalized cuts and image segmentation”\nJianbo\nShi, Jitendra Malik, 2000\n“A Random Walks View of Spectral Segmentation”\nMarina Meila, Jianbo Shi, 2001\n“On Spectral Clustering: Analysis and an algorithm”\nAndrew Y. Ng, Michael I. Jordan, Yair Weiss, 2001\n“Preconditioned Spectral Clustering for Stochastic Block Partition\nStreaming Graph Challenge”\nDavid Zhuzhunashvili, Andrew Knyazev\n2.3.6.\nHierarchical clustering\nHierarchical clustering is a general family of clustering algorithms that\nbuild nested clusters by merging or splitting them successively. This\nhierarchy of clusters is represented as a tree (or dendrogram). The root of the\ntree is the unique cluster that gathers all the samples, the leaves being the\nclusters with only one sample. See the\nWikipedia page\nfor more details.\nThe\nAgglomerativeClustering",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "clusters with only one sample. See the\nWikipedia page\nfor more details.\nThe\nAgglomerativeClustering\nobject performs a hierarchical clustering\nusing a bottom up approach: each observation starts in its own cluster, and\nclusters are successively merged together. The linkage criteria determines the\nmetric used for the merge strategy:\nWard\nminimizes the sum of squared differences within all clusters. It is a\nvariance-minimizing approach and in this sense is similar to the k-means\nobjective function but tackled with an agglomerative hierarchical\napproach.\nMaximum\nor\ncomplete linkage\nminimizes the maximum distance between\nobservations of pairs of clusters.\nAverage linkage\nminimizes the average of the distances between all\nobservations of pairs of clusters.\nSingle linkage\nminimizes the distance between the closest\nobservations of pairs of clusters.\nAgglomerativeClustering\ncan also scale to large number of samples\nwhen it is used jointly with a connectivity matrix, but is computationally",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "AgglomerativeClustering\ncan also scale to large number of samples\nwhen it is used jointly with a connectivity matrix, but is computationally\nexpensive when no connectivity constraints are added between samples: it\nconsiders at each step all the possible merges.\n2.3.6.1.\nDifferent linkage type: Ward, complete, average, and single linkage\nAgglomerativeClustering\nsupports Ward, single, average, and complete\nlinkage strategies.\nAgglomerative cluster has a “rich get richer” behavior that leads to\nuneven cluster sizes. In this regard, single linkage is the worst\nstrategy, and Ward gives the most regular sizes. However, the affinity\n(or distance used in clustering) cannot be varied with Ward, thus for non\nEuclidean metrics, average linkage is a good alternative. Single linkage,\nwhile not robust to noisy data, can be computed very efficiently and can\ntherefore be useful to provide hierarchical clustering of larger datasets.\nSingle linkage can also perform well on non-globular data.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "therefore be useful to provide hierarchical clustering of larger datasets.\nSingle linkage can also perform well on non-globular data.\nExamples\nVarious Agglomerative Clustering on a 2D embedding of digits\n: exploration of the\ndifferent linkage strategies in a real dataset.\nComparing different hierarchical linkage methods on toy datasets\n: exploration of\nthe different linkage strategies in toy datasets.\n2.3.6.2.\nVisualization of cluster hierarchy\nIt’s possible to visualize the tree representing the hierarchical merging of clusters\nas a dendrogram. Visual inspection can often be useful for understanding the structure\nof the data, though more so in the case of small sample sizes.\nExamples\nPlot Hierarchical Clustering Dendrogram\n2.3.6.3.\nAdding connectivity constraints\nAn interesting aspect of\nAgglomerativeClustering\nis that\nconnectivity constraints can be added to this algorithm (only adjacent\nclusters can be merged together), through a connectivity matrix that defines",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "connectivity constraints can be added to this algorithm (only adjacent\nclusters can be merged together), through a connectivity matrix that defines\nfor each sample the neighboring samples following a given structure of the\ndata. For instance, in the swiss-roll example below, the connectivity\nconstraints forbid the merging of points that are not adjacent on the swiss\nroll, and thus avoid forming clusters that extend across overlapping folds of\nthe roll.\nThese constraint are useful to impose a certain local structure, but they\nalso make the algorithm faster, especially when the number of the samples\nis high.\nThe connectivity constraints are imposed via an connectivity matrix: a\nscipy sparse matrix that has elements only at the intersection of a row\nand a column with indices of the dataset that should be connected. This\nmatrix can be constructed from a-priori information: for instance, you\nmay wish to cluster web pages by only merging pages with a link pointing",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "matrix can be constructed from a-priori information: for instance, you\nmay wish to cluster web pages by only merging pages with a link pointing\nfrom one to another. It can also be learned from the data, for instance\nusing\nsklearn.neighbors.kneighbors_graph\nto restrict\nmerging to nearest neighbors as in\nthis example\n, or\nusing\nsklearn.feature_extraction.image.grid_to_graph\nto\nenable only merging of neighboring pixels on an image, as in the\ncoin\nexample.\nWarning\nConnectivity constraints with single, average and complete linkage\nConnectivity constraints and single, complete or average linkage can enhance\nthe ‘rich getting richer’ aspect of agglomerative clustering,\nparticularly so if they are built with\nsklearn.neighbors.kneighbors_graph\n. In the limit of a small\nnumber of clusters, they tend to give a few macroscopically occupied\nclusters and almost empty ones. (see the discussion in\nAgglomerative clustering with and without structure\n).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "clusters and almost empty ones. (see the discussion in\nAgglomerative clustering with and without structure\n).\nSingle linkage is the most brittle linkage option with regard to this issue.\nExamples\nA demo of structured Ward hierarchical clustering on an image of coins\n: Ward\nclustering to split the image of coins in regions.\nHierarchical clustering: structured vs unstructured ward\n: Example\nof Ward algorithm on a swiss-roll, comparison of structured approaches\nversus unstructured approaches.\nFeature agglomeration vs. univariate selection\n: Example\nof dimensionality reduction with feature agglomeration based on Ward\nhierarchical clustering.\nAgglomerative clustering with and without structure\n2.3.6.4.\nVarying the metric\nSingle, average and complete linkage can be used with a variety of distances (or\naffinities), in particular Euclidean distance (\nl2\n), Manhattan distance\n(or Cityblock, or\nl1\n), cosine distance, or any precomputed affinity\nmatrix.\nl1",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "l2\n), Manhattan distance\n(or Cityblock, or\nl1\n), cosine distance, or any precomputed affinity\nmatrix.\nl1\ndistance is often good for sparse features, or sparse noise: i.e.\nmany of the features are zero, as in text mining using occurrences of\nrare words.\ncosine\ndistance is interesting because it is invariant to global\nscalings of the signal.\nThe guidelines for choosing a metric is to use one that maximizes the\ndistance between samples in different classes, and minimizes that within\neach class.\nExamples\nAgglomerative clustering with different metrics\n2.3.6.5.\nBisecting K-Means\nThe\nBisectingKMeans\nis an iterative variant of\nKMeans\n, using\ndivisive hierarchical clustering. Instead of creating all centroids at once, centroids\nare picked progressively based on a previous clustering: a cluster is split into two\nnew clusters repeatedly until the target number of clusters is reached.\nBisectingKMeans\nis more efficient than\nKMeans\nwhen the number of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "new clusters repeatedly until the target number of clusters is reached.\nBisectingKMeans\nis more efficient than\nKMeans\nwhen the number of\nclusters is large since it only works on a subset of the data at each bisection\nwhile\nKMeans\nalways works on the entire dataset.\nAlthough\nBisectingKMeans\ncan’t benefit from the advantages of the\n\"k-means++\"\ninitialization by design, it will still produce comparable results than\nKMeans(init=\"k-means++\")\nin terms of inertia at cheaper computational costs, and will\nlikely produce better results than\nKMeans\nwith a random initialization.\nThis variant is more efficient to agglomerative clustering if the number of clusters is\nsmall compared to the number of data points.\nThis variant also does not produce empty clusters.\nThere exist two strategies for selecting the cluster to split:\nbisecting_strategy=\"largest_cluster\"\nselects the cluster having the most points\nbisecting_strategy=\"biggest_inertia\"\nselects the cluster with biggest inertia",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "selects the cluster having the most points\nbisecting_strategy=\"biggest_inertia\"\nselects the cluster with biggest inertia\n(cluster with biggest Sum of Squared Errors within)\nPicking by largest amount of data points in most cases produces result as\naccurate as picking by inertia and is faster (especially for larger amount of data\npoints, where calculating error may be costly).\nPicking by largest amount of data points will also likely produce clusters of similar\nsizes while\nKMeans\nis known to produce clusters of different sizes.\nDifference between Bisecting K-Means and regular K-Means can be seen on example\nBisecting K-Means and Regular K-Means Performance Comparison\n.\nWhile the regular K-Means algorithm tends to create non-related clusters,\nclusters from Bisecting K-Means are well ordered and create quite a visible hierarchy.\nReferences\n“A Comparison of Document Clustering Techniques”\nMichael\nSteinbach, George Karypis and Vipin Kumar, Department of Computer Science and",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "References\n“A Comparison of Document Clustering Techniques”\nMichael\nSteinbach, George Karypis and Vipin Kumar, Department of Computer Science and\nEgineering, University of Minnesota (June 2000)\n“Performance Analysis of K-Means and Bisecting K-Means Algorithms in Weblog\nData”\nK.Abirami and Dr.P.Mayilvahanan, International Journal of Emerging\nTechnologies in Engineering Research (IJETER) Volume 4, Issue 8, (August 2016)\n“Bisecting K-means Algorithm Based on K-valued Self-determining and\nClustering Center Optimization”\nJian Di, Xinyue Gou School\nof Control and Computer Engineering,North China Electric Power University,\nBaoding, Hebei, China (August 2017)\n2.3.7.\nDBSCAN\nThe\nDBSCAN\nalgorithm views clusters as areas of high density\nseparated by areas of low density. Due to this rather generic view, clusters\nfound by DBSCAN can be any shape, as opposed to k-means which assumes that\nclusters are convex shaped. The central component to the DBSCAN is the concept\nof\ncore samples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "clusters are convex shaped. The central component to the DBSCAN is the concept\nof\ncore samples\n, which are samples that are in areas of high density. A\ncluster is therefore a set of core samples, each close to each other\n(measured by some distance measure)\nand a set of non-core samples that are close to a core sample (but are not\nthemselves core samples). There are two parameters to the algorithm,\nmin_samples\nand\neps\n,\nwhich define formally what we mean when we say\ndense\n.\nHigher\nmin_samples\nor lower\neps\nindicate higher density necessary to form a cluster.\nMore formally, we define a core sample as being a sample in the dataset such\nthat there exist\nmin_samples\nother samples within a distance of\neps\n, which are defined as\nneighbors\nof the core sample. This tells\nus that the core sample is in a dense area of the vector space. A cluster\nis a set of core samples that can be built by recursively taking a core\nsample, finding all of its neighbors that are core samples, finding all of\ntheir",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "sample, finding all of its neighbors that are core samples, finding all of\ntheir\nneighbors that are core samples, and so on. A cluster also has a\nset of non-core samples, which are samples that are neighbors of a core sample\nin the cluster but are not themselves core samples. Intuitively, these samples\nare on the fringes of a cluster.\nAny core sample is part of a cluster, by definition. Any sample that is not a\ncore sample, and is at least\neps\nin distance from any core sample, is\nconsidered an outlier by the algorithm.\nWhile the parameter\nmin_samples\nprimarily controls how tolerant the\nalgorithm is towards noise (on noisy and large data sets it may be desirable\nto increase this parameter), the parameter\neps\nis\ncrucial to choose\nappropriately\nfor the data set and distance function and usually cannot be\nleft at the default value. It controls the local neighborhood of the points.\nWhen chosen too small, most data will not be clustered at all (and labeled\nas\n-1",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "When chosen too small, most data will not be clustered at all (and labeled\nas\n-1\nfor “noise”). When chosen too large, it causes close clusters to\nbe merged into one cluster, and eventually the entire data set to be returned\nas a single cluster. Some heuristics for choosing this parameter have been\ndiscussed in the literature, for example based on a knee in the nearest neighbor\ndistances plot (as discussed in the references below).\nIn the figure below, the color indicates cluster membership, with large circles\nindicating core samples found by the algorithm. Smaller circles are non-core\nsamples that are still part of a cluster. Moreover, the outliers are indicated\nby black points below.\nExamples\nDemo of DBSCAN clustering algorithm\nImplementation\nThe DBSCAN algorithm is deterministic, always generating the same clusters when\ngiven the same data in the same order. However, the results can differ when\ndata is provided in a different order. First, even though the core samples will",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "data is provided in a different order. First, even though the core samples will\nalways be assigned to the same clusters, the labels of those clusters will\ndepend on the order in which those samples are encountered in the data. Second\nand more importantly, the clusters to which non-core samples are assigned can\ndiffer depending on the data order. This would happen when a non-core sample\nhas a distance lower than\neps\nto two core samples in different clusters. By\nthe triangular inequality, those two core samples must be more distant than\neps\nfrom each other, or they would be in the same cluster. The non-core\nsample is assigned to whichever cluster is generated first in a pass through the\ndata, and so the results will depend on the data ordering.\nThe current implementation uses ball trees and kd-trees to determine the\nneighborhood of points, which avoids calculating the full distance matrix (as\nwas done in scikit-learn versions before 0.14). The possibility to use custom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "was done in scikit-learn versions before 0.14). The possibility to use custom\nmetrics is retained; for details, see\nNearestNeighbors\n.\nMemory consumption for large sample sizes\nThis implementation is by default not memory efficient because it constructs a\nfull pairwise similarity matrix in the case where kd-trees or ball-trees cannot\nbe used (e.g., with sparse matrices). This matrix will consume\n\\(n^2\\)\nfloats. A couple of mechanisms for getting around this are:\nUse\nOPTICS\nclustering in conjunction with the\nextract_dbscan\nmethod. OPTICS clustering also calculates the full pairwise matrix, but only\nkeeps one row in memory at a time (memory complexity n).\nA sparse radius neighborhood graph (where missing entries are presumed to be\nout of eps) can be precomputed in a memory-efficient way and dbscan can be run\nover this with\nmetric='precomputed'\n. See\nsklearn.neighbors.NearestNeighbors.radius_neighbors_graph\n.\nThe dataset can be compressed, either by removing exact duplicates if these",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 38,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ". See\nsklearn.neighbors.NearestNeighbors.radius_neighbors_graph\n.\nThe dataset can be compressed, either by removing exact duplicates if these\noccur in your data, or by using BIRCH. Then you only have a relatively small\nnumber of representatives for a large number of points. You can then provide a\nsample_weight\nwhen fitting DBSCAN.\nReferences\nA Density-Based Algorithm for Discovering Clusters in Large Spatial\nDatabases with Noise\nEster, M., H. P. Kriegel, J. Sander, and X. Xu, In Proceedings of the 2nd\nInternational Conference on Knowledge Discovery and Data Mining, Portland, OR,\nAAAI Press, pp. 226-231. 1996\nDBSCAN revisited, revisited: why and how you should (still) use DBSCAN.\nSchubert, E., Sander, J., Ester, M., Kriegel, H. P., & Xu,\nX. (2017). In ACM Transactions on Database Systems (TODS), 42(3), 19.\n2.3.8.\nHDBSCAN\nThe\nHDBSCAN\nalgorithm can be seen as an extension of\nDBSCAN\nand\nOPTICS\n. Specifically,\nDBSCAN\nassumes that the clustering\ncriterion (i.e. density requirement) is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 39,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "algorithm can be seen as an extension of\nDBSCAN\nand\nOPTICS\n. Specifically,\nDBSCAN\nassumes that the clustering\ncriterion (i.e. density requirement) is\nglobally homogeneous\n.\nIn other words,\nDBSCAN\nmay struggle to successfully capture clusters\nwith different densities.\nHDBSCAN\nalleviates this assumption and explores all possible density\nscales by building an alternative representation of the clustering problem.\nNote\nThis implementation is adapted from the original implementation of HDBSCAN,\nscikit-learn-contrib/hdbscan\nbased on\n[LJ2017]\n.\nExamples\nDemo of HDBSCAN clustering algorithm\n2.3.8.1.\nMutual Reachability Graph\nHDBSCAN first defines\n\\(d_c(x_p)\\)\n, the\ncore distance\nof a sample\n\\(x_p\\)\n, as the\ndistance to its\nmin_samples\nth-nearest neighbor, counting itself. For example,\nif\nmin_samples=5\nand\n\\(x_*\\)\nis the 5th-nearest neighbor of\n\\(x_p\\)\nthen the core distance is:\n\\[d_c(x_p)=d(x_p, x_*).\\]\nNext it defines\n\\(d_m(x_p, x_q)\\)\n, the\nmutual reachability distance\nof two points",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 40,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(x_p\\)\nthen the core distance is:\n\\[d_c(x_p)=d(x_p, x_*).\\]\nNext it defines\n\\(d_m(x_p, x_q)\\)\n, the\nmutual reachability distance\nof two points\n\\(x_p, x_q\\)\n, as:\n\\[d_m(x_p, x_q) = \\max\\{d_c(x_p), d_c(x_q), d(x_p, x_q)\\}\\]\nThese two notions allow us to construct the\nmutual reachability graph\n\\(G_{ms}\\)\ndefined for a fixed choice of\nmin_samples\nby associating each\nsample\n\\(x_p\\)\nwith a vertex of the graph, and thus edges between points\n\\(x_p, x_q\\)\nare the mutual reachability distance\n\\(d_m(x_p, x_q)\\)\nbetween them. We may build subsets of this graph, denoted as\n\\(G_{ms,\\varepsilon}\\)\n, by removing any edges with value greater than\n\\(\\varepsilon\\)\n:\nfrom the original graph. Any points whose core distance is less than\n\\(\\varepsilon\\)\n:\nare at this staged marked as noise. The remaining points are then clustered by\nfinding the connected components of this trimmed graph.\nNote\nTaking the connected components of a trimmed graph\n\\(G_{ms,\\varepsilon}\\)\nis\nequivalent to running DBSCAN* with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 41,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Note\nTaking the connected components of a trimmed graph\n\\(G_{ms,\\varepsilon}\\)\nis\nequivalent to running DBSCAN* with\nmin_samples\nand\n\\(\\varepsilon\\)\n. DBSCAN* is a\nslightly modified version of DBSCAN mentioned in\n[CM2013]\n.\n2.3.8.2.\nHierarchical Clustering\nHDBSCAN can be seen as an algorithm which performs DBSCAN* clustering across all\nvalues of\n\\(\\varepsilon\\)\n. As mentioned prior, this is equivalent to finding the connected\ncomponents of the mutual reachability graphs for all values of\n\\(\\varepsilon\\)\n. To do this\nefficiently, HDBSCAN first extracts a minimum spanning tree (MST) from the fully\n-connected mutual reachability graph, then greedily cuts the edges with highest\nweight. An outline of the HDBSCAN algorithm is as follows:\nExtract the MST of\n\\(G_{ms}\\)\n.\nExtend the MST by adding a “self edge” for each vertex, with weight equal\nto the core distance of the underlying sample.\nInitialize a single cluster and label for the MST.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 42,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to the core distance of the underlying sample.\nInitialize a single cluster and label for the MST.\nRemove the edge with the greatest weight from the MST (ties are\nremoved simultaneously).\nAssign cluster labels to the connected components which contain the\nend points of the now-removed edge. If the component does not have at least\none edge it is instead assigned a “null” label marking it as noise.\nRepeat 4-5 until there are no more connected components.\nHDBSCAN is therefore able to obtain all possible partitions achievable by\nDBSCAN* for a fixed choice of\nmin_samples\nin a hierarchical fashion.\nIndeed, this allows HDBSCAN to perform clustering across multiple densities\nand as such it no longer needs\n\\(\\varepsilon\\)\nto be given as a hyperparameter. Instead\nit relies solely on the choice of\nmin_samples\n, which tends to be a more robust\nhyperparameter.\nHDBSCAN can be smoothed with an additional hyperparameter\nmin_cluster_size",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 43,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "min_samples\n, which tends to be a more robust\nhyperparameter.\nHDBSCAN can be smoothed with an additional hyperparameter\nmin_cluster_size\nwhich specifies that during the hierarchical clustering, components with fewer\nthan\nminimum_cluster_size\nmany samples are considered noise. In practice, one\ncan set\nminimum_cluster_size\n=\nmin_samples\nto couple the parameters and\nsimplify the hyperparameter space.\nReferences\n[\nCM2013\n]\nCampello, R.J.G.B., Moulavi, D., Sander, J. (2013). Density-Based\nClustering Based on Hierarchical Density Estimates. In: Pei, J., Tseng, V.S.,\nCao, L., Motoda, H., Xu, G. (eds) Advances in Knowledge Discovery and Data\nMining. PAKDD 2013. Lecture Notes in Computer Science(), vol 7819. Springer,\nBerlin, Heidelberg.\nDensity-Based Clustering Based on Hierarchical\nDensity Estimates\n[\nLJ2017\n]\nL. McInnes and J. Healy, (2017). Accelerated Hierarchical Density\nBased Clustering. In: IEEE International Conference on Data Mining Workshops\n(ICDMW), 2017, pp. 33-42.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 44,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Based Clustering. In: IEEE International Conference on Data Mining Workshops\n(ICDMW), 2017, pp. 33-42.\nAccelerated Hierarchical Density Based\nClustering\n2.3.9.\nOPTICS\nThe\nOPTICS\nalgorithm shares many similarities with the\nDBSCAN\nalgorithm, and can be considered a generalization of DBSCAN that relaxes the\neps\nrequirement from a single value to a value range. The key difference\nbetween DBSCAN and OPTICS is that the OPTICS algorithm builds a\nreachability\ngraph, which assigns each sample both a\nreachability_\ndistance, and a spot\nwithin the cluster\nordering_\nattribute; these two attributes are assigned\nwhen the model is fitted, and are used to determine cluster membership. If\nOPTICS is run with the default value of\ninf\nset for\nmax_eps\n, then DBSCAN\nstyle cluster extraction can be performed repeatedly in linear time for any\ngiven\neps\nvalue using the\ncluster_optics_dbscan\nmethod. Setting\nmax_eps\nto a lower value will result in shorter run times, and can be",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 45,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "given\neps\nvalue using the\ncluster_optics_dbscan\nmethod. Setting\nmax_eps\nto a lower value will result in shorter run times, and can be\nthought of as the maximum neighborhood radius from each point to find other\npotential reachable points.\nThe\nreachability\ndistances generated by OPTICS allow for variable density\nextraction of clusters within a single data set. As shown in the above plot,\ncombining\nreachability\ndistances and data set\nordering_\nproduces a\nreachability plot\n, where point density is represented on the Y-axis, and\npoints are ordered such that nearby points are adjacent. ‘Cutting’ the\nreachability plot at a single value produces DBSCAN like results; all points\nabove the ‘cut’ are classified as noise, and each time that there is a break\nwhen reading from left to right signifies a new cluster. The default cluster\nextraction with OPTICS looks at the steep slopes within the graph to find\nclusters, and the user can define what counts as a steep slope using the\nparameter\nxi",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 46,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "clusters, and the user can define what counts as a steep slope using the\nparameter\nxi\n. There are also other possibilities for analysis on the graph\nitself, such as generating hierarchical representations of the data through\nreachability-plot dendrograms, and the hierarchy of clusters detected by the\nalgorithm can be accessed through the\ncluster_hierarchy_\nparameter. The\nplot above has been color-coded so that cluster colors in planar space match\nthe linear segment clusters of the reachability plot. Note that the blue and\nred clusters are adjacent in the reachability plot, and can be hierarchically\nrepresented as children of a larger parent cluster.\nExamples\nDemo of OPTICS clustering algorithm\nComparison with DBSCAN\nThe results from OPTICS\ncluster_optics_dbscan\nmethod and DBSCAN are very\nsimilar, but not always identical; specifically, labeling of periphery and noise\npoints. This is in part because the first samples of each dense area processed",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 47,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "points. This is in part because the first samples of each dense area processed\nby OPTICS have a large reachability value while being close to other points in\ntheir area, and will thus sometimes be marked as noise rather than periphery.\nThis affects adjacent points when they are considered as candidates for being\nmarked as either periphery or noise.\nNote that for any single value of\neps\n, DBSCAN will tend to have a shorter\nrun time than OPTICS; however, for repeated runs at varying\neps\nvalues, a\nsingle run of OPTICS may require less cumulative runtime than DBSCAN. It is also\nimportant to note that OPTICS’ output is close to DBSCAN’s only if\neps\nand\nmax_eps\nare close.\nComputational Complexity\nSpatial indexing trees are used to avoid calculating the full distance matrix,\nand allow for efficient memory usage on large sets of samples. Different\ndistance metrics can be supplied via the\nmetric\nkeyword.\nFor large datasets, similar (but not identical) results can be obtained via\nHDBSCAN",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 48,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "distance metrics can be supplied via the\nmetric\nkeyword.\nFor large datasets, similar (but not identical) results can be obtained via\nHDBSCAN\n. The HDBSCAN implementation is multithreaded, and has better\nalgorithmic runtime complexity than OPTICS, at the cost of worse memory scaling.\nFor extremely large datasets that exhaust system memory using HDBSCAN, OPTICS\nwill maintain\n\\(n\\)\n(as opposed to\n\\(n^2\\)\n) memory scaling; however,\ntuning of the\nmax_eps\nparameter will likely need to be used to give a\nsolution in a reasonable amount of wall time.\nReferences\n“OPTICS: ordering points to identify the clustering structure.” Ankerst,\nMihael, Markus M. Breunig, Hans-Peter Kriegel, and Jörg Sander. In ACM Sigmod\nRecord, vol. 28, no. 2, pp. 49-60. ACM, 1999.\n2.3.10.\nBIRCH\nThe\nBirch\nbuilds a tree called the Clustering Feature Tree (CFT)\nfor the given data. The data is essentially lossy compressed to a set of\nClustering Feature nodes (CF Nodes). The CF Nodes have a number of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 49,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for the given data. The data is essentially lossy compressed to a set of\nClustering Feature nodes (CF Nodes). The CF Nodes have a number of\nsubclusters called Clustering Feature subclusters (CF Subclusters)\nand these CF Subclusters located in the non-terminal CF Nodes\ncan have CF Nodes as children.\nThe CF Subclusters hold the necessary information for clustering which prevents\nthe need to hold the entire input data in memory. This information includes:\nNumber of samples in a subcluster.\nLinear Sum - An n-dimensional vector holding the sum of all samples\nSquared Sum - Sum of the squared L2 norm of all samples.\nCentroids - To avoid recalculation linear sum / n_samples.\nSquared norm of the centroids.\nThe BIRCH algorithm has two parameters, the threshold and the branching factor.\nThe branching factor limits the number of subclusters in a node and the\nthreshold limits the distance between the entering sample and the existing\nsubclusters.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 50,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "threshold limits the distance between the entering sample and the existing\nsubclusters.\nThis algorithm can be viewed as an instance or data reduction method,\nsince it reduces the input data to a set of subclusters which are obtained directly\nfrom the leaves of the CFT. This reduced data can be further processed by feeding\nit into a global clusterer. This global clusterer can be set by\nn_clusters\n.\nIf\nn_clusters\nis set to None, the subclusters from the leaves are directly\nread off, otherwise a global clustering step labels these subclusters into global\nclusters (labels) and the samples are mapped to the global label of the nearest subcluster.\nAlgorithm description\nA new sample is inserted into the root of the CF Tree which is a CF Node. It\nis then merged with the subcluster of the root, that has the smallest radius\nafter merging, constrained by the threshold and branching factor conditions.\nIf the subcluster has any child node, then this is done repeatedly till it",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 51,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "If the subcluster has any child node, then this is done repeatedly till it\nreaches a leaf. After finding the nearest subcluster in the leaf, the\nproperties of this subcluster and the parent subclusters are recursively\nupdated.\nIf the radius of the subcluster obtained by merging the new sample and the\nnearest subcluster is greater than the square of the threshold and if the\nnumber of subclusters is greater than the branching factor, then a space is\ntemporarily allocated to this new sample. The two farthest subclusters are\ntaken and the subclusters are divided into two groups on the basis of the\ndistance between these subclusters.\nIf this split node has a parent subcluster and there is room for a new\nsubcluster, then the parent is split into two. If there is no room, then this\nnode is again split into two and the process is continued recursively, till it\nreaches the root.\nBIRCH or MiniBatchKMeans?\nBIRCH does not scale very well to high dimensional data. As a rule of thumb if\nn_features",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 52,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "reaches the root.\nBIRCH or MiniBatchKMeans?\nBIRCH does not scale very well to high dimensional data. As a rule of thumb if\nn_features\nis greater than twenty, it is generally better to use MiniBatchKMeans.\nIf the number of instances of data needs to be reduced, or if one wants a\nlarge number of subclusters either as a preprocessing step or otherwise,\nBIRCH is more useful than MiniBatchKMeans.\nHow to use partial_fit?\nTo avoid the computation of global clustering, for every call of\npartial_fit\nthe user is advised:\nTo set\nn_clusters=None\ninitially.\nTrain all data by multiple calls to partial_fit.\nSet\nn_clusters\nto a required value using\nbrc.set_params(n_clusters=n_clusters)\n.\nCall\npartial_fit\nfinally with no arguments, i.e.\nbrc.partial_fit()\nwhich performs the global clustering.\nReferences\nTian Zhang, Raghu Ramakrishnan, Maron Livny BIRCH: An efficient data\nclustering method for large databases.\nhttps://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 53,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "clustering method for large databases.\nhttps://www.cs.sfu.ca/CourseCentral/459/han/papers/zhang96.pdf\nRoberto Perdisci JBirch - Java implementation of BIRCH clustering algorithm\nhttps://code.google.com/archive/p/jbirch\n2.3.11.\nClustering performance evaluation\nEvaluating the performance of a clustering algorithm is not as trivial as\ncounting the number of errors or the precision and recall of a supervised\nclassification algorithm. In particular any evaluation metric should not\ntake the absolute values of the cluster labels into account but rather\nif this clustering define separations of the data similar to some ground\ntruth set of classes or satisfying some assumption such that members\nbelong to the same class are more similar than members of different\nclasses according to some similarity metric.\n2.3.11.1.\nRand index\nGiven the knowledge of the ground truth class assignments\nlabels_true\nand our clustering algorithm assignments of the same\nsamples\nlabels_pred\n, the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 54,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Given the knowledge of the ground truth class assignments\nlabels_true\nand our clustering algorithm assignments of the same\nsamples\nlabels_pred\n, the\n(adjusted or unadjusted) Rand index\nis a function that measures the\nsimilarity\nof the two assignments,\nignoring permutations:\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nlabels_true\n=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n1\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n0\n,\n0\n,\n1\n,\n1\n,\n2\n,\n2\n]\n>>>\nmetrics\n.\nrand_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.66\nThe Rand index does not ensure to obtain a value close to 0.0 for a\nrandom labelling. The adjusted Rand index\ncorrects for chance\nand\nwill give such a baseline.\n>>>\nmetrics\n.\nadjusted_rand_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.24\nAs with all clustering metrics, one can permute 0 and 1 in the predicted\nlabels, rename 2 to 3, and get the same score:\n>>>\nlabels_pred\n=\n[\n1\n,\n1\n,\n0\n,\n0\n,\n3\n,\n3\n]\n>>>\nmetrics\n.\nrand_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.66\n>>>\nmetrics\n.\nadjusted_rand_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.24\nFurthermore, both",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 55,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "]\n>>>\nmetrics\n.\nrand_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.66\n>>>\nmetrics\n.\nadjusted_rand_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.24\nFurthermore, both\nrand_score\nand\nadjusted_rand_score\nare\nsymmetric\n: swapping the argument does not change the scores. They can\nthus be used as\nconsensus measures\n:\n>>>\nmetrics\n.\nrand_score\n(\nlabels_pred\n,\nlabels_true\n)\n0.66\n>>>\nmetrics\n.\nadjusted_rand_score\n(\nlabels_pred\n,\nlabels_true\n)\n0.24\nPerfect labeling is scored 1.0:\n>>>\nlabels_pred\n=\nlabels_true\n[:]\n>>>\nmetrics\n.\nrand_score\n(\nlabels_true\n,\nlabels_pred\n)\n1.0\n>>>\nmetrics\n.\nadjusted_rand_score\n(\nlabels_true\n,\nlabels_pred\n)\n1.0\nPoorly agreeing labels (e.g. independent labelings) have lower scores,\nand for the adjusted Rand index the score will be negative or close to\nzero. However, for the unadjusted Rand index the score, while lower,\nwill not necessarily be close to zero:\n>>>\nlabels_true\n=\n[\n0\n,\n0\n,\n0\n,\n0\n,\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n0\n,\n1\n,\n2\n,\n3\n,\n4\n,\n5\n,\n5\n,\n6\n]\n>>>\nmetrics\n.\nrand_score\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 56,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nlabels_true\n=\n[\n0\n,\n0\n,\n0\n,\n0\n,\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n0\n,\n1\n,\n2\n,\n3\n,\n4\n,\n5\n,\n5\n,\n6\n]\n>>>\nmetrics\n.\nrand_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.39\n>>>\nmetrics\n.\nadjusted_rand_score\n(\nlabels_true\n,\nlabels_pred\n)\n-0.072\nExamples\nAdjustment for chance in clustering performance evaluation\n:\nAnalysis of the impact of the dataset size on the value of\nclustering measures for random assignments.\nMathematical formulation\nIf C is a ground truth class assignment and K the clustering, let us define\n\\(a\\)\nand\n\\(b\\)\nas:\n\\(a\\)\n, the number of pairs of elements that are in the same set in C and\nin the same set in K\n\\(b\\)\n, the number of pairs of elements that are in different sets in C and\nin different sets in K\nThe unadjusted Rand index is then given by:\n\\[\\text{RI} = \\frac{a + b}{C_2^{n_{samples}}}\\]\nwhere\n\\(C_2^{n_{samples}}\\)\nis the total number of possible pairs in the\ndataset. It does not matter if the calculation is performed on ordered pairs or",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 57,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(C_2^{n_{samples}}\\)\nis the total number of possible pairs in the\ndataset. It does not matter if the calculation is performed on ordered pairs or\nunordered pairs as long as the calculation is performed consistently.\nHowever, the Rand index does not guarantee that random label assignments will\nget a value close to zero (esp. if the number of clusters is in the same order\nof magnitude as the number of samples).\nTo counter this effect we can discount the expected RI\n\\(E[\\text{RI}]\\)\nof\nrandom labelings by defining the adjusted Rand index as follows:\n\\[\\text{ARI} = \\frac{\\text{RI} - E[\\text{RI}]}{\\max(\\text{RI}) - E[\\text{RI}]}\\]\nReferences\nComparing Partitions\nL. Hubert and P.\nArabie, Journal of Classification 1985\nProperties of the Hubert-Arabie adjusted Rand index\nD. Steinley, Psychological\nMethods 2004\nWikipedia entry for the Rand index\nMinimum adjusted Rand index for two clusterings of a given size, 2022, J. E. Chacón and A. I. Rastrojo\n2.3.11.2.\nMutual Information based scores",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 58,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Minimum adjusted Rand index for two clusterings of a given size, 2022, J. E. Chacón and A. I. Rastrojo\n2.3.11.2.\nMutual Information based scores\nGiven the knowledge of the ground truth class assignments\nlabels_true\nand\nour clustering algorithm assignments of the same samples\nlabels_pred\n, the\nMutual Information\nis a function that measures the\nagreement\nof the two\nassignments, ignoring permutations. Two different normalized versions of this\nmeasure are available,\nNormalized Mutual Information (NMI)\nand\nAdjusted\nMutual Information (AMI)\n. NMI is often used in the literature, while AMI was\nproposed more recently and is\nnormalized against chance\n:\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nlabels_true\n=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n1\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n0\n,\n0\n,\n1\n,\n1\n,\n2\n,\n2\n]\n>>>\nmetrics\n.\nadjusted_mutual_info_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.22504\nOne can permute 0 and 1 in the predicted labels, rename 2 to 3 and get\nthe same score:\n>>>\nlabels_pred\n=\n[\n1\n,\n1\n,\n0\n,\n0\n,\n3\n,\n3\n]\n>>>\nmetrics\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 59,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "One can permute 0 and 1 in the predicted labels, rename 2 to 3 and get\nthe same score:\n>>>\nlabels_pred\n=\n[\n1\n,\n1\n,\n0\n,\n0\n,\n3\n,\n3\n]\n>>>\nmetrics\n.\nadjusted_mutual_info_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.22504\nAll,\nmutual_info_score\n,\nadjusted_mutual_info_score\nand\nnormalized_mutual_info_score\nare symmetric: swapping the argument does\nnot change the score. Thus they can be used as a\nconsensus measure\n:\n>>>\nmetrics\n.\nadjusted_mutual_info_score\n(\nlabels_pred\n,\nlabels_true\n)\n0.22504\nPerfect labeling is scored 1.0:\n>>>\nlabels_pred\n=\nlabels_true\n[:]\n>>>\nmetrics\n.\nadjusted_mutual_info_score\n(\nlabels_true\n,\nlabels_pred\n)\n1.0\n>>>\nmetrics\n.\nnormalized_mutual_info_score\n(\nlabels_true\n,\nlabels_pred\n)\n1.0\nThis is not true for\nmutual_info_score\n, which is therefore harder to judge:\n>>>\nmetrics\n.\nmutual_info_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.69\nBad (e.g. independent labelings) have non-positive scores:\n>>>\nlabels_true\n=\n[\n0\n,\n1\n,\n2\n,\n0\n,\n3\n,\n4\n,\n5\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n1\n,\n1\n,\n0\n,\n0\n,\n2\n,\n2",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 60,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nlabels_true\n=\n[\n0\n,\n1\n,\n2\n,\n0\n,\n3\n,\n4\n,\n5\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n1\n,\n1\n,\n0\n,\n0\n,\n2\n,\n2\n,\n2\n,\n2\n]\n>>>\nmetrics\n.\nadjusted_mutual_info_score\n(\nlabels_true\n,\nlabels_pred\n)\n-0.10526\nExamples\nAdjustment for chance in clustering performance evaluation\n: Analysis\nof the impact of the dataset size on the value of clustering measures for random\nassignments. This example also includes the Adjusted Rand Index.\nMathematical formulation\nAssume two label assignments (of the same N objects),\n\\(U\\)\nand\n\\(V\\)\n.\nTheir entropy is the amount of uncertainty for a partition set, defined by:\n\\[H(U) = - \\sum_{i=1}^{|U|}P(i)\\log(P(i))\\]\nwhere\n\\(P(i) = |U_i| / N\\)\nis the probability that an object picked at\nrandom from\n\\(U\\)\nfalls into class\n\\(U_i\\)\n. Likewise for\n\\(V\\)\n:\n\\[H(V) = - \\sum_{j=1}^{|V|}P'(j)\\log(P'(j))\\]\nWith\n\\(P'(j) = |V_j| / N\\)\n. The mutual information (MI) between\n\\(U\\)\nand\n\\(V\\)\nis calculated by:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 61,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ":\n\\[H(V) = - \\sum_{j=1}^{|V|}P'(j)\\log(P'(j))\\]\nWith\n\\(P'(j) = |V_j| / N\\)\n. The mutual information (MI) between\n\\(U\\)\nand\n\\(V\\)\nis calculated by:\n\\[\\text{MI}(U, V) = \\sum_{i=1}^{|U|}\\sum_{j=1}^{|V|}P(i, j)\\log\\left(\\frac{P(i,j)}{P(i)P'(j)}\\right)\\]\nwhere\n\\(P(i, j) = |U_i \\cap V_j| / N\\)\nis the probability that an object\npicked at random falls into both classes\n\\(U_i\\)\nand\n\\(V_j\\)\n.\nIt also can be expressed in set cardinality formulation:\n\\[\\text{MI}(U, V) = \\sum_{i=1}^{|U|} \\sum_{j=1}^{|V|} \\frac{|U_i \\cap V_j|}{N}\\log\\left(\\frac{N|U_i \\cap V_j|}{|U_i||V_j|}\\right)\\]\nThe normalized mutual information is defined as\n\\[\\text{NMI}(U, V) = \\frac{\\text{MI}(U, V)}{\\text{mean}(H(U), H(V))}\\]\nThis value of the mutual information and also the normalized variant is not\nadjusted for chance and will tend to increase as the number of different labels\n(clusters) increases, regardless of the actual amount of “mutual information”\nbetween the label assignments.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 62,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(clusters) increases, regardless of the actual amount of “mutual information”\nbetween the label assignments.\nThe expected value for the mutual information can be calculated using the\nfollowing equation\n[VEB2009]\n. In this equation,\n\\(a_i = |U_i|\\)\n(the number\nof elements in\n\\(U_i\\)\n) and\n\\(b_j = |V_j|\\)\n(the number of elements in\n\\(V_j\\)\n).\n\\[E[\\text{MI}(U,V)]=\\sum_{i=1}^{|U|} \\sum_{j=1}^{|V|} \\sum_{n_{ij}=(a_i+b_j-N)^+\n}^{\\min(a_i, b_j)} \\frac{n_{ij}}{N}\\log \\left( \\frac{ N.n_{ij}}{a_i b_j}\\right)\n\\frac{a_i!b_j!(N-a_i)!(N-b_j)!}{N!n_{ij}!(a_i-n_{ij})!(b_j-n_{ij})!\n(N-a_i-b_j+n_{ij})!}\\]\nUsing the expected value, the adjusted mutual information can then be calculated\nusing a similar form to that of the adjusted Rand index:\n\\[\\text{AMI} = \\frac{\\text{MI} - E[\\text{MI}]}{\\text{mean}(H(U), H(V)) - E[\\text{MI}]}\\]\nFor normalized mutual information and adjusted mutual information, the\nnormalizing value is typically some\ngeneralized\nmean of the entropies of each",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 63,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "For normalized mutual information and adjusted mutual information, the\nnormalizing value is typically some\ngeneralized\nmean of the entropies of each\nclustering. Various generalized means exist, and no firm rules exist for\npreferring one over the others. The decision is largely a field-by-field basis;\nfor instance, in community detection, the arithmetic mean is most common. Each\nnormalizing method provides “qualitatively similar behaviours”\n[YAT2016]\n. In\nour implementation, this is controlled by the\naverage_method\nparameter.\nVinh et al. (2010) named variants of NMI and AMI by their averaging method\n[VEB2010]\n. Their ‘sqrt’ and ‘sum’ averages are the geometric and arithmetic\nmeans; we use these more broadly common names.\nReferences\nStrehl, Alexander, and Joydeep Ghosh (2002). “Cluster ensembles - a\nknowledge reuse framework for combining multiple partitions”. Journal of\nMachine Learning Research 3: 583-617.\ndoi:10.1162/153244303321897735\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 64,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "knowledge reuse framework for combining multiple partitions”. Journal of\nMachine Learning Research 3: 583-617.\ndoi:10.1162/153244303321897735\n.\nWikipedia entry for the (normalized) Mutual Information\nWikipedia entry for the Adjusted Mutual Information\n[\nVEB2009\n]\nVinh, Epps, and Bailey, (2009). “Information theoretic measures\nfor clusterings comparison”. Proceedings of the 26th Annual International\nConference on Machine Learning - ICML ‘09.\ndoi:10.1145/1553374.1553511\n. ISBN\n9781605585161.\n[\nVEB2010\n]\nVinh, Epps, and Bailey, (2010). “Information Theoretic Measures\nfor Clusterings Comparison: Variants, Properties, Normalization and\nCorrection for Chance”. JMLR\n<\nhttps://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf\n>\n[\nYAT2016\n]\nYang, Algesheimer, and Tessone, (2016). “A comparative analysis\nof community detection algorithms on artificial networks”. Scientific\nReports 6: 30750.\ndoi:10.1038/srep30750\n.\n2.3.11.3.\nHomogeneity, completeness and V-measure",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 65,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Reports 6: 30750.\ndoi:10.1038/srep30750\n.\n2.3.11.3.\nHomogeneity, completeness and V-measure\nGiven the knowledge of the ground truth class assignments of the samples,\nit is possible to define some intuitive metric using conditional entropy\nanalysis.\nIn particular Rosenberg and Hirschberg (2007) define the following two\ndesirable objectives for any cluster assignment:\nhomogeneity\n: each cluster contains only members of a single class.\ncompleteness\n: all members of a given class are assigned to the same\ncluster.\nWe can turn those concept as scores\nhomogeneity_score\nand\ncompleteness_score\n. Both are bounded below by 0.0 and above by\n1.0 (higher is better):\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nlabels_true\n=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n1\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n0\n,\n0\n,\n1\n,\n1\n,\n2\n,\n2\n]\n>>>\nmetrics\n.\nhomogeneity_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.66\n>>>\nmetrics\n.\ncompleteness_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.42\nTheir harmonic mean called\nV-measure\nis computed by\nv_measure_score\n:\n>>>\nmetrics\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 66,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "metrics\n.\ncompleteness_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.42\nTheir harmonic mean called\nV-measure\nis computed by\nv_measure_score\n:\n>>>\nmetrics\n.\nv_measure_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.516\nThis function’s formula is as follows:\n\\[v = \\frac{(1 + \\beta) \\times \\text{homogeneity} \\times \\text{completeness}}{(\\beta \\times \\text{homogeneity} + \\text{completeness})}\\]\nbeta\ndefaults to a value of 1.0, but for using a value less than 1 for beta:\n>>>\nmetrics\n.\nv_measure_score\n(\nlabels_true\n,\nlabels_pred\n,\nbeta\n=\n0.6\n)\n0.547\nmore weight will be attributed to homogeneity, and using a value greater than 1:\n>>>\nmetrics\n.\nv_measure_score\n(\nlabels_true\n,\nlabels_pred\n,\nbeta\n=\n1.8\n)\n0.48\nmore weight will be attributed to completeness.\nThe V-measure is actually equivalent to the mutual information (NMI)\ndiscussed above, with the aggregation function being the arithmetic mean\n[B2011]\n.\nHomogeneity, completeness and V-measure can be computed at once using\nhomogeneity_completeness_v_measure",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 67,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[B2011]\n.\nHomogeneity, completeness and V-measure can be computed at once using\nhomogeneity_completeness_v_measure\nas follows:\n>>>\nmetrics\n.\nhomogeneity_completeness_v_measure\n(\nlabels_true\n,\nlabels_pred\n)\n(0.67, 0.42, 0.52)\nThe following clustering assignment is slightly better, since it is\nhomogeneous but not complete:\n>>>\nlabels_pred\n=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n2\n,\n2\n]\n>>>\nmetrics\n.\nhomogeneity_completeness_v_measure\n(\nlabels_true\n,\nlabels_pred\n)\n(1.0, 0.68, 0.81)\nNote\nv_measure_score\nis\nsymmetric\n: it can be used to evaluate\nthe\nagreement\nof two independent assignments on the same dataset.\nThis is not the case for\ncompleteness_score\nand\nhomogeneity_score\n: both are bound by the relationship:\nhomogeneity_score\n(\na\n,\nb\n)\n==\ncompleteness_score\n(\nb\n,\na\n)\nExamples\nAdjustment for chance in clustering performance evaluation\n: Analysis\nof the impact of the dataset size on the value of clustering measures for\nrandom assignments.\nMathematical formulation",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 68,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ": Analysis\nof the impact of the dataset size on the value of clustering measures for\nrandom assignments.\nMathematical formulation\nHomogeneity and completeness scores are formally given by:\n\\[h = 1 - \\frac{H(C|K)}{H(C)}\\]\n\\[c = 1 - \\frac{H(K|C)}{H(K)}\\]\nwhere\n\\(H(C|K)\\)\nis the\nconditional entropy of the classes given the\ncluster assignments\nand is given by:\n\\[H(C|K) = - \\sum_{c=1}^{|C|} \\sum_{k=1}^{|K|} \\frac{n_{c,k}}{n}\n\\cdot \\log\\left(\\frac{n_{c,k}}{n_k}\\right)\\]\nand\n\\(H(C)\\)\nis the\nentropy of the classes\nand is given by:\n\\[H(C) = - \\sum_{c=1}^{|C|} \\frac{n_c}{n} \\cdot \\log\\left(\\frac{n_c}{n}\\right)\\]\nwith\n\\(n\\)\nthe total number of samples,\n\\(n_c\\)\nand\n\\(n_k\\)\nthe\nnumber of samples respectively belonging to class\n\\(c\\)\nand cluster\n\\(k\\)\n, and finally\n\\(n_{c,k}\\)\nthe number of samples from class\n\\(c\\)\nassigned to cluster\n\\(k\\)\n.\nThe\nconditional entropy of clusters given class\n\\(H(K|C)\\)\nand the\nentropy of clusters\n\\(H(K)\\)\nare defined in a symmetric manner.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 69,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(k\\)\n.\nThe\nconditional entropy of clusters given class\n\\(H(K|C)\\)\nand the\nentropy of clusters\n\\(H(K)\\)\nare defined in a symmetric manner.\nRosenberg and Hirschberg further define\nV-measure\nas the\nharmonic mean of\nhomogeneity and completeness\n:\n\\[v = 2 \\cdot \\frac{h \\cdot c}{h + c}\\]\nReferences\nV-Measure: A conditional entropy-based external cluster evaluation measure\nAndrew Rosenberg and Julia\nHirschberg, 2007\n[\nB2011\n]\nIdentification and Characterization of Events in Social Media\n, Hila\nBecker, PhD Thesis.\n2.3.11.4.\nFowlkes-Mallows scores\nThe original Fowlkes-Mallows index (FMI) was intended to measure the similarity\nbetween two clustering results, which is inherently an unsupervised comparison.\nThe supervised adaptation of the Fowlkes-Mallows index\n(as implemented in\nsklearn.metrics.fowlkes_mallows_score\n) can be used\nwhen the ground truth class assignments of the samples are known.\nThe FMI is defined as the geometric mean of the pairwise precision and recall:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 70,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "when the ground truth class assignments of the samples are known.\nThe FMI is defined as the geometric mean of the pairwise precision and recall:\n\\[\\text{FMI} = \\frac{\\text{TP}}{\\sqrt{(\\text{TP} + \\text{FP}) (\\text{TP} + \\text{FN})}}\\]\nIn the above formula:\nTP\n(\nTrue Positive\n): The number of pairs of points that are clustered together\nboth in the true labels and in the predicted labels.\nFP\n(\nFalse Positive\n): The number of pairs of points that are clustered together\nin the predicted labels but not in the true labels.\nFN\n(\nFalse Negative\n): The number of pairs of points that are clustered together\nin the true labels but not in the predicted labels.\nThe score ranges from 0 to 1. A high value indicates a good similarity\nbetween two clusters.\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nlabels_true\n=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n1\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n0\n,\n0\n,\n1\n,\n1\n,\n2\n,\n2\n]\n>>>\nmetrics\n.\nfowlkes_mallows_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.47140",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 71,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n1\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n0\n,\n0\n,\n1\n,\n1\n,\n2\n,\n2\n]\n>>>\nmetrics\n.\nfowlkes_mallows_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.47140\nOne can permute 0 and 1 in the predicted labels, rename 2 to 3 and get\nthe same score:\n>>>\nlabels_pred\n=\n[\n1\n,\n1\n,\n0\n,\n0\n,\n3\n,\n3\n]\n>>>\nmetrics\n.\nfowlkes_mallows_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.47140\nPerfect labeling is scored 1.0:\n>>>\nlabels_pred\n=\nlabels_true\n[:]\n>>>\nmetrics\n.\nfowlkes_mallows_score\n(\nlabels_true\n,\nlabels_pred\n)\n1.0\nBad (e.g. independent labelings) have zero scores:\n>>>\nlabels_true\n=\n[\n0\n,\n1\n,\n2\n,\n0\n,\n3\n,\n4\n,\n5\n,\n1\n]\n>>>\nlabels_pred\n=\n[\n1\n,\n1\n,\n0\n,\n0\n,\n2\n,\n2\n,\n2\n,\n2\n]\n>>>\nmetrics\n.\nfowlkes_mallows_score\n(\nlabels_true\n,\nlabels_pred\n)\n0.0\nReferences\nE. B. Fowkles and C. L. Mallows, 1983. “A method for comparing two\nhierarchical clusterings”. Journal of the American Statistical Association.\nhttps://www.tandfonline.com/doi/abs/10.1080/01621459.1983.10478008\nWikipedia entry for the Fowlkes-Mallows Index\n2.3.11.5.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 72,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "https://www.tandfonline.com/doi/abs/10.1080/01621459.1983.10478008\nWikipedia entry for the Fowlkes-Mallows Index\n2.3.11.5.\nSilhouette Coefficient\nIf the ground truth labels are not known, evaluation must be performed using\nthe model itself. The Silhouette Coefficient\n(\nsklearn.metrics.silhouette_score\n)\nis an example of such an evaluation, where a\nhigher Silhouette Coefficient score relates to a model with better defined\nclusters. The Silhouette Coefficient is defined for each sample and is composed\nof two scores:\na\n: The mean distance between a sample and all other points in the same\nclass.\nb\n: The mean distance between a sample and all other points in the\nnext\nnearest cluster\n.\nThe Silhouette Coefficient\ns\nfor a single sample is then given as:\n\\[s = \\frac{b - a}{max(a, b)}\\]\nThe Silhouette Coefficient for a set of samples is given as the mean of the\nSilhouette Coefficient for each sample.\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nfrom\nsklearn.metrics\nimport\npairwise_distances\n>>>\nfrom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 73,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Silhouette Coefficient for each sample.\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nfrom\nsklearn.metrics\nimport\npairwise_distances\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\nIn normal usage, the Silhouette Coefficient is applied to the results of a\ncluster analysis.\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.cluster\nimport\nKMeans\n>>>\nkmeans_model\n=\nKMeans\n(\nn_clusters\n=\n3\n,\nrandom_state\n=\n1\n)\n.\nfit\n(\nX\n)\n>>>\nlabels\n=\nkmeans_model\n.\nlabels_\n>>>\nmetrics\n.\nsilhouette_score\n(\nX\n,\nlabels\n,\nmetric\n=\n'euclidean'\n)\n0.55\nExamples\nSelecting the number of clusters with silhouette analysis on KMeans clustering\n: In\nthis example the silhouette analysis is used to choose an optimal value for\nn_clusters.\nReferences\nPeter J. Rousseeuw (1987).\n“Silhouettes: a Graphical Aid to the\nInterpretation and Validation of Cluster Analysis”\n.\nComputational and Applied Mathematics 20: 53-65.\n2.3.11.6.\nCalinski-Harabasz Index",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 74,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Interpretation and Validation of Cluster Analysis”\n.\nComputational and Applied Mathematics 20: 53-65.\n2.3.11.6.\nCalinski-Harabasz Index\nIf the ground truth labels are not known, the Calinski-Harabasz index\n(\nsklearn.metrics.calinski_harabasz_score\n) - also known as the Variance\nRatio Criterion - can be used to evaluate the model, where a higher\nCalinski-Harabasz score relates to a model with better defined clusters.\nThe index is the ratio of the sum of between-clusters dispersion and of\nwithin-cluster dispersion for all clusters (where dispersion is defined as the\nsum of distances squared):\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nfrom\nsklearn.metrics\nimport\npairwise_distances\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\nIn normal usage, the Calinski-Harabasz index is applied to the results of a\ncluster analysis:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.cluster\nimport\nKMeans\n>>>\nkmeans_model\n=\nKMeans\n(\nn_clusters\n=\n3\n,\nrandom_state\n=\n1\n)\n.\nfit\n(\nX",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 75,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.cluster\nimport\nKMeans\n>>>\nkmeans_model\n=\nKMeans\n(\nn_clusters\n=\n3\n,\nrandom_state\n=\n1\n)\n.\nfit\n(\nX\n)\n>>>\nlabels\n=\nkmeans_model\n.\nlabels_\n>>>\nmetrics\n.\ncalinski_harabasz_score\n(\nX\n,\nlabels\n)\n561.59\nMathematical formulation\nFor a set of data\n\\(E\\)\nof size\n\\(n_E\\)\nwhich has been clustered into\n\\(k\\)\nclusters, the Calinski-Harabasz score\n\\(s\\)\nis defined as the\nratio of the between-clusters dispersion mean and the within-cluster\ndispersion:\n\\[s = \\frac{\\mathrm{tr}(B_k)}{\\mathrm{tr}(W_k)} \\times \\frac{n_E - k}{k - 1}\\]\nwhere\n\\(\\mathrm{tr}(B_k)\\)\nis trace of the between group dispersion matrix\nand\n\\(\\mathrm{tr}(W_k)\\)\nis the trace of the within-cluster dispersion\nmatrix defined by:\n\\[W_k = \\sum_{q=1}^k \\sum_{x \\in C_q} (x - c_q) (x - c_q)^T\\]\n\\[B_k = \\sum_{q=1}^k n_q (c_q - c_E) (c_q - c_E)^T\\]\nwith\n\\(C_q\\)\nthe set of points in cluster\n\\(q\\)\n,\n\\(c_q\\)\nthe\ncenter of cluster\n\\(q\\)\n,\n\\(c_E\\)\nthe center of\n\\(E\\)\n, and\n\\(n_q\\)\nthe number of points in cluster",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 76,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the set of points in cluster\n\\(q\\)\n,\n\\(c_q\\)\nthe\ncenter of cluster\n\\(q\\)\n,\n\\(c_E\\)\nthe center of\n\\(E\\)\n, and\n\\(n_q\\)\nthe number of points in cluster\n\\(q\\)\n.\nReferences\nCaliński, T., & Harabasz, J. (1974).\n“A Dendrite Method for Cluster Analysis”\n.\nCommunications in Statistics-theory and Methods 3: 1-27\n.\n2.3.11.7.\nDavies-Bouldin Index\nIf the ground truth labels are not known, the Davies-Bouldin index\n(\nsklearn.metrics.davies_bouldin_score\n) can be used to evaluate the\nmodel, where a lower Davies-Bouldin index relates to a model with better\nseparation between the clusters.\nThis index signifies the average ‘similarity’ between clusters, where the\nsimilarity is a measure that compares the distance between clusters with the\nsize of the clusters themselves.\nZero is the lowest possible score. Values closer to zero indicate a better\npartition.\nIn normal usage, the Davies-Bouldin index is applied to the results of a\ncluster analysis as follows:\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\niris\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 77,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In normal usage, the Davies-Bouldin index is applied to the results of a\ncluster analysis as follows:\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\niris\n=\ndatasets\n.\nload_iris\n()\n>>>\nX\n=\niris\n.\ndata\n>>>\nfrom\nsklearn.cluster\nimport\nKMeans\n>>>\nfrom\nsklearn.metrics\nimport\ndavies_bouldin_score\n>>>\nkmeans\n=\nKMeans\n(\nn_clusters\n=\n3\n,\nrandom_state\n=\n1\n)\n.\nfit\n(\nX\n)\n>>>\nlabels\n=\nkmeans\n.\nlabels_\n>>>\ndavies_bouldin_score\n(\nX\n,\nlabels\n)\n0.666\nMathematical formulation\nThe index is defined as the average similarity between each cluster\n\\(C_i\\)\nfor\n\\(i=1, ..., k\\)\nand its most similar one\n\\(C_j\\)\n. In the context of\nthis index, similarity is defined as a measure\n\\(R_{ij}\\)\nthat trades off:\n\\(s_i\\)\n, the average distance between each point of cluster\n\\(i\\)\nand\nthe centroid of that cluster – also known as cluster diameter.\n\\(d_{ij}\\)\n, the distance between cluster centroids\n\\(i\\)\nand\n\\(j\\)\n.\nA simple choice to construct\n\\(R_{ij}\\)\nso that it is nonnegative and\nsymmetric is:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 78,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", the distance between cluster centroids\n\\(i\\)\nand\n\\(j\\)\n.\nA simple choice to construct\n\\(R_{ij}\\)\nso that it is nonnegative and\nsymmetric is:\n\\[R_{ij} = \\frac{s_i + s_j}{d_{ij}}\\]\nThen the Davies-Bouldin index is defined as:\n\\[DB = \\frac{1}{k} \\sum_{i=1}^k \\max_{i \\neq j} R_{ij}\\]\nReferences\nDavies, David L.; Bouldin, Donald W. (1979).\n“A Cluster Separation\nMeasure”\nIEEE Transactions on Pattern Analysis\nand Machine Intelligence. PAMI-1 (2): 224-227.\nHalkidi, Maria; Batistakis, Yannis; Vazirgiannis, Michalis (2001).\n“On\nClustering Validation Techniques”\nJournal of\nIntelligent Information Systems, 17(2-3), 107-145.\nWikipedia entry for Davies-Bouldin index\n.\n2.3.11.8.\nContingency Matrix\nContingency matrix (\nsklearn.metrics.cluster.contingency_matrix\n)\nreports the intersection cardinality for every true/predicted cluster pair.\nThe contingency matrix provides sufficient statistics for all clustering\nmetrics where the samples are independent and identically distributed and",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 79,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The contingency matrix provides sufficient statistics for all clustering\nmetrics where the samples are independent and identically distributed and\none doesn’t need to account for some instances not being clustered.\nHere is an example:\n>>>\nfrom\nsklearn.metrics.cluster\nimport\ncontingency_matrix\n>>>\nx\n=\n[\n\"a\"\n,\n\"a\"\n,\n\"a\"\n,\n\"b\"\n,\n\"b\"\n,\n\"b\"\n]\n>>>\ny\n=\n[\n0\n,\n0\n,\n1\n,\n1\n,\n2\n,\n2\n]\n>>>\ncontingency_matrix\n(\nx\n,\ny\n)\narray([[2, 1, 0],\n[0, 1, 2]])\nThe first row of the output array indicates that there are three samples whose\ntrue cluster is “a”. Of them, two are in predicted cluster 0, one is in 1,\nand none is in 2. And the second row indicates that there are three samples\nwhose true cluster is “b”. Of them, none is in predicted cluster 0, one is in\n1 and two are in 2.\nA\nconfusion matrix\nfor classification is a square\ncontingency matrix where the order of rows and columns correspond to a list\nof classes.\nReferences\nWikipedia entry for contingency matrix\n2.3.11.9.\nPair Confusion Matrix",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 80,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of classes.\nReferences\nWikipedia entry for contingency matrix\n2.3.11.9.\nPair Confusion Matrix\nThe pair confusion matrix\n(\nsklearn.metrics.cluster.pair_confusion_matrix\n) is a 2x2\nsimilarity matrix\n\\[\\begin{split}C = \\left[\\begin{matrix}\nC_{00} & C_{01} \\\\\nC_{10} & C_{11}\n\\end{matrix}\\right]\\end{split}\\]\nbetween two clusterings computed by considering all pairs of samples and\ncounting pairs that are assigned into the same or into different clusters\nunder the true and predicted clusterings.\nIt has the following entries:\n\\(C_{00}\\)\n: number of pairs with both clusterings having the samples\nnot clustered together\n\\(C_{10}\\)\n: number of pairs with the true label clustering having the\nsamples clustered together but the other clustering not having the samples\nclustered together\n\\(C_{01}\\)\n: number of pairs with the true label clustering not having\nthe samples clustered together but the other clustering having the samples\nclustered together\n\\(C_{11}\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 81,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the samples clustered together but the other clustering having the samples\nclustered together\n\\(C_{11}\\)\n: number of pairs with both clusterings having the samples\nclustered together\nConsidering a pair of samples that is clustered together a positive pair,\nthen as in binary classification the count of true negatives is\n\\(C_{00}\\)\n, false negatives is\n\\(C_{10}\\)\n, true positives is\n\\(C_{11}\\)\nand false positives is\n\\(C_{01}\\)\n.\nPerfectly matching labelings have all non-zero entries on the\ndiagonal regardless of actual label values:\n>>>\nfrom\nsklearn.metrics.cluster\nimport\npair_confusion_matrix\n>>>\npair_confusion_matrix\n([\n0\n,\n0\n,\n1\n,\n1\n],\n[\n0\n,\n0\n,\n1\n,\n1\n])\narray([[8, 0],\n[0, 4]])\n>>>\npair_confusion_matrix\n([\n0\n,\n0\n,\n1\n,\n1\n],\n[\n1\n,\n1\n,\n0\n,\n0\n])\narray([[8, 0],\n[0, 4]])\nLabelings that assign all classes members to the same clusters\nare complete but may not always be pure, hence penalized, and\nhave some off-diagonal non-zero entries:\n>>>\npair_confusion_matrix\n([\n0\n,\n0\n,\n1\n,\n2\n],\n[\n0\n,\n0\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 82,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "have some off-diagonal non-zero entries:\n>>>\npair_confusion_matrix\n([\n0\n,\n0\n,\n1\n,\n2\n],\n[\n0\n,\n0\n,\n1\n,\n1\n])\narray([[8, 2],\n[0, 2]])\nThe matrix is not symmetric:\n>>>\npair_confusion_matrix\n([\n0\n,\n0\n,\n1\n,\n1\n],\n[\n0\n,\n0\n,\n1\n,\n2\n])\narray([[8, 0],\n[2, 2]])\nIf classes members are completely split across different clusters, the\nassignment is totally incomplete, hence the matrix has all zero\ndiagonal entries:\n>>>\npair_confusion_matrix\n([\n0\n,\n0\n,\n0\n,\n0\n],\n[\n0\n,\n1\n,\n2\n,\n3\n])\narray([[ 0, 0],\n[12, 0]])\nReferences\n“Comparing Partitions”\nL. Hubert and P. Arabie,\nJournal of Classification 1985\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/clustering.html",
      "chunk_index": 83,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.1.\nPipelines and composite estimators\nTo build a composite estimator, transformers are usually combined with other\ntransformers or with\npredictors\n(such as classifiers or regressors).\nThe most common tool used for composing estimators is a\nPipeline\n. Pipelines require all steps except the last to be a\ntransformer\n. The last step can be anything, a transformer, a\npredictor\n, or a clustering estimator which might have or not have a\n.predict(...)\nmethod. A pipeline exposes all methods provided by the last\nestimator: if the last step provides a\ntransform\nmethod, then the pipeline\nwould have a\ntransform\nmethod and behave like a transformer. If the last step\nprovides a\npredict\nmethod, then the pipeline would expose that method, and\ngiven a data\nX\n, use all steps except the last to transform the data,\nand then give that transformed data to the\npredict\nmethod of the last step of\nthe pipeline. The class\nPipeline\nis often used in combination with\nColumnTransformer\nor\nFeatureUnion",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "predict\nmethod of the last step of\nthe pipeline. The class\nPipeline\nis often used in combination with\nColumnTransformer\nor\nFeatureUnion\nwhich concatenate the output of transformers\ninto a composite feature space.\nTransformedTargetRegressor\ndeals with transforming the\ntarget\n(i.e. log-transform\ny\n).\n7.1.1.\nPipeline: chaining estimators\nPipeline\ncan be used to chain multiple estimators\ninto one. This is useful as there is often a fixed sequence\nof steps in processing the data, for example feature selection, normalization\nand classification.\nPipeline\nserves multiple purposes here:\nConvenience and encapsulation\nYou only have to call\nfit\nand\npredict\nonce on your\ndata to fit a whole sequence of estimators.\nJoint parameter selection\nYou can\ngrid search\nover parameters of all estimators in the pipeline at once.\nSafety\nPipelines help avoid leaking statistics from your test data into the\ntrained model in cross-validation, by ensuring that the same samples are",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Safety\nPipelines help avoid leaking statistics from your test data into the\ntrained model in cross-validation, by ensuring that the same samples are\nused to train the transformers and predictors.\nAll estimators in a pipeline, except the last one, must be transformers\n(i.e. must have a\ntransform\nmethod).\nThe last estimator may be any type (transformer, classifier, etc.).\nNote\nCalling\nfit\non the pipeline is the same as calling\nfit\non\neach estimator in turn,\ntransform\nthe input and pass it on to the next step.\nThe pipeline has all the methods that the last estimator in the pipeline has,\ni.e. if the last estimator is a classifier, the\nPipeline\ncan be used\nas a classifier. If the last estimator is a transformer, again, so is the\npipeline.\n7.1.1.1.\nUsage\n7.1.1.1.1.\nBuild a pipeline\nThe\nPipeline\nis built using a list of\n(key,\nvalue)\npairs, where\nthe\nkey\nis a string containing the name you want to give this step and\nvalue\nis an estimator object:\n>>>\nfrom\nsklearn.pipeline\nimport\nPipeline\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the\nkey\nis a string containing the name you want to give this step and\nvalue\nis an estimator object:\n>>>\nfrom\nsklearn.pipeline\nimport\nPipeline\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nfrom\nsklearn.decomposition\nimport\nPCA\n>>>\nestimators\n=\n[(\n'reduce_dim'\n,\nPCA\n()),\n(\n'clf'\n,\nSVC\n())]\n>>>\npipe\n=\nPipeline\n(\nestimators\n)\n>>>\npipe\nPipeline(steps=[('reduce_dim', PCA()), ('clf', SVC())])\nShorthand version using\nmake_pipeline\nThe utility function\nmake_pipeline\nis a shorthand\nfor constructing pipelines;\nit takes a variable number of estimators and returns a pipeline,\nfilling in the names automatically:\n>>>\nfrom\nsklearn.pipeline\nimport\nmake_pipeline\n>>>\nmake_pipeline\n(\nPCA\n(),\nSVC\n())\nPipeline(steps=[('pca', PCA()), ('svc', SVC())])\n7.1.1.1.2.\nAccess pipeline steps\nThe estimators of a pipeline are stored as a list in the\nsteps\nattribute.\nA sub-pipeline can be extracted using the slicing notation commonly used\nfor Python Sequences such as lists or strings (although only a step of 1 is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "A sub-pipeline can be extracted using the slicing notation commonly used\nfor Python Sequences such as lists or strings (although only a step of 1 is\npermitted). This is convenient for performing only some of the transformations\n(or their inverse):\n>>>\npipe\n[:\n1\n]\nPipeline(steps=[('reduce_dim', PCA())])\n>>>\npipe\n[\n-\n1\n:]\nPipeline(steps=[('clf', SVC())])\nAccessing a step by name or position\nA specific step can also be accessed by index or name by indexing (with\n[idx]\n) the\npipeline:\n>>>\npipe\n.\nsteps\n[\n0\n]\n('reduce_dim', PCA())\n>>>\npipe\n[\n0\n]\nPCA()\n>>>\npipe\n[\n'reduce_dim'\n]\nPCA()\nPipeline\n’s\nnamed_steps\nattribute allows accessing steps by name with tab\ncompletion in interactive environments:\n>>>\npipe\n.\nnamed_steps\n.\nreduce_dim\nis\npipe\n[\n'reduce_dim'\n]\nTrue\n7.1.1.1.3.\nTracking feature names in a pipeline\nTo enable model inspection,\nPipeline\nhas a\nget_feature_names_out()\nmethod, just like all transformers. You can use\npipeline slicing to get the feature names going into each step:\n>>>\nfrom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "get_feature_names_out()\nmethod, just like all transformers. You can use\npipeline slicing to get the feature names going into each step:\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.linear_model\nimport\nLogisticRegression\n>>>\nfrom\nsklearn.feature_selection\nimport\nSelectKBest\n>>>\niris\n=\nload_iris\n()\n>>>\npipe\n=\nPipeline\n(\nsteps\n=\n[\n...\n(\n'select'\n,\nSelectKBest\n(\nk\n=\n2\n)),\n...\n(\n'clf'\n,\nLogisticRegression\n())])\n>>>\npipe\n.\nfit\n(\niris\n.\ndata\n,\niris\n.\ntarget\n)\nPipeline(steps=[('select', SelectKBest(...)), ('clf', LogisticRegression(...))])\n>>>\npipe\n[:\n-\n1\n]\n.\nget_feature_names_out\n()\narray(['x2', 'x3'], ...)\nCustomize feature names\nYou can also provide custom feature names for the input data using\nget_feature_names_out\n:\n>>>\npipe\n[:\n-\n1\n]\n.\nget_feature_names_out\n(\niris\n.\nfeature_names\n)\narray(['petal length (cm)', 'petal width (cm)'], ...)\n7.1.1.1.4.\nAccess to nested parameters\nIt is common to adjust the parameters of an estimator within a pipeline. This parameter",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.1.1.1.4.\nAccess to nested parameters\nIt is common to adjust the parameters of an estimator within a pipeline. This parameter\nis therefore nested because it belongs to a particular sub-step. Parameters of the\nestimators in the pipeline are accessible using the\n<estimator>__<parameter>\nsyntax:\n>>>\npipe\n=\nPipeline\n(\nsteps\n=\n[(\n\"reduce_dim\"\n,\nPCA\n()),\n(\n\"clf\"\n,\nSVC\n())])\n>>>\npipe\n.\nset_params\n(\nclf__C\n=\n10\n)\nPipeline(steps=[('reduce_dim', PCA()), ('clf', SVC(C=10))])\nWhen does it matter?\nThis is particularly important for doing grid searches:\n>>>\nfrom\nsklearn.model_selection\nimport\nGridSearchCV\n>>>\nparam_grid\n=\ndict\n(\nreduce_dim__n_components\n=\n[\n2\n,\n5\n,\n10\n],\n...\nclf__C\n=\n[\n0.1\n,\n10\n,\n100\n])\n>>>\ngrid_search\n=\nGridSearchCV\n(\npipe\n,\nparam_grid\n=\nparam_grid\n)\nIndividual steps may also be replaced as parameters, and non-final steps may be\nignored by setting them to\n'passthrough'\n:\n>>>\nparam_grid\n=\ndict\n(\nreduce_dim\n=\n[\n'passthrough'\n,\nPCA\n(\n5\n),\nPCA\n(\n10\n)],\n...\nclf\n=\n[\nSVC\n(),",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "ignored by setting them to\n'passthrough'\n:\n>>>\nparam_grid\n=\ndict\n(\nreduce_dim\n=\n[\n'passthrough'\n,\nPCA\n(\n5\n),\nPCA\n(\n10\n)],\n...\nclf\n=\n[\nSVC\n(),\nLogisticRegression\n()],\n...\nclf__C\n=\n[\n0.1\n,\n10\n,\n100\n])\n>>>\ngrid_search\n=\nGridSearchCV\n(\npipe\n,\nparam_grid\n=\nparam_grid\n)\nSee also\nComposite estimators and parameter spaces\nExamples\nPipeline ANOVA SVM\nSample pipeline for text feature extraction and evaluation\nPipelining: chaining a PCA and a logistic regression\nExplicit feature map approximation for RBF kernels\nSVM-Anova: SVM with univariate feature selection\nSelecting dimensionality reduction with Pipeline and GridSearchCV\nDisplaying Pipelines\n7.1.1.2.\nCaching transformers: avoid repeated computation\nFitting transformers may be computationally expensive. With its\nmemory\nparameter set,\nPipeline\nwill cache each transformer\nafter calling\nfit\n.\nThis feature is used to avoid computing the fit transformers within a pipeline",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Pipeline\nwill cache each transformer\nafter calling\nfit\n.\nThis feature is used to avoid computing the fit transformers within a pipeline\nif the parameters and input data are identical. A typical example is the case of\na grid search in which the transformers can be fitted only once and reused for\neach configuration. The last step will never be cached, even if it is a transformer.\nThe parameter\nmemory\nis needed in order to cache the transformers.\nmemory\ncan be either a string containing the directory where to cache the\ntransformers or a\njoblib.Memory\nobject:\n>>>\nfrom\ntempfile\nimport\nmkdtemp\n>>>\nfrom\nshutil\nimport\nrmtree\n>>>\nfrom\nsklearn.decomposition\nimport\nPCA\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nfrom\nsklearn.pipeline\nimport\nPipeline\n>>>\nestimators\n=\n[(\n'reduce_dim'\n,\nPCA\n()),\n(\n'clf'\n,\nSVC\n())]\n>>>\ncachedir\n=\nmkdtemp\n()\n>>>\npipe\n=\nPipeline\n(\nestimators\n,\nmemory\n=\ncachedir\n)\n>>>\npipe\nPipeline(memory=...,\nsteps=[('reduce_dim', PCA()), ('clf', SVC())])\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nmkdtemp\n()\n>>>\npipe\n=\nPipeline\n(\nestimators\n,\nmemory\n=\ncachedir\n)\n>>>\npipe\nPipeline(memory=...,\nsteps=[('reduce_dim', PCA()), ('clf', SVC())])\n>>>\n# Clear the cache directory when you don't need it anymore\n>>>\nrmtree\n(\ncachedir\n)\nSide effect of caching transformers\nUsing a\nPipeline\nwithout cache enabled, it is possible to\ninspect the original instance such as:\n>>>\nfrom\nsklearn.datasets\nimport\nload_digits\n>>>\nX_digits\n,\ny_digits\n=\nload_digits\n(\nreturn_X_y\n=\nTrue\n)\n>>>\npca1\n=\nPCA\n(\nn_components\n=\n10\n)\n>>>\nsvm1\n=\nSVC\n()\n>>>\npipe\n=\nPipeline\n([(\n'reduce_dim'\n,\npca1\n),\n(\n'clf'\n,\nsvm1\n)])\n>>>\npipe\n.\nfit\n(\nX_digits\n,\ny_digits\n)\nPipeline(steps=[('reduce_dim', PCA(n_components=10)), ('clf', SVC())])\n>>>\n# The pca instance can be inspected directly\n>>>\npca1\n.\ncomponents_\n.\nshape\n(10, 64)\nEnabling caching triggers a clone of the transformers before fitting.\nTherefore, the transformer instance given to the pipeline cannot be\ninspected directly.\nIn the following example, accessing the\nPCA",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Therefore, the transformer instance given to the pipeline cannot be\ninspected directly.\nIn the following example, accessing the\nPCA\ninstance\npca2\nwill raise an\nAttributeError\nsince\npca2\nwill be an\nunfitted transformer.\nInstead, use the attribute\nnamed_steps\nto inspect estimators within\nthe pipeline:\n>>>\ncachedir\n=\nmkdtemp\n()\n>>>\npca2\n=\nPCA\n(\nn_components\n=\n10\n)\n>>>\nsvm2\n=\nSVC\n()\n>>>\ncached_pipe\n=\nPipeline\n([(\n'reduce_dim'\n,\npca2\n),\n(\n'clf'\n,\nsvm2\n)],\n...\nmemory\n=\ncachedir\n)\n>>>\ncached_pipe\n.\nfit\n(\nX_digits\n,\ny_digits\n)\nPipeline(memory=...,\nsteps=[('reduce_dim', PCA(n_components=10)), ('clf', SVC())])\n>>>\ncached_pipe\n.\nnamed_steps\n[\n'reduce_dim'\n]\n.\ncomponents_\n.\nshape\n(10, 64)\n>>>\n# Remove the cache directory\n>>>\nrmtree\n(\ncachedir\n)\nExamples\nSelecting dimensionality reduction with Pipeline and GridSearchCV\n7.1.2.\nTransforming target in regression\nTransformedTargetRegressor\ntransforms the\ntargets\ny\nbefore fitting a regression model. The predictions are mapped",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Transforming target in regression\nTransformedTargetRegressor\ntransforms the\ntargets\ny\nbefore fitting a regression model. The predictions are mapped\nback to the original space via an inverse transform. It takes as an argument\nthe regressor that will be used for prediction, and the transformer that will\nbe applied to the target variable:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.datasets\nimport\nmake_regression\n>>>\nfrom\nsklearn.compose\nimport\nTransformedTargetRegressor\n>>>\nfrom\nsklearn.preprocessing\nimport\nQuantileTransformer\n>>>\nfrom\nsklearn.linear_model\nimport\nLinearRegression\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\n# create a synthetic dataset\n>>>\nX\n,\ny\n=\nmake_regression\n(\nn_samples\n=\n20640\n,\n...\nn_features\n=\n8\n,\n...\nnoise\n=\n100.0\n,\n...\nrandom_state\n=\n0\n)\n>>>\ny\n=\nnp\n.\nexp\n(\n1\n+\n(\ny\n-\ny\n.\nmin\n())\n*\n(\n4\n/\n(\ny\n.\nmax\n()\n-\ny\n.\nmin\n())))\n>>>\nX\n,\ny\n=\nX\n[:\n2000\n,\n:],\ny\n[:\n2000\n]\n# select a subset of data\n>>>\ntransformer\n=\nQuantileTransformer\n(\noutput_distribution\n=\n'normal'",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nmin\n())))\n>>>\nX\n,\ny\n=\nX\n[:\n2000\n,\n:],\ny\n[:\n2000\n]\n# select a subset of data\n>>>\ntransformer\n=\nQuantileTransformer\n(\noutput_distribution\n=\n'normal'\n)\n>>>\nregressor\n=\nLinearRegression\n()\n>>>\nregr\n=\nTransformedTargetRegressor\n(\nregressor\n=\nregressor\n,\n...\ntransformer\n=\ntransformer\n)\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\nX\n,\ny\n,\nrandom_state\n=\n0\n)\n>>>\nregr\n.\nfit\n(\nX_train\n,\ny_train\n)\nTransformedTargetRegressor(...)\n>>>\nprint\n(\nf\n\"R2 score:\n{\nregr\n.\nscore\n(\nX_test\n,\ny_test\n)\n:\n.2f\n}\n\"\n)\nR2 score: 0.67\n>>>\nraw_target_regr\n=\nLinearRegression\n()\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nprint\n(\nf\n\"R2 score:\n{\nraw_target_regr\n.\nscore\n(\nX_test\n,\ny_test\n)\n:\n.2f\n}\n\"\n)\nR2 score: 0.64\nFor simple transformations, instead of a Transformer object, a pair of\nfunctions can be passed, defining the transformation and its inverse mapping:\n>>>\ndef\nfunc\n(\nx\n):\n...\nreturn\nnp\n.\nlog\n(\nx\n)\n>>>\ndef\ninverse_func\n(\nx\n):\n...\nreturn\nnp\n.\nexp\n(\nx\n)\nSubsequently, the object is created as:\n>>>\nregr\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "def\nfunc\n(\nx\n):\n...\nreturn\nnp\n.\nlog\n(\nx\n)\n>>>\ndef\ninverse_func\n(\nx\n):\n...\nreturn\nnp\n.\nexp\n(\nx\n)\nSubsequently, the object is created as:\n>>>\nregr\n=\nTransformedTargetRegressor\n(\nregressor\n=\nregressor\n,\n...\nfunc\n=\nfunc\n,\n...\ninverse_func\n=\ninverse_func\n)\n>>>\nregr\n.\nfit\n(\nX_train\n,\ny_train\n)\nTransformedTargetRegressor(...)\n>>>\nprint\n(\nf\n\"R2 score:\n{\nregr\n.\nscore\n(\nX_test\n,\ny_test\n)\n:\n.2f\n}\n\"\n)\nR2 score: 0.67\nBy default, the provided functions are checked at each fit to be the inverse of\neach other. However, it is possible to bypass this checking by setting\ncheck_inverse\nto\nFalse\n:\n>>>\ndef\ninverse_func\n(\nx\n):\n...\nreturn\nx\n>>>\nregr\n=\nTransformedTargetRegressor\n(\nregressor\n=\nregressor\n,\n...\nfunc\n=\nfunc\n,\n...\ninverse_func\n=\ninverse_func\n,\n...\ncheck_inverse\n=\nFalse\n)\n>>>\nregr\n.\nfit\n(\nX_train\n,\ny_train\n)\nTransformedTargetRegressor(...)\n>>>\nprint\n(\nf\n\"R2 score:\n{\nregr\n.\nscore\n(\nX_test\n,\ny_test\n)\n:\n.2f\n}\n\"\n)\nR2 score: -3.02\nNote\nThe transformation can be triggered by setting either\ntransformer",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\nf\n\"R2 score:\n{\nregr\n.\nscore\n(\nX_test\n,\ny_test\n)\n:\n.2f\n}\n\"\n)\nR2 score: -3.02\nNote\nThe transformation can be triggered by setting either\ntransformer\nor the\npair of functions\nfunc\nand\ninverse_func\n. However, setting both\noptions will raise an error.\nExamples\nEffect of transforming the targets in regression model\n7.1.3.\nFeatureUnion: composite feature spaces\nFeatureUnion\ncombines several transformer objects into a new\ntransformer that combines their output. A\nFeatureUnion\ntakes\na list of transformer objects. During fitting, each of these\nis fit to the data independently. The transformers are applied in parallel,\nand the feature matrices they output are concatenated side-by-side into a\nlarger matrix.\nWhen you want to apply different transformations to each field of the data,\nsee the related class\nColumnTransformer\n(see\nuser guide\n).\nFeatureUnion\nserves the same purposes as\nPipeline\n-\nconvenience and joint parameter estimation and validation.\nFeatureUnion\nand\nPipeline\ncan be combined to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "serves the same purposes as\nPipeline\n-\nconvenience and joint parameter estimation and validation.\nFeatureUnion\nand\nPipeline\ncan be combined to\ncreate complex models.\n(A\nFeatureUnion\nhas no way of checking whether two transformers\nmight produce identical features. It only produces a union when the\nfeature sets are disjoint, and making sure they are is the caller’s\nresponsibility.)\n7.1.3.1.\nUsage\nA\nFeatureUnion\nis built using a list of\n(key,\nvalue)\npairs,\nwhere the\nkey\nis the name you want to give to a given transformation\n(an arbitrary string; it only serves as an identifier)\nand\nvalue\nis an estimator object:\n>>>\nfrom\nsklearn.pipeline\nimport\nFeatureUnion\n>>>\nfrom\nsklearn.decomposition\nimport\nPCA\n>>>\nfrom\nsklearn.decomposition\nimport\nKernelPCA\n>>>\nestimators\n=\n[(\n'linear_pca'\n,\nPCA\n()),\n(\n'kernel_pca'\n,\nKernelPCA\n())]\n>>>\ncombined\n=\nFeatureUnion\n(\nestimators\n)\n>>>\ncombined\nFeatureUnion(transformer_list=[('linear_pca', PCA()),\n('kernel_pca', KernelPCA())])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "())]\n>>>\ncombined\n=\nFeatureUnion\n(\nestimators\n)\n>>>\ncombined\nFeatureUnion(transformer_list=[('linear_pca', PCA()),\n('kernel_pca', KernelPCA())])\nLike pipelines, feature unions have a shorthand constructor called\nmake_union\nthat does not require explicit naming of the components.\nLike\nPipeline\n, individual steps may be replaced using\nset_params\n,\nand ignored by setting to\n'drop'\n:\n>>>\ncombined\n.\nset_params\n(\nkernel_pca\n=\n'drop'\n)\nFeatureUnion(transformer_list=[('linear_pca', PCA()),\n('kernel_pca', 'drop')])\nExamples\nConcatenating multiple feature extraction methods\n7.1.4.\nColumnTransformer for heterogeneous data\nMany datasets contain features of different types, say text, floats, and dates,\nwhere each type of feature requires separate preprocessing or feature\nextraction steps. Often it is easiest to preprocess data before applying\nscikit-learn methods, for example using\npandas\n.\nProcessing your data before passing it to scikit-learn might be problematic for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "scikit-learn methods, for example using\npandas\n.\nProcessing your data before passing it to scikit-learn might be problematic for\none of the following reasons:\nIncorporating statistics from test data into the preprocessors makes\ncross-validation scores unreliable (known as\ndata leakage\n),\nfor example in the case of scalers or imputing missing values.\nYou may want to include the parameters of the preprocessors in a\nparameter search\n.\nThe\nColumnTransformer\nhelps performing different\ntransformations for different columns of the data, within a\nPipeline\nthat is safe from data leakage and that can\nbe parametrized.\nColumnTransformer\nworks on\narrays, sparse matrices, and\npandas DataFrames\n.\nTo each column, a different transformation can be applied, such as\npreprocessing or a specific feature extraction method:\n>>>\nimport\npandas\nas\npd\n>>>\nX\n=\npd\n.\nDataFrame\n(\n...\n{\n'city'\n:\n[\n'London'\n,\n'London'\n,\n'Paris'\n,\n'Sallisaw'\n],\n...\n'title'\n:\n[\n\"His Last Bow\"\n,\n\"How Watson Learned the Trick\"\n,\n...",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "pd\n.\nDataFrame\n(\n...\n{\n'city'\n:\n[\n'London'\n,\n'London'\n,\n'Paris'\n,\n'Sallisaw'\n],\n...\n'title'\n:\n[\n\"His Last Bow\"\n,\n\"How Watson Learned the Trick\"\n,\n...\n\"A Moveable Feast\"\n,\n\"The Grapes of Wrath\"\n],\n...\n'expert_rating'\n:\n[\n5\n,\n3\n,\n4\n,\n5\n],\n...\n'user_rating'\n:\n[\n4\n,\n5\n,\n4\n,\n3\n]})\nFor this data, we might want to encode the\n'city'\ncolumn as a categorical\nvariable using\nOneHotEncoder\nbut apply a\nCountVectorizer\nto the\n'title'\ncolumn.\nAs we might use multiple feature extraction methods on the same column, we give\neach transformer a unique name, say\n'city_category'\nand\n'title_bow'\n.\nBy default, the remaining rating columns are ignored (\nremainder='drop'\n):\n>>>\nfrom\nsklearn.compose\nimport\nColumnTransformer\n>>>\nfrom\nsklearn.feature_extraction.text\nimport\nCountVectorizer\n>>>\nfrom\nsklearn.preprocessing\nimport\nOneHotEncoder\n>>>\ncolumn_trans\n=\nColumnTransformer\n(\n...\n[(\n'categories'\n,\nOneHotEncoder\n(\ndtype\n=\n'int'\n),\n[\n'city'\n]),\n...\n(\n'title_bow'\n,\nCountVectorizer\n(),\n'title'\n)],\n...\nremainder\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\n...\n[(\n'categories'\n,\nOneHotEncoder\n(\ndtype\n=\n'int'\n),\n[\n'city'\n]),\n...\n(\n'title_bow'\n,\nCountVectorizer\n(),\n'title'\n)],\n...\nremainder\n=\n'drop'\n,\nverbose_feature_names_out\n=\nFalse\n)\n>>>\ncolumn_trans\n.\nfit\n(\nX\n)\nColumnTransformer(transformers=[('categories', OneHotEncoder(dtype='int'),\n['city']),\n('title_bow', CountVectorizer(), 'title')],\nverbose_feature_names_out=False)\n>>>\ncolumn_trans\n.\nget_feature_names_out\n()\narray(['city_London', 'city_Paris', 'city_Sallisaw', 'bow', 'feast',\n'grapes', 'his', 'how', 'last', 'learned', 'moveable', 'of', 'the',\n'trick', 'watson', 'wrath'], ...)\n>>>\ncolumn_trans\n.\ntransform\n(\nX\n)\n.\ntoarray\n()\narray([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],\n[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0],\n[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],\n[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1]]...)\nIn the above example, the\nCountVectorizer\nexpects a 1D array as\ninput and therefore the columns were specified as a string (\n'title'\n).\nHowever,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In the above example, the\nCountVectorizer\nexpects a 1D array as\ninput and therefore the columns were specified as a string (\n'title'\n).\nHowever,\nOneHotEncoder\nas most of other transformers expects 2D data, therefore in that case you need\nto specify the column as a list of strings (\n['city']\n).\nApart from a scalar or a single item list, the column selection can be specified\nas a list of multiple items, an integer array, a slice, a boolean mask, or\nwith a\nmake_column_selector\n. The\nmake_column_selector\nis used to select columns based\non data type or column name:\n>>>\nfrom\nsklearn.preprocessing\nimport\nStandardScaler\n>>>\nfrom\nsklearn.compose\nimport\nmake_column_selector\n>>>\nct\n=\nColumnTransformer\n([\n...\n(\n'scale'\n,\nStandardScaler\n(),\n...\nmake_column_selector\n(\ndtype_include\n=\nnp\n.\nnumber\n)),\n...\n(\n'onehot'\n,\n...\nOneHotEncoder\n(),\n...\nmake_column_selector\n(\npattern\n=\n'city'\n,\ndtype_include\n=\nobject\n))])\n>>>\nct\n.\nfit_transform\n(\nX\n)\narray([[ 0.904, 0. , 1. , 0. , 0. ],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(),\n...\nmake_column_selector\n(\npattern\n=\n'city'\n,\ndtype_include\n=\nobject\n))])\n>>>\nct\n.\nfit_transform\n(\nX\n)\narray([[ 0.904, 0. , 1. , 0. , 0. ],\n[-1.507, 1.414, 1. , 0. , 0. ],\n[-0.301, 0. , 0. , 1. , 0. ],\n[ 0.904, -1.414, 0. , 0. , 1. ]])\nStrings can reference columns if the input is a DataFrame, integers are always\ninterpreted as the positional columns.\nWe can keep the remaining rating columns by setting\nremainder='passthrough'\n. The values are appended to the end of the\ntransformation:\n>>>\ncolumn_trans\n=\nColumnTransformer\n(\n...\n[(\n'city_category'\n,\nOneHotEncoder\n(\ndtype\n=\n'int'\n),[\n'city'\n]),\n...\n(\n'title_bow'\n,\nCountVectorizer\n(),\n'title'\n)],\n...\nremainder\n=\n'passthrough'\n)\n>>>\ncolumn_trans\n.\nfit_transform\n(\nX\n)\narray([[1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 5, 4],\n[1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 3, 5],\n[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],\n[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]]...)\nThe\nremainder",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 4, 4],\n[0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 5, 3]]...)\nThe\nremainder\nparameter can be set to an estimator to transform the\nremaining rating columns. The transformed values are appended to the end of\nthe transformation:\n>>>\nfrom\nsklearn.preprocessing\nimport\nMinMaxScaler\n>>>\ncolumn_trans\n=\nColumnTransformer\n(\n...\n[(\n'city_category'\n,\nOneHotEncoder\n(),\n[\n'city'\n]),\n...\n(\n'title_bow'\n,\nCountVectorizer\n(),\n'title'\n)],\n...\nremainder\n=\nMinMaxScaler\n())\n>>>\ncolumn_trans\n.\nfit_transform\n(\nX\n)[:,\n-\n2\n:]\narray([[1. , 0.5],\n[0. , 1. ],\n[0.5, 0.5],\n[1. , 0. ]])\nThe\nmake_column_transformer\nfunction is available\nto more easily create a\nColumnTransformer\nobject.\nSpecifically, the names will be given automatically. The equivalent for the\nabove example would be:\n>>>\nfrom\nsklearn.compose\nimport\nmake_column_transformer\n>>>\ncolumn_trans\n=\nmake_column_transformer\n(\n...\n(\nOneHotEncoder\n(),\n[\n'city'\n]),\n...\n(\nCountVectorizer\n(),\n'title'\n),\n...",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "make_column_transformer\n>>>\ncolumn_trans\n=\nmake_column_transformer\n(\n...\n(\nOneHotEncoder\n(),\n[\n'city'\n]),\n...\n(\nCountVectorizer\n(),\n'title'\n),\n...\nremainder\n=\nMinMaxScaler\n())\n>>>\ncolumn_trans\nColumnTransformer(remainder=MinMaxScaler(),\ntransformers=[('onehotencoder', OneHotEncoder(), ['city']),\n('countvectorizer', CountVectorizer(),\n'title')])\nIf\nColumnTransformer\nis fitted with a dataframe\nand the dataframe only has string column names, then transforming a dataframe\nwill use the column names to select the columns:\n>>>\nct\n=\nColumnTransformer\n(\n...\n[(\n\"scale\"\n,\nStandardScaler\n(),\n[\n\"expert_rating\"\n])])\n.\nfit\n(\nX\n)\n>>>\nX_new\n=\npd\n.\nDataFrame\n({\n\"expert_rating\"\n:\n[\n5\n,\n6\n,\n1\n],\n...\n\"ignored_new_col\"\n:\n[\n1.2\n,\n0.3\n,\n-\n0.1\n]})\n>>>\nct\n.\ntransform\n(\nX_new\n)\narray([[ 0.9],\n[ 2.1],\n[-3.9]])\n7.1.5.\nVisualizing Composite Estimators\nEstimators are displayed with an HTML representation when shown in a\njupyter notebook. This is useful to diagnose or visualize a Pipeline with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Estimators are displayed with an HTML representation when shown in a\njupyter notebook. This is useful to diagnose or visualize a Pipeline with\nmany estimators. This visualization is activated by default:\n>>>\ncolumn_trans\nIt can be deactivated by setting the\ndisplay\noption in\nset_config\nto ‘text’:\n>>>\nfrom\nsklearn\nimport\nset_config\n>>>\nset_config\n(\ndisplay\n=\n'text'\n)\n>>>\n# displays text representation in a jupyter context\n>>>\ncolumn_trans\nAn example of the HTML output can be seen in the\nHTML representation of Pipeline\nsection of\nColumn Transformer with Mixed Types\n.\nAs an alternative, the HTML can be written to a file using\nestimator_html_repr\n:\n>>>\nfrom\nsklearn.utils\nimport\nestimator_html_repr\n>>>\nwith\nopen\n(\n'my_estimator.html'\n,\n'w'\n)\nas\nf\n:\n...\nf\n.\nwrite\n(\nestimator_html_repr\n(\nclf\n))\nExamples\nColumn Transformer with Heterogeneous Data Sources\nColumn Transformer with Mixed Types\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/compose.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.6.\nCovariance estimation\nMany statistical problems require the estimation of a\npopulation’s covariance matrix, which can be seen as an estimation of\ndata set scatter plot shape. Most of the time, such an estimation has\nto be done on a sample whose properties (size, structure, homogeneity)\nhave a large influence on the estimation’s quality. The\nsklearn.covariance\npackage provides tools for accurately estimating\na population’s covariance matrix under various settings.\nWe assume that the observations are independent and identically\ndistributed (i.i.d.).\n2.6.1.\nEmpirical covariance\nThe covariance matrix of a data set is known to be well approximated\nby the classical\nmaximum likelihood estimator\n(or “empirical\ncovariance”), provided the number of observations is large enough\ncompared to the number of features (the variables describing the\nobservations). More precisely, the Maximum Likelihood Estimator of a\nsample is an asymptotically unbiased estimator of the corresponding",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "observations). More precisely, the Maximum Likelihood Estimator of a\nsample is an asymptotically unbiased estimator of the corresponding\npopulation’s covariance matrix.\nThe empirical covariance matrix of a sample can be computed using the\nempirical_covariance\nfunction of the package, or by fitting an\nEmpiricalCovariance\nobject to the data sample with the\nEmpiricalCovariance.fit\nmethod. Be careful that results depend\non whether the data are centered, so one may want to use the\nassume_centered\nparameter accurately. More precisely, if\nassume_centered=True\n, then\nall features in the train and test sets should have a mean of zero. If not, both should\nbe centered by the user, or\nassume_centered=False\nshould be used.\nExamples\nSee\nShrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood\nfor\nan example on how to fit an\nEmpiricalCovariance\nobject to data.\n2.6.2.\nShrunk Covariance\n2.6.2.1.\nBasic shrinkage\nDespite being an asymptotically unbiased estimator of the covariance matrix,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "object to data.\n2.6.2.\nShrunk Covariance\n2.6.2.1.\nBasic shrinkage\nDespite being an asymptotically unbiased estimator of the covariance matrix,\nthe Maximum Likelihood Estimator is not a good estimator of the\neigenvalues of the covariance matrix, so the precision matrix obtained\nfrom its inversion is not accurate. Sometimes, it even occurs that the\nempirical covariance matrix cannot be inverted for numerical\nreasons. To avoid such an inversion problem, a transformation of the\nempirical covariance matrix has been introduced: the\nshrinkage\n.\nIn scikit-learn, this transformation (with a user-defined shrinkage\ncoefficient) can be directly applied to a pre-computed covariance with\nthe\nshrunk_covariance\nmethod. Also, a shrunk estimator of the\ncovariance can be fitted to data with a\nShrunkCovariance\nobject\nand its\nShrunkCovariance.fit\nmethod. Again, results depend on\nwhether the data are centered, so one may want to use the\nassume_centered\nparameter accurately.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "ShrunkCovariance.fit\nmethod. Again, results depend on\nwhether the data are centered, so one may want to use the\nassume_centered\nparameter accurately.\nMathematically, this shrinkage consists in reducing the ratio between the\nsmallest and the largest eigenvalues of the empirical covariance matrix.\nIt can be done by simply shifting every eigenvalue according to a given\noffset, which is equivalent of finding the l2-penalized Maximum\nLikelihood Estimator of the covariance matrix. In practice, shrinkage\nboils down to a simple convex transformation :\n\\(\\Sigma_{\\rm\nshrunk} = (1-\\alpha)\\hat{\\Sigma} + \\alpha\\frac{{\\rm\nTr}\\hat{\\Sigma}}{p}\\rm Id\\)\n.\nChoosing the amount of shrinkage,\n\\(\\alpha\\)\namounts to setting a\nbias/variance trade-off, and is discussed below.\nExamples\nSee\nShrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood\nfor\nan example on how to fit a\nShrunkCovariance\nobject to data.\n2.6.2.2.\nLedoit-Wolf shrinkage\nIn their 2004 paper\n[\n1\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for\nan example on how to fit a\nShrunkCovariance\nobject to data.\n2.6.2.2.\nLedoit-Wolf shrinkage\nIn their 2004 paper\n[\n1\n]\n, O. Ledoit and M. Wolf propose a formula\nto compute the optimal shrinkage coefficient\n\\(\\alpha\\)\nthat\nminimizes the Mean Squared Error between the estimated and the real\ncovariance matrix.\nThe Ledoit-Wolf estimator of the covariance matrix can be computed on\na sample with the\nledoit_wolf\nfunction of the\nsklearn.covariance\npackage, or it can be otherwise obtained by\nfitting a\nLedoitWolf\nobject to the same sample.\nNote\nCase when population covariance matrix is isotropic\nIt is important to note that when the number of samples is much larger than\nthe number of features, one would expect that no shrinkage would be\nnecessary. The intuition behind this is that if the population covariance\nis full rank, when the number of samples grows, the sample covariance will\nalso become positive definite. As a result, no shrinkage would be necessary",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "also become positive definite. As a result, no shrinkage would be necessary\nand the method should automatically do this.\nThis, however, is not the case in the Ledoit-Wolf procedure when the\npopulation covariance happens to be a multiple of the identity matrix. In\nthis case, the Ledoit-Wolf shrinkage estimate approaches 1 as the number of\nsamples increases. This indicates that the optimal estimate of the\ncovariance matrix in the Ledoit-Wolf sense is a multiple of the identity.\nSince the population covariance is already a multiple of the identity\nmatrix, the Ledoit-Wolf solution is indeed a reasonable estimate.\nExamples\nSee\nShrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood\nfor\nan example on how to fit a\nLedoitWolf\nobject to data and\nfor visualizing the performances of the Ledoit-Wolf estimator in\nterms of likelihood.\nReferences\n2.6.2.3.\nOracle Approximating Shrinkage\nUnder the assumption that the data are Gaussian distributed, Chen et\nal.\n[\n2\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "References\n2.6.2.3.\nOracle Approximating Shrinkage\nUnder the assumption that the data are Gaussian distributed, Chen et\nal.\n[\n2\n]\nderived a formula aimed at choosing a shrinkage coefficient that\nyields a smaller Mean Squared Error than the one given by Ledoit and\nWolf’s formula. The resulting estimator is known as the Oracle\nShrinkage Approximating estimator of the covariance.\nThe OAS estimator of the covariance matrix can be computed on a sample\nwith the\noas\nfunction of the\nsklearn.covariance\npackage, or it can be otherwise obtained by fitting an\nOAS\nobject to the same sample.\nBias-variance trade-off when setting the shrinkage: comparing the\nchoices of Ledoit-Wolf and OAS estimators\nReferences\nExamples\nSee\nShrinkage covariance estimation: LedoitWolf vs OAS and max-likelihood\nfor\nan example on how to fit an\nOAS\nobject to data.\nSee\nLedoit-Wolf vs OAS estimation\nto visualize the\nMean Squared Error difference between a\nLedoitWolf\nand\nan\nOAS\nestimator of the covariance.\n2.6.3.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "See\nLedoit-Wolf vs OAS estimation\nto visualize the\nMean Squared Error difference between a\nLedoitWolf\nand\nan\nOAS\nestimator of the covariance.\n2.6.3.\nSparse inverse covariance\nThe matrix inverse of the covariance matrix, often called the precision\nmatrix, is proportional to the partial correlation matrix. It gives the\npartial independence relationship. In other words, if two features are\nindependent conditionally on the others, the corresponding coefficient in\nthe precision matrix will be zero. This is why it makes sense to\nestimate a sparse precision matrix: the estimation of the covariance\nmatrix is better conditioned by learning independence relations from\nthe data. This is known as\ncovariance selection\n.\nIn the small-samples situation, in which\nn_samples\nis on the order\nof\nn_features\nor smaller, sparse inverse covariance estimators tend to work\nbetter than shrunk covariance estimators. However, in the opposite",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of\nn_features\nor smaller, sparse inverse covariance estimators tend to work\nbetter than shrunk covariance estimators. However, in the opposite\nsituation, or for very correlated data, they can be numerically unstable.\nIn addition, unlike shrinkage estimators, sparse estimators are able to\nrecover off-diagonal structure.\nThe\nGraphicalLasso\nestimator uses an l1 penalty to enforce sparsity on\nthe precision matrix: the higher its\nalpha\nparameter, the more sparse\nthe precision matrix. The corresponding\nGraphicalLassoCV\nobject uses\ncross-validation to automatically set the\nalpha\nparameter.\nA comparison of maximum likelihood, shrinkage and sparse estimates of\nthe covariance and precision matrix in the very small samples\nsettings.\nNote\nStructure recovery\nRecovering a graphical structure from correlations in the data is a\nchallenging thing. If you are interested in such recovery keep in mind\nthat:\nRecovery is easier from a correlation matrix than a covariance",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "challenging thing. If you are interested in such recovery keep in mind\nthat:\nRecovery is easier from a correlation matrix than a covariance\nmatrix: standardize your observations before running\nGraphicalLasso\nIf the underlying graph has nodes with much more connections than\nthe average node, the algorithm will miss some of these connections.\nIf your number of observations is not large compared to the number\nof edges in your underlying graph, you will not recover it.\nEven if you are in favorable recovery conditions, the alpha\nparameter chosen by cross-validation (e.g. using the\nGraphicalLassoCV\nobject) will lead to selecting too many edges.\nHowever, the relevant edges will have heavier weights than the\nirrelevant ones.\nThe mathematical formulation is the following:\n\\[\\hat{K} = \\mathrm{argmin}_K \\big(\n\\mathrm{tr} S K - \\mathrm{log} \\mathrm{det} K\n+ \\alpha \\|K\\|_1\n\\big)\\]\nWhere\n\\(K\\)\nis the precision matrix to be estimated, and\n\\(S\\)\nis the\nsample covariance matrix.\n\\(\\|K\\|_1\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "+ \\alpha \\|K\\|_1\n\\big)\\]\nWhere\n\\(K\\)\nis the precision matrix to be estimated, and\n\\(S\\)\nis the\nsample covariance matrix.\n\\(\\|K\\|_1\\)\nis the sum of the absolute values of\noff-diagonal coefficients of\n\\(K\\)\n. The algorithm employed to solve this\nproblem is the GLasso algorithm, from the Friedman 2008 Biostatistics\npaper. It is the same algorithm as in the R\nglasso\npackage.\nExamples\nSparse inverse covariance estimation\n: example on synthetic\ndata showing some recovery of a structure, and comparing to other\ncovariance estimators.\nVisualizing the stock market structure\n: example on real\nstock market data, finding which symbols are most linked.\nReferences\nFriedman et al,\n“Sparse inverse covariance estimation with the\ngraphical lasso”\n,\nBiostatistics 9, pp 432, 2008\n2.6.4.\nRobust Covariance Estimation\nReal data sets are often subject to measurement or recording\nerrors. Regular but uncommon observations may also appear for a variety\nof reasons. Observations which are very uncommon are called",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "errors. Regular but uncommon observations may also appear for a variety\nof reasons. Observations which are very uncommon are called\noutliers.\nThe empirical covariance estimator and the shrunk covariance\nestimators presented above are very sensitive to the presence of\noutliers in the data. Therefore, one should use robust\ncovariance estimators to estimate the covariance of its real data\nsets. Alternatively, robust covariance estimators can be used to\nperform outlier detection and discard/downweight some observations\naccording to further processing of the data.\nThe\nsklearn.covariance\npackage implements a robust estimator of covariance,\nthe Minimum Covariance Determinant\n[\n3\n]\n.\n2.6.4.1.\nMinimum Covariance Determinant\nThe Minimum Covariance Determinant estimator is a robust estimator of\na data set’s covariance introduced by P.J. Rousseeuw in\n[\n3\n]\n. The idea\nis to find a given proportion (h) of “good” observations which are not\noutliers and compute their empirical covariance matrix. This",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[\n3\n]\n. The idea\nis to find a given proportion (h) of “good” observations which are not\noutliers and compute their empirical covariance matrix. This\nempirical covariance matrix is then rescaled to compensate the\nperformed selection of observations (“consistency step”). Having\ncomputed the Minimum Covariance Determinant estimator, one can give\nweights to observations according to their Mahalanobis distance,\nleading to a reweighted estimate of the covariance matrix of the data\nset (“reweighting step”).\nRousseeuw and Van Driessen\n[\n4\n]\ndeveloped the FastMCD algorithm in order\nto compute the Minimum Covariance Determinant. This algorithm is used\nin scikit-learn when fitting an MCD object to data. The FastMCD\nalgorithm also computes a robust estimate of the data set location at\nthe same time.\nRaw estimates can be accessed as\nraw_location_\nand\nraw_covariance_\nattributes of a\nMinCovDet\nrobust covariance estimator object.\nReferences\nExamples\nSee\nRobust vs Empirical covariance estimate\nfor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and\nraw_covariance_\nattributes of a\nMinCovDet\nrobust covariance estimator object.\nReferences\nExamples\nSee\nRobust vs Empirical covariance estimate\nfor\nan example on how to fit a\nMinCovDet\nobject to data and see how\nthe estimate remains accurate despite the presence of outliers.\nSee\nRobust covariance estimation and Mahalanobis distances relevance\nto\nvisualize the difference between\nEmpiricalCovariance\nand\nMinCovDet\ncovariance estimators in terms of Mahalanobis distance\n(so we get a better estimate of the precision matrix too).\nInfluence of outliers on location and covariance estimates\nSeparating inliers from outliers using a Mahalanobis distance\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/covariance.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.8.\nCross decomposition\nThe cross decomposition module contains\nsupervised\nestimators for\ndimensionality reduction and regression, belonging to the “Partial Least\nSquares” family.\nCross decomposition algorithms find the fundamental relations between two\nmatrices (X and Y). They are latent variable approaches to modeling the\ncovariance structures in these two spaces. They will try to find the\nmultidimensional direction in the X space that explains the maximum\nmultidimensional variance direction in the Y space. In other words, PLS\nprojects both\nX\nand\nY\ninto a lower-dimensional subspace such that the\ncovariance between\ntransformed(X)\nand\ntransformed(Y)\nis maximal.\nPLS draws similarities with\nPrincipal Component Regression\n(PCR), where\nthe samples are first projected into a lower-dimensional subspace, and the\ntargets\ny\nare predicted using\ntransformed(X)\n. One issue with PCR is that\nthe dimensionality reduction is unsupervised, and may lose some important",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "targets\ny\nare predicted using\ntransformed(X)\n. One issue with PCR is that\nthe dimensionality reduction is unsupervised, and may lose some important\nvariables: PCR would keep the features with the most variance, but it’s\npossible that features with small variances are relevant for predicting\nthe target. In a way, PLS allows for the same kind of dimensionality\nreduction, but by taking into account the targets\ny\n. An illustration of\nthis fact is given in the following example:\n*\nPrincipal Component Regression vs Partial Least Squares Regression\n.\nApart from CCA, the PLS estimators are particularly suited when the matrix of\npredictors has more variables than observations, and when there is\nmulticollinearity among the features. By contrast, standard linear regression\nwould fail in these cases unless it is regularized.\nClasses included in this module are\nPLSRegression\n,\nPLSCanonical\n,\nCCA\nand\nPLSSVD\n1.8.1.\nPLSCanonical\nWe here describe the algorithm used in\nPLSCanonical\n. The other",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "PLSRegression\n,\nPLSCanonical\n,\nCCA\nand\nPLSSVD\n1.8.1.\nPLSCanonical\nWe here describe the algorithm used in\nPLSCanonical\n. The other\nestimators use variants of this algorithm, and are detailed below.\nWe recommend section\n[\n1\n]\nfor more details and comparisons between these\nalgorithms. In\n[\n1\n]\n,\nPLSCanonical\ncorresponds to “PLSW2A”.\nGiven two centered matrices\n\\(X \\in \\mathbb{R}^{n \\times d}\\)\nand\n\\(Y \\in \\mathbb{R}^{n \\times t}\\)\n, and a number of components\n\\(K\\)\n,\nPLSCanonical\nproceeds as follows:\nSet\n\\(X_1\\)\nto\n\\(X\\)\nand\n\\(Y_1\\)\nto\n\\(Y\\)\n. Then, for each\n\\(k \\in [1, K]\\)\n:\na) compute\n\\(u_k \\in \\mathbb{R}^d\\)\nand\n\\(v_k \\in \\mathbb{R}^t\\)\n,\nthe first left and right singular vectors of the cross-covariance matrix\n\\(C = X_k^T Y_k\\)\n.\n\\(u_k\\)\nand\n\\(v_k\\)\nare called the\nweights\n.\nBy definition,\n\\(u_k\\)\nand\n\\(v_k\\)\nare\nchosen so that they maximize the covariance between the projected\n\\(X_k\\)\nand the projected target, that is\n\\(\\text{Cov}(X_k u_k,\nY_k v_k)\\)\n.\nb) Project\n\\(X_k\\)\nand\n\\(Y_k\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(X_k\\)\nand the projected target, that is\n\\(\\text{Cov}(X_k u_k,\nY_k v_k)\\)\n.\nb) Project\n\\(X_k\\)\nand\n\\(Y_k\\)\non the singular vectors to obtain\nscores\n:\n\\(\\xi_k = X_k u_k\\)\nand\n\\(\\omega_k = Y_k v_k\\)\nc) Regress\n\\(X_k\\)\non\n\\(\\xi_k\\)\n, i.e. find a vector\n\\(\\gamma_k\n\\in \\mathbb{R}^d\\)\nsuch that the rank-1 matrix\n\\(\\xi_k \\gamma_k^T\\)\nis as close as possible to\n\\(X_k\\)\n. Do the same on\n\\(Y_k\\)\nwith\n\\(\\omega_k\\)\nto obtain\n\\(\\delta_k\\)\n. The vectors\n\\(\\gamma_k\\)\nand\n\\(\\delta_k\\)\nare called the\nloadings\n.\nd)\ndeflate\n\\(X_k\\)\nand\n\\(Y_k\\)\n, i.e. subtract the rank-1\napproximations:\n\\(X_{k+1} = X_k - \\xi_k \\gamma_k^T\\)\n, and\n\\(Y_{k + 1} = Y_k - \\omega_k \\delta_k^T\\)\n.\nAt the end, we have approximated\n\\(X\\)\nas a sum of rank-1 matrices:\n\\(X = \\Xi \\Gamma^T\\)\nwhere\n\\(\\Xi \\in \\mathbb{R}^{n \\times K}\\)\ncontains the scores in its columns, and\n\\(\\Gamma^T \\in \\mathbb{R}^{K\n\\times d}\\)\ncontains the loadings in its rows. Similarly for\n\\(Y\\)\n, we\nhave\n\\(Y = \\Omega \\Delta^T\\)\n.\nNote that the scores matrices",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\times d}\\)\ncontains the loadings in its rows. Similarly for\n\\(Y\\)\n, we\nhave\n\\(Y = \\Omega \\Delta^T\\)\n.\nNote that the scores matrices\n\\(\\Xi\\)\nand\n\\(\\Omega\\)\ncorrespond to\nthe projections of the training data\n\\(X\\)\nand\n\\(Y\\)\n, respectively.\nStep\na)\nmay be performed in two ways: either by computing the whole SVD of\n\\(C\\)\nand only retaining the singular vectors with the biggest singular\nvalues, or by directly computing the singular vectors using the power method (cf section 11.3 in\n[\n1\n]\n),\nwhich corresponds to the\n'nipals'\noption of the\nalgorithm\nparameter.\nTransforming data\nTo transform\n\\(X\\)\ninto\n\\(\\bar{X}\\)\n, we need to find a projection\nmatrix\n\\(P\\)\nsuch that\n\\(\\bar{X} = XP\\)\n. We know that for the\ntraining data,\n\\(\\Xi = XP\\)\n, and\n\\(X = \\Xi \\Gamma^T\\)\n. Setting\n\\(P = U(\\Gamma^T U)^{-1}\\)\nwhere\n\\(U\\)\nis the matrix with the\n\\(u_k\\)\nin the columns, we have\n\\(XP = X U(\\Gamma^T U)^{-1} = \\Xi\n(\\Gamma^T U) (\\Gamma^T U)^{-1} = \\Xi\\)\nas desired. The rotation matrix\n\\(P\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(u_k\\)\nin the columns, we have\n\\(XP = X U(\\Gamma^T U)^{-1} = \\Xi\n(\\Gamma^T U) (\\Gamma^T U)^{-1} = \\Xi\\)\nas desired. The rotation matrix\n\\(P\\)\ncan be accessed from the\nx_rotations_\nattribute.\nSimilarly,\n\\(Y\\)\ncan be transformed using the rotation matrix\n\\(V(\\Delta^T V)^{-1}\\)\n, accessed via the\ny_rotations_\nattribute.\nPredicting the targets\nY\nTo predict the targets of some data\n\\(X\\)\n, we are looking for a\ncoefficient matrix\n\\(\\beta \\in R^{d \\times t}\\)\nsuch that\n\\(Y =\nX\\beta\\)\n.\nThe idea is to try to predict the transformed targets\n\\(\\Omega\\)\nas a\nfunction of the transformed samples\n\\(\\Xi\\)\n, by computing\n\\(\\alpha\n\\in \\mathbb{R}\\)\nsuch that\n\\(\\Omega = \\alpha \\Xi\\)\n.\nThen, we have\n\\(Y = \\Omega \\Delta^T = \\alpha \\Xi \\Delta^T\\)\n, and since\n\\(\\Xi\\)\nis the transformed training data we have that\n\\(Y = X \\alpha\nP \\Delta^T\\)\n, and as a result the coefficient matrix\n\\(\\beta = \\alpha P\n\\Delta^T\\)\n.\n\\(\\beta\\)\ncan be accessed through the\ncoef_\nattribute.\n1.8.2.\nPLSSVD\nPLSSVD",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", and as a result the coefficient matrix\n\\(\\beta = \\alpha P\n\\Delta^T\\)\n.\n\\(\\beta\\)\ncan be accessed through the\ncoef_\nattribute.\n1.8.2.\nPLSSVD\nPLSSVD\nis a simplified version of\nPLSCanonical\ndescribed earlier: instead of iteratively deflating the matrices\n\\(X_k\\)\nand\n\\(Y_k\\)\n,\nPLSSVD\ncomputes the SVD of\n\\(C = X^TY\\)\nonly\nonce\n, and stores the\nn_components\nsingular vectors corresponding to\nthe biggest singular values in the matrices\nU\nand\nV\n, corresponding to the\nx_weights_\nand\ny_weights_\nattributes. Here, the transformed data is\nsimply\ntransformed(X)\n=\nXU\nand\ntransformed(Y)\n=\nYV\n.\nIf\nn_components\n==\n1\n,\nPLSSVD\nand\nPLSCanonical\nare\nstrictly equivalent.\n1.8.3.\nPLSRegression\nThe\nPLSRegression\nestimator is similar to\nPLSCanonical\nwith\nalgorithm='nipals'\n, with 2 significant\ndifferences:\nat step a) in the power method to compute\n\\(u_k\\)\nand\n\\(v_k\\)\n,\n\\(v_k\\)\nis never normalized.\nat step c), the targets\n\\(Y_k\\)\nare approximated using the projection\nof\n\\(X_k\\)\n(i.e.\n\\(\\xi_k\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(u_k\\)\nand\n\\(v_k\\)\n,\n\\(v_k\\)\nis never normalized.\nat step c), the targets\n\\(Y_k\\)\nare approximated using the projection\nof\n\\(X_k\\)\n(i.e.\n\\(\\xi_k\\)\n) instead of the projection of\n\\(Y_k\\)\n(i.e.\n\\(\\omega_k\\)\n). In other words, the loadings\ncomputation is different. As a result, the deflation in step d) will also\nbe affected.\nThese two modifications affect the output of\npredict\nand\ntransform\n,\nwhich are not the same as for\nPLSCanonical\n. Also, while the number\nof components is limited by\nmin(n_samples,\nn_features,\nn_targets)\nin\nPLSCanonical\n, here the limit is the rank of\n\\(X^TX\\)\n, i.e.\nmin(n_samples,\nn_features)\n.\nPLSRegression\nis also known as PLS1 (single targets) and PLS2\n(multiple targets). Much like\nLasso\n,\nPLSRegression\nis a form of regularized linear regression where the\nnumber of components controls the strength of the regularization.\n1.8.4.\nCanonical Correlation Analysis\nCanonical Correlation Analysis was developed prior and independently to PLS.\nBut it turns out that\nCCA",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.8.4.\nCanonical Correlation Analysis\nCanonical Correlation Analysis was developed prior and independently to PLS.\nBut it turns out that\nCCA\nis a special case of PLS, and corresponds\nto PLS in “Mode B” in the literature.\nCCA\ndiffers from\nPLSCanonical\nin the way the weights\n\\(u_k\\)\nand\n\\(v_k\\)\nare computed in the power method of step a).\nDetails can be found in section 10 of\n[\n1\n]\n.\nSince\nCCA\ninvolves the inversion of\n\\(X_k^TX_k\\)\nand\n\\(Y_k^TY_k\\)\n, this estimator can be unstable if the number of features or\ntargets is greater than the number of samples.\nReferences\nExamples\nCompare cross decomposition methods\nPrincipal Component Regression vs Partial Least Squares Regression\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_decomposition.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.1.\nCross-validation: evaluating estimator performance\nLearning the parameters of a prediction function and testing it on the\nsame data is a methodological mistake: a model that would just repeat\nthe labels of the samples that it has just seen would have a perfect\nscore but would fail to predict anything useful on yet-unseen data.\nThis situation is called\noverfitting\n.\nTo avoid it, it is common practice when performing\na (supervised) machine learning experiment\nto hold out part of the available data as a\ntest set\nX_test,\ny_test\n.\nNote that the word “experiment” is not intended\nto denote academic use only,\nbecause even in commercial settings\nmachine learning usually starts out experimentally.\nHere is a flowchart of typical cross validation workflow in model training.\nThe best parameters can be determined by\ngrid search\ntechniques.\nIn scikit-learn a random split into training and test sets\ncan be quickly computed with the\ntrain_test_split\nhelper function.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "grid search\ntechniques.\nIn scikit-learn a random split into training and test sets\ncan be quickly computed with the\ntrain_test_split\nhelper function.\nLet’s load the iris data set to fit a linear support vector machine on it:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\nfrom\nsklearn\nimport\nsvm\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nX\n.\nshape\n,\ny\n.\nshape\n((150, 4), (150,))\nWe can now quickly sample a training set while holding out 40% of the\ndata for testing (evaluating) our classifier:\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\n...\nX\n,\ny\n,\ntest_size\n=\n0.4\n,\nrandom_state\n=\n0\n)\n>>>\nX_train\n.\nshape\n,\ny_train\n.\nshape\n((90, 4), (90,))\n>>>\nX_test\n.\nshape\n,\ny_test\n.\nshape\n((60, 4), (60,))\n>>>\nclf\n=\nsvm\n.\nSVC\n(\nkernel\n=\n'linear'\n,\nC\n=\n1\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nclf\n.\nscore\n(\nX_test\n,\ny_test\n)\n0.96\nWhen evaluating different settings (“hyperparameters”) for estimators,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nC\n=\n1\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nclf\n.\nscore\n(\nX_test\n,\ny_test\n)\n0.96\nWhen evaluating different settings (“hyperparameters”) for estimators,\nsuch as the\nC\nsetting that must be manually set for an SVM,\nthere is still a risk of overfitting\non the test set\nbecause the parameters can be tweaked until the estimator performs optimally.\nThis way, knowledge about the test set can “leak” into the model\nand evaluation metrics no longer report on generalization performance.\nTo solve this problem, yet another part of the dataset can be held out\nas a so-called “validation set”: training proceeds on the training set,\nafter which evaluation is done on the validation set,\nand when the experiment seems to be successful,\nfinal evaluation can be done on the test set.\nHowever, by partitioning the available data into three sets,\nwe drastically reduce the number of samples\nwhich can be used for learning the model,\nand the results can depend on a particular random choice for the pair of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "which can be used for learning the model,\nand the results can depend on a particular random choice for the pair of\n(train, validation) sets.\nA solution to this problem is a procedure called\ncross-validation\n(CV for short).\nA test set should still be held out for final evaluation,\nbut the validation set is no longer needed when doing CV.\nIn the basic approach, called\nk\n-fold CV,\nthe training set is split into\nk\nsmaller sets\n(other approaches are described below,\nbut generally follow the same principles).\nThe following procedure is followed for each of the\nk\n“folds”:\nA model is trained using\n\\(k-1\\)\nof the folds as training data;\nthe resulting model is validated on the remaining part of the data\n(i.e., it is used as a test set to compute a performance measure\nsuch as accuracy).\nThe performance measure reported by\nk\n-fold cross-validation\nis then the average of the values computed in the loop.\nThis approach can be computationally expensive,\nbut does not waste too much data",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is then the average of the values computed in the loop.\nThis approach can be computationally expensive,\nbut does not waste too much data\n(as is the case when fixing an arbitrary validation set),\nwhich is a major advantage in problems such as inverse inference\nwhere the number of samples is very small.\n3.1.1.\nComputing cross-validated metrics\nThe simplest way to use cross-validation is to call the\ncross_val_score\nhelper function on the estimator and the dataset.\nThe following example demonstrates how to estimate the accuracy of a linear\nkernel support vector machine on the iris dataset by splitting the data, fitting\na model and computing the score 5 consecutive times (with different splits each\ntime):\n>>>\nfrom\nsklearn.model_selection\nimport\ncross_val_score\n>>>\nclf\n=\nsvm\n.\nSVC\n(\nkernel\n=\n'linear'\n,\nC\n=\n1\n,\nrandom_state\n=\n42\n)\n>>>\nscores\n=\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\n5\n)\n>>>\nscores\narray([0.96, 1. , 0.96, 0.96, 1. ])\nThe mean score and the standard deviation are hence given by:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "cross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\n5\n)\n>>>\nscores\narray([0.96, 1. , 0.96, 0.96, 1. ])\nThe mean score and the standard deviation are hence given by:\n>>>\nprint\n(\n\"\n%0.2f\naccuracy with a standard deviation of\n%0.2f\n\"\n%\n(\nscores\n.\nmean\n(),\nscores\n.\nstd\n()))\n0.98 accuracy with a standard deviation of 0.02\nBy default, the score computed at each CV iteration is the\nscore\nmethod of the estimator. It is possible to change this by using the\nscoring parameter:\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\nscores\n=\ncross_val_score\n(\n...\nclf\n,\nX\n,\ny\n,\ncv\n=\n5\n,\nscoring\n=\n'f1_macro'\n)\n>>>\nscores\narray([0.96, 1., 0.96, 0.96, 1.])\nSee\nThe scoring parameter: defining model evaluation rules\nfor details.\nIn the case of the Iris dataset, the samples are balanced across target\nclasses hence the accuracy and the F1-score are almost equal.\nWhen the\ncv\nargument is an integer,\ncross_val_score\nuses the\nKFold\nor\nStratifiedKFold\nstrategies by default, the latter\nbeing used if the estimator derives from\nClassifierMixin\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "cross_val_score\nuses the\nKFold\nor\nStratifiedKFold\nstrategies by default, the latter\nbeing used if the estimator derives from\nClassifierMixin\n.\nIt is also possible to use other cross validation strategies by passing a cross\nvalidation iterator instead, for instance:\n>>>\nfrom\nsklearn.model_selection\nimport\nShuffleSplit\n>>>\nn_samples\n=\nX\n.\nshape\n[\n0\n]\n>>>\ncv\n=\nShuffleSplit\n(\nn_splits\n=\n5\n,\ntest_size\n=\n0.3\n,\nrandom_state\n=\n0\n)\n>>>\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\ncv\n)\narray([0.977, 0.977, 1., 0.955, 1.])\nAnother option is to use an iterable yielding (train, test) splits as arrays of\nindices, for example:\n>>>\ndef\ncustom_cv_2folds\n(\nX\n):\n...\nn\n=\nX\n.\nshape\n[\n0\n]\n...\ni\n=\n1\n...\nwhile\ni\n<=\n2\n:\n...\nidx\n=\nnp\n.\narange\n(\nn\n*\n(\ni\n-\n1\n)\n/\n2\n,\nn\n*\ni\n/\n2\n,\ndtype\n=\nint\n)\n...\nyield\nidx\n,\nidx\n...\ni\n+=\n1\n...\n>>>\ncustom_cv\n=\ncustom_cv_2folds\n(\nX\n)\n>>>\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\ncustom_cv\n)\narray([1. , 0.973])\nData transformation with held-out data",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "custom_cv\n=\ncustom_cv_2folds\n(\nX\n)\n>>>\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\ncustom_cv\n)\narray([1. , 0.973])\nData transformation with held-out data\nJust as it is important to test a predictor on data held-out from\ntraining, preprocessing (such as standardization, feature selection, etc.)\nand similar\ndata transformations\nsimilarly should\nbe learnt from a training set and applied to held-out data for prediction:\n>>>\nfrom\nsklearn\nimport\npreprocessing\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\n...\nX\n,\ny\n,\ntest_size\n=\n0.4\n,\nrandom_state\n=\n0\n)\n>>>\nscaler\n=\npreprocessing\n.\nStandardScaler\n()\n.\nfit\n(\nX_train\n)\n>>>\nX_train_transformed\n=\nscaler\n.\ntransform\n(\nX_train\n)\n>>>\nclf\n=\nsvm\n.\nSVC\n(\nC\n=\n1\n)\n.\nfit\n(\nX_train_transformed\n,\ny_train\n)\n>>>\nX_test_transformed\n=\nscaler\n.\ntransform\n(\nX_test\n)\n>>>\nclf\n.\nscore\n(\nX_test_transformed\n,\ny_test\n)\n0.9333\nA\nPipeline\nmakes it easier to compose\nestimators, providing this behavior under cross-validation:\n>>>\nfrom\nsklearn.pipeline\nimport",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\ny_test\n)\n0.9333\nA\nPipeline\nmakes it easier to compose\nestimators, providing this behavior under cross-validation:\n>>>\nfrom\nsklearn.pipeline\nimport\nmake_pipeline\n>>>\nclf\n=\nmake_pipeline\n(\npreprocessing\n.\nStandardScaler\n(),\nsvm\n.\nSVC\n(\nC\n=\n1\n))\n>>>\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\ncv\n)\narray([0.977, 0.933, 0.955, 0.933, 0.977])\nSee\nPipelines and composite estimators\n.\n3.1.1.1.\nThe cross_validate function and multiple metric evaluation\nThe\ncross_validate\nfunction differs from\ncross_val_score\nin\ntwo ways:\nIt allows specifying multiple metrics for evaluation.\nIt returns a dict containing fit-times, score-times\n(and optionally training scores, fitted estimators, train-test split indices)\nin addition to the test score.\nFor single metric evaluation, where the scoring parameter is a string,\ncallable or None, the keys will be -\n['test_score',\n'fit_time',\n'score_time']\nAnd for multiple metric evaluation, the return value is a dict with the\nfollowing keys -\n['test_<scorer1_name>',",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "'fit_time',\n'score_time']\nAnd for multiple metric evaluation, the return value is a dict with the\nfollowing keys -\n['test_<scorer1_name>',\n'test_<scorer2_name>',\n'test_<scorer...>',\n'fit_time',\n'score_time']\nreturn_train_score\nis set to\nFalse\nby default to save computation time.\nTo evaluate the scores on the training set as well you need to set it to\nTrue\n. You may also retain the estimator fitted on each training set by\nsetting\nreturn_estimator=True\n. Similarly, you may set\nreturn_indices=True\nto retain the training and testing indices used to split\nthe dataset into train and test sets for each cv split.\nThe multiple metrics can be specified either as a list, tuple or set of\npredefined scorer names:\n>>>\nfrom\nsklearn.model_selection\nimport\ncross_validate\n>>>\nfrom\nsklearn.metrics\nimport\nrecall_score\n>>>\nscoring\n=\n[\n'precision_macro'\n,\n'recall_macro'\n]\n>>>\nclf\n=\nsvm\n.\nSVC\n(\nkernel\n=\n'linear'\n,\nC\n=\n1\n,\nrandom_state\n=\n0\n)\n>>>\nscores\n=\ncross_validate\n(\nclf\n,\nX\n,\ny\n,\nscoring\n=\nscoring\n)\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "]\n>>>\nclf\n=\nsvm\n.\nSVC\n(\nkernel\n=\n'linear'\n,\nC\n=\n1\n,\nrandom_state\n=\n0\n)\n>>>\nscores\n=\ncross_validate\n(\nclf\n,\nX\n,\ny\n,\nscoring\n=\nscoring\n)\n>>>\nsorted\n(\nscores\n.\nkeys\n())\n['fit_time', 'score_time', 'test_precision_macro', 'test_recall_macro']\n>>>\nscores\n[\n'test_recall_macro'\n]\narray([0.96, 1., 0.96, 0.96, 1.])\nOr as a dict mapping scorer name to a predefined or custom scoring function:\n>>>\nfrom\nsklearn.metrics\nimport\nmake_scorer\n>>>\nscoring\n=\n{\n'prec_macro'\n:\n'precision_macro'\n,\n...\n'rec_macro'\n:\nmake_scorer\n(\nrecall_score\n,\naverage\n=\n'macro'\n)}\n>>>\nscores\n=\ncross_validate\n(\nclf\n,\nX\n,\ny\n,\nscoring\n=\nscoring\n,\n...\ncv\n=\n5\n,\nreturn_train_score\n=\nTrue\n)\n>>>\nsorted\n(\nscores\n.\nkeys\n())\n['fit_time', 'score_time', 'test_prec_macro', 'test_rec_macro',\n'train_prec_macro', 'train_rec_macro']\n>>>\nscores\n[\n'train_rec_macro'\n]\narray([0.97, 0.97, 0.99, 0.98, 0.98])\nHere is an example of\ncross_validate\nusing a single metric:\n>>>\nscores\n=\ncross_validate\n(\nclf\n,\nX\n,\ny\n,\n...\nscoring\n=\n'precision_macro'\n,\ncv\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Here is an example of\ncross_validate\nusing a single metric:\n>>>\nscores\n=\ncross_validate\n(\nclf\n,\nX\n,\ny\n,\n...\nscoring\n=\n'precision_macro'\n,\ncv\n=\n5\n,\n...\nreturn_estimator\n=\nTrue\n)\n>>>\nsorted\n(\nscores\n.\nkeys\n())\n['estimator', 'fit_time', 'score_time', 'test_score']\n3.1.1.2.\nObtaining predictions by cross-validation\nThe function\ncross_val_predict\nhas a similar interface to\ncross_val_score\n, but returns, for each element in the input, the\nprediction that was obtained for that element when it was in the test set. Only\ncross-validation strategies that assign all elements to a test set exactly once\ncan be used (otherwise, an exception is raised).\nWarning\nNote on inappropriate usage of cross_val_predict\nThe result of\ncross_val_predict\nmay be different from those\nobtained using\ncross_val_score\nas the elements are grouped in\ndifferent ways. The function\ncross_val_score\ntakes an average\nover cross-validation folds, whereas\ncross_val_predict\nsimply",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "different ways. The function\ncross_val_score\ntakes an average\nover cross-validation folds, whereas\ncross_val_predict\nsimply\nreturns the labels (or probabilities) from several distinct models\nundistinguished. Thus,\ncross_val_predict\nis not an appropriate\nmeasure of generalization error.\nThe function\ncross_val_predict\nis appropriate for:\nVisualization of predictions obtained from different models.\nModel blending: When predictions of one supervised estimator are used to\ntrain another estimator in ensemble methods.\nThe available cross validation iterators are introduced in the following\nsection.\nExamples\nReceiver Operating Characteristic (ROC) with cross validation\n,\nRecursive feature elimination with cross-validation\n,\nCustom refit strategy of a grid search with cross-validation\n,\nSample pipeline for text feature extraction and evaluation\n,\nPlotting Cross-Validated Predictions\n,\nNested versus non-nested cross-validation\n.\n3.1.2.\nCross validation iterators",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nPlotting Cross-Validated Predictions\n,\nNested versus non-nested cross-validation\n.\n3.1.2.\nCross validation iterators\nThe following sections list utilities to generate indices\nthat can be used to generate dataset splits according to different cross\nvalidation strategies.\n3.1.2.1.\nCross-validation iterators for i.i.d. data\nAssuming that some data is Independent and Identically Distributed (i.i.d.) is\nmaking the assumption that all samples stem from the same generative process\nand that the generative process is assumed to have no memory of past generated\nsamples.\nThe following cross-validators can be used in such cases.\nNote\nWhile i.i.d. data is a common assumption in machine learning theory, it rarely\nholds in practice. If one knows that the samples have been generated using a\ntime-dependent process, it is safer to\nuse a\ntime-series aware cross-validation scheme\n.\nSimilarly, if we know that the generative process has a group structure",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "use a\ntime-series aware cross-validation scheme\n.\nSimilarly, if we know that the generative process has a group structure\n(samples collected from different subjects, experiments, measurement\ndevices), it is safer to use\ngroup-wise cross-validation\n.\n3.1.2.1.1.\nK-fold\nKFold\ndivides all the samples in\n\\(k\\)\ngroups of samples,\ncalled folds (if\n\\(k = n\\)\n, this is equivalent to the\nLeave One\nOut\nstrategy), of equal sizes (if possible). The prediction function is\nlearned using\n\\(k - 1\\)\nfolds, and the fold left out is used for test.\nExample of 2-fold cross-validation on a dataset with 4 samples:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.model_selection\nimport\nKFold\n>>>\nX\n=\n[\n\"a\"\n,\n\"b\"\n,\n\"c\"\n,\n\"d\"\n]\n>>>\nkf\n=\nKFold\n(\nn_splits\n=\n2\n)\n>>>\nfor\ntrain\n,\ntest\nin\nkf\n.\nsplit\n(\nX\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[2 3] [0 1]\n[0 1] [2 3]\nHere is a visualization of the cross-validation behavior. Note that\nKFold\nis not affected by classes or groups.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\ntest\n))\n[2 3] [0 1]\n[0 1] [2 3]\nHere is a visualization of the cross-validation behavior. Note that\nKFold\nis not affected by classes or groups.\nEach fold is constituted by two arrays: the first one is related to the\ntraining set\n, and the second one to the\ntest set\n.\nThus, one can create the training/test sets using numpy indexing:\n>>>\nX\n=\nnp\n.\narray\n([[\n0.\n,\n0.\n],\n[\n1.\n,\n1.\n],\n[\n-\n1.\n,\n-\n1.\n],\n[\n2.\n,\n2.\n]])\n>>>\ny\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n0\n,\n1\n])\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\nX\n[\ntrain\n],\nX\n[\ntest\n],\ny\n[\ntrain\n],\ny\n[\ntest\n]\n3.1.2.1.2.\nRepeated K-Fold\nRepeatedKFold\nrepeats\nKFold\n\\(n\\)\ntimes, producing different splits in\neach repetition.\nExample of 2-fold K-Fold repeated 2 times:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.model_selection\nimport\nRepeatedKFold\n>>>\nX\n=\nnp\n.\narray\n([[\n1\n,\n2\n],\n[\n3\n,\n4\n],\n[\n1\n,\n2\n],\n[\n3\n,\n4\n]])\n>>>\nrandom_state\n=\n12883823\n>>>\nrkf\n=\nRepeatedKFold\n(\nn_splits\n=\n2\n,\nn_repeats\n=\n2\n,\nrandom_state\n=\nrandom_state\n)\n>>>\nfor\ntrain\n,\ntest\nin\nrkf\n.\nsplit\n(\nX",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n12883823\n>>>\nrkf\n=\nRepeatedKFold\n(\nn_splits\n=\n2\n,\nn_repeats\n=\n2\n,\nrandom_state\n=\nrandom_state\n)\n>>>\nfor\ntrain\n,\ntest\nin\nrkf\n.\nsplit\n(\nX\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n...\n[2 3] [0 1]\n[0 1] [2 3]\n[0 2] [1 3]\n[1 3] [0 2]\nSimilarly,\nRepeatedStratifiedKFold\nrepeats\nStratifiedKFold\n\\(n\\)\ntimes\nwith different randomization in each repetition.\n3.1.2.1.3.\nLeave One Out (LOO)\nLeaveOneOut\n(or LOO) is a simple cross-validation. Each learning\nset is created by taking all the samples except one, the test set being\nthe sample left out. Thus, for\n\\(n\\)\nsamples, we have\n\\(n\\)\ndifferent\ntraining sets and\n\\(n\\)\ndifferent test sets. This cross-validation\nprocedure does not waste much data as only one sample is removed from the\ntraining set:\n>>>\nfrom\nsklearn.model_selection\nimport\nLeaveOneOut\n>>>\nX\n=\n[\n1\n,\n2\n,\n3\n,\n4\n]\n>>>\nloo\n=\nLeaveOneOut\n()\n>>>\nfor\ntrain\n,\ntest\nin\nloo\n.\nsplit\n(\nX\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[1 2 3] [0]\n[0 2 3] [1]\n[0 1 3] [2]\n[0 1 2] [3]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nLeaveOneOut\n()\n>>>\nfor\ntrain\n,\ntest\nin\nloo\n.\nsplit\n(\nX\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[1 2 3] [0]\n[0 2 3] [1]\n[0 1 3] [2]\n[0 1 2] [3]\nPotential users of LOO for model selection should weigh a few known caveats.\nWhen compared with\n\\(k\\)\n-fold cross validation, one builds\n\\(n\\)\nmodels\nfrom\n\\(n\\)\nsamples instead of\n\\(k\\)\nmodels, where\n\\(n > k\\)\n.\nMoreover, each is trained on\n\\(n - 1\\)\nsamples rather than\n\\((k-1) n / k\\)\n. In both ways, assuming\n\\(k\\)\nis not too large\nand\n\\(k < n\\)\n, LOO is more computationally expensive than\n\\(k\\)\n-fold\ncross validation.\nIn terms of accuracy, LOO often results in high variance as an estimator for the\ntest error. Intuitively, since\n\\(n - 1\\)\nof\nthe\n\\(n\\)\nsamples are used to build each model, models constructed from\nfolds are virtually identical to each other and to the model built from the\nentire training set.\nHowever, if the learning curve is steep for the training size in question,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "entire training set.\nHowever, if the learning curve is steep for the training size in question,\nthen 5 or 10-fold cross validation can overestimate the generalization error.\nAs a general rule, most authors and empirical evidence suggest that 5 or 10-fold\ncross validation should be preferred to LOO.\nReferences\nhttp://www.faqs.org/faqs/ai-faq/neural-nets/part3/section-12.html\n;\nT. Hastie, R. Tibshirani, J. Friedman,\nThe Elements of Statistical Learning\n, Springer 2009\nL. Breiman, P. Spector\nSubmodel selection and evaluation in regression: The X-random case\n, International Statistical Review 1992;\nR. Kohavi,\nA Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection\n, Intl. Jnt. Conf. AI\nR. Bharat Rao, G. Fung, R. Rosales,\nOn the Dangers of Cross-Validation. An Experimental Evaluation\n, SIAM 2008;\nG. James, D. Witten, T. Hastie, R. Tibshirani,\nAn Introduction to\nStatistical Learning\n, Springer 2013.\n3.1.2.1.4.\nLeave P Out (LPO)\nLeavePOut\nis very similar to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "An Introduction to\nStatistical Learning\n, Springer 2013.\n3.1.2.1.4.\nLeave P Out (LPO)\nLeavePOut\nis very similar to\nLeaveOneOut\nas it creates all\nthe possible training/test sets by removing\n\\(p\\)\nsamples from the complete\nset. For\n\\(n\\)\nsamples, this produces\n\\({n \\choose p}\\)\ntrain-test\npairs. Unlike\nLeaveOneOut\nand\nKFold\n, the test sets will\noverlap for\n\\(p > 1\\)\n.\nExample of Leave-2-Out on a dataset with 4 samples:\n>>>\nfrom\nsklearn.model_selection\nimport\nLeavePOut\n>>>\nX\n=\nnp\n.\nones\n(\n4\n)\n>>>\nlpo\n=\nLeavePOut\n(\np\n=\n2\n)\n>>>\nfor\ntrain\n,\ntest\nin\nlpo\n.\nsplit\n(\nX\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[2 3] [0 1]\n[1 3] [0 2]\n[1 2] [0 3]\n[0 3] [1 2]\n[0 2] [1 3]\n[0 1] [2 3]\n3.1.2.1.5.\nRandom permutations cross-validation a.k.a. Shuffle & Split\nThe\nShuffleSplit\niterator will generate a user defined number of\nindependent train / test dataset splits. Samples are first shuffled and\nthen split into a pair of train and test sets.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "independent train / test dataset splits. Samples are first shuffled and\nthen split into a pair of train and test sets.\nIt is possible to control the randomness for reproducibility of the\nresults by explicitly seeding the\nrandom_state\npseudo random number\ngenerator.\nHere is a usage example:\n>>>\nfrom\nsklearn.model_selection\nimport\nShuffleSplit\n>>>\nX\n=\nnp\n.\narange\n(\n10\n)\n>>>\nss\n=\nShuffleSplit\n(\nn_splits\n=\n5\n,\ntest_size\n=\n0.25\n,\nrandom_state\n=\n0\n)\n>>>\nfor\ntrain_index\n,\ntest_index\nin\nss\n.\nsplit\n(\nX\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain_index\n,\ntest_index\n))\n[9 1 6 7 3 0 5] [2 8 4]\n[2 9 8 0 6 7 4] [3 5 1]\n[4 5 1 0 6 9 7] [2 3 8]\n[2 7 5 8 0 3 4] [6 1 9]\n[4 1 0 6 8 9 3] [5 2 7]\nHere is a visualization of the cross-validation behavior. Note that\nShuffleSplit\nis not affected by classes or groups.\nShuffleSplit\nis thus a good alternative to\nKFold\ncross\nvalidation that allows a finer control on the number of iterations and\nthe proportion of samples on each side of the train / test split.\n3.1.2.2.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "validation that allows a finer control on the number of iterations and\nthe proportion of samples on each side of the train / test split.\n3.1.2.2.\nCross-validation iterators with stratification based on class labels\nSome classification tasks can naturally exhibit rare classes: for instance,\nthere could be orders of magnitude more negative observations than positive\nobservations (e.g. medical screening, fraud detection, etc). As a result,\ncross-validation splitting can generate train or validation folds without any\noccurrence of a particular class. This typically leads to undefined\nclassification metrics (e.g. ROC AUC), exceptions raised when attempting to\ncall\nfit\nor missing columns in the output of the\npredict_proba\nor\ndecision_function\nmethods of multiclass classifiers trained on different\nfolds.\nTo mitigate such problems, splitters such as\nStratifiedKFold\nand\nStratifiedShuffleSplit\nimplement stratified sampling to ensure that",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "folds.\nTo mitigate such problems, splitters such as\nStratifiedKFold\nand\nStratifiedShuffleSplit\nimplement stratified sampling to ensure that\nrelative class frequencies are approximately preserved in each fold.\nNote\nStratified sampling was introduced in scikit-learn to workaround the\naforementioned engineering problems rather than solve a statistical one.\nStratification makes cross-validation folds more homogeneous, and as a result\nhides some of the variability inherent to fitting models with a limited\nnumber of observations.\nAs a result, stratification can artificially shrink the spread of the metric\nmeasured across cross-validation iterations: the inter-fold variability does\nno longer reflect the uncertainty in the performance of classifiers in the\npresence of rare classes.\n3.1.2.2.1.\nStratified K-fold\nStratifiedKFold\nis a variation of\nK-fold\nwhich returns\nstratified\nfolds: each set contains approximately the same percentage of samples of each\ntarget class as the complete set.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "K-fold\nwhich returns\nstratified\nfolds: each set contains approximately the same percentage of samples of each\ntarget class as the complete set.\nHere is an example of stratified 3-fold cross-validation on a dataset with 50 samples from\ntwo unbalanced classes. We show the number of samples in each class and compare with\nKFold\n.\n>>>\nfrom\nsklearn.model_selection\nimport\nStratifiedKFold\n,\nKFold\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n,\ny\n=\nnp\n.\nones\n((\n50\n,\n1\n)),\nnp\n.\nhstack\n(([\n0\n]\n*\n45\n,\n[\n1\n]\n*\n5\n))\n>>>\nskf\n=\nStratifiedKFold\n(\nn_splits\n=\n3\n)\n>>>\nfor\ntrain\n,\ntest\nin\nskf\n.\nsplit\n(\nX\n,\ny\n):\n...\nprint\n(\n'train -\n{}\n| test -\n{}\n'\n.\nformat\n(\n...\nnp\n.\nbincount\n(\ny\n[\ntrain\n]),\nnp\n.\nbincount\n(\ny\n[\ntest\n])))\ntrain - [30 3] | test - [15 2]\ntrain - [30 3] | test - [15 2]\ntrain - [30 4] | test - [15 1]\n>>>\nkf\n=\nKFold\n(\nn_splits\n=\n3\n)\n>>>\nfor\ntrain\n,\ntest\nin\nkf\n.\nsplit\n(\nX\n,\ny\n):\n...\nprint\n(\n'train -\n{}\n| test -\n{}\n'\n.\nformat\n(\n...\nnp\n.\nbincount\n(\ny\n[\ntrain\n]),\nnp\n.\nbincount\n(\ny\n[\ntest\n])))",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\ntest\nin\nkf\n.\nsplit\n(\nX\n,\ny\n):\n...\nprint\n(\n'train -\n{}\n| test -\n{}\n'\n.\nformat\n(\n...\nnp\n.\nbincount\n(\ny\n[\ntrain\n]),\nnp\n.\nbincount\n(\ny\n[\ntest\n])))\ntrain - [28 5] | test - [17]\ntrain - [28 5] | test - [17]\ntrain - [34] | test - [11 5]\nWe can see that\nStratifiedKFold\npreserves the class ratios\n(approximately 1 / 10) in both train and test datasets.\nHere is a visualization of the cross-validation behavior.\nRepeatedStratifiedKFold\ncan be used to repeat Stratified K-Fold n times\nwith different randomization in each repetition.\n3.1.2.2.2.\nStratified Shuffle Split\nStratifiedShuffleSplit\nis a variation of\nShuffleSplit\n, which returns\nstratified splits,\ni.e.\nwhich creates splits by preserving the same\npercentage for each target class as in the complete set.\nHere is a visualization of the cross-validation behavior.\n3.1.2.3.\nPredefined fold-splits / Validation-sets\nFor some datasets, a pre-defined split of the data into training- and\nvalidation fold or into several cross-validation folds already",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "For some datasets, a pre-defined split of the data into training- and\nvalidation fold or into several cross-validation folds already\nexists. Using\nPredefinedSplit\nit is possible to use these folds\ne.g. when searching for hyperparameters.\nFor example, when using a validation set, set the\ntest_fold\nto 0 for all\nsamples that are part of the validation set, and to -1 for all other samples.\n3.1.2.4.\nCross-validation iterators for grouped data\nThe i.i.d. assumption is broken if the underlying generative process yields\ngroups of dependent samples.\nSuch a grouping of data is domain specific. An example would be when there is\nmedical data collected from multiple patients, with multiple samples taken from\neach patient. And such data is likely to be dependent on the individual group.\nIn our example, the patient id for each sample will be its group identifier.\nIn this case we would like to know if a model trained on a particular set of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In this case we would like to know if a model trained on a particular set of\ngroups generalizes well to the unseen groups. To measure this, we need to\nensure that all the samples in the validation fold come from groups that are\nnot represented at all in the paired training fold.\nThe following cross-validation splitters can be used to do that.\nThe grouping identifier for the samples is specified via the\ngroups\nparameter.\n3.1.2.4.1.\nGroup K-fold\nGroupKFold\nis a variation of K-fold which ensures that the same group is\nnot represented in both testing and training sets. For example if the data is\nobtained from different subjects with several samples per-subject and if the\nmodel is flexible enough to learn from highly person specific features it\ncould fail to generalize to new subjects.\nGroupKFold\nmakes it possible\nto detect this kind of overfitting situations.\nImagine you have three subjects, each with an associated number from 1 to 3:\n>>>\nfrom\nsklearn.model_selection\nimport\nGroupKFold\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Imagine you have three subjects, each with an associated number from 1 to 3:\n>>>\nfrom\nsklearn.model_selection\nimport\nGroupKFold\n>>>\nX\n=\n[\n0.1\n,\n0.2\n,\n2.2\n,\n2.4\n,\n2.3\n,\n4.55\n,\n5.8\n,\n8.8\n,\n9\n,\n10\n]\n>>>\ny\n=\n[\n\"a\"\n,\n\"b\"\n,\n\"b\"\n,\n\"b\"\n,\n\"c\"\n,\n\"c\"\n,\n\"c\"\n,\n\"d\"\n,\n\"d\"\n,\n\"d\"\n]\n>>>\ngroups\n=\n[\n1\n,\n1\n,\n1\n,\n2\n,\n2\n,\n2\n,\n3\n,\n3\n,\n3\n,\n3\n]\n>>>\ngkf\n=\nGroupKFold\n(\nn_splits\n=\n3\n)\n>>>\nfor\ntrain\n,\ntest\nin\ngkf\n.\nsplit\n(\nX\n,\ny\n,\ngroups\n=\ngroups\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[0 1 2 3 4 5] [6 7 8 9]\n[0 1 2 6 7 8 9] [3 4 5]\n[3 4 5 6 7 8 9] [0 1 2]\nEach subject is in a different testing fold, and the same subject is never in\nboth testing and training. Notice that the folds do not have exactly the same\nsize due to the imbalance in the data. If class proportions must be balanced\nacross folds,\nStratifiedGroupKFold\nis a better option.\nHere is a visualization of the cross-validation behavior.\nSimilar to\nKFold\n, the test sets from\nGroupKFold\nwill form a\ncomplete partition of all the data.\nWhile\nGroupKFold",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Similar to\nKFold\n, the test sets from\nGroupKFold\nwill form a\ncomplete partition of all the data.\nWhile\nGroupKFold\nattempts to place the same number of samples in each\nfold when\nshuffle=False\n, when\nshuffle=True\nit attempts to place an equal\nnumber of distinct groups in each fold (but does not account for group sizes).\n3.1.2.4.2.\nStratifiedGroupKFold\nStratifiedGroupKFold\nis a cross-validation scheme that combines both\nStratifiedKFold\nand\nGroupKFold\n. The idea is to try to\npreserve the distribution of classes in each split while keeping each group\nwithin a single split. That might be useful when you have an unbalanced\ndataset so that using just\nGroupKFold\nmight produce skewed splits.\nExample:\n>>>\nfrom\nsklearn.model_selection\nimport\nStratifiedGroupKFold\n>>>\nX\n=\nlist\n(\nrange\n(\n18\n))\n>>>\ny\n=\n[\n1\n]\n*\n6\n+\n[\n0\n]\n*\n12\n>>>\ngroups\n=\n[\n1\n,\n2\n,\n3\n,\n3\n,\n4\n,\n4\n,\n1\n,\n1\n,\n2\n,\n2\n,\n3\n,\n4\n,\n5\n,\n5\n,\n5\n,\n6\n,\n6\n,\n6\n]\n>>>\nsgkf\n=\nStratifiedGroupKFold\n(\nn_splits\n=\n3\n)\n>>>\nfor\ntrain\n,\ntest\nin\nsgkf\n.\nsplit\n(\nX\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\n4\n,\n4\n,\n1\n,\n1\n,\n2\n,\n2\n,\n3\n,\n4\n,\n5\n,\n5\n,\n5\n,\n6\n,\n6\n,\n6\n]\n>>>\nsgkf\n=\nStratifiedGroupKFold\n(\nn_splits\n=\n3\n)\n>>>\nfor\ntrain\n,\ntest\nin\nsgkf\n.\nsplit\n(\nX\n,\ny\n,\ngroups\n=\ngroups\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[ 0 2 3 4 5 6 7 10 11 15 16 17] [ 1 8 9 12 13 14]\n[ 0 1 4 5 6 7 8 9 11 12 13 14] [ 2 3 10 15 16 17]\n[ 1 2 3 8 9 10 12 13 14 15 16 17] [ 0 4 5 6 7 11]\nImplementation notes\nWith the current implementation full shuffle is not possible in most\nscenarios. When shuffle=True, the following happens:\nAll groups are shuffled.\nGroups are sorted by standard deviation of classes using stable sort.\nSorted groups are iterated over and assigned to folds.\nThat means that only groups with the same standard deviation of class\ndistribution will be shuffled, which might be useful when each group has only\na single class.\nThe algorithm greedily assigns each group to one of n_splits test sets,\nchoosing the test set that minimises the variance in class distribution",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The algorithm greedily assigns each group to one of n_splits test sets,\nchoosing the test set that minimises the variance in class distribution\nacross test sets. Group assignment proceeds from groups with highest to\nlowest variance in class frequency, i.e. large groups peaked on one or few\nclasses are assigned first.\nThis split is suboptimal in a sense that it might produce imbalanced splits\neven if perfect stratification is possible. If you have relatively close\ndistribution of classes in each group, using\nGroupKFold\nis better.\nHere is a visualization of cross-validation behavior for uneven groups:\n3.1.2.4.3.\nLeave One Group Out\nLeaveOneGroupOut\nis a cross-validation scheme where each split holds\nout samples belonging to one specific group. Group information is\nprovided via an array that encodes the group of each sample.\nEach training set is thus constituted by all the samples except the ones\nrelated to a specific group. This is the same as\nLeavePGroupsOut\nwith\nn_groups=1",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "related to a specific group. This is the same as\nLeavePGroupsOut\nwith\nn_groups=1\nand the same as\nGroupKFold\nwith\nn_splits\nequal to the\nnumber of unique labels passed to the\ngroups\nparameter.\nFor example, in the cases of multiple experiments,\nLeaveOneGroupOut\ncan be used to create a cross-validation based on the different experiments:\nwe create a training set using the samples of all the experiments except one:\n>>>\nfrom\nsklearn.model_selection\nimport\nLeaveOneGroupOut\n>>>\nX\n=\n[\n1\n,\n5\n,\n10\n,\n50\n,\n60\n,\n70\n,\n80\n]\n>>>\ny\n=\n[\n0\n,\n1\n,\n1\n,\n2\n,\n2\n,\n2\n,\n2\n]\n>>>\ngroups\n=\n[\n1\n,\n1\n,\n2\n,\n2\n,\n3\n,\n3\n,\n3\n]\n>>>\nlogo\n=\nLeaveOneGroupOut\n()\n>>>\nfor\ntrain\n,\ntest\nin\nlogo\n.\nsplit\n(\nX\n,\ny\n,\ngroups\n=\ngroups\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[2 3 4 5 6] [0 1]\n[0 1 4 5 6] [2 3]\n[0 1 2 3] [4 5 6]\nAnother common application is to use time information: for instance the\ngroups could be the year of collection of the samples and thus allow\nfor cross-validation against time-based splits.\n3.1.2.4.4.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "groups could be the year of collection of the samples and thus allow\nfor cross-validation against time-based splits.\n3.1.2.4.4.\nLeave P Groups Out\nLeavePGroupsOut\nis similar to\nLeaveOneGroupOut\n, but removes\nsamples related to\n\\(P\\)\ngroups for each training/test set. All possible\ncombinations of\n\\(P\\)\ngroups are left out, meaning test sets will overlap\nfor\n\\(P>1\\)\n.\nExample of Leave-2-Group Out:\n>>>\nfrom\nsklearn.model_selection\nimport\nLeavePGroupsOut\n>>>\nX\n=\nnp\n.\narange\n(\n6\n)\n>>>\ny\n=\n[\n1\n,\n1\n,\n1\n,\n2\n,\n2\n,\n2\n]\n>>>\ngroups\n=\n[\n1\n,\n1\n,\n2\n,\n2\n,\n3\n,\n3\n]\n>>>\nlpgo\n=\nLeavePGroupsOut\n(\nn_groups\n=\n2\n)\n>>>\nfor\ntrain\n,\ntest\nin\nlpgo\n.\nsplit\n(\nX\n,\ny\n,\ngroups\n=\ngroups\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[4 5] [0 1 2 3]\n[2 3] [0 1 4 5]\n[0 1] [2 3 4 5]\n3.1.2.4.5.\nGroup Shuffle Split\nThe\nGroupShuffleSplit\niterator behaves as a combination of\nShuffleSplit\nand\nLeavePGroupsOut\n, and generates a\nsequence of randomized partitions in which a subset of groups are held",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "ShuffleSplit\nand\nLeavePGroupsOut\n, and generates a\nsequence of randomized partitions in which a subset of groups are held\nout for each split. Each train/test split is performed independently meaning\nthere is no guaranteed relationship between successive test sets.\nHere is a usage example:\n>>>\nfrom\nsklearn.model_selection\nimport\nGroupShuffleSplit\n>>>\nX\n=\n[\n0.1\n,\n0.2\n,\n2.2\n,\n2.4\n,\n2.3\n,\n4.55\n,\n5.8\n,\n0.001\n]\n>>>\ny\n=\n[\n\"a\"\n,\n\"b\"\n,\n\"b\"\n,\n\"b\"\n,\n\"c\"\n,\n\"c\"\n,\n\"c\"\n,\n\"a\"\n]\n>>>\ngroups\n=\n[\n1\n,\n1\n,\n2\n,\n2\n,\n3\n,\n3\n,\n4\n,\n4\n]\n>>>\ngss\n=\nGroupShuffleSplit\n(\nn_splits\n=\n4\n,\ntest_size\n=\n0.5\n,\nrandom_state\n=\n0\n)\n>>>\nfor\ntrain\n,\ntest\nin\ngss\n.\nsplit\n(\nX\n,\ny\n,\ngroups\n=\ngroups\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n...\n[0 1 2 3] [4 5 6 7]\n[2 3 6 7] [0 1 4 5]\n[2 3 4 5] [0 1 6 7]\n[4 5 6 7] [0 1 2 3]\nHere is a visualization of the cross-validation behavior.\nThis class is useful when the behavior of\nLeavePGroupsOut\nis\ndesired, but the number of groups is large enough that generating all",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "This class is useful when the behavior of\nLeavePGroupsOut\nis\ndesired, but the number of groups is large enough that generating all\npossible partitions with\n\\(P\\)\ngroups withheld would be prohibitively\nexpensive. In such a scenario,\nGroupShuffleSplit\nprovides\na random sample (with replacement) of the train / test splits\ngenerated by\nLeavePGroupsOut\n.\n3.1.2.5.\nUsing cross-validation iterators to split train and test\nThe above group cross-validation functions may also be useful for splitting a\ndataset into training and testing subsets. Note that the convenience\nfunction\ntrain_test_split\nis a wrapper around\nShuffleSplit\nand thus only allows for stratified splitting (using the class labels)\nand cannot account for groups.\nTo perform the train and test split, use the indices for the train and test\nsubsets yielded by the generator output by the\nsplit()\nmethod of the\ncross-validation splitter. For example:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.model_selection\nimport\nGroupShuffleSplit\n>>>\nX\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "method of the\ncross-validation splitter. For example:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.model_selection\nimport\nGroupShuffleSplit\n>>>\nX\n=\nnp\n.\narray\n([\n0.1\n,\n0.2\n,\n2.2\n,\n2.4\n,\n2.3\n,\n4.55\n,\n5.8\n,\n0.001\n])\n>>>\ny\n=\nnp\n.\narray\n([\n\"a\"\n,\n\"b\"\n,\n\"b\"\n,\n\"b\"\n,\n\"c\"\n,\n\"c\"\n,\n\"c\"\n,\n\"a\"\n])\n>>>\ngroups\n=\nnp\n.\narray\n([\n1\n,\n1\n,\n2\n,\n2\n,\n3\n,\n3\n,\n4\n,\n4\n])\n>>>\ntrain_indx\n,\ntest_indx\n=\nnext\n(\n...\nGroupShuffleSplit\n(\nrandom_state\n=\n7\n)\n.\nsplit\n(\nX\n,\ny\n,\ngroups\n)\n...\n)\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\n\\\n...\nX\n[\ntrain_indx\n],\nX\n[\ntest_indx\n],\ny\n[\ntrain_indx\n],\ny\n[\ntest_indx\n]\n>>>\nX_train\n.\nshape\n,\nX_test\n.\nshape\n((6,), (2,))\n>>>\nnp\n.\nunique\n(\ngroups\n[\ntrain_indx\n]),\nnp\n.\nunique\n(\ngroups\n[\ntest_indx\n])\n(array([1, 2, 4]), array([3]))\n3.1.2.6.\nCross validation of time series data\nTime series data is characterized by the correlation between observations\nthat are near in time (\nautocorrelation\n). However, classical\ncross-validation techniques such as\nKFold\nand\nShuffleSplit",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "that are near in time (\nautocorrelation\n). However, classical\ncross-validation techniques such as\nKFold\nand\nShuffleSplit\nassume the samples are independent and\nidentically distributed, and would result in unreasonable correlation\nbetween training and testing instances (yielding poor estimates of\ngeneralization error) on time series data. Therefore, it is very important\nto evaluate our model for time series data on the “future” observations\nleast like those that are used to train the model. To achieve this, one\nsolution is provided by\nTimeSeriesSplit\n.\n3.1.2.6.1.\nTime Series Split\nTimeSeriesSplit\nis a variation of\nk-fold\nwhich\nreturns first\n\\(k\\)\nfolds as train set and the\n\\((k+1)\\)\nth\nfold as test set. Note that unlike standard cross-validation methods,\nsuccessive training sets are supersets of those that come before them.\nAlso, it adds all surplus data to the first training partition, which\nis always used to train the model.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Also, it adds all surplus data to the first training partition, which\nis always used to train the model.\nThis class can be used to cross-validate time series data samples\nthat are observed at fixed time intervals. Indeed, the folds must\nrepresent the same duration, in order to have comparable metrics across folds.\nExample of 3-split time series cross-validation on a dataset with 6 samples:\n>>>\nfrom\nsklearn.model_selection\nimport\nTimeSeriesSplit\n>>>\nX\n=\nnp\n.\narray\n([[\n1\n,\n2\n],\n[\n3\n,\n4\n],\n[\n1\n,\n2\n],\n[\n3\n,\n4\n],\n[\n1\n,\n2\n],\n[\n3\n,\n4\n]])\n>>>\ny\n=\nnp\n.\narray\n([\n1\n,\n2\n,\n3\n,\n4\n,\n5\n,\n6\n])\n>>>\ntscv\n=\nTimeSeriesSplit\n(\nn_splits\n=\n3\n)\n>>>\nprint\n(\ntscv\n)\nTimeSeriesSplit(gap=0, max_train_size=None, n_splits=3, test_size=None)\n>>>\nfor\ntrain\n,\ntest\nin\ntscv\n.\nsplit\n(\nX\n):\n...\nprint\n(\n\"\n%s\n%s\n\"\n%\n(\ntrain\n,\ntest\n))\n[0 1 2] [3]\n[0 1 2 3] [4]\n[0 1 2 3 4] [5]\nHere is a visualization of the cross-validation behavior.\n3.1.3.\nA note on shuffling",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\"\n%\n(\ntrain\n,\ntest\n))\n[0 1 2] [3]\n[0 1 2 3] [4]\n[0 1 2 3 4] [5]\nHere is a visualization of the cross-validation behavior.\n3.1.3.\nA note on shuffling\nIf the data ordering is not arbitrary (e.g. samples with the same class label\nare contiguous), shuffling it first may be essential to get a meaningful\ncross-validation result. However, the opposite may be true if the samples are not\nindependently and identically distributed. For example, if samples correspond\nto news articles, and are ordered by their time of publication, then shuffling\nthe data will likely lead to a model that is overfit and an inflated validation\nscore: it will be tested on samples that are artificially similar (close in\ntime) to training samples.\nSome cross validation iterators, such as\nKFold\n, have an inbuilt option\nto shuffle the data indices before splitting them. Note that:\nThis consumes less memory than shuffling the data directly.\nBy default no shuffling occurs, including for the (stratified) K fold",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 38,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "This consumes less memory than shuffling the data directly.\nBy default no shuffling occurs, including for the (stratified) K fold\ncross-validation performed by specifying\ncv=some_integer\nto\ncross_val_score\n, grid search, etc. Keep in mind that\ntrain_test_split\nstill returns a random split.\nThe\nrandom_state\nparameter defaults to\nNone\n, meaning that the\nshuffling will be different every time\nKFold(...,\nshuffle=True)\nis\niterated. However,\nGridSearchCV\nwill use the same shuffling for each set\nof parameters validated by a single call to its\nfit\nmethod.\nTo get identical results for each split, set\nrandom_state\nto an integer.\nFor more details on how to control the randomness of cv splitters and avoid\ncommon pitfalls, see\nControlling randomness\n.\n3.1.4.\nCross validation and model selection\nCross validation iterators can also be used to directly perform model\nselection using Grid Search for the optimal hyperparameters of the\nmodel. This is the topic of the next section:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 39,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "selection using Grid Search for the optimal hyperparameters of the\nmodel. This is the topic of the next section:\nTuning the hyper-parameters of an estimator\n.\n3.1.5.\nPermutation test score\npermutation_test_score\noffers another way\nto evaluate the performance of a\npredictor\n. It provides a\npermutation-based p-value, which represents how likely an observed performance of the\nestimator would be obtained by chance. The null hypothesis in this test is\nthat the estimator fails to leverage any statistical dependency between the\nfeatures and the targets to make correct predictions on left-out data.\npermutation_test_score\ngenerates a null\ndistribution by calculating\nn_permutations\ndifferent permutations of the\ndata. In each permutation the target values are randomly shuffled, thereby removing\nany dependency between the features and the targets. The p-value output is the fraction\nof permutations whose cross-validation score is better or equal than the true score",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 40,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of permutations whose cross-validation score is better or equal than the true score\nwithout permuting targets. For reliable results\nn_permutations\nshould typically be\nlarger than 100 and\ncv\nbetween 3-10 folds.\nA low p-value provides evidence that the dataset contains some real dependency between\nfeatures and targets\nand\nthat the estimator was able to utilize this dependency to\nobtain good results. A high p-value, in reverse, could be due to either one of these:\na lack of dependency between features and targets (i.e., there is no systematic\nrelationship and any observed patterns are likely due to random chance)\nor\nbecause the estimator was not able to use the dependency in the data (for\ninstance because it underfit).\nIn the latter case, using a more appropriate estimator that is able to use the\nstructure in the data, would result in a lower p-value.\nCross-validation provides information about how well an estimator generalizes\nby estimating the range of its expected scores. However, an",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 41,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Cross-validation provides information about how well an estimator generalizes\nby estimating the range of its expected scores. However, an\nestimator trained on a high dimensional dataset with no structure may still\nperform better than expected on cross-validation, just by chance.\nThis can typically happen with small datasets with less than a few hundred\nsamples.\npermutation_test_score\nprovides information\non whether the estimator has found a real dependency between features and targets and\ncan help in evaluating the performance of the estimator.\nIt is important to note that this test has been shown to produce low\np-values even if there is only weak structure in the data because in the\ncorresponding permutated datasets there is absolutely no structure. This\ntest is therefore only able to show whether the model reliably outperforms\nrandom guessing.\nFinally,\npermutation_test_score\nis computed\nusing brute force and internally fits\n(n_permutations\n+\n1)\n*\nn_cv\nmodels.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 42,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "random guessing.\nFinally,\npermutation_test_score\nis computed\nusing brute force and internally fits\n(n_permutations\n+\n1)\n*\nn_cv\nmodels.\nIt is therefore only tractable with small datasets for which fitting an\nindividual model is very fast. Using the\nn_jobs\nparameter parallelizes the\ncomputation and thus speeds it up.\nExamples\nTest with permutations the significance of a classification score\nReferences\nOjala and Garriga.\nPermutation Tests for Studying Classifier Performance\n.\nJ. Mach. Learn. Res. 2010.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/cross_validation.html",
      "chunk_index": 43,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.5.\nDecomposing signals in components (matrix factorization problems)\n2.5.1.\nPrincipal component analysis (PCA)\n2.5.1.1.\nExact PCA and probabilistic interpretation\nPCA is used to decompose a multivariate dataset in a set of successive\northogonal components that explain a maximum amount of the variance. In\nscikit-learn,\nPCA\nis implemented as a\ntransformer\nobject\nthat learns\n\\(n\\)\ncomponents in its\nfit\nmethod, and can be used on new\ndata to project it on these components.\nPCA centers but does not scale the input data for each feature before\napplying the SVD. The optional parameter\nwhiten=True\nmakes it\npossible to project the data onto the singular space while scaling each\ncomponent to unit variance. This is often useful if the models down-stream make\nstrong assumptions on the isotropy of the signal: this is for example the case\nfor Support Vector Machines with the RBF kernel and the K-Means clustering\nalgorithm.\nBelow is an example of the iris dataset, which is comprised of 4",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "algorithm.\nBelow is an example of the iris dataset, which is comprised of 4\nfeatures, projected on the 2 dimensions that explain most variance:\nThe\nPCA\nobject also provides a\nprobabilistic interpretation of the PCA that can give a likelihood of\ndata based on the amount of variance it explains. As such it implements a\nscore\nmethod that can be used in cross-validation:\nExamples\nPrincipal Component Analysis (PCA) on Iris Dataset\nComparison of LDA and PCA 2D projection of Iris dataset\nModel selection with Probabilistic PCA and Factor Analysis (FA)\n2.5.1.2.\nIncremental PCA\nThe\nPCA\nobject is very useful, but has certain limitations for\nlarge datasets. The biggest limitation is that\nPCA\nonly supports\nbatch processing, which means all of the data to be processed must fit in main\nmemory. The\nIncrementalPCA\nobject uses a different form of\nprocessing and allows for partial computations which almost\nexactly match the results of\nPCA\nwhile processing the data in a\nminibatch fashion.\nIncrementalPCA",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "exactly match the results of\nPCA\nwhile processing the data in a\nminibatch fashion.\nIncrementalPCA\nmakes it possible to implement\nout-of-core Principal Component Analysis either by:\nUsing its\npartial_fit\nmethod on chunks of data fetched sequentially\nfrom the local hard drive or a network database.\nCalling its fit method on a memory mapped file using\nnumpy.memmap\n.\nIncrementalPCA\nonly stores estimates of component and noise variances,\nin order to update\nexplained_variance_ratio_\nincrementally. This is why\nmemory usage depends on the number of samples per batch, rather than the\nnumber of samples to be processed in the dataset.\nAs in\nPCA\n,\nIncrementalPCA\ncenters but does not scale the\ninput data for each feature before applying the SVD.\nExamples\nIncremental PCA\n2.5.1.3.\nPCA using randomized SVD\nIt is often interesting to project data to a lower-dimensional\nspace that preserves most of the variance, by dropping the singular vector\nof components associated with lower singular values.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "space that preserves most of the variance, by dropping the singular vector\nof components associated with lower singular values.\nFor instance, if we work with 64x64 pixel gray-level pictures\nfor face recognition,\nthe dimensionality of the data is 4096 and it is slow to train an\nRBF support vector machine on such wide data. Furthermore we know that\nthe intrinsic dimensionality of the data is much lower than 4096 since all\npictures of human faces look somewhat alike.\nThe samples lie on a manifold of much lower\ndimension (say around 200 for instance). The PCA algorithm can be used\nto linearly transform the data while both reducing the dimensionality\nand preserving most of the explained variance at the same time.\nThe class\nPCA\nused with the optional parameter\nsvd_solver='randomized'\nis very useful in that case: since we are going\nto drop most of the singular vectors it is much more efficient to limit the\ncomputation to an approximated estimate of the singular vectors we will keep",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "computation to an approximated estimate of the singular vectors we will keep\nto actually perform the transform.\nFor instance, the following shows 16 sample portraits (centered around\n0.0) from the Olivetti dataset. On the right hand side are the first 16\nsingular vectors reshaped as portraits. Since we only require the top\n16 singular vectors of a dataset with size\n\\(n_{samples} = 400\\)\nand\n\\(n_{features} = 64 \\times 64 = 4096\\)\n, the computation time is\nless than 1s:\nIf we note\n\\(n_{\\max} = \\max(n_{\\mathrm{samples}}, n_{\\mathrm{features}})\\)\nand\n\\(n_{\\min} = \\min(n_{\\mathrm{samples}}, n_{\\mathrm{features}})\\)\n, the time complexity\nof the randomized\nPCA\nis\n\\(O(n_{\\max}^2 \\cdot n_{\\mathrm{components}})\\)\ninstead of\n\\(O(n_{\\max}^2 \\cdot n_{\\min})\\)\nfor the exact method\nimplemented in\nPCA\n.\nThe memory footprint of randomized\nPCA\nis also proportional to\n\\(2 \\cdot n_{\\max} \\cdot n_{\\mathrm{components}}\\)\ninstead of\n\\(n_{\\max}\n\\cdot n_{\\min}\\)\nfor the exact method.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "PCA\nis also proportional to\n\\(2 \\cdot n_{\\max} \\cdot n_{\\mathrm{components}}\\)\ninstead of\n\\(n_{\\max}\n\\cdot n_{\\min}\\)\nfor the exact method.\nNote: the implementation of\ninverse_transform\nin\nPCA\nwith\nsvd_solver='randomized'\nis not the exact inverse transform of\ntransform\neven when\nwhiten=False\n(default).\nExamples\nFaces recognition example using eigenfaces and SVMs\nFaces dataset decompositions\nReferences\nAlgorithm 4.3 in\n“Finding structure with randomness: Stochastic algorithms for\nconstructing approximate matrix decompositions”\nHalko, et al., 2009\n“An implementation of a randomized algorithm for principal component\nanalysis”\nA. Szlam et al. 2014\n2.5.1.4.\nSparse principal components analysis (SparsePCA and MiniBatchSparsePCA)\nSparsePCA\nis a variant of PCA, with the goal of extracting the\nset of sparse components that best reconstruct the data.\nMini-batch sparse PCA (\nMiniBatchSparsePCA\n) is a variant of\nSparsePCA\nthat is faster but less accurate. The increased speed is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Mini-batch sparse PCA (\nMiniBatchSparsePCA\n) is a variant of\nSparsePCA\nthat is faster but less accurate. The increased speed is\nreached by iterating over small chunks of the set of features, for a given\nnumber of iterations.\nPrincipal component analysis (\nPCA\n) has the disadvantage that the\ncomponents extracted by this method have exclusively dense expressions, i.e.\nthey have non-zero coefficients when expressed as linear combinations of the\noriginal variables. This can make interpretation difficult. In many cases,\nthe real underlying components can be more naturally imagined as sparse\nvectors; for example in face recognition, components might naturally map to\nparts of faces.\nSparse principal components yield a more parsimonious, interpretable\nrepresentation, clearly emphasizing which of the original features contribute\nto the differences between samples.\nThe following example illustrates 16 components extracted using sparse PCA from",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to the differences between samples.\nThe following example illustrates 16 components extracted using sparse PCA from\nthe Olivetti faces dataset. It can be seen how the regularization term induces\nmany zeros. Furthermore, the natural structure of the data causes the non-zero\ncoefficients to be vertically adjacent. The model does not enforce this\nmathematically: each component is a vector\n\\(h \\in \\mathbf{R}^{4096}\\)\n, and\nthere is no notion of vertical adjacency except during the human-friendly\nvisualization as 64x64 pixel images. The fact that the components shown below\nappear local is the effect of the inherent structure of the data, which makes\nsuch local patterns minimize reconstruction error. There exist sparsity-inducing\nnorms that take into account adjacency and different kinds of structure; see\n[Jen09]\nfor a review of such methods.\nFor more details on how to use Sparse PCA, see the Examples section, below.\nNote that there are many different formulations for the Sparse PCA",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "For more details on how to use Sparse PCA, see the Examples section, below.\nNote that there are many different formulations for the Sparse PCA\nproblem. The one implemented here is based on\n[Mrl09]\n. The optimization\nproblem solved is a PCA problem (dictionary learning) with an\n\\(\\ell_1\\)\npenalty on the components:\n\\[\\begin{split}(U^*, V^*) = \\underset{U, V}{\\operatorname{arg\\,min\\,}} & \\frac{1}{2}\n||X-UV||_{\\text{Fro}}^2+\\alpha||V||_{1,1} \\\\\n\\text{subject to } & ||U_k||_2 \\leq 1 \\text{ for all }\n0 \\leq k < n_{components}\\end{split}\\]\n\\(||.||_{\\text{Fro}}\\)\nstands for the Frobenius norm and\n\\(||.||_{1,1}\\)\nstands for the entry-wise matrix norm which is the sum of the absolute values\nof all the entries in the matrix.\nThe sparsity-inducing\n\\(||.||_{1,1}\\)\nmatrix norm also prevents learning\ncomponents from noise when few training samples are available. The degree\nof penalization (and thus sparsity) can be adjusted through the\nhyperparameter\nalpha",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of penalization (and thus sparsity) can be adjusted through the\nhyperparameter\nalpha\n. Small values lead to a gently regularized\nfactorization, while larger values shrink many coefficients to zero.\nNote\nWhile in the spirit of an online algorithm, the class\nMiniBatchSparsePCA\ndoes not implement\npartial_fit\nbecause\nthe algorithm is online along the features direction, not the samples\ndirection.\nExamples\nFaces dataset decompositions\nReferences\n[\nMrl09\n]\n“Online Dictionary Learning for Sparse Coding”\nJ. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009\n[\nJen09\n]\n“Structured Sparse Principal Component Analysis”\nR. Jenatton, G. Obozinski, F. Bach, 2009\n2.5.2.\nKernel Principal Component Analysis (kPCA)\n2.5.2.1.\nExact Kernel PCA\nKernelPCA\nis an extension of PCA which achieves non-linear\ndimensionality reduction through the use of kernels (see\nPairwise metrics, Affinities and Kernels\n)\n[Scholkopf1997]\n. It\nhas many applications including denoising, compression and structured",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Pairwise metrics, Affinities and Kernels\n)\n[Scholkopf1997]\n. It\nhas many applications including denoising, compression and structured\nprediction (kernel dependency estimation).\nKernelPCA\nsupports both\ntransform\nand\ninverse_transform\n.\nNote\nKernelPCA.inverse_transform\nrelies on a kernel ridge to learn the\nfunction mapping samples from the PCA basis into the original feature\nspace\n[Bakir2003]\n. Thus, the reconstruction obtained with\nKernelPCA.inverse_transform\nis an approximation. See the example\nlinked below for more details.\nExamples\nKernel PCA\nImage denoising using kernel PCA\nReferences\n[\nScholkopf1997\n]\nSchölkopf, Bernhard, Alexander Smola, and Klaus-Robert Müller.\n“Kernel principal component analysis.”\nInternational conference on artificial neural networks.\nSpringer, Berlin, Heidelberg, 1997.\n[\nBakir2003\n]\nBakır, Gökhan H., Jason Weston, and Bernhard Schölkopf.\n“Learning to find pre-images.”\nAdvances in neural information processing systems 16 (2003): 449-456.\n2.5.2.2.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "“Learning to find pre-images.”\nAdvances in neural information processing systems 16 (2003): 449-456.\n2.5.2.2.\nChoice of solver for Kernel PCA\nWhile in\nPCA\nthe number of components is bounded by the number of\nfeatures, in\nKernelPCA\nthe number of components is bounded by the\nnumber of samples. Many real-world datasets have large number of samples! In\nthese cases finding\nall\nthe components with a full kPCA is a waste of\ncomputation time, as data is mostly described by the first few components\n(e.g.\nn_components<=100\n). In other words, the centered Gram matrix that\nis eigendecomposed in the Kernel PCA fitting process has an effective rank that\nis much smaller than its size. This is a situation where approximate\neigensolvers can provide speedup with very low precision loss.\nEigensolvers\nThe optional parameter\neigen_solver='randomized'\ncan be used to\nsignificantly\nreduce the computation time when the number of requested\nn_components\nis small compared with the number of samples. It relies on",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "significantly\nreduce the computation time when the number of requested\nn_components\nis small compared with the number of samples. It relies on\nrandomized decomposition methods to find an approximate solution in a shorter\ntime.\nThe time complexity of the randomized\nKernelPCA\nis\n\\(O(n_{\\mathrm{samples}}^2 \\cdot n_{\\mathrm{components}})\\)\ninstead of\n\\(O(n_{\\mathrm{samples}}^3)\\)\nfor the exact method\nimplemented with\neigen_solver='dense'\n.\nThe memory footprint of randomized\nKernelPCA\nis also proportional to\n\\(2 \\cdot n_{\\mathrm{samples}} \\cdot n_{\\mathrm{components}}\\)\ninstead of\n\\(n_{\\mathrm{samples}}^2\\)\nfor the exact method.\nNote: this technique is the same as in\nPCA using randomized SVD\n.\nIn addition to the above two solvers,\neigen_solver='arpack'\ncan be used as\nan alternate way to get an approximate decomposition. In practice, this method\nonly provides reasonable execution times when the number of components to find",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "only provides reasonable execution times when the number of components to find\nis extremely small. It is enabled by default when the desired number of\ncomponents is less than 10 (strict) and the number of samples is more than 200\n(strict). See\nKernelPCA\nfor details.\nReferences\ndense\nsolver:\nscipy.linalg.eigh documentation\nrandomized\nsolver:\nAlgorithm 4.3 in\n“Finding structure with randomness: Stochastic\nalgorithms for constructing approximate matrix decompositions”\nHalko, et al. (2009)\n“An implementation of a randomized algorithm\nfor principal component analysis”\nA. Szlam et al. (2014)\narpack\nsolver:\nscipy.sparse.linalg.eigsh documentation\nR. B. Lehoucq, D. C. Sorensen, and C. Yang, (1998)\n2.5.3.\nTruncated singular value decomposition and latent semantic analysis\nTruncatedSVD\nimplements a variant of singular value decomposition\n(SVD) that only computes the\n\\(k\\)\nlargest singular values,\nwhere\n\\(k\\)\nis a user-specified parameter.\nTruncatedSVD\nis very similar to\nPCA\n, but differs",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(k\\)\nlargest singular values,\nwhere\n\\(k\\)\nis a user-specified parameter.\nTruncatedSVD\nis very similar to\nPCA\n, but differs\nin that the matrix\n\\(X\\)\ndoes not need to be centered.\nWhen the columnwise (per-feature) means of\n\\(X\\)\nare subtracted from the feature values,\ntruncated SVD on the resulting matrix is equivalent to PCA.\nAbout truncated SVD and latent semantic analysis (LSA)\nWhen truncated SVD is applied to term-document matrices\n(as returned by\nCountVectorizer\nor\nTfidfVectorizer\n),\nthis transformation is known as\nlatent semantic analysis\n(LSA), because it transforms such matrices\nto a “semantic” space of low dimensionality.\nIn particular, LSA is known to combat the effects of synonymy and polysemy\n(both of which roughly mean there are multiple meanings per word),\nwhich cause term-document matrices to be overly sparse\nand exhibit poor similarity under measures such as cosine similarity.\nNote\nLSA is also known as latent semantic indexing, LSI,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and exhibit poor similarity under measures such as cosine similarity.\nNote\nLSA is also known as latent semantic indexing, LSI,\nthough strictly that refers to its use in persistent indexes\nfor information retrieval purposes.\nMathematically, truncated SVD applied to training samples\n\\(X\\)\nproduces a low-rank approximation\n\\(X\\)\n:\n\\[X \\approx X_k = U_k \\Sigma_k V_k^\\top\\]\nAfter this operation,\n\\(U_k \\Sigma_k\\)\nis the transformed training set with\n\\(k\\)\nfeatures\n(called\nn_components\nin the API).\nTo also transform a test set\n\\(X\\)\n, we multiply it with\n\\(V_k\\)\n:\n\\[X' = X V_k\\]\nNote\nMost treatments of LSA in the natural language processing (NLP)\nand information retrieval (IR) literature\nswap the axes of the matrix\n\\(X\\)\nso that it has shape\n(n_features,\nn_samples)\n.\nWe present LSA in a different way that matches the scikit-learn API better,\nbut the singular values found are the same.\nWhile the\nTruncatedSVD\ntransformer\nworks with any feature matrix,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "but the singular values found are the same.\nWhile the\nTruncatedSVD\ntransformer\nworks with any feature matrix,\nusing it on tf-idf matrices is recommended over raw frequency counts\nin an LSA/document processing setting.\nIn particular, sublinear scaling and inverse document frequency\nshould be turned on (\nsublinear_tf=True,\nuse_idf=True\n)\nto bring the feature values closer to a Gaussian distribution,\ncompensating for LSA’s erroneous assumptions about textual data.\nExamples\nClustering text documents using k-means\nReferences\nChristopher D. Manning, Prabhakar Raghavan and Hinrich Schütze (2008),\nIntroduction to Information Retrieval\n, Cambridge University Press,\nchapter 18:\nMatrix decompositions & latent semantic indexing\n2.5.4.\nDictionary Learning\n2.5.4.1.\nSparse coding with a precomputed dictionary\nThe\nSparseCoder\nobject is an estimator that can be used to transform signals\ninto sparse linear combination of atoms from a fixed, precomputed dictionary",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SparseCoder\nobject is an estimator that can be used to transform signals\ninto sparse linear combination of atoms from a fixed, precomputed dictionary\nsuch as a discrete wavelet basis. This object therefore does not\nimplement a\nfit\nmethod. The transformation amounts\nto a sparse coding problem: finding a representation of the data as a linear\ncombination of as few dictionary atoms as possible. All variations of\ndictionary learning implement the following transform methods, controllable via\nthe\ntransform_method\ninitialization parameter:\nOrthogonal matching pursuit (\nOrthogonal Matching Pursuit (OMP)\n)\nLeast-angle regression (\nLeast Angle Regression\n)\nLasso computed by least-angle regression\nLasso using coordinate descent (\nLasso\n)\nThresholding\nThresholding is very fast but it does not yield accurate reconstructions.\nThey have been shown useful in literature for classification tasks. For image\nreconstruction tasks, orthogonal matching pursuit yields the most accurate,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "reconstruction tasks, orthogonal matching pursuit yields the most accurate,\nunbiased reconstruction.\nThe dictionary learning objects offer, via the\nsplit_code\nparameter, the\npossibility to separate the positive and negative values in the results of\nsparse coding. This is useful when dictionary learning is used for extracting\nfeatures that will be used for supervised learning, because it allows the\nlearning algorithm to assign different weights to negative loadings of a\nparticular atom, from to the corresponding positive loading.\nThe split code for a single sample has length\n2\n*\nn_components\nand is constructed using the following rule: First, the regular code of length\nn_components\nis computed. Then, the first\nn_components\nentries of the\nsplit_code\nare\nfilled with the positive part of the regular code vector. The second half of\nthe split code is filled with the negative part of the code vector, only with\na positive sign. Therefore, the split_code is non-negative.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the split code is filled with the negative part of the code vector, only with\na positive sign. Therefore, the split_code is non-negative.\nExamples\nSparse coding with a precomputed dictionary\n2.5.4.2.\nGeneric dictionary learning\nDictionary learning (\nDictionaryLearning\n) is a matrix factorization\nproblem that amounts to finding a (usually overcomplete) dictionary that will\nperform well at sparsely encoding the fitted data.\nRepresenting data as sparse combinations of atoms from an overcomplete\ndictionary is suggested to be the way the mammalian primary visual cortex works.\nConsequently, dictionary learning applied on image patches has been shown to\ngive good results in image processing tasks such as image completion,\ninpainting and denoising, as well as for supervised recognition tasks.\nDictionary learning is an optimization problem solved by alternatively updating\nthe sparse code, as a solution to multiple Lasso problems, considering the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the sparse code, as a solution to multiple Lasso problems, considering the\ndictionary fixed, and then updating the dictionary to best fit the sparse code.\n\\[\\begin{split}(U^*, V^*) = \\underset{U, V}{\\operatorname{arg\\,min\\,}} & \\frac{1}{2}\n||X-UV||_{\\text{Fro}}^2+\\alpha||U||_{1,1} \\\\\n\\text{subject to } & ||V_k||_2 \\leq 1 \\text{ for all }\n0 \\leq k < n_{\\mathrm{atoms}}\\end{split}\\]\n\\(||.||_{\\text{Fro}}\\)\nstands for the Frobenius norm and\n\\(||.||_{1,1}\\)\nstands for the entry-wise matrix norm which is the sum of the absolute values\nof all the entries in the matrix.\nAfter using such a procedure to fit the dictionary, the transform is simply a\nsparse coding step that shares the same implementation with all dictionary\nlearning objects (see\nSparse coding with a precomputed dictionary\n).\nIt is also possible to constrain the dictionary and/or code to be positive to\nmatch constraints that may be present in the data. Below are the faces with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "match constraints that may be present in the data. Below are the faces with\ndifferent positivity constraints applied. Red indicates negative values, blue\nindicates positive values, and white represents zeros.\nThe following image shows how a dictionary learned from 4x4 pixel image patches\nextracted from part of the image of a raccoon face looks like.\nExamples\nImage denoising using dictionary learning\nReferences\n“Online dictionary learning for sparse coding”\nJ. Mairal, F. Bach, J. Ponce, G. Sapiro, 2009\n2.5.4.3.\nMini-batch dictionary learning\nMiniBatchDictionaryLearning\nimplements a faster, but less accurate\nversion of the dictionary learning algorithm that is better suited for large\ndatasets.\nBy default,\nMiniBatchDictionaryLearning\ndivides the data into\nmini-batches and optimizes in an online manner by cycling over the mini-batches\nfor the specified number of iterations. However, at the moment it does not\nimplement a stopping condition.\nThe estimator also implements\npartial_fit",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for the specified number of iterations. However, at the moment it does not\nimplement a stopping condition.\nThe estimator also implements\npartial_fit\n, which updates the dictionary by\niterating only once over a mini-batch. This can be used for online learning\nwhen the data is not readily available from the start, or for when the data\ndoes not fit into memory.\n2.5.5.\nFactor Analysis\nIn unsupervised learning we only have a dataset\n\\(X = \\{x_1, x_2, \\dots, x_n\n\\}\\)\n. How can this dataset be described mathematically? A very simple\ncontinuous\nlatent\nvariable\nmodel for\n\\(X\\)\nis\n\\[x_i = W h_i + \\mu + \\epsilon\\]\nThe vector\n\\(h_i\\)\nis called “latent” because it is unobserved.\n\\(\\epsilon\\)\nis\nconsidered a noise term distributed according to a Gaussian with mean 0 and\ncovariance\n\\(\\Psi\\)\n(i.e.\n\\(\\epsilon \\sim \\mathcal{N}(0, \\Psi)\\)\n),\n\\(\\mu\\)\nis some\narbitrary offset vector. Such a model is called “generative” as it describes\nhow\n\\(x_i\\)\nis generated from\n\\(h_i\\)\n. If we use all the\n\\(x_i\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "arbitrary offset vector. Such a model is called “generative” as it describes\nhow\n\\(x_i\\)\nis generated from\n\\(h_i\\)\n. If we use all the\n\\(x_i\\)\n’s as columns to form\na matrix\n\\(\\mathbf{X}\\)\nand all the\n\\(h_i\\)\n’s as columns of a matrix\n\\(\\mathbf{H}\\)\nthen we can write (with suitably defined\n\\(\\mathbf{M}\\)\nand\n\\(\\mathbf{E}\\)\n):\n\\[\\mathbf{X} = W \\mathbf{H} + \\mathbf{M} + \\mathbf{E}\\]\nIn other words, we\ndecomposed\nmatrix\n\\(\\mathbf{X}\\)\n.\nIf\n\\(h_i\\)\nis given, the above equation automatically implies the following\nprobabilistic interpretation:\n\\[p(x_i|h_i) = \\mathcal{N}(Wh_i + \\mu, \\Psi)\\]\nFor a complete probabilistic model we also need a prior distribution for the\nlatent variable\n\\(h\\)\n. The most straightforward assumption (based on the nice\nproperties of the Gaussian distribution) is\n\\(h \\sim \\mathcal{N}(0,\n\\mathbf{I})\\)\n. This yields a Gaussian as the marginal distribution of\n\\(x\\)\n:\n\\[p(x) = \\mathcal{N}(\\mu, WW^T + \\Psi)\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(h \\sim \\mathcal{N}(0,\n\\mathbf{I})\\)\n. This yields a Gaussian as the marginal distribution of\n\\(x\\)\n:\n\\[p(x) = \\mathcal{N}(\\mu, WW^T + \\Psi)\\]\nNow, without any further assumptions the idea of having a latent variable\n\\(h\\)\nwould be superfluous –\n\\(x\\)\ncan be completely modelled with a mean\nand a covariance. We need to impose some more specific structure on one\nof these two parameters. A simple additional assumption regards the\nstructure of the error covariance\n\\(\\Psi\\)\n:\n\\(\\Psi = \\sigma^2 \\mathbf{I}\\)\n: This assumption leads to\nthe probabilistic model of\nPCA\n.\n\\(\\Psi = \\mathrm{diag}(\\psi_1, \\psi_2, \\dots, \\psi_n)\\)\n: This model is called\nFactorAnalysis\n, a classical statistical model. The matrix W is\nsometimes called the “factor loading matrix”.\nBoth models essentially estimate a Gaussian with a low-rank covariance matrix.\nBecause both models are probabilistic they can be integrated in more complex\nmodels, e.g. Mixture of Factor Analysers. One gets very different models (e.g.\nFastICA",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "models, e.g. Mixture of Factor Analysers. One gets very different models (e.g.\nFastICA\n) if non-Gaussian priors on the latent variables are assumed.\nFactor analysis\ncan\nproduce similar components (the columns of its loading\nmatrix) to\nPCA\n. However, one can not make any general statements\nabout these components (e.g. whether they are orthogonal):\nThe main advantage for Factor Analysis over\nPCA\nis that\nit can model the variance in every direction of the input space independently\n(heteroscedastic noise):\nThis allows better model selection than probabilistic PCA in the presence\nof heteroscedastic noise:\nFactor Analysis is often followed by a rotation of the factors (with the\nparameter\nrotation\n), usually to improve interpretability. For example,\nVarimax rotation maximizes the sum of the variances of the squared loadings,\ni.e., it tends to produce sparser factors, which are influenced by only a few\nfeatures each (the “simple structure”). See e.g., the first example below.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "features each (the “simple structure”). See e.g., the first example below.\nExamples\nFactor Analysis (with rotation) to visualize patterns\nModel selection with Probabilistic PCA and Factor Analysis (FA)\n2.5.6.\nIndependent component analysis (ICA)\nIndependent component analysis separates a multivariate signal into\nadditive subcomponents that are maximally independent. It is\nimplemented in scikit-learn using the\nFast\nICA\nalgorithm. Typically, ICA is not used for reducing dimensionality but\nfor separating superimposed signals. Since the ICA model does not include\na noise term, for the model to be correct, whitening must be applied.\nThis can be done internally using the\nwhiten\nargument or manually using one\nof the PCA variants.\nIt is classically used to separate mixed signals (a problem known as\nblind source separation\n), as in the example below:\nICA can also be used as yet another non linear decomposition that finds\ncomponents with some sparsity:\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "), as in the example below:\nICA can also be used as yet another non linear decomposition that finds\ncomponents with some sparsity:\nExamples\nBlind source separation using FastICA\nFastICA on 2D point clouds\nFaces dataset decompositions\n2.5.7.\nNon-negative matrix factorization (NMF or NNMF)\n2.5.7.1.\nNMF with the Frobenius norm\nNMF\n[\n1\n]\nis an alternative approach to decomposition that assumes that the\ndata and the components are non-negative.\nNMF\ncan be plugged in\ninstead of\nPCA\nor its variants, in the cases where the data matrix\ndoes not contain negative values. It finds a decomposition of samples\n\\(X\\)\ninto two matrices\n\\(W\\)\nand\n\\(H\\)\nof non-negative elements,\nby optimizing the distance\n\\(d\\)\nbetween\n\\(X\\)\nand the matrix product\n\\(WH\\)\n. The most widely used distance function is the squared Frobenius\nnorm, which is an obvious extension of the Euclidean norm to matrices:\n\\[d_{\\mathrm{Fro}}(X, Y) = \\frac{1}{2} ||X - Y||_{\\mathrm{Fro}}^2 = \\frac{1}{2} \\sum_{i,j} (X_{ij} - {Y}_{ij})^2\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[d_{\\mathrm{Fro}}(X, Y) = \\frac{1}{2} ||X - Y||_{\\mathrm{Fro}}^2 = \\frac{1}{2} \\sum_{i,j} (X_{ij} - {Y}_{ij})^2\\]\nUnlike\nPCA\n, the representation of a vector is obtained in an additive\nfashion, by superimposing the components, without subtracting. Such additive\nmodels are efficient for representing images and text.\nIt has been observed in [Hoyer, 2004]\n[\n2\n]\nthat, when carefully constrained,\nNMF\ncan produce a parts-based representation of the dataset,\nresulting in interpretable models. The following example displays 16\nsparse components found by\nNMF\nfrom the images in the Olivetti\nfaces dataset, in comparison with the PCA eigenfaces.\nThe\ninit\nattribute determines the initialization method applied, which\nhas a great impact on the performance of the method.\nNMF\nimplements the\nmethod Nonnegative Double Singular Value Decomposition. NNDSVD\n[\n4\n]\nis based on\ntwo SVD processes, one approximating the data matrix, the other approximating",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[\n4\n]\nis based on\ntwo SVD processes, one approximating the data matrix, the other approximating\npositive sections of the resulting partial SVD factors utilizing an algebraic\nproperty of unit rank matrices. The basic NNDSVD algorithm is better fit for\nsparse factorization. Its variants NNDSVDa (in which all zeros are set equal to\nthe mean of all elements of the data), and NNDSVDar (in which the zeros are set\nto random perturbations less than the mean of the data divided by 100) are\nrecommended in the dense case.\nNote that the Multiplicative Update (‘mu’) solver cannot update zeros present in\nthe initialization, so it leads to poorer results when used jointly with the\nbasic NNDSVD algorithm which introduces a lot of zeros; in this case, NNDSVDa or\nNNDSVDar should be preferred.\nNMF\ncan also be initialized with correctly scaled random non-negative\nmatrices by setting\ninit=\"random\"\n. An integer seed or a\nRandomState\ncan also be passed to\nrandom_state\nto control\nreproducibility.\nIn\nNMF",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "matrices by setting\ninit=\"random\"\n. An integer seed or a\nRandomState\ncan also be passed to\nrandom_state\nto control\nreproducibility.\nIn\nNMF\n, L1 and L2 priors can be added to the loss function in order to\nregularize the model. The L2 prior uses the Frobenius norm, while the L1 prior\nuses an elementwise L1 norm. As in\nElasticNet\n,\nwe control the combination of L1 and L2 with the\nl1_ratio\n(\n\\(\\rho\\)\n)\nparameter, and the intensity of the regularization with the\nalpha_W\nand\nalpha_H\n(\n\\(\\alpha_W\\)\nand\n\\(\\alpha_H\\)\n) parameters. The priors are\nscaled by the number of samples (\n\\(n\\_samples\\)\n) for\nH\nand the number of\nfeatures (\n\\(n\\_features\\)\n) for\nW\nto keep their impact balanced with\nrespect to one another and to the data fit term as independent as possible of\nthe size of the training set. Then the priors terms are:\n\\[(\\alpha_W \\rho ||W||_1 + \\frac{\\alpha_W(1-\\rho)}{2} ||W||_{\\mathrm{Fro}} ^ 2) * n\\_features",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[(\\alpha_W \\rho ||W||_1 + \\frac{\\alpha_W(1-\\rho)}{2} ||W||_{\\mathrm{Fro}} ^ 2) * n\\_features\n+ (\\alpha_H \\rho ||H||_1 + \\frac{\\alpha_H(1-\\rho)}{2} ||H||_{\\mathrm{Fro}} ^ 2) * n\\_samples\\]\nand the regularized objective function is:\n\\[d_{\\mathrm{Fro}}(X, WH)\n+ (\\alpha_W \\rho ||W||_1 + \\frac{\\alpha_W(1-\\rho)}{2} ||W||_{\\mathrm{Fro}} ^ 2) * n\\_features\n+ (\\alpha_H \\rho ||H||_1 + \\frac{\\alpha_H(1-\\rho)}{2} ||H||_{\\mathrm{Fro}} ^ 2) * n\\_samples\\]\n2.5.7.2.\nNMF with a beta-divergence\nAs described previously, the most widely used distance function is the squared\nFrobenius norm, which is an obvious extension of the Euclidean norm to\nmatrices:\n\\[d_{\\mathrm{Fro}}(X, Y) = \\frac{1}{2} ||X - Y||_{Fro}^2 = \\frac{1}{2} \\sum_{i,j} (X_{ij} - {Y}_{ij})^2\\]\nOther distance functions can be used in NMF as, for example, the (generalized)\nKullback-Leibler (KL) divergence, also referred as I-divergence:\n\\[d_{KL}(X, Y) = \\sum_{i,j} (X_{ij} \\log(\\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Kullback-Leibler (KL) divergence, also referred as I-divergence:\n\\[d_{KL}(X, Y) = \\sum_{i,j} (X_{ij} \\log(\\frac{X_{ij}}{Y_{ij}}) - X_{ij} + Y_{ij})\\]\nOr, the Itakura-Saito (IS) divergence:\n\\[d_{IS}(X, Y) = \\sum_{i,j} (\\frac{X_{ij}}{Y_{ij}} - \\log(\\frac{X_{ij}}{Y_{ij}}) - 1)\\]\nThese three distances are special cases of the beta-divergence family, with\n\\(\\beta = 2, 1, 0\\)\nrespectively\n[\n6\n]\n. The beta-divergence is\ndefined by :\n\\[d_{\\beta}(X, Y) = \\sum_{i,j} \\frac{1}{\\beta(\\beta - 1)}(X_{ij}^\\beta + (\\beta-1)Y_{ij}^\\beta - \\beta X_{ij} Y_{ij}^{\\beta - 1})\\]\nNote that this definition is not valid if\n\\(\\beta \\in (0; 1)\\)\n, yet it can\nbe continuously extended to the definitions of\n\\(d_{KL}\\)\nand\n\\(d_{IS}\\)\nrespectively.\nNMF implemented solvers\nNMF\nimplements two solvers, using Coordinate Descent (‘cd’)\n[\n5\n]\n, and\nMultiplicative Update (‘mu’)\n[\n6\n]\n. The ‘mu’ solver can optimize every\nbeta-divergence, including of course the Frobenius norm (\n\\(\\beta=2\\)\n), the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Multiplicative Update (‘mu’)\n[\n6\n]\n. The ‘mu’ solver can optimize every\nbeta-divergence, including of course the Frobenius norm (\n\\(\\beta=2\\)\n), the\n(generalized) Kullback-Leibler divergence (\n\\(\\beta=1\\)\n) and the\nItakura-Saito divergence (\n\\(\\beta=0\\)\n). Note that for\n\\(\\beta \\in (1; 2)\\)\n, the ‘mu’ solver is significantly faster than for other\nvalues of\n\\(\\beta\\)\n. Note also that with a negative (or 0, i.e.\n‘itakura-saito’)\n\\(\\beta\\)\n, the input matrix cannot contain zero values.\nThe ‘cd’ solver can only optimize the Frobenius norm. Due to the\nunderlying non-convexity of NMF, the different solvers may converge to\ndifferent minima, even when optimizing the same distance function.\nNMF is best used with the\nfit_transform\nmethod, which returns the matrix W.\nThe matrix H is stored into the fitted model in the\ncomponents_\nattribute;\nthe method\ntransform\nwill decompose a new matrix X_new based on these\nstored components:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n1\n,\n1\n],\n[\n2\n,\n1\n],\n[",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "transform\nwill decompose a new matrix X_new based on these\nstored components:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n1\n,\n1\n],\n[\n2\n,\n1\n],\n[\n3\n,\n1.2\n],\n[\n4\n,\n1\n],\n[\n5\n,\n0.8\n],\n[\n6\n,\n1\n]])\n>>>\nfrom\nsklearn.decomposition\nimport\nNMF\n>>>\nmodel\n=\nNMF\n(\nn_components\n=\n2\n,\ninit\n=\n'random'\n,\nrandom_state\n=\n0\n)\n>>>\nW\n=\nmodel\n.\nfit_transform\n(\nX\n)\n>>>\nH\n=\nmodel\n.\ncomponents_\n>>>\nX_new\n=\nnp\n.\narray\n([[\n1\n,\n0\n],\n[\n1\n,\n6.1\n],\n[\n1\n,\n0\n],\n[\n1\n,\n4\n],\n[\n3.2\n,\n1\n],\n[\n0\n,\n4\n]])\n>>>\nW_new\n=\nmodel\n.\ntransform\n(\nX_new\n)\nExamples\nFaces dataset decompositions\nTopic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation\n2.5.7.3.\nMini-batch Non Negative Matrix Factorization\nMiniBatchNMF\n[\n7\n]\nimplements a faster, but less accurate version of the\nnon negative matrix factorization (i.e.\nNMF\n),\nbetter suited for large datasets.\nBy default,\nMiniBatchNMF\ndivides the data into mini-batches and\noptimizes the NMF model in an online manner by cycling over the mini-batches",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "By default,\nMiniBatchNMF\ndivides the data into mini-batches and\noptimizes the NMF model in an online manner by cycling over the mini-batches\nfor the specified number of iterations. The\nbatch_size\nparameter controls\nthe size of the batches.\nIn order to speed up the mini-batch algorithm it is also possible to scale\npast batches, giving them less importance than newer batches. This is done\nby introducing a so-called forgetting factor controlled by the\nforget_factor\nparameter.\nThe estimator also implements\npartial_fit\n, which updates\nH\nby iterating\nonly once over a mini-batch. This can be used for online learning when the data\nis not readily available from the start, or when the data does not fit into memory.\nReferences\n2.5.8.\nLatent Dirichlet Allocation (LDA)\nLatent Dirichlet Allocation is a generative probabilistic model for collections of\ndiscrete datasets such as text corpora. It is also a topic model that is used for\ndiscovering abstract topics from a collection of documents.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "discrete datasets such as text corpora. It is also a topic model that is used for\ndiscovering abstract topics from a collection of documents.\nThe graphical model of LDA is a three-level generative model:\nNote on notations presented in the graphical model above, which can be found in\nHoffman et al. (2013):\nThe corpus is a collection of\n\\(D\\)\ndocuments.\nA document is a sequence of\n\\(N\\)\nwords.\nThere are\n\\(K\\)\ntopics in the corpus.\nThe boxes represent repeated sampling.\nIn the graphical model, each node is a random variable and has a role in the\ngenerative process. A shaded node indicates an observed variable and an unshaded\nnode indicates a hidden (latent) variable. In this case, words in the corpus are\nthe only data that we observe. The latent variables determine the random mixture\nof topics in the corpus and the distribution of words in the documents.\nThe goal of LDA is to use the observed words to infer the hidden topic\nstructure.\nDetails on modeling text corpora",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The goal of LDA is to use the observed words to infer the hidden topic\nstructure.\nDetails on modeling text corpora\nWhen modeling text corpora, the model assumes the following generative process\nfor a corpus with\n\\(D\\)\ndocuments and\n\\(K\\)\ntopics, with\n\\(K\\)\ncorresponding to\nn_components\nin the API:\nFor each topic\n\\(k \\in K\\)\n, draw\n\\(\\beta_k \\sim\n\\mathrm{Dirichlet}(\\eta)\\)\n. This provides a distribution over the words,\ni.e. the probability of a word appearing in topic\n\\(k\\)\n.\n\\(\\eta\\)\ncorresponds to\ntopic_word_prior\n.\nFor each document\n\\(d \\in D\\)\n, draw the topic proportions\n\\(\\theta_d \\sim \\mathrm{Dirichlet}(\\alpha)\\)\n.\n\\(\\alpha\\)\ncorresponds to\ndoc_topic_prior\n.\nFor each word\n\\(i\\)\nin document\n\\(d\\)\n:\nDraw the topic assignment\n\\(z_{di} \\sim \\mathrm{Multinomial}\n(\\theta_d)\\)\nDraw the observed word\n\\(w_{ij} \\sim \\mathrm{Multinomial}\n(\\beta_{z_{di}})\\)\nFor parameter estimation, the posterior distribution is:\n\\[p(z, \\theta, \\beta |w, \\alpha, \\eta) =",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\\beta_{z_{di}})\\)\nFor parameter estimation, the posterior distribution is:\n\\[p(z, \\theta, \\beta |w, \\alpha, \\eta) =\n\\frac{p(z, \\theta, \\beta|\\alpha, \\eta)}{p(w|\\alpha, \\eta)}\\]\nSince the posterior is intractable, variational Bayesian method\nuses a simpler distribution\n\\(q(z,\\theta,\\beta | \\lambda, \\phi, \\gamma)\\)\nto approximate it, and those variational parameters\n\\(\\lambda\\)\n,\n\\(\\phi\\)\n,\n\\(\\gamma\\)\nare optimized to maximize the Evidence\nLower Bound (ELBO):\n\\[\\log\\: P(w | \\alpha, \\eta) \\geq L(w,\\phi,\\gamma,\\lambda) \\overset{\\triangle}{=}\nE_{q}[\\log\\:p(w,z,\\theta,\\beta|\\alpha,\\eta)] - E_{q}[\\log\\:q(z, \\theta, \\beta)]\\]\nMaximizing ELBO is equivalent to minimizing the Kullback-Leibler(KL) divergence\nbetween\n\\(q(z,\\theta,\\beta)\\)\nand the true posterior\n\\(p(z, \\theta, \\beta |w, \\alpha, \\eta)\\)\n.\nLatentDirichletAllocation\nimplements the online variational Bayes\nalgorithm and supports both online and batch update methods.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 38,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nLatentDirichletAllocation\nimplements the online variational Bayes\nalgorithm and supports both online and batch update methods.\nWhile the batch method updates variational variables after each full pass through\nthe data, the online method updates variational variables from mini-batch data\npoints.\nNote\nAlthough the online method is guaranteed to converge to a local optimum point, the quality of\nthe optimum point and the speed of convergence may depend on mini-batch size and\nattributes related to learning rate setting.\nWhen\nLatentDirichletAllocation\nis applied on a “document-term” matrix, the matrix\nwill be decomposed into a “topic-term” matrix and a “document-topic” matrix. While\n“topic-term” matrix is stored as\ncomponents_\nin the model, “document-topic” matrix\ncan be calculated from\ntransform\nmethod.\nLatentDirichletAllocation\nalso implements\npartial_fit\nmethod. This is used\nwhen data can be fetched sequentially.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 39,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "transform\nmethod.\nLatentDirichletAllocation\nalso implements\npartial_fit\nmethod. This is used\nwhen data can be fetched sequentially.\nExamples\nTopic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation\nReferences\n“Latent Dirichlet Allocation”\nD. Blei, A. Ng, M. Jordan, 2003\n“Online Learning for Latent Dirichlet Allocation”\nM. Hoffman, D. Blei, F. Bach, 2010\n“Stochastic Variational Inference”\nM. Hoffman, D. Blei, C. Wang, J. Paisley, 2013\n“The varimax criterion for analytic rotation in factor analysis”\nH. F. Kaiser, 1958\nSee also\nDimensionality reduction\nfor dimensionality reduction with\nNeighborhood Components Analysis.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/decomposition.html",
      "chunk_index": 40,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.8.\nDensity Estimation\nDensity estimation walks the line between unsupervised learning, feature\nengineering, and data modeling. Some of the most popular and useful\ndensity estimation techniques are mixture models such as\nGaussian Mixtures (\nGaussianMixture\n), and\nneighbor-based approaches such as the kernel density estimate\n(\nKernelDensity\n).\nGaussian Mixtures are discussed more fully in the context of\nclustering\n, because the technique is also useful as\nan unsupervised clustering scheme.\nDensity estimation is a very simple concept, and most people are already\nfamiliar with one common density estimation technique: the histogram.\n2.8.1.\nDensity Estimation: Histograms\nA histogram is a simple visualization of data where bins are defined, and the\nnumber of data points within each bin is tallied. An example of a histogram\ncan be seen in the upper-left panel of the following figure:\nA major problem with histograms, however, is that the choice of binning can",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/density.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "can be seen in the upper-left panel of the following figure:\nA major problem with histograms, however, is that the choice of binning can\nhave a disproportionate effect on the resulting visualization. Consider the\nupper-right panel of the above figure. It shows a histogram over the same\ndata, with the bins shifted right. The results of the two visualizations look\nentirely different, and might lead to different interpretations of the data.\nIntuitively, one can also think of a histogram as a stack of blocks, one block\nper point. By stacking the blocks in the appropriate grid space, we recover\nthe histogram. But what if, instead of stacking the blocks on a regular grid,\nwe center each block on the point it represents, and sum the total height at\neach location? This idea leads to the lower-left visualization. It is perhaps\nnot as clean as a histogram, but the fact that the data drive the block\nlocations mean that it is a much better representation of the underlying\ndata.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/density.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "locations mean that it is a much better representation of the underlying\ndata.\nThis visualization is an example of a\nkernel density estimation\n, in this case\nwith a top-hat kernel (i.e. a square block at each point). We can recover a\nsmoother distribution by using a smoother kernel. The bottom-right plot shows\na Gaussian kernel density estimate, in which each point contributes a Gaussian\ncurve to the total. The result is a smooth density estimate which is derived\nfrom the data, and functions as a powerful non-parametric model of the\ndistribution of points.\n2.8.2.\nKernel Density Estimation\nKernel density estimation in scikit-learn is implemented in the\nKernelDensity\nestimator, which uses the\nBall Tree or KD Tree for efficient queries (see\nNearest Neighbors\nfor\na discussion of these). Though the above example\nuses a 1D data set for simplicity, kernel density estimation can be\nperformed in any number of dimensions, though in practice the curse of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/density.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "uses a 1D data set for simplicity, kernel density estimation can be\nperformed in any number of dimensions, though in practice the curse of\ndimensionality causes its performance to degrade in high dimensions.\nIn the following figure, 100 points are drawn from a bimodal distribution,\nand the kernel density estimates are shown for three choices of kernels:\nIt’s clear how the kernel shape affects the smoothness of the resulting\ndistribution. The scikit-learn kernel density estimator can be used as\nfollows:\n>>>\nfrom\nsklearn.neighbors\nimport\nKernelDensity\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n1\n,\n-\n1\n],\n[\n-\n2\n,\n-\n1\n],\n[\n-\n3\n,\n-\n2\n],\n[\n1\n,\n1\n],\n[\n2\n,\n1\n],\n[\n3\n,\n2\n]])\n>>>\nkde\n=\nKernelDensity\n(\nkernel\n=\n'gaussian'\n,\nbandwidth\n=\n0.2\n)\n.\nfit\n(\nX\n)\n>>>\nkde\n.\nscore_samples\n(\nX\n)\narray([-0.41075698, -0.41075698, -0.41076071, -0.41075698, -0.41075698,\n-0.41076071])\nHere we have used\nkernel='gaussian'\n, as seen above.\nMathematically, a kernel is a positive function\n\\(K(x;h)\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/density.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "-0.41076071])\nHere we have used\nkernel='gaussian'\n, as seen above.\nMathematically, a kernel is a positive function\n\\(K(x;h)\\)\nwhich is controlled by the bandwidth parameter\n\\(h\\)\n.\nGiven this kernel form, the density estimate at a point\n\\(y\\)\nwithin\na group of points\n\\(x_i; i=1, \\cdots, N\\)\nis given by:\n\\[\\rho_K(y) = \\sum_{i=1}^{N} K(y - x_i; h)\\]\nThe bandwidth here acts as a smoothing parameter, controlling the tradeoff\nbetween bias and variance in the result. A large bandwidth leads to a very\nsmooth (i.e. high-bias) density distribution. A small bandwidth leads\nto an unsmooth (i.e. high-variance) density distribution.\nThe parameter\nbandwidth\ncontrols this smoothing. One can either set\nmanually this parameter or use Scott’s and Silverman’s estimation\nmethods.\nKernelDensity\nimplements several common kernel\nforms, which are shown in the following figure:\nKernels’ mathematical expressions\nThe form of these kernels is as follows:\nGaussian kernel (\nkernel\n=\n'gaussian'\n)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/density.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Kernels’ mathematical expressions\nThe form of these kernels is as follows:\nGaussian kernel (\nkernel\n=\n'gaussian'\n)\n\\(K(x; h) \\propto \\exp(- \\frac{x^2}{2h^2} )\\)\nTophat kernel (\nkernel\n=\n'tophat'\n)\n\\(K(x; h) \\propto 1\\)\nif\n\\(x < h\\)\nEpanechnikov kernel (\nkernel\n=\n'epanechnikov'\n)\n\\(K(x; h) \\propto 1 - \\frac{x^2}{h^2}\\)\nExponential kernel (\nkernel\n=\n'exponential'\n)\n\\(K(x; h) \\propto \\exp(-x/h)\\)\nLinear kernel (\nkernel\n=\n'linear'\n)\n\\(K(x; h) \\propto 1 - x/h\\)\nif\n\\(x < h\\)\nCosine kernel (\nkernel\n=\n'cosine'\n)\n\\(K(x; h) \\propto \\cos(\\frac{\\pi x}{2h})\\)\nif\n\\(x < h\\)\nThe kernel density estimator can be used with any of the valid distance\nmetrics (see\nDistanceMetric\nfor a list of\navailable metrics), though the results are properly normalized only\nfor the Euclidean metric. One particularly useful metric is the\nHaversine distance\nwhich measures the angular distance between points on a sphere. Here\nis an example of using a kernel density estimate for a visualization",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/density.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "which measures the angular distance between points on a sphere. Here\nis an example of using a kernel density estimate for a visualization\nof geospatial data, in this case the distribution of observations of two\ndifferent species on the South American continent:\nOne other useful application of kernel density estimation is to learn a\nnon-parametric generative model of a dataset in order to efficiently\ndraw new samples from this generative model.\nHere is an example of using this process to\ncreate a new set of hand-written digits, using a Gaussian kernel learned\non a PCA projection of the data:\nThe “new” data consists of linear combinations of the input data, with weights\nprobabilistically drawn given the KDE model.\nExamples\nSimple 1D Kernel Density Estimation\n: computation of simple kernel\ndensity estimates in one dimension.\nKernel Density Estimation\n: an example of using\nKernel Density estimation to learn a generative model of the hand-written",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/density.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Kernel Density Estimation\n: an example of using\nKernel Density estimation to learn a generative model of the hand-written\ndigits data, and drawing new samples from this model.\nKernel Density Estimate of Species Distributions\n: an example of Kernel Density\nestimation using the Haversine distance metric to visualize geospatial data\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/density.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.11.\nEnsembles: Gradient boosting, random forests, bagging, voting, stacking\nEnsemble methods\ncombine the predictions of several\nbase estimators built with a given learning algorithm in order to improve\ngeneralizability / robustness over a single estimator.\nTwo very famous examples of ensemble methods are\ngradient-boosted trees\nand\nrandom forests\n.\nMore generally, ensemble models can be applied to any base learner beyond\ntrees, in averaging methods such as\nBagging methods\n,\nmodel stacking\n, or\nVoting\n, or in\nboosting, as\nAdaBoost\n.\n1.11.1.\nGradient-boosted trees\nGradient Tree Boosting\nor Gradient Boosted Decision Trees (GBDT) is a generalization\nof boosting to arbitrary differentiable loss functions, see the seminal work of\n[Friedman2001]\n. GBDT is an excellent model for both regression and\nclassification, in particular for tabular data.\n1.11.1.1.\nHistogram-Based Gradient Boosting\nScikit-learn 0.21 introduced two new implementations of\ngradient boosted trees, namely",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.11.1.1.\nHistogram-Based Gradient Boosting\nScikit-learn 0.21 introduced two new implementations of\ngradient boosted trees, namely\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor\n, inspired by\nLightGBM\n(See\n[LightGBM]\n).\nThese histogram-based estimators can be\norders of magnitude faster\nthan\nGradientBoostingClassifier\nand\nGradientBoostingRegressor\nwhen the number of samples is larger\nthan tens of thousands of samples.\nThey also have built-in support for missing values, which avoids the need\nfor an imputer.\nThese fast estimators first bin the input samples\nX\ninto\ninteger-valued bins (typically 256 bins) which tremendously reduces the\nnumber of splitting points to consider, and allows the algorithm to\nleverage integer-based data structures (histograms) instead of relying on\nsorted continuous values when building the trees. The API of these\nestimators is slightly different, and some of the features from\nGradientBoostingClassifier\nand\nGradientBoostingRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "estimators is slightly different, and some of the features from\nGradientBoostingClassifier\nand\nGradientBoostingRegressor\nare not yet supported, for instance some loss functions.\nExamples\nPartial Dependence and Individual Conditional Expectation Plots\nComparing Random Forests and Histogram Gradient Boosting models\n1.11.1.1.1.\nUsage\nMost of the parameters are unchanged from\nGradientBoostingClassifier\nand\nGradientBoostingRegressor\n.\nOne exception is the\nmax_iter\nparameter that replaces\nn_estimators\n, and\ncontrols the number of iterations of the boosting process:\n>>>\nfrom\nsklearn.ensemble\nimport\nHistGradientBoostingClassifier\n>>>\nfrom\nsklearn.datasets\nimport\nmake_hastie_10_2\n>>>\nX\n,\ny\n=\nmake_hastie_10_2\n(\nrandom_state\n=\n0\n)\n>>>\nX_train\n,\nX_test\n=\nX\n[:\n2000\n],\nX\n[\n2000\n:]\n>>>\ny_train\n,\ny_test\n=\ny\n[:\n2000\n],\ny\n[\n2000\n:]\n>>>\nclf\n=\nHistGradientBoostingClassifier\n(\nmax_iter\n=\n100\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nclf\n.\nscore\n(\nX_test\n,\ny_test\n)\n0.8965\nAvailable losses for\nregression\nare:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\nmax_iter\n=\n100\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nclf\n.\nscore\n(\nX_test\n,\ny_test\n)\n0.8965\nAvailable losses for\nregression\nare:\n‘squared_error’, which is the default loss;\n‘absolute_error’, which is less sensitive to outliers than the squared error;\n‘gamma’, which is well suited to model strictly positive outcomes;\n‘poisson’, which is well suited to model counts and frequencies;\n‘quantile’, which allows for estimating a conditional quantile that can later\nbe used to obtain prediction intervals.\nFor\nclassification\n, ‘log_loss’ is the only option. For binary classification\nit uses the binary log loss, also known as binomial deviance or binary\ncross-entropy. For\nn_classes\n>=\n3\n, it uses the multi-class log loss function,\nwith multinomial deviance and categorical cross-entropy as alternative names.\nThe appropriate loss version is selected based on\ny\npassed to\nfit\n.\nThe size of the trees can be controlled through the\nmax_leaf_nodes\n,\nmax_depth\n, and\nmin_samples_leaf\nparameters.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "y\npassed to\nfit\n.\nThe size of the trees can be controlled through the\nmax_leaf_nodes\n,\nmax_depth\n, and\nmin_samples_leaf\nparameters.\nThe number of bins used to bin the data is controlled with the\nmax_bins\nparameter. Using less bins acts as a form of regularization. It is generally\nrecommended to use as many bins as possible (255), which is the default.\nThe\nl2_regularization\nparameter acts as a regularizer for the loss function,\nand corresponds to\n\\(\\lambda\\)\nin the following expression (see equation (2)\nin\n[XGBoost]\n):\n\\[\\mathcal{L}(\\phi) = \\sum_i l(\\hat{y}_i, y_i) + \\frac12 \\sum_k \\lambda ||w_k||^2\\]\nDetails on l2 regularization\nIt is important to notice that the loss term\n\\(l(\\hat{y}_i, y_i)\\)\ndescribes\nonly half of the actual loss function except for the pinball loss and absolute\nerror.\nThe index\n\\(k\\)\nrefers to the k-th tree in the ensemble of trees. In the\ncase of regression and binary classification, gradient boosting models grow one\ntree per iteration, then\n\\(k\\)\nruns up to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "case of regression and binary classification, gradient boosting models grow one\ntree per iteration, then\n\\(k\\)\nruns up to\nmax_iter\n. In the case of\nmulticlass classification problems, the maximal value of the index\n\\(k\\)\nis\nn_classes\n\\(\\times\\)\nmax_iter\n.\nIf\n\\(T_k\\)\ndenotes the number of leaves in the k-th tree, then\n\\(w_k\\)\nis a vector of length\n\\(T_k\\)\n, which contains the leaf values of the form\nw\n=\n-sum_gradient\n/\n(sum_hessian\n+\nl2_regularization)\n(see equation (5) in\n[XGBoost]\n).\nThe leaf values\n\\(w_k\\)\nare derived by dividing the sum of the gradients of\nthe loss function by the combined sum of hessians. Adding the regularization to\nthe denominator penalizes the leaves with small hessians (flat regions),\nresulting in smaller updates. Those\n\\(w_k\\)\nvalues contribute then to the\nmodel’s prediction for a given input that ends up in the corresponding leaf. The\nfinal prediction is the sum of the base prediction and the contributions from",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "final prediction is the sum of the base prediction and the contributions from\neach tree. The result of that sum is then transformed by the inverse link\nfunction depending on the choice of the loss function (see\nMathematical formulation\n).\nNotice that the original paper\n[XGBoost]\nintroduces a term\n\\(\\gamma\\sum_k\nT_k\\)\nthat penalizes the number of leaves (making it a smooth version of\nmax_leaf_nodes\n) not presented here as it is not implemented in scikit-learn;\nwhereas\n\\(\\lambda\\)\npenalizes the magnitude of the individual tree\npredictions before being rescaled by the learning rate, see\nShrinkage via learning rate\n.\nNote that\nearly-stopping is enabled by default if the number of samples is\nlarger than 10,000\n. The early-stopping behaviour is controlled via the\nearly_stopping\n,\nscoring\n,\nvalidation_fraction\n,\nn_iter_no_change\n, and\ntol\nparameters. It is possible to early-stop\nusing an arbitrary\nscorer\n, or just the training or validation loss.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nn_iter_no_change\n, and\ntol\nparameters. It is possible to early-stop\nusing an arbitrary\nscorer\n, or just the training or validation loss.\nNote that for technical reasons, using a callable as a scorer is significantly slower\nthan using the loss. By default, early-stopping is performed if there are at least\n10,000 samples in the training set, using the validation loss.\n1.11.1.1.2.\nMissing values support\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor\nhave built-in support for missing\nvalues (NaNs).\nDuring training, the tree grower learns at each split point whether samples\nwith missing values should go to the left or right child, based on the\npotential gain. When predicting, samples with missing values are assigned to\nthe left or right child consequently:\n>>>\nfrom\nsklearn.ensemble\nimport\nHistGradientBoostingClassifier\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n2\n,\nnp\n.\nnan\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ny\n=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\ngbdt\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n2\n,\nnp\n.\nnan\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ny\n=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\ngbdt\n=\nHistGradientBoostingClassifier\n(\nmin_samples_leaf\n=\n1\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\ngbdt\n.\npredict\n(\nX\n)\narray([0, 0, 1, 1])\nWhen the missingness pattern is predictive, the splits can be performed on\nwhether the feature value is missing or not:\n>>>\nX\n=\nnp\n.\narray\n([\n0\n,\nnp\n.\nnan\n,\n1\n,\n2\n,\nnp\n.\nnan\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ny\n=\n[\n0\n,\n1\n,\n0\n,\n0\n,\n1\n]\n>>>\ngbdt\n=\nHistGradientBoostingClassifier\n(\nmin_samples_leaf\n=\n1\n,\n...\nmax_depth\n=\n2\n,\n...\nlearning_rate\n=\n1\n,\n...\nmax_iter\n=\n1\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\ngbdt\n.\npredict\n(\nX\n)\narray([0, 1, 0, 0, 1])\nIf no missing values were encountered for a given feature during training,\nthen samples with missing values are mapped to whichever child has the most\nsamples.\nExamples\nFeatures in Histogram Gradient Boosting Trees\n1.11.1.1.3.\nSample weight support\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Features in Histogram Gradient Boosting Trees\n1.11.1.1.3.\nSample weight support\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor\nsupport sample weights during\nfit\n.\nThe following toy example demonstrates that samples with a sample weight of zero are ignored:\n>>>\nX\n=\n[[\n1\n,\n0\n],\n...\n[\n1\n,\n0\n],\n...\n[\n1\n,\n0\n],\n...\n[\n0\n,\n1\n]]\n>>>\ny\n=\n[\n0\n,\n0\n,\n1\n,\n0\n]\n>>>\n# ignore the first 2 training samples by setting their weight to 0\n>>>\nsample_weight\n=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\ngb\n=\nHistGradientBoostingClassifier\n(\nmin_samples_leaf\n=\n1\n)\n>>>\ngb\n.\nfit\n(\nX\n,\ny\n,\nsample_weight\n=\nsample_weight\n)\nHistGradientBoostingClassifier(...)\n>>>\ngb\n.\npredict\n([[\n1\n,\n0\n]])\narray([1])\n>>>\ngb\n.\npredict_proba\n([[\n1\n,\n0\n]])[\n0\n,\n1\n]\nnp.float64(0.999)\nAs you can see, the\n[1,\n0]\nis comfortably classified as\n1\nsince the first\ntwo samples are ignored due to their sample weights.\nImplementation detail: taking sample weights into account amounts to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1\nsince the first\ntwo samples are ignored due to their sample weights.\nImplementation detail: taking sample weights into account amounts to\nmultiplying the gradients (and the hessians) by the sample weights. Note that\nthe binning stage (specifically the quantiles computation) does not take the\nweights into account.\n1.11.1.1.4.\nCategorical Features Support\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor\nhave native support for categorical\nfeatures: they can consider splits on non-ordered, categorical data.\nFor datasets with categorical features, using the native categorical support\nis often better than relying on one-hot encoding\n(\nOneHotEncoder\n), because one-hot encoding\nrequires more tree depth to achieve equivalent splits. It is also usually\nbetter to rely on the native categorical support rather than to treat\ncategorical features as continuous (ordinal), which happens for ordinal-encoded",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "better to rely on the native categorical support rather than to treat\ncategorical features as continuous (ordinal), which happens for ordinal-encoded\ncategorical data, since categories are nominal quantities where order does not\nmatter.\nTo enable categorical support, a boolean mask can be passed to the\ncategorical_features\nparameter, indicating which feature is categorical. In\nthe following, the first feature will be treated as categorical and the\nsecond feature as numerical:\n>>>\ngbdt\n=\nHistGradientBoostingClassifier\n(\ncategorical_features\n=\n[\nTrue\n,\nFalse\n])\nEquivalently, one can pass a list of integers indicating the indices of the\ncategorical features:\n>>>\ngbdt\n=\nHistGradientBoostingClassifier\n(\ncategorical_features\n=\n[\n0\n])\nWhen the input is a DataFrame, it is also possible to pass a list of column\nnames:\n>>>\ngbdt\n=\nHistGradientBoostingClassifier\n(\ncategorical_features\n=\n[\n\"site\"\n,\n\"manufacturer\"\n])\nFinally, when the input is a DataFrame we can use",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\ngbdt\n=\nHistGradientBoostingClassifier\n(\ncategorical_features\n=\n[\n\"site\"\n,\n\"manufacturer\"\n])\nFinally, when the input is a DataFrame we can use\ncategorical_features=\"from_dtype\"\nin which case all columns with a categorical\ndtype\nwill be treated as categorical features.\nThe cardinality of each categorical feature must be less than the\nmax_bins\nparameter. For an example using histogram-based gradient boosting on categorical\nfeatures, see\nCategorical Feature Support in Gradient Boosting\n.\nIf there are missing values during training, the missing values will be\ntreated as a proper category. If there are no missing values during training,\nthen at prediction time, missing values are mapped to the child node that has\nthe most samples (just like for continuous features). When predicting,\ncategories that were not seen during fit time will be treated as missing\nvalues.\nSplit finding with categorical features\nThe canonical way of considering categorical splits in a tree is to consider",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "values.\nSplit finding with categorical features\nThe canonical way of considering categorical splits in a tree is to consider\nall of the\n\\(2^{K - 1} - 1\\)\npartitions, where\n\\(K\\)\nis the number of\ncategories. This can quickly become prohibitive when\n\\(K\\)\nis large.\nFortunately, since gradient boosting trees are always regression trees (even\nfor classification problems), there exists a faster strategy that can yield\nequivalent splits. First, the categories of a feature are sorted according to\nthe variance of the target, for each category\nk\n. Once the categories are\nsorted, one can consider\ncontinuous partitions\n, i.e. treat the categories\nas if they were ordered continuous values (see Fisher\n[Fisher1958]\nfor a\nformal proof). As a result, only\n\\(K - 1\\)\nsplits need to be considered\ninstead of\n\\(2^{K - 1} - 1\\)\n. The initial sorting is a\n\\(\\mathcal{O}(K \\log(K))\\)\noperation, leading to a total complexity of\n\\(\\mathcal{O}(K \\log(K) + K)\\)\n, instead of\n\\(\\mathcal{O}(2^K)\\)\n.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\mathcal{O}(K \\log(K))\\)\noperation, leading to a total complexity of\n\\(\\mathcal{O}(K \\log(K) + K)\\)\n, instead of\n\\(\\mathcal{O}(2^K)\\)\n.\nExamples\nCategorical Feature Support in Gradient Boosting\n1.11.1.1.5.\nMonotonic Constraints\nDepending on the problem at hand, you may have prior knowledge indicating\nthat a given feature should in general have a positive (or negative) effect\non the target value. For example, all else being equal, a higher credit\nscore should increase the probability of getting approved for a loan.\nMonotonic constraints allow you to incorporate such prior knowledge into the\nmodel.\nFor a predictor\n\\(F\\)\nwith two features:\na\nmonotonic increase constraint\nis a constraint of the form:\n\\[x_1 \\leq x_1' \\implies F(x_1, x_2) \\leq F(x_1', x_2)\\]\na\nmonotonic decrease constraint\nis a constraint of the form:\n\\[x_1 \\leq x_1' \\implies F(x_1, x_2) \\geq F(x_1', x_2)\\]\nYou can specify a monotonic constraint on each feature using the\nmonotonic_cst",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[x_1 \\leq x_1' \\implies F(x_1, x_2) \\geq F(x_1', x_2)\\]\nYou can specify a monotonic constraint on each feature using the\nmonotonic_cst\nparameter. For each feature, a value of 0 indicates no\nconstraint, while 1 and -1 indicate a monotonic increase and\nmonotonic decrease constraint, respectively:\n>>>\nfrom\nsklearn.ensemble\nimport\nHistGradientBoostingRegressor\n... # monotonic increase, monotonic decrease, and no constraint on the 3 features\n>>>\ngbdt\n=\nHistGradientBoostingRegressor\n(\nmonotonic_cst\n=\n[\n1\n,\n-\n1\n,\n0\n])\nIn a binary classification context, imposing a monotonic increase (decrease) constraint means that higher values of the feature are supposed\nto have a positive (negative) effect on the probability of samples\nto belong to the positive class.\nNevertheless, monotonic constraints only marginally constrain feature effects on the output.\nFor instance, monotonic increase and decrease constraints cannot be used to enforce the\nfollowing modelling constraint:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "For instance, monotonic increase and decrease constraints cannot be used to enforce the\nfollowing modelling constraint:\n\\[x_1 \\leq x_1' \\implies F(x_1, x_2) \\leq F(x_1', x_2')\\]\nAlso, monotonic constraints are not supported for multiclass classification.\nFor a practical implementation of monotonic constraints with the histogram-based\ngradient boosting, including how they can improve generalization when domain knowledge\nis available, see\nMonotonic Constraints\n.\nNote\nSince categories are unordered quantities, it is not possible to enforce\nmonotonic constraints on categorical features.\nExamples\nFeatures in Histogram Gradient Boosting Trees\n1.11.1.1.6.\nInteraction constraints\nA priori, the histogram gradient boosted trees are allowed to use any feature\nto split a node into child nodes. This creates so called interactions between\nfeatures, i.e. usage of different features as split along a branch. Sometimes,\none wants to restrict the possible interactions, see\n[Mayer2022]\n. This can be",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "one wants to restrict the possible interactions, see\n[Mayer2022]\n. This can be\ndone by the parameter\ninteraction_cst\n, where one can specify the indices\nof features that are allowed to interact.\nFor instance, with 3 features in total,\ninteraction_cst=[{0},\n{1},\n{2}]\nforbids all interactions.\nThe constraints\n[{0,\n1},\n{1,\n2}]\nspecify two groups of possibly\ninteracting features. Features 0 and 1 may interact with each other, as well\nas features 1 and 2. But note that features 0 and 2 are forbidden to interact.\nThe following depicts a tree and the possible splits of the tree:\n1 <- Both constraint groups could be applied from now on\n/ \\\n1 2 <- Left split still fulfills both constraint groups.\n/ \\ / \\ Right split at feature 2 has only group {1, 2} from now on.\nLightGBM uses the same logic for overlapping groups.\nNote that features not listed in\ninteraction_cst\nare automatically\nassigned an interaction group for themselves. With again 3 features, this\nmeans that\n[{0}]\nis equivalent to\n[{0},",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "interaction_cst\nare automatically\nassigned an interaction group for themselves. With again 3 features, this\nmeans that\n[{0}]\nis equivalent to\n[{0},\n{1,\n2}]\n.\nExamples\nPartial Dependence and Individual Conditional Expectation Plots\nReferences\n[\nMayer2022\n]\nM. Mayer, S.C. Bourassa, M. Hoesli, and D.F. Scognamiglio.\n2022.\nMachine Learning Applications to Land and Structure Valuation\n.\nJournal of Risk and Financial Management 15, no. 5: 193\n1.11.1.1.7.\nLow-level parallelism\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor\nuse OpenMP\nfor parallelization through Cython. For more details on how to control the\nnumber of threads, please refer to our\nParallelism\nnotes.\nThe following parts are parallelized:\nmapping samples from real values to integer-valued bins (finding the bin\nthresholds is however sequential)\nbuilding histograms is parallelized over features\nfinding the best split point at a node is parallelized over features",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "building histograms is parallelized over features\nfinding the best split point at a node is parallelized over features\nduring fit, mapping samples into the left and right children is\nparallelized over samples\ngradient and hessians computations are parallelized over samples\npredicting is parallelized over samples\n1.11.1.1.8.\nWhy it’s faster\nThe bottleneck of a gradient boosting procedure is building the decision\ntrees. Building a traditional decision tree (as in the other GBDTs\nGradientBoostingClassifier\nand\nGradientBoostingRegressor\n)\nrequires sorting the samples at each node (for\neach feature). Sorting is needed so that the potential gain of a split point\ncan be computed efficiently. Splitting a single node has thus a complexity\nof\n\\(\\mathcal{O}(n_\\text{features} \\times n \\log(n))\\)\nwhere\n\\(n\\)\nis the number of samples at the node.\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor\n, in contrast, do not require sorting the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is the number of samples at the node.\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor\n, in contrast, do not require sorting the\nfeature values and instead use a data-structure called a histogram, where the\nsamples are implicitly ordered. Building a histogram has a\n\\(\\mathcal{O}(n)\\)\ncomplexity, so the node splitting procedure has a\n\\(\\mathcal{O}(n_\\text{features} \\times n)\\)\ncomplexity, much smaller\nthan the previous one. In addition, instead of considering\n\\(n\\)\nsplit\npoints, we consider only\nmax_bins\nsplit points, which might be much\nsmaller.\nIn order to build histograms, the input data\nX\nneeds to be binned into\ninteger-valued bins. This binning procedure does require sorting the feature\nvalues, but it only happens once at the very beginning of the boosting process\n(not at each node, like in\nGradientBoostingClassifier\nand\nGradientBoostingRegressor\n).\nFinally, many parts of the implementation of\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and\nGradientBoostingRegressor\n).\nFinally, many parts of the implementation of\nHistGradientBoostingClassifier\nand\nHistGradientBoostingRegressor\nare parallelized.\nReferences\n[\nXGBoost\n]\n(\n1\n,\n2\n,\n3\n)\nTianqi Chen, Carlos Guestrin,\n“XGBoost: A Scalable Tree\nBoosting System”\n[\nLightGBM\n]\nKe et. al.\n“LightGBM: A Highly Efficient Gradient\nBoostingDecision Tree”\n[\nFisher1958\n]\nFisher, W.D. (1958).\n“On Grouping for Maximum Homogeneity”\nJournal of the American Statistical Association, 53, 789-798.\n1.11.1.2.\nGradientBoostingClassifier\nand\nGradientBoostingRegressor\nThe usage and the parameters of\nGradientBoostingClassifier\nand\nGradientBoostingRegressor\nare described below. The 2 most important\nparameters of these estimators are\nn_estimators\nand\nlearning_rate\n.\nClassification\nGradientBoostingClassifier\nsupports both binary and multi-class\nclassification.\nThe following example shows how to fit a gradient boosting classifier\nwith 100 decision stumps as weak learners:\n>>>\nfrom\nsklearn.datasets\nimport",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The following example shows how to fit a gradient boosting classifier\nwith 100 decision stumps as weak learners:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_hastie_10_2\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingClassifier\n>>>\nX\n,\ny\n=\nmake_hastie_10_2\n(\nrandom_state\n=\n0\n)\n>>>\nX_train\n,\nX_test\n=\nX\n[:\n2000\n],\nX\n[\n2000\n:]\n>>>\ny_train\n,\ny_test\n=\ny\n[:\n2000\n],\ny\n[\n2000\n:]\n>>>\nclf\n=\nGradientBoostingClassifier\n(\nn_estimators\n=\n100\n,\nlearning_rate\n=\n1.0\n,\n...\nmax_depth\n=\n1\n,\nrandom_state\n=\n0\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nclf\n.\nscore\n(\nX_test\n,\ny_test\n)\n0.913\nThe number of weak learners (i.e. regression trees) is controlled by the\nparameter\nn_estimators\n;\nThe size of each tree\ncan be controlled either by setting the tree\ndepth via\nmax_depth\nor by setting the number of leaf nodes via\nmax_leaf_nodes\n. The\nlearning_rate\nis a hyper-parameter in the range\n(0.0, 1.0] that controls overfitting via\nshrinkage\n.\nNote\nClassification with more than 2 classes requires the induction\nof\nn_classes",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(0.0, 1.0] that controls overfitting via\nshrinkage\n.\nNote\nClassification with more than 2 classes requires the induction\nof\nn_classes\nregression trees at each iteration,\nthus, the total number of induced trees equals\nn_classes\n*\nn_estimators\n. For datasets with a large number\nof classes we strongly recommend to use\nHistGradientBoostingClassifier\nas an alternative to\nGradientBoostingClassifier\n.\nRegression\nGradientBoostingRegressor\nsupports a number of\ndifferent loss functions\nfor regression which can be specified via the argument\nloss\n; the default loss function for regression is squared error\n(\n'squared_error'\n).\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nmean_squared_error\n>>>\nfrom\nsklearn.datasets\nimport\nmake_friedman1\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingRegressor\n>>>\nX\n,\ny\n=\nmake_friedman1\n(\nn_samples\n=\n1200\n,\nrandom_state\n=\n0\n,\nnoise\n=\n1.0\n)\n>>>\nX_train\n,\nX_test\n=\nX\n[:\n200\n],\nX\n[\n200\n:]\n>>>\ny_train\n,\ny_test\n=\ny\n[:\n200\n],\ny\n[\n200\n:]\n>>>\nest\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n1200\n,\nrandom_state\n=\n0\n,\nnoise\n=\n1.0\n)\n>>>\nX_train\n,\nX_test\n=\nX\n[:\n200\n],\nX\n[\n200\n:]\n>>>\ny_train\n,\ny_test\n=\ny\n[:\n200\n],\ny\n[\n200\n:]\n>>>\nest\n=\nGradientBoostingRegressor\n(\n...\nn_estimators\n=\n100\n,\nlearning_rate\n=\n0.1\n,\nmax_depth\n=\n1\n,\nrandom_state\n=\n0\n,\n...\nloss\n=\n'squared_error'\n...\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nmean_squared_error\n(\ny_test\n,\nest\n.\npredict\n(\nX_test\n))\n5.00\nThe figure below shows the results of applying\nGradientBoostingRegressor\nwith least squares loss and 500 base learners to the diabetes dataset\n(\nsklearn.datasets.load_diabetes\n).\nThe plot shows the train and test error at each iteration.\nThe train error at each iteration is stored in the\ntrain_score_\nattribute of the gradient boosting model.\nThe test error at each iteration can be obtained\nvia the\nstaged_predict\nmethod which returns a\ngenerator that yields the predictions at each stage. Plots like these can be used\nto determine the optimal number of trees (i.e.\nn_estimators\n) by early stopping.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to determine the optimal number of trees (i.e.\nn_estimators\n) by early stopping.\nExamples\nGradient Boosting regression\nGradient Boosting Out-of-Bag estimates\n1.11.1.2.1.\nFitting additional weak-learners\nBoth\nGradientBoostingRegressor\nand\nGradientBoostingClassifier\nsupport\nwarm_start=True\nwhich allows you to add more estimators to an already\nfitted model.\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nmean_squared_error\n>>>\nfrom\nsklearn.datasets\nimport\nmake_friedman1\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingRegressor\n>>>\nX\n,\ny\n=\nmake_friedman1\n(\nn_samples\n=\n1200\n,\nrandom_state\n=\n0\n,\nnoise\n=\n1.0\n)\n>>>\nX_train\n,\nX_test\n=\nX\n[:\n200\n],\nX\n[\n200\n:]\n>>>\ny_train\n,\ny_test\n=\ny\n[:\n200\n],\ny\n[\n200\n:]\n>>>\nest\n=\nGradientBoostingRegressor\n(\n...\nn_estimators\n=\n100\n,\nlearning_rate\n=\n0.1\n,\nmax_depth\n=\n1\n,\nrandom_state\n=\n0\n,\n...\nloss\n=\n'squared_error'\n...\n)\n>>>\nest\n=\nest\n.\nfit\n(\nX_train\n,\ny_train\n)\n# fit with 100 trees\n>>>\nmean_squared_error\n(\ny_test\n,\nest\n.\npredict\n(\nX_test\n))\n5.00\n>>>\n_\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "...\n)\n>>>\nest\n=\nest\n.\nfit\n(\nX_train\n,\ny_train\n)\n# fit with 100 trees\n>>>\nmean_squared_error\n(\ny_test\n,\nest\n.\npredict\n(\nX_test\n))\n5.00\n>>>\n_\n=\nest\n.\nset_params\n(\nn_estimators\n=\n200\n,\nwarm_start\n=\nTrue\n)\n# set warm_start and increase num of trees\n>>>\n_\n=\nest\n.\nfit\n(\nX_train\n,\ny_train\n)\n# fit additional 100 trees to est\n>>>\nmean_squared_error\n(\ny_test\n,\nest\n.\npredict\n(\nX_test\n))\n3.84\n1.11.1.2.2.\nControlling the tree size\nThe size of the regression tree base learners defines the level of variable\ninteractions that can be captured by the gradient boosting model. In general,\na tree of depth\nh\ncan capture interactions of order\nh\n.\nThere are two ways in which the size of the individual regression trees can\nbe controlled.\nIf you specify\nmax_depth=h\nthen complete binary trees\nof depth\nh\nwill be grown. Such trees will have (at most)\n2**h\nleaf nodes\nand\n2**h\n-\n1\nsplit nodes.\nAlternatively, you can control the tree size by specifying the number of\nleaf nodes via the parameter\nmax_leaf_nodes",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and\n2**h\n-\n1\nsplit nodes.\nAlternatively, you can control the tree size by specifying the number of\nleaf nodes via the parameter\nmax_leaf_nodes\n. In this case,\ntrees will be grown using best-first search where nodes with the highest improvement\nin impurity will be expanded first.\nA tree with\nmax_leaf_nodes=k\nhas\nk\n-\n1\nsplit nodes and thus can\nmodel interactions of up to order\nmax_leaf_nodes\n-\n1\n.\nWe found that\nmax_leaf_nodes=k\ngives comparable results to\nmax_depth=k-1\nbut is significantly faster to train at the expense of a slightly higher\ntraining error.\nThe parameter\nmax_leaf_nodes\ncorresponds to the variable\nJ\nin the\nchapter on gradient boosting in\n[Friedman2001]\nand is related to the parameter\ninteraction.depth\nin R’s gbm package where\nmax_leaf_nodes\n==\ninteraction.depth\n+\n1\n.\n1.11.1.2.3.\nMathematical formulation\nWe first present GBRT for regression, and then detail the classification\ncase.\nRegression\nGBRT regressors are additive models whose prediction\n\\(\\hat{y}_i\\)\nfor a",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "case.\nRegression\nGBRT regressors are additive models whose prediction\n\\(\\hat{y}_i\\)\nfor a\ngiven input\n\\(x_i\\)\nis of the following form:\n\\[\\hat{y}_i = F_M(x_i) = \\sum_{m=1}^{M} h_m(x_i)\\]\nwhere the\n\\(h_m\\)\nare estimators called\nweak learners\nin the context\nof boosting. Gradient Tree Boosting uses\ndecision tree regressors\nof fixed size as weak learners. The constant M corresponds to the\nn_estimators\nparameter.\nSimilar to other boosting algorithms, a GBRT is built in a greedy fashion:\n\\[F_m(x) = F_{m-1}(x) + h_m(x),\\]\nwhere the newly added tree\n\\(h_m\\)\nis fitted in order to minimize a sum\nof losses\n\\(L_m\\)\n, given the previous ensemble\n\\(F_{m-1}\\)\n:\n\\[h_m = \\arg\\min_{h} L_m = \\arg\\min_{h} \\sum_{i=1}^{n}\nl(y_i, F_{m-1}(x_i) + h(x_i)),\\]\nwhere\n\\(l(y_i, F(x_i))\\)\nis defined by the\nloss\nparameter, detailed\nin the next section.\nBy default, the initial model\n\\(F_{0}\\)\nis chosen as the constant that\nminimizes the loss: for a least-squares loss, this is the empirical mean of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "By default, the initial model\n\\(F_{0}\\)\nis chosen as the constant that\nminimizes the loss: for a least-squares loss, this is the empirical mean of\nthe target values. The initial model can also be specified via the\ninit\nargument.\nUsing a first-order Taylor approximation, the value of\n\\(l\\)\ncan be\napproximated as follows:\n\\[l(y_i, F_{m-1}(x_i) + h_m(x_i)) \\approx\nl(y_i, F_{m-1}(x_i))\n+ h_m(x_i)\n\\left[ \\frac{\\partial l(y_i, F(x_i))}{\\partial F(x_i)} \\right]_{F=F_{m - 1}}.\\]\nNote\nBriefly, a first-order Taylor approximation says that\n\\(l(z) \\approx l(a) + (z - a) \\frac{\\partial l}{\\partial z}(a)\\)\n.\nHere,\n\\(z\\)\ncorresponds to\n\\(F_{m - 1}(x_i) + h_m(x_i)\\)\n, and\n\\(a\\)\ncorresponds to\n\\(F_{m-1}(x_i)\\)\nThe quantity\n\\(\\left[ \\frac{\\partial l(y_i, F(x_i))}{\\partial F(x_i)}\n\\right]_{F=F_{m - 1}}\\)\nis the derivative of the loss with respect to its\nsecond parameter, evaluated at\n\\(F_{m-1}(x)\\)\n. It is easy to compute for\nany given\n\\(F_{m - 1}(x_i)\\)\nin a closed form since the loss is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "second parameter, evaluated at\n\\(F_{m-1}(x)\\)\n. It is easy to compute for\nany given\n\\(F_{m - 1}(x_i)\\)\nin a closed form since the loss is\ndifferentiable. We will denote it by\n\\(g_i\\)\n.\nRemoving the constant terms, we have:\n\\[h_m \\approx \\arg\\min_{h} \\sum_{i=1}^{n} h(x_i) g_i\\]\nThis is minimized if\n\\(h(x_i)\\)\nis fitted to predict a value that is\nproportional to the negative gradient\n\\(-g_i\\)\n. Therefore, at each\niteration,\nthe estimator\n\\(h_m\\)\nis fitted to predict the negative\ngradients of the samples\n. The gradients are updated at each iteration.\nThis can be considered as some kind of gradient descent in a functional\nspace.\nNote\nFor some losses, e.g.\n'absolute_error'\nwhere the gradients\nare\n\\(\\pm 1\\)\n, the values predicted by a fitted\n\\(h_m\\)\nare not\naccurate enough: the tree can only output integer values. As a result, the\nleaves values of the tree\n\\(h_m\\)\nare modified once the tree is\nfitted, such that the leaves values minimize the loss\n\\(L_m\\)\n. The",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "leaves values of the tree\n\\(h_m\\)\nare modified once the tree is\nfitted, such that the leaves values minimize the loss\n\\(L_m\\)\n. The\nupdate is loss-dependent: for the absolute error loss, the value of\na leaf is updated to the median of the samples in that leaf.\nClassification\nGradient boosting for classification is very similar to the regression case.\nHowever, the sum of the trees\n\\(F_M(x_i) = \\sum_m h_m(x_i)\\)\nis not\nhomogeneous to a prediction: it cannot be a class, since the trees predict\ncontinuous values.\nThe mapping from the value\n\\(F_M(x_i)\\)\nto a class or a probability is\nloss-dependent. For the log-loss, the probability that\n\\(x_i\\)\nbelongs to the positive class is modeled as\n\\(p(y_i = 1 |\nx_i) = \\sigma(F_M(x_i))\\)\nwhere\n\\(\\sigma\\)\nis the sigmoid or expit function.\nFor multiclass classification, K trees (for K classes) are built at each of\nthe\n\\(M\\)\niterations. The probability that\n\\(x_i\\)\nbelongs to class\nk is modeled as a softmax of the\n\\(F_{M,k}(x_i)\\)\nvalues.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the\n\\(M\\)\niterations. The probability that\n\\(x_i\\)\nbelongs to class\nk is modeled as a softmax of the\n\\(F_{M,k}(x_i)\\)\nvalues.\nNote that even for a classification task, the\n\\(h_m\\)\nsub-estimator is\nstill a regressor, not a classifier. This is because the sub-estimators are\ntrained to predict (negative)\ngradients\n, which are always continuous\nquantities.\n1.11.1.2.4.\nLoss Functions\nThe following loss functions are supported and can be specified using\nthe parameter\nloss\n:\nRegression\nSquared error (\n'squared_error'\n): The natural choice for regression\ndue to its superior computational properties. The initial model is\ngiven by the mean of the target values.\nAbsolute error (\n'absolute_error'\n): A robust loss function for\nregression. The initial model is given by the median of the\ntarget values.\nHuber (\n'huber'\n): Another robust loss function that combines\nleast squares and least absolute deviation; use\nalpha\nto\ncontrol the sensitivity with regards to outliers (see\n[Friedman2001]\nfor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "least squares and least absolute deviation; use\nalpha\nto\ncontrol the sensitivity with regards to outliers (see\n[Friedman2001]\nfor\nmore details).\nQuantile (\n'quantile'\n): A loss function for quantile regression.\nUse\n0\n<\nalpha\n<\n1\nto specify the quantile. This loss function\ncan be used to create prediction intervals\n(see\nPrediction Intervals for Gradient Boosting Regression\n).\nClassification\nBinary log-loss (\n'log-loss'\n): The binomial\nnegative log-likelihood loss function for binary classification. It provides\nprobability estimates. The initial model is given by the\nlog odds-ratio.\nMulti-class log-loss (\n'log-loss'\n): The multinomial\nnegative log-likelihood loss function for multi-class classification with\nn_classes\nmutually exclusive classes. It provides\nprobability estimates. The initial model is given by the\nprior probability of each class. At each iteration\nn_classes\nregression trees have to be constructed which makes GBRT rather",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "prior probability of each class. At each iteration\nn_classes\nregression trees have to be constructed which makes GBRT rather\ninefficient for data sets with a large number of classes.\nExponential loss (\n'exponential'\n): The same loss function\nas\nAdaBoostClassifier\n. Less robust to mislabeled\nexamples than\n'log-loss'\n; can only be used for binary\nclassification.\n1.11.1.2.5.\nShrinkage via learning rate\n[Friedman2001]\nproposed a simple regularization strategy that scales\nthe contribution of each weak learner by a constant factor\n\\(\\nu\\)\n:\n\\[F_m(x) = F_{m-1}(x) + \\nu h_m(x)\\]\nThe parameter\n\\(\\nu\\)\nis also called the\nlearning rate\nbecause\nit scales the step length of the gradient descent procedure; it can\nbe set via the\nlearning_rate\nparameter.\nThe parameter\nlearning_rate\nstrongly interacts with the parameter\nn_estimators\n, the number of weak learners to fit. Smaller values\nof\nlearning_rate\nrequire larger numbers of weak learners to maintain",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "n_estimators\n, the number of weak learners to fit. Smaller values\nof\nlearning_rate\nrequire larger numbers of weak learners to maintain\na constant training error. Empirical evidence suggests that small\nvalues of\nlearning_rate\nfavor better test error.\n[HTF]\nrecommend to set the learning rate to a small constant\n(e.g.\nlearning_rate\n<=\n0.1\n) and choose\nn_estimators\nlarge enough\nthat early stopping applies,\nsee\nEarly stopping in Gradient Boosting\nfor a more detailed discussion of the interaction between\nlearning_rate\nand\nn_estimators\nsee\n[R2007]\n.\n1.11.1.2.6.\nSubsampling\n[Friedman2002]\nproposed stochastic gradient boosting, which combines gradient\nboosting with bootstrap averaging (bagging). At each iteration\nthe base classifier is trained on a fraction\nsubsample\nof\nthe available training data. The subsample is drawn without replacement.\nA typical value of\nsubsample\nis 0.5.\nThe figure below illustrates the effect of shrinkage and subsampling",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "A typical value of\nsubsample\nis 0.5.\nThe figure below illustrates the effect of shrinkage and subsampling\non the goodness-of-fit of the model. We can clearly see that shrinkage\noutperforms no-shrinkage. Subsampling with shrinkage can further increase\nthe accuracy of the model. Subsampling without shrinkage, on the other hand,\ndoes poorly.\nAnother strategy to reduce the variance is by subsampling the features\nanalogous to the random splits in\nRandomForestClassifier\n.\nThe number of subsampled features can be controlled via the\nmax_features\nparameter.\nNote\nUsing a small\nmax_features\nvalue can significantly decrease the runtime.\nStochastic gradient boosting allows to compute out-of-bag estimates of the\ntest deviance by computing the improvement in deviance on the examples that are\nnot included in the bootstrap sample (i.e. the out-of-bag examples).\nThe improvements are stored in the attribute\noob_improvement_\n.\noob_improvement_[i]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The improvements are stored in the attribute\noob_improvement_\n.\noob_improvement_[i]\nholds the improvement in terms of the loss on the OOB samples\nif you add the i-th stage to the current predictions.\nOut-of-bag estimates can be used for model selection, for example to determine\nthe optimal number of iterations. OOB estimates are usually very pessimistic thus\nwe recommend to use cross-validation instead and only use OOB if cross-validation\nis too time consuming.\nExamples\nGradient Boosting regularization\nGradient Boosting Out-of-Bag estimates\nOOB Errors for Random Forests\n1.11.1.2.7.\nInterpretation with feature importance\nIndividual decision trees can be interpreted easily by simply\nvisualizing the tree structure. Gradient boosting models, however,\ncomprise hundreds of regression trees thus they cannot be easily\ninterpreted by visual inspection of the individual trees. Fortunately,\na number of techniques have been proposed to summarize and interpret\ngradient boosting models.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "a number of techniques have been proposed to summarize and interpret\ngradient boosting models.\nOften features do not contribute equally to predict the target\nresponse; in many situations the majority of the features are in fact\nirrelevant.\nWhen interpreting a model, the first question usually is: what are\nthose important features and how do they contribute in predicting\nthe target response?\nIndividual decision trees intrinsically perform feature selection by selecting\nappropriate split points. This information can be used to measure the\nimportance of each feature; the basic idea is: the more often a\nfeature is used in the split points of a tree the more important that\nfeature is. This notion of importance can be extended to decision tree\nensembles by simply averaging the impurity-based feature importance of each tree (see\nFeature importance evaluation\nfor more details).\nThe feature importance scores of a fit gradient boosting model can be\naccessed via the\nfeature_importances_",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 38,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for more details).\nThe feature importance scores of a fit gradient boosting model can be\naccessed via the\nfeature_importances_\nproperty:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_hastie_10_2\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingClassifier\n>>>\nX\n,\ny\n=\nmake_hastie_10_2\n(\nrandom_state\n=\n0\n)\n>>>\nclf\n=\nGradientBoostingClassifier\n(\nn_estimators\n=\n100\n,\nlearning_rate\n=\n1.0\n,\n...\nmax_depth\n=\n1\n,\nrandom_state\n=\n0\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nclf\n.\nfeature_importances_\narray([0.107, 0.105, 0.113, 0.0987, 0.0947,\n0.107, 0.0916, 0.0972, 0.0958, 0.0906])\nNote that this computation of feature importance is based on entropy, and it\nis distinct from\nsklearn.inspection.permutation_importance\nwhich is\nbased on permutation of the features.\nExamples\nGradient Boosting regression\nReferences\n[\nFriedman2001\n]\n(\n1\n,\n2\n,\n3\n,\n4\n)\nFriedman, J.H. (2001).\nGreedy function approximation: A gradient\nboosting machine\n.\nAnnals of Statistics, 29, 1189-1232.\n[\nFriedman2002\n]\nFriedman, J.H. (2002).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 39,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Greedy function approximation: A gradient\nboosting machine\n.\nAnnals of Statistics, 29, 1189-1232.\n[\nFriedman2002\n]\nFriedman, J.H. (2002).\nStochastic gradient boosting.\n.\nComputational Statistics & Data Analysis, 38, 367-378.\n[\nR2007\n]\nG. Ridgeway (2006).\nGeneralized Boosted Models: A guide to the gbm\npackage\n1.11.2.\nRandom forests and other randomized tree ensembles\nThe\nsklearn.ensemble\nmodule includes two averaging algorithms based\non randomized\ndecision trees\n: the RandomForest algorithm\nand the Extra-Trees method. Both algorithms are perturb-and-combine\ntechniques\n[B1998]\nspecifically designed for trees. This means a diverse\nset of classifiers is created by introducing randomness in the classifier\nconstruction. The prediction of the ensemble is given as the averaged\nprediction of the individual classifiers.\nAs other classifiers, forest classifiers have to be fitted with two\narrays: a sparse or dense array X of shape\n(n_samples,\nn_features)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 40,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "As other classifiers, forest classifiers have to be fitted with two\narrays: a sparse or dense array X of shape\n(n_samples,\nn_features)\nholding the training samples, and an array Y of shape\n(n_samples,)\nholding the target values (class labels) for the training samples:\n>>>\nfrom\nsklearn.ensemble\nimport\nRandomForestClassifier\n>>>\nX\n=\n[[\n0\n,\n0\n],\n[\n1\n,\n1\n]]\n>>>\nY\n=\n[\n0\n,\n1\n]\n>>>\nclf\n=\nRandomForestClassifier\n(\nn_estimators\n=\n10\n)\n>>>\nclf\n=\nclf\n.\nfit\n(\nX\n,\nY\n)\nLike\ndecision trees\n, forests of trees also extend to\nmulti-output problems\n(if Y is an array\nof shape\n(n_samples,\nn_outputs)\n).\n1.11.2.1.\nRandom Forests\nIn random forests (see\nRandomForestClassifier\nand\nRandomForestRegressor\nclasses), each tree in the ensemble is built\nfrom a sample drawn with replacement (i.e., a bootstrap sample) from the\ntraining set.\nFurthermore, when splitting each node during the construction of a tree, the\nbest split is found through an exhaustive search of the feature values of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 41,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "best split is found through an exhaustive search of the feature values of\neither all input features or a random subset of size\nmax_features\n.\n(See the\nparameter tuning guidelines\nfor more details.)\nThe purpose of these two sources of randomness is to decrease the variance of\nthe forest estimator. Indeed, individual decision trees typically exhibit high\nvariance and tend to overfit. The injected randomness in forests yield decision\ntrees with somewhat decoupled prediction errors. By taking an average of those\npredictions, some errors can cancel out. Random forests achieve a reduced\nvariance by combining diverse trees, sometimes at the cost of a slight increase\nin bias. In practice the variance reduction is often significant hence yielding\nan overall better model.\nIn contrast to the original publication\n[B2001]\n, the scikit-learn\nimplementation combines classifiers by averaging their probabilistic\nprediction, instead of letting each classifier vote for a single class.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 42,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "implementation combines classifiers by averaging their probabilistic\nprediction, instead of letting each classifier vote for a single class.\nA competitive alternative to random forests are\nHistogram-Based Gradient Boosting\n(HGBT) models:\nBuilding trees: Random forests typically rely on deep trees (that overfit\nindividually) which uses much computational resources, as they require\nseveral splittings and evaluations of candidate splits. Boosting models\nbuild shallow trees (that underfit individually) which are faster to fit\nand predict.\nSequential boosting: In HGBT, the decision trees are built sequentially,\nwhere each tree is trained to correct the errors made by the previous ones.\nThis allows them to iteratively improve the model’s performance using\nrelatively few trees. In contrast, random forests use a majority vote to\npredict the outcome, which can require a larger number of trees to achieve\nthe same level of accuracy.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 43,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "predict the outcome, which can require a larger number of trees to achieve\nthe same level of accuracy.\nEfficient binning: HGBT uses an efficient binning algorithm that can handle\nlarge datasets with a high number of features. The binning algorithm can\npre-process the data to speed up the subsequent tree construction (see\nWhy it’s faster\n). In contrast, the scikit-learn\nimplementation of random forests does not use binning and relies on exact\nsplitting, which can be computationally expensive.\nOverall, the computational cost of HGBT versus RF depends on the specific\ncharacteristics of the dataset and the modeling task. It’s a good idea\nto try both models and compare their performance and computational efficiency\non your specific problem to determine which model is the best fit.\nExamples\nComparing Random Forests and Histogram Gradient Boosting models\n1.11.2.2.\nExtremely Randomized Trees\nIn extremely randomized trees (see\nExtraTreesClassifier\nand\nExtraTreesRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 44,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.11.2.2.\nExtremely Randomized Trees\nIn extremely randomized trees (see\nExtraTreesClassifier\nand\nExtraTreesRegressor\nclasses), randomness goes one step\nfurther in the way splits are computed. As in random forests, a random\nsubset of candidate features is used, but instead of looking for the\nmost discriminative thresholds, thresholds are drawn at random for each\ncandidate feature and the best of these randomly-generated thresholds is\npicked as the splitting rule. This usually allows to reduce the variance\nof the model a bit more, at the expense of a slightly greater increase\nin bias:\n>>>\nfrom\nsklearn.model_selection\nimport\ncross_val_score\n>>>\nfrom\nsklearn.datasets\nimport\nmake_blobs\n>>>\nfrom\nsklearn.ensemble\nimport\nRandomForestClassifier\n>>>\nfrom\nsklearn.ensemble\nimport\nExtraTreesClassifier\n>>>\nfrom\nsklearn.tree\nimport\nDecisionTreeClassifier\n>>>\nX\n,\ny\n=\nmake_blobs\n(\nn_samples\n=\n10000\n,\nn_features\n=\n10\n,\ncenters\n=\n100\n,\n...\nrandom_state\n=\n0\n)\n>>>\nclf\n=\nDecisionTreeClassifier\n(\nmax_depth",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 45,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nX\n,\ny\n=\nmake_blobs\n(\nn_samples\n=\n10000\n,\nn_features\n=\n10\n,\ncenters\n=\n100\n,\n...\nrandom_state\n=\n0\n)\n>>>\nclf\n=\nDecisionTreeClassifier\n(\nmax_depth\n=\nNone\n,\nmin_samples_split\n=\n2\n,\n...\nrandom_state\n=\n0\n)\n>>>\nscores\n=\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\n5\n)\n>>>\nscores\n.\nmean\n()\nnp.float64(0.98)\n>>>\nclf\n=\nRandomForestClassifier\n(\nn_estimators\n=\n10\n,\nmax_depth\n=\nNone\n,\n...\nmin_samples_split\n=\n2\n,\nrandom_state\n=\n0\n)\n>>>\nscores\n=\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\n5\n)\n>>>\nscores\n.\nmean\n()\nnp.float64(0.999)\n>>>\nclf\n=\nExtraTreesClassifier\n(\nn_estimators\n=\n10\n,\nmax_depth\n=\nNone\n,\n...\nmin_samples_split\n=\n2\n,\nrandom_state\n=\n0\n)\n>>>\nscores\n=\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\n5\n)\n>>>\nscores\n.\nmean\n()\n>\n0.999\nnp.True_\n1.11.2.3.\nParameters\nThe main parameters to adjust when using these methods is\nn_estimators\nand\nmax_features\n. The former is the number of trees in the forest. The larger\nthe better, but also the longer it will take to compute. In addition, note that",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 46,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ". The former is the number of trees in the forest. The larger\nthe better, but also the longer it will take to compute. In addition, note that\nresults will stop getting significantly better beyond a critical number of\ntrees. The latter is the size of the random subsets of features to consider\nwhen splitting a node. The lower the greater the reduction of variance, but\nalso the greater the increase in bias. Empirical good default values are\nmax_features=1.0\nor equivalently\nmax_features=None\n(always considering\nall features instead of a random subset) for regression problems, and\nmax_features=\"sqrt\"\n(using a random subset of size\nsqrt(n_features)\n)\nfor classification tasks (where\nn_features\nis the number of features in\nthe data). The default value of\nmax_features=1.0\nis equivalent to bagged\ntrees and more randomness can be achieved by setting smaller values (e.g. 0.3\nis a typical default in the literature). Good results are often achieved when\nsetting\nmax_depth=None\nin combination with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 47,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is a typical default in the literature). Good results are often achieved when\nsetting\nmax_depth=None\nin combination with\nmin_samples_split=2\n(i.e.,\nwhen fully developing the trees). Bear in mind though that these values are\nusually not optimal, and might result in models that consume a lot of RAM.\nThe best parameter values should always be cross-validated. In addition, note\nthat in random forests, bootstrap samples are used by default\n(\nbootstrap=True\n) while the default strategy for extra-trees is to use the\nwhole dataset (\nbootstrap=False\n). When using bootstrap sampling the\ngeneralization error can be estimated on the left out or out-of-bag samples.\nThis can be enabled by setting\noob_score=True\n.\nNote\nThe size of the model with the default parameters is\n\\(O( M * N * log (N) )\\)\n,\nwhere\n\\(M\\)\nis the number of trees and\n\\(N\\)\nis the number of samples.\nIn order to reduce the size of the model, you can change these parameters:\nmin_samples_split\n,\nmax_leaf_nodes\n,\nmax_depth\nand",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 48,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In order to reduce the size of the model, you can change these parameters:\nmin_samples_split\n,\nmax_leaf_nodes\n,\nmax_depth\nand\nmin_samples_leaf\n.\n1.11.2.4.\nParallelization\nFinally, this module also features the parallel construction of the trees\nand the parallel computation of the predictions through the\nn_jobs\nparameter. If\nn_jobs=k\nthen computations are partitioned into\nk\njobs, and run on\nk\ncores of the machine. If\nn_jobs=-1\nthen all cores available on the machine are used. Note that because of\ninter-process communication overhead, the speedup might not be linear\n(i.e., using\nk\njobs will unfortunately not be\nk\ntimes as\nfast). Significant speedup can still be achieved though when building\na large number of trees, or when building a single tree requires a fair\namount of time (e.g., on large datasets).\nExamples\nPlot the decision surfaces of ensembles of trees on the iris dataset\nFace completion with a multi-output estimators\nReferences\n[\nB2001\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 49,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Examples\nPlot the decision surfaces of ensembles of trees on the iris dataset\nFace completion with a multi-output estimators\nReferences\n[\nB2001\n]\nBreiman, “Random Forests”, Machine Learning, 45(1), 5-32, 2001.\n[\nB1998\n]\nBreiman, “Arcing Classifiers”, Annals of Statistics 1998.\nP. Geurts, D. Ernst., and L. Wehenkel, “Extremely randomized\ntrees”, Machine Learning, 63(1), 3-42, 2006.\n1.11.2.5.\nFeature importance evaluation\nThe relative rank (i.e. depth) of a feature used as a decision node in a\ntree can be used to assess the relative importance of that feature with\nrespect to the predictability of the target variable. Features used at\nthe top of the tree contribute to the final prediction decision of a\nlarger fraction of the input samples. The\nexpected fraction of the\nsamples\nthey contribute to can thus be used as an estimate of the\nrelative importance of the features\n. In scikit-learn, the fraction of\nsamples a feature contributes to is combined with the decrease in impurity",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 50,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "relative importance of the features\n. In scikit-learn, the fraction of\nsamples a feature contributes to is combined with the decrease in impurity\nfrom splitting them to create a normalized estimate of the predictive power\nof that feature.\nBy\naveraging\nthe estimates of predictive ability over several randomized\ntrees one can\nreduce the variance\nof such an estimate and use it\nfor feature selection. This is known as the mean decrease in impurity, or MDI.\nRefer to\n[L2014]\nfor more information on MDI and feature importance\nevaluation with Random Forests.\nWarning\nThe impurity-based feature importances computed on tree-based models suffer\nfrom two flaws that can lead to misleading conclusions. First they are\ncomputed on statistics derived from the training dataset and therefore\ndo\nnot necessarily inform us on which features are most important to make good\npredictions on held-out dataset\n. Secondly,\nthey favor high cardinality\nfeatures\n, that is features with many unique values.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 51,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "predictions on held-out dataset\n. Secondly,\nthey favor high cardinality\nfeatures\n, that is features with many unique values.\nPermutation feature importance\nis an alternative to impurity-based feature\nimportance that does not suffer from these flaws. These two methods of\nobtaining feature importance are explored in:\nPermutation Importance vs Random Forest Feature Importance (MDI)\n.\nIn practice those estimates are stored as an attribute named\nfeature_importances_\non the fitted model. This is an array with shape\n(n_features,)\nwhose values are positive and sum to 1.0. The higher\nthe value, the more important is the contribution of the matching feature\nto the prediction function.\nExamples\nFeature importances with a forest of trees\nReferences\n[\nL2014\n]\nG. Louppe,\n“Understanding Random Forests: From Theory to\nPractice”\n,\nPhD Thesis, U. of Liege, 2014.\n1.11.2.6.\nTotally Random Trees Embedding\nRandomTreesEmbedding\nimplements an unsupervised transformation of the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 52,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nPhD Thesis, U. of Liege, 2014.\n1.11.2.6.\nTotally Random Trees Embedding\nRandomTreesEmbedding\nimplements an unsupervised transformation of the\ndata. Using a forest of completely random trees,\nRandomTreesEmbedding\nencodes the data by the indices of the leaves a data point ends up in. This\nindex is then encoded in a one-of-K manner, leading to a high dimensional,\nsparse binary coding.\nThis coding can be computed very efficiently and can then be used as a basis\nfor other learning tasks.\nThe size and sparsity of the code can be influenced by choosing the number of\ntrees and the maximum depth per tree. For each tree in the ensemble, the coding\ncontains one entry of one. The size of the coding is at most\nn_estimators\n*\n2\n**\nmax_depth\n, the maximum number of leaves in the forest.\nAs neighboring data points are more likely to lie within the same leaf of a\ntree, the transformation performs an implicit, non-parametric density\nestimation.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 53,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "tree, the transformation performs an implicit, non-parametric density\nestimation.\nExamples\nHashing feature transformation using Totally Random Trees\nManifold learning on handwritten digits: Locally Linear Embedding, Isomap…\ncompares non-linear\ndimensionality reduction techniques on handwritten digits.\nFeature transformations with ensembles of trees\ncompares\nsupervised and unsupervised tree based feature transformations.\nSee also\nManifold learning\ntechniques can also be useful to derive non-linear\nrepresentations of feature space, also these approaches focus also on\ndimensionality reduction.\n1.11.2.7.\nFitting additional trees\nRandomForest, Extra-Trees and\nRandomTreesEmbedding\nestimators all support\nwarm_start=True\nwhich allows you to add more trees to an already fitted model.\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn.ensemble\nimport\nRandomForestClassifier\n>>>\nX\n,\ny\n=\nmake_classification\n(\nn_samples\n=\n100\n,\nrandom_state\n=\n1\n)\n>>>\nclf\n=\nRandomForestClassifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 54,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "import\nRandomForestClassifier\n>>>\nX\n,\ny\n=\nmake_classification\n(\nn_samples\n=\n100\n,\nrandom_state\n=\n1\n)\n>>>\nclf\n=\nRandomForestClassifier\n(\nn_estimators\n=\n10\n)\n>>>\nclf\n=\nclf\n.\nfit\n(\nX\n,\ny\n)\n# fit with 10 trees\n>>>\nlen\n(\nclf\n.\nestimators_\n)\n10\n>>>\n# set warm_start and increase num of estimators\n>>>\n_\n=\nclf\n.\nset_params\n(\nn_estimators\n=\n20\n,\nwarm_start\n=\nTrue\n)\n>>>\n_\n=\nclf\n.\nfit\n(\nX\n,\ny\n)\n# fit additional 10 trees\n>>>\nlen\n(\nclf\n.\nestimators_\n)\n20\nWhen\nrandom_state\nis also set, the internal random state is also preserved\nbetween\nfit\ncalls. This means that training a model once with\nn\nestimators is\nthe same as building the model iteratively via multiple\nfit\ncalls, where the\nfinal number of estimators is equal to\nn\n.\n>>>\nclf\n=\nRandomForestClassifier\n(\nn_estimators\n=\n20\n)\n# set `n_estimators` to 10 + 10\n>>>\n_\n=\nclf\n.\nfit\n(\nX\n,\ny\n)\n# fit `estimators_` will be the same as `clf` above\nNote that this differs from the usual behavior of\nrandom_state\nin that it does\nnot",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 55,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "fit\n(\nX\n,\ny\n)\n# fit `estimators_` will be the same as `clf` above\nNote that this differs from the usual behavior of\nrandom_state\nin that it does\nnot\nresult in the same result across different calls.\n1.11.3.\nBagging meta-estimator\nIn ensemble algorithms, bagging methods form a class of algorithms which build\nseveral instances of a black-box estimator on random subsets of the original\ntraining set and then aggregate their individual predictions to form a final\nprediction. These methods are used as a way to reduce the variance of a base\nestimator (e.g., a decision tree), by introducing randomization into its\nconstruction procedure and then making an ensemble out of it. In many cases,\nbagging methods constitute a very simple way to improve with respect to a\nsingle model, without making it necessary to adapt the underlying base\nalgorithm. As they provide a way to reduce overfitting, bagging methods work\nbest with strong and complex models (e.g., fully developed decision trees), in",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 56,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "best with strong and complex models (e.g., fully developed decision trees), in\ncontrast with boosting methods which usually work best with weak models (e.g.,\nshallow decision trees).\nBagging methods come in many flavours but mostly differ from each other by the\nway they draw random subsets of the training set:\nWhen random subsets of the dataset are drawn as random subsets of the\nsamples, then this algorithm is known as Pasting\n[B1999]\n.\nWhen samples are drawn with replacement, then the method is known as\nBagging\n[B1996]\n.\nWhen random subsets of the dataset are drawn as random subsets of\nthe features, then the method is known as Random Subspaces\n[H1998]\n.\nFinally, when base estimators are built on subsets of both samples and\nfeatures, then the method is known as Random Patches\n[LG2012]\n.\nIn scikit-learn, bagging methods are offered as a unified\nBaggingClassifier\nmeta-estimator (resp.\nBaggingRegressor\n),\ntaking as input a user-specified estimator along with parameters",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 57,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "BaggingClassifier\nmeta-estimator (resp.\nBaggingRegressor\n),\ntaking as input a user-specified estimator along with parameters\nspecifying the strategy to draw random subsets. In particular,\nmax_samples\nand\nmax_features\ncontrol the size of the subsets (in terms of samples and\nfeatures), while\nbootstrap\nand\nbootstrap_features\ncontrol whether\nsamples and features are drawn with or without replacement. When using a subset\nof the available samples the generalization accuracy can be estimated with the\nout-of-bag samples by setting\noob_score=True\n. As an example, the\nsnippet below illustrates how to instantiate a bagging ensemble of\nKNeighborsClassifier\nestimators, each built on random\nsubsets of 50% of the samples and 50% of the features.\n>>>\nfrom\nsklearn.ensemble\nimport\nBaggingClassifier\n>>>\nfrom\nsklearn.neighbors\nimport\nKNeighborsClassifier\n>>>\nbagging\n=\nBaggingClassifier\n(\nKNeighborsClassifier\n(),\n...\nmax_samples\n=\n0.5\n,\nmax_features\n=\n0.5\n)\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 58,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "import\nKNeighborsClassifier\n>>>\nbagging\n=\nBaggingClassifier\n(\nKNeighborsClassifier\n(),\n...\nmax_samples\n=\n0.5\n,\nmax_features\n=\n0.5\n)\nExamples\nSingle estimator versus bagging: bias-variance decomposition\nReferences\n[\nB1999\n]\nL. Breiman, “Pasting small votes for classification in large\ndatabases and on-line”, Machine Learning, 36(1), 85-103, 1999.\n[\nB1996\n]\nL. Breiman, “Bagging predictors”, Machine Learning, 24(2),\n123-140, 1996.\n[\nH1998\n]\nT. Ho, “The random subspace method for constructing decision\nforests”, Pattern Analysis and Machine Intelligence, 20(8), 832-844, 1998.\n[\nLG2012\n]\nG. Louppe and P. Geurts, “Ensembles on Random Patches”,\nMachine Learning and Knowledge Discovery in Databases, 346-361, 2012.\n1.11.4.\nVoting Classifier\nThe idea behind the\nVotingClassifier\nis to combine\nconceptually different machine learning classifiers and use a majority vote\nor the average predicted probabilities (soft vote) to predict the class labels.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 59,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "or the average predicted probabilities (soft vote) to predict the class labels.\nSuch a classifier can be useful for a set of equally well performing models\nin order to balance out their individual weaknesses.\n1.11.4.1.\nMajority Class Labels (Majority/Hard Voting)\nIn majority voting, the predicted class label for a particular sample is\nthe class label that represents the majority (mode) of the class labels\npredicted by each individual classifier.\nE.g., if the prediction for a given sample is\nclassifier 1 -> class 1\nclassifier 2 -> class 1\nclassifier 3 -> class 2\nthe VotingClassifier (with\nvoting='hard'\n) would classify the sample\nas “class 1” based on the majority class label.\nIn the cases of a tie, the\nVotingClassifier\nwill select the class\nbased on the ascending sort order. E.g., in the following scenario\nclassifier 1 -> class 2\nclassifier 2 -> class 1\nthe class label 1 will be assigned to the sample.\n1.11.4.2.\nUsage",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 60,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classifier 1 -> class 2\nclassifier 2 -> class 1\nthe class label 1 will be assigned to the sample.\n1.11.4.2.\nUsage\nThe following example shows how to fit the majority rule classifier:\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\nfrom\nsklearn.model_selection\nimport\ncross_val_score\n>>>\nfrom\nsklearn.linear_model\nimport\nLogisticRegression\n>>>\nfrom\nsklearn.naive_bayes\nimport\nGaussianNB\n>>>\nfrom\nsklearn.ensemble\nimport\nRandomForestClassifier\n>>>\nfrom\nsklearn.ensemble\nimport\nVotingClassifier\n>>>\niris\n=\ndatasets\n.\nload_iris\n()\n>>>\nX\n,\ny\n=\niris\n.\ndata\n[:,\n1\n:\n3\n],\niris\n.\ntarget\n>>>\nclf1\n=\nLogisticRegression\n(\nrandom_state\n=\n1\n)\n>>>\nclf2\n=\nRandomForestClassifier\n(\nn_estimators\n=\n50\n,\nrandom_state\n=\n1\n)\n>>>\nclf3\n=\nGaussianNB\n()\n>>>\neclf\n=\nVotingClassifier\n(\n...\nestimators\n=\n[(\n'lr'\n,\nclf1\n),\n(\n'rf'\n,\nclf2\n),\n(\n'gnb'\n,\nclf3\n)],\n...\nvoting\n=\n'hard'\n)\n>>>\nfor\nclf\n,\nlabel\nin\nzip\n([\nclf1\n,\nclf2\n,\nclf3\n,\neclf\n],\n[\n'Logistic Regression'\n,\n'Random Forest'\n,\n'naive Bayes'\n,\n'Ensemble'\n]):\n...\nscores\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 61,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ")\n>>>\nfor\nclf\n,\nlabel\nin\nzip\n([\nclf1\n,\nclf2\n,\nclf3\n,\neclf\n],\n[\n'Logistic Regression'\n,\n'Random Forest'\n,\n'naive Bayes'\n,\n'Ensemble'\n]):\n...\nscores\n=\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\nscoring\n=\n'accuracy'\n,\ncv\n=\n5\n)\n...\nprint\n(\n\"Accuracy:\n%0.2f\n(+/-\n%0.2f\n) [\n%s\n]\"\n%\n(\nscores\n.\nmean\n(),\nscores\n.\nstd\n(),\nlabel\n))\nAccuracy: 0.95 (+/- 0.04) [Logistic Regression]\nAccuracy: 0.94 (+/- 0.04) [Random Forest]\nAccuracy: 0.91 (+/- 0.04) [naive Bayes]\nAccuracy: 0.95 (+/- 0.04) [Ensemble]\n1.11.4.3.\nWeighted Average Probabilities (Soft Voting)\nIn contrast to majority voting (hard voting), soft voting\nreturns the class label as argmax of the sum of predicted probabilities.\nSpecific weights can be assigned to each classifier via the\nweights\nparameter. When weights are provided, the predicted class probabilities\nfor each classifier are collected, multiplied by the classifier weight,\nand averaged. The final class label is then derived from the class label\nwith the highest average probability.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 62,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and averaged. The final class label is then derived from the class label\nwith the highest average probability.\nTo illustrate this with a simple example, let’s assume we have 3\nclassifiers and a 3-class classification problem where we assign\nequal weights to all classifiers: w1=1, w2=1, w3=1.\nThe weighted average probabilities for a sample would then be\ncalculated as follows:\nclassifier\nclass 1\nclass 2\nclass 3\nclassifier 1\nw1 * 0.2\nw1 * 0.5\nw1 * 0.3\nclassifier 2\nw2 * 0.6\nw2 * 0.3\nw2 * 0.1\nclassifier 3\nw3 * 0.3\nw3 * 0.4\nw3 * 0.3\nweighted average\n0.37\n0.4\n0.23\nHere, the predicted class label is 2, since it has the highest average\npredicted probability. See the example on\nVisualizing the probabilistic predictions of a VotingClassifier\nfor a\ndemonstration of how the predicted class label can be obtained from the weighted\naverage of predicted probabilities.\nThe following figure illustrates how the decision regions may change when\na soft\nVotingClassifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 63,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "average of predicted probabilities.\nThe following figure illustrates how the decision regions may change when\na soft\nVotingClassifier\nis trained with weights on three linear\nmodels:\n1.11.4.4.\nUsage\nIn order to predict the class labels based on the predicted\nclass-probabilities (scikit-learn estimators in the VotingClassifier\nmust support\npredict_proba\nmethod):\n>>>\neclf\n=\nVotingClassifier\n(\n...\nestimators\n=\n[(\n'lr'\n,\nclf1\n),\n(\n'rf'\n,\nclf2\n),\n(\n'gnb'\n,\nclf3\n)],\n...\nvoting\n=\n'soft'\n...\n)\nOptionally, weights can be provided for the individual classifiers:\n>>>\neclf\n=\nVotingClassifier\n(\n...\nestimators\n=\n[(\n'lr'\n,\nclf1\n),\n(\n'rf'\n,\nclf2\n),\n(\n'gnb'\n,\nclf3\n)],\n...\nvoting\n=\n'soft'\n,\nweights\n=\n[\n2\n,\n5\n,\n1\n]\n...\n)\nUsing the\nVotingClassifier\nwith\nGridSearchCV\nThe\nVotingClassifier\ncan also be used together with\nGridSearchCV\nin order to tune the\nhyperparameters of the individual estimators:\n>>>\nfrom\nsklearn.model_selection\nimport\nGridSearchCV\n>>>\nclf1\n=\nLogisticRegression\n(\nrandom_state\n=\n1\n)\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 64,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nfrom\nsklearn.model_selection\nimport\nGridSearchCV\n>>>\nclf1\n=\nLogisticRegression\n(\nrandom_state\n=\n1\n)\n>>>\nclf2\n=\nRandomForestClassifier\n(\nrandom_state\n=\n1\n)\n>>>\nclf3\n=\nGaussianNB\n()\n>>>\neclf\n=\nVotingClassifier\n(\n...\nestimators\n=\n[(\n'lr'\n,\nclf1\n),\n(\n'rf'\n,\nclf2\n),\n(\n'gnb'\n,\nclf3\n)],\n...\nvoting\n=\n'soft'\n...\n)\n>>>\nparams\n=\n{\n'lr__C'\n:\n[\n1.0\n,\n100.0\n],\n'rf__n_estimators'\n:\n[\n20\n,\n200\n]}\n>>>\ngrid\n=\nGridSearchCV\n(\nestimator\n=\neclf\n,\nparam_grid\n=\nparams\n,\ncv\n=\n5\n)\n>>>\ngrid\n=\ngrid\n.\nfit\n(\niris\n.\ndata\n,\niris\n.\ntarget\n)\n1.11.5.\nVoting Regressor\nThe idea behind the\nVotingRegressor\nis to combine conceptually\ndifferent machine learning regressors and return the average predicted values.\nSuch a regressor can be useful for a set of equally well performing models\nin order to balance out their individual weaknesses.\n1.11.5.1.\nUsage\nThe following example shows how to fit the VotingRegressor:\n>>>\nfrom\nsklearn.datasets\nimport\nload_diabetes\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingRegressor\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 65,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nfrom\nsklearn.datasets\nimport\nload_diabetes\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingRegressor\n>>>\nfrom\nsklearn.ensemble\nimport\nRandomForestRegressor\n>>>\nfrom\nsklearn.linear_model\nimport\nLinearRegression\n>>>\nfrom\nsklearn.ensemble\nimport\nVotingRegressor\n>>>\n# Loading some example data\n>>>\nX\n,\ny\n=\nload_diabetes\n(\nreturn_X_y\n=\nTrue\n)\n>>>\n# Training classifiers\n>>>\nreg1\n=\nGradientBoostingRegressor\n(\nrandom_state\n=\n1\n)\n>>>\nreg2\n=\nRandomForestRegressor\n(\nrandom_state\n=\n1\n)\n>>>\nreg3\n=\nLinearRegression\n()\n>>>\nereg\n=\nVotingRegressor\n(\nestimators\n=\n[(\n'gb'\n,\nreg1\n),\n(\n'rf'\n,\nreg2\n),\n(\n'lr'\n,\nreg3\n)])\n>>>\nereg\n=\nereg\n.\nfit\n(\nX\n,\ny\n)\nExamples\nPlot individual and voting regression predictions\n1.11.6.\nStacked generalization\nStacked generalization is a method for combining estimators to reduce their\nbiases\n[W1992]\n[HTF]\n. More precisely, the predictions of each individual\nestimator are stacked together and used as input to a final estimator to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 66,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "biases\n[W1992]\n[HTF]\n. More precisely, the predictions of each individual\nestimator are stacked together and used as input to a final estimator to\ncompute the prediction. This final estimator is trained through\ncross-validation.\nThe\nStackingClassifier\nand\nStackingRegressor\nprovide such\nstrategies which can be applied to classification and regression problems.\nThe\nestimators\nparameter corresponds to the list of the estimators which\nare stacked together in parallel on the input data. It should be given as a\nlist of names and estimators:\n>>>\nfrom\nsklearn.linear_model\nimport\nRidgeCV\n,\nLassoCV\n>>>\nfrom\nsklearn.neighbors\nimport\nKNeighborsRegressor\n>>>\nestimators\n=\n[(\n'ridge'\n,\nRidgeCV\n()),\n...\n(\n'lasso'\n,\nLassoCV\n(\nrandom_state\n=\n42\n)),\n...\n(\n'knr'\n,\nKNeighborsRegressor\n(\nn_neighbors\n=\n20\n,\n...\nmetric\n=\n'euclidean'\n))]\nThe\nfinal_estimator\nwill use the predictions of the\nestimators\nas input. It\nneeds to be a classifier or a regressor when using\nStackingClassifier\nor\nStackingRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 67,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "will use the predictions of the\nestimators\nas input. It\nneeds to be a classifier or a regressor when using\nStackingClassifier\nor\nStackingRegressor\n, respectively:\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingRegressor\n>>>\nfrom\nsklearn.ensemble\nimport\nStackingRegressor\n>>>\nfinal_estimator\n=\nGradientBoostingRegressor\n(\n...\nn_estimators\n=\n25\n,\nsubsample\n=\n0.5\n,\nmin_samples_leaf\n=\n25\n,\nmax_features\n=\n1\n,\n...\nrandom_state\n=\n42\n)\n>>>\nreg\n=\nStackingRegressor\n(\n...\nestimators\n=\nestimators\n,\n...\nfinal_estimator\n=\nfinal_estimator\n)\nTo train the\nestimators\nand\nfinal_estimator\n, the\nfit\nmethod needs\nto be called on the training data:\n>>>\nfrom\nsklearn.datasets\nimport\nload_diabetes\n>>>\nX\n,\ny\n=\nload_diabetes\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\nX\n,\ny\n,\n...\nrandom_state\n=\n42\n)\n>>>\nreg\n.\nfit\n(\nX_train\n,\ny_train\n)\nStackingRegressor(...)\nDuring training, the\nestimators",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 68,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\ntrain_test_split\n(\nX\n,\ny\n,\n...\nrandom_state\n=\n42\n)\n>>>\nreg\n.\nfit\n(\nX_train\n,\ny_train\n)\nStackingRegressor(...)\nDuring training, the\nestimators\nare fitted on the whole training data\nX_train\n. They will be used when calling\npredict\nor\npredict_proba\n. To\ngeneralize and avoid over-fitting, the\nfinal_estimator\nis trained on\nout-samples using\nsklearn.model_selection.cross_val_predict\ninternally.\nFor\nStackingClassifier\n, note that the output of the\nestimators\nis\ncontrolled by the parameter\nstack_method\nand it is called by each estimator.\nThis parameter is either a string, being estimator method names, or\n'auto'\nwhich will automatically identify an available method depending on the\navailability, tested in the order of preference:\npredict_proba\n,\ndecision_function\nand\npredict\n.\nA\nStackingRegressor\nand\nStackingClassifier\ncan be used as\nany other regressor or classifier, exposing a\npredict\n,\npredict_proba\n, or\ndecision_function\nmethod, e.g.:\n>>>\ny_pred\n=\nreg\n.\npredict\n(\nX_test\n)\n>>>\nfrom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 69,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "predict\n,\npredict_proba\n, or\ndecision_function\nmethod, e.g.:\n>>>\ny_pred\n=\nreg\n.\npredict\n(\nX_test\n)\n>>>\nfrom\nsklearn.metrics\nimport\nr2_score\n>>>\nprint\n(\n'R2 score:\n{:.2f}\n'\n.\nformat\n(\nr2_score\n(\ny_test\n,\ny_pred\n)))\nR2 score: 0.53\nNote that it is also possible to get the output of the stacked\nestimators\nusing the\ntransform\nmethod:\n>>>\nreg\n.\ntransform\n(\nX_test\n[:\n5\n])\narray([[142, 138, 146],\n[179, 182, 151],\n[139, 132, 158],\n[286, 292, 225],\n[126, 124, 164]])\nIn practice, a stacking predictor predicts as good as the best predictor of the\nbase layer and even sometimes outperforms it by combining the different\nstrengths of these predictors. However, training a stacking predictor is\ncomputationally expensive.\nNote\nFor\nStackingClassifier\n, when using\nstack_method_='predict_proba'\n,\nthe first column is dropped when the problem is a binary classification\nproblem. Indeed, both probability columns predicted by each estimator are\nperfectly collinear.\nNote",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 70,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "problem. Indeed, both probability columns predicted by each estimator are\nperfectly collinear.\nNote\nMultiple stacking layers can be achieved by assigning\nfinal_estimator\nto\na\nStackingClassifier\nor\nStackingRegressor\n:\n>>>\nfinal_layer_rfr\n=\nRandomForestRegressor\n(\n...\nn_estimators\n=\n10\n,\nmax_features\n=\n1\n,\nmax_leaf_nodes\n=\n5\n,\nrandom_state\n=\n42\n)\n>>>\nfinal_layer_gbr\n=\nGradientBoostingRegressor\n(\n...\nn_estimators\n=\n10\n,\nmax_features\n=\n1\n,\nmax_leaf_nodes\n=\n5\n,\nrandom_state\n=\n42\n)\n>>>\nfinal_layer\n=\nStackingRegressor\n(\n...\nestimators\n=\n[(\n'rf'\n,\nfinal_layer_rfr\n),\n...\n(\n'gbrt'\n,\nfinal_layer_gbr\n)],\n...\nfinal_estimator\n=\nRidgeCV\n()\n...\n)\n>>>\nmulti_layer_regressor\n=\nStackingRegressor\n(\n...\nestimators\n=\n[(\n'ridge'\n,\nRidgeCV\n()),\n...\n(\n'lasso'\n,\nLassoCV\n(\nrandom_state\n=\n42\n)),\n...\n(\n'knr'\n,\nKNeighborsRegressor\n(\nn_neighbors\n=\n20\n,\n...\nmetric\n=\n'euclidean'\n))],\n...\nfinal_estimator\n=\nfinal_layer\n...\n)\n>>>\nmulti_layer_regressor\n.\nfit\n(\nX_train\n,\ny_train\n)\nStackingRegressor(...)\n>>>\nprint\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 71,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n'euclidean'\n))],\n...\nfinal_estimator\n=\nfinal_layer\n...\n)\n>>>\nmulti_layer_regressor\n.\nfit\n(\nX_train\n,\ny_train\n)\nStackingRegressor(...)\n>>>\nprint\n(\n'R2 score:\n{:.2f}\n'\n...\n.\nformat\n(\nmulti_layer_regressor\n.\nscore\n(\nX_test\n,\ny_test\n)))\nR2 score: 0.53\nExamples\nCombine predictors using stacking\nReferences\n[\nW1992\n]\nWolpert, David H. “Stacked generalization.” Neural networks 5.2\n(1992): 241-259.\n1.11.7.\nAdaBoost\nThe module\nsklearn.ensemble\nincludes the popular boosting algorithm\nAdaBoost, introduced in 1995 by Freund and Schapire\n[FS1995]\n.\nThe core principle of AdaBoost is to fit a sequence of weak learners (i.e.,\nmodels that are only slightly better than random guessing, such as small\ndecision trees) on repeatedly modified versions of the data. The predictions\nfrom all of them are then combined through a weighted majority vote (or sum) to\nproduce the final prediction. The data modifications at each so-called boosting\niteration consists of applying weights\n\\(w_1\\)\n,\n\\(w_2\\)\n, …,\n\\(w_N\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 72,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "produce the final prediction. The data modifications at each so-called boosting\niteration consists of applying weights\n\\(w_1\\)\n,\n\\(w_2\\)\n, …,\n\\(w_N\\)\nto each of the training samples. Initially, those weights are all set to\n\\(w_i = 1/N\\)\n, so that the first step simply trains a weak learner on the\noriginal data. For each successive iteration, the sample weights are\nindividually modified and the learning algorithm is reapplied to the reweighted\ndata. At a given step, those training examples that were incorrectly predicted\nby the boosted model induced at the previous step have their weights increased,\nwhereas the weights are decreased for those that were predicted correctly. As\niterations proceed, examples that are difficult to predict receive\never-increasing influence. Each subsequent weak learner is thereby forced to\nconcentrate on the examples that are missed by the previous ones in the sequence\n[HTF]\n.\nAdaBoost can be used both for classification and regression problems:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 73,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[HTF]\n.\nAdaBoost can be used both for classification and regression problems:\nFor multi-class classification,\nAdaBoostClassifier\nimplements\nAdaBoost.SAMME\n[ZZRH2009]\n.\nFor regression,\nAdaBoostRegressor\nimplements AdaBoost.R2\n[D1997]\n.\n1.11.7.1.\nUsage\nThe following example shows how to fit an AdaBoost classifier with 100 weak\nlearners:\n>>>\nfrom\nsklearn.model_selection\nimport\ncross_val_score\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.ensemble\nimport\nAdaBoostClassifier\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nclf\n=\nAdaBoostClassifier\n(\nn_estimators\n=\n100\n)\n>>>\nscores\n=\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\n5\n)\n>>>\nscores\n.\nmean\n()\nnp.float64(0.95)\nThe number of weak learners is controlled by the parameter\nn_estimators\n. The\nlearning_rate\nparameter controls the contribution of the weak learners in\nthe final combination. By default, weak learners are decision stumps. Different\nweak learners can be specified through the\nestimator\nparameter.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 74,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the final combination. By default, weak learners are decision stumps. Different\nweak learners can be specified through the\nestimator\nparameter.\nThe main parameters to tune to obtain good results are\nn_estimators\nand\nthe complexity of the base estimators (e.g., its depth\nmax_depth\nor\nminimum required number of samples to consider a split\nmin_samples_split\n).\nExamples\nMulti-class AdaBoosted Decision Trees\nshows the performance\nof AdaBoost on a multi-class problem.\nTwo-class AdaBoost\nshows the decision boundary\nand decision function values for a non-linearly separable two-class problem\nusing AdaBoost-SAMME.\nDecision Tree Regression with AdaBoost\ndemonstrates regression\nwith the AdaBoost.R2 algorithm.\nReferences\n[\nFS1995\n]\nY. Freund, and R. Schapire, “A Decision-Theoretic Generalization of\nOn-Line Learning and an Application to Boosting”, 1997.\n[\nZZRH2009\n]\nZhu, H. Zou, S. Rosset, T. Hastie. “Multi-class AdaBoost”, 2009.\n[\nD1997\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 75,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "On-Line Learning and an Application to Boosting”, 1997.\n[\nZZRH2009\n]\nZhu, H. Zou, S. Rosset, T. Hastie. “Multi-class AdaBoost”, 2009.\n[\nD1997\n]\nDrucker. “Improving Regressors using Boosting Techniques”, 1997.\n[\nHTF\n]\n(\n1\n,\n2\n,\n3\n)\nT. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning\nEd. 2”, Springer, 2009.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/ensemble.html",
      "chunk_index": 76,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.2.\nFeature extraction\nThe\nsklearn.feature_extraction\nmodule can be used to extract\nfeatures in a format supported by machine learning algorithms from datasets\nconsisting of formats such as text and image.\nNote\nFeature extraction is very different from\nFeature selection\n:\nthe former consists of transforming arbitrary data, such as text or\nimages, into numerical features usable for machine learning. The latter\nis a machine learning technique applied to these features.\n7.2.1.\nLoading features from dicts\nThe class\nDictVectorizer\ncan be used to convert feature\narrays represented as lists of standard Python\ndict\nobjects to the\nNumPy/SciPy representation used by scikit-learn estimators.\nWhile not particularly fast to process, Python’s\ndict\nhas the\nadvantages of being convenient to use, being sparse (absent features\nneed not be stored) and storing feature names in addition to values.\nDictVectorizer\nimplements what is called one-of-K or “one-hot”",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "need not be stored) and storing feature names in addition to values.\nDictVectorizer\nimplements what is called one-of-K or “one-hot”\ncoding for categorical (aka nominal, discrete) features. Categorical\nfeatures are “attribute-value” pairs where the value is restricted\nto a list of discrete possibilities without ordering (e.g. topic\nidentifiers, types of objects, tags, names…).\nIn the following, “city” is a categorical attribute while “temperature”\nis a traditional numerical feature:\n>>>\nmeasurements\n=\n[\n...\n{\n'city'\n:\n'Dubai'\n,\n'temperature'\n:\n33.\n},\n...\n{\n'city'\n:\n'London'\n,\n'temperature'\n:\n12.\n},\n...\n{\n'city'\n:\n'San Francisco'\n,\n'temperature'\n:\n18.\n},\n...\n]\n>>>\nfrom\nsklearn.feature_extraction\nimport\nDictVectorizer\n>>>\nvec\n=\nDictVectorizer\n()\n>>>\nvec\n.\nfit_transform\n(\nmeasurements\n)\n.\ntoarray\n()\narray([[ 1., 0., 0., 33.],\n[ 0., 1., 0., 12.],\n[ 0., 0., 1., 18.]])\n>>>\nvec\n.\nget_feature_names_out\n()\narray(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'], ...)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[ 0., 0., 1., 18.]])\n>>>\nvec\n.\nget_feature_names_out\n()\narray(['city=Dubai', 'city=London', 'city=San Francisco', 'temperature'], ...)\nDictVectorizer\naccepts multiple string values for one\nfeature, like, e.g., multiple categories for a movie.\nAssume a database classifies each movie using some categories (not mandatory)\nand its year of release.\n>>>\nmovie_entry\n=\n[{\n'category'\n:\n[\n'thriller'\n,\n'drama'\n],\n'year'\n:\n2003\n},\n...\n{\n'category'\n:\n[\n'animation'\n,\n'family'\n],\n'year'\n:\n2011\n},\n...\n{\n'year'\n:\n1974\n}]\n>>>\nvec\n.\nfit_transform\n(\nmovie_entry\n)\n.\ntoarray\n()\narray([[0.000e+00, 1.000e+00, 0.000e+00, 1.000e+00, 2.003e+03],\n[1.000e+00, 0.000e+00, 1.000e+00, 0.000e+00, 2.011e+03],\n[0.000e+00, 0.000e+00, 0.000e+00, 0.000e+00, 1.974e+03]])\n>>>\nvec\n.\nget_feature_names_out\n()\narray(['category=animation', 'category=drama', 'category=family',\n'category=thriller', 'year'], ...)\n>>>\nvec\n.\ntransform\n({\n'category'\n:\n[\n'thriller'\n],\n...\n'unseen_feature'\n:\n'3'\n})\n.\ntoarray\n()",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "'category=thriller', 'year'], ...)\n>>>\nvec\n.\ntransform\n({\n'category'\n:\n[\n'thriller'\n],\n...\n'unseen_feature'\n:\n'3'\n})\n.\ntoarray\n()\narray([[0., 0., 0., 1., 0.]])\nDictVectorizer\nis also a useful representation transformation\nfor training sequence classifiers in Natural Language Processing models\nthat typically work by extracting feature windows around a particular\nword of interest.\nFor example, suppose that we have a first algorithm that extracts Part of\nSpeech (PoS) tags that we want to use as complementary tags for training\na sequence classifier (e.g. a chunker). The following dict could be\nsuch a window of features extracted around the word ‘sat’ in the sentence\n‘The cat sat on the mat.’:\n>>>\npos_window\n=\n[\n...\n{\n...\n'word-2'\n:\n'the'\n,\n...\n'pos-2'\n:\n'DT'\n,\n...\n'word-1'\n:\n'cat'\n,\n...\n'pos-1'\n:\n'NN'\n,\n...\n'word+1'\n:\n'on'\n,\n...\n'pos+1'\n:\n'PP'\n,\n...\n},\n...\n# in a real application one would extract many such dictionaries\n...\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\n...\n'pos-1'\n:\n'NN'\n,\n...\n'word+1'\n:\n'on'\n,\n...\n'pos+1'\n:\n'PP'\n,\n...\n},\n...\n# in a real application one would extract many such dictionaries\n...\n]\nThis description can be vectorized into a sparse two-dimensional matrix\nsuitable for feeding into a classifier (maybe after being piped into a\nTfidfTransformer\nfor normalization):\n>>>\nvec\n=\nDictVectorizer\n()\n>>>\npos_vectorized\n=\nvec\n.\nfit_transform\n(\npos_window\n)\n>>>\npos_vectorized\n<Compressed Sparse...dtype 'float64'\nwith 6 stored elements and shape (1, 6)>\n>>>\npos_vectorized\n.\ntoarray\n()\narray([[1., 1., 1., 1., 1., 1.]])\n>>>\nvec\n.\nget_feature_names_out\n()\narray(['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat',\n'word-2=the'], ...)\nAs you can imagine, if one extracts such a context around each individual\nword of a corpus of documents the resulting matrix will be very wide\n(many one-hot-features) with most of them being valued to zero most\nof the time. So as to make the resulting data structure able to fit in\nmemory the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(many one-hot-features) with most of them being valued to zero most\nof the time. So as to make the resulting data structure able to fit in\nmemory the\nDictVectorizer\nclass uses a\nscipy.sparse\nmatrix by\ndefault instead of a\nnumpy.ndarray\n.\n7.2.2.\nFeature hashing\nThe class\nFeatureHasher\nis a high-speed, low-memory vectorizer that\nuses a technique known as\nfeature hashing\n,\nor the “hashing trick”.\nInstead of building a hash table of the features encountered in training,\nas the vectorizers do, instances of\nFeatureHasher\napply a hash function to the features\nto determine their column index in sample matrices directly.\nThe result is increased speed and reduced memory usage,\nat the expense of inspectability;\nthe hasher does not remember what the input features looked like\nand has no\ninverse_transform\nmethod.\nSince the hash function might cause collisions between (unrelated) features,\na signed hash function is used and the sign of the hash value",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "method.\nSince the hash function might cause collisions between (unrelated) features,\na signed hash function is used and the sign of the hash value\ndetermines the sign of the value stored in the output matrix for a feature.\nThis way, collisions are likely to cancel out rather than accumulate error,\nand the expected mean of any output feature’s value is zero. This mechanism\nis enabled by default with\nalternate_sign=True\nand is particularly useful\nfor small hash table sizes (\nn_features\n<\n10000\n). For large hash table\nsizes, it can be disabled, to allow the output to be passed to estimators like\nMultinomialNB\nor\nchi2\nfeature selectors that expect non-negative inputs.\nFeatureHasher\naccepts either mappings\n(like Python’s\ndict\nand its variants in the\ncollections\nmodule),\n(feature,\nvalue)\npairs, or strings,\ndepending on the constructor parameter\ninput_type\n.\nMappings are treated as lists of\n(feature,\nvalue)\npairs,\nwhile single strings have an implicit value of 1,\nso\n['feat1',\n'feat2',",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "input_type\n.\nMappings are treated as lists of\n(feature,\nvalue)\npairs,\nwhile single strings have an implicit value of 1,\nso\n['feat1',\n'feat2',\n'feat3']\nis interpreted as\n[('feat1',\n1),\n('feat2',\n1),\n('feat3',\n1)]\n.\nIf a single feature occurs multiple times in a sample,\nthe associated values will be summed\n(so\n('feat',\n2)\nand\n('feat',\n3.5)\nbecome\n('feat',\n5.5)\n).\nThe output from\nFeatureHasher\nis always a\nscipy.sparse\nmatrix\nin the CSR format.\nFeature hashing can be employed in document classification,\nbut unlike\nCountVectorizer\n,\nFeatureHasher\ndoes not do word\nsplitting or any other preprocessing except Unicode-to-UTF-8 encoding;\nsee\nVectorizing a large text corpus with the hashing trick\n, below, for a combined tokenizer/hasher.\nAs an example, consider a word-level natural language processing task\nthat needs features extracted from\n(token,\npart_of_speech)\npairs.\nOne could use a Python generator function to extract features:\ndef\ntoken_features\n(\ntoken\n,\npart_of_speech\n):\nif\ntoken\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "part_of_speech)\npairs.\nOne could use a Python generator function to extract features:\ndef\ntoken_features\n(\ntoken\n,\npart_of_speech\n):\nif\ntoken\n.\nisdigit\n():\nyield\n\"numeric\"\nelse\n:\nyield\n\"token=\n{}\n\"\n.\nformat\n(\ntoken\n.\nlower\n())\nyield\n\"token,pos=\n{}\n,\n{}\n\"\n.\nformat\n(\ntoken\n,\npart_of_speech\n)\nif\ntoken\n[\n0\n]\n.\nisupper\n():\nyield\n\"uppercase_initial\"\nif\ntoken\n.\nisupper\n():\nyield\n\"all_uppercase\"\nyield\n\"pos=\n{}\n\"\n.\nformat\n(\npart_of_speech\n)\nThen, the\nraw_X\nto be fed to\nFeatureHasher.transform\ncan be constructed using:\nraw_X\n=\n(\ntoken_features\n(\ntok\n,\npos_tagger\n(\ntok\n))\nfor\ntok\nin\ncorpus\n)\nand fed to a hasher with:\nhasher\n=\nFeatureHasher\n(\ninput_type\n=\n'string'\n)\nX\n=\nhasher\n.\ntransform\n(\nraw_X\n)\nto get a\nscipy.sparse\nmatrix\nX\n.\nNote the use of a generator comprehension,\nwhich introduces laziness into the feature extraction:\ntokens are only processed on demand from the hasher.\nImplementation details\nFeatureHasher\nuses the signed 32-bit variant of MurmurHash3.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "tokens are only processed on demand from the hasher.\nImplementation details\nFeatureHasher\nuses the signed 32-bit variant of MurmurHash3.\nAs a result (and because of limitations in\nscipy.sparse\n),\nthe maximum number of features supported is currently\n\\(2^{31} - 1\\)\n.\nThe original formulation of the hashing trick by Weinberger et al.\nused two separate hash functions\n\\(h\\)\nand\n\\(\\xi\\)\nto determine the column index and sign of a feature, respectively.\nThe present implementation works under the assumption\nthat the sign bit of MurmurHash3 is independent of its other bits.\nSince a simple modulo is used to transform the hash function to a column index,\nit is advisable to use a power of two as the\nn_features\nparameter;\notherwise the features will not be mapped evenly to the columns.\nReferences\nMurmurHash3\n.\nReferences\nKilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola and\nJosh Attenberg (2009).\nFeature hashing for large scale multitask learning\n. Proc. ICML.\n7.2.3.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Josh Attenberg (2009).\nFeature hashing for large scale multitask learning\n. Proc. ICML.\n7.2.3.\nText feature extraction\n7.2.3.1.\nThe Bag of Words representation\nText Analysis is a major application field for machine learning\nalgorithms. However the raw data, a sequence of symbols, cannot be fed\ndirectly to the algorithms themselves as most of them expect numerical\nfeature vectors with a fixed size rather than the raw text documents\nwith variable length.\nIn order to address this, scikit-learn provides utilities for the most\ncommon ways to extract numerical features from text content, namely:\ntokenizing\nstrings and giving an integer id for each possible token,\nfor instance by using white-spaces and punctuation as token separators.\ncounting\nthe occurrences of tokens in each document.\nnormalizing\nand weighting with diminishing importance tokens that\noccur in the majority of samples / documents.\nIn this scheme, features and samples are defined as follows:\neach",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "occur in the majority of samples / documents.\nIn this scheme, features and samples are defined as follows:\neach\nindividual token occurrence frequency\n(normalized or not)\nis treated as a\nfeature\n.\nthe vector of all the token frequencies for a given\ndocument\nis\nconsidered a multivariate\nsample\n.\nA corpus of documents can thus be represented by a matrix with one row\nper document and one column per token (e.g. word) occurring in the corpus.\nWe call\nvectorization\nthe general process of turning a collection\nof text documents into numerical feature vectors. This specific strategy\n(tokenization, counting and normalization) is called the\nBag of Words\nor “Bag of n-grams” representation. Documents are described by word\noccurrences while completely ignoring the relative position information\nof the words in the document.\n7.2.3.2.\nSparsity\nAs most documents will typically use a very small subset of the words used in\nthe corpus, the resulting matrix will have many feature values that are",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "As most documents will typically use a very small subset of the words used in\nthe corpus, the resulting matrix will have many feature values that are\nzeros (typically more than 99% of them).\nFor instance a collection of 10,000 short text documents (such as emails)\nwill use a vocabulary with a size in the order of 100,000 unique words in\ntotal while each document will use 100 to 1000 unique words individually.\nIn order to be able to store such a matrix in memory but also to speed\nup algebraic operations matrix / vector, implementations will typically\nuse a sparse representation such as the implementations available in the\nscipy.sparse\npackage.\n7.2.3.3.\nCommon Vectorizer usage\nCountVectorizer\nimplements both tokenization and occurrence\ncounting in a single class:\n>>>\nfrom\nsklearn.feature_extraction.text\nimport\nCountVectorizer\nThis model has many parameters, however the default values are quite\nreasonable (please see the\nreference documentation\nfor the details):\n>>>\nvectorizer\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "reasonable (please see the\nreference documentation\nfor the details):\n>>>\nvectorizer\n=\nCountVectorizer\n()\n>>>\nvectorizer\nCountVectorizer()\nLet’s use it to tokenize and count the word occurrences of a minimalistic\ncorpus of text documents:\n>>>\ncorpus\n=\n[\n...\n'This is the first document.'\n,\n...\n'This is the second second document.'\n,\n...\n'And the third one.'\n,\n...\n'Is this the first document?'\n,\n...\n]\n>>>\nX\n=\nvectorizer\n.\nfit_transform\n(\ncorpus\n)\n>>>\nX\n<Compressed Sparse...dtype 'int64'\nwith 19 stored elements and shape (4, 9)>\nThe default configuration tokenizes the string by extracting words of\nat least 2 letters. The specific function that does this step can be\nrequested explicitly:\n>>>\nanalyze\n=\nvectorizer\n.\nbuild_analyzer\n()\n>>>\nanalyze\n(\n\"This is a text document to analyze.\"\n)\n==\n(\n...\n[\n'this'\n,\n'is'\n,\n'text'\n,\n'document'\n,\n'to'\n,\n'analyze'\n])\nTrue\nEach term found by the analyzer during the fit is assigned a unique",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ")\n==\n(\n...\n[\n'this'\n,\n'is'\n,\n'text'\n,\n'document'\n,\n'to'\n,\n'analyze'\n])\nTrue\nEach term found by the analyzer during the fit is assigned a unique\ninteger index corresponding to a column in the resulting matrix. This\ninterpretation of the columns can be retrieved as follows:\n>>>\nvectorizer\n.\nget_feature_names_out\n()\narray(['and', 'document', 'first', 'is', 'one', 'second', 'the',\n'third', 'this'], ...)\n>>>\nX\n.\ntoarray\n()\narray([[0, 1, 1, 1, 0, 0, 1, 0, 1],\n[0, 1, 0, 1, 0, 2, 1, 0, 1],\n[1, 0, 0, 0, 1, 0, 1, 1, 0],\n[0, 1, 1, 1, 0, 0, 1, 0, 1]]...)\nThe converse mapping from feature name to column index is stored in the\nvocabulary_\nattribute of the vectorizer:\n>>>\nvectorizer\n.\nvocabulary_\n.\nget\n(\n'document'\n)\n1\nHence words that were not seen in the training corpus will be completely\nignored in future calls to the transform method:\n>>>\nvectorizer\n.\ntransform\n([\n'Something completely new.'\n])\n.\ntoarray\n()\narray([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nvectorizer\n.\ntransform\n([\n'Something completely new.'\n])\n.\ntoarray\n()\narray([[0, 0, 0, 0, 0, 0, 0, 0, 0]]...)\nNote that in the previous corpus, the first and the last documents have\nexactly the same words hence are encoded in equal vectors. In particular\nwe lose the information that the last document is an interrogative form. To\npreserve some of the local ordering information we can extract 2-grams\nof words in addition to the 1-grams (individual words):\n>>>\nbigram_vectorizer\n=\nCountVectorizer\n(\nngram_range\n=\n(\n1\n,\n2\n),\n...\ntoken_pattern\n=\nr\n'\\b\\w+\\b'\n,\nmin_df\n=\n1\n)\n>>>\nanalyze\n=\nbigram_vectorizer\n.\nbuild_analyzer\n()\n>>>\nanalyze\n(\n'Bi-grams are cool!'\n)\n==\n(\n...\n[\n'bi'\n,\n'grams'\n,\n'are'\n,\n'cool'\n,\n'bi grams'\n,\n'grams are'\n,\n'are cool'\n])\nTrue\nThe vocabulary extracted by this vectorizer is hence much bigger and\ncan now resolve ambiguities encoded in local positioning patterns:\n>>>\nX_2\n=\nbigram_vectorizer\n.\nfit_transform\n(\ncorpus\n)\n.\ntoarray\n()\n>>>\nX_2",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "can now resolve ambiguities encoded in local positioning patterns:\n>>>\nX_2\n=\nbigram_vectorizer\n.\nfit_transform\n(\ncorpus\n)\n.\ntoarray\n()\n>>>\nX_2\narray([[0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0],\n[0, 0, 1, 0, 0, 1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0],\n[1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0],\n[0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1]]...)\nIn particular the interrogative form “Is this” is only present in the\nlast document:\n>>>\nfeature_index\n=\nbigram_vectorizer\n.\nvocabulary_\n.\nget\n(\n'is this'\n)\n>>>\nX_2\n[:,\nfeature_index\n]\narray([0, 0, 0, 1]...)\n7.2.3.4.\nUsing stop words\nStop words are words like “and”, “the”, “him”, which are presumed to be\nuninformative in representing the content of a text, and which may be\nremoved to avoid them being construed as informative for prediction. Sometimes,\nhowever, similar words are useful for prediction, such as in classifying\nwriting style or personality.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "however, similar words are useful for prediction, such as in classifying\nwriting style or personality.\nThere are several known issues in our provided ‘english’ stop word list. It\ndoes not aim to be a general, ‘one-size-fits-all’ solution as some tasks\nmay require a more custom solution. See\n[NQY18]\nfor more details.\nPlease take care in choosing a stop word list.\nPopular stop word lists may include words that are highly informative to\nsome tasks, such as\ncomputer\n.\nYou should also make sure that the stop word list has had the same\npreprocessing and tokenization applied as the one used in the vectorizer.\nThe word\nwe’ve\nis split into\nwe\nand\nve\nby CountVectorizer’s default\ntokenizer, so if\nwe’ve\nis in\nstop_words\n, but\nve\nis not,\nve\nwill\nbe retained from\nwe’ve\nin transformed text. Our vectorizers will try to\nidentify and warn about some kinds of inconsistencies.\nReferences\n[\nNQY18\n]\nJ. Nothman, H. Qin and R. Yurchak (2018).\n“Stop Word Lists in Free Open-source Software Packages”\n.\nIn",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "References\n[\nNQY18\n]\nJ. Nothman, H. Qin and R. Yurchak (2018).\n“Stop Word Lists in Free Open-source Software Packages”\n.\nIn\nProc. Workshop for NLP Open Source Software\n.\n7.2.3.5.\nTf–idf term weighting\nIn a large text corpus, some words will be very present (e.g. “the”, “a”,\n“is” in English) hence carrying very little meaningful information about\nthe actual contents of the document. If we were to feed the direct count\ndata directly to a classifier those very frequent terms would shadow\nthe frequencies of rarer yet more interesting terms.\nIn order to re-weight the count features into floating point values\nsuitable for usage by a classifier it is very common to use the tf–idf\ntransform.\nTf means\nterm-frequency\nwhile tf–idf means term-frequency times\ninverse document-frequency\n:\n\\(\\text{tf-idf(t,d)}=\\text{tf(t,d)} \\times \\text{idf(t)}\\)\n.\nUsing the\nTfidfTransformer\n’s default settings,\nTfidfTransformer(norm='l2',\nuse_idf=True,\nsmooth_idf=True,\nsublinear_tf=False)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nUsing the\nTfidfTransformer\n’s default settings,\nTfidfTransformer(norm='l2',\nuse_idf=True,\nsmooth_idf=True,\nsublinear_tf=False)\nthe term frequency, the number of times a term occurs in a given document,\nis multiplied with idf component, which is computed as\n\\(\\text{idf}(t) = \\log{\\frac{1 + n}{1+\\text{df}(t)}} + 1\\)\n,\nwhere\n\\(n\\)\nis the total number of documents in the document set, and\n\\(\\text{df}(t)\\)\nis the number of documents in the document set that\ncontain term\n\\(t\\)\n. The resulting tf-idf vectors are then normalized by the\nEuclidean norm:\n\\(v_{norm} = \\frac{v}{||v||_2} = \\frac{v}{\\sqrt{v{_1}^2 +\nv{_2}^2 + \\dots + v{_n}^2}}\\)\n.\nThis was originally a term weighting scheme developed for information retrieval\n(as a ranking function for search engines results) that has also found good\nuse in document classification and clustering.\nThe following sections contain further explanations and examples that\nillustrate how the tf-idfs are computed exactly and how the tf-idfs",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The following sections contain further explanations and examples that\nillustrate how the tf-idfs are computed exactly and how the tf-idfs\ncomputed in scikit-learn’s\nTfidfTransformer\nand\nTfidfVectorizer\ndiffer slightly from the standard textbook\nnotation that defines the idf as\n\\(\\text{idf}(t) = \\log{\\frac{n}{1+\\text{df}(t)}}.\\)\nIn the\nTfidfTransformer\nand\nTfidfVectorizer\nwith\nsmooth_idf=False\n, the\n“1” count is added to the idf instead of the idf’s denominator:\n\\(\\text{idf}(t) = \\log{\\frac{n}{\\text{df}(t)}} + 1\\)\nThis normalization is implemented by the\nTfidfTransformer\nclass:\n>>>\nfrom\nsklearn.feature_extraction.text\nimport\nTfidfTransformer\n>>>\ntransformer\n=\nTfidfTransformer\n(\nsmooth_idf\n=\nFalse\n)\n>>>\ntransformer\nTfidfTransformer(smooth_idf=False)\nAgain please see the\nreference documentation\nfor the details on all the parameters.\nNumeric example of a tf-idf matrix\nLet’s take an example with the following counts. The first term is present",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for the details on all the parameters.\nNumeric example of a tf-idf matrix\nLet’s take an example with the following counts. The first term is present\n100% of the time hence not very interesting. The two other features only\nin less than 50% of the time hence probably more representative of the\ncontent of the documents:\n>>>\ncounts\n=\n[[\n3\n,\n0\n,\n1\n],\n...\n[\n2\n,\n0\n,\n0\n],\n...\n[\n3\n,\n0\n,\n0\n],\n...\n[\n4\n,\n0\n,\n0\n],\n...\n[\n3\n,\n2\n,\n0\n],\n...\n[\n3\n,\n0\n,\n2\n]]\n...\n>>>\ntfidf\n=\ntransformer\n.\nfit_transform\n(\ncounts\n)\n>>>\ntfidf\n<Compressed Sparse...dtype 'float64'\nwith 9 stored elements and shape (6, 3)>\n>>>\ntfidf\n.\ntoarray\n()\narray([[0.81940995, 0. , 0.57320793],\n[1. , 0. , 0. ],\n[1. , 0. , 0. ],\n[1. , 0. , 0. ],\n[0.47330339, 0.88089948, 0. ],\n[0.58149261, 0. , 0.81355169]])\nEach row is normalized to have unit Euclidean norm:\n\\(v_{norm} = \\frac{v}{||v||_2} = \\frac{v}{\\sqrt{v{_1}^2 +\nv{_2}^2 + \\dots + v{_n}^2}}\\)\nFor example, we can compute the tf-idf of the first term in the first\ndocument in the\ncounts",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "v{_2}^2 + \\dots + v{_n}^2}}\\)\nFor example, we can compute the tf-idf of the first term in the first\ndocument in the\ncounts\narray as follows:\n\\(n = 6\\)\n\\(\\text{df}(t)_{\\text{term1}} = 6\\)\n\\(\\text{idf}(t)_{\\text{term1}} =\n\\log \\frac{n}{\\text{df}(t)} + 1 = \\log(1)+1 = 1\\)\n\\(\\text{tf-idf}_{\\text{term1}} = \\text{tf} \\times \\text{idf} = 3 \\times 1 = 3\\)\nNow, if we repeat this computation for the remaining 2 terms in the document,\nwe get\n\\(\\text{tf-idf}_{\\text{term2}} = 0 \\times (\\log(6/1)+1) = 0\\)\n\\(\\text{tf-idf}_{\\text{term3}} = 1 \\times (\\log(6/2)+1) \\approx 2.0986\\)\nand the vector of raw tf-idfs:\n\\(\\text{tf-idf}_{\\text{raw}} = [3, 0, 2.0986].\\)\nThen, applying the Euclidean (L2) norm, we obtain the following tf-idfs\nfor document 1:\n\\(\\frac{[3, 0, 2.0986]}{\\sqrt{\\big(3^2 + 0^2 + 2.0986^2\\big)}}\n= [ 0.819, 0, 0.573].\\)\nFurthermore, the default parameter\nsmooth_idf=True\nadds “1” to the numerator\nand denominator as if an extra document was seen containing every term in the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "smooth_idf=True\nadds “1” to the numerator\nand denominator as if an extra document was seen containing every term in the\ncollection exactly once, which prevents zero divisions:\n\\(\\text{idf}(t) = \\log{\\frac{1 + n}{1+\\text{df}(t)}} + 1\\)\nUsing this modification, the tf-idf of the third term in document 1 changes to\n1.8473:\n\\(\\text{tf-idf}_{\\text{term3}} = 1 \\times \\log(7/3)+1 \\approx 1.8473\\)\nAnd the L2-normalized tf-idf changes to\n\\(\\frac{[3, 0, 1.8473]}{\\sqrt{\\big(3^2 + 0^2 + 1.8473^2\\big)}}\n= [0.8515, 0, 0.5243]\\)\n:\n>>>\ntransformer\n=\nTfidfTransformer\n()\n>>>\ntransformer\n.\nfit_transform\n(\ncounts\n)\n.\ntoarray\n()\narray([[0.85151335, 0. , 0.52433293],\n[1. , 0. , 0. ],\n[1. , 0. , 0. ],\n[1. , 0. , 0. ],\n[0.55422893, 0.83236428, 0. ],\n[0.63035731, 0. , 0.77630514]])\nThe weights of each\nfeature computed by the\nfit\nmethod call are stored in a model\nattribute:\n>>>\ntransformer\n.\nidf_\narray([1., 2.25, 1.84])\nAs tf-idf is very often used for text features, there is also another\nclass called",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "attribute:\n>>>\ntransformer\n.\nidf_\narray([1., 2.25, 1.84])\nAs tf-idf is very often used for text features, there is also another\nclass called\nTfidfVectorizer\nthat combines all the options of\nCountVectorizer\nand\nTfidfTransformer\nin a single model:\n>>>\nfrom\nsklearn.feature_extraction.text\nimport\nTfidfVectorizer\n>>>\nvectorizer\n=\nTfidfVectorizer\n()\n>>>\nvectorizer\n.\nfit_transform\n(\ncorpus\n)\n<Compressed Sparse...dtype 'float64'\nwith 19 stored elements and shape (4, 9)>\nWhile the tf-idf normalization is often very useful, there might\nbe cases where the binary occurrence markers might offer better\nfeatures. This can be achieved by using the\nbinary\nparameter\nof\nCountVectorizer\n. In particular, some estimators such as\nBernoulli Naive Bayes\nexplicitly model discrete boolean random\nvariables. Also, very short texts are likely to have noisy tf-idf values\nwhile the binary occurrence info is more stable.\nAs usual the best way to adjust the feature extraction parameters",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "while the binary occurrence info is more stable.\nAs usual the best way to adjust the feature extraction parameters\nis to use a cross-validated grid search, for instance by pipelining the\nfeature extractor with a classifier:\nSample pipeline for text feature extraction and evaluation\n7.2.3.6.\nDecoding text files\nText is made of characters, but files are made of bytes. These bytes represent\ncharacters according to some\nencoding\n. To work with text files in Python,\ntheir bytes must be\ndecoded\nto a character set called Unicode.\nCommon encodings are ASCII, Latin-1 (Western Europe), KOI8-R (Russian)\nand the universal encodings UTF-8 and UTF-16. Many others exist.\nNote\nAn encoding can also be called a ‘character set’,\nbut this term is less accurate: several encodings can exist\nfor a single character set.\nThe text feature extractors in scikit-learn know how to decode text files,\nbut only if you tell them what encoding the files are in.\nThe\nCountVectorizer\ntakes an\nencoding",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "but only if you tell them what encoding the files are in.\nThe\nCountVectorizer\ntakes an\nencoding\nparameter for this purpose.\nFor modern text files, the correct encoding is probably UTF-8,\nwhich is therefore the default (\nencoding=\"utf-8\"\n).\nIf the text you are loading is not actually encoded with UTF-8, however,\nyou will get a\nUnicodeDecodeError\n.\nThe vectorizers can be told to be silent about decoding errors\nby setting the\ndecode_error\nparameter to either\n\"ignore\"\nor\n\"replace\"\n. See the documentation for the Python function\nbytes.decode\nfor more details\n(type\nhelp(bytes.decode)\nat the Python prompt).\nTroubleshooting decoding text\nIf you are having trouble decoding text, here are some things to try:\nFind out what the actual encoding of the text is. The file might come\nwith a header or README that tells you the encoding, or there might be some\nstandard encoding you can assume based on where the text comes from.\nYou may be able to find out what kind of encoding it is in general",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "standard encoding you can assume based on where the text comes from.\nYou may be able to find out what kind of encoding it is in general\nusing the UNIX command\nfile\n. The Python\nchardet\nmodule comes with\na script called\nchardetect.py\nthat will guess the specific encoding,\nthough you cannot rely on its guess being correct.\nYou could try UTF-8 and disregard the errors. You can decode byte\nstrings with\nbytes.decode(errors='replace')\nto replace all\ndecoding errors with a meaningless character, or set\ndecode_error='replace'\nin the vectorizer. This may damage the\nusefulness of your features.\nReal text may come from a variety of sources that may have used different\nencodings, or even be sloppily decoded in a different encoding than the\none it was encoded with. This is common in text retrieved from the Web.\nThe Python package\nftfy\ncan automatically sort out some classes of\ndecoding errors, so you could try decoding the unknown text as\nlatin-1\nand then using\nftfy\nto fix errors.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "can automatically sort out some classes of\ndecoding errors, so you could try decoding the unknown text as\nlatin-1\nand then using\nftfy\nto fix errors.\nIf the text is in a mish-mash of encodings that is simply too hard to sort\nout (which is the case for the 20 Newsgroups dataset), you can fall back on\na simple single-byte encoding such as\nlatin-1\n. Some text may display\nincorrectly, but at least the same sequence of bytes will always represent\nthe same feature.\nFor example, the following snippet uses\nchardet\n(not shipped with scikit-learn, must be installed separately)\nto figure out the encoding of three texts.\nIt then vectorizes the texts and prints the learned vocabulary.\nThe output is not shown here.\n>>>\nimport\nchardet\n>>>\ntext1\n=\nb\n\"Sei mir gegr\n\\xc3\\xbc\\xc3\\x9f\nt mein Sauerkraut\"\n>>>\ntext2\n=\nb\n\"holdselig sind deine Ger\n\\xfc\nche\"\n>>>\ntext3\n=\nb\n\"\n\\xff\\xfe\nA\n\\x00\nu\n\\x00\nf\n\\x00\n\\x00\nF\n\\x00\nl\n\\x00\\xfc\\x00\ng\n\\x00\ne\n\\x00\nl\n\\x00\nn\n\\x00\n\\x00\nd\n\\x00\ne\n\\x00\ns\n\\x00\n\\x00\nG\n\\x00\ne\n\\x00\ns\n\\x00\na",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nb\n\"\n\\xff\\xfe\nA\n\\x00\nu\n\\x00\nf\n\\x00\n\\x00\nF\n\\x00\nl\n\\x00\\xfc\\x00\ng\n\\x00\ne\n\\x00\nl\n\\x00\nn\n\\x00\n\\x00\nd\n\\x00\ne\n\\x00\ns\n\\x00\n\\x00\nG\n\\x00\ne\n\\x00\ns\n\\x00\na\n\\x00\nn\n\\x00\ng\n\\x00\ne\n\\x00\ns\n\\x00\n,\n\\x00\n\\x00\nH\n\\x00\ne\n\\x00\nr\n\\x00\nz\n\\x00\nl\n\\x00\ni\n\\x00\ne\n\\x00\nb\n\\x00\nc\n\\x00\nh\n\\x00\ne\n\\x00\nn\n\\x00\n,\n\\x00\n\\x00\nt\n\\x00\nr\n\\x00\na\n\\x00\ng\n\\x00\n\\x00\ni\n\\x00\nc\n\\x00\nh\n\\x00\n\\x00\nd\n\\x00\ni\n\\x00\nc\n\\x00\nh\n\\x00\n\\x00\nf\n\\x00\no\n\\x00\nr\n\\x00\nt\n\\x00\n\"\n>>>\ndecoded\n=\n[\nx\n.\ndecode\n(\nchardet\n.\ndetect\n(\nx\n)[\n'encoding'\n])\n...\nfor\nx\nin\n(\ntext1\n,\ntext2\n,\ntext3\n)]\n>>>\nv\n=\nCountVectorizer\n()\n.\nfit\n(\ndecoded\n)\n.\nvocabulary_\n>>>\nfor\nterm\nin\nv\n:\nprint\n(\nv\n)\n(Depending on the version of\nchardet\n, it might get the first one wrong.)\nFor an introduction to Unicode and character encodings in general,\nsee Joel Spolsky’s\nAbsolute Minimum Every Software Developer Must Know\nAbout Unicode\n.\n7.2.3.7.\nApplications and examples\nThe bag of words representation is quite simplistic but surprisingly\nuseful in practice.\nIn particular in a\nsupervised setting",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The bag of words representation is quite simplistic but surprisingly\nuseful in practice.\nIn particular in a\nsupervised setting\nit can be successfully combined\nwith fast and scalable linear models to train\ndocument classifiers\n,\nfor instance:\nClassification of text documents using sparse features\nIn an\nunsupervised setting\nit can be used to group similar documents\ntogether by applying clustering algorithms such as\nK-means\n:\nClustering text documents using k-means\nFinally it is possible to discover the main topics of a corpus by\nrelaxing the hard assignment constraint of clustering, for instance by\nusing\nNon-negative matrix factorization (NMF or NNMF)\n:\nTopic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation\n7.2.3.8.\nLimitations of the Bag of Words representation\nA collection of unigrams (what bag of words is) cannot capture phrases\nand multi-word expressions, effectively disregarding any word order",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "A collection of unigrams (what bag of words is) cannot capture phrases\nand multi-word expressions, effectively disregarding any word order\ndependence. Additionally, the bag of words model doesn’t account for potential\nmisspellings or word derivations.\nN-grams to the rescue! Instead of building a simple collection of\nunigrams (n=1), one might prefer a collection of bigrams (n=2), where\noccurrences of pairs of consecutive words are counted.\nOne might alternatively consider a collection of character n-grams, a\nrepresentation resilient against misspellings and derivations.\nFor example, let’s say we’re dealing with a corpus of two documents:\n['words',\n'wprds']\n. The second document contains a misspelling\nof the word ‘words’.\nA simple bag of words representation would consider these two as\nvery distinct documents, differing in both of the two possible features.\nA character 2-gram representation, however, would find the documents",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "very distinct documents, differing in both of the two possible features.\nA character 2-gram representation, however, would find the documents\nmatching in 4 out of 8 features, which may help the preferred classifier\ndecide better:\n>>>\nngram_vectorizer\n=\nCountVectorizer\n(\nanalyzer\n=\n'char_wb'\n,\nngram_range\n=\n(\n2\n,\n2\n))\n>>>\ncounts\n=\nngram_vectorizer\n.\nfit_transform\n([\n'words'\n,\n'wprds'\n])\n>>>\nngram_vectorizer\n.\nget_feature_names_out\n()\narray([' w', 'ds', 'or', 'pr', 'rd', 's ', 'wo', 'wp'], ...)\n>>>\ncounts\n.\ntoarray\n()\n.\nastype\n(\nint\n)\narray([[1, 1, 1, 0, 1, 1, 1, 0],\n[1, 1, 0, 1, 1, 1, 0, 1]])\nIn the above example,\nchar_wb\nanalyzer is used, which creates n-grams\nonly from characters inside word boundaries (padded with space on each\nside). The\nchar\nanalyzer, alternatively, creates n-grams that\nspan across words:\n>>>\nngram_vectorizer\n=\nCountVectorizer\n(\nanalyzer\n=\n'char_wb'\n,\nngram_range\n=\n(\n5\n,\n5\n))\n>>>\nngram_vectorizer\n.\nfit_transform\n([\n'jumpy fox'\n])\n<Compressed Sparse...dtype 'int64'",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\nanalyzer\n=\n'char_wb'\n,\nngram_range\n=\n(\n5\n,\n5\n))\n>>>\nngram_vectorizer\n.\nfit_transform\n([\n'jumpy fox'\n])\n<Compressed Sparse...dtype 'int64'\nwith 4 stored elements and shape (1, 4)>\n>>>\nngram_vectorizer\n.\nget_feature_names_out\n()\narray([' fox ', ' jump', 'jumpy', 'umpy '], ...)\n>>>\nngram_vectorizer\n=\nCountVectorizer\n(\nanalyzer\n=\n'char'\n,\nngram_range\n=\n(\n5\n,\n5\n))\n>>>\nngram_vectorizer\n.\nfit_transform\n([\n'jumpy fox'\n])\n<Compressed Sparse...dtype 'int64'\nwith 5 stored elements and shape (1, 5)>\n>>>\nngram_vectorizer\n.\nget_feature_names_out\n()\narray(['jumpy', 'mpy f', 'py fo', 'umpy ', 'y fox'], ...)\nThe word boundaries-aware variant\nchar_wb\nis especially interesting\nfor languages that use white-spaces for word separation as it generates\nsignificantly less noisy features than the raw\nchar\nvariant in\nthat case. For such languages it can increase both the predictive\naccuracy and convergence speed of classifiers trained using such",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "char\nvariant in\nthat case. For such languages it can increase both the predictive\naccuracy and convergence speed of classifiers trained using such\nfeatures while retaining the robustness with regards to misspellings and\nword derivations.\nWhile some local positioning information can be preserved by extracting\nn-grams instead of individual words, bag of words and bag of n-grams\ndestroy most of the inner structure of the document and hence most of\nthe meaning carried by that internal structure.\nIn order to address the wider task of Natural Language Understanding,\nthe local structure of sentences and paragraphs should thus be taken\ninto account. Many such models will thus be casted as “Structured output”\nproblems which are currently outside of the scope of scikit-learn.\n7.2.3.9.\nVectorizing a large text corpus with the hashing trick\nThe above vectorization scheme is simple but the fact that it holds an\nin-memory mapping from the string tokens to the integer feature indices\n(the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The above vectorization scheme is simple but the fact that it holds an\nin-memory mapping from the string tokens to the integer feature indices\n(the\nvocabulary_\nattribute) causes several\nproblems when dealing with large\ndatasets\n:\nthe larger the corpus, the larger the vocabulary will grow and hence the\nmemory use too,\nfitting requires the allocation of intermediate data structures\nof size proportional to that of the original dataset.\nbuilding the word-mapping requires a full pass over the dataset hence it is\nnot possible to fit text classifiers in a strictly online manner.\npickling and un-pickling vectorizers with a large\nvocabulary_\ncan be very\nslow (typically much slower than pickling / un-pickling flat data structures\nsuch as a NumPy array of the same size),\nit is not easily possible to split the vectorization work into concurrent sub\ntasks as the\nvocabulary_\nattribute would have to be a shared state with a\nfine grained synchronization barrier: the mapping from token string to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "tasks as the\nvocabulary_\nattribute would have to be a shared state with a\nfine grained synchronization barrier: the mapping from token string to\nfeature index is dependent on the ordering of the first occurrence of each token\nhence would have to be shared, potentially harming the concurrent workers’\nperformance to the point of making them slower than the sequential variant.\nIt is possible to overcome those limitations by combining the “hashing trick”\n(\nFeature hashing\n) implemented by the\nFeatureHasher\nclass and the text\npreprocessing and tokenization features of the\nCountVectorizer\n.\nThis combination is implemented in\nHashingVectorizer\n,\na transformer class that is mostly API compatible with\nCountVectorizer\n.\nHashingVectorizer\nis stateless,\nmeaning that you don’t have to call\nfit\non it:\n>>>\nfrom\nsklearn.feature_extraction.text\nimport\nHashingVectorizer\n>>>\nhv\n=\nHashingVectorizer\n(\nn_features\n=\n10\n)\n>>>\nhv\n.\ntransform\n(\ncorpus\n)\n<Compressed Sparse...dtype 'float64'",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "import\nHashingVectorizer\n>>>\nhv\n=\nHashingVectorizer\n(\nn_features\n=\n10\n)\n>>>\nhv\n.\ntransform\n(\ncorpus\n)\n<Compressed Sparse...dtype 'float64'\nwith 16 stored elements and shape (4, 10)>\nYou can see that 16 non-zero feature tokens were extracted in the vector\noutput: this is less than the 19 non-zeros extracted previously by the\nCountVectorizer\non the same toy corpus. The discrepancy comes from\nhash function collisions because of the low value of the\nn_features\nparameter.\nIn a real world setting, the\nn_features\nparameter can be left to its\ndefault value of\n2\n**\n20\n(roughly one million possible features). If memory\nor downstream models size is an issue selecting a lower value such as\n2\n**\n18\nmight help without introducing too many additional collisions on typical\ntext classification tasks.\nNote that the dimensionality does not affect the CPU training time of\nalgorithms which operate on CSR matrices (\nLinearSVC(dual=True)\n,\nPerceptron\n,\nSGDClassifier\n,\nPassiveAggressive\n) but it does for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "algorithms which operate on CSR matrices (\nLinearSVC(dual=True)\n,\nPerceptron\n,\nSGDClassifier\n,\nPassiveAggressive\n) but it does for\nalgorithms that work with CSC matrices (\nLinearSVC(dual=False)\n,\nLasso()\n,\netc.).\nLet’s try again with the default setting:\n>>>\nhv\n=\nHashingVectorizer\n()\n>>>\nhv\n.\ntransform\n(\ncorpus\n)\n<Compressed Sparse...dtype 'float64'\nwith 19 stored elements and shape (4, 1048576)>\nWe no longer get the collisions, but this comes at the expense of a much larger\ndimensionality of the output space.\nOf course, other terms than the 19 used here\nmight still collide with each other.\nThe\nHashingVectorizer\nalso comes with the following limitations:\nit is not possible to invert the model (no\ninverse_transform\nmethod),\nnor to access the original string representation of the features,\nbecause of the one-way nature of the hash function that performs the mapping.\nit does not provide IDF weighting as that would introduce statefulness in the\nmodel. A\nTfidfTransformer",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 38,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "it does not provide IDF weighting as that would introduce statefulness in the\nmodel. A\nTfidfTransformer\ncan be appended to it in a pipeline if\nrequired.\nPerforming out-of-core scaling with HashingVectorizer\nAn interesting development of using a\nHashingVectorizer\nis the ability\nto perform\nout-of-core\nscaling. This means that we can learn from data that\ndoes not fit into the computer’s main memory.\nA strategy to implement out-of-core scaling is to stream data to the estimator\nin mini-batches. Each mini-batch is vectorized using\nHashingVectorizer\nso as to guarantee that the input space of the estimator has always the same\ndimensionality. The amount of memory used at any time is thus bounded by the\nsize of a mini-batch. Although there is no limit to the amount of data that can\nbe ingested using such an approach, from a practical point of view the learning\ntime is often limited by the CPU time one wants to spend on the task.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 39,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "be ingested using such an approach, from a practical point of view the learning\ntime is often limited by the CPU time one wants to spend on the task.\nFor a full-fledged example of out-of-core scaling in a text classification\ntask see\nOut-of-core classification of text documents\n.\n7.2.3.10.\nCustomizing the vectorizer classes\nIt is possible to customize the behavior by passing a callable\nto the vectorizer constructor:\n>>>\ndef\nmy_tokenizer\n(\ns\n):\n...\nreturn\ns\n.\nsplit\n()\n...\n>>>\nvectorizer\n=\nCountVectorizer\n(\ntokenizer\n=\nmy_tokenizer\n)\n>>>\nvectorizer\n.\nbuild_analyzer\n()(\nu\n\"Some... punctuation!\"\n)\n==\n(\n...\n[\n'some...'\n,\n'punctuation!'\n])\nTrue\nIn particular we name:\npreprocessor\n: a callable that takes an entire document as input (as a\nsingle string), and returns a possibly transformed version of the document,\nstill as an entire string. This can be used to remove HTML tags, lowercase\nthe entire document, etc.\ntokenizer\n: a callable that takes the output from the preprocessor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 40,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the entire document, etc.\ntokenizer\n: a callable that takes the output from the preprocessor\nand splits it into tokens, then returns a list of these.\nanalyzer\n: a callable that replaces the preprocessor and tokenizer.\nThe default analyzers all call the preprocessor and tokenizer, but custom\nanalyzers will skip this. N-gram extraction and stop word filtering take\nplace at the analyzer level, so a custom analyzer may have to reproduce\nthese steps.\n(Lucene users might recognize these names, but be aware that scikit-learn\nconcepts may not map one-to-one onto Lucene concepts.)\nTo make the preprocessor, tokenizer and analyzers aware of the model\nparameters it is possible to derive from the class and override the\nbuild_preprocessor\n,\nbuild_tokenizer\nand\nbuild_analyzer\nfactory methods instead of passing custom functions.\nTips and tricks\nIf documents are pre-tokenized by an external package, then store them in\nfiles (or strings) with the tokens separated by whitespace and pass",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 41,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "If documents are pre-tokenized by an external package, then store them in\nfiles (or strings) with the tokens separated by whitespace and pass\nanalyzer=str.split\nFancy token-level analysis such as stemming, lemmatizing, compound\nsplitting, filtering based on part-of-speech, etc. are not included in the\nscikit-learn codebase, but can be added by customizing either the\ntokenizer or the analyzer.\nHere’s a\nCountVectorizer\nwith a tokenizer and lemmatizer using\nNLTK\n:\n>>>\nfrom\nnltk\nimport\nword_tokenize\n>>>\nfrom\nnltk.stem\nimport\nWordNetLemmatizer\n>>>\nclass\nLemmaTokenizer\n:\n...\ndef\n__init__\n(\nself\n):\n...\nself\n.\nwnl\n=\nWordNetLemmatizer\n()\n...\ndef\n__call__\n(\nself\n,\ndoc\n):\n...\nreturn\n[\nself\n.\nwnl\n.\nlemmatize\n(\nt\n)\nfor\nt\nin\nword_tokenize\n(\ndoc\n)]\n...\n>>>\nvect\n=\nCountVectorizer\n(\ntokenizer\n=\nLemmaTokenizer\n())\n(Note that this will not filter out punctuation.)\nThe following example will, for instance, transform some British spelling\nto American spelling:\n>>>\nimport\nre\n>>>\ndef\nto_british\n(\ntokens\n):",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 42,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The following example will, for instance, transform some British spelling\nto American spelling:\n>>>\nimport\nre\n>>>\ndef\nto_british\n(\ntokens\n):\n...\nfor\nt\nin\ntokens\n:\n...\nt\n=\nre\n.\nsub\n(\nr\n\"(...)our$\"\n,\nr\n\"\\1or\"\n,\nt\n)\n...\nt\n=\nre\n.\nsub\n(\nr\n\"([bt])re$\"\n,\nr\n\"\\1er\"\n,\nt\n)\n...\nt\n=\nre\n.\nsub\n(\nr\n\"([iy])s(e$|ing|ation)\"\n,\nr\n\"\\1z\\2\"\n,\nt\n)\n...\nt\n=\nre\n.\nsub\n(\nr\n\"ogue$\"\n,\n\"og\"\n,\nt\n)\n...\nyield\nt\n...\n>>>\nclass\nCustomVectorizer\n(\nCountVectorizer\n):\n...\ndef\nbuild_tokenizer\n(\nself\n):\n...\ntokenize\n=\nsuper\n()\n.\nbuild_tokenizer\n()\n...\nreturn\nlambda\ndoc\n:\nlist\n(\nto_british\n(\ntokenize\n(\ndoc\n)))\n...\n>>>\nprint\n(\nCustomVectorizer\n()\n.\nbuild_analyzer\n()(\nu\n\"color colour\"\n))\n[...'color', ...'color']\nfor other styles of preprocessing; examples include stemming, lemmatization,\nor normalizing numerical tokens, with the latter illustrated in:\nBiclustering documents with the Spectral Co-clustering algorithm\nCustomizing the vectorizer can also be useful when handling Asian languages",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 43,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Biclustering documents with the Spectral Co-clustering algorithm\nCustomizing the vectorizer can also be useful when handling Asian languages\nthat do not use an explicit word separator such as whitespace.\n7.2.4.\nImage feature extraction\n7.2.4.1.\nPatch extraction\nThe\nextract_patches_2d\nfunction extracts patches from an image stored\nas a two-dimensional array, or three-dimensional with color information along\nthe third axis. For rebuilding an image from all its patches, use\nreconstruct_from_patches_2d\n. For example let us generate a 4x4 pixel\npicture with 3 color channels (e.g. in RGB format):\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.feature_extraction\nimport\nimage\n>>>\none_image\n=\nnp\n.\narange\n(\n4\n*\n4\n*\n3\n)\n.\nreshape\n((\n4\n,\n4\n,\n3\n))\n>>>\none_image\n[:,\n:,\n0\n]\n# R channel of a fake RGB picture\narray([[ 0, 3, 6, 9],\n[12, 15, 18, 21],\n[24, 27, 30, 33],\n[36, 39, 42, 45]])\n>>>\npatches\n=\nimage\n.\nextract_patches_2d\n(\none_image\n,\n(\n2\n,\n2\n),\nmax_patches\n=\n2\n,\n...\nrandom_state\n=\n0\n)\n>>>\npatches\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 44,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[36, 39, 42, 45]])\n>>>\npatches\n=\nimage\n.\nextract_patches_2d\n(\none_image\n,\n(\n2\n,\n2\n),\nmax_patches\n=\n2\n,\n...\nrandom_state\n=\n0\n)\n>>>\npatches\n.\nshape\n(2, 2, 2, 3)\n>>>\npatches\n[:,\n:,\n:,\n0\n]\narray([[[ 0, 3],\n[12, 15]],\n[[15, 18],\n[27, 30]]])\n>>>\npatches\n=\nimage\n.\nextract_patches_2d\n(\none_image\n,\n(\n2\n,\n2\n))\n>>>\npatches\n.\nshape\n(9, 2, 2, 3)\n>>>\npatches\n[\n4\n,\n:,\n:,\n0\n]\narray([[15, 18],\n[27, 30]])\nLet us now try to reconstruct the original image from the patches by averaging\non overlapping areas:\n>>>\nreconstructed\n=\nimage\n.\nreconstruct_from_patches_2d\n(\npatches\n,\n(\n4\n,\n4\n,\n3\n))\n>>>\nnp\n.\ntesting\n.\nassert_array_equal\n(\none_image\n,\nreconstructed\n)\nThe\nPatchExtractor\nclass works in the same way as\nextract_patches_2d\n, only it supports multiple images as input. It is\nimplemented as a scikit-learn transformer, so it can be used in pipelines. See:\n>>>\nfive_images\n=\nnp\n.\narange\n(\n5\n*\n4\n*\n4\n*\n3\n)\n.\nreshape\n(\n5\n,\n4\n,\n4\n,\n3\n)\n>>>\npatches\n=\nimage\n.\nPatchExtractor\n(\npatch_size\n=\n(\n2\n,\n2\n))\n.\ntransform\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 45,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "five_images\n=\nnp\n.\narange\n(\n5\n*\n4\n*\n4\n*\n3\n)\n.\nreshape\n(\n5\n,\n4\n,\n4\n,\n3\n)\n>>>\npatches\n=\nimage\n.\nPatchExtractor\n(\npatch_size\n=\n(\n2\n,\n2\n))\n.\ntransform\n(\nfive_images\n)\n>>>\npatches\n.\nshape\n(45, 2, 2, 3)\n7.2.4.2.\nConnectivity graph of an image\nSeveral estimators in scikit-learn can use connectivity information between\nfeatures or samples. For instance Ward clustering\n(\nHierarchical clustering\n) can cluster together only neighboring pixels\nof an image, thus forming contiguous patches:\nFor this purpose, the estimators use a ‘connectivity’ matrix, giving\nwhich samples are connected.\nThe function\nimg_to_graph\nreturns such a matrix from a 2D or 3D\nimage. Similarly,\ngrid_to_graph\nbuilds a connectivity matrix for\nimages given the shape of these images.\nThese matrices can be used to impose connectivity in estimators that use\nconnectivity information, such as Ward clustering\n(\nHierarchical clustering\n), but also to build precomputed kernels,\nor similarity matrices.\nNote\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 46,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\nHierarchical clustering\n), but also to build precomputed kernels,\nor similarity matrices.\nNote\nExamples\nA demo of structured Ward hierarchical clustering on an image of coins\nSpectral clustering for image segmentation\nFeature agglomeration vs. univariate selection\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_extraction.html",
      "chunk_index": 47,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.13.\nFeature selection\nThe classes in the\nsklearn.feature_selection\nmodule can be used\nfor feature selection/dimensionality reduction on sample sets, either to\nimprove estimators’ accuracy scores or to boost their performance on very\nhigh-dimensional datasets.\n1.13.1.\nRemoving features with low variance\nVarianceThreshold\nis a simple baseline approach to feature selection.\nIt removes all features whose variance doesn’t meet some threshold.\nBy default, it removes all zero-variance features,\ni.e. features that have the same value in all samples.\nAs an example, suppose that we have a dataset with boolean features,\nand we want to remove all features that are either one or zero (on or off)\nin more than 80% of the samples.\nBoolean features are Bernoulli random variables,\nand the variance of such variables is given by\n\\[\\mathrm{Var}[X] = p(1 - p)\\]\nso we can select using the threshold\n.8\n*\n(1\n-\n.8)\n:\n>>>\nfrom\nsklearn.feature_selection\nimport\nVarianceThreshold\n>>>\nX\n=\n[[\n0\n,\n0\n,\n1\n],\n[\n0\n,\n1\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "so we can select using the threshold\n.8\n*\n(1\n-\n.8)\n:\n>>>\nfrom\nsklearn.feature_selection\nimport\nVarianceThreshold\n>>>\nX\n=\n[[\n0\n,\n0\n,\n1\n],\n[\n0\n,\n1\n,\n0\n],\n[\n1\n,\n0\n,\n0\n],\n[\n0\n,\n1\n,\n1\n],\n[\n0\n,\n1\n,\n0\n],\n[\n0\n,\n1\n,\n1\n]]\n>>>\nsel\n=\nVarianceThreshold\n(\nthreshold\n=\n(\n.8\n*\n(\n1\n-\n.8\n)))\n>>>\nsel\n.\nfit_transform\n(\nX\n)\narray([[0, 1],\n[1, 0],\n[0, 0],\n[1, 1],\n[1, 0],\n[1, 1]])\nAs expected,\nVarianceThreshold\nhas removed the first column,\nwhich has a probability\n\\(p = 5/6 > .8\\)\nof containing a zero.\n1.13.2.\nUnivariate feature selection\nUnivariate feature selection works by selecting the best features based on\nunivariate statistical tests. It can be seen as a preprocessing step\nto an estimator. Scikit-learn exposes feature selection routines\nas objects that implement the\ntransform\nmethod:\nSelectKBest\nremoves all but the\n\\(k\\)\nhighest scoring features\nSelectPercentile\nremoves all but a user-specified highest scoring\npercentage of features\nusing common univariate statistical tests for each feature:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SelectPercentile\nremoves all but a user-specified highest scoring\npercentage of features\nusing common univariate statistical tests for each feature:\nfalse positive rate\nSelectFpr\n, false discovery rate\nSelectFdr\n, or family wise error\nSelectFwe\n.\nGenericUnivariateSelect\nallows to perform univariate feature\nselection with a configurable strategy. This allows to select the best\nunivariate selection strategy with hyper-parameter search estimator.\nFor instance, we can use a F-test to retrieve the two\nbest features for a dataset as follows:\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.feature_selection\nimport\nSelectKBest\n>>>\nfrom\nsklearn.feature_selection\nimport\nf_classif\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nX\n.\nshape\n(150, 4)\n>>>\nX_new\n=\nSelectKBest\n(\nf_classif\n,\nk\n=\n2\n)\n.\nfit_transform\n(\nX\n,\ny\n)\n>>>\nX_new\n.\nshape\n(150, 2)\nThese objects take as input a scoring function that returns univariate scores\nand p-values (or only scores for\nSelectKBest\nand",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "X_new\n.\nshape\n(150, 2)\nThese objects take as input a scoring function that returns univariate scores\nand p-values (or only scores for\nSelectKBest\nand\nSelectPercentile\n):\nFor regression:\nr_regression\n,\nf_regression\n,\nmutual_info_regression\nFor classification:\nchi2\n,\nf_classif\n,\nmutual_info_classif\nThe methods based on F-test estimate the degree of linear dependency between\ntwo random variables. On the other hand, mutual information methods can capture\nany kind of statistical dependency, but being nonparametric, they require more\nsamples for accurate estimation. Note that the\n\\(\\chi^2\\)\n-test should only be\napplied to non-negative features, such as frequencies.\nWarning\nBeware not to use a regression scoring function with a classification\nproblem, you will get useless results.\nNote\nThe\nSelectPercentile\nand\nSelectKBest\nsupport unsupervised\nfeature selection as well. One needs to provide a\nscore_func\nwhere\ny=None\n.\nThe\nscore_func\nshould use internally\nX\nto compute the scores.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "feature selection as well. One needs to provide a\nscore_func\nwhere\ny=None\n.\nThe\nscore_func\nshould use internally\nX\nto compute the scores.\nExamples\nUnivariate Feature Selection\nComparison of F-test and mutual information\n1.13.3.\nRecursive feature elimination\nGiven an external estimator that assigns weights to features (e.g., the\ncoefficients of a linear model), the goal of recursive feature elimination (\nRFE\n)\nis to select features by recursively considering smaller and smaller sets of\nfeatures. First, the estimator is trained on the initial set of features and\nthe importance of each feature is obtained either through any specific attribute\n(such as\ncoef_\n,\nfeature_importances_\n) or callable. Then, the least important\nfeatures are pruned from the current set of features. That procedure is recursively\nrepeated on the pruned set until the desired number of features to select is\neventually reached.\nRFECV\nperforms RFE in a cross-validation loop to find the optimal",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "eventually reached.\nRFECV\nperforms RFE in a cross-validation loop to find the optimal\nnumber of features. In more details, the number of features selected is tuned\nautomatically by fitting an\nRFE\nselector on the different\ncross-validation splits (provided by the\ncv\nparameter). The performance\nof the\nRFE\nselector is evaluated using\nscorer\nfor different numbers\nof selected features and aggregated together. Finally, the scores are averaged\nacross folds and the number of features selected is set to the number of\nfeatures that maximize the cross-validation score.\nExamples\nRecursive feature elimination\n: A recursive feature elimination example\nshowing the relevance of pixels in a digit classification task.\nRecursive feature elimination with cross-validation\n: A recursive feature\nelimination example with automatic tuning of the number of features\nselected with cross-validation.\n1.13.4.\nFeature selection using SelectFromModel\nSelectFromModel",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "selected with cross-validation.\n1.13.4.\nFeature selection using SelectFromModel\nSelectFromModel\nis a meta-transformer that can be used alongside any\nestimator that assigns importance to each feature through a specific attribute (such as\ncoef_\n,\nfeature_importances_\n) or via an\nimportance_getter\ncallable after fitting.\nThe features are considered unimportant and removed if the corresponding\nimportance of the feature values is below the provided\nthreshold\nparameter. Apart from specifying the threshold numerically,\nthere are built-in heuristics for finding a threshold using a string argument.\nAvailable heuristics are “mean”, “median” and float multiples of these like\n“0.1*mean”. In combination with the\nthreshold\ncriteria, one can use the\nmax_features\nparameter to set a limit on the number of features to select.\nFor examples on how it is to be used refer to the sections below.\nExamples\nModel-based and sequential feature selection\n1.13.4.1.\nL1-based feature selection\nLinear models",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Examples\nModel-based and sequential feature selection\n1.13.4.1.\nL1-based feature selection\nLinear models\npenalized with the L1 norm have\nsparse solutions: many of their estimated coefficients are zero. When the goal\nis to reduce the dimensionality of the data to use with another classifier,\nthey can be used along with\nSelectFromModel\nto select the non-zero coefficients. In particular, sparse estimators useful\nfor this purpose are the\nLasso\nfor regression, and\nof\nLogisticRegression\nand\nLinearSVC\nfor classification:\n>>>\nfrom\nsklearn.svm\nimport\nLinearSVC\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.feature_selection\nimport\nSelectFromModel\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nX\n.\nshape\n(150, 4)\n>>>\nlsvc\n=\nLinearSVC\n(\nC\n=\n0.01\n,\npenalty\n=\n\"l1\"\n,\ndual\n=\nFalse\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nmodel\n=\nSelectFromModel\n(\nlsvc\n,\nprefit\n=\nTrue\n)\n>>>\nX_new\n=\nmodel\n.\ntransform\n(\nX\n)\n>>>\nX_new\n.\nshape\n(150, 3)\nWith SVMs and logistic regression, the parameter C controls the sparsity:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nTrue\n)\n>>>\nX_new\n=\nmodel\n.\ntransform\n(\nX\n)\n>>>\nX_new\n.\nshape\n(150, 3)\nWith SVMs and logistic regression, the parameter C controls the sparsity:\nthe smaller C the fewer features selected. With Lasso, the higher the\nalpha parameter, the fewer features selected.\nExamples\nLasso on dense and sparse data\n.\nL1-recovery and compressive sensing\nFor a good choice of alpha, the\nLasso\ncan fully recover the\nexact set of non-zero variables using only few observations, provided\ncertain specific conditions are met. In particular, the number of\nsamples should be “sufficiently large”, or L1 models will perform at\nrandom, where “sufficiently large” depends on the number of non-zero\ncoefficients, the logarithm of the number of features, the amount of\nnoise, the smallest absolute value of non-zero coefficients, and the\nstructure of the design matrix X. In addition, the design matrix must\ndisplay certain specific properties, such as not being too correlated.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "structure of the design matrix X. In addition, the design matrix must\ndisplay certain specific properties, such as not being too correlated.\nOn the use of Lasso for sparse signal recovery, see this example on\ncompressive sensing:\nCompressive sensing: tomography reconstruction with L1 prior (Lasso)\n.\nThere is no general rule to select an alpha parameter for recovery of\nnon-zero coefficients. It can be set by cross-validation\n(\nLassoCV\nor\nLassoLarsCV\n), though this may lead to\nunder-penalized models: including a small number of non-relevant variables\nis not detrimental to prediction score. BIC\n(\nLassoLarsIC\n) tends, on the opposite, to set\nhigh values of alpha.\nReferences\nRichard G. Baraniuk “Compressive Sensing”, IEEE Signal\nProcessing Magazine [120] July 2007\nhttp://users.isr.ist.utl.pt/~aguiar/CS_notes.pdf\n1.13.4.2.\nTree-based feature selection\nTree-based estimators (see the\nsklearn.tree\nmodule and forest\nof trees in the\nsklearn.ensemble\nmodule) can be used to compute",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Tree-based estimators (see the\nsklearn.tree\nmodule and forest\nof trees in the\nsklearn.ensemble\nmodule) can be used to compute\nimpurity-based feature importances, which in turn can be used to discard irrelevant\nfeatures (when coupled with the\nSelectFromModel\nmeta-transformer):\n>>>\nfrom\nsklearn.ensemble\nimport\nExtraTreesClassifier\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.feature_selection\nimport\nSelectFromModel\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nX\n.\nshape\n(150, 4)\n>>>\nclf\n=\nExtraTreesClassifier\n(\nn_estimators\n=\n50\n)\n>>>\nclf\n=\nclf\n.\nfit\n(\nX\n,\ny\n)\n>>>\nclf\n.\nfeature_importances_\narray([ 0.04, 0.05, 0.4, 0.4])\n>>>\nmodel\n=\nSelectFromModel\n(\nclf\n,\nprefit\n=\nTrue\n)\n>>>\nX_new\n=\nmodel\n.\ntransform\n(\nX\n)\n>>>\nX_new\n.\nshape\n(150, 2)\nExamples\nFeature importances with a forest of trees\n: example on\nsynthetic data showing the recovery of the actually meaningful features.\nPermutation Importance vs Random Forest Feature Importance (MDI)\n: example",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "synthetic data showing the recovery of the actually meaningful features.\nPermutation Importance vs Random Forest Feature Importance (MDI)\n: example\ndiscussing the caveats of using impurity-based feature importances as a proxy for\nfeature relevance.\n1.13.5.\nSequential Feature Selection\nSequential Feature Selection\n[sfs]\n(SFS) is available in the\nSequentialFeatureSelector\ntransformer.\nSFS can be either forward or backward:\nForward-SFS is a greedy procedure that iteratively finds the best new feature\nto add to the set of selected features. Concretely, we initially start with\nzero features and find the one feature that maximizes a cross-validated score\nwhen an estimator is trained on this single feature. Once that first feature\nis selected, we repeat the procedure by adding a new feature to the set of\nselected features. The procedure stops when the desired number of selected\nfeatures is reached, as determined by the\nn_features_to_select\nparameter.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "selected features. The procedure stops when the desired number of selected\nfeatures is reached, as determined by the\nn_features_to_select\nparameter.\nBackward-SFS follows the same idea but works in the opposite direction:\ninstead of starting with no features and greedily adding features, we start\nwith\nall\nthe features and greedily\nremove\nfeatures from the set. The\ndirection\nparameter controls whether forward or backward SFS is used.\nDetails on Sequential Feature Selection\nIn general, forward and backward selection do not yield equivalent results.\nAlso, one may be much faster than the other depending on the requested number\nof selected features: if we have 10 features and ask for 7 selected features,\nforward selection would need to perform 7 iterations while backward selection\nwould only need to perform 3.\nSFS differs from\nRFE\nand\nSelectFromModel\nin that it does not\nrequire the underlying model to expose a\ncoef_\nor\nfeature_importances_",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SFS differs from\nRFE\nand\nSelectFromModel\nin that it does not\nrequire the underlying model to expose a\ncoef_\nor\nfeature_importances_\nattribute. It may however be slower considering that more models need to be\nevaluated, compared to the other approaches. For example in backward\nselection, the iteration going from\nm\nfeatures to\nm\n-\n1\nfeatures using k-fold\ncross-validation requires fitting\nm\n*\nk\nmodels, while\nRFE\nwould require only a single fit, and\nSelectFromModel\nalways just does a single\nfit and requires no iterations.\nReferences\n[\nsfs\n]\nFerri et al,\nComparative study of techniques for\nlarge-scale feature selection\n.\nExamples\nModel-based and sequential feature selection\n1.13.6.\nFeature selection as part of a pipeline\nFeature selection is usually used as a pre-processing step before doing\nthe actual learning. The recommended way to do this in scikit-learn is\nto use a\nPipeline\n:\nclf\n=\nPipeline\n([\n(\n'feature_selection'\n,\nSelectFromModel\n(\nLinearSVC\n(\npenalty\n=\n\"l1\"\n))),\n(\n'classification'",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to use a\nPipeline\n:\nclf\n=\nPipeline\n([\n(\n'feature_selection'\n,\nSelectFromModel\n(\nLinearSVC\n(\npenalty\n=\n\"l1\"\n))),\n(\n'classification'\n,\nRandomForestClassifier\n())\n])\nclf\n.\nfit\n(\nX\n,\ny\n)\nIn this snippet we make use of a\nLinearSVC\ncoupled with\nSelectFromModel\nto evaluate feature importances and select the most relevant features.\nThen, a\nRandomForestClassifier\nis trained on the\ntransformed output, i.e. using only relevant features. You can perform\nsimilar operations with the other feature selection methods and also\nclassifiers that provide a way to evaluate feature importances of course.\nSee the\nPipeline\nexamples for more details.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/feature_selection.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.7.\nGaussian Processes\nGaussian Processes (GP)\nare a nonparametric supervised learning method used\nto solve\nregression\nand\nprobabilistic classification\nproblems.\nThe advantages of Gaussian processes are:\nThe prediction interpolates the observations (at least for regular\nkernels).\nThe prediction is probabilistic (Gaussian) so that one can compute\nempirical confidence intervals and decide based on those if one should\nrefit (online fitting, adaptive fitting) the prediction in some\nregion of interest.\nVersatile: different\nkernels\ncan be specified. Common kernels are provided, but\nit is also possible to specify custom kernels.\nThe disadvantages of Gaussian processes include:\nOur implementation is not sparse, i.e., they use the whole samples/features\ninformation to perform the prediction.\nThey lose efficiency in high dimensional spaces – namely when the number\nof features exceeds a few dozens.\n1.7.1.\nGaussian Process Regression (GPR)\nThe\nGaussianProcessRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of features exceeds a few dozens.\n1.7.1.\nGaussian Process Regression (GPR)\nThe\nGaussianProcessRegressor\nimplements Gaussian processes (GP) for\nregression purposes. For this, the prior of the GP needs to be specified. GP\nwill combine this prior and the likelihood function based on training samples.\nIt allows to give a probabilistic approach to prediction by giving the mean and\nstandard deviation as output when predicting.\nThe prior mean is assumed to be constant and zero (for\nnormalize_y=False\n) or\nthe training data’s mean (for\nnormalize_y=True\n). The prior’s covariance is\nspecified by passing a\nkernel\nobject. The hyperparameters\nof the kernel are optimized when fitting the\nGaussianProcessRegressor\nby maximizing the log-marginal-likelihood (LML) based on the passed\noptimizer\n. As the LML may have multiple local optima, the optimizer can be\nstarted repeatedly by specifying\nn_restarts_optimizer\n. The first run is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "optimizer\n. As the LML may have multiple local optima, the optimizer can be\nstarted repeatedly by specifying\nn_restarts_optimizer\n. The first run is\nalways conducted starting from the initial hyperparameter values of the kernel;\nsubsequent runs are conducted from hyperparameter values that have been chosen\nrandomly from the range of allowed values. If the initial hyperparameters\nshould be kept fixed,\nNone\ncan be passed as optimizer.\nThe noise level in the targets can be specified by passing it via the parameter\nalpha\n, either globally as a scalar or per datapoint. Note that a moderate\nnoise level can also be helpful for dealing with numeric instabilities during\nfitting as it is effectively implemented as Tikhonov regularization, i.e., by\nadding it to the diagonal of the kernel matrix. An alternative to specifying\nthe noise level explicitly is to include a\nWhiteKernel\ncomponent into the\nkernel, which can estimate the global noise level from the data (see example",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "WhiteKernel\ncomponent into the\nkernel, which can estimate the global noise level from the data (see example\nbelow). The figure below shows the effect of noisy target handled by setting\nthe parameter\nalpha\n.\nThe implementation is based on Algorithm 2.1 of\n[RW2006]\n. In addition to\nthe API of standard scikit-learn estimators,\nGaussianProcessRegressor\n:\nallows prediction without prior fitting (based on the GP prior)\nprovides an additional method\nsample_y(X)\n, which evaluates samples\ndrawn from the GPR (prior or posterior) at given inputs\nexposes a method\nlog_marginal_likelihood(theta)\n, which can be used\nexternally for other ways of selecting hyperparameters, e.g., via\nMarkov chain Monte Carlo.\nExamples\nGaussian Processes regression: basic introductory example\nAbility of Gaussian process regression (GPR) to estimate data noise-level\nComparison of kernel ridge and Gaussian process regression\nForecasting of CO2 level on Mona Loa dataset using Gaussian process regression (GPR)\n1.7.2.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Forecasting of CO2 level on Mona Loa dataset using Gaussian process regression (GPR)\n1.7.2.\nGaussian Process Classification (GPC)\nThe\nGaussianProcessClassifier\nimplements Gaussian processes (GP) for\nclassification purposes, more specifically for probabilistic classification,\nwhere test predictions take the form of class probabilities.\nGaussianProcessClassifier places a GP prior on a latent function\n\\(f\\)\n,\nwhich is then squashed through a link function\n\\(\\pi\\)\nto obtain the probabilistic\nclassification. The latent function\n\\(f\\)\nis a so-called nuisance function,\nwhose values are not observed and are not relevant by themselves.\nIts purpose is to allow a convenient formulation of the model, and\n\\(f\\)\nis removed (integrated out) during prediction.\nGaussianProcessClassifier\nimplements the logistic link function, for which the integral cannot be\ncomputed analytically but is easily approximated in the binary case.\nIn contrast to the regression setting, the posterior of the latent function",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "computed analytically but is easily approximated in the binary case.\nIn contrast to the regression setting, the posterior of the latent function\n\\(f\\)\nis not Gaussian even for a GP prior since a Gaussian likelihood is\ninappropriate for discrete class labels. Rather, a non-Gaussian likelihood\ncorresponding to the logistic link function (logit) is used.\nGaussianProcessClassifier approximates the non-Gaussian posterior with a\nGaussian based on the Laplace approximation. More details can be found in\nChapter 3 of\n[RW2006]\n.\nThe GP prior mean is assumed to be zero. The prior’s\ncovariance is specified by passing a\nkernel\nobject. The\nhyperparameters of the kernel are optimized during fitting of\nGaussianProcessRegressor by maximizing the log-marginal-likelihood (LML) based\non the passed\noptimizer\n. As the LML may have multiple local optima, the\noptimizer can be started repeatedly by specifying\nn_restarts_optimizer\n. The",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "on the passed\noptimizer\n. As the LML may have multiple local optima, the\noptimizer can be started repeatedly by specifying\nn_restarts_optimizer\n. The\nfirst run is always conducted starting from the initial hyperparameter values\nof the kernel; subsequent runs are conducted from hyperparameter values\nthat have been chosen randomly from the range of allowed values.\nIf the initial hyperparameters should be kept fixed,\nNone\ncan be passed as\noptimizer.\nIn some scenarios, information about the latent function\n\\(f\\)\nis desired\n(i.e. the mean\n\\(\\bar{f_*}\\)\nand the variance\n\\(\\text{Var}[f_*]\\)\ndescribed\nin Eqs. (3.21) and (3.24) of\n[RW2006]\n). The\nGaussianProcessClassifier\nprovides access to these quantities via the\nlatent_mean_and_variance\nmethod.\nGaussianProcessClassifier\nsupports multi-class classification\nby performing either one-versus-rest or one-versus-one based training and\nprediction. In one-versus-rest, one binary Gaussian process classifier is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "by performing either one-versus-rest or one-versus-one based training and\nprediction. In one-versus-rest, one binary Gaussian process classifier is\nfitted for each class, which is trained to separate this class from the rest.\nIn “one_vs_one”, one binary Gaussian process classifier is fitted for each pair\nof classes, which is trained to separate these two classes. The predictions of\nthese binary predictors are combined into multi-class predictions. See the\nsection on\nmulti-class classification\nfor more details.\nIn the case of Gaussian process classification, “one_vs_one” might be\ncomputationally cheaper since it has to solve many problems involving only a\nsubset of the whole training set rather than fewer problems on the whole\ndataset. Since Gaussian process classification scales cubically with the size\nof the dataset, this might be considerably faster. However, note that\n“one_vs_one” does not support predicting probability estimates but only plain\npredictions. Moreover, note that",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "“one_vs_one” does not support predicting probability estimates but only plain\npredictions. Moreover, note that\nGaussianProcessClassifier\ndoes not\n(yet) implement a true multi-class Laplace approximation internally, but\nas discussed above is based on solving several binary classification tasks\ninternally, which are combined using one-versus-rest or one-versus-one.\n1.7.3.\nGPC examples\n1.7.3.1.\nProbabilistic predictions with GPC\nThis example illustrates the predicted probability of GPC for an RBF kernel\nwith different choices of the hyperparameters. The first figure shows the\npredicted probability of GPC with arbitrarily chosen hyperparameters and with\nthe hyperparameters corresponding to the maximum log-marginal-likelihood (LML).\nWhile the hyperparameters chosen by optimizing LML have a considerably larger\nLML, they perform slightly worse according to the log-loss on test data. The\nfigure shows that this is because they exhibit a steep change of the class",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "figure shows that this is because they exhibit a steep change of the class\nprobabilities at the class boundaries (which is good) but have predicted\nprobabilities close to 0.5 far away from the class boundaries (which is bad).\nThis undesirable effect is caused by the Laplace approximation used\ninternally by GPC.\nThe second figure shows the log-marginal-likelihood for different choices of\nthe kernel’s hyperparameters, highlighting the two choices of the\nhyperparameters used in the first figure by black dots.\n1.7.3.2.\nIllustration of GPC on the XOR dataset\nThis example illustrates GPC on XOR data. Compared are a stationary, isotropic\nkernel (\nRBF\n) and a non-stationary kernel (\nDotProduct\n). On\nthis particular dataset, the\nDotProduct\nkernel obtains considerably\nbetter results because the class-boundaries are linear and coincide with the\ncoordinate axes. In practice, however, stationary kernels such as\nRBF\noften obtain better results.\n1.7.3.3.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "coordinate axes. In practice, however, stationary kernels such as\nRBF\noften obtain better results.\n1.7.3.3.\nGaussian process classification (GPC) on iris dataset\nThis example illustrates the predicted probability of GPC for an isotropic\nand anisotropic RBF kernel on a two-dimensional version for the iris dataset.\nThis illustrates the applicability of GPC to non-binary classification.\nThe anisotropic RBF kernel obtains slightly higher log-marginal-likelihood by\nassigning different length-scales to the two feature dimensions.\n1.7.4.\nKernels for Gaussian Processes\nKernels (also called “covariance functions” in the context of GPs) are a crucial\ningredient of GPs which determine the shape of prior and posterior of the GP.\nThey encode the assumptions on the function being learned by defining the “similarity”\nof two datapoints combined with the assumption that similar datapoints should\nhave similar target values. Two categories of kernels can be distinguished:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "have similar target values. Two categories of kernels can be distinguished:\nstationary kernels depend only on the distance of two datapoints and not on their\nabsolute values\n\\(k(x_i, x_j)= k(d(x_i, x_j))\\)\nand are thus invariant to\ntranslations in the input space, while non-stationary kernels\ndepend also on the specific values of the datapoints. Stationary kernels can further\nbe subdivided into isotropic and anisotropic kernels, where isotropic kernels are\nalso invariant to rotations in the input space. For more details, we refer to\nChapter 4 of\n[RW2006]\n.\nThis example\nshows how to define a custom kernel over discrete data. For guidance on how to best\ncombine different kernels, we refer to\n[Duv2014]\n.\nGaussian Process Kernel API\nThe main usage of a\nKernel\nis to compute the GP’s covariance between\ndatapoints. For this, the method\n__call__\nof the kernel can be called. This\nmethod can either be used to compute the “auto-covariance” of all pairs of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "__call__\nof the kernel can be called. This\nmethod can either be used to compute the “auto-covariance” of all pairs of\ndatapoints in a 2d array X, or the “cross-covariance” of all combinations\nof datapoints of a 2d array X with datapoints in a 2d array Y. The following\nidentity holds true for all kernels k (except for the\nWhiteKernel\n):\nk(X)\n==\nK(X,\nY=X)\nIf only the diagonal of the auto-covariance is being used, the method\ndiag()\nof a kernel can be called, which is more computationally efficient than the\nequivalent call to\n__call__\n:\nnp.diag(k(X,\nX))\n==\nk.diag(X)\nKernels are parameterized by a vector\n\\(\\theta\\)\nof hyperparameters. These\nhyperparameters can for instance control length-scales or periodicity of a\nkernel (see below). All kernels support computing analytic gradients\nof the kernel’s auto-covariance with respect to\n\\(log(\\theta)\\)\nvia setting\neval_gradient=True\nin the\n__call__\nmethod.\nThat is, a\n(len(X),\nlen(X),\nlen(theta))\narray is returned where the entry\n[i,\nj,\nl]\ncontains",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "via setting\neval_gradient=True\nin the\n__call__\nmethod.\nThat is, a\n(len(X),\nlen(X),\nlen(theta))\narray is returned where the entry\n[i,\nj,\nl]\ncontains\n\\(\\frac{\\partial k_\\theta(x_i, x_j)}{\\partial log(\\theta_l)}\\)\n.\nThis gradient is used by the Gaussian process (both regressor and classifier)\nin computing the gradient of the log-marginal-likelihood, which in turn is used\nto determine the value of\n\\(\\theta\\)\n, which maximizes the log-marginal-likelihood,\nvia gradient ascent. For each hyperparameter, the initial value and the\nbounds need to be specified when creating an instance of the kernel. The\ncurrent value of\n\\(\\theta\\)\ncan be get and set via the property\ntheta\nof the kernel object. Moreover, the bounds of the hyperparameters can be\naccessed by the property\nbounds\nof the kernel. Note that both properties\n(theta and bounds) return log-transformed values of the internally used values\nsince those are typically more amenable to gradient-based optimization.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "since those are typically more amenable to gradient-based optimization.\nThe specification of each hyperparameter is stored in the form of an instance of\nHyperparameter\nin the respective kernel. Note that a kernel using a\nhyperparameter with name “x” must have the attributes self.x and self.x_bounds.\nThe abstract base class for all kernels is\nKernel\n. Kernel implements a\nsimilar interface as\nBaseEstimator\n, providing the\nmethods\nget_params()\n,\nset_params()\n, and\nclone()\n. This allows\nsetting kernel values also via meta-estimators such as\nPipeline\nor\nGridSearchCV\n. Note that due to the nested\nstructure of kernels (by applying kernel operators, see below), the names of\nkernel parameters might become relatively complicated. In general, for a binary\nkernel operator, parameters of the left operand are prefixed with\nk1__\nand\nparameters of the right operand with\nk2__\n. An additional convenience method\nis\nclone_with_theta(theta)\n, which returns a cloned version of the kernel",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "k2__\n. An additional convenience method\nis\nclone_with_theta(theta)\n, which returns a cloned version of the kernel\nbut with the hyperparameters set to\ntheta\n. An illustrative example:\n>>>\nfrom\nsklearn.gaussian_process.kernels\nimport\nConstantKernel\n,\nRBF\n>>>\nkernel\n=\nConstantKernel\n(\nconstant_value\n=\n1.0\n,\nconstant_value_bounds\n=\n(\n0.0\n,\n10.0\n))\n*\nRBF\n(\nlength_scale\n=\n0.5\n,\nlength_scale_bounds\n=\n(\n0.0\n,\n10.0\n))\n+\nRBF\n(\nlength_scale\n=\n2.0\n,\nlength_scale_bounds\n=\n(\n0.0\n,\n10.0\n))\n>>>\nfor\nhyperparameter\nin\nkernel\n.\nhyperparameters\n:\nprint\n(\nhyperparameter\n)\nHyperparameter(name='k1__k1__constant_value', value_type='numeric', bounds=array([[ 0., 10.]]), n_elements=1, fixed=False)\nHyperparameter(name='k1__k2__length_scale', value_type='numeric', bounds=array([[ 0., 10.]]), n_elements=1, fixed=False)\nHyperparameter(name='k2__length_scale', value_type='numeric', bounds=array([[ 0., 10.]]), n_elements=1, fixed=False)\n>>>\nparams\n=\nkernel\n.\nget_params\n()\n>>>\nfor\nkey\nin\nsorted\n(\nparams\n):\nprint\n(\n\"",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nparams\n=\nkernel\n.\nget_params\n()\n>>>\nfor\nkey\nin\nsorted\n(\nparams\n):\nprint\n(\n\"\n%s\n:\n%s\n\"\n%\n(\nkey\n,\nparams\n[\nkey\n]))\nk1 : 1**2 * RBF(length_scale=0.5)\nk1__k1 : 1**2\nk1__k1__constant_value : 1.0\nk1__k1__constant_value_bounds : (0.0, 10.0)\nk1__k2 : RBF(length_scale=0.5)\nk1__k2__length_scale : 0.5\nk1__k2__length_scale_bounds : (0.0, 10.0)\nk2 : RBF(length_scale=2)\nk2__length_scale : 2.0\nk2__length_scale_bounds : (0.0, 10.0)\n>>>\nprint\n(\nkernel\n.\ntheta\n)\n# Note: log-transformed\n[ 0. -0.69314718 0.69314718]\n>>>\nprint\n(\nkernel\n.\nbounds\n)\n# Note: log-transformed\n[[ -inf 2.30258509]\n[ -inf 2.30258509]\n[ -inf 2.30258509]]\nAll Gaussian process kernels are interoperable with\nsklearn.metrics.pairwise\nand vice versa: instances of subclasses of\nKernel\ncan be passed as\nmetric\nto\npairwise_kernels\nfrom\nsklearn.metrics.pairwise\n. Moreover,\nkernel functions from pairwise can be used as GP kernels by using the wrapper\nclass\nPairwiseKernel\n. The only caveat is that the gradient of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "kernel functions from pairwise can be used as GP kernels by using the wrapper\nclass\nPairwiseKernel\n. The only caveat is that the gradient of\nthe hyperparameters is not analytic but numeric and all those kernels support\nonly isotropic distances. The parameter\ngamma\nis considered to be a\nhyperparameter and may be optimized. The other kernel parameters are set\ndirectly at initialization and are kept fixed.\n1.7.4.1.\nBasic kernels\nThe\nConstantKernel\nkernel can be used as part of a\nProduct\nkernel where it scales the magnitude of the other factor (kernel) or as part\nof a\nSum\nkernel, where it modifies the mean of the Gaussian process.\nIt depends on a parameter\n\\(constant\\_value\\)\n. It is defined as:\n\\[k(x_i, x_j) = constant\\_value \\;\\forall\\; x_i, x_j\\]\nThe main use-case of the\nWhiteKernel\nkernel is as part of a\nsum-kernel where it explains the noise-component of the signal. Tuning its\nparameter\n\\(noise\\_level\\)\ncorresponds to estimating the noise-level.\nIt is defined as:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "parameter\n\\(noise\\_level\\)\ncorresponds to estimating the noise-level.\nIt is defined as:\n\\[k(x_i, x_j) = noise\\_level \\text{ if } x_i == x_j \\text{ else } 0\\]\n1.7.4.2.\nKernel operators\nKernel operators take one or two base kernels and combine them into a new\nkernel. The\nSum\nkernel takes two kernels\n\\(k_1\\)\nand\n\\(k_2\\)\nand combines them via\n\\(k_{sum}(X, Y) = k_1(X, Y) + k_2(X, Y)\\)\n.\nThe\nProduct\nkernel takes two kernels\n\\(k_1\\)\nand\n\\(k_2\\)\nand combines them via\n\\(k_{product}(X, Y) = k_1(X, Y) * k_2(X, Y)\\)\n.\nThe\nExponentiation\nkernel takes one base kernel and a scalar parameter\n\\(p\\)\nand combines them via\n\\(k_{exp}(X, Y) = k(X, Y)^p\\)\n.\nNote that magic methods\n__add__\n,\n__mul___\nand\n__pow__\nare\noverridden on the Kernel objects, so one can use e.g.\nRBF()\n+\nRBF()\nas\na shortcut for\nSum(RBF(),\nRBF())\n.\n1.7.4.3.\nRadial basis function (RBF) kernel\nThe\nRBF\nkernel is a stationary kernel. It is also known as the “squared\nexponential” kernel. It is parameterized by a length-scale parameter",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The\nRBF\nkernel is a stationary kernel. It is also known as the “squared\nexponential” kernel. It is parameterized by a length-scale parameter\n\\(l>0\\)\n, which\ncan either be a scalar (isotropic variant of the kernel) or a vector with the same\nnumber of dimensions as the inputs\n\\(x\\)\n(anisotropic variant of the kernel).\nThe kernel is given by:\n\\[k(x_i, x_j) = \\text{exp}\\left(- \\frac{d(x_i, x_j)^2}{2l^2} \\right)\\]\nwhere\n\\(d(\\cdot, \\cdot)\\)\nis the Euclidean distance.\nThis kernel is infinitely differentiable, which implies that GPs with this\nkernel as covariance function have mean square derivatives of all orders, and are thus\nvery smooth. The prior and posterior of a GP resulting from an RBF kernel are shown in\nthe following figure:\n1.7.4.4.\nMatérn kernel\nThe\nMatern\nkernel is a stationary kernel and a generalization of the\nRBF\nkernel. It has an additional parameter\n\\(\\nu\\)\nwhich controls\nthe smoothness of the resulting function. It is parameterized by a length-scale parameter\n\\(l>0\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\nu\\)\nwhich controls\nthe smoothness of the resulting function. It is parameterized by a length-scale parameter\n\\(l>0\\)\n, which can either be a scalar (isotropic variant of the kernel) or a vector with the same number of dimensions as the inputs\n\\(x\\)\n(anisotropic variant of the kernel).\nMathematical implementation of Matérn kernel\nThe kernel is given by:\n\\[k(x_i, x_j) = \\frac{1}{\\Gamma(\\nu)2^{\\nu-1}}\\Bigg(\\frac{\\sqrt{2\\nu}}{l} d(x_i , x_j )\\Bigg)^\\nu K_\\nu\\Bigg(\\frac{\\sqrt{2\\nu}}{l} d(x_i , x_j )\\Bigg),\\]\nwhere\n\\(d(\\cdot,\\cdot)\\)\nis the Euclidean distance,\n\\(K_\\nu(\\cdot)\\)\nis a modified Bessel function and\n\\(\\Gamma(\\cdot)\\)\nis the gamma function.\nAs\n\\(\\nu\\rightarrow\\infty\\)\n, the Matérn kernel converges to the RBF kernel.\nWhen\n\\(\\nu = 1/2\\)\n, the Matérn kernel becomes identical to the absolute\nexponential kernel, i.e.,\n\\[k(x_i, x_j) = \\exp \\Bigg(- \\frac{1}{l} d(x_i , x_j ) \\Bigg) \\quad \\quad \\nu= \\tfrac{1}{2}\\]\nIn particular,\n\\(\\nu = 3/2\\)\n:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[k(x_i, x_j) = \\exp \\Bigg(- \\frac{1}{l} d(x_i , x_j ) \\Bigg) \\quad \\quad \\nu= \\tfrac{1}{2}\\]\nIn particular,\n\\(\\nu = 3/2\\)\n:\n\\[k(x_i, x_j) = \\Bigg(1 + \\frac{\\sqrt{3}}{l} d(x_i , x_j )\\Bigg) \\exp \\Bigg(-\\frac{\\sqrt{3}}{l} d(x_i , x_j ) \\Bigg) \\quad \\quad \\nu= \\tfrac{3}{2}\\]\nand\n\\(\\nu = 5/2\\)\n:\n\\[k(x_i, x_j) = \\Bigg(1 + \\frac{\\sqrt{5}}{l} d(x_i , x_j ) +\\frac{5}{3l} d(x_i , x_j )^2 \\Bigg) \\exp \\Bigg(-\\frac{\\sqrt{5}}{l} d(x_i , x_j ) \\Bigg) \\quad \\quad \\nu= \\tfrac{5}{2}\\]\nare popular choices for learning functions that are not infinitely\ndifferentiable (as assumed by the RBF kernel) but at least once (\n\\(\\nu =\n3/2\\)\n) or twice differentiable (\n\\(\\nu = 5/2\\)\n).\nThe flexibility of controlling the smoothness of the learned function via\n\\(\\nu\\)\nallows adapting to the properties of the true underlying functional relation.\nThe prior and posterior of a GP resulting from a Matérn kernel are shown in\nthe following figure:\nSee\n[RW2006]\n, pp84 for further details regarding the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the following figure:\nSee\n[RW2006]\n, pp84 for further details regarding the\ndifferent variants of the Matérn kernel.\n1.7.4.5.\nRational quadratic kernel\nThe\nRationalQuadratic\nkernel can be seen as a scale mixture (an infinite sum)\nof\nRBF\nkernels with different characteristic length-scales. It is parameterized\nby a length-scale parameter\n\\(l>0\\)\nand a scale mixture parameter\n\\(\\alpha>0\\)\nOnly the isotropic variant where\n\\(l\\)\nis a scalar is supported at the moment.\nThe kernel is given by:\n\\[k(x_i, x_j) = \\left(1 + \\frac{d(x_i, x_j)^2}{2\\alpha l^2}\\right)^{-\\alpha}\\]\nThe prior and posterior of a GP resulting from a\nRationalQuadratic\nkernel are shown in\nthe following figure:\n1.7.4.6.\nExp-Sine-Squared kernel\nThe\nExpSineSquared\nkernel allows modeling periodic functions.\nIt is parameterized by a length-scale parameter\n\\(l>0\\)\nand a periodicity parameter\n\\(p>0\\)\n. Only the isotropic variant where\n\\(l\\)\nis a scalar is supported at the moment.\nThe kernel is given by:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(l>0\\)\nand a periodicity parameter\n\\(p>0\\)\n. Only the isotropic variant where\n\\(l\\)\nis a scalar is supported at the moment.\nThe kernel is given by:\n\\[k(x_i, x_j) = \\text{exp}\\left(- \\frac{ 2\\sin^2(\\pi d(x_i, x_j) / p) }{ l^ 2} \\right)\\]\nThe prior and posterior of a GP resulting from an ExpSineSquared kernel are shown in\nthe following figure:\n1.7.4.7.\nDot-Product kernel\nThe\nDotProduct\nkernel is non-stationary and can be obtained from linear regression\nby putting\n\\(N(0, 1)\\)\npriors on the coefficients of\n\\(x_d (d = 1, . . . , D)\\)\nand\na prior of\n\\(N(0, \\sigma_0^2)\\)\non the bias. The\nDotProduct\nkernel is invariant to a rotation\nof the coordinates about the origin, but not translations.\nIt is parameterized by a parameter\n\\(\\sigma_0^2\\)\n. For\n\\(\\sigma_0^2 = 0\\)\n, the kernel\nis called the homogeneous linear kernel, otherwise it is inhomogeneous. The kernel is given by\n\\[k(x_i, x_j) = \\sigma_0 ^ 2 + x_i \\cdot x_j\\]\nThe\nDotProduct",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[k(x_i, x_j) = \\sigma_0 ^ 2 + x_i \\cdot x_j\\]\nThe\nDotProduct\nkernel is commonly combined with exponentiation. An example with exponent 2 is\nshown in the following figure:\n1.7.4.8.\nReferences\n[\nRW2006\n]\n(\n1\n,\n2\n,\n3\n,\n4\n,\n5\n)\nCarl E. Rasmussen and Christopher K.I. Williams,\n“Gaussian Processes for Machine Learning”,\nMIT Press 2006\n[\nDuv2014\n]\nDavid Duvenaud, “The Kernel Cookbook: Advice on Covariance functions”, 2014\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/gaussian_process.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.2.\nTuning the hyper-parameters of an estimator\nHyper-parameters are parameters that are not directly learnt within estimators.\nIn scikit-learn they are passed as arguments to the constructor of the\nestimator classes. Typical examples include\nC\n,\nkernel\nand\ngamma\nfor Support Vector Classifier,\nalpha\nfor Lasso, etc.\nIt is possible and recommended to search the hyper-parameter space for the\nbest\ncross validation\nscore.\nAny parameter provided when constructing an estimator may be optimized in this\nmanner. Specifically, to find the names and current values for all parameters\nfor a given estimator, use:\nestimator\n.\nget_params\n()\nA search consists of:\nan estimator (regressor or classifier such as\nsklearn.svm.SVC()\n);\na parameter space;\na method for searching or sampling candidates;\na cross-validation scheme; and\na\nscore function\n.\nTwo generic approaches to parameter search are provided in\nscikit-learn: for given values,\nGridSearchCV\nexhaustively considers\nall parameter combinations, while",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "scikit-learn: for given values,\nGridSearchCV\nexhaustively considers\nall parameter combinations, while\nRandomizedSearchCV\ncan sample a\ngiven number of candidates from a parameter space with a specified\ndistribution. Both these tools have successive halving counterparts\nHalvingGridSearchCV\nand\nHalvingRandomSearchCV\n, which can be\nmuch faster at finding a good parameter combination.\nAfter describing these tools we detail\nbest practices\napplicable to these approaches. Some models allow for\nspecialized, efficient parameter search strategies, outlined in\nAlternatives to brute force parameter search\n.\nNote that it is common that a small subset of those parameters can have a large\nimpact on the predictive or computation performance of the model while others\ncan be left to their default values. It is recommended to read the docstring of\nthe estimator class to get a finer understanding of their expected behavior,\npossibly by reading the enclosed reference to the literature.\n3.2.1.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the estimator class to get a finer understanding of their expected behavior,\npossibly by reading the enclosed reference to the literature.\n3.2.1.\nExhaustive Grid Search\nThe grid search provided by\nGridSearchCV\nexhaustively generates\ncandidates from a grid of parameter values specified with the\nparam_grid\nparameter. For instance, the following\nparam_grid\n:\nparam_grid\n=\n[\n{\n'C'\n:\n[\n1\n,\n10\n,\n100\n,\n1000\n],\n'kernel'\n:\n[\n'linear'\n]},\n{\n'C'\n:\n[\n1\n,\n10\n,\n100\n,\n1000\n],\n'gamma'\n:\n[\n0.001\n,\n0.0001\n],\n'kernel'\n:\n[\n'rbf'\n]},\n]\nspecifies that two grids should be explored: one with a linear kernel and\nC values in [1, 10, 100, 1000], and the second one with an RBF kernel,\nand the cross-product of C values ranging in [1, 10, 100, 1000] and gamma\nvalues in [0.001, 0.0001].\nThe\nGridSearchCV\ninstance implements the usual estimator API: when\n“fitting” it on a dataset all the possible combinations of parameter values are\nevaluated and the best combination is retained.\nExamples\nSee",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "“fitting” it on a dataset all the possible combinations of parameter values are\nevaluated and the best combination is retained.\nExamples\nSee\nNested versus non-nested cross-validation\nfor an example of Grid Search within a cross validation loop on the iris\ndataset. This is the best practice for evaluating the performance of a\nmodel with grid search.\nSee\nSample pipeline for text feature extraction and evaluation\nfor an example\nof Grid Search coupling parameters from a text documents feature\nextractor (n-gram count vectorizer and TF-IDF transformer) with a\nclassifier (here a linear SVM trained with SGD with either elastic\nnet or L2 penalty) using a\nPipeline\ninstance.\nAdvanced examples\nSee\nNested versus non-nested cross-validation\nfor an example of Grid Search within a cross validation loop on the iris\ndataset. This is the best practice for evaluating the performance of a\nmodel with grid search.\nSee\nDemonstration of multi-metric evaluation on cross_val_score and GridSearchCV",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "model with grid search.\nSee\nDemonstration of multi-metric evaluation on cross_val_score and GridSearchCV\nfor an example of\nGridSearchCV\nbeing used to evaluate multiple\nmetrics simultaneously.\nSee\nBalance model complexity and cross-validated score\nfor an example of using\nrefit=callable\ninterface in\nGridSearchCV\n. The example shows how this interface adds a certain\namount of flexibility in identifying the “best” estimator. This interface\ncan also be used in multiple metrics evaluation.\nSee\nStatistical comparison of models using grid search\nfor an example of how to do a statistical comparison on the outputs of\nGridSearchCV\n.\n3.2.2.\nRandomized Parameter Optimization\nWhile using a grid of parameter settings is currently the most widely used\nmethod for parameter optimization, other search methods have more\nfavorable properties.\nRandomizedSearchCV\nimplements a randomized search over parameters,\nwhere each setting is sampled from a distribution over possible parameter values.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "RandomizedSearchCV\nimplements a randomized search over parameters,\nwhere each setting is sampled from a distribution over possible parameter values.\nThis has two main benefits over an exhaustive search:\nA budget can be chosen independent of the number of parameters and possible values.\nAdding parameters that do not influence the performance does not decrease efficiency.\nSpecifying how parameters should be sampled is done using a dictionary, very\nsimilar to specifying parameters for\nGridSearchCV\n. Additionally,\na computation budget, being the number of sampled candidates or sampling\niterations, is specified using the\nn_iter\nparameter.\nFor each parameter, either a distribution over possible values or a list of\ndiscrete choices (which will be sampled uniformly) can be specified:\n{\n'C'\n:\nscipy\n.\nstats\n.\nexpon\n(\nscale\n=\n100\n),\n'gamma'\n:\nscipy\n.\nstats\n.\nexpon\n(\nscale\n=\n.1\n),\n'kernel'\n:\n[\n'rbf'\n],\n'class_weight'\n:[\n'balanced'\n,\nNone\n]}\nThis example uses the\nscipy.stats",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "100\n),\n'gamma'\n:\nscipy\n.\nstats\n.\nexpon\n(\nscale\n=\n.1\n),\n'kernel'\n:\n[\n'rbf'\n],\n'class_weight'\n:[\n'balanced'\n,\nNone\n]}\nThis example uses the\nscipy.stats\nmodule, which contains many useful\ndistributions for sampling parameters, such as\nexpon\n,\ngamma\n,\nuniform\n,\nloguniform\nor\nrandint\n.\nIn principle, any function can be passed that provides a\nrvs\n(random\nvariate sample) method to sample a value. A call to the\nrvs\nfunction should\nprovide independent random samples from possible parameter values on\nconsecutive calls.\nWarning\nThe distributions in\nscipy.stats\nprior to version scipy 0.16\ndo not allow specifying a random state. Instead, they use the global\nnumpy random state, that can be seeded via\nnp.random.seed\nor set\nusing\nnp.random.set_state\n. However, beginning scikit-learn 0.18,\nthe\nsklearn.model_selection\nmodule sets the random state provided\nby the user if scipy >= 0.16 is also available.\nFor continuous parameters, such as\nC\nabove, it is important to specify",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "by the user if scipy >= 0.16 is also available.\nFor continuous parameters, such as\nC\nabove, it is important to specify\na continuous distribution to take full advantage of the randomization. This way,\nincreasing\nn_iter\nwill always lead to a finer search.\nA continuous log-uniform random variable is the continuous version of\na log-spaced parameter. For example to specify the equivalent of\nC\nfrom above,\nloguniform(1,\n100)\ncan be used instead of\n[1,\n10,\n100]\n.\nMirroring the example above in grid search, we can specify a continuous random\nvariable that is log-uniformly distributed between\n1e0\nand\n1e3\n:\nfrom\nsklearn.utils.fixes\nimport\nloguniform\n{\n'C'\n:\nloguniform\n(\n1e0\n,\n1e3\n),\n'gamma'\n:\nloguniform\n(\n1e-4\n,\n1e-3\n),\n'kernel'\n:\n[\n'rbf'\n],\n'class_weight'\n:[\n'balanced'\n,\nNone\n]}\nExamples\nComparing randomized search and grid search for hyperparameter estimation\ncompares the usage and efficiency\nof randomized search and grid search.\nReferences\nBergstra, J. and Bengio, Y.,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "compares the usage and efficiency\nof randomized search and grid search.\nReferences\nBergstra, J. and Bengio, Y.,\nRandom search for hyper-parameter optimization,\nThe Journal of Machine Learning Research (2012)\n3.2.3.\nSearching for optimal parameters with successive halving\nScikit-learn also provides the\nHalvingGridSearchCV\nand\nHalvingRandomSearchCV\nestimators that can be used to\nsearch a parameter space using successive halving\n[\n1\n]\n[\n2\n]\n. Successive\nhalving (SH) is like a tournament among candidate parameter combinations.\nSH is an iterative selection process where all candidates (the\nparameter combinations) are evaluated with a small amount of resources at\nthe first iteration. Only some of these candidates are selected for the next\niteration, which will be allocated more resources. For parameter tuning, the\nresource is typically the number of training samples, but it can also be an\narbitrary numeric parameter such as\nn_estimators\nin a random forest.\nNote",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "resource is typically the number of training samples, but it can also be an\narbitrary numeric parameter such as\nn_estimators\nin a random forest.\nNote\nThe resource increase chosen should be large enough so that a large improvement\nin scores is obtained when taking into account statistical significance.\nAs illustrated in the figure below, only a subset of candidates\n‘survive’ until the last iteration. These are the candidates that have\nconsistently ranked among the top-scoring candidates across all iterations.\nEach iteration is allocated an increasing amount of resources per candidate,\nhere the number of samples.\nWe here briefly describe the main parameters, but each parameter and their\ninteractions are described more in detail in the dropdown section below. The\nfactor\n(> 1) parameter controls the rate at which the resources grow, and\nthe rate at which the number of candidates decreases. In each iteration, the\nnumber of resources per candidate is multiplied by\nfactor\nand the number",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the rate at which the number of candidates decreases. In each iteration, the\nnumber of resources per candidate is multiplied by\nfactor\nand the number\nof candidates is divided by the same factor. Along with\nresource\nand\nmin_resources\n,\nfactor\nis the most important parameter to control the\nsearch in our implementation, though a value of 3 usually works well.\nfactor\neffectively controls the number of iterations in\nHalvingGridSearchCV\nand the number of candidates (by default) and\niterations in\nHalvingRandomSearchCV\n.\naggressive_elimination=True\ncan also be used if the number of available resources is small. More control\nis available through tuning the\nmin_resources\nparameter.\nThese estimators are still\nexperimental\n: their predictions\nand their API might change without any deprecation cycle. To use them, you\nneed to explicitly import\nenable_halving_search_cv\n:\n>>>\nfrom\nsklearn.experimental\nimport\nenable_halving_search_cv\n# noqa\n>>>\nfrom\nsklearn.model_selection\nimport\nHalvingGridSearchCV",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ":\n>>>\nfrom\nsklearn.experimental\nimport\nenable_halving_search_cv\n# noqa\n>>>\nfrom\nsklearn.model_selection\nimport\nHalvingGridSearchCV\n>>>\nfrom\nsklearn.model_selection\nimport\nHalvingRandomSearchCV\nExamples\nComparison between grid search and successive halving\nSuccessive Halving Iterations\nThe sections below dive into technical aspects of successive halving.\nChoosing\nmin_resources\nand the number of candidates\nBeside\nfactor\n, the two main parameters that influence the behaviour of a\nsuccessive halving search are the\nmin_resources\nparameter, and the\nnumber of candidates (or parameter combinations) that are evaluated.\nmin_resources\nis the amount of resources allocated at the first\niteration for each candidate. The number of candidates is specified directly\nin\nHalvingRandomSearchCV\n, and is determined from the\nparam_grid\nparameter of\nHalvingGridSearchCV\n.\nConsider a case where the resource is the number of samples, and where we\nhave 1000 samples. In theory, with\nmin_resources=10\nand\nfactor=2",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nConsider a case where the resource is the number of samples, and where we\nhave 1000 samples. In theory, with\nmin_resources=10\nand\nfactor=2\n, we\nare able to run\nat most\n7 iterations with the following number of\nsamples:\n[10,\n20,\n40,\n80,\n160,\n320,\n640]\n.\nBut depending on the number of candidates, we might run less than 7\niterations: if we start with a\nsmall\nnumber of candidates, the last\niteration might use less than 640 samples, which means not using all the\navailable resources (samples). For example if we start with 5 candidates, we\nonly need 2 iterations: 5 candidates for the first iteration, then\n5\n//\n2\n=\n2\ncandidates at the second iteration, after which we know which\ncandidate performs the best (so we don’t need a third one). We would only be\nusing at most 20 samples which is a waste since we have 1000 samples at our\ndisposal. On the other hand, if we start with a\nhigh\nnumber of\ncandidates, we might end up with a lot of candidates at the last iteration,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "disposal. On the other hand, if we start with a\nhigh\nnumber of\ncandidates, we might end up with a lot of candidates at the last iteration,\nwhich may not always be ideal: it means that many candidates will run with\nthe full resources, basically reducing the procedure to standard search.\nIn the case of\nHalvingRandomSearchCV\n, the number of candidates is set\nby default such that the last iteration uses as much of the available\nresources as possible. For\nHalvingGridSearchCV\n, the number of\ncandidates is determined by the\nparam_grid\nparameter. Changing the value of\nmin_resources\nwill impact the number of possible iterations, and as a\nresult will also have an effect on the ideal number of candidates.\nAnother consideration when choosing\nmin_resources\nis whether or not it\nis easy to discriminate between good and bad candidates with a small amount\nof resources. For example, if you need a lot of samples to distinguish\nbetween good and bad parameters, a high\nmin_resources\nis recommended. On",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of resources. For example, if you need a lot of samples to distinguish\nbetween good and bad parameters, a high\nmin_resources\nis recommended. On\nthe other hand if the distinction is clear even with a small amount of\nsamples, then a small\nmin_resources\nmay be preferable since it would\nspeed up the computation.\nNotice in the example above that the last iteration does not use the maximum\namount of resources available: 1000 samples are available, yet only 640 are\nused, at most. By default, both\nHalvingRandomSearchCV\nand\nHalvingGridSearchCV\ntry to use as many resources as possible in the\nlast iteration, with the constraint that this amount of resources must be a\nmultiple of both\nmin_resources\nand\nfactor\n(this constraint will be clear\nin the next section).\nHalvingRandomSearchCV\nachieves this by\nsampling the right amount of candidates, while\nHalvingGridSearchCV\nachieves this by properly setting\nmin_resources\n.\nAmount of resource and number of candidates at each iteration\nAt any iteration\ni",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "achieves this by properly setting\nmin_resources\n.\nAmount of resource and number of candidates at each iteration\nAt any iteration\ni\n, each candidate is allocated a given amount of resources\nwhich we denote\nn_resources_i\n. This quantity is controlled by the\nparameters\nfactor\nand\nmin_resources\nas follows (\nfactor\nis strictly\ngreater than 1):\nn_resources_i\n=\nfactor\n**\ni\n*\nmin_resources\n,\nor equivalently:\nn_resources_\n{\ni\n+\n1\n}\n=\nn_resources_i\n*\nfactor\nwhere\nmin_resources\n==\nn_resources_0\nis the amount of resources used at\nthe first iteration.\nfactor\nalso defines the proportions of candidates\nthat will be selected for the next iteration:\nn_candidates_i\n=\nn_candidates\n//\n(\nfactor\n**\ni\n)\nor equivalently:\nn_candidates_0\n=\nn_candidates\nn_candidates_\n{\ni\n+\n1\n}\n=\nn_candidates_i\n//\nfactor\nSo in the first iteration, we use\nmin_resources\nresources\nn_candidates\ntimes. In the second iteration, we use\nmin_resources\n*\nfactor\nresources\nn_candidates\n//\nfactor\ntimes. The third again",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "resources\nn_candidates\ntimes. In the second iteration, we use\nmin_resources\n*\nfactor\nresources\nn_candidates\n//\nfactor\ntimes. The third again\nmultiplies the resources per candidate and divides the number of candidates.\nThis process stops when the maximum amount of resource per candidate is\nreached, or when we have identified the best candidate. The best candidate\nis identified at the iteration that is evaluating\nfactor\nor less candidates\n(see just below for an explanation).\nHere is an example with\nmin_resources=3\nand\nfactor=2\n, starting with\n70 candidates:\nn_resources_i\nn_candidates_i\n3 (=min_resources)\n70 (=n_candidates)\n3 * 2 = 6\n70 // 2 = 35\n6 * 2 = 12\n35 // 2 = 17\n12 * 2 = 24\n17 // 2 = 8\n24 * 2 = 48\n8 // 2 = 4\n48 * 2 = 96\n4 // 2 = 2\nWe can note that:\nthe process stops at the first iteration which evaluates\nfactor=2\ncandidates: the best candidate is the best out of these 2 candidates. It\nis not necessary to run an additional iteration, since it would only",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "candidates: the best candidate is the best out of these 2 candidates. It\nis not necessary to run an additional iteration, since it would only\nevaluate one candidate (namely the best one, which we have already\nidentified). For this reason, in general, we want the last iteration to\nrun at most\nfactor\ncandidates. If the last iteration evaluates more\nthan\nfactor\ncandidates, then this last iteration reduces to a regular\nsearch (as in\nRandomizedSearchCV\nor\nGridSearchCV\n).\neach\nn_resources_i\nis a multiple of both\nfactor\nand\nmin_resources\n(which is confirmed by its definition above).\nThe amount of resources that is used at each iteration can be found in the\nn_resources_\nattribute.\nChoosing a resource\nBy default, the resource is defined in terms of number of samples. That is,\neach iteration will use an increasing amount of samples to train on. You can\nhowever manually specify a parameter to use as the resource with the\nresource\nparameter. Here is an example where the resource is defined in",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "however manually specify a parameter to use as the resource with the\nresource\nparameter. Here is an example where the resource is defined in\nterms of the number of estimators of a random forest:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn.ensemble\nimport\nRandomForestClassifier\n>>>\nfrom\nsklearn.experimental\nimport\nenable_halving_search_cv\n# noqa\n>>>\nfrom\nsklearn.model_selection\nimport\nHalvingGridSearchCV\n>>>\nimport\npandas\nas\npd\n>>>\nparam_grid\n=\n{\n'max_depth'\n:\n[\n3\n,\n5\n,\n10\n],\n...\n'min_samples_split'\n:\n[\n2\n,\n5\n,\n10\n]}\n>>>\nbase_estimator\n=\nRandomForestClassifier\n(\nrandom_state\n=\n0\n)\n>>>\nX\n,\ny\n=\nmake_classification\n(\nn_samples\n=\n1000\n,\nrandom_state\n=\n0\n)\n>>>\nsh\n=\nHalvingGridSearchCV\n(\nbase_estimator\n,\nparam_grid\n,\ncv\n=\n5\n,\n...\nfactor\n=\n2\n,\nresource\n=\n'n_estimators'\n,\n...\nmax_resources\n=\n30\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nsh\n.\nbest_estimator_\nRandomForestClassifier(max_depth=5, n_estimators=24, random_state=0)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\n...\nmax_resources\n=\n30\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nsh\n.\nbest_estimator_\nRandomForestClassifier(max_depth=5, n_estimators=24, random_state=0)\nNote that it is not possible to budget on a parameter that is part of the\nparameter grid.\nExhausting the available resources\nAs mentioned above, the number of resources that is used at each iteration\ndepends on the\nmin_resources\nparameter.\nIf you have a lot of resources available but start with a low number of\nresources, some of them might be wasted (i.e. not used):\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nfrom\nsklearn.experimental\nimport\nenable_halving_search_cv\n# noqa\n>>>\nfrom\nsklearn.model_selection\nimport\nHalvingGridSearchCV\n>>>\nimport\npandas\nas\npd\n>>>\nparam_grid\n=\n{\n'kernel'\n:\n(\n'linear'\n,\n'rbf'\n),\n...\n'C'\n:\n[\n1\n,\n10\n,\n100\n]}\n>>>\nbase_estimator\n=\nSVC\n(\ngamma\n=\n'scale'\n)\n>>>\nX\n,\ny\n=\nmake_classification\n(\nn_samples\n=\n1000\n)\n>>>\nsh\n=\nHalvingGridSearchCV\n(\nbase_estimator\n,\nparam_grid\n,\ncv\n=\n5\n,\n...",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\ngamma\n=\n'scale'\n)\n>>>\nX\n,\ny\n=\nmake_classification\n(\nn_samples\n=\n1000\n)\n>>>\nsh\n=\nHalvingGridSearchCV\n(\nbase_estimator\n,\nparam_grid\n,\ncv\n=\n5\n,\n...\nfactor\n=\n2\n,\nmin_resources\n=\n20\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nsh\n.\nn_resources_\n[20, 40, 80]\nThe search process will only use 80 resources at most, while our maximum\namount of available resources is\nn_samples=1000\n. Here, we have\nmin_resources\n=\nr_0\n=\n20\n.\nFor\nHalvingGridSearchCV\n, by default, the\nmin_resources\nparameter\nis set to ‘exhaust’. This means that\nmin_resources\nis automatically set\nsuch that the last iteration can use as many resources as possible, within\nthe\nmax_resources\nlimit:\n>>>\nsh\n=\nHalvingGridSearchCV\n(\nbase_estimator\n,\nparam_grid\n,\ncv\n=\n5\n,\n...\nfactor\n=\n2\n,\nmin_resources\n=\n'exhaust'\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nsh\n.\nn_resources_\n[250, 500, 1000]\nmin_resources\nwas here automatically set to 250, which results in the last\niteration using all the resources. The exact value that is used depends on\nthe number of candidate parameters, on",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "iteration using all the resources. The exact value that is used depends on\nthe number of candidate parameters, on\nmax_resources\nand on\nfactor\n.\nFor\nHalvingRandomSearchCV\n, exhausting the resources can be done in 2\nways:\nby setting\nmin_resources='exhaust'\n, just like for\nHalvingGridSearchCV\n;\nby setting\nn_candidates='exhaust'\n.\nBoth options are mutually exclusive: using\nmin_resources='exhaust'\nrequires\nknowing the number of candidates, and symmetrically\nn_candidates='exhaust'\nrequires knowing\nmin_resources\n.\nIn general, exhausting the total number of resources leads to a better final\ncandidate parameter, and is slightly more time-intensive.\n3.2.3.1.\nAggressive elimination of candidates\nUsing the\naggressive_elimination\nparameter, you can force the search\nprocess to end up with less than\nfactor\ncandidates at the last\niteration.\nCode example of aggressive elimination\nIdeally, we want the last iteration to evaluate\nfactor\ncandidates. We",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "factor\ncandidates at the last\niteration.\nCode example of aggressive elimination\nIdeally, we want the last iteration to evaluate\nfactor\ncandidates. We\nthen just have to pick the best one. When the number of available resources is\nsmall with respect to the number of candidates, the last iteration may have to\nevaluate more than\nfactor\ncandidates:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nfrom\nsklearn.experimental\nimport\nenable_halving_search_cv\n# noqa\n>>>\nfrom\nsklearn.model_selection\nimport\nHalvingGridSearchCV\n>>>\nimport\npandas\nas\npd\n>>>\nparam_grid\n=\n{\n'kernel'\n:\n(\n'linear'\n,\n'rbf'\n),\n...\n'C'\n:\n[\n1\n,\n10\n,\n100\n]}\n>>>\nbase_estimator\n=\nSVC\n(\ngamma\n=\n'scale'\n)\n>>>\nX\n,\ny\n=\nmake_classification\n(\nn_samples\n=\n1000\n)\n>>>\nsh\n=\nHalvingGridSearchCV\n(\nbase_estimator\n,\nparam_grid\n,\ncv\n=\n5\n,\n...\nfactor\n=\n2\n,\nmax_resources\n=\n40\n,\n...\naggressive_elimination\n=\nFalse\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nsh\n.\nn_resources_\n[20, 40]\n>>>\nsh\n.\nn_candidates_\n[6, 3]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "factor\n=\n2\n,\nmax_resources\n=\n40\n,\n...\naggressive_elimination\n=\nFalse\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nsh\n.\nn_resources_\n[20, 40]\n>>>\nsh\n.\nn_candidates_\n[6, 3]\nSince we cannot use more than\nmax_resources=40\nresources, the process\nhas to stop at the second iteration which evaluates more than\nfactor=2\ncandidates.\nWhen using\naggressive_elimination\n, the process will eliminate as many\ncandidates as necessary using\nmin_resources\nresources:\n>>>\nsh\n=\nHalvingGridSearchCV\n(\nbase_estimator\n,\nparam_grid\n,\ncv\n=\n5\n,\n...\nfactor\n=\n2\n,\n...\nmax_resources\n=\n40\n,\n...\naggressive_elimination\n=\nTrue\n,\n...\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nsh\n.\nn_resources_\n[20, 20, 40]\n>>>\nsh\n.\nn_candidates_\n[6, 3, 2]\nNotice that we end with 2 candidates at the last iteration since we have\neliminated enough candidates during the first iterations, using\nn_resources\n=\nmin_resources\n=\n20\n.\n3.2.3.2.\nAnalyzing results with the\ncv_results_\nattribute\nThe\ncv_results_\nattribute contains useful information for analyzing the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n20\n.\n3.2.3.2.\nAnalyzing results with the\ncv_results_\nattribute\nThe\ncv_results_\nattribute contains useful information for analyzing the\nresults of a search. It can be converted to a pandas dataframe with\ndf\n=\npd.DataFrame(est.cv_results_)\n. The\ncv_results_\nattribute of\nHalvingGridSearchCV\nand\nHalvingRandomSearchCV\nis similar\nto that of\nGridSearchCV\nand\nRandomizedSearchCV\n, with\nadditional information related to the successive halving process.\nExample of a (truncated) output dataframe:\niter\nn_resources\nmean_test_score\nparams\n0\n0\n125\n0.983667\n{‘criterion’: ‘log_loss’, ‘max_depth’: None, ‘max_features’: 9, ‘min_samples_split’: 5}\n1\n0\n125\n0.983667\n{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 8, ‘min_samples_split’: 7}\n2\n0\n125\n0.983667\n{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 10, ‘min_samples_split’: 10}\n3\n0\n125\n0.983667\n{‘criterion’: ‘log_loss’, ‘max_depth’: None, ‘max_features’: 6, ‘min_samples_split’: 6}\n…\n…\n…\n…\n…\n15\n2\n500\n0.951958",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3\n0\n125\n0.983667\n{‘criterion’: ‘log_loss’, ‘max_depth’: None, ‘max_features’: 6, ‘min_samples_split’: 6}\n…\n…\n…\n…\n…\n15\n2\n500\n0.951958\n{‘criterion’: ‘log_loss’, ‘max_depth’: None, ‘max_features’: 9, ‘min_samples_split’: 10}\n16\n2\n500\n0.947958\n{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 10, ‘min_samples_split’: 10}\n17\n2\n500\n0.951958\n{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 10, ‘min_samples_split’: 4}\n18\n3\n1000\n0.961009\n{‘criterion’: ‘log_loss’, ‘max_depth’: None, ‘max_features’: 9, ‘min_samples_split’: 10}\n19\n3\n1000\n0.955989\n{‘criterion’: ‘gini’, ‘max_depth’: None, ‘max_features’: 10, ‘min_samples_split’: 4}\nEach row corresponds to a given parameter combination (a candidate) and a given\niteration. The iteration is given by the\niter\ncolumn. The\nn_resources\ncolumn tells you how many resources were used.\nIn the example above, the best parameter combination is\n{'criterion':\n'log_loss',\n'max_depth':\nNone,\n'max_features':\n9,\n'min_samples_split':\n10}",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In the example above, the best parameter combination is\n{'criterion':\n'log_loss',\n'max_depth':\nNone,\n'max_features':\n9,\n'min_samples_split':\n10}\nsince it has reached the last iteration (3) with the highest score:\n0.96.\nReferences\n3.2.4.\nTips for parameter search\n3.2.4.1.\nSpecifying an objective metric\nBy default, parameter search uses the\nscore\nfunction of the estimator to\nevaluate a parameter setting. These are the\nsklearn.metrics.accuracy_score\nfor classification and\nsklearn.metrics.r2_score\nfor regression. For some applications, other\nscoring functions are better suited (for example in unbalanced classification,\nthe accuracy score is often uninformative), see\nWhich scoring function should I use?\nfor some guidance. An alternative scoring function can be specified via the\nscoring\nparameter of most parameter search tools, see\nThe scoring parameter: defining model evaluation rules\nfor more details.\n3.2.4.2.\nSpecifying multiple metrics for evaluation\nGridSearchCV\nand\nRandomizedSearchCV",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for more details.\n3.2.4.2.\nSpecifying multiple metrics for evaluation\nGridSearchCV\nand\nRandomizedSearchCV\nallow specifying\nmultiple metrics for the\nscoring\nparameter.\nMultimetric scoring can either be specified as a list of strings of predefined\nscores names or a dict mapping the scorer name to the scorer function and/or\nthe predefined scorer name(s). See\nUsing multiple metric evaluation\nfor more details.\nWhen specifying multiple metrics, the\nrefit\nparameter must be set to the\nmetric (string) for which the\nbest_params_\nwill be found and used to build\nthe\nbest_estimator_\non the whole dataset. If the search should not be\nrefit, set\nrefit=False\n. Leaving refit to the default value\nNone\nwill\nresult in an error when using multiple metrics.\nSee\nDemonstration of multi-metric evaluation on cross_val_score and GridSearchCV\nfor an example usage.\nHalvingRandomSearchCV\nand\nHalvingGridSearchCV\ndo not support\nmultimetric scoring.\n3.2.4.3.\nComposite estimators and parameter spaces\nGridSearchCV\nand",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "HalvingRandomSearchCV\nand\nHalvingGridSearchCV\ndo not support\nmultimetric scoring.\n3.2.4.3.\nComposite estimators and parameter spaces\nGridSearchCV\nand\nRandomizedSearchCV\nallow searching over\nparameters of composite or nested estimators such as\nPipeline\n,\nColumnTransformer\n,\nVotingClassifier\nor\nCalibratedClassifierCV\nusing a dedicated\n<estimator>__<parameter>\nsyntax:\n>>>\nfrom\nsklearn.model_selection\nimport\nGridSearchCV\n>>>\nfrom\nsklearn.calibration\nimport\nCalibratedClassifierCV\n>>>\nfrom\nsklearn.ensemble\nimport\nRandomForestClassifier\n>>>\nfrom\nsklearn.datasets\nimport\nmake_moons\n>>>\nX\n,\ny\n=\nmake_moons\n()\n>>>\ncalibrated_forest\n=\nCalibratedClassifierCV\n(\n...\nestimator\n=\nRandomForestClassifier\n(\nn_estimators\n=\n10\n))\n>>>\nparam_grid\n=\n{\n...\n'estimator__max_depth'\n:\n[\n2\n,\n4\n,\n6\n,\n8\n]}\n>>>\nsearch\n=\nGridSearchCV\n(\ncalibrated_forest\n,\nparam_grid\n,\ncv\n=\n5\n)\n>>>\nsearch\n.\nfit\n(\nX\n,\ny\n)\nGridSearchCV(cv=5,\nestimator=CalibratedClassifierCV(estimator=RandomForestClassifier(n_estimators=10)),",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\ncv\n=\n5\n)\n>>>\nsearch\n.\nfit\n(\nX\n,\ny\n)\nGridSearchCV(cv=5,\nestimator=CalibratedClassifierCV(estimator=RandomForestClassifier(n_estimators=10)),\nparam_grid={'estimator__max_depth': [2, 4, 6, 8]})\nHere,\n<estimator>\nis the parameter name of the nested estimator,\nin this case\nestimator\n.\nIf the meta-estimator is constructed as a collection of estimators as in\npipeline.Pipeline\n, then\n<estimator>\nrefers to the name of the estimator,\nsee\nAccess to nested parameters\n. In practice, there can be several\nlevels of nesting:\n>>>\nfrom\nsklearn.pipeline\nimport\nPipeline\n>>>\nfrom\nsklearn.feature_selection\nimport\nSelectKBest\n>>>\npipe\n=\nPipeline\n([\n...\n(\n'select'\n,\nSelectKBest\n()),\n...\n(\n'model'\n,\ncalibrated_forest\n)])\n>>>\nparam_grid\n=\n{\n...\n'select__k'\n:\n[\n1\n,\n2\n],\n...\n'model__estimator__max_depth'\n:\n[\n2\n,\n4\n,\n6\n,\n8\n]}\n>>>\nsearch\n=\nGridSearchCV\n(\npipe\n,\nparam_grid\n,\ncv\n=\n5\n)\n.\nfit\n(\nX\n,\ny\n)\nPlease refer to\nPipeline: chaining estimators\nfor performing parameter searches over\npipelines.\n3.2.4.4.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "pipe\n,\nparam_grid\n,\ncv\n=\n5\n)\n.\nfit\n(\nX\n,\ny\n)\nPlease refer to\nPipeline: chaining estimators\nfor performing parameter searches over\npipelines.\n3.2.4.4.\nModel selection: development and evaluation\nModel selection by evaluating various parameter settings can be seen as a way\nto use the labeled data to “train” the parameters of the grid.\nWhen evaluating the resulting model it is important to do it on\nheld-out samples that were not seen during the grid search process:\nit is recommended to split the data into a\ndevelopment set\n(to\nbe fed to the\nGridSearchCV\ninstance) and an\nevaluation set\nto compute performance metrics.\nThis can be done by using the\ntrain_test_split\nutility function.\n3.2.4.5.\nParallelism\nThe parameter search tools evaluate each parameter combination on each data\nfold independently. Computations can be run in parallel by using the keyword\nn_jobs=-1\n. See function signature for more details, and also the Glossary\nentry for\nn_jobs\n.\n3.2.4.6.\nRobustness to failure",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "n_jobs=-1\n. See function signature for more details, and also the Glossary\nentry for\nn_jobs\n.\n3.2.4.6.\nRobustness to failure\nSome parameter settings may result in a failure to\nfit\none or more folds of\nthe data. By default, the score for those settings will be\nnp.nan\n. This can\nbe controlled by setting\nerror_score=\"raise\"\nto raise an exception if one fit\nfails, or for example\nerror_score=0\nto set another value for the score of\nfailing parameter combinations.\n3.2.5.\nAlternatives to brute force parameter search\n3.2.5.1.\nModel specific cross-validation\nSome models can fit data for a range of values of some parameter almost\nas efficiently as fitting the estimator for a single value of the\nparameter. This feature can be leveraged to perform a more efficient\ncross-validation used for model selection of this parameter.\nThe most common parameter amenable to this strategy is the parameter\nencoding the strength of the regularizer. In this case we say that we\ncompute the\nregularization path",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "encoding the strength of the regularizer. In this case we say that we\ncompute the\nregularization path\nof the estimator.\nHere is the list of such models:\nlinear_model.ElasticNetCV\n(*[, l1_ratio, ...])\nElastic Net model with iterative fitting along a regularization path.\nlinear_model.LarsCV\n(*[, fit_intercept, ...])\nCross-validated Least Angle Regression model.\nlinear_model.LassoCV\n(*[, eps, n_alphas, ...])\nLasso linear model with iterative fitting along a regularization path.\nlinear_model.LassoLarsCV\n(*[, fit_intercept, ...])\nCross-validated Lasso, using the LARS algorithm.\nlinear_model.LogisticRegressionCV\n(*[, Cs, ...])\nLogistic Regression CV (aka logit, MaxEnt) classifier.\nlinear_model.MultiTaskElasticNetCV\n(*[, ...])\nMulti-task L1/L2 ElasticNet with built-in cross-validation.\nlinear_model.MultiTaskLassoCV\n(*[, eps, ...])\nMulti-task Lasso model trained with L1/L2 mixed-norm as regularizer.\nlinear_model.OrthogonalMatchingPursuitCV\n(*)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(*[, eps, ...])\nMulti-task Lasso model trained with L1/L2 mixed-norm as regularizer.\nlinear_model.OrthogonalMatchingPursuitCV\n(*)\nCross-validated Orthogonal Matching Pursuit model (OMP).\nlinear_model.RidgeCV\n([alphas, ...])\nRidge regression with built-in cross-validation.\nlinear_model.RidgeClassifierCV\n([alphas, ...])\nRidge classifier with built-in cross-validation.\n3.2.5.2.\nInformation Criterion\nSome models can offer an information-theoretic closed-form formula of the\noptimal estimate of the regularization parameter by computing a single\nregularization path (instead of several when using cross-validation).\nHere is the list of models benefiting from the Akaike Information\nCriterion (AIC) or the Bayesian Information Criterion (BIC) for automated\nmodel selection:\nlinear_model.LassoLarsIC\n([criterion, ...])\nLasso model fit with Lars using BIC or AIC for model selection.\n3.2.5.3.\nOut of Bag Estimates\nWhen using ensemble methods based upon bagging, i.e. generating new",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.2.5.3.\nOut of Bag Estimates\nWhen using ensemble methods based upon bagging, i.e. generating new\ntraining sets using sampling with replacement, part of the training set\nremains unused. For each classifier in the ensemble, a different part\nof the training set is left out.\nThis left out portion can be used to estimate the generalization error\nwithout having to rely on a separate validation set. This estimate\ncomes “for free” as no additional data is needed and can be used for\nmodel selection.\nThis is currently implemented in the following classes:\nensemble.RandomForestClassifier\n([...])\nA random forest classifier.\nensemble.RandomForestRegressor\n([...])\nA random forest regressor.\nensemble.ExtraTreesClassifier\n([...])\nAn extra-trees classifier.\nensemble.ExtraTreesRegressor\n([n_estimators, ...])\nAn extra-trees regressor.\nensemble.GradientBoostingClassifier\n(*[, ...])\nGradient Boosting for classification.\nensemble.GradientBoostingRegressor\n(*[, ...])\nGradient Boosting for regression.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(*[, ...])\nGradient Boosting for classification.\nensemble.GradientBoostingRegressor\n(*[, ...])\nGradient Boosting for regression.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/grid_search.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.4.\nImputation of missing values\nFor various reasons, many real world datasets contain missing values, often\nencoded as blanks, NaNs or other placeholders. Such datasets however are\nincompatible with scikit-learn estimators which assume that all values in an\narray are numerical, and that all have and hold meaning. A basic strategy to\nuse incomplete datasets is to discard entire rows and/or columns containing\nmissing values. However, this comes at the price of losing data which may be\nvaluable (even though incomplete). A better strategy is to impute the missing\nvalues, i.e., to infer them from the known part of the data. See the\nglossary entry on\nimputation\n.\n7.4.1.\nUnivariate vs. Multivariate Imputation\nOne type of imputation algorithm is univariate, which imputes values in the\ni-th feature dimension using only non-missing values in that feature dimension\n(e.g.\nSimpleImputer\n). By contrast, multivariate imputation",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "i-th feature dimension using only non-missing values in that feature dimension\n(e.g.\nSimpleImputer\n). By contrast, multivariate imputation\nalgorithms use the entire set of available feature dimensions to estimate the\nmissing values (e.g.\nIterativeImputer\n).\n7.4.2.\nUnivariate feature imputation\nThe\nSimpleImputer\nclass provides basic strategies for imputing missing\nvalues. Missing values can be imputed with a provided constant value, or using\nthe statistics (mean, median or most frequent) of each column in which the\nmissing values are located. This class also allows for different missing values\nencodings.\nThe following snippet demonstrates how to replace missing values,\nencoded as\nnp.nan\n, using the mean value of the columns (axis 0)\nthat contain the missing values:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.impute\nimport\nSimpleImputer\n>>>\nimp\n=\nSimpleImputer\n(\nmissing_values\n=\nnp\n.\nnan\n,\nstrategy\n=\n'mean'\n)\n>>>\nimp\n.\nfit\n([[\n1\n,\n2\n],\n[\nnp\n.\nnan\n,\n3\n],\n[\n7\n,\n6\n]])\nSimpleImputer()\n>>>\nX\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nSimpleImputer\n(\nmissing_values\n=\nnp\n.\nnan\n,\nstrategy\n=\n'mean'\n)\n>>>\nimp\n.\nfit\n([[\n1\n,\n2\n],\n[\nnp\n.\nnan\n,\n3\n],\n[\n7\n,\n6\n]])\nSimpleImputer()\n>>>\nX\n=\n[[\nnp\n.\nnan\n,\n2\n],\n[\n6\n,\nnp\n.\nnan\n],\n[\n7\n,\n6\n]]\n>>>\nprint\n(\nimp\n.\ntransform\n(\nX\n))\n[[4. 2. ]\n[6. 3.666]\n[7. 6. ]]\nThe\nSimpleImputer\nclass also supports sparse matrices:\n>>>\nimport\nscipy.sparse\nas\nsp\n>>>\nX\n=\nsp\n.\ncsc_matrix\n([[\n1\n,\n2\n],\n[\n0\n,\n-\n1\n],\n[\n8\n,\n4\n]])\n>>>\nimp\n=\nSimpleImputer\n(\nmissing_values\n=-\n1\n,\nstrategy\n=\n'mean'\n)\n>>>\nimp\n.\nfit\n(\nX\n)\nSimpleImputer(missing_values=-1)\n>>>\nX_test\n=\nsp\n.\ncsc_matrix\n([[\n-\n1\n,\n2\n],\n[\n6\n,\n-\n1\n],\n[\n7\n,\n6\n]])\n>>>\nprint\n(\nimp\n.\ntransform\n(\nX_test\n)\n.\ntoarray\n())\n[[3. 2.]\n[6. 3.]\n[7. 6.]]\nNote that this format is not meant to be used to implicitly store missing\nvalues in the matrix because it would densify it at transform time. Missing\nvalues encoded by 0 must be used with dense input.\nThe\nSimpleImputer\nclass also supports categorical data represented as\nstring values or pandas categoricals when using the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The\nSimpleImputer\nclass also supports categorical data represented as\nstring values or pandas categoricals when using the\n'most_frequent'\nor\n'constant'\nstrategy:\n>>>\nimport\npandas\nas\npd\n>>>\ndf\n=\npd\n.\nDataFrame\n([[\n\"a\"\n,\n\"x\"\n],\n...\n[\nnp\n.\nnan\n,\n\"y\"\n],\n...\n[\n\"a\"\n,\nnp\n.\nnan\n],\n...\n[\n\"b\"\n,\n\"y\"\n]],\ndtype\n=\n\"category\"\n)\n...\n>>>\nimp\n=\nSimpleImputer\n(\nstrategy\n=\n\"most_frequent\"\n)\n>>>\nprint\n(\nimp\n.\nfit_transform\n(\ndf\n))\n[['a' 'x']\n['a' 'y']\n['a' 'y']\n['b' 'y']]\nFor another example on usage, see\nImputing missing values before building an estimator\n.\n7.4.3.\nMultivariate feature imputation\nA more sophisticated approach is to use the\nIterativeImputer\nclass,\nwhich models each feature with missing values as a function of other features,\nand uses that estimate for imputation. It does so in an iterated round-robin\nfashion: at each step, a feature column is designated as output\ny\nand the\nother feature columns are treated as inputs\nX\n. A regressor is fit on\n(X,\ny)\nfor known\ny",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "y\nand the\nother feature columns are treated as inputs\nX\n. A regressor is fit on\n(X,\ny)\nfor known\ny\n. Then, the regressor is used to predict the missing values\nof\ny\n. This is done for each feature in an iterative fashion, and then is\nrepeated for\nmax_iter\nimputation rounds. The results of the final\nimputation round are returned.\nNote\nThis estimator is still\nexperimental\nfor now: default parameters or\ndetails of behaviour might change without any deprecation cycle. Resolving\nthe following issues would help stabilize\nIterativeImputer\n:\nconvergence criteria (\n#14338\n) and default estimators\n(\n#13286\n). To use it, you need to explicitly import\nenable_iterative_imputer\n.\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.experimental\nimport\nenable_iterative_imputer\n>>>\nfrom\nsklearn.impute\nimport\nIterativeImputer\n>>>\nimp\n=\nIterativeImputer\n(\nmax_iter\n=\n10\n,\nrandom_state\n=\n0\n)\n>>>\nimp\n.\nfit\n([[\n1\n,\n2\n],\n[\n3\n,\n6\n],\n[\n4\n,\n8\n],\n[\nnp\n.\nnan\n,\n3\n],\n[\n7\n,\nnp\n.\nnan\n]])\nIterativeImputer(random_state=0)\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n10\n,\nrandom_state\n=\n0\n)\n>>>\nimp\n.\nfit\n([[\n1\n,\n2\n],\n[\n3\n,\n6\n],\n[\n4\n,\n8\n],\n[\nnp\n.\nnan\n,\n3\n],\n[\n7\n,\nnp\n.\nnan\n]])\nIterativeImputer(random_state=0)\n>>>\nX_test\n=\n[[\nnp\n.\nnan\n,\n2\n],\n[\n6\n,\nnp\n.\nnan\n],\n[\nnp\n.\nnan\n,\n6\n]]\n>>>\n# the model learns that the second feature is double the first\n>>>\nprint\n(\nnp\n.\nround\n(\nimp\n.\ntransform\n(\nX_test\n)))\n[[ 1. 2.]\n[ 6. 12.]\n[ 3. 6.]]\nBoth\nSimpleImputer\nand\nIterativeImputer\ncan be used in a\nPipeline as a way to build a composite estimator that supports imputation.\nSee\nImputing missing values before building an estimator\n.\n7.4.3.1.\nFlexibility of IterativeImputer\nThere are many well-established imputation packages in the R data science\necosystem: Amelia, mi, mice, missForest, etc. missForest is popular, and turns\nout to be a particular instance of different sequential imputation algorithms\nthat can all be implemented with\nIterativeImputer\nby passing in\ndifferent regressors to be used for predicting missing feature values. In the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "that can all be implemented with\nIterativeImputer\nby passing in\ndifferent regressors to be used for predicting missing feature values. In the\ncase of missForest, this regressor is a Random Forest.\nSee\nImputing missing values with variants of IterativeImputer\n.\n7.4.3.2.\nMultiple vs. Single Imputation\nIn the statistics community, it is common practice to perform multiple\nimputations, generating, for example,\nm\nseparate imputations for a single\nfeature matrix. Each of these\nm\nimputations is then put through the\nsubsequent analysis pipeline (e.g. feature engineering, clustering, regression,\nclassification). The\nm\nfinal analysis results (e.g. held-out validation\nerrors) allow the data scientist to obtain understanding of how analytic\nresults may differ as a consequence of the inherent uncertainty caused by the\nmissing values. The above practice is called multiple imputation.\nOur implementation of\nIterativeImputer\nwas inspired by the R MICE",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "missing values. The above practice is called multiple imputation.\nOur implementation of\nIterativeImputer\nwas inspired by the R MICE\npackage (Multivariate Imputation by Chained Equations)\n[\n1\n]\n, but differs from\nit by returning a single imputation instead of multiple imputations. However,\nIterativeImputer\ncan also be used for multiple imputations by applying\nit repeatedly to the same dataset with different random seeds when\nsample_posterior=True\n. See\n[\n2\n]\n, chapter 4 for more discussion on multiple\nvs. single imputations.\nIt is still an open problem as to how useful single vs. multiple imputation is\nin the context of prediction and classification when the user is not\ninterested in measuring uncertainty due to missing values.\nNote that a call to the\ntransform\nmethod of\nIterativeImputer\nis\nnot allowed to change the number of samples. Therefore multiple imputations\ncannot be achieved by a single call to\ntransform\n.\nReferences\n7.4.4.\nNearest neighbors imputation\nThe\nKNNImputer",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "cannot be achieved by a single call to\ntransform\n.\nReferences\n7.4.4.\nNearest neighbors imputation\nThe\nKNNImputer\nclass provides imputation for filling in missing values\nusing the k-Nearest Neighbors approach. By default, a euclidean distance metric\nthat supports missing values,\nnan_euclidean_distances\n, is used to find the\nnearest neighbors. Each missing feature is imputed using values from\nn_neighbors\nnearest neighbors that have a value for the feature. The\nfeature of the neighbors are averaged uniformly or weighted by distance to each\nneighbor. If a sample has more than one feature missing, then the neighbors for\nthat sample can be different depending on the particular feature being imputed.\nWhen the number of available neighbors is less than\nn_neighbors\nand there are\nno defined distances to the training set, the training set average for that\nfeature is used during imputation. If there is at least one neighbor with a",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "feature is used during imputation. If there is at least one neighbor with a\ndefined distance, the weighted or unweighted average of the remaining neighbors\nwill be used during imputation. If a feature is always missing in training, it\nis removed during\ntransform\n. For more information on the methodology, see\nref.\n[OL2001]\n.\nThe following snippet demonstrates how to replace missing values,\nencoded as\nnp.nan\n, using the mean feature value of the two nearest\nneighbors of samples with missing values:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.impute\nimport\nKNNImputer\n>>>\nnan\n=\nnp\n.\nnan\n>>>\nX\n=\n[[\n1\n,\n2\n,\nnan\n],\n[\n3\n,\n4\n,\n3\n],\n[\nnan\n,\n6\n,\n5\n],\n[\n8\n,\n8\n,\n7\n]]\n>>>\nimputer\n=\nKNNImputer\n(\nn_neighbors\n=\n2\n,\nweights\n=\n\"uniform\"\n)\n>>>\nimputer\n.\nfit_transform\n(\nX\n)\narray([[1. , 2. , 4. ],\n[3. , 4. , 3. ],\n[5.5, 6. , 5. ],\n[8. , 8. , 7. ]])\nFor another example on usage, see\nImputing missing values before building an estimator\n.\nReferences\n[\nOL2001\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[5.5, 6. , 5. ],\n[8. , 8. , 7. ]])\nFor another example on usage, see\nImputing missing values before building an estimator\n.\nReferences\n[\nOL2001\n]\nOlga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown,\nTrevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman,\nMissing value estimation methods for DNA microarrays, BIOINFORMATICS\nVol. 17 no. 6, 2001 Pages 520-525.\n7.4.5.\nKeeping the number of features constant\nBy default, the scikit-learn imputers will drop fully empty features, i.e.\ncolumns containing only missing values. For instance:\n>>>\nimputer\n=\nSimpleImputer\n()\n>>>\nX\n=\nnp\n.\narray\n([[\nnp\n.\nnan\n,\n1\n],\n[\nnp\n.\nnan\n,\n2\n],\n[\nnp\n.\nnan\n,\n3\n]])\n>>>\nimputer\n.\nfit_transform\n(\nX\n)\narray([[1.],\n[2.],\n[3.]])\nThe first feature in\nX\ncontaining only\nnp.nan\nwas dropped after the\nimputation. While this feature will not help in predictive setting, dropping\nthe columns will change the shape of\nX\nwhich could be problematic when using",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the columns will change the shape of\nX\nwhich could be problematic when using\nimputers in a more complex machine-learning pipeline. The parameter\nkeep_empty_features\noffers the option to keep the empty features by imputing\nwith a constant value. In most of the cases, this constant value is zero:\n>>>\nimputer\n.\nset_params\n(\nkeep_empty_features\n=\nTrue\n)\nSimpleImputer(keep_empty_features=True)\n>>>\nimputer\n.\nfit_transform\n(\nX\n)\narray([[0., 1.],\n[0., 2.],\n[0., 3.]])\n7.4.6.\nMarking imputed values\nThe\nMissingIndicator\ntransformer is useful to transform a dataset into\ncorresponding binary matrix indicating the presence of missing values in the\ndataset. This transformation is useful in conjunction with imputation. When\nusing imputation, preserving the information about which values had been\nmissing can be informative. Note that both the\nSimpleImputer\nand\nIterativeImputer\nhave the boolean parameter\nadd_indicator\n(\nFalse\nby default) which when set to\nTrue\nprovides a convenient way of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SimpleImputer\nand\nIterativeImputer\nhave the boolean parameter\nadd_indicator\n(\nFalse\nby default) which when set to\nTrue\nprovides a convenient way of\nstacking the output of the\nMissingIndicator\ntransformer with the\noutput of the imputer.\nNaN\nis usually used as the placeholder for missing values. However, it\nenforces the data type to be float. The parameter\nmissing_values\nallows to\nspecify other placeholder such as integer. In the following example, we will\nuse\n-1\nas missing values:\n>>>\nfrom\nsklearn.impute\nimport\nMissingIndicator\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n1\n,\n-\n1\n,\n1\n,\n3\n],\n...\n[\n4\n,\n-\n1\n,\n0\n,\n-\n1\n],\n...\n[\n8\n,\n-\n1\n,\n1\n,\n0\n]])\n>>>\nindicator\n=\nMissingIndicator\n(\nmissing_values\n=-\n1\n)\n>>>\nmask_missing_values_only\n=\nindicator\n.\nfit_transform\n(\nX\n)\n>>>\nmask_missing_values_only\narray([[ True, True, False],\n[False, True, True],\n[False, True, False]])\nThe\nfeatures\nparameter is used to choose the features for which the mask is\nconstructed. By default, it is\n'missing-only'\nwhich returns the imputer",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The\nfeatures\nparameter is used to choose the features for which the mask is\nconstructed. By default, it is\n'missing-only'\nwhich returns the imputer\nmask of the features containing missing values at\nfit\ntime:\n>>>\nindicator\n.\nfeatures_\narray([0, 1, 3])\nThe\nfeatures\nparameter can be set to\n'all'\nto return all features\nwhether or not they contain missing values:\n>>>\nindicator\n=\nMissingIndicator\n(\nmissing_values\n=-\n1\n,\nfeatures\n=\n\"all\"\n)\n>>>\nmask_all\n=\nindicator\n.\nfit_transform\n(\nX\n)\n>>>\nmask_all\narray([[ True, True, False, False],\n[False, True, False, True],\n[False, True, False, False]])\n>>>\nindicator\n.\nfeatures_\narray([0, 1, 2, 3])\nWhen using the\nMissingIndicator\nin a\nPipeline\n, be sure to use the\nFeatureUnion\nor\nColumnTransformer\nto add the indicator features to\nthe regular features. First we obtain the\niris\ndataset, and add some missing\nvalues to it.\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.impute\nimport\nSimpleImputer\n,\nMissingIndicator\n>>>\nfrom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "values to it.\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.impute\nimport\nSimpleImputer\n,\nMissingIndicator\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nfrom\nsklearn.pipeline\nimport\nFeatureUnion\n,\nmake_pipeline\n>>>\nfrom\nsklearn.tree\nimport\nDecisionTreeClassifier\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nmask\n=\nnp\n.\nrandom\n.\nrandint\n(\n0\n,\n2\n,\nsize\n=\nX\n.\nshape\n)\n.\nastype\n(\nbool\n)\n>>>\nX\n[\nmask\n]\n=\nnp\n.\nnan\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\n_\n=\ntrain_test_split\n(\nX\n,\ny\n,\ntest_size\n=\n100\n,\n...\nrandom_state\n=\n0\n)\nNow we create a\nFeatureUnion\n. All features will be\nimputed using\nSimpleImputer\n, in order to enable classifiers to work\nwith this data. Additionally, it adds the indicator variables from\nMissingIndicator\n.\n>>>\ntransformer\n=\nFeatureUnion\n(\n...\ntransformer_list\n=\n[\n...\n(\n'features'\n,\nSimpleImputer\n(\nstrategy\n=\n'mean'\n)),\n...\n(\n'indicators'\n,\nMissingIndicator\n())])\n>>>\ntransformer\n=\ntransformer\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nresults\n=\ntransformer\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n'mean'\n)),\n...\n(\n'indicators'\n,\nMissingIndicator\n())])\n>>>\ntransformer\n=\ntransformer\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nresults\n=\ntransformer\n.\ntransform\n(\nX_test\n)\n>>>\nresults\n.\nshape\n(100, 8)\nOf course, we cannot use the transformer to make any predictions. We should\nwrap this in a\nPipeline\nwith a classifier (e.g., a\nDecisionTreeClassifier\n) to be able to make predictions.\n>>>\nclf\n=\nmake_pipeline\n(\ntransformer\n,\nDecisionTreeClassifier\n())\n>>>\nclf\n=\nclf\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nresults\n=\nclf\n.\npredict\n(\nX_test\n)\n>>>\nresults\n.\nshape\n(100,)\n7.4.7.\nEstimators that handle NaN values\nSome estimators are designed to handle NaN values without preprocessing.\nBelow is the list of these estimators, classified by type\n(cluster, regressor, classifier, transform):\nEstimators that allow NaN values for type\ncluster\n:\nHDBSCAN\nEstimators that allow NaN values for type\nregressor\n:\nBaggingRegressor\nDecisionTreeRegressor\nExtraTreeRegressor\nExtraTreesRegressor\nHistGradientBoostingRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "regressor\n:\nBaggingRegressor\nDecisionTreeRegressor\nExtraTreeRegressor\nExtraTreesRegressor\nHistGradientBoostingRegressor\nRandomForestRegressor\nStackingRegressor\nVotingRegressor\nEstimators that allow NaN values for type\nclassifier\n:\nBaggingClassifier\nDecisionTreeClassifier\nExtraTreeClassifier\nExtraTreesClassifier\nHistGradientBoostingClassifier\nRandomForestClassifier\nStackingClassifier\nVotingClassifier\nEstimators that allow NaN values for type\ntransformer\n:\nIterativeImputer\nKNNImputer\nMaxAbsScaler\nMinMaxScaler\nMissingIndicator\nOneHotEncoder\nOrdinalEncoder\nPowerTransformer\nQuantileTransformer\nRandomTreesEmbedding\nRobustScaler\nSimpleImputer\nStackingClassifier\nStackingRegressor\nStandardScaler\nTargetEncoder\nVarianceThreshold\nVotingClassifier\nVotingRegressor\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/impute.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.15.\nIsotonic regression\nThe class\nIsotonicRegression\nfits a non-decreasing real function to\n1-dimensional data. It solves the following problem:\n\\[\\min \\sum_i w_i (y_i - \\hat{y}_i)^2\\]\nsubject to\n\\(\\hat{y}_i \\le \\hat{y}_j\\)\nwhenever\n\\(X_i \\le X_j\\)\n,\nwhere the weights\n\\(w_i\\)\nare strictly positive, and both\nX\nand\ny\nare\narbitrary real quantities.\nThe\nincreasing\nparameter changes the constraint to\n\\(\\hat{y}_i \\ge \\hat{y}_j\\)\nwhenever\n\\(X_i \\le X_j\\)\n. Setting it to\n‘auto’ will automatically choose the constraint based on\nSpearman’s rank\ncorrelation coefficient\n.\nIsotonicRegression\nproduces a series of predictions\n\\(\\hat{y}_i\\)\nfor the training data which are the closest to the targets\n\\(y\\)\nin terms of mean squared error. These predictions are interpolated\nfor predicting to unseen data. The predictions of\nIsotonicRegression\nthus form a function that is piecewise linear:\nExamples\nIsotonic Regression\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/isotonic.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.7.\nKernel Approximation\nThis submodule contains functions that approximate the feature mappings that\ncorrespond to certain kernels, as they are used for example in support vector\nmachines (see\nSupport Vector Machines\n).\nThe following feature functions perform non-linear transformations of the\ninput, which can serve as a basis for linear classification or other\nalgorithms.\nThe advantage of using approximate explicit feature maps compared to the\nkernel trick\n,\nwhich makes use of feature maps implicitly, is that explicit mappings\ncan be better suited for online learning and can significantly reduce the cost\nof learning with very large datasets.\nStandard kernelized SVMs do not scale well to large datasets, but using an\napproximate kernel map it is possible to use much more efficient linear SVMs.\nIn particular, the combination of kernel map approximations with\nSGDClassifier\ncan make non-linear learning on large datasets possible.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In particular, the combination of kernel map approximations with\nSGDClassifier\ncan make non-linear learning on large datasets possible.\nSince there has not been much empirical work using approximate embeddings, it\nis advisable to compare results against exact kernel methods when possible.\nSee also\nPolynomial regression: extending linear models with basis functions\nfor an exact polynomial transformation.\n7.7.1.\nNystroem Method for Kernel Approximation\nThe Nystroem method, as implemented in\nNystroem\nis a general method for\nreduced rank approximations of kernels. It achieves this by subsampling without\nreplacement rows/columns of the data on which the kernel is evaluated. While the\ncomputational complexity of the exact method is\n\\(\\mathcal{O}(n^3_{\\text{samples}})\\)\n, the complexity of the approximation\nis\n\\(\\mathcal{O}(n^2_{\\text{components}} \\cdot n_{\\text{samples}})\\)\n, where\none can set\n\\(n_{\\text{components}} \\ll n_{\\text{samples}}\\)\nwithout a\nsignificant decrease in performance",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", where\none can set\n\\(n_{\\text{components}} \\ll n_{\\text{samples}}\\)\nwithout a\nsignificant decrease in performance\n[WS2001]\n.\nWe can construct the eigendecomposition of the kernel matrix\n\\(K\\)\n, based\non the features of the data, and then split it into sampled and unsampled data\npoints.\n\\[\\begin{split}K = U \\Lambda U^T\n= \\begin{bmatrix} U_1 \\\\ U_2\\end{bmatrix} \\Lambda \\begin{bmatrix} U_1 \\\\ U_2 \\end{bmatrix}^T\n= \\begin{bmatrix} U_1 \\Lambda U_1^T & U_1 \\Lambda U_2^T \\\\ U_2 \\Lambda U_1^T & U_2 \\Lambda U_2^T \\end{bmatrix}\n\\equiv \\begin{bmatrix} K_{11} & K_{12} \\\\ K_{21} & K_{22} \\end{bmatrix}\\end{split}\\]\nwhere:\n\\(U\\)\nis orthonormal\n\\(\\Lambda\\)\nis diagonal matrix of eigenvalues\n\\(U_1\\)\nis orthonormal matrix of samples that were chosen\n\\(U_2\\)\nis orthonormal matrix of samples that were not chosen\nGiven that\n\\(U_1 \\Lambda U_1^T\\)\ncan be obtained by orthonormalization of\nthe matrix\n\\(K_{11}\\)\n, and\n\\(U_2 \\Lambda U_1^T\\)\ncan be evaluated (as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Given that\n\\(U_1 \\Lambda U_1^T\\)\ncan be obtained by orthonormalization of\nthe matrix\n\\(K_{11}\\)\n, and\n\\(U_2 \\Lambda U_1^T\\)\ncan be evaluated (as\nwell as its transpose), the only remaining term to elucidate is\n\\(U_2 \\Lambda U_2^T\\)\n. To do this we can express it in terms of the already\nevaluated matrices:\n\\[\\begin{split}\\begin{align} U_2 \\Lambda U_2^T &= \\left(K_{21} U_1 \\Lambda^{-1}\\right) \\Lambda \\left(K_{21} U_1 \\Lambda^{-1}\\right)^T\n\\\\&= K_{21} U_1 (\\Lambda^{-1} \\Lambda) \\Lambda^{-1} U_1^T K_{21}^T\n\\\\&= K_{21} U_1 \\Lambda^{-1} U_1^T K_{21}^T\n\\\\&= K_{21} K_{11}^{-1} K_{21}^T\n\\\\&= \\left( K_{21} K_{11}^{-\\frac12} \\right) \\left( K_{21} K_{11}^{-\\frac12} \\right)^T\n.\\end{align}\\end{split}\\]\nDuring\nfit\n, the class\nNystroem\nevaluates the basis\n\\(U_1\\)\n, and\ncomputes the normalization constant,\n\\(K_{11}^{-\\frac12}\\)\n. Later, during\ntransform\n, the kernel matrix is determined between the basis (given by the\ncomponents_\nattribute) and the new data points,\nX\n. This matrix is then",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "transform\n, the kernel matrix is determined between the basis (given by the\ncomponents_\nattribute) and the new data points,\nX\n. This matrix is then\nmultiplied by the\nnormalization_\nmatrix for the final result.\nBy default\nNystroem\nuses the\nrbf\nkernel, but it can use any kernel\nfunction or a precomputed kernel matrix. The number of samples used - which is\nalso the dimensionality of the features computed - is given by the parameter\nn_components\n.\nExamples\nSee the example entitled\nTime-related feature engineering\n,\nthat shows an efficient machine learning pipeline that uses a\nNystroem\nkernel.\nSee\nExplicit feature map approximation for RBF kernels\nfor a comparison of\nNystroem\nkernel with\nRBFSampler\n.\n7.7.2.\nRadial Basis Function Kernel\nThe\nRBFSampler\nconstructs an approximate mapping for the radial basis\nfunction kernel, also known as\nRandom Kitchen Sinks\n[RR2007]\n. This\ntransformation can be used to explicitly model a kernel map, prior to applying",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "function kernel, also known as\nRandom Kitchen Sinks\n[RR2007]\n. This\ntransformation can be used to explicitly model a kernel map, prior to applying\na linear algorithm, for example a linear SVM:\n>>>\nfrom\nsklearn.kernel_approximation\nimport\nRBFSampler\n>>>\nfrom\nsklearn.linear_model\nimport\nSGDClassifier\n>>>\nX\n=\n[[\n0\n,\n0\n],\n[\n1\n,\n1\n],\n[\n1\n,\n0\n],\n[\n0\n,\n1\n]]\n>>>\ny\n=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\nrbf_feature\n=\nRBFSampler\n(\ngamma\n=\n1\n,\nrandom_state\n=\n1\n)\n>>>\nX_features\n=\nrbf_feature\n.\nfit_transform\n(\nX\n)\n>>>\nclf\n=\nSGDClassifier\n(\nmax_iter\n=\n5\n)\n>>>\nclf\n.\nfit\n(\nX_features\n,\ny\n)\nSGDClassifier(max_iter=5)\n>>>\nclf\n.\nscore\n(\nX_features\n,\ny\n)\n1.0\nThe mapping relies on a Monte Carlo approximation to the\nkernel values. The\nfit\nfunction performs the Monte Carlo sampling, whereas\nthe\ntransform\nmethod performs the mapping of the data. Because of the\ninherent randomness of the process, results may vary between different calls to\nthe\nfit\nfunction.\nThe\nfit\nfunction takes two arguments:\nn_components",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "inherent randomness of the process, results may vary between different calls to\nthe\nfit\nfunction.\nThe\nfit\nfunction takes two arguments:\nn_components\n, which is the target dimensionality of the feature transform,\nand\ngamma\n, the parameter of the RBF-kernel. A higher\nn_components\nwill\nresult in a better approximation of the kernel and will yield results more\nsimilar to those produced by a kernel SVM. Note that “fitting” the feature\nfunction does not actually depend on the data given to the\nfit\nfunction.\nOnly the dimensionality of the data is used.\nDetails on the method can be found in\n[RR2007]\n.\nFor a given value of\nn_components\nRBFSampler\nis often less accurate\nas\nNystroem\n.\nRBFSampler\nis cheaper to compute, though, making\nuse of larger feature spaces more efficient.\nComparing an exact RBF kernel (left) with the approximation (right)\nExamples\nSee\nExplicit feature map approximation for RBF kernels\nfor a\ncomparison of\nNystroem\nkernel with\nRBFSampler\n.\n7.7.3.\nAdditive Chi Squared Kernel",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "See\nExplicit feature map approximation for RBF kernels\nfor a\ncomparison of\nNystroem\nkernel with\nRBFSampler\n.\n7.7.3.\nAdditive Chi Squared Kernel\nThe additive chi squared kernel is a kernel on histograms, often used in computer vision.\nThe additive chi squared kernel as used here is given by\n\\[k(x, y) = \\sum_i \\frac{2x_iy_i}{x_i+y_i}\\]\nThis is not exactly the same as\nsklearn.metrics.pairwise.additive_chi2_kernel\n.\nThe authors of\n[VZ2010]\nprefer the version above as it is always positive\ndefinite.\nSince the kernel is additive, it is possible to treat all components\n\\(x_i\\)\nseparately for embedding. This makes it possible to sample\nthe Fourier transform in regular intervals, instead of approximating\nusing Monte Carlo sampling.\nThe class\nAdditiveChi2Sampler\nimplements this component wise\ndeterministic sampling. Each component is sampled\n\\(n\\)\ntimes, yielding\n\\(2n+1\\)\ndimensions per input dimension (the multiple of two stems\nfrom the real and complex part of the Fourier transform).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(n\\)\ntimes, yielding\n\\(2n+1\\)\ndimensions per input dimension (the multiple of two stems\nfrom the real and complex part of the Fourier transform).\nIn the literature,\n\\(n\\)\nis usually chosen to be 1 or 2, transforming\nthe dataset to size\nn_samples\n*\n5\n*\nn_features\n(in the case of\n\\(n=2\\)\n).\nThe approximate feature map provided by\nAdditiveChi2Sampler\ncan be combined\nwith the approximate feature map provided by\nRBFSampler\nto yield an approximate\nfeature map for the exponentiated chi squared kernel.\nSee the\n[VZ2010]\nfor details and\n[VVZ2010]\nfor combination with the\nRBFSampler\n.\n7.7.4.\nSkewed Chi Squared Kernel\nThe skewed chi squared kernel is given by:\n\\[k(x,y) = \\prod_i \\frac{2\\sqrt{x_i+c}\\sqrt{y_i+c}}{x_i + y_i + 2c}\\]\nIt has properties that are similar to the exponentiated chi squared kernel\noften used in computer vision, but allows for a simple Monte Carlo\napproximation of the feature map.\nThe usage of the\nSkewedChi2Sampler\nis the same as the usage described\nabove for the\nRBFSampler",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "approximation of the feature map.\nThe usage of the\nSkewedChi2Sampler\nis the same as the usage described\nabove for the\nRBFSampler\n. The only difference is in the free\nparameter, that is called\n\\(c\\)\n.\nFor a motivation for this mapping and the mathematical details see\n[LS2010]\n.\n7.7.5.\nPolynomial Kernel Approximation via Tensor Sketch\nThe\npolynomial kernel\nis a popular type of kernel\nfunction given by:\n\\[k(x, y) = (\\gamma x^\\top y +c_0)^d\\]\nwhere:\nx\n,\ny\nare the input vectors\nd\nis the kernel degree\nIntuitively, the feature space of the polynomial kernel of degree\nd\nconsists of all possible degree-\nd\nproducts among input features, which enables\nlearning algorithms using this kernel to account for interactions between features.\nThe TensorSketch\n[PP2013]\nmethod, as implemented in\nPolynomialCountSketch\n, is a\nscalable, input data independent method for polynomial kernel approximation.\nIt is based on the concept of Count sketch\n[WIKICS]\n[CCF2002]\n, a dimensionality",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "It is based on the concept of Count sketch\n[WIKICS]\n[CCF2002]\n, a dimensionality\nreduction technique similar to feature hashing, which instead uses several\nindependent hash functions. TensorSketch obtains a Count Sketch of the outer product\nof two vectors (or a vector with itself), which can be used as an approximation of the\npolynomial kernel feature space. In particular, instead of explicitly computing\nthe outer product, TensorSketch computes the Count Sketch of the vectors and then\nuses polynomial multiplication via the Fast Fourier Transform to compute the\nCount Sketch of their outer product.\nConveniently, the training phase of TensorSketch simply consists of initializing\nsome random variables. It is thus independent of the input data, i.e. it only\ndepends on the number of input features, but not the data values.\nIn addition, this method can transform samples in\n\\(\\mathcal{O}(n_{\\text{samples}}(n_{\\text{features}} + n_{\\text{components}} \\log(n_{\\text{components}})))\\)\ntime, where",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\mathcal{O}(n_{\\text{samples}}(n_{\\text{features}} + n_{\\text{components}} \\log(n_{\\text{components}})))\\)\ntime, where\n\\(n_{\\text{components}}\\)\nis the desired output dimension,\ndetermined by\nn_components\n.\nExamples\nScalable learning with polynomial kernel approximation\n7.7.6.\nMathematical Details\nKernel methods like support vector machines or kernelized\nPCA rely on a property of reproducing kernel Hilbert spaces.\nFor any positive definite kernel function\n\\(k\\)\n(a so called Mercer kernel),\nit is guaranteed that there exists a mapping\n\\(\\phi\\)\ninto a Hilbert space\n\\(\\mathcal{H}\\)\n, such that\n\\[k(x,y) = \\langle \\phi(x), \\phi(y) \\rangle\\]\nWhere\n\\(\\langle \\cdot, \\cdot \\rangle\\)\ndenotes the inner product in the\nHilbert space.\nIf an algorithm, such as a linear support vector machine or PCA,\nrelies only on the scalar product of data points\n\\(x_i\\)\n, one may use\nthe value of\n\\(k(x_i, x_j)\\)\n, which corresponds to applying the algorithm\nto the mapped data points\n\\(\\phi(x_i)\\)\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(x_i\\)\n, one may use\nthe value of\n\\(k(x_i, x_j)\\)\n, which corresponds to applying the algorithm\nto the mapped data points\n\\(\\phi(x_i)\\)\n.\nThe advantage of using\n\\(k\\)\nis that the mapping\n\\(\\phi\\)\nnever has\nto be calculated explicitly, allowing for arbitrary large\nfeatures (even infinite).\nOne drawback of kernel methods is, that it might be necessary\nto store many kernel values\n\\(k(x_i, x_j)\\)\nduring optimization.\nIf a kernelized classifier is applied to new data\n\\(y_j\\)\n,\n\\(k(x_i, y_j)\\)\nneeds to be computed to make predictions,\npossibly for many different\n\\(x_i\\)\nin the training set.\nThe classes in this submodule allow to approximate the embedding\n\\(\\phi\\)\n, thereby working explicitly with the representations\n\\(\\phi(x_i)\\)\n, which obviates the need to apply the kernel\nor store training examples.\nReferences\n[\nWS2001\n]\n“Using the Nyström method to speed up kernel machines”\nWilliams, C.K.I.; Seeger, M. - 2001.\n[\nRR2007\n]\n(\n1\n,\n2\n)\n“Random features for large-scale kernel machines”",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Williams, C.K.I.; Seeger, M. - 2001.\n[\nRR2007\n]\n(\n1\n,\n2\n)\n“Random features for large-scale kernel machines”\nRahimi, A. and Recht, B. - Advances in neural information processing 2007,\n[\nLS2010\n]\n“Random Fourier approximations for skewed multiplicative histogram kernels”\nLi, F., Ionescu, C., and Sminchisescu, C.\n- Pattern Recognition, DAGM 2010, Lecture Notes in Computer Science.\n[\nVZ2010\n]\n(\n1\n,\n2\n)\n“Efficient additive kernels via explicit feature maps”\nVedaldi, A. and Zisserman, A. - Computer Vision and Pattern Recognition 2010\n[\nVVZ2010\n]\n“Generalized RBF feature maps for Efficient Detection”\nVempati, S. and Vedaldi, A. and Zisserman, A. and Jawahar, CV - 2010\n[\nPP2013\n]\n“Fast and scalable polynomial kernels via explicit feature maps”\nPham, N., & Pagh, R. - 2013\n[\nCCF2002\n]\n“Finding frequent items in data streams”\nCharikar, M., Chen, K., & Farach-Colton - 2002\n[\nWIKICS\n]\n“Wikipedia: Count sketch”\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_approximation.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.3.\nKernel ridge regression\nKernel ridge regression (KRR)\n[M2012]\ncombines\nRidge regression and classification\n(linear least squares with\n\\(L_2\\)\n-norm regularization) with the\nkernel trick\n. It thus learns a linear\nfunction in the space induced by the respective kernel and the data. For\nnon-linear kernels, this corresponds to a non-linear function in the original\nspace.\nThe form of the model learned by\nKernelRidge\nis identical to support\nvector regression (\nSVR\n). However, different loss\nfunctions are used: KRR uses squared error loss while support vector\nregression uses\n\\(\\epsilon\\)\n-insensitive loss, both combined with\n\\(L_2\\)\nregularization. In contrast to\nSVR\n, fitting\nKernelRidge\ncan be done in closed-form and is typically faster for\nmedium-sized datasets. On the other hand, the learned model is non-sparse and\nthus slower than\nSVR\n, which learns a sparse model for\n\\(\\epsilon > 0\\)\n, at prediction-time.\nThe following figure compares\nKernelRidge\nand\nSVR",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_ridge.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "thus slower than\nSVR\n, which learns a sparse model for\n\\(\\epsilon > 0\\)\n, at prediction-time.\nThe following figure compares\nKernelRidge\nand\nSVR\non an artificial dataset, which consists of a\nsinusoidal target function and strong noise added to every fifth datapoint.\nThe learned model of\nKernelRidge\nand\nSVR\nis\nplotted, where both complexity/regularization and bandwidth of the RBF kernel\nhave been optimized using grid-search. The learned functions are very\nsimilar; however, fitting\nKernelRidge\nis approximately seven times\nfaster than fitting\nSVR\n(both with grid-search).\nHowever, prediction of 100,000 target values is more than three times faster\nwith\nSVR\nsince it has learned a sparse model using only\napproximately 1/3 of the 100 training datapoints as support vectors.\nThe next figure compares the time for fitting and prediction of\nKernelRidge\nand\nSVR\nfor different sizes of the\ntraining set. Fitting\nKernelRidge\nis faster than\nSVR\nfor medium-sized training sets (less than 1000",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_ridge.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "KernelRidge\nand\nSVR\nfor different sizes of the\ntraining set. Fitting\nKernelRidge\nis faster than\nSVR\nfor medium-sized training sets (less than 1000\nsamples); however, for larger training sets\nSVR\nscales\nbetter. With regard to prediction time,\nSVR\nis faster\nthan\nKernelRidge\nfor all sizes of the training set because of the\nlearned sparse solution. Note that the degree of sparsity and thus the\nprediction time depends on the parameters\n\\(\\epsilon\\)\nand\n\\(C\\)\nof\nthe\nSVR\n;\n\\(\\epsilon = 0\\)\nwould correspond to a\ndense model.\nExamples\nComparison of kernel ridge regression and SVR\nReferences\n[\nM2012\n]\n“Machine Learning: A Probabilistic Perspective”\nMurphy, K. P. - chapter 14.4.3, pp. 492-493, The MIT Press, 2012\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/kernel_ridge.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.2.\nLinear and Quadratic Discriminant Analysis\nLinear Discriminant Analysis\n(\nLinearDiscriminantAnalysis\n) and Quadratic\nDiscriminant Analysis\n(\nQuadraticDiscriminantAnalysis\n) are two classic\nclassifiers, with, as their names suggest, a linear and a quadratic decision\nsurface, respectively.\nThese classifiers are attractive because they have closed-form solutions that\ncan be easily computed, are inherently multiclass, have proven to work well in\npractice, and have no hyperparameters to tune.\nThe plot shows decision boundaries for Linear Discriminant Analysis and\nQuadratic Discriminant Analysis. The bottom row demonstrates that Linear\nDiscriminant Analysis can only learn linear boundaries, while Quadratic\nDiscriminant Analysis can learn quadratic boundaries and is therefore more\nflexible.\nExamples\nLinear and Quadratic Discriminant Analysis with covariance ellipsoid\n: Comparison of LDA and\nQDA on synthetic data.\n1.2.1.\nDimensionality reduction using Linear Discriminant Analysis",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ": Comparison of LDA and\nQDA on synthetic data.\n1.2.1.\nDimensionality reduction using Linear Discriminant Analysis\nLinearDiscriminantAnalysis\ncan be used to\nperform supervised dimensionality reduction, by projecting the input data to a\nlinear subspace consisting of the directions which maximize the separation\nbetween classes (in a precise sense discussed in the mathematics section\nbelow). The dimension of the output is necessarily less than the number of\nclasses, so this is in general a rather strong dimensionality reduction, and\nonly makes sense in a multiclass setting.\nThis is implemented in the\ntransform\nmethod. The desired dimensionality can\nbe set using the\nn_components\nparameter. This parameter has no influence\non the\nfit\nand\npredict\nmethods.\nExamples\nComparison of LDA and PCA 2D projection of Iris dataset\n: Comparison of LDA and\nPCA for dimensionality reduction of the Iris dataset\n1.2.2.\nMathematical formulation of the LDA and QDA classifiers",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ": Comparison of LDA and\nPCA for dimensionality reduction of the Iris dataset\n1.2.2.\nMathematical formulation of the LDA and QDA classifiers\nBoth LDA and QDA can be derived from simple probabilistic models which model\nthe class conditional distribution of the data\n\\(P(X|y=k)\\)\nfor each class\n\\(k\\)\n. Predictions can then be obtained by using Bayes’ rule, for each\ntraining sample\n\\(x \\in \\mathcal{R}^d\\)\n:\n\\[P(y=k | x) = \\frac{P(x | y=k) P(y=k)}{P(x)} = \\frac{P(x | y=k) P(y = k)}{ \\sum_{l} P(x | y=l) \\cdot P(y=l)}\\]\nand we select the class\n\\(k\\)\nwhich maximizes this posterior probability.\nMore specifically, for linear and quadratic discriminant analysis,\n\\(P(x|y)\\)\nis modeled as a multivariate Gaussian distribution with\ndensity:\n\\[P(x | y=k) = \\frac{1}{(2\\pi)^{d/2} |\\Sigma_k|^{1/2}}\\exp\\left(-\\frac{1}{2} (x-\\mu_k)^t \\Sigma_k^{-1} (x-\\mu_k)\\right)\\]\nwhere\n\\(d\\)\nis the number of features.\n1.2.2.1.\nQDA\nAccording to the model above, the log of the posterior is:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "where\n\\(d\\)\nis the number of features.\n1.2.2.1.\nQDA\nAccording to the model above, the log of the posterior is:\n\\[\\begin{split}\\log P(y=k | x) &= \\log P(x | y=k) + \\log P(y = k) + Cst \\\\\n&= -\\frac{1}{2} \\log |\\Sigma_k| -\\frac{1}{2} (x-\\mu_k)^t \\Sigma_k^{-1} (x-\\mu_k) + \\log P(y = k) + Cst,\\end{split}\\]\nwhere the constant term\n\\(Cst\\)\ncorresponds to the denominator\n\\(P(x)\\)\n, in addition to other constant terms from the Gaussian. The\npredicted class is the one that maximises this log-posterior.\nNote\nRelation with Gaussian Naive Bayes\nIf in the QDA model one assumes that the covariance matrices are diagonal,\nthen the inputs are assumed to be conditionally independent in each class,\nand the resulting classifier is equivalent to the Gaussian Naive Bayes\nclassifier\nnaive_bayes.GaussianNB\n.\n1.2.2.2.\nLDA\nLDA is a special case of QDA, where the Gaussians for each class are assumed\nto share the same covariance matrix:\n\\(\\Sigma_k = \\Sigma\\)\nfor all\n\\(k\\)\n. This reduces the log posterior to:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to share the same covariance matrix:\n\\(\\Sigma_k = \\Sigma\\)\nfor all\n\\(k\\)\n. This reduces the log posterior to:\n\\[\\log P(y=k | x) = -\\frac{1}{2} (x-\\mu_k)^t \\Sigma^{-1} (x-\\mu_k) + \\log P(y = k) + Cst.\\]\nThe term\n\\((x-\\mu_k)^t \\Sigma^{-1} (x-\\mu_k)\\)\ncorresponds to the\nMahalanobis Distance\nbetween the sample\n\\(x\\)\nand the mean\n\\(\\mu_k\\)\n. The Mahalanobis\ndistance tells how close\n\\(x\\)\nis from\n\\(\\mu_k\\)\n, while also\naccounting for the variance of each feature. We can thus interpret LDA as\nassigning\n\\(x\\)\nto the class whose mean is the closest in terms of\nMahalanobis distance, while also accounting for the class prior\nprobabilities.\nThe log-posterior of LDA can also be written\n[\n3\n]\nas:\n\\[\\log P(y=k | x) = \\omega_k^t x + \\omega_{k0} + Cst.\\]\nwhere\n\\(\\omega_k = \\Sigma^{-1} \\mu_k\\)\nand\n\\(\\omega_{k0} =\n-\\frac{1}{2} \\mu_k^t\\Sigma^{-1}\\mu_k + \\log P (y = k)\\)\n. These quantities\ncorrespond to the\ncoef_\nand\nintercept_\nattributes, respectively.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "-\\frac{1}{2} \\mu_k^t\\Sigma^{-1}\\mu_k + \\log P (y = k)\\)\n. These quantities\ncorrespond to the\ncoef_\nand\nintercept_\nattributes, respectively.\nFrom the above formula, it is clear that LDA has a linear decision surface.\nIn the case of QDA, there are no assumptions on the covariance matrices\n\\(\\Sigma_k\\)\nof the Gaussians, leading to quadratic decision surfaces.\nSee\n[\n1\n]\nfor more details.\n1.2.3.\nMathematical formulation of LDA dimensionality reduction\nFirst note that the K means\n\\(\\mu_k\\)\nare vectors in\n\\(\\mathcal{R}^d\\)\n, and they lie in an affine subspace\n\\(H\\)\nof\ndimension at most\n\\(K - 1\\)\n(2 points lie on a line, 3 points lie on a\nplane, etc.).\nAs mentioned above, we can interpret LDA as assigning\n\\(x\\)\nto the class\nwhose mean\n\\(\\mu_k\\)\nis the closest in terms of Mahalanobis distance,\nwhile also accounting for the class prior probabilities. Alternatively, LDA\nis equivalent to first\nsphering\nthe data so that the covariance matrix is\nthe identity, and then assigning\n\\(x\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is equivalent to first\nsphering\nthe data so that the covariance matrix is\nthe identity, and then assigning\n\\(x\\)\nto the closest mean in terms of\nEuclidean distance (still accounting for the class priors).\nComputing Euclidean distances in this d-dimensional space is equivalent to\nfirst projecting the data points into\n\\(H\\)\n, and computing the distances\nthere (since the other dimensions will contribute equally to each class in\nterms of distance). In other words, if\n\\(x\\)\nis closest to\n\\(\\mu_k\\)\nin the original space, it will also be the case in\n\\(H\\)\n.\nThis shows that, implicit in the LDA\nclassifier, there is a dimensionality reduction by linear projection onto a\n\\(K-1\\)\ndimensional space.\nWe can reduce the dimension even more, to a chosen\n\\(L\\)\n, by projecting\nonto the linear subspace\n\\(H_L\\)\nwhich maximizes the variance of the\n\\(\\mu^*_k\\)\nafter projection (in effect, we are doing a form of PCA for the\ntransformed class means\n\\(\\mu^*_k\\)\n). This\n\\(L\\)\ncorresponds to the\nn_components",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "after projection (in effect, we are doing a form of PCA for the\ntransformed class means\n\\(\\mu^*_k\\)\n). This\n\\(L\\)\ncorresponds to the\nn_components\nparameter used in the\ntransform\nmethod. See\n[\n1\n]\nfor more details.\n1.2.4.\nShrinkage and Covariance Estimator\nShrinkage is a form of regularization used to improve the estimation of\ncovariance matrices in situations where the number of training samples is\nsmall compared to the number of features.\nIn this scenario, the empirical sample covariance is a poor\nestimator, and shrinkage helps improving the generalization performance of\nthe classifier.\nShrinkage LDA can be used by setting the\nshrinkage\nparameter of\nthe\nLinearDiscriminantAnalysis\nclass to\n'auto'\n.\nThis automatically determines the optimal shrinkage parameter in an analytic\nway following the lemma introduced by Ledoit and Wolf\n[\n2\n]\n. Note that\ncurrently shrinkage only works when setting the\nsolver\nparameter to\n'lsqr'\nor\n'eigen'\n.\nThe\nshrinkage",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[\n2\n]\n. Note that\ncurrently shrinkage only works when setting the\nsolver\nparameter to\n'lsqr'\nor\n'eigen'\n.\nThe\nshrinkage\nparameter can also be manually set between 0 and 1. In\nparticular, a value of 0 corresponds to no shrinkage (which means the empirical\ncovariance matrix will be used) and a value of 1 corresponds to complete\nshrinkage (which means that the diagonal matrix of variances will be used as\nan estimate for the covariance matrix). Setting this parameter to a value\nbetween these two extrema will estimate a shrunk version of the covariance\nmatrix.\nThe shrunk Ledoit and Wolf estimator of covariance may not always be the\nbest choice. For example if the distribution of the data\nis normally distributed, the\nOracle Approximating Shrinkage estimator\nsklearn.covariance.OAS\nyields a smaller Mean Squared Error than the one given by Ledoit and Wolf’s\nformula used with\nshrinkage=\"auto\"\n. In LDA, the data are assumed to be gaussian",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "formula used with\nshrinkage=\"auto\"\n. In LDA, the data are assumed to be gaussian\nconditionally to the class. If these assumptions hold, using LDA with\nthe OAS estimator of covariance will yield a better classification\naccuracy than if Ledoit and Wolf or the empirical covariance estimator is used.\nThe covariance estimator can be chosen using the\ncovariance_estimator\nparameter of the\ndiscriminant_analysis.LinearDiscriminantAnalysis\nclass. A covariance estimator should have a\nfit\nmethod and a\ncovariance_\nattribute like all covariance estimators in the\nsklearn.covariance\nmodule.\nExamples\nNormal, Ledoit-Wolf and OAS Linear Discriminant Analysis for classification\n: Comparison of LDA classifiers\nwith Empirical, Ledoit Wolf and OAS covariance estimator.\n1.2.5.\nEstimation algorithms\nUsing LDA and QDA requires computing the log-posterior which depends on the\nclass priors\n\\(P(y=k)\\)\n, the class means\n\\(\\mu_k\\)\n, and the\ncovariance matrices.\nThe ‘svd’ solver is the default solver used for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "class priors\n\\(P(y=k)\\)\n, the class means\n\\(\\mu_k\\)\n, and the\ncovariance matrices.\nThe ‘svd’ solver is the default solver used for\nLinearDiscriminantAnalysis\n, and it is\nthe only available solver for\nQuadraticDiscriminantAnalysis\n.\nIt can perform both classification and transform (for LDA).\nAs it does not rely on the calculation of the covariance matrix, the ‘svd’\nsolver may be preferable in situations where the number of features is large.\nThe ‘svd’ solver cannot be used with shrinkage.\nFor QDA, the use of the SVD solver relies on the fact that the covariance\nmatrix\n\\(\\Sigma_k\\)\nis, by definition, equal to\n\\(\\frac{1}{n - 1}\nX_k^tX_k = \\frac{1}{n - 1} V S^2 V^t\\)\nwhere\n\\(V\\)\ncomes from the SVD of the (centered)\nmatrix:\n\\(X_k = U S V^t\\)\n. It turns out that we can compute the\nlog-posterior above without having to explicitly compute\n\\(\\Sigma\\)\n:\ncomputing\n\\(S\\)\nand\n\\(V\\)\nvia the SVD of\n\\(X\\)\nis enough. For\nLDA, two SVDs are computed: the SVD of the centered input matrix\n\\(X\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\Sigma\\)\n:\ncomputing\n\\(S\\)\nand\n\\(V\\)\nvia the SVD of\n\\(X\\)\nis enough. For\nLDA, two SVDs are computed: the SVD of the centered input matrix\n\\(X\\)\nand the SVD of the class-wise mean vectors.\nThe\n'lsqr'\nsolver is an efficient algorithm that only works for\nclassification. It needs to explicitly compute the covariance matrix\n\\(\\Sigma\\)\n, and supports shrinkage and custom covariance estimators.\nThis solver computes the coefficients\n\\(\\omega_k = \\Sigma^{-1}\\mu_k\\)\nby solving for\n\\(\\Sigma \\omega =\n\\mu_k\\)\n, thus avoiding the explicit computation of the inverse\n\\(\\Sigma^{-1}\\)\n.\nThe\n'eigen'\nsolver is based on the optimization of the between class scatter to\nwithin class scatter ratio. It can be used for both classification and\ntransform, and it supports shrinkage. However, the\n'eigen'\nsolver needs to\ncompute the covariance matrix, so it might not be suitable for situations with\na high number of features.\nReferences\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/lda_qda.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.5.\nValidation curves: plotting scores to evaluate models\nEvery estimator has its advantages and drawbacks. Its generalization error\ncan be decomposed in terms of bias, variance and noise. The\nbias\nof an\nestimator is its average error for different training sets. The\nvariance\nof an estimator indicates how sensitive it is to varying training sets. Noise\nis a property of the data.\nIn the following plot, we see a function\n\\(f(x) = \\cos (\\frac{3}{2} \\pi x)\\)\nand some noisy samples from that function. We use three different estimators\nto fit the function: linear regression with polynomial features of degree 1,\n4 and 15. We see that the first estimator can at best provide only a poor fit\nto the samples and the true function because it is too simple (high bias),\nthe second estimator approximates it almost perfectly and the last estimator\napproximates the training data perfectly but does not fit the true function\nvery well, i.e. it is very sensitive to varying training data (high variance).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/learning_curve.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "very well, i.e. it is very sensitive to varying training data (high variance).\nBias and variance are inherent properties of estimators and we usually have to\nselect learning algorithms and hyperparameters so that both bias and variance\nare as low as possible (see\nBias-variance dilemma\n). Another way to reduce\nthe variance of a model is to use more training data. However, you should only\ncollect more training data if the true function is too complex to be\napproximated by an estimator with a lower variance.\nIn the simple one-dimensional problem that we have seen in the example it is\neasy to see whether the estimator suffers from bias or variance. However, in\nhigh-dimensional spaces, models can become very difficult to visualize. For\nthis reason, it is often helpful to use the tools described below.\nExamples\nUnderfitting vs. Overfitting\nEffect of model regularization on training and test error\nPlotting Learning Curves and Checking Models’ Scalability\n3.5.1.\nValidation curve",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/learning_curve.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Effect of model regularization on training and test error\nPlotting Learning Curves and Checking Models’ Scalability\n3.5.1.\nValidation curve\nTo validate a model we need a scoring function (see\nMetrics and scoring: quantifying the quality of predictions\n),\nfor example accuracy for classifiers. The proper way of choosing multiple\nhyperparameters of an estimator is of course grid search or similar methods\n(see\nTuning the hyper-parameters of an estimator\n) that select the hyperparameter with the maximum score\non a validation set or multiple validation sets. Note that if we optimize\nthe hyperparameters based on a validation score the validation score is biased\nand not a good estimate of the generalization any longer. To get a proper\nestimate of the generalization we have to compute the score on another test\nset.\nHowever, it is sometimes helpful to plot the influence of a single\nhyperparameter on the training score and the validation score to find out",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/learning_curve.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "set.\nHowever, it is sometimes helpful to plot the influence of a single\nhyperparameter on the training score and the validation score to find out\nwhether the estimator is overfitting or underfitting for some hyperparameter\nvalues.\nThe function\nvalidation_curve\ncan help in this case:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.model_selection\nimport\nvalidation_curve\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nnp\n.\nrandom\n.\nseed\n(\n0\n)\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nindices\n=\nnp\n.\narange\n(\ny\n.\nshape\n[\n0\n])\n>>>\nnp\n.\nrandom\n.\nshuffle\n(\nindices\n)\n>>>\nX\n,\ny\n=\nX\n[\nindices\n],\ny\n[\nindices\n]\n>>>\ntrain_scores\n,\nvalid_scores\n=\nvalidation_curve\n(\n...\nSVC\n(\nkernel\n=\n\"linear\"\n),\nX\n,\ny\n,\nparam_name\n=\n\"C\"\n,\nparam_range\n=\nnp\n.\nlogspace\n(\n-\n7\n,\n3\n,\n3\n),\n...\n)\n>>>\ntrain_scores\narray([[0.90, 0.94, 0.91, 0.89, 0.92],\n[0.9 , 0.92, 0.93, 0.92, 0.93],\n[0.97, 1 , 0.98, 0.97, 0.99]])\n>>>\nvalid_scores\narray([[0.9, 0.9 , 0.9 , 0.96, 0.9 ],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/learning_curve.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[0.9 , 0.92, 0.93, 0.92, 0.93],\n[0.97, 1 , 0.98, 0.97, 0.99]])\n>>>\nvalid_scores\narray([[0.9, 0.9 , 0.9 , 0.96, 0.9 ],\n[0.9, 0.83, 0.96, 0.96, 0.93],\n[1. , 0.93, 1 , 1 , 0.9 ]])\nIf you intend to plot the validation curves only, the class\nValidationCurveDisplay\nis more direct than\nusing matplotlib manually on the results of a call to\nvalidation_curve\n.\nYou can use the method\nfrom_estimator\nsimilarly\nto\nvalidation_curve\nto generate and plot the validation curve:\nfrom\nsklearn.datasets\nimport\nload_iris\nfrom\nsklearn.model_selection\nimport\nValidationCurveDisplay\nfrom\nsklearn.svm\nimport\nSVC\nfrom\nsklearn.utils\nimport\nshuffle\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\nX\n,\ny\n=\nshuffle\n(\nX\n,\ny\n,\nrandom_state\n=\n0\n)\nValidationCurveDisplay\n.\nfrom_estimator\n(\nSVC\n(\nkernel\n=\n\"linear\"\n),\nX\n,\ny\n,\nparam_name\n=\n\"C\"\n,\nparam_range\n=\nnp\n.\nlogspace\n(\n-\n7\n,\n3\n,\n10\n)\n)\nIf the training score and the validation score are both low, the estimator will",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/learning_curve.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "y\n,\nparam_name\n=\n\"C\"\n,\nparam_range\n=\nnp\n.\nlogspace\n(\n-\n7\n,\n3\n,\n10\n)\n)\nIf the training score and the validation score are both low, the estimator will\nbe underfitting. If the training score is high and the validation score is low,\nthe estimator is overfitting and otherwise it is working very well. A low\ntraining score and a high validation score is usually not possible.\n3.5.2.\nLearning curve\nA learning curve shows the validation and training score of an estimator\nfor varying numbers of training samples. It is a tool to find out how much\nwe benefit from adding more training data and whether the estimator suffers\nmore from a variance error or a bias error. Consider the following example\nwhere we plot the learning curve of a naive Bayes classifier and an SVM.\nFor the naive Bayes, both the validation score and the training score\nconverge to a value that is quite low with increasing size of the training\nset. Thus, we will probably not benefit much from more training data.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/learning_curve.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "converge to a value that is quite low with increasing size of the training\nset. Thus, we will probably not benefit much from more training data.\nIn contrast, for small amounts of data, the training score of the SVM is\nmuch greater than the validation score. Adding more training samples will\nmost likely increase generalization.\nWe can use the function\nlearning_curve\nto generate the values\nthat are required to plot such a learning curve (number of samples\nthat have been used, the average scores on the training sets and the\naverage scores on the validation sets):\n>>>\nfrom\nsklearn.model_selection\nimport\nlearning_curve\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\ntrain_sizes\n,\ntrain_scores\n,\nvalid_scores\n=\nlearning_curve\n(\n...\nSVC\n(\nkernel\n=\n'linear'\n),\nX\n,\ny\n,\ntrain_sizes\n=\n[\n50\n,\n80\n,\n110\n],\ncv\n=\n5\n)\n>>>\ntrain_sizes\narray([ 50, 80, 110])\n>>>\ntrain_scores\narray([[0.98, 0.98 , 0.98, 0.98, 0.98],\n[0.98, 1. , 0.98, 0.98, 0.98],\n[0.98, 1. , 0.98, 0.98, 0.99]])\n>>>\nvalid_scores",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/learning_curve.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\ntrain_scores\narray([[0.98, 0.98 , 0.98, 0.98, 0.98],\n[0.98, 1. , 0.98, 0.98, 0.98],\n[0.98, 1. , 0.98, 0.98, 0.99]])\n>>>\nvalid_scores\narray([[1. , 0.93, 1. , 1. , 0.96],\n[1. , 0.96, 1. , 1. , 0.96],\n[1. , 0.96, 1. , 1. , 0.96]])\nIf you intend to plot the learning curves only, the class\nLearningCurveDisplay\nwill be easier to use.\nYou can use the method\nfrom_estimator\nsimilarly\nto\nlearning_curve\nto generate and plot the learning curve:\nfrom\nsklearn.datasets\nimport\nload_iris\nfrom\nsklearn.model_selection\nimport\nLearningCurveDisplay\nfrom\nsklearn.svm\nimport\nSVC\nfrom\nsklearn.utils\nimport\nshuffle\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\nX\n,\ny\n=\nshuffle\n(\nX\n,\ny\n,\nrandom_state\n=\n0\n)\nLearningCurveDisplay\n.\nfrom_estimator\n(\nSVC\n(\nkernel\n=\n\"linear\"\n),\nX\n,\ny\n,\ntrain_sizes\n=\n[\n50\n,\n80\n,\n110\n],\ncv\n=\n5\n)\nExamples\nSee\nPlotting Learning Curves and Checking Models’ Scalability\nfor an\nexample of using learning curves to check the scalability of a predictive model.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/learning_curve.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.1.\nLinear Models\nThe following are a set of methods intended for regression in which\nthe target value is expected to be a linear combination of the features.\nIn mathematical notation, if\n\\(\\hat{y}\\)\nis the predicted\nvalue.\n\\[\\hat{y}(w, x) = w_0 + w_1 x_1 + ... + w_p x_p\\]\nAcross the module, we designate the vector\n\\(w = (w_1,\n..., w_p)\\)\nas\ncoef_\nand\n\\(w_0\\)\nas\nintercept_\n.\nTo perform classification with generalized linear models, see\nLogistic regression\n.\n1.1.1.\nOrdinary Least Squares\nLinearRegression\nfits a linear model with coefficients\n\\(w = (w_1, ..., w_p)\\)\nto minimize the residual sum\nof squares between the observed targets in the dataset, and the\ntargets predicted by the linear approximation. Mathematically it\nsolves a problem of the form:\n\\[\\min_{w} || X w - y||_2^2\\]\nLinearRegression\ntakes in its\nfit\nmethod arguments\nX\n,\ny\n,\nsample_weight\nand stores the coefficients\n\\(w\\)\nof the linear model in its\ncoef_\nand\nintercept_\nattributes:\n>>>\nfrom\nsklearn\nimport\nlinear_model\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "sample_weight\nand stores the coefficients\n\\(w\\)\nof the linear model in its\ncoef_\nand\nintercept_\nattributes:\n>>>\nfrom\nsklearn\nimport\nlinear_model\n>>>\nreg\n=\nlinear_model\n.\nLinearRegression\n()\n>>>\nreg\n.\nfit\n([[\n0\n,\n0\n],\n[\n1\n,\n1\n],\n[\n2\n,\n2\n]],\n[\n0\n,\n1\n,\n2\n])\nLinearRegression()\n>>>\nreg\n.\ncoef_\narray([0.5, 0.5])\n>>>\nreg\n.\nintercept_\n0.0\nThe coefficient estimates for Ordinary Least Squares rely on the\nindependence of the features. When features are correlated and some\ncolumns of the design matrix\n\\(X\\)\nhave an approximately linear\ndependence, the design matrix becomes close to singular\nand as a result, the least-squares estimate becomes highly sensitive\nto random errors in the observed target, producing a large\nvariance. This situation of\nmulticollinearity\ncan arise, for\nexample, when data are collected without an experimental design.\nExamples\nOrdinary Least Squares and Ridge Regression\n1.1.1.1.\nNon-Negative Least Squares",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Examples\nOrdinary Least Squares and Ridge Regression\n1.1.1.1.\nNon-Negative Least Squares\nIt is possible to constrain all the coefficients to be non-negative, which may\nbe useful when they represent some physical or naturally non-negative\nquantities (e.g., frequency counts or prices of goods).\nLinearRegression\naccepts a boolean\npositive\nparameter: when set to\nTrue\nNon-Negative Least Squares\nare then applied.\nExamples\nNon-negative least squares\n1.1.1.2.\nOrdinary Least Squares Complexity\nThe least squares solution is computed using the singular value\ndecomposition of\n\\(X\\)\n. If\n\\(X\\)\nis a matrix of shape\n(n_samples,\nn_features)\nthis method has a cost of\n\\(O(n_{\\text{samples}} n_{\\text{features}}^2)\\)\n, assuming that\n\\(n_{\\text{samples}} \\geq n_{\\text{features}}\\)\n.\n1.1.2.\nRidge regression and classification\n1.1.2.1.\nRegression\nRidge\nregression addresses some of the problems of\nOrdinary Least Squares\nby imposing a penalty on the size of the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.1.2.1.\nRegression\nRidge\nregression addresses some of the problems of\nOrdinary Least Squares\nby imposing a penalty on the size of the\ncoefficients. The ridge coefficients minimize a penalized residual sum\nof squares:\n\\[\\min_{w} || X w - y||_2^2 + \\alpha ||w||_2^2\\]\nThe complexity parameter\n\\(\\alpha \\geq 0\\)\ncontrols the amount\nof shrinkage: the larger the value of\n\\(\\alpha\\)\n, the greater the amount\nof shrinkage and thus the coefficients become more robust to collinearity.\nAs with other linear models,\nRidge\nwill take in its\nfit\nmethod\narrays\nX\n,\ny\nand will store the coefficients\n\\(w\\)\nof the linear model in\nits\ncoef_\nmember:\n>>>\nfrom\nsklearn\nimport\nlinear_model\n>>>\nreg\n=\nlinear_model\n.\nRidge\n(\nalpha\n=\n.5\n)\n>>>\nreg\n.\nfit\n([[\n0\n,\n0\n],\n[\n0\n,\n0\n],\n[\n1\n,\n1\n]],\n[\n0\n,\n.1\n,\n1\n])\nRidge(alpha=0.5)\n>>>\nreg\n.\ncoef_\narray([0.34545455, 0.34545455])\n>>>\nreg\n.\nintercept_\nnp.float64(0.13636)\nNote that the class\nRidge\nallows for the user to specify that the\nsolver be automatically chosen by setting",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nreg\n.\nintercept_\nnp.float64(0.13636)\nNote that the class\nRidge\nallows for the user to specify that the\nsolver be automatically chosen by setting\nsolver=\"auto\"\n. When this option\nis specified,\nRidge\nwill choose between the\n\"lbfgs\"\n,\n\"cholesky\"\n,\nand\n\"sparse_cg\"\nsolvers.\nRidge\nwill begin checking the conditions\nshown in the following table from top to bottom. If the condition is true,\nthe corresponding solver is chosen.\nSolver\nCondition\n‘lbfgs’\nThe\npositive=True\noption is specified.\n‘cholesky’\nThe input array X is not sparse.\n‘sparse_cg’\nNone of the above conditions are fulfilled.\nExamples\nOrdinary Least Squares and Ridge Regression\nPlot Ridge coefficients as a function of the regularization\nCommon pitfalls in the interpretation of coefficients of linear models\n1.1.2.2.\nClassification\nThe\nRidge\nregressor has a classifier variant:\nRidgeClassifier\n. This classifier first converts binary targets to\n{-1,\n1}\nand then treats the problem as a regression task, optimizing the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "RidgeClassifier\n. This classifier first converts binary targets to\n{-1,\n1}\nand then treats the problem as a regression task, optimizing the\nsame objective as above. The predicted class corresponds to the sign of the\nregressor’s prediction. For multiclass classification, the problem is\ntreated as multi-output regression, and the predicted class corresponds to\nthe output with the highest value.\nIt might seem questionable to use a (penalized) Least Squares loss to fit a\nclassification model instead of the more traditional logistic or hinge\nlosses. However, in practice, all those models can lead to similar\ncross-validation scores in terms of accuracy or precision/recall, while the\npenalized least squares loss used by the\nRidgeClassifier\nallows for\na very different choice of the numerical solvers with distinct computational\nperformance profiles.\nThe\nRidgeClassifier\ncan be significantly faster than e.g.\nLogisticRegression\nwith a high number of classes because it can",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "performance profiles.\nThe\nRidgeClassifier\ncan be significantly faster than e.g.\nLogisticRegression\nwith a high number of classes because it can\ncompute the projection matrix\n\\((X^T X)^{-1} X^T\\)\nonly once.\nThis classifier is sometimes referred to as a\nLeast Squares Support Vector\nMachine\nwith\na linear kernel.\nExamples\nClassification of text documents using sparse features\n1.1.2.3.\nRidge Complexity\nThis method has the same order of complexity as\nOrdinary Least Squares\n.\n1.1.2.4.\nSetting the regularization parameter: leave-one-out Cross-Validation\nRidgeCV\nand\nRidgeClassifierCV\nimplement ridge\nregression/classification with built-in cross-validation of the alpha parameter.\nThey work in the same way as\nGridSearchCV\nexcept\nthat it defaults to efficient Leave-One-Out\ncross-validation\n.\nWhen using the default\ncross-validation\n, alpha cannot be 0 due to the\nformulation used to calculate Leave-One-Out error. See\n[RL2007]\nfor details.\nUsage example:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "formulation used to calculate Leave-One-Out error. See\n[RL2007]\nfor details.\nUsage example:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn\nimport\nlinear_model\n>>>\nreg\n=\nlinear_model\n.\nRidgeCV\n(\nalphas\n=\nnp\n.\nlogspace\n(\n-\n6\n,\n6\n,\n13\n))\n>>>\nreg\n.\nfit\n([[\n0\n,\n0\n],\n[\n0\n,\n0\n],\n[\n1\n,\n1\n]],\n[\n0\n,\n.1\n,\n1\n])\nRidgeCV(alphas=array([1.e-06, 1.e-05, 1.e-04, 1.e-03, 1.e-02, 1.e-01, 1.e+00, 1.e+01,\n1.e+02, 1.e+03, 1.e+04, 1.e+05, 1.e+06]))\n>>>\nreg\n.\nalpha_\nnp.float64(0.01)\nSpecifying the value of the\ncv\nattribute will trigger the use of\ncross-validation with\nGridSearchCV\n, for\nexample\ncv=10\nfor 10-fold cross-validation, rather than Leave-One-Out\nCross-Validation.\nReferences\n[\nRL2007\n]\n“Notes on Regularized Least Squares”, Rifkin & Lippert (\ntechnical report\n,\ncourse slides\n).\n1.1.3.\nLasso\nThe\nLasso\nis a linear model that estimates sparse coefficients.\nIt is useful in some contexts due to its tendency to prefer solutions\nwith fewer non-zero coefficients, effectively reducing the number of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "It is useful in some contexts due to its tendency to prefer solutions\nwith fewer non-zero coefficients, effectively reducing the number of\nfeatures upon which the given solution is dependent. For this reason,\nLasso and its variants are fundamental to the field of compressed sensing.\nUnder certain conditions, it can recover the exact set of non-zero\ncoefficients (see\nCompressive sensing: tomography reconstruction with L1 prior (Lasso)\n).\nMathematically, it consists of a linear model with an added regularization term.\nThe objective function to minimize is:\n\\[\\min_{w} { \\frac{1}{2n_{\\text{samples}}} ||X w - y||_2 ^ 2 + \\alpha ||w||_1}\\]\nThe lasso estimate thus solves the minimization of the\nleast-squares penalty with\n\\(\\alpha ||w||_1\\)\nadded, where\n\\(\\alpha\\)\nis a constant and\n\\(||w||_1\\)\nis the\n\\(\\ell_1\\)\n-norm of\nthe coefficient vector.\nThe implementation in the class\nLasso\nuses coordinate descent as\nthe algorithm to fit the coefficients. See\nLeast Angle Regression",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The implementation in the class\nLasso\nuses coordinate descent as\nthe algorithm to fit the coefficients. See\nLeast Angle Regression\nfor another implementation:\n>>>\nfrom\nsklearn\nimport\nlinear_model\n>>>\nreg\n=\nlinear_model\n.\nLasso\n(\nalpha\n=\n0.1\n)\n>>>\nreg\n.\nfit\n([[\n0\n,\n0\n],\n[\n1\n,\n1\n]],\n[\n0\n,\n1\n])\nLasso(alpha=0.1)\n>>>\nreg\n.\npredict\n([[\n1\n,\n1\n]])\narray([0.8])\nThe function\nlasso_path\nis useful for lower-level tasks, as it\ncomputes the coefficients along the full path of possible values.\nExamples\nL1-based models for Sparse Signals\nCompressive sensing: tomography reconstruction with L1 prior (Lasso)\nCommon pitfalls in the interpretation of coefficients of linear models\nNote\nFeature selection with Lasso\nAs the Lasso regression yields sparse models, it can\nthus be used to perform feature selection, as detailed in\nL1-based feature selection\n.\nReferences\nThe following two references explain the iterations\nused in the coordinate descent solver of scikit-learn, as well as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nReferences\nThe following two references explain the iterations\nused in the coordinate descent solver of scikit-learn, as well as\nthe duality gap computation used for convergence control.\n“Regularization Path For Generalized linear Models by Coordinate Descent”,\nFriedman, Hastie & Tibshirani, J Stat Softw, 2010 (\nPaper\n).\n“An Interior-Point Method for Large-Scale L1-Regularized Least Squares,”\nS. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky,\nin IEEE Journal of Selected Topics in Signal Processing, 2007\n(\nPaper\n)\n1.1.3.1.\nSetting regularization parameter\nThe\nalpha\nparameter controls the degree of sparsity of the estimated\ncoefficients.\n1.1.3.1.1.\nUsing cross-validation\nscikit-learn exposes objects that set the Lasso\nalpha\nparameter by\ncross-validation:\nLassoCV\nand\nLassoLarsCV\n.\nLassoLarsCV\nis based on the\nLeast Angle Regression\nalgorithm\nexplained below.\nFor high-dimensional datasets with many collinear features,\nLassoCV\nis most often preferable. However,\nLassoLarsCV\nhas",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "algorithm\nexplained below.\nFor high-dimensional datasets with many collinear features,\nLassoCV\nis most often preferable. However,\nLassoLarsCV\nhas\nthe advantage of exploring more relevant values of\nalpha\nparameter, and\nif the number of samples is very small compared to the number of\nfeatures, it is often faster than\nLassoCV\n.\n1.1.3.1.2.\nInformation-criteria based model selection\nAlternatively, the estimator\nLassoLarsIC\nproposes to use the\nAkaike information criterion (AIC) and the Bayes Information criterion (BIC).\nIt is a computationally cheaper alternative to find the optimal value of alpha\nas the regularization path is computed only once instead of k+1 times\nwhen using k-fold cross-validation.\nIndeed, these criteria are computed on the in-sample training set. In short,\nthey penalize the over-optimistic scores of the different Lasso models by\ntheir flexibility (cf. to “Mathematical details” section below).\nHowever, such criteria need a proper estimation of the degrees of freedom of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "their flexibility (cf. to “Mathematical details” section below).\nHowever, such criteria need a proper estimation of the degrees of freedom of\nthe solution, are derived for large samples (asymptotic results) and assume the\ncorrect model is candidates under investigation. They also tend to break when\nthe problem is badly conditioned (e.g. more features than samples).\nExamples\nLasso model selection: AIC-BIC / cross-validation\nLasso model selection via information criteria\n1.1.3.1.3.\nAIC and BIC criteria\nThe definition of AIC (and thus BIC) might differ in the literature. In this\nsection, we give more information regarding the criterion computed in\nscikit-learn.\nMathematical details\nThe AIC criterion is defined as:\n\\[AIC = -2 \\log(\\hat{L}) + 2 d\\]\nwhere\n\\(\\hat{L}\\)\nis the maximum likelihood of the model and\n\\(d\\)\nis the number of parameters (as well referred to as degrees of\nfreedom in the previous section).\nThe definition of BIC replaces the constant\n\\(2\\)\nby\n\\(\\log(N)\\)\n:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "freedom in the previous section).\nThe definition of BIC replaces the constant\n\\(2\\)\nby\n\\(\\log(N)\\)\n:\n\\[BIC = -2 \\log(\\hat{L}) + \\log(N) d\\]\nwhere\n\\(N\\)\nis the number of samples.\nFor a linear Gaussian model, the maximum log-likelihood is defined as:\n\\[\\log(\\hat{L}) = - \\frac{n}{2} \\log(2 \\pi) - \\frac{n}{2} \\log(\\sigma^2) - \\frac{\\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2}{2\\sigma^2}\\]\nwhere\n\\(\\sigma^2\\)\nis an estimate of the noise variance,\n\\(y_i\\)\nand\n\\(\\hat{y}_i\\)\nare respectively the true and predicted\ntargets, and\n\\(n\\)\nis the number of samples.\nPlugging the maximum log-likelihood in the AIC formula yields:\n\\[AIC = n \\log(2 \\pi \\sigma^2) + \\frac{\\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2}{\\sigma^2} + 2 d\\]\nThe first term of the above expression is sometimes discarded since it is a\nconstant when\n\\(\\sigma^2\\)\nis provided. In addition,\nit is sometimes stated that the AIC is equivalent to the\n\\(C_p\\)\nstatistic\n[\n12\n]\n. In a strict sense, however, it is equivalent only up to some constant",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(C_p\\)\nstatistic\n[\n12\n]\n. In a strict sense, however, it is equivalent only up to some constant\nand a multiplicative factor.\nAt last, we mentioned above that\n\\(\\sigma^2\\)\nis an estimate of the\nnoise variance. In\nLassoLarsIC\nwhen the parameter\nnoise_variance\nis\nnot provided (default), the noise variance is estimated via the unbiased\nestimator\n[\n13\n]\ndefined as:\n\\[\\sigma^2 = \\frac{\\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2}{n - p}\\]\nwhere\n\\(p\\)\nis the number of features and\n\\(\\hat{y}_i\\)\nis the\npredicted target using an ordinary least squares regression. Note, that this\nformula is valid only when\nn_samples\n>\nn_features\n.\nReferences\n1.1.3.1.4.\nComparison with the regularization parameter of SVM\nThe equivalence between\nalpha\nand the regularization parameter of SVM,\nC\nis given by\nalpha\n=\n1\n/\nC\nor\nalpha\n=\n1\n/\n(n_samples\n*\nC)\n,\ndepending on the estimator and the exact objective function optimized by the\nmodel.\n1.1.4.\nMulti-task Lasso\nThe\nMultiTaskLasso\nis a linear model that estimates sparse",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "model.\n1.1.4.\nMulti-task Lasso\nThe\nMultiTaskLasso\nis a linear model that estimates sparse\ncoefficients for multiple regression problems jointly:\ny\nis a 2D array,\nof shape\n(n_samples,\nn_tasks)\n. The constraint is that the selected\nfeatures are the same for all the regression problems, also called tasks.\nThe following figure compares the location of the non-zero entries in the\ncoefficient matrix W obtained with a simple Lasso or a MultiTaskLasso.\nThe Lasso estimates yield scattered non-zeros while the non-zeros of\nthe MultiTaskLasso are full columns.\nFitting a time-series model, imposing that any active feature be active at all times.\nExamples\nJoint feature selection with multi-task Lasso\nMathematical details\nMathematically, it consists of a linear model trained with a mixed\n\\(\\ell_1\\)\n\\(\\ell_2\\)\n-norm for regularization.\nThe objective function to minimize is:\n\\[\\min_{W} { \\frac{1}{2n_{\\text{samples}}} ||X W - Y||_{\\text{Fro}} ^ 2 + \\alpha ||W||_{21}}\\]\nwhere\n\\(\\text{Fro}\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[\\min_{W} { \\frac{1}{2n_{\\text{samples}}} ||X W - Y||_{\\text{Fro}} ^ 2 + \\alpha ||W||_{21}}\\]\nwhere\n\\(\\text{Fro}\\)\nindicates the Frobenius norm\n\\[||A||_{\\text{Fro}} = \\sqrt{\\sum_{ij} a_{ij}^2}\\]\nand\n\\(\\ell_1\\)\n\\(\\ell_2\\)\nreads\n\\[||A||_{2 1} = \\sum_i \\sqrt{\\sum_j a_{ij}^2}.\\]\nThe implementation in the class\nMultiTaskLasso\nuses\ncoordinate descent as the algorithm to fit the coefficients.\n1.1.5.\nElastic-Net\nElasticNet\nis a linear regression model trained with both\n\\(\\ell_1\\)\nand\n\\(\\ell_2\\)\n-norm regularization of the coefficients.\nThis combination allows for learning a sparse model where few of\nthe weights are non-zero like\nLasso\n, while still maintaining\nthe regularization properties of\nRidge\n. We control the convex\ncombination of\n\\(\\ell_1\\)\nand\n\\(\\ell_2\\)\nusing the\nl1_ratio\nparameter.\nElastic-net is useful when there are multiple features that are\ncorrelated with one another. Lasso is likely to pick one of these\nat random, while elastic-net is likely to pick both.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "correlated with one another. Lasso is likely to pick one of these\nat random, while elastic-net is likely to pick both.\nA practical advantage of trading-off between Lasso and Ridge is that it\nallows Elastic-Net to inherit some of Ridge’s stability under rotation.\nThe objective function to minimize is in this case\n\\[\\min_{w} { \\frac{1}{2n_{\\text{samples}}} ||X w - y||_2 ^ 2 + \\alpha \\rho ||w||_1 +\n\\frac{\\alpha(1-\\rho)}{2} ||w||_2 ^ 2}\\]\nThe class\nElasticNetCV\ncan be used to set the parameters\nalpha\n(\n\\(\\alpha\\)\n) and\nl1_ratio\n(\n\\(\\rho\\)\n) by cross-validation.\nExamples\nL1-based models for Sparse Signals\nLasso, Lasso-LARS, and Elastic Net paths\nFitting an Elastic Net with a precomputed Gram Matrix and Weighted Samples\nReferences\nThe following two references explain the iterations\nused in the coordinate descent solver of scikit-learn, as well as\nthe duality gap computation used for convergence control.\n“Regularization Path For Generalized linear Models by Coordinate Descent”,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the duality gap computation used for convergence control.\n“Regularization Path For Generalized linear Models by Coordinate Descent”,\nFriedman, Hastie & Tibshirani, J Stat Softw, 2010 (\nPaper\n).\n“An Interior-Point Method for Large-Scale L1-Regularized Least Squares,”\nS. J. Kim, K. Koh, M. Lustig, S. Boyd and D. Gorinevsky,\nin IEEE Journal of Selected Topics in Signal Processing, 2007\n(\nPaper\n)\n1.1.6.\nMulti-task Elastic-Net\nThe\nMultiTaskElasticNet\nis an elastic-net model that estimates sparse\ncoefficients for multiple regression problems jointly:\nY\nis a 2D array\nof shape\n(n_samples,\nn_tasks)\n. The constraint is that the selected\nfeatures are the same for all the regression problems, also called tasks.\nMathematically, it consists of a linear model trained with a mixed\n\\(\\ell_1\\)\n\\(\\ell_2\\)\n-norm and\n\\(\\ell_2\\)\n-norm for regularization.\nThe objective function to minimize is:\n\\[\\min_{W} { \\frac{1}{2n_{\\text{samples}}} ||X W - Y||_{\\text{Fro}}^2 + \\alpha \\rho ||W||_{2 1} +",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The objective function to minimize is:\n\\[\\min_{W} { \\frac{1}{2n_{\\text{samples}}} ||X W - Y||_{\\text{Fro}}^2 + \\alpha \\rho ||W||_{2 1} +\n\\frac{\\alpha(1-\\rho)}{2} ||W||_{\\text{Fro}}^2}\\]\nThe implementation in the class\nMultiTaskElasticNet\nuses coordinate descent as\nthe algorithm to fit the coefficients.\nThe class\nMultiTaskElasticNetCV\ncan be used to set the parameters\nalpha\n(\n\\(\\alpha\\)\n) and\nl1_ratio\n(\n\\(\\rho\\)\n) by cross-validation.\n1.1.7.\nLeast Angle Regression\nLeast-angle regression (LARS) is a regression algorithm for\nhigh-dimensional data, developed by Bradley Efron, Trevor Hastie, Iain\nJohnstone and Robert Tibshirani. LARS is similar to forward stepwise\nregression. At each step, it finds the feature most correlated with the\ntarget. When there are multiple features having equal correlation, instead\nof continuing along the same feature, it proceeds in a direction equiangular\nbetween the features.\nThe advantages of LARS are:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of continuing along the same feature, it proceeds in a direction equiangular\nbetween the features.\nThe advantages of LARS are:\nIt is numerically efficient in contexts where the number of features\nis significantly greater than the number of samples.\nIt is computationally just as fast as forward selection and has\nthe same order of complexity as ordinary least squares.\nIt produces a full piecewise linear solution path, which is\nuseful in cross-validation or similar attempts to tune the model.\nIf two features are almost equally correlated with the target,\nthen their coefficients should increase at approximately the same\nrate. The algorithm thus behaves as intuition would expect, and\nalso is more stable.\nIt is easily modified to produce solutions for other estimators,\nlike the Lasso.\nThe disadvantages of the LARS method include:\nBecause LARS is based upon an iterative refitting of the\nresiduals, it would appear to be especially sensitive to the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Because LARS is based upon an iterative refitting of the\nresiduals, it would appear to be especially sensitive to the\neffects of noise. This problem is discussed in detail by Weisberg\nin the discussion section of the Efron et al. (2004) Annals of\nStatistics article.\nThe LARS model can be used via the estimator\nLars\n, or its\nlow-level implementation\nlars_path\nor\nlars_path_gram\n.\n1.1.8.\nLARS Lasso\nLassoLars\nis a lasso model implemented using the LARS\nalgorithm, and unlike the implementation based on coordinate descent,\nthis yields the exact solution, which is piecewise linear as a\nfunction of the norm of its coefficients.\n>>>\nfrom\nsklearn\nimport\nlinear_model\n>>>\nreg\n=\nlinear_model\n.\nLassoLars\n(\nalpha\n=\n.1\n)\n>>>\nreg\n.\nfit\n([[\n0\n,\n0\n],\n[\n1\n,\n1\n]],\n[\n0\n,\n1\n])\nLassoLars(alpha=0.1)\n>>>\nreg\n.\ncoef_\narray([0.6, 0. ])\nExamples\nLasso, Lasso-LARS, and Elastic Net paths\nThe LARS algorithm provides the full path of the coefficients along",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "reg\n.\ncoef_\narray([0.6, 0. ])\nExamples\nLasso, Lasso-LARS, and Elastic Net paths\nThe LARS algorithm provides the full path of the coefficients along\nthe regularization parameter almost for free, thus a common operation\nis to retrieve the path with one of the functions\nlars_path\nor\nlars_path_gram\n.\nMathematical formulation\nThe algorithm is similar to forward stepwise regression, but instead\nof including features at each step, the estimated coefficients are\nincreased in a direction equiangular to each one’s correlations with\nthe residual.\nInstead of giving a vector result, the LARS solution consists of a\ncurve denoting the solution for each value of the\n\\(\\ell_1\\)\nnorm of the\nparameter vector. The full coefficients path is stored in the array\ncoef_path_\nof shape\n(n_features,\nmax_features\n+\n1)\n. The first\ncolumn is always zero.\nReferences\nOriginal Algorithm is detailed in the paper\nLeast Angle Regression\nby Hastie et al.\n1.1.9.\nOrthogonal Matching Pursuit (OMP)\nOrthogonalMatchingPursuit",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Least Angle Regression\nby Hastie et al.\n1.1.9.\nOrthogonal Matching Pursuit (OMP)\nOrthogonalMatchingPursuit\nand\northogonal_mp\nimplement the OMP\nalgorithm for approximating the fit of a linear model with constraints imposed\non the number of non-zero coefficients (i.e. the\n\\(\\ell_0\\)\npseudo-norm).\nBeing a forward feature selection method like\nLeast Angle Regression\n,\northogonal matching pursuit can approximate the optimum solution vector with a\nfixed number of non-zero elements:\n\\[\\underset{w}{\\operatorname{arg\\,min\\,}} ||y - Xw||_2^2 \\text{ subject to } ||w||_0 \\leq n_{\\text{nonzero_coefs}}\\]\nAlternatively, orthogonal matching pursuit can target a specific error instead\nof a specific number of non-zero coefficients. This can be expressed as:\n\\[\\underset{w}{\\operatorname{arg\\,min\\,}} ||w||_0 \\text{ subject to } ||y-Xw||_2^2 \\leq \\text{tol}\\]\nOMP is based on a greedy algorithm that includes at each step the atom most",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "OMP is based on a greedy algorithm that includes at each step the atom most\nhighly correlated with the current residual. It is similar to the simpler\nmatching pursuit (MP) method, but better in that at each iteration, the\nresidual is recomputed using an orthogonal projection on the space of the\npreviously chosen dictionary elements.\nExamples\nOrthogonal Matching Pursuit\nReferences\nhttps://www.cs.technion.ac.il/~ronrubin/Publications/KSVD-OMP-v2.pdf\nMatching pursuits with time-frequency dictionaries\n,\nS. G. Mallat, Z. Zhang,\n1.1.10.\nBayesian Regression\nBayesian regression techniques can be used to include regularization\nparameters in the estimation procedure: the regularization parameter is\nnot set in a hard sense but tuned to the data at hand.\nThis can be done by introducing\nuninformative priors\nover the hyper parameters of the model.\nThe\n\\(\\ell_{2}\\)\nregularization used in\nRidge regression and classification\nis",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "uninformative priors\nover the hyper parameters of the model.\nThe\n\\(\\ell_{2}\\)\nregularization used in\nRidge regression and classification\nis\nequivalent to finding a maximum a posteriori estimation under a Gaussian prior\nover the coefficients\n\\(w\\)\nwith precision\n\\(\\lambda^{-1}\\)\n.\nInstead of setting\nlambda\nmanually, it is possible to treat it as a random\nvariable to be estimated from the data.\nTo obtain a fully probabilistic model, the output\n\\(y\\)\nis assumed\nto be Gaussian distributed around\n\\(X w\\)\n:\n\\[p(y|X,w,\\alpha) = \\mathcal{N}(y|X w,\\alpha^{-1})\\]\nwhere\n\\(\\alpha\\)\nis again treated as a random variable that is to be\nestimated from the data.\nThe advantages of Bayesian Regression are:\nIt adapts to the data at hand.\nIt can be used to include regularization parameters in the\nestimation procedure.\nThe disadvantages of Bayesian regression include:\nInference of the model can be time consuming.\nReferences\nA good introduction to Bayesian methods is given in C. Bishop: Pattern",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Inference of the model can be time consuming.\nReferences\nA good introduction to Bayesian methods is given in C. Bishop: Pattern\nRecognition and Machine learning\nOriginal Algorithm is detailed in the book\nBayesian\nlearning\nfor\nneural\nnetworks\nby Radford M. Neal\n1.1.10.1.\nBayesian Ridge Regression\nBayesianRidge\nestimates a probabilistic model of the\nregression problem as described above.\nThe prior for the coefficient\n\\(w\\)\nis given by a spherical Gaussian:\n\\[p(w|\\lambda) =\n\\mathcal{N}(w|0,\\lambda^{-1}\\mathbf{I}_{p})\\]\nThe priors over\n\\(\\alpha\\)\nand\n\\(\\lambda\\)\nare chosen to be\ngamma\ndistributions\n, the\nconjugate prior for the precision of the Gaussian. The resulting model is\ncalled\nBayesian Ridge Regression\n, and is similar to the classical\nRidge\n.\nThe parameters\n\\(w\\)\n,\n\\(\\alpha\\)\nand\n\\(\\lambda\\)\nare estimated\njointly during the fit of the model, the regularization parameters\n\\(\\alpha\\)\nand\n\\(\\lambda\\)\nbeing estimated by maximizing the\nlog marginal likelihood",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\alpha\\)\nand\n\\(\\lambda\\)\nbeing estimated by maximizing the\nlog marginal likelihood\n. The scikit-learn implementation\nis based on the algorithm described in Appendix A of (Tipping, 2001)\nwhere the update of the parameters\n\\(\\alpha\\)\nand\n\\(\\lambda\\)\nis done\nas suggested in (MacKay, 1992). The initial value of the maximization procedure\ncan be set with the hyperparameters\nalpha_init\nand\nlambda_init\n.\nThere are four more hyperparameters,\n\\(\\alpha_1\\)\n,\n\\(\\alpha_2\\)\n,\n\\(\\lambda_1\\)\nand\n\\(\\lambda_2\\)\nof the gamma prior distributions over\n\\(\\alpha\\)\nand\n\\(\\lambda\\)\n. These are usually chosen to be\nnon-informative\n. By default\n\\(\\alpha_1 = \\alpha_2 = \\lambda_1 = \\lambda_2 = 10^{-6}\\)\n.\nBayesian Ridge Regression is used for regression:\n>>>\nfrom\nsklearn\nimport\nlinear_model\n>>>\nX\n=\n[[\n0.\n,\n0.\n],\n[\n1.\n,\n1.\n],\n[\n2.\n,\n2.\n],\n[\n3.\n,\n3.\n]]\n>>>\nY\n=\n[\n0.\n,\n1.\n,\n2.\n,\n3.\n]\n>>>\nreg\n=\nlinear_model\n.\nBayesianRidge\n()\n>>>\nreg\n.\nfit\n(\nX\n,\nY\n)\nBayesianRidge()",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.\n,\n1.\n],\n[\n2.\n,\n2.\n],\n[\n3.\n,\n3.\n]]\n>>>\nY\n=\n[\n0.\n,\n1.\n,\n2.\n,\n3.\n]\n>>>\nreg\n=\nlinear_model\n.\nBayesianRidge\n()\n>>>\nreg\n.\nfit\n(\nX\n,\nY\n)\nBayesianRidge()\nAfter being fitted, the model can then be used to predict new values:\n>>>\nreg\n.\npredict\n([[\n1\n,\n0.\n]])\narray([0.50000013])\nThe coefficients\n\\(w\\)\nof the model can be accessed:\n>>>\nreg\n.\ncoef_\narray([0.49999993, 0.49999993])\nDue to the Bayesian framework, the weights found are slightly different from the\nones found by\nOrdinary Least Squares\n. However, Bayesian Ridge Regression\nis more robust to ill-posed problems.\nExamples\nCurve Fitting with Bayesian Ridge Regression\nReferences\nSection 3.3 in Christopher M. Bishop: Pattern Recognition and Machine Learning, 2006\nDavid J. C. MacKay,\nBayesian Interpolation\n, 1992.\nMichael E. Tipping,\nSparse Bayesian Learning and the Relevance Vector Machine\n, 2001.\n1.1.10.2.\nAutomatic Relevance Determination - ARD\nThe Automatic Relevance Determination (as being implemented in\nARDRegression",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", 2001.\n1.1.10.2.\nAutomatic Relevance Determination - ARD\nThe Automatic Relevance Determination (as being implemented in\nARDRegression\n) is a kind of linear model which is very similar to the\nBayesian Ridge Regression\n, but that leads to sparser coefficients\n\\(w\\)\n[\n1\n]\n[\n2\n]\n.\nARDRegression\nposes a different prior over\n\\(w\\)\n: it drops\nthe spherical Gaussian distribution for a centered elliptic Gaussian\ndistribution. This means each coefficient\n\\(w_{i}\\)\ncan itself be drawn from\na Gaussian distribution, centered on zero and with a precision\n\\(\\lambda_{i}\\)\n:\n\\[p(w|\\lambda) = \\mathcal{N}(w|0,A^{-1})\\]\nwith\n\\(A\\)\nbeing a positive definite diagonal matrix and\n\\(\\text{diag}(A) = \\lambda = \\{\\lambda_{1},...,\\lambda_{p}\\}\\)\n.\nIn contrast to the\nBayesian Ridge Regression\n, each coordinate of\n\\(w_{i}\\)\nhas its own standard deviation\n\\(\\frac{1}{\\lambda_i}\\)\n. The\nprior over all\n\\(\\lambda_i\\)\nis chosen to be the same gamma distribution\ngiven by the hyperparameters\n\\(\\lambda_1\\)\nand",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\frac{1}{\\lambda_i}\\)\n. The\nprior over all\n\\(\\lambda_i\\)\nis chosen to be the same gamma distribution\ngiven by the hyperparameters\n\\(\\lambda_1\\)\nand\n\\(\\lambda_2\\)\n.\nARD is also known in the literature as\nSparse Bayesian Learning\nand\nRelevance\nVector Machine\n[\n3\n]\n[\n4\n]\n.\nSee\nComparing Linear Bayesian Regressors\nfor a worked-out comparison between ARD and\nBayesian Ridge Regression\n.\nSee\nL1-based models for Sparse Signals\nfor a comparison between various methods - Lasso, ARD and ElasticNet - on correlated data.\nReferences\n1.1.11.\nLogistic regression\nThe logistic regression is implemented in\nLogisticRegression\n. Despite\nits name, it is implemented as a linear model for classification rather than\nregression in terms of the scikit-learn/ML nomenclature. The logistic\nregression is also known in the literature as logit regression,\nmaximum-entropy classification (MaxEnt) or the log-linear classifier. In this\nmodel, the probabilities describing the possible outcomes of a single trial",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "model, the probabilities describing the possible outcomes of a single trial\nare modeled using a\nlogistic function\n.\nThis implementation can fit binary, One-vs-Rest, or multinomial logistic\nregression with optional\n\\(\\ell_1\\)\n,\n\\(\\ell_2\\)\nor Elastic-Net\nregularization.\nNote\nRegularization\nRegularization is applied by default, which is common in machine\nlearning but not in statistics. Another advantage of regularization is\nthat it improves numerical stability. No regularization amounts to\nsetting C to a very high value.\nNote\nLogistic Regression as a special case of the Generalized Linear Models (GLM)\nLogistic regression is a special case of\nGeneralized Linear Models\nwith a Binomial / Bernoulli conditional\ndistribution and a Logit link. The numerical output of the logistic\nregression, which is the predicted probability, can be used as a classifier\nby applying a threshold (by default 0.5) to it. This is how it is\nimplemented in scikit-learn, so it expects a categorical target, making",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "by applying a threshold (by default 0.5) to it. This is how it is\nimplemented in scikit-learn, so it expects a categorical target, making\nthe Logistic Regression a classifier.\nExamples\nL1 Penalty and Sparsity in Logistic Regression\nRegularization path of L1- Logistic Regression\nDecision Boundaries of Multinomial and One-vs-Rest Logistic Regression\nMulticlass sparse logistic regression on 20newgroups\nMNIST classification using multinomial logistic + L1\nPlot classification probability\n1.1.11.1.\nBinary Case\nFor notational ease, we assume that the target\n\\(y_i\\)\ntakes values in the\nset\n\\(\\{0, 1\\}\\)\nfor data point\n\\(i\\)\n.\nOnce fitted, the\npredict_proba\nmethod of\nLogisticRegression\npredicts\nthe probability of the positive class\n\\(P(y_i=1|X_i)\\)\nas\n\\[\\hat{p}(X_i) = \\operatorname{expit}(X_i w + w_0) = \\frac{1}{1 + \\exp(-X_i w - w_0)}.\\]\nAs an optimization problem, binary\nclass logistic regression with regularization term\n\\(r(w)\\)\nminimizes the\nfollowing cost function:\n(1)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "As an optimization problem, binary\nclass logistic regression with regularization term\n\\(r(w)\\)\nminimizes the\nfollowing cost function:\n(1)\n\\[\\min_{w} \\frac{1}{S}\\sum_{i=1}^n s_i\n\\left(-y_i \\log(\\hat{p}(X_i)) - (1 - y_i) \\log(1 - \\hat{p}(X_i))\\right)\n+ \\frac{r(w)}{S C}\\,,\\]\nwhere\n\\({s_i}\\)\ncorresponds to the weights assigned by the user to a\nspecific training sample (the vector\n\\(s\\)\nis formed by element-wise\nmultiplication of the class weights and sample weights),\nand the sum\n\\(S = \\sum_{i=1}^n s_i\\)\n.\nWe currently provide four choices for the regularization term\n\\(r(w)\\)\nvia\nthe\npenalty\nargument:\npenalty\n\\(r(w)\\)\nNone\n\\(0\\)\n\\(\\ell_1\\)\n\\(\\|w\\|_1\\)\n\\(\\ell_2\\)\n\\(\\frac{1}{2}\\|w\\|_2^2 = \\frac{1}{2}w^T w\\)\nElasticNet\n\\(\\frac{1 - \\rho}{2}w^T w + \\rho \\|w\\|_1\\)\nFor ElasticNet,\n\\(\\rho\\)\n(which corresponds to the\nl1_ratio\nparameter)\ncontrols the strength of\n\\(\\ell_1\\)\nregularization vs.\n\\(\\ell_2\\)\nregularization. Elastic-Net is equivalent to\n\\(\\ell_1\\)\nwhen\n\\(\\rho = 1\\)\nand equivalent to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\ell_1\\)\nregularization vs.\n\\(\\ell_2\\)\nregularization. Elastic-Net is equivalent to\n\\(\\ell_1\\)\nwhen\n\\(\\rho = 1\\)\nand equivalent to\n\\(\\ell_2\\)\nwhen\n\\(\\rho=0\\)\n.\nNote that the scale of the class weights and the sample weights will influence\nthe optimization problem. For instance, multiplying the sample weights by a\nconstant\n\\(b>0\\)\nis equivalent to multiplying the (inverse) regularization\nstrength\nC\nby\n\\(b\\)\n.\n1.1.11.2.\nMultinomial Case\nThe binary case can be extended to\n\\(K\\)\nclasses leading to the multinomial\nlogistic regression, see also\nlog-linear model\n.\nNote\nIt is possible to parameterize a\n\\(K\\)\n-class classification model\nusing only\n\\(K-1\\)\nweight vectors, leaving one class probability fully\ndetermined by the other class probabilities by leveraging the fact that all\nclass probabilities must sum to one. We deliberately choose to overparameterize the model\nusing\n\\(K\\)\nweight vectors for ease of implementation and to preserve the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "using\n\\(K\\)\nweight vectors for ease of implementation and to preserve the\nsymmetrical inductive bias regarding ordering of classes, see\n[\n16\n]\n. This effect becomes\nespecially important when using regularization. The choice of overparameterization can be\ndetrimental for unpenalized models since then the solution may not be unique, as shown in\n[\n16\n]\n.\nMathematical details\nLet\n\\(y_i \\in \\{1, \\ldots, K\\}\\)\nbe the label (ordinal) encoded target variable for observation\n\\(i\\)\n.\nInstead of a single coefficient vector, we now have\na matrix of coefficients\n\\(W\\)\nwhere each row vector\n\\(W_k\\)\ncorresponds to class\n\\(k\\)\n. We aim at predicting the class probabilities\n\\(P(y_i=k|X_i)\\)\nvia\npredict_proba\nas:\n\\[\\hat{p}_k(X_i) = \\frac{\\exp(X_i W_k + W_{0, k})}{\\sum_{l=0}^{K-1} \\exp(X_i W_l + W_{0, l})}.\\]\nThe objective for the optimization becomes\n\\[\\min_W -\\frac{1}{S}\\sum_{i=1}^n \\sum_{k=0}^{K-1} s_{ik} [y_i = k] \\log(\\hat{p}_k(X_i))\n+ \\frac{r(W)}{S C}\\,,\\]\nwhere\n\\([P]\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[\\min_W -\\frac{1}{S}\\sum_{i=1}^n \\sum_{k=0}^{K-1} s_{ik} [y_i = k] \\log(\\hat{p}_k(X_i))\n+ \\frac{r(W)}{S C}\\,,\\]\nwhere\n\\([P]\\)\nrepresents the Iverson bracket which evaluates to\n\\(0\\)\nif\n\\(P\\)\nis false, otherwise it evaluates to\n\\(1\\)\n.\nAgain,\n\\(s_{ik}\\)\nare the weights assigned by the user (multiplication of sample\nweights and class weights) with their sum\n\\(S = \\sum_{i=1}^n \\sum_{k=0}^{K-1} s_{ik}\\)\n.\nWe currently provide four choices\nfor the regularization term\n\\(r(W)\\)\nvia the\npenalty\nargument, where\n\\(m\\)\nis the number of features:\npenalty\n\\(r(W)\\)\nNone\n\\(0\\)\n\\(\\ell_1\\)\n\\(\\|W\\|_{1,1} = \\sum_{i=1}^m\\sum_{j=1}^{K}|W_{i,j}|\\)\n\\(\\ell_2\\)\n\\(\\frac{1}{2}\\|W\\|_F^2 = \\frac{1}{2}\\sum_{i=1}^m\\sum_{j=1}^{K} W_{i,j}^2\\)\nElasticNet\n\\(\\frac{1 - \\rho}{2}\\|W\\|_F^2 + \\rho \\|W\\|_{1,1}\\)\n1.1.11.3.\nSolvers\nThe solvers implemented in the class\nLogisticRegression\nare “lbfgs”, “liblinear”, “newton-cg”, “newton-cholesky”, “sag” and “saga”:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.1.11.3.\nSolvers\nThe solvers implemented in the class\nLogisticRegression\nare “lbfgs”, “liblinear”, “newton-cg”, “newton-cholesky”, “sag” and “saga”:\nThe following table summarizes the penalties and multinomial multiclass supported by each solver:\nSolvers\nPenalties\n‘lbfgs’\n‘liblinear’\n‘newton-cg’\n‘newton-cholesky’\n‘sag’\n‘saga’\nL2 penalty\nyes\nyes\nyes\nyes\nyes\nyes\nL1 penalty\nno\nyes\nno\nno\nno\nyes\nElastic-Net (L1 + L2)\nno\nno\nno\nno\nno\nyes\nNo penalty (‘none’)\nyes\nno\nyes\nyes\nyes\nyes\nMulticlass support\nmultinomial multiclass\nyes\nno\nyes\nyes\nyes\nyes\nBehaviors\nPenalize the intercept (bad)\nno\nyes\nno\nno\nno\nno\nFaster for large datasets\nno\nno\nno\nno\nyes\nyes\nRobust to unscaled datasets\nyes\nyes\nyes\nyes\nno\nno\nThe “lbfgs” solver is used by default for its robustness. For\nn_samples\n>>\nn_features\n, “newton-cholesky” is a good choice and can reach high\nprecision (tiny\ntol\nvalues). For large datasets\nthe “saga” solver is usually faster (than “lbfgs”), in particular for low precision\n(high\ntol\n).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "precision (tiny\ntol\nvalues). For large datasets\nthe “saga” solver is usually faster (than “lbfgs”), in particular for low precision\n(high\ntol\n).\nFor large dataset, you may also consider using\nSGDClassifier\nwith\nloss=\"log_loss\"\n, which might be even faster but requires more tuning.\n1.1.11.3.1.\nDifferences between solvers\nThere might be a difference in the scores obtained between\nLogisticRegression\nwith\nsolver=liblinear\nor\nLinearSVC\nand the external liblinear library directly,\nwhen\nfit_intercept=False\nand the fit\ncoef_\n(or) the data to be predicted\nare zeroes. This is because for the sample(s) with\ndecision_function\nzero,\nLogisticRegression\nand\nLinearSVC\npredict the\nnegative class, while liblinear predicts the positive class. Note that a model\nwith\nfit_intercept=False\nand having many samples with\ndecision_function\nzero, is likely to be an underfit, bad model and you are advised to set\nfit_intercept=True\nand increase the\nintercept_scaling\n.\nSolvers’ details",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 38,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "zero, is likely to be an underfit, bad model and you are advised to set\nfit_intercept=True\nand increase the\nintercept_scaling\n.\nSolvers’ details\nThe solver “liblinear” uses a coordinate descent (CD) algorithm, and relies\non the excellent C++\nLIBLINEAR library\n, which is shipped with\nscikit-learn. However, the CD algorithm implemented in liblinear cannot learn\na true multinomial (multiclass) model; instead, the optimization problem is\ndecomposed in a “one-vs-rest” fashion so separate binary classifiers are\ntrained for all classes. This happens under the hood, so\nLogisticRegression\ninstances using this solver behave as multiclass\nclassifiers. For\n\\(\\ell_1\\)\nregularization\nsklearn.svm.l1_min_c\nallows to\ncalculate the lower bound for C in order to get a non “null” (all feature\nweights to zero) model.\nThe “lbfgs”, “newton-cg” and “sag” solvers only support\n\\(\\ell_2\\)\nregularization or no regularization, and are found to converge faster for some\nhigh-dimensional data. Setting\nmulti_class",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 39,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\ell_2\\)\nregularization or no regularization, and are found to converge faster for some\nhigh-dimensional data. Setting\nmulti_class\nto “multinomial” with these solvers\nlearns a true multinomial logistic regression model\n[\n5\n]\n, which means that its\nprobability estimates should be better calibrated than the default “one-vs-rest”\nsetting.\nThe “sag” solver uses Stochastic Average Gradient descent\n[\n6\n]\n. It is faster\nthan other solvers for large datasets, when both the number of samples and the\nnumber of features are large.\nThe “saga” solver\n[\n7\n]\nis a variant of “sag” that also supports the\nnon-smooth\npenalty=\"l1\"\n. This is therefore the solver of choice for sparse\nmultinomial logistic regression. It is also the only solver that supports\npenalty=\"elasticnet\"\n.\nThe “lbfgs” is an optimization algorithm that approximates the\nBroyden–Fletcher–Goldfarb–Shanno algorithm\n[\n8\n]\n, which belongs to\nquasi-Newton methods. As such, it can deal with a wide range of different training",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 40,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[\n8\n]\n, which belongs to\nquasi-Newton methods. As such, it can deal with a wide range of different training\ndata and is therefore the default solver. Its performance, however, suffers on poorly\nscaled datasets and on datasets with one-hot encoded categorical features with rare\ncategories.\nThe “newton-cholesky” solver is an exact Newton solver that calculates the Hessian\nmatrix and solves the resulting linear system. It is a very good choice for\nn_samples\n>>\nn_features\nand can reach high precision (tiny values of\ntol\n),\nbut has a few shortcomings: Only\n\\(\\ell_2\\)\nregularization is supported.\nFurthermore, because the Hessian matrix is explicitly computed, the memory usage\nhas a quadratic dependency on\nn_features\nas well as on\nn_classes\n.\nFor a comparison of some of these solvers, see\n[\n9\n]\n.\nReferences\nNote\nFeature selection with sparse logistic regression\nA logistic regression with\n\\(\\ell_1\\)\npenalty yields sparse models, and can",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 41,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[\n9\n]\n.\nReferences\nNote\nFeature selection with sparse logistic regression\nA logistic regression with\n\\(\\ell_1\\)\npenalty yields sparse models, and can\nthus be used to perform feature selection, as detailed in\nL1-based feature selection\n.\nNote\nP-value estimation\nIt is possible to obtain the p-values and confidence intervals for\ncoefficients in cases of regression without penalization. The\nstatsmodels\npackage\nnatively supports this.\nWithin sklearn, one could use bootstrapping instead as well.\nLogisticRegressionCV\nimplements Logistic Regression with built-in\ncross-validation support, to find the optimal\nC\nand\nl1_ratio\nparameters\naccording to the\nscoring\nattribute. The “newton-cg”, “sag”, “saga” and\n“lbfgs” solvers are found to be faster for high-dimensional dense data, due\nto warm-starting (see\nGlossary\n).\n1.1.12.\nGeneralized Linear Models\nGeneralized Linear Models (GLM) extend linear models in two ways\n[\n10\n]\n. First, the predicted values\n\\(\\hat{y}\\)\nare linked to a linear",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 42,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Generalized Linear Models (GLM) extend linear models in two ways\n[\n10\n]\n. First, the predicted values\n\\(\\hat{y}\\)\nare linked to a linear\ncombination of the input variables\n\\(X\\)\nvia an inverse link function\n\\(h\\)\nas\n\\[\\hat{y}(w, X) = h(Xw).\\]\nSecondly, the squared loss function is replaced by the unit deviance\n\\(d\\)\nof a distribution in the exponential family (or more precisely, a\nreproductive exponential dispersion model (EDM)\n[\n11\n]\n).\nThe minimization problem becomes:\n\\[\\min_{w} \\frac{1}{2 n_{\\text{samples}}} \\sum_i d(y_i, \\hat{y}_i) + \\frac{\\alpha}{2} ||w||_2^2,\\]\nwhere\n\\(\\alpha\\)\nis the L2 regularization penalty. When sample weights are\nprovided, the average becomes a weighted average.\nThe following table lists some specific EDMs and their unit deviance :\nDistribution\nTarget Domain\nUnit Deviance\n\\(d(y, \\hat{y})\\)\nNormal\n\\(y \\in (-\\infty, \\infty)\\)\n\\((y-\\hat{y})^2\\)\nBernoulli\n\\(y \\in \\{0, 1\\}\\)\n\\(2({y}\\log\\frac{y}{\\hat{y}}+({1}-{y})\\log\\frac{{1}-{y}}{{1}-\\hat{y}})\\)\nCategorical",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 43,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\((y-\\hat{y})^2\\)\nBernoulli\n\\(y \\in \\{0, 1\\}\\)\n\\(2({y}\\log\\frac{y}{\\hat{y}}+({1}-{y})\\log\\frac{{1}-{y}}{{1}-\\hat{y}})\\)\nCategorical\n\\(y \\in \\{0, 1, ..., k\\}\\)\n\\(2\\sum_{i \\in \\{0, 1, ..., k\\}} I(y = i) y_\\text{i}\\log\\frac{I(y = i)}{\\hat{I(y = i)}}\\)\nPoisson\n\\(y \\in [0, \\infty)\\)\n\\(2(y\\log\\frac{y}{\\hat{y}}-y+\\hat{y})\\)\nGamma\n\\(y \\in (0, \\infty)\\)\n\\(2(\\log\\frac{\\hat{y}}{y}+\\frac{y}{\\hat{y}}-1)\\)\nInverse Gaussian\n\\(y \\in (0, \\infty)\\)\n\\(\\frac{(y-\\hat{y})^2}{y\\hat{y}^2}\\)\nThe Probability Density Functions (PDF) of these distributions are illustrated\nin the following figure,\nPDF of a random variable Y following Poisson, Tweedie (power=1.5) and Gamma\ndistributions with different mean values (\n\\(\\mu\\)\n). Observe the point\nmass at\n\\(Y=0\\)\nfor the Poisson distribution and the Tweedie (power=1.5)\ndistribution, but not for the Gamma distribution which has a strictly\npositive target domain.\nThe Bernoulli distribution is a discrete probability distribution modelling a",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 44,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "positive target domain.\nThe Bernoulli distribution is a discrete probability distribution modelling a\nBernoulli trial - an event that has only two mutually exclusive outcomes.\nThe Categorical distribution is a generalization of the Bernoulli distribution\nfor a categorical random variable. While a random variable in a Bernoulli\ndistribution has two possible outcomes, a Categorical random variable can take\non one of K possible categories, with the probability of each category\nspecified separately.\nThe choice of the distribution depends on the problem at hand:\nIf the target values\n\\(y\\)\nare counts (non-negative integer valued) or\nrelative frequencies (non-negative), you might use a Poisson distribution\nwith a log-link.\nIf the target values are positive valued and skewed, you might try a Gamma\ndistribution with a log-link.\nIf the target values seem to be heavier tailed than a Gamma distribution, you\nmight try an Inverse Gaussian distribution (or even higher variance powers of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 45,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "might try an Inverse Gaussian distribution (or even higher variance powers of\nthe Tweedie family).\nIf the target values\n\\(y\\)\nare probabilities, you can use the Bernoulli\ndistribution. The Bernoulli distribution with a logit link can be used for\nbinary classification. The Categorical distribution with a softmax link can be\nused for multiclass classification.\nExamples of use cases\nAgriculture / weather modeling: number of rain events per year (Poisson),\namount of rainfall per event (Gamma), total rainfall per year (Tweedie /\nCompound Poisson Gamma).\nRisk modeling / insurance policy pricing: number of claim events /\npolicyholder per year (Poisson), cost per event (Gamma), total cost per\npolicyholder per year (Tweedie / Compound Poisson Gamma).\nCredit Default: probability that a loan can’t be paid back (Bernoulli).\nFraud Detection: probability that a financial transaction like a cash transfer\nis a fraudulent transaction (Bernoulli).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 46,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Fraud Detection: probability that a financial transaction like a cash transfer\nis a fraudulent transaction (Bernoulli).\nPredictive maintenance: number of production interruption events per year\n(Poisson), duration of interruption (Gamma), total interruption time per year\n(Tweedie / Compound Poisson Gamma).\nMedical Drug Testing: probability of curing a patient in a set of trials or\nprobability that a patient will experience side effects (Bernoulli).\nNews Classification: classification of news articles into three categories\nnamely Business News, Politics and Entertainment news (Categorical).\nReferences\n1.1.12.1.\nUsage\nTweedieRegressor\nimplements a generalized linear model for the\nTweedie distribution, that allows to model any of the above mentioned\ndistributions using the appropriate\npower\nparameter. In particular:\npower\n=\n0\n: Normal distribution. Specific estimators such as\nRidge\n,\nElasticNet\nare generally more appropriate in\nthis case.\npower\n=\n1\n: Poisson distribution.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 47,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ": Normal distribution. Specific estimators such as\nRidge\n,\nElasticNet\nare generally more appropriate in\nthis case.\npower\n=\n1\n: Poisson distribution.\nPoissonRegressor\nis exposed\nfor convenience. However, it is strictly equivalent to\nTweedieRegressor(power=1,\nlink='log')\n.\npower\n=\n2\n: Gamma distribution.\nGammaRegressor\nis exposed for\nconvenience. However, it is strictly equivalent to\nTweedieRegressor(power=2,\nlink='log')\n.\npower\n=\n3\n: Inverse Gaussian distribution.\nThe link function is determined by the\nlink\nparameter.\nUsage example:\n>>>\nfrom\nsklearn.linear_model\nimport\nTweedieRegressor\n>>>\nreg\n=\nTweedieRegressor\n(\npower\n=\n1\n,\nalpha\n=\n0.5\n,\nlink\n=\n'log'\n)\n>>>\nreg\n.\nfit\n([[\n0\n,\n0\n],\n[\n0\n,\n1\n],\n[\n2\n,\n2\n]],\n[\n0\n,\n1\n,\n2\n])\nTweedieRegressor(alpha=0.5, link='log', power=1)\n>>>\nreg\n.\ncoef_\narray([0.2463, 0.4337])\n>>>\nreg\n.\nintercept_\nnp.float64(-0.7638)\nExamples\nPoisson regression and non-normal loss\nTweedie regression on insurance claims\nPractical considerations\nThe feature matrix\nX",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 48,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Examples\nPoisson regression and non-normal loss\nTweedie regression on insurance claims\nPractical considerations\nThe feature matrix\nX\nshould be standardized before fitting. This ensures\nthat the penalty treats features equally.\nSince the linear predictor\n\\(Xw\\)\ncan be negative and Poisson,\nGamma and Inverse Gaussian distributions don’t support negative values, it\nis necessary to apply an inverse link function that guarantees the\nnon-negativeness. For example with\nlink='log'\n, the inverse link function\nbecomes\n\\(h(Xw)=\\exp(Xw)\\)\n.\nIf you want to model a relative frequency, i.e. counts per exposure (time,\nvolume, …) you can do so by using a Poisson distribution and passing\n\\(y=\\frac{\\mathrm{counts}}{\\mathrm{exposure}}\\)\nas target values\ntogether with\n\\(\\mathrm{exposure}\\)\nas sample weights. For a concrete\nexample see e.g.\nTweedie regression on insurance claims\n.\nWhen performing cross-validation for the\npower\nparameter of\nTweedieRegressor\n, it is advisable to specify an explicit\nscoring",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 49,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nWhen performing cross-validation for the\npower\nparameter of\nTweedieRegressor\n, it is advisable to specify an explicit\nscoring\nfunction,\nbecause the default scorer\nTweedieRegressor.score\nis a function of\npower\nitself.\n1.1.13.\nStochastic Gradient Descent - SGD\nStochastic gradient descent is a simple yet very efficient approach\nto fit linear models. It is particularly useful when the number of samples\n(and the number of features) is very large.\nThe\npartial_fit\nmethod allows online/out-of-core learning.\nThe classes\nSGDClassifier\nand\nSGDRegressor\nprovide\nfunctionality to fit linear models for classification and regression\nusing different (convex) loss functions and different penalties.\nE.g., with\nloss=\"log\"\n,\nSGDClassifier\nfits a logistic regression model,\nwhile with\nloss=\"hinge\"\nit fits a linear support vector machine (SVM).\nYou can refer to the dedicated\nStochastic Gradient Descent\ndocumentation section for more details.\n1.1.14.\nPerceptron\nThe\nPerceptron",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 50,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "You can refer to the dedicated\nStochastic Gradient Descent\ndocumentation section for more details.\n1.1.14.\nPerceptron\nThe\nPerceptron\nis another simple classification algorithm suitable for\nlarge scale learning. By default:\nIt does not require a learning rate.\nIt is not regularized (penalized).\nIt updates its model only on mistakes.\nThe last characteristic implies that the Perceptron is slightly faster to\ntrain than SGD with the hinge loss and that the resulting models are\nsparser.\nIn fact, the\nPerceptron\nis a wrapper around the\nSGDClassifier\nclass using a perceptron loss and a constant learning rate. Refer to\nmathematical section\nof the SGD procedure\nfor more details.\n1.1.15.\nPassive Aggressive Algorithms\nThe passive-aggressive algorithms are a family of algorithms for large-scale\nlearning. They are similar to the Perceptron in that they do not require a\nlearning rate. However, contrary to the Perceptron, they include a\nregularization parameter\nC\n.\nFor classification,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 51,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "learning rate. However, contrary to the Perceptron, they include a\nregularization parameter\nC\n.\nFor classification,\nPassiveAggressiveClassifier\ncan be used with\nloss='hinge'\n(PA-I) or\nloss='squared_hinge'\n(PA-II). For regression,\nPassiveAggressiveRegressor\ncan be used with\nloss='epsilon_insensitive'\n(PA-I) or\nloss='squared_epsilon_insensitive'\n(PA-II).\nReferences\n“Online Passive-Aggressive Algorithms”\nK. Crammer, O. Dekel, J. Keshat, S. Shalev-Shwartz, Y. Singer - JMLR 7 (2006)\n1.1.16.\nRobustness regression: outliers and modeling errors\nRobust regression aims to fit a regression model in the\npresence of corrupt data: either outliers, or error in the model.\n1.1.16.1.\nDifferent scenario and useful concepts\nThere are different things to keep in mind when dealing with data\ncorrupted by outliers:\nOutliers in X or in y\n?\nOutliers in the y direction\nOutliers in the X direction\nFraction of outliers versus amplitude of error\nThe number of outlying points matters, but also how much they are",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 52,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Outliers in the X direction\nFraction of outliers versus amplitude of error\nThe number of outlying points matters, but also how much they are\noutliers.\nSmall outliers\nLarge outliers\nAn important notion of robust fitting is that of breakdown point: the\nfraction of data that can be outlying for the fit to start missing the\ninlying data.\nNote that in general, robust fitting in high-dimensional setting (large\nn_features\n) is very hard. The robust models here will probably not work\nin these settings.\n1.1.16.2.\nRANSAC: RANdom SAmple Consensus\nRANSAC (RANdom SAmple Consensus) fits a model from random subsets of\ninliers from the complete data set.\nRANSAC is a non-deterministic algorithm producing only a reasonable result with\na certain probability, which is dependent on the number of iterations (see\nmax_trials\nparameter). It is typically used for linear and non-linear\nregression problems and is especially popular in the field of photogrammetric\ncomputer vision.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 53,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "regression problems and is especially popular in the field of photogrammetric\ncomputer vision.\nThe algorithm splits the complete input sample data into a set of inliers,\nwhich may be subject to noise, and outliers, which are e.g. caused by erroneous\nmeasurements or invalid hypotheses about the data. The resulting model is then\nestimated only from the determined inliers.\nExamples\nRobust linear model estimation using RANSAC\nRobust linear estimator fitting\nDetails of the algorithm\nEach iteration performs the following steps:\nSelect\nmin_samples\nrandom samples from the original data and check\nwhether the set of data is valid (see\nis_data_valid\n).\nFit a model to the random subset (\nestimator.fit\n) and check\nwhether the estimated model is valid (see\nis_model_valid\n).\nClassify all data as inliers or outliers by calculating the residuals\nto the estimated model (\nestimator.predict(X)\n-\ny\n) - all data\nsamples with absolute residuals smaller than or equal to the\nresidual_threshold",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 54,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to the estimated model (\nestimator.predict(X)\n-\ny\n) - all data\nsamples with absolute residuals smaller than or equal to the\nresidual_threshold\nare considered as inliers.\nSave fitted model as best model if number of inlier samples is\nmaximal. In case the current estimated model has the same number of\ninliers, it is only considered as the best model if it has better score.\nThese steps are performed either a maximum number of times (\nmax_trials\n) or\nuntil one of the special stop criteria are met (see\nstop_n_inliers\nand\nstop_score\n). The final model is estimated using all inlier samples (consensus\nset) of the previously determined best model.\nThe\nis_data_valid\nand\nis_model_valid\nfunctions allow to identify and reject\ndegenerate combinations of random sub-samples. If the estimated model is not\nneeded for identifying degenerate cases,\nis_data_valid\nshould be used as it\nis called prior to fitting the model and thus leading to better computational\nperformance.\nReferences",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 55,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is_data_valid\nshould be used as it\nis called prior to fitting the model and thus leading to better computational\nperformance.\nReferences\nhttps://en.wikipedia.org/wiki/RANSAC\n“Random Sample Consensus: A Paradigm for Model Fitting with Applications to\nImage Analysis and Automated Cartography”\nMartin A. Fischler and Robert C. Bolles - SRI International (1981)\n“Performance Evaluation of RANSAC Family”\nSunglok Choi, Taemin Kim and Wonpil Yu - BMVC (2009)\n1.1.16.3.\nTheil-Sen estimator: generalized-median-based estimator\nThe\nTheilSenRegressor\nestimator uses a generalization of the median in\nmultiple dimensions. It is thus robust to multivariate outliers. Note however\nthat the robustness of the estimator decreases quickly with the dimensionality\nof the problem. It loses its robustness properties and becomes no\nbetter than an ordinary least squares in high dimension.\nExamples\nTheil-Sen Regression\nRobust linear estimator fitting\nTheoretical considerations\nTheilSenRegressor\nis comparable to the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 56,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Examples\nTheil-Sen Regression\nRobust linear estimator fitting\nTheoretical considerations\nTheilSenRegressor\nis comparable to the\nOrdinary Least Squares\n(OLS)\nin terms of asymptotic efficiency and as an\nunbiased estimator. In contrast to OLS, Theil-Sen is a non-parametric\nmethod which means it makes no assumption about the underlying\ndistribution of the data. Since Theil-Sen is a median-based estimator, it\nis more robust against corrupted data aka outliers. In univariate\nsetting, Theil-Sen has a breakdown point of about 29.3% in case of a\nsimple linear regression which means that it can tolerate arbitrary\ncorrupted data of up to 29.3%.\nThe implementation of\nTheilSenRegressor\nin scikit-learn follows a\ngeneralization to a multivariate linear regression model\n[\n14\n]\nusing the\nspatial median which is a generalization of the median to multiple\ndimensions\n[\n15\n]\n.\nIn terms of time and space complexity, Theil-Sen scales according to\n\\[\\binom{n_{\\text{samples}}}{n_{\\text{subsamples}}}\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 57,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "dimensions\n[\n15\n]\n.\nIn terms of time and space complexity, Theil-Sen scales according to\n\\[\\binom{n_{\\text{samples}}}{n_{\\text{subsamples}}}\\]\nwhich makes it infeasible to be applied exhaustively to problems with a\nlarge number of samples and features. Therefore, the magnitude of a\nsubpopulation can be chosen to limit the time and space complexity by\nconsidering only a random subset of all possible combinations.\nReferences\nAlso see the\nWikipedia page\n1.1.16.4.\nHuber Regression\nThe\nHuberRegressor\nis different from\nRidge\nbecause it applies a\nlinear loss to samples that are defined as outliers by the\nepsilon\nparameter.\nA sample is classified as an inlier if the absolute error of that sample is\nless than the threshold\nepsilon\n. It differs from\nTheilSenRegressor\nand\nRANSACRegressor\nbecause it does not ignore the effect of the outliers\nbut gives a lesser weight to them.\nExamples\nHuberRegressor vs Ridge on dataset with strong outliers\nMathematical details\nHuberRegressor\nminimizes",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 58,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "but gives a lesser weight to them.\nExamples\nHuberRegressor vs Ridge on dataset with strong outliers\nMathematical details\nHuberRegressor\nminimizes\n\\[\\min_{w, \\sigma} {\\sum_{i=1}^n\\left(\\sigma + H_{\\epsilon}\\left(\\frac{X_{i}w - y_{i}}{\\sigma}\\right)\\sigma\\right) + \\alpha {||w||_2}^2}\\]\nwhere the loss function is given by\n\\[\\begin{split}H_{\\epsilon}(z) = \\begin{cases}\nz^2, & \\text {if } |z| < \\epsilon, \\\\\n2\\epsilon|z| - \\epsilon^2, & \\text{otherwise}\n\\end{cases}\\end{split}\\]\nIt is advised to set the parameter\nepsilon\nto 1.35 to achieve 95%\nstatistical efficiency.\nReferences\nPeter J. Huber, Elvezio M. Ronchetti: Robust Statistics, Concomitant scale\nestimates, p. 172.\nThe\nHuberRegressor\ndiffers from using\nSGDRegressor\nwith loss set to\nhuber\nin the following ways.\nHuberRegressor\nis scaling invariant. Once\nepsilon\nis set, scaling\nX\nand\ny\ndown or up by different values would produce the same robustness to outliers as before.\nas compared to\nSGDRegressor\nwhere\nepsilon\nhas to be set again when\nX",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 59,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "as compared to\nSGDRegressor\nwhere\nepsilon\nhas to be set again when\nX\nand\ny\nare\nscaled.\nHuberRegressor\nshould be more efficient to use on data with small number of\nsamples while\nSGDRegressor\nneeds a number of passes on the training data to\nproduce the same robustness.\nNote that this estimator is different from the\nR implementation of Robust\nRegression\nbecause the R\nimplementation does a weighted least squares implementation with weights given to each\nsample on the basis of how much the residual is greater than a certain threshold.\n1.1.17.\nQuantile Regression\nQuantile regression estimates the median or other quantiles of\n\\(y\\)\nconditional on\n\\(X\\)\n, while ordinary least squares (OLS) estimates the\nconditional mean.\nQuantile regression may be useful if one is interested in predicting an\ninterval instead of point prediction. Sometimes, prediction intervals are\ncalculated based on the assumption that prediction error is distributed",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 60,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "interval instead of point prediction. Sometimes, prediction intervals are\ncalculated based on the assumption that prediction error is distributed\nnormally with zero mean and constant variance. Quantile regression provides\nsensible prediction intervals even for errors with non-constant (but\npredictable) variance or non-normal distribution.\nBased on minimizing the pinball loss, conditional quantiles can also be\nestimated by models other than linear models. For example,\nGradientBoostingRegressor\ncan predict conditional\nquantiles if its parameter\nloss\nis set to\n\"quantile\"\nand parameter\nalpha\nis set to the quantile that should be predicted. See the example in\nPrediction Intervals for Gradient Boosting Regression\n.\nMost implementations of quantile regression are based on linear programming\nproblem. The current implementation is based on\nscipy.optimize.linprog\n.\nExamples\nQuantile regression\nMathematical details\nAs a linear model, the\nQuantileRegressor\ngives linear predictions",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 61,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "scipy.optimize.linprog\n.\nExamples\nQuantile regression\nMathematical details\nAs a linear model, the\nQuantileRegressor\ngives linear predictions\n\\(\\hat{y}(w, X) = Xw\\)\nfor the\n\\(q\\)\n-th quantile,\n\\(q \\in (0, 1)\\)\n.\nThe weights or coefficients\n\\(w\\)\nare then found by the following\nminimization problem:\n\\[\\min_{w} {\\frac{1}{n_{\\text{samples}}}\n\\sum_i PB_q(y_i - X_i w) + \\alpha ||w||_1}.\\]\nThis consists of the pinball loss (also known as linear loss),\nsee also\nmean_pinball_loss\n,\n\\[\\begin{split}PB_q(t) = q \\max(t, 0) + (1 - q) \\max(-t, 0) =\n\\begin{cases}\nq t, & t > 0, \\\\\n0, & t = 0, \\\\\n(q-1) t, & t < 0\n\\end{cases}\\end{split}\\]\nand the L1 penalty controlled by parameter\nalpha\n, similar to\nLasso\n.\nAs the pinball loss is only linear in the residuals, quantile regression is\nmuch more robust to outliers than squared error based estimation of the mean.\nSomewhat in between is the\nHuberRegressor\n.\nReferences\nKoenker, R., & Bassett Jr, G. (1978).\nRegression quantiles.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 62,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Somewhat in between is the\nHuberRegressor\n.\nReferences\nKoenker, R., & Bassett Jr, G. (1978).\nRegression quantiles.\nEconometrica: journal of the Econometric Society, 33-50.\nPortnoy, S., & Koenker, R. (1997).\nThe Gaussian hare and the Laplacian\ntortoise: computability of squared-error versus absolute-error estimators.\nStatistical Science, 12, 279-300\n.\nKoenker, R. (2005).\nQuantile Regression\n.\nCambridge University Press.\n1.1.18.\nPolynomial regression: extending linear models with basis functions\nOne common pattern within machine learning is to use linear models trained\non nonlinear functions of the data. This approach maintains the generally\nfast performance of linear methods, while allowing them to fit a much wider\nrange of data.\nMathematical details\nFor example, a simple linear regression can be extended by constructing\npolynomial features\nfrom the coefficients. In the standard linear\nregression case, you might have a model that looks like this for\ntwo-dimensional data:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 63,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "from the coefficients. In the standard linear\nregression case, you might have a model that looks like this for\ntwo-dimensional data:\n\\[\\hat{y}(w, x) = w_0 + w_1 x_1 + w_2 x_2\\]\nIf we want to fit a paraboloid to the data instead of a plane, we can combine\nthe features in second-order polynomials, so that the model looks like this:\n\\[\\hat{y}(w, x) = w_0 + w_1 x_1 + w_2 x_2 + w_3 x_1 x_2 + w_4 x_1^2 + w_5 x_2^2\\]\nThe (sometimes surprising) observation is that this is\nstill a linear model\n:\nto see this, imagine creating a new set of features\n\\[z = [x_1, x_2, x_1 x_2, x_1^2, x_2^2]\\]\nWith this re-labeling of the data, our problem can be written\n\\[\\hat{y}(w, z) = w_0 + w_1 z_1 + w_2 z_2 + w_3 z_3 + w_4 z_4 + w_5 z_5\\]\nWe see that the resulting\npolynomial regression\nis in the same class of\nlinear models we considered above (i.e. the model is linear in\n\\(w\\)\n)\nand can be solved by the same techniques. By considering linear fits within",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 64,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "linear models we considered above (i.e. the model is linear in\n\\(w\\)\n)\nand can be solved by the same techniques. By considering linear fits within\na higher-dimensional space built with these basis functions, the model has the\nflexibility to fit a much broader range of data.\nHere is an example of applying this idea to one-dimensional data, using\npolynomial features of varying degrees:\nThis figure is created using the\nPolynomialFeatures\ntransformer, which\ntransforms an input data matrix into a new data matrix of a given degree.\nIt can be used as follows:\n>>>\nfrom\nsklearn.preprocessing\nimport\nPolynomialFeatures\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narange\n(\n6\n)\n.\nreshape\n(\n3\n,\n2\n)\n>>>\nX\narray([[0, 1],\n[2, 3],\n[4, 5]])\n>>>\npoly\n=\nPolynomialFeatures\n(\ndegree\n=\n2\n)\n>>>\npoly\n.\nfit_transform\n(\nX\n)\narray([[ 1., 0., 1., 0., 0., 1.],\n[ 1., 2., 3., 4., 6., 9.],\n[ 1., 4., 5., 16., 20., 25.]])\nThe features of\nX\nhave been transformed from\n\\([x_1, x_2]\\)\nto\n\\([1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 65,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[ 1., 4., 5., 16., 20., 25.]])\nThe features of\nX\nhave been transformed from\n\\([x_1, x_2]\\)\nto\n\\([1, x_1, x_2, x_1^2, x_1 x_2, x_2^2]\\)\n, and can now be used within\nany linear model.\nThis sort of preprocessing can be streamlined with the\nPipeline\ntools. A single object representing a simple\npolynomial regression can be created and used as follows:\n>>>\nfrom\nsklearn.preprocessing\nimport\nPolynomialFeatures\n>>>\nfrom\nsklearn.linear_model\nimport\nLinearRegression\n>>>\nfrom\nsklearn.pipeline\nimport\nPipeline\n>>>\nimport\nnumpy\nas\nnp\n>>>\nmodel\n=\nPipeline\n([(\n'poly'\n,\nPolynomialFeatures\n(\ndegree\n=\n3\n)),\n...\n(\n'linear'\n,\nLinearRegression\n(\nfit_intercept\n=\nFalse\n))])\n>>>\n# fit to an order-3 polynomial data\n>>>\nx\n=\nnp\n.\narange\n(\n5\n)\n>>>\ny\n=\n3\n-\n2\n*\nx\n+\nx\n**\n2\n-\nx\n**\n3\n>>>\nmodel\n=\nmodel\n.\nfit\n(\nx\n[:,\nnp\n.\nnewaxis\n],\ny\n)\n>>>\nmodel\n.\nnamed_steps\n[\n'linear'\n]\n.\ncoef_\narray([ 3., -2., 1., -1.])\nThe linear model trained on polynomial features is able to exactly recover\nthe input polynomial coefficients.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 66,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "]\n.\ncoef_\narray([ 3., -2., 1., -1.])\nThe linear model trained on polynomial features is able to exactly recover\nthe input polynomial coefficients.\nIn some cases it’s not necessary to include higher powers of any single feature,\nbut only the so-called\ninteraction features\nthat multiply together at most\n\\(d\\)\ndistinct features.\nThese can be gotten from\nPolynomialFeatures\nwith the setting\ninteraction_only=True\n.\nFor example, when dealing with boolean features,\n\\(x_i^n = x_i\\)\nfor all\n\\(n\\)\nand is therefore useless;\nbut\n\\(x_i x_j\\)\nrepresents the conjunction of two booleans.\nThis way, we can solve the XOR problem with a linear classifier:\n>>>\nfrom\nsklearn.linear_model\nimport\nPerceptron\n>>>\nfrom\nsklearn.preprocessing\nimport\nPolynomialFeatures\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n0\n,\n0\n],\n[\n0\n,\n1\n],\n[\n1\n,\n0\n],\n[\n1\n,\n1\n]])\n>>>\ny\n=\nX\n[:,\n0\n]\n^\nX\n[:,\n1\n]\n>>>\ny\narray([0, 1, 1, 0])\n>>>\nX\n=\nPolynomialFeatures\n(\ninteraction_only\n=\nTrue\n)\n.\nfit_transform\n(\nX\n)\n.\nastype\n(\nint\n)\n>>>\nX",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 67,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nX\n[:,\n0\n]\n^\nX\n[:,\n1\n]\n>>>\ny\narray([0, 1, 1, 0])\n>>>\nX\n=\nPolynomialFeatures\n(\ninteraction_only\n=\nTrue\n)\n.\nfit_transform\n(\nX\n)\n.\nastype\n(\nint\n)\n>>>\nX\narray([[1, 0, 0, 0],\n[1, 0, 1, 0],\n[1, 1, 0, 0],\n[1, 1, 1, 1]])\n>>>\nclf\n=\nPerceptron\n(\nfit_intercept\n=\nFalse\n,\nmax_iter\n=\n10\n,\ntol\n=\nNone\n,\n...\nshuffle\n=\nFalse\n)\n.\nfit\n(\nX\n,\ny\n)\nAnd the classifier “predictions” are perfect:\n>>>\nclf\n.\npredict\n(\nX\n)\narray([0, 1, 1, 0])\n>>>\nclf\n.\nscore\n(\nX\n,\ny\n)\n1.0\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/linear_model.html",
      "chunk_index": 68,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.2.\nManifold learning\nLook for the bare necessities\nThe simple bare necessities\nForget about your worries and your strife\nI mean the bare necessities\nOld Mother Nature’s recipes\nThat bring the bare necessities of life\n– Baloo’s song [The Jungle Book]\nManifold learning is an approach to non-linear dimensionality reduction.\nAlgorithms for this task are based on the idea that the dimensionality of\nmany data sets is only artificially high.\n2.2.1.\nIntroduction\nHigh-dimensional datasets can be very difficult to visualize. While data\nin two or three dimensions can be plotted to show the inherent\nstructure of the data, equivalent high-dimensional plots are much less\nintuitive. To aid visualization of the structure of a dataset, the\ndimension must be reduced in some way.\nThe simplest way to accomplish this dimensionality reduction is by taking\na random projection of the data. Though this allows some degree of\nvisualization of the data structure, the randomness of the choice leaves much",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "a random projection of the data. Though this allows some degree of\nvisualization of the data structure, the randomness of the choice leaves much\nto be desired. In a random projection, it is likely that the more\ninteresting structure within the data will be lost.\nTo address this concern, a number of supervised and unsupervised linear\ndimensionality reduction frameworks have been designed, such as Principal\nComponent Analysis (PCA), Independent Component Analysis, Linear\nDiscriminant Analysis, and others. These algorithms define specific\nrubrics to choose an “interesting” linear projection of the data.\nThese methods can be powerful, but often miss important non-linear\nstructure in the data.\nManifold Learning can be thought of as an attempt to generalize linear\nframeworks like PCA to be sensitive to non-linear structure in data. Though\nsupervised variants exist, the typical manifold learning problem is\nunsupervised: it learns the high-dimensional structure of the data",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "supervised variants exist, the typical manifold learning problem is\nunsupervised: it learns the high-dimensional structure of the data\nfrom the data itself, without the use of predetermined classifications.\nExamples\nSee\nManifold learning on handwritten digits: Locally Linear Embedding, Isomap…\nfor an example of\ndimensionality reduction on handwritten digits.\nSee\nComparison of Manifold Learning methods\nfor an example of\ndimensionality reduction on a toy “S-curve” dataset.\nSee\nVisualizing the stock market structure\nfor an example of\nusing manifold learning to map the stock market structure based on historical stock\nprices.\nSee\nManifold Learning methods on a severed sphere\nfor an example of\nmanifold learning techniques applied to a spherical data-set.\nSee\nSwiss Roll And Swiss-Hole Reduction\nfor an example of using\nmanifold learning techniques on a Swiss Roll dataset.\nThe manifold learning implementations available in scikit-learn are\nsummarized below\n2.2.2.\nIsomap",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The manifold learning implementations available in scikit-learn are\nsummarized below\n2.2.2.\nIsomap\nOne of the earliest approaches to manifold learning is the Isomap\nalgorithm, short for Isometric Mapping. Isomap can be viewed as an\nextension of Multi-dimensional Scaling (MDS) or Kernel PCA.\nIsomap seeks a lower-dimensional embedding which maintains geodesic\ndistances between all points. Isomap can be performed with the object\nIsomap\n.\nComplexity\nThe Isomap algorithm comprises three stages:\nNearest neighbor search.\nIsomap uses\nBallTree\nfor efficient neighbor search.\nThe cost is approximately\n\\(O[D \\log(k) N \\log(N)]\\)\n, for\n\\(k\\)\nnearest neighbors of\n\\(N\\)\npoints in\n\\(D\\)\ndimensions.\nShortest-path graph search.\nThe most efficient known algorithms\nfor this are\nDijkstra’s Algorithm\n, which is approximately\n\\(O[N^2(k + \\log(N))]\\)\n, or the\nFloyd-Warshall algorithm\n, which\nis\n\\(O[N^3]\\)\n. The algorithm can be selected by the user with\nthe\npath_method\nkeyword of\nIsomap",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", or the\nFloyd-Warshall algorithm\n, which\nis\n\\(O[N^3]\\)\n. The algorithm can be selected by the user with\nthe\npath_method\nkeyword of\nIsomap\n. If unspecified, the code\nattempts to choose the best algorithm for the input data.\nPartial eigenvalue decomposition.\nThe embedding is encoded in the\neigenvectors corresponding to the\n\\(d\\)\nlargest eigenvalues of the\n\\(N \\times N\\)\nisomap kernel. For a dense solver, the cost is\napproximately\n\\(O[d N^2]\\)\n. This cost can often be improved using\nthe\nARPACK\nsolver. The eigensolver can be specified by the user\nwith the\neigen_solver\nkeyword of\nIsomap\n. If unspecified, the\ncode attempts to choose the best algorithm for the input data.\nThe overall complexity of Isomap is\n\\(O[D \\log(k) N \\log(N)] + O[N^2(k + \\log(N))] + O[d N^2]\\)\n.\n\\(N\\)\n: number of training data points\n\\(D\\)\n: input dimension\n\\(k\\)\n: number of nearest neighbors\n\\(d\\)\n: output dimension\nReferences\n“A global geometric framework for nonlinear dimensionality reduction”",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(k\\)\n: number of nearest neighbors\n\\(d\\)\n: output dimension\nReferences\n“A global geometric framework for nonlinear dimensionality reduction”\nTenenbaum, J.B.; De Silva, V.; & Langford, J.C. Science 290 (5500)\n2.2.3.\nLocally Linear Embedding\nLocally linear embedding (LLE) seeks a lower-dimensional projection of the data\nwhich preserves distances within local neighborhoods. It can be thought\nof as a series of local Principal Component Analyses which are globally\ncompared to find the best non-linear embedding.\nLocally linear embedding can be performed with function\nlocally_linear_embedding\nor its object-oriented counterpart\nLocallyLinearEmbedding\n.\nComplexity\nThe standard LLE algorithm comprises three stages:\nNearest Neighbors Search\n. See discussion under Isomap above.\nWeight Matrix Construction\n.\n\\(O[D N k^3]\\)\n.\nThe construction of the LLE weight matrix involves the solution of a\n\\(k \\times k\\)\nlinear equation for each of the\n\\(N\\)\nlocal\nneighborhoods.\nPartial Eigenvalue Decomposition",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(k \\times k\\)\nlinear equation for each of the\n\\(N\\)\nlocal\nneighborhoods.\nPartial Eigenvalue Decomposition\n. See discussion under Isomap above.\nThe overall complexity of standard LLE is\n\\(O[D \\log(k) N \\log(N)] + O[D N k^3] + O[d N^2]\\)\n.\n\\(N\\)\n: number of training data points\n\\(D\\)\n: input dimension\n\\(k\\)\n: number of nearest neighbors\n\\(d\\)\n: output dimension\nReferences\n“Nonlinear dimensionality reduction by locally linear embedding”\nRoweis, S. & Saul, L. Science 290:2323 (2000)\n2.2.4.\nModified Locally Linear Embedding\nOne well-known issue with LLE is the regularization problem. When the number\nof neighbors is greater than the number of input dimensions, the matrix\ndefining each local neighborhood is rank-deficient. To address this, standard\nLLE applies an arbitrary regularization parameter\n\\(r\\)\n, which is chosen\nrelative to the trace of the local weight matrix. Though it can be shown\nformally that as\n\\(r \\to 0\\)\n, the solution converges to the desired",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "relative to the trace of the local weight matrix. Though it can be shown\nformally that as\n\\(r \\to 0\\)\n, the solution converges to the desired\nembedding, there is no guarantee that the optimal solution will be found\nfor\n\\(r > 0\\)\n. This problem manifests itself in embeddings which distort\nthe underlying geometry of the manifold.\nOne method to address the regularization problem is to use multiple weight\nvectors in each neighborhood. This is the essence of\nmodified locally\nlinear embedding\n(MLLE). MLLE can be performed with function\nlocally_linear_embedding\nor its object-oriented counterpart\nLocallyLinearEmbedding\n, with the keyword\nmethod\n=\n'modified'\n.\nIt requires\nn_neighbors\n>\nn_components\n.\nComplexity\nThe MLLE algorithm comprises three stages:\nNearest Neighbors Search\n. Same as standard LLE\nWeight Matrix Construction\n. Approximately\n\\(O[D N k^3] + O[N (k-D) k^2]\\)\n. The first term is exactly equivalent\nto that of standard LLE. The second term has to do with constructing the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(O[D N k^3] + O[N (k-D) k^2]\\)\n. The first term is exactly equivalent\nto that of standard LLE. The second term has to do with constructing the\nweight matrix from multiple weights. In practice, the added cost of\nconstructing the MLLE weight matrix is relatively small compared to the\ncost of stages 1 and 3.\nPartial Eigenvalue Decomposition\n. Same as standard LLE\nThe overall complexity of MLLE is\n\\(O[D \\log(k) N \\log(N)] + O[D N k^3] + O[N (k-D) k^2] + O[d N^2]\\)\n.\n\\(N\\)\n: number of training data points\n\\(D\\)\n: input dimension\n\\(k\\)\n: number of nearest neighbors\n\\(d\\)\n: output dimension\nReferences\n“MLLE: Modified Locally Linear Embedding Using Multiple Weights”\nZhang, Z. & Wang, J.\n2.2.5.\nHessian Eigenmapping\nHessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another method\nof solving the regularization problem of LLE. It revolves around a\nhessian-based quadratic form at each neighborhood which is used to recover",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of solving the regularization problem of LLE. It revolves around a\nhessian-based quadratic form at each neighborhood which is used to recover\nthe locally linear structure. Though other implementations note its poor\nscaling with data size,\nsklearn\nimplements some algorithmic\nimprovements which make its cost comparable to that of other LLE variants\nfor small output dimension. HLLE can be performed with function\nlocally_linear_embedding\nor its object-oriented counterpart\nLocallyLinearEmbedding\n, with the keyword\nmethod\n=\n'hessian'\n.\nIt requires\nn_neighbors\n>\nn_components\n*\n(n_components\n+\n3)\n/\n2\n.\nComplexity\nThe HLLE algorithm comprises three stages:\nNearest Neighbors Search\n. Same as standard LLE\nWeight Matrix Construction\n. Approximately\n\\(O[D N k^3] + O[N d^6]\\)\n. The first term reflects a similar\ncost to that of standard LLE. The second term comes from a QR\ndecomposition of the local hessian estimator.\nPartial Eigenvalue Decomposition\n. Same as standard LLE.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "decomposition of the local hessian estimator.\nPartial Eigenvalue Decomposition\n. Same as standard LLE.\nThe overall complexity of standard HLLE is\n\\(O[D \\log(k) N \\log(N)] + O[D N k^3] + O[N d^6] + O[d N^2]\\)\n.\n\\(N\\)\n: number of training data points\n\\(D\\)\n: input dimension\n\\(k\\)\n: number of nearest neighbors\n\\(d\\)\n: output dimension\nReferences\n“Hessian Eigenmaps: Locally linear embedding techniques for\nhigh-dimensional data”\nDonoho, D. & Grimes, C. Proc Natl Acad Sci USA. 100:5591 (2003)\n2.2.6.\nSpectral Embedding\nSpectral Embedding is an approach to calculating a non-linear embedding.\nScikit-learn implements Laplacian Eigenmaps, which finds a low dimensional\nrepresentation of the data using a spectral decomposition of the graph\nLaplacian. The graph generated can be considered as a discrete approximation of\nthe low dimensional manifold in the high dimensional space. Minimization of a\ncost function based on the graph ensures that points close to each other on",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "cost function based on the graph ensures that points close to each other on\nthe manifold are mapped close to each other in the low dimensional space,\npreserving local distances. Spectral embedding can be performed with the\nfunction\nspectral_embedding\nor its object-oriented counterpart\nSpectralEmbedding\n.\nComplexity\nThe Spectral Embedding (Laplacian Eigenmaps) algorithm comprises three stages:\nWeighted Graph Construction\n. Transform the raw input data into\ngraph representation using affinity (adjacency) matrix representation.\nGraph Laplacian Construction\n. unnormalized Graph Laplacian\nis constructed as\n\\(L = D - A\\)\nfor and normalized one as\n\\(L = D^{-\\frac{1}{2}} (D - A) D^{-\\frac{1}{2}}\\)\n.\nPartial Eigenvalue Decomposition\n. Eigenvalue decomposition is\ndone on graph Laplacian.\nThe overall complexity of spectral embedding is\n\\(O[D \\log(k) N \\log(N)] + O[D N k^3] + O[d N^2]\\)\n.\n\\(N\\)\n: number of training data points\n\\(D\\)\n: input dimension\n\\(k\\)\n: number of nearest neighbors\n\\(d\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\n\\(N\\)\n: number of training data points\n\\(D\\)\n: input dimension\n\\(k\\)\n: number of nearest neighbors\n\\(d\\)\n: output dimension\nReferences\n“Laplacian Eigenmaps for Dimensionality Reduction\nand Data Representation”\nM. Belkin, P. Niyogi, Neural Computation, June 2003; 15 (6):1373-1396\n2.2.7.\nLocal Tangent Space Alignment\nThough not technically a variant of LLE, Local tangent space alignment (LTSA)\nis algorithmically similar enough to LLE that it can be put in this category.\nRather than focusing on preserving neighborhood distances as in LLE, LTSA\nseeks to characterize the local geometry at each neighborhood via its\ntangent space, and performs a global optimization to align these local\ntangent spaces to learn the embedding. LTSA can be performed with function\nlocally_linear_embedding\nor its object-oriented counterpart\nLocallyLinearEmbedding\n, with the keyword\nmethod\n=\n'ltsa'\n.\nComplexity\nThe LTSA algorithm comprises three stages:\nNearest Neighbors Search\n. Same as standard LLE",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", with the keyword\nmethod\n=\n'ltsa'\n.\nComplexity\nThe LTSA algorithm comprises three stages:\nNearest Neighbors Search\n. Same as standard LLE\nWeight Matrix Construction\n. Approximately\n\\(O[D N k^3] + O[k^2 d]\\)\n. The first term reflects a similar\ncost to that of standard LLE.\nPartial Eigenvalue Decomposition\n. Same as standard LLE\nThe overall complexity of standard LTSA is\n\\(O[D \\log(k) N \\log(N)] + O[D N k^3] + O[k^2 d] + O[d N^2]\\)\n.\n\\(N\\)\n: number of training data points\n\\(D\\)\n: input dimension\n\\(k\\)\n: number of nearest neighbors\n\\(d\\)\n: output dimension\nReferences\n“Principal manifolds and nonlinear dimensionality reduction via\ntangent space alignment”\nZhang, Z. & Zha, H. Journal of Shanghai Univ. 8:406 (2004)\n2.2.8.\nMulti-dimensional Scaling (MDS)\nMultidimensional scaling\n(\nMDS\n) seeks a low-dimensional\nrepresentation of the data in which the distances respect well the\ndistances in the original high-dimensional space.\nIn general,\nMDS\nis a technique used for analyzing",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "distances in the original high-dimensional space.\nIn general,\nMDS\nis a technique used for analyzing\ndissimilarity data. It attempts to model dissimilarities as\ndistances in a Euclidean space. The data can be ratings of dissimilarity between\nobjects, interaction frequencies of molecules, or trade indices between\ncountries.\nThere exist two types of MDS algorithm: metric and non-metric. In\nscikit-learn, the class\nMDS\nimplements both. In metric MDS,\nthe distances in the embedding space are set as\nclose as possible to the dissimilarity data. In the non-metric\nversion, the algorithm will try to preserve the order of the distances, and\nhence seek for a monotonic relationship between the distances in the embedded\nspace and the input dissimilarities.\nLet\n\\(\\delta_{ij}\\)\nbe the dissimilarity matrix between the\n\\(n\\)\ninput points (possibly arising as some pairwise distances\n\\(d_{ij}(X)\\)\nbetween the coordinates\n\\(X\\)\nof the input points).\nDisparities\n\\(\\hat{d}_{ij} = f(\\delta_{ij})\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(d_{ij}(X)\\)\nbetween the coordinates\n\\(X\\)\nof the input points).\nDisparities\n\\(\\hat{d}_{ij} = f(\\delta_{ij})\\)\nare some transformation of\nthe dissimilarities. The MDS objective, called the raw stress, is then\ndefined by\n\\(\\sum_{i < j} (\\hat{d}_{ij} - d_{ij}(Z))^2\\)\n,\nwhere\n\\(d_{ij}(Z)\\)\nare the pairwise distances between the\ncoordinates\n\\(Z\\)\nof the embedded points.\nMetric MDS\nIn the metric\nMDS\nmodel (sometimes also called\nabsolute MDS\n),\ndisparities are simply equal to the input dissimilarities\n\\(\\hat{d}_{ij} = \\delta_{ij}\\)\n.\nNonmetric MDS\nNon metric\nMDS\nfocuses on the ordination of the data. If\n\\(\\delta_{ij} > \\delta_{kl}\\)\n, then the embedding\nseeks to enforce\n\\(d_{ij}(Z) > d_{kl}(Z)\\)\n. A simple algorithm\nto enforce proper ordination is to use an\nisotonic regression of\n\\(d_{ij}(Z)\\)\non\n\\(\\delta_{ij}\\)\n, yielding\ndisparities\n\\(\\hat{d}_{ij}\\)\nthat are a monotonic transformation\nof dissimilarities\n\\(\\delta_{ij}\\)\nand hence having the same ordering.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", yielding\ndisparities\n\\(\\hat{d}_{ij}\\)\nthat are a monotonic transformation\nof dissimilarities\n\\(\\delta_{ij}\\)\nand hence having the same ordering.\nThis is done repeatedly after every step of the optimization algorithm.\nIn order to avoid the trivial solution where all embedding points are\noverlapping, the disparities\n\\(\\hat{d}_{ij}\\)\nare normalized.\nNote that since we only care about relative ordering, our objective should be\ninvariant to simple translation and scaling, however the stress used in metric\nMDS is sensitive to scaling. To address this, non-metric MDS returns\nnormalized stress, also known as Stress-1, defined as\n\\[\\sqrt{\\frac{\\sum_{i < j} (\\hat{d}_{ij} - d_{ij}(Z))^2}{\\sum_{i < j}\nd_{ij}(Z)^2}}.\\]\nNormalized Stress-1 is returned if\nnormalized_stress=True\n.\nReferences\n“More on Multidimensional Scaling and Unfolding in R: smacof Version 2”\nMair P, Groenen P., de Leeuw J. Journal of Statistical Software (2022)\n“Modern Multidimensional Scaling - Theory and Applications”",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Mair P, Groenen P., de Leeuw J. Journal of Statistical Software (2022)\n“Modern Multidimensional Scaling - Theory and Applications”\nBorg, I.; Groenen P. Springer Series in Statistics (1997)\n“Nonmetric multidimensional scaling: a numerical method”\nKruskal, J. Psychometrika, 29 (1964)\n“Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis”\nKruskal, J. Psychometrika, 29, (1964)\n2.2.9.\nt-distributed Stochastic Neighbor Embedding (t-SNE)\nt-SNE (\nTSNE\n) converts affinities of data points to probabilities.\nThe affinities in the original space are represented by Gaussian joint\nprobabilities and the affinities in the embedded space are represented by\nStudent’s t-distributions. This allows t-SNE to be particularly sensitive\nto local structure and has a few other advantages over existing techniques:\nRevealing the structure at many scales on a single map\nRevealing data that lie in multiple, different, manifolds or clusters",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Revealing the structure at many scales on a single map\nRevealing data that lie in multiple, different, manifolds or clusters\nReducing the tendency to crowd points together at the center\nWhile Isomap, LLE and variants are best suited to unfold a single continuous\nlow dimensional manifold, t-SNE will focus on the local structure of the data\nand will tend to extract clustered local groups of samples as highlighted on\nthe S-curve example. This ability to group samples based on the local structure\nmight be beneficial to visually disentangle a dataset that comprises several\nmanifolds at once as is the case in the digits dataset.\nThe Kullback-Leibler (KL) divergence of the joint\nprobabilities in the original space and the embedded space will be minimized\nby gradient descent. Note that the KL divergence is not convex, i.e.\nmultiple restarts with different initializations will end up in local minima\nof the KL divergence. Hence, it is sometimes useful to try different seeds",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of the KL divergence. Hence, it is sometimes useful to try different seeds\nand select the embedding with the lowest KL divergence.\nThe disadvantages to using t-SNE are roughly:\nt-SNE is computationally expensive, and can take several hours on million-sample\ndatasets where PCA will finish in seconds or minutes\nThe Barnes-Hut t-SNE method is limited to two or three dimensional embeddings.\nThe algorithm is stochastic and multiple restarts with different seeds can\nyield different embeddings. However, it is perfectly legitimate to pick the\nembedding with the least error.\nGlobal structure is not explicitly preserved. This problem is mitigated by\ninitializing points with PCA (using\ninit='pca'\n).\nOptimizing t-SNE\nThe main purpose of t-SNE is visualization of high-dimensional data. Hence,\nit works best when the data will be embedded on two or three dimensions.\nOptimizing the KL divergence can be a little bit tricky sometimes. There are",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "it works best when the data will be embedded on two or three dimensions.\nOptimizing the KL divergence can be a little bit tricky sometimes. There are\nfive parameters that control the optimization of t-SNE and therefore possibly\nthe quality of the resulting embedding:\nperplexity\nearly exaggeration factor\nlearning rate\nmaximum number of iterations\nangle (not used in the exact method)\nThe perplexity is defined as\n\\(k=2^{(S)}\\)\nwhere\n\\(S\\)\nis the Shannon\nentropy of the conditional probability distribution. The perplexity of a\n\\(k\\)\n-sided die is\n\\(k\\)\n, so that\n\\(k\\)\nis effectively the number of\nnearest neighbors t-SNE considers when generating the conditional probabilities.\nLarger perplexities lead to more nearest neighbors and less sensitive to small\nstructure. Conversely a lower perplexity considers a smaller number of\nneighbors, and thus ignores more global information in favour of the\nlocal neighborhood. As dataset sizes get larger more points will be",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "neighbors, and thus ignores more global information in favour of the\nlocal neighborhood. As dataset sizes get larger more points will be\nrequired to get a reasonable sample of the local neighborhood, and hence\nlarger perplexities may be required. Similarly noisier datasets will require\nlarger perplexity values to encompass enough local neighbors to see beyond\nthe background noise.\nThe maximum number of iterations is usually high enough and does not need\nany tuning. The optimization consists of two phases: the early exaggeration\nphase and the final optimization. During early exaggeration the joint\nprobabilities in the original space will be artificially increased by\nmultiplication with a given factor. Larger factors result in larger gaps\nbetween natural clusters in the data. If the factor is too high, the KL\ndivergence could increase during this phase. Usually it does not have to be\ntuned. A critical parameter is the learning rate. If it is too low gradient",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "tuned. A critical parameter is the learning rate. If it is too low gradient\ndescent will get stuck in a bad local minimum. If it is too high the KL\ndivergence will increase during optimization. A heuristic suggested in\nBelkina et al. (2019) is to set the learning rate to the sample size\ndivided by the early exaggeration factor. We implement this heuristic\nas\nlearning_rate='auto'\nargument. More tips can be found in\nLaurens van der Maaten’s FAQ (see references). The last parameter, angle,\nis a tradeoff between performance and accuracy. Larger angles imply that we\ncan approximate larger regions by a single point, leading to better speed\nbut less accurate results.\n“How to Use t-SNE Effectively”\nprovides a good discussion of the effects of the various parameters, as well\nas interactive plots to explore the effects of different parameters.\nBarnes-Hut t-SNE\nThe Barnes-Hut t-SNE that has been implemented here is usually much slower than",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Barnes-Hut t-SNE\nThe Barnes-Hut t-SNE that has been implemented here is usually much slower than\nother manifold learning algorithms. The optimization is quite difficult\nand the computation of the gradient is\n\\(O[d N log(N)]\\)\n, where\n\\(d\\)\nis the number of output dimensions and\n\\(N\\)\nis the number of samples. The\nBarnes-Hut method improves on the exact method where t-SNE complexity is\n\\(O[d N^2]\\)\n, but has several other notable differences:\nThe Barnes-Hut implementation only works when the target dimensionality is 3\nor less. The 2D case is typical when building visualizations.\nBarnes-Hut only works with dense input data. Sparse data matrices can only be\nembedded with the exact method or can be approximated by a dense low rank\nprojection for instance using\nPCA\nBarnes-Hut is an approximation of the exact method. The approximation is\nparameterized with the angle parameter, therefore the angle parameter is\nunused when method=”exact”",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "parameterized with the angle parameter, therefore the angle parameter is\nunused when method=”exact”\nBarnes-Hut is significantly more scalable. Barnes-Hut can be used to embed\nhundreds of thousands of data points while the exact method can handle\nthousands of samples before becoming computationally intractable\nFor visualization purpose (which is the main use case of t-SNE), using the\nBarnes-Hut method is strongly recommended. The exact t-SNE method is useful\nfor checking the theoretical properties of the embedding possibly in higher\ndimensional space but limited to small datasets due to computational constraints.\nAlso note that the digits labels roughly match the natural grouping found by\nt-SNE while the linear 2D projection of the PCA model yields a representation\nwhere label regions largely overlap. This is a strong clue that this data can\nbe well separated by non linear methods that focus on the local structure (e.g.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "be well separated by non linear methods that focus on the local structure (e.g.\nan SVM with a Gaussian RBF kernel). However, failing to visualize well\nseparated homogeneously labeled groups with t-SNE in 2D does not necessarily\nimply that the data cannot be correctly classified by a supervised model. It\nmight be the case that 2 dimensions are not high enough to accurately represent\nthe internal structure of the data.\nReferences\n“Visualizing High-Dimensional Data Using t-SNE”\nvan der Maaten, L.J.P.; Hinton, G. Journal of Machine Learning Research (2008)\n“t-Distributed Stochastic Neighbor Embedding”\nvan der Maaten, L.J.P.\n“Accelerating t-SNE using Tree-Based Algorithms”\nvan der Maaten, L.J.P.; Journal of Machine Learning Research 15(Oct):3221-3245, 2014.\n“Automated optimized parameters for T-distributed stochastic neighbor\nembedding improve visualization and analysis of large datasets”\nBelkina, A.C., Ciccolella, C.O., Anno, R., Halpert, R., Spidlen, J.,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "embedding improve visualization and analysis of large datasets”\nBelkina, A.C., Ciccolella, C.O., Anno, R., Halpert, R., Spidlen, J.,\nSnyder-Cappione, J.E., Nature Communications 10, 5415 (2019).\n2.2.10.\nTips on practical use\nMake sure the same scale is used over all features. Because manifold\nlearning methods are based on a nearest-neighbor search, the algorithm\nmay perform poorly otherwise. See\nStandardScaler\nfor convenient ways of scaling heterogeneous data.\nThe reconstruction error computed by each routine can be used to choose\nthe optimal output dimension. For a\n\\(d\\)\n-dimensional manifold embedded\nin a\n\\(D\\)\n-dimensional parameter space, the reconstruction error will\ndecrease as\nn_components\nis increased until\nn_components\n==\nd\n.\nNote that noisy data can “short-circuit” the manifold, in essence acting\nas a bridge between parts of the manifold that would otherwise be\nwell-separated. Manifold learning on noisy and/or incomplete data is\nan active area of research.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "well-separated. Manifold learning on noisy and/or incomplete data is\nan active area of research.\nCertain input configurations can lead to singular weight matrices, for\nexample when more than two points in the dataset are identical, or when\nthe data is split into disjointed groups. In this case,\nsolver='arpack'\nwill fail to find the null space. The easiest way to address this is to\nuse\nsolver='dense'\nwhich will work on a singular matrix, though it may\nbe very slow depending on the number of input points. Alternatively, one\ncan attempt to understand the source of the singularity: if it is due to\ndisjoint sets, increasing\nn_neighbors\nmay help. If it is due to\nidentical points in the dataset, removing these points may help.\nSee also\nTotally Random Trees Embedding\ncan also be useful to derive non-linear\nrepresentations of feature space, but it does not perform\ndimensionality reduction.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/manifold.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.8.\nPairwise metrics, Affinities and Kernels\nThe\nsklearn.metrics.pairwise\nsubmodule implements utilities to evaluate\npairwise distances or affinity of sets of samples.\nThis module contains both distance metrics and kernels. A brief summary is\ngiven on the two here.\nDistance metrics are functions\nd(a,\nb)\nsuch that\nd(a,\nb)\n<\nd(a,\nc)\nif objects\na\nand\nb\nare considered “more similar” than objects\na\nand\nc\n. Two objects exactly alike would have a distance of zero.\nOne of the most popular examples is Euclidean distance.\nTo be a ‘true’ metric, it must obey the following four conditions:\n1.\nd\n(\na\n,\nb\n)\n>=\n0\n,\nfor\nall\na\nand\nb\n2.\nd\n(\na\n,\nb\n)\n==\n0\n,\nif\nand\nonly\nif\na\n=\nb\n,\npositive\ndefiniteness\n3.\nd\n(\na\n,\nb\n)\n==\nd\n(\nb\n,\na\n),\nsymmetry\n4.\nd\n(\na\n,\nc\n)\n<=\nd\n(\na\n,\nb\n)\n+\nd\n(\nb\n,\nc\n),\nthe\ntriangle\ninequality\nKernels are measures of similarity, i.e.\ns(a,\nb)\n>\ns(a,\nc)\nif objects\na\nand\nb\nare considered “more similar” than objects\na\nand\nc\n. A kernel must also be positive semi-definite.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/metrics.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "s(a,\nb)\n>\ns(a,\nc)\nif objects\na\nand\nb\nare considered “more similar” than objects\na\nand\nc\n. A kernel must also be positive semi-definite.\nThere are a number of ways to convert between a distance metric and a\nsimilarity measure, such as a kernel. Let\nD\nbe the distance, and\nS\nbe\nthe kernel:\nS\n=\nnp.exp(-D\n*\ngamma)\n, where one heuristic for choosing\ngamma\nis\n1\n/\nnum_features\nS\n=\n1.\n/\n(D\n/\nnp.max(D))\nThe distances between the row vectors of\nX\nand the row vectors of\nY\ncan be evaluated using\npairwise_distances\n. If\nY\nis omitted the\npairwise distances of the row vectors of\nX\nare calculated. Similarly,\npairwise.pairwise_kernels\ncan be used to calculate the kernel between\nX\nand\nY\nusing different kernel functions. See the API reference for more\ndetails.\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\npairwise_distances\n>>>\nfrom\nsklearn.metrics.pairwise\nimport\npairwise_kernels\n>>>\nX\n=\nnp\n.\narray\n([[\n2\n,\n3\n],\n[\n3\n,\n5\n],\n[\n5\n,\n8\n]])\n>>>\nY\n=\nnp\n.\narray\n([[\n1\n,\n0\n],\n[\n2\n,\n1\n]])\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/metrics.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "import\npairwise_kernels\n>>>\nX\n=\nnp\n.\narray\n([[\n2\n,\n3\n],\n[\n3\n,\n5\n],\n[\n5\n,\n8\n]])\n>>>\nY\n=\nnp\n.\narray\n([[\n1\n,\n0\n],\n[\n2\n,\n1\n]])\n>>>\npairwise_distances\n(\nX\n,\nY\n,\nmetric\n=\n'manhattan'\n)\narray([[ 4., 2.],\n[ 7., 5.],\n[12., 10.]])\n>>>\npairwise_distances\n(\nX\n,\nmetric\n=\n'manhattan'\n)\narray([[0., 3., 8.],\n[3., 0., 5.],\n[8., 5., 0.]])\n>>>\npairwise_kernels\n(\nX\n,\nY\n,\nmetric\n=\n'linear'\n)\narray([[ 2., 7.],\n[ 3., 11.],\n[ 5., 18.]])\n7.8.1.\nCosine similarity\ncosine_similarity\ncomputes the L2-normalized dot product of vectors.\nThat is, if\n\\(x\\)\nand\n\\(y\\)\nare row vectors,\ntheir cosine similarity\n\\(k\\)\nis defined as:\n\\[k(x, y) = \\frac{x y^\\top}{\\|x\\| \\|y\\|}\\]\nThis is called cosine similarity, because Euclidean (L2) normalization\nprojects the vectors onto the unit sphere,\nand their dot product is then the cosine of the angle between the points\ndenoted by the vectors.\nThis kernel is a popular choice for computing the similarity of documents\nrepresented as tf-idf vectors.\ncosine_similarity\naccepts\nscipy.sparse",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/metrics.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "This kernel is a popular choice for computing the similarity of documents\nrepresented as tf-idf vectors.\ncosine_similarity\naccepts\nscipy.sparse\nmatrices.\n(Note that the tf-idf functionality in\nsklearn.feature_extraction.text\ncan produce normalized vectors, in which case\ncosine_similarity\nis equivalent to\nlinear_kernel\n, only slower.)\nReferences\nC.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to\nInformation Retrieval. Cambridge University Press.\nhttps://nlp.stanford.edu/IR-book/html/htmledition/the-vector-space-model-for-scoring-1.html\n7.8.2.\nLinear kernel\nThe function\nlinear_kernel\ncomputes the linear kernel, that is, a\nspecial case of\npolynomial_kernel\nwith\ndegree=1\nand\ncoef0=0\n(homogeneous).\nIf\nx\nand\ny\nare column vectors, their linear kernel is:\n\\[k(x, y) = x^\\top y\\]\n7.8.3.\nPolynomial kernel\nThe function\npolynomial_kernel\ncomputes the degree-d polynomial kernel\nbetween two vectors. The polynomial kernel represents the similarity between two",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/metrics.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "polynomial_kernel\ncomputes the degree-d polynomial kernel\nbetween two vectors. The polynomial kernel represents the similarity between two\nvectors. Conceptually, the polynomial kernel considers not only the similarity\nbetween vectors under the same dimension, but also across dimensions. When used\nin machine learning algorithms, this allows to account for feature interaction.\nThe polynomial kernel is defined as:\n\\[k(x, y) = (\\gamma x^\\top y +c_0)^d\\]\nwhere:\nx\n,\ny\nare the input vectors\nd\nis the kernel degree\nIf\n\\(c_0 = 0\\)\nthe kernel is said to be homogeneous.\n7.8.4.\nSigmoid kernel\nThe function\nsigmoid_kernel\ncomputes the sigmoid kernel between two\nvectors. The sigmoid kernel is also known as hyperbolic tangent, or Multilayer\nPerceptron (because, in the neural network field, it is often used as neuron\nactivation function). It is defined as:\n\\[k(x, y) = \\tanh( \\gamma x^\\top y + c_0)\\]\nwhere:\nx\n,\ny\nare the input vectors\n\\(\\gamma\\)\nis known as slope\n\\(c_0\\)\nis known as intercept\n7.8.5.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/metrics.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[k(x, y) = \\tanh( \\gamma x^\\top y + c_0)\\]\nwhere:\nx\n,\ny\nare the input vectors\n\\(\\gamma\\)\nis known as slope\n\\(c_0\\)\nis known as intercept\n7.8.5.\nRBF kernel\nThe function\nrbf_kernel\ncomputes the radial basis function (RBF) kernel\nbetween two vectors. This kernel is defined as:\n\\[k(x, y) = \\exp( -\\gamma \\| x-y \\|^2)\\]\nwhere\nx\nand\ny\nare the input vectors. If\n\\(\\gamma = \\sigma^{-2}\\)\nthe kernel is known as the Gaussian kernel of variance\n\\(\\sigma^2\\)\n.\n7.8.6.\nLaplacian kernel\nThe function\nlaplacian_kernel\nis a variant on the radial basis\nfunction kernel defined as:\n\\[k(x, y) = \\exp( -\\gamma \\| x-y \\|_1)\\]\nwhere\nx\nand\ny\nare the input vectors and\n\\(\\|x-y\\|_1\\)\nis the\nManhattan distance between the input vectors.\nIt has proven useful in ML applied to noiseless data.\nSee e.g.\nMachine learning for quantum mechanics in a nutshell\n.\n7.8.7.\nChi-squared kernel\nThe chi-squared kernel is a very popular choice for training non-linear SVMs in\ncomputer vision applications.\nIt can be computed using",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/metrics.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The chi-squared kernel is a very popular choice for training non-linear SVMs in\ncomputer vision applications.\nIt can be computed using\nchi2_kernel\nand then passed to an\nSVC\nwith\nkernel=\"precomputed\"\n:\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nfrom\nsklearn.metrics.pairwise\nimport\nchi2_kernel\n>>>\nX\n=\n[[\n0\n,\n1\n],\n[\n1\n,\n0\n],\n[\n.2\n,\n.8\n],\n[\n.7\n,\n.3\n]]\n>>>\ny\n=\n[\n0\n,\n1\n,\n0\n,\n1\n]\n>>>\nK\n=\nchi2_kernel\n(\nX\n,\ngamma\n=\n.5\n)\n>>>\nK\narray([[1. , 0.36787944, 0.89483932, 0.58364548],\n[0.36787944, 1. , 0.51341712, 0.83822343],\n[0.89483932, 0.51341712, 1. , 0.7768366 ],\n[0.58364548, 0.83822343, 0.7768366 , 1. ]])\n>>>\nsvm\n=\nSVC\n(\nkernel\n=\n'precomputed'\n)\n.\nfit\n(\nK\n,\ny\n)\n>>>\nsvm\n.\npredict\n(\nK\n)\narray([0, 1, 0, 1])\nIt can also be directly used as the\nkernel\nargument:\n>>>\nsvm\n=\nSVC\n(\nkernel\n=\nchi2_kernel\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nsvm\n.\npredict\n(\nX\n)\narray([0, 1, 0, 1])\nThe chi squared kernel is given by\n\\[k(x, y) = \\exp \\left (-\\gamma \\sum_i \\frac{(x[i] - y[i]) ^ 2}{x[i] + y[i]} \\right )\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/metrics.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\nX\n)\narray([0, 1, 0, 1])\nThe chi squared kernel is given by\n\\[k(x, y) = \\exp \\left (-\\gamma \\sum_i \\frac{(x[i] - y[i]) ^ 2}{x[i] + y[i]} \\right )\\]\nThe data is assumed to be non-negative, and is often normalized to have an L1-norm of one.\nThe normalization is rationalized with the connection to the chi squared distance,\nwhich is a distance between discrete probability distributions.\nThe chi squared kernel is most commonly used on histograms (bags) of visual words.\nReferences\nZhang, J. and Marszalek, M. and Lazebnik, S. and Schmid, C.\nLocal features and kernels for classification of texture and object\ncategories: A comprehensive study\nInternational Journal of Computer Vision 2007\nhttps://hal.archives-ouvertes.fr/hal-00171412/document\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/metrics.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.1.\nGaussian mixture models\nsklearn.mixture\nis a package which enables one to learn\nGaussian Mixture Models (diagonal, spherical, tied and full covariance\nmatrices supported), sample them, and estimate them from\ndata. Facilities to help determine the appropriate number of\ncomponents are also provided.\nTwo-component Gaussian mixture model:\ndata points, and equi-probability\nsurfaces of the model.\nA Gaussian mixture model is a probabilistic model that assumes all the\ndata points are generated from a mixture of a finite number of\nGaussian distributions with unknown parameters. One can think of\nmixture models as generalizing k-means clustering to incorporate\ninformation about the covariance structure of the data as well as the\ncenters of the latent Gaussians.\nScikit-learn implements different classes to estimate Gaussian\nmixture models, that correspond to different estimation strategies,\ndetailed below.\n2.1.1.\nGaussian Mixture\nThe\nGaussianMixture\nobject implements the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "mixture models, that correspond to different estimation strategies,\ndetailed below.\n2.1.1.\nGaussian Mixture\nThe\nGaussianMixture\nobject implements the\nexpectation-maximization\n(EM)\nalgorithm for fitting mixture-of-Gaussian models. It can also draw\nconfidence ellipsoids for multivariate models, and compute the\nBayesian Information Criterion to assess the number of clusters in the\ndata. A\nGaussianMixture.fit\nmethod is provided that learns a Gaussian\nMixture Model from training data. Given test data, it can assign to each\nsample the Gaussian it most probably belongs to using\nthe\nGaussianMixture.predict\nmethod.\nThe\nGaussianMixture\ncomes with different options to constrain the\ncovariance of the difference classes estimated: spherical, diagonal, tied or\nfull covariance.\nExamples\nSee\nGMM covariances\nfor an example of\nusing the Gaussian mixture as clustering on the iris dataset.\nSee\nDensity Estimation for a Gaussian mixture\nfor an example on plotting the\ndensity estimation.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "See\nDensity Estimation for a Gaussian mixture\nfor an example on plotting the\ndensity estimation.\nPros and cons of class GaussianMixture\nPros\nSpeed\n:\nIt is the fastest algorithm for learning mixture models\nAgnostic\n:\nAs this algorithm maximizes only the likelihood, it\nwill not bias the means towards zero, or bias the cluster sizes to\nhave specific structures that might or might not apply.\nCons\nSingularities\n:\nWhen one has insufficiently many points per\nmixture, estimating the covariance matrices becomes difficult,\nand the algorithm is known to diverge and find solutions with\ninfinite likelihood unless one regularizes the covariances artificially.\nNumber of components\n:\nThis algorithm will always use all the\ncomponents it has access to, needing held-out data\nor information theoretical criteria to decide how many components to use\nin the absence of external cues.\nSelecting the number of components in a classical Gaussian Mixture model",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "in the absence of external cues.\nSelecting the number of components in a classical Gaussian Mixture model\nThe BIC criterion can be used to select the number of components in a Gaussian\nMixture in an efficient way. In theory, it recovers the true number of\ncomponents only in the asymptotic regime (i.e. if much data is available and\nassuming that the data was actually generated i.i.d. from a mixture of Gaussian\ndistributions). Note that using a\nVariational Bayesian Gaussian mixture\navoids the specification of the number of components for a Gaussian mixture\nmodel.\nExamples\nSee\nGaussian Mixture Model Selection\nfor an example\nof model selection performed with classical Gaussian mixture.\nEstimation algorithm expectation-maximization\nThe main difficulty in learning Gaussian mixture models from unlabeled\ndata is that one usually doesn’t know which points came from\nwhich latent component (if one has access to this information it gets",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "data is that one usually doesn’t know which points came from\nwhich latent component (if one has access to this information it gets\nvery easy to fit a separate Gaussian distribution to each set of\npoints).\nExpectation-maximization\nis a well-founded statistical\nalgorithm to get around this problem by an iterative process. First\none assumes random components (randomly centered on data points,\nlearned from k-means, or even just normally distributed around the\norigin) and computes for each point a probability of being generated by\neach component of the model. Then, one tweaks the\nparameters to maximize the likelihood of the data given those\nassignments. Repeating this process is guaranteed to always converge\nto a local optimum.\nChoice of the Initialization method\nThere is a choice of four initialization methods (as well as inputting user defined\ninitial means) to generate the initial centers for the model components:\nk-means (default)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "initial means) to generate the initial centers for the model components:\nk-means (default)\nThis applies a traditional k-means clustering algorithm.\nThis can be computationally expensive compared to other initialization methods.\nk-means++\nThis uses the initialization method of k-means clustering: k-means++.\nThis will pick the first center at random from the data. Subsequent centers will be\nchosen from a weighted distribution of the data favouring points further away from\nexisting centers. k-means++ is the default initialization for k-means so will be\nquicker than running a full k-means but can still take a significant amount of\ntime for large data sets with many components.\nrandom_from_data\nThis will pick random data points from the input data as the initial\ncenters. This is a very fast method of initialization but can produce non-convergent\nresults if the chosen points are too close to each other.\nrandom\nCenters are chosen as a small perturbation away from the mean of all data.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "results if the chosen points are too close to each other.\nrandom\nCenters are chosen as a small perturbation away from the mean of all data.\nThis method is simple but can lead to the model taking longer to converge.\nExamples\nSee\nGMM Initialization Methods\nfor an example of\nusing different initializations in Gaussian Mixture.\n2.1.2.\nVariational Bayesian Gaussian Mixture\nThe\nBayesianGaussianMixture\nobject implements a variant of the\nGaussian mixture model with variational inference algorithms. The API is\nsimilar to the one defined by\nGaussianMixture\n.\nEstimation algorithm: variational inference\nVariational inference is an extension of expectation-maximization that\nmaximizes a lower bound on model evidence (including\npriors) instead of data likelihood. The principle behind\nvariational methods is the same as expectation-maximization (that is\nboth are iterative algorithms that alternate between finding the\nprobabilities for each point to be generated by each mixture and",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "both are iterative algorithms that alternate between finding the\nprobabilities for each point to be generated by each mixture and\nfitting the mixture to these assigned points), but variational\nmethods add regularization by integrating information from prior\ndistributions. This avoids the singularities often found in\nexpectation-maximization solutions but introduces some subtle biases\nto the model. Inference is often notably slower, but not usually as\nmuch so as to render usage unpractical.\nDue to its Bayesian nature, the variational algorithm needs more hyperparameters\nthan expectation-maximization, the most important of these being the\nconcentration parameter\nweight_concentration_prior\n. Specifying a low value\nfor the concentration prior will make the model put most of the weight on a few\ncomponents and set the remaining components’ weights very close to zero. High\nvalues of the concentration prior will allow a larger number of components to\nbe active in the mixture.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "values of the concentration prior will allow a larger number of components to\nbe active in the mixture.\nThe parameters implementation of the\nBayesianGaussianMixture\nclass\nproposes two types of prior for the weights distribution: a finite mixture model\nwith Dirichlet distribution and an infinite mixture model with the Dirichlet\nProcess. In practice Dirichlet Process inference algorithm is approximated and\nuses a truncated distribution with a fixed maximum number of components (called\nthe Stick-breaking representation). The number of components actually used\nalmost always depends on the data.\nThe next figure compares the results obtained for the different types of the\nweight concentration prior (parameter\nweight_concentration_prior_type\n)\nfor different values of\nweight_concentration_prior\n.\nHere, we can see the value of the\nweight_concentration_prior\nparameter\nhas a strong impact on the effective number of active components obtained. We",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Here, we can see the value of the\nweight_concentration_prior\nparameter\nhas a strong impact on the effective number of active components obtained. We\ncan also notice that large values for the concentration weight prior lead to\nmore uniform weights when the type of prior is ‘dirichlet_distribution’ while\nthis is not necessarily the case for the ‘dirichlet_process’ type (used by\ndefault).\nThe examples below compare Gaussian mixture models with a fixed number of\ncomponents, to the variational Gaussian mixture models with a Dirichlet process\nprior. Here, a classical Gaussian mixture is fitted with 5 components on a\ndataset composed of 2 clusters. We can see that the variational Gaussian mixture\nwith a Dirichlet process prior is able to limit itself to only 2 components\nwhereas the Gaussian mixture fits the data with a fixed number of components\nthat has to be set a priori by the user. In this case the user has selected\nn_components=5",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "that has to be set a priori by the user. In this case the user has selected\nn_components=5\nwhich does not match the true generative distribution of this\ntoy dataset. Note that with very little observations, the variational Gaussian\nmixture models with a Dirichlet process prior can take a conservative stand, and\nfit only one component.\nOn the following figure we are fitting a dataset not well-depicted by a\nGaussian mixture. Adjusting the\nweight_concentration_prior\n, parameter of the\nBayesianGaussianMixture\ncontrols the number of components used to fit\nthis data. We also present on the last two plots a random sampling generated\nfrom the two resulting mixtures.\nExamples\nSee\nGaussian Mixture Model Ellipsoids\nfor an example on\nplotting the confidence ellipsoids for both\nGaussianMixture\nand\nBayesianGaussianMixture\n.\nGaussian Mixture Model Sine Curve\nshows using\nGaussianMixture\nand\nBayesianGaussianMixture\nto fit a\nsine wave.\nSee",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and\nBayesianGaussianMixture\n.\nGaussian Mixture Model Sine Curve\nshows using\nGaussianMixture\nand\nBayesianGaussianMixture\nto fit a\nsine wave.\nSee\nConcentration Prior Type Analysis of Variation Bayesian Gaussian Mixture\nfor an example plotting the confidence ellipsoids for the\nBayesianGaussianMixture\nwith different\nweight_concentration_prior_type\nfor different values of the parameter\nweight_concentration_prior\n.\nPros and cons of variational inference with BayesianGaussianMixture\nPros\nAutomatic selection\n:\nWhen\nweight_concentration_prior\nis small enough and\nn_components\nis larger than what is found necessary by the model, the\nVariational Bayesian mixture model has a natural tendency to set some mixture\nweights values close to zero. This makes it possible to let the model choose\na suitable number of effective components automatically. Only an upper bound\nof this number needs to be provided. Note however that the “ideal” number of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of this number needs to be provided. Note however that the “ideal” number of\nactive components is very application specific and is typically ill-defined\nin a data exploration setting.\nLess sensitivity to the number of parameters\n:\nUnlike finite models, which will\nalmost always use all components as much as they can, and hence will produce\nwildly different solutions for different numbers of components, the\nvariational inference with a Dirichlet process prior\n(\nweight_concentration_prior_type='dirichlet_process'\n) won’t change much\nwith changes to the parameters, leading to more stability and less tuning.\nRegularization\n:\nDue to the incorporation of prior information,\nvariational solutions have less pathological special cases than\nexpectation-maximization solutions.\nCons\nSpeed\n:\nThe extra parametrization necessary for variational inference makes\ninference slower, although not by much.\nHyperparameters\n:\nThis algorithm needs an extra hyperparameter",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "inference slower, although not by much.\nHyperparameters\n:\nThis algorithm needs an extra hyperparameter\nthat might need experimental tuning via cross-validation.\nBias\n:\nThere are many implicit biases in the inference algorithms (and also in\nthe Dirichlet process if used), and whenever there is a mismatch between\nthese biases and the data it might be possible to fit better models using a\nfinite mixture.\n2.1.2.1.\nThe Dirichlet Process\nHere we describe variational inference algorithms on Dirichlet process\nmixture. The Dirichlet process is a prior probability distribution on\nclusterings with an infinite, unbounded, number of partitions\n.\nVariational techniques let us incorporate this prior structure on\nGaussian mixture models at almost no penalty in inference time, comparing\nwith a finite Gaussian mixture model.\nAn important question is how can the Dirichlet process use an infinite,\nunbounded number of clusters and still be consistent. While a full explanation",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "unbounded number of clusters and still be consistent. While a full explanation\ndoesn’t fit this manual, one can think of its\nstick breaking process\nanalogy to help understanding it. The stick breaking process is a generative\nstory for the Dirichlet process. We start with a unit-length stick and in each\nstep we break off a portion of the remaining stick. Each time, we associate the\nlength of the piece of the stick to the proportion of points that falls into a\ngroup of the mixture. At the end, to represent the infinite mixture, we\nassociate the last remaining piece of the stick to the proportion of points\nthat don’t fall into all the other groups. The length of each piece is a random\nvariable with probability proportional to the concentration parameter. Smaller\nvalues of the concentration will divide the unit-length into larger pieces of\nthe stick (defining more concentrated distribution). Larger concentration\nvalues will create smaller pieces of the stick (increasing the number of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the stick (defining more concentrated distribution). Larger concentration\nvalues will create smaller pieces of the stick (increasing the number of\ncomponents with non zero weights).\nVariational inference techniques for the Dirichlet process still work\nwith a finite approximation to this infinite mixture model, but\ninstead of having to specify a priori how many components one wants to\nuse, one just specifies the concentration parameter and an upper bound\non the number of mixture components (this upper bound, assuming it is\nhigher than the “true” number of components, affects only algorithmic\ncomplexity, not the actual number of components used).\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/mixture.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.4.\nMetrics and scoring: quantifying the quality of predictions\n3.4.1.\nWhich scoring function should I use?\nBefore we take a closer look into the details of the many scores and\nevaluation metrics\n, we want to give some guidance, inspired by statistical\ndecision theory, on the choice of\nscoring functions\nfor\nsupervised learning\n,\nsee\n[Gneiting2009]\n:\nWhich scoring function should I use?\nWhich scoring function is a good one for my task?\nIn a nutshell, if the scoring function is given, e.g. in a kaggle competition\nor in a business context, use that one.\nIf you are free to choose, it starts by considering the ultimate goal and application\nof the prediction. It is useful to distinguish two steps:\nPredicting\nDecision making\nPredicting:\nUsually, the response variable\n\\(Y\\)\nis a random variable, in the sense that there\nis\nno deterministic\nfunction\n\\(Y = g(X)\\)\nof the features\n\\(X\\)\n.\nInstead, there is a probability distribution\n\\(F\\)\nof\n\\(Y\\)\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is\nno deterministic\nfunction\n\\(Y = g(X)\\)\nof the features\n\\(X\\)\n.\nInstead, there is a probability distribution\n\\(F\\)\nof\n\\(Y\\)\n.\nOne can aim to predict the whole distribution, known as\nprobabilistic prediction\n,\nor—more the focus of scikit-learn—issue a\npoint prediction\n(or point forecast)\nby choosing a property or functional of that distribution\n\\(F\\)\n.\nTypical examples are the mean (expected value), the median or a quantile of the\nresponse variable\n\\(Y\\)\n(conditionally on\n\\(X\\)\n).\nOnce that is settled, use a\nstrictly consistent\nscoring function for that\n(target) functional, see\n[Gneiting2009]\n.\nThis means using a scoring function that is aligned with\nmeasuring the distance\nbetween predictions\ny_pred\nand the true target functional using observations of\n\\(Y\\)\n, i.e.\ny_true\n.\nFor classification\nstrictly proper scoring rules\n, see\nWikipedia entry for Scoring rule\nand\n[Gneiting2007]\n, coincide with strictly consistent scoring functions.\nThe table further below provides examples.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Wikipedia entry for Scoring rule\nand\n[Gneiting2007]\n, coincide with strictly consistent scoring functions.\nThe table further below provides examples.\nOne could say that consistent scoring functions act as\ntruth serum\nin that\nthey guarantee\n“that truth telling […] is an optimal strategy in\nexpectation”\n[Gneiting2014]\n.\nOnce a strictly consistent scoring function is chosen, it is best used for both: as\nloss function for model training and as metric/score in model evaluation and model\ncomparison.\nNote that for regressors, the prediction is done with\npredict\nwhile for\nclassifiers it is usually\npredict_proba\n.\nDecision Making:\nThe most common decisions are done on binary classification tasks, where the result of\npredict_proba\nis turned into a single outcome, e.g., from the predicted\nprobability of rain a decision is made on how to act (whether to take mitigating\nmeasures like an umbrella or not).\nFor classifiers, this is what\npredict\nreturns.\nSee also",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "measures like an umbrella or not).\nFor classifiers, this is what\npredict\nreturns.\nSee also\nTuning the decision threshold for class prediction\n.\nThere are many scoring functions which measure different aspects of such a\ndecision, most of them are covered with or derived from the\nmetrics.confusion_matrix\n.\nList of strictly consistent scoring functions:\nHere, we list some of the most relevant statistical functionals and corresponding\nstrictly consistent scoring functions for tasks in practice. Note that the list is not\ncomplete and that there are more of them.\nFor further criteria on how to select a specific one, see\n[Fissler2022]\n.\nfunctional\nscoring or loss function\nresponse\ny\nprediction\nClassification\nmean\nBrier score\n1\nmulti-class\npredict_proba\nmean\nlog loss\nmulti-class\npredict_proba\nmode\nzero-one loss\n2\nmulti-class\npredict\n, categorical\nRegression\nmean\nsquared error\n3\nall reals\npredict\n, all reals\nmean\nPoisson deviance\nnon-negative\npredict\n, strictly positive\nmean\nGamma deviance",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Regression\nmean\nsquared error\n3\nall reals\npredict\n, all reals\nmean\nPoisson deviance\nnon-negative\npredict\n, strictly positive\nmean\nGamma deviance\nstrictly positive\npredict\n, strictly positive\nmean\nTweedie deviance\ndepends on\npower\npredict\n, depends on\npower\nmedian\nabsolute error\nall reals\npredict\n, all reals\nquantile\npinball loss\nall reals\npredict\n, all reals\nmode\nno consistent one exists\nreals\n1\nThe Brier score is just a different name for the squared error in case of\nclassification.\n2\nThe zero-one loss is only consistent but not strictly consistent for the mode.\nThe zero-one loss is equivalent to one minus the accuracy score, meaning it gives\ndifferent score values but the same ranking.\n3\nR² gives the same ranking as squared error.\nFictitious Example:\nLet’s make the above arguments more tangible. Consider a setting in network reliability\nengineering, such as maintaining stable internet or Wi-Fi connections.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "engineering, such as maintaining stable internet or Wi-Fi connections.\nAs provider of the network, you have access to the dataset of log entries of network\nconnections containing network load over time and many interesting features.\nYour goal is to improve the reliability of the connections.\nIn fact, you promise your customers that on at least 99% of all days there are no\nconnection discontinuities larger than 1 minute.\nTherefore, you are interested in a prediction of the 99% quantile (of longest\nconnection interruption duration per day) in order to know in advance when to add\nmore bandwidth and thereby satisfy your customers. So the\ntarget functional\nis the\n99% quantile. From the table above, you choose the pinball loss as scoring function\n(fair enough, not much choice given), for model training (e.g.\nHistGradientBoostingRegressor(loss=\"quantile\",\nquantile=0.99)\n) as well as model\nevaluation (\nmean_pinball_loss(...,\nalpha=0.99)\n- we apologize for the different\nargument names,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "quantile=0.99)\n) as well as model\nevaluation (\nmean_pinball_loss(...,\nalpha=0.99)\n- we apologize for the different\nargument names,\nquantile\nand\nalpha\n) be it in grid search for finding\nhyperparameters or in comparing to other models like\nQuantileRegressor(quantile=0.99)\n.\nReferences\n[\nGneiting2007\n]\nT. Gneiting and A. E. Raftery.\nStrictly Proper\nScoring Rules, Prediction, and Estimation\nIn: Journal of the American Statistical Association 102 (2007),\npp. 359– 378.\nlink to pdf\n[\nGneiting2009\n]\n(\n1\n,\n2\n)\nT. Gneiting.\nMaking and Evaluating Point Forecasts\nJournal of the American Statistical Association 106 (2009): 746 - 762.\n[\nGneiting2014\n]\nT. Gneiting and M. Katzfuss.\nProbabilistic Forecasting\n. In: Annual Review of Statistics and Its Application 1.1 (2014), pp. 125–151.\n[\nFissler2022\n]\nT. Fissler, C. Lorentzen and M. Mayer.\nModel\nComparison and Calibration Assessment: User Guide for Consistent Scoring\nFunctions in Machine Learning and Actuarial Practice.\n3.4.2.\nScoring API overview",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Functions in Machine Learning and Actuarial Practice.\n3.4.2.\nScoring API overview\nThere are 3 different APIs for evaluating the quality of a model’s\npredictions:\nEstimator score method\n: Estimators have a\nscore\nmethod providing a\ndefault evaluation criterion for the problem they are designed to solve.\nMost commonly this is\naccuracy\nfor classifiers and the\ncoefficient of determination\n(\n\\(R^2\\)\n) for regressors.\nDetails for each estimator can be found in its documentation.\nScoring parameter\n: Model-evaluation tools that use\ncross-validation\n(such as\nmodel_selection.GridSearchCV\n,\nmodel_selection.validation_curve\nand\nlinear_model.LogisticRegressionCV\n) rely on an internal\nscoring\nstrategy.\nThis can be specified using the\nscoring\nparameter of that tool and is discussed\nin the section\nThe scoring parameter: defining model evaluation rules\n.\nMetric functions\n: The\nsklearn.metrics\nmodule implements functions\nassessing prediction error for specific purposes. These metrics are detailed",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nMetric functions\n: The\nsklearn.metrics\nmodule implements functions\nassessing prediction error for specific purposes. These metrics are detailed\nin sections on\nClassification metrics\n,\nMultilabel ranking metrics\n,\nRegression metrics\nand\nClustering metrics\n.\nFinally,\nDummy estimators\nare useful to get a baseline\nvalue of those metrics for random predictions.\nSee also\nFor “pairwise” metrics, between\nsamples\nand not estimators or\npredictions, see the\nPairwise metrics, Affinities and Kernels\nsection.\n3.4.3.\nThe\nscoring\nparameter: defining model evaluation rules\nModel selection and evaluation tools that internally use\ncross-validation\n(such as\nmodel_selection.GridSearchCV\n,\nmodel_selection.validation_curve\nand\nlinear_model.LogisticRegressionCV\n) take a\nscoring\nparameter that\ncontrols what metric they apply to the estimators evaluated.\nThey can be specified in several ways:\nNone\n: the estimator’s default evaluation criterion (i.e., the metric used in the\nestimator’s\nscore\nmethod) is used.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "None\n: the estimator’s default evaluation criterion (i.e., the metric used in the\nestimator’s\nscore\nmethod) is used.\nString name\n: common metrics can be passed via a string\nname.\nCallable\n: more complex metrics can be passed via a custom\nmetric callable (e.g., function).\nSome tools do also accept multiple metric evaluation. See\nUsing multiple metric evaluation\nfor details.\n3.4.3.1.\nString name scorers\nFor the most common use cases, you can designate a scorer object with the\nscoring\nparameter via a string name; the table below shows all possible values.\nAll scorer objects follow the convention that\nhigher return values are better\nthan lower return values\n. Thus metrics which measure the distance between\nthe model and the data, like\nmetrics.mean_squared_error\n, are\navailable as ‘neg_mean_squared_error’ which return the negated value\nof the metric.\nScoring string name\nFunction\nComment\nClassification\n‘accuracy’\nmetrics.accuracy_score\n‘balanced_accuracy’\nmetrics.balanced_accuracy_score",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Scoring string name\nFunction\nComment\nClassification\n‘accuracy’\nmetrics.accuracy_score\n‘balanced_accuracy’\nmetrics.balanced_accuracy_score\n‘top_k_accuracy’\nmetrics.top_k_accuracy_score\n‘average_precision’\nmetrics.average_precision_score\n‘neg_brier_score’\nmetrics.brier_score_loss\n‘f1’\nmetrics.f1_score\nfor binary targets\n‘f1_micro’\nmetrics.f1_score\nmicro-averaged\n‘f1_macro’\nmetrics.f1_score\nmacro-averaged\n‘f1_weighted’\nmetrics.f1_score\nweighted average\n‘f1_samples’\nmetrics.f1_score\nby multilabel sample\n‘neg_log_loss’\nmetrics.log_loss\nrequires\npredict_proba\nsupport\n‘precision’ etc.\nmetrics.precision_score\nsuffixes apply as with ‘f1’\n‘recall’ etc.\nmetrics.recall_score\nsuffixes apply as with ‘f1’\n‘jaccard’ etc.\nmetrics.jaccard_score\nsuffixes apply as with ‘f1’\n‘roc_auc’\nmetrics.roc_auc_score\n‘roc_auc_ovr’\nmetrics.roc_auc_score\n‘roc_auc_ovo’\nmetrics.roc_auc_score\n‘roc_auc_ovr_weighted’\nmetrics.roc_auc_score\n‘roc_auc_ovo_weighted’\nmetrics.roc_auc_score\n‘d2_log_loss_score’",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "‘roc_auc_ovo’\nmetrics.roc_auc_score\n‘roc_auc_ovr_weighted’\nmetrics.roc_auc_score\n‘roc_auc_ovo_weighted’\nmetrics.roc_auc_score\n‘d2_log_loss_score’\nmetrics.d2_log_loss_score\nClustering\n‘adjusted_mutual_info_score’\nmetrics.adjusted_mutual_info_score\n‘adjusted_rand_score’\nmetrics.adjusted_rand_score\n‘completeness_score’\nmetrics.completeness_score\n‘fowlkes_mallows_score’\nmetrics.fowlkes_mallows_score\n‘homogeneity_score’\nmetrics.homogeneity_score\n‘mutual_info_score’\nmetrics.mutual_info_score\n‘normalized_mutual_info_score’\nmetrics.normalized_mutual_info_score\n‘rand_score’\nmetrics.rand_score\n‘v_measure_score’\nmetrics.v_measure_score\nRegression\n‘explained_variance’\nmetrics.explained_variance_score\n‘neg_max_error’\nmetrics.max_error\n‘neg_mean_absolute_error’\nmetrics.mean_absolute_error\n‘neg_mean_squared_error’\nmetrics.mean_squared_error\n‘neg_root_mean_squared_error’\nmetrics.root_mean_squared_error\n‘neg_mean_squared_log_error’\nmetrics.mean_squared_log_error\n‘neg_root_mean_squared_log_error’",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "metrics.root_mean_squared_error\n‘neg_mean_squared_log_error’\nmetrics.mean_squared_log_error\n‘neg_root_mean_squared_log_error’\nmetrics.root_mean_squared_log_error\n‘neg_median_absolute_error’\nmetrics.median_absolute_error\n‘r2’\nmetrics.r2_score\n‘neg_mean_poisson_deviance’\nmetrics.mean_poisson_deviance\n‘neg_mean_gamma_deviance’\nmetrics.mean_gamma_deviance\n‘neg_mean_absolute_percentage_error’\nmetrics.mean_absolute_percentage_error\n‘d2_absolute_error_score’\nmetrics.d2_absolute_error_score\nUsage examples:\n>>>\nfrom\nsklearn\nimport\nsvm\n,\ndatasets\n>>>\nfrom\nsklearn.model_selection\nimport\ncross_val_score\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nclf\n=\nsvm\n.\nSVC\n(\nrandom_state\n=\n0\n)\n>>>\ncross_val_score\n(\nclf\n,\nX\n,\ny\n,\ncv\n=\n5\n,\nscoring\n=\n'recall_macro'\n)\narray([0.96, 0.96, 0.96, 0.93, 1. ])\nNote\nIf a wrong scoring name is passed, an\nInvalidParameterError\nis raised.\nYou can retrieve the names of all available scorers by calling\nget_scorer_names\n.\n3.4.3.2.\nCallable scorers",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "InvalidParameterError\nis raised.\nYou can retrieve the names of all available scorers by calling\nget_scorer_names\n.\n3.4.3.2.\nCallable scorers\nFor more complex use cases and more flexibility, you can pass a callable to\nthe\nscoring\nparameter. This can be done by:\nAdapting predefined metrics via make_scorer\nCreating a custom scorer object\n(most flexible)\n3.4.3.2.1.\nAdapting predefined metrics via\nmake_scorer\nThe following metric functions are not implemented as named scorers,\nsometimes because they require additional parameters, such as\nfbeta_score\n. They cannot be passed to the\nscoring\nparameters; instead their callable needs to be passed to\nmake_scorer\ntogether with the value of the user-settable\nparameters.\nFunction\nParameter\nExample usage\nClassification\nmetrics.fbeta_score\nbeta\nmake_scorer(fbeta_score,\nbeta=2)\nRegression\nmetrics.mean_tweedie_deviance\npower\nmake_scorer(mean_tweedie_deviance,\npower=1.5)\nmetrics.mean_pinball_loss\nalpha\nmake_scorer(mean_pinball_loss,\nalpha=0.95)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "power\nmake_scorer(mean_tweedie_deviance,\npower=1.5)\nmetrics.mean_pinball_loss\nalpha\nmake_scorer(mean_pinball_loss,\nalpha=0.95)\nmetrics.d2_tweedie_score\npower\nmake_scorer(d2_tweedie_score,\npower=1.5)\nmetrics.d2_pinball_score\nalpha\nmake_scorer(d2_pinball_score,\nalpha=0.95)\nOne typical use case is to wrap an existing metric function from the library\nwith non-default values for its parameters, such as the\nbeta\nparameter for\nthe\nfbeta_score\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nfbeta_score\n,\nmake_scorer\n>>>\nftwo_scorer\n=\nmake_scorer\n(\nfbeta_score\n,\nbeta\n=\n2\n)\n>>>\nfrom\nsklearn.model_selection\nimport\nGridSearchCV\n>>>\nfrom\nsklearn.svm\nimport\nLinearSVC\n>>>\ngrid\n=\nGridSearchCV\n(\nLinearSVC\n(),\nparam_grid\n=\n{\n'C'\n:\n[\n1\n,\n10\n]},\n...\nscoring\n=\nftwo_scorer\n,\ncv\n=\n5\n)\nThe module\nsklearn.metrics\nalso exposes a set of simple functions\nmeasuring a prediction error given ground truth and prediction:\nfunctions ending with\n_score\nreturn a value to\nmaximize, the higher the better.\nfunctions ending with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "functions ending with\n_score\nreturn a value to\nmaximize, the higher the better.\nfunctions ending with\n_error\n,\n_loss\n, or\n_deviance\nreturn a\nvalue to minimize, the lower the better. When converting\ninto a scorer object using\nmake_scorer\n, set\nthe\ngreater_is_better\nparameter to\nFalse\n(\nTrue\nby default; see the\nparameter description below).\n3.4.3.2.2.\nCreating a custom scorer object\nYou can create your own custom scorer object using\nmake_scorer\nor for the most flexibility, from scratch. See below for details.\nCustom scorer objects using\nmake_scorer\nYou can build a completely custom scorer object\nfrom a simple python function using\nmake_scorer\n, which can\ntake several parameters:\nthe python function you want to use (\nmy_custom_loss_func\nin the example below)\nwhether the python function returns a score (\ngreater_is_better=True\n,\nthe default) or a loss (\ngreater_is_better=False\n). If a loss, the output\nof the python function is negated by the scorer object, conforming to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nthe default) or a loss (\ngreater_is_better=False\n). If a loss, the output\nof the python function is negated by the scorer object, conforming to\nthe cross validation convention that scorers return higher values for better models.\nfor classification metrics only: whether the python function you provided requires\ncontinuous decision certainties. If the scoring function only accepts probability\nestimates (e.g.\nmetrics.log_loss\n), then one needs to set the parameter\nresponse_method=\"predict_proba\"\n. Some scoring\nfunctions do not necessarily require probability estimates but rather non-thresholded\ndecision values (e.g.\nmetrics.roc_auc_score\n). In this case, one can provide a\nlist (e.g.,\nresponse_method=[\"decision_function\",\n\"predict_proba\"]\n),\nand scorer will use the first available method, in the order given in the list,\nto compute the scores.\nany additional parameters of the scoring function, such as\nbeta\nor\nlabels\n.\nHere is an example of building custom scorers, and of using the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "any additional parameters of the scoring function, such as\nbeta\nor\nlabels\n.\nHere is an example of building custom scorers, and of using the\ngreater_is_better\nparameter:\n>>>\nimport\nnumpy\nas\nnp\n>>>\ndef\nmy_custom_loss_func\n(\ny_true\n,\ny_pred\n):\n...\ndiff\n=\nnp\n.\nabs\n(\ny_true\n-\ny_pred\n)\n.\nmax\n()\n...\nreturn\nfloat\n(\nnp\n.\nlog1p\n(\ndiff\n))\n...\n>>>\n# score will negate the return value of my_custom_loss_func,\n>>>\n# which will be np.log(2), 0.693, given the values for X\n>>>\n# and y defined below.\n>>>\nscore\n=\nmake_scorer\n(\nmy_custom_loss_func\n,\ngreater_is_better\n=\nFalse\n)\n>>>\nX\n=\n[[\n1\n],\n[\n1\n]]\n>>>\ny\n=\n[\n0\n,\n1\n]\n>>>\nfrom\nsklearn.dummy\nimport\nDummyClassifier\n>>>\nclf\n=\nDummyClassifier\n(\nstrategy\n=\n'most_frequent'\n,\nrandom_state\n=\n0\n)\n>>>\nclf\n=\nclf\n.\nfit\n(\nX\n,\ny\n)\n>>>\nmy_custom_loss_func\n(\ny\n,\nclf\n.\npredict\n(\nX\n))\n0.69\n>>>\nscore\n(\nclf\n,\nX\n,\ny\n)\n-0.69\nCustom scorer objects from scratch\nYou can generate even more flexible model scorers by constructing your own",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "))\n0.69\n>>>\nscore\n(\nclf\n,\nX\n,\ny\n)\n-0.69\nCustom scorer objects from scratch\nYou can generate even more flexible model scorers by constructing your own\nscoring object from scratch, without using the\nmake_scorer\nfactory.\nFor a callable to be a scorer, it needs to meet the protocol specified by\nthe following two rules:\nIt can be called with parameters\n(estimator,\nX,\ny)\n, where\nestimator\nis the model that should be evaluated,\nX\nis validation data, and\ny\nis\nthe ground truth target for\nX\n(in the supervised case) or\nNone\n(in the\nunsupervised case).\nIt returns a floating point number that quantifies the\nestimator\nprediction quality on\nX\n, with reference to\ny\n.\nAgain, by convention higher numbers are better, so if your scorer\nreturns loss, that value should be negated.\nAdvanced: If it requires extra metadata to be passed to it, it should expose\na\nget_metadata_routing\nmethod returning the requested metadata. The user\nshould be able to set the requested metadata via a\nset_score_request",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "a\nget_metadata_routing\nmethod returning the requested metadata. The user\nshould be able to set the requested metadata via a\nset_score_request\nmethod. Please see\nUser Guide\nand\nDeveloper\nGuide\nfor\nmore details.\nUsing custom scorers in functions where n_jobs > 1\nWhile defining the custom scoring function alongside the calling function\nshould work out of the box with the default joblib backend (loky),\nimporting it from another module will be a more robust approach and work\nindependently of the joblib backend.\nFor example, to use\nn_jobs\ngreater than 1 in the example below,\ncustom_scoring_function\nfunction is saved in a user-created module\n(\ncustom_scorer_module.py\n) and imported:\n>>>\nfrom\ncustom_scorer_module\nimport\ncustom_scoring_function\n>>>\ncross_val_score\n(\nmodel\n,\n...\nX_train\n,\n...\ny_train\n,\n...\nscoring\n=\nmake_scorer\n(\ncustom_scoring_function\n,\ngreater_is_better\n=\nFalse\n),\n...\ncv\n=\n5\n,\n...\nn_jobs\n=-\n1\n)\n3.4.3.3.\nUsing multiple metric evaluation",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nmake_scorer\n(\ncustom_scoring_function\n,\ngreater_is_better\n=\nFalse\n),\n...\ncv\n=\n5\n,\n...\nn_jobs\n=-\n1\n)\n3.4.3.3.\nUsing multiple metric evaluation\nScikit-learn also permits evaluation of multiple metrics in\nGridSearchCV\n,\nRandomizedSearchCV\nand\ncross_validate\n.\nThere are three ways to specify multiple scoring metrics for the\nscoring\nparameter:\nAs an iterable of string metrics:\n>>>\nscoring\n=\n[\n'accuracy'\n,\n'precision'\n]\nAs a\ndict\nmapping the scorer name to the scoring function:\n>>>\nfrom\nsklearn.metrics\nimport\naccuracy_score\n>>>\nfrom\nsklearn.metrics\nimport\nmake_scorer\n>>>\nscoring\n=\n{\n'accuracy'\n:\nmake_scorer\n(\naccuracy_score\n),\n...\n'prec'\n:\n'precision'\n}\nNote that the dict values can either be scorer functions or one of the\npredefined metric strings.\nAs a callable that returns a dictionary of scores:\n>>>\nfrom\nsklearn.model_selection\nimport\ncross_validate\n>>>\nfrom\nsklearn.metrics\nimport\nconfusion_matrix\n>>>\n# A sample toy binary classification dataset\n>>>\nX\n,\ny\n=\ndatasets\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "import\ncross_validate\n>>>\nfrom\nsklearn.metrics\nimport\nconfusion_matrix\n>>>\n# A sample toy binary classification dataset\n>>>\nX\n,\ny\n=\ndatasets\n.\nmake_classification\n(\nn_classes\n=\n2\n,\nrandom_state\n=\n0\n)\n>>>\nsvm\n=\nLinearSVC\n(\nrandom_state\n=\n0\n)\n>>>\ndef\nconfusion_matrix_scorer\n(\nclf\n,\nX\n,\ny\n):\n...\ny_pred\n=\nclf\n.\npredict\n(\nX\n)\n...\ncm\n=\nconfusion_matrix\n(\ny\n,\ny_pred\n)\n...\nreturn\n{\n'tn'\n:\ncm\n[\n0\n,\n0\n],\n'fp'\n:\ncm\n[\n0\n,\n1\n],\n...\n'fn'\n:\ncm\n[\n1\n,\n0\n],\n'tp'\n:\ncm\n[\n1\n,\n1\n]}\n>>>\ncv_results\n=\ncross_validate\n(\nsvm\n,\nX\n,\ny\n,\ncv\n=\n5\n,\n...\nscoring\n=\nconfusion_matrix_scorer\n)\n>>>\n# Getting the test set true positive scores\n>>>\nprint\n(\ncv_results\n[\n'test_tp'\n])\n[10 9 8 7 8]\n>>>\n# Getting the test set false negative scores\n>>>\nprint\n(\ncv_results\n[\n'test_fn'\n])\n[0 1 2 3 2]\n3.4.4.\nClassification metrics\nThe\nsklearn.metrics\nmodule implements several loss, score, and utility\nfunctions to measure classification performance.\nSome metrics might require probability estimates of the positive class,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "functions to measure classification performance.\nSome metrics might require probability estimates of the positive class,\nconfidence values, or binary decisions values.\nMost implementations allow each sample to provide a weighted contribution\nto the overall score, through the\nsample_weight\nparameter.\nSome of these are restricted to the binary classification case:\nprecision_recall_curve\n(y_true, y_score, *[, ...])\nCompute precision-recall pairs for different probability thresholds.\nroc_curve\n(y_true, y_score, *[, pos_label, ...])\nCompute Receiver operating characteristic (ROC).\nclass_likelihood_ratios\n(y_true, y_pred, *[, ...])\nCompute binary classification positive and negative likelihood ratios.\ndet_curve\n(y_true, y_score[, pos_label, ...])\nCompute Detection Error Tradeoff (DET) for different probability thresholds.\nOthers also work in the multiclass case:\nbalanced_accuracy_score\n(y_true, y_pred, *[, ...])\nCompute the balanced accuracy.\ncohen_kappa_score\n(y1, y2, *[, labels, ...])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "balanced_accuracy_score\n(y_true, y_pred, *[, ...])\nCompute the balanced accuracy.\ncohen_kappa_score\n(y1, y2, *[, labels, ...])\nCompute Cohen's kappa: a statistic that measures inter-annotator agreement.\nconfusion_matrix\n(y_true, y_pred, *[, ...])\nCompute confusion matrix to evaluate the accuracy of a classification.\nhinge_loss\n(y_true, pred_decision, *[, ...])\nAverage hinge loss (non-regularized).\nmatthews_corrcoef\n(y_true, y_pred, *[, ...])\nCompute the Matthews correlation coefficient (MCC).\nroc_auc_score\n(y_true, y_score, *[, average, ...])\nCompute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.\ntop_k_accuracy_score\n(y_true, y_score, *[, ...])\nTop-k Accuracy classification score.\nSome also work in the multilabel case:\naccuracy_score\n(y_true, y_pred, *[, ...])\nAccuracy classification score.\nclassification_report\n(y_true, y_pred, *[, ...])\nBuild a text report showing the main classification metrics.\nf1_score\n(y_true, y_pred, *[, labels, ...])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(y_true, y_pred, *[, ...])\nBuild a text report showing the main classification metrics.\nf1_score\n(y_true, y_pred, *[, labels, ...])\nCompute the F1 score, also known as balanced F-score or F-measure.\nfbeta_score\n(y_true, y_pred, *, beta[, ...])\nCompute the F-beta score.\nhamming_loss\n(y_true, y_pred, *[, sample_weight])\nCompute the average Hamming loss.\njaccard_score\n(y_true, y_pred, *[, labels, ...])\nJaccard similarity coefficient score.\nlog_loss\n(y_true, y_pred, *[, normalize, ...])\nLog loss, aka logistic loss or cross-entropy loss.\nmultilabel_confusion_matrix\n(y_true, y_pred, *)\nCompute a confusion matrix for each class or sample.\nprecision_recall_fscore_support\n(y_true, ...)\nCompute precision, recall, F-measure and support for each class.\nprecision_score\n(y_true, y_pred, *[, labels, ...])\nCompute the precision.\nrecall_score\n(y_true, y_pred, *[, labels, ...])\nCompute the recall.\nroc_auc_score\n(y_true, y_score, *[, average, ...])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Compute the precision.\nrecall_score\n(y_true, y_pred, *[, labels, ...])\nCompute the recall.\nroc_auc_score\n(y_true, y_score, *[, average, ...])\nCompute Area Under the Receiver Operating Characteristic Curve (ROC AUC) from prediction scores.\nzero_one_loss\n(y_true, y_pred, *[, ...])\nZero-one classification loss.\nd2_log_loss_score\n(y_true, y_pred, *[, ...])\n\\(D^2\\)\nscore function, fraction of log loss explained.\nAnd some work with binary and multilabel (but not multiclass) problems:\naverage_precision_score\n(y_true, y_score, *)\nCompute average precision (AP) from prediction scores.\nIn the following sub-sections, we will describe each of those functions,\npreceded by some notes on common API and metric definition.\n3.4.4.1.\nFrom binary to multiclass and multilabel\nSome metrics are essentially defined for binary classification tasks (e.g.\nf1_score\n,\nroc_auc_score\n). In these cases, by default\nonly the positive label is evaluated, assuming by default that the positive\nclass is labelled\n1",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nroc_auc_score\n). In these cases, by default\nonly the positive label is evaluated, assuming by default that the positive\nclass is labelled\n1\n(though this may be configurable through the\npos_label\nparameter).\nIn extending a binary metric to multiclass or multilabel problems, the data\nis treated as a collection of binary problems, one for each class.\nThere are then a number of ways to average binary metric calculations across\nthe set of classes, each of which may be useful in some scenario.\nWhere available, you should select among these using the\naverage\nparameter.\n\"macro\"\nsimply calculates the mean of the binary metrics,\ngiving equal weight to each class. In problems where infrequent classes\nare nonetheless important, macro-averaging may be a means of highlighting\ntheir performance. On the other hand, the assumption that all classes are\nequally important is often untrue, such that macro-averaging will\nover-emphasize the typically low performance on an infrequent class.\n\"weighted\"",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "equally important is often untrue, such that macro-averaging will\nover-emphasize the typically low performance on an infrequent class.\n\"weighted\"\naccounts for class imbalance by computing the average of\nbinary metrics in which each class’s score is weighted by its presence in the\ntrue data sample.\n\"micro\"\ngives each sample-class pair an equal contribution to the overall\nmetric (except as a result of sample-weight). Rather than summing the\nmetric per class, this sums the dividends and divisors that make up the\nper-class metrics to calculate an overall quotient.\nMicro-averaging may be preferred in multilabel settings, including\nmulticlass classification where a majority class is to be ignored.\n\"samples\"\napplies only to multilabel problems. It does not calculate a\nper-class measure, instead calculating the metric over the true and predicted\nclasses for each sample in the evaluation data, and returning their\n(\nsample_weight\n-weighted) average.\nSelecting\naverage=None",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classes for each sample in the evaluation data, and returning their\n(\nsample_weight\n-weighted) average.\nSelecting\naverage=None\nwill return an array with the score for each\nclass.\nWhile multiclass data is provided to the metric, like binary targets, as an\narray of class labels, multilabel data is specified as an indicator matrix,\nin which cell\n[i,\nj]\nhas value 1 if sample\ni\nhas label\nj\nand value\n0 otherwise.\n3.4.4.2.\nAccuracy score\nThe\naccuracy_score\nfunction computes the\naccuracy\n, either the fraction\n(default) or the count (normalize=False) of correct predictions.\nIn multilabel classification, the function returns the subset accuracy. If\nthe entire set of predicted labels for a sample strictly match with the true\nset of labels, then the subset accuracy is 1.0; otherwise it is 0.0.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of\nthe\n\\(i\\)\n-th sample and\n\\(y_i\\)\nis the corresponding true value,\nthen the fraction of correct predictions over\n\\(n_\\text{samples}\\)\nis\ndefined as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the\n\\(i\\)\n-th sample and\n\\(y_i\\)\nis the corresponding true value,\nthen the fraction of correct predictions over\n\\(n_\\text{samples}\\)\nis\ndefined as\n\\[\\texttt{accuracy}(y, \\hat{y}) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples}-1} 1(\\hat{y}_i = y_i)\\]\nwhere\n\\(1(x)\\)\nis the\nindicator function\n.\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\naccuracy_score\n>>>\ny_pred\n=\n[\n0\n,\n2\n,\n1\n,\n3\n]\n>>>\ny_true\n=\n[\n0\n,\n1\n,\n2\n,\n3\n]\n>>>\naccuracy_score\n(\ny_true\n,\ny_pred\n)\n0.5\n>>>\naccuracy_score\n(\ny_true\n,\ny_pred\n,\nnormalize\n=\nFalse\n)\n2.0\nIn the multilabel case with binary label indicators:\n>>>\naccuracy_score\n(\nnp\n.\narray\n([[\n0\n,\n1\n],\n[\n1\n,\n1\n]]),\nnp\n.\nones\n((\n2\n,\n2\n)))\n0.5\nExamples\nSee\nTest with permutations the significance of a classification score\nfor an example of accuracy score usage using permutations of\nthe dataset.\n3.4.4.3.\nTop-k accuracy score\nThe\ntop_k_accuracy_score\nfunction is a generalization of\naccuracy_score\n. The difference is that a prediction is considered",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Top-k accuracy score\nThe\ntop_k_accuracy_score\nfunction is a generalization of\naccuracy_score\n. The difference is that a prediction is considered\ncorrect as long as the true label is associated with one of the\nk\nhighest\npredicted scores.\naccuracy_score\nis the special case of\nk\n=\n1\n.\nThe function covers the binary and multiclass classification cases but not the\nmultilabel case.\nIf\n\\(\\hat{f}_{i,j}\\)\nis the predicted class for the\n\\(i\\)\n-th sample\ncorresponding to the\n\\(j\\)\n-th largest predicted score and\n\\(y_i\\)\nis the\ncorresponding true value, then the fraction of correct predictions over\n\\(n_\\text{samples}\\)\nis defined as\n\\[\\texttt{top-k accuracy}(y, \\hat{f}) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples}-1} \\sum_{j=1}^{k} 1(\\hat{f}_{i,j} = y_i)\\]\nwhere\n\\(k\\)\nis the number of guesses allowed and\n\\(1(x)\\)\nis the\nindicator function\n.\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\ntop_k_accuracy_score\n>>>\ny_true\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n2\n,\n2\n])\n>>>\ny_score\n=\nnp\n.\narray",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\ntop_k_accuracy_score\n>>>\ny_true\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n2\n,\n2\n])\n>>>\ny_score\n=\nnp\n.\narray\n([[\n0.5\n,\n0.2\n,\n0.2\n],\n...\n[\n0.3\n,\n0.4\n,\n0.2\n],\n...\n[\n0.2\n,\n0.4\n,\n0.3\n],\n...\n[\n0.7\n,\n0.2\n,\n0.1\n]])\n>>>\ntop_k_accuracy_score\n(\ny_true\n,\ny_score\n,\nk\n=\n2\n)\n0.75\n>>>\n# Not normalizing gives the number of \"correctly\" classified samples\n>>>\ntop_k_accuracy_score\n(\ny_true\n,\ny_score\n,\nk\n=\n2\n,\nnormalize\n=\nFalse\n)\n3.0\n3.4.4.4.\nBalanced accuracy score\nThe\nbalanced_accuracy_score\nfunction computes the\nbalanced accuracy\n, which avoids inflated\nperformance estimates on imbalanced datasets. It is the macro-average of recall\nscores per class or, equivalently, raw accuracy where each sample is weighted\naccording to the inverse prevalence of its true class.\nThus for balanced datasets, the score is equal to accuracy.\nIn the binary case, balanced accuracy is equal to the arithmetic mean of\nsensitivity\n(true positive rate) and\nspecificity\n(true negative",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In the binary case, balanced accuracy is equal to the arithmetic mean of\nsensitivity\n(true positive rate) and\nspecificity\n(true negative\nrate), or the area under the ROC curve with binary predictions rather than\nscores:\n\\[\\texttt{balanced-accuracy} = \\frac{1}{2}\\left( \\frac{TP}{TP + FN} + \\frac{TN}{TN + FP}\\right )\\]\nIf the classifier performs equally well on either class, this term reduces to\nthe conventional accuracy (i.e., the number of correct predictions divided by\nthe total number of predictions).\nIn contrast, if the conventional accuracy is above chance only because the\nclassifier takes advantage of an imbalanced test set, then the balanced\naccuracy, as appropriate, will drop to\n\\(\\frac{1}{n\\_classes}\\)\n.\nThe score ranges from 0 to 1, or when\nadjusted=True\nis used, it is rescaled to\nthe range\n\\(\\frac{1}{1 - n\\_classes}\\)\nto 1, inclusive, with\nperformance at random scoring 0.\nIf\n\\(y_i\\)\nis the true value of the\n\\(i\\)\n-th sample, and\n\\(w_i\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\frac{1}{1 - n\\_classes}\\)\nto 1, inclusive, with\nperformance at random scoring 0.\nIf\n\\(y_i\\)\nis the true value of the\n\\(i\\)\n-th sample, and\n\\(w_i\\)\nis the corresponding sample weight, then we adjust the sample weight to:\n\\[\\hat{w}_i = \\frac{w_i}{\\sum_j{1(y_j = y_i) w_j}}\\]\nwhere\n\\(1(x)\\)\nis the\nindicator function\n.\nGiven predicted\n\\(\\hat{y}_i\\)\nfor sample\n\\(i\\)\n, balanced accuracy is\ndefined as:\n\\[\\texttt{balanced-accuracy}(y, \\hat{y}, w) = \\frac{1}{\\sum{\\hat{w}_i}} \\sum_i 1(\\hat{y}_i = y_i) \\hat{w}_i\\]\nWith\nadjusted=True\n, balanced accuracy reports the relative increase from\n\\(\\texttt{balanced-accuracy}(y, \\mathbf{0}, w) =\n\\frac{1}{n\\_classes}\\)\n. In the binary case, this is also known as\n*Youden’s J statistic*\n,\nor\ninformedness\n.\nNote\nThe multiclass definition here seems the most reasonable extension of the\nmetric used in binary classification, though there is no certain consensus\nin the literature:\nOur definition:\n[Mosley2013]\n,\n[Kelleher2015]\nand\n[Guyon2015]\n, where\n[Guyon2015]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "in the literature:\nOur definition:\n[Mosley2013]\n,\n[Kelleher2015]\nand\n[Guyon2015]\n, where\n[Guyon2015]\nadopt the adjusted version to ensure that random predictions\nhave a score of\n\\(0\\)\nand perfect predictions have a score of\n\\(1\\)\n..\nClass balanced accuracy as described in\n[Mosley2013]\n: the minimum between the precision\nand the recall for each class is computed. Those values are then averaged over the total\nnumber of classes to get the balanced accuracy.\nBalanced Accuracy as described in\n[Urbanowicz2015]\n: the average of sensitivity and specificity\nis computed for each class and then averaged over total number of classes.\nReferences\n[\nGuyon2015\n]\n(\n1\n,\n2\n)\nI. Guyon, K. Bennett, G. Cawley, H.J. Escalante, S. Escalera, T.K. Ho, N. Macià,\nB. Ray, M. Saeed, A.R. Statnikov, E. Viegas,\nDesign of the 2015 ChaLearn AutoML Challenge\n, IJCNN 2015.\n[\nMosley2013\n]\n(\n1\n,\n2\n)\nL. Mosley,\nA balanced approach to the multi-class imbalance problem\n, IJCV 2010.\n[\nKelleher2015\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", IJCNN 2015.\n[\nMosley2013\n]\n(\n1\n,\n2\n)\nL. Mosley,\nA balanced approach to the multi-class imbalance problem\n, IJCV 2010.\n[\nKelleher2015\n]\nJohn. D. Kelleher, Brian Mac Namee, Aoife D’Arcy,\nFundamentals of\nMachine Learning for Predictive Data Analytics: Algorithms, Worked Examples,\nand Case Studies\n,\n2015.\n[\nUrbanowicz2015\n]\nUrbanowicz R.J., Moore, J.H.\nExSTraCS 2.0: description\nand evaluation of a scalable learning classifier\nsystem\n, Evol. Intel. (2015) 8: 89.\n3.4.4.5.\nCohen’s kappa\nThe function\ncohen_kappa_score\ncomputes\nCohen’s kappa\nstatistic.\nThis measure is intended to compare labelings by different human annotators,\nnot a classifier versus a ground truth.\nThe kappa score is a number between -1 and 1.\nScores above .8 are generally considered good agreement;\nzero or lower means no agreement (practically random labels).\nKappa scores can be computed for binary or multiclass problems,\nbut not for multilabel problems (except by manually computing a per-label score)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Kappa scores can be computed for binary or multiclass problems,\nbut not for multilabel problems (except by manually computing a per-label score)\nand not for more than two annotators.\n>>>\nfrom\nsklearn.metrics\nimport\ncohen_kappa_score\n>>>\nlabeling1\n=\n[\n2\n,\n0\n,\n2\n,\n2\n,\n0\n,\n1\n]\n>>>\nlabeling2\n=\n[\n0\n,\n0\n,\n2\n,\n2\n,\n0\n,\n2\n]\n>>>\ncohen_kappa_score\n(\nlabeling1\n,\nlabeling2\n)\n0.4285714285714286\n3.4.4.6.\nConfusion matrix\nThe\nconfusion_matrix\nfunction evaluates\nclassification accuracy by computing the\nconfusion matrix\nwith each row corresponding\nto the true class (Wikipedia and other references may use different convention\nfor axes).\nBy definition, entry\n\\(i, j\\)\nin a confusion matrix is\nthe number of observations actually in group\n\\(i\\)\n, but\npredicted to be in group\n\\(j\\)\n. Here is an example:\n>>>\nfrom\nsklearn.metrics\nimport\nconfusion_matrix\n>>>\ny_true\n=\n[\n2\n,\n0\n,\n2\n,\n2\n,\n0\n,\n1\n]\n>>>\ny_pred\n=\n[\n0\n,\n0\n,\n2\n,\n2\n,\n0\n,\n2\n]\n>>>\nconfusion_matrix\n(\ny_true\n,\ny_pred\n)\narray([[2, 0, 0],\n[0, 0, 1],\n[1, 0, 2]])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n[\n2\n,\n0\n,\n2\n,\n2\n,\n0\n,\n1\n]\n>>>\ny_pred\n=\n[\n0\n,\n0\n,\n2\n,\n2\n,\n0\n,\n2\n]\n>>>\nconfusion_matrix\n(\ny_true\n,\ny_pred\n)\narray([[2, 0, 0],\n[0, 0, 1],\n[1, 0, 2]])\nConfusionMatrixDisplay\ncan be used to visually represent a confusion\nmatrix as shown in the\nConfusion matrix\nexample, which creates the following figure:\nThe parameter\nnormalize\nallows to report ratios instead of counts. The\nconfusion matrix can be normalized in 3 different ways:\n'pred'\n,\n'true'\n,\nand\n'all'\nwhich will divide the counts by the sum of each columns, rows, or\nthe entire matrix, respectively.\n>>>\ny_true\n=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n1\n,\n1\n,\n1\n,\n1\n]\n>>>\ny_pred\n=\n[\n0\n,\n1\n,\n0\n,\n1\n,\n0\n,\n1\n,\n0\n,\n1\n]\n>>>\nconfusion_matrix\n(\ny_true\n,\ny_pred\n,\nnormalize\n=\n'all'\n)\narray([[0.25 , 0.125],\n[0.25 , 0.375]])\nFor binary problems, we can get counts of true negatives, false positives,\nfalse negatives and true positives as follows:\n>>>\ny_true\n=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n1\n,\n1\n,\n1\n,\n1\n]\n>>>\ny_pred\n=\n[\n0\n,\n1\n,\n0\n,\n1\n,\n0\n,\n1\n,\n0\n,\n1\n]\n>>>\ntn\n,\nfp\n,\nfn\n,\ntp\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\ny_true\n=\n[\n0\n,\n0\n,\n0\n,\n1\n,\n1\n,\n1\n,\n1\n,\n1\n]\n>>>\ny_pred\n=\n[\n0\n,\n1\n,\n0\n,\n1\n,\n0\n,\n1\n,\n0\n,\n1\n]\n>>>\ntn\n,\nfp\n,\nfn\n,\ntp\n=\nconfusion_matrix\n(\ny_true\n,\ny_pred\n)\n.\nravel\n()\n.\ntolist\n()\n>>>\ntn\n,\nfp\n,\nfn\n,\ntp\n(2, 1, 2, 3)\nExamples\nSee\nConfusion matrix\nfor an example of using a confusion matrix to evaluate classifier output\nquality.\nSee\nRecognizing hand-written digits\nfor an example of using a confusion matrix to classify\nhand-written digits.\nSee\nClassification of text documents using sparse features\nfor an example of using a confusion matrix to classify text\ndocuments.\n3.4.4.7.\nClassification report\nThe\nclassification_report\nfunction builds a text report showing the\nmain classification metrics. Here is a small example with custom\ntarget_names\nand inferred labels:\n>>>\nfrom\nsklearn.metrics\nimport\nclassification_report\n>>>\ny_true\n=\n[\n0\n,\n1\n,\n2\n,\n2\n,\n0\n]\n>>>\ny_pred\n=\n[\n0\n,\n0\n,\n2\n,\n1\n,\n0\n]\n>>>\ntarget_names\n=\n[\n'class 0'\n,\n'class 1'\n,\n'class 2'\n]\n>>>\nprint\n(\nclassification_report\n(\ny_true\n,\ny_pred\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 38,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "0\n]\n>>>\ny_pred\n=\n[\n0\n,\n0\n,\n2\n,\n1\n,\n0\n]\n>>>\ntarget_names\n=\n[\n'class 0'\n,\n'class 1'\n,\n'class 2'\n]\n>>>\nprint\n(\nclassification_report\n(\ny_true\n,\ny_pred\n,\ntarget_names\n=\ntarget_names\n))\nprecision recall f1-score support\nclass 0 0.67 1.00 0.80 2\nclass 1 0.00 0.00 0.00 1\nclass 2 1.00 0.50 0.67 2\naccuracy 0.60 5\nmacro avg 0.56 0.50 0.49 5\nweighted avg 0.67 0.60 0.59 5\nExamples\nSee\nRecognizing hand-written digits\nfor an example of classification report usage for\nhand-written digits.\nSee\nCustom refit strategy of a grid search with cross-validation\nfor an example of classification report usage for\ngrid search with nested cross-validation.\n3.4.4.8.\nHamming loss\nThe\nhamming_loss\ncomputes the average Hamming loss or\nHamming\ndistance\nbetween two sets\nof samples.\nIf\n\\(\\hat{y}_{i,j}\\)\nis the predicted value for the\n\\(j\\)\n-th label of a\ngiven sample\n\\(i\\)\n,\n\\(y_{i,j}\\)\nis the corresponding true value,\n\\(n_\\text{samples}\\)\nis the number of samples and\n\\(n_\\text{labels}\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 39,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "given sample\n\\(i\\)\n,\n\\(y_{i,j}\\)\nis the corresponding true value,\n\\(n_\\text{samples}\\)\nis the number of samples and\n\\(n_\\text{labels}\\)\nis the number of labels, then the Hamming loss\n\\(L_{Hamming}\\)\nis defined\nas:\n\\[L_{Hamming}(y, \\hat{y}) = \\frac{1}{n_\\text{samples} * n_\\text{labels}} \\sum_{i=0}^{n_\\text{samples}-1} \\sum_{j=0}^{n_\\text{labels} - 1} 1(\\hat{y}_{i,j} \\not= y_{i,j})\\]\nwhere\n\\(1(x)\\)\nis the\nindicator function\n.\nThe equation above does not hold true in the case of multiclass classification.\nPlease refer to the note below for more information.\n>>>\nfrom\nsklearn.metrics\nimport\nhamming_loss\n>>>\ny_pred\n=\n[\n1\n,\n2\n,\n3\n,\n4\n]\n>>>\ny_true\n=\n[\n2\n,\n2\n,\n3\n,\n4\n]\n>>>\nhamming_loss\n(\ny_true\n,\ny_pred\n)\n0.25\nIn the multilabel case with binary label indicators:\n>>>\nhamming_loss\n(\nnp\n.\narray\n([[\n0\n,\n1\n],\n[\n1\n,\n1\n]]),\nnp\n.\nzeros\n((\n2\n,\n2\n)))\n0.75\nNote\nIn multiclass classification, the Hamming loss corresponds to the Hamming\ndistance between\ny_true\nand\ny_pred\nwhich is similar to the\nZero one loss",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 40,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In multiclass classification, the Hamming loss corresponds to the Hamming\ndistance between\ny_true\nand\ny_pred\nwhich is similar to the\nZero one loss\nfunction. However, while zero-one loss penalizes\nprediction sets that do not strictly match true sets, the Hamming loss\npenalizes individual labels. Thus the Hamming loss, upper bounded by the zero-one\nloss, is always between zero and one, inclusive; and predicting a proper subset\nor superset of the true labels will give a Hamming loss between\nzero and one, exclusive.\n3.4.4.9.\nPrecision, recall and F-measures\nIntuitively,\nprecision\nis the ability\nof the classifier not to label as positive a sample that is negative, and\nrecall\nis the\nability of the classifier to find all the positive samples.\nThe\nF-measure\n(\n\\(F_\\beta\\)\nand\n\\(F_1\\)\nmeasures) can be interpreted as a weighted\nharmonic mean of the precision and recall. A\n\\(F_\\beta\\)\nmeasure reaches its best value at 1 and its worst score at 0.\nWith\n\\(\\beta = 1\\)\n,\n\\(F_\\beta\\)\nand\n\\(F_1\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 41,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(F_\\beta\\)\nmeasure reaches its best value at 1 and its worst score at 0.\nWith\n\\(\\beta = 1\\)\n,\n\\(F_\\beta\\)\nand\n\\(F_1\\)\nare equivalent, and the recall and the precision are equally important.\nThe\nprecision_recall_curve\ncomputes a precision-recall curve\nfrom the ground truth label and a score given by the classifier\nby varying a decision threshold.\nThe\naverage_precision_score\nfunction computes the\naverage precision\n(AP) from prediction scores. The value is between 0 and 1 and higher is better.\nAP is defined as\n\\[\\text{AP} = \\sum_n (R_n - R_{n-1}) P_n\\]\nwhere\n\\(P_n\\)\nand\n\\(R_n\\)\nare the precision and recall at the\nnth threshold. With random predictions, the AP is the fraction of positive\nsamples.\nReferences\n[Manning2008]\nand\n[Everingham2010]\npresent alternative variants of\nAP that interpolate the precision-recall curve. Currently,\naverage_precision_score\ndoes not implement any interpolated variant.\nReferences\n[Davis2006]\nand\n[Flach2015]\ndescribe why a linear interpolation of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 42,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "average_precision_score\ndoes not implement any interpolated variant.\nReferences\n[Davis2006]\nand\n[Flach2015]\ndescribe why a linear interpolation of\npoints on the precision-recall curve provides an overly-optimistic measure of\nclassifier performance. This linear interpolation is used when computing area\nunder the curve with the trapezoidal rule in\nauc\n.\nSeveral functions allow you to analyze the precision, recall and F-measures\nscore:\naverage_precision_score\n(y_true, y_score, *)\nCompute average precision (AP) from prediction scores.\nf1_score\n(y_true, y_pred, *[, labels, ...])\nCompute the F1 score, also known as balanced F-score or F-measure.\nfbeta_score\n(y_true, y_pred, *, beta[, ...])\nCompute the F-beta score.\nprecision_recall_curve\n(y_true, y_score, *[, ...])\nCompute precision-recall pairs for different probability thresholds.\nprecision_recall_fscore_support\n(y_true, ...)\nCompute precision, recall, F-measure and support for each class.\nprecision_score",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 43,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "precision_recall_fscore_support\n(y_true, ...)\nCompute precision, recall, F-measure and support for each class.\nprecision_score\n(y_true, y_pred, *[, labels, ...])\nCompute the precision.\nrecall_score\n(y_true, y_pred, *[, labels, ...])\nCompute the recall.\nNote that the\nprecision_recall_curve\nfunction is restricted to the\nbinary case. The\naverage_precision_score\nfunction supports multiclass\nand multilabel formats by computing each class score in a One-vs-the-rest (OvR)\nfashion and averaging them or not depending of its\naverage\nargument value.\nThe\nPrecisionRecallDisplay.from_estimator\nand\nPrecisionRecallDisplay.from_predictions\nfunctions will plot the\nprecision-recall curve as follows.\nExamples\nSee\nCustom refit strategy of a grid search with cross-validation\nfor an example of\nprecision_score\nand\nrecall_score\nusage\nto estimate parameters using grid search with nested cross-validation.\nSee\nPrecision-Recall\nfor an example of\nprecision_recall_curve\nusage to evaluate\nclassifier output quality.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 44,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "See\nPrecision-Recall\nfor an example of\nprecision_recall_curve\nusage to evaluate\nclassifier output quality.\nReferences\n[\nManning2008\n]\nC.D. Manning, P. Raghavan, H. Schütze,\nIntroduction to Information Retrieval\n,\n2008.\n[\nEveringham2010\n]\nM. Everingham, L. Van Gool, C.K.I. Williams, J. Winn, A. Zisserman,\nThe Pascal Visual Object Classes (VOC) Challenge\n,\nIJCV 2010.\n[\nDavis2006\n]\nJ. Davis, M. Goadrich,\nThe Relationship Between Precision-Recall and ROC Curves\n,\nICML 2006.\n[\nFlach2015\n]\nP.A. Flach, M. Kull,\nPrecision-Recall-Gain Curves: PR Analysis Done Right\n,\nNIPS 2015.\n3.4.4.9.1.\nBinary classification\nIn a binary classification task, the terms ‘’positive’’ and ‘’negative’’ refer\nto the classifier’s prediction, and the terms ‘’true’’ and ‘’false’’ refer to\nwhether that prediction corresponds to the external judgment (sometimes known\nas the ‘’observation’’). Given these definitions, we can formulate the\nfollowing table:\nActual class (observation)\nPredicted class\n(expectation)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 45,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "as the ‘’observation’’). Given these definitions, we can formulate the\nfollowing table:\nActual class (observation)\nPredicted class\n(expectation)\ntp (true positive)\nCorrect result\nfp (false positive)\nUnexpected result\nfn (false negative)\nMissing result\ntn (true negative)\nCorrect absence of result\nIn this context, we can define the notions of precision and recall:\n\\[\\text{precision} = \\frac{\\text{tp}}{\\text{tp} + \\text{fp}},\\]\n\\[\\text{recall} = \\frac{\\text{tp}}{\\text{tp} + \\text{fn}},\\]\n(Sometimes recall is also called ‘’sensitivity’’)\nF-measure is the weighted harmonic mean of precision and recall, with precision’s\ncontribution to the mean weighted by some parameter\n\\(\\beta\\)\n:\n\\[F_\\beta = (1 + \\beta^2) \\frac{\\text{precision} \\times \\text{recall}}{\\beta^2 \\text{precision} + \\text{recall}}\\]\nTo avoid division by zero when precision and recall are zero, Scikit-Learn calculates F-measure with this\notherwise-equivalent formula:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 46,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "To avoid division by zero when precision and recall are zero, Scikit-Learn calculates F-measure with this\notherwise-equivalent formula:\n\\[F_\\beta = \\frac{(1 + \\beta^2) \\text{tp}}{(1 + \\beta^2) \\text{tp} + \\text{fp} + \\beta^2 \\text{fn}}\\]\nNote that this formula is still undefined when there are no true positives, false\npositives, or false negatives. By default, F-1 for a set of exclusively true negatives\nis calculated as 0, however this behavior can be changed using the\nzero_division\nparameter.\nHere are some small examples in binary classification:\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\ny_pred\n=\n[\n0\n,\n1\n,\n0\n,\n0\n]\n>>>\ny_true\n=\n[\n0\n,\n1\n,\n0\n,\n1\n]\n>>>\nmetrics\n.\nprecision_score\n(\ny_true\n,\ny_pred\n)\n1.0\n>>>\nmetrics\n.\nrecall_score\n(\ny_true\n,\ny_pred\n)\n0.5\n>>>\nmetrics\n.\nf1_score\n(\ny_true\n,\ny_pred\n)\n0.66\n>>>\nmetrics\n.\nfbeta_score\n(\ny_true\n,\ny_pred\n,\nbeta\n=\n0.5\n)\n0.83\n>>>\nmetrics\n.\nfbeta_score\n(\ny_true\n,\ny_pred\n,\nbeta\n=\n1\n)\n0.66\n>>>\nmetrics\n.\nfbeta_score\n(\ny_true\n,\ny_pred\n,\nbeta\n=\n2\n)\n0.55\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 47,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nbeta\n=\n0.5\n)\n0.83\n>>>\nmetrics\n.\nfbeta_score\n(\ny_true\n,\ny_pred\n,\nbeta\n=\n1\n)\n0.66\n>>>\nmetrics\n.\nfbeta_score\n(\ny_true\n,\ny_pred\n,\nbeta\n=\n2\n)\n0.55\n>>>\nmetrics\n.\nprecision_recall_fscore_support\n(\ny_true\n,\ny_pred\n,\nbeta\n=\n0.5\n)\n(array([0.66, 1. ]), array([1. , 0.5]), array([0.71, 0.83]), array([2, 2]))\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nprecision_recall_curve\n>>>\nfrom\nsklearn.metrics\nimport\naverage_precision_score\n>>>\ny_true\n=\nnp\n.\narray\n([\n0\n,\n0\n,\n1\n,\n1\n])\n>>>\ny_scores\n=\nnp\n.\narray\n([\n0.1\n,\n0.4\n,\n0.35\n,\n0.8\n])\n>>>\nprecision\n,\nrecall\n,\nthreshold\n=\nprecision_recall_curve\n(\ny_true\n,\ny_scores\n)\n>>>\nprecision\narray([0.5 , 0.66, 0.5 , 1. , 1. ])\n>>>\nrecall\narray([1. , 1. , 0.5, 0.5, 0. ])\n>>>\nthreshold\narray([0.1 , 0.35, 0.4 , 0.8 ])\n>>>\naverage_precision_score\n(\ny_true\n,\ny_scores\n)\n0.83\n3.4.4.9.2.\nMulticlass and multilabel classification\nIn a multiclass and multilabel classification task, the notions of precision,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 48,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\ny_scores\n)\n0.83\n3.4.4.9.2.\nMulticlass and multilabel classification\nIn a multiclass and multilabel classification task, the notions of precision,\nrecall, and F-measures can be applied to each label independently.\nThere are a few ways to combine results across labels,\nspecified by the\naverage\nargument to the\naverage_precision_score\n,\nf1_score\n,\nfbeta_score\n,\nprecision_recall_fscore_support\n,\nprecision_score\nand\nrecall_score\nfunctions, as described\nabove\n.\nNote the following behaviors when averaging:\nIf all labels are included, “micro”-averaging in a multiclass setting will produce\nprecision, recall and\n\\(F\\)\nthat are all identical to accuracy.\n“weighted” averaging may produce a F-score that is not between precision and recall.\n“macro” averaging for F-measures is calculated as the arithmetic mean over\nper-label/class F-measures, not the harmonic mean over the arithmetic precision and\nrecall means. Both calculations can be seen in the literature but are not equivalent,\nsee\n[OB2019]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 49,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "recall means. Both calculations can be seen in the literature but are not equivalent,\nsee\n[OB2019]\nfor details.\nTo make this more explicit, consider the following notation:\n\\(y\\)\nthe set of\ntrue\n\\((sample, label)\\)\npairs\n\\(\\hat{y}\\)\nthe set of\npredicted\n\\((sample, label)\\)\npairs\n\\(L\\)\nthe set of labels\n\\(S\\)\nthe set of samples\n\\(y_s\\)\nthe subset of\n\\(y\\)\nwith sample\n\\(s\\)\n,\ni.e.\n\\(y_s := \\left\\{(s', l) \\in y | s' = s\\right\\}\\)\n\\(y_l\\)\nthe subset of\n\\(y\\)\nwith label\n\\(l\\)\nsimilarly,\n\\(\\hat{y}_s\\)\nand\n\\(\\hat{y}_l\\)\nare subsets of\n\\(\\hat{y}\\)\n\\(P(A, B) := \\frac{\\left| A \\cap B \\right|}{\\left|B\\right|}\\)\nfor some\nsets\n\\(A\\)\nand\n\\(B\\)\n\\(R(A, B) := \\frac{\\left| A \\cap B \\right|}{\\left|A\\right|}\\)\n(Conventions vary on handling\n\\(A = \\emptyset\\)\n; this implementation uses\n\\(R(A, B):=0\\)\n, and similar for\n\\(P\\)\n.)\n\\(F_\\beta(A, B) := \\left(1 + \\beta^2\\right) \\frac{P(A, B) \\times R(A, B)}{\\beta^2 P(A, B) + R(A, B)}\\)\nThen the metrics are defined as:\naverage\nPrecision\nRecall\nF_beta\n\"micro\"",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 50,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Then the metrics are defined as:\naverage\nPrecision\nRecall\nF_beta\n\"micro\"\n\\(P(y, \\hat{y})\\)\n\\(R(y, \\hat{y})\\)\n\\(F_\\beta(y, \\hat{y})\\)\n\"samples\"\n\\(\\frac{1}{\\left|S\\right|} \\sum_{s \\in S} P(y_s, \\hat{y}_s)\\)\n\\(\\frac{1}{\\left|S\\right|} \\sum_{s \\in S} R(y_s, \\hat{y}_s)\\)\n\\(\\frac{1}{\\left|S\\right|} \\sum_{s \\in S} F_\\beta(y_s, \\hat{y}_s)\\)\n\"macro\"\n\\(\\frac{1}{\\left|L\\right|} \\sum_{l \\in L} P(y_l, \\hat{y}_l)\\)\n\\(\\frac{1}{\\left|L\\right|} \\sum_{l \\in L} R(y_l, \\hat{y}_l)\\)\n\\(\\frac{1}{\\left|L\\right|} \\sum_{l \\in L} F_\\beta(y_l, \\hat{y}_l)\\)\n\"weighted\"\n\\(\\frac{1}{\\sum_{l \\in L} \\left|y_l\\right|} \\sum_{l \\in L} \\left|y_l\\right| P(y_l, \\hat{y}_l)\\)\n\\(\\frac{1}{\\sum_{l \\in L} \\left|y_l\\right|} \\sum_{l \\in L} \\left|y_l\\right| R(y_l, \\hat{y}_l)\\)\n\\(\\frac{1}{\\sum_{l \\in L} \\left|y_l\\right|} \\sum_{l \\in L} \\left|y_l\\right| F_\\beta(y_l, \\hat{y}_l)\\)\nNone\n\\(\\langle P(y_l, \\hat{y}_l) | l \\in L \\rangle\\)\n\\(\\langle R(y_l, \\hat{y}_l) | l \\in L \\rangle\\)\n\\(\\langle F_\\beta(y_l, \\hat{y}_l) | l \\in L \\rangle\\)\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 51,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\langle R(y_l, \\hat{y}_l) | l \\in L \\rangle\\)\n\\(\\langle F_\\beta(y_l, \\hat{y}_l) | l \\in L \\rangle\\)\n>>>\nfrom\nsklearn\nimport\nmetrics\n>>>\ny_true\n=\n[\n0\n,\n1\n,\n2\n,\n0\n,\n1\n,\n2\n]\n>>>\ny_pred\n=\n[\n0\n,\n2\n,\n1\n,\n0\n,\n0\n,\n1\n]\n>>>\nmetrics\n.\nprecision_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n'macro'\n)\n0.22\n>>>\nmetrics\n.\nrecall_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n'micro'\n)\n0.33\n>>>\nmetrics\n.\nf1_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n'weighted'\n)\n0.267\n>>>\nmetrics\n.\nfbeta_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n'macro'\n,\nbeta\n=\n0.5\n)\n0.238\n>>>\nmetrics\n.\nprecision_recall_fscore_support\n(\ny_true\n,\ny_pred\n,\nbeta\n=\n0.5\n,\naverage\n=\nNone\n)\n(array([0.667, 0., 0.]), array([1., 0., 0.]), array([0.714, 0., 0.]), array([2, 2, 2]))\nFor multiclass classification with a “negative class”, it is possible to exclude some labels:\n>>>\nmetrics\n.\nrecall_score\n(\ny_true\n,\ny_pred\n,\nlabels\n=\n[\n1\n,\n2\n],\naverage\n=\n'micro'\n)\n...\n# excluding 0, no labels were correctly recalled\n0.0",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 52,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nmetrics\n.\nrecall_score\n(\ny_true\n,\ny_pred\n,\nlabels\n=\n[\n1\n,\n2\n],\naverage\n=\n'micro'\n)\n...\n# excluding 0, no labels were correctly recalled\n0.0\nSimilarly, labels not present in the data sample may be accounted for in macro-averaging.\n>>>\nmetrics\n.\nprecision_score\n(\ny_true\n,\ny_pred\n,\nlabels\n=\n[\n0\n,\n1\n,\n2\n,\n3\n],\naverage\n=\n'macro'\n)\n0.166\nReferences\n[\nOB2019\n]\nOpitz, J., & Burst, S. (2019). “Macro f1 and macro f1.”\n3.4.4.10.\nJaccard similarity coefficient score\nThe\njaccard_score\nfunction computes the average of\nJaccard similarity\ncoefficients\n, also called the\nJaccard index, between pairs of label sets.\nThe Jaccard similarity coefficient with a ground truth label set\n\\(y\\)\nand\npredicted label set\n\\(\\hat{y}\\)\n, is defined as\n\\[J(y, \\hat{y}) = \\frac{|y \\cap \\hat{y}|}{|y \\cup \\hat{y}|}.\\]\nThe\njaccard_score\n(like\nprecision_recall_fscore_support\n) applies\nnatively to binary targets. By computing it set-wise it can be extended to apply\nto multilabel and multiclass through the use of\naverage",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 53,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ") applies\nnatively to binary targets. By computing it set-wise it can be extended to apply\nto multilabel and multiclass through the use of\naverage\n(see\nabove\n).\nIn the binary case:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\njaccard_score\n>>>\ny_true\n=\nnp\n.\narray\n([[\n0\n,\n1\n,\n1\n],\n...\n[\n1\n,\n1\n,\n0\n]])\n>>>\ny_pred\n=\nnp\n.\narray\n([[\n1\n,\n1\n,\n1\n],\n...\n[\n1\n,\n0\n,\n0\n]])\n>>>\njaccard_score\n(\ny_true\n[\n0\n],\ny_pred\n[\n0\n])\n0.6666\nIn the 2D comparison case (e.g. image similarity):\n>>>\njaccard_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n\"micro\"\n)\n0.6\nIn the multilabel case with binary label indicators:\n>>>\njaccard_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n'samples'\n)\n0.5833\n>>>\njaccard_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n'macro'\n)\n0.6666\n>>>\njaccard_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\nNone\n)\narray([0.5, 0.5, 1. ])\nMulticlass problems are binarized and treated like the corresponding\nmultilabel problem:\n>>>\ny_pred\n=\n[\n0\n,\n2\n,\n1\n,\n2\n]\n>>>\ny_true\n=\n[\n0\n,\n1\n,\n2\n,\n2\n]\n>>>\njaccard_score\n(\ny_true\n,\ny_pred\n,\naverage",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 54,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "multilabel problem:\n>>>\ny_pred\n=\n[\n0\n,\n2\n,\n1\n,\n2\n]\n>>>\ny_true\n=\n[\n0\n,\n1\n,\n2\n,\n2\n]\n>>>\njaccard_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\nNone\n)\narray([1. , 0. , 0.33])\n>>>\njaccard_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n'macro'\n)\n0.44\n>>>\njaccard_score\n(\ny_true\n,\ny_pred\n,\naverage\n=\n'micro'\n)\n0.33\n3.4.4.11.\nHinge loss\nThe\nhinge_loss\nfunction computes the average distance between\nthe model and the data using\nhinge loss\n, a one-sided metric\nthat considers only prediction errors. (Hinge\nloss is used in maximal margin classifiers such as support vector machines.)\nIf the true label\n\\(y_i\\)\nof a binary classification task is encoded as\n\\(y_i=\\left\\{-1, +1\\right\\}\\)\nfor every sample\n\\(i\\)\n; and\n\\(w_i\\)\nis the corresponding predicted decision (an array of shape (\nn_samples\n,) as\noutput by the\ndecision_function\nmethod), then the hinge loss is defined as:\n\\[L_\\text{Hinge}(y, w) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples}-1} \\max\\left\\{1 - w_i y_i, 0\\right\\}\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 55,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[L_\\text{Hinge}(y, w) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples}-1} \\max\\left\\{1 - w_i y_i, 0\\right\\}\\]\nIf there are more than two labels,\nhinge_loss\nuses a multiclass variant\ndue to Crammer & Singer.\nHere\nis\nthe paper describing it.\nIn this case the predicted decision is an array of shape (\nn_samples\n,\nn_labels\n). If\n\\(w_{i, y_i}\\)\nis the predicted decision for the true label\n\\(y_i\\)\nof the\n\\(i\\)\n-th sample; and\n\\(\\hat{w}_{i, y_i} = \\max\\left\\{w_{i, y_j}~|~y_j \\ne y_i \\right\\}\\)\nis the maximum of the\npredicted decisions for all the other labels, then the multi-class hinge loss\nis defined by:\n\\[L_\\text{Hinge}(y, w) = \\frac{1}{n_\\text{samples}}\n\\sum_{i=0}^{n_\\text{samples}-1} \\max\\left\\{1 + \\hat{w}_{i, y_i}\n- w_{i, y_i}, 0\\right\\}\\]\nHere is a small example demonstrating the use of the\nhinge_loss\nfunction\nwith a svm classifier in a binary class problem:\n>>>\nfrom\nsklearn\nimport\nsvm\n>>>\nfrom\nsklearn.metrics\nimport\nhinge_loss\n>>>\nX\n=\n[[\n0\n],\n[\n1\n]]\n>>>\ny\n=\n[\n-\n1\n,\n1\n]\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 56,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nfrom\nsklearn\nimport\nsvm\n>>>\nfrom\nsklearn.metrics\nimport\nhinge_loss\n>>>\nX\n=\n[[\n0\n],\n[\n1\n]]\n>>>\ny\n=\n[\n-\n1\n,\n1\n]\n>>>\nest\n=\nsvm\n.\nLinearSVC\n(\nrandom_state\n=\n0\n)\n>>>\nest\n.\nfit\n(\nX\n,\ny\n)\nLinearSVC(random_state=0)\n>>>\npred_decision\n=\nest\n.\ndecision_function\n([[\n-\n2\n],\n[\n3\n],\n[\n0.5\n]])\n>>>\npred_decision\narray([-2.18, 2.36, 0.09])\n>>>\nhinge_loss\n([\n-\n1\n,\n1\n,\n1\n],\npred_decision\n)\n0.3\nHere is an example demonstrating the use of the\nhinge_loss\nfunction\nwith a svm classifier in a multiclass problem:\n>>>\nX\n=\nnp\n.\narray\n([[\n0\n],\n[\n1\n],\n[\n2\n],\n[\n3\n]])\n>>>\nY\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n2\n,\n3\n])\n>>>\nlabels\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n2\n,\n3\n])\n>>>\nest\n=\nsvm\n.\nLinearSVC\n()\n>>>\nest\n.\nfit\n(\nX\n,\nY\n)\nLinearSVC()\n>>>\npred_decision\n=\nest\n.\ndecision_function\n([[\n-\n1\n],\n[\n2\n],\n[\n3\n]])\n>>>\ny_true\n=\n[\n0\n,\n2\n,\n3\n]\n>>>\nhinge_loss\n(\ny_true\n,\npred_decision\n,\nlabels\n=\nlabels\n)\n0.56\n3.4.4.12.\nLog loss\nLog loss, also called logistic regression loss or\ncross-entropy loss, is defined on probability estimates. It is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 57,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nlabels\n)\n0.56\n3.4.4.12.\nLog loss\nLog loss, also called logistic regression loss or\ncross-entropy loss, is defined on probability estimates. It is\ncommonly used in (multinomial) logistic regression and neural networks, as well\nas in some variants of expectation-maximization, and can be used to evaluate the\nprobability outputs (\npredict_proba\n) of a classifier instead of its\ndiscrete predictions.\nFor binary classification with a true label\n\\(y \\in \\{0,1\\}\\)\nand a probability estimate\n\\(\\hat{p} \\approx \\operatorname{Pr}(y = 1)\\)\n,\nthe log loss per sample is the negative log-likelihood\nof the classifier given the true label:\n\\[L_{\\log}(y, \\hat{p}) = -\\log \\operatorname{Pr}(y|\\hat{p}) = -(y \\log (\\hat{p}) + (1 - y) \\log (1 - \\hat{p}))\\]\nThis extends to the multiclass case as follows.\nLet the true labels for a set of samples\nbe encoded as a 1-of-K binary indicator matrix\n\\(Y\\)\n,\ni.e.,\n\\(y_{i,k} = 1\\)\nif sample\n\\(i\\)\nhas label\n\\(k\\)\ntaken from a set of\n\\(K\\)\nlabels.\nLet\n\\(\\hat{P}\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 58,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(Y\\)\n,\ni.e.,\n\\(y_{i,k} = 1\\)\nif sample\n\\(i\\)\nhas label\n\\(k\\)\ntaken from a set of\n\\(K\\)\nlabels.\nLet\n\\(\\hat{P}\\)\nbe a matrix of probability estimates,\nwith elements\n\\(\\hat{p}_{i,k} \\approx \\operatorname{Pr}(y_{i,k} = 1)\\)\n.\nThen the log loss of the whole set is\n\\[L_{\\log}(Y, \\hat{P}) = -\\log \\operatorname{Pr}(Y|\\hat{P}) = - \\frac{1}{N} \\sum_{i=0}^{N-1} \\sum_{k=0}^{K-1} y_{i,k} \\log \\hat{p}_{i,k}\\]\nTo see how this generalizes the binary log loss given above,\nnote that in the binary case,\n\\(\\hat{p}_{i,0} = 1 - \\hat{p}_{i,1}\\)\nand\n\\(y_{i,0} = 1 - y_{i,1}\\)\n,\nso expanding the inner sum over\n\\(y_{i,k} \\in \\{0,1\\}\\)\ngives the binary log loss.\nThe\nlog_loss\nfunction computes log loss given a list of ground-truth\nlabels and a probability matrix, as returned by an estimator’s\npredict_proba\nmethod.\n>>>\nfrom\nsklearn.metrics\nimport\nlog_loss\n>>>\ny_true\n=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\ny_pred\n=\n[[\n.9\n,\n.1\n],\n[\n.8\n,\n.2\n],\n[\n.3\n,\n.7\n],\n[\n.01\n,\n.99\n]]\n>>>\nlog_loss\n(\ny_true\n,\ny_pred\n)\n0.1738\nThe first\n[.9,\n.1]\nin",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 59,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\ny_pred\n=\n[[\n.9\n,\n.1\n],\n[\n.8\n,\n.2\n],\n[\n.3\n,\n.7\n],\n[\n.01\n,\n.99\n]]\n>>>\nlog_loss\n(\ny_true\n,\ny_pred\n)\n0.1738\nThe first\n[.9,\n.1]\nin\ny_pred\ndenotes 90% probability that the first\nsample has label 0. The log loss is non-negative.\n3.4.4.13.\nMatthews correlation coefficient\nThe\nmatthews_corrcoef\nfunction computes the\nMatthew’s correlation coefficient (MCC)\nfor binary classes. Quoting Wikipedia:\n“The Matthews correlation coefficient is used in machine learning as a\nmeasure of the quality of binary (two-class) classifications. It takes\ninto account true and false positives and negatives and is generally\nregarded as a balanced measure which can be used even if the classes are\nof very different sizes. The MCC is in essence a correlation coefficient\nvalue between -1 and +1. A coefficient of +1 represents a perfect\nprediction, 0 an average random prediction and -1 an inverse prediction.\nThe statistic is also known as the phi coefficient.”\nIn the binary (two-class) case,\n\\(tp\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 60,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The statistic is also known as the phi coefficient.”\nIn the binary (two-class) case,\n\\(tp\\)\n,\n\\(tn\\)\n,\n\\(fp\\)\nand\n\\(fn\\)\nare respectively the number of true positives, true negatives, false\npositives and false negatives, the MCC is defined as\n\\[MCC = \\frac{tp \\times tn - fp \\times fn}{\\sqrt{(tp + fp)(tp + fn)(tn + fp)(tn + fn)}}.\\]\nIn the multiclass case, the Matthews correlation coefficient can be\ndefined\nin terms of a\nconfusion_matrix\n\\(C\\)\nfor\n\\(K\\)\nclasses. To simplify the\ndefinition consider the following intermediate variables:\n\\(t_k=\\sum_{i}^{K} C_{ik}\\)\nthe number of times class\n\\(k\\)\ntruly occurred,\n\\(p_k=\\sum_{i}^{K} C_{ki}\\)\nthe number of times class\n\\(k\\)\nwas predicted,\n\\(c=\\sum_{k}^{K} C_{kk}\\)\nthe total number of samples correctly predicted,\n\\(s=\\sum_{i}^{K} \\sum_{j}^{K} C_{ij}\\)\nthe total number of samples.\nThen the multiclass MCC is defined as:\n\\[MCC = \\frac{\nc \\times s - \\sum_{k}^{K} p_k \\times t_k\n}{\\sqrt{\n(s^2 - \\sum_{k}^{K} p_k^2) \\times\n(s^2 - \\sum_{k}^{K} t_k^2)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 61,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[MCC = \\frac{\nc \\times s - \\sum_{k}^{K} p_k \\times t_k\n}{\\sqrt{\n(s^2 - \\sum_{k}^{K} p_k^2) \\times\n(s^2 - \\sum_{k}^{K} t_k^2)\n}}\\]\nWhen there are more than two labels, the value of the MCC will no longer range\nbetween -1 and +1. Instead the minimum value will be somewhere between -1 and 0\ndepending on the number and distribution of ground truth labels. The maximum\nvalue is always +1.\nFor additional information, see\n[WikipediaMCC2021]\n.\nHere is a small example illustrating the usage of the\nmatthews_corrcoef\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nmatthews_corrcoef\n>>>\ny_true\n=\n[\n+\n1\n,\n+\n1\n,\n+\n1\n,\n-\n1\n]\n>>>\ny_pred\n=\n[\n+\n1\n,\n-\n1\n,\n+\n1\n,\n+\n1\n]\n>>>\nmatthews_corrcoef\n(\ny_true\n,\ny_pred\n)\n-0.33\nReferences\n[\nWikipediaMCC2021\n]\nWikipedia contributors. Phi coefficient.\nWikipedia, The Free Encyclopedia. April 21, 2021, 12:21 CEST.\nAvailable at:\nhttps://en.wikipedia.org/wiki/Phi_coefficient\nAccessed April 21, 2021.\n3.4.4.14.\nMulti-label confusion matrix\nThe\nmultilabel_confusion_matrix",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 62,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "https://en.wikipedia.org/wiki/Phi_coefficient\nAccessed April 21, 2021.\n3.4.4.14.\nMulti-label confusion matrix\nThe\nmultilabel_confusion_matrix\nfunction computes class-wise (default)\nor sample-wise (samplewise=True) multilabel confusion matrix to evaluate\nthe accuracy of a classification. multilabel_confusion_matrix also treats\nmulticlass data as if it were multilabel, as this is a transformation commonly\napplied to evaluate multiclass problems with binary classification metrics\n(such as precision, recall, etc.).\nWhen calculating class-wise multilabel confusion matrix\n\\(C\\)\n, the\ncount of true negatives for class\n\\(i\\)\nis\n\\(C_{i,0,0}\\)\n, false\nnegatives is\n\\(C_{i,1,0}\\)\n, true positives is\n\\(C_{i,1,1}\\)\nand false positives is\n\\(C_{i,0,1}\\)\n.\nHere is an example demonstrating the use of the\nmultilabel_confusion_matrix\nfunction with\nmultilabel indicator matrix\ninput:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nmultilabel_confusion_matrix\n>>>\ny_true\n=\nnp\n.\narray\n([[\n1\n,\n0\n,\n1\n],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 63,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "input:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nmultilabel_confusion_matrix\n>>>\ny_true\n=\nnp\n.\narray\n([[\n1\n,\n0\n,\n1\n],\n...\n[\n0\n,\n1\n,\n0\n]])\n>>>\ny_pred\n=\nnp\n.\narray\n([[\n1\n,\n0\n,\n0\n],\n...\n[\n0\n,\n1\n,\n1\n]])\n>>>\nmultilabel_confusion_matrix\n(\ny_true\n,\ny_pred\n)\narray([[[1, 0],\n[0, 1]],\n[[1, 0],\n[0, 1]],\n[[0, 1],\n[1, 0]]])\nOr a confusion matrix can be constructed for each sample’s labels:\n>>>\nmultilabel_confusion_matrix\n(\ny_true\n,\ny_pred\n,\nsamplewise\n=\nTrue\n)\narray([[[1, 0],\n[1, 1]],\n[[1, 1],\n[0, 1]]])\nHere is an example demonstrating the use of the\nmultilabel_confusion_matrix\nfunction with\nmulticlass\ninput:\n>>>\ny_true\n=\n[\n\"cat\"\n,\n\"ant\"\n,\n\"cat\"\n,\n\"cat\"\n,\n\"ant\"\n,\n\"bird\"\n]\n>>>\ny_pred\n=\n[\n\"ant\"\n,\n\"ant\"\n,\n\"cat\"\n,\n\"cat\"\n,\n\"ant\"\n,\n\"cat\"\n]\n>>>\nmultilabel_confusion_matrix\n(\ny_true\n,\ny_pred\n,\n...\nlabels\n=\n[\n\"ant\"\n,\n\"bird\"\n,\n\"cat\"\n])\narray([[[3, 1],\n[0, 2]],\n[[5, 0],\n[1, 0]],\n[[2, 1],\n[1, 2]]])\nHere are some examples demonstrating the use of the\nmultilabel_confusion_matrix",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 64,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "])\narray([[[3, 1],\n[0, 2]],\n[[5, 0],\n[1, 0]],\n[[2, 1],\n[1, 2]]])\nHere are some examples demonstrating the use of the\nmultilabel_confusion_matrix\nfunction to calculate recall\n(or sensitivity), specificity, fall out and miss rate for each class in a\nproblem with multilabel indicator matrix input.\nCalculating\nrecall\n(also called the true positive rate or the sensitivity) for each class:\n>>>\ny_true\n=\nnp\n.\narray\n([[\n0\n,\n0\n,\n1\n],\n...\n[\n0\n,\n1\n,\n0\n],\n...\n[\n1\n,\n1\n,\n0\n]])\n>>>\ny_pred\n=\nnp\n.\narray\n([[\n0\n,\n1\n,\n0\n],\n...\n[\n0\n,\n0\n,\n1\n],\n...\n[\n1\n,\n1\n,\n0\n]])\n>>>\nmcm\n=\nmultilabel_confusion_matrix\n(\ny_true\n,\ny_pred\n)\n>>>\ntn\n=\nmcm\n[:,\n0\n,\n0\n]\n>>>\ntp\n=\nmcm\n[:,\n1\n,\n1\n]\n>>>\nfn\n=\nmcm\n[:,\n1\n,\n0\n]\n>>>\nfp\n=\nmcm\n[:,\n0\n,\n1\n]\n>>>\ntp\n/\n(\ntp\n+\nfn\n)\narray([1. , 0.5, 0. ])\nCalculating\nspecificity\n(also called the true negative rate) for each class:\n>>>\ntn\n/\n(\ntn\n+\nfp\n)\narray([1. , 0. , 0.5])\nCalculating\nfall out\n(also called the false positive rate) for each class:\n>>>\nfp\n/\n(\nfp\n+\ntn\n)\narray([0. , 1. , 0.5])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 65,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "+\nfp\n)\narray([1. , 0. , 0.5])\nCalculating\nfall out\n(also called the false positive rate) for each class:\n>>>\nfp\n/\n(\nfp\n+\ntn\n)\narray([0. , 1. , 0.5])\nCalculating\nmiss rate\n(also called the false negative rate) for each class:\n>>>\nfn\n/\n(\nfn\n+\ntp\n)\narray([0. , 0.5, 1. ])\n3.4.4.15.\nReceiver operating characteristic (ROC)\nThe function\nroc_curve\ncomputes the\nreceiver operating characteristic curve, or ROC curve\n.\nQuoting Wikipedia :\n“A receiver operating characteristic (ROC), or simply ROC curve, is a\ngraphical plot which illustrates the performance of a binary classifier\nsystem as its discrimination threshold is varied. It is created by plotting\nthe fraction of true positives out of the positives (TPR = true positive\nrate) vs. the fraction of false positives out of the negatives (FPR = false\npositive rate), at various threshold settings. TPR is also known as\nsensitivity, and FPR is one minus the specificity or true negative rate.”",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 66,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "positive rate), at various threshold settings. TPR is also known as\nsensitivity, and FPR is one minus the specificity or true negative rate.”\nThis function requires the true binary value and the target scores, which can\neither be probability estimates of the positive class, confidence values, or\nbinary decisions. Here is a small example of how to use the\nroc_curve\nfunction:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nroc_curve\n>>>\ny\n=\nnp\n.\narray\n([\n1\n,\n1\n,\n2\n,\n2\n])\n>>>\nscores\n=\nnp\n.\narray\n([\n0.1\n,\n0.4\n,\n0.35\n,\n0.8\n])\n>>>\nfpr\n,\ntpr\n,\nthresholds\n=\nroc_curve\n(\ny\n,\nscores\n,\npos_label\n=\n2\n)\n>>>\nfpr\narray([0. , 0. , 0.5, 0.5, 1. ])\n>>>\ntpr\narray([0. , 0.5, 0.5, 1. , 1. ])\n>>>\nthresholds\narray([ inf, 0.8 , 0.4 , 0.35, 0.1 ])\nCompared to metrics such as the subset accuracy, the Hamming loss, or the\nF1 score, ROC doesn’t require optimizing a threshold for each label.\nThe\nroc_auc_score\nfunction, denoted by ROC-AUC or AUROC, computes the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 67,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "F1 score, ROC doesn’t require optimizing a threshold for each label.\nThe\nroc_auc_score\nfunction, denoted by ROC-AUC or AUROC, computes the\narea under the ROC curve. By doing so, the curve information is summarized in\none number.\nThe following figure shows the ROC curve and ROC-AUC score for a classifier\naimed to distinguish the virginica flower from the rest of the species in the\nIris plants dataset\n:\nFor more information see the\nWikipedia article on AUC\n.\n3.4.4.15.1.\nBinary case\nIn the\nbinary case\n, you can either provide the probability estimates, using\nthe\nclassifier.predict_proba()\nmethod, or the non-thresholded decision values\ngiven by the\nclassifier.decision_function()\nmethod. In the case of providing\nthe probability estimates, the probability of the class with the\n“greater label” should be provided. The “greater label” corresponds to\nclassifier.classes_[1]\nand thus\nclassifier.predict_proba(X)[:,\n1]\n.\nTherefore, the\ny_score\nparameter is of size (n_samples,).\n>>>\nfrom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 68,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classifier.classes_[1]\nand thus\nclassifier.predict_proba(X)[:,\n1]\n.\nTherefore, the\ny_score\nparameter is of size (n_samples,).\n>>>\nfrom\nsklearn.datasets\nimport\nload_breast_cancer\n>>>\nfrom\nsklearn.linear_model\nimport\nLogisticRegression\n>>>\nfrom\nsklearn.metrics\nimport\nroc_auc_score\n>>>\nX\n,\ny\n=\nload_breast_cancer\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nclf\n=\nLogisticRegression\n()\n.\nfit\n(\nX\n,\ny\n)\n>>>\nclf\n.\nclasses_\narray([0, 1])\nWe can use the probability estimates corresponding to\nclf.classes_[1]\n.\n>>>\ny_score\n=\nclf\n.\npredict_proba\n(\nX\n)[:,\n1\n]\n>>>\nroc_auc_score\n(\ny\n,\ny_score\n)\n0.99\nOtherwise, we can use the non-thresholded decision values\n>>>\nroc_auc_score\n(\ny\n,\nclf\n.\ndecision_function\n(\nX\n))\n0.99\n3.4.4.15.2.\nMulti-class case\nThe\nroc_auc_score\nfunction can also be used in\nmulti-class\nclassification\n. Two averaging strategies are currently supported: the\none-vs-one algorithm computes the average of the pairwise ROC AUC scores, and",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 69,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classification\n. Two averaging strategies are currently supported: the\none-vs-one algorithm computes the average of the pairwise ROC AUC scores, and\nthe one-vs-rest algorithm computes the average of the ROC AUC scores for each\nclass against all other classes. In both cases, the predicted labels are\nprovided in an array with values from 0 to\nn_classes\n, and the scores\ncorrespond to the probability estimates that a sample belongs to a particular\nclass. The OvO and OvR algorithms support weighting uniformly\n(\naverage='macro'\n) and by prevalence (\naverage='weighted'\n).\nOne-vs-one Algorithm\nComputes the average AUC of all possible pairwise\ncombinations of classes.\n[HT2001]\ndefines a multiclass AUC metric weighted\nuniformly:\n\\[\\frac{1}{c(c-1)}\\sum_{j=1}^{c}\\sum_{k > j}^c (\\text{AUC}(j | k) +\n\\text{AUC}(k | j))\\]\nwhere\n\\(c\\)\nis the number of classes and\n\\(\\text{AUC}(j | k)\\)\nis the\nAUC with class\n\\(j\\)\nas the positive class and class\n\\(k\\)\nas the\nnegative class. In general,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 70,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\text{AUC}(j | k)\\)\nis the\nAUC with class\n\\(j\\)\nas the positive class and class\n\\(k\\)\nas the\nnegative class. In general,\n\\(\\text{AUC}(j | k) \\neq \\text{AUC}(k | j))\\)\nin the multiclass\ncase. This algorithm is used by setting the keyword argument\nmulticlass\nto\n'ovo'\nand\naverage\nto\n'macro'\n.\nThe\n[HT2001]\nmulticlass AUC metric can be extended to be weighted by the\nprevalence:\n\\[\\frac{1}{c(c-1)}\\sum_{j=1}^{c}\\sum_{k > j}^c p(j \\cup k)(\n\\text{AUC}(j | k) + \\text{AUC}(k | j))\\]\nwhere\n\\(c\\)\nis the number of classes. This algorithm is used by setting\nthe keyword argument\nmulticlass\nto\n'ovo'\nand\naverage\nto\n'weighted'\n. The\n'weighted'\noption returns a prevalence-weighted average\nas described in\n[FC2009]\n.\nOne-vs-rest Algorithm\nComputes the AUC of each class against the rest\n[PD2000]\n. The algorithm is functionally the same as the multilabel case. To\nenable this algorithm set the keyword argument\nmulticlass\nto\n'ovr'\n.\nAdditionally to\n'macro'\n[F2006]\nand\n'weighted'\n[F2001]\naveraging, OvR",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 71,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "enable this algorithm set the keyword argument\nmulticlass\nto\n'ovr'\n.\nAdditionally to\n'macro'\n[F2006]\nand\n'weighted'\n[F2001]\naveraging, OvR\nsupports\n'micro'\naveraging.\nIn applications where a high false positive rate is not tolerable the parameter\nmax_fpr\nof\nroc_auc_score\ncan be used to summarize the ROC curve up\nto the given limit.\nThe following figure shows the micro-averaged ROC curve and its corresponding\nROC-AUC score for a classifier aimed to distinguish the different species in\nthe\nIris plants dataset\n:\n3.4.4.15.3.\nMulti-label case\nIn\nmulti-label classification\n, the\nroc_auc_score\nfunction is\nextended by averaging over the labels as\nabove\n. In this case,\nyou should provide a\ny_score\nof shape\n(n_samples,\nn_classes)\n. Thus, when\nusing the probability estimates, one needs to select the probability of the\nclass with the greater label for each output.\n>>>\nfrom\nsklearn.datasets\nimport\nmake_multilabel_classification\n>>>\nfrom\nsklearn.multioutput\nimport\nMultiOutputClassifier\n>>>\nX\n,\ny\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 72,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nfrom\nsklearn.datasets\nimport\nmake_multilabel_classification\n>>>\nfrom\nsklearn.multioutput\nimport\nMultiOutputClassifier\n>>>\nX\n,\ny\n=\nmake_multilabel_classification\n(\nrandom_state\n=\n0\n)\n>>>\ninner_clf\n=\nLogisticRegression\n(\nrandom_state\n=\n0\n)\n>>>\nclf\n=\nMultiOutputClassifier\n(\ninner_clf\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\ny_score\n=\nnp\n.\ntranspose\n([\ny_pred\n[:,\n1\n]\nfor\ny_pred\nin\nclf\n.\npredict_proba\n(\nX\n)])\n>>>\nroc_auc_score\n(\ny\n,\ny_score\n,\naverage\n=\nNone\n)\narray([0.828, 0.851, 0.94, 0.87, 0.95])\nAnd the decision values do not require such processing.\n>>>\nfrom\nsklearn.linear_model\nimport\nRidgeClassifierCV\n>>>\nclf\n=\nRidgeClassifierCV\n()\n.\nfit\n(\nX\n,\ny\n)\n>>>\ny_score\n=\nclf\n.\ndecision_function\n(\nX\n)\n>>>\nroc_auc_score\n(\ny\n,\ny_score\n,\naverage\n=\nNone\n)\narray([0.82, 0.85, 0.93, 0.87, 0.94])\nExamples\nSee\nMulticlass Receiver Operating Characteristic (ROC)\nfor an example of\nusing ROC to evaluate the quality of the output of a classifier.\nSee\nReceiver Operating Characteristic (ROC) with cross validation\nfor an",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 73,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "using ROC to evaluate the quality of the output of a classifier.\nSee\nReceiver Operating Characteristic (ROC) with cross validation\nfor an\nexample of using ROC to evaluate classifier output quality, using cross-validation.\nSee\nSpecies distribution modeling\nfor an example of using ROC to model species distribution.\nReferences\n[\nHT2001\n]\n(\n1\n,\n2\n)\nHand, D.J. and Till, R.J., (2001).\nA simple generalisation\nof the area under the ROC curve for multiple class classification problems.\nMachine learning, 45(2), pp. 171-186.\n[\nFC2009\n]\nFerri, Cèsar & Hernandez-Orallo, Jose & Modroiu, R. (2009).\nAn Experimental Comparison of Performance Measures for Classification.\nPattern Recognition Letters. 30. 27-38.\n[\nPD2000\n]\nProvost, F., Domingos, P. (2000).\nWell-trained PETs: Improving\nprobability estimation trees\n(Section 6.2), CeDER Working Paper #IS-00-04, Stern School of Business,\nNew York University.\n[\nF2006\n]\nFawcett, T., 2006.\nAn introduction to ROC analysis.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 74,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "New York University.\n[\nF2006\n]\nFawcett, T., 2006.\nAn introduction to ROC analysis.\nPattern Recognition Letters, 27(8), pp. 861-874.\n[\nF2001\n]\nFawcett, T., 2001.\nUsing rule sets to maximize\nROC performance\nIn Data Mining, 2001.\nProceedings IEEE International Conference, pp. 131-138.\n3.4.4.16.\nDetection error tradeoff (DET)\nThe function\ndet_curve\ncomputes the\ndetection error tradeoff curve (DET) curve\n[WikipediaDET2017]\n.\nQuoting Wikipedia:\n“A detection error tradeoff (DET) graph is a graphical plot of error rates\nfor binary classification systems, plotting false reject rate vs. false\naccept rate. The x- and y-axes are scaled non-linearly by their standard\nnormal deviates (or just by logarithmic transformation), yielding tradeoff\ncurves that are more linear than ROC curves, and use most of the image area\nto highlight the differences of importance in the critical operating region.”\nDET curves are a variation of receiver operating characteristic (ROC) curves",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 75,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "DET curves are a variation of receiver operating characteristic (ROC) curves\nwhere False Negative Rate is plotted on the y-axis instead of True Positive\nRate.\nDET curves are commonly plotted in normal deviate scale by transformation with\n\\(\\phi^{-1}\\)\n(with\n\\(\\phi\\)\nbeing the cumulative distribution\nfunction).\nThe resulting performance curves explicitly visualize the tradeoff of error\ntypes for given classification algorithms.\nSee\n[Martin1997]\nfor examples and further motivation.\nThis figure compares the ROC and DET curves of two example classifiers on the\nsame classification task:\nProperties\nDET curves form a linear curve in normal deviate scale if the detection\nscores are normally (or close-to normally) distributed.\nIt was shown by\n[Navratil2007]\nthat the reverse is not necessarily true and\neven more general distributions are able to produce linear DET curves.\nThe normal deviate scale transformation spreads out the points such that a\ncomparatively larger space of plot is occupied.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 76,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The normal deviate scale transformation spreads out the points such that a\ncomparatively larger space of plot is occupied.\nTherefore curves with similar classification performance might be easier to\ndistinguish on a DET plot.\nWith False Negative Rate being “inverse” to True Positive Rate the point\nof perfection for DET curves is the origin (in contrast to the top left\ncorner for ROC curves).\nApplications and limitations\nDET curves are intuitive to read and hence allow quick visual assessment of a\nclassifier’s performance.\nAdditionally DET curves can be consulted for threshold analysis and operating\npoint selection.\nThis is particularly helpful if a comparison of error types is required.\nOn the other hand DET curves do not provide their metric as a single number.\nTherefore for either automated evaluation or comparison to other\nclassification tasks metrics like the derived area under ROC curve might be\nbetter suited.\nExamples\nSee\nDetection error tradeoff (DET) curve",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 77,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classification tasks metrics like the derived area under ROC curve might be\nbetter suited.\nExamples\nSee\nDetection error tradeoff (DET) curve\nfor an example comparison between receiver operating characteristic (ROC)\ncurves and Detection error tradeoff (DET) curves.\nReferences\n[\nWikipediaDET2017\n]\nWikipedia contributors. Detection error tradeoff.\nWikipedia, The Free Encyclopedia. September 4, 2017, 23:33 UTC.\nAvailable at:\nhttps://en.wikipedia.org/w/index.php?title=Detection_error_tradeoff&oldid=798982054\n.\nAccessed February 19, 2018.\n[\nMartin1997\n]\nA. Martin, G. Doddington, T. Kamm, M. Ordowski, and M. Przybocki,\nThe DET Curve in Assessment of Detection Task Performance\n, NIST 1997.\n[\nNavratil2007\n]\nJ. Navratil and D. Klusacek,\n“On Linear DETs”\n,\n2007 IEEE International Conference on Acoustics,\nSpeech and Signal Processing - ICASSP ‘07, Honolulu,\nHI, 2007, pp. IV-229-IV-232.\n3.4.4.17.\nZero one loss\nThe\nzero_one_loss\nfunction computes the sum or the average of the 0-1",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 78,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "HI, 2007, pp. IV-229-IV-232.\n3.4.4.17.\nZero one loss\nThe\nzero_one_loss\nfunction computes the sum or the average of the 0-1\nclassification loss (\n\\(L_{0-1}\\)\n) over\n\\(n_{\\text{samples}}\\)\n. By\ndefault, the function normalizes over the sample. To get the sum of the\n\\(L_{0-1}\\)\n, set\nnormalize\nto\nFalse\n.\nIn multilabel classification, the\nzero_one_loss\nscores a subset as\none if its labels strictly match the predictions, and as a zero if there\nare any errors. By default, the function returns the percentage of imperfectly\npredicted subsets. To get the count of such subsets instead, set\nnormalize\nto\nFalse\n.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of\nthe\n\\(i\\)\n-th sample and\n\\(y_i\\)\nis the corresponding true value,\nthen the 0-1 loss\n\\(L_{0-1}\\)\nis defined as:\n\\[L_{0-1}(y, \\hat{y}) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples}-1} 1(\\hat{y}_i \\not= y_i)\\]\nwhere\n\\(1(x)\\)\nis the\nindicator function\n. The zero-one\nloss can also be computed as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 79,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "where\n\\(1(x)\\)\nis the\nindicator function\n. The zero-one\nloss can also be computed as\n\\(\\text{zero-one loss} = 1 - \\text{accuracy}\\)\n.\n>>>\nfrom\nsklearn.metrics\nimport\nzero_one_loss\n>>>\ny_pred\n=\n[\n1\n,\n2\n,\n3\n,\n4\n]\n>>>\ny_true\n=\n[\n2\n,\n2\n,\n3\n,\n4\n]\n>>>\nzero_one_loss\n(\ny_true\n,\ny_pred\n)\n0.25\n>>>\nzero_one_loss\n(\ny_true\n,\ny_pred\n,\nnormalize\n=\nFalse\n)\n1.0\nIn the multilabel case with binary label indicators, where the first label\nset [0,1] has an error:\n>>>\nzero_one_loss\n(\nnp\n.\narray\n([[\n0\n,\n1\n],\n[\n1\n,\n1\n]]),\nnp\n.\nones\n((\n2\n,\n2\n)))\n0.5\n>>>\nzero_one_loss\n(\nnp\n.\narray\n([[\n0\n,\n1\n],\n[\n1\n,\n1\n]]),\nnp\n.\nones\n((\n2\n,\n2\n)),\nnormalize\n=\nFalse\n)\n1.0\nExamples\nSee\nRecursive feature elimination with cross-validation\nfor an example of zero one loss usage to perform recursive feature\nelimination with cross-validation.\n3.4.4.18.\nBrier score loss\nThe\nbrier_score_loss\nfunction computes the\nBrier score\nfor binary and multiclass\nprobabilistic predictions and is equivalent to the mean squared error.\nQuoting Wikipedia:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 80,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "function computes the\nBrier score\nfor binary and multiclass\nprobabilistic predictions and is equivalent to the mean squared error.\nQuoting Wikipedia:\n“The Brier score is a strictly proper scoring rule that measures the accuracy of\nprobabilistic predictions. […] [It] is applicable to tasks in which predictions\nmust assign probabilities to a set of mutually exclusive discrete outcomes or\nclasses.”\nLet the true labels for a set of\n\\(N\\)\ndata points be encoded as a 1-of-K binary\nindicator matrix\n\\(Y\\)\n, i.e.,\n\\(y_{i,k} = 1\\)\nif sample\n\\(i\\)\nhas\nlabel\n\\(k\\)\ntaken from a set of\n\\(K\\)\nlabels. Let\n\\(\\hat{P}\\)\nbe a matrix\nof probability estimates with elements\n\\(\\hat{p}_{i,k} \\approx \\operatorname{Pr}(y_{i,k} = 1)\\)\n.\nFollowing the original definition by\n[Brier1950]\n, the Brier score is given by:\n\\[BS(Y, \\hat{P}) = \\frac{1}{N}\\sum_{i=0}^{N-1}\\sum_{k=0}^{K-1}(y_{i,k} - \\hat{p}_{i,k})^{2}\\]\nThe Brier score lies in the interval\n\\([0, 2]\\)\nand the lower the value the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 81,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The Brier score lies in the interval\n\\([0, 2]\\)\nand the lower the value the\nbetter the probability estimates are (the mean squared difference is smaller).\nActually, the Brier score is a strictly proper scoring rule, meaning that it\nachieves the best score only when the estimated probabilities equal the\ntrue ones.\nNote that in the binary case, the Brier score is usually divided by two and\nranges between\n\\([0,1]\\)\n. For binary targets\n\\(y_i \\in \\{0, 1\\}\\)\nand\nprobability estimates\n\\(\\hat{p}_i \\approx \\operatorname{Pr}(y_i = 1)\\)\nfor the positive class, the Brier score is then equal to:\n\\[BS(y, \\hat{p}) = \\frac{1}{N} \\sum_{i=0}^{N - 1}(y_i - \\hat{p}_i)^2\\]\nThe\nbrier_score_loss\nfunction computes the Brier score given the\nground-truth labels and predicted probabilities, as returned by an estimator’s\npredict_proba\nmethod. The\nscale_by_half\nparameter controls which of the\ntwo above definitions to follow.\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nbrier_score_loss\n>>>\ny_true\n=\nnp",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 82,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "two above definitions to follow.\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nbrier_score_loss\n>>>\ny_true\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n1\n,\n0\n])\n>>>\ny_true_categorical\n=\nnp\n.\narray\n([\n\"spam\"\n,\n\"ham\"\n,\n\"ham\"\n,\n\"spam\"\n])\n>>>\ny_prob\n=\nnp\n.\narray\n([\n0.1\n,\n0.9\n,\n0.8\n,\n0.4\n])\n>>>\nbrier_score_loss\n(\ny_true\n,\ny_prob\n)\n0.055\n>>>\nbrier_score_loss\n(\ny_true\n,\n1\n-\ny_prob\n,\npos_label\n=\n0\n)\n0.055\n>>>\nbrier_score_loss\n(\ny_true_categorical\n,\ny_prob\n,\npos_label\n=\n\"ham\"\n)\n0.055\n>>>\nbrier_score_loss\n(\n...\n[\n\"eggs\"\n,\n\"ham\"\n,\n\"spam\"\n],\n...\n[[\n0.8\n,\n0.1\n,\n0.1\n],\n[\n0.2\n,\n0.7\n,\n0.1\n],\n[\n0.2\n,\n0.2\n,\n0.6\n]],\n...\nlabels\n=\n[\n\"eggs\"\n,\n\"ham\"\n,\n\"spam\"\n],\n...\n)\n0.146\nThe Brier score can be used to assess how well a classifier is calibrated.\nHowever, a lower Brier score loss does not always mean a better calibration.\nThis is because, by analogy with the bias-variance decomposition of the mean\nsquared error, the Brier score loss can be decomposed as the sum of calibration\nloss and refinement loss\n[Bella2012]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 83,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "squared error, the Brier score loss can be decomposed as the sum of calibration\nloss and refinement loss\n[Bella2012]\n. Calibration loss is defined as the mean\nsquared deviation from empirical probabilities derived from the slope of ROC\nsegments. Refinement loss can be defined as the expected optimal loss as\nmeasured by the area under the optimal cost curve. Refinement loss can change\nindependently from calibration loss, thus a lower Brier score loss does not\nnecessarily mean a better calibrated model. “Only when refinement loss remains\nthe same does a lower Brier score loss always mean better calibration”\n[Bella2012]\n,\n[Flach2008]\n.\nExamples\nSee\nProbability calibration of classifiers\nfor an example of Brier score loss usage to perform probability\ncalibration of classifiers.\nReferences\n[\nBrier1950\n]\nG. Brier,\nVerification of forecasts expressed in terms of probability\n,\nMonthly weather review 78.1 (1950)\n[\nBella2012\n]\n(\n1\n,\n2\n)\nBella, Ferri, Hernández-Orallo, and Ramírez-Quintana",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 84,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\nMonthly weather review 78.1 (1950)\n[\nBella2012\n]\n(\n1\n,\n2\n)\nBella, Ferri, Hernández-Orallo, and Ramírez-Quintana\n“Calibration of Machine Learning Models”\nin Khosrow-Pour, M. “Machine learning: concepts, methodologies, tools\nand applications.” Hershey, PA: Information Science Reference (2012).\n[\nFlach2008\n]\nFlach, Peter, and Edson Matsubara.\n“On classification, ranking,\nand probability estimation.”\nDagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik (2008).\n3.4.4.19.\nClass likelihood ratios\nThe\nclass_likelihood_ratios\nfunction computes the\npositive and negative\nlikelihood ratios\n\\(LR_\\pm\\)\nfor binary classes, which can be interpreted as the ratio of\npost-test to pre-test odds as explained below. As a consequence, this metric is\ninvariant w.r.t. the class prevalence (the number of samples in the positive\nclass divided by the total number of samples) and\ncan be extrapolated between\npopulations regardless of any possible class imbalance.\nThe\n\\(LR_\\pm\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 85,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "class divided by the total number of samples) and\ncan be extrapolated between\npopulations regardless of any possible class imbalance.\nThe\n\\(LR_\\pm\\)\nmetrics are therefore very useful in settings where the data\navailable to learn and evaluate a classifier is a study population with nearly\nbalanced classes, such as a case-control study, while the target application,\ni.e. the general population, has very low prevalence.\nThe positive likelihood ratio\n\\(LR_+\\)\nis the probability of a classifier to\ncorrectly predict that a sample belongs to the positive class divided by the\nprobability of predicting the positive class for a sample belonging to the\nnegative class:\n\\[LR_+ = \\frac{\\text{PR}(P+|T+)}{\\text{PR}(P+|T-)}.\\]\nThe notation here refers to predicted (\n\\(P\\)\n) or true (\n\\(T\\)\n) label and\nthe sign\n\\(+\\)\nand\n\\(-\\)\nrefer to the positive and negative class,\nrespectively, e.g.\n\\(P+\\)\nstands for “predicted positive”.\nAnalogously, the negative likelihood ratio\n\\(LR_-\\)\nis the probability of a",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 86,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "respectively, e.g.\n\\(P+\\)\nstands for “predicted positive”.\nAnalogously, the negative likelihood ratio\n\\(LR_-\\)\nis the probability of a\nsample of the positive class being classified as belonging to the negative class\ndivided by the probability of a sample of the negative class being correctly\nclassified:\n\\[LR_- = \\frac{\\text{PR}(P-|T+)}{\\text{PR}(P-|T-)}.\\]\nFor classifiers above chance\n\\(LR_+\\)\nabove 1\nhigher is better\n, while\n\\(LR_-\\)\nranges from 0 to 1 and\nlower is better\n.\nValues of\n\\(LR_\\pm\\approx 1\\)\ncorrespond to chance level.\nNotice that probabilities differ from counts, for instance\n\\(\\operatorname{PR}(P+|T+)\\)\nis not equal to the number of true positive\ncounts\ntp\n(see\nthe wikipedia page\nfor\nthe actual formulas).\nExamples\nClass Likelihood Ratios to measure classification performance\nInterpretation across varying prevalence\nBoth class likelihood ratios are interpretable in terms of an odds ratio\n(pre-test and post-tests):",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 87,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Interpretation across varying prevalence\nBoth class likelihood ratios are interpretable in terms of an odds ratio\n(pre-test and post-tests):\n\\[\\text{post-test odds} = \\text{Likelihood ratio} \\times \\text{pre-test odds}.\\]\nOdds are in general related to probabilities via\n\\[\\text{odds} = \\frac{\\text{probability}}{1 - \\text{probability}},\\]\nor equivalently\n\\[\\text{probability} = \\frac{\\text{odds}}{1 + \\text{odds}}.\\]\nOn a given population, the pre-test probability is given by the prevalence. By\nconverting odds to probabilities, the likelihood ratios can be translated into a\nprobability of truly belonging to either class before and after a classifier\nprediction:\n\\[\\text{post-test odds} = \\text{Likelihood ratio} \\times\n\\frac{\\text{pre-test probability}}{1 - \\text{pre-test probability}},\\]\n\\[\\text{post-test probability} = \\frac{\\text{post-test odds}}{1 + \\text{post-test odds}}.\\]\nMathematical divergences\nThe positive likelihood ratio (\nLR+\n) is undefined when\n\\(fp=0\\)\n, meaning the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 88,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Mathematical divergences\nThe positive likelihood ratio (\nLR+\n) is undefined when\n\\(fp=0\\)\n, meaning the\nclassifier does not misclassify any negative labels as positives. This condition can\neither indicate a perfect identification of all the negative cases or, if there are\nalso no true positive predictions (\n\\(tp=0\\)\n), that the classifier does not predict\nthe positive class at all. In the first case,\nLR+\ncan be interpreted as\nnp.inf\n, in\nthe second case (for instance, with highly imbalanced data) it can be interpreted as\nnp.nan\n.\nThe negative likelihood ratio (\nLR-\n) is undefined when\n\\(tn=0\\)\n. Such\ndivergence is invalid, as\n\\(LR_- > 1.0\\)\nwould indicate an increase in the odds of\na sample belonging to the positive class after being classified as negative, as if the\nact of classifying caused the positive condition. This includes the case of a\nDummyClassifier\nthat always predicts the positive class\n(i.e. when\n\\(tn=fn=0\\)\n).\nBoth class likelihood ratios (\nLR+\nand\nLR-",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 89,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "DummyClassifier\nthat always predicts the positive class\n(i.e. when\n\\(tn=fn=0\\)\n).\nBoth class likelihood ratios (\nLR+\nand\nLR-\n) are undefined when\n\\(tp=fn=0\\)\n, which\nmeans that no samples of the positive class were present in the test set. This can\nhappen when cross-validating on highly imbalanced data and also leads to a division by\nzero.\nIf a division by zero occurs and\nraise_warning\nis set to\nTrue\n(default),\nclass_likelihood_ratios\nraises an\nUndefinedMetricWarning\nand returns\nnp.nan\nby default to avoid pollution when averaging over cross-validation folds.\nUsers can set return values in case of a division by zero with the\nreplace_undefined_by\nparam.\nFor a worked-out demonstration of the\nclass_likelihood_ratios\nfunction,\nsee the example below.\nReferences\nWikipedia entry for Likelihood ratios in diagnostic testing\nBrenner, H., & Gefeller, O. (1997).\nVariation of sensitivity, specificity, likelihood ratios and predictive",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 90,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Brenner, H., & Gefeller, O. (1997).\nVariation of sensitivity, specificity, likelihood ratios and predictive\nvalues with disease prevalence. Statistics in medicine, 16(9), 981-991.\n3.4.4.20.\nD² score for classification\nThe D² score computes the fraction of deviance explained.\nIt is a generalization of R², where the squared error is generalized and replaced\nby a classification deviance of choice\n\\(\\text{dev}(y, \\hat{y})\\)\n(e.g., Log loss). D² is a form of a\nskill score\n.\nIt is calculated as\n\\[D^2(y, \\hat{y}) = 1 - \\frac{\\text{dev}(y, \\hat{y})}{\\text{dev}(y, y_{\\text{null}})} \\,.\\]\nWhere\n\\(y_{\\text{null}}\\)\nis the optimal prediction of an intercept-only model\n(e.g., the per-class proportion of\ny_true\nin the case of the Log loss).\nLike R², the best possible score is 1.0 and it can be negative (because the\nmodel can be arbitrarily worse). A constant model that always predicts\n\\(y_{\\text{null}}\\)\n, disregarding the input features, would get a D² score\nof 0.0.\nD2 log loss score\nThe",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 91,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(y_{\\text{null}}\\)\n, disregarding the input features, would get a D² score\nof 0.0.\nD2 log loss score\nThe\nd2_log_loss_score\nfunction implements the special case\nof D² with the log loss, see\nLog loss\n, i.e.:\n\\[\\text{dev}(y, \\hat{y}) = \\text{log_loss}(y, \\hat{y}).\\]\nHere are some usage examples of the\nd2_log_loss_score\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nd2_log_loss_score\n>>>\ny_true\n=\n[\n1\n,\n1\n,\n2\n,\n3\n]\n>>>\ny_pred\n=\n[\n...\n[\n0.5\n,\n0.25\n,\n0.25\n],\n...\n[\n0.5\n,\n0.25\n,\n0.25\n],\n...\n[\n0.5\n,\n0.25\n,\n0.25\n],\n...\n[\n0.5\n,\n0.25\n,\n0.25\n],\n...\n]\n>>>\nd2_log_loss_score\n(\ny_true\n,\ny_pred\n)\n0.0\n>>>\ny_true\n=\n[\n1\n,\n2\n,\n3\n]\n>>>\ny_pred\n=\n[\n...\n[\n0.98\n,\n0.01\n,\n0.01\n],\n...\n[\n0.01\n,\n0.98\n,\n0.01\n],\n...\n[\n0.01\n,\n0.01\n,\n0.98\n],\n...\n]\n>>>\nd2_log_loss_score\n(\ny_true\n,\ny_pred\n)\n0.981\n>>>\ny_true\n=\n[\n1\n,\n2\n,\n3\n]\n>>>\ny_pred\n=\n[\n...\n[\n0.1\n,\n0.6\n,\n0.3\n],\n...\n[\n0.1\n,\n0.6\n,\n0.3\n],\n...\n[\n0.4\n,\n0.5\n,\n0.1\n],\n...\n]\n>>>\nd2_log_loss_score\n(\ny_true\n,\ny_pred\n)\n-0.552\n3.4.5.\nMultilabel ranking metrics",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 92,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\n0.3\n],\n...\n[\n0.1\n,\n0.6\n,\n0.3\n],\n...\n[\n0.4\n,\n0.5\n,\n0.1\n],\n...\n]\n>>>\nd2_log_loss_score\n(\ny_true\n,\ny_pred\n)\n-0.552\n3.4.5.\nMultilabel ranking metrics\nIn multilabel learning, each sample can have any number of ground truth labels\nassociated with it. The goal is to give high scores and better rank to\nthe ground truth labels.\n3.4.5.1.\nCoverage error\nThe\ncoverage_error\nfunction computes the average number of labels that\nhave to be included in the final prediction such that all true labels\nare predicted. This is useful if you want to know how many top-scored-labels\nyou have to predict in average without missing any true one. The best value\nof this metric is thus the average number of true labels.\nNote\nOur implementation’s score is 1 greater than the one given in Tsoumakas\net al., 2010. This extends it to handle the degenerate case in which an\ninstance has 0 true labels.\nFormally, given a binary indicator matrix of the ground truth labels",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 93,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "instance has 0 true labels.\nFormally, given a binary indicator matrix of the ground truth labels\n\\(y \\in \\left\\{0, 1\\right\\}^{n_\\text{samples} \\times n_\\text{labels}}\\)\nand the\nscore associated with each label\n\\(\\hat{f} \\in \\mathbb{R}^{n_\\text{samples} \\times n_\\text{labels}}\\)\n,\nthe coverage is defined as\n\\[coverage(y, \\hat{f}) = \\frac{1}{n_{\\text{samples}}}\n\\sum_{i=0}^{n_{\\text{samples}} - 1} \\max_{j:y_{ij} = 1} \\text{rank}_{ij}\\]\nwith\n\\(\\text{rank}_{ij} = \\left|\\left\\{k: \\hat{f}_{ik} \\geq \\hat{f}_{ij} \\right\\}\\right|\\)\n.\nGiven the rank definition, ties in\ny_scores\nare broken by giving the\nmaximal rank that would have been assigned to all tied values.\nHere is a small example of usage of this function:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\ncoverage_error\n>>>\ny_true\n=\nnp\n.\narray\n([[\n1\n,\n0\n,\n0\n],\n[\n0\n,\n0\n,\n1\n]])\n>>>\ny_score\n=\nnp\n.\narray\n([[\n0.75\n,\n0.5\n,\n1\n],\n[\n1\n,\n0.2\n,\n0.1\n]])\n>>>\ncoverage_error\n(\ny_true\n,\ny_score\n)\n2.5\n3.4.5.2.\nLabel ranking average precision\nThe",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 94,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nnp\n.\narray\n([[\n0.75\n,\n0.5\n,\n1\n],\n[\n1\n,\n0.2\n,\n0.1\n]])\n>>>\ncoverage_error\n(\ny_true\n,\ny_score\n)\n2.5\n3.4.5.2.\nLabel ranking average precision\nThe\nlabel_ranking_average_precision_score\nfunction\nimplements label ranking average precision (LRAP). This metric is linked to\nthe\naverage_precision_score\nfunction, but is based on the notion of\nlabel ranking instead of precision and recall.\nLabel ranking average precision (LRAP) averages over the samples the answer to\nthe following question: for each ground truth label, what fraction of\nhigher-ranked labels were true labels? This performance measure will be higher\nif you are able to give better rank to the labels associated with each sample.\nThe obtained score is always strictly greater than 0, and the best value is 1.\nIf there is exactly one relevant label per sample, label ranking average\nprecision is equivalent to the\nmean\nreciprocal rank\n.\nFormally, given a binary indicator matrix of the ground truth labels",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 95,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "precision is equivalent to the\nmean\nreciprocal rank\n.\nFormally, given a binary indicator matrix of the ground truth labels\n\\(y \\in \\left\\{0, 1\\right\\}^{n_\\text{samples} \\times n_\\text{labels}}\\)\nand the score associated with each label\n\\(\\hat{f} \\in \\mathbb{R}^{n_\\text{samples} \\times n_\\text{labels}}\\)\n,\nthe average precision is defined as\n\\[LRAP(y, \\hat{f}) = \\frac{1}{n_{\\text{samples}}}\n\\sum_{i=0}^{n_{\\text{samples}} - 1} \\frac{1}{||y_i||_0}\n\\sum_{j:y_{ij} = 1} \\frac{|\\mathcal{L}_{ij}|}{\\text{rank}_{ij}}\\]\nwhere\n\\(\\mathcal{L}_{ij} = \\left\\{k: y_{ik} = 1, \\hat{f}_{ik} \\geq \\hat{f}_{ij} \\right\\}\\)\n,\n\\(\\text{rank}_{ij} = \\left|\\left\\{k: \\hat{f}_{ik} \\geq \\hat{f}_{ij} \\right\\}\\right|\\)\n,\n\\(|\\cdot|\\)\ncomputes the cardinality of the set (i.e., the number of\nelements in the set), and\n\\(||\\cdot||_0\\)\nis the\n\\(\\ell_0\\)\n“norm”\n(which computes the number of nonzero elements in a vector).\nHere is a small example of usage of this function:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 96,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Here is a small example of usage of this function:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nlabel_ranking_average_precision_score\n>>>\ny_true\n=\nnp\n.\narray\n([[\n1\n,\n0\n,\n0\n],\n[\n0\n,\n0\n,\n1\n]])\n>>>\ny_score\n=\nnp\n.\narray\n([[\n0.75\n,\n0.5\n,\n1\n],\n[\n1\n,\n0.2\n,\n0.1\n]])\n>>>\nlabel_ranking_average_precision_score\n(\ny_true\n,\ny_score\n)\n0.416\n3.4.5.3.\nRanking loss\nThe\nlabel_ranking_loss\nfunction computes the ranking loss which\naverages over the samples the number of label pairs that are incorrectly\nordered, i.e. true labels have a lower score than false labels, weighted by\nthe inverse of the number of ordered pairs of false and true labels.\nThe lowest achievable ranking loss is zero.\nFormally, given a binary indicator matrix of the ground truth labels\n\\(y \\in \\left\\{0, 1\\right\\}^{n_\\text{samples} \\times n_\\text{labels}}\\)\nand the\nscore associated with each label\n\\(\\hat{f} \\in \\mathbb{R}^{n_\\text{samples} \\times n_\\text{labels}}\\)\n,\nthe ranking loss is defined as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 97,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and the\nscore associated with each label\n\\(\\hat{f} \\in \\mathbb{R}^{n_\\text{samples} \\times n_\\text{labels}}\\)\n,\nthe ranking loss is defined as\n\\[ranking\\_loss(y, \\hat{f}) = \\frac{1}{n_{\\text{samples}}}\n\\sum_{i=0}^{n_{\\text{samples}} - 1} \\frac{1}{||y_i||_0(n_\\text{labels} - ||y_i||_0)}\n\\left|\\left\\{(k, l): \\hat{f}_{ik} \\leq \\hat{f}_{il}, y_{ik} = 1, y_{il} = 0 \\right\\}\\right|\\]\nwhere\n\\(|\\cdot|\\)\ncomputes the cardinality of the set (i.e., the number of\nelements in the set) and\n\\(||\\cdot||_0\\)\nis the\n\\(\\ell_0\\)\n“norm”\n(which computes the number of nonzero elements in a vector).\nHere is a small example of usage of this function:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.metrics\nimport\nlabel_ranking_loss\n>>>\ny_true\n=\nnp\n.\narray\n([[\n1\n,\n0\n,\n0\n],\n[\n0\n,\n0\n,\n1\n]])\n>>>\ny_score\n=\nnp\n.\narray\n([[\n0.75\n,\n0.5\n,\n1\n],\n[\n1\n,\n0.2\n,\n0.1\n]])\n>>>\nlabel_ranking_loss\n(\ny_true\n,\ny_score\n)\n0.75\n>>>\n# With the following prediction, we have perfect and minimal loss\n>>>\ny_score\n=\nnp\n.\narray\n([[\n1.0\n,\n0.1\n,\n0.2\n],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 98,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\ny_true\n,\ny_score\n)\n0.75\n>>>\n# With the following prediction, we have perfect and minimal loss\n>>>\ny_score\n=\nnp\n.\narray\n([[\n1.0\n,\n0.1\n,\n0.2\n],\n[\n0.1\n,\n0.2\n,\n0.9\n]])\n>>>\nlabel_ranking_loss\n(\ny_true\n,\ny_score\n)\n0.0\nReferences\nTsoumakas, G., Katakis, I., & Vlahavas, I. (2010). Mining multi-label data. In\nData mining and knowledge discovery handbook (pp. 667-685). Springer US.\n3.4.5.4.\nNormalized Discounted Cumulative Gain\nDiscounted Cumulative Gain (DCG) and Normalized Discounted Cumulative Gain\n(NDCG) are ranking metrics implemented in\ndcg_score\nand\nndcg_score\n; they compare a predicted order to\nground-truth scores, such as the relevance of answers to a query.\nFrom the Wikipedia page for Discounted Cumulative Gain:\n“Discounted cumulative gain (DCG) is a measure of ranking quality. In\ninformation retrieval, it is often used to measure effectiveness of web search\nengine algorithms or related applications. Using a graded relevance scale of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 99,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "engine algorithms or related applications. Using a graded relevance scale of\ndocuments in a search-engine result set, DCG measures the usefulness, or gain,\nof a document based on its position in the result list. The gain is accumulated\nfrom the top of the result list to the bottom, with the gain of each result\ndiscounted at lower ranks.”\nDCG orders the true targets (e.g. relevance of query answers) in the predicted\norder, then multiplies them by a logarithmic decay and sums the result. The sum\ncan be truncated after the first\n\\(K\\)\nresults, in which case we call it\nDCG@K.\nNDCG, or NDCG@K is DCG divided by the DCG obtained by a perfect prediction, so\nthat it is always between 0 and 1. Usually, NDCG is preferred to DCG.\nCompared with the ranking loss, NDCG can take into account relevance scores,\nrather than a ground-truth ranking. So if the ground-truth consists only of an\nordering, the ranking loss should be preferred; if the ground-truth consists of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 100,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "ordering, the ranking loss should be preferred; if the ground-truth consists of\nactual usefulness scores (e.g. 0 for irrelevant, 1 for relevant, 2 for very\nrelevant), NDCG can be used.\nFor one sample, given the vector of continuous ground-truth values for each\ntarget\n\\(y \\in \\mathbb{R}^{M}\\)\n, where\n\\(M\\)\nis the number of outputs, and\nthe prediction\n\\(\\hat{y}\\)\n, which induces the ranking function\n\\(f\\)\n, the\nDCG score is\n\\[\\sum_{r=1}^{\\min(K, M)}\\frac{y_{f(r)}}{\\log(1 + r)}\\]\nand the NDCG score is the DCG score divided by the DCG score obtained for\n\\(y\\)\n.\nReferences\nWikipedia entry for Discounted Cumulative Gain\nJarvelin, K., & Kekalainen, J. (2002).\nCumulated gain-based evaluation of IR techniques. ACM Transactions on\nInformation Systems (TOIS), 20(4), 422-446.\nWang, Y., Wang, L., Li, Y., He, D., Chen, W., & Liu, T. Y. (2013, May).\nA theoretical analysis of NDCG ranking measures. In Proceedings of the 26th\nAnnual Conference on Learning Theory (COLT 2013)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 101,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "A theoretical analysis of NDCG ranking measures. In Proceedings of the 26th\nAnnual Conference on Learning Theory (COLT 2013)\nMcSherry, F., & Najork, M. (2008, March). Computing information retrieval\nperformance measures efficiently in the presence of tied scores. In\nEuropean conference on information retrieval (pp. 414-421). Springer,\nBerlin, Heidelberg.\n3.4.6.\nRegression metrics\nThe\nsklearn.metrics\nmodule implements several loss, score, and utility\nfunctions to measure regression performance. Some of those have been enhanced\nto handle the multioutput case:\nmean_squared_error\n,\nmean_absolute_error\n,\nr2_score\n,\nexplained_variance_score\n,\nmean_pinball_loss\n,\nd2_pinball_score\nand\nd2_absolute_error_score\n.\nThese functions have a\nmultioutput\nkeyword argument which specifies the\nway the scores or losses for each individual target should be averaged. The\ndefault is\n'uniform_average'\n, which specifies a uniformly weighted mean\nover outputs. If an\nndarray\nof shape\n(n_outputs,)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 102,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "default is\n'uniform_average'\n, which specifies a uniformly weighted mean\nover outputs. If an\nndarray\nof shape\n(n_outputs,)\nis passed, then its\nentries are interpreted as weights and an according weighted average is\nreturned. If\nmultioutput\nis\n'raw_values'\n, then all unaltered\nindividual scores or losses will be returned in an array of shape\n(n_outputs,)\n.\nThe\nr2_score\nand\nexplained_variance_score\naccept an additional\nvalue\n'variance_weighted'\nfor the\nmultioutput\nparameter. This option\nleads to a weighting of each individual score by the variance of the\ncorresponding target variable. This setting quantifies the globally captured\nunscaled variance. If the target variables are of different scale, then this\nscore puts more importance on explaining the higher variance variables.\n3.4.6.1.\nR² score, the coefficient of determination\nThe\nr2_score\nfunction computes the\ncoefficient of\ndetermination\n,\nusually denoted as\n\\(R^2\\)\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 103,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.4.6.1.\nR² score, the coefficient of determination\nThe\nr2_score\nfunction computes the\ncoefficient of\ndetermination\n,\nusually denoted as\n\\(R^2\\)\n.\nIt represents the proportion of variance (of y) that has been explained by the\nindependent variables in the model. It provides an indication of goodness of\nfit and therefore a measure of how well unseen samples are likely to be\npredicted by the model, through the proportion of explained variance.\nAs such variance is dataset dependent,\n\\(R^2\\)\nmay not be meaningfully comparable\nacross different datasets. Best possible score is 1.0 and it can be negative\n(because the model can be arbitrarily worse). A constant model that always\npredicts the expected (average) value of y, disregarding the input features,\nwould get an\n\\(R^2\\)\nscore of 0.0.\nNote: when the prediction residuals have zero mean, the\n\\(R^2\\)\nscore and\nthe\nExplained variance score\nare identical.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample\nand\n\\(y_i\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 104,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(R^2\\)\nscore and\nthe\nExplained variance score\nare identical.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample\nand\n\\(y_i\\)\nis the corresponding true value for total\n\\(n\\)\nsamples,\nthe estimated\n\\(R^2\\)\nis defined as:\n\\[R^2(y, \\hat{y}) = 1 - \\frac{\\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2}{\\sum_{i=1}^{n} (y_i - \\bar{y})^2}\\]\nwhere\n\\(\\bar{y} = \\frac{1}{n} \\sum_{i=1}^{n} y_i\\)\nand\n\\(\\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 = \\sum_{i=1}^{n} \\epsilon_i^2\\)\n.\nNote that\nr2_score\ncalculates unadjusted\n\\(R^2\\)\nwithout correcting for\nbias in sample variance of y.\nIn the particular case where the true target is constant, the\n\\(R^2\\)\nscore is\nnot finite: it is either\nNaN\n(perfect predictions) or\n-Inf\n(imperfect\npredictions). Such non-finite scores may prevent correct model optimization\nsuch as grid-search cross-validation to be performed correctly. For this reason\nthe default behaviour of\nr2_score\nis to replace them with 1.0 (perfect\npredictions) or 0.0 (imperfect predictions). If\nforce_finite",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 105,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the default behaviour of\nr2_score\nis to replace them with 1.0 (perfect\npredictions) or 0.0 (imperfect predictions). If\nforce_finite\nis set to\nFalse\n, this score falls back on the original\n\\(R^2\\)\ndefinition.\nHere is a small example of usage of the\nr2_score\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nr2_score\n>>>\ny_true\n=\n[\n3\n,\n-\n0.5\n,\n2\n,\n7\n]\n>>>\ny_pred\n=\n[\n2.5\n,\n0.0\n,\n2\n,\n8\n]\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n)\n0.948\n>>>\ny_true\n=\n[[\n0.5\n,\n1\n],\n[\n-\n1\n,\n1\n],\n[\n7\n,\n-\n6\n]]\n>>>\ny_pred\n=\n[[\n0\n,\n2\n],\n[\n-\n1\n,\n2\n],\n[\n8\n,\n-\n5\n]]\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n'variance_weighted'\n)\n0.938\n>>>\ny_true\n=\n[[\n0.5\n,\n1\n],\n[\n-\n1\n,\n1\n],\n[\n7\n,\n-\n6\n]]\n>>>\ny_pred\n=\n[[\n0\n,\n2\n],\n[\n-\n1\n,\n2\n],\n[\n8\n,\n-\n5\n]]\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n'uniform_average'\n)\n0.936\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n'raw_values'\n)\narray([0.965, 0.908])\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n[\n0.3\n,\n0.7\n])\n0.925\n>>>\ny_true\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n]\n>>>\ny_pred\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n]\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 106,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nr2_score\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n[\n0.3\n,\n0.7\n])\n0.925\n>>>\ny_true\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n]\n>>>\ny_pred\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n]\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n)\n1.0\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n,\nforce_finite\n=\nFalse\n)\nnan\n>>>\ny_true\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n]\n>>>\ny_pred\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n+\n1e-8\n]\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n)\n0.0\n>>>\nr2_score\n(\ny_true\n,\ny_pred\n,\nforce_finite\n=\nFalse\n)\n-inf\nExamples\nSee\nL1-based models for Sparse Signals\nfor an example of R² score usage to\nevaluate Lasso and Elastic Net on sparse signals.\n3.4.6.2.\nMean absolute error\nThe\nmean_absolute_error\nfunction computes\nmean absolute\nerror\n, a risk\nmetric corresponding to the expected value of the absolute error loss or\n\\(l1\\)\n-norm loss.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample,\nand\n\\(y_i\\)\nis the corresponding true value, then the mean absolute error\n(MAE) estimated over\n\\(n_{\\text{samples}}\\)\nis defined as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 107,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(i\\)\n-th sample,\nand\n\\(y_i\\)\nis the corresponding true value, then the mean absolute error\n(MAE) estimated over\n\\(n_{\\text{samples}}\\)\nis defined as\n\\[\\text{MAE}(y, \\hat{y}) = \\frac{1}{n_{\\text{samples}}} \\sum_{i=0}^{n_{\\text{samples}}-1} \\left| y_i - \\hat{y}_i \\right|.\\]\nHere is a small example of usage of the\nmean_absolute_error\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nmean_absolute_error\n>>>\ny_true\n=\n[\n3\n,\n-\n0.5\n,\n2\n,\n7\n]\n>>>\ny_pred\n=\n[\n2.5\n,\n0.0\n,\n2\n,\n8\n]\n>>>\nmean_absolute_error\n(\ny_true\n,\ny_pred\n)\n0.5\n>>>\ny_true\n=\n[[\n0.5\n,\n1\n],\n[\n-\n1\n,\n1\n],\n[\n7\n,\n-\n6\n]]\n>>>\ny_pred\n=\n[[\n0\n,\n2\n],\n[\n-\n1\n,\n2\n],\n[\n8\n,\n-\n5\n]]\n>>>\nmean_absolute_error\n(\ny_true\n,\ny_pred\n)\n0.75\n>>>\nmean_absolute_error\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n'raw_values'\n)\narray([0.5, 1. ])\n>>>\nmean_absolute_error\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n[\n0.3\n,\n0.7\n])\n0.85\n3.4.6.3.\nMean squared error\nThe\nmean_squared_error\nfunction computes\nmean squared\nerror\n, a risk",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 108,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "y_true\n,\ny_pred\n,\nmultioutput\n=\n[\n0.3\n,\n0.7\n])\n0.85\n3.4.6.3.\nMean squared error\nThe\nmean_squared_error\nfunction computes\nmean squared\nerror\n, a risk\nmetric corresponding to the expected value of the squared (quadratic) error or\nloss.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample,\nand\n\\(y_i\\)\nis the corresponding true value, then the mean squared error\n(MSE) estimated over\n\\(n_{\\text{samples}}\\)\nis defined as\n\\[\\text{MSE}(y, \\hat{y}) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples} - 1} (y_i - \\hat{y}_i)^2.\\]\nHere is a small example of usage of the\nmean_squared_error\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nmean_squared_error\n>>>\ny_true\n=\n[\n3\n,\n-\n0.5\n,\n2\n,\n7\n]\n>>>\ny_pred\n=\n[\n2.5\n,\n0.0\n,\n2\n,\n8\n]\n>>>\nmean_squared_error\n(\ny_true\n,\ny_pred\n)\n0.375\n>>>\ny_true\n=\n[[\n0.5\n,\n1\n],\n[\n-\n1\n,\n1\n],\n[\n7\n,\n-\n6\n]]\n>>>\ny_pred\n=\n[[\n0\n,\n2\n],\n[\n-\n1\n,\n2\n],\n[\n8\n,\n-\n5\n]]\n>>>\nmean_squared_error\n(\ny_true\n,\ny_pred\n)\n0.7083\nExamples\nSee\nGradient Boosting regression",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 109,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "-\n6\n]]\n>>>\ny_pred\n=\n[[\n0\n,\n2\n],\n[\n-\n1\n,\n2\n],\n[\n8\n,\n-\n5\n]]\n>>>\nmean_squared_error\n(\ny_true\n,\ny_pred\n)\n0.7083\nExamples\nSee\nGradient Boosting regression\nfor an example of mean squared error usage to evaluate gradient boosting regression.\nTaking the square root of the MSE, called the root mean squared error (RMSE), is another\ncommon metric that provides a measure in the same units as the target variable. RMSE is\navailable through the\nroot_mean_squared_error\nfunction.\n3.4.6.4.\nMean squared logarithmic error\nThe\nmean_squared_log_error\nfunction computes a risk metric\ncorresponding to the expected value of the squared logarithmic (quadratic)\nerror or loss.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample,\nand\n\\(y_i\\)\nis the corresponding true value, then the mean squared\nlogarithmic error (MSLE) estimated over\n\\(n_{\\text{samples}}\\)\nis\ndefined as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 110,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and\n\\(y_i\\)\nis the corresponding true value, then the mean squared\nlogarithmic error (MSLE) estimated over\n\\(n_{\\text{samples}}\\)\nis\ndefined as\n\\[\\text{MSLE}(y, \\hat{y}) = \\frac{1}{n_\\text{samples}} \\sum_{i=0}^{n_\\text{samples} - 1} (\\log_e (1 + y_i) - \\log_e (1 + \\hat{y}_i) )^2.\\]\nWhere\n\\(\\log_e (x)\\)\nmeans the natural logarithm of\n\\(x\\)\n. This metric\nis best to use when targets having exponential growth, such as population\ncounts, average sales of a commodity over a span of years etc. Note that this\nmetric penalizes an under-predicted estimate greater than an over-predicted\nestimate.\nHere is a small example of usage of the\nmean_squared_log_error\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nmean_squared_log_error\n>>>\ny_true\n=\n[\n3\n,\n5\n,\n2.5\n,\n7\n]\n>>>\ny_pred\n=\n[\n2.5\n,\n5\n,\n4\n,\n8\n]\n>>>\nmean_squared_log_error\n(\ny_true\n,\ny_pred\n)\n0.0397\n>>>\ny_true\n=\n[[\n0.5\n,\n1\n],\n[\n1\n,\n2\n],\n[\n7\n,\n6\n]]\n>>>\ny_pred\n=\n[[\n0.5\n,\n2\n],\n[\n1\n,\n2.5\n],\n[\n8\n,\n8\n]]\n>>>\nmean_squared_log_error\n(\ny_true\n,\ny_pred\n)\n0.044",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 111,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "y_true\n=\n[[\n0.5\n,\n1\n],\n[\n1\n,\n2\n],\n[\n7\n,\n6\n]]\n>>>\ny_pred\n=\n[[\n0.5\n,\n2\n],\n[\n1\n,\n2.5\n],\n[\n8\n,\n8\n]]\n>>>\nmean_squared_log_error\n(\ny_true\n,\ny_pred\n)\n0.044\nThe root mean squared logarithmic error (RMSLE) is available through the\nroot_mean_squared_log_error\nfunction.\n3.4.6.5.\nMean absolute percentage error\nThe\nmean_absolute_percentage_error\n(MAPE), also known as mean absolute\npercentage deviation (MAPD), is an evaluation metric for regression problems.\nThe idea of this metric is to be sensitive to relative errors. It is for example\nnot changed by a global scaling of the target variable.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample\nand\n\\(y_i\\)\nis the corresponding true value, then the mean absolute percentage\nerror (MAPE) estimated over\n\\(n_{\\text{samples}}\\)\nis defined as\n\\[\\text{MAPE}(y, \\hat{y}) = \\frac{1}{n_{\\text{samples}}} \\sum_{i=0}^{n_{\\text{samples}}-1} \\frac{{}\\left| y_i - \\hat{y}_i \\right|}{\\max(\\epsilon, \\left| y_i \\right|)}\\]\nwhere\n\\(\\epsilon\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 112,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "where\n\\(\\epsilon\\)\nis an arbitrary small yet strictly positive number to\navoid undefined results when y is zero.\nThe\nmean_absolute_percentage_error\nfunction supports multioutput.\nHere is a small example of usage of the\nmean_absolute_percentage_error\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nmean_absolute_percentage_error\n>>>\ny_true\n=\n[\n1\n,\n10\n,\n1e6\n]\n>>>\ny_pred\n=\n[\n0.9\n,\n15\n,\n1.2e6\n]\n>>>\nmean_absolute_percentage_error\n(\ny_true\n,\ny_pred\n)\n0.2666\nIn above example, if we had used\nmean_absolute_error\n, it would have ignored\nthe small magnitude values and only reflected the error in prediction of highest\nmagnitude value. But that problem is resolved in case of MAPE because it calculates\nrelative percentage error with respect to actual output.\nNote\nThe MAPE formula here does not represent the common “percentage” definition: the\npercentage in the range [0, 100] is converted to a relative value in the range [0,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 113,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "percentage in the range [0, 100] is converted to a relative value in the range [0,\n1] by dividing by 100. Thus, an error of 200% corresponds to a relative error of 2.\nThe motivation here is to have a range of values that is more consistent with other\nerror metrics in scikit-learn, such as\naccuracy_score\n.\nTo obtain the mean absolute percentage error as per the Wikipedia formula,\nmultiply the\nmean_absolute_percentage_error\ncomputed here by 100.\nReferences\nWikipedia entry for Mean Absolute Percentage Error\n3.4.6.6.\nMedian absolute error\nThe\nmedian_absolute_error\nis particularly interesting because it is\nrobust to outliers. The loss is calculated by taking the median of all absolute\ndifferences between the target and the prediction.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample\nand\n\\(y_i\\)\nis the corresponding true value, then the median absolute error\n(MedAE) estimated over\n\\(n_{\\text{samples}}\\)\nis defined as",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 114,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "-th sample\nand\n\\(y_i\\)\nis the corresponding true value, then the median absolute error\n(MedAE) estimated over\n\\(n_{\\text{samples}}\\)\nis defined as\n\\[\\text{MedAE}(y, \\hat{y}) = \\text{median}(\\mid y_1 - \\hat{y}_1 \\mid, \\ldots, \\mid y_n - \\hat{y}_n \\mid).\\]\nThe\nmedian_absolute_error\ndoes not support multioutput.\nHere is a small example of usage of the\nmedian_absolute_error\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nmedian_absolute_error\n>>>\ny_true\n=\n[\n3\n,\n-\n0.5\n,\n2\n,\n7\n]\n>>>\ny_pred\n=\n[\n2.5\n,\n0.0\n,\n2\n,\n8\n]\n>>>\nmedian_absolute_error\n(\ny_true\n,\ny_pred\n)\n0.5\n3.4.6.7.\nMax error\nThe\nmax_error\nfunction computes the maximum\nresidual error\n, a metric\nthat captures the worst case error between the predicted value and\nthe true value. In a perfectly fitted single output regression\nmodel,\nmax_error\nwould be\n0\non the training set and though this\nwould be highly unlikely in the real world, this metric shows the\nextent of error that the model had when it was fitted.\nIf\n\\(\\hat{y}_i\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 115,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "would be highly unlikely in the real world, this metric shows the\nextent of error that the model had when it was fitted.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample,\nand\n\\(y_i\\)\nis the corresponding true value, then the max error is\ndefined as\n\\[\\text{Max Error}(y, \\hat{y}) = \\max(| y_i - \\hat{y}_i |)\\]\nHere is a small example of usage of the\nmax_error\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nmax_error\n>>>\ny_true\n=\n[\n3\n,\n2\n,\n7\n,\n1\n]\n>>>\ny_pred\n=\n[\n9\n,\n2\n,\n7\n,\n1\n]\n>>>\nmax_error\n(\ny_true\n,\ny_pred\n)\n6.0\nThe\nmax_error\ndoes not support multioutput.\n3.4.6.8.\nExplained variance score\nThe\nexplained_variance_score\ncomputes the\nexplained variance\nregression score\n.\nIf\n\\(\\hat{y}\\)\nis the estimated target output,\n\\(y\\)\nthe corresponding\n(correct) target output, and\n\\(Var\\)\nis\nVariance\n, the square of the standard deviation,\nthen the explained variance is estimated as follow:\n\\[explained\\_{}variance(y, \\hat{y}) = 1 - \\frac{Var\\{ y - \\hat{y}\\}}{Var\\{y\\}}\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 116,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "then the explained variance is estimated as follow:\n\\[explained\\_{}variance(y, \\hat{y}) = 1 - \\frac{Var\\{ y - \\hat{y}\\}}{Var\\{y\\}}\\]\nThe best possible score is 1.0, lower values are worse.\nIn the particular case where the true target is constant, the Explained\nVariance score is not finite: it is either\nNaN\n(perfect predictions) or\n-Inf\n(imperfect predictions). Such non-finite scores may prevent correct\nmodel optimization such as grid-search cross-validation to be performed\ncorrectly. For this reason the default behaviour of\nexplained_variance_score\nis to replace them with 1.0 (perfect\npredictions) or 0.0 (imperfect predictions). You can set the\nforce_finite\nparameter to\nFalse\nto prevent this fix from happening and fallback on the\noriginal Explained Variance score.\nHere is a small example of usage of the\nexplained_variance_score\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nexplained_variance_score\n>>>\ny_true\n=\n[\n3\n,\n-\n0.5\n,\n2\n,\n7\n]\n>>>\ny_pred\n=\n[\n2.5\n,\n0.0\n,\n2\n,\n8\n]\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 117,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "function:\n>>>\nfrom\nsklearn.metrics\nimport\nexplained_variance_score\n>>>\ny_true\n=\n[\n3\n,\n-\n0.5\n,\n2\n,\n7\n]\n>>>\ny_pred\n=\n[\n2.5\n,\n0.0\n,\n2\n,\n8\n]\n>>>\nexplained_variance_score\n(\ny_true\n,\ny_pred\n)\n0.957\n>>>\ny_true\n=\n[[\n0.5\n,\n1\n],\n[\n-\n1\n,\n1\n],\n[\n7\n,\n-\n6\n]]\n>>>\ny_pred\n=\n[[\n0\n,\n2\n],\n[\n-\n1\n,\n2\n],\n[\n8\n,\n-\n5\n]]\n>>>\nexplained_variance_score\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n'raw_values'\n)\narray([0.967, 1. ])\n>>>\nexplained_variance_score\n(\ny_true\n,\ny_pred\n,\nmultioutput\n=\n[\n0.3\n,\n0.7\n])\n0.990\n>>>\ny_true\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n]\n>>>\ny_pred\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n]\n>>>\nexplained_variance_score\n(\ny_true\n,\ny_pred\n)\n1.0\n>>>\nexplained_variance_score\n(\ny_true\n,\ny_pred\n,\nforce_finite\n=\nFalse\n)\nnan\n>>>\ny_true\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n]\n>>>\ny_pred\n=\n[\n-\n2\n,\n-\n2\n,\n-\n2\n+\n1e-8\n]\n>>>\nexplained_variance_score\n(\ny_true\n,\ny_pred\n)\n0.0\n>>>\nexplained_variance_score\n(\ny_true\n,\ny_pred\n,\nforce_finite\n=\nFalse\n)\n-inf\n3.4.6.9.\nMean Poisson, Gamma, and Tweedie deviances\nThe\nmean_tweedie_deviance\nfunction computes the\nmean Tweedie",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 118,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "y_pred\n,\nforce_finite\n=\nFalse\n)\n-inf\n3.4.6.9.\nMean Poisson, Gamma, and Tweedie deviances\nThe\nmean_tweedie_deviance\nfunction computes the\nmean Tweedie\ndeviance error\nwith a\npower\nparameter (\n\\(p\\)\n). This is a metric that elicits\npredicted expectation values of regression targets.\nFollowing special cases exist,\nwhen\npower=0\nit is equivalent to\nmean_squared_error\n.\nwhen\npower=1\nit is equivalent to\nmean_poisson_deviance\n.\nwhen\npower=2\nit is equivalent to\nmean_gamma_deviance\n.\nIf\n\\(\\hat{y}_i\\)\nis the predicted value of the\n\\(i\\)\n-th sample,\nand\n\\(y_i\\)\nis the corresponding true value, then the mean Tweedie\ndeviance error (D) for power\n\\(p\\)\n, estimated over\n\\(n_{\\text{samples}}\\)\nis defined as\n\\[\\begin{split}\\text{D}(y, \\hat{y}) = \\frac{1}{n_\\text{samples}}\n\\sum_{i=0}^{n_\\text{samples} - 1}\n\\begin{cases}\n(y_i-\\hat{y}_i)^2, & \\text{for }p=0\\text{ (Normal)}\\\\\n2(y_i \\log(y_i/\\hat{y}_i) + \\hat{y}_i - y_i), & \\text{for }p=1\\text{ (Poisson)}\\\\",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 119,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\begin{cases}\n(y_i-\\hat{y}_i)^2, & \\text{for }p=0\\text{ (Normal)}\\\\\n2(y_i \\log(y_i/\\hat{y}_i) + \\hat{y}_i - y_i), & \\text{for }p=1\\text{ (Poisson)}\\\\\n2(\\log(\\hat{y}_i/y_i) + y_i/\\hat{y}_i - 1), & \\text{for }p=2\\text{ (Gamma)}\\\\\n2\\left(\\frac{\\max(y_i,0)^{2-p}}{(1-p)(2-p)}-\n\\frac{y_i\\,\\hat{y}_i^{1-p}}{1-p}+\\frac{\\hat{y}_i^{2-p}}{2-p}\\right),\n& \\text{otherwise}\n\\end{cases}\\end{split}\\]\nTweedie deviance is a homogeneous function of degree\n2-power\n.\nThus, Gamma distribution with\npower=2\nmeans that simultaneously scaling\ny_true\nand\ny_pred\nhas no effect on the deviance. For Poisson\ndistribution\npower=1\nthe deviance scales linearly, and for Normal\ndistribution (\npower=0\n), quadratically. In general, the higher\npower\nthe less weight is given to extreme deviations between true\nand predicted targets.\nFor instance, let’s compare the two predictions 1.5 and 150 that are both\n50% larger than their corresponding true value.\nThe mean squared error (\npower=0\n) is very sensitive to the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 120,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "50% larger than their corresponding true value.\nThe mean squared error (\npower=0\n) is very sensitive to the\nprediction difference of the second point,:\n>>>\nfrom\nsklearn.metrics\nimport\nmean_tweedie_deviance\n>>>\nmean_tweedie_deviance\n([\n1.0\n],\n[\n1.5\n],\npower\n=\n0\n)\n0.25\n>>>\nmean_tweedie_deviance\n([\n100.\n],\n[\n150.\n],\npower\n=\n0\n)\n2500.0\nIf we increase\npower\nto 1,:\n>>>\nmean_tweedie_deviance\n([\n1.0\n],\n[\n1.5\n],\npower\n=\n1\n)\n0.189\n>>>\nmean_tweedie_deviance\n([\n100.\n],\n[\n150.\n],\npower\n=\n1\n)\n18.9\nthe difference in errors decreases. Finally, by setting,\npower=2\n:\n>>>\nmean_tweedie_deviance\n([\n1.0\n],\n[\n1.5\n],\npower\n=\n2\n)\n0.144\n>>>\nmean_tweedie_deviance\n([\n100.\n],\n[\n150.\n],\npower\n=\n2\n)\n0.144\nwe would get identical errors. The deviance when\npower=2\nis thus only\nsensitive to relative errors.\n3.4.6.10.\nPinball loss\nThe\nmean_pinball_loss\nfunction is used to evaluate the predictive\nperformance of\nquantile regression\nmodels.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 121,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "3.4.6.10.\nPinball loss\nThe\nmean_pinball_loss\nfunction is used to evaluate the predictive\nperformance of\nquantile regression\nmodels.\n\\[\\text{pinball}(y, \\hat{y}) = \\frac{1}{n_{\\text{samples}}} \\sum_{i=0}^{n_{\\text{samples}}-1} \\alpha \\max(y_i - \\hat{y}_i, 0) + (1 - \\alpha) \\max(\\hat{y}_i - y_i, 0)\\]\nThe value of pinball loss is equivalent to half of\nmean_absolute_error\nwhen the quantile\nparameter\nalpha\nis set to 0.5.\nHere is a small example of usage of the\nmean_pinball_loss\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nmean_pinball_loss\n>>>\ny_true\n=\n[\n1\n,\n2\n,\n3\n]\n>>>\nmean_pinball_loss\n(\ny_true\n,\n[\n0\n,\n2\n,\n3\n],\nalpha\n=\n0.1\n)\n0.033\n>>>\nmean_pinball_loss\n(\ny_true\n,\n[\n1\n,\n2\n,\n4\n],\nalpha\n=\n0.1\n)\n0.3\n>>>\nmean_pinball_loss\n(\ny_true\n,\n[\n0\n,\n2\n,\n3\n],\nalpha\n=\n0.9\n)\n0.3\n>>>\nmean_pinball_loss\n(\ny_true\n,\n[\n1\n,\n2\n,\n4\n],\nalpha\n=\n0.9\n)\n0.033\n>>>\nmean_pinball_loss\n(\ny_true\n,\ny_true\n,\nalpha\n=\n0.1\n)\n0.0\n>>>\nmean_pinball_loss\n(\ny_true\n,\ny_true\n,\nalpha\n=\n0.9\n)\n0.0",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 122,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2\n,\n4\n],\nalpha\n=\n0.9\n)\n0.033\n>>>\nmean_pinball_loss\n(\ny_true\n,\ny_true\n,\nalpha\n=\n0.1\n)\n0.0\n>>>\nmean_pinball_loss\n(\ny_true\n,\ny_true\n,\nalpha\n=\n0.9\n)\n0.0\nIt is possible to build a scorer object with a specific choice of\nalpha\n:\n>>>\nfrom\nsklearn.metrics\nimport\nmake_scorer\n>>>\nmean_pinball_loss_95p\n=\nmake_scorer\n(\nmean_pinball_loss\n,\nalpha\n=\n0.95\n)\nSuch a scorer can be used to evaluate the generalization performance of a\nquantile regressor via cross-validation:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_regression\n>>>\nfrom\nsklearn.model_selection\nimport\ncross_val_score\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingRegressor\n>>>\n>>>\nX\n,\ny\n=\nmake_regression\n(\nn_samples\n=\n100\n,\nrandom_state\n=\n0\n)\n>>>\nestimator\n=\nGradientBoostingRegressor\n(\n...\nloss\n=\n\"quantile\"\n,\n...\nalpha\n=\n0.95\n,\n...\nrandom_state\n=\n0\n,\n...\n)\n>>>\ncross_val_score\n(\nestimator\n,\nX\n,\ny\n,\ncv\n=\n5\n,\nscoring\n=\nmean_pinball_loss_95p\n)\narray([13.6, 9.7, 23.3, 9.5, 10.4])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 123,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "...\nrandom_state\n=\n0\n,\n...\n)\n>>>\ncross_val_score\n(\nestimator\n,\nX\n,\ny\n,\ncv\n=\n5\n,\nscoring\n=\nmean_pinball_loss_95p\n)\narray([13.6, 9.7, 23.3, 9.5, 10.4])\nIt is also possible to build scorer objects for hyper-parameter tuning. The\nsign of the loss must be switched to ensure that greater means better as\nexplained in the example linked below.\nExamples\nSee\nPrediction Intervals for Gradient Boosting Regression\nfor an example of using the pinball loss to evaluate and tune the\nhyper-parameters of quantile regression models on data with non-symmetric\nnoise and outliers.\n3.4.6.11.\nD² score\nThe D² score computes the fraction of deviance explained.\nIt is a generalization of R², where the squared error is generalized and replaced\nby a deviance of choice\n\\(\\text{dev}(y, \\hat{y})\\)\n(e.g., Tweedie, pinball or mean absolute error). D² is a form of a\nskill score\n.\nIt is calculated as\n\\[D^2(y, \\hat{y}) = 1 - \\frac{\\text{dev}(y, \\hat{y})}{\\text{dev}(y, y_{\\text{null}})} \\,.\\]\nWhere\n\\(y_{\\text{null}}\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 124,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nIt is calculated as\n\\[D^2(y, \\hat{y}) = 1 - \\frac{\\text{dev}(y, \\hat{y})}{\\text{dev}(y, y_{\\text{null}})} \\,.\\]\nWhere\n\\(y_{\\text{null}}\\)\nis the optimal prediction of an intercept-only model\n(e.g., the mean of\ny_true\nfor the Tweedie case, the median for absolute\nerror and the alpha-quantile for pinball loss).\nLike R², the best possible score is 1.0 and it can be negative (because the\nmodel can be arbitrarily worse). A constant model that always predicts\n\\(y_{\\text{null}}\\)\n, disregarding the input features, would get a D² score\nof 0.0.\nD² Tweedie score\nThe\nd2_tweedie_score\nfunction implements the special case of D²\nwhere\n\\(\\text{dev}(y, \\hat{y})\\)\nis the Tweedie deviance, see\nMean Poisson, Gamma, and Tweedie deviances\n.\nIt is also known as D² Tweedie and is related to McFadden’s likelihood ratio index.\nThe argument\npower\ndefines the Tweedie power as for\nmean_tweedie_deviance\n. Note that for\npower=0\n,\nd2_tweedie_score\nequals\nr2_score\n(for single targets).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 125,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "power\ndefines the Tweedie power as for\nmean_tweedie_deviance\n. Note that for\npower=0\n,\nd2_tweedie_score\nequals\nr2_score\n(for single targets).\nA scorer object with a specific choice of\npower\ncan be built by:\n>>>\nfrom\nsklearn.metrics\nimport\nd2_tweedie_score\n,\nmake_scorer\n>>>\nd2_tweedie_score_15\n=\nmake_scorer\n(\nd2_tweedie_score\n,\npower\n=\n1.5\n)\nD² pinball score\nThe\nd2_pinball_score\nfunction implements the special case\nof D² with the pinball loss, see\nPinball loss\n, i.e.:\n\\[\\text{dev}(y, \\hat{y}) = \\text{pinball}(y, \\hat{y}).\\]\nThe argument\nalpha\ndefines the slope of the pinball loss as for\nmean_pinball_loss\n(\nPinball loss\n). It determines the\nquantile level\nalpha\nfor which the pinball loss and also D²\nare optimal. Note that for\nalpha=0.5\n(the default)\nd2_pinball_score\nequals\nd2_absolute_error_score\n.\nA scorer object with a specific choice of\nalpha\ncan be built by:\n>>>\nfrom\nsklearn.metrics\nimport\nd2_pinball_score\n,\nmake_scorer\n>>>\nd2_pinball_score_08\n=\nmake_scorer\n(\nd2_pinball_score\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 126,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "alpha\ncan be built by:\n>>>\nfrom\nsklearn.metrics\nimport\nd2_pinball_score\n,\nmake_scorer\n>>>\nd2_pinball_score_08\n=\nmake_scorer\n(\nd2_pinball_score\n,\nalpha\n=\n0.8\n)\nD² absolute error score\nThe\nd2_absolute_error_score\nfunction implements the special case of\nthe\nMean absolute error\n:\n\\[\\text{dev}(y, \\hat{y}) = \\text{MAE}(y, \\hat{y}).\\]\nHere are some usage examples of the\nd2_absolute_error_score\nfunction:\n>>>\nfrom\nsklearn.metrics\nimport\nd2_absolute_error_score\n>>>\ny_true\n=\n[\n3\n,\n-\n0.5\n,\n2\n,\n7\n]\n>>>\ny_pred\n=\n[\n2.5\n,\n0.0\n,\n2\n,\n8\n]\n>>>\nd2_absolute_error_score\n(\ny_true\n,\ny_pred\n)\n0.764\n>>>\ny_true\n=\n[\n1\n,\n2\n,\n3\n]\n>>>\ny_pred\n=\n[\n1\n,\n2\n,\n3\n]\n>>>\nd2_absolute_error_score\n(\ny_true\n,\ny_pred\n)\n1.0\n>>>\ny_true\n=\n[\n1\n,\n2\n,\n3\n]\n>>>\ny_pred\n=\n[\n2\n,\n2\n,\n2\n]\n>>>\nd2_absolute_error_score\n(\ny_true\n,\ny_pred\n)\n0.0\n3.4.6.12.\nVisual evaluation of regression models\nAmong methods to assess the quality of regression models, scikit-learn provides\nthe\nPredictionErrorDisplay\nclass. It allows to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 127,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Among methods to assess the quality of regression models, scikit-learn provides\nthe\nPredictionErrorDisplay\nclass. It allows to\nvisually inspect the prediction errors of a model in two different manners.\nThe plot on the left shows the actual values vs predicted values. For a\nnoise-free regression task aiming to predict the (conditional) expectation of\ny\n, a perfect regression model would display data points on the diagonal\ndefined by predicted equal to actual values. The further away from this optimal\nline, the larger the error of the model. In a more realistic setting with\nirreducible noise, that is, when not all the variations of\ny\ncan be explained\nby features in\nX\n, then the best model would lead to a cloud of points densely\narranged around the diagonal.\nNote that the above only holds when the predicted values is the expected value\nof\ny\ngiven\nX\n. This is typically the case for regression models that\nminimize the mean squared error objective function or more generally the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 128,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of\ny\ngiven\nX\n. This is typically the case for regression models that\nminimize the mean squared error objective function or more generally the\nmean Tweedie deviance\nfor any value of its\n“power” parameter.\nWhen plotting the predictions of an estimator that predicts a quantile\nof\ny\ngiven\nX\n, e.g.\nQuantileRegressor\nor any other model minimizing the\npinball loss\n, a\nfraction of the points are either expected to lie above or below the diagonal\ndepending on the estimated quantile level.\nAll in all, while intuitive to read, this plot does not really inform us on\nwhat to do to obtain a better model.\nThe right-hand side plot shows the residuals (i.e. the difference between the\nactual and the predicted values) vs. the predicted values.\nThis plot makes it easier to visualize if the residuals follow and\nhomoscedastic or heteroschedastic\ndistribution.\nIn particular, if the true distribution of\ny|X\nis Poisson or Gamma\ndistributed, it is expected that the variance of the residuals of the optimal",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 129,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In particular, if the true distribution of\ny|X\nis Poisson or Gamma\ndistributed, it is expected that the variance of the residuals of the optimal\nmodel would grow with the predicted value of\nE[y|X]\n(either linearly for\nPoisson or quadratically for Gamma).\nWhen fitting a linear least squares regression model (see\nLinearRegression\nand\nRidge\n), we can use this plot to check\nif some of the\nmodel assumptions\nare met, in particular that the residuals should be uncorrelated, their\nexpected value should be null and that their variance should be constant\n(homoschedasticity).\nIf this is not the case, and in particular if the residuals plot show some\nbanana-shaped structure, this is a hint that the model is likely mis-specified\nand that non-linear feature engineering or switching to a non-linear regression\nmodel might be useful.\nRefer to the example below to see a model evaluation that makes use of this\ndisplay.\nExamples\nSee\nEffect of transforming the targets in regression model\nfor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 130,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "display.\nExamples\nSee\nEffect of transforming the targets in regression model\nfor\nan example on how to use\nPredictionErrorDisplay\nto visualize the prediction quality improvement of a regression model\nobtained by transforming the target before learning.\n3.4.7.\nClustering metrics\nThe\nsklearn.metrics\nmodule implements several loss, score, and utility\nfunctions to measure clustering performance. For more information see the\nClustering performance evaluation\nsection for instance clustering, and\nBiclustering evaluation\nfor biclustering.\n3.4.8.\nDummy estimators\nWhen doing supervised learning, a simple sanity check consists of comparing\none’s estimator against simple rules of thumb.\nDummyClassifier\nimplements several such simple strategies for classification:\nstratified\ngenerates random predictions by respecting the training\nset class distribution.\nmost_frequent\nalways predicts the most frequent label in the training set.\nprior\nalways predicts the class that maximizes the class prior\n(like",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 131,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "most_frequent\nalways predicts the most frequent label in the training set.\nprior\nalways predicts the class that maximizes the class prior\n(like\nmost_frequent\n) and\npredict_proba\nreturns the class prior.\nuniform\ngenerates predictions uniformly at random.\nconstant\nalways predicts a constant label that is provided by the user.\nA major motivation of this method is F1-scoring, when the positive class\nis in the minority.\nNote that with all these strategies, the\npredict\nmethod completely ignores\nthe input data!\nTo illustrate\nDummyClassifier\n, first let’s create an imbalanced\ndataset:\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\ny\n[\ny\n!=\n1\n]\n=\n-\n1\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\nX\n,\ny\n,\nrandom_state\n=\n0\n)\nNext, let’s compare the accuracy of\nSVC\nand\nmost_frequent\n:\n>>>\nfrom\nsklearn.dummy\nimport\nDummyClassifier\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nclf\n=\nSVC\n(\nkernel",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 132,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SVC\nand\nmost_frequent\n:\n>>>\nfrom\nsklearn.dummy\nimport\nDummyClassifier\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nclf\n=\nSVC\n(\nkernel\n=\n'linear'\n,\nC\n=\n1\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nclf\n.\nscore\n(\nX_test\n,\ny_test\n)\n0.63\n>>>\nclf\n=\nDummyClassifier\n(\nstrategy\n=\n'most_frequent'\n,\nrandom_state\n=\n0\n)\n>>>\nclf\n.\nfit\n(\nX_train\n,\ny_train\n)\nDummyClassifier(random_state=0, strategy='most_frequent')\n>>>\nclf\n.\nscore\n(\nX_test\n,\ny_test\n)\n0.579\nWe see that\nSVC\ndoesn’t do much better than a dummy classifier. Now, let’s\nchange the kernel:\n>>>\nclf\n=\nSVC\n(\nkernel\n=\n'rbf'\n,\nC\n=\n1\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nclf\n.\nscore\n(\nX_test\n,\ny_test\n)\n0.94\nWe see that the accuracy was boosted to almost 100%. A cross validation\nstrategy is recommended for a better estimate of the accuracy, if it\nis not too CPU costly. For more information see the\nCross-validation: evaluating estimator performance\nsection. Moreover if you want to optimize over the parameter space, it is highly",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 133,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Cross-validation: evaluating estimator performance\nsection. Moreover if you want to optimize over the parameter space, it is highly\nrecommended to use an appropriate methodology; see the\nTuning the hyper-parameters of an estimator\nsection for details.\nMore generally, when the accuracy of a classifier is too close to random, it\nprobably means that something went wrong: features are not helpful, a\nhyperparameter is not correctly tuned, the classifier is suffering from class\nimbalance, etc…\nDummyRegressor\nalso implements four simple rules of thumb for regression:\nmean\nalways predicts the mean of the training targets.\nmedian\nalways predicts the median of the training targets.\nquantile\nalways predicts a user provided quantile of the training targets.\nconstant\nalways predicts a constant value that is provided by the user.\nIn all these strategies, the\npredict\nmethod completely ignores\nthe input data.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/model_evaluation.html",
      "chunk_index": 134,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.12.\nMulticlass and multioutput algorithms\nThis section of the user guide covers functionality related to multi-learning\nproblems, including\nmulticlass\n,\nmultilabel\n, and\nmultioutput\nclassification and regression.\nThe modules in this section implement\nmeta-estimators\n, which require a\nbase estimator to be provided in their constructor. Meta-estimators extend the\nfunctionality of the base estimator to support multi-learning problems, which\nis accomplished by transforming the multi-learning problem into a set of\nsimpler problems, then fitting one estimator per problem.\nThis section covers two modules:\nsklearn.multiclass\nand\nsklearn.multioutput\n. The chart below demonstrates the problem types\nthat each module is responsible for, and the corresponding meta-estimators\nthat each module provides.\nThe table below provides a quick reference on the differences between problem\ntypes. More detailed explanations can be found in subsequent sections of this\nguide.\nNumber of targets",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "types. More detailed explanations can be found in subsequent sections of this\nguide.\nNumber of targets\nTarget cardinality\nValid\ntype_of_target\nMulticlass\nclassification\n1\n>2\n‘multiclass’\nMultilabel\nclassification\n>1\n2 (0 or 1)\n‘multilabel-indicator’\nMulticlass-multioutput\nclassification\n>1\n>2\n‘multiclass-multioutput’\nMultioutput\nregression\n>1\nContinuous\n‘continuous-multioutput’\nBelow is a summary of scikit-learn estimators that have multi-learning support\nbuilt-in, grouped by strategy. You don’t need the meta-estimators provided by\nthis section if you’re using one of these estimators. However, meta-estimators\ncan provide additional strategies beyond what is built-in:\nInherently multiclass:\nnaive_bayes.BernoulliNB\ntree.DecisionTreeClassifier\ntree.ExtraTreeClassifier\nensemble.ExtraTreesClassifier\nnaive_bayes.GaussianNB\nneighbors.KNeighborsClassifier\nsemi_supervised.LabelPropagation\nsemi_supervised.LabelSpreading\ndiscriminant_analysis.LinearDiscriminantAnalysis\nsvm.LinearSVC",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "semi_supervised.LabelPropagation\nsemi_supervised.LabelSpreading\ndiscriminant_analysis.LinearDiscriminantAnalysis\nsvm.LinearSVC\n(setting multi_class=”crammer_singer”)\nlinear_model.LogisticRegression\n(with most solvers)\nlinear_model.LogisticRegressionCV\n(with most solvers)\nneural_network.MLPClassifier\nneighbors.NearestCentroid\ndiscriminant_analysis.QuadraticDiscriminantAnalysis\nneighbors.RadiusNeighborsClassifier\nensemble.RandomForestClassifier\nlinear_model.RidgeClassifier\nlinear_model.RidgeClassifierCV\nMulticlass as One-Vs-One:\nsvm.NuSVC\nsvm.SVC\n.\ngaussian_process.GaussianProcessClassifier\n(setting multi_class = “one_vs_one”)\nMulticlass as One-Vs-The-Rest:\nensemble.GradientBoostingClassifier\ngaussian_process.GaussianProcessClassifier\n(setting multi_class = “one_vs_rest”)\nsvm.LinearSVC\n(setting multi_class=”ovr”)\nlinear_model.LogisticRegression\n(most solvers)\nlinear_model.LogisticRegressionCV\n(most solvers)\nlinear_model.SGDClassifier\nlinear_model.Perceptron",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "linear_model.LogisticRegression\n(most solvers)\nlinear_model.LogisticRegressionCV\n(most solvers)\nlinear_model.SGDClassifier\nlinear_model.Perceptron\nlinear_model.PassiveAggressiveClassifier\nSupport multilabel:\ntree.DecisionTreeClassifier\ntree.ExtraTreeClassifier\nensemble.ExtraTreesClassifier\nneighbors.KNeighborsClassifier\nneural_network.MLPClassifier\nneighbors.RadiusNeighborsClassifier\nensemble.RandomForestClassifier\nlinear_model.RidgeClassifier\nlinear_model.RidgeClassifierCV\nSupport multiclass-multioutput:\ntree.DecisionTreeClassifier\ntree.ExtraTreeClassifier\nensemble.ExtraTreesClassifier\nneighbors.KNeighborsClassifier\nneighbors.RadiusNeighborsClassifier\nensemble.RandomForestClassifier\n1.12.1.\nMulticlass classification\nWarning\nAll classifiers in scikit-learn do multiclass classification\nout-of-the-box. You don’t need to use the\nsklearn.multiclass\nmodule\nunless you want to experiment with different multiclass strategies.\nMulticlass classification",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "sklearn.multiclass\nmodule\nunless you want to experiment with different multiclass strategies.\nMulticlass classification\nis a classification task with more than two\nclasses. Each sample can only be labeled as one class.\nFor example, classification using features extracted from a set of images of\nfruit, where each image may either be of an orange, an apple, or a pear.\nEach image is one sample and is labeled as one of the 3 possible classes.\nMulticlass classification makes the assumption that each sample is assigned\nto one and only one label - one sample cannot, for example, be both a pear\nand an apple.\nWhile all scikit-learn classifiers are capable of multiclass classification,\nthe meta-estimators offered by\nsklearn.multiclass\npermit changing the way they handle more than two classes\nbecause this may have an effect on classifier performance\n(either in terms of generalization error or required computational resources).\n1.12.1.1.\nTarget format\nValid\nmulticlass\nrepresentations for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(either in terms of generalization error or required computational resources).\n1.12.1.1.\nTarget format\nValid\nmulticlass\nrepresentations for\ntype_of_target\n(\ny\n) are:\n1d or column vector containing more than two discrete values. An\nexample of a vector\ny\nfor 4 samples:\n>>>\nimport\nnumpy\nas\nnp\n>>>\ny\n=\nnp\n.\narray\n([\n'apple'\n,\n'pear'\n,\n'apple'\n,\n'orange'\n])\n>>>\nprint\n(\ny\n)\n['apple' 'pear' 'apple' 'orange']\nDense or sparse\nbinary\nmatrix of shape\n(n_samples,\nn_classes)\nwith a single sample per row, where each column represents one class. An\nexample of both a dense and sparse\nbinary\nmatrix\ny\nfor 4\nsamples, where the columns, in order, are apple, orange, and pear:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.preprocessing\nimport\nLabelBinarizer\n>>>\ny\n=\nnp\n.\narray\n([\n'apple'\n,\n'pear'\n,\n'apple'\n,\n'orange'\n])\n>>>\ny_dense\n=\nLabelBinarizer\n()\n.\nfit_transform\n(\ny\n)\n>>>\nprint\n(\ny_dense\n)\n[[1 0 0]\n[0 0 1]\n[1 0 0]\n[0 1 0]]\n>>>\nfrom\nscipy\nimport\nsparse\n>>>\ny_sparse\n=\nsparse\n.\ncsr_matrix\n(\ny_dense\n)\n>>>\nprint\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\ny\n)\n>>>\nprint\n(\ny_dense\n)\n[[1 0 0]\n[0 0 1]\n[1 0 0]\n[0 1 0]]\n>>>\nfrom\nscipy\nimport\nsparse\n>>>\ny_sparse\n=\nsparse\n.\ncsr_matrix\n(\ny_dense\n)\n>>>\nprint\n(\ny_sparse\n)\n<Compressed Sparse Row sparse matrix of dtype 'int64'\nwith 4 stored elements and shape (4, 3)>\nCoords Values\n(0, 0) 1\n(1, 2) 1\n(2, 0) 1\n(3, 1) 1\nFor more information about\nLabelBinarizer\n,\nrefer to\nTransforming the prediction target (y)\n.\n1.12.1.2.\nOneVsRestClassifier\nThe\none-vs-rest\nstrategy, also known as\none-vs-all\n, is implemented in\nOneVsRestClassifier\n. The strategy consists in\nfitting one classifier per class. For each classifier, the class is fitted\nagainst all the other classes. In addition to its computational efficiency\n(only\nn_classes\nclassifiers are needed), one advantage of this approach is\nits interpretability. Since each class is represented by one and only one\nclassifier, it is possible to gain knowledge about the class by inspecting its",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classifier, it is possible to gain knowledge about the class by inspecting its\ncorresponding classifier. This is the most commonly used strategy and is a fair\ndefault choice.\nBelow is an example of multiclass learning using OvR:\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\nfrom\nsklearn.multiclass\nimport\nOneVsRestClassifier\n>>>\nfrom\nsklearn.svm\nimport\nLinearSVC\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nOneVsRestClassifier\n(\nLinearSVC\n(\nrandom_state\n=\n0\n))\n.\nfit\n(\nX\n,\ny\n)\n.\npredict\n(\nX\n)\narray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2,\n2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])\nOneVsRestClassifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2,\n2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])\nOneVsRestClassifier\nalso supports multilabel\nclassification. To use this feature, feed the classifier an indicator matrix,\nin which cell [i, j] indicates the presence of label j in sample i.\nExamples\nMultilabel classification\nPlot classification probability\nDecision Boundaries of Multinomial and One-vs-Rest Logistic Regression\n1.12.1.3.\nOneVsOneClassifier\nOneVsOneClassifier\nconstructs one classifier per\npair of classes. At prediction time, the class which received the most votes\nis selected. In the event of a tie (among two classes with an equal number of\nvotes), it selects the class with the highest aggregate classification\nconfidence by summing over the pair-wise classification confidence levels\ncomputed by the underlying binary classifiers.\nSince it requires to fit\nn_classes\n*\n(n_classes\n-\n1)\n/\n2\nclassifiers,\nthis method is usually slower than one-vs-the-rest, due to its",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Since it requires to fit\nn_classes\n*\n(n_classes\n-\n1)\n/\n2\nclassifiers,\nthis method is usually slower than one-vs-the-rest, due to its\nO(n_classes^2) complexity. However, this method may be advantageous for\nalgorithms such as kernel algorithms which don’t scale well with\nn_samples\n. This is because each individual learning problem only involves\na small subset of the data whereas, with one-vs-the-rest, the complete\ndataset is used\nn_classes\ntimes. The decision function is the result\nof a monotonic transformation of the one-versus-one classification.\nBelow is an example of multiclass learning using OvO:\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\nfrom\nsklearn.multiclass\nimport\nOneVsOneClassifier\n>>>\nfrom\nsklearn.svm\nimport\nLinearSVC\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nOneVsOneClassifier\n(\nLinearSVC\n(\nrandom_state\n=\n0\n))\n.\nfit\n(\nX\n,\ny\n)\n.\npredict\n(\nX\n)\narray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\nLinearSVC\n(\nrandom_state\n=\n0\n))\n.\nfit\n(\nX\n,\ny\n)\n.\npredict\n(\nX\n)\narray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])\nReferences\n“Pattern Recognition and Machine Learning. Springer”,\nChristopher M. Bishop, page 183, (First Edition)\n1.12.1.4.\nOutputCodeClassifier\nError-Correcting Output Code-based strategies are fairly different from\none-vs-the-rest and one-vs-one. With these strategies, each class is\nrepresented in a Euclidean space, where each dimension can only be 0 or 1.\nAnother way to put it is that each class is represented by a binary code (an",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Another way to put it is that each class is represented by a binary code (an\narray of 0 and 1). The matrix which keeps track of the location/code of each\nclass is called the code book. The code size is the dimensionality of the\naforementioned space. Intuitively, each class should be represented by a code\nas unique as possible and a good code book should be designed to optimize\nclassification accuracy. In this implementation, we simply use a\nrandomly-generated code book as advocated in\n[\n3\n]\nalthough more elaborate\nmethods may be added in the future.\nAt fitting time, one binary classifier per bit in the code book is fitted.\nAt prediction time, the classifiers are used to project new points in the\nclass space and the class closest to the points is chosen.\nIn\nOutputCodeClassifier\n, the\ncode_size\nattribute allows the user to control the number of classifiers which will be\nused. It is a percentage of the total number of classes.\nA number between 0 and 1 will require fewer classifiers than",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "used. It is a percentage of the total number of classes.\nA number between 0 and 1 will require fewer classifiers than\none-vs-the-rest. In theory,\nlog2(n_classes)\n/\nn_classes\nis sufficient to\nrepresent each class unambiguously. However, in practice, it may not lead to\ngood accuracy since\nlog2(n_classes)\nis much smaller than\nn_classes\n.\nA number greater than 1 will require more classifiers than\none-vs-the-rest. In this case, some classifiers will in theory correct for\nthe mistakes made by other classifiers, hence the name “error-correcting”.\nIn practice, however, this may not happen as classifier mistakes will\ntypically be correlated. The error-correcting output codes have a similar\neffect to bagging.\nBelow is an example of multiclass learning using Output-Codes:\n>>>\nfrom\nsklearn\nimport\ndatasets\n>>>\nfrom\nsklearn.multiclass\nimport\nOutputCodeClassifier\n>>>\nfrom\nsklearn.svm\nimport\nLinearSVC\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nclf\n=\nOutputCodeClassifier\n(\nLinearSVC\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nfrom\nsklearn.svm\nimport\nLinearSVC\n>>>\nX\n,\ny\n=\ndatasets\n.\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nclf\n=\nOutputCodeClassifier\n(\nLinearSVC\n(\nrandom_state\n=\n0\n),\ncode_size\n=\n2\n,\nrandom_state\n=\n0\n)\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\n.\npredict\n(\nX\n)\narray([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\n0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1,\n1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1,\n1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,\n2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 1, 2, 2, 2,\n2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])\nReferences\n“Solving multiclass learning problems via error-correcting output codes”,\nDietterich T., Bakiri G., Journal of Artificial Intelligence Research 2, 1995.\n“The Elements of Statistical Learning”,\nHastie T., Tibshirani R., Friedman J., page 606 (second-edition), 2008.\n1.12.2.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "“The Elements of Statistical Learning”,\nHastie T., Tibshirani R., Friedman J., page 606 (second-edition), 2008.\n1.12.2.\nMultilabel classification\nMultilabel classification\n(closely related to\nmultioutput\nclassification\n) is a classification task labeling each sample with\nm\nlabels from\nn_classes\npossible classes, where\nm\ncan be 0 to\nn_classes\ninclusive. This can be thought of as predicting properties of a\nsample that are not mutually exclusive. Formally, a binary output is assigned\nto each class, for every sample. Positive classes are indicated with 1 and\nnegative classes with 0 or -1. It is thus comparable to running\nn_classes\nbinary classification tasks, for example with\nMultiOutputClassifier\n. This approach treats\neach label independently whereas multilabel classifiers\nmay\ntreat the\nmultiple classes simultaneously, accounting for correlated behavior among\nthem.\nFor example, prediction of the topics relevant to a text document or video.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "them.\nFor example, prediction of the topics relevant to a text document or video.\nThe document or video may be about one of ‘religion’, ‘politics’, ‘finance’\nor ‘education’, several of the topic classes or all of the topic classes.\n1.12.2.1.\nTarget format\nA valid representation of\nmultilabel\ny\nis an either dense or sparse\nbinary\nmatrix of shape\n(n_samples,\nn_classes)\n. Each column\nrepresents a class. The\n1\n’s in each row denote the positive classes a\nsample has been labeled with. An example of a dense matrix\ny\nfor 3\nsamples:\n>>>\ny\n=\nnp\n.\narray\n([[\n1\n,\n0\n,\n0\n,\n1\n],\n[\n0\n,\n0\n,\n1\n,\n1\n],\n[\n0\n,\n0\n,\n0\n,\n0\n]])\n>>>\nprint\n(\ny\n)\n[[1 0 0 1]\n[0 0 1 1]\n[0 0 0 0]]\nDense binary matrices can also be created using\nMultiLabelBinarizer\n. For more information,\nrefer to\nTransforming the prediction target (y)\n.\nAn example of the same\ny\nin sparse matrix form:\n>>>\ny_sparse\n=\nsparse\n.\ncsr_matrix\n(\ny\n)\n>>>\nprint\n(\ny_sparse\n)\n<Compressed Sparse Row sparse matrix of dtype 'int64'",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "y\nin sparse matrix form:\n>>>\ny_sparse\n=\nsparse\n.\ncsr_matrix\n(\ny\n)\n>>>\nprint\n(\ny_sparse\n)\n<Compressed Sparse Row sparse matrix of dtype 'int64'\nwith 4 stored elements and shape (3, 4)>\nCoords Values\n(0, 0) 1\n(0, 3) 1\n(1, 2) 1\n(1, 3) 1\n1.12.2.2.\nMultiOutputClassifier\nMultilabel classification support can be added to any classifier with\nMultiOutputClassifier\n. This strategy consists of\nfitting one classifier per target. This allows multiple target variable\nclassifications. The purpose of this class is to extend estimators\nto be able to estimate a series of target functions (f1,f2,f3…,fn)\nthat are trained on a single X predictor matrix to predict a series\nof responses (y1,y2,y3…,yn).\nYou can find a usage example for\nMultiOutputClassifier\nas part of the section on\nMulticlass-multioutput classification\nsince it is a generalization of multilabel classification to\nmulticlass outputs instead of binary outputs.\n1.12.2.3.\nClassifierChain\nClassifier chains (see\nClassifierChain\n) are a way",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "multiclass outputs instead of binary outputs.\n1.12.2.3.\nClassifierChain\nClassifier chains (see\nClassifierChain\n) are a way\nof combining a number of binary classifiers into a single multi-label model\nthat is capable of exploiting correlations among targets.\nFor a multi-label classification problem with N classes, N binary\nclassifiers are assigned an integer between 0 and N-1. These integers\ndefine the order of models in the chain. Each classifier is then fit on the\navailable training data plus the true labels of the classes whose\nmodels were assigned a lower number.\nWhen predicting, the true labels will not be available. Instead the\npredictions of each model are passed on to the subsequent models in the\nchain to be used as features.\nClearly the order of the chain is important. The first model in the chain\nhas no information about the other labels while the last model in the chain\nhas features indicating the presence of all of the other labels. In general",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "has features indicating the presence of all of the other labels. In general\none does not know the optimal ordering of the models in the chain so\ntypically many randomly ordered chains are fit and their predictions are\naveraged together.\nReferences\nJesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank,\n“Classifier Chains for Multi-label Classification”, 2009.\n1.12.3.\nMulticlass-multioutput classification\nMulticlass-multioutput classification\n(also known as\nmultitask classification\n) is a\nclassification task which labels each sample with a set of\nnon-binary\nproperties. Both the number of properties and the number of\nclasses per property is greater than 2. A single estimator thus\nhandles several joint classification tasks. This is both a generalization of\nthe multi\nlabel\nclassification task, which only considers binary\nattributes, as well as a generalization of the multi\nclass\nclassification\ntask, where only one property is considered.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "attributes, as well as a generalization of the multi\nclass\nclassification\ntask, where only one property is considered.\nFor example, classification of the properties “type of fruit” and “colour”\nfor a set of images of fruit. The property “type of fruit” has the possible\nclasses: “apple”, “pear” and “orange”. The property “colour” has the\npossible classes: “green”, “red”, “yellow” and “orange”. Each sample is an\nimage of a fruit, a label is output for both properties and each label is\none of the possible classes of the corresponding property.\nNote that all classifiers handling multiclass-multioutput (also known as\nmultitask classification) tasks, support the multilabel classification task\nas a special case. Multitask classification is similar to the multioutput\nclassification task with different model formulations. For more information,\nsee the relevant estimator documentation.\nBelow is an example of multiclass-multioutput classification:\n>>>\nfrom\nsklearn.datasets\nimport",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "see the relevant estimator documentation.\nBelow is an example of multiclass-multioutput classification:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn.multioutput\nimport\nMultiOutputClassifier\n>>>\nfrom\nsklearn.ensemble\nimport\nRandomForestClassifier\n>>>\nfrom\nsklearn.utils\nimport\nshuffle\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n,\ny1\n=\nmake_classification\n(\nn_samples\n=\n10\n,\nn_features\n=\n100\n,\n...\nn_informative\n=\n30\n,\nn_classes\n=\n3\n,\n...\nrandom_state\n=\n1\n)\n>>>\ny2\n=\nshuffle\n(\ny1\n,\nrandom_state\n=\n1\n)\n>>>\ny3\n=\nshuffle\n(\ny1\n,\nrandom_state\n=\n2\n)\n>>>\nY\n=\nnp\n.\nvstack\n((\ny1\n,\ny2\n,\ny3\n))\n.\nT\n>>>\nn_samples\n,\nn_features\n=\nX\n.\nshape\n# 10,100\n>>>\nn_outputs\n=\nY\n.\nshape\n[\n1\n]\n# 3\n>>>\nn_classes\n=\n3\n>>>\nforest\n=\nRandomForestClassifier\n(\nrandom_state\n=\n1\n)\n>>>\nmulti_target_forest\n=\nMultiOutputClassifier\n(\nforest\n,\nn_jobs\n=\n2\n)\n>>>\nmulti_target_forest\n.\nfit\n(\nX\n,\nY\n)\n.\npredict\n(\nX\n)\narray([[2, 2, 0],\n[1, 2, 1],\n[2, 1, 0],\n[0, 0, 2],\n[0, 2, 1],\n[0, 0, 2],\n[1, 1, 0],\n[1, 1, 1],\n[0, 0, 2],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nfit\n(\nX\n,\nY\n)\n.\npredict\n(\nX\n)\narray([[2, 2, 0],\n[1, 2, 1],\n[2, 1, 0],\n[0, 0, 2],\n[0, 2, 1],\n[0, 0, 2],\n[1, 1, 0],\n[1, 1, 1],\n[0, 0, 2],\n[2, 0, 0]])\nWarning\nAt present, no metric in\nsklearn.metrics\nsupports the multiclass-multioutput classification task.\n1.12.3.1.\nTarget format\nA valid representation of\nmultioutput\ny\nis a dense matrix of shape\n(n_samples,\nn_classes)\nof class labels. A column wise concatenation of 1d\nmulticlass\nvariables. An example of\ny\nfor 3 samples:\n>>>\ny\n=\nnp\n.\narray\n([[\n'apple'\n,\n'green'\n],\n[\n'orange'\n,\n'orange'\n],\n[\n'pear'\n,\n'green'\n]])\n>>>\nprint\n(\ny\n)\n[['apple' 'green']\n['orange' 'orange']\n['pear' 'green']]\n1.12.4.\nMultioutput regression\nMultioutput regression\npredicts multiple numerical properties for each\nsample. Each property is a numerical variable and the number of properties\nto be predicted for each sample is greater than or equal to 2. Some estimators\nthat support multioutput regression are faster than just running\nn_output\nestimators.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "that support multioutput regression are faster than just running\nn_output\nestimators.\nFor example, prediction of both wind speed and wind direction, in degrees,\nusing data obtained at a certain location. Each sample would be data\nobtained at one location and both wind speed and direction would be\noutput for each sample.\nThe following regressors natively support multioutput regression:\ncross_decomposition.CCA\ntree.DecisionTreeRegressor\ndummy.DummyRegressor\nlinear_model.ElasticNet\ntree.ExtraTreeRegressor\nensemble.ExtraTreesRegressor\ngaussian_process.GaussianProcessRegressor\nneighbors.KNeighborsRegressor\nkernel_ridge.KernelRidge\nlinear_model.Lars\nlinear_model.Lasso\nlinear_model.LassoLars\nlinear_model.LinearRegression\nmultioutput.MultiOutputRegressor\nlinear_model.MultiTaskElasticNet\nlinear_model.MultiTaskElasticNetCV\nlinear_model.MultiTaskLasso\nlinear_model.MultiTaskLassoCV\nlinear_model.OrthogonalMatchingPursuit\ncross_decomposition.PLSCanonical\ncross_decomposition.PLSRegression",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "linear_model.MultiTaskLassoCV\nlinear_model.OrthogonalMatchingPursuit\ncross_decomposition.PLSCanonical\ncross_decomposition.PLSRegression\nlinear_model.RANSACRegressor\nneighbors.RadiusNeighborsRegressor\nensemble.RandomForestRegressor\nmultioutput.RegressorChain\nlinear_model.Ridge\nlinear_model.RidgeCV\ncompose.TransformedTargetRegressor\n1.12.4.1.\nTarget format\nA valid representation of\nmultioutput\ny\nis a dense matrix of shape\n(n_samples,\nn_output)\nof floats. A column wise concatenation of\ncontinuous\nvariables. An example of\ny\nfor 3 samples:\n>>>\ny\n=\nnp\n.\narray\n([[\n31.4\n,\n94\n],\n[\n40.5\n,\n109\n],\n[\n25.0\n,\n30\n]])\n>>>\nprint\n(\ny\n)\n[[ 31.4 94. ]\n[ 40.5 109. ]\n[ 25. 30. ]]\n1.12.4.2.\nMultiOutputRegressor\nMultioutput regression support can be added to any regressor with\nMultiOutputRegressor\n. This strategy consists of\nfitting one regressor per target. Since each target is represented by exactly\none regressor it is possible to gain knowledge about the target by\ninspecting its corresponding regressor. As",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "one regressor it is possible to gain knowledge about the target by\ninspecting its corresponding regressor. As\nMultiOutputRegressor\nfits one regressor per\ntarget it can not take advantage of correlations between targets.\nBelow is an example of multioutput regression:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_regression\n>>>\nfrom\nsklearn.multioutput\nimport\nMultiOutputRegressor\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingRegressor\n>>>\nX\n,\ny\n=\nmake_regression\n(\nn_samples\n=\n10\n,\nn_targets\n=\n3\n,\nrandom_state\n=\n1\n)\n>>>\nMultiOutputRegressor\n(\nGradientBoostingRegressor\n(\nrandom_state\n=\n0\n))\n.\nfit\n(\nX\n,\ny\n)\n.\npredict\n(\nX\n)\narray([[-154.75474165, -147.03498585, -50.03812219],\n[ 7.12165031, 5.12914884, -81.46081961],\n[-187.8948621 , -100.44373091, 13.88978285],\n[-141.62745778, 95.02891072, -191.48204257],\n[ 97.03260883, 165.34867495, 139.52003279],\n[ 123.92529176, 21.25719016, -7.84253 ],\n[-122.25193977, -85.16443186, -107.12274212],\n[ -30.170388 , -94.80956739, 12.16979946],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[ 123.92529176, 21.25719016, -7.84253 ],\n[-122.25193977, -85.16443186, -107.12274212],\n[ -30.170388 , -94.80956739, 12.16979946],\n[ 140.72667194, 176.50941682, -17.50447799],\n[ 149.37967282, -81.15699552, -5.72850319]])\n1.12.4.3.\nRegressorChain\nRegressor chains (see\nRegressorChain\n) is\nanalogous to\nClassifierChain\nas a way of\ncombining a number of regressions into a single multi-target model that is\ncapable of exploiting correlations among targets.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/multiclass.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.9.\nNaive Bayes\nNaive Bayes methods are a set of supervised learning algorithms\nbased on applying Bayes’ theorem with the “naive” assumption of\nconditional independence between every pair of features given the\nvalue of the class variable. Bayes’ theorem states the following\nrelationship, given class variable\n\\(y\\)\nand dependent feature\nvector\n\\(x_1\\)\nthrough\n\\(x_n\\)\n, :\n\\[P(y \\mid x_1, \\dots, x_n) = \\frac{P(y) P(x_1, \\dots, x_n \\mid y)}\n{P(x_1, \\dots, x_n)}\\]\nUsing the naive conditional independence assumption that\n\\[P(x_i | y, x_1, \\dots, x_{i-1}, x_{i+1}, \\dots, x_n) = P(x_i | y),\\]\nfor all\n\\(i\\)\n, this relationship is simplified to\n\\[P(y \\mid x_1, \\dots, x_n) = \\frac{P(y) \\prod_{i=1}^{n} P(x_i \\mid y)}\n{P(x_1, \\dots, x_n)}\\]\nSince\n\\(P(x_1, \\dots, x_n)\\)\nis constant given the input,\nwe can use the following classification rule:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "{P(x_1, \\dots, x_n)}\\]\nSince\n\\(P(x_1, \\dots, x_n)\\)\nis constant given the input,\nwe can use the following classification rule:\n\\[ \\begin{align}\\begin{aligned}P(y \\mid x_1, \\dots, x_n) \\propto P(y) \\prod_{i=1}^{n} P(x_i \\mid y)\\\\\\Downarrow\\\\\\hat{y} = \\arg\\max_y P(y) \\prod_{i=1}^{n} P(x_i \\mid y),\\end{aligned}\\end{align} \\]\nand we can use Maximum A Posteriori (MAP) estimation to estimate\n\\(P(y)\\)\nand\n\\(P(x_i \\mid y)\\)\n;\nthe former is then the relative frequency of class\n\\(y\\)\nin the training set.\nThe different naive Bayes classifiers differ mainly by the assumptions they\nmake regarding the distribution of\n\\(P(x_i \\mid y)\\)\n.\nIn spite of their apparently over-simplified assumptions, naive Bayes\nclassifiers have worked quite well in many real-world situations, famously\ndocument classification and spam filtering. They require a small amount\nof training data to estimate the necessary parameters. (For theoretical\nreasons why naive Bayes works well, and on which types of data it does, see",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of training data to estimate the necessary parameters. (For theoretical\nreasons why naive Bayes works well, and on which types of data it does, see\nthe references below.)\nNaive Bayes learners and classifiers can be extremely fast compared to more\nsophisticated methods.\nThe decoupling of the class conditional feature distributions means that each\ndistribution can be independently estimated as a one dimensional distribution.\nThis in turn helps to alleviate problems stemming from the curse of\ndimensionality.\nOn the flip side, although naive Bayes is known as a decent classifier,\nit is known to be a bad estimator, so the probability outputs from\npredict_proba\nare not to be taken too seriously.\nReferences\nH. Zhang (2004).\nThe optimality of Naive Bayes.\nProc. FLAIRS.\n1.9.1.\nGaussian Naive Bayes\nGaussianNB\nimplements the Gaussian Naive Bayes algorithm for\nclassification. The likelihood of the features is assumed to be Gaussian:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "GaussianNB\nimplements the Gaussian Naive Bayes algorithm for\nclassification. The likelihood of the features is assumed to be Gaussian:\n\\[P(x_i \\mid y) = \\frac{1}{\\sqrt{2\\pi\\sigma^2_y}} \\exp\\left(-\\frac{(x_i - \\mu_y)^2}{2\\sigma^2_y}\\right)\\]\nThe parameters\n\\(\\sigma_y\\)\nand\n\\(\\mu_y\\)\nare estimated using maximum likelihood.\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nfrom\nsklearn.naive_bayes\nimport\nGaussianNB\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\nX\n,\ny\n,\ntest_size\n=\n0.5\n,\nrandom_state\n=\n0\n)\n>>>\ngnb\n=\nGaussianNB\n()\n>>>\ny_pred\n=\ngnb\n.\nfit\n(\nX_train\n,\ny_train\n)\n.\npredict\n(\nX_test\n)\n>>>\nprint\n(\n\"Number of mislabeled points out of a total\n%d\npoints :\n%d\n\"\n...\n%\n(\nX_test\n.\nshape\n[\n0\n],\n(\ny_test\n!=\ny_pred\n)\n.\nsum\n()))\nNumber of mislabeled points out of a total 75 points : 4\n1.9.2.\nMultinomial Naive Bayes\nMultinomialNB",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nshape\n[\n0\n],\n(\ny_test\n!=\ny_pred\n)\n.\nsum\n()))\nNumber of mislabeled points out of a total 75 points : 4\n1.9.2.\nMultinomial Naive Bayes\nMultinomialNB\nimplements the naive Bayes algorithm for multinomially\ndistributed data, and is one of the two classic naive Bayes variants used in\ntext classification (where the data are typically represented as word vector\ncounts, although tf-idf vectors are also known to work well in practice).\nThe distribution is parametrized by vectors\n\\(\\theta_y = (\\theta_{y1},\\ldots,\\theta_{yn})\\)\nfor each class\n\\(y\\)\n, where\n\\(n\\)\nis the number of features\n(in text classification, the size of the vocabulary)\nand\n\\(\\theta_{yi}\\)\nis the probability\n\\(P(x_i \\mid y)\\)\nof feature\n\\(i\\)\nappearing in a sample belonging to class\n\\(y\\)\n.\nThe parameters\n\\(\\theta_y\\)\nare estimated by a smoothed\nversion of maximum likelihood, i.e. relative frequency counting:\n\\[\\hat{\\theta}_{yi} = \\frac{ N_{yi} + \\alpha}{N_y + \\alpha n}\\]\nwhere\n\\(N_{yi} = \\sum_{x \\in T} x_i\\)\nis",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\[\\hat{\\theta}_{yi} = \\frac{ N_{yi} + \\alpha}{N_y + \\alpha n}\\]\nwhere\n\\(N_{yi} = \\sum_{x \\in T} x_i\\)\nis\nthe number of times feature\n\\(i\\)\nappears in all samples of class\n\\(y\\)\nin the training set\n\\(T\\)\n,\nand\n\\(N_{y} = \\sum_{i=1}^{n} N_{yi}\\)\nis the total count of\nall features for class\n\\(y\\)\n.\nThe smoothing priors\n\\(\\alpha \\ge 0\\)\naccount for\nfeatures not present in the learning samples and prevent zero probabilities\nin further computations.\nSetting\n\\(\\alpha = 1\\)\nis called Laplace smoothing,\nwhile\n\\(\\alpha < 1\\)\nis called Lidstone smoothing.\n1.9.3.\nComplement Naive Bayes\nComplementNB\nimplements the complement naive Bayes (CNB) algorithm.\nCNB is an adaptation of the standard multinomial naive Bayes (MNB) algorithm\nthat is particularly suited for imbalanced data sets. Specifically, CNB uses\nstatistics from the\ncomplement\nof each class to compute the model’s weights.\nThe inventors of CNB show empirically that the parameter estimates for CNB are",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "complement\nof each class to compute the model’s weights.\nThe inventors of CNB show empirically that the parameter estimates for CNB are\nmore stable than those for MNB. Further, CNB regularly outperforms MNB (often\nby a considerable margin) on text classification tasks.\nWeights calculation\nThe procedure for calculating the weights is as follows:\n\\[ \\begin{align}\\begin{aligned}\\hat{\\theta}_{ci} = \\frac{\\alpha_i + \\sum_{j:y_j \\neq c} d_{ij}}\n{\\alpha + \\sum_{j:y_j \\neq c} \\sum_{k} d_{kj}}\\\\w_{ci} = \\log \\hat{\\theta}_{ci}\\\\w_{ci} = \\frac{w_{ci}}{\\sum_{j} |w_{cj}|}\\end{aligned}\\end{align} \\]\nwhere the summations are over all documents\n\\(j\\)\nnot in class\n\\(c\\)\n,\n\\(d_{ij}\\)\nis either the count or tf-idf value of term\n\\(i\\)\nin document\n\\(j\\)\n,\n\\(\\alpha_i\\)\nis a smoothing hyperparameter like that found in\nMNB, and\n\\(\\alpha = \\sum_{i} \\alpha_i\\)\n. The second normalization addresses\nthe tendency for longer documents to dominate parameter estimates in MNB. The\nclassification rule is:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ". The second normalization addresses\nthe tendency for longer documents to dominate parameter estimates in MNB. The\nclassification rule is:\n\\[\\hat{c} = \\arg\\min_c \\sum_{i} t_i w_{ci}\\]\ni.e., a document is assigned to the class that is the\npoorest\ncomplement\nmatch.\nReferences\nRennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003).\nTackling the poor assumptions of naive bayes text classifiers.\nIn ICML (Vol. 3, pp. 616-623).\n1.9.4.\nBernoulli Naive Bayes\nBernoulliNB\nimplements the naive Bayes training and classification\nalgorithms for data that is distributed according to multivariate Bernoulli\ndistributions; i.e., there may be multiple features but each one is assumed\nto be a binary-valued (Bernoulli, boolean) variable.\nTherefore, this class requires samples to be represented as binary-valued\nfeature vectors; if handed any other kind of data, a\nBernoulliNB\ninstance\nmay binarize its input (depending on the\nbinarize\nparameter).\nThe decision rule for Bernoulli naive Bayes is based on",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "BernoulliNB\ninstance\nmay binarize its input (depending on the\nbinarize\nparameter).\nThe decision rule for Bernoulli naive Bayes is based on\n\\[P(x_i \\mid y) = P(x_i = 1 \\mid y) x_i + (1 - P(x_i = 1 \\mid y)) (1 - x_i)\\]\nwhich differs from multinomial NB’s rule\nin that it explicitly penalizes the non-occurrence of a feature\n\\(i\\)\nthat is an indicator for class\n\\(y\\)\n,\nwhere the multinomial variant would simply ignore a non-occurring feature.\nIn the case of text classification, word occurrence vectors (rather than word\ncount vectors) may be used to train and use this classifier.\nBernoulliNB\nmight perform better on some datasets, especially those with shorter documents.\nIt is advisable to evaluate both models, if time permits.\nReferences\nC.D. Manning, P. Raghavan and H. Schütze (2008). Introduction to\nInformation Retrieval. Cambridge University Press, pp. 234-265.\nA. McCallum and K. Nigam (1998).\nA comparison of event models for Naive Bayes text classification.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "A. McCallum and K. Nigam (1998).\nA comparison of event models for Naive Bayes text classification.\nProc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.\nV. Metsis, I. Androutsopoulos and G. Paliouras (2006).\nSpam filtering with Naive Bayes – Which Naive Bayes?\n3rd Conf. on Email and Anti-Spam (CEAS).\n1.9.5.\nCategorical Naive Bayes\nCategoricalNB\nimplements the categorical naive Bayes\nalgorithm for categorically distributed data. It assumes that each feature,\nwhich is described by the index\n\\(i\\)\n, has its own categorical\ndistribution.\nFor each feature\n\\(i\\)\nin the training set\n\\(X\\)\n,\nCategoricalNB\nestimates a categorical distribution for each feature i\nof X conditioned on the class y. The index set of the samples is defined as\n\\(J = \\{ 1, \\dots, m \\}\\)\n, with\n\\(m\\)\nas the number of samples.\nProbability calculation\nThe probability of category\n\\(t\\)\nin feature\n\\(i\\)\ngiven class\n\\(c\\)\nis estimated as:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", with\n\\(m\\)\nas the number of samples.\nProbability calculation\nThe probability of category\n\\(t\\)\nin feature\n\\(i\\)\ngiven class\n\\(c\\)\nis estimated as:\n\\[P(x_i = t \\mid y = c \\: ;\\, \\alpha) = \\frac{ N_{tic} + \\alpha}{N_{c} +\n\\alpha n_i},\\]\nwhere\n\\(N_{tic} = |\\{j \\in J \\mid x_{ij} = t, y_j = c\\}|\\)\nis the number\nof times category\n\\(t\\)\nappears in the samples\n\\(x_{i}\\)\n, which belong\nto class\n\\(c\\)\n,\n\\(N_{c} = |\\{ j \\in J\\mid y_j = c\\}|\\)\nis the number\nof samples with class c,\n\\(\\alpha\\)\nis a smoothing parameter and\n\\(n_i\\)\nis the number of available categories of feature\n\\(i\\)\n.\nCategoricalNB\nassumes that the sample matrix\n\\(X\\)\nis encoded (for\ninstance with the help of\nOrdinalEncoder\n) such\nthat all categories for each feature\n\\(i\\)\nare represented with numbers\n\\(0, ..., n_i - 1\\)\nwhere\n\\(n_i\\)\nis the number of available categories\nof feature\n\\(i\\)\n.\n1.9.6.\nOut-of-core naive Bayes model fitting\nNaive Bayes models can be used to tackle large scale classification problems",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of feature\n\\(i\\)\n.\n1.9.6.\nOut-of-core naive Bayes model fitting\nNaive Bayes models can be used to tackle large scale classification problems\nfor which the full training set might not fit in memory. To handle this case,\nMultinomialNB\n,\nBernoulliNB\n, and\nGaussianNB\nexpose a\npartial_fit\nmethod that can be used\nincrementally as done with other classifiers as demonstrated in\nOut-of-core classification of text documents\n. All naive Bayes\nclassifiers support sample weighting.\nContrary to the\nfit\nmethod, the first call to\npartial_fit\nneeds to be\npassed the list of all the expected class labels.\nFor an overview of available strategies in scikit-learn, see also the\nout-of-core learning\ndocumentation.\nNote\nThe\npartial_fit\nmethod call of naive Bayes models introduces some\ncomputational overhead. It is recommended to use data chunk sizes that are as\nlarge as possible, that is as the available RAM allows.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/naive_bayes.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.6.\nNearest Neighbors\nsklearn.neighbors\nprovides functionality for unsupervised and\nsupervised neighbors-based learning methods. Unsupervised nearest neighbors\nis the foundation of many other learning methods,\nnotably manifold learning and spectral clustering. Supervised neighbors-based\nlearning comes in two flavors:\nclassification\nfor data with\ndiscrete labels, and\nregression\nfor data with continuous labels.\nThe principle behind nearest neighbor methods is to find a predefined number\nof training samples closest in distance to the new point, and\npredict the label from these. The number of samples can be a user-defined\nconstant (k-nearest neighbor learning), or vary based\non the local density of points (radius-based neighbor learning).\nThe distance can, in general, be any metric measure: standard Euclidean\ndistance is the most common choice.\nNeighbors-based methods are known as\nnon-generalizing\nmachine\nlearning methods, since they simply “remember” all of its training data",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Neighbors-based methods are known as\nnon-generalizing\nmachine\nlearning methods, since they simply “remember” all of its training data\n(possibly transformed into a fast indexing structure such as a\nBall Tree\nor\nKD Tree\n).\nDespite its simplicity, nearest neighbors has been successful in a\nlarge number of classification and regression problems, including\nhandwritten digits and satellite image scenes. Being a non-parametric method,\nit is often successful in classification situations where the decision\nboundary is very irregular.\nThe classes in\nsklearn.neighbors\ncan handle either NumPy arrays or\nscipy.sparse\nmatrices as input. For dense matrices, a large number of\npossible distance metrics are supported. For sparse matrices, arbitrary\nMinkowski metrics are supported for searches.\nThere are many learning routines which rely on nearest neighbors at their\ncore. One example is\nkernel density estimation\n,\ndiscussed in the\ndensity estimation\nsection.\n1.6.1.\nUnsupervised Nearest Neighbors",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "core. One example is\nkernel density estimation\n,\ndiscussed in the\ndensity estimation\nsection.\n1.6.1.\nUnsupervised Nearest Neighbors\nNearestNeighbors\nimplements unsupervised nearest neighbors learning.\nIt acts as a uniform interface to three different nearest neighbors\nalgorithms:\nBallTree\n,\nKDTree\n, and a\nbrute-force algorithm based on routines in\nsklearn.metrics.pairwise\n.\nThe choice of neighbors search algorithm is controlled through the keyword\n'algorithm'\n, which must be one of\n['auto',\n'ball_tree',\n'kd_tree',\n'brute']\n. When the default value\n'auto'\nis passed, the algorithm attempts to determine the best approach\nfrom the training data. For a discussion of the strengths and weaknesses\nof each option, see\nNearest Neighbor Algorithms\n.\nWarning\nRegarding the Nearest Neighbors algorithms, if two\nneighbors\n\\(k+1\\)\nand\n\\(k\\)\nhave identical distances\nbut different labels, the result will depend on the ordering of the\ntraining data.\n1.6.1.1.\nFinding the Nearest Neighbors",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "have identical distances\nbut different labels, the result will depend on the ordering of the\ntraining data.\n1.6.1.1.\nFinding the Nearest Neighbors\nFor the simple task of finding the nearest neighbors between two sets of\ndata, the unsupervised algorithms within\nsklearn.neighbors\ncan be\nused:\n>>>\nfrom\nsklearn.neighbors\nimport\nNearestNeighbors\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n1\n,\n-\n1\n],\n[\n-\n2\n,\n-\n1\n],\n[\n-\n3\n,\n-\n2\n],\n[\n1\n,\n1\n],\n[\n2\n,\n1\n],\n[\n3\n,\n2\n]])\n>>>\nnbrs\n=\nNearestNeighbors\n(\nn_neighbors\n=\n2\n,\nalgorithm\n=\n'ball_tree'\n)\n.\nfit\n(\nX\n)\n>>>\ndistances\n,\nindices\n=\nnbrs\n.\nkneighbors\n(\nX\n)\n>>>\nindices\narray([[0, 1],\n[1, 0],\n[2, 1],\n[3, 4],\n[4, 3],\n[5, 4]]...)\n>>>\ndistances\narray([[0. , 1. ],\n[0. , 1. ],\n[0. , 1.41421356],\n[0. , 1. ],\n[0. , 1. ],\n[0. , 1.41421356]])\nBecause the query set matches the training set, the nearest neighbor of each\npoint is the point itself, at a distance of zero.\nIt is also possible to efficiently produce a sparse graph showing the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "point is the point itself, at a distance of zero.\nIt is also possible to efficiently produce a sparse graph showing the\nconnections between neighboring points:\n>>>\nnbrs\n.\nkneighbors_graph\n(\nX\n)\n.\ntoarray\n()\narray([[1., 1., 0., 0., 0., 0.],\n[1., 1., 0., 0., 0., 0.],\n[0., 1., 1., 0., 0., 0.],\n[0., 0., 0., 1., 1., 0.],\n[0., 0., 0., 1., 1., 0.],\n[0., 0., 0., 0., 1., 1.]])\nThe dataset is structured such that points nearby in index order are nearby\nin parameter space, leading to an approximately block-diagonal matrix of\nK-nearest neighbors. Such a sparse graph is useful in a variety of\ncircumstances which make use of spatial relationships between points for\nunsupervised learning: in particular, see\nIsomap\n,\nLocallyLinearEmbedding\n, and\nSpectralClustering\n.\n1.6.1.2.\nKDTree and BallTree Classes\nAlternatively, one can use the\nKDTree\nor\nBallTree\nclasses\ndirectly to find nearest neighbors. This is the functionality wrapped by\nthe\nNearestNeighbors\nclass used above. The Ball Tree and KD Tree",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classes\ndirectly to find nearest neighbors. This is the functionality wrapped by\nthe\nNearestNeighbors\nclass used above. The Ball Tree and KD Tree\nhave the same interface; we’ll show an example of using the KD Tree here:\n>>>\nfrom\nsklearn.neighbors\nimport\nKDTree\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n1\n,\n-\n1\n],\n[\n-\n2\n,\n-\n1\n],\n[\n-\n3\n,\n-\n2\n],\n[\n1\n,\n1\n],\n[\n2\n,\n1\n],\n[\n3\n,\n2\n]])\n>>>\nkdt\n=\nKDTree\n(\nX\n,\nleaf_size\n=\n30\n,\nmetric\n=\n'euclidean'\n)\n>>>\nkdt\n.\nquery\n(\nX\n,\nk\n=\n2\n,\nreturn_distance\n=\nFalse\n)\narray([[0, 1],\n[1, 0],\n[2, 1],\n[3, 4],\n[4, 3],\n[5, 4]]...)\nRefer to the\nKDTree\nand\nBallTree\nclass documentation\nfor more information on the options available for nearest neighbors searches,\nincluding specification of query strategies, distance metrics, etc. For a list\nof valid metrics use\nKDTree.valid_metrics\nand\nBallTree.valid_metrics\n:\n>>>\nfrom\nsklearn.neighbors\nimport\nKDTree\n,\nBallTree\n>>>\nKDTree\n.\nvalid_metrics",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of valid metrics use\nKDTree.valid_metrics\nand\nBallTree.valid_metrics\n:\n>>>\nfrom\nsklearn.neighbors\nimport\nKDTree\n,\nBallTree\n>>>\nKDTree\n.\nvalid_metrics\n['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity']\n>>>\nBallTree\n.\nvalid_metrics\n['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity', 'seuclidean', 'mahalanobis', 'hamming', 'canberra', 'braycurtis', 'jaccard', 'dice', 'rogerstanimoto', 'russellrao', 'sokalmichener', 'sokalsneath', 'haversine', 'pyfunc']\n1.6.2.\nNearest Neighbors Classification\nNeighbors-based classification is a type of\ninstance-based learning\nor\nnon-generalizing learning\n: it does not attempt to construct a general\ninternal model, but simply stores instances of the training data.\nClassification is computed from a simple majority vote of the nearest\nneighbors of each point: a query point is assigned the data class which",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Classification is computed from a simple majority vote of the nearest\nneighbors of each point: a query point is assigned the data class which\nhas the most representatives within the nearest neighbors of the point.\nscikit-learn implements two different nearest neighbors classifiers:\nKNeighborsClassifier\nimplements learning based on the\n\\(k\\)\nnearest neighbors of each query point, where\n\\(k\\)\nis an integer value\nspecified by the user.\nRadiusNeighborsClassifier\nimplements learning\nbased on the number of neighbors within a fixed radius\n\\(r\\)\nof each\ntraining point, where\n\\(r\\)\nis a floating-point value specified by\nthe user.\nThe\n\\(k\\)\n-neighbors classification in\nKNeighborsClassifier\nis the most commonly used technique. The optimal choice of the value\n\\(k\\)\nis highly data-dependent: in general a larger\n\\(k\\)\nsuppresses the effects\nof noise, but makes the classification boundaries less distinct.\nIn cases where the data is not uniformly sampled, radius-based neighbors\nclassification in",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In cases where the data is not uniformly sampled, radius-based neighbors\nclassification in\nRadiusNeighborsClassifier\ncan be a better choice.\nThe user specifies a fixed radius\n\\(r\\)\n, such that points in sparser\nneighborhoods use fewer nearest neighbors for the classification. For\nhigh-dimensional parameter spaces, this method becomes less effective due\nto the so-called “curse of dimensionality”.\nThe basic nearest neighbors classification uses uniform weights: that is, the\nvalue assigned to a query point is computed from a simple majority vote of\nthe nearest neighbors. Under some circumstances, it is better to weight the\nneighbors such that nearer neighbors contribute more to the fit. This can\nbe accomplished through the\nweights\nkeyword. The default value,\nweights\n=\n'uniform'\n, assigns uniform weights to each neighbor.\nweights\n=\n'distance'\nassigns weights proportional to the inverse of the\ndistance from the query point. Alternatively, a user-defined function of the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "weights\n=\n'distance'\nassigns weights proportional to the inverse of the\ndistance from the query point. Alternatively, a user-defined function of the\ndistance can be supplied to compute the weights.\nExamples\nNearest Neighbors Classification\n: an example of\nclassification using nearest neighbors.\n1.6.3.\nNearest Neighbors Regression\nNeighbors-based regression can be used in cases where the data labels are\ncontinuous rather than discrete variables. The label assigned to a query\npoint is computed based on the mean of the labels of its nearest neighbors.\nscikit-learn implements two different neighbors regressors:\nKNeighborsRegressor\nimplements learning based on the\n\\(k\\)\nnearest neighbors of each query point, where\n\\(k\\)\nis an integer\nvalue specified by the user.\nRadiusNeighborsRegressor\nimplements\nlearning based on the neighbors within a fixed radius\n\\(r\\)\nof the\nquery point, where\n\\(r\\)\nis a floating-point value specified by the\nuser.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "learning based on the neighbors within a fixed radius\n\\(r\\)\nof the\nquery point, where\n\\(r\\)\nis a floating-point value specified by the\nuser.\nThe basic nearest neighbors regression uses uniform weights: that is,\neach point in the local neighborhood contributes uniformly to the\nclassification of a query point. Under some circumstances, it can be\nadvantageous to weight points such that nearby points contribute more\nto the regression than faraway points. This can be accomplished through\nthe\nweights\nkeyword. The default value,\nweights\n=\n'uniform'\n,\nassigns equal weights to all points.\nweights\n=\n'distance'\nassigns\nweights proportional to the inverse of the distance from the query point.\nAlternatively, a user-defined function of the distance can be supplied,\nwhich will be used to compute the weights.\nThe use of multi-output nearest neighbors for regression is demonstrated in\nFace completion with a multi-output estimators\n. In this example, the inputs",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Face completion with a multi-output estimators\n. In this example, the inputs\nX are the pixels of the upper half of faces and the outputs Y are the pixels of\nthe lower half of those faces.\nExamples\nNearest Neighbors regression\n: an example of regression\nusing nearest neighbors.\nFace completion with a multi-output estimators\n:\nan example of multi-output regression using nearest neighbors.\n1.6.4.\nNearest Neighbor Algorithms\n1.6.4.1.\nBrute Force\nFast computation of nearest neighbors is an active area of research in\nmachine learning. The most naive neighbor search implementation involves\nthe brute-force computation of distances between all pairs of points in the\ndataset: for\n\\(N\\)\nsamples in\n\\(D\\)\ndimensions, this approach scales\nas\n\\(O[D N^2]\\)\n. Efficient brute-force neighbors searches can be very\ncompetitive for small data samples.\nHowever, as the number of samples\n\\(N\\)\ngrows, the brute-force\napproach quickly becomes infeasible. In the classes within\nsklearn.neighbors",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "However, as the number of samples\n\\(N\\)\ngrows, the brute-force\napproach quickly becomes infeasible. In the classes within\nsklearn.neighbors\n, brute-force neighbors searches are specified\nusing the keyword\nalgorithm\n=\n'brute'\n, and are computed using the\nroutines available in\nsklearn.metrics.pairwise\n.\n1.6.4.2.\nK-D Tree\nTo address the computational inefficiencies of the brute-force approach, a\nvariety of tree-based data structures have been invented. In general, these\nstructures attempt to reduce the required number of distance calculations\nby efficiently encoding aggregate distance information for the sample.\nThe basic idea is that if point\n\\(A\\)\nis very distant from point\n\\(B\\)\n, and point\n\\(B\\)\nis very close to point\n\\(C\\)\n,\nthen we know that points\n\\(A\\)\nand\n\\(C\\)\nare very distant,\nwithout having to explicitly calculate their distance\n.\nIn this way, the computational cost of a nearest neighbors search can be\nreduced to\n\\(O[D N \\log(N)]\\)\nor better. This is a significant",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nIn this way, the computational cost of a nearest neighbors search can be\nreduced to\n\\(O[D N \\log(N)]\\)\nor better. This is a significant\nimprovement over brute-force for large\n\\(N\\)\n.\nAn early approach to taking advantage of this aggregate information was\nthe\nKD tree\ndata structure (short for\nK-dimensional tree\n), which\ngeneralizes two-dimensional\nQuad-trees\nand 3-dimensional\nOct-trees\nto an arbitrary number of dimensions. The KD tree is a binary tree\nstructure which recursively partitions the parameter space along the data\naxes, dividing it into nested orthotropic regions into which data points\nare filed. The construction of a KD tree is very fast: because partitioning\nis performed only along the data axes, no\n\\(D\\)\n-dimensional distances\nneed to be computed. Once constructed, the nearest neighbor of a query\npoint can be determined with only\n\\(O[\\log(N)]\\)\ndistance computations.\nThough the KD tree approach is very fast for low-dimensional (\n\\(D < 20\\)\n)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "point can be determined with only\n\\(O[\\log(N)]\\)\ndistance computations.\nThough the KD tree approach is very fast for low-dimensional (\n\\(D < 20\\)\n)\nneighbors searches, it becomes inefficient as\n\\(D\\)\ngrows very large:\nthis is one manifestation of the so-called “curse of dimensionality”.\nIn scikit-learn, KD tree neighbors searches are specified using the\nkeyword\nalgorithm\n=\n'kd_tree'\n, and are computed using the class\nKDTree\n.\nReferences\n“Multidimensional binary search trees used for associative searching”\n,\nBentley, J.L., Communications of the ACM (1975)\n1.6.4.3.\nBall Tree\nTo address the inefficiencies of KD Trees in higher dimensions, the\nball tree\ndata structure was developed. Where KD trees partition data along\nCartesian axes, ball trees partition data in a series of nesting\nhyper-spheres. This makes tree construction more costly than that of the\nKD tree, but results in a data structure which can be very efficient on\nhighly structured data, even in very high dimensions.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "KD tree, but results in a data structure which can be very efficient on\nhighly structured data, even in very high dimensions.\nA ball tree recursively divides the data into\nnodes defined by a centroid\n\\(C\\)\nand radius\n\\(r\\)\n, such that each\npoint in the node lies within the hyper-sphere defined by\n\\(r\\)\nand\n\\(C\\)\n. The number of candidate points for a neighbor search\nis reduced through use of the\ntriangle inequality\n:\n\\[|x+y| \\leq |x| + |y|\\]\nWith this setup, a single distance calculation between a test point and\nthe centroid is sufficient to determine a lower and upper bound on the\ndistance to all points within the node.\nBecause of the spherical geometry of the ball tree nodes, it can out-perform\na\nKD-tree\nin high dimensions, though the actual performance is highly\ndependent on the structure of the training data.\nIn scikit-learn, ball-tree-based\nneighbors searches are specified using the keyword\nalgorithm\n=\n'ball_tree'\n,\nand are computed using the class\nBallTree\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "neighbors searches are specified using the keyword\nalgorithm\n=\n'ball_tree'\n,\nand are computed using the class\nBallTree\n.\nAlternatively, the user can work with the\nBallTree\nclass directly.\nReferences\n“Five Balltree Construction Algorithms”\n,\nOmohundro, S.M., International Computer Science Institute\nTechnical Report (1989)\nChoice of Nearest Neighbors Algorithm\nThe optimal algorithm for a given dataset is a complicated choice, and\ndepends on a number of factors:\nnumber of samples\n\\(N\\)\n(i.e.\nn_samples\n) and dimensionality\n\\(D\\)\n(i.e.\nn_features\n).\nBrute force\nquery time grows as\n\\(O[D N]\\)\nBall tree\nquery time grows as approximately\n\\(O[D \\log(N)]\\)\nKD tree\nquery time changes with\n\\(D\\)\nin a way that is difficult\nto precisely characterise. For small\n\\(D\\)\n(less than 20 or so)\nthe cost is approximately\n\\(O[D\\log(N)]\\)\n, and the KD tree\nquery can be very efficient.\nFor larger\n\\(D\\)\n, the cost increases to nearly\n\\(O[DN]\\)\n, and\nthe overhead due to the tree",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", and the KD tree\nquery can be very efficient.\nFor larger\n\\(D\\)\n, the cost increases to nearly\n\\(O[DN]\\)\n, and\nthe overhead due to the tree\nstructure can lead to queries which are slower than brute force.\nFor small data sets (\n\\(N\\)\nless than 30 or so),\n\\(\\log(N)\\)\nis\ncomparable to\n\\(N\\)\n, and brute force algorithms can be more efficient\nthan a tree-based approach. Both\nKDTree\nand\nBallTree\naddress this through providing a\nleaf size\nparameter: this controls the\nnumber of samples at which a query switches to brute-force. This allows both\nalgorithms to approach the efficiency of a brute-force computation for small\n\\(N\\)\n.\ndata structure:\nintrinsic dimensionality\nof the data and/or\nsparsity\nof the data. Intrinsic dimensionality refers to the dimension\n\\(d \\le D\\)\nof a manifold on which the data lies, which can be linearly\nor non-linearly embedded in the parameter space. Sparsity refers to the\ndegree to which the data fills the parameter space (this is to be",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "or non-linearly embedded in the parameter space. Sparsity refers to the\ndegree to which the data fills the parameter space (this is to be\ndistinguished from the concept as used in “sparse” matrices. The data\nmatrix may have no zero entries, but the\nstructure\ncan still be\n“sparse” in this sense).\nBrute force\nquery time is unchanged by data structure.\nBall tree\nand\nKD tree\nquery times can be greatly influenced\nby data structure. In general, sparser data with a smaller intrinsic\ndimensionality leads to faster query times. Because the KD tree\ninternal representation is aligned with the parameter axes, it will not\ngenerally show as much improvement as ball tree for arbitrarily\nstructured data.\nDatasets used in machine learning tend to be very structured, and are\nvery well-suited for tree-based queries.\nnumber of neighbors\n\\(k\\)\nrequested for a query point.\nBrute force\nquery time is largely unaffected by the value of\n\\(k\\)\nBall tree\nand\nKD tree\nquery time will become slower as\n\\(k\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Brute force\nquery time is largely unaffected by the value of\n\\(k\\)\nBall tree\nand\nKD tree\nquery time will become slower as\n\\(k\\)\nincreases. This is due to two effects: first, a larger\n\\(k\\)\nleads\nto the necessity to search a larger portion of the parameter space.\nSecond, using\n\\(k > 1\\)\nrequires internal queueing of results\nas the tree is traversed.\nAs\n\\(k\\)\nbecomes large compared to\n\\(N\\)\n, the ability to prune\nbranches in a tree-based query is reduced. In this situation, Brute force\nqueries can be more efficient.\nnumber of query points. Both the ball tree and the KD Tree\nrequire a construction phase. The cost of this construction becomes\nnegligible when amortized over many queries. If only a small number of\nqueries will be performed, however, the construction can make up\na significant fraction of the total cost. If very few query points\nwill be required, brute force is better than a tree-based method.\nCurrently,\nalgorithm\n=\n'auto'\nselects\n'brute'\nif any of the following",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "will be required, brute force is better than a tree-based method.\nCurrently,\nalgorithm\n=\n'auto'\nselects\n'brute'\nif any of the following\nconditions are verified:\ninput data is sparse\nmetric\n=\n'precomputed'\n\\(D > 15\\)\n\\(k >= N/2\\)\neffective_metric_\nisn’t in the\nVALID_METRICS\nlist for either\n'kd_tree'\nor\n'ball_tree'\nOtherwise, it selects the first out of\n'kd_tree'\nand\n'ball_tree'\nthat\nhas\neffective_metric_\nin its\nVALID_METRICS\nlist. This heuristic is\nbased on the following assumptions:\nthe number of query points is at least the same order as the number of\ntraining points\nleaf_size\nis close to its default value of\n30\nwhen\n\\(D > 15\\)\n, the intrinsic dimensionality of the data is generally\ntoo high for tree-based methods\nEffect of\nleaf_size\nAs noted above, for small sample sizes a brute force search can be more\nefficient than a tree-based query. This fact is accounted for in the ball\ntree and KD tree by internally switching to brute force searches within",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "efficient than a tree-based query. This fact is accounted for in the ball\ntree and KD tree by internally switching to brute force searches within\nleaf nodes. The level of this switch can be specified with the parameter\nleaf_size\n. This parameter choice has many effects:\nconstruction time\nA larger\nleaf_size\nleads to a faster tree construction time, because\nfewer nodes need to be created\nquery time\nBoth a large or small\nleaf_size\ncan lead to suboptimal query cost.\nFor\nleaf_size\napproaching 1, the overhead involved in traversing\nnodes can significantly slow query times. For\nleaf_size\napproaching\nthe size of the training set, queries become essentially brute force.\nA good compromise between these is\nleaf_size\n=\n30\n, the default value\nof the parameter.\nmemory\nAs\nleaf_size\nincreases, the memory required to store a tree structure\ndecreases. This is especially important in the case of ball tree, which\nstores a\n\\(D\\)\n-dimensional centroid for each node. The required\nstorage space for\nBallTree",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "stores a\n\\(D\\)\n-dimensional centroid for each node. The required\nstorage space for\nBallTree\nis approximately\n1\n/\nleaf_size\ntimes\nthe size of the training set.\nleaf_size\nis not referenced for brute force queries.\nValid Metrics for Nearest Neighbor Algorithms\nFor a list of available metrics, see the documentation of the\nDistanceMetric\nclass and the metrics listed in\nsklearn.metrics.pairwise.PAIRWISE_DISTANCE_FUNCTIONS\n. Note that the “cosine”\nmetric uses\ncosine_distances\n.\nA list of valid metrics for any of the above algorithms can be obtained by using their\nvalid_metric\nattribute. For example, valid metrics for\nKDTree\ncan be generated by:\n>>>\nfrom\nsklearn.neighbors\nimport\nKDTree\n>>>\nprint\n(\nsorted\n(\nKDTree\n.\nvalid_metrics\n))\n['chebyshev', 'cityblock', 'euclidean', 'infinity', 'l1', 'l2', 'manhattan', 'minkowski', 'p']\n1.6.5.\nNearest Centroid Classifier\nThe\nNearestCentroid\nclassifier is a simple algorithm that represents",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.6.5.\nNearest Centroid Classifier\nThe\nNearestCentroid\nclassifier is a simple algorithm that represents\neach class by the centroid of its members. In effect, this makes it\nsimilar to the label updating phase of the\nKMeans\nalgorithm.\nIt also has no parameters to choose, making it a good baseline classifier. It\ndoes, however, suffer on non-convex classes, as well as when classes have\ndrastically different variances, as equal variance in all dimensions is\nassumed. See Linear Discriminant Analysis (\nLinearDiscriminantAnalysis\n)\nand Quadratic Discriminant Analysis (\nQuadraticDiscriminantAnalysis\n)\nfor more complex methods that do not make this assumption. Usage of the default\nNearestCentroid\nis simple:\n>>>\nfrom\nsklearn.neighbors\nimport\nNearestCentroid\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n1\n,\n-\n1\n],\n[\n-\n2\n,\n-\n1\n],\n[\n-\n3\n,\n-\n2\n],\n[\n1\n,\n1\n],\n[\n2\n,\n1\n],\n[\n3\n,\n2\n]])\n>>>\ny\n=\nnp\n.\narray\n([\n1\n,\n1\n,\n1\n,\n2\n,\n2\n,\n2\n])\n>>>\nclf\n=\nNearestCentroid\n()\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\nNearestCentroid()",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1\n,\n1\n],\n[\n2\n,\n1\n],\n[\n3\n,\n2\n]])\n>>>\ny\n=\nnp\n.\narray\n([\n1\n,\n1\n,\n1\n,\n2\n,\n2\n,\n2\n])\n>>>\nclf\n=\nNearestCentroid\n()\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\nNearestCentroid()\n>>>\nprint\n(\nclf\n.\npredict\n([[\n-\n0.8\n,\n-\n1\n]]))\n[1]\n1.6.5.1.\nNearest Shrunken Centroid\nThe\nNearestCentroid\nclassifier has a\nshrink_threshold\nparameter,\nwhich implements the nearest shrunken centroid classifier. In effect, the value\nof each feature for each centroid is divided by the within-class variance of\nthat feature. The feature values are then reduced by\nshrink_threshold\n. Most\nnotably, if a particular feature value crosses zero, it is set\nto zero. In effect, this removes the feature from affecting the classification.\nThis is useful, for example, for removing noisy features.\nIn the example below, using a small shrink threshold increases the accuracy of\nthe model from 0.81 to 0.82.\nExamples\nNearest Centroid Classification\n: an example of\nclassification using nearest centroid with different shrink thresholds.\n1.6.6.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Examples\nNearest Centroid Classification\n: an example of\nclassification using nearest centroid with different shrink thresholds.\n1.6.6.\nNearest Neighbors Transformer\nMany scikit-learn estimators rely on nearest neighbors: Several classifiers and\nregressors such as\nKNeighborsClassifier\nand\nKNeighborsRegressor\n, but also some clustering methods such as\nDBSCAN\nand\nSpectralClustering\n, and some manifold embeddings such\nas\nTSNE\nand\nIsomap\n.\nAll these estimators can compute internally the nearest neighbors, but most of\nthem also accept precomputed nearest neighbors\nsparse graph\n,\nas given by\nkneighbors_graph\nand\nradius_neighbors_graph\n. With mode\nmode='connectivity'\n, these functions return a binary adjacency sparse graph\nas required, for instance, in\nSpectralClustering\n.\nWhereas with\nmode='distance'\n, they return a distance sparse graph as required,\nfor instance, in\nDBSCAN\n. To include these functions in\na scikit-learn pipeline, one can also use the corresponding classes",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for instance, in\nDBSCAN\n. To include these functions in\na scikit-learn pipeline, one can also use the corresponding classes\nKNeighborsTransformer\nand\nRadiusNeighborsTransformer\n.\nThe benefits of this sparse graph API are multiple.\nFirst, the precomputed graph can be reused multiple times, for instance while\nvarying a parameter of the estimator. This can be done manually by the user, or\nusing the caching properties of the scikit-learn pipeline:\n>>>\nimport\ntempfile\n>>>\nfrom\nsklearn.manifold\nimport\nIsomap\n>>>\nfrom\nsklearn.neighbors\nimport\nKNeighborsTransformer\n>>>\nfrom\nsklearn.pipeline\nimport\nmake_pipeline\n>>>\nfrom\nsklearn.datasets\nimport\nmake_regression\n>>>\ncache_path\n=\ntempfile\n.\ngettempdir\n()\n# we use a temporary folder here\n>>>\nX\n,\n_\n=\nmake_regression\n(\nn_samples\n=\n50\n,\nn_features\n=\n25\n,\nrandom_state\n=\n0\n)\n>>>\nestimator\n=\nmake_pipeline\n(\n...\nKNeighborsTransformer\n(\nmode\n=\n'distance'\n),\n...\nIsomap\n(\nn_components\n=\n3\n,\nmetric\n=\n'precomputed'\n),\n...\nmemory\n=\ncache_path\n)\n>>>\nX_embedded",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\n...\nKNeighborsTransformer\n(\nmode\n=\n'distance'\n),\n...\nIsomap\n(\nn_components\n=\n3\n,\nmetric\n=\n'precomputed'\n),\n...\nmemory\n=\ncache_path\n)\n>>>\nX_embedded\n=\nestimator\n.\nfit_transform\n(\nX\n)\n>>>\nX_embedded\n.\nshape\n(50, 3)\nSecond, precomputing the graph can give finer control on the nearest neighbors\nestimation, for instance enabling multiprocessing though the parameter\nn_jobs\n, which might not be available in all estimators.\nFinally, the precomputation can be performed by custom estimators to use\ndifferent implementations, such as approximate nearest neighbors methods, or\nimplementation with special data types. The precomputed neighbors\nsparse graph\nneeds to be formatted as in\nradius_neighbors_graph\noutput:\na CSR matrix (although COO, CSC or LIL will be accepted).\nonly explicitly store nearest neighborhoods of each sample with respect to the\ntraining data. This should include those at 0 distance from a query point,\nincluding the matrix diagonal when computing the nearest neighborhoods",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "training data. This should include those at 0 distance from a query point,\nincluding the matrix diagonal when computing the nearest neighborhoods\nbetween the training data and itself.\neach row’s\ndata\nshould store the distance in increasing order (optional.\nUnsorted data will be stable-sorted, adding a computational overhead).\nall values in data should be non-negative.\nthere should be no duplicate\nindices\nin any row\n(see\nscipy/scipy#5807\n).\nif the algorithm being passed the precomputed matrix uses k nearest neighbors\n(as opposed to radius neighborhood), at least k neighbors must be stored in\neach row (or k+1, as explained in the following note).\nNote\nWhen a specific number of neighbors is queried (using\nKNeighborsTransformer\n), the definition of\nn_neighbors\nis ambiguous\nsince it can either include each training point as its own neighbor, or\nexclude them. Neither choice is perfect, since including them leads to a\ndifferent number of non-self neighbors during training and testing, while",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "exclude them. Neither choice is perfect, since including them leads to a\ndifferent number of non-self neighbors during training and testing, while\nexcluding them leads to a difference between\nfit(X).transform(X)\nand\nfit_transform(X)\n, which is against scikit-learn API.\nIn\nKNeighborsTransformer\nwe use the definition which includes each\ntraining point as its own neighbor in the count of\nn_neighbors\n. However,\nfor compatibility reasons with other estimators which use the other\ndefinition, one extra neighbor will be computed when\nmode\n==\n'distance'\n.\nTo maximise compatibility with all estimators, a safe choice is to always\ninclude one extra neighbor in a custom nearest neighbors estimator, since\nunnecessary neighbors will be filtered by following estimators.\nExamples\nApproximate nearest neighbors in TSNE\n:\nan example of pipelining\nKNeighborsTransformer\nand\nTSNE\n. Also proposes two custom nearest neighbors\nestimators based on external packages.\nCaching nearest neighbors\n:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "KNeighborsTransformer\nand\nTSNE\n. Also proposes two custom nearest neighbors\nestimators based on external packages.\nCaching nearest neighbors\n:\nan example of pipelining\nKNeighborsTransformer\nand\nKNeighborsClassifier\nto enable caching of the neighbors graph\nduring a hyper-parameter grid-search.\n1.6.7.\nNeighborhood Components Analysis\nNeighborhood Components Analysis (NCA,\nNeighborhoodComponentsAnalysis\n)\nis a distance metric learning algorithm which aims to improve the accuracy of\nnearest neighbors classification compared to the standard Euclidean distance.\nThe algorithm directly maximizes a stochastic variant of the leave-one-out\nk-nearest neighbors (KNN) score on the training set. It can also learn a\nlow-dimensional linear projection of data that can be used for data\nvisualization and fast classification.\nIn the above illustrating figure, we consider some points from a randomly\ngenerated dataset. We focus on the stochastic KNN classification of point no.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "generated dataset. We focus on the stochastic KNN classification of point no.\n3. The thickness of a link between sample 3 and another point is proportional\nto their distance, and can be seen as the relative weight (or probability) that\na stochastic nearest neighbor prediction rule would assign to this point. In\nthe original space, sample 3 has many stochastic neighbors from various\nclasses, so the right class is not very likely. However, in the projected space\nlearned by NCA, the only stochastic neighbors with non-negligible weight are\nfrom the same class as sample 3, guaranteeing that the latter will be well\nclassified. See the\nmathematical formulation\nfor more details.\n1.6.7.1.\nClassification\nCombined with a nearest neighbors classifier (\nKNeighborsClassifier\n),\nNCA is attractive for classification because it can naturally handle\nmulti-class problems without any increase in the model size, and does not\nintroduce additional parameters that require fine-tuning by the user.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "multi-class problems without any increase in the model size, and does not\nintroduce additional parameters that require fine-tuning by the user.\nNCA classification has been shown to work well in practice for data sets of\nvarying size and difficulty. In contrast to related methods such as Linear\nDiscriminant Analysis, NCA does not make any assumptions about the class\ndistributions. The nearest neighbor classification can naturally produce highly\nirregular decision boundaries.\nTo use this model for classification, one needs to combine a\nNeighborhoodComponentsAnalysis\ninstance that learns the optimal\ntransformation with a\nKNeighborsClassifier\ninstance that performs the\nclassification in the projected space. Here is an example using the two\nclasses:\n>>>\nfrom\nsklearn.neighbors\nimport\n(\nNeighborhoodComponentsAnalysis\n,\n...\nKNeighborsClassifier\n)\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nfrom\nsklearn.pipeline\nimport\nPipeline\n>>>\nX",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ")\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nfrom\nsklearn.pipeline\nimport\nPipeline\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\nX\n,\ny\n,\n...\nstratify\n=\ny\n,\ntest_size\n=\n0.7\n,\nrandom_state\n=\n42\n)\n>>>\nnca\n=\nNeighborhoodComponentsAnalysis\n(\nrandom_state\n=\n42\n)\n>>>\nknn\n=\nKNeighborsClassifier\n(\nn_neighbors\n=\n3\n)\n>>>\nnca_pipe\n=\nPipeline\n([(\n'nca'\n,\nnca\n),\n(\n'knn'\n,\nknn\n)])\n>>>\nnca_pipe\n.\nfit\n(\nX_train\n,\ny_train\n)\nPipeline(...)\n>>>\nprint\n(\nnca_pipe\n.\nscore\n(\nX_test\n,\ny_test\n))\n0.96190476...\nThe plot shows decision boundaries for Nearest Neighbor Classification and\nNeighborhood Components Analysis classification on the iris dataset, when\ntraining and scoring on only two features, for visualisation purposes.\n1.6.7.2.\nDimensionality reduction\nNCA can be used to perform supervised dimensionality reduction. The input data",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.6.7.2.\nDimensionality reduction\nNCA can be used to perform supervised dimensionality reduction. The input data\nare projected onto a linear subspace consisting of the directions which\nminimize the NCA objective. The desired dimensionality can be set using the\nparameter\nn_components\n. For instance, the following figure shows a\ncomparison of dimensionality reduction with Principal Component Analysis\n(\nPCA\n), Linear Discriminant Analysis\n(\nLinearDiscriminantAnalysis\n) and\nNeighborhood Component Analysis (\nNeighborhoodComponentsAnalysis\n) on\nthe Digits dataset, a dataset with size\n\\(n_{samples} = 1797\\)\nand\n\\(n_{features} = 64\\)\n. The data set is split into a training and a test set\nof equal size, then standardized. For evaluation the 3-nearest neighbor\nclassification accuracy is computed on the 2-dimensional projected points found\nby each method. Each data sample belongs to one of 10 classes.\nExamples\nComparing Nearest Neighbors with and without Neighborhood Components Analysis",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "by each method. Each data sample belongs to one of 10 classes.\nExamples\nComparing Nearest Neighbors with and without Neighborhood Components Analysis\nDimensionality Reduction with Neighborhood Components Analysis\nManifold learning on handwritten digits: Locally Linear Embedding, Isomap…\n1.6.7.3.\nMathematical formulation\nThe goal of NCA is to learn an optimal linear transformation matrix of size\n(n_components,\nn_features)\n, which maximises the sum over all samples\n\\(i\\)\nof the probability\n\\(p_i\\)\nthat\n\\(i\\)\nis correctly\nclassified, i.e.:\n\\[\\underset{L}{\\arg\\max} \\sum\\limits_{i=0}^{N - 1} p_{i}\\]\nwith\n\\(N\\)\n=\nn_samples\nand\n\\(p_i\\)\nthe probability of sample\n\\(i\\)\nbeing correctly classified according to a stochastic nearest\nneighbors rule in the learned embedded space:\n\\[p_{i}=\\sum\\limits_{j \\in C_i}{p_{i j}}\\]\nwhere\n\\(C_i\\)\nis the set of points in the same class as sample\n\\(i\\)\n,\nand\n\\(p_{i j}\\)\nis the softmax over Euclidean distances in the embedded\nspace:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "where\n\\(C_i\\)\nis the set of points in the same class as sample\n\\(i\\)\n,\nand\n\\(p_{i j}\\)\nis the softmax over Euclidean distances in the embedded\nspace:\n\\[p_{i j} = \\frac{\\exp(-||L x_i - L x_j||^2)}{\\sum\\limits_{k \\ne\ni} {\\exp{-(||L x_i - L x_k||^2)}}} , \\quad p_{i i} = 0\\]\nMahalanobis distance\nNCA can be seen as learning a (squared) Mahalanobis distance metric:\n\\[|| L(x_i - x_j)||^2 = (x_i - x_j)^TM(x_i - x_j),\\]\nwhere\n\\(M = L^T L\\)\nis a symmetric positive semi-definite matrix of size\n(n_features,\nn_features)\n.\n1.6.7.4.\nImplementation\nThis implementation follows what is explained in the original paper\n[\n1\n]\n. For\nthe optimisation method, it currently uses scipy’s L-BFGS-B with a full\ngradient computation at each iteration, to avoid to tune the learning rate and\nprovide stable learning.\nSee the examples below and the docstring of\nNeighborhoodComponentsAnalysis.fit\nfor further information.\n1.6.7.5.\nComplexity\n1.6.7.5.1.\nTraining\nNCA stores a matrix of pairwise distances, taking\nn_samples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for further information.\n1.6.7.5.\nComplexity\n1.6.7.5.1.\nTraining\nNCA stores a matrix of pairwise distances, taking\nn_samples\n**\n2\nmemory.\nTime complexity depends on the number of iterations done by the optimisation\nalgorithm. However, one can set the maximum number of iterations with the\nargument\nmax_iter\n. For each iteration, time complexity is\nO(n_components\nx\nn_samples\nx\nmin(n_samples,\nn_features))\n.\n1.6.7.5.2.\nTransform\nHere the\ntransform\noperation returns\n\\(LX^T\\)\n, therefore its time\ncomplexity equals\nn_components\n*\nn_features\n*\nn_samples_test\n. There is no\nadded space complexity in the operation.\nReferences\nWikipedia entry on Neighborhood Components Analysis\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neighbors.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.17.\nNeural network models (supervised)\nWarning\nThis implementation is not intended for large-scale applications. In particular,\nscikit-learn offers no GPU support. For much faster, GPU-based implementations,\nas well as frameworks offering much more flexibility to build deep learning\narchitectures, see\nRelated Projects\n.\n1.17.1.\nMulti-layer Perceptron\nMulti-layer Perceptron (MLP)\nis a supervised learning algorithm that learns\na function\n\\(f: R^m \\rightarrow R^o\\)\nby training on a dataset,\nwhere\n\\(m\\)\nis the number of dimensions for input and\n\\(o\\)\nis the\nnumber of dimensions for output. Given a set of features\n\\(X = \\{x_1, x_2, ..., x_m\\}\\)\nand a target\n\\(y\\)\n, it can learn a non-linear function approximator for either\nclassification or regression. It is different from logistic regression, in that\nbetween the input and the output layer, there can be one or more non-linear\nlayers, called hidden layers. Figure 1 shows a one hidden layer MLP with scalar\noutput.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "layers, called hidden layers. Figure 1 shows a one hidden layer MLP with scalar\noutput.\nFigure 1 : One hidden layer MLP.\nThe leftmost layer, known as the input layer, consists of a set of neurons\n\\(\\{x_i | x_1, x_2, ..., x_m\\}\\)\nrepresenting the input features. Each\nneuron in the hidden layer transforms the values from the previous layer with\na weighted linear summation\n\\(w_1x_1 + w_2x_2 + ... + w_mx_m\\)\n, followed\nby a non-linear activation function\n\\(g(\\cdot):R \\rightarrow R\\)\n- like\nthe hyperbolic tan function. The output layer receives the values from the\nlast hidden layer and transforms them into output values.\nThe module contains the public attributes\ncoefs_\nand\nintercepts_\n.\ncoefs_\nis a list of weight matrices, where weight matrix at index\n\\(i\\)\nrepresents the weights between layer\n\\(i\\)\nand layer\n\\(i+1\\)\n.\nintercepts_\nis a list of bias vectors, where the vector\nat index\n\\(i\\)\nrepresents the bias values added to layer\n\\(i+1\\)\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(i\\)\nand layer\n\\(i+1\\)\n.\nintercepts_\nis a list of bias vectors, where the vector\nat index\n\\(i\\)\nrepresents the bias values added to layer\n\\(i+1\\)\n.\nAdvantages and disadvantages of Multi-layer Perceptron\nThe advantages of Multi-layer Perceptron are:\nCapability to learn non-linear models.\nCapability to learn models in real-time (on-line learning)\nusing\npartial_fit\n.\nThe disadvantages of Multi-layer Perceptron (MLP) include:\nMLP with hidden layers has a non-convex loss function where there exists\nmore than one local minimum. Therefore, different random weight\ninitializations can lead to different validation accuracy.\nMLP requires tuning a number of hyperparameters such as the number of\nhidden neurons, layers, and iterations.\nMLP is sensitive to feature scaling.\nPlease see\nTips on Practical Use\nsection that addresses\nsome of these disadvantages.\n1.17.2.\nClassification\nClass\nMLPClassifier\nimplements a multi-layer perceptron (MLP) algorithm\nthat trains using\nBackpropagation\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.17.2.\nClassification\nClass\nMLPClassifier\nimplements a multi-layer perceptron (MLP) algorithm\nthat trains using\nBackpropagation\n.\nMLP trains on two arrays: array X of size (n_samples, n_features), which holds\nthe training samples represented as floating point feature vectors; and array\ny of size (n_samples,), which holds the target values (class labels) for the\ntraining samples:\n>>>\nfrom\nsklearn.neural_network\nimport\nMLPClassifier\n>>>\nX\n=\n[[\n0.\n,\n0.\n],\n[\n1.\n,\n1.\n]]\n>>>\ny\n=\n[\n0\n,\n1\n]\n>>>\nclf\n=\nMLPClassifier\n(\nsolver\n=\n'lbfgs'\n,\nalpha\n=\n1e-5\n,\n...\nhidden_layer_sizes\n=\n(\n5\n,\n2\n),\nrandom_state\n=\n1\n)\n...\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\nMLPClassifier(alpha=1e-05, hidden_layer_sizes=(5, 2), random_state=1,\nsolver='lbfgs')\nAfter fitting (training), the model can predict labels for new samples:\n>>>\nclf\n.\npredict\n([[\n2.\n,\n2.\n],\n[\n-\n1.\n,\n-\n2.\n]])\narray([1, 0])\nMLP can fit a non-linear model to the training data.\nclf.coefs_\ncontains the weight matrices that constitute the model parameters:\n>>>\n[\ncoef\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "MLP can fit a non-linear model to the training data.\nclf.coefs_\ncontains the weight matrices that constitute the model parameters:\n>>>\n[\ncoef\n.\nshape\nfor\ncoef\nin\nclf\n.\ncoefs_\n]\n[(2, 5), (5, 2), (2, 1)]\nCurrently,\nMLPClassifier\nsupports only the\nCross-Entropy loss function, which allows probability estimates by running the\npredict_proba\nmethod.\nMLP trains using Backpropagation. More precisely, it trains using some form of\ngradient descent and the gradients are calculated using Backpropagation. For\nclassification, it minimizes the Cross-Entropy loss function, giving a vector\nof probability estimates\n\\(P(y|x)\\)\nper sample\n\\(x\\)\n:\n>>>\nclf\n.\npredict_proba\n([[\n2.\n,\n2.\n],\n[\n1.\n,\n2.\n]])\narray([[1.967e-04, 9.998e-01],\n[1.967e-04, 9.998e-01]])\nMLPClassifier\nsupports multi-class classification by\napplying\nSoftmax\nas the output function.\nFurther, the model supports\nmulti-label classification\nin which a sample can belong to more than one class. For each class, the raw",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Further, the model supports\nmulti-label classification\nin which a sample can belong to more than one class. For each class, the raw\noutput passes through the logistic function. Values larger or equal to\n0.5\nare rounded to\n1\n, otherwise to\n0\n. For a predicted output of a sample, the\nindices where the value is\n1\nrepresent the assigned classes of that sample:\n>>>\nX\n=\n[[\n0.\n,\n0.\n],\n[\n1.\n,\n1.\n]]\n>>>\ny\n=\n[[\n0\n,\n1\n],\n[\n1\n,\n1\n]]\n>>>\nclf\n=\nMLPClassifier\n(\nsolver\n=\n'lbfgs'\n,\nalpha\n=\n1e-5\n,\n...\nhidden_layer_sizes\n=\n(\n15\n,),\nrandom_state\n=\n1\n)\n...\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\nMLPClassifier(alpha=1e-05, hidden_layer_sizes=(15,), random_state=1,\nsolver='lbfgs')\n>>>\nclf\n.\npredict\n([[\n1.\n,\n2.\n]])\narray([[1, 1]])\n>>>\nclf\n.\npredict\n([[\n0.\n,\n0.\n]])\narray([[0, 1]])\nSee the examples below and the docstring of\nMLPClassifier.fit\nfor further information.\nExamples\nCompare Stochastic learning strategies for MLPClassifier\nSee\nVisualization of MLP weights on MNIST\nfor\nvisualized representation of trained weights.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Compare Stochastic learning strategies for MLPClassifier\nSee\nVisualization of MLP weights on MNIST\nfor\nvisualized representation of trained weights.\n1.17.3.\nRegression\nClass\nMLPRegressor\nimplements a multi-layer perceptron (MLP) that\ntrains using backpropagation with no activation function in the output layer,\nwhich can also be seen as using the identity function as activation function.\nTherefore, it uses the square error as the loss function, and the output is a\nset of continuous values.\nMLPRegressor\nalso supports multi-output regression, in\nwhich a sample can have more than one target.\n1.17.4.\nRegularization\nBoth\nMLPRegressor\nand\nMLPClassifier\nuse parameter\nalpha\nfor regularization (L2 regularization) term which helps in avoiding overfitting\nby penalizing weights with large magnitudes. Following plot displays varying\ndecision function with value of alpha.\nSee the examples below for further information.\nExamples\nVarying regularization in Multi-layer Perceptron\n1.17.5.\nAlgorithms",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "See the examples below for further information.\nExamples\nVarying regularization in Multi-layer Perceptron\n1.17.5.\nAlgorithms\nMLP trains using\nStochastic Gradient Descent\n,\nAdam\n, or\nL-BFGS\n.\nStochastic Gradient Descent (SGD) updates parameters using the gradient of the\nloss function with respect to a parameter that needs adaptation, i.e.\n\\[w \\leftarrow w - \\eta (\\alpha \\frac{\\partial R(w)}{\\partial w}\n+ \\frac{\\partial Loss}{\\partial w})\\]\nwhere\n\\(\\eta\\)\nis the learning rate which controls the step-size in\nthe parameter space search.\n\\(Loss\\)\nis the loss function used\nfor the network.\nMore details can be found in the documentation of\nSGD\nAdam is similar to SGD in a sense that it is a stochastic optimizer, but it can\nautomatically adjust the amount to update parameters based on adaptive estimates\nof lower-order moments.\nWith SGD or Adam, training supports online and mini-batch learning.\nL-BFGS is a solver that approximates the Hessian matrix which represents the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "With SGD or Adam, training supports online and mini-batch learning.\nL-BFGS is a solver that approximates the Hessian matrix which represents the\nsecond-order partial derivative of a function. Further it approximates the\ninverse of the Hessian matrix to perform parameter updates. The implementation\nuses the Scipy version of\nL-BFGS\n.\nIf the selected solver is ‘L-BFGS’, training does not support online nor\nmini-batch learning.\n1.17.6.\nComplexity\nSuppose there are\n\\(n\\)\ntraining samples,\n\\(m\\)\nfeatures,\n\\(k\\)\nhidden layers, each containing\n\\(h\\)\nneurons - for simplicity, and\n\\(o\\)\noutput neurons. The time complexity of backpropagation is\n\\(O(i \\cdot n \\cdot (m \\cdot h + (k - 1) \\cdot h \\cdot h + h \\cdot o))\\)\n, where\n\\(i\\)\nis the number\nof iterations. Since backpropagation has a high time complexity, it is advisable\nto start with smaller number of hidden neurons and few hidden layers for\ntraining.\nMathematical formulation\nGiven a set of training examples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to start with smaller number of hidden neurons and few hidden layers for\ntraining.\nMathematical formulation\nGiven a set of training examples\n\\(\\{(x_1, y_1), (x_2, y_2), \\ldots, (x_n, y_n)\\}\\)\nwhere\n\\(x_i \\in \\mathbf{R}^n\\)\nand\n\\(y_i \\in \\{0, 1\\}\\)\n, a one hidden\nlayer one hidden neuron MLP learns the function\n\\(f(x) = W_2 g(W_1^T x + b_1) + b_2\\)\nwhere\n\\(W_1 \\in \\mathbf{R}^m\\)\nand\n\\(W_2, b_1, b_2 \\in \\mathbf{R}\\)\nare\nmodel parameters.\n\\(W_1, W_2\\)\nrepresent the weights of the input layer and\nhidden layer, respectively; and\n\\(b_1, b_2\\)\nrepresent the bias added to\nthe hidden layer and the output layer, respectively.\n\\(g(\\cdot) : R \\rightarrow R\\)\nis the activation function, set by default as\nthe hyperbolic tan. It is given as,\n\\[g(z)= \\frac{e^z-e^{-z}}{e^z+e^{-z}}\\]\nFor binary classification,\n\\(f(x)\\)\npasses through the logistic function\n\\(g(z)=1/(1+e^{-z})\\)\nto obtain output values between zero and one. A\nthreshold, set to 0.5, would assign samples of outputs larger or equal 0.5",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(g(z)=1/(1+e^{-z})\\)\nto obtain output values between zero and one. A\nthreshold, set to 0.5, would assign samples of outputs larger or equal 0.5\nto the positive class, and the rest to the negative class.\nIf there are more than two classes,\n\\(f(x)\\)\nitself would be a vector of\nsize (n_classes,). Instead of passing through logistic function, it passes\nthrough the softmax function, which is written as,\n\\[\\text{softmax}(z)_i = \\frac{\\exp(z_i)}{\\sum_{l=1}^k\\exp(z_l)}\\]\nwhere\n\\(z_i\\)\nrepresents the\n\\(i\\)\nth element of the input to softmax,\nwhich corresponds to class\n\\(i\\)\n, and\n\\(K\\)\nis the number of classes.\nThe result is a vector containing the probabilities that sample\n\\(x\\)\nbelongs to each class. The output is the class with the highest probability.\nIn regression, the output remains as\n\\(f(x)\\)\n; therefore, output activation\nfunction is just the identity function.\nMLP uses different loss functions depending on the problem type. The loss",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "; therefore, output activation\nfunction is just the identity function.\nMLP uses different loss functions depending on the problem type. The loss\nfunction for classification is Average Cross-Entropy, which in binary case is\ngiven as,\n\\[Loss(\\hat{y},y,W) = -\\dfrac{1}{n}\\sum_{i=0}^n(y_i \\ln {\\hat{y_i}} + (1-y_i) \\ln{(1-\\hat{y_i})}) + \\dfrac{\\alpha}{2n} ||W||_2^2\\]\nwhere\n\\(\\alpha ||W||_2^2\\)\nis an L2-regularization term (aka penalty)\nthat penalizes complex models; and\n\\(\\alpha > 0\\)\nis a non-negative\nhyperparameter that controls the magnitude of the penalty.\nFor regression, MLP uses the Mean Square Error loss function; written as,\n\\[Loss(\\hat{y},y,W) = \\frac{1}{2n}\\sum_{i=0}^n||\\hat{y}_i - y_i ||_2^2 + \\frac{\\alpha}{2n} ||W||_2^2\\]\nStarting from initial random weights, multi-layer perceptron (MLP) minimizes\nthe loss function by repeatedly updating these weights. After computing the\nloss, a backward pass propagates it from the output layer to the previous",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the loss function by repeatedly updating these weights. After computing the\nloss, a backward pass propagates it from the output layer to the previous\nlayers, providing each weight parameter with an update value meant to decrease\nthe loss.\nIn gradient descent, the gradient\n\\(\\nabla Loss_{W}\\)\nof the loss with respect\nto the weights is computed and deducted from\n\\(W\\)\n.\nMore formally, this is expressed as,\n\\[W^{i+1} = W^i - \\epsilon \\nabla {Loss}_{W}^{i}\\]\nwhere\n\\(i\\)\nis the iteration step, and\n\\(\\epsilon\\)\nis the learning rate\nwith a value larger than 0.\nThe algorithm stops when it reaches a preset maximum number of iterations; or\nwhen the improvement in loss is below a certain, small number.\n1.17.7.\nTips on Practical Use\nMulti-layer Perceptron is sensitive to feature scaling, so it\nis highly recommended to scale your data. For example, scale each\nattribute on the input vector X to [0, 1] or [-1, +1], or standardize\nit to have mean 0 and variance 1. Note that you must apply the\nsame",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "attribute on the input vector X to [0, 1] or [-1, +1], or standardize\nit to have mean 0 and variance 1. Note that you must apply the\nsame\nscaling to the test set for meaningful results.\nYou can use\nStandardScaler\nfor standardization.\n>>>\nfrom\nsklearn.preprocessing\nimport\nStandardScaler\n>>>\nscaler\n=\nStandardScaler\n()\n>>>\n# Don't cheat - fit only on training data\n>>>\nscaler\n.\nfit\n(\nX_train\n)\n>>>\nX_train\n=\nscaler\n.\ntransform\n(\nX_train\n)\n>>>\n# apply same transformation to test data\n>>>\nX_test\n=\nscaler\n.\ntransform\n(\nX_test\n)\nAn alternative and recommended approach is to use\nStandardScaler\nin a\nPipeline\nFinding a reasonable regularization parameter\n\\(\\alpha\\)\nis best done\nusing\nGridSearchCV\n, usually in the range\n10.0\n**\n-np.arange(1,\n7)\n.\nEmpirically, we observed that\nL-BFGS\nconverges faster and\nwith better solutions on small datasets. For relatively large\ndatasets, however,\nAdam\nis very robust. It usually converges\nquickly and gives pretty good performance.\nSGD\nwith momentum or",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "datasets, however,\nAdam\nis very robust. It usually converges\nquickly and gives pretty good performance.\nSGD\nwith momentum or\nnesterov’s momentum, on the other hand, can perform better than\nthose two algorithms if learning rate is correctly tuned.\n1.17.8.\nMore control with warm_start\nIf you want more control over stopping criteria or learning rate in SGD,\nor want to do additional monitoring, using\nwarm_start=True\nand\nmax_iter=1\nand iterating yourself can be helpful:\n>>>\nX\n=\n[[\n0.\n,\n0.\n],\n[\n1.\n,\n1.\n]]\n>>>\ny\n=\n[\n0\n,\n1\n]\n>>>\nclf\n=\nMLPClassifier\n(\nhidden_layer_sizes\n=\n(\n15\n,),\nrandom_state\n=\n1\n,\nmax_iter\n=\n1\n,\nwarm_start\n=\nTrue\n)\n>>>\nfor\ni\nin\nrange\n(\n10\n):\n...\nclf\n.\nfit\n(\nX\n,\ny\n)\n...\n# additional monitoring / inspection\nMLPClassifier(...\nReferences\n“Learning representations by back-propagating errors.”\nRumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.\n“Stochastic Gradient Descent”\nL. Bottou - Website, 2010.\n“Backpropagation”",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.\n“Stochastic Gradient Descent”\nL. Bottou - Website, 2010.\n“Backpropagation”\nAndrew Ng, Jiquan Ngiam, Chuan Yu Foo, Yifan Mai, Caroline Suen - Website, 2011.\n“Efficient BackProp”\nY. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks of the Trade 1998.\n“Adam: A method for stochastic optimization.”\nKingma, Diederik, and Jimmy Ba (2014)\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_supervised.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.9.\nNeural network models (unsupervised)\n2.9.1.\nRestricted Boltzmann machines\nRestricted Boltzmann machines (RBM) are unsupervised nonlinear feature learners\nbased on a probabilistic model. The features extracted by an RBM or a hierarchy\nof RBMs often give good results when fed into a linear classifier such as a\nlinear SVM or a perceptron.\nThe model makes assumptions regarding the distribution of inputs. At the moment,\nscikit-learn only provides\nBernoulliRBM\n, which assumes the inputs are\neither binary values or values between 0 and 1, each encoding the probability\nthat the specific feature would be turned on.\nThe RBM tries to maximize the likelihood of the data using a particular\ngraphical model. The parameter learning algorithm used (\nStochastic\nMaximum Likelihood\n) prevents the representations from straying far\nfrom the input data, which makes them capture interesting regularities, but\nmakes the model less useful for small datasets, and usually not useful for\ndensity estimation.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "makes the model less useful for small datasets, and usually not useful for\ndensity estimation.\nThe method gained popularity for initializing deep neural networks with the\nweights of independent RBMs. This method is known as unsupervised pre-training.\nExamples\nRestricted Boltzmann Machine features for digit classification\n2.9.1.1.\nGraphical model and parametrization\nThe graphical model of an RBM is a fully-connected bipartite graph.\nThe nodes are random variables whose states depend on the state of the other\nnodes they are connected to. The model is therefore parameterized by the\nweights of the connections, as well as one intercept (bias) term for each\nvisible and hidden unit, omitted from the image for simplicity.\nThe energy function measures the quality of a joint assignment:\n\\[E(\\mathbf{v}, \\mathbf{h}) = -\\sum_i \\sum_j w_{ij}v_ih_j - \\sum_i b_iv_i\n- \\sum_j c_jh_j\\]\nIn the formula above,\n\\(\\mathbf{b}\\)\nand\n\\(\\mathbf{c}\\)\nare the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "- \\sum_j c_jh_j\\]\nIn the formula above,\n\\(\\mathbf{b}\\)\nand\n\\(\\mathbf{c}\\)\nare the\nintercept vectors for the visible and hidden layers, respectively. The\njoint probability of the model is defined in terms of the energy:\n\\[P(\\mathbf{v}, \\mathbf{h}) = \\frac{e^{-E(\\mathbf{v}, \\mathbf{h})}}{Z}\\]\nThe word\nrestricted\nrefers to the bipartite structure of the model, which\nprohibits direct interaction between hidden units, or between visible units.\nThis means that the following conditional independencies are assumed:\n\\[\\begin{split}h_i \\bot h_j | \\mathbf{v} \\\\\nv_i \\bot v_j | \\mathbf{h}\\end{split}\\]\nThe bipartite structure allows for the use of efficient block Gibbs sampling for\ninference.\n2.9.1.2.\nBernoulli Restricted Boltzmann machines\nIn the\nBernoulliRBM\n, all units are binary stochastic units. This\nmeans that the input data should either be binary, or real-valued between 0 and\n1 signifying the probability that the visible unit would turn on or off. This",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1 signifying the probability that the visible unit would turn on or off. This\nis a good model for character recognition, where the interest is on which\npixels are active and which aren’t. For images of natural scenes it no longer\nfits because of background, depth and the tendency of neighbouring pixels to\ntake the same values.\nThe conditional probability distribution of each unit is given by the\nlogistic sigmoid activation function of the input it receives:\n\\[\\begin{split}P(v_i=1|\\mathbf{h}) = \\sigma(\\sum_j w_{ij}h_j + b_i) \\\\\nP(h_i=1|\\mathbf{v}) = \\sigma(\\sum_i w_{ij}v_i + c_j)\\end{split}\\]\nwhere\n\\(\\sigma\\)\nis the logistic sigmoid function:\n\\[\\sigma(x) = \\frac{1}{1 + e^{-x}}\\]\n2.9.1.3.\nStochastic Maximum Likelihood learning\nThe training algorithm implemented in\nBernoulliRBM\nis known as\nStochastic Maximum Likelihood (SML) or Persistent Contrastive Divergence\n(PCD). Optimizing maximum likelihood directly is infeasible because of\nthe form of the data likelihood:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(PCD). Optimizing maximum likelihood directly is infeasible because of\nthe form of the data likelihood:\n\\[\\log P(v) = \\log \\sum_h e^{-E(v, h)} - \\log \\sum_{x, y} e^{-E(x, y)}\\]\nFor simplicity the equation above is written for a single training example.\nThe gradient with respect to the weights is formed of two terms corresponding to\nthe ones above. They are usually known as the positive gradient and the negative\ngradient, because of their respective signs. In this implementation, the\ngradients are estimated over mini-batches of samples.\nIn maximizing the log-likelihood, the positive gradient makes the model prefer\nhidden states that are compatible with the observed training data. Because of\nthe bipartite structure of RBMs, it can be computed efficiently. The\nnegative gradient, however, is intractable. Its goal is to lower the energy of\njoint states that the model prefers, therefore making it stay true to the data.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "joint states that the model prefers, therefore making it stay true to the data.\nIt can be approximated by Markov chain Monte Carlo using block Gibbs sampling by\niteratively sampling each of\n\\(v\\)\nand\n\\(h\\)\ngiven the other, until the\nchain mixes. Samples generated in this way are sometimes referred as fantasy\nparticles. This is inefficient and it is difficult to determine whether the\nMarkov chain mixes.\nThe Contrastive Divergence method suggests to stop the chain after a small\nnumber of iterations,\n\\(k\\)\n, usually even 1. This method is fast and has\nlow variance, but the samples are far from the model distribution.\nPersistent Contrastive Divergence addresses this. Instead of starting a new\nchain each time the gradient is needed, and performing only one Gibbs sampling\nstep, in PCD we keep a number of chains (fantasy particles) that are updated\n\\(k\\)\nGibbs steps after each weight update. This allows the particles to\nexplore the space more thoroughly.\nReferences",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(k\\)\nGibbs steps after each weight update. This allows the particles to\nexplore the space more thoroughly.\nReferences\n“A fast learning algorithm for deep belief nets”\n,\nG. Hinton, S. Osindero, Y.-W. Teh, 2006\n“Training Restricted Boltzmann Machines using Approximations to\nthe Likelihood Gradient”\n,\nT. Tieleman, 2008\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/neural_networks_unsupervised.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.7.\nNovelty and Outlier Detection\nMany applications require being able to decide whether a new observation\nbelongs to the same distribution as existing observations (it is an\ninlier\n), or should be considered as different (it is an\noutlier\n).\nOften, this ability is used to clean real data sets. Two important\ndistinctions must be made:\noutlier detection\n:\nThe training data contains outliers which are defined as observations that\nare far from the others. Outlier detection estimators thus try to fit the\nregions where the training data is the most concentrated, ignoring the\ndeviant observations.\nnovelty detection\n:\nThe training data is not polluted by outliers and we are interested in\ndetecting whether a\nnew\nobservation is an outlier. In this context an\noutlier is also called a novelty.\nOutlier detection and novelty detection are both used for anomaly\ndetection, where one is interested in detecting abnormal or unusual",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Outlier detection and novelty detection are both used for anomaly\ndetection, where one is interested in detecting abnormal or unusual\nobservations. Outlier detection is then also known as unsupervised anomaly\ndetection and novelty detection as semi-supervised anomaly detection. In the\ncontext of outlier detection, the outliers/anomalies cannot form a\ndense cluster as available estimators assume that the outliers/anomalies are\nlocated in low density regions. On the contrary, in the context of novelty\ndetection, novelties/anomalies can form a dense cluster as long as they are in\na low density region of the training data, considered as normal in this\ncontext.\nThe scikit-learn project provides a set of machine learning tools that\ncan be used both for novelty or outlier detection. This strategy is\nimplemented with objects learning in an unsupervised way from the data:\nestimator\n.\nfit\n(\nX_train\n)\nnew observations can then be sorted as inliers or outliers with a\npredict\nmethod:\nestimator\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "estimator\n.\nfit\n(\nX_train\n)\nnew observations can then be sorted as inliers or outliers with a\npredict\nmethod:\nestimator\n.\npredict\n(\nX_test\n)\nInliers are labeled 1, while outliers are labeled -1. The predict method\nmakes use of a threshold on the raw scoring function computed by the\nestimator. This scoring function is accessible through the\nscore_samples\nmethod, while the threshold can be controlled by the\ncontamination\nparameter.\nThe\ndecision_function\nmethod is also defined from the scoring function,\nin such a way that negative values are outliers and non-negative ones are\ninliers:\nestimator\n.\ndecision_function\n(\nX_test\n)\nNote that\nneighbors.LocalOutlierFactor\ndoes not support\npredict\n,\ndecision_function\nand\nscore_samples\nmethods by default\nbut only a\nfit_predict\nmethod, as this estimator was originally meant to\nbe applied for outlier detection. The scores of abnormality of the training\nsamples are accessible through the\nnegative_outlier_factor_\nattribute.\nIf you really want to use",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "samples are accessible through the\nnegative_outlier_factor_\nattribute.\nIf you really want to use\nneighbors.LocalOutlierFactor\nfor novelty\ndetection, i.e. predict labels or compute the score of abnormality of new\nunseen data, you can instantiate the estimator with the\nnovelty\nparameter\nset to\nTrue\nbefore fitting the estimator. In this case,\nfit_predict\nis\nnot available.\nWarning\nNovelty detection with Local Outlier Factor\nWhen\nnovelty\nis set to\nTrue\nbe aware that you must only use\npredict\n,\ndecision_function\nand\nscore_samples\non new unseen data\nand not on the training samples as this would lead to wrong results.\nI.e., the result of\npredict\nwill not be the same as\nfit_predict\n.\nThe scores of abnormality of the training samples are always accessible\nthrough the\nnegative_outlier_factor_\nattribute.\nThe behavior of\nneighbors.LocalOutlierFactor\nis summarized in the\nfollowing table.\nMethod\nOutlier detection\nNovelty detection\nfit_predict\nOK\nNot available\npredict\nNot available",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is summarized in the\nfollowing table.\nMethod\nOutlier detection\nNovelty detection\nfit_predict\nOK\nNot available\npredict\nNot available\nUse only on new data\ndecision_function\nNot available\nUse only on new data\nscore_samples\nUse\nnegative_outlier_factor_\nUse only on new data\nnegative_outlier_factor_\nOK\nOK\n2.7.1.\nOverview of outlier detection methods\nA comparison of the outlier detection algorithms in scikit-learn. Local\nOutlier Factor (LOF) does not show a decision boundary in black as it\nhas no predict method to be applied on new data when it is used for outlier\ndetection.\nensemble.IsolationForest\nand\nneighbors.LocalOutlierFactor\nperform reasonably well on the data sets considered here.\nThe\nsvm.OneClassSVM\nis known to be sensitive to outliers and thus\ndoes not perform very well for outlier detection. That being said, outlier\ndetection in high-dimension, or without any assumptions on the distribution\nof the inlying data is very challenging.\nsvm.OneClassSVM\nmay still",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "detection in high-dimension, or without any assumptions on the distribution\nof the inlying data is very challenging.\nsvm.OneClassSVM\nmay still\nbe used with outlier detection but requires fine-tuning of its hyperparameter\nnu\nto handle outliers and prevent overfitting.\nlinear_model.SGDOneClassSVM\nprovides an implementation of a\nlinear One-Class SVM with a linear complexity in the number of samples. This\nimplementation is here used with a kernel approximation technique to obtain\nresults similar to\nsvm.OneClassSVM\nwhich uses a Gaussian kernel\nby default. Finally,\ncovariance.EllipticEnvelope\nassumes the data is\nGaussian and learns an ellipse. For more details on the different estimators\nrefer to the example\nComparing anomaly detection algorithms for outlier detection on toy datasets\nand the\nsections hereunder.\nExamples\nSee\nComparing anomaly detection algorithms for outlier detection on toy datasets\nfor a comparison of the\nsvm.OneClassSVM\n, the\nensemble.IsolationForest\n, the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for a comparison of the\nsvm.OneClassSVM\n, the\nensemble.IsolationForest\n, the\nneighbors.LocalOutlierFactor\nand\ncovariance.EllipticEnvelope\n.\nSee\nEvaluation of outlier detection estimators\nfor an example showing how to evaluate outlier detection estimators,\nthe\nneighbors.LocalOutlierFactor\nand the\nensemble.IsolationForest\n, using ROC curves from\nmetrics.RocCurveDisplay\n.\n2.7.2.\nNovelty Detection\nConsider a data set of\n\\(n\\)\nobservations from the same\ndistribution described by\n\\(p\\)\nfeatures. Consider now that we\nadd one more observation to that data set. Is the new observation so\ndifferent from the others that we can doubt it is regular? (i.e. does\nit come from the same distribution?) Or on the contrary, is it so\nsimilar to the other that we cannot distinguish it from the original\nobservations? This is the question addressed by the novelty detection\ntools and methods.\nIn general, it is about to learn a rough, close frontier delimiting",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "tools and methods.\nIn general, it is about to learn a rough, close frontier delimiting\nthe contour of the initial observations distribution, plotted in\nembedding\n\\(p\\)\n-dimensional space. Then, if further observations\nlay within the frontier-delimited subspace, they are considered as\ncoming from the same population as the initial\nobservations. Otherwise, if they lay outside the frontier, we can say\nthat they are abnormal with a given confidence in our assessment.\nThe One-Class SVM has been introduced by Schölkopf et al. for that purpose\nand implemented in the\nSupport Vector Machines\nmodule in the\nsvm.OneClassSVM\nobject. It requires the choice of a\nkernel and a scalar parameter to define a frontier. The RBF kernel is\nusually chosen although there exists no exact formula or algorithm to\nset its bandwidth parameter. This is the default in the scikit-learn\nimplementation. The\nnu\nparameter, also known as the margin of\nthe One-Class SVM, corresponds to the probability of finding a new,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "implementation. The\nnu\nparameter, also known as the margin of\nthe One-Class SVM, corresponds to the probability of finding a new,\nbut regular, observation outside the frontier.\nReferences\nEstimating the support of a high-dimensional distribution\nSchölkopf, Bernhard, et al. Neural computation 13.7 (2001): 1443-1471.\nExamples\nSee\nOne-class SVM with non-linear kernel (RBF)\nfor visualizing the\nfrontier learned around some data by a\nsvm.OneClassSVM\nobject.\nSpecies distribution modeling\n2.7.2.1.\nScaling up the One-Class SVM\nAn online linear version of the One-Class SVM is implemented in\nlinear_model.SGDOneClassSVM\n. This implementation scales linearly with\nthe number of samples and can be used with a kernel approximation to\napproximate the solution of a kernelized\nsvm.OneClassSVM\nwhose\ncomplexity is at best quadratic in the number of samples. See section\nOnline One-Class SVM\nfor more details.\nExamples\nSee\nOne-Class SVM versus One-Class SVM using Stochastic Gradient Descent",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Online One-Class SVM\nfor more details.\nExamples\nSee\nOne-Class SVM versus One-Class SVM using Stochastic Gradient Descent\nfor an illustration of the approximation of a kernelized One-Class SVM\nwith the\nlinear_model.SGDOneClassSVM\ncombined with kernel approximation.\n2.7.3.\nOutlier Detection\nOutlier detection is similar to novelty detection in the sense that\nthe goal is to separate a core of regular observations from some\npolluting ones, called\noutliers\n. Yet, in the case of outlier\ndetection, we don’t have a clean data set representing the population\nof regular observations that can be used to train any tool.\n2.7.3.1.\nFitting an elliptic envelope\nOne common way of performing outlier detection is to assume that the\nregular data come from a known distribution (e.g. data are Gaussian\ndistributed). From this assumption, we generally try to define the\n“shape” of the data, and can define outlying observations as\nobservations which stand far enough from the fit shape.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "“shape” of the data, and can define outlying observations as\nobservations which stand far enough from the fit shape.\nThe scikit-learn provides an object\ncovariance.EllipticEnvelope\nthat fits a robust covariance\nestimate to the data, and thus fits an ellipse to the central data\npoints, ignoring points outside the central mode.\nFor instance, assuming that the inlier data are Gaussian distributed, it\nwill estimate the inlier location and covariance in a robust way (i.e.\nwithout being influenced by outliers). The Mahalanobis distances\nobtained from this estimate are used to derive a measure of outlyingness.\nThis strategy is illustrated below.\nExamples\nSee\nRobust covariance estimation and Mahalanobis distances relevance\nfor\nan illustration of the difference between using a standard\n(\ncovariance.EmpiricalCovariance\n) or a robust estimate\n(\ncovariance.MinCovDet\n) of location and covariance to\nassess the degree of outlyingness of an observation.\nSee\nOutlier detection on a real data set",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(\ncovariance.MinCovDet\n) of location and covariance to\nassess the degree of outlyingness of an observation.\nSee\nOutlier detection on a real data set\nfor an example of robust covariance estimation on a real data set.\nReferences\nRousseeuw, P.J., Van Driessen, K. “A fast algorithm for the minimum\ncovariance determinant estimator” Technometrics 41(3), 212 (1999)\n2.7.3.2.\nIsolation Forest\nOne efficient way of performing outlier detection in high-dimensional datasets\nis to use random forests.\nThe\nensemble.IsolationForest\n‘isolates’ observations by randomly selecting\na feature and then randomly selecting a split value between the maximum and\nminimum values of the selected feature.\nSince recursive partitioning can be represented by a tree structure, the\nnumber of splittings required to isolate a sample is equivalent to the path\nlength from the root node to the terminating node.\nThis path length, averaged over a forest of such random trees, is a\nmeasure of normality and our decision function.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "This path length, averaged over a forest of such random trees, is a\nmeasure of normality and our decision function.\nRandom partitioning produces noticeably shorter paths for anomalies.\nHence, when a forest of random trees collectively produce shorter path\nlengths for particular samples, they are highly likely to be anomalies.\nThe implementation of\nensemble.IsolationForest\nis based on an ensemble\nof\ntree.ExtraTreeRegressor\n. Following Isolation Forest original paper,\nthe maximum depth of each tree is set to\n\\(\\lceil \\log_2(n) \\rceil\\)\nwhere\n\\(n\\)\nis the number of samples used to build the tree (see (Liu et al.,\n2008) for more details).\nThis algorithm is illustrated below.\nThe\nensemble.IsolationForest\nsupports\nwarm_start=True\nwhich\nallows you to add more trees to an already fitted model:\n>>>\nfrom\nsklearn.ensemble\nimport\nIsolationForest\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n1\n,\n-\n1\n],\n[\n-\n2\n,\n-\n1\n],\n[\n-\n3\n,\n-\n2\n],\n[\n0\n,\n0\n],\n[\n-\n20\n,\n50\n],\n[\n3\n,\n5\n]])\n>>>\nclf\n=\nIsolationForest",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "import\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n1\n,\n-\n1\n],\n[\n-\n2\n,\n-\n1\n],\n[\n-\n3\n,\n-\n2\n],\n[\n0\n,\n0\n],\n[\n-\n20\n,\n50\n],\n[\n3\n,\n5\n]])\n>>>\nclf\n=\nIsolationForest\n(\nn_estimators\n=\n10\n,\nwarm_start\n=\nTrue\n)\n>>>\nclf\n.\nfit\n(\nX\n)\n# fit 10 trees\n>>>\nclf\n.\nset_params\n(\nn_estimators\n=\n20\n)\n# add 10 more trees\n>>>\nclf\n.\nfit\n(\nX\n)\n# fit the added trees\nExamples\nSee\nIsolationForest example\nfor\nan illustration of the use of IsolationForest.\nSee\nComparing anomaly detection algorithms for outlier detection on toy datasets\nfor a comparison of\nensemble.IsolationForest\nwith\nneighbors.LocalOutlierFactor\n,\nsvm.OneClassSVM\n(tuned to perform like an outlier detection\nmethod),\nlinear_model.SGDOneClassSVM\n, and a covariance-based\noutlier detection with\ncovariance.EllipticEnvelope\n.\nReferences\nLiu, Fei Tony, Ting, Kai Ming and Zhou, Zhi-Hua. “Isolation forest.”\nData Mining, 2008. ICDM’08. Eighth IEEE International Conference on.\n2.7.3.3.\nLocal Outlier Factor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Data Mining, 2008. ICDM’08. Eighth IEEE International Conference on.\n2.7.3.3.\nLocal Outlier Factor\nAnother efficient way to perform outlier detection on moderately high dimensional\ndatasets is to use the Local Outlier Factor (LOF) algorithm.\nThe\nneighbors.LocalOutlierFactor\n(LOF) algorithm computes a score\n(called local outlier factor) reflecting the degree of abnormality of the\nobservations.\nIt measures the local density deviation of a given data point with respect to\nits neighbors. The idea is to detect the samples that have a substantially\nlower density than their neighbors.\nIn practice the local density is obtained from the k-nearest neighbors.\nThe LOF score of an observation is equal to the ratio of the\naverage local density of its k-nearest neighbors, and its own local density:\na normal instance is expected to have a local density similar to that of its\nneighbors, while abnormal data are expected to have much smaller local density.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "neighbors, while abnormal data are expected to have much smaller local density.\nThe number k of neighbors considered, (alias parameter\nn_neighbors\n) is\ntypically chosen 1) greater than the minimum number of objects a cluster has to\ncontain, so that other objects can be local outliers relative to this cluster,\nand 2) smaller than the maximum number of close by objects that can potentially\nbe local outliers. In practice, such information is generally not available, and\ntaking\nn_neighbors=20\nappears to work well in general. When the proportion of\noutliers is high (i.e. greater than 10 %, as in the example below),\nn_neighbors\nshould be greater (\nn_neighbors=35\nin the example below).\nThe strength of the LOF algorithm is that it takes both local and global\nproperties of datasets into consideration: it can perform well even in datasets\nwhere abnormal samples have different underlying densities.\nThe question is not, how isolated the sample is, but how isolated it is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "where abnormal samples have different underlying densities.\nThe question is not, how isolated the sample is, but how isolated it is\nwith respect to the surrounding neighborhood.\nWhen applying LOF for outlier detection, there are no\npredict\n,\ndecision_function\nand\nscore_samples\nmethods but only a\nfit_predict\nmethod. The scores of abnormality of the training samples are accessible\nthrough the\nnegative_outlier_factor_\nattribute.\nNote that\npredict\n,\ndecision_function\nand\nscore_samples\ncan be used\non new unseen data when LOF is applied for novelty detection, i.e. when the\nnovelty\nparameter is set to\nTrue\n, but the result of\npredict\nmay\ndiffer from that of\nfit_predict\n. See\nNovelty detection with Local Outlier Factor\n.\nThis strategy is illustrated below.\nExamples\nSee\nOutlier detection with Local Outlier Factor (LOF)\nfor an illustration of the use of\nneighbors.LocalOutlierFactor\n.\nSee\nComparing anomaly detection algorithms for outlier detection on toy datasets",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for an illustration of the use of\nneighbors.LocalOutlierFactor\n.\nSee\nComparing anomaly detection algorithms for outlier detection on toy datasets\nfor a comparison with other anomaly detection methods.\nReferences\nBreunig, Kriegel, Ng, and Sander (2000)\nLOF: identifying density-based local outliers.\nProc. ACM SIGMOD\n2.7.4.\nNovelty detection with Local Outlier Factor\nTo use\nneighbors.LocalOutlierFactor\nfor novelty detection, i.e.\npredict labels or compute the score of abnormality of new unseen data, you\nneed to instantiate the estimator with the\nnovelty\nparameter\nset to\nTrue\nbefore fitting the estimator:\nlof\n=\nLocalOutlierFactor\n(\nnovelty\n=\nTrue\n)\nlof\n.\nfit\n(\nX_train\n)\nNote that\nfit_predict\nis not available in this case to avoid inconsistencies.\nWarning\nNovelty detection with Local Outlier Factor\nWhen\nnovelty\nis set to\nTrue\nbe aware that you must only use\npredict\n,\ndecision_function\nand\nscore_samples\non new unseen data\nand not on the training samples as this would lead to wrong results.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "predict\n,\ndecision_function\nand\nscore_samples\non new unseen data\nand not on the training samples as this would lead to wrong results.\nI.e., the result of\npredict\nwill not be the same as\nfit_predict\n.\nThe scores of abnormality of the training samples are always accessible\nthrough the\nnegative_outlier_factor_\nattribute.\nNovelty detection with\nneighbors.LocalOutlierFactor\nis illustrated below\n(see\nNovelty detection with Local Outlier Factor (LOF)\n).\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/outlier_detection.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "5.1.\nPartial Dependence and Individual Conditional Expectation plots\nPartial dependence plots (PDP) and individual conditional expectation (ICE)\nplots can be used to visualize and analyze interaction between the target\nresponse\n[\n1\n]\nand a set of input features of interest.\nBoth PDPs\n[H2009]\nand ICEs\n[G2015]\nassume that the input features of interest\nare independent from the complement features, and this assumption is often\nviolated in practice. Thus, in the case of correlated features, we will\ncreate absurd data points to compute the PDP/ICE\n[M2019]\n.\n5.1.1.\nPartial dependence plots\nPartial dependence plots (PDP) show the dependence between the target response\nand a set of input features of interest, marginalizing over the values\nof all other input features (the ‘complement’ features). Intuitively, we can\ninterpret the partial dependence as the expected target response as a\nfunction of the input features of interest.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "interpret the partial dependence as the expected target response as a\nfunction of the input features of interest.\nDue to the limits of human perception, the size of the set of input features of\ninterest must be small (usually, one or two) thus the input features of interest\nare usually chosen among the most important features.\nThe figure below shows two one-way and one two-way partial dependence plots for\nthe bike sharing dataset, with a\nHistGradientBoostingRegressor\n:\nOne-way PDPs tell us about the interaction between the target response and an input\nfeature of interest (e.g. linear, non-linear). The left plot in the above figure\nshows the effect of the temperature on the number of bike rentals; we can clearly see\nthat a higher temperature is related with a higher number of bike rentals. Similarly, we\ncould analyze the effect of the humidity on the number of bike rentals (middle plot).\nThus, these interpretations are marginal, considering a feature at a time.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Thus, these interpretations are marginal, considering a feature at a time.\nPDPs with two input features of interest show the interactions among the two features.\nFor example, the two-variable PDP in the above figure shows the dependence of the number\nof bike rentals on joint values of temperature and humidity. We can clearly see an\ninteraction between the two features: with a temperature higher than 20 degrees Celsius,\nmainly the humidity has a strong impact on the number of bike rentals. For lower\ntemperatures, both the temperature and the humidity have an impact on the number of bike\nrentals.\nThe\nsklearn.inspection\nmodule provides a convenience function\nfrom_estimator\nto create one-way and two-way partial\ndependence plots. In the below example we show how to create a grid of\npartial dependence plots: two one-way PDPs for the features\n0\nand\n1\nand a two-way PDP between the two features:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_hastie_10_2\n>>>\nfrom\nsklearn.ensemble\nimport",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "0\nand\n1\nand a two-way PDP between the two features:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_hastie_10_2\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingClassifier\n>>>\nfrom\nsklearn.inspection\nimport\nPartialDependenceDisplay\n>>>\nX\n,\ny\n=\nmake_hastie_10_2\n(\nrandom_state\n=\n0\n)\n>>>\nclf\n=\nGradientBoostingClassifier\n(\nn_estimators\n=\n100\n,\nlearning_rate\n=\n1.0\n,\n...\nmax_depth\n=\n1\n,\nrandom_state\n=\n0\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nfeatures\n=\n[\n0\n,\n1\n,\n(\n0\n,\n1\n)]\n>>>\nPartialDependenceDisplay\n.\nfrom_estimator\n(\nclf\n,\nX\n,\nfeatures\n)\n<...>\nYou can access the newly created figure and Axes objects using\nplt.gcf()\nand\nplt.gca()\n.\nTo make a partial dependence plot with categorical features, you need to specify\nwhich features are categorical using the parameter\ncategorical_features\n. This\nparameter takes a list of indices, names of the categorical features or a boolean\nmask. The graphical representation of partial dependence for categorical features is\na bar plot or a 2D heatmap.\nPDPs for multi-class classification",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "mask. The graphical representation of partial dependence for categorical features is\na bar plot or a 2D heatmap.\nPDPs for multi-class classification\nFor multi-class classification, you need to set the class label for which\nthe PDPs should be created via the\ntarget\nargument:\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\niris\n=\nload_iris\n()\n>>>\nmc_clf\n=\nGradientBoostingClassifier\n(\nn_estimators\n=\n10\n,\n...\nmax_depth\n=\n1\n)\n.\nfit\n(\niris\n.\ndata\n,\niris\n.\ntarget\n)\n>>>\nfeatures\n=\n[\n3\n,\n2\n,\n(\n3\n,\n2\n)]\n>>>\nPartialDependenceDisplay\n.\nfrom_estimator\n(\nmc_clf\n,\nX\n,\nfeatures\n,\ntarget\n=\n0\n)\n<...>\nThe same parameter\ntarget\nis used to specify the target in multi-output\nregression settings.\nIf you need the raw values of the partial dependence function rather than\nthe plots, you can use the\nsklearn.inspection.partial_dependence\nfunction:\n>>>\nfrom\nsklearn.inspection\nimport\npartial_dependence\n>>>\nresults\n=\npartial_dependence\n(\nclf\n,\nX\n,\n[\n0\n])\n>>>\nresults\n[\n\"average\"\n]\narray([[ 2.466..., 2.466..., ...\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "import\npartial_dependence\n>>>\nresults\n=\npartial_dependence\n(\nclf\n,\nX\n,\n[\n0\n])\n>>>\nresults\n[\n\"average\"\n]\narray([[ 2.466..., 2.466..., ...\n>>>\nresults\n[\n\"grid_values\"\n]\n[array([-1.624..., -1.592..., ...\nThe values at which the partial dependence should be evaluated are directly\ngenerated from\nX\n. For 2-way partial dependence, a 2D-grid of values is\ngenerated. The\nvalues\nfield returned by\nsklearn.inspection.partial_dependence\ngives the actual values\nused in the grid for each input feature of interest. They also correspond to\nthe axis of the plots.\n5.1.2.\nIndividual conditional expectation (ICE) plot\nSimilar to a PDP, an individual conditional expectation (ICE) plot\nshows the dependence between the target function and an input feature of\ninterest. However, unlike a PDP, which shows the average effect of the input\nfeature, an ICE plot visualizes the dependence of the prediction on a\nfeature for each sample separately with one line per sample.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "feature, an ICE plot visualizes the dependence of the prediction on a\nfeature for each sample separately with one line per sample.\nDue to the limits of human perception, only one input feature of interest is\nsupported for ICE plots.\nThe figures below show two ICE plots for the bike sharing dataset,\nwith a\nHistGradientBoostingRegressor\n. The figures plot\nthe corresponding PD line overlaid on ICE lines.\nWhile the PDPs are good at showing the average effect of the target features,\nthey can obscure a heterogeneous relationship created by interactions.\nWhen interactions are present the ICE plot will provide many more insights.\nFor example, we see that the ICE for the temperature feature gives us some\nadditional information: some of the ICE lines are flat while some others\nshow a decrease of the dependence for temperature above 35 degrees Celsius.\nWe observe a similar pattern for the humidity feature: some of the ICE\nlines show a sharp decrease when the humidity is above 80%.\nThe",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "We observe a similar pattern for the humidity feature: some of the ICE\nlines show a sharp decrease when the humidity is above 80%.\nThe\nsklearn.inspection\nmodule’s\nPartialDependenceDisplay.from_estimator\nconvenience function can be used to create ICE plots by setting\nkind='individual'\n. In the example below, we show how to create a grid of\nICE plots:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_hastie_10_2\n>>>\nfrom\nsklearn.ensemble\nimport\nGradientBoostingClassifier\n>>>\nfrom\nsklearn.inspection\nimport\nPartialDependenceDisplay\n>>>\nX\n,\ny\n=\nmake_hastie_10_2\n(\nrandom_state\n=\n0\n)\n>>>\nclf\n=\nGradientBoostingClassifier\n(\nn_estimators\n=\n100\n,\nlearning_rate\n=\n1.0\n,\n...\nmax_depth\n=\n1\n,\nrandom_state\n=\n0\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nfeatures\n=\n[\n0\n,\n1\n]\n>>>\nPartialDependenceDisplay\n.\nfrom_estimator\n(\nclf\n,\nX\n,\nfeatures\n,\n...\nkind\n=\n'individual'\n)\n<...>\nIn ICE plots it might not be easy to see the average effect of the input\nfeature of interest. Hence, it is recommended to use ICE plots alongside",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "In ICE plots it might not be easy to see the average effect of the input\nfeature of interest. Hence, it is recommended to use ICE plots alongside\nPDPs. They can be plotted together with\nkind='both'\n.\n>>>\nPartialDependenceDisplay\n.\nfrom_estimator\n(\nclf\n,\nX\n,\nfeatures\n,\n...\nkind\n=\n'both'\n)\n<...>\nIf there are too many lines in an ICE plot, it can be difficult to see\ndifferences between individual samples and interpret the model. Centering the\nICE at the first value on the x-axis, produces centered Individual Conditional\nExpectation (cICE) plots\n[G2015]\n. This puts emphasis on the divergence of\nindividual conditional expectations from the mean line, thus making it easier\nto explore heterogeneous relationships. cICE plots can be plotted by setting\ncentered=True\n:\n>>>\nPartialDependenceDisplay\n.\nfrom_estimator\n(\nclf\n,\nX\n,\nfeatures\n,\n...\nkind\n=\n'both'\n,\ncentered\n=\nTrue\n)\n<...>\n5.1.3.\nMathematical Definition\nLet\n\\(X_S\\)\nbe the set of input features of interest (i.e. the\nfeatures",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\n...\nkind\n=\n'both'\n,\ncentered\n=\nTrue\n)\n<...>\n5.1.3.\nMathematical Definition\nLet\n\\(X_S\\)\nbe the set of input features of interest (i.e. the\nfeatures\nparameter) and let\n\\(X_C\\)\nbe its complement.\nThe partial dependence of the response\n\\(f\\)\nat a point\n\\(x_S\\)\nis\ndefined as:\n\\[\\begin{split}pd_{X_S}(x_S) &\\overset{def}{=} \\mathbb{E}_{X_C}\\left[ f(x_S, X_C) \\right]\\\\\n&= \\int f(x_S, x_C) p(x_C) dx_C,\\end{split}\\]\nwhere\n\\(f(x_S, x_C)\\)\nis the response function (\npredict\n,\npredict_proba\nor\ndecision_function\n) for a given sample whose\nvalues are defined by\n\\(x_S\\)\nfor the features in\n\\(X_S\\)\n, and by\n\\(x_C\\)\nfor the features in\n\\(X_C\\)\n. Note that\n\\(x_S\\)\nand\n\\(x_C\\)\nmay be tuples.\nComputing this integral for various values of\n\\(x_S\\)\nproduces a PDP plot\nas above. An ICE line is defined as a single\n\\(f(x_{S}, x_{C}^{(i)})\\)\nevaluated at\n\\(x_{S}\\)\n.\n5.1.4.\nComputation methods\nThere are two main methods to approximate the integral above, namely the\n'brute'\nand\n'recursion'\nmethods. The\nmethod",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\n5.1.4.\nComputation methods\nThere are two main methods to approximate the integral above, namely the\n'brute'\nand\n'recursion'\nmethods. The\nmethod\nparameter controls which method\nto use.\nThe\n'brute'\nmethod is a generic method that works with any estimator. Note that\ncomputing ICE plots is only supported with the\n'brute'\nmethod. It\napproximates the above integral by computing an average over the data\nX\n:\n\\[pd_{X_S}(x_S) \\approx \\frac{1}{n_\\text{samples}} \\sum_{i=1}^n f(x_S, x_C^{(i)}),\\]\nwhere\n\\(x_C^{(i)}\\)\nis the value of the i-th sample for the features in\n\\(X_C\\)\n. For each value of\n\\(x_S\\)\n, this method requires a full pass\nover the dataset\nX\nwhich is computationally intensive.\nEach of the\n\\(f(x_{S}, x_{C}^{(i)})\\)\ncorresponds to one ICE line evaluated\nat\n\\(x_{S}\\)\n. Computing this for multiple values of\n\\(x_{S}\\)\n, one\nobtains a full ICE line. As one can see, the average of the ICE lines\ncorresponds to the partial dependence line.\nThe\n'recursion'\nmethod is faster than the\n'brute'",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "corresponds to the partial dependence line.\nThe\n'recursion'\nmethod is faster than the\n'brute'\nmethod, but it is only\nsupported for PDP plots by some tree-based estimators. It is computed as\nfollows. For a given point\n\\(x_S\\)\n, a weighted tree traversal is performed:\nif a split node involves an input feature of interest, the corresponding left\nor right branch is followed; otherwise both branches are followed, each branch\nbeing weighted by the fraction of training samples that entered that branch.\nFinally, the partial dependence is given by a weighted average of all the\nvisited leaves’ values.\nWith the\n'brute'\nmethod, the parameter\nX\nis used both for generating the\ngrid of values\n\\(x_S\\)\nand the complement feature values\n\\(x_C\\)\n.\nHowever with the ‘recursion’ method,\nX\nis only used for the grid values:\nimplicitly, the\n\\(x_C\\)\nvalues are those of the training data.\nBy default, the\n'recursion'\nmethod is used for plotting PDPs on tree-based",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "implicitly, the\n\\(x_C\\)\nvalues are those of the training data.\nBy default, the\n'recursion'\nmethod is used for plotting PDPs on tree-based\nestimators that support it, and ‘brute’ is used for the rest.\nNote\nWhile both methods should be close in general, they might differ in some\nspecific settings. The\n'brute'\nmethod assumes the existence of the\ndata points\n\\((x_S, x_C^{(i)})\\)\n. When the features are correlated,\nsuch artificial samples may have a very low probability mass. The\n'brute'\nand\n'recursion'\nmethods will likely disagree regarding the value of the\npartial dependence, because they will treat these unlikely\nsamples differently. Remember, however, that the primary assumption for\ninterpreting PDPs is that the features should be independent.\nExamples\nPartial Dependence and Individual Conditional Expectation Plots\nFootnotes\nReferences\n[\nH2009\n]\nT. Hastie, R. Tibshirani and J. Friedman,\nThe Elements of Statistical Learning\n,\nSecond Edition, Section 10.13.2, Springer, 2009.\n[\nM2019\n]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[\nH2009\n]\nT. Hastie, R. Tibshirani and J. Friedman,\nThe Elements of Statistical Learning\n,\nSecond Edition, Section 10.13.2, Springer, 2009.\n[\nM2019\n]\nC. Molnar,\nInterpretable Machine Learning\n,\nSection 5.1, 2019.\n[\nG2015\n]\n(\n1\n,\n2\n)\nA. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin,\n“Peeking Inside the Black Box: Visualizing Statistical\nLearning With Plots of Individual Conditional Expectation”\nJournal of Computational and Graphical Statistics,\n24(1): 44-65, Springer, 2015.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/partial_dependence.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "5.2.\nPermutation feature importance\nPermutation feature importance is a model inspection technique that measures the\ncontribution of each feature to a\nfitted\nmodel’s statistical performance\non a given tabular dataset. This technique is particularly useful for non-linear\nor opaque\nestimators\n, and involves randomly shuffling the values of a\nsingle feature and observing the resulting degradation of the model’s score\n[\n1\n]\n. By breaking the relationship between the feature and the target, we\ndetermine how much the model relies on such particular feature.\nIn the following figures, we observe the effect of permuting features on the correlation\nbetween the feature and the target and consequently on the model’s statistical\nperformance.\nOn the top figure, we observe that permuting a predictive feature breaks the\ncorrelation between the feature and the target, and consequently the model’s\nstatistical performance decreases. On the bottom figure, we observe that permuting",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "statistical performance decreases. On the bottom figure, we observe that permuting\na non-predictive feature does not significantly degrade the model’s statistical\nperformance.\nOne key advantage of permutation feature importance is that it is\nmodel-agnostic, i.e. it can be applied to any fitted estimator. Moreover, it can\nbe calculated multiple times with different permutations of the feature, further\nproviding a measure of the variance in the estimated feature importances for the\nspecific trained model.\nThe figure below shows the permutation feature importance of a\nRandomForestClassifier\ntrained on an augmented\nversion of the titanic dataset that contains a\nrandom_cat\nand a\nrandom_num\nfeatures, i.e. a categorical and a numerical feature that are not correlated in\nany way with the target variable:\nWarning\nFeatures that are deemed of\nlow importance for a bad model\n(low\ncross-validation score) could be\nvery important for a good model\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Warning\nFeatures that are deemed of\nlow importance for a bad model\n(low\ncross-validation score) could be\nvery important for a good model\n.\nTherefore it is always important to evaluate the predictive power of a model\nusing a held-out set (or better with cross-validation) prior to computing\nimportances. Permutation importance does not reflect the intrinsic\npredictive value of a feature by itself but\nhow important this feature is\nfor a particular model\n.\nThe\npermutation_importance\nfunction calculates the feature importance\nof\nestimators\nfor a given dataset. The\nn_repeats\nparameter sets the\nnumber of times a feature is randomly shuffled and returns a sample of feature\nimportances.\nLet’s consider the following trained regression model:\n>>>\nfrom\nsklearn.datasets\nimport\nload_diabetes\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nfrom\nsklearn.linear_model\nimport\nRidge\n>>>\ndiabetes\n=\nload_diabetes\n()\n>>>\nX_train\n,\nX_val\n,\ny_train\n,\ny_val\n=\ntrain_test_split\n(\n...\ndiabetes\n.\ndata",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "from\nsklearn.linear_model\nimport\nRidge\n>>>\ndiabetes\n=\nload_diabetes\n()\n>>>\nX_train\n,\nX_val\n,\ny_train\n,\ny_val\n=\ntrain_test_split\n(\n...\ndiabetes\n.\ndata\n,\ndiabetes\n.\ntarget\n,\nrandom_state\n=\n0\n)\n...\n>>>\nmodel\n=\nRidge\n(\nalpha\n=\n1e-2\n)\n.\nfit\n(\nX_train\n,\ny_train\n)\n>>>\nmodel\n.\nscore\n(\nX_val\n,\ny_val\n)\n0.356...\nIts validation performance, measured via the\n\\(R^2\\)\nscore, is\nsignificantly larger than the chance level. This makes it possible to use the\npermutation_importance\nfunction to probe which features are most\npredictive:\n>>>\nfrom\nsklearn.inspection\nimport\npermutation_importance\n>>>\nr\n=\npermutation_importance\n(\nmodel\n,\nX_val\n,\ny_val\n,\n...\nn_repeats\n=\n30\n,\n...\nrandom_state\n=\n0\n)\n...\n>>>\nfor\ni\nin\nr\n.\nimportances_mean\n.\nargsort\n()[::\n-\n1\n]:\n...\nif\nr\n.\nimportances_mean\n[\ni\n]\n-\n2\n*\nr\n.\nimportances_std\n[\ni\n]\n>\n0\n:\n...\nprint\n(\nf\n\"\n{\ndiabetes\n.\nfeature_names\n[\ni\n]\n:\n<8\n}\n\"\n...\nf\n\"\n{\nr\n.\nimportances_mean\n[\ni\n]\n:\n.3f\n}\n\"\n...\nf\n\" +/-\n{\nr\n.\nimportances_std\n[\ni\n]\n:\n.3f\n}\n\"\n)\n...\ns5 0.204 +/- 0.050",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "[\ni\n]\n:\n<8\n}\n\"\n...\nf\n\"\n{\nr\n.\nimportances_mean\n[\ni\n]\n:\n.3f\n}\n\"\n...\nf\n\" +/-\n{\nr\n.\nimportances_std\n[\ni\n]\n:\n.3f\n}\n\"\n)\n...\ns5 0.204 +/- 0.050\nbmi 0.176 +/- 0.048\nbp 0.088 +/- 0.033\nsex 0.056 +/- 0.023\nNote that the importance values for the top features represent a large\nfraction of the reference score of 0.356.\nPermutation importances can be computed either on the training set or on a\nheld-out testing or validation set. Using a held-out set makes it possible to\nhighlight which features contribute the most to the generalization power of the\ninspected model. Features that are important on the training set but not on the\nheld-out set might cause the model to overfit.\nThe permutation feature importance depends on the score function that is\nspecified with the\nscoring\nargument. This argument accepts multiple scorers,\nwhich is more computationally efficient than sequentially calling\npermutation_importance\nseveral times with a different scorer, as it\nreuses model predictions.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "permutation_importance\nseveral times with a different scorer, as it\nreuses model predictions.\nExample of permutation feature importance using multiple scorers\nIn the example below we use a list of metrics, but more input formats are\npossible, as documented in\nUsing multiple metric evaluation\n.\n>>>\nscoring\n=\n[\n'r2'\n,\n'neg_mean_absolute_percentage_error'\n,\n'neg_mean_squared_error'\n]\n>>>\nr_multi\n=\npermutation_importance\n(\n...\nmodel\n,\nX_val\n,\ny_val\n,\nn_repeats\n=\n30\n,\nrandom_state\n=\n0\n,\nscoring\n=\nscoring\n)\n...\n>>>\nfor\nmetric\nin\nr_multi\n:\n...\nprint\n(\nf\n\"\n{\nmetric\n}\n\"\n)\n...\nr\n=\nr_multi\n[\nmetric\n]\n...\nfor\ni\nin\nr\n.\nimportances_mean\n.\nargsort\n()[::\n-\n1\n]:\n...\nif\nr\n.\nimportances_mean\n[\ni\n]\n-\n2\n*\nr\n.\nimportances_std\n[\ni\n]\n>\n0\n:\n...\nprint\n(\nf\n\"\n{\ndiabetes\n.\nfeature_names\n[\ni\n]\n:\n<8\n}\n\"\n...\nf\n\"\n{\nr\n.\nimportances_mean\n[\ni\n]\n:\n.3f\n}\n\"\n...\nf\n\" +/-\n{\nr\n.\nimportances_std\n[\ni\n]\n:\n.3f\n}\n\"\n)\n...\nr2\ns5 0.204 +/- 0.050\nbmi 0.176 +/- 0.048\nbp 0.088 +/- 0.033\nsex 0.056 +/- 0.023",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "]\n:\n.3f\n}\n\"\n...\nf\n\" +/-\n{\nr\n.\nimportances_std\n[\ni\n]\n:\n.3f\n}\n\"\n)\n...\nr2\ns5 0.204 +/- 0.050\nbmi 0.176 +/- 0.048\nbp 0.088 +/- 0.033\nsex 0.056 +/- 0.023\nneg_mean_absolute_percentage_error\ns5 0.081 +/- 0.020\nbmi 0.064 +/- 0.015\nbp 0.029 +/- 0.010\nneg_mean_squared_error\ns5 1013.866 +/- 246.445\nbmi 872.726 +/- 240.298\nbp 438.663 +/- 163.022\nsex 277.376 +/- 115.123\nThe ranking of the features is approximately the same for different metrics even\nif the scales of the importance values are very different. However, this is not\nguaranteed and different metrics might lead to significantly different feature\nimportances, in particular for models trained for imbalanced classification problems,\nfor which\nthe choice of the classification metric can be critical\n.\n5.2.1.\nOutline of the permutation importance algorithm\nInputs: fitted predictive model\n\\(m\\)\n, tabular dataset (training or\nvalidation)\n\\(D\\)\n.\nCompute the reference score\n\\(s\\)\nof the model\n\\(m\\)\non data\n\\(D\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(m\\)\n, tabular dataset (training or\nvalidation)\n\\(D\\)\n.\nCompute the reference score\n\\(s\\)\nof the model\n\\(m\\)\non data\n\\(D\\)\n(for instance the accuracy for a classifier or the\n\\(R^2\\)\nfor\na regressor).\nFor each feature\n\\(j\\)\n(column of\n\\(D\\)\n):\nFor each repetition\n\\(k\\)\nin\n\\({1, ..., K}\\)\n:\nRandomly shuffle column\n\\(j\\)\nof dataset\n\\(D\\)\nto generate a\ncorrupted version of the data named\n\\(\\tilde{D}_{k,j}\\)\n.\nCompute the score\n\\(s_{k,j}\\)\nof model\n\\(m\\)\non corrupted data\n\\(\\tilde{D}_{k,j}\\)\n.\nCompute importance\n\\(i_j\\)\nfor feature\n\\(f_j\\)\ndefined as:\n\\[i_j = s - \\frac{1}{K} \\sum_{k=1}^{K} s_{k,j}\\]\n5.2.2.\nRelation to impurity-based importance in trees\nTree-based models provide an alternative measure of\nfeature importances\nbased on the mean decrease in impurity\n(MDI). Impurity is quantified by the splitting criterion of the decision trees\n(Gini, Log Loss or Mean Squared Error). However, this method can give high",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(Gini, Log Loss or Mean Squared Error). However, this method can give high\nimportance to features that may not be predictive on unseen data when the model\nis overfitting. Permutation-based feature importance, on the other hand, avoids\nthis issue, since it can be computed on unseen data.\nFurthermore, impurity-based feature importance for trees is\nstrongly\nbiased\nand\nfavor high cardinality features\n(typically numerical features)\nover low cardinality features such as binary features or categorical variables\nwith a small number of possible categories.\nPermutation-based feature importances do not exhibit such a bias. Additionally,\nthe permutation feature importance may be computed with any performance metric\non the model predictions and can be used to analyze any model class (not just\ntree-based models).\nThe following example highlights the limitations of impurity-based feature\nimportance in contrast to permutation-based feature importance:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The following example highlights the limitations of impurity-based feature\nimportance in contrast to permutation-based feature importance:\nPermutation Importance vs Random Forest Feature Importance (MDI)\n.\n5.2.3.\nMisleading values on strongly correlated features\nWhen two features are correlated and one of the features is permuted, the model\nstill has access to the latter through its correlated feature. This results in a\nlower reported importance value for both features, though they might\nactually\nbe important.\nThe figure below shows the permutation feature importance of a\nRandomForestClassifier\ntrained using the\nBreast cancer Wisconsin (diagnostic) dataset\n, which contains strongly correlated features. A\nnaive interpretation would suggest that all features are unimportant:\nOne way to handle the issue is to cluster features that are correlated and only\nkeep one feature from each cluster.\nFor more details on such strategy, see the example",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "keep one feature from each cluster.\nFor more details on such strategy, see the example\nPermutation Importance with Multicollinear or Correlated Features\n.\nExamples\nPermutation Importance vs Random Forest Feature Importance (MDI)\nPermutation Importance with Multicollinear or Correlated Features\nReferences\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/permutation_importance.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.3.\nPreprocessing data\nThe\nsklearn.preprocessing\npackage provides several common\nutility functions and transformer classes to change raw feature vectors\ninto a representation that is more suitable for the downstream estimators.\nIn general, many learning algorithms such as linear models benefit from standardization of the data set\n(see\nImportance of Feature Scaling\n).\nIf some outliers are present in the set, robust scalers or other transformers can\nbe more appropriate. The behaviors of the different scalers, transformers, and\nnormalizers on a dataset containing marginal outliers are highlighted in\nCompare the effect of different scalers on data with outliers\n.\n7.3.1.\nStandardization, or mean removal and variance scaling\nStandardization\nof datasets is a\ncommon requirement for many\nmachine learning estimators\nimplemented in scikit-learn; they might behave\nbadly if the individual features do not more or less look like standard\nnormally distributed data: Gaussian with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "badly if the individual features do not more or less look like standard\nnormally distributed data: Gaussian with\nzero mean and unit variance\n.\nIn practice we often ignore the shape of the distribution and just\ntransform the data to center it by removing the mean value of each\nfeature, then scale it by dividing non-constant features by their\nstandard deviation.\nFor instance, many elements used in the objective function of\na learning algorithm (such as the RBF kernel of Support Vector\nMachines or the l1 and l2 regularizers of linear models) may assume that\nall features are centered around zero or have variance in the same\norder. If a feature has a variance that is orders of magnitude larger\nthan others, it might dominate the objective function and make the\nestimator unable to learn from other features correctly as expected.\nThe\npreprocessing\nmodule provides the\nStandardScaler\nutility class, which is a quick and\neasy way to perform the following operation on an array-like\ndataset:\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "module provides the\nStandardScaler\nutility class, which is a quick and\neasy way to perform the following operation on an array-like\ndataset:\n>>>\nfrom\nsklearn\nimport\npreprocessing\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX_train\n=\nnp\n.\narray\n([[\n1.\n,\n-\n1.\n,\n2.\n],\n...\n[\n2.\n,\n0.\n,\n0.\n],\n...\n[\n0.\n,\n1.\n,\n-\n1.\n]])\n>>>\nscaler\n=\npreprocessing\n.\nStandardScaler\n()\n.\nfit\n(\nX_train\n)\n>>>\nscaler\nStandardScaler()\n>>>\nscaler\n.\nmean_\narray([1., 0., 0.33])\n>>>\nscaler\n.\nscale_\narray([0.81, 0.81, 1.24])\n>>>\nX_scaled\n=\nscaler\n.\ntransform\n(\nX_train\n)\n>>>\nX_scaled\narray([[ 0. , -1.22, 1.33 ],\n[ 1.22, 0. , -0.267],\n[-1.22, 1.22, -1.06 ]])\nScaled data has zero mean and unit variance:\n>>>\nX_scaled\n.\nmean\n(\naxis\n=\n0\n)\narray([0., 0., 0.])\n>>>\nX_scaled\n.\nstd\n(\naxis\n=\n0\n)\narray([1., 1., 1.])\nThis class implements the\nTransformer\nAPI to compute the mean and\nstandard deviation on a training set so as to be able to later re-apply the\nsame transformation on the testing set. This class is hence suitable for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "standard deviation on a training set so as to be able to later re-apply the\nsame transformation on the testing set. This class is hence suitable for\nuse in the early steps of a\nPipeline\n:\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn.linear_model\nimport\nLogisticRegression\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nfrom\nsklearn.pipeline\nimport\nmake_pipeline\n>>>\nfrom\nsklearn.preprocessing\nimport\nStandardScaler\n>>>\nX\n,\ny\n=\nmake_classification\n(\nrandom_state\n=\n42\n)\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\nX\n,\ny\n,\nrandom_state\n=\n42\n)\n>>>\npipe\n=\nmake_pipeline\n(\nStandardScaler\n(),\nLogisticRegression\n())\n>>>\npipe\n.\nfit\n(\nX_train\n,\ny_train\n)\n# apply scaling on training data\nPipeline(steps=[('standardscaler', StandardScaler()),\n('logisticregression', LogisticRegression())])\n>>>\npipe\n.\nscore\n(\nX_test\n,\ny_test\n)\n# apply scaling on testing data, without leaking training data.\n0.96",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\npipe\n.\nscore\n(\nX_test\n,\ny_test\n)\n# apply scaling on testing data, without leaking training data.\n0.96\nIt is possible to disable either centering or scaling by either\npassing\nwith_mean=False\nor\nwith_std=False\nto the constructor\nof\nStandardScaler\n.\n7.3.1.1.\nScaling features to a range\nAn alternative standardization is scaling features to\nlie between a given minimum and maximum value, often between zero and one,\nor so that the maximum absolute value of each feature is scaled to unit size.\nThis can be achieved using\nMinMaxScaler\nor\nMaxAbsScaler\n,\nrespectively.\nThe motivation to use this scaling includes robustness to very small\nstandard deviations of features and preserving zero entries in sparse data.\nHere is an example to scale a toy data matrix to the\n[0,\n1]\nrange:\n>>>\nX_train\n=\nnp\n.\narray\n([[\n1.\n,\n-\n1.\n,\n2.\n],\n...\n[\n2.\n,\n0.\n,\n0.\n],\n...\n[\n0.\n,\n1.\n,\n-\n1.\n]])\n...\n>>>\nmin_max_scaler\n=\npreprocessing\n.\nMinMaxScaler\n()\n>>>\nX_train_minmax\n=\nmin_max_scaler\n.\nfit_transform\n(\nX_train\n)\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "...\n[\n0.\n,\n1.\n,\n-\n1.\n]])\n...\n>>>\nmin_max_scaler\n=\npreprocessing\n.\nMinMaxScaler\n()\n>>>\nX_train_minmax\n=\nmin_max_scaler\n.\nfit_transform\n(\nX_train\n)\n>>>\nX_train_minmax\narray([[0.5 , 0. , 1. ],\n[1. , 0.5 , 0.33333333],\n[0. , 1. , 0. ]])\nThe same instance of the transformer can then be applied to some new test data\nunseen during the fit call: the same scaling and shifting operations will be\napplied to be consistent with the transformation performed on the train data:\n>>>\nX_test\n=\nnp\n.\narray\n([[\n-\n3.\n,\n-\n1.\n,\n4.\n]])\n>>>\nX_test_minmax\n=\nmin_max_scaler\n.\ntransform\n(\nX_test\n)\n>>>\nX_test_minmax\narray([[-1.5 , 0. , 1.66666667]])\nIt is possible to introspect the scaler attributes to find about the exact\nnature of the transformation learned on the training data:\n>>>\nmin_max_scaler\n.\nscale_\narray([0.5 , 0.5 , 0.33])\n>>>\nmin_max_scaler\n.\nmin_\narray([0. , 0.5 , 0.33])\nIf\nMinMaxScaler\nis given an explicit\nfeature_range=(min,\nmax)\nthe\nfull formula is:\nX_std\n=\n(\nX\n-\nX\n.\nmin\n(\naxis\n=\n0\n))\n/\n(\nX\n.\nmax\n(",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "If\nMinMaxScaler\nis given an explicit\nfeature_range=(min,\nmax)\nthe\nfull formula is:\nX_std\n=\n(\nX\n-\nX\n.\nmin\n(\naxis\n=\n0\n))\n/\n(\nX\n.\nmax\n(\naxis\n=\n0\n)\n-\nX\n.\nmin\n(\naxis\n=\n0\n))\nX_scaled\n=\nX_std\n*\n(\nmax\n-\nmin\n)\n+\nmin\nMaxAbsScaler\nworks in a very similar fashion, but scales in a way\nthat the training data lies within the range\n[-1,\n1]\nby dividing through\nthe largest maximum value in each feature. It is meant for data\nthat is already centered at zero or sparse data.\nHere is how to use the toy data from the previous example with this scaler:\n>>>\nX_train\n=\nnp\n.\narray\n([[\n1.\n,\n-\n1.\n,\n2.\n],\n...\n[\n2.\n,\n0.\n,\n0.\n],\n...\n[\n0.\n,\n1.\n,\n-\n1.\n]])\n...\n>>>\nmax_abs_scaler\n=\npreprocessing\n.\nMaxAbsScaler\n()\n>>>\nX_train_maxabs\n=\nmax_abs_scaler\n.\nfit_transform\n(\nX_train\n)\n>>>\nX_train_maxabs\narray([[ 0.5, -1. , 1. ],\n[ 1. , 0. , 0. ],\n[ 0. , 1. , -0.5]])\n>>>\nX_test\n=\nnp\n.\narray\n([[\n-\n3.\n,\n-\n1.\n,\n4.\n]])\n>>>\nX_test_maxabs\n=\nmax_abs_scaler\n.\ntransform\n(\nX_test\n)\n>>>\nX_test_maxabs\narray([[-1.5, -1. , 2. ]])\n>>>",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nnp\n.\narray\n([[\n-\n3.\n,\n-\n1.\n,\n4.\n]])\n>>>\nX_test_maxabs\n=\nmax_abs_scaler\n.\ntransform\n(\nX_test\n)\n>>>\nX_test_maxabs\narray([[-1.5, -1. , 2. ]])\n>>>\nmax_abs_scaler\n.\nscale_\narray([2., 1., 2.])\n7.3.1.2.\nScaling sparse data\nCentering sparse data would destroy the sparseness structure in the data, and\nthus rarely is a sensible thing to do. However, it can make sense to scale\nsparse inputs, especially if features are on different scales.\nMaxAbsScaler\nwas specifically designed for scaling\nsparse data, and is the recommended way to go about this.\nHowever,\nStandardScaler\ncan accept\nscipy.sparse\nmatrices as input, as long as\nwith_mean=False\nis explicitly passed\nto the constructor. Otherwise a\nValueError\nwill be raised as\nsilently centering would break the sparsity and would often crash the\nexecution by allocating excessive amounts of memory unintentionally.\nRobustScaler\ncannot be fitted to sparse inputs, but you can use\nthe\ntransform\nmethod on sparse inputs.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "RobustScaler\ncannot be fitted to sparse inputs, but you can use\nthe\ntransform\nmethod on sparse inputs.\nNote that the scalers accept both Compressed Sparse Rows and Compressed\nSparse Columns format (see\nscipy.sparse.csr_matrix\nand\nscipy.sparse.csc_matrix\n). Any other sparse input will be\nconverted to\nthe Compressed Sparse Rows representation\n. To avoid unnecessary memory\ncopies, it is recommended to choose the CSR or CSC representation upstream.\nFinally, if the centered data is expected to be small enough, explicitly\nconverting the input to an array using the\ntoarray\nmethod of sparse matrices\nis another option.\n7.3.1.3.\nScaling data with outliers\nIf your data contains many outliers, scaling using the mean and variance\nof the data is likely to not work very well. In these cases, you can use\nRobustScaler\nas a drop-in replacement instead. It uses\nmore robust estimates for the center and range of your data.\nReferences\nFurther discussion on the importance of centering and scaling data is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "more robust estimates for the center and range of your data.\nReferences\nFurther discussion on the importance of centering and scaling data is\navailable on this FAQ:\nShould I normalize/standardize/rescale the data?\nScaling vs Whitening\nIt is sometimes not enough to center and scale the features\nindependently, since a downstream model can further make some assumption\non the linear independence of the features.\nTo address this issue you can use\nPCA\nwith\nwhiten=True\nto further remove the linear correlation across features.\n7.3.1.4.\nCentering kernel matrices\nIf you have a kernel matrix of a kernel\n\\(K\\)\nthat computes a dot product\nin a feature space (possibly implicitly) defined by a function\n\\(\\phi(\\cdot)\\)\n, a\nKernelCenterer\ncan transform the kernel matrix\nso that it contains inner products in the feature space defined by\n\\(\\phi\\)\nfollowed by the removal of the mean in that space. In other words,\nKernelCenterer\ncomputes the centered Gram matrix associated to a",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\phi\\)\nfollowed by the removal of the mean in that space. In other words,\nKernelCenterer\ncomputes the centered Gram matrix associated to a\npositive semidefinite kernel\n\\(K\\)\n.\nMathematical formulation\nWe can have a look at the mathematical formulation now that we have the\nintuition. Let\n\\(K\\)\nbe a kernel matrix of shape\n(n_samples,\nn_samples)\ncomputed from\n\\(X\\)\n, a data matrix of shape\n(n_samples,\nn_features)\n,\nduring the\nfit\nstep.\n\\(K\\)\nis defined by\n\\[K(X, X) = \\phi(X) . \\phi(X)^{T}\\]\n\\(\\phi(X)\\)\nis a function mapping of\n\\(X\\)\nto a Hilbert space. A\ncentered kernel\n\\(\\tilde{K}\\)\nis defined as:\n\\[\\tilde{K}(X, X) = \\tilde{\\phi}(X) . \\tilde{\\phi}(X)^{T}\\]\nwhere\n\\(\\tilde{\\phi}(X)\\)\nresults from centering\n\\(\\phi(X)\\)\nin the\nHilbert space.\nThus, one could compute\n\\(\\tilde{K}\\)\nby mapping\n\\(X\\)\nusing the\nfunction\n\\(\\phi(\\cdot)\\)\nand center the data in this new space. However,\nkernels are often used because they allow some algebra calculations that",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "function\n\\(\\phi(\\cdot)\\)\nand center the data in this new space. However,\nkernels are often used because they allow some algebra calculations that\navoid computing explicitly this mapping using\n\\(\\phi(\\cdot)\\)\n. Indeed, one\ncan implicitly center as shown in Appendix B in\n[Scholkopf1998]\n:\n\\[\\tilde{K} = K - 1_{\\text{n}_{samples}} K - K 1_{\\text{n}_{samples}} + 1_{\\text{n}_{samples}} K 1_{\\text{n}_{samples}}\\]\n\\(1_{\\text{n}_{samples}}\\)\nis a matrix of\n(n_samples,\nn_samples)\nwhere\nall entries are equal to\n\\(\\frac{1}{\\text{n}_{samples}}\\)\n. In the\ntransform\nstep, the kernel becomes\n\\(K_{test}(X, Y)\\)\ndefined as:\n\\[K_{test}(X, Y) = \\phi(Y) . \\phi(X)^{T}\\]\n\\(Y\\)\nis the test dataset of shape\n(n_samples_test,\nn_features)\nand thus\n\\(K_{test}\\)\nis of shape\n(n_samples_test,\nn_samples)\n. In this case,\ncentering\n\\(K_{test}\\)\nis done as:\n\\[\\tilde{K}_{test}(X, Y) = K_{test} - 1'_{\\text{n}_{samples}} K - K_{test} 1_{\\text{n}_{samples}} + 1'_{\\text{n}_{samples}} K 1_{\\text{n}_{samples}}\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(1'_{\\text{n}_{samples}}\\)\nis a matrix of shape\n(n_samples_test,\nn_samples)\nwhere all entries are equal to\n\\(\\frac{1}{\\text{n}_{samples}}\\)\n.\nReferences\n[\nScholkopf1998\n]\nB. Schölkopf, A. Smola, and K.R. Müller,\n“Nonlinear component analysis as a kernel eigenvalue problem.”\nNeural computation 10.5 (1998): 1299-1319.\n7.3.2.\nNon-linear transformation\nTwo types of transformations are available: quantile transforms and power\ntransforms. Both quantile and power transforms are based on monotonic\ntransformations of the features and thus preserve the rank of the values\nalong each feature.\nQuantile transforms put all features into the same desired distribution based\non the formula\n\\(G^{-1}(F(X))\\)\nwhere\n\\(F\\)\nis the cumulative\ndistribution function of the feature and\n\\(G^{-1}\\)\nthe\nquantile function\nof the\ndesired output distribution\n\\(G\\)\n. This formula is using the two following\nfacts: (i) if\n\\(X\\)\nis a random variable with a continuous cumulative\ndistribution function\n\\(F\\)\nthen\n\\(F(X)\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "facts: (i) if\n\\(X\\)\nis a random variable with a continuous cumulative\ndistribution function\n\\(F\\)\nthen\n\\(F(X)\\)\nis uniformly distributed on\n\\([0,1]\\)\n; (ii) if\n\\(U\\)\nis a random variable with uniform distribution\non\n\\([0,1]\\)\nthen\n\\(G^{-1}(U)\\)\nhas distribution\n\\(G\\)\n. By performing\na rank transformation, a quantile transform smooths out unusual distributions\nand is less influenced by outliers than scaling methods. It does, however,\ndistort correlations and distances within and across features.\nPower transforms are a family of parametric transformations that aim to map\ndata from any distribution to as close to a Gaussian distribution.\n7.3.2.1.\nMapping to a Uniform distribution\nQuantileTransformer\nprovides a non-parametric\ntransformation to map the data to a uniform distribution\nwith values between 0 and 1:\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nX_train\n,\nX_test\n,\ny_train\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "load_iris\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nX\n,\ny\n=\nload_iris\n(\nreturn_X_y\n=\nTrue\n)\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\nX\n,\ny\n,\nrandom_state\n=\n0\n)\n>>>\nquantile_transformer\n=\npreprocessing\n.\nQuantileTransformer\n(\nrandom_state\n=\n0\n)\n>>>\nX_train_trans\n=\nquantile_transformer\n.\nfit_transform\n(\nX_train\n)\n>>>\nX_test_trans\n=\nquantile_transformer\n.\ntransform\n(\nX_test\n)\n>>>\nnp\n.\npercentile\n(\nX_train\n[:,\n0\n],\n[\n0\n,\n25\n,\n50\n,\n75\n,\n100\n])\narray([ 4.3, 5.1, 5.8, 6.5, 7.9])\nThis feature corresponds to the sepal length in cm. Once the quantile\ntransformation is applied, those landmarks approach closely the percentiles\npreviously defined:\n>>>\nnp\n.\npercentile\n(\nX_train_trans\n[:,\n0\n],\n[\n0\n,\n25\n,\n50\n,\n75\n,\n100\n])\n...\narray([ 0.00 , 0.24, 0.49, 0.73, 0.99 ])\nThis can be confirmed on an independent testing set with similar remarks:\n>>>\nnp\n.\npercentile\n(\nX_test\n[:,\n0\n],\n[\n0\n,\n25\n,\n50\n,\n75\n,\n100\n])\n...\narray([ 4.4 , 5.125, 5.75 , 6.175, 7.3 ])\n>>>\nnp\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nnp\n.\npercentile\n(\nX_test\n[:,\n0\n],\n[\n0\n,\n25\n,\n50\n,\n75\n,\n100\n])\n...\narray([ 4.4 , 5.125, 5.75 , 6.175, 7.3 ])\n>>>\nnp\n.\npercentile\n(\nX_test_trans\n[:,\n0\n],\n[\n0\n,\n25\n,\n50\n,\n75\n,\n100\n])\n...\narray([ 0.01, 0.25, 0.46, 0.60 , 0.94])\n7.3.2.2.\nMapping to a Gaussian distribution\nIn many modeling scenarios, normality of the features in a dataset is desirable.\nPower transforms are a family of parametric, monotonic transformations that aim\nto map data from any distribution to as close to a Gaussian distribution as\npossible in order to stabilize variance and minimize skewness.\nPowerTransformer\ncurrently provides two such power transformations,\nthe Yeo-Johnson transform and the Box-Cox transform.\nYeo-Johnson transform\n\\[\\begin{split}x_i^{(\\lambda)} =\n\\begin{cases}\n[(x_i + 1)^\\lambda - 1] / \\lambda & \\text{if } \\lambda \\neq 0, x_i \\geq 0, \\\\[8pt]\n\\ln{(x_i + 1)} & \\text{if } \\lambda = 0, x_i \\geq 0 \\\\[8pt]\n-[(-x_i + 1)^{2 - \\lambda} - 1] / (2 - \\lambda) & \\text{if } \\lambda \\neq 2, x_i < 0, \\\\[8pt]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "-[(-x_i + 1)^{2 - \\lambda} - 1] / (2 - \\lambda) & \\text{if } \\lambda \\neq 2, x_i < 0, \\\\[8pt]\n- \\ln (- x_i + 1) & \\text{if } \\lambda = 2, x_i < 0\n\\end{cases}\\end{split}\\]\nBox-Cox transform\n\\[\\begin{split}x_i^{(\\lambda)} =\n\\begin{cases}\n\\dfrac{x_i^\\lambda - 1}{\\lambda} & \\text{if } \\lambda \\neq 0, \\\\[8pt]\n\\ln{(x_i)} & \\text{if } \\lambda = 0,\n\\end{cases}\\end{split}\\]\nBox-Cox can only be applied to strictly positive data. In both methods, the\ntransformation is parameterized by\n\\(\\lambda\\)\n, which is determined through\nmaximum likelihood estimation. Here is an example of using Box-Cox to map\nsamples drawn from a lognormal distribution to a normal distribution:\n>>>\npt\n=\npreprocessing\n.\nPowerTransformer\n(\nmethod\n=\n'box-cox'\n,\nstandardize\n=\nFalse\n)\n>>>\nX_lognormal\n=\nnp\n.\nrandom\n.\nRandomState\n(\n616\n)\n.\nlognormal\n(\nsize\n=\n(\n3\n,\n3\n))\n>>>\nX_lognormal\narray([[1.28, 1.18 , 0.84 ],\n[0.94, 1.60 , 0.388],\n[1.35, 0.217, 1.09 ]])\n>>>\npt\n.\nfit_transform\n(\nX_lognormal\n)\narray([[ 0.49 , 0.179, -0.156],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "array([[1.28, 1.18 , 0.84 ],\n[0.94, 1.60 , 0.388],\n[1.35, 0.217, 1.09 ]])\n>>>\npt\n.\nfit_transform\n(\nX_lognormal\n)\narray([[ 0.49 , 0.179, -0.156],\n[-0.051, 0.589, -0.576],\n[ 0.69 , -0.849, 0.101]])\nWhile the above example sets the\nstandardize\noption to\nFalse\n,\nPowerTransformer\nwill apply zero-mean, unit-variance normalization\nto the transformed output by default.\nBelow are examples of Box-Cox and Yeo-Johnson applied to various probability\ndistributions. Note that when applied to certain distributions, the power\ntransforms achieve very Gaussian-like results, but with others, they are\nineffective. This highlights the importance of visualizing the data before and\nafter transformation.\nIt is also possible to map data to a normal distribution using\nQuantileTransformer\nby setting\noutput_distribution='normal'\n.\nUsing the earlier example with the iris dataset:\n>>>\nquantile_transformer\n=\npreprocessing\n.\nQuantileTransformer\n(\n...\noutput_distribution\n=\n'normal'\n,\nrandom_state\n=\n0\n)\n>>>\nX_trans\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nquantile_transformer\n=\npreprocessing\n.\nQuantileTransformer\n(\n...\noutput_distribution\n=\n'normal'\n,\nrandom_state\n=\n0\n)\n>>>\nX_trans\n=\nquantile_transformer\n.\nfit_transform\n(\nX\n)\n>>>\nquantile_transformer\n.\nquantiles_\narray([[4.3, 2. , 1. , 0.1],\n[4.4, 2.2, 1.1, 0.1],\n[4.4, 2.2, 1.2, 0.1],\n...,\n[7.7, 4.1, 6.7, 2.5],\n[7.7, 4.2, 6.7, 2.5],\n[7.9, 4.4, 6.9, 2.5]])\nThus the median of the input becomes the mean of the output, centered at 0. The\nnormal output is clipped so that the input’s minimum and maximum —\ncorresponding to the 1e-7 and 1 - 1e-7 quantiles respectively — do not\nbecome infinite under the transformation.\n7.3.3.\nNormalization\nNormalization\nis the process of\nscaling individual samples to have\nunit norm\n. This process can be useful if you plan to use a quadratic form\nsuch as the dot-product or any other kernel to quantify the similarity\nof any pair of samples.\nThis assumption is the base of the\nVector Space Model\noften used in text\nclassification and clustering contexts.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "of any pair of samples.\nThis assumption is the base of the\nVector Space Model\noften used in text\nclassification and clustering contexts.\nThe function\nnormalize\nprovides a quick and easy way to perform this\noperation on a single array-like dataset, either using the\nl1\n,\nl2\n, or\nmax\nnorms:\n>>>\nX\n=\n[[\n1.\n,\n-\n1.\n,\n2.\n],\n...\n[\n2.\n,\n0.\n,\n0.\n],\n...\n[\n0.\n,\n1.\n,\n-\n1.\n]]\n>>>\nX_normalized\n=\npreprocessing\n.\nnormalize\n(\nX\n,\nnorm\n=\n'l2'\n)\n>>>\nX_normalized\narray([[ 0.408, -0.408, 0.812],\n[ 1. , 0. , 0. ],\n[ 0. , 0.707, -0.707]])\nThe\npreprocessing\nmodule further provides a utility class\nNormalizer\nthat implements the same operation using the\nTransformer\nAPI (even though the\nfit\nmethod is useless in this case:\nthe class is stateless as this operation treats samples independently).\nThis class is hence suitable for use in the early steps of a\nPipeline\n:\n>>>\nnormalizer\n=\npreprocessing\n.\nNormalizer\n()\n.\nfit\n(\nX\n)\n# fit does nothing\n>>>\nnormalizer\nNormalizer()",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Pipeline\n:\n>>>\nnormalizer\n=\npreprocessing\n.\nNormalizer\n()\n.\nfit\n(\nX\n)\n# fit does nothing\n>>>\nnormalizer\nNormalizer()\nThe normalizer instance can then be used on sample vectors as any transformer:\n>>>\nnormalizer\n.\ntransform\n(\nX\n)\narray([[ 0.408, -0.408, 0.812],\n[ 1. , 0. , 0. ],\n[ 0. , 0.707, -0.707]])\n>>>\nnormalizer\n.\ntransform\n([[\n-\n1.\n,\n1.\n,\n0.\n]])\narray([[-0.707, 0.707, 0.]])\nNote: L2 normalization is also known as spatial sign preprocessing.\nSparse input\nnormalize\nand\nNormalizer\naccept\nboth dense array-like\nand sparse matrices from scipy.sparse as input\n.\nFor sparse input the data is\nconverted to the Compressed Sparse Rows\nrepresentation\n(see\nscipy.sparse.csr_matrix\n) before being fed to\nefficient Cython routines. To avoid unnecessary memory copies, it is\nrecommended to choose the CSR representation upstream.\n7.3.4.\nEncoding categorical features\nOften features are not given as continuous values but categorical.\nFor example a person could have features\n[\"male\",\n\"female\"]\n,\n[\"from",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Often features are not given as continuous values but categorical.\nFor example a person could have features\n[\"male\",\n\"female\"]\n,\n[\"from\nEurope\",\n\"from\nUS\",\n\"from\nAsia\"]\n,\n[\"uses\nFirefox\",\n\"uses\nChrome\",\n\"uses\nSafari\",\n\"uses\nInternet\nExplorer\"]\n.\nSuch features can be efficiently coded as integers, for instance\n[\"male\",\n\"from\nUS\",\n\"uses\nInternet\nExplorer\"]\ncould be expressed as\n[0,\n1,\n3]\nwhile\n[\"female\",\n\"from\nAsia\",\n\"uses\nChrome\"]\nwould be\n[1,\n2,\n1]\n.\nTo convert categorical features to such integer codes, we can use the\nOrdinalEncoder\n. This estimator transforms each categorical feature to one\nnew feature of integers (0 to n_categories - 1):\n>>>\nenc\n=\npreprocessing\n.\nOrdinalEncoder\n()\n>>>\nX\n=\n[[\n'male'\n,\n'from US'\n,\n'uses Safari'\n],\n[\n'female'\n,\n'from Europe'\n,\n'uses Firefox'\n]]\n>>>\nenc\n.\nfit\n(\nX\n)\nOrdinalEncoder()\n>>>\nenc\n.\ntransform\n([[\n'female'\n,\n'from US'\n,\n'uses Safari'\n]])\narray([[0., 1., 1.]])\nSuch integer representation can, however, not be used directly with all",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "transform\n([[\n'female'\n,\n'from US'\n,\n'uses Safari'\n]])\narray([[0., 1., 1.]])\nSuch integer representation can, however, not be used directly with all\nscikit-learn estimators, as these expect continuous input, and would interpret\nthe categories as being ordered, which is often not desired (i.e. the set of\nbrowsers was ordered arbitrarily).\nBy default,\nOrdinalEncoder\nwill also passthrough missing values that\nare indicated by\nnp.nan\n.\n>>>\nenc\n=\npreprocessing\n.\nOrdinalEncoder\n()\n>>>\nX\n=\n[[\n'male'\n],\n[\n'female'\n],\n[\nnp\n.\nnan\n],\n[\n'female'\n]]\n>>>\nenc\n.\nfit_transform\n(\nX\n)\narray([[ 1.],\n[ 0.],\n[nan],\n[ 0.]])\nOrdinalEncoder\nprovides a parameter\nencoded_missing_value\nto encode\nthe missing values without the need to create a pipeline and using\nSimpleImputer\n.\n>>>\nenc\n=\npreprocessing\n.\nOrdinalEncoder\n(\nencoded_missing_value\n=-\n1\n)\n>>>\nX\n=\n[[\n'male'\n],\n[\n'female'\n],\n[\nnp\n.\nnan\n],\n[\n'female'\n]]\n>>>\nenc\n.\nfit_transform\n(\nX\n)\narray([[ 1.],\n[ 0.],\n[-1.],\n[ 0.]])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=-\n1\n)\n>>>\nX\n=\n[[\n'male'\n],\n[\n'female'\n],\n[\nnp\n.\nnan\n],\n[\n'female'\n]]\n>>>\nenc\n.\nfit_transform\n(\nX\n)\narray([[ 1.],\n[ 0.],\n[-1.],\n[ 0.]])\nThe above processing is equivalent to the following pipeline:\n>>>\nfrom\nsklearn.pipeline\nimport\nPipeline\n>>>\nfrom\nsklearn.impute\nimport\nSimpleImputer\n>>>\nenc\n=\nPipeline\n(\nsteps\n=\n[\n...\n(\n\"encoder\"\n,\npreprocessing\n.\nOrdinalEncoder\n()),\n...\n(\n\"imputer\"\n,\nSimpleImputer\n(\nstrategy\n=\n\"constant\"\n,\nfill_value\n=-\n1\n)),\n...\n])\n>>>\nenc\n.\nfit_transform\n(\nX\n)\narray([[ 1.],\n[ 0.],\n[-1.],\n[ 0.]])\nAnother possibility to convert categorical features to features that can be used\nwith scikit-learn estimators is to use a one-of-K, also known as one-hot or\ndummy encoding.\nThis type of encoding can be obtained with the\nOneHotEncoder\n,\nwhich transforms each categorical feature with\nn_categories\npossible values into\nn_categories\nbinary features, with\none of them 1, and all others 0.\nContinuing the example above:\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n()\n>>>\nX\n=\n[[\n'male'\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "binary features, with\none of them 1, and all others 0.\nContinuing the example above:\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n()\n>>>\nX\n=\n[[\n'male'\n,\n'from US'\n,\n'uses Safari'\n],\n[\n'female'\n,\n'from Europe'\n,\n'uses Firefox'\n]]\n>>>\nenc\n.\nfit\n(\nX\n)\nOneHotEncoder()\n>>>\nenc\n.\ntransform\n([[\n'female'\n,\n'from US'\n,\n'uses Safari'\n],\n...\n[\n'male'\n,\n'from Europe'\n,\n'uses Safari'\n]])\n.\ntoarray\n()\narray([[1., 0., 0., 1., 0., 1.],\n[0., 1., 1., 0., 0., 1.]])\nBy default, the values each feature can take is inferred automatically\nfrom the dataset and can be found in the\ncategories_\nattribute:\n>>>\nenc\n.\ncategories_\n[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object), array(['uses Firefox', 'uses Safari'], dtype=object)]\nIt is possible to specify this explicitly using the parameter\ncategories\n.\nThere are two genders, four possible continents and four web browsers in our\ndataset:\n>>>\ngenders\n=\n[\n'female'\n,\n'male'\n]\n>>>\nlocations\n=\n[\n'from Africa'\n,\n'from Asia'\n,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "dataset:\n>>>\ngenders\n=\n[\n'female'\n,\n'male'\n]\n>>>\nlocations\n=\n[\n'from Africa'\n,\n'from Asia'\n,\n'from Europe'\n,\n'from US'\n]\n>>>\nbrowsers\n=\n[\n'uses Chrome'\n,\n'uses Firefox'\n,\n'uses IE'\n,\n'uses Safari'\n]\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\ncategories\n=\n[\ngenders\n,\nlocations\n,\nbrowsers\n])\n>>>\n# Note that for there are missing categorical values for the 2nd and 3rd\n>>>\n# feature\n>>>\nX\n=\n[[\n'male'\n,\n'from US'\n,\n'uses Safari'\n],\n[\n'female'\n,\n'from Europe'\n,\n'uses Firefox'\n]]\n>>>\nenc\n.\nfit\n(\nX\n)\nOneHotEncoder(categories=[['female', 'male'],\n['from Africa', 'from Asia', 'from Europe',\n'from US'],\n['uses Chrome', 'uses Firefox', 'uses IE',\n'uses Safari']])\n>>>\nenc\n.\ntransform\n([[\n'female'\n,\n'from Asia'\n,\n'uses Chrome'\n]])\n.\ntoarray\n()\narray([[1., 0., 0., 1., 0., 0., 1., 0., 0., 0.]])\nIf there is a possibility that the training data might have missing categorical\nfeatures, it can often be better to specify\nhandle_unknown='infrequent_if_exist'\ninstead of setting the\ncategories",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "features, it can often be better to specify\nhandle_unknown='infrequent_if_exist'\ninstead of setting the\ncategories\nmanually as above. When\nhandle_unknown='infrequent_if_exist'\nis specified\nand unknown categories are encountered during transform, no error will be\nraised but the resulting one-hot encoded columns for this feature will be all\nzeros or considered as an infrequent category if enabled.\n(\nhandle_unknown='infrequent_if_exist'\nis only supported for one-hot\nencoding):\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\nhandle_unknown\n=\n'infrequent_if_exist'\n)\n>>>\nX\n=\n[[\n'male'\n,\n'from US'\n,\n'uses Safari'\n],\n[\n'female'\n,\n'from Europe'\n,\n'uses Firefox'\n]]\n>>>\nenc\n.\nfit\n(\nX\n)\nOneHotEncoder(handle_unknown='infrequent_if_exist')\n>>>\nenc\n.\ntransform\n([[\n'female'\n,\n'from Asia'\n,\n'uses Chrome'\n]])\n.\ntoarray\n()\narray([[1., 0., 0., 0., 0., 0.]])\nIt is also possible to encode each column into\nn_categories\n-\n1\ncolumns\ninstead of\nn_categories\ncolumns by using the\ndrop\nparameter. This",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "It is also possible to encode each column into\nn_categories\n-\n1\ncolumns\ninstead of\nn_categories\ncolumns by using the\ndrop\nparameter. This\nparameter allows the user to specify a category for each feature to be dropped.\nThis is useful to avoid co-linearity in the input matrix in some classifiers.\nSuch functionality is useful, for example, when using non-regularized\nregression (\nLinearRegression\n),\nsince co-linearity would cause the covariance matrix to be non-invertible:\n>>>\nX\n=\n[[\n'male'\n,\n'from US'\n,\n'uses Safari'\n],\n...\n[\n'female'\n,\n'from Europe'\n,\n'uses Firefox'\n]]\n>>>\ndrop_enc\n=\npreprocessing\n.\nOneHotEncoder\n(\ndrop\n=\n'first'\n)\n.\nfit\n(\nX\n)\n>>>\ndrop_enc\n.\ncategories_\n[array(['female', 'male'], dtype=object), array(['from Europe', 'from US'], dtype=object),\narray(['uses Firefox', 'uses Safari'], dtype=object)]\n>>>\ndrop_enc\n.\ntransform\n(\nX\n)\n.\ntoarray\n()\narray([[1., 1., 1.],\n[0., 0., 0.]])\nOne might want to drop one of the two columns only for features with 2",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "drop_enc\n.\ntransform\n(\nX\n)\n.\ntoarray\n()\narray([[1., 1., 1.],\n[0., 0., 0.]])\nOne might want to drop one of the two columns only for features with 2\ncategories. In this case, you can set the parameter\ndrop='if_binary'\n.\n>>>\nX\n=\n[[\n'male'\n,\n'US'\n,\n'Safari'\n],\n...\n[\n'female'\n,\n'Europe'\n,\n'Firefox'\n],\n...\n[\n'female'\n,\n'Asia'\n,\n'Chrome'\n]]\n>>>\ndrop_enc\n=\npreprocessing\n.\nOneHotEncoder\n(\ndrop\n=\n'if_binary'\n)\n.\nfit\n(\nX\n)\n>>>\ndrop_enc\n.\ncategories_\n[array(['female', 'male'], dtype=object), array(['Asia', 'Europe', 'US'], dtype=object),\narray(['Chrome', 'Firefox', 'Safari'], dtype=object)]\n>>>\ndrop_enc\n.\ntransform\n(\nX\n)\n.\ntoarray\n()\narray([[1., 0., 0., 1., 0., 0., 1.],\n[0., 0., 1., 0., 0., 1., 0.],\n[0., 1., 0., 0., 1., 0., 0.]])\nIn the transformed\nX\n, the first column is the encoding of the feature with\ncategories “male”/”female”, while the remaining 6 columns are the encoding of\nthe 2 features with respectively 3 categories each.\nWhen\nhandle_unknown='ignore'\nand\ndrop",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the 2 features with respectively 3 categories each.\nWhen\nhandle_unknown='ignore'\nand\ndrop\nis not None, unknown categories will\nbe encoded as all zeros:\n>>>\ndrop_enc\n=\npreprocessing\n.\nOneHotEncoder\n(\ndrop\n=\n'first'\n,\n...\nhandle_unknown\n=\n'ignore'\n)\n.\nfit\n(\nX\n)\n>>>\nX_test\n=\n[[\n'unknown'\n,\n'America'\n,\n'IE'\n]]\n>>>\ndrop_enc\n.\ntransform\n(\nX_test\n)\n.\ntoarray\n()\narray([[0., 0., 0., 0., 0.]])\nAll the categories in\nX_test\nare unknown during transform and will be mapped\nto all zeros. This means that unknown categories will have the same mapping as\nthe dropped category.\nOneHotEncoder.inverse_transform\nwill map all zeros\nto the dropped category if a category is dropped and\nNone\nif a category is\nnot dropped:\n>>>\ndrop_enc\n=\npreprocessing\n.\nOneHotEncoder\n(\ndrop\n=\n'if_binary'\n,\nsparse_output\n=\nFalse\n,\n...\nhandle_unknown\n=\n'ignore'\n)\n.\nfit\n(\nX\n)\n>>>\nX_test\n=\n[[\n'unknown'\n,\n'America'\n,\n'IE'\n]]\n>>>\nX_trans\n=\ndrop_enc\n.\ntransform\n(\nX_test\n)\n>>>\nX_trans\narray([[0., 0., 0., 0., 0., 0., 0.]])\n>>>\ndrop_enc\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n[[\n'unknown'\n,\n'America'\n,\n'IE'\n]]\n>>>\nX_trans\n=\ndrop_enc\n.\ntransform\n(\nX_test\n)\n>>>\nX_trans\narray([[0., 0., 0., 0., 0., 0., 0.]])\n>>>\ndrop_enc\n.\ninverse_transform\n(\nX_trans\n)\narray([['female', None, None]], dtype=object)\nSupport of categorical features with missing values\nOneHotEncoder\nsupports categorical features with missing values by\nconsidering the missing values as an additional category:\n>>>\nX\n=\n[[\n'male'\n,\n'Safari'\n],\n...\n[\n'female'\n,\nNone\n],\n...\n[\nnp\n.\nnan\n,\n'Firefox'\n]]\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\nhandle_unknown\n=\n'error'\n)\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ncategories_\n[array(['female', 'male', nan], dtype=object),\narray(['Firefox', 'Safari', None], dtype=object)]\n>>>\nenc\n.\ntransform\n(\nX\n)\n.\ntoarray\n()\narray([[0., 1., 0., 0., 1., 0.],\n[1., 0., 0., 0., 0., 1.],\n[0., 0., 1., 1., 0., 0.]])\nIf a feature contains both\nnp.nan\nand\nNone\n, they will be considered\nseparate categories:\n>>>\nX\n=\n[[\n'Safari'\n],\n[\nNone\n],\n[\nnp\n.\nnan\n],\n[\n'Firefox'\n]]\n>>>\nenc\n=\npreprocessing\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and\nNone\n, they will be considered\nseparate categories:\n>>>\nX\n=\n[[\n'Safari'\n],\n[\nNone\n],\n[\nnp\n.\nnan\n],\n[\n'Firefox'\n]]\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\nhandle_unknown\n=\n'error'\n)\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ncategories_\n[array(['Firefox', 'Safari', None, nan], dtype=object)]\n>>>\nenc\n.\ntransform\n(\nX\n)\n.\ntoarray\n()\narray([[0., 1., 0., 0.],\n[0., 0., 1., 0.],\n[0., 0., 0., 1.],\n[1., 0., 0., 0.]])\nSee\nLoading features from dicts\nfor categorical features that are\nrepresented as a dict, not as scalars.\n7.3.4.1.\nInfrequent categories\nOneHotEncoder\nand\nOrdinalEncoder\nsupport aggregating\ninfrequent categories into a single output for each feature. The parameters to\nenable the gathering of infrequent categories are\nmin_frequency\nand\nmax_categories\n.\nmin_frequency\nis either an integer greater or equal to 1, or a float in\nthe interval\n(0.0,\n1.0)\n. If\nmin_frequency\nis an integer, categories with\na cardinality smaller than\nmin_frequency\nwill be considered infrequent.\nIf\nmin_frequency",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(0.0,\n1.0)\n. If\nmin_frequency\nis an integer, categories with\na cardinality smaller than\nmin_frequency\nwill be considered infrequent.\nIf\nmin_frequency\nis a float, categories with a cardinality smaller than\nthis fraction of the total number of samples will be considered infrequent.\nThe default value is 1, which means every category is encoded separately.\nmax_categories\nis either\nNone\nor any integer greater than 1. This\nparameter sets an upper limit to the number of output features for each\ninput feature.\nmax_categories\nincludes the feature that combines\ninfrequent categories.\nIn the following example with\nOrdinalEncoder\n, the categories\n'dog'\nand\n'snake'\nare considered infrequent:\n>>>\nX\n=\nnp\n.\narray\n([[\n'dog'\n]\n*\n5\n+\n[\n'cat'\n]\n*\n20\n+\n[\n'rabbit'\n]\n*\n10\n+\n...\n[\n'snake'\n]\n*\n3\n],\ndtype\n=\nobject\n)\n.\nT\n>>>\nenc\n=\npreprocessing\n.\nOrdinalEncoder\n(\nmin_frequency\n=\n6\n)\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ninfrequent_categories_\n[array(['dog', 'snake'], dtype=object)]\n>>>\nenc\n.\ntransform\n(\nnp\n.\narray\n([[\n'dog'\n],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\n6\n)\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ninfrequent_categories_\n[array(['dog', 'snake'], dtype=object)]\n>>>\nenc\n.\ntransform\n(\nnp\n.\narray\n([[\n'dog'\n],\n[\n'cat'\n],\n[\n'rabbit'\n],\n[\n'snake'\n]]))\narray([[2.],\n[0.],\n[1.],\n[2.]])\nOrdinalEncoder\n’s\nmax_categories\ndo\nnot\ntake into account missing\nor unknown categories. Setting\nunknown_value\nor\nencoded_missing_value\nto an\ninteger will increase the number of unique integer codes by one each. This can\nresult in up to\nmax_categories\n+\n2\ninteger codes. In the following example,\n“a” and “d” are considered infrequent and grouped together into a single\ncategory, “b” and “c” are their own categories, unknown values are encoded as 3\nand missing values are encoded as 4.\n>>>\nX_train\n=\nnp\n.\narray\n(\n...\n[[\n\"a\"\n]\n*\n5\n+\n[\n\"b\"\n]\n*\n20\n+\n[\n\"c\"\n]\n*\n10\n+\n[\n\"d\"\n]\n*\n3\n+\n[\nnp\n.\nnan\n]],\n...\ndtype\n=\nobject\n)\n.\nT\n>>>\nenc\n=\npreprocessing\n.\nOrdinalEncoder\n(\n...\nhandle_unknown\n=\n\"use_encoded_value\"\n,\nunknown_value\n=\n3\n,\n...\nmax_categories\n=\n3\n,\nencoded_missing_value\n=\n4\n)\n>>>\n_\n=\nenc\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nOrdinalEncoder\n(\n...\nhandle_unknown\n=\n\"use_encoded_value\"\n,\nunknown_value\n=\n3\n,\n...\nmax_categories\n=\n3\n,\nencoded_missing_value\n=\n4\n)\n>>>\n_\n=\nenc\n.\nfit\n(\nX_train\n)\n>>>\nX_test\n=\nnp\n.\narray\n([[\n\"a\"\n],\n[\n\"b\"\n],\n[\n\"c\"\n],\n[\n\"d\"\n],\n[\n\"e\"\n],\n[\nnp\n.\nnan\n]],\ndtype\n=\nobject\n)\n>>>\nenc\n.\ntransform\n(\nX_test\n)\narray([[2.],\n[0.],\n[1.],\n[2.],\n[3.],\n[4.]])\nSimilarly,\nOneHotEncoder\ncan be configured to group together infrequent\ncategories:\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\nmin_frequency\n=\n6\n,\nsparse_output\n=\nFalse\n)\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ninfrequent_categories_\n[array(['dog', 'snake'], dtype=object)]\n>>>\nenc\n.\ntransform\n(\nnp\n.\narray\n([[\n'dog'\n],\n[\n'cat'\n],\n[\n'rabbit'\n],\n[\n'snake'\n]]))\narray([[0., 0., 1.],\n[1., 0., 0.],\n[0., 1., 0.],\n[0., 0., 1.]])\nBy setting handle_unknown to\n'infrequent_if_exist'\n, unknown categories will\nbe considered infrequent:\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\n...\nhandle_unknown\n=\n'infrequent_if_exist'\n,\nsparse_output\n=\nFalse\n,\nmin_frequency\n=\n6\n)\n>>>\nenc\n=\nenc\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 34,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\n...\nhandle_unknown\n=\n'infrequent_if_exist'\n,\nsparse_output\n=\nFalse\n,\nmin_frequency\n=\n6\n)\n>>>\nenc\n=\nenc\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ntransform\n(\nnp\n.\narray\n([[\n'dragon'\n]]))\narray([[0., 0., 1.]])\nOneHotEncoder.get_feature_names_out\nuses ‘infrequent’ as the infrequent\nfeature name:\n>>>\nenc\n.\nget_feature_names_out\n()\narray(['x0_cat', 'x0_rabbit', 'x0_infrequent_sklearn'], dtype=object)\nWhen\n'handle_unknown'\nis set to\n'infrequent_if_exist'\nand an unknown\ncategory is encountered in transform:\nIf infrequent category support was not configured or there was no\ninfrequent category during training, the resulting one-hot encoded columns\nfor this feature will be all zeros. In the inverse transform, an unknown\ncategory will be denoted as\nNone\n.\nIf there is an infrequent category during training, the unknown category\nwill be considered infrequent. In the inverse transform, ‘infrequent_sklearn’\nwill be used to represent the infrequent category.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 35,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "will be considered infrequent. In the inverse transform, ‘infrequent_sklearn’\nwill be used to represent the infrequent category.\nInfrequent categories can also be configured using\nmax_categories\n. In the\nfollowing example, we set\nmax_categories=2\nto limit the number of features in\nthe output. This will result in all but the\n'cat'\ncategory to be considered\ninfrequent, leading to two features, one for\n'cat'\nand one for infrequent\ncategories - which are all the others:\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\nmax_categories\n=\n2\n,\nsparse_output\n=\nFalse\n)\n>>>\nenc\n=\nenc\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ntransform\n([[\n'dog'\n],\n[\n'cat'\n],\n[\n'rabbit'\n],\n[\n'snake'\n]])\narray([[0., 1.],\n[1., 0.],\n[0., 1.],\n[0., 1.]])\nIf both\nmax_categories\nand\nmin_frequency\nare non-default values, then\ncategories are selected based on\nmin_frequency\nfirst and\nmax_categories\ncategories are kept. In the following example,\nmin_frequency=4\nconsiders\nonly\nsnake\nto be infrequent, but\nmax_categories=3\n, forces\ndog\nto also be",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 36,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "categories are kept. In the following example,\nmin_frequency=4\nconsiders\nonly\nsnake\nto be infrequent, but\nmax_categories=3\n, forces\ndog\nto also be\ninfrequent:\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\nmin_frequency\n=\n4\n,\nmax_categories\n=\n3\n,\nsparse_output\n=\nFalse\n)\n>>>\nenc\n=\nenc\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ntransform\n([[\n'dog'\n],\n[\n'cat'\n],\n[\n'rabbit'\n],\n[\n'snake'\n]])\narray([[0., 0., 1.],\n[1., 0., 0.],\n[0., 1., 0.],\n[0., 0., 1.]])\nIf there are infrequent categories with the same cardinality at the cutoff of\nmax_categories\n, then the first\nmax_categories\nare taken based on lexicon\nordering. In the following example, “b”, “c”, and “d”, have the same cardinality\nand with\nmax_categories=2\n, “b” and “c” are infrequent because they have a higher\nlexicon order.\n>>>\nX\n=\nnp\n.\nasarray\n([[\n\"a\"\n]\n*\n20\n+\n[\n\"b\"\n]\n*\n10\n+\n[\n\"c\"\n]\n*\n10\n+\n[\n\"d\"\n]\n*\n10\n],\ndtype\n=\nobject\n)\n.\nT\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\nmax_categories\n=\n3\n)\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ninfrequent_categories_",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 37,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "10\n+\n[\n\"d\"\n]\n*\n10\n],\ndtype\n=\nobject\n)\n.\nT\n>>>\nenc\n=\npreprocessing\n.\nOneHotEncoder\n(\nmax_categories\n=\n3\n)\n.\nfit\n(\nX\n)\n>>>\nenc\n.\ninfrequent_categories_\n[array(['b', 'c'], dtype=object)]\n7.3.4.2.\nTarget Encoder\nThe\nTargetEncoder\nuses the target mean conditioned on the categorical\nfeature for encoding unordered categories, i.e. nominal categories\n[PAR]\n[MIC]\n. This encoding scheme is useful with categorical features with high\ncardinality, where one-hot encoding would inflate the feature space making it\nmore expensive for a downstream model to process. A classical example of high\ncardinality categories are location based such as zip code or region.\nBinary classification targets\nFor the binary classification target, the target encoding is given by:\n\\[S_i = \\lambda_i\\frac{n_{iY}}{n_i} + (1 - \\lambda_i)\\frac{n_Y}{n}\\]\nwhere\n\\(S_i\\)\nis the encoding for category\n\\(i\\)\n,\n\\(n_{iY}\\)\nis the\nnumber of observations with\n\\(Y=1\\)\nand category\n\\(i\\)\n,\n\\(n_i\\)\nis\nthe number of observations with category",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 38,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(i\\)\n,\n\\(n_{iY}\\)\nis the\nnumber of observations with\n\\(Y=1\\)\nand category\n\\(i\\)\n,\n\\(n_i\\)\nis\nthe number of observations with category\n\\(i\\)\n,\n\\(n_Y\\)\nis the number of\nobservations with\n\\(Y=1\\)\n,\n\\(n\\)\nis the number of observations, and\n\\(\\lambda_i\\)\nis a shrinkage factor for category\n\\(i\\)\n. The shrinkage\nfactor is given by:\n\\[\\lambda_i = \\frac{n_i}{m + n_i}\\]\nwhere\n\\(m\\)\nis a smoothing factor, which is controlled with the\nsmooth\nparameter in\nTargetEncoder\n. Large smoothing factors will put more\nweight on the global mean. When\nsmooth=\"auto\"\n, the smoothing factor is\ncomputed as an empirical Bayes estimate:\n\\(m=\\sigma_i^2/\\tau^2\\)\n, where\n\\(\\sigma_i^2\\)\nis the variance of\ny\nwith category\n\\(i\\)\nand\n\\(\\tau^2\\)\nis the global variance of\ny\n.\nMulticlass classification targets\nFor multiclass classification targets, the formulation is similar to binary\nclassification:\n\\[S_{ij} = \\lambda_i\\frac{n_{iY_j}}{n_i} + (1 - \\lambda_i)\\frac{n_{Y_j}}{n}\\]\nwhere\n\\(S_{ij}\\)\nis the encoding for category",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 39,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classification:\n\\[S_{ij} = \\lambda_i\\frac{n_{iY_j}}{n_i} + (1 - \\lambda_i)\\frac{n_{Y_j}}{n}\\]\nwhere\n\\(S_{ij}\\)\nis the encoding for category\n\\(i\\)\nand class\n\\(j\\)\n,\n\\(n_{iY_j}\\)\nis the number of observations with\n\\(Y=j\\)\nand category\n\\(i\\)\n,\n\\(n_i\\)\nis the number of observations with category\n\\(i\\)\n,\n\\(n_{Y_j}\\)\nis the number of observations with\n\\(Y=j\\)\n,\n\\(n\\)\nis the\nnumber of observations, and\n\\(\\lambda_i\\)\nis a shrinkage factor for category\n\\(i\\)\n.\nContinuous targets\nFor continuous targets, the formulation is similar to binary classification:\n\\[S_i = \\lambda_i\\frac{\\sum_{k\\in L_i}Y_k}{n_i} + (1 - \\lambda_i)\\frac{\\sum_{k=1}^{n}Y_k}{n}\\]\nwhere\n\\(L_i\\)\nis the set of observations with category\n\\(i\\)\nand\n\\(n_i\\)\nis the number of observations with category\n\\(i\\)\n.\nfit_transform\ninternally relies on a\ncross fitting\nscheme to prevent target information from leaking into the train-time\nrepresentation, especially for non-informative high-cardinality categorical",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 40,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "scheme to prevent target information from leaking into the train-time\nrepresentation, especially for non-informative high-cardinality categorical\nvariables, and help prevent the downstream model from overfitting spurious\ncorrelations. Note that as a result,\nfit(X,\ny).transform(X)\ndoes not equal\nfit_transform(X,\ny)\n. In\nfit_transform\n, the training\ndata is split into\nk\nfolds (determined by the\ncv\nparameter) and each fold is\nencoded using the encodings learnt using the other\nk-1\nfolds. The following\ndiagram shows the\ncross fitting\nscheme in\nfit_transform\nwith the default\ncv=5\n:\nfit_transform\nalso learns a ‘full data’ encoding using\nthe whole training set. This is never used in\nfit_transform\nbut is saved to the attribute\nencodings_\n,\nfor use when\ntransform\nis called. Note that the encodings\nlearned for each fold during the\ncross fitting\nscheme are not saved to\nan attribute.\nThe\nfit\nmethod does\nnot\nuse any\ncross fitting",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 41,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "learned for each fold during the\ncross fitting\nscheme are not saved to\nan attribute.\nThe\nfit\nmethod does\nnot\nuse any\ncross fitting\nschemes and learns one encoding on the entire training set, which is used to\nencode categories in\ntransform\n.\nThis encoding is the same as the ‘full data’\nencoding learned in\nfit_transform\n.\nNote\nTargetEncoder\nconsiders missing values, such as\nnp.nan\nor\nNone\n,\nas another category and encodes them like any other category. Categories\nthat are not seen during\nfit\nare encoded with the target mean, i.e.\ntarget_mean_\n.\nExamples\nComparing Target Encoder with Other Encoders\nTarget Encoder’s Internal Cross fitting\nReferences\n[\nMIC\n]\nMicci-Barreca, Daniele. “A preprocessing scheme for high-cardinality\ncategorical attributes in classification and prediction problems”\nSIGKDD Explor. Newsl. 3, 1 (July 2001), 27-32.\n[\nPAR\n]\nPargent, F., Pfisterer, F., Thomas, J. et al. “Regularized target\nencoding outperforms traditional methods in supervised machine learning with",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 42,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "PAR\n]\nPargent, F., Pfisterer, F., Thomas, J. et al. “Regularized target\nencoding outperforms traditional methods in supervised machine learning with\nhigh cardinality features” Comput Stat 37, 2671-2692 (2022)\n7.3.5.\nDiscretization\nDiscretization\n(otherwise known as quantization or binning) provides a way to partition continuous\nfeatures into discrete values. Certain datasets with continuous features\nmay benefit from discretization, because discretization can transform the dataset\nof continuous attributes to one with only nominal attributes.\nOne-hot encoded discretized features can make a model more expressive, while\nmaintaining interpretability. For instance, pre-processing with a discretizer\ncan introduce nonlinearity to linear models. For more advanced possibilities,\nin particular smooth ones, see\nGenerating polynomial features\nfurther\nbelow.\n7.3.5.1.\nK-bins discretization\nKBinsDiscretizer\ndiscretizes features into\nk\nbins:\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n3.\n,\n5.\n,\n15\n],\n...\n[\n0.\n,\n6.\n,\n14",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 43,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "below.\n7.3.5.1.\nK-bins discretization\nKBinsDiscretizer\ndiscretizes features into\nk\nbins:\n>>>\nX\n=\nnp\n.\narray\n([[\n-\n3.\n,\n5.\n,\n15\n],\n...\n[\n0.\n,\n6.\n,\n14\n],\n...\n[\n6.\n,\n3.\n,\n11\n]])\n>>>\nest\n=\npreprocessing\n.\nKBinsDiscretizer\n(\nn_bins\n=\n[\n3\n,\n2\n,\n2\n],\nencode\n=\n'ordinal'\n)\n.\nfit\n(\nX\n)\nBy default the output is one-hot encoded into a sparse matrix\n(See\nEncoding categorical features\n)\nand this can be configured with the\nencode\nparameter.\nFor each feature, the bin edges are computed during\nfit\nand together with\nthe number of bins, they will define the intervals. Therefore, for the current\nexample, these intervals are defined as:\nfeature 1:\n\\({[-\\infty, -1), [-1, 2), [2, \\infty)}\\)\nfeature 2:\n\\({[-\\infty, 5), [5, \\infty)}\\)\nfeature 3:\n\\({[-\\infty, 14), [14, \\infty)}\\)\nBased on these bin intervals,\nX\nis transformed as follows:\n>>>\nest\n.\ntransform\n(\nX\n)\narray([[ 0., 1., 1.],\n[ 1., 1., 1.],\n[ 2., 0., 0.]])\nThe resulting dataset contains ordinal attributes which can be further used\nin a\nPipeline\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 44,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "X\n)\narray([[ 0., 1., 1.],\n[ 1., 1., 1.],\n[ 2., 0., 0.]])\nThe resulting dataset contains ordinal attributes which can be further used\nin a\nPipeline\n.\nDiscretization is similar to constructing histograms for continuous data.\nHowever, histograms focus on counting features which fall into particular\nbins, whereas discretization focuses on assigning feature values to these bins.\nKBinsDiscretizer\nimplements different binning strategies, which can be\nselected with the\nstrategy\nparameter. The ‘uniform’ strategy uses\nconstant-width bins. The ‘quantile’ strategy uses the quantiles values to have\nequally populated bins in each feature. The ‘kmeans’ strategy defines bins based\non a k-means clustering procedure performed on each feature independently.\nBe aware that one can specify custom bins by passing a callable defining the\ndiscretization strategy to\nFunctionTransformer\n.\nFor instance, we can use the Pandas function\npandas.cut\n:\n>>>\nimport\npandas\nas\npd\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 45,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "FunctionTransformer\n.\nFor instance, we can use the Pandas function\npandas.cut\n:\n>>>\nimport\npandas\nas\npd\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn\nimport\npreprocessing\n>>>\n>>>\nbins\n=\n[\n0\n,\n1\n,\n13\n,\n20\n,\n60\n,\nnp\n.\ninf\n]\n>>>\nlabels\n=\n[\n'infant'\n,\n'kid'\n,\n'teen'\n,\n'adult'\n,\n'senior citizen'\n]\n>>>\ntransformer\n=\npreprocessing\n.\nFunctionTransformer\n(\n...\npd\n.\ncut\n,\nkw_args\n=\n{\n'bins'\n:\nbins\n,\n'labels'\n:\nlabels\n,\n'retbins'\n:\nFalse\n}\n...\n)\n>>>\nX\n=\nnp\n.\narray\n([\n0.2\n,\n2\n,\n15\n,\n25\n,\n97\n])\n>>>\ntransformer\n.\nfit_transform\n(\nX\n)\n['infant', 'kid', 'teen', 'adult', 'senior citizen']\nCategories (5, object): ['infant' < 'kid' < 'teen' < 'adult' < 'senior citizen']\nExamples\nUsing KBinsDiscretizer to discretize continuous features\nFeature discretization\nDemonstrating the different strategies of KBinsDiscretizer\n7.3.5.2.\nFeature binarization\nFeature binarization\nis the process of\nthresholding numerical\nfeatures to get boolean values\n. This can be useful for downstream",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 46,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Feature binarization\nFeature binarization\nis the process of\nthresholding numerical\nfeatures to get boolean values\n. This can be useful for downstream\nprobabilistic estimators that make assumption that the input data\nis distributed according to a multi-variate\nBernoulli distribution\n. For instance,\nthis is the case for the\nBernoulliRBM\n.\nIt is also common among the text processing community to use binary\nfeature values (probably to simplify the probabilistic reasoning) even\nif normalized counts (a.k.a. term frequencies) or TF-IDF valued features\noften perform slightly better in practice.\nAs for the\nNormalizer\n, the utility class\nBinarizer\nis meant to be used in the early stages of\nPipeline\n. The\nfit\nmethod does nothing\nas each sample is treated independently of others:\n>>>\nX\n=\n[[\n1.\n,\n-\n1.\n,\n2.\n],\n...\n[\n2.\n,\n0.\n,\n0.\n],\n...\n[\n0.\n,\n1.\n,\n-\n1.\n]]\n>>>\nbinarizer\n=\npreprocessing\n.\nBinarizer\n()\n.\nfit\n(\nX\n)\n# fit does nothing\n>>>\nbinarizer\nBinarizer()\n>>>\nbinarizer\n.\ntransform\n(\nX\n)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 47,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ",\n1.\n,\n-\n1.\n]]\n>>>\nbinarizer\n=\npreprocessing\n.\nBinarizer\n()\n.\nfit\n(\nX\n)\n# fit does nothing\n>>>\nbinarizer\nBinarizer()\n>>>\nbinarizer\n.\ntransform\n(\nX\n)\narray([[1., 0., 1.],\n[1., 0., 0.],\n[0., 1., 0.]])\nIt is possible to adjust the threshold of the binarizer:\n>>>\nbinarizer\n=\npreprocessing\n.\nBinarizer\n(\nthreshold\n=\n1.1\n)\n>>>\nbinarizer\n.\ntransform\n(\nX\n)\narray([[0., 0., 1.],\n[1., 0., 0.],\n[0., 0., 0.]])\nAs for the\nNormalizer\nclass, the preprocessing module\nprovides a companion function\nbinarize\nto be used when the transformer API is not necessary.\nNote that the\nBinarizer\nis similar to the\nKBinsDiscretizer\nwhen\nk\n=\n2\n, and when the bin edge is at the value\nthreshold\n.\n7.3.6.\nImputation of missing values\nTools for imputing missing values are discussed at\nImputation of missing values\n.\n7.3.7.\nGenerating polynomial features\nOften it’s useful to add complexity to a model by considering nonlinear\nfeatures of the input data. We show two possibilities that are both based on",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 48,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Often it’s useful to add complexity to a model by considering nonlinear\nfeatures of the input data. We show two possibilities that are both based on\npolynomials: The first one uses pure polynomials, the second one uses splines,\ni.e. piecewise polynomials.\n7.3.7.1.\nPolynomial features\nA simple and common method to use is polynomial features, which can get\nfeatures’ high-order and interaction terms. It is implemented in\nPolynomialFeatures\n:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.preprocessing\nimport\nPolynomialFeatures\n>>>\nX\n=\nnp\n.\narange\n(\n6\n)\n.\nreshape\n(\n3\n,\n2\n)\n>>>\nX\narray([[0, 1],\n[2, 3],\n[4, 5]])\n>>>\npoly\n=\nPolynomialFeatures\n(\n2\n)\n>>>\npoly\n.\nfit_transform\n(\nX\n)\narray([[ 1., 0., 1., 0., 0., 1.],\n[ 1., 2., 3., 4., 6., 9.],\n[ 1., 4., 5., 16., 20., 25.]])\nThe features of X have been transformed from\n\\((X_1, X_2)\\)\nto\n\\((1, X_1, X_2, X_1^2, X_1X_2, X_2^2)\\)\n.\nIn some cases, only interaction terms among features are required, and it can\nbe gotten with the setting\ninteraction_only=True\n:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 49,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nIn some cases, only interaction terms among features are required, and it can\nbe gotten with the setting\ninteraction_only=True\n:\n>>>\nX\n=\nnp\n.\narange\n(\n9\n)\n.\nreshape\n(\n3\n,\n3\n)\n>>>\nX\narray([[0, 1, 2],\n[3, 4, 5],\n[6, 7, 8]])\n>>>\npoly\n=\nPolynomialFeatures\n(\ndegree\n=\n3\n,\ninteraction_only\n=\nTrue\n)\n>>>\npoly\n.\nfit_transform\n(\nX\n)\narray([[ 1., 0., 1., 2., 0., 0., 2., 0.],\n[ 1., 3., 4., 5., 12., 15., 20., 60.],\n[ 1., 6., 7., 8., 42., 48., 56., 336.]])\nThe features of X have been transformed from\n\\((X_1, X_2, X_3)\\)\nto\n\\((1, X_1, X_2, X_3, X_1X_2, X_1X_3, X_2X_3, X_1X_2X_3)\\)\n.\nNote that polynomial features are used implicitly in\nkernel methods\n(e.g.,\nSVC\n,\nKernelPCA\n) when using polynomial\nKernel functions\n.\nSee\nPolynomial and Spline interpolation\nfor Ridge regression using created polynomial features.\n7.3.7.2.\nSpline transformer\nAnother way to add nonlinear terms instead of pure polynomials of features is\nto generate spline basis functions for each feature with the\nSplineTransformer",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 50,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "to generate spline basis functions for each feature with the\nSplineTransformer\n. Splines are piecewise polynomials, parametrized by\ntheir polynomial degree and the positions of the knots. The\nSplineTransformer\nimplements a B-spline basis, cf. the references\nbelow.\nNote\nThe\nSplineTransformer\ntreats each feature separately, i.e. it\nwon’t give you interaction terms.\nSome of the advantages of splines over polynomials are:\nB-splines are very flexible and robust if you keep a fixed low degree,\nusually 3, and parsimoniously adapt the number of knots. Polynomials\nwould need a higher degree, which leads to the next point.\nB-splines do not have oscillatory behaviour at the boundaries as have\npolynomials (the higher the degree, the worse). This is known as\nRunge’s\nphenomenon\n.\nB-splines provide good options for extrapolation beyond the boundaries,\ni.e. beyond the range of fitted values. Have a look at the option\nextrapolation\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 51,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "i.e. beyond the range of fitted values. Have a look at the option\nextrapolation\n.\nB-splines generate a feature matrix with a banded structure. For a single\nfeature, every row contains only\ndegree\n+\n1\nnon-zero elements, which\noccur consecutively and are even positive. This results in a matrix with\ngood numerical properties, e.g. a low condition number, in sharp contrast\nto a matrix of polynomials, which goes under the name\nVandermonde matrix\n.\nA low condition number is important for stable algorithms of linear\nmodels.\nThe following code snippet shows splines in action:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.preprocessing\nimport\nSplineTransformer\n>>>\nX\n=\nnp\n.\narange\n(\n5\n)\n.\nreshape\n(\n5\n,\n1\n)\n>>>\nX\narray([[0],\n[1],\n[2],\n[3],\n[4]])\n>>>\nspline\n=\nSplineTransformer\n(\ndegree\n=\n2\n,\nn_knots\n=\n3\n)\n>>>\nspline\n.\nfit_transform\n(\nX\n)\narray([[0.5 , 0.5 , 0. , 0. ],\n[0.125, 0.75 , 0.125, 0. ],\n[0. , 0.5 , 0.5 , 0. ],\n[0. , 0.125, 0.75 , 0.125],\n[0. , 0. , 0.5 , 0.5 ]])\nAs the\nX",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 52,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "X\n)\narray([[0.5 , 0.5 , 0. , 0. ],\n[0.125, 0.75 , 0.125, 0. ],\n[0. , 0.5 , 0.5 , 0. ],\n[0. , 0.125, 0.75 , 0.125],\n[0. , 0. , 0.5 , 0.5 ]])\nAs the\nX\nis sorted, one can easily see the banded matrix output. Only the\nthree middle diagonals are non-zero for\ndegree=2\n. The higher the degree,\nthe more overlapping of the splines.\nInterestingly, a\nSplineTransformer\nof\ndegree=0\nis the same as\nKBinsDiscretizer\nwith\nencode='onehot-dense'\nand\nn_bins\n=\nn_knots\n-\n1\nif\nknots\n=\nstrategy\n.\nExamples\nPolynomial and Spline interpolation\nTime-related feature engineering\nReferences\nEilers, P., & Marx, B. (1996).\nFlexible Smoothing with B-splines and\nPenalties\n. Statist. Sci. 11 (1996), no. 2, 89–121.\nPerperoglou, A., Sauerbrei, W., Abrahamowicz, M. et al.\nA review of\nspline function procedures in R\n.\nBMC Med Res Methodol 19, 46 (2019).\n7.3.8.\nCustom transformers\nOften, you will want to convert an existing Python function into a transformer",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 53,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ".\nBMC Med Res Methodol 19, 46 (2019).\n7.3.8.\nCustom transformers\nOften, you will want to convert an existing Python function into a transformer\nto assist in data cleaning or processing. You can implement a transformer from\nan arbitrary function with\nFunctionTransformer\n. For example, to build\na transformer that applies a log transformation in a pipeline, do:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.preprocessing\nimport\nFunctionTransformer\n>>>\ntransformer\n=\nFunctionTransformer\n(\nnp\n.\nlog1p\n,\nvalidate\n=\nTrue\n)\n>>>\nX\n=\nnp\n.\narray\n([[\n0\n,\n1\n],\n[\n2\n,\n3\n]])\n>>>\n# Since FunctionTransformer is no-op during fit, we can call transform directly\n>>>\ntransformer\n.\ntransform\n(\nX\n)\narray([[0. , 0.69314718],\n[1.09861229, 1.38629436]])\nYou can ensure that\nfunc\nand\ninverse_func\nare the inverse of each other\nby setting\ncheck_inverse=True\nand calling\nfit\nbefore\ntransform\n. Please note that a warning is raised and can be turned into an\nerror with a\nfilterwarnings\n:\n>>>\nimport\nwarnings\n>>>\nwarnings\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 54,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "before\ntransform\n. Please note that a warning is raised and can be turned into an\nerror with a\nfilterwarnings\n:\n>>>\nimport\nwarnings\n>>>\nwarnings\n.\nfilterwarnings\n(\n\"error\"\n,\nmessage\n=\n\".*check_inverse*.\"\n,\n...\ncategory\n=\nUserWarning\n,\nappend\n=\nFalse\n)\nFor a full code example that demonstrates using a\nFunctionTransformer\nto extract features from text data see\nColumn Transformer with Heterogeneous Data Sources\nand\nTime-related feature engineering\n.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing.html",
      "chunk_index": 55,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.9.\nTransforming the prediction target (\ny\n)\nThese are transformers that are not intended to be used on features, only on\nsupervised learning targets. See also\nTransforming target in regression\nif\nyou want to transform the prediction target for learning, but evaluate the\nmodel in the original (untransformed) space.\n7.9.1.\nLabel binarization\n7.9.1.1.\nLabelBinarizer\nLabelBinarizer\nis a utility class to help create a\nlabel\nindicator matrix\nfrom a list of\nmulticlass\nlabels:\n>>>\nfrom\nsklearn\nimport\npreprocessing\n>>>\nlb\n=\npreprocessing\n.\nLabelBinarizer\n()\n>>>\nlb\n.\nfit\n([\n1\n,\n2\n,\n6\n,\n4\n,\n2\n])\nLabelBinarizer()\n>>>\nlb\n.\nclasses_\narray([1, 2, 4, 6])\n>>>\nlb\n.\ntransform\n([\n1\n,\n6\n])\narray([[1, 0, 0, 0],\n[0, 0, 0, 1]])\nUsing this format can enable multiclass classification in estimators\nthat support the label indicator matrix format.\nWarning\nLabelBinarizer is not needed if you are using an estimator that\nalready supports\nmulticlass\ndata.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing_targets.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Warning\nLabelBinarizer is not needed if you are using an estimator that\nalready supports\nmulticlass\ndata.\nFor more information about multiclass classification, refer to\nMulticlass classification\n.\n7.9.1.2.\nMultiLabelBinarizer\nIn\nmultilabel\nlearning, the joint set of binary classification tasks is\nexpressed with a label binary indicator array: each sample is one row of a 2d\narray of shape (n_samples, n_classes) with binary values where the one, i.e. the\nnon zero elements, corresponds to the subset of labels for that sample. An array\nsuch as\nnp.array([[1,\n0,\n0],\n[0,\n1,\n1],\n[0,\n0,\n0]])\nrepresents label 0 in the\nfirst sample, labels 1 and 2 in the second sample, and no labels in the third\nsample.\nProducing multilabel data as a list of sets of labels may be more intuitive.\nThe\nMultiLabelBinarizer\ntransformer can be used to convert between a collection of collections of\nlabels and the indicator format:\n>>>\nfrom\nsklearn.preprocessing\nimport\nMultiLabelBinarizer\n>>>\ny\n=\n[[\n2\n,\n3\n,\n4\n],\n[\n2\n],",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing_targets.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "labels and the indicator format:\n>>>\nfrom\nsklearn.preprocessing\nimport\nMultiLabelBinarizer\n>>>\ny\n=\n[[\n2\n,\n3\n,\n4\n],\n[\n2\n],\n[\n0\n,\n1\n,\n3\n],\n[\n0\n,\n1\n,\n2\n,\n3\n,\n4\n],\n[\n0\n,\n1\n,\n2\n]]\n>>>\nMultiLabelBinarizer\n()\n.\nfit_transform\n(\ny\n)\narray([[0, 0, 1, 1, 1],\n[0, 0, 1, 0, 0],\n[1, 1, 0, 1, 0],\n[1, 1, 1, 1, 1],\n[1, 1, 1, 0, 0]])\nFor more information about multilabel classification, refer to\nMultilabel classification\n.\n7.9.2.\nLabel encoding\nLabelEncoder\nis a utility class to help normalize labels such that\nthey contain only values between 0 and n_classes-1. This is sometimes useful\nfor writing efficient Cython routines.\nLabelEncoder\ncan be used as\nfollows:\n>>>\nfrom\nsklearn\nimport\npreprocessing\n>>>\nle\n=\npreprocessing\n.\nLabelEncoder\n()\n>>>\nle\n.\nfit\n([\n1\n,\n2\n,\n2\n,\n6\n])\nLabelEncoder()\n>>>\nle\n.\nclasses_\narray([1, 2, 6])\n>>>\nle\n.\ntransform\n([\n1\n,\n1\n,\n2\n,\n6\n])\narray([0, 0, 1, 2])\n>>>\nle\n.\ninverse_transform\n([\n0\n,\n0\n,\n1\n,\n2\n])\narray([1, 1, 2, 6])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing_targets.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "array([1, 2, 6])\n>>>\nle\n.\ntransform\n([\n1\n,\n1\n,\n2\n,\n6\n])\narray([0, 0, 1, 2])\n>>>\nle\n.\ninverse_transform\n([\n0\n,\n0\n,\n1\n,\n2\n])\narray([1, 1, 2, 6])\nIt can also be used to transform non-numerical labels (as long as they are\nhashable and comparable) to numerical labels:\n>>>\nle\n=\npreprocessing\n.\nLabelEncoder\n()\n>>>\nle\n.\nfit\n([\n\"paris\"\n,\n\"paris\"\n,\n\"tokyo\"\n,\n\"amsterdam\"\n])\nLabelEncoder()\n>>>\nlist\n(\nle\n.\nclasses_\n)\n[np.str_('amsterdam'), np.str_('paris'), np.str_('tokyo')]\n>>>\nle\n.\ntransform\n([\n\"tokyo\"\n,\n\"tokyo\"\n,\n\"paris\"\n])\narray([2, 2, 1])\n>>>\nlist\n(\nle\n.\ninverse_transform\n([\n2\n,\n2\n,\n1\n]))\n[np.str_('tokyo'), np.str_('tokyo'), np.str_('paris')]\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/preprocessing_targets.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.6.\nRandom Projection\nThe\nsklearn.random_projection\nmodule implements a simple and\ncomputationally efficient way to reduce the dimensionality of the data by\ntrading a controlled amount of accuracy (as additional variance) for faster\nprocessing times and smaller model sizes. This module implements two types of\nunstructured random matrix:\nGaussian random matrix\nand\nsparse random matrix\n.\nThe dimensions and distribution of random projections matrices are\ncontrolled so as to preserve the pairwise distances between any two\nsamples of the dataset. Thus random projection is a suitable approximation\ntechnique for distance based method.\nReferences\nSanjoy Dasgupta. 2000.\nExperiments with random projection.\nIn Proceedings of the Sixteenth conference on Uncertainty in artificial\nintelligence (UAI’00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan\nKaufmann Publishers Inc., San Francisco, CA, USA, 143-151.\nElla Bingham and Heikki Mannila. 2001.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/random_projection.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151.\nElla Bingham and Heikki Mannila. 2001.\nRandom projection in dimensionality reduction: applications to image and text data.\nIn Proceedings of the seventh ACM SIGKDD international conference on\nKnowledge discovery and data mining (KDD ‘01). ACM, New York, NY, USA,\n245-250.\n7.6.1.\nThe Johnson-Lindenstrauss lemma\nThe main theoretical result behind the efficiency of random projection is the\nJohnson-Lindenstrauss lemma (quoting Wikipedia)\n:\nIn mathematics, the Johnson-Lindenstrauss lemma is a result\nconcerning low-distortion embeddings of points from high-dimensional\ninto low-dimensional Euclidean space. The lemma states that a small set\nof points in a high-dimensional space can be embedded into a space of\nmuch lower dimension in such a way that distances between the points are\nnearly preserved. The map used for the embedding is at least Lipschitz,\nand can even be taken to be an orthogonal projection.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/random_projection.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "nearly preserved. The map used for the embedding is at least Lipschitz,\nand can even be taken to be an orthogonal projection.\nKnowing only the number of samples, the\njohnson_lindenstrauss_min_dim\nestimates\nconservatively the minimal size of the random subspace to guarantee a\nbounded distortion introduced by the random projection:\n>>>\nfrom\nsklearn.random_projection\nimport\njohnson_lindenstrauss_min_dim\n>>>\njohnson_lindenstrauss_min_dim\n(\nn_samples\n=\n1e6\n,\neps\n=\n0.5\n)\nnp.int64(663)\n>>>\njohnson_lindenstrauss_min_dim\n(\nn_samples\n=\n1e6\n,\neps\n=\n[\n0.5\n,\n0.1\n,\n0.01\n])\narray([ 663, 11841, 1112658])\n>>>\njohnson_lindenstrauss_min_dim\n(\nn_samples\n=\n[\n1e4\n,\n1e5\n,\n1e6\n],\neps\n=\n0.1\n)\narray([ 7894, 9868, 11841])\nExamples\nSee\nThe Johnson-Lindenstrauss bound for embedding with random projections\nfor a theoretical explication on the Johnson-Lindenstrauss lemma and an\nempirical validation using sparse random matrices.\nReferences\nSanjoy Dasgupta and Anupam Gupta, 1999.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/random_projection.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "empirical validation using sparse random matrices.\nReferences\nSanjoy Dasgupta and Anupam Gupta, 1999.\nAn elementary proof of the Johnson-Lindenstrauss Lemma.\n7.6.2.\nGaussian random projection\nThe\nGaussianRandomProjection\nreduces the\ndimensionality by projecting the original input space on a randomly generated\nmatrix where components are drawn from the following distribution\n\\(N(0, \\frac{1}{n_{components}})\\)\n.\nHere is a small excerpt which illustrates how to use the Gaussian random\nprojection transformer:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn\nimport\nrandom_projection\n>>>\nX\n=\nnp\n.\nrandom\n.\nrand\n(\n100\n,\n10000\n)\n>>>\ntransformer\n=\nrandom_projection\n.\nGaussianRandomProjection\n()\n>>>\nX_new\n=\ntransformer\n.\nfit_transform\n(\nX\n)\n>>>\nX_new\n.\nshape\n(100, 3947)\n7.6.3.\nSparse random projection\nThe\nSparseRandomProjection\nreduces the\ndimensionality by projecting the original input space using a sparse\nrandom matrix.\nSparse random matrices are an alternative to dense Gaussian random",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/random_projection.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "random matrix.\nSparse random matrices are an alternative to dense Gaussian random\nprojection matrix that guarantees similar embedding quality while being much\nmore memory efficient and allowing faster computation of the projected data.\nIf we define\ns\n=\n1\n/\ndensity\n, the elements of the random matrix\nare drawn from\n\\[\\begin{split}\\left\\{\n\\begin{array}{c c l}\n-\\sqrt{\\frac{s}{n_{\\text{components}}}} & & 1 / 2s\\\\\n0 &\\text{with probability} & 1 - 1 / s \\\\\n+\\sqrt{\\frac{s}{n_{\\text{components}}}} & & 1 / 2s\\\\\n\\end{array}\n\\right.\\end{split}\\]\nwhere\n\\(n_{\\text{components}}\\)\nis the size of the projected subspace.\nBy default the density of non zero elements is set to the minimum density as\nrecommended by Ping Li et al.:\n\\(1 / \\sqrt{n_{\\text{features}}}\\)\n.\nHere is a small excerpt which illustrates how to use the sparse random\nprojection transformer:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn\nimport\nrandom_projection\n>>>\nX\n=\nnp\n.\nrandom\n.\nrand\n(\n100\n,\n10000\n)\n>>>\ntransformer\n=\nrandom_projection\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/random_projection.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ">>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn\nimport\nrandom_projection\n>>>\nX\n=\nnp\n.\nrandom\n.\nrand\n(\n100\n,\n10000\n)\n>>>\ntransformer\n=\nrandom_projection\n.\nSparseRandomProjection\n()\n>>>\nX_new\n=\ntransformer\n.\nfit_transform\n(\nX\n)\n>>>\nX_new\n.\nshape\n(100, 3947)\nReferences\nD. Achlioptas. 2003.\nDatabase-friendly random projections: Johnson-Lindenstrauss with binary\ncoins\n.\nJournal of Computer and System Sciences 66 (2003) 671-687.\nPing Li, Trevor J. Hastie, and Kenneth W. Church. 2006.\nVery sparse random projections.\nIn Proceedings of the 12th ACM SIGKDD international conference on\nKnowledge discovery and data mining (KDD ‘06). ACM, New York, NY, USA, 287-296.\n7.6.4.\nInverse Transform\nThe random projection transformers have\ncompute_inverse_components\nparameter. When\nset to True, after creating the random\ncomponents_\nmatrix during fitting,\nthe transformer computes the pseudo-inverse of this matrix and stores it as\ninverse_components_\n. The\ninverse_components_\nmatrix has shape",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/random_projection.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the transformer computes the pseudo-inverse of this matrix and stores it as\ninverse_components_\n. The\ninverse_components_\nmatrix has shape\n\\(n_{features} \\times n_{components}\\)\n, and it is always a dense matrix,\nregardless of whether the components matrix is sparse or dense. So depending on\nthe number of features and components, it may use a lot of memory.\nWhen the\ninverse_transform\nmethod is called, it computes the product of the\ninput\nX\nand the transpose of the inverse components. If the inverse components have\nbeen computed during fit, they are reused at each call to\ninverse_transform\n.\nOtherwise they are recomputed each time, which can be costly. The result is always\ndense, even if\nX\nis sparse.\nHere is a small code example which illustrates how to use the inverse transform\nfeature:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.random_projection\nimport\nSparseRandomProjection\n>>>\nX\n=\nnp\n.\nrandom\n.\nrand\n(\n100\n,\n10000\n)\n>>>\ntransformer\n=\nSparseRandomProjection\n(\n...",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/random_projection.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "sklearn.random_projection\nimport\nSparseRandomProjection\n>>>\nX\n=\nnp\n.\nrandom\n.\nrand\n(\n100\n,\n10000\n)\n>>>\ntransformer\n=\nSparseRandomProjection\n(\n...\ncompute_inverse_components\n=\nTrue\n...\n)\n...\n>>>\nX_new\n=\ntransformer\n.\nfit_transform\n(\nX\n)\n>>>\nX_new\n.\nshape\n(100, 3947)\n>>>\nX_new_inversed\n=\ntransformer\n.\ninverse_transform\n(\nX_new\n)\n>>>\nX_new_inversed\n.\nshape\n(100, 10000)\n>>>\nX_new_again\n=\ntransformer\n.\ntransform\n(\nX_new_inversed\n)\n>>>\nnp\n.\nallclose\n(\nX_new\n,\nX_new_again\n)\nTrue\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/random_projection.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.14.\nSemi-supervised learning\nSemi-supervised learning\nis a situation\nin which in your training data some of the samples are not labeled. The\nsemi-supervised estimators in\nsklearn.semi_supervised\nare able to\nmake use of this additional unlabeled data to better capture the shape of\nthe underlying data distribution and generalize better to new samples.\nThese algorithms can perform well when we have a very small amount of\nlabeled points and a large amount of unlabeled points.\nNote\nSemi-supervised algorithms need to make assumptions about the distribution\nof the dataset in order to achieve performance gains. See\nhere\nfor more details.\n1.14.1.\nSelf Training\nThis self-training implementation is based on Yarowsky’s\n[\n1\n]\nalgorithm. Using\nthis algorithm, a given supervised classifier can function as a semi-supervised\nclassifier, allowing it to learn from unlabeled data.\nSelfTrainingClassifier\ncan be called with any classifier that\nimplements\npredict_proba\n, passed as the parameter\nestimator",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/semi_supervised.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SelfTrainingClassifier\ncan be called with any classifier that\nimplements\npredict_proba\n, passed as the parameter\nestimator\n. In\neach iteration, the\nestimator\npredicts labels for the unlabeled\nsamples and adds a subset of these labels to the labeled dataset.\nThe choice of this subset is determined by the selection criterion. This\nselection can be done using a\nthreshold\non the prediction probabilities, or\nby choosing the\nk_best\nsamples according to the prediction probabilities.\nThe labels used for the final fit as well as the iteration in which each sample\nwas labeled are available as attributes. The optional\nmax_iter\nparameter\nspecifies how many times the loop is executed at most.\nThe\nmax_iter\nparameter may be set to\nNone\n, causing the algorithm to iterate\nuntil all samples have labels or no new samples are selected in that iteration.\nNote\nWhen using the self-training classifier, the\ncalibration\nof the classifier is important.\nExamples\nEffect of varying threshold for self-training",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/semi_supervised.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Note\nWhen using the self-training classifier, the\ncalibration\nof the classifier is important.\nExamples\nEffect of varying threshold for self-training\nDecision boundary of semi-supervised classifiers versus SVM on the Iris dataset\nReferences\n1.14.2.\nLabel Propagation\nLabel propagation denotes a few variations of semi-supervised graph\ninference algorithms.\nA few features available in this model:\nUsed for classification tasks\nKernel methods to project data into alternate dimensional spaces\nscikit-learn\nprovides two label propagation models:\nLabelPropagation\nand\nLabelSpreading\n. Both work by\nconstructing a similarity graph over all items in the input dataset.\nAn illustration of label-propagation:\nthe structure of unlabeled\nobservations is consistent with the class structure, and thus the\nclass label can be propagated to the unlabeled observations of the\ntraining set.\nLabelPropagation\nand\nLabelSpreading\ndiffer in modifications to the similarity matrix that graph and the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/semi_supervised.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "training set.\nLabelPropagation\nand\nLabelSpreading\ndiffer in modifications to the similarity matrix that graph and the\nclamping effect on the label distributions.\nClamping allows the algorithm to change the weight of the true ground labeled\ndata to some degree. The\nLabelPropagation\nalgorithm performs hard\nclamping of input labels, which means\n\\(\\alpha=0\\)\n. This clamping factor\ncan be relaxed, to say\n\\(\\alpha=0.2\\)\n, which means that we will always\nretain 80 percent of our original label distribution, but the algorithm gets to\nchange its confidence of the distribution within 20 percent.\nLabelPropagation\nuses the raw similarity matrix constructed from\nthe data with no modifications. In contrast,\nLabelSpreading\nminimizes a loss function that has regularization properties, as such it\nis often more robust to noise. The algorithm iterates on a modified\nversion of the original graph and normalizes the edge weights by\ncomputing the normalized graph Laplacian matrix. This procedure is also",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/semi_supervised.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "version of the original graph and normalizes the edge weights by\ncomputing the normalized graph Laplacian matrix. This procedure is also\nused in\nSpectral clustering\n.\nLabel propagation models have two built-in kernel methods. Choice of kernel\naffects both scalability and performance of the algorithms. The following are\navailable:\nrbf (\n\\(\\exp(-\\gamma |x-y|^2), \\gamma > 0\\)\n).\n\\(\\gamma\\)\nis\nspecified by keyword gamma.\nknn (\n\\(1[x' \\in kNN(x)]\\)\n).\n\\(k\\)\nis specified by keyword\nn_neighbors.\nThe RBF kernel will produce a fully connected graph which is represented in memory\nby a dense matrix. This matrix may be very large and combined with the cost of\nperforming a full matrix multiplication calculation for each iteration of the\nalgorithm can lead to prohibitively long running times. On the other hand,\nthe KNN kernel will produce a much more memory-friendly sparse matrix\nwhich can drastically reduce running times.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/semi_supervised.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the KNN kernel will produce a much more memory-friendly sparse matrix\nwhich can drastically reduce running times.\nExamples\nDecision boundary of semi-supervised classifiers versus SVM on the Iris dataset\nLabel Propagation circles: Learning a complex structure\nLabel Propagation digits: Demonstrating performance\nLabel Propagation digits: Active learning\nReferences\n[2] Yoshua Bengio, Olivier Delalleau, Nicolas Le Roux. In Semi-Supervised\nLearning (2006), pp. 193-216\n[3] Olivier Delalleau, Yoshua Bengio, Nicolas Le Roux. Efficient\nNon-Parametric Function Induction in Semi-Supervised Learning. AISTAT 2005\nhttps://www.gatsby.ucl.ac.uk/aistats/fullpapers/204.pdf\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/semi_supervised.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.5.\nStochastic Gradient Descent\nStochastic Gradient Descent (SGD)\nis a simple yet very efficient\napproach to fitting linear classifiers and regressors under\nconvex loss functions such as (linear)\nSupport Vector Machines\nand\nLogistic\nRegression\n.\nEven though SGD has been around in the machine learning community for\na long time, it has received a considerable amount of attention just\nrecently in the context of large-scale learning.\nSGD has been successfully applied to large-scale and sparse machine\nlearning problems often encountered in text classification and natural\nlanguage processing. Given that the data is sparse, the classifiers\nin this module easily scale to problems with more than\n\\(10^5\\)\ntraining\nexamples and more than\n\\(10^5\\)\nfeatures.\nStrictly speaking, SGD is merely an optimization technique and does not\ncorrespond to a specific family of machine learning models. It is only a\nway\nto train a model. Often, an instance of\nSGDClassifier\nor\nSGDRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "correspond to a specific family of machine learning models. It is only a\nway\nto train a model. Often, an instance of\nSGDClassifier\nor\nSGDRegressor\nwill have an equivalent estimator in\nthe scikit-learn API, potentially using a different optimization technique.\nFor example, using\nSGDClassifier(loss='log_loss')\nresults in logistic regression,\ni.e. a model equivalent to\nLogisticRegression\nwhich is fitted via SGD instead of being fitted by one of the other solvers\nin\nLogisticRegression\n. Similarly,\nSGDRegressor(loss='squared_error',\npenalty='l2')\nand\nRidge\nsolve the same optimization problem, via\ndifferent means.\nThe advantages of Stochastic Gradient Descent are:\nEfficiency.\nEase of implementation (lots of opportunities for code tuning).\nThe disadvantages of Stochastic Gradient Descent include:\nSGD requires a number of hyperparameters such as the regularization\nparameter and the number of iterations.\nSGD is sensitive to feature scaling.\nWarning",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "parameter and the number of iterations.\nSGD is sensitive to feature scaling.\nWarning\nMake sure you permute (shuffle) your training data before fitting the model\nor use\nshuffle=True\nto shuffle after each iteration (used by default).\nAlso, ideally, features should be standardized using e.g.\nmake_pipeline(StandardScaler(),\nSGDClassifier())\n(see\nPipelines\n).\n1.5.1.\nClassification\nThe class\nSGDClassifier\nimplements a plain stochastic gradient\ndescent learning routine which supports different loss functions and\npenalties for classification. Below is the decision boundary of a\nSGDClassifier\ntrained with the hinge loss, equivalent to a linear SVM.\nAs other classifiers, SGD has to be fitted with two arrays: an array\nX\nof shape (n_samples, n_features) holding the training samples, and an\narray\ny\nof shape (n_samples,) holding the target values (class labels)\nfor the training samples:\n>>>\nfrom\nsklearn.linear_model\nimport\nSGDClassifier\n>>>\nX\n=\n[[\n0.\n,\n0.\n],\n[\n1.\n,\n1.\n]]\n>>>\ny\n=\n[\n0\n,\n1\n]\n>>>\nclf\n=",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for the training samples:\n>>>\nfrom\nsklearn.linear_model\nimport\nSGDClassifier\n>>>\nX\n=\n[[\n0.\n,\n0.\n],\n[\n1.\n,\n1.\n]]\n>>>\ny\n=\n[\n0\n,\n1\n]\n>>>\nclf\n=\nSGDClassifier\n(\nloss\n=\n\"hinge\"\n,\npenalty\n=\n\"l2\"\n,\nmax_iter\n=\n5\n)\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\nSGDClassifier(max_iter=5)\nAfter being fitted, the model can then be used to predict new values:\n>>>\nclf\n.\npredict\n([[\n2.\n,\n2.\n]])\narray([1])\nSGD fits a linear model to the training data. The\ncoef_\nattribute holds\nthe model parameters:\n>>>\nclf\n.\ncoef_\narray([[9.9, 9.9]])\nThe\nintercept_\nattribute holds the intercept (aka offset or bias):\n>>>\nclf\n.\nintercept_\narray([-9.9])\nWhether or not the model should use an intercept, i.e. a biased\nhyperplane, is controlled by the parameter\nfit_intercept\n.\nThe signed distance to the hyperplane (computed as the dot product between\nthe coefficients and the input sample, plus the intercept) is given by\nSGDClassifier.decision_function\n:\n>>>\nclf\n.\ndecision_function\n([[\n2.\n,\n2.\n]])\narray([29.6])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SGDClassifier.decision_function\n:\n>>>\nclf\n.\ndecision_function\n([[\n2.\n,\n2.\n]])\narray([29.6])\nThe concrete loss function can be set via the\nloss\nparameter.\nSGDClassifier\nsupports the following loss functions:\nloss=\"hinge\"\n: (soft-margin) linear Support Vector Machine,\nloss=\"modified_huber\"\n: smoothed hinge loss,\nloss=\"log_loss\"\n: logistic regression,\nand all regression losses below. In this case the target is encoded as\n\\(-1\\)\nor\n\\(1\\)\n, and the problem is treated as a regression problem. The predicted\nclass then corresponds to the sign of the predicted target.\nPlease refer to the\nmathematical section below\nfor formulas.\nThe first two loss functions are lazy, they only update the model\nparameters if an example violates the margin constraint, which makes\ntraining very efficient and may result in sparser models (i.e. with more zero\ncoefficients), even when\n\\(L_2\\)\npenalty is used.\nUsing\nloss=\"log_loss\"\nor\nloss=\"modified_huber\"\nenables the\npredict_proba",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "coefficients), even when\n\\(L_2\\)\npenalty is used.\nUsing\nloss=\"log_loss\"\nor\nloss=\"modified_huber\"\nenables the\npredict_proba\nmethod, which gives a vector of probability estimates\n\\(P(y|x)\\)\nper sample\n\\(x\\)\n:\n>>>\nclf\n=\nSGDClassifier\n(\nloss\n=\n\"log_loss\"\n,\nmax_iter\n=\n5\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nclf\n.\npredict_proba\n([[\n1.\n,\n1.\n]])\narray([[0.00, 0.99]])\nThe concrete penalty can be set via the\npenalty\nparameter.\nSGD supports the following penalties:\npenalty=\"l2\"\n:\n\\(L_2\\)\nnorm penalty on\ncoef_\n.\npenalty=\"l1\"\n:\n\\(L_1\\)\nnorm penalty on\ncoef_\n.\npenalty=\"elasticnet\"\n: Convex combination of\n\\(L_2\\)\nand\n\\(L_1\\)\n;\n(1\n-\nl1_ratio)\n*\nL2\n+\nl1_ratio\n*\nL1\n.\nThe default setting is\npenalty=\"l2\"\n. The\n\\(L_1\\)\npenalty leads to sparse\nsolutions, driving most coefficients to zero. The Elastic Net\n[\n11\n]\nsolves\nsome deficiencies of the\n\\(L_1\\)\npenalty in the presence of highly correlated\nattributes. The parameter\nl1_ratio\ncontrols the convex combination\nof\n\\(L_1\\)\nand\n\\(L_2\\)\npenalty.\nSGDClassifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "attributes. The parameter\nl1_ratio\ncontrols the convex combination\nof\n\\(L_1\\)\nand\n\\(L_2\\)\npenalty.\nSGDClassifier\nsupports multi-class classification by combining\nmultiple binary classifiers in a “one versus all” (OVA) scheme. For each\nof the\n\\(K\\)\nclasses, a binary classifier is learned that discriminates\nbetween that and all other\n\\(K-1\\)\nclasses. At testing time, we compute the\nconfidence score (i.e. the signed distances to the hyperplane) for each\nclassifier and choose the class with the highest confidence. The Figure\nbelow illustrates the OVA approach on the iris dataset. The dashed\nlines represent the three OVA classifiers; the background colors show\nthe decision surface induced by the three classifiers.\nIn the case of multi-class classification\ncoef_\nis a two-dimensional\narray of shape (n_classes, n_features) and\nintercept_\nis a\none-dimensional array of shape (n_classes,). The\n\\(i\\)\n-th row of\ncoef_\nholds\nthe weight vector of the OVA classifier for the\n\\(i\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "intercept_\nis a\none-dimensional array of shape (n_classes,). The\n\\(i\\)\n-th row of\ncoef_\nholds\nthe weight vector of the OVA classifier for the\n\\(i\\)\n-th class; classes are\nindexed in ascending order (see attribute\nclasses_\n).\nNote that, in principle, since they allow to create a probability model,\nloss=\"log_loss\"\nand\nloss=\"modified_huber\"\nare more suitable for\none-vs-all classification.\nSGDClassifier\nsupports both weighted classes and weighted\ninstances via the fit parameters\nclass_weight\nand\nsample_weight\n. See\nthe examples below and the docstring of\nSGDClassifier.fit\nfor\nfurther information.\nSGDClassifier\nsupports averaged SGD (ASGD)\n[\n10\n]\n. Averaging can be\nenabled by setting\naverage=True\n. ASGD performs the same updates as the\nregular SGD (see\nMathematical formulation\n), but instead of using\nthe last value of the coefficients as the\ncoef_\nattribute (i.e. the values\nof the last update),\ncoef_\nis set instead to the\naverage\nvalue of the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the last value of the coefficients as the\ncoef_\nattribute (i.e. the values\nof the last update),\ncoef_\nis set instead to the\naverage\nvalue of the\ncoefficients across all updates. The same is done for the\nintercept_\nattribute. When using ASGD the learning rate can be larger and even constant,\nleading on some datasets to a speed up in training time.\nFor classification with a logistic loss, another variant of SGD with an\naveraging strategy is available with Stochastic Average Gradient (SAG)\nalgorithm, available as a solver in\nLogisticRegression\n.\nExamples\nSGD: Maximum margin separating hyperplane\nPlot multi-class SGD on the iris dataset\nSGD: Weighted samples\nSVM: Separating hyperplane for unbalanced classes\n(See the Note in the example)\n1.5.2.\nRegression\nThe class\nSGDRegressor\nimplements a plain stochastic gradient\ndescent learning routine which supports different loss functions and\npenalties to fit linear regression models.\nSGDRegressor\nis",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "descent learning routine which supports different loss functions and\npenalties to fit linear regression models.\nSGDRegressor\nis\nwell suited for regression problems with a large number of training\nsamples (> 10.000), for other problems we recommend\nRidge\n,\nLasso\n, or\nElasticNet\n.\nThe concrete loss function can be set via the\nloss\nparameter.\nSGDRegressor\nsupports the following loss functions:\nloss=\"squared_error\"\n: Ordinary least squares,\nloss=\"huber\"\n: Huber loss for robust regression,\nloss=\"epsilon_insensitive\"\n: linear Support Vector Regression.\nPlease refer to the\nmathematical section below\nfor formulas.\nThe Huber and epsilon-insensitive loss functions can be used for\nrobust regression. The width of the insensitive region has to be\nspecified via the parameter\nepsilon\n. This parameter depends on the\nscale of the target variables.\nThe\npenalty\nparameter determines the regularization to be used (see\ndescription above in the classification section).\nSGDRegressor",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The\npenalty\nparameter determines the regularization to be used (see\ndescription above in the classification section).\nSGDRegressor\nalso supports averaged SGD\n[\n10\n]\n(here again, see\ndescription above in the classification section).\nFor regression with a squared loss and a\n\\(L_2\\)\npenalty, another variant of\nSGD with an averaging strategy is available with Stochastic Average\nGradient (SAG) algorithm, available as a solver in\nRidge\n.\nExamples\nPrediction Latency\n1.5.3.\nOnline One-Class SVM\nThe class\nsklearn.linear_model.SGDOneClassSVM\nimplements an online\nlinear version of the One-Class SVM using a stochastic gradient descent.\nCombined with kernel approximation techniques,\nsklearn.linear_model.SGDOneClassSVM\ncan be used to approximate the\nsolution of a kernelized One-Class SVM, implemented in\nsklearn.svm.OneClassSVM\n, with a linear complexity in the number of\nsamples. Note that the complexity of a kernelized One-Class SVM is at best\nquadratic in the number of samples.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "samples. Note that the complexity of a kernelized One-Class SVM is at best\nquadratic in the number of samples.\nsklearn.linear_model.SGDOneClassSVM\nis thus well suited for datasets\nwith a large number of training samples (over 10,000) for which the SGD\nvariant can be several orders of magnitude faster.\nMathematical details\nIts implementation is based on the implementation of the stochastic\ngradient descent. Indeed, the original optimization problem of the One-Class\nSVM is given by\n\\[\\begin{split}\\begin{aligned}\n\\min_{w, \\rho, \\xi} & \\quad \\frac{1}{2}\\Vert w \\Vert^2 - \\rho + \\frac{1}{\\nu n} \\sum_{i=1}^n \\xi_i \\\\\n\\text{s.t.} & \\quad \\langle w, x_i \\rangle \\geq \\rho - \\xi_i \\quad 1 \\leq i \\leq n \\\\\n& \\quad \\xi_i \\geq 0 \\quad 1 \\leq i \\leq n\n\\end{aligned}\\end{split}\\]\nwhere\n\\(\\nu \\in (0, 1]\\)\nis the user-specified parameter controlling the\nproportion of outliers and the proportion of support vectors. Getting rid of\nthe slack variables\n\\(\\xi_i\\)\nthis problem is equivalent to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "proportion of outliers and the proportion of support vectors. Getting rid of\nthe slack variables\n\\(\\xi_i\\)\nthis problem is equivalent to\n\\[\\min_{w, \\rho} \\frac{1}{2}\\Vert w \\Vert^2 - \\rho + \\frac{1}{\\nu n} \\sum_{i=1}^n \\max(0, \\rho - \\langle w, x_i \\rangle) \\, .\\]\nMultiplying by the constant\n\\(\\nu\\)\nand introducing the intercept\n\\(b = 1 - \\rho\\)\nwe obtain the following equivalent optimization problem\n\\[\\min_{w, b} \\frac{\\nu}{2}\\Vert w \\Vert^2 + b\\nu + \\frac{1}{n} \\sum_{i=1}^n \\max(0, 1 - (\\langle w, x_i \\rangle + b)) \\, .\\]\nThis is similar to the optimization problems studied in section\nMathematical formulation\nwith\n\\(y_i = 1, 1 \\leq i \\leq n\\)\nand\n\\(\\alpha = \\nu/2\\)\n,\n\\(L\\)\nbeing the hinge loss function and\n\\(R\\)\nbeing the\n\\(L_2\\)\nnorm. We just need to add the term\n\\(b\\nu\\)\nin the\noptimization loop.\nAs\nSGDClassifier\nand\nSGDRegressor\n,\nSGDOneClassSVM\nsupports averaged SGD. Averaging can be enabled by setting\naverage=True\n.\nExamples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "As\nSGDClassifier\nand\nSGDRegressor\n,\nSGDOneClassSVM\nsupports averaged SGD. Averaging can be enabled by setting\naverage=True\n.\nExamples\nOne-Class SVM versus One-Class SVM using Stochastic Gradient Descent\n1.5.4.\nStochastic Gradient Descent for sparse data\nNote\nThe sparse implementation produces slightly different results\nfrom the dense implementation, due to a shrunk learning rate for the\nintercept. See\nImplementation details\n.\nThere is built-in support for sparse data given in any matrix in a format\nsupported by\nscipy.sparse\n. For maximum\nefficiency, however, use the CSR\nmatrix format as defined in\nscipy.sparse.csr_matrix\n.\nExamples\nClassification of text documents using sparse features\n1.5.5.\nComplexity\nThe major advantage of SGD is its efficiency, which is basically\nlinear in the number of training examples. If\n\\(X\\)\nis a matrix of size\n\\(n \\times p\\)\n(with\n\\(n\\)\nsamples and\n\\(p\\)\nfeatures),\ntraining has a cost of\n\\(O(k n \\bar p)\\)\n, where\n\\(k\\)\nis the number",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is a matrix of size\n\\(n \\times p\\)\n(with\n\\(n\\)\nsamples and\n\\(p\\)\nfeatures),\ntraining has a cost of\n\\(O(k n \\bar p)\\)\n, where\n\\(k\\)\nis the number\nof iterations (epochs) and\n\\(\\bar p\\)\nis the average number of\nnon-zero attributes per sample.\nRecent theoretical results, however, show that the runtime to get some\ndesired optimization accuracy does not increase as the training set size increases.\n1.5.6.\nStopping criterion\nThe classes\nSGDClassifier\nand\nSGDRegressor\nprovide two\ncriteria to stop the algorithm when a given level of convergence is reached:\nWith\nearly_stopping=True\n, the input data is split into a training set\nand a validation set. The model is then fitted on the training set, and the\nstopping criterion is based on the prediction score (using the\nscore\nmethod) computed on the validation set. The size of the validation set\ncan be changed with the parameter\nvalidation_fraction\n.\nWith\nearly_stopping=False\n, the model is fitted on the entire input data",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "can be changed with the parameter\nvalidation_fraction\n.\nWith\nearly_stopping=False\n, the model is fitted on the entire input data\nand the stopping criterion is based on the objective function computed on\nthe training data.\nIn both cases, the criterion is evaluated once by epoch, and the algorithm stops\nwhen the criterion does not improve\nn_iter_no_change\ntimes in a row. The\nimprovement is evaluated with absolute tolerance\ntol\n, and the algorithm\nstops in any case after a maximum number of iterations\nmax_iter\n.\nSee\nEarly stopping of Stochastic Gradient Descent\nfor an\nexample of the effects of early stopping.\n1.5.7.\nTips on Practical Use\nStochastic Gradient Descent is sensitive to feature scaling, so it\nis highly recommended to scale your data. For example, scale each\nattribute on the input vector\n\\(X\\)\nto\n\\([0,1]\\)\nor\n\\([-1,1]\\)\n, or standardize\nit to have mean\n\\(0\\)\nand variance\n\\(1\\)\n. Note that the\nsame\nscaling must be",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(X\\)\nto\n\\([0,1]\\)\nor\n\\([-1,1]\\)\n, or standardize\nit to have mean\n\\(0\\)\nand variance\n\\(1\\)\n. Note that the\nsame\nscaling must be\napplied to the test vector to obtain meaningful results. This can be easily\ndone using\nStandardScaler\n:\nfrom\nsklearn.preprocessing\nimport\nStandardScaler\nscaler\n=\nStandardScaler\n()\nscaler\n.\nfit\n(\nX_train\n)\n# Don't cheat - fit only on training data\nX_train\n=\nscaler\n.\ntransform\n(\nX_train\n)\nX_test\n=\nscaler\n.\ntransform\n(\nX_test\n)\n# apply same transformation to test data\n# Or better yet: use a pipeline!\nfrom\nsklearn.pipeline\nimport\nmake_pipeline\nest\n=\nmake_pipeline\n(\nStandardScaler\n(),\nSGDClassifier\n())\nest\n.\nfit\n(\nX_train\n)\nest\n.\npredict\n(\nX_test\n)\nIf your attributes have an intrinsic scale (e.g. word frequencies or\nindicator features) scaling is not needed.\nFinding a reasonable regularization term\n\\(\\alpha\\)\nis\nbest done using automatic hyper-parameter search, e.g.\nGridSearchCV\nor\nRandomizedSearchCV\n, usually in the\nrange\n10.0**-np.arange(1,7)\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "is\nbest done using automatic hyper-parameter search, e.g.\nGridSearchCV\nor\nRandomizedSearchCV\n, usually in the\nrange\n10.0**-np.arange(1,7)\n.\nEmpirically, we found that SGD converges after observing\napproximately\n\\(10^6\\)\ntraining samples. Thus, a reasonable first guess\nfor the number of iterations is\nmax_iter\n=\nnp.ceil(10**6\n/\nn)\n,\nwhere\nn\nis the size of the training set.\nIf you apply SGD to features extracted using PCA we found that\nit is often wise to scale the feature values by some constant\nc\nsuch that the average\n\\(L_2\\)\nnorm of the training data equals one.\nWe found that Averaged SGD works best with a larger number of features\nand a higher\neta0\n.\nReferences\n“Efficient BackProp”\nY. LeCun, L. Bottou, G. Orr, K. Müller - In Neural Networks: Tricks\nof the Trade 1998.\n1.5.8.\nMathematical formulation\nWe describe here the mathematical details of the SGD procedure. A good\noverview with convergence rates can be found in\n[\n12\n]\n.\nGiven a set of training examples",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "overview with convergence rates can be found in\n[\n12\n]\n.\nGiven a set of training examples\n\\(\\{(x_1, y_1), \\ldots, (x_n, y_n)\\}\\)\nwhere\n\\(x_i \\in \\mathbf{R}^m\\)\nand\n\\(y_i \\in \\mathbf{R}\\)\n(\n\\(y_i \\in \\{-1, 1\\}\\)\nfor classification),\nour goal is to learn a linear scoring function\n\\(f(x) = w^T x + b\\)\nwith model parameters\n\\(w \\in \\mathbf{R}^m\\)\nand\nintercept\n\\(b \\in \\mathbf{R}\\)\n. In order to make predictions for binary\nclassification, we simply look at the sign of\n\\(f(x)\\)\n. To find the model\nparameters, we minimize the regularized training error given by\n\\[E(w,b) = \\frac{1}{n}\\sum_{i=1}^{n} L(y_i, f(x_i)) + \\alpha R(w)\\]\nwhere\n\\(L\\)\nis a loss function that measures model (mis)fit and\n\\(R\\)\nis a regularization term (aka penalty) that penalizes model\ncomplexity;\n\\(\\alpha > 0\\)\nis a non-negative hyperparameter that controls\nthe regularization strength.\nLoss functions details\nDifferent choices for\n\\(L\\)\nentail different classifiers or regressors:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the regularization strength.\nLoss functions details\nDifferent choices for\n\\(L\\)\nentail different classifiers or regressors:\nHinge (soft-margin): equivalent to Support Vector Classification.\n\\(L(y_i, f(x_i)) = \\max(0, 1 - y_i f(x_i))\\)\n.\nPerceptron:\n\\(L(y_i, f(x_i)) = \\max(0, - y_i f(x_i))\\)\n.\nModified Huber:\n\\(L(y_i, f(x_i)) = \\max(0, 1 - y_i f(x_i))^2\\)\nif\n\\(y_i f(x_i) >\n-1\\)\n, and\n\\(L(y_i, f(x_i)) = -4 y_i f(x_i)\\)\notherwise.\nLog Loss: equivalent to Logistic Regression.\n\\(L(y_i, f(x_i)) = \\log(1 + \\exp (-y_i f(x_i)))\\)\n.\nSquared Error: Linear regression (Ridge or Lasso depending on\n\\(R\\)\n).\n\\(L(y_i, f(x_i)) = \\frac{1}{2}(y_i - f(x_i))^2\\)\n.\nHuber: less sensitive to outliers than least-squares. It is equivalent to\nleast squares when\n\\(|y_i - f(x_i)| \\leq \\varepsilon\\)\n, and\n\\(L(y_i, f(x_i)) = \\varepsilon |y_i - f(x_i)| - \\frac{1}{2}\n\\varepsilon^2\\)\notherwise.\nEpsilon-Insensitive: (soft-margin) equivalent to Support Vector Regression.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\varepsilon^2\\)\notherwise.\nEpsilon-Insensitive: (soft-margin) equivalent to Support Vector Regression.\n\\(L(y_i, f(x_i)) = \\max(0, |y_i - f(x_i)| - \\varepsilon)\\)\n.\nAll of the above loss functions can be regarded as an upper bound on the\nmisclassification error (Zero-one loss) as shown in the Figure below.\nPopular choices for the regularization term\n\\(R\\)\n(the\npenalty\nparameter) include:\n\\(L_2\\)\nnorm:\n\\(R(w) := \\frac{1}{2} \\sum_{j=1}^{m} w_j^2 = ||w||_2^2\\)\n,\n\\(L_1\\)\nnorm:\n\\(R(w) := \\sum_{j=1}^{m} |w_j|\\)\n, which leads to sparse\nsolutions.\nElastic Net:\n\\(R(w) := \\frac{\\rho}{2} \\sum_{j=1}^{n} w_j^2 +\n(1-\\rho) \\sum_{j=1}^{m} |w_j|\\)\n, a convex combination of\n\\(L_2\\)\nand\n\\(L_1\\)\n, where\n\\(\\rho\\)\nis given by\n1\n-\nl1_ratio\n.\nThe Figure below shows the contours of the different regularization terms\nin a 2-dimensional parameter space (\n\\(m=2\\)\n) when\n\\(R(w) = 1\\)\n.\n1.5.8.1.\nSGD\nStochastic gradient descent is an optimization method for unconstrained",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(m=2\\)\n) when\n\\(R(w) = 1\\)\n.\n1.5.8.1.\nSGD\nStochastic gradient descent is an optimization method for unconstrained\noptimization problems. In contrast to (batch) gradient descent, SGD\napproximates the true gradient of\n\\(E(w,b)\\)\nby considering a\nsingle training example at a time.\nThe class\nSGDClassifier\nimplements a first-order SGD learning\nroutine. The algorithm iterates over the training examples and for each\nexample updates the model parameters according to the update rule given by\n\\[w \\leftarrow w - \\eta \\left[\\alpha \\frac{\\partial R(w)}{\\partial w}\n+ \\frac{\\partial L(w^T x_i + b, y_i)}{\\partial w}\\right]\\]\nwhere\n\\(\\eta\\)\nis the learning rate which controls the step-size in\nthe parameter space. The intercept\n\\(b\\)\nis updated similarly but\nwithout regularization (and with additional decay for sparse matrices, as\ndetailed in\nImplementation details\n).\nThe learning rate\n\\(\\eta\\)\ncan be either constant or gradually decaying. For\nclassification, the default learning rate schedule (",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ").\nThe learning rate\n\\(\\eta\\)\ncan be either constant or gradually decaying. For\nclassification, the default learning rate schedule (\nlearning_rate='optimal'\n)\nis given by\n\\[\\eta^{(t)} = \\frac {1}{\\alpha (t_0 + t)}\\]\nwhere\n\\(t\\)\nis the time step (there are a total of\nn_samples\n*\nn_iter\ntime steps),\n\\(t_0\\)\nis determined based on a heuristic proposed by Léon Bottou\nsuch that the expected initial updates are comparable with the expected\nsize of the weights (this assumes that the norm of the training samples is\napproximately 1). The exact definition can be found in\n_init_t\nin\nBaseSGD\n.\nFor regression the default learning rate schedule is inverse scaling\n(\nlearning_rate='invscaling'\n), given by\n\\[\\eta^{(t)} = \\frac{\\eta_0}{t^{power\\_t}}\\]\nwhere\n\\(\\eta_0\\)\nand\n\\(power\\_t\\)\nare hyperparameters chosen by the\nuser via\neta0\nand\npower_t\n, respectively.\nFor a constant learning rate use\nlearning_rate='constant'\nand use\neta0\nto specify the learning rate.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "user via\neta0\nand\npower_t\n, respectively.\nFor a constant learning rate use\nlearning_rate='constant'\nand use\neta0\nto specify the learning rate.\nFor an adaptively decreasing learning rate, use\nlearning_rate='adaptive'\nand use\neta0\nto specify the starting learning rate. When the stopping\ncriterion is reached, the learning rate is divided by 5, and the algorithm\ndoes not stop. The algorithm stops when the learning rate goes below\n1e-6\n.\nThe model parameters can be accessed through the\ncoef_\nand\nintercept_\nattributes:\ncoef_\nholds the weights\n\\(w\\)\nand\nintercept_\nholds\n\\(b\\)\n.\nWhen using Averaged SGD (with the\naverage\nparameter),\ncoef_\nis set to the\naverage weight across all updates:\ncoef_\n\\(= \\frac{1}{T} \\sum_{t=0}^{T-1} w^{(t)}\\)\n,\nwhere\n\\(T\\)\nis the total number of updates, found in the\nt_\nattribute.\n1.5.9.\nImplementation details\nThe implementation of SGD is influenced by the\nStochastic\nGradient\nSVM\nof\n[\n7\n]\n.\nSimilar to SvmSGD,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "t_\nattribute.\n1.5.9.\nImplementation details\nThe implementation of SGD is influenced by the\nStochastic\nGradient\nSVM\nof\n[\n7\n]\n.\nSimilar to SvmSGD,\nthe weight vector is represented as the product of a scalar and a vector\nwhich allows an efficient weight update in the case of\n\\(L_2\\)\nregularization.\nIn the case of sparse input\nX\n, the intercept is updated with a\nsmaller learning rate (multiplied by 0.01) to account for the fact that\nit is updated more frequently. Training examples are picked up sequentially\nand the learning rate is lowered after each observed example. We adopted the\nlearning rate schedule from\n[\n8\n]\n.\nFor multi-class classification, a “one versus all” approach is used.\nWe use the truncated gradient algorithm proposed in\n[\n9\n]\nfor\n\\(L_1\\)\nregularization (and the Elastic Net).\nThe code is written in Cython.\nReferences\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/sgd.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.4.\nSupport Vector Machines\nSupport vector machines (SVMs)\nare a set of supervised learning\nmethods used for\nclassification\n,\nregression\nand\noutliers detection\n.\nThe advantages of support vector machines are:\nEffective in high dimensional spaces.\nStill effective in cases where number of dimensions is greater\nthan the number of samples.\nUses a subset of training points in the decision function (called\nsupport vectors), so it is also memory efficient.\nVersatile: different\nKernel functions\ncan be\nspecified for the decision function. Common kernels are\nprovided, but it is also possible to specify custom kernels.\nThe disadvantages of support vector machines include:\nIf the number of features is much greater than the number of\nsamples, avoid over-fitting in choosing\nKernel functions\nand regularization\nterm is crucial.\nSVMs do not directly provide probability estimates, these are\ncalculated using an expensive five-fold cross-validation\n(see\nScores and probabilities\n, below).",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "calculated using an expensive five-fold cross-validation\n(see\nScores and probabilities\n, below).\nThe support vector machines in scikit-learn support both dense\n(\nnumpy.ndarray\nand convertible to that by\nnumpy.asarray\n) and\nsparse (any\nscipy.sparse\n) sample vectors as input. However, to use\nan SVM to make predictions for sparse data, it must have been fit on such\ndata. For optimal performance, use C-ordered\nnumpy.ndarray\n(dense) or\nscipy.sparse.csr_matrix\n(sparse) with\ndtype=float64\n.\n1.4.1.\nClassification\nSVC\n,\nNuSVC\nand\nLinearSVC\nare classes\ncapable of performing binary and multi-class classification on a dataset.\nSVC\nand\nNuSVC\nare similar methods, but accept slightly\ndifferent sets of parameters and have different mathematical formulations (see\nsection\nMathematical formulation\n). On the other hand,\nLinearSVC\nis another (faster) implementation of Support Vector\nClassification for the case of a linear kernel. It also\nlacks some of the attributes of\nSVC\nand\nNuSVC\n, like\nsupport_\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Classification for the case of a linear kernel. It also\nlacks some of the attributes of\nSVC\nand\nNuSVC\n, like\nsupport_\n.\nLinearSVC\nuses\nsquared_hinge\nloss and due to its\nimplementation in\nliblinear\nit also regularizes the intercept, if considered.\nThis effect can however be reduced by carefully fine tuning its\nintercept_scaling\nparameter, which allows the intercept term to have a\ndifferent regularization behavior compared to the other features. The\nclassification results and score can therefore differ from the other two\nclassifiers.\nAs other classifiers,\nSVC\n,\nNuSVC\nand\nLinearSVC\ntake as input two arrays: an array\nX\nof shape\n(n_samples,\nn_features)\nholding the training samples, and an array\ny\nof\nclass labels (strings or integers), of shape\n(n_samples)\n:\n>>>\nfrom\nsklearn\nimport\nsvm\n>>>\nX\n=\n[[\n0\n,\n0\n],\n[\n1\n,\n1\n]]\n>>>\ny\n=\n[\n0\n,\n1\n]\n>>>\nclf\n=\nsvm\n.\nSVC\n()\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\nSVC()\nAfter being fitted, the model can then be used to predict new values:\n>>>\nclf\n.\npredict\n([[\n2.\n,\n2.\n]])",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\nsvm\n.\nSVC\n()\n>>>\nclf\n.\nfit\n(\nX\n,\ny\n)\nSVC()\nAfter being fitted, the model can then be used to predict new values:\n>>>\nclf\n.\npredict\n([[\n2.\n,\n2.\n]])\narray([1])\nSVMs decision function (detailed in the\nMathematical formulation\n)\ndepends on some subset of the training data, called the support vectors. Some\nproperties of these support vectors can be found in attributes\nsupport_vectors_\n,\nsupport_\nand\nn_support_\n:\n>>>\n# get support vectors\n>>>\nclf\n.\nsupport_vectors_\narray([[0., 0.],\n[1., 1.]])\n>>>\n# get indices of support vectors\n>>>\nclf\n.\nsupport_\narray([0, 1]...)\n>>>\n# get number of support vectors for each class\n>>>\nclf\n.\nn_support_\narray([1, 1]...)\nExamples\nSVM: Maximum margin separating hyperplane\nSVM-Anova: SVM with univariate feature selection\nPlot classification probability\n1.4.1.1.\nMulti-class classification\nSVC\nand\nNuSVC\nimplement the “one-versus-one” (“ovo”)\napproach for multi-class classification, which constructs\nn_classes\n*\n(n_classes\n-\n1)\n/\n2",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "SVC\nand\nNuSVC\nimplement the “one-versus-one” (“ovo”)\napproach for multi-class classification, which constructs\nn_classes\n*\n(n_classes\n-\n1)\n/\n2\nclassifiers, each trained on data from two classes. Internally, the solver\nalways uses this “ovo” strategy to train the models. However, by default, the\ndecision_function_shape\nparameter is set to\n\"ovr\"\n(“one-vs-rest”), to have\na consistent interface with other classifiers by monotonically transforming the “ovo”\ndecision function into an “ovr” decision function of shape\n(n_samples,\nn_classes)\n.\n>>>\nX\n=\n[[\n0\n],\n[\n1\n],\n[\n2\n],\n[\n3\n]]\n>>>\nY\n=\n[\n0\n,\n1\n,\n2\n,\n3\n]\n>>>\nclf\n=\nsvm\n.\nSVC\n(\ndecision_function_shape\n=\n'ovo'\n)\n>>>\nclf\n.\nfit\n(\nX\n,\nY\n)\nSVC(decision_function_shape='ovo')\n>>>\ndec\n=\nclf\n.\ndecision_function\n([[\n1\n]])\n>>>\ndec\n.\nshape\n[\n1\n]\n# 6 classes: 4*3/2 = 6\n6\n>>>\nclf\n.\ndecision_function_shape\n=\n\"ovr\"\n>>>\ndec\n=\nclf\n.\ndecision_function\n([[\n1\n]])\n>>>\ndec\n.\nshape\n[\n1\n]\n# 4 classes\n4\nOn the other hand,\nLinearSVC\nimplements a “one-vs-rest” (“ovr”)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\"ovr\"\n>>>\ndec\n=\nclf\n.\ndecision_function\n([[\n1\n]])\n>>>\ndec\n.\nshape\n[\n1\n]\n# 4 classes\n4\nOn the other hand,\nLinearSVC\nimplements a “one-vs-rest” (“ovr”)\nmulti-class strategy, thus training\nn_classes\nmodels.\n>>>\nlin_clf\n=\nsvm\n.\nLinearSVC\n()\n>>>\nlin_clf\n.\nfit\n(\nX\n,\nY\n)\nLinearSVC()\n>>>\ndec\n=\nlin_clf\n.\ndecision_function\n([[\n1\n]])\n>>>\ndec\n.\nshape\n[\n1\n]\n4\nSee\nMathematical formulation\nfor a complete description of\nthe decision function.\nDetails on multi-class strategies\nNote that the\nLinearSVC\nalso implements an alternative multi-class\nstrategy, the so-called multi-class SVM formulated by Crammer and Singer\n[\n16\n]\n, by using the option\nmulti_class='crammer_singer'\n. In practice,\none-vs-rest classification is usually preferred, since the results are mostly\nsimilar, but the runtime is significantly less.\nFor “one-vs-rest”\nLinearSVC\nthe attributes\ncoef_\nand\nintercept_\nhave the shape\n(n_classes,\nn_features)\nand\n(n_classes,)\nrespectively.\nEach row of the coefficients corresponds to one of the",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "coef_\nand\nintercept_\nhave the shape\n(n_classes,\nn_features)\nand\n(n_classes,)\nrespectively.\nEach row of the coefficients corresponds to one of the\nn_classes\n“one-vs-rest” classifiers and similar for the intercepts, in the\norder of the “one” class.\nIn the case of “one-vs-one”\nSVC\nand\nNuSVC\n, the layout of\nthe attributes is a little more involved. In the case of a linear\nkernel, the attributes\ncoef_\nand\nintercept_\nhave the shape\n(n_classes\n*\n(n_classes\n-\n1)\n/\n2,\nn_features)\nand\n(n_classes\n*\n(n_classes\n-\n1)\n/\n2)\nrespectively. This is similar to the layout for\nLinearSVC\ndescribed above, with each row now corresponding\nto a binary classifier. The order for classes\n0 to n is “0 vs 1”, “0 vs 2” , … “0 vs n”, “1 vs 2”, “1 vs 3”, “1 vs n”, . .\n. “n-1 vs n”.\nThe shape of\ndual_coef_\nis\n(n_classes-1,\nn_SV)\nwith\na somewhat hard to grasp layout.\nThe columns correspond to the support vectors involved in any\nof the\nn_classes\n*\n(n_classes\n-\n1)\n/\n2\n“one-vs-one” classifiers.\nEach support vector\nv",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "The columns correspond to the support vectors involved in any\nof the\nn_classes\n*\n(n_classes\n-\n1)\n/\n2\n“one-vs-one” classifiers.\nEach support vector\nv\nhas a dual coefficient in each of the\nn_classes\n-\n1\nclassifiers comparing the class of\nv\nagainst another class.\nNote that some, but not all, of these dual coefficients, may be zero.\nThe\nn_classes\n-\n1\nentries in each column are these dual coefficients,\nordered by the opposing class.\nThis might be clearer with an example: consider a three class problem with\nclass 0 having three support vectors\n\\(v^{0}_0, v^{1}_0, v^{2}_0\\)\nand class 1 and 2 having two support vectors\n\\(v^{0}_1, v^{1}_1\\)\nand\n\\(v^{0}_2, v^{1}_2\\)\nrespectively. For each\nsupport vector\n\\(v^{j}_i\\)\n, there are two dual coefficients. Let’s call\nthe coefficient of support vector\n\\(v^{j}_i\\)\nin the classifier between\nclasses\n\\(i\\)\nand\n\\(k\\)\n\\(\\alpha^{j}_{i,k}\\)\n.\nThen\ndual_coef_\nlooks like this:\n\\(\\alpha^{0}_{0,1}\\)\n\\(\\alpha^{1}_{0,1}\\)\n\\(\\alpha^{2}_{0,1}\\)\n\\(\\alpha^{0}_{1,0}\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and\n\\(k\\)\n\\(\\alpha^{j}_{i,k}\\)\n.\nThen\ndual_coef_\nlooks like this:\n\\(\\alpha^{0}_{0,1}\\)\n\\(\\alpha^{1}_{0,1}\\)\n\\(\\alpha^{2}_{0,1}\\)\n\\(\\alpha^{0}_{1,0}\\)\n\\(\\alpha^{1}_{1,0}\\)\n\\(\\alpha^{0}_{2,0}\\)\n\\(\\alpha^{1}_{2,0}\\)\n\\(\\alpha^{0}_{0,2}\\)\n\\(\\alpha^{1}_{0,2}\\)\n\\(\\alpha^{2}_{0,2}\\)\n\\(\\alpha^{0}_{1,2}\\)\n\\(\\alpha^{1}_{1,2}\\)\n\\(\\alpha^{0}_{2,1}\\)\n\\(\\alpha^{1}_{2,1}\\)\nCoefficients\nfor SVs of class 0\nCoefficients\nfor SVs of class 1\nCoefficients\nfor SVs of class 2\nExamples\nPlot different SVM classifiers in the iris dataset\n1.4.1.2.\nScores and probabilities\nThe\ndecision_function\nmethod of\nSVC\nand\nNuSVC\ngives\nper-class scores for each sample (or a single score per sample in the binary\ncase). When the constructor option\nprobability\nis set to\nTrue\n,\nclass membership probability estimates (from the methods\npredict_proba\nand\npredict_log_proba\n) are enabled. In the binary case, the probabilities are\ncalibrated using Platt scaling\n[\n9\n]\n: logistic regression on the SVM’s scores,",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ") are enabled. In the binary case, the probabilities are\ncalibrated using Platt scaling\n[\n9\n]\n: logistic regression on the SVM’s scores,\nfit by an additional cross-validation on the training data.\nIn the multiclass case, this is extended as per\n[\n10\n]\n.\nNote\nThe same probability calibration procedure is available for all estimators\nvia the\nCalibratedClassifierCV\n(see\nProbability calibration\n). In the case of\nSVC\nand\nNuSVC\n, this\nprocedure is builtin to\nlibsvm\nwhich is used under the hood, so it does\nnot rely on scikit-learn’s\nCalibratedClassifierCV\n.\nThe cross-validation involved in Platt scaling\nis an expensive operation for large datasets.\nIn addition, the probability estimates may be inconsistent with the scores:\nthe “argmax” of the scores may not be the argmax of the probabilities\nin binary classification, a sample may be labeled by\npredict\nas\nbelonging to the positive class even if the output of\npredict_proba\nis",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "in binary classification, a sample may be labeled by\npredict\nas\nbelonging to the positive class even if the output of\npredict_proba\nis\nless than 0.5; and similarly, it could be labeled as negative even if the\noutput of\npredict_proba\nis more than 0.5.\nPlatt’s method is also known to have theoretical issues.\nIf confidence scores are required, but these do not have to be probabilities,\nthen it is advisable to set\nprobability=False\nand use\ndecision_function\ninstead of\npredict_proba\n.\nPlease note that when\ndecision_function_shape='ovr'\nand\nn_classes\n>\n2\n,\nunlike\ndecision_function\n, the\npredict\nmethod does not try to break ties\nby default. You can set\nbreak_ties=True\nfor the output of\npredict\nto be\nthe same as\nnp.argmax(clf.decision_function(...),\naxis=1)\n, otherwise the\nfirst class among the tied classes will always be returned; but have in mind\nthat it comes with a computational cost. See\nSVM Tie Breaking Example\nfor an example on\ntie breaking.\n1.4.1.3.\nUnbalanced problems",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "that it comes with a computational cost. See\nSVM Tie Breaking Example\nfor an example on\ntie breaking.\n1.4.1.3.\nUnbalanced problems\nIn problems where it is desired to give more importance to certain\nclasses or certain individual samples, the parameters\nclass_weight\nand\nsample_weight\ncan be used.\nSVC\n(but not\nNuSVC\n) implements the parameter\nclass_weight\nin the\nfit\nmethod. It’s a dictionary of the form\n{class_label\n:\nvalue}\n, where value is a floating point number > 0\nthat sets the parameter\nC\nof class\nclass_label\nto\nC\n*\nvalue\n.\nThe figure below illustrates the decision boundary of an unbalanced problem,\nwith and without weight correction.\nSVC\n,\nNuSVC\n,\nSVR\n,\nNuSVR\n,\nLinearSVC\n,\nLinearSVR\nand\nOneClassSVM\nimplement also weights for\nindividual samples in the\nfit\nmethod through the\nsample_weight\nparameter.\nSimilar to\nclass_weight\n, this sets the parameter\nC\nfor the i-th\nexample to\nC\n*\nsample_weight[i]\n, which will encourage the classifier to",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "parameter.\nSimilar to\nclass_weight\n, this sets the parameter\nC\nfor the i-th\nexample to\nC\n*\nsample_weight[i]\n, which will encourage the classifier to\nget these samples right. The figure below illustrates the effect of sample\nweighting on the decision boundary. The size of the circles is proportional\nto the sample weights:\nExamples\nSVM: Separating hyperplane for unbalanced classes\nSVM: Weighted samples\n1.4.2.\nRegression\nThe method of Support Vector Classification can be extended to solve\nregression problems. This method is called Support Vector Regression.\nThe model produced by support vector classification (as described\nabove) depends only on a subset of the training data, because the cost\nfunction for building the model does not care about training points\nthat lie beyond the margin. Analogously, the model produced by Support\nVector Regression depends only on a subset of the training data,\nbecause the cost function ignores samples whose prediction is close to their\ntarget.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Vector Regression depends only on a subset of the training data,\nbecause the cost function ignores samples whose prediction is close to their\ntarget.\nThere are three different implementations of Support Vector Regression:\nSVR\n,\nNuSVR\nand\nLinearSVR\n.\nLinearSVR\nprovides a faster implementation than\nSVR\nbut only considers the\nlinear kernel, while\nNuSVR\nimplements a slightly different formulation\nthan\nSVR\nand\nLinearSVR\n. Due to its implementation in\nliblinear\nLinearSVR\nalso regularizes the intercept, if considered.\nThis effect can however be reduced by carefully fine tuning its\nintercept_scaling\nparameter, which allows the intercept term to have a\ndifferent regularization behavior compared to the other features. The\nclassification results and score can therefore differ from the other two\nclassifiers. See\nImplementation details\nfor further details.\nAs with classification classes, the fit method will take as\nargument vectors X, y, only that in this case y is expected to have",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "for further details.\nAs with classification classes, the fit method will take as\nargument vectors X, y, only that in this case y is expected to have\nfloating point values instead of integer values:\n>>>\nfrom\nsklearn\nimport\nsvm\n>>>\nX\n=\n[[\n0\n,\n0\n],\n[\n2\n,\n2\n]]\n>>>\ny\n=\n[\n0.5\n,\n2.5\n]\n>>>\nregr\n=\nsvm\n.\nSVR\n()\n>>>\nregr\n.\nfit\n(\nX\n,\ny\n)\nSVR()\n>>>\nregr\n.\npredict\n([[\n1\n,\n1\n]])\narray([1.5])\nExamples\nSupport Vector Regression (SVR) using linear and non-linear kernels\n1.4.3.\nDensity estimation, novelty detection\nThe class\nOneClassSVM\nimplements a One-Class SVM which is used in\noutlier detection.\nSee\nNovelty and Outlier Detection\nfor the description and usage of OneClassSVM.\n1.4.4.\nComplexity\nSupport Vector Machines are powerful tools, but their compute and\nstorage requirements increase rapidly with the number of training\nvectors. The core of an SVM is a quadratic programming problem (QP),\nseparating support vectors from the rest of the training data. The QP\nsolver used by the\nlibsvm",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "separating support vectors from the rest of the training data. The QP\nsolver used by the\nlibsvm\n-based implementation scales between\n\\(O(n_{features} \\times n_{samples}^2)\\)\nand\n\\(O(n_{features} \\times n_{samples}^3)\\)\ndepending on how efficiently\nthe\nlibsvm\ncache is used in practice (dataset dependent). If the data\nis very sparse\n\\(n_{features}\\)\nshould be replaced by the average number\nof non-zero features in a sample vector.\nFor the linear case, the algorithm used in\nLinearSVC\nby the\nliblinear\nimplementation is much more\nefficient than its\nlibsvm\n-based\nSVC\ncounterpart and can\nscale almost linearly to millions of samples and/or features.\n1.4.5.\nTips on Practical Use\nAvoiding data copy\n: For\nSVC\n,\nSVR\n,\nNuSVC\nand\nNuSVR\n, if the data passed to certain methods is not C-ordered\ncontiguous and double precision, it will be copied before calling the\nunderlying C implementation. You can check whether a given numpy array is\nC-contiguous by inspecting its\nflags\nattribute.\nFor\nLinearSVC\n(and",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "underlying C implementation. You can check whether a given numpy array is\nC-contiguous by inspecting its\nflags\nattribute.\nFor\nLinearSVC\n(and\nLogisticRegression\n) any input passed as a numpy\narray will be copied and converted to the\nliblinear\ninternal sparse data\nrepresentation (double precision floats and int32 indices of non-zero\ncomponents). If you want to fit a large-scale linear classifier without\ncopying a dense numpy C-contiguous double precision array as input, we\nsuggest to use the\nSGDClassifier\nclass instead. The objective\nfunction can be configured to be almost the same as the\nLinearSVC\nmodel.\nKernel cache size\n: For\nSVC\n,\nSVR\n,\nNuSVC\nand\nNuSVR\n, the size of the kernel cache has a strong impact on run\ntimes for larger problems. If you have enough RAM available, it is\nrecommended to set\ncache_size\nto a higher value than the default of\n200(MB), such as 500(MB) or 1000(MB).\nSetting C\n:\nC\nis\n1\nby default and it’s a reasonable default",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "cache_size\nto a higher value than the default of\n200(MB), such as 500(MB) or 1000(MB).\nSetting C\n:\nC\nis\n1\nby default and it’s a reasonable default\nchoice. If you have a lot of noisy observations you should decrease it:\ndecreasing C corresponds to more regularization.\nLinearSVC\nand\nLinearSVR\nare less sensitive to\nC\nwhen\nit becomes large, and prediction results stop improving after a certain\nthreshold. Meanwhile, larger\nC\nvalues will take more time to train,\nsometimes up to 10 times longer, as shown in\n[\n11\n]\n.\nSupport Vector Machine algorithms are not scale invariant, so\nit\nis highly recommended to scale your data\n. For example, scale each\nattribute on the input vector X to [0,1] or [-1,+1], or standardize it\nto have mean 0 and variance 1. Note that the\nsame\nscaling must be\napplied to the test vector to obtain meaningful results. This can be done\neasily by using a\nPipeline\n:\n>>>\nfrom\nsklearn.pipeline\nimport\nmake_pipeline\n>>>\nfrom\nsklearn.preprocessing\nimport\nStandardScaler\n>>>\nfrom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "easily by using a\nPipeline\n:\n>>>\nfrom\nsklearn.pipeline\nimport\nmake_pipeline\n>>>\nfrom\nsklearn.preprocessing\nimport\nStandardScaler\n>>>\nfrom\nsklearn.svm\nimport\nSVC\n>>>\nclf\n=\nmake_pipeline\n(\nStandardScaler\n(),\nSVC\n())\nSee section\nPreprocessing data\nfor more details on scaling and\nnormalization.\nRegarding the\nshrinking\nparameter, quoting\n[\n12\n]\n:\nWe found that if the\nnumber of iterations is large, then shrinking can shorten the training\ntime. However, if we loosely solve the optimization problem (e.g., by\nusing a large stopping tolerance), the code without using shrinking may\nbe much faster\nParameter\nnu\nin\nNuSVC\n/\nOneClassSVM\n/\nNuSVR\napproximates the fraction of training errors and support vectors.\nIn\nSVC\n, if the data is unbalanced (e.g. many\npositive and few negative), set\nclass_weight='balanced'\nand/or try\ndifferent penalty parameters\nC\n.\nRandomness of the underlying implementations\n: The underlying\nimplementations of\nSVC\nand\nNuSVC\nuse a random number",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "different penalty parameters\nC\n.\nRandomness of the underlying implementations\n: The underlying\nimplementations of\nSVC\nand\nNuSVC\nuse a random number\ngenerator only to shuffle the data for probability estimation (when\nprobability\nis set to\nTrue\n). This randomness can be controlled\nwith the\nrandom_state\nparameter. If\nprobability\nis set to\nFalse\nthese estimators are not random and\nrandom_state\nhas no effect on the\nresults. The underlying\nOneClassSVM\nimplementation is similar to\nthe ones of\nSVC\nand\nNuSVC\n. As no probability estimation\nis provided for\nOneClassSVM\n, it is not random.\nThe underlying\nLinearSVC\nimplementation uses a random number\ngenerator to select features when fitting the model with a dual coordinate\ndescent (i.e. when\ndual\nis set to\nTrue\n). It is thus not uncommon\nto have slightly different results for the same input data. If that\nhappens, try with a smaller\ntol\nparameter. This randomness can also be\ncontrolled with the\nrandom_state\nparameter. When\ndual\nis\nset to\nFalse",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "happens, try with a smaller\ntol\nparameter. This randomness can also be\ncontrolled with the\nrandom_state\nparameter. When\ndual\nis\nset to\nFalse\nthe underlying implementation of\nLinearSVC\nis\nnot random and\nrandom_state\nhas no effect on the results.\nUsing L1 penalization as provided by\nLinearSVC(penalty='l1',\ndual=False)\nyields a sparse solution, i.e. only a subset of feature\nweights is different from zero and contribute to the decision function.\nIncreasing\nC\nyields a more complex model (more features are selected).\nThe\nC\nvalue that yields a “null” model (all weights equal to zero) can\nbe calculated using\nl1_min_c\n.\n1.4.6.\nKernel functions\nThe\nkernel function\ncan be any of the following:\nlinear:\n\\(\\langle x, x'\\rangle\\)\n.\npolynomial:\n\\((\\gamma \\langle x, x'\\rangle + r)^d\\)\n, where\n\\(d\\)\nis specified by parameter\ndegree\n,\n\\(r\\)\nby\ncoef0\n.\nrbf:\n\\(\\exp(-\\gamma \\|x-x'\\|^2)\\)\n, where\n\\(\\gamma\\)\nis\nspecified by parameter\ngamma\n, must be greater than 0.\nsigmoid",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "degree\n,\n\\(r\\)\nby\ncoef0\n.\nrbf:\n\\(\\exp(-\\gamma \\|x-x'\\|^2)\\)\n, where\n\\(\\gamma\\)\nis\nspecified by parameter\ngamma\n, must be greater than 0.\nsigmoid\n\\(\\tanh(\\gamma \\langle x,x'\\rangle + r)\\)\n,\nwhere\n\\(r\\)\nis specified by\ncoef0\n.\nDifferent kernels are specified by the\nkernel\nparameter:\n>>>\nlinear_svc\n=\nsvm\n.\nSVC\n(\nkernel\n=\n'linear'\n)\n>>>\nlinear_svc\n.\nkernel\n'linear'\n>>>\nrbf_svc\n=\nsvm\n.\nSVC\n(\nkernel\n=\n'rbf'\n)\n>>>\nrbf_svc\n.\nkernel\n'rbf'\nSee also\nKernel Approximation\nfor a solution to use RBF kernels that is much faster and more scalable.\n1.4.6.1.\nParameters of the RBF Kernel\nWhen training an SVM with the\nRadial Basis Function\n(RBF) kernel, two\nparameters must be considered:\nC\nand\ngamma\n. The parameter\nC\n,\ncommon to all SVM kernels, trades off misclassification of training examples\nagainst simplicity of the decision surface. A low\nC\nmakes the decision\nsurface smooth, while a high\nC\naims at classifying all training examples\ncorrectly.\ngamma",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "C\nmakes the decision\nsurface smooth, while a high\nC\naims at classifying all training examples\ncorrectly.\ngamma\ndefines how much influence a single training example has.\nThe larger\ngamma\nis, the closer other examples must be to be affected.\nProper choice of\nC\nand\ngamma\nis critical to the SVM’s performance. One\nis advised to use\nGridSearchCV\nwith\nC\nand\ngamma\nspaced exponentially far apart to choose good values.\nExamples\nRBF SVM parameters\nScaling the regularization parameter for SVCs\n1.4.6.2.\nCustom Kernels\nYou can define your own kernels by either giving the kernel as a\npython function or by precomputing the Gram matrix.\nClassifiers with custom kernels behave the same way as any other\nclassifiers, except that:\nField\nsupport_vectors_\nis now empty, only indices of support\nvectors are stored in\nsupport_\nA reference (and not a copy) of the first argument in the\nfit()\nmethod is stored for future reference. If that array changes between the\nuse of\nfit()\nand\npredict()",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "fit()\nmethod is stored for future reference. If that array changes between the\nuse of\nfit()\nand\npredict()\nyou will have unexpected results.\nUsing Python functions as kernels\nYou can use your own defined kernels by passing a function to the\nkernel\nparameter.\nYour kernel must take as arguments two matrices of shape\n(n_samples_1,\nn_features)\n,\n(n_samples_2,\nn_features)\nand return a kernel matrix of shape\n(n_samples_1,\nn_samples_2)\n.\nThe following code defines a linear kernel and creates a classifier\ninstance that will use that kernel:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn\nimport\nsvm\n>>>\ndef\nmy_kernel\n(\nX\n,\nY\n):\n...\nreturn\nnp\n.\ndot\n(\nX\n,\nY\n.\nT\n)\n...\n>>>\nclf\n=\nsvm\n.\nSVC\n(\nkernel\n=\nmy_kernel\n)\nUsing the Gram matrix\nYou can pass pre-computed kernels by using the\nkernel='precomputed'\noption. You should then pass Gram matrix instead of X to the\nfit\nand\npredict\nmethods. The kernel values between\nall\ntraining vectors and the\ntest vectors must be provided:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "fit\nand\npredict\nmethods. The kernel values between\nall\ntraining vectors and the\ntest vectors must be provided:\n>>>\nimport\nnumpy\nas\nnp\n>>>\nfrom\nsklearn.datasets\nimport\nmake_classification\n>>>\nfrom\nsklearn.model_selection\nimport\ntrain_test_split\n>>>\nfrom\nsklearn\nimport\nsvm\n>>>\nX\n,\ny\n=\nmake_classification\n(\nn_samples\n=\n10\n,\nrandom_state\n=\n0\n)\n>>>\nX_train\n,\nX_test\n,\ny_train\n,\ny_test\n=\ntrain_test_split\n(\nX\n,\ny\n,\nrandom_state\n=\n0\n)\n>>>\nclf\n=\nsvm\n.\nSVC\n(\nkernel\n=\n'precomputed'\n)\n>>>\n# linear kernel computation\n>>>\ngram_train\n=\nnp\n.\ndot\n(\nX_train\n,\nX_train\n.\nT\n)\n>>>\nclf\n.\nfit\n(\ngram_train\n,\ny_train\n)\nSVC(kernel='precomputed')\n>>>\n# predict on training examples\n>>>\ngram_test\n=\nnp\n.\ndot\n(\nX_test\n,\nX_train\n.\nT\n)\n>>>\nclf\n.\npredict\n(\ngram_test\n)\narray([0, 1, 0])\nExamples\nSVM with custom kernel\n1.4.7.\nMathematical formulation\nA support vector machine constructs a hyper-plane or set of hyper-planes in a\nhigh or infinite dimensional space, which can be used for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "A support vector machine constructs a hyper-plane or set of hyper-planes in a\nhigh or infinite dimensional space, which can be used for\nclassification, regression or other tasks. Intuitively, a good\nseparation is achieved by the hyper-plane that has the largest distance\nto the nearest training data points of any class (so-called functional\nmargin), since in general the larger the margin the lower the\ngeneralization error of the classifier. The figure below shows the decision\nfunction for a linearly separable problem, with three samples on the\nmargin boundaries, called “support vectors”:\nIn general, when the problem isn’t linearly separable, the support vectors\nare the samples\nwithin\nthe margin boundaries.\nWe recommend\n[\n13\n]\nand\n[\n14\n]\nas good references for the theory and\npracticalities of SVMs.\n1.4.7.1.\nSVC\nGiven training vectors\n\\(x_i \\in \\mathbb{R}^p\\)\n, i=1,…, n, in two classes, and a\nvector\n\\(y \\in \\{1, -1\\}^n\\)\n, our goal is to find\n\\(w \\in\n\\mathbb{R}^p\\)\nand",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(x_i \\in \\mathbb{R}^p\\)\n, i=1,…, n, in two classes, and a\nvector\n\\(y \\in \\{1, -1\\}^n\\)\n, our goal is to find\n\\(w \\in\n\\mathbb{R}^p\\)\nand\n\\(b \\in \\mathbb{R}\\)\nsuch that the prediction given by\n\\(\\text{sign} (w^T\\phi(x) + b)\\)\nis correct for most samples.\nSVC solves the following primal problem:\n\\[ \\begin{align}\\begin{aligned}\\min_ {w, b, \\zeta} \\frac{1}{2} w^T w + C \\sum_{i=1}^{n} \\zeta_i\\\\\\begin{split}\\textrm {subject to } & y_i (w^T \\phi (x_i) + b) \\geq 1 - \\zeta_i,\\\\\n& \\zeta_i \\geq 0, i=1, ..., n\\end{split}\\end{aligned}\\end{align} \\]\nIntuitively, we’re trying to maximize the margin (by minimizing\n\\(||w||^2 = w^Tw\\)\n), while incurring a penalty when a sample is\nmisclassified or within the margin boundary. Ideally, the value\n\\(y_i\n(w^T \\phi (x_i) + b)\\)\nwould be\n\\(\\geq 1\\)\nfor all samples, which\nindicates a perfect prediction. But problems are usually not always perfectly\nseparable with a hyperplane, so we allow some samples to be at a distance\n\\(\\zeta_i\\)\nfrom",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "separable with a hyperplane, so we allow some samples to be at a distance\n\\(\\zeta_i\\)\nfrom\ntheir correct margin boundary. The penalty term\nC\ncontrols the strength of\nthis penalty, and as a result, acts as an inverse regularization parameter\n(see note below).\nThe dual problem to the primal is\n\\[ \\begin{align}\\begin{aligned}\\min_{\\alpha} \\frac{1}{2} \\alpha^T Q \\alpha - e^T \\alpha\\\\\\begin{split}\n\\textrm {subject to } & y^T \\alpha = 0\\\\\n& 0 \\leq \\alpha_i \\leq C, i=1, ..., n\\end{split}\\end{aligned}\\end{align} \\]\nwhere\n\\(e\\)\nis the vector of all ones,\nand\n\\(Q\\)\nis an\n\\(n\\)\nby\n\\(n\\)\npositive semidefinite matrix,\n\\(Q_{ij} \\equiv y_i y_j K(x_i, x_j)\\)\n, where\n\\(K(x_i, x_j) = \\phi (x_i)^T \\phi (x_j)\\)\nis the kernel. The terms\n\\(\\alpha_i\\)\nare called the dual coefficients,\nand they are upper-bounded by\n\\(C\\)\n.\nThis dual representation highlights the fact that training vectors are\nimplicitly mapped into a higher (maybe infinite)\ndimensional space by the function\n\\(\\phi\\)\n: see\nkernel trick\n.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "implicitly mapped into a higher (maybe infinite)\ndimensional space by the function\n\\(\\phi\\)\n: see\nkernel trick\n.\nOnce the optimization problem is solved, the output of\ndecision_function\nfor a given sample\n\\(x\\)\nbecomes:\n\\[\\sum_{i\\in SV} y_i \\alpha_i K(x_i, x) + b,\\]\nand the predicted class corresponds to its sign. We only need to sum over the\nsupport vectors (i.e. the samples that lie within the margin) because the\ndual coefficients\n\\(\\alpha_i\\)\nare zero for the other samples.\nThese parameters can be accessed through the attributes\ndual_coef_\nwhich holds the product\n\\(y_i \\alpha_i\\)\n,\nsupport_vectors_\nwhich\nholds the support vectors, and\nintercept_\nwhich holds the independent\nterm\n\\(b\\)\n.\nNote\nWhile SVM models derived from\nlibsvm\nand\nliblinear\nuse\nC\nas\nregularization parameter, most other estimators use\nalpha\n. The exact\nequivalence between the amount of regularization of two models depends on\nthe exact objective function optimized by the model. For example, when the\nestimator used is",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the exact objective function optimized by the model. For example, when the\nestimator used is\nRidge\nregression,\nthe relation between them is given as\n\\(C = \\frac{1}{\\alpha}\\)\n.\nLinearSVC\nThe primal problem can be equivalently formulated as\n\\[\\min_ {w, b} \\frac{1}{2} w^T w + C \\sum_{i=1}^{n}\\max(0, 1 - y_i (w^T \\phi(x_i) + b)),\\]\nwhere we make use of the\nhinge loss\n. This is the form that is\ndirectly optimized by\nLinearSVC\n, but unlike the dual form, this one\ndoes not involve inner products between samples, so the famous kernel trick\ncannot be applied. This is why only the linear kernel is supported by\nLinearSVC\n(\n\\(\\phi\\)\nis the identity function).\nNuSVC\nThe\n\\(\\nu\\)\n-SVC formulation\n[\n15\n]\nis a reparameterization of the\n\\(C\\)\n-SVC and therefore mathematically equivalent.\nWe introduce a new parameter\n\\(\\nu\\)\n(instead of\n\\(C\\)\n) which\ncontrols the number of support vectors and\nmargin errors\n:\n\\(\\nu \\in (0, 1]\\)\nis an upper bound on the fraction of margin errors and",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(C\\)\n) which\ncontrols the number of support vectors and\nmargin errors\n:\n\\(\\nu \\in (0, 1]\\)\nis an upper bound on the fraction of margin errors and\na lower bound of the fraction of support vectors. A margin error corresponds\nto a sample that lies on the wrong side of its margin boundary: it is either\nmisclassified, or it is correctly classified but does not lie beyond the\nmargin.\n1.4.7.2.\nSVR\nGiven training vectors\n\\(x_i \\in \\mathbb{R}^p\\)\n, i=1,…, n, and a\nvector\n\\(y \\in \\mathbb{R}^n\\)\n\\(\\varepsilon\\)\n-SVR solves the following primal problem:\n\\[ \\begin{align}\\begin{aligned}\\min_ {w, b, \\zeta, \\zeta^*} \\frac{1}{2} w^T w + C \\sum_{i=1}^{n} (\\zeta_i + \\zeta_i^*)\\\\\\begin{split}\\textrm {subject to } & y_i - w^T \\phi (x_i) - b \\leq \\varepsilon + \\zeta_i,\\\\\n& w^T \\phi (x_i) + b - y_i \\leq \\varepsilon + \\zeta_i^*,\\\\\n& \\zeta_i, \\zeta_i^* \\geq 0, i=1, ..., n\\end{split}\\end{aligned}\\end{align} \\]\nHere, we are penalizing samples whose prediction is at least\n\\(\\varepsilon\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 30,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Here, we are penalizing samples whose prediction is at least\n\\(\\varepsilon\\)\naway from their true target. These samples penalize the objective by\n\\(\\zeta_i\\)\nor\n\\(\\zeta_i^*\\)\n, depending on whether their predictions\nlie above or below the\n\\(\\varepsilon\\)\ntube.\nThe dual problem is\n\\[ \\begin{align}\\begin{aligned}\\min_{\\alpha, \\alpha^*} \\frac{1}{2} (\\alpha - \\alpha^*)^T Q (\\alpha - \\alpha^*) + \\varepsilon e^T (\\alpha + \\alpha^*) - y^T (\\alpha - \\alpha^*)\\\\\\begin{split}\n\\textrm {subject to } & e^T (\\alpha - \\alpha^*) = 0\\\\\n& 0 \\leq \\alpha_i, \\alpha_i^* \\leq C, i=1, ..., n\\end{split}\\end{aligned}\\end{align} \\]\nwhere\n\\(e\\)\nis the vector of all ones,\n\\(Q\\)\nis an\n\\(n\\)\nby\n\\(n\\)\npositive semidefinite matrix,\n\\(Q_{ij} \\equiv K(x_i, x_j) = \\phi (x_i)^T \\phi (x_j)\\)\nis the kernel. Here training vectors are implicitly mapped into a higher\n(maybe infinite) dimensional space by the function\n\\(\\phi\\)\n.\nThe prediction is:\n\\[\\sum_{i \\in SV}(\\alpha_i - \\alpha_i^*) K(x_i, x) + b\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 31,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(maybe infinite) dimensional space by the function\n\\(\\phi\\)\n.\nThe prediction is:\n\\[\\sum_{i \\in SV}(\\alpha_i - \\alpha_i^*) K(x_i, x) + b\\]\nThese parameters can be accessed through the attributes\ndual_coef_\nwhich holds the difference\n\\(\\alpha_i - \\alpha_i^*\\)\n,\nsupport_vectors_\nwhich\nholds the support vectors, and\nintercept_\nwhich holds the independent\nterm\n\\(b\\)\nLinearSVR\nThe primal problem can be equivalently formulated as\n\\[\\min_ {w, b} \\frac{1}{2} w^T w + C \\sum_{i=1}^{n}\\max(0, |y_i - (w^T \\phi(x_i) + b)| - \\varepsilon),\\]\nwhere we make use of the epsilon-insensitive loss, i.e. errors of less than\n\\(\\varepsilon\\)\nare ignored. This is the form that is directly optimized\nby\nLinearSVR\n.\n1.4.8.\nImplementation details\nInternally, we use\nlibsvm\n[\n12\n]\nand\nliblinear\n[\n11\n]\nto handle all\ncomputations. These libraries are wrapped using C and Cython.\nFor a description of the implementation and details of the algorithms\nused, please refer to their respective papers.\nReferences\nOn this page",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 32,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "For a description of the implementation and details of the algorithms\nused, please refer to their respective papers.\nReferences\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/svm.html",
      "chunk_index": 33,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.10.\nDecision Trees\nDecision Trees (DTs)\nare a non-parametric supervised learning method used\nfor\nclassification\nand\nregression\n. The goal is to create a model that predicts the value of a\ntarget variable by learning simple decision rules inferred from the data\nfeatures. A tree can be seen as a piecewise constant approximation.\nFor instance, in the example below, decision trees learn from data to\napproximate a sine curve with a set of if-then-else decision rules. The deeper\nthe tree, the more complex the decision rules and the fitter the model.\nSome advantages of decision trees are:\nSimple to understand and to interpret. Trees can be visualized.\nRequires little data preparation. Other techniques often require data\nnormalization, dummy variables need to be created and blank values to\nbe removed. Some tree and algorithm combinations support\nmissing values\n.\nThe cost of using the tree (i.e., predicting data) is logarithmic in the\nnumber of data points used to train the tree.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "missing values\n.\nThe cost of using the tree (i.e., predicting data) is logarithmic in the\nnumber of data points used to train the tree.\nAble to handle both numerical and categorical data. However, the scikit-learn\nimplementation does not support categorical variables for now. Other\ntechniques are usually specialized in analyzing datasets that have only one type\nof variable. See\nalgorithms\nfor more\ninformation.\nAble to handle multi-output problems.\nUses a white box model. If a given situation is observable in a model,\nthe explanation for the condition is easily explained by boolean logic.\nBy contrast, in a black box model (e.g., in an artificial neural\nnetwork), results may be more difficult to interpret.\nPossible to validate a model using statistical tests. That makes it\npossible to account for the reliability of the model.\nPerforms well even if its assumptions are somewhat violated by\nthe true model from which the data were generated.\nThe disadvantages of decision trees include:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the true model from which the data were generated.\nThe disadvantages of decision trees include:\nDecision-tree learners can create over-complex trees that do not\ngeneralize the data well. This is called overfitting. Mechanisms\nsuch as pruning, setting the minimum number of samples required\nat a leaf node or setting the maximum depth of the tree are\nnecessary to avoid this problem.\nDecision trees can be unstable because small variations in the\ndata might result in a completely different tree being generated.\nThis problem is mitigated by using decision trees within an\nensemble.\nPredictions of decision trees are neither smooth nor continuous, but\npiecewise constant approximations as seen in the above figure. Therefore,\nthey are not good at extrapolation.\nThe problem of learning an optimal decision tree is known to be\nNP-complete under several aspects of optimality and even for simple\nconcepts. Consequently, practical decision-tree learning algorithms",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "NP-complete under several aspects of optimality and even for simple\nconcepts. Consequently, practical decision-tree learning algorithms\nare based on heuristic algorithms such as the greedy algorithm where\nlocally optimal decisions are made at each node. Such algorithms\ncannot guarantee to return the globally optimal decision tree. This\ncan be mitigated by training multiple trees in an ensemble learner,\nwhere the features and samples are randomly sampled with replacement.\nThere are concepts that are hard to learn because decision trees\ndo not express them easily, such as XOR, parity or multiplexer problems.\nDecision tree learners create biased trees if some classes dominate.\nIt is therefore recommended to balance the dataset prior to fitting\nwith the decision tree.\n1.10.1.\nClassification\nDecisionTreeClassifier\nis a class capable of performing multi-class\nclassification on a dataset.\nAs with other classifiers,\nDecisionTreeClassifier\ntakes as input two arrays:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "classification on a dataset.\nAs with other classifiers,\nDecisionTreeClassifier\ntakes as input two arrays:\nan array X, sparse or dense, of shape\n(n_samples,\nn_features)\nholding the\ntraining samples, and an array Y of integer values, shape\n(n_samples,)\n,\nholding the class labels for the training samples:\n>>>\nfrom\nsklearn\nimport\ntree\n>>>\nX\n=\n[[\n0\n,\n0\n],\n[\n1\n,\n1\n]]\n>>>\nY\n=\n[\n0\n,\n1\n]\n>>>\nclf\n=\ntree\n.\nDecisionTreeClassifier\n()\n>>>\nclf\n=\nclf\n.\nfit\n(\nX\n,\nY\n)\nAfter being fitted, the model can then be used to predict the class of samples:\n>>>\nclf\n.\npredict\n([[\n2.\n,\n2.\n]])\narray([1])\nIn case that there are multiple classes with the same and highest\nprobability, the classifier will predict the class with the lowest index\namongst those classes.\nAs an alternative to outputting a specific class, the probability of each class\ncan be predicted, which is the fraction of training samples of the class in a\nleaf:\n>>>\nclf\n.\npredict_proba\n([[\n2.\n,\n2.\n]])\narray([[0., 1.]])\nDecisionTreeClassifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "leaf:\n>>>\nclf\n.\npredict_proba\n([[\n2.\n,\n2.\n]])\narray([[0., 1.]])\nDecisionTreeClassifier\nis capable of both binary (where the\nlabels are [-1, 1]) classification and multiclass (where the labels are\n[0, …, K-1]) classification.\nUsing the Iris dataset, we can construct a tree as follows:\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn\nimport\ntree\n>>>\niris\n=\nload_iris\n()\n>>>\nX\n,\ny\n=\niris\n.\ndata\n,\niris\n.\ntarget\n>>>\nclf\n=\ntree\n.\nDecisionTreeClassifier\n()\n>>>\nclf\n=\nclf\n.\nfit\n(\nX\n,\ny\n)\nOnce trained, you can plot the tree with the\nplot_tree\nfunction:\n>>>\ntree\n.\nplot_tree\n(\nclf\n)\n[...]\nAlternative ways to export trees\nWe can also export the tree in\nGraphviz\nformat using the\nexport_graphviz\nexporter. If you use the\nconda\npackage manager, the graphviz binaries\nand the python package can be installed with\nconda\ninstall\npython-graphviz\n.\nAlternatively binaries for graphviz can be downloaded from the graphviz project homepage,\nand the Python wrapper installed from pypi with\npip\ninstall",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 5,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Alternatively binaries for graphviz can be downloaded from the graphviz project homepage,\nand the Python wrapper installed from pypi with\npip\ninstall\ngraphviz\n.\nBelow is an example graphviz export of the above tree trained on the entire\niris dataset; the results are saved in an output file\niris.pdf\n:\n>>>\nimport\ngraphviz\n>>>\ndot_data\n=\ntree\n.\nexport_graphviz\n(\nclf\n,\nout_file\n=\nNone\n)\n>>>\ngraph\n=\ngraphviz\n.\nSource\n(\ndot_data\n)\n>>>\ngraph\n.\nrender\n(\n\"iris\"\n)\nThe\nexport_graphviz\nexporter also supports a variety of aesthetic\noptions, including coloring nodes by their class (or value for regression) and\nusing explicit variable and class names if desired. Jupyter notebooks also\nrender these plots inline automatically:\n>>>\ndot_data\n=\ntree\n.\nexport_graphviz\n(\nclf\n,\nout_file\n=\nNone\n,\n...\nfeature_names\n=\niris\n.\nfeature_names\n,\n...\nclass_names\n=\niris\n.\ntarget_names\n,\n...\nfilled\n=\nTrue\n,\nrounded\n=\nTrue\n,\n...\nspecial_characters\n=\nTrue\n)\n>>>\ngraph\n=\ngraphviz\n.\nSource\n(\ndot_data\n)\n>>>\ngraph",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 6,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "=\niris\n.\ntarget_names\n,\n...\nfilled\n=\nTrue\n,\nrounded\n=\nTrue\n,\n...\nspecial_characters\n=\nTrue\n)\n>>>\ngraph\n=\ngraphviz\n.\nSource\n(\ndot_data\n)\n>>>\ngraph\nAlternatively, the tree can also be exported in textual format with the\nfunction\nexport_text\n. This method doesn’t require the installation\nof external libraries and is more compact:\n>>>\nfrom\nsklearn.datasets\nimport\nload_iris\n>>>\nfrom\nsklearn.tree\nimport\nDecisionTreeClassifier\n>>>\nfrom\nsklearn.tree\nimport\nexport_text\n>>>\niris\n=\nload_iris\n()\n>>>\ndecision_tree\n=\nDecisionTreeClassifier\n(\nrandom_state\n=\n0\n,\nmax_depth\n=\n2\n)\n>>>\ndecision_tree\n=\ndecision_tree\n.\nfit\n(\niris\n.\ndata\n,\niris\n.\ntarget\n)\n>>>\nr\n=\nexport_text\n(\ndecision_tree\n,\nfeature_names\n=\niris\n[\n'feature_names'\n])\n>>>\nprint\n(\nr\n)\n|--- petal width (cm) <= 0.80\n| |--- class: 0\n|--- petal width (cm) > 0.80\n| |--- petal width (cm) <= 1.75\n| | |--- class: 1\n| |--- petal width (cm) > 1.75\n| | |--- class: 2\nExamples\nPlot the decision surface of decision trees trained on the iris dataset",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 7,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "| | |--- class: 1\n| |--- petal width (cm) > 1.75\n| | |--- class: 2\nExamples\nPlot the decision surface of decision trees trained on the iris dataset\nUnderstanding the decision tree structure\n1.10.2.\nRegression\nDecision trees can also be applied to regression problems, using the\nDecisionTreeRegressor\nclass.\nAs in the classification setting, the fit method will take as argument arrays X\nand y, only that in this case y is expected to have floating point values\ninstead of integer values:\n>>>\nfrom\nsklearn\nimport\ntree\n>>>\nX\n=\n[[\n0\n,\n0\n],\n[\n2\n,\n2\n]]\n>>>\ny\n=\n[\n0.5\n,\n2.5\n]\n>>>\nclf\n=\ntree\n.\nDecisionTreeRegressor\n()\n>>>\nclf\n=\nclf\n.\nfit\n(\nX\n,\ny\n)\n>>>\nclf\n.\npredict\n([[\n1\n,\n1\n]])\narray([0.5])\nExamples\nDecision Tree Regression\n1.10.3.\nMulti-output problems\nA multi-output problem is a supervised learning problem with several outputs\nto predict, that is when Y is a 2d array of shape\n(n_samples,\nn_outputs)\n.\nWhen there is no correlation between the outputs, a very simple way to solve",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 8,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "(n_samples,\nn_outputs)\n.\nWhen there is no correlation between the outputs, a very simple way to solve\nthis kind of problem is to build n independent models, i.e. one for each\noutput, and then to use those models to independently predict each one of the n\noutputs. However, because it is likely that the output values related to the\nsame input are themselves correlated, an often better way is to build a single\nmodel capable of predicting simultaneously all n outputs. First, it requires\nlower training time since only a single estimator is built. Second, the\ngeneralization accuracy of the resulting estimator may often be increased.\nWith regard to decision trees, this strategy can readily be used to support\nmulti-output problems. This requires the following changes:\nStore n output values in leaves, instead of 1;\nUse splitting criteria that compute the average reduction across all\nn outputs.\nThis module offers support for multi-output problems by implementing this\nstrategy in both",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 9,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "n outputs.\nThis module offers support for multi-output problems by implementing this\nstrategy in both\nDecisionTreeClassifier\nand\nDecisionTreeRegressor\n. If a decision tree is fit on an output array Y\nof shape\n(n_samples,\nn_outputs)\nthen the resulting estimator will:\nOutput n_output values upon\npredict\n;\nOutput a list of n_output arrays of class probabilities upon\npredict_proba\n.\nThe use of multi-output trees for regression is demonstrated in\nDecision Tree Regression\n. In this example, the input\nX is a single real value and the outputs Y are the sine and cosine of X.\nThe use of multi-output trees for classification is demonstrated in\nFace completion with a multi-output estimators\n. In this example, the inputs\nX are the pixels of the upper half of faces and the outputs Y are the pixels of\nthe lower half of those faces.\nExamples\nFace completion with a multi-output estimators\nReferences\nM. Dumont et al,\nFast multi-class image annotation with random subwindows",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 10,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Examples\nFace completion with a multi-output estimators\nReferences\nM. Dumont et al,\nFast multi-class image annotation with random subwindows\nand multiple output randomized trees\n,\nInternational Conference on Computer Vision Theory and Applications 2009\n1.10.4.\nComplexity\nIn general, the run time cost to construct a balanced binary tree is\n\\(O(n_{samples}n_{features}\\log(n_{samples}))\\)\nand query time\n\\(O(\\log(n_{samples}))\\)\n. Although the tree construction algorithm attempts\nto generate balanced trees, they will not always be balanced. Assuming that the\nsubtrees remain approximately balanced, the cost at each node consists of\nsearching through\n\\(O(n_{features})\\)\nto find the feature that offers the\nlargest reduction in the impurity criterion, e.g. log loss (which is equivalent to an\ninformation gain). This has a cost of\n\\(O(n_{features}n_{samples}\\log(n_{samples}))\\)\nat each node, leading to a\ntotal cost over the entire trees (by summing the cost at each node) of",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 11,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(O(n_{features}n_{samples}\\log(n_{samples}))\\)\nat each node, leading to a\ntotal cost over the entire trees (by summing the cost at each node) of\n\\(O(n_{features}n_{samples}^{2}\\log(n_{samples}))\\)\n.\n1.10.5.\nTips on practical use\nDecision trees tend to overfit on data with a large number of features.\nGetting the right ratio of samples to number of features is important, since\na tree with few samples in high dimensional space is very likely to overfit.\nConsider performing dimensionality reduction (\nPCA\n,\nICA\n, or\nFeature selection\n) beforehand to\ngive your tree a better chance of finding features that are discriminative.\nUnderstanding the decision tree structure\nwill help\nin gaining more insights about how the decision tree makes predictions, which is\nimportant for understanding the important features in the data.\nVisualize your tree as you are training by using the\nexport\nfunction. Use\nmax_depth=3\nas an initial tree depth to get a feel for",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 12,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Visualize your tree as you are training by using the\nexport\nfunction. Use\nmax_depth=3\nas an initial tree depth to get a feel for\nhow the tree is fitting to your data, and then increase the depth.\nRemember that the number of samples required to populate the tree doubles\nfor each additional level the tree grows to. Use\nmax_depth\nto control\nthe size of the tree to prevent overfitting.\nUse\nmin_samples_split\nor\nmin_samples_leaf\nto ensure that multiple\nsamples inform every decision in the tree, by controlling which splits will\nbe considered. A very small number will usually mean the tree will overfit,\nwhereas a large number will prevent the tree from learning the data. Try\nmin_samples_leaf=5\nas an initial value. If the sample size varies\ngreatly, a float number can be used as percentage in these two parameters.\nWhile\nmin_samples_split\ncan create arbitrarily small leaves,\nmin_samples_leaf\nguarantees that each leaf has a minimum size, avoiding",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 13,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "While\nmin_samples_split\ncan create arbitrarily small leaves,\nmin_samples_leaf\nguarantees that each leaf has a minimum size, avoiding\nlow-variance, over-fit leaf nodes in regression problems. For\nclassification with few classes,\nmin_samples_leaf=1\nis often the best\nchoice.\nNote that\nmin_samples_split\nconsiders samples directly and independent of\nsample_weight\n, if provided (e.g. a node with m weighted samples is still\ntreated as having exactly m samples). Consider\nmin_weight_fraction_leaf\nor\nmin_impurity_decrease\nif accounting for sample weights is required at splits.\nBalance your dataset before training to prevent the tree from being biased\ntoward the classes that are dominant. Class balancing can be done by\nsampling an equal number of samples from each class, or preferably by\nnormalizing the sum of the sample weights (\nsample_weight\n) for each\nclass to the same value. Also note that weight-based pre-pruning criteria,\nsuch as\nmin_weight_fraction_leaf\n, will then be less biased toward",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 14,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "class to the same value. Also note that weight-based pre-pruning criteria,\nsuch as\nmin_weight_fraction_leaf\n, will then be less biased toward\ndominant classes than criteria that are not aware of the sample weights,\nlike\nmin_samples_leaf\n.\nIf the samples are weighted, it will be easier to optimize the tree\nstructure using weight-based pre-pruning criterion such as\nmin_weight_fraction_leaf\n, which ensures that leaf nodes contain at least\na fraction of the overall sum of the sample weights.\nAll decision trees use\nnp.float32\narrays internally.\nIf training data is not in this format, a copy of the dataset will be made.\nIf the input matrix X is very sparse, it is recommended to convert to sparse\ncsc_matrix\nbefore calling fit and sparse\ncsr_matrix\nbefore calling\npredict. Training time can be orders of magnitude faster for a sparse\nmatrix input compared to a dense matrix when features have zero values in\nmost of the samples.\n1.10.6.\nTree algorithms: ID3, C4.5, C5.0 and CART",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 15,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "matrix input compared to a dense matrix when features have zero values in\nmost of the samples.\n1.10.6.\nTree algorithms: ID3, C4.5, C5.0 and CART\nWhat are all the various decision tree algorithms and how do they differ\nfrom each other? Which one is implemented in scikit-learn?\nVarious decision tree algorithms\nID3\n(Iterative Dichotomiser 3) was developed in 1986 by Ross Quinlan.\nThe algorithm creates a multiway tree, finding for each node (i.e. in\na greedy manner) the categorical feature that will yield the largest\ninformation gain for categorical targets. Trees are grown to their\nmaximum size and then a pruning step is usually applied to improve the\nability of the tree to generalize to unseen data.\nC4.5 is the successor to ID3 and removed the restriction that features\nmust be categorical by dynamically defining a discrete attribute (based\non numerical variables) that partitions the continuous attribute value\ninto a discrete set of intervals. C4.5 converts the trained trees",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 16,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "on numerical variables) that partitions the continuous attribute value\ninto a discrete set of intervals. C4.5 converts the trained trees\n(i.e. the output of the ID3 algorithm) into sets of if-then rules.\nThe accuracy of each rule is then evaluated to determine the order\nin which they should be applied. Pruning is done by removing a rule’s\nprecondition if the accuracy of the rule improves without it.\nC5.0 is Quinlan’s latest version release under a proprietary license.\nIt uses less memory and builds smaller rulesets than C4.5 while being\nmore accurate.\nCART (Classification and Regression Trees) is very similar to C4.5, but\nit differs in that it supports numerical target variables (regression) and\ndoes not compute rule sets. CART constructs binary trees using the feature\nand threshold that yield the largest information gain at each node.\nscikit-learn uses an optimized version of the CART algorithm; however, the\nscikit-learn implementation does not support categorical variables for now.",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 17,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "scikit-learn implementation does not support categorical variables for now.\n1.10.7.\nMathematical formulation\nGiven training vectors\n\\(x_i \\in R^n\\)\n, i=1,…, l and a label vector\n\\(y \\in R^l\\)\n, a decision tree recursively partitions the feature space\nsuch that the samples with the same labels or similar target values are grouped\ntogether.\nLet the data at node\n\\(m\\)\nbe represented by\n\\(Q_m\\)\nwith\n\\(n_m\\)\nsamples. For each candidate split\n\\(\\theta = (j, t_m)\\)\nconsisting of a\nfeature\n\\(j\\)\nand threshold\n\\(t_m\\)\n, partition the data into\n\\(Q_m^{left}(\\theta)\\)\nand\n\\(Q_m^{right}(\\theta)\\)\nsubsets\n\\[ \\begin{align}\\begin{aligned}Q_m^{left}(\\theta) = \\{(x, y) | x_j \\leq t_m\\}\\\\Q_m^{right}(\\theta) = Q_m \\setminus Q_m^{left}(\\theta)\\end{aligned}\\end{align} \\]\nThe quality of a candidate split of node\n\\(m\\)\nis then computed using an\nimpurity function or loss function\n\\(H()\\)\n, the choice of which depends on\nthe task being solved (classification or regression)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 18,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "impurity function or loss function\n\\(H()\\)\n, the choice of which depends on\nthe task being solved (classification or regression)\n\\[G(Q_m, \\theta) = \\frac{n_m^{left}}{n_m} H(Q_m^{left}(\\theta))\n+ \\frac{n_m^{right}}{n_m} H(Q_m^{right}(\\theta))\\]\nSelect the parameters that minimises the impurity\n\\[\\theta^* = \\operatorname{argmin}_\\theta G(Q_m, \\theta)\\]\nRecurse for subsets\n\\(Q_m^{left}(\\theta^*)\\)\nand\n\\(Q_m^{right}(\\theta^*)\\)\nuntil the maximum allowable depth is reached,\n\\(n_m < \\min_{samples}\\)\nor\n\\(n_m = 1\\)\n.\n1.10.7.1.\nClassification criteria\nIf a target is a classification outcome taking on values 0,1,…,K-1,\nfor node\n\\(m\\)\n, let\n\\[p_{mk} = \\frac{1}{n_m} \\sum_{y \\in Q_m} I(y = k)\\]\nbe the proportion of class k observations in node\n\\(m\\)\n. If\n\\(m\\)\nis a\nterminal node,\npredict_proba\nfor this region is set to\n\\(p_{mk}\\)\n.\nCommon measures of impurity are the following.\nGini:\n\\[H(Q_m) = \\sum_k p_{mk} (1 - p_{mk})\\]\nLog Loss or Entropy:\n\\[H(Q_m) = - \\sum_k p_{mk} \\log(p_{mk})\\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 19,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Gini:\n\\[H(Q_m) = \\sum_k p_{mk} (1 - p_{mk})\\]\nLog Loss or Entropy:\n\\[H(Q_m) = - \\sum_k p_{mk} \\log(p_{mk})\\]\nShannon entropy\nThe entropy criterion computes the Shannon entropy of the possible classes. It\ntakes the class frequencies of the training data points that reached a given\nleaf\n\\(m\\)\nas their probability. Using the\nShannon entropy as tree node\nsplitting criterion is equivalent to minimizing the log loss\n(also known as\ncross-entropy and multinomial deviance) between the true labels\n\\(y_i\\)\nand the probabilistic predictions\n\\(T_k(x_i)\\)\nof the tree model\n\\(T\\)\nfor class\n\\(k\\)\n.\nTo see this, first recall that the log loss of a tree model\n\\(T\\)\ncomputed on a dataset\n\\(D\\)\nis defined as follows:\n\\[\\mathrm{LL}(D, T) = -\\frac{1}{n} \\sum_{(x_i, y_i) \\in D} \\sum_k I(y_i = k) \\log(T_k(x_i))\\]\nwhere\n\\(D\\)\nis a training dataset of\n\\(n\\)\npairs\n\\((x_i, y_i)\\)\n.\nIn a classification tree, the predicted class probabilities within leaf nodes\nare constant, that is: for all\n\\((x_i, y_i) \\in Q_m\\)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 20,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\((x_i, y_i)\\)\n.\nIn a classification tree, the predicted class probabilities within leaf nodes\nare constant, that is: for all\n\\((x_i, y_i) \\in Q_m\\)\n, one has:\n\\(T_k(x_i) = p_{mk}\\)\nfor each class\n\\(k\\)\n.\nThis property makes it possible to rewrite\n\\(\\mathrm{LL}(D, T)\\)\nas the\nsum of the Shannon entropies computed for each leaf of\n\\(T\\)\nweighted by\nthe number of training data points that reached each leaf:\n\\[\\mathrm{LL}(D, T) = \\sum_{m \\in T} \\frac{n_m}{n} H(Q_m)\\]\n1.10.7.2.\nRegression criteria\nIf the target is a continuous value, then for node\n\\(m\\)\n, common\ncriteria to minimize as for determining locations for future splits are Mean\nSquared Error (MSE or L2 error), Poisson deviance as well as Mean Absolute\nError (MAE or L1 error). MSE and Poisson deviance both set the predicted value\nof terminal nodes to the learned mean value\n\\(\\bar{y}_m\\)\nof the node\nwhereas the MAE sets the predicted value of terminal nodes to the median\n\\(median(y)_m\\)\n.\nMean Squared Error:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 21,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(\\bar{y}_m\\)\nof the node\nwhereas the MAE sets the predicted value of terminal nodes to the median\n\\(median(y)_m\\)\n.\nMean Squared Error:\n\\[ \\begin{align}\\begin{aligned}\\bar{y}_m = \\frac{1}{n_m} \\sum_{y \\in Q_m} y\\\\H(Q_m) = \\frac{1}{n_m} \\sum_{y \\in Q_m} (y - \\bar{y}_m)^2\\end{aligned}\\end{align} \\]\nMean Poisson deviance:\n\\[H(Q_m) = \\frac{2}{n_m} \\sum_{y \\in Q_m} (y \\log\\frac{y}{\\bar{y}_m}\n- y + \\bar{y}_m)\\]\nSetting\ncriterion=\"poisson\"\nmight be a good choice if your target is a count\nor a frequency (count per some unit). In any case,\n\\(y >= 0\\)\nis a\nnecessary condition to use this criterion. Note that it fits much slower than\nthe MSE criterion. For performance reasons the actual implementation minimizes\nthe half mean poisson deviance, i.e. the mean poisson deviance divided by 2.\nMean Absolute Error:\n\\[ \\begin{align}\\begin{aligned}median(y)_m = \\underset{y \\in Q_m}{\\mathrm{median}}(y)\\\\H(Q_m) = \\frac{1}{n_m} \\sum_{y \\in Q_m} |y - median(y)_m|\\end{aligned}\\end{align} \\]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 22,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Note that it fits much slower than the MSE criterion.\n1.10.8.\nMissing Values Support\nDecisionTreeClassifier\n,\nDecisionTreeRegressor\nhave built-in support for missing values using\nsplitter='best'\n, where\nthe splits are determined in a greedy fashion.\nExtraTreeClassifier\n, and\nExtraTreeRegressor\nhave built-in\nsupport for missing values for\nsplitter='random'\n, where the splits\nare determined randomly. For more details on how the splitter differs on\nnon-missing values, see the\nForest section\n.\nThe criterion supported when there are missing values are\n'gini'\n,\n'entropy'\n, or\n'log_loss'\n, for classification or\n'squared_error'\n,\n'friedman_mse'\n, or\n'poisson'\nfor regression.\nFirst we will describe how\nDecisionTreeClassifier\n,\nDecisionTreeRegressor\nhandle missing-values in the data.\nFor each potential threshold on the non-missing data, the splitter will evaluate\nthe split with all the missing values going to the left node or the right node.\nDecisions are made as follows:",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 23,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "the split with all the missing values going to the left node or the right node.\nDecisions are made as follows:\nBy default when predicting, the samples with missing values are classified\nwith the class used in the split found during training:\n>>>\nfrom\nsklearn.tree\nimport\nDecisionTreeClassifier\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n6\n,\nnp\n.\nnan\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ny\n=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\ntree\n=\nDecisionTreeClassifier\n(\nrandom_state\n=\n0\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\ntree\n.\npredict\n(\nX\n)\narray([0, 0, 1, 1])\nIf the criterion evaluation is the same for both nodes,\nthen the tie for missing value at predict time is broken by going to the\nright node. The splitter also checks the split where all the missing\nvalues go to one child and non-missing values go to the other:\n>>>\nfrom\nsklearn.tree\nimport\nDecisionTreeClassifier\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([\nnp\n.\nnan\n,\n-\n1\n,\nnp\n.\nnan\n,\n1\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ny\n=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\ntree\n=\nDecisionTreeClassifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 24,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "numpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([\nnp\n.\nnan\n,\n-\n1\n,\nnp\n.\nnan\n,\n1\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ny\n=\n[\n0\n,\n0\n,\n1\n,\n1\n]\n>>>\ntree\n=\nDecisionTreeClassifier\n(\nrandom_state\n=\n0\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nX_test\n=\nnp\n.\narray\n([\nnp\n.\nnan\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ntree\n.\npredict\n(\nX_test\n)\narray([1])\nIf no missing values are seen during training for a given feature, then during\nprediction missing values are mapped to the child with the most samples:\n>>>\nfrom\nsklearn.tree\nimport\nDecisionTreeClassifier\n>>>\nimport\nnumpy\nas\nnp\n>>>\nX\n=\nnp\n.\narray\n([\n0\n,\n1\n,\n2\n,\n3\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ny\n=\n[\n0\n,\n1\n,\n1\n,\n1\n]\n>>>\ntree\n=\nDecisionTreeClassifier\n(\nrandom_state\n=\n0\n)\n.\nfit\n(\nX\n,\ny\n)\n>>>\nX_test\n=\nnp\n.\narray\n([\nnp\n.\nnan\n])\n.\nreshape\n(\n-\n1\n,\n1\n)\n>>>\ntree\n.\npredict\n(\nX_test\n)\narray([1])\nExtraTreeClassifier\n, and\nExtraTreeRegressor\nhandle missing values\nin a slightly different way. When splitting a node, a random threshold will be chosen",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 25,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": ", and\nExtraTreeRegressor\nhandle missing values\nin a slightly different way. When splitting a node, a random threshold will be chosen\nto split the non-missing values on. Then the non-missing values will be sent to the\nleft and right child based on the randomly selected threshold, while the missing\nvalues will also be randomly sent to the left or right child. This is repeated for\nevery feature considered at each split. The best split among these is chosen.\nDuring prediction, the treatment of missing-values is the same as that of the\ndecision tree:\nBy default when predicting, the samples with missing values are classified\nwith the class used in the split found during training.\nIf no missing values are seen during training for a given feature, then during\nprediction missing values are mapped to the child with the most samples.\n1.10.9.\nMinimal Cost-Complexity Pruning\nMinimal cost-complexity pruning is an algorithm used to prune a tree to avoid\nover-fitting, described in Chapter 3 of\n[BRE]",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 26,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "Minimal cost-complexity pruning is an algorithm used to prune a tree to avoid\nover-fitting, described in Chapter 3 of\n[BRE]\n. This algorithm is parameterized\nby\n\\(\\alpha\\ge0\\)\nknown as the complexity parameter. The complexity\nparameter is used to define the cost-complexity measure,\n\\(R_\\alpha(T)\\)\nof\na given tree\n\\(T\\)\n:\n\\[R_\\alpha(T) = R(T) + \\alpha|\\widetilde{T}|\\]\nwhere\n\\(|\\widetilde{T}|\\)\nis the number of terminal nodes in\n\\(T\\)\nand\n\\(R(T)\\)\nis traditionally defined as the total misclassification rate of the terminal\nnodes. Alternatively, scikit-learn uses the total sample weighted impurity of\nthe terminal nodes for\n\\(R(T)\\)\n. As shown above, the impurity of a node\ndepends on the criterion. Minimal cost-complexity pruning finds the subtree of\n\\(T\\)\nthat minimizes\n\\(R_\\alpha(T)\\)\n.\nThe cost complexity measure of a single node is\n\\(R_\\alpha(t)=R(t)+\\alpha\\)\n. The branch,\n\\(T_t\\)\n, is defined to be a\ntree where node\n\\(t\\)\nis its root. In general, the impurity of a node",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 27,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "\\(R_\\alpha(t)=R(t)+\\alpha\\)\n. The branch,\n\\(T_t\\)\n, is defined to be a\ntree where node\n\\(t\\)\nis its root. In general, the impurity of a node\nis greater than the sum of impurities of its terminal nodes,\n\\(R(T_t)<R(t)\\)\n. However, the cost complexity measure of a node,\n\\(t\\)\n, and its branch,\n\\(T_t\\)\n, can be equal depending on\n\\(\\alpha\\)\n. We define the effective\n\\(\\alpha\\)\nof a node to be the\nvalue where they are equal,\n\\(R_\\alpha(T_t)=R_\\alpha(t)\\)\nor\n\\(\\alpha_{eff}(t)=\\frac{R(t)-R(T_t)}{|T|-1}\\)\n. A non-terminal node\nwith the smallest value of\n\\(\\alpha_{eff}\\)\nis the weakest link and will\nbe pruned. This process stops when the pruned tree’s minimal\n\\(\\alpha_{eff}\\)\nis greater than the\nccp_alpha\nparameter.\nExamples\nPost pruning decision trees with cost complexity pruning\nReferences\n[\nBRE\n]\nL. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification\nand Regression Trees. Wadsworth, Belmont, CA, 1984.\nhttps://en.wikipedia.org/wiki/Decision_tree_learning",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 28,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "and Regression Trees. Wadsworth, Belmont, CA, 1984.\nhttps://en.wikipedia.org/wiki/Decision_tree_learning\nhttps://en.wikipedia.org/wiki/Predictive_analytics\nJ.R. Quinlan. C4. 5: programs for machine learning. Morgan\nKaufmann, 1993.\nT. Hastie, R. Tibshirani and J. Friedman. Elements of Statistical\nLearning, Springer, 2009.\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/tree.html",
      "chunk_index": 29,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.5.\nUnsupervised dimensionality reduction\nIf your number of features is high, it may be useful to reduce it with an\nunsupervised step prior to supervised steps. Many of the\nUnsupervised learning\nmethods implement a\ntransform\nmethod that\ncan be used to reduce the dimensionality. Below we discuss two specific\nexamples of this pattern that are heavily used.\n7.5.1.\nPCA: principal component analysis\ndecomposition.PCA\nlooks for a combination of features that\ncapture well the variance of the original features. See\nDecomposing signals in components (matrix factorization problems)\n.\nExamples\nFaces recognition example using eigenfaces and SVMs\n7.5.2.\nRandom projections\nThe module:\nrandom_projection\nprovides several tools for data\nreduction by random projections. See the relevant section of the\ndocumentation:\nRandom Projection\n.\nExamples\nThe Johnson-Lindenstrauss bound for embedding with random projections\n7.5.3.\nFeature agglomeration\ncluster.FeatureAgglomeration\napplies\nHierarchical clustering",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/unsupervised_reduction.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "7.5.3.\nFeature agglomeration\ncluster.FeatureAgglomeration\napplies\nHierarchical clustering\nto group together features that behave\nsimilarly.\nExamples\nFeature agglomeration vs. univariate selection\nFeature agglomeration\nOn this page\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/modules/unsupervised_reduction.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.\nSupervised learning\n1.1. Linear Models\n1.1.1. Ordinary Least Squares\n1.1.2. Ridge regression and classification\n1.1.3. Lasso\n1.1.4. Multi-task Lasso\n1.1.5. Elastic-Net\n1.1.6. Multi-task Elastic-Net\n1.1.7. Least Angle Regression\n1.1.8. LARS Lasso\n1.1.9. Orthogonal Matching Pursuit (OMP)\n1.1.10. Bayesian Regression\n1.1.11. Logistic regression\n1.1.12. Generalized Linear Models\n1.1.13. Stochastic Gradient Descent - SGD\n1.1.14. Perceptron\n1.1.15. Passive Aggressive Algorithms\n1.1.16. Robustness regression: outliers and modeling errors\n1.1.17. Quantile Regression\n1.1.18. Polynomial regression: extending linear models with basis functions\n1.2. Linear and Quadratic Discriminant Analysis\n1.2.1. Dimensionality reduction using Linear Discriminant Analysis\n1.2.2. Mathematical formulation of the LDA and QDA classifiers\n1.2.3. Mathematical formulation of LDA dimensionality reduction\n1.2.4. Shrinkage and Covariance Estimator\n1.2.5. Estimation algorithms\n1.3. Kernel ridge regression",
    "metadata": {
      "url": "https://scikit-learn.org/stable/supervised_learning.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.2.4. Shrinkage and Covariance Estimator\n1.2.5. Estimation algorithms\n1.3. Kernel ridge regression\n1.4. Support Vector Machines\n1.4.1. Classification\n1.4.2. Regression\n1.4.3. Density estimation, novelty detection\n1.4.4. Complexity\n1.4.5. Tips on Practical Use\n1.4.6. Kernel functions\n1.4.7. Mathematical formulation\n1.4.8. Implementation details\n1.5. Stochastic Gradient Descent\n1.5.1. Classification\n1.5.2. Regression\n1.5.3. Online One-Class SVM\n1.5.4. Stochastic Gradient Descent for sparse data\n1.5.5. Complexity\n1.5.6. Stopping criterion\n1.5.7. Tips on Practical Use\n1.5.8. Mathematical formulation\n1.5.9. Implementation details\n1.6. Nearest Neighbors\n1.6.1. Unsupervised Nearest Neighbors\n1.6.2. Nearest Neighbors Classification\n1.6.3. Nearest Neighbors Regression\n1.6.4. Nearest Neighbor Algorithms\n1.6.5. Nearest Centroid Classifier\n1.6.6. Nearest Neighbors Transformer\n1.6.7. Neighborhood Components Analysis\n1.7. Gaussian Processes\n1.7.1. Gaussian Process Regression (GPR)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/supervised_learning.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.6.6. Nearest Neighbors Transformer\n1.6.7. Neighborhood Components Analysis\n1.7. Gaussian Processes\n1.7.1. Gaussian Process Regression (GPR)\n1.7.2. Gaussian Process Classification (GPC)\n1.7.3. GPC examples\n1.7.4. Kernels for Gaussian Processes\n1.8. Cross decomposition\n1.8.1. PLSCanonical\n1.8.2. PLSSVD\n1.8.3. PLSRegression\n1.8.4. Canonical Correlation Analysis\n1.9. Naive Bayes\n1.9.1. Gaussian Naive Bayes\n1.9.2. Multinomial Naive Bayes\n1.9.3. Complement Naive Bayes\n1.9.4. Bernoulli Naive Bayes\n1.9.5. Categorical Naive Bayes\n1.9.6. Out-of-core naive Bayes model fitting\n1.10. Decision Trees\n1.10.1. Classification\n1.10.2. Regression\n1.10.3. Multi-output problems\n1.10.4. Complexity\n1.10.5. Tips on practical use\n1.10.6. Tree algorithms: ID3, C4.5, C5.0 and CART\n1.10.7. Mathematical formulation\n1.10.8. Missing Values Support\n1.10.9. Minimal Cost-Complexity Pruning\n1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking\n1.11.1. Gradient-boosted trees",
    "metadata": {
      "url": "https://scikit-learn.org/stable/supervised_learning.html",
      "chunk_index": 2,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.10.9. Minimal Cost-Complexity Pruning\n1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking\n1.11.1. Gradient-boosted trees\n1.11.2. Random forests and other randomized tree ensembles\n1.11.3. Bagging meta-estimator\n1.11.4. Voting Classifier\n1.11.5. Voting Regressor\n1.11.6. Stacked generalization\n1.11.7. AdaBoost\n1.12. Multiclass and multioutput algorithms\n1.12.1. Multiclass classification\n1.12.2. Multilabel classification\n1.12.3. Multiclass-multioutput classification\n1.12.4. Multioutput regression\n1.13. Feature selection\n1.13.1. Removing features with low variance\n1.13.2. Univariate feature selection\n1.13.3. Recursive feature elimination\n1.13.4. Feature selection using SelectFromModel\n1.13.5. Sequential Feature Selection\n1.13.6. Feature selection as part of a pipeline\n1.14. Semi-supervised learning\n1.14.1. Self Training\n1.14.2. Label Propagation\n1.15. Isotonic regression\n1.16. Probability calibration\n1.16.1. Calibration curves\n1.16.2. Calibrating a classifier",
    "metadata": {
      "url": "https://scikit-learn.org/stable/supervised_learning.html",
      "chunk_index": 3,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "1.14.2. Label Propagation\n1.15. Isotonic regression\n1.16. Probability calibration\n1.16.1. Calibration curves\n1.16.2. Calibrating a classifier\n1.16.3. Usage\n1.17. Neural network models (supervised)\n1.17.1. Multi-layer Perceptron\n1.17.2. Classification\n1.17.3. Regression\n1.17.4. Regularization\n1.17.5. Algorithms\n1.17.6. Complexity\n1.17.7. Tips on Practical Use\n1.17.8. More control with warm_start\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/supervised_learning.html",
      "chunk_index": 4,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.\nUnsupervised learning\n2.1. Gaussian mixture models\n2.1.1. Gaussian Mixture\n2.1.2. Variational Bayesian Gaussian Mixture\n2.2. Manifold learning\n2.2.1. Introduction\n2.2.2. Isomap\n2.2.3. Locally Linear Embedding\n2.2.4. Modified Locally Linear Embedding\n2.2.5. Hessian Eigenmapping\n2.2.6. Spectral Embedding\n2.2.7. Local Tangent Space Alignment\n2.2.8. Multi-dimensional Scaling (MDS)\n2.2.9. t-distributed Stochastic Neighbor Embedding (t-SNE)\n2.2.10. Tips on practical use\n2.3. Clustering\n2.3.1. Overview of clustering methods\n2.3.2. K-means\n2.3.3. Affinity Propagation\n2.3.4. Mean Shift\n2.3.5. Spectral clustering\n2.3.6. Hierarchical clustering\n2.3.7. DBSCAN\n2.3.8. HDBSCAN\n2.3.9. OPTICS\n2.3.10. BIRCH\n2.3.11. Clustering performance evaluation\n2.4. Biclustering\n2.4.1. Spectral Co-Clustering\n2.4.2. Spectral Biclustering\n2.4.3. Biclustering evaluation\n2.5. Decomposing signals in components (matrix factorization problems)\n2.5.1. Principal component analysis (PCA)",
    "metadata": {
      "url": "https://scikit-learn.org/stable/unsupervised_learning.html",
      "chunk_index": 0,
      "source": "scikit-learn-docs"
    }
  },
  {
    "page_content": "2.4.3. Biclustering evaluation\n2.5. Decomposing signals in components (matrix factorization problems)\n2.5.1. Principal component analysis (PCA)\n2.5.2. Kernel Principal Component Analysis (kPCA)\n2.5.3. Truncated singular value decomposition and latent semantic analysis\n2.5.4. Dictionary Learning\n2.5.5. Factor Analysis\n2.5.6. Independent component analysis (ICA)\n2.5.7. Non-negative matrix factorization (NMF or NNMF)\n2.5.8. Latent Dirichlet Allocation (LDA)\n2.6. Covariance estimation\n2.6.1. Empirical covariance\n2.6.2. Shrunk Covariance\n2.6.3. Sparse inverse covariance\n2.6.4. Robust Covariance Estimation\n2.7. Novelty and Outlier Detection\n2.7.1. Overview of outlier detection methods\n2.7.2. Novelty Detection\n2.7.3. Outlier Detection\n2.7.4. Novelty detection with Local Outlier Factor\n2.8. Density Estimation\n2.8.1. Density Estimation: Histograms\n2.8.2. Kernel Density Estimation\n2.9. Neural network models (unsupervised)\n2.9.1. Restricted Boltzmann machines\nThis Page\nShow Source",
    "metadata": {
      "url": "https://scikit-learn.org/stable/unsupervised_learning.html",
      "chunk_index": 1,
      "source": "scikit-learn-docs"
    }
  }
]