[ { "page_content": "User Guide\n1. Supervised learning\n1.1. Linear Models\n1.1.1. Ordinary Least Squares\n1.1.2. Ridge regression and classification\n1.1.3. Lasso\n1.1.4. Multi-task Lasso\n1.1.5. Elastic-Net\n1.1.6. Multi-task Elastic-Net\n1.1.7. Least Angle Regression\n1.1.8. LARS Lasso\n1.1.9. Orthogonal Matching Pursuit (OMP)\n1.1.10. Bayesian Regression\n1.1.11. Logistic regression\n1.1.12. Generalized Linear Models\n1.1.13. Stochastic Gradient Descent - SGD\n1.1.14. Perceptron\n1.1.15. Passive Aggressive Algorithms\n1.1.16. Robustness regression: outliers and modeling errors\n1.1.17. Quantile Regression\n1.1.18. Polynomial regression: extending linear models with basis functions\n1.2. Linear and Quadratic Discriminant Analysis\n1.2.1. Dimensionality reduction using Linear Discriminant Analysis\n1.2.2. Mathematical formulation of the LDA and QDA classifiers\n1.2.3. Mathematical formulation of LDA dimensionality reduction\n1.2.4. Shrinkage and Covariance Estimator\n1.2.5. Estimation algorithms\n1.3. Kernel ridge regression", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 0, "source": "scikit-learn-docs" } }, { "page_content": "1.2.4. Shrinkage and Covariance Estimator\n1.2.5. Estimation algorithms\n1.3. Kernel ridge regression\n1.4. Support Vector Machines\n1.4.1. Classification\n1.4.2. Regression\n1.4.3. Density estimation, novelty detection\n1.4.4. Complexity\n1.4.5. Tips on Practical Use\n1.4.6. Kernel functions\n1.4.7. Mathematical formulation\n1.4.8. Implementation details\n1.5. Stochastic Gradient Descent\n1.5.1. Classification\n1.5.2. Regression\n1.5.3. Online One-Class SVM\n1.5.4. Stochastic Gradient Descent for sparse data\n1.5.5. Complexity\n1.5.6. Stopping criterion\n1.5.7. Tips on Practical Use\n1.5.8. Mathematical formulation\n1.5.9. Implementation details\n1.6. Nearest Neighbors\n1.6.1. Unsupervised Nearest Neighbors\n1.6.2. Nearest Neighbors Classification\n1.6.3. Nearest Neighbors Regression\n1.6.4. Nearest Neighbor Algorithms\n1.6.5. Nearest Centroid Classifier\n1.6.6. Nearest Neighbors Transformer\n1.6.7. Neighborhood Components Analysis\n1.7. Gaussian Processes\n1.7.1. Gaussian Process Regression (GPR)", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 1, "source": "scikit-learn-docs" } }, { "page_content": "1.6.6. Nearest Neighbors Transformer\n1.6.7. Neighborhood Components Analysis\n1.7. Gaussian Processes\n1.7.1. Gaussian Process Regression (GPR)\n1.7.2. Gaussian Process Classification (GPC)\n1.7.3. GPC examples\n1.7.4. Kernels for Gaussian Processes\n1.8. Cross decomposition\n1.8.1. PLSCanonical\n1.8.2. PLSSVD\n1.8.3. PLSRegression\n1.8.4. Canonical Correlation Analysis\n1.9. Naive Bayes\n1.9.1. Gaussian Naive Bayes\n1.9.2. Multinomial Naive Bayes\n1.9.3. Complement Naive Bayes\n1.9.4. Bernoulli Naive Bayes\n1.9.5. Categorical Naive Bayes\n1.9.6. Out-of-core naive Bayes model fitting\n1.10. Decision Trees\n1.10.1. Classification\n1.10.2. Regression\n1.10.3. Multi-output problems\n1.10.4. Complexity\n1.10.5. Tips on practical use\n1.10.6. Tree algorithms: ID3, C4.5, C5.0 and CART\n1.10.7. Mathematical formulation\n1.10.8. Missing Values Support\n1.10.9. Minimal Cost-Complexity Pruning\n1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking\n1.11.1. Gradient-boosted trees", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 2, "source": "scikit-learn-docs" } }, { "page_content": "1.10.9. Minimal Cost-Complexity Pruning\n1.11. Ensembles: Gradient boosting, random forests, bagging, voting, stacking\n1.11.1. Gradient-boosted trees\n1.11.2. Random forests and other randomized tree ensembles\n1.11.3. Bagging meta-estimator\n1.11.4. Voting Classifier\n1.11.5. Voting Regressor\n1.11.6. Stacked generalization\n1.11.7. AdaBoost\n1.12. Multiclass and multioutput algorithms\n1.12.1. Multiclass classification\n1.12.2. Multilabel classification\n1.12.3. Multiclass-multioutput classification\n1.12.4. Multioutput regression\n1.13. Feature selection\n1.13.1. Removing features with low variance\n1.13.2. Univariate feature selection\n1.13.3. Recursive feature elimination\n1.13.4. Feature selection using SelectFromModel\n1.13.5. Sequential Feature Selection\n1.13.6. Feature selection as part of a pipeline\n1.14. Semi-supervised learning\n1.14.1. Self Training\n1.14.2. Label Propagation\n1.15. Isotonic regression\n1.16. Probability calibration\n1.16.1. Calibration curves\n1.16.2. Calibrating a classifier", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 3, "source": "scikit-learn-docs" } }, { "page_content": "1.14.2. Label Propagation\n1.15. Isotonic regression\n1.16. Probability calibration\n1.16.1. Calibration curves\n1.16.2. Calibrating a classifier\n1.16.3. Usage\n1.17. Neural network models (supervised)\n1.17.1. Multi-layer Perceptron\n1.17.2. Classification\n1.17.3. Regression\n1.17.4. Regularization\n1.17.5. Algorithms\n1.17.6. Complexity\n1.17.7. Tips on Practical Use\n1.17.8. More control with warm_start\n2. Unsupervised learning\n2.1. Gaussian mixture models\n2.1.1. Gaussian Mixture\n2.1.2. Variational Bayesian Gaussian Mixture\n2.2. Manifold learning\n2.2.1. Introduction\n2.2.2. Isomap\n2.2.3. Locally Linear Embedding\n2.2.4. Modified Locally Linear Embedding\n2.2.5. Hessian Eigenmapping\n2.2.6. Spectral Embedding\n2.2.7. Local Tangent Space Alignment\n2.2.8. Multi-dimensional Scaling (MDS)\n2.2.9. t-distributed Stochastic Neighbor Embedding (t-SNE)\n2.2.10. Tips on practical use\n2.3. Clustering\n2.3.1. Overview of clustering methods\n2.3.2. K-means\n2.3.3. Affinity Propagation\n2.3.4. Mean Shift", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 4, "source": "scikit-learn-docs" } }, { "page_content": "2.2.10. Tips on practical use\n2.3. Clustering\n2.3.1. Overview of clustering methods\n2.3.2. K-means\n2.3.3. Affinity Propagation\n2.3.4. Mean Shift\n2.3.5. Spectral clustering\n2.3.6. Hierarchical clustering\n2.3.7. DBSCAN\n2.3.8. HDBSCAN\n2.3.9. OPTICS\n2.3.10. BIRCH\n2.3.11. Clustering performance evaluation\n2.4. Biclustering\n2.4.1. Spectral Co-Clustering\n2.4.2. Spectral Biclustering\n2.4.3. Biclustering evaluation\n2.5. Decomposing signals in components (matrix factorization problems)\n2.5.1. Principal component analysis (PCA)\n2.5.2. Kernel Principal Component Analysis (kPCA)\n2.5.3. Truncated singular value decomposition and latent semantic analysis\n2.5.4. Dictionary Learning\n2.5.5. Factor Analysis\n2.5.6. Independent component analysis (ICA)\n2.5.7. Non-negative matrix factorization (NMF or NNMF)\n2.5.8. Latent Dirichlet Allocation (LDA)\n2.6. Covariance estimation\n2.6.1. Empirical covariance\n2.6.2. Shrunk Covariance\n2.6.3. Sparse inverse covariance\n2.6.4. Robust Covariance Estimation", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 5, "source": "scikit-learn-docs" } }, { "page_content": "2.6. Covariance estimation\n2.6.1. Empirical covariance\n2.6.2. Shrunk Covariance\n2.6.3. Sparse inverse covariance\n2.6.4. Robust Covariance Estimation\n2.7. Novelty and Outlier Detection\n2.7.1. Overview of outlier detection methods\n2.7.2. Novelty Detection\n2.7.3. Outlier Detection\n2.7.4. Novelty detection with Local Outlier Factor\n2.8. Density Estimation\n2.8.1. Density Estimation: Histograms\n2.8.2. Kernel Density Estimation\n2.9. Neural network models (unsupervised)\n2.9.1. Restricted Boltzmann machines\n3. Model selection and evaluation\n3.1. Cross-validation: evaluating estimator performance\n3.1.1. Computing cross-validated metrics\n3.1.2. Cross validation iterators\n3.1.3. A note on shuffling\n3.1.4. Cross validation and model selection\n3.1.5. Permutation test score\n3.2. Tuning the hyper-parameters of an estimator\n3.2.1. Exhaustive Grid Search\n3.2.2. Randomized Parameter Optimization\n3.2.3. Searching for optimal parameters with successive halving\n3.2.4. Tips for parameter search", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 6, "source": "scikit-learn-docs" } }, { "page_content": "3.2.2. Randomized Parameter Optimization\n3.2.3. Searching for optimal parameters with successive halving\n3.2.4. Tips for parameter search\n3.2.5. Alternatives to brute force parameter search\n3.3. Tuning the decision threshold for class prediction\n3.3.1. Post-tuning the decision threshold\n3.4. Metrics and scoring: quantifying the quality of predictions\n3.4.1. Which scoring function should I use?\n3.4.2. Scoring API overview\n3.4.3. The\nscoring\nparameter: defining model evaluation rules\n3.4.4. Classification metrics\n3.4.5. Multilabel ranking metrics\n3.4.6. Regression metrics\n3.4.7. Clustering metrics\n3.4.8. Dummy estimators\n3.5. Validation curves: plotting scores to evaluate models\n3.5.1. Validation curve\n3.5.2. Learning curve\n4. Metadata Routing\n4.1. Usage Examples\n4.1.1. Weighted scoring and fitting\n4.1.2. Weighted scoring and unweighted fitting\n4.1.3. Unweighted feature selection\n4.1.4. Different scoring and fitting weights\n4.2. API Interface\n4.3. Metadata Routing Support Status", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 7, "source": "scikit-learn-docs" } }, { "page_content": "4.1.3. Unweighted feature selection\n4.1.4. Different scoring and fitting weights\n4.2. API Interface\n4.3. Metadata Routing Support Status\n5. Inspection\n5.1. Partial Dependence and Individual Conditional Expectation plots\n5.1.1. Partial dependence plots\n5.1.2. Individual conditional expectation (ICE) plot\n5.1.3. Mathematical Definition\n5.1.4. Computation methods\n5.2. Permutation feature importance\n5.2.1. Outline of the permutation importance algorithm\n5.2.2. Relation to impurity-based importance in trees\n5.2.3. Misleading values on strongly correlated features\n6. Visualizations\n6.1. Available Plotting Utilities\n6.1.1. Display Objects\n7. Dataset transformations\n7.1. Pipelines and composite estimators\n7.1.1. Pipeline: chaining estimators\n7.1.2. Transforming target in regression\n7.1.3. FeatureUnion: composite feature spaces\n7.1.4. ColumnTransformer for heterogeneous data\n7.1.5. Visualizing Composite Estimators\n7.2. Feature extraction\n7.2.1. Loading features from dicts", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 8, "source": "scikit-learn-docs" } }, { "page_content": "7.1.4. ColumnTransformer for heterogeneous data\n7.1.5. Visualizing Composite Estimators\n7.2. Feature extraction\n7.2.1. Loading features from dicts\n7.2.2. Feature hashing\n7.2.3. Text feature extraction\n7.2.4. Image feature extraction\n7.3. Preprocessing data\n7.3.1. Standardization, or mean removal and variance scaling\n7.3.2. Non-linear transformation\n7.3.3. Normalization\n7.3.4. Encoding categorical features\n7.3.5. Discretization\n7.3.6. Imputation of missing values\n7.3.7. Generating polynomial features\n7.3.8. Custom transformers\n7.4. Imputation of missing values\n7.4.1. Univariate vs. Multivariate Imputation\n7.4.2. Univariate feature imputation\n7.4.3. Multivariate feature imputation\n7.4.4. Nearest neighbors imputation\n7.4.5. Keeping the number of features constant\n7.4.6. Marking imputed values\n7.4.7. Estimators that handle NaN values\n7.5. Unsupervised dimensionality reduction\n7.5.1. PCA: principal component analysis\n7.5.2. Random projections\n7.5.3. Feature agglomeration", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 9, "source": "scikit-learn-docs" } }, { "page_content": "7.5. Unsupervised dimensionality reduction\n7.5.1. PCA: principal component analysis\n7.5.2. Random projections\n7.5.3. Feature agglomeration\n7.6. Random Projection\n7.6.1. The Johnson-Lindenstrauss lemma\n7.6.2. Gaussian random projection\n7.6.3. Sparse random projection\n7.6.4. Inverse Transform\n7.7. Kernel Approximation\n7.7.1. Nystroem Method for Kernel Approximation\n7.7.2. Radial Basis Function Kernel\n7.7.3. Additive Chi Squared Kernel\n7.7.4. Skewed Chi Squared Kernel\n7.7.5. Polynomial Kernel Approximation via Tensor Sketch\n7.7.6. Mathematical Details\n7.8. Pairwise metrics, Affinities and Kernels\n7.8.1. Cosine similarity\n7.8.2. Linear kernel\n7.8.3. Polynomial kernel\n7.8.4. Sigmoid kernel\n7.8.5. RBF kernel\n7.8.6. Laplacian kernel\n7.8.7. Chi-squared kernel\n7.9. Transforming the prediction target (\ny\n)\n7.9.1. Label binarization\n7.9.2. Label encoding\n8. Dataset loading utilities\n8.1. Toy datasets\n8.1.1. Iris plants dataset\n8.1.2. Diabetes dataset", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 10, "source": "scikit-learn-docs" } }, { "page_content": "y\n)\n7.9.1. Label binarization\n7.9.2. Label encoding\n8. Dataset loading utilities\n8.1. Toy datasets\n8.1.1. Iris plants dataset\n8.1.2. Diabetes dataset\n8.1.3. Optical recognition of handwritten digits dataset\n8.1.4. Linnerrud dataset\n8.1.5. Wine recognition dataset\n8.1.6. Breast cancer Wisconsin (diagnostic) dataset\n8.2. Real world datasets\n8.2.1. The Olivetti faces dataset\n8.2.2. The 20 newsgroups text dataset\n8.2.3. The Labeled Faces in the Wild face recognition dataset\n8.2.4. Forest covertypes\n8.2.5. RCV1 dataset\n8.2.6. Kddcup 99 dataset\n8.2.7. California Housing dataset\n8.2.8. Species distribution dataset\n8.3. Generated datasets\n8.3.1. Generators for classification and clustering\n8.3.2. Generators for regression\n8.3.3. Generators for manifold learning\n8.3.4. Generators for decomposition\n8.4. Loading other datasets\n8.4.1. Sample images\n8.4.2. Datasets in svmlight / libsvm format\n8.4.3. Downloading datasets from the openml.org repository\n8.4.4. Loading from external datasets", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 11, "source": "scikit-learn-docs" } }, { "page_content": "8.4.2. Datasets in svmlight / libsvm format\n8.4.3. Downloading datasets from the openml.org repository\n8.4.4. Loading from external datasets\n9. Computing with scikit-learn\n9.1. Strategies to scale computationally: bigger data\n9.1.1. Scaling with instances using out-of-core learning\n9.2. Computational Performance\n9.2.1. Prediction Latency\n9.2.2. Prediction Throughput\n9.2.3. Tips and Tricks\n9.3. Parallelism, resource management, and configuration\n9.3.1. Parallelism\n9.3.2. Configuration switches\n10. Model persistence\n10.1. Workflow Overview\n10.1.1. Train and Persist the Model\n10.2. ONNX\n10.3.\nskops.io\n10.4.\npickle\n,\njoblib\n, and\ncloudpickle\n10.5. Security & Maintainability Limitations\n10.5.1. Replicating the training environment in production\n10.5.2. Serving the model artifact\n10.6. Summarizing the key points\n11. Common pitfalls and recommended practices\n11.1. Inconsistent preprocessing\n11.2. Data leakage\n11.2.1. How to avoid data leakage\n11.2.2. Data leakage during pre-processing", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 12, "source": "scikit-learn-docs" } }, { "page_content": "11.1. Inconsistent preprocessing\n11.2. Data leakage\n11.2.1. How to avoid data leakage\n11.2.2. Data leakage during pre-processing\n11.3. Controlling randomness\n11.3.1. Using\nNone\nor\nRandomState\ninstances, and repeated calls to\nfit\nand\nsplit\n11.3.2. Common pitfalls and subtleties\n11.3.3. General recommendations\n12. Dispatching\n12.1. Array API support (experimental)\n12.1.1. Example usage\n12.1.2. Support for\nArray\nAPI\n-compatible inputs\n12.1.3. Input and output array type handling\n12.1.4. Common estimator checks\n13. Choosing the right estimator\n14. External Resources, Videos and Talks\n14.1. The scikit-learn MOOC\n14.2. Videos\n14.3. New to Scientific Python?\n14.4. External Tutorials\nThis Page\nShow Source", "metadata": { "url": "https://scikit-learn.org/stable/user_guide.html", "chunk_index": 13, "source": "scikit-learn-docs" } }, { "page_content": "9.2.\nComputational Performance\nFor some applications the performance (mainly latency and throughput at\nprediction time) of estimators is crucial. It may also be of interest to\nconsider the training throughput but this is often less important in a\nproduction setup (where it often takes place offline).\nWe will review here the orders of magnitude you can expect from a number of\nscikit-learn estimators in different contexts and provide some tips and\ntricks for overcoming performance bottlenecks.\nPrediction latency is measured as the elapsed time necessary to make a\nprediction (e.g. in microseconds). Latency is often viewed as a distribution\nand operations engineers often focus on the latency at a given percentile of\nthis distribution (e.g. the 90th percentile).\nPrediction throughput is defined as the number of predictions the software can\ndeliver in a given amount of time (e.g. in predictions per second).\nAn important aspect of performance optimization is also that it can hurt", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 0, "source": "scikit-learn-docs" } }, { "page_content": "deliver in a given amount of time (e.g. in predictions per second).\nAn important aspect of performance optimization is also that it can hurt\nprediction accuracy. Indeed, simpler models (e.g. linear instead of\nnon-linear, or with fewer parameters) often run faster but are not always able\nto take into account the same exact properties of the data as more complex ones.\n9.2.1.\nPrediction Latency\nOne of the most straightforward concerns one may have when using/choosing a\nmachine learning toolkit is the latency at which predictions can be made in a\nproduction environment.\nThe main factors that influence the prediction latency are\nNumber of features\nInput data representation and sparsity\nModel complexity\nFeature extraction\nA last major parameter is also the possibility to do predictions in bulk or\none-at-a-time mode.\n9.2.1.1.\nBulk versus Atomic mode\nIn general doing predictions in bulk (many instances at the same time) is", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 1, "source": "scikit-learn-docs" } }, { "page_content": "one-at-a-time mode.\n9.2.1.1.\nBulk versus Atomic mode\nIn general doing predictions in bulk (many instances at the same time) is\nmore efficient for a number of reasons (branching predictability, CPU cache,\nlinear algebra libraries optimizations etc.). Here we see on a setting\nwith few features that independently of estimator choice the bulk mode is\nalways faster, and for some of them by 1 to 2 orders of magnitude:\nTo benchmark different estimators for your case you can simply change the\nn_features\nparameter in this example:\nPrediction Latency\n. This should give\nyou an estimate of the order of magnitude of the prediction latency.\n9.2.1.2.\nConfiguring Scikit-learn for reduced validation overhead\nScikit-learn does some validation on data that increases the overhead per\ncall to\npredict\nand similar functions. In particular, checking that\nfeatures are finite (not NaN or infinite) involves a full pass over the\ndata. If you ensure that your data is acceptable, you may suppress", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 2, "source": "scikit-learn-docs" } }, { "page_content": "features are finite (not NaN or infinite) involves a full pass over the\ndata. If you ensure that your data is acceptable, you may suppress\nchecking for finiteness by setting the environment variable\nSKLEARN_ASSUME_FINITE\nto a non-empty string before importing\nscikit-learn, or configure it in Python with\nset_config\n.\nFor more control than these global settings, a\nconfig_context\nallows you to set this configuration within a specified context:\n>>>\nimport\nsklearn\n>>>\nwith\nsklearn\n.\nconfig_context\n(\nassume_finite\n=\nTrue\n):\n...\npass\n# do learning/prediction here with reduced validation\nNote that this will affect all uses of\nassert_all_finite\nwithin the context.\n9.2.1.3.\nInfluence of the Number of Features\nObviously when the number of features increases so does the memory\nconsumption of each example. Indeed, for a matrix of\n\\(M\\)\ninstances\nwith\n\\(N\\)\nfeatures, the space complexity is in\n\\(O(NM)\\)\n.\nFrom a computing perspective it also means that the number of basic operations", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 3, "source": "scikit-learn-docs" } }, { "page_content": "instances\nwith\n\\(N\\)\nfeatures, the space complexity is in\n\\(O(NM)\\)\n.\nFrom a computing perspective it also means that the number of basic operations\n(e.g., multiplications for vector-matrix products in linear models) increases\ntoo. Here is a graph of the evolution of the prediction latency with the\nnumber of features:\nOverall you can expect the prediction time to increase at least linearly with\nthe number of features (non-linear cases can happen depending on the global\nmemory footprint and estimator).\n9.2.1.4.\nInfluence of the Input Data Representation\nScipy provides sparse matrix data structures which are optimized for storing\nsparse data. The main feature of sparse formats is that you don’t store zeros\nso if your data is sparse then you use much less memory. A non-zero value in\na sparse (\nCSR or CSC\n)\nrepresentation will only take on average one 32bit integer position + the 64\nbit floating point value + an additional 32bit per row or column in the matrix.", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 4, "source": "scikit-learn-docs" } }, { "page_content": "bit floating point value + an additional 32bit per row or column in the matrix.\nUsing sparse input on a dense (or sparse) linear model can speedup prediction\nby quite a bit as only the non zero valued features impact the dot product\nand thus the model predictions. Hence if you have 100 non zeros in 1e6\ndimensional space, you only need 100 multiply and add operation instead of 1e6.\nCalculation over a dense representation, however, may leverage highly optimized\nvector operations and multithreading in BLAS, and tends to result in fewer CPU\ncache misses. So the sparsity should typically be quite high (10% non-zeros\nmax, to be checked depending on the hardware) for the sparse input\nrepresentation to be faster than the dense input representation on a machine\nwith many CPUs and an optimized BLAS implementation.\nHere is sample code to test the sparsity of your input:\ndef\nsparsity_ratio\n(\nX\n):\nreturn\n1.0\n-\nnp\n.\ncount_nonzero\n(\nX\n)\n/\nfloat\n(\nX\n.\nshape\n[\n0\n]\n*\nX\n.\nshape\n[\n1\n])\nprint\n(", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 5, "source": "scikit-learn-docs" } }, { "page_content": "def\nsparsity_ratio\n(\nX\n):\nreturn\n1.0\n-\nnp\n.\ncount_nonzero\n(\nX\n)\n/\nfloat\n(\nX\n.\nshape\n[\n0\n]\n*\nX\n.\nshape\n[\n1\n])\nprint\n(\n\"input sparsity ratio:\"\n,\nsparsity_ratio\n(\nX\n))\nAs a rule of thumb you can consider that if the sparsity ratio is greater\nthan 90% you can probably benefit from sparse formats. Check Scipy’s sparse\nmatrix formats\ndocumentation\nfor more information on how to build (or convert your data to) sparse matrix\nformats. Most of the time the\nCSR\nand\nCSC\nformats work best.\n9.2.1.5.\nInfluence of the Model Complexity\nGenerally speaking, when model complexity increases, predictive power and\nlatency are supposed to increase. Increasing predictive power is usually\ninteresting, but for many applications we would better not increase\nprediction latency too much. We will now review this idea for different\nfamilies of supervised models.\nFor\nsklearn.linear_model\n(e.g. Lasso, ElasticNet,\nSGDClassifier/Regressor, Ridge & RidgeClassifier,", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 6, "source": "scikit-learn-docs" } }, { "page_content": "families of supervised models.\nFor\nsklearn.linear_model\n(e.g. Lasso, ElasticNet,\nSGDClassifier/Regressor, Ridge & RidgeClassifier,\nPassiveAggressiveClassifier/Regressor, LinearSVC, LogisticRegression…) the\ndecision function that is applied at prediction time is the same (a dot product)\n, so latency should be equivalent.\nHere is an example using\nSGDClassifier\nwith the\nelasticnet\npenalty. The regularization strength is globally controlled by\nthe\nalpha\nparameter. With a sufficiently high\nalpha\n,\none can then increase the\nl1_ratio\nparameter of\nelasticnet\nto\nenforce various levels of sparsity in the model coefficients. Higher sparsity\nhere is interpreted as less model complexity as we need fewer coefficients to\ndescribe it fully. Of course sparsity influences in turn the prediction time\nas the sparse dot-product takes time roughly proportional to the number of\nnon-zero coefficients.\nFor the\nsklearn.svm\nfamily of algorithms with a non-linear kernel,", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 7, "source": "scikit-learn-docs" } }, { "page_content": "non-zero coefficients.\nFor the\nsklearn.svm\nfamily of algorithms with a non-linear kernel,\nthe latency is tied to the number of support vectors (the fewer the faster).\nLatency and throughput should (asymptotically) grow linearly with the number\nof support vectors in a SVC or SVR model. The kernel will also influence the\nlatency as it is used to compute the projection of the input vector once per\nsupport vector. In the following graph the\nnu\nparameter of\nNuSVR\nwas used to influence the number of\nsupport vectors.\nFor\nsklearn.ensemble\nof trees (e.g. RandomForest, GBT,\nExtraTrees, etc.) the number of trees and their depth play the most\nimportant role. Latency and throughput should scale linearly with the number\nof trees. In this case we used directly the\nn_estimators\nparameter of\nGradientBoostingRegressor\n.\nIn any case be warned that decreasing model complexity can hurt accuracy as\nmentioned above. For instance a non-linearly separable problem can be handled", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 8, "source": "scikit-learn-docs" } }, { "page_content": "mentioned above. For instance a non-linearly separable problem can be handled\nwith a speedy linear model but prediction power will very likely suffer in\nthe process.\n9.2.1.6.\nFeature Extraction Latency\nMost scikit-learn models are usually pretty fast as they are implemented\neither with compiled Cython extensions or optimized computing libraries.\nOn the other hand, in many real world applications the feature extraction\nprocess (i.e. turning raw data like database rows or network packets into\nnumpy arrays) governs the overall prediction time. For example on the Reuters\ntext classification task the whole preparation (reading and parsing SGML\nfiles, tokenizing the text and hashing it into a common vector space) is\ntaking 100 to 500 times more time than the actual prediction code, depending on\nthe chosen model.\nIn many cases it is thus recommended to carefully time and profile your\nfeature extraction code as it may be a good place to start optimizing when", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 9, "source": "scikit-learn-docs" } }, { "page_content": "In many cases it is thus recommended to carefully time and profile your\nfeature extraction code as it may be a good place to start optimizing when\nyour overall latency is too slow for your application.\n9.2.2.\nPrediction Throughput\nAnother important metric to care about when sizing production systems is the\nthroughput i.e. the number of predictions you can make in a given amount of\ntime. Here is a benchmark from the\nPrediction Latency\nexample that measures\nthis quantity for a number of estimators on synthetic data:\nThese throughputs are achieved on a single process. An obvious way to\nincrease the throughput of your application is to spawn additional instances\n(usually processes in Python because of the\nGIL\n) that share the\nsame model. One might also add machines to spread the load. A detailed\nexplanation on how to achieve this is beyond the scope of this documentation\nthough.\n9.2.3.\nTips and Tricks\n9.2.3.1.\nLinear algebra libraries", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 10, "source": "scikit-learn-docs" } }, { "page_content": "explanation on how to achieve this is beyond the scope of this documentation\nthough.\n9.2.3.\nTips and Tricks\n9.2.3.1.\nLinear algebra libraries\nAs scikit-learn relies heavily on Numpy/Scipy and linear algebra in general it\nmakes sense to take explicit care of the versions of these libraries.\nBasically, you ought to make sure that Numpy is built using an optimized\nBLAS\n/\nLAPACK\nlibrary.\nNot all models benefit from optimized BLAS and Lapack implementations. For\ninstance models based on (randomized) decision trees typically do not rely on\nBLAS calls in their inner loops, nor do kernel SVMs (\nSVC\n,\nSVR\n,\nNuSVC\n,\nNuSVR\n). On the other hand a linear model implemented with a\nBLAS DGEMM call (via\nnumpy.dot\n) will typically benefit hugely from a tuned\nBLAS implementation and lead to orders of magnitude speedup over a\nnon-optimized BLAS.\nYou can display the BLAS / LAPACK implementation used by your NumPy / SciPy /\nscikit-learn install with the following command:\npython\n-\nc", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 11, "source": "scikit-learn-docs" } }, { "page_content": "You can display the BLAS / LAPACK implementation used by your NumPy / SciPy /\nscikit-learn install with the following command:\npython\n-\nc\n\"import sklearn; sklearn.show_versions()\"\nOptimized BLAS / LAPACK implementations include:\nAtlas (need hardware specific tuning by rebuilding on the target machine)\nOpenBLAS\nMKL\nApple Accelerate and vecLib frameworks (OSX only)\nMore information can be found on the\nNumPy install page\nand in this\nblog post\nfrom Daniel Nouri which has some nice step by step install instructions for\nDebian / Ubuntu.\n9.2.3.2.\nLimiting Working Memory\nSome calculations when implemented using standard numpy vectorized operations\ninvolve using a large amount of temporary memory. This may potentially exhaust\nsystem memory. Where computations can be performed in fixed-memory chunks, we\nattempt to do so, and allow the user to hint at the maximum size of this\nworking memory (defaulting to 1GB) using\nset_config\nor\nconfig_context\n. The following suggests to limit temporary working", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 12, "source": "scikit-learn-docs" } }, { "page_content": "working memory (defaulting to 1GB) using\nset_config\nor\nconfig_context\n. The following suggests to limit temporary working\nmemory to 128 MiB:\n>>>\nimport\nsklearn\n>>>\nwith\nsklearn\n.\nconfig_context\n(\nworking_memory\n=\n128\n):\n...\npass\n# do chunked work here\nAn example of a chunked operation adhering to this setting is\npairwise_distances_chunked\n, which facilitates computing\nrow-wise reductions of a pairwise distance matrix.\n9.2.3.3.\nModel Compression\nModel compression in scikit-learn only concerns linear models for the moment.\nIn this context it means that we want to control the model sparsity (i.e. the\nnumber of non-zero coordinates in the model vectors). It is generally a good\nidea to combine model sparsity with sparse input data representation.\nHere is sample code that illustrates the use of the\nsparsify()\nmethod:\nclf\n=\nSGDRegressor\n(\npenalty\n=\n'elasticnet'\n,\nl1_ratio\n=\n0.25\n)\nclf\n.\nfit\n(\nX_train\n,\ny_train\n)\n.\nsparsify\n()\nclf\n.\npredict\n(\nX_test\n)\nIn this example we prefer the\nelasticnet", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 13, "source": "scikit-learn-docs" } }, { "page_content": "=\n'elasticnet'\n,\nl1_ratio\n=\n0.25\n)\nclf\n.\nfit\n(\nX_train\n,\ny_train\n)\n.\nsparsify\n()\nclf\n.\npredict\n(\nX_test\n)\nIn this example we prefer the\nelasticnet\npenalty as it is often a good\ncompromise between model compactness and prediction power. One can also\nfurther tune the\nl1_ratio\nparameter (in combination with the\nregularization strength\nalpha\n) to control this tradeoff.\nA typical\nbenchmark\non synthetic data yields a >30% decrease in latency when both the model and\ninput are sparse (with 0.000024 and 0.027400 non-zero coefficients ratio\nrespectively). Your mileage may vary depending on the sparsity and size of\nyour data and model.\nFurthermore, sparsifying can be very useful to reduce the memory usage of\npredictive models deployed on production servers.\n9.2.3.4.\nModel Reshaping\nModel reshaping consists in selecting only a portion of the available features\nto fit a model. In other words, if a model discards features during the", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 14, "source": "scikit-learn-docs" } }, { "page_content": "to fit a model. In other words, if a model discards features during the\nlearning phase we can then strip those from the input. This has several\nbenefits. Firstly it reduces memory (and therefore time) overhead of the\nmodel itself. It also allows to discard explicit\nfeature selection components in a pipeline once we know which features to\nkeep from a previous run. Finally, it can help reduce processing time and I/O\nusage upstream in the data access and feature extraction layers by not\ncollecting and building features that are discarded by the model. For instance\nif the raw data come from a database, it is possible to write simpler\nand faster queries or reduce I/O usage by making the queries return lighter\nrecords.\nAt the moment, reshaping needs to be performed manually in scikit-learn.\nIn the case of sparse input (particularly in\nCSR\nformat), it is generally\nsufficient to not generate the relevant features, leaving their columns empty.\n9.2.3.5.\nLinks", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 15, "source": "scikit-learn-docs" } }, { "page_content": "CSR\nformat), it is generally\nsufficient to not generate the relevant features, leaving their columns empty.\n9.2.3.5.\nLinks\nscikit-learn developer performance documentation\nScipy sparse matrix formats documentation\nOn this page\nThis Page\nShow Source", "metadata": { "url": "https://scikit-learn.org/stable/computing/computational_performance.html", "chunk_index": 16, "source": "scikit-learn-docs" } }, { "page_content": "9.3.\nParallelism, resource management, and configuration\n9.3.1.\nParallelism\nSome scikit-learn estimators and utilities parallelize costly operations\nusing multiple CPU cores.\nDepending on the type of estimator and sometimes the values of the\nconstructor parameters, this is either done:\nwith higher-level parallelism via\njoblib\n.\nwith lower-level parallelism via OpenMP, used in C or Cython code.\nwith lower-level parallelism via BLAS, used by NumPy and SciPy for generic operations\non arrays.\nThe\nn_jobs\nparameters of estimators always controls the amount of parallelism\nmanaged by joblib (processes or threads depending on the joblib backend).\nThe thread-level parallelism managed by OpenMP in scikit-learn’s own Cython code\nor by BLAS & LAPACK libraries used by NumPy and SciPy operations used in scikit-learn\nis always controlled by environment variables or\nthreadpoolctl\nas explained below.\nNote that some estimators can leverage all three kinds of parallelism at different", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 0, "source": "scikit-learn-docs" } }, { "page_content": "threadpoolctl\nas explained below.\nNote that some estimators can leverage all three kinds of parallelism at different\npoints of their training and prediction methods.\nWe describe these 3 types of parallelism in the following subsections in more details.\n9.3.1.1.\nHigher-level parallelism with joblib\nWhen the underlying implementation uses joblib, the number of workers\n(threads or processes) that are spawned in parallel can be controlled via the\nn_jobs\nparameter.\nNote\nWhere (and how) parallelization happens in the estimators using joblib by\nspecifying\nn_jobs\nis currently poorly documented.\nPlease help us by improving our docs and tackle\nissue 14228\n!\nJoblib is able to support both multi-processing and multi-threading. Whether\njoblib chooses to spawn a thread or a process depends on the\nbackend\nthat it’s using.\nscikit-learn generally relies on the\nloky\nbackend, which is joblib’s\ndefault backend. Loky is a multi-processing backend. When doing", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 1, "source": "scikit-learn-docs" } }, { "page_content": "that it’s using.\nscikit-learn generally relies on the\nloky\nbackend, which is joblib’s\ndefault backend. Loky is a multi-processing backend. When doing\nmulti-processing, in order to avoid duplicating the memory in each process\n(which isn’t reasonable with big datasets), joblib will create a\nmemmap\nthat all processes can share, when the data is bigger than 1MB.\nIn some specific cases (when the code that is run in parallel releases the\nGIL), scikit-learn will indicate to\njoblib\nthat a multi-threading\nbackend is preferable.\nAs a user, you may control the backend that joblib will use (regardless of\nwhat scikit-learn recommends) by using a context manager:\nfrom\njoblib\nimport\nparallel_backend\nwith\nparallel_backend\n(\n'threading'\n,\nn_jobs\n=\n2\n):\n# Your scikit-learn code here\nPlease refer to the\njoblib’s docs\nfor more details.\nIn practice, whether parallelism is helpful at improving runtime depends on\nmany factors. It is usually a good idea to experiment rather than assuming", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 2, "source": "scikit-learn-docs" } }, { "page_content": "many factors. It is usually a good idea to experiment rather than assuming\nthat increasing the number of workers is always a good thing. In some cases\nit can be highly detrimental to performance to run multiple copies of some\nestimators or functions in parallel (see\noversubscription\nbelow).\n9.3.1.2.\nLower-level parallelism with OpenMP\nOpenMP is used to parallelize code written in Cython or C, relying on\nmulti-threading exclusively. By default, the implementations using OpenMP\nwill use as many threads as possible, i.e. as many threads as logical cores.\nYou can control the exact number of threads that are used either:\nvia the\nOMP_NUM_THREADS\nenvironment variable, for instance when:\nrunning a python script:\nOMP_NUM_THREADS\n=\n4\npython\nmy_script.py\nor via\nthreadpoolctl\nas explained by\nthis piece of documentation\n.\n9.3.1.3.\nParallel NumPy and SciPy routines from numerical libraries\nscikit-learn relies heavily on NumPy and SciPy, which internally call", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 3, "source": "scikit-learn-docs" } }, { "page_content": ".\n9.3.1.3.\nParallel NumPy and SciPy routines from numerical libraries\nscikit-learn relies heavily on NumPy and SciPy, which internally call\nmulti-threaded linear algebra routines (BLAS & LAPACK) implemented in libraries\nsuch as MKL, OpenBLAS or BLIS.\nYou can control the exact number of threads used by BLAS for each library\nusing environment variables, namely:\nMKL_NUM_THREADS\nsets the number of threads MKL uses,\nOPENBLAS_NUM_THREADS\nsets the number of threads OpenBLAS uses\nBLIS_NUM_THREADS\nsets the number of threads BLIS uses\nNote that BLAS & LAPACK implementations can also be impacted by\nOMP_NUM_THREADS\n. To check whether this is the case in your environment,\nyou can inspect how the number of threads effectively used by those libraries\nis affected when running the following command in a bash or zsh terminal\nfor different values of\nOMP_NUM_THREADS\n:\nOMP_NUM_THREADS\n=\n2\npython\n-m\nthreadpoolctl\n-i\nnumpy\nscipy\nNote\nAt the time of writing (2022), NumPy and SciPy packages which are", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 4, "source": "scikit-learn-docs" } }, { "page_content": "OMP_NUM_THREADS\n:\nOMP_NUM_THREADS\n=\n2\npython\n-m\nthreadpoolctl\n-i\nnumpy\nscipy\nNote\nAt the time of writing (2022), NumPy and SciPy packages which are\ndistributed on pypi.org (i.e. the ones installed via\npip\ninstall\n)\nand on the conda-forge channel (i.e. the ones installed via\nconda\ninstall\n--channel\nconda-forge\n) are linked with OpenBLAS, while\nNumPy and SciPy packages shipped on the\ndefaults\nconda\nchannel from Anaconda.org (i.e. the ones installed via\nconda\ninstall\n)\nare linked by default with MKL.\n9.3.1.4.\nOversubscription: spawning too many threads\nIt is generally recommended to avoid using significantly more processes or\nthreads than the number of CPUs on a machine. Over-subscription happens when\na program is running too many threads at the same time.\nSuppose you have a machine with 8 CPUs. Consider a case where you’re running\na\nGridSearchCV\n(parallelized with joblib)\nwith\nn_jobs=8\nover a\nHistGradientBoostingClassifier\n(parallelized with\nOpenMP). Each instance of", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 5, "source": "scikit-learn-docs" } }, { "page_content": "a\nGridSearchCV\n(parallelized with joblib)\nwith\nn_jobs=8\nover a\nHistGradientBoostingClassifier\n(parallelized with\nOpenMP). Each instance of\nHistGradientBoostingClassifier\nwill spawn 8 threads\n(since you have 8 CPUs). That’s a total of\n8\n*\n8\n=\n64\nthreads, which\nleads to oversubscription of threads for physical CPU resources and thus\nto scheduling overhead.\nOversubscription can arise in the exact same fashion with parallelized\nroutines from MKL, OpenBLAS or BLIS that are nested in joblib calls.\nStarting from\njoblib\n>=\n0.14\n, when the\nloky\nbackend is used (which\nis the default), joblib will tell its child\nprocesses\nto limit the\nnumber of threads they can use, so as to avoid oversubscription. In practice\nthe heuristic that joblib uses is to tell the processes to use\nmax_threads\n=\nn_cpus\n//\nn_jobs\n, via their corresponding environment variable. Back to\nour example from above, since the joblib backend of\nGridSearchCV\nis\nloky\n, each process will", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 6, "source": "scikit-learn-docs" } }, { "page_content": ", via their corresponding environment variable. Back to\nour example from above, since the joblib backend of\nGridSearchCV\nis\nloky\n, each process will\nonly be able to use 1 thread instead of 8, thus mitigating the\noversubscription issue.\nNote that:\nManually setting one of the environment variables (\nOMP_NUM_THREADS\n,\nMKL_NUM_THREADS\n,\nOPENBLAS_NUM_THREADS\n, or\nBLIS_NUM_THREADS\n)\nwill take precedence over what joblib tries to do. The total number of\nthreads will be\nn_jobs\n*\n_NUM_THREADS\n. Note that setting this\nlimit will also impact your computations in the main process, which will\nonly use\n_NUM_THREADS\n. Joblib exposes a context manager for\nfiner control over the number of threads in its workers (see joblib docs\nlinked below).\nWhen joblib is configured to use the\nthreading\nbackend, there is no\nmechanism to avoid oversubscriptions when calling into parallel native\nlibraries in the joblib-managed threads.", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 7, "source": "scikit-learn-docs" } }, { "page_content": "threading\nbackend, there is no\nmechanism to avoid oversubscriptions when calling into parallel native\nlibraries in the joblib-managed threads.\nAll scikit-learn estimators that explicitly rely on OpenMP in their Cython code\nalways use\nthreadpoolctl\ninternally to automatically adapt the numbers of\nthreads used by OpenMP and potentially nested BLAS calls so as to avoid\noversubscription.\nYou will find additional details about joblib mitigation of oversubscription\nin\njoblib documentation\n.\nYou will find additional details about parallelism in numerical python libraries\nin\nthis document from Thomas J. Fan\n.\n9.3.2.\nConfiguration switches\n9.3.2.1.\nPython API\nsklearn.set_config\nand\nsklearn.config_context\ncan be used to change\nparameters of the configuration which control aspect of parallelism.\n9.3.2.2.\nEnvironment variables\nThese environment variables should be set before importing scikit-learn.\n9.3.2.2.1.\nSKLEARN_ASSUME_FINITE\nSets the default value for the\nassume_finite\nargument of", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 8, "source": "scikit-learn-docs" } }, { "page_content": "9.3.2.2.1.\nSKLEARN_ASSUME_FINITE\nSets the default value for the\nassume_finite\nargument of\nsklearn.set_config\n.\n9.3.2.2.2.\nSKLEARN_WORKING_MEMORY\nSets the default value for the\nworking_memory\nargument of\nsklearn.set_config\n.\n9.3.2.2.3.\nSKLEARN_SEED\nSets the seed of the global random generator when running the tests, for\nreproducibility.\nNote that scikit-learn tests are expected to run deterministically with\nexplicit seeding of their own independent RNG instances instead of relying on\nthe numpy or Python standard library RNG singletons to make sure that test\nresults are independent of the test execution order. However some tests might\nforget to use explicit seeding and this variable is a way to control the initial\nstate of the aforementioned singletons.\n9.3.2.2.4.\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED\nControls the seeding of the random number generator used in tests that rely on\nthe\nglobal_random_seed\nfixture.\nAll tests that use this fixture accept the contract that they should", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 9, "source": "scikit-learn-docs" } }, { "page_content": "the\nglobal_random_seed\nfixture.\nAll tests that use this fixture accept the contract that they should\ndeterministically pass for any seed value from 0 to 99 included.\nIn nightly CI builds, the\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED\nenvironment\nvariable is drawn randomly in the above range and all fixtured tests will run\nfor that specific seed. The goal is to ensure that, over time, our CI will run\nall tests with different seeds while keeping the test duration of a single run\nof the full test suite limited. This will check that the assertions of tests\nwritten to use this fixture are not dependent on a specific seed value.\nThe range of admissible seed values is limited to [0, 99] because it is often\nnot possible to write a test that can work for any possible seed and we want to\navoid having tests that randomly fail on the CI.\nValid values for\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED\n:\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"42\"\n: run tests with a fixed seed of 42\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"40-42\"", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 10, "source": "scikit-learn-docs" } }, { "page_content": "SKLEARN_TESTS_GLOBAL_RANDOM_SEED\n:\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"42\"\n: run tests with a fixed seed of 42\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"40-42\"\n: run the tests with all seeds\nbetween 40 and 42 included\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED=\"all\"\n: run the tests with all seeds\nbetween 0 and 99 included. This can take a long time: only use for individual\ntests, not the full test suite!\nIf the variable is not set, then 42 is used as the global seed in a\ndeterministic manner. This ensures that, by default, the scikit-learn test\nsuite is as deterministic as possible to avoid disrupting our friendly\nthird-party package maintainers. Similarly, this variable should not be set in\nthe CI config of pull-requests to make sure that our friendly contributors are\nnot the first people to encounter a seed-sensitivity regression in a test\nunrelated to the changes of their own PR. Only the scikit-learn maintainers who\nwatch the results of the nightly builds are expected to be annoyed by this.", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 11, "source": "scikit-learn-docs" } }, { "page_content": "watch the results of the nightly builds are expected to be annoyed by this.\nWhen writing a new test function that uses this fixture, please use the\nfollowing command to make sure that it passes deterministically for all\nadmissible seeds on your local machine:\nSKLEARN_TESTS_GLOBAL_RANDOM_SEED\n=\n\"all\"\npytest\n-v\n-k\ntest_your_test_name\n9.3.2.2.5.\nSKLEARN_SKIP_NETWORK_TESTS\nWhen this environment variable is set to a non zero value, the tests that need\nnetwork access are skipped. When this environment variable is not set then\nnetwork tests are skipped.\n9.3.2.2.6.\nSKLEARN_RUN_FLOAT32_TESTS\nWhen this environment variable is set to ‘1’, the tests using the\nglobal_dtype\nfixture are also run on float32 data.\nWhen this environment variable is not set, the tests are only run on\nfloat64 data.\n9.3.2.2.7.\nSKLEARN_ENABLE_DEBUG_CYTHON_DIRECTIVES\nWhen this environment variable is set to a non zero value, the\nCython\nderivative,\nboundscheck\nis set to\nTrue\n. This is useful for finding\nsegfaults.\n9.3.2.2.8.", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 12, "source": "scikit-learn-docs" } }, { "page_content": "Cython\nderivative,\nboundscheck\nis set to\nTrue\n. This is useful for finding\nsegfaults.\n9.3.2.2.8.\nSKLEARN_BUILD_ENABLE_DEBUG_SYMBOLS\nWhen this environment variable is set to a non zero value, the debug symbols\nwill be included in the compiled C extensions. Only debug symbols for POSIX\nsystems are configured.\n9.3.2.2.9.\nSKLEARN_PAIRWISE_DIST_CHUNK_SIZE\nThis sets the size of chunk to be used by the underlying\nPairwiseDistancesReductions\nimplementations. The default value is\n256\nwhich has been showed to be adequate on\nmost machines.\nUsers looking for the best performance might want to tune this variable using\npowers of 2 so as to get the best parallelism behavior for their hardware,\nespecially with respect to their caches’ sizes.\n9.3.2.2.10.\nSKLEARN_WARNINGS_AS_ERRORS\nThis environment variable is used to turn warnings into errors in tests and\ndocumentation build.\nSome CI (Continuous Integration) builds set\nSKLEARN_WARNINGS_AS_ERRORS=1\n, for", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 13, "source": "scikit-learn-docs" } }, { "page_content": "documentation build.\nSome CI (Continuous Integration) builds set\nSKLEARN_WARNINGS_AS_ERRORS=1\n, for\nexample to make sure that we catch deprecation warnings from our dependencies\nand that we adapt our code.\nTo locally run with the same “warnings as errors” setting as in these CI builds\nyou can set\nSKLEARN_WARNINGS_AS_ERRORS=1\n.\nBy default, warnings are not turned into errors. This is the case if\nSKLEARN_WARNINGS_AS_ERRORS\nis unset, or\nSKLEARN_WARNINGS_AS_ERRORS=0\n.\nThis environment variable uses specific warning filters to ignore some warnings,\nsince sometimes warnings originate from third-party libraries and there is not\nmuch we can do about it. You can see the warning filters in the\n_get_warnings_filters_info_list\nfunction in\nsklearn/utils/_testing.py\n.\nNote that for documentation build,\nSKLEARN_WARNING_AS_ERRORS=1\nis checking\nthat the documentation build, in particular running examples, does not produce\nany warnings. This is different from the\n-W\nsphinx-build\nargument that", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 14, "source": "scikit-learn-docs" } }, { "page_content": "that the documentation build, in particular running examples, does not produce\nany warnings. This is different from the\n-W\nsphinx-build\nargument that\ncatches syntax warnings in the rst files.\nOn this page\nThis Page\nShow Source", "metadata": { "url": "https://scikit-learn.org/stable/computing/parallelism.html", "chunk_index": 15, "source": "scikit-learn-docs" } }, { "page_content": "9.1.\nStrategies to scale computationally: bigger data\nFor some applications the amount of examples, features (or both) and/or the\nspeed at which they need to be processed are challenging for traditional\napproaches. In these cases scikit-learn has a number of options you can\nconsider to make your system scale.\n9.1.1.\nScaling with instances using out-of-core learning\nOut-of-core (or “external memory”) learning is a technique used to learn from\ndata that cannot fit in a computer’s main memory (RAM).\nHere is a sketch of a system designed to achieve this goal:\na way to stream instances\na way to extract features from instances\nan incremental algorithm\n9.1.1.1.\nStreaming instances\nBasically, 1. may be a reader that yields instances from files on a\nhard drive, a database, from a network stream etc. However,\ndetails on how to achieve this are beyond the scope of this documentation.\n9.1.1.2.\nExtracting features\n2. could be any relevant way to extract features among the\ndifferent", "metadata": { "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html", "chunk_index": 0, "source": "scikit-learn-docs" } }, { "page_content": "9.1.1.2.\nExtracting features\n2. could be any relevant way to extract features among the\ndifferent\nfeature extraction\nmethods supported by\nscikit-learn. However, when working with data that needs vectorization and\nwhere the set of features or values is not known in advance one should take\nexplicit care. A good example is text classification where unknown terms are\nlikely to be found during training. It is possible to use a stateful\nvectorizer if making multiple passes over the data is reasonable from an\napplication point of view. Otherwise, one can turn up the difficulty by using\na stateless feature extractor. Currently the preferred way to do this is to\nuse the so-called\nhashing trick\nas implemented by\nsklearn.feature_extraction.FeatureHasher\nfor datasets with categorical\nvariables represented as list of Python dicts or\nsklearn.feature_extraction.text.HashingVectorizer\nfor text documents.\n9.1.1.3.\nIncremental learning", "metadata": { "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html", "chunk_index": 1, "source": "scikit-learn-docs" } }, { "page_content": "variables represented as list of Python dicts or\nsklearn.feature_extraction.text.HashingVectorizer\nfor text documents.\n9.1.1.3.\nIncremental learning\nFinally, for 3. we have a number of options inside scikit-learn. Although not\nall algorithms can learn incrementally (i.e. without seeing all the instances\nat once), all estimators implementing the\npartial_fit\nAPI are candidates.\nActually, the ability to learn incrementally from a mini-batch of instances\n(sometimes called “online learning”) is key to out-of-core learning as it\nguarantees that at any given time there will be only a small amount of\ninstances in the main memory. Choosing a good size for the mini-batch that\nbalances relevancy and memory footprint could involve some tuning\n[\n1\n]\n.\nHere is a list of incremental estimators for different tasks:\nClassification\nsklearn.naive_bayes.MultinomialNB\nsklearn.naive_bayes.BernoulliNB\nsklearn.linear_model.Perceptron\nsklearn.linear_model.SGDClassifier", "metadata": { "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html", "chunk_index": 2, "source": "scikit-learn-docs" } }, { "page_content": "Classification\nsklearn.naive_bayes.MultinomialNB\nsklearn.naive_bayes.BernoulliNB\nsklearn.linear_model.Perceptron\nsklearn.linear_model.SGDClassifier\nsklearn.linear_model.PassiveAggressiveClassifier\nsklearn.neural_network.MLPClassifier\nRegression\nsklearn.linear_model.SGDRegressor\nsklearn.linear_model.PassiveAggressiveRegressor\nsklearn.neural_network.MLPRegressor\nClustering\nsklearn.cluster.MiniBatchKMeans\nsklearn.cluster.Birch\nDecomposition / feature Extraction\nsklearn.decomposition.MiniBatchDictionaryLearning\nsklearn.decomposition.IncrementalPCA\nsklearn.decomposition.LatentDirichletAllocation\nsklearn.decomposition.MiniBatchNMF\nPreprocessing\nsklearn.preprocessing.StandardScaler\nsklearn.preprocessing.MinMaxScaler\nsklearn.preprocessing.MaxAbsScaler\nFor classification, a somewhat important thing to note is that although a\nstateless feature extraction routine may be able to cope with new/unseen\nattributes, the incremental learner itself may be unable to cope with", "metadata": { "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html", "chunk_index": 3, "source": "scikit-learn-docs" } }, { "page_content": "stateless feature extraction routine may be able to cope with new/unseen\nattributes, the incremental learner itself may be unable to cope with\nnew/unseen targets classes. In this case you have to pass all the possible\nclasses to the first\npartial_fit\ncall using the\nclasses=\nparameter.\nAnother aspect to consider when choosing a proper algorithm is that not all of\nthem put the same importance on each example over time. Namely, the\nPerceptron\nis still sensitive to badly labeled examples even after many\nexamples whereas the\nSGD*\nand\nPassiveAggressive*\nfamilies are more\nrobust to this kind of artifacts. Conversely, the latter also tend to give less\nimportance to remarkably different, yet properly labeled examples when they\ncome late in the stream as their learning rate decreases over time.\n9.1.1.4.\nExamples\nFinally, we have a full-fledged example of\nOut-of-core classification of text documents\n. It is aimed at\nproviding a starting point for people wanting to build out-of-core learning", "metadata": { "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html", "chunk_index": 4, "source": "scikit-learn-docs" } }, { "page_content": "Out-of-core classification of text documents\n. It is aimed at\nproviding a starting point for people wanting to build out-of-core learning\nsystems and demonstrates most of the notions discussed above.\nFurthermore, it also shows the evolution of the performance of different\nalgorithms with the number of processed examples.\nNow looking at the computation time of the different parts, we see that the\nvectorization is much more expensive than learning itself. From the different\nalgorithms,\nMultinomialNB\nis the most expensive, but its overhead can be\nmitigated by increasing the size of the mini-batches (exercise: change\nminibatch_size\nto 100 and 10000 in the program and compare).\n9.1.1.5.\nNotes\nOn this page\nThis Page\nShow Source", "metadata": { "url": "https://scikit-learn.org/stable/computing/scaling_strategies.html", "chunk_index": 5, "source": "scikit-learn-docs" } }, { "page_content": "7.\nDataset transformations\nscikit-learn provides a library of transformers, which may clean (see\nPreprocessing data\n), reduce (see\nUnsupervised dimensionality reduction\n), expand (see\nKernel Approximation\n) or generate (see\nFeature extraction\n)\nfeature representations.\nLike other estimators, these are represented by classes with a\nfit\nmethod,\nwhich learns model parameters (e.g. mean and standard deviation for\nnormalization) from a training set, and a\ntransform\nmethod which applies\nthis transformation model to unseen data.\nfit_transform\nmay be more\nconvenient and efficient for modelling and transforming the training data\nsimultaneously.\nCombining such transformers, either in parallel or series is covered in\nPipelines and composite estimators\n.\nPairwise metrics, Affinities and Kernels\ncovers transforming feature\nspaces into affinity matrices, while\nTransforming the prediction target (y)\nconsiders\ntransformations of the target space (e.g. categorical labels) for use in\nscikit-learn.", "metadata": { "url": "https://scikit-learn.org/stable/data_transforms.html", "chunk_index": 0, "source": "scikit-learn-docs" } }, { "page_content": "Transforming the prediction target (y)\nconsiders\ntransformations of the target space (e.g. categorical labels) for use in\nscikit-learn.\n7.1. Pipelines and composite estimators\n7.1.1. Pipeline: chaining estimators\n7.1.2. Transforming target in regression\n7.1.3. FeatureUnion: composite feature spaces\n7.1.4. ColumnTransformer for heterogeneous data\n7.1.5. Visualizing Composite Estimators\n7.2. Feature extraction\n7.2.1. Loading features from dicts\n7.2.2. Feature hashing\n7.2.3. Text feature extraction\n7.2.4. Image feature extraction\n7.3. Preprocessing data\n7.3.1. Standardization, or mean removal and variance scaling\n7.3.2. Non-linear transformation\n7.3.3. Normalization\n7.3.4. Encoding categorical features\n7.3.5. Discretization\n7.3.6. Imputation of missing values\n7.3.7. Generating polynomial features\n7.3.8. Custom transformers\n7.4. Imputation of missing values\n7.4.1. Univariate vs. Multivariate Imputation\n7.4.2. Univariate feature imputation\n7.4.3. Multivariate feature imputation", "metadata": { "url": "https://scikit-learn.org/stable/data_transforms.html", "chunk_index": 1, "source": "scikit-learn-docs" } }, { "page_content": "7.4.1. Univariate vs. Multivariate Imputation\n7.4.2. Univariate feature imputation\n7.4.3. Multivariate feature imputation\n7.4.4. Nearest neighbors imputation\n7.4.5. Keeping the number of features constant\n7.4.6. Marking imputed values\n7.4.7. Estimators that handle NaN values\n7.5. Unsupervised dimensionality reduction\n7.5.1. PCA: principal component analysis\n7.5.2. Random projections\n7.5.3. Feature agglomeration\n7.6. Random Projection\n7.6.1. The Johnson-Lindenstrauss lemma\n7.6.2. Gaussian random projection\n7.6.3. Sparse random projection\n7.6.4. Inverse Transform\n7.7. Kernel Approximation\n7.7.1. Nystroem Method for Kernel Approximation\n7.7.2. Radial Basis Function Kernel\n7.7.3. Additive Chi Squared Kernel\n7.7.4. Skewed Chi Squared Kernel\n7.7.5. Polynomial Kernel Approximation via Tensor Sketch\n7.7.6. Mathematical Details\n7.8. Pairwise metrics, Affinities and Kernels\n7.8.1. Cosine similarity\n7.8.2. Linear kernel\n7.8.3. Polynomial kernel\n7.8.4. Sigmoid kernel\n7.8.5. RBF kernel", "metadata": { "url": "https://scikit-learn.org/stable/data_transforms.html", "chunk_index": 2, "source": "scikit-learn-docs" } }, { "page_content": "7.8.1. Cosine similarity\n7.8.2. Linear kernel\n7.8.3. Polynomial kernel\n7.8.4. Sigmoid kernel\n7.8.5. RBF kernel\n7.8.6. Laplacian kernel\n7.8.7. Chi-squared kernel\n7.9. Transforming the prediction target (\ny\n)\n7.9.1. Label binarization\n7.9.2. Label encoding\nThis Page\nShow Source", "metadata": { "url": "https://scikit-learn.org/stable/data_transforms.html", "chunk_index": 3, "source": "scikit-learn-docs" } }, { "page_content": "8.4.\nLoading other datasets\n8.4.1.\nSample images\nScikit-learn also embeds a couple of sample JPEG images published under Creative\nCommons license by their authors. Those images can be useful to test algorithms\nand pipelines on 2D data.\nload_sample_images\n()\nLoad sample images for image manipulation.\nload_sample_image\n(image_name)\nLoad the numpy array of a single sample image.\nWarning\nThe default coding of images is based on the\nuint8\ndtype to\nspare memory. Often machine learning algorithms work best if the\ninput is converted to a floating point representation first. Also,\nif you plan to use\nmatplotlib.pyplot.imshow\n, don’t forget to scale to the range\n0 - 1 as done in the following example.\n8.4.2.\nDatasets in svmlight / libsvm format\nscikit-learn includes utility functions for loading\ndatasets in the svmlight / libsvm format. In this format, each line\ntakes the form\n