• The classes in the sklearn.feature_selection module can be used for feature selection/dimensionality reduction on sample sets, either to improve estimators’ accuracy scores or to boost their performance on very high-dimensional datasets.


  • The Python tutorials are written as Jupyter notebooks and run directly in Google Colab—a hosted notebook environment that requires no setup. Click the Run in Google Colab button.


  • Colab link - Open colab


  • Removing features with low variance


  • VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold.


  • By default, it removes all zero-variance features, i.e. features that have the same value in all samples.


  • Parameters threshold: float, optional


  • Features with a training-set variance lower than this threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.


  • P-value: What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.


  • By convention, when the


  • p-value is < 0.001: we say there is strong evidence that the correlation is significant.


  • the p-value is < 0.05: there is moderate evidence that the correlation is significant.


  • the p-value is < 0.1: there is weak evidence that the correlation is significant.


  • the p-value is > 0.1: there is no evidence that the correlation is significant.


  • ANOVA: Analysis of Variance


  • The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:


  • F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.


  • P-value: P-value tells how statistically significant is our calculated score value.


  • If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.


  • 
    from sklearn.feature_selection import VarianceThreshold
    
    X = [[0, 2, 0, 3], [0, 1, 4, 3], [0, 1, 1, 3]]
    
    sel = VarianceThreshold()
    
    sel.fit_transform(X) 
    
  • Univariate feature selection


  • Univariate feature selection works by selecting the best features based on univariate statistical tests.


  • It can be seen as a preprocessing step to an estimator. Scikit-learn exposes feature selection routines as objects that implement the transform method:


  • SelectKBest removes all but the highest scoring features


  • SelectPercentile removes all but a user-specified highest scoring percentage of features using common univariate statistical tests for each feature: false positive rate


  • SelectFpr, false discovery rate SelectFdr, or family wise error SelectFwe.


  • GenericUnivariateSelect allows to perform univariate feature selection with a configurable strategy. This allows to select the best univariate selection strategy with hyper-parameter search estimator.


  • 
    class sklearn.feature_selection.SelectKBest(score_func=, *, k=10)
     
    
  • chi2, f_classif, mutual_info_classif: scoring for classification


  • f_regression, mutual_info_regression: F-value between label/feature for regression tasks.


  • 
    from sklearn.datasets import load_iris
    
    from sklearn.feature_selection import SelectKBest, SelectPercentile, SelectFpr, SelectFdr, SelectFwe, GenericUnivariateSelect
    
    from sklearn.feature_selection import chi2, f_classif, mutual_info_classif, f_regression, mutual_info_regression
    
    X, y = load_iris(return_X_y=True)
    
    X.shape
    
    X_new = SelectKBest(chi2, k=2).fit_transform(X, y)
    
    X_new.shape
    
    X_new = SelectPercentile(chi2, percentile=50).fit_transform(X, y)
    
    X_new.shape
    
    X_new = SelectFpr(chi2, alpha=0.05).fit_transform(X, y)
    
    X_new.shape
    
    X_new = SelectFdr(chi2, alpha=0.05).fit_transform(X, y)
    
    X_new.shape
    
    X_new = SelectFwe(chi2, alpha=0.05).fit_transform(X, y)
    
    X_new.shape
    
    X_new = GenericUnivariateSelect(chi2, mode='percentile', param=50).fit_transform(X, y)
    
    X_new.shape 
    
  • Recursive feature elimination


  • 
    from sklearn.datasets import make_friedman1
    from sklearn.feature_selection import RFECV
    from sklearn.svm import SVR
    X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
    estimator = SVR(kernel="linear")
    selector = RFECV(estimator, step=1, cv=5)
    selector = selector.fit(X, y)
     
    
    
    selector.support_ 
    
  • The feature ranking, such that ranking_[i] corresponds to the ranking position of the i-th feature. Selected (i.e., estimated best) features are assigned rank 1.


  • 
    selector.ranking_
    Feature selection using SelectFromModel 
    
  • SelectFromModel is a meta-transformer that can be used along with any estimator that has a coef_ or feature_importances_ attribute after fitting.


  • The features are considered unimportant and removed, if the corresponding coef_ or feature_importances_ values are below the provided threshold parameter.


  • Apart from specifying the threshold numerically, there are built-in heuristics for finding a threshold using a string argument.


  • Available heuristics are “mean”, “median” and float multiples of these like “0.1*mean”.


  • In combination with the threshold criteria, one can use the max_features parameter to set a limit on the number of features to select.


  • 
    print(__doc__)
    
    import matplotlib.pyplot as plt
    import numpy as np
    
    from sklearn.datasets import load_diabetes
    from sklearn.feature_selection import SelectFromModel
    from sklearn.linear_model import LassoCV
    [ ]
    diabetes = load_diabetes()
    
    X = diabetes.data
    y = diabetes.target
    
    feature_names = diabetes.feature_names
    print(feature_names)
    [ ]
    clf = LassoCV().fit(X, y)
    importance = np.abs(clf.coef_)
    print(importance) 
    
  • Now we want to select the two features which are the most important.


  • SelectFromModel() allows for setting the threshold. Only the features with the coef_ higher than the threshold will remain.


  • Here, we want to set the threshold slightly above the third highest coef_ calculated by LassoCV() from our data.


  • 
    idx_third = importance.argsort()[-3]
    threshold = importance[idx_third] + 0.01
    
    idx_features = (-importance).argsort()[:2]
    name_features = np.array(feature_names)[idx_features]
    print('Selected features: {}'.format(name_features))
    
    sfm = SelectFromModel(clf, threshold=threshold)
    sfm.fit(X, y)
    X_transform = sfm.transform(X)
     
    
  • Plot the two most important features


  • 
    plt.title(
        "Features from diabets using SelectFromModel with "
        "threshold %0.3f." % sfm.threshold)
    feature1 = X_transform[:, 0]
    feature2 = X_transform[:, 1]
    plt.plot(feature1, feature2, 'r.')
    plt.xlabel("First feature: {}".format(name_features[0]))
    plt.ylabel("Second feature: {}".format(name_features[1]))
    plt.ylim([np.min(feature2), np.max(feature2)])
    plt.show() 
    
  • L1-based feature selection


  • 
    from sklearn.svm import LinearSVC
    from sklearn.datasets import load_iris
    from sklearn.feature_selection import SelectFromModel
    X, y = load_iris(return_X_y=True)
    X.shape
    
    lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X, y)
    model = SelectFromModel(lsvc, prefit=True)
    X_new = model.transform(X)
    X_new.shape 
    
  • Tree-based feature selection


  • 
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.datasets import load_iris
    from sklearn.feature_selection import SelectFromModel
    X, y = load_iris(return_X_y=True)
    X.shape