candle.feature_selection_utils.select_features_by_variation

candle.feature_selection_utils.select_features_by_variation#

candle.feature_selection_utils.select_features_by_variation(data, variation_measure='var', threshold=None, portion=None, draw_histogram=False, bins=100, log=False)#

This function evaluates the variations of individual features and returns the indices of features with large variations. Missing values are ignored in evaluating variation.

Parameters:
  • data – numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features].

  • variation_metric (string) – string indicating the metric used for evaluating feature variation. ‘var’ indicates variance; ‘std’ indicates standard deviation; ‘mad’ indicates median absolute deviation. Default is ‘var’.

  • threshold (float) – Features with a variation larger than threshold will be selected. Default is None.

  • portion (float) – float in the range of [0, 1]. It is the portion of features to be selected based on variation. The number of selected features will be the smaller of int(portion * n_features) and the total number of features with non-missing variations. Default is None. threshold and portion can not take real values and be used simultaneously.

  • draw_histogram (bool) – whether to draw a histogram of feature variations. Default is False.

  • bins (int) – positive integer, the number of bins in the histogram. Default is the smaller of 50 and the number of features with non-missing variations.

  • log (bool) – whether the histogram should be drawn on log scale.

Returns:

1-D numpy array containing the indices of selected features. If both threshold and portion are None, indices will be an empty array.