candle.data_utils.load_X_data2

Contents

candle.data_utils.load_X_data2#

candle.data_utils.load_X_data2(train_file, test_file, drop_cols=None, n_cols=None, shuffle=False, scaling=None, validation_split=0.1, dtype=<class 'numpy.float32'>, seed=7102)#

Load training and testing unlabeleled data from the files specified. Further split trainig data into training and validation partitions, and construct corresponding training, validation and testing pandas DataFrames. Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved, but training is split into training and validation partitions. This function assumes that the files contain a header with column names.

Parameters:
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • drop_cols (list) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).

  • n_cols (integer) – Number of columns to load from the files. (Default: None, all the columns are used).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: 0.1, ten percent of the training data is used for the validation partition).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns:

  • X_train (pandas DataFrame) – Data for training loaded in a pandas DataFrame and pre-processed as specified.

  • X_val (pandas DataFrame) – Data for validation loaded in a pandas DataFrame and pre-processed as specified.

  • X_test (pandas DataFrame) – Data for testing loaded in a pandas DataFrame and pre-processed as specified.