candle.data_utils.load_csv_data

Contents

candle.data_utils.load_csv_data#

candle.data_utils.load_csv_data(train_path, test_path=None, sep=',', nrows=None, x_cols=None, y_cols=None, drop_cols=None, onehot_cols=None, n_cols=None, random_cols=False, shuffle=False, scaling=None, dtype=None, validation_split=None, return_dataframe=True, return_header=False, seed=7102)#

Load data from the files specified. Columns corresponding to data features and labels can be specified. A one-hot encoding can be used for either features or labels. If validation_split is specified, trainig data is further split into training and validation partitions. pandas DataFrames are used to load and pre-process the data. If specified, those DataFrames are returned. Otherwise just values are returned. Labels to output can be integer labels (for classification) or continuous labels (for regression). Columns to load can be specified, randomly selected or a subset can be dropped. Order of rows can be shuffled. Data can be rescaled. This function assumes that the files contain a header with column names.

Parameters:
  • train_path – Name of the file to load the training data.

  • test_path – Name of the file to load the testing data. (Optional).

  • sep – Character used as column separator. (Default: ‘,’, comma separated values).

  • nrows (int) – Number of rows to load from the files. (Default: None, all the rows are used).

  • x_cols – List of columns to use as features. (Default: None).

  • y_cols – List of columns to use as labels. (Default: None).

  • drop_cols – List of columns to drop from the files being loaded. (Default: None, all the columns are used).

  • onehot_cols – List of columns to one-hot encode. (Default: None).

  • n_cols (int) – Number of columns to load from the files. (Default: None).

  • random_cols (boolean) – Boolean flag to indicate random selection of columns. If True a number of n_cols columns is randomly selected, if False the specified columns are used. (Default: False).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) –

    String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’.

    • maxabs: scales data to range [-1 to 1].

    • minmax: scales data to range [-1 to 1].

    • std : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype – Data type to use for the output pandas DataFrames. (Default: None).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: None, no validation partition is constructed).

  • return_dataframe (boolean) – Boolean flag to indicate that the pandas DataFrames used for data pre-processing are to be returned. (Default: True, pandas DataFrames are returned).

  • return_header (boolean) – Boolean flag to indicate if the column headers are to be returned. (Default: False, no column headers are separetely returned).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns:

Tuples of data features and labels are returned, for train, validation and testing partitions, together with the column names (headers). The specific objects to return depend on the options selected.