candle.P1_utils.generate_gene_set_data

candle.P1_utils.generate_gene_set_data#

candle.P1_utils.generate_gene_set_data(data, genes, gene_name_type='entrez', gene_set_category='c6.all', metric='mean', standardize=False, data_dir='../../Data/examples/Gene_Sets/MSigDB.v7.0/')#

This function generates genomic data summarized at the gene set level.

Parameters:
  • data – numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features].

  • genes – 1-D array or list of gene names with a length of n_features. It indicates which gene a genomic feature belongs to.

  • gene_name_type (string) – the type of gene name used in genes. ‘entrez’ indicates Entrez gene ID and ‘symbols’ indicates HGNC gene symbol. Default is ‘symbols’.

  • gene_set_category (string) – the gene sets for which data will be calculated. ‘c2.cgp’ indicates gene sets affected by chemical and genetic perturbations; ‘c2.cp.biocarta’ indicates BioCarta gene sets; ‘c2.cp.kegg’ indicates KEGG gene sets; ‘c2.cp.pid’ indicates PID gene sets; ‘c2.cp.reactome’ indicates Reactome gene sets; ‘c5.bp’ indicates GO biological processes; ‘c5.cc’ indicates GO cellular components; ‘c5.mf’ indicates GO molecular functions; ‘c6.all’ indicates oncogenic signatures. Default is ‘c6.all’.

  • metric (string) – the way to calculate gene-set-level data. ‘mean’ calculates the mean of gene features belonging to the same gene set. ‘sum’ calculates the summation of gene features belonging to the same gene set. ‘max’ calculates the maximum of gene features. ‘min’ calculates the minimum of gene features. ‘abs_mean’ calculates the mean of absolute values. ‘abs_maximum’ calculates the maximum of absolute values. Default is ‘mean’.

  • standardize (bool) – whether to standardize features before calculation. Standardization transforms each feature to have a zero mean and a unit standard deviation.

Returns:

a data frame of calculated gene-set-level data. Column names are the gene set names.