candle.P1_utils.coxen_multi_drug_gene_selection

candle.P1_utils.coxen_multi_drug_gene_selection#

candle.P1_utils.coxen_multi_drug_gene_selection(source_data, target_data, drug_response_data, drug_response_col, tumor_col, drug_col, prediction_power_measure='lm', num_predictive_gene=100, generalization_power_measure='ccc', num_generalizable_gene=50, union_of_single_drug_selection=False)#

This function uses the COXEN approach to select genes for predicting the response of multiple drugs. It assumes no missing data exist. It works in three modes. (1) If union_of_single_drug_selection is True, prediction_power_measure must be either ‘pearson’ or ‘mutual_info’. This functions runs coxen_single_drug_gene_selection for every drug with the parameter setting and takes the union of the selected genes of every drug as the output. The size of the selected gene set may be larger than num_generalizable_gene. (2) If union_of_single_drug_selection is False and prediction_power_measure is ‘lm’, this function uses a linear model to fit the response of multiple drugs using the expression of a gene, while the drugs are one-hot encoded. The p-value associated with the coefficient of gene expression is used as the prediction power measure, according to which num_predictive_gene genes will be selected. Then, among the predictive genes, num_generalizable_gene generalizable genes will be selected. (3) If union_of_single_drug_selection is False and prediction_power_measure is ‘pearson’ or ‘mutual_info’, for each drug this functions ranks the genes according to their power of predicting the response of the drug. The union of an equal number of predictive genes for every drug will be generated, and its size must be at least num_predictive_gene. Then, num_generalizable_gene generalizable genes will be selected.

Parameters:
  • source_data – pandas data frame of gene expressions of tumors, for which drug response is known. Its size is [n_source_samples, n_features].

  • target_data – pandas data frame of gene expressions of tumors, for which drug response needs to be predicted. Its size is [n_target_samples, n_features]. source_data and target_data have the same set of features and the orders of features must match.

  • drug_response_data – pandas data frame of drug response that must include a column of drug response values, a column of tumor IDs, and a column of drug IDs.

  • drug_response_col – non-negative integer or string. If integer, it is the column index of drug response in drug_response_data. If string, it is the column name of drug response.

  • tumor_col – non-negative integer or string. If integer, it is the column index of tumor IDs in drug_response_data. If string, it is the column name of tumor IDs.

  • drug_col – non-negative integer or string. If integer, it is the column index of drugs in drug_response_data. If string, it is the column name of drugs.

  • prediction_power_measure (string) – ‘pearson’ uses the absolute value of Pearson correlation coefficient to measure prediction power of a gene; ‘mutual_info’ uses the mutual information to measure prediction power of a gene; ‘lm’ uses the linear regression model to select predictive genes for multiple drugs. Default is ‘lm’.

  • num_predictive_gene (int) – the number of predictive genes to be selected.

  • generalization_power_measure (string) – ‘pearson’ indicates the Pearson correlation coefficient; ‘ccc’ indicates the concordance correlation coefficient. Default is ‘ccc’.

  • num_generalizable_gene (int) – the number of generalizable genes to be selected.

  • union_of_single_drug_selection (bool) – whether the final gene set should be the union of genes selected for every drug.

Returns:

1-D numpy array containing the indices of selected genes.