The potential advantage of data merging was evaluated by means of a quantitative rate of correct classification

The gene signatures were built by a supervised machine learning algorithm like Support Vector Machines or unsupervised classification methods like clustering or statistical method such as Cox regression model and likelihood ratio test. Such gene signatures consist of a list of genes, usually associated with weights that are used to compute a predictive score. Note however that Albaspidin-AA different studies used different performance measures, percentage of concordance in classification to this end. Gene signatures are also Danshensu evaluated in terms of “robustness” and “reproducibility”. Robustness is related to the sample size of the training set from which a gene signature is built and the size of the testing set on which it is validated. A predictor generated from a small training set could have a high prediction accuracy on the training set but may lose generalization power when it is validated on an independent testing set. Moreover, performance estimates obtained with a small testing set have high statistical error, i.e. they come with a large confidence interval. On the other hand, reproducibility means the convergence of results obtained from replicate experiments, possibly carried out in different labs and relying on different technologies. Reproducibility is assessed by cross-data set validation, i.e. the evaluation of a gene signature derived from one data set, with a testing set originating from another study. In this work, we analyzed the potential benefits of merging data sets for prognostic application in breast cancer diagnosis. Contrary to related work, we did not discretize the clinical followup information into good and poor outcome classes, a practice which results in loss of information. Instead, we directly used censored survival data to derive a gene signature that allows for the computation of a risk score from a patients expression profile. The risk score was based on the Cox proportional hazard model, and expected to be inversely related to the time to death or relapse. The basic design of our study is as follows. We used eight breast cancer microarray data sets from eight different studies. Each set had clinical follow-up information in form of censored time to event data, the event being either “overall survival” or “relapse-free survival” or both. The goal was to extract a gene signature from a training set that can be used to predict disease outcome for patients in the testing set. The gene signature we used consisted of a set of genes plus corresponding Cox coefficients derived by univariate fitting of the expression values to the survival data on the training set. Gene signatures were built from the individual or merged data sets. The accuracy and robustness of prediction were evaluated by 10-fold cross-validation. Reproducibility as defined above was analyzed by training a signature from one or several complete data sets and testing its performance on complete independent validation sets. Data sets were merged in their original numerical representation using two different data integration methods: ComBat and Z-score normalization. Two signature performance measures were computed in each experiment: time dependent Receiver-Operator Characteristic Area Under the Curve and the hazard ratio of the predicted risk scores relative to the survival data in the testing set.

Leave a Reply

Your email address will not be published. Required fields are marked *