@misc{oai:ir.soken.ac.jp:00001677, author = {小森, 理 and コモリ, オサム and KOMORI, Osamu}, month = {2016-02-17, 2016-02-17}, note = {With the advent of information age, huge amount of data has been collected in laboratories and hospitals. It includes not only clinical data such as age, laboratory test values, the size of internal organ; but also genomic data such as gene expression patterns, single nucleotide polymorphism (SNP) and proteome. Based on the information, we want to predict as accurately as possible the condition of the subject (diseased or non-diseased), who comes to a hospital and has gone through some clinical tests. However, it is often difficult to analyze these variety of medical data within a traditional statistical framework. Moreover, there exist criteria that are suitable for medical and clinical sciences. Hence, we have tried to develop a new statistical method that can deal with these data and provide us with a useful information for the discrimination, based on a criterion that is widely used by medical doctors or clinical researchers.  In medical and biological sciences, the receiver operating characteristic (ROC) curve and the area under the ROC curve (AUC) have gained in popularity. The ROC curve originated from the signal detection theory, where the performance of the radar operator who monitors enemy warplanes is measured or compared using the curve. It is also applied in psychology, and now is used in a variety of discrimination problems. Its appealing points are that the false positive rate (FPR) and the true positive rate (TPR) are both measured in the ROC curve, and that the curve is independent of the population prevalence of disease. FPR and 1-TPR express different aspects of the classification performance, so it is important to report the values separately, when evaluating the goodness of the classification. The independence also is suitable for quantifying the inherent accuracy of classification, and this property makes the AUC different from other accuracv measures such as the error rate, the relative risk or the odds ratio.  In this thesis, we have developed a new statistical method that is designed to optimize the AUC based on a boosting technique, which is widely used in the machine learning community. The method can deal with both usual low dimensional settings as well as high dimensional settings. The main concept of boosting is that a strong classifier (score function) is constructed by combining many various "weak classifiers". The weak classifier means that its discriminant ability is slightly better than random guessing. The method includes an implicit procedure of marker selection in its boosting algorithm, and produce a score function after an appropriate number of iterations. The resulting score plots are shown to be useful for understanding how each marker is associated with the outcome variable, say, the status of the subjects (non-diseased or diseased). Hence, our method put importance on the classification accuracy as well as the interpretation of the result. We also have extended this AUC-based boosting method to pAUCBoost, which focuses on the partial area under the ROC curve (pAUC) that is often more relevant in some clinical or medical situations.  In Chapter 1, we review other accuracy measures than the AUC and pAUC, which are also important in clinical evaluation of markers; we investigate the properties and consider why the AUC and pAUC are getting popular in recent years. In Chapter 2, we also review the status of progress and development in machine learning community, and characterize the property of boosting from an objective viewpoint. We propose a new statistical method, termed AUCBoost, in Chapter 3 and discuss the statistical properties and demonstrate its utility. In Chapter 4, we focus on PSA data analysis. This is a collaborative research with medical doctors in Keio University Hospital. PSA is an abbreviation of prostate specific antigen, and is a primary marker for prostate cancers. The subject with PSA larger than 4 ng/ml is usually recommend to undergo biopsy; however, the value is affected by the age and the size of the prostate gland and other clinical covariates. Hence, we consider a optimal combination of these markers as well as the association to the prostate cancer, using AUCBoost. As a result, we present a "nomogram", by which medical doctors determine whether they perform biopsy in consideration of PSA, age, the volume of prostate gland and the number of biopsy undergone. The point of this nomogram is that the cutoff points are determined so that the sensitivity is at least 95 percent. This idea is quite different from existing nomograms that are based on a probability of having the cancer, and much more suitable for practical medical diagnosis. In Chapter 5, we extend AUCBoost to pAUCBoost, which focuses on the partial area under the ROC curve. We show that pAUCBoost is preferable to AUCBoost in some clinical situations. In Chapter 6, we mention ongoing and future work that I am engaged in now. Finally, we close this thesis with acknowledgements to all persons who supported me during my hard and pleasant doctor course., application/pdf, 総研大甲第1335号}, title = {Boosting Methods for Maximization of the Area under the ROC Curve and their Applications to Clinical Data}, year = {} }