Date of Award

2020

Degree Type

Open Access Dissertation

Degree Name

Computational Science Joint PhD with San Diego State University, PhD

Program

Institute of Mathematical Sciences

Advisor/Supervisor/Committee Chair

Juanjuan Fan

Dissertation or Thesis Committee Member

Ralph-Axel M¨uller

Dissertation or Thesis Committee Member

Barbara Bailey

Dissertation or Thesis Committee Member

John Angus

Terms of Use & License Information

Terms of Use for work posted in Scholarship@Claremont.

Rights Information

© 2020 Afrooz Jahedi

Keywords

Autism, Binary Classification Modeling, Missing Values, Multivariate Matching, R Package, Random Forest

Subject Categories

Neurosciences | Statistics and Probability

Abstract

Random Forest (RF) is a flexible, easy to use machine learning algorithm that was proposed by Leo Breiman in 2001 for building a predictor ensemble with a set of decision trees that grow in randomly selected subspaces of data. Its superior prediction accuracy has made it the most used algorithms in the machine learning field. In this dissertation, we use the random forest as the main building block for creating a proximity matrix for multivariate matching and diagnostic classification problems that are used for autism research (as an exemplary application). In observational studies, matching is used to optimize the balance between treatment groups. Although many matching algorithms can achieve this goal, in some fields, matching could face its own challenges. Datasets with small sample sizes and limited control reservoirs are prone to this issue. This problem may apply to many ongoing research fields, such as autism spectrum disorder (ASD). We are interested in eliminating the effect of undesirable variables using two types of algorithms, 1:k nearest matching, and full matching. Therefore, we first introduced three different types of 1:k nearest matching algorithms and two full matching based methods to compare group-wise matching vs. pairwise matching for creating an optimal balance and sample size. These proposed methods were applied to a data set from the Brain Development Imaging Lab (BDIL) at San Diego State University. Next, we introduce the iterMatch R package. This package finds a 1:1 matched subsample of the data that is balanced on all matching variables while incorporating missing values in an iterative manner. Missing variables in dataset need to be imputed or only complete cases can be considered in matching. Losing data because of the limitations in a matching algorithm can decrease the power of the study as well as omit important information. Other than introducing the iterMatch package, tuning the input parameters of this package is discussed, using medium and large datasets from the Autism Brain Imaging Data Exchange (ABIDE). We then propose two mixed-effects random forest-based classification algorithms applicable to multi-site (clustered data) using resting-state fMRI (rs-fMRI) and structural MRI (sMRI). These algorithms control the random effects of the confounding factor of the site and fixed-effect of phenotype variable of age internally while building the prediction model. On top of controlling the effects of confounding variables, these algorithms take away the necessity of utilizing a separate dimension reduction algorithm for high dimensional data such as functional connectivity in a non-linear fashion. We show the proposed algorithms can achieve prediction accuracy over 80 percent using test data.

ISBN

9798662444980

Share

COinS