Date of Award

Spring 2024

Degree Type

Open Access Dissertation

Degree Name

Computational Science Joint PhD with San Diego State University, PhD

Program

School of Mathematical Sciences

Advisor/Supervisor/Committee Chair

Barbara Bailey

Dissertation or Thesis Committee Member

Chii-Dean Lin

Dissertation or Thesis Committee Member

Claudia Rangel-Escareño

Dissertation or Thesis Committee Member

John Angus

Terms of Use & License Information

Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.

Rights Information

© 2023 Nadia Bernardo Mendoza

Keywords

categorical variables, imputation methods, mice amelia missForest hmisc, missing data, simulations

Subject Categories

Computer Sciences | Statistics and Probability

Abstract

In general, standard statistical analysis models typically rely on completely observed cases, excluding incomplete rows from the dataset. This approach poses particular challenges when the objective is to predict a rare outcome, especially when some of the ob servations with the rare outcome are incomplete. In such cases, the available information to support the model in predicting this event is reduces. Theoretically correct models may pre dict all instances in the majority class achieving high accuracy, but fail in predicting the rare cases, which are often the most interesting ones. Therefore, it is crucial to make the most of all available information, whether complete or not. This challenge is particularly relevant in epidemiological studies aimed at predicting the occurrence of a disease, which typically affects only a small percentage of the population. The primary inspiration for this work is drawn from a linkage to HIV care trial conducted in rural Uganda. The study focused on enhancing linkage to care to all subjects that tested positive to HIV. Our study is centered on improving accuracy when predicting new HIV cases, accounting for the fact that approximately half of the observations from HIV positive subjects are incomplete. Imputation methods fill in educated guesses into the missing values in a dataset en abling the utilization of all collected information without discarding any observations. Several options are available and, in this work, the popular and freely available imputation methods amelia, hmisc, mice and missForest are evaluated. Simulations were conducted with a variety of scenarios using the HIV data, and with synthetic data focusing on imputation for categorical variables. The results enable us to suggest guidelines regarding impact of imputation consider ing dataset aspects such as percentage of incomplete data missing, missing mechanisms (MAR – missing at random, MCAR – missing completely at random, MNAR – missing not at ran dom), number of variables in a dataset, number of classes in predictors and outcome and initial prediction accuracy with complete cases analysis (CCA). We focused on two and threeclasses outcome variables, considering random forest and multinomial regression as analysis models. The main metrics accessed were imputation precision, overall prediction accuracy, sensitivity and imputation time. Oversampling or generating synthetic samples from the minority class combined with undersampling the majority class did not provide improvements for predicting new HIV positive cases with the completely observed cases. On the other hand, increasing minority class information via imputation by any method resulted in improvements. Single imputation by missForest were closer to the observed values but this did not lead to better predictive models. Although prediction accuracy of HIV status and sensitivity for predicting new HIV positives were higher after hmisc and mice imputations, in general, all methods performed similarly when considering synthetic datasets. In some settings, CCA with low percent missingness (up to 20% or 30%) may result in similar model accuracy to the complete or imputed datasets. With larger missingness, modeling after imputation is almost always superior than CCA. Results were not greatly affected by different missing mechanism and correlation levels between the predictors, when taking into consideration the other aspects of the incomplete dataset. Mice and missForest were the slowest to input, but different parallel functionalities available in R proved to be efficient for speeding them up, even on a laptop with 4 cores, substantially reducing imputation time.

ISBN

9798382305585

Share

COinS