Date of Award

Fall 2020

Degree Type

Open Access Dissertation

Degree Name

Computational Science Joint PhD with San Diego State University, PhD


Institute of Mathematical Sciences

Advisor/Supervisor/Committee Chair

Robert Edwards

Dissertation or Thesis Committee Member

Claudia Rangel

Dissertation or Thesis Committee Member

Anca Segall

Dissertation or Thesis Committee Member

Allon Percus

Terms of Use & License Information

Terms of Use for work posted in Scholarship@Claremont.

Rights Information

© Vito Adrian Cantu Alessio Robles, 2020 All rights reserved


As of October 2020, there are 18.6 × 1015 DNA base pairs publicly available in the Sequence Read Archive and this number is growing at an exponential rate. As DNA sequencing prices continue to drop, many research groups around the world have incorporated high throughput sequencing in their research, giving us access to sequences from many distinct ecosystems. This has revolutionized the field of metagenomics, which aims to fully characterize all organisms and their interactions in a particular system. Nevertheless, the plethora of available data has made its analysis difficult as traditional techniques such as genome assembly or sequence alignment are bound to fail due to the high noise of metagenomes, or take an impractically long time due to their size. Through this thesis, we explore those challenges and develop techniques to meet them. Chapter 1 serves as an introduction to the fields of metagenomics and machine learning and the applications where the two meet. Chapter 2 examines the different kinds of noises in sequencing datasets and presents PRINSEQ++, a C++ multi-threaded software for quality control of sequencing datasets. Chapter 3 describes the analysis of 63 metagenomic samples from children with ”nodding syndrome” using Random Forest to give insights into the etiology of the disease. Chapter 4 explores the use of artificial neutral networks to classify phage structural proteins derived from metagenomes.