Researcher ORCID Identifier

0009-0005-9195-3839

Graduation Year

2026

Date of Submission

4-2026

Document Type

Open Access Senior Thesis

Degree Name

Bachelor of Arts

Department

Mathematical Sciences

Second Department

Biology

Reader 1

Shibu Yooseph

Terms of Use & License Information

Terms of Use for work posted in Scholarship@Claremont.

Rights Information

@ 2026 Matthew Q Jabro

Abstract

Soil harbors the most diverse microbial communities on Earth, yet whether predictable community types exist across biomes and whether taxonomic composition encodes habitat of origin remain open questions at global scale. This thesis addresses both questions by applying unsupervised clustering and supervised classification to transformed 16S ribosomal RNA (rRNA) amplicon profiles from two independent datasets: the global topsoil survey of Bahram et al. (193 samples) and the Earth Microbiome Project (EMP) soil subset of Thompson et al. (2,209 samples). Application of a sample clustering method based on a mixture of Gaussian Graphical Models (MixGGM) identified 19 clusters in the topsoil dataset and 10 in the EMP dataset; cluster assignments were significantly associated with biome type, geographic location, and physicochemical variables in both cases. Random forest classifiers trained on the same sample-taxa matrices achieved 61.8% test-set accuracy under grouped biome labels for the topsoil data and 98.2% for the EMP data. The performance gap likely reflects the underlying biology: major habitat boundaries (marine vs. terrestrial vs. freshwater) produce strong compositional contrasts, while within-terrestrial biome differences are subtler and driven by overlapping environmental gradients rather than discrete community turnover. The most predictive taxa, including Ellin516, Candidatus Udaeobacter, Anaeromyxobacter, Bacillus, and Paenibacillus, have known ecological associations consistent with the biome categories they discriminated. Taken together, these results demonstrate that genus-level 16S rRNA profiles carry sufficient information to recover recurring community types and predict habitat of origin, with predictive power scaling with the breadth of environmental contrast in the dataset.

Share

COinS