Graduation Year
2017
Document Type
Open Access Senior Thesis
Degree Name
Bachelor of Science
Department
Mathematics
Reader 1
Talithia Williams
Reader 2
Tanja Srebotnjak
Reader 3
Blake Hunter
Terms of Use & License Information
Rights Information
© 2017 Dylan K. Baker
Abstract
With the abundance of written information available online, it is useful to be able to automatically synthesize and extract meaningful information from text corpora. We present a unique method for visualizing relationships between documents in a text corpus. By using Latent Dirichlet Allocation to extract topics from the corpus, we create a graph whose nodes represent individual documents and whose edge weights indicate the distance between topic distributions in documents. These edge lengths are then scaled using multidimensional scaling techniques, such that more similar documents are clustered together. Applying this method to several datasets, we demonstrate that these graphs are useful in visually representing high-dimensional document clustering in topic-space.
Recommended Citation
Baker, Dylan, "The Document Similarity Network: A Novel Technique for Visualizing Relationships in Text Corpora" (2017). HMC Senior Theses. 100.
https://scholarship.claremont.edu/hmc_theses/100
Source Fulltext
https://www.math.hmc.edu/~dbaker/thesis/