Date of Award


Degree Type

Open Access Dissertation

Degree Name

Information Systems and Technology, PhD


Center for Information Systems and Technology

Dissertation or Thesis Committee Member

Yan Li

Dissertation or Thesis Committee Member

Samir Chatterjee

Dissertation or Thesis Committee Member

Gondy Leroy

Terms of Use & License Information

Terms of Use for work posted in Scholarship@Claremont.

Rights Information

© 2023 Jeffrey Harwell


big data, common crawl, machine learning, natural language processing, viewpoint diversity

Subject Categories

Systems Science


A fundamental requirement for Western democracy is an informed and engaged electorate with access to a wide range of viewpoints. However, concerns have arisen regarding how information technology affects the diversity of viewpoints available. In response to an increasingly polarized society and worries surrounding filter bubbles and algorithmic bias, this research presents a novel tool for constructing internet-based topical corpora and an algorithm tailored for viewpoint detection and the curation of diverse search results.Following a comprehensive exploration of viewpoint diversity through the lenses of mass media, social psychology, and information retrieval, this dissertation presents an approach to operationalize viewpoint diversity rooted in a cross-linguistic discourse analysis model. The proposed viewpoint detection algorithm harnesses context-specific sentence embedding features, sentiment features, and topic features. The algorithm's development and refinement are based on the Internet Argument Corpus, compiled by Abbott and his team in 2016. To assess the efficacy of the viewpoint detection algorithm, we need to construct topical corpora from online sources and identify the viewpoints expressed in these documents. To achieve this, we first develop a big data processing architecture for creating indexed corpora from the Common Crawl web archives. The architecture is instantiated into an automated tool that generates an intelligible topical corpus through a series of steps involving processing, filtering, cleaning, and removing duplicate content. Utilizing this tool, we processed approximately 1.2 billion web pages from the Common Crawl dataset, resulting in four distinct corpora. Each corpus includes around 1,000 relevant documents for a specific topic. The viewpoint diversity algorithm was then used to identify the ten most relevant documents that likely represented opposing stances, resulting in a collection of 20 documents per topic. A group of volunteers independently assessed and assigned viewpoints to each document, and then resolved any disagreements collaboratively. The viewpoint detection algorithm was evaluated against the resulting gold standard. It successfully curated a set of documents with balanced viewpoints for the evolution topic, demonstrating its ability to generalize from the Internet Argument Corpus to open internet documents. However, the algorithm did not generalize well for the abortion and gun control topics. We discuss the reasons behind these discrepancies and suggest potential solutions. In summary, this dissertation contributes to enhancing viewpoint diversity and transparency in online content. It offers valuable insights into the challenges and potential solutions for achieving this goal. Our research not only represents a step toward promoting transparency in managing viewpoints on the internet, but also provides tools for researchers and practitioners to utilize extensive textual data more effectively from the web.