Date of Award

Spring 2021

Degree Type

Restricted to Claremont Colleges Dissertation

Degree Name

Management of Information System and Technology, PhD

Dissertation or Thesis Committee Member

Yan Li

Dissertation or Thesis Committee Member

Brian Hilton

Dissertation or Thesis Committee Member

Anthony Corso

Abstract

Massive user generated social media texts (SMT) posits new opportunities as well as challenges for psycholinguistic analysis to understand individual differences such as personality. SMT are often written in an informal way, and thus contain lexical variants such as nonstandard spellings, capitalizations, and abbreviations. These lexical variants are referred as out-of-vocabulary (OOV) words. They are not captured in standard dictionaries used by standard Natural Language Processing (NLP) tools. Literature indicates that these OOV words may include hidden linguistic patterns that reflect individual characteristics. These OOV-related linguistic patterns are not captured by the existing closed-vocabulary and open-vocabulary approaches. To address these issues, this dissertation develops two artifacts, following a design science research process model. The first artifact is an OOV-aware data curation process that focuses on capturing and categorizing OOV words. The evaluation of the first artifact demonstrates that it can capture more OOV words and is useful in analyzing SMT. The second artifact is an OOV-aware hybrid approach that integrates the closed-vocabulary and open-vocabulary approaches with expanded OOV categories and OOV words. The hybrid approach shows an improved performance over existing approach. This dissertation makes theoretical contributions by adding additional OOV knowledge and a new method for psycholinguistic analysis of SMT. It also makes practical contributions by enabling psycholinguistic researchers and practitioners to exploit more psycholinguistic cues for tasks like personality prediction.

DOI

10.5642/cguetd/217

Share

COinS