Date of Award
Spring 2021
Degree Type
Restricted to Claremont Colleges Dissertation
Degree Name
Management of Information System and Technology, PhD
Dissertation or Thesis Committee Member
Yan Li
Dissertation or Thesis Committee Member
Brian Hilton
Dissertation or Thesis Committee Member
Anthony Corso
Terms of Use & License Information
Keywords
hybrid, out-of-vocabulary, personality, psycholinguistics, social media text
Abstract
Massive user generated social media texts (SMT) posits new opportunities as well as challenges for psycholinguistic analysis to understand individual differences such as personality. SMT are often written in an informal way, and thus contain lexical variants such as nonstandard spellings, capitalizations, and abbreviations. These lexical variants are referred as out-of-vocabulary (OOV) words. They are not captured in standard dictionaries used by standard Natural Language Processing (NLP) tools. Literature indicates that these OOV words may include hidden linguistic patterns that reflect individual characteristics. These OOV-related linguistic patterns are not captured by the existing closed-vocabulary and open-vocabulary approaches. To address these issues, this dissertation develops two artifacts, following a design science research process model. The first artifact is an OOV-aware data curation process that focuses on capturing and categorizing OOV words. The evaluation of the first artifact demonstrates that it can capture more OOV words and is useful in analyzing SMT. The second artifact is an OOV-aware hybrid approach that integrates the closed-vocabulary and open-vocabulary approaches with expanded OOV categories and OOV words. The hybrid approach shows an improved performance over existing approach. This dissertation makes theoretical contributions by adding additional OOV knowledge and a new method for psycholinguistic analysis of SMT. It also makes practical contributions by enabling psycholinguistic researchers and practitioners to exploit more psycholinguistic cues for tasks like personality prediction.
DOI
10.5642/cguetd/217
ISBN
9798515243937
Recommended Citation
Liu, Kun. (2021). Incorporate Out-of-Vocabulary Words for Psycholinguistic Analysis using Social Media Texts - An OOV-aware Data Curation Process and a Hybrid Approach. CGU Theses & Dissertations, 217. https://scholarship.claremont.edu/cgu_etd/217. doi: 10.5642/cguetd/217