Information Systems and Technology (CGU)
Computer Sciences | Databases and Information Systems | Medicine and Health Sciences
Current approaches to word sense disambiguation use (and often combine) various machine learning techniques. Most refer to characteristics of the ambiguity and its surrounding words and are based on thousands of examples. Unfortunately, developing large training sets is burdensome, and in response to this challenge, we investigate the use of symbolic knowledge for small datasets. A naïve Bayes classifier was trained for 15 words with 100 examples for each. Unified Medical Language System (UMLS) semantic types assigned to concepts found in the sentence and relationships between these semantic types form the knowledge base. The most frequent sense of a word served as the baseline. The effect of increasingly accurate symbolic knowledge was evaluated in nine experimental conditions. Performance was measured by accuracy based on 10-fold cross-validation. The best condition used only the semantic types of the words in the sentence. Accuracy was then on average 10% higher than the baseline; however, it varied from 8% deterioration to 29% improvement. To investigate this large variance, we performed several follow-up evaluations, testing additional algorithms (decision tree and neural network), and gold standards (per expert), but the results did not significantly differ. However, we noted a trend that the best disambiguation was found for words that were the least troublesome to the human evaluators. We conclude that neither algorithm nor individual human behavior cause these large differences, but that the structure of the UMLS Metathesaurus (used to represent senses of ambiguous words) contributes to inaccuracies in the gold standard, leading to varied performance of word sense disambiguation techniques.
© 2005 Elsevier B.V.
Leroy, Gondy and Rindflesch, Thomas C., "Effects of Information and Machine Learning Algorithms on Word Sense Disambiguation with Small Datasets" (2005). CGU Faculty Publications and Research. 92.