Home > Published Issues > 2023 > Volume 14, No. 4, 2023 >
JAIT 2023 Vol.14(4): 796-802
doi: 10.12720/jait.14.4.796-802

Linguistic Driven Feature Selection for Text Classification as Stop Word Replacement

Daniel Schönle 1,*, Christoph Reich 1, and Djaffar Ould Abdeslam 2
1. Das Institut für Data Science, Cloud Computing und IT-Sicherheit (IDACUS), Furtwangen University, Furtwangen, Germany; Email: rch@hs-furtwangen.de (C.R.)
2. L’Institut de Recherche en Informatique, Mathématiques, Automatique et Signal (IRIMAS), Université de Haute Alsace, Mulhouse, France; Email: djafar.ould-abdeslam@uha.fr (D.O.A.)
*Correspondence: daniel.schoenle@hs-furtwangen.de (D.S.)

Manuscript received October 8, 2022; revised December 27, 2022; accepted January 30, 2023; published August 11, 2023.

Abstract—The common corpus optimization method “stop words removal” is based on the assumption that text tokens with high occurrence frequency can be removed without affecting classification performance. Linguistic information regarding sentence structure is ignored as well as preferences of the classification technology. We propose the Weighted Unimportant Part-of-Speech Model (WUP-Model) for token removal in the pre-processing of text corpora. The weighted relevance of a token is determined using classification relevance and classification performance impact. The WUP-Model uses linguistic information (part of speech) as grouping criteria. Analogous to stop word removal, we provide a set of irrelevant part of speech (WUP-Instance) for word removal. In a proof-of-concept we created WUP-Instances for several classification algorithms. The evaluation showed significant advantages compared to classic stop word removal. The tree-based classifier increased runtime by 65% and 25% in performance. The performance of the other classifiers decreased between 0.2% and 2.4%, their runtime improved between −4.4% and −24.7%. These results prove beneficial effects of the proposed WUP-Model.
 
Keywords—text classification, natural language processing, feature selection, linguistics

Cite: Daniel Schönle, Christoph Reich, and Djaffar Ould Abdeslam, "Linguistic Driven Feature Selection for Text Classification as Stop Word Replacement," Journal of Advances in Information Technology, Vol. 14, No. 4, pp. 796-802, 2023.

Copyright © 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.