Home > Published Issues > 2023 > Volume 14, No. 4, 2023 >
JAIT 2023 Vol.14(4): 830-837
doi: 10.12720/jait.14.4.830-837

Part-of-Speech (POS) Tagging for Standard Brunei Malay: A Probabilistic and Neural-Based Approach

Izzati Mohaimin 1,*, Rosyzie A. Apong 1, and Ashrol R. Damit 2
1. School of Digital Science, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei Darussalam;
Email: rosyzie.apong@ubd.edu.bn (R.A.A.)
2. Faculty of Arts and Social Sciences, Universiti Brunei Darussalam, Bandar Seri Begawan, Brunei Darussalam;
Email: rahimy.damit@ubd.edu.bn (A.R.D.)
*Correspondence: 20m2052@ubd.edu.bn (I.M.)

Manuscript received October 8, 2022; revised December 28, 2022; accepted January 30, 2023; published August 17, 2023.

Abstract—As online information increases over the years, text mining researchers developed Natural Language Processing tools to extract relevant and useful information from textual data such as online news articles. The Malay language is widely spoken, especially in the Southeast Asian region, but there is a lack of Natural Language Processing (NLP) tools such as Malay corpora and Part-of-Speech (POS) taggers. Existing NLP tools are mainly based on Standard Malay of Malaysia and Indonesian language, but there is none for the Bruneian Malay. We addressed this issue by designing a Standard Brunei Malay corpus consisting of over 114,000 lexical tokens, annotated using 17 Malay POS tagsets. Furthermore, we implemented two commonly used POS tagging techniques, Conditional Random Field (CRF) and Bi-directional Long Short-Term Memory (BLSTM), to develop Bruneian POS taggers and compared their performances. The results showed that both CRF and BLSTM models performed well in predicting POS tags on Bruneian texts. However, CRF models outperform BLSTM, where CRF using all features achieved an F-Measure of 92.06% on news articles and 90.71% of F-Measure on crime articles. Adding a batch normalization layer to the BLSTM model architecture increased the performance by 7.13%. To further improve the BLSTM models, we suggested increasing the training data and experimenting with different hyperparameter settings. The findings also indicated that modelling BLSTM with fastText has improved the POS prediction of Bruneian words.
 
Keywords—part-of-speech tagging, Conditional Random Field (CRF), Bi-directional Long Short-Term Memory (BLSTM), pre-trained word embeddings, batch normalization

Cite: Izzati Mohaimin, Rosyzie A. Apong, and Ashrol R. Damit, "Part-of-Speech (POS) Tagging for Standard Brunei Malay: A Probabilistic and Neural-Based Approach," Journal of Advances in Information Technology, Vol. 14, No. 4, pp. 830-837, 2023.

Copyright © 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.