Home > Published Issues > 2023 > Volume 14, No. 5, 2023 >
JAIT 2023 Vol.14(5): 1073-1081
doi: 10.12720/jait.14.5.1073-1081

Analysis of Language Model Role in Improving Machine Translation Accuracy for Extremely Low Resource Languages

Herry Sujaini 1,*, Samuel Cahyawijaya 2, and Arif B. Putra 1
1. Department of Informatics, University of Tanjungpura, Pontianak, Indonesia; Email: arifbpn@untan.ac.id (A.B.P.)
2. Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Kowloon, Hong Kong; Email: scahyawijaya@ust.hk (S.C.)
*Correspondence: hs@untan.ac.id (H.S.)

Manuscript received March 5, 2023; revised April 11, 2023; accepted June 25, 2023; published October 13, 2023.

Abstract—Several previous studies have suggested using statistical machine translation instead of neural machine translation for extremely low-resource languages. We could translate texts from 12 different regional languages into Indonesian using machine translation experiments. We increased the accuracy of machine translation for 12 extremely low-resource languages by using several monolingual corpus sizes on the language model’s target side. Since many Indonesian sources are available, we added this corpus to improve the model’s performance. Our study aims to analyze and evaluate the impact of different language models trained on various monolingual corpus on the accuracy of machine translation. The increase in accuracy when enlarging the monolingual corpus is not observed every time, according to our experiments. Therefore, it is necessary to perform several experiments to determine the monolingual corpus to optimize the quality. Experiments showed that Melayu Pontianak achieved the highest bilingual evaluation understudy improvement point. Specifically, we found that by adding a monolingual corpus of 50–100K, they performed a bilingual evaluation understudy improvement point of 2.15, the highest improvement point they reached for any of the twelve languages tested.
 
Keywords—statistical machine translation, extremely low resource languages, monolingual corpus, language model, Indonesian

Cite: Herry Sujaini, Samuel Cahyawijaya, and Arif B. Putra, "Analysis of Language Model Role in Improving Machine Translation Accuracy for Extremely Low Resource Languages ," Journal of Advances in Information Technology, Vol. 14, No. 5, pp. 1073-1081, 2023.

Copyright © 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.