Emotion Classification in Sundanese Text Using LSTM and BERT Models
Keywords:
BERT, Emotion Classification, LTSM, Sundanese LanguageAbstract
The Sundanese language, once spoken by 48 million individuals, has experienced a significant decline in speakers, losing 2 million in the past decade. This decline is attributed to weakened intergenerational transmission and the dominance of more widely used languages. The challenges in developing Natural Language Processing (NLP) tools for Sundanese stem from the lack of annotated corpora, trained language models, and adequate processing tools, complicating efforts to preserve and enhance the language's usability. This research aims to address these challenges by implementing emotion classification in Sundanese text using Long Short-Term Memory (LSTM) and Bidirectional Encoder Representations from Transformers (BERT) models. The study utilizes a dataset of annotated Sundanese tweets, applying preprocessing techniques such as cleansing, stopword removal, stemming, and tokenization to prepare the data for analysis. The results indicate that the BERT model significantly outperforms the LSTM model, achieving an accuracy of approximately 80% compared to LSTM's 70%. These findings highlight the potential of advanced NLP techniques in enhancing the understanding of emotional nuances in Sundanese communication and contribute to the revitalization of the language in the digital age.
References
Afrad, M. (2024). Utilization of principal component analysis to improve emotion classification performance in text using artificial neural networks. Journal of Applied Intelligent System, 9(1), 8–18.
Aranditio, S. (2023). Para penutur bahasa daerah berguguran.
Badawi, A. (2021). The effectiveness of natural language processing (NLP) as a processing solution and semantic improvement. International Journal of Economic, Technology and Social Sciences (Injects, 2(1), 36–44.
Cahyawijaya, S., Winata, G. I., Wilie, B., Vincentio, K., Li, X., Kuncoro, A., … Fung, P. (2021). IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. EMNLP 2021 - Conference on Empirical Methods in Natural Language Processing, Proceedings, 8875–8898.
Hananto, A. (2022). Bahasa Sunda dan urgensi perlindungan bahasa daerah.
Hutagaol, Y. R., & Arifin, Y. (2024). Semantic-based email spam classification using BERT method. Journal of Information Technology and Computer Science (INTECOMS, 7(5), 1823–1836.
Jabbar, A., Iqbal, S., Tamimy, M. I., Rehman, A., Bahaj, S. A., & Saba, T. (2023). An analytical analysis of text stemming methodologies in information retrieval and natural language processing systems. IEEE Access, 11(December), 133681–133702.
Kaur, J. (2018). Stopwords removal and its algorithms based on different methods. International Journal of Advanced Research in Computer Science, 9(5), 81–88.
Putra, O. V., Wasmanson, F. M., Harmini, T., & Utama, S. N. (2020). Sundanese Twitter dataset for emotion classification. CENIM 2020 - Proceeding: International Conference on Computer Engineering, Network, and Intelligent Multimedia 2020, 391–395.
Sakinah, M. A., Ramadhan, T. I., & Hartono, R. (2024). Neural machine translation untuk bahasa Sunda loma – Sunda halus menggunakan long short term memory. Jurnal Komputer Antartika, 2(1), 26–34.
Sharou, K. Al, Li, Z., & Specia, L. (2021). Towards a better understanding of noise in natural language processing. International Conference Recent Advances in Natural Language Processing, RANLP, 53–62.
Sudarma, T. F. D., Wahya, Citraresmana, E., Indira, D., Muhtadin, T., & Lyra, H. M. (2018). Upaya pemertahanan bahasa-budaya Sunda di tengah pengaruh globalisasi. Jurnal Pengabdian Kepada Masyarakat, 2(12), 1–6.
Wang, D., Li, Y., Jiang, J., Ding, Z., Jiang, G., Liang, J., & Yang, D. (2024). Tokenization matters! Degrading large language models through challenging their tokenization. ArXiv Preprint ArXi, 1–17. Retrieved from http://arxiv.org/abs/2405.17067
William, S., Kenny, & Chowanda, A. (2024). Emotion recognition Indonesian language from Twitter using IndoBERT and Bi-LSTM. Communications in Mathematical Biology and Neuroscience, 2024, 1–15.