Enhanced Named Entity Recognition Algorithm for Filipino Cultural and Heritage Texts

Authors

  • Jhan Lou P Robantes Pamantasang ng Lungsod ng Maynila
  • Andreo A Serrano Pamantasang ng Lungsod ng Maynila

Keywords:

Named Entity Recognition, Natural Language Processing, Filipino Corpus

Abstract

Named Entity Recognition (NER) is a crucial natural language processing task that extracts and classifies named entities from unstructured text into predefined categories. While existing NER methods have shown success in general domains, they often face significant challenges when applied to specialized contexts like Filipino cultural and historical texts. These challenges stem from the unique linguistic features, and diverse naming conventions. This research introduces an enhanced rule-based NER approach that specifically addresses these challenges. At its core, the system utilizes curated Corpus of Historical Filipino and Philippine English (COHFIE), which serves as both training and evaluation data. This research presents an enhanced rule-based approach for NER using a Corpus of Historical Filipino and Philippine English (COHFIE) building on pattern-learning methods, incorporating character and token features, and by using positive and negative example sets. To enrich the classification process, we used the International Committee for Documentation – Conceptual Reference Model (CIDOC-CRM), a cultural heritage framework, to provide a more nuanced categorization of entities based on their historical and cultural significance. Tested across existing Filipino based models (calamanCy and RoBERTa Tagalog), the enhanced model shows improvement on identifying entities related to Filipino culture (CUL) and history terms (PER, ORG, LOC).

References

B. M. Dela Cruz, C., Montalla, A., Manansala, R., Rodriguez, M., Octaviano, M., & Fabito, B. S. (2018). Named-Entity Recognition for Disaster Related Filipino News Articles. TENCON 2018 - 2018 IEEE Region 10 Conference, Jeju, Korea (South), 1633–1636.

Chan, J., Tan, C., & Su, J. (2023). Constructing a Named Entity Recognizer for Low-Resource Language with Cross-Lingual Task Learning: A Case Study on Telecommunication Firms. arXiv preprint arXiv:2301.12345.

Cohn, D., Ghahramani, Z., & Jordan, M. I. (1995). Active learning with statistical models. Proceedings of the 5th International Conference on Neural Information Processing Systems (NIPS), 11, 705–712.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., ... & Stoyanov, V. (2020). Unsupervised Cross-lingual Representation Learning at Scale. Proceedings of ACL 2020.

Constant, M., & Watrin, P. (2017). Named entity recognition for low-resource languages: Challenges and solutions. Proceedings of the International Conference on Linguistic Resources and Evaluation (LREC), 29–35.

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019, 4171–4186.

Doerr, M. (2003). The CIDOC CRM – An Ontological Approach to Cultural Heritage Information. ICOM/CIDOC Conference.

Filipinas Heritage Library. (n.d.). Retrieved from https://www.filipinaslibrary.org.ph/collections/.

Finkel, J. R., Grenager, T., & Manning, C. D. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL), 363–370.

Guevara, N., Pascual, P., & Santos, A. (2020). RoBERTa Tagalog: A pre-trained language model for Filipino text classification and named entity recognition. Proceedings of the Workshop on NLP for Indigenous Languages of South America, 92–98.

Lample, G., Ballesteros, M., Subramanian, S., et al. (2016). Neural architectures for named entity recognition. Proceedings of NAACL-HLT 2016, 260–270.

Lewis, D. D., & Catlett, J. (1994). Heterogeneous uncertainty sampling for supervised learning. Proceedings of the 11th International Conference on Machine Learning (ICML), 148–156.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., ... & Stoyanov, V. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv preprint arXiv:1907.11692.

Marrero, M., & Urbano, J. (2018). A Semi-automatic and low-cost method to learn patterns for named entity recognition. Natural Language Engineering, 24(1), 39–75.

Miranda, L. (2023, July 31). calamanCy: NLP pipelines for Tagalog. Lj Miranda. Retrieved from https://ljvmiranda921.github.io/projects/2023/08/01/calamancy.

Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.

National Memory Project Journal. (2021). Retrieved from https://memory.nhcp.gov.ph/journals/?years=2021.

Pan, X., Zhang, B., May, J., Nothman, J., Knight, K., & Ji, H. (2017). Cross-lingual Name Tagging and Linking for 282 Languages. Proceedings of ACL 2017.

Pascual, P., et al. (2020). CalamanCy: A rule-based tagging system for Filipino named entity recognition. Proceedings of the International Conference on Natural Language Processing (ICON).

Sambasivan, N., & Pietroszek, S. (2018). Named Entity Recognition for Filipino using deep learning techniques. Proceedings of the Workshop on South and Southeast Asian NLP, 85–91.

Santos, A., & Guimarães, D. (2015). Morphological challenges in named entity recognition for Filipino. Proceedings of the 14th International Conference on Computational Linguistics (COLING), 78–85.

Downloads

Published

2025-01-06