An Enhancement of Jiang, Z., et al.’s Compression-Based Classification Algorithm Applied to News Article Categorization
Keywords:
Classification, Compression, News Article, Preprocessing, UnigramsAbstract
This study enhances Jiang et al.'s compression-based classification algorithm by addressing its limitations in detecting semantic similarities between text documents. The proposed improvements focus on unigram extraction and optimized concatenation, eliminating reliance on entire document compression. By compressing extracted unigrams, the algorithm mitigates sliding window limitations inherent to gzip, improving compression efficiency and similarity detection. The optimized concatenation strategy replaces direct concatenation with the union of unigrams, reducing redundancy and enhancing the accuracy of Normalized Compression Distance (NCD) calculations. Experimental results across datasets of varying sizes and complexities demonstrate an average accuracy improvement of 5.73%, with gains of up to 11% on datasets containing longer documents. Notably, these improvements are more pronounced in datasets with high-label diversity and complex text structures. The methodology achieves these results while maintaining computational efficiency, making it suitable for resource-constrained environments. This study provides a robust, scalable solution for text classification, emphasizing lightweight preprocessing techniques to achieve efficient compression, which in turn enables more accurate classification.
References
Bullock, M., Lechowski, M., & Mehra, R. (2024). New standards for a faster and more private internet. The Cloudflare Blog. Retrieved from https://blog.cloudflare.com/new-standards/
Dogra, V., Verma, S., Kavita, N., Chatterjee, P., Shafi, J., Choi, J., & Ijaz, M. F. (2022). A complete process of text classification system using state-of-the-art NLP models. Computational Intelligence and Neuroscience, 2022(1). https://doi.org/10.1155/2022/1883698
Gasparetto, A., Marcuzzo, M., Zangari, A., & Albarelli, A. (2022). A survey on text classification algorithms: From text to predictions. MDPI. https://doi.org/10.3390/info13020083
Hassan, S. U., Ahamed, J., & Ahmad, K. (2022). Analytics of machine learning-based algorithms for text classification. Sustainable Operations and Computers, 3, 238–248. https://doi.org/10.1016/j.susoc.2022.03.001
Jianan, G., Kehao, R., & Binwei, G. (2023). Deep learning-based text knowledge classification for whole-process engineering consulting standards. Journal of Engineering Research, 12(2), 61–71. https://doi.org/10.1016/j.jer.2023.07.011
Jiang, Z., Yang, M. Y. R., Tsirlin, M., Tang, R., & Lin, J. (2022). Less is more: Parameter-free text classification with Gzip. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2212.09410
Jiang, Z., Yang, M., Tsirlin, M., Tang, R., Dai, Y., & Lin, J. (2023). “Low-Resource” text classification: A parameter-free classification method with compressors. Findings of the Association for Computational Linguistics: ACL 2022. https://doi.org/10.18653/v1/2023.findings-acl.426
Jimoh, R. G., Adewole, K. S., Aderemi, T. E., & Balogun, A. O. (2021). Investigative study of unigram and bigram features for short message spam detection. Lecture Notes in Networks and Systems, 70–81. https://doi.org/10.1007/978-3-030-80216-5_6
Ozan, S. (2024). DNA sequence classification with compressors. arXiv (Cornell University). https://doi.org/10.48550/arxiv.2401.14025
Pascarella, A., Gianni, E., Abbondanza, M., Armonaite, K., Pitolli, F., Bertoli, M., ... & Tecchio, F. (2022). Normalized compression distance to measure cortico-muscular synchronization. Frontiers in Neuroscience, 16. https://doi.org/10.3389/fnins.2022.933391
Peters, H. (2023). Commentary: GZIP + KNN beats deep neural networks in text classification. Medium. Retrieved from https://medium.com/@heinrichpeters/commentary-gzip-knn-beats-deep-neural-networks-in-text-classification-f395c71283a6
Singh, Y. V., Naithani, P., Ansari, P., & Agnihotri, P. (2021). News classification system using machine learning approach. 2021 3rd International Conference on Advances in Computing, Communication Control and Networking (ICAC3N), 186–188. https://doi.org/10.1109/icac3n53548.2021.9725409
Volety, R. (2024). News classification techniques using NLP. Labellerr. Retrieved from https://www.labellerr.com/blog/news-classification-using-nlp/
Zhang, D., Li, J., Xie, Y., & Wulamu, A. (2023). Research on performance variations of classifiers with the influence of pre-processing methods for Chinese short text classification. PLOS ONE, 18(10), e0292582. https://doi.org/10.1371/journal.pone.0292582
Zhu, H., & Lei, L. (2022). The research trends of text classification studies (2000–2020): A bibliometric analysis. SAGE Open, 12(2). https://doi.org/10.1177/21582440221089963