CLASSIFICATION OF ARTICLES USING MACHINE LEARNING: CASE STUDY OF TRA VINH UNIVERSITY JOURNAL OF SCIENCE, VIETNAM
Abstract
The rapid development of technologies has led to an increasing number of research works submitted to journals or conferences. However, the process of submitting articles can be challenging for authors due to the wide range of subjects covered by submission systems, such as the Association for Computing Machinery, with 2,000 subjects. This challenge arises from the need to accurately categorize the manuscript into the appropriate subject area before submission. This article proposes an automatic solution that extracts information and categorizes scientific papers into relevant topics to address this issue. The proposed approach employs pre-processing, extraction, vectorization, and classification techniques using three machine learning methods: support vector machines, Naïve Bayes, and decision trees. The experiments conducted on a dataset of articles published in the Tra Vinh University Journal of Science show promising results. The support vector machines technique, in particular, achieved an accuracy rate of over 75%, demonstrating its potential as a tool for developing an automatic classification system for scientific papers.
Downloads
References
[2] Li Y, Zhang L, Xu Y, Yao Y, Lau RYK, Wu Y. Enhancing binary classification by modeling uncertain
boundary in three-way decisions. IEEE Transactions
on Knowledge and Data Engineering. 2017;29(7):
1438–1451. DOI: 10.1109/TKDE.2017.2681671.
[3] Sebastiani F. Machine learning in automated text
categorization. ACM Computing Surveys. 2002;34(1): 1–47.
[4] Dien Tran Thanh, Thai Nhut Thanh, Nguyen ThaiNghe. An approach to scientific paper classification
using machine learning [Giải pháp phân loại bài
báo khoa học bằng kĩ thuật máy học]. Can Tho
University Journal of Science [Tạp chí Khoa học
Trường Đại học Cần Thơ]. 2019;55(4A): 29–37.
DOI:10.22144/ctu.jvn.2019.093.
[5] Yang Y, Liu X. A re-examination of text
categorization methods. In: Proceedings of
the 22nd annual international ACM SIGIR
conference on Research and development in
information retrieval. SIGIR; 1999.p.42–49).
https://dl.acm.org/doi/proceedings/10.1145/312624.
[6] Tran Cao De, Pham Nguyen Khang. Text classification with support vector machines and decision trees.
[Phân loại văn bản với máy học vector hỗ trợ và cây
quyết định]. Can Tho University Journal of Science
[Tạp chí Khoa học Trường Đại học Cần thơ]. 2012;
21a:52–63.
[7] George HJ, Pat L. Estimating continuous distributions
in Bayesian classifiers. In: Philippe Besnard, Steve
Hanks (eds). Proceedings of the Eleventh conference
on Uncertainty in artificial intelligence, Montréal,
Qué, Canada. Massachusetts, United States of America: Morgan Kaufman Publishes; 1995. p.338–345.
https://dl.acm.org/doi/proceedings/10.5555/2074158.
[8] Liu B, Dai Y, Li X, Lee WS, Yu PS. Building
text classifiers using positive and unlabeled examples.
In: Third IEEE International Conference on Data
Mining. IEEE; 2003. p.179–186.
[9] Chen J, Huang H, Tian S, Qu Y. Feature selection for
text classification with Na¨ıve Bayes. Expert Systems
with Applications. 2009;36(3): 5432–5435.
[10] Haddoud M, Mokhtari A, Lecroq T, Abdedda¨ım
S. Combining supervised term-weighting metrics for
SVM text classification with extended term representation. Knowledge and Information Systems. 2016;49
(3): 909–931.
[11] Mitchell T. Machine Learning. New York: McGrawHill Higher Education; 1997.
[12] McCallum A, Nigam K. A comparison of
event models for na¨ıve bayes text classification.
In: AAAI-98 workshop on learning for text
categorization. Citeseer; 1998. p.41–48.
https://aaai.org/proceeding/ws98-05/.
[13] Tsai CH. MMSEG: A Word Identification System
for Mandarin Chinese Text Based on TwoVariants of the Maximum Matching Algorithm. 2019.
http://technology.chtsai.org/mmseg/ [Accessed 05th
January 2023]
[14] Nguyen Giang Linh, Nguyen Manh Hien. Classification of Vietnamese document using Support
Vector Machine. [Phân loại văn bản tiếng Việt
với bộ phân loại vectơ hỗ trợ SVM]. Posts and
Telecommunications Institute of Technology [Học
viện Công nghệ Bưu chính Viễn thông]. 2016.
https://www.scribd.com/doc/66961154/SVM.
[15] Tran Thi Thu Thao, Vu Thi Chinh. Building a vietnamese document classification system. [Xây dựng
hệ thống phân loại tài liệu tiếng Việt]. In: Research
Report of Lac Hong University [Báo cáo nghiên cứu
khoa học Trường Đại học Lạc Hồng]. Dong Nai.
2012.
[16] Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3): 273–297.
[17] Quinlan J. Programs for Machine Learning. Massachusetts, United States of America: Morgan Kaufmann Publishers; 1993.
[18] NLTK Project. Natural Language Toolkit source
3.8.1. 2023. Available from https://www.nltk.org/
[Accessed 25th January 2023]
[19] Christian SECP. Machine Learning::Cosine
Similarity for Vector Space Models (Part III).
2019. https://mipdirect.com/vector-space-modelcosine-similarity-example [Accessed 15th January
2023]
[20] Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: David
Haussler (ed). Proceedings of the fifth annual
workshop on Computational learning theory. Pittsburgh, Pennsylvania, USA: ACM; 1992. p.144–152.
https://dl.acm.org/doi/proceedings/10.1145/130385.
[21] Yang Y, Pedersen JO. A comparative study on
feature selection in text categorization. In: Douglas H. Fisher (ed). Proceedings of the Fourteenth International Conference on Machine Learning. Massachusetts, United States of America:
Morgan Kaufmann Publishers; 1997. p.412–420.
https://dl.acm.org/doi/proceedings/10.5555/645526.
[22] Burges CJC. A tutorial on support vector machines
for pattern recognition. Data Mining and Knowledge
Discovery. 1998;2(2): 121–167.
[23] Dumais S, Platt J, Heckerman D, Sahami M.
Inductive learning algorithms and representations
for text categorization. In: Niki Pissinou,
Charles Nicholas, James French et al. (eds).
Proceedings of the seventh international conference
on information and knowledge management.
Bethesda, Maryland, USA: ACM; 1998. p.148–155.
https://dl.acm.org/doi/proceedings/10.1145/288627.