AI-BASED CLINICAL TEXT CLASSIFICATION FOR LUNG DISEASE DIAGNOSIS

Thi-Diem Truong; Thanh-Nghi Do

doi:10.35382/tvujs.15.3.2025.137

Authors

Thi-Diem Truong An Giang University, Vietnam National University Ho Chi Minh City, Vietnam
Thanh-Nghi Do College of Information and Communication Technology, Can Tho University, Vietnam

DOI:

https://doi.org/10.35382/tvujs.15.3.2025.137

Keywords:

clinical data, electronic medical records, lung disease diagnosis,, machine learning, text classification

Abstract

Lung diseases pose a significant challenge to global healthcare, with rising incidence and mortality rates underscoring the need for more accurate and efficient diagnostic methods. Although artificial intelligence has shown enormous promise in enhancing diagnostic accuracy, research on applying natural language processing to Vietnamese clinical texts for lung disease classification remains notably limited. This study addresses the critical gap through two significant contributions. First, a novel clinical dataset comprising 12 categories of lung diseases derived from electronic health records at An Giang Provincial General Hospital, Vietnam is introduced. Second, the study conducts a comprehensive comparative evaluation of text representation techniques, including traditional methods (bags of words and term frequency-inverse document frequency) and modern embeddings (Word2Vec, GloVe, FastText, BERT). These representations are integrated with multiple machine learning models to assess classification performance. Experimental results demonstrate that traditional representations consistently outperform modern embeddings on Vietnamese clinical texts. Significantly, the combination of bags of words with the light gradient boosting machine achieves the highest classification accuracy of 86.26%. These findings offer practical guidance on selecting appropriate natural language processing techniques for Vietnamese medical text analysis and underscore the potential of cost-effective artificial intelligence solutions in resource-limited healthcare settings.