Evaluation of different text representation techniques and distance metrics using KNN for documents classification

Nowadays, text data is a fundamental part in databases around the world and one of the biggest challenges has been the extraction of meaningful information from large sets of text. Existing literature about text classification is extensive, however, during the last 25 years the statistical methods (...

Descripción completa

Formato: Artículo
Idioma: Español
Publicado: Editorial Tecnológica de Costa Rica (entidad editora) 2020
Materias:
KNN
Acceso en línea: https://revistas.tec.ac.cr/index.php/tec_marcha/article/view/5022
https://hdl.handle.net/2238/12052
Sumario: Nowadays, text data is a fundamental part in databases around the world and one of the biggest challenges has been the extraction of meaningful information from large sets of text. Existing literature about text classification is extensive, however, during the last 25 years the statistical methods (where similarity functions are applied over vectors of words) have achieved good results in many areas of text mining. Additionally, several models have been proposed to achieve dimensional reduction and incorporate the semantic factor, such as the topic modelling. In this paper we evaluate different text representation techniques including traditional bag of words and topics modelling. The evaluation is done by testing different combinations of text representations and text distance metrics (Cosine, Jaccard and Kullback-Leibler Divergence) using K-Nearest-Neighbors in order to determine the effectiveness of using topic modelling representations for dimensional reduction when classifying text. The results show that the simplest version of bag of words and the Jaccard similarity outperformed the rest of combinations in most of the cases. A statistical test showed that the accuracy values obtained when using supervised Latent Dirichlet Allocation representations, combined with the relative entropy metric, were no significantly different to the ones obtained by using traditional text classification techniques. LDA managed to abstract thousands of words in less than 60 topics for the main set of experiments. Additional experiments suggest that topic modelling can perform better when used for short text documents or when increasing the parameter of number of topics (dimensions) at the moment of generating the model.