Vista Equipo: Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación

Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación

Proyecto de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2017.

Autor Principal:	Mena-Arias, José Andrés
Otros Autores:	Calvo-Valverde, Luis Alexánder
Formato:	Tesis
Idioma:	Español
Publicado:	Instituto Tecnológico de Costa Rica 2018
Materias:	Bases de Datos Métodos Categorías Modelado Dirichlet Latente Research Subject Categories::TECHNOLOGY::Information technology::Computer engineering
Acceso en línea:	https://hdl.handle.net/2238/9658

id	RepoTEC9658
recordtype	dspace
spelling	RepoTEC96582023-05-04T14:28:25Z Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación Mena-Arias, José Andrés Calvo-Valverde, Luis Alexánder Bases de Datos Métodos Categorías Modelado Dirichlet Latente Research Subject Categories::TECHNOLOGY::Information technology::Computer engineering Proyecto de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2017. Nowadays, there is a general problem of missing values in databases around the world, which is caused by several reasons going from hardware malfunctions to nonmandatory fields in forms. Data imputation can be defined as the use of some method to find plausible values for those missing. When the missing value can be inferred from a text value attribute, then the problemcan be seen as a classification algorithms problem where text documents should be organized within categories representing the plausible missing values. It also implies the problem of calculating how similar is a text value with respect to another. Existing literature about solving this kind of problems is extensive, however, during the last 25 years the statistical methods (where similarity functions are applied over vectors of words) have achieved good results in many areas of text mining [38]. Additionally, topic modeling has arisen in the last years as a promising alternative to existing methods by achieving dimensional reduction and incorporating the semantic factor when classifying documents [30]. This project is focused on the evaluation of traditional data representation techniques and similarity metrics (words vectors, Cosine and Jaccard) respect to topic modeling techniques and probability distributions comparison (Latent Dirichlet Allocation and Kullback- Leibler Divergence). An statistical analysis is applied to the results obtained after running several experiments that involved the mentioned metrics, both individually and combined, to classify data sets of text documents. At a high level, the results show that the accuracy scores achieved by using document representations obtained thought Latent Dirichlet Allocation, combined with the relative entropy metric, were statically similar to the ones obtained by using traditional text classification techniques. The topics modeling manages to abstract thousands of words in less than 60 topics for the main set of experiments. The results also highlight cons, improvement areas and potential scenarios where such models could achieve a better performance. 2018-03-20T17:22:37Z 2018-03-20T17:22:37Z 2017 info:eu-repo/semantics/masterThesis https://hdl.handle.net/2238/9658 spa application/pdf Instituto Tecnológico de Costa Rica
institution	Tecnológico de Costa Rica
collection	Repositorio TEC
language	Español
topic	Bases de Datos Métodos Categorías Modelado Dirichlet Latente Research Subject Categories::TECHNOLOGY::Information technology::Computer engineering
spellingShingle	Bases de Datos Métodos Categorías Modelado Dirichlet Latente Research Subject Categories::TECHNOLOGY::Information technology::Computer engineering Mena-Arias, José Andrés Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación
description	Proyecto de Graduación (Maestría en Computación) Instituto Tecnológico de Costa Rica, Escuela de Ingeniería en Computación, 2017.
author2	Calvo-Valverde, Luis Alexánder
format	Tesis
author	Mena-Arias, José Andrés
author_sort	Mena-Arias, José Andrés
title	Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación
title_short	Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación
title_full	Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación
title_fullStr	Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación
title_full_unstemmed	Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación
title_sort	evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación
publisher	Instituto Tecnológico de Costa Rica
publishDate	2018
url	https://hdl.handle.net/2238/9658
_version_	1796142610635554816
score	12.041648

Evaluación del uso de distintas métricas de distancia de texto en un algoritmo agregado para la imputación de valores faltantes mediante clasificación

Ejemplares similares