Vista Equipo: A study of checkpointing in large scale training of deep neural networks

A study of checkpointing in large scale training of deep neural networks

Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. Checkpoint-res...

Descripción completa

Autores Principales:	Rojas, Elvis, Kahira, Albert Njoroge, Meneses, Esteban, Bautista-Gomez, Leonardo, Badia, Rosa M
Formato:	Artículo
Idioma:	Inglés
Publicado:	arXiv.Org 2023
Materias:	APRENDIZAJE PROFUNDO RESILIENCIA REDES NEURONALES COMPUTACIÓN DE ALTO RENDIMIENTO DEEP LEARNING RESILIENCE NEURAL NETWORKS HIGH PERFORMANCE COMPUTING
Acceso en línea:	http://hdl.handle.net/11056/26772 https://doi.org/10.48550/arXiv.2012.00825

id	RepoUNACR26772
recordtype	dspace
spelling	RepoUNACR267722023-10-30T19:19:19Z A study of checkpointing in large scale training of deep neural networks Rojas, Elvis Kahira, Albert Njoroge Meneses, Esteban Bautista-Gomez, Leonardo Badia, Rosa M APRENDIZAJE PROFUNDO RESILIENCIA REDES NEURONALES COMPUTACIÓN DE ALTO RENDIMIENTO DEEP LEARNING RESILIENCE NEURAL NETWORKS HIGH PERFORMANCE COMPUTING Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. Checkpoint-restart is a common fault tolerance technique in HPC workloads. In this work, we examine the checkpointing implementation of popular DL platforms. We perform experiments with three state-of-theart DL frameworks common in HPC (Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide take-away points that framework developers can use to facilitate better checkpointing of DL workloads in HPC. Las aplicaciones de aprendizaje profundo (deep learning, DL) se despliegan cada vez más en sistemas HPC para aprovechar el paralelismo masivo y la potencia de cálculo de estos sistemas. Aunque se ha hecho un gran esfuerzo para facilitar el entrenamiento distribuido por parte de los marcos de DL, la tolerancia a fallos se ha ignorado en gran medida. El reinicio por puntos de control es una técnica de tolerancia a fallos habitual en las cargas de trabajo de HPC. En este trabajo, examinamos la implementación de puntos de control de las plataformas de DL más populares. Realizamos experimentos con tres marcos de DL de última generación comunes en HPC (Chainer, PyTorch y TensorFlow). Evaluamos el coste computacional del checkpointing, los formatos y tamaños de los archivos, el impacto de la escala y el checkpointing determinista. Nuestra evaluación muestra algunas diferencias críticas en los mecanismos de checkpoint y expone varios cuellos de botella en las implementaciones de checkpoint existentes. Aportamos puntos de debate que pueden ayudar a los usuarios a seleccionar un marco tolerante a fallos para su uso en HPC. También proporcionamos puntos de partida que los desarrolladores de marcos pueden utilizar para facilitar un mejor punto de control de las cargas de trabajo DL en HPC. Universidad Nacional, Costa Rica Sede Regional Brunca, Campus Pérez Zeledón 2023-10-30T19:19:14Z 2023-10-30T19:19:14Z 2021-03-29 http://purl.org/coar/resource_type/c_816b http://hdl.handle.net/11056/26772 https://doi.org/10.48550/arXiv.2012.00825 eng Acceso abierto Attribution-NonCommercial-NoDerivatives 4.0 Internacional http://creativecommons.org/licenses/by-nc-nd/4.0/ application/pdf arXiv.Org
institution	Universidad Nacional de Costa Rica
collection	Repositorio UNA-Costa Rica
language	Inglés
topic	APRENDIZAJE PROFUNDO RESILIENCIA REDES NEURONALES COMPUTACIÓN DE ALTO RENDIMIENTO DEEP LEARNING RESILIENCE NEURAL NETWORKS HIGH PERFORMANCE COMPUTING
spellingShingle	APRENDIZAJE PROFUNDO RESILIENCIA REDES NEURONALES COMPUTACIÓN DE ALTO RENDIMIENTO DEEP LEARNING RESILIENCE NEURAL NETWORKS HIGH PERFORMANCE COMPUTING Rojas, Elvis Kahira, Albert Njoroge Meneses, Esteban Bautista-Gomez, Leonardo Badia, Rosa M A study of checkpointing in large scale training of deep neural networks
description	Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. Checkpoint-restart is a common fault tolerance technique in HPC workloads. In this work, we examine the checkpointing implementation of popular DL platforms. We perform experiments with three state-of-theart DL frameworks common in HPC (Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide take-away points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.
format	Artículo
author	Rojas, Elvis Kahira, Albert Njoroge Meneses, Esteban Bautista-Gomez, Leonardo Badia, Rosa M
author_sort	Rojas, Elvis
title	A study of checkpointing in large scale training of deep neural networks
title_short	A study of checkpointing in large scale training of deep neural networks
title_full	A study of checkpointing in large scale training of deep neural networks
title_fullStr	A study of checkpointing in large scale training of deep neural networks
title_full_unstemmed	A study of checkpointing in large scale training of deep neural networks
title_sort	study of checkpointing in large scale training of deep neural networks
publisher	arXiv.Org
publishDate	2023
url	http://hdl.handle.net/11056/26772 https://doi.org/10.48550/arXiv.2012.00825
_version_	1803793979862417408
score	12.19327

A study of checkpointing in large scale training of deep neural networks

Ejemplares similares