Distributed Training

This project contains scripts/modules for distributed training
Based on current deep learning models, size of datasets, training methodologies; waiting for a model to train on a single GPU can be compared to waiting for an infant to take the first steps

Let’s cut to the chase.
In this repository I try to simplify the concepts and (a few) implementations for distributed training

Introduction

There are generally two ways to distributed computation across multiple devices:

Data Parallelism: where a single model gets replicated on multiple devices or multiple machines. Each of them processes different batches of data, then they merge their results. There exist many variants of this setup, that differ in how the different model replicas merge results, in whether they stay in sync at every batch or whether they are more loosely coupled, etc.
Model Parallelism: where different parts of a single model run on different devices, processing a single batch of data together. This works best with models that have a naturally-parallel architecture, such as models that feature multiple branches.

Before going ahead there are two cases we need to think about first:

Single-host (or node/machine), multi-device training
- TF’s MirroredStrategy
Multi-worker distributed synchronous training or Multi-host, multi-device training
- TF’s MultiWorkerMirroredStrategy
- TF’s ParameterServerStrategy

Certain concepts and implementations have been picked up raw, you find the same here

Pytorch Distributed Training

Pytorch has two ways to split models and data across multiple GPUs: nn.DataParallel and nn.DistributedDataParallel

nn.DataParallel:

nn.DataParallel uses one process to compute the model weights and then distribute them to each GPU during each batch, networking quickly becomes a bottle-neck and GPU utilization is often very low.
Furthermore, nn.DataParallel requires that all the GPUs be on the same node and doesn’t work with Apex for mixed-precision training

nn.DistributedDataParallel:

DistributedDataParallel model reproduces the model onto each GPU
Machine has one processes per GPU, each model is controlled by each process
GPUs can all be on the same node(machine) or across multiple nodes.
Only gradients are passed between the processes/GPUs
- During training:
  - Each process loads its own mini-batch from disk and passes it to its GPU
  - Each GPU does its forward pass
  - The gradients from all gpus are all-reduced across the GPUs
  - (NOTE) Gradients for each layer do not depend on previoud layers, so the all-reduced is calculated concurrently with the backwards pass to further alleviate the networking bottleneck
  - At the end of the backwards pass, every node has the average gradients (ensuring the model weights stay synchronized)

Implementation

Driver script: On execution of main.py on either modules, the script launches a processs for every GPU.
Each process needs to know which gpu to use and where it ranks amongst all the processes that are running

Parameters:

nodes: The total number of nodes or machines to be used (currently testes on 1)
gpus: The number of GPUs on each node
nr: The rank of the current node (machine) within all the nodes [0 to nodes-1]

Calculated:

world_size: The total number of processes to run (gpus * nodes)

Training:

Pytorch’s DistributedDataParallel model reproduces the model onto each GPU
nn.utils.data.DistributedSampler makes sure that each process gets a different slice of training data while loading it
- As the Sampler already distributes data differently, we can assign shuffle=False

Drawbacks:
Here we try to train one model on multiple gpus, instead of one.
As elegant and simple the approach seems, there are few pitfalls.

Probelm:

main GPU running out of memory
- As the first GPU will save all the different outputs for the different GPUs to calculate the loss, which inturn increase the memory usage

Solution:

Reduce batch size
Mixed precision training

Alternative to nn.DistributeDataParallel is Nvidia’s Apex, for mixed precision

Mixed Precision: The use of lower-precision operations ( float16 and bfloat16 ) in a model during training to make it run faster and use less memory. Using mixed precision can improve performance by more than 3 times on modern GPUs and 60% on TPUs.[1].
The weight will remain at 32-bit, whereas other parameters like loss, gradients, etc will be computed at 16-bits. More of that here

Usage

single node, multi-devices

    $ python main.py --nodes 1 --gpus 2 --epochs 5

multi-node, multi-devices

    $ python main.py --nodes 2 --gpus 2 --epochs 5

Distributed Training on Nvidia’s Apex

Concept

Nvidia’s Apex is the best alterative to traditional distributed training for the following reasons

No GPU running out of memory errors, due to mixed precision training
Apex’s apex.parallel.DistributedDataParallel internally allows only one GPU per process
Mixed-precision training required that the loss is scaled in order to prevent the gradients from underflowing. Apex does this automatically

Changes as compared to torch

Replace nn.DistributedDataParallel with apex.parallel.DistributedDataParallel

Saving model/loading checkpoints

  # Save checkpoint
  checkpoint = {
      'model': model.state_dict(),
      'optimizer': optimizer.state_dict(),
      'amp': amp.state_dict()
  }
  torch.save(checkpoint, 'amp_checkpoint.pt')
  ...


  # Restore
  model = ...
  optimizer = ...
  checkpoint = torch.load('amp_checkpoint.pt')

  model, optimizer = amp.initialize(model, optimizer, opt_level=opt_level)
  model.load_state_dict(checkpoint['model'])
  optimizer.load_state_dict(checkpoint['optimizer'])
  amp.load_state_dict(checkpoint['amp'])

  # Continue training
  ...

Usage

single node, multi-devices

    $ python main.py --nodes 1 --gpus 2 --epochs 5

multi-node, multi-devices

    $ python main.py --nodes 2 --gpus 2 --epochs 5