Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes

Peng Sun et. al. SenseTime 2019

Communication backend: GradientFlow

  • ring-based allreduce
  • mixed-precision training
  • computation/communication overlap
  • lazy allreduce: fusing multiple communication operations
  • coarse-grained sparse communication: only transmitting important gradient chunks

and also,

  • momentum SGD correction
  • warm-up dense training

DGC naively perform fine-grained (i.e., element-wise) top-k to select gradients, and thus the communication will suffer from increased allgather data volume as #nodes increases.

CSC modified the process with coarse-grained sparsification: gradients are partioned into chunks, allreduce the gradient chunks selected based on allreduced L1-norm of each chunk, which gets rid of the allgather and solves the problem.