Last year, Microsoft Research revealed Computational Network Toolkit (CNTK), a unified computational network framework that describes deep neural networks as a series of computational steps via a directed graph. With the combination of CNTK and Microsoft’s Azure GPU Lab, Microsoft has a distributed GPU platform that the community can utilize to advance AI research. Since the launch of CNTK last year, MSR team has significantly improved machine learning efficiency with Azure GPU Lab. In fact, CNTK now offers the most efficient distributed computational performance beating Google’s TensorFlow and others.
For mission critical AI research, we believe efficiency and performance should be one of the most important design criteria. There are a number of deep learning toolkits available from Torch, Theano and Caffe to the recently open sourced toolkits from Google and IBM. We compared CNTK with four popular toolkits. We focus on comparing the raw computational efficiency of different toolkits using simulated data with an effective mini batch size (8192) in order to fully utilize all GPUs. With a fully connected 4-layer neural network (see our benchmark scripts), the number of frames each toolkit can process per second is illustrated in the chart. We include two configurations on a single Linux machine with 1 and 4 GPUs (Nvidia K40) respectively. We also report our 8-GPU CNTK speed on Azure GPU Lab with 2 identical Linux machines (2 x 4 GPUs) as used in the baseline benchmark. CNTK compares favorably in computational efficiency for distributed deep learning (4 GPUs or 8 GPUs) on all these toolkits we tested. CNTK can easily scale beyond 8 GPUs across multiple machines with superior distributed system performance.