Microsoft’s massive FPGA-based AI platform achieve real-time processing at data centre scale

Microsoft today revealed some more about the technology which will eventually power the robots which will crush our skulls beneath their heels.

Called Project Brainwave, Microsoft’s cloud-based AI platform is powered by Intel’s new 14 nm Stratix 10 FPGA units and are able to deliver a sustaining 39.5 Teraflops, running each request in under one millisecond. This high performance and ultra-low latency let Microsoft deliver real-time AI which is becoming increasingly important as cloud infrastructures process live data streams, whether they be search queries, videos, sensor streams, or interactions with users.

By attaching high-performance FPGAs directly to their datacenter network, Microsoft can serve DNNs as hardware microservices, where a DNN can be mapped to a pool of remote FPGAs and called by a server with no software in the loop. This system architecture both reduces latency, since the CPU does not need to process incoming requests, and allows very high throughput, with the FPGA processing requests as fast as the network can stream them.

Project Brainwave uses a powerful “soft” DNN processing unit (or DPU), synthesized onto commercially available FPGAs that combines both the ASIC digital signal processing blocks on the FPGAs and the synthesizable logic to provide a greater and more optimized number of functional units. Using a number of custom techniques it can achieve performance comparable to – or greater than – many hard-coded DPU chips.

To help developers make use of all this power Project Brainwave incorporates a software stack designed to support the wide range of popular deep learning frameworks. It already supports Microsoft Cognitive Toolkit and Google’s Tensorflow, with plans to support many others.

The system is architected to show high actual performance across a wide range of complex models, with batch-free execution and can handle complex, memory-intensive models such as LSTMs in real-time.

Even on early Stratix 10 silicon, Microsoft demonstrated the ported Project Brainwave system running a large GRU model—five times larger than Resnet-50—with no batching, and achieved record-setting performance. The demo used Microsoft’s custom 8-bit floating point format (“ms-fp8”), which does not suffer accuracy losses (on average) across a range of models.

They showed Stratix 10 sustaining 39.5 Teraflops on this large GRU, running each request in under one millisecond. At that level of performance, the Brainwave architecture sustains execution of over 130,000 compute operations per cycle, driven by one macro-instruction being issued each 10 cycles. Running on Stratix 10, Project Brainwave achieved unprecedented levels of demonstrated real-time AI performance on extremely challenging models, with today’s performance just a starting point.

Microsoft plans to bring Project Brainwave to Azure in 2018 so any customer can gain access to the technology, allowing them to run their most complex deep learning models at record-setting performance, and bring Armageddon one step closer.

Read more detail about the technology at Microsoft here.