In order to enable the training of big models using just a modest cluster and in an efficient manner, Microsoft recently released Distributed Machine Learning Toolkit (DMTK), which contains both algorithmic and system innovations. This makes big data research more scalable, efficient and flexible.
The toolkit, available now on GitHub, is designed for distributed machine learning — using multiple computers in parallel to solve a complex problem. It contains a parameter server-based programing framework, which makes machine learning tasks on big data highly scalable, efficient and flexible. It also contains two distributed machine learning algorithms, which can be used to train the fastest and largest topic model and the largest word-embedding model in the world.
The toolkit offers rich and easy-to-use APIs to reduce the barrier of distributed machine learning, so researchers and developers can focus on core machine learning tasks like data, model and training.
The current version of DMTK includes the following components (more components will be added to the future versions):
• DMTK Framework: a flexible framework that supports unified interface for data parallelization, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency.
• LightLDA, an extremely fast and scalable topic model algorithm, with a O(1) Gibbs sampler and an efficient distributed implementation.
• Distributed (Multisense) Word Embedding, a distributed version of (multi-sense) word embedding algorithm.
Machine learning researchers and practitioners can also build their own distributed machine learning algorithms on top of our framework with small modifications to their existing single-machine algorithms.