Microsoft yesterday released a new library for data scientists to be more productive on Apache Spark. The Microsoft Machine Learning library will increase the rate of experimentation, and leverage cutting-edge machine learning techniques on very large datasets. It provides simplified consistent APIs for handling different types of data such as text or categoricals. With this new library, you can simply pass the data to the model, and the library takes care of the rest. It also allows you to easily change the feature space and algorithm without having to re-code the pipeline. The capabilities of MMLSpark include:
- DNN featurization: Using a pre-trained model is a great approach when you’re constrained by time or the amount of labeled data. You can use pre-trained state-of-the-art neural networks such as ResNet to extract high-order features from images in a scalable manner, and then pass these features to traditional ML models, such as logistic regression or decision forests.
- Training on a GPU node: Sometimes, your problem is so domain specific that a pre-trained model is not suitable, and you need to train your own DNN model. You can use Spark worker nodes to pre-process and condense large datasets prior to DNN training, then feed the data to a GPU VM for accelerated DNN training, and finally broadcast the model to worker nodes for scalable scoring.
- Scalable image processing pipelines: For a complete end-to-end workflow for image processing, DNN integration is not enough. Typically, you have to pre-process your images so they have the correct shape and normalization, before passing them to DNN models. In MMLSpark, you can use OpenCV-based image transformations to read in and prepare your data.
Learn about it in detail here.